Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • enwiki.tar.gz (1.67GB)
  • simplewiki.tar.gar (88.9MB)
  • fullwiki_files.tar.gz (1.66GB)
  • simplewiki_files.tar.gz (88.9MB)
  • basefiles.tar.gz (135MB)

Processing the Simple/English Wikipedia revision streams was done in three phases. First, articles of interest were identified and their corresponding revisions were extracted from wiki dumps.

For each article (a set of revisions), a list of all unique sentences in the article was generated.

Each sentence was given an id, and each revision of the article was then represented as a sequence of SENTENCE_IDs that belonged to it. This was a reasonably compact representation of the stream.

Afterward, sentences in adjacent revisions of an article were aligned using a TF-IDF weighting scheme where Document Frequency was computed in the sense that the set of revisions in an article was considered a document and the corpus was all revisions of an article.

Finally, phrases from aligned sentences were extracted by finding a single differing segment in each of the two sentences. If changes were large, the single differing could be each sentence, if the changes were small, the single differing segment would could be a single word from each sentence.

basefiles.tar.gz:

These are files that were directly used for creating a list of simplifications.
It is data aggregated over the original Hadoop processing.
Every file contains sentences that were aligned between adjacent revisions with a threshold greater than .3 using a TF-IDF style alignment and where the source phrase and the final phrase were both less than or equal to a length of 5.

...