Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • enwiki.tar.gz (1.67GB)
  • simplewiki.tar.gar (88.9MB)
  • fullwiki_files.tar.gz (1.66GB)
  • simplewiki_files.tar.gz (88.9MB)
  • basefiles.tar.gz (135MB) 
  • simple.ids.titles
  • full.ids.titles.sid

Processing the Simple/English Wikipedia revision streams was done in three phases. Initially, articles of interest were identified and their corresponding revisions were extracted from wiki dumps. Each article has a unique id, ARTICLE_ID, and contains a set of revisions. Each revision has a unique id, REVISION_ID. Also, each revision has an optional comment representing the transition from the previous state to the current revision of the article (the first comment represent represents the creation of the article).

...

Afterward, sentences in adjacent revisions of an article were aligned using a TF-IDF weighting scheme where Document Frequency was computed in the sense that the that  a revisions revision in an article was considered a document and the corpus was all revisions of an article. This is used to generate an ALIGNMENT_SCORE.

Finally, phrases from aligned sentences were extracted by finding a single differing segment in each of the two sentences (PHRASE). If changes were large, the single differing difference could be each sentence, ; if the changes were small, the single differing segment would could be a single word from each sentence.

...

These files are provided for three types of data:
fullwiki: an extraction of articles from english wiki wikipedia that were also found in simple wikipedia.
    This is not all shared articles, but the first 80% that were found sequentially searching the full wikipedia dump. Older articles will be earlier in the dump.
simplewiki: an extraction of articles from simple wiki wikipedia that were also found in the english wikipedia.
simplewiki.all: an extraction of all articles that were found in the simple wikipedia.

...