Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We plan to complete a final version of the README later, but wanted to quickly provide enough details here for interested parties to be able to start making use of the data beforehand.

The page from which the All data files can be downloaded is at the homepage from this page (data for lexical simplification experiments).  

  • The most "processed" data can be found in basefiles.tar.gz (135MB)

We also provide the following supplementary files files are:

  • enwiki.tar.gz (1.67GB)
  • simplewiki.tar.gar (88.9MB)
  • fullwiki_files.tar.gz (1.66GB)
  • simplewiki_files.tar.gz (88.9MB)basefiles.tar.gz (135MB) 
  • simple.ids.titles
  • full.ids.titles.sid

Overview

Processing the Simple/English Wikipedia revision streams was done in three phases. Initially, articles of interest were identified and their corresponding revisions were extracted from wiki dumps. Each article has a unique id, ARTICLE_ID, and contains a set of revisions. Each revision has a unique id, REVISION_ID. Also, each revision has an optional comment representing the transition from the previous state to the current revision of the article (the first comment represents the creation of the article).

For each article (a set of revisions), a list of all unique sentences in the revision stream of this article was generated. Each sentence was given an id, and each revision of the article was then represented as a sequence of SENTENCE_IDs that belonged to it. This was a reasonably compact representation of the stream.

...