This page consists of our draft README for the data release related to our paper,

For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia
Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu-Mizil and Lillian Lee
Proceedings of the NAACL, 2010 (short paper).

We plan to complete a final version of the README later, but wanted to quickly provide enough details here for interested parties to be able to start making use of the data beforehand.

The page from which the data can be downloaded is at the homepage data for lexical simplification experiments.  

The files are:

Processing the Simple/English Wikipedia revision streams was done in three phases. First, articles of interest were identified and their corresponding revisions were extracted from wiki dumps.

For each article (a set of revisions), a list of all unique sentences in the article was generated.

Each sentence was given an id, and each revision of the article was then represented as a sequence of SENTENCE_IDs that belonged to it. This was a reasonably compact representation of the stream.

Afterward, sentences in adjacent revisions of an article were aligned using a TF-IDF weighting scheme where Document Frequency was computed in the sense that the set of revisions in an article was considered a document and the corpus was all revisions of an article.

Finally, phrases from aligned sentences were extracted by finding a single differing segment in each of the two sentences. If changes were large, the single differing could be each sentence, if the changes were small, the single differing segment would could be a single word from each sentence.


These are files that were directly used for creating a list of simplifications.
It is data aggregated over the original Hadoop processing.
Every file contains sentences that were aligned between adjacent revisions with a threshold greater than .3 using a TF-IDF style alignment and where the source phrase and the final phrase were both less than or equal to a length of 5.

There are three classes of files, each of which is in a tab separated format:

These files are provided for three types of data:
fullwiki: an extraction of articles from english wiki that were also found in simple wikipedia.
    This is not all shared articles, but the first 80% that were found sequentially searching the full wikipedia dump. Older articles will be earlier in the dump.
simplewiki: an extraction of articles from simple wiki that were also found in the english wikipedia.
simplewiki.all: an extraction of all articles that were found in the simple wikipedia.


The output from a Hadoop process geared toward processing history dumps. There are 109 different 'parts' to the processing.
For more information about 'parts', reference the Hadoop tutorial.  

There are five types of files here:

Two fields need additional explanation:
SENTENCE_STREAM: This is a comma separated string of SENTENCE_IDs in the order they appear in the revision, with the character "P" used to indicate paragraph breaks.
UNSIMPLE_FLAG: This a flag for whether or not the revision was tagged with the "UNSIMPLE" group. Editors had the option of marking a file with this tag to indicate that it needed further simplification.


The same as simplewiki.tar.gz except for the english wiki.


This is the same information as simplewiki.tar.gz but where each article is in its own folder and the index files thrown away.
That folder contains df,sentid_sent, revid_sentid, direction_sentid_sentid_w which only correspond to that one article.
This is convient for browsing documents.


The same as simplewiki_files.tar.gz except for the english wiki.


A map for simple article titles


A map from engilsh article ids to titles to simple article ids.