Simplification data README draft

This page consists of our draft README for the data release related to our paper,

For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia
Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu-Mizil and Lillian Lee
Proceedings of the NAACL, 2010 (short paper).

We plan to complete a final version of the README later, but wanted to quickly provide enough details here for interested parties to be able to start making use of the data beforehand.

The page from which the data can be downloaded is at the homepage data for lexical simplification experiments.

The files are:

enwiki.tar.gz (1.67GB)
simplewiki.tar.gar (88.9MB)
fullwiki_files.tar.gz (1.66GB)
simplewiki_files.tar.gz (88.9MB)
basefiles.tar.gz (135MB)

basefiles.tar.gz:

These are files that were directly used for creating a list of simplifications.
It is data agregated over the original Hadoop processing.
Every file contains sentences that were aligned between some revisions with a threshold greater
than .3 using a TF-IDF style alignment and where the source phrase and the final phrase were both less than or equal to a length of 5.

There are three classes of files, each of which is in a tab seperated format:

*.extra: ARTICLE_ID, REVISION1_ID, REVIsION2_ID, SENTENCE1_ID, SENTENCE2_ID, ALIGNMENT_SCORE, PHRASE1_LENGTH, PHRASE1, PHRASE2_LENGTH, PHRASE2, SENTENCE1, SENTENCE2, COMMENT
*.sp: same as above but filtered words with identical soundex.
*.cut3: ARTICLE_ID, PHRASE1, PHRASE2

*.simpl: the same as *.extra but where the comments are filtered for only containing "simpl*"

These files are provided for three types of data:
fullwiki: an extraction of articles from english wiki that were also found in simple wikipedia.
This is not all shared articles, but the first 80% that were found sequentially searching the full wikipedia dump. Older articles will be ealier in the dump.
simplewiki: an extraction of articles from simple wiki that were also found in the english wikipedia.
simplewiki.all: an extraction of all articles that were found in the simple wikipedia.

simplewiki.tar.gz:

The output from a Hadoop process geared toward processing history dumps. There are 109 different 'parts' to the processing.
For more information about 'parts', refference the Hadoop tutorial.

There are five types of files here:

*.df: PAGE_ID, WORD, FREQ, with exception of the first line corresponding to the article, which lists its name.
*.sentid_sent: PAGE_ID, SENTID, SENTENCE
*.revid_sentid: PAGE_ID, REVID, UNSIMPLE_FLAG, SENTENCE_COUNT, SENTENCE_STREAM, COMMENT
*.directed_sentid_sentid_w: PAGE_ID, REVISION1_ID, REVISION2_ID, SENTENCE1_ID, SENTENCE2_ID, ALIGNMENT_SCORE, PHRASE1_LENGTH, PHRASE1, PHRASE2_LENGTH2, PHRASE2
*.index: This is an index for looing up particular page info. There is one of these for each of the types listed above, but it takes the same format: PAGE_ID, END_LINE_NUMBER, NUMBER_OF_LINES.

Two fields need additional explaination:
SENTENCE_STREAM: This is a comma seperated string of SENTID in the order they appear in the revision, with the character "P" used to indicate paragraph breaks.
UNSIMPLE_FLAG: This a flag for whether or not the revision was tagged with the "UNSIMPLE" group. Editors had the option of marking a file with this tag to indicate that it needed further simplification.

enwiki.tar.gz

The same as simplewiki.tar.gz except for the english wiki.

simplewiki_files.tar.gz

This is the same information as simplewiki.tar.gz but where each article is in its own folder and the index files thrown away.
That folder contains df,sentid_sent, revid_sentid, direction_sentid_sentid_w which only correspond to that one article.
This is convient for browsing documents.

enwiki_files.tar.gz

The same as simplewiki_files.tar.gz except for the english wiki.

simple.ids.titles

A map for simple article titles
PAGE_ID, SIMPLE_TITLE

full.ids.titles.sid

A map from engilsh aritcle ids to titles to simple article ids.
ENGLISH_PAGE_ID, TITLE, SIMPLE_PAGE_ID

Page tree