Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

All data files can be downloaded from this page (data "Data for lexical simplification experiments").  

  • The most "processed" data can be found in basefiles.tar.gz (135MB)
  • We also provide the following supplementary files:
    • enwiki.tar.gz (1.67GB)
    • ; simplewiki.tar.gar (88.9MB)
    • fullwiki_files.tar.gz (1.66GB)
    • ; simplewiki_files.tar.gz (88.9MB)
    • simple.ids.titles
    • full.ids.titles.sid

Overview

Processing the Simple/English Wikipedia revision streams was done in three phases. Initially, articles of interest were identified and their corresponding revisions were extracted from wiki dumps. Each article has a unique id, ARTICLE_ID, and contains a set of revisions. Each revision has a unique id, REVISION_ID. Also, each revision has an optional comment representing the transition from the previous state to the current revision of the article (the first comment represents the creation of the article).

...

Finally, phrases from aligned sentences were extracted by finding a single differing segment in each of the two sentences (PHRASE). If changes were large, the single difference could be each sentence; if the changes were small, the single differing segment would could be a single word from each sentence.

basefiles.tar.gz:

These are files that were directly used for creating a list of simplifications.
It is data aggregated over the original Hadoop processing.
Every file contains sentences that were aligned between adjacent revisions with a threshold greater than .3 using a TF-IDF style alignment and where the source phrase and the final phrase were both less than or equal to a length of 5files contain the lexical edit instances (PHRASE1 -> PHRASE2) used in our paper. We also provide the aligned sentence pairs from which these lexical edit instances were extracted. We include only sentence pairs that satisfy the following criteria:

  • ALIGNMENT_SCORE > .3 using the TF-IDF-based alignment described above; and
  • both PHRASE1 and PHRASE2 extracted from this sentence pair contain no more than 5 words.

There are three classes of files , each of which is in a tab separated format(all fields are tab-separated):

  • *.extra: ARTICLE_ID, REVISION1_ID, REVISION2_ID, SENTENCE1_ID, SENTENCE2_ID, ALIGNMENT_SCORE, PHRASE1_LENGTH, PHRASE1, PHRASE2_LENGTH, PHRASE2, SENTENCE1, SENTENCE2, COMMENT
  • *.sp: same as above but filtered where : a subset of *.extra: an instance is filtered out if PHRASE1 and PHRASE2 have identical soundex.
  • *.cut3: ARTICLE_ID, PHRASE1, PHRASE2
  • *.simpl: the same as a subset of *.extra but where the comments are filtered for only containing containing instances associated with COMMENT that contains "simpl*"

These files are provided for three types of data:

  • fullwiki: an extraction of articles from english wikipedia that were also found in simple wikipedia.
        This is not all shared articles, but the first 80% that were found sequentially searching the full wikipedia dump. Older articles will be earlier in the dump.
  • simplewiki: an extraction of articles from simple wikipedia that were also found in the english wikipedia.
  • simplewiki.all: an extraction of all articles that were found in the simple wikipedia.

simplewiki.tar.gz:

The output from a Hadoop process geared toward processing history dumps. There are 109 different 'parts' to the processing.
For more information about 'parts', reference the Hadoop tutorial.  

Intermediate files with basic pre-processing of revision data. Each "part" contains a subset of the data (due to Hadoop processing).

There are five types of files here:

  • *.df: PAGE_ID, WORD, FREQ , with (the exception of the first line corresponding to the an article , which lists its name. This is a listing of a word document frequency where a document is considered to be a revision and the corpus is all revisions of an article). As noted earlier, for a given article, document frequency is computed over its revisions, where each revision is considered as a "document".
  • *.sentid_sent: PAGE_ID, SENTENCE_ID, SENTENCE
  • *.revid_sentid: PAGE_ID, REVISION_ID, UNSIMPLE_FLAG, SENTENCE_COUNT, SENTENCE_STREAM, COMMENT
  • *.directed_sentid_sentid_w: PAGE_ID, REVISION1_ID, REVISION2_ID, SENTENCE1_ID, SENTENCE2_ID, ALIGNMENT_SCORE, PHRASE1_LENGTH, PHRASE1, PHRASE2_LENGTH2, PHRASE2
  • *.index: This is an index for looking up particular page info. There is one of these for each of the types listed above, but it takes the same format: PAGE_ID, END_LINE_NUMBER, NUMBER_OF_LINES.

Two fields need additional explanation:
SENTENCE_STREAM: This is a comma separated string of SENTENCE_IDs in the order they appear in the revision, with the character "P" used to indicate paragraph breaks.
UNSIMPLE_FLAG: This a flag for whether or not the revision was tagged with the "UNSIMPLE" group. Editors had the option of marking a file with this tag to indicate that it needed further simplification.

enwiki.tar.gz

The same as Similar to simplewiki.tar.gz except for , intermediate files generated when processing the english wikipedia.

simplewiki_files.tar.gz

...