Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

These are files that were directly used for creating a list of simplifications.
It is data agregated aggregated over the original Hadoop processing.
Every file contains sentences that were aligned between some revisions with a threshold greater
than .3 using a TF-IDF style alignment and where the source phrase and the final phrase were both less than or equal to a length of 5.

There are three classes of files, each of which is in a tab seperated separated format:

  • *.extra: ARTICLE_ID, REVISION1_ID, REVIsION2REVISION2_ID, SENTENCE1_ID, SENTENCE2_ID, ALIGNMENT_SCORE, PHRASE1_LENGTH, PHRASE1, PHRASE2_LENGTH, PHRASE2, SENTENCE1, SENTENCE2, COMMENT
  • *.sp: same as above but filtered words with identical soundex.
  • *.cut3: ARTICLE_ID, PHRASE1, PHRASE2

...

These files are provided for three types of data:
fullwiki: an extraction of articles from english wiki that were also found in simple wikipedia.
    This is not all shared articles, but the first 80% that were found sequentially searching the full wikipedia dump. Older articles will be ealier earlier in the dump.
simplewiki: an extraction of articles from simple wiki that were also found in the english wikipedia.
simplewiki.all: an extraction of all articles that were found in the simple wikipedia.

...

The output from a Hadoop process geared toward processing history dumps. There are 109 different 'parts' to the processing.
For more information about 'parts', refference reference the Hadoop tutorial.  

There are five types of files here:

...

Two fields need additional explainationexplanation:
SENTENCE_STREAM: This is a comma seperated string of SENTID in the order they appear in the revision, with the character "P" used to indicate paragraph breaks.
UNSIMPLE_FLAG: This a flag for whether or not the revision was tagged with the "UNSIMPLE" group. Editors had the option of marking a file with this tag to indicate that it needed further simplification.

...

full.ids.titles.sid

A map from engilsh aritcle article ids to titles to simple article ids.
ENGLISH_PAGE_ID, TITLE, SIMPLE_PAGE_ID