Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • We also provide the following supplementary files:
    • enwiki.tar.gz (1.67GB); simplewiki.tar.gar (88.9MB)
    • fullwiki_files.tar.gz (1.66GB); simplewiki_files.tar.gz (88.9MB)
    • simple.ids.titles
    • full.ids.titles.sid
    • output.threshold
    • output.translation

Overview

Processing the Simple/English Wikipedia revision streams was done in three phases. Initially, articles of interest were identified and their corresponding revisions were extracted from wiki dumps. Each article has a unique id, ARTICLE_ID, and contains a set of revisions. Each revision has a unique id, REVISION_ID. Also, each revision has an optional comment representing the transition from the previous state to the current revision of the article (the first comment represents the creation of the article).

...

A map from english article ids to titles to simple article ids.
ENGLISH_PAGE_ID, TITLE, SIMPLE_PAGE_ID

output.threshold

Full output of the simplification system used for evaluation. Here we report the top simplification from a source word given that there was some simplification that surpassed a threshold of being at least 45% likely. It is sorted by the probability that the source word needs simplification.

SOURCE_WORD TARGET_WORD

output.translation

The full translation table derived from our simplification model. As a note, in the paper we reported 1079 pairs, in fact there are 1078 because of a spuriously printed line.

PROBABILITY_OF_SIMPLIFICATION SOURCE_WORD NUM_TARGET_WORDS (TARGET_WORD TARGET_PROBABILITY)+