...
- We also provide the following supplementary files:
- enwiki.tar.gz (1.67GB); simplewiki.tar.gar (88.9MB)
- fullwiki_files.tar.gz (1.66GB); simplewiki_files.tar.gz (88.9MB)
- simple.ids.titles
- full.ids.titles.sid
- output.threshold
- output.translation
Overview
Processing the Simple/English Wikipedia revision streams was done in three phases. Initially, articles of interest were identified and their corresponding revisions were extracted from wiki dumps. Each article has a unique id, ARTICLE_ID, and contains a set of revisions. Each revision has a unique id, REVISION_ID. Also, each revision has an optional comment representing the transition from the previous state to the current revision of the article (the first comment represents the creation of the article).
...
A map from english article ids to titles to simple article ids.
ENGLISH_PAGE_ID, TITLE, SIMPLE_PAGE_ID
output.threshold
Full output of the simplification system used for evaluation. Here we report the top simplification from a source word given that there was some simplification that surpassed a threshold of being at least 45% likely. It is sorted by the probability that the source word needs simplification.
SOURCE_WORD TARGET_WORD
output.translation
The full translation table derived from our simplification model. As a note, in the paper we reported 1079 pairs, in fact there are 1078 because of a spuriously printed line.
PROBABILITY_OF_SIMPLIFICATION SOURCE_WORD NUM_TARGET_WORDS (TARGET_WORD TARGET_PROBABILITY)+