Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • *.df: PAGE_ID, WORD, FREQ, with the exception of the first line corresponding to the article, which lists its name. This is a listing of a word document frequency where a document is considered to be a revision and the corpus is all revisions of an article.
  • *.sentid_sent: PAGE_ID, SENTENCE_ID, SENTENCE
  • *.revid_sentid: PAGE_ID, REVSIONREVISION_ID, UNSIMPLE_FLAG, SENTENCE_COUNT, SENTENCE_STREAM, COMMENT
  • *.directed_sentid_sentid_w: PAGE_ID, REVISION1_ID, REVISION2_ID, SENTENCE1_ID, SENTENCE2_ID, ALIGNMENT_SCORE, PHRASE1_LENGTH, PHRASE1, PHRASE2_LENGTH2, PHRASE2
  • *.index: This is an index for looking up particular page info. There is one of these for each of the types listed above, but it takes the same format: PAGE_ID, END_LINE_NUMBER, NUMBER_OF_LINES.

...

The same as simplewiki.tar.gz except for the english wikiwikipedia.

simplewiki_files.tar.gz

This is the same information as simplewiki.tar.gz but where each article is in its own folder and the index files thrown away.
That folder contains df,sentid_sent, revid_sentid, direction_sentid_sentid_w which only correspond to that one article.
This is convient convenient for browsing documents.

enwiki_files.tar.gz

The same as simplewiki_files.tar.gz except for the english wikiwikipedia.

simple.ids.titles

A map for simple article titles
PAGE_ID, SIMPLE_TITLE

full.ids.titles.sid

A map from engilsh english article ids to titles to simple article ids.
ENGLISH_PAGE_ID, TITLE, SIMPLE_PAGE_ID