Page History

...

*.extra: ARTICLE_ID, REVISION1_ID, REVISION2_ID, SENTENCE1_ID, SENTENCE2_ID, ALIGNMENT_SCORE, PHRASE1_LENGTH, PHRASE1, PHRASE2_LENGTH, PHRASE2, SENTENCE1, SENTENCE2, COMMENT
*.sp: same as above but filtered words with where PHRASE1 and PHRASE2 have identical soundex.
*.cut3: ARTICLE_ID, PHRASE1, PHRASE2

...

*.df: PAGE_ID, WORD, FREQ, with exception of the first line corresponding to the article, which lists its name. This is a listing of a word document frequency where a document is considered to be a revision and the corpus is all revisions of an article.
*.sentid_sent: PAGE_ID, SENTIDSENTENCE_ID, SENTENCE
*.revid_sentid: PAGE_ID, REVIDREVSION_ID, UNSIMPLE_FLAG, SENTENCE_COUNT, SENTENCE_STREAM, COMMENT
*.directed_sentid_sentid_w: PAGE_ID, REVISION1_ID, REVISION2_ID, SENTENCE1_ID, SENTENCE2_ID, ALIGNMENT_SCORE, PHRASE1_LENGTH, PHRASE1, PHRASE2_LENGTH2, PHRASE2
*.index: This is an index for looking up particular page info. There is one of these for each of the types listed above, but it takes the same format: PAGE_ID, END_LINE_NUMBER, NUMBER_OF_LINES.

Two fields need additional explanation:
SENTENCE_STREAM: This is a comma separated string of SENTID SENTENCE_IDs in the order they appear in the revision, with the character "P" used to indicate paragraph breaks.
UNSIMPLE_FLAG: This a flag for whether or not the revision was tagged with the "UNSIMPLE" group. Editors had the option of marking a file with this tag to indicate that it needed further simplification.

...

Page tree

Versions Compared

Old Version 10

New Version 11

Key