Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Intermediate files with basic pre-processing of revision data. Each "part" contains a subset of the data (due to Hadoop processing) and contains information about many pages.

There are five types of files (where * specifies part001,part002, ect):

  • *.df: PAGE_ID, WORD, FREQ (the first line corresponding to an article lists its name). As noted earlier, for a given article, document frequency is computed over its revisions, where each revision is considered as a "document".
  • *.sentid_sent: PAGE_ID, SENTENCE_ID, SENTENCE
  • *.revid_sentid: PAGE_ID, REVISION_ID, UNSIMPLE_FLAG, SENTENCE_COUNT, SENTENCE_STREAM, COMMENT
  • *.directed_sentid_sentid_w: PAGE_ID, REVISION1_ID, REVISION2_ID, SENTENCE1_ID, SENTENCE2_ID, ALIGNMENT_SCORE, PHRASE1_LENGTH, PHRASE1, PHRASE2_LENGTH2, PHRASE2
  • *.index: This is an index for looking up a particular page's info within the part files. There is one of these for each of the types listed above, but it takes the same format: PAGE_ID, END_LINE_NUMBER, NUMBER_OF_LINES.

...