Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Processing the Simple/English Wikipedia revision streams was done in three phases. FirstInitially, articles of interest were identified and their corresponding revisions were extracted from wiki dumps. Each article has a unique id, ARTICLE_ID and contains a set of revisions. Each revision has a unique id, REVISION_ID. Also, each revision has an optional comment representing the transition from the previous state to the current revision of the article (the first comment represent the creation of the article).

For each article (a set of revisions), a list of all unique sentences in the article was generated. Each sentence was given an id, and each revision of the article was then represented as a sequence of SENTENCE_IDs that belonged to it. This was a reasonably compact representation of the stream.

Afterward, sentences in adjacent revisions of an article were aligned using a TF-IDF weighting scheme where Document Frequency was computed in the sense that the set of a revisions in an article was considered a document and the corpus was all revisions of an article. This is used to generate an ALIGNMENT_SCORE.

Finally, phrases from aligned sentences were extracted by finding a single differing segment in each of the two sentences (PHRASE). If changes were large, the single differing could be each sentence, if the changes were small, the single differing segment would could be a single word from each sentence.

...