Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
  1. BERT vs hand features, controversy paper
  2. Word embeddings

    Question/proposal: where is the word embedding version of LIWC? ("Can we BERT - word pieces!)LIWC?").

    Expand
    titleSome work in this direction

    Fast, Ethan; Chen, Binbin, Bernstein, Michael S. Lexicons on demand: Neural word embeddings for large-scale text analysis. IJCAI 2017.

    Abstract: Human language is colored by a broad range of topics, but existing text analysis tools only focus on a small number of them. We present Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like “bleed” and “punch” to generate the category violence). Empath draws connotations between words and phrases by learning a neural embedding across billions of words on the web. Given a small set of seed words that characterize a category, Empath uses its neural embedding to discover new related terms, then validates the category with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories we have generated such as neglect, government, and social media. We show that Empath’s data-driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC.

    Expand
    titleOverview references

    Smith, Noah A. 2019. Contextual word representations: A contextual introduction. arxiv:1092.06006, version 2, dats Feb 19. 2019.
    Twitter commentary regarding the history as recounted in the above (Naftali Tishby and yours truly are among the "& co." referred to by Robert Munro): [1] [2] [3]

    Goldberg, Yoav. 2017. Neural network methods for natural language processing. Morgan Claypool. Earlier, shorter, open-access journal version: A primer on neural network models for natural language processing: JAIR 57:345--420, 2016.

  3. Language modeling = the bridge?

    Note that the basic units might be characters or byte pairs instead of words.

    Expand
    titleRecommendations from Jack Hessel and Yoav Artzi, Cornell

    Thanks to Jack Hessel and Yoav Artzi for the below. Paraphrasing errors are my own.

    The best off-the-shelf language model right now (caveat: this is a very fast-moving field) is GPT-2, where GPT stands for Generative Pre-Training. It seems to transfer well via fine-tuning to small new datasets. [code] [https://openai.com/blog/better-language-models/]

    Expand
    titleReferences

    Radford, Alec, Wu, Jeffrey, Child, Rewon, Luan, David, Amodei, Dario, Sutskever, Ilya. 2019. Language models are unsupervised multitask learners. Manuscript.

  4.