Page History

...

Expand

BERT vs hand features, controversy paper

Word embeddings

Question/proposal: where is the word embedding version of LIWC? ("Can we BERT - word pieces!)LIWC?").

Expand

title	Some work in this direction

Fast, Ethan; Chen, Binbin, Bernstein, Michael S. Lexicons on demand: Neural word embeddings for large-scale text analysis. IJCAI 2017.

Abstract: Human language is colored by a broad range of topics, but existing text analysis tools only focus on a small number of them. We present Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like “bleed” and “punch” to generate the category violence). Empath draws connotations between words and phrases by learning a neural embedding across billions of words on the web. Given a small set of seed words that characterize a category, Empath uses its neural embedding to discover new related terms, then validates the category with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories we have generated such as neglect, government, and social media. We show that Empath’s data-driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC.

Expand

title	Overview references

Smith, Noah A. 2019. Contextual word representations: A contextual introduction. arxiv:1092.06006, version 2, dats Feb 19. 2019.
Twitter commentary regarding the history as recounted in the above (Naftali Tishby and yours truly are among the "& co." referred to by Robert Munro): [1] [2] [3]

Goldberg, Yoav. 2017. Neural network methods for natural language processing. Morgan Claypool. Earlier, shorter, open-access journal version: A primer on neural network models for natural language processing: JAIR 57:345--420, 2016.

Language modeling = the bridge?

Note that the basic units might be characters or byte pairs instead of words.

Expand

title	Recommendations from Jack Hessel and Yoav Artzi, Cornell

Thanks to Jack Hessel and Yoav Artzi for the below. Paraphrasing errors are my own.

The best off-the-shelf language model right now (caveat: this is a very fast-moving field) is GPT-2, where GPT stands for Generative Pre-Training. It seems to transfer well via fine-tuning to small new datasets. [code] [https://openai.com/blog/better-language-models/]

Expand

title	References

Radford, Alec, Wu, Jeffrey, Child, Rewon, Luan, David, Amodei, Dario, Sutskever, Ilya. 2019. Language models are unsupervised multitask learners. Manuscript.

Page tree

Versions Compared

Old Version 28

New Version 29

Key