Lillian Lee, Choice 2019 Symposium "Wisdom from Words: Insight from Language and Text Analysis", draft/work in progress
This URL: https://confluence.cornell.edu/display/~ljl2/Choice2019
Setting: what makes language type A different from type B?
...
- etc
For various reasons, including an eye towards deploying applications, we ultimately evaluate our hypothesis with prediction even though we are personally interested and invested in understanding what underlies the phenomenon being considered.
Expand | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||
|
Some features/technologies I like
Expand | |||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
The Cornell Conversational Analysis ToolkitFeatures for: linguistic coordination, politeness strategies, conversation motifs, conversation graphs (in a long line of LiWC-like lexicons)Datasets: Wikipedia talk page conversations that (do not) become derailed by personal attacks; dialogs from movie scripts; UK Parliamentary question-answer pairs; Supreme Court oral arguments; Wikipedia talk pages conversations; post-tennis-match press interviews; reddit conversations. Chenhao Tan's list of hedging phrases, such as "I suspect", "raising the possibility":This is in the long line of LIWC-like lexicons. [README] [list itself]
language Language models, which assign probabilities P(x) to words, sentences or text units after being trained on some language sample.These are great for similarity, distinctiveness, visualization.
Distributional similarity (word embeddings are the modern version)Here's a figure from 1997 about ideas from the early 90's: For references, see the word embeddings section later in this document
|
.
...
Distributional similarity
Expand | ||
---|---|---|
| ||
Type/token ratio
|
.. . and one feature that I both like and drives me crazy: length
Expand |
---|
It represents an intuitively slightly ridiculous null hypothesis that often works surprisingly well as a feature, most likely because it correlates with a lot of other features of interest. ExampleExamples: (to be inserted) |
A feature-effectiveness test that's caught my eye
Wang, Zhao and Aron Culotta, When do Words Matter? Understanding the Impact of Lexical Choice on Audience Perception using Individual Treatment Effect Estimation. AAAI 2019. [code]
How do we proceed during the age of deep learning, where, for prediction, we don't need to (aren't supposed to) worry about features anymore?
Expand | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT vs Comparison of hand-crafted features, controversy paperagainst deep learning on predicting controversial social-media postsstar = best in column; circle = performance within 1% of the best in column. Columns: different sub-reddits.
Expand | | |||||||||||
|
Expand | ||
---|---|---|
| ||
Fast, Ethan, Binbin Chen, Michael S Bernstein. Lexicons on demand: Neural word embeddings for large-scale text analysis. IJCAI 2017. |
Abstract: Human language is colored by a broad range of topics, but existing text analysis tools only focus on a small number of them. We present Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like “bleed” and “punch” to generate the category violence). Empath draws connotations between words and phrases by learning a neural embedding across billions of words on the web. Given a small set of seed words that characterize a category, Empath uses its neural embedding to discover new related terms, then validates the category with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories we have generated such as neglect, government, and social media. We show that Empath’s data-driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC. |
Expand | ||
---|---|---|
| ||
Smith, Noah A. 2019. Contextual word representations: A contextual introduction. arxiv:1092.06006, version 2, dated Feb 19. 2019. Twitter commentary regarding the history as recounted in the above (Naftali Tishby and yours truly are among the "& co." referred to by Robert Munro): [1] [2] [3] |
Goldberg, Yoav. 2017. Neural network methods for natural language processing. Morgan Claypool. Earlier, shorter, open-access journal version: A primer on neural network models for natural language processing: JAIR 57:345--420, 2016. |
Language modeling = the bridge?
Note that the basic units might be characters or unicode code points ("names of character") instead of words.
Expand | |||||
---|---|---|---|---|---|
| |||||
Thanks to Jack Hessel and Yoav Artzi for the below. Paraphrasing errors are my own. The best off-the-shelf language model right now (caveat: this is a very fast-moving field) is the 12-or-so layer GPT-2, where GPT stands for Generative Pre-Training. [code] [(infamous) announcement] [hugging face's reimplementation of pre-trained GPT-2] But a single-layer LSTM trained from scratch, with carefully chosen hyperparameters, is still often a very strong baseline, especially with small data (around 10K samples). Both BERT and GPT seems to transfer well via fine-tuning to small new datasets, at least in expert hands. [code] [Colab] [hugging face's reimplementation of pre-trained BERT] [announcement]The Giant Language model Test Room (GLTR) can be used for analyzing what a neural LM is doing, although its stated purpose is to enable "detect automatically generated text".
|
Expand | ||
---|---|---|
| ||
Zhang, Tianyi, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi. April 21, 2019. BERTScore: Evaluating Text Generation with BERT. arxiv version 1. [code] |
Expand |
---|
| ||
Belinkov, Yonatan and James Glass. 2019. Analysis methods in neural NLP. TACL 7:49–72. [supplementary materials] |