intro to natural language processing - pamela tomana brief intro to natural language processing...
TRANSCRIPT
A brief intro to natural language processing
Pamela Toman
16 August 2016
1
At 11 am, you will be able to…• Competence (able to perform on your own, with varying levels of perfection)
• Convert your raw data into numeric representations suitable for statistical modeling
• Build bag-of-words models
• Establish model performance and business process baselines to exceed as part of evaluation
• Choose a good problem on which to apply NLP and apply NLP iteratively
• Exposure (aware)
• Familiar with vocabulary related to preprocessing (tokenization, stemming/lemmatization, stop words), representing text as numbers (one-hot encoding, word embedding), and model development (bag-of-words models, n-grams, skip-grams, feature engineering, smoothing)
• Familiar with some NLP modeling gotchas: training data must match testing data, unknown words must be addressed, testing data must be entirely new to be trusted, human performance is itself rarely 100%
• Familiar with toolsets: software (sklearn, nltk, CoreNLP) and resources (GloVe, WordNet)
• Familiar with the huge variety of already-solved tasks & open-source tools available
• Aware that vanilla RNNs fail at long-term dependencies and that LSTM/GRU units succeed
• Aware of the meaning and reason for attention mechanisms in neural networks2
What kind of text data can you access?
3
Why do we want NLP?
• There’s an awful lot of text…
• It’s awfully nuanced….
On Tuesday, September 15, 2015, about 7:03 a.m. local time, a 47-passenger 2009 International school bus, operated by the Houston Independent School District (HISD) and occupied by a 44-year-old female driver and four HISD students, crashed into …
The Rock is destined to be the 21st century's new “Conan” and that he's going to make a splash even greater than Arnold Schwarzenegger, Jean-Claud van Damme or Steven Segal.
4
Common tasks & off-the-shelf solutions
• Good performing, free, OSS/OTS solutions are in lots of areas:
• Speech-to-text (CMUSphinx, Kaldi) – internet helps
• Text-to-speech (Windows SAPI, OSX NSSS, *nix ESpeak)
• Stemming (NLTK Snowball English)
• Sentiment analysis (Stanford Recursive Neural Tensor Net)
• Named entity recognition (Stanford CoreNLP)
• Coreference resolution (Stanford CoreNLP)
• Relation extraction (agent-action-patient) (Stanford CoreNLP)
• Search / info retrieval (Lucene/elasticsearch/Solr)
• You can improve them (nltk, sklearn, Java)
• Don’t ignore paid services – they may be worth the cost
• Many projects will combine existing and custom methods
5
Sentiment analysis
• Goal: Determine attitude overall and/or re: a topic
• Usually positive or negative
• Uses: Marketing, customer service, recommendations, finance, understanding individuals’ perspectives, …
• The problem is hard: human inter-annotator agreement of ~80%
Effective but too-tepid biopic.
If you sometimes like to go to the movies to have fun, Wasabi is a good place to start.
Perhaps no picture ever made has more literally showed that the road to hell is paved with good intentions.
The film provides some great insight into the neurotic mindset of all comics—even those who have reached the absolute top of the game. Pang and Lee 2005
6
Task structure: What is the unit? What is the label?
… The actor is virtually unrecognizable until the final third, where he is finally unmasked.
“Beyond” is undoubtedly messy, like a Starfleet ship that’s taken its fair share of beatings, but it is frequently a reminder of how good the series can be when all its engines are in working order. Opening a box left behind by Commander Spock, Spock the younger finds a photo of the original crew. That nostalgia, both for its source material and time gone by, is both reverent and earned. …
+1 -1
7
Evaluating performance
• Define how you will evaluate before you begin
• You always want 1+ technical baseline:
• Human performance?
• An extremely naïve model?
• State-of-the-art computational model?
• All of the above?
• Common evaluation metrics for classification:
• Precision & recall
• Confusion matrices
• F1 score
Precision/Recall –Wikipedia8
Preprocessing
• Learning all the surface variants from limited data can be extremely hard
• Texts have punctuation ( tokenization)
• Texts are capitalized ( lowercasing)
• Texts have morphology like affixes (e.g., plural –s) ( stemming or lemmatization)
• With limited data, it may help to preprocess the data
• Preprocessing improves recall at the expense of precision
• Preprocessing tools work well out-of-the-box for almost all purposes
• Given more data, it becomes more possible to use the raw data
9
Word frequencies are related through the power law
• “Stop words” are very common words that some people filter out, especially for information retrieval
Zipf’s law & frequency lists
Word Count
you 1,222,421
I 1,052,546
to 823,661
the 770,161
a 563,578
and 480,214
that 413,389
it 388,320
of 332,038
me 312,326
10
Final labeled dataset
Unit of analysis (cleaned) Target
the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .
+1
the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .
+1
effective but too-tepid biopic +1
simplistic , silly and tedious . -1
it's so laddish and juvenile , only teenage boys could possibly find it funny . -1
exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable .
-1
… …
11
We can encode text with one-hot vectors
• A “word embedding” is how we represent a word numerically (in vector space)
• To create numbers from text, it’s common to use one-hot encodings
12
the <1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...>
rock <0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...>
is <0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...>
dinosaur <0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...>
the rock is <1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...>
+
After encoding the texts, we can learn weights for classification
13
XW
Unit 1Unit 2Unit 3Unit 4Unit 5
…
nas
tyex
celle
nt
amaz
ing
wo
nd
erfu
lte
rrib
leaw
ful
…• = Ŷ Y≈
nastyexcellentamazing
wonderfulterrible
awful…
+ -
S
+ -
Unit 1Unit 2Unit 3Unit 4Unit 5
…
10011
1 00 11 01 00 1
01001
……………
We could also use dense word vectors
• One-hot vectors are arbitrary & uninformative about meaning
• Dense vectors carry more meaning
• We derive them from counts or predictions
• Usually they are ~50-500 dimensions
treadmills <0.4307 -0.31399 0.60878 -0.10931 -0.38425 -0.4796 0.41749 -0.95494 0.65878
-0.24547 -0.21854 -0.2505 -0.50729 0.047186 -0.47561 0.28034 -0.28351 1.2879 -0.50432
-1.2837 0.047272 -0.14955 -0.10071 -0.06754 0.25839 0.6971 0.14029 -0.16382 0.54242 ...>
nauseating <0.53624 -0.99272 -0.77127 -0.48285 0.41019 -0.17711 0.94572 0.13201 0.020268
1.0641 -0.24506 -0.22863 0.31128 0.34609 0.26537 -0.35245 -0.071506 0.33989 0.17206
-0.54792 -0.5268 0.032567 0.32413 -0.096092 0.41635 0.24512 -0.73399 1.431 1.0965 ...>
, <0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938
0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428
0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 ...>
14
Dense word vectors have real meaning
• Dense word vectors include GloVe and word2vec
• Nearby words have similar meaning
• Direction & distance are meaningful (analogy tests)
GloVe talk15
Synonym datasets are another resource
• WordNet (starting 1985) is a lexical database of synsets
• Thesaurus + dictionary
• Links words by meaning
• Links words by semantic relationship
• Synset = a particular meaning (could have many lemmas / lexical forms)
• It is super-exhaustive (“to cat is to whip”) but not technical or jargony
• It’s available through nltkPrinceton: Coursera
16
A naïve “bag of words” model is a good baseline
• n-grams are extremely easy and perform surprisingly well:
• Bigrams & especially trigrams get pretty sparse
• We also may use skip-grams to capture context
Effective but too-tepid biopic.
effective | but | too-tepid | biopic | .
START effective | effective but | but too-tepid | too-tepid biopic | biopic . | . END
START START effective | START effective but | effective but too-tepid | but too-tepid biopic | too-tepid biopic . | biopic . END | . END END
bigrams
Effective but too-tepid biopic.START effective | effective but | but too-tepid | too-tepid biopic | biopic . | . END | START but | effective too-tepid | but biopic | too-tepid . | biopic END
1-skip-bigrams
17
We use smoothing to deal with unseen words
• We’ll need to deal with unknown words
• Some methods (especially Naïve Bayes) struggle if the probability of an observed feature is estimated to be literally 0
• We may also want to leverage biases in the language
• Two very easy approaches to smoothing:
• We relabel seen-once words as UNK and estimate the characteristics of UNK; new out-of-vocab words map to UNK
• We create UNK from no observations, and then we add one to each count (add-one/Laplacian smoothing)
• Statistically better but more complicated schemes are available
• Data size can help inform which approach to use
18
Now that we’ve incorporated smoothed n-grams, we can start thinking about smarter features
• Remember we’re just building a statistical model
• We can go beyond features indicating the presence/absence of each word
• Smarter features mean smarter models
• One possible straightforward feature change is negation: let’s differentiate “good” and “not good”
• Many other features are possible – we’re now into your domain expertise!
• You can use parts of speech, synsets, grammatical relationships, …
interesting , but not compelling . interesting, but not NOT_compelling .
Das and Chen, 2001Pang, Lee, and Vaithyanathan, 200219
We can go even more abstract than words
• We can derive features from the GloVe vectors:How close is each word in the paragraph to the “negative” words vs. “positive” words?
positive words negative words
20
Recurrent Neural Networks
21
Recurrent neural networks are neat because…
• They process sequences
• They have a memory
• As in vanilla neural networks:
• Compositions of nonlinear functions provide huge expressive power
• They learn their own features
• Similar tools: Hidden Markov Models, Conditional Random Fields
22
Recurrent neural networks operate on sequences
Karpathy 201523
Recurrent neural networks operate on sequences
Karpathy 2015
Input state (unit)
(one hot or dense vector)
Output state (unit)
Hidden state (unit)
23
Recurrent neural networks operate on sequences
Karpathy 2015
Input state (unit)
(one hot or dense vector)
Output state (unit)
Hidden state (unit)
Image classification
23
Recurrent neural networks operate on sequences
Karpathy 2015
Input state (unit)
(one hot or dense vector)
Output state (unit)
Hidden state (unit)
Image captioning
Image classification
23
Recurrent neural networks operate on sequences
Karpathy 2015
Input state (unit)
(one hot or dense vector)
Output state (unit)
Hidden state (unit)
Image captioning
Image classification
Sentiment analysis
23
Recurrent neural networks operate on sequences
Karpathy 2015
Input state (unit)
(one hot or dense vector)
Output state (unit)
Hidden state (unit)
Image captioning
Image classification
Sentiment analysis
Machine translation
23
Recurrent neural networks operate on sequences
Karpathy 2015
Input state (unit)
(one hot or dense vector)
Output state (unit)
Hidden state (unit)
Image captioning
Image classification
Sentiment analysis
Machine translation
Frame-level classification
23
Recurrent neural networks operate on sequences
Karpathy 2015
Input state (unit)
(one hot or dense vector)
Output state (unit)
Hidden state (unit)
* Length is unbounded! (in principle)
Image captioning
Image classification
Sentiment analysis
Machine translation
Frame-level classification
23
Recurrent units have a memory
• Each arrow represents a different weight matrix
• The weight matrices are learned during training
• The green recurrent transformation has a memory
• We can stack multiple green layers to form a deep recurrent network
Karpathy 2015, Olah 201524
The memory in recurrent units varies
• In practice we use LSTMs or GRUs to deal with vanishing gradients / allow us to train with longer-term dependencies
• LSTMs add a cell layer that persists from one time to the next, until it learns that it is appropriate to forget
• GRUs have a calculation that either retains much info or contains much new info – they are newer and simpler
Olah 2015
RNN
LSTM
GRU
25
Attention mechanisms also help with long-term dependencies
• Attention mechanisms let us “attend to” different parts of the input while generating each output unit
• The network learns what to focus on based on (a) the input and (b) the output it has generated so far
• Each output depends on a weighted combo of all inputs – not just the current input
• Since this leads to a huge proliferation of parameters (N inputs per output), we often use reinforcement learning to approximate what to attend to
• Visualizing attention is one of the (few) ways that neural networks for text are interpretable
26
RNN for sentiment
• Combine at each layer of tree and try to get the parent score right
• Train through gradient descent into multiple tensors that get concatenated and operated on recursively
• Works best of all for sentiment analysis
• Had to have the framing insight, collect more data, have model insight, and have computational training resources to make this feasible
• It’s available to use in Stanford CoreNLP now, so you don’t have to touch it
Socher et al. 201327
Neural networks vs. traditional models
• Neural networks are a good idea if you have substantial amounts of labeled data & you’re at an impasse for what features to use after many attempts
• If you have limited amounts of data – you’ll do better using theory/getting more data
• If you have unlabeled data – you need to label it
• If you haven’t captured everything that might matter – you want feature exploration
• If you have theory – you should test & benefit from it until it’s exhausted
28
Closing Thoughts
29
NLP is only as good as the training data
• Your performance will be bounded by how similar the training data is to the test data
• Shorthand, jargon, unexpected grammar degrade performance
• “l/m” – what’s that?
• Amazon Mechanical Turk can be helpful for labelling data with target variables
• Budget 5-10¢ per response
• Requires substantial planning
30
How do I know if NLP is good for my problem?
• Are people working texts by hand? Start here!
• It’s attainable
• Leadership cares
• Maybe there is labeled data!
• If you don’t already have people doing it? Your turn!
• Look at some texts
• Figure out what you would do with them
• Code up examples by hand
• Identify whether NLP can help (and even if not, you’ve still done something useful)
31
“Imperfect” doesn’t mean “not useful”
• Humans are imperfect too
• It’s easy to get hung up on specific failure cases
• Best to measure a baseline before and after as well as model performance –is there improvement in what we care about?
• Time for a human to accomplish task
• Proportion of fields populated
• Proportion of fields correct (requires blind scoring)
• Defining what the business actually cares about can be challenging – but it’s very helpful
32
At 11 am, you are able to…• Competence (able to perform on your own, with varying levels of perfection)
• Convert your raw data into numeric representations suitable for statistical modeling
• Build bag-of-words models
• Establish model performance and business process baselines to exceed as part of evaluation
• Choose a good problem on which to apply NLP and apply NLP iteratively
• Exposure (aware)
• Familiar with vocabulary related to preprocessing (tokenization, stemming/lemmatization, stop words), representing text as numbers (one-hot encoding, word embedding), and model development (bag-of-words models, n-grams, skip-grams, feature engineering, smoothing)
• Familiar with some NLP modeling gotchas: training data must match testing data, unknown words must be addressed, testing data must be entirely new to be trusted, human performance is itself rarely 100%
• Familiar with toolsets: software (sklearn, nltk, CoreNLP) and resources (GloVe, WordNet)
• Familiar with the huge variety of already-solved tasks & open-source tools available
• Aware that vanilla RNNs fail at long-term dependencies and that LSTM/GRU units succeed
• Aware of the meaning and reason for attention mechanisms in neural networks33
Backup
34
Backpropagation works across time
• We can perform backpropagation across time
Britz 201535
Backpropagation works across time
• We can perform backpropagation across time
U U U U U
Britz 201535
Backpropagation works across time
• We can perform backpropagation across time
U
W W W W W
U U U U
Britz 201535
Backpropagation works across time
• We can perform backpropagation across time
U
W W W W W
U U U U
V V V V V
Britz 201535
Backpropagation works across time
• We can perform backpropagation across time
ŷ0 ŷ1 ŷ2 ŷ3ŷ4
U
W W W W W
U U U U
V V V V V
Labels:
Britz 201535
Backpropagation works across time
• We can perform backpropagation across time
E3
ŷ0 ŷ1 ŷ2 ŷ3ŷ4
E4E2E1E0
U
W W W W W
U U U U
V V V V V
Labels:
Errors/Losses:
Britz 201535
Backpropagation works across time
• We can perform backpropagation across time
E3
ŷ0 ŷ1 ŷ2 ŷ3ŷ4
E4E2E1E0
U
W W W W W
U U U U
V V V V V
Labels:
Errors/Losses:
Britz 201535
Backpropagation works across time
• We can perform backpropagation across time
E3
ŷ0 ŷ1 ŷ2 ŷ3ŷ4
E4E2E1E0
U
W W W W W
U U U U
V V V V V
Labels:
Errors/Losses:
Britz 201535
Backpropagation works across time
• We can perform backpropagation across time
E3
ŷ0 ŷ1 ŷ2 ŷ3ŷ4
E4E2E1E0
U
W W W W W
U U U U
V V V V V
Labels:
Errors/Losses:
Britz 201535
Backpropagation works across time
• We can perform backpropagation across time
E3
ŷ0 ŷ1 ŷ2 ŷ3ŷ4
E4E2E1E0
U
W W W W W
U U U U
V V V V V
Labels:
Errors/Losses:
Britz 201535
Vanilla RNNs struggle with long-range dependencies
• “Vanishing gradients” occur in deep networks (like deep CNNs and RNNs)
• Long-term dependencies are very common – so for training to work, we need a mechanism to ensure gradients don’t vanish
On Tuesday, September 15, 2015, about 7:03 a.m. local time, a 47-passenger 2009 International school bus, operated by the Houston Independent School District (HISD) and occupied by a 44-year-old female driver and four HISD students, crashed into …
E3
ŷ0 ŷ1 ŷ2 ŷ3ŷ4
E4E2E1E0
U
W W W W W
U U U U
V V V V V
Labels:
Errors/Losses:
Britz 2015,functions36