intro to natural language processing - pamela tomana brief intro to natural language processing...

A brief intro to natural language processing

Pamela Toman

16 August 2016

1

At 11 am, you will be able to…• Competence (able to perform on your own, with varying levels of perfection)

• Convert your raw data into numeric representations suitable for statistical modeling

• Build bag-of-words models

• Establish model performance and business process baselines to exceed as part of evaluation

• Choose a good problem on which to apply NLP and apply NLP iteratively

• Exposure (aware)

• Familiar with vocabulary related to preprocessing (tokenization, stemming/lemmatization, stop words), representing text as numbers (one-hot encoding, word embedding), and model development (bag-of-words models, n-grams, skip-grams, feature engineering, smoothing)

• Familiar with some NLP modeling gotchas: training data must match testing data, unknown words must be addressed, testing data must be entirely new to be trusted, human performance is itself rarely 100%

• Familiar with toolsets: software (sklearn, nltk, CoreNLP) and resources (GloVe, WordNet)

• Familiar with the huge variety of already-solved tasks & open-source tools available

• Aware that vanilla RNNs fail at long-term dependencies and that LSTM/GRU units succeed

• Aware of the meaning and reason for attention mechanisms in neural networks2

What kind of text data can you access?

3

Why do we want NLP?

• There’s an awful lot of text…

• It’s awfully nuanced….

On Tuesday, September 15, 2015, about 7:03 a.m. local time, a 47-passenger 2009 International school bus, operated by the Houston Independent School District (HISD) and occupied by a 44-year-old female driver and four HISD students, crashed into …

The Rock is destined to be the 21st century's new “Conan” and that he's going to make a splash even greater than Arnold Schwarzenegger, Jean-Claud van Damme or Steven Segal.

4

Common tasks & off-the-shelf solutions

• Good performing, free, OSS/OTS solutions are in lots of areas:

• Speech-to-text (CMUSphinx, Kaldi) – internet helps

• Text-to-speech (Windows SAPI, OSX NSSS, *nix ESpeak)

• Stemming (NLTK Snowball English)

• Sentiment analysis (Stanford Recursive Neural Tensor Net)

• Named entity recognition (Stanford CoreNLP)

• Coreference resolution (Stanford CoreNLP)

• Relation extraction (agent-action-patient) (Stanford CoreNLP)

• Search / info retrieval (Lucene/elasticsearch/Solr)

• You can improve them (nltk, sklearn, Java)

• Don’t ignore paid services – they may be worth the cost

• Many projects will combine existing and custom methods

5

Sentiment analysis

• Goal: Determine attitude overall and/or re: a topic

• Usually positive or negative

• Uses: Marketing, customer service, recommendations, finance, understanding individuals’ perspectives, …

• The problem is hard: human inter-annotator agreement of ~80%

Effective but too-tepid biopic.

If you sometimes like to go to the movies to have fun, Wasabi is a good place to start.

Perhaps no picture ever made has more literally showed that the road to hell is paved with good intentions.

The film provides some great insight into the neurotic mindset of all comics—even those who have reached the absolute top of the game. Pang and Lee 2005

6

http://www.cs.cornell.edu/people/pabo/movie-review-data/

Task structure: What is the unit? What is the label?

… The actor is virtually unrecognizable until the final third, where he is finally unmasked.

“Beyond” is undoubtedly messy, like a Starfleet ship that’s taken its fair share of beatings, but it is frequently a reminder of how good the series can be when all its engines are in working order. Opening a box left behind by Commander Spock, Spock the younger finds a photo of the original crew. That nostalgia, both for its source material and time gone by, is both reverent and earned. …

+1 -1

7

Evaluating performance

• Define how you will evaluate before you begin

• You always want 1+ technical baseline:

• Human performance?

• An extremely naïve model?

• State-of-the-art computational model?

• All of the above?

• Common evaluation metrics for classification:

• Precision & recall

• Confusion matrices

• F1 score

Precision/Recall –Wikipedia8

https://en.wikipedia.org/wiki/File:Precisionrecall.svg

Preprocessing

• Learning all the surface variants from limited data can be extremely hard

• Texts have punctuation ( tokenization)

• Texts are capitalized ( lowercasing)

• Texts have morphology like affixes (e.g., plural –s) ( stemming or lemmatization)

• With limited data, it may help to preprocess the data

• Preprocessing improves recall at the expense of precision

• Preprocessing tools work well out-of-the-box for almost all purposes

• Given more data, it becomes more possible to use the raw data

9

Word frequencies are related through the power law

• “Stop words” are very common words that some people filter out, especially for information retrieval

Zipf’s law & frequency lists

Word Count

you 1,222,421

I 1,052,546

to 823,661

the 770,161

a 563,578

and 480,214

that 413,389

it 388,320

of 332,038

me 312,326

10

https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists

Final labeled dataset

Unit of analysis (cleaned) Target

the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .

+1

the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .

+1

effective but too-tepid biopic +1

simplistic , silly and tedious . -1

it's so laddish and juvenile , only teenage boys could possibly find it funny . -1

exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable .

-1

… …

11

We can encode text with one-hot vectors

• A “word embedding” is how we represent a word numerically (in vector space)

• To create numbers from text, it’s common to use one-hot encodings

12

the <1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...>

rock <0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...>

is <0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...>

dinosaur <0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...>

the rock is <1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...>

+

After encoding the texts, we can learn weights for classification

13

XW

Unit 1Unit 2Unit 3Unit 4Unit 5

…

nas

tyex

celle

nt

amaz

ing

wo

nd

erfu

lte

rrib

leaw

ful

…• = Ŷ Y≈

nastyexcellentamazing

wonderfulterrible

awful…

+ -

S

+ -

Unit 1Unit 2Unit 3Unit 4Unit 5

…

10011

1 00 11 01 00 1

01001

……………

We could also use dense word vectors

• One-hot vectors are arbitrary & uninformative about meaning

• Dense vectors carry more meaning

• We derive them from counts or predictions

• Usually they are ~50-500 dimensions

treadmills <0.4307 -0.31399 0.60878 -0.10931 -0.38425 -0.4796 0.41749 -0.95494 0.65878

-0.24547 -0.21854 -0.2505 -0.50729 0.047186 -0.47561 0.28034 -0.28351 1.2879 -0.50432

-1.2837 0.047272 -0.14955 -0.10071 -0.06754 0.25839 0.6971 0.14029 -0.16382 0.54242 ...>

nauseating <0.53624 -0.99272 -0.77127 -0.48285 0.41019 -0.17711 0.94572 0.13201 0.020268

1.0641 -0.24506 -0.22863 0.31128 0.34609 0.26537 -0.35245 -0.071506 0.33989 0.17206

-0.54792 -0.5268 0.032567 0.32413 -0.096092 0.41635 0.24512 -0.73399 1.431 1.0965 ...>

, <0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938

0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428

0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 ...>

14

Dense word vectors have real meaning

• Dense word vectors include GloVe and word2vec

• Nearby words have similar meaning

• Direction & distance are meaningful (analogy tests)

GloVe talk15

http://nlp.stanford.edu/projects/glove/

http://deeplearning4j.org/word2vec

https://www.youtube.com/watch?v=RyTpzZQrHCs

Synonym datasets are another resource

• WordNet (starting 1985) is a lexical database of synsets

• Thesaurus + dictionary

• Links words by meaning

• Links words by semantic relationship

• Synset = a particular meaning (could have many lemmas / lexical forms)

• It is super-exhaustive (“to cat is to whip”) but not technical or jargony

• It’s available through nltkPrinceton: Coursera

16

http://wordnet.princeton.edu/

http://coursera.cs.princeton.edu/algs4/assignments/wordnet.html

We use smoothing to deal with unseen words

• We’ll need to deal with unknown words

• Some methods (especially Naïve Bayes) struggle if the probability of an observed feature is estimated to be literally 0

• We may also want to leverage biases in the language

• Two very easy approaches to smoothing:

• We relabel seen-once words as UNK and estimate the characteristics of UNK; new out-of-vocab words map to UNK

• We create UNK from no observations, and then we add one to each count (add-one/Laplacian smoothing)

• Statistically better but more complicated schemes are available

• Data size can help inform which approach to use

18

Now that we’ve incorporated smoothed n-grams, we can start thinking about smarter features

• Remember we’re just building a statistical model

• We can go beyond features indicating the presence/absence of each word

• Smarter features mean smarter models

• One possible straightforward feature change is negation: let’s differentiate “good” and “not good”

• Many other features are possible – we’re now into your domain expertise!

• You can use parts of speech, synsets, grammatical relationships, …

interesting , but not compelling . interesting, but not NOT_compelling .

Das and Chen, 2001Pang, Lee, and Vaithyanathan, 200219

We can go even more abstract than words

• We can derive features from the GloVe vectors:How close is each word in the paragraph to the “negative” words vs. “positive” words?

positive words negative words

20

Recurrent Neural Networks

21

Recurrent neural networks are neat because…

• They process sequences

• They have a memory

• As in vanilla neural networks:

• Compositions of nonlinear functions provide huge expressive power

• They learn their own features

• Similar tools: Hidden Markov Models, Conditional Random Fields

22

Recurrent neural networks operate on sequences

Karpathy 201523

http://karpathy.github.io/2015/05/21/rnn-effectiveness/


Karpathy 2015

Input state (unit)

(one hot or dense vector)

Output state (unit)

Hidden state (unit)

23



Karpathy 2015

Input state (unit)


Output state (unit)

Hidden state (unit)

Image classification

23



Karpathy 2015

Input state (unit)


Output state (unit)

Hidden state (unit)

Image captioning


23



Karpathy 2015

Input state (unit)


Output state (unit)

Hidden state (unit)

Image captioning


Sentiment analysis

23



Karpathy 2015

Input state (unit)


Output state (unit)

Hidden state (unit)

Image captioning


Sentiment analysis

Machine translation

23



Karpathy 2015

Input state (unit)


Output state (unit)

Hidden state (unit)

Image captioning


Sentiment analysis

Machine translation

Frame-level classification

23



Karpathy 2015

Input state (unit)


Output state (unit)

Hidden state (unit)

* Length is unbounded! (in principle)

Image captioning


Sentiment analysis

Machine translation

Frame-level classification

23


Recurrent units have a memory

• Each arrow represents a different weight matrix

• The weight matrices are learned during training

• The green recurrent transformation has a memory

• We can stack multiple green layers to form a deep recurrent network

Karpathy 2015, Olah 201524


http://colah.github.io/posts/2015-08-Understanding-LSTMs/

The memory in recurrent units varies

• In practice we use LSTMs or GRUs to deal with vanishing gradients / allow us to train with longer-term dependencies

• LSTMs add a cell layer that persists from one time to the next, until it learns that it is appropriate to forget

• GRUs have a calculation that either retains much info or contains much new info – they are newer and simpler

Olah 2015

RNN

LSTM

GRU

25

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Attention mechanisms also help with long-term dependencies

• Attention mechanisms let us “attend to” different parts of the input while generating each output unit

• The network learns what to focus on based on (a) the input and (b) the output it has generated so far

• Each output depends on a weighted combo of all inputs – not just the current input

• Since this leads to a huge proliferation of parameters (N inputs per output), we often use reinforcement learning to approximate what to attend to

• Visualizing attention is one of the (few) ways that neural networks for text are interpretable

26

RNN for sentiment

• Combine at each layer of tree and try to get the parent score right

• Train through gradient descent into multiple tensors that get concatenated and operated on recursively

• Works best of all for sentiment analysis

• Had to have the framing insight, collect more data, have model insight, and have computational training resources to make this feasible

• It’s available to use in Stanford CoreNLP now, so you don’t have to touch it

Socher et al. 201327

http://nlp.stanford.edu/sentiment/index.html

Neural networks vs. traditional models

• Neural networks are a good idea if you have substantial amounts of labeled data & you’re at an impasse for what features to use after many attempts

• If you have limited amounts of data – you’ll do better using theory/getting more data

• If you have unlabeled data – you need to label it

• If you haven’t captured everything that might matter – you want feature exploration

• If you have theory – you should test & benefit from it until it’s exhausted

28

Closing Thoughts

29

NLP is only as good as the training data

• Your performance will be bounded by how similar the training data is to the test data

• Shorthand, jargon, unexpected grammar degrade performance

• “l/m” – what’s that?

• Amazon Mechanical Turk can be helpful for labelling data with target variables

• Budget 5-10¢ per response

• Requires substantial planning

30

How do I know if NLP is good for my problem?

• Are people working texts by hand? Start here!

• It’s attainable

• Leadership cares

• Maybe there is labeled data!

• If you don’t already have people doing it? Your turn!

• Look at some texts

• Figure out what you would do with them

• Code up examples by hand

• Identify whether NLP can help (and even if not, you’ve still done something useful)

31

“Imperfect” doesn’t mean “not useful”

• Humans are imperfect too

• It’s easy to get hung up on specific failure cases

• Best to measure a baseline before and after as well as model performance –is there improvement in what we care about?

• Time for a human to accomplish task

• Proportion of fields populated

• Proportion of fields correct (requires blind scoring)

• Defining what the business actually cares about can be challenging – but it’s very helpful

32

At 11 am, you are able to…• Competence (able to perform on your own, with varying levels of perfection)

• Convert your raw data into numeric representations suitable for statistical modeling

• Build bag-of-words models

• Establish model performance and business process baselines to exceed as part of evaluation

• Choose a good problem on which to apply NLP and apply NLP iteratively

• Exposure (aware)

• Familiar with vocabulary related to preprocessing (tokenization, stemming/lemmatization, stop words), representing text as numbers (one-hot encoding, word embedding), and model development (bag-of-words models, n-grams, skip-grams, feature engineering, smoothing)

• Familiar with some NLP modeling gotchas: training data must match testing data, unknown words must be addressed, testing data must be entirely new to be trusted, human performance is itself rarely 100%

• Familiar with toolsets: software (sklearn, nltk, CoreNLP) and resources (GloVe, WordNet)

• Familiar with the huge variety of already-solved tasks & open-source tools available

• Aware that vanilla RNNs fail at long-term dependencies and that LSTM/GRU units succeed

• Aware of the meaning and reason for attention mechanisms in neural networks33

Backup

34

Backpropagation works across time

• We can perform backpropagation across time

Britz 201535

http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/



U U U U U

Britz 201535




U

W W W W W

U U U U

Britz 201535




U

W W W W W

U U U U

V V V V V

Britz 201535




ŷ0 ŷ1 ŷ2 ŷ3ŷ4

U

W W W W W

U U U U

V V V V V

Labels:

Britz 201535




E3

ŷ0 ŷ1 ŷ2 ŷ3ŷ4

E4E2E1E0

U

W W W W W

U U U U

V V V V V

Labels:

Errors/Losses:

Britz 201535


Vanilla RNNs struggle with long-range dependencies

• “Vanishing gradients” occur in deep networks (like deep CNNs and RNNs)

• Long-term dependencies are very common – so for training to work, we need a mechanism to ensure gradients don’t vanish

On Tuesday, September 15, 2015, about 7:03 a.m. local time, a 47-passenger 2009 International school bus, operated by the Houston Independent School District (HISD) and occupied by a 44-year-old female driver and four HISD students, crashed into …

E3

ŷ0 ŷ1 ŷ2 ŷ3ŷ4

E4E2E1E0

U

W W W W W

U U U U

V V V V V

Labels:

Errors/Losses:

Britz 2015,functions36


https://iliauk.com/2016/01/23/introduction-to-deep-learning-part-1/

intro to natural language processing - pamela tomana brief intro to natural language processing...

Documents