intro to natural language processing - pamela tomana brief intro to natural language processing...

52
A brief intro to natural language processing Pamela Toman 16 August 2016 1

Upload: others

Post on 20-Jun-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

A brief intro to natural language processing

Pamela Toman

16 August 2016

1

Page 2: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

At 11 am, you will be able to…• Competence (able to perform on your own, with varying levels of perfection)

• Convert your raw data into numeric representations suitable for statistical modeling

• Build bag-of-words models

• Establish model performance and business process baselines to exceed as part of evaluation

• Choose a good problem on which to apply NLP and apply NLP iteratively

• Exposure (aware)

• Familiar with vocabulary related to preprocessing (tokenization, stemming/lemmatization, stop words), representing text as numbers (one-hot encoding, word embedding), and model development (bag-of-words models, n-grams, skip-grams, feature engineering, smoothing)

• Familiar with some NLP modeling gotchas: training data must match testing data, unknown words must be addressed, testing data must be entirely new to be trusted, human performance is itself rarely 100%

• Familiar with toolsets: software (sklearn, nltk, CoreNLP) and resources (GloVe, WordNet)

• Familiar with the huge variety of already-solved tasks & open-source tools available

• Aware that vanilla RNNs fail at long-term dependencies and that LSTM/GRU units succeed

• Aware of the meaning and reason for attention mechanisms in neural networks2

Page 3: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

What kind of text data can you access?

3

Page 4: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Why do we want NLP?

• There’s an awful lot of text…

• It’s awfully nuanced….

On Tuesday, September 15, 2015, about 7:03 a.m. local time, a 47-passenger 2009 International school bus, operated by the Houston Independent School District (HISD) and occupied by a 44-year-old female driver and four HISD students, crashed into …

The Rock is destined to be the 21st century's new “Conan” and that he's going to make a splash even greater than Arnold Schwarzenegger, Jean-Claud van Damme or Steven Segal.

4

Page 5: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Common tasks & off-the-shelf solutions

• Good performing, free, OSS/OTS solutions are in lots of areas:

• Speech-to-text (CMUSphinx, Kaldi) – internet helps

• Text-to-speech (Windows SAPI, OSX NSSS, *nix ESpeak)

• Stemming (NLTK Snowball English)

• Sentiment analysis (Stanford Recursive Neural Tensor Net)

• Named entity recognition (Stanford CoreNLP)

• Coreference resolution (Stanford CoreNLP)

• Relation extraction (agent-action-patient) (Stanford CoreNLP)

• Search / info retrieval (Lucene/elasticsearch/Solr)

• You can improve them (nltk, sklearn, Java)

• Don’t ignore paid services – they may be worth the cost

• Many projects will combine existing and custom methods

5

Page 6: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Sentiment analysis

• Goal: Determine attitude overall and/or re: a topic

• Usually positive or negative

• Uses: Marketing, customer service, recommendations, finance, understanding individuals’ perspectives, …

• The problem is hard: human inter-annotator agreement of ~80%

Effective but too-tepid biopic.

If you sometimes like to go to the movies to have fun, Wasabi is a good place to start.

Perhaps no picture ever made has more literally showed that the road to hell is paved with good intentions.

The film provides some great insight into the neurotic mindset of all comics—even those who have reached the absolute top of the game. Pang and Lee 2005

6

Page 7: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Task structure: What is the unit? What is the label?

… The actor is virtually unrecognizable until the final third, where he is finally unmasked.

“Beyond” is undoubtedly messy, like a Starfleet ship that’s taken its fair share of beatings, but it is frequently a reminder of how good the series can be when all its engines are in working order. Opening a box left behind by Commander Spock, Spock the younger finds a photo of the original crew. That nostalgia, both for its source material and time gone by, is both reverent and earned. …

+1 -1

7

Page 8: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Evaluating performance

• Define how you will evaluate before you begin

• You always want 1+ technical baseline:

• Human performance?

• An extremely naïve model?

• State-of-the-art computational model?

• All of the above?

• Common evaluation metrics for classification:

• Precision & recall

• Confusion matrices

• F1 score

Precision/Recall –Wikipedia8

Page 9: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Preprocessing

• Learning all the surface variants from limited data can be extremely hard

• Texts have punctuation ( tokenization)

• Texts are capitalized ( lowercasing)

• Texts have morphology like affixes (e.g., plural –s) ( stemming or lemmatization)

• With limited data, it may help to preprocess the data

• Preprocessing improves recall at the expense of precision

• Preprocessing tools work well out-of-the-box for almost all purposes

• Given more data, it becomes more possible to use the raw data

9

Page 10: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Word frequencies are related through the power law

• “Stop words” are very common words that some people filter out, especially for information retrieval

Zipf’s law & frequency lists

Word Count

you 1,222,421

I 1,052,546

to 823,661

the 770,161

a 563,578

and 480,214

that 413,389

it 388,320

of 332,038

me 312,326

10

Page 11: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Final labeled dataset

Unit of analysis (cleaned) Target

the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .

+1

the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .

+1

effective but too-tepid biopic +1

simplistic , silly and tedious . -1

it's so laddish and juvenile , only teenage boys could possibly find it funny . -1

exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable .

-1

… …

11

Page 12: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

We can encode text with one-hot vectors

• A “word embedding” is how we represent a word numerically (in vector space)

• To create numbers from text, it’s common to use one-hot encodings

12

the <1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...>

rock <0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...>

is <0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...>

dinosaur <0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...>

the rock is <1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...>

+

Page 13: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

After encoding the texts, we can learn weights for classification

13

XW

Unit 1Unit 2Unit 3Unit 4Unit 5

nas

tyex

celle

nt

amaz

ing

wo

nd

erfu

lte

rrib

leaw

ful

…• = Ŷ Y≈

nastyexcellentamazing

wonderfulterrible

awful…

+ -

S

+ -

Unit 1Unit 2Unit 3Unit 4Unit 5

10011

1 00 11 01 00 1

01001

……………

Page 14: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

We could also use dense word vectors

• One-hot vectors are arbitrary & uninformative about meaning

• Dense vectors carry more meaning

• We derive them from counts or predictions

• Usually they are ~50-500 dimensions

treadmills <0.4307 -0.31399 0.60878 -0.10931 -0.38425 -0.4796 0.41749 -0.95494 0.65878

-0.24547 -0.21854 -0.2505 -0.50729 0.047186 -0.47561 0.28034 -0.28351 1.2879 -0.50432

-1.2837 0.047272 -0.14955 -0.10071 -0.06754 0.25839 0.6971 0.14029 -0.16382 0.54242 ...>

nauseating <0.53624 -0.99272 -0.77127 -0.48285 0.41019 -0.17711 0.94572 0.13201 0.020268

1.0641 -0.24506 -0.22863 0.31128 0.34609 0.26537 -0.35245 -0.071506 0.33989 0.17206

-0.54792 -0.5268 0.032567 0.32413 -0.096092 0.41635 0.24512 -0.73399 1.431 1.0965 ...>

, <0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938

0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428

0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 ...>

14

Page 15: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Dense word vectors have real meaning

• Dense word vectors include GloVe and word2vec

• Nearby words have similar meaning

• Direction & distance are meaningful (analogy tests)

GloVe talk15

Page 16: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Synonym datasets are another resource

• WordNet (starting 1985) is a lexical database of synsets

• Thesaurus + dictionary

• Links words by meaning

• Links words by semantic relationship

• Synset = a particular meaning (could have many lemmas / lexical forms)

• It is super-exhaustive (“to cat is to whip”) but not technical or jargony

• It’s available through nltkPrinceton: Coursera

16

Page 17: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

A naïve “bag of words” model is a good baseline

• n-grams are extremely easy and perform surprisingly well:

• Bigrams & especially trigrams get pretty sparse

• We also may use skip-grams to capture context

Effective but too-tepid biopic.

effective | but | too-tepid | biopic | .

START effective | effective but | but too-tepid | too-tepid biopic | biopic . | . END

START START effective | START effective but | effective but too-tepid | but too-tepid biopic | too-tepid biopic . | biopic . END | . END END

bigrams

Effective but too-tepid biopic.START effective | effective but | but too-tepid | too-tepid biopic | biopic . | . END | START but | effective too-tepid | but biopic | too-tepid . | biopic END

1-skip-bigrams

17

Page 18: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

We use smoothing to deal with unseen words

• We’ll need to deal with unknown words

• Some methods (especially Naïve Bayes) struggle if the probability of an observed feature is estimated to be literally 0

• We may also want to leverage biases in the language

• Two very easy approaches to smoothing:

• We relabel seen-once words as UNK and estimate the characteristics of UNK; new out-of-vocab words map to UNK

• We create UNK from no observations, and then we add one to each count (add-one/Laplacian smoothing)

• Statistically better but more complicated schemes are available

• Data size can help inform which approach to use

18

Page 19: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Now that we’ve incorporated smoothed n-grams, we can start thinking about smarter features

• Remember we’re just building a statistical model

• We can go beyond features indicating the presence/absence of each word

• Smarter features mean smarter models

• One possible straightforward feature change is negation: let’s differentiate “good” and “not good”

• Many other features are possible – we’re now into your domain expertise!

• You can use parts of speech, synsets, grammatical relationships, …

interesting , but not compelling . interesting, but not NOT_compelling .

Das and Chen, 2001Pang, Lee, and Vaithyanathan, 200219

Page 20: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

We can go even more abstract than words

• We can derive features from the GloVe vectors:How close is each word in the paragraph to the “negative” words vs. “positive” words?

positive words negative words

20

Page 21: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Recurrent Neural Networks

21

Page 22: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Recurrent neural networks are neat because…

• They process sequences

• They have a memory

• As in vanilla neural networks:

• Compositions of nonlinear functions provide huge expressive power

• They learn their own features

• Similar tools: Hidden Markov Models, Conditional Random Fields

22

Page 23: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Recurrent neural networks operate on sequences

Karpathy 201523

Page 24: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Recurrent neural networks operate on sequences

Karpathy 2015

Input state (unit)

(one hot or dense vector)

Output state (unit)

Hidden state (unit)

23

Page 25: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Recurrent neural networks operate on sequences

Karpathy 2015

Input state (unit)

(one hot or dense vector)

Output state (unit)

Hidden state (unit)

Image classification

23

Page 26: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Recurrent neural networks operate on sequences

Karpathy 2015

Input state (unit)

(one hot or dense vector)

Output state (unit)

Hidden state (unit)

Image captioning

Image classification

23

Page 27: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Recurrent neural networks operate on sequences

Karpathy 2015

Input state (unit)

(one hot or dense vector)

Output state (unit)

Hidden state (unit)

Image captioning

Image classification

Sentiment analysis

23

Page 28: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Recurrent neural networks operate on sequences

Karpathy 2015

Input state (unit)

(one hot or dense vector)

Output state (unit)

Hidden state (unit)

Image captioning

Image classification

Sentiment analysis

Machine translation

23

Page 29: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Recurrent neural networks operate on sequences

Karpathy 2015

Input state (unit)

(one hot or dense vector)

Output state (unit)

Hidden state (unit)

Image captioning

Image classification

Sentiment analysis

Machine translation

Frame-level classification

23

Page 30: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Recurrent neural networks operate on sequences

Karpathy 2015

Input state (unit)

(one hot or dense vector)

Output state (unit)

Hidden state (unit)

* Length is unbounded! (in principle)

Image captioning

Image classification

Sentiment analysis

Machine translation

Frame-level classification

23

Page 31: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Recurrent units have a memory

• Each arrow represents a different weight matrix

• The weight matrices are learned during training

• The green recurrent transformation has a memory

• We can stack multiple green layers to form a deep recurrent network

Karpathy 2015, Olah 201524

Page 32: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

The memory in recurrent units varies

• In practice we use LSTMs or GRUs to deal with vanishing gradients / allow us to train with longer-term dependencies

• LSTMs add a cell layer that persists from one time to the next, until it learns that it is appropriate to forget

• GRUs have a calculation that either retains much info or contains much new info – they are newer and simpler

Olah 2015

RNN

LSTM

GRU

25

Page 33: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Attention mechanisms also help with long-term dependencies

• Attention mechanisms let us “attend to” different parts of the input while generating each output unit

• The network learns what to focus on based on (a) the input and (b) the output it has generated so far

• Each output depends on a weighted combo of all inputs – not just the current input

• Since this leads to a huge proliferation of parameters (N inputs per output), we often use reinforcement learning to approximate what to attend to

• Visualizing attention is one of the (few) ways that neural networks for text are interpretable

26

Page 34: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

RNN for sentiment

• Combine at each layer of tree and try to get the parent score right

• Train through gradient descent into multiple tensors that get concatenated and operated on recursively

• Works best of all for sentiment analysis

• Had to have the framing insight, collect more data, have model insight, and have computational training resources to make this feasible

• It’s available to use in Stanford CoreNLP now, so you don’t have to touch it

Socher et al. 201327

Page 35: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Neural networks vs. traditional models

• Neural networks are a good idea if you have substantial amounts of labeled data & you’re at an impasse for what features to use after many attempts

• If you have limited amounts of data – you’ll do better using theory/getting more data

• If you have unlabeled data – you need to label it

• If you haven’t captured everything that might matter – you want feature exploration

• If you have theory – you should test & benefit from it until it’s exhausted

28

Page 36: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Closing Thoughts

29

Page 37: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

NLP is only as good as the training data

• Your performance will be bounded by how similar the training data is to the test data

• Shorthand, jargon, unexpected grammar degrade performance

• “l/m” – what’s that?

• Amazon Mechanical Turk can be helpful for labelling data with target variables

• Budget 5-10¢ per response

• Requires substantial planning

30

Page 38: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

How do I know if NLP is good for my problem?

• Are people working texts by hand? Start here!

• It’s attainable

• Leadership cares

• Maybe there is labeled data!

• If you don’t already have people doing it? Your turn!

• Look at some texts

• Figure out what you would do with them

• Code up examples by hand

• Identify whether NLP can help (and even if not, you’ve still done something useful)

31

Page 39: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

“Imperfect” doesn’t mean “not useful”

• Humans are imperfect too

• It’s easy to get hung up on specific failure cases

• Best to measure a baseline before and after as well as model performance –is there improvement in what we care about?

• Time for a human to accomplish task

• Proportion of fields populated

• Proportion of fields correct (requires blind scoring)

• Defining what the business actually cares about can be challenging – but it’s very helpful

32

Page 40: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

At 11 am, you are able to…• Competence (able to perform on your own, with varying levels of perfection)

• Convert your raw data into numeric representations suitable for statistical modeling

• Build bag-of-words models

• Establish model performance and business process baselines to exceed as part of evaluation

• Choose a good problem on which to apply NLP and apply NLP iteratively

• Exposure (aware)

• Familiar with vocabulary related to preprocessing (tokenization, stemming/lemmatization, stop words), representing text as numbers (one-hot encoding, word embedding), and model development (bag-of-words models, n-grams, skip-grams, feature engineering, smoothing)

• Familiar with some NLP modeling gotchas: training data must match testing data, unknown words must be addressed, testing data must be entirely new to be trusted, human performance is itself rarely 100%

• Familiar with toolsets: software (sklearn, nltk, CoreNLP) and resources (GloVe, WordNet)

• Familiar with the huge variety of already-solved tasks & open-source tools available

• Aware that vanilla RNNs fail at long-term dependencies and that LSTM/GRU units succeed

• Aware of the meaning and reason for attention mechanisms in neural networks33

Page 41: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Backup

34

Page 42: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Backpropagation works across time

• We can perform backpropagation across time

Britz 201535

Page 43: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Backpropagation works across time

• We can perform backpropagation across time

U U U U U

Britz 201535

Page 44: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Backpropagation works across time

• We can perform backpropagation across time

U

W W W W W

U U U U

Britz 201535

Page 45: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Backpropagation works across time

• We can perform backpropagation across time

U

W W W W W

U U U U

V V V V V

Britz 201535

Page 46: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Backpropagation works across time

• We can perform backpropagation across time

ŷ0 ŷ1 ŷ2 ŷ3ŷ4

U

W W W W W

U U U U

V V V V V

Labels:

Britz 201535

Page 47: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Backpropagation works across time

• We can perform backpropagation across time

E3

ŷ0 ŷ1 ŷ2 ŷ3ŷ4

E4E2E1E0

U

W W W W W

U U U U

V V V V V

Labels:

Errors/Losses:

Britz 201535

Page 48: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Backpropagation works across time

• We can perform backpropagation across time

E3

ŷ0 ŷ1 ŷ2 ŷ3ŷ4

E4E2E1E0

U

W W W W W

U U U U

V V V V V

Labels:

Errors/Losses:

Britz 201535

Page 49: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Backpropagation works across time

• We can perform backpropagation across time

E3

ŷ0 ŷ1 ŷ2 ŷ3ŷ4

E4E2E1E0

U

W W W W W

U U U U

V V V V V

Labels:

Errors/Losses:

Britz 201535

Page 50: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Backpropagation works across time

• We can perform backpropagation across time

E3

ŷ0 ŷ1 ŷ2 ŷ3ŷ4

E4E2E1E0

U

W W W W W

U U U U

V V V V V

Labels:

Errors/Losses:

Britz 201535

Page 51: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Backpropagation works across time

• We can perform backpropagation across time

E3

ŷ0 ŷ1 ŷ2 ŷ3ŷ4

E4E2E1E0

U

W W W W W

U U U U

V V V V V

Labels:

Errors/Losses:

Britz 201535

Page 52: Intro to Natural Language Processing - Pamela TomanA brief intro to natural language processing Pamela Toman 16 August 2016 1. ... Familiar with some NLP modeling gotchas: training

Vanilla RNNs struggle with long-range dependencies

• “Vanishing gradients” occur in deep networks (like deep CNNs and RNNs)

• Long-term dependencies are very common – so for training to work, we need a mechanism to ensure gradients don’t vanish

On Tuesday, September 15, 2015, about 7:03 a.m. local time, a 47-passenger 2009 International school bus, operated by the Houston Independent School District (HISD) and occupied by a 44-year-old female driver and four HISD students, crashed into …

E3

ŷ0 ŷ1 ŷ2 ŷ3ŷ4

E4E2E1E0

U

W W W W W

U U U U

V V V V V

Labels:

Errors/Losses:

Britz 2015,functions36