![Page 1: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/1.jpg)
Natural Language Processing
Info 159/259Lecture 7: Language models 2 (Feb 11, 2020)
David Bamman, UC Berkeley
![Page 2: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/2.jpg)
Office hours
• TAs will hold extra office hours for homework 2 tomorrow 2/12 11am-noon, 107 South Hall
![Page 3: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/3.jpg)
My favorite food is …
• … is the food I love you too • is the best food I’ve ever had • is the worst I can get • is the best food places in the area • is the best beer • is in my coffee • is here and there are a lot better options • is a good friend
![Page 4: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/4.jpg)
Language Model
• Vocabulary 𝒱 is a finite set of discrete symbols (e.g., words, characters); V = | 𝒱 |
• 𝒱+ is the infinite set of sequences of symbols from 𝒱; each sequence ends with STOP
• x ∈ 𝒱+
![Page 5: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/5.jpg)
Language modeling is the task of estimating P(w)
Language Model
![Page 6: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/6.jpg)
Language Model
• arms man and I sing • Arms, and the man I sing [Dryden]
• I sing of arms and a manarma virumque cano
• When we have choices to make about different ways to say something, a language models gives us a formal way of operationalizing their fluency (in terms of how likely they are exist in the language)
![Page 7: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/7.jpg)
bigram model (first-order markov)
trigram model (second-order markov)
n�
i
P (wi | wi�1) � P (STOP | wn)
n�
i
P (wi | wi�2, wi�1)
�P (STOP | wn�1, wn)
Markov assumption
![Page 8: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/8.jpg)
Classification
𝓧 = set of all documents 𝒴 = {english, mandarin, greek, …}
A mapping h from input data x (drawn from instance space 𝓧) to a label (or labels) y from some enumerable output space 𝒴
x = a single document y = ancient greek
![Page 9: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/9.jpg)
Classification
𝒴 = {the, of, a, dog, iphone, …}
A mapping h from input data x (drawn from instance space 𝓧) to a label (or labels) y from some enumerable output space 𝒴
x = (context) y = word
![Page 10: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/10.jpg)
Logistic regression
Y = {0, 1}output space
P(y = 1 | x, β) =1
1 + exp��
�Fi=1 xiβi
�
![Page 11: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/11.jpg)
Feature Value
the 0
and 0
bravest 0
love 0
loved 0
genius 0
not 0
fruit 1
BIAS 1
x = feature vector
11
Feature β
the 0.01
and 0.03
bravest 1.4
love 3.1
loved 1.2
genius 0.5
not -3.0
fruit -0.8
BIAS -0.1
β = coefficients
![Page 12: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/12.jpg)
P(Y = 1 | X = x; β) =exp(x�β)
1 + exp(x�β)
=exp(x�β)
exp(x�0) + exp(x�β)
![Page 13: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/13.jpg)
Logistic regression
output space
P(Y = y | X = x; β) =exp(x�βy)�
y��Y exp(x�βy�)
Y = {1, . . . ,K}
![Page 14: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/14.jpg)
Feature Value
the 0
and 0
bravest 0
love 0
loved 0
genius 0
not 0
fruit 1
BIAS 1
x = feature vector
14
Feature β1 β2 β3 β4 β5
the 1.33 -0.80 -0.54 0.87 0
and 1.21 -1.73 -1.57 -0.13 0
bravest 0.96 -0.05 0.24 0.81 0
love 1.49 0.53 1.01 0.64 0
loved -0.52 -0.02 2.21 -2.53 0
genius 0.98 0.77 1.53 -0.95 0
not -0.96 2.14 -0.71 0.43 0
fruit 0.59 -0.76 0.93 0.03 0
BIAS -1.92 -0.70 0.94 -0.63 0
β = coefficients
![Page 15: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/15.jpg)
Language Model• We can use multi class logistic regression for
language modeling by treating the vocabulary as the output space
Y = V
![Page 16: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/16.jpg)
Unigram LM
Feature βthe βof βa βdog βiphone
BIAS -1.92 -0.70 0.94 -0.63 0
• A unigram language model here would have just one feature: a bias term.
![Page 17: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/17.jpg)
Bigram LM
βthe βof βa βdog βiphone
-0.78 -0.80 -0.54 0.87 01.21 -1.73 -1.57 -0.13 00.96 -0.05 0.24 0.81 0
1.49 0.53 1.01 0.64 0-0.52 -0.02 2.21 -2.53 0
0.98 0.77 1.53 -0.95 0
-0.96 2.14 -0.71 0.43 00.59 -0.76 0.93 0.03 0
-1.92 -0.70 0.94 -0.63 0
Feature Value
wi-1=the 1wi-1=and 0
wi-1=bravest
0wi-1=love 0
wi-1=loved 0wi-1=dog 0wi-1=not 0wi-1=fruit 0
BIAS 1
![Page 18: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/18.jpg)
βthe βof βa βdog βiphone
-0.78 -0.80 -0.54 0.87 01.21 -1.73 -1.57 -0.13 00.96 -0.05 0.24 0.81 0
1.49 0.53 1.01 0.64 0-0.52 -0.02 2.21 -2.53 0
0.98 0.77 1.53 -0.95 0
-0.96 2.14 -0.71 0.43 00.59 -0.76 0.93 0.03 0
-1.92 -0.70 0.94 -0.63 0
Feature Value
wi-1=the 1wi-1=and 0
wi-1=bravest
0wi-1=love 0
wi-1=loved 0wi-1=dog 0wi-1=not 0wi-1=fruit 0
BIAS 1
P(wi = dog | wi�1 = the)
![Page 19: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/19.jpg)
Trigram LM
Feature Value
wi-2=the ^ wi-1=the 0wi-2=and ^ wi-1=the 1
wi-2=bravest ^ wi-1=the 0wi-2=love ^ wi-1=the 0
wi-2=loved ^ wi-1=the 0wi-2=genius ^ wi-1=the 0
wi-2=not ^ wi-1=the 0wi-2=fruit ^ wi-1=the 0
BIAS 1
P(wi = dog | wi�2 = and,wi�1 = the)
![Page 20: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/20.jpg)
Smoothing
Feature Value
wi-2=the ^ wi-1=the 0wi-2=and ^ wi-1=the 1
wi-2=bravest ^ wi-1=the 0wi-2=love ^ wi-1=the 0
wi-1=the 1wi-1=and 0
wi-1=bravest 0wi-1=love 0
BIAS 1
P(wi = dog | wi�2 = and,wi�1 = the)
second-order features
first-order features
![Page 21: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/21.jpg)
L2 regularization
• We can do this by changing the function we’re trying to optimize by adding a penalty for having values of β that are high
• This is equivalent to saying that each β element is drawn from a Normal distribution centered on 0.
• η controls how much of a penalty to pay for coefficients that are far from 0 (optimize on development data)
21
�(β) =N�
i=1logP(yi | xi, β)
� �� �we want this to be high
� ηF�
j=1β2j
� �� �but we want this to be small
![Page 22: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/22.jpg)
L1 regularization
• L1 regularization encourages coefficients to be exactly 0.
• η again controls how much of a penalty to pay for coefficients that are far from 0 (optimize on development data)
22
�(β) =N�
i=1logP(yi | xi, β)
� �� �we want this to be high
� ηF�
j=1|βj|
� �� �but we want this to be small
![Page 23: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/23.jpg)
Richer representations
• Log-linear models give us the flexibility of encoding richer representations of the context we are conditioning on.
• We can reason about any observations from the entire history and not just the local context.
![Page 24: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/24.jpg)
“JACKSONVILLE, Fla. — Stressed and exhausted families across the Southeast were assessing the damage from Hurricane Irma on Tuesday, even as flooding from the storm continued to plague some areas, like Jacksonville, and the worst of its wallop was being revealed in others, like the Florida Keys.
Officials in Florida, Georgia and South Carolina tried to prepare residents for the hardships of recovery from the ______________”
https://www.nytimes.com/2017/09/12/us/irma-jacksonville-florida.html
![Page 25: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/25.jpg)
“WASHINGTON — Senior Justice Department officials intervened to overrule front-line prosecutors and will recommend a more lenient sentencing for Roger J. Stone Jr., convicted last year of impeding investigators in a bid to protect his longtime friend President _________”
https://www.nytimes.com/2020/02/11/us/politics/roger-stone-sentencing.html
![Page 26: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/26.jpg)
![Page 27: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/27.jpg)
27
feature classes example
ngrams (wi-1, wi-2:wi-1, wi-3:wi-1) wi-2=“friend”, wi=“donald”
gappy ngrams w1=“stone” and w2=“donald”
spelling, capitalization wi-1 is capitalized and wi is capitalized
class/gazetteer membership wi-1 in list of names and wi in list of names
“WASHINGTON — Senior Justice Department officials intervened to overrule front-line prosecutors and will recommend a more lenient sentencing for Roger J. Stone Jr., convicted last year of impeding investigators in a bid to protect his longtime friend President _________”
![Page 28: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/28.jpg)
Classes
Goldberg 2017
bigram count
black car 100
blue car 37
red car 0
bigram count
<COLOR> car 137
![Page 29: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/29.jpg)
Tradeoffs• Richer representations = more parameters, higher
likelihood of overfitting
• Much slower to train than estimating the parameters of a generative model
P(Y = y | X = x; β) =exp(x�βy)�
y��Y exp(x�βy�)
![Page 30: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/30.jpg)
Neural LM
x = [v(w1); . . . ; v(wk)]
Simple feed-forward multilayer perceptron (e.g., one hidden layer)
input x = vector concatenation of a conditioning context of fixed size k
1
Bengio et al. 2003
![Page 31: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/31.jpg)
x = [v(w1); . . . ; v(wk)]
1
1
1
1
v(w1)0.7
1.3
-4.5
0.7
1.3
-4.5
0.7
1.3
-4.5
0.7
1.3
-4.5
one-hot encoding distributed representation
v(w2)
v(w3)
v(w4)
v(w1)
v(w2)
v(w3)
v(w4)w1 = tried w2 = to w3 = prepare w4 = residents
Bengio et al. 2003
![Page 32: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/32.jpg)
x = [v(w1); . . . ; v(wk)]
b1 � RH b2 � RVW2 � RH�V
y = softmax(hW2 + b2)
W1 � RkD�H
h = g(xW1 + b1)
Bengio et al. 2003
![Page 33: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/33.jpg)
Softmax
P(Y = y | X = x; β) =exp(x�βy)�
y��Y exp(x�βy�)
![Page 34: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/34.jpg)
tried to prepare residents for the hardships of recovery from the
Neural LM
conditioning context
y
![Page 35: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/35.jpg)
Recurrent neural network
• RNN allow arbitarily-sized conditioning contexts; condition on the entire sequence history.
![Page 36: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/36.jpg)
Recurrent neural network
Goldberg 2017
![Page 37: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/37.jpg)
• Each time step has two inputs:
• xi (the observation at time step i); one-hot vector, feature vector or distributed representation.
• si-1 (the output of the previous state); base case: s0 = 0 vector
Recurrent neural network
![Page 38: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/38.jpg)
Recurrent neural networksi = R(xi, si�1)
yi = O(si)
R computes the output state as a function of the current input and
previous state
O computes the output as a function of the current output
state
![Page 39: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/39.jpg)
Continuous bag of words
yi = O(si) = si
si = R(xi, si�1) = si�1 + xi
Each state is the sum of all previous inputs
Output is the identity
![Page 40: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/40.jpg)
“Simple” RNN
si = R(xi, si�1) = g(si�1Ws + xiWx + b)
yi = O(si) = si
Ws � RH�H
Wx � RD�H
b � RH
Different weight vectors W transform the previous state and current input before combining
g = tanh or relu
Elman 1990, Mikolov 2012
![Page 41: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/41.jpg)
RNN LM
Elman 1990, Mikolov 2012
• The output state si is an H-dimensional real vector; we can transfer that into a probability by passing it through an additional linear transformation followed by a softmax
yi = O(si) = (siWo + bo)
![Page 42: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/42.jpg)
Training RNNs
• We have five sets of parameters to learn:
• Given this definition of an RNN:
si = R(xi, si�1) = g(si�1Ws + xiW
x + b)
yi = O(si) = (siWo + bo)
W s, W x, W o, b, bo
![Page 43: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/43.jpg)
Training RNNswe tried to prepare residents
• At each time step, we make a prediction and incur a loss; we know the true y (the word we see in that position)
![Page 44: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/44.jpg)
we tried to prepare residents
• Training here is standard backpropagation, taking the derivative of the loss we incur at step t with respect to the parameters we want to update
�L(�)y1
�W s
�L(�)y2
�W s
�L(�)y3
�W s
�L(�)y4
�W s
�L(�)y5
�W s
![Page 45: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/45.jpg)
• As we sample, the words we generate form the new context we condition on
Generationcontext1 context2 generated
word
START START The
START The dog
The dog walked
dog walked in
![Page 46: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/46.jpg)
Generation
![Page 47: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/47.jpg)
Conditioned generation• In a basic RNN, the input at each timestep is a
representation of the word at that position
si = R(xi, si�1) = g(si�1Ws + [xi; c]W
x + b)
si = R(xi, si�1) = g(si�1Ws + xiW
x + b)
• But we can also condition on any arbitrary context (topic, author, date, metadata, dialect, etc.)
![Page 48: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/48.jpg)
Conditioned generation• What information could you condition on in order to
predict the next word?
feature classes
location
addressee
![Page 49: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/49.jpg)
we tried to prepare residents
0
0
0
0
0
-2
5
-2
-4
8
8
-4
5
-2
-4
-8
2
-11
8
-1
-1
0
0
-3
0
2
2
5
-2
2
Each state i encodes information seen until time i and its structure is optimized to predict the next word
![Page 50: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/50.jpg)
Character LM• Vocabulary 𝒱 is a finite set of discrete characters
• When the output space is small, you’re putting a lot of the burden on the structure of the model
• Encode long-range dependencies (suffixes depend on prefixes, word boundaries etc.)
![Page 51: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/51.jpg)
Character LM
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
![Page 52: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/52.jpg)
Character LM
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
![Page 53: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/53.jpg)
Character LM
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
![Page 54: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/54.jpg)
Drawbacks• Very expensive to train (especially for large
vocabulary, though tricks exists — cf. hierarchical softmax)
• Backpropagation through long histories leads to vanishing gradients (cf. LSTMs in a few weeks).
• But they consistently have some of the strongest performances in perplexity evaluations.
![Page 55: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/55.jpg)
Mikolov 2010
![Page 56: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/56.jpg)
Melis, Dyer and Blunsom 2017
![Page 57: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/57.jpg)
Distributed representations
• Some of the greatest power in neural network approaches is in the representation of words (and contexts) as low-dimensional vectors
• We’ll talk much more about that on Thursday.
1
0.7
1.3
-4.5→
![Page 58: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/58.jpg)
You should feel comfortable:
• Calculate the probability of a sentence given a trained model
• Estimating (e.g., trigram) language model
• Evaluating perplexity on held-out data
• Sampling a sentence from a trained model
![Page 59: Natural Language Processingpeople.ischool.berkeley.edu/~dbamman/nlp20_slides/7_LM_2.pdfloved 0 genius 0 not 0 fruit 1 BIAS 1 x = feature vector 11 Feature β ... wi-1=loved 0 wi-1=dog](https://reader034.vdocuments.site/reader034/viewer/2022050501/5f93c15f6bc792425b382fb8/html5/thumbnails/59.jpg)
Tools
• SRILMhttp://www.speech.sri.com/projects/srilm/
• KenLMhttps://kheafield.com/code/kenlm/
• Berkeley LMhttps://code.google.com/archive/p/berkeleylm/