lecture 3: language model smoothingkc2wc/teaching/nlp16/slides/03... · 2019-04-14 · n-grams only...
TRANSCRIPT
Lecture 3: Language Model Smoothing
Kai-Wei Chang
CS @ University of Virginia
Couse webpage: http://kwchang.net/teaching/NLP16
1CS6501 Natural Language Processing
This lecture
Zipf’s law
Dealing with unseen words/n-grams
Add-one smoothing
Linear smoothing
Good-Turing smoothing
Absolute discounting
Kneser-Ney smoothing
2CS6501 Natural Language Processing
Recap: Bigram language model
Let P(<S>) = 1
P( I | <S>) = 2 / 3 P(am | I) = 1
P( Sam | am) = 1/3 P( </S> | Sam) = 1/2
P( <S> I am Sam</S>) = 1*2/3*1*1/3*1/2
3
<S> I am Sam </S><S> I am legend </S><S> Sam I am </S>
CS6501 Natural Language Processing
More examples: Berkeley Restaurant Project sentences
can you tell me about any good cantonese restaurants
close by
mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are
available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
4CS6501 Natural Language Processing
Raw bigram counts
Out of 9222 sentences
5CS6501 Natural Language Processing
Raw bigram probabilities
Normalize by unigrams:
Result:
6CS6501 Natural Language Processing
Zeros
Training set:… denied the allegations… denied the reports… denied the claims… denied the request
P(“offer” | denied the) = 0
Test set… denied the offer… denied the loan
7CS6501 Natural Language Processing
Smoothing
8
There are more principled smoothing methods, too. We’ll look next at log-linear models, which are a good and popular general technique.
But the traditional methods are easy to implement, run fast, and will give you intuitions about what you want from a smoothing method.
This dark art is why
NLP is taught in the
engineering school.
Credit: the following slides are adapted from Jason Eisner’s NLP course
CS6501 Natural Language Processing
What is smoothing?
9
20 200
2000 2000000
CS6501 Natural Language Processing
ML 101: bias variance tradeoff
Different samples of size 20 vary considerably
though on average, they give the correct bell
curve!
10
20 20
20 20
CS6501 Natural Language Processing
Overfitting
11CS6501 Natural Language Processing
The perils of overfitting
N-grams only work well for word prediction if the
test corpus looks like the training corpus
In real life, it often doesn’t
We need to train robust models that
generalize!
One kind of generalization: Zeros!
Things that don’t ever occur in the training set
But occur in the test set
12CS6501 Natural Language Processing
The intuition of smoothing
When we have sparse statistics:
Steal probability mass to generalize better
13
P(w | denied the)3 allegations2 reports1 claims1 request7 total
P(w | denied the)
2.5 allegations
1.5 reports
0.5 claims
0.5 request
2 other
7 total
alle
gati
on
s
rep
ort
s
clai
ms
attack
req
ues
t
man
outc
om
e
…
alle
gati
on
s
attack
ma
n
outc
om
e
…alle
gati
on
s
rep
ort
s
clai
ms
req
ues
tCredit: Dan Klein
CS6501 Natural Language Processing
Add-one estimation (Laplace smoothing)
Pretend we saw each word one more time than
we did
Just add one to all the counts!
MLE estimate:
Add-1 estimate:
14
PMLE (wi |wi-1) =c(wi-1,wi )
c(wi-1)
PAdd-1(wi |wi-1) =c(wi-1,wi )+1
c(wi-1)+V
CS6501 Natural Language Processing
Add-One Smoothing
15
xya 100 100/300 101 101/326
xyb 0 0/300 1 1/326
xyc 0 0/300 1 1/326
xyd 200 200/300 201 201/326
xye 0 0/300 1 1/326
…
xyz 0 0/300 1 1/326
Total xy 300 300/300 326 326/326
CS6501 Natural Language Processing
Berkeley Restaurant Corpus: Laplace smoothed bigram counts
Laplace-smoothed bigrams
V=1446 in the Berkeley Restaurant Project corpus
Reconstituted counts
Compare with raw bigram counts
Problem with Add-One Smoothing
We’ve been considering just 26 letter types …
20
xya 1 1/3 2 2/29
xyb 0 0/3 1 1/29
xyc 0 0/3 1 1/29
xyd 2 2/3 3 3/29
xye 0 0/3 1 1/29
…
xyz 0 0/3 1 1/29
Total xy 3 3/3 29 29/29
CS6501 Natural Language Processing
Problem with Add-One Smoothing
Suppose we’re considering 20000 word types
21
see the abacus 1 1/3 2 2/20003see the abbot 0 0/3 1 1/20003
see the abduct 0 0/3 1 1/20003see the above 2 2/3 3 3/20003see the Abram 0 0/3 1 1/20003
…
see the zygote 0 0/3 1 1/20003
Total 3 3/3 20003 20003/20003
CS6501 Natural Language Processing
600.465 - Intro to NLP - J. Eisner 22
Problem with Add-One Smoothing
Suppose we’re considering 20000 word types
22
see the abacus 1 1/3 2 2/20003see the abbot 0 0/3 1 1/20003
see the abduct 0 0/3 1 1/20003see the above 2 2/3 3 3/20003see the Abram 0 0/3 1 1/20003
…
see the zygote 0 0/3 1 1/20003
Total 3 3/3 20003 20003/20003
“Novel event” = event never happened in training data.
Here: 19998 novel events, with total estimated probability 19998/20003.
Add-one smoothing thinks we are extremely likely to see novel events, rather than words we’ve seen.
CS6501 Natural Language Processing
Infinite Dictionary?
In fact, aren’t there infinitely many possible word types?
23
see the aaaaa 1 1/3 2 2/(∞+3)
see the aaaab 0 0/3 1 1/(∞+3)
see the aaaac 0 0/3 1 1/(∞+3)
see the aaaad 2 2/3 3 3/(∞+3)
see the aaaae 0 0/3 1 1/(∞+3)
…
see the zzzzz 0 0/3 1 1/(∞+3)
Total 3 3/3 (∞+3) (∞+3)/(∞+3)
CS6501 Natural Language Processing
Add-Lambda Smoothing
A large dictionary makes novel events too probable.
To fix: Instead of adding 1 to all counts, add = 0.01?
This gives much less probability to novel events.
But how to pick best value for ?
That is, how much should we smooth?
24CS6501 Natural Language Processing
Add-0.001 Smoothing
Doesn’t smooth much (estimated distribution has high variance)
25
xya 1 1/3 1.001 0.331
xyb 0 0/3 0.001 0.0003
xyc 0 0/3 0.001 0.0003
xyd 2 2/3 2.001 0.661
xye 0 0/3 0.001 0.0003
…
xyz 0 0/3 0.001 0.0003
Total xy 3 3/3 3.026 1
CS6501 Natural Language Processing
Add-1000 Smoothing
Smooths too much (estimated distribution has high bias)
26
xya 1 1/3 1001 1/26
xyb 0 0/3 1000 1/26
xyc 0 0/3 1000 1/26
xyd 2 2/3 1002 1/26
xye 0 0/3 1000 1/26
…
xyz 0 0/3 1000 1/26
Total xy 3 3/3 26003 1
CS6501 Natural Language Processing
Add-Lambda Smoothing
A large dictionary makes novel events too probable.
To fix: Instead of adding 1 to all counts, add = 0.01?
This gives much less probability to novel events.
But how to pick best value for ?
That is, how much should we smooth?
E.g., how much probability to “set aside” for novel events?
Depends on how likely novel events really are!
Which may depend on the type of text, size of training corpus, …
Can we figure it out from the data?
We’ll look at a few methods for deciding how much to smooth.
27CS6501 Natural Language Processing
Setting Smoothing Parameters
How to pick best value for ? (in add- smoothing)
Try many values & report the one that gets best results?
How to measure whether a particular gets good results?
Is it fair to measure that on test data (for setting )?Moral: Selective reporting on test data can make a
method look artificially good. So it is unethical. Rule: Test data cannot influence system
development. No peeking! Use it only to evaluate the final system(s). Report all results on it.
28
TestTraining
CS6501 Natural Language Processing
Setting Smoothing Parameters
How to pick best value for ? (in add- smoothing)
Try many values & report the one that gets best results?
How to measure whether a particular gets good results?
Is it fair to measure that on test data (for setting )?Moral: Selective reporting on test data can make a
method look artificially good. So it is unethical. Rule: Test data cannot influence system
development. No peeking! Use it only to evaluate the final system(s). Report all results on it.
29
TestTraining
CS6501 Natural Language Processing
Feynman’s Advice: “The first principle is that you
must not fool yourself, and you are the easiest person
to fool.”
How to pick best value for ?
Try many values & report the one that gets best
results?
… and
report
results of
that final
model on
test data.
600.465 - Intro to NLP - J. Eisner
Setting Smoothing Parameters
30
TestTraining
Dev. Training
Pick that
gets best
results on
this 20% …
… when we collect counts
from this 80% and smooth
them using add- smoothing.
Now use that
to get
smoothed
counts from
all 100% …
CS6501 Natural Language Processing
Large or small Dev set?
Here we held out 20% of our training set
(yellow) for development.
Would like to use > 20% yellow:
20% not enough to reliably assess
Would like to use > 80% blue:
Best for smoothing 80% best for smoothing 100%
CS6501 Natural Language Processing 31
600.465 - Intro to NLP - J. Eisner 32
Cross-Validation
Try 5 training/dev splits as below Pick that gets best average performance
Tests on all 100% as yellow, so we can more reliably assess
Still picks a that’s good at smoothing the 80% size, not 100%.
But now we can grow that 80% without trouble
32
Dev.Dev.
Dev.Dev.
Dev.
Test
CS6501 Natural Language Processing
600.465 - Intro to NLP - J. Eisner 33
N-fold Cross-Validation (“Leave One Out”)
Test each sentence with smoothed model from other N-1 sentences
Still tests on all 100% as yellow, so we can reliably assess
Trains on nearly 100% blue data ((N-1)/N) to measure whether is good for smoothing that
33
…
Test
CS6501 Natural Language Processing
600.465 - Intro to NLP - J. Eisner 34
N-fold Cross-Validation (“Leave One Out”)
Surprisingly fast: why?
Usually easy to retrain on blue by adding/subtracting 1 sentence’s counts
34
…
Test
CS6501 Natural Language Processing
More Ideas for Smoothing
Remember, we’re trying to decide how
much to smooth.
E.g., how much probability to “set aside” for
novel events?
Depends on how likely novel events really
are
Which may depend on the type of text, size
of training corpus, …
Can we figure this out from the data?
35CS6501 Natural Language Processing
How likely are novel events?
36
a 150 0
both 18 0
candy 0 1
donuts 0 2
every 50 versus 0
??? 0 0
grapes 0 1
his 38 0
ice cream 0 7
…
20000 types 300 tokens 300 tokens
0/300 0/300
which zero would you expect is really rare?
CS6501 Natural Language Processing
How likely are novel events?
37
a 150 0
both 18 0
candy 0 1
donuts 0 2
every 50 versus 0
farina 0 0
grapes 0 1
his 38 0
ice cream 0 7
…
20000 types 300 tokens 300 tokens
determiners: a closed class
CS6501 Natural Language Processing
How likely are novel events?
38
a 150 0
both 18 0
candy 0 1
donuts 0 2
every 50 versus 0
farina 0 0
grapes 0 1
his 38 0
ice cream 0 7
…
20000 types 300 tokens 300 tokens
(food) nouns: an open class
CS6501 Natural Language Processing
Zipfs’ law
CS6501 Natural Language Processing 39
the r-th mostcommon word 𝑤𝑟 hasP(𝑤𝑟) ∝ 1/r
A few words are very frequent
http://wugology.com/zipfs-law/
CS6501 Natural Language Processing 40
CS6501 Natural Language Processing 41
0 10000 20000 30000 40000 50000 60000 70000 80000
0
1
2
3
4
5
6
52108
69836
How common are novel events?
42
N0 *
N1 *
N2 *
N3 *
N4 *
N5 *
N6 *
1*
1*
abaringe, Babatinde, cabaret …
aback, Babbitt, cabanas …
Abbas, babel, Cabot …
abdominal, Bach, cabana …
aberrant, backlog, cabinets …
abdomen, bachelor, Caesar …
the
EOS
CS6501 Natural Language Processing
How common are novel events?
43
0 5000 10000 15000 20000 25000
0
1
2
3
4
5
6
N0 *
N1 *
N2 *
N3 *
N4 *
N5 *
N6 *
Counts from Brown Corpus (N 1 million tokens)
novel words (in dictionary but never occur)
singletons (occur once)
doubletons (occur twice)
N2 = # doubleton typesN2 * 2 = # doubleton tokens
r Nr = total # types = T (purple bars)
r (Nr * r) = total # tokens = N (all bars)
CS6501 Natural Language Processing
Witten-Bell Smoothing Idea
44
0 5000 10000 15000 20000 25000
0
1
2
3
4
5
6
N0 *
N1 *
N2 *
N3 *
N4 *
N5 *
N6 *
novel words
singletons
doubletons
If T/N is large, we’ve seen lots of novel types in the past, so we expect lots more.
unsmoothed smoothed
2/N 2/(N+T)
1/N 1/(N+T)
0/N (T/(N+T)) / N0
CS6501 Natural Language Processing
0 0.005 0.01 0.015 0.02 0.025
0/N
1/N
2/N
3/N
4/N
5/N
6/N
Good-Turing Smoothing Idea
45
N0*
N1*
N2*
N3*
N4*
N5*
N6*
Partition the type vocabulary into classes (novel, singletons, doubletons, …) by how often they occurred in training data
Use observed total probability of class r+1 to estimate total probability of class r
unsmoothed smoothed
(N3*3/N)/N2
(N2*2/N)/N1
(N1*1/N)/N0
obs. p(singleton)
est. p(novel)2%
obs. p(doubleton)
est. p(singleton)1.5%
obs. (tripleton)
est. p(doubleton)1.2%
2/N (N3*3/N)/N2
1/N (N2*2/N)/N1
0/N (N1*1/N)/N0
CS6501 Natural Language Processingr/N = (Nr*r/N)/Nr (Nr+1*(r+1)/N)/Nr
Witten-Bell vs. Good-Turing
Estimate p(z | xy) using just the tokens we’ve
seen in context xy. Might be a small set …
Witten-Bell intuition: If those tokens were
distributed over many different types, then novel
types are likely in future.
Good-Turing intuition: If many of those tokens
came from singleton types , then novel types are
likely in future.
Very nice idea (but a bit tricky in practice)
See the paper “Good-Turing smoothing without tears”
46CS6501 Natural Language Processing
Good-Turing Reweighting
Problem 1: what about “the”? (k=4417)
For small k, Nk > Nk+1
For large k, too jumpy.
Problem 2: we don’t observe
events for every k
N1
N2
N3
N4417
N3511
. . . .
N0
N1
N2
N4416
N3510
. . . .
Good-Turing Reweighting
Simple Good-Turing [Gale and Sampson]:
replace empirical Nk with a best-fit
regression (e.g., power law) once count
counts get unreliable
f(k) = a + b log (k)
Find a,b such that f(k) ∼ Nk
N1
N2 N3
N1
N2
600.465 - Intro to NLP - J. Eisner 49
Backoff and interpolation
Why are we treating all novel events as the
same?
49CS6501 Natural Language Processing
words
words
600.465 - Intro to NLP - J. Eisner 50
Backoff and interpolation
p(zombie | see the) vs. p(baby | see the)
What if count(see the zygote) =
count(see the baby) = 0?
baby beats zygote as a unigram
the baby beats the zygote as a bigram
see the baby beats see the zygote ?
(even if both have the same count, such as 0)
50CS6501 Natural Language Processing
600.465 - Intro to NLP - J. Eisner 51
Backoff and interpolation
condition on less context for
contexts you haven’t learned much
about
backoff: use trigram if you have
good evidence, otherwise bigram,
otherwise unigram
Interpolation: – mixture of unigram,
bigram, trigram (etc.) models
51CS6501 Natural Language Processing
600.465 - Intro to NLP - J. Eisner 52
Smoothing + backoff
Basic smoothing (e.g., add-, Good-Turing, Witten-Bell):Holds out some probability mass for novel
events
Divided up evenly among the novel events
Backoff smoothingHolds out same amount of probability mass for
novel events
But divide up unevenly in proportion to backoffprob.
52CS6501 Natural Language Processing
600.465 - Intro to NLP - J. Eisner 53
Smoothing + backoff
When defining p(z | xy), the backoff prob
for novel z is p(z | y)
Even if z was never observed after xy, it
may have been observed after y (why?).
Then p(z | y) can be estimated without
further backoff. If not, we back off further to
p(z).
53CS6501 Natural Language Processing
600.465 - Intro to NLP - J. Eisner 54
Linear Interpolation
Jelinek-Mercer smoothing
Use a weighted average of backed-off naïve
models:
paverage(z | xy) = 3 p(z | xy) + 2 p(z | y) + 1 p(z)
where 3 + 2 + 1 = 1 and all are 0
The weights can depend on the context xy
E.g., we can make 3 large if count(xy) is high
Learn the weights on held-out data w/ jackknifing
Different 3 when xy is observed 1 time, 2 times,
5 times, …
54CS6501 Natural Language Processing
Absolute Discounting Interpolation
Save ourselves some time and just
subtract 0.75 (or some d)!
But should we really just use the regular
unigram P(w)?
55
)()()(
),()|( 1
1
11scountingAbsoluteDi wPw
wc
dwwcwwP i
i
iiii
discounted bigram
unigram
Interpolation weight
CS6501 Natural Language Processing
Absolute discounting: just subtract a little from each count
How much to subtract ?
Church and Gale (1991)’s clever idea
Divide data into training and held-out sets
for each bigram in the training set
see the actual count in the held-out set!
It sure looks like c* = (c - .75)
56
Bigram count in training
Bigram count in heldout set
0 .0000270
1 0.448
2 1.25
3 2.24
4 3.23
5 4.21
6 5.23CS6501 Natural Language Processing