lecture 3: language model smoothingkc2wc/teaching/nlp16/slides/03... · 2019-04-14 · n-grams only...

Lecture 3: Language Model Smoothing

Kai-Wei Chang

CS @ University of Virginia

[email protected]

Couse webpage: http://kwchang.net/teaching/NLP16

1CS6501 Natural Language Processing

mailto:[email protected]

http://kwchang.net/teaching/NLP16

This lecture

Zipf’s law

Dealing with unseen words/n-grams

Add-one smoothing

Linear smoothing

Good-Turing smoothing

Absolute discounting

Kneser-Ney smoothing


Recap: Bigram language model

Let P(<S>) = 1

P( I | <S>) = 2 / 3 P(am | I) = 1

P( Sam | am) = 1/3 P( </S> | Sam) = 1/2

P( <S> I am Sam</S>) = 1*2/3*1*1/3*1/2

3

<S> I am Sam </S><S> I am legend </S><S> Sam I am </S>

CS6501 Natural Language Processing

More examples: Berkeley Restaurant Project sentences

can you tell me about any good cantonese restaurants

close by

mid priced thai food is what i’m looking for

tell me about chez panisse

can you give me a listing of the kinds of food that are

available

i’m looking for a good place to eat breakfast

when is caffe venezia open during the day


Raw bigram counts

Out of 9222 sentences


Raw bigram probabilities

Normalize by unigrams:

Result:


Zeros

Training set:… denied the allegations… denied the reports… denied the claims… denied the request

P(“offer” | denied the) = 0

Test set… denied the offer… denied the loan


Smoothing

8

There are more principled smoothing methods, too. We’ll look next at log-linear models, which are a good and popular general technique.

But the traditional methods are easy to implement, run fast, and will give you intuitions about what you want from a smoothing method.

This dark art is why

NLP is taught in the

engineering school.

Credit: the following slides are adapted from Jason Eisner’s NLP course


What is smoothing?

9

20 200

2000 2000000


ML 101: bias variance tradeoff

Different samples of size 20 vary considerably

though on average, they give the correct bell

curve!

10

20 20

20 20


Overfitting


The perils of overfitting

N-grams only work well for word prediction if the

test corpus looks like the training corpus

In real life, it often doesn’t

We need to train robust models that

generalize!

One kind of generalization: Zeros!

Things that don’t ever occur in the training set

But occur in the test set


The intuition of smoothing

When we have sparse statistics:

Steal probability mass to generalize better

13

P(w | denied the)3 allegations2 reports1 claims1 request7 total

P(w | denied the)

2.5 allegations

1.5 reports

0.5 claims

0.5 request

2 other

7 total

alle

gati

on

s

rep

ort

s

clai

ms

attack

req

ues

t

man

outc

om

e

…

alle

gati

on

s

attack

ma

n

outc

om

e

…alle

gati

on

s

rep

ort

s

clai

ms

req

ues

tCredit: Dan Klein


Add-one estimation (Laplace smoothing)

Pretend we saw each word one more time than

we did

Just add one to all the counts!

MLE estimate:

Add-1 estimate:

14

PMLE (wi |wi-1) =c(wi-1,wi )

c(wi-1)

PAdd-1(wi |wi-1) =c(wi-1,wi )+1

c(wi-1)+V


Add-One Smoothing

15

xya 100 100/300 101 101/326

xyb 0 0/300 1 1/326

xyc 0 0/300 1 1/326

xyd 200 200/300 201 201/326

xye 0 0/300 1 1/326

…

xyz 0 0/300 1 1/326

Total xy 300 300/300 326 326/326


Berkeley Restaurant Corpus: Laplace smoothed bigram counts

Laplace-smoothed bigrams

V=1446 in the Berkeley Restaurant Project corpus

Reconstituted counts

Compare with raw bigram counts

Problem with Add-One Smoothing

We’ve been considering just 26 letter types …

20

xya 1 1/3 2 2/29

xyb 0 0/3 1 1/29

xyc 0 0/3 1 1/29

xyd 2 2/3 3 3/29

xye 0 0/3 1 1/29

…

xyz 0 0/3 1 1/29

Total xy 3 3/3 29 29/29



Suppose we’re considering 20000 word types

21

see the abacus 1 1/3 2 2/20003see the abbot 0 0/3 1 1/20003

see the abduct 0 0/3 1 1/20003see the above 2 2/3 3 3/20003see the Abram 0 0/3 1 1/20003

…

see the zygote 0 0/3 1 1/20003

Total 3 3/3 20003 20003/20003


600.465 - Intro to NLP - J. Eisner 22


Suppose we’re considering 20000 word types

22

see the abacus 1 1/3 2 2/20003see the abbot 0 0/3 1 1/20003

see the abduct 0 0/3 1 1/20003see the above 2 2/3 3 3/20003see the Abram 0 0/3 1 1/20003

…

see the zygote 0 0/3 1 1/20003

Total 3 3/3 20003 20003/20003

“Novel event” = event never happened in training data.

Here: 19998 novel events, with total estimated probability 19998/20003.

Add-one smoothing thinks we are extremely likely to see novel events, rather than words we’ve seen.


Infinite Dictionary?

In fact, aren’t there infinitely many possible word types?

23

see the aaaaa 1 1/3 2 2/(∞+3)

see the aaaab 0 0/3 1 1/(∞+3)

see the aaaac 0 0/3 1 1/(∞+3)

see the aaaad 2 2/3 3 3/(∞+3)

see the aaaae 0 0/3 1 1/(∞+3)

…

see the zzzzz 0 0/3 1 1/(∞+3)

Total 3 3/3 (∞+3) (∞+3)/(∞+3)


Add-Lambda Smoothing

A large dictionary makes novel events too probable.

To fix: Instead of adding 1 to all counts, add = 0.01?

This gives much less probability to novel events.

But how to pick best value for ?

That is, how much should we smooth?


Add-0.001 Smoothing

Doesn’t smooth much (estimated distribution has high variance)

25

xya 1 1/3 1.001 0.331

xyb 0 0/3 0.001 0.0003

xyc 0 0/3 0.001 0.0003

xyd 2 2/3 2.001 0.661

xye 0 0/3 0.001 0.0003

…

xyz 0 0/3 0.001 0.0003

Total xy 3 3/3 3.026 1


Add-1000 Smoothing

Smooths too much (estimated distribution has high bias)

26

xya 1 1/3 1001 1/26

xyb 0 0/3 1000 1/26

xyc 0 0/3 1000 1/26

xyd 2 2/3 1002 1/26

xye 0 0/3 1000 1/26

…

xyz 0 0/3 1000 1/26

Total xy 3 3/3 26003 1


Add-Lambda Smoothing

A large dictionary makes novel events too probable.

To fix: Instead of adding 1 to all counts, add = 0.01?

This gives much less probability to novel events.

But how to pick best value for ?

That is, how much should we smooth?

E.g., how much probability to “set aside” for novel events?

Depends on how likely novel events really are!

Which may depend on the type of text, size of training corpus, …

Can we figure it out from the data?

We’ll look at a few methods for deciding how much to smooth.


Setting Smoothing Parameters

How to pick best value for ? (in add- smoothing)

Try many values & report the one that gets best results?

How to measure whether a particular gets good results?

Is it fair to measure that on test data (for setting )?Moral: Selective reporting on test data can make a

method look artificially good. So it is unethical. Rule: Test data cannot influence system

development. No peeking! Use it only to evaluate the final system(s). Report all results on it.

28

TestTraining



How to pick best value for ? (in add- smoothing)

Try many values & report the one that gets best results?

How to measure whether a particular gets good results?

Is it fair to measure that on test data (for setting )?Moral: Selective reporting on test data can make a

method look artificially good. So it is unethical. Rule: Test data cannot influence system

development. No peeking! Use it only to evaluate the final system(s). Report all results on it.

29

TestTraining


Feynman’s Advice: “The first principle is that you

must not fool yourself, and you are the easiest person

to fool.”

How to pick best value for ?

Try many values & report the one that gets best

results?

… and

report

results of

that final

model on

test data.

600.465 - Intro to NLP - J. Eisner


30

TestTraining

Dev. Training

Pick that

gets best

results on

this 20% …

… when we collect counts

from this 80% and smooth

them using add- smoothing.

Now use that

to get

smoothed

counts from

all 100% …


Large or small Dev set?

Here we held out 20% of our training set

(yellow) for development.

Would like to use > 20% yellow:

20% not enough to reliably assess

Would like to use > 80% blue:

Best for smoothing 80% best for smoothing 100%

CS6501 Natural Language Processing 31


Cross-Validation

Try 5 training/dev splits as below Pick that gets best average performance

Tests on all 100% as yellow, so we can more reliably assess

Still picks a that’s good at smoothing the 80% size, not 100%.

But now we can grow that 80% without trouble

32

Dev.Dev.

Dev.Dev.

Dev.

Test



N-fold Cross-Validation (“Leave One Out”)

Test each sentence with smoothed model from other N-1 sentences

Still tests on all 100% as yellow, so we can reliably assess

Trains on nearly 100% blue data ((N-1)/N) to measure whether is good for smoothing that

33

…

Test



N-fold Cross-Validation (“Leave One Out”)

Surprisingly fast: why?

Usually easy to retrain on blue by adding/subtracting 1 sentence’s counts

34

…

Test


More Ideas for Smoothing

Remember, we’re trying to decide how

much to smooth.

E.g., how much probability to “set aside” for

novel events?

Depends on how likely novel events really

are

Which may depend on the type of text, size

of training corpus, …

Can we figure this out from the data?


How likely are novel events?

36

a 150 0

both 18 0

candy 0 1

donuts 0 2

every 50 versus 0

??? 0 0

grapes 0 1

his 38 0

ice cream 0 7

…

20000 types 300 tokens 300 tokens

0/300 0/300

which zero would you expect is really rare?



37

a 150 0

both 18 0

candy 0 1

donuts 0 2

every 50 versus 0

farina 0 0

grapes 0 1

his 38 0

ice cream 0 7

…


determiners: a closed class



38

a 150 0

both 18 0

candy 0 1

donuts 0 2

every 50 versus 0

farina 0 0

grapes 0 1

his 38 0

ice cream 0 7

…


(food) nouns: an open class


Zipfs’ law


the r-th mostcommon word 𝑤𝑟 hasP(𝑤𝑟) ∝ 1/r

A few words are very frequent

http://wugology.com/zipfs-law/

0 10000 20000 30000 40000 50000 60000 70000 80000

0

1

2

3

4

5

6

52108

69836

How common are novel events?

42

N0 *

N1 *

N2 *

N3 *

N4 *

N5 *

N6 *

1*

1*

abaringe, Babatinde, cabaret …

aback, Babbitt, cabanas …

Abbas, babel, Cabot …

abdominal, Bach, cabana …

aberrant, backlog, cabinets …

abdomen, bachelor, Caesar …

the

EOS


How common are novel events?

43

0 5000 10000 15000 20000 25000

0

1

2

3

4

5

6

N0 *

N1 *

N2 *

N3 *

N4 *

N5 *

N6 *

Counts from Brown Corpus (N 1 million tokens)

novel words (in dictionary but never occur)

singletons (occur once)

doubletons (occur twice)

N2 = # doubleton typesN2 * 2 = # doubleton tokens

r Nr = total # types = T (purple bars)

r (Nr * r) = total # tokens = N (all bars)


Witten-Bell Smoothing Idea

44

0 5000 10000 15000 20000 25000

0

1

2

3

4

5

6

N0 *

N1 *

N2 *

N3 *

N4 *

N5 *

N6 *

novel words

singletons

doubletons

If T/N is large, we’ve seen lots of novel types in the past, so we expect lots more.

unsmoothed smoothed

2/N 2/(N+T)

1/N 1/(N+T)

0/N (T/(N+T)) / N0


0 0.005 0.01 0.015 0.02 0.025

0/N

1/N

2/N

3/N

4/N

5/N

6/N

Good-Turing Smoothing Idea

45

N0*

N1*

N2*

N3*

N4*

N5*

N6*

Partition the type vocabulary into classes (novel, singletons, doubletons, …) by how often they occurred in training data

Use observed total probability of class r+1 to estimate total probability of class r

unsmoothed smoothed

(N3*3/N)/N2

(N2*2/N)/N1

(N1*1/N)/N0

obs. p(singleton)

est. p(novel)2%

obs. p(doubleton)

est. p(singleton)1.5%

obs. (tripleton)

est. p(doubleton)1.2%

2/N (N3*3/N)/N2

1/N (N2*2/N)/N1

0/N (N1*1/N)/N0

CS6501 Natural Language Processingr/N = (Nr*r/N)/Nr (Nr+1*(r+1)/N)/Nr

Witten-Bell vs. Good-Turing

Estimate p(z | xy) using just the tokens we’ve

seen in context xy. Might be a small set …

Witten-Bell intuition: If those tokens were

distributed over many different types, then novel

types are likely in future.

Good-Turing intuition: If many of those tokens

came from singleton types , then novel types are

likely in future.

Very nice idea (but a bit tricky in practice)

See the paper “Good-Turing smoothing without tears”


Good-Turing Reweighting

Problem 1: what about “the”? (k=4417)

For small k, Nk > Nk+1

For large k, too jumpy.

Problem 2: we don’t observe

events for every k

N1

N2

N3

N4417

N3511

. . . .

N0

N1

N2

N4416

N3510

. . . .

Good-Turing Reweighting

Simple Good-Turing [Gale and Sampson]:

replace empirical Nk with a best-fit

regression (e.g., power law) once count

counts get unreliable

f(k) = a + b log (k)

Find a,b such that f(k) ∼ Nk

N1

N2 N3

N1

N2


Backoff and interpolation

Why are we treating all novel events as the

same?


words

words



p(zombie | see the) vs. p(baby | see the)

What if count(see the zygote) =

count(see the baby) = 0?

baby beats zygote as a unigram

the baby beats the zygote as a bigram

see the baby beats see the zygote ?

(even if both have the same count, such as 0)




condition on less context for

contexts you haven’t learned much

about

backoff: use trigram if you have

good evidence, otherwise bigram,

otherwise unigram

Interpolation: – mixture of unigram,

bigram, trigram (etc.) models



Smoothing + backoff

Basic smoothing (e.g., add-, Good-Turing, Witten-Bell):Holds out some probability mass for novel

events

Divided up evenly among the novel events

Backoff smoothingHolds out same amount of probability mass for

novel events

But divide up unevenly in proportion to backoffprob.



Smoothing + backoff

When defining p(z | xy), the backoff prob

for novel z is p(z | y)

Even if z was never observed after xy, it

may have been observed after y (why?).

Then p(z | y) can be estimated without

further backoff. If not, we back off further to

p(z).



Linear Interpolation

Jelinek-Mercer smoothing

Use a weighted average of backed-off naïve

models:

paverage(z | xy) = 3 p(z | xy) + 2 p(z | y) + 1 p(z)

where 3 + 2 + 1 = 1 and all are 0

The weights can depend on the context xy

E.g., we can make 3 large if count(xy) is high

Learn the weights on held-out data w/ jackknifing

Different 3 when xy is observed 1 time, 2 times,

5 times, …


Absolute Discounting Interpolation

Save ourselves some time and just

subtract 0.75 (or some d)!

But should we really just use the regular

unigram P(w)?

55

)()()(

),()|( 1

1

11scountingAbsoluteDi wPw

wc

dwwcwwP i

i

iiii

discounted bigram

unigram

Interpolation weight


Absolute discounting: just subtract a little from each count

How much to subtract ?

Church and Gale (1991)’s clever idea

Divide data into training and held-out sets

for each bigram in the training set

see the actual count in the held-out set!

It sure looks like c* = (c - .75)

56

Bigram count in training

Bigram count in heldout set

0 .0000270

1 0.448

2 1.25

3 2.24

4 3.23

5 4.21

6 5.23CS6501 Natural Language Processing

lecture 3: language model smoothingkc2wc/teaching/nlp16/slides/03... · 2019-04-14 · n-grams only...

Documents