lecture 3: language model smoothingkc2wc/teaching/nlp16/slides/03... · 2019-04-14 · n-grams only...

56
Lecture 3: Language Model Smoothing Kai-Wei Chang CS @ University of Virginia [email protected] Couse webpage: http://kwchang.net/teaching/NLP16 1 CS6501 Natural Language Processing

Upload: others

Post on 05-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Lecture 3: Language Model Smoothing

Kai-Wei Chang

CS @ University of Virginia

[email protected]

Couse webpage: http://kwchang.net/teaching/NLP16

1CS6501 Natural Language Processing

Page 2: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

This lecture

Zipf’s law

Dealing with unseen words/n-grams

Add-one smoothing

Linear smoothing

Good-Turing smoothing

Absolute discounting

Kneser-Ney smoothing

2CS6501 Natural Language Processing

Page 3: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Recap: Bigram language model

Let P(<S>) = 1

P( I | <S>) = 2 / 3 P(am | I) = 1

P( Sam | am) = 1/3 P( </S> | Sam) = 1/2

P( <S> I am Sam</S>) = 1*2/3*1*1/3*1/2

3

<S> I am Sam </S><S> I am legend </S><S> Sam I am </S>

CS6501 Natural Language Processing

Page 4: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

More examples: Berkeley Restaurant Project sentences

can you tell me about any good cantonese restaurants

close by

mid priced thai food is what i’m looking for

tell me about chez panisse

can you give me a listing of the kinds of food that are

available

i’m looking for a good place to eat breakfast

when is caffe venezia open during the day

4CS6501 Natural Language Processing

Page 5: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Raw bigram counts

Out of 9222 sentences

5CS6501 Natural Language Processing

Page 6: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Raw bigram probabilities

Normalize by unigrams:

Result:

6CS6501 Natural Language Processing

Page 7: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Zeros

Training set:… denied the allegations… denied the reports… denied the claims… denied the request

P(“offer” | denied the) = 0

Test set… denied the offer… denied the loan

7CS6501 Natural Language Processing

Page 8: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Smoothing

8

There are more principled smoothing methods, too. We’ll look next at log-linear models, which are a good and popular general technique.

But the traditional methods are easy to implement, run fast, and will give you intuitions about what you want from a smoothing method.

This dark art is why

NLP is taught in the

engineering school.

Credit: the following slides are adapted from Jason Eisner’s NLP course

CS6501 Natural Language Processing

Page 9: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

What is smoothing?

9

20 200

2000 2000000

CS6501 Natural Language Processing

Page 10: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

ML 101: bias variance tradeoff

Different samples of size 20 vary considerably

though on average, they give the correct bell

curve!

10

20 20

20 20

CS6501 Natural Language Processing

Page 11: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Overfitting

11CS6501 Natural Language Processing

Page 12: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

The perils of overfitting

N-grams only work well for word prediction if the

test corpus looks like the training corpus

In real life, it often doesn’t

We need to train robust models that

generalize!

One kind of generalization: Zeros!

Things that don’t ever occur in the training set

But occur in the test set

12CS6501 Natural Language Processing

Page 13: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

The intuition of smoothing

When we have sparse statistics:

Steal probability mass to generalize better

13

P(w | denied the)3 allegations2 reports1 claims1 request7 total

P(w | denied the)

2.5 allegations

1.5 reports

0.5 claims

0.5 request

2 other

7 total

alle

gati

on

s

rep

ort

s

clai

ms

attack

req

ues

t

man

outc

om

e

alle

gati

on

s

attack

ma

n

outc

om

e

…alle

gati

on

s

rep

ort

s

clai

ms

req

ues

tCredit: Dan Klein

CS6501 Natural Language Processing

Page 14: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Add-one estimation (Laplace smoothing)

Pretend we saw each word one more time than

we did

Just add one to all the counts!

MLE estimate:

Add-1 estimate:

14

PMLE (wi |wi-1) =c(wi-1,wi )

c(wi-1)

PAdd-1(wi |wi-1) =c(wi-1,wi )+1

c(wi-1)+V

CS6501 Natural Language Processing

Page 15: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Add-One Smoothing

15

xya 100 100/300 101 101/326

xyb 0 0/300 1 1/326

xyc 0 0/300 1 1/326

xyd 200 200/300 201 201/326

xye 0 0/300 1 1/326

xyz 0 0/300 1 1/326

Total xy 300 300/300 326 326/326

CS6501 Natural Language Processing

Page 16: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Berkeley Restaurant Corpus: Laplace smoothed bigram counts

Page 17: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Laplace-smoothed bigrams

V=1446 in the Berkeley Restaurant Project corpus

Page 18: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Reconstituted counts

Page 19: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Compare with raw bigram counts

Page 20: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Problem with Add-One Smoothing

We’ve been considering just 26 letter types …

20

xya 1 1/3 2 2/29

xyb 0 0/3 1 1/29

xyc 0 0/3 1 1/29

xyd 2 2/3 3 3/29

xye 0 0/3 1 1/29

xyz 0 0/3 1 1/29

Total xy 3 3/3 29 29/29

CS6501 Natural Language Processing

Page 21: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Problem with Add-One Smoothing

Suppose we’re considering 20000 word types

21

see the abacus 1 1/3 2 2/20003see the abbot 0 0/3 1 1/20003

see the abduct 0 0/3 1 1/20003see the above 2 2/3 3 3/20003see the Abram 0 0/3 1 1/20003

see the zygote 0 0/3 1 1/20003

Total 3 3/3 20003 20003/20003

CS6501 Natural Language Processing

Page 22: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

600.465 - Intro to NLP - J. Eisner 22

Problem with Add-One Smoothing

Suppose we’re considering 20000 word types

22

see the abacus 1 1/3 2 2/20003see the abbot 0 0/3 1 1/20003

see the abduct 0 0/3 1 1/20003see the above 2 2/3 3 3/20003see the Abram 0 0/3 1 1/20003

see the zygote 0 0/3 1 1/20003

Total 3 3/3 20003 20003/20003

“Novel event” = event never happened in training data.

Here: 19998 novel events, with total estimated probability 19998/20003.

Add-one smoothing thinks we are extremely likely to see novel events, rather than words we’ve seen.

CS6501 Natural Language Processing

Page 23: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Infinite Dictionary?

In fact, aren’t there infinitely many possible word types?

23

see the aaaaa 1 1/3 2 2/(∞+3)

see the aaaab 0 0/3 1 1/(∞+3)

see the aaaac 0 0/3 1 1/(∞+3)

see the aaaad 2 2/3 3 3/(∞+3)

see the aaaae 0 0/3 1 1/(∞+3)

see the zzzzz 0 0/3 1 1/(∞+3)

Total 3 3/3 (∞+3) (∞+3)/(∞+3)

CS6501 Natural Language Processing

Page 24: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Add-Lambda Smoothing

A large dictionary makes novel events too probable.

To fix: Instead of adding 1 to all counts, add = 0.01?

This gives much less probability to novel events.

But how to pick best value for ?

That is, how much should we smooth?

24CS6501 Natural Language Processing

Page 25: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Add-0.001 Smoothing

Doesn’t smooth much (estimated distribution has high variance)

25

xya 1 1/3 1.001 0.331

xyb 0 0/3 0.001 0.0003

xyc 0 0/3 0.001 0.0003

xyd 2 2/3 2.001 0.661

xye 0 0/3 0.001 0.0003

xyz 0 0/3 0.001 0.0003

Total xy 3 3/3 3.026 1

CS6501 Natural Language Processing

Page 26: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Add-1000 Smoothing

Smooths too much (estimated distribution has high bias)

26

xya 1 1/3 1001 1/26

xyb 0 0/3 1000 1/26

xyc 0 0/3 1000 1/26

xyd 2 2/3 1002 1/26

xye 0 0/3 1000 1/26

xyz 0 0/3 1000 1/26

Total xy 3 3/3 26003 1

CS6501 Natural Language Processing

Page 27: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Add-Lambda Smoothing

A large dictionary makes novel events too probable.

To fix: Instead of adding 1 to all counts, add = 0.01?

This gives much less probability to novel events.

But how to pick best value for ?

That is, how much should we smooth?

E.g., how much probability to “set aside” for novel events?

Depends on how likely novel events really are!

Which may depend on the type of text, size of training corpus, …

Can we figure it out from the data?

We’ll look at a few methods for deciding how much to smooth.

27CS6501 Natural Language Processing

Page 28: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Setting Smoothing Parameters

How to pick best value for ? (in add- smoothing)

Try many values & report the one that gets best results?

How to measure whether a particular gets good results?

Is it fair to measure that on test data (for setting )?Moral: Selective reporting on test data can make a

method look artificially good. So it is unethical. Rule: Test data cannot influence system

development. No peeking! Use it only to evaluate the final system(s). Report all results on it.

28

TestTraining

CS6501 Natural Language Processing

Page 29: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Setting Smoothing Parameters

How to pick best value for ? (in add- smoothing)

Try many values & report the one that gets best results?

How to measure whether a particular gets good results?

Is it fair to measure that on test data (for setting )?Moral: Selective reporting on test data can make a

method look artificially good. So it is unethical. Rule: Test data cannot influence system

development. No peeking! Use it only to evaluate the final system(s). Report all results on it.

29

TestTraining

CS6501 Natural Language Processing

Feynman’s Advice: “The first principle is that you

must not fool yourself, and you are the easiest person

to fool.”

Page 30: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

How to pick best value for ?

Try many values & report the one that gets best

results?

… and

report

results of

that final

model on

test data.

600.465 - Intro to NLP - J. Eisner

Setting Smoothing Parameters

30

TestTraining

Dev. Training

Pick that

gets best

results on

this 20% …

… when we collect counts

from this 80% and smooth

them using add- smoothing.

Now use that

to get

smoothed

counts from

all 100% …

CS6501 Natural Language Processing

Page 31: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Large or small Dev set?

Here we held out 20% of our training set

(yellow) for development.

Would like to use > 20% yellow:

20% not enough to reliably assess

Would like to use > 80% blue:

Best for smoothing 80% best for smoothing 100%

CS6501 Natural Language Processing 31

Page 32: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

600.465 - Intro to NLP - J. Eisner 32

Cross-Validation

Try 5 training/dev splits as below Pick that gets best average performance

Tests on all 100% as yellow, so we can more reliably assess

Still picks a that’s good at smoothing the 80% size, not 100%.

But now we can grow that 80% without trouble

32

Dev.Dev.

Dev.Dev.

Dev.

Test

CS6501 Natural Language Processing

Page 33: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

600.465 - Intro to NLP - J. Eisner 33

N-fold Cross-Validation (“Leave One Out”)

Test each sentence with smoothed model from other N-1 sentences

Still tests on all 100% as yellow, so we can reliably assess

Trains on nearly 100% blue data ((N-1)/N) to measure whether is good for smoothing that

33

Test

CS6501 Natural Language Processing

Page 34: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

600.465 - Intro to NLP - J. Eisner 34

N-fold Cross-Validation (“Leave One Out”)

Surprisingly fast: why?

Usually easy to retrain on blue by adding/subtracting 1 sentence’s counts

34

Test

CS6501 Natural Language Processing

Page 35: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

More Ideas for Smoothing

Remember, we’re trying to decide how

much to smooth.

E.g., how much probability to “set aside” for

novel events?

Depends on how likely novel events really

are

Which may depend on the type of text, size

of training corpus, …

Can we figure this out from the data?

35CS6501 Natural Language Processing

Page 36: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

How likely are novel events?

36

a 150 0

both 18 0

candy 0 1

donuts 0 2

every 50 versus 0

??? 0 0

grapes 0 1

his 38 0

ice cream 0 7

20000 types 300 tokens 300 tokens

0/300 0/300

which zero would you expect is really rare?

CS6501 Natural Language Processing

Page 37: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

How likely are novel events?

37

a 150 0

both 18 0

candy 0 1

donuts 0 2

every 50 versus 0

farina 0 0

grapes 0 1

his 38 0

ice cream 0 7

20000 types 300 tokens 300 tokens

determiners: a closed class

CS6501 Natural Language Processing

Page 38: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

How likely are novel events?

38

a 150 0

both 18 0

candy 0 1

donuts 0 2

every 50 versus 0

farina 0 0

grapes 0 1

his 38 0

ice cream 0 7

20000 types 300 tokens 300 tokens

(food) nouns: an open class

CS6501 Natural Language Processing

Page 39: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Zipfs’ law

CS6501 Natural Language Processing 39

the r-th mostcommon word 𝑤𝑟 hasP(𝑤𝑟) ∝ 1/r

A few words are very frequent

http://wugology.com/zipfs-law/

Page 40: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

CS6501 Natural Language Processing 40

Page 41: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

CS6501 Natural Language Processing 41

Page 42: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

0 10000 20000 30000 40000 50000 60000 70000 80000

0

1

2

3

4

5

6

52108

69836

How common are novel events?

42

N0 *

N1 *

N2 *

N3 *

N4 *

N5 *

N6 *

1*

1*

abaringe, Babatinde, cabaret …

aback, Babbitt, cabanas …

Abbas, babel, Cabot …

abdominal, Bach, cabana …

aberrant, backlog, cabinets …

abdomen, bachelor, Caesar …

the

EOS

CS6501 Natural Language Processing

Page 43: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

How common are novel events?

43

0 5000 10000 15000 20000 25000

0

1

2

3

4

5

6

N0 *

N1 *

N2 *

N3 *

N4 *

N5 *

N6 *

Counts from Brown Corpus (N 1 million tokens)

novel words (in dictionary but never occur)

singletons (occur once)

doubletons (occur twice)

N2 = # doubleton typesN2 * 2 = # doubleton tokens

r Nr = total # types = T (purple bars)

r (Nr * r) = total # tokens = N (all bars)

CS6501 Natural Language Processing

Page 44: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Witten-Bell Smoothing Idea

44

0 5000 10000 15000 20000 25000

0

1

2

3

4

5

6

N0 *

N1 *

N2 *

N3 *

N4 *

N5 *

N6 *

novel words

singletons

doubletons

If T/N is large, we’ve seen lots of novel types in the past, so we expect lots more.

unsmoothed smoothed

2/N 2/(N+T)

1/N 1/(N+T)

0/N (T/(N+T)) / N0

CS6501 Natural Language Processing

Page 45: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

0 0.005 0.01 0.015 0.02 0.025

0/N

1/N

2/N

3/N

4/N

5/N

6/N

Good-Turing Smoothing Idea

45

N0*

N1*

N2*

N3*

N4*

N5*

N6*

Partition the type vocabulary into classes (novel, singletons, doubletons, …) by how often they occurred in training data

Use observed total probability of class r+1 to estimate total probability of class r

unsmoothed smoothed

(N3*3/N)/N2

(N2*2/N)/N1

(N1*1/N)/N0

obs. p(singleton)

est. p(novel)2%

obs. p(doubleton)

est. p(singleton)1.5%

obs. (tripleton)

est. p(doubleton)1.2%

2/N (N3*3/N)/N2

1/N (N2*2/N)/N1

0/N (N1*1/N)/N0

CS6501 Natural Language Processingr/N = (Nr*r/N)/Nr (Nr+1*(r+1)/N)/Nr

Page 46: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Witten-Bell vs. Good-Turing

Estimate p(z | xy) using just the tokens we’ve

seen in context xy. Might be a small set …

Witten-Bell intuition: If those tokens were

distributed over many different types, then novel

types are likely in future.

Good-Turing intuition: If many of those tokens

came from singleton types , then novel types are

likely in future.

Very nice idea (but a bit tricky in practice)

See the paper “Good-Turing smoothing without tears”

46CS6501 Natural Language Processing

Page 47: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Good-Turing Reweighting

Problem 1: what about “the”? (k=4417)

For small k, Nk > Nk+1

For large k, too jumpy.

Problem 2: we don’t observe

events for every k

N1

N2

N3

N4417

N3511

. . . .

N0

N1

N2

N4416

N3510

. . . .

Page 48: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Good-Turing Reweighting

Simple Good-Turing [Gale and Sampson]:

replace empirical Nk with a best-fit

regression (e.g., power law) once count

counts get unreliable

f(k) = a + b log (k)

Find a,b such that f(k) ∼ Nk

N1

N2 N3

N1

N2

Page 49: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

600.465 - Intro to NLP - J. Eisner 49

Backoff and interpolation

Why are we treating all novel events as the

same?

49CS6501 Natural Language Processing

words

words

Page 50: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

600.465 - Intro to NLP - J. Eisner 50

Backoff and interpolation

p(zombie | see the) vs. p(baby | see the)

What if count(see the zygote) =

count(see the baby) = 0?

baby beats zygote as a unigram

the baby beats the zygote as a bigram

see the baby beats see the zygote ?

(even if both have the same count, such as 0)

50CS6501 Natural Language Processing

Page 51: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

600.465 - Intro to NLP - J. Eisner 51

Backoff and interpolation

condition on less context for

contexts you haven’t learned much

about

backoff: use trigram if you have

good evidence, otherwise bigram,

otherwise unigram

Interpolation: – mixture of unigram,

bigram, trigram (etc.) models

51CS6501 Natural Language Processing

Page 52: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

600.465 - Intro to NLP - J. Eisner 52

Smoothing + backoff

Basic smoothing (e.g., add-, Good-Turing, Witten-Bell):Holds out some probability mass for novel

events

Divided up evenly among the novel events

Backoff smoothingHolds out same amount of probability mass for

novel events

But divide up unevenly in proportion to backoffprob.

52CS6501 Natural Language Processing

Page 53: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

600.465 - Intro to NLP - J. Eisner 53

Smoothing + backoff

When defining p(z | xy), the backoff prob

for novel z is p(z | y)

Even if z was never observed after xy, it

may have been observed after y (why?).

Then p(z | y) can be estimated without

further backoff. If not, we back off further to

p(z).

53CS6501 Natural Language Processing

Page 54: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

600.465 - Intro to NLP - J. Eisner 54

Linear Interpolation

Jelinek-Mercer smoothing

Use a weighted average of backed-off naïve

models:

paverage(z | xy) = 3 p(z | xy) + 2 p(z | y) + 1 p(z)

where 3 + 2 + 1 = 1 and all are 0

The weights can depend on the context xy

E.g., we can make 3 large if count(xy) is high

Learn the weights on held-out data w/ jackknifing

Different 3 when xy is observed 1 time, 2 times,

5 times, …

54CS6501 Natural Language Processing

Page 55: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Absolute Discounting Interpolation

Save ourselves some time and just

subtract 0.75 (or some d)!

But should we really just use the regular

unigram P(w)?

55

)()()(

),()|( 1

1

11scountingAbsoluteDi wPw

wc

dwwcwwP i

i

iiii

discounted bigram

unigram

Interpolation weight

CS6501 Natural Language Processing

Page 56: Lecture 3: Language Model Smoothingkc2wc/teaching/NLP16/slides/03... · 2019-04-14 · N-grams only work well for word prediction if the test corpus looks like the training corpus

Absolute discounting: just subtract a little from each count

How much to subtract ?

Church and Gale (1991)’s clever idea

Divide data into training and held-out sets

for each bigram in the training set

see the actual count in the held-out set!

It sure looks like c* = (c - .75)

56

Bigram count in training

Bigram count in heldout set

0 .0000270

1 0.448

2 1.25

3 2.24

4 3.23

5 4.21

6 5.23CS6501 Natural Language Processing