*introduction to natural language processing (600.465) language modeling (and the noisy channel)

121
1 *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel) Dr. Jan Hajič CS Dept., Johns Hopkins Univ. [email protected] www.cs.jhu.edu/~hajic

Upload: ralph-fox

Post on 03-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

*Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel). Dr. Jan Hajič CS Dept., Johns Hopkins Univ. [email protected] www.cs.jhu.edu/~hajic. The Noisy Channel. Prototypical case: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

1

*Introduction to Natural Language Processing (600.465)

Language Modeling (and the Noisy Channel)

Dr. Jan Hajič

CS Dept., Johns Hopkins Univ.

[email protected]

www.cs.jhu.edu/~hajic

Page 2: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

2

The Noisy Channel

• Prototypical case: Input Output (noisy)

The channel

0,1,1,1,0,1,0,1,... (adds noise) 0,1,1,0,0,1,1,0,...

• Model: probability of error (noise):

• Example: p(0|1) = .3 p(1|1) = .7 p(1|0) = .4 p(0|0) = .6

• The Task:

known: the noisy output; want to know: the input (decoding)

Page 3: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

3

Noisy Channel Applications• OCR

– straightforward: text → print (adds noise), scan →image

• Handwriting recognition– text → neurons, muscles (“noise”), scan/digitize → image

• Speech recognition (dictation, commands, etc.)– text → conversion to acoustic signal (“noise”) → acoustic waves

• Machine Translation– text in target language → translation (“noise”) → source language

• Also: Part of Speech Tagging– sequence of tags → selection of word forms → text

Page 4: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

4

Noisy Channel: The Golden Rule of ...

OCR, ASR, HR, MT, ...• Recall:

p(A|B) = p(B|A) p(A) / p(B) (Bayes formula)

Abest = argmaxA p(B|A) p(A) (The Golden Rule)

• p(B|A): the acoustic/image/translation/lexical model– application-specific name

– will explore later

• p(A): the language model

Page 5: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Probabilistic Language Models

Why?

Page 6: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Probabilistic Language Modeling

• Goal: compute the probability of a sentence or sequence of words:• P(W) = P(w1,w2,w3,w4,w5…wn)

• Related task: probability of an upcoming word:• P(w5|w1,w2,w3,w4)

• A model that computes either of these:• P(W) or P(wn|w1,w2…wn-1) is called a language model.

• Better: the grammar But language model or LM is standard

Page 7: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

7

The Perfect Language Model

• Sequence of word forms [forget about tagging for the moment]

• Notation: A ~ W = (w1,w2,w3,...,wd)

• The big (modeling) question:

p(W) = ?

• Well, we know (Bayes/chain rule →):

p(W) = p(w1,w2,w3,...,wd) =

= p(w1)ⅹp(w2|w1)ⅹp(w3|w1,w2)ⅹⅹp(wd|w1,w2,...,wd-1)

• Not practical (even short W →too many parameters)

Page 8: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

8

Markov Chain

• Unlimited memory (cf. previous foil):– for wi, we know all its predecessors w1,w2,w3,...,wi-1

• Limited memory:– we disregard “too old” predecessors

– remember only k previous words: wi-k,wi-k+1,...,wi-1

– called “kth order Markov approximation”

• + stationary character (no change over time):

p(W) i=1..dp(wi|wi-k,wi-k+1,...,wi-1), d = |W|

Page 9: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

9

n-gram Language Models

• (n-1)th order Markov approximation → n-gram LM:

p(W) df i=1..dp(wi|wi-n+1,wi-n+2,...,wi-1) !• In particular (assume vocabulary |V| = 60k):

• 0-gram LM: uniform model, p(w) = 1/|V|, 1 parameter

• 1-gram LM: unigram model, p(w), 6ⅹ104 parameters

• 2-gram LM: bigram model, p(wi|wi-1) 3.6ⅹ109 parameters

• 3-gram LM: trigram model, p(wi|wi-2,wi-1) 2.16ⅹ1014 parameters

prediction history

Page 10: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

10

LM: Observations• How large n?

– nothing is enough (theoretically)

– but anyway: as much as possible (→close to “perfect” model)

– empirically: 3• parameter estimation? (reliability, data availability, storage space, ...)

• 4 is too much: |V|=60k →1.296ⅹ1019 parameters

• but: 6-7 would be (almost) ideal (having enough data): in fact, one can recover original from 7-grams!

• Reliability ~ (1 / Detail) (→ need compromise) (detail=many gram)

• For now, keep word forms (no “linguistic” processing)

Page 11: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

11

Parameter Estimation

• Parameter: numerical value needed to compute p(w|h)• From data (how else?)• Data preparation:

• get rid of formatting etc. (“text cleaning”)• define words (separate but include punctuation, call it “word”)• define sentence boundaries (insert “words” <s> and </s>)• letter case: keep, discard, or be smart:

– name recognition

– number type identification

[these are huge problems per se!]

• numbers: keep, replace by <num>, or be smart (form ~ pronunciation)

Page 12: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

12

Maximum Likelihood Estimate• MLE: Relative Frequency...

– ...best predicts the data at hand (the “training data”)

• Trigrams from Training Data T:– count sequences of three words in T: c3(wi-2,wi-1,wi)

– [NB: notation: just saying that the three words follow each other]

– count sequences of two words in T: c2(wi-1,wi):

• either use c2(y,z) = w c3(y,z,w)

• or count differently at the beginning (& end) of data!

p(wi|wi-2,wi-1) =est. c3(wi-2,wi-1,wi) / c2(wi-2,wi-1) !

Page 13: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

13

Character Language Model

• Use individual characters instead of words:

• Same formulas etc.• Might consider 4-grams, 5-grams or even more• Good only for language comparison• Transform cross-entropy between letter- and word-

based models: HS(pc) = HS(pw) / avg. # of characters/word in S

p(W) df i=1..dp(ci|ci-n+1,ci-n+2,...,ci-1)

Page 14: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

14

LM: an Example

• Training data: <s> <s> He can buy the can of soda.– Unigram: p1(He) = p1(buy) = p1(the) = p1(of) = p1(soda) = p1(.) = .125

p1(can) = .25

– Bigram: p2(He|<s>) = 1, p2(can|He) = 1, p2(buy|can) = .5,

p2(of|can) = .5, p2(the|buy) = 1,...– Trigram: p3(He|<s>,<s>) = 1, p3(can|<s>,He) = 1,

p3(buy|He,can) = 1, p3(of|the,can) = 1, ..., p3(.|of,soda) = 1.– (normalized for all n-grams) Entropy: H(p1) = 2.75, H(p2) = .25,

H(p3) = 0 ← Great?!

Page 15: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Language Modeling Toolkits

• SRILM• http://www.speech.sri.com/projects/srilm/

Page 16: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Google N-Gram Release, August 2006

Page 17: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Google N-Gram Release

• serve as the incoming 92• serve as the incubator 99• serve as the independent 794• serve as the index 223• serve as the indication 72• serve as the indicator 120• serve as the indicators 45• serve as the indispensable 111• serve as the indispensible 40• serve as the individual 234

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Page 18: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Google Book N-grams

• http://ngrams.googlelabs.com/

Page 19: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Evaluation: How good is our model?

• Does our language model prefer good sentences to bad ones?• Assign higher probability to “real” or “frequently observed” sentences

• Than “ungrammatical” or “rarely observed” sentences?

• We train parameters of our model on a training set.• We test the model’s performance on data we haven’t seen.

• A test set is an unseen dataset that is different from our training set, totally unused.

• An evaluation metric tells us how well our model does on the test set.

Page 20: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Extrinsic evaluation of N-gram models

• Best evaluation for comparing models A and B• Put each model in a task

• spelling corrector, speech recognizer, MT system• Run the task, get an accuracy for A and for B

• How many misspelled words corrected properly• How many words translated correctly

• Compare accuracy for A and B

Page 21: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Difficulty of extrinsic (in-vivo) evaluation of N-gram models

• Extrinsic evaluation• Time-consuming; can take days or weeks

• So• Sometimes use intrinsic evaluation: perplexity• Bad approximation

• unless the test data looks just like the training data• So generally only useful in pilot experiments

• But is helpful to think about.

Page 22: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Intuition of Perplexity

• The Shannon Game:• How well can we predict the next word?

• Unigrams are terrible at this game. (Why?)

• A better model of a text• is one which assigns a higher probability to the word that actually occurs

I always order pizza with cheese and ____

The 33rd President of the US was ____

I saw a ____

mushrooms 0.1

pepperoni 0.1

anchovies 0.01

….

fried rice 0.0001

….

and 1e-100

Claude Shannon

Page 23: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Perplexity

• Perplexity is the probability of the test set, normalized by the number of words:

• Chain rule:

• For bigrams:

Minimizing perplexity is the same as maximizing probability

The best language model is one that best predicts an unseen test set• Gives the highest P(sentence)

Page 24: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

The Shannon Game intuition for perplexity

• From Josh Goodman• How hard is the task of recognizing digits ‘0,1,2,3,4,5,6,7,8,9’

• Perplexity 10

• How hard is recognizing (30,000) names at Microsoft. • Perplexity = 30,000

• If a system has to recognize• Operator (1 in 4)• Sales (1 in 4)• Technical Support (1 in 4)• 30,000 names (1 in 120,000 each)• Perplexity is 53

• Perplexity is weighted equivalent branching factor

Page 25: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Perplexity as branching factor

• Let’s suppose a sentence consisting of random digits• What is the perplexity of this sentence according to a model that

assign P=1/10 to each digit?

Page 26: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Lower perplexity = better model

• Training 38 million words, test 1.5 million words, WSJ

N-gram Order Unigram Bigram Trigram

Perplexity 962 170 109

Page 27: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

The wall street journal

Page 28: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

28

LM: an Example

• Training data: <s> <s> He can buy the can of soda.– Unigram: p1(He) = p1(buy) = p1(the) = p1(of) = p1(soda) = p1(.) = .125

p1(can) = .25

– Bigram: p2(He|<s>) = 1, p2(can|He) = 1, p2(buy|can) = .5,

p2(of|can) = .5, p2(the|buy) = 1,...– Trigram: p3(He|<s>,<s>) = 1, p3(can|<s>,He) = 1,

p3(buy|He,can) = 1, p3(of|the,can) = 1, ..., p3(.|of,soda) = 1.– (normalized for all n-grams) Entropy: H(p1) = 2.75, H(p2) = .25,

H(p3) = 0 ← Great?!

Page 29: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

29

LM: an Example (The Problem)

• Cross-entropy:• S = <s> <s> It was the greatest buy of all. (test data)

• Even HS(p1) fails (= HS(p2) = HS(p3) = ∞), because:

– all unigrams but p1(the), p1(buy), p1(of) and p1(.) are 0.

– all bigram probabilities are 0.

– all trigram probabilities are 0.

• We want: to make all probabilities non-zero. data sparseness handling

Page 30: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

30

The Zero Problem• “Raw” n-gram language model estimate:

– necessarily, some zeros• !many: trigram model → 2.16ⅹ1014 parameters, data ~ 109 words

– which are true 0? • optimal situation: even the least frequent trigram would be seen several times,

in order to distinguish it’s probability vs. other trigrams

• optimal situation cannot happen, unfortunately (open question: how many data would we need?)

– → we don’t know– we must eliminate the zeros

• Two kinds of zeros: p(w|h) = 0, or even p(h) = 0!

Page 31: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

31

Why do we need Nonzero Probs?

• To avoid infinite Cross Entropy:– happens when an event is found in test data which has

not been seen in training data

H(p) = ∞prevents comparing data with ≥ 0 “errors”

• To make the system more robust– low count estimates:

• they typically happen for “detailed” but relatively rare appearances

– high count estimates: reliable but less “detailed”

Page 32: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

32

Eliminating the Zero Probabilities:Smoothing

• Get new p’(w) (same ): almost p(w) but no zeros• Discount w for (some) p(w) > 0: new p’(w) < p(w)

w∈discounted (p(w) - p’(w)) = D

• Distribute D to all w; p(w) = 0: new p’(w) > p(w) – possibly also to other w with low p(w)

• For some w (possibly): p’(w) = p(w)

• Make sure w∈p’(w) = 1

• There are many ways of smoothing

Page 33: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

33

Smoothing by Adding 1(Laplace)• Simplest but not really usable:

– Predicting words w from a vocabulary V, training data T:

p’(w|h) = (c(h,w) + 1) / (c(h) + |V|)• for non-conditional distributions: p’(w) = (c(w) + 1) / (|T| + |V|)

– Problem if |V| > c(h) (as is often the case; even >> c(h)!)

• Example: Training data: <s> what is it what is small ? |T| = 8

• V = { what, is, it, small, ?, <s>, flying, birds, are, a, bird, . }, |V| = 12

• p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .252ⅹ.1252 .001

p(it is flying.) = .125ⅹ.25ⅹ02 = 0

• p’(it) =.1, p’(what) =.15, p’(.)=.05 p’(what is it?) = .152ⅹ.12 .0002

p’(it is flying.) = .1ⅹ.15ⅹ.052 .00004

(assume word independence!)

Page 34: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

34

Adding less than 1

• Equally simple:– Predicting words w from a vocabulary V, training data T:

p’(w|h) = (c(h,w) + ) / (c(h) + |V|), • for non-conditional distributions: p’(w) = (c(w) + ) / (|T| + |V|)

• Example: Training data: <s> what is it what is small ? |T| = 8

• V = { what, is, it, small, ?, <s>, flying, birds, are, a, bird, . }, |V| = 12

• p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .252ⅹ.1252 .001

p(it is flying.) = .125ⅹ.2502 = 0

• Use = .1:• p’(it).12, p’(what).23, p’(.).01 p’(what is it?) = .232ⅹ.122 .0007

p’(it is flying.) = .12ⅹ.23ⅹ.012 .000003

Page 35: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Language Modeling

Advanced: Good Turing Smoothing

Page 36: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Reminder: Add-1 (Laplace) Smoothing

Page 37: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

More general formulations: Add-k

Page 38: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Unigram prior smoothing

Page 39: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Advanced smoothing algorithms

• Intuition used by many smoothing algorithms– Good-Turing

– Kneser-Ney

– Witten-Bell

• Use the count of things we’ve seen once– to help estimate the count of things we’ve never seen

Page 40: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Notation: Nc = Frequency of frequency c

• Nc = the count of things we’ve seen c times

• Sam I am I am Sam I do not eat

I 3sam 2am 2do 1not 1eat 1

40

N1 = 3

N2 = 2

N3 = 1

Page 41: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Good-Turing smoothing intuition

• You are fishing (a scenario from Josh Goodman), and caught:– 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish

• How likely is it that next species is trout?– 1/18

• How likely is it that next species is new (i.e. catfish or bass)– Let’s use our estimate of things-we-saw-once to estimate the new things.– 3/18 (because N1=3)

• Assuming so, how likely is it that next species is trout?

– Must be less than 1/18 – discounted by 3/18!!

– How to estimate?

Page 42: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

• Seen once (trout)• c = 1• MLE p = 1/18

• C*(trout) = 2 * N2/N1

= 2 * 1/3

= 2/3

• P*GT(trout) = 2/3 / 18 = 1/27

Good Turing calculations

• Unseen (bass or catfish)– c = 0:– MLE p = 0/18 = 0

– P*GT (unseen) = N1/N =

3/18

Page 43: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Ney et al.’s Good Turing Intuition

43

Held-out words:

H. Ney, U. Essen, and R. Kneser, 1995. On the estimation of 'small' probabilities by leaving-one-out. IEEE Trans. PAMI. 17:12,1202-1212

Page 44: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Ney et al. Good Turing Intuition(slide from Dan Klein)

• Intuition from leave-one-out validation– Take each of the c training words out in turn– c training sets of size c–1, held-out of size 1– What fraction of held-out words are unseen in training?

• N1/c– What fraction of held-out words are seen k times in

training?• (k+1)Nk+1/c

– So in the future we expect (k+1)Nk+1/c of the words to be those with training count k

– There are Nk words with training count k– Each should occur with probability:

• (k+1)Nk+1/c/Nk

– …or expected count:

N1

N2

N3

N4417

N3511

. . .

.

N0

N1

N2

N4416

N3510

. . .

.

Training Held out

Page 45: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Good-Turing complications (slide from Dan Klein)

• Problem: what about “the”? (say c=4417)

– For small k, Nk > Nk+1

– For large k, too jumpy, zeros wreck estimates

– Simple Good-Turing [Gale and Sampson]: replace empirical Nk with a best-fit power law once counts get unreliable

N1

N2 N3

N1

N2

Page 46: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Resulting Good-Turing numbers

• Numbers from Church and Gale (1991)• 22 million words of AP Newswire

Count c

Good Turing c*

0 .00002701 0.4462 1.263 2.244 3.245 4.226 5.197 6.218 7.249 8.25

Page 47: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Language Modeling

Advanced:

Kneser-Ney Smoothing

Page 48: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Resulting Good-Turing numbers

• Numbers from Church and Gale (1991)• 22 million words of AP Newswire

• It sure looks like c* = (c - .75)

Count c

Good Turing c*

0 .0000270

1 0.446

2 1.26

3 2.24

4 3.24

5 4.22

6 5.19

7 6.21

8 7.24

9 8.25

Page 49: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Absolute Discounting Interpolation

• Save ourselves some time and just subtract 0.75 (or some d)!

– (Maybe keeping a couple extra values of d for counts 1 and 2)

• But should we really just use the regular unigram P(w)?

49

discounted bigram

unigram

Interpolation weight

Page 50: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Kneser-Ney Smoothing I• Better estimate for probabilities of lower-order unigrams!

– Shannon game: I can’t see without my reading___________?

– “Francisco” is more common than “glasses”

– … but “Francisco” always follows “San”

• The unigram is useful exactly when we haven’t seen this bigram!

• Instead of P(w): “How likely is w”

• Pcontinuation(w): “How likely is w to appear as a novel continuation?

– For each word, count the number of bigram types it completes

– Every bigram type was a novel continuation the first time it was seen

Franciscoglasses

Page 51: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Kneser-Ney Smoothing II

• How many times does w appear as a novel continuation:

• Normalized by the total number of word bigram types

Page 52: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Kneser-Ney Smoothing III

• Alternative metaphor: The number of # of word types seen to precede w

• normalized by the # of words preceding all words:

• A frequent word (Francisco) occurring in only one context (San) will have a low continuation probability

Page 53: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Kneser-Ney Smoothing IV

53

λ is a normalizing constant; the probability mass we’ve discounted

the normalized discountThe number of word types that can follow wi-1 = # of word types we discounted= # of times we applied normalized discount

Page 54: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Kneser-Ney Smoothing: Recursive formulation

54

Continuation count = Number of unique single word contexts for

Page 55: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Backoff and Interpolation

• Sometimes it helps to use less context– Condition on less context for contexts you haven’t learned

much about

• Backoff: – use trigram if you have good evidence,– otherwise bigram, otherwise unigram

• Interpolation: – mix unigram, bigram, trigram

• Interpolation works better

Page 56: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

56

Smoothing by Combination:Linear Interpolation

• Combine what?• distributions of various level of detail vs. reliability

• n-gram models:• use (n-1)gram, (n-2)gram, ..., uniform

reliability

detail

• Simplest possible combination: – sum of probabilities, normalize:

• p(0|0) = .8, p(1|0) = .2, p(0|1) = 1, p(1|1) = 0, p(0) = .4, p(1) = .6:

• p’(0|0) = .6, p’(1|0) = .4, p’(0|1) = .7, p’(1|1) = .3

• (p’(0|0) = 0.5p(0|0) + 0.5p(0))

Page 57: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

57

Typical n-gram LM Smoothing

• Weight in less detailed distributions using =(0,,,):

p’(wi| wi-2 ,wi-1) = p3(wi| wi-2 ,wi-1) +

p2(wi| wi-1) + p1(wi) + 0/|V|

• Normalize: i > 0, i=0..n i = 1 is sufficient (0 = 1 - i=1..n i) (n=3)

• Estimation using MLE:– fix the p3, p2, p1 and |V| parameters as estimated from the training data

– then find such {i}which minimizes the cross entropy (maximizes

probability of data): -(1/|D|)i=1..|D|log2(p’(wi|hi))

Page 58: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

58

Held-out Data• What data to use? (to estimate

– (bad) try the training data T: but we will always get = 1

• why? (let piT be an i-gram distribution estimated using relative freq. from T)

• minimizing HT(p’) over a vector , p’ = p3T+p2T+p1T+/|V|

– remember: HT(p’) = H(p3T) + D(p3T||p’); (p3T fixed → H(p3T) fixed, best)

– which p’ minimizes HT(p’)? Obviously, a p’ for which D(p3T|| p’)=0

– ...and that’s p3T (because D(p||p) = 0, as we know).

– ...and certainly p’ = p3T if = 1 (maybe in some other cases, too).

– (p’ = 1ⅹp3T + 0ⅹp2T + 0ⅹp1T + 0/|V|)

– thus: do not use the training data for estimation of • must hold out part of the training data (heldout data, H):

• ...call the remaining data the (true/raw) training data, T

• the test data S (e.g., for comparison purposes): still different data!

Page 59: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

59

The Formulas (for H)• Repeat: minimizing -(1/|H|)i=1..|H|log2(p’(wi|hi)) over

p’(wi| hi) = p’(wi| wi-2 ,wi-1) = p3(wi| wi-2 ,wi-1) +

p2(wi| wi-1) + p1(wi) + 0/|V|

• “Expected Counts (of lambdas)”: j = 0..3 – next page

c(j) = i=1..|H| (jpj(wi|hi) / p’(wi|hi))

• “Next ”: j = 0..3

j,next = c(j) / k=0..3 (c(k))

!

!

!

Page 60: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

60

The (Smoothing) EM Algorithm

1. Start with some , such that j > 0 for all j ∈0..3.

2. Compute “Expected Counts” for each j.

3. Compute new set of j, using the “Next ” formula.

4. Start over at step 2, unless a termination condition is met.• Termination condition: convergence of .

– Simply set an , and finish if |j - j,next| < for each j (step 3).

• Guaranteed to converge: follows from Jensen’s inequality, plus a technical proof.

Page 61: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

61

Simple Example

• Raw distribution (unigram only; smooth with uniform): p(a) = .25, p(b) = .5, p() = 1/64 for ∈{c…r}, = 0 for the rest: s,t,u,v,w,x,y,z

• Heldout data: baby; use one set of (1: unigram, 0: uniform)

• Start with 1 = .5; p’(b) = .5 x .5 + .5 / 26 = .27

p’(a) = .5 x .25 + .5 / 26 = .14

p’(y) = .5 x 0 + .5 / 26 = .02

c(1) = .5x.5/.27 + .5x.25/.14 + .5x.5/.27 + .5x0/.02 = 2.72

c(0) = .5x.04/.27 + .5x.04/.14 + .5x.04/.27 + .5x.04/.02 = 1.28

Normalize: 1,next = .68, 0,next = .32.

Repeat from step 2 (recompute p’ first for efficient computation, then c(i), ...)

Finish when new lambdas almost equal to the old ones (say, < 0.01 difference).

Page 62: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

62

Some More Technical Hints

• Set V = {all words from training data}.• You may also consider V = T ∪ H, but it does not make the coding in

any way simpler (in fact, harder).• But: you must never use the test data for your vocabulary!

• Prepend two “words” in front of all data:• avoids beginning-of-data problems• call these index -1 and 0: then the formulas hold exactly

• When cn(w) = 0:• Assign 0 probability to pn(w|h) where cn-1(h) > 0, but a uniform

probability (1/|V|) to those pn(w|h) where cn-1(h) = 0 [this must be done both when working on the heldout data during EM, as well as when computing cross-entropy on the test data!]

Page 63: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

63

Introduction to Natural Language Processing (600.465)

Mutual Information and Word Classes(class n-gram)

Dr. Jan Hajič

CS Dept., Johns Hopkins Univ.

[email protected]

www.cs.jhu.edu/~hajic

Page 64: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

64

The Problem

• Not enough data• Language Modeling: we do not see “correct” n-grams

– solution so far: smoothing

• suppose we see:– short homework, short assignment, simple homework

• but not:– simple assigment

• What happens to our (bigram) LM?– p(homework | simple) = high probability

– p(assigment | simple) = low probability (smoothed with p(assigment))

– They should be much closer!

Page 65: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

65

Word Classes

• Observation: similar words behave in a similar way– trigram LM:

– in the ... (all nouns/adj);

– catch a ... (all things which can be catched, incl. their accompanying adjectives);

– trigram LM, conditioning: – a ... homework (any atribute of homework: short, simple, late, difficult),

– ... the woods (any verb that has the woods as an object: walk, cut, save)

– trigram LM: both:– a (short,long,difficult,...) (homework,assignment,task,job,...)

Page 66: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

66

Solution

• Use the Word Classes as the “reliability” measure• Example: we see

• short homework, short assignment, simple homework

– but not:• simple assigment

– Cluster into classes:• (short, simple) (homework, assignment)

– covers “simple assignment”, too

• Gaining: realistic estimates for unseen n-grams• Loosing: accuracy (level of detail) within classes

Page 67: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

67

The New Model• Rewrite the n-gram LM using classes:

– Was: [k = 1..n]• pk(wi|hi) = c(hi,wi) / c(hi) [history: (k-1) words]

– Introduce classes:

pk(wi|hi) = p(wi|ci) pk(ci|hi) !• history: classes, too: [for trigram: hi = ci-2,ci-1, bigram: hi = ci-1]

– Smoothing as usual• over pk(wi|hi), where each is defined as above (except uniform which stays

at 1/|V|)

Page 68: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

68

Training Data

• Suppose we already have a mapping:– r: V →C assigning each word its class (ci = r(wi))

• Expand the training data:– T = (w1, w2, ..., w|T|) into

– TC = (<w1,r(w1)>, <w2,r(w2)>, ..., <w|T|,r(w|T|)>)

• Effectively, we have two streams of data:– word stream: w1, w2, ..., w|T|

– class stream: c1, c2, ..., c|T| (def. as ci = r(wi))

• Expand Heldout, Test data too

Page 69: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

69

Training the New Model

• As expected, using ML estimates:– p(wi|ci) = p(wi|r(wi)) = c(wi) / c(r(wi)) = c(wi) / c(ci)

• !!! c(wi,ci) = c(wi) [since ci determined by wi]

– pk(ci|hi):

• p3(ci|hi) = p3(ci|ci-2 ,ci-1) = c(ci-2 ,ci-1,ci) / c(ci-2 ,ci-1)

• p2(ci|hi) = p2(ci|ci-1) = c(ci-1,ci) / c(ci-1)

• p1(ci|hi) = p1(ci) = c(ci) / |T|

• Then smooth as usual – not the p(wi|ci) nor pk(ci|hi) individually, but the pk(wi|hi)

Page 70: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

70

Classes: How To Get Them

• We supposed the classes are given• Maybe there are in [human] dictionaries, but...

– dictionaries are incomplete

– dictionaries are unreliable

– do not define classes as equivalence relation (overlap)

– do not define classes suitable for LM • small, short... maybe; small and difficult?

• we have to construct them from data (again...)

Page 71: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

71

Creating the Word-to-Class Map

• We will talk about bigrams from now• Bigram estimate:

• p2(ci|hi) = p2(ci|ci-1) = c(ci-1,ci) / c(ci-1) = c(r(wi-1),r(wi)) / c(r(wi-1))

• Form of the model: (class bi-gram)– just raw bigram for now:

• P(T) = i=1..|T|p(wi|r(wi)) p2(r(wi))|r(wi-1)) (p2(c1|c0) =df p(c1))

• Maximize over r (given r → fixed p, p2):– define objective L(r) = 1/|T| i=1..|T|log(p(wi|r(wi)) p2(r(wi))|r(wi-1)))

– rbest = argmaxr L(r) (L(r) = norm. logprob of training data... as usual) (or negative cross entropy)

Page 72: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

72

Simplifying the Objective Function• Start from L(r) = 1/|T| i=1..|T|log(p(wi|r(wi)) p2(r(wi)|r(wi-1))):

1/|T| i=1..|T|log(p(wi|r(wi)) p(r(wi)) p2(r(wi)|r(wi-1)) / p(r(wi))) =

1/|T| i=1..|T|log(p(wi,r(wi)) p2(r(wi)|r(wi-1)) / p(r(wi))) =

1/|T| i=1..|T|log(p(wi)) + 1/|T| i=1..|T|log(p2(r(wi)|r(wi-1)) / p(r(wi))) =

-H(W) + 1/|T| i=1..|T|log(p2(r(wi)|r(wi-1)) p(r(wi-1)) / (p(r(wi-1)) p(r(wi)))) =

-H(W) + 1/|T| i=1..|T|log(p(r(wi),r(wi-1)) / (p(r(wi-1)) p(r(wi)))) =

-H(W) + d,e∈C p(d,e) log( p (d,e) / (p(d) p(e)) ) =

-H(W) + I(D,E) (event E picks class adjacent (to the right) to the one picked by D)

• Since W does not depend on r, we ended up with I(D,E).the need to maximize

Page 73: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

73

Maximizing Mutual Information(dependent on the mapping r)

• Result from previous foil:– Maximizing the probability of data amounts to

maximizing I(D,E), the mutual information of the adjacent classes.

• Good:– We know what a MI is, and we know how to maximize.

• Bad:– There is no way how to maximize over so many

possible partitionings: |V||V| - no way to test them all.

Page 74: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

74

The Greedy Algorithm• Define merging operation on the mapping r: V →C:

– merge: R ⅹC ⅹ C →R’ ⅹC-1: (r,k,l) →r’,C’ such that– C-1 = {C - {k,l} ∪ {m}} (throw out k and l, add new m C)∉

– r’(w) = ..... m for w ∈rINV{k,l}),

..... r(w) otherwise.

• 1. Start with each word in its own class (C = V), r = identity.

• 2. Merge two classes k,l into one, m, such that (k,l) = argmaxk,l Imerge(r,k,l)(D,E).

• 3. Set new (r,C) = merge(r,k,l).

• 4. Repeat 2 and 3 until |C| reaches predetermined size.

Page 75: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

75

Word Classes in Applications

• Word Sense Disambiguation: context not seen [enough(-times)]

• Parsing: verb-subject, verb-object relations• Speech recognition (acoustic model): need more

instances of [rare(r)] sequences of phonemes• Machine Translation: translation equivalent

selection [for rare(r) words]

Page 76: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Spelling Correction and

the Noisy Channel

The Spelling Correction Task

Page 77: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Applications for spelling correction

77

Web search

PhonesWord processing

Page 78: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Spelling Tasks

• Spelling Error Detection• Spelling Error Correction:

• Autocorrect • htethe

• Suggest a correction• Suggestion lists

78

Page 79: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Types of spelling errors

• Non-word Errors• graffe giraffe

• Real-word Errors• Typographical errors

• three there• Cognitive Errors (homophones)

• piecepeace, • too two

79

Page 80: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Rates of spelling errors

26%: Web queries Wang et al. 2003

13%: Retyping, no backspace: Whitelaw et al. English&German

7%: Words corrected retyping on phone-sized organizer2%: Words uncorrected on organizer Soukoreff &MacKenzie 2003

1-2%: Retyping: Kane and Wobbrock 2007, Gruden et al. 1983

80

Page 81: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Non-word spelling errors

• Non-word spelling error detection:• Any word not in a dictionary is an error• The larger the dictionary the better

• Non-word spelling error correction:• Generate candidates: real words that are similar to error• Choose the one which is best:

• Shortest weighted edit distance• Highest noisy channel probability

81

Page 82: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Real word spelling errors

• For each word w, generate candidate set:• Find candidate words with similar pronunciations• Find candidate words with similar spelling• Include w in candidate set

• Choose best candidate• Noisy Channel • Classifier

82

Page 83: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Spelling Correction and

the Noisy Channel

The Noisy Channel Model of Spelling

Page 84: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Noisy Channel Intuition

84

Page 85: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Noisy Channel

• We see an observation x of a misspelled word• Find the correct word w

85

Page 86: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

History: Noisy channel for spelling proposed around 1990

• IBM• Mays, Eric, Fred J. Damerau and Robert L. Mercer. 1991.

Context based spelling correction. Information Processing and Management, 23(5), 517–522

• AT&T Bell Labs• Kernighan, Mark D., Kenneth W. Church, and William A. Gale.

1990. A spelling correction program based on a noisy channel model. Proceedings of COLING 1990, 205-210

Page 87: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Non-word spelling error example

acress

87

Page 88: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Candidate generation

• Words with similar spelling• Small edit distance to error

• Words with similar pronunciation• Small edit distance of pronunciation to error

88

Page 89: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Damerau-Levenshtein edit distance

• Minimal edit distance between two strings, where edits are:• Insertion• Deletion• Substitution• Transposition of two adjacent letters

89

Page 90: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Words within 1 of acressError Candidate

CorrectionCorrect Letter

Error Letter

Type

acress actress t - deletion

acress cress - a insertion

acress caress ca ac transposition

acress access c r substitution

acress across o e substitution

acress acres - s insertion

acress acres - s insertion90

Page 91: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Candidate generation

• 80% of errors are within edit distance 1• Almost all errors within edit distance 2

• Also allow insertion of space or hyphen• thisidea this idea• inlaw in-law

91

Page 92: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Language Model

• Use any of the language modeling algorithms we’ve learned• Unigram, bigram, trigram• Web-scale spelling correction (web-scale language modeling)

• Stupid backoff

92

• “Stupid backoff” (Brants et al. 2007)• No discounting, just use relative frequencies

Page 93: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Unigram Prior probability

word Frequency of word

P(word)

actress 9,321 .0000230573

cress 220 .0000005442

caress 686 .0000016969

access 37,038 .0000916207

across 120,844 .0002989314

acres 12,874 .000031846393

Counts from 404,253,213 words in Corpus of Contemporary English (COCA)

Page 94: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Channel model probability

• Error model probability, Edit probability• Kernighan, Church, Gale 1990

• Misspelled word x = x1, x2, x3… xm

• Correct word w = w1, w2, w3,…, wn

• P(x|w) = probability of the edit • (deletion/insertion/substitution/transposition)

94

Page 95: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Computing error probability: confusion matrix

del[x,y]: count(xy typed as x)ins[x,y]: count(x typed as xy)sub[x,y]: count(x typed as y)trans[x,y]: count(xy typed as yx)

Insertion and deletion conditioned on previous character

95

Page 96: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Confusion matrix for spelling errors

Page 97: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Generating the confusion matrix

• Peter Norvig’s list of errors• Peter Norvig’s list of counts of single-edit errors

97

Page 98: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Channel model

98

Kernighan, Church, Gale 1990

Page 99: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Channel model for acressCandidate Correction

Correct Letter

Error Letter

x|w P(x|word)

actress

t - c|ct .000117

cress - a a|# .00000144

caress ca ac ac|ca

.00000164

access c r r|c .000000209

across o e e|o .0000093

acres - s es|e .0000321

acres - s ss|s .000034299

Page 100: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Noisy channel probability for acressCandidate Correction

Correct Letter

Error Letter

x|w P(x|word) P(word) 109 *P(x|w)P(w)

actress t - c|ct .000117 .0000231 2.7

cress - a a|# .00000144 .000000544

.00078

caress ca ac ac|ca .00000164 .00000170 .0028

access c r r|c .000000209

.0000916 .019

across o e e|o .0000093 .000299 2.8

acres - s es|e .0000321 .0000318 1.0

acres - s ss|s .0000342 .0000318 1.0100

Page 101: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Using a bigram language model

• “a stellar and versatile acress whose combination of sass and glamour…”

• Counts from the Corpus of Contemporary American English with add-1 smoothing

• P(actress|versatile)=.000021 P(whose|actress) = .0010• P(across|versatile) =.000021 P(whose|across) = .000006

• P(“versatile actress whose”) = .000021*.0010 = 210 x10-10

• P(“versatile across whose”) = .000021*.000006 = 1 x10-10

101

Page 102: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Using a bigram language model

• “a stellar and versatile acress whose combination of sass and glamour…”

• Counts from the Corpus of Contemporary American English with add-1 smoothing

• P(actress|versatile)=.000021 P(whose|actress) = .0010• P(across|versatile) =.000021 P(whose|across) = .000006

• P(“versatile actress whose”) = .000021*.0010 = 210 x10-10

• P(“versatile across whose”) = .000021*.000006 = 1 x10-10

102

Page 103: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Evaluation

• Some spelling error test sets• Wikipedia’s list of common English misspelling• Aspell filtered version of that list• Birkbeck spelling error corpus• Peter Norvig’s list of errors (includes Wikipedia and Birkbeck, for training

or testing)

103

Page 104: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Spelling Correction and

the Noisy Channel

Real-Word Spelling Correction

Page 105: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Real-word spelling errors

• …leaving in about fifteen minuets to go to her house.• The design an construction of the system…• Can they lave him my messages?• The study was conducted mainly be John Black.

• 25-40% of spelling errors are real words Kukich 1992

105

Page 106: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Solving real-world spelling errors

• For each word in sentence• Generate candidate set

• the word itself • all single-letter edits that are English words• words that are homophones

• Choose best candidates• Noisy channel model• Task-specific classifier

106

Page 107: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Noisy channel for real-word spell correction

• Given a sentence w1,w2,w3,…,wn

• Generate a set of candidates for each word wi

• Candidate(w1) = {w1, w’1 , w’’1 , w’’’1 ,…}

• Candidate(w2) = {w2, w’2 , w’’2 , w’’’2 ,…}

• Candidate(wn) = {wn, w’n , w’’n , w’’’n ,…}

• Choose the sequence W that maximizes P(W)

Page 108: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Noisy channel for real-word spell correction

108

Page 109: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Noisy channel for real-word spell correction

109

Page 110: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Simplification: One error per sentence

• Out of all possible sentences with one word replaced• w1, w’’2,w3,w4 two off thew

• w1,w2,w’3,w4 two of the

• w’’’1,w2,w3,w4 too of thew

• …

• Choose the sequence W that maximizes P(W)

Page 111: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Where to get the probabilities

• Language model• Unigram• Bigram• Etc

• Channel model• Same as for non-word spelling correction• Plus need probability for no error, P(w|w)

111

Page 112: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Probability of no error

• What is the channel probability for a correctly typed word?• P(“the”|“the”)

• Obviously this depends on the application• .90 (1 error in 10 words)• .95 (1 error in 20 words)• .99 (1 error in 100 words)• .995 (1 error in 200 words)

112

Page 113: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Peter Norvig’s “thew” example

113

x w x|w P(x|w) P(w)109 P(x|w)P(w)

thew the ew|e 0.000007 0.02 144

thew thew 0.95 0.00000009 90

thew thaw e|a 0.001 0.0000007 0.7

thew threw h|hr 0.000008 0.000004 0.03

thew thweew|we 0.000003 0.00000004 0.0001

Page 114: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Spelling Correction and

the Noisy Channel

State-of-the-art Systems

Page 115: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

HCI issues in spelling

• If very confident in correction• Autocorrect

• Less confident• Give the best correction

• Less confident• Give a correction list

• Unconfident• Just flag as an error

115

Page 116: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

State of the art noisy channel

• We never just multiply the prior and the error model• Independence assumptionsprobabilities not commensurate• Instead: Weigh them

• Learn λ from a development test set

116

Page 117: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Phonetic error model

• Metaphone, used in GNU aspell • Convert misspelling to metaphone pronunciation

• “Drop duplicate adjacent letters, except for C.”• “If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.”• “Drop 'B' if after 'M' and if it is at the end of the word”• …

• Find words whose pronunciation is 1-2 edit distance from misspelling’s• Score result list

• Weighted edit distance of candidate to misspelling• Edit distance of candidate pronunciation to misspelling pronunciation

117

Page 118: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Improvements to channel model

• Allow richer edits (Brill and Moore 2000)• entant• phf• leal

• Incorporate pronunciation into channel (Toutanova and Moore 2002)

118

Page 119: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Channel model

• Factors that could influence p(misspelling|word)• The source letter• The target letter• Surrounding letters• The position in the word• Nearby keys on the keyboard• Homology on the keyboard• Pronunciations• Likely morpheme transformations

119

Page 120: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Nearby keys

Page 121: *Introduction to  Natural Language Processing (600.465) Language Modeling  (and the Noisy Channel)

Dan Jurafsky

Classifier-based methods for real-word spelling correction

• Instead of just channel model and language model• Use many features in a classifier such as MaxEnt, CRF.• Build a classifier for a specific pair like: whether/weather

• “cloudy” within +- 10 words• ___ to VERB• ___ or not

121