estimating n-gram probabilities language modeling
TRANSCRIPT
Estimating N-gram Probabilities
Language Modeling
Dan Jurafsky
Estimating bigram probabilities
Dan Jurafsky
An example
Small Corpus:<s> I am Sam </s><s> Sam I am </s><s> I do not like green eggs and ham </s>
Dan Jurafsky
More examples: Berkeley Restaurant Project sentences
• can you tell me about any good cantonese restaurants close by• mid priced thai food is what i’m looking for• tell me about chez panisse• can you give me a listing of the kinds of food that are available• i’m looking for a good place to eat breakfast• when is caffe venezia open during the day
Dan Jurafsky
Raw bigram counts
• Shows the bigram counts from the Berkeley Restaurant Project. Note that the majority of the values are zero. A matrix selected from a random set of seven words .
1
Dan Jurafsky
Raw bigram probabilities
• Table2: Shows the unigram counts
• Table 3: Shows the bigram probabilities after normalization (dividing each row by the following unigram counts
3
2
Dan Jurafsky
Bigram estimates of sentence probabilitiesHere are a few other useful probabilities: • P(i|<s>) = 0.25 • P(english|want) = 0.0011 • P(food|english) = 0.5• P(<s>|food) = 0.68Compute the Bigram of sentence by using the info. above and table 3:P(<s> I want english food </s>) =
P(I|<s>) × P(want|I) × P(english|want) × P(food|english)× P(</s>|food) =0.25 × 0.33×.00011 × 0.5× 0.68 = .000031
Dan Jurafsky
What kinds of knowledge?
• P(english|want) = .0011• P(chinese|want) = .0065• P(to|want) = .66 • P(eat | to) = .28• P(food | to) = 0• P(want | spend) = 0 • P (i | <s>) = .25
It shows Fact about world
Fact about grammar. Ex: "want” base verb require infinitive verb (to + v)
Fact about grammar. Two verbs in row can not be allow in English)
Dan Jurafsky
Practical Issues• Since probabilities are (by definition) less than or equal to 1 ,we do everything in log space because
by using log probabilities instead of raw probabilities, we get numbers that are not as small.• But the more probabilities we multiply together, the smaller the product becomes. Multiplying
enough N-grams together would result in numerical underflow• Solve by using adding because adding is faster than multiplying
Dan Jurafsky
Google N-Gram Release, August 2006
…
Dan Jurafsky
Google N-Gram Releasehttp://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Evaluation and Perplexity
Language Modeling
Dan Jurafsky
Evaluation: How good is our model?
• Does our language model prefer good sentences to bad ones?• Assign higher probability to “real” or “frequently observed” sentences
• Than “ungrammatical” or “rarely observed” sentences?
1. There are two type of Evaluation:• Intrinsic Evaluation• Extrinsic Evaluation
Dan Jurafsky
16
An Intrinsic Evaluation of N-gram models
• First evaluation for comparing N-gram models A and B (N-gram model use training set or training corpus such as Google N-gram corpus to compute the probabilities)
1. Put each n-gram model (A&B) in a task such as• spelling corrector, speech recognizer, MT system
2. Run the task, get an accuracy for A model and B model. How?• By testing each model’s performance on test set that haven’t seen
and is different from our training set, totally unused • Then using an evaluation metric which tells us how well our model
does on the test set.
Dan Jurafsky
Extrinsic evaluation of N-gram models
• another evaluation for comparing N-gram models A and B1. Put each n-gram model (A &B) in a task such as
• spelling corrector, speech recognizer, MT system2. Run the task, get an accuracy for A model and B model. How?
• Counting by hand how many misspelled words corrected properly (spelling corrector)
• Seeing which gives the more accurate transcription (speech recognizer)• Counting by hand how many words translated correctly (MT)
3. Compare accuracy for A model and B model handily .
Dan Jurafsky
Difficulty of extrinsic evaluation of N-gram models
• Extrinsic evaluation• Time-consuming; can take days or weeks
• So,• Sometimes use Intrinsic evaluation• But Bad approximation can be caused if test data part of training
test• Solve by choosing our test data that large as possible, not part of
training test , and unseen(unused) to avoid the bad approximation. (For example: we can divide the large corpus that we want into training and test)
Dan Jurafsky
19
More about Intrinsic evaluation
In Intrinsic evaluation:
In practice we don’t use raw probability as our metric for evaluating language models, but a variant called perplexity.
Perplexity of a language model on a test set(sometimes called PP for short) is the inverse probability of the test set, normalized by the number of words.
Dan Jurafsky
Perplexity
Perplexity is the inverse probability of the test set, normalized by the number of words:
We can use the chain rule to expand the probability of W: If we are computing the perplexity of W with a bigram language model we get:
Minimizing perplexity is the same as maximizing probability
N
N
NN
wwwP
wwwPWPP
)...(
1
)...()(
21
1
21
Dan Jurafsky
Perplexity as branching factor
• Let’s suppose a sentence consisting of random digits• What is the perplexity of this sentence according to a model that
assign P=1/10 to each digit?
Dan Jurafsky
Lower perplexity = better model
Ex:• We trained unigram, bigram, and trigram grammars on 38 million words (training set)• from the Wall Street Journal, using a 19,979 word vocabulary. • We then computed the perplexity of each of these models on a test set of 1.5 million words (test set)
N-gram Order
Unigram Bigram Trigram
Perplexity 962 170 109