estimating n-gram probabilities language modeling

Estimating N-gram Probabilities

Language Modeling

Dan Jurafsky

Estimating bigram probabilities

Dan Jurafsky

An example

Small Corpus:<s> I am Sam </s><s> Sam I am </s><s> I do not like green eggs and ham </s>

Dan Jurafsky

More examples: Berkeley Restaurant Project sentences

• can you tell me about any good cantonese restaurants close by• mid priced thai food is what i’m looking for• tell me about chez panisse• can you give me a listing of the kinds of food that are available• i’m looking for a good place to eat breakfast• when is caffe venezia open during the day

Dan Jurafsky

Raw bigram counts

• Shows the bigram counts from the Berkeley Restaurant Project. Note that the majority of the values are zero. A matrix selected from a random set of seven words .

1

Dan Jurafsky

Raw bigram probabilities

• Table2: Shows the unigram counts

• Table 3: Shows the bigram probabilities after normalization (dividing each row by the following unigram counts

3

2

Dan Jurafsky

What kinds of knowledge?

• P(english|want) = .0011• P(chinese|want) = .0065• P(to|want) = .66 • P(eat | to) = .28• P(food | to) = 0• P(want | spend) = 0 • P (i | <s>) = .25

It shows Fact about world

Fact about grammar. Ex: "want” base verb require infinitive verb (to + v)

Fact about grammar. Two verbs in row can not be allow in English)

Dan Jurafsky

Practical Issues• Since probabilities are (by definition) less than or equal to 1 ,we do everything in log space because

by using log probabilities instead of raw probabilities, we get numbers that are not as small.• But the more probabilities we multiply together, the smaller the product becomes. Multiplying

enough N-grams together would result in numerical underflow• Solve by using adding because adding is faster than multiplying

Dan Jurafsky

Google N-Gram Release, August 2006

…

Dan Jurafsky

Google N-Gram Releasehttp://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Dan Jurafsky

Google Book N-grams

• http://ngrams.googlelabs.com/

http://ngrams.googlelabs.com/

Evaluation and Perplexity

Language Modeling

Dan Jurafsky

Evaluation: How good is our model?

• Does our language model prefer good sentences to bad ones?• Assign higher probability to “real” or “frequently observed” sentences

• Than “ungrammatical” or “rarely observed” sentences?

1. There are two type of Evaluation:• Intrinsic Evaluation• Extrinsic Evaluation

Dan Jurafsky

16

An Intrinsic Evaluation of N-gram models

• First evaluation for comparing N-gram models A and B (N-gram model use training set or training corpus such as Google N-gram corpus to compute the probabilities)

1. Put each n-gram model (A&B) in a task such as• spelling corrector, speech recognizer, MT system

2. Run the task, get an accuracy for A model and B model. How?• By testing each model’s performance on test set that haven’t seen

and is different from our training set, totally unused • Then using an evaluation metric which tells us how well our model

does on the test set.

Dan Jurafsky

Extrinsic evaluation of N-gram models

• another evaluation for comparing N-gram models A and B1. Put each n-gram model (A &B) in a task such as

• spelling corrector, speech recognizer, MT system2. Run the task, get an accuracy for A model and B model. How?

• Counting by hand how many misspelled words corrected properly (spelling corrector)

• Seeing which gives the more accurate transcription (speech recognizer)• Counting by hand how many words translated correctly (MT)

3. Compare accuracy for A model and B model handily .

Dan Jurafsky

Difficulty of extrinsic evaluation of N-gram models

• Extrinsic evaluation• Time-consuming; can take days or weeks

• So,• Sometimes use Intrinsic evaluation• But Bad approximation can be caused if test data part of training

test• Solve by choosing our test data that large as possible, not part of

training test , and unseen(unused) to avoid the bad approximation. (For example: we can divide the large corpus that we want into training and test)

Dan Jurafsky

19

More about Intrinsic evaluation

In Intrinsic evaluation:

In practice we don’t use raw probability as our metric for evaluating language models, but a variant called perplexity.

Perplexity of a language model on a test set(sometimes called PP for short) is the inverse probability of the test set, normalized by the number of words.

Dan Jurafsky

Perplexity

Perplexity is the inverse probability of the test set, normalized by the number of words:

We can use the chain rule to expand the probability of W: If we are computing the perplexity of W with a bigram language model we get:

Minimizing perplexity is the same as maximizing probability

N

N

NN

wwwP

wwwPWPP

)...(

1

)...()(

21

1

21

Dan Jurafsky

Perplexity as branching factor

• Let’s suppose a sentence consisting of random digits• What is the perplexity of this sentence according to a model that

assign P=1/10 to each digit?

Dan Jurafsky

Lower perplexity = better model

Ex:• We trained unigram, bigram, and trigram grammars on 38 million words (training set)• from the Wall Street Journal, using a 19,979 word vocabulary. • We then computed the perplexity of each of these models on a test set of 1.5 million words (test set)

N-gram Order

Unigram Bigram Trigram

Perplexity 962 170 109

estimating n-gram probabilities language modeling

Documents

ngram models

ngram model ab

google ngram corpus

gram modelsfirst evaluation

log probabilities

useful probabilities

raw probabilities

b model