estimating n-gram probabilities language modeling

21
Estimating N- gram Probabilities Language Modeling

Upload: bernard-robertson

Post on 18-Jan-2016

242 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Estimating N-gram Probabilities Language Modeling

Estimating N-gram Probabilities

Language Modeling

Page 2: Estimating N-gram Probabilities Language Modeling

Dan Jurafsky

Estimating bigram probabilities

Page 3: Estimating N-gram Probabilities Language Modeling

Dan Jurafsky

An example

Small Corpus:<s> I am Sam </s><s> Sam I am </s><s> I do not like green eggs and ham </s>

Page 4: Estimating N-gram Probabilities Language Modeling

Dan Jurafsky

More examples: Berkeley Restaurant Project sentences

• can you tell me about any good cantonese restaurants close by• mid priced thai food is what i’m looking for• tell me about chez panisse• can you give me a listing of the kinds of food that are available• i’m looking for a good place to eat breakfast• when is caffe venezia open during the day

Page 5: Estimating N-gram Probabilities Language Modeling

Dan Jurafsky

Raw bigram counts

• Shows the bigram counts from the Berkeley Restaurant Project. Note that the majority of the values are zero. A matrix selected from a random set of seven words .

1

Page 6: Estimating N-gram Probabilities Language Modeling

Dan Jurafsky

Raw bigram probabilities

• Table2: Shows the unigram counts

• Table 3: Shows the bigram probabilities after normalization (dividing each row by the following unigram counts

3

2

Page 7: Estimating N-gram Probabilities Language Modeling

Dan Jurafsky

Bigram estimates of sentence probabilitiesHere are a few other useful probabilities: • P(i|<s>) = 0.25 • P(english|want) = 0.0011 • P(food|english) = 0.5• P(<s>|food) = 0.68Compute the Bigram of sentence by using the info. above and table 3:P(<s> I want english food </s>) =

P(I|<s>) × P(want|I) × P(english|want) × P(food|english)× P(</s>|food) =0.25 × 0.33×.00011 × 0.5× 0.68 = .000031

Page 8: Estimating N-gram Probabilities Language Modeling

Dan Jurafsky

What kinds of knowledge?

• P(english|want) = .0011• P(chinese|want) = .0065• P(to|want) = .66 • P(eat | to) = .28• P(food | to) = 0• P(want | spend) = 0 • P (i | <s>) = .25

It shows Fact about world

Fact about grammar. Ex: "want” base verb require infinitive verb (to + v)

Fact about grammar. Two verbs in row can not be allow in English)

Page 9: Estimating N-gram Probabilities Language Modeling

Dan Jurafsky

Practical Issues• Since probabilities are (by definition) less than or equal to 1 ,we do everything in log space because

by using log probabilities instead of raw probabilities, we get numbers that are not as small.• But the more probabilities we multiply together, the smaller the product becomes. Multiplying

enough N-grams together would result in numerical underflow• Solve by using adding because adding is faster than multiplying

Page 10: Estimating N-gram Probabilities Language Modeling

Dan Jurafsky

Google N-Gram Release, August 2006

Page 11: Estimating N-gram Probabilities Language Modeling

Dan Jurafsky

Google N-Gram Releasehttp://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Page 12: Estimating N-gram Probabilities Language Modeling

Dan Jurafsky

Google Book N-grams

• http://ngrams.googlelabs.com/

Page 13: Estimating N-gram Probabilities Language Modeling

Evaluation and Perplexity

Language Modeling

Page 14: Estimating N-gram Probabilities Language Modeling

Dan Jurafsky

Evaluation: How good is our model?

• Does our language model prefer good sentences to bad ones?• Assign higher probability to “real” or “frequently observed” sentences

• Than “ungrammatical” or “rarely observed” sentences?

1. There are two type of Evaluation:• Intrinsic Evaluation• Extrinsic Evaluation

Page 15: Estimating N-gram Probabilities Language Modeling

Dan Jurafsky

16

An Intrinsic Evaluation of N-gram models

• First evaluation for comparing N-gram models A and B (N-gram model use training set or training corpus such as Google N-gram corpus to compute the probabilities)

1. Put each n-gram model (A&B) in a task such as• spelling corrector, speech recognizer, MT system

2. Run the task, get an accuracy for A model and B model. How?• By testing each model’s performance on test set that haven’t seen

and is different from our training set, totally unused • Then using an evaluation metric which tells us how well our model

does on the test set.

Page 16: Estimating N-gram Probabilities Language Modeling

Dan Jurafsky

Extrinsic evaluation of N-gram models

• another evaluation for comparing N-gram models A and B1. Put each n-gram model (A &B) in a task such as

• spelling corrector, speech recognizer, MT system2. Run the task, get an accuracy for A model and B model. How?

• Counting by hand how many misspelled words corrected properly (spelling corrector)

• Seeing which gives the more accurate transcription (speech recognizer)• Counting by hand how many words translated correctly (MT)

3. Compare accuracy for A model and B model handily .

Page 17: Estimating N-gram Probabilities Language Modeling

Dan Jurafsky

Difficulty of extrinsic evaluation of N-gram models

• Extrinsic evaluation• Time-consuming; can take days or weeks

• So,• Sometimes use Intrinsic evaluation• But Bad approximation can be caused if test data part of training

test• Solve by choosing our test data that large as possible, not part of

training test , and unseen(unused) to avoid the bad approximation. (For example: we can divide the large corpus that we want into training and test)

Page 18: Estimating N-gram Probabilities Language Modeling

Dan Jurafsky

19

More about Intrinsic evaluation

In Intrinsic evaluation:

In practice we don’t use raw probability as our metric for evaluating language models, but a variant called perplexity.

Perplexity of a language model on a test set(sometimes called PP for short) is the inverse probability of the test set, normalized by the number of words.

Page 19: Estimating N-gram Probabilities Language Modeling

Dan Jurafsky

Perplexity

Perplexity is the inverse probability of the test set, normalized by the number of words:

We can use the chain rule to expand the probability of W: If we are computing the perplexity of W with a bigram language model we get:

Minimizing perplexity is the same as maximizing probability

N

N

NN

wwwP

wwwPWPP

)...(

1

)...()(

21

1

21

Page 20: Estimating N-gram Probabilities Language Modeling

Dan Jurafsky

Perplexity as branching factor

• Let’s suppose a sentence consisting of random digits• What is the perplexity of this sentence according to a model that

assign P=1/10 to each digit?

Page 21: Estimating N-gram Probabilities Language Modeling

Dan Jurafsky

Lower perplexity = better model

Ex:• We trained unigram, bigram, and trigram grammars on 38 million words (training set)• from the Wall Street Journal, using a 19,979 word vocabulary. • We then computed the perplexity of each of these models on a test set of 1.5 million words (test set)

N-gram Order

Unigram Bigram Trigram

Perplexity 962 170 109