september 2003 1 basic techniques in statistical nlp word prediction n-grams smoothing

September 2003 1

BASIC TECHNIQUES IN STATISTICAL NLP

Word predictionn-gramssmoothing

September 2003 2

Statistical Methods in NLE

Two characteristics of NL make it desirable to endow programs with the ability to LEARN from examples of past use:

– VARIETY (no programmer can really take into account all possibilities)

– AMBIGUITY (need to have ways of choosing between alternatives)

In a number of NLE applications, statistical methods are very common

The simplest application: WORD PREDICTION

September 2003 3

We are good at word prediction

Stocks plunged this morning, despite a cut in interestStocks plunged this morning, despite a cut in interestrates by the Federal Reserve, as WallStocks plunged this morning, despite a cut in interestrates by the Federal Reserve, as WallStreet began ….

September 2003 4

Real Spelling Errors

They are leaving in about fifteen minuets to go to her house

The study was conducted mainly be John Black.

The design an construction of the system will take more than one year.

Hopefully, all with continue smoothly in my absence.

Can they lave him my messages?

I need to notified the bank of this problem.

He is trying to fine out.

September 2003 5

Handwriting recognition

From Woody Allen’s Take the Money and Run (1969)– Allen (a bank robber), walks up to the teller and

hands her a note that reads. "I have a gun. Give me all your cash."

The teller, however, is puzzled, because he reads "I have a gub." "No, it's gun", Allen says. "Looks like 'gub' to me," the teller says, then asks another teller to help him read the note, then another, and finally everyone is arguing over what the note means.

September 2003 6

Applications of word prediction

Spelling checkers Mobile phone texting Speech recognition Handwriting recognition Disabled users

September 2003 7

Statistics and word prediction

The basic idea underlying the statistical approach to word prediction is to use the probabilities of SEQUENCES OF WORDS to choose the most likely next word / correction of spelling error

I.e., to compute

For all words w, and predict as next word the one for which this (conditional) probability is highest.

P(w | W1 …. WN-1)

September 2003 8

Using corpora to estimate probabilities

But where do we get these probabilities? Idea: estimate them by RELATIVE FREQUENCY.

The simplest method: Maximum Likelihood Estimate (MLE). Count the number of words in a corpus, then count how many times a given sequence is encountered.

‘Maximum’ because doesn’t waste any probability on events not in the corpus

WWCWWP nn

)..()..( 1

September 2003 9

Maximum Likelihood Estimation for conditional probabilities

In order to estimate P(w|W1 … WN), we can use instead:

Cfr.: – P(A|B) = P(A&B) / P(B)

)..()..|(

nnn WWC

WWCWWWP

September 2003 10

Aside: counting words in corpora

Keep in mind that it’s not always so obvious what ‘a word’ is (cfr. yesterday)

In text:– He stepped out into the hall, was delighted to encounter a

brother. (From the Brown corpus.)

In speech:– I do uh main- mainly business data processing

LEMMAS: cats vs cat TYPES vs. TOKENS

September 2003 11

The problem: sparse data

In principle, we would like the n of our models to be fairly large, to model ‘long distance’ dependencies such as:– Sue SWALLOWED the large green …

However, in practice, most events of encountering sequences of words of length greater than 3 hardly ever occur in our corpora! (See below)

(Part of the) Solution: we APPROXIMATE the probability of a word given all previous words

September 2003 12

The Markov Assumption

The probability of being in a certain state only depends on the previous state:

P(Xn = Sk| X1 … Xn-1) = P(Xn = Sk|Xn-1)

This is equivalent to the assumption that the next state only depends on the previous m inputs, for m finite

(N-gram models / Markov models can be seen as probabilistic finite state automata)

September 2003 13

The Markov assumption for language: n-grams models

Making the Markov assumption for word prediction means assuming that the probability of a word only depends on the previous n words (N-GRAM model)

)..|()..|( 1111 nNnnnn WWWPWWWP

September 2003 14

Bigrams and trigrams

Typical values of n are 2 or 3 (BIGRAM or TRIGRAM models):

P(Wn|W1 ….. W n-1) ~ P(Wn|W n-2,W n-1)

P(W1,…Wn) ~ П P(Wi| W i-2,W i-1) What bigram model means in practice:

– Instead of P(rabbit|Just the other day I saw a)– We use P(rabbit|a)

Unigram: P(dog)Bigram: P(dog|big)Trigram: P(dog|the,big)

September 2003 15

The chain rule

So how can we compute the probability of sequences of words longer than 2 or 3? We use the CHAIN RULE:

E.g., – P(the big dog) = P(the) P(big|the) P(dog|the big)

Then we use the Markov assumption to reduce this to manageable proportions:

)..|()..|()|()()..( 112131211 nnn WWWPWWWPWWPWPWWP

)|()..|()|()()..( 122131211 nnnn WWWPWWWPWWPWPWWP

September 2003 16

Example: the Berkeley Restaurant Project (BERP) corpus

BERP is a speech-based restaurant consultant The corpus contains user queries; examples

include– I’m looking for Cantonese food– I’d like to eat dinner someplace nearby– Tell me about Chez Panisse– I’m looking for a good place to eat breakfast

September 2003 17

Computing the probability of a sentence

Given a corpus like BERP, we can compute the probability of a sentence like “I want to eat Chinese food”

September 2003 18

Bigram counts

September 2003 19

How the bigram probabilities are computed

Example of P(I,I):– C(“I”,”I”): 8– C(“I”): 8 + 1087 + 13 …. = 3437– P(“I”|”I”) = 8 / 3437 = .0023

September 2003 20

Bigram probabilities

September 2003 21

The probability of the example sentence

P(I want to eat Chinese food) P(I|”sentence start”) * P(want|I) * P(to|want) *

P(eat|to) * P(Chinese|eat) * P(food|Chinese) = .25 * .32 * .65 * .26 * .002 * .60 = .000016

September 2003 22

Examples of actual bigram probabilities computed using BERP

September 2003 23

Visualizing an n-gram based language model: the Shannon/Miller/Selfridge method

For unigrams:– Choose a random value r between 0 and 1– Print out w such that P(w) = r

For bigrams:– Choose a random bigram P(w|<s>)– Then pick up bigrams to follow as before

September 2003 24

The Shannon/Miller/Selfridge method trained on Shakespeare

September 2003 25

Approximating Shakespeare, cont’d

September 2003 26

A more formal evaluation mechanism

Entropy Cross-entropy

September 2003 27

The downside

The entire Shakespeare oeuvre consists of – 884,647 tokens (N)– 29,066 types (V)– 300,000 bigrams

All of Jane Austen’s novels (on Manning and Schuetze’s website): – N = 617,091 tokens– V = 14,585 types

September 2003 28

Comparing Austen n-grams: unigrams

In person

she was inferior to

1-gram P(.) P(.) P(.) P(.)

1 the .034 the .034 the .034 the .034

2 to .032 to .032 to .032 to .032

3 and .030 and .030 and .030

8 was .015 was .015

13 she .011

1701 inferior .00005

September 2003 29

Comparing Austen n-grams: bigrams

In person

she was inferior to

2-gram P(.|person) P(.|she) P(.|was) P(.inferior)

1 and .099 had .0141 not .065 to .212

2 who .099 was .122 a .052

23 she .009

inferior 0

September 2003 30

Comparing Austen n-grams: trigrams

In person

she was inferior to

3-gram P(.|In,person) P(.|person, she)

P(.|she,was)

P(.was,inferior)

1 UNSEEN did .05 not .057 UNSEEN

2 was .05 very .038

inferior 0

September 2003 31

Maybe with a larger corpus?

Words such as ‘ergativity’ unlikely to be found outside a corpus of linguistic articles

september 2003 1 basic techniques in statistical nlp word prediction n-grams smoothing

word prediction slide

corpus slide

word prediction stocks

pab pb slide

words w

statistical methods

conditional probabilities

number of words

Documents

exponential smoothing method - web.uettaxila.edu.pk€¦ ·...

n-grams: probabilistic language modeling - jarrar ·...

smoothing this dark art is why nlp is taught in the...

nlp practitioner heart of nlp - nlp courses

exponential smoothing methods.ppt -...

nlp master - landsiedel nlp

language models for text recognition: an...

applications - deep learninggoodfellow 2016) natural...

nlp uebungen der nlp ausbildung im nlp uebungsbuch

nlp: n-grams - dan garrette · nlp: n-grams dan garrette...

lecture 5: language models and...

triple beam balance - reading...triple beam balance 2 record...

nlp-automata in nlp

brownies recipe 4 eggs 200 grams flour 150 grams sugar 50...

speech & nlp (fall 2014): n-grams, n-gram computation, word...

language models & smoothing shallow processing techniques...

cmsc 723 / ling 645: intro to computational linguistics...

&->31@ %4;

fall 2004 1 basic techniques in statistical nlp word...

si485i : nlp set 4 smoothing language models fall 2012 :...