n-gram model formulas

14
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation n n w w w ... 1 1 ) | ( ) | ( )... | ( ) | ( ) ( ) ( 1 1 1 1 1 2 1 3 1 2 1 1 k n k k n n n w w P w w P w w P w w P w P w P ) | ( ) ( 1 1 1 1 k N k n k k n w w P w P ) | ( ) ( 1 1 1 k n k k n w w P w P

Upload: wilmer

Post on 09-Feb-2016

26 views

Category:

Documents


0 download

DESCRIPTION

N-Gram Model Formulas. Word sequences Chain rule of probability Bigram approximation N-gram approximation. Estimating Probabilities. N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: N-Gram Model Formulas

N-Gram Model Formulas

• Word sequences

• Chain rule of probability

• Bigram approximation

• N-gram approximation

nn www ...11

)|()|()...|()|()()( 11

1

11

2131211

kn

kk

nn

n wwPwwPwwPwwPwPwP

)|()( 11

11

k

Nk

n

kk

n wwPwP

)|()( 11

1 k

n

kk

n wwPwP

Page 2: N-Gram Model Formulas

Estimating Probabilities

• N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences.

• To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to every sentence and treat these as additional words.

)()()|(

1

11

n

nnnn wC

wwCwwP

)()()|( 1

1

111

1

nNn

nn

NnnNnn wC

wwCwwP

Bigram:

N-gram:

Page 3: N-Gram Model Formulas

Perplexity

• Measure of how well a model “fits” the test data.• Uses the probability that the model assigns to the

test corpus.• Normalizes for the number of words in the test

corpus and takes the inverse.

N

NwwwPWPP

)...(1)(21

• Measures the weighted average branching factor in predicting the next word (lower is better).

Page 4: N-Gram Model Formulas

Laplace (Add-One) Smoothing

• “Hallucinate” additional training data in which each possible N-gram occurs exactly once and adjust estimates accordingly.

where V is the total number of possible (N1)-grams (i.e. the vocabulary size for a bigram model).

VwCwwCwwP

n

nnnn

)(

1)()|(1

11

VwCwwCwwP n

Nn

nn

NnnNnn

)(1)()|( 1

1

111

1

Bigram:

N-gram:

• Tends to reassign too much mass to unseen events, so can be adjusted to add 0<<1 (normalized by V instead of V).

Page 5: N-Gram Model Formulas

Interpolation

• Linearly combine estimates of N-gram models of increasing order.

• Learn proper values for i by training to (approximately) maximize the likelihood of an independent development (a.k.a. tuning) corpus.

)()|()|()|(ˆ3121,211,2 nnnnnnnnn wPwwPwwwPwwwP

Interpolated Trigram Model:

1i

iWhere:

Page 6: N-Gram Model Formulas

6

Formal Definition of an HMM

• A set of N +2 states S={s0,s1,s2, … sN, sF}– Distinguished start state: s0

– Distinguished final state: sF

• A set of M possible observations V={v1,v2…vM}• A state transition probability distribution A={aij}

• Observation probability distribution for each state j B={bj(k)}

• Total parameter set λ={A,B}

FjiNjisqsqPa itjtij ,0 and ,1 )|( 1

Mk1 1 )|at ()( NjsqtvPkb jtkj

Niaa iF

N

jij

011

Page 7: N-Gram Model Formulas

Forward Probabilities

• Let t(j) be the probability of being in state j after seeing the first t observations (by summing over all initial paths leading to j).

7

)| ,,...,()( 21 jttt sqoooPj

Page 8: N-Gram Model Formulas

Computing the Forward Probabilities

• Initialization

• Recursion

• Termination

8

Njobaj jj 1)()( 101

TtNjobaij tj

N

iijtt

1 ,1)()()(

11

N

iiFTFT aisOP

11 )()()|(

Page 9: N-Gram Model Formulas

Viterbi Scores

• Recursively compute the probability of the most likely subsequence of states that accounts for the first t observations and ends in state sj.

9

)| ,,..., ,,...,,(max)( 11110,...,, 110

jtttqqqt sqooqqqPjvt

• Also record “backpointers” that subsequently allow backtracing the most probable state sequence. btt(j) stores the state at time t-1 that maximizes the

probability that system was in state sj at time t (given the observed sequence).

Page 10: N-Gram Model Formulas

Computing the Viterbi Scores

• Initialization

• Recursion

• Termination

10

Njobajv jj 1)()( 101

TtNjobaivjv tjijt

N

it 1 ,1)()(max)( 11

iFT

N

iFT aivsvP )( max)(*11

Analogous to Forward algorithm except take max instead of sum

Page 11: N-Gram Model Formulas

Computing the Viterbi Backpointers

• Initialization

• Recursion

• Termination

11

Njsjbt 1)( 01

TtNjobaivjbt tjijt

N

it

1 ,1)()(argmax)( 1

1

iFT

N

iFTT aivsbtq )( argmax)(*

11

Final state in the most probable state sequence. Follow backpointers to initial state to construct full sequence.

Page 12: N-Gram Model Formulas

Supervised Parameter Estimation

• Estimate state transition probabilities based on tag bigram and unigram statistics in the labeled data.

• Estimate the observation probabilities based on tag/word co-occurrence statistics in the labeled data.

• Use appropriate smoothing if training data is sparse.

12

)()q ,( 1t

it

jitij sqC

ssqCa

)(),(

)(ji

kijij sqC

vosqCkb

Page 13: N-Gram Model Formulas

Context Free Grammars (CFG)

• N a set of non-terminal symbols (or variables) a set of terminal symbols (disjoint from N)• R a set of productions or rules of the form

A→, where A is a non-terminal and is a string of symbols from ( N)*

• S, a designated non-terminal called the start symbol

Page 14: N-Gram Model Formulas

Estimating Production Probabilities

• Set of production rules can be taken directly from the set of rewrites in the treebank.

• Parameters can be directly estimated from frequency counts in the treebank.

14

)count()count(

)count()count()|(

P