n-gram model formulas
DESCRIPTION
N-Gram Model Formulas. Word sequences Chain rule of probability Bigram approximation N-gram approximation. Estimating Probabilities. N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences. - PowerPoint PPT PresentationTRANSCRIPT
N-Gram Model Formulas
• Word sequences
• Chain rule of probability
• Bigram approximation
• N-gram approximation
nn www ...11
)|()|()...|()|()()( 11
1
11
2131211
kn
kk
nn
n wwPwwPwwPwwPwPwP
)|()( 11
11
k
Nk
n
kk
n wwPwP
)|()( 11
1 k
n
kk
n wwPwP
Estimating Probabilities
• N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences.
• To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to every sentence and treat these as additional words.
)()()|(
1
11
n
nnnn wC
wwCwwP
)()()|( 1
1
111
1
nNn
nn
NnnNnn wC
wwCwwP
Bigram:
N-gram:
Perplexity
• Measure of how well a model “fits” the test data.• Uses the probability that the model assigns to the
test corpus.• Normalizes for the number of words in the test
corpus and takes the inverse.
N
NwwwPWPP
)...(1)(21
• Measures the weighted average branching factor in predicting the next word (lower is better).
Laplace (Add-One) Smoothing
• “Hallucinate” additional training data in which each possible N-gram occurs exactly once and adjust estimates accordingly.
where V is the total number of possible (N1)-grams (i.e. the vocabulary size for a bigram model).
VwCwwCwwP
n
nnnn
)(
1)()|(1
11
VwCwwCwwP n
Nn
nn
NnnNnn
)(1)()|( 1
1
111
1
Bigram:
N-gram:
• Tends to reassign too much mass to unseen events, so can be adjusted to add 0<<1 (normalized by V instead of V).
Interpolation
• Linearly combine estimates of N-gram models of increasing order.
• Learn proper values for i by training to (approximately) maximize the likelihood of an independent development (a.k.a. tuning) corpus.
)()|()|()|(ˆ3121,211,2 nnnnnnnnn wPwwPwwwPwwwP
Interpolated Trigram Model:
1i
iWhere:
6
Formal Definition of an HMM
• A set of N +2 states S={s0,s1,s2, … sN, sF}– Distinguished start state: s0
– Distinguished final state: sF
• A set of M possible observations V={v1,v2…vM}• A state transition probability distribution A={aij}
• Observation probability distribution for each state j B={bj(k)}
• Total parameter set λ={A,B}
FjiNjisqsqPa itjtij ,0 and ,1 )|( 1
Mk1 1 )|at ()( NjsqtvPkb jtkj
Niaa iF
N
jij
011
Forward Probabilities
• Let t(j) be the probability of being in state j after seeing the first t observations (by summing over all initial paths leading to j).
7
)| ,,...,()( 21 jttt sqoooPj
Computing the Forward Probabilities
• Initialization
• Recursion
• Termination
8
Njobaj jj 1)()( 101
TtNjobaij tj
N
iijtt
1 ,1)()()(
11
N
iiFTFT aisOP
11 )()()|(
Viterbi Scores
• Recursively compute the probability of the most likely subsequence of states that accounts for the first t observations and ends in state sj.
9
)| ,,..., ,,...,,(max)( 11110,...,, 110
jtttqqqt sqooqqqPjvt
• Also record “backpointers” that subsequently allow backtracing the most probable state sequence. btt(j) stores the state at time t-1 that maximizes the
probability that system was in state sj at time t (given the observed sequence).
Computing the Viterbi Scores
• Initialization
• Recursion
• Termination
10
Njobajv jj 1)()( 101
TtNjobaivjv tjijt
N
it 1 ,1)()(max)( 11
iFT
N
iFT aivsvP )( max)(*11
Analogous to Forward algorithm except take max instead of sum
Computing the Viterbi Backpointers
• Initialization
• Recursion
• Termination
11
Njsjbt 1)( 01
TtNjobaivjbt tjijt
N
it
1 ,1)()(argmax)( 1
1
iFT
N
iFTT aivsbtq )( argmax)(*
11
Final state in the most probable state sequence. Follow backpointers to initial state to construct full sequence.
Supervised Parameter Estimation
• Estimate state transition probabilities based on tag bigram and unigram statistics in the labeled data.
• Estimate the observation probabilities based on tag/word co-occurrence statistics in the labeled data.
• Use appropriate smoothing if training data is sparse.
12
)()q ,( 1t
it
jitij sqC
ssqCa
)(),(
)(ji
kijij sqC
vosqCkb
Context Free Grammars (CFG)
• N a set of non-terminal symbols (or variables) a set of terminal symbols (disjoint from N)• R a set of productions or rules of the form
A→, where A is a non-terminal and is a string of symbols from ( N)*
• S, a designated non-terminal called the start symbol
Estimating Production Probabilities
• Set of production rules can be taken directly from the set of rewrites in the treebank.
• Parameters can be directly estimated from frequency counts in the treebank.
14
)count()count(
)count()count()|(
P