hidden markov models (ii): log-likelihood (forward algorithm)
TRANSCRIPT
Hidden Markov Models (II):Log-Likelihood (Forward Algorithm)
CMSC 473/673UMBC
October 11th, 2017
Course Announcement: Assignment 2
Due next Saturday, 10/21 at 11:59 PM (~10 days)
Any questions?
Recap from last timeβ¦
Expectation Maximization (EM)0. Assume some value for your parameters
Two step, iterative algorithm
1. E-step: count under uncertainty, assuming these parameters
2. M-step: maximize log-likelihood, assuming these uncertain counts
estimated counts
Counting Requires Marginalizing
w z1 & w z2 & w z3 & w z4 & w
ππ π€π€ = ππ π§π§1,π€π€ + ππ π§π§2,π€π€ + ππ π§π§3,π€π€ + ππ π§π§4,π€π€ = οΏ½π§π§=1
4
ππ(π§π§ππ ,π€π€)
E-step: count under uncertainty, assuming these parameters
break into 4 disjoint pieces
Agenda
HMM Motivation (Part of Speech) and Brief Definition
What is Part of Speech?
HMM Detailed Definition
HMM Tasks
Parts of Speech
Adapted from Luke Zettlemoyer
Closed class words
Open class words
Nouns
milk cat
cats
UMBC
Baltimorebread
speak
give
cando
may
Verbs
Adjectives
would-be
wettestlargehappy
red
fake
Kamp & Partee (1995)
Adverbs
recentlyhappily
thenthere (location)
intransitive
run
ditransitive
transitive
subsective
non-subsective
modals, auxiliaries
I you
Determiners
PrepositionsConjunctions
Pronouns
athe
everywhat
inunder
topParticles
(set) up
so (far)not
(call) off
and or ifbecause
there
Numbers
one
1,324
because
Penn Treebank Part of Speech
3SLP: Chapter 10
http://universaldependencies.org/
Hidden Markov Models: Part of Speech
p(British Left Waffles on Falkland Islands)
Adjective Noun Verb
Noun Verb Noun Prep Noun Noun
Prep Noun Noun(i):
(ii):
Class-basedmodel
Bigram modelof the classes
Model allclass sequences
ππ π€π€ππ|π§π§ππ ππ π§π§ππ|π§π§ππβ1 οΏ½π§π§1,..,π§π§ππ
ππ π§π§1,π€π€1, π§π§2,π€π€2, β¦ , π§π§ππ,π€π€ππ
Hidden Markov Model
Goal: maximize (log-)likelihood
In practice: we donβt actually observe these zvalues; we just see the words w
ππ π§π§1,π€π€1, π§π§2,π€π€2, β¦ , π§π§ππ,π€π€ππ = ππ π§π§1| π§π§0 ππ π€π€1|π§π§1 β―ππ π§π§ππ| π§π§ππβ1 ππ π€π€ππ|π§π§ππ= οΏ½
ππ
ππ π€π€ππ|π§π§ππ ππ π§π§ππ| π§π§ππβ1
if we knew the probability parametersthen we could estimate z and evaluate
likelihoodβ¦ but we donβt! :(
if we did observe z, estimating the probability parameters would be easyβ¦
but we donβt! :(
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
Transition and emission distributions do not change
Q: How many different probability values are there with K states and V vocab items?
A: VK emission values and K2 transition values
ππ π§π§1,π€π€1, π§π§2,π€π€2, β¦ , π§π§ππ,π€π€ππ = ππ π§π§1| π§π§0 ππ π€π€1|π§π§1 β―ππ π§π§ππ| π§π§ππβ1 ππ π€π€ππ|π§π§ππ= οΏ½
ππ
ππ π€π€ππ|π§π§ππ ππ π§π§ππ| π§π§ππβ1
emissionprobabilities/parameters
transitionprobabilities/parameters
Hidden Markov Model Representationππ π§π§1,π€π€1, π§π§2,π€π€2, β¦ , π§π§ππ,π€π€ππ = ππ π§π§1| π§π§0 ππ π€π€1|π§π§1 β―ππ π§π§ππ| π§π§ππβ1 ππ π€π€ππ|π§π§ππ
= οΏ½ππ
ππ π€π€ππ|π§π§ππ ππ π§π§ππ| π§π§ππβ1emission
probabilities/parameterstransitionprobabilities/parameters
z1
w1
β¦
w2 w3 w4
z2 z3 z4
ππ π€π€1|π§π§1 ππ π€π€2|π§π§2 ππ π€π€3|π§π§3 ππ π€π€4|π§π§4
ππ π§π§2| π§π§1 ππ π§π§3| π§π§2 ππ π§π§4| π§π§3ππ π§π§1| π§π§0
initial starting distribution
(βBOSβ)
Each zi can take the value of one of K latent states
Transition and emission distributions do not change
Graphical Models(see 478/678)
Example: 2-state Hidden Markov Model as a Lattice
z1 = N
w1
β¦
w2 w3 w4
ππ π€π€1|ππ ππ π€π€2|ππ ππ π€π€3|ππ ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ ππ ππ| ππ ππ ππ| ππππ ππ| start
β¦ππ ππ| ππ ππ ππ| ππ ππ ππ| ππππ ππ| ππ ππ ππ| ππ ππ ππ| ππ
ππ π€π€4|ππππ π€π€3|ππππ π€π€2|ππππ π€π€1|ππ
ππ ππ| ππ ππ ππ| ππ ππ ππ| ππ
Hidden Markov Model
Unigram Class-based Language Model (βKβ coins)
Unigram Language Model
Comparison of Joint Probabilitiesππ π€π€1,π€π€2, β¦ ,π€π€ππ = ππ π€π€1 ππ π€π€2 β―ππ π€π€ππ = οΏ½
ππ
ππ π€π€ππ
ππ π§π§1,π€π€1, π§π§2,π€π€2, β¦ , π§π§ππ,π€π€ππ = ππ π§π§1 ππ π€π€1|π§π§1 β―ππ π§π§ππ ππ π€π€ππ|π§π§ππ= οΏ½
ππ
ππ π€π€ππ|π§π§ππ ππ π§π§ππ
ππ π§π§1,π€π€1, π§π§2,π€π€2, β¦ , π§π§ππ,π€π€ππ = ππ π§π§1| π§π§0 ππ π€π€1|π§π§1 β―ππ π§π§ππ| π§π§ππβ1 ππ π€π€ππ|π§π§ππ= οΏ½
ππ
ππ π€π€ππ|π§π§ππ ππ π§π§ππ| π§π§ππβ1
Agenda
HMM Motivation (Part of Speech) and Brief Definition
What is Part of Speech?
HMM Detailed Definition
HMM Tasks
Hidden Markov Model Tasks
Calculate the (log) likelihood of an observed sequence w1, β¦, wN
Calculate the most likely sequence of states (for an observed sequence)
Learn the emission and transition parameters
ππ π§π§1,π€π€1, π§π§2,π€π€2, β¦ , π§π§ππ,π€π€ππ = ππ π§π§1| π§π§0 ππ π€π€1|π§π§1 β―ππ π§π§ππ| π§π§ππβ1 ππ π€π€ππ|π§π§ππ= οΏ½
ππ
ππ π€π€ππ|π§π§ππ ππ π§π§ππ| π§π§ππβ1emission
probabilities/parameterstransitionprobabilities/parameters
Hidden Markov Model Tasks
Calculate the (log) likelihood of an observed sequence w1, β¦, wN
Calculate the most likely sequence of states (for an observed sequence)
Learn the emission and transition parameters
ππ π§π§1,π€π€1, π§π§2,π€π€2, β¦ , π§π§ππ,π€π€ππ = ππ π§π§1| π§π§0 ππ π€π€1|π§π§1 β―ππ π§π§ππ| π§π§ππβ1 ππ π€π€ππ|π§π§ππ= οΏ½
ππ
ππ π€π€ππ|π§π§ππ ππ π§π§ππ| π§π§ππβ1emission
probabilities/parameterstransitionprobabilities/parameters
HMM Likelihood Task
Marginalize over all latent sequence joint likelihoods
ππ π€π€1,π€π€2, β¦ ,π€π€ππ = οΏ½π§π§1,β―,π§π§ππ
ππ π§π§1,π€π€1, π§π§2,π€π€2, β¦ , π§π§ππ,π€π€ππ
Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there?
HMM Likelihood Task
Marginalize over all latent sequence joint likelihoods
ππ π€π€1,π€π€2, β¦ ,π€π€ππ = οΏ½π§π§1,β―,π§π§ππ
ππ π§π§1,π€π€1, π§π§2,π€π€2, β¦ , π§π§ππ,π€π€ππ
Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there?
A: KN
HMM Likelihood Task
Marginalize over all latent sequence joint likelihoods
ππ π€π€1,π€π€2, β¦ ,π€π€ππ = οΏ½π§π§1,β―,π§π§ππ
ππ π§π§1,π€π€1, π§π§2,π€π€2, β¦ , π§π§ππ,π€π€ππ
Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there?
A: KN
Goal:Find a way to compute this exponential sum efficiently
(in polynomial time)
HMM Likelihood Task
Marginalize over all latent sequence joint likelihoods
ππ π€π€1,π€π€2, β¦ ,π€π€ππ = οΏ½π§π§1,β―,π§π§ππ
ππ π§π§1,π€π€1, π§π§2,π€π€2, β¦ , π§π§ππ,π€π€ππ
Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there?
A: KN
Goal:Find a way to compute this exponential sum efficiently
(in polynomial time)
Like in language modeling, you need to model when to stop generating.
This ending state is generally not included in βK.β
2 (3)-State HMM Likelihood
z1 = N
w1
β¦
w2 w3 w4
ππ π€π€1|ππ ππ π€π€2|ππ ππ π€π€3|ππ ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ ππ ππ| ππ ππ ππ| ππππ ππ| start
β¦ππ ππ| ππ ππ ππ| ππ ππ ππ| ππππ ππ| ππ ππ ππ| ππ ππ ππ| ππ
ππ π€π€4|ππππ π€π€3|ππππ π€π€2|ππππ π€π€1|ππ
ππ ππ| ππ ππ ππ| ππ ππ ππ| ππ
Q: What are the latent sequences here (EOS excluded)?
2 (3)-State HMM Likelihood
z1 = N
w1
β¦
w2 w3 w4
ππ π€π€1|ππ ππ π€π€2|ππ ππ π€π€3|ππ ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ ππ ππ| ππ ππ ππ| ππππ ππ| start
β¦ππ ππ| ππ ππ ππ| ππ ππ ππ| ππππ ππ| ππ ππ ππ| ππ ππ ππ| ππ
ππ π€π€4|ππππ π€π€3|ππππ π€π€2|ππππ π€π€1|ππ
ππ ππ| ππ ππ ππ| ππ ππ ππ| ππ
(N, w1), (N, w2), (N, w3), (N, w4)(N, w1), (N, w2), (N, w3), (V, w4)(N, w1), (N, w2), (V, w3), (N, w4)(N, w1), (N, w2), (V, w3), (V, w4)
A:(N, w1), (V, w2), (N, w3), (N, w4)(N, w1), (V, w2), (N, w3), (V, w4)(N, w1), (V, w2), (V, w3), (N, w4)(N, w1), (V, w2), (V, w3), (V, w4)
(V, w1), (N, w2), (N, w3), (N, w4)(V, w1), (N, w2), (N, w3), (V, w4)β¦ (six more)
Q: What are the latent sequences here (EOS excluded)?
2 (3)-State HMM Likelihood
z1 = N
w1
β¦
w2 w3 w4
ππ π€π€1|ππ ππ π€π€2|ππ ππ π€π€3|ππ ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ ππ ππ| ππ ππ ππ| ππππ ππ| start
β¦ππ ππ| ππ ππ ππ| ππ ππ ππ| ππππ ππ| ππ ππ ππ| ππ ππ ππ| ππ
ππ π€π€4|ππππ π€π€3|ππππ π€π€2|ππππ π€π€1|ππ
ππ ππ| ππ ππ ππ| ππ ππ ππ| ππ
(N, w1), (N, w2), (N, w3), (N, w4)(N, w1), (N, w2), (N, w3), (V, w4)(N, w1), (N, w2), (V, w3), (N, w4)(N, w1), (N, w2), (V, w3), (V, w4)
A:(N, w1), (V, w2), (N, w3), (N, w4)(N, w1), (V, w2), (N, w3), (V, w4)(N, w1), (V, w2), (V, w3), (N, w4)(N, w1), (V, w2), (V, w3), (V, w4)
(V, w1), (N, w2), (N, w3), (N, w4)(V, w1), (N, w2), (N, w3), (V, w4)β¦ (six more)
Q: What are the latent sequences here (EOS excluded)?
2 (3)-State HMM Likelihood
z1 = N
w1
β¦
w2 w3 w4
ππ π€π€1|ππ ππ π€π€2|ππ ππ π€π€3|ππ ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ ππ ππ| ππ ππ ππ| ππππ ππ| start
β¦ππ ππ| ππ ππ ππ| ππ ππ ππ| ππππ ππ| ππ ππ ππ| ππ ππ ππ| ππ
ππ π€π€4|ππππ π€π€3|ππππ π€π€2|ππππ π€π€1|ππ
ππ ππ| ππ ππ ππ| ππ ππ ππ| ππ
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
2 (3)-State HMM Likelihood
z1 = N
w1
β¦
w2 w3 w4
ππ π€π€1|ππ ππ π€π€2|ππ ππ π€π€3|ππ ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ ππ ππ| ππ ππ ππ| ππππ ππ| start
β¦ππ ππ| ππ ππ ππ| ππ ππ ππ| ππππ ππ| ππ ππ ππ| ππ ππ ππ| ππ
ππ π€π€4|ππππ π€π€3|ππππ π€π€2|ππππ π€π€1|ππ
ππ ππ| ππ ππ ππ| ππ ππ ππ| ππ
Q: Whatβs the probability of(N, w1), (V, w2), (V, w3), (N, w4)?N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
2 (3)-State HMM Likelihood
z1 = N
w1 w2 w3 w4
ππ π€π€1|ππ ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ
ππ ππ| ππππ ππ| ππ
ππ π€π€3|ππππ π€π€2|ππ
Q: Whatβs the probability of(N, w1), (V, w2), (V, w3), (N, w4)?
A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) =0.0002822
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
2 (3)-State HMM Likelihood
z1 = N
w1 w2 w3 w4
ππ π€π€1|ππ ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ
ππ ππ| ππππ ππ| ππ
ππ π€π€3|ππππ π€π€2|ππ
Q: Whatβs the probability of(N, w1), (V, w2), (V, w3), (N, w4) with ending included (unique ending symbol β#β)?A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) *
(.05 * 1) =0.00001235
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4 #
N .7 .2 .05 .05 0
V .2 .6 .1 .1 0end 0 0 0 0 1
2 (3)-State HMM Likelihood
z1 = N
w1
β¦
w2 w3 w4
ππ π€π€1|ππ ππ π€π€2|ππ ππ π€π€3|ππ ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ ππ ππ| ππ ππ ππ| ππππ ππ| start
β¦ππ ππ| ππ ππ ππ| ππ ππ ππ| ππππ ππ| ππ ππ ππ| ππ ππ ππ| ππ
ππ π€π€4|ππππ π€π€3|ππππ π€π€2|ππππ π€π€1|ππ
ππ ππ| ππ ππ ππ| ππ ππ ππ| ππ
Q: Whatβs the probability of(N, w1), (V, w2), (N, w3), (N, w4)?N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
2 (3)-State HMM Likelihood
z1 = N
w1 w2 w3 w4
ππ π€π€1|ππ ππ π€π€3|ππ ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππππ ππ| ππ
ππ π€π€2|ππ
ππ ππ| ππ
Q: Whatβs the probability of(N, w1), (V, w2), (N, w3), (N, w4)?
A: (.7*.7) * (.8*.6) * (.6*.05) * (.15*.05) =0.00007056
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
2 (3)-State HMM Likelihood
z1 = N
w1 w2 w3 w4
ππ π€π€1|ππ ππ π€π€3|ππ ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππππ ππ| ππ
ππ π€π€2|ππ
ππ ππ| ππ
Q: Whatβs the probability of(N, w1), (V, w2), (N, w3), (N, w4) with ending (unique symbol β#β)?
A: (.7*.7) * (.8*.6) * (.6*.05) * (.15*.05) *(.05 * 1) =
0.000002646
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4 #
N .7 .2 .05 .05 0
V .2 .6 .1 .1 0end 0 0 0 0 1
2 (3)-State HMM Likelihood
z1 = N
w1 w2 w3 w4
ππ π€π€1|ππ ππ π€π€3|ππ ππ π€π€4|ππ
ππ ππ| start
z2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππππ ππ| ππ
ππ π€π€2|ππ
ππ ππ| ππ
z1 = N
w1 w2 w3 w4
ππ π€π€1|ππ ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ
ππ ππ| ππ
ππ ππ| ππ
ππ π€π€3|ππππ π€π€2|ππ
2 (3)-State HMM Likelihood
z1 = N
w1 w2 w3 w4
ππ π€π€1|ππ ππ π€π€3|ππ ππ π€π€4|ππ
ππ ππ| start
z2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππππ ππ| ππ
ππ π€π€2|ππ
ππ ππ| ππ
z1 = N
w1 w2 w3 w4
ππ π€π€1|ππ ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ
ππ ππ| ππ
ππ ππ| ππ
ππ π€π€3|ππππ π€π€2|ππ
Up until here, all the computation was the same
2 (3)-State HMM Likelihood
z1 = N
w1 w2 w3 w4
ππ π€π€1|ππ ππ π€π€3|ππ ππ π€π€4|ππ
ππ ππ| start
z2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππππ ππ| ππ
ππ π€π€2|ππ
ππ ππ| ππ
z1 = N
w1 w2 w3 w4
ππ π€π€1|ππ ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ
ππ ππ| ππ
ππ ππ| ππ
ππ π€π€3|ππππ π€π€2|ππ
Up until here, all the computation was the same
Letβs reuse what computations we can
Reusing Computation: Attempt #1
w3 w4
ππ π€π€3|ππ ππ π€π€4|ππ
z3 = N
z4 = N
z3 = V
z4 = V
ππ ππ| ππ
z1 = N
w1 w2
w3 w4
ππ π€π€1|ππ
ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ
ππ ππ| ππ
ππ ππ| ππ
ππ π€π€3|ππ
ππ π€π€2|ππ
ππ ππ| ππ
(.7*.7)
(.7*.7) * (.8*.6)
(.7*.7) * (.8*.6) * (.4*.1)
(.7*.7) * (.8*.6) * (.6*.05)
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
Reusing Computation: Attempt #1
w3 w4
ππ π€π€3|ππ ππ π€π€4|ππ
z3 = N
z4 = N
z3 = V
z4 = V
ππ ππ| ππ
z1 = N
w1 w2
w3 w4
ππ π€π€1|ππ
ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ
ππ ππ| ππ
ππ ππ| ππ
ππ π€π€3|ππ
ππ π€π€2|ππ
ππ ππ| ππ
(.7*.7)
(.7*.7) * (.8*.6)
(.7*.7) * (.8*.6) * (.4*.1)
(.7*.7) * (.8*.6) * (.6*.05)
Remember: these are only two of the 16 paths through the
trellis
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
Reusing Computation: Attempt #1
w3 w4
ππ π€π€3|ππ ππ π€π€4|ππ
z3 = N
z4 = N
z3 = V
z4 = V
ππ ππ| ππ
z1 = N
w1 w2
w3 w4
ππ π€π€1|ππ
ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ
ππ ππ| ππ
ππ ππ| ππ
ππ π€π€3|ππ
ππ π€π€2|ππ
ππ ππ| ππ
(.7*.7)
aN[2, V] =(.7*.7) * (.8*.6)
aN[2, V] * (.4*.1)
aN[2, V] * (.6*.05)
Remember: these are only two of the 16 paths through the
trellis
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
Reusing Computation: Attempt #1
w3 w4
ππ π€π€3|ππ ππ π€π€4|ππ
z3 = N
z4 = N
z3 = V
z4 = V
ππ ππ| ππ
z1 = N
w1 w2
w3 w4
ππ π€π€1|ππ
ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ
ππ ππ| ππ
ππ ππ| ππ
ππ π€π€3|ππ
ππ π€π€2|ππ
ππ ππ| ππ
a[1,N] =(.7*.7)
aN[2, V] =a[1, V] * (.8*.6)
aN[2, V] * (.4*.1)
aN[2, V] * (.6*.05)
Remember: these are only two of the 16 paths through the
trellis
These aP(t, s) values
represent the probability of a state s at time
t in a particular shared path P
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
Reusing Computation: Attempt #1
w3 w4
ππ π€π€3|ππ ππ π€π€4|ππ
z3 = N
z4 = N
z3 = V
z4 = V
ππ ππ| ππ
z1 = N
w1 w2
w3 w4
ππ π€π€1|ππ
ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ
ππ ππ| ππ
ππ ππ| ππ
ππ π€π€3|ππ
ππ π€π€2|ππ
ππ ππ| ππ
a[1,N] =(.7*.7)
aN[2, V] =a[1, V] * (.8*.6)
aN[2, V] * (.4*.1)
aN[2, V] * (.6*.05)
Remember: these are only two of the 16 paths through the
trellis
These aP(t, s) values
represent the probability of a state s at time
t in a particular shared path P
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
βNVβ is path prefix
Reusing Computation: Attempt #1
w3 w4
ππ π€π€3|ππ ππ π€π€4|ππ
z3 = N
z4 = N
z3 = V
z4 = V
ππ ππ| ππ
z1 = N
w1 w2
w3 w4
ππ π€π€1|ππ
ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ
ππ ππ| ππ
ππ ππ| ππ
ππ π€π€3|ππ
ππ π€π€2|ππ
ππ ππ| ππ
(.7*.7)
aN[2, V] =(.7*.7) * (.8*.6)
aN[2, V] * (.4*.1)
aN [2, V] * (.6*.05)
Remember: these are only two of the 16 paths through the
trellis
How do we handle the other 14 paths?
Marginalize at each time step
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
Reusing Computation: Attempt #1
so far, weβve only considered βparticular shared path (AB) *β
aAB(i, *) = aA(i-1, B) * p(* | B) * p(obs at i | *)
zi-1= C
zi-1= B
zi-1 = A
zi = C
zi = B
zi = A
zi-2= C
zi-2= B
zi-2 = A
what about any of the other paths (AAB, ACB, etc.)?
aA(i-1, B)a(i-2, A) aAB(i, *)
Reusing Computation: Attempt #2
zi-1= C
zi-1= B
zi-1 = A
zi = C
zi = B
zi = A
letβs first consider βany shared path ending with B (AB, BB, or CB) Bβ
zi-2= C
zi-2= B
zi-2 = A
Reusing Computation: Attempt #2
letβs first consider βany shared path ending with B (AB, BB, or CB) Bβmarginalize across the previous hidden state values
πΌπΌ ππ,π΅π΅ = οΏ½π π
πΌπΌ ππ β 1, π π β ππ π΅π΅ π π ) β ππ(obs at ππ | π΅π΅)
zi-1= C
zi-1= B
zi-1 = A
zi = C
zi = B
zi = A
zi-2= C
zi-2= B
zi-2 = A
Reusing Computation: Attempt #2
letβs first consider βany shared path ending with B (AB, BB, or CB) Bβmarginalize across the previous hidden state values
πΌπΌ ππ,π΅π΅ = οΏ½π π
πΌπΌ ππ β 1, π π β ππ π΅π΅ π π ) β ππ(obs at ππ | π΅π΅)
aAB(i, B) = aA(i-1, B) * p(B | B) * p(obs at i | B)
zi-1= C
zi-1= B
zi-1 = A
zi = C
zi = B
zi = A
zi-2= C
zi-2= B
zi-2 = A
Reusing Computation: Attempt #2
letβs first consider βany shared path ending with B (AB, BB, or CB) Bβmarginalize across the previous hidden state values
πΌπΌ ππ,π΅π΅ = οΏ½π π
πΌπΌ ππ β 1, π π β ππ π΅π΅ π π ) β ππ(obs at ππ | π΅π΅)
aAB(i, B) = aA(i-1, B) * p(B | B) * p(obs at i | B)
computing Ξ± at time i-1 will correctly incorporate paths
through time i-2: we correctly obey the Markov property
zi-1= C
zi-1= B
zi-1 = A
zi = C
zi = B
zi = A
zi-2= C
zi-2= B
zi-2 = A
Forward Probability
letβs first consider βany shared path ending with B (AB, BB, or CB) Bβmarginalize across the previous hidden state values
πΌπΌ ππ,π΅π΅ = οΏ½π π β²πΌπΌ ππ β 1, π π β² β ππ π΅π΅ π π β²) β ππ(obs at ππ | π΅π΅)
aAB(i, B) = aA(i-1, B) * p(B | B) * p(obs at i | B)
computing Ξ± at time i-1 will correctly incorporate paths
through time i-2: we correctly obey the Markov property
Ξ±(i, B) is the total probability of all paths to that state B from the
beginning
zi-1= C
zi-1= B
zi-1 = A
zi = C
zi = B
zi = A
zi-2= C
zi-2= B
zi-2 = A
Forward Probability
Ξ±(i, s) is the total probability of all paths:1. that start from the beginning
2. that end (currently) in s at step i3. that emit the observation obs at i
πΌπΌ ππ, π π = οΏ½π π β²πΌπΌ ππ β 1, π π β² β ππ π π π π β²) β ππ(obs at ππ | π π )
Forward Probability
Ξ±(i, s) is the total probability of all paths:1. that start from the beginning
2. that end (currently) in s at step i3. that emit the observation obs at i
πΌπΌ ππ, π π = οΏ½π π β²πΌπΌ ππ β 1, π π β² β ππ π π π π β²) β ππ(obs at ππ | π π )
how likely is it to get into state s this way?
what are the immediate ways to get into state s?
whatβs the total probability up until now?
2 (3) -State HMM Likelihood with Forward Probabilities
w3 w4
ππ π€π€3|ππ ππ π€π€4|ππ
z3 = N
z4 = N
z3 = V
z4 = V
ππ ππ| ππ
z1 = N
w1 w2
w3 w4
ππ π€π€1|ππ
ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ
ππ ππ| ππ
ππ ππ| ππ
ππ π€π€3|ππ
ππ π€π€2|ππ
ππ ππ| ππ
Ξ±[1, N] =(.7*.7)
Ξ±[2, V] =Ξ±[1, N] * (.8*.6) +Ξ±[1, V] * (.35*.6)
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
ππ ππ| start
Ξ±[3, V] =Ξ±[2, V] * (.35*.1)+Ξ±[2, N] * (.8*.1)
Ξ±[3, N] =Ξ±[2, V] * (.6*.05) +Ξ±[2, N] * (.15*.05)
2 (3) -State HMM Likelihood with Forward Probabilities
w3 w4
ππ π€π€3|ππ ππ π€π€4|ππ
z3 = N
z4 = N
z3 = V
z4 = V
ππ ππ| ππ
z1 = N
w1 w2
w3 w4
ππ π€π€1|ππ
ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ
ππ ππ| ππ
ππ ππ| ππ
ππ π€π€3|ππ
ππ π€π€2|ππ
ππ ππ| ππ
Ξ±[1, N] =(.7*.7)
Ξ±[2, V] =Ξ±[1, N] * (.8*.6) +Ξ±[1, V] * (.35*.6)
Ξ±[3, V] =Ξ±[2, V] * (.35*.1)+Ξ±[2, N] * (.8*.1)
Ξ±[3, N] =Ξ±[2, V] * (.6*.05) +Ξ±[2, N] * (.15*.05)
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
ππ ππ| start
Ξ±[1, V] =(.2*.2)
2 (3) -State HMM Likelihood with Forward Probabilities
w3 w4
ππ π€π€3|ππ ππ π€π€4|ππ
z3 = N
z4 = N
z3 = V
z4 = V
ππ ππ| ππ
z1 = N
w1 w2
w3 w4
ππ π€π€1|ππ
ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ
ππ ππ| ππ
ππ ππ| ππ
ππ π€π€3|ππ
ππ π€π€2|ππ
ππ ππ| ππ
Ξ±[1, N] =(.7*.7) = .49
Ξ±[2, V] =Ξ±[1, N] * (.8*.6) +Ξ±[1, V] * (.35*.6) =0.2436
Ξ±[3, V] =Ξ±[2, V] * (.35*.1)+Ξ±[2, N] * (.8*.1)
Ξ±[3, N] =Ξ±[2, V] * (.6*.05) +Ξ±[2, N] * (.2*.05)
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
ππ ππ| start
Ξ±[1, V] =(.2*.2) = .04
Ξ±[2, N] =Ξ±[1, N] * (.15*.2) +Ξ±[1, V] * (.6*.2) =.0195
2 (3) -State HMM Likelihood with Forward Probabilities
w3 w4
ππ π€π€3|ππ ππ π€π€4|ππ
z3 = N
z4 = N
z3 = V
z4 = V
ππ ππ| ππ
z1 = N
w1 w2
w3 w4
ππ π€π€1|ππ
ππ π€π€4|ππ
ππ ππ| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
ππ ππ| ππ
ππ ππ| ππ
ππ ππ| ππ
ππ π€π€3|ππ
ππ π€π€2|ππ
ππ ππ| ππ
Ξ±[1, N] =(.7*.7) = .49
Ξ±[2, V] =Ξ±[1, N] * (.8*.6) +Ξ±[1, V] * (.35*.6) =0.2436
Ξ±[3, V] =Ξ±[2, V] * (.35*.1)+Ξ±[2, N] * (.8*.1)
Ξ±[3, N] =Ξ±[2, V] * (.6*.05) +Ξ±[2, N] * (.2*.05)
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
ππ ππ| start
Ξ±[1, V] =(.2*.2) = .04
Ξ±[2, N] =Ξ±[1, N] * (.15*.2) +Ξ±[1, V] * (.6*.2) =.0195
Use dynamic programming to build the Ξ± left-to-right
Forward Algorithm
Ξ±: a 2D table, N+2 x K*N+2: number of observations (+2 for the
BOS & EOS symbols)K*: number of states
Use dynamic programming to build the Ξ± left-to-right
Forward AlgorithmΞ± = double[N+2][K*]Ξ±[0][*] = 1.0
for(i = 1; i β€ N+1; ++i) {
}
Forward AlgorithmΞ± = double[N+2][K*]Ξ±[0][*] = 1.0
for(i = 1; i β€ N+1; ++i) {for(state = 0; state < K*; ++state) {
}}
Forward AlgorithmΞ± = double[N+2][K*]Ξ±[0][*] = 1.0
for(i = 1; i β€ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = pemission(obsi | state)
}}
Forward AlgorithmΞ± = double[N+2][K*]Ξ±[0][*] = 1.0
for(i = 1; i β€ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = pemission(obsi | state)for(old = 0; old < K*; ++old) {
pmove = ptransition(state | old)Ξ±[i][state] += Ξ±[i-1][old] * pobs * pmove
}}
}
Forward AlgorithmΞ± = double[N+2][K*]Ξ±[0][*] = 1.0
for(i = 1; i β€ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = pemission(obsi | state)for(old = 0; old < K*; ++old) {
pmove = ptransition(state | old)Ξ±[i][state] += Ξ±[i-1][old] * pobs * pmove
}}
}
we still need to learn these (EM)
Forward AlgorithmΞ± = double[N+2][K*]Ξ±[0][*] = 1.0
for(i = 1; i β€ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = pemission(obsi | state)for(old = 0; old < K*; ++old) {
pmove = ptransition(state | old)Ξ±[i][state] += Ξ±[i-1][old] * pobs * pmove
}}
}
Q: What do we return? (How do we return the likelihood of the sequence?)
Forward AlgorithmΞ± = double[N+2][K*]Ξ±[0][*] = 1.0
for(i = 1; i β€ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = pemission(obsi | state)for(old = 0; old < K*; ++old) {
pmove = ptransition(state | old)Ξ±[i][state] += Ξ±[i-1][old] * pobs * pmove
}}
}
Q: What do we return? (How do we return the likelihood of the sequence?)
A: Ξ±[N+1][end]
Interactive HMM Example
https://goo.gl/rbHEoc (Jason Eisner, 2002)
Original: http://www.cs.jhu.edu/~jason/465/PowerPoint/lect24-hmm.xls
Forward Algorithm in Log-SpaceΞ± = double[N+2][K*]
Ξ±[0][*] = 0.0
for(i = 1; i β€ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = log pemission(obsi | state)for(old = 0; old < K*; ++old) {
pmove = log ptransition(state | old)Ξ±[i][state] = logadd(Ξ±[i][state],
Ξ±[i-1][old] + pobs + pmove)}
}}
Forward Algorithm in Log-SpaceΞ± = double[N+2][K*]
Ξ±[0][*] = 0.0
for(i = 1; i β€ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = log pemission(obsi | state)for(old = 0; old < K*; ++old) {
pmove = log ptransition(state | old)Ξ±[i][state] = logadd(Ξ±[i][state],
Ξ±[i-1][old] + pobs + pmove)}
}}
logadd ππππ, ππππ = log exp ππππ + exp(ππππ)
Forward Algorithm in Log-SpaceΞ± = double[N+2][K*]
Ξ±[0][*] = 0.0
for(i = 1; i β€ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = log pemission(obsi | state)for(old = 0; old < K*; ++old) {
pmove = log ptransition(state | old)Ξ±[i][state] = logadd(Ξ±[i][state],
Ξ±[i-1][old] + pobs + pmove)}
}}
logadd ππππ, ππππ = log exp ππππ + exp(ππππ)
this can still overflow! (why?)
Forward Algorithm in Log-SpaceΞ± = double[N+2][K*]
Ξ±[0][*] = 0.0
for(i = 1; i β€ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = log pemission(obsi | state)for(old = 0; old < K*; ++old) {
pmove = log ptransition(state | old)Ξ±[i][state] = logadd(Ξ±[i][state],
Ξ±[i-1][old] + pobs + pmove)}
}} logadd ππππ, ππππ = οΏ½ ππππ + log 1 + exp ππππ β ππππ , ππππ β₯ ππππ
ππππ + log 1 + exp ππππ β ππππ , ππππ > ππππ
Forward Algorithm in Log-SpaceΞ± = double[N+2][K*]
Ξ±[0][*] = 0.0
for(i = 1; i β€ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = log pemission(obsi | state)for(old = 0; old < K*; ++old) {
pmove = log ptransition(state | old)Ξ±[i][state] = logadd(Ξ±[i][state],
Ξ±[i-1][old] + pobs + pmove)}
}} logadd ππππ, ππππ = οΏ½ ππππ + log 1 + exp ππππ β ππππ , ππππ β₯ ππππ
ππππ + log 1 + exp ππππ β ππππ , ππππ > ππππ
scipy.misc.logsumexp