the em algorithm ling 572 fei xia 03/01/07. what is em? em stands for “expectation...
Post on 21-Dec-2015
220 views
TRANSCRIPT
The EM algorithm
LING 572
Fei Xia
03/01/07
What is EM?
• EM stands for “expectation maximization”.
• A parameter estimation method: it falls into the general framework of maximum-likelihood estimation (MLE).
• The general form was given in (Dempster, Laird, and Rubin, 1977), although essence of the algorithm appeared previously in various forms.
Outline
• MLE
• EM– Basic concepts– Main ideas of EM– EM for PM models– An example: Forward-backward algorithm
MLE
What is MLE?
• Given
– A sample X={X1, …, Xn}
– A vector of parameters θ
• We define– Likelihood of the data: P(X | θ)– Log-likelihood of the data: L(θ)=log P(X|θ)
• Given X, find )(maxarg
LML
MLE (cont)
• Often we assume that Xis are independently identically distributed (i.i.d.)
• Depending on the form of P(X | θ), solving optimization problem can be easy or hard.
)|(logmaxarg
)|(logmaxarg
)|,...,(logmaxarg
)|(logmaxarg
)(maxarg
1
ii
ii
n
ML
XP
XP
XXP
XP
L
An easy case
• Assuming– A coin has a probability p of being heads, 1-p
of being tails.– Observation: We toss a coin N times, and the
result is a set of Hs and Ts, and there are m Hs.
• What is the value of p based on MLE, given the observation?
An easy case (cont)
)1log()(log
)1(log)|(log)(
pmNpm
ppXPL mNm
01
))1log()(log()(
p
mN
p
m
dp
pmNpmd
dp
dL
p= m/N
EM: basic concepts
Basic setting in EM
• X is a set of data points: observed data• Θ is a parameter vector.• EM is a method to find θML where
• Calculating P(X | θ) directly is hard.• Calculating P(X,Y|θ) is much simpler,
where Y is “hidden” data (or “missing” data).
)|(logmaxarg
)(maxarg
XP
LML
The basic EM strategy
• Z = (X, Y)– Z: complete data (“augmented data”)– X: observed data (“incomplete” data)– Y: hidden data (“missing” data)
The “missing” data Y
• Y need not necessarily be missing in the practical sense of the word.
• It may just be a conceptually convenient technical device to simplify the calculation of P(X | θ).
• There could be many possible Ys.
Examples of EM
HMM PCFG MT Coin toss
X
(observed)
sentences sentences Parallel data Head-tail sequences
Y (hidden) State sequences
Parse trees Word alignment
Coin id sequences
θ aij
bijk
P(ABC) t(f | e)
d(aj | j, l, m), …
p1, p2, λ
Algorithm Forward-backward
Inside-outside
IBM Models N/A
The EM algorithm
• Consider a set of starting parameters
• Use these to “estimate” the missing data
• Use “complete” data to update parameters
• Repeat until convergence
Highlights
• General algorithm for missing data problems
• Requires “specialization” to the problem at hand
• Examples of EM:– Forward-backward algorithm for HMM– Inside-outside algorithm for PCFG– EM in IBM MT Models
EM: main ideas
Idea #1: find θ that maximizes the likelihood of training data
)|(logmaxarg
)(maxarg
XP
LML
Idea #2: find the θt sequence
No analytical solution iterative approach, find
s.t.
,....,...,, 10 t
....)(...)()( 10 tlll
Idea #3: find θt+1 that maximizes a tight lower bound of )()( tll
a tight lower bound
])|,(
)|,([log)()(
1),|( t
i
in
ixyP
t
yxP
yxPEll t
i
Idea #4: find θt+1 that maximizes the Q function
)]|,([logmaxarg
])|,(
)|,([logmaxarg
1),|(
1),|(
)1(
yxPE
yxp
yxpE
i
n
ixyP
ti
in
ixyP
t
ti
ti
Lower bound of )()( tll
The Q function
The EM algorithm
• Start with initial estimate, θ0
• Repeat until convergence– E-step: calculate
– M-step: find
),(maxarg)1( tt Q
)|,(log),|(),(1
yxPxyPQ it
n
i yi
t
Important classes of EM problem
• Products of multinomial (PM) models
• Exponential families
• Gaussian mixture
• …
The EM algorithm for PM models
PM models
mi
yxici
i
yxici
i
yxici pppyxp ),,(),,(),,(
1
...)|,(
1 ji
ip
Where is a partition of all the parameters, and for any j
),...,( 1 m
HMM is a PM
kji
j
kw
ik
jsis
ji
wss
yxsscj
w
i
yxsscji
ssp
ssp
yxp
,,
),,(
),,(
)(
)(
)|,(
,
1
1
kijk
jij
b
a
PCFG
• PCFG: each sample point (x,y):– x is a sentence– y is a possible parse tree for that sentence.
)|()|,(1
ii
n
ii AAPyxP
)|(
)|(
)|(
)|,(
VPsleepsVPP
NPJimNPP
SVPNPSP
yxP
PCFG is a PM
,
),,()(
)|,(
A
yxAcAp
yxp
A
Ap 1)(
Q-function for PM
)log),,(),|((...)log),,(),|((
)log),|((...)log),|((
))(log(),|(
)|,(log),|(
),(
11
),,(
1
),,(
1
),,(
1
1
1
1
jij
tn
i yiji
j
tn
i yi
yxjC
jj
tn
i yi
yxjC
jj
tn
i yi
yxjC
k jj
tn
i yi
it
n
i yi
t
pyxjCxyPpyxjCxyP
pxyPpxyP
pxyP
yxPxyP
Q
k
i
k
i
k
),(1tQ
Maximizing the Q function
jj
it
n
i yi
t pyxjCxyPQ log),,(),|(),(11
1
11
j
jp
Maximize
Subject to the constraint
11
)log),,(),|((),(ˆ1
1j
jjj
it
n
i yi
t ppyxjCxyPQ
Use Lagrange multipliers
Optimal solution
11
)log),,(),|((),(ˆ1
1j
jjj
it
n
i yi
t ppyxjCxyPQ
0)/),,(),|((),(ˆ
1
1
ji
tn
i yi
j
t
pyxjCxypp
Q
),,(),|(1
yxjCxyp
pi
tn
i yi
j
Normalization factor
Expected count
PCFG example
• Calculate expected counts
• Update parameters
EM: An example
Forward-backward algorithm
• HMM is a PM (Product of Multi-nominal) Model
• Forward-backward algorithm is a special case of the EM algorithm for PM Models.
• X (observed data): each data point is an O1T.
• Y (hidden data): state sequence X1T.
• Θ (parameters): aij, bijk, πi.
Expected counts
)(
)|,(
),,(*),|(
),,(*),|()(
1
111
1111
1
t
OjXiXP
ssXOcountOXP
ssYXcountXYPsscount
T
tij
Tt
T
tt
jX
iTTTT
Yjiji
T
Expected counts (cont)
),()(
),(*),|,(
),,(*),|(
),,(*),|()(
1
111
1111
1
kk
T
tij
kkTt
T
tt
jX
w
iTTTT
j
w
iY
j
w
i
wOt
wOOjXiXP
ssXOcountOXP
ssYXcountXYPsscount
T
k
kk
The inner loop for forward-backward algorithm
Given an input sequence and1. Calculate forward probability:
• Base case• Recursive case:
2. Calculate backward probability:• Base case:• Recursive case:
3. Calculate expected counts:
4. Update the parameters:
),,,,( BAKS
ii )1(
tijoiji
ij batt )()1(
1)1( Ti
tijoijj
ji batt )1()(
N
mmm
jijoijiij
tt
tbatt t
1
)()(
)1()()(
T
tij
N
j
T
tij
ij
t
ta
11
1
)(
)(
T
tij
T
tijkt
ijk
t
twob
1
1
)(
)(),(
HMM
• A HMM is a tuple :– A set of states S={s1, s2, …, sN}.
– A set of output symbols Σ={w1, …, wM}.
– Initial state probabilities
– State transition prob: A={aij}.
– Symbol emission prob: B={bijk}
• State sequence: X1…XT+1
• Output sequence: o1…oT
}{ i
),,,,( BAS
Forward probability
The probability of producing oi,t-1 while ending up in state si:
),()( 1,1 iXOPt tt
def
i
Calculating forward probability
tijoiji
i
ttttti
tttttti
tti
t
ttj
bat
iXjXoPiXOP
iXOjXoPiXOP
jXiXOP
jXOPt
)(
)|,(*),(
),|,(*),(
),,(
),()1(
11,1
1,111,1
1,1
1,1
ii )1(Initialization:
Induction:
Backward probability
• The probability of producing the sequence Ot,T, given that at time t, we are at state s i.
)|()( , iXOPt tTt
def
i
Calculating backward probability
tijoijj
j
tTtttj
t
tttTtttj
t
ttTtj
t
tTt
def
i
bat
jXOPiXjXoP
ojXiXOPiXjXoP
iXjX,OoP
iXOPt
)1(
)|(*)|,(
),,|(*)|,(
)|,(
)|()(
1,11
1),1(1
1),1(
,
1)1( TiInitialization:
Induction:
Calculating the prob of the observation
)1()(1
i
N
iiOP
)1()(1
TOPN
ii
)()(
),()(
1
1
tt
iXOPOP
i
N
ii
N
it
Estimating parameters
• The prob of traversing a certain arc at time t given O: (denoted by pt(i, j) in M&S)
N
mmm
jijoiji
tt
ttij
tt
tbat
OP
OjXiXP
OjXiXPt
t
1
1
1
)()(
)1()(
)(
),,(
)|,()(
N
jttti OjXiXPOiXPt
11 )|,()|()(
)()(1
ttN
jiji
The prob of being at state i at time t given O:
Expected counts
Sum over the time index:• Expected # of transitions from state i to j in O:
• Expected # of transitions from state i in O:
N
j
T
tij
T
t
N
jij
T
ti ttt
1 11 11
)()()(
T
tij t
1
)(
Update parameters
)1(1expˆ ii ttimeatistateinfrequencyected
T
tij
N
j
T
tij
T
ti
T
tij
ij
t
t
t
t
istatefromstransitionofected
jtoistatefromstransitionofecteda
11
1
1
1
)(
)(
)(
)(
#exp
#exp
T
tij
T
tijkt
ijk
t
two
jtoistatefromstransitionofected
observedkwithjtoistatefromstransitionofectedb
1
1
)(
)(),(
#exp
#exp
Final formulae
N
mmm
jijoijiij
tt
tbatt t
1
)()(
)1()()(
T
tij
N
j
T
tij
ij
t
ta
11
1
)(
)(
T
tij
T
tijkt
ijk
t
twob
1
1
)(
)(),(
Summary
Summary
• The EM algorithm– An iterative approach – L(θ) is non-decreasing at each iteration– Optimal solution in M-step exists for many classes of
problems.
• The EM algorithm for PM models– Simpler formulae– Three special cases
• Inside-outside algorithm• Forward-backward algorithm• IBM Models for MT
Relations among the algorithms
The generalized EM
The EM algorithm
PM Gaussian MixInside-OutsideForward-backwardIBM models
Strengths of EM
• Numerical stability: in every iteration of the EM algorithm, it increases the likelihood of the observed data.
• The EM handles parameter constraints gracefully.
Problems with EM
• Convergence can be very slow on some problems and is intimately related to the amount of missing information.
• It guarantees to improve the probability of the training corpus, which is different from reducing the errors directly.
• It cannot guarantee to reach global maxima (it could get struck at the local maxima, saddle points, etc)
The initial estimate is important.
Additional slides
The EM algorithm for PM models
// for each iteration
// for each training example xi
// for each possible y
// for each parameter
// for each parameter