the em algorithm ling 572 fei xia 03/01/07. what is em? em stands for “expectation...

The EM algorithm

LING 572

Fei Xia

03/01/07

What is EM?

• EM stands for “expectation maximization”.

• A parameter estimation method: it falls into the general framework of maximum-likelihood estimation (MLE).

• The general form was given in (Dempster, Laird, and Rubin, 1977), although essence of the algorithm appeared previously in various forms.

Outline

• MLE

• EM– Basic concepts– Main ideas of EM– EM for PM models– An example: Forward-backward algorithm

What is MLE?

• Given

– A sample X={X1, …, Xn}

– A vector of parameters θ

• We define– Likelihood of the data: P(X | θ)– Log-likelihood of the data: L(θ)=log P(X|θ)

• Given X, find )(maxarg

LML

MLE (cont)

• Often we assume that Xis are independently identically distributed (i.i.d.)

• Depending on the form of P(X | θ), solving optimization problem can be easy or hard.

)|(logmaxarg

)|(logmaxarg

)|,...,(logmaxarg

)|(logmaxarg

)(maxarg

1

ii

ii

n

ML

XP

XP

XXP

XP

L

An easy case

• Assuming– A coin has a probability p of being heads, 1-p

of being tails.– Observation: We toss a coin N times, and the

result is a set of Hs and Ts, and there are m Hs.

• What is the value of p based on MLE, given the observation?

An easy case (cont)

)1log()(log

)1(log)|(log)(

pmNpm

ppXPL mNm

01

))1log()(log()(

p

mN

p

m

dp

pmNpmd

dp

dL

p= m/N

EM: basic concepts

Basic setting in EM

• X is a set of data points: observed data• Θ is a parameter vector.• EM is a method to find θML where

• Calculating P(X | θ) directly is hard.• Calculating P(X,Y|θ) is much simpler,

where Y is “hidden” data (or “missing” data).

)|(logmaxarg

)(maxarg

XP

LML

The basic EM strategy

• Z = (X, Y)– Z: complete data (“augmented data”)– X: observed data (“incomplete” data)– Y: hidden data (“missing” data)

The “missing” data Y

• Y need not necessarily be missing in the practical sense of the word.

• It may just be a conceptually convenient technical device to simplify the calculation of P(X | θ).

• There could be many possible Ys.

Examples of EM

HMM PCFG MT Coin toss

X

(observed)

sentences sentences Parallel data Head-tail sequences

Y (hidden) State sequences

Parse trees Word alignment

Coin id sequences

θ aij

bijk

P(ABC) t(f | e)

d(aj | j, l, m), …

p1, p2, λ

Algorithm Forward-backward

Inside-outside

IBM Models N/A

The EM algorithm

• Consider a set of starting parameters

• Use these to “estimate” the missing data

• Use “complete” data to update parameters

• Repeat until convergence

Highlights

• General algorithm for missing data problems

• Requires “specialization” to the problem at hand

• Examples of EM:– Forward-backward algorithm for HMM– Inside-outside algorithm for PCFG– EM in IBM MT Models

EM: main ideas

Idea #1: find θ that maximizes the likelihood of training data

)|(logmaxarg

)(maxarg

XP

LML

Idea #2: find the θt sequence

No analytical solution iterative approach, find

s.t.

,....,...,, 10 t

....)(...)()( 10 tlll

Idea #3: find θt+1 that maximizes a tight lower bound of )()( tll

a tight lower bound

])|,(

)|,([log)()(

1),|( t

i

in

ixyP

t

yxP

yxPEll t

i

The EM algorithm

• Start with initial estimate, θ0

• Repeat until convergence– E-step: calculate

– M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

n

i yi

t

Important classes of EM problem

• Products of multinomial (PM) models

• Exponential families

• Gaussian mixture

• …

The EM algorithm for PM models

PM models

mi

yxici

i

yxici

i

yxici pppyxp ),,(),,(),,(

1

...)|,(

1 ji

ip

Where is a partition of all the parameters, and for any j

),...,( 1 m

HMM is a PM

kji

j

kw

ik

jsis

ji

wss

yxsscj

w

i

yxsscji

ssp

ssp

yxp

,,

),,(

),,(

)(

)(

)|,(

,

1

1

kijk

jij

b

a

PCFG

• PCFG: each sample point (x,y):– x is a sentence– y is a possible parse tree for that sentence.

)|()|,(1

ii

n

ii AAPyxP

)|(

)|(

)|(

)|,(

VPsleepsVPP

NPJimNPP

SVPNPSP

yxP

PCFG is a PM

,

),,()(

)|,(

A

yxAcAp

yxp

A

Ap 1)(

Q-function for PM

)log),,(),|((...)log),,(),|((

)log),|((...)log),|((

))(log(),|(

)|,(log),|(

),(

11

),,(

1

),,(

1

),,(

1

1

1

1

jij

tn

i yiji

j

tn

i yi

yxjC

jj

tn

i yi

yxjC

jj

tn

i yi

yxjC

k jj

tn

i yi

it

n

i yi

t

pyxjCxyPpyxjCxyP

pxyPpxyP

pxyP

yxPxyP

Q

k

i

k

i

k

),(1tQ

Maximizing the Q function

jj

it

n

i yi

t pyxjCxyPQ log),,(),|(),(11

1

11

j

jp

Maximize

Subject to the constraint

11

)log),,(),|((),(ˆ1

1j

jjj

it

n

i yi

t ppyxjCxyPQ

Use Lagrange multipliers

Optimal solution

11

)log),,(),|((),(ˆ1

1j

jjj

it

n

i yi

t ppyxjCxyPQ

0)/),,(),|((),(ˆ

1

1

ji

tn

i yi

j

t

pyxjCxypp

Q

),,(),|(1

yxjCxyp

pi

tn

i yi

j

Normalization factor

Expected count

PCFG example

• Calculate expected counts

• Update parameters

EM: An example

Forward-backward algorithm

• HMM is a PM (Product of Multi-nominal) Model

• Forward-backward algorithm is a special case of the EM algorithm for PM Models.

• X (observed data): each data point is an O1T.

• Y (hidden data): state sequence X1T.

• Θ (parameters): aij, bijk, πi.

Expected counts

)(

)|,(

),,(*),|(

),,(*),|()(

1

111

1111

1

t

OjXiXP

ssXOcountOXP

ssYXcountXYPsscount

T

tij

Tt

T

tt

jX

iTTTT

Yjiji

T

Expected counts (cont)

),()(

),(*),|,(

),,(*),|(

),,(*),|()(

1

111

1111

1

kk

T

tij

kkTt

T

tt

jX

w

iTTTT

j

w

iY

j

w

i

wOt

wOOjXiXP

ssXOcountOXP

ssYXcountXYPsscount

T

k

kk

The inner loop for forward-backward algorithm

Given an input sequence and1. Calculate forward probability:

• Base case• Recursive case:

2. Calculate backward probability:• Base case:• Recursive case:

3. Calculate expected counts:

4. Update the parameters:

),,,,( BAKS

ii )1(

tijoiji

ij batt )()1(

1)1( Ti

tijoijj

ji batt )1()(

N

mmm

jijoijiij

tt

tbatt t

1

)()(

)1()()(

T

tij

N

j

T

tij

ij

t

ta

11

1

)(

)(

T

tij

T

tijkt

ijk

t

twob

1

1

)(

)(),(

HMM

• A HMM is a tuple :– A set of states S={s1, s2, …, sN}.

– A set of output symbols Σ={w1, …, wM}.

– Initial state probabilities

– State transition prob: A={aij}.

– Symbol emission prob: B={bijk}

• State sequence: X1…XT+1

• Output sequence: o1…oT

}{ i

),,,,( BAS

Forward probability

The probability of producing oi,t-1 while ending up in state si:

),()( 1,1 iXOPt tt

def

i

Calculating forward probability

tijoiji

i

ttttti

tttttti

tti

t

ttj

bat

iXjXoPiXOP

iXOjXoPiXOP

jXiXOP

jXOPt

)(

)|,(*),(

),|,(*),(

),,(

),()1(

11,1

1,111,1

1,1

1,1

ii )1(Initialization:

Induction:

Backward probability

• The probability of producing the sequence Ot,T, given that at time t, we are at state s i.

)|()( , iXOPt tTt

def

i

Calculating backward probability

tijoijj

j

tTtttj

t

tttTtttj

t

ttTtj

t

tTt

def

i

bat

jXOPiXjXoP

ojXiXOPiXjXoP

iXjX，OoP

iXOPt

)1(

)|(*)|,(

),,|(*)|,(

)|,(

)|()(

1,11

1),1(1

1),1(

,

1)1( TiInitialization:

Induction:

Calculating the prob of the observation

)1()(1

i

N

iiOP

)1()(1

TOPN

ii

)()(

),()(

1

1

tt

iXOPOP

i

N

ii

N

it

Estimating parameters

• The prob of traversing a certain arc at time t given O: (denoted by pt(i, j) in M&S)

N

mmm

jijoiji

tt

ttij

tt

tbat

OP

OjXiXP

OjXiXPt

t

1

1

1

)()(

)1()(

)(

),,(

)|,()(

N

jttti OjXiXPOiXPt

11 )|,()|()(

)()(1

ttN

jiji

The prob of being at state i at time t given O:

Expected counts

Sum over the time index:• Expected # of transitions from state i to j in O:

• Expected # of transitions from state i in O:

N

j

T

tij

T

t

N

jij

T

ti ttt

1 11 11

)()()(

T

tij t

1

)(

Update parameters

)1(1expˆ ii ttimeatistateinfrequencyected

T

tij

N

j

T

tij

T

ti

T

tij

ij

t

t

t

t

istatefromstransitionofected

jtoistatefromstransitionofecteda

11

1

1

1

)(

)(

)(

)(

#exp

#exp

T

tij

T

tijkt

ijk

t

two

jtoistatefromstransitionofected

observedkwithjtoistatefromstransitionofectedb

1

1

)(

)(),(

#exp

#exp

Final formulae

N

mmm

jijoijiij

tt

tbatt t

1

)()(

)1()()(

T

tij

N

j

T

tij

ij

t

ta

11

1

)(

)(

T

tij

T

tijkt

ijk

t

twob

1

1

)(

)(),(

Summary

Summary

• The EM algorithm– An iterative approach – L(θ) is non-decreasing at each iteration– Optimal solution in M-step exists for many classes of

problems.

• The EM algorithm for PM models– Simpler formulae– Three special cases

• Inside-outside algorithm• Forward-backward algorithm• IBM Models for MT

Relations among the algorithms

The generalized EM

The EM algorithm

PM Gaussian MixInside-OutsideForward-backwardIBM models

Strengths of EM

• Numerical stability: in every iteration of the EM algorithm, it increases the likelihood of the observed data.

• The EM handles parameter constraints gracefully.

Problems with EM

• Convergence can be very slow on some problems and is intimately related to the amount of missing information.

• It guarantees to improve the probability of the training corpus, which is different from reducing the errors directly.

• It cannot guarantee to reach global maxima (it could get struck at the local maxima, saddle points, etc)

The initial estimate is important.

Additional slides

The EM algorithm for PM models

// for each iteration

// for each training example xi

// for each possible y

// for each parameter

// for each parameter

the em algorithm ling 572 fei xia 03/01/07. what is em? em stands for “expectation...

Documents