em-talk

8/12/2019 em-talk

1/31

Expectation-Maximization Algorithm

and Applications

Eugene WeinsteinCourant Institute of

Mathematical Sciences

Nov 14th, 2006

8/12/2019 em-talk

2/31

2/31

List of Concepts

Maximum-Likelihood Estimation (MLE) Expectation-Maximization (EM)

Conditional Probability

Mixture Modeling

Gaussian Mixture Models (GMMs)

String edit-distance

Forward-backward algorithms

8/12/2019 em-talk

3/31

3/31

Overview

Expectation-Maximization Mixture Model Training

Learning String Edit-Distance

8/12/2019 em-talk

4/31

4/31

One-Slide MLE Review Say I give you a coin with

But I dont tell you the value of

Now say I let you flip the coin n times You get h heads and - tails

What is the natural estimate of ? This is

More formally, the likelihood of is governed by a

binomial distribution: Can prove is the maximum-likelihood estimate of

Differentiate with respect to , set equal to 0

8/12/2019 em-talk

5/31

5/31

EM Motivation So, to solve any ML-type problem, we analytically

maximize the likelihood function? Seems to work for 1D Bernoulli (coin toss)

Also works for 1D Gaussian (find , 2 ) Not quite

Distribution may not be well-behaved, or have too manyparameters

Say your likelihood function is a mixture of 1000 1000-dimensional Gaussians (1M parameters)

Direct maximization is not feasible

Solution: introduce hidden variables to Simplify the likelihood function (more common)

Account for actual missing data

8/12/2019 em-talk

6/31

6/31

Hidden and Observed Variables Observed variables: directly measurable from the

data, e.g. The waveform values of a speech recording Is it raining today?

Did the smoke alarm go off?

Hidden variables: influence the data, but nottrivial to measure The phonemes that produce a given speech recording

(rain today | rain yesterday) Is the smoke alarm malfunctioning?

8/12/2019 em-talk

7/31

7/31

Expectation-Maximization Model dependent random variables:

Observed variable x

Unobserved (hidden) variable y that generates x Assume probability distributions:

represents set of all parameters of distribution Repeat until convergence

E-step: Compute expectation of

( , : old, new distribution parameters) M-step: Find that maximizes Q

8/12/2019 em-talk

8/31

8/31

Let X, Ybe r.v.s drawn from the

distributions P(x) and P(y) Conditional distribution given by:

Then

For function h(Y):

Given a particular value of X (X=x):

Conditional Expectation Review

8/12/2019 em-talk

9/31

9/31

Maximum Likelihood Problem Want to pick that maximizes the log-

likelihood of the observed (x) andunobserved (y) variables given

Observed variable x Previous parameters

Conditional expectation of

given x and is

8/12/2019 em-talk

10/31

10/31

EM Derivation Lemma (Special case of Jensens

Inequality): Let p(x), q(x) be probabilitydistributions. Then

Proof: rewrite as:

Interpretation: relative entropy non-negative

8/12/2019 em-talk

11/31

11/31

EM Derivation EM Theorem:

If then

Proof:

By some algebra and lemma,

So, if this quantity is positive, so is

8/12/2019 em-talk

12/31

12/31

EM Summary Repeat until convergence

E-step: Compute expectation of

( , : old, new distribution parameters) M-step: Find that maximizes Q

EM Theorem:

If

then

Interpretation As long as we can improve the expectation of the log-likelihood,

EM improves our model of observed variable x

Actually, its not necessary to maximize the expectation, just need

to make sure that it increases this is called Generalized EM

8/12/2019 em-talk

13/31

13/31

EM Comments In practice, the x is series of data points

To calculate expectation, can assume i.i.d and sumover all points:

Problems with EM? Local maxima

Need to bootstrap training process (pick a )

When is EM most useful?

When model distributions easy to maximize (e.g.,Gaussian mixture models)

EM is a meta-algorithm, needs to be adapted to

particular application

8/12/2019 em-talk

14/31

14/31

Overview


Learning String Distance

8/12/2019 em-talk

15/31

15/31

EM Applications: Mixture Models Gaussian/normal distribution

Parameters: mean and variance 2

In the multi-dimensional case, assume isotropic Gaussian:

same variance in all dimensions

We can model arbitrary distributions with density mixtures

8/12/2019 em-talk

16/31

16/31

Density Mixtures Combine m elementary densities to model a

complex data distribution

kth Gaussian parametrized by

8/12/2019 em-talk

17/31

17/31

Density Mixtures Combine m elementary densities to model a

complex data distribution

8/12/2019 em-talk

18/31

18/31

Density Mixtures Combine m elementary densities to model a complex

data distribution

Log-likelihood function of the data x given :

Log of sum hard to optimize analytically!

Instead, introduce hidden variable y

: x generated by Gaussian k

EM formulation: maximize

8/12/2019 em-talk

19/31

19/31

Gaussian Mixture Model EM Goal: maximize

n (observed) data points:

n (hidden) labels: : xi generated by Gaussian k

Several pages of math later, we get:

E step: compute likelihood of

M step: update k, k, k for each Gaussian k=1..m

8/12/2019 em-talk

20/31

20/31

GMM-EM Discussion

Summary: EM naturally applicable to training

probabilistic models EM is a generic formulation, need to do some

hairy math to get to implementation

Problems with GMM-EM? Local maxima

Need to bootstrap training process (pick a )

GMM-EM applicable to enormous number ofpattern recognition tasks: speech, vision, etc.

Hours of fun with GMM-EM

8/12/2019 em-talk

21/31

21/31

Overview


Learning String Distance

8/12/2019 em-talk

22/31

22/31

String Edit-Distance Notation:

Operate on two strings: Edit-distance: transform one string into

another using

Substitution: kitten bitten, cost

Insertion: cop crop, cost

Deletion:

learn

earn, cost

Can compute efficiently recursively

8/12/2019 em-talk

23/31

23/31

Stochastic String Edit-Distance Instead of setting costs, model edit operation

sequence as a random process

Edit operations selected according to aprobability distribution

For edit operation sequence

View string edit-distance as memoryless (Markov):

stochastic: random process according to () isgoverned by a true probability distribution

transducer:

8/12/2019 em-talk

24/31

24/31

Edit-Distance Transducer

Arc label a:b/0 means input a, output b andweight 0

Assume

8/12/2019 em-talk

25/31

25/31

Two Distances Define yield of an edit sequence (zn)

as the set of all strings hx,yi such that znturns x into y

Viterbi edit-distance: negative log-

likelihood of most likely edit sequence

Stochastic edit-distance: negative log-likelihood of all edit sequences from x to y

8/12/2019 em-talk

26/31

26/31

Evaluating Likelihood Viterbi:

Stochastic:

Both require calculation of over allpossible edit sequences

possibilities (three edit operations)

However, memoryless assumption allowsus to compute likelihood efficiently

Use the forward-backward method!

8/12/2019 em-talk

27/31

27/31

Forward Evaluation of forward probabilities :

likelihood of picking an edit sequence that

generates the prefix pair

Memoryless assumption allows efficient

recursive computation:

8/12/2019 em-talk

28/31

28/31

Backward Evaluation of backward probabilities :

likelihood of picking an edit sequence that

generates the suffix pair

Memoryless assumption allows efficient

recursive computation:

8/12/2019 em-talk

29/31

29/31

EM Formulation Edit operations selected according to a

probability distribution

So, EM has to update based onoccurrence counts of each operation(similar to coin-tossing example)

Idea: accumulate expected counts from

forward, backward variables

(z): expected count of edit operation z

8/12/2019 em-talk

30/31

30/31

EM Details (z): expected count of edit operation z e.g,

8/12/2019 em-talk

31/31

31/31

References A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete

data via the EM algorithm, Journal of the Royal Statistical Society B, 39(1), 1977 pp.

1-38. C. F. J. Wu, On the Convergence Properties of the EM Algorithm, The Annals ofStatistics, 11(1), Mar 1983, pp. 95-103.

F. Jelinek, Statistical Methods for Speech Recognition, 1997

M. Collins, The EM Algorithm, 1997

J. A. Bilmes, A Gentle Tutorial of the EM Algorithm and its Application to Parameter

Estimation for Gaussian Mixture and Hidden Markov Models, Technical Report,University of Berkeley, TR-97-021, 1998

E. S. Ristad and P. N. Yianilos, Learning string edit distance, IEEE Transactions onPattern Analysis and Machine Intelligence, 20(2), 1998, pp. 522-532.

L.R. Rabiner. A tutorial on HMM and selected applications in speech recognition, InProc. IEEE, 77(2), 1989, pp. 257-286.

A. D'Souza, Using EM To Estimate A Probablity [sic] Density With A Mixture OfGaussians

M. Mohri. Edit-Distance of Weighted Automata, in Proc. Implementation andApplication of Automata, (CIAA) 2002, pp. 1-23

J. Glass, Lecture Notes, MIT class 6.345: Automatic Speech Recognition, 2003

Carlo Tomasi, Estimating Gaussian Mixture Densities with EM A Tutorial, 2004

Wikipedia

em-talk

Documents