em-talk
TRANSCRIPT
-
8/12/2019 em-talk
1/31
Expectation-Maximization Algorithm
and Applications
Eugene WeinsteinCourant Institute of
Mathematical Sciences
Nov 14th, 2006
-
8/12/2019 em-talk
2/31
2/31
List of Concepts
Maximum-Likelihood Estimation (MLE) Expectation-Maximization (EM)
Conditional Probability
Mixture Modeling
Gaussian Mixture Models (GMMs)
String edit-distance
Forward-backward algorithms
-
8/12/2019 em-talk
3/31
3/31
Overview
Expectation-Maximization Mixture Model Training
Learning String Edit-Distance
-
8/12/2019 em-talk
4/31
4/31
One-Slide MLE Review Say I give you a coin with
But I dont tell you the value of
Now say I let you flip the coin n times You get h heads and - tails
What is the natural estimate of ? This is
More formally, the likelihood of is governed by a
binomial distribution: Can prove is the maximum-likelihood estimate of
Differentiate with respect to , set equal to 0
-
8/12/2019 em-talk
5/31
5/31
EM Motivation So, to solve any ML-type problem, we analytically
maximize the likelihood function? Seems to work for 1D Bernoulli (coin toss)
Also works for 1D Gaussian (find , 2 ) Not quite
Distribution may not be well-behaved, or have too manyparameters
Say your likelihood function is a mixture of 1000 1000-dimensional Gaussians (1M parameters)
Direct maximization is not feasible
Solution: introduce hidden variables to Simplify the likelihood function (more common)
Account for actual missing data
-
8/12/2019 em-talk
6/31
6/31
Hidden and Observed Variables Observed variables: directly measurable from the
data, e.g. The waveform values of a speech recording Is it raining today?
Did the smoke alarm go off?
Hidden variables: influence the data, but nottrivial to measure The phonemes that produce a given speech recording
(rain today | rain yesterday) Is the smoke alarm malfunctioning?
-
8/12/2019 em-talk
7/31
7/31
Expectation-Maximization Model dependent random variables:
Observed variable x
Unobserved (hidden) variable y that generates x Assume probability distributions:
represents set of all parameters of distribution Repeat until convergence
E-step: Compute expectation of
( , : old, new distribution parameters) M-step: Find that maximizes Q
-
8/12/2019 em-talk
8/31
8/31
Let X, Ybe r.v.s drawn from the
distributions P(x) and P(y) Conditional distribution given by:
Then
For function h(Y):
Given a particular value of X (X=x):
Conditional Expectation Review
-
8/12/2019 em-talk
9/31
9/31
Maximum Likelihood Problem Want to pick that maximizes the log-
likelihood of the observed (x) andunobserved (y) variables given
Observed variable x Previous parameters
Conditional expectation of
given x and is
-
8/12/2019 em-talk
10/31
10/31
EM Derivation Lemma (Special case of Jensens
Inequality): Let p(x), q(x) be probabilitydistributions. Then
Proof: rewrite as:
Interpretation: relative entropy non-negative
-
8/12/2019 em-talk
11/31
11/31
EM Derivation EM Theorem:
If then
Proof:
By some algebra and lemma,
So, if this quantity is positive, so is
-
8/12/2019 em-talk
12/31
12/31
EM Summary Repeat until convergence
E-step: Compute expectation of
( , : old, new distribution parameters) M-step: Find that maximizes Q
EM Theorem:
If
then
Interpretation As long as we can improve the expectation of the log-likelihood,
EM improves our model of observed variable x
Actually, its not necessary to maximize the expectation, just need
to make sure that it increases this is called Generalized EM
-
8/12/2019 em-talk
13/31
13/31
EM Comments In practice, the x is series of data points
To calculate expectation, can assume i.i.d and sumover all points:
Problems with EM? Local maxima
Need to bootstrap training process (pick a )
When is EM most useful?
When model distributions easy to maximize (e.g.,Gaussian mixture models)
EM is a meta-algorithm, needs to be adapted to
particular application
-
8/12/2019 em-talk
14/31
14/31
Overview
Expectation-Maximization Mixture Model Training
Learning String Distance
-
8/12/2019 em-talk
15/31
15/31
EM Applications: Mixture Models Gaussian/normal distribution
Parameters: mean and variance 2
In the multi-dimensional case, assume isotropic Gaussian:
same variance in all dimensions
We can model arbitrary distributions with density mixtures
-
8/12/2019 em-talk
16/31
16/31
Density Mixtures Combine m elementary densities to model a
complex data distribution
kth Gaussian parametrized by
-
8/12/2019 em-talk
17/31
17/31
Density Mixtures Combine m elementary densities to model a
complex data distribution
-
8/12/2019 em-talk
18/31
18/31
Density Mixtures Combine m elementary densities to model a complex
data distribution
Log-likelihood function of the data x given :
Log of sum hard to optimize analytically!
Instead, introduce hidden variable y
: x generated by Gaussian k
EM formulation: maximize
-
8/12/2019 em-talk
19/31
19/31
Gaussian Mixture Model EM Goal: maximize
n (observed) data points:
n (hidden) labels: : xi generated by Gaussian k
Several pages of math later, we get:
E step: compute likelihood of
M step: update k, k, k for each Gaussian k=1..m
-
8/12/2019 em-talk
20/31
20/31
GMM-EM Discussion
Summary: EM naturally applicable to training
probabilistic models EM is a generic formulation, need to do some
hairy math to get to implementation
Problems with GMM-EM? Local maxima
Need to bootstrap training process (pick a )
GMM-EM applicable to enormous number ofpattern recognition tasks: speech, vision, etc.
Hours of fun with GMM-EM
-
8/12/2019 em-talk
21/31
21/31
Overview
Expectation-Maximization Mixture Model Training
Learning String Distance
-
8/12/2019 em-talk
22/31
22/31
String Edit-Distance Notation:
Operate on two strings: Edit-distance: transform one string into
another using
Substitution: kitten bitten, cost
Insertion: cop crop, cost
Deletion:
learn
earn, cost
Can compute efficiently recursively
-
8/12/2019 em-talk
23/31
23/31
Stochastic String Edit-Distance Instead of setting costs, model edit operation
sequence as a random process
Edit operations selected according to aprobability distribution
For edit operation sequence
View string edit-distance as memoryless (Markov):
stochastic: random process according to () isgoverned by a true probability distribution
transducer:
-
8/12/2019 em-talk
24/31
24/31
Edit-Distance Transducer
Arc label a:b/0 means input a, output b andweight 0
Assume
-
8/12/2019 em-talk
25/31
25/31
Two Distances Define yield of an edit sequence (zn)
as the set of all strings hx,yi such that znturns x into y
Viterbi edit-distance: negative log-
likelihood of most likely edit sequence
Stochastic edit-distance: negative log-likelihood of all edit sequences from x to y
-
8/12/2019 em-talk
26/31
26/31
Evaluating Likelihood Viterbi:
Stochastic:
Both require calculation of over allpossible edit sequences
possibilities (three edit operations)
However, memoryless assumption allowsus to compute likelihood efficiently
Use the forward-backward method!
-
8/12/2019 em-talk
27/31
27/31
Forward Evaluation of forward probabilities :
likelihood of picking an edit sequence that
generates the prefix pair
Memoryless assumption allows efficient
recursive computation:
-
8/12/2019 em-talk
28/31
28/31
Backward Evaluation of backward probabilities :
likelihood of picking an edit sequence that
generates the suffix pair
Memoryless assumption allows efficient
recursive computation:
-
8/12/2019 em-talk
29/31
29/31
EM Formulation Edit operations selected according to a
probability distribution
So, EM has to update based onoccurrence counts of each operation(similar to coin-tossing example)
Idea: accumulate expected counts from
forward, backward variables
(z): expected count of edit operation z
-
8/12/2019 em-talk
30/31
30/31
EM Details (z): expected count of edit operation z e.g,
-
8/12/2019 em-talk
31/31
31/31
References A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete
data via the EM algorithm, Journal of the Royal Statistical Society B, 39(1), 1977 pp.
1-38. C. F. J. Wu, On the Convergence Properties of the EM Algorithm, The Annals ofStatistics, 11(1), Mar 1983, pp. 95-103.
F. Jelinek, Statistical Methods for Speech Recognition, 1997
M. Collins, The EM Algorithm, 1997
J. A. Bilmes, A Gentle Tutorial of the EM Algorithm and its Application to Parameter
Estimation for Gaussian Mixture and Hidden Markov Models, Technical Report,University of Berkeley, TR-97-021, 1998
E. S. Ristad and P. N. Yianilos, Learning string edit distance, IEEE Transactions onPattern Analysis and Machine Intelligence, 20(2), 1998, pp. 522-532.
L.R. Rabiner. A tutorial on HMM and selected applications in speech recognition, InProc. IEEE, 77(2), 1989, pp. 257-286.
A. D'Souza, Using EM To Estimate A Probablity [sic] Density With A Mixture OfGaussians
M. Mohri. Edit-Distance of Weighted Automata, in Proc. Implementation andApplication of Automata, (CIAA) 2002, pp. 1-23
J. Glass, Lecture Notes, MIT class 6.345: Automatic Speech Recognition, 2003
Carlo Tomasi, Estimating Gaussian Mixture Densities with EM A Tutorial, 2004
Wikipedia