em-talk

Upload: lucas-piroulus

Post on 03-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 em-talk

    1/31

    Expectation-Maximization Algorithm

    and Applications

    Eugene WeinsteinCourant Institute of

    Mathematical Sciences

    Nov 14th, 2006

  • 8/12/2019 em-talk

    2/31

    2/31

    List of Concepts

    Maximum-Likelihood Estimation (MLE) Expectation-Maximization (EM)

    Conditional Probability

    Mixture Modeling

    Gaussian Mixture Models (GMMs)

    String edit-distance

    Forward-backward algorithms

  • 8/12/2019 em-talk

    3/31

    3/31

    Overview

    Expectation-Maximization Mixture Model Training

    Learning String Edit-Distance

  • 8/12/2019 em-talk

    4/31

    4/31

    One-Slide MLE Review Say I give you a coin with

    But I dont tell you the value of

    Now say I let you flip the coin n times You get h heads and - tails

    What is the natural estimate of ? This is

    More formally, the likelihood of is governed by a

    binomial distribution: Can prove is the maximum-likelihood estimate of

    Differentiate with respect to , set equal to 0

  • 8/12/2019 em-talk

    5/31

    5/31

    EM Motivation So, to solve any ML-type problem, we analytically

    maximize the likelihood function? Seems to work for 1D Bernoulli (coin toss)

    Also works for 1D Gaussian (find , 2 ) Not quite

    Distribution may not be well-behaved, or have too manyparameters

    Say your likelihood function is a mixture of 1000 1000-dimensional Gaussians (1M parameters)

    Direct maximization is not feasible

    Solution: introduce hidden variables to Simplify the likelihood function (more common)

    Account for actual missing data

  • 8/12/2019 em-talk

    6/31

    6/31

    Hidden and Observed Variables Observed variables: directly measurable from the

    data, e.g. The waveform values of a speech recording Is it raining today?

    Did the smoke alarm go off?

    Hidden variables: influence the data, but nottrivial to measure The phonemes that produce a given speech recording

    (rain today | rain yesterday) Is the smoke alarm malfunctioning?

  • 8/12/2019 em-talk

    7/31

    7/31

    Expectation-Maximization Model dependent random variables:

    Observed variable x

    Unobserved (hidden) variable y that generates x Assume probability distributions:

    represents set of all parameters of distribution Repeat until convergence

    E-step: Compute expectation of

    ( , : old, new distribution parameters) M-step: Find that maximizes Q

  • 8/12/2019 em-talk

    8/31

    8/31

    Let X, Ybe r.v.s drawn from the

    distributions P(x) and P(y) Conditional distribution given by:

    Then

    For function h(Y):

    Given a particular value of X (X=x):

    Conditional Expectation Review

  • 8/12/2019 em-talk

    9/31

    9/31

    Maximum Likelihood Problem Want to pick that maximizes the log-

    likelihood of the observed (x) andunobserved (y) variables given

    Observed variable x Previous parameters

    Conditional expectation of

    given x and is

  • 8/12/2019 em-talk

    10/31

    10/31

    EM Derivation Lemma (Special case of Jensens

    Inequality): Let p(x), q(x) be probabilitydistributions. Then

    Proof: rewrite as:

    Interpretation: relative entropy non-negative

  • 8/12/2019 em-talk

    11/31

    11/31

    EM Derivation EM Theorem:

    If then

    Proof:

    By some algebra and lemma,

    So, if this quantity is positive, so is

  • 8/12/2019 em-talk

    12/31

    12/31

    EM Summary Repeat until convergence

    E-step: Compute expectation of

    ( , : old, new distribution parameters) M-step: Find that maximizes Q

    EM Theorem:

    If

    then

    Interpretation As long as we can improve the expectation of the log-likelihood,

    EM improves our model of observed variable x

    Actually, its not necessary to maximize the expectation, just need

    to make sure that it increases this is called Generalized EM

  • 8/12/2019 em-talk

    13/31

    13/31

    EM Comments In practice, the x is series of data points

    To calculate expectation, can assume i.i.d and sumover all points:

    Problems with EM? Local maxima

    Need to bootstrap training process (pick a )

    When is EM most useful?

    When model distributions easy to maximize (e.g.,Gaussian mixture models)

    EM is a meta-algorithm, needs to be adapted to

    particular application

  • 8/12/2019 em-talk

    14/31

    14/31

    Overview

    Expectation-Maximization Mixture Model Training

    Learning String Distance

  • 8/12/2019 em-talk

    15/31

    15/31

    EM Applications: Mixture Models Gaussian/normal distribution

    Parameters: mean and variance 2

    In the multi-dimensional case, assume isotropic Gaussian:

    same variance in all dimensions

    We can model arbitrary distributions with density mixtures

  • 8/12/2019 em-talk

    16/31

    16/31

    Density Mixtures Combine m elementary densities to model a

    complex data distribution

    kth Gaussian parametrized by

  • 8/12/2019 em-talk

    17/31

    17/31

    Density Mixtures Combine m elementary densities to model a

    complex data distribution

  • 8/12/2019 em-talk

    18/31

    18/31

    Density Mixtures Combine m elementary densities to model a complex

    data distribution

    Log-likelihood function of the data x given :

    Log of sum hard to optimize analytically!

    Instead, introduce hidden variable y

    : x generated by Gaussian k

    EM formulation: maximize

  • 8/12/2019 em-talk

    19/31

    19/31

    Gaussian Mixture Model EM Goal: maximize

    n (observed) data points:

    n (hidden) labels: : xi generated by Gaussian k

    Several pages of math later, we get:

    E step: compute likelihood of

    M step: update k, k, k for each Gaussian k=1..m

  • 8/12/2019 em-talk

    20/31

    20/31

    GMM-EM Discussion

    Summary: EM naturally applicable to training

    probabilistic models EM is a generic formulation, need to do some

    hairy math to get to implementation

    Problems with GMM-EM? Local maxima

    Need to bootstrap training process (pick a )

    GMM-EM applicable to enormous number ofpattern recognition tasks: speech, vision, etc.

    Hours of fun with GMM-EM

  • 8/12/2019 em-talk

    21/31

    21/31

    Overview

    Expectation-Maximization Mixture Model Training

    Learning String Distance

  • 8/12/2019 em-talk

    22/31

    22/31

    String Edit-Distance Notation:

    Operate on two strings: Edit-distance: transform one string into

    another using

    Substitution: kitten bitten, cost

    Insertion: cop crop, cost

    Deletion:

    learn

    earn, cost

    Can compute efficiently recursively

  • 8/12/2019 em-talk

    23/31

    23/31

    Stochastic String Edit-Distance Instead of setting costs, model edit operation

    sequence as a random process

    Edit operations selected according to aprobability distribution

    For edit operation sequence

    View string edit-distance as memoryless (Markov):

    stochastic: random process according to () isgoverned by a true probability distribution

    transducer:

  • 8/12/2019 em-talk

    24/31

    24/31

    Edit-Distance Transducer

    Arc label a:b/0 means input a, output b andweight 0

    Assume

  • 8/12/2019 em-talk

    25/31

    25/31

    Two Distances Define yield of an edit sequence (zn)

    as the set of all strings hx,yi such that znturns x into y

    Viterbi edit-distance: negative log-

    likelihood of most likely edit sequence

    Stochastic edit-distance: negative log-likelihood of all edit sequences from x to y

  • 8/12/2019 em-talk

    26/31

    26/31

    Evaluating Likelihood Viterbi:

    Stochastic:

    Both require calculation of over allpossible edit sequences

    possibilities (three edit operations)

    However, memoryless assumption allowsus to compute likelihood efficiently

    Use the forward-backward method!

  • 8/12/2019 em-talk

    27/31

    27/31

    Forward Evaluation of forward probabilities :

    likelihood of picking an edit sequence that

    generates the prefix pair

    Memoryless assumption allows efficient

    recursive computation:

  • 8/12/2019 em-talk

    28/31

    28/31

    Backward Evaluation of backward probabilities :

    likelihood of picking an edit sequence that

    generates the suffix pair

    Memoryless assumption allows efficient

    recursive computation:

  • 8/12/2019 em-talk

    29/31

    29/31

    EM Formulation Edit operations selected according to a

    probability distribution

    So, EM has to update based onoccurrence counts of each operation(similar to coin-tossing example)

    Idea: accumulate expected counts from

    forward, backward variables

    (z): expected count of edit operation z

  • 8/12/2019 em-talk

    30/31

    30/31

    EM Details (z): expected count of edit operation z e.g,

  • 8/12/2019 em-talk

    31/31

    31/31

    References A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete

    data via the EM algorithm, Journal of the Royal Statistical Society B, 39(1), 1977 pp.

    1-38. C. F. J. Wu, On the Convergence Properties of the EM Algorithm, The Annals ofStatistics, 11(1), Mar 1983, pp. 95-103.

    F. Jelinek, Statistical Methods for Speech Recognition, 1997

    M. Collins, The EM Algorithm, 1997

    J. A. Bilmes, A Gentle Tutorial of the EM Algorithm and its Application to Parameter

    Estimation for Gaussian Mixture and Hidden Markov Models, Technical Report,University of Berkeley, TR-97-021, 1998

    E. S. Ristad and P. N. Yianilos, Learning string edit distance, IEEE Transactions onPattern Analysis and Machine Intelligence, 20(2), 1998, pp. 522-532.

    L.R. Rabiner. A tutorial on HMM and selected applications in speech recognition, InProc. IEEE, 77(2), 1989, pp. 257-286.

    A. D'Souza, Using EM To Estimate A Probablity [sic] Density With A Mixture OfGaussians

    M. Mohri. Edit-Distance of Weighted Automata, in Proc. Implementation andApplication of Automata, (CIAA) 2002, pp. 1-23

    J. Glass, Lecture Notes, MIT class 6.345: Automatic Speech Recognition, 2003

    Carlo Tomasi, Estimating Gaussian Mixture Densities with EM A Tutorial, 2004

    Wikipedia