statistical guarantees for the em algorithm: from ... · an example (murray, 1977; wu, 1983) twelve...
TRANSCRIPT
![Page 1: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/1.jpg)
Statistical Guarantees for the EM Algorithm: From Population to
Sample-Based AnalysisAuthors: Balakrishnan, Wainwright, and Yu
Chenxi Zhou
Reading Group in Statistical Learning and Data MiningSeptember 5th, 2017
1
![Page 2: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/2.jpg)
Outline! Overview of Expectation-Maximization (EM) Algorithm" Population Analysis of First-Order EM Algorithm# Sample Analysis of First-Order EM Algorithm$ Example: Gaussian Mixture Model
2
![Page 3: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/3.jpg)
Overview of Expectation-Maximization Algorithm
3
![Page 4: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/4.jpg)
Estimation of Linkage in Genetics
» 197 animals are distributed multinomially into 5 categories
» Observed data:
with .
» Cell probabilities:
4
![Page 5: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/5.jpg)
Example (continued)
» Likelihood function:
and
» How to solve this type of incomplete data problem?
5
![Page 6: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/6.jpg)
What is the EM Algorithm?
Expectation-Maximization (EM) Algorithm is an iterative method that attempts to find the maximum likelihood estimator of a parameter of a parametric probability distribution in incomplete data problems.
Incompleteness:- Missing data- Censored or grouped data- Latent class and latent data structures-
6
![Page 7: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/7.jpg)
Basic Setup
» Let with the joint density function , where
»
» is non-empty, compact convex set
» Observe i.i.d. copies of ,
» are missing or latent
» Goal: Estimate by maximizing log-likelihood:
7
![Page 8: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/8.jpg)
EM Idea
Unfortunately, maximizing directly can be hard! But often the complete data log-likelihood
is easier to maximize. So we replace the complete data log-likelihood by its conditional expectation:
where expectation is computed with respect to current iterate .
8
![Page 9: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/9.jpg)
EM Algorithm
Starting with initial iterate , iterate the following steps for .
» Expectation Step: Compute EM surrogate :
» Maximization Step: Maximize EM surrogate:
9
![Page 10: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/10.jpg)
Source:
10
![Page 11: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/11.jpg)
Advantages !» Easy to implement
» Requires small storage space
» Low cost per iteration
» If is bounded, converges monotonically to , where is a stationary point
»
11
![Page 12: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/12.jpg)
Drawbacks !» Finding the exact maximizer in the M step can be hard
12
![Page 13: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/13.jpg)
As for the First Drawback...
» Generalized EM Algorithm: Just choose so that
» First-Order EM Algorithm: Assume is differentiable in the first argument at each iteration . Given a step size , the updates are
where the gradient is taken in the first argument of .
13
![Page 14: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/14.jpg)
Drawbacks !» Finding the exact maximizer in the M step can be hard
» No guarantees to converge to the global maximum of (depending on the choice of starting point)
14
![Page 15: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/15.jpg)
An Example (Murray, 1977; Wu, 1983)
Twelve observations are collected from a bivariate normal distribution with mean , correlation coefficient and variances ,
The likelihood function has
- two global maxima: , ; and
- a saddle point: , .
The EM algorithm starting at will return the saddle point.
15
![Page 16: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/16.jpg)
Drawbacks !» Finding the exact maximizer in the M step can be hard
» No guarantees to converge to the global maximum of (depending on the choice of starting point)
» , where is a stationary point, does NOT imply and Wu (1983) only established the conditions of
convergence of to a stationary point
16
![Page 17: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/17.jpg)
Contributions of Balakrishnan et al. (2017)
» Quantitative characterization of a basin of attraction around
» Where to choose the initialization to ensure
» Establishment of the convergence rate and the corresponding conditions
» Establishment of connections between population and sample analysis
17
![Page 18: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/18.jpg)
Population Version of EM Algorithm
E Step: Compute the following population version surrogate function
M Step:
» Standard EM:
» First-Order EM:
18
![Page 19: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/19.jpg)
Oracle Surrogate Function and Iterates
» Oracle Surrogate Function
» Oracle Iterates
19
![Page 20: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/20.jpg)
Oracle Surrogate Function and Iterates
Why do we need them?
» If satisfies strong concavity and smoothness, then the gradient ascent updates achieve geometric convergence rate to
» The population version first-order EM updates can be viewed as a perturbation of the oracle updates
» Therefore, in the population level analysis of EM algorithm, we need to control the quantity
20
![Page 21: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/21.jpg)
Population Analysis of First-Order EM Algorithm
21
![Page 22: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/22.jpg)
Recall the population first-order updates are
and the oracle updates are
Condition 1: Gradient Smoothness
For an appropriately small parameter ,
for all .
22
![Page 23: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/23.jpg)
Condition 2: -Strong Concavity
There is some such that
or, equivalently,
for all pairs .
23
![Page 24: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/24.jpg)
Condition 3: -Smoothness
There is some such that
or, equivalently,
for all pairs .
24
![Page 25: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/25.jpg)
25
![Page 26: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/26.jpg)
Theorem 1(General Population-Level Guarantee)
» For some radius and a triplet with such that -gradient smoothness, -strong concavity and -smoothness conditions hold;
» Choose the step size .
Then, given any , the population first-order EM iterates satisfy the bound
26
![Page 27: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/27.jpg)
Sample Analysis of First-Order EM Algorithm
27
![Page 28: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/28.jpg)
Recall that the sample first-order EM updates are
The analysis of the finite sample first-order EM algorithm depends on the empirical process
28
![Page 29: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/29.jpg)
For a given sample size and tolerance parameter , let be the smallest scalar such that
with probability at least .
29
![Page 30: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/30.jpg)
Theorem 2(General Sample-Level Guarantee)
» For some radius and a triplet with such that the -gradient smoothness, -strong concavity and -smoothness conditions hold;
» Choose the step size ;
» Suppose the sample size is large enough to ensure
30
![Page 31: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/31.jpg)
Theorem 2(General Sample-Level Guarantee) (continued)
Then, with probability at least , given any initialization , the finite-sample first-order EM iterates
satisfy the bound
for all
31
![Page 32: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/32.jpg)
Example: Gaussian Mixture Model
32
![Page 33: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/33.jpg)
Consider the following two-component Gaussian mixture model
where
and and are independent.
Key Quantity:
33
![Page 34: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/34.jpg)
To analyze the EM algorithm at the population level, i.e., apply Theorem 1, one needs to establish the gradient smoothness, -strong concavity and -smoothness.
Oracle Function:
where the weighting function is a smooth function.
It is easy to verify that is strongly-concave and smooth with parameters 1, i.e., .
34
![Page 35: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/35.jpg)
What about gradient smoothness?
Lemma 2
» Let for a sufficiently large ;
» Let the radius be .
Then, there is a constant with such that
35
![Page 36: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/36.jpg)
Corollary 1(Population result for the first-order EM algorithm for GMM)
» Let for a sufficiently large ;
» Let the radius ;
» Choose the step size .
36
![Page 37: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/37.jpg)
Corollary 1(Population result for the first-order EM algorithm for GMM) (continued)
Then, there is a contraction coefficient , where is a universal constant, such that for any initialization
, the population first-order EM iterates satisfy
the bound
for all
37
![Page 38: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/38.jpg)
Now, we go from the population to the sample-based analysis of this particular model.
At the sample level, we study the random variable
over the ball .
38
![Page 39: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/39.jpg)
Corollary 4(Sample-based result for first-order EM guarantees for GMM)
» Let for a sufficiently large ;
» Choose the radius ;
» Choose the step size ;
» Suppose the sample size is lower bounded by .
39
![Page 40: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/40.jpg)
Corollary 4(Sample-based result for first-order EM guarantees for GMM)(continued)
Then, there is a contraction coefficient , where is a
universal constant, such that, for any initialization ,
the first-order EM iterates satisfy the bound
with probability at least .
40
![Page 41: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/41.jpg)
Summary
» This paper advances our theoretical understanding of EM algorithm.
» This paper concentrates on how to obtain a near-optimal estimate of using EM algorithm.
» With the help of optimization theory, this paper establishes the size of the region of attraction where the initialization should be chosen and the rate of convergence of the EM algorithm.
» This paper also develops techniques to analyze other algorithms for solving non-convex problems.
» What's next...
41
![Page 42: Statistical Guarantees for the EM Algorithm: From ... · An Example (Murray, 1977; Wu, 1983) Twelve observations are collected from a bivariate normal distribution with mean , correlation](https://reader034.vdocuments.site/reader034/viewer/2022050405/5f82c15c9f60ac5c2e20301f/html5/thumbnails/42.jpg)
42