expectation maximization cont’d · 2017. 3. 2. · maximization cont’d cse 446: machine...
TRANSCRIPT
3/1/2017
1
CSE 446: Machine Learning©2017 Emily Fox
CSE 446: Machine LearningEmily FoxUniversity of WashingtonMarch 1, 2017
ExpectationMaximization cont’d
CSE 446: Machine Learning
Recap of EM so far…
©2017 Emily Fox
3/1/2017
2
CSE 446: Machine Learning
Part 1:What if we knew the cluster parameters {πk,μk ,Σk }?
©2017 Emily Fox
CSE 446: Machine Learning4 ©2017 Emily Fox
Responsibility cluster k takes for observation i
Normalized over all possible cluster assignments
Responsibilities in equations
3/1/2017
3
CSE 446: Machine Learning5
Part 1 summary
©2017 Emily Fox
Desired soft assignments (responsibilities) are easy to compute when cluster parameters {πk , μk , Σk } are known
But, we don’t know these!
CSE 446: Machine Learning
Part 2b:What can we do with just soft assignments rij?
©2017 Emily Fox
3/1/2017
4
CSE 446: Machine Learning7
Estimating cluster parameters from soft assignments
Instead of having a full observation xi
in cluster k, just allocate a portion rik
©2017 Emily Fox
xi divided across all clusters,as determined by rik
CSE 446: Machine Learning8
Maximum likelihood estimation from soft assignments
Just like in boosting with weighted observations…
©2017 Emily Fox
R G B ri1 ri2 ri3
x1[1] x1[2] x1[3] 0.30 0.18 0.52
x2[1] x2[2] x2[3] 0.01 0.26 0.73
x3[1] x3[2] x3[3] 0.002 0.008 0.99
x4[1] x4[2] x4[3] 0.75 0.10 0.15
x5[1] x5[2] x5[3] 0.05 0.93 0.02
x6[1] x6[2] x6[3] 0.13 0.86 0.01
Total weight in cluster:(effective # of obs)
1.242 2.8 2.42
52% chance this obs is in cluster 3
3/1/2017
5
CSE 446: Machine Learning9
Maximum likelihood estimation from soft assignments
©2017 Emily Fox
R G B Cluster 1 weights
x1[1] x1[2] x1[3] 0.30
x2[1] x2[2] x2[3] 0.01
x3[1] x3[2] x3[3] 0.002
x4[1] x4[2] x4[3] 0.75
x5[1] x5[2] x5[3] 0.05
x6[1] x6[2] x6[3] 0.13
R G B Cluster 2weights
x1[1] x1[2] x1[3] 0.18
x2[1] x2[2] x2[3] 0.26
x3[1] x3[2] x3[3] 0.008
x4[1] x4[2] x4[3] 0.10
x5[1] x5[2] x5[3] 0.93
x6[1] x6[2] x6[3] 0.86
R G B Cluster 3 weights
x1[1] x1[2] x1[3] 0.52
x2[1] x2[2] x2[3] 0.73
x3[1] x3[2] x3[3] 0.99
x4[1] x4[2] x4[3] 0.15
x5[1] x5[2] x5[3] 0.02
x6[1] x6[2] x6[3] 0.01
CSE 446: Machine Learning10
Cluster-specific location/shape MLE
©2017 Emily Fox
Compute cluster parameter estimateswith weights on each row operation
R G B Cluster 1 weights
x1[1] x1[2] x1[3] 0.30
x2[1] x2[2] x2[3] 0.01
x3[1] x3[2] x3[3] 0.002
x4[1] x4[2] x4[3] 0.75
x5[1] x5[2] x5[3] 0.05
x6[1] x6[2] x6[3] 0.13
Total weight in cluster k= effective # obs
3/1/2017
6
CSE 446: Machine Learning11
MLE of cluster proportions
©2017 Emily Fox
Estimate cluster proportions from relative weights
ri1 ri2 ri3
0.30 0.18 0.52
0.01 0.26 0.73
0.002 0.008 0.99
0.75 0.10 0.15
0.05 0.93 0.02
0.13 0.86 0.01
Total weight in cluster: 1.242 2.8 2.42
Total weight in dataset: 6
# datapoints N
Total weight in cluster k= effective # obs
CSE 446: Machine Learning12
Part 2b summary
Still straightforward to compute cluster parameter estimates from soft assignments
©2017 Emily Fox
3/1/2017
7
CSE 446: Machine Learning
Expectation maximization (EM)
©2017 Emily Fox
CSE 446: Machine Learning14
Expectation maximization (EM):An iterative algorithm
Motivates an iterative algorithm:
1. E-step: estimate cluster responsibilities given current parameter estimates
2. M-step: maximize likelihood over parameters given current responsibilities
©2017 Emily Fox
3/1/2017
8
CSE 446: Machine Learning15
EM for mixtures of Gaussians in pictures –initialization
©2017 Emily Fox
CSE 446: Machine Learning16
EM for mixtures of Gaussians in pictures –after 1st iteration
©2017 Emily Fox
3/1/2017
9
CSE 446: Machine Learning17
EM for mixtures of Gaussians in pictures –after 2nd iteration
©2017 Emily Fox
CSE 446: Machine Learning18
EM for mixtures of Gaussians in pictures –converged solution
©2017 Emily Fox
3/1/2017
10
CSE 446: Machine Learning19
EM for mixtures of Gaussians in pictures –replay
©2017 Emily Fox
CSE 446: Machine Learning
The nitty gritty of EM
©2017 Emily Fox
3/1/2017
11
CSE 446: Machine Learning
Convergence and initialization of EM
©2017 Emily Fox
CSE 446: Machine Learning22
Convergence of EM
• EM is a coordinate-ascent algorithm- Can equate E-and M-steps with alternating maximizations
of an objective function
• Convergences to a local mode
• We will assess via (log) likelihood of data under current parameter and responsibility estimates
©2017 Emily Fox
3/1/2017
12
CSE 446: Machine Learning23
Initialization
• Many ways to initialize the EM algorithm
• Important for convergence rates & quality of local mode found
• Examples:- Choose K observations at random to define K “centroids”. Assign other
observations to nearest centriod to form initial parameter estimates.
- Pick centers sequentially to provide good coverage of data like in k-means++
- Initialize from k-means solution
- Grow mixture model by splitting (and sometimes removing) clusters until K clusters are formed
©2017 Emily Fox
CSE 446: Machine Learning
Potential of vanilla EM to overfit
©2017 Emily Fox
3/1/2017
13
CSE 446: Machine Learning25
Overfitting of MLE
Maximizing likelihood can overfit to data
Imagine at K=2 example with one obs assigned to cluster 1 and others assigned to cluster 2
- What parameter values maximize likelihood?
©2017 Emily Fox
Set center equal to point and shrink
variance to 0
Likelihood goes to ∞ !
CSE 446: Machine Learning26
Overfitting in high dims
Doc-clustering example: Imagine only 1 doc assigned to cluster k has word w (or all docs in cluster agree on count of word w)
Likelihood of any doc with different count on word w being in cluster k is 0!
©2017 Emily Fox
Likelihood maximized by setting µk[w] = xi[w] and σw,k = 02
3/1/2017
14
CSE 446: Machine Learning27
Simple regularization of M-step for mixtures of Gaussians
Simple fix: Don’t let variances 0!
Add small amount to diagonal of covariance estimate
©2017 Emily Fox
Alternatively, take Bayesian approach & place prior on parameters.
Similar idea, but all parameter estimates are “smoothed” via cluster pseudo-observations.
CSE 446: Machine Learning
Formally relating k-means and EM for mixtures of Gaussians
©2017 Emily Fox
3/1/2017
15
CSE 446: Machine Learning29
Relationship to k-means
Consider Gaussian mixture model with
and let the variance parameter σ 0
©2017 Emily Fox
Σ =
Spherically symmetric clusters
Datapoint gets fully assigned to nearest center, just as in k-means
σ2
σ2
σ2
σ2
- Spherical clusters w/ equal variances, so relative likelihoods just fcn of dist to cluster center
- As variances0, likelihood ratio becomes 0 or 1
- Responsibilities weigh in cluster proportions, but dominated by likelihood disparity
CSE 446: Machine Learning30
Infinitesimally small variance EM = k-means
1. E-step: estimate cluster responsibilities given current parameter estimates
2. M-step: maximize likelihood over parameters given current responsibilities (hard assignments!)
©2017 Emily Fox
Decision based on distance to nearest cluster center
Infinitesimally small
3/1/2017
16
CSE 446: Machine Learning
Summary for the EM algorithm
©2017 Emily Fox
CSE 446: Machine Learning32
What you can do now…
• Estimate soft assignments (responsibilities) given mixture model parameters
• Solve maximum likelihood parameter estimation using soft assignments (weighted data)
• Implement an EM algorithm for inferring soft assignments and cluster parameters- Determine an initialization strategy- Implement a variant that helps avoid overfitting issues
• Compare and contrast with k-means- Soft vs. hard assignments- k-means as a limiting special case of EM for mixtures of Gaussians
©2017 Emily Fox
3/1/2017
17
CSE 446: Machine Learning©2017 Emily Fox
CSE 446: Machine LearningEmily FoxUniversity of WashingtonMarch 1, 2017
Bayes Optimal Classifier &Naïve Bayes
CSE 446: Machine Learning34
Classification
Learn: f: X Y- X – features- Y – target classes
Suppose you know P(Y|X) exactly, how should you classify?- Bayes optimal classifier:
©2017 Emily Fox
3/1/2017
18
CSE 446: Machine Learning35
Recall: Bayes rule
Which is shorthand for:
©2017 Emily Fox
CSE 446: Machine Learning36
How hard is it to learn the optimal classifier?
• Data =
• How do we represent these? How many parameters?- Prior, P(Y):
• Suppose Y is composed of k classes
- Likelihood, P(X|Y):• Suppose X is composed of d binary features
• Complex model ! High variance with limited data!!!
Sky Temp Humid Wind Water Forecast EnjoySpt
Sunny Warm Normal Strong Warm Same Yes
Sunny Warm High Strong Warm Same Yes
Rainy Cold High Strong Warm Change No
Sunny Warm High Strong Cold Change Yes
©2017 Emily Fox