Download - Mixture Models and the EM Algorithm
![Page 1: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/1.jpg)
Christopher M. Bishop
Mixture Models and the EM Algorithm
Microsoft Research, Cambridge
2006 Advanced Tutorial Lecture Series, CUED
0 0.5 10
0.5
1
(a)0 0.5 1
0
0.5
1
(b)
![Page 2: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/2.jpg)
Applications of Machine Learning
• Web search, email spam detection, collaborative filtering, game player ranking, video games, real-time stereo, protein folding, image editing, jet engine anomaly detection, fluorescence in-situ hybridisation, signature verification, satellite scatterometer, cervical smear screening, human genome analysis, compiler optimization, handwriting recognition, breast X-ray screening, fingerprint recognition, fast spectral analysis, one-touch microwave oven, monitoring of premature babies, text/graphics discrimination, event selection in high energy physics, electronic nose, real-time tokamakcontrol, crash log analysis, QSAR, backgammon, sleep EEG staging, fMRI analysis, speech recognition, natural language processing, face detection, data visualization, computer Go, satellite track removal, iris recognition, …
![Page 3: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/3.jpg)
Three Important Developments
• 1. Adoption of a Bayesian framework• 2. Probabilistic graphical models• 3. Efficient techniques for approximate inference
![Page 4: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/4.jpg)
Illustration: Bayesian Ranking
• Goal is to rank player skill from outcome of games• Conventional approach: Elo (used in chess)
– maintains a single strength value for each player– cannot handle team games, or more than 2 players
![Page 5: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/5.jpg)
Bayesian Ranking: TrueSkillTM
• Ralf Herbrich, Thore Graepel, Tom Minka• Xbox 360 Live (November 2005)
– millions of players– billions of service-hours– hundreds of thousands of game outcomes per day
• First “planet-scale” application of Bayesian methods?• NIPS (2006) oral
![Page 6: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/6.jpg)
Expectation Propagation on a Factor Graph
![Page 7: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/7.jpg)
New Book
• Springer (2006)• 738 pages, hardcover • Full colour• Low price• 431 exercises + solutions• Matlab software and
companion text withIan Nabney
http://research.microsoft.com/~cmbishop/PRML
![Page 8: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/8.jpg)
Mixture Models and EM
• K-means clustering • Gaussian mixture model• Maximum likelihood and EM• Bayesian GMM and variational inference
Please ask questions!
![Page 9: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/9.jpg)
Old Faithful
![Page 10: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/10.jpg)
Old Faithful Data Set
Duration of eruption (minutes)
Time betweeneruptions (minutes)
![Page 11: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/11.jpg)
K-means Algorithm
• Goal: represent a data set in terms of K clusters each of which is summarized by a prototype
• Initialize prototypes, then iterate between two phases:– E-step: assign each data point to nearest prototype– M-step: update prototypes to be the cluster means
• Simplest version is based on Euclidean distance– re-scale Old Faithful data
![Page 12: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/12.jpg)
![Page 13: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/13.jpg)
![Page 14: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/14.jpg)
![Page 15: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/15.jpg)
![Page 16: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/16.jpg)
![Page 17: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/17.jpg)
![Page 18: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/18.jpg)
![Page 19: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/19.jpg)
![Page 20: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/20.jpg)
![Page 21: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/21.jpg)
Responsibilities
• Responsibilities assign data points to clusters
such that
• Example: 5 data points and 3 clusters
![Page 22: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/22.jpg)
K-means Cost Function
prototypesresponsibilities
data
![Page 23: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/23.jpg)
Minimizing the Cost Function
• E-step: minimize J w.r.t.
• M-step: minimize J w.r.t
• Convergence guaranteed since there is a finite number of possible settings for the responsibilities
![Page 24: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/24.jpg)
![Page 25: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/25.jpg)
Probabilistic Clustering
• Represent the probability distribution of the data as a mixture model– captures uncertainty in cluster assignments– gives model for data distribution– Bayesian mixture model allows us to determine K
• Consider mixtures of Gaussians
![Page 26: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/26.jpg)
The Gaussian Distribution
• Multivariate Gaussian
mean covariance
x1x1 x1
x2x2 x2
(a) (b) (c)
![Page 27: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/27.jpg)
Likelihood Function
• Data set
• Consider first a single Gaussian• Assume observed data points generated independently
• Viewed as a function of the parameters, this is known as the likelihood function
![Page 28: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/28.jpg)
Maximum Likelihood
• Set the parameters by maximizing the likelihood function• Equivalently maximize the log likelihood
![Page 29: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/29.jpg)
Maximum Likelihood Solution
• Maximizing w.r.t. the mean gives the sample mean
• Maximizing w.r.t covariance gives the sample covariance
![Page 30: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/30.jpg)
Gaussian Mixtures
• Linear super-position of Gaussians
• Normalization and positivity require
• Can interpret the mixing coefficients as prior probabilities
![Page 31: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/31.jpg)
Example: Mixture of 3 Gaussians
0 0.5 10
0.5
1
(a)
![Page 32: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/32.jpg)
Contours of Probability Distribution
0 0.5 10
0.5
1
(b)
![Page 33: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/33.jpg)
Surface Plot
![Page 34: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/34.jpg)
Sampling from the Gaussian
• To generate a data point:– first pick one of the components with probability – then draw a sample from that component
• Repeat these two steps for each new data point
![Page 35: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/35.jpg)
Synthetic Data Set
0 0.5 10
0.5
1
(a)
![Page 36: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/36.jpg)
Fitting the Gaussian Mixture
• We wish to invert this process – given the data set, find the corresponding parameters:– mixing coefficients– means – covariances
• If we knew which component generated each data point, the maximum likelihood solution would involve fitting each component to the corresponding cluster
• Problem: the data set is unlabelled• We shall refer to the labels as latent (= hidden) variables
![Page 37: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/37.jpg)
Synthetic Data Set Without Labels
0 0.5 10
0.5
1
(b)
![Page 38: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/38.jpg)
Posterior Probabilities
• We can think of the mixing coefficients as prior probabilities for the components
• For a given value of we can evaluate the corresponding posterior probabilities, called responsibilities
• These are given from Bayes’ theorem by
![Page 39: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/39.jpg)
Posterior Probabilities (colour coded)
0 0.5 10
0.5
1
(a)
![Page 40: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/40.jpg)
Latent Variables
0 0.5 10
0.5
1
(a)0 0.5 1
0
0.5
1
(b)0 0.5 1
0
0.5
1
(a)
![Page 41: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/41.jpg)
Maximum Likelihood for the GMM
• The log likelihood function takes the form
• Note: sum over components appears inside the log• There is no closed form solution for maximum likelihood
![Page 42: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/42.jpg)
Over-fitting in Gaussian Mixture Models
• Singularities in likelihood function when a component ‘collapses’ onto a data point:
then consider
• Likelihood function gets larger as we add more components (and hence parameters) to the model– not clear how to choose the number K of components
![Page 43: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/43.jpg)
Problems and Solutions
• How to maximize the log likelihood– solved by expectation-maximization (EM) algorithm
• How to avoid singularities in the likelihood function– solved by a Bayesian treatment
• How to choose number K of components– also solved by a Bayesian treatment
![Page 44: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/44.jpg)
EM Algorithm – Informal Derivation
• Let us proceed by simply differentiating the log likelihood
![Page 45: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/45.jpg)
EM Algorithm – Informal Derivation
• Similarly for the covariances
• For mixing coefficients use a Lagrange multiplier to give
![Page 46: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/46.jpg)
EM Algorithm – Informal Derivation
• The solutions are not closed form since they are coupled• Suggests an iterative scheme for solving them:
– make initial guesses for the parameters– alternate between the following two stages:
1. E-step: evaluate responsibilities2. M-step: update parameters using ML results
• Each EM cycle guaranteed not to decrease the likelihood
![Page 47: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/47.jpg)
![Page 48: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/48.jpg)
![Page 49: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/49.jpg)
![Page 50: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/50.jpg)
![Page 51: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/51.jpg)
![Page 52: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/52.jpg)
![Page 53: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/53.jpg)
Relation to K-means
• Consider GMM with common covariances• Take limit• Responsibilities become binary
• EM algorithm is precisely equivalent to K-means
![Page 54: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/54.jpg)
Bayesian Mixture of Gaussians
• Include prior distribution over parameters
• Make predictions by marginalizing over parameters– c.f. point estimate from maximum likelihood
![Page 55: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/55.jpg)
Bayesian Mixture of Gaussians
• Conjugate priors for the parameters:– Dirichlet prior for mixing coefficients
– Normal-Wishart prior for means and precisions
where the Wishart distribution is given by
![Page 56: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/56.jpg)
Variational Inference
• Exact solution is intractable• Variational inference:
– extension of EM– alternate between updating posterior over parameters
and posterior over latent variables– again convergence is guaranteed
![Page 57: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/57.jpg)
Illustration: a single Gaussian
• Convenient to work with precision
• Likelihood function
• Prior over parameters
![Page 58: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/58.jpg)
Variational Inference
• Goal is to find true posterior distribution
• Factorized approximation
• Alternately update each factor to minimize a measure of closeness between the true and approximate distributions
![Page 59: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/59.jpg)
Initial Configuration
−1 0 10
1
2
µ
τ
(a)
![Page 60: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/60.jpg)
After Updating
−1 0 10
1
2
µ
τ
(b)
![Page 61: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/61.jpg)
After Updating
−1 0 10
1
2
µ
τ
(c)
![Page 62: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/62.jpg)
Converged Solution
−1 0 10
1
2
µ
τ
(d)
![Page 63: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/63.jpg)
Variational Equations for GMM
![Page 64: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/64.jpg)
Sufficient Statistics
• Small computational overhead compared to maximum likelihood EM
![Page 65: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/65.jpg)
Bayesian Model Comparison
• Multiple models (e.g. different values of K) with priors
• Posterior probabilities
• For equal priors, models are compared using evidence
• Variational inference maximizes lower bound on
![Page 66: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/66.jpg)
Evidence vs. K for Old Faithful
![Page 67: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/67.jpg)
Bayesian Model Complexity
![Page 68: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/68.jpg)
Take-home Messages
• Maximum likelihood gives severe over-fitting– singularities– favours ever larger numbers of components
• Bayesian mixture of Gaussians – no singularities– determines optimal number of components
• Variational inference– effective solution for Bayesian GMM– little computational overhead compared to EM
![Page 69: Mixture Models and the EM Algorithm](https://reader033.vdocuments.site/reader033/viewer/2022050719/586b5c8f1a28abfc718babba/html5/thumbnails/69.jpg)
Viewgraphs, tutorials andpublications available from:
http://research.microsoft.com/~cmbishop