cpsc 540: machine learningschmidtm/courses/540-w16/l11.pdfcpsc 540: machine learning expectation...
TRANSCRIPT
![Page 1: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/1.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
CPSC 540: Machine LearningExpectation Maximization and Kernel Density Estimation
Mark Schmidt
University of British Columbia
Winter 2016
![Page 2: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/2.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Admin
Assignment 2:
2 late day to hand it in now.Thursday is last possible day.
Assignment 3:
Due in 2 weeks, start early.Some additional hints will be added.
Reading week:
No classes next week.I’m talking at Robson Square 6:30pm Wednesday February 17.
February 25:
Default is to not have class this day.Instead go to Rich Sutton’s talk in DMP 110:
“Reinforcement Learning And The Future of Artificial Intelligence”.
![Page 3: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/3.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last: Multivariate Gaussian
The multivariate normal distribution models PDF of vector x as
p(x|µ,Σ) =1
(2π)d2 |Σ|
12
exp
(−1
2(x− µ)TΣ−1(x− µ)
)where µ ∈ Rd and Σ ∈ Rd×d and Σ � 0.
Closed-form MLE:
µ =1
n
n∑i=1
xi, Σ =1
n
N∑i=1
(xi − µ)(xi − µ)T .
Closed under several operations: products of PDFs, marginalization, conditioning.
Uni-modal: probability strictly decreases as you move away from mean.
Light-tailed’: assumes all data is close to mean.
Not robust to outliers or data far away from mean.
![Page 4: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/4.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last: Multivariate Gaussian
The multivariate normal distribution models PDF of vector x as
p(x|µ,Σ) =1
(2π)d2 |Σ|
12
exp
(−1
2(x− µ)TΣ−1(x− µ)
)where µ ∈ Rd and Σ ∈ Rd×d and Σ � 0.
Closed-form MLE:
µ =1
n
n∑i=1
xi, Σ =1
n
N∑i=1
(xi − µ)(xi − µ)T .
Closed under several operations: products of PDFs, marginalization, conditioning.
Uni-modal: probability strictly decreases as you move away from mean.
Light-tailed’: assumes all data is close to mean.
Not robust to outliers or data far away from mean.
![Page 5: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/5.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 6: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/6.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 7: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/7.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 8: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/8.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 9: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/9.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 10: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/10.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 11: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/11.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 12: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/12.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 13: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/13.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 14: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/14.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 15: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/15.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 16: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/16.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 17: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/17.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 18: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/18.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 19: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/19.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 20: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/20.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 21: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/21.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 22: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/22.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 23: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/23.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 24: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/24.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 25: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/25.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 26: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/26.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Mixture Models
To distributions more flexible, we introduced mixture models.Example is mixture of Gaussians:
We have a set of k Gaussian distributions, each with a mean µc and covariance Σc.We have a prior probability θc of a data point from each distribution c.
How a mixture of Gaussian “generates” data:1 Sample cluster zi based on prior probabilities θc (categorical distribution).2 Sample example xi based on mean µc and covariance Σc.
Standard appraoch to fitting mixture models: expectation maximization:General method for fitting models with hidden variables.
![Page 27: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/27.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Learning with Hidden Values
We often want to learn when some variables unobserved/missing/hidden/latent.
For example, we could have a dataset
X =
N 33 5F 10 1F ? 2M 22 0
, y =
−1+1−1?
.Missing values are very common in real datasets.
We’ll focus on data that is missing at random (MAR):
The fact that is ? is missing does not not depend on missing value.
In the case of mixture models, we’ll treat the clusters zi as missing values.
![Page 28: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/28.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Last Time: Learning with Hidden Values
We often want to learn when some variables unobserved/missing/hidden/latent.
For example, we could have a dataset
X =
N 33 5F 10 1F ? 2M 22 0
, y =
−1+1−1?
.Missing values are very common in real datasets.
We’ll focus on data that is missing at random (MAR):
The fact that is ? is missing does not not depend on missing value.
In the case of mixture models, we’ll treat the clusters zi as missing values.
![Page 29: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/29.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
More Motivation for EM: Semi-Supervised Learning
Important special case of hidden values is semi-supervised learning.Motivation:
Getting labels is often expensive.But getting unlabeled examples is usually cheap.
Can we train with labeled data (X, y) and unlabeled data X?
X =
[ ], y =
[],
X =
, y =
?????
,If these are IID samples, then y values are MAR.
Classic approach: use generative classifier and apply EM.
![Page 30: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/30.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
More Motivation for EM: Semi-Supervised Learning
Important special case of hidden values is semi-supervised learning.Motivation:
Getting labels is often expensive.But getting unlabeled examples is usually cheap.
Can we train with labeled data (X, y) and unlabeled data X?
X =
[ ], y =
[],
X =
, y =
?????
,
If these are IID samples, then y values are MAR.
Classic approach: use generative classifier and apply EM.
![Page 31: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/31.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
More Motivation for EM: Semi-Supervised Learning
Important special case of hidden values is semi-supervised learning.Motivation:
Getting labels is often expensive.But getting unlabeled examples is usually cheap.
Can we train with labeled data (X, y) and unlabeled data X?
X =
[ ], y =
[],
X =
, y =
?????
,If these are IID samples, then y values are MAR.
Classic approach: use generative classifier and apply EM.
![Page 32: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/32.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Probabilistic Approach to Learing with Hidden VariablesLet’s use O as observed variables and H as our hidden variables.
For semi-supervised learning, O = {X, y, X} and H = {y}.We’ll use Θ as the parameters we want to optimize.
Since we don’t observe H, we’ll focus on the probability of observed O,
p(O|Θ) =∑H
p(O,H|Θ), (by marginalization rule p(a) =∑b
p(a, b)),
where we sum (or integrate) over all possible hidden values.This is nice because we only need likelihood of “complete” data (O, h).But the log-likelihood has the form
− log p(O|Θ) = − log
(∑h
p(O,H|Θ)
),
which usually is not convex (due to sum to sum inside the log).Even if − log p(O,H|Θ) is “nice” (closed-form, convex, etc.),maximizing the likelihood is typically hard
![Page 33: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/33.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Probabilistic Approach to Learing with Hidden VariablesLet’s use O as observed variables and H as our hidden variables.
For semi-supervised learning, O = {X, y, X} and H = {y}.We’ll use Θ as the parameters we want to optimize.
Since we don’t observe H, we’ll focus on the probability of observed O,
p(O|Θ) =∑H
p(O,H|Θ), (by marginalization rule p(a) =∑b
p(a, b)),
where we sum (or integrate) over all possible hidden values.This is nice because we only need likelihood of “complete” data (O, h).
But the log-likelihood has the form
− log p(O|Θ) = − log
(∑h
p(O,H|Θ)
),
which usually is not convex (due to sum to sum inside the log).Even if − log p(O,H|Θ) is “nice” (closed-form, convex, etc.),maximizing the likelihood is typically hard
![Page 34: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/34.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Probabilistic Approach to Learing with Hidden VariablesLet’s use O as observed variables and H as our hidden variables.
For semi-supervised learning, O = {X, y, X} and H = {y}.We’ll use Θ as the parameters we want to optimize.
Since we don’t observe H, we’ll focus on the probability of observed O,
p(O|Θ) =∑H
p(O,H|Θ), (by marginalization rule p(a) =∑b
p(a, b)),
where we sum (or integrate) over all possible hidden values.This is nice because we only need likelihood of “complete” data (O, h).But the log-likelihood has the form
− log p(O|Θ) = − log
(∑h
p(O,H|Θ)
),
which usually is not convex (due to sum to sum inside the log).
Even if − log p(O,H|Θ) is “nice” (closed-form, convex, etc.),maximizing the likelihood is typically hard
![Page 35: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/35.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Probabilistic Approach to Learing with Hidden VariablesLet’s use O as observed variables and H as our hidden variables.
For semi-supervised learning, O = {X, y, X} and H = {y}.We’ll use Θ as the parameters we want to optimize.
Since we don’t observe H, we’ll focus on the probability of observed O,
p(O|Θ) =∑H
p(O,H|Θ), (by marginalization rule p(a) =∑b
p(a, b)),
where we sum (or integrate) over all possible hidden values.This is nice because we only need likelihood of “complete” data (O, h).But the log-likelihood has the form
− log p(O|Θ) = − log
(∑h
p(O,H|Θ)
),
which usually is not convex (due to sum to sum inside the log).Even if − log p(O,H|Θ) is “nice” (closed-form, convex, etc.),maximizing the likelihood is typically hard
![Page 36: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/36.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Expectation Maximization as Bound Optimization
Expectation maximization is a bound-optimization method:
At each iteration we optimize a bound on the function.
In gradient descent, our bound came from Lipschitz-continuity of the gradient.
In EM, our bound comes from expectation over hidden variables.
Bound will typically not be a quadratic.
![Page 37: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/37.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Expectation Maximization as Bound Optimization
Expectation maximization is a bound-optimization method:
At each iteration we optimize a bound on the function.
In gradient descent, our bound came from Lipschitz-continuity of the gradient.
In EM, our bound comes from expectation over hidden variables.
Bound will typically not be a quadratic.
![Page 38: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/38.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Expectation Maximization as Bound Optimization
Expectation maximization is a bound-optimization method:
At each iteration we optimize a bound on the function.
In gradient descent, our bound came from Lipschitz-continuity of the gradient.
In EM, our bound comes from expectation over hidden variables.
Bound will typically not be a quadratic.
![Page 39: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/39.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Probabilistic Approach to Learing with Hidden VariablesFor example, in semi-supervised learning our complete-data NLL is
− log p(X, y, X, y︸ ︷︷ ︸O,H
|Θ) = − log
((n∏
i=1
p(xi, yi|Θ)
)(t∏
i=1
p(xi, yi|Θ)
))
= −n∑
i=1
log p(xi, yi|Θ)︸ ︷︷ ︸labeled
−t∑
i=1
log p(xi, yi|Θ)︸ ︷︷ ︸unlabeled with guesses yi
,
which is convex if − log p(x, y|Θ) is convex.
But since we don’t observe yi, our observed-data NLL is
− log(p(X, y, X)|Θ) = − log
∑y1
∑y2
· · ·∑yt
n∏i=1
p(xi, yi|Θ)
t∏i=1
p(xi, yi|Θ)
= −
n∑i=1
log p(xi, yi|Θ)−t∑
i=1
log
∑yi
p(xi, yi|Θ)
,
and second term is non-convex even if − log p(x, y|Θ) is convex.Notation
∑yi means “sum over all values of yi”
(replace with integral for continuous yi).
![Page 40: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/40.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Probabilistic Approach to Learing with Hidden VariablesFor example, in semi-supervised learning our complete-data NLL is
− log p(X, y, X, y︸ ︷︷ ︸O,H
|Θ) = − log
((n∏
i=1
p(xi, yi|Θ)
)(t∏
i=1
p(xi, yi|Θ)
))
= −n∑
i=1
log p(xi, yi|Θ)︸ ︷︷ ︸labeled
−t∑
i=1
log p(xi, yi|Θ)︸ ︷︷ ︸unlabeled with guesses yi
,
which is convex if − log p(x, y|Θ) is convex.But since we don’t observe yi, our observed-data NLL is
− log(p(X, y, X)|Θ) = − log
∑y1
∑y2
· · ·∑yt
n∏i=1
p(xi, yi|Θ)
t∏i=1
p(xi, yi|Θ)
= −
n∑i=1
log p(xi, yi|Θ)−t∑
i=1
log
∑yi
p(xi, yi|Θ)
,
and second term is non-convex even if − log p(x, y|Θ) is convex.Notation
∑yi means “sum over all values of yi”
(replace with integral for continuous yi).
![Page 41: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/41.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
“Hard” Expectation Maximization and K-Means
A heuristic approach to repeat “hard” EM:1 Imputation: replace ? with their most likely values.2 Optimization: fit model with these values.
Nice because the second step uses “nice” complete-data NLL.
With mixture of Gaussians this would give:1 Imputation: for each xi, find most likely cluster zi.2 Optimization: update cluster means µc and covariances Σc.
This is k-means clustering when covariances are shared across clusters.
Instead of single assignment, EM takes combination of all possible hidden values,
− log p(O|Θ) = − log
(∑H
p(O,H|Θ)
)≈ −
∑H
αH log p(O,H|Θ).
The weights αh are set so that minimizing approximation decreases − log p(O).
![Page 42: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/42.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
“Hard” Expectation Maximization and K-Means
A heuristic approach to repeat “hard” EM:1 Imputation: replace ? with their most likely values.2 Optimization: fit model with these values.
Nice because the second step uses “nice” complete-data NLL.
With mixture of Gaussians this would give:1 Imputation: for each xi, find most likely cluster zi.2 Optimization: update cluster means µc and covariances Σc.
This is k-means clustering when covariances are shared across clusters.
Instead of single assignment, EM takes combination of all possible hidden values,
− log p(O|Θ) = − log
(∑H
p(O,H|Θ)
)≈ −
∑H
αH log p(O,H|Θ).
The weights αh are set so that minimizing approximation decreases − log p(O).
![Page 43: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/43.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
“Hard” Expectation Maximization and K-Means
A heuristic approach to repeat “hard” EM:1 Imputation: replace ? with their most likely values.2 Optimization: fit model with these values.
Nice because the second step uses “nice” complete-data NLL.
With mixture of Gaussians this would give:1 Imputation: for each xi, find most likely cluster zi.2 Optimization: update cluster means µc and covariances Σc.
This is k-means clustering when covariances are shared across clusters.
Instead of single assignment, EM takes combination of all possible hidden values,
− log p(O|Θ) = − log
(∑H
p(O,H|Θ)
)≈ −
∑H
αH log p(O,H|Θ).
The weights αh are set so that minimizing approximation decreases − log p(O).
![Page 44: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/44.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Expectation Maximization (EM)
EM is local optimizer for cases where minimizing − log p(O,H) is easy.
Key idea: start with some Θ0 and set Θt+1 to minimize upper bound
− log p(O|Θ) ≤ −∑H
αtH log p(O,H|Θ) + const.,
where using αtH = p(H|O,Θt) guarantees that Θt+1 decrease − log p(O|Θ).
This is typically written as two steps:1 E-step: Define expectation of complete-data log-likelihood given Θt,
Q(Θ|Θt) = EH|O,Θt [log p(O,H|Θ)]
=∑H
p(H|O,Θt)︸ ︷︷ ︸fixed weight
log p(O,H|Θ)︸ ︷︷ ︸nice term
,
which is a weighted version of the “nice” log p(O,H) values.2 M-step: Maximize this expectation,
Θt+1 = argmaxΘ
Q(Θ|Θt).
![Page 45: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/45.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Expectation Maximization (EM)
EM is local optimizer for cases where minimizing − log p(O,H) is easy.
Key idea: start with some Θ0 and set Θt+1 to minimize upper bound
− log p(O|Θ) ≤ −∑H
αtH log p(O,H|Θ) + const.,
where using αtH = p(H|O,Θt) guarantees that Θt+1 decrease − log p(O|Θ).This is typically written as two steps:
1 E-step: Define expectation of complete-data log-likelihood given Θt,
Q(Θ|Θt) = EH|O,Θt [log p(O,H|Θ)]
=∑H
p(H|O,Θt)︸ ︷︷ ︸fixed weight
log p(O,H|Θ)︸ ︷︷ ︸nice term
,
which is a weighted version of the “nice” log p(O,H) values.
2 M-step: Maximize this expectation,
Θt+1 = argmaxΘ
Q(Θ|Θt).
![Page 46: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/46.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Expectation Maximization (EM)
EM is local optimizer for cases where minimizing − log p(O,H) is easy.
Key idea: start with some Θ0 and set Θt+1 to minimize upper bound
− log p(O|Θ) ≤ −∑H
αtH log p(O,H|Θ) + const.,
where using αtH = p(H|O,Θt) guarantees that Θt+1 decrease − log p(O|Θ).This is typically written as two steps:
1 E-step: Define expectation of complete-data log-likelihood given Θt,
Q(Θ|Θt) = EH|O,Θt [log p(O,H|Θ)]
=∑H
p(H|O,Θt)︸ ︷︷ ︸fixed weight
log p(O,H|Θ)︸ ︷︷ ︸nice term
,
which is a weighted version of the “nice” log p(O,H) values.2 M-step: Maximize this expectation,
Θt+1 = argmaxΘ
Q(Θ|Θt).
![Page 47: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/47.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Convergence Properties of Expectation Maximization
We can show that EM does not decrease likelihood.
In fact we have the slightly stronger
log p(O|Θt+1)− log p(O|Θt) ≥ Q(Θt+1|Θt)−Q(Θt|Θt),
that increase in likelihood is at least as big as increase in bound.
Does this imply convergence?
Yes, if likelihood is bounded above.
Does this imply convergence to a stationary point?No, although many papers imply that it does.
Could have maximum of 3 and objective values of 1, 1 + 1/2, 1 + 1/2 + 1/4, . . .
Might just asymptotically make less and less progress.
Almost nothing is known about rate of convergence.
![Page 48: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/48.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Convergence Properties of Expectation Maximization
We can show that EM does not decrease likelihood.
In fact we have the slightly stronger
log p(O|Θt+1)− log p(O|Θt) ≥ Q(Θt+1|Θt)−Q(Θt|Θt),
that increase in likelihood is at least as big as increase in bound.
Does this imply convergence?
Yes, if likelihood is bounded above.
Does this imply convergence to a stationary point?No, although many papers imply that it does.
Could have maximum of 3 and objective values of 1, 1 + 1/2, 1 + 1/2 + 1/4, . . .
Might just asymptotically make less and less progress.
Almost nothing is known about rate of convergence.
![Page 49: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/49.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Convergence Properties of Expectation Maximization
We can show that EM does not decrease likelihood.
In fact we have the slightly stronger
log p(O|Θt+1)− log p(O|Θt) ≥ Q(Θt+1|Θt)−Q(Θt|Θt),
that increase in likelihood is at least as big as increase in bound.
Does this imply convergence?
Yes, if likelihood is bounded above.
Does this imply convergence to a stationary point?
No, although many papers imply that it does.
Could have maximum of 3 and objective values of 1, 1 + 1/2, 1 + 1/2 + 1/4, . . .
Might just asymptotically make less and less progress.
Almost nothing is known about rate of convergence.
![Page 50: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/50.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Convergence Properties of Expectation Maximization
We can show that EM does not decrease likelihood.
In fact we have the slightly stronger
log p(O|Θt+1)− log p(O|Θt) ≥ Q(Θt+1|Θt)−Q(Θt|Θt),
that increase in likelihood is at least as big as increase in bound.
Does this imply convergence?
Yes, if likelihood is bounded above.
Does this imply convergence to a stationary point?No, although many papers imply that it does.
Could have maximum of 3 and objective values of 1, 1 + 1/2, 1 + 1/2 + 1/4, . . .
Might just asymptotically make less and less progress.
Almost nothing is known about rate of convergence.
![Page 51: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/51.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Bound on Progress of Expectation Maximization
The iterations of the EM algorithm satisfy
log p(O|Θt+1)− log p(O|Θt) ≥ Q(Θt+1|Θt)−Q(Θt|Θt),
Key idea of proof: − log(z) a convex function, and for a convex f we have
f
(∑i
αizi
)≤∑i
αif(zi), for αi ≥ 0 and∑i
αi = 1.
Generalizes f(αz1 + (1− α)z2) ≤ αf(z1) + (1− α)z2) to convex combinations.Proof: − log p(O|Θ) = − log(
∑h
p(O,H|Θ))
= − log
(∑H
αHp(O,H|Θ)
αH
)
≤ −∑H
αH log
(p(O,H|Θ)
αH
).
![Page 52: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/52.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Bound on Progress of Expectation Maximization
The iterations of the EM algorithm satisfy
log p(O|Θt+1)− log p(O|Θt) ≥ Q(Θt+1|Θt)−Q(Θt|Θt),
Key idea of proof: − log(z) a convex function, and for a convex f we have
f
(∑i
αizi
)≤∑i
αif(zi), for αi ≥ 0 and∑i
αi = 1.
Generalizes f(αz1 + (1− α)z2) ≤ αf(z1) + (1− α)z2) to convex combinations.
Proof: − log p(O|Θ) = − log(∑h
p(O,H|Θ))
= − log
(∑H
αHp(O,H|Θ)
αH
)
≤ −∑H
αH log
(p(O,H|Θ)
αH
).
![Page 53: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/53.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Bound on Progress of Expectation Maximization
The iterations of the EM algorithm satisfy
log p(O|Θt+1)− log p(O|Θt) ≥ Q(Θt+1|Θt)−Q(Θt|Θt),
Key idea of proof: − log(z) a convex function, and for a convex f we have
f
(∑i
αizi
)≤∑i
αif(zi), for αi ≥ 0 and∑i
αi = 1.
Generalizes f(αz1 + (1− α)z2) ≤ αf(z1) + (1− α)z2) to convex combinations.Proof: − log p(O|Θ) = − log(
∑h
p(O,H|Θ))
= − log
(∑H
αHp(O,H|Θ)
αH
)
≤ −∑H
αH log
(p(O,H|Θ)
αH
).
![Page 54: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/54.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Bound on Progress of Expectation Maximization
The iterations of the EM algorithm satisfy
log p(O|Θt+1)− log p(O|Θt) ≥ Q(Θt+1|Θt)−Q(Θt|Θt),
Continuing and using αH = p(H|O,Θt) we have
− log p(O|Θ) ≤ −∑H
αH log
(p(O,H|Θ)
αH
)= −
∑H
αH log p(O,H|Θ)︸ ︷︷ ︸Q(Θ|Θt)
+∑H
αH logαH︸ ︷︷ ︸negative entropy
= −Q(Θ|Θt)− entropy(α).
![Page 55: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/55.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Bound on Progress of Expectation Maximization
The iterations of the EM algorithm satisfy
log p(O|Θt+1)− log p(O|Θt) ≥ Q(Θt+1|Θt)−Q(Θt|Θt),
We’ve bounded p(O|Θt+1), now we need to bound p(O|Θt),
p(O|Θt) = p(O,H|Θt)/p(H|O,Θt) (using p(a|b) = p(a, b)/p(b))
Taking expectation of logarithm of both sides gives
EαH [log p(O|Θt)] = EαH [log p(O,H|Θt)− log p(H|O,Θt)]∑H
αH log p(O|Θt) =∑H
αH log p(O,H|Θt)−∑H
αH log p(H|O,Θt).
And using the definition of αh we have
log p(O|Θt)∑H
αH︸ ︷︷ ︸=1
= Q(Θt|Θt) + entropy(α).
![Page 56: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/56.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Bound on Progress of Expectation Maximization
The iterations of the EM algorithm satisfy
log p(O|Θt+1)− log p(O|Θt) ≥ Q(Θt+1|Θt)−Q(Θt|Θt),
We’ve bounded p(O|Θt+1), now we need to bound p(O|Θt),
p(O|Θt) = p(O,H|Θt)/p(H|O,Θt) (using p(a|b) = p(a, b)/p(b))
Taking expectation of logarithm of both sides gives
EαH [log p(O|Θt)] = EαH [log p(O,H|Θt)− log p(H|O,Θt)]∑H
αH log p(O|Θt) =∑H
αH log p(O,H|Θt)−∑H
αH log p(H|O,Θt).
And using the definition of αh we have
log p(O|Θt)∑H
αH︸ ︷︷ ︸=1
= Q(Θt|Θt) + entropy(α).
![Page 57: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/57.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Bound on Progress of Expectation Maximization
The iterations of the EM algorithm satisfy
log p(O|Θt+1)− log p(O|Θt) ≥ Q(Θt+1|Θt)−Q(Θt|Θt),
We’ve bounded p(O|Θt+1), now we need to bound p(O|Θt),
p(O|Θt) = p(O,H|Θt)/p(H|O,Θt) (using p(a|b) = p(a, b)/p(b))
Taking expectation of logarithm of both sides gives
EαH [log p(O|Θt)] = EαH [log p(O,H|Θt)− log p(H|O,Θt)]∑H
αH log p(O|Θt) =∑H
αH log p(O,H|Θt)−∑H
αH log p(H|O,Θt).
And using the definition of αh we have
log p(O|Θt)∑H
αH︸ ︷︷ ︸=1
= Q(Θt|Θt) + entropy(α).
![Page 58: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/58.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Bound on Progress of Expectation Maximization
The iterations of the EM algorithm satisfy
log p(O|Θt+1)− log p(O|Θt) ≥ Q(Θt+1|Θt)−Q(Θt|Θt),
Thus we have the two bounds
log p(O|Θ) ≥ Q(Θ|Θt) + entropy(α)
log p(O|Θt) = Q(Θt|Θt) + entropy(α).
Using Θ = Θt+1 and subtracting the second from the first gives the result.
Notes:Bound says we can choose any Θ that increases Q over Θt.
Approximate M-steps are ok.
Entropy of hidden variables gives tightness of bound Q:Low entropy (hidden values are predictable): EM bound is tight.High entropy (hidden values are unpredictable): EM bound is loose.
![Page 59: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/59.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Bound on Progress of Expectation Maximization
The iterations of the EM algorithm satisfy
log p(O|Θt+1)− log p(O|Θt) ≥ Q(Θt+1|Θt)−Q(Θt|Θt),
Thus we have the two bounds
log p(O|Θ) ≥ Q(Θ|Θt) + entropy(α)
log p(O|Θt) = Q(Θt|Θt) + entropy(α).
Using Θ = Θt+1 and subtracting the second from the first gives the result.
Notes:Bound says we can choose any Θ that increases Q over Θt.
Approximate M-steps are ok.
Entropy of hidden variables gives tightness of bound Q:Low entropy (hidden values are predictable): EM bound is tight.High entropy (hidden values are unpredictable): EM bound is loose.
![Page 60: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/60.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Expectation Maximization for Mixture of Gaussians
The classic mixture of Gaussians model uses a PDF of the form
p(xi|Θ) =
k∑c=1
p(zi = c|Θ)p(xi|zi = c,Θ),
where each mixture component is a multivariate Gaussian,
p(xi|zi = c,Θ) =1
(2π)d2 |Σc|
12
exp
(−1
2(xi − µc)TΣ−1
c (xi − µc)),
and we model the mixture probabilities as categorical,
p(zi = c|Θ) = θc.
Finding the optimal parameter Θ = {θc, µc,Σc}kc=1 is NP-hard.
But EM updates for improving parameters use analytic form of Gaussian MLE.
![Page 61: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/61.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Expectation Maximization for Mixture of Gaussians
The classic mixture of Gaussians model uses a PDF of the form
p(xi|Θ) =
k∑c=1
p(zi = c|Θ)p(xi|zi = c,Θ),
where each mixture component is a multivariate Gaussian,
p(xi|zi = c,Θ) =1
(2π)d2 |Σc|
12
exp
(−1
2(xi − µc)TΣ−1
c (xi − µc)),
and we model the mixture probabilities as categorical,
p(zi = c|Θ) = θc.
Finding the optimal parameter Θ = {θc, µc,Σc}kc=1 is NP-hard.
But EM updates for improving parameters use analytic form of Gaussian MLE.
![Page 62: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/62.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Expectation Maximization for Mixture of Gaussians
The classic mixture of Gaussians model uses a PDF of the form
p(xi|Θ) =
k∑c=1
p(zi = c|Θ)p(xi|zi = c,Θ),
where each mixture component is a multivariate Gaussian,
p(xi|zi = c,Θ) =1
(2π)d2 |Σc|
12
exp
(−1
2(xi − µc)TΣ−1
c (xi − µc)),
and we model the mixture probabilities as categorical,
p(zi = c|Θ) = θc.
Finding the optimal parameter Θ = {θc, µc,Σc}kc=1 is NP-hard.
But EM updates for improving parameters use analytic form of Gaussian MLE.
![Page 63: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/63.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Expectation Maximization for Mixture of Gaussians
The classic mixture of Gaussians model uses a PDF of the form
p(xi|Θ) =
k∑c=1
p(zi = c|Θ)p(xi|zi = c,Θ),
where each mixture component is a multivariate Gaussian,
p(xi|zi = c,Θ) =1
(2π)d2 |Σc|
12
exp
(−1
2(xi − µc)TΣ−1
c (xi − µc)),
and we model the mixture probabilities as categorical,
p(zi = c|Θ) = θc.
Finding the optimal parameter Θ = {θc, µc,Σc}kc=1 is NP-hard.
But EM updates for improving parameters use analytic form of Gaussian MLE.
![Page 64: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/64.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Expectation Maximization for Mixture of Gaussians
The weights from the E-step are the responsibilitites,
ric , p(zi = c|xi,Θt) =p(xi|zi = c,Θt)p(zi = c,Θt)∑k
c′=1 p(xi|zi = c′,Θt)p(zi = c′,Θt)
The weighted MLE in the M-step is given by
θt+1c =
1
n
n∑i=1
ric (proportion of examples soft-assigned to cluster c)
µt+1c =
∑ni=1 r
icx
i
nθt+1c
(mean of examples soft-assigned to c)
Σt+1c =
∑ni=1 r
ic(x
i − µt+1c )(xi − µt+1
c )T
nθt+1c
(covariance of examples soft-assigned to c).
Derivation is tedious, I’ll put a note on the webpage.Uses distributive law, probabilities sum to one, Lagrangian, weighted Gaussian MLE.
This is k-means if covariances are constant, and ric = 1 for most likely cluster.
![Page 65: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/65.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Expectation Maximization for Mixture of Gaussians
The weights from the E-step are the responsibilitites,
ric , p(zi = c|xi,Θt) =p(xi|zi = c,Θt)p(zi = c,Θt)∑k
c′=1 p(xi|zi = c′,Θt)p(zi = c′,Θt)
The weighted MLE in the M-step is given by
θt+1c =
1
n
n∑i=1
ric (proportion of examples soft-assigned to cluster c)
µt+1c =
∑ni=1 r
icx
i
nθt+1c
(mean of examples soft-assigned to c)
Σt+1c =
∑ni=1 r
ic(x
i − µt+1c )(xi − µt+1
c )T
nθt+1c
(covariance of examples soft-assigned to c).
Derivation is tedious, I’ll put a note on the webpage.Uses distributive law, probabilities sum to one, Lagrangian, weighted Gaussian MLE.
This is k-means if covariances are constant, and ric = 1 for most likely cluster.
![Page 66: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/66.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Expectation Maximization for Mixture of Gaussians
The weights from the E-step are the responsibilitites,
ric , p(zi = c|xi,Θt) =p(xi|zi = c,Θt)p(zi = c,Θt)∑k
c′=1 p(xi|zi = c′,Θt)p(zi = c′,Θt)
The weighted MLE in the M-step is given by
θt+1c =
1
n
n∑i=1
ric (proportion of examples soft-assigned to cluster c)
µt+1c =
∑ni=1 r
icx
i
nθt+1c
(mean of examples soft-assigned to c)
Σt+1c =
∑ni=1 r
ic(x
i − µt+1c )(xi − µt+1
c )T
nθt+1c
(covariance of examples soft-assigned to c).
Derivation is tedious, I’ll put a note on the webpage.Uses distributive law, probabilities sum to one, Lagrangian, weighted Gaussian MLE.
This is k-means if covariances are constant, and ric = 1 for most likely cluster.
![Page 67: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/67.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Expectation Maximization for Mixture of Gaussians
EM for fitting mixture of Gaussians in action:https://www.youtube.com/watch?v=B36fzChfyGU
![Page 68: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/68.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Discussing of EM for Mixtures of Gaussians
EM and Gaussian mixtures are used in a ton of applications.
One of the default unsupervised learning methods.
EM usually doesn’t reach global optimum.
Restart the algorithm from different initializations.
MLE for some clusters may not exist (e.g., only responsible for one point).
Use MAP estimates or remove these clusters.
How do you choose number of mixtures k?
Use cross-validation or other model selection criteria.
Can you make it robust?
Use mixture of Laplace of student t distributions.
Are there alternatives to EM?
Could use gradient descent on NLL.Spectral and other recent methods have some global guarantees.
![Page 69: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/69.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Discussing of EM for Mixtures of Gaussians
EM and Gaussian mixtures are used in a ton of applications.
One of the default unsupervised learning methods.
EM usually doesn’t reach global optimum.
Restart the algorithm from different initializations.
MLE for some clusters may not exist (e.g., only responsible for one point).
Use MAP estimates or remove these clusters.
How do you choose number of mixtures k?
Use cross-validation or other model selection criteria.
Can you make it robust?
Use mixture of Laplace of student t distributions.
Are there alternatives to EM?
Could use gradient descent on NLL.Spectral and other recent methods have some global guarantees.
![Page 70: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/70.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Discussing of EM for Mixtures of Gaussians
EM and Gaussian mixtures are used in a ton of applications.
One of the default unsupervised learning methods.
EM usually doesn’t reach global optimum.
Restart the algorithm from different initializations.
MLE for some clusters may not exist (e.g., only responsible for one point).
Use MAP estimates or remove these clusters.
How do you choose number of mixtures k?
Use cross-validation or other model selection criteria.
Can you make it robust?
Use mixture of Laplace of student t distributions.
Are there alternatives to EM?
Could use gradient descent on NLL.Spectral and other recent methods have some global guarantees.
![Page 71: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/71.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Discussing of EM for Mixtures of Gaussians
EM and Gaussian mixtures are used in a ton of applications.
One of the default unsupervised learning methods.
EM usually doesn’t reach global optimum.
Restart the algorithm from different initializations.
MLE for some clusters may not exist (e.g., only responsible for one point).
Use MAP estimates or remove these clusters.
How do you choose number of mixtures k?
Use cross-validation or other model selection criteria.
Can you make it robust?
Use mixture of Laplace of student t distributions.
Are there alternatives to EM?
Could use gradient descent on NLL.Spectral and other recent methods have some global guarantees.
![Page 72: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/72.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Discussing of EM for Mixtures of Gaussians
EM and Gaussian mixtures are used in a ton of applications.
One of the default unsupervised learning methods.
EM usually doesn’t reach global optimum.
Restart the algorithm from different initializations.
MLE for some clusters may not exist (e.g., only responsible for one point).
Use MAP estimates or remove these clusters.
How do you choose number of mixtures k?
Use cross-validation or other model selection criteria.
Can you make it robust?
Use mixture of Laplace of student t distributions.
Are there alternatives to EM?
Could use gradient descent on NLL.Spectral and other recent methods have some global guarantees.
![Page 73: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/73.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Discussing of EM for Mixtures of Gaussians
EM and Gaussian mixtures are used in a ton of applications.
One of the default unsupervised learning methods.
EM usually doesn’t reach global optimum.
Restart the algorithm from different initializations.
MLE for some clusters may not exist (e.g., only responsible for one point).
Use MAP estimates or remove these clusters.
How do you choose number of mixtures k?
Use cross-validation or other model selection criteria.
Can you make it robust?
Use mixture of Laplace of student t distributions.
Are there alternatives to EM?
Could use gradient descent on NLL.Spectral and other recent methods have some global guarantees.
![Page 74: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/74.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
(pause)
![Page 75: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/75.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
A Non-Parametric Mixture ModelThe classic parametric mixture model has the form
p(x) =
k∑c=1
p(z = c)p(x|z = c).
A natural way to define a non-parametric mixture model is
p(x) =
n∑i=1
p(z = i)p(x|z = i),
where we have one mixture for every training example i.
Common example: z is uniform and x|z is Gaussian with mean xi,
p(x) =1
n
n∑i=1
N (x|xi, σ2I),
and we use a shared covariance σ2I (and σ estimated by cross-validation).
A special case of kernel density estimation or Parzen window.
![Page 76: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/76.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
A Non-Parametric Mixture ModelThe classic parametric mixture model has the form
p(x) =
k∑c=1
p(z = c)p(x|z = c).
A natural way to define a non-parametric mixture model is
p(x) =
n∑i=1
p(z = i)p(x|z = i),
where we have one mixture for every training example i.
Common example: z is uniform and x|z is Gaussian with mean xi,
p(x) =1
n
n∑i=1
N (x|xi, σ2I),
and we use a shared covariance σ2I (and σ estimated by cross-validation).
A special case of kernel density estimation or Parzen window.
![Page 77: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/77.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
A Non-Parametric Mixture ModelThe classic parametric mixture model has the form
p(x) =
k∑c=1
p(z = c)p(x|z = c).
A natural way to define a non-parametric mixture model is
p(x) =
n∑i=1
p(z = i)p(x|z = i),
where we have one mixture for every training example i.
Common example: z is uniform and x|z is Gaussian with mean xi,
p(x) =1
n
n∑i=1
N (x|xi, σ2I),
and we use a shared covariance σ2I (and σ estimated by cross-validation).
A special case of kernel density estimation or Parzen window.
![Page 78: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/78.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
A Non-Parametric Mixture ModelThe classic parametric mixture model has the form
p(x) =
k∑c=1
p(z = c)p(x|z = c).
A natural way to define a non-parametric mixture model is
p(x) =
n∑i=1
p(z = i)p(x|z = i),
where we have one mixture for every training example i.
Common example: z is uniform and x|z is Gaussian with mean xi,
p(x) =1
n
n∑i=1
N (x|xi, σ2I),
and we use a shared covariance σ2I (and σ estimated by cross-validation).
A special case of kernel density estimation or Parzen window.
![Page 79: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/79.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Parzen Window vs. Gaussian and Mixture of Gaussian
https://en.wikipedia.org/wiki/Kernel_density_estimation
![Page 80: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/80.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Parzen Window vs. Gaussian and Mixture of Gaussian
![Page 81: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/81.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Parzen Window vs. Gaussian and Mixture of Gaussian
![Page 82: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/82.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Parzen Window vs. Gaussian and Mixture of Gaussian
![Page 83: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/83.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Kernel Density Estimation
The 1D kernel density estimation (KDE) model uses
p(x) =1
n
n∑i=1
kh(x− xi),
where the PDF k is “kernel” and the parameter h is the “bandwidth”.
In the previous slide we used the (normalized) Gaussian kernel,
k1(x) =1√2π
exp
(−x
2
2
), kh(x) =
1
h√
2πexp
(− x2
2h2
).
Note that we can add a bandwith h to any PDF k1, using
kh(x) =1
hk1
(xh
),
which follows from the change of variables formula for probabilities.
Under common choices of kernels, KDEs can model any density.
![Page 84: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/84.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Kernel Density Estimation
The 1D kernel density estimation (KDE) model uses
p(x) =1
n
n∑i=1
kh(x− xi),
where the PDF k is “kernel” and the parameter h is the “bandwidth”.
In the previous slide we used the (normalized) Gaussian kernel,
k1(x) =1√2π
exp
(−x
2
2
), kh(x) =
1
h√
2πexp
(− x2
2h2
).
Note that we can add a bandwith h to any PDF k1, using
kh(x) =1
hk1
(xh
),
which follows from the change of variables formula for probabilities.
Under common choices of kernels, KDEs can model any density.
![Page 85: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/85.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Kernel Density Estimation
The 1D kernel density estimation (KDE) model uses
p(x) =1
n
n∑i=1
kh(x− xi),
where the PDF k is “kernel” and the parameter h is the “bandwidth”.
In the previous slide we used the (normalized) Gaussian kernel,
k1(x) =1√2π
exp
(−x
2
2
), kh(x) =
1
h√
2πexp
(− x2
2h2
).
Note that we can add a bandwith h to any PDF k1, using
kh(x) =1
hk1
(xh
),
which follows from the change of variables formula for probabilities.
Under common choices of kernels, KDEs can model any density.
![Page 86: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/86.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Efficient Kernel Density Estimation
KDE with the Gaussian kernel is slow at test time:
We need to compute distance of test point to every training point.
A common alternative is the Epanechnikov kernel,
k1(x) =3
4
(1− x2
)I [|x| ≤ 1] .
This kernel has two nice properties:
Epanechnikov showed that it is asymptotically optimal in terms of squared error.It can be much faster to use since it only depends on nearby points.
You can use fast methods for computing nearest neighbours.
The kernel is non-smooth, at the boundaries, but many smooth approximationsexist.
Quartic, triweight, tricube, cosine, etc.
![Page 87: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/87.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Efficient Kernel Density Estimation
KDE with the Gaussian kernel is slow at test time:
We need to compute distance of test point to every training point.
A common alternative is the Epanechnikov kernel,
k1(x) =3
4
(1− x2
)I [|x| ≤ 1] .
This kernel has two nice properties:
Epanechnikov showed that it is asymptotically optimal in terms of squared error.It can be much faster to use since it only depends on nearby points.
You can use fast methods for computing nearest neighbours.
The kernel is non-smooth, at the boundaries, but many smooth approximationsexist.
Quartic, triweight, tricube, cosine, etc.
![Page 88: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/88.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Efficient Kernel Density Estimation
KDE with the Gaussian kernel is slow at test time:
We need to compute distance of test point to every training point.
A common alternative is the Epanechnikov kernel,
k1(x) =3
4
(1− x2
)I [|x| ≤ 1] .
This kernel has two nice properties:
Epanechnikov showed that it is asymptotically optimal in terms of squared error.It can be much faster to use since it only depends on nearby points.
You can use fast methods for computing nearest neighbours.
The kernel is non-smooth, at the boundaries, but many smooth approximationsexist.
Quartic, triweight, tricube, cosine, etc.
![Page 89: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/89.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Efficient Kernel Density Estimation
https://en.wikipedia.org/wiki/Kernel_%28statistics%29
![Page 90: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/90.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Multivariate Kernel Density Estimation
The general kernel density estimation (KDE) model uses
p(x) =1
n
n∑i=1
kΣ(x− xi),
The most common kernel is again the Gaussian,
kI(x) =1
√2π
d2
exp
(−‖x‖
2
2
).
By the multivariate change of variables formula we can add bandwith H using
kH(x) =1
|H|k1(H−1x) (generalizes kh(x) =
1
hk1
(xh
)).
We get a multivariate Gaussian corresponds to using H = Σ12 .
To reduce number of paramters, we typically:Use a product of independent distributions and use H = hI for some h.
![Page 91: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/91.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Multivariate Kernel Density Estimation
The general kernel density estimation (KDE) model uses
p(x) =1
n
n∑i=1
kΣ(x− xi),
The most common kernel is again the Gaussian,
kI(x) =1
√2π
d2
exp
(−‖x‖
2
2
).
By the multivariate change of variables formula we can add bandwith H using
kH(x) =1
|H|k1(H−1x) (generalizes kh(x) =
1
hk1
(xh
)).
We get a multivariate Gaussian corresponds to using H = Σ12 .
To reduce number of paramters, we typically:Use a product of independent distributions and use H = hI for some h.
![Page 92: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/92.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Summary
Semi-supervised learning: Learning with labeled and unlabeled data.
Expectation maximization: optimization with hidden variables,when knowing hidden variables make problem easy.
Monotonicity of EM: EM is guaranteed not to decrease likelihood.
Kernel density estimation: Non-parametric continuous density estimation method.
Next time: Probabilistic PCA and factor analysis.
![Page 93: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/93.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Summary
Semi-supervised learning: Learning with labeled and unlabeled data.
Expectation maximization: optimization with hidden variables,when knowing hidden variables make problem easy.
Monotonicity of EM: EM is guaranteed not to decrease likelihood.
Kernel density estimation: Non-parametric continuous density estimation method.
Next time: Probabilistic PCA and factor analysis.
![Page 94: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/94.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Summary
Semi-supervised learning: Learning with labeled and unlabeled data.
Expectation maximization: optimization with hidden variables,when knowing hidden variables make problem easy.
Monotonicity of EM: EM is guaranteed not to decrease likelihood.
Kernel density estimation: Non-parametric continuous density estimation method.
Next time: Probabilistic PCA and factor analysis.
![Page 95: CPSC 540: Machine Learningschmidtm/Courses/540-W16/L11.pdfCPSC 540: Machine Learning Expectation Maximization and Kernel Density Estimation Mark Schmidt University of British Columbia](https://reader034.vdocuments.site/reader034/viewer/2022052613/5f27177a3732a834382e47cf/html5/thumbnails/95.jpg)
Expectation Maximization Convergence of EM Kernel Density Estimation
Summary
Semi-supervised learning: Learning with labeled and unlabeled data.
Expectation maximization: optimization with hidden variables,when knowing hidden variables make problem easy.
Monotonicity of EM: EM is guaranteed not to decrease likelihood.
Kernel density estimation: Non-parametric continuous density estimation method.
Next time: Probabilistic PCA and factor analysis.