introduction to machine learning - carnegie mellon universitygdicaro/10315/lectures/315-f... ·...
TRANSCRIPT
Teacher:Gianni A. Di Caro
Lecture 29:Latent Variable Models, Gaussian Mixture Models, Expectation-Maximization
Introduction to Machine Learning10-315 Fall ‘19
Disclaimer: These slides can include material from different sources. I’ll happy to explicitly acknowledge a source if required. Contact me for requests.
2
All the following slides are from Piyush Rai (IIT)
Latent Variable Models
Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 3
A Simple Generative Model
All observations {x1, . . . , xN} generated from a distribution p(x |θ)
Unknowns: Parameters θ of the assumed data distribution p(x |θ)
Many ways to estimate the parameters (MLE, MAP, or Bayesian inference)
Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 4
Generative Model with Latent Variables
Assume each observation xn to be associated with a latent variable zn
In this “latent variable model” of data, data x also depends some latent variable(s) z
zn is akin to a latent representation or “encoding” of xn; controls what data “looks like”.E.g,
zn ∈ {1, . . . ,K} denotes the cluster xn belongs to
zn ∈ RK denotes a low-dimensional latent representation or latent “code” for xn
Unknowns: {z1, . . . , zN}, and (θ, φ). zn’s called “local” variables; (θ, φ) called “global” variables
Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 5
Brief Detour/Recap:Gaussian Parameter Estimation
Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 6
MLE for Multivariate Gaussian
Multivariate Gaussian in D dimensions
p(x |µ,Σ) =1
(2π)D/2|Σ|1/2exp
(−1
2(x − µ)>Σ−1(x − µ)
)
Goal: Given N i.i.d. observations {xn}Nn=1 from this Gaussian, estimate parameters µ and Σ
MLE for the D × 1 mean µ ∈ RD and D × D p.s.d. covariance matrix Σ
µ =1
N
N∑
n=1
xn and Σ =1
N
N∑
n=1
(xn − µ)(xn − µ)>
Note: Σ depends on µ, but µ doesn’t depend on Σ ⇒ no need for alternating opt.
Note: log works nicely with exp of the Gaussian. Simplifies MLE expressions in this case
In general, when the distribution is an exponential family distribution, MLE is usually very easy
Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 7
Brief Detour: Exponential Family Distributions
An exponential family distribution is of the form
p(x |θ) = h(x) exp[θ>φ(x)− A(θ)]
θ is called the natural parameter of the family
h(x), φ(x), and A(θ) are known functions. Note: Don’t confuse φ with kernel mappings!
φ(x) is called the sufficient statistics: knowing this is sufficient to estimate θ
Every exp. family distribution also has a conjugate distribution (often also in exp. family)
Many other nice properties (especially useful in Bayesian inference)
Also, MLE/MAP is usually quite simple (note that log p(x |θ) will typically have a simple form)
Many well-known distribution (Bernoulli, Binomial, multinoulli, beta, gamma, Gaussian, etc.) areexponential family distributions
https://en.wikipedia.org/wiki/Exponential_family
Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 8
MLE for Generative Classification with Gaussian Class-conditionals
Each class k modeled using a Gaussian with mean µk and covariance matrix Σk
Note: Can assume label yn to be one-hot and then ynk = 1 if yn = k, and ynk = 0, otherwise
Assuming p(yn = k) = πk , this model has parameters Θ = {πk , µk ,Σk}Kk=1
(We have done this before) Given {xn, yn}Nn=1, MLE for Θ will be
πk =1
N
N∑
n=1
ynk =Nk
N
µk =1
Nk
N∑
n=1
ynkxn
Σk =1
Nk
N∑
n=1
ynk(xn − µk)(xn − µk)>
Basically estimating K Gaussians instead of just 1 (each using data only from that class)
Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 9
MLE for Generative Classification with Gaussian Class-conditionals
Let’s look at the “formal” procedure of deriving MLE in this case
MLE for Θ = {πk , µk ,Σk}Kk=1 in this case can be written as (assuming i.i.d. data)
Θ = arg maxΘ
p(X, y |Θ) = arg maxΘ
N∏
n=1
p(xn, yn|Θ) = arg maxΘ
N∏
n=1
p(yn|Θ)p(xn|yn,Θ)
= arg maxΘ
N∏
n=1
K∏
k=1
[p(yn = k|Θ)p(xn|yn = k,Θ)]ynk
= arg maxΘ
logN∏
n=1
K∏
k=1
[p(yn = k|Θ)p(xn|yn = k,Θ)]ynk
= arg maxΘ
N∑
n=1
K∑
k=1
ynk [log p(yn = k|Θ) + log p(xn|yn = k,Θ)]
= arg maxΘ
N∑
n=1
K∑
k=1
ynk [log πk + logN (xn|µk ,Σk)]
Given (X, y), optimizing it w.r.t. πk , µk ,Σk will give us the solution we saw on the previous slide
Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 10
MLE When Labels Go Missing..
So the MLE problem for generative classification with Gaussian class-conditionals was
Θ = arg maxΘ
log p(X, y |Θ) = arg maxΘ
N∑
n=1
K∑
k=1
ynk [log πk + logN (xn|µk ,Σk)]
This problem has a nice separable structure, and a straightforward solution as we saw
What if we don’t know the label yn (i.e., don’t know if ynk is 0 or 1)? How to estimate Θ now?
When might we need to solve such a problem?
Mixture density estimation: Given N inputs x1, . . . , xN , model p(x) as a mixture of distributions
Probabilistic clustering: Same as density estimation; can get cluster ids once Θ is estimated
Semi-supervised generative classification: In training data, some yn’s are known, some not known
Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 11
MLE When Labels Go Missing..
Recall the MLE problem for Θ when the labels are known
Θ = arg maxΘ
N∑
n=1
K∑
k=1
ynk [log πk + logN (xn|µk ,Σk)]
Will estimating Θ via MLE be as easy if yn’s are unknown? We only have X = {x1, . . . , xN}The MLE problem for Θ = {πk , µk ,Σk}Kk=1 in this case would be (assuming i.i.d. data)
Θ = arg maxΘ
log p(X|Θ) = arg maxΘ
logN∏
n=1
p(xn|Θ) = arg maxΘ
N∑
n=1
log p(xn|Θ)
Computing each likelihood p(xn|Θ) in this case requires summing over all possible values of yn
p(xn|Θ) =K∑
k=1
p(xn, yn = k|Θ) =K∑
k=1
p(yn = k|Θ)p(xn|yn = k,Θ) =K∑
k=1
πkN (xn|µk ,Σk)
The MLE problem for Θ when the labels are unknown
Θ = arg maxΘ
N∑
n=1
logK∑
k=1
πkN (xn|µk ,Σk)
Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 12
MLE When Labels Go Missing..
So we saw that the MLE problem for Θ when the labels are unknown
Θ = arg maxΘ
N∑
n=1
logK∑
k=1
πkN (xn|µk ,Σk)
Solving this would enable us to learn a Gaussian Mixture Model (GMM)
Note: The Gaussian can be replaced by other distributions too (e.g., Poisson mixture model)
A small issue now: Log can’t go inside the summation. Expressions won’t be simple anymore
Note: Can still take (partial) derivatives and do GG/SGD etc. but these are iterative methods
Recall that we didn’t need GD/SGD etc when doing MLE with fully observed yn’s
One workaround: Can try doing alternating optimization
Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 13
MLE for Gaussian Mixture Model using ALT-OPT
Based on the fact that MLE is simple when labels are known
Notation change: We will now use zn instead of yn and znk instead of ynk
MLE for Gaussian Mixture Model using ALT-OPT
1 Initialize Θ as Θ
2 For n = 1, . . . ,N, find the best zn
zn = arg maxk∈{1,...,K}p(xn, zn = k|Θ)
= arg maxk∈{1,...,K}p(zn = k|xn, Θ)
3 Given Z = {z1, . . . , zN}, re-estimate Θ using MLE
Θ = arg maxΘ
N∑
n=1
K∑
k=1
znk [log πk + logN (xn|µk ,Σk)]
4 Go to step 2 if not yet converged
Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 14
Is ALT-OPT Doing The Correct Thing?
Our original problem was
Θ = arg maxΘ
N∑
n=1
logK∑
k=1
πkN (xn|µk ,Σk)
What ALT-OPT did was the following
Θ = arg maxΘ
N∑
n=1
K∑
k=1
znk [log πk + logN (xn|µk ,Σk)]
We clearly aren’t solving the original problem!
arg maxΘ
log p(X|Θ) vs arg maxΘ
log p(X, Z|Θ)
Also, we updated zn as follows
zn = arg maxk∈{1,...,K}p(zn = k|xn, Θ)
Why choose zn to be this (makes intuitive sense, but is there a formal justification)?
It turns out (as we will see), this ALT-OPT is an approximation of the Expectation Maximization(EM) algorithm for GMM
Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 15
Why Estimation is Difficult in LVMs?
Suppose we want to estimate parameters Θ. If we knew both xn and zn then we could do
ΘMLE = arg maxΘ
N∑
n=1
log p(xn, zn|Θ) = arg maxΘ
N∑
n=1
[log p(zn|φ) + log p(xn|zn, θ)]
Simple to solve (usually closed form) if p(zn|φ) and p(xn|zn, θ) are “simple” (e.g., exp-fam. dist.)
However, in LVMs where zn is “hidden”, the MLE problem will be the following
ΘMLE = arg maxΘ
N∑
n=1
log p(xn|Θ) = arg maxΘ
log p(X|Θ)
The form of p(xn|Θ) may not be simple since we need to sum over unknown zn’s possible values
p(xn|Θ) =∑
zn
p(xn, zn|Θ) ... or if zn is continuous: p(xn|Θ) =
∫p(xn, zn|Θ)dzn
The summation/integral may be intractable + may lead to complex expressions for p(xn|Θ), infact almost never an exponential family distribution. MLE for Θ won’t have closed form solutions!
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3
An Important Identity
Define pz = p(Z|X,Θ) and let q(Z) be some distribution over Z
Assume discrete Z, the identity below holds for any choice of the distribution q(Z)
log p(X|Θ) = L(q,Θ) + KL(q||pz)
L(q,Θ) =∑
Z
q(Z) log
{p(X,Z|Θ)
q(Z)
}
KL(q||pz ) = −∑
Z
q(Z) log
{p(Z|X,Θ)
q(Z)
}
(Exercise: Verify the above identity)
Since KL(q||pz) ≥ 0, L(q,Θ) is a lower-bound on log p(X|Θ)
log p(X|Θ) ≥ L(q,Θ)
Maximizing L(q,Θ) will also improve log p(X|Θ). Also, as we’ll see, it’s easier to maximize L(q,Θ)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4
Maximizing L(q,Θ)
Note that L(q,Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these
First recall the identity we had: log p(X|Θ) = L(q,Θ) + KL(q||pz) with
L(q,Θ) =∑
Z
q(Z) log
{p(X,Z|Θ)
q(Z)
}and KL(q||pz ) = −
∑
Z
q(Z) log
{p(Z|X,Θ)
q(Z)
}
Maximize L w.r.t. q with Θ fixed at Θold : Since log p(X|Θ) will be a constant in this case,
q = arg maxqL(q,Θold) = arg min
qKL(q||pz) = pz = p(Z|X,Θold)
Maximize L w.r.t. Θ with q fixed at q = p(Z|X,Θold)
Θnew = arg maxΘL(q,Θ) = arg max
Θ
∑
Z
p(Z|X,Θold ) logp(X,Z|Θ)
p(Z|X,Θold )= arg max
Θ
∑
Z
p(Z|X,Θold ) log p(X,Z|Θ)
.. therefore, Θnew = arg maxθQ(Θ,Θold) where Q(Θ,Θold) = Ep(Z|X,Θold )[log p(X,Z|Θ)]
Q(Θ,Θold) = Ep(Z|X,Θold )[log p(X,Z|Θ)] is known as expected complete data log-likelihood (CLL)
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5
What’s Going On: A Visual Illustration..
Step 1: We set q = p(Z|X,Θold), L(q,Θ) touches log p(X|Θ) at Θold
Step 2: We maximize L(q,Θ) w.r.t. Θ (equivalent to maximizing Q(Θ,Θold))
After updating q
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..
Step 1: We set q = p(Z|X,Θold), L(q,Θ) touches log p(X|Θ) at Θold
Step 2: We maximize L(q,Θ) w.r.t. Θ (equivalent to maximizing Q(Θ,Θold))
After updating q
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..
Step 1: We set q = p(Z|X,Θold), L(q,Θ) touches log p(X|Θ) at Θold
Step 2: We maximize L(q,Θ) w.r.t. Θ (equivalent to maximizing Q(Θ,Θold))
After maximizing w.r.t.
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..
Step 1: We set q = p(Z|X,Θold), L(q,Θ) touches log p(X|Θ) at Θold
Step 2: We maximize L(q,Θ) w.r.t. Θ (equivalent to maximizing Q(Θ,Θold))
After updating q
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..
Step 1: We set q = p(Z|X,Θold), L(q,Θ) touches log p(X|Θ) at Θold
Step 2: We maximize L(q,Θ) w.r.t. Θ (equivalent to maximizing Q(Θ,Θold))
After maximizing w.r.t.
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..
Step 1: We set q = p(Z|X,Θold), L(q,Θ) touches log p(X|Θ) at Θold
Step 2: We maximize L(q,Θ) w.r.t. Θ (equivalent to maximizing Q(Θ,Θold))
After updating q
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..
Step 1: We set q = p(Z|X,Θold), L(q,Θ) touches log p(X|Θ) at Θold
Step 2: We maximize L(q,Θ) w.r.t. Θ (equivalent to maximizing Q(Θ,Θold))
After maximizing w.r.t.
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: A Visual Illustration..
Step 1: We set q = p(Z|X,Θold), L(q,Θ) touches log p(X|Θ) at Θold
Step 2: We maximize L(q,Θ) w.r.t. Θ (equivalent to maximizing Q(Θ,Θold))
Local Maxima Found
After updating q
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6
What’s Going On: Another Illustration
The two-step alternating optimzation scheme we saw can never decrease p(X|Θ) (good thing)
To see this consider both steps: (1) Optimize q given Θ = Θold ; (2) Optimize Θ given this q
(Step 1)(Step 2)
Step 1 keeps Θ fixed, so p(X|Θ) obviously can’t decrease (stays unchanged in this step)
Step 2 maximizes the lower bound L(q,Θ) w.r.t Θ. Thus p(X|Θ) can’t decrease!
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 7
What’s Going On: Another Illustration
The two-step alternating optimzation scheme we saw can never decrease p(X|Θ) (good thing)
To see this consider both steps: (1) Optimize q given Θ = Θold ; (2) Optimize Θ given this q
(Step 1)(Step 2)
Step 1 keeps Θ fixed, so p(X|Θ) obviously can’t decrease (stays unchanged in this step)
Step 2 maximizes the lower bound L(q,Θ) w.r.t Θ. Thus p(X|Θ) can’t decrease!
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 7
What’s Going On: Another Illustration
The two-step alternating optimzation scheme we saw can never decrease p(X|Θ) (good thing)
To see this consider both steps: (1) Optimize q given Θ = Θold ; (2) Optimize Θ given this q
(Step 1)(Step 2)
Step 1 keeps Θ fixed, so p(X|Θ) obviously can’t decrease (stays unchanged in this step)
Step 2 maximizes the lower bound L(q,Θ) w.r.t Θ. Thus p(X|Θ) can’t decrease!
Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 7