introduction to machine learning - carnegie mellon universitygdicaro/10315/lectures/315-f... ·...

Teacher:Gianni A. Di Caro

Lecture 29:Latent Variable Models, Gaussian Mixture Models, Expectation-Maximization

Introduction to Machine Learning10-315 Fall ‘19

Disclaimer: These slides can include material from different sources. I’ll happy to explicitly acknowledge a source if required. Contact me for requests.

2

All the following slides are from Piyush Rai (IIT)

Latent Variable Models

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 3

A Simple Generative Model

All observations {x1, . . . , xN} generated from a distribution p(x |θ)

Unknowns: Parameters θ of the assumed data distribution p(x |θ)

Many ways to estimate the parameters (MLE, MAP, or Bayesian inference)


Generative Model with Latent Variables

Assume each observation xn to be associated with a latent variable zn

In this “latent variable model” of data, data x also depends some latent variable(s) z

zn is akin to a latent representation or “encoding” of xn; controls what data “looks like”.E.g,

zn ∈ {1, . . . ,K} denotes the cluster xn belongs to

zn ∈ RK denotes a low-dimensional latent representation or latent “code” for xn

Unknowns: {z1, . . . , zN}, and (θ, φ). zn’s called “local” variables; (θ, φ) called “global” variables


Brief Detour/Recap:Gaussian Parameter Estimation


MLE for Multivariate Gaussian

Multivariate Gaussian in D dimensions

p(x |µ,Σ) =1

(2π)D/2|Σ|1/2exp

(−1

2(x − µ)>Σ−1(x − µ)

)

Goal: Given N i.i.d. observations {xn}Nn=1 from this Gaussian, estimate parameters µ and Σ

MLE for the D × 1 mean µ ∈ RD and D × D p.s.d. covariance matrix Σ

µ =1

N

N∑

n=1

xn and Σ =1

N

N∑

n=1

(xn − µ)(xn − µ)>

Note: Σ depends on µ, but µ doesn’t depend on Σ ⇒ no need for alternating opt.

Note: log works nicely with exp of the Gaussian. Simplifies MLE expressions in this case

In general, when the distribution is an exponential family distribution, MLE is usually very easy


Brief Detour: Exponential Family Distributions

An exponential family distribution is of the form

p(x |θ) = h(x) exp[θ>φ(x)− A(θ)]

θ is called the natural parameter of the family

h(x), φ(x), and A(θ) are known functions. Note: Don’t confuse φ with kernel mappings!

φ(x) is called the sufficient statistics: knowing this is sufficient to estimate θ

Every exp. family distribution also has a conjugate distribution (often also in exp. family)

Many other nice properties (especially useful in Bayesian inference)

Also, MLE/MAP is usually quite simple (note that log p(x |θ) will typically have a simple form)

Many well-known distribution (Bernoulli, Binomial, multinoulli, beta, gamma, Gaussian, etc.) areexponential family distributions

https://en.wikipedia.org/wiki/Exponential_family


MLE for Generative Classification with Gaussian Class-conditionals

Each class k modeled using a Gaussian with mean µk and covariance matrix Σk

Note: Can assume label yn to be one-hot and then ynk = 1 if yn = k, and ynk = 0, otherwise

Assuming p(yn = k) = πk , this model has parameters Θ = {πk , µk ,Σk}Kk=1

(We have done this before) Given {xn, yn}Nn=1, MLE for Θ will be

πk =1

N

N∑

n=1

ynk =Nk

N

µk =1

Nk

N∑

n=1

ynkxn

Σk =1

Nk

N∑

n=1

ynk(xn − µk)(xn − µk)>

Basically estimating K Gaussians instead of just 1 (each using data only from that class)


MLE for Generative Classification with Gaussian Class-conditionals

Let’s look at the “formal” procedure of deriving MLE in this case

MLE for Θ = {πk , µk ,Σk}Kk=1 in this case can be written as (assuming i.i.d. data)

Θ = arg maxΘ

p(X, y |Θ) = arg maxΘ

N∏

n=1

p(xn, yn|Θ) = arg maxΘ

N∏

n=1

p(yn|Θ)p(xn|yn,Θ)

= arg maxΘ

N∏

n=1

K∏

k=1

[p(yn = k|Θ)p(xn|yn = k,Θ)]ynk

= arg maxΘ

logN∏

n=1

K∏

k=1

[p(yn = k|Θ)p(xn|yn = k,Θ)]ynk

= arg maxΘ

N∑

n=1

K∑

k=1

ynk [log p(yn = k|Θ) + log p(xn|yn = k,Θ)]

= arg maxΘ

N∑

n=1

K∑

k=1

ynk [log πk + logN (xn|µk ,Σk)]

Given (X, y), optimizing it w.r.t. πk , µk ,Σk will give us the solution we saw on the previous slide


MLE When Labels Go Missing..

So the MLE problem for generative classification with Gaussian class-conditionals was

Θ = arg maxΘ

log p(X, y |Θ) = arg maxΘ

N∑

n=1

K∑

k=1


This problem has a nice separable structure, and a straightforward solution as we saw

What if we don’t know the label yn (i.e., don’t know if ynk is 0 or 1)? How to estimate Θ now?

When might we need to solve such a problem?

Mixture density estimation: Given N inputs x1, . . . , xN , model p(x) as a mixture of distributions

Probabilistic clustering: Same as density estimation; can get cluster ids once Θ is estimated

Semi-supervised generative classification: In training data, some yn’s are known, some not known



Recall the MLE problem for Θ when the labels are known

Θ = arg maxΘ

N∑

n=1

K∑

k=1


Will estimating Θ via MLE be as easy if yn’s are unknown? We only have X = {x1, . . . , xN}The MLE problem for Θ = {πk , µk ,Σk}Kk=1 in this case would be (assuming i.i.d. data)

Θ = arg maxΘ

log p(X|Θ) = arg maxΘ

logN∏

n=1

p(xn|Θ) = arg maxΘ

N∑

n=1

log p(xn|Θ)

Computing each likelihood p(xn|Θ) in this case requires summing over all possible values of yn

p(xn|Θ) =K∑

k=1

p(xn, yn = k|Θ) =K∑

k=1

p(yn = k|Θ)p(xn|yn = k,Θ) =K∑

k=1

πkN (xn|µk ,Σk)

The MLE problem for Θ when the labels are unknown

Θ = arg maxΘ

N∑

n=1

logK∑

k=1

πkN (xn|µk ,Σk)



So we saw that the MLE problem for Θ when the labels are unknown

Θ = arg maxΘ

N∑

n=1

logK∑

k=1

πkN (xn|µk ,Σk)

Solving this would enable us to learn a Gaussian Mixture Model (GMM)

Note: The Gaussian can be replaced by other distributions too (e.g., Poisson mixture model)

A small issue now: Log can’t go inside the summation. Expressions won’t be simple anymore

Note: Can still take (partial) derivatives and do GG/SGD etc. but these are iterative methods

Recall that we didn’t need GD/SGD etc when doing MLE with fully observed yn’s

One workaround: Can try doing alternating optimization


MLE for Gaussian Mixture Model using ALT-OPT

Based on the fact that MLE is simple when labels are known

Notation change: We will now use zn instead of yn and znk instead of ynk

MLE for Gaussian Mixture Model using ALT-OPT

1 Initialize Θ as Θ

2 For n = 1, . . . ,N, find the best zn

zn = arg maxk∈{1,...,K}p(xn, zn = k|Θ)

= arg maxk∈{1,...,K}p(zn = k|xn, Θ)

3 Given Z = {z1, . . . , zN}, re-estimate Θ using MLE

Θ = arg maxΘ

N∑

n=1

K∑

k=1

znk [log πk + logN (xn|µk ,Σk)]

4 Go to step 2 if not yet converged


Is ALT-OPT Doing The Correct Thing?

Our original problem was

Θ = arg maxΘ

N∑

n=1

logK∑

k=1

πkN (xn|µk ,Σk)

What ALT-OPT did was the following

Θ = arg maxΘ

N∑

n=1

K∑

k=1

znk [log πk + logN (xn|µk ,Σk)]

We clearly aren’t solving the original problem!

arg maxΘ

log p(X|Θ) vs arg maxΘ

log p(X, Z|Θ)

Also, we updated zn as follows

zn = arg maxk∈{1,...,K}p(zn = k|xn, Θ)

Why choose zn to be this (makes intuitive sense, but is there a formal justification)?

It turns out (as we will see), this ALT-OPT is an approximation of the Expectation Maximization(EM) algorithm for GMM


Why Estimation is Difficult in LVMs?

Suppose we want to estimate parameters Θ. If we knew both xn and zn then we could do

ΘMLE = arg maxΘ

N∑

n=1

log p(xn, zn|Θ) = arg maxΘ

N∑

n=1

[log p(zn|φ) + log p(xn|zn, θ)]

Simple to solve (usually closed form) if p(zn|φ) and p(xn|zn, θ) are “simple” (e.g., exp-fam. dist.)

However, in LVMs where zn is “hidden”, the MLE problem will be the following

ΘMLE = arg maxΘ

N∑

n=1

log p(xn|Θ) = arg maxΘ

log p(X|Θ)

The form of p(xn|Θ) may not be simple since we need to sum over unknown zn’s possible values

p(xn|Θ) =∑

zn

p(xn, zn|Θ) ... or if zn is continuous: p(xn|Θ) =

∫p(xn, zn|Θ)dzn

The summation/integral may be intractable + may lead to complex expressions for p(xn|Θ), infact almost never an exponential family distribution. MLE for Θ won’t have closed form solutions!

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3

An Important Identity

Define pz = p(Z|X,Θ) and let q(Z) be some distribution over Z

Assume discrete Z, the identity below holds for any choice of the distribution q(Z)

log p(X|Θ) = L(q,Θ) + KL(q||pz)

L(q,Θ) =∑

Z

q(Z) log

{p(X,Z|Θ)

q(Z)

}

KL(q||pz ) = −∑

Z

q(Z) log

{p(Z|X,Θ)

q(Z)

}

(Exercise: Verify the above identity)

Since KL(q||pz) ≥ 0, L(q,Θ) is a lower-bound on log p(X|Θ)

log p(X|Θ) ≥ L(q,Θ)

Maximizing L(q,Θ) will also improve log p(X|Θ). Also, as we’ll see, it’s easier to maximize L(q,Θ)


Maximizing L(q,Θ)

Note that L(q,Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these

First recall the identity we had: log p(X|Θ) = L(q,Θ) + KL(q||pz) with

L(q,Θ) =∑

Z

q(Z) log

{p(X,Z|Θ)

q(Z)

}and KL(q||pz ) = −

∑

Z

q(Z) log

{p(Z|X,Θ)

q(Z)

}

Maximize L w.r.t. q with Θ fixed at Θold : Since log p(X|Θ) will be a constant in this case,

q = arg maxqL(q,Θold) = arg min

qKL(q||pz) = pz = p(Z|X,Θold)

Maximize L w.r.t. Θ with q fixed at q = p(Z|X,Θold)

Θnew = arg maxΘL(q,Θ) = arg max

Θ

∑

Z

p(Z|X,Θold ) logp(X,Z|Θ)

p(Z|X,Θold )= arg max

Θ

∑

Z

p(Z|X,Θold ) log p(X,Z|Θ)

.. therefore, Θnew = arg maxθQ(Θ,Θold) where Q(Θ,Θold) = Ep(Z|X,Θold )[log p(X,Z|Θ)]

Q(Θ,Θold) = Ep(Z|X,Θold )[log p(X,Z|Θ)] is known as expected complete data log-likelihood (CLL)


What’s Going On: A Visual Illustration..

Step 1: We set q = p(Z|X,Θold), L(q,Θ) touches log p(X|Θ) at Θold

Step 2: We maximize L(q,Θ) w.r.t. Θ (equivalent to maximizing Q(Θ,Θold))

After updating q





After maximizing w.r.t.





After updating q





Local Maxima Found

After updating q


What’s Going On: Another Illustration

The two-step alternating optimzation scheme we saw can never decrease p(X|Θ) (good thing)

To see this consider both steps: (1) Optimize q given Θ = Θold ; (2) Optimize Θ given this q

(Step 1)(Step 2)

Step 1 keeps Θ fixed, so p(X|Θ) obviously can’t decrease (stays unchanged in this step)

Step 2 maximizes the lower bound L(q,Θ) w.r.t Θ. Thus p(X|Θ) can’t decrease!


introduction to machine learning - carnegie mellon universitygdicaro/10315/lectures/315-f... ·...

Documents