introduction to machine learning - carnegie mellon universitygdicaro/10315/lectures/315-f... ·...

29
Teacher: Gianni A. Di Caro Lecture 29: Latent Variable Models, Gaussian Mixture Models, Expectation - Maximization Introduction to Machine Learning 10 - 315 Fall ‘19 Disclaimer: These slides can include material from different sources. I’ll happy to explicitly acknowledge a source if required. Contact me for requests.

Upload: others

Post on 15-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

Teacher:Gianni A. Di Caro

Lecture 29:Latent Variable Models, Gaussian Mixture Models, Expectation-Maximization

Introduction to Machine Learning10-315 Fall ‘19

Disclaimer: These slides can include material from different sources. I’ll happy to explicitly acknowledge a source if required. Contact me for requests.

Page 2: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

2

All the following slides are from Piyush Rai (IIT)

Page 3: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

Latent Variable Models

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 3

Page 4: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

A Simple Generative Model

All observations {x1, . . . , xN} generated from a distribution p(x |θ)

Unknowns: Parameters θ of the assumed data distribution p(x |θ)

Many ways to estimate the parameters (MLE, MAP, or Bayesian inference)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 4

Page 5: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

Generative Model with Latent Variables

Assume each observation xn to be associated with a latent variable zn

In this “latent variable model” of data, data x also depends some latent variable(s) z

zn is akin to a latent representation or “encoding” of xn; controls what data “looks like”.E.g,

zn ∈ {1, . . . ,K} denotes the cluster xn belongs to

zn ∈ RK denotes a low-dimensional latent representation or latent “code” for xn

Unknowns: {z1, . . . , zN}, and (θ, φ). zn’s called “local” variables; (θ, φ) called “global” variables

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 5

Page 6: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

Brief Detour/Recap:Gaussian Parameter Estimation

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 6

Page 7: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

MLE for Multivariate Gaussian

Multivariate Gaussian in D dimensions

p(x |µ,Σ) =1

(2π)D/2|Σ|1/2exp

(−1

2(x − µ)>Σ−1(x − µ)

)

Goal: Given N i.i.d. observations {xn}Nn=1 from this Gaussian, estimate parameters µ and Σ

MLE for the D × 1 mean µ ∈ RD and D × D p.s.d. covariance matrix Σ

µ =1

N

N∑

n=1

xn and Σ =1

N

N∑

n=1

(xn − µ)(xn − µ)>

Note: Σ depends on µ, but µ doesn’t depend on Σ ⇒ no need for alternating opt.

Note: log works nicely with exp of the Gaussian. Simplifies MLE expressions in this case

In general, when the distribution is an exponential family distribution, MLE is usually very easy

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 7

Page 8: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

Brief Detour: Exponential Family Distributions

An exponential family distribution is of the form

p(x |θ) = h(x) exp[θ>φ(x)− A(θ)]

θ is called the natural parameter of the family

h(x), φ(x), and A(θ) are known functions. Note: Don’t confuse φ with kernel mappings!

φ(x) is called the sufficient statistics: knowing this is sufficient to estimate θ

Every exp. family distribution also has a conjugate distribution (often also in exp. family)

Many other nice properties (especially useful in Bayesian inference)

Also, MLE/MAP is usually quite simple (note that log p(x |θ) will typically have a simple form)

Many well-known distribution (Bernoulli, Binomial, multinoulli, beta, gamma, Gaussian, etc.) areexponential family distributions

https://en.wikipedia.org/wiki/Exponential_family

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 8

Page 9: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

MLE for Generative Classification with Gaussian Class-conditionals

Each class k modeled using a Gaussian with mean µk and covariance matrix Σk

Note: Can assume label yn to be one-hot and then ynk = 1 if yn = k, and ynk = 0, otherwise

Assuming p(yn = k) = πk , this model has parameters Θ = {πk , µk ,Σk}Kk=1

(We have done this before) Given {xn, yn}Nn=1, MLE for Θ will be

πk =1

N

N∑

n=1

ynk =Nk

N

µk =1

Nk

N∑

n=1

ynkxn

Σk =1

Nk

N∑

n=1

ynk(xn − µk)(xn − µk)>

Basically estimating K Gaussians instead of just 1 (each using data only from that class)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 9

Page 10: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

MLE for Generative Classification with Gaussian Class-conditionals

Let’s look at the “formal” procedure of deriving MLE in this case

MLE for Θ = {πk , µk ,Σk}Kk=1 in this case can be written as (assuming i.i.d. data)

Θ = arg maxΘ

p(X, y |Θ) = arg maxΘ

N∏

n=1

p(xn, yn|Θ) = arg maxΘ

N∏

n=1

p(yn|Θ)p(xn|yn,Θ)

= arg maxΘ

N∏

n=1

K∏

k=1

[p(yn = k|Θ)p(xn|yn = k,Θ)]ynk

= arg maxΘ

logN∏

n=1

K∏

k=1

[p(yn = k|Θ)p(xn|yn = k,Θ)]ynk

= arg maxΘ

N∑

n=1

K∑

k=1

ynk [log p(yn = k|Θ) + log p(xn|yn = k,Θ)]

= arg maxΘ

N∑

n=1

K∑

k=1

ynk [log πk + logN (xn|µk ,Σk)]

Given (X, y), optimizing it w.r.t. πk , µk ,Σk will give us the solution we saw on the previous slide

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 10

Page 11: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

MLE When Labels Go Missing..

So the MLE problem for generative classification with Gaussian class-conditionals was

Θ = arg maxΘ

log p(X, y |Θ) = arg maxΘ

N∑

n=1

K∑

k=1

ynk [log πk + logN (xn|µk ,Σk)]

This problem has a nice separable structure, and a straightforward solution as we saw

What if we don’t know the label yn (i.e., don’t know if ynk is 0 or 1)? How to estimate Θ now?

When might we need to solve such a problem?

Mixture density estimation: Given N inputs x1, . . . , xN , model p(x) as a mixture of distributions

Probabilistic clustering: Same as density estimation; can get cluster ids once Θ is estimated

Semi-supervised generative classification: In training data, some yn’s are known, some not known

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 11

Page 12: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

MLE When Labels Go Missing..

Recall the MLE problem for Θ when the labels are known

Θ = arg maxΘ

N∑

n=1

K∑

k=1

ynk [log πk + logN (xn|µk ,Σk)]

Will estimating Θ via MLE be as easy if yn’s are unknown? We only have X = {x1, . . . , xN}The MLE problem for Θ = {πk , µk ,Σk}Kk=1 in this case would be (assuming i.i.d. data)

Θ = arg maxΘ

log p(X|Θ) = arg maxΘ

logN∏

n=1

p(xn|Θ) = arg maxΘ

N∑

n=1

log p(xn|Θ)

Computing each likelihood p(xn|Θ) in this case requires summing over all possible values of yn

p(xn|Θ) =K∑

k=1

p(xn, yn = k|Θ) =K∑

k=1

p(yn = k|Θ)p(xn|yn = k,Θ) =K∑

k=1

πkN (xn|µk ,Σk)

The MLE problem for Θ when the labels are unknown

Θ = arg maxΘ

N∑

n=1

logK∑

k=1

πkN (xn|µk ,Σk)

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 12

Page 13: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

MLE When Labels Go Missing..

So we saw that the MLE problem for Θ when the labels are unknown

Θ = arg maxΘ

N∑

n=1

logK∑

k=1

πkN (xn|µk ,Σk)

Solving this would enable us to learn a Gaussian Mixture Model (GMM)

Note: The Gaussian can be replaced by other distributions too (e.g., Poisson mixture model)

A small issue now: Log can’t go inside the summation. Expressions won’t be simple anymore

Note: Can still take (partial) derivatives and do GG/SGD etc. but these are iterative methods

Recall that we didn’t need GD/SGD etc when doing MLE with fully observed yn’s

One workaround: Can try doing alternating optimization

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 13

Page 14: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

MLE for Gaussian Mixture Model using ALT-OPT

Based on the fact that MLE is simple when labels are known

Notation change: We will now use zn instead of yn and znk instead of ynk

MLE for Gaussian Mixture Model using ALT-OPT

1 Initialize Θ as Θ

2 For n = 1, . . . ,N, find the best zn

zn = arg maxk∈{1,...,K}p(xn, zn = k|Θ)

= arg maxk∈{1,...,K}p(zn = k|xn, Θ)

3 Given Z = {z1, . . . , zN}, re-estimate Θ using MLE

Θ = arg maxΘ

N∑

n=1

K∑

k=1

znk [log πk + logN (xn|µk ,Σk)]

4 Go to step 2 if not yet converged

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 14

Page 15: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

Is ALT-OPT Doing The Correct Thing?

Our original problem was

Θ = arg maxΘ

N∑

n=1

logK∑

k=1

πkN (xn|µk ,Σk)

What ALT-OPT did was the following

Θ = arg maxΘ

N∑

n=1

K∑

k=1

znk [log πk + logN (xn|µk ,Σk)]

We clearly aren’t solving the original problem!

arg maxΘ

log p(X|Θ) vs arg maxΘ

log p(X, Z|Θ)

Also, we updated zn as follows

zn = arg maxk∈{1,...,K}p(zn = k|xn, Θ)

Why choose zn to be this (makes intuitive sense, but is there a formal justification)?

It turns out (as we will see), this ALT-OPT is an approximation of the Expectation Maximization(EM) algorithm for GMM

Intro to Machine Learning (CS771A) Parameter Estimation in Latent Variable Models 15

Page 16: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

Why Estimation is Difficult in LVMs?

Suppose we want to estimate parameters Θ. If we knew both xn and zn then we could do

ΘMLE = arg maxΘ

N∑

n=1

log p(xn, zn|Θ) = arg maxΘ

N∑

n=1

[log p(zn|φ) + log p(xn|zn, θ)]

Simple to solve (usually closed form) if p(zn|φ) and p(xn|zn, θ) are “simple” (e.g., exp-fam. dist.)

However, in LVMs where zn is “hidden”, the MLE problem will be the following

ΘMLE = arg maxΘ

N∑

n=1

log p(xn|Θ) = arg maxΘ

log p(X|Θ)

The form of p(xn|Θ) may not be simple since we need to sum over unknown zn’s possible values

p(xn|Θ) =∑

zn

p(xn, zn|Θ) ... or if zn is continuous: p(xn|Θ) =

∫p(xn, zn|Θ)dzn

The summation/integral may be intractable + may lead to complex expressions for p(xn|Θ), infact almost never an exponential family distribution. MLE for Θ won’t have closed form solutions!

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 3

Page 17: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

An Important Identity

Define pz = p(Z|X,Θ) and let q(Z) be some distribution over Z

Assume discrete Z, the identity below holds for any choice of the distribution q(Z)

log p(X|Θ) = L(q,Θ) + KL(q||pz)

L(q,Θ) =∑

Z

q(Z) log

{p(X,Z|Θ)

q(Z)

}

KL(q||pz ) = −∑

Z

q(Z) log

{p(Z|X,Θ)

q(Z)

}

(Exercise: Verify the above identity)

Since KL(q||pz) ≥ 0, L(q,Θ) is a lower-bound on log p(X|Θ)

log p(X|Θ) ≥ L(q,Θ)

Maximizing L(q,Θ) will also improve log p(X|Θ). Also, as we’ll see, it’s easier to maximize L(q,Θ)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 4

Page 18: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

Maximizing L(q,Θ)

Note that L(q,Θ) depends on two things q(Z) and Θ. Let’s do ALT-OPT for these

First recall the identity we had: log p(X|Θ) = L(q,Θ) + KL(q||pz) with

L(q,Θ) =∑

Z

q(Z) log

{p(X,Z|Θ)

q(Z)

}and KL(q||pz ) = −

Z

q(Z) log

{p(Z|X,Θ)

q(Z)

}

Maximize L w.r.t. q with Θ fixed at Θold : Since log p(X|Θ) will be a constant in this case,

q = arg maxqL(q,Θold) = arg min

qKL(q||pz) = pz = p(Z|X,Θold)

Maximize L w.r.t. Θ with q fixed at q = p(Z|X,Θold)

Θnew = arg maxΘL(q,Θ) = arg max

Θ

Z

p(Z|X,Θold ) logp(X,Z|Θ)

p(Z|X,Θold )= arg max

Θ

Z

p(Z|X,Θold ) log p(X,Z|Θ)

.. therefore, Θnew = arg maxθQ(Θ,Θold) where Q(Θ,Θold) = Ep(Z|X,Θold )[log p(X,Z|Θ)]

Q(Θ,Θold) = Ep(Z|X,Θold )[log p(X,Z|Θ)] is known as expected complete data log-likelihood (CLL)

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 5

Page 19: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

What’s Going On: A Visual Illustration..

Step 1: We set q = p(Z|X,Θold), L(q,Θ) touches log p(X|Θ) at Θold

Step 2: We maximize L(q,Θ) w.r.t. Θ (equivalent to maximizing Q(Θ,Θold))

After updating q

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6

Page 20: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

What’s Going On: A Visual Illustration..

Step 1: We set q = p(Z|X,Θold), L(q,Θ) touches log p(X|Θ) at Θold

Step 2: We maximize L(q,Θ) w.r.t. Θ (equivalent to maximizing Q(Θ,Θold))

After updating q

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6

Page 21: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

What’s Going On: A Visual Illustration..

Step 1: We set q = p(Z|X,Θold), L(q,Θ) touches log p(X|Θ) at Θold

Step 2: We maximize L(q,Θ) w.r.t. Θ (equivalent to maximizing Q(Θ,Θold))

After maximizing w.r.t.

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6

Page 22: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

What’s Going On: A Visual Illustration..

Step 1: We set q = p(Z|X,Θold), L(q,Θ) touches log p(X|Θ) at Θold

Step 2: We maximize L(q,Θ) w.r.t. Θ (equivalent to maximizing Q(Θ,Θold))

After updating q

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6

Page 23: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

What’s Going On: A Visual Illustration..

Step 1: We set q = p(Z|X,Θold), L(q,Θ) touches log p(X|Θ) at Θold

Step 2: We maximize L(q,Θ) w.r.t. Θ (equivalent to maximizing Q(Θ,Θold))

After maximizing w.r.t.

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6

Page 24: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

What’s Going On: A Visual Illustration..

Step 1: We set q = p(Z|X,Θold), L(q,Θ) touches log p(X|Θ) at Θold

Step 2: We maximize L(q,Θ) w.r.t. Θ (equivalent to maximizing Q(Θ,Θold))

After updating q

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6

Page 25: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

What’s Going On: A Visual Illustration..

Step 1: We set q = p(Z|X,Θold), L(q,Θ) touches log p(X|Θ) at Θold

Step 2: We maximize L(q,Θ) w.r.t. Θ (equivalent to maximizing Q(Θ,Θold))

After maximizing w.r.t.

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6

Page 26: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

What’s Going On: A Visual Illustration..

Step 1: We set q = p(Z|X,Θold), L(q,Θ) touches log p(X|Θ) at Θold

Step 2: We maximize L(q,Θ) w.r.t. Θ (equivalent to maximizing Q(Θ,Θold))

Local Maxima Found

After updating q

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 6

Page 27: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

What’s Going On: Another Illustration

The two-step alternating optimzation scheme we saw can never decrease p(X|Θ) (good thing)

To see this consider both steps: (1) Optimize q given Θ = Θold ; (2) Optimize Θ given this q

(Step 1)(Step 2)

Step 1 keeps Θ fixed, so p(X|Θ) obviously can’t decrease (stays unchanged in this step)

Step 2 maximizes the lower bound L(q,Θ) w.r.t Θ. Thus p(X|Θ) can’t decrease!

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 7

Page 28: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

What’s Going On: Another Illustration

The two-step alternating optimzation scheme we saw can never decrease p(X|Θ) (good thing)

To see this consider both steps: (1) Optimize q given Θ = Θold ; (2) Optimize Θ given this q

(Step 1)(Step 2)

Step 1 keeps Θ fixed, so p(X|Θ) obviously can’t decrease (stays unchanged in this step)

Step 2 maximizes the lower bound L(q,Θ) w.r.t Θ. Thus p(X|Θ) can’t decrease!

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 7

Page 29: Introduction to Machine Learning - Carnegie Mellon Universitygdicaro/10315/lectures/315-F... · 2019. 11. 13. · Introduction to Machine Learning 10-315 Fall ‘19 Disclaimer: These

What’s Going On: Another Illustration

The two-step alternating optimzation scheme we saw can never decrease p(X|Θ) (good thing)

To see this consider both steps: (1) Optimize q given Θ = Θold ; (2) Optimize Θ given this q

(Step 1)(Step 2)

Step 1 keeps Θ fixed, so p(X|Θ) obviously can’t decrease (stays unchanged in this step)

Step 2 maximizes the lower bound L(q,Θ) w.r.t Θ. Thus p(X|Θ) can’t decrease!

Intro to Machine Learning (CS771A) Latent Variable Models and Expectation Maximization 7