lecture13 gaussian mixtures - mit csailpeople.csail.mit.edu/dsontag/courses/pgm13/slides/... ·...

Mixture Models & EM algorithm Probabilis5c Graphical Models, Lecture 13

David Sontag New York University

Slides adapted from Carlos Guestrin, Dan Klein, Luke Ze@lemoyer, Dan Weld, Vibhav Gogate, and Andrew Moore

Mixture of Gaussians model

µ1 µ2

•  P(Y): There are k components

•  P(X|Y): Each component generates data from a mul5variate Gaussian with mean μi and covariance matrix Σi

Each data point is sampled from a genera&ve process:

1.  Choose component i with probability P(y=i)

2.  Generate datapoint ~ N(mi, Σi )

Gaussian mixture model (GMM)

MulVvariate Gaussians

Σ = arbitrary (semidefinite) matrix: -‐ specifies rotaVon (change of basis) -‐ eigenvalues specify relaVve elongaVon

P(X = x j |Y = i) =1

(2π )m/2 || Σi ||1/2 exp −

12x j −µi( )

TΣi−1 x j −µi( )

'( P(X=xj)=

Mixtures of Gaussians (1)

Old Faithful Data Set

Time to ErupV

DuraVon of Last ErupVon

Old Faithful Data Set

Single Gaussian Mixture of two Gaussians

Combine simple models into a complex model:

Component

Mixing coefficient

Unsupervised learning Model data as mixture of mulVvariate Gaussians

Shown is the posterior probability that a point was generated from ith Gaussian: Pr(Y = i | x)

ML esVmaVon in supervised se_ng

•  Univariate Gaussian

•  Mixture of Mul&variate Gaussians

ML esVmate for each of the MulVvariate Gaussians is given by:

Just sums over x generated from the k’th Gaussian

µML =1n

∑ ΣML =1n

x j −µML( ) x j −µML( )T

∑k k k k

That was easy! But what if unobserved data?

•  MLE: – argmaxθ ∏j P(yj,xj)

– θ: all model parameters •  eg, class probs, means, and variances

•  But we don’t know yj’s!!! •  Maximize marginal likelihood:

– argmaxθ ∏j P(xj) = argmax ∏j ∑k=1 P(Yj=k, xj) K

EM: Two Easy Steps Objec5ve: argmaxθ lg∏j ∑k=1

K P(Yj=k, xj |θ) = ∑j lg ∑k=1K P(Yj=k, xj|θ)

Data: {xj | j=1 .. n}

•  E-‐step: Compute expectaVons to “fill in” missing y values according to current parameters, θ

–  For all examples j and values k for Yj, compute: P(Yj=k | xj, θ)

•  M-‐step: Re-‐esVmate the parameters with “weighted” MLE esVmates

–  Set θ = argmaxθ ∑j ∑k P(Yj=k | xj, θ) log P(Yj=k, xj |θ)

Especially useful when the E and M steps have closed form soluVons!!!

Simple example: learn means only! Consider: •  1D data •  Mixture of k=2 Gaussians

•  Variances fixed to σ=1 •  DistribuVon over classes is uniform

•  Just need to esVmate μ1 and μ2

P(x,Yj = k)k=1

∑j=1

∏ ∝ exp −12σ2

x − µk2⎡

⎣ ⎢ ⎤

⎦ ⎥ P(Yj = k)

∑j=1

.01 .03 .05 .07 .09

EM for GMMs: only learning means Iterate: On the t’th iteraVon let our esVmates be

λt = { μ1(t), μ2(t) … μK(t) } E-‐step

Compute “expected” classes of all datapoints

M-‐step

Compute most likely new μs given class expectaVons €

P Yj = k x j ,µ1...µK( )∝ exp − 12σ 2 x j − µk

⎝ ⎜

⎠ ⎟ P Yj = k( )

µk = P Yj = k x j( )

∑ x j

P Yj = k x j( )j=1

Gaussian Mixture Example: Start

Amer first iteraVon

Amer 2nd iteraVon

Amer 3rd iteraVon

Amer 4th iteraVon

Amer 5th iteraVon

Amer 6th iteraVon

Amer 20th iteraVon

Jensen’s inequality

•  Theorem: log ∑z P(z) f(z) ≥ ∑z P(z) log f(z) – e.g., Binary case for convex func5on f:

lecture13 gaussian mixtures - mit csailpeople.csail.mit.edu/dsontag/courses/pgm13/slides/... ·...

Documents

lecture10 - massachusetts institute of...

introduction to machine learning - mit...

bayesianmethods&naïve( bayes(...

learning representations for counterfactual...

probabilistic graphical models -...

probabilistic graphical...

lecture15 - mit...

latent dirichlet allocation - mit csailpeople.csail.mit.edu...

passive navigation* - mit csailpeople.csail.mit.edu › bkph...

probabilistic graphical models -...

streaming variational bayes - mit...

inference and representation - mit...

computational imaging - mit...

lecture21 - people | mit...

lecture11 - people | mit...

linear classifiers lecture 3 - people | mit...

daniel...

inference and representation -...

support vector machines (svms) lecture 2 - mit...

lecture11 - mit...