csci567 machine learning (fall 2014) - university of …liu32/567/lec16.pdf · 2014-11-13 ·...

53
CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 30, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 1 / 43

Upload: trinhtu

Post on 23-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

CSCI567 Machine Learning (Fall 2014)

Drs. Sha & Liu

{feisha,yanliu.cs}@usc.edu

October 30, 2014

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 1 / 43

Page 2: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Administrative matters

Outline

1 Administrative mattersQuiz 1 GradesMini-project Out

2 Review of last lecture

3 Summary

4 Clustering

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 2 / 43

Page 3: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Administrative matters Quiz 1 Grades

Quiz 1 Grades finished and uploaded to Blackboard

StatisticsMinimum: 31, maximum: 98, average: 66, median: 66, std: 14.28Distribution

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 3 / 43

Page 4: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Administrative matters Mini-project Out

Kaggle Competition on Click-Through Rate Prediction

Website https://www.kaggle.com/c/avazu-ctr-prediction

Group formation A group should consist of 1-5 people. One person canonly participate one group.

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 4 / 43

Page 5: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Review of last lecture

Outline

1 Administrative matters

2 Review of last lectureNeural networksDeep Neural Networks (DNNs)

3 Summary

4 Clustering

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 5 / 43

Page 6: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Review of last lecture Neural networks

Basic idea

Learning nonlinear basis functions and classifiers

Hidden layers are nonlinear mappings from input features to newrepresentation

Output layers use the new representations for classification andregression

Learning parameters

Stochastic gradient descent

Large-scale computing

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 6 / 43

Page 7: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Review of last lecture Neural networks

Key steps (essentially, chain rule in calculus)

To compute∂`

∂wji

we compute∂`

∂wji= zi

∂`

∂aj

as wji affects only ajNonlinear passthru

∂`

∂aj=

∂`

∂zj

∂zj∂aj

= h′(aj)∂`

∂zj

Recursion∂`

∂zj=∑k

∂`

∂ak

∂ak∂zj

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 7 / 43

Page 8: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Review of last lecture Deep Neural Networks (DNNs)

Basic idea behind DNNs

Architecturally, a big neural networks (with a lot of variants)

in depth: 4-5 layers are commonly (Google LeNet uses more than 20)

in width: the number of hidden units in each layer can be a fewthousands

the number of parameters: hundreds of millions, even billions

Algorithmically, many new things

Pre-training: do not do error-backprogation right away

Layer-wise greedy: train one layer at a time

...

Computing

Heavy computing: in both speed in computation and coping with alot of data

Ex: fast Graphics Processing Unit (GPUs) are almost indispensable

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 8 / 43

Page 9: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Review of last lecture Deep Neural Networks (DNNs)

Good references

Easy to find as DNNs are very popular these days

Many, many online video tutorials

Good open-source packages: Theanos, cuDNN, Caffee, etc

Examples:

Wikipedia entry on “Deep Learning”http://en.wikipedia.org/wiki/Deep_learning provides a decentportal to many things including deep belief networks, convolution netsA collection of tutorials and codes for implementing them in Pythonhttp://www.deeplearning.net/tutorial/

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 9 / 43

Page 10: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Summary

Outline

1 Administrative matters

2 Review of last lecture

3 SummarySupervised learning

4 Clustering

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 10 / 43

Page 11: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Summary Supervised learning

Summary of the course so far: a short list of importantconcepts

Supervised learning has been our focus

Setup: given a training dataset {xn, yn}Nn=1, we learn a function h(x)to predict x’s true value y (i.e., regression or classification)

Linear vs. nonlinear features1 Linear: h(x) depends on wTx2 Nonlinear: h(x) depends on wTφ(x), which in terms depends on a

kernel function k(xm,xn) = φ(xm)Tφ(xn),

Loss function1 Squared loss: least square for regression (minimizing residual sum of

errors)2 Logistic loss: logistic regression3 Exponential loss: AdaBoost4 Margin-based loss: support vector machines

Principles of estimation1 Point estimate: maximum likelihood, regularized likelihood

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 11 / 43

Page 12: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Summary Supervised learning

cont’d

Optimization1 Methods: gradient descent, Newton method2 Convex optimization: global optimum vs. local optimum3 Lagrange duality: primal and dual formulation

Learning theory1 Difference between training error and generalization error2 Overfitting, bias and variance tradeoff3 Regularization: various regularized models

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 12 / 43

Page 13: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering

Outline

1 Administrative matters

2 Review of last lecture

3 Summary

4 ClusteringGaussian mixture modelsEM Algorithm

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 13 / 43

Page 14: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering

Clustering

Setup Given D = {xn}Nn=1 and K, we want to output

{µk}Kk=1: prototypes of clusters

A(xn) ∈ {1, 2, . . . ,K}: the cluster membership, i.e., the cluster IDassigned to xn

Example Cluster data into two clusters.

(a)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 14 / 43

Page 15: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering

K-means clustering

Intuition Data points assigned to cluster k should be close to µk, theprototype.

Distortion measure (clustering objective function, cost function)

J =

N∑n=1

K∑k=1

rnk‖xn − µk‖22

where rnk ∈ {0, 1} is an indicator variable

rnk = 1 if and only if A(xn) = k

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 15 / 43

Page 16: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering

K-means clustering

Intuition Data points assigned to cluster k should be close to µk, theprototype.

Distortion measure (clustering objective function, cost function)

J =

N∑n=1

K∑k=1

rnk‖xn − µk‖22

where rnk ∈ {0, 1} is an indicator variable

rnk = 1 if and only if A(xn) = k

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 15 / 43

Page 17: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering

Algorithm

Minimize distortion measure alternative optimization between {rnk}and {µk}

Step 0 Initialize {µk} to some values

Step 1 Assume the current value of {µk} fixed, minimize J over{rnk}, which leads to the following cluster assignment rule

rnk =

{1 if k = argminj ‖xn − µj‖220 otherwise

Step 2 Assume the current value of {rnk} fixed, minimize J over{µk}, which leads to the following rule to update the prototypes ofthe clusters

µk =

∑n rnkxn∑n rnk

Step 3 Determine whether to stop or return to Step 1

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 16 / 43

Page 18: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering

Algorithm

Minimize distortion measure alternative optimization between {rnk}and {µk}

Step 0 Initialize {µk} to some values

Step 1 Assume the current value of {µk} fixed, minimize J over{rnk}, which leads to the following cluster assignment rule

rnk =

{1 if k = argminj ‖xn − µj‖220 otherwise

Step 2 Assume the current value of {rnk} fixed, minimize J over{µk}, which leads to the following rule to update the prototypes ofthe clusters

µk =

∑n rnkxn∑n rnk

Step 3 Determine whether to stop or return to Step 1

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 16 / 43

Page 19: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering

Algorithm

Minimize distortion measure alternative optimization between {rnk}and {µk}

Step 0 Initialize {µk} to some values

Step 1 Assume the current value of {µk} fixed, minimize J over{rnk}, which leads to the following cluster assignment rule

rnk =

{1 if k = argminj ‖xn − µj‖220 otherwise

Step 2 Assume the current value of {rnk} fixed, minimize J over{µk}, which leads to the following rule to update the prototypes ofthe clusters

µk =

∑n rnkxn∑n rnk

Step 3 Determine whether to stop or return to Step 1

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 16 / 43

Page 20: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering

Remarks

The prototype µk is the means of data points assigned to the clusterk, hence the name K-means clustering.

The procedure terminates after a finite number of steps (in general,assuming that there is no tie in comparing distances in Step 1), as theprocedure reduces J in both Step 1 and Step 2. Since J is lowerbounded by 0, the procedure cannot be infinite.

There is no guarantee the procedure terminates at the globaloptimum of J — in most cases, the algorithm stops at a localoptimum, which depends on the initial values in Step 0.

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 17 / 43

Page 21: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering

Example of running K-means algorithm

(a)

−2 0 2

−2

0

2 (b)

−2 0 2

−2

0

2 (c)

−2 0 2

−2

0

2

(d)

−2 0 2

−2

0

2 (e)

−2 0 2

−2

0

2 (f)

−2 0 2

−2

0

2

(g)

−2 0 2

−2

0

2 (h)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 18 / 43

Page 22: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering

Application: vector quantization

We can replace our data points with the prototypes µk from the clustersthey are assigned to. This is called vector quantization. In other words, wehave compressed the data points into i) a codebook of all the prototypes;ii) a list of indices to the codebook for the data points. This compression isobviously lossy as certain information will be lost if we use a very small K.

Clustering the pixels in the image and vector quantizing them. From leftto right: Original image, quantized one with a large K, a medium K, anda small K. Details are missing due to the higher compression (smaller K).

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 19 / 43

Page 23: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering Gaussian mixture models

Probabilistic interpretation of clustering?

We can impose a probabilistic interpretation of the intuition: data pointsstay close to the centers of their clusters. This is just a statement of howp(x) looks like — we will see how to model this distribution.

(b)

0 0.5 1

0

0.5

1 The data points seem to form 3clusters. However, we cannot modelp(x) with simple and knowndistributions. For example, the datado not obey Gaussian distributions asthere are seemingly 3 regions wheredata concentrate.

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 20 / 43

Page 24: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering Gaussian mixture models

Probabilistic interpretation of clustering?

We can impose a probabilistic interpretation of the intuition: data pointsstay close to the centers of their clusters. This is just a statement of howp(x) looks like — we will see how to model this distribution.

(b)

0 0.5 1

0

0.5

1 The data points seem to form 3clusters. However, we cannot modelp(x) with simple and knowndistributions. For example, the datado not obey Gaussian distributions asthere are seemingly 3 regions wheredata concentrate.

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 20 / 43

Page 25: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering Gaussian mixture models

Gaussian mixture models: intuition

(a)

0 0.5 1

0

0.5

1

Instead, we will model each regionwith a Gaussian distribution. Thisleads to the idea of Gaussian mixturemodels (GMMs) or mixture ofGaussians (MoGs).

The problem we are now facing isthat i) we do not know which (color)region a data point comes from; ii)the parameters of Gaussiandistributions in each region. We needto all of them from unsuperviseddata D = {xn}Nn=1.

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 21 / 43

Page 26: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering Gaussian mixture models

Gaussian mixture models: formal definition

A Gaussian mixture model has the following density function for x

p(x) =

K∑k=

ωkN(x|µk,Σk)

where

K: the number of Gaussians — they are called (mixture) components

µk and Σk: mean and covariance matrix of the k-th component

ωk: mixture weights – they represent how much each componentcontributes to the final distribution. It satisfies two properties:

∀ k, ωk > 0, and∑k

ωk = 1

The properties ensure p(x) is a properly normalized probabilitydensity function.

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 22 / 43

Page 27: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering Gaussian mixture models

GMM as the marginal distribution of a joint distribution

Consider the following joint distribution

p(x, z) = p(z)p(x|z)

where z is a discrete random variable taking values between 1 and K.Denote

ωk = p(z = k)

and furthermore, assume the conditional distributions are Gaussiandistributions

p(x|z = k) = N(x|µk,Σk)

Then, the marginal distribution of x is

p(x) =

K∑k=

ωkN(x|µk,Σk)

Namely, the Gaussian mixture models.Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 23 / 43

Page 28: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering Gaussian mixture models

GMMs: example

(a)

0 0.5 1

0

0.5

1

The joint distribution between x and z(representing color) are

p(x, z =′ red′) = N(x|µ1,Σ1)

p(x, z =′ blue′) = N(x|µ2,Σ2)

p(x, z =′ green′) = N(x|µ3,Σ3)

(b)

0 0.5 1

0

0.5

1 The marginal distribution is thus

p(x) = p(′red′)N(x|µ1,Σ1) + p(′blue′)N(x|µ2,Σ2)

+ p(′green′)N(x|µ3,Σ3)

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 24 / 43

Page 29: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering Gaussian mixture models

GMMs: example

(a)

0 0.5 1

0

0.5

1

The joint distribution between x and z(representing color) are

p(x, z =′ red′) = N(x|µ1,Σ1)

p(x, z =′ blue′) = N(x|µ2,Σ2)

p(x, z =′ green′) = N(x|µ3,Σ3)

(b)

0 0.5 1

0

0.5

1 The marginal distribution is thus

p(x) = p(′red′)N(x|µ1,Σ1) + p(′blue′)N(x|µ2,Σ2)

+ p(′green′)N(x|µ3,Σ3)

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 24 / 43

Page 30: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering Gaussian mixture models

Parameter estimation for Gaussian mixture models

The parameters in GMMs are θ = {ωk,µk,Σk}Kk=1. To estimate, considerthe simple case first.

z is given If we assume z is observed for every x, then our estimationproblem is easier to solve. Particularly, our training data is augmented

D′ = {xn, zn}Nn=1

Note that, for every xn, we have a zn to denote the region/color wherethe specific xn comes from. We call D′ the complete data and D theincomplete data.

Given D′, the maximum likelihood estimation of the θ is given by

θ = argmax logD′ =∑n

log p(xn, zn)

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 25 / 43

Page 31: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering Gaussian mixture models

Parameter estimation for Gaussian mixture models

The parameters in GMMs are θ = {ωk,µk,Σk}Kk=1. To estimate, considerthe simple case first.

z is given If we assume z is observed for every x, then our estimationproblem is easier to solve. Particularly, our training data is augmented

D′ = {xn, zn}Nn=1

Note that, for every xn, we have a zn to denote the region/color wherethe specific xn comes from. We call D′ the complete data and D theincomplete data.

Given D′, the maximum likelihood estimation of the θ is given by

θ = argmax logD′ =∑n

log p(xn, zn)

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 25 / 43

Page 32: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering Gaussian mixture models

Parameter estimation for GMMs: complete data

The likelihood — which we will refer to complete likelihood isdecomposable∑

n

log p(xn, zn) =∑n

log p(zn)p(xn|zn) =∑k

∑n:zn=k

log p(zn)p(xn|zn)

where we have grouped data by its values zn. Let us introduce a binaryvariable γnk ∈ {0, 1} to indicate whether zn = k. We can rewrite ourdecomposition as

∑n

log p(xn, zn) =∑k

∑n

γnk log p(z = k)p(xn|z = k)

Note that we have used a “dummy” variable z to denote all the possiblevalues xn’s true zn can take – but only one of the possible values is givenin D′.

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 26 / 43

Page 33: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering Gaussian mixture models

Parameter estimation for GMMs: complete data

The likelihood — which we will refer to complete likelihood isdecomposable∑

n

log p(xn, zn) =∑n

log p(zn)p(xn|zn) =∑k

∑n:zn=k

log p(zn)p(xn|zn)

where we have grouped data by its values zn. Let us introduce a binaryvariable γnk ∈ {0, 1} to indicate whether zn = k. We can rewrite ourdecomposition as∑

n

log p(xn, zn) =∑k

∑n

γnk log p(z = k)p(xn|z = k)

Note that we have used a “dummy” variable z to denote all the possiblevalues xn’s true zn can take – but only one of the possible values is givenin D′.

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 26 / 43

Page 34: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering Gaussian mixture models

Parameter estimation for GMMs: solution for completedata

Substituting our assumption about the conditional distributions, we have∑n

log p(xn, zn) =∑k

∑n

γnk [logωk + logN(xn|µk,Σk)]

Regrouping, we have∑n

log p(xn, zn) =∑k

∑n

γnk logωk +∑k

{∑n

γnk logN(xn|µk,Σk)

}

Note that, the term inside the braces depends on k-th component’s parameters.It is now easy to show that (left as a homework exercise), the maximumlikelihood estimation of the parameters are

ωk =

∑n γnk∑

k

∑n γnk

, µk =1∑n γnk

∑n

γnkxn

Σk =1∑n γnk

∑n

γnk(xn − µk)(xn − µk)T

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 27 / 43

Page 35: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering Gaussian mixture models

Parameter estimation for GMMs: solution for completedata

Substituting our assumption about the conditional distributions, we have∑n

log p(xn, zn) =∑k

∑n

γnk [logωk + logN(xn|µk,Σk)]

Regrouping, we have∑n

log p(xn, zn) =∑k

∑n

γnk logωk +∑k

{∑n

γnk logN(xn|µk,Σk)

}Note that, the term inside the braces depends on k-th component’s parameters.It is now easy to show that (left as a homework exercise), the maximumlikelihood estimation of the parameters are

ωk =

∑n γnk∑

k

∑n γnk

, µk =1∑n γnk

∑n

γnkxn

Σk =1∑n γnk

∑n

γnk(xn − µk)(xn − µk)T

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 27 / 43

Page 36: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering Gaussian mixture models

Intuition

Since γnk is binary, the previous solution is nothing but

For ωk: count the number of data points whose zn is k and divide bythe total number of data points (note that

∑k

∑n γnk = N)

For µk: get all the data points whose zn is k, compute their mean

For Σk: get all the data points whose zn is k, compute theircovariance matrix

This intuition is going to help us to develop an algorithm for estimating θwhen we do not know zn.

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 28 / 43

Page 37: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering Gaussian mixture models

Parameter estimation for GMMs: incomplete data

When zn is not given, we can guess which region/color xn comes from bycomputing the posterior probability

p(zn = k|xn) =p(xn|zn = k)p(zn = k)

p(xn)=

p(xn|zn = k)p(zn = k)∑Kk′=1 p(xn|zn = k′)p(zn = k′)

Note that, to compute the posterior probability, we need to know theparameters θ. Let us for a second, we pretend we know the value of theparameters thus we can compute the posterior probability.

How is that going to help us?

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 29 / 43

Page 38: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering Gaussian mixture models

Estimation with soft γnk

We are going to pretend p(zn = k|xn) as γnk which should be binary –but now is regarded as “soft” assigning xn to k-th component. With thatin mind, we have

γnk = p(zn = k|xn)

ωk =

∑n γnk∑

k

∑n γnk

µk =1∑n γnk

∑n

γnkxn

Σk =1∑n γnk

∑n

γnk(xn − µk)(xn − µk)T

In other words, every data point xn is assigned to a componentfractionally according to p(zn = k|xn) — sometimes, this quantity is alsocalled “responsibility”.

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 30 / 43

Page 39: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering Gaussian mixture models

Iterative procedure

Since we do not know θ to begin with, we cannot compute the soft γnk.However, we can invoke an iterative procedure and alternate betweenestimating γnk and using the estimated γnk to compute the parameters

Step 0: guess θ with initial values

Step 1: compute γnk using the current θ

Step 2: update θ using the just computed γnk

Step 3: go back to Step 1

Questions: i) is this procedure correct, for example, optimizing a sensiblecriteria? ii) practically, will this procedure ever stop instead of iteratingforever?

The answer lies in the EM algorithm — a powerful procedure for modelestimation with unknown data.

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 31 / 43

Page 40: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering EM Algorithm

EM algorithm: motivation and setup

As a general procedure, EM is used to estimate parameters forprobabilistic models with hidden/latent variables. Suppose the model isgiven by a joint distribution

p(x|θ) =∑z

p(x, z|θ)

where x is the observed random variable and z is hidden.

We are given data containing only the observed variable D = {xn} wherethe corresponding hidden variable values z is not included. Our goal is toobtain the maximum likelihood estimate of θ. Namely, we choose

θ = argmax logD = argmax∑n

log p(xn|θ)

= argmax∑n

log∑zn

p(xn, zn|θ)

The objective function `(θ) is called incomplete log-likelihood.Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 32 / 43

Page 41: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering EM Algorithm

Expected (complete) log-likelihood

The difficulty with incomplete log-likelihood is that it needs to sum overall possible values that zn can take, then take a logarithm. This log-sumformat makes computation intractable. Instead, the EM algorithm uses aclever trick to change this into sum-log form.

To this end, we define the following

Qq(θ) =∑n

Ezn∼q(zn) log p(xn, zn|θ)

=∑n

∑zn

q(zn) log p(xn, zn|θ)

which is called expected (complete) log-likelihood (with respect to q(z).q(z) is a distribution over z. Note that Qq(θ) takes the form of sum-log,which turns out to be tractable.

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 33 / 43

Page 42: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering EM Algorithm

Examples

Consider the previous model where x could be from 3 regions. We canchoose q(z) any valid distribution. This will lead to different Qq(θ). Notethat z here represents different colors.

q(z = k) = 1/3 for any of 3 colors. This gives rise to

Qq(θ) =∑n

1

3

[log p(xn,

′ red′|θ)

+ log p(xn,′ blue′|θ) + log p(xn,

′ green′|θ)]

q(z = k) = 1/2 for ’red’ and ’blue’, 0 for ’green’. This gives rise to

Qq(θ) =∑n

1

2

[log p(xn,

′ red′|θ) + log p(xn,′ blue′|θ)

]

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 34 / 43

Page 43: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering EM Algorithm

Which q(z) to choose?

We will choose a special q(z) = p(z|x;θ), i.e., the posterior probability ofz. We define

Q(θ) = Qz∼p(z|x;θ)(θ)

and we will show

`(θ) = Q(θ) +∑n

H[p(z|xn;θ)]

where H[p] is the entropy of the probabilistic distribution p:

H[p(x)] = −∫p(x) log p(x)dx

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 35 / 43

Page 44: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering EM Algorithm

Proof

Q(θ) =∑n

∑zn

p(zn|xn;θ) log p(xn, zn|θ)

=∑n

∑zn

p(zn|xn;θ) [log p(xn|θ) + log p(zn|xn;θ)]

=∑n

∑zn

p(zn|xn;θ) log p(xn|θ)

+∑n

∑zn

p(zn|xn;θ) log p(zn|xn;θ)

=∑n

log p(xn|θ)∑zn

p(zn|xn;θ)−∑n

H[p(z|xn;θ)]

=∑n

log p(xn|θ)−∑n

H[p(z|xn;θ)]

= `(θ)−∑n

H[p(z|xn;θ)]

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 36 / 43

Page 45: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering EM Algorithm

A computable Q(θ)

As before, Q(θ) cannot be computed, as it depends on the unknownparameter values θ to compute the posterior probability p(z|x;θ). Instead,we will use a known value θold to compute the expected likelihood

Q(θ,θold) =∑n

∑zn

p(zn|xn;θold) log p(xn, zn|θ)

Note that, in the above, the variable is θ. θold is assumed to be known.By its definition, the following is true

Q(θ) = Q(θ,θold)

However, how does Q(θ,θold) relates to `(θ)? We will show that

`(θ) ≥ Q(θ,θold) +∑n

H[p(z|xn;θold)]

Thus, in a way, Q(θ) is better than Q(θ,θold) (because we have equalitythere) except that we cannot compute the former.

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 37 / 43

Page 46: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering EM Algorithm

Proof

`(θ) =∑n

log∑zn

p(z|xn;θold)

p(xn, zn|θ)p(z|xn;θold)

≥∑n

∑zn

p(z|xn;θold) log

p(xn, zn|θ)p(z|xn;θold)

=∑n

∑zn

p(z|xn;θold) log p(xn, zn|θ)

−∑n

∑zn

p(z|xn;θold) log p(z|xn;θ

old)

= Q(θ,θold) +∑n

H[p(z|xn;θold)]

The inequality (≥) is true because log is a concave function:

log∑i

wixi ≥∑i

wi log xi, ∀ wi ≥ 0,∑i

wi = 1

And in our case, the wi is p(z|xn;θold).

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 38 / 43

Page 47: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering EM Algorithm

Putting things together: auxiliary function

So far we have shown a lower bound on the log-likelihood

`(θ) ≥ A(θ,θold) = Q(θ,θold) +∑n

H[p(z|xn;θold)]

We will call the right-hand-side an auxiliary function.

This auxiliary function has an important property. When θ = θold,

A(θ,θ) = `(θ)

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 39 / 43

Page 48: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering EM Algorithm

Use auxiliary function to increase log-likelihood

Suppose we have an initial guess θold, then we maximize the auxiliaryfunction

θnew = argmaxθ A(θ,θold)

With the new guess, we have

`(θnew) ≥ A(θnew,θold) ≥ A(θnew,θold) = `(θold)

Repeating this process, we have

`(θeven newer) ≥ `(θnew) ≥ `(θold)

whereθeven newer = argmaxθ A(θ,θ

new)

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 40 / 43

Page 49: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering EM Algorithm

Use auxiliary function to increase log-likelihood

Suppose we have an initial guess θold, then we maximize the auxiliaryfunction

θnew = argmaxθ A(θ,θold)

With the new guess, we have

`(θnew) ≥ A(θnew,θold) ≥ A(θnew,θold) = `(θold)

Repeating this process, we have

`(θeven newer) ≥ `(θnew) ≥ `(θold)

whereθeven newer = argmaxθ A(θ,θ

new)

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 40 / 43

Page 50: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering EM Algorithm

Use auxiliary function to increase log-likelihood

Suppose we have an initial guess θold, then we maximize the auxiliaryfunction

θnew = argmaxθ A(θ,θold)

With the new guess, we have

`(θnew) ≥ A(θnew,θold) ≥ A(θnew,θold) = `(θold)

Repeating this process, we have

`(θeven newer) ≥ `(θnew) ≥ `(θold)

whereθeven newer = argmaxθ A(θ,θ

new)

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 40 / 43

Page 51: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering EM Algorithm

Iterative and monotonic improvement

Thus, by maximizing the auxiliary function, we obtain a sequence of guesses

θold,θnew,θeven newer, · · · ,

that will keep increasing the likelihood. This process will eventually stops if thelikelihood is bounded from above (i.e.. less than +∞). This is the core of the EMalgorithm.

Expectation-Maximization (EM)

Step 0: Initialize θ with θ(0)

Step 1 (E-step): Compute the auxiliary function using the current value of θ

A(θ,θ(t))

Step 2 (M-step): Maximize the auxiliary function

θ(t+1) ← argmaxA(θ,θ(t))

Step 3: Increase t to t+ 1 and go back to Step 1; or stop if `(θ(t+1)) doesnot improve `(θ(t)) much.

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 41 / 43

Page 52: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering EM Algorithm

Remarks

The EM procedure converges but only converges to a local optimum.Global optimum is not guaranteed to be found.

The E-step depends on computing the posterior probability

p(zn|xn;θ(t))

The M-step does not depend on the entropy term, so we need only todo the following

θ(t+1) ← argmaxA(θ,θ(t)) = argmaxQ(θ,θ(t))

We often call the last term Q-function.

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 42 / 43

Page 53: CSCI567 Machine Learning (Fall 2014) - University of …liu32/567/lec16.pdf · 2014-11-13 · CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu ... Quiz 1 Grades Mini-project Out

Clustering EM Algorithm

Example: applying EM to GMMs

What is the E-step in GMM? We compute the responsibility

γnk = p(z = k|xn;θ(t))

What is the M-step in GMM? The Q-function is

Q(θ,θ(t)) =∑n

∑k

p(z = k|xn;θ(t)) log p(xn, z = k|θ)

=∑n

∑k

γnk log p(xn, z = k|θ)

=∑k

∑n

γnk log p(z = k)p(xn|z = k)

=∑k

∑n

γnk [logωk + logN(xn|µk,Σk)]

Hence, we have recovered the parameter estimation algorithm for GMMs,seen previously. (We still need to do the maximization to get θ(t+1) — leftas homework.)

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 43 / 43