csci567 machine learning (fall 2014) - university of …liu32/567/lec16.pdf · 2014-11-13 ·...

CSCI567 Machine Learning (Fall 2014)

Drs. Sha & Liu

{feisha,yanliu.cs}@usc.edu

October 30, 2014

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 1 / 43

Administrative matters

Outline

1 Administrative mattersQuiz 1 GradesMini-project Out

2 Review of last lecture

3 Summary

4 Clustering


Administrative matters Quiz 1 Grades

Quiz 1 Grades finished and uploaded to Blackboard

StatisticsMinimum: 31, maximum: 98, average: 66, median: 66, std: 14.28Distribution


Administrative matters Mini-project Out

Kaggle Competition on Click-Through Rate Prediction

Website https://www.kaggle.com/c/avazu-ctr-prediction

Group formation A group should consist of 1-5 people. One person canonly participate one group.


https://www.kaggle.com/c/avazu-ctr-prediction

Review of last lecture

Outline

1 Administrative matters

2 Review of last lectureNeural networksDeep Neural Networks (DNNs)

3 Summary

4 Clustering


Review of last lecture Neural networks

Basic idea

Learning nonlinear basis functions and classifiers

Hidden layers are nonlinear mappings from input features to newrepresentation

Output layers use the new representations for classification andregression

Learning parameters

Stochastic gradient descent

Large-scale computing


Review of last lecture Neural networks

Key steps (essentially, chain rule in calculus)

To compute∂`

∂wji

we compute∂`

∂wji= zi

∂`

∂aj

as wji affects only ajNonlinear passthru

∂`

∂aj=

∂`

∂zj

∂zj∂aj

= h′(aj)∂`

∂zj

Recursion∂`

∂zj=∑k

∂`

∂ak

∂ak∂zj


Review of last lecture Deep Neural Networks (DNNs)

Basic idea behind DNNs

Architecturally, a big neural networks (with a lot of variants)

in depth: 4-5 layers are commonly (Google LeNet uses more than 20)

in width: the number of hidden units in each layer can be a fewthousands

the number of parameters: hundreds of millions, even billions

Algorithmically, many new things

Pre-training: do not do error-backprogation right away

Layer-wise greedy: train one layer at a time

...

Computing

Heavy computing: in both speed in computation and coping with alot of data

Ex: fast Graphics Processing Unit (GPUs) are almost indispensable


Review of last lecture Deep Neural Networks (DNNs)

Good references

Easy to find as DNNs are very popular these days

Many, many online video tutorials

Good open-source packages: Theanos, cuDNN, Caffee, etc

Examples:

Wikipedia entry on “Deep Learning”http://en.wikipedia.org/wiki/Deep_learning provides a decentportal to many things including deep belief networks, convolution netsA collection of tutorials and codes for implementing them in Pythonhttp://www.deeplearning.net/tutorial/


http://en.wikipedia.org/wiki/Deep_learning

http://www.deeplearning.net/tutorial/

Summary

Outline



3 SummarySupervised learning

4 Clustering


Summary Supervised learning

Summary of the course so far: a short list of importantconcepts

Supervised learning has been our focus

Setup: given a training dataset {xn, yn}Nn=1, we learn a function h(x)to predict x’s true value y (i.e., regression or classification)

Linear vs. nonlinear features1 Linear: h(x) depends on wTx2 Nonlinear: h(x) depends on wTφ(x), which in terms depends on a

kernel function k(xm,xn) = φ(xm)Tφ(xn),

Loss function1 Squared loss: least square for regression (minimizing residual sum of

errors)2 Logistic loss: logistic regression3 Exponential loss: AdaBoost4 Margin-based loss: support vector machines

Principles of estimation1 Point estimate: maximum likelihood, regularized likelihood


Summary Supervised learning

cont’d

Optimization1 Methods: gradient descent, Newton method2 Convex optimization: global optimum vs. local optimum3 Lagrange duality: primal and dual formulation

Learning theory1 Difference between training error and generalization error2 Overfitting, bias and variance tradeoff3 Regularization: various regularized models


Clustering

Outline



3 Summary

4 ClusteringGaussian mixture modelsEM Algorithm


Clustering

Clustering

Setup Given D = {xn}Nn=1 and K, we want to output

{µk}Kk=1: prototypes of clusters

A(xn) ∈ {1, 2, . . . ,K}: the cluster membership, i.e., the cluster IDassigned to xn

Example Cluster data into two clusters.

(a)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2


Clustering

K-means clustering

Intuition Data points assigned to cluster k should be close to µk, theprototype.

Distortion measure (clustering objective function, cost function)

J =

N∑n=1

K∑k=1

rnk‖xn − µk‖22

where rnk ∈ {0, 1} is an indicator variable

rnk = 1 if and only if A(xn) = k


Clustering

Algorithm

Minimize distortion measure alternative optimization between {rnk}and {µk}

Step 0 Initialize {µk} to some values

Step 1 Assume the current value of {µk} fixed, minimize J over{rnk}, which leads to the following cluster assignment rule

rnk =

{1 if k = argminj ‖xn − µj‖220 otherwise

Step 2 Assume the current value of {rnk} fixed, minimize J over{µk}, which leads to the following rule to update the prototypes ofthe clusters

µk =

∑n rnkxn∑n rnk

Step 3 Determine whether to stop or return to Step 1


Clustering

Remarks

The prototype µk is the means of data points assigned to the clusterk, hence the name K-means clustering.

The procedure terminates after a finite number of steps (in general,assuming that there is no tie in comparing distances in Step 1), as theprocedure reduces J in both Step 1 and Step 2. Since J is lowerbounded by 0, the procedure cannot be infinite.

There is no guarantee the procedure terminates at the globaloptimum of J — in most cases, the algorithm stops at a localoptimum, which depends on the initial values in Step 0.


Clustering

Example of running K-means algorithm

(a)

−2 0 2

−2

0

2 (b)

−2 0 2

−2

0

2 (c)

−2 0 2

−2

0

2

(d)

−2 0 2

−2

0

2 (e)

−2 0 2

−2

0

2 (f)

−2 0 2

−2

0

2

(g)

−2 0 2

−2

0

2 (h)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2


Clustering

Application: vector quantization

We can replace our data points with the prototypes µk from the clustersthey are assigned to. This is called vector quantization. In other words, wehave compressed the data points into i) a codebook of all the prototypes;ii) a list of indices to the codebook for the data points. This compression isobviously lossy as certain information will be lost if we use a very small K.

Clustering the pixels in the image and vector quantizing them. From leftto right: Original image, quantized one with a large K, a medium K, anda small K. Details are missing due to the higher compression (smaller K).


Clustering Gaussian mixture models

Probabilistic interpretation of clustering?

We can impose a probabilistic interpretation of the intuition: data pointsstay close to the centers of their clusters. This is just a statement of howp(x) looks like — we will see how to model this distribution.

(b)

0 0.5 1

0

0.5

1 The data points seem to form 3clusters. However, we cannot modelp(x) with simple and knowndistributions. For example, the datado not obey Gaussian distributions asthere are seemingly 3 regions wheredata concentrate.



Gaussian mixture models: intuition

(a)

0 0.5 1

0

0.5

1

Instead, we will model each regionwith a Gaussian distribution. Thisleads to the idea of Gaussian mixturemodels (GMMs) or mixture ofGaussians (MoGs).

The problem we are now facing isthat i) we do not know which (color)region a data point comes from; ii)the parameters of Gaussiandistributions in each region. We needto all of them from unsuperviseddata D = {xn}Nn=1.



Gaussian mixture models: formal definition

A Gaussian mixture model has the following density function for x

p(x) =

K∑k=

ωkN(x|µk,Σk)

where

K: the number of Gaussians — they are called (mixture) components

µk and Σk: mean and covariance matrix of the k-th component

ωk: mixture weights – they represent how much each componentcontributes to the final distribution. It satisfies two properties:

∀ k, ωk > 0, and∑k

ωk = 1

The properties ensure p(x) is a properly normalized probabilitydensity function.



GMM as the marginal distribution of a joint distribution

Consider the following joint distribution

p(x, z) = p(z)p(x|z)

where z is a discrete random variable taking values between 1 and K.Denote

ωk = p(z = k)

and furthermore, assume the conditional distributions are Gaussiandistributions

p(x|z = k) = N(x|µk,Σk)

Then, the marginal distribution of x is

p(x) =

K∑k=

ωkN(x|µk,Σk)

Namely, the Gaussian mixture models.Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 23 / 43


GMMs: example

(a)

0 0.5 1

0

0.5

1

The joint distribution between x and z(representing color) are

p(x, z =′ red′) = N(x|µ1,Σ1)

p(x, z =′ blue′) = N(x|µ2,Σ2)

p(x, z =′ green′) = N(x|µ3,Σ3)

(b)

0 0.5 1

0

0.5

1 The marginal distribution is thus

p(x) = p(′red′)N(x|µ1,Σ1) + p(′blue′)N(x|µ2,Σ2)

+ p(′green′)N(x|µ3,Σ3)



Parameter estimation for Gaussian mixture models

The parameters in GMMs are θ = {ωk,µk,Σk}Kk=1. To estimate, considerthe simple case first.

z is given If we assume z is observed for every x, then our estimationproblem is easier to solve. Particularly, our training data is augmented

D′ = {xn, zn}Nn=1

Note that, for every xn, we have a zn to denote the region/color wherethe specific xn comes from. We call D′ the complete data and D theincomplete data.

Given D′, the maximum likelihood estimation of the θ is given by

θ = argmax logD′ =∑n

log p(xn, zn)



Parameter estimation for GMMs: complete data

The likelihood — which we will refer to complete likelihood isdecomposable∑

n

log p(xn, zn) =∑n

log p(zn)p(xn|zn) =∑k

∑n:zn=k

log p(zn)p(xn|zn)

where we have grouped data by its values zn. Let us introduce a binaryvariable γnk ∈ {0, 1} to indicate whether zn = k. We can rewrite ourdecomposition as

∑n

log p(xn, zn) =∑k

∑n

γnk log p(z = k)p(xn|z = k)

Note that we have used a “dummy” variable z to denote all the possiblevalues xn’s true zn can take – but only one of the possible values is givenin D′.



Parameter estimation for GMMs: complete data

The likelihood — which we will refer to complete likelihood isdecomposable∑

n

log p(xn, zn) =∑n

log p(zn)p(xn|zn) =∑k

∑n:zn=k

log p(zn)p(xn|zn)

where we have grouped data by its values zn. Let us introduce a binaryvariable γnk ∈ {0, 1} to indicate whether zn = k. We can rewrite ourdecomposition as∑

n

log p(xn, zn) =∑k

∑n


Note that we have used a “dummy” variable z to denote all the possiblevalues xn’s true zn can take – but only one of the possible values is givenin D′.



Parameter estimation for GMMs: solution for completedata

Substituting our assumption about the conditional distributions, we have∑n

log p(xn, zn) =∑k

∑n

γnk [logωk + logN(xn|µk,Σk)]

Regrouping, we have∑n

log p(xn, zn) =∑k

∑n

γnk logωk +∑k

{∑n

γnk logN(xn|µk,Σk)

}

Note that, the term inside the braces depends on k-th component’s parameters.It is now easy to show that (left as a homework exercise), the maximumlikelihood estimation of the parameters are

ωk =

∑n γnk∑

k

∑n γnk

, µk =1∑n γnk

∑n

γnkxn

Σk =1∑n γnk

∑n

γnk(xn − µk)(xn − µk)T



Parameter estimation for GMMs: solution for completedata

Substituting our assumption about the conditional distributions, we have∑n

log p(xn, zn) =∑k

∑n


Regrouping, we have∑n

log p(xn, zn) =∑k

∑n

γnk logωk +∑k

{∑n

γnk logN(xn|µk,Σk)

}Note that, the term inside the braces depends on k-th component’s parameters.It is now easy to show that (left as a homework exercise), the maximumlikelihood estimation of the parameters are

ωk =

∑n γnk∑

k

∑n γnk

, µk =1∑n γnk

∑n

γnkxn

Σk =1∑n γnk

∑n




Intuition

Since γnk is binary, the previous solution is nothing but

For ωk: count the number of data points whose zn is k and divide bythe total number of data points (note that

∑k

∑n γnk = N)

For µk: get all the data points whose zn is k, compute their mean

For Σk: get all the data points whose zn is k, compute theircovariance matrix

This intuition is going to help us to develop an algorithm for estimating θwhen we do not know zn.



Parameter estimation for GMMs: incomplete data

When zn is not given, we can guess which region/color xn comes from bycomputing the posterior probability

p(zn = k|xn) =p(xn|zn = k)p(zn = k)

p(xn)=

p(xn|zn = k)p(zn = k)∑Kk′=1 p(xn|zn = k′)p(zn = k′)

Note that, to compute the posterior probability, we need to know theparameters θ. Let us for a second, we pretend we know the value of theparameters thus we can compute the posterior probability.

How is that going to help us?



Estimation with soft γnk

We are going to pretend p(zn = k|xn) as γnk which should be binary –but now is regarded as “soft” assigning xn to k-th component. With thatin mind, we have

γnk = p(zn = k|xn)

ωk =

∑n γnk∑

k

∑n γnk

µk =1∑n γnk

∑n

γnkxn

Σk =1∑n γnk

∑n


In other words, every data point xn is assigned to a componentfractionally according to p(zn = k|xn) — sometimes, this quantity is alsocalled “responsibility”.



Iterative procedure

Since we do not know θ to begin with, we cannot compute the soft γnk.However, we can invoke an iterative procedure and alternate betweenestimating γnk and using the estimated γnk to compute the parameters

Step 0: guess θ with initial values

Step 1: compute γnk using the current θ

Step 2: update θ using the just computed γnk

Step 3: go back to Step 1

Questions: i) is this procedure correct, for example, optimizing a sensiblecriteria? ii) practically, will this procedure ever stop instead of iteratingforever?

The answer lies in the EM algorithm — a powerful procedure for modelestimation with unknown data.


Clustering EM Algorithm

EM algorithm: motivation and setup

As a general procedure, EM is used to estimate parameters forprobabilistic models with hidden/latent variables. Suppose the model isgiven by a joint distribution

p(x|θ) =∑z

p(x, z|θ)

where x is the observed random variable and z is hidden.

We are given data containing only the observed variable D = {xn} wherethe corresponding hidden variable values z is not included. Our goal is toobtain the maximum likelihood estimate of θ. Namely, we choose

θ = argmax logD = argmax∑n

log p(xn|θ)

= argmax∑n

log∑zn

p(xn, zn|θ)

The objective function `(θ) is called incomplete log-likelihood.Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) October 30, 2014 32 / 43


Expected (complete) log-likelihood

The difficulty with incomplete log-likelihood is that it needs to sum overall possible values that zn can take, then take a logarithm. This log-sumformat makes computation intractable. Instead, the EM algorithm uses aclever trick to change this into sum-log form.

To this end, we define the following

Qq(θ) =∑n

Ezn∼q(zn) log p(xn, zn|θ)

=∑n

∑zn

q(zn) log p(xn, zn|θ)

which is called expected (complete) log-likelihood (with respect to q(z).q(z) is a distribution over z. Note that Qq(θ) takes the form of sum-log,which turns out to be tractable.



Examples

Consider the previous model where x could be from 3 regions. We canchoose q(z) any valid distribution. This will lead to different Qq(θ). Notethat z here represents different colors.

q(z = k) = 1/3 for any of 3 colors. This gives rise to

Qq(θ) =∑n

1

3

[log p(xn,

′ red′|θ)

+ log p(xn,′ blue′|θ) + log p(xn,

′ green′|θ)]

q(z = k) = 1/2 for ’red’ and ’blue’, 0 for ’green’. This gives rise to

Qq(θ) =∑n

1

2

[log p(xn,

′ red′|θ) + log p(xn,′ blue′|θ)

]



Which q(z) to choose?

We will choose a special q(z) = p(z|x;θ), i.e., the posterior probability ofz. We define

Q(θ) = Qz∼p(z|x;θ)(θ)

and we will show

`(θ) = Q(θ) +∑n

H[p(z|xn;θ)]

where H[p] is the entropy of the probabilistic distribution p:

H[p(x)] = −∫p(x) log p(x)dx



A computable Q(θ)

As before, Q(θ) cannot be computed, as it depends on the unknownparameter values θ to compute the posterior probability p(z|x;θ). Instead,we will use a known value θold to compute the expected likelihood

Q(θ,θold) =∑n

∑zn

p(zn|xn;θold) log p(xn, zn|θ)

Note that, in the above, the variable is θ. θold is assumed to be known.By its definition, the following is true

Q(θ) = Q(θ,θold)

However, how does Q(θ,θold) relates to `(θ)? We will show that

`(θ) ≥ Q(θ,θold) +∑n

H[p(z|xn;θold)]

Thus, in a way, Q(θ) is better than Q(θ,θold) (because we have equalitythere) except that we cannot compute the former.



Putting things together: auxiliary function

So far we have shown a lower bound on the log-likelihood

`(θ) ≥ A(θ,θold) = Q(θ,θold) +∑n

H[p(z|xn;θold)]

We will call the right-hand-side an auxiliary function.

This auxiliary function has an important property. When θ = θold,

A(θ,θ) = `(θ)



Use auxiliary function to increase log-likelihood

Suppose we have an initial guess θold, then we maximize the auxiliaryfunction

θnew = argmaxθ A(θ,θold)

With the new guess, we have

`(θnew) ≥ A(θnew,θold) ≥ A(θnew,θold) = `(θold)

Repeating this process, we have

`(θeven newer) ≥ `(θnew) ≥ `(θold)

whereθeven newer = argmaxθ A(θ,θ

new)



Iterative and monotonic improvement

Thus, by maximizing the auxiliary function, we obtain a sequence of guesses

θold,θnew,θeven newer, · · · ,

that will keep increasing the likelihood. This process will eventually stops if thelikelihood is bounded from above (i.e.. less than +∞). This is the core of the EMalgorithm.

Expectation-Maximization (EM)

Step 0: Initialize θ with θ(0)

Step 1 (E-step): Compute the auxiliary function using the current value of θ

A(θ,θ(t))

Step 2 (M-step): Maximize the auxiliary function

θ(t+1) ← argmaxA(θ,θ(t))

Step 3: Increase t to t+ 1 and go back to Step 1; or stop if `(θ(t+1)) doesnot improve `(θ(t)) much.



Remarks

The EM procedure converges but only converges to a local optimum.Global optimum is not guaranteed to be found.

The E-step depends on computing the posterior probability

p(zn|xn;θ(t))

The M-step does not depend on the entropy term, so we need only todo the following

θ(t+1) ← argmaxA(θ,θ(t)) = argmaxQ(θ,θ(t))

We often call the last term Q-function.



Example: applying EM to GMMs

What is the E-step in GMM? We compute the responsibility

γnk = p(z = k|xn;θ(t))

What is the M-step in GMM? The Q-function is

Q(θ,θ(t)) =∑n

∑k

p(z = k|xn;θ(t)) log p(xn, z = k|θ)

=∑n

∑k

γnk log p(xn, z = k|θ)

=∑k

∑n


=∑k

∑n


Hence, we have recovered the parameter estimation algorithm for GMMs,seen previously. (We still need to do the maximization to get θ(t+1) — leftas homework.)


csci567 machine learning (fall 2014) - university of …liu32/567/lec16.pdf · 2014-11-13 ·...

Documents