practical machine learning (cs294-34) september 24, 2009 ...jordan/courses/... · linear...

186
Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Upload: others

Post on 05-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Linear Dimensionality Reduction

Practical Machine Learning (CS294-34)

September 24, 2009

Percy Liang

Page 2: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Lots of high-dimensional data...

face images

Zambian President LevyMwanawasa has won asecond term in office inan election his challengerMichael Sata accused himof rigging, official resultsshowed on Monday.

According to media reports,a pair of hackers said onSaturday that the FirefoxWeb browser, commonlyperceived as the saferand more customizablealternative to marketleader Internet Explorer,is critically flawed. Apresentation on the flawwas shown during theToorCon hacker conferencein San Diego.

documents

gene expression data MEG readings

2

Page 3: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Motivation and context

Why do dimensionality reduction?

• Computational: compress data ⇒ time/space efficiency

3

Page 4: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Motivation and context

Why do dimensionality reduction?

• Computational: compress data ⇒ time/space efficiency

• Statistical: fewer dimensions ⇒ better generalization

3

Page 5: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Motivation and context

Why do dimensionality reduction?

• Computational: compress data ⇒ time/space efficiency

• Statistical: fewer dimensions ⇒ better generalization

• Visualization: understand structure of data

3

Page 6: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Motivation and context

Why do dimensionality reduction?

• Computational: compress data ⇒ time/space efficiency

• Statistical: fewer dimensions ⇒ better generalization

• Visualization: understand structure of data

• Anomaly detection: describe normal data, detect outliers

3

Page 7: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Motivation and context

Why do dimensionality reduction?

• Computational: compress data ⇒ time/space efficiency

• Statistical: fewer dimensions ⇒ better generalization

• Visualization: understand structure of data

• Anomaly detection: describe normal data, detect outliers

Dimensionality reduction in this course:

• Linear methods (this week)

• Clustering (last week)

• Feature selection (next week)

• Nonlinear methods (later)

3

Page 8: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Types of problems

• Prediction x→ y: classification, regression

4

Page 9: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Types of problems

• Prediction x→ y: classification, regressionApplications: face recognition, gene expression predictionTechniques: kNN, SVM, least squares (+ dimensionalityreduction preprocessing)

4

Page 10: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Types of problems

• Prediction x→ y: classification, regressionApplications: face recognition, gene expression predictionTechniques: kNN, SVM, least squares (+ dimensionalityreduction preprocessing)

• Structure discovery x→ z: find an alternativerepresentation z of data x

4

Page 11: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Types of problems

• Prediction x→ y: classification, regressionApplications: face recognition, gene expression predictionTechniques: kNN, SVM, least squares (+ dimensionalityreduction preprocessing)

• Structure discovery x→ z: find an alternativerepresentation z of data xApplications: visualizationTechniques: clustering, linear dimensionality reduction

4

Page 12: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Types of problems

• Prediction x→ y: classification, regressionApplications: face recognition, gene expression predictionTechniques: kNN, SVM, least squares (+ dimensionalityreduction preprocessing)

• Structure discovery x→ z: find an alternativerepresentation z of data xApplications: visualizationTechniques: clustering, linear dimensionality reduction

• Density estimation p(x): model the data

4

Page 13: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Types of problems

• Prediction x→ y: classification, regressionApplications: face recognition, gene expression predictionTechniques: kNN, SVM, least squares (+ dimensionalityreduction preprocessing)

• Structure discovery x→ z: find an alternativerepresentation z of data xApplications: visualizationTechniques: clustering, linear dimensionality reduction

• Density estimation p(x): model the dataApplications: anomaly detection, language modelingTechniques: clustering, linear dimensionality reduction

4

Page 14: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Basic idea of linear dimensionality reduction

Represent each face as a high-dimensional vector x ∈ R361

5

Page 15: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Basic idea of linear dimensionality reduction

Represent each face as a high-dimensional vector x ∈ R361

x ∈ R361

z = U>x

z ∈ R10

5

Page 16: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Basic idea of linear dimensionality reduction

Represent each face as a high-dimensional vector x ∈ R361

x ∈ R361

z = U>x

z ∈ R10

How do we choose U?

5

Page 17: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Outline

• Principal component analysis (PCA)

– Basic principles

– Case studies

– Kernel PCA

– Probabilistic PCA

• Canonical correlation analysis (CCA)

• Fisher discriminant analysis (FDA)

• Summary

6

Page 18: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Roadmap

• Principal component analysis (PCA)

– Basic principles

– Case studies

– Kernel PCA

– Probabilistic PCA

• Canonical correlation analysis (CCA)

• Fisher discriminant analysis (FDA)

• Summary

Principal component analysis (PCA) / Basic principles 7

Page 19: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Dimensionality reduction setup

Given n data points in d dimensions: x1, . . . ,xn ∈ Rd

Principal component analysis (PCA) / Basic principles 8

Page 20: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Dimensionality reduction setup

Given n data points in d dimensions: x1, . . . ,xn ∈ Rd

X = ( x1 · · · · · · xn ) ∈ Rd×n

Principal component analysis (PCA) / Basic principles 8

Page 21: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Dimensionality reduction setup

Given n data points in d dimensions: x1, . . . ,xn ∈ Rd

X = ( x1 · · · · · · xn ) ∈ Rd×n

Want to reduce dimensionality from d to k

Principal component analysis (PCA) / Basic principles 8

Page 22: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Dimensionality reduction setup

Given n data points in d dimensions: x1, . . . ,xn ∈ Rd

X = ( x1 · · · · · · xn ) ∈ Rd×n

Want to reduce dimensionality from d to k

Choose k directions u1, . . . ,uk

Principal component analysis (PCA) / Basic principles 8

Page 23: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Dimensionality reduction setup

Given n data points in d dimensions: x1, . . . ,xn ∈ Rd

X = ( x1 · · · · · · xn ) ∈ Rd×n

Want to reduce dimensionality from d to k

Choose k directions u1, . . . ,uk

U = ( u1 ·· uk ) ∈ Rd×k

Principal component analysis (PCA) / Basic principles 8

Page 24: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Dimensionality reduction setup

Given n data points in d dimensions: x1, . . . ,xn ∈ Rd

X = ( x1 · · · · · · xn ) ∈ Rd×n

Want to reduce dimensionality from d to k

Choose k directions u1, . . . ,uk

U = ( u1 ·· uk ) ∈ Rd×k

For each uj, compute “similarity” zj = u>j x

Principal component analysis (PCA) / Basic principles 8

Page 25: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Dimensionality reduction setup

Given n data points in d dimensions: x1, . . . ,xn ∈ Rd

X = ( x1 · · · · · · xn ) ∈ Rd×n

Want to reduce dimensionality from d to k

Choose k directions u1, . . . ,uk

U = ( u1 ·· uk ) ∈ Rd×k

For each uj, compute “similarity” zj = u>j x

Project x down to z = (z1, . . . , zk)> = U>x

Principal component analysis (PCA) / Basic principles 8

Page 26: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Dimensionality reduction setup

Given n data points in d dimensions: x1, . . . ,xn ∈ Rd

X = ( x1 · · · · · · xn ) ∈ Rd×n

Want to reduce dimensionality from d to k

Choose k directions u1, . . . ,uk

U = ( u1 ·· uk ) ∈ Rd×k

For each uj, compute “similarity” zj = u>j x

Project x down to z = (z1, . . . , zk)> = U>xHow to choose U?

Principal component analysis (PCA) / Basic principles 8

Page 27: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

PCA objective 1: reconstruction error

U serves two functions:

• Encode: z = U>x, zj = u>j x

Principal component analysis (PCA) / Basic principles 9

Page 28: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

PCA objective 1: reconstruction error

U serves two functions:

• Encode: z = U>x, zj = u>j x

• Decode: x = Uz =∑k

j=1 zjuj

Principal component analysis (PCA) / Basic principles 9

Page 29: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

PCA objective 1: reconstruction error

U serves two functions:

• Encode: z = U>x, zj = u>j x

• Decode: x = Uz =∑k

j=1 zjuj

Want reconstruction error ‖x− x‖ to be small

Principal component analysis (PCA) / Basic principles 9

Page 30: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

PCA objective 1: reconstruction error

U serves two functions:

• Encode: z = U>x, zj = u>j x

• Decode: x = Uz =∑k

j=1 zjuj

Want reconstruction error ‖x− x‖ to be small

Objective: minimize total squared reconstruction error

minU∈Rd×k

n∑i=1

‖xi −UU>xi‖2

Principal component analysis (PCA) / Basic principles 9

Page 31: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

PCA objective 2: projected variance

Empirical distribution: uniform over x1, . . . ,xn

Principal component analysis (PCA) / Basic principles 10

Page 32: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

PCA objective 2: projected variance

Empirical distribution: uniform over x1, . . . ,xn

Expectation (think sum over data points):

E[f(x)] = 1n

∑ni=1 f(xi)

Principal component analysis (PCA) / Basic principles 10

Page 33: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

PCA objective 2: projected variance

Empirical distribution: uniform over x1, . . . ,xn

Expectation (think sum over data points):

E[f(x)] = 1n

∑ni=1 f(xi)

Variance (think sum of squares if centered):

var[f(x)] + (E[f(x)])2 = E[f(x)2] = 1n

∑ni=1 f(xi)2

Principal component analysis (PCA) / Basic principles 10

Page 34: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

PCA objective 2: projected variance

Empirical distribution: uniform over x1, . . . ,xn

Expectation (think sum over data points):

E[f(x)] = 1n

∑ni=1 f(xi)

Variance (think sum of squares if centered):

var[f(x)] + (E[f(x)])2 = E[f(x)2] = 1n

∑ni=1 f(xi)2

Assume data is centered: E[x] = 0

Principal component analysis (PCA) / Basic principles 10

Page 35: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

PCA objective 2: projected variance

Empirical distribution: uniform over x1, . . . ,xn

Expectation (think sum over data points):

E[f(x)] = 1n

∑ni=1 f(xi)

Variance (think sum of squares if centered):

var[f(x)] + (E[f(x)])2 = E[f(x)2] = 1n

∑ni=1 f(xi)2

Assume data is centered: E[x] = 0 (what’s E[U>x]?)

Principal component analysis (PCA) / Basic principles 10

Page 36: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

PCA objective 2: projected variance

Empirical distribution: uniform over x1, . . . ,xn

Expectation (think sum over data points):

E[f(x)] = 1n

∑ni=1 f(xi)

Variance (think sum of squares if centered):

var[f(x)] + (E[f(x)])2 = E[f(x)2] = 1n

∑ni=1 f(xi)2

Assume data is centered: E[x] = 0 (what’s E[U>x]?)

Objective: maximize variance of projected data

maxU∈Rd×k,U>U=I

E[‖U>x‖2]

Principal component analysis (PCA) / Basic principles 10

Page 37: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Equivalence in two objectives

Key intuition:

variance of data︸ ︷︷ ︸fixed

= captured variance︸ ︷︷ ︸want large

+ reconstruction error︸ ︷︷ ︸want small

Principal component analysis (PCA) / Basic principles 11

Page 38: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Equivalence in two objectives

Key intuition:

variance of data︸ ︷︷ ︸fixed

= captured variance︸ ︷︷ ︸want large

+ reconstruction error︸ ︷︷ ︸want small

Pythagorean decomposition: x = UU>x + (I −UU>)x

Principal component analysis (PCA) / Basic principles 11

Page 39: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Equivalence in two objectives

Key intuition:

variance of data︸ ︷︷ ︸fixed

= captured variance︸ ︷︷ ︸want large

+ reconstruction error︸ ︷︷ ︸want small

Pythagorean decomposition: x = UU>x + (I −UU>)x

‖UU>x‖

‖(I −UU>)x‖‖x‖

Principal component analysis (PCA) / Basic principles 11

Page 40: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Equivalence in two objectives

Key intuition:

variance of data︸ ︷︷ ︸fixed

= captured variance︸ ︷︷ ︸want large

+ reconstruction error︸ ︷︷ ︸want small

Pythagorean decomposition: x = UU>x + (I −UU>)x

‖UU>x‖

‖(I −UU>)x‖‖x‖

Take expectations; note rotation U doesn’t affect length:

E[‖x‖2] = E[‖U>x‖2] + E[‖x−UU>x‖2]

Principal component analysis (PCA) / Basic principles 11

Page 41: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Equivalence in two objectives

Key intuition:

variance of data︸ ︷︷ ︸fixed

= captured variance︸ ︷︷ ︸want large

+ reconstruction error︸ ︷︷ ︸want small

Pythagorean decomposition: x = UU>x + (I −UU>)x

‖UU>x‖

‖(I −UU>)x‖‖x‖

Take expectations; note rotation U doesn’t affect length:

E[‖x‖2] = E[‖U>x‖2] + E[‖x−UU>x‖2]Minimize reconstruction error ↔ Maximize captured variance

Principal component analysis (PCA) / Basic principles 11

Page 42: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Finding one principal component

Input data:

X = ( x1 . . . xn )Principal component analysis (PCA) / Basic principles 12

Page 43: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Finding one principal component

Input data:

X = ( x1 . . . xn )

Objective: maximize varianceof projected data

Principal component analysis (PCA) / Basic principles 12

Page 44: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Finding one principal component

Input data:

X = ( x1 . . . xn )

Objective: maximize varianceof projected data

= max‖u‖=1

E[(u>x)2]

Principal component analysis (PCA) / Basic principles 12

Page 45: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Finding one principal component

Input data:

X = ( x1 . . . xn )

Objective: maximize varianceof projected data

= max‖u‖=1

E[(u>x)2]

= max‖u‖=1

1n

n∑i=1

(u>xi)2

Principal component analysis (PCA) / Basic principles 12

Page 46: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Finding one principal component

Input data:

X = ( x1 . . . xn )

Objective: maximize varianceof projected data

= max‖u‖=1

E[(u>x)2]

= max‖u‖=1

1n

n∑i=1

(u>xi)2

= max‖u‖=1

1n‖u>X‖2

Principal component analysis (PCA) / Basic principles 12

Page 47: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Finding one principal component

Input data:

X = ( x1 . . . xn )

Objective: maximize varianceof projected data

= max‖u‖=1

E[(u>x)2]

= max‖u‖=1

1n

n∑i=1

(u>xi)2

= max‖u‖=1

1n‖u>X‖2

= max‖u‖=1

u>(

1nXX>

)u

Principal component analysis (PCA) / Basic principles 12

Page 48: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Finding one principal component

Input data:

X = ( x1 . . . xn )

Objective: maximize varianceof projected data

= max‖u‖=1

E[(u>x)2]

= max‖u‖=1

1n

n∑i=1

(u>xi)2

= max‖u‖=1

1n‖u>X‖2

= max‖u‖=1

u>(

1nXX>

)u

= largest eigenvalue of Cdef=

1nXX>

(C is covariance matrix of data)Principal component analysis (PCA) / Basic principles 12

Page 49: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

How many principal components?• Similar to question of “How many clusters?”

• Magnitude of eigenvalues indicate fraction of variance captured.

Principal component analysis (PCA) / Basic principles 15

Page 50: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

How many principal components?• Similar to question of “How many clusters?”

• Magnitude of eigenvalues indicate fraction of variance captured.

• Eigenvalues on a face image dataset:

2 3 4 5 6 7 8 9 10 11

i

287.1

553.6

820.1

1086.7

1353.2

λi

Principal component analysis (PCA) / Basic principles 15

Page 51: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

How many principal components?• Similar to question of “How many clusters?”

• Magnitude of eigenvalues indicate fraction of variance captured.

• Eigenvalues on a face image dataset:

2 3 4 5 6 7 8 9 10 11

i

287.1

553.6

820.1

1086.7

1353.2

λi

• Eigenvalues typically drop off sharply, so don’t need that many.

• Of course variance isn’t everything...

Principal component analysis (PCA) / Basic principles 15

Page 52: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Computing PCAMethod 1: eigendecomposition

U are eigenvectors of covariance matrix C = 1nXX>

Computing C already takes O(nd2) time (very expensive)

Principal component analysis (PCA) / Basic principles 16

Page 53: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Computing PCAMethod 1: eigendecomposition

U are eigenvectors of covariance matrix C = 1nXX>

Computing C already takes O(nd2) time (very expensive)

Method 2: singular value decomposition (SVD)

Find X = Ud×dΣd×nV>n×n

where U>U = Id×d, V>V = In×n, Σ is diagonalComputing top k singular vectors takes only O(ndk)

Principal component analysis (PCA) / Basic principles 16

Page 54: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Computing PCAMethod 1: eigendecomposition

U are eigenvectors of covariance matrix C = 1nXX>

Computing C already takes O(nd2) time (very expensive)

Method 2: singular value decomposition (SVD)

Find X = Ud×dΣd×nV>n×n

where U>U = Id×d, V>V = In×n, Σ is diagonalComputing top k singular vectors takes only O(ndk)

Relationship between eigendecomposition and SVD:

Left singular vectors are principal components (C = UΣ2U>)

Principal component analysis (PCA) / Basic principles 16

Page 55: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Roadmap

• Principal component analysis (PCA)

– Basic principles

– Case studies

– Kernel PCA

– Probabilistic PCA

• Canonical correlation analysis (CCA)

• Fisher discriminant analysis (FDA)

• Summary

Principal component analysis (PCA) / Case studies 17

Page 56: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Eigen-faces [Turk and Pentland, 1991]

• d = number of pixels

• Each xi ∈ Rd is a face image

• xji = intensity of the j-th pixel in image i

Principal component analysis (PCA) / Case studies 18

Page 57: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Eigen-faces [Turk and Pentland, 1991]

• d = number of pixels

• Each xi ∈ Rd is a face image

• xji = intensity of the j-th pixel in image i

Xd×n u Ud×k Zk×n

( . . . ) u ( ) ( z1 . . . zn )

Principal component analysis (PCA) / Case studies 18

Page 58: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Eigen-faces [Turk and Pentland, 1991]

• d = number of pixels

• Each xi ∈ Rd is a face image

• xji = intensity of the j-th pixel in image i

Xd×n u Ud×k Zk×n

( . . . ) u ( ) ( z1 . . . zn )Idea: zi more “meaningful” representation of i-th face than xi

Can use zi for nearest-neighbor classification

Much faster: O(dk + nk) time instead of O(dn) when n, d� k

Why no time savings for linear classifier?

Principal component analysis (PCA) / Case studies 18

Page 59: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Latent Semantic Analysis [Deerwater, 1990]

• d = number of words in the vocabulary

• Each xi ∈ Rd is a vector of word counts

• xji = frequency of word j in document i

Xd×n u Ud×k Zk×n

(stocks: 2 · · · · · · · · · 0

chairman: 4 · · · · · · · · · 1the: 8 · · · · · · · · · 7· · · ... · · · · · · · · · ...

wins: 0 · · · · · · · · · 2game: 1 · · · · · · · · · 3

) u (0.4 ·· -0.0010.8 ·· 0.03

0.01 ·· 0.04... ·· ...

0.002 ·· 2.30.003 ·· 1.9

) ( z1 . . . zn )

Principal component analysis (PCA) / Case studies 19

Page 60: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Latent Semantic Analysis [Deerwater, 1990]

• d = number of words in the vocabulary

• Each xi ∈ Rd is a vector of word counts

• xji = frequency of word j in document i

Xd×n u Ud×k Zk×n

(stocks: 2 · · · · · · · · · 0

chairman: 4 · · · · · · · · · 1the: 8 · · · · · · · · · 7· · · ... · · · · · · · · · ...

wins: 0 · · · · · · · · · 2game: 1 · · · · · · · · · 3

) u (0.4 ·· -0.0010.8 ·· 0.03

0.01 ·· 0.04... ·· ...

0.002 ·· 2.30.003 ·· 1.9

) ( z1 . . . zn )How to measure similarity between two documents?

z>1 z2 is probably better than x>1 x2

Principal component analysis (PCA) / Case studies 19

Page 61: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Latent Semantic Analysis [Deerwater, 1990]

• d = number of words in the vocabulary

• Each xi ∈ Rd is a vector of word counts

• xji = frequency of word j in document i

Xd×n u Ud×k Zk×n

(stocks: 2 · · · · · · · · · 0

chairman: 4 · · · · · · · · · 1the: 8 · · · · · · · · · 7· · · ... · · · · · · · · · ...

wins: 0 · · · · · · · · · 2game: 1 · · · · · · · · · 3

) u (0.4 ·· -0.0010.8 ·· 0.03

0.01 ·· 0.04... ·· ...

0.002 ·· 2.30.003 ·· 1.9

) ( z1 . . . zn )How to measure similarity between two documents?

z>1 z2 is probably better than x>1 x2

Applications: information retrieval

Principal component analysis (PCA) / Case studies 19

Page 62: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Latent Semantic Analysis [Deerwater, 1990]

• d = number of words in the vocabulary

• Each xi ∈ Rd is a vector of word counts

• xji = frequency of word j in document i

Xd×n u Ud×k Zk×n

(stocks: 2 · · · · · · · · · 0

chairman: 4 · · · · · · · · · 1the: 8 · · · · · · · · · 7· · · ... · · · · · · · · · ...

wins: 0 · · · · · · · · · 2game: 1 · · · · · · · · · 3

) u (0.4 ·· -0.0010.8 ·· 0.03

0.01 ·· 0.04... ·· ...

0.002 ·· 2.30.003 ·· 1.9

) ( z1 . . . zn )How to measure similarity between two documents?

z>1 z2 is probably better than x>1 x2

Applications: information retrievalNote: no computational savings; original x is already sparse

Principal component analysis (PCA) / Case studies 19

Page 63: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Network anomaly detection [Lakhina, ’05]

xji = amount of traffic onlink j in the networkduring each time interval i

Principal component analysis (PCA) / Case studies 20

Page 64: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Network anomaly detection [Lakhina, ’05]

xji = amount of traffic onlink j in the networkduring each time interval i

Model assumption: total traffic is sum of flows along a few “paths”

Principal component analysis (PCA) / Case studies 20

Page 65: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Network anomaly detection [Lakhina, ’05]

xji = amount of traffic onlink j in the networkduring each time interval i

Model assumption: total traffic is sum of flows along a few “paths”Apply PCA: each principal component intuitively represents a “path”

Principal component analysis (PCA) / Case studies 20

Page 66: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Network anomaly detection [Lakhina, ’05]

xji = amount of traffic onlink j in the networkduring each time interval i

Model assumption: total traffic is sum of flows along a few “paths”Apply PCA: each principal component intuitively represents a “path”Anomaly when traffic deviates from first few principal components

Principal component analysis (PCA) / Case studies 20

Page 67: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Network anomaly detection [Lakhina, ’05]

xji = amount of traffic onlink j in the networkduring each time interval i

Model assumption: total traffic is sum of flows along a few “paths”Apply PCA: each principal component intuitively represents a “path”Anomaly when traffic deviates from first few principal components

Principal component analysis (PCA) / Case studies 20

Page 68: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Unsupervised POS tagging [Schutze, ’95]

Part-of-speech (POS) tagging task:Input: I like reducing the dimensionality of data .

Output: NOUN VERB VERB(-ING) DET NOUN PREP NOUN .

Principal component analysis (PCA) / Case studies 21

Page 69: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Unsupervised POS tagging [Schutze, ’95]

Part-of-speech (POS) tagging task:Input: I like reducing the dimensionality of data .

Output: NOUN VERB VERB(-ING) DET NOUN PREP NOUN .

Each xi is (the context distribution of) a word.

xji is number of times word i appeared in context j

Key idea: words appearing in similar contextstend to have the same POS tags;so cluster using the contexts of each word type

Problem: contexts are too sparse

Principal component analysis (PCA) / Case studies 21

Page 70: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Unsupervised POS tagging [Schutze, ’95]

Part-of-speech (POS) tagging task:Input: I like reducing the dimensionality of data .

Output: NOUN VERB VERB(-ING) DET NOUN PREP NOUN .

Each xi is (the context distribution of) a word.

xji is number of times word i appeared in context j

Key idea: words appearing in similar contextstend to have the same POS tags;so cluster using the contexts of each word type

Problem: contexts are too sparse

Solution: run PCA first,then cluster using new representation

Principal component analysis (PCA) / Case studies 21

Page 71: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Multi-task learning [Ando & Zhang, ’05]

• Have n related tasks (classify documents for various users)

• Each task has a linear classifier with weights xi

• Want to share structure between classifiers

Principal component analysis (PCA) / Case studies 22

Page 72: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Multi-task learning [Ando & Zhang, ’05]

• Have n related tasks (classify documents for various users)

• Each task has a linear classifier with weights xi

• Want to share structure between classifiers

One step of their procedure:given n linear classifiers x1, . . . ,xn,run PCA to identify shared structure:

Principal component analysis (PCA) / Case studies 22

Page 73: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Multi-task learning [Ando & Zhang, ’05]

• Have n related tasks (classify documents for various users)

• Each task has a linear classifier with weights xi

• Want to share structure between classifiers

One step of their procedure:given n linear classifiers x1, . . . ,xn,run PCA to identify shared structure:

X = ( x1 . . . xn ) u UZ

Principal component analysis (PCA) / Case studies 22

Page 74: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Multi-task learning [Ando & Zhang, ’05]

• Have n related tasks (classify documents for various users)

• Each task has a linear classifier with weights xi

• Want to share structure between classifiers

One step of their procedure:given n linear classifiers x1, . . . ,xn,run PCA to identify shared structure:

X = ( x1 . . . xn ) u UZ

Each principal component is a eigen-classifier

Principal component analysis (PCA) / Case studies 22

Page 75: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Multi-task learning [Ando & Zhang, ’05]

• Have n related tasks (classify documents for various users)

• Each task has a linear classifier with weights xi

• Want to share structure between classifiers

One step of their procedure:given n linear classifiers x1, . . . ,xn,run PCA to identify shared structure:

X = ( x1 . . . xn ) u UZ

Each principal component is a eigen-classifier

Other step of their procedure:Retrain classifiers, regularizing towards subspace U

Principal component analysis (PCA) / Case studies 22

Page 76: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

PCA summary

• Intuition: capture variance of data or minimizereconstruction error

Principal component analysis (PCA) / Case studies 23

Page 77: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

PCA summary

• Intuition: capture variance of data or minimizereconstruction error

• Algorithm: find eigendecomposition of covariancematrix or SVD

Principal component analysis (PCA) / Case studies 23

Page 78: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

PCA summary

• Intuition: capture variance of data or minimizereconstruction error

• Algorithm: find eigendecomposition of covariancematrix or SVD

• Impact: reduce storage (from O(nd) to O(nk)), reducetime complexity

Principal component analysis (PCA) / Case studies 23

Page 79: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

PCA summary

• Intuition: capture variance of data or minimizereconstruction error

• Algorithm: find eigendecomposition of covariancematrix or SVD

• Impact: reduce storage (from O(nd) to O(nk)), reducetime complexity

• Advantages: simple, fast

Principal component analysis (PCA) / Case studies 23

Page 80: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

PCA summary

• Intuition: capture variance of data or minimizereconstruction error

• Algorithm: find eigendecomposition of covariancematrix or SVD

• Impact: reduce storage (from O(nd) to O(nk)), reducetime complexity

• Advantages: simple, fast

• Applications: eigen-faces, eigen-documents, networkanomaly detection, etc.

Principal component analysis (PCA) / Case studies 23

Page 81: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Roadmap

• Principal component analysis (PCA)

– Basic principles

– Case studies

– Kernel PCA

– Probabilistic PCA

• Canonical correlation analysis (CCA)

• Fisher discriminant analysis (FDA)

• Summary

Principal component analysis (PCA) / Kernel PCA 24

Page 82: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Limitations of linearity

Principal component analysis (PCA) / Kernel PCA 25

Page 83: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Limitations of linearity

PCA is effective

Principal component analysis (PCA) / Kernel PCA 25

Page 84: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Limitations of linearity

PCA is effective

Principal component analysis (PCA) / Kernel PCA 25

Page 85: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Limitations of linearity

PCA is effective PCA is ineffective

Principal component analysis (PCA) / Kernel PCA 25

Page 86: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Limitations of linearity

PCA is effective PCA is ineffective

Problem is that PCA subspace is linear:

S = {x = Uz : z ∈ Rk}

Principal component analysis (PCA) / Kernel PCA 25

Page 87: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Limitations of linearity

PCA is effective PCA is ineffective

Problem is that PCA subspace is linear:

S = {x = Uz : z ∈ Rk}

In this example:

S = {(x1, x2) : x2 = u2u1

x1}

Principal component analysis (PCA) / Kernel PCA 25

Page 88: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Going beyond linearity: quick solution

Broken solution

Principal component analysis (PCA) / Kernel PCA 26

Page 89: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Going beyond linearity: quick solution

Broken solution Desired solution

We want desired solution: S = {(x1, x2) : x2 = u2u1

x21}

Principal component analysis (PCA) / Kernel PCA 26

Page 90: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Going beyond linearity: quick solution

Broken solution Desired solution

We want desired solution: S = {(x1, x2) : x2 = u2u1x2

1}

We can get this: S = {φ(x) = Uz} with φ(x) = (x21, x2)>

Principal component analysis (PCA) / Kernel PCA 26

Page 91: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Going beyond linearity: quick solution

Broken solution Desired solution

We want desired solution: S = {(x1, x2) : x2 = u2u1x2

1}

We can get this: S = {φ(x) = Uz} with φ(x) = (x21, x2)>

Linear dimensionality reduction in φ(x) space⇔

Nonlinear dimensionality reduction in x space

Principal component analysis (PCA) / Kernel PCA 26

Page 92: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Going beyond linearity: quick solution

Broken solution Desired solution

We want desired solution: S = {(x1, x2) : x2 = u2u1x2

1}

We can get this: S = {φ(x) = Uz} with φ(x) = (x21, x2)>

Linear dimensionality reduction in φ(x) space⇔

Nonlinear dimensionality reduction in x space

In general, can set φ(x) = (x1, x21, x1x2, sin(x1), . . . )>

Principal component analysis (PCA) / Kernel PCA 26

Page 93: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Going beyond linearity: quick solution

Broken solution Desired solution

We want desired solution: S = {(x1, x2) : x2 = u2u1x2

1}

We can get this: S = {φ(x) = Uz} with φ(x) = (x21, x2)>

Linear dimensionality reduction in φ(x) space⇔

Nonlinear dimensionality reduction in x space

In general, can set φ(x) = (x1, x21, x1x2, sin(x1), . . . )>

Problems: (1) ad-hoc and tedious(2) φ(x) large, computationally expensive

Principal component analysis (PCA) / Kernel PCA 26

Page 94: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Towards kernels

Representer theorem:

PCA solution is linear combination of xis

Principal component analysis (PCA) / Kernel PCA 27

Page 95: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Towards kernels

Representer theorem:

PCA solution is linear combination of xis

Why?

Recall PCA eigenvalue problem: XX>u = λu

Principal component analysis (PCA) / Kernel PCA 27

Page 96: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Towards kernels

Representer theorem:

PCA solution is linear combination of xis

Why?

Recall PCA eigenvalue problem: XX>u = λuNotice that u = Xα =

∑ni=1 αixi for some weights α

Principal component analysis (PCA) / Kernel PCA 27

Page 97: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Towards kernels

Representer theorem:

PCA solution is linear combination of xis

Why?

Recall PCA eigenvalue problem: XX>u = λuNotice that u = Xα =

∑ni=1 αixi for some weights α

Analogy with SVMs: weight vector w = Xα

Principal component analysis (PCA) / Kernel PCA 27

Page 98: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Towards kernels

Representer theorem:

PCA solution is linear combination of xis

Why?

Recall PCA eigenvalue problem: XX>u = λuNotice that u = Xα =

∑ni=1αixi for some weights α

Analogy with SVMs: weight vector w = Xα

Key fact:

PCA only needs inner products K = X>X

Principal component analysis (PCA) / Kernel PCA 27

Page 99: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Towards kernels

Representer theorem:

PCA solution is linear combination of xis

Why?

Recall PCA eigenvalue problem: XX>u = λuNotice that u = Xα =

∑ni=1αixi for some weights α

Analogy with SVMs: weight vector w = Xα

Key fact:

PCA only needs inner products K = X>XWhy?

Use representer theorem on PCA objective:

max‖u‖=1

u>XX>u = maxα>X>Xα=1

α>(X>X)(X>X)α

Principal component analysis (PCA) / Kernel PCA 27

Page 100: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Kernel PCA

Kernel function: k(x1,x2) such thatK, the kernel matrix formed by Kij = k(xi,xj),is positive semi-definite

Principal component analysis (PCA) / Kernel PCA 28

Page 101: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Kernel PCA

Kernel function: k(x1,x2) such thatK, the kernel matrix formed by Kij = k(xi,xj),is positive semi-definite

Examples:

Linear kernel: k(x1,x2) = x>1 x2

Principal component analysis (PCA) / Kernel PCA 28

Page 102: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Kernel PCA

Kernel function: k(x1,x2) such thatK, the kernel matrix formed by Kij = k(xi,xj),is positive semi-definite

Examples:

Linear kernel: k(x1,x2) = x>1 x2

Polynomial kernel: k(x1,x2) = (1 + x>1 x2)2

Principal component analysis (PCA) / Kernel PCA 28

Page 103: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Kernel PCA

Kernel function: k(x1,x2) such thatK, the kernel matrix formed by Kij = k(xi,xj),is positive semi-definite

Examples:

Linear kernel: k(x1,x2) = x>1 x2

Polynomial kernel: k(x1,x2) = (1 + x>1 x2)2

Gaussian (RBF) kernel: k(x1,x2) = e−‖x1−x2‖2

Principal component analysis (PCA) / Kernel PCA 28

Page 104: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Kernel PCA

Kernel function: k(x1,x2) such thatK, the kernel matrix formed by Kij = k(xi,xj),is positive semi-definite

Examples:

Linear kernel: k(x1,x2) = x>1 x2

Polynomial kernel: k(x1,x2) = (1 + x>1 x2)2

Gaussian (RBF) kernel: k(x1,x2) = e−‖x1−x2‖2

Treat data points x as black boxes, only access via k

Principal component analysis (PCA) / Kernel PCA 28

Page 105: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Kernel PCA

Kernel function: k(x1,x2) such thatK, the kernel matrix formed by Kij = k(xi,xj),is positive semi-definite

Examples:

Linear kernel: k(x1,x2) = x>1 x2

Polynomial kernel: k(x1,x2) = (1 + x>1 x2)2

Gaussian (RBF) kernel: k(x1,x2) = e−‖x1−x2‖2

Treat data points x as black boxes, only access via kk intuitively measures “similarity” between two inputs

Principal component analysis (PCA) / Kernel PCA 28

Page 106: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Kernel PCA

Kernel function: k(x1,x2) such thatK, the kernel matrix formed by Kij = k(xi,xj),is positive semi-definite

Examples:

Linear kernel: k(x1,x2) = x>1 x2

Polynomial kernel: k(x1,x2) = (1 + x>1 x2)2

Gaussian (RBF) kernel: k(x1,x2) = e−‖x1−x2‖2

Treat data points x as black boxes, only access via kk intuitively measures “similarity” between two inputs

Mercer’s theorem (using kernels is sensible)Exists high-dimensional feature space φ such that

k(x1,x2) = φ(x1)>φ(x2) (like quick solution earlier!)

Principal component analysis (PCA) / Kernel PCA 28

Page 107: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Solving kernel PCA

Direct method:

Kernel PCA objective:

maxα>Kα=1

α>K2α

Principal component analysis (PCA) / Kernel PCA 29

Page 108: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Solving kernel PCA

Direct method:

Kernel PCA objective:

maxα>Kα=1

α>K2α

⇒ kernel PCA eigenvalue problem: X>Xα = λ′α

Principal component analysis (PCA) / Kernel PCA 29

Page 109: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Solving kernel PCA

Direct method:

Kernel PCA objective:

maxα>Kα=1

α>K2α

⇒ kernel PCA eigenvalue problem: X>Xα = λ′α

Modular method (if you don’t want to think about kernels):

Find vectors x′1, . . . ,x′n such that

x′>i x′j = Kij = φ(xi)>φ(xj)

Principal component analysis (PCA) / Kernel PCA 29

Page 110: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Solving kernel PCA

Direct method:

Kernel PCA objective:

maxα>Kα=1

α>K2α

⇒ kernel PCA eigenvalue problem: X>Xα = λ′α

Modular method (if you don’t want to think about kernels):

Find vectors x′1, . . . ,x′n such that

x′>i x′j = Kij = φ(xi)>φ(xj)

Key: use any vectors that preserve inner products

Principal component analysis (PCA) / Kernel PCA 29

Page 111: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Solving kernel PCA

Direct method:

Kernel PCA objective:

maxα>Kα=1

α>K2α

⇒ kernel PCA eigenvalue problem: X>Xα = λ′α

Modular method (if you don’t want to think about kernels):

Find vectors x′1, . . . ,x′n such that

x′>i x′j = Kij = φ(xi)>φ(xj)

Key: use any vectors that preserve inner products

One possibility is Cholesky decomposition K = X>X

Principal component analysis (PCA) / Kernel PCA 29

Page 112: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Roadmap

• Principal component analysis (PCA)

– Basic principles

– Case studies

– Kernel PCA

– Probabilistic PCA

• Canonical correlation analysis (CCA)

• Fisher discriminant analysis (FDA)

• Summary

Principal component analysis (PCA) / Probabilistic PCA 30

Page 113: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic modelingSo far, deal with objective functions:

minU

f(X,U)

Principal component analysis (PCA) / Probabilistic PCA 31

Page 114: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic modelingSo far, deal with objective functions:

minU

f(X,U)

Probabilistic modeling:max

Up(X | U)

Principal component analysis (PCA) / Probabilistic PCA 31

Page 115: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic modelingSo far, deal with objective functions:

minU

f(X,U)

Probabilistic modeling:max

Up(X | U)

Invent a generative story of how data X arosePlay detective: infer parameters U that produced X

Principal component analysis (PCA) / Probabilistic PCA 31

Page 116: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic modelingSo far, deal with objective functions:

minU

f(X,U)

Probabilistic modeling:max

Up(X | U)

Invent a generative story of how data X arosePlay detective: infer parameters U that produced X

Advantages:•Model reports estimates of uncertainty

• Natural way to handle missing data

Principal component analysis (PCA) / Probabilistic PCA 31

Page 117: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic modelingSo far, deal with objective functions:

minU

f(X,U)

Probabilistic modeling:max

Up(X | U)

Invent a generative story of how data X arosePlay detective: infer parameters U that produced X

Advantages:•Model reports estimates of uncertainty

• Natural way to handle missing data

• Natural way to introduce prior knowledge

• Natural way to incorporate in a larger model

Principal component analysis (PCA) / Probabilistic PCA 31

Page 118: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic modelingSo far, deal with objective functions:

minU

f(X,U)

Probabilistic modeling:max

Up(X | U)

Invent a generative story of how data X arosePlay detective: infer parameters U that produced X

Advantages:•Model reports estimates of uncertainty

• Natural way to handle missing data

• Natural way to introduce prior knowledge

• Natural way to incorporate in a larger model

Example from last lecture: k-means ⇒ GMMs

Principal component analysis (PCA) / Probabilistic PCA 31

Page 119: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic PCA

Generative story [Tipping and Bishop, 1999]:

For each data point i = 1, . . . , n:

Principal component analysis (PCA) / Probabilistic PCA 32

Page 120: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic PCA

Generative story [Tipping and Bishop, 1999]:

For each data point i = 1, . . . , n:Draw the latent vector: zi ∼ N (0, Ik×k)

Principal component analysis (PCA) / Probabilistic PCA 32

Page 121: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic PCA

Generative story [Tipping and Bishop, 1999]:

For each data point i = 1, . . . , n:Draw the latent vector: zi ∼ N (0, Ik×k)Create the data point: xi ∼ N (Uzi, σ

2Id×d)

Principal component analysis (PCA) / Probabilistic PCA 32

Page 122: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic PCA

Generative story [Tipping and Bishop, 1999]:

For each data point i = 1, . . . , n:Draw the latent vector: zi ∼ N (0, Ik×k)Create the data point: xi ∼ N (Uzi, σ

2Id×d)

PCA finds the U that maximizes the likelihood of the data

Principal component analysis (PCA) / Probabilistic PCA 32

Page 123: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic PCA

Generative story [Tipping and Bishop, 1999]:

For each data point i = 1, . . . , n:Draw the latent vector: zi ∼ N (0, Ik×k)Create the data point: xi ∼ N (Uzi, σ

2Id×d)

PCA finds the U that maximizes the likelihood of the data

Advantages:• Handles missing data (important for collaborative

filtering)

Principal component analysis (PCA) / Probabilistic PCA 32

Page 124: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic PCA

Generative story [Tipping and Bishop, 1999]:

For each data point i = 1, . . . , n:Draw the latent vector: zi ∼ N (0, Ik×k)Create the data point: xi ∼ N (Uzi, σ

2Id×d)

PCA finds the U that maximizes the likelihood of the data

Advantages:• Handles missing data (important for collaborative

filtering)

• Extension to factor analysis: allow non-isotropic noise(replace σ2Id×d with arbitrary diagonal matrix)

Principal component analysis (PCA) / Probabilistic PCA 32

Page 125: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic latent semantic analysis (pLSA)Motivation: in text analysis, X contains word counts; PCA (LSA) isbad model as it allows negative counts; pLSA fixes this

Principal component analysis (PCA) / Probabilistic PCA 33

Page 126: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic latent semantic analysis (pLSA)Motivation: in text analysis, X contains word counts; PCA (LSA) isbad model as it allows negative counts; pLSA fixes thisGenerative story for pLSA [Hofmann, 1999]:

For each document i = 1, . . . , n:

Principal component analysis (PCA) / Probabilistic PCA 33

Page 127: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic latent semantic analysis (pLSA)Motivation: in text analysis, X contains word counts; PCA (LSA) isbad model as it allows negative counts; pLSA fixes thisGenerative story for pLSA [Hofmann, 1999]:

For each document i = 1, . . . , n:Repeat M times (number of word tokens in document):

Principal component analysis (PCA) / Probabilistic PCA 33

Page 128: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic latent semantic analysis (pLSA)Motivation: in text analysis, X contains word counts; PCA (LSA) isbad model as it allows negative counts; pLSA fixes thisGenerative story for pLSA [Hofmann, 1999]:

For each document i = 1, . . . , n:Repeat M times (number of word tokens in document):

Draw a latent topic: z ∼ p(z | i)

Principal component analysis (PCA) / Probabilistic PCA 33

Page 129: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic latent semantic analysis (pLSA)Motivation: in text analysis, X contains word counts; PCA (LSA) isbad model as it allows negative counts; pLSA fixes thisGenerative story for pLSA [Hofmann, 1999]:

For each document i = 1, . . . , n:Repeat M times (number of word tokens in document):

Draw a latent topic: z ∼ p(z | i)Choose the word token: x ∼ p(x | z)

Principal component analysis (PCA) / Probabilistic PCA 33

Page 130: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic latent semantic analysis (pLSA)Motivation: in text analysis, X contains word counts; PCA (LSA) isbad model as it allows negative counts; pLSA fixes thisGenerative story for pLSA [Hofmann, 1999]:

For each document i = 1, . . . , n:Repeat M times (number of word tokens in document):

Draw a latent topic: z ∼ p(z | i)Choose the word token: x ∼ p(x | z)

Set xji to be the number of times word j was chosen

Principal component analysis (PCA) / Probabilistic PCA 33

Page 131: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic latent semantic analysis (pLSA)Motivation: in text analysis, X contains word counts; PCA (LSA) isbad model as it allows negative counts; pLSA fixes thisGenerative story for pLSA [Hofmann, 1999]:

For each document i = 1, . . . , n:Repeat M times (number of word tokens in document):

Draw a latent topic: z ∼ p(z | i)Choose the word token: x ∼ p(x | z)

Set xji to be the number of times word j was chosen

Learning using Hard EM (analog of k-means):E-step: fix parameters, choose best topicsM-step: fix topics, optimize parameters

More sophisticated methods: EM, Latent Dirichlet Allocation

Principal component analysis (PCA) / Probabilistic PCA 33

Page 132: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Probabilistic latent semantic analysis (pLSA)Motivation: in text analysis, X contains word counts; PCA (LSA) isbad model as it allows negative counts; pLSA fixes thisGenerative story for pLSA [Hofmann, 1999]:

For each document i = 1, . . . , n:Repeat M times (number of word tokens in document):

Draw a latent topic: z ∼ p(z | i)Choose the word token: x ∼ p(x | z)

Set xji to be the number of times word j was chosen

Learning using Hard EM (analog of k-means):E-step: fix parameters, choose best topicsM-step: fix topics, optimize parameters

More sophisticated methods: EM, Latent Dirichlet AllocationComparison to a mixture model for clustering:

Mixture model: assume a single topic for entire documentpLSA: allow multiple topics per document

Principal component analysis (PCA) / Probabilistic PCA 33

Page 133: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Roadmap

• Principal component analysis (PCA)

– Basic principles

– Case studies

– Kernel PCA

– Probabilistic PCA

• Canonical correlation analysis (CCA)

• Fisher discriminant analysis (FDA)

• Summary

Canonical correlation analysis (CCA) 34

Page 134: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Motivation for CCA [Hotelling, 1936]

Often, each data point consists of two views:

• Image retrieval: for each image, have the following:

– x: Pixels (or other visual features)

– y: Text around the image

Canonical correlation analysis (CCA) 35

Page 135: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Motivation for CCA [Hotelling, 1936]

Often, each data point consists of two views:

• Image retrieval: for each image, have the following:

– x: Pixels (or other visual features)

– y: Text around the image

• Time series:

– x: Signal at time t

– y: Signal at time t + 1

Canonical correlation analysis (CCA) 35

Page 136: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Motivation for CCA [Hotelling, 1936]

Often, each data point consists of two views:

• Image retrieval: for each image, have the following:

– x: Pixels (or other visual features)

– y: Text around the image

• Time series:

– x: Signal at time t

– y: Signal at time t + 1• Two-view learning: divide features into two sets

– x: Features of a word/object, etc.

– y: Features of the context in which it appears

Canonical correlation analysis (CCA) 35

Page 137: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Motivation for CCA [Hotelling, 1936]

Often, each data point consists of two views:

• Image retrieval: for each image, have the following:

– x: Pixels (or other visual features)

– y: Text around the image

• Time series:

– x: Signal at time t

– y: Signal at time t + 1• Two-view learning: divide features into two sets

– x: Features of a word/object, etc.

– y: Features of the context in which it appears

Goal: reduce the dimensionality of the two views jointly

Canonical correlation analysis (CCA) 35

Page 138: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

An example

Setup:

Input data: (x1,y1), . . . , (xn,yn) (matrices X,Y)

Goal: find pair of projections (u,v)

Canonical correlation analysis (CCA) 36

Page 139: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

An example

Setup:

Input data: (x1,y1), . . . , (xn,yn) (matrices X,Y)

Goal: find pair of projections (u,v)

In figure, x and y are paired by brightness

Canonical correlation analysis (CCA) 36

Page 140: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

An example

Setup:

Input data: (x1,y1), . . . , (xn,yn) (matrices X,Y)

Goal: find pair of projections (u,v)

In figure, x and y are paired by brightness

Dimensionality reduction solutions:

Independent

Canonical correlation analysis (CCA) 36

Page 141: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

An example

Setup:

Input data: (x1,y1), . . . , (xn,yn) (matrices X,Y)

Goal: find pair of projections (u,v)

In figure, x and y are paired by brightness

Dimensionality reduction solutions:

Independent Joint

Canonical correlation analysis (CCA) 36

Page 142: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

From PCA to CCA

PCA on views separately: no covariance term

maxu,v

u>XX>uu>u

+v>YY>v

v>v

Canonical correlation analysis (CCA) 37

Page 143: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

From PCA to CCA

PCA on views separately: no covariance term

maxu,v

u>XX>uu>u

+v>YY>v

v>v

PCA on concatenation (X>,Y>)>: includes covariance term

maxu,v

u>XX>u + 2u>XY>v + v>YY>vu>u + v>v

Canonical correlation analysis (CCA) 37

Page 144: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

From PCA to CCA

PCA on views separately: no covariance term

maxu,v

u>XX>uu>u

+v>YY>v

v>v

PCA on concatenation (X>,Y>)>: includes covariance term

maxu,v

u>XX>u + 2u>XY>v + v>YY>vu>u + v>v

Maximum covariance: drop variance terms

maxu,v

u>XY>v√u>u

√v>v

Canonical correlation analysis (CCA) 37

Page 145: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

From PCA to CCA

PCA on views separately: no covariance term

maxu,v

u>XX>uu>u

+v>YY>v

v>v

PCA on concatenation (X>,Y>)>: includes covariance term

maxu,v

u>XX>u + 2u>XY>v + v>YY>vu>u + v>v

Maximum covariance: drop variance terms

maxu,v

u>XY>v√u>u

√v>v

Maximum correlation (CCA): divide out variance terms

maxu,v

u>XY>v√u>XX>u

√v>YY>v

Canonical correlation analysis (CCA) 37

Page 146: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Canonical correlation analysis (CCA)

Definitions:

Variance: var(u>x) = u>XX>u

Canonical correlation analysis (CCA) 38

Page 147: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Canonical correlation analysis (CCA)

Definitions:

Variance: var(u>x) = u>XX>uCovariance: cov(u>x,v>y) = u>XY>v

Canonical correlation analysis (CCA) 38

Page 148: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Canonical correlation analysis (CCA)

Definitions:

Variance: var(u>x) = u>XX>uCovariance: cov(u>x,v>y) = u>XY>v

Correlation: ccov(u>x,v>y)√cvar(u>x)√cvar(v>y)

Canonical correlation analysis (CCA) 38

Page 149: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Canonical correlation analysis (CCA)

Definitions:

Variance: var(u>x) = u>XX>uCovariance: cov(u>x,v>y) = u>XY>v

Correlation: ccov(u>x,v>y)√cvar(u>x)√cvar(v>y)

Objective: maximize correlation between projected views

maxu,v

corr(u>x,v>y)

Canonical correlation analysis (CCA) 38

Page 150: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Canonical correlation analysis (CCA)

Definitions:

Variance: var(u>x) = u>XX>uCovariance: cov(u>x,v>y) = u>XY>v

Correlation: ccov(u>x,v>y)√cvar(u>x)√cvar(v>y)

Objective: maximize correlation between projected views

maxu,v

corr(u>x,v>y)

Properties:

• Focus on how variables are related, not how much they vary

Canonical correlation analysis (CCA) 38

Page 151: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Canonical correlation analysis (CCA)

Definitions:

Variance: var(u>x) = u>XX>uCovariance: cov(u>x,v>y) = u>XY>v

Correlation: ccov(u>x,v>y)√cvar(u>x)√cvar(v>y)

Objective: maximize correlation between projected views

maxu,v

corr(u>x,v>y)

Properties:

• Focus on how variables are related, not how much they vary

• Invariant to any rotation and scaling of data

Canonical correlation analysis (CCA) 38

Page 152: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Canonical correlation analysis (CCA)

Definitions:

Variance: var(u>x) = u>XX>uCovariance: cov(u>x,v>y) = u>XY>v

Correlation: ccov(u>x,v>y)√cvar(u>x)√cvar(v>y)

Objective: maximize correlation between projected views

maxu,v

corr(u>x,v>y)

Properties:

• Focus on how variables are related, not how much they vary

• Invariant to any rotation and scaling of data

Solved via a generalized eigenvalue problem (Aw = λBw)

Canonical correlation analysis (CCA) 38

Page 153: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Regularization is important

Extreme examples of degeneracy:

• If x = Ay, then any (u,v) with u = Av is optimal(correlation 1)

Canonical correlation analysis (CCA) 41

Page 154: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Regularization is important

Extreme examples of degeneracy:

• If x = Ay, then any (u,v) with u = Av is optimal(correlation 1)

• If x and y are independent, then any (u,v) is optimal(correlation 0)

Canonical correlation analysis (CCA) 41

Page 155: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Regularization is important

Extreme examples of degeneracy:

• If x = Ay, then any (u,v) with u = Av is optimal(correlation 1)

• If x and y are independent, then any (u,v) is optimal(correlation 0)

Problem: if X or Y has rank n, then any (u,v) is optimal

(correlation 1) with u = X†>Yv ⇒ CCA is meaningless!

Canonical correlation analysis (CCA) 41

Page 156: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Regularization is important

Extreme examples of degeneracy:

• If x = Ay, then any (u,v) with u = Av is optimal(correlation 1)

• If x and y are independent, then any (u,v) is optimal(correlation 0)

Problem: if X or Y has rank n, then any (u,v) is optimal

(correlation 1) with u = X†>Yv ⇒ CCA is meaningless!

Solution: regularization (interpolate between

maximum covariance and maximum correlation)

maxu,v

u>XY>v√u>(XX> + λI)u

√v>(YY> + λI)v

Canonical correlation analysis (CCA) 41

Page 157: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Kernel CCA

Two kernels: kx and ky

Canonical correlation analysis (CCA) 42

Page 158: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Kernel CCA

Two kernels: kx and ky

Direct method:

(some math)

Canonical correlation analysis (CCA) 42

Page 159: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Kernel CCA

Two kernels: kx and ky

Direct method:

(some math)

Modular method:

1. Transform xi into x′i ∈ Rn satisfying

k(xi,xj) = x′>i x′j (do same for y)

Canonical correlation analysis (CCA) 42

Page 160: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Kernel CCA

Two kernels: kx and ky

Direct method:

(some math)

Modular method:

1. Transform xi into x′i ∈ Rn satisfying

k(xi,xj) = x′>i x′j (do same for y)

2. Perform regular CCA

Canonical correlation analysis (CCA) 42

Page 161: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Kernel CCA

Two kernels: kx and ky

Direct method:

(some math)

Modular method:

1. Transform xi into x′i ∈ Rn satisfying

k(xi,xj) = x′>i x′j (do same for y)

2. Perform regular CCA

Regularization is especially important for kernel CCA!

Canonical correlation analysis (CCA) 42

Page 162: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Roadmap

• Principal component analysis (PCA)

– Basic principles

– Case studies

– Kernel PCA

– Probabilistic PCA

• Canonical correlation analysis (CCA)

• Fisher discriminant analysis (FDA)

• Summary

Fisher discriminant analysis (FDA) 43

Page 163: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Motivation for FDA [Fisher, 1936]

What is the best linear projection?

Fisher discriminant analysis (FDA) 44

Page 164: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Motivation for FDA [Fisher, 1936]

What is the best linear projection?

PCA solution

Fisher discriminant analysis (FDA) 44

Page 165: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Motivation for FDA [Fisher, 1936]

What is the best linear projection with these labels?

PCA solution

Fisher discriminant analysis (FDA) 44

Page 166: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Motivation for FDA [Fisher, 1936]

What is the best linear projection with these labels?

PCA solution FDA solution

Fisher discriminant analysis (FDA) 44

Page 167: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Motivation for FDA [Fisher, 1936]

What is the best linear projection with these labels?

PCA solution FDA solution

Goal: reduce the dimensionality given labels

Idea: want projection to maximize overall interclass variancerelative to intraclass variance

Fisher discriminant analysis (FDA) 44

Page 168: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Motivation for FDA [Fisher, 1936]

What is the best linear projection with these labels?

PCA solution FDA solution

Goal: reduce the dimensionality given labels

Idea: want projection to maximize overall interclass variancerelative to intraclass variance

Linear classifiers (logistic regression, SVMs) have similar feel:

Find one-dimensional subspace w,

e.g., to maximize margin between different classes

Fisher discriminant analysis (FDA) 44

Page 169: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Motivation for FDA [Fisher, 1936]

What is the best linear projection with these labels?

PCA solution FDA solution

Goal: reduce the dimensionality given labels

Idea: want projection to maximize overall interclass variancerelative to intraclass variance

Linear classifiers (logistic regression, SVMs) have similar feel:

Find one-dimensional subspace w,

e.g., to maximize margin between different classes

FDA handles multiple classes, allows multiple dimensions

Fisher discriminant analysis (FDA) 44

Page 170: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

FDA objective function

Setup: xi ∈ Rd, yi ∈ {1, . . . ,m}, for i = 1, . . . , n

Fisher discriminant analysis (FDA) 45

Page 171: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

FDA objective function

Setup: xi ∈ Rd, yi ∈ {1, . . . ,m}, for i = 1, . . . , n

Objective: maximize interclass varianceintraclass variance = total variance

intraclass variance − 1

Fisher discriminant analysis (FDA) 45

Page 172: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

FDA objective function

Setup: xi ∈ Rd, yi ∈ {1, . . . ,m}, for i = 1, . . . , n

Objective: maximize interclass varianceintraclass variance = total variance

intraclass variance − 1

Total variance: 1n

∑i(u

>(xi − µ))2

Mean of all points: µ = 1n

∑i xi

Fisher discriminant analysis (FDA) 45

Page 173: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

FDA objective function

Setup: xi ∈ Rd, yi ∈ {1, . . . ,m}, for i = 1, . . . , n

Objective: maximize interclass varianceintraclass variance = total variance

intraclass variance − 1

Total variance: 1n

∑i(u

>(xi − µ))2

Mean of all points: µ = 1n

∑i xi

Intraclass variance: 1n

∑i(u

>(xi − µyi))2

Mean of points in class y: µy = 1|{i:yi=y}|

∑i:yi=y xi

Fisher discriminant analysis (FDA) 45

Page 174: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

FDA objective function

Setup: xi ∈ Rd, yi ∈ {1, . . . ,m}, for i = 1, . . . , n

Objective: maximize interclass varianceintraclass variance = total variance

intraclass variance − 1

Total variance: 1n

∑i(u

>(xi − µ))2

Mean of all points: µ = 1n

∑i xi

Intraclass variance: 1n

∑i(u

>(xi − µyi))2

Mean of points in class y: µy = 1|{i:yi=y}|

∑i:yi=y xi

Reduces to a generalized eigenvalue problem.

Fisher discriminant analysis (FDA) 45

Page 175: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

FDA objective function

Setup: xi ∈ Rd, yi ∈ {1, . . . ,m}, for i = 1, . . . , n

Objective: maximize interclass varianceintraclass variance = total variance

intraclass variance − 1

Total variance: 1n

∑i(u

>(xi − µ))2

Mean of all points: µ = 1n

∑i xi

Intraclass variance: 1n

∑i(u

>(xi − µyi))2

Mean of points in class y: µy = 1|{i:yi=y}|

∑i:yi=y xi

Reduces to a generalized eigenvalue problem.

Kernel FDA: use modular method

Fisher discriminant analysis (FDA) 45

Page 176: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Other linear methods

Random projections:

Randomly project data onto k = O(log n) dimensions

Fisher discriminant analysis (FDA) 47

Page 177: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Other linear methods

Random projections:

Randomly project data onto k = O(log n) dimensions

All pairwise distances preserved with high probability

‖U>xi −U>xj‖2 u ‖xi − xj‖2 for all i, j

Fisher discriminant analysis (FDA) 47

Page 178: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Other linear methods

Random projections:

Randomly project data onto k = O(log n) dimensions

All pairwise distances preserved with high probability

‖U>xi −U>xj‖2 u ‖xi − xj‖2 for all i, j

Trivial to implement

Fisher discriminant analysis (FDA) 47

Page 179: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Other linear methods

Random projections:

Randomly project data onto k = O(log n) dimensions

All pairwise distances preserved with high probability

‖U>xi −U>xj‖2 u ‖xi − xj‖2 for all i, j

Trivial to implement

Kernel dimensionality reduction:

One type of sufficient dimensionality reduction

Find subspace that contains all information about labels

Fisher discriminant analysis (FDA) 47

Page 180: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Other linear methods

Random projections:

Randomly project data onto k = O(log n) dimensions

All pairwise distances preserved with high probability

‖U>xi −U>xj‖2 u ‖xi − xj‖2 for all i, j

Trivial to implement

Kernel dimensionality reduction:

One type of sufficient dimensionality reduction

Find subspace that contains all information about labels

y ⊥⊥ x | U>x

Fisher discriminant analysis (FDA) 47

Page 181: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Other linear methods

Random projections:

Randomly project data onto k = O(log n) dimensions

All pairwise distances preserved with high probability

‖U>xi −U>xj‖2 u ‖xi − xj‖2 for all i, j

Trivial to implement

Kernel dimensionality reduction:

One type of sufficient dimensionality reduction

Find subspace that contains all information about labels

y ⊥⊥ x | U>xCapturing information is stronger than capturing variance

Fisher discriminant analysis (FDA) 47

Page 182: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Other linear methods

Random projections:

Randomly project data onto k = O(log n) dimensions

All pairwise distances preserved with high probability

‖U>xi −U>xj‖2 u ‖xi − xj‖2 for all i, j

Trivial to implement

Kernel dimensionality reduction:

One type of sufficient dimensionality reduction

Find subspace that contains all information about labels

y ⊥⊥ x | U>xCapturing information is stronger than capturing variance

Hard nonconvex optimization problemFisher discriminant analysis (FDA) 47

Page 183: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Summary

Framework: z = U>x, x u Uz

Fisher discriminant analysis (FDA) 48

Page 184: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Summary

Framework: z = U>x, x u Uz

Criteria for choosing U:

• PCA: maximize projected variance

• CCA: maximize projected correlation

• FDA: maximize projected interclass varianceintraclass variance

Fisher discriminant analysis (FDA) 48

Page 185: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Summary

Framework: z = U>x, x u Uz

Criteria for choosing U:

• PCA: maximize projected variance

• CCA: maximize projected correlation

• FDA: maximize projected interclass varianceintraclass variance

Algorithm: generalized eigenvalue problem

Fisher discriminant analysis (FDA) 48

Page 186: Practical Machine Learning (CS294-34) September 24, 2009 ...jordan/courses/... · Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang

Summary

Framework: z = U>x, x u Uz

Criteria for choosing U:

• PCA: maximize projected variance

• CCA: maximize projected correlation

• FDA: maximize projected interclass varianceintraclass variance

Algorithm: generalized eigenvalue problem

Extensions:

non-linear using kernels (using same linear framework)

probabilistic, sparse, robust (hard optimization)

Fisher discriminant analysis (FDA) 48