cs 2750: machine learningkovashka/cs2750_sp17/ml_04_pca.pdf · cs 2750: machine learning...
TRANSCRIPT
CS 2750: Machine Learning
Dimensionality Reduction
Prof. Adriana KovashkaUniversity of Pittsburgh
January 19, 2017
Plan for today
• Dimensionality reduction – motivation
• Principal Component Analysis (PCA)
• Applications of PCA
• Other methods for dimensionality reduction
Why reduce dimensionality?
• Data may intrinsically live in a lower-dim space
• Too many features and too few data
• Lower computational expense (memory, train/test time)
• Want to visualize the data in a lower-dim space
• Want to use data of different dimensionality
Goal
• Input: Data in a high-dim feature space
• Output: Projection of same data into a lower-dim space
• F: high-dim X low-dim X
Goal
Slide credit: Erik Sudderth
Some criteria for success
• Find a projection where the data has:
– Low reconstruction error
– High variance of the data
See hand-written notes for how we find the optimal projection
Slide credit: Subhransu Maji
Principal Components Analysis
Demo
• http://www.cs.pitt.edu/~kovashka/cs2750_sp17/PCA_demo.m
• http://www.cs.pitt.edu/~kovashka/cs2750_sp17/PCA.m
• Demo with eigenfaces: http://www.cs.ait.ac.th/~mdailey/matlab/
• Covariance matrix is huge (D2 for D pixels)
• But typically # examples N << D
• Simple trick– X is NxD matrix of normalized training data
– Solve for eigenvectors u of XXT instead of XTX
– Then Xu is eigenvector of covariance XTX
– Need to normalize each vector of Xu into unit length
Adapted from Derek Hoiem
Implementation issue
How to pick K?
• One goal can be to pick K such that P% of the variance of the data is preserved, e.g. 90%
• Let Λ = a vector containing the eigenvalues of the covariance matrix
• Total variance can be obtained from entries of Λ
– total_variance = sum(Λ);
• Take as many of these entries as needed
– K = find( cumsum(Λ) / total_variance >= P, 1);
Variance preserved at i-th eigenvalue
Figure 12.4 (a) from Bishop
Application: Face Recognition
Image from cnet.com
Face recognition: once you’ve detected and cropped a face, try to recognize it
Detection Recognition “Sally”
Slide credit: Lana Lazebnik
Typical face recognition scenarios
• Verification: a person is claiming a particular identity; verify whether that is true– E.g., security
• Closed-world identification: assign a face to one person from among a known set
• General identification: assign a face to a known person or to “unknown”
Slide credit: Derek Hoiem
The space of all face images• When viewed as vectors of pixel values, face images are
extremely high-dimensional– 24x24 image = 576 dimensions
– Slow and lots of storage
• But very few 576-dimensional vectors are valid face images
• We want to effectively model the subspace of face images
Adapted from Derek Hoiem
Representation and reconstruction
• Face x in “face space” coordinates:
• Reconstruction:
= +
µ + w1u1+w2u2+w3u3+w4u4+ …
=
^x =
Slide credit: Derek Hoiem
Recognition w/ eigenfacesProcess labeled training images• Find mean µ and covariance matrix Σ
• Find k principal components (eigenvectors of Σ) u1,…uk
• Project each training image xi onto subspace spanned by principal components: (wi1,…,wik) = (u1
Txi, … , ukTxi)
Given novel image x• Project onto subspace: (w1,…,wk) = (u1
Tx, … , ukTx)
• Classify as closest training face in k-dimensional subspace
M. Turk and A. Pentland,
Face Recognition using Eigenfaces,
CVPR 1991
Adapted from Derek Hoiem
Slide credit: Alexander Ihler
Slide credit: Alexander Ihler
Slide credit: Alexander Ihler
Slide credit: Alexander Ihler
Slide credit: Alexander Ihler
Slide credit: Alexander Ihler
Slide credit: Alexander Ihler
Slide credit: Alexander Ihler
Slide credit: Alexander Ihler
Slide credit: Alexander Ihler
Plan for today
• Dimensionality reduction – motivation
• Principal Component Analysis (PCA)
• Applications of PCA
• Other methods for dimensionality reduction
PCA
• General dimensionality reduction technique
• Preserves most of variance with a much more compact representation– Lower storage requirements (eigenvectors + a few
numbers per face)
– Faster matching
• What are some problems?
Slide credit: Derek Hoiem
PCA limitations• The direction of maximum variance is not
always good for classification
Slide credit: Derek Hoiem
PCA limitations
• PCA preserves maximum variance
• A more discriminative subspace:
Fisher Linear Discriminants
• FLD preserves discrimination
– Find projection that maximizes scatter between classes and minimizes scatter within classes
Adapted from Derek Hoiem
Poor Projection
x1
x2
x1
x2
Using two classes as example:
Good
Slide credit: Derek Hoiem
Fisher’s Linear Discriminant
Slide credit: Derek Hoiem
Comparison with PCA
Other dimensionality reduction methods
• Non-linear:– Kernel PCA (Schölkopf et al., Neural Computation
1998)
– Independent component analysis – Comon, Signal Processing 1994
– LLE (locally linear embedding) – Roweis and Saul, Science 2000
– ISOMAP (isometric feature mapping) – Tenenbaum et al., Science 2000
– t-SNE (t-distributed stochastic neighbor embedding) –van der Maaten and Hinton, JMLR 2008
ISOMAP example
Figure from Carlotta Domeniconi
ISOMAP example
Figure from Carlotta Domeniconi
t-SNE example
Figure from Genevieve Patterson, IJCV 2014
t-SNE example
Thomas and Kovashka, CVPR 2016
t-SNE example
Thomas and Kovashka, CVPR 2016