linear dimensionality reduction: principal component analysis
TRANSCRIPT
![Page 1: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/1.jpg)
Linear Dimensionality Reduction:Principal Component Analysis
Piyush Rai
Machine Learning (CS771A)
Sept 2, 2016
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 1
![Page 2: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/2.jpg)
Dimensionality Reduction
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 2
![Page 3: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/3.jpg)
Dimensionality Reduction
Usually considered an unsupervised learning method
Used for learning the low-dimensional structures in the data
Also useful for “feature learning” or “representation learning” (learning a better, oftensmaller-dimensional, representation of the data), e.g.,
Documents using using topic vectors instead of bag-of-words vectors
Images using their constituent parts (faces - eigenfaces)
Can be used for speeding up learning algorithms
Can be used for data compression
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 3
![Page 4: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/4.jpg)
Dimensionality Reduction
Usually considered an unsupervised learning method
Used for learning the low-dimensional structures in the data
Also useful for “feature learning” or “representation learning” (learning a better, oftensmaller-dimensional, representation of the data), e.g.,
Documents using using topic vectors instead of bag-of-words vectors
Images using their constituent parts (faces - eigenfaces)
Can be used for speeding up learning algorithms
Can be used for data compression
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 3
![Page 5: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/5.jpg)
Dimensionality Reduction
Usually considered an unsupervised learning method
Used for learning the low-dimensional structures in the data
Also useful for “feature learning” or “representation learning” (learning a better, oftensmaller-dimensional, representation of the data), e.g.,
Documents using using topic vectors instead of bag-of-words vectors
Images using their constituent parts (faces - eigenfaces)
Can be used for speeding up learning algorithms
Can be used for data compression
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 3
![Page 6: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/6.jpg)
Dimensionality Reduction
Usually considered an unsupervised learning method
Used for learning the low-dimensional structures in the data
Also useful for “feature learning” or “representation learning” (learning a better, oftensmaller-dimensional, representation of the data), e.g.,
Documents using using topic vectors instead of bag-of-words vectors
Images using their constituent parts (faces - eigenfaces)
Can be used for speeding up learning algorithms
Can be used for data compression
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 3
![Page 7: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/7.jpg)
Dimensionality Reduction
Usually considered an unsupervised learning method
Used for learning the low-dimensional structures in the data
Also useful for “feature learning” or “representation learning” (learning a better, oftensmaller-dimensional, representation of the data), e.g.,
Documents using using topic vectors instead of bag-of-words vectors
Images using their constituent parts (faces - eigenfaces)
Can be used for speeding up learning algorithms
Can be used for data compression
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 3
![Page 8: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/8.jpg)
Dimensionality Reduction
Usually considered an unsupervised learning method
Used for learning the low-dimensional structures in the data
Also useful for “feature learning” or “representation learning” (learning a better, oftensmaller-dimensional, representation of the data), e.g.,
Documents using using topic vectors instead of bag-of-words vectors
Images using their constituent parts (faces - eigenfaces)
Can be used for speeding up learning algorithms
Can be used for data compression
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 3
![Page 9: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/9.jpg)
Dimensionality Reduction
Usually considered an unsupervised learning method
Used for learning the low-dimensional structures in the data
Also useful for “feature learning” or “representation learning” (learning a better, oftensmaller-dimensional, representation of the data), e.g.,
Documents using using topic vectors instead of bag-of-words vectors
Images using their constituent parts (faces - eigenfaces)
Can be used for speeding up learning algorithms
Can be used for data compression
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 3
![Page 10: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/10.jpg)
Curse of Dimensionality
Exponentially large # of examples required to “fill up” high-dim spaces
Fewer dimensions ⇒ Less chances of overfitting ⇒ Better generalization
Dimensionality reduction is a way to beat the curse of dimensionality
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 4
![Page 11: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/11.jpg)
Curse of Dimensionality
Exponentially large # of examples required to “fill up” high-dim spaces
Fewer dimensions ⇒ Less chances of overfitting ⇒ Better generalization
Dimensionality reduction is a way to beat the curse of dimensionality
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 4
![Page 12: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/12.jpg)
Curse of Dimensionality
Exponentially large # of examples required to “fill up” high-dim spaces
Fewer dimensions ⇒ Less chances of overfitting ⇒ Better generalization
Dimensionality reduction is a way to beat the curse of dimensionality
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 4
![Page 13: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/13.jpg)
Linear Dimensionality Reduction
A projection matrix U = [u1 u2 . . . uK ] of size D × K defines K linear projection directions, eachuk ∈ RD , for the D dim. data (assume K < D)
Can use U to transform xn ∈ RD into zn ∈ RK as shown below
Note that zn = U>xn = [u>1 xn u>2 xn . . . u>K xn] is a K -dim projection of xn
zn ∈ RK is also called low-dimensional “embedding” of xn ∈ RD
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 5
![Page 14: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/14.jpg)
Linear Dimensionality Reduction
A projection matrix U = [u1 u2 . . . uK ] of size D × K defines K linear projection directions, eachuk ∈ RD , for the D dim. data (assume K < D)
Can use U to transform xn ∈ RD into zn ∈ RK as shown below
Note that zn = U>xn = [u>1 xn u>2 xn . . . u>K xn] is a K -dim projection of xn
zn ∈ RK is also called low-dimensional “embedding” of xn ∈ RD
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 5
![Page 15: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/15.jpg)
Linear Dimensionality Reduction
A projection matrix U = [u1 u2 . . . uK ] of size D × K defines K linear projection directions, eachuk ∈ RD , for the D dim. data (assume K < D)
Can use U to transform xn ∈ RD into zn ∈ RK as shown below
Note that zn = U>xn = [u>1 xn u>2 xn . . . u>K xn] is a K -dim projection of xn
zn ∈ RK is also called low-dimensional “embedding” of xn ∈ RD
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 5
![Page 16: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/16.jpg)
Linear Dimensionality Reduction
A projection matrix U = [u1 u2 . . . uK ] of size D × K defines K linear projection directions, eachuk ∈ RD , for the D dim. data (assume K < D)
Can use U to transform xn ∈ RD into zn ∈ RK as shown below
Note that zn = U>xn = [u>1 xn u>2 xn . . . u>K xn] is a K -dim projection of xn
zn ∈ RK is also called low-dimensional “embedding” of xn ∈ RD
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 5
![Page 17: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/17.jpg)
Linear Dimensionality Reduction
X = [x1 x2 . . . xN ] is D × N matrix denoting all the N data points
Z = [z1 z2 . . . zN ] is K × N matrix denoting embeddings of data points
With this notation, the figure on previous slide can be re-drawn as below
How do we learn the “best” projection matrix U?
What criteria should we optimize for when learning U?
Principal Component Analysis (PCA) is an algorithm for doing this
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 6
![Page 18: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/18.jpg)
Linear Dimensionality Reduction
X = [x1 x2 . . . xN ] is D × N matrix denoting all the N data points
Z = [z1 z2 . . . zN ] is K × N matrix denoting embeddings of data points
With this notation, the figure on previous slide can be re-drawn as below
How do we learn the “best” projection matrix U?
What criteria should we optimize for when learning U?
Principal Component Analysis (PCA) is an algorithm for doing this
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 6
![Page 19: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/19.jpg)
Linear Dimensionality Reduction
X = [x1 x2 . . . xN ] is D × N matrix denoting all the N data points
Z = [z1 z2 . . . zN ] is K × N matrix denoting embeddings of data points
With this notation, the figure on previous slide can be re-drawn as below
How do we learn the “best” projection matrix U?
What criteria should we optimize for when learning U?
Principal Component Analysis (PCA) is an algorithm for doing this
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 6
![Page 20: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/20.jpg)
Linear Dimensionality Reduction
X = [x1 x2 . . . xN ] is D × N matrix denoting all the N data points
Z = [z1 z2 . . . zN ] is K × N matrix denoting embeddings of data points
With this notation, the figure on previous slide can be re-drawn as below
How do we learn the “best” projection matrix U?
What criteria should we optimize for when learning U?
Principal Component Analysis (PCA) is an algorithm for doing this
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 6
![Page 21: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/21.jpg)
Linear Dimensionality Reduction
X = [x1 x2 . . . xN ] is D × N matrix denoting all the N data points
Z = [z1 z2 . . . zN ] is K × N matrix denoting embeddings of data points
With this notation, the figure on previous slide can be re-drawn as below
How do we learn the “best” projection matrix U?
What criteria should we optimize for when learning U?
Principal Component Analysis (PCA) is an algorithm for doing this
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 6
![Page 22: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/22.jpg)
Linear Dimensionality Reduction
X = [x1 x2 . . . xN ] is D × N matrix denoting all the N data points
Z = [z1 z2 . . . zN ] is K × N matrix denoting embeddings of data points
With this notation, the figure on previous slide can be re-drawn as below
How do we learn the “best” projection matrix U?
What criteria should we optimize for when learning U?
Principal Component Analysis (PCA) is an algorithm for doing this
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 6
![Page 23: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/23.jpg)
Principal Component Analysis
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 7
![Page 24: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/24.jpg)
Principal Component Analysis (PCA)
A classic linear dim. reduction method (Pearson, 1901; Hotelling, 1930)
Can be seen as
Learning projection directions that capture maximum variance in data
Learning projection directions that result in smallest reconstruction error
Can also be seen as changing the basis in which the data is represented (and transforming thefeatures such that new features become decorrelated)
Also related to other classic methods, e.g., Factor Analysis (Spearman, 1904)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 8
![Page 25: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/25.jpg)
Principal Component Analysis (PCA)
A classic linear dim. reduction method (Pearson, 1901; Hotelling, 1930)
Can be seen as
Learning projection directions that capture maximum variance in data
Learning projection directions that result in smallest reconstruction error
Can also be seen as changing the basis in which the data is represented (and transforming thefeatures such that new features become decorrelated)
Also related to other classic methods, e.g., Factor Analysis (Spearman, 1904)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 8
![Page 26: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/26.jpg)
Principal Component Analysis (PCA)
A classic linear dim. reduction method (Pearson, 1901; Hotelling, 1930)
Can be seen as
Learning projection directions that capture maximum variance in data
Learning projection directions that result in smallest reconstruction error
Can also be seen as changing the basis in which the data is represented (and transforming thefeatures such that new features become decorrelated)
Also related to other classic methods, e.g., Factor Analysis (Spearman, 1904)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 8
![Page 27: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/27.jpg)
Principal Component Analysis (PCA)
A classic linear dim. reduction method (Pearson, 1901; Hotelling, 1930)
Can be seen as
Learning projection directions that capture maximum variance in data
Learning projection directions that result in smallest reconstruction error
Can also be seen as changing the basis in which the data is represented (and transforming thefeatures such that new features become decorrelated)
Also related to other classic methods, e.g., Factor Analysis (Spearman, 1904)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 8
![Page 28: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/28.jpg)
Principal Component Analysis (PCA)
A classic linear dim. reduction method (Pearson, 1901; Hotelling, 1930)
Can be seen as
Learning projection directions that capture maximum variance in data
Learning projection directions that result in smallest reconstruction error
Can also be seen as changing the basis in which the data is represented (and transforming thefeatures such that new features become decorrelated)
Also related to other classic methods, e.g., Factor Analysis (Spearman, 1904)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 8
![Page 29: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/29.jpg)
Principal Component Analysis (PCA)
A classic linear dim. reduction method (Pearson, 1901; Hotelling, 1930)
Can be seen as
Learning projection directions that capture maximum variance in data
Learning projection directions that result in smallest reconstruction error
Can also be seen as changing the basis in which the data is represented (and transforming thefeatures such that new features become decorrelated)
Also related to other classic methods, e.g., Factor Analysis (Spearman, 1904)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 8
![Page 30: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/30.jpg)
PCA as Maximizing Variance
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 9
![Page 31: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/31.jpg)
Variance Captured by Projections
Consider projecting xn ∈ RD on a one-dim subspace defined by u1 ∈ RD
Projection/embedding of xn along a one-dim subspace u1 = u>1 xn (location of the green pointalong the purple line representing u1)
Mean of projections of all the data: 1N
∑Nn=1 u>
1 xn = u>1 ( 1
N
∑Nn=1 xn) = u>
1 µ
Variance of the projected data (“spread” of the green points)
1
N
N∑n=1
(u>1 xn − u>
1 µ)2
=1
N
N∑n=1
{u>1 (xn − µ)}2 = u>
1 Su1
S is the D × D data covariance matrix: S = 1N
∑Nn=1(xn − µ)(xn − µ)> . If data already centered
(µ = 0) then S = 1N
∑Nn=1 xnx>
n = 1N X>X
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 10
![Page 32: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/32.jpg)
Variance Captured by Projections
Consider projecting xn ∈ RD on a one-dim subspace defined by u1 ∈ RD
Projection/embedding of xn along a one-dim subspace u1 = u>1 xn (location of the green pointalong the purple line representing u1)
Mean of projections of all the data: 1N
∑Nn=1 u>
1 xn = u>1 ( 1
N
∑Nn=1 xn) = u>
1 µ
Variance of the projected data (“spread” of the green points)
1
N
N∑n=1
(u>1 xn − u>
1 µ)2
=1
N
N∑n=1
{u>1 (xn − µ)}2 = u>
1 Su1
S is the D × D data covariance matrix: S = 1N
∑Nn=1(xn − µ)(xn − µ)> . If data already centered
(µ = 0) then S = 1N
∑Nn=1 xnx>
n = 1N X>X
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 10
![Page 33: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/33.jpg)
Variance Captured by Projections
Consider projecting xn ∈ RD on a one-dim subspace defined by u1 ∈ RD
Projection/embedding of xn along a one-dim subspace u1 = u>1 xn (location of the green pointalong the purple line representing u1)
Mean of projections of all the data: 1N
∑Nn=1 u>
1 xn = u>1 ( 1
N
∑Nn=1 xn) = u>
1 µ
Variance of the projected data (“spread” of the green points)
1
N
N∑n=1
(u>1 xn − u>
1 µ)2
=1
N
N∑n=1
{u>1 (xn − µ)}2 = u>
1 Su1
S is the D × D data covariance matrix: S = 1N
∑Nn=1(xn − µ)(xn − µ)> . If data already centered
(µ = 0) then S = 1N
∑Nn=1 xnx>
n = 1N X>X
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 10
![Page 34: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/34.jpg)
Variance Captured by Projections
Consider projecting xn ∈ RD on a one-dim subspace defined by u1 ∈ RD
Projection/embedding of xn along a one-dim subspace u1 = u>1 xn (location of the green pointalong the purple line representing u1)
Mean of projections of all the data: 1N
∑Nn=1 u>
1 xn = u>1 ( 1
N
∑Nn=1 xn) = u>
1 µ
Variance of the projected data (“spread” of the green points)
1
N
N∑n=1
(u>1 xn − u>
1 µ)2
=1
N
N∑n=1
{u>1 (xn − µ)}2 = u>
1 Su1
S is the D × D data covariance matrix: S = 1N
∑Nn=1(xn − µ)(xn − µ)> . If data already centered
(µ = 0) then S = 1N
∑Nn=1 xnx>
n = 1N X>X
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 10
![Page 35: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/35.jpg)
Direction of Maximum Variance
We want u1 s.t. the variance of the projected data is maximized
arg maxu1
u>1 Su1
To prevent trivial solution (max var. = infinite), assume ||u1|| = 1 = u>1 u1
We will find u1 by solving the following constrained opt. problem
arg maxu1
u>1 Su1 + λ1(1− u>1 u1)
where λ1 is a Lagrange multiplier
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 11
![Page 36: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/36.jpg)
Direction of Maximum Variance
We want u1 s.t. the variance of the projected data is maximized
arg maxu1
u>1 Su1
To prevent trivial solution (max var. = infinite), assume ||u1|| = 1 = u>1 u1
We will find u1 by solving the following constrained opt. problem
arg maxu1
u>1 Su1 + λ1(1− u>1 u1)
where λ1 is a Lagrange multiplier
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 11
![Page 37: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/37.jpg)
Direction of Maximum Variance
We want u1 s.t. the variance of the projected data is maximized
arg maxu1
u>1 Su1
To prevent trivial solution (max var. = infinite), assume ||u1|| = 1 = u>1 u1
We will find u1 by solving the following constrained opt. problem
arg maxu1
u>1 Su1 + λ1(1− u>1 u1)
where λ1 is a Lagrange multiplier
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 11
![Page 38: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/38.jpg)
Direction of Maximum Variance
The objective function: arg maxu1 u>1 Su1 + λ1(1− u>1 u1)
Taking the derivative w.r.t. u1 and setting to zero gives
Su1 = λ1u1
Thus u1 is an eigenvector of S (with corresponding eigenvalue λ1)
But which of S’s (D possible) eigenvectors it is?
Note that since u>1 u1 = 1, the variance of projected data is
u>1 Su1 = λ1
Var. is maximized when u1 is the (top) eigenvector with largest eigenvalue
The top eigenvector u1 is also known as the first Principal Component (PC)
Other directions can also be found likewise (with each being orthogonal to all previous ones) usingthe eigendecomposition of S (this is PCA)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 12
![Page 39: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/39.jpg)
Direction of Maximum Variance
The objective function: arg maxu1 u>1 Su1 + λ1(1− u>1 u1)
Taking the derivative w.r.t. u1 and setting to zero gives
Su1 = λ1u1
Thus u1 is an eigenvector of S (with corresponding eigenvalue λ1)
But which of S’s (D possible) eigenvectors it is?
Note that since u>1 u1 = 1, the variance of projected data is
u>1 Su1 = λ1
Var. is maximized when u1 is the (top) eigenvector with largest eigenvalue
The top eigenvector u1 is also known as the first Principal Component (PC)
Other directions can also be found likewise (with each being orthogonal to all previous ones) usingthe eigendecomposition of S (this is PCA)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 12
![Page 40: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/40.jpg)
Direction of Maximum Variance
The objective function: arg maxu1 u>1 Su1 + λ1(1− u>1 u1)
Taking the derivative w.r.t. u1 and setting to zero gives
Su1 = λ1u1
Thus u1 is an eigenvector of S (with corresponding eigenvalue λ1)
But which of S’s (D possible) eigenvectors it is?
Note that since u>1 u1 = 1, the variance of projected data is
u>1 Su1 = λ1
Var. is maximized when u1 is the (top) eigenvector with largest eigenvalue
The top eigenvector u1 is also known as the first Principal Component (PC)
Other directions can also be found likewise (with each being orthogonal to all previous ones) usingthe eigendecomposition of S (this is PCA)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 12
![Page 41: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/41.jpg)
Direction of Maximum Variance
The objective function: arg maxu1 u>1 Su1 + λ1(1− u>1 u1)
Taking the derivative w.r.t. u1 and setting to zero gives
Su1 = λ1u1
Thus u1 is an eigenvector of S (with corresponding eigenvalue λ1)
But which of S’s (D possible) eigenvectors it is?
Note that since u>1 u1 = 1, the variance of projected data is
u>1 Su1 = λ1
Var. is maximized when u1 is the (top) eigenvector with largest eigenvalue
The top eigenvector u1 is also known as the first Principal Component (PC)
Other directions can also be found likewise (with each being orthogonal to all previous ones) usingthe eigendecomposition of S (this is PCA)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 12
![Page 42: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/42.jpg)
Direction of Maximum Variance
The objective function: arg maxu1 u>1 Su1 + λ1(1− u>1 u1)
Taking the derivative w.r.t. u1 and setting to zero gives
Su1 = λ1u1
Thus u1 is an eigenvector of S (with corresponding eigenvalue λ1)
But which of S’s (D possible) eigenvectors it is?
Note that since u>1 u1 = 1, the variance of projected data is
u>1 Su1 = λ1
Var. is maximized when u1 is the (top) eigenvector with largest eigenvalue
The top eigenvector u1 is also known as the first Principal Component (PC)
Other directions can also be found likewise (with each being orthogonal to all previous ones) usingthe eigendecomposition of S (this is PCA)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 12
![Page 43: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/43.jpg)
Direction of Maximum Variance
The objective function: arg maxu1 u>1 Su1 + λ1(1− u>1 u1)
Taking the derivative w.r.t. u1 and setting to zero gives
Su1 = λ1u1
Thus u1 is an eigenvector of S (with corresponding eigenvalue λ1)
But which of S’s (D possible) eigenvectors it is?
Note that since u>1 u1 = 1, the variance of projected data is
u>1 Su1 = λ1
Var. is maximized when u1 is the (top) eigenvector with largest eigenvalue
The top eigenvector u1 is also known as the first Principal Component (PC)
Other directions can also be found likewise (with each being orthogonal to all previous ones) usingthe eigendecomposition of S (this is PCA)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 12
![Page 44: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/44.jpg)
Direction of Maximum Variance
The objective function: arg maxu1 u>1 Su1 + λ1(1− u>1 u1)
Taking the derivative w.r.t. u1 and setting to zero gives
Su1 = λ1u1
Thus u1 is an eigenvector of S (with corresponding eigenvalue λ1)
But which of S’s (D possible) eigenvectors it is?
Note that since u>1 u1 = 1, the variance of projected data is
u>1 Su1 = λ1
Var. is maximized when u1 is the (top) eigenvector with largest eigenvalue
The top eigenvector u1 is also known as the first Principal Component (PC)
Other directions can also be found likewise (with each being orthogonal to all previous ones) usingthe eigendecomposition of S (this is PCA)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 12
![Page 45: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/45.jpg)
Principal Component Analysis
Steps in Principal Component Analysis
Center the data (subtract the mean µ = 1N
∑Nn=1 xn from each data point)
Compute the covariance matrix S using the centered data as
S =1
NXX> (note: X assumed D × N here)
Do an eigendecomposition of the covariance matrix S
Take first K leading eigenvectors {uk}Kk=1 with eigenvalues {λk}Kk=1
The final K dim. projection/embedding of data is given by
Z = U>X
where U = [u1 . . . uK ] is D × K and embedding matrix Z is K × N
A word about notation: If X is N × D, then S = 1NX>X (needs to be D × D) and the embedding
will be computed as Z = XU where Z is N × K
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 13
![Page 46: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/46.jpg)
Principal Component Analysis
Steps in Principal Component Analysis
Center the data (subtract the mean µ = 1N
∑Nn=1 xn from each data point)
Compute the covariance matrix S using the centered data as
S =1
NXX> (note: X assumed D × N here)
Do an eigendecomposition of the covariance matrix S
Take first K leading eigenvectors {uk}Kk=1 with eigenvalues {λk}Kk=1
The final K dim. projection/embedding of data is given by
Z = U>X
where U = [u1 . . . uK ] is D × K and embedding matrix Z is K × N
A word about notation: If X is N × D, then S = 1NX>X (needs to be D × D) and the embedding
will be computed as Z = XU where Z is N × K
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 13
![Page 47: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/47.jpg)
Principal Component Analysis
Steps in Principal Component Analysis
Center the data (subtract the mean µ = 1N
∑Nn=1 xn from each data point)
Compute the covariance matrix S using the centered data as
S =1
NXX> (note: X assumed D × N here)
Do an eigendecomposition of the covariance matrix S
Take first K leading eigenvectors {uk}Kk=1 with eigenvalues {λk}Kk=1
The final K dim. projection/embedding of data is given by
Z = U>X
where U = [u1 . . . uK ] is D × K and embedding matrix Z is K × N
A word about notation: If X is N × D, then S = 1NX>X (needs to be D × D) and the embedding
will be computed as Z = XU where Z is N × K
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 13
![Page 48: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/48.jpg)
Principal Component Analysis
Steps in Principal Component Analysis
Center the data (subtract the mean µ = 1N
∑Nn=1 xn from each data point)
Compute the covariance matrix S using the centered data as
S =1
NXX> (note: X assumed D × N here)
Do an eigendecomposition of the covariance matrix S
Take first K leading eigenvectors {uk}Kk=1 with eigenvalues {λk}Kk=1
The final K dim. projection/embedding of data is given by
Z = U>X
where U = [u1 . . . uK ] is D × K and embedding matrix Z is K × N
A word about notation: If X is N × D, then S = 1NX>X (needs to be D × D) and the embedding
will be computed as Z = XU where Z is N × K
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 13
![Page 49: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/49.jpg)
Principal Component Analysis
Steps in Principal Component Analysis
Center the data (subtract the mean µ = 1N
∑Nn=1 xn from each data point)
Compute the covariance matrix S using the centered data as
S =1
NXX> (note: X assumed D × N here)
Do an eigendecomposition of the covariance matrix S
Take first K leading eigenvectors {uk}Kk=1 with eigenvalues {λk}Kk=1
The final K dim. projection/embedding of data is given by
Z = U>X
where U = [u1 . . . uK ] is D × K and embedding matrix Z is K × N
A word about notation: If X is N × D, then S = 1NX>X (needs to be D × D) and the embedding
will be computed as Z = XU where Z is N × K
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 13
![Page 50: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/50.jpg)
Principal Component Analysis
Steps in Principal Component Analysis
Center the data (subtract the mean µ = 1N
∑Nn=1 xn from each data point)
Compute the covariance matrix S using the centered data as
S =1
NXX> (note: X assumed D × N here)
Do an eigendecomposition of the covariance matrix S
Take first K leading eigenvectors {uk}Kk=1 with eigenvalues {λk}Kk=1
The final K dim. projection/embedding of data is given by
Z = U>X
where U = [u1 . . . uK ] is D × K and embedding matrix Z is K × N
A word about notation: If X is N × D, then S = 1NX>X (needs to be D × D) and the embedding
will be computed as Z = XU where Z is N × K
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 13
![Page 51: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/51.jpg)
Principal Component Analysis
Steps in Principal Component Analysis
Center the data (subtract the mean µ = 1N
∑Nn=1 xn from each data point)
Compute the covariance matrix S using the centered data as
S =1
NXX> (note: X assumed D × N here)
Do an eigendecomposition of the covariance matrix S
Take first K leading eigenvectors {uk}Kk=1 with eigenvalues {λk}Kk=1
The final K dim. projection/embedding of data is given by
Z = U>X
where U = [u1 . . . uK ] is D × K and embedding matrix Z is K × N
A word about notation: If X is N × D, then S = 1NX>X (needs to be D × D) and the embedding
will be computed as Z = XU where Z is N × K
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 13
![Page 52: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/52.jpg)
PCA as Minimizing theReconstruction Error
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 14
![Page 53: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/53.jpg)
Data as Combination of Basis Vectors
Assume complete orthonormal basis vectors u1,u2, . . . ,uD , each ud ∈ RD
We can represent each data point xn ∈ RD exactly using this new basis
xn =D∑
k=1
znkuk
Denoting zn = [zn1 zn2 . . . znD ]>, U = [u1 u2 . . .uD ], and using U>U = ID
xn = Uzn and zn = U>xn
Also note that each component of vector zn is znk = u>k xn
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 15
![Page 54: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/54.jpg)
Data as Combination of Basis Vectors
Assume complete orthonormal basis vectors u1,u2, . . . ,uD , each ud ∈ RD
We can represent each data point xn ∈ RD exactly using this new basis
xn =D∑
k=1
znkuk
Denoting zn = [zn1 zn2 . . . znD ]>, U = [u1 u2 . . .uD ], and using U>U = ID
xn = Uzn and zn = U>xn
Also note that each component of vector zn is znk = u>k xn
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 15
![Page 55: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/55.jpg)
Data as Combination of Basis Vectors
Assume complete orthonormal basis vectors u1,u2, . . . ,uD , each ud ∈ RD
We can represent each data point xn ∈ RD exactly using this new basis
xn =D∑
k=1
znkuk
Denoting zn = [zn1 zn2 . . . znD ]>, U = [u1 u2 . . .uD ], and using U>U = ID
xn = Uzn and zn = U>xn
Also note that each component of vector zn is znk = u>k xn
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 15
![Page 56: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/56.jpg)
Data as Combination of Basis Vectors
Assume complete orthonormal basis vectors u1,u2, . . . ,uD , each ud ∈ RD
We can represent each data point xn ∈ RD exactly using this new basis
xn =D∑
k=1
znkuk
Denoting zn = [zn1 zn2 . . . znD ]>, U = [u1 u2 . . .uD ], and using U>U = ID
xn = Uzn and zn = U>xn
Also note that each component of vector zn is znk = u>k xn
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 15
![Page 57: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/57.jpg)
Data as Combination of Basis Vectors
Assume complete orthonormal basis vectors u1,u2, . . . ,uD , each ud ∈ RD
We can represent each data point xn ∈ RD exactly using this new basis
xn =D∑
k=1
znkuk
Denoting zn = [zn1 zn2 . . . znD ]>, U = [u1 u2 . . .uD ], and using U>U = ID
xn = Uzn and zn = U>xn
Also note that each component of vector zn is znk = u>k xn
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 15
![Page 58: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/58.jpg)
Reconstruction of Data from Projections
Reconstruction of xn from zn will be exact if we use all the D basis vectors
Will be approximate if we only use K < D basis vectors: xn ≈∑K
k=1 znkuk
Let’s use K = 1 basis vector. Then the one-dim embedding of xn is
zn = u>1 xn (note: this will just be a scalar)
We can now try “reconstructing” xn from its embedding zn as follows
xn = u1zn = u1u>1 xn
Total error or “loss” in reconstructing all the data points
L(u1) =N∑
n=1
||xn − xn||2 =N∑
n=1
||xn − u1u>1 xn||2
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 16
![Page 59: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/59.jpg)
Reconstruction of Data from Projections
Reconstruction of xn from zn will be exact if we use all the D basis vectors
Will be approximate if we only use K < D basis vectors: xn ≈∑K
k=1 znkuk
Let’s use K = 1 basis vector. Then the one-dim embedding of xn is
zn = u>1 xn (note: this will just be a scalar)
We can now try “reconstructing” xn from its embedding zn as follows
xn = u1zn = u1u>1 xn
Total error or “loss” in reconstructing all the data points
L(u1) =N∑
n=1
||xn − xn||2 =N∑
n=1
||xn − u1u>1 xn||2
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 16
![Page 60: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/60.jpg)
Reconstruction of Data from Projections
Reconstruction of xn from zn will be exact if we use all the D basis vectors
Will be approximate if we only use K < D basis vectors: xn ≈∑K
k=1 znkuk
Let’s use K = 1 basis vector. Then the one-dim embedding of xn is
zn = u>1 xn (note: this will just be a scalar)
We can now try “reconstructing” xn from its embedding zn as follows
xn = u1zn = u1u>1 xn
Total error or “loss” in reconstructing all the data points
L(u1) =N∑
n=1
||xn − xn||2 =N∑
n=1
||xn − u1u>1 xn||2
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 16
![Page 61: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/61.jpg)
Reconstruction of Data from Projections
Reconstruction of xn from zn will be exact if we use all the D basis vectors
Will be approximate if we only use K < D basis vectors: xn ≈∑K
k=1 znkuk
Let’s use K = 1 basis vector. Then the one-dim embedding of xn is
zn = u>1 xn (note: this will just be a scalar)
We can now try “reconstructing” xn from its embedding zn as follows
xn = u1zn = u1u>1 xn
Total error or “loss” in reconstructing all the data points
L(u1) =N∑
n=1
||xn − xn||2 =N∑
n=1
||xn − u1u>1 xn||2
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 16
![Page 62: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/62.jpg)
Reconstruction of Data from Projections
Reconstruction of xn from zn will be exact if we use all the D basis vectors
Will be approximate if we only use K < D basis vectors: xn ≈∑K
k=1 znkuk
Let’s use K = 1 basis vector. Then the one-dim embedding of xn is
zn = u>1 xn (note: this will just be a scalar)
We can now try “reconstructing” xn from its embedding zn as follows
xn = u1zn = u1u>1 xn
Total error or “loss” in reconstructing all the data points
L(u1) =N∑
n=1
||xn − xn||2 =N∑
n=1
||xn − u1u>1 xn||2
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 16
![Page 63: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/63.jpg)
Direction with Best Reconstruction
We want to find u1 that minimizes the reconstruction error
L(u1) =N∑
n=1
||xn − u1u>1 xn||2
=N∑
n=1
{x>n xn + (u1u
>1 xn)
>(u1u>1 xn)− 2x>
n u1u>1 xn}
=N∑
n=1
−u>1 xnx
>n u1 (using u>
1 u1 = 1 and ignoring constants w.r.t. u1)
Thus the problem is equivalent to the following maximization
arg maxu1:||u1||2=1
u>1
(1
N
N∑n=1
xnx>n
)u1 = arg max
u1:||u1||2=1
u>1 Su1
where S is the covariance matrix of the data (data assumed centered)
It’s the same objective that we had when we maximized the variance
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 17
![Page 64: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/64.jpg)
Direction with Best Reconstruction
We want to find u1 that minimizes the reconstruction error
L(u1) =N∑
n=1
||xn − u1u>1 xn||2
=N∑
n=1
{x>n xn + (u1u
>1 xn)
>(u1u>1 xn)− 2x>
n u1u>1 xn}
=N∑
n=1
−u>1 xnx
>n u1 (using u>
1 u1 = 1 and ignoring constants w.r.t. u1)
Thus the problem is equivalent to the following maximization
arg maxu1:||u1||2=1
u>1
(1
N
N∑n=1
xnx>n
)u1 = arg max
u1:||u1||2=1
u>1 Su1
where S is the covariance matrix of the data (data assumed centered)
It’s the same objective that we had when we maximized the variance
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 17
![Page 65: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/65.jpg)
Direction with Best Reconstruction
We want to find u1 that minimizes the reconstruction error
L(u1) =N∑
n=1
||xn − u1u>1 xn||2
=N∑
n=1
{x>n xn + (u1u
>1 xn)
>(u1u>1 xn)− 2x>
n u1u>1 xn}
=N∑
n=1
−u>1 xnx
>n u1 (using u>
1 u1 = 1 and ignoring constants w.r.t. u1)
Thus the problem is equivalent to the following maximization
arg maxu1:||u1||2=1
u>1
(1
N
N∑n=1
xnx>n
)u1 = arg max
u1:||u1||2=1
u>1 Su1
where S is the covariance matrix of the data (data assumed centered)
It’s the same objective that we had when we maximized the variance
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 17
![Page 66: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/66.jpg)
Direction with Best Reconstruction
We want to find u1 that minimizes the reconstruction error
L(u1) =N∑
n=1
||xn − u1u>1 xn||2
=N∑
n=1
{x>n xn + (u1u
>1 xn)
>(u1u>1 xn)− 2x>
n u1u>1 xn}
=N∑
n=1
−u>1 xnx
>n u1 (using u>
1 u1 = 1 and ignoring constants w.r.t. u1)
Thus the problem is equivalent to the following maximization
arg maxu1:||u1||2=1
u>1
(1
N
N∑n=1
xnx>n
)u1 = arg max
u1:||u1||2=1
u>1 Su1
where S is the covariance matrix of the data (data assumed centered)
It’s the same objective that we had when we maximized the variance
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 17
![Page 67: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/67.jpg)
Direction with Best Reconstruction
We want to find u1 that minimizes the reconstruction error
L(u1) =N∑
n=1
||xn − u1u>1 xn||2
=N∑
n=1
{x>n xn + (u1u
>1 xn)
>(u1u>1 xn)− 2x>
n u1u>1 xn}
=N∑
n=1
−u>1 xnx
>n u1 (using u>
1 u1 = 1 and ignoring constants w.r.t. u1)
Thus the problem is equivalent to the following maximization
arg maxu1:||u1||2=1
u>1
(1
N
N∑n=1
xnx>n
)u1 = arg max
u1:||u1||2=1
u>1 Su1
where S is the covariance matrix of the data (data assumed centered)
It’s the same objective that we had when we maximized the variance
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 17
![Page 68: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/68.jpg)
Direction with Best Reconstruction
We want to find u1 that minimizes the reconstruction error
L(u1) =N∑
n=1
||xn − u1u>1 xn||2
=N∑
n=1
{x>n xn + (u1u
>1 xn)
>(u1u>1 xn)− 2x>
n u1u>1 xn}
=N∑
n=1
−u>1 xnx
>n u1 (using u>
1 u1 = 1 and ignoring constants w.r.t. u1)
Thus the problem is equivalent to the following maximization
arg maxu1:||u1||2=1
u>1
(1
N
N∑n=1
xnx>n
)u1 = arg max
u1:||u1||2=1
u>1 Su1
where S is the covariance matrix of the data (data assumed centered)
It’s the same objective that we had when we maximized the variance
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 17
![Page 69: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/69.jpg)
How many Principal Components to Use?
Eigenvalue λk measures the variance captured by the corresponding PC uk
The “left-over” variance will therefore beD∑
k=K+1
λk
Can choose K by looking at what fraction of variance is captured by the first K PCs
Another direct way is to look at the spectrum of the eigenvalues plot, or the plot of reconstructionerror vs number of PC
Can also use other criteria such as AIC/BIC (or more advanced probabilistic approaches to PCAusing nonparametric Bayesian methods)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 18
![Page 70: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/70.jpg)
How many Principal Components to Use?
Eigenvalue λk measures the variance captured by the corresponding PC uk
The “left-over” variance will therefore beD∑
k=K+1
λk
Can choose K by looking at what fraction of variance is captured by the first K PCs
Another direct way is to look at the spectrum of the eigenvalues plot, or the plot of reconstructionerror vs number of PC
Can also use other criteria such as AIC/BIC (or more advanced probabilistic approaches to PCAusing nonparametric Bayesian methods)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 18
![Page 71: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/71.jpg)
How many Principal Components to Use?
Eigenvalue λk measures the variance captured by the corresponding PC uk
The “left-over” variance will therefore beD∑
k=K+1
λk
Can choose K by looking at what fraction of variance is captured by the first K PCs
Another direct way is to look at the spectrum of the eigenvalues plot, or the plot of reconstructionerror vs number of PC
Can also use other criteria such as AIC/BIC (or more advanced probabilistic approaches to PCAusing nonparametric Bayesian methods)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 18
![Page 72: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/72.jpg)
How many Principal Components to Use?
Eigenvalue λk measures the variance captured by the corresponding PC uk
The “left-over” variance will therefore beD∑
k=K+1
λk
Can choose K by looking at what fraction of variance is captured by the first K PCs
Another direct way is to look at the spectrum of the eigenvalues plot, or the plot of reconstructionerror vs number of PC
Can also use other criteria such as AIC/BIC (or more advanced probabilistic approaches to PCAusing nonparametric Bayesian methods)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 18
![Page 73: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/73.jpg)
How many Principal Components to Use?
Eigenvalue λk measures the variance captured by the corresponding PC uk
The “left-over” variance will therefore beD∑
k=K+1
λk
Can choose K by looking at what fraction of variance is captured by the first K PCs
Another direct way is to look at the spectrum of the eigenvalues plot, or the plot of reconstructionerror vs number of PC
Can also use other criteria such as AIC/BIC (or more advanced probabilistic approaches to PCAusing nonparametric Bayesian methods)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 18
![Page 74: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/74.jpg)
PCA as Matrix Factorization
Note that PCA represents each xn as xn = Uzn
When using only K < D components, xn ≈ Uzn
For all the N data points, we can write the same as
X ≈ UZ
where X is D × N, U is D × K and Z is K × N
The above approx. is equivalent to a low-rank matrix factorization of X
Also closely related to Singular Value Decomposition (SVD); see next slide
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 19
![Page 75: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/75.jpg)
PCA as Matrix Factorization
Note that PCA represents each xn as xn = Uzn
When using only K < D components, xn ≈ Uzn
For all the N data points, we can write the same as
X ≈ UZ
where X is D × N, U is D × K and Z is K × N
The above approx. is equivalent to a low-rank matrix factorization of X
Also closely related to Singular Value Decomposition (SVD); see next slide
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 19
![Page 76: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/76.jpg)
PCA as Matrix Factorization
Note that PCA represents each xn as xn = Uzn
When using only K < D components, xn ≈ Uzn
For all the N data points, we can write the same as
X ≈ UZ
where X is D × N, U is D × K and Z is K × N
The above approx. is equivalent to a low-rank matrix factorization of X
Also closely related to Singular Value Decomposition (SVD); see next slide
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 19
![Page 77: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/77.jpg)
PCA as Matrix Factorization
Note that PCA represents each xn as xn = Uzn
When using only K < D components, xn ≈ Uzn
For all the N data points, we can write the same as
X ≈ UZ
where X is D × N, U is D × K and Z is K × N
The above approx. is equivalent to a low-rank matrix factorization of X
Also closely related to Singular Value Decomposition (SVD); see next slide
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 19
![Page 78: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/78.jpg)
PCA as Matrix Factorization
Note that PCA represents each xn as xn = Uzn
When using only K < D components, xn ≈ Uzn
For all the N data points, we can write the same as
X ≈ UZ
where X is D × N, U is D × K and Z is K × N
The above approx. is equivalent to a low-rank matrix factorization of X
Also closely related to Singular Value Decomposition (SVD); see next slide
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 19
![Page 79: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/79.jpg)
PCA and SVD
A rank-K SVD approximates a data matrix X as follows: X ≈ UΛV>
U is D × K matrix with top K left singular vectors of X
Λ is a K × K diagonal matrix (with top K singular values)
V is N × K matrix with top K right singular vectors of X
Rank-K SVD is based on minimizing the reconstruction error
||X−UΛV>||
PCA is equivalent to the best rank-K SVD after centering the data
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 20
![Page 80: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/80.jpg)
PCA and SVD
A rank-K SVD approximates a data matrix X as follows: X ≈ UΛV>
U is D × K matrix with top K left singular vectors of X
Λ is a K × K diagonal matrix (with top K singular values)
V is N × K matrix with top K right singular vectors of X
Rank-K SVD is based on minimizing the reconstruction error
||X−UΛV>||
PCA is equivalent to the best rank-K SVD after centering the data
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 20
![Page 81: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/81.jpg)
PCA and SVD
A rank-K SVD approximates a data matrix X as follows: X ≈ UΛV>
U is D × K matrix with top K left singular vectors of X
Λ is a K × K diagonal matrix (with top K singular values)
V is N × K matrix with top K right singular vectors of X
Rank-K SVD is based on minimizing the reconstruction error
||X−UΛV>||
PCA is equivalent to the best rank-K SVD after centering the data
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 20
![Page 82: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/82.jpg)
PCA and SVD
A rank-K SVD approximates a data matrix X as follows: X ≈ UΛV>
U is D × K matrix with top K left singular vectors of X
Λ is a K × K diagonal matrix (with top K singular values)
V is N × K matrix with top K right singular vectors of X
Rank-K SVD is based on minimizing the reconstruction error
||X−UΛV>||
PCA is equivalent to the best rank-K SVD after centering the data
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 20
![Page 83: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/83.jpg)
PCA and SVD
A rank-K SVD approximates a data matrix X as follows: X ≈ UΛV>
U is D × K matrix with top K left singular vectors of X
Λ is a K × K diagonal matrix (with top K singular values)
V is N × K matrix with top K right singular vectors of X
Rank-K SVD is based on minimizing the reconstruction error
||X−UΛV>||
PCA is equivalent to the best rank-K SVD after centering the data
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 20
![Page 84: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/84.jpg)
PCA and SVD
A rank-K SVD approximates a data matrix X as follows: X ≈ UΛV>
U is D × K matrix with top K left singular vectors of X
Λ is a K × K diagonal matrix (with top K singular values)
V is N × K matrix with top K right singular vectors of X
Rank-K SVD is based on minimizing the reconstruction error
||X−UΛV>||
PCA is equivalent to the best rank-K SVD after centering the data
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 20
![Page 85: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/85.jpg)
PCA and SVD
A rank-K SVD approximates a data matrix X as follows: X ≈ UΛV>
U is D × K matrix with top K left singular vectors of X
Λ is a K × K diagonal matrix (with top K singular values)
V is N × K matrix with top K right singular vectors of X
Rank-K SVD is based on minimizing the reconstruction error
||X−UΛV>||
PCA is equivalent to the best rank-K SVD after centering the data
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 20
![Page 86: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/86.jpg)
PCA: Some Comments
The idea of approximating each data point as a combination of basis vectors
xn ≈K∑
k=1
znkuk or X ≈ UZ
is also popularly known as “Dictionary Learning” in signal/image processing; the learned basisvectors represent the “Dictionary”
Some examples:
Each face in a collection as a combination of a small no of “eigenfaces”
Each document in a collection as a comb. of a small no of “topics”
Each gene-expression sample as a comb. of a small no of “genetic pathways”
The “eigenfaces”, “topics”, “genetic pathways”, etc. are the “basis vectors”, which can be learnedfrom data using PCA/SVD or other similar methods
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 21
![Page 87: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/87.jpg)
PCA: Some Comments
The idea of approximating each data point as a combination of basis vectors
xn ≈K∑
k=1
znkuk or X ≈ UZ
is also popularly known as “Dictionary Learning” in signal/image processing; the learned basisvectors represent the “Dictionary”
Some examples:
Each face in a collection as a combination of a small no of “eigenfaces”
Each document in a collection as a comb. of a small no of “topics”
Each gene-expression sample as a comb. of a small no of “genetic pathways”
The “eigenfaces”, “topics”, “genetic pathways”, etc. are the “basis vectors”, which can be learnedfrom data using PCA/SVD or other similar methods
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 21
![Page 88: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/88.jpg)
PCA: Some Comments
The idea of approximating each data point as a combination of basis vectors
xn ≈K∑
k=1
znkuk or X ≈ UZ
is also popularly known as “Dictionary Learning” in signal/image processing; the learned basisvectors represent the “Dictionary”
Some examples:
Each face in a collection as a combination of a small no of “eigenfaces”
Each document in a collection as a comb. of a small no of “topics”
Each gene-expression sample as a comb. of a small no of “genetic pathways”
The “eigenfaces”, “topics”, “genetic pathways”, etc. are the “basis vectors”, which can be learnedfrom data using PCA/SVD or other similar methods
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 21
![Page 89: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/89.jpg)
PCA: Some Comments
The idea of approximating each data point as a combination of basis vectors
xn ≈K∑
k=1
znkuk or X ≈ UZ
is also popularly known as “Dictionary Learning” in signal/image processing; the learned basisvectors represent the “Dictionary”
Some examples:
Each face in a collection as a combination of a small no of “eigenfaces”
Each document in a collection as a comb. of a small no of “topics”
Each gene-expression sample as a comb. of a small no of “genetic pathways”
The “eigenfaces”, “topics”, “genetic pathways”, etc. are the “basis vectors”, which can be learnedfrom data using PCA/SVD or other similar methods
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 21
![Page 90: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/90.jpg)
PCA: Some Comments
The idea of approximating each data point as a combination of basis vectors
xn ≈K∑
k=1
znkuk or X ≈ UZ
is also popularly known as “Dictionary Learning” in signal/image processing; the learned basisvectors represent the “Dictionary”
Some examples:
Each face in a collection as a combination of a small no of “eigenfaces”
Each document in a collection as a comb. of a small no of “topics”
Each gene-expression sample as a comb. of a small no of “genetic pathways”
The “eigenfaces”, “topics”, “genetic pathways”, etc. are the “basis vectors”, which can be learnedfrom data using PCA/SVD or other similar methods
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 21
![Page 91: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/91.jpg)
PCA: Example
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 22
![Page 92: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/92.jpg)
PCA: Example
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 23
![Page 93: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/93.jpg)
PCA: Limitations and Extensions
A linear projection method
Won’t work well if data can’t be approximated by a linear subspace
But PCA can be kernelized easily (Kernel PCA)
Variance based projection directions can sometimes be suboptimal (e.g., if we want to preserveclass separation, e.g., when doing classification)
PCA relies on eigendecomposition of an D × D covariance matrix
Can be slow if done naıvely. Takes O(D3) time
Many faster methods exists (e.g., Power Method)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 24
![Page 94: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/94.jpg)
PCA: Limitations and Extensions
A linear projection method
Won’t work well if data can’t be approximated by a linear subspace
But PCA can be kernelized easily (Kernel PCA)
Variance based projection directions can sometimes be suboptimal (e.g., if we want to preserveclass separation, e.g., when doing classification)
PCA relies on eigendecomposition of an D × D covariance matrix
Can be slow if done naıvely. Takes O(D3) time
Many faster methods exists (e.g., Power Method)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 24
![Page 95: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/95.jpg)
PCA: Limitations and Extensions
A linear projection method
Won’t work well if data can’t be approximated by a linear subspace
But PCA can be kernelized easily (Kernel PCA)
Variance based projection directions can sometimes be suboptimal (e.g., if we want to preserveclass separation, e.g., when doing classification)
PCA relies on eigendecomposition of an D × D covariance matrix
Can be slow if done naıvely. Takes O(D3) time
Many faster methods exists (e.g., Power Method)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 24
![Page 96: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/96.jpg)
PCA: Limitations and Extensions
A linear projection method
Won’t work well if data can’t be approximated by a linear subspace
But PCA can be kernelized easily (Kernel PCA)
Variance based projection directions can sometimes be suboptimal (e.g., if we want to preserveclass separation, e.g., when doing classification)
PCA relies on eigendecomposition of an D × D covariance matrix
Can be slow if done naıvely. Takes O(D3) time
Many faster methods exists (e.g., Power Method)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 24
![Page 97: Linear Dimensionality Reduction: Principal Component Analysis](https://reader030.vdocuments.site/reader030/viewer/2022013000/61c995a11d380046ed4fa8fa/html5/thumbnails/97.jpg)
PCA: Limitations and Extensions
A linear projection method
Won’t work well if data can’t be approximated by a linear subspace
But PCA can be kernelized easily (Kernel PCA)
Variance based projection directions can sometimes be suboptimal (e.g., if we want to preserveclass separation, e.g., when doing classification)
PCA relies on eigendecomposition of an D × D covariance matrix
Can be slow if done naıvely. Takes O(D3) time
Many faster methods exists (e.g., Power Method)
Machine Learning (CS771A) Linear Dimensionality Reduction: Principal Component Analysis 24