soleymani - sharif university of...
TRANSCRIPT
![Page 1: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/1.jpg)
CE-725: Statistical Pattern Recognition Sharif University of TechnologySpring 2013
Soleymani
Feature Extraction (PCA & LDA)
![Page 2: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/2.jpg)
Outline What is feature extraction? Feature extraction algorithms Linear Methods
Unsupervised: Principal Component Analysis (PCA) Also known as Karhonen-Loeve (KL) transform
Supervised: Linear Discriminant Analysis (LDA) Also known as Fisher’s Discriminant Analysis (FDA)
2
![Page 3: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/3.jpg)
Dimensionality Reduction:Feature Selection vs. Feature Extraction
Feature selection Select a subset of a given feature set
Feature extraction (e.g., PCA, LDA) A linear or non-linear transform on the original feature space
3
⋮ → ⋮Feature
Selection( < )
⋮ → ⋮ = ⋮Feature
Extraction
![Page 4: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/4.jpg)
Feature Extraction
4
Mapping of the original data onto a lower-dimensional space Criterion for feature extraction can be different based on problem
settings Unsupervised task: minimize the information loss (reconstruction error) Supervised task: maximize the class discrimination on the projected space
In the previous lecture, we talked about feature selection: Feature selection can be considered as a special form of feature
extraction (only a subset of the original features are used).
Example:
0 1 0 0'
0 0 1 0
X X4N X 2' N X
Second and thirth features are selected
![Page 5: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/5.jpg)
Feature Extraction
5
Unsupervised feature extraction:
Supervised feature extraction:
Feature Extraction= ( ) ⋯ ( )⋮ ⋱ ⋮( ) ⋯ ( )A mapping : ℝ → ℝOr only the transformed data
= ′( ) ⋯ ′( )⋮ ⋱ ⋮′( ) ⋯ ′( )
Feature Extraction= ( ) ⋯ ( )⋮ ⋱ ⋮( ) ⋯ ( )= ( )⋮( )
A mapping : ℝ → ℝOronly the transformed data
= ′( ) ⋯ ′( )⋮ ⋱ ⋮′( ) ⋯ ′( )
![Page 6: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/6.jpg)
Linear Transformation
6
For linear transformation, we find an explicit mappingthat can transform also new data vectors.
d x T d dA
dx
Original data
reduced data
( )T d d x A x
=
![Page 7: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/7.jpg)
Linear Transformation
7
Linear transformation are simple mappings
( )T Tj j x A x x a x
11 12 1
21 22 2
1 2
d
d
d d dd
a a aa a a
a a a
A
1a d a
= 1, … ,
![Page 8: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/8.jpg)
Linear Dimensionality Reduction
8
Unsupervised Principal Component Analysis (PCA) [we will discuss] Independent Component Analysis (ICA) SingularValue Decomposition (SVD) Multi Dimensional Scaling (MDS) Canonical Correlation Analysis (CCA)
Supervised Linear Discriminant Analysis (LDA) [we will discuss]
![Page 9: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/9.jpg)
Unsupervised Feature Reduction
9
Visualization: projection of high-dimensional data onto 2Dor 3D.
Data compression: efficient storage, communication, orand retrieval.
Noise removal: to improve accuracy by removingirrelevant features. As a preprocessing step to reduce dimensions for classification
tasks
![Page 10: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/10.jpg)
Principal Component Analysis (PCA)
10
The “best” subspace: Centered at the sample mean The axes have been rotated to new (principal) axes such that:
Principal axis 1 has the highest variance Principal axis 2 has the next highest variance, and so on.
The principal axes are uncorrelated Covariance among each pair of the principal axes is zero.
Goal: reducing the dimensionality of the data while preservingthe variation present in the dataset as much as possible.
![Page 11: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/11.jpg)
Principal Component Analysis (PCA)
11
Principal Components (PCs): orthogonal vectors that areordered by the fraction of the total information (variation) inthe corresponding directions
PCs can be found as the “best” eigenvectors of the covariancematrix of the data points. If data has a Gaussian distribution N(μ, Σ), the direction of the largest
variance can be found by the eigenvector of Σ that corresponds tothe largest eigenvalue of Σ
![Page 12: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/12.jpg)
Covariance Matrix
12
ML estimate of covariance matrix from data points ( ) := 1= ( ) −⋮( ) − = 1 ( )
Mean-centered data
![Page 13: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/13.jpg)
Eigenvalues and Eigenvectors:Geometrical Interpretation
1
Covariance matrix is a PSD matrix corresponding to a hyper-ellipsoidal in an d-dimensional space
=
13
= 00
![Page 14: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/14.jpg)
PCA: Steps
14
Input: data matrix (each row contain adimensional data point) = ∑ ( ) ← Mean value of data points is subtracted from rows of
= (Covariance matrix)
Calculate eigenvalue and eigenvectors of Pick eigenvectors corresponding to the largest eigenvalues
and put them in the columns of = [ , … , ] ′ =
First PC d’-th PC
![Page 15: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/15.jpg)
Another Interpretation: Least Squares Error
15
PCs are linear least squares fits to samples, each orthogonal tothe previous PCs: First PC is a minimum distance fit to a vector in the original feature
space Second PC is a minimum distance fit to a vector in the plane
perpendicular to the first PC
![Page 16: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/16.jpg)
Another Interpretation: Least Squares Error
16
![Page 17: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/17.jpg)
Another Interpretation: Least Squares Error
17
![Page 18: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/18.jpg)
Least Squares Error and Maximum Variance Views Are Equivalent (1-dim Interpretation)
Minimizing sum of square distances to the line is equivalent tomaximizing the sum of squares of the projections on that line(Pythagoras).
18
origin
![Page 19: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/19.jpg)
PCA: Uncorrelated Features
19
If where are orthonormaleighenvectors of :
then mutually uncorrelated features are obtained Completely uncorrelated features avoid information
redundancies
![Page 20: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/20.jpg)
PCA Derivation (Correlation Version): Mean Square Error Approximation
20
Incorporating all eigenvectors in :
If then can be reconstructed exactly from
![Page 21: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/21.jpg)
PCA Derivation (Correlation Version): Mean Square Error Approximation
21
Incorporating only ′ eigenvectors corresponding to thelargest eigenvalues = [ , … , ] ( < )
It minimizes MSE between and = :
= − = − == = == == =
Sum of the − smallest eigenvalues
![Page 22: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/22.jpg)
PCA Derivation (Correlation Version):Relation between Eigenvalues and Variances
22
The j-th largest eigenvalue of is the variance on the j-th PC:
![Page 23: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/23.jpg)
PCA Derivation (Correlation Version): Mean Square Error Approximation
23
In general, it can also be shown MSE is minimized compared toany other approximation of by any -dimensionalorthonormal basis without first assuming that the axes are eigenvectors of the correlation
matrix, this result can also be obtained.
If the data is mean-centered in advance, and (covariancematrix) will be the same. However, in the correlation version when ≠ the approximation is
not, in general, a good one (although it is a minimum MSE solution)
![Page 24: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/24.jpg)
PCA on Faces: “Eigenfaces”
24
ORL Database
Some Images
![Page 25: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/25.jpg)
PCA on Faces: “Eigenfaces”
25
For eigen faces“gray” = 0,“white” > 0,“black” < 0
Averageface
1st PC
6th PC
![Page 26: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/26.jpg)
PCA on Faces:
Feature vector=[ , , … , ]
+ × + × + ×+ ⋯26
=Average
Face
= The projection of on the i-th PC
is a 112 × 92 = 10304 dimensional vectorcontaining intensity of the pixels of this image
![Page 27: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/27.jpg)
PCA on Faces: Reconstructed Face
d'=1 d'=2 d'=4 d'=8 d'=16
d'=32 d'=64 d'=128Original Imaged'=256
27
![Page 28: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/28.jpg)
PCA Drawback
28
An excellent information packing transform does notnecessarily lead to a good class separability. The directions of the maximum variance may be useless for classification
purpose
![Page 29: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/29.jpg)
Independent Component Analysis (ICA)
29
PCA: The transformed dimensions will be uncorrelated from each
other Orthogonal linear transform Only uses second order statistics (i.e., covariance matrix)
ICA: The transformed dimensions will be as independent as
possible. Non-orthogonal linear transform High-order statistics are used
![Page 30: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/30.jpg)
Uncorrelated and Independent
30
Gaussian Independent ⟺ Uncorrelated
Non-Gaussian Independent ⇒ Uncorrelated Uncorrelated ⇏ Independent
Uncorrelated: , = 0Independent: , = ( ) ( )
![Page 31: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/31.jpg)
PCA vs. ICA
31
PCA
ICA
1 class 2 class 2 class 3 class cubic(orthogonal) (non-orthogonal) (non-orthogonal) linear
![Page 32: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/32.jpg)
Kernel PCA
32
Kernel extension of PCA
data (approximately) lies on a lower dimensional non-linear space
![Page 33: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/33.jpg)
Kernel PCA
33
Hilbert space: → ( ) (Nonlinear extension of PCA)= 1 ( ) ( ) All eigenvectors of lie in the span of the mapped data points,== ( ) After some algebra, we have:=
![Page 34: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/34.jpg)
Linear Discriminant Analysis (LDA)
34
Supervised feature extraction
Fisher’s Linear Discriminant Analysis : Dimensionality reduction
Finds linear combinations of features with large ratios of between-groups to within-groups scatters (as discriminant new variables)
Classification Example: Predicts the class of an observation by the class whose
mean vector is the closest to in the space of the discriminantvariables
![Page 35: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/35.jpg)
Good Projection for Classification
What is a good criterion? Separating different classes in the projected space As opposed to PCA, we use also the labels of the training data
35
![Page 36: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/36.jpg)
Good Projection for Classification
What is a good criterion? Separating different classes in the projected space As opposed to PCA, we use also the labels of the training data
36
![Page 37: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/37.jpg)
Good Projection for Classification
What is a good criterion? Separating different classes in the projected space As opposed to PCA, we use also the labels of the training data
37
![Page 38: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/38.jpg)
LDA Problem Problem definition:
= 2 classes
( ), ( ) training samples with samples from the first class ( )and samples from the second class ( )
Goal: finding the best direction that we hope will enable accurateclassification
The projection of sample onto a line in direction is
What is the measure of the separation between the projectedpoints of different classes?
38
![Page 39: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/39.jpg)
Measure of Separation in the Projected Direction
39
[Bishop]
Is the direction of the line jointing the class means is a goodcandidate for ?
= ∑ ( )( )∈ = ∑ ( )( )∈
![Page 40: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/40.jpg)
Measure of Separation in the Projected Direction
40
The direction of the line jointing the class means is thesolution of the following problem: Maximizes the separation of the projected class meansmax = −s. t. = 1
What is the problem with the criteria considering only− ? It does not consider the variances of the classes
==
![Page 41: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/41.jpg)
LDA Criteria
41
Fisher idea: maximize a function that will give large separation between the projected class means while also achieving a small variance within each class, thereby
minimizing the class overlap.
![Page 42: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/42.jpg)
LDA Criteria
The scatters of the original data are:= −( )∈= −( )∈ The scatters of projected data are:= −( )∈= −( )∈
42
![Page 43: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/43.jpg)
LDA Criteria
43
= −+− = −= − −= −( )∈= − −( )∈
![Page 44: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/44.jpg)
LDA Criteria
44
= − −= += − −( )∈= − −( )∈
scatter matrix=N×covariance matrix
Between-class scatter matrix
Within-class scatter matrix
![Page 45: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/45.jpg)
LDA Derivation
( )T
BT
W
J w S www S w
2 2
2 2( )TT
T TWB T TW BB W W B
T TW W
J
w S ww S w w S w w S w S w w S w S w w S ww w ww w S w w S w
( ) 0 B WJ
w S w S ww
45
![Page 46: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/46.jpg)
LDA Derivation
for any vector points in the same direction as:
Thus, we can solve the eigenvalue problem immediately
1W B S S w w
11 2W
w S μ μ
1 2 1 2 1 2T
B S x μ μ μ μ x μ μ
B WS w S wSW full-rank
46
![Page 47: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/47.jpg)
LDA Algorithm
47
and ← mean of samples of class 1 and 2 respectively and ← scatter matrix of class 1 and 2 respectively = + = − − Feature Extraction = − as the eigenvector corresponding to the largest
eigenvalue of
Classification = − Using a threshold on , we can classify
![Page 48: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/48.jpg)
Multi-Class LDA (MDA)
48
> 2 : the natural generalization of LDA involves − 1discriminant functions. The projection from a d-dimensional space to a ( − 1)-dimensional
space (tacitly assumed that ≥ ).
== − −
= ∑ ( )( )∈ = 1, … ,= ∑ ( )= ∑ − −( )∈ = 1, … ,
![Page 49: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/49.jpg)
Multi-Class LDA
49
Means and scatters after transform = : = =
![Page 50: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/50.jpg)
Multi-Class LDA: Objective Function
50
We seek a transformation matrix that in some sense“maximizes the ratio of the between-class scatter to thewithin-class scatter”.
A simple scalar measure of scatter is the determinantof the scatter matrix.
![Page 51: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/51.jpg)
Multi-Class LDA: Objective Function
51
= The solution of the problem where = [ … ]:= It is a generalized eigenvectors problem.
determinant
![Page 52: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/52.jpg)
Multi-Class LDA:
52
≤ − 1 is the sum of matrices − − of rank (at most)
one and only − 1 of these are independent, ⇒ atmost − 1 nonzero eigenvalues and the desired weight vectors
correspond to these nonzero eigenvalues.
![Page 53: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/53.jpg)
Multi-Class LDA: Other Objective Functions
53
There are many possible choices of criterion for multi-class LDA, e.g.:
= = The solution is given by solving a generalized eigenvalue
problem Solution: eigen vectors corresponding to the largest eigen values
constitute the new variables
![Page 54: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/54.jpg)
LDA Criterion Limitation
54
When = , LDA criterion can not lead to a properprojection ( = 0) However, discriminatory information in the scatter of the data may
be helpful
If classes are non-linearly separable they may havelarge overlap when projected to any line
LDA implicitly assumes Gaussian distribution ofsamples of each class
![Page 55: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/55.jpg)
Issues in LDA
55
Singularity or undersampled problem (when < ) Example: gene expression data, images, text documents
Can reduces dimension only to ′ ≤ − 1 (unlike PCA)
Approaches to avoid these problems: PCA+LDA, Regularized LDA, Locally FDA (LFDA), etc.
![Page 56: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of](https://reader031.vdocuments.site/reader031/viewer/2022030705/5af11d8d7f8b9a8b4c8e58f5/html5/thumbnails/56.jpg)
Summary
Although LDA often provide more suitable features forclassification tasks, PCA might outperform LDA in some situationssuch as: when the number of samples per class is small (overfitting problem of LDA) when the training data non-uniformly sample the underlying distribution when the number of the desired features is more than − 1
Advances in the recent decade: Semi-supervised feature extraction Nonlinear dimensionality reduction
56
PCALDA