soleymani - sharif university of...

56
CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Feature Extraction (PCA & LDA)

Upload: ngodieu

Post on 08-May-2018

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

CE-725: Statistical Pattern Recognition Sharif University of TechnologySpring 2013

Soleymani

Feature Extraction (PCA & LDA)

Page 2: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Outline What is feature extraction? Feature extraction algorithms Linear Methods

Unsupervised: Principal Component Analysis (PCA) Also known as Karhonen-Loeve (KL) transform

Supervised: Linear Discriminant Analysis (LDA) Also known as Fisher’s Discriminant Analysis (FDA)

2

Page 3: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Dimensionality Reduction:Feature Selection vs. Feature Extraction

Feature selection Select a subset of a given feature set

Feature extraction (e.g., PCA, LDA) A linear or non-linear transform on the original feature space

3

⋮ → ⋮Feature

Selection( < )

⋮ → ⋮ = ⋮Feature

Extraction

Page 4: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Feature Extraction

4

Mapping of the original data onto a lower-dimensional space Criterion for feature extraction can be different based on problem

settings Unsupervised task: minimize the information loss (reconstruction error) Supervised task: maximize the class discrimination on the projected space

In the previous lecture, we talked about feature selection: Feature selection can be considered as a special form of feature

extraction (only a subset of the original features are used).

Example:

0 1 0 0'

0 0 1 0

X X4N X 2' N X

Second and thirth features are selected

Page 5: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Feature Extraction

5

Unsupervised feature extraction:

Supervised feature extraction:

Feature Extraction= ( ) ⋯ ( )⋮ ⋱ ⋮( ) ⋯ ( )A mapping : ℝ → ℝOr only the transformed data

= ′( ) ⋯ ′( )⋮ ⋱ ⋮′( ) ⋯ ′( )

Feature Extraction= ( ) ⋯ ( )⋮ ⋱ ⋮( ) ⋯ ( )= ( )⋮( )

A mapping : ℝ → ℝOronly the transformed data

= ′( ) ⋯ ′( )⋮ ⋱ ⋮′( ) ⋯ ′( )

Page 6: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Linear Transformation

6

For linear transformation, we find an explicit mappingthat can transform also new data vectors.

d x T d dA

dx

Original data

reduced data

( )T d d x A x

=

Page 7: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Linear Transformation

7

Linear transformation are simple mappings

( )T Tj j x A x x a x

11 12 1

21 22 2

1 2

d

d

d d dd

a a aa a a

a a a

A

1a d a

= 1, … ,

Page 8: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Linear Dimensionality Reduction

8

Unsupervised Principal Component Analysis (PCA) [we will discuss] Independent Component Analysis (ICA) SingularValue Decomposition (SVD) Multi Dimensional Scaling (MDS) Canonical Correlation Analysis (CCA)

Supervised Linear Discriminant Analysis (LDA) [we will discuss]

Page 9: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Unsupervised Feature Reduction

9

Visualization: projection of high-dimensional data onto 2Dor 3D.

Data compression: efficient storage, communication, orand retrieval.

Noise removal: to improve accuracy by removingirrelevant features. As a preprocessing step to reduce dimensions for classification

tasks

Page 10: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Principal Component Analysis (PCA)

10

The “best” subspace: Centered at the sample mean The axes have been rotated to new (principal) axes such that:

Principal axis 1 has the highest variance Principal axis 2 has the next highest variance, and so on.

The principal axes are uncorrelated Covariance among each pair of the principal axes is zero.

Goal: reducing the dimensionality of the data while preservingthe variation present in the dataset as much as possible.

Page 11: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Principal Component Analysis (PCA)

11

Principal Components (PCs): orthogonal vectors that areordered by the fraction of the total information (variation) inthe corresponding directions

PCs can be found as the “best” eigenvectors of the covariancematrix of the data points. If data has a Gaussian distribution N(μ, Σ), the direction of the largest

variance can be found by the eigenvector of Σ that corresponds tothe largest eigenvalue of Σ

Page 12: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Covariance Matrix

12

ML estimate of covariance matrix from data points ( ) := 1= ( ) −⋮( ) − = 1 ( )

Mean-centered data

Page 13: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Eigenvalues and Eigenvectors:Geometrical Interpretation

1

Covariance matrix is a PSD matrix corresponding to a hyper-ellipsoidal in an d-dimensional space

=

13

= 00

Page 14: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

PCA: Steps

14

Input: data matrix (each row contain adimensional data point) = ∑ ( ) ← Mean value of data points is subtracted from rows of

= (Covariance matrix)

Calculate eigenvalue and eigenvectors of Pick eigenvectors corresponding to the largest eigenvalues

and put them in the columns of = [ , … , ] ′ =

First PC d’-th PC

Page 15: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Another Interpretation: Least Squares Error

15

PCs are linear least squares fits to samples, each orthogonal tothe previous PCs: First PC is a minimum distance fit to a vector in the original feature

space Second PC is a minimum distance fit to a vector in the plane

perpendicular to the first PC

Page 16: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Another Interpretation: Least Squares Error

16

Page 17: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Another Interpretation: Least Squares Error

17

Page 18: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Least Squares Error and Maximum Variance Views Are Equivalent (1-dim Interpretation)

Minimizing sum of square distances to the line is equivalent tomaximizing the sum of squares of the projections on that line(Pythagoras).

18

origin

Page 19: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

PCA: Uncorrelated Features

19

If where are orthonormaleighenvectors of :

then mutually uncorrelated features are obtained Completely uncorrelated features avoid information

redundancies

Page 20: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

PCA Derivation (Correlation Version): Mean Square Error Approximation

20

Incorporating all eigenvectors in :

If then can be reconstructed exactly from

Page 21: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

PCA Derivation (Correlation Version): Mean Square Error Approximation

21

Incorporating only ′ eigenvectors corresponding to thelargest eigenvalues = [ , … , ] ( < )

It minimizes MSE between and = :

= − = − == = == == =

Sum of the − smallest eigenvalues

Page 22: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

PCA Derivation (Correlation Version):Relation between Eigenvalues and Variances

22

The j-th largest eigenvalue of is the variance on the j-th PC:

Page 23: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

PCA Derivation (Correlation Version): Mean Square Error Approximation

23

In general, it can also be shown MSE is minimized compared toany other approximation of by any -dimensionalorthonormal basis without first assuming that the axes are eigenvectors of the correlation

matrix, this result can also be obtained.

If the data is mean-centered in advance, and (covariancematrix) will be the same. However, in the correlation version when ≠ the approximation is

not, in general, a good one (although it is a minimum MSE solution)

Page 24: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

PCA on Faces: “Eigenfaces”

24

ORL Database

Some Images

Page 25: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

PCA on Faces: “Eigenfaces”

25

For eigen faces“gray” = 0,“white” > 0,“black” < 0

Averageface

1st PC

6th PC

Page 26: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

PCA on Faces:

Feature vector=[ , , … , ]

+ × + × + ×+ ⋯26

=Average

Face

= The projection of on the i-th PC

is a 112 × 92 = 10304 dimensional vectorcontaining intensity of the pixels of this image

Page 27: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

PCA on Faces: Reconstructed Face

d'=1 d'=2 d'=4 d'=8 d'=16

d'=32 d'=64 d'=128Original Imaged'=256

27

Page 28: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

PCA Drawback

28

An excellent information packing transform does notnecessarily lead to a good class separability. The directions of the maximum variance may be useless for classification

purpose

Page 29: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Independent Component Analysis (ICA)

29

PCA: The transformed dimensions will be uncorrelated from each

other Orthogonal linear transform Only uses second order statistics (i.e., covariance matrix)

ICA: The transformed dimensions will be as independent as

possible. Non-orthogonal linear transform High-order statistics are used

Page 30: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Uncorrelated and Independent

30

Gaussian Independent ⟺ Uncorrelated

Non-Gaussian Independent ⇒ Uncorrelated Uncorrelated ⇏ Independent

Uncorrelated: , = 0Independent: , = ( ) ( )

Page 31: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

PCA vs. ICA

31

PCA

ICA

1 class 2 class 2 class 3 class cubic(orthogonal) (non-orthogonal) (non-orthogonal) linear

Page 32: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Kernel PCA

32

Kernel extension of PCA

data (approximately) lies on a lower dimensional non-linear space

Page 33: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Kernel PCA

33

Hilbert space: → ( ) (Nonlinear extension of PCA)= 1 ( ) ( ) All eigenvectors of lie in the span of the mapped data points,== ( ) After some algebra, we have:=

Page 34: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Linear Discriminant Analysis (LDA)

34

Supervised feature extraction

Fisher’s Linear Discriminant Analysis : Dimensionality reduction

Finds linear combinations of features with large ratios of between-groups to within-groups scatters (as discriminant new variables)

Classification Example: Predicts the class of an observation by the class whose

mean vector is the closest to in the space of the discriminantvariables

Page 35: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Good Projection for Classification

What is a good criterion? Separating different classes in the projected space As opposed to PCA, we use also the labels of the training data

35

Page 36: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Good Projection for Classification

What is a good criterion? Separating different classes in the projected space As opposed to PCA, we use also the labels of the training data

36

Page 37: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Good Projection for Classification

What is a good criterion? Separating different classes in the projected space As opposed to PCA, we use also the labels of the training data

37

Page 38: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

LDA Problem Problem definition:

= 2 classes

( ), ( ) training samples with samples from the first class ( )and samples from the second class ( )

Goal: finding the best direction that we hope will enable accurateclassification

The projection of sample onto a line in direction is

What is the measure of the separation between the projectedpoints of different classes?

38

Page 39: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Measure of Separation in the Projected Direction

39

[Bishop]

Is the direction of the line jointing the class means is a goodcandidate for ?

= ∑ ( )( )∈ = ∑ ( )( )∈

Page 40: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Measure of Separation in the Projected Direction

40

The direction of the line jointing the class means is thesolution of the following problem: Maximizes the separation of the projected class meansmax = −s. t. = 1

What is the problem with the criteria considering only− ? It does not consider the variances of the classes

==

Page 41: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

LDA Criteria

41

Fisher idea: maximize a function that will give large separation between the projected class means while also achieving a small variance within each class, thereby

minimizing the class overlap.

Page 42: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

LDA Criteria

The scatters of the original data are:= −( )∈= −( )∈ The scatters of projected data are:= −( )∈= −( )∈

42

Page 43: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

LDA Criteria

43

= −+− = −= − −= −( )∈= − −( )∈

Page 44: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

LDA Criteria

44

= − −= += − −( )∈= − −( )∈

scatter matrix=N×covariance matrix

Between-class scatter matrix

Within-class scatter matrix

Page 45: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

LDA Derivation

( )T

BT

W

J w S www S w

2 2

2 2( )TT

T TWB T TW BB W W B

T TW W

J

w S ww S w w S w w S w S w w S w S w w S ww w ww w S w w S w

( ) 0 B WJ

w S w S ww

45

Page 46: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

LDA Derivation

for any vector points in the same direction as:

Thus, we can solve the eigenvalue problem immediately

1W B S S w w

11 2W

w S μ μ

1 2 1 2 1 2T

B S x μ μ μ μ x μ μ

B WS w S wSW full-rank

46

Page 47: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

LDA Algorithm

47

and ← mean of samples of class 1 and 2 respectively and ← scatter matrix of class 1 and 2 respectively = + = − − Feature Extraction = − as the eigenvector corresponding to the largest

eigenvalue of

Classification = − Using a threshold on , we can classify

Page 48: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Multi-Class LDA (MDA)

48

> 2 : the natural generalization of LDA involves − 1discriminant functions. The projection from a d-dimensional space to a ( − 1)-dimensional

space (tacitly assumed that ≥ ).

== − −

= ∑ ( )( )∈ = 1, … ,= ∑ ( )= ∑ − −( )∈ = 1, … ,

Page 49: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Multi-Class LDA

49

Means and scatters after transform = : = =

Page 50: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Multi-Class LDA: Objective Function

50

We seek a transformation matrix that in some sense“maximizes the ratio of the between-class scatter to thewithin-class scatter”.

A simple scalar measure of scatter is the determinantof the scatter matrix.

Page 51: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Multi-Class LDA: Objective Function

51

= The solution of the problem where = [ … ]:= It is a generalized eigenvectors problem.

determinant

Page 52: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Multi-Class LDA:

52

≤ − 1 is the sum of matrices − − of rank (at most)

one and only − 1 of these are independent, ⇒ atmost − 1 nonzero eigenvalues and the desired weight vectors

correspond to these nonzero eigenvalues.

Page 53: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Multi-Class LDA: Other Objective Functions

53

There are many possible choices of criterion for multi-class LDA, e.g.:

= = The solution is given by solving a generalized eigenvalue

problem Solution: eigen vectors corresponding to the largest eigen values

constitute the new variables

Page 54: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

LDA Criterion Limitation

54

When = , LDA criterion can not lead to a properprojection ( = 0) However, discriminatory information in the scatter of the data may

be helpful

If classes are non-linearly separable they may havelarge overlap when projected to any line

LDA implicitly assumes Gaussian distribution ofsamples of each class

Page 55: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Issues in LDA

55

Singularity or undersampled problem (when < ) Example: gene expression data, images, text documents

Can reduces dimension only to ′ ≤ − 1 (unlike PCA)

Approaches to avoid these problems: PCA+LDA, Regularized LDA, Locally FDA (LFDA), etc.

Page 56: Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Summary

Although LDA often provide more suitable features forclassification tasks, PCA might outperform LDA in some situationssuch as: when the number of samples per class is small (overfitting problem of LDA) when the training data non-uniformly sample the underlying distribution when the number of the desired features is more than − 1

Advances in the recent decade: Semi-supervised feature extraction Nonlinear dimensionality reduction

56

PCALDA