soleymani - sharif university of...

CE-725: Statistical Pattern Recognition Sharif University of TechnologySpring 2013

Soleymani

Feature Extraction (PCA & LDA)

Outline What is feature extraction? Feature extraction algorithms Linear Methods

Unsupervised: Principal Component Analysis (PCA) Also known as Karhonen-Loeve (KL) transform

Supervised: Linear Discriminant Analysis (LDA) Also known as Fisher’s Discriminant Analysis (FDA)

2

Dimensionality Reduction:Feature Selection vs. Feature Extraction

Feature selection Select a subset of a given feature set

Feature extraction (e.g., PCA, LDA) A linear or non-linear transform on the original feature space

3

⋮ → ⋮Feature

Selection( < )

⋮ → ⋮ = ⋮Feature

Extraction

Feature Extraction

4

Mapping of the original data onto a lower-dimensional space Criterion for feature extraction can be different based on problem

settings Unsupervised task: minimize the information loss (reconstruction error) Supervised task: maximize the class discrimination on the projected space

In the previous lecture, we talked about feature selection: Feature selection can be considered as a special form of feature

extraction (only a subset of the original features are used).

Example:

0 1 0 0'

0 0 1 0

X X4N X 2' N X

Second and thirth features are selected

Feature Extraction

5

Unsupervised feature extraction:

Supervised feature extraction:

Feature Extraction= ( ) ⋯ ( )⋮ ⋱ ⋮( ) ⋯ ( )A mapping : ℝ → ℝOr only the transformed data

= ′( ) ⋯ ′( )⋮ ⋱ ⋮′( ) ⋯ ′( )

Feature Extraction= ( ) ⋯ ( )⋮ ⋱ ⋮( ) ⋯ ( )= ( )⋮( )

A mapping : ℝ → ℝOronly the transformed data

= ′( ) ⋯ ′( )⋮ ⋱ ⋮′( ) ⋯ ′( )

Linear Transformation

6

For linear transformation, we find an explicit mappingthat can transform also new data vectors.

d x T d dA

dx

Original data

reduced data

( )T d d x A x

=

Linear Transformation

7

Linear transformation are simple mappings

( )T Tj j x A x x a x

11 12 1

21 22 2

1 2

d

d

d d dd

a a aa a a

a a a

A

1a d a

= 1, … ,

Linear Dimensionality Reduction

8

Unsupervised Principal Component Analysis (PCA) [we will discuss] Independent Component Analysis (ICA) SingularValue Decomposition (SVD) Multi Dimensional Scaling (MDS) Canonical Correlation Analysis (CCA)

Supervised Linear Discriminant Analysis (LDA) [we will discuss]

Unsupervised Feature Reduction

9

Visualization: projection of high-dimensional data onto 2Dor 3D.

Data compression: efficient storage, communication, orand retrieval.

Noise removal: to improve accuracy by removingirrelevant features. As a preprocessing step to reduce dimensions for classification

tasks

Principal Component Analysis (PCA)

10

The “best” subspace: Centered at the sample mean The axes have been rotated to new (principal) axes such that:

Principal axis 1 has the highest variance Principal axis 2 has the next highest variance, and so on.

The principal axes are uncorrelated Covariance among each pair of the principal axes is zero.

Goal: reducing the dimensionality of the data while preservingthe variation present in the dataset as much as possible.

Principal Component Analysis (PCA)

11

Principal Components (PCs): orthogonal vectors that areordered by the fraction of the total information (variation) inthe corresponding directions

PCs can be found as the “best” eigenvectors of the covariancematrix of the data points. If data has a Gaussian distribution N(μ, Σ), the direction of the largest

variance can be found by the eigenvector of Σ that corresponds tothe largest eigenvalue of Σ

Covariance Matrix

12

ML estimate of covariance matrix from data points ( ) := 1= ( ) −⋮( ) − = 1 ( )

Mean-centered data

Eigenvalues and Eigenvectors:Geometrical Interpretation

1

Covariance matrix is a PSD matrix corresponding to a hyper-ellipsoidal in an d-dimensional space

=

13

= 00

PCA: Steps

14

Input: data matrix (each row contain adimensional data point) = ∑ ( ) ← Mean value of data points is subtracted from rows of

= (Covariance matrix)

Calculate eigenvalue and eigenvectors of Pick eigenvectors corresponding to the largest eigenvalues

and put them in the columns of = [ , … , ] ′ =

First PC d’-th PC

Another Interpretation: Least Squares Error

15

PCs are linear least squares fits to samples, each orthogonal tothe previous PCs: First PC is a minimum distance fit to a vector in the original feature

space Second PC is a minimum distance fit to a vector in the plane

perpendicular to the first PC


16


17

Least Squares Error and Maximum Variance Views Are Equivalent (1-dim Interpretation)

Minimizing sum of square distances to the line is equivalent tomaximizing the sum of squares of the projections on that line(Pythagoras).

18

origin

PCA: Uncorrelated Features

19

If where are orthonormaleighenvectors of :

then mutually uncorrelated features are obtained Completely uncorrelated features avoid information

redundancies

PCA Derivation (Correlation Version): Mean Square Error Approximation

20

Incorporating all eigenvectors in :

If then can be reconstructed exactly from


21

Incorporating only ′ eigenvectors corresponding to thelargest eigenvalues = [ , … , ] ( < )

It minimizes MSE between and = :

= − = − == = == == =

Sum of the − smallest eigenvalues

PCA Derivation (Correlation Version):Relation between Eigenvalues and Variances

22

The j-th largest eigenvalue of is the variance on the j-th PC:


23

In general, it can also be shown MSE is minimized compared toany other approximation of by any -dimensionalorthonormal basis without first assuming that the axes are eigenvectors of the correlation

matrix, this result can also be obtained.

If the data is mean-centered in advance, and (covariancematrix) will be the same. However, in the correlation version when ≠ the approximation is

not, in general, a good one (although it is a minimum MSE solution)

PCA on Faces: “Eigenfaces”

24

ORL Database

Some Images

PCA on Faces: “Eigenfaces”

25

For eigen faces“gray” = 0,“white” > 0,“black” < 0

Averageface

1st PC

6th PC

PCA on Faces:

Feature vector=[ , , … , ]

+ × + × + ×+ ⋯26

=Average

Face

= The projection of on the i-th PC

is a 112 × 92 = 10304 dimensional vectorcontaining intensity of the pixels of this image

PCA on Faces: Reconstructed Face

d'=1 d'=2 d'=4 d'=8 d'=16

d'=32 d'=64 d'=128Original Imaged'=256

27

PCA Drawback

28

An excellent information packing transform does notnecessarily lead to a good class separability. The directions of the maximum variance may be useless for classification

purpose

Independent Component Analysis (ICA)

29

PCA: The transformed dimensions will be uncorrelated from each

other Orthogonal linear transform Only uses second order statistics (i.e., covariance matrix)

ICA: The transformed dimensions will be as independent as

possible. Non-orthogonal linear transform High-order statistics are used

Uncorrelated and Independent

30

Gaussian Independent ⟺ Uncorrelated

Non-Gaussian Independent ⇒ Uncorrelated Uncorrelated ⇏ Independent

Uncorrelated: , = 0Independent: , = ( ) ( )

PCA vs. ICA

31

PCA

ICA

1 class 2 class 2 class 3 class cubic(orthogonal) (non-orthogonal) (non-orthogonal) linear

Kernel PCA

32

Kernel extension of PCA

data (approximately) lies on a lower dimensional non-linear space

Kernel PCA

33

Hilbert space: → ( ) (Nonlinear extension of PCA)= 1 ( ) ( ) All eigenvectors of lie in the span of the mapped data points,== ( ) After some algebra, we have:=

Linear Discriminant Analysis (LDA)

34

Supervised feature extraction

Fisher’s Linear Discriminant Analysis : Dimensionality reduction

Finds linear combinations of features with large ratios of between-groups to within-groups scatters (as discriminant new variables)

Classification Example: Predicts the class of an observation by the class whose

mean vector is the closest to in the space of the discriminantvariables

Good Projection for Classification

What is a good criterion? Separating different classes in the projected space As opposed to PCA, we use also the labels of the training data

35



36



37

LDA Problem Problem definition:

= 2 classes

( ), ( ) training samples with samples from the first class ( )and samples from the second class ( )

Goal: finding the best direction that we hope will enable accurateclassification

The projection of sample onto a line in direction is

What is the measure of the separation between the projectedpoints of different classes?

38

Measure of Separation in the Projected Direction

39

[Bishop]

Is the direction of the line jointing the class means is a goodcandidate for ?

= ∑ ( )( )∈ = ∑ ( )( )∈

Measure of Separation in the Projected Direction

40

The direction of the line jointing the class means is thesolution of the following problem: Maximizes the separation of the projected class meansmax = −s. t. = 1

What is the problem with the criteria considering only− ? It does not consider the variances of the classes

==

LDA Criteria

41

Fisher idea: maximize a function that will give large separation between the projected class means while also achieving a small variance within each class, thereby

minimizing the class overlap.

LDA Criteria

The scatters of the original data are:= −( )∈= −( )∈ The scatters of projected data are:= −( )∈= −( )∈

42

LDA Criteria

43

= −+− = −= − −= −( )∈= − −( )∈

LDA Criteria

44

= − −= += − −( )∈= − −( )∈

scatter matrix=N×covariance matrix

Between-class scatter matrix

Within-class scatter matrix

LDA Derivation

( )T

BT

W

J w S www S w

2 2

2 2( )TT

T TWB T TW BB W W B

T TW W

J

w S ww S w w S w w S w S w w S w S w w S ww w ww w S w w S w

( ) 0 B WJ

w S w S ww

45

LDA Derivation

for any vector points in the same direction as:

Thus, we can solve the eigenvalue problem immediately

1W B S S w w

11 2W

w S μ μ

1 2 1 2 1 2T

B S x μ μ μ μ x μ μ

B WS w S wSW full-rank

46

LDA Algorithm

47

and ← mean of samples of class 1 and 2 respectively and ← scatter matrix of class 1 and 2 respectively = + = − − Feature Extraction = − as the eigenvector corresponding to the largest

eigenvalue of

Classification = − Using a threshold on , we can classify

Multi-Class LDA (MDA)

48

> 2 : the natural generalization of LDA involves − 1discriminant functions. The projection from a d-dimensional space to a ( − 1)-dimensional

space (tacitly assumed that ≥ ).

== − −

= ∑ ( )( )∈ = 1, … ,= ∑ ( )= ∑ − −( )∈ = 1, … ,

Multi-Class LDA

49

Means and scatters after transform = : = =

Multi-Class LDA: Objective Function

50

We seek a transformation matrix that in some sense“maximizes the ratio of the between-class scatter to thewithin-class scatter”.

A simple scalar measure of scatter is the determinantof the scatter matrix.

Multi-Class LDA: Objective Function

51

= The solution of the problem where = [ … ]:= It is a generalized eigenvectors problem.

determinant

Multi-Class LDA:

52

≤ − 1 is the sum of matrices − − of rank (at most)

one and only − 1 of these are independent, ⇒ atmost − 1 nonzero eigenvalues and the desired weight vectors

correspond to these nonzero eigenvalues.

Multi-Class LDA: Other Objective Functions

53

There are many possible choices of criterion for multi-class LDA, e.g.:

= = The solution is given by solving a generalized eigenvalue

problem Solution: eigen vectors corresponding to the largest eigen values

constitute the new variables

LDA Criterion Limitation

54

When = , LDA criterion can not lead to a properprojection ( = 0) However, discriminatory information in the scatter of the data may

be helpful

If classes are non-linearly separable they may havelarge overlap when projected to any line

LDA implicitly assumes Gaussian distribution ofsamples of each class

Issues in LDA

55

Singularity or undersampled problem (when < ) Example: gene expression data, images, text documents

Can reduces dimension only to ′ ≤ − 1 (unlike PCA)

Approaches to avoid these problems: PCA+LDA, Regularized LDA, Locally FDA (LFDA), etc.

Summary

Although LDA often provide more suitable features forclassification tasks, PCA might outperform LDA in some situationssuch as: when the number of samples per class is small (overfitting problem of LDA) when the training data non-uniformly sample the underlying distribution when the number of the desired features is more than − 1

Advances in the recent decade: Semi-supervised feature extraction Nonlinear dimensionality reduction

56

PCALDA

soleymani - sharif university of...

Documents