dimensionality reduction chapter 3 (duda et al.) – section 3.8 cs479/679 pattern recognition dr....

Post on 19-Dec-2015

232 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Dimensionality Reduction

Chapter 3 (Duda et al.) – Section 3.8

CS479/679 Pattern RecognitionDr. George Bebis

Data Dimensionality

• From a theoretical point of view, increasing the number of features should lead to better performance.

• In practice, the inclusion

of more features leads to

worse performance (i.e., curse

of dimensionality).

• Need exponential number of training

examples as dimensionality increases.

3

Dimensionality Reduction

• Significant improvements can be achieved by first mapping the data into a lower-dimensional space.

• Dimensionality can be reduced by:− Combining features (linearly or non-linearly)− Selecting a subset of features (i.e., feature selection).

• We will focus on feature combinations first.

Dimensionality Reduction (cont’d)

• Linear combinations are particularly attractive because they are simple to compute and analytically tractable.

• Given x ϵ RN, the goal is to find an N x K transformation matrix U such that:

y = UTx ϵ RK where K<<N

4

Dimensionality Reduction (cont’d)

• Idea: find a set of basis vectors in a lower dimensional space.

5

(1) Higher-dimensional space representation:

(2) Lower-dimensional sub-space representation:

Dimensionality Reduction (cont’d)

• Two classical approaches for finding optimal linear transformations are:

− Principal Components Analysis (PCA): Seeks a projection that preserves as much information in the data as possible (in a least-squares sense).

− Linear Discriminant Analysis (LDA): Seeks a projection that best separates the data (in a least-squares sense).

6

7

Principal Component Analysis (PCA)

• Dimensionality reduction implies information loss; PCA preserves as much information as possible, that is, it minimizes the error:

• How should we determine the “best” lower dimensional space?

The “best” low-dimensional space can be determined by the “best” eigenvectors of the covariance matrix of the data (i.e., the eigenvectors corresponding to the “largest” eigenvalues – also called “principal components”).

8

PCA - Steps

− Suppose x1, x2, ..., xM are N x 1 vectors

1

M

(i.e., center at zero)

9

PCA – Steps (cont’d)

( ).

( . )i

ii i

x x ub

u u

an orthogonal basis

where

10

PCA – Linear Transformation

• The linear transformation RN RK that performs the dimensionality reduction is:

( ).( ).

( . )i

i ii i

x x ub x x u

u u

If ui has unit length:

11

Geometric interpretation

• PCA projects the data along the directions where the data varies the most.

• These directions are determined by the eigenvectors of the covariance matrix corresponding to the largest eigenvalues.

• The magnitude of the eigenvalues corresponds to the variance of the data along the eigenvector directions.

12

How to choose K?

• Choose K using the following criterion:

• In this case, we say that we “preserve” 90% or 95% of the information in the data.

• If K=N, then we “preserve” 100% of the information in the data.

13

Error due to dimensionality reduction

• The original vector x can be reconstructed using its principal components:

• PCA minimizes the reconstruction error:

• It can be shown that the reconstruction error is:

14

Normalization

• The principal components are dependent on the units used to measure the original variables as well as on the range of values they assume.

• Data should always be normalized prior to using PCA.

• A common normalization method is to transform all the data to have zero mean and unit standard deviation:

15

Application to Faces• Computation of low-dimensional basis (i.e.,eigenfaces):

16

Application to Faces• Computation of the eigenfaces – cont.

1

M

17

Application to Faces• Computation of the eigenfaces – cont.

ui

Ti i iAA u u

i i i iu Av and

18

Application to Faces• Computation of the eigenfaces – cont.

each face Φi can be represented as follows:

(i.e., using ATA)

19

Example

Training images

20

Example (cont’d)

Top eigenvectors: u1,…uk

Mean: μ

21

Application to Faces• Representing faces onto this basis

Face reconstruction:

( || || 1)jwhere u

22

Eigenfaces• Case Study: Eigenfaces for Face Detection/Recognition

− M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive Neuroscience, vol. 3, no. 1, pp. 71-86, 1991.

• Face Recognition

− The simplest approach is to think of it as a template matching problem.

− Problems arise when performing recognition in a high-dimensional space.

− Significant improvements can be achieved by first mapping the data into a lower dimensionality space.

23

Eigenfaces• Face Recognition Using Eigenfaces

( || || 1)iwhere u

2

1

|| || ( )K

l li i

i

w w

where

The distance er is called distance in face space (difs)

24

Face detection and recognition

Detection Recognition “Sally”

25

Eigenfaces• Face Detection Using Eigenfaces

( || || 1)iwhere u

− The distance ed is called distance from face space (dffs)

26

Eigenfaces

Reconstructed face lookslike a face.

Reconstructed non-facelooks like a face again!

Input Reconstructed

27

Eigenfaces• Face Detection Using Eigenfaces – cont.

Case 1: in face space AND close to a given faceCase 2: in face space but NOT close to any given faceCase 3: not in face space AND close to a given faceCase 4: not in face space and NOT close to any given face

28

Reconstruction using partial information

• Robust to partial face occlusion.

Input Reconstructed

29

Eigenfaces

• Face detection, tracking, and recognition

Visualize dffs:

30

Limitations

• Background changes cause problems− De-emphasize the outside of the face (e.g., by multiplying the input

image by a 2D Gaussian window centered on the face).

• Light changes degrade performance− Light normalization might help.

• Performance decreases quickly with changes to face size− Scale input image to multiple sizes.− Multi-scale eigenspaces.

• Performance decreases with changes to face orientation (but not as fast as with scale changes)

− Plane rotations are easier to handle.− Out-of-plane rotations are more difficult to handle.− Multi-orientation eigenspaces.

31

Limitations (cont’d)

• Not robust to misalignment.

32

Limitations (cont’d)

• PCA is not always an optimal dimensionality-reduction technique for classification purposes.

33

Linear Discriminant Analysis (LDA)

• What is the goal of LDA?

− Perform dimensionality reduction “while preserving as much of the class discriminatory information as possible”.

− Seeks to find directions along which the classes are best separated by taking into consideration the scatter within-classes but also the scatter between-classes.

34

Case of C classes

1 1

( )( )iMC

Tw j i j i

i j

S x x

μ μ

35

Case of C classes (cont’d)

• LDA seeks projections that maximize the between-class scatter while minimizing the within-class scatter:

| | | |max max

| || |

Tb b

Tww

S U S U

U S US

TUy x

,b wS S

• Suppose the desired projection transformation is given by:

• Suppose the scatter matrices of the projected data y are:

36

Case of C classes (cont’d)

• This is equivalent to the following generalized eigenvector problem:

• The columns of the matrix U in the linear transformation are the eigenvectors (i.e., called Fisherfaces) corresponding to the largest eigenvalues.

• Important: Sb has at most rank C-1; therefore, the max number of eigenvectors with non-zero eigenvalues is C-1 (i.e., max dimensionality of sub-space is C-1)

b k k w kS u S u

TUy x

37

Case of C classes (cont’d)

• If Sw is non-singular, we can solve a conventional eigenvalue problem as follows:

• In practice, Sw is singular since the data are image vectors with large dimensionality while the size of the data set is much smaller (M << N )

1w b k k kS S u u

b k k w kS u S u

38

Does Sw-1 always exist? (cont’d)

• To alleviate this problem, PCA can be used first:

1) PCA is first applied to the data set to reduce its dimensionality.

2) LDA is then applied to find the most discriminative directions:

39

Case Study I

− D. Swets, J. Weng, "Using Discriminant Eigenfeatures for Image Retrieval", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 8, pp. 831-836, 1996.

• Content-based image retrieval:

− Application: query-by-example content-based image retrieval

− Question: how to select a good set of image features for content-based image retrieval?

40

Case Study I (cont’d)• Assumptions

− "Well-framed" images are required as input for training and query-by-example test probes.

− Only a small variation in the size, position, and orientation of the objects in the images is allowed.

41

Case Study I (cont’d)

• Terminology− Most Expressive Features (MEF): features obtained using PCA.− Most Discriminating Features (MDF): features obtained using LDA.

• Numerical instabilities

− When computing the eigenvalues/eigenvectors of Sw-1SBuk = kuk

numerically, the computations can be unstable since Sw-1SB is not

always symmetric; look in the paper for more information.

42

Case Study I (cont’d)

• Comparing MEF with MDF:− MEF vectors show the tendency of PCA to capture major variations

in the training set such as lighting direction.− MDF vectors discount those factors unrelated to classification.

43

Case Study I (cont’d)

• Clustering effect

44

Case Study I (cont’d)

1) Represent each training image in terms of MEFs/MDFs.

2) Represent a query image in terms of MEFs/MDFs.

3) Find the k closest neighbors for retrieval (e.g., using Euclidean distance).

• Methodology

45

Case Study I (cont’d)

• Experiments and results

Face images− A set of face images was used with 2 expressions, 3 lighting conditions.

− Testing was performed using a disjoint set of images.

46

Case Study I (cont’d)

47

Case Study I (cont’d)

− Examples of correct search probes

48

Case Study I (cont’d)

− Example of a failed search probe

49

Case Study II

− A. Martinez, A. Kak, "PCA versus LDA", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228-233, 2001.

• Is LDA always better than PCA?

− There has been a tendency in the computer vision community to prefer LDA over PCA.

− This is mainly because LDA deals directly with discrimination between classes while PCA does not pay attention to the underlying class structure.

50

Case Study II (cont’d)

AR database

51

Case Study II (cont’d)

LDA is not always better when the training set is small

52

Case Study II (cont’d)

LDA outperforms PCA when the training set is large

top related