multivariate statistics [1em]principal component analysis...

41
Multivariate Statistics Principal Component Analysis (PCA) 10.12.2014

Upload: others

Post on 27-Sep-2019

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Multivariate Statistics

Principal Component Analysis (PCA)

10.12.2014

Page 2: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Goals of Today’s Lecture

Get familiar with the multivariate counterparts of the expectation andthe variance.

See how principal component analysis (PCA) can be used as adimension reduction technique.

1 / 40

Page 3: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Introduction

In your introductory course you started with univariate statistics.You had a look at one random variable X at a time. E.g., X =“measurement of temperature”.

A random variable can be characterized by its expectation µ and thevariance σ2 (or standard deviation σ).

µ = E [X ], σ2 = Var(X ) = E [(X − µ)2].

A model that is often used is the normal distribution:X ∼ N (µ, σ2).

The normal distribution is fully characterized by the expectation andthe variance.

2 / 40

Page 4: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

The unknown parameters µ and σ can be estimated from data.

Say we observe n (independent) realizations of our random variableX : x1, . . . , xn.

You can think of measuring n times a certain quantity, e.g.temperature.

Usual parameter estimates are

µ̂ = x =1

n

n∑i=1

xi , σ̂2 = V̂ar(X ) =1

n − 1

n∑i=1

(xi − x)2.

3 / 40

Page 5: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Multivariate Data

Now we are going to have a look at the situation where we measuremultiple things simultaneously.

Hence, we have a multivariate random variable (vector) X havingm components: X ∈ Rm.

You can think of measuring temperature at two different locations ormeasuring temperature and pressure at one location (m = 2).

In that case

X =

[X (1)

X (2)

],

where X (1) is temperature and X (2) is pressure.

4 / 40

Page 6: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

A possible data-set now consists of n vectors of dimension 2 (or m inthe general case):

x1, . . . , xn,

where x i ∈ R2 (or Rm).

Remark

In multiple linear regression we already had multiple variables perobservation.

There, we had one response variable and many predictor variables.

Here, the situation is more general in the sense that we don’t have aresponse variable but we want to model “relationships” between (any)variables.

5 / 40

Page 7: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Expectation and Covariance Matrix

We need new concepts to model / describe this kind of data.

We are therefore looking for the multivariate counterparts of theexpectation and the variance.

The (multivariate) expectation of X is defined as

E [X ] = µ = (µ1, . . . , µm)T = (E [X (1)], . . . ,E [X (m)])T .

It’s nothing else than the collection of the univariate expectations.

6 / 40

Page 8: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

What about dependency?

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

18 19 20 21 22 23

1819

2021

22

Temperature A

Tem

pera

ture

B

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

18 19 20 21 22

1819

2021

22

Temperature A

Tem

pera

ture

B

●●

●●

●●

● ●

●●

●●

●●

●●

●●

17 18 19 20 21 22 23

1718

1920

2122

23

Temperature A

Tem

pera

ture

B

● ●

●●

● ●

●●

●●●

●●

●●

●●

●●

● ●●

17 18 19 20 21 22

1718

1920

2122

Temperature A

Tem

pera

ture

B

7 / 40

Page 9: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

We need a measure to characterize the dependency between thedifferent components.

The simplest thing one can think of is linear dependency betweentwo components.

The corresponding measure is the correlation ρ.

ρ is dimensionless and it always holds that

−1 ≤ ρ ≤ 1

|ρ| measures the strength of the linear relationship.

The sign of ρ indicates the direction of the linear relationship.

8 / 40

Page 10: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

The formal definition of the correlation is based on the covariance.

The covariance is an unstandardized version of the correlation. It isdefined as

Cov(X (j),X (k)) = E [(X (j) − µj)(X (k) − µk)].

The correlation between X (j) and X (k) is then

ρjk = Corr(X (j),X (k)) =Cov(X (j),X (k))√

Var(X (j)) Var(X (k))

You have seen the empirical version in the introductory course.

9 / 40

Page 11: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

The covariance matrix |Σ is an m ×m matrix with elements

|Σjk = Cov(X (j),X (k)) = E [(X (j) − µj)(X (k) − µk)].

We also write Var(X ) or Cov(X ) instead of |Σ.

The special symbol |Σ is used in order to avoid confusion with thesum sign

∑.

10 / 40

Page 12: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

The covariance matrix contains a lot of information, e.g.

|Σjj = Var(X (j)).

This means that the diagonal consists of the individual variances.

We can also compute the correlations via

Corr(X (j),X (k)) =Cov(X (j),X (k))√

Var(X (j)) Var(X (k))=

|Σjk√|Σjj |Σkk

.

11 / 40

Page 13: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Again, from a real data-set we can estimate these quantities with

µ̂ =[x (1), x (2), . . . , x (m)

]T|̂Σjk =

1

n − 1

n∑i=1

(x(j)i − µ̂j)(x

(k)i − µ̂k),

Or more directly the whole matrix

|̂Σ =1

n − 1

n∑i=1

(x i − µ̂)(x i − µ̂)T .

Remember: (Empirical) correlation only measures strength of linearrelationship between two variables.

12 / 40

Page 14: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Illustration: Empirical Correlation

Source: Wikipedia

13 / 40

Page 15: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Linear Transformations

The following table illustrates how the expectation and the variance(covariance matrix) change when linear transformations are applied tothe univariate random variable X or the multivariate random vector X .

Univariate Multivariate

Y = a + bX Y = a + BX

E [Y ] = a + bE [X ] E [Y ] = a + BE [X ]

Var(Y ) = b2 Var(X ) Var(Y ) = B |ΣXBT

where a, b ∈ R and a ∈ Rm,B ∈ Rm×m.

14 / 40

Page 16: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Principal Component Analysis (PCA)

Goal: Dimensionality reduction.

We have m different dimensions (variables) but we would like to find“a few specific dimensions (projections) of the data that contain mostvariation”.

If two specific dimensions of the data-set contain most variation,visualizations will be easy (plot these two!).

Such a plot then can be used to check for any “structure”.

15 / 40

Page 17: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Illustration of Artificial 3-Dim Data-Set

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1.

5−

1.0

−0.

5 0

.0 0

.5 1

.0 1

.5

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

x(1)

x(2)x(3

)

●●

●●

●●

●●

●●

16 / 40

Page 18: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

We have to be more precise with what we mean with “variation”.

We define the total variation in the data as the sum of allindividual empirical variances

m∑j=1

V̂ar(X (j)) =m∑j=1

σ̂2(X (j)).

How can we now find projections that contain most variation?

Conceptually, we are looking for a new coordinate system with basisvectors b1, . . . , bm ∈ Rm.

17 / 40

Page 19: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Of course, our data-points x i ∈ Rm will then have new coordinates

z(k)i = xTi bk , k = 1, . . . ,m

(= projection on new basis vectors).

How should we choose the new basis?

I The first basis vector b1 should be chosen such that V̂ar(Z (1)) ismaximal (i.e. it captures most variation).

I The second basis vector b2 should be orthogonal to the first one

(bT2 b1 = 0) such that V̂ar(Z (2)) is maximized.

I And so on...

18 / 40

Page 20: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

The new basis vectors are the so-called principal components.

The individual components of these basis vectors are called loadings.The loadings tell us how to interpret the new coordinate system (i.e.,how the old variables are weighted to get the new ones).

The coordinates with respect to the new basis vectors (thetransformed variable values) are the so-called scores .

19 / 40

Page 21: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

PCA: Illustration in Two Dimensionslo

g (

wid

th)

0.65 0.75

0.5

0.6

10

1. Principal Component

2. P

rinci

pal C

ompo

nent

−0.1 0 0.1−

0.1

00.

1

20 / 40

Page 22: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

We could find the first basis vector b1 by solving the followingmaximization problem

maxb:‖b‖=1

V̂ar(Xb),

where X is the matrix that has different observations in different rowsand different variables in different columns (like the design matrix inregression).

It can be shown that b1 is the (standardized) eigenvector of |̂ΣX thatcorresponds to the largest eigenvalue.

Similarly for the other vectors b2, . . . , bm.

21 / 40

Page 23: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

To summarize: We are performing a transformation to newvariables

z i = BT (x i − µ̂),

where the transformation matrix B is orthogonal and contains thebk ’s as columns.

In general we also subtract the mean vector to ensure that allcomponents have mean 0.

B is the matrix of (standardized) eigenvectors corresponding to the

eigenvalues λk of |̂ΣX (in decreasing order).

22 / 40

Page 24: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Hence, we have

V̂ar(Z ) = |̂ΣZ = BT |̂ΣXB =

λ1 0 . . . 00 λ2 . . . 0...

.... . .

0 0 . . . λm

λ1 ≥ λ2 ≥ · · · ≥ λm ≥ 0.

Hence, the variance of the different components of Z is given by thecorresponding eigenvalue (values on the diagonal).

Moreover, the different components are uncorrelated (because theoff-diagonal elements of the covariance matrix of Z are all zero).

23 / 40

Page 25: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Scaling Issues

The variance is not invariant under rescaling.

If we change the units of a variable, that will change the variance(e.g. when measuring a length in [m] instead of [mm]).

Therefore, if variables are measured on very different scales, theyshould first be standardized to comparable units.

This can be done by standardizing each variable to variance 1.

Otherwise, PCA can be misleading.

24 / 40

Page 26: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

PCA and Dimensionality Reduction

At the beginning we were talking about dimensionality reduction.

We can achieve this by simply looking at the first p < m principalcomponents (and ignoring the remaining components).

The proportion of the variance that is explained by the first pprincipal components is ∑p

j=1 λj∑mj=1 λj

.

25 / 40

Page 27: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Sometimes we see a sharp drop when plotting the eigenvalues (indecreasing order).

Comp.1 Comp.3 Comp.5 Comp.7 Comp.9

NIR−spectra without 5 outliers

Var

ianc

es0.

0e+

002.

0e−

054.

0e−

05 Variances of princ. components

This plot is also known as the scree-plot.

26 / 40

Page 28: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

PCA and Dimensionality Reduction

For visualization of our data, we can for example use a scatterplot ofthe first two principal components.

It should show “most of the variation”.

27 / 40

Page 29: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Illustration

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1.

5−

1.0

−0.

5 0

.0 0

.5 1

.0 1

.5

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

x(1)

x(2)x(3

)

●●

●●

●●

●●

●●

28 / 40

Page 30: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

PC1

−1.0 −0.5 0.0 0.5 1.0

●●

●● ●

● −1.

0−

0.5

0.0

0.5

1.0

●●

●●●

−1.

0−

0.5

0.0

0.5

1.0

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

PC2●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

−1.0 −0.5 0.0 0.5 1.0

●●

●● ●

●●

●●●●

●● ●

●●

●●

●● ● ●● ●●●

●●●

● ●

●●

●●

● ●

●● ●

●●

●● ●●

●● ●

●●

●●

● ● ●●●● ●●

● ●●

● ●

●●

●●

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

PC3

29 / 40

Page 31: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Loadings

PC1 PC2 PC3

x1 -0.395118451 0.06887762 0.91604437

x2 -0.009993467 0.99680385 -0.07926044

x3 -0.918575822 -0.04047172 -0.39316727

ScreeplotV

aria

nces

0.0

0.1

0.2

0.3

0.4

0.5

30 / 40

Page 32: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Applying PCA to NIR-Spectra

An NIR-spectrum can be thought of as a multivariate observation(the different variables are the measurements at differentwavelengths).

A spectrum has the property that the different variables are “ordered”and we can plot one observation as a “function” (see plot on nextslide).

If we apply PCA to this kind of data, the individual components ofthe bk ’s (the so called loadings) can again be plotted as spectra.

As an example we have a look at spectra measured at differenttime-points of a chemical reaction.

31 / 40

Page 33: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Illustration: Spectra (centered at each wavelength)

1200 1400 1600 1800 2000 2200 2400

−0.

4−

0.2

0.0

0.2

0.4

Wavelength

Cen

tere

d S

pect

ra

One observation is one spectrum, i.e. a whole “function”.32 / 40

Page 34: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

1200 1400 1600 1800 2000 2200 2400

−0.

20.

00.

2

1st Principal Component

Wavelength

Load

ing

1200 1400 1600 1800 2000 2200 2400

−0.

20.

00.

2

2nd Principal Component

Wavelength

Load

ing

1200 1400 1600 1800 2000 2200 2400

−0.

20.

00.

2

3rd Principal Component

Wavelength

Load

ing

33 / 40

Page 35: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Scatterplot of the First Two Principal Components

● ●●●●●●●●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●

●●

●●●●●

●●●●●● ●●●●●

●●

−2 0 2 4 6 8

−1.

0−

0.5

0.0

0.5

1.0

1.5

1. PCA

2. P

CA

34 / 40

Page 36: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Scree-Plot

Var

ianc

es

02

46

810

1214

35 / 40

Page 37: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Alternative Interpretations

If we restrict ourselves to the first p < m principal components we have

x i − µ̂ = x̂ i + e i

where

x̂ i =

p∑k=1

z(k)i b(k), e i =

m∑k=p+1

z(k)i b(k).

36 / 40

Page 38: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Linear Algebra

It can be shown that the data matrix consisting of the x̂ i is the bestapproximation of our original (centered) data matrix if we restrictourselves to matrices of rank p (with respect to the Frobenius norm), i.e.it has smallest ∑

i ,j

(e(j)i

)2.

Statistics

It’s the best approximation in the sense that it has the smallest sum ofvariances

m∑j=1

V̂ar(E (j)).

37 / 40

Page 39: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

PCA via Singular Value Decomposition (SVD)

We can also get the principal components from the singular valuedecomposition (SVD) of the data matrix X.

For that reason we require X to have centered columns!

Why does this work?SVD of X yields the decomposition

X = UDVT

whereI U is n × n and orthogonal

I D is n ×m and generalized diagonal (containing the so-called singularvalues in descending order)

I V is m ×m and orthogonal

38 / 40

Page 40: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Properties of SVD

The (standardized) eigenvectors of XTX make up the columns of V.

The singular values are the square roots of the eigenvalues of XTX.

But what is XTX? If the columns of X are centered, this is the rescaled

(empirical) covariance matrix |̂ΣX , because

(XTX)jk =n∑

i=1

(x ixTi )jk =

n∑i=1

x(j)i x

(k)i = (n − 1) |̂Σjk .

Hence, the singular value decomposition of the centered data-matrixautomatically gives us the principal components (in V).

The data-matrix in new coordinates is given in UD.

39 / 40

Page 41: Multivariate Statistics [1em]Principal Component Analysis ...stat.ethz.ch/~meier/teaching/cheming/4_multivariate.pdf · Goals of Today’s Lecture Get familiar with the multivariate

Summary

PCA is a useful tool for dimension reduction.

New basis system is given by (standardized) eigenvectors ofcovariance matrix.

Eigenvalues are the variances of the new coordinates.

In the case of spectra, the loadings can again be plotted as spectra.

40 / 40