principal component analysis - rosenstiel …€¦ · 114 principal component analysis another use...

17
CHAPTER 9 PRINCIPAL COMPONENT ANALYSIS In many physical, statistical, and biological investigations it is desirable to represent a system of points in plane, three, or higher dimensioned space by the “best-fitting” straight line or plane. K. Pearson, “On Lines and Planes of Closest Fit to Systems of Points in Space,” 1901. If, as we assume, the population is normally distributed, the loci of uniform density are concentric similar, and similarly placed ellipsoids. The method of principal com- ponents, we shall see, is equivalent to choosing a set of coordinate axes coinciding with the principal axes of these ellipsoids. H. Hotelling, “Analysis of a Complex of Statistical Variables with Principal Components,” 1933. Principal component analysis (PCA), also called Empirical Orthogonal Function (EOF) analysis, is a technique for decomposing multivariate time series into a set of uncorrelated, orthogonal components ordered according to their variance, from largest to smallest. PCA is optimal in the sense that the first T components come closer to the multivariate time series (in a mean square sense) than any other T components. Equivalently, the first T components capture the maximum possible variance of any T components– no other set of T components can capture as much variance as the first T principal components. For this reason, PCA is a popular technique for “data compression.” An important use of PCA is to characterize the space-time variability of a system. In this approach, we try to get a feeling for what is going on by looking at the “biggest” and “smallest” components. Ranking is a basic tool in virtually all fields of data analysis. Statistical Methods in Climate Science. By DelSole and Tippett 113

Upload: lexuyen

Post on 31-Mar-2018

229 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: PRINCIPAL COMPONENT ANALYSIS - Rosenstiel …€¦ · 114 PRINCIPAL COMPONENT ANALYSIS Another use of PCA is to define a small number of components within a large dimensional system

CHAPTER 9

PRINCIPAL COMPONENT ANALYSIS

In many physical, statistical, and biological investigations it is desirable to representa system of points in plane, three, or higher dimensioned space by the “best-fitting”straight line or plane. K. Pearson, “On Lines and Planes of Closest Fit to Systems ofPoints in Space,” 1901.

If, as we assume, the population is normally distributed, the loci of uniform densityare concentric similar, and similarly placed ellipsoids. The method of principal com-ponents, we shall see, is equivalent to choosing a set of coordinate axes coincidingwith the principal axes of these ellipsoids. H. Hotelling, “Analysis of a Complex ofStatistical Variables with Principal Components,” 1933.

Principal component analysis (PCA), also called Empirical Orthogonal Function (EOF)analysis, is a technique for decomposing multivariate time series into a set of uncorrelated,orthogonal components ordered according to their variance, from largest to smallest. PCAis optimal in the sense that the first T components come closer to the multivariate timeseries (in a mean square sense) than any other T components. Equivalently, the first Tcomponents capture the maximum possible variance of any T components– no other set ofT components can capture as much variance as the first T principal components. For thisreason, PCA is a popular technique for “data compression.”

An important use of PCA is to characterize the space-time variability of a system. Inthis approach, we try to get a feeling for what is going on by looking at the “biggest”and “smallest” components. Ranking is a basic tool in virtually all fields of data analysis.

Statistical Methods in Climate Science. By DelSole and Tippett 113

Page 2: PRINCIPAL COMPONENT ANALYSIS - Rosenstiel …€¦ · 114 PRINCIPAL COMPONENT ANALYSIS Another use of PCA is to define a small number of components within a large dimensional system

114 PRINCIPAL COMPONENT ANALYSIS

Another use of PCA is to define a small number of components within a large dimensionalsystem for performing linear regression.

9.1 DESCRIPTION OF THE PROBLEM

To describe PCA, suppose all the data are collected into the N !M matrix X. Here, N isthe number of samples (usually time steps), and M is the number of parameters needed todefine the spatial structure (usually values at individual grid points). Thus, typically, eachrow of X is a “snap shot” of the data at an instant in time.

In typical applications, we are interested in variability about the climatological mean.The difference between the original data and the climatological mean field is called theanomaly. Anomaly data often are indicated by a prime. Thus

X!nm = Xnm "Xc

nm, (9.1)

where Xcnm is the climatological mean value of the m’th component at the n’th sample.

For instance, if the data are monthly means, then each row of Xc might be the calendarmonth mean of the corresponding row of X. In long term climate studies, the climatologyis sometimes defined to be the average during some reference “base period.”

A basic difficulty is that the matrix X often is too big to understand easily. For instance,a 2.5" ! 2.5" lat-long grid contains over 10,000 grid points, and when this is multipliedby the number of time steps the total number of data values can easily exceed one million.If, however, the variability is characterized by a small number of spatial structures, thenit would be more efficient to describe the variability of those spatial structures directly.As an extreme example, consider a propagating sine-wave: at each instant the sine-wave isdefined by an infinite number of data values (e.g., the amplitude at each location), but it alsocan be described completely by three parameters, namely the amplitude, wavelength, andphase of a sinusoid. In this extreme case, an infinite number of parameters is replaced bythree parameters without loss of any information. In practice, the best spatial structures toconsider are unknown. The basic idea of PCA is to use the data to choose the components.

In general, any propagating or standing pattern can be described by a sum of fixed pat-terns with time-varying coefficients. For instance, a propagating sine wave can be writtenas the sum of two fixed patterns with time-varying coefficients:

sin(kx" !t) = cos(!t) sin(kx)" sin(!t) cos(kx) . (9.2)

More generally, we consider representing an arbitrary data set in the form

X(x, y, z, t) = u1(t)v1(x, y, z) + u2(t)v2(x, y, z) + · · · + uM (t)vM (x, y, z). (9.3)

If space and time are discretized, and only a single component is considered, then the dataare approximated in the form

X!nm # s [u]n [v]m , (9.4)

where the N -dimensional vector u defines the time variability, the M -dimensional vectorv defines the spatial structure, and s defines the amplitude of the component; the notation[u]n and [v]m refers to the n-th and m-th elements of the vectors u and v, reserving

Page 3: PRINCIPAL COMPONENT ANALYSIS - Rosenstiel …€¦ · 114 PRINCIPAL COMPONENT ANALYSIS Another use of PCA is to define a small number of components within a large dimensional system

SOLUTION BASED ON SINGULAR VALUE DECOMPOSITION 115

the subscript without brackets for indexing components. The approximation (9.4) can bewritten in matrix form as

X! # suvT , (9.5)

where superscript T denotes the transpose operation.

The question arises as to what is the “best” way of choosing s,u,v. PCA defines “best”as that which minimizes the sum square difference between X! and the component. Thus,PCA determines s,u,v by minimizing

$X! " suvT $2F =!

m

!

n

(X!nm " s [u]n [v]m)2 . (9.6)

This measure is called the Frobenius Norm (hence the subscript “F”). This measure is nevernegative, and equals zero if and only if suvT equals X!. Hence, the Frobenius norm canbe interpreted as the “distance” between the data and the component defined by the triplets,u,v.

Technically, the above minimization problem has no unique solution because the ap-proximation depends only on the product of s,u,v– the same product can be producedby very different values of s,u,v; the product that solves the minimization problem isgenerally unique. We can produce a (almost) unique solution triplet s,u,v by imposingconstraints. Convenient constraints are that u and v have unit “length;” i.e.,

!

n

[u]2n = 1 ,!

m

[v]2m = 1, (9.7)

and that s > 0. These constraints fix the “length” of u and v, but not their signs– u canbe multiplied by negative one provided v also is multiplied by negative one, since the twosign changes cancel in the product. In general, there is no standard convention for the signand so we allow it to be arbitrary. In summary, we seek the triplet s,u,v that solves thefollowing optimization problem:

minimize $X! " suvT $2F subject to uT u = 1 ,vT v = 1 and s > 0 . (9.8)

The solution we seek is unique up to a sign.

9.2 SOLUTION BASED ON SINGULAR VALUE DECOMPOSITION

The solution to the above optimization problem is solved most directly (and often mostefficiently) by the singular value decomposition (SVD). The SVD follows from a math-ematical theorem that can be stated very simply: every matrix X! can be written in theform

X! = U S VT

[N !M ] [N !N ] [N !M ] [M !M ] (9.9)

where U and V are unitary matrices, S is a diagonal matrix (not necessarily square) withnon-negative diagonal elements. The columns of U are called the left singular vectors,the columns of V are called the right singular vectors, and the diagonal elements of S arecalled the singular values. Since U and V are unitary,

UT U = UUT = I and VT V = VVT = I. (9.10)

Page 4: PRINCIPAL COMPONENT ANALYSIS - Rosenstiel …€¦ · 114 PRINCIPAL COMPONENT ANALYSIS Another use of PCA is to define a small number of components within a large dimensional system

116 PRINCIPAL COMPONENT ANALYSIS

The usual convention is to order the positive singular values from largest to smallest, s1 %s2 % . . . sR > 0, where the number R of nonzero singular values is the rank of X!, whichcannot exceed the minimum of M and N . (In fact, the rank of X! often is less than theminimum of M and N due to subtraction of the climatology.) Thus, the first left singularvector is the left vector with the largest singular value, and so on.

The SVD (9.9) can be expressed equivalently and more efficiently as

X! = s1u1vT1 + s2u2vT

2 + · · · + sRuRvTR, (9.11)

where uj and vj are the jth columns of U and V respectively. Writing (9.11) in matrixform gives the economy SVD

X! = U S VT

[N !M ] [N !R] [R!R] [R!M ] (9.12)

where U = [u1, . . . ,uR], V = [v1, . . . ,vR] and S = diag(s1, . . . , sR). This form iscalled the “economy” SVD because it avoids storage of singular vectors associated withthe null spaces of X!. The dots indicate that the corresponding matrices in the SVD havebeen truncated. Further details of the SVD can be found in sec. ??.

The solution to our optimization problem is now simple: the triplet s,u,v that solves(9.8) is given by the leading singular value and singular vectors of X, namely s1,u1,v1.However, the SVD yields more than just the single triplet that best approximates the data.It can be shown that the truncated sum

X! = s1u1vT1 + s2u2vT

2 + · · · + sT uT vTT , (9.13)

where the sum stops at T terms, minimizes $X! " X!$F out of all matrices X! of rankT . For instance, this means that the first two singular values and vectors can be used toconstruct the best rank-2 approximation of X! with respect to the Frobenius norm.

The above decomposition is important enough to be identified with specific names. Ifthe data matrix X! is organized by time ! space, then:Definition 9.1 (Empirical orthogonal function (EOF)). A right singular vector vi of thedata matrix X!. The EOF defines the spatial structure of the component.Definition 9.2 (Principal Components (PCs)). The left singular vector times the singularvalue siui of the data matrix X!. The principal component defines the time series of thecomponent.Definition 9.3 (Principal Component Analysis (PCA)). The procedure for finding theEOFs and PCs of a data set.

Some basic properties of the EOFs and PCs are the following.

1. The PCs can be obtained from linear combinations of the data X!as

US = X!V , (9.14)

a fact that follows from V being unitary. This procedure for finding the time-seriesthat multiply the EOFs can be applied not only to the data used in the PCA but toother independent data as well. For instance, suppose we have a new data set Y!

with dimension Ny !M , then the projection of Y! onto the EOFs of X! is Y!V.

Page 5: PRINCIPAL COMPONENT ANALYSIS - Rosenstiel …€¦ · 114 PRINCIPAL COMPONENT ANALYSIS Another use of PCA is to define a small number of components within a large dimensional system

SOLUTION BASED ON SINGULAR VALUE DECOMPOSITION 117

2. The time mean of the PCs is zero, since the time mean of the columns of X! vanishes,and the PCs are linear combinations of the columns of X!. (This statement assumesthe climatology has been defined as the column mean of X!.)

3. Because the PCs have zero mean, the variance of the i-th PC is

1N

siuTi uisi =

1N

s2i . (9.15)

Since the singular vectors are unit vectors, the singular values have the same physicalunits as the data and the variance of a PC has the same physical units as the varianceof the data.

4. The orthogonality of the right singular vectors means that uTi uj = 0 if i &= j, which

implies that the covariance between the ith and jth PC vanishes, which in turn meansthe PCs are uncorrelated.

5. The total variance of the anomaly data X! is

1N$X!$2F =

1N

(s21 + . . . s2

R) . (9.16)

6. The fraction of variance “explained” by the i-th EOF is the variance of the PC di-vided by the total variance and is

s2i

s21 + s2

2 + · · · + s2R

. (9.17)

7. The previous points imply that principal component analysis decomposes the datainto an uncorrelated set of components ordered by their explained variance. Thedecomposition is optimal in the following sense: the first PC/EOF comes “closest” tothe anomaly data, in the sense that it minimizes the Frobenius norm of the differencebetween the anomaly data and the component; the second PC/EOF comes “closest”to the data out of all vectors that are orthogonal to the first PC/EOF; and so on.

8. The truncated data matrix X! defined (9.13) gives the best approximation the datapossible for any rank-T matrix. The residual after subtracting out this best approxi-mation has the squared norm

$X! " X!$2F =

"""""

R!

n=1

snunvTn "

T!

n=1

snunvTn

"""""

2

F

(9.18)

=

"""""

R!

n=T+1

snunvTn

"""""

2

F

(9.19)

=R!

n=T+1

s2n (9.20)

= $X!$2F "T!

n=1

s2n (9.21)

Page 6: PRINCIPAL COMPONENT ANALYSIS - Rosenstiel …€¦ · 114 PRINCIPAL COMPONENT ANALYSIS Another use of PCA is to define a small number of components within a large dimensional system

118 PRINCIPAL COMPONENT ANALYSIS

Dividing this relation by the sample size gives

1N $X

! " X!$2F = 1N $X

!$2F " 1N

#Tk=1 s2

kResidual Variance Total Variance Explained Variance

. (9.22)

This expression shows that minimizing the residual variance is equivalent to maxi-mizing the explained variance.

9. The sample estimate of the covariance matrix of x is

!x =1N

N!

n=1

(xn " µX) (xn " µX)T , (9.23)

or equivalently,

!X =1N

X!T X! =1N

VST SVT . (9.24)

Since ST S is a diagonal matrix, it can be seen by inspection that the columns of Vare the eigenvectors of the matrix !X , and that the eigenvalues are s2

i /N . Since theEOFs are the eigenvectors of the covariance matrix, it follows that the EOFs can beobtained alternatively by solving the eigenvector equation

!Xv = "v. (9.25)

Numerically, the eigenvector approach is less accurate and often more expensivethan the SVD approach, because eigenvector methods require (1) calculating a ma-trix product, (2) solving an eigenvector problem which may be ill-conditioned orinvolves a substantial null space, (3) an additional matrix product to calculate thePCs. A major exception is when N or M is so large that the data matrix X cannoteven be stored in computer memory. In such cases, PCA still can be calculated byreading in small chunks of the data and then updating the covariance matrix sequen-tially. For instance, if N is very large, we may compute

C =N!

n=1

xnxTn , m =

N!

n=1

xn, (9.26)

for each time step n, then after all the data has been read we compute the covariancematrix as

µX =1N

m , !x =1N

C" µX µTX . (9.27)

10. The PCs are the eigenvectors of the matrix X!X!T . This form is useful for perform-ing PCA when M is very large.

11. It can be shown that EOFs solve the following maximization problem:

maximize vT X!T X!v subject to vT v = 1 (9.28)

Since X!v is the time series produced by projection vector v, and vT X!T X!v isproportional to the variance of the time series, the above optimization problem canbe interpreted as finding the projection vector of unit length that maximizes variance.In this sense, the first PC/EOF explains the maximum possible variance; the secondPC/EOF explains the maximum variance out of all vectors that are orthogonal to thefirst PC/EOF; and so on.

Page 7: PRINCIPAL COMPONENT ANALYSIS - Rosenstiel …€¦ · 114 PRINCIPAL COMPONENT ANALYSIS Another use of PCA is to define a small number of components within a large dimensional system

PCA BASED ON GENERALIZED VARIANCE 119

9.3 PCA BASED ON GENERALIZED VARIANCE

In general, variables with different units cannot be included in PCA because they are notadditive (e.g., minimizing (9.6) implies summing the squared difference over all variables).To include variables with different units, one usually normalizes the data based on physicalprinciples (e.g., so that all variables have units of energy), or based on the standard devia-tion (e.g., so that all variables are nondimensional and have unit variance). Another relatedissue is that the use of sum square differences to measure the “distance” between two datasets is not appropriate for gridded data, because points near the poles are highly redundantand should not count the same as points along the equator. A more appropriate measuremight be the area-weighted squared difference. These issues can be taken into account bymodifying the “distance” measure between two data sets.

A common measure of “generalized distance” is a weighted sum square difference

$X! " suvT $2W =!

m

!

n

wm (X!nm " s [u]n [v]m)2 , (9.29)

where wm defines the weights. For area-weighting, wm would be approximately the cosineof the latitude of grid point m. In order for this expression to be a meaningful “distance”measure, all weights must be positive– if some weights are negative, then the above ex-pression could vanish even if the residual does not; if some weights equal zero, then thesedo not contribute to the distance measure and should be ignored in PCA.

A clever trick for finding the components that minimize the weighted sum square dif-ference is to define a weighted data matrix

X!!nm =

'wm X!

nm . (9.30)

Then, the minimum weighted-distance between data and components is equal to the mini-mum distance between weighted data and component, that is,

mins,u,v

$X! " suvT $2W = mins,u,v

$X!! " suvT $2F . (9.31)

Therefore, the components that minimize the weighted sum square difference are found byapplying PCA to the weighted data; i.e., by calculating

SVD of X!! = USVT . (9.32)

Having multiplied the data by the square root of weights, one must divide out thesefactors from the EOFs to obtain the EOFs appropriate to the original data (this division isalways possible since the weights are positive). Thus, for the generalized distance (9.29),the EOFs would be Vmi/

'wm and the PCs would be siUni. Note that both of these terms

depend on the weights owing to (9.32). The transformation (9.30) can be written in matrixform as X!! = X!W1/2, where W1/2 is a diagonal matrix in which the diagonal elementsequal the square root of the weights. Then the EOFs are W#1/2V, the PCs are US, andthe decomposition is

X! = US$W#1/2V

%T. (9.33)

The properties of the PCs from weighted PCA parallel those of ordinary PCA, withstatements about variance being replaced with statements about weighted variance. Theproperties of the EOFs from weighted PCA are different.

Page 8: PRINCIPAL COMPONENT ANALYSIS - Rosenstiel …€¦ · 114 PRINCIPAL COMPONENT ANALYSIS Another use of PCA is to define a small number of components within a large dimensional system

120 PRINCIPAL COMPONENT ANALYSIS

1. While not unit vectors, the EOFS are orthogonal with respect to the generalizeddistance measure: $

W#1/2V%T

W$W#1/2V

%= I . (9.34)

2. The PCs can be obtained from linear combinations of the data X! as

US = X!W1/2V . (9.35)

9.4 GRAPHICAL DISPLAY

When reporting the results of PCA, the PC and EOF can themselves be plotted as a timeseries and a state space pattern, respectively. However, in the present form, the resultsdepend on the sample size N since ui is normalized such that the sum of its N squaredelements is unity. This implies that the PC amplitude depends on the length of the data,and makes it awkward to compare PCs from datasets with different numbers of time points.Similarly requiring the EOF vi to be a unit vector makes its amplitude depend on statedimension and spatial resolution. We would like to define normalizations of the EOF andPC that maintain the product of the EOF and PC but that allow all of the information in thecomponents to be displayed effectively and efficiently.

To define normalized EOFs and PCs, first we define the i-th normalized PC fi to be thetime series vector that is proportional to ui and has unit variance. That is,

fi ='

Nui . (9.36)

Then, looking at the product of an EOF with its PC

siuivTi W#1/2 =

1'N

sifivTi W#1/2 = fieT

i , (9.37)

where we define the physical EOF pattern

ei =1'N

siW#1/2vi . (9.38)

Note that the physical pattern has the same units as the data. Matrix forms of the plottingvariable definitions are

F ='

NU (9.39)

andE =

1'N

W#1/2VST . (9.40)

Some properties of these plotting variables are the following.

1. The normalized PCs have zero mean, are uncorrelated, and have unit variance

1N

FT F = I . (9.41)

Page 9: PRINCIPAL COMPONENT ANALYSIS - Rosenstiel …€¦ · 114 PRINCIPAL COMPONENT ANALYSIS Another use of PCA is to define a small number of components within a large dimensional system

EXAMPLE 121

2. While not unit vectors, the pattern vectors are orthogonal with respect to the gener-alized distance measure:

ET WE =1N

SVT VST =1N

SST . (9.42)

3. The anomaly data isX! = FET . (9.43)

4. Since the normalized PCs have unit variance, the absolute value of the i, k-th elementof E, eik, is the standard deviation explained by k-th EOF at the i-th spatial element.

5. The fraction of generalized variance $X!!$2F /N “explained” by the i-th EOF is givenby

s2i

s21 + s2

2 + · · · + s2R

. (9.44)

6. Regression and projection properties of the EOFs and PCs can be translated intoproperties of the pattern vectors and normalized PCs with factors of N runningaround. For those comfortable with the SVD, the EOF and PC relations are eas-ier to remember, and the properties of the pattern vectors and normalized PCs followsimply.

9.5 EXAMPLE

We now show the EOFs of monthly sea surface temperature (SST) (Smith and Reynolds(1999)). SST is nonstationary owing to, among other things, a strong annual cycle. Theseasonal cycle is subtracted prior to computing EOFs. The resulting residual is not neces-sarily stationary, but at least it is not dominated by a strongly periodic signal.

Fig. 9.1 shows the leading EOF. This EOF explains 22% of the variance, as determinedfrom (9.17). The corresponding time series is shown in the bottom panel. The variance ofthe time series is exactly one, by construction. Note that the time series is suggestive ofa “climate change” in the late 1970’s. Hence, the time series might not be stationary. Asubstantial fraction of the “variance” is due to the “shift” in the late 1970’s.

The EOF is shown as the shaded figure. The EOF is normalized such that it can bemultiplied by the time series to give the amplitude of the pattern at each month. Since thetime series has unit variance, the absolute value of the EOF at each point in space givesthe standard deviation of the EOF at that point. Thus, the EOF has the same units as theoriginal data set, namely Celsius. It follows that the leading EOF has a standard deviationaround 1" C in the eastern Pacific.

9.6 PROJECTIONS AND THE ECONOMY SVD

Recall the economy SVD, which is

X!! = U S VT

[N !M ] [N !R] [R!R] [R!M ] (9.45)

Page 10: PRINCIPAL COMPONENT ANALYSIS - Rosenstiel …€¦ · 114 PRINCIPAL COMPONENT ANALYSIS Another use of PCA is to define a small number of components within a large dimensional system

122 PRINCIPAL COMPONENT ANALYSIS

0 50 100 200 300

−40

040

EOF−1 of ERSSTv3 (22%)

lon

lat

−1.0−0.8−0.6−0.4−0.2 0.0 0.2 0.4

−0.8 −0.2

−0.2

0

0

0

0

0

0

0

0.2

1950 1960 1970 1980 1990 2000

−3−1

1

Year

PC

Figure 9.1 Leading EOF and PC of the Extended Reconstructed SST produced byReynolds and Smith (1999)

where U = [u1, . . . ,uR], V = [v1, . . . ,vR] and S = diag(s1, . . . , sR) and the dotsindicate that the corresponding matrices in the SVD have been truncated. We define thepseudo-inverse of the kth EOF as

eik =

'Ns#1

k W1/2vk. (9.46)

In matrix form, the pseudo-inverse is

Ei ='

NW1/2VS#1. (9.47)

This matrix is not square but ET Ei = I. It can be verified that

X!Ei = F. (9.48)

This expression shows that the principal components can be computed from the originaldata X! by projection based on the pseudo-inverse. This projection also provides the properway to find the principal components in independent data.

9.7 LARGE SAMPLE PROPERTIES OF PCA

Quantifying the sampling distribution of the EOFs and PCs is difficult. Some useful math-ematical results are known about the eigenvectors of sample covariance matrices. SinceEOFs are the eigenvectors of the sample covariance matrix, the sampling properties ofeigenvectors can be used to derive some basic conclusions regarding the sampling prop-erties of EOFs and PCs. The following theorem gives the distribution of the eigenvectorsand eigenvalues of the sample covariance matrix !X in the limit of large sample size N .

Page 11: PRINCIPAL COMPONENT ANALYSIS - Rosenstiel …€¦ · 114 PRINCIPAL COMPONENT ANALYSIS Another use of PCA is to define a small number of components within a large dimensional system

CONFIDENCE INTERVAL FOR THE EIGENVALUES 123

Theorem 9.1 (Sampling Distribution of EOFs). Let x1, x2, . . ., xN be N independentsamples from the M -dimensional normal population NM (µ,!x), and let !x be a sampleestimate of !x. Assume that the eigenvalues of !x are distinct and positive. Then, the Meigenvalues and eigenvectors of the sample covariance matrix !x, denoted

" = diag&"1, "2, . . . , "M

', V = [v1, v2, . . . , vM ] (9.49)

have the following properties for asymptotically large N :

"i (N(

"i,2"2

i

N

)

vi (NM

(vi,

1N

Li

)where Li = "i

!

k $=i

"k

("k " "i)2 vkvT

k .

(9.50)

This theorem will be examined in more detail the next three sections.

9.8 CONFIDENCE INTERVAL FOR THE EIGENVALUES

According to the above theorem, the 100c% confidence interval for the eigenvalues is

c =p$"i " zc"i

*2N < "i < "i + zc"i

*2N

%

=p$"i " zc"i

*2N < "i and "i < "i + zc"i

*2N

%

=p

("i < !i

1#zc

'2N

and !i

1+zc

'2N

< "i

)

=p

(!i

1+zc

'2N

< "i < !i

1#zc

'2N

)

(9.51)

Since the above theorem applies only for large samples, the above confidence interval isusually approximated as

+"i " zc"i

,2N

, "i + zc"i

,2N

-(9.52)

Note that the length of the confidence interval is proportional to "i , implying that theuncertainty grows with the estimate, in contrast to the confidence intervals considered inprevious chapters.

9.9 EFFECTIVE DEGENERACY

When a matrix has two eigenvalues that are equal to each other, then associated eigenval-ues are said to be degenerate. The eigenvectors of degenerate eigenvalues are not unique,

Page 12: PRINCIPAL COMPONENT ANALYSIS - Rosenstiel …€¦ · 114 PRINCIPAL COMPONENT ANALYSIS Another use of PCA is to define a small number of components within a large dimensional system

124 PRINCIPAL COMPONENT ANALYSIS

although they must span a particular space. Although the matrix !x may have distincteigenvalues, the uncertainty in the eigenvalues may be larger than the spacing betweeneigenvalues, in which case the eigenvalues are effectively degenerate. The implication ofeffective degeneracy is that the eigenvectors associated with effectively degenerate eigen-values are not unique– or rather, they are extremely sensitive to the details of samplingerrors. This sensitivity is reflected in (9.52) which shows that the variance of the eigen-vector depends inversely on the spacing of the eigenvalues, in contrast to the variance ofthe eigenvalues themselves. Thus, as two eigenvalues get “closer,” the variance in theeigenvector grows to infinity. If the spacing in eigenvalues is comparable to the samplingerror in the eigenvalues, then eigenvectors associated with neighboring eigenvalues can-not be resolved. This means that the eigenvectors probably depend wildly from sample tosample.

Suppose eigenvalues "i and "i+1 are sufficiently close to each other that these termsdominate V (in theorem 20.5). Then

vi ( N(vi,

1N

"i"i+1

("i " "i+1)2vi+1vT

i+1

)

vi+1 ( N(vi+1,

1N

"i"i+1

("i " "i+1)2vivT

i

)(9.53)

These distributions show that eigenvectors i and i+1 “mix” with each other. It is reasonableto assume that when the spacing between eigenvalues is comparable to the sampling errors,the eigenvectors are mixed and should not be split up. This mixing is called an “effectivedegeneracy.” In essence, the covariance matrix is “effectively degenerate,” because twoclosely spaced eigenvalues cannot be resolved by the data.

9.10 NORTH ET AL.’S RULE OF THUMB

Fig. 9.2 shows the leading eigenvalues of the sample covariance matrix of the SST data setdiscussed previously. It also shows the 95% confidence intervals indicated as error bars.First, note that the confidence errors have the same size on a log-scale plot. This is a usefulproperty of log plots. North et al. (1982) proposed that one should truncate EOFs onlywhen the confidence intervals for the corresponding eigenvalues do not over lap (althoughthey effectively used a 67% confidence interval). In the figure, this rule of thumb wouldimply that we could truncate between 1 and 2, or 2 and 3, but not anywhere else.

In our opinion, the above rule of thumb is more important when discussing “physicallyimportant EOFs” rather than choosing truncation point. In most problems in which weare truncating EOFs, the EOFs are used as basis functions for representing the data. In thiscontext, it is not critically important whether the trailing EOFs have been “resolved” or not.For instance, basis functions can be arbitrary, such as spherical harmonics. A collection ofEOFs provide a convenient orthogonal basis set that is more tailored to the data than thosethat arise from classical differential equations. On the other hand, if we want to interpretan individual EOF in physical terms, then it is much more important to be certain that theEOF is not sensitively dependent on the sample.

Page 13: PRINCIPAL COMPONENT ANALYSIS - Rosenstiel …€¦ · 114 PRINCIPAL COMPONENT ANALYSIS Another use of PCA is to define a small number of components within a large dimensional system

ARE EOFS PHYSICAL MODES? 125

0 5 10 15 20 25 30

Fraction of Explained Variance

Index

Frac

tiona

l Var

ianc

e

110

100

Figure 9.2

There is one caveat that should be mentioned about the above comment: EOFs tend togive biased estimates of variance. This bias is easy to understand. For instance, the leadingEOF is specially designed to explain maximal variance in the available sample. However,the detailed weighting required to achieve this variance is peculiar to the sample and willnot consistently hold in independent data sets. Consequently, the leading EOF generallywill account for less variance in an independent data set than in the sample from whichit was derived. This reasoning suggests that if bias in variance is an important issue in aparticular analysis, then EOFs should be used with care.

9.11 ARE EOFS PHYSICAL MODES?

One might be tempted to interpret an individual EOF as a “mode of oscillation” of thephysical system. For instance, the leading EOF of SST, shown previously, is often in-terpreted as an actual mode of oscillation associated with the El Nino, La Nino, and theSouthern Oscillation. The question arises as to whether this interpretation is appropriate.

One problem with interpreting EOFs as modes is that the EOFs are orthogonal in spacewhereas modes are not necessarily orthogonal. For instance, fluid dynamical systems withbackground shear tend to have non-orthogonal eigenmodes (Farrell and Ioannou, 1996).While the orthogonality constraint is perhaps not critical for the leading EOF, it becomesvery important as one as one examines successively higher order EOFs.

Another problem with this interpretation is that it is not clear what is meant by “modes.”The climate system is nonlinear and there is no universally accepted definition of “mode”of a nonlinear system. If we consider the climate system as “mostly linear,” then the modeswould be identified with the eigenmodes of the dynamical operator. However, most linearmodels of the climate system have eigenmode amplitudes that either grow to infinity ordecay to zero, neither of which is realistic. Thus, most linear theories of climate are basedon stochastic models– damped linear systems driven by random noise. It turns out that

Page 14: PRINCIPAL COMPONENT ANALYSIS - Rosenstiel …€¦ · 114 PRINCIPAL COMPONENT ANALYSIS Another use of PCA is to define a small number of components within a large dimensional system

126 PRINCIPAL COMPONENT ANALYSIS

there is only one class of stochastic models whose EOFs correspond to eigenmodes: alinear system with orthogonal eigenmodes driven by noise that is white in space and time.Such systems tend to be too simplistic to describe interesting climate dynamics. However,it turns out that in many stochastic climate models the leading EOF is very similar to theleast damped eigenmode.

9.12 A CAUTIONARY EXAMPLE OF EOFS

Consider three modes of oscillation, each of whose structure is shown in fig. 9.3. Supposeeach mode fluctuates independently of the others, but with variances indicated in the figureand with zero mean. The data matrix is of the form X = ZAMT , where

M =

.

/5 0 20 0 20 4.5 2

0

1 , A =

.

/

'0.4360 0

0'

0.353 00 0

'0.209

0

1 (9.54)

and Z is a matrix whose columns are mutually orthonormal time series. The EOFs of thesystem are given by the eigenvectors of the covariance matrix

!x =1N

XT X =1N

MAZT ZAT MT =1N

MA2MT (9.55)

The eigenvectors of this matrix are shown in the second row of the figure. We see thatthe EOFs fail to recover the original eigenmodes. In fact, we see a dipole for the secondEOF and a tripole for the third EOF, even though neither of these structures explicitlyfluctuates. The reason for this is that the EOF procedure merely maximizes variance, itdoes not guarantee that the resulting structures are modes. This example illustrates thedanger of interpreting “dipoles” and “tripoles” as “real modes”. The figure also showssomething called the VARIMAX (third panel from top) pattern, and the regressing patternsobtain by regression the time series in the box with all other coordinates (bottom panel).The values inside the boxes are the amplitudes of the patterns in the respective region, inarbitrary units. The figure is from Dommenget and Latif (2002).

9.13 IS THERE A PROCEDURE THAT CAN FIND MODES?

It may surprise you to learn that we can prove that NO procedure based on only the co-variance matrix can find the modes in the above example. To show this, let Z be a unitarymatrix; i.e., ZT Z = I. Then, the two data matrices X1 = FET and X2 = FZE haveprecisely the same sample covariance matrix EET . This implies that the structure E canbe determined directly from the sample covariance only to within a unitary transformation.It follows that in order to extract a unique matrix E, additional constraints are needed.

It is probably worth pointing out that M need not be a square matrix. For instance, Mcould be a 3! 100 matrix and still yield a 3! 3 covariance matrix. Clearly, there is a lossof information from M to !x. This example illustrates in a different way that there is nomethod based on the covariance matrix that can extract the mode matrix M.

In order to constrain the problem to extract modes, we have to answer the question“what precisely is a mode.” As discussed before, the main interpretation that makes sense

Page 15: PRINCIPAL COMPONENT ANALYSIS - Rosenstiel …€¦ · 114 PRINCIPAL COMPONENT ANALYSIS Another use of PCA is to define a small number of components within a large dimensional system

IS THERE A PROCEDURE THAT CAN FIND MODES? 127

Figure 9.3

Page 16: PRINCIPAL COMPONENT ANALYSIS - Rosenstiel …€¦ · 114 PRINCIPAL COMPONENT ANALYSIS Another use of PCA is to define a small number of components within a large dimensional system

128 PRINCIPAL COMPONENT ANALYSIS

is one in which the system is considered to be a linear dynamical system driven by randomnoise. If the noise is white in time, then the modes of the underlying linear system can beobtained by a procedure known as principal oscillation pattern (POP) analysis. This proce-dure requires knowledge of the time-lagged covariance matrix, which gives the covariancebetween every degree of freedom with every other degree of freedom, lagged in time.

9.14 EOFS AS THE MAJOR AXES OF AN ELLIPSOID OF CONSTANTPROBABILITY DENSITY

The EOFs also have an important interpretation in the context of multivariate Gaussiandistributions. Recall that the multivariate Gaussian distribution has the form

f(x1, x2, ..., xM ) = f(x) =1

(2#)m/2|!|1/2e#(x#µ)T !!1(x#µ)/2. (9.56)

This distribution depends on x only through the argument to the exponential (all otherparameters are population parameters and hence constant). Therefore, the vectors x withconstant probability density are those that satisfy

(x" µ)T !#1 (x" µ) = constant . (9.57)

This equation is a quadratic form. In fact, since it is a linear combination of second orderproducts, it must be a conic section, i.e., it must be an ellipse, hyperboloid, or cone. Thekey to figuring out the type of conic section is to transform the above equation using atransformation based on the eigenvectors of the covariance matrix. Since the covariancematrix is symmetric and positive definite, it can be decomposed into the form

! = V"VT , (9.58)

where V is a unitary matrix and " is a diagonal matrix with positive diagonal elements. Itfollows that the inverse is

!#1 = V"#1VT . (9.59)Substituting this into the quadratic form gives

(x" µx)T !#1 (x" µx) = (x" µ)T V"#1VT (x" µ)

=yT "#1y

="#11 y2

1 + "#12 y2

2 + ... + "#1M y2

M ,

(9.60)

where y = VT (x " µ). The above expression is a sum of squares (with no cross terms)with positive coefficients. This implies that the surface of constant probability density isan ellipsoid in the y-variable. However, a unitary transformation corresponds to a rotationor reflection. Therefore, the y-variable is related to the x-variable by a translation androtation, which preserves the ellipsoidal shape. It follows that the surface of constantprobability density for x is an ellipsoid with the following properties:

1. The ellipsoid is centered at µ.

2. The major axes are oriented along the eigenvectors of !, which are the EOFs.

3. The lengths of the major axes are proportional to'

"i , where "i is the ith eigenvalueof !.

Page 17: PRINCIPAL COMPONENT ANALYSIS - Rosenstiel …€¦ · 114 PRINCIPAL COMPONENT ANALYSIS Another use of PCA is to define a small number of components within a large dimensional system

SUMMARY OF EQUATIONS RELATED TO PRINCIPAL COMPONENT ANALYSIS 129

9.15 SUMMARY OF EQUATIONS RELATED TO PRINCIPALCOMPONENT ANALYSIS

X!: The anomaly data, stored in an N !M matrix. M = space or variable dimension,N = time dimension.

W: Area weighting/variable normalization matrix. The user specifies this operator. Itdefines the norm the user believes to be physically important.

X!W = USVT economy SVD

X! = USVT W#1/2 PCA decomposition

US PCs

W#1/2V EOFs

F ='

NV Normalized PCs (plotting)

E = 1%N

W#1/2US Physical patterns (plotting)

X! = FET PCA decomposition (plotting)

1N FT F = I Normalized PCs uncorrelated and unit variance

ET WE = 1N S2 Physical patterns orthogonal with norm W

1N $X

!W1/2$2F = 1N

2s21 + · · · + s2

R

3Total Variance with Respect to Norm W

Ei ='

NW1/2VT S#1 Pseudo-inverse of the EOFs

1N X!T fk = ek normalized PC to EOF transformation

X!eik = fk Pseudo-inverse to normalized PC transformation