principal component analysis (pca) principal component analysis (pca) creates new variables...

16
Principal Component Analysis (PCA) Principal component analysis (PCA) creates new variables (components) that consist of uncorrelated, linear combinations of the original variables. PCA is used to simplify the data structure and still account for as much of the total variation in the original data as possible.

Post on 20-Dec-2015

256 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Principal Component Analysis (PCA) Principal component analysis (PCA) creates new variables (components) that consist of uncorrelated, linear combinations

Principal Component Analysis (PCA)

• Principal component analysis (PCA) creates new variables (components) that consist of uncorrelated, linear combinations of the original variables.

• PCA is used to simplify the data structure and still account for as much of the total variation in the original data as possible.

Page 2: Principal Component Analysis (PCA) Principal component analysis (PCA) creates new variables (components) that consist of uncorrelated, linear combinations

Simple Case: Stock Market Data

Index

S&

P

2502252001751501251007550251

1450

1400

1350

1300

1250

1200

Time Series Plot of S&P

Index

Dow

2502252001751501251007550251

12500

12000

11500

11000

10500

Time Series Plot of Dow

Can the data be reduced to just one linear combinations of the original variables be used without loosing much information?

Page 3: Principal Component Analysis (PCA) Principal component analysis (PCA) creates new variables (components) that consist of uncorrelated, linear combinations

3 Steps for PCA

1) Calculate the correlation matrix

2) Calculate the eigenvectors of the correlation matrix

3) Multiply the eigenvectors by the standardized original data. The first principal component (PC1) is a linear combination of the standardized data where the first eigenvector is used as the weights.

Page 4: Principal Component Analysis (PCA) Principal component analysis (PCA) creates new variables (components) that consist of uncorrelated, linear combinations

Standardized closing values of 2006 Dow Index vs 2006 S&P 500

ZDow

ZSandP

210-1-2

2

1

0

-1

-2

Scatterplot of ZSandP vs ZDow

Simple Case: Stock Market Data

Page 5: Principal Component Analysis (PCA) Principal component analysis (PCA) creates new variables (components) that consist of uncorrelated, linear combinations

Direction of first principal component (the first eigenvalue).

ZDow

ZSandP

210-1-2

2

1

0

-1

-2

Scatterplot of ZSandP vs ZDow

Simple Case: Stock Market Data

Page 6: Principal Component Analysis (PCA) Principal component analysis (PCA) creates new variables (components) that consist of uncorrelated, linear combinations

ZDow

ZSandP

210-1-2

2

1

0

-1

-2

Scatterplot of ZSandP vs ZDow

Rotating the data to the first principal component. PC1 is a linear combination of the standardized data with the first eigenvector is used as the weights.

Simple Case: Stock Market Data

Page 7: Principal Component Analysis (PCA) Principal component analysis (PCA) creates new variables (components) that consist of uncorrelated, linear combinations

2006 Business Day

Sta

ndard

ized S

tock

Mark

et

Valu

es

2502252001751501251007550251

3

2

1

0

-1

-2

Variable

PC1

ZDowZSandP

Time Series Plot of Standardized Stock Market Values and First Principal Component

Simple Case: Stock Market Data

Page 8: Principal Component Analysis (PCA) Principal component analysis (PCA) creates new variables (components) that consist of uncorrelated, linear combinations

LAB: Principal Component Analysis in Environmental Studies

The Debate Over Statistical Techniques Used in the Derivation of the Global Warming Hockey Stick Graph

Figure 1: The instrumental record of global average temperatures.

Page 9: Principal Component Analysis (PCA) Principal component analysis (PCA) creates new variables (components) that consist of uncorrelated, linear combinations

The Hockey Stick Graph

Figure 2: Mann’s 1998 Hockey Stick Graph

Page 10: Principal Component Analysis (PCA) Principal component analysis (PCA) creates new variables (components) that consist of uncorrelated, linear combinations

The Hockey Stick Graph

Figure 2: Mann’s 1998 Hockey Stick Graph

Page 11: Principal Component Analysis (PCA) Principal component analysis (PCA) creates new variables (components) that consist of uncorrelated, linear combinations

The Hockey Stick Graph

Page 12: Principal Component Analysis (PCA) Principal component analysis (PCA) creates new variables (components) that consist of uncorrelated, linear combinations

• In 1998 Mann, Bradley, and Hughes (MBH) used a modified PCA to reduce 70 series of proxy data to one principal component (PC1).

• MBH’s graph was widely used as evidence of global warming.

• In 2003 McIntyre and McKitrick (MM), claimed that the graph was not correct – but had a significant amount of trouble getting published.

• In 2005 MM published a simulation study that showed that MBH’s modified PCA technique would consistently result in a hockey stick shape.

• In 2006 Ed Wegman provided an ad-hoc committee report to congress on the “Hockey Stick Global Climate Reconstruction”, http://www.heartland.org/pdf/19383.pdf .

The Hockey Stick Graph

Page 13: Principal Component Analysis (PCA) Principal component analysis (PCA) creates new variables (components) that consist of uncorrelated, linear combinations

• MBH used data from 1400-1980, 581observations for each of the 70 proxy variables (tree ring data)

• Each variable would typically be standardized by the following formula:

• MBH used a ‘decentered’ standardization:

• What is the mean and standard deviation of a ‘decentered’ variable?

• How will this impact principal component analysis?

]1980:1902[

]1980:1902[

S

X

The Hockey Stick Graph

]1980:1400[

]1980:1400[

S

X

Page 14: Principal Component Analysis (PCA) Principal component analysis (PCA) creates new variables (components) that consist of uncorrelated, linear combinations

Questions 1and 2: Generate a matrix of random AR(1) data. AR(1) data follows the general pattern of tree ring growth in many trees.

Question 3: Standardize the data matrix

Question 4: Perform PCA on a random AR(1) matrix with 70 series.

Question 5: Write a function that repeats question 4 ten times.

Question 6: Write a function that repeats question 5, but uses a ‘decentered’ standardization.

Does it look like ‘hockey stick’ shaped graphs occur more often with decentered data? Can we conduct a more thorough simulation study?

Simulation Study of the Hockey Stick Graph

Page 15: Principal Component Analysis (PCA) Principal component analysis (PCA) creates new variables (components) that consist of uncorrelated, linear combinations

The Hockey Stick Graph1) Why do you think that the IPCC and supporters of the Kyoto accord prominently featured Mann’s

(i.e. MBH’s) graph?

2) This paper shows reasons to believe that MBH’s graph was developed inappropriately; does this mean that there is no global warming?

3) State specifically how you would expect proponents and opponents to respond to MM’s and MBH’s work for their own political/personal benefit?

4) In 2006, the Chairman of the Committee on Energy and Commerce as well as the Chairman of the Subcommittee on Oversight and Investigations requested an Ad Hoc committee, chaired by Edward Wegman, to review the controversy between MM and MBH. This committee claimed there was improper use of principle component analysis in MBH’s work. Wegman’s report hasn’t been widely publicized. In addition, according to Wegman[i], he has been personally slandered and called a patsy for the Republican Party – even though he has stated publicly that he voted for Al Gore in 2000. Why do you believe this material hasn’t been made more public? Should inaccurate mathematical details remain hidden if it results in creating a better environment?

5) Other scientists have essentially stated that while Mann’s statistical analysis was incorrect; Mann’s conclusion (global warming) is correct and the focus should be on global warming and not the technical details[ii]. Do you agree with this assessment?

6) Wegman’s report and MM [http://www.climatechangeissues.com/files/PDF/conf05mckitrick.pdf p. 8] describe the difficulty of obtaining the original data (and algorithm) from MBH and Nature (where MBH’s article was published). Under a court subpoena, MBH has shared the raw data, however, to date, they have refused to share the code used in conducting Mann’s analysis and no one has been able to perfectly replicate his results. Do you feel that researchers and journals should be required to share data after an article has been published? Does your opinion change if the data collection was paid for by the US government?

7) Do you believe that research involving new/advanced statistical techniques should be reviewed by statisticians before it is published?

8) What can be done to ensure proper information is appropriately communicated to the public? What are the consequences of inaccurate data being highly publicized?

Page 16: Principal Component Analysis (PCA) Principal component analysis (PCA) creates new variables (components) that consist of uncorrelated, linear combinations

Week 1: Review of Statistics 101

Lab: Making connections between the two sample t-test, ANOVA, and regression

Week 2-3: Randomization Tests/Nonparametric Tests

Activity: Westvaco discrimination case

Week 4-6: Multiple Regression

Intro Lab: How much is your car worth?

Lab: Population control and economic growth

Week 7-9: Designing an Experiment

Intro Lab: Weight gain in pigs

Lab: Perfection- reaction time tests

Week 10-12: Principal Component Analysis

Intro Lab: Stock market values

Lab: Global warming and the hockey stick graph

Week 13 and 14: Final Projects

Proposed Course