principal component analysis (pca) principal component analysis (pca) creates new variables...
Post on 20-Dec-2015
256 views
TRANSCRIPT
Principal Component Analysis (PCA)
• Principal component analysis (PCA) creates new variables (components) that consist of uncorrelated, linear combinations of the original variables.
• PCA is used to simplify the data structure and still account for as much of the total variation in the original data as possible.
Simple Case: Stock Market Data
Index
S&
P
2502252001751501251007550251
1450
1400
1350
1300
1250
1200
Time Series Plot of S&P
Index
Dow
2502252001751501251007550251
12500
12000
11500
11000
10500
Time Series Plot of Dow
Can the data be reduced to just one linear combinations of the original variables be used without loosing much information?
3 Steps for PCA
1) Calculate the correlation matrix
2) Calculate the eigenvectors of the correlation matrix
3) Multiply the eigenvectors by the standardized original data. The first principal component (PC1) is a linear combination of the standardized data where the first eigenvector is used as the weights.
Standardized closing values of 2006 Dow Index vs 2006 S&P 500
ZDow
ZSandP
210-1-2
2
1
0
-1
-2
Scatterplot of ZSandP vs ZDow
Simple Case: Stock Market Data
Direction of first principal component (the first eigenvalue).
ZDow
ZSandP
210-1-2
2
1
0
-1
-2
Scatterplot of ZSandP vs ZDow
Simple Case: Stock Market Data
ZDow
ZSandP
210-1-2
2
1
0
-1
-2
Scatterplot of ZSandP vs ZDow
Rotating the data to the first principal component. PC1 is a linear combination of the standardized data with the first eigenvector is used as the weights.
Simple Case: Stock Market Data
2006 Business Day
Sta
ndard
ized S
tock
Mark
et
Valu
es
2502252001751501251007550251
3
2
1
0
-1
-2
Variable
PC1
ZDowZSandP
Time Series Plot of Standardized Stock Market Values and First Principal Component
Simple Case: Stock Market Data
LAB: Principal Component Analysis in Environmental Studies
The Debate Over Statistical Techniques Used in the Derivation of the Global Warming Hockey Stick Graph
Figure 1: The instrumental record of global average temperatures.
The Hockey Stick Graph
Figure 2: Mann’s 1998 Hockey Stick Graph
The Hockey Stick Graph
Figure 2: Mann’s 1998 Hockey Stick Graph
The Hockey Stick Graph
• In 1998 Mann, Bradley, and Hughes (MBH) used a modified PCA to reduce 70 series of proxy data to one principal component (PC1).
• MBH’s graph was widely used as evidence of global warming.
• In 2003 McIntyre and McKitrick (MM), claimed that the graph was not correct – but had a significant amount of trouble getting published.
• In 2005 MM published a simulation study that showed that MBH’s modified PCA technique would consistently result in a hockey stick shape.
• In 2006 Ed Wegman provided an ad-hoc committee report to congress on the “Hockey Stick Global Climate Reconstruction”, http://www.heartland.org/pdf/19383.pdf .
The Hockey Stick Graph
• MBH used data from 1400-1980, 581observations for each of the 70 proxy variables (tree ring data)
• Each variable would typically be standardized by the following formula:
• MBH used a ‘decentered’ standardization:
• What is the mean and standard deviation of a ‘decentered’ variable?
• How will this impact principal component analysis?
]1980:1902[
]1980:1902[
S
X
The Hockey Stick Graph
]1980:1400[
]1980:1400[
S
X
Questions 1and 2: Generate a matrix of random AR(1) data. AR(1) data follows the general pattern of tree ring growth in many trees.
Question 3: Standardize the data matrix
Question 4: Perform PCA on a random AR(1) matrix with 70 series.
Question 5: Write a function that repeats question 4 ten times.
Question 6: Write a function that repeats question 5, but uses a ‘decentered’ standardization.
Does it look like ‘hockey stick’ shaped graphs occur more often with decentered data? Can we conduct a more thorough simulation study?
Simulation Study of the Hockey Stick Graph
The Hockey Stick Graph1) Why do you think that the IPCC and supporters of the Kyoto accord prominently featured Mann’s
(i.e. MBH’s) graph?
2) This paper shows reasons to believe that MBH’s graph was developed inappropriately; does this mean that there is no global warming?
3) State specifically how you would expect proponents and opponents to respond to MM’s and MBH’s work for their own political/personal benefit?
4) In 2006, the Chairman of the Committee on Energy and Commerce as well as the Chairman of the Subcommittee on Oversight and Investigations requested an Ad Hoc committee, chaired by Edward Wegman, to review the controversy between MM and MBH. This committee claimed there was improper use of principle component analysis in MBH’s work. Wegman’s report hasn’t been widely publicized. In addition, according to Wegman[i], he has been personally slandered and called a patsy for the Republican Party – even though he has stated publicly that he voted for Al Gore in 2000. Why do you believe this material hasn’t been made more public? Should inaccurate mathematical details remain hidden if it results in creating a better environment?
5) Other scientists have essentially stated that while Mann’s statistical analysis was incorrect; Mann’s conclusion (global warming) is correct and the focus should be on global warming and not the technical details[ii]. Do you agree with this assessment?
6) Wegman’s report and MM [http://www.climatechangeissues.com/files/PDF/conf05mckitrick.pdf p. 8] describe the difficulty of obtaining the original data (and algorithm) from MBH and Nature (where MBH’s article was published). Under a court subpoena, MBH has shared the raw data, however, to date, they have refused to share the code used in conducting Mann’s analysis and no one has been able to perfectly replicate his results. Do you feel that researchers and journals should be required to share data after an article has been published? Does your opinion change if the data collection was paid for by the US government?
7) Do you believe that research involving new/advanced statistical techniques should be reviewed by statisticians before it is published?
8) What can be done to ensure proper information is appropriately communicated to the public? What are the consequences of inaccurate data being highly publicized?
Week 1: Review of Statistics 101
Lab: Making connections between the two sample t-test, ANOVA, and regression
Week 2-3: Randomization Tests/Nonparametric Tests
Activity: Westvaco discrimination case
Week 4-6: Multiple Regression
Intro Lab: How much is your car worth?
Lab: Population control and economic growth
Week 7-9: Designing an Experiment
Intro Lab: Weight gain in pigs
Lab: Perfection- reaction time tests
Week 10-12: Principal Component Analysis
Intro Lab: Stock market values
Lab: Global warming and the hockey stick graph
Week 13 and 14: Final Projects
Proposed Course