statistics – o. r. 892 object oriented data analysis

117
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina

Upload: lok

Post on 06-Jan-2016

24 views

Category:

Documents


1 download

DESCRIPTION

Statistics – O. R. 892 Object Oriented Data Analysis. J. S. Marron Dept. of Statistics and Operations Research University of North Carolina. Administrative Info. Details on Course Web Page http://stor892fall2014.web.unc.edu/ Or: Google: “ Marron Courses” Choose This Course. - PowerPoint PPT Presentation

TRANSCRIPT

Title

Statistics O. R. 892Object Oriented Data AnalysisJ. S. Marron

Dept. of Statistics and Operations ResearchUniversity of North Carolina1Administrative InfoDetails on Course Web Pagehttp://stor892fall2014.web.unc.edu/Or:Google: Marron CoursesChoose This Course2Object Oriented Data AnalysisWhat is it?A Sound-Bite Explanation: What is the atom of the statistical analysis?1st Course: NumbersMultivariate Analysis Course : VectorsFunctional Data Analysis: CurvesMore generally: Data Objects3Object Oriented Data AnalysisCurrent Motivation: In Complicated Data Analyses Fundamental (Non-Obvious) Question Is:What Should We Take as Data Objects? Key to Focussing Needed Analyses4

Mortality Time SeriesImprovedColoring:

RainbowRepresentingYear:

Magenta = 1908

Red = 20025Time Series of CurvesJust a Set of CurvesBut Time Order is Important!Useful Approach (as above):Use color to code for time

Start End

6T. S. Toy E.g., PCA View

PCA gives Modes of VariationBut there are ManyIntuitively Useful???Like harmonics?Isnt there only 1 mode of variation?Answer comes in scores scatterplots7T. S. Toy E.g., PCA Scatterplot

8Chemo-metric Time Series, Control

9

SuggestionOfClusters

Which AreThese?Functional Data Analysis10

ManuallyBrushClustersFunctional Data Analysis11

ManuallyBrushClusters

ClearAlternateSplicingFunctional Data Analysis12Limitation of PCA, Toy E.g.

13NCI 60: Can we find classesUsing PCA view?

14PCA Visualization of NCI 60 DataMaybe need to look at more PCs?

Study array of such PCA projections:

15NCI 60: Can we find classesUsing PCA 9-12?

16PCA Visualization of NCI 60 DataCan we find classes using PC directions??Found some, but not othersNothing after 1st five PCs Rest seem to be noise driven

Are There Better Directions? PCA only feels maximal variation Ignores Class Labels How Can We Use Class Labels?17Visualization of NCI 60 DataHow Can We Use Class Labels?

Approach: Find Directions to Best Separate Classes In Disjoint Pairs (thus 4 Directions) Use DWD:Distance Weighted Discrimination Defined (& Motivated) Later Project All Data on These 4 Directions18NCI 60: Views using DWD Dirns (focus on biology)

19DWD Visualization of NCI 60 DataMost cancer types clearly distinct(Renal, CNS, Ovar, Leuk, Colon, Melan)Using these carefully chosen directionsOthers less clear cutNSCLC (at least 3 subtypes)Breast (4 published subtypes)20DWD VisualizationRecall PCA limitationsDWD uses class infoHence can better separate known classesDo this for pairs of classes(DWD just on those, ignore others)Carefully choose pairs in NCI 60 dataNote DWD Directions Not Orthogonal(PCA orthogonality may be too strong a constraint)21NCI 60: Views using DWD Dirns (focus on biology)

22PCA Visualization of NCI 60 DataCan we find classes using PC directions??Found some, but not othersNot so distinct as in DWD viewNothing after 1st five PCs Rest seem to be noise drivenOrthogonality too strong a constraint???Interesting dirns are nearly orthogonal

23Limitation of PCA

Main Point:

May be Important Data StructureNot Visible in 1st Few PCs

24Yeast Cell Cycle DataAnother Example Showing

Interesting Directions Beyond PCA25Yeast Cell Cycle DataGene Expression Microarray dataData (after major preprocessing): Expression level of:thousands of genes (d ~ 1,000s)but only dozens of cases (n ~ 10s)Interesting statistical issue:High Dimension Low Sample Size data(HDLSS)26Yeast Cell Cycle DataData from:

Spellman, et al (1998)

Analysis here is from:

Zhao, Marron & Wells (2004)27Yeast Cell Cycle DataLab experiment:Chemically synchronize cell cycles, of yeast cellsDo cDNA micro-arrays over timeUsed 18 time points, over about 2 cell cyclesStudied 4,489 genes (whole genome)Time series view of data: have 4,489 time series of length 18Functional Data View: have 18 curves, of dimension 4,48928Yeast Cell Cycle DataLab experiment:Chemically synchronize cell cycles, of yeast cellsDo cDNA micro-arrays over timeUsed 18 time points, over about 2 cell cyclesStudied 4,489 genes (whole genome)Time series view of data: have 4,489 time series of length 18Functional Data View: have 18 curves, of dimension 4,489What are the dataobjects?29Yeast Cell Cycle Data, FDA ViewCentral question:Which genes are periodic over 2 cell cycles?

30Yeast Cell Cycle Data, FDA ViewPeriodic genes?

Nave approach:Simple PCA

31Yeast Cell Cycle Data, FDA ViewCentral question: which genes are periodic over 2 cell cycles?Nave approach: Simple PCANo apparent (2 cycle) periodic structure?Eigenvalues suggest large amount of variationPCA finds directions of maximal variationOften, but not always, same as interesting directionsHere need better approach to study periodicities32Yeast Cell Cycle Data, FDA ViewApproachProject on Period 2 Components OnlyCalculate via Fourier RepresentationTo understand, study Fourier Basis

Powerful Fact: linear combos of sin and cos capture phase, since:

33Sin-Cos Phase Shifts are LinearPowerful Fact: linear combos of sin and cos capture phase, since:

Consequence:

Random Phase Shifts Captured in Just 2 PCs

34

n = 30curves

Sin-Cos Phase Shifts are Linear35

n = 30curves

Random Phase Shifts Captured in Just 2 PCs

Sin-Cos Phase Shifts are Linear36

Sin-Cos Phase Shifts are Linear37Fourier Basis

38Fourier Basis39Fourier BasisFourier Basis Facts:Complete Basis (spans whole space) Exactly True for both versionsBasis Elements are Directions Will think about as aboveGood References:Brillinger (2001)Bloomfield (2004)

40Fourier Basis

41Yeast Cell Cycle Data, FDA ViewApproachProject on Period 2 Components OnlyCalculate via Fourier RepresentationProject onto Subspace of Even FrequenciesKeeps only 2-period part of data(i.e. same over both cycles)Then do PCA on projected data42Fourier Basis

43Yeast Cell Cycles, Freq. 2 Proj.

PCA onFreq. 2PeriodicComponent Of Data44Yeast Cell Cycles, Freq. 2 Proj.PCA on periodic component of data Hard to see periodicities in raw dataBut very clear in PC1 (~sin) and PC2 (~cos)PC1 and PC2 explain 65% of variation (see residuals) Recall linear combos of sin and cos capture phase, since:

45Frequency 2 AnalysisImportant features of data appear only at frequency 2,Hence project data onto 2-dim space of sin and cos (freq. 2)Useful view: scatterplotSimilar to PCA projns, except directions are now chosen, not var maxing46Frequency 2 Analysis

Colors are47Frequency 2 AnalysisProject data onto 2-dim space of sin and cos (freq. 2)Useful view: scatterplotAngle (in polar coordinates) shows phaseColors: Spellmans cell cycle phase classificationBlack was labeled not periodicWithin class phases approxly same, but notable differencesLater will try to improve phase classification48Batch and Source AdjustmentFor Stanford Breast Cancer Data (C. Perou)Analysis in Benito, et al (2004) Bioinformatics, 20, 105-114. https://genome.unc.edu/pubsup/dwd/Adjust for Source EffectsDifferent sources of mRNA Adjust for Batch EffectsArrays fabricated at different times

49Idea Behind AdjustmentFind direction from one to otherShift data along that directionDetails of DWD Direction developed later

50Source Batch Adj: Raw Breast Cancer data

51Source Batch Adj: Source Colors

52Source Batch Adj: Batch Colors

53Source Batch Adj: Biological Class Colors

54Source Batch Adj: Biological Class Col. & Symbols

55Source Batch Adj: Biological Class Symbols

56Source Batch Adj: Source Colors

57Source Batch Adj: PC 1-3 & DWD direction

58Source Batch Adj: DWD Source Adjustment

59Source Batch Adj: Source Adjd, PCA view

60Source Batch Adj: Source Adjd, Class Colored

61Source Batch Adj: Source Adjd, Batch Colored

62Source Batch Adj: Source Adjd, 5 PCs

63Source Batch Adj: S. Adjd, Batch 1,2 vs. 3 DWD

64Source Batch Adj: S. & B1,2 vs. 3 Adjusted

65Source Batch Adj: S. & B1,2 vs. 3 Adjd, 5 PCs

66Source Batch Adj: S. & B Adjd, B1 vs. 2 DWD

67Source Batch Adj: S. & B Adjd, B1 vs. 2 Adjd

68Source Batch Adj: S. & B Adjd, 5 PC view

69Source Batch Adj: S. & B Adjd, 4 PC view

70Source Batch Adj: S. & B Adjd, Class Colors

71Source Batch Adj: S. & B Adjd, Adjd PCA

72Source Batch Adj: Raw Data, Tree View

73Caution on Colors~10 % of Males are: Red Green Color Blind

Cant distinguish Red vs. Green

Should use better scheme74Caution About Tree ViewCan Miss Important Features

75Caution About Tree ViewImportant Clusters, not in Coord Axis Dirn

76Source Batch Adj: Raw Data, Tree View

77Source Batch Adj: Raw Data, Array Tree

78Source Batch Adj: Raw Array Tree, Source Colored

79Source Batch Adj: Raw Array Tree, Batch Colored

80Source Batch Adj: Raw Array Tree, Class Colored

81Source Batch Adj: DWD Adjusted Data, Tree View

82Source Batch Adj: DWD Adjusted Data, Array Tree

83Source Batch Adj: DWD Adjusted Data, Source Colored

84Source Batch Adj: DWD Adjusted Data, Batch Colored

85Source Batch Adj: DWD Adjusted Data, Class Colored

86DWD: A look under the hoodDistance Weighted Discrimination (DWD)Modification of Support Vector MachineFor HDLSS dataUses 2nd Order Cone programmingWill study later

Main Goal:Find direction, that separates data classesIn best possible way87DWD: Why not PC1? PC 1 Direction feels variation, not classes

Also eliminates (important?) within class variation

88DWD: Why not PC1? Direction driven by classes

Sliding maintains (important?) within class variation

89E. g. even worse for PCA PC1 direction is worst possible

90But easy for DWDSince DWD uses class label information

91DWD does not solve all problemsOnly handles means, not differing variation

92Interesting Benchmark Data SetNCI 60 Cell LinesInteresting benchmark, since same cellsData Web available:http://discover.nci.nih.gov/datasetsNature2000.jspBoth cDNA and Affymetrix PlatformsDifferent from Breast Cancer DataWhich had no common samples93NCI 60: Raw Data, Platform Colored

94NCI 60: Raw Data, Tree View

95NCI 60: Raw Data

96NCI 60: Raw Data, Before DWD Adjustment

97NCI 60: Before & After DWD adjustment

98NCI 60: Before & After, new scales

99NCI 60: After DWD

100NCI 60: DWD adjusted data

101NCI 60: Before Column Mean Adjustment

102NCI 60: Before & After Column Mean Adjustment

103NCI 60: Before & After Col. Mean Adj., Rescaled

104NCI 60: After DWD & Column Mean Adj.

105NCI 60: DWD & Column Mean Adjusted

106NCI 60: Before Column Stand. Dev. Adjustment

107NCI 60: Before and After Column S.D. Adjustment

108NCI 60: Before and After Col. S.D. Adj., Rescaled

109NCI 60: After Column Stand. Dev. adjustment

110NCI 60: Fully Adjusted Data

111NCI 60: Fully Adjusted Data, Platform Colored

112NCI 60: Fully Adjusted Data, Melanoma Cluster

BREAST.MDAMB435BREAST.MDN MELAN.MALME3M MELAN.SKMEL2 MELAN.SKMEL5 MELAN.SKMEL28 MELAN.M14 MELAN.UACC62 MELAN.UACC257 113NCI 60: Fully Adjusted Data, Leukemia Cluster

LEUK.CCRFCEM LEUK.K562 LEUK.MOLT4 LEUK.HL60 LEUK.RPMI8266LEUK.SR 114Another DWD Appln: VisualizationRecall PCA limitationsDWD uses class infoHence can better separate known classesDo this for pairs of classes(DWD just on those, ignore others)Carefully choose pairs in NCI 60 dataShows Effectiveness of Adjustment115NCI 60: Views using DWD Dirns (focus on biology)

116DWD Visualization of NCI 60 DataMost cancer types clearly distinct(Renal, CNS, Ovar, Leuk, Colon, Melan)Using these carefully chosen directionsOthers less clear cutNSCLC (at least 3 subtypes)Breast (4 published subtypes)DWD adjustment was very effective(very few black connectors visible)117