theanalysis of gene expression data -...
TRANSCRIPT
Giovanni ParmigianiElizabeth S. GarrettRafael A. IrizarryScott L. ZegerEditors
The Analysis ofGene Expression DataMethods and Software
With 113 Figures, Including 37 Color Plates
, Springer
vii
1 The Analysis of Gene Expression Data: An Overview ofMethods and Software 1Giovanni Parmigiani, Elizabeth S. Garrett, Rafael A. Irizarry,and Scott L. Zeger
1.1 Measuring Gene Expression Using Microarrays. . 11.1.1 Microarray Technologies . . . . . . . . . . 11.1.2 Sources of Variation in Gene Expression
Measurements Using Microarrays . . 41.1.3 Phases of Microarray Data Analysis . . 5
1.2 Design of Microarray Experiments . . . . . . . . 71.2.1 Replication and Sample Size Considerations. 71.2.2 Design of Two-Channel Arrays 9
1.3 Data Storage . . . . 91.3.1 Databases.......... . . 91.3.2 Standards........ .. . . 101.3.3 Statistical Analysis Languages 11
1.4 Preprocessing........... .. . 121.4.1 Image Analysis . . . . . . . . . 121.4.2 Visualizations for Quality Control 121.4.3 Background Subtraction . . . . . . 131.4.4 Probe-level Analysis of Oligonucleotide Arrays 141.4.5 Within-Array Normalization of cDNA Arrays. 151.4.6 Normalization Across Arrays . . . . 15
1.5 Screening for Differentially Expressed Genes 161.5.1 Estimation or Selection? . . . . . . . 161.5.2 One Problem or Many? . . . . . . . 171.5.3 Selection and False Discovery Rates 181.5.4 Beyond Two Groups . . . . . . . . 19
1.6 Challenges of Genome Biometry Analyses . 19
Contents
Preface
Contributors
Color Insert
v
xvii
(facing page 236)
46
viii Contents
1.7 Visualization and Unsupervised Analyses 211.7.1 Profile Visualization . . 211.7.2 Why Clustering? . . . . . . . . . 221.7.3 Hierarchical Clustering . . . .. 231.7.4 k-Means Clustering and Self-Organizing Maps 251.7.5 Model-Based Cluste ring . . . . . 261.7.6 Principal Components Analysis. . . . . 261.7.7 Mult idimensional Scaling . . . . . . . . 271.7.8 Identifying Novel Molecular Subclasses. 271.7.9 T ime Series Analysis . 28
1.8 Prediction . . . . .... . . . 291.8.1 Prediction Tools . . . 291.8.2 Dimension Reduction 301.8.3 Evaluation of Classifiers 301.8.4 Regression-Based Approaches. 311.8.5 Classification Trees. . . . . . . 311.8.6 Probabilistic Model-Based Classification. 321.8.7 Discriminant Analysis . . . . 331.8.8 Nearest-Neighbor Classifiers . 331.8.9 Support Vector Machines 33
1.9 Free and Open-Source Software . . 331.9.1 Whitehead Institute Tools . 341.9.2 Eisen Lab Tools . . 341.9.3 TIGR Tools. . . . . 341.9.4 GeneX and Cyber'I' 351.9.5 Projects at NCBI 351.9.6 BRB . . ... .. .. 351.9.7 The OOML library. 361.9.8 MatArray . 361.9.9 BASE 36
1.10 Conclusion. . .. . 36
2 Visualization and Annotation ofGenomic ExperimentsRobert Gentleman and Vincent Carey
2.1 Introduction.......... . . . . . ... 462.2 Motivations for Component-Based Software 472.3 Formalism . .. ... ..... . . .... . . 492.4 Bioconductor Software for Filtering, Explorin g, and
Interp reting Microarray Exp eriments . . . . . . . 502.4.1 Formal Dat a Structures and Methods for
Multiple Microarrays . . . . . . . . . . . . 502.4.2 Tools for Filtering Gene Expression Data:
The Closure Concept 54
Contents IX
4 An R Package for Analyses of AffymetrixOligonucleotide Arrays 102Rafael A. Irizarry, Laurent Gaut ier, and Leslie M. Cope
21212223252626272728292930303131323333333334343435353536363636
2.4.3 Expression Density Diagnost ics:High-Throughput Exploratory Dat aAnalysis for Microarrays .
2.4.4 Annotation2.5 Visualizat ion ..
2.5.1 Chromosomes .2.6 Applications ..
2.6.1 A Case Study of Gene Filtering .2.6. 2 Application of Expression Density Diagnostics
2.7 Conclusions . . . . .. . .. . .. .. . . . . . . . . .
3 Bioconductor R Packages for Exploratory Analysisand Normalization of cDNA Microarray DataSandrine Dud oit and Jean Yee Hwa Yang
3.1 Introduction.......... . . .. . . . .3.1.1 Overview of Packages .3.1.2 Two-Color cDNA Microarray Experiments
3.2 Methods . . . .. . .3.2.1 Standards for Microarr ay Dat a .3.2.2 Object-Oriented Programming: Microarray
Classes and Methods. . . . . . . . . . . . .3.2.3 Diagnost ic Plots . . . . . . . . . . . . . . .3.2.4 Normalization Using Robust Local Regression
3.3 Application: Swirl Microarray Experiment . . . . . . .3.4 Software ..... . .. ...... . .. ..... .. .
3.4.1 Package marrayClasses- Classes and Methodsfor cDNA Microarray Data .
3.4.2 Package marraylnput-Data Input for cDNAMicroarrays . . . . . . . . . . . . . . . . . . .
3.4.3 Package marrayPlots-Diagnost ic Plots forcDNA Microarray Data .
3.4.4 Package marrayNorm-Location and ScaleNormalization for cDNA Microarr ay Data .
3.5 Discussion . . .. .
4.1 Introduction . ..4.2 Methods . . ... . . .. . .. . . .
4.2.1 Notat ion .4.2.2 The CEL/CDF Convent ion4.2.3 Probe Pair Sets. . . .4.2.4 Probe-Level Obj ects ....
5557585964646770
73
7373757676
7778798081
81
89
91
9699
102103103104106107
6 Expression Profiler 142Jaak Vilo, Misha Kapushesky , Patrick Kem meren, Ugis Sarkans,and Alvis Brazma
142143143144146147
III113113115115118118
108
120
6.1 Introduction .6.2 EP CLUST .
6.2.1 EP CLUST : Dat a Import6.2.2 EP CLUST: Data Filtering6.2.3 EPCLUST: Data Annotation6.2.4 EPCLUST: Data Environment
4.2.5 Normalization .4.2.6 Exploratory Data Analysis of
Probe-L evel Data . .4.3 Applicat ion .
4.3.1 Expression Measures .4.4 Software... ....··· ··
4.4.1 A Case Study . . . . .4.4.2 Extending the Package .
4.5 Conclusion. . . . .....·
5 DNA-Chip Analyzer (dChip)Cheng Li and Wing Hung Wong
5.1 Introduction · 1205.2 Methods. . ··· 121
5.2.1 Normalization of Arrays Based on an"Invariant Set" . . . . . . 121
5.2.2 Model-Based Analysis ofOligonucleotide Arrays . . 122
5.2.3 Confidence Interval for Fold Change . 1225.2.4 Pooling Replicate Arrays Considering
Measurement Accuracy . . . 1245.3 Software and Applications . . . . . . 125
5.3.1 Reading in Array Data Files 1255.3.2 Viewing an Array Image. 1275.3.3 Normalizing Arrays . . . . . 1295.3.4 Viewing PM /MM Data . . . 1295.3.5 Calculating Model-Based Expression Indexes 1315.3.6 Filter Genes 1325.3.7 Hierarchical Clustering . . . . . 1335.3.8 Comparing Samples . . . . . . . 1355.3.9 Mapping Genes to Chromosomes 1375.3.10 Sample Classification by Linear
Discriminant Analysis 1385.4 Discussion........... . . . . .. 139
x Contents
108
120
120121
121
122122
124125125127129129131132
• 133• 135
137
Contents xi
6.2.5 EP CLUST: Data Analysis . . . . . . . . . 1486.3 URLMAP: Cross-Linking of the Analysis Results
Between the Tools and Databases . . . . . . . . . 1516.4 EP:GO GeneOntology Browser 1526.5 EP:PPI: Comparison of Pro tein Pairs and Expression. 1536.6 Pattern Discovery, Pattern Matching, and
Visualization Tools . . . . . . . . . . . . . . . . . . . 1546.7 An Example of th e Data Analysis and Visualizations
Performed by the Tools in Expression Profiler 1546.8 Integration of Expression Profiler with Pu blic
Microarray Dat abases . 1596.9 Conclusions.. . . . . . . ... . . . . . . . . 160
7 An S-PLUS Library for the Analysis and Visualizationof Differential Expression 163Jae K. Lee and Michael 0 'Connell
7.1 Int roduc t ion . ............. 1637.2 Assessment of Differential Expression 164
7.2.1 Local Pooled Error. . . . . . 1657.2.2 Tests for Differential Expression 1697.2.3 Cluster Analysis and Visualization 171
7.3 Analysis of Melanoma Expression . . . . . 1747.3.1 Tests for Differential Expression 1757.3.2 Cluster Analysis and Visualizat ion 1787.3.3 Annotation 180
7.4 Discussion . . . . .... . . . . . . . . . . 181
8 DRAGON and DRAGON View: Methods for theAnnotation, Analysis, and Visualization of Large-ScaleGene Expression Data 185Christopher M.L .S. Bouton, George Henry , Carlo Colantuoni,and Jonathan Pevsner
8.1 Introduction........... 1858.2 System and Methods . . . . . . 189
8.2.1 Overview of DRAGON 1898.2.2 DRAGON's Hardwar e, Software, and
Database Architecture . . . . . . . . . 1908.2.3 Cross-Referencing Information in DRAGON 1928.2.4 The DRAGON Search and Annotate Tools 1938.2.5 The DRAGON View Data Visualization Tools 1968.2.6 DRAGON Gram: A Novel Visualization Tool 198
8.3 Implementation .. .. .. 1998.4 Discussion and Conclusion . . . . . . . . . . . . . . . 204
xii Contents
9 SNOMAD: Biologist-Friendly Web Toolsfor the Standardization and NOrmalizationof Microarray Data 210Carlo Colantuoni, George Henry , Christopher M.L.S. Bouton,Scott L. Zeger, and Jonathan Pevsner
9.1 Introduction 2109.2 Methods and Application. . . . . . . . . . . 212
9.2.1 Overview of Exp eriment al and DataAnalysis Procedures . . . . . 212
9.2.2 Background Subtraction. . . . . . . 2149.2.3 Global Mean Normalization . . . . . 2149.2.4 Standard Data Transformation and
Visualization Methods . . . . . . . . 2159.2.5 Local Mean Normalization Across Element
Signal Intensity . . . . . . . . . . . . . . . . 2179.2.6 Local Variance Correc tion Across Element
Signal Intensity . . . . . . . . . . . . . 2199.2.7 Local Mean Normalization Across the
Microarray Surface . 2239.3 Software 2259.4 Discussion .. .... . . .. 226
10 Microarray Analysis Using the MicroArray Explorer 229Peter F. Lemkin, Gregory C. Thornwall, and Jai Evans
10.1 Introduction 2~9
10.1.1 Need for the Methodology . . . . . . 23010.1.2 Basic Ideas Behind th e Approach. . 231
10.2 Methods-Statistical and Informatics Basis. 23210.2.1 Analysis Paradigm. . . . . . 23510.2.2 Particular Analysis Methods 23810.2.3 Data Conversion . . . . . . . 238
10.3 Software . . . . . . . . . . . . . . . . 23910.3.1 System Design-Software Implementation 24410.3.2 How to Download th e Software. . . . . . . 24710.3.3 Strengths and Weaknesses of th e Approach 248
10.4 Applications 24910.5 Discussion . . . .. 251
11 Parametric Empirical Bayes Methods for Microarrays 254Michael A . Newton and Christina Kendziorski
11.1 Introduction 25411.2 EB Methods . . . . . . . . . . . . . . . . 256
11.2.1 Canonical EB Example . . . . . 25611.2.2 General Model Structure: Two Condi tions . 256
285285285287287288289
272273273275276277278280283283283284
259260261263269
258
272
291
Contents xiii
11.2.3 Multiple Conditions . . . .11.2.4 The Gamma-Gamma and
Lognormal-Normal Models11.2.5 Model Fitting.Software ..ApplicationDiscussion .
11.311.411.5
12 SAM Thresholding and False Discovery Ratesfor Detecting Differential Gene Expressionin DNA MicroarraysJohn D. Storey and Robert Tibshirani
12.1 Introduction .12.2 Methods and Applications .
12.2.1 Multiple Hypothesis Testing12.2.2 An Application .12.2.3 Forming the Test Statistics .12.2.4 Calculating the Null Distribution .12.2.5 The SAM Thresholding Procedure12.2.6 Estimating False-Discovery Rates.
12.3 Software . . . . . . . . . . . ..12.3.1 Obtaining the Software12.3.2 Data Formats. . . . ..12.3.3 Response Format . . ..12.3.4 Example Input Data File for an
Unpaired Problem .12.3.5 Block Permutations .12.3.6 Normalization of Experiments12.3.7 Handling Missing Data . .. .12.3.8 Running SAM .... . . . . .12.3.9 Format of the Significant Gene List
12.4 Discussion .
13 Adaptive Gene Picking with Microarray Data:Detecting Important Low Abundance SignalsYi Lin, Samuel T . Nadler, Hong Lan, Alan D. Attie,and Brian S. Yandell
13.1 Introduction. . .. . ... ... . 29113.2 Methods . . . . . . . . . . . . . . 292
13.2.1 Background Subtraction. 29213.2.2 Transformation to Approximate Normality 29313.2.3 Differential Expression Across Conditions 29513.2.4 Robust Center and Spread 297
2~9
230231232235238238239
·244. 247• 248
249251
210212
212214214
.' .
.' . 215
217
219
223225226
229
342
xiv Contents
13.2.5 Formal Evaluation of SignificantDifferential Expression . . . . . . . . . . . . 299
13.2.6 Simulation Studies . . . . . . . . . . . . . . 30113.2.7 Comparison of Methods with E. coli Data . 304
13.3 Software , 30413.4 Application 306
13.4.1 Diabetes and Obesity Studies . 30613.4.2 Software Example . . . . . . . 308
14 MAANOVA: A Software Package for the Analysis ofSpotted cDNA Microarray Experiments 313Hao Wu, M. Kathleen Kerr, Xiangqin cu.and Gary A. Churchill
14.1 Introduction....... 31314.2 Meth ods . . . . . . . . . 314
14.2.1 Data Acquisition 31514.2.2 ANOVA Models for Microarray Data. 31514.2.3 Experimental Design for Microarrays . 31714.2.4 Dat a Transformations . . . . . . . . . 32114.2.5 Algorithms for Computing ANOVA Estimates 32214.2.6 Statistical Inference 32314.2.7 Cluster Analysis 327
14.3 Software . . . . . . 32814.3.1 Availability.. 32814.3.2 Functionality.. 329
14.4 Data Analysis with MAANOVA 33414.5 Discussion . . . . . . . . . . . . 339
15 GeneClustKim-Anh Do, Bradley Broom, and Siji n Wen
15.1 Introduction.... 34215.2 Methods . . . . . . . . . . . . . . . . . 343
15.2.1 Algorithm....... ..... 34315.2.2 Choice of Cluster Size via the Gap Statistic . 34415.2.3 Supervised Gene Shaving for
Class Discrimination . . . . 34615.3 Software . . . . . . . . . . . . . . . . 347
15.3.1 Th e GeneShaving Package . . 34715.3.2 GeneClust: A Faster Implementation of
Gene Shaving . . . . . . . 35215.4 Applications . . .. . ...... . 354
15.4.1 The Alon Colon Dat aset . 35415.4.2 The NCI60 Dataset 356
15.5 Discussion . . . . . . . . . . . . . 358
313
313314
. 315• 315
317321322323327328328329334339
Contents xv
16 POE: Statistical Methods for Qualitative Analysis ofGene Expression 362Elizabeth S. Garrett and Giovanni Parmigiani
16.1 Introduction... . . . .. . .. . . . . . 36216.2 Methodology.. ....... . . . .... 364
16.2.1 Mixture Model for Gene Expression 36416.2.2 Useful Representations of the Results 36616.2.3 Bayesian Hierarchical Model Formulation 36716.2.4 Restrictions to Remove Ambiguity in the
Case of Only Two Components . 36816.2.5 Mining for Subsets of Genes . 36816.2.6 Creat ing Molecular Profiles . . 370
16.3 R Software Extension: POE . . . . . . 37116.3.1 An Example of Using POE on
Simulated Data. . . . . . . . . 37116.3.2 Est imating Posterior Probability of
Expression Using poe. fit . 37216.3.3 Visualization Tools . . . . 37416.3.4 Gene-Mining Functions . . 37716.3.5 Molecular Profiling Tool . . 379
16.4 Resul ts of P OE Applied to Lung Cancer Dat a 38116.5 Discussion and Future Work . . . . . . . . . . 384
17 Bayesian Decomposition 388Michael F. Ochs
17.1 Introduction .. .. ... . . . . . . . . . . . .. . . 38817.1.1 Role of Signaling and Metabolic Pathways . 38817.1.2 Gene Expression Microar rays 389
17.2 Methods . . . . . . . . . . . . . . . 39017.2.1 Matrix Decomposition. . . 39017.2.2 Markov Chain Monte Carlo 39117.2.3 Bayesian Framework . . 39217.2.4 Th e Prior Distribution . 39317.2.5 Summary Statistics 395
17.3 Software . . . . . . . . . . . . 39617.3.1 Implementation. . . . 39617.3.2 Files and Installation 39617.3.3 Issues in th e Application of
Bayesian Decomposition . . 39717.4 Applicat ion of Bayesian Decomposit ion to Yeast
Cell Cycle Data . . . . . . . . . 39817.4.1 Preparation of the Dat a 39817.4.2 Running the Program 39917.4.3 Visualizing the Output 400
xvi Contents
18 Bayesian Clustering of Gene Expression Dynamics 409Paola Sebastiani, Marco Ramoni, and Isaac S. Kahane
17.4.4 Interpretation... ...... .. ... . 40217.4.5 Advantages of Bayesian Decomposition 403
17.5 Discussion . . . . . . . . . . . . . . . . . . . . . 403
409411412413415416417417418418419419420420421421424
447
428
18.1 Introduction .18.2 Methods . . . . . . . . . . . . . . . .
18.2.1 Modeling Time .18.2.2 Probabilistic Scoring Metric .18.2.3 Heuristic Search . . .18.2.4 Statistical Diagnostics . .
18.3 Software . . . . . . . . . . ... .18.3.1 Screen 0: Welcome Screen18.3.2 Screen 1: Getting Started18.3.3 Screen 2: Analysis .. .18.3.4 Screen 3: Cluster Model18.3.5 Screen 4: Pack and Go!
18.4 Application .18.4.1 Analysis .18.4.2 Statistical Diagnostics .18.4.3 Understanding the Model
18.5 Conclusions .
Index
19 Relevance Networks: A First Step TowardFinding Genetic Regulatory Networks WithinMicroarray DataAtul J. Butte and Isaac S. Kahan e
19.1 Introduction 42819.1.1 Advantages of Relevance Networks 429
19.2 Methodology........ .... ..... 43119.2.1 Formal Definition of Relevance Networks 43119.2.2 Finding Regulatory Networks in
Phenotypic Data . . . . . . . . . . . . . . 43219.2.3 Using Entropy and Mutual Information to
Evaluate Gene-Gene Associations 43419.3 Applications. . .. .... . . . . . 437
19.3.1 Finding PharmacogenomicRegulatory Networks . 437
19.3.2 Setting the Threshold 43919.4 Software . . . . . . . . . . . . 440