theanalysis of gene expression data -...

11
Giovanni Parmigiani Elizabeth S. Garrett Rafael A. Irizarry Scott L. Zeger Editors The Analysis of Gene Expression Data Methods and Software With 113 Figures, Including 37 Color Plates , Springer

Upload: lydung

Post on 25-Aug-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Giovanni ParmigianiElizabeth S. GarrettRafael A. IrizarryScott L. ZegerEditors

The Analysis ofGene Expression DataMethods and Software

With 113 Figures, Including 37 Color Plates

, Springer

vii

1 The Analysis of Gene Expression Data: An Overview ofMethods and Software 1Giovanni Parmigiani, Elizabeth S. Garrett, Rafael A. Irizarry,and Scott L. Zeger

1.1 Measuring Gene Expression Using Microarrays. . 11.1.1 Microarray Technologies . . . . . . . . . . 11.1.2 Sources of Variation in Gene Expression

Measurements Using Microarrays . . 41.1.3 Phases of Microarray Data Analysis . . 5

1.2 Design of Microarray Experiments . . . . . . . . 71.2.1 Replication and Sample Size Considerations. 71.2.2 Design of Two-Channel Arrays 9

1.3 Data Storage . . . . 91.3.1 Databases.......... . . 91.3.2 Standards........ .. . . 101.3.3 Statistical Analysis Languages 11

1.4 Preprocessing........... .. . 121.4.1 Image Analysis . . . . . . . . . 121.4.2 Visualizations for Quality Control 121.4.3 Background Subtraction . . . . . . 131.4.4 Probe-level Analysis of Oligonucleotide Arrays 141.4.5 Within-Array Normalization of cDNA Arrays. 151.4.6 Normalization Across Arrays . . . . 15

1.5 Screening for Differentially Expressed Genes 161.5.1 Estimation or Selection? . . . . . . . 161.5.2 One Problem or Many? . . . . . . . 171.5.3 Selection and False Discovery Rates 181.5.4 Beyond Two Groups . . . . . . . . 19

1.6 Challenges of Genome Biometry Analyses . 19

Contents

Preface

Contributors

Color Insert

v

xvii

(facing page 236)

46

viii Contents

1.7 Visualization and Unsupervised Analyses 211.7.1 Profile Visualization . . 211.7.2 Why Clustering? . . . . . . . . . 221.7.3 Hierarchical Clustering . . . .. 231.7.4 k-Means Clustering and Self-Organizing Maps 251.7.5 Model-Based Cluste ring . . . . . 261.7.6 Principal Components Analysis. . . . . 261.7.7 Mult idimensional Scaling . . . . . . . . 271.7.8 Identifying Novel Molecular Subclasses. 271.7.9 T ime Series Analysis . 28

1.8 Prediction . . . . .... . . . 291.8.1 Prediction Tools . . . 291.8.2 Dimension Reduction 301.8.3 Evaluation of Classifiers 301.8.4 Regression-Based Approaches. 311.8.5 Classification Trees. . . . . . . 311.8.6 Probabilistic Model-Based Classification. 321.8.7 Discriminant Analysis . . . . 331.8.8 Nearest-Neighbor Classifiers . 331.8.9 Support Vector Machines 33

1.9 Free and Open-Source Software . . 331.9.1 Whitehead Institute Tools . 341.9.2 Eisen Lab Tools . . 341.9.3 TIGR Tools. . . . . 341.9.4 GeneX and Cyber'I' 351.9.5 Projects at NCBI 351.9.6 BRB . . ... .. .. 351.9.7 The OOML library. 361.9.8 MatArray . 361.9.9 BASE 36

1.10 Conclusion. . .. . 36

2 Visualization and Annotation ofGenomic ExperimentsRobert Gentleman and Vincent Carey

2.1 Introduction.......... . . . . . ... 462.2 Motivations for Component-Based Software 472.3 Formalism . .. ... ..... . . .... . . 492.4 Bioconductor Software for Filtering, Explorin g, and

Interp reting Microarray Exp eriments . . . . . . . 502.4.1 Formal Dat a Structures and Methods for

Multiple Microarrays . . . . . . . . . . . . 502.4.2 Tools for Filtering Gene Expression Data:

The Closure Concept 54

Contents IX

4 An R Package for Analyses of AffymetrixOligonucleotide Arrays 102Rafael A. Irizarry, Laurent Gaut ier, and Leslie M. Cope

21212223252626272728292930303131323333333334343435353536363636

2.4.3 Expression Density Diagnost ics:High-Throughput Exploratory Dat aAnalysis for Microarrays .

2.4.4 Annotation2.5 Visualizat ion ..

2.5.1 Chromosomes .2.6 Applications ..

2.6.1 A Case Study of Gene Filtering .2.6. 2 Application of Expression Density Diagnostics

2.7 Conclusions . . . . .. . .. . .. .. . . . . . . . . .

3 Bioconductor R Packages for Exploratory Analysisand Normalization of cDNA Microarray DataSandrine Dud oit and Jean Yee Hwa Yang

3.1 Introduction.......... . . .. . . . .3.1.1 Overview of Packages .3.1.2 Two-Color cDNA Microarray Experiments

3.2 Methods . . . .. . .3.2.1 Standards for Microarr ay Dat a .3.2.2 Object-Oriented Programming: Microarray

Classes and Methods. . . . . . . . . . . . .3.2.3 Diagnost ic Plots . . . . . . . . . . . . . . .3.2.4 Normalization Using Robust Local Regression

3.3 Application: Swirl Microarray Experiment . . . . . . .3.4 Software ..... . .. ...... . .. ..... .. .

3.4.1 Package marrayClasses- Classes and Methodsfor cDNA Microarray Data .

3.4.2 Package marraylnput-Data Input for cDNAMicroarrays . . . . . . . . . . . . . . . . . . .

3.4.3 Package marrayPlots-Diagnost ic Plots forcDNA Microarray Data .

3.4.4 Package marrayNorm-Location and ScaleNormalization for cDNA Microarr ay Data .

3.5 Discussion . . .. .

4.1 Introduction . ..4.2 Methods . . ... . . .. . .. . . .

4.2.1 Notat ion .4.2.2 The CEL/CDF Convent ion4.2.3 Probe Pair Sets. . . .4.2.4 Probe-Level Obj ects ....

5557585964646770

73

7373757676

7778798081

81

89

91

9699

102103103104106107

6 Expression Profiler 142Jaak Vilo, Misha Kapushesky , Patrick Kem meren, Ugis Sarkans,and Alvis Brazma

142143143144146147

III113113115115118118

108

120

6.1 Introduction .6.2 EP CLUST .

6.2.1 EP CLUST : Dat a Import6.2.2 EP CLUST: Data Filtering6.2.3 EPCLUST: Data Annotation6.2.4 EPCLUST: Data Environment

4.2.5 Normalization .4.2.6 Exploratory Data Analysis of

Probe-L evel Data . .4.3 Applicat ion .

4.3.1 Expression Measures .4.4 Software... ....··· ··

4.4.1 A Case Study . . . . .4.4.2 Extending the Package .

4.5 Conclusion. . . . .....·

5 DNA-Chip Analyzer (dChip)Cheng Li and Wing Hung Wong

5.1 Introduction · 1205.2 Methods. . ··· 121

5.2.1 Normalization of Arrays Based on an"Invariant Set" . . . . . . 121

5.2.2 Model-Based Analysis ofOligonucleotide Arrays . . 122

5.2.3 Confidence Interval for Fold Change . 1225.2.4 Pooling Replicate Arrays Considering

Measurement Accuracy . . . 1245.3 Software and Applications . . . . . . 125

5.3.1 Reading in Array Data Files 1255.3.2 Viewing an Array Image. 1275.3.3 Normalizing Arrays . . . . . 1295.3.4 Viewing PM /MM Data . . . 1295.3.5 Calculating Model-Based Expression Indexes 1315.3.6 Filter Genes 1325.3.7 Hierarchical Clustering . . . . . 1335.3.8 Comparing Samples . . . . . . . 1355.3.9 Mapping Genes to Chromosomes 1375.3.10 Sample Classification by Linear

Discriminant Analysis 1385.4 Discussion........... . . . . .. 139

x Contents

108

120

120121

121

122122

124125125127129129131132

• 133• 135

137

Contents xi

6.2.5 EP CLUST: Data Analysis . . . . . . . . . 1486.3 URLMAP: Cross-Linking of the Analysis Results

Between the Tools and Databases . . . . . . . . . 1516.4 EP:GO GeneOntology Browser 1526.5 EP:PPI: Comparison of Pro tein Pairs and Expression. 1536.6 Pattern Discovery, Pattern Matching, and

Visualization Tools . . . . . . . . . . . . . . . . . . . 1546.7 An Example of th e Data Analysis and Visualizations

Performed by the Tools in Expression Profiler 1546.8 Integration of Expression Profiler with Pu blic

Microarray Dat abases . 1596.9 Conclusions.. . . . . . . ... . . . . . . . . 160

7 An S-PLUS Library for the Analysis and Visualizationof Differential Expression 163Jae K. Lee and Michael 0 'Connell

7.1 Int roduc t ion . ............. 1637.2 Assessment of Differential Expression 164

7.2.1 Local Pooled Error. . . . . . 1657.2.2 Tests for Differential Expression 1697.2.3 Cluster Analysis and Visualization 171

7.3 Analysis of Melanoma Expression . . . . . 1747.3.1 Tests for Differential Expression 1757.3.2 Cluster Analysis and Visualizat ion 1787.3.3 Annotation 180

7.4 Discussion . . . . .... . . . . . . . . . . 181

8 DRAGON and DRAGON View: Methods for theAnnotation, Analysis, and Visualization of Large-ScaleGene Expression Data 185Christopher M.L .S. Bouton, George Henry , Carlo Colantuoni,and Jonathan Pevsner

8.1 Introduction........... 1858.2 System and Methods . . . . . . 189

8.2.1 Overview of DRAGON 1898.2.2 DRAGON's Hardwar e, Software, and

Database Architecture . . . . . . . . . 1908.2.3 Cross-Referencing Information in DRAGON 1928.2.4 The DRAGON Search and Annotate Tools 1938.2.5 The DRAGON View Data Visualization Tools 1968.2.6 DRAGON Gram: A Novel Visualization Tool 198

8.3 Implementation .. .. .. 1998.4 Discussion and Conclusion . . . . . . . . . . . . . . . 204

xii Contents

9 SNOMAD: Biologist-Friendly Web Toolsfor the Standardization and NOrmalizationof Microarray Data 210Carlo Colantuoni, George Henry , Christopher M.L.S. Bouton,Scott L. Zeger, and Jonathan Pevsner

9.1 Introduction 2109.2 Methods and Application. . . . . . . . . . . 212

9.2.1 Overview of Exp eriment al and DataAnalysis Procedures . . . . . 212

9.2.2 Background Subtraction. . . . . . . 2149.2.3 Global Mean Normalization . . . . . 2149.2.4 Standard Data Transformation and

Visualization Methods . . . . . . . . 2159.2.5 Local Mean Normalization Across Element

Signal Intensity . . . . . . . . . . . . . . . . 2179.2.6 Local Variance Correc tion Across Element

Signal Intensity . . . . . . . . . . . . . 2199.2.7 Local Mean Normalization Across the

Microarray Surface . 2239.3 Software 2259.4 Discussion .. .... . . .. 226

10 Microarray Analysis Using the MicroArray Explorer 229Peter F. Lemkin, Gregory C. Thornwall, and Jai Evans

10.1 Introduction 2~9

10.1.1 Need for the Methodology . . . . . . 23010.1.2 Basic Ideas Behind th e Approach. . 231

10.2 Methods-Statistical and Informatics Basis. 23210.2.1 Analysis Paradigm. . . . . . 23510.2.2 Particular Analysis Methods 23810.2.3 Data Conversion . . . . . . . 238

10.3 Software . . . . . . . . . . . . . . . . 23910.3.1 System Design-Software Implementation 24410.3.2 How to Download th e Software. . . . . . . 24710.3.3 Strengths and Weaknesses of th e Approach 248

10.4 Applications 24910.5 Discussion . . . .. 251

11 Parametric Empirical Bayes Methods for Microarrays 254Michael A . Newton and Christina Kendziorski

11.1 Introduction 25411.2 EB Methods . . . . . . . . . . . . . . . . 256

11.2.1 Canonical EB Example . . . . . 25611.2.2 General Model Structure: Two Condi tions . 256

285285285287287288289

272273273275276277278280283283283284

259260261263269

258

272

291

Contents xiii

11.2.3 Multiple Conditions . . . .11.2.4 The Gamma-Gamma and

Lognormal-Normal Models11.2.5 Model Fitting.Software ..ApplicationDiscussion .

11.311.411.5

12 SAM Thresholding and False Discovery Ratesfor Detecting Differential Gene Expressionin DNA MicroarraysJohn D. Storey and Robert Tibshirani

12.1 Introduction .12.2 Methods and Applications .

12.2.1 Multiple Hypothesis Testing12.2.2 An Application .12.2.3 Forming the Test Statistics .12.2.4 Calculating the Null Distribution .12.2.5 The SAM Thresholding Procedure12.2.6 Estimating False-Discovery Rates.

12.3 Software . . . . . . . . . . . ..12.3.1 Obtaining the Software12.3.2 Data Formats. . . . ..12.3.3 Response Format . . ..12.3.4 Example Input Data File for an

Unpaired Problem .12.3.5 Block Permutations .12.3.6 Normalization of Experiments12.3.7 Handling Missing Data . .. .12.3.8 Running SAM .... . . . . .12.3.9 Format of the Significant Gene List

12.4 Discussion .

13 Adaptive Gene Picking with Microarray Data:Detecting Important Low Abundance SignalsYi Lin, Samuel T . Nadler, Hong Lan, Alan D. Attie,and Brian S. Yandell

13.1 Introduction. . .. . ... ... . 29113.2 Methods . . . . . . . . . . . . . . 292

13.2.1 Background Subtraction. 29213.2.2 Transformation to Approximate Normality 29313.2.3 Differential Expression Across Conditions 29513.2.4 Robust Center and Spread 297

2~9

230231232235238238239

·244. 247• 248

249251

210212

212214214

.' .

.' . 215

217

219

223225226

229

342

xiv Contents

13.2.5 Formal Evaluation of SignificantDifferential Expression . . . . . . . . . . . . 299

13.2.6 Simulation Studies . . . . . . . . . . . . . . 30113.2.7 Comparison of Methods with E. coli Data . 304

13.3 Software , 30413.4 Application 306

13.4.1 Diabetes and Obesity Studies . 30613.4.2 Software Example . . . . . . . 308

14 MAANOVA: A Software Package for the Analysis ofSpotted cDNA Microarray Experiments 313Hao Wu, M. Kathleen Kerr, Xiangqin cu.and Gary A. Churchill

14.1 Introduction....... 31314.2 Meth ods . . . . . . . . . 314

14.2.1 Data Acquisition 31514.2.2 ANOVA Models for Microarray Data. 31514.2.3 Experimental Design for Microarrays . 31714.2.4 Dat a Transformations . . . . . . . . . 32114.2.5 Algorithms for Computing ANOVA Estimates 32214.2.6 Statistical Inference 32314.2.7 Cluster Analysis 327

14.3 Software . . . . . . 32814.3.1 Availability.. 32814.3.2 Functionality.. 329

14.4 Data Analysis with MAANOVA 33414.5 Discussion . . . . . . . . . . . . 339

15 GeneClustKim-Anh Do, Bradley Broom, and Siji n Wen

15.1 Introduction.... 34215.2 Methods . . . . . . . . . . . . . . . . . 343

15.2.1 Algorithm....... ..... 34315.2.2 Choice of Cluster Size via the Gap Statistic . 34415.2.3 Supervised Gene Shaving for

Class Discrimination . . . . 34615.3 Software . . . . . . . . . . . . . . . . 347

15.3.1 Th e GeneShaving Package . . 34715.3.2 GeneClust: A Faster Implementation of

Gene Shaving . . . . . . . 35215.4 Applications . . .. . ...... . 354

15.4.1 The Alon Colon Dat aset . 35415.4.2 The NCI60 Dataset 356

15.5 Discussion . . . . . . . . . . . . . 358

313

313314

. 315• 315

317321322323327328328329334339

Contents xv

16 POE: Statistical Methods for Qualitative Analysis ofGene Expression 362Elizabeth S. Garrett and Giovanni Parmigiani

16.1 Introduction... . . . .. . .. . . . . . 36216.2 Methodology.. ....... . . . .... 364

16.2.1 Mixture Model for Gene Expression 36416.2.2 Useful Representations of the Results 36616.2.3 Bayesian Hierarchical Model Formulation 36716.2.4 Restrictions to Remove Ambiguity in the

Case of Only Two Components . 36816.2.5 Mining for Subsets of Genes . 36816.2.6 Creat ing Molecular Profiles . . 370

16.3 R Software Extension: POE . . . . . . 37116.3.1 An Example of Using POE on

Simulated Data. . . . . . . . . 37116.3.2 Est imating Posterior Probability of

Expression Using poe. fit . 37216.3.3 Visualization Tools . . . . 37416.3.4 Gene-Mining Functions . . 37716.3.5 Molecular Profiling Tool . . 379

16.4 Resul ts of P OE Applied to Lung Cancer Dat a 38116.5 Discussion and Future Work . . . . . . . . . . 384

17 Bayesian Decomposition 388Michael F. Ochs

17.1 Introduction .. .. ... . . . . . . . . . . . .. . . 38817.1.1 Role of Signaling and Metabolic Pathways . 38817.1.2 Gene Expression Microar rays 389

17.2 Methods . . . . . . . . . . . . . . . 39017.2.1 Matrix Decomposition. . . 39017.2.2 Markov Chain Monte Carlo 39117.2.3 Bayesian Framework . . 39217.2.4 Th e Prior Distribution . 39317.2.5 Summary Statistics 395

17.3 Software . . . . . . . . . . . . 39617.3.1 Implementation. . . . 39617.3.2 Files and Installation 39617.3.3 Issues in th e Application of

Bayesian Decomposition . . 39717.4 Applicat ion of Bayesian Decomposit ion to Yeast

Cell Cycle Data . . . . . . . . . 39817.4.1 Preparation of the Dat a 39817.4.2 Running the Program 39917.4.3 Visualizing the Output 400

xvi Contents

18 Bayesian Clustering of Gene Expression Dynamics 409Paola Sebastiani, Marco Ramoni, and Isaac S. Kahane

17.4.4 Interpretation... ...... .. ... . 40217.4.5 Advantages of Bayesian Decomposition 403

17.5 Discussion . . . . . . . . . . . . . . . . . . . . . 403

409411412413415416417417418418419419420420421421424

447

428

18.1 Introduction .18.2 Methods . . . . . . . . . . . . . . . .

18.2.1 Modeling Time .18.2.2 Probabilistic Scoring Metric .18.2.3 Heuristic Search . . .18.2.4 Statistical Diagnostics . .

18.3 Software . . . . . . . . . . ... .18.3.1 Screen 0: Welcome Screen18.3.2 Screen 1: Getting Started18.3.3 Screen 2: Analysis .. .18.3.4 Screen 3: Cluster Model18.3.5 Screen 4: Pack and Go!

18.4 Application .18.4.1 Analysis .18.4.2 Statistical Diagnostics .18.4.3 Understanding the Model

18.5 Conclusions .

Index

19 Relevance Networks: A First Step TowardFinding Genetic Regulatory Networks WithinMicroarray DataAtul J. Butte and Isaac S. Kahan e

19.1 Introduction 42819.1.1 Advantages of Relevance Networks 429

19.2 Methodology........ .... ..... 43119.2.1 Formal Definition of Relevance Networks 43119.2.2 Finding Regulatory Networks in

Phenotypic Data . . . . . . . . . . . . . . 43219.2.3 Using Entropy and Mutual Information to

Evaluate Gene-Gene Associations 43419.3 Applications. . .. .... . . . . . 437

19.3.1 Finding PharmacogenomicRegulatory Networks . 437

19.3.2 Setting the Threshold 43919.4 Software . . . . . . . . . . . . 440