data acquisition - broad instituteback to the lab data acquisition microarray processing data...

12
1 Data Acquisition Data Acquisition DNA microarrays The functional genomics pipeline The functional genomics pipeline Experimental design affects outcome data analysis Supervised Analysis Differential analysis, Classification, … Unsupervised Analysis Clustering, Bi-clustering, … Enrichment analysis GO annotation, GSEA, … “In silico” testing Cross validation, train/test, etc, “In vitro” testing Back to the lab Data acquisition microarray processing Data preprocessing scaling/normalization/filtering Data analysis/Hypothesis generation Data analysis/Hypothesis generation Validation/Annotation Validation/Annotation

Upload: others

Post on 13-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Acquisition - Broad InstituteBack to the lab Data acquisition microarray processing Data preprocessing ... 2. green dot: the gene is expressed in the control but not in the treated

1

Data AcquisitionData AcquisitionDNA microarrays

The functional genomics pipelineThe functional genomics pipelineExperimental designaffects outcome data analysis

Supervised AnalysisDifferential analysis, Classification, …

Unsupervised AnalysisClustering, Bi-clustering, …

Enrichment analysisGO annotation, GSEA, …

“In silico” testingCross validation, train/test, etc,

“In vitro” testingBack to the lab

Data acquisitionmicroarray processing

Data preprocessingscaling/normalization/filtering

Data analysis/Hypothesis generationData analysis/Hypothesis generation

Validation/AnnotationValidation/Annotation

Page 2: Data Acquisition - Broad InstituteBack to the lab Data acquisition microarray processing Data preprocessing ... 2. green dot: the gene is expressed in the control but not in the treated

2

MicroarraysMicroarrays• A technology that has reshaped molecular biology.

• Traditional methods in molecular biology work on a "one gene in one experiment" basis, but the "whole picture" of gene function is hard to obtain. –– hypothesis testinghypothesis testing approach.

• Microarray technology: put thousands of genes on a chip to measure their activity/expression simultaneously.–– hypothesis generationhypothesis generation approach.

• Main technologies: cDNA arrayscDNA arrays and oligonucleotideoligonucleotide chipschips.

• “Gene expression levels” measured in terms of mRNA abundance.

From Tissues to MicroarraysFrom Tissues to Microarrays

N Engl J Med, 354: 2463, 2006.

Page 3: Data Acquisition - Broad InstituteBack to the lab Data acquisition microarray processing Data preprocessing ... 2. green dot: the gene is expressed in the control but not in the treated

3

MicroarraysMicroarrays• DNA Microarray - A technology that is reshaping molecular biology.• Traditional methods in molecular biology generally work on a "one

gene in one experiment" basis, but the "whole picture" of gene function is hard to obtain. – This is a hypothesis drivenhypothesis driven approach.

• Microarray technology: put thousands of genes on a chip to measure their activity/expression simultaneously.– This is a hypothesis generationhypothesis generation approach!

• Main technologies: cDNA arrayscDNA arrays and oligonucleotideoligonucleotide arraysarrays.• Both measure the “gene expression levels” in terms of mRNA

abundance.

cDNA cDNA MicroarraysMicroarraysFor each gene, synthesize its sequence and print/spot it on the chip surface.

The labeled probes are allowed to bind to complementary DNA strands (targets) on the slides.

Examination of the fluorescence in each probe tells us which gene is present in which sample.

Nomenclature: probe and target are interchanged in

cDNA and oligo arrays.

Prepare cDNA targetsPrepare cDNA targets

Hybridize target to

microarray

Page 4: Data Acquisition - Broad InstituteBack to the lab Data acquisition microarray processing Data preprocessing ... 2. green dot: the gene is expressed in the control but not in the treated

4

ImageImage1.1. red dotred dot: the gene is

expressed in the treated sample but not in the control;

2.2. green dotgreen dot: the gene is expressed in the control but not in the treated sample;

3.3. yellow dotyellow dot: the gene is expressed in both samples;

4.4. grey dotgrey dot: the gene is expressed in neithersamples.

Gene name Healthy Tumortranscription terminat 0.72 0.1selenoprotein P plasm 1.58 1.05Hs protein (peptidyl-p 1.1 0.97erythrocyte membran 0.97 1acid-inducible phosph 1.21 1.29ESTs 1.45 1.44lumican 1.15 1.1cathepsin K (pycnody 1.32 1.35carnitine palmitoyltra 1.01 1.38KIAA0455 gene prod 0.85 1.03ESTs 1.12 0.92ribosomal protein L5 1.23 1.21

Gene name Ratiotranscription terminati 0.14selenoprotein P plasm 0.66Hs protein (peptidyl-p 0.88erythrocyte membran 1.03acid-inducible phosph 1.07ESTs 0.99lumican 0.96cathepsin K (pycnody 1.02carnitine palmitoyltran 1.37KIAA0455 gene produ 1.21ESTs 0.82ribosomal protein L5 0.98

From Image to DataFrom Image to Data

genes

samples

Healthy cell Tumor cell

One microarray gives measures for genes in two conditions

not necessarily paired

reference sample sample of interest

Page 5: Data Acquisition - Broad InstituteBack to the lab Data acquisition microarray processing Data preprocessing ... 2. green dot: the gene is expressed in the control but not in the treated

5

Oligonucleotide MicroarraysOligonucleotide Microarrays• Oligonucleotide arrays: Affymetrix genechip.• Represent a gene with a set of 11-20 probe pairsprobe pairs:

– Each probe (oligonucleotide) is a 25-long sequence of bases characteristic of one gene.

Oligonucleotide MicroarraysOligonucleotide Microarrays• Oligonucleotide arrays : Affymetrix genechip.• Represent a gene with a set of 11-20 probe pairsprobe pairs:

– Each probe (oligonucleotide) is a 25-long sequence of bases characteristic of one gene.

• Each probe pairprobe pair consists of:–– Perfect matchPerfect match (PM): a probe that should hybridize.–– MismatchMismatch (MM): a probe that should not hybridize, because the

central base has been inverted (internal control).

Page 6: Data Acquisition - Broad InstituteBack to the lab Data acquisition microarray processing Data preprocessing ... 2. green dot: the gene is expressed in the control but not in the treated

6

Gene Expression = “avg”(PM-MM)

Scanned microarray

Oligonucleotide MicroarraysOligonucleotide Microarrays

Probe: a 25-merProbe pair: a (PM,MM) pair

Probe-set: collection of probe pairsIntensity: Gene expression level quantified by the intensity of its cells in the scanned image.

Technology ComparisonTechnology Comparison

N Engl J Med, 354: 2463, 2006.

cDNA arrayscDNA arrays

OligoOligo arraysarrays

One

One

-- col

or a

naly

sis

colo

r ana

lysi

s Tw

oTw

o --co

lor a

naly

sis

colo

r ana

lysi

s

Page 7: Data Acquisition - Broad InstituteBack to the lab Data acquisition microarray processing Data preprocessing ... 2. green dot: the gene is expressed in the control but not in the treated

7

Technology comparisonTechnology comparison

•• Affymetrix arraysAffymetrix arrays– Factory made, ligth-directed

chemical synthesis of 25-mer oligos.

–– MultipleMultiple assays/gene (20-40 oligos/gene).

–– absolute levelabsolute level of transcription.

– Inter-experiment, inter-lab comparisons relativelystraightforward.

–– ExpensiveExpensive

•• cDNA spotted arrayscDNA spotted arrays– Custom-made, robotically

spotted PCR products from cloned ESTs.

–– SingleSingle assay/gene. .

–– relative levelrelative level (experiment/control ratio)

– Inter-experiment comparisons difficult, unless done on exactly same slides.

–– CheapCheap

Use the genes in a file Use the genes in a tube

Technology comparisonTechnology comparison•• 2 studies2 studies:

– Irizarry et al., Nat Meth 2(5): 1-5– Larkin et al., Nat Meth 2(5): 337-343

•• MultiMulti--lab comparison of three platformslab comparison of three platforms– 1-color short oligos (Affy)– 2-color cDNA– 2-color long oligos

•• ResultsResults– Affymetrix accuracy best overall.– Precision comparable among platforms.– Lab effect stronger than platform effect.– Reproducibility across labs and platform good although not perfect.

Page 8: Data Acquisition - Broad InstituteBack to the lab Data acquisition microarray processing Data preprocessing ... 2. green dot: the gene is expressed in the control but not in the treated

8

- - - - - - - - -- - - - - - - - -- - - - - - - - -- - - - - - - - -- - - - - - - - -- - - - - - - - -- - - - - - - - -- - - - - - - - -- - - - - - - - -- - - - - - - - -- - - - - - - - -

Data representationData representationExperimentsExperiments

GenesGenes

e1e2 … en

g1g2

gm

e1e2

en

m x n matrix

Data VisualizationData Visualization: : HeatmapHeatmap

Page 9: Data Acquisition - Broad InstituteBack to the lab Data acquisition microarray processing Data preprocessing ... 2. green dot: the gene is expressed in the control but not in the treated

9

Data VisualizationData Visualization: : HeatmapHeatmap

Statistical ChallengesStatistical Challenges

• Noisy measurements

• “Large p, small n” (small sample size and large number dimensions)– “Curse of dimensionality”– Easy to “over-fit”.

Page 10: Data Acquisition - Broad InstituteBack to the lab Data acquisition microarray processing Data preprocessing ... 2. green dot: the gene is expressed in the control but not in the treated

10

Signal ExtractionSignal Extraction

GenesGenes

SamplesSamples

Main StepsMain Steps• Background correction• Normalization• Summarization

Signal ExtractionSignal Extraction

• Background correction– Eliminate background noise (correct for background “luminosity”,

etc.)

• Normalization– Make the different samples “comparable”

• Summarization:– Gene Expression = “avg”(PM-MM)

Page 11: Data Acquisition - Broad InstituteBack to the lab Data acquisition microarray processing Data preprocessing ... 2. green dot: the gene is expressed in the control but not in the treated

11

Method Summarization Background correction Normalization Citation

ChipMan A multiplicative model similar to that of dChip is fit to the PM Linear transformation (Lauren, 2003)

dChip A multiplicative model is fit MM intensities are subtracted Spline fitted to rank invariant set (Li and Wong, 2001)GL As RMA None Loess fitted to subset (Freudenberg, 2005)

gMOSv.1

Parameters from a gamma model are estimated from the PM and MM. These account for background and signal

(Milo et al., 2003)

GCRMA As RMA Based on probe sequnece As RMA (Wu et al., 2004)GSVDmod Generalized SVD is used None Scale normalization (Zuzan, 2003)MAS5.0 A robust average (Tukey biweight) Spatial effect and MM subtracted Scale normalization (Affymetrix, 2002)MMEI A linear mixed model is fitted None Linear mixed model used as well (Deng et al., 2005)

PerfectMatch

Model accounts for background and signal. The non-specific and specific effects are predicted using a free energy model

(Zhang et al., 2003)

PLIER A multiplicative model is fitted to PM-MM. Accounts for heteroskedacity As RMA (Hubbell et al., 2004)

ProbeProfiler Proprietary algorithm (http://www.corimbia.com)

RMA A robust linear model is fitted A global correction is performed Quantile (Irizarry et al., 2003)

RSVD The robust singular value decomposition methodology is applied to probe-level data

(Liu et al., 2003)

UMTrMn A trimmed mean of the PM-MM is computed Negatives are truncated Similar to RMA (Giordano et al., 2001)

VSN As RMA Generalized log transform is used to normalize and background correct

(Huber et al., 2002)

ZAM A robust linear model is fit Similar to RMA Averaged pairwise Loess (Åstrand, 2003)

ZL

Model used to motivated a generalized log transform that normalizes, background corrects and summarizes.

(Zhou and Rocke, 2005)

Bioinformatics, 22(7): 789-794, 2006

Signal ExtractionSignal Extractionmethodsmethods

Signal ExtractionSignal Extractiondesideratadesiderata

• Maximize PrecisionPrecision– low variance (need for replicates)

• Maximize AccuracyAccuracy– low bias (need for the truth).

Page 12: Data Acquisition - Broad InstituteBack to the lab Data acquisition microarray processing Data preprocessing ... 2. green dot: the gene is expressed in the control but not in the treated

12

Signal ExtractionSignal Extraction

•• Most popular methodsMost popular methods:– Robust Multi-chip Analysis (RMA)– MAS5

• Affymetrix software.– GC-RMA

• Modified RMA that models intensity of probe levels as a function of GC-content.

• Expects to see higher intensity values for probes that are GC-rich due to increased binding

•• Pros and consPros and cons:– (GC-)RMA more precise (accurate?) in the low and mid-range

(tends to “squash” signal at high levels).– MAS5 noisier at low levels.

Introductory referencesIntroductory references1. Hoffman, E. P., Awad, T., et al. Expression Profiling - Best Practices for

Data Generation and Interpretation in Clinical Trials. Nat Rev Genet, 5: 229-237, 2004.

2. Larkin, J. E., Frank, B. C., et al. Independence and reproducibility across microarray platforms. Nat Meth, 2: 337-344, 2005.

3. Irizarry, R. A., Warren, D., et al. Multiple-laboratory comparison of microarray platforms. Nat Meth, 2: 345-350, 2005.

4. Irizarry, R. A., Wu, Z., and Jaffee, H. A. Comparison of Affymetrix GeneChip expression measures. Bioinformatics, 22(7): 789-794, 2006.