data acquisition - broad instituteback to the lab data acquisition microarray processing data...
TRANSCRIPT
1
Data AcquisitionData AcquisitionDNA microarrays
The functional genomics pipelineThe functional genomics pipelineExperimental designaffects outcome data analysis
Supervised AnalysisDifferential analysis, Classification, …
Unsupervised AnalysisClustering, Bi-clustering, …
Enrichment analysisGO annotation, GSEA, …
“In silico” testingCross validation, train/test, etc,
“In vitro” testingBack to the lab
Data acquisitionmicroarray processing
Data preprocessingscaling/normalization/filtering
Data analysis/Hypothesis generationData analysis/Hypothesis generation
Validation/AnnotationValidation/Annotation
2
MicroarraysMicroarrays• A technology that has reshaped molecular biology.
• Traditional methods in molecular biology work on a "one gene in one experiment" basis, but the "whole picture" of gene function is hard to obtain. –– hypothesis testinghypothesis testing approach.
• Microarray technology: put thousands of genes on a chip to measure their activity/expression simultaneously.–– hypothesis generationhypothesis generation approach.
• Main technologies: cDNA arrayscDNA arrays and oligonucleotideoligonucleotide chipschips.
• “Gene expression levels” measured in terms of mRNA abundance.
From Tissues to MicroarraysFrom Tissues to Microarrays
N Engl J Med, 354: 2463, 2006.
3
MicroarraysMicroarrays• DNA Microarray - A technology that is reshaping molecular biology.• Traditional methods in molecular biology generally work on a "one
gene in one experiment" basis, but the "whole picture" of gene function is hard to obtain. – This is a hypothesis drivenhypothesis driven approach.
• Microarray technology: put thousands of genes on a chip to measure their activity/expression simultaneously.– This is a hypothesis generationhypothesis generation approach!
• Main technologies: cDNA arrayscDNA arrays and oligonucleotideoligonucleotide arraysarrays.• Both measure the “gene expression levels” in terms of mRNA
abundance.
cDNA cDNA MicroarraysMicroarraysFor each gene, synthesize its sequence and print/spot it on the chip surface.
The labeled probes are allowed to bind to complementary DNA strands (targets) on the slides.
Examination of the fluorescence in each probe tells us which gene is present in which sample.
Nomenclature: probe and target are interchanged in
cDNA and oligo arrays.
Prepare cDNA targetsPrepare cDNA targets
Hybridize target to
microarray
4
ImageImage1.1. red dotred dot: the gene is
expressed in the treated sample but not in the control;
2.2. green dotgreen dot: the gene is expressed in the control but not in the treated sample;
3.3. yellow dotyellow dot: the gene is expressed in both samples;
4.4. grey dotgrey dot: the gene is expressed in neithersamples.
Gene name Healthy Tumortranscription terminat 0.72 0.1selenoprotein P plasm 1.58 1.05Hs protein (peptidyl-p 1.1 0.97erythrocyte membran 0.97 1acid-inducible phosph 1.21 1.29ESTs 1.45 1.44lumican 1.15 1.1cathepsin K (pycnody 1.32 1.35carnitine palmitoyltra 1.01 1.38KIAA0455 gene prod 0.85 1.03ESTs 1.12 0.92ribosomal protein L5 1.23 1.21
Gene name Ratiotranscription terminati 0.14selenoprotein P plasm 0.66Hs protein (peptidyl-p 0.88erythrocyte membran 1.03acid-inducible phosph 1.07ESTs 0.99lumican 0.96cathepsin K (pycnody 1.02carnitine palmitoyltran 1.37KIAA0455 gene produ 1.21ESTs 0.82ribosomal protein L5 0.98
From Image to DataFrom Image to Data
genes
samples
Healthy cell Tumor cell
One microarray gives measures for genes in two conditions
not necessarily paired
reference sample sample of interest
5
Oligonucleotide MicroarraysOligonucleotide Microarrays• Oligonucleotide arrays: Affymetrix genechip.• Represent a gene with a set of 11-20 probe pairsprobe pairs:
– Each probe (oligonucleotide) is a 25-long sequence of bases characteristic of one gene.
Oligonucleotide MicroarraysOligonucleotide Microarrays• Oligonucleotide arrays : Affymetrix genechip.• Represent a gene with a set of 11-20 probe pairsprobe pairs:
– Each probe (oligonucleotide) is a 25-long sequence of bases characteristic of one gene.
• Each probe pairprobe pair consists of:–– Perfect matchPerfect match (PM): a probe that should hybridize.–– MismatchMismatch (MM): a probe that should not hybridize, because the
central base has been inverted (internal control).
6
Gene Expression = “avg”(PM-MM)
Scanned microarray
Oligonucleotide MicroarraysOligonucleotide Microarrays
Probe: a 25-merProbe pair: a (PM,MM) pair
Probe-set: collection of probe pairsIntensity: Gene expression level quantified by the intensity of its cells in the scanned image.
Technology ComparisonTechnology Comparison
N Engl J Med, 354: 2463, 2006.
cDNA arrayscDNA arrays
OligoOligo arraysarrays
One
One
-- col
or a
naly
sis
colo
r ana
lysi
s Tw
oTw
o --co
lor a
naly
sis
colo
r ana
lysi
s
7
Technology comparisonTechnology comparison
•• Affymetrix arraysAffymetrix arrays– Factory made, ligth-directed
chemical synthesis of 25-mer oligos.
–– MultipleMultiple assays/gene (20-40 oligos/gene).
–– absolute levelabsolute level of transcription.
– Inter-experiment, inter-lab comparisons relativelystraightforward.
–– ExpensiveExpensive
•• cDNA spotted arrayscDNA spotted arrays– Custom-made, robotically
spotted PCR products from cloned ESTs.
–– SingleSingle assay/gene. .
–– relative levelrelative level (experiment/control ratio)
– Inter-experiment comparisons difficult, unless done on exactly same slides.
–– CheapCheap
Use the genes in a file Use the genes in a tube
Technology comparisonTechnology comparison•• 2 studies2 studies:
– Irizarry et al., Nat Meth 2(5): 1-5– Larkin et al., Nat Meth 2(5): 337-343
•• MultiMulti--lab comparison of three platformslab comparison of three platforms– 1-color short oligos (Affy)– 2-color cDNA– 2-color long oligos
•• ResultsResults– Affymetrix accuracy best overall.– Precision comparable among platforms.– Lab effect stronger than platform effect.– Reproducibility across labs and platform good although not perfect.
8
- - - - - - - - -- - - - - - - - -- - - - - - - - -- - - - - - - - -- - - - - - - - -- - - - - - - - -- - - - - - - - -- - - - - - - - -- - - - - - - - -- - - - - - - - -- - - - - - - - -
Data representationData representationExperimentsExperiments
GenesGenes
e1e2 … en
g1g2
…
gm
e1e2
…
en
m x n matrix
Data VisualizationData Visualization: : HeatmapHeatmap
9
Data VisualizationData Visualization: : HeatmapHeatmap
Statistical ChallengesStatistical Challenges
• Noisy measurements
• “Large p, small n” (small sample size and large number dimensions)– “Curse of dimensionality”– Easy to “over-fit”.
10
Signal ExtractionSignal Extraction
GenesGenes
SamplesSamples
Main StepsMain Steps• Background correction• Normalization• Summarization
Signal ExtractionSignal Extraction
• Background correction– Eliminate background noise (correct for background “luminosity”,
etc.)
• Normalization– Make the different samples “comparable”
• Summarization:– Gene Expression = “avg”(PM-MM)
11
Method Summarization Background correction Normalization Citation
ChipMan A multiplicative model similar to that of dChip is fit to the PM Linear transformation (Lauren, 2003)
dChip A multiplicative model is fit MM intensities are subtracted Spline fitted to rank invariant set (Li and Wong, 2001)GL As RMA None Loess fitted to subset (Freudenberg, 2005)
gMOSv.1
Parameters from a gamma model are estimated from the PM and MM. These account for background and signal
(Milo et al., 2003)
GCRMA As RMA Based on probe sequnece As RMA (Wu et al., 2004)GSVDmod Generalized SVD is used None Scale normalization (Zuzan, 2003)MAS5.0 A robust average (Tukey biweight) Spatial effect and MM subtracted Scale normalization (Affymetrix, 2002)MMEI A linear mixed model is fitted None Linear mixed model used as well (Deng et al., 2005)
PerfectMatch
Model accounts for background and signal. The non-specific and specific effects are predicted using a free energy model
(Zhang et al., 2003)
PLIER A multiplicative model is fitted to PM-MM. Accounts for heteroskedacity As RMA (Hubbell et al., 2004)
ProbeProfiler Proprietary algorithm (http://www.corimbia.com)
RMA A robust linear model is fitted A global correction is performed Quantile (Irizarry et al., 2003)
RSVD The robust singular value decomposition methodology is applied to probe-level data
(Liu et al., 2003)
UMTrMn A trimmed mean of the PM-MM is computed Negatives are truncated Similar to RMA (Giordano et al., 2001)
VSN As RMA Generalized log transform is used to normalize and background correct
(Huber et al., 2002)
ZAM A robust linear model is fit Similar to RMA Averaged pairwise Loess (Åstrand, 2003)
ZL
Model used to motivated a generalized log transform that normalizes, background corrects and summarizes.
(Zhou and Rocke, 2005)
Bioinformatics, 22(7): 789-794, 2006
Signal ExtractionSignal Extractionmethodsmethods
Signal ExtractionSignal Extractiondesideratadesiderata
• Maximize PrecisionPrecision– low variance (need for replicates)
• Maximize AccuracyAccuracy– low bias (need for the truth).
12
Signal ExtractionSignal Extraction
•• Most popular methodsMost popular methods:– Robust Multi-chip Analysis (RMA)– MAS5
• Affymetrix software.– GC-RMA
• Modified RMA that models intensity of probe levels as a function of GC-content.
• Expects to see higher intensity values for probes that are GC-rich due to increased binding
•• Pros and consPros and cons:– (GC-)RMA more precise (accurate?) in the low and mid-range
(tends to “squash” signal at high levels).– MAS5 noisier at low levels.
Introductory referencesIntroductory references1. Hoffman, E. P., Awad, T., et al. Expression Profiling - Best Practices for
Data Generation and Interpretation in Clinical Trials. Nat Rev Genet, 5: 229-237, 2004.
2. Larkin, J. E., Frank, B. C., et al. Independence and reproducibility across microarray platforms. Nat Meth, 2: 337-344, 2005.
3. Irizarry, R. A., Warren, D., et al. Multiple-laboratory comparison of microarray platforms. Nat Meth, 2: 345-350, 2005.
4. Irizarry, R. A., Wu, Z., and Jaffee, H. A. Comparison of Affymetrix GeneChip expression measures. Bioinformatics, 22(7): 789-794, 2006.