outline of the course - broad institute
TRANSCRIPT
1
Gad GetzGad Getz Stefano Stefano MontiMonti Michael ReichMichael Reich{{gadgetz,smonti,mreich}@broad.mit.edugadgetz,smonti,mreich}@broad.mit.eduhttp://http://www.broad.mit.edu/~smonti/awswww.broad.mit.edu/~smonti/aws
Broad Institute of MIT & HarvardBroad Institute of MIT & HarvardOctober 18October 18--20, 200620, 2006
Cambridge, MACambridge, MA
Outline of the courseOutline of the courseLecturesLectures
Day 1:Day 1:Introduction: Functional GenomicsGenePattern mini-tutorialFG Pipeline:
o Data Acquisitiono Preprocessing & Visualization
Day 2:Day 2:Supervised Analysis
o Differential analysis/GSEAo Class Prediction/Classificationo Validation
Day 3:Day 3:o [Survival analysis]
Unsupervised Analysiso Clustering, Bi-clustering
Annotation
HandsHands--onon’’ss
Day 1:Day 1:PreprocessingData visualization/Dimensionality reduction:
o HeatMapso PCA,NMF,MDS
Day 2:Day 2:Differential Analysis/AnnotationClassification:
o Model building/selectiono Evaluation
Day 3:Day 3:Clustering:
o HC, NMF, CC, Bi-clusteringo GO Annotation
Final Project
2
The functional genomics pipelineThe functional genomics pipelineExperimental designaffects outcome data analysis
Supervised AnalysisDifferential analysis, Classification, …
Unsupervised AnalysisClustering, Bi-clustering, …
Enrichment analysisGO annotation, GSEA, …
“In silico” testingCross validation, train/test, etc,
“In vitro” testingBack to the lab
Data acquisitionmicroarray processing
Data preprocessingscaling/normalization/filtering
Data analysis/Hypothesis generationData analysis/Hypothesis generation
Validation/AnnotationValidation/Annotation
The functional genomics pipelineThe functional genomics pipelineExperimental designaffects outcome data analysis
Supervised AnalysisSupervised AnalysisDifferential analysis, Classification,
Unsupervised AnalysisClustering, Bi-clustering, …
Enrichment analysisGO annotation, GSEA, …
“In silico” testingCross validation, train/test, etc,
“In vitro” testingBack to the lab
Data acquisitionmicroarray processing
Data preprocessingscaling/normalization/filtering
Data analysis/Hypothesis generationData analysis/Hypothesis generation
Validation/AnnotationValidation/Annotation
3
Supervised Supervised vsvs. Unsupervised. UnsupervisedSupervised methodsSupervised methods look at relations between gene expression and knownknown experimental conditions.
Differential experiments usually yield:A list of genes changed across two (or more) conditions;A “stochastic profile” of each condition.
Useful to identify diagnosticdiagnostic profiles and prognosticprognostic models.
Unsupervised methodsUnsupervised methods aim at discovering new and unknownunknown relationships:
.. among genes.
.. among samples.
.. among genes and samples.
Differential analysisDifferential analysismarkers selectionmarkers selection
Given phenotypically distinct classesdistinct classesfind “markers” with
distinct expression patternsdistinct expression patterns(in different classes)
4
Differential analysisDifferential analysisTakeTake--home messageshome messages
How to look for genes differentially expressed
How to visualize results
How to test for significance & control for multiple testing
How to evaluate significance by permutation testing
Going beyond gene markers (genesets as markers)
Sources of variationin the data
Sources of variationSources of variationin the datain the data
InterestingInteresting variation:Gene expression variation associated to phenotype changes.
“ObscuringObscuring” variation:TechnicalTechnical:
sample preparation (RNA extraction), manufacture of the arrays, processing of the arrays (IVT), temperature in the labinstrument (scanner) precisionDifferent platforms
BiologicalBiological:Different growth conditions, heterogeneity of samples, stochastic nature of biology.
5
Marker SelectionMarker Selectionhierarchy of problemshierarchy of problems’’ difficultydifficulty
Problem Gene Markers Error Example
I. Tissue or Cell Type ~1000-2000 ~0% Normal vs. Renal carcinomaNormal vs. Renal carcinomaNormal vs. Abnormal
II. Morphological ~200-500 ~0-5% Leukemia ALL vs. AMLLeukemia ALL vs. AMLType
III. Morphological Subtype ~50-100 ~0-15% ALL BALL B-- vsvs. T. T--CellCellMulticlass Classification
IV. Treatment Outcome ~1-20 ~5-50% AML Treatment OutcomeAML Treatment OutcomeDrug Sensitivity
Degree of DifficultyDegree of Difficultyadapted from P. Tamayo
ScoreScore
Measure of Measure of confidence/significanceconfidence/significance
Reject/Accept Reject/Accept criterioncriterion
?
YesYes
NoNo
Marker Marker candidatescandidates
DatasetDataset
PhenotypePhenotype
Marker SelectionMarker Selectionwhat is neededwhat is needed
6
gA
gB
yy
=fold difference =y gB − y gA
σ g
A: normalA: normal B: tumorB: tumor
Gene N01 N02 N03 T01 T02 T03 Descriptions40909_at 21.00 9.40 2.60 19.20 4.60 3.70 M23316 SGD:YEL024C Yeast S40908_r_at 6.20 4.40 2.20 8.30 0.70 1.10 K02207 SGD:YEL021W Yeast S40907_at 17.20 28.90 38.10 14.70 13.20 23.40 U18530 SGD:YEL018W Yeast S40906_at 6.90 4.60 3.10 33.70 5.60 6.20 X61388 SGD: YEL002C Yeast S40905_s_at 52.10 4.20 12.60 61.50 4.50 16.60 K01391 B subtilis TrpE protein40904_at 154.90 88.40 70.10 118.40 72.30 110.20 K01391 B subtilis TrpE protein40903_at 92.40 6.40 10.00 96.60 9.90 12.80 K01391 B subtilis TrpE protein40902_at 10.20 8.00 2.80 4.10 1.80 10.60 X04603 B subtilis thrC40901_at 45.40 9.30 15.60 54.90 11.60 14.30 X04603 B subtilis thrC
Gene markers selectionGene markers selectionquantifying differential expressionquantifying differential expression
GeneClusterGeneCluster/GenePattern /GenePattern (Broad) uses the signal-to-noise ratio (SNR) statistic, and traditional t-statistics.
SAMSAM (Stanford) uses a statistic similar to the classical t-statistic. The parameter ais chosen to minimize the coefficient of variation.
BioConductor
…
Success: availability of software.
B
gB
A
gA
gBgAg
nS
nS
a
yyt
22'
++
−=
gBgA
gBgAg SS
yyns
+−
=2
B
gB
A
gA
gBgAg
nS
nS
yyt
22
+
−=
Gene markers selectionGene markers selectionscoresscores
7
Visualization of Visualization of EE: : HeatmapHeatmap
sortsort
Rows are sorted according to the score
Rows are sorted according to the score
highlow
ALL ALL vs.vs. MLL MLL vs.vs. AMLAML(top 50 markers/class)(top 50 markers/class)
high expressionlow expression
Nat
ure
Gen
etic
s30,
pp
41 -
47 (2
002)
ALLALL MLLMLL AMLAML
ALL
mar
kers
ALL
mar
kers
MLL
mar
kers
MLL
mar
kers
AM
L m
arke
rsA
ML
mar
kers
8
Small sample size.
Non-normal (non-symmetric) distribution.
SNR, t-statistic work best with symmetric uni-modal distributions.
Gene interaction not taken into account.
Multiple testing.
Gene markers selectionGene markers selectionchallengeschallenges
Gene markers selectionGene markers selectionsignificance and multiple testingsignificance and multiple testing
Settings: • 20K+ genes,• only 10s/100s of samples
Problem:• the chances of finding genes
correlated with any randomphenotype labels are high.
• the smaller the no. of samples the higher the chances.
SamplesSamples
Gen
esG
enes
Head vs. Tail
9
Gene markers selectionGene markers selectionbetter than chance?better than chance?
As sample size increases, random phenotype less cleanHead/Tail example suggests a way of testing for the
significance of the results: is the observed difference in expression bigger than what we can observe with respect
to a random phenotype (head/tail)?
Head Tail Head Tail Head Tail Head Tail
6 samples 14 samples 30 samples 100 samples
Generated a [10,000x100] matrix from a Gaussian(μ=0, σ=0.5)
Picked n columns (n = 6,14,30,100)
Tossed n coins
Selectedtop 25 markers for headtop 25 markers for tail
Generated a [10,000x100] matrix from a Gaussian(μ=0, σ=0.5)
Picked n columns (n = 6,14,30,100)
Tossed n coins
Selectedtop 25 markers for headtop 25 markers for tail
Statistical SignificanceStatistical Significance
P-value: definition(s)
Multiple-Hypothesis Testing:P-value …… vs. False Discovery Rate (FDR) …… vs. Family-Wise Error Rate (FWER).
10
H0: nono difference in expression
H1: there ISIS difference in expression
Single test: HSingle test: H00 vs. Hvs. H11
)(and/or
1010
10
σσμμμμ
≠≠≠
mean in class 0
)(and/or
1010
10
σσμμμμ
===
Null hypothesisNull hypothesis
Alternative hypothesisAlternative hypothesis
Choose a test statistictest statistic (e.g. SNR, T-test, rank-sum)
Calculate observedobserved statistic: tobs
Calculate/estimate P-value:Probability of observing tobs (or larger) under the null distribution
p ≡ P(T≥ tobs | H0)
What distributional form does the t-statistic have?
Single test: HSingle test: H00 vs. Hvs. H11
11
PP--value calculationvalue calculationasymptotic theoryasymptotic theory
IFIF a gene is normally distributed, …THENTHEN t-score follows the t-distribution (if H0 holds)
Observed score of gene
Distribution of scores under null hypothesis Area Total
Area Blue=p
P(t-s
core
| H
0)
t-score
tobs|H0 ~ t(μ=0,df=n-1)
PP--value calculationvalue calculationasymptotic theoryasymptotic theory
WHAT IFWHAT IF genes are NOTNOT normally distributed?
Observed score of gene
Distribution of scores under null hypothesis Area Total
Area Blue=p
P(t-s
core
| H
0)
t-score
12
scores
Distribution of permuted scores for
given gene
PP--value calculationvalue calculationpermutation testpermutation test
Repeat many timesshuffle labels (class membership)compute score for each gene (t-score, SNR, .. )
⇒⇒ Empirical null distributionEmpirical null distribution of scores for each gene Compare observed score to empirical distribution.
Observed score of gene
}{#}{#
permuted
observedpermuted
sss
p≥
=
No distributional assumptions are madeNo distributional assumptions are made
SNR perm.1 perm.2 perm.3 perm.4 perm.5gene.1 -0.29 -0.17 -0.09 0.23 0.23 -0.25gene.2 -0.28 -0.08 0.29 0.16 0.61 0.41gene.3 -0.31 -0.33 0.03 -0.05 -0.10 -0.46gene.4 -0.17 0.65 -0.46 -0.13 -0.30 -0.75gene.5 -0.47 0.19 -0.70 0.76 0.02 -0.36gene.6 0.29 -0.09 -0.15 0.08 0.06 -0.44gene.7 0.28 0.05 0.13 0.03 0.47 0.26gene.8 0.31 -0.23 -0.12 -0.18 -0.20 -0.69gene.9 0.17 0.82 -0.54 -0.13 -0.44 -0.96gene.10 0.47 0.41 -0.94 0.48 -0.28 -0.61gene.11 -0.36 -0.60 0.10 0.81 -0.16 0.11gene.12 0.08 -0.69 0.42 -0.27 0.26 0.32gene.13 0.22 -0.18 -0.54 0.29 0.20 0.15gene.14 0.26 0.25 -0.13 0.07 -1.74 -0.32gene.15 -0.22 0.15 0.21 -0.06 -0.52 -0.14gene.16 0.22 0.03 0.18 -0.32 -0.36 -0.62gene.17 0.12 0.65 0.09 -0.10 -0.54 -0.41gene.18 0.42 0.14 -0.51 0.37 -0.10 0.09gene.19 0.04 0.34 -0.30 0.04 -0.18 -0.01gene.20 -0.82 0.37 0.02 -0.10 -0.40 -0.50
Permutation TestPermutation Testgenegene--specific pspecific p--valuevalue
Compute observedobserved score for each geneCompute permutedpermuted scores for each gene (5 times)
p=0/5p=3/5p=2/5…
…
p=0/5
Compute genegene--specificspecific p-values
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 11 0 0 1 1 1 0 0 1 0 0 1 0 1 1 00 1 1 0 1 0 0 1 1 0 0 1 1 1 0 01 0 0 0 1 0 1 1 1 0 0 1 1 0 0 11 0 1 0 1 0 1 0 1 0 1 1 1 0 0 00 0 0 1 0 0 1 1 1 1 1 0 0 1 0 10 1 0 0 1 0 1 1 1 1 1 0 0 0 1 0
13
Choose a test statistic (e.g. SNR, T-test, rank-sum)Calculate observed statistic: tobs
Calculate/estimate P-value: p ≡ P(T≥ tobs | H0)Controlling the false positive rate (FPR) at level of α:
p ≤ α→ significant
Type I error: False Positive Rate (FPR) P(p≤α |H0) = P(“significant” | H0)
Type II error: False Negative Rate (FNR) P(p>α |H1) = P(¬significant|H1) = β(α)Power: P(p ≤ α |H1) = 1- β(α)
Single test: HSingle test: H00 vs. Hvs. H11
Choose a test statistic (e.g. SNR, T-test, rank-sum)Calculate observed statistic: tobs
Calculate/estimate P-value: p ≡ P(T≥ tobs | H0)Controlling the false positive rate (FPR) at level of α:
p ≤ α→ significant
Type I error: False Positive Rate (FPR) P(p≤α |H0) = P(“significant” | H0) = ?
Given HGiven H00: : CDF(p)=p ⇒⇒ PP--values of Hvalues of H00 are U[0,1] are U[0,1]
Single test: HSingle test: H00 vs. Hvs. H11
0 p 1
1
P(p)
α
14
Multiple HypothesesMultiple Hypotheseswhat to controlwhat to control
FWER (family-wise error rate): probability of calling one or more hypotheses significant given that they are all null
P(#significant>0|∩iH0i)Bonferroni or maxT
FDR (false discovery rate): probability that the null hypothesis is true given that the result is significant ≡
P(H0|significant)Benjiamini & Hochberg [JRSSB, 1995]Storey & Tibshirani [PNAS, 2004]…
No free lunch: MHT corrections control for false positives at the cost of increasing the false negatives.
Try to reduce the number of hypotheses tested in the first place.
Multiple HypothesesMultiple HypothesesMultiple hypotheses (H•i, i=1,2,…,n) → {pi}
{pi} are in general not independentthe marginal distribution of each one is uniform
Distribution of P-values is a mixture of 2 distributions:π0 fraction of genes NOT differentially expressed = P(H0)1-π0 fraction of genes differentially expressed = P(H1)=1-P(H0)
0 1
+(1−π0)
0 1
= π0π0
0 1
1
0 1
π0
FDR(α)=
α α α
15
FDR calculationFDR calculation
iNp
ppppFDR i
i
ii ×=
≤′=
}{prop)(
Rank of the p-value
00
}{prop)( ππ
××=≤′
=iNp
ppppFDR i
i
ii
Benjiamini & Hochberg [JRSSB, 1995] assume π0=1
Storey & Tibshirani [PNAS, 2004] estimate π0 from the data
}{prop}{prop
)(obs
prm
ssss
sFDR≥≥
=
Empirical FDR based on permutation:
0 1
π0
FDR(α)=
PP--value calculation value calculation by permutation testby permutation test
Gene-specific p-value
FWER (family-wise error rate)
FWER by maxT [Westfall & Young, 93]
…
16
Permutation TestPermutation TestFWERFWER
SNR perm.1 perm.2 perm.3 perm.4 perm.5gene.20 -0.82 -0.69 -0.94 -0.32 -1.74 -0.96gene.5 -0.47 -0.60 -0.70 -0.27 -0.54 -0.75gene.11 -0.36 -0.33 -0.54 -0.18 -0.52 -0.69gene.3 -0.31 -0.23 -0.54 -0.13 -0.44 -0.62gene.1 -0.29 -0.18 -0.51 -0.13 -0.40 -0.61gene.2 -0.28 -0.17 -0.46 -0.10 -0.36 -0.50gene.15 -0.22 -0.09 -0.30 -0.10 -0.30 -0.46gene.4 -0.17 -0.08 -0.15 -0.06 -0.28 -0.44gene.19 0.04 0.03 -0.13 -0.05 -0.20 -0.41gene.12 0.08 0.05 -0.12 0.03 -0.18 -0.36gene.17 0.12 0.14 -0.09 0.04 -0.16 -0.32gene.9 0.17 0.15 0.02 0.07 -0.10 -0.25gene.13 0.22 0.19 0.03 0.08 -0.10 -0.14gene.16 0.22 0.25 0.09 0.16 0.02 -0.01gene.14 0.26 0.34 0.10 0.23 0.06 0.09gene.7 0.28 0.37 0.13 0.29 0.20 0.11gene.6 0.29 0.41 0.18 0.37 0.23 0.15gene.8 0.31 0.65 0.21 0.48 0.26 0.26gene.18 0.42 0.65 0.29 0.76 0.47 0.32gene.10 0.47 0.82 0.42 0.81 0.61 0.41
FWER(s) = proportion of iterations having one or more scores ≥ s
Permutation TestPermutation TestFWERFWER
SNR perm.1 perm.2 perm.3 perm.4 perm.5gene.20 -0.82 -0.69 -0.94 -0.32 -1.74 -0.96gene.5 -0.47 -0.60 -0.70 -0.27 -0.54 -0.75gene.11 -0.36 -0.33 -0.54 -0.18 -0.52 -0.69gene.3 -0.31 -0.23 -0.54 -0.13 -0.44 -0.62gene.1 -0.29 -0.18 -0.51 -0.13 -0.40 -0.61gene.2 -0.28 -0.17 -0.46 -0.10 -0.36 -0.50gene.15 -0.22 -0.09 -0.30 -0.10 -0.30 -0.46gene.4 -0.17 -0.08 -0.15 -0.06 -0.28 -0.44gene.19 0.04 0.03 -0.13 -0.05 -0.20 -0.41gene.12 0.08 0.05 -0.12 0.03 -0.18 -0.36gene.17 0.12 0.14 -0.09 0.04 -0.16 -0.32gene.9 0.17 0.15 0.02 0.07 -0.10 -0.25gene.13 0.22 0.19 0.03 0.08 -0.10 -0.14gene.16 0.22 0.25 0.09 0.16 0.02 -0.01gene.14 0.26 0.34 0.10 0.23 0.06 0.09gene.7 0.28 0.37 0.13 0.29 0.20 0.11gene.6 0.29 0.41 0.18 0.37 0.23 0.15gene.8 0.31 0.65 0.21 0.48 0.26 0.26gene.18 0.42 0.65 0.29 0.76 0.47 0.32gene.10 0.47 0.82 0.42 0.81 0.61 0.41
FWER(s) = proportion of iterations having one or more scores ≥ s
FWER(gene20) =(1+1+0+1+1) ÷ 5 = 0.8
17
Permutation TestPermutation TestFWERFWER
SNR perm.1 perm.2 perm.3 perm.4 perm.5gene.20 -0.82 -0.69 -0.94 -0.32 -1.74 -0.96gene.5 -0.47 -0.60 -0.70 -0.27 -0.54 -0.75gene.11 -0.36 -0.33 -0.54 -0.18 -0.52 -0.69gene.3 -0.31 -0.23 -0.54 -0.13 -0.44 -0.62gene.1 -0.29 -0.18 -0.51 -0.13 -0.40 -0.61gene.2 -0.28 -0.17 -0.46 -0.10 -0.36 -0.50gene.15 -0.22 -0.09 -0.30 -0.10 -0.30 -0.46gene.4 -0.17 -0.08 -0.15 -0.06 -0.28 -0.44gene.19 0.04 0.03 -0.13 -0.05 -0.20 -0.41gene.12 0.08 0.05 -0.12 0.03 -0.18 -0.36gene.17 0.12 0.14 -0.09 0.04 -0.16 -0.32gene.9 0.17 0.15 0.02 0.07 -0.10 -0.25gene.13 0.22 0.19 0.03 0.08 -0.10 -0.14gene.16 0.22 0.25 0.09 0.16 0.02 -0.01gene.14 0.26 0.34 0.10 0.23 0.06 0.09gene.7 0.28 0.37 0.13 0.29 0.20 0.11gene.6 0.29 0.41 0.18 0.37 0.23 0.15gene.8 0.31 0.65 0.21 0.48 0.26 0.26gene.18 0.42 0.65 0.29 0.76 0.47 0.32gene.10 0.47 0.82 0.42 0.81 0.61 0.41
FWER(s) = proportion of iterations having one or more score ≥ s
FWER(gene5) =(1+1+1+1+1) ÷ 5 = 1
Permutation TestPermutation TestFWER by FWER by maxTmaxT procedureprocedure
Compute observedobserved score for each geneCompute permutedpermuted scores for each gene (5 times)
GeneID score prm.1 prm.2 prm.3 prm.4 prm.5gene.1 -0.47 0.09 0.18 -0.25 0.19 0.02gene.2 -0.42 0.25 -0.14 -0.29 0.18 -0.03gene.3 -0.28 -0.15 0.22 -0.10 -0.48 -0.19gene.4 -0.51 0.06 0.02 -0.50 0.01 -0.19gene.5 -0.26 0.23 -0.16 -0.18 0.17 -0.19gene.6 0.47 -0.06 0.20 0.17 0.15 0.04gene.7 0.42 -0.13 0.07 0.26 0.32 0.01gene.8 0.28 0.07 0.25 0.13 -0.32 -0.18gene.9 0.51 0.06 0.01 -0.10 0.03 -0.09gene.10 0.26 0.12 0.13 -0.01 0.33 0.02gene.11 0.13 0.01 0.03 -0.01 0.02 -0.11gene.12 0.19 0.22 -0.02 -0.18 0.17 -0.05gene.13 0.19 -0.05 0.02 -0.02 0.10 0.20gene.14 -0.17 -0.13 0.09 -0.12 -0.08 0.27gene.15 0.04 0.06 -0.11 0.15 -0.10 -0.34gene.16 0.03 -0.18 0.07 -0.42 -0.13 -0.07gene.17 -0.02 -0.03 0.12 0.10 -0.09 0.05gene.18 0.15 -0.26 -0.22 0.13 -0.16 0.05gene.19 0.23 0.02 0.36 -0.02 0.13 0.16gene.20 0.10 -0.03 0.13 0.00 -0.05 0.01
18
Permutation TestPermutation TestFWER by FWER by maxTmaxT procedureprocedure
Take the absolute valuesabsolute values of the scoresSort genes according to absolute value of observedobserved scoresMake the permuted scores monotonically nonnon--decreasingdecreasing
GeneID score prm.1 prm.2 prm.3 prm.4 prm.5gene.1 0.47 0.09 0.18 0.25 0.19 0.02gene.2 0.42 0.25 0.14 0.29 0.18 0.03gene.3 0.28 0.15 0.22 0.10 0.48 0.19gene.4 0.51 0.06 0.02 0.50 0.01 0.19gene.5 0.26 0.23 0.16 0.18 0.17 0.19gene.6 0.47 0.06 0.20 0.17 0.15 0.04gene.7 0.42 0.13 0.07 0.26 0.32 0.01gene.8 0.28 0.07 0.25 0.13 0.32 0.18gene.9 0.51 0.06 0.01 0.10 0.03 0.09gene.10 0.26 0.12 0.13 0.01 0.33 0.02gene.11 0.13 0.01 0.03 0.01 0.02 0.11gene.12 0.19 0.22 0.02 0.18 0.17 0.05gene.13 0.19 0.05 0.02 0.02 0.10 0.20gene.14 0.17 0.13 0.09 0.12 0.08 0.27gene.15 0.04 0.06 0.11 0.15 0.10 0.34gene.16 0.03 0.18 0.07 0.42 0.13 0.07gene.17 0.02 0.03 0.12 0.10 0.09 0.05gene.18 0.15 0.26 0.22 0.13 0.16 0.05gene.19 0.23 0.02 0.36 0.02 0.13 0.16gene.20 0.10 0.03 0.13 0.00 0.05 0.01
GeneID score prm.1 prm.2 prm.3 prm.4 prm.5gene.4 0.51 0.06 0.02 0.50 0.01 0.19gene.9 0.51 0.06 0.01 0.10 0.03 0.09gene.1 0.47 0.09 0.18 0.25 0.19 0.02gene.6 0.47 0.06 0.20 0.17 0.15 0.04gene.2 0.42 0.25 0.14 0.29 0.18 0.03gene.7 0.42 0.13 0.07 0.26 0.32 0.01gene.3 0.28 0.15 0.22 0.10 0.48 0.19gene.8 0.28 0.07 0.25 0.13 0.32 0.18gene.5 0.26 0.23 0.16 0.18 0.17 0.19gene.10 0.26 0.12 0.13 0.01 0.33 0.02gene.19 0.23 0.02 0.36 0.02 0.13 0.16gene.12 0.19 0.22 0.02 0.18 0.17 0.05gene.13 0.19 0.05 0.02 0.02 0.10 0.20gene.14 0.17 0.13 0.09 0.12 0.08 0.27gene.18 0.15 0.26 0.22 0.13 0.16 0.05gene.11 0.13 0.01 0.03 0.01 0.02 0.11gene.20 0.10 0.03 0.13 0.00 0.05 0.01gene.15 0.04 0.06 0.11 0.15 0.10 0.34gene.16 0.03 0.18 0.07 0.42 0.13 0.07gene.17 0.02 0.03 0.12 0.10 0.09 0.05
GeneID score prm.1 prm.2 prm.3 prm.4 prm.5gene.4 0.51 0.26 0.02 0.50 0.01 0.19gene.9 0.51 0.26 0.01 0.10 0.03 0.09gene.1 0.47 0.26 0.18 0.25 0.19 0.02gene.6 0.47 0.26 0.20 0.17 0.15 0.04gene.2 0.42 0.26 0.14 0.29 0.18 0.03gene.7 0.42 0.26 0.07 0.26 0.32 0.01gene.3 0.28 0.26 0.22 0.10 0.48 0.19gene.8 0.28 0.26 0.25 0.13 0.32 0.18gene.5 0.26 0.26 0.16 0.18 0.17 0.19gene.10 0.26 0.26 0.13 0.01 0.33 0.02gene.19 0.23 0.26 0.36 0.02 0.13 0.16gene.12 0.19 0.26 0.02 0.18 0.17 0.05gene.13 0.19 0.26 0.02 0.02 0.10 0.20gene.14 0.17 0.26 0.09 0.12 0.08 0.27gene.18 0.15 0.26 0.22 0.13 0.16 0.05gene.11 0.13 0.18 0.03 0.01 0.02 0.11gene.20 0.10 0.18 0.13 0.00 0.05 0.01gene.15 0.04 0.18 0.11 0.15 0.10 0.34gene.16 0.03 0.18 0.07 0.42 0.13 0.07gene.17 0.02 0.03 0.12 0.10 0.09 0.05
GeneID score prm.1 prm.2 prm.3 prm.4 prm.5gene.4 0.51 0.26 0.36 0.50 0.01 0.19gene.9 0.51 0.26 0.36 0.10 0.03 0.09gene.1 0.47 0.26 0.36 0.25 0.19 0.02gene.6 0.47 0.26 0.36 0.17 0.15 0.04gene.2 0.42 0.26 0.36 0.29 0.18 0.03gene.7 0.42 0.26 0.36 0.26 0.32 0.01gene.3 0.28 0.26 0.36 0.10 0.48 0.19gene.8 0.28 0.26 0.36 0.13 0.32 0.18gene.5 0.26 0.26 0.36 0.18 0.17 0.19gene.10 0.26 0.26 0.36 0.01 0.33 0.02gene.19 0.23 0.26 0.36 0.02 0.13 0.16gene.12 0.19 0.26 0.22 0.18 0.17 0.05gene.13 0.19 0.26 0.22 0.02 0.10 0.20gene.14 0.17 0.26 0.22 0.12 0.08 0.27gene.18 0.15 0.26 0.22 0.13 0.16 0.05gene.11 0.13 0.18 0.13 0.01 0.02 0.11gene.20 0.10 0.18 0.13 0.00 0.05 0.01gene.15 0.04 0.18 0.12 0.15 0.10 0.34gene.16 0.03 0.18 0.12 0.42 0.13 0.07gene.17 0.02 0.03 0.12 0.10 0.09 0.05
GeneID score prm.1 prm.2 prm.3 prm.4 prm.5gene.4 0.51 0.26 0.36 0.50 0.48 0.34gene.9 0.51 0.26 0.36 0.42 0.48 0.34gene.1 0.47 0.26 0.36 0.42 0.48 0.34gene.6 0.47 0.26 0.36 0.42 0.48 0.34gene.2 0.42 0.26 0.36 0.42 0.48 0.34gene.7 0.42 0.26 0.36 0.42 0.48 0.34gene.3 0.28 0.26 0.36 0.42 0.48 0.34gene.8 0.28 0.26 0.36 0.42 0.33 0.34gene.5 0.26 0.26 0.36 0.42 0.33 0.34gene.10 0.26 0.26 0.36 0.42 0.33 0.34gene.19 0.23 0.26 0.36 0.42 0.17 0.34gene.12 0.19 0.26 0.22 0.42 0.17 0.34gene.13 0.19 0.26 0.22 0.42 0.16 0.34gene.14 0.17 0.26 0.22 0.42 0.16 0.34gene.18 0.15 0.26 0.22 0.42 0.16 0.34gene.11 0.13 0.18 0.13 0.42 0.13 0.34gene.20 0.10 0.18 0.13 0.42 0.13 0.34gene.15 0.04 0.18 0.12 0.42 0.13 0.34gene.16 0.03 0.18 0.12 0.42 0.13 0.07gene.17 0.02 0.03 0.12 0.10 0.09 0.05
GeneID score prm.1 prm.2 prm.3 prm.4 prm.5gene.4 0.51 0.26 0.36 0.50 0.48 0.34 p = 0/5gene.9 0.51 0.26 0.36 0.42 0.48 0.34 p = 0/5gene.1 0.47 0.26 0.36 0.42 0.48 0.34 p = 1/5gene.6 0.47 0.26 0.36 0.42 0.48 0.34 p = 1/5gene.2 0.42 0.26 0.36 0.42 0.48 0.34 p = 2/5gene.7 0.42 0.26 0.36 0.42 0.48 0.34 p = 2/5gene.3 0.28 0.26 0.36 0.42 0.48 0.34 p = 4/5gene.8 0.28 0.26 0.36 0.42 0.33 0.34 p = 4/5gene.5 0.26 0.26 0.36 0.42 0.33 0.34 p = 5/5gene.10 0.26 0.26 0.36 0.42 0.33 0.34 …gene.19 0.23 0.26 0.36 0.42 0.17 0.34gene.12 0.19 0.26 0.22 0.42 0.17 0.34gene.13 0.19 0.26 0.22 0.42 0.16 0.34gene.14 0.17 0.26 0.22 0.42 0.16 0.34gene.18 0.15 0.26 0.22 0.42 0.16 0.34gene.11 0.13 0.18 0.13 0.42 0.13 0.34gene.20 0.10 0.18 0.13 0.42 0.13 0.34gene.15 0.04 0.18 0.12 0.42 0.13 0.34gene.16 0.03 0.18 0.12 0.42 0.13 0.07gene.17 0.02 0.03 0.12 0.10 0.09 0.05
Compute the p-values.
Permutation testPermutation testcommentscomments
Permuting the class labels vs. permuting the gene values:
Permuting the class labels preserves the correlationpreserves the correlation among genes.
Pooling permuted scores from different genes problematic (genes may have different distributions).
Empirical p-value does not always take into account the magnitude of score/difference.
19
ALL ALL vs.vs. MLL MLL vs.vs. AMLAML(top 50 markers/class)(top 50 markers/class)
high expressionlow expression
Nat
ure
Gen
etic
s30,
pp
41 -
47 (2
002)
ALLALL MLLMLL AMLAML
ALL
mar
kers
ALL
mar
kers
MLL
mar
kers
MLL
mar
kers
AM
L m
arke
rsA
ML
mar
kers
AnnotationAnnotationAnnotating the identified gene markers:
identify the gene products and their documented characteristics/function:
o GeneCruisero …
test for their enrichment with respect to meaningful biological categories.
o Gene Ontology (GO)o Gene Set Enrichment Analysis (GSEA)
20
Gene Set Enrichment AnalysisGene Set Enrichment Analysisbeyond gene markersbeyond gene markers
Using genesetsgenesets as phenotype markers.
Markers selectionMarkers selectionwhatwhat’’s neededs needed
ScoreScore
Measure of Measure of confidence/significanceconfidence/significance
Reject/Accept Reject/Accept criterioncriterion
?
YesYes
NoNo
Marker Marker candidatescandidates
DatasetDataset
PhenotypePhenotype
21
EnrichmentEnrichmentKSKS--scorescore
hit (member of G) miss (non-member of G)
Gene Set G
Enric
hmen
t Sco
re S
Gene List Order Index
Max. Enrichment Score ES
( )
( )( ),MN
j M1 (i)missP
,HN
j M (i) hitP
i
1 j
i
1 j
∑
∑
=
=
−=
=
( ) ( ) ( )( )∑=
⎥⎦
⎤⎢⎣
⎡ −−=
i
1j MH Nj M1
NjMi S
KS Enrichment Score (ES):
Running Score S:
Empirical Cumulative Distributions:
N1,..,iiSmax ES
== )(
Mootha et al., Nature Genetics 2004
Ordered Marker
List
Phenotype
• Rank genes according to their “correlation” with the class of interest.
• Test if a geneset (e.g., a GO category, a pathway, a different class signature), “enriches” any of the classes.
• Use Kolmogorov-Smirnoffscore to measure enrichment.
Subramanian et al., PNAS 2005
EnrichmentEnrichmentWeighted score.
Motivation:Standard KS score susceptible to calling significant a geneset when its members are close to each other even though far from both ends of the ranking.
22
Enriched Gene Set Un-enriched Gene Set
Enric
hmen
t Sco
re S
Max. Enrichment Score ES
Gene List Order Index
Enric
hmen
t Sco
re S
Max. Enrichment Score ES
Gene List Order Index
Every hit go up by 1/NH
Every miss go down by 1/NM
The maximum height provides the enrichment score
EnrichmentEnrichmentKSKS--scorescore
EnrichmentEnrichmentscoring multiple scoring multiple genesetsgenesets
Ordered Gene Marker List
Phenotype or Gene Template
Gene Set A
Gene Set B
Pathway A Pathway B
Potentially an entire database of Gene Sets
23
Test 1: Randomize Gene Set
Labels
Test 2: Randomize Phenotype
LabelsGene Set G
G
Asses Significance of Geneset G observed ES score using histogram
of permuted ES scores
…
Repeat N times
ES
EnrichmentEnrichmenttesting a testing a genesetgeneset for significance: permutation testfor significance: permutation test
Preserves correlation
between genes: more stringent
…
Gene Set 1 Permutation 1 Permutation 2 … Permutation Π
G2
Randomize Phenotype Labels
EnrichmentEnrichmenttesting multiple testing multiple genesetsgenesets for significancefor significance
Need for Multiple Hypothesis CorrectionSame approaches as in the gene markers seletion (FDR, maxT, etc.)
24
Constructing Constructing genesetsgenesets
GO categories.
“Pathways” from BioCarta, GenMapp, Kegg, etc.
“Target” gene lists.
Gene marker signatures.
Gene clusters.
GenesetGeneset EnrichmentEnrichmentan example: human diabetes an example: human diabetes
Normals Diabetics Skeletal muscle biopsies
GSEA was used to assess enrichment of 149 Gene Sets including 113 pathways from internal curation and GenMAPP, and 36 tightly co-expressed clusters from a compendium of mouse gene expression data.
These GSEA results appeared in Mootha et al. Nature Genetics 15 June 2003, vol. 34 no. 3 pp 267 – 273:
PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes
25
GenesetGeneset EnrichmentEnrichmentanother example: another example: mediastinalmediastinal lymphomalymphoma
KS-based enrichment analysis was used to test the similarity between Mediastinal Lymphoma (a type of DLBCL), and Hodgkin Lymphoma [Savage et al., Blood 2003].
MLBCL DLBCL
Hits represent genes belonging to a “Hodgkin geneset”identified by differential analysis of “normal B-cells vs.
Hodgkin cell lines” in an independent study.Hodgkin Geneset
Differential analysisDifferential analysisTakeTake--home messageshome messages
How to look for genes differentially expressed
How to visualize results
How to test for significance & control for multiple testing
How to evaluate significance by permutation testing
Going beyond gene markers (genesets as markers)
26
CookbookCookbookReduce number of hypotheses/genes by variation filtering (attempt at reducing FNs)
Choose test statistic (say, t-score)
If enough samples, compute p-values by permutation test (otherwise, use asymptotic test).
Control for MHT by using the FDR correctionRemember: if you choose FDR≤α (e.g., 0.05), you’re willing to accept α×100% (e.g., 5%) of false positives.
If number of significant hypotheses/genes “too large” even for very small α’s, either:
o use the maxT correction (possible w/ empirical p-values only).
o use additional criteria (e.g., min fold-change, min expression value, etc.)
WARNINGWARNING: should not be followed blindly …
Differential AnalysisDifferential Analysisnot discussednot discussed
Two-sided vs. one-sided test
Variance thresholded t-test
Confidence Interval on P-values after n permutations.
“Smoothing” of p-values
Speeding up permutation tests
Bayesian Methods (computing the posterior)
Controlling for “confounders”