outline of the course - broad institute

26
1 Gad Getz Gad Getz Stefano Stefano Monti Monti Michael Reich Michael Reich {gadgetz,smonti,mreich}@broad.mit.edu gadgetz,smonti,mreich}@broad.mit.edu http:// http://www.broad.mit.edu/~smonti/aws www.broad.mit.edu/~smonti/aws Broad Institute of MIT & Harvard Broad Institute of MIT & Harvard October 18 October 18- 20, 2006 20, 2006 Cambridge, MA Cambridge, MA Outline of the course Outline of the course Lectures Lectures Day 1: Day 1: Introduction: Functional Genomics GenePattern mini-tutorial FG Pipeline: o Data Acquisition o Preprocessing & Visualization Day 2: Day 2: Supervised Analysis o Differential analysis/GSEA o Class Prediction/Classification o Validation Day 3: Day 3: o [Survival analysis] Unsupervised Analysis o Clustering, Bi-clustering Annotation Hands Hands- on on’s Day 1: Day 1: Preprocessing Data visualization/Dimensionality reduction: o HeatMaps o PCA,NMF,MDS Day 2: Day 2: Differential Analysis/Annotation Classification: o Model building/selection o Evaluation Day 3: Day 3: Clustering: o HC, NMF, CC, Bi-clustering o GO Annotation Final Project

Upload: others

Post on 05-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

1

Gad GetzGad Getz Stefano Stefano MontiMonti Michael ReichMichael Reich{{gadgetz,smonti,mreich}@broad.mit.edugadgetz,smonti,mreich}@broad.mit.eduhttp://http://www.broad.mit.edu/~smonti/awswww.broad.mit.edu/~smonti/aws

Broad Institute of MIT & HarvardBroad Institute of MIT & HarvardOctober 18October 18--20, 200620, 2006

Cambridge, MACambridge, MA

Outline of the courseOutline of the courseLecturesLectures

Day 1:Day 1:Introduction: Functional GenomicsGenePattern mini-tutorialFG Pipeline:

o Data Acquisitiono Preprocessing & Visualization

Day 2:Day 2:Supervised Analysis

o Differential analysis/GSEAo Class Prediction/Classificationo Validation

Day 3:Day 3:o [Survival analysis]

Unsupervised Analysiso Clustering, Bi-clustering

Annotation

HandsHands--onon’’ss

Day 1:Day 1:PreprocessingData visualization/Dimensionality reduction:

o HeatMapso PCA,NMF,MDS

Day 2:Day 2:Differential Analysis/AnnotationClassification:

o Model building/selectiono Evaluation

Day 3:Day 3:Clustering:

o HC, NMF, CC, Bi-clusteringo GO Annotation

Final Project

2

The functional genomics pipelineThe functional genomics pipelineExperimental designaffects outcome data analysis

Supervised AnalysisDifferential analysis, Classification, …

Unsupervised AnalysisClustering, Bi-clustering, …

Enrichment analysisGO annotation, GSEA, …

“In silico” testingCross validation, train/test, etc,

“In vitro” testingBack to the lab

Data acquisitionmicroarray processing

Data preprocessingscaling/normalization/filtering

Data analysis/Hypothesis generationData analysis/Hypothesis generation

Validation/AnnotationValidation/Annotation

The functional genomics pipelineThe functional genomics pipelineExperimental designaffects outcome data analysis

Supervised AnalysisSupervised AnalysisDifferential analysis, Classification,

Unsupervised AnalysisClustering, Bi-clustering, …

Enrichment analysisGO annotation, GSEA, …

“In silico” testingCross validation, train/test, etc,

“In vitro” testingBack to the lab

Data acquisitionmicroarray processing

Data preprocessingscaling/normalization/filtering

Data analysis/Hypothesis generationData analysis/Hypothesis generation

Validation/AnnotationValidation/Annotation

3

Supervised Supervised vsvs. Unsupervised. UnsupervisedSupervised methodsSupervised methods look at relations between gene expression and knownknown experimental conditions.

Differential experiments usually yield:A list of genes changed across two (or more) conditions;A “stochastic profile” of each condition.

Useful to identify diagnosticdiagnostic profiles and prognosticprognostic models.

Unsupervised methodsUnsupervised methods aim at discovering new and unknownunknown relationships:

.. among genes.

.. among samples.

.. among genes and samples.

Differential analysisDifferential analysismarkers selectionmarkers selection

Given phenotypically distinct classesdistinct classesfind “markers” with

distinct expression patternsdistinct expression patterns(in different classes)

4

Differential analysisDifferential analysisTakeTake--home messageshome messages

How to look for genes differentially expressed

How to visualize results

How to test for significance & control for multiple testing

How to evaluate significance by permutation testing

Going beyond gene markers (genesets as markers)

Sources of variationin the data

Sources of variationSources of variationin the datain the data

InterestingInteresting variation:Gene expression variation associated to phenotype changes.

“ObscuringObscuring” variation:TechnicalTechnical:

sample preparation (RNA extraction), manufacture of the arrays, processing of the arrays (IVT), temperature in the labinstrument (scanner) precisionDifferent platforms

BiologicalBiological:Different growth conditions, heterogeneity of samples, stochastic nature of biology.

5

Marker SelectionMarker Selectionhierarchy of problemshierarchy of problems’’ difficultydifficulty

Problem Gene Markers Error Example

I. Tissue or Cell Type ~1000-2000 ~0% Normal vs. Renal carcinomaNormal vs. Renal carcinomaNormal vs. Abnormal

II. Morphological ~200-500 ~0-5% Leukemia ALL vs. AMLLeukemia ALL vs. AMLType

III. Morphological Subtype ~50-100 ~0-15% ALL BALL B-- vsvs. T. T--CellCellMulticlass Classification

IV. Treatment Outcome ~1-20 ~5-50% AML Treatment OutcomeAML Treatment OutcomeDrug Sensitivity

Degree of DifficultyDegree of Difficultyadapted from P. Tamayo

ScoreScore

Measure of Measure of confidence/significanceconfidence/significance

Reject/Accept Reject/Accept criterioncriterion

?

YesYes

NoNo

Marker Marker candidatescandidates

DatasetDataset

PhenotypePhenotype

Marker SelectionMarker Selectionwhat is neededwhat is needed

6

gA

gB

yy

=fold difference =y gB − y gA

σ g

A: normalA: normal B: tumorB: tumor

Gene N01 N02 N03 T01 T02 T03 Descriptions40909_at 21.00 9.40 2.60 19.20 4.60 3.70 M23316 SGD:YEL024C Yeast S40908_r_at 6.20 4.40 2.20 8.30 0.70 1.10 K02207 SGD:YEL021W Yeast S40907_at 17.20 28.90 38.10 14.70 13.20 23.40 U18530 SGD:YEL018W Yeast S40906_at 6.90 4.60 3.10 33.70 5.60 6.20 X61388 SGD: YEL002C Yeast S40905_s_at 52.10 4.20 12.60 61.50 4.50 16.60 K01391 B subtilis TrpE protein40904_at 154.90 88.40 70.10 118.40 72.30 110.20 K01391 B subtilis TrpE protein40903_at 92.40 6.40 10.00 96.60 9.90 12.80 K01391 B subtilis TrpE protein40902_at 10.20 8.00 2.80 4.10 1.80 10.60 X04603 B subtilis thrC40901_at 45.40 9.30 15.60 54.90 11.60 14.30 X04603 B subtilis thrC

Gene markers selectionGene markers selectionquantifying differential expressionquantifying differential expression

GeneClusterGeneCluster/GenePattern /GenePattern (Broad) uses the signal-to-noise ratio (SNR) statistic, and traditional t-statistics.

SAMSAM (Stanford) uses a statistic similar to the classical t-statistic. The parameter ais chosen to minimize the coefficient of variation.

BioConductor

Success: availability of software.

B

gB

A

gA

gBgAg

nS

nS

a

yyt

22'

++

−=

gBgA

gBgAg SS

yyns

+−

=2

B

gB

A

gA

gBgAg

nS

nS

yyt

22

+

−=

Gene markers selectionGene markers selectionscoresscores

7

Visualization of Visualization of EE: : HeatmapHeatmap

sortsort

Rows are sorted according to the score

Rows are sorted according to the score

highlow

ALL ALL vs.vs. MLL MLL vs.vs. AMLAML(top 50 markers/class)(top 50 markers/class)

high expressionlow expression

Nat

ure

Gen

etic

s30,

pp

41 -

47 (2

002)

ALLALL MLLMLL AMLAML

ALL

mar

kers

ALL

mar

kers

MLL

mar

kers

MLL

mar

kers

AM

L m

arke

rsA

ML

mar

kers

8

Small sample size.

Non-normal (non-symmetric) distribution.

SNR, t-statistic work best with symmetric uni-modal distributions.

Gene interaction not taken into account.

Multiple testing.

Gene markers selectionGene markers selectionchallengeschallenges

Gene markers selectionGene markers selectionsignificance and multiple testingsignificance and multiple testing

Settings: • 20K+ genes,• only 10s/100s of samples

Problem:• the chances of finding genes

correlated with any randomphenotype labels are high.

• the smaller the no. of samples the higher the chances.

SamplesSamples

Gen

esG

enes

Head vs. Tail

9

Gene markers selectionGene markers selectionbetter than chance?better than chance?

As sample size increases, random phenotype less cleanHead/Tail example suggests a way of testing for the

significance of the results: is the observed difference in expression bigger than what we can observe with respect

to a random phenotype (head/tail)?

Head Tail Head Tail Head Tail Head Tail

6 samples 14 samples 30 samples 100 samples

Generated a [10,000x100] matrix from a Gaussian(μ=0, σ=0.5)

Picked n columns (n = 6,14,30,100)

Tossed n coins

Selectedtop 25 markers for headtop 25 markers for tail

Generated a [10,000x100] matrix from a Gaussian(μ=0, σ=0.5)

Picked n columns (n = 6,14,30,100)

Tossed n coins

Selectedtop 25 markers for headtop 25 markers for tail

Statistical SignificanceStatistical Significance

P-value: definition(s)

Multiple-Hypothesis Testing:P-value …… vs. False Discovery Rate (FDR) …… vs. Family-Wise Error Rate (FWER).

10

H0: nono difference in expression

H1: there ISIS difference in expression

Single test: HSingle test: H00 vs. Hvs. H11

)(and/or

1010

10

σσμμμμ

≠≠≠

mean in class 0

)(and/or

1010

10

σσμμμμ

===

Null hypothesisNull hypothesis

Alternative hypothesisAlternative hypothesis

Choose a test statistictest statistic (e.g. SNR, T-test, rank-sum)

Calculate observedobserved statistic: tobs

Calculate/estimate P-value:Probability of observing tobs (or larger) under the null distribution

p ≡ P(T≥ tobs | H0)

What distributional form does the t-statistic have?

Single test: HSingle test: H00 vs. Hvs. H11

11

PP--value calculationvalue calculationasymptotic theoryasymptotic theory

IFIF a gene is normally distributed, …THENTHEN t-score follows the t-distribution (if H0 holds)

Observed score of gene

Distribution of scores under null hypothesis Area Total

Area Blue=p

P(t-s

core

| H

0)

t-score

tobs|H0 ~ t(μ=0,df=n-1)

PP--value calculationvalue calculationasymptotic theoryasymptotic theory

WHAT IFWHAT IF genes are NOTNOT normally distributed?

Observed score of gene

Distribution of scores under null hypothesis Area Total

Area Blue=p

P(t-s

core

| H

0)

t-score

12

scores

Distribution of permuted scores for

given gene

PP--value calculationvalue calculationpermutation testpermutation test

Repeat many timesshuffle labels (class membership)compute score for each gene (t-score, SNR, .. )

⇒⇒ Empirical null distributionEmpirical null distribution of scores for each gene Compare observed score to empirical distribution.

Observed score of gene

}{#}{#

permuted

observedpermuted

sss

p≥

=

No distributional assumptions are madeNo distributional assumptions are made

SNR perm.1 perm.2 perm.3 perm.4 perm.5gene.1 -0.29 -0.17 -0.09 0.23 0.23 -0.25gene.2 -0.28 -0.08 0.29 0.16 0.61 0.41gene.3 -0.31 -0.33 0.03 -0.05 -0.10 -0.46gene.4 -0.17 0.65 -0.46 -0.13 -0.30 -0.75gene.5 -0.47 0.19 -0.70 0.76 0.02 -0.36gene.6 0.29 -0.09 -0.15 0.08 0.06 -0.44gene.7 0.28 0.05 0.13 0.03 0.47 0.26gene.8 0.31 -0.23 -0.12 -0.18 -0.20 -0.69gene.9 0.17 0.82 -0.54 -0.13 -0.44 -0.96gene.10 0.47 0.41 -0.94 0.48 -0.28 -0.61gene.11 -0.36 -0.60 0.10 0.81 -0.16 0.11gene.12 0.08 -0.69 0.42 -0.27 0.26 0.32gene.13 0.22 -0.18 -0.54 0.29 0.20 0.15gene.14 0.26 0.25 -0.13 0.07 -1.74 -0.32gene.15 -0.22 0.15 0.21 -0.06 -0.52 -0.14gene.16 0.22 0.03 0.18 -0.32 -0.36 -0.62gene.17 0.12 0.65 0.09 -0.10 -0.54 -0.41gene.18 0.42 0.14 -0.51 0.37 -0.10 0.09gene.19 0.04 0.34 -0.30 0.04 -0.18 -0.01gene.20 -0.82 0.37 0.02 -0.10 -0.40 -0.50

Permutation TestPermutation Testgenegene--specific pspecific p--valuevalue

Compute observedobserved score for each geneCompute permutedpermuted scores for each gene (5 times)

p=0/5p=3/5p=2/5…

p=0/5

Compute genegene--specificspecific p-values

0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 11 0 0 1 1 1 0 0 1 0 0 1 0 1 1 00 1 1 0 1 0 0 1 1 0 0 1 1 1 0 01 0 0 0 1 0 1 1 1 0 0 1 1 0 0 11 0 1 0 1 0 1 0 1 0 1 1 1 0 0 00 0 0 1 0 0 1 1 1 1 1 0 0 1 0 10 1 0 0 1 0 1 1 1 1 1 0 0 0 1 0

13

Choose a test statistic (e.g. SNR, T-test, rank-sum)Calculate observed statistic: tobs

Calculate/estimate P-value: p ≡ P(T≥ tobs | H0)Controlling the false positive rate (FPR) at level of α:

p ≤ α→ significant

Type I error: False Positive Rate (FPR) P(p≤α |H0) = P(“significant” | H0)

Type II error: False Negative Rate (FNR) P(p>α |H1) = P(¬significant|H1) = β(α)Power: P(p ≤ α |H1) = 1- β(α)

Single test: HSingle test: H00 vs. Hvs. H11

Choose a test statistic (e.g. SNR, T-test, rank-sum)Calculate observed statistic: tobs

Calculate/estimate P-value: p ≡ P(T≥ tobs | H0)Controlling the false positive rate (FPR) at level of α:

p ≤ α→ significant

Type I error: False Positive Rate (FPR) P(p≤α |H0) = P(“significant” | H0) = ?

Given HGiven H00: : CDF(p)=p ⇒⇒ PP--values of Hvalues of H00 are U[0,1] are U[0,1]

Single test: HSingle test: H00 vs. Hvs. H11

0 p 1

1

P(p)

α

14

Multiple HypothesesMultiple Hypotheseswhat to controlwhat to control

FWER (family-wise error rate): probability of calling one or more hypotheses significant given that they are all null

P(#significant>0|∩iH0i)Bonferroni or maxT

FDR (false discovery rate): probability that the null hypothesis is true given that the result is significant ≡

P(H0|significant)Benjiamini & Hochberg [JRSSB, 1995]Storey & Tibshirani [PNAS, 2004]…

No free lunch: MHT corrections control for false positives at the cost of increasing the false negatives.

Try to reduce the number of hypotheses tested in the first place.

Multiple HypothesesMultiple HypothesesMultiple hypotheses (H•i, i=1,2,…,n) → {pi}

{pi} are in general not independentthe marginal distribution of each one is uniform

Distribution of P-values is a mixture of 2 distributions:π0 fraction of genes NOT differentially expressed = P(H0)1-π0 fraction of genes differentially expressed = P(H1)=1-P(H0)

0 1

+(1−π0)

0 1

= π0π0

0 1

1

0 1

π0

FDR(α)=

α α α

15

FDR calculationFDR calculation

iNp

ppppFDR i

i

ii ×=

≤′=

}{prop)(

Rank of the p-value

00

}{prop)( ππ

××=≤′

=iNp

ppppFDR i

i

ii

Benjiamini & Hochberg [JRSSB, 1995] assume π0=1

Storey & Tibshirani [PNAS, 2004] estimate π0 from the data

}{prop}{prop

)(obs

prm

ssss

sFDR≥≥

=

Empirical FDR based on permutation:

0 1

π0

FDR(α)=

PP--value calculation value calculation by permutation testby permutation test

Gene-specific p-value

FWER (family-wise error rate)

FWER by maxT [Westfall & Young, 93]

16

Permutation TestPermutation TestFWERFWER

SNR perm.1 perm.2 perm.3 perm.4 perm.5gene.20 -0.82 -0.69 -0.94 -0.32 -1.74 -0.96gene.5 -0.47 -0.60 -0.70 -0.27 -0.54 -0.75gene.11 -0.36 -0.33 -0.54 -0.18 -0.52 -0.69gene.3 -0.31 -0.23 -0.54 -0.13 -0.44 -0.62gene.1 -0.29 -0.18 -0.51 -0.13 -0.40 -0.61gene.2 -0.28 -0.17 -0.46 -0.10 -0.36 -0.50gene.15 -0.22 -0.09 -0.30 -0.10 -0.30 -0.46gene.4 -0.17 -0.08 -0.15 -0.06 -0.28 -0.44gene.19 0.04 0.03 -0.13 -0.05 -0.20 -0.41gene.12 0.08 0.05 -0.12 0.03 -0.18 -0.36gene.17 0.12 0.14 -0.09 0.04 -0.16 -0.32gene.9 0.17 0.15 0.02 0.07 -0.10 -0.25gene.13 0.22 0.19 0.03 0.08 -0.10 -0.14gene.16 0.22 0.25 0.09 0.16 0.02 -0.01gene.14 0.26 0.34 0.10 0.23 0.06 0.09gene.7 0.28 0.37 0.13 0.29 0.20 0.11gene.6 0.29 0.41 0.18 0.37 0.23 0.15gene.8 0.31 0.65 0.21 0.48 0.26 0.26gene.18 0.42 0.65 0.29 0.76 0.47 0.32gene.10 0.47 0.82 0.42 0.81 0.61 0.41

FWER(s) = proportion of iterations having one or more scores ≥ s

Permutation TestPermutation TestFWERFWER

SNR perm.1 perm.2 perm.3 perm.4 perm.5gene.20 -0.82 -0.69 -0.94 -0.32 -1.74 -0.96gene.5 -0.47 -0.60 -0.70 -0.27 -0.54 -0.75gene.11 -0.36 -0.33 -0.54 -0.18 -0.52 -0.69gene.3 -0.31 -0.23 -0.54 -0.13 -0.44 -0.62gene.1 -0.29 -0.18 -0.51 -0.13 -0.40 -0.61gene.2 -0.28 -0.17 -0.46 -0.10 -0.36 -0.50gene.15 -0.22 -0.09 -0.30 -0.10 -0.30 -0.46gene.4 -0.17 -0.08 -0.15 -0.06 -0.28 -0.44gene.19 0.04 0.03 -0.13 -0.05 -0.20 -0.41gene.12 0.08 0.05 -0.12 0.03 -0.18 -0.36gene.17 0.12 0.14 -0.09 0.04 -0.16 -0.32gene.9 0.17 0.15 0.02 0.07 -0.10 -0.25gene.13 0.22 0.19 0.03 0.08 -0.10 -0.14gene.16 0.22 0.25 0.09 0.16 0.02 -0.01gene.14 0.26 0.34 0.10 0.23 0.06 0.09gene.7 0.28 0.37 0.13 0.29 0.20 0.11gene.6 0.29 0.41 0.18 0.37 0.23 0.15gene.8 0.31 0.65 0.21 0.48 0.26 0.26gene.18 0.42 0.65 0.29 0.76 0.47 0.32gene.10 0.47 0.82 0.42 0.81 0.61 0.41

FWER(s) = proportion of iterations having one or more scores ≥ s

FWER(gene20) =(1+1+0+1+1) ÷ 5 = 0.8

17

Permutation TestPermutation TestFWERFWER

SNR perm.1 perm.2 perm.3 perm.4 perm.5gene.20 -0.82 -0.69 -0.94 -0.32 -1.74 -0.96gene.5 -0.47 -0.60 -0.70 -0.27 -0.54 -0.75gene.11 -0.36 -0.33 -0.54 -0.18 -0.52 -0.69gene.3 -0.31 -0.23 -0.54 -0.13 -0.44 -0.62gene.1 -0.29 -0.18 -0.51 -0.13 -0.40 -0.61gene.2 -0.28 -0.17 -0.46 -0.10 -0.36 -0.50gene.15 -0.22 -0.09 -0.30 -0.10 -0.30 -0.46gene.4 -0.17 -0.08 -0.15 -0.06 -0.28 -0.44gene.19 0.04 0.03 -0.13 -0.05 -0.20 -0.41gene.12 0.08 0.05 -0.12 0.03 -0.18 -0.36gene.17 0.12 0.14 -0.09 0.04 -0.16 -0.32gene.9 0.17 0.15 0.02 0.07 -0.10 -0.25gene.13 0.22 0.19 0.03 0.08 -0.10 -0.14gene.16 0.22 0.25 0.09 0.16 0.02 -0.01gene.14 0.26 0.34 0.10 0.23 0.06 0.09gene.7 0.28 0.37 0.13 0.29 0.20 0.11gene.6 0.29 0.41 0.18 0.37 0.23 0.15gene.8 0.31 0.65 0.21 0.48 0.26 0.26gene.18 0.42 0.65 0.29 0.76 0.47 0.32gene.10 0.47 0.82 0.42 0.81 0.61 0.41

FWER(s) = proportion of iterations having one or more score ≥ s

FWER(gene5) =(1+1+1+1+1) ÷ 5 = 1

Permutation TestPermutation TestFWER by FWER by maxTmaxT procedureprocedure

Compute observedobserved score for each geneCompute permutedpermuted scores for each gene (5 times)

GeneID score prm.1 prm.2 prm.3 prm.4 prm.5gene.1 -0.47 0.09 0.18 -0.25 0.19 0.02gene.2 -0.42 0.25 -0.14 -0.29 0.18 -0.03gene.3 -0.28 -0.15 0.22 -0.10 -0.48 -0.19gene.4 -0.51 0.06 0.02 -0.50 0.01 -0.19gene.5 -0.26 0.23 -0.16 -0.18 0.17 -0.19gene.6 0.47 -0.06 0.20 0.17 0.15 0.04gene.7 0.42 -0.13 0.07 0.26 0.32 0.01gene.8 0.28 0.07 0.25 0.13 -0.32 -0.18gene.9 0.51 0.06 0.01 -0.10 0.03 -0.09gene.10 0.26 0.12 0.13 -0.01 0.33 0.02gene.11 0.13 0.01 0.03 -0.01 0.02 -0.11gene.12 0.19 0.22 -0.02 -0.18 0.17 -0.05gene.13 0.19 -0.05 0.02 -0.02 0.10 0.20gene.14 -0.17 -0.13 0.09 -0.12 -0.08 0.27gene.15 0.04 0.06 -0.11 0.15 -0.10 -0.34gene.16 0.03 -0.18 0.07 -0.42 -0.13 -0.07gene.17 -0.02 -0.03 0.12 0.10 -0.09 0.05gene.18 0.15 -0.26 -0.22 0.13 -0.16 0.05gene.19 0.23 0.02 0.36 -0.02 0.13 0.16gene.20 0.10 -0.03 0.13 0.00 -0.05 0.01

18

Permutation TestPermutation TestFWER by FWER by maxTmaxT procedureprocedure

Take the absolute valuesabsolute values of the scoresSort genes according to absolute value of observedobserved scoresMake the permuted scores monotonically nonnon--decreasingdecreasing

GeneID score prm.1 prm.2 prm.3 prm.4 prm.5gene.1 0.47 0.09 0.18 0.25 0.19 0.02gene.2 0.42 0.25 0.14 0.29 0.18 0.03gene.3 0.28 0.15 0.22 0.10 0.48 0.19gene.4 0.51 0.06 0.02 0.50 0.01 0.19gene.5 0.26 0.23 0.16 0.18 0.17 0.19gene.6 0.47 0.06 0.20 0.17 0.15 0.04gene.7 0.42 0.13 0.07 0.26 0.32 0.01gene.8 0.28 0.07 0.25 0.13 0.32 0.18gene.9 0.51 0.06 0.01 0.10 0.03 0.09gene.10 0.26 0.12 0.13 0.01 0.33 0.02gene.11 0.13 0.01 0.03 0.01 0.02 0.11gene.12 0.19 0.22 0.02 0.18 0.17 0.05gene.13 0.19 0.05 0.02 0.02 0.10 0.20gene.14 0.17 0.13 0.09 0.12 0.08 0.27gene.15 0.04 0.06 0.11 0.15 0.10 0.34gene.16 0.03 0.18 0.07 0.42 0.13 0.07gene.17 0.02 0.03 0.12 0.10 0.09 0.05gene.18 0.15 0.26 0.22 0.13 0.16 0.05gene.19 0.23 0.02 0.36 0.02 0.13 0.16gene.20 0.10 0.03 0.13 0.00 0.05 0.01

GeneID score prm.1 prm.2 prm.3 prm.4 prm.5gene.4 0.51 0.06 0.02 0.50 0.01 0.19gene.9 0.51 0.06 0.01 0.10 0.03 0.09gene.1 0.47 0.09 0.18 0.25 0.19 0.02gene.6 0.47 0.06 0.20 0.17 0.15 0.04gene.2 0.42 0.25 0.14 0.29 0.18 0.03gene.7 0.42 0.13 0.07 0.26 0.32 0.01gene.3 0.28 0.15 0.22 0.10 0.48 0.19gene.8 0.28 0.07 0.25 0.13 0.32 0.18gene.5 0.26 0.23 0.16 0.18 0.17 0.19gene.10 0.26 0.12 0.13 0.01 0.33 0.02gene.19 0.23 0.02 0.36 0.02 0.13 0.16gene.12 0.19 0.22 0.02 0.18 0.17 0.05gene.13 0.19 0.05 0.02 0.02 0.10 0.20gene.14 0.17 0.13 0.09 0.12 0.08 0.27gene.18 0.15 0.26 0.22 0.13 0.16 0.05gene.11 0.13 0.01 0.03 0.01 0.02 0.11gene.20 0.10 0.03 0.13 0.00 0.05 0.01gene.15 0.04 0.06 0.11 0.15 0.10 0.34gene.16 0.03 0.18 0.07 0.42 0.13 0.07gene.17 0.02 0.03 0.12 0.10 0.09 0.05

GeneID score prm.1 prm.2 prm.3 prm.4 prm.5gene.4 0.51 0.26 0.02 0.50 0.01 0.19gene.9 0.51 0.26 0.01 0.10 0.03 0.09gene.1 0.47 0.26 0.18 0.25 0.19 0.02gene.6 0.47 0.26 0.20 0.17 0.15 0.04gene.2 0.42 0.26 0.14 0.29 0.18 0.03gene.7 0.42 0.26 0.07 0.26 0.32 0.01gene.3 0.28 0.26 0.22 0.10 0.48 0.19gene.8 0.28 0.26 0.25 0.13 0.32 0.18gene.5 0.26 0.26 0.16 0.18 0.17 0.19gene.10 0.26 0.26 0.13 0.01 0.33 0.02gene.19 0.23 0.26 0.36 0.02 0.13 0.16gene.12 0.19 0.26 0.02 0.18 0.17 0.05gene.13 0.19 0.26 0.02 0.02 0.10 0.20gene.14 0.17 0.26 0.09 0.12 0.08 0.27gene.18 0.15 0.26 0.22 0.13 0.16 0.05gene.11 0.13 0.18 0.03 0.01 0.02 0.11gene.20 0.10 0.18 0.13 0.00 0.05 0.01gene.15 0.04 0.18 0.11 0.15 0.10 0.34gene.16 0.03 0.18 0.07 0.42 0.13 0.07gene.17 0.02 0.03 0.12 0.10 0.09 0.05

GeneID score prm.1 prm.2 prm.3 prm.4 prm.5gene.4 0.51 0.26 0.36 0.50 0.01 0.19gene.9 0.51 0.26 0.36 0.10 0.03 0.09gene.1 0.47 0.26 0.36 0.25 0.19 0.02gene.6 0.47 0.26 0.36 0.17 0.15 0.04gene.2 0.42 0.26 0.36 0.29 0.18 0.03gene.7 0.42 0.26 0.36 0.26 0.32 0.01gene.3 0.28 0.26 0.36 0.10 0.48 0.19gene.8 0.28 0.26 0.36 0.13 0.32 0.18gene.5 0.26 0.26 0.36 0.18 0.17 0.19gene.10 0.26 0.26 0.36 0.01 0.33 0.02gene.19 0.23 0.26 0.36 0.02 0.13 0.16gene.12 0.19 0.26 0.22 0.18 0.17 0.05gene.13 0.19 0.26 0.22 0.02 0.10 0.20gene.14 0.17 0.26 0.22 0.12 0.08 0.27gene.18 0.15 0.26 0.22 0.13 0.16 0.05gene.11 0.13 0.18 0.13 0.01 0.02 0.11gene.20 0.10 0.18 0.13 0.00 0.05 0.01gene.15 0.04 0.18 0.12 0.15 0.10 0.34gene.16 0.03 0.18 0.12 0.42 0.13 0.07gene.17 0.02 0.03 0.12 0.10 0.09 0.05

GeneID score prm.1 prm.2 prm.3 prm.4 prm.5gene.4 0.51 0.26 0.36 0.50 0.48 0.34gene.9 0.51 0.26 0.36 0.42 0.48 0.34gene.1 0.47 0.26 0.36 0.42 0.48 0.34gene.6 0.47 0.26 0.36 0.42 0.48 0.34gene.2 0.42 0.26 0.36 0.42 0.48 0.34gene.7 0.42 0.26 0.36 0.42 0.48 0.34gene.3 0.28 0.26 0.36 0.42 0.48 0.34gene.8 0.28 0.26 0.36 0.42 0.33 0.34gene.5 0.26 0.26 0.36 0.42 0.33 0.34gene.10 0.26 0.26 0.36 0.42 0.33 0.34gene.19 0.23 0.26 0.36 0.42 0.17 0.34gene.12 0.19 0.26 0.22 0.42 0.17 0.34gene.13 0.19 0.26 0.22 0.42 0.16 0.34gene.14 0.17 0.26 0.22 0.42 0.16 0.34gene.18 0.15 0.26 0.22 0.42 0.16 0.34gene.11 0.13 0.18 0.13 0.42 0.13 0.34gene.20 0.10 0.18 0.13 0.42 0.13 0.34gene.15 0.04 0.18 0.12 0.42 0.13 0.34gene.16 0.03 0.18 0.12 0.42 0.13 0.07gene.17 0.02 0.03 0.12 0.10 0.09 0.05

GeneID score prm.1 prm.2 prm.3 prm.4 prm.5gene.4 0.51 0.26 0.36 0.50 0.48 0.34 p = 0/5gene.9 0.51 0.26 0.36 0.42 0.48 0.34 p = 0/5gene.1 0.47 0.26 0.36 0.42 0.48 0.34 p = 1/5gene.6 0.47 0.26 0.36 0.42 0.48 0.34 p = 1/5gene.2 0.42 0.26 0.36 0.42 0.48 0.34 p = 2/5gene.7 0.42 0.26 0.36 0.42 0.48 0.34 p = 2/5gene.3 0.28 0.26 0.36 0.42 0.48 0.34 p = 4/5gene.8 0.28 0.26 0.36 0.42 0.33 0.34 p = 4/5gene.5 0.26 0.26 0.36 0.42 0.33 0.34 p = 5/5gene.10 0.26 0.26 0.36 0.42 0.33 0.34 …gene.19 0.23 0.26 0.36 0.42 0.17 0.34gene.12 0.19 0.26 0.22 0.42 0.17 0.34gene.13 0.19 0.26 0.22 0.42 0.16 0.34gene.14 0.17 0.26 0.22 0.42 0.16 0.34gene.18 0.15 0.26 0.22 0.42 0.16 0.34gene.11 0.13 0.18 0.13 0.42 0.13 0.34gene.20 0.10 0.18 0.13 0.42 0.13 0.34gene.15 0.04 0.18 0.12 0.42 0.13 0.34gene.16 0.03 0.18 0.12 0.42 0.13 0.07gene.17 0.02 0.03 0.12 0.10 0.09 0.05

Compute the p-values.

Permutation testPermutation testcommentscomments

Permuting the class labels vs. permuting the gene values:

Permuting the class labels preserves the correlationpreserves the correlation among genes.

Pooling permuted scores from different genes problematic (genes may have different distributions).

Empirical p-value does not always take into account the magnitude of score/difference.

19

ALL ALL vs.vs. MLL MLL vs.vs. AMLAML(top 50 markers/class)(top 50 markers/class)

high expressionlow expression

Nat

ure

Gen

etic

s30,

pp

41 -

47 (2

002)

ALLALL MLLMLL AMLAML

ALL

mar

kers

ALL

mar

kers

MLL

mar

kers

MLL

mar

kers

AM

L m

arke

rsA

ML

mar

kers

AnnotationAnnotationAnnotating the identified gene markers:

identify the gene products and their documented characteristics/function:

o GeneCruisero …

test for their enrichment with respect to meaningful biological categories.

o Gene Ontology (GO)o Gene Set Enrichment Analysis (GSEA)

20

Gene Set Enrichment AnalysisGene Set Enrichment Analysisbeyond gene markersbeyond gene markers

Using genesetsgenesets as phenotype markers.

Markers selectionMarkers selectionwhatwhat’’s neededs needed

ScoreScore

Measure of Measure of confidence/significanceconfidence/significance

Reject/Accept Reject/Accept criterioncriterion

?

YesYes

NoNo

Marker Marker candidatescandidates

DatasetDataset

PhenotypePhenotype

21

EnrichmentEnrichmentKSKS--scorescore

hit (member of G) miss (non-member of G)

Gene Set G

Enric

hmen

t Sco

re S

Gene List Order Index

Max. Enrichment Score ES

( )

( )( ),MN

j M1 (i)missP

,HN

j M (i) hitP

i

1 j

i

1 j

=

=

−=

=

( ) ( ) ( )( )∑=

⎥⎦

⎤⎢⎣

⎡ −−=

i

1j MH Nj M1

NjMi S

KS Enrichment Score (ES):

Running Score S:

Empirical Cumulative Distributions:

N1,..,iiSmax ES

== )(

Mootha et al., Nature Genetics 2004

Ordered Marker

List

Phenotype

• Rank genes according to their “correlation” with the class of interest.

• Test if a geneset (e.g., a GO category, a pathway, a different class signature), “enriches” any of the classes.

• Use Kolmogorov-Smirnoffscore to measure enrichment.

Subramanian et al., PNAS 2005

EnrichmentEnrichmentWeighted score.

Motivation:Standard KS score susceptible to calling significant a geneset when its members are close to each other even though far from both ends of the ranking.

22

Enriched Gene Set Un-enriched Gene Set

Enric

hmen

t Sco

re S

Max. Enrichment Score ES

Gene List Order Index

Enric

hmen

t Sco

re S

Max. Enrichment Score ES

Gene List Order Index

Every hit go up by 1/NH

Every miss go down by 1/NM

The maximum height provides the enrichment score

EnrichmentEnrichmentKSKS--scorescore

EnrichmentEnrichmentscoring multiple scoring multiple genesetsgenesets

Ordered Gene Marker List

Phenotype or Gene Template

Gene Set A

Gene Set B

Pathway A Pathway B

Potentially an entire database of Gene Sets

23

Test 1: Randomize Gene Set

Labels

Test 2: Randomize Phenotype

LabelsGene Set G

G

Asses Significance of Geneset G observed ES score using histogram

of permuted ES scores

Repeat N times

ES

EnrichmentEnrichmenttesting a testing a genesetgeneset for significance: permutation testfor significance: permutation test

Preserves correlation

between genes: more stringent

Gene Set 1 Permutation 1 Permutation 2 … Permutation Π

G2

Randomize Phenotype Labels

EnrichmentEnrichmenttesting multiple testing multiple genesetsgenesets for significancefor significance

Need for Multiple Hypothesis CorrectionSame approaches as in the gene markers seletion (FDR, maxT, etc.)

24

Constructing Constructing genesetsgenesets

GO categories.

“Pathways” from BioCarta, GenMapp, Kegg, etc.

“Target” gene lists.

Gene marker signatures.

Gene clusters.

GenesetGeneset EnrichmentEnrichmentan example: human diabetes an example: human diabetes

Normals Diabetics Skeletal muscle biopsies

GSEA was used to assess enrichment of 149 Gene Sets including 113 pathways from internal curation and GenMAPP, and 36 tightly co-expressed clusters from a compendium of mouse gene expression data.

These GSEA results appeared in Mootha et al. Nature Genetics 15 June 2003, vol. 34 no. 3 pp 267 – 273:

PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes

25

GenesetGeneset EnrichmentEnrichmentanother example: another example: mediastinalmediastinal lymphomalymphoma

KS-based enrichment analysis was used to test the similarity between Mediastinal Lymphoma (a type of DLBCL), and Hodgkin Lymphoma [Savage et al., Blood 2003].

MLBCL DLBCL

Hits represent genes belonging to a “Hodgkin geneset”identified by differential analysis of “normal B-cells vs.

Hodgkin cell lines” in an independent study.Hodgkin Geneset

Differential analysisDifferential analysisTakeTake--home messageshome messages

How to look for genes differentially expressed

How to visualize results

How to test for significance & control for multiple testing

How to evaluate significance by permutation testing

Going beyond gene markers (genesets as markers)

26

CookbookCookbookReduce number of hypotheses/genes by variation filtering (attempt at reducing FNs)

Choose test statistic (say, t-score)

If enough samples, compute p-values by permutation test (otherwise, use asymptotic test).

Control for MHT by using the FDR correctionRemember: if you choose FDR≤α (e.g., 0.05), you’re willing to accept α×100% (e.g., 5%) of false positives.

If number of significant hypotheses/genes “too large” even for very small α’s, either:

o use the maxT correction (possible w/ empirical p-values only).

o use additional criteria (e.g., min fold-change, min expression value, etc.)

WARNINGWARNING: should not be followed blindly …

Differential AnalysisDifferential Analysisnot discussednot discussed

Two-sided vs. one-sided test

Variance thresholded t-test

Confidence Interval on P-values after n permutations.

“Smoothing” of p-values

Speeding up permutation tests

Bayesian Methods (computing the posterior)

Controlling for “confounders”