the problem of detecting differentially expressed genes

101
The Problem of Detecting Differentially Expressed Genes

Upload: valentine-richard

Post on 31-Dec-2015

233 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: The Problem of Detecting Differentially Expressed Genes

The Problem of Detecting Differentially Expressed Genes

Page 2: The Problem of Detecting Differentially Expressed Genes

Gene 1

Gene 2

Gene N

.

.

.

.. . . .Sample 1 Sample 2 Sample M

Page 3: The Problem of Detecting Differentially Expressed Genes

Gene 1

Gene 2

Gene N

.

.

.

.. . . .Sample 1 Sample 2 Sample M

Class 1 Class 2

Page 4: The Problem of Detecting Differentially Expressed Genes

Fold Change is the Simplest Method

Calculate the log ratio between the two classes and consider all genes that differ by more than an arbitrary cutoff value to be differentially expressed. A two-fold difference is often chosen.

Fold change is not a statistical test.

Page 5: The Problem of Detecting Differentially Expressed Genes

(1) For gene consider the null hypothesis of no association between its expression level and its class membership.

Test of a Single Hypothesis

(3) Perform a test (e.g Student’s t-test) for each gene.

(2) Decide on level of significance (commonly 5%).

(4) Obtain P-value corresponding to that test statistic.

(5) Compare P-value with the significance level. Theneither reject or retain the null hypothesis.

Page 6: The Problem of Detecting Differentially Expressed Genes

Two-Sample t-Statistic

2221

21

21

nsnsW

jj

jjj

yy

Student’s t-statistic:

Page 7: The Problem of Detecting Differentially Expressed Genes

Two-Sample t-Statistic

21

21

11 nnsW

j

jjj

yy

Pooled form of the Student’s t-statistic, assumed common variance in the two classes:

Page 8: The Problem of Detecting Differentially Expressed Genes

021

21

11 annsW

j

jjj

yy

Two-Sample t-Statistic

Modified t-statistic of Tusher et al. (2001):

Page 9: The Problem of Detecting Differentially Expressed Genes

Retain Null Reject Null

Null False positive

Non-null False negative TRUE

PREDICTED

Types of Errors in Hypothesis Testing

Page 10: The Problem of Detecting Differentially Expressed Genes

Multiplicity Problem

Further: Genes are co-regulated, subsequently there is correlation between the test statistics.

When many hypotheses are tested, the probability of a false positive increases sharply with the number of hypotheses.

Page 11: The Problem of Detecting Differentially Expressed Genes

Suppose we measure the expression of 10,000 genes in a microarray experiment.

If all 10,000 genes were not differentially expressed, then we would expect for:

P= 0.05 for each test, 500 false positives.

P= 0.05/10,000 for each test, .05 false positives.

Example

Page 12: The Problem of Detecting Differentially Expressed Genes

Controlling the Error Rate

• Methods for controlling false positives e.g. Bonferroni are too strict for microarray analyses

• Use the False Discovery Rate instead (FDR)(Benjamini and Hochberg 1995)

Page 13: The Problem of Detecting Differentially Expressed Genes

Methods for dealing with the Multiplicity Problem

• The Bonferroni Method controls the family wise error rate (FWER) i.e.the probability that at least one false positive error will be made

• The False Discovery Rate (FDR)emphasizes the proportion of false positives among the identified differentially expressed genes.

Too strict for gene expression data, tries to make it unlikely that even one false rejection of the null is made, may lead to missed findings

Good for gene expression data – says something about the chosen genes

Page 14: The Problem of Detecting Differentially Expressed Genes

)hypotheses (rejected#

positives) (false#FDR

The FDR is essentially the expectation of the proportion of false positives among the identified differentially expressed genes.

False Discovery Rate Benjamini and Hochberg (1995)

Page 15: The Problem of Detecting Differentially Expressed Genes

  Accept Null Reject Null Total

Null True N00 N01 N0

Non-True N10 N11 N1

Total N - Nr Nr N

Possible Outcomes for N Hypothesis Tests

Page 16: The Problem of Detecting Differentially Expressed Genes

}1

{FDR01

rNN

E

)1,max(1 rr NN where

}1)(

{FNDR10

rNN

NE

Page 17: The Problem of Detecting Differentially Expressed Genes

}0|/E{pFDR 01 rr NNN

}0FDR/pr{ rN

Positive FDR

Page 18: The Problem of Detecting Differentially Expressed Genes

Lindsay, Kettenring, and Siegmund (2004).

A Report on the Future of Statistics.

Statist. Sci. 19.

Page 19: The Problem of Detecting Differentially Expressed Genes
Page 20: The Problem of Detecting Differentially Expressed Genes

Key papers on controlling the FDR

• Genovese and Wasserman (2002)

• Storey (2002, 2003)

• Storey and Tibshirani (2003a, 2003b)

• Storey, Taylor and Siegmund (2004)

• Black (2004)

• Cox and Wong (2004)

Controlling FDR

Benjamini and Hochberg (1995)

Page 21: The Problem of Detecting Differentially Expressed Genes

Benjamini-Hochberg (BH) Procedure

Controls the FDR at level when the P-values following the null distribution are independent and uniformly distributed.

(1) Let be the observed P-values.

(2) Calculate .

(3) If exists then reject null hypotheses corresponding to

. Otherwise, reject nothing.

)()1( Npp

)()1( kpp

Nkpkk k /:maxargˆ)(

Nk 1

k

Page 22: The Problem of Detecting Differentially Expressed Genes

Example: Bonferroni and BH Tests

Suppose that 10 independent hypothesis tests are carried outleading to the following ordered P-values:

0.00017 0.00448 0.00671 0.00907 0.012200.33626 0.39341 0.53882 0.58125 0.98617

(a) With = 0.05, the Bonferroni test rejects any hypothesiswhose P-value is less than / 10 = 0.005.

Thus only the first two hypotheses are rejected.

(b) For the BH test, we find the largest k such that P(k) < k / N. Here k = 5, thus we reject the first five hypotheses.

Page 23: The Problem of Detecting Differentially Expressed Genes

q-VALUE

q-value of a gene j is expected proportion of false positives when calling that gene significant.

P-value is the probability under the null hypothesis of obtaining a value of the test statistic as or more extreme than its observed value.

The q-value for an observed test statistic can be viewed as the expected proportion of false positives among all genes with their test statistics as or more extreme than the observed value.

Page 24: The Problem of Detecting Differentially Expressed Genes

LIST OF SIGNIFICANT GENES

Call all genes significant if pj < 0.05

or

Call all genes significant if qj < 0.05

to produce a set of significant genes so that a proportion of them (<0.05) is expected to be false (at least for a large no. of genes not necessarily independent)

Page 25: The Problem of Detecting Differentially Expressed Genes

BRCA1 versus BRCA2-mutation positive tumours (Hedenfalk et al., 2001)

BRCA1 (7) versus BRCA2-mutation (8) positive tumours, p=3226 genes

P=.001 gave 51 genes differentially expressed

P=0.0001 gave 9-11 genes

Page 26: The Problem of Detecting Differentially Expressed Genes

Using q<0.05, gives 160 genes are taken to be significant.

It means that approx. 8 of these 160 genes are expected to be false positives.

Also, it is estimated that 33% of the genes are differentially expressed.

Page 27: The Problem of Detecting Differentially Expressed Genes
Page 28: The Problem of Detecting Differentially Expressed Genes
Page 29: The Problem of Detecting Differentially Expressed Genes

Permutation Method

The null distribution has a resolution on the order of the number of permutations.

If we perform B permutations, then the P-value will be estimatedwith a resolution of 1/B.

If we assume that each gene has the same null distribution and combine the permutations, then the resolution will be 1/(NB) for the pooled null distribution.

Null Distribution of the Test Statistic

Page 30: The Problem of Detecting Differentially Expressed Genes

Using just the B permutations of the class labels for the gene-specific statistic Wj , the P-value for Wj = wj is assessed as:

where w(b)0j is the null version of wj after the bth permutation

of the class labels.

B

wwbp j

bj

j

|}||:|{# )(0

Page 31: The Problem of Detecting Differentially Expressed Genes

If we pool over all N genes, then:

B

b

jbi

j NB

Niwwip

1

)(0 },...,1|,||:|{#

Page 32: The Problem of Detecting Differentially Expressed Genes

Class 1 Class 2

Gene 1 A1(1) A2

(1) A3(1) B4

(1) B5(1) B6

(1)

Gene 2 A1(2) A2

(2) A3(2) B4

(2) B5(2) B6

(2)

Suppose we have two classes of tissue samples, with three samples from each class. Consider the expressions of two genes, Gene 1 and Gene 2.

Null Distribution of the Test Statistic: Example

Page 33: The Problem of Detecting Differentially Expressed Genes

Class 1 Class 2

Gene 1 A1(1) A2

(1) A3(1) B4

(1) B5(1) B6

(1)

Gene 2 A1(2) A2

(2) A3(2) B4

(2) B5(2) B6

(2)

Gene 1 A1(1) A2

(1) A3(1) A4

(1) A5(1) A6

(1)

To find the null distribution of the test statistic for Gene 1, we proceed under the assumption that there is no difference between the classes (for Gene 1) so that:

Perm. 1 A1(1) A2

(1) A4(1) A3

(1) A5(1) A6

(1) ...There are 10 distinct permutations.

And permute the class labels:

Page 34: The Problem of Detecting Differentially Expressed Genes

Ten Permutations of Gene 1

A1(1) A2

(1) A3(1) A4

(1) A5(1) A6

(1)

A1(1) A2

(1) A4(1) A3

(1) A5(1) A6

(1)

A1(1) A2

(1) A5(1) A3

(1) A4(1) A6

(1)

A1(1) A2

(1) A6(1) A3

(1) A4(1) A5

(1)

A1(1) A3

(1) A4(1) A2

(1) A5(1) A6

(1)

A1(1) A3

(1) A5(1) A2

(1) A4(1) A6

(1)

A1(1) A3

(1) A6(1) A2

(1) A4(1) A5

(1)

A1(1) A4

(1) A5(1) A2

(1) A3(1) A6

(1)

A1(1) A4

(1) A6(1) A2

(1) A3(1) A5

(1)

A1(1) A5

(1) A6(1) A2

(1) A3(1) A4

(1)

Page 35: The Problem of Detecting Differentially Expressed Genes

As there are only 10 distinct permutations here, thenull distribution based on these permutations is toogranular.

Hence consideration is given to permuting the labelsof each of the other genes and estimating the null distribution of a gene based on the pooled permutationsso obtained.

But there is a problem with this method in that the null values of the test statistic for each gene does not necessarily have the theoretical null distribution that we are trying to estimate.

Page 36: The Problem of Detecting Differentially Expressed Genes

Suppose we were to use Gene 2 also to estimate the null distribution of Gene 1.

Suppose that Gene 2 is differentially expressed, then the null values of the test statistic for Gene 2will have a mixed distribution.

Page 37: The Problem of Detecting Differentially Expressed Genes

Class 1 Class 2

Gene 1 A1(1) A2

(1) A3(1) B4

(1) B5(1) B6

(1)

Gene 2 A1(2) A2

(2) A3(2) B4

(2) B5(2) B6

(2)

Gene 2 A1(2) A2

(2) A3(2) B4

(2) B5(2) B6

(2)

Perm. 1 A1(2) A2

(2) B4(2) A3

(2) B5(2) B6

(2)

...

Permute the class labels:

Page 38: The Problem of Detecting Differentially Expressed Genes

Example of a null case: with 7 N(0,1) points and8 N(0,1) points; histogram of the pooled two-sample t-statistic under 1000 permutations of the class labels with t13 density superimposed.

ty

Page 39: The Problem of Detecting Differentially Expressed Genes

Example of a null case: with 7 N(0,1) points and

8 N(10,9) points; histogram of the pooled two-sample

t-statistic under 1000 permutations of the class

labels with t13 density superimposed.

ty

Page 40: The Problem of Detecting Differentially Expressed Genes

The SAM Method

Use the permutation method to calculate the null distribution of the modified t-statistic (Tusher et al., 2001).

The order statistics t(1), ... , t(N) are plotted against their null expectations above.

A good test in situations where there are more genes being over-expressed than under-expressed, or vice-versa.

B

b

bjj NjwBw

1

)()(0)(0 ).,...,1()/1(

Page 41: The Problem of Detecting Differentially Expressed Genes

TRUE

PREDICTED

Retain Null Reject Null

Null a b (false positive)

Non-null c (false negative) d

FDR ~ R

b

R

FNDR ~ ca

c

The FDR and other error rates

Page 42: The Problem of Detecting Differentially Expressed Genes

TRUE

PREDICTED

Retain Null Reject Null

Null a b (false positive)

Non-null c (false negative) d

FDR ~ R

b

R

FNR = dc

c

The FDR and other error rates

Page 43: The Problem of Detecting Differentially Expressed Genes

Test Result

non-DE DE Total

True

non-DE

DE

Total

A = 9025 B = 475

C = 100 D = 400

9125 875

9500

500

10000

FDR = B / (B + D) = 475 / 875 = 54%

FNDR = C / (A + C) = 100 / 9125 = 1%

Toy Example with 10,000 Genes

Page 44: The Problem of Detecting Differentially Expressed Genes

Two-component mixture model

)()()( 1100 jjj wfwfwf 0 is the proportion of genes that are not

differentially expressed, and 01 1

Page 45: The Problem of Detecting Differentially Expressed Genes

Use of the P-Value as a Summary Statistic(Allison et al., 2002)

Instead of using the pooled form of the t-statistic, we can work with the value pj, which is the P-value associated with tj

in the test of the null hypothesis of no difference in expression between the two classes.

The distribution of the P-value is modelled by the h-component mixture model

h

i

iijij pfpf1

21 ),;()(

where 11 = 12 = 1.

,

Page 46: The Problem of Detecting Differentially Expressed Genes

Use of the P-Value as a Summary Statistic

Under the null hypothesis of no difference in expression for thejth gene, pj will have a uniform distribution on the unit interval; ie the 1,1 distribution.

The 1,2 density is given by

where

),()},(/)1({),;( )1,0(2111

2121 uIBuuuf

).(/)()(),( 212121 B

Page 47: The Problem of Detecting Differentially Expressed Genes
Page 48: The Problem of Detecting Differentially Expressed Genes

• Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. JASA 96,1151-1160.

• Efron B (2004) Large-scale simultaenous hypothesis testing: the choice of a null hypothesis. JASA 99, 96-104.

• Efron B (2004) Selection and Estimation for Large-Scale Simultaneous Inference.

• Efron B (2005) Local False Discovery Rates.

• Efron B (2006) Correlation and Large-Scale Simultaneous Significance Testing.

Page 49: The Problem of Detecting Differentially Expressed Genes

• McLachlan GJ, Bean RW, Ben-Tovim Jones L, Zhu JX. Using mixture models to detect differentially expressed genes. Australian Journal of Experimental Agriculture 45, 859-866.

• McLachlan GJ, Bean RW, Ben-Tovim Jones L. A simple implentation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 26. To appear.

Page 50: The Problem of Detecting Differentially Expressed Genes

Two component mixture model

)()()( 1100 jjj wfwfwf

)(

)()( 00

0j

jj wf

wfw

Using Bayes’ Theorem, we calculate the posterior probability that gene j is not differentially expressed:

π0 is the proportion of genes that are not differentiallyexpressed. The two-component mixture model is:

Page 51: The Problem of Detecting Differentially Expressed Genes

0(wj) c0,

then this decision minimizes the (estimated) Bayes risk

If we conclude that gene j is differentially expressed if:

Page 52: The Problem of Detecting Differentially Expressed Genes

where

101010)1(Risk ecec where e01 is the probability of a false positive and e10 is the probability of a false negative.

Page 53: The Problem of Detecting Differentially Expressed Genes

Estimated FDR

where

Page 54: The Problem of Detecting Differentially Expressed Genes

N

jrjcj NNwIwRDFN

10),(1 )/())(ˆ()(ˆˆ

0

N

jj

N

jjcj wwIwRPF

10

10],0[0 )(ˆ/))(ˆ()(ˆˆ

0

N

jj

N

jjcj wwIwRNF

11

10),(1 )(ˆ/))(ˆ()(ˆˆ

0

Similarly, the false positive rate is given by

and the false non-discovery rate and false negativerate by:

Page 55: The Problem of Detecting Differentially Expressed Genes

F0: N(0,1), π0=0.9F1: N(1,1), π1=0.1

Reject H0 if z≥2

τ0(2) = 0.99972but FDR=0.17

F0: N(0,1), π0=0.6F1: N(1,1), π1=0.4

Reject H0 if z≥2

τ0(2) = 0.251but FDR=0.177

Glonek and Solomon (2003)

Page 56: The Problem of Detecting Differentially Expressed Genes

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

Page 57: The Problem of Detecting Differentially Expressed Genes

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

Page 58: The Problem of Detecting Differentially Expressed Genes

Gene Statistics: Two-Sample t-Statistic

2221

21

21

nsnsw

jj

jjj

yyStudent’s t-statistic

Pooled form of the Student’s t-statistic, assumed common variance in the two classes 21

21

11 nnsw

j

jjj

yy

Modified t-statistic of Tusher et al. (2001)

021

21

11 annsw

j

jjj

yy

Page 59: The Problem of Detecting Differentially Expressed Genes

1. Obtain the z-score for each of the genes

The Procedure

2. Rank the genes on the basis of the z-scores, starting with the largest ones (the same ordering as with the P-values, pj).

3. The posterior probability of non-differential expression of gene j, is given by 0(zj).

4. Conclude gene j to be differentially expressed if

0(zj) < c0

1jz jp1

Page 60: The Problem of Detecting Differentially Expressed Genes

τ0(zj) ≤ c

czfzf

zf

jj

j )()(

)(

1100

00

1

0

0

1 1

)(

)(

c

c

zf

zf

j

j

c

c

zf

zf

j

j

19

)(

)(

0

1

If π0/π1 ≤ 9, then

e.g. c = 0.2, 36)(

)(

0

1 j

j

zf

zf

Page 61: The Problem of Detecting Differentially Expressed Genes

Suppose π0/ π1 ≥ 0.9

then

τ0(zj) ≤ c

τ0(zj) ≤ 0.2 if

36)(

)(

0

1 j

j

zf

zf

Much stronger level of evidence against the nullthan in standard one-at-a-time testing.

Page 62: The Problem of Detecting Differentially Expressed Genes

e.g.

Zj ~ N(μ,1)

H0 : μ = 0 vs H1 : μ = 2.8

Rejecting H0 if |Zj| ≥ 1.96 yields a two-sidedtest of size 0.05 and power 0.80.

f1(1.96)/f0(1.96) = 4.8.

Page 63: The Problem of Detecting Differentially Expressed Genes

Z-scores, null case Z-scores, +1

Z-scores, +2Z-scores, +3

r.bean
Page 64: The Problem of Detecting Differentially Expressed Genes

zj j

Use of Wilson-Hilferty Transformation asin Broet et al. (2004)

Page 65: The Problem of Detecting Differentially Expressed Genes

Plot of approximate z-scores obtained byWilson-Hilferty transformation of simulated values of F-statisticwith 1 and 8 degrees of freedom.

Transformed t-statistics

Fre

qu

en

cy

-1 0 1 2 3 4

05

10

15

20

25

Page 66: The Problem of Detecting Differentially Expressed Genes

EMMIX-FDR

A program has been written in R which interfaces with EMMIX to implement the algorithm described in McLachlan et al (2006).

We fit a mixture of two normal components

to test statistics calculated from the gene expression data.

Page 67: The Problem of Detecting Differentially Expressed Genes

1100 ˆˆˆˆ z

21010

211

200

2 )ˆˆ(ˆˆˆˆˆˆ zs

When we equate the sample mean and varianceof the mixture to their population counterparts, we obtain:

Page 68: The Problem of Detecting Differentially Expressed Genes

)ˆ1/(ˆ 01 z

)ˆ1/(}ˆ)ˆ1(ˆˆ{ˆ 021000

221 zs

When we are working with the theoretical null,we can easily estimate the mean and varianceof the non-null component with the following formulae.

Page 69: The Problem of Detecting Differentially Expressed Genes

)}(/{}:{#)()0(0 Nzz jj

Following the approach of Storey and Tibshirani (2003)we can obtain an initial estimate for π0 as follows:

This estimate of π0 is used as input to EMMIX aspart of the initial parameter estimates. Thus no randomor k-means starts are required, as is usually the case.

There are two different versions of EMMIX used, thestandard version for the empirical null and a modifiedversion for the theoretical null which fixes the mean andvariance of the null component to be 0 and 1 respectively.

Page 70: The Problem of Detecting Differentially Expressed Genes

Theoretical and empirical nulls

Efron (2004) suggested the use of two kinds of null component: the theoretical and the empirical null. In the theoretical case the null component has mean 0 and variance 1 and the empirical null has unrestricted mean and variance.

Page 71: The Problem of Detecting Differentially Expressed Genes

Examples

We examined the performance of EMMIX-FDR on three well-known data sets

from the literature.

• Alon colon cancer data (1999)

• Hedenfalk breast cancer data (2001)

• van’t Wout HIV data (2003)

Page 72: The Problem of Detecting Differentially Expressed Genes

Hedenfalk Breast Cancer Data

Hedenfalk et al. (2001) used cDNA arrays to obtain gene expression profiles of tumours from carriers of either the BRCA1 or BRCA2 mutation (hereditary breast cancers), as well as sporadic breast cancer.

We consider their data set of M = 15 patients, comprising two patient groups: BRCA1 (7) versus BRCA2 - mutation positive (8), with N = 3,226 genes.

The problem is to find genes which are differentially expressed between the BRCA1 and BRCA2 patients.

Hedenfalk et al. (2001) NEJM, 344, 539-547

Page 73: The Problem of Detecting Differentially Expressed Genes

Fit

to the N values of wj (based on pooled two-sample t-statistic)

),()1,0( 21110 NN

j th gene is taken to be differentially expressed if:

00 )(ˆ cw j

Two component model for the Breast Cancer Data

Page 74: The Problem of Detecting Differentially Expressed Genes

Histogram of z-scores for 3226 Hedenfalk genes

z

Fre

qu

en

cy

-4 -2 0 2 4

02

04

06

08

01

00

Fitting two component mixture model to Hedenfalk data

Null

Non–Null(DE genes)

Page 75: The Problem of Detecting Differentially Expressed Genes

Estimates of π0 for Hedenfalk data

• 0.52 (Broet, 2004)

• 0.64 (Gottardo, 2006)

• 0.61 (Ploner et al, 2006)

• 0.47 (Storey, 2002)

Using a theoretical null, we estimated π0 to be 0.65.

Page 76: The Problem of Detecting Differentially Expressed Genes

We used

pj = 2 F13(-|wj|)

Similar results where pj obtained by permutation methods.

Page 77: The Problem of Detecting Differentially Expressed Genes

Gene j Pj zj 0 ( zj )

Gene 1

.

.

.

Gene R

.

.

.

.

Gene R+R1

.

.

.

Gene N

0.002

.

.

.

0.1

0.12

0.18

.

.

0.20

c0 = 0.1

Ranking and Selecting the Genes

FDR = Sum/R = 0.06

Proportion ofFalse Negatives= 1 – Sum1/ R1

Local FDR

Page 78: The Problem of Detecting Differentially Expressed Genes

c0 R

0.5 906 0.26

0.4 694 0.21

0.3 518 0.16

0.2 326 0.11

0.1 143 0.06

RDF ˆ

Estimated FDR for various levels of c0

Page 79: The Problem of Detecting Differentially Expressed Genes

c0 Nr

0.1 143 0.06 0.32 0.88 0.004

0.2 338 0.11 0.28 0.73 0.02

0.3 539 0.16 0.25 0.60 0.04

0.4 742 0.21 0.22 0.48 0.08

0.5 971 0.27 0.18 0.37 0.12

RDF ˆ DRNF ˆ RNF ˆ RPF ˆ

Estimated FDR and other error rates for various levelsof threshold c0 applied to the posterior probability of nondifferentialexpression for the breast cancer data (Nr=number of selected genes)

Page 80: The Problem of Detecting Differentially Expressed Genes

Storey and Tibshirani (2003) PNAS, 100, 9440-9445

Comparison of identified DE genes

Our method (143) Hedenfalk (175)

Storey and Tibshirani (160)

101

6

12

8

39

2924

Page 81: The Problem of Detecting Differentially Expressed Genes

Uniquely Identified Genes: Differentially Expressed between BRCA1 and BRCA2

Gene GO TermUBE2B, DDB2

(UBE2V1)

RAB9, RHOC

ITGB5, ITGA3

PRKCBP1

HDAC3, MIF

KIF5B, spindle body pole protein

CTCL1

TNAFIP1

HARS, HSD17B7

DNA repair

(cell cycle)

small GTPase signal transduction

integrin mediated signalling pathway

regulation of transcription

negative regulation of apoptosis

cytoskeleton organisation

vesicle mediated transport

cation transport

metabolism

Page 82: The Problem of Detecting Differentially Expressed Genes

1. Obtain the z-score for each of the genes

The Procedure

2. Rank the genes on the basis of the z-scores, starting with the largest ones (the same ordering as with the P-values, pj).

3. The posterior probability of non-differential expression of gene j, is given by 0(zj).

4. Conclude gene j to be differentially expressed if

0(zj) < c0

1jz jp1

Page 83: The Problem of Detecting Differentially Expressed Genes

Breast cancer data: plot of fitted two-componentnormal mixture model with theoretical N(0,1) null and non-nullcomponents (weighted respectively by the estimated proportion of null andnon-null genes) imposed on histogram of z-scores.

z-scores

Fre

qu

en

cy

-4 -2 0 2 4

02

04

06

08

01

00

Page 84: The Problem of Detecting Differentially Expressed Genes

Hedenfalk breast cancer data:plot of fitted two-component normal mixture model with empirical nulland non-null components (weighted respectively by the estimated proportionof null and non-null genes) imposed on histogram of z-scores.

z-scores

Fre

qu

en

cy

-4 -2 0 2 4

02

04

06

08

01

00

Page 85: The Problem of Detecting Differentially Expressed Genes

Histogram of N=3,226 z-Values from the Breast Cancer Study.The theoretical N(0,1) null is much narrower than the central peak,which has (δ0,σ0)=(-.02,1.58). In this case the central peak seems toinclude the entire histogram.

Page 86: The Problem of Detecting Differentially Expressed Genes

Alon Colon Cancer Data

• Affymetrix arrays, samples from colon tissues from 62 patients

• Dataset of M = 62 patients: 40 tumor vs 22 normal, with N = 2,000 genes.

Alon et al. (1999) PNAS, 96, 6745-6750

Page 87: The Problem of Detecting Differentially Expressed Genes

Alon data theoretical null

We estimated π0 to be 0.39 using the theoretical null. With the empirical null, it is 0.53.

Six smooth muscle genes are included in the 2,000 genes and each has an estimated posterior probability of non-DE less than 0.015.

Page 88: The Problem of Detecting Differentially Expressed Genes

Colon cancer data: plot of fitted two-componentnormal mixture model with theoretical N(0,1) null and non-nullcomponents (weighted respectively by the estimated proportion of null andnon-null genes) imposed on histogram of z-scores.

z-scores

Fre

qu

en

cy

-6 -4 -2 0 2 4 6

01

02

03

04

05

06

0

Page 89: The Problem of Detecting Differentially Expressed Genes

Colon cancer data: plot of fitted two-componentnormal mixture model with empirical null and non-null components (weighted respectively by the estimated proportion of null and non-null genes) imposed on histogram of z-scores.

Page 90: The Problem of Detecting Differentially Expressed Genes

van’t Wout HIV Data

• Four experiments using the same RNA preparation on 4 different slides

• Expression levels of transcripts assessed in CD4-T-cell lines 24 hours after infection with HIV type 1

• Two samples – one corresponds to HIV infected cells and one to non-infected cells

• Dataset of M = 8 samples: 4 HIV-infected vs 4 non-infected, with N = 7,680 genes. Analysed by Gottardo et al (2006)

van’t Wout et al (2003), J Virology 77, 1392-1402Gottardo et al (2006), Biometrics 62, 10-18

Page 91: The Problem of Detecting Differentially Expressed Genes

HIV data: plot of fitted two-componentnormal mixture model with empirical null and non-nullcomponents (weighted respectively by the estimated proportion of null andnon-null genes) imposed on histogram of z-scores.

z-scores

Fre

qu

en

cy

-6 -4 -2 0 2 4 6

05

01

00

15

02

00

25

03

00

35

0

Page 92: The Problem of Detecting Differentially Expressed Genes

• Mixture-model based approach to finding DE genes can yield new information

• Gives a measure of the posterior probability that a gene is not DE (i.e. a local FDR rather than global)

• Provides estimates of global rates – the FDR and the false negative rate FNR (FNR=1-sensitivity)

Conclusions

Page 93: The Problem of Detecting Differentially Expressed Genes

Mixture Model-Based Approach: Advantages

• Provides a local FDR and a measure of the FNR, as well as the global FDR for the selected genes.

• We show that it can yield new information

Page 94: The Problem of Detecting Differentially Expressed Genes

Allison generated data for 10 mice over 3000 genes. The data are generated in six groups of 500 with a value ρ of 0, 0.4, or 0.8 in the off-diagonal elements of the 500 x 500 covariance matrix used to generate each group.

For a random 20% of the genes, a value d of 0, 4, or 8 is added to the gene expression levels of the last five mice.

Allison Mice Simulation

Page 95: The Problem of Detecting Differentially Expressed Genes

Theoretical null, ρ=0.8, d=4

Page 96: The Problem of Detecting Differentially Expressed Genes

Empirical null, ρ=0.8, d=4

Page 97: The Problem of Detecting Differentially Expressed Genes

Theoretical null, ρ=0.8, d=8

Page 98: The Problem of Detecting Differentially Expressed Genes

Empirical null, ρ=0.8, d=8

Page 99: The Problem of Detecting Differentially Expressed Genes

Suppose 0(w) is monotonic decreasing in w. Then

0)(0ˆ cwj for .0wwj

)(ˆ1

)(1ˆˆ

0

000

wF

wFRDF

00 |)( wwwE

Page 100: The Problem of Detecting Differentially Expressed Genes

Suppose 0(w) is monotonic decreasing in w. Then

0)(0ˆ cwj for .0wwj

)(ˆ1

)(1ˆˆ

0

000

wF

wFRDF

where )()( 000 wwF

1

10

1000ˆ

ˆˆ)(ˆ)(ˆ

w

wwF

Page 101: The Problem of Detecting Differentially Expressed Genes

For a desired control level , say = 0.05, define

)(ˆminarg0 wRDFww

(1)

If)(ˆ1

)(1ˆ

00

wF

wF

is monotonic in w, then using (1)

to control the FDR [with and taken to be the empirical distribution function] is equivalent to using the Benjamini-Hochberg procedure based on the P-values corresponding to the statistic wj.

1ˆ 0 )(ˆ 0wF