the problem of detecting differentially expressed genes

The Problem of Detecting Differentially Expressed Genes

Gene 1

Gene 2

Gene N

.

.

.

.. . . .Sample 1 Sample 2 Sample M

Gene 1

Gene 2

Gene N

.

.

.

.. . . .Sample 1 Sample 2 Sample M

Class 1 Class 2

Fold Change is the Simplest Method

Calculate the log ratio between the two classes and consider all genes that differ by more than an arbitrary cutoff value to be differentially expressed. A two-fold difference is often chosen.

Fold change is not a statistical test.

(1) For gene consider the null hypothesis of no association between its expression level and its class membership.

Test of a Single Hypothesis

(3) Perform a test (e.g Student’s t-test) for each gene.

(2) Decide on level of significance (commonly 5%).

(4) Obtain P-value corresponding to that test statistic.

(5) Compare P-value with the significance level. Theneither reject or retain the null hypothesis.

Two-Sample t-Statistic

2221

21

21

nsnsW

jj

jjj

yy

Student’s t-statistic:


21

21

11 nnsW

j

jjj

yy

Pooled form of the Student’s t-statistic, assumed common variance in the two classes:

021

21

11 annsW

j

jjj

yy


Modified t-statistic of Tusher et al. (2001):

Retain Null Reject Null

Null False positive

Non-null False negative TRUE

PREDICTED

Types of Errors in Hypothesis Testing

Multiplicity Problem

Further: Genes are co-regulated, subsequently there is correlation between the test statistics.

When many hypotheses are tested, the probability of a false positive increases sharply with the number of hypotheses.

Suppose we measure the expression of 10,000 genes in a microarray experiment.

If all 10,000 genes were not differentially expressed, then we would expect for:

P= 0.05 for each test, 500 false positives.

P= 0.05/10,000 for each test, .05 false positives.

Example

Controlling the Error Rate

• Methods for controlling false positives e.g. Bonferroni are too strict for microarray analyses

• Use the False Discovery Rate instead (FDR)(Benjamini and Hochberg 1995)

Methods for dealing with the Multiplicity Problem

• The Bonferroni Method controls the family wise error rate (FWER) i.e.the probability that at least one false positive error will be made

• The False Discovery Rate (FDR)emphasizes the proportion of false positives among the identified differentially expressed genes.

Too strict for gene expression data, tries to make it unlikely that even one false rejection of the null is made, may lead to missed findings

Good for gene expression data – says something about the chosen genes

)hypotheses (rejected#

positives) (false#FDR

The FDR is essentially the expectation of the proportion of false positives among the identified differentially expressed genes.

False Discovery Rate Benjamini and Hochberg (1995)

Accept Null Reject Null Total

Null True N00 N01 N0

Non-True N10 N11 N1

Total N - Nr Nr N

Possible Outcomes for N Hypothesis Tests

}1

{FDR01

rNN

E

)1,max(1 rr NN where

}1)(

{FNDR10

rNN

NE

}0|/E{pFDR 01 rr NNN

}0FDR/pr{ rN

Positive FDR

Lindsay, Kettenring, and Siegmund (2004).

A Report on the Future of Statistics.

Statist. Sci. 19.

Key papers on controlling the FDR

• Genovese and Wasserman (2002)

• Storey (2002, 2003)

• Storey and Tibshirani (2003a, 2003b)

• Storey, Taylor and Siegmund (2004)

• Black (2004)

• Cox and Wong (2004)

Controlling FDR

Benjamini and Hochberg (1995)

Benjamini-Hochberg (BH) Procedure

Controls the FDR at level when the P-values following the null distribution are independent and uniformly distributed.

(1) Let be the observed P-values.

(2) Calculate .

(3) If exists then reject null hypotheses corresponding to

. Otherwise, reject nothing.

)()1( Npp

)()1( kpp

Nkpkk k /:maxargˆ)(

Nk 1

k

Example: Bonferroni and BH Tests

Suppose that 10 independent hypothesis tests are carried outleading to the following ordered P-values:

0.00017 0.00448 0.00671 0.00907 0.012200.33626 0.39341 0.53882 0.58125 0.98617

(a) With = 0.05, the Bonferroni test rejects any hypothesiswhose P-value is less than / 10 = 0.005.

Thus only the first two hypotheses are rejected.

(b) For the BH test, we find the largest k such that P(k) < k / N. Here k = 5, thus we reject the first five hypotheses.

q-VALUE

q-value of a gene j is expected proportion of false positives when calling that gene significant.

P-value is the probability under the null hypothesis of obtaining a value of the test statistic as or more extreme than its observed value.

The q-value for an observed test statistic can be viewed as the expected proportion of false positives among all genes with their test statistics as or more extreme than the observed value.

LIST OF SIGNIFICANT GENES

Call all genes significant if pj < 0.05

or

Call all genes significant if qj < 0.05

to produce a set of significant genes so that a proportion of them (<0.05) is expected to be false (at least for a large no. of genes not necessarily independent)

BRCA1 versus BRCA2-mutation positive tumours (Hedenfalk et al., 2001)

BRCA1 (7) versus BRCA2-mutation (8) positive tumours, p=3226 genes

P=.001 gave 51 genes differentially expressed

P=0.0001 gave 9-11 genes

Using q<0.05, gives 160 genes are taken to be significant.

It means that approx. 8 of these 160 genes are expected to be false positives.

Also, it is estimated that 33% of the genes are differentially expressed.

Permutation Method

The null distribution has a resolution on the order of the number of permutations.

If we perform B permutations, then the P-value will be estimatedwith a resolution of 1/B.

If we assume that each gene has the same null distribution and combine the permutations, then the resolution will be 1/(NB) for the pooled null distribution.

Null Distribution of the Test Statistic

Using just the B permutations of the class labels for the gene-specific statistic Wj , the P-value for Wj = wj is assessed as:

where w(b)0j is the null version of wj after the bth permutation

of the class labels.

B

wwbp j

bj

j

|}||:|{# )(0

If we pool over all N genes, then:

B

b

jbi

j NB

Niwwip

1

)(0 },...,1|,||:|{#

Class 1 Class 2

Gene 1 A1(1) A2

(1) A3(1) B4

(1) B5(1) B6

(1)

Gene 2 A1(2) A2

(2) A3(2) B4

(2) B5(2) B6

(2)

Suppose we have two classes of tissue samples, with three samples from each class. Consider the expressions of two genes, Gene 1 and Gene 2.

Null Distribution of the Test Statistic: Example

Class 1 Class 2

Gene 1 A1(1) A2

(1) A3(1) B4

(1) B5(1) B6

(1)

Gene 2 A1(2) A2

(2) A3(2) B4

(2) B5(2) B6

(2)

Gene 1 A1(1) A2

(1) A3(1) A4

(1) A5(1) A6

(1)

To find the null distribution of the test statistic for Gene 1, we proceed under the assumption that there is no difference between the classes (for Gene 1) so that:

Perm. 1 A1(1) A2

(1) A4(1) A3

(1) A5(1) A6

(1) ...There are 10 distinct permutations.

And permute the class labels:

Ten Permutations of Gene 1

A1(1) A2

(1) A3(1) A4

(1) A5(1) A6

(1)

A1(1) A2

(1) A4(1) A3

(1) A5(1) A6

(1)

A1(1) A2

(1) A5(1) A3

(1) A4(1) A6

(1)

A1(1) A2

(1) A6(1) A3

(1) A4(1) A5

(1)

A1(1) A3

(1) A4(1) A2

(1) A5(1) A6

(1)

A1(1) A3

(1) A5(1) A2

(1) A4(1) A6

(1)

A1(1) A3

(1) A6(1) A2

(1) A4(1) A5

(1)

A1(1) A4

(1) A5(1) A2

(1) A3(1) A6

(1)

A1(1) A4

(1) A6(1) A2

(1) A3(1) A5

(1)

A1(1) A5

(1) A6(1) A2

(1) A3(1) A4

(1)

As there are only 10 distinct permutations here, thenull distribution based on these permutations is toogranular.

Hence consideration is given to permuting the labelsof each of the other genes and estimating the null distribution of a gene based on the pooled permutationsso obtained.

But there is a problem with this method in that the null values of the test statistic for each gene does not necessarily have the theoretical null distribution that we are trying to estimate.

Suppose we were to use Gene 2 also to estimate the null distribution of Gene 1.

Suppose that Gene 2 is differentially expressed, then the null values of the test statistic for Gene 2will have a mixed distribution.

Class 1 Class 2

Gene 1 A1(1) A2

(1) A3(1) B4

(1) B5(1) B6

(1)

Gene 2 A1(2) A2

(2) A3(2) B4

(2) B5(2) B6

(2)

Gene 2 A1(2) A2

(2) A3(2) B4

(2) B5(2) B6

(2)

Perm. 1 A1(2) A2

(2) B4(2) A3

(2) B5(2) B6

(2)

...

Permute the class labels:

Example of a null case: with 7 N(0,1) points and8 N(0,1) points; histogram of the pooled two-sample t-statistic under 1000 permutations of the class labels with t13 density superimposed.

ty

Example of a null case: with 7 N(0,1) points and

8 N(10,9) points; histogram of the pooled two-sample

t-statistic under 1000 permutations of the class

labels with t13 density superimposed.

ty

The SAM Method

Use the permutation method to calculate the null distribution of the modified t-statistic (Tusher et al., 2001).

The order statistics t(1), ... , t(N) are plotted against their null expectations above.

A good test in situations where there are more genes being over-expressed than under-expressed, or vice-versa.

B

b

bjj NjwBw

1

)()(0)(0 ).,...,1()/1(

TRUE

PREDICTED


Null a b (false positive)

Non-null c (false negative) d

FDR ~ R

b

R

FNDR ~ ca

c

The FDR and other error rates

TRUE

PREDICTED


Null a b (false positive)

Non-null c (false negative) d

FDR ~ R

b

R

FNR = dc

c

The FDR and other error rates

Test Result

non-DE DE Total

True

non-DE

DE

Total

A = 9025 B = 475

C = 100 D = 400

9125 875

9500

500

10000

FDR = B / (B + D) = 475 / 875 = 54%

FNDR = C / (A + C) = 100 / 9125 = 1%

Toy Example with 10,000 Genes

Two-component mixture model

)()()( 1100 jjj wfwfwf 0 is the proportion of genes that are not

differentially expressed, and 01 1

Use of the P-Value as a Summary Statistic(Allison et al., 2002)

Instead of using the pooled form of the t-statistic, we can work with the value pj, which is the P-value associated with tj

in the test of the null hypothesis of no difference in expression between the two classes.

The distribution of the P-value is modelled by the h-component mixture model

h

i

iijij pfpf1

21 ),;()(

where 11 = 12 = 1.

,

Use of the P-Value as a Summary Statistic

Under the null hypothesis of no difference in expression for thejth gene, pj will have a uniform distribution on the unit interval; ie the 1,1 distribution.

The 1,2 density is given by

where

),()},(/)1({),;( )1,0(2111

2121 uIBuuuf

).(/)()(),( 212121 B

• Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. JASA 96,1151-1160.

• Efron B (2004) Large-scale simultaenous hypothesis testing: the choice of a null hypothesis. JASA 99, 96-104.

• Efron B (2004) Selection and Estimation for Large-Scale Simultaneous Inference.

• Efron B (2005) Local False Discovery Rates.

• Efron B (2006) Correlation and Large-Scale Simultaneous Significance Testing.

• McLachlan GJ, Bean RW, Ben-Tovim Jones L, Zhu JX. Using mixture models to detect differentially expressed genes. Australian Journal of Experimental Agriculture 45, 859-866.

• McLachlan GJ, Bean RW, Ben-Tovim Jones L. A simple implentation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 26. To appear.

Two component mixture model

)()()( 1100 jjj wfwfwf

)(

)()( 00

0j

jj wf

wfw

Using Bayes’ Theorem, we calculate the posterior probability that gene j is not differentially expressed:

π0 is the proportion of genes that are not differentiallyexpressed. The two-component mixture model is:

0(wj) c0,

then this decision minimizes the (estimated) Bayes risk

If we conclude that gene j is differentially expressed if:

where

101010)1(Risk ecec where e01 is the probability of a false positive and e10 is the probability of a false negative.

Estimated FDR

where

N

jrjcj NNwIwRDFN

10),(1 )/())(ˆ()(ˆˆ

0

N

jj

N

jjcj wwIwRPF

10

10],0[0 )(ˆ/))(ˆ()(ˆˆ

0

N

jj

N

jjcj wwIwRNF

11

10),(1 )(ˆ/))(ˆ()(ˆˆ

0

Similarly, the false positive rate is given by

and the false non-discovery rate and false negativerate by:

F0: N(0,1), π0=0.9F1: N(1,1), π1=0.1

Reject H0 if z≥2

τ0(2) = 0.99972but FDR=0.17

F0: N(0,1), π0=0.6F1: N(1,1), π1=0.4

Reject H0 if z≥2

τ0(2) = 0.251but FDR=0.177

Glonek and Solomon (2003)

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

Gene Statistics: Two-Sample t-Statistic

2221

21

21

nsnsw

jj

jjj

yyStudent’s t-statistic

Pooled form of the Student’s t-statistic, assumed common variance in the two classes 21

21

11 nnsw

j

jjj

yy

Modified t-statistic of Tusher et al. (2001)

021

21

11 annsw

j

jjj

yy

1. Obtain the z-score for each of the genes

The Procedure

2. Rank the genes on the basis of the z-scores, starting with the largest ones (the same ordering as with the P-values, pj).

3. The posterior probability of non-differential expression of gene j, is given by 0(zj).

4. Conclude gene j to be differentially expressed if

0(zj) < c0

1jz jp1

τ0(zj) ≤ c

czfzf

zf

jj

j )()(

)(

1100

00

1

0

0

1 1

)(

)(

c

c

zf

zf

j

j

c

c

zf

zf

j

j

19

)(

)(

0

1

If π0/π1 ≤ 9, then

e.g. c = 0.2, 36)(

)(

0

1 j

j

zf

zf

Suppose π0/ π1 ≥ 0.9

then

τ0(zj) ≤ c

τ0(zj) ≤ 0.2 if

36)(

)(

0

1 j

j

zf

zf

Much stronger level of evidence against the nullthan in standard one-at-a-time testing.

e.g.

Zj ~ N(μ,1)

H0 : μ = 0 vs H1 : μ = 2.8

Rejecting H0 if |Zj| ≥ 1.96 yields a two-sidedtest of size 0.05 and power 0.80.

f1(1.96)/f0(1.96) = 4.8.

Z-scores, null case Z-scores, +1

Z-scores, +2Z-scores, +3

r.bean

zj j

Use of Wilson-Hilferty Transformation asin Broet et al. (2004)

Plot of approximate z-scores obtained byWilson-Hilferty transformation of simulated values of F-statisticwith 1 and 8 degrees of freedom.

Transformed t-statistics

Fre

qu

en

cy

-1 0 1 2 3 4

05

10

15

20

25

EMMIX-FDR

A program has been written in R which interfaces with EMMIX to implement the algorithm described in McLachlan et al (2006).

We fit a mixture of two normal components

to test statistics calculated from the gene expression data.

1100 ˆˆˆˆ z

21010

211

200

2 )ˆˆ(ˆˆˆˆˆˆ zs

When we equate the sample mean and varianceof the mixture to their population counterparts, we obtain:

)ˆ1/(ˆ 01 z

)ˆ1/(}ˆ)ˆ1(ˆˆ{ˆ 021000

221 zs

When we are working with the theoretical null,we can easily estimate the mean and varianceof the non-null component with the following formulae.

)}(/{}:{#)()0(0 Nzz jj

Following the approach of Storey and Tibshirani (2003)we can obtain an initial estimate for π0 as follows:

This estimate of π0 is used as input to EMMIX aspart of the initial parameter estimates. Thus no randomor k-means starts are required, as is usually the case.

There are two different versions of EMMIX used, thestandard version for the empirical null and a modifiedversion for the theoretical null which fixes the mean andvariance of the null component to be 0 and 1 respectively.

Theoretical and empirical nulls

Efron (2004) suggested the use of two kinds of null component: the theoretical and the empirical null. In the theoretical case the null component has mean 0 and variance 1 and the empirical null has unrestricted mean and variance.

Examples

We examined the performance of EMMIX-FDR on three well-known data sets

from the literature.

• Alon colon cancer data (1999)

• Hedenfalk breast cancer data (2001)

• van’t Wout HIV data (2003)

Hedenfalk Breast Cancer Data

Hedenfalk et al. (2001) used cDNA arrays to obtain gene expression profiles of tumours from carriers of either the BRCA1 or BRCA2 mutation (hereditary breast cancers), as well as sporadic breast cancer.

We consider their data set of M = 15 patients, comprising two patient groups: BRCA1 (7) versus BRCA2 - mutation positive (8), with N = 3,226 genes.

The problem is to find genes which are differentially expressed between the BRCA1 and BRCA2 patients.

Hedenfalk et al. (2001) NEJM, 344, 539-547

Fit

to the N values of wj (based on pooled two-sample t-statistic)

),()1,0( 21110 NN

j th gene is taken to be differentially expressed if:

00 )(ˆ cw j

Two component model for the Breast Cancer Data

Histogram of z-scores for 3226 Hedenfalk genes

z

Fre

qu

en

cy

-4 -2 0 2 4

02

04

06

08

01

00

Fitting two component mixture model to Hedenfalk data

Null

Non–Null(DE genes)

Estimates of π0 for Hedenfalk data

• 0.52 (Broet, 2004)

• 0.64 (Gottardo, 2006)

• 0.61 (Ploner et al, 2006)

• 0.47 (Storey, 2002)

Using a theoretical null, we estimated π0 to be 0.65.

We used

pj = 2 F13(-|wj|)

Similar results where pj obtained by permutation methods.

Gene j Pj zj 0 ( zj )

Gene 1

.

.

.

Gene R

.

.

.

.

Gene R+R1

.

.

.

Gene N

0.002

.

.

.

0.1

0.12

0.18

.

.

0.20

c0 = 0.1

Ranking and Selecting the Genes

FDR = Sum/R = 0.06

Proportion ofFalse Negatives= 1 – Sum1/ R1

Local FDR

c0 R

0.5 906 0.26

0.4 694 0.21

0.3 518 0.16

0.2 326 0.11

0.1 143 0.06

RDF ˆ

Estimated FDR for various levels of c0

c0 Nr

0.1 143 0.06 0.32 0.88 0.004

0.2 338 0.11 0.28 0.73 0.02

0.3 539 0.16 0.25 0.60 0.04

0.4 742 0.21 0.22 0.48 0.08

0.5 971 0.27 0.18 0.37 0.12

RDF ˆ DRNF ˆ RNF ˆ RPF ˆ

Estimated FDR and other error rates for various levelsof threshold c0 applied to the posterior probability of nondifferentialexpression for the breast cancer data (Nr=number of selected genes)

Storey and Tibshirani (2003) PNAS, 100, 9440-9445

Comparison of identified DE genes

Our method (143) Hedenfalk (175)

Storey and Tibshirani (160)

101

6

12

8

39

2924

Uniquely Identified Genes: Differentially Expressed between BRCA1 and BRCA2

Gene GO TermUBE2B, DDB2

(UBE2V1)

RAB9, RHOC

ITGB5, ITGA3

PRKCBP1

HDAC3, MIF

KIF5B, spindle body pole protein

CTCL1

TNAFIP1

HARS, HSD17B7

DNA repair

(cell cycle)

small GTPase signal transduction

integrin mediated signalling pathway

regulation of transcription

negative regulation of apoptosis

cytoskeleton organisation

vesicle mediated transport

cation transport

metabolism

1. Obtain the z-score for each of the genes

The Procedure

2. Rank the genes on the basis of the z-scores, starting with the largest ones (the same ordering as with the P-values, pj).

3. The posterior probability of non-differential expression of gene j, is given by 0(zj).

4. Conclude gene j to be differentially expressed if

0(zj) < c0

1jz jp1

Breast cancer data: plot of fitted two-componentnormal mixture model with theoretical N(0,1) null and non-nullcomponents (weighted respectively by the estimated proportion of null andnon-null genes) imposed on histogram of z-scores.

z-scores

Fre

qu

en

cy

-4 -2 0 2 4

02

04

06

08

01

00

Hedenfalk breast cancer data:plot of fitted two-component normal mixture model with empirical nulland non-null components (weighted respectively by the estimated proportionof null and non-null genes) imposed on histogram of z-scores.

z-scores

Fre

qu

en

cy

-4 -2 0 2 4

02

04

06

08

01

00

Histogram of N=3,226 z-Values from the Breast Cancer Study.The theoretical N(0,1) null is much narrower than the central peak,which has (δ0,σ0)=(-.02,1.58). In this case the central peak seems toinclude the entire histogram.

Alon Colon Cancer Data

• Affymetrix arrays, samples from colon tissues from 62 patients

• Dataset of M = 62 patients: 40 tumor vs 22 normal, with N = 2,000 genes.

Alon et al. (1999) PNAS, 96, 6745-6750

Alon data theoretical null

We estimated π0 to be 0.39 using the theoretical null. With the empirical null, it is 0.53.

Six smooth muscle genes are included in the 2,000 genes and each has an estimated posterior probability of non-DE less than 0.015.

Colon cancer data: plot of fitted two-componentnormal mixture model with theoretical N(0,1) null and non-nullcomponents (weighted respectively by the estimated proportion of null andnon-null genes) imposed on histogram of z-scores.

z-scores

Fre

qu

en

cy

-6 -4 -2 0 2 4 6

01

02

03

04

05

06

0

Colon cancer data: plot of fitted two-componentnormal mixture model with empirical null and non-null components (weighted respectively by the estimated proportion of null and non-null genes) imposed on histogram of z-scores.

van’t Wout HIV Data

• Four experiments using the same RNA preparation on 4 different slides

• Expression levels of transcripts assessed in CD4-T-cell lines 24 hours after infection with HIV type 1

• Two samples – one corresponds to HIV infected cells and one to non-infected cells

• Dataset of M = 8 samples: 4 HIV-infected vs 4 non-infected, with N = 7,680 genes. Analysed by Gottardo et al (2006)

van’t Wout et al (2003), J Virology 77, 1392-1402Gottardo et al (2006), Biometrics 62, 10-18

HIV data: plot of fitted two-componentnormal mixture model with empirical null and non-nullcomponents (weighted respectively by the estimated proportion of null andnon-null genes) imposed on histogram of z-scores.

z-scores

Fre

qu

en

cy

-6 -4 -2 0 2 4 6

05

01

00

15

02

00

25

03

00

35

0

• Mixture-model based approach to finding DE genes can yield new information

• Gives a measure of the posterior probability that a gene is not DE (i.e. a local FDR rather than global)

• Provides estimates of global rates – the FDR and the false negative rate FNR (FNR=1-sensitivity)

Conclusions

Mixture Model-Based Approach: Advantages

• Provides a local FDR and a measure of the FNR, as well as the global FDR for the selected genes.

• We show that it can yield new information

Allison generated data for 10 mice over 3000 genes. The data are generated in six groups of 500 with a value ρ of 0, 0.4, or 0.8 in the off-diagonal elements of the 500 x 500 covariance matrix used to generate each group.

For a random 20% of the genes, a value d of 0, 4, or 8 is added to the gene expression levels of the last five mice.

Allison Mice Simulation

Theoretical null, ρ=0.8, d=4

Empirical null, ρ=0.8, d=4

Theoretical null, ρ=0.8, d=8

Empirical null, ρ=0.8, d=8

Suppose 0(w) is monotonic decreasing in w. Then

0)(0ˆ cwj for .0wwj

)(ˆ1

)(1ˆˆ

0

000

wF

wFRDF

00 |)( wwwE

Suppose 0(w) is monotonic decreasing in w. Then

0)(0ˆ cwj for .0wwj

)(ˆ1

)(1ˆˆ

0

000

wF

wFRDF

where )()( 000 wwF

1

10

1000ˆ

ˆˆ)(ˆ)(ˆ

w

wwF

For a desired control level , say = 0.05, define

)(ˆminarg0 wRDFww

(1)

If)(ˆ1

)(1ˆ

00

wF

wF

is monotonic in w, then using (1)

to control the FDR [with and taken to be the empirical distribution function] is equivalent to using the Benjamini-Hochberg procedure based on the P-values corresponding to the statistic wj.

1ˆ 0 )(ˆ 0wF

the problem of detecting differentially expressed genes

Documents

false rejection

proportion of false

false positive error

false positive increases

false discovery rate

null hypothesis

test statistics

g students ttest