the problem of detecting differentially expressed genes
TRANSCRIPT
The Problem of Detecting Differentially Expressed Genes
Gene 1
Gene 2
Gene N
.
.
.
.. . . .Sample 1 Sample 2 Sample M
Gene 1
Gene 2
Gene N
.
.
.
.. . . .Sample 1 Sample 2 Sample M
Class 1 Class 2
Fold Change is the Simplest Method
Calculate the log ratio between the two classes and consider all genes that differ by more than an arbitrary cutoff value to be differentially expressed. A two-fold difference is often chosen.
Fold change is not a statistical test.
(1) For gene consider the null hypothesis of no association between its expression level and its class membership.
Test of a Single Hypothesis
(3) Perform a test (e.g Student’s t-test) for each gene.
(2) Decide on level of significance (commonly 5%).
(4) Obtain P-value corresponding to that test statistic.
(5) Compare P-value with the significance level. Theneither reject or retain the null hypothesis.
Two-Sample t-Statistic
2221
21
21
nsnsW
jj
jjj
yy
Student’s t-statistic:
Two-Sample t-Statistic
21
21
11 nnsW
j
jjj
yy
Pooled form of the Student’s t-statistic, assumed common variance in the two classes:
021
21
11 annsW
j
jjj
yy
Two-Sample t-Statistic
Modified t-statistic of Tusher et al. (2001):
Retain Null Reject Null
Null False positive
Non-null False negative TRUE
PREDICTED
Types of Errors in Hypothesis Testing
Multiplicity Problem
Further: Genes are co-regulated, subsequently there is correlation between the test statistics.
When many hypotheses are tested, the probability of a false positive increases sharply with the number of hypotheses.
Suppose we measure the expression of 10,000 genes in a microarray experiment.
If all 10,000 genes were not differentially expressed, then we would expect for:
P= 0.05 for each test, 500 false positives.
P= 0.05/10,000 for each test, .05 false positives.
Example
Controlling the Error Rate
• Methods for controlling false positives e.g. Bonferroni are too strict for microarray analyses
• Use the False Discovery Rate instead (FDR)(Benjamini and Hochberg 1995)
Methods for dealing with the Multiplicity Problem
• The Bonferroni Method controls the family wise error rate (FWER) i.e.the probability that at least one false positive error will be made
• The False Discovery Rate (FDR)emphasizes the proportion of false positives among the identified differentially expressed genes.
Too strict for gene expression data, tries to make it unlikely that even one false rejection of the null is made, may lead to missed findings
Good for gene expression data – says something about the chosen genes
)hypotheses (rejected#
positives) (false#FDR
The FDR is essentially the expectation of the proportion of false positives among the identified differentially expressed genes.
False Discovery Rate Benjamini and Hochberg (1995)
Accept Null Reject Null Total
Null True N00 N01 N0
Non-True N10 N11 N1
Total N - Nr Nr N
Possible Outcomes for N Hypothesis Tests
}1
{FDR01
rNN
E
)1,max(1 rr NN where
}1)(
{FNDR10
rNN
NE
}0|/E{pFDR 01 rr NNN
}0FDR/pr{ rN
Positive FDR
Lindsay, Kettenring, and Siegmund (2004).
A Report on the Future of Statistics.
Statist. Sci. 19.
Key papers on controlling the FDR
• Genovese and Wasserman (2002)
• Storey (2002, 2003)
• Storey and Tibshirani (2003a, 2003b)
• Storey, Taylor and Siegmund (2004)
• Black (2004)
• Cox and Wong (2004)
Controlling FDR
Benjamini and Hochberg (1995)
Benjamini-Hochberg (BH) Procedure
Controls the FDR at level when the P-values following the null distribution are independent and uniformly distributed.
(1) Let be the observed P-values.
(2) Calculate .
(3) If exists then reject null hypotheses corresponding to
. Otherwise, reject nothing.
)()1( Npp
)()1( kpp
Nkpkk k /:maxargˆ)(
Nk 1
k
Example: Bonferroni and BH Tests
Suppose that 10 independent hypothesis tests are carried outleading to the following ordered P-values:
0.00017 0.00448 0.00671 0.00907 0.012200.33626 0.39341 0.53882 0.58125 0.98617
(a) With = 0.05, the Bonferroni test rejects any hypothesiswhose P-value is less than / 10 = 0.005.
Thus only the first two hypotheses are rejected.
(b) For the BH test, we find the largest k such that P(k) < k / N. Here k = 5, thus we reject the first five hypotheses.
q-VALUE
q-value of a gene j is expected proportion of false positives when calling that gene significant.
P-value is the probability under the null hypothesis of obtaining a value of the test statistic as or more extreme than its observed value.
The q-value for an observed test statistic can be viewed as the expected proportion of false positives among all genes with their test statistics as or more extreme than the observed value.
LIST OF SIGNIFICANT GENES
Call all genes significant if pj < 0.05
or
Call all genes significant if qj < 0.05
to produce a set of significant genes so that a proportion of them (<0.05) is expected to be false (at least for a large no. of genes not necessarily independent)
BRCA1 versus BRCA2-mutation positive tumours (Hedenfalk et al., 2001)
BRCA1 (7) versus BRCA2-mutation (8) positive tumours, p=3226 genes
P=.001 gave 51 genes differentially expressed
P=0.0001 gave 9-11 genes
Using q<0.05, gives 160 genes are taken to be significant.
It means that approx. 8 of these 160 genes are expected to be false positives.
Also, it is estimated that 33% of the genes are differentially expressed.
Permutation Method
The null distribution has a resolution on the order of the number of permutations.
If we perform B permutations, then the P-value will be estimatedwith a resolution of 1/B.
If we assume that each gene has the same null distribution and combine the permutations, then the resolution will be 1/(NB) for the pooled null distribution.
Null Distribution of the Test Statistic
Using just the B permutations of the class labels for the gene-specific statistic Wj , the P-value for Wj = wj is assessed as:
where w(b)0j is the null version of wj after the bth permutation
of the class labels.
B
wwbp j
bj
j
|}||:|{# )(0
If we pool over all N genes, then:
B
b
jbi
j NB
Niwwip
1
)(0 },...,1|,||:|{#
Class 1 Class 2
Gene 1 A1(1) A2
(1) A3(1) B4
(1) B5(1) B6
(1)
Gene 2 A1(2) A2
(2) A3(2) B4
(2) B5(2) B6
(2)
Suppose we have two classes of tissue samples, with three samples from each class. Consider the expressions of two genes, Gene 1 and Gene 2.
Null Distribution of the Test Statistic: Example
Class 1 Class 2
Gene 1 A1(1) A2
(1) A3(1) B4
(1) B5(1) B6
(1)
Gene 2 A1(2) A2
(2) A3(2) B4
(2) B5(2) B6
(2)
Gene 1 A1(1) A2
(1) A3(1) A4
(1) A5(1) A6
(1)
To find the null distribution of the test statistic for Gene 1, we proceed under the assumption that there is no difference between the classes (for Gene 1) so that:
Perm. 1 A1(1) A2
(1) A4(1) A3
(1) A5(1) A6
(1) ...There are 10 distinct permutations.
And permute the class labels:
Ten Permutations of Gene 1
A1(1) A2
(1) A3(1) A4
(1) A5(1) A6
(1)
A1(1) A2
(1) A4(1) A3
(1) A5(1) A6
(1)
A1(1) A2
(1) A5(1) A3
(1) A4(1) A6
(1)
A1(1) A2
(1) A6(1) A3
(1) A4(1) A5
(1)
A1(1) A3
(1) A4(1) A2
(1) A5(1) A6
(1)
A1(1) A3
(1) A5(1) A2
(1) A4(1) A6
(1)
A1(1) A3
(1) A6(1) A2
(1) A4(1) A5
(1)
A1(1) A4
(1) A5(1) A2
(1) A3(1) A6
(1)
A1(1) A4
(1) A6(1) A2
(1) A3(1) A5
(1)
A1(1) A5
(1) A6(1) A2
(1) A3(1) A4
(1)
As there are only 10 distinct permutations here, thenull distribution based on these permutations is toogranular.
Hence consideration is given to permuting the labelsof each of the other genes and estimating the null distribution of a gene based on the pooled permutationsso obtained.
But there is a problem with this method in that the null values of the test statistic for each gene does not necessarily have the theoretical null distribution that we are trying to estimate.
Suppose we were to use Gene 2 also to estimate the null distribution of Gene 1.
Suppose that Gene 2 is differentially expressed, then the null values of the test statistic for Gene 2will have a mixed distribution.
Class 1 Class 2
Gene 1 A1(1) A2
(1) A3(1) B4
(1) B5(1) B6
(1)
Gene 2 A1(2) A2
(2) A3(2) B4
(2) B5(2) B6
(2)
Gene 2 A1(2) A2
(2) A3(2) B4
(2) B5(2) B6
(2)
Perm. 1 A1(2) A2
(2) B4(2) A3
(2) B5(2) B6
(2)
...
Permute the class labels:
Example of a null case: with 7 N(0,1) points and8 N(0,1) points; histogram of the pooled two-sample t-statistic under 1000 permutations of the class labels with t13 density superimposed.
ty
Example of a null case: with 7 N(0,1) points and
8 N(10,9) points; histogram of the pooled two-sample
t-statistic under 1000 permutations of the class
labels with t13 density superimposed.
ty
The SAM Method
Use the permutation method to calculate the null distribution of the modified t-statistic (Tusher et al., 2001).
The order statistics t(1), ... , t(N) are plotted against their null expectations above.
A good test in situations where there are more genes being over-expressed than under-expressed, or vice-versa.
B
b
bjj NjwBw
1
)()(0)(0 ).,...,1()/1(
TRUE
PREDICTED
Retain Null Reject Null
Null a b (false positive)
Non-null c (false negative) d
FDR ~ R
b
R
FNDR ~ ca
c
The FDR and other error rates
TRUE
PREDICTED
Retain Null Reject Null
Null a b (false positive)
Non-null c (false negative) d
FDR ~ R
b
R
FNR = dc
c
The FDR and other error rates
Test Result
non-DE DE Total
True
non-DE
DE
Total
A = 9025 B = 475
C = 100 D = 400
9125 875
9500
500
10000
FDR = B / (B + D) = 475 / 875 = 54%
FNDR = C / (A + C) = 100 / 9125 = 1%
Toy Example with 10,000 Genes
Two-component mixture model
)()()( 1100 jjj wfwfwf 0 is the proportion of genes that are not
differentially expressed, and 01 1
Use of the P-Value as a Summary Statistic(Allison et al., 2002)
Instead of using the pooled form of the t-statistic, we can work with the value pj, which is the P-value associated with tj
in the test of the null hypothesis of no difference in expression between the two classes.
The distribution of the P-value is modelled by the h-component mixture model
h
i
iijij pfpf1
21 ),;()(
where 11 = 12 = 1.
,
Use of the P-Value as a Summary Statistic
Under the null hypothesis of no difference in expression for thejth gene, pj will have a uniform distribution on the unit interval; ie the 1,1 distribution.
The 1,2 density is given by
where
),()},(/)1({),;( )1,0(2111
2121 uIBuuuf
).(/)()(),( 212121 B
• Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. JASA 96,1151-1160.
• Efron B (2004) Large-scale simultaenous hypothesis testing: the choice of a null hypothesis. JASA 99, 96-104.
• Efron B (2004) Selection and Estimation for Large-Scale Simultaneous Inference.
• Efron B (2005) Local False Discovery Rates.
• Efron B (2006) Correlation and Large-Scale Simultaneous Significance Testing.
• McLachlan GJ, Bean RW, Ben-Tovim Jones L, Zhu JX. Using mixture models to detect differentially expressed genes. Australian Journal of Experimental Agriculture 45, 859-866.
• McLachlan GJ, Bean RW, Ben-Tovim Jones L. A simple implentation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 26. To appear.
Two component mixture model
)()()( 1100 jjj wfwfwf
)(
)()( 00
0j
jj wf
wfw
Using Bayes’ Theorem, we calculate the posterior probability that gene j is not differentially expressed:
π0 is the proportion of genes that are not differentiallyexpressed. The two-component mixture model is:
0(wj) c0,
then this decision minimizes the (estimated) Bayes risk
If we conclude that gene j is differentially expressed if:
where
101010)1(Risk ecec where e01 is the probability of a false positive and e10 is the probability of a false negative.
Estimated FDR
where
N
jrjcj NNwIwRDFN
10),(1 )/())(ˆ()(ˆˆ
0
N
jj
N
jjcj wwIwRPF
10
10],0[0 )(ˆ/))(ˆ()(ˆˆ
0
N
jj
N
jjcj wwIwRNF
11
10),(1 )(ˆ/))(ˆ()(ˆˆ
0
Similarly, the false positive rate is given by
and the false non-discovery rate and false negativerate by:
F0: N(0,1), π0=0.9F1: N(1,1), π1=0.1
Reject H0 if z≥2
τ0(2) = 0.99972but FDR=0.17
F0: N(0,1), π0=0.6F1: N(1,1), π1=0.4
Reject H0 if z≥2
τ0(2) = 0.251but FDR=0.177
Glonek and Solomon (2003)
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
Gene Statistics: Two-Sample t-Statistic
2221
21
21
nsnsw
jj
jjj
yyStudent’s t-statistic
Pooled form of the Student’s t-statistic, assumed common variance in the two classes 21
21
11 nnsw
j
jjj
yy
Modified t-statistic of Tusher et al. (2001)
021
21
11 annsw
j
jjj
yy
1. Obtain the z-score for each of the genes
The Procedure
2. Rank the genes on the basis of the z-scores, starting with the largest ones (the same ordering as with the P-values, pj).
3. The posterior probability of non-differential expression of gene j, is given by 0(zj).
4. Conclude gene j to be differentially expressed if
0(zj) < c0
1jz jp1
τ0(zj) ≤ c
czfzf
zf
jj
j )()(
)(
1100
00
1
0
0
1 1
)(
)(
c
c
zf
zf
j
j
c
c
zf
zf
j
j
19
)(
)(
0
1
If π0/π1 ≤ 9, then
e.g. c = 0.2, 36)(
)(
0
1 j
j
zf
zf
Suppose π0/ π1 ≥ 0.9
then
τ0(zj) ≤ c
τ0(zj) ≤ 0.2 if
36)(
)(
0
1 j
j
zf
zf
Much stronger level of evidence against the nullthan in standard one-at-a-time testing.
e.g.
Zj ~ N(μ,1)
H0 : μ = 0 vs H1 : μ = 2.8
Rejecting H0 if |Zj| ≥ 1.96 yields a two-sidedtest of size 0.05 and power 0.80.
f1(1.96)/f0(1.96) = 4.8.
Z-scores, null case Z-scores, +1
Z-scores, +2Z-scores, +3
zj j
Use of Wilson-Hilferty Transformation asin Broet et al. (2004)
Plot of approximate z-scores obtained byWilson-Hilferty transformation of simulated values of F-statisticwith 1 and 8 degrees of freedom.
Transformed t-statistics
Fre
qu
en
cy
-1 0 1 2 3 4
05
10
15
20
25
EMMIX-FDR
A program has been written in R which interfaces with EMMIX to implement the algorithm described in McLachlan et al (2006).
We fit a mixture of two normal components
to test statistics calculated from the gene expression data.
1100 ˆˆˆˆ z
21010
211
200
2 )ˆˆ(ˆˆˆˆˆˆ zs
When we equate the sample mean and varianceof the mixture to their population counterparts, we obtain:
)ˆ1/(ˆ 01 z
)ˆ1/(}ˆ)ˆ1(ˆˆ{ˆ 021000
221 zs
When we are working with the theoretical null,we can easily estimate the mean and varianceof the non-null component with the following formulae.
)}(/{}:{#)()0(0 Nzz jj
Following the approach of Storey and Tibshirani (2003)we can obtain an initial estimate for π0 as follows:
This estimate of π0 is used as input to EMMIX aspart of the initial parameter estimates. Thus no randomor k-means starts are required, as is usually the case.
There are two different versions of EMMIX used, thestandard version for the empirical null and a modifiedversion for the theoretical null which fixes the mean andvariance of the null component to be 0 and 1 respectively.
Theoretical and empirical nulls
Efron (2004) suggested the use of two kinds of null component: the theoretical and the empirical null. In the theoretical case the null component has mean 0 and variance 1 and the empirical null has unrestricted mean and variance.
Examples
We examined the performance of EMMIX-FDR on three well-known data sets
from the literature.
• Alon colon cancer data (1999)
• Hedenfalk breast cancer data (2001)
• van’t Wout HIV data (2003)
Hedenfalk Breast Cancer Data
Hedenfalk et al. (2001) used cDNA arrays to obtain gene expression profiles of tumours from carriers of either the BRCA1 or BRCA2 mutation (hereditary breast cancers), as well as sporadic breast cancer.
We consider their data set of M = 15 patients, comprising two patient groups: BRCA1 (7) versus BRCA2 - mutation positive (8), with N = 3,226 genes.
The problem is to find genes which are differentially expressed between the BRCA1 and BRCA2 patients.
Hedenfalk et al. (2001) NEJM, 344, 539-547
Fit
to the N values of wj (based on pooled two-sample t-statistic)
),()1,0( 21110 NN
j th gene is taken to be differentially expressed if:
00 )(ˆ cw j
Two component model for the Breast Cancer Data
Histogram of z-scores for 3226 Hedenfalk genes
z
Fre
qu
en
cy
-4 -2 0 2 4
02
04
06
08
01
00
Fitting two component mixture model to Hedenfalk data
Null
Non–Null(DE genes)
Estimates of π0 for Hedenfalk data
• 0.52 (Broet, 2004)
• 0.64 (Gottardo, 2006)
• 0.61 (Ploner et al, 2006)
• 0.47 (Storey, 2002)
Using a theoretical null, we estimated π0 to be 0.65.
We used
pj = 2 F13(-|wj|)
Similar results where pj obtained by permutation methods.
Gene j Pj zj 0 ( zj )
Gene 1
.
.
.
Gene R
.
.
.
.
Gene R+R1
.
.
.
Gene N
0.002
.
.
.
0.1
0.12
0.18
.
.
0.20
c0 = 0.1
Ranking and Selecting the Genes
FDR = Sum/R = 0.06
Proportion ofFalse Negatives= 1 – Sum1/ R1
Local FDR
c0 R
0.5 906 0.26
0.4 694 0.21
0.3 518 0.16
0.2 326 0.11
0.1 143 0.06
RDF ˆ
Estimated FDR for various levels of c0
c0 Nr
0.1 143 0.06 0.32 0.88 0.004
0.2 338 0.11 0.28 0.73 0.02
0.3 539 0.16 0.25 0.60 0.04
0.4 742 0.21 0.22 0.48 0.08
0.5 971 0.27 0.18 0.37 0.12
RDF ˆ DRNF ˆ RNF ˆ RPF ˆ
Estimated FDR and other error rates for various levelsof threshold c0 applied to the posterior probability of nondifferentialexpression for the breast cancer data (Nr=number of selected genes)
Storey and Tibshirani (2003) PNAS, 100, 9440-9445
Comparison of identified DE genes
Our method (143) Hedenfalk (175)
Storey and Tibshirani (160)
101
6
12
8
39
2924
Uniquely Identified Genes: Differentially Expressed between BRCA1 and BRCA2
Gene GO TermUBE2B, DDB2
(UBE2V1)
RAB9, RHOC
ITGB5, ITGA3
PRKCBP1
HDAC3, MIF
KIF5B, spindle body pole protein
CTCL1
TNAFIP1
HARS, HSD17B7
DNA repair
(cell cycle)
small GTPase signal transduction
integrin mediated signalling pathway
regulation of transcription
negative regulation of apoptosis
cytoskeleton organisation
vesicle mediated transport
cation transport
metabolism
1. Obtain the z-score for each of the genes
The Procedure
2. Rank the genes on the basis of the z-scores, starting with the largest ones (the same ordering as with the P-values, pj).
3. The posterior probability of non-differential expression of gene j, is given by 0(zj).
4. Conclude gene j to be differentially expressed if
0(zj) < c0
1jz jp1
Breast cancer data: plot of fitted two-componentnormal mixture model with theoretical N(0,1) null and non-nullcomponents (weighted respectively by the estimated proportion of null andnon-null genes) imposed on histogram of z-scores.
z-scores
Fre
qu
en
cy
-4 -2 0 2 4
02
04
06
08
01
00
Hedenfalk breast cancer data:plot of fitted two-component normal mixture model with empirical nulland non-null components (weighted respectively by the estimated proportionof null and non-null genes) imposed on histogram of z-scores.
z-scores
Fre
qu
en
cy
-4 -2 0 2 4
02
04
06
08
01
00
Histogram of N=3,226 z-Values from the Breast Cancer Study.The theoretical N(0,1) null is much narrower than the central peak,which has (δ0,σ0)=(-.02,1.58). In this case the central peak seems toinclude the entire histogram.
Alon Colon Cancer Data
• Affymetrix arrays, samples from colon tissues from 62 patients
• Dataset of M = 62 patients: 40 tumor vs 22 normal, with N = 2,000 genes.
Alon et al. (1999) PNAS, 96, 6745-6750
Alon data theoretical null
We estimated π0 to be 0.39 using the theoretical null. With the empirical null, it is 0.53.
Six smooth muscle genes are included in the 2,000 genes and each has an estimated posterior probability of non-DE less than 0.015.
Colon cancer data: plot of fitted two-componentnormal mixture model with theoretical N(0,1) null and non-nullcomponents (weighted respectively by the estimated proportion of null andnon-null genes) imposed on histogram of z-scores.
z-scores
Fre
qu
en
cy
-6 -4 -2 0 2 4 6
01
02
03
04
05
06
0
Colon cancer data: plot of fitted two-componentnormal mixture model with empirical null and non-null components (weighted respectively by the estimated proportion of null and non-null genes) imposed on histogram of z-scores.
van’t Wout HIV Data
• Four experiments using the same RNA preparation on 4 different slides
• Expression levels of transcripts assessed in CD4-T-cell lines 24 hours after infection with HIV type 1
• Two samples – one corresponds to HIV infected cells and one to non-infected cells
• Dataset of M = 8 samples: 4 HIV-infected vs 4 non-infected, with N = 7,680 genes. Analysed by Gottardo et al (2006)
van’t Wout et al (2003), J Virology 77, 1392-1402Gottardo et al (2006), Biometrics 62, 10-18
HIV data: plot of fitted two-componentnormal mixture model with empirical null and non-nullcomponents (weighted respectively by the estimated proportion of null andnon-null genes) imposed on histogram of z-scores.
z-scores
Fre
qu
en
cy
-6 -4 -2 0 2 4 6
05
01
00
15
02
00
25
03
00
35
0
• Mixture-model based approach to finding DE genes can yield new information
• Gives a measure of the posterior probability that a gene is not DE (i.e. a local FDR rather than global)
• Provides estimates of global rates – the FDR and the false negative rate FNR (FNR=1-sensitivity)
Conclusions
Mixture Model-Based Approach: Advantages
• Provides a local FDR and a measure of the FNR, as well as the global FDR for the selected genes.
• We show that it can yield new information
Allison generated data for 10 mice over 3000 genes. The data are generated in six groups of 500 with a value ρ of 0, 0.4, or 0.8 in the off-diagonal elements of the 500 x 500 covariance matrix used to generate each group.
For a random 20% of the genes, a value d of 0, 4, or 8 is added to the gene expression levels of the last five mice.
Allison Mice Simulation
Theoretical null, ρ=0.8, d=4
Empirical null, ρ=0.8, d=4
Theoretical null, ρ=0.8, d=8
Empirical null, ρ=0.8, d=8
Suppose 0(w) is monotonic decreasing in w. Then
0)(0ˆ cwj for .0wwj
)(ˆ1
)(1ˆˆ
0
000
wF
wFRDF
00 |)( wwwE
Suppose 0(w) is monotonic decreasing in w. Then
0)(0ˆ cwj for .0wwj
)(ˆ1
)(1ˆˆ
0
000
wF
wFRDF
where )()( 000 wwF
1
10
1000ˆ
ˆˆ)(ˆ)(ˆ
w
wwF
For a desired control level , say = 0.05, define
)(ˆminarg0 wRDFww
(1)
If)(ˆ1
)(1ˆ
00
wF
wF
is monotonic in w, then using (1)
to control the FDR [with and taken to be the empirical distribution function] is equivalent to using the Benjamini-Hochberg procedure based on the P-values corresponding to the statistic wj.
1ˆ 0 )(ˆ 0wF