statistical significance for genomewide studies

Statistical significance for genomewide studies

John D. Storey and Robert Tibshirani

Saurabh PaliwalTopics in Bioinformatics class presentation

11/14/06

Outline of presentation

(Multiple) Hypothesis testing Hypothesis testing: terminology Type I and II errors Problem of multiple hypothesis testing Choice of threshold

Methodology of paper : Motivating biological examples False discovery rates, q values Results

Discussion

Hypothesis testingUsual steps in statistical hypothesis testing:

1. Formulate the Null Hypothesis (H0) and the Alternate hypothesis (H1): Null hypothesis: Statistical hypothesis that is tested for possible rejection

under the assumption that it is true Alternate hypothesis: The observations are the result of a real effect.

(Hypothesis that the data is distributed different from the distribution of the null hypothesis)

2. Identify a test statistic that can be used to assess the truth of the null hypothesis

3. Compute the p-value: Probability that a test statistic at least as extreme as the one observed would be obtained assuming that the null hypothesis is true

4. Compare the p-value to an acceptable significance level . If

then the observed effect is called statistically significant, the null hypothesis is rejected in favor of the alternative hypothesisp

Type I and II errors, other terms Type I error / α error / false positive : the error of rejecting a null

hypothesis i.e. the test statistic is sufficiently extreme, when it is in fact true

Type II error / β error / false negative : the error of not rejecting the null hypothesis when the alternative hypothesis is true

Power of a test / Sensitivity = 1-β i.e. the probability of not committing a type II error

Specificity = 1- α i.e. the probability of not making a type I error Critical region/ rejection region: A set of values (greater than

threshold) of the test statistic for which the null hypothesis is rejected

Some common pitfalls

If the test statistic is outside the critical region (less than threshold), the only conclusion is that ‘there is not enough evidence to reject the null hypothesis’. This is NOT the same as evidence in favor of the null hypothesis. In other words, failing to find evidence that there is a difference does not constitute evidence that there is no difference.

The p-value is NOT the probability that a finding is “merely a fluke”

The p-value is NOT the probability that the null hypothesis is true, or (1-(p-value)) is NOT the probability of the alternative hypothesis being true

The significance level of the test is NOT determined by the p-value. It is decided upon before any of the data are collected

A statistically significant result is not always of practical significance / does not necessarily demonstrate a large effect in the population. Given a sufficiently large sample, extremely small and non-notable differences can be found to be statistically significant

some more terms…

One-sided / one-tailed test: Values for which we can reject the null hypothesis are located entirely in one-tail of the probability distribution

Two-sided / two-tailed test: Values … are located in both tails of the probability distribution

Multiple hypothesis testing m hypothesis tests

H1 = 0 vs H1 = 1 H2 = 0 vs H2 = 1 …. Hm = 0 vs Hm = 1

Want to make simultaneous inference Each test has possible Type I and Type II errors and there are

many possible ways to combine them

The multiple testing problem Assume all the test statistics are null. As the number of independent

applications of the hypothesis testing criterion grows, it begins to outweigh the high unlikelihood associated with each individual test. It becomes increasingly likely that that one will observe data that satisfies the rejection criterion by chance alone (even if the null hypothesis is true in all cases). Thus, there is an increase in the number of ‘false positives’.

Example of flipping coins:A coin is declared biased if in 10 flips, it lands on heads at least 9 times. If the null hypothesis is that the coin is fair, the likelihood that a fair coin would

come up heads at least 9 times out of 10 is 11 / 210 = 0.0107 : pretty unlikely! Multiple comparison: want to test the fairness of many coins, say 100, by this

method (flip each 10 times). The likelihood that all 100 fair coins are identified as fair is (1 – 0.0107)100 = 0.34 : pretty likely (0.66) that some will be identified as biased! (note that the probability of seeing a pre-selected coin do this is still 0.0107)

In n independent comparisons are performed, the probability that one or more false positives will be detected is given byIt increases as the number of comparisons increase (in fact, it will go to 1 eventually).

All quantities except m and S are unobservable m0 = truly null, m1 = truly alternative Regardless of whether the p-value threshold is fixed or data-dependent,

the quantities F, T and S are random variables. Hence, it is common statistical practice to write the overall error in terms of an expected value, E[·]

Specificity = m0 - F / m0 Sensitivity = T / m1

How does one choose a threshold?

Control the Per-Comparison Type I error (PCER) i.e. “uncorrected testing”, too many false positives Gives P(Fi = 1) ≤ α marginally for all 1 ≤ i ≤ m

Control the Familywise Type I error (FWER) E.g. Bonferroni correction: can guarantee that P(F ≥ 1) ≤ α by

setting individual test p-values ≤ α/m. Follows from Boole’s inequality : May be appropriate if very few features are expected to be truly

alternative (e.g. linkage analysis) However, typically it is much too conservative for a number of

applications e.g. genomewide studies involving differentially expressed genes, fMRI studies etc

Control the False Discovery Rate (FDR) Guarantees that the FDR = E [F / F+T] = E [F/S] ≤ α Sensible balance between the number of false positive features, F

and the number of true positive features, T More later…

Biomedical significance of the multiple testing problem

Example 1 : Detecting differentially expressed genes : Hedenfalk et.al., N. Engl.J.Med. 2001 Detect differential expression of genes (features in this case) between

BRCA1 and BRCA2 mutation-positive tumors Computed a modified F statistic, which was used to assign a p value to

each gene p-value of 0.001 was selected to find 51 differentially expressed (DE)

genes out of 3226 (~ 3 false positives expected) More conservative threshold of 0.0001 yielded 9-11 DE genes

Example 2 :Identifying Exonic Splicing Enhancers : Fairbrother et.al., Science, 2002 Exonic splice enhancers are short oligonucleotide sequences that

enhance pre-mRNA splicing when present in exons They analyzed human genomic DNA to predict exonic splice enhancers

based on the statistical analysis of exon-intron and splice site composition

Used a p-value associated with 4096 possible hexamer sequences. Cutoff of 10-4 results in an expected value of < 1 false positive.

238 significant hexamers were subsequently biologically verified

Example 3: Genetic dissection of transcriptional regulation, Brem et.al., Science, 2002 Statistically significant linkage between a gene’s expression level and a

marker indicates that a regulator for that gene is located in the region of the marker

Tested each of 6215 genes for linkage to at least one locus, resulted in 6215 p values

p-value cutoff of 8.5x 10-3 was used, and 507 genes showed linkage to at least one locus, where 53 are expected by chance

p-value cutoff of 1.6x 10-4 : 205 genes (where 1 is expected by chance) Example 4: Finding binding sites of transcriptional regulators: Lee et.al.,

Science, 2002 Transcriptional regulatory proteins bind to specific promoter sequences

to participate in the regulation of gene expression Binding of 106 transcriptional factors was analyzed all over the genome.

At each genomic location, a p value was calculated under the null hypothesis that no binding occirs, resulting in thousands of p values

At a p-value of 0.001, they estimate 3985 interactions, ~6-10% are false positives

fMRI applications

Compare two sets of conditions and using statistical methods, analyze the difference in ‘brain activity’ in particular parts of the brain

100,000 voxels in fMRI Low signal to noise ratio

in images High number of features

will lead to enhanced number of false positives, and a huge difficulty in recognizing the actual area of interest

Bonferroni correction is far too conservative

p values and q values

The p value is a measure of significance in terms of the false positive rate i.e. the rate that truly null features are called significant

The q value is a measure in terms of the false discovery rate i.e. the rate that significant features are truly null

Note the difference between FPR and FDR Hence, a p-value cutoff says little about the

content of the features actually called significant

(Positive) False Discovery Rate (p)FDR

False discovery rate: There is the possibility that S = 0, in which case F/S is undefined.

Hence, define the positive False discovery rate (3 possible formulations discussed in Benjamini and Hochberg, 1995 : R is the same as S, and V is the same as F):

For the purposes of the paper, first concentrate on the assumption of S > 0 (S = 0 case will be discussed at the end)

In terms of specificity and sensitivity , one can write the FDR as:

Commonly referred to as FDRCommonly referred to as pFDR

| 0 Pr( 0),

| 0 ,

FE S S

S

FE S

S

E F

E S

The FDR is a measure of the overall accuracy of the set of features declared to be significant

The q value is a measure that reflects the significance that can be attached to each individual feature.

The q value of a certain feature can be described as the expected proportion of false positives among all features as or more extreme than the observed one. (Similar to the p-value definition as the probability of a null feature being as or more extreme than the observed one)

Methodology1. The authors first calculated a

p-value for each of the 3170 genes of Example 1 (Using a two sample t statistic*)

2. Plotted a density histogram of the 3170 p values

3. Order the p-values in increasing order of magnitude. For some threshold t, where 0<t<1, all the features with p values less than t are called significant.

Let these m p values be

4. Estimate FDR(t) as Since m is very large, this can be approximated as (proved later on)

5. Simple estimate of E[S(t)] is the observed S(t) i.e. number of observed p values ≤ t

6. The probability a null p value is ≤ t is simply t. Hence, E[F(t)] = m0 . t

However, since m0 is unknown, it has to be estimated.

7. Define the ratio of features that are truly null to total features = m0 / m = π0

Need to specify the distribution of the truly alternative p values to estimate π0

However, since the null p values are uniformly distributed, we can get an estimate

Aside: Note that if all the genes were null (not differentially expressed, the density histogram would look like this

8. Estimate π0 in terms of a tuning parameter λ

The rationale behind this estimate:

p values of truly alternative features will be close to 0, while p values of null features will be uniformly distributed among [0,1]. ‘Most’ of the p values near 1 will be null.

An unbiased estimate of π0 would be

(assuming that we could count only null values). However, presence of a few alternative p values only makes the estimate conservative

There must be mostly null p values in this region of the graph (p> λ), where λ = 0.5

Conservative estimate of overall proportion of p values

Bias–Variance tradeoff:

As λ is closer to 1, the bias is lower (because lesser and lesser number of alternative p values will be found there), but the variance of the estimate increases (because lesser number of points are being used to estimate π0)

…and vice-versa

Notice the high variance in the estimate here. It makes it necessary to estimate π0 using a cubic spline

Thus, the mathematical definition of the q value is the minimum FDR that can be attained when calling that feature significant

The above method ensures that the estimated q values are increasing in the same order as the p-values

Suppose that each feature’s statistic probabilistically follows a random mixture of a null distribution and an alternative distribution. Then the pFDR can be written as Pr (feature i is truly null | feature i is significant) ~ q value

False positive rate is Pr(feature i is significant | feature i is truly null) ~ p value The p value can be thought of as the minimum possible false positive rate when calling the feature significant

9. Using the above estimate for π0 to estimate FDR(t) as

10. Estimate the q value of feature i as

The algorithm, in brief

Results The initial estimate (based on a

tuning parameter of λ=0.5) is that 33% of the examined genes were differentially expressed in the study of example 1 (Hedenfalk et.al.)

Thresholding genes with q values ≤ 0.05 yields 160 genes significant for differential expression (~8 genes are expected to be false positives). 117 of these 160 were found to be overexpressed in BRCA1-mutation-positive tumors.

Since all q values can be considered simultaneously, we can use several plots to help us calibrate the q-value cutoff that should be applied in a study based on curves of the form shown in (b), (c) and (d) on the right

further interpretation of results

Assume a gene, say MSH2, whose p value is 5.05 x 10-5 and a q value of 0.013. The former implies that the probability that a null (nondifferentially expressed) gene would be as or more extreme than MSH2 is 5.05 x 10-5. The latter on the other hand suggests that ~0.013 of the genes that are as or more extreme than MSH2 are false positives.

Note: q value is not the probability that the feature (say MSH2) is a false positive.

Intuitively, the probability that MSH2 is a false positive is higher than that of the other genes which are more significant than MSH2, thus it is like a “local FDR”.

The q value takes into account multiple features simultaneously (every feature as or more extreme will also be significant), which is important when assigning multiple measures of statistical significance.

Analysis of Hedenfalk et.al. data

Data consisted of 3226 genes on n1 = 7 BRCA1 arrays and n2 = 8 BRCA2 arrays. Disregarded some genes that had measurements that were several interquartile ranges away from the interquartile range of all of the data.

The expression value from array j and gene i is denoted by xij. Then, the sample mean and variance for gene i are given by

The two-sample t statistic for gene i allowing for different variances of the gene in the two tumors is given by

The null versions of t1…t3170 are calculated by a permutation method: the labels on the arrays are randomly scrambled and the t statistics are recomputed for B = 100 permutations. The p value for gene i was calculated as :

22

2 222 2

2 2

( )

( 1)

iij ijj BRCA j BRCA

i i

x x x

x and sn n

Theorem proof

Critical discussion

Used t-statistic for calculation of p values instead of the f-statistic used in the Hedenfalk paper. Could have additionally used the same values as in that paper to make the comparison easier in terms of number of genes detected as significant etc.

Do not show whether the increased number of genes found are significant for differentiating between BRCA1 positive or negative samples, or BRCA2 positive or negative samples

The assumption of null p values being uniform is critical to the algorithm (as mentioned by the authors too). However, it would be interesting to see how they would handle it if the null p value distribution was different

The assumption is that as m →∞, their procedure controls the FDR asymptotically (i.e. if all features with q ≤ α are taken, then FDR ≤ α for large m). Another recent paper Benjamini, Krieger, Yekutieli (BKY) 2004 suggested that many of the cases of practical interest may not have such a high m value, so this may not be as relevant for those cases

The effect of dependency of the various features has not been investigated enough, it could potentially be very important. BKY 2004 find that FDR can be almost double the bound.

Parting thought….

"... surely, God loves the .06 nearly as much as the .05."

(Rosnell and Rosenthal 1989)

statistical significance for genomewide studies

Documents