controlling the actual number of false discoveries at a given confidence level

Controlling the Actual Number of False Discoveries

at a Given Confidence Level

Joe Maisog

BIST-530 Final Project

December 3, 2008

False Discovery Rate• FDR (FPR) = proportion of positive tests

which are actually false positives

• FDR methods control the FDR in the sense that

E{FDR} q

where q [0,1] is the desired level of control

Benjamini and Hochberg, 1995

Korn’s Variants

Korn E et al., J of Statistical Planning and Inference 124(2): 379-98 (2004).

Follow-Up Paper by Lusa et al.

• Lusa L, Korn EL, McShane LM, A class comparison method with filtering-enhanced variable selection for high-dimensional data sets, Stat Med. 2008 Dec 10;27(28):5834-49.

• C code (R package)

A Problem“Procedures targeting control of the expected number or proportion of false discoveries rather than the actual number or proportion can give a false sense of security. … Even with no correlation the results here [using “regular” FDR with simulated data] are troubling: 10% of the time the false discovery proportion will be 0.29 or more.” [emphasis mine]

Analogy: Accuracy vs. Precision

High AccuracyLow Precision

High PrecisionLow Accuracy

FDR

http://en.wikipedia.org/wiki/Accuracy

Two Jokes: Controlling ExpectationWithout a Confidence Level

• Three statisticians went out hunting, and came across a large deer. The first statistician fired, but missed, by a meter to the left. The second statistician fired, but also missed, by a meter to the right.The third statistician didn't fire, but shouted in triumph, "On the average we got it!"

• With one foot in a bucket of ice water, and one foot in a bucket of boiling water, you are, on the average, comfortable.

http://www.workjoke.com/statisticians-jokes.html

Korn’s Solution

“[Procedures targeting control of the actual number or proportion of false discoveries] will allow statements such as ‘with 95% confidence, the number of false discoveries does not exceed 2’ or ‘with approximate 95% confidence, the proportion of false discoveries does not exceed 0.01.’ ”[emphasis mine]

Korn’s Variants

Adjusted

p-Values

Actual number of false discoveries (“A”)

Actual proportion of false discoveries (“B”)

Full Algorithm

Computationally Efficient Algorithm

Unadjusted

p-Values

Actual number of false discoveries (“A”)

Actual proportion of false discoveries (“B”)

Full Algorithm

Computationally Efficient Algorithm

Two Goals

1. Confirm Korn’s warning that when using “regular” FDR, a fairly large fraction of false positive rates exceed the expected rate.

2. Implement in R Korn’s method to control the actual number of false positives at a given confidence level, using the computationally efficient version.

Definition

• k variables (e.g., genes)

• P(1) < P(2) < . . . < P(k) are the ordered p-values from

the univariate tests

• H(1), H(2), . . . , H(k) are the corresponding null

hypotheses

• T = { t1, t2, . . . , tj } is any subset of K = { 1, 2, . . . ,

k }

• Pr00 is the multivariate permutation distribution of p-

values

Definition

Procedure To Control the Actual Number of False Discoveries

1000 Simulations in R

• 50 controls, 50 treatments,1000 genes

• Noise ~ N(0,1), no cross-gene correlations

• 100 genes “activated” in treatments with increase = 0.3969 ( p = 0.05)

• “Regular” FDR method to control E{FDR} at q = 0.05

• Korn’s method to control the number of actual FP’s at u = 50, with 95% confidence

Simulated Data Matrix

p-values

N1

= 5

0N

2 =

50

G1 =100 G2 = 900

k = 1000

Nto

t =

100

Results: “Regular” FDR

• Mean FPR = 0.0394 (so, controlled at q = 0.05)• But 17.5% of the time, FPR > 0.05

Results: Korn’s Method

• 98.9% of the time, the actual number of false positives was 50

• Controlled at u = 50 with 95% confidence

Conclusions

• 17.5% of the time, FPR > q = 0.05 with “regular” FDR

• Korn’s method controlled actual number of false positives at u = 50 with 95% confidence (actually slightly conservative)

• Disadvantage: computationally intensive• Examining someone else’s computer

program can be difficult but very rewarding!

Future Directions

• Try different parameters (e.g., signal size; number of subjects, variables, or permutations), or with correlated variables

• Try the method on real data

• Try Korn’s “Procedure B”, which controls the actual FDR at a given confidence level

• Try Lusa’s R package for feature selection

References• Benjamini, Y., and Hochberg, Y. 1995. Controlling the false

discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57: 289–300.

• Korn EL, Troendle JF, McShane LM and Simon R. Controlling the number of false discoveries: application to high-dimensional genomic data. Journal of Statistical Planning and Inference 124(2): 379-398 (2004).

• Lusa L, Korn EL, McShane LM, A class comparison method with filtering-enhanced variable selection for high-dimensional data sets, Stat Med. 2008 Dec 10;27(28):5834-49. R package available at: http://linus.nci.nih.gov/Data/LusaL/bioinfo/

• Westfall PF, Tobias RD, Rom D, Wolfinger RD, Hochberg Y, Multiple Comparisons and Multiple Tests, Crary, NC:SAS Institute, Inc, 1999.

• A copy of the R code developed for this project can be found here:http://bist.pbwiki.com/f/bist530FinalProject.r

http://linus.nci.nih.gov/Data/LusaL/bioinfo/

http://bist.pbwiki.com/f/korn_fdr_v12.r

controlling the actual number of false discoveries at a given confidence level

Documents