# α-Investing: A Procedure for Sequential Control of Expected False Discoveries

Post on 31-Jan-2017

212 views

TRANSCRIPT

-Investing: A Procedure for Sequential Control of Expected False DiscoveriesAuthor(s): Dean P. Foster and Robert A. StineSource: Journal of the Royal Statistical Society. Series B (Statistical Methodology), Vol. 70, No.2 (2008), pp. 429-444Published by: Wiley for the Royal Statistical SocietyStable URL: http://www.jstor.org/stable/20203833 .Accessed: 28/06/2014 19:22

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact support@jstor.org.

.

Wiley and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and extend access toJournal of the Royal Statistical Society. Series B (Statistical Methodology).

http://www.jstor.org

This content downloaded from 193.142.30.98 on Sat, 28 Jun 2014 19:22:32 PMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=blackhttp://www.jstor.org/action/showPublisher?publisherCode=rsshttp://www.jstor.org/stable/20203833?origin=JSTOR-pdfhttp://www.jstor.org/page/info/about/policies/terms.jsphttp://www.jstor.org/page/info/about/policies/terms.jsp

J. R. Statist. Soc. B (2008) 70, Part 2, pp. 429-444

a-investing: a procedure for sequential control

of expected false discoveries

Dean P. Foster and Robert A. Stine

University of Pennsylvania, Philadelphia, USA

[Received April 2005. Final revision September 2007]

Summary, a-investing is an adaptive sequential methodology that encompasses a large fam

ily of procedures for testing multiple hypotheses. All control mFDR, which is the ratio of the

expected number of false rejections to the expected number of rejections. mFDR is a weaker criterion than the false discovery rate, which is the expected value of the ratio. We compen sate for this weakness by showing that a-investing controls mFDR at every rejected hypothesis, a-investing resembles a-spending that is used in sequential trials, but it has a key difference.

When a test rejects a null hypothesis, a-investing earns additional probability towards sub sequent tests, a-investing hence allows us to incorporate domain knowledge into the testing

procedure and to improve the power of the tests. In this way, a-investing enables the statistician

to design a testing procedure for a specific problem while guaranteeing control of mFDR.

Keywords: a-spending; Bonferroni method; False discovery rate; Familywise error rate;

Multiple comparisons

1. Introduction

We propose an adaptive sequential methodology for testing multiple hypotheses. Our approach, called a-investing, works in the usual setting in which one has a batch of several hypotheses as well as in cases in which hypotheses arrive sequentially in a stream. Streams of hypoth eses arise naturally in contemporary modelling applications such as genomics and variable

selection for large models. In contrast with the comparatively small problems that spawned

multiple-comparison procedures, modern applications can involve thousands of tests. For exam

ple, microarrays lead to comparisons of a control group with a treatment group by using measured differences on over 6000 genes (Dudoit et al9 2003). If we consider the possibil

ity for interactions, then the number of tests is virtually infinite. In contrast, the example that was used by Tukey to motivate multiple comparisons compares the means of only six

groups (Tukey (1953), available in Braun (1994)). Because a-investing tests hypotheses sequen

tially, the choice of future hypotheses can depend on the results of previous tests. Thus,

having discovered differences in certain genes, an investigator could, for example, direct atten

tion towards genes that share common transcription factor binding sites (Gupta and Ibra

him, 2005). Genovese et al (2006) described other testing situations that offer domain

knowledge. Before we describe a-investing, we introduce a variation on the marginal false discovery rate

mFDR, which is an existing criterion for multiple testing. Let the observable random variable R

denote the total number of hypotheses that are rejected by a testing procedure, and let V denote

Address for correspondence: Robert A. Stine, Department of Statistics, Wharton School of the University of

Pennsylvania, Philadelphia, PA 19104-6340, USA.

E-mail: stine@wharton.upenn.edu

? 2008 Royal Statistical Society 1369-7412/08/70429

This content downloaded from 193.142.30.98 on Sat, 28 Jun 2014 19:22:32 PMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/page/info/about/policies/terms.jsp

430 D. P. Foster and R. A. Stine

the unobserved number of falsely rejected hypotheses. A testing procedure controls mFDR at level a if

E(V) mFDRi =--?

Sequential Control of Expected False Discoveries 431

We follow the standard notation for labelling correct and incorrect rejections (Benjamini and Hochberg, 1995). Assume that mo of the null hypotheses in H(m) are true. The observable statistic R(m) counts how many of these m hypotheses are rejected. A superscript 6 distinguishes unobservable random variables from statistics such as R(m). The random variable Ve (m) denotes the number of false positive results among the m tests, counting cases in which the testing pro cedure incorrectly rejects a true null hypothesis. S6(m)

= R(m) ?

Ve(m) counts the number of

correctly rejected null hypotheses. Under the complete null hypothesis, mo = m9 Ve(m) = R(m) zadSe(m) = 0.

The original intent of multiple testing was to control the chance for any false rejection. The

familywise error rate FWER is the probability of falsely rejecting any null hypothesis from H(m):

FWER(m) = sup[Pe{Vd(m)^l}]. (2)

An important special case is control of FWER under the complete null hypothesis: Po{Ve(m) ^

1} a, where Po denotes the probability measure under the complete null hypothesis. We refer to controlling FWER under the complete null hypothesis as controlling FWER in the weak sense.

Bonferroni procedures control FWER. Let p\9...9pm denote the/?-values of tests ofH\9..., Hm. Given a chosen level 0 < a < 1, the simplest Bonferroni procedure rejects those Hj for which

Pj < a/m. Let the indicators Vj {0,1} track incorrect rejections; Vj

is 1 if Hj is incorrectly rejected and is 0 otherwise. Then Ve(m)

= ? Vj

and the inequality

m

P0{Ve(m)^l}^J2P?M = l)^a (3)

7=1

shows that this procedure controls FWER(m) ^ a. We need not distribute a equally over H(m); inequality (3) requires only that the sum of the a-levels does not exceed a. This observa tion suggests an a-spending characterization of the Bonferroni procedure. As an a-spend ing rule, the Bonferroni procedure allocates a over a collection of hypotheses, devoting a

larger share to hypotheses of greater interest. In effect, the procedure has a budget of a to

spend. It can spend c?j 0 on testing each hypothesis Hj so long as E7-o/ < a. Although such a-spending rules control FWER, they are often criticized for having little power. Clearly, the power of the traditional Bonferroni procedure decreases as m increases because the threshold

a/m for detecting a significant effect decreases. The testing procedure that was introduced in Holm (1979) offers more power while controlling FWER, but the improvements are small.

To obtain substantially more power, Benjamini and Hochberg (1995) introduced a different criterion, the false discovery rate FDR. FDR is the expected proportion of false positive results among rejected hypotheses,

R(m)>o\p{R(m)>0}. (4)

For the complete null hypothesis, FDR(m) = Po{R(m) > 0}, which is FWER(m). Thus, test

procedures that control FDR(m) ^ a also control FWER(m) in the weak sense at level a. If the complete null hypothesis is rejected, FDR introduces a different type of control. Under the alternative, FDR(m) decreases as the number of false null hypotheses m-mo increases (Dudoit et al.9 2003). As a result, FDR(m) becomes more easy to control in the presence of non-zero

effects, allowing more powerful procedures. Variations on FDR include pFDR (which drops the

FDR(m) / V\m) = Ed\~m)

This content downloaded from 193.142.30.98 on Sat, 28 Jun 2014 19:22:32 PMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/page/info/about/policies/terms.jsp

432 D. P Foster and R. A. Stine

term P(R > 0); see Storey (2002,2003)) and the local false discovery rate fdr(z) (which estimates

the false discovery rate as a function of the size of the test statistic; see Efron (2007)). Closer to

our work, Meinshausen and B?hlmann (2004) and Meinshausen and Rice (2006) estimated rao, the total number of false null hypotheses in H(m)9 and Genovese et al. (2006) weighted/7-values on the basis of prior knowledge that identifies hypotheses that are likely to be false. Benjamini and Hochberg (1995) also considered mFDRo and mFDRi, which they considered artificial.

mFDR does not control a property of the realized sequence of tests; instead it controls a ratio

of expectations.

Benjamini and Hochberg (1995) also introduced a step-down testing procedure that controls

FDR. Order the collection of m hypotheses so that the/?-values of the associated tests are sorted

from smallest to largest (putting the most significant first),

P(l) 0 denote the a-wealth that is accumulated by an investing rule after k tests; W(0) is

the initial a-wealth. Conventionally, W(0) = 0.05 or W(0) =0.10. At step7, an a-investing rule

sets the level a7 for testing Hj from 0 up to the maximum a-level that it can afford. The rule

http://www.jstor.org/page/info/about/policies/terms.jsp

Sequential Control of Expected False Discoveries 433

must ensure that its wealth never goes negative. Let Rj e {0,1} denote the outcome of testing

hypothesis Hf

R = J 1 ' if H j is rejected (pj < ay), ,~ 7

\ 0, otherwise.

Using this notation, the investing rule J is a function of W(0) and the prior outcomes,

ay-=2"w(0)({?i,?2,...,?;-_i}). (7)

The outcome of testing H\9H29...9Hj determines the a-wealth W(j) that is available for

testing Hj+\. If pj ^ oij, the test rejects Hj and the investing rule earns a contribution to its

a-wealth, which is called the pay-out and is denoted by uj < 1. We typically set a = ou = W(0).

If pj > aj9 its a-wealth decreases by a;-/(l -

o?j)9 which is slightly more than the cost that is

extracted in a-spending. The change in the a-wealth is thus

W(j)-WU-l) = lU ... , *Pi%ai> (8) yjJ yj ' \ -aj/(\-aj) iipj>aj. w

If the /?-value is uniformly distributed on [0,1], then the expected change in the a-wealth is

-(1 -

uj)aj < 0. This suggests that a-wealth decreases when testing a true null hypothesis. Other

payment systems are possible; see the discussion in Section 7.

The notion of compensation for rejecting a hypothesis allows us to build context-dependent information into the testing procedure. Suppose that substantive insights suggest that the first

few hypotheses are likely to be rejected and that subsequent false hypotheses come in clusters. In

this instance, we might consider an a-investing rule that invests heavily at the start and after each

rejection, as illustrated by the following rule. Assume that the most recently rejected hypothesis is Hk*. (Set &*

= 0 when testing H\.) If false hypotheses are clustered, an a-investing rule should

invest heavily in the test of Hk*+\. One rule that does this is, for j>k*9

W(i-l)

IW(o)({RuR2,.--,Rj-i})=l^._^. (9)

This rule invests \

of its current wealth in testing H\ or Hy*+\. The a-level falls off quadratically if subsequent hypotheses are tested and not rejected. If the substantive insight is correct and

the false hypotheses are clustered, then tests of H\ or Hk*+\ represent 'good investments'. An

example in Section 6 illustrates these ideas.

Although it is relatively straightforward to devise investing rules, it may be difficult a priori to order the hypotheses in such a way that those which are most likely to be rejected come first.

Such an ordering relies on the specific situation. Another complication is the construction of

tests for which one can obtain the ^-values needed. To show that a testing procedure controls

mFDR, we require that, conditionally on the prior j ?\ outcomes, the level of the test of Hj must not exceed a?\

E0(y?\Rj-i,Rj-2,...,Ri)

434 D. P. Foster and R. A. Stine

4. mFDR

The following definition generalizes the definition of mFDR that is given in condition (1).

Definition 1. Consider a procedure that tests hypotheses H\9H2,..., Hm. Then we define

Eo{Ve(m)} mFDR^ra)

= sup 9eS Ee{R(m)} + r]

(11)

A multiple-testing procedure controls mEDR^m) at level a if mFDR^ra) < a. We typically set rj = 1

? a. Values of rj that are near 0 produce a less satisfactory criterion. Under the com

plete null hypothesis, no procedure can reduce mFDRo below 1 since Ve(m) =

R(m). Control

of mFDR^ provides control of FWER in the weak sense. Under the complete null hypothesis,

mFDR^ra) ^ a implies that

E9{Ve(m)}^-^-. 1 ?a

If rj = 1 ?

a, then Eo{Ve(m)} ^ a. Hence, control of mFDRi_a implies weak control of FWER

at level a.

The following simulation illustrates the similarity of mFDR^ and FDR. The tested hypoth eses Hj :/j,j

= 0 specify means of m = 200 populations. We set \ij by sampling a spike-and-slab

mixture. The mixture puts 100(1 ?

tt\)% of its probability in a spike at zero; tt\ = 0 identifies the

complete null hypothesis. The slab of this mixture is a normal distribution, so

Mr o

N(09 a2)

with probability 1 ?

n\, with probability tt\ .

(12)

In the simulation, 7ri ranges from 0 (the complete null hypothesis) to 1 (in which case Ve (m) = 0). We set a2 = 2 log(ra) so that the standard deviation of the non-zero pj matches the bound that is

commonly used in hard thresholding. The test statistics are independent, normally distributed

random variables Z7~IIDA^(/i;-, 1) for which the two-sided/?-values are pj

= 2$(?\Zj\).

Given theses-values, we computed FDR and mFDRo.95 with 10000 trials of four test pro cedures. Two procedures fix the level: one rejects Hj if pj < a

= 0.05 and the second rejects if

? o

? ,_

0.005 0.02 0.08

Proportion False (7^)

0.32

-)fortheBH Fig. 1. FDR and mFDR^ provide similar control: simulated FDR (-) and mFDR0 95 ( step-down procedure ( ), oracle-based WBH (*), the Bonferroni method (o) and a procedure that tests each

hypothesis at level a = 0.05 (+)

http://www.jstor.org/page/info/about/policies/terms.jsp

Sequential Control of Expected False Discoveries 435

Pj ^ a/200 = 0.000 25 (Bonferroni). The other two are step-down tests: the BH and weighted BH

(WBH) procedures. For our implementation of the WBH procedure, we group the hypotheses into those that are false (?ij / 0) and those that are true. The hypotheses are weighted so that

only false null hypotheses are tested. Each false null hypothesis receives weight Wj =

\/S6(m)9 and the true null hypotheses receive weight 0 (see section 3 of Genovese et al. (2006)). In effect, it is as though an oracle has revealed which hypotheses are false and the statistician applies the

BH procedure to these rather than all m hypotheses.

Fig. 1 confirms the similarity of control that is implied by FDR and mFDR. Both criteria iden

tify the failure of naive testing; FDR and mFDR approach 1 unless tt\ approaches 1. The criteria

also provide similar assessments of the other two procedures; Bonferroni and step-down testing control FDR and mFDR for all levels of signal. As tt\ increases and more hypotheses are rejected, both FDR and mFDR of all procedures fall towards 0. The FDR of the WBH procedure is

identically 0 because it only tests false null hypotheses and cannot incorrectly reject a hypothesis.

5. Uniform control of mFDR

Because a-investing proceeds sequentially, the testing can halt after some number of rejected hypotheses. To control such sequential testing, we extend the definition of mFDR^ (m) to random

stopping times. Recall the definition of a stopping time: T is a stopping time if the event T < j can be determined by information that is known when theyth test is completed.

Definition 2. If T denotes a finite stopping time of a procedure for testing a stream of hypoth eses H\9Hi9... then

mFDR^rt^sup 6eG

Ee{Ve(T)} mEe{R(T)} + ri\

The supremum over stopping times of mFDR determines the uniform mFDR.

Definition 3. A testing procedure provides uniform control of mFDR^ at level a if

VirGT^mFDR^TKa (13)

where T is the set of finite stopping times.

Before proving that a-investing provides uniform control of mFDR, we prove half of theorem 1. We first define the relevant stopping time.

Definition 4. The stopping time TR=r = mft{t\R(t) =

r} where we take TR=r to be oo if the set is empty.

Lemma I. For any testing procedure that uniformly controls mFDRa at level a,

Ee{Ve(TR=r)}^a(r+l-a).

Proof. Let r ^ 1 denote an arbitrary, but finite, number of tests. We can stop the process at T = Tr=v a t to make it a finite stopping time. Thus Eq{R(T)} ^ r. We know by our hypothesis that Ee{Ve(T)} ^ a[Ee{R(T)} + 1

- a]

= a(r + 1 -

a). Since this holds for all r and Ve(t) is bounded by r, we can take the limit as r increases. D

mFDR and a variety of modifications of FDR become equivalent when stopped at a fixed number of rejections. A little algebra shows that

-j2R < mFDRo - FDR - lR^- < 0, (14)

http://www.jstor.org/page/info/about/policies/terms.jsp

436 D. P. Foster and R. A. Stine

where py and ay are the mean and standard deviation respectively of Ve(j), the coefficient

of variation jR = aR/pR and p is the correlation between Ve(j) and R(j). When jr is small, FDR and mFDRo are close. If Tr=t is finite almost surely, then the standard deviation of R is

identically 0, and hence jr = 0. So, mFDRo and FDR are identical. The following theorem is

similar in spirit to Tsai et al. (2003) who worked in a Bayesian setting.

Theorem 2. Suppose that 7>=r < co almost surely. Then a testing procedure that stops when r rejections have occurred has the following properties.

(a) 7* = 0.

(b) FDR = mFDRo = cFDR = eFDR = pFDR

= E{Ve(r)}/r.

(c) FDR ^ a(r + 2)/(r +1) if the procedure has uniform control of mFDRo at level a.

Further, for a-investing rules with a = uj = W(0)9 then

(d) var{V?(r)Kar. (e) P{V6(r)^l+ar + kJr}^Qxp(-k2/2).

The proofs of all five results are straightforward and hence are omitted. Property (e) has a

relationship to Genovese and Wasserman (2002, 2004). In Genovese and Wasserman (2004) a stochastic process of hypothesis tests is considered. Their approach contrasts with ours in

that they use a stochastic process that is indexed by /?-values, whereas we use a process that is

indexed by the a priori order in which hypotheses are considered. In both cases, an appropriate

martingale converts expectations into tail probabilities. The following theorem shows that an a-investing rule Xw(0) with wealth determined by expres

sion (8) controls mFDRi_^(0) so long as the pay-out u is not too large. The theorem follows

by showing that a stochastic process that is related to the a-wealth sequence W(0)9 W(l)9... is

a submartingale. Because the proof of this result relies only on the optional stopping theorem

for martingales, we do not require independent tests, though this is the easiest context in which

to show that the/?-values are honest in the sense that is required for condition (10) to hold.

Theorem 3. An a-investing rule Tw(0) which is governed by expression (8) with initial a-wealth

W(0) ^ arj and pay-out u ^ a controls mFDR^ at level a.

Theorem 3 also applies to the stopped version of mFDR and hence shows uniform control.

A proof of this theorem is in Appendix A.

Remark 1. It may not be obvious that the condition of a finite stopping time 7>=r < oo can

in fact be met. Given an infinite sequence of hypotheses, define Se(oo) = \imm^ {Se(m)}. If

S?(oo) = oo then Tr=v < oo for all r because R(m) ^ S?(m). A parameter 0 that has the property

(for all m and all W(m) > 0) that the chance that the a-investing procedure 1 will reject at least

one more hypothesis after m is at least 0.5 is said to provide continuous funding for 1. Clearly for such a 0 we have that Se(oo)

= oo. We say that an a-investing procedure is thrifty if it never

commits all of its current a-wealth to the current hypothesis. An a-investing procedure is hopeful if it always spends some wealth on the next hypothesis. A hopeful thrifty procedure consumes

some of its a-wealth to test every hypothesis in an infinite sequence. With these preliminaries, it

can be shown that for any hopeful, thrifty a-investing procedure there is a distribution Pq that

provides continuous funding so that Tr=v is finite almost surely.

6. Examples

This section discusses practical issues of using a-investing. We start by offering general guide

http://www.jstor.org/page/info/about/policies/terms.jsp

Sequential Control of Expected False Discoveries 437

lines, or policies, on how to construct a-investing rules. An example constructs an a-investing rule that mimics step-down testing. We conclude with simulations that show the advantages of

a good policy and compare a-investing with BH procedures.

a-investing allows the statistician to incorporate prior beliefs into the design of the testing

procedure while avoiding the quagmire of uncorrected multiplicities. This flexibility opens the

question of how such choices should be made. We can recommend a few policies.

(a) Best foot forward policy: ideally, the initial hypotheses include those which are believed

most likely to be rejected. For example, in testing drugs, it is common to test the primary end point before others, a-investing rewards this approach: the rejection of the leading

hypotheses earns additional a-wealth towards tests of secondary end points.

(b) Spending rate policies: compared with ordering the hypotheses, deciding how much

a-wealth to spend on each is less important. That said, spending too slowly is ineffi

cient. There is no reward for conserving a-wealth unused; the procedure could have used

more powerful tests. Alternatively, spending too quickly may exhaust the a-wealth before

testing every hypothesis. It seems reasonable to use a thrifty procedure that reserves some

a-wealth for future tests.

(c) Dynamic ordering policies: suppose that you are sufficiently lucky to have a drug that

might cure cancer and heart disease. Clearly, these two hypotheses should be tested first.

But what should come next if the procedure rejects one but not the other? The entire col

lection of subsequent tests depends on which of the initial hypotheses has been rejected. The nature of such dynamic ordering policies is clearly domain specific.

(d) Revisiting policies: our theorems make no assumptions on how the various hypotheses are

related. This flexibility makes it possible to test hypotheses that closely resemble others.

In fact, our theorems hold if the same hypothesis is tested more than once, so long as

subsequent tests condition on prior outcomes. For example, it might be sensible to test

hypothesis Hj initially at a small level a7 = 0.001 (so as not to risk much a-wealth) and

then to test other hypotheses. If the test does not reject Hj the first time, it might make sense to test this hypothesis again at a higher level, say 0.01. In this way, the procedure distributes its a-wealth among a variety of hypotheses?spending a little here and then a

little there.

The following testing procedure illustrates how a-investing benefits when the investigator has accurate knowledge of the underlying science. If the investigator can order hypotheses a priori so that the procedure first tests those which are most likely to be rejected (the best foot forward

policy), then a-investing rejects more hypotheses than the step-down BH test. The full benefit is

only realized, however, when combined with a spending rate policy. Suppose that the test proce dure has rejected Hk* and is about to test Hk*+\. Rather than spread its current a-wealth W(k*) evenly over the remaining hypotheses, allocate W(k*) by using a discrete probability mass func tion such as the following version of equation (9). This version consumes all remaining a-wealth

by the last hypothesis by setting

aj = W(j-\)( l

v?~-\ (15)

If a subsequent test rejects a hypothesis, the procedure reallocates its wealth so that all is spent by the time that the procedure tests Hm. Mimicking the language of financial investing, we describe this type of a-investing rule as aggressive.

In the absence of domain knowledge, a revisiting policy produces an a-investing rule that imitates the step-down BH procedure. Begin by investing small amounts of a-wealth in the

http://www.jstor.org/page/info/about/policies/terms.jsp

438 D. P. Foster and R. A. Stine

initial test of every hypothesis. This conservative policy means that the procedure runs out of

hypotheses well before it runs out of a-wealth. To improve its power, the rule uses its remain

ing a-wealth to take a second pass through the hypotheses that were not rejected in the first

pass. Although we do not advocate this as a general procedure, it is allowed by our theorems.

Gradually 'nibbling' away at the hypotheses in this fashion results in a procedure that resem

bles step-down testing. In fact, as the size of these nibbles goes to 0, the order that hypotheses are rejected is precisely that from step-down testing. The point where each testing procedure

stops is slightly different. Sometimes step-down testing stops first and sometimes a-spending

stops first.

A few calculations clarify the connection to the BH procedure. To avoid taking many passes

through the hypotheses without generating any rejects, increase the size of the nibbles so as to

take as large a bite each time as possible. Set a = uj = W(0). The procedure begins by testing

each hypothesis in H(m) at level ?\ =a/(o? + m)^o?/m9 approximating the Bonferroni level.

This level assures us that the procedure exhausts its a-wealth if no hypothesis is rejected. If,

however, some hypothesis is rejected, the procedure uses the earned a-wealth to revisit the

hypotheses that were not rejected in the first pass. At the start of each pass, the algorithm divides its a-wealth equally among all remaining unrejected hypotheses and tests each at this

common level. These steps continue until a pass does not reject a hypothesis and the a-wealth

reaches 0.

On successive passes through the hypotheses, the fact that a hypothesis was previously tested

must be used in computing the rejection region. Suppose that exactly one hypothesis is rejected at each pass. In the first pass, the ^-value of one hypothesis is smaller than the initial level ?\.

Following expression (8), the rule pays ?\/(\ ?

?\) for each test that it does not reject and earns

a for rejecting //(i). Hence, after the initial test of each hypothesis at level ?\, the a-wealth grows

slightly to

W(m) = W(0)+u- (m -

l)/?i/(l -

?\) = a + a/ra. (16)

For large m9 its a-wealth is virtually unchanged from W(0). For the second pass, the procedure again distributes its a-wealth equally over the remaining

hypotheses. The level that is invested in each test at this second pass is

W(m) a

W(m) + ra ? 1 ra

To determine which tests are rejected during the second pass, assume that the remaining /^-values are uniformly distributed. Conditioning on pj > ?\, this pass rejects any hypothesis for which

Pj ^ p* with the threshold p* determined by

P*{Pj^P*\Pj>?\) =

R^-=?2 This implies that p*=?\+?i

? ?\?i

~ 2a/ra. Thus, the second pass approximately rejects any

hypotheses with/rvalue that is smaller than 2a/ra, the second threshold of the step-down test.

In this way, the investing rule gradually raises the threshold for rejecting hypotheses as in step down testing. If any hypothesis is rejected during a pass over the remaining hypotheses, then

a-wealth remains and the testing procedure continues recursively. Instead of spending equally on each hypothesis we could weight these hypotheses differently. This idea of using prior infor

mation is implicit in a-spending rules. The use of prior information also appears in the WBH

http://www.jstor.org/page/info/about/policies/terms.jsp

Sequential Control of Expected False Discoveries 439

procedure that was proposed in Genovese et al. (2006). Following the ideas of this section, we

can show that the WBH procedure controls mFDR.

6.1. Simulations

Our first simulation compares three a-investing rules with step-down versions of the BH and

WBH procedures. As in Section 3 for the WBH procedure, an oracle assigns weights

w [0 if #7 is true, J

\l/S?(m) otherwise

to each hypothesis so that the WBH procedure tests only false hypotheses. For comparison, each replication of the simulation tests a fixed batch of m

= 200 hypotheses. The 200 hypoth eses are defined as in the simulation in Section 3 (see equation (12)); this simulation also uses

10000 samples. Of the a-investing rules, one implements the revisiting policy that mimics the

BH procedure. The other two a-investing rules implement aggressive a-investing. To illustrate

the effect of domain knowledge, we simulated the performance of a-investing by using equation

(15) in a best case and a worst case scenario. For the best case, the investigator tests the hypoth eses in the order that is implied by |/x/|, testing the hypothesis with the largest \?ij\ first. The

tests are ordered by the underlying means rather than the observed/^-values. In the worst case, the hypotheses are tested in random order, indicating poor domain knowledge. We set the level

for all procedures to a = 0.05; for a-investing, the initial wealth W(0)

= a = oo and rj = 1 ? a.

Fig. 2(a) shows FDR for each procedure. As in the prior simulation that is summarized in

Fig. 1, FDR and mFDR are quite similar in this simulation and so we have shown only FDR.

All five procedures control FDR (and mFDR), as they should. As in Fig. 1, FDR for the WBH

procedure is identically 0 because it tests only false null hypotheses. For the other procedures that do not benefit from this oracle, it becomes easier to control FDR in the presence of more

false null hypotheses (larger 7ri). Accurate domain knowledge reduces FDR for the aggressive

procedure. When tested in random order (poor domain knowledge), FDR for aggressive testing is similar to that of the BH step-down procedure.

a-investing guarantees protection from too many false rejections, but how well does it find

signal? For each a-investing rule, Fig. 2(b) shows the ratio of the number of correctly rejected

hypotheses relative to the number that is rejected by the BH procedure, estimating

f S6(200, test procedure) 1

e\ 5^(200, BH) J

from the simulation. With accurate domain knowledge, aggressive a-investing rejects in excess

of 50% more hypotheses than the step-down BH procedure. Unless the problem offers few false null hypotheses (ir\ < 0.02, an average of one or two false null hypotheses), aggressive a-investing with good domain knowledge obtains the greatest power, a-investing gains more from testing the hypotheses in the right order than BH gains from knowing which hypotheses to test. If the

domain knowledge is poor, aggressive a-investing rejects no less than 70% of the number that

is rejected by step-down testing. As expected from its design, a-investing by using the revisiting

policy performs similarly to step-down testing. We also performed a simulation to investigate the performance of a-investing when applied to

an infinite stream of hypotheses. Each null hypothesis specifies a mean (Ht : ?it = 0, t = 1,2,...),

and the test statistic for each hypothesis is Zt ~

N(fit, 1), independently. The simulation com

putes the FDR and power of aggressive a-investing by using rule (9) and a two-sided test of Ht. Two levels of signal are present in the simulation. Under one scenario, 10% of the null hypoth

http://www.jstor.org/page/info/about/policies/terms.jsp

D. P. Foster and R. A. Stine

0 0.005 0.02 0.08 0.32 1

Proportion False (nj

(a)

i-1-1-1-1

0.005 0.02 0.08 0.32 1

Proportion False (7^)

(b) Fig. 2. Comparison of FDR and the power of aggressive a-investing rules with accurate (v) and inaccurate

(a) domain knowledge with the BH procedure ( ), oracle-based WBH procedure (*) and a revisiting a-invest

ing rule (o) (the vertical axis shows the ratio of the number of correctly rejected null hypotheses relative to the BH procedure): (a) all five procedures control FDR; (b) better domain knowledge improves the power of

a-investing relative to step-down testing

eses are false; in the second, 20% are false. For each scenario, false null hypotheses cluster in

bursts of varying size. We generated the sequence of means pt from a two-state Markov chain

{Yt}. In state 0, the null hypothesis holds: pt ? 0. In state 1,^

= 3. The transition probabilities for the Markov chain are p?j

= P(Yt+\ =j\Yt

= i). The probability of leaving state 0 varies over

P0l = (0.0025,0.005,0.010,0.025,0.05). To obtain a fixed percentage of false null hypotheses, we set p 10 = kpoi with k

= 4 (20% false null hypotheses) and k = 9 (10%). Given k9 increases in

the transition probability /?oi produce a more choppy sequence of hypotheses. We simulated

1000 streams of hypotheses for each combination of k and po\9 beginning each with Y\ = 1

(a false null hypothesis). As in previous simulations, we set W(0) = a = uj = 0.05 and r? = 1

? a.

Fig. 3 presents a snapshot of the results after testing 4000 hypotheses. Although aggressive

http://www.jstor.org/page/info/about/policies/terms.jsp

Sequential Control of Expected False Discoveries 441

i-1-1-1-1-r 2 5 10 20 50 100

Mean Cluster Size

(a)

(b) Fig. 3. The power of aggressive a-investing increases with a higher proportion of false null hypotheses that arrive in longer clusters (o, 10% false null hypotheses (k = 9); x, 20% false null hypotheses (k = 4)): (a) FDR after completing 4000 tests; (b) power

a-investing is thrifty and hopeful in the sense of Section 5, it can be shown that the sequence of means {pt} that is generated by the Markov chain does not provide continuous funding. Each sequence of tests eventually stops after a finite number of tests when the a-wealth runs

out. Even so, at the point of the snapshot in Fig. 3, each simulated realization of a-investing has enough a-wealth to reject further hypotheses. For example, the retained a-wealth W(4000)

averaged 0.003 with k = 9 and po\ = 0.05 (fewest, most choppy false hypotheses) up to 0.63 with

k = 4 and poi =0.0025. In every case, Fig. 3(a) shows that FDR lies well below 0.05.

Fig. 3(b) shows that aggressive a-investing has higher power when applied to sequences with a higher proportion of false null hypotheses that arrive in longer clusters. This situation affords

the best opportunity to accumulate a-wealth that can be spent quickly to reject clustered false

DC Q

http://www.jstor.org/page/info/about/policies/terms.jsp

442 D. P. Foster and R. A. Stine

hypotheses. The power rises rapidly once a cluster of false null hypotheses has been discovered. For example, the overall power is 0.51 with k = 4 and po\ =0.025 (20% false null hypotheses with mean cluster size 10). For finding the first null hypothesis of each cluster, however, the

power falls to 0.20. This observation suggests that a revisiting policy that returns to hypotheses immediately preceding a rejected hypothesis would have higher power. In general, the power that is shown in Fig. 3(b) rises roughly linearly in the logarithm of the mean cluster size l//?io

Aggressive a-investing also finds a higher percentage of the false null hypotheses when more are present. The power is consistently higher when 20% of the hypotheses are false (k

= 4) than

when 10% are false (& =

9). The gap between these scenarios diminishes slightly as the mean cluster size increases. As in the prior simulation, the higher power that is obtained for longer clusters is accompanied by smaller rates of false discoveries.

7. Discussion

The best foot forward policy raises a concern that is relevant to any multiple-testing proce dure that controls an FDR-like criterion. Suppose that H(m) is contaminated with trivially false hypotheses that artificially produce a-wealth. Then all subsequent tests are tainted. As an extreme example, suppose that the first hypothesis claims that 'gravity does not exist'. After

rejecting this hypothesis, the testing procedure has more a-wealth to use in subsequent tests than allocated by W(0). Most readers would, however, be uncomfortable using this additional

a-wealth to test the primary end point of a drug. As discussed by Finner and Roters (2001), step-down tests share this problem. By rejecting the trivially false hypothesis H\9 the level of the test of the first 'real hypothesis' is 2a/ra rather than a/ra. In this sense, it is important to allow an observer to ignore the list of tested hypotheses from some point onwards. The design of the

sequential test procedure should put the most interesting hypotheses first to ensure that, when the reader stops, they have seen the most important results. By providing uniform control of

mFDR, a-investing controls this criterion wherever the observer stops. We can regulate a-investing by using other methods of compensating, or charging, for each

test. The increment in the a-wealth that is defined in expression (8) is natural, with a fixed reward and penalty that are determined by whether the test rejects a hypothesis, say Hj. Because nei ther the pay-out u nor the cost a/(l

? a) reveals pj (other than to indicate whether pj ^ a7),

subsequent tests need only to condition on the sequence of rejections, R\9...9 Rj. The follow

ing alternative method for regulating a-investing has the same expected pay-out but varies the

winnings when the test rejects Hj\

wo-)-wa-D=(?;+!og(1;/';') ?PiTh (i7) yjJ w ' {log(l-a7) iipj>aj. a-investing that is governed by this 'regulator' also satisfies the theorems that were shown pre

viously. Because the reward reveals pj when Hj is rejected, however, the investing rule must

condition on pj for any rejected prior hypotheses. This would seem to complicate the design of tests in applications in which the ^-values are not independent. Other methods for regulating the a-wealth could be desirable in other situations. We hope to pursue these ideas in future

work.

We speculate that the greatest reward from developing a specialized testing strategy will come

from developing methods that select the next hypothesis rather than specific functions to deter mine how a is spent. Rule (15) invests half of the current wealth in testing hypotheses following a rejection. One can devise other choices. Results in information theory (Rissanen, 1983; Foster et al., 2002), however, suggest that one can find universal a-investing rules. A universal a-invest

http://www.jstor.org/page/info/about/policies/terms.jsp

Sequential Control of Expected False Discoveries 443

ing rule would reject on average as many hypotheses as the best rule within some class. We would

expect such a rule to spend its a-wealth more slowly than the simple rule (15), but we retain this

general form.

Appendix A: Proof of theorem 3

We begin by defining a stochastic process that is indexed byj, the number of hypotheses that have been tested:

A(j) = * R(j) -

Ve(j) + na- W(j).

Our main lemma shows that A (j) is a submartingale for a-investing rules with pay-out u < a. In other

words we shall show that A(j) is 'increasing' in the sense that

Ee{A(j)\A(j-\),A(j-2),...,A(l)}^A(j-l).

Theorem 3 uses the weaker fact that E0 {A (j)} ^ A (0). By definition Ve (0) = R(0) = 0 so A (0) = r?a

- W(0) ^

0 if W(0) < r]a. When A(j) is a submartingale, the optional stopping theorem implies for all finite stopping times M that Ee{A(M)\ ^ 0. Thus,

Ee[a{R(M) + r)}-V6(M)] =

Ee{W(M) + A(M)}

^Ee{A(M)} ^A(0)^0.

The first inequality follows because the a-wealth W(j) ^ 0 (almost surely), and the second inequality fol lows from the submartingale property. Thus, once we have shown that A (j) is a submartingale, it follows that Ee{Ve(M)} ^ a[Ee{R(M)} + rj\ and

mFDR^M) = l ' 7=1

R(m) = J2Rj 7=1

Similarly write the accumulated a-wealth W(ra) and A(ra) as sums of increments, W(m) =

T>J=0Wj and

A(m) = ?J=0 Aj. Let a; denote the a-level of the test of Hj that satisfies condition (10). The change in the a-wealth from testing Hj

can be written as

Wj =

Rju-(l-Rj)aj/(\-aj).

Substituting this expression for Wj into the definition of A7 we obtain

Aj =

(a-?j)Rj-V(? + (\-Rj)aj/(l-aj).

http://www.jstor.org/page/info/about/policies/terms.jsp

444 D. P. Foster and R. A. Stine

Since Rj ^ 0 and a ? ou 0 by the conditions of the lemma, it follows that

AjZ(l-Rj)aj/a-aj)-V?. (19)

If OjgHj, then V? = 0 and Aj > 0 almost surely. So we need to consider only the case in which the null

hypothesis Hj is true. When Hj is true, Rj = Ve- and inequality (19) becomes

Aj > (1 -

Rj)aj/(\ -

aj) -

Rj =

(aj -

Rj)/(l -

aj). (20)

Abbreviate the conditional expectation

E?1 (X) = Ee {X\A(\)9 A(2),..., A(j

- 1)}.

Under the null hypothesis, EJe~l (Rj) < aj by the definition of this being an a}rlevel test. Taking conditional

expectations in expression (20) gives EJe~l (Aj) 0.

References

Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to

multiple testing. J. R Statist. Soc. B, 57, 289-300.

Benjamini, Y and Yekutieli, D. (2001) The control of the false discovery rate in multiple testing under dependency. Ann. Statist., 29, 1165-1188.

Braun, H. I. (ed.) (1994) The Collected Works of John W. Tukey: Multiple Comparisons, vol. VIII. New York:

Chapman and Hall.

Dudoit, S., Shaffer, J. P. and Boldrick, J. C. (2003) Multiple hypothesis testing in microarray experiments. Statist.

Sri, 18, 71-103.

Efron, B. (2007) Size, power, and false discovery rates. Ann. Statist, 35, 1351-1377.

Finner, H. and Roters, M. (2001) On the false discovery rate and expected type I errors. Biometr. J., 8, 985-1005.

Foster, D. P., Stine, R. A. and Wyner, A. J. (2002) Universal codes for finite sequences of integers drawn from a

monotone distribution. IEEE Trans. Inform. Theory, 48, 1713-1720.

Genovese, C, Roeder, K. and Wasserman, L. (2006) False discovery control with p-value weighting. Biometrika,

93, 509-524.

Genovese, C. and Wasserman, L. (2002) Operating characteristics and extensions of the false discovery rate

procedure. J. R. Statist. Soc. B, 64, 499-517.

Genovese, C. and Wasserman, L. (2004) A stochastic process approach to false discovery control. Ann. Statist.,

32, 1035-1061.

Gupta, M. and Ibrahim, J. G. (2005) Towards a complete picture of gene regulation: using Bayesian approaches to integrate genomic sequence and expression data. Technical Report. University of North Carolina, Chapel

Hill.

Holm, S. (1979) A simple sequentially rejective multiple test procedure. Scand. J. Statist., 6, 65-70.

Lehmacher, W and Wassmer, G (1999) Adaptive sample size calculations in group sequential trials. Biometrics,

55, 1286-1290.

Meinshausen, N. and B?hlmann, P. (2004) Lower bounds for the number of false null hypotheses for multiple

testing of associations under general dependence. Biometrika, 92, 893-907.

Meinshausen, N. and Rice, J. (2006) Estimating the proportion of false null hypotheses among a large number of

independently tested hypotheses. Ann. Statist., 34, 373-393.

Rissanen, J. (1983) A universal prior for integers and estimation by minimum description length. Ann. Statist.,

11,416-431.

Sarkar, S. K. (1998) Some probability inequalities for ordered Mtp2 random variables: a proof of the Simes

conjecture. Ann. Statist, 26, 494^504.

Simes, R. J. (1986) An improved bonferroni procedure for multiple tests of significance. Biometrika, 73, 751-754.

Storey, J. D. (2002) A direct approach to false discovery rates. J. R. Statist. Soc. B, 64, 479-498.

Storey, J. D. (2003) The positive false discovery rate: a Bayesian interpretation and the q-value. Ann. Statist, 31, 2013-2035.

Troendle, J. F. (1996) A permutation step-up method of testing multiple outcomes. Biometrics, 52, 846-859.

Tsai, C.-A., Hsueh, H.-M. and Chen, J. J. (2003) Estimation of false discovery rates in multiple testing: application to gene microarray data. Biometrics, 59, 1071-1081.

Tsiatis, A. A. and Mehta, C. (2003) On the inefficiency of the adaptive design for monitoring clinical trials.

Biometrika, 90, 367-378.

Tukey, J. W (1953) The problem of multiple comparisons. Unpublished lecture notes.

Tukey, J. W. (1991) The philosophy of multiple comparisons. Statist. Sei., 6, 100-116.

http://www.jstor.org/page/info/about/policies/terms.jsp

Article Contentsp. [429]p. 430p. 431p. 432p. 433p. 434p. 435p. 436p. 437p. 438p. 439p. 440p. 441p. 442p. 443p. 444

Issue Table of ContentsJournal of the Royal Statistical Society. Series B (Statistical Methodology), Vol. 70, No. 2 (2008), pp. 287-460Front MatterBinary Models for Marginal Independence [pp. 287-309]Generalized Linear Models Incorporating Population Level Information: An Empirical-Likelihood-Based Approach [pp. 311-328]Empirical-Likelihood-Based Difference-in-Differences Estimators [pp. 329-349]Variable Selection in Semiparametric Linear Regression with Censored Data [pp. 351-370]Every Missingness Not at Random Model Has a Missingness at Random Counterpart with Equal Fit [pp. 371-388]Semiparametrically Efficient Inference Based on Signs and Ranks for Median-Restricted Models [pp. 389-412]Practical Filtering with Sequential Parameter Learning [pp. 413-428]-Investing: A Procedure for Sequential Control of Expected False Discoveries [pp. 429-444]A New Method for Analysing Discrete Life History Data with Missing Covariate Values [pp. 445-460]Back Matter

Recommended