false discoveries and models for gene discovery

6
False discoveries and models for gene discovery Edwin J.C.G. van den Oord and Patrick F. Sullivan Virginia Institute for Psychiatric and Behavioral Genetics, Medical College of Virginia, Virginia Commonwealth University, Richmond, VA 23298-0126, USA In the search for genes underlying complex traits, there is a tendency to impose increasingly stringent criteria to avoid false discoveries. These stringent criteria make it hard to find true effects, and we argue that it might be better to optimize our procedures for eliminating and controlling false discoveries. Focusing on achieving an acceptable ratio of true- and false-positives, we show that false discoveries could be eliminated much more efficiently using a stepwise approach. To avoid a relatively high false discovery rate, corrections for ‘multiple testing’ might also be needed in candidate gene studies. If the appropriate methods are used, detecting the proportion of true effects appears to be a more important determinant of the genotyping burden than the desired false discovery rate. This raises the question of whether current models for gene discovery are shaped excessively by a fear of false discoveries. In the search for mutations associated with complex diseases, geneticists are faced with the challenge of finding a limited number of causal mutations by testing a (very) large set of markers. This challenge could become even more substantial in the future. For example, interactions between genes might also be important [1]. In the case of k markers, considering only two marker interactions already results in ðk 2 2 kÞ=2 tests. Stringent models for gene discovery A false discovery is the result of the TYPE I ERROR (see Glossary) of rejecting the null hypothesis stating that the marker has no effects, when in fact it has. The risk of false discoveries is associated with p 0 , the proportion of tests for which the null hypothesis is true. Thus, if none of the markers has an effect, all significant results will be false discoveries. If all markers have effects, none of the significant results can be a false discovery. Many tests are typically performed in the search for disease mutations, of which only a small proportion will have true effects. There is therefore a considerable risk of false discoveries. This appears to be confirmed by studies showing that, even for allelic variants in sensible candidates genes, a significant number of the initial associations are not replicated and were probably false discoveries [2]. To prevent false discoveries, it can be tempting to resort to stringent criteria for gene discovery. For linkage studies, Lander and Krugylak [3] proposed critical P-values that would result in only one false-positive discovery in every 20 whole-genome scans. To consider association studies for publication, scientific journals look for large sample sizes, small P-values, associations that make biological sense, and independent replications [4].A recent article [5] proposed an even more extensive burden- of-proof, further requiring sequence analysis to identify the variants, functional tests, and preferably circumstan- tial evidence, such as similar genotype – phenotype associ- ations found in other species. It is not clear, however, whether such stringent criteria will have a positive impact on gene discovery. A negative consequence might be that true effects are missed. This becomes a particular concern if one realizes that poly- morphisms affecting complex diseases might have small effects, and that commonly used methods such as linkage scans do not have the POWER to detect small effects. Limitations in resources, skills and biological knowledge might further hamper the detection of true causal genes when the earlier proposed type of stringent criteria for gene discovery are used. Strategies for eliminating false discoveries Before advocating very stringent criteria, we argue that it might be sensible to optimize our procedures for control- ling and eliminating false discoveries. The null hypothesis will be rejected if the P-value of the test statistic is smaller Glossary Type I error . A Type I error is the error of rejecting the null hypothesis when it is true. This results either in a false positive or false discovery. Power . The probability of rejecting the null hypothesis when the alternative hypothesis is true. Significance level a . The a priori probability of making a Type I error. The P- value is the posteriori probability of the test statistic being at least as extreme as the one observed, given that the null hypothesis is true. Small P-values are an indication that the null hypothesis could be false. If the P-value is smaller than significance level a, the null hypothesis will be rejected. Although it is somewhat arbitrary, significance levels of 0.05 and 0.01 are often used in practice. These choices imply that that if the null hypothesis is true, a significant result will, on average, occur by chance 5% or 1% of the time. Multiple testing . If more than one statistical test is performed the number of false discoveries will increase. For example, for 100 independent tests and a significance level of 0.05, there will be 0.05 £ 100 ¼ 5 false discoveries if the null hypothesis is true for all tests. To correct for multiple testing and avoid this increase in false positives, the significance level needs to be adjusted downwards. For ease of presentation, we will refer to this adjusted significance level as the critical P-value (P k ); that is, the null hypothesis will be rejected for all tests that yield test statistics of which the associated P-values are smaller than P k . Corresponding author: Edwin J.C.G. van den Oord ([email protected]). Opinion TRENDS in Genetics Vol.19 No.10 October 2003 537 http://tigs.trends.com 0168-9525/$ - see front matter q 2003 Elsevier Ltd. All rights reserved. doi:10.1016/j.tig.2003.08.003

Upload: edwin-jcg-van-den-oord

Post on 12-Sep-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

False discoveries and models for genediscoveryEdwin J.C.G. van den Oord and Patrick F. Sullivan

Virginia Institute for Psychiatric and Behavioral Genetics, Medical College of Virginia, Virginia Commonwealth University,

Richmond, VA 23298-0126, USA

In the search for genes underlying complex traits, there

is a tendency to impose increasingly stringent criteria

to avoid false discoveries. These stringent criteria make

it hard to find true effects, and we argue that it might

be better to optimize our procedures for eliminating

and controlling false discoveries. Focusing on achieving

an acceptable ratio of true- and false-positives, we

show that false discoveries could be eliminated much

more efficiently using a stepwise approach. To avoid a

relatively high false discovery rate, corrections for

‘multiple testing’ might also be needed in candidate

gene studies. If the appropriate methods are used,

detecting the proportion of true effects appears to be a

more important determinant of the genotyping burden

than the desired false discovery rate. This raises the

question of whether current models for gene discovery

are shaped excessively by a fear of false discoveries.

In the search for mutations associated with complexdiseases, geneticists are faced with the challenge of findinga limited number of causal mutations by testing a (very)large set of markers. This challenge could become evenmore substantial in the future. For example, interactionsbetween genes might also be important [1]. In the caseof k markers, considering only two marker interactionsalready results in ðk2 2 kÞ=2 tests.

Stringent models for gene discovery

A false discovery is the result of the TYPE I ERROR (seeGlossary) of rejecting the null hypothesis stating thatthe marker has no effects, when in fact it has. The risk offalse discoveries is associated with p0, the proportion oftests for which the null hypothesis is true. Thus, if noneof the markers has an effect, all significant results will befalse discoveries. If all markers have effects, none of thesignificant results can be a false discovery. Many tests aretypically performed in the search for disease mutations, ofwhich only a small proportion will have true effects. Thereis therefore a considerable risk of false discoveries. Thisappears to be confirmed by studies showing that, even forallelic variants in sensible candidates genes, a significantnumber of the initial associations are not replicated andwere probably false discoveries [2].

To prevent false discoveries, it can be tempting toresort to stringent criteria for gene discovery. For

linkage studies, Lander and Krugylak [3] proposed criticalP-values that would result in only one false-positivediscovery in every 20 whole-genome scans. To considerassociation studies for publication, scientific journals lookfor large sample sizes, small P-values, associations thatmake biological sense, and independent replications [4]. Arecent article [5] proposed an even more extensive burden-of-proof, further requiring sequence analysis to identifythe variants, functional tests, and preferably circumstan-tial evidence, such as similar genotype–phenotype associ-ations found in other species.

It is not clear, however, whether such stringent criteriawill have a positive impact on gene discovery. A negativeconsequence might be that true effects are missed. Thisbecomes a particular concern if one realizes that poly-morphisms affecting complex diseases might have smalleffects, and that commonly used methods such as linkagescans do not have the POWER to detect small effects.Limitations in resources, skills and biological knowledgemight further hamper the detection of true causal geneswhen the earlier proposed type of stringent criteria forgene discovery are used.

Strategies for eliminating false discoveries

Before advocating very stringent criteria, we argue that itmight be sensible to optimize our procedures for control-ling and eliminating false discoveries. The null hypothesiswill be rejected if the P-value of the test statistic is smaller

Glossary

Type I error . A Type I error is the error of rejecting the null hypothesis when it is

true. This results either in a false positive or false discovery.

Power . The probability of rejecting the null hypothesis when the alternative

hypothesis is true.

Significance level a . The a priori probability of making a Type I error. The P-

value is the posteriori probability of the test statistic being at least as extreme

as the one observed, given that the null hypothesis is true. Small P-values are

an indication that the null hypothesis could be false. If the P-value is smaller

than significance level a, the null hypothesis will be rejected. Although it is

somewhat arbitrary, significance levels of 0.05 and 0.01 are often used in

practice. These choices imply that that if the null hypothesis is true, a

significant result will, on average, occur by chance 5% or 1% of the time.

Multiple testing . If more than one statistical test is performed the number of

false discoveries will increase. For example, for 100 independent tests and a

significance level of 0.05, there will be 0.05 £ 100 ¼ 5 false discoveries if the

null hypothesis is true for all tests. To correct for multiple testing and avoid this

increase in false positives, the significance level needs to be adjusted

downwards. For ease of presentation, we will refer to this adjusted significance

level as the critical P-value (Pk); that is, the null hypothesis will be rejected for

all tests that yield test statistics of which the associated P-values are smaller

than Pk.

Corresponding author: Edwin J.C.G. van den Oord ([email protected]).

Opinion TRENDS in Genetics Vol.19 No.10 October 2003 537

http://tigs.trends.com 0168-9525/$ - see front matter q 2003 Elsevier Ltd. All rights reserved. doi:10.1016/j.tig.2003.08.003

than SIGNIFICANCE LEVEL a. If multiple statistical tests areperformed, the probability that one of the P-values issmaller than a increases. To counteract this increase infalse discoveries, the significance level needs to beadjusted downwards. For ease of presentation, we referto this adjusted significance level as the critical P-value Pk.

The adjustment for MULTIPLE TESTING is typically basedon the number of tests that are performed. The Bonferronicorrection, for example, is Pk ¼ a=m; where m is thenumber of tests. Because in multiple-testing situationssignificance level a is the probability of finding one-or-more false discoveries, this procedure aims to achieve aprobability of 1 2 a that not a single false discovery will befound in the whole set of tests. A disadvantage of thesekind of ‘familywise’ correction procedures is that whenmany tests are performed (i.e. m is large) Pk becomes verysmall. This results in a low statistical power to detect thetrue effects. For single-gene traits, maximizing theprobability of finding not even a single false discoverymight make sense. First, the power to detect the diseasemutation is relatively good. For complex traits, wherefinding disease mutations is more difficult, it is question-able whether further sacrificing the power for thisstringent goal is desirable. Second, a discovery in thecase of a single-gene disorder implies the strong claim thatone has found the cause of the disorder, which would haveimportant practical and medical implications. As complextraits are affected by multiple genes and environmentalfactors, it could be argued that the consequences of a falsediscovery are less severe.

Although current procedures cannot correct for them,multiple testing problems also arise in candidate genestudies. For example, assume that, in one year, 1000different research groups test, for a variety of phenotypes,a putative high-risk allele in a candidate gene. Eachresearch group would probably assume that m ¼ 1 andmake no correction for multiple testing. However, whereasa single test might have been performed in the reportedstudy, there is a need to correct for multiple testingbecause the research community as a whole is testingmany candidates. Thus, if each research group tested at

a ¼ 0:05; together, they would produce 0:05 £ 1000 ¼ 50false discoveries in that year.

These examples illustrate that our current proceduresfor controlling and eliminating false discoveries might notbe optimal. That is, depending on the type of study that isperformed, they can either be extremely conservative(genome-wide scans), making it very hard to find genes, orbe very liberal (candidate gene studies) allowing for a highrate of false discoveries.

Alternative approaches

Controlling a false discovery rate

Instead of sacrificing the power to maximize the prob-ability that no false discoveries will be found, the difficultyof identifying polymorphisms influencing common dis-eases could argue for focusing on obtaining an acceptableratio of true- and false-discoveries. The formula in Box 1calculates the Pk needed to obtain a desired false discoveryrate (FDR) [6]. It shows that lowering the FDR in a givensituation will result in smaller Pk values because fewerfalse positives are allowed. The parameter p0 is theproportion of tests for which the null hypothesis is true(i.e. that have no effect). Instead of the number of tests m,the correction for multiple testing is, in essence, made onthe basis of this p0. Smaller values of p0 will result insmaller values of Pk. This makes sense because if most ofthe markers have no effect, tests need to be performed at amore conservative level to obtain the same FDR. Instatistical terms, the proportion of true effects that willbe detected (PTD) is the average power of the tests with thecomputed Pk. The PTD also influences Pk because if it islower it will be more difficult to distinguish true- fromfalse-positives.

This formula is helpful for studying and understandingmultiple testing problems. It is important to mention thatprocedures are readily available for controlling the FDR insituations where the P-values of all the tests that havebeen performed are available [7] (Y. Benjamini et al.,unpublished). These procedures essentially control theFDR by estimating Pk from the data.

We observed that FDRs smaller than 0.1 generallyresulted in a sharp increase in sample size (Fig. 1). We

Box 1. A simple formula for controlling the false discovery rate

The critical P-value (Pk) can be calculated as follows:

pk ¼ð1 2 p0ÞPTD

p0=FDR 2 p0

½Eqn I�

where p0 is the proportion of tests for which the null hypothesis is true,

PTD is the expected proportion of markers with true effects that will be

detected, and FDR is the false discovery rate. The formula can be applied

when tests are independent or dependent, such as when there is LD

between markers (for details see [13]).

The use of Pk computed with this formula controls the expected FDR in

the situation where p0 and PTD are as specified [6]. That is, if an infinite

number of studies are conducted, the expected proportion of false

discoveries within the whole set of tests that are rejected would equal

the specified FDR. The FDR can be set to whatever value is considered

acceptable and refers to a proportion. Thus, the FDR is identical when

one false positive is found among ten claims, or ten false positives are

found among 100 claims. The formula shows that lowering the FDR in a

given situation will result in smaller Pk values because fewer false

positives are allowed.

In statistical terms, PTD it is the average power of the tests for which

the alternative hypotheses are true. A researcher can control the PTD; for

example, it can be fixed at a desired level. Then, by using the Pk from the

formula, and assuming a certain effect size, it is possible to compute the

sample size that results in the chosen PTD. A lower PTD implies a lower

Pk because it is more difficult to distinguish between true- and false-

discoveries.

p0 is generally unknown, although there are ways to estimate it (see

final section of the text). In our calculations, this was achieved by

assuming a range of values. Smaller values of p0 will result in smaller

values of Pk. This makes sense because if most of the markers have no

effect, tests need to be performed at a more conservative level to obtain

the same FDR. In traditional approaches, the correction for multiple

testing is based on the number of tests. Here, it is essentially made on

the basis of p0.

Opinion TRENDS in Genetics Vol.19 No.10 October 2003538

http://tigs.trends.com

used an FDR of 0.1 in our calculations because this valueprovided a good balance between controlling false positivesand finding true effects with a minimum of genotyping andphenotyping costs. Many geneticists would probably alsobe happy to accept one false positive in every ten claims ifthat helps them to find true disease mutations.

A stepwise approach

Although not all proposed procedures are accurate [8], theuse of a stepwise approach might significantly reduce theamount of genotyping [9,10]. In the first step, all markersare typed for a subset of individuals. The most promisingmarkers are then evaluated on additional subjects in thenext steps. This will eliminate wastage of resources onmarkers that are unlikely to be associated with diseasebased on the results of the first step. In this article, wecompare one- versus two-step designs; however, the effectof using more than two steps can also be inferred from ourresults. We consider three scenarios. In scenario (a), p0 ¼

0:9999; which could, for example, apply to a whole-genomelinkage disequilibrium (LD) scan with 500 000 singlenucleotide polymorphisms (SNPs) [11,12] and 50 trueeffects. In scenario (b), p0 ¼ 0:995; which could apply to anLD fine-mapping study where 200 tests are performed todetect a single mutation, or a linkage scan where 400markers are used to detect two true peaks. Finally, in

scenario (c), p0 ¼ 0:75; which applies to markers that arelikely to have effects.

The genotyping burden was defined as the averagenumber of genotypes needed for a marker. For a one-stepdesign, this is simply the number of individuals screened.For a two-step design the genotype burden is the step onesample size plus the proportion of markers selected forstep two, multiplied by the sample size of step two. Notethat the total amount of genotyping would be the numberof markers multiplied by the genotype burden. Efficiencywas defined as the ratio of genotype burden for a two-stepdesign divided by the genotype burden for a one-stepdesign. Because the equation in Box 1 is based onproportions and rates, the actual numbers of tests and/ormarkers, true- and false-discoveries, and true effects arenot relevant. Furthermore, this ratio appeared to be robustagainst variations in effect size and the type of test that isperformed. Our results should therefore demonstratefairly general principles.

An iterative procedure can be used to determine thedesign that minimizes the genotyping burden [13]. Table 1shows features of the optimal design when the aim is todetect 80% of the true effects (PTD ¼ 0.8) while allowingfor an FDR of 0.1. Because we focus on the relativeefficiency, the actual sample size is not very important. InTable 1, we simply assumed a chi-square test for equalgenotype distributions in cases and control. In scenario (a),where p0 ¼ 0:9999; our iterative procedure attempted tokeep the sample size for step one as small as possible byallowing for a very high FDR and a PTD that was onlyslightly higher than the eventual aim of 0.8. This resultedin a fairly liberal critical P-value of Pk ¼ 0:068: In step two,the large number of false discoveries present after step oneneeds to be reduced to the desired FDR of 0.1, and PTDneeds to be relatively high because the proportion of trueeffects left after step one is 0.82. This results in a muchlarger sample size and a much smaller Pk of 0.000082.Every marker needs to be typed for the 319 individuals atstep one, but only the proportion of markers left after stepone (0.06833) needs to be typed for the 1373 individuals instep two. This results in a total genotype burden of 319 þ

0:06833 £ 1373 ¼ 413: Because 1092 individuals would beneeded to achieve the same aims in a single step, theefficiency ratio is 413/1092 ¼ 0.37. This ratio is alsodisplayed in Fig. 2. Thus, a two-step design would requiremerely one-third of the genotyping that would be neededwith a one-step design.

Table 1 and Fig. 2 show that for p0 ¼ 0.995 the resultsare somewhat less extreme. However, the same pattern ofresults is observed with Pk ¼ 0:09 and 0.005 for step-oneand -two, respectively. Finally, p0 ¼ 0:75 applies tomarkers that are likely to have effects; in this condition,little is gained by using two steps.

Scenarios (b) and (c) show the effect of changing thegoals with respect to the true- and false-discoveries. Theproportion of detected genes [scenario (b)] seems a muchmore important determinant of the amount of genotypingthan the FDR [scenario (c)]. That this phenomenon ispresent in a one-step design suggests that the shape of thecurve relating changes to required sample size is steeperfor the PTD than the FDR. In Fig. 2 we studied a limited

Fig. 1. Relationship between false discovery rate (FDR; x-axis) and required sample

size (y-axis). Although actual sample sizes differed, we found the shapes of the

curves to be similar for a variety of tests and effect sizes. We did not show specific

sample sizes on the y-axis to stress that the figure tries to indicate a more general

trend. Different curves within the figure that are labeled PTD indicate the pro-

portion of markers where true effects will be detected, see Box 1. The figure shows

that the required sample size increases fast for FDRs of smaller than 0.1.

TRENDS in Genetics

False discovery rate

Sam

ple

size

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.975

0.7

0.5

0.9

0.3

Opinion TRENDS in Genetics Vol.19 No.10 October 2003 539

http://tigs.trends.com

number of situations and this result might also have beena consequence of the way we made the comparison(reduction of n true discoveries with an increase in nfalse discoveries). However, Fig. 1 also suggests that,

unless the FDR is very close to zero, changes in FDR have arelatively modest impact on sample size.

This phenomenon is much more pronounced for the two-step approach. Particularly in conditions with a large p0,reducing the false discoveries has only a small effect onthe genotyping burden, whereas increasing the truediscoveries has a large effect. Table 1 shows that theFDR is mainly reduced in the second step so that changingthe eventual goal for the FDR mainly affects the step twosample size. However, because there are relatively fewmarkers left at step two, the impact on the genotypingburden is relatively small. By contrast, the PTD at step onecan never be lower than the final proportion of truediscoveries one wants to detect. A change in the final PTDtherefore implies a similar change in PTD at step one.Thus, because all markers have to be typed at step one thishas a much larger impact on the genotyping burden.

Implications

Scanning genomic regions

An efficient elimination strategy might include a step thatis characterized by high P-values and relatively smallsample sizes, which would result in many false positives.This would be followed by a replication step with smallP-values and large sample sizes to filter out the truepositives from this first selection. This sequence mightseem counterintuitive and is essentially the opposite of theguidelines proposed by Lander and Kryglyak [3]. The keyissue is that it is inefficient to type markers that are likelyto have no effect. It is better to make a first rough selection,and pursue only those for which there is some indicationthat they could have a true effect. There is a clearsimilarity with population-based screens for breast cancer.Here, individuals are first selected using a quick test thatshould be sensitive for detecting affected individuals, butlacks the specificity to eliminate false positives (mammo-graphy), followed by a ‘gold-standard’ test (e.g. biopsy) thatseparates the true- from false-positives.

The stepwise approach to eliminating false positivescould involve a single or multiple studies. In our opinion, alarge step-one study has scientific merit and would deserveto be considered for publication. The same is true for step-two studies that have high power and find significant

Table 1. Characteristics of one- and two-step study designs that minimize genotype burden given a set of goals with respect to

proportions of true and false discoveriesa

Design FDR PTD Pk Proportion of markers left Required sample Genotype burden per marker

p0 ¼ 0:9999

Step one of one 0.1000 0.800 0.0000089 0.0000889 1092 1092

Step one of two 0.9988 0.820 0.0682600 0.0683300 319 319

Step two of two 0.1000 0.976 0.0001302 0.0000889 1373 94

Total two-step design 413

p0 ¼ 0:995

Step one of one 0.1000 0.800 0.0004470 0.0044440 756 756

Step one of two 0.9580 0.820 0.0939890 0.0976190 287 287

Step two of two 0.1000 0.976 0.0047520 0.0044440 952 93

Total two-step design 380

p0 ¼ 0:75

Step one of one 0.1000 0.800 0.0296300 0.2222200 381 381

Step one of two 0.3660 0.840 0.1617100 0.3312800 246 246

Step two of two 0.1000 0.952 0.1832200 0.2222200 0.381 126

Total two-step design 372

aAbbreviations: FDR, false discovery rate; p0, the proportion of tests for which the null hypothesis is true; Pk, critical P-value; PTD, proportion of true effects that will be detected.

Fig. 2. Reduction in the amount of genotyping by using a two-step rather than a

one step-design. In scenario (a), the goal is to detect 80% of the markers with real

effects [proportion of true effects that will be detected (PTD) ¼ 0.8], with a false dis-

covery rate (FDR) of 0.1. Results show that a two-step design is much more effi-

cient than a one-step design, but that this difference decreases when the

proportion of tests for which the null hypothesis is true ( p0) increases. Scenarios

(b) and (c) are modifications of scenario (a) where we changed the PTD and FDR.

More specifically, in scenario (b) we reduced the number of true discoveries with

n. In scenario (c), we allowed for an increase in the same number n of false discov-

eries. For example, if we lowered the number of true discoveries we wanted to

find by five, we would also allow five more false positives. It can be shown that

once PTD is chosen for scenario (b) (the proportion of true effects one wants to

find), this reduction of n true discoveries and the increase of n false discoveries

can be achieved by changing the FDR. The results therefore hold, in general, for

that value of PTD, and do not depend on the actual number n. Here, we chose

PTD ¼ 0.4 for scenario (b). Details of the calculations can be found at http://www.

vipbg.vcu.edu/~edwin. The amount of genotyping required in scenarios (b) and (c)

is expressed relative to the genotyping required for scenario (a). Results are the

same as those for scenario (a) showing that a two-step design is more efficient.

Furthermore, the proportion of genes one wants to detect [scenario (b)] is a more

important determinant of the amount of genotyping than the FDR ([scenario (c)].

TRENDS in Genetics

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 step 2 steps 1 step 2 steps 1 step 2 steps

p 0 = 0.9999 p 0 = 0.995 p 0 = 0.75

False discoveryrate = 0.1, proportiontrue discoveries = 80%

Reducetruediscoveries

(a) (b) (c)Increasefalsediscoveries

Opinion TRENDS in Genetics Vol.19 No.10 October 2003540

http://tigs.trends.com

results using small P-values, because these are likely toreport true discoveries. A mechanism for storage anddissemination of the data, including those from smallerstudies, could also be useful. This could be a brief referencein a journal or – similar to what is now available forpublished association studies [14] – a curated Website.The website would preferably include results of step oneand all step two studies, thereby avoiding the ‘file drawer’problem that keeps false positives in the literature longerthan is desirable.

In this article, a statistical procedure was used to makea first selection of markers in step one. Another optionwould be to use database resources to make the firstselection [15]. Knowledge about the biological pathwaysand other literature might hold further suggestions.Because of the increasing amount of information thatbecomes available, in addition to the development of betterbioinformatic tools to collect and integrate these data, thisstrategy might be a viable alternative in the future. Falsediscoveries could, in principle, be eliminated even moreefficiently using more than two steps. This correspondswith moving along the x-axis of Fig. 2, from the first to thethird condition. Thus, in each step, p0 is increased a littleuntil no further reduction in genotyping could be expected.

Candidate gene studies

In studies of candidate genes that are likely to have trueeffects, not much will be gained by using a two-stepprocedure. However, the critical P-value will still need tobe adjusted to account for multiple testing as performed bythe scientific community as a whole. Standard multipletesting and FDR procedures that are based on the numberof tests or the P-values of the tests that have beenperformed cannot be applied in this situation. The formulain Box 1 can, in principle, be used. A practical limitation is

that p0 is unknown for candidate gene studies. For thisreason, we plotted the Pk values for a range of p0 values(Fig. 3). The results suggest that Pk might need to be assmall as 0.004, and that Pk ¼ 0.01 will, on average, controlthe FDR at 0.10. This method would clearly need to berefined, for example, to obtain more precise estimates of p0.However, it does suggest that traditional significancelevels such as 0.05 are too high for candidate gene studies.Instead, a critical P-value of 0.01 is more likely to controlthe FDR at the level of 0.1.

Concluding remarks

This article does not claim to offer any ready-made solu-tions: many questions remain. An important issue forfuture research involves which p0 value to choose for agiven situation. In our calculations, this was dealt with byassuming a range of possible values. Although manycomplexities are involved, it is, in principle, possible toobtain more precise estimates based on empirical data. Anexperimental approach would be to collect data from alarge number of linkage and/or LD scans, or candidategene studies. By counting the proportion of markers thatreplicate in independent samples (presumably, markerswith real effects), an estimate of p0 could be obtained.Statistical procedures using data from a single study havealso been developed to estimate p0 [16,17] (Y. Benjaminiet al., unpublished).

Our findings suggest that there is considerable room toimprove our procedures for eliminating and controllingfalse discoveries. A clear example was the stepwiseapproach, which achieved the same false discovery ratemuch more efficiently. If appropriate methods are used, theproportion of true effects one wants to detect seems a muchmore important determinant of the genotyping burdenthan the false discovery rate. This raises the question ofwhether current models for gene discovery are shaped toomuch by a concern for false discoveries.

AcknowledgementsE.v.d.O. is supported in part by grants from the US National Institute ofMental Health MH065320 and NARSAD.

References

1 Wright, S. (1931) Evolution in Mendelian populations. Genetics 16,97–159

2 Hirschhorn, J.N. et al. (2002) A comprehensive review of geneticassociation studies. Genet. Med. 4, 45–61

3 Lander, E. and Kruglyak, L. (1995) Genetic dissection of complextraits: guidelines for interpreting and reporting linkage results. Nat.Genet. 11, 241–247

4 Editorial, (1999) Freely associating. Nat. Genet. 22, 1–25 Glazier, A.M. et al. (2002) Finding genes that underlie complex traits.

Science 298, 2345–23496 Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery

rate: a practical and powerful approach to multiple testing. J. R. Stat.Soc. 57, 289–300

7 Genovese, C. and Wasserman, L. (2002) Operating characteristics andextensions of the FDR procedure. J R. Stat. Soc. 64, 499–517

8 Allison, D.B. and Coffey, C.S. (2002) Two-stage testing in microarrayanalysis: what is gained? J. Gerontol. A. Biol. Sci. Med. Sci. 57,B189–B192

9 Elston, R.C. et al. (1996) Two-stage global search designs for linkageanalysis using pairs of affected relatives. Genet. Epidemiol. 13,535–558

Fig. 3. Possible critical P-values (Pk) for candidate gene association studies. The

expected proportion of markers with no effect (x-axis) assumes that either 80% or

40% of the true effects are discovered. Because few genes have been found for

complex traits – although many candidates have probably been tested – we

speculated that 0.75 could be a lower-bound value of p0 for candidate gene associ-

ation studies. 0.99 was taken as the upper-bound value because this is more typi-

cal for LD and linkage scans as studied in scenario (b) in Fig. 2. Results show that

Pk might need to be as small as 0.004, and that Pk ¼ 0:01 will, on average, control

the FDR at 0.10. PTD indicate the proportion of markers where true effects will be

detected.

TRENDS in Genetics

Critical P -value

p0

PTD = 0.8

PTD = 0.4

0

0.005

0.010

0.015

0.020

0.025

0.030

0.035

0.040

0.045

0.050

0.99 0.96 0.93 0.90 0.87 0.84 0.81 0.78 0.75

Opinion TRENDS in Genetics Vol.19 No.10 October 2003 541

http://tigs.trends.com

10 Satagopan, J.M. et al. (2002) Two-stage designs for gene-diseaseassociation studies. Biometrics 58, 163–170

11 Zwick, M. et al. (2002) An empirical estimate of the number of SNPsrequired for a whole genome association study. Am. J. Hum. Genet. 71,219

12 Kruglyak, L. (1999) Prospects for whole-genome linkage disequili-brium mapping of common disease genes. Nat. Genet. 22, 139–144

13 Van den Oord, E.J.C.G. and Sullivan, P. A framework for controllingfalse discovery rates and minimizing the amount of genotyping in the

search for disease mutations. (in press) (http://www.vipbg.vcu.edu/~edwin)

14 Casci, T. (2002) Web watch: free associations. Nat. Rev. Genet. 3, 90415 Tabor, H.K. (2002) Candidate-gene approaches for studying complex

genetic traits: practical considerations. Nat. Rev. Genet. 3, 391–39716 Efron, B. et al. (2001) Empirical Bayes analysis of a microarray

experiment. J. Am. Stat. Assoc. 96, 1151–116017 Storey, J.D. (2002) A direct approach to false discovery rates. J. R. Stat.

Soc. 64, 479–498

News & Features on BioMedNet

Start your day with BioMedNet’s own daily science news, features, research update articles and special reports. Every two weeks,

enjoy BioMedNet Magazine, which contains free articles from Trends, Current Opinion, Cell and Current Biology. Plus, subscribe to

Conference Reporter to get daily reports direct from major life science meetings.

http://news.bmn.com

Here is what you will find in News & Features:

Today’s News

Daily news and features for life scientists.

Sign up to receive weekly email alerts at http://news.bmn.com/alerts

Special Report

Special in-depth report on events of current importance in the world of the life sciences.

Research Update

Brief commentary on the latest hot papers from across the life sciences, written by laboratory researchers chosen by the editors of

the Trends and Current Opinions journals, and a panel of key experts in their fields.

Sign up to receive Research Update email alerts on your chosen subject at http://update.bmn.com/alerts

BioMedNet Magazine

BioMedNet Magazine offers free articles from Trends, Current Opinion, Cell and BioMedNet News, with a focus on issues of general

scientific interest. From the latest book reviews to the most current Special Report, BioMedNet Magazine features Opinions, Forum

pieces, Conference Reporter, Historical Perspectives, Science and Society pieces and much more in an easily accessible format. It

also provides exciting reviews, news and features, and primary research. BioMedNet Magazine is published every 2 weeks.

Sign up to receive weekly email alerts at http://news.bmn.com/alerts

Conference Reporter

BioMedNet’s expert science journalists cover dozens of sessions at major conferences, providing a quick but comprehensive report

of what you might have missed. Far more informative than an ordinary conference overview, Conference Reporter’s easy-to-read

summaries are updated daily throughout the meeting.

Sign up to receive email alerts at http://news.bmn.com/alerts

Opinion TRENDS in Genetics Vol.19 No.10 October 2003542

http://tigs.trends.com