research article robust association tests for the replication of...
TRANSCRIPT
Research ArticleRobust Association Tests for the Replication ofGenome-Wide Association Studies
Jungnam Joo1 Ju-Hyun Park2 Bora Lee1 Boram Park1 Sohee Kim1 Kyong-Ah Yoon3
Jin Soo Lee3 and Nancy L Geller4
1Biometric Research Branch Research Institute and Hospital National Cancer Center Gyeonggi-doGoyang-si 410-769 Republic of Korea2Department of Statistics Dongguk University Seoul 100-715 Republic of Korea3Lung Cancer Branch Research Institute and Hospital National Cancer Center Gyeonggi-do Goyang-si 410-769 Republic of Korea4Office of Biostatistics Research National Heart Lung and Blood Institute Bethesda MD 20892-7938 USA
Correspondence should be addressed to Jungnam Joo joojnccrekr
Received 14 November 2014 Revised 14 February 2015 Accepted 14 February 2015
Academic Editor Taesung Park
Copyright copy 2015 Jungnam Joo et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
In genome-wide association study (GWAS) robust genetic association tests such as maximum of three CATTs (MAX3) eachcorresponding to recessive additive and dominant geneticmodels theminimum p value of Pearsonrsquos Chi-square test with 2 degreesof freedom and CATT based on additive genetic model (MIN2) genetic model selection (GMS) and genetic model exclusion(GME) methods have been shown to provide better power performance under wide range of underlying genetic models In thispaper we demonstrate how these robust tests can be applied to the replication study of GWAS and how the overall statisticalsignificance can be evaluated using the combined test formed by p values of the discovery and replication studies
1 Introduction
With the advance of biotechnology and substantial reduc-tion of genotyping costs a genome-wide association study(GWAS) using hundred thousand markers in several thou-sand individuals is now increasingly utilized and has beensuccessful in detecting genetic associations across the entiregenomewith complex human traits [1ndash6] Amongmany chal-lenges this application holds development of more efficientand robust statistical methodologies with higher power todetect an association with a single marker has been one ofthe most important statistical issues given that effects ofindividualmarkers are usually characterized as being small tomoderate One attempt to overcome this challenge is focusedondeveloping efficient tests that are robust against underlyinggenetic model misspecification
Two most frequently used association tests are the allele-based test (ABT) and the genotype-based test (GBT) ABTcompares the allele frequencies between cases and controlswhile GBT compares the genotype distributions of cases andcontrolsThe Cochran-Armitage trend test (CATT) [7 8] is a
popularGBTwhich takes into account the underlying geneticmodel It is well known however that the ABT may inflatetype I error when Hardy-Weinberg equilibrium (HWE) doesnot hold in the samples [9] Even under HWE when thegenetic model is recessive or dominant the ABT may sufferfrom serious power loss On the other hand the CATT doesnot depend on HWE but to apply the CATT the choice ofscores optimal for the underlying genetic model needs to bespecified For complex diseases the genetic model is usuallyunknown and robust tests such as the maximum of threeCATTs (MAX3) [10] and the maximum efficiency robusttest (MERT) [11 12] are preferable Alternatively Zheng andNg [13] and Joo et al [14] proposed a two-phase analysisbased on the genetic model selection (GMS) and geneticmodel exclusion (GME) Moreover an alternative approachwas proposed by theWellcome trust case-control consortium(WTCCC) [5] which used a minimum 119901 value of PearsonrsquosChi-square test and additive CATT and the asymptoticproperties of this approach were studied in detail by Jooet al [15]Thesemethods provide better or comparable powerperformance than some of the robust tests such as MAX3
Hindawi Publishing CorporationBioMed Research InternationalVolume 2015 Article ID 461593 10 pageshttpdxdoiorg1011552015461593
2 BioMed Research International
In this paper we illustrate how these robust tests canbe applied to a replication study of GWAS and how overallstatistical significance can be evaluated using the combinedtest formed by 119901 values of the discovery and replicationstudiesThe importance of replication or validation in GWAShas been well recognized [16 17] and joint analysis in atwo-stage design of GWAS has been proved to be morepowerful than replication-based analysis and has been widelyconducted in GWAS with a variety of phenotypes of interest[18 19]
The paper is organized as follows We first describethe data structures and notation and review existing robustassociation tests for a single data set Then we describe howto obtain the 119901 value for the replication data set given thesignificant result of the discovery stage using robust tests Inthe next section a combined test of the119901 values of the discov-ery and replication data sets is proposed together with theway to evaluate the statistical significance for the combinedtest Simulation studies are conducted to compare the typeI error rates and powers of various analytical strategies Forillustration purposes the summarized methods are appliedto a non-small-cell lung cancer data set and at the end thereis a discussion
2 Methods
21 Data and Notation For a marker with two alleles 119860 and119861 let the frequencies of 119861 in cases and controls be 119901 = 119875(119861 |
case) and 119902 = 119875(119861 | control) Denote three genotypes by1198660= 119860119860119866
1= 119860119861 and119866
2= 119861119861 In case-control association
studies 119903 cases and 119904 controls are independently sampledfrom each population The observed genotype counts for(1198660 1198661 1198662) are (119903
0 1199031 1199032) in the cases and (119904
0 1199041 1199042) in the
controls Disease prevalence is denoted by 119896 = 119875(disease)and penetrance by 119891
119894= 119875(disease | 119866
119894) for 119894 = 0 1 2 Two
genotype relative risks (GRRs) are denoted by 1205821= 11989111198910
and 1205822= 11989121198910using 119891
0gt 0 as baseline penetrance Under
the null hypothesis of no association 1198670 1198910= 1198911= 1198912= 119896
or alternatively1198670 1205822= 1205821= 1 Genetic model is recessive
(REC) additive (ADD)multiplicative (MUL) and dominant(DOM)when 120582
1= 1 120582
1= (1+120582
2)2 1205821= 12058212
2 and 120582
2= 1205821
respectively
22 Review of Association Tests for a Single Data Set Theassociation in case-control studies can be tested using variousmethods which have been extensively studied The generalassociation between the disease status and the SNP canbe tested using Pearsonrsquos Chi-square test which has anasymptotic Chi-square distributionwith 2 degrees of freedomunder119867
0 The test is given by
119879chi2 =2
sum
119895=0
(119903119895minus 119899119895119903119899)2
119899119895119903119899
+
2
sum
119895=0
(119904119895minus 119899119895119904119899)2
119899119895119904119899
(1)
where 119899119894= 119903119894+ 119904119894for 119894 = 0 1 2 and 119899 = 119903 + 119904 Under Hardy-
Weinberg equilibrium (HWE) an allele-based test (ABT) and
CATT with scores (0 119909 1) for (1198660 1198661 1198662) where 0 le 119909 le 1
are given by
119885ABT =11989912
2119903 (21199040+ 1199041) minus 2119904 (2119903
0+ 1199031)
2119903119904 (21198990+ 1198991) (1198991+ 21198992)12
119885119909=
11989912
sum2
119894=0119909119894(119904119903119894minus 119903119904119894)
[119903119904119899 119899sum2
119894=01199092119894119899119894minus (sum2
119894=0119909119894119899119894)2
]12
(2)
where (1199090 1199091 1199092) = (0 119909 1) [9] The optimal choices of 119909
for the recessive (REC) additivemultiplicative (ADDMUL)and dominant (DOM) models are 119909 = 0 12 and 1respectively [9 20] Both 119885
119909and 119885ABT asymptotically follow
a standard normal distribution under 1198670 119885119909can be used
even when HWE does not hold However without theHWE assumption 119885ABT does not follow a standard normaldistribution due to the correlation between two alleles
A robust test MAX3 proposed by Friedlin et al [10] canbe obtained by taking the maximum of three CATTs underthe three genetic models as MAX3 = max(|119885
0| |11988512| |1198851|)
Parametric bootstrap or permutationmethods can be used tofind the 119901 value of MAX3 [4]
Let the 119901 values of Pearsonrsquos Chi-square test and CATTunder the additive genetic model 119885
12be 119875chi2 and 119875
12
respectively WTCCC [5] proposed an alternative robust testMIN2 = min(119875chi2 11987512) Joo et al [15] derived the asymptoticnull distribution of MIN2 and using their result the 119901 valueof MIN2 can be obtained as
119875MIN2
=1
2exp minus1
2119867minus1
1(1 minusMIN2) + 1
2MIN2
minus1
2120587int
minus2 log(MIN2)
119867minus1
1
(1minusMIN2)119890minusV2 arcsin(
2119867minus1
1(1 minusMIN2)V minus 1
)119889V
(3)
where 1198671and 119867
2are the cumulative distributions of Chi-
square distributions with 1 and 2 degrees of freedomOn the other hand Song and Elston [21] considered a
Hardy-Weinberg disequilibrium trend test (HWDTT) givenby
119885119867=
(119903119904119899)12
(Δ119901minus Δ119902)
1 minus 1198992119899 minus 119899
1 (2119899) 119899
2119899 + 119899
1 (2119899)
(4)
where Δ119901= 1199012minus (1199012+ 11990112)2 and Δ
119902= 1199022minus (1199022+ 11990212)2
are the estimates of Δ119901and Δ
119902 where 119901
119894= 119903119894119903 and 119902
119894= 119904119894119904
Here Δ denotes the Hardy-Weinberg disequilibrium (HWD)coefficient defined by 119875119903(119861119861) minus 119875119903(119860119861)2 + 119875119903(119861119861)
2 andΔ119901andΔ
119902denote the HWD coefficient in cases and controls
respectivelyZheng and Ng [13] used the information contained in the
signs of (Δ119901 Δ119902) to determine the genetic models in their
two-phase method Their two-phase statistic 119885GMS is givenby 119885GMS = 119885
0if 119885119867
gt 119888 1198851if 119885119867
lt minus119888 and 11988512
other-wise where 119888 = Φ
minus1(1 minus 120572
119867) for 120572
119867= 005 The asymptotic
BioMed Research International 3
correlations between 119885119867and three CATTs under HWE were
derived and the significance level was adjusted accordingly tocontrol the desired type I error Based on the observation thatthis method assumes 119861 is the risk allele Joo et al [14] studiedthe behavior of 119885GMS when either one of the alleles can be arisk alleleThey chose the risk allele based on the sign of119885
12
that is if 11988512
gt 0 119861 is the risk allele and 1198850 11988512
and 1198851
are chosen for REC ADD and DOM models respectivelyIf 11988512
lt 0 the respective test statistics are chosen tobe minus119885
1 minus11988512
and minus1198850 They incorporate this property in
defining the test statistic for genetic model selection (119885GMS)and calculating the 119901 value Let Θ
0(119911) = 119911 119911 gt 119888
Θ12(119911) = 119911 |119911| lt 119888 and Θ
1(119911) = 119911 119911 lt minus119888 Then
the 119901 value of this method can be obtained by
119901GMS
= 2
1
sum
119909=0
intΘ119909
(119911119867
)
int
infin
0
int
infin
119905
120601119909(119911119909 11991112 119911119867) 11988911991111990911988911991112119889119911119867
+ 2
intΘ12
(119911)
Φ((minus119905 and 0) + 120588
12119911
(1 minus 120588212)12
)119889Φ (119911)
(5)
where 120588119909= Corr(119885
119909 119885119867) in (5) and 120588
11990912= Corr(119885
119909 11988512)
(119909 = 0 1) are replaced by their consistent estimates Here119905 = 119911GMS and (minus119905 and 0) = min(minus119905 0) Moreover 119911GMS and 11991112are the observed values of 119885GMS and 11988512 respectively
While studying the properties of GMS Joo et al [14]noticed that the probability of selecting the true recessive ordominant models using 119885
119867is very low especially for low
to moderate GRRs but the unlikely genetic model can besuccessfully excluded This led to genetic model exclusionmethod119885GME which is the same as the119885GMS described aboveexcept 119885
119909for 119909 = 0 12 1 is replaced by 119885lowast
119909where 119885lowast
119909=
(119885119909+11988512)2(1 +120588
11990912)12 And the 119901 value of GME can be
obtained as
119901GME
= 2
1
sum
119909=0
intΘ119909
(119911119867
)
int
infin
0
int
infin
119871
120601119909(119911119909 11991112 119911119867) 11988911991111990911988911991112119889119911119867
+ 2
intΘ12
(119911)
Φ((minus119905 and 0) + 120588
12119911
(1 minus 120588212)12
)119889Φ (119911)
(6)
where 119871 = 1199052(1 + 12058811990912
)12
minus 11991112
for 119905 = 119911GME
23 119901 Value of Replication Data Using the Robust Method Inthe discovery stage the 119901 value of robust association testsincluding MAX3 MIN2 119885GMS and 119885GME can be obtainedas described in Section 22 For the 119901 value of replication datausing the robust method we use the same analytic methodthat was used for discovery and the risk allele identified byit [16] This means that when the best test statistic or geneticmodel is selected in the discovery stage the replication stage
will adopt the discovery stage selection and the direction ofassociation
Suppose that for simplicity of notation our interest isin GWAS with two stages one for discovery and the otherfor replication although the methodology described belowcan be extended to multistages for replication Let 119885(119894)
119909for
119909 = 0 12 1 be the CATT optimal for recessive additiveand dominant models and let 119875(119894)
119909be corresponding 119901 value
for 119894th stage (119894 = 1 for discovery and 119894 = 2 for replicationstages) Also denote 119885(119894)lowast
119909= (119885(119894)
119909+ 119885(119894)
12)2(1 + 120588
(119894)
11990912)12
for 119909 = 0 12 1 Then for CATT with a preselected geneticmodel 119875(2)
119909= 1minusΦ(sign(119885(1)
119909sdot119885(2)
119909) sdot |119885(2)
119909|) using a one-sided
119901 value given the direction of association from the discoverystage and119875(2)lowast
119909= 1minusΦ(sign(119885(1)lowast
119909sdot119885(2)lowast
119909) sdot |119885(2)lowast
119909|) Moreover
denote the test statistics and 119901 values using Pearsonrsquos Chi-square test from the 119894th stage as 119879(119894)chi2 and 119875
(119894)
chi2 Further letHWDTT from the 119894th stage be 119885(119894)
119867 Then the second stage
119901 values using MAX3 MIN2 119885GMS and 119885GME denoted as119875(2)
MAX3 119875(2)
MIN2 119875(2)
GMS and 119875(2)
GME can be obtained as follows
119875(2)
MAX3 = 119875(2)
119909lowast
where 119909lowast = arg min119909isin0121
119875(1)
119909
119875(2)
MIN2 = 119875(2)
12sdot 119868 (119875(1)
12le 119875(2)
chi2) + 119875(2)
chi2 sdot 119868 (119875(1)
12gt 119875(2)
chi2)
119875(2)
GMS = 119875(2)
0119868 (119885(1)
119867gt 119888) + 119875
(2)
1119868 (119885(1)
119867lt minus119888)
+ 119875(2)
12119868 (10038161003816100381610038161003816119885(1)
119867
10038161003816100381610038161003816le 119888) if 119885(1)
12gt 0
= 119875(2)
0119868 (119885(1)
119867lt minus119888) + 119875
(2)
1119868 (119885(1)
119867gt 119888)
+ 119875(2)
12119868 (10038161003816100381610038161003816119885(1)
119867
10038161003816100381610038161003816le 119888) if 119885(1)
12le 0
119875(2)
GME = 119875(2)lowast
0119868 (119885(1)
119867gt 119888) + 119875
(2)lowast
1119868 (119885(1)
119867lt minus119888)
+ 119875(2)lowast
12119868 (10038161003816100381610038161003816119885(1)
119867
10038161003816100381610038161003816le 119888) if 119885(1)
12gt 0
= 119875(2)lowast
0119868 (119885(1)
119867lt minus119888) + 119875
(2)lowast
1119868 (119885(1)
119867gt 119888)
+ 119875(2)lowast
12119868 (10038161003816100381610038161003816119885(1)
119867
10038161003816100381610038161003816le 119888) if 119885(1)
12le 0
(7)
It is important to note that even though the direction ofthe test statistics and the selected genetic models are usedto obtain the second stage 119901 values the 119901 values from thetwo stages are independent under the null hypothesis Thisis because under the null hypothesis the probability of 119885
12
being positive or negative is simply 12 and the probabilityof the selection of a certain genetic model is also a constant(120572119867for the recessive and dominant models and 1 minus 2120572
119867for
the additive model)
24 Combined Test Using 119901 Values and Its Statistical Signif-icance For a given robust test we can consider the jointanalysis by combining 119901 values from the discovery andreplication stages of GWAS We consider using 119901 valuesrather than the test statistics because test statistics can have
4 BioMed Research International
Table 1 Type I error rates of three approachesmdashreplication-based (REP) test Fisherrsquos combination (119885FC) and linear combination of test(119885LC)mdashbased on the CATT with an additive model (119885
12) 1205942 MAX3 MIN2 GMS and GME The disease prevalence 119870 = 01 119872 = 10
markers 119903 = 1 500 cases and 119904 = 1 500 controls are considered based on 20000 simulations
MAF 120587119904
120572119863
119865 = 0
11988512
1205942 MAX3 MIN2 GMS GME
03 05
005REP 000530 000455 000505 000485 00050 000490119885FC 000500 000535 000495 000510 000515 000460119885LC 000535 000525 000485 000510 00050 000485
01REP 000510 000560 000565 000485 000525 000545119885FC 000565 000535 000565 000545 000565 000540119885LC 000520 000565 000525 000520 000530 000525
03 06
005REP 000510 000485 000480 000515 000480 000500119885FC 000445 000455 000450 000455 000450 000460LC 000500 000515 000495 000520 000475 000480
01REP 000500 000485 000485 000535 000530 000505119885FC 000465 000490 000485 000500 000455 000460119885LC 000480 000470 000515 000490 000485 000475
04 05
005REP 000590 000505 000530 000565 000505 000510119885FC 000575 000430 000460 000535 000460 000500119885LC 000600 000445 000500 000540 000490 000490
01REP 000525 000470 000535 000450 000480 000515119885FC 000515 000510 000495 000475 000540 000500119885LC 000530 000500 000485 000475 000495 000510
04 06
005REP 000475 000585 000480 000500 000515 000495119885FC 000460 000470 000420 000490 000455 000440119885LC 000525 000550 000520 000580 000510 000510
01REP 000550 000490 000515 000535 000555 000540119885FC 000520 000370 000495 000450 000515 000510119885LC 000565 000485 000570 000530 000610 000580
complex forms and obtaining the distribution of the joint testcan be difficult On the other hand calculating a 119901 value foreach data set might be relatively simple and the distributionof 119901 values under the null hypothesis of no association is easyto handle
There are several methods for combining test statisticsfrom two stages [22] and twomost commonly used forms arebased on Fisherrsquos combination and a linear combination afterinverse normal transformation [23] Fisherrsquos combination(FC) directly sums 119901 values after minus2 log transformation thatis 119885FC = minus2119908
1log(119875(1)) minus 2119908
2log(119875(2)) where 119875(119894) is 119901 value
from 119894 = 0 for discovery and 119894 = 1 for replication stagesusing a given robust test A specification of 119908
1= 1199082= 1
gives the same weight for discovery and replication stagesand one can consider 119908
1= 2120587
119904and 119908
2= 2(1 minus 120587
119904)
where 120587119904= 119873119863(119873119863+ 119873119877) and 119873
119863and 119873
119877are sample
sizes of the discovery and replication data sets A linearcombination of two 119875 values after taking the inverse of thestandard normal cumulative distribution is given by 119885LC =
1199081Φminus1(1minus119875(1)2)+119908
2Φminus1(1minus119875(2))radic1199082
1+ 11990822with a natural
choice of 1199081= radic120587119904 and 1199082 = radic1 minus 120587
119904 Let the significance
level of the discovery stage be 120572119863 which means that markers
with 119875(1) lt 120572119863are selected and replicated in the replication
stage The 119901 value of combined test can then be obtained as
119901FC = 1198751198670
(119875(1)
lt 120572119863 119885FC gt 119911FC)where the observed value of
119885FC is 119911FCThe119901FC are calculated as 119890minus119911FC2(1+119911FC2+log120572119863)
for equal weights where 119911FC gt minus2 log120572119863and (119908
1(1199081minus
1199082))119890minus119911FC21199081minus(119908
2(1199081minus1199082))119890minus119911FC21199082120572
minus(1199081
minus1199082
)1199082
119889for unequal
weights where 119911FC gt minus21199081log120572119863 Detailed derivations are
described in the Appendix Equivalently for an overall typeI error threshold for a single marker of 120572 one may obtainthe threshold 119862FC of 119885FC that satisfies 119875
1198670
(119875(1)
lt 120572119863 119885FC gt
119862FC) le 120572 Similarly for the 119885LC the 119901 value is calculatedas 119901LC = 119875
1198670
(119875(1)
lt 120572119863 119885LC gt 119911LC) = int
infin
1199111minus120572
119863
2
120601(119911)[1 minus
Φ((radic11990821+ 11990822119911LC minus 1199081119911)1199082)]119889119911 for 119911LC gt 119911
1minus120572119863
2where the
observed value of 119885LC = 119911LC
3 Simulation Results
31 Type I Error Table 1 provides the type I errors underdifferent scenarios A disease prevalence of 10 is assumedand a total of 1500 cases and 1500 controls were divided intotwo stages The proportions of samples in the first stage (120587
119904)
of 05 and 06 were considered for the minor allele frequency(MAF) of 03 and 04 We considered 119872 = 10 markers tocontrol the genome-wide false positive rate at 120572 = 005 withthe Bonferroni correction We did not consider a larger 119872
BioMed Research International 5
such as 300000 or 500000 because this will require morethan 10 million simulations to show a stable estimate of thetype I error rate With 119872 = 10 we performed 20000simulations which result in less than 10 of a coefficientof variation for a significance level 005119872 = 0005 foreach marker [24] The test statistics considered are 119885
12
Pearsonrsquos Chi-square testMIN2MAX3GMS andGME Forthe second stage analysis we considered a replication-basedanalysis 119885FC and 119885LC as proposed above The results arebased on the situation under HWE (HWE coefficient 119865 = 0)As expected all tests control the type I error reasonablly welland similar results were obtained when a slight deviationfrom HWE is present with 119865 = 005 (results not shown)
32 Empirical Power We examined the empirical powers ofdifferent tests considered above In Figure 1 we considered119872 = 10 markers a disease prevalence of 10 the samegenotype relative risk for two stages (119903
1= 14 and 119903
2= 14)
and 1000 cases and 1000 controls 2000 simulations wereperformed under HWE (119865 = 0) to control the genome-wide false positive rate at 120572 = 005 The recessive additiveand dominant models were assumed for the first secondand third rows Both joint analyses showed better powerperformances compared to the replication-based analysis (upto 159 in scenarios considered in Figure 1) and LC and FChave comparable powers with less than 2 difference Thepower gain of using the joint analysis is not as much as thatobserved in Skol et al [18] However as reported by Skolet al [18] when the between-stage heterogeneity exists andthe risk allele has a larger effect in the first stage than thatin the second stage much improved power is observed byusing the joint test Figure 2 shows results under this scenariowith 119903
1= 16 and 119903
2= 14 and the observed increase
in power using the joint test is as high as 339 Againthe difference between LC and FC is minor with less than3 difference As for comparison between different robustmethods MAX3 GMS and GME perform well under therecessive model while 119885
12 1205942 and MIN2 are less powerful
Under the additivemodel11988512
is most powerful as expectedand 1205942 is least powerful Other robust methods perform wellwith a slight decrease in power compared to 119885
12 Under the
dominant model MAX3 GMS and GME perform the besteven though all tests show good power performances and thedifference is minor Similar patterns were observed when aslight deviation from theHWE is present (results not shown)
4 Real Data Application
The GWAS on non-small-cell lung cancer (NSCLC) by Yoonet al [25] studied 621 NSCLC patients and 1541 controlsubjects in the discovery stage After stringent quality controlsteps a total of 246758 SNPs were tested for the associationwith NSCLC based on119885
12 In the replication stage 168 SNPs
with 119901 value less than 1 times 10minus4 in the first stage based on 11988512
were tested using 804 patients and 1470 control samples Weidentified additional 234 SNPs using MIN2 in the first stagewhich could be studied in the replication stage if MIN2 wasused instead of 119885
12since MIN2 produces stronger evidence
for the additional SNPs than 11988512
does The Manhattan plotsof using MIN2 and 119885
12are presented in Figure 3 One
example is 119903119904385272 located in chromosome 2 which hada 119901 value of 137 times 10
minus7 which reached significance level atBonferroni correction in discovery samples alone whereas11988512
yielded a119901 value greater than 1times10minus4 Even though thereis possibility of false positive findings these SNPs could havebeen selected for replication if robust methods were used
Since we do not have replication data for these additionalSNPs selected using MIN2 because the first stage selectionwas based on 119885
12in Yoon et al [25] just for illustration
purpose of the proposed methods we present the results ofthree SNPs including 1199031199042131877 that was reported by Yoonet al [25] When the significance level in the discovery stageis set at 120572
119863= 5 times 10
minus5 so that all these exemplary SNPs canbe selected in the discovery stage the 119901 value of combinedtest based on four robust methods (MAX3 MIN2 GMSand GME) as well as 119885
12and Pearsonrsquos Chi-square test is
presented in Table 2 Fisherrsquos combination was used for thejoint test in the second stage Only 1199031199042131877was found to besignificant with Bonferroni correction (119901 value lt 203times10
minus7)by all except MAX3method
5 Discussion
In genetic association studies efficiency robust tests whoseperformance does not depend on the underlying geneticmodel have been extensively studied and their power benefitover a wide range of geneticmodels has been well recognizedIn this paper we described how the idea of these robustassociation tests can be applied to the replication studies andfurther how overall statistical significance can be evaluatedusing the combined test formed by 119901 values of the discoveryand replication studies
When the robust tests are used the test statistic ofeach stage can have a complex form and thus dealing withthe distribution of the joint test can be difficult whereascalculating the 119901 value of each stage might be relativelysimple Because the asymptotic distribution of the 119901 valueunder the null hypothesis of no association is easy to handlethe combined test using 119901 values rather than the test statisticsthemselves can provide computational convenience
There are several methods for combining test statisticsfrom two stages and Won et al [22] compared the per-formances of various choices Two most commonly usedforms are based on Fisherrsquos combination and the linearcombination after the inverse normal transformation [23]and we presented the test statistics and 119901 values of these twomethods In our limited experience the linear combinationand Fisherrsquos combination are fairly comparable Fisherrsquoscombination seems to perform slightly better than the linearcombination when there exists some heterogeneity betweenstages in terms of the genotype relative risk while the linearcombination seems to perform slightly better in most ofother situations However the difference is extremely minorFurther research is required for the thorough comparison ofvarious methods of combining 119901 values in the application ofefficiency robust tests to the replication of genetic associationstudies
6 BioMed Research International
Recessive
Additive
Dominant
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)Em
piric
al p
ower
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
Z05
1205942
MAX3
ZFC
ZFC
ZFC
ZLC
ZLC
ZLC
MIN2GMSGME
MIN2GMSGME
MIN2GMSGME
Z05
1205942
MAX3
Z05
1205942
MAX3
Replication-based test
Replication-based test
Replication-based test
Figure 1 Empirical powers based on 2000 simulations for119872 = 10markers genotype relative risks of both stages = 14 and disease prevalence119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to control 120572 = 005 The first stagetype I error rate for discovery is 120572
119863= 005 Six test statistics 119885
12 1205942 MAX3 MIN2 GMS and GME are considered The first second and
third columns depict powers using the replication-based test 119885FC and 119885LC respectively
BioMed Research International 7
Replication-based test
Recessive
Additive
Dominant
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)Em
piric
al p
ower
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
Z05
1205942
MAX3
ZFC
ZFC
ZFC
ZLC
ZLC
ZLC
MIN2GMSGME
MIN2GMSGME
MIN2GMSGME
Z05
1205942
MAX3
Z05
1205942
MAX3
Replication-based test
Replication-based test
Figure 2 Empirical powers based on 2000 simulations for 119872 = 10 markers genotype relative risks of two stages are different (1199031= 16
1199032= 14) disease prevalence 119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to
control 120572 = 005 The first stage type I error rate for discovery is 120572119863= 005 Six test statistics 119885
12 1205942 MAX3 MIN2 GMS and GME are
considered The first second and third columns depict powers using the replication-based test 119885FC and 119885LC respectively
8 BioMed Research International
minuslo
g 10(120588)
Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23
8
6
4
2
0
MIN2 P value
(a)
minuslo
g 10(120588)
Additive P value8
6
4
2
0
Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23
(b)
Figure 3 Manhattan plots of 246758 SNPs from Yoon et al [25] based on MIN2 (a) and 11988512
(b) The 119909 axis is chromosomal location andthe 119910 axis is the significance (minuslog
10119875) of association The horizontal line corresponds to the significance level 10minus4
Table 2 For selected exemplary three SNPs for testing association with NSCLC 119901 value of combined test using additive CATT (11988512)
Pearsonrsquos Chi-square test (119879chi2) MAX3 MIN2 119885GMS and 119885GME
SNP 119901 value of 11988512
119901 value of 119879chi2Discovery Replication Combined test Discovery Replication Combined test
rs2131877 788 times 10minus5
104 times 10minus4
797 times 10minus8
140 times 10minus4
149 times 10minus4
184 times 10minus7
rs905551 183 times 10minus5
702 times 10minus3
770 times 10minus6
806 times 10minus5
489 times 10minus2
140 times 10minus5
rs1695109 248 times 10minus4
346 times 10minus2
217 times 10minus6
456 times 10minus5
153 times 10minus1
207 times 10minus5
SNP 119901 value of MAX3 119901 value of MIN2Discovery Replication Combined test Discovery Replication Combined test
rs2131877 153 times 10minus4
405 times 10minus2
192 times 10minus5
132 times 10minus4
104 times 10minus4
126 times 10minus7
rs905551 450 times 10minus5
702 times 10minus3
192 times 10minus6
134 times 10minus4
489 times 10minus2
199 times 10minus5
rs1695109 354 times 10minus5
263 times 10minus2
464 times 10minus6
236 times 10minus5
263 times 10minus2
335 times 10minus6
SNP 119901 value of 119885GMS 119901 value of 119885GME
Discovery Replication Combined test Discovery Replication Combined testrs2131877 186 times 10
minus4104 times 10
minus4171 times 10
minus7103 times 10
minus4104 times 10
minus4102 times 10
minus7
rs905551 519 times 10minus5
702 times 10minus3
216 times 10minus6
735 times 10minus5
801 times 10minus3
320 times 10minus6
rs1695109 689 times 10minus4
127 times 10minus1
385 times 10minus5
269 times 10minus5
419 times 10minus2
540 times 10minus6
In a genetic study where the purpose of considering areplication stage is to validate or replicate the genetic findingsfrom the discovery stage which is the case considered inthis paper the analysis in the replication stage utilized thetest statistic or genetic model that is selected as being thebest in the discovery stage and also the direction of the riskallele following guidelines for exact replication in geneticassociation studies If the purpose is to simply combine theevidence fromdifferent data sources such as inmeta-analysisother strategies may be devised Further research again isrequired to provide fully detailed properties of suchmethods
Power gain of a joint analysis over the conventionalreplication-based analysis was thoroughly studied by Skol etal [18 19] In our simulation the amount of power increaseusing a joint test compared to the replication-based analysiswas much minor than what was observed by Skol et al[18 19] The exact reason is not known but we suspect thismight be due to the power advantages of robust methodsand also due to the fact that the optimal choice from thefirst stage is used when calculating the second stage 119901 values
However even though it was minor in some situationsthe joint anlysis presented better power performance thanthe replication-based analysis in our study This type ofjoint analysis raised concerns about the exact meaning ofreplication [17] However McCarthy et al [26] mentionedthat joint analyses ldquoblur the boundaries of where exactlyreplication starts but whichever analytical approach is takenconfirmation in many independent samples is important andit is the overall strength of the evidence of association thatmattersrdquo Purpose of the current study was to present howthe overall strength of the evidence of association can beevaluated when robust tests are used in GWAS replicationstudies
We illustrated how the proposed methods can be appliedin the real data that studied the association of SNPs with non-small-cell lung cancer (NSCLC) in discovery and replicationstages In the original study reported by Yoon et al [25]SNPs were selected in the discovery data set not based onthe robust tests but based on additive CATT Therefore wefound that some SNPs could have been selected by one
BioMed Research International 9
of the robust methods but they were not included in thereplication data set For these SNPs we were not able toperform the joint analysis that we propose and it was notpossible to examine whether there are other SNPs that couldhave been found to be associated with NSCLC by proposedmethods in the replication study For this reason we merelypresented howmany additional SNPs could have been furtherfollowed in the replication stage when robust methods wereused In many GWASs it is a common practice to reportthe summary test statistics and 119901 values of the SNPs undera specific genetic model usually an additive model whichwere further genotyped in the replication stage and werefinally defined to be significantly associated with a phenotypeof interest As emphasized in this paper one may have abetter chance of findingmanymissing SNPs by applyingmorepowerful and robust methods that consider different geneticmodels simultaneouslyTherefore we urge the community toshare test results under not only an additive model but alsoother genetic models although they were not significant at astringent significance level so that future research may haveenriched data resources to which robust tests can be appliedin association studies
Appendix
119901 value of Fisherrsquos Combination forEqual and Unequal Weights
Equal Weights Assume 1199081
= 1199082
= 1 Under the nullhypothesis of no association 119883
1= minus2 log119875(1) and 119883
2=
minus2 log119875(2) are independent and each asymptotically followsa 1205942 distribution with 2 degrees of freedom Let 119891
119896(119909) and
119865119896(119909) be the probability and cumulative density functions of
1205942 random variable with 119896 degrees of freedomThen 119891
2(119909) =
exp(minus1199092)2 1198914(119909) = 119909 exp(minus1199092)4 119865
2(119909) = 1 minus exp(minus1199092)
and1198654(119909) = 1minusexp(minus1199092)minus119909 exp(minus1199092)2 Denote the cutoff
of the discovery stage based on 120572119863as 119862119863 that is 119865
2(119862119863) =
1 minus 120572119863 For observed value 119911FC gt 119862
119863of1198831+ 1198832 the 119901 value
is written as
1198751198670
(1198831gt 119862119863 1198831+ 1198832gt 119911FC)
= 120572119863minus int
119911FC
119862119863
1198912 (119909) 1198652 (119911FC minus 119909) 119889119909
= 120572119863+ exp (minus
119911FC2) minus 120572119863+1
2exp(minus
119911FC2) (119911FC minus 119862119863)
= exp (minus119911FC2) (1 +
119911FC2
+ log120572119863)
(A1)
Unequal WeightsWhen different proportions of samples areused in the discovery and replication stages it may be moreappropriate to assignweights proportional to the sample sizesfor each stage For example when only a small portion isused in the discovery stage to prevent Fisherrsquos combinationtest from being dominated by the significant result in the
discovery stage one may want to assign a small weight to thediscovery stage result
When 120587119904is the proportion of samples used in the dis-
covery stage one selection for weights is 1199081
= 2120587119904and
1199082= 2(1 minus 120587
119904) for discovery and replication stages which
simplifies to equal weights when 120587119904= 05 Based on these
weights we consider unequal-weighted Fisherrsquos combinationasminus2 log119875(1)1199081119875(2)1199082 = 119908
11198831+11990821198832[27] Its density function
is given by
119891119908 (119909) =
1
2 (1199081minus 1199082)exp(minus 119909
(21199081))
minus1
2 (1199081minus 1199082)exp(minus 119909
(21199082))
(A2)
and the probability distribution function is
119865119908 (119909) = 1 minus
1199081
(1199081minus 1199082)exp(minus 119909
21199081
)
minus1199082
(1199081minus 1199082)exp(minus 119909
21199082
) 1199081
= 1199082
(A3)
Using the previous notation we have the following formof 119901 value
1198751198670
(1198831gt 119862119863 11990811198831+ 11990821198832gt 119911FC)
= 120572119863minus int
119911FC1199081
119862119863
1198912 (119909) 1198652 (
119911FC minus 1199081119909
1199082
)119889119909
= exp(minus119911FC21199081
) +1199082
1199081minus 1199082
exp(minus119911FC21199082
)
times exp(1199081 minus 1199082211990811199082
119911FC) minus exp(1199081 minus 1199082211990811199082
1199081119862119863)
=1199081
1199081minus 1199082
exp(minus119911FC21199081
)
minus1199082
1199081minus 1199082
exp(minus119911FC21199082
)120572minus(1199081
minus1199082
)1199082
119863
=1199081
1199081minus 1199082
exp(minus119911FC21199081
)
minus1199082
1199081minus 1199082
exp(minus119911FC21199082
)120572minus(1199081
minus1199082
)1199082
119863
(A4)
where 119911FC1199081 gt 119862119863
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
10 BioMed Research International
Acknowledgments
The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)
References
[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005
[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006
[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006
[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007
[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007
[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012
[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954
[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955
[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997
[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002
[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966
[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985
[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008
[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010
[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009
[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009
[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009
[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for
two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006
[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007
[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003
[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006
[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009
[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012
[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993
[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010
[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008
[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
2 BioMed Research International
In this paper we illustrate how these robust tests canbe applied to a replication study of GWAS and how overallstatistical significance can be evaluated using the combinedtest formed by 119901 values of the discovery and replicationstudiesThe importance of replication or validation in GWAShas been well recognized [16 17] and joint analysis in atwo-stage design of GWAS has been proved to be morepowerful than replication-based analysis and has been widelyconducted in GWAS with a variety of phenotypes of interest[18 19]
The paper is organized as follows We first describethe data structures and notation and review existing robustassociation tests for a single data set Then we describe howto obtain the 119901 value for the replication data set given thesignificant result of the discovery stage using robust tests Inthe next section a combined test of the119901 values of the discov-ery and replication data sets is proposed together with theway to evaluate the statistical significance for the combinedtest Simulation studies are conducted to compare the typeI error rates and powers of various analytical strategies Forillustration purposes the summarized methods are appliedto a non-small-cell lung cancer data set and at the end thereis a discussion
2 Methods
21 Data and Notation For a marker with two alleles 119860 and119861 let the frequencies of 119861 in cases and controls be 119901 = 119875(119861 |
case) and 119902 = 119875(119861 | control) Denote three genotypes by1198660= 119860119860119866
1= 119860119861 and119866
2= 119861119861 In case-control association
studies 119903 cases and 119904 controls are independently sampledfrom each population The observed genotype counts for(1198660 1198661 1198662) are (119903
0 1199031 1199032) in the cases and (119904
0 1199041 1199042) in the
controls Disease prevalence is denoted by 119896 = 119875(disease)and penetrance by 119891
119894= 119875(disease | 119866
119894) for 119894 = 0 1 2 Two
genotype relative risks (GRRs) are denoted by 1205821= 11989111198910
and 1205822= 11989121198910using 119891
0gt 0 as baseline penetrance Under
the null hypothesis of no association 1198670 1198910= 1198911= 1198912= 119896
or alternatively1198670 1205822= 1205821= 1 Genetic model is recessive
(REC) additive (ADD)multiplicative (MUL) and dominant(DOM)when 120582
1= 1 120582
1= (1+120582
2)2 1205821= 12058212
2 and 120582
2= 1205821
respectively
22 Review of Association Tests for a Single Data Set Theassociation in case-control studies can be tested using variousmethods which have been extensively studied The generalassociation between the disease status and the SNP canbe tested using Pearsonrsquos Chi-square test which has anasymptotic Chi-square distributionwith 2 degrees of freedomunder119867
0 The test is given by
119879chi2 =2
sum
119895=0
(119903119895minus 119899119895119903119899)2
119899119895119903119899
+
2
sum
119895=0
(119904119895minus 119899119895119904119899)2
119899119895119904119899
(1)
where 119899119894= 119903119894+ 119904119894for 119894 = 0 1 2 and 119899 = 119903 + 119904 Under Hardy-
Weinberg equilibrium (HWE) an allele-based test (ABT) and
CATT with scores (0 119909 1) for (1198660 1198661 1198662) where 0 le 119909 le 1
are given by
119885ABT =11989912
2119903 (21199040+ 1199041) minus 2119904 (2119903
0+ 1199031)
2119903119904 (21198990+ 1198991) (1198991+ 21198992)12
119885119909=
11989912
sum2
119894=0119909119894(119904119903119894minus 119903119904119894)
[119903119904119899 119899sum2
119894=01199092119894119899119894minus (sum2
119894=0119909119894119899119894)2
]12
(2)
where (1199090 1199091 1199092) = (0 119909 1) [9] The optimal choices of 119909
for the recessive (REC) additivemultiplicative (ADDMUL)and dominant (DOM) models are 119909 = 0 12 and 1respectively [9 20] Both 119885
119909and 119885ABT asymptotically follow
a standard normal distribution under 1198670 119885119909can be used
even when HWE does not hold However without theHWE assumption 119885ABT does not follow a standard normaldistribution due to the correlation between two alleles
A robust test MAX3 proposed by Friedlin et al [10] canbe obtained by taking the maximum of three CATTs underthe three genetic models as MAX3 = max(|119885
0| |11988512| |1198851|)
Parametric bootstrap or permutationmethods can be used tofind the 119901 value of MAX3 [4]
Let the 119901 values of Pearsonrsquos Chi-square test and CATTunder the additive genetic model 119885
12be 119875chi2 and 119875
12
respectively WTCCC [5] proposed an alternative robust testMIN2 = min(119875chi2 11987512) Joo et al [15] derived the asymptoticnull distribution of MIN2 and using their result the 119901 valueof MIN2 can be obtained as
119875MIN2
=1
2exp minus1
2119867minus1
1(1 minusMIN2) + 1
2MIN2
minus1
2120587int
minus2 log(MIN2)
119867minus1
1
(1minusMIN2)119890minusV2 arcsin(
2119867minus1
1(1 minusMIN2)V minus 1
)119889V
(3)
where 1198671and 119867
2are the cumulative distributions of Chi-
square distributions with 1 and 2 degrees of freedomOn the other hand Song and Elston [21] considered a
Hardy-Weinberg disequilibrium trend test (HWDTT) givenby
119885119867=
(119903119904119899)12
(Δ119901minus Δ119902)
1 minus 1198992119899 minus 119899
1 (2119899) 119899
2119899 + 119899
1 (2119899)
(4)
where Δ119901= 1199012minus (1199012+ 11990112)2 and Δ
119902= 1199022minus (1199022+ 11990212)2
are the estimates of Δ119901and Δ
119902 where 119901
119894= 119903119894119903 and 119902
119894= 119904119894119904
Here Δ denotes the Hardy-Weinberg disequilibrium (HWD)coefficient defined by 119875119903(119861119861) minus 119875119903(119860119861)2 + 119875119903(119861119861)
2 andΔ119901andΔ
119902denote the HWD coefficient in cases and controls
respectivelyZheng and Ng [13] used the information contained in the
signs of (Δ119901 Δ119902) to determine the genetic models in their
two-phase method Their two-phase statistic 119885GMS is givenby 119885GMS = 119885
0if 119885119867
gt 119888 1198851if 119885119867
lt minus119888 and 11988512
other-wise where 119888 = Φ
minus1(1 minus 120572
119867) for 120572
119867= 005 The asymptotic
BioMed Research International 3
correlations between 119885119867and three CATTs under HWE were
derived and the significance level was adjusted accordingly tocontrol the desired type I error Based on the observation thatthis method assumes 119861 is the risk allele Joo et al [14] studiedthe behavior of 119885GMS when either one of the alleles can be arisk alleleThey chose the risk allele based on the sign of119885
12
that is if 11988512
gt 0 119861 is the risk allele and 1198850 11988512
and 1198851
are chosen for REC ADD and DOM models respectivelyIf 11988512
lt 0 the respective test statistics are chosen tobe minus119885
1 minus11988512
and minus1198850 They incorporate this property in
defining the test statistic for genetic model selection (119885GMS)and calculating the 119901 value Let Θ
0(119911) = 119911 119911 gt 119888
Θ12(119911) = 119911 |119911| lt 119888 and Θ
1(119911) = 119911 119911 lt minus119888 Then
the 119901 value of this method can be obtained by
119901GMS
= 2
1
sum
119909=0
intΘ119909
(119911119867
)
int
infin
0
int
infin
119905
120601119909(119911119909 11991112 119911119867) 11988911991111990911988911991112119889119911119867
+ 2
intΘ12
(119911)
Φ((minus119905 and 0) + 120588
12119911
(1 minus 120588212)12
)119889Φ (119911)
(5)
where 120588119909= Corr(119885
119909 119885119867) in (5) and 120588
11990912= Corr(119885
119909 11988512)
(119909 = 0 1) are replaced by their consistent estimates Here119905 = 119911GMS and (minus119905 and 0) = min(minus119905 0) Moreover 119911GMS and 11991112are the observed values of 119885GMS and 11988512 respectively
While studying the properties of GMS Joo et al [14]noticed that the probability of selecting the true recessive ordominant models using 119885
119867is very low especially for low
to moderate GRRs but the unlikely genetic model can besuccessfully excluded This led to genetic model exclusionmethod119885GME which is the same as the119885GMS described aboveexcept 119885
119909for 119909 = 0 12 1 is replaced by 119885lowast
119909where 119885lowast
119909=
(119885119909+11988512)2(1 +120588
11990912)12 And the 119901 value of GME can be
obtained as
119901GME
= 2
1
sum
119909=0
intΘ119909
(119911119867
)
int
infin
0
int
infin
119871
120601119909(119911119909 11991112 119911119867) 11988911991111990911988911991112119889119911119867
+ 2
intΘ12
(119911)
Φ((minus119905 and 0) + 120588
12119911
(1 minus 120588212)12
)119889Φ (119911)
(6)
where 119871 = 1199052(1 + 12058811990912
)12
minus 11991112
for 119905 = 119911GME
23 119901 Value of Replication Data Using the Robust Method Inthe discovery stage the 119901 value of robust association testsincluding MAX3 MIN2 119885GMS and 119885GME can be obtainedas described in Section 22 For the 119901 value of replication datausing the robust method we use the same analytic methodthat was used for discovery and the risk allele identified byit [16] This means that when the best test statistic or geneticmodel is selected in the discovery stage the replication stage
will adopt the discovery stage selection and the direction ofassociation
Suppose that for simplicity of notation our interest isin GWAS with two stages one for discovery and the otherfor replication although the methodology described belowcan be extended to multistages for replication Let 119885(119894)
119909for
119909 = 0 12 1 be the CATT optimal for recessive additiveand dominant models and let 119875(119894)
119909be corresponding 119901 value
for 119894th stage (119894 = 1 for discovery and 119894 = 2 for replicationstages) Also denote 119885(119894)lowast
119909= (119885(119894)
119909+ 119885(119894)
12)2(1 + 120588
(119894)
11990912)12
for 119909 = 0 12 1 Then for CATT with a preselected geneticmodel 119875(2)
119909= 1minusΦ(sign(119885(1)
119909sdot119885(2)
119909) sdot |119885(2)
119909|) using a one-sided
119901 value given the direction of association from the discoverystage and119875(2)lowast
119909= 1minusΦ(sign(119885(1)lowast
119909sdot119885(2)lowast
119909) sdot |119885(2)lowast
119909|) Moreover
denote the test statistics and 119901 values using Pearsonrsquos Chi-square test from the 119894th stage as 119879(119894)chi2 and 119875
(119894)
chi2 Further letHWDTT from the 119894th stage be 119885(119894)
119867 Then the second stage
119901 values using MAX3 MIN2 119885GMS and 119885GME denoted as119875(2)
MAX3 119875(2)
MIN2 119875(2)
GMS and 119875(2)
GME can be obtained as follows
119875(2)
MAX3 = 119875(2)
119909lowast
where 119909lowast = arg min119909isin0121
119875(1)
119909
119875(2)
MIN2 = 119875(2)
12sdot 119868 (119875(1)
12le 119875(2)
chi2) + 119875(2)
chi2 sdot 119868 (119875(1)
12gt 119875(2)
chi2)
119875(2)
GMS = 119875(2)
0119868 (119885(1)
119867gt 119888) + 119875
(2)
1119868 (119885(1)
119867lt minus119888)
+ 119875(2)
12119868 (10038161003816100381610038161003816119885(1)
119867
10038161003816100381610038161003816le 119888) if 119885(1)
12gt 0
= 119875(2)
0119868 (119885(1)
119867lt minus119888) + 119875
(2)
1119868 (119885(1)
119867gt 119888)
+ 119875(2)
12119868 (10038161003816100381610038161003816119885(1)
119867
10038161003816100381610038161003816le 119888) if 119885(1)
12le 0
119875(2)
GME = 119875(2)lowast
0119868 (119885(1)
119867gt 119888) + 119875
(2)lowast
1119868 (119885(1)
119867lt minus119888)
+ 119875(2)lowast
12119868 (10038161003816100381610038161003816119885(1)
119867
10038161003816100381610038161003816le 119888) if 119885(1)
12gt 0
= 119875(2)lowast
0119868 (119885(1)
119867lt minus119888) + 119875
(2)lowast
1119868 (119885(1)
119867gt 119888)
+ 119875(2)lowast
12119868 (10038161003816100381610038161003816119885(1)
119867
10038161003816100381610038161003816le 119888) if 119885(1)
12le 0
(7)
It is important to note that even though the direction ofthe test statistics and the selected genetic models are usedto obtain the second stage 119901 values the 119901 values from thetwo stages are independent under the null hypothesis Thisis because under the null hypothesis the probability of 119885
12
being positive or negative is simply 12 and the probabilityof the selection of a certain genetic model is also a constant(120572119867for the recessive and dominant models and 1 minus 2120572
119867for
the additive model)
24 Combined Test Using 119901 Values and Its Statistical Signif-icance For a given robust test we can consider the jointanalysis by combining 119901 values from the discovery andreplication stages of GWAS We consider using 119901 valuesrather than the test statistics because test statistics can have
4 BioMed Research International
Table 1 Type I error rates of three approachesmdashreplication-based (REP) test Fisherrsquos combination (119885FC) and linear combination of test(119885LC)mdashbased on the CATT with an additive model (119885
12) 1205942 MAX3 MIN2 GMS and GME The disease prevalence 119870 = 01 119872 = 10
markers 119903 = 1 500 cases and 119904 = 1 500 controls are considered based on 20000 simulations
MAF 120587119904
120572119863
119865 = 0
11988512
1205942 MAX3 MIN2 GMS GME
03 05
005REP 000530 000455 000505 000485 00050 000490119885FC 000500 000535 000495 000510 000515 000460119885LC 000535 000525 000485 000510 00050 000485
01REP 000510 000560 000565 000485 000525 000545119885FC 000565 000535 000565 000545 000565 000540119885LC 000520 000565 000525 000520 000530 000525
03 06
005REP 000510 000485 000480 000515 000480 000500119885FC 000445 000455 000450 000455 000450 000460LC 000500 000515 000495 000520 000475 000480
01REP 000500 000485 000485 000535 000530 000505119885FC 000465 000490 000485 000500 000455 000460119885LC 000480 000470 000515 000490 000485 000475
04 05
005REP 000590 000505 000530 000565 000505 000510119885FC 000575 000430 000460 000535 000460 000500119885LC 000600 000445 000500 000540 000490 000490
01REP 000525 000470 000535 000450 000480 000515119885FC 000515 000510 000495 000475 000540 000500119885LC 000530 000500 000485 000475 000495 000510
04 06
005REP 000475 000585 000480 000500 000515 000495119885FC 000460 000470 000420 000490 000455 000440119885LC 000525 000550 000520 000580 000510 000510
01REP 000550 000490 000515 000535 000555 000540119885FC 000520 000370 000495 000450 000515 000510119885LC 000565 000485 000570 000530 000610 000580
complex forms and obtaining the distribution of the joint testcan be difficult On the other hand calculating a 119901 value foreach data set might be relatively simple and the distributionof 119901 values under the null hypothesis of no association is easyto handle
There are several methods for combining test statisticsfrom two stages [22] and twomost commonly used forms arebased on Fisherrsquos combination and a linear combination afterinverse normal transformation [23] Fisherrsquos combination(FC) directly sums 119901 values after minus2 log transformation thatis 119885FC = minus2119908
1log(119875(1)) minus 2119908
2log(119875(2)) where 119875(119894) is 119901 value
from 119894 = 0 for discovery and 119894 = 1 for replication stagesusing a given robust test A specification of 119908
1= 1199082= 1
gives the same weight for discovery and replication stagesand one can consider 119908
1= 2120587
119904and 119908
2= 2(1 minus 120587
119904)
where 120587119904= 119873119863(119873119863+ 119873119877) and 119873
119863and 119873
119877are sample
sizes of the discovery and replication data sets A linearcombination of two 119875 values after taking the inverse of thestandard normal cumulative distribution is given by 119885LC =
1199081Φminus1(1minus119875(1)2)+119908
2Φminus1(1minus119875(2))radic1199082
1+ 11990822with a natural
choice of 1199081= radic120587119904 and 1199082 = radic1 minus 120587
119904 Let the significance
level of the discovery stage be 120572119863 which means that markers
with 119875(1) lt 120572119863are selected and replicated in the replication
stage The 119901 value of combined test can then be obtained as
119901FC = 1198751198670
(119875(1)
lt 120572119863 119885FC gt 119911FC)where the observed value of
119885FC is 119911FCThe119901FC are calculated as 119890minus119911FC2(1+119911FC2+log120572119863)
for equal weights where 119911FC gt minus2 log120572119863and (119908
1(1199081minus
1199082))119890minus119911FC21199081minus(119908
2(1199081minus1199082))119890minus119911FC21199082120572
minus(1199081
minus1199082
)1199082
119889for unequal
weights where 119911FC gt minus21199081log120572119863 Detailed derivations are
described in the Appendix Equivalently for an overall typeI error threshold for a single marker of 120572 one may obtainthe threshold 119862FC of 119885FC that satisfies 119875
1198670
(119875(1)
lt 120572119863 119885FC gt
119862FC) le 120572 Similarly for the 119885LC the 119901 value is calculatedas 119901LC = 119875
1198670
(119875(1)
lt 120572119863 119885LC gt 119911LC) = int
infin
1199111minus120572
119863
2
120601(119911)[1 minus
Φ((radic11990821+ 11990822119911LC minus 1199081119911)1199082)]119889119911 for 119911LC gt 119911
1minus120572119863
2where the
observed value of 119885LC = 119911LC
3 Simulation Results
31 Type I Error Table 1 provides the type I errors underdifferent scenarios A disease prevalence of 10 is assumedand a total of 1500 cases and 1500 controls were divided intotwo stages The proportions of samples in the first stage (120587
119904)
of 05 and 06 were considered for the minor allele frequency(MAF) of 03 and 04 We considered 119872 = 10 markers tocontrol the genome-wide false positive rate at 120572 = 005 withthe Bonferroni correction We did not consider a larger 119872
BioMed Research International 5
such as 300000 or 500000 because this will require morethan 10 million simulations to show a stable estimate of thetype I error rate With 119872 = 10 we performed 20000simulations which result in less than 10 of a coefficientof variation for a significance level 005119872 = 0005 foreach marker [24] The test statistics considered are 119885
12
Pearsonrsquos Chi-square testMIN2MAX3GMS andGME Forthe second stage analysis we considered a replication-basedanalysis 119885FC and 119885LC as proposed above The results arebased on the situation under HWE (HWE coefficient 119865 = 0)As expected all tests control the type I error reasonablly welland similar results were obtained when a slight deviationfrom HWE is present with 119865 = 005 (results not shown)
32 Empirical Power We examined the empirical powers ofdifferent tests considered above In Figure 1 we considered119872 = 10 markers a disease prevalence of 10 the samegenotype relative risk for two stages (119903
1= 14 and 119903
2= 14)
and 1000 cases and 1000 controls 2000 simulations wereperformed under HWE (119865 = 0) to control the genome-wide false positive rate at 120572 = 005 The recessive additiveand dominant models were assumed for the first secondand third rows Both joint analyses showed better powerperformances compared to the replication-based analysis (upto 159 in scenarios considered in Figure 1) and LC and FChave comparable powers with less than 2 difference Thepower gain of using the joint analysis is not as much as thatobserved in Skol et al [18] However as reported by Skolet al [18] when the between-stage heterogeneity exists andthe risk allele has a larger effect in the first stage than thatin the second stage much improved power is observed byusing the joint test Figure 2 shows results under this scenariowith 119903
1= 16 and 119903
2= 14 and the observed increase
in power using the joint test is as high as 339 Againthe difference between LC and FC is minor with less than3 difference As for comparison between different robustmethods MAX3 GMS and GME perform well under therecessive model while 119885
12 1205942 and MIN2 are less powerful
Under the additivemodel11988512
is most powerful as expectedand 1205942 is least powerful Other robust methods perform wellwith a slight decrease in power compared to 119885
12 Under the
dominant model MAX3 GMS and GME perform the besteven though all tests show good power performances and thedifference is minor Similar patterns were observed when aslight deviation from theHWE is present (results not shown)
4 Real Data Application
The GWAS on non-small-cell lung cancer (NSCLC) by Yoonet al [25] studied 621 NSCLC patients and 1541 controlsubjects in the discovery stage After stringent quality controlsteps a total of 246758 SNPs were tested for the associationwith NSCLC based on119885
12 In the replication stage 168 SNPs
with 119901 value less than 1 times 10minus4 in the first stage based on 11988512
were tested using 804 patients and 1470 control samples Weidentified additional 234 SNPs using MIN2 in the first stagewhich could be studied in the replication stage if MIN2 wasused instead of 119885
12since MIN2 produces stronger evidence
for the additional SNPs than 11988512
does The Manhattan plotsof using MIN2 and 119885
12are presented in Figure 3 One
example is 119903119904385272 located in chromosome 2 which hada 119901 value of 137 times 10
minus7 which reached significance level atBonferroni correction in discovery samples alone whereas11988512
yielded a119901 value greater than 1times10minus4 Even though thereis possibility of false positive findings these SNPs could havebeen selected for replication if robust methods were used
Since we do not have replication data for these additionalSNPs selected using MIN2 because the first stage selectionwas based on 119885
12in Yoon et al [25] just for illustration
purpose of the proposed methods we present the results ofthree SNPs including 1199031199042131877 that was reported by Yoonet al [25] When the significance level in the discovery stageis set at 120572
119863= 5 times 10
minus5 so that all these exemplary SNPs canbe selected in the discovery stage the 119901 value of combinedtest based on four robust methods (MAX3 MIN2 GMSand GME) as well as 119885
12and Pearsonrsquos Chi-square test is
presented in Table 2 Fisherrsquos combination was used for thejoint test in the second stage Only 1199031199042131877was found to besignificant with Bonferroni correction (119901 value lt 203times10
minus7)by all except MAX3method
5 Discussion
In genetic association studies efficiency robust tests whoseperformance does not depend on the underlying geneticmodel have been extensively studied and their power benefitover a wide range of geneticmodels has been well recognizedIn this paper we described how the idea of these robustassociation tests can be applied to the replication studies andfurther how overall statistical significance can be evaluatedusing the combined test formed by 119901 values of the discoveryand replication studies
When the robust tests are used the test statistic ofeach stage can have a complex form and thus dealing withthe distribution of the joint test can be difficult whereascalculating the 119901 value of each stage might be relativelysimple Because the asymptotic distribution of the 119901 valueunder the null hypothesis of no association is easy to handlethe combined test using 119901 values rather than the test statisticsthemselves can provide computational convenience
There are several methods for combining test statisticsfrom two stages and Won et al [22] compared the per-formances of various choices Two most commonly usedforms are based on Fisherrsquos combination and the linearcombination after the inverse normal transformation [23]and we presented the test statistics and 119901 values of these twomethods In our limited experience the linear combinationand Fisherrsquos combination are fairly comparable Fisherrsquoscombination seems to perform slightly better than the linearcombination when there exists some heterogeneity betweenstages in terms of the genotype relative risk while the linearcombination seems to perform slightly better in most ofother situations However the difference is extremely minorFurther research is required for the thorough comparison ofvarious methods of combining 119901 values in the application ofefficiency robust tests to the replication of genetic associationstudies
6 BioMed Research International
Recessive
Additive
Dominant
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)Em
piric
al p
ower
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
Z05
1205942
MAX3
ZFC
ZFC
ZFC
ZLC
ZLC
ZLC
MIN2GMSGME
MIN2GMSGME
MIN2GMSGME
Z05
1205942
MAX3
Z05
1205942
MAX3
Replication-based test
Replication-based test
Replication-based test
Figure 1 Empirical powers based on 2000 simulations for119872 = 10markers genotype relative risks of both stages = 14 and disease prevalence119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to control 120572 = 005 The first stagetype I error rate for discovery is 120572
119863= 005 Six test statistics 119885
12 1205942 MAX3 MIN2 GMS and GME are considered The first second and
third columns depict powers using the replication-based test 119885FC and 119885LC respectively
BioMed Research International 7
Replication-based test
Recessive
Additive
Dominant
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)Em
piric
al p
ower
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
Z05
1205942
MAX3
ZFC
ZFC
ZFC
ZLC
ZLC
ZLC
MIN2GMSGME
MIN2GMSGME
MIN2GMSGME
Z05
1205942
MAX3
Z05
1205942
MAX3
Replication-based test
Replication-based test
Figure 2 Empirical powers based on 2000 simulations for 119872 = 10 markers genotype relative risks of two stages are different (1199031= 16
1199032= 14) disease prevalence 119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to
control 120572 = 005 The first stage type I error rate for discovery is 120572119863= 005 Six test statistics 119885
12 1205942 MAX3 MIN2 GMS and GME are
considered The first second and third columns depict powers using the replication-based test 119885FC and 119885LC respectively
8 BioMed Research International
minuslo
g 10(120588)
Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23
8
6
4
2
0
MIN2 P value
(a)
minuslo
g 10(120588)
Additive P value8
6
4
2
0
Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23
(b)
Figure 3 Manhattan plots of 246758 SNPs from Yoon et al [25] based on MIN2 (a) and 11988512
(b) The 119909 axis is chromosomal location andthe 119910 axis is the significance (minuslog
10119875) of association The horizontal line corresponds to the significance level 10minus4
Table 2 For selected exemplary three SNPs for testing association with NSCLC 119901 value of combined test using additive CATT (11988512)
Pearsonrsquos Chi-square test (119879chi2) MAX3 MIN2 119885GMS and 119885GME
SNP 119901 value of 11988512
119901 value of 119879chi2Discovery Replication Combined test Discovery Replication Combined test
rs2131877 788 times 10minus5
104 times 10minus4
797 times 10minus8
140 times 10minus4
149 times 10minus4
184 times 10minus7
rs905551 183 times 10minus5
702 times 10minus3
770 times 10minus6
806 times 10minus5
489 times 10minus2
140 times 10minus5
rs1695109 248 times 10minus4
346 times 10minus2
217 times 10minus6
456 times 10minus5
153 times 10minus1
207 times 10minus5
SNP 119901 value of MAX3 119901 value of MIN2Discovery Replication Combined test Discovery Replication Combined test
rs2131877 153 times 10minus4
405 times 10minus2
192 times 10minus5
132 times 10minus4
104 times 10minus4
126 times 10minus7
rs905551 450 times 10minus5
702 times 10minus3
192 times 10minus6
134 times 10minus4
489 times 10minus2
199 times 10minus5
rs1695109 354 times 10minus5
263 times 10minus2
464 times 10minus6
236 times 10minus5
263 times 10minus2
335 times 10minus6
SNP 119901 value of 119885GMS 119901 value of 119885GME
Discovery Replication Combined test Discovery Replication Combined testrs2131877 186 times 10
minus4104 times 10
minus4171 times 10
minus7103 times 10
minus4104 times 10
minus4102 times 10
minus7
rs905551 519 times 10minus5
702 times 10minus3
216 times 10minus6
735 times 10minus5
801 times 10minus3
320 times 10minus6
rs1695109 689 times 10minus4
127 times 10minus1
385 times 10minus5
269 times 10minus5
419 times 10minus2
540 times 10minus6
In a genetic study where the purpose of considering areplication stage is to validate or replicate the genetic findingsfrom the discovery stage which is the case considered inthis paper the analysis in the replication stage utilized thetest statistic or genetic model that is selected as being thebest in the discovery stage and also the direction of the riskallele following guidelines for exact replication in geneticassociation studies If the purpose is to simply combine theevidence fromdifferent data sources such as inmeta-analysisother strategies may be devised Further research again isrequired to provide fully detailed properties of suchmethods
Power gain of a joint analysis over the conventionalreplication-based analysis was thoroughly studied by Skol etal [18 19] In our simulation the amount of power increaseusing a joint test compared to the replication-based analysiswas much minor than what was observed by Skol et al[18 19] The exact reason is not known but we suspect thismight be due to the power advantages of robust methodsand also due to the fact that the optimal choice from thefirst stage is used when calculating the second stage 119901 values
However even though it was minor in some situationsthe joint anlysis presented better power performance thanthe replication-based analysis in our study This type ofjoint analysis raised concerns about the exact meaning ofreplication [17] However McCarthy et al [26] mentionedthat joint analyses ldquoblur the boundaries of where exactlyreplication starts but whichever analytical approach is takenconfirmation in many independent samples is important andit is the overall strength of the evidence of association thatmattersrdquo Purpose of the current study was to present howthe overall strength of the evidence of association can beevaluated when robust tests are used in GWAS replicationstudies
We illustrated how the proposed methods can be appliedin the real data that studied the association of SNPs with non-small-cell lung cancer (NSCLC) in discovery and replicationstages In the original study reported by Yoon et al [25]SNPs were selected in the discovery data set not based onthe robust tests but based on additive CATT Therefore wefound that some SNPs could have been selected by one
BioMed Research International 9
of the robust methods but they were not included in thereplication data set For these SNPs we were not able toperform the joint analysis that we propose and it was notpossible to examine whether there are other SNPs that couldhave been found to be associated with NSCLC by proposedmethods in the replication study For this reason we merelypresented howmany additional SNPs could have been furtherfollowed in the replication stage when robust methods wereused In many GWASs it is a common practice to reportthe summary test statistics and 119901 values of the SNPs undera specific genetic model usually an additive model whichwere further genotyped in the replication stage and werefinally defined to be significantly associated with a phenotypeof interest As emphasized in this paper one may have abetter chance of findingmanymissing SNPs by applyingmorepowerful and robust methods that consider different geneticmodels simultaneouslyTherefore we urge the community toshare test results under not only an additive model but alsoother genetic models although they were not significant at astringent significance level so that future research may haveenriched data resources to which robust tests can be appliedin association studies
Appendix
119901 value of Fisherrsquos Combination forEqual and Unequal Weights
Equal Weights Assume 1199081
= 1199082
= 1 Under the nullhypothesis of no association 119883
1= minus2 log119875(1) and 119883
2=
minus2 log119875(2) are independent and each asymptotically followsa 1205942 distribution with 2 degrees of freedom Let 119891
119896(119909) and
119865119896(119909) be the probability and cumulative density functions of
1205942 random variable with 119896 degrees of freedomThen 119891
2(119909) =
exp(minus1199092)2 1198914(119909) = 119909 exp(minus1199092)4 119865
2(119909) = 1 minus exp(minus1199092)
and1198654(119909) = 1minusexp(minus1199092)minus119909 exp(minus1199092)2 Denote the cutoff
of the discovery stage based on 120572119863as 119862119863 that is 119865
2(119862119863) =
1 minus 120572119863 For observed value 119911FC gt 119862
119863of1198831+ 1198832 the 119901 value
is written as
1198751198670
(1198831gt 119862119863 1198831+ 1198832gt 119911FC)
= 120572119863minus int
119911FC
119862119863
1198912 (119909) 1198652 (119911FC minus 119909) 119889119909
= 120572119863+ exp (minus
119911FC2) minus 120572119863+1
2exp(minus
119911FC2) (119911FC minus 119862119863)
= exp (minus119911FC2) (1 +
119911FC2
+ log120572119863)
(A1)
Unequal WeightsWhen different proportions of samples areused in the discovery and replication stages it may be moreappropriate to assignweights proportional to the sample sizesfor each stage For example when only a small portion isused in the discovery stage to prevent Fisherrsquos combinationtest from being dominated by the significant result in the
discovery stage one may want to assign a small weight to thediscovery stage result
When 120587119904is the proportion of samples used in the dis-
covery stage one selection for weights is 1199081
= 2120587119904and
1199082= 2(1 minus 120587
119904) for discovery and replication stages which
simplifies to equal weights when 120587119904= 05 Based on these
weights we consider unequal-weighted Fisherrsquos combinationasminus2 log119875(1)1199081119875(2)1199082 = 119908
11198831+11990821198832[27] Its density function
is given by
119891119908 (119909) =
1
2 (1199081minus 1199082)exp(minus 119909
(21199081))
minus1
2 (1199081minus 1199082)exp(minus 119909
(21199082))
(A2)
and the probability distribution function is
119865119908 (119909) = 1 minus
1199081
(1199081minus 1199082)exp(minus 119909
21199081
)
minus1199082
(1199081minus 1199082)exp(minus 119909
21199082
) 1199081
= 1199082
(A3)
Using the previous notation we have the following formof 119901 value
1198751198670
(1198831gt 119862119863 11990811198831+ 11990821198832gt 119911FC)
= 120572119863minus int
119911FC1199081
119862119863
1198912 (119909) 1198652 (
119911FC minus 1199081119909
1199082
)119889119909
= exp(minus119911FC21199081
) +1199082
1199081minus 1199082
exp(minus119911FC21199082
)
times exp(1199081 minus 1199082211990811199082
119911FC) minus exp(1199081 minus 1199082211990811199082
1199081119862119863)
=1199081
1199081minus 1199082
exp(minus119911FC21199081
)
minus1199082
1199081minus 1199082
exp(minus119911FC21199082
)120572minus(1199081
minus1199082
)1199082
119863
=1199081
1199081minus 1199082
exp(minus119911FC21199081
)
minus1199082
1199081minus 1199082
exp(minus119911FC21199082
)120572minus(1199081
minus1199082
)1199082
119863
(A4)
where 119911FC1199081 gt 119862119863
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
10 BioMed Research International
Acknowledgments
The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)
References
[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005
[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006
[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006
[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007
[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007
[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012
[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954
[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955
[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997
[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002
[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966
[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985
[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008
[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010
[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009
[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009
[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009
[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for
two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006
[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007
[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003
[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006
[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009
[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012
[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993
[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010
[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008
[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
BioMed Research International 3
correlations between 119885119867and three CATTs under HWE were
derived and the significance level was adjusted accordingly tocontrol the desired type I error Based on the observation thatthis method assumes 119861 is the risk allele Joo et al [14] studiedthe behavior of 119885GMS when either one of the alleles can be arisk alleleThey chose the risk allele based on the sign of119885
12
that is if 11988512
gt 0 119861 is the risk allele and 1198850 11988512
and 1198851
are chosen for REC ADD and DOM models respectivelyIf 11988512
lt 0 the respective test statistics are chosen tobe minus119885
1 minus11988512
and minus1198850 They incorporate this property in
defining the test statistic for genetic model selection (119885GMS)and calculating the 119901 value Let Θ
0(119911) = 119911 119911 gt 119888
Θ12(119911) = 119911 |119911| lt 119888 and Θ
1(119911) = 119911 119911 lt minus119888 Then
the 119901 value of this method can be obtained by
119901GMS
= 2
1
sum
119909=0
intΘ119909
(119911119867
)
int
infin
0
int
infin
119905
120601119909(119911119909 11991112 119911119867) 11988911991111990911988911991112119889119911119867
+ 2
intΘ12
(119911)
Φ((minus119905 and 0) + 120588
12119911
(1 minus 120588212)12
)119889Φ (119911)
(5)
where 120588119909= Corr(119885
119909 119885119867) in (5) and 120588
11990912= Corr(119885
119909 11988512)
(119909 = 0 1) are replaced by their consistent estimates Here119905 = 119911GMS and (minus119905 and 0) = min(minus119905 0) Moreover 119911GMS and 11991112are the observed values of 119885GMS and 11988512 respectively
While studying the properties of GMS Joo et al [14]noticed that the probability of selecting the true recessive ordominant models using 119885
119867is very low especially for low
to moderate GRRs but the unlikely genetic model can besuccessfully excluded This led to genetic model exclusionmethod119885GME which is the same as the119885GMS described aboveexcept 119885
119909for 119909 = 0 12 1 is replaced by 119885lowast
119909where 119885lowast
119909=
(119885119909+11988512)2(1 +120588
11990912)12 And the 119901 value of GME can be
obtained as
119901GME
= 2
1
sum
119909=0
intΘ119909
(119911119867
)
int
infin
0
int
infin
119871
120601119909(119911119909 11991112 119911119867) 11988911991111990911988911991112119889119911119867
+ 2
intΘ12
(119911)
Φ((minus119905 and 0) + 120588
12119911
(1 minus 120588212)12
)119889Φ (119911)
(6)
where 119871 = 1199052(1 + 12058811990912
)12
minus 11991112
for 119905 = 119911GME
23 119901 Value of Replication Data Using the Robust Method Inthe discovery stage the 119901 value of robust association testsincluding MAX3 MIN2 119885GMS and 119885GME can be obtainedas described in Section 22 For the 119901 value of replication datausing the robust method we use the same analytic methodthat was used for discovery and the risk allele identified byit [16] This means that when the best test statistic or geneticmodel is selected in the discovery stage the replication stage
will adopt the discovery stage selection and the direction ofassociation
Suppose that for simplicity of notation our interest isin GWAS with two stages one for discovery and the otherfor replication although the methodology described belowcan be extended to multistages for replication Let 119885(119894)
119909for
119909 = 0 12 1 be the CATT optimal for recessive additiveand dominant models and let 119875(119894)
119909be corresponding 119901 value
for 119894th stage (119894 = 1 for discovery and 119894 = 2 for replicationstages) Also denote 119885(119894)lowast
119909= (119885(119894)
119909+ 119885(119894)
12)2(1 + 120588
(119894)
11990912)12
for 119909 = 0 12 1 Then for CATT with a preselected geneticmodel 119875(2)
119909= 1minusΦ(sign(119885(1)
119909sdot119885(2)
119909) sdot |119885(2)
119909|) using a one-sided
119901 value given the direction of association from the discoverystage and119875(2)lowast
119909= 1minusΦ(sign(119885(1)lowast
119909sdot119885(2)lowast
119909) sdot |119885(2)lowast
119909|) Moreover
denote the test statistics and 119901 values using Pearsonrsquos Chi-square test from the 119894th stage as 119879(119894)chi2 and 119875
(119894)
chi2 Further letHWDTT from the 119894th stage be 119885(119894)
119867 Then the second stage
119901 values using MAX3 MIN2 119885GMS and 119885GME denoted as119875(2)
MAX3 119875(2)
MIN2 119875(2)
GMS and 119875(2)
GME can be obtained as follows
119875(2)
MAX3 = 119875(2)
119909lowast
where 119909lowast = arg min119909isin0121
119875(1)
119909
119875(2)
MIN2 = 119875(2)
12sdot 119868 (119875(1)
12le 119875(2)
chi2) + 119875(2)
chi2 sdot 119868 (119875(1)
12gt 119875(2)
chi2)
119875(2)
GMS = 119875(2)
0119868 (119885(1)
119867gt 119888) + 119875
(2)
1119868 (119885(1)
119867lt minus119888)
+ 119875(2)
12119868 (10038161003816100381610038161003816119885(1)
119867
10038161003816100381610038161003816le 119888) if 119885(1)
12gt 0
= 119875(2)
0119868 (119885(1)
119867lt minus119888) + 119875
(2)
1119868 (119885(1)
119867gt 119888)
+ 119875(2)
12119868 (10038161003816100381610038161003816119885(1)
119867
10038161003816100381610038161003816le 119888) if 119885(1)
12le 0
119875(2)
GME = 119875(2)lowast
0119868 (119885(1)
119867gt 119888) + 119875
(2)lowast
1119868 (119885(1)
119867lt minus119888)
+ 119875(2)lowast
12119868 (10038161003816100381610038161003816119885(1)
119867
10038161003816100381610038161003816le 119888) if 119885(1)
12gt 0
= 119875(2)lowast
0119868 (119885(1)
119867lt minus119888) + 119875
(2)lowast
1119868 (119885(1)
119867gt 119888)
+ 119875(2)lowast
12119868 (10038161003816100381610038161003816119885(1)
119867
10038161003816100381610038161003816le 119888) if 119885(1)
12le 0
(7)
It is important to note that even though the direction ofthe test statistics and the selected genetic models are usedto obtain the second stage 119901 values the 119901 values from thetwo stages are independent under the null hypothesis Thisis because under the null hypothesis the probability of 119885
12
being positive or negative is simply 12 and the probabilityof the selection of a certain genetic model is also a constant(120572119867for the recessive and dominant models and 1 minus 2120572
119867for
the additive model)
24 Combined Test Using 119901 Values and Its Statistical Signif-icance For a given robust test we can consider the jointanalysis by combining 119901 values from the discovery andreplication stages of GWAS We consider using 119901 valuesrather than the test statistics because test statistics can have
4 BioMed Research International
Table 1 Type I error rates of three approachesmdashreplication-based (REP) test Fisherrsquos combination (119885FC) and linear combination of test(119885LC)mdashbased on the CATT with an additive model (119885
12) 1205942 MAX3 MIN2 GMS and GME The disease prevalence 119870 = 01 119872 = 10
markers 119903 = 1 500 cases and 119904 = 1 500 controls are considered based on 20000 simulations
MAF 120587119904
120572119863
119865 = 0
11988512
1205942 MAX3 MIN2 GMS GME
03 05
005REP 000530 000455 000505 000485 00050 000490119885FC 000500 000535 000495 000510 000515 000460119885LC 000535 000525 000485 000510 00050 000485
01REP 000510 000560 000565 000485 000525 000545119885FC 000565 000535 000565 000545 000565 000540119885LC 000520 000565 000525 000520 000530 000525
03 06
005REP 000510 000485 000480 000515 000480 000500119885FC 000445 000455 000450 000455 000450 000460LC 000500 000515 000495 000520 000475 000480
01REP 000500 000485 000485 000535 000530 000505119885FC 000465 000490 000485 000500 000455 000460119885LC 000480 000470 000515 000490 000485 000475
04 05
005REP 000590 000505 000530 000565 000505 000510119885FC 000575 000430 000460 000535 000460 000500119885LC 000600 000445 000500 000540 000490 000490
01REP 000525 000470 000535 000450 000480 000515119885FC 000515 000510 000495 000475 000540 000500119885LC 000530 000500 000485 000475 000495 000510
04 06
005REP 000475 000585 000480 000500 000515 000495119885FC 000460 000470 000420 000490 000455 000440119885LC 000525 000550 000520 000580 000510 000510
01REP 000550 000490 000515 000535 000555 000540119885FC 000520 000370 000495 000450 000515 000510119885LC 000565 000485 000570 000530 000610 000580
complex forms and obtaining the distribution of the joint testcan be difficult On the other hand calculating a 119901 value foreach data set might be relatively simple and the distributionof 119901 values under the null hypothesis of no association is easyto handle
There are several methods for combining test statisticsfrom two stages [22] and twomost commonly used forms arebased on Fisherrsquos combination and a linear combination afterinverse normal transformation [23] Fisherrsquos combination(FC) directly sums 119901 values after minus2 log transformation thatis 119885FC = minus2119908
1log(119875(1)) minus 2119908
2log(119875(2)) where 119875(119894) is 119901 value
from 119894 = 0 for discovery and 119894 = 1 for replication stagesusing a given robust test A specification of 119908
1= 1199082= 1
gives the same weight for discovery and replication stagesand one can consider 119908
1= 2120587
119904and 119908
2= 2(1 minus 120587
119904)
where 120587119904= 119873119863(119873119863+ 119873119877) and 119873
119863and 119873
119877are sample
sizes of the discovery and replication data sets A linearcombination of two 119875 values after taking the inverse of thestandard normal cumulative distribution is given by 119885LC =
1199081Φminus1(1minus119875(1)2)+119908
2Φminus1(1minus119875(2))radic1199082
1+ 11990822with a natural
choice of 1199081= radic120587119904 and 1199082 = radic1 minus 120587
119904 Let the significance
level of the discovery stage be 120572119863 which means that markers
with 119875(1) lt 120572119863are selected and replicated in the replication
stage The 119901 value of combined test can then be obtained as
119901FC = 1198751198670
(119875(1)
lt 120572119863 119885FC gt 119911FC)where the observed value of
119885FC is 119911FCThe119901FC are calculated as 119890minus119911FC2(1+119911FC2+log120572119863)
for equal weights where 119911FC gt minus2 log120572119863and (119908
1(1199081minus
1199082))119890minus119911FC21199081minus(119908
2(1199081minus1199082))119890minus119911FC21199082120572
minus(1199081
minus1199082
)1199082
119889for unequal
weights where 119911FC gt minus21199081log120572119863 Detailed derivations are
described in the Appendix Equivalently for an overall typeI error threshold for a single marker of 120572 one may obtainthe threshold 119862FC of 119885FC that satisfies 119875
1198670
(119875(1)
lt 120572119863 119885FC gt
119862FC) le 120572 Similarly for the 119885LC the 119901 value is calculatedas 119901LC = 119875
1198670
(119875(1)
lt 120572119863 119885LC gt 119911LC) = int
infin
1199111minus120572
119863
2
120601(119911)[1 minus
Φ((radic11990821+ 11990822119911LC minus 1199081119911)1199082)]119889119911 for 119911LC gt 119911
1minus120572119863
2where the
observed value of 119885LC = 119911LC
3 Simulation Results
31 Type I Error Table 1 provides the type I errors underdifferent scenarios A disease prevalence of 10 is assumedand a total of 1500 cases and 1500 controls were divided intotwo stages The proportions of samples in the first stage (120587
119904)
of 05 and 06 were considered for the minor allele frequency(MAF) of 03 and 04 We considered 119872 = 10 markers tocontrol the genome-wide false positive rate at 120572 = 005 withthe Bonferroni correction We did not consider a larger 119872
BioMed Research International 5
such as 300000 or 500000 because this will require morethan 10 million simulations to show a stable estimate of thetype I error rate With 119872 = 10 we performed 20000simulations which result in less than 10 of a coefficientof variation for a significance level 005119872 = 0005 foreach marker [24] The test statistics considered are 119885
12
Pearsonrsquos Chi-square testMIN2MAX3GMS andGME Forthe second stage analysis we considered a replication-basedanalysis 119885FC and 119885LC as proposed above The results arebased on the situation under HWE (HWE coefficient 119865 = 0)As expected all tests control the type I error reasonablly welland similar results were obtained when a slight deviationfrom HWE is present with 119865 = 005 (results not shown)
32 Empirical Power We examined the empirical powers ofdifferent tests considered above In Figure 1 we considered119872 = 10 markers a disease prevalence of 10 the samegenotype relative risk for two stages (119903
1= 14 and 119903
2= 14)
and 1000 cases and 1000 controls 2000 simulations wereperformed under HWE (119865 = 0) to control the genome-wide false positive rate at 120572 = 005 The recessive additiveand dominant models were assumed for the first secondand third rows Both joint analyses showed better powerperformances compared to the replication-based analysis (upto 159 in scenarios considered in Figure 1) and LC and FChave comparable powers with less than 2 difference Thepower gain of using the joint analysis is not as much as thatobserved in Skol et al [18] However as reported by Skolet al [18] when the between-stage heterogeneity exists andthe risk allele has a larger effect in the first stage than thatin the second stage much improved power is observed byusing the joint test Figure 2 shows results under this scenariowith 119903
1= 16 and 119903
2= 14 and the observed increase
in power using the joint test is as high as 339 Againthe difference between LC and FC is minor with less than3 difference As for comparison between different robustmethods MAX3 GMS and GME perform well under therecessive model while 119885
12 1205942 and MIN2 are less powerful
Under the additivemodel11988512
is most powerful as expectedand 1205942 is least powerful Other robust methods perform wellwith a slight decrease in power compared to 119885
12 Under the
dominant model MAX3 GMS and GME perform the besteven though all tests show good power performances and thedifference is minor Similar patterns were observed when aslight deviation from theHWE is present (results not shown)
4 Real Data Application
The GWAS on non-small-cell lung cancer (NSCLC) by Yoonet al [25] studied 621 NSCLC patients and 1541 controlsubjects in the discovery stage After stringent quality controlsteps a total of 246758 SNPs were tested for the associationwith NSCLC based on119885
12 In the replication stage 168 SNPs
with 119901 value less than 1 times 10minus4 in the first stage based on 11988512
were tested using 804 patients and 1470 control samples Weidentified additional 234 SNPs using MIN2 in the first stagewhich could be studied in the replication stage if MIN2 wasused instead of 119885
12since MIN2 produces stronger evidence
for the additional SNPs than 11988512
does The Manhattan plotsof using MIN2 and 119885
12are presented in Figure 3 One
example is 119903119904385272 located in chromosome 2 which hada 119901 value of 137 times 10
minus7 which reached significance level atBonferroni correction in discovery samples alone whereas11988512
yielded a119901 value greater than 1times10minus4 Even though thereis possibility of false positive findings these SNPs could havebeen selected for replication if robust methods were used
Since we do not have replication data for these additionalSNPs selected using MIN2 because the first stage selectionwas based on 119885
12in Yoon et al [25] just for illustration
purpose of the proposed methods we present the results ofthree SNPs including 1199031199042131877 that was reported by Yoonet al [25] When the significance level in the discovery stageis set at 120572
119863= 5 times 10
minus5 so that all these exemplary SNPs canbe selected in the discovery stage the 119901 value of combinedtest based on four robust methods (MAX3 MIN2 GMSand GME) as well as 119885
12and Pearsonrsquos Chi-square test is
presented in Table 2 Fisherrsquos combination was used for thejoint test in the second stage Only 1199031199042131877was found to besignificant with Bonferroni correction (119901 value lt 203times10
minus7)by all except MAX3method
5 Discussion
In genetic association studies efficiency robust tests whoseperformance does not depend on the underlying geneticmodel have been extensively studied and their power benefitover a wide range of geneticmodels has been well recognizedIn this paper we described how the idea of these robustassociation tests can be applied to the replication studies andfurther how overall statistical significance can be evaluatedusing the combined test formed by 119901 values of the discoveryand replication studies
When the robust tests are used the test statistic ofeach stage can have a complex form and thus dealing withthe distribution of the joint test can be difficult whereascalculating the 119901 value of each stage might be relativelysimple Because the asymptotic distribution of the 119901 valueunder the null hypothesis of no association is easy to handlethe combined test using 119901 values rather than the test statisticsthemselves can provide computational convenience
There are several methods for combining test statisticsfrom two stages and Won et al [22] compared the per-formances of various choices Two most commonly usedforms are based on Fisherrsquos combination and the linearcombination after the inverse normal transformation [23]and we presented the test statistics and 119901 values of these twomethods In our limited experience the linear combinationand Fisherrsquos combination are fairly comparable Fisherrsquoscombination seems to perform slightly better than the linearcombination when there exists some heterogeneity betweenstages in terms of the genotype relative risk while the linearcombination seems to perform slightly better in most ofother situations However the difference is extremely minorFurther research is required for the thorough comparison ofvarious methods of combining 119901 values in the application ofefficiency robust tests to the replication of genetic associationstudies
6 BioMed Research International
Recessive
Additive
Dominant
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)Em
piric
al p
ower
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
Z05
1205942
MAX3
ZFC
ZFC
ZFC
ZLC
ZLC
ZLC
MIN2GMSGME
MIN2GMSGME
MIN2GMSGME
Z05
1205942
MAX3
Z05
1205942
MAX3
Replication-based test
Replication-based test
Replication-based test
Figure 1 Empirical powers based on 2000 simulations for119872 = 10markers genotype relative risks of both stages = 14 and disease prevalence119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to control 120572 = 005 The first stagetype I error rate for discovery is 120572
119863= 005 Six test statistics 119885
12 1205942 MAX3 MIN2 GMS and GME are considered The first second and
third columns depict powers using the replication-based test 119885FC and 119885LC respectively
BioMed Research International 7
Replication-based test
Recessive
Additive
Dominant
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)Em
piric
al p
ower
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
Z05
1205942
MAX3
ZFC
ZFC
ZFC
ZLC
ZLC
ZLC
MIN2GMSGME
MIN2GMSGME
MIN2GMSGME
Z05
1205942
MAX3
Z05
1205942
MAX3
Replication-based test
Replication-based test
Figure 2 Empirical powers based on 2000 simulations for 119872 = 10 markers genotype relative risks of two stages are different (1199031= 16
1199032= 14) disease prevalence 119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to
control 120572 = 005 The first stage type I error rate for discovery is 120572119863= 005 Six test statistics 119885
12 1205942 MAX3 MIN2 GMS and GME are
considered The first second and third columns depict powers using the replication-based test 119885FC and 119885LC respectively
8 BioMed Research International
minuslo
g 10(120588)
Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23
8
6
4
2
0
MIN2 P value
(a)
minuslo
g 10(120588)
Additive P value8
6
4
2
0
Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23
(b)
Figure 3 Manhattan plots of 246758 SNPs from Yoon et al [25] based on MIN2 (a) and 11988512
(b) The 119909 axis is chromosomal location andthe 119910 axis is the significance (minuslog
10119875) of association The horizontal line corresponds to the significance level 10minus4
Table 2 For selected exemplary three SNPs for testing association with NSCLC 119901 value of combined test using additive CATT (11988512)
Pearsonrsquos Chi-square test (119879chi2) MAX3 MIN2 119885GMS and 119885GME
SNP 119901 value of 11988512
119901 value of 119879chi2Discovery Replication Combined test Discovery Replication Combined test
rs2131877 788 times 10minus5
104 times 10minus4
797 times 10minus8
140 times 10minus4
149 times 10minus4
184 times 10minus7
rs905551 183 times 10minus5
702 times 10minus3
770 times 10minus6
806 times 10minus5
489 times 10minus2
140 times 10minus5
rs1695109 248 times 10minus4
346 times 10minus2
217 times 10minus6
456 times 10minus5
153 times 10minus1
207 times 10minus5
SNP 119901 value of MAX3 119901 value of MIN2Discovery Replication Combined test Discovery Replication Combined test
rs2131877 153 times 10minus4
405 times 10minus2
192 times 10minus5
132 times 10minus4
104 times 10minus4
126 times 10minus7
rs905551 450 times 10minus5
702 times 10minus3
192 times 10minus6
134 times 10minus4
489 times 10minus2
199 times 10minus5
rs1695109 354 times 10minus5
263 times 10minus2
464 times 10minus6
236 times 10minus5
263 times 10minus2
335 times 10minus6
SNP 119901 value of 119885GMS 119901 value of 119885GME
Discovery Replication Combined test Discovery Replication Combined testrs2131877 186 times 10
minus4104 times 10
minus4171 times 10
minus7103 times 10
minus4104 times 10
minus4102 times 10
minus7
rs905551 519 times 10minus5
702 times 10minus3
216 times 10minus6
735 times 10minus5
801 times 10minus3
320 times 10minus6
rs1695109 689 times 10minus4
127 times 10minus1
385 times 10minus5
269 times 10minus5
419 times 10minus2
540 times 10minus6
In a genetic study where the purpose of considering areplication stage is to validate or replicate the genetic findingsfrom the discovery stage which is the case considered inthis paper the analysis in the replication stage utilized thetest statistic or genetic model that is selected as being thebest in the discovery stage and also the direction of the riskallele following guidelines for exact replication in geneticassociation studies If the purpose is to simply combine theevidence fromdifferent data sources such as inmeta-analysisother strategies may be devised Further research again isrequired to provide fully detailed properties of suchmethods
Power gain of a joint analysis over the conventionalreplication-based analysis was thoroughly studied by Skol etal [18 19] In our simulation the amount of power increaseusing a joint test compared to the replication-based analysiswas much minor than what was observed by Skol et al[18 19] The exact reason is not known but we suspect thismight be due to the power advantages of robust methodsand also due to the fact that the optimal choice from thefirst stage is used when calculating the second stage 119901 values
However even though it was minor in some situationsthe joint anlysis presented better power performance thanthe replication-based analysis in our study This type ofjoint analysis raised concerns about the exact meaning ofreplication [17] However McCarthy et al [26] mentionedthat joint analyses ldquoblur the boundaries of where exactlyreplication starts but whichever analytical approach is takenconfirmation in many independent samples is important andit is the overall strength of the evidence of association thatmattersrdquo Purpose of the current study was to present howthe overall strength of the evidence of association can beevaluated when robust tests are used in GWAS replicationstudies
We illustrated how the proposed methods can be appliedin the real data that studied the association of SNPs with non-small-cell lung cancer (NSCLC) in discovery and replicationstages In the original study reported by Yoon et al [25]SNPs were selected in the discovery data set not based onthe robust tests but based on additive CATT Therefore wefound that some SNPs could have been selected by one
BioMed Research International 9
of the robust methods but they were not included in thereplication data set For these SNPs we were not able toperform the joint analysis that we propose and it was notpossible to examine whether there are other SNPs that couldhave been found to be associated with NSCLC by proposedmethods in the replication study For this reason we merelypresented howmany additional SNPs could have been furtherfollowed in the replication stage when robust methods wereused In many GWASs it is a common practice to reportthe summary test statistics and 119901 values of the SNPs undera specific genetic model usually an additive model whichwere further genotyped in the replication stage and werefinally defined to be significantly associated with a phenotypeof interest As emphasized in this paper one may have abetter chance of findingmanymissing SNPs by applyingmorepowerful and robust methods that consider different geneticmodels simultaneouslyTherefore we urge the community toshare test results under not only an additive model but alsoother genetic models although they were not significant at astringent significance level so that future research may haveenriched data resources to which robust tests can be appliedin association studies
Appendix
119901 value of Fisherrsquos Combination forEqual and Unequal Weights
Equal Weights Assume 1199081
= 1199082
= 1 Under the nullhypothesis of no association 119883
1= minus2 log119875(1) and 119883
2=
minus2 log119875(2) are independent and each asymptotically followsa 1205942 distribution with 2 degrees of freedom Let 119891
119896(119909) and
119865119896(119909) be the probability and cumulative density functions of
1205942 random variable with 119896 degrees of freedomThen 119891
2(119909) =
exp(minus1199092)2 1198914(119909) = 119909 exp(minus1199092)4 119865
2(119909) = 1 minus exp(minus1199092)
and1198654(119909) = 1minusexp(minus1199092)minus119909 exp(minus1199092)2 Denote the cutoff
of the discovery stage based on 120572119863as 119862119863 that is 119865
2(119862119863) =
1 minus 120572119863 For observed value 119911FC gt 119862
119863of1198831+ 1198832 the 119901 value
is written as
1198751198670
(1198831gt 119862119863 1198831+ 1198832gt 119911FC)
= 120572119863minus int
119911FC
119862119863
1198912 (119909) 1198652 (119911FC minus 119909) 119889119909
= 120572119863+ exp (minus
119911FC2) minus 120572119863+1
2exp(minus
119911FC2) (119911FC minus 119862119863)
= exp (minus119911FC2) (1 +
119911FC2
+ log120572119863)
(A1)
Unequal WeightsWhen different proportions of samples areused in the discovery and replication stages it may be moreappropriate to assignweights proportional to the sample sizesfor each stage For example when only a small portion isused in the discovery stage to prevent Fisherrsquos combinationtest from being dominated by the significant result in the
discovery stage one may want to assign a small weight to thediscovery stage result
When 120587119904is the proportion of samples used in the dis-
covery stage one selection for weights is 1199081
= 2120587119904and
1199082= 2(1 minus 120587
119904) for discovery and replication stages which
simplifies to equal weights when 120587119904= 05 Based on these
weights we consider unequal-weighted Fisherrsquos combinationasminus2 log119875(1)1199081119875(2)1199082 = 119908
11198831+11990821198832[27] Its density function
is given by
119891119908 (119909) =
1
2 (1199081minus 1199082)exp(minus 119909
(21199081))
minus1
2 (1199081minus 1199082)exp(minus 119909
(21199082))
(A2)
and the probability distribution function is
119865119908 (119909) = 1 minus
1199081
(1199081minus 1199082)exp(minus 119909
21199081
)
minus1199082
(1199081minus 1199082)exp(minus 119909
21199082
) 1199081
= 1199082
(A3)
Using the previous notation we have the following formof 119901 value
1198751198670
(1198831gt 119862119863 11990811198831+ 11990821198832gt 119911FC)
= 120572119863minus int
119911FC1199081
119862119863
1198912 (119909) 1198652 (
119911FC minus 1199081119909
1199082
)119889119909
= exp(minus119911FC21199081
) +1199082
1199081minus 1199082
exp(minus119911FC21199082
)
times exp(1199081 minus 1199082211990811199082
119911FC) minus exp(1199081 minus 1199082211990811199082
1199081119862119863)
=1199081
1199081minus 1199082
exp(minus119911FC21199081
)
minus1199082
1199081minus 1199082
exp(minus119911FC21199082
)120572minus(1199081
minus1199082
)1199082
119863
=1199081
1199081minus 1199082
exp(minus119911FC21199081
)
minus1199082
1199081minus 1199082
exp(minus119911FC21199082
)120572minus(1199081
minus1199082
)1199082
119863
(A4)
where 119911FC1199081 gt 119862119863
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
10 BioMed Research International
Acknowledgments
The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)
References
[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005
[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006
[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006
[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007
[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007
[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012
[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954
[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955
[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997
[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002
[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966
[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985
[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008
[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010
[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009
[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009
[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009
[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for
two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006
[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007
[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003
[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006
[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009
[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012
[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993
[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010
[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008
[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
4 BioMed Research International
Table 1 Type I error rates of three approachesmdashreplication-based (REP) test Fisherrsquos combination (119885FC) and linear combination of test(119885LC)mdashbased on the CATT with an additive model (119885
12) 1205942 MAX3 MIN2 GMS and GME The disease prevalence 119870 = 01 119872 = 10
markers 119903 = 1 500 cases and 119904 = 1 500 controls are considered based on 20000 simulations
MAF 120587119904
120572119863
119865 = 0
11988512
1205942 MAX3 MIN2 GMS GME
03 05
005REP 000530 000455 000505 000485 00050 000490119885FC 000500 000535 000495 000510 000515 000460119885LC 000535 000525 000485 000510 00050 000485
01REP 000510 000560 000565 000485 000525 000545119885FC 000565 000535 000565 000545 000565 000540119885LC 000520 000565 000525 000520 000530 000525
03 06
005REP 000510 000485 000480 000515 000480 000500119885FC 000445 000455 000450 000455 000450 000460LC 000500 000515 000495 000520 000475 000480
01REP 000500 000485 000485 000535 000530 000505119885FC 000465 000490 000485 000500 000455 000460119885LC 000480 000470 000515 000490 000485 000475
04 05
005REP 000590 000505 000530 000565 000505 000510119885FC 000575 000430 000460 000535 000460 000500119885LC 000600 000445 000500 000540 000490 000490
01REP 000525 000470 000535 000450 000480 000515119885FC 000515 000510 000495 000475 000540 000500119885LC 000530 000500 000485 000475 000495 000510
04 06
005REP 000475 000585 000480 000500 000515 000495119885FC 000460 000470 000420 000490 000455 000440119885LC 000525 000550 000520 000580 000510 000510
01REP 000550 000490 000515 000535 000555 000540119885FC 000520 000370 000495 000450 000515 000510119885LC 000565 000485 000570 000530 000610 000580
complex forms and obtaining the distribution of the joint testcan be difficult On the other hand calculating a 119901 value foreach data set might be relatively simple and the distributionof 119901 values under the null hypothesis of no association is easyto handle
There are several methods for combining test statisticsfrom two stages [22] and twomost commonly used forms arebased on Fisherrsquos combination and a linear combination afterinverse normal transformation [23] Fisherrsquos combination(FC) directly sums 119901 values after minus2 log transformation thatis 119885FC = minus2119908
1log(119875(1)) minus 2119908
2log(119875(2)) where 119875(119894) is 119901 value
from 119894 = 0 for discovery and 119894 = 1 for replication stagesusing a given robust test A specification of 119908
1= 1199082= 1
gives the same weight for discovery and replication stagesand one can consider 119908
1= 2120587
119904and 119908
2= 2(1 minus 120587
119904)
where 120587119904= 119873119863(119873119863+ 119873119877) and 119873
119863and 119873
119877are sample
sizes of the discovery and replication data sets A linearcombination of two 119875 values after taking the inverse of thestandard normal cumulative distribution is given by 119885LC =
1199081Φminus1(1minus119875(1)2)+119908
2Φminus1(1minus119875(2))radic1199082
1+ 11990822with a natural
choice of 1199081= radic120587119904 and 1199082 = radic1 minus 120587
119904 Let the significance
level of the discovery stage be 120572119863 which means that markers
with 119875(1) lt 120572119863are selected and replicated in the replication
stage The 119901 value of combined test can then be obtained as
119901FC = 1198751198670
(119875(1)
lt 120572119863 119885FC gt 119911FC)where the observed value of
119885FC is 119911FCThe119901FC are calculated as 119890minus119911FC2(1+119911FC2+log120572119863)
for equal weights where 119911FC gt minus2 log120572119863and (119908
1(1199081minus
1199082))119890minus119911FC21199081minus(119908
2(1199081minus1199082))119890minus119911FC21199082120572
minus(1199081
minus1199082
)1199082
119889for unequal
weights where 119911FC gt minus21199081log120572119863 Detailed derivations are
described in the Appendix Equivalently for an overall typeI error threshold for a single marker of 120572 one may obtainthe threshold 119862FC of 119885FC that satisfies 119875
1198670
(119875(1)
lt 120572119863 119885FC gt
119862FC) le 120572 Similarly for the 119885LC the 119901 value is calculatedas 119901LC = 119875
1198670
(119875(1)
lt 120572119863 119885LC gt 119911LC) = int
infin
1199111minus120572
119863
2
120601(119911)[1 minus
Φ((radic11990821+ 11990822119911LC minus 1199081119911)1199082)]119889119911 for 119911LC gt 119911
1minus120572119863
2where the
observed value of 119885LC = 119911LC
3 Simulation Results
31 Type I Error Table 1 provides the type I errors underdifferent scenarios A disease prevalence of 10 is assumedand a total of 1500 cases and 1500 controls were divided intotwo stages The proportions of samples in the first stage (120587
119904)
of 05 and 06 were considered for the minor allele frequency(MAF) of 03 and 04 We considered 119872 = 10 markers tocontrol the genome-wide false positive rate at 120572 = 005 withthe Bonferroni correction We did not consider a larger 119872
BioMed Research International 5
such as 300000 or 500000 because this will require morethan 10 million simulations to show a stable estimate of thetype I error rate With 119872 = 10 we performed 20000simulations which result in less than 10 of a coefficientof variation for a significance level 005119872 = 0005 foreach marker [24] The test statistics considered are 119885
12
Pearsonrsquos Chi-square testMIN2MAX3GMS andGME Forthe second stage analysis we considered a replication-basedanalysis 119885FC and 119885LC as proposed above The results arebased on the situation under HWE (HWE coefficient 119865 = 0)As expected all tests control the type I error reasonablly welland similar results were obtained when a slight deviationfrom HWE is present with 119865 = 005 (results not shown)
32 Empirical Power We examined the empirical powers ofdifferent tests considered above In Figure 1 we considered119872 = 10 markers a disease prevalence of 10 the samegenotype relative risk for two stages (119903
1= 14 and 119903
2= 14)
and 1000 cases and 1000 controls 2000 simulations wereperformed under HWE (119865 = 0) to control the genome-wide false positive rate at 120572 = 005 The recessive additiveand dominant models were assumed for the first secondand third rows Both joint analyses showed better powerperformances compared to the replication-based analysis (upto 159 in scenarios considered in Figure 1) and LC and FChave comparable powers with less than 2 difference Thepower gain of using the joint analysis is not as much as thatobserved in Skol et al [18] However as reported by Skolet al [18] when the between-stage heterogeneity exists andthe risk allele has a larger effect in the first stage than thatin the second stage much improved power is observed byusing the joint test Figure 2 shows results under this scenariowith 119903
1= 16 and 119903
2= 14 and the observed increase
in power using the joint test is as high as 339 Againthe difference between LC and FC is minor with less than3 difference As for comparison between different robustmethods MAX3 GMS and GME perform well under therecessive model while 119885
12 1205942 and MIN2 are less powerful
Under the additivemodel11988512
is most powerful as expectedand 1205942 is least powerful Other robust methods perform wellwith a slight decrease in power compared to 119885
12 Under the
dominant model MAX3 GMS and GME perform the besteven though all tests show good power performances and thedifference is minor Similar patterns were observed when aslight deviation from theHWE is present (results not shown)
4 Real Data Application
The GWAS on non-small-cell lung cancer (NSCLC) by Yoonet al [25] studied 621 NSCLC patients and 1541 controlsubjects in the discovery stage After stringent quality controlsteps a total of 246758 SNPs were tested for the associationwith NSCLC based on119885
12 In the replication stage 168 SNPs
with 119901 value less than 1 times 10minus4 in the first stage based on 11988512
were tested using 804 patients and 1470 control samples Weidentified additional 234 SNPs using MIN2 in the first stagewhich could be studied in the replication stage if MIN2 wasused instead of 119885
12since MIN2 produces stronger evidence
for the additional SNPs than 11988512
does The Manhattan plotsof using MIN2 and 119885
12are presented in Figure 3 One
example is 119903119904385272 located in chromosome 2 which hada 119901 value of 137 times 10
minus7 which reached significance level atBonferroni correction in discovery samples alone whereas11988512
yielded a119901 value greater than 1times10minus4 Even though thereis possibility of false positive findings these SNPs could havebeen selected for replication if robust methods were used
Since we do not have replication data for these additionalSNPs selected using MIN2 because the first stage selectionwas based on 119885
12in Yoon et al [25] just for illustration
purpose of the proposed methods we present the results ofthree SNPs including 1199031199042131877 that was reported by Yoonet al [25] When the significance level in the discovery stageis set at 120572
119863= 5 times 10
minus5 so that all these exemplary SNPs canbe selected in the discovery stage the 119901 value of combinedtest based on four robust methods (MAX3 MIN2 GMSand GME) as well as 119885
12and Pearsonrsquos Chi-square test is
presented in Table 2 Fisherrsquos combination was used for thejoint test in the second stage Only 1199031199042131877was found to besignificant with Bonferroni correction (119901 value lt 203times10
minus7)by all except MAX3method
5 Discussion
In genetic association studies efficiency robust tests whoseperformance does not depend on the underlying geneticmodel have been extensively studied and their power benefitover a wide range of geneticmodels has been well recognizedIn this paper we described how the idea of these robustassociation tests can be applied to the replication studies andfurther how overall statistical significance can be evaluatedusing the combined test formed by 119901 values of the discoveryand replication studies
When the robust tests are used the test statistic ofeach stage can have a complex form and thus dealing withthe distribution of the joint test can be difficult whereascalculating the 119901 value of each stage might be relativelysimple Because the asymptotic distribution of the 119901 valueunder the null hypothesis of no association is easy to handlethe combined test using 119901 values rather than the test statisticsthemselves can provide computational convenience
There are several methods for combining test statisticsfrom two stages and Won et al [22] compared the per-formances of various choices Two most commonly usedforms are based on Fisherrsquos combination and the linearcombination after the inverse normal transformation [23]and we presented the test statistics and 119901 values of these twomethods In our limited experience the linear combinationand Fisherrsquos combination are fairly comparable Fisherrsquoscombination seems to perform slightly better than the linearcombination when there exists some heterogeneity betweenstages in terms of the genotype relative risk while the linearcombination seems to perform slightly better in most ofother situations However the difference is extremely minorFurther research is required for the thorough comparison ofvarious methods of combining 119901 values in the application ofefficiency robust tests to the replication of genetic associationstudies
6 BioMed Research International
Recessive
Additive
Dominant
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)Em
piric
al p
ower
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
Z05
1205942
MAX3
ZFC
ZFC
ZFC
ZLC
ZLC
ZLC
MIN2GMSGME
MIN2GMSGME
MIN2GMSGME
Z05
1205942
MAX3
Z05
1205942
MAX3
Replication-based test
Replication-based test
Replication-based test
Figure 1 Empirical powers based on 2000 simulations for119872 = 10markers genotype relative risks of both stages = 14 and disease prevalence119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to control 120572 = 005 The first stagetype I error rate for discovery is 120572
119863= 005 Six test statistics 119885
12 1205942 MAX3 MIN2 GMS and GME are considered The first second and
third columns depict powers using the replication-based test 119885FC and 119885LC respectively
BioMed Research International 7
Replication-based test
Recessive
Additive
Dominant
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)Em
piric
al p
ower
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
Z05
1205942
MAX3
ZFC
ZFC
ZFC
ZLC
ZLC
ZLC
MIN2GMSGME
MIN2GMSGME
MIN2GMSGME
Z05
1205942
MAX3
Z05
1205942
MAX3
Replication-based test
Replication-based test
Figure 2 Empirical powers based on 2000 simulations for 119872 = 10 markers genotype relative risks of two stages are different (1199031= 16
1199032= 14) disease prevalence 119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to
control 120572 = 005 The first stage type I error rate for discovery is 120572119863= 005 Six test statistics 119885
12 1205942 MAX3 MIN2 GMS and GME are
considered The first second and third columns depict powers using the replication-based test 119885FC and 119885LC respectively
8 BioMed Research International
minuslo
g 10(120588)
Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23
8
6
4
2
0
MIN2 P value
(a)
minuslo
g 10(120588)
Additive P value8
6
4
2
0
Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23
(b)
Figure 3 Manhattan plots of 246758 SNPs from Yoon et al [25] based on MIN2 (a) and 11988512
(b) The 119909 axis is chromosomal location andthe 119910 axis is the significance (minuslog
10119875) of association The horizontal line corresponds to the significance level 10minus4
Table 2 For selected exemplary three SNPs for testing association with NSCLC 119901 value of combined test using additive CATT (11988512)
Pearsonrsquos Chi-square test (119879chi2) MAX3 MIN2 119885GMS and 119885GME
SNP 119901 value of 11988512
119901 value of 119879chi2Discovery Replication Combined test Discovery Replication Combined test
rs2131877 788 times 10minus5
104 times 10minus4
797 times 10minus8
140 times 10minus4
149 times 10minus4
184 times 10minus7
rs905551 183 times 10minus5
702 times 10minus3
770 times 10minus6
806 times 10minus5
489 times 10minus2
140 times 10minus5
rs1695109 248 times 10minus4
346 times 10minus2
217 times 10minus6
456 times 10minus5
153 times 10minus1
207 times 10minus5
SNP 119901 value of MAX3 119901 value of MIN2Discovery Replication Combined test Discovery Replication Combined test
rs2131877 153 times 10minus4
405 times 10minus2
192 times 10minus5
132 times 10minus4
104 times 10minus4
126 times 10minus7
rs905551 450 times 10minus5
702 times 10minus3
192 times 10minus6
134 times 10minus4
489 times 10minus2
199 times 10minus5
rs1695109 354 times 10minus5
263 times 10minus2
464 times 10minus6
236 times 10minus5
263 times 10minus2
335 times 10minus6
SNP 119901 value of 119885GMS 119901 value of 119885GME
Discovery Replication Combined test Discovery Replication Combined testrs2131877 186 times 10
minus4104 times 10
minus4171 times 10
minus7103 times 10
minus4104 times 10
minus4102 times 10
minus7
rs905551 519 times 10minus5
702 times 10minus3
216 times 10minus6
735 times 10minus5
801 times 10minus3
320 times 10minus6
rs1695109 689 times 10minus4
127 times 10minus1
385 times 10minus5
269 times 10minus5
419 times 10minus2
540 times 10minus6
In a genetic study where the purpose of considering areplication stage is to validate or replicate the genetic findingsfrom the discovery stage which is the case considered inthis paper the analysis in the replication stage utilized thetest statistic or genetic model that is selected as being thebest in the discovery stage and also the direction of the riskallele following guidelines for exact replication in geneticassociation studies If the purpose is to simply combine theevidence fromdifferent data sources such as inmeta-analysisother strategies may be devised Further research again isrequired to provide fully detailed properties of suchmethods
Power gain of a joint analysis over the conventionalreplication-based analysis was thoroughly studied by Skol etal [18 19] In our simulation the amount of power increaseusing a joint test compared to the replication-based analysiswas much minor than what was observed by Skol et al[18 19] The exact reason is not known but we suspect thismight be due to the power advantages of robust methodsand also due to the fact that the optimal choice from thefirst stage is used when calculating the second stage 119901 values
However even though it was minor in some situationsthe joint anlysis presented better power performance thanthe replication-based analysis in our study This type ofjoint analysis raised concerns about the exact meaning ofreplication [17] However McCarthy et al [26] mentionedthat joint analyses ldquoblur the boundaries of where exactlyreplication starts but whichever analytical approach is takenconfirmation in many independent samples is important andit is the overall strength of the evidence of association thatmattersrdquo Purpose of the current study was to present howthe overall strength of the evidence of association can beevaluated when robust tests are used in GWAS replicationstudies
We illustrated how the proposed methods can be appliedin the real data that studied the association of SNPs with non-small-cell lung cancer (NSCLC) in discovery and replicationstages In the original study reported by Yoon et al [25]SNPs were selected in the discovery data set not based onthe robust tests but based on additive CATT Therefore wefound that some SNPs could have been selected by one
BioMed Research International 9
of the robust methods but they were not included in thereplication data set For these SNPs we were not able toperform the joint analysis that we propose and it was notpossible to examine whether there are other SNPs that couldhave been found to be associated with NSCLC by proposedmethods in the replication study For this reason we merelypresented howmany additional SNPs could have been furtherfollowed in the replication stage when robust methods wereused In many GWASs it is a common practice to reportthe summary test statistics and 119901 values of the SNPs undera specific genetic model usually an additive model whichwere further genotyped in the replication stage and werefinally defined to be significantly associated with a phenotypeof interest As emphasized in this paper one may have abetter chance of findingmanymissing SNPs by applyingmorepowerful and robust methods that consider different geneticmodels simultaneouslyTherefore we urge the community toshare test results under not only an additive model but alsoother genetic models although they were not significant at astringent significance level so that future research may haveenriched data resources to which robust tests can be appliedin association studies
Appendix
119901 value of Fisherrsquos Combination forEqual and Unequal Weights
Equal Weights Assume 1199081
= 1199082
= 1 Under the nullhypothesis of no association 119883
1= minus2 log119875(1) and 119883
2=
minus2 log119875(2) are independent and each asymptotically followsa 1205942 distribution with 2 degrees of freedom Let 119891
119896(119909) and
119865119896(119909) be the probability and cumulative density functions of
1205942 random variable with 119896 degrees of freedomThen 119891
2(119909) =
exp(minus1199092)2 1198914(119909) = 119909 exp(minus1199092)4 119865
2(119909) = 1 minus exp(minus1199092)
and1198654(119909) = 1minusexp(minus1199092)minus119909 exp(minus1199092)2 Denote the cutoff
of the discovery stage based on 120572119863as 119862119863 that is 119865
2(119862119863) =
1 minus 120572119863 For observed value 119911FC gt 119862
119863of1198831+ 1198832 the 119901 value
is written as
1198751198670
(1198831gt 119862119863 1198831+ 1198832gt 119911FC)
= 120572119863minus int
119911FC
119862119863
1198912 (119909) 1198652 (119911FC minus 119909) 119889119909
= 120572119863+ exp (minus
119911FC2) minus 120572119863+1
2exp(minus
119911FC2) (119911FC minus 119862119863)
= exp (minus119911FC2) (1 +
119911FC2
+ log120572119863)
(A1)
Unequal WeightsWhen different proportions of samples areused in the discovery and replication stages it may be moreappropriate to assignweights proportional to the sample sizesfor each stage For example when only a small portion isused in the discovery stage to prevent Fisherrsquos combinationtest from being dominated by the significant result in the
discovery stage one may want to assign a small weight to thediscovery stage result
When 120587119904is the proportion of samples used in the dis-
covery stage one selection for weights is 1199081
= 2120587119904and
1199082= 2(1 minus 120587
119904) for discovery and replication stages which
simplifies to equal weights when 120587119904= 05 Based on these
weights we consider unequal-weighted Fisherrsquos combinationasminus2 log119875(1)1199081119875(2)1199082 = 119908
11198831+11990821198832[27] Its density function
is given by
119891119908 (119909) =
1
2 (1199081minus 1199082)exp(minus 119909
(21199081))
minus1
2 (1199081minus 1199082)exp(minus 119909
(21199082))
(A2)
and the probability distribution function is
119865119908 (119909) = 1 minus
1199081
(1199081minus 1199082)exp(minus 119909
21199081
)
minus1199082
(1199081minus 1199082)exp(minus 119909
21199082
) 1199081
= 1199082
(A3)
Using the previous notation we have the following formof 119901 value
1198751198670
(1198831gt 119862119863 11990811198831+ 11990821198832gt 119911FC)
= 120572119863minus int
119911FC1199081
119862119863
1198912 (119909) 1198652 (
119911FC minus 1199081119909
1199082
)119889119909
= exp(minus119911FC21199081
) +1199082
1199081minus 1199082
exp(minus119911FC21199082
)
times exp(1199081 minus 1199082211990811199082
119911FC) minus exp(1199081 minus 1199082211990811199082
1199081119862119863)
=1199081
1199081minus 1199082
exp(minus119911FC21199081
)
minus1199082
1199081minus 1199082
exp(minus119911FC21199082
)120572minus(1199081
minus1199082
)1199082
119863
=1199081
1199081minus 1199082
exp(minus119911FC21199081
)
minus1199082
1199081minus 1199082
exp(minus119911FC21199082
)120572minus(1199081
minus1199082
)1199082
119863
(A4)
where 119911FC1199081 gt 119862119863
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
10 BioMed Research International
Acknowledgments
The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)
References
[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005
[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006
[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006
[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007
[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007
[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012
[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954
[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955
[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997
[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002
[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966
[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985
[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008
[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010
[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009
[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009
[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009
[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for
two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006
[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007
[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003
[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006
[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009
[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012
[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993
[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010
[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008
[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
BioMed Research International 5
such as 300000 or 500000 because this will require morethan 10 million simulations to show a stable estimate of thetype I error rate With 119872 = 10 we performed 20000simulations which result in less than 10 of a coefficientof variation for a significance level 005119872 = 0005 foreach marker [24] The test statistics considered are 119885
12
Pearsonrsquos Chi-square testMIN2MAX3GMS andGME Forthe second stage analysis we considered a replication-basedanalysis 119885FC and 119885LC as proposed above The results arebased on the situation under HWE (HWE coefficient 119865 = 0)As expected all tests control the type I error reasonablly welland similar results were obtained when a slight deviationfrom HWE is present with 119865 = 005 (results not shown)
32 Empirical Power We examined the empirical powers ofdifferent tests considered above In Figure 1 we considered119872 = 10 markers a disease prevalence of 10 the samegenotype relative risk for two stages (119903
1= 14 and 119903
2= 14)
and 1000 cases and 1000 controls 2000 simulations wereperformed under HWE (119865 = 0) to control the genome-wide false positive rate at 120572 = 005 The recessive additiveand dominant models were assumed for the first secondand third rows Both joint analyses showed better powerperformances compared to the replication-based analysis (upto 159 in scenarios considered in Figure 1) and LC and FChave comparable powers with less than 2 difference Thepower gain of using the joint analysis is not as much as thatobserved in Skol et al [18] However as reported by Skolet al [18] when the between-stage heterogeneity exists andthe risk allele has a larger effect in the first stage than thatin the second stage much improved power is observed byusing the joint test Figure 2 shows results under this scenariowith 119903
1= 16 and 119903
2= 14 and the observed increase
in power using the joint test is as high as 339 Againthe difference between LC and FC is minor with less than3 difference As for comparison between different robustmethods MAX3 GMS and GME perform well under therecessive model while 119885
12 1205942 and MIN2 are less powerful
Under the additivemodel11988512
is most powerful as expectedand 1205942 is least powerful Other robust methods perform wellwith a slight decrease in power compared to 119885
12 Under the
dominant model MAX3 GMS and GME perform the besteven though all tests show good power performances and thedifference is minor Similar patterns were observed when aslight deviation from theHWE is present (results not shown)
4 Real Data Application
The GWAS on non-small-cell lung cancer (NSCLC) by Yoonet al [25] studied 621 NSCLC patients and 1541 controlsubjects in the discovery stage After stringent quality controlsteps a total of 246758 SNPs were tested for the associationwith NSCLC based on119885
12 In the replication stage 168 SNPs
with 119901 value less than 1 times 10minus4 in the first stage based on 11988512
were tested using 804 patients and 1470 control samples Weidentified additional 234 SNPs using MIN2 in the first stagewhich could be studied in the replication stage if MIN2 wasused instead of 119885
12since MIN2 produces stronger evidence
for the additional SNPs than 11988512
does The Manhattan plotsof using MIN2 and 119885
12are presented in Figure 3 One
example is 119903119904385272 located in chromosome 2 which hada 119901 value of 137 times 10
minus7 which reached significance level atBonferroni correction in discovery samples alone whereas11988512
yielded a119901 value greater than 1times10minus4 Even though thereis possibility of false positive findings these SNPs could havebeen selected for replication if robust methods were used
Since we do not have replication data for these additionalSNPs selected using MIN2 because the first stage selectionwas based on 119885
12in Yoon et al [25] just for illustration
purpose of the proposed methods we present the results ofthree SNPs including 1199031199042131877 that was reported by Yoonet al [25] When the significance level in the discovery stageis set at 120572
119863= 5 times 10
minus5 so that all these exemplary SNPs canbe selected in the discovery stage the 119901 value of combinedtest based on four robust methods (MAX3 MIN2 GMSand GME) as well as 119885
12and Pearsonrsquos Chi-square test is
presented in Table 2 Fisherrsquos combination was used for thejoint test in the second stage Only 1199031199042131877was found to besignificant with Bonferroni correction (119901 value lt 203times10
minus7)by all except MAX3method
5 Discussion
In genetic association studies efficiency robust tests whoseperformance does not depend on the underlying geneticmodel have been extensively studied and their power benefitover a wide range of geneticmodels has been well recognizedIn this paper we described how the idea of these robustassociation tests can be applied to the replication studies andfurther how overall statistical significance can be evaluatedusing the combined test formed by 119901 values of the discoveryand replication studies
When the robust tests are used the test statistic ofeach stage can have a complex form and thus dealing withthe distribution of the joint test can be difficult whereascalculating the 119901 value of each stage might be relativelysimple Because the asymptotic distribution of the 119901 valueunder the null hypothesis of no association is easy to handlethe combined test using 119901 values rather than the test statisticsthemselves can provide computational convenience
There are several methods for combining test statisticsfrom two stages and Won et al [22] compared the per-formances of various choices Two most commonly usedforms are based on Fisherrsquos combination and the linearcombination after the inverse normal transformation [23]and we presented the test statistics and 119901 values of these twomethods In our limited experience the linear combinationand Fisherrsquos combination are fairly comparable Fisherrsquoscombination seems to perform slightly better than the linearcombination when there exists some heterogeneity betweenstages in terms of the genotype relative risk while the linearcombination seems to perform slightly better in most ofother situations However the difference is extremely minorFurther research is required for the thorough comparison ofvarious methods of combining 119901 values in the application ofefficiency robust tests to the replication of genetic associationstudies
6 BioMed Research International
Recessive
Additive
Dominant
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)Em
piric
al p
ower
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
Z05
1205942
MAX3
ZFC
ZFC
ZFC
ZLC
ZLC
ZLC
MIN2GMSGME
MIN2GMSGME
MIN2GMSGME
Z05
1205942
MAX3
Z05
1205942
MAX3
Replication-based test
Replication-based test
Replication-based test
Figure 1 Empirical powers based on 2000 simulations for119872 = 10markers genotype relative risks of both stages = 14 and disease prevalence119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to control 120572 = 005 The first stagetype I error rate for discovery is 120572
119863= 005 Six test statistics 119885
12 1205942 MAX3 MIN2 GMS and GME are considered The first second and
third columns depict powers using the replication-based test 119885FC and 119885LC respectively
BioMed Research International 7
Replication-based test
Recessive
Additive
Dominant
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)Em
piric
al p
ower
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
Z05
1205942
MAX3
ZFC
ZFC
ZFC
ZLC
ZLC
ZLC
MIN2GMSGME
MIN2GMSGME
MIN2GMSGME
Z05
1205942
MAX3
Z05
1205942
MAX3
Replication-based test
Replication-based test
Figure 2 Empirical powers based on 2000 simulations for 119872 = 10 markers genotype relative risks of two stages are different (1199031= 16
1199032= 14) disease prevalence 119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to
control 120572 = 005 The first stage type I error rate for discovery is 120572119863= 005 Six test statistics 119885
12 1205942 MAX3 MIN2 GMS and GME are
considered The first second and third columns depict powers using the replication-based test 119885FC and 119885LC respectively
8 BioMed Research International
minuslo
g 10(120588)
Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23
8
6
4
2
0
MIN2 P value
(a)
minuslo
g 10(120588)
Additive P value8
6
4
2
0
Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23
(b)
Figure 3 Manhattan plots of 246758 SNPs from Yoon et al [25] based on MIN2 (a) and 11988512
(b) The 119909 axis is chromosomal location andthe 119910 axis is the significance (minuslog
10119875) of association The horizontal line corresponds to the significance level 10minus4
Table 2 For selected exemplary three SNPs for testing association with NSCLC 119901 value of combined test using additive CATT (11988512)
Pearsonrsquos Chi-square test (119879chi2) MAX3 MIN2 119885GMS and 119885GME
SNP 119901 value of 11988512
119901 value of 119879chi2Discovery Replication Combined test Discovery Replication Combined test
rs2131877 788 times 10minus5
104 times 10minus4
797 times 10minus8
140 times 10minus4
149 times 10minus4
184 times 10minus7
rs905551 183 times 10minus5
702 times 10minus3
770 times 10minus6
806 times 10minus5
489 times 10minus2
140 times 10minus5
rs1695109 248 times 10minus4
346 times 10minus2
217 times 10minus6
456 times 10minus5
153 times 10minus1
207 times 10minus5
SNP 119901 value of MAX3 119901 value of MIN2Discovery Replication Combined test Discovery Replication Combined test
rs2131877 153 times 10minus4
405 times 10minus2
192 times 10minus5
132 times 10minus4
104 times 10minus4
126 times 10minus7
rs905551 450 times 10minus5
702 times 10minus3
192 times 10minus6
134 times 10minus4
489 times 10minus2
199 times 10minus5
rs1695109 354 times 10minus5
263 times 10minus2
464 times 10minus6
236 times 10minus5
263 times 10minus2
335 times 10minus6
SNP 119901 value of 119885GMS 119901 value of 119885GME
Discovery Replication Combined test Discovery Replication Combined testrs2131877 186 times 10
minus4104 times 10
minus4171 times 10
minus7103 times 10
minus4104 times 10
minus4102 times 10
minus7
rs905551 519 times 10minus5
702 times 10minus3
216 times 10minus6
735 times 10minus5
801 times 10minus3
320 times 10minus6
rs1695109 689 times 10minus4
127 times 10minus1
385 times 10minus5
269 times 10minus5
419 times 10minus2
540 times 10minus6
In a genetic study where the purpose of considering areplication stage is to validate or replicate the genetic findingsfrom the discovery stage which is the case considered inthis paper the analysis in the replication stage utilized thetest statistic or genetic model that is selected as being thebest in the discovery stage and also the direction of the riskallele following guidelines for exact replication in geneticassociation studies If the purpose is to simply combine theevidence fromdifferent data sources such as inmeta-analysisother strategies may be devised Further research again isrequired to provide fully detailed properties of suchmethods
Power gain of a joint analysis over the conventionalreplication-based analysis was thoroughly studied by Skol etal [18 19] In our simulation the amount of power increaseusing a joint test compared to the replication-based analysiswas much minor than what was observed by Skol et al[18 19] The exact reason is not known but we suspect thismight be due to the power advantages of robust methodsand also due to the fact that the optimal choice from thefirst stage is used when calculating the second stage 119901 values
However even though it was minor in some situationsthe joint anlysis presented better power performance thanthe replication-based analysis in our study This type ofjoint analysis raised concerns about the exact meaning ofreplication [17] However McCarthy et al [26] mentionedthat joint analyses ldquoblur the boundaries of where exactlyreplication starts but whichever analytical approach is takenconfirmation in many independent samples is important andit is the overall strength of the evidence of association thatmattersrdquo Purpose of the current study was to present howthe overall strength of the evidence of association can beevaluated when robust tests are used in GWAS replicationstudies
We illustrated how the proposed methods can be appliedin the real data that studied the association of SNPs with non-small-cell lung cancer (NSCLC) in discovery and replicationstages In the original study reported by Yoon et al [25]SNPs were selected in the discovery data set not based onthe robust tests but based on additive CATT Therefore wefound that some SNPs could have been selected by one
BioMed Research International 9
of the robust methods but they were not included in thereplication data set For these SNPs we were not able toperform the joint analysis that we propose and it was notpossible to examine whether there are other SNPs that couldhave been found to be associated with NSCLC by proposedmethods in the replication study For this reason we merelypresented howmany additional SNPs could have been furtherfollowed in the replication stage when robust methods wereused In many GWASs it is a common practice to reportthe summary test statistics and 119901 values of the SNPs undera specific genetic model usually an additive model whichwere further genotyped in the replication stage and werefinally defined to be significantly associated with a phenotypeof interest As emphasized in this paper one may have abetter chance of findingmanymissing SNPs by applyingmorepowerful and robust methods that consider different geneticmodels simultaneouslyTherefore we urge the community toshare test results under not only an additive model but alsoother genetic models although they were not significant at astringent significance level so that future research may haveenriched data resources to which robust tests can be appliedin association studies
Appendix
119901 value of Fisherrsquos Combination forEqual and Unequal Weights
Equal Weights Assume 1199081
= 1199082
= 1 Under the nullhypothesis of no association 119883
1= minus2 log119875(1) and 119883
2=
minus2 log119875(2) are independent and each asymptotically followsa 1205942 distribution with 2 degrees of freedom Let 119891
119896(119909) and
119865119896(119909) be the probability and cumulative density functions of
1205942 random variable with 119896 degrees of freedomThen 119891
2(119909) =
exp(minus1199092)2 1198914(119909) = 119909 exp(minus1199092)4 119865
2(119909) = 1 minus exp(minus1199092)
and1198654(119909) = 1minusexp(minus1199092)minus119909 exp(minus1199092)2 Denote the cutoff
of the discovery stage based on 120572119863as 119862119863 that is 119865
2(119862119863) =
1 minus 120572119863 For observed value 119911FC gt 119862
119863of1198831+ 1198832 the 119901 value
is written as
1198751198670
(1198831gt 119862119863 1198831+ 1198832gt 119911FC)
= 120572119863minus int
119911FC
119862119863
1198912 (119909) 1198652 (119911FC minus 119909) 119889119909
= 120572119863+ exp (minus
119911FC2) minus 120572119863+1
2exp(minus
119911FC2) (119911FC minus 119862119863)
= exp (minus119911FC2) (1 +
119911FC2
+ log120572119863)
(A1)
Unequal WeightsWhen different proportions of samples areused in the discovery and replication stages it may be moreappropriate to assignweights proportional to the sample sizesfor each stage For example when only a small portion isused in the discovery stage to prevent Fisherrsquos combinationtest from being dominated by the significant result in the
discovery stage one may want to assign a small weight to thediscovery stage result
When 120587119904is the proportion of samples used in the dis-
covery stage one selection for weights is 1199081
= 2120587119904and
1199082= 2(1 minus 120587
119904) for discovery and replication stages which
simplifies to equal weights when 120587119904= 05 Based on these
weights we consider unequal-weighted Fisherrsquos combinationasminus2 log119875(1)1199081119875(2)1199082 = 119908
11198831+11990821198832[27] Its density function
is given by
119891119908 (119909) =
1
2 (1199081minus 1199082)exp(minus 119909
(21199081))
minus1
2 (1199081minus 1199082)exp(minus 119909
(21199082))
(A2)
and the probability distribution function is
119865119908 (119909) = 1 minus
1199081
(1199081minus 1199082)exp(minus 119909
21199081
)
minus1199082
(1199081minus 1199082)exp(minus 119909
21199082
) 1199081
= 1199082
(A3)
Using the previous notation we have the following formof 119901 value
1198751198670
(1198831gt 119862119863 11990811198831+ 11990821198832gt 119911FC)
= 120572119863minus int
119911FC1199081
119862119863
1198912 (119909) 1198652 (
119911FC minus 1199081119909
1199082
)119889119909
= exp(minus119911FC21199081
) +1199082
1199081minus 1199082
exp(minus119911FC21199082
)
times exp(1199081 minus 1199082211990811199082
119911FC) minus exp(1199081 minus 1199082211990811199082
1199081119862119863)
=1199081
1199081minus 1199082
exp(minus119911FC21199081
)
minus1199082
1199081minus 1199082
exp(minus119911FC21199082
)120572minus(1199081
minus1199082
)1199082
119863
=1199081
1199081minus 1199082
exp(minus119911FC21199081
)
minus1199082
1199081minus 1199082
exp(minus119911FC21199082
)120572minus(1199081
minus1199082
)1199082
119863
(A4)
where 119911FC1199081 gt 119862119863
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
10 BioMed Research International
Acknowledgments
The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)
References
[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005
[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006
[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006
[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007
[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007
[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012
[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954
[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955
[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997
[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002
[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966
[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985
[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008
[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010
[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009
[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009
[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009
[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for
two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006
[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007
[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003
[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006
[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009
[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012
[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993
[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010
[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008
[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
6 BioMed Research International
Recessive
Additive
Dominant
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)Em
piric
al p
ower
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
Z05
1205942
MAX3
ZFC
ZFC
ZFC
ZLC
ZLC
ZLC
MIN2GMSGME
MIN2GMSGME
MIN2GMSGME
Z05
1205942
MAX3
Z05
1205942
MAX3
Replication-based test
Replication-based test
Replication-based test
Figure 1 Empirical powers based on 2000 simulations for119872 = 10markers genotype relative risks of both stages = 14 and disease prevalence119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to control 120572 = 005 The first stagetype I error rate for discovery is 120572
119863= 005 Six test statistics 119885
12 1205942 MAX3 MIN2 GMS and GME are considered The first second and
third columns depict powers using the replication-based test 119885FC and 119885LC respectively
BioMed Research International 7
Replication-based test
Recessive
Additive
Dominant
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)Em
piric
al p
ower
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
Z05
1205942
MAX3
ZFC
ZFC
ZFC
ZLC
ZLC
ZLC
MIN2GMSGME
MIN2GMSGME
MIN2GMSGME
Z05
1205942
MAX3
Z05
1205942
MAX3
Replication-based test
Replication-based test
Figure 2 Empirical powers based on 2000 simulations for 119872 = 10 markers genotype relative risks of two stages are different (1199031= 16
1199032= 14) disease prevalence 119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to
control 120572 = 005 The first stage type I error rate for discovery is 120572119863= 005 Six test statistics 119885
12 1205942 MAX3 MIN2 GMS and GME are
considered The first second and third columns depict powers using the replication-based test 119885FC and 119885LC respectively
8 BioMed Research International
minuslo
g 10(120588)
Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23
8
6
4
2
0
MIN2 P value
(a)
minuslo
g 10(120588)
Additive P value8
6
4
2
0
Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23
(b)
Figure 3 Manhattan plots of 246758 SNPs from Yoon et al [25] based on MIN2 (a) and 11988512
(b) The 119909 axis is chromosomal location andthe 119910 axis is the significance (minuslog
10119875) of association The horizontal line corresponds to the significance level 10minus4
Table 2 For selected exemplary three SNPs for testing association with NSCLC 119901 value of combined test using additive CATT (11988512)
Pearsonrsquos Chi-square test (119879chi2) MAX3 MIN2 119885GMS and 119885GME
SNP 119901 value of 11988512
119901 value of 119879chi2Discovery Replication Combined test Discovery Replication Combined test
rs2131877 788 times 10minus5
104 times 10minus4
797 times 10minus8
140 times 10minus4
149 times 10minus4
184 times 10minus7
rs905551 183 times 10minus5
702 times 10minus3
770 times 10minus6
806 times 10minus5
489 times 10minus2
140 times 10minus5
rs1695109 248 times 10minus4
346 times 10minus2
217 times 10minus6
456 times 10minus5
153 times 10minus1
207 times 10minus5
SNP 119901 value of MAX3 119901 value of MIN2Discovery Replication Combined test Discovery Replication Combined test
rs2131877 153 times 10minus4
405 times 10minus2
192 times 10minus5
132 times 10minus4
104 times 10minus4
126 times 10minus7
rs905551 450 times 10minus5
702 times 10minus3
192 times 10minus6
134 times 10minus4
489 times 10minus2
199 times 10minus5
rs1695109 354 times 10minus5
263 times 10minus2
464 times 10minus6
236 times 10minus5
263 times 10minus2
335 times 10minus6
SNP 119901 value of 119885GMS 119901 value of 119885GME
Discovery Replication Combined test Discovery Replication Combined testrs2131877 186 times 10
minus4104 times 10
minus4171 times 10
minus7103 times 10
minus4104 times 10
minus4102 times 10
minus7
rs905551 519 times 10minus5
702 times 10minus3
216 times 10minus6
735 times 10minus5
801 times 10minus3
320 times 10minus6
rs1695109 689 times 10minus4
127 times 10minus1
385 times 10minus5
269 times 10minus5
419 times 10minus2
540 times 10minus6
In a genetic study where the purpose of considering areplication stage is to validate or replicate the genetic findingsfrom the discovery stage which is the case considered inthis paper the analysis in the replication stage utilized thetest statistic or genetic model that is selected as being thebest in the discovery stage and also the direction of the riskallele following guidelines for exact replication in geneticassociation studies If the purpose is to simply combine theevidence fromdifferent data sources such as inmeta-analysisother strategies may be devised Further research again isrequired to provide fully detailed properties of suchmethods
Power gain of a joint analysis over the conventionalreplication-based analysis was thoroughly studied by Skol etal [18 19] In our simulation the amount of power increaseusing a joint test compared to the replication-based analysiswas much minor than what was observed by Skol et al[18 19] The exact reason is not known but we suspect thismight be due to the power advantages of robust methodsand also due to the fact that the optimal choice from thefirst stage is used when calculating the second stage 119901 values
However even though it was minor in some situationsthe joint anlysis presented better power performance thanthe replication-based analysis in our study This type ofjoint analysis raised concerns about the exact meaning ofreplication [17] However McCarthy et al [26] mentionedthat joint analyses ldquoblur the boundaries of where exactlyreplication starts but whichever analytical approach is takenconfirmation in many independent samples is important andit is the overall strength of the evidence of association thatmattersrdquo Purpose of the current study was to present howthe overall strength of the evidence of association can beevaluated when robust tests are used in GWAS replicationstudies
We illustrated how the proposed methods can be appliedin the real data that studied the association of SNPs with non-small-cell lung cancer (NSCLC) in discovery and replicationstages In the original study reported by Yoon et al [25]SNPs were selected in the discovery data set not based onthe robust tests but based on additive CATT Therefore wefound that some SNPs could have been selected by one
BioMed Research International 9
of the robust methods but they were not included in thereplication data set For these SNPs we were not able toperform the joint analysis that we propose and it was notpossible to examine whether there are other SNPs that couldhave been found to be associated with NSCLC by proposedmethods in the replication study For this reason we merelypresented howmany additional SNPs could have been furtherfollowed in the replication stage when robust methods wereused In many GWASs it is a common practice to reportthe summary test statistics and 119901 values of the SNPs undera specific genetic model usually an additive model whichwere further genotyped in the replication stage and werefinally defined to be significantly associated with a phenotypeof interest As emphasized in this paper one may have abetter chance of findingmanymissing SNPs by applyingmorepowerful and robust methods that consider different geneticmodels simultaneouslyTherefore we urge the community toshare test results under not only an additive model but alsoother genetic models although they were not significant at astringent significance level so that future research may haveenriched data resources to which robust tests can be appliedin association studies
Appendix
119901 value of Fisherrsquos Combination forEqual and Unequal Weights
Equal Weights Assume 1199081
= 1199082
= 1 Under the nullhypothesis of no association 119883
1= minus2 log119875(1) and 119883
2=
minus2 log119875(2) are independent and each asymptotically followsa 1205942 distribution with 2 degrees of freedom Let 119891
119896(119909) and
119865119896(119909) be the probability and cumulative density functions of
1205942 random variable with 119896 degrees of freedomThen 119891
2(119909) =
exp(minus1199092)2 1198914(119909) = 119909 exp(minus1199092)4 119865
2(119909) = 1 minus exp(minus1199092)
and1198654(119909) = 1minusexp(minus1199092)minus119909 exp(minus1199092)2 Denote the cutoff
of the discovery stage based on 120572119863as 119862119863 that is 119865
2(119862119863) =
1 minus 120572119863 For observed value 119911FC gt 119862
119863of1198831+ 1198832 the 119901 value
is written as
1198751198670
(1198831gt 119862119863 1198831+ 1198832gt 119911FC)
= 120572119863minus int
119911FC
119862119863
1198912 (119909) 1198652 (119911FC minus 119909) 119889119909
= 120572119863+ exp (minus
119911FC2) minus 120572119863+1
2exp(minus
119911FC2) (119911FC minus 119862119863)
= exp (minus119911FC2) (1 +
119911FC2
+ log120572119863)
(A1)
Unequal WeightsWhen different proportions of samples areused in the discovery and replication stages it may be moreappropriate to assignweights proportional to the sample sizesfor each stage For example when only a small portion isused in the discovery stage to prevent Fisherrsquos combinationtest from being dominated by the significant result in the
discovery stage one may want to assign a small weight to thediscovery stage result
When 120587119904is the proportion of samples used in the dis-
covery stage one selection for weights is 1199081
= 2120587119904and
1199082= 2(1 minus 120587
119904) for discovery and replication stages which
simplifies to equal weights when 120587119904= 05 Based on these
weights we consider unequal-weighted Fisherrsquos combinationasminus2 log119875(1)1199081119875(2)1199082 = 119908
11198831+11990821198832[27] Its density function
is given by
119891119908 (119909) =
1
2 (1199081minus 1199082)exp(minus 119909
(21199081))
minus1
2 (1199081minus 1199082)exp(minus 119909
(21199082))
(A2)
and the probability distribution function is
119865119908 (119909) = 1 minus
1199081
(1199081minus 1199082)exp(minus 119909
21199081
)
minus1199082
(1199081minus 1199082)exp(minus 119909
21199082
) 1199081
= 1199082
(A3)
Using the previous notation we have the following formof 119901 value
1198751198670
(1198831gt 119862119863 11990811198831+ 11990821198832gt 119911FC)
= 120572119863minus int
119911FC1199081
119862119863
1198912 (119909) 1198652 (
119911FC minus 1199081119909
1199082
)119889119909
= exp(minus119911FC21199081
) +1199082
1199081minus 1199082
exp(minus119911FC21199082
)
times exp(1199081 minus 1199082211990811199082
119911FC) minus exp(1199081 minus 1199082211990811199082
1199081119862119863)
=1199081
1199081minus 1199082
exp(minus119911FC21199081
)
minus1199082
1199081minus 1199082
exp(minus119911FC21199082
)120572minus(1199081
minus1199082
)1199082
119863
=1199081
1199081minus 1199082
exp(minus119911FC21199081
)
minus1199082
1199081minus 1199082
exp(minus119911FC21199082
)120572minus(1199081
minus1199082
)1199082
119863
(A4)
where 119911FC1199081 gt 119862119863
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
10 BioMed Research International
Acknowledgments
The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)
References
[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005
[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006
[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006
[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007
[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007
[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012
[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954
[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955
[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997
[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002
[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966
[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985
[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008
[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010
[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009
[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009
[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009
[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for
two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006
[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007
[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003
[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006
[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009
[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012
[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993
[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010
[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008
[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
BioMed Research International 7
Replication-based test
Recessive
Additive
Dominant
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)Em
piric
al p
ower
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
(05 03) (05 04) (06 03) (06 04)
(120587sMAF)
Empi
rical
pow
er
10
08
06
04
02
00
Z05
1205942
MAX3
ZFC
ZFC
ZFC
ZLC
ZLC
ZLC
MIN2GMSGME
MIN2GMSGME
MIN2GMSGME
Z05
1205942
MAX3
Z05
1205942
MAX3
Replication-based test
Replication-based test
Figure 2 Empirical powers based on 2000 simulations for 119872 = 10 markers genotype relative risks of two stages are different (1199031= 16
1199032= 14) disease prevalence 119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to
control 120572 = 005 The first stage type I error rate for discovery is 120572119863= 005 Six test statistics 119885
12 1205942 MAX3 MIN2 GMS and GME are
considered The first second and third columns depict powers using the replication-based test 119885FC and 119885LC respectively
8 BioMed Research International
minuslo
g 10(120588)
Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23
8
6
4
2
0
MIN2 P value
(a)
minuslo
g 10(120588)
Additive P value8
6
4
2
0
Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23
(b)
Figure 3 Manhattan plots of 246758 SNPs from Yoon et al [25] based on MIN2 (a) and 11988512
(b) The 119909 axis is chromosomal location andthe 119910 axis is the significance (minuslog
10119875) of association The horizontal line corresponds to the significance level 10minus4
Table 2 For selected exemplary three SNPs for testing association with NSCLC 119901 value of combined test using additive CATT (11988512)
Pearsonrsquos Chi-square test (119879chi2) MAX3 MIN2 119885GMS and 119885GME
SNP 119901 value of 11988512
119901 value of 119879chi2Discovery Replication Combined test Discovery Replication Combined test
rs2131877 788 times 10minus5
104 times 10minus4
797 times 10minus8
140 times 10minus4
149 times 10minus4
184 times 10minus7
rs905551 183 times 10minus5
702 times 10minus3
770 times 10minus6
806 times 10minus5
489 times 10minus2
140 times 10minus5
rs1695109 248 times 10minus4
346 times 10minus2
217 times 10minus6
456 times 10minus5
153 times 10minus1
207 times 10minus5
SNP 119901 value of MAX3 119901 value of MIN2Discovery Replication Combined test Discovery Replication Combined test
rs2131877 153 times 10minus4
405 times 10minus2
192 times 10minus5
132 times 10minus4
104 times 10minus4
126 times 10minus7
rs905551 450 times 10minus5
702 times 10minus3
192 times 10minus6
134 times 10minus4
489 times 10minus2
199 times 10minus5
rs1695109 354 times 10minus5
263 times 10minus2
464 times 10minus6
236 times 10minus5
263 times 10minus2
335 times 10minus6
SNP 119901 value of 119885GMS 119901 value of 119885GME
Discovery Replication Combined test Discovery Replication Combined testrs2131877 186 times 10
minus4104 times 10
minus4171 times 10
minus7103 times 10
minus4104 times 10
minus4102 times 10
minus7
rs905551 519 times 10minus5
702 times 10minus3
216 times 10minus6
735 times 10minus5
801 times 10minus3
320 times 10minus6
rs1695109 689 times 10minus4
127 times 10minus1
385 times 10minus5
269 times 10minus5
419 times 10minus2
540 times 10minus6
In a genetic study where the purpose of considering areplication stage is to validate or replicate the genetic findingsfrom the discovery stage which is the case considered inthis paper the analysis in the replication stage utilized thetest statistic or genetic model that is selected as being thebest in the discovery stage and also the direction of the riskallele following guidelines for exact replication in geneticassociation studies If the purpose is to simply combine theevidence fromdifferent data sources such as inmeta-analysisother strategies may be devised Further research again isrequired to provide fully detailed properties of suchmethods
Power gain of a joint analysis over the conventionalreplication-based analysis was thoroughly studied by Skol etal [18 19] In our simulation the amount of power increaseusing a joint test compared to the replication-based analysiswas much minor than what was observed by Skol et al[18 19] The exact reason is not known but we suspect thismight be due to the power advantages of robust methodsand also due to the fact that the optimal choice from thefirst stage is used when calculating the second stage 119901 values
However even though it was minor in some situationsthe joint anlysis presented better power performance thanthe replication-based analysis in our study This type ofjoint analysis raised concerns about the exact meaning ofreplication [17] However McCarthy et al [26] mentionedthat joint analyses ldquoblur the boundaries of where exactlyreplication starts but whichever analytical approach is takenconfirmation in many independent samples is important andit is the overall strength of the evidence of association thatmattersrdquo Purpose of the current study was to present howthe overall strength of the evidence of association can beevaluated when robust tests are used in GWAS replicationstudies
We illustrated how the proposed methods can be appliedin the real data that studied the association of SNPs with non-small-cell lung cancer (NSCLC) in discovery and replicationstages In the original study reported by Yoon et al [25]SNPs were selected in the discovery data set not based onthe robust tests but based on additive CATT Therefore wefound that some SNPs could have been selected by one
BioMed Research International 9
of the robust methods but they were not included in thereplication data set For these SNPs we were not able toperform the joint analysis that we propose and it was notpossible to examine whether there are other SNPs that couldhave been found to be associated with NSCLC by proposedmethods in the replication study For this reason we merelypresented howmany additional SNPs could have been furtherfollowed in the replication stage when robust methods wereused In many GWASs it is a common practice to reportthe summary test statistics and 119901 values of the SNPs undera specific genetic model usually an additive model whichwere further genotyped in the replication stage and werefinally defined to be significantly associated with a phenotypeof interest As emphasized in this paper one may have abetter chance of findingmanymissing SNPs by applyingmorepowerful and robust methods that consider different geneticmodels simultaneouslyTherefore we urge the community toshare test results under not only an additive model but alsoother genetic models although they were not significant at astringent significance level so that future research may haveenriched data resources to which robust tests can be appliedin association studies
Appendix
119901 value of Fisherrsquos Combination forEqual and Unequal Weights
Equal Weights Assume 1199081
= 1199082
= 1 Under the nullhypothesis of no association 119883
1= minus2 log119875(1) and 119883
2=
minus2 log119875(2) are independent and each asymptotically followsa 1205942 distribution with 2 degrees of freedom Let 119891
119896(119909) and
119865119896(119909) be the probability and cumulative density functions of
1205942 random variable with 119896 degrees of freedomThen 119891
2(119909) =
exp(minus1199092)2 1198914(119909) = 119909 exp(minus1199092)4 119865
2(119909) = 1 minus exp(minus1199092)
and1198654(119909) = 1minusexp(minus1199092)minus119909 exp(minus1199092)2 Denote the cutoff
of the discovery stage based on 120572119863as 119862119863 that is 119865
2(119862119863) =
1 minus 120572119863 For observed value 119911FC gt 119862
119863of1198831+ 1198832 the 119901 value
is written as
1198751198670
(1198831gt 119862119863 1198831+ 1198832gt 119911FC)
= 120572119863minus int
119911FC
119862119863
1198912 (119909) 1198652 (119911FC minus 119909) 119889119909
= 120572119863+ exp (minus
119911FC2) minus 120572119863+1
2exp(minus
119911FC2) (119911FC minus 119862119863)
= exp (minus119911FC2) (1 +
119911FC2
+ log120572119863)
(A1)
Unequal WeightsWhen different proportions of samples areused in the discovery and replication stages it may be moreappropriate to assignweights proportional to the sample sizesfor each stage For example when only a small portion isused in the discovery stage to prevent Fisherrsquos combinationtest from being dominated by the significant result in the
discovery stage one may want to assign a small weight to thediscovery stage result
When 120587119904is the proportion of samples used in the dis-
covery stage one selection for weights is 1199081
= 2120587119904and
1199082= 2(1 minus 120587
119904) for discovery and replication stages which
simplifies to equal weights when 120587119904= 05 Based on these
weights we consider unequal-weighted Fisherrsquos combinationasminus2 log119875(1)1199081119875(2)1199082 = 119908
11198831+11990821198832[27] Its density function
is given by
119891119908 (119909) =
1
2 (1199081minus 1199082)exp(minus 119909
(21199081))
minus1
2 (1199081minus 1199082)exp(minus 119909
(21199082))
(A2)
and the probability distribution function is
119865119908 (119909) = 1 minus
1199081
(1199081minus 1199082)exp(minus 119909
21199081
)
minus1199082
(1199081minus 1199082)exp(minus 119909
21199082
) 1199081
= 1199082
(A3)
Using the previous notation we have the following formof 119901 value
1198751198670
(1198831gt 119862119863 11990811198831+ 11990821198832gt 119911FC)
= 120572119863minus int
119911FC1199081
119862119863
1198912 (119909) 1198652 (
119911FC minus 1199081119909
1199082
)119889119909
= exp(minus119911FC21199081
) +1199082
1199081minus 1199082
exp(minus119911FC21199082
)
times exp(1199081 minus 1199082211990811199082
119911FC) minus exp(1199081 minus 1199082211990811199082
1199081119862119863)
=1199081
1199081minus 1199082
exp(minus119911FC21199081
)
minus1199082
1199081minus 1199082
exp(minus119911FC21199082
)120572minus(1199081
minus1199082
)1199082
119863
=1199081
1199081minus 1199082
exp(minus119911FC21199081
)
minus1199082
1199081minus 1199082
exp(minus119911FC21199082
)120572minus(1199081
minus1199082
)1199082
119863
(A4)
where 119911FC1199081 gt 119862119863
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
10 BioMed Research International
Acknowledgments
The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)
References
[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005
[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006
[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006
[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007
[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007
[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012
[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954
[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955
[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997
[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002
[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966
[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985
[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008
[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010
[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009
[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009
[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009
[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for
two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006
[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007
[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003
[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006
[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009
[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012
[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993
[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010
[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008
[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
8 BioMed Research International
minuslo
g 10(120588)
Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23
8
6
4
2
0
MIN2 P value
(a)
minuslo
g 10(120588)
Additive P value8
6
4
2
0
Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23
(b)
Figure 3 Manhattan plots of 246758 SNPs from Yoon et al [25] based on MIN2 (a) and 11988512
(b) The 119909 axis is chromosomal location andthe 119910 axis is the significance (minuslog
10119875) of association The horizontal line corresponds to the significance level 10minus4
Table 2 For selected exemplary three SNPs for testing association with NSCLC 119901 value of combined test using additive CATT (11988512)
Pearsonrsquos Chi-square test (119879chi2) MAX3 MIN2 119885GMS and 119885GME
SNP 119901 value of 11988512
119901 value of 119879chi2Discovery Replication Combined test Discovery Replication Combined test
rs2131877 788 times 10minus5
104 times 10minus4
797 times 10minus8
140 times 10minus4
149 times 10minus4
184 times 10minus7
rs905551 183 times 10minus5
702 times 10minus3
770 times 10minus6
806 times 10minus5
489 times 10minus2
140 times 10minus5
rs1695109 248 times 10minus4
346 times 10minus2
217 times 10minus6
456 times 10minus5
153 times 10minus1
207 times 10minus5
SNP 119901 value of MAX3 119901 value of MIN2Discovery Replication Combined test Discovery Replication Combined test
rs2131877 153 times 10minus4
405 times 10minus2
192 times 10minus5
132 times 10minus4
104 times 10minus4
126 times 10minus7
rs905551 450 times 10minus5
702 times 10minus3
192 times 10minus6
134 times 10minus4
489 times 10minus2
199 times 10minus5
rs1695109 354 times 10minus5
263 times 10minus2
464 times 10minus6
236 times 10minus5
263 times 10minus2
335 times 10minus6
SNP 119901 value of 119885GMS 119901 value of 119885GME
Discovery Replication Combined test Discovery Replication Combined testrs2131877 186 times 10
minus4104 times 10
minus4171 times 10
minus7103 times 10
minus4104 times 10
minus4102 times 10
minus7
rs905551 519 times 10minus5
702 times 10minus3
216 times 10minus6
735 times 10minus5
801 times 10minus3
320 times 10minus6
rs1695109 689 times 10minus4
127 times 10minus1
385 times 10minus5
269 times 10minus5
419 times 10minus2
540 times 10minus6
In a genetic study where the purpose of considering areplication stage is to validate or replicate the genetic findingsfrom the discovery stage which is the case considered inthis paper the analysis in the replication stage utilized thetest statistic or genetic model that is selected as being thebest in the discovery stage and also the direction of the riskallele following guidelines for exact replication in geneticassociation studies If the purpose is to simply combine theevidence fromdifferent data sources such as inmeta-analysisother strategies may be devised Further research again isrequired to provide fully detailed properties of suchmethods
Power gain of a joint analysis over the conventionalreplication-based analysis was thoroughly studied by Skol etal [18 19] In our simulation the amount of power increaseusing a joint test compared to the replication-based analysiswas much minor than what was observed by Skol et al[18 19] The exact reason is not known but we suspect thismight be due to the power advantages of robust methodsand also due to the fact that the optimal choice from thefirst stage is used when calculating the second stage 119901 values
However even though it was minor in some situationsthe joint anlysis presented better power performance thanthe replication-based analysis in our study This type ofjoint analysis raised concerns about the exact meaning ofreplication [17] However McCarthy et al [26] mentionedthat joint analyses ldquoblur the boundaries of where exactlyreplication starts but whichever analytical approach is takenconfirmation in many independent samples is important andit is the overall strength of the evidence of association thatmattersrdquo Purpose of the current study was to present howthe overall strength of the evidence of association can beevaluated when robust tests are used in GWAS replicationstudies
We illustrated how the proposed methods can be appliedin the real data that studied the association of SNPs with non-small-cell lung cancer (NSCLC) in discovery and replicationstages In the original study reported by Yoon et al [25]SNPs were selected in the discovery data set not based onthe robust tests but based on additive CATT Therefore wefound that some SNPs could have been selected by one
BioMed Research International 9
of the robust methods but they were not included in thereplication data set For these SNPs we were not able toperform the joint analysis that we propose and it was notpossible to examine whether there are other SNPs that couldhave been found to be associated with NSCLC by proposedmethods in the replication study For this reason we merelypresented howmany additional SNPs could have been furtherfollowed in the replication stage when robust methods wereused In many GWASs it is a common practice to reportthe summary test statistics and 119901 values of the SNPs undera specific genetic model usually an additive model whichwere further genotyped in the replication stage and werefinally defined to be significantly associated with a phenotypeof interest As emphasized in this paper one may have abetter chance of findingmanymissing SNPs by applyingmorepowerful and robust methods that consider different geneticmodels simultaneouslyTherefore we urge the community toshare test results under not only an additive model but alsoother genetic models although they were not significant at astringent significance level so that future research may haveenriched data resources to which robust tests can be appliedin association studies
Appendix
119901 value of Fisherrsquos Combination forEqual and Unequal Weights
Equal Weights Assume 1199081
= 1199082
= 1 Under the nullhypothesis of no association 119883
1= minus2 log119875(1) and 119883
2=
minus2 log119875(2) are independent and each asymptotically followsa 1205942 distribution with 2 degrees of freedom Let 119891
119896(119909) and
119865119896(119909) be the probability and cumulative density functions of
1205942 random variable with 119896 degrees of freedomThen 119891
2(119909) =
exp(minus1199092)2 1198914(119909) = 119909 exp(minus1199092)4 119865
2(119909) = 1 minus exp(minus1199092)
and1198654(119909) = 1minusexp(minus1199092)minus119909 exp(minus1199092)2 Denote the cutoff
of the discovery stage based on 120572119863as 119862119863 that is 119865
2(119862119863) =
1 minus 120572119863 For observed value 119911FC gt 119862
119863of1198831+ 1198832 the 119901 value
is written as
1198751198670
(1198831gt 119862119863 1198831+ 1198832gt 119911FC)
= 120572119863minus int
119911FC
119862119863
1198912 (119909) 1198652 (119911FC minus 119909) 119889119909
= 120572119863+ exp (minus
119911FC2) minus 120572119863+1
2exp(minus
119911FC2) (119911FC minus 119862119863)
= exp (minus119911FC2) (1 +
119911FC2
+ log120572119863)
(A1)
Unequal WeightsWhen different proportions of samples areused in the discovery and replication stages it may be moreappropriate to assignweights proportional to the sample sizesfor each stage For example when only a small portion isused in the discovery stage to prevent Fisherrsquos combinationtest from being dominated by the significant result in the
discovery stage one may want to assign a small weight to thediscovery stage result
When 120587119904is the proportion of samples used in the dis-
covery stage one selection for weights is 1199081
= 2120587119904and
1199082= 2(1 minus 120587
119904) for discovery and replication stages which
simplifies to equal weights when 120587119904= 05 Based on these
weights we consider unequal-weighted Fisherrsquos combinationasminus2 log119875(1)1199081119875(2)1199082 = 119908
11198831+11990821198832[27] Its density function
is given by
119891119908 (119909) =
1
2 (1199081minus 1199082)exp(minus 119909
(21199081))
minus1
2 (1199081minus 1199082)exp(minus 119909
(21199082))
(A2)
and the probability distribution function is
119865119908 (119909) = 1 minus
1199081
(1199081minus 1199082)exp(minus 119909
21199081
)
minus1199082
(1199081minus 1199082)exp(minus 119909
21199082
) 1199081
= 1199082
(A3)
Using the previous notation we have the following formof 119901 value
1198751198670
(1198831gt 119862119863 11990811198831+ 11990821198832gt 119911FC)
= 120572119863minus int
119911FC1199081
119862119863
1198912 (119909) 1198652 (
119911FC minus 1199081119909
1199082
)119889119909
= exp(minus119911FC21199081
) +1199082
1199081minus 1199082
exp(minus119911FC21199082
)
times exp(1199081 minus 1199082211990811199082
119911FC) minus exp(1199081 minus 1199082211990811199082
1199081119862119863)
=1199081
1199081minus 1199082
exp(minus119911FC21199081
)
minus1199082
1199081minus 1199082
exp(minus119911FC21199082
)120572minus(1199081
minus1199082
)1199082
119863
=1199081
1199081minus 1199082
exp(minus119911FC21199081
)
minus1199082
1199081minus 1199082
exp(minus119911FC21199082
)120572minus(1199081
minus1199082
)1199082
119863
(A4)
where 119911FC1199081 gt 119862119863
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
10 BioMed Research International
Acknowledgments
The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)
References
[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005
[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006
[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006
[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007
[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007
[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012
[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954
[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955
[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997
[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002
[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966
[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985
[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008
[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010
[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009
[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009
[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009
[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for
two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006
[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007
[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003
[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006
[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009
[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012
[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993
[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010
[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008
[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
BioMed Research International 9
of the robust methods but they were not included in thereplication data set For these SNPs we were not able toperform the joint analysis that we propose and it was notpossible to examine whether there are other SNPs that couldhave been found to be associated with NSCLC by proposedmethods in the replication study For this reason we merelypresented howmany additional SNPs could have been furtherfollowed in the replication stage when robust methods wereused In many GWASs it is a common practice to reportthe summary test statistics and 119901 values of the SNPs undera specific genetic model usually an additive model whichwere further genotyped in the replication stage and werefinally defined to be significantly associated with a phenotypeof interest As emphasized in this paper one may have abetter chance of findingmanymissing SNPs by applyingmorepowerful and robust methods that consider different geneticmodels simultaneouslyTherefore we urge the community toshare test results under not only an additive model but alsoother genetic models although they were not significant at astringent significance level so that future research may haveenriched data resources to which robust tests can be appliedin association studies
Appendix
119901 value of Fisherrsquos Combination forEqual and Unequal Weights
Equal Weights Assume 1199081
= 1199082
= 1 Under the nullhypothesis of no association 119883
1= minus2 log119875(1) and 119883
2=
minus2 log119875(2) are independent and each asymptotically followsa 1205942 distribution with 2 degrees of freedom Let 119891
119896(119909) and
119865119896(119909) be the probability and cumulative density functions of
1205942 random variable with 119896 degrees of freedomThen 119891
2(119909) =
exp(minus1199092)2 1198914(119909) = 119909 exp(minus1199092)4 119865
2(119909) = 1 minus exp(minus1199092)
and1198654(119909) = 1minusexp(minus1199092)minus119909 exp(minus1199092)2 Denote the cutoff
of the discovery stage based on 120572119863as 119862119863 that is 119865
2(119862119863) =
1 minus 120572119863 For observed value 119911FC gt 119862
119863of1198831+ 1198832 the 119901 value
is written as
1198751198670
(1198831gt 119862119863 1198831+ 1198832gt 119911FC)
= 120572119863minus int
119911FC
119862119863
1198912 (119909) 1198652 (119911FC minus 119909) 119889119909
= 120572119863+ exp (minus
119911FC2) minus 120572119863+1
2exp(minus
119911FC2) (119911FC minus 119862119863)
= exp (minus119911FC2) (1 +
119911FC2
+ log120572119863)
(A1)
Unequal WeightsWhen different proportions of samples areused in the discovery and replication stages it may be moreappropriate to assignweights proportional to the sample sizesfor each stage For example when only a small portion isused in the discovery stage to prevent Fisherrsquos combinationtest from being dominated by the significant result in the
discovery stage one may want to assign a small weight to thediscovery stage result
When 120587119904is the proportion of samples used in the dis-
covery stage one selection for weights is 1199081
= 2120587119904and
1199082= 2(1 minus 120587
119904) for discovery and replication stages which
simplifies to equal weights when 120587119904= 05 Based on these
weights we consider unequal-weighted Fisherrsquos combinationasminus2 log119875(1)1199081119875(2)1199082 = 119908
11198831+11990821198832[27] Its density function
is given by
119891119908 (119909) =
1
2 (1199081minus 1199082)exp(minus 119909
(21199081))
minus1
2 (1199081minus 1199082)exp(minus 119909
(21199082))
(A2)
and the probability distribution function is
119865119908 (119909) = 1 minus
1199081
(1199081minus 1199082)exp(minus 119909
21199081
)
minus1199082
(1199081minus 1199082)exp(minus 119909
21199082
) 1199081
= 1199082
(A3)
Using the previous notation we have the following formof 119901 value
1198751198670
(1198831gt 119862119863 11990811198831+ 11990821198832gt 119911FC)
= 120572119863minus int
119911FC1199081
119862119863
1198912 (119909) 1198652 (
119911FC minus 1199081119909
1199082
)119889119909
= exp(minus119911FC21199081
) +1199082
1199081minus 1199082
exp(minus119911FC21199082
)
times exp(1199081 minus 1199082211990811199082
119911FC) minus exp(1199081 minus 1199082211990811199082
1199081119862119863)
=1199081
1199081minus 1199082
exp(minus119911FC21199081
)
minus1199082
1199081minus 1199082
exp(minus119911FC21199082
)120572minus(1199081
minus1199082
)1199082
119863
=1199081
1199081minus 1199082
exp(minus119911FC21199081
)
minus1199082
1199081minus 1199082
exp(minus119911FC21199082
)120572minus(1199081
minus1199082
)1199082
119863
(A4)
where 119911FC1199081 gt 119862119863
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
10 BioMed Research International
Acknowledgments
The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)
References
[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005
[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006
[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006
[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007
[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007
[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012
[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954
[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955
[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997
[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002
[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966
[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985
[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008
[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010
[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009
[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009
[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009
[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for
two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006
[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007
[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003
[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006
[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009
[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012
[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993
[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010
[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008
[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
10 BioMed Research International
Acknowledgments
The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)
References
[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005
[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006
[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006
[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007
[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007
[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012
[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954
[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955
[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997
[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002
[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966
[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985
[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008
[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010
[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009
[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009
[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009
[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for
two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006
[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007
[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003
[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006
[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009
[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012
[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993
[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010
[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008
[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation httpwwwhindawicom
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttpwwwhindawicom
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Microbiology