research article robust association tests for the replication of...

11
Research Article Robust Association Tests for the Replication of Genome-Wide Association Studies Jungnam Joo, 1 Ju-Hyun Park, 2 Bora Lee, 1 Boram Park, 1 Sohee Kim, 1 Kyong-Ah Yoon, 3 Jin Soo Lee, 3 and Nancy L. Geller 4 1 Biometric Research Branch, Research Institute and Hospital, National Cancer Center, Gyeonggi-do, Goyang-si 410-769, Republic of Korea 2 Department of Statistics, Dongguk University, Seoul 100-715, Republic of Korea 3 Lung Cancer Branch, Research Institute and Hospital, National Cancer Center, Gyeonggi-do, Goyang-si 410-769, Republic of Korea 4 Office of Biostatistics Research, National Heart, Lung and Blood Institute, Bethesda, MD 20892-7938, USA Correspondence should be addressed to Jungnam Joo; [email protected] Received 14 November 2014; Revised 14 February 2015; Accepted 14 February 2015 Academic Editor: Taesung Park Copyright © 2015 Jungnam Joo et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In genome-wide association study (GWAS), robust genetic association tests such as maximum of three CATTs (MAX3), each corresponding to recessive, additive, and dominant genetic models, the minimum p value of Pearson’s Chi-square test with 2 degrees of freedom, and CATT based on additive genetic model (MIN2), genetic model selection (GMS), and genetic model exclusion (GME) methods have been shown to provide better power performance under wide range of underlying genetic models. In this paper, we demonstrate how these robust tests can be applied to the replication study of GWAS and how the overall statistical significance can be evaluated using the combined test formed by p values of the discovery and replication studies. 1. Introduction With the advance of biotechnology and substantial reduc- tion of genotyping costs, a genome-wide association study (GWAS) using hundred thousand markers in several thou- sand individuals is now increasingly utilized and has been successful in detecting genetic associations across the entire genome with complex human traits [16]. Among many chal- lenges this application holds; development of more efficient and robust statistical methodologies with higher power to detect an association with a single marker has been one of the most important statistical issues, given that effects of individual markers are usually characterized as being small to moderate. One attempt to overcome this challenge is focused on developing efficient tests that are robust against underlying genetic model misspecification. Two most frequently used association tests are the allele- based test (ABT) and the genotype-based test (GBT). ABT compares the allele frequencies between cases and controls, while GBT compares the genotype distributions of cases and controls. e Cochran-Armitage trend test (CATT) [7, 8] is a popular GBT which takes into account the underlying genetic model. It is well known, however, that the ABT may inflate type I error when Hardy-Weinberg equilibrium (HWE) does not hold in the samples [9]. Even under HWE, when the genetic model is recessive or dominant, the ABT may suffer from serious power loss. On the other hand, the CATT does not depend on HWE, but to apply the CATT the choice of scores optimal for the underlying genetic model needs to be specified. For complex diseases, the genetic model is usually unknown and robust tests such as the maximum of three CATTs (MAX3) [10] and the maximum efficiency robust test (MERT) [11, 12] are preferable. Alternatively, Zheng and Ng [13] and Joo et al. [14] proposed a two-phase analysis based on the genetic model selection (GMS) and genetic model exclusion (GME). Moreover, an alternative approach was proposed by the Wellcome trust case-control consortium (WTCCC) [5] which used a minimum value of Pearson’s Chi-square test and additive CATT, and the asymptotic properties of this approach were studied in detail by Joo et al. [15]. ese methods provide better or comparable power performance than some of the robust tests such as MAX3. Hindawi Publishing Corporation BioMed Research International Volume 2015, Article ID 461593, 10 pages http://dx.doi.org/10.1155/2015/461593

Upload: others

Post on 22-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Research Article Robust Association Tests for the Replication of …downloads.hindawi.com/journals/bmri/2015/461593.pdf · 2019. 7. 31. · Research Article Robust Association Tests

Research ArticleRobust Association Tests for the Replication ofGenome-Wide Association Studies

Jungnam Joo1 Ju-Hyun Park2 Bora Lee1 Boram Park1 Sohee Kim1 Kyong-Ah Yoon3

Jin Soo Lee3 and Nancy L Geller4

1Biometric Research Branch Research Institute and Hospital National Cancer Center Gyeonggi-doGoyang-si 410-769 Republic of Korea2Department of Statistics Dongguk University Seoul 100-715 Republic of Korea3Lung Cancer Branch Research Institute and Hospital National Cancer Center Gyeonggi-do Goyang-si 410-769 Republic of Korea4Office of Biostatistics Research National Heart Lung and Blood Institute Bethesda MD 20892-7938 USA

Correspondence should be addressed to Jungnam Joo joojnccrekr

Received 14 November 2014 Revised 14 February 2015 Accepted 14 February 2015

Academic Editor Taesung Park

Copyright copy 2015 Jungnam Joo et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

In genome-wide association study (GWAS) robust genetic association tests such as maximum of three CATTs (MAX3) eachcorresponding to recessive additive and dominant geneticmodels theminimum p value of Pearsonrsquos Chi-square test with 2 degreesof freedom and CATT based on additive genetic model (MIN2) genetic model selection (GMS) and genetic model exclusion(GME) methods have been shown to provide better power performance under wide range of underlying genetic models In thispaper we demonstrate how these robust tests can be applied to the replication study of GWAS and how the overall statisticalsignificance can be evaluated using the combined test formed by p values of the discovery and replication studies

1 Introduction

With the advance of biotechnology and substantial reduc-tion of genotyping costs a genome-wide association study(GWAS) using hundred thousand markers in several thou-sand individuals is now increasingly utilized and has beensuccessful in detecting genetic associations across the entiregenomewith complex human traits [1ndash6] Amongmany chal-lenges this application holds development of more efficientand robust statistical methodologies with higher power todetect an association with a single marker has been one ofthe most important statistical issues given that effects ofindividualmarkers are usually characterized as being small tomoderate One attempt to overcome this challenge is focusedondeveloping efficient tests that are robust against underlyinggenetic model misspecification

Two most frequently used association tests are the allele-based test (ABT) and the genotype-based test (GBT) ABTcompares the allele frequencies between cases and controlswhile GBT compares the genotype distributions of cases andcontrolsThe Cochran-Armitage trend test (CATT) [7 8] is a

popularGBTwhich takes into account the underlying geneticmodel It is well known however that the ABT may inflatetype I error when Hardy-Weinberg equilibrium (HWE) doesnot hold in the samples [9] Even under HWE when thegenetic model is recessive or dominant the ABT may sufferfrom serious power loss On the other hand the CATT doesnot depend on HWE but to apply the CATT the choice ofscores optimal for the underlying genetic model needs to bespecified For complex diseases the genetic model is usuallyunknown and robust tests such as the maximum of threeCATTs (MAX3) [10] and the maximum efficiency robusttest (MERT) [11 12] are preferable Alternatively Zheng andNg [13] and Joo et al [14] proposed a two-phase analysisbased on the genetic model selection (GMS) and geneticmodel exclusion (GME) Moreover an alternative approachwas proposed by theWellcome trust case-control consortium(WTCCC) [5] which used a minimum 119901 value of PearsonrsquosChi-square test and additive CATT and the asymptoticproperties of this approach were studied in detail by Jooet al [15]Thesemethods provide better or comparable powerperformance than some of the robust tests such as MAX3

Hindawi Publishing CorporationBioMed Research InternationalVolume 2015 Article ID 461593 10 pageshttpdxdoiorg1011552015461593

2 BioMed Research International

In this paper we illustrate how these robust tests canbe applied to a replication study of GWAS and how overallstatistical significance can be evaluated using the combinedtest formed by 119901 values of the discovery and replicationstudiesThe importance of replication or validation in GWAShas been well recognized [16 17] and joint analysis in atwo-stage design of GWAS has been proved to be morepowerful than replication-based analysis and has been widelyconducted in GWAS with a variety of phenotypes of interest[18 19]

The paper is organized as follows We first describethe data structures and notation and review existing robustassociation tests for a single data set Then we describe howto obtain the 119901 value for the replication data set given thesignificant result of the discovery stage using robust tests Inthe next section a combined test of the119901 values of the discov-ery and replication data sets is proposed together with theway to evaluate the statistical significance for the combinedtest Simulation studies are conducted to compare the typeI error rates and powers of various analytical strategies Forillustration purposes the summarized methods are appliedto a non-small-cell lung cancer data set and at the end thereis a discussion

2 Methods

21 Data and Notation For a marker with two alleles 119860 and119861 let the frequencies of 119861 in cases and controls be 119901 = 119875(119861 |

case) and 119902 = 119875(119861 | control) Denote three genotypes by1198660= 119860119860119866

1= 119860119861 and119866

2= 119861119861 In case-control association

studies 119903 cases and 119904 controls are independently sampledfrom each population The observed genotype counts for(1198660 1198661 1198662) are (119903

0 1199031 1199032) in the cases and (119904

0 1199041 1199042) in the

controls Disease prevalence is denoted by 119896 = 119875(disease)and penetrance by 119891

119894= 119875(disease | 119866

119894) for 119894 = 0 1 2 Two

genotype relative risks (GRRs) are denoted by 1205821= 11989111198910

and 1205822= 11989121198910using 119891

0gt 0 as baseline penetrance Under

the null hypothesis of no association 1198670 1198910= 1198911= 1198912= 119896

or alternatively1198670 1205822= 1205821= 1 Genetic model is recessive

(REC) additive (ADD)multiplicative (MUL) and dominant(DOM)when 120582

1= 1 120582

1= (1+120582

2)2 1205821= 12058212

2 and 120582

2= 1205821

respectively

22 Review of Association Tests for a Single Data Set Theassociation in case-control studies can be tested using variousmethods which have been extensively studied The generalassociation between the disease status and the SNP canbe tested using Pearsonrsquos Chi-square test which has anasymptotic Chi-square distributionwith 2 degrees of freedomunder119867

0 The test is given by

119879chi2 =2

sum

119895=0

(119903119895minus 119899119895119903119899)2

119899119895119903119899

+

2

sum

119895=0

(119904119895minus 119899119895119904119899)2

119899119895119904119899

(1)

where 119899119894= 119903119894+ 119904119894for 119894 = 0 1 2 and 119899 = 119903 + 119904 Under Hardy-

Weinberg equilibrium (HWE) an allele-based test (ABT) and

CATT with scores (0 119909 1) for (1198660 1198661 1198662) where 0 le 119909 le 1

are given by

119885ABT =11989912

2119903 (21199040+ 1199041) minus 2119904 (2119903

0+ 1199031)

2119903119904 (21198990+ 1198991) (1198991+ 21198992)12

119885119909=

11989912

sum2

119894=0119909119894(119904119903119894minus 119903119904119894)

[119903119904119899 119899sum2

119894=01199092119894119899119894minus (sum2

119894=0119909119894119899119894)2

]12

(2)

where (1199090 1199091 1199092) = (0 119909 1) [9] The optimal choices of 119909

for the recessive (REC) additivemultiplicative (ADDMUL)and dominant (DOM) models are 119909 = 0 12 and 1respectively [9 20] Both 119885

119909and 119885ABT asymptotically follow

a standard normal distribution under 1198670 119885119909can be used

even when HWE does not hold However without theHWE assumption 119885ABT does not follow a standard normaldistribution due to the correlation between two alleles

A robust test MAX3 proposed by Friedlin et al [10] canbe obtained by taking the maximum of three CATTs underthe three genetic models as MAX3 = max(|119885

0| |11988512| |1198851|)

Parametric bootstrap or permutationmethods can be used tofind the 119901 value of MAX3 [4]

Let the 119901 values of Pearsonrsquos Chi-square test and CATTunder the additive genetic model 119885

12be 119875chi2 and 119875

12

respectively WTCCC [5] proposed an alternative robust testMIN2 = min(119875chi2 11987512) Joo et al [15] derived the asymptoticnull distribution of MIN2 and using their result the 119901 valueof MIN2 can be obtained as

119875MIN2

=1

2exp minus1

2119867minus1

1(1 minusMIN2) + 1

2MIN2

minus1

2120587int

minus2 log(MIN2)

119867minus1

1

(1minusMIN2)119890minusV2 arcsin(

2119867minus1

1(1 minusMIN2)V minus 1

)119889V

(3)

where 1198671and 119867

2are the cumulative distributions of Chi-

square distributions with 1 and 2 degrees of freedomOn the other hand Song and Elston [21] considered a

Hardy-Weinberg disequilibrium trend test (HWDTT) givenby

119885119867=

(119903119904119899)12

(Δ119901minus Δ119902)

1 minus 1198992119899 minus 119899

1 (2119899) 119899

2119899 + 119899

1 (2119899)

(4)

where Δ119901= 1199012minus (1199012+ 11990112)2 and Δ

119902= 1199022minus (1199022+ 11990212)2

are the estimates of Δ119901and Δ

119902 where 119901

119894= 119903119894119903 and 119902

119894= 119904119894119904

Here Δ denotes the Hardy-Weinberg disequilibrium (HWD)coefficient defined by 119875119903(119861119861) minus 119875119903(119860119861)2 + 119875119903(119861119861)

2 andΔ119901andΔ

119902denote the HWD coefficient in cases and controls

respectivelyZheng and Ng [13] used the information contained in the

signs of (Δ119901 Δ119902) to determine the genetic models in their

two-phase method Their two-phase statistic 119885GMS is givenby 119885GMS = 119885

0if 119885119867

gt 119888 1198851if 119885119867

lt minus119888 and 11988512

other-wise where 119888 = Φ

minus1(1 minus 120572

119867) for 120572

119867= 005 The asymptotic

BioMed Research International 3

correlations between 119885119867and three CATTs under HWE were

derived and the significance level was adjusted accordingly tocontrol the desired type I error Based on the observation thatthis method assumes 119861 is the risk allele Joo et al [14] studiedthe behavior of 119885GMS when either one of the alleles can be arisk alleleThey chose the risk allele based on the sign of119885

12

that is if 11988512

gt 0 119861 is the risk allele and 1198850 11988512

and 1198851

are chosen for REC ADD and DOM models respectivelyIf 11988512

lt 0 the respective test statistics are chosen tobe minus119885

1 minus11988512

and minus1198850 They incorporate this property in

defining the test statistic for genetic model selection (119885GMS)and calculating the 119901 value Let Θ

0(119911) = 119911 119911 gt 119888

Θ12(119911) = 119911 |119911| lt 119888 and Θ

1(119911) = 119911 119911 lt minus119888 Then

the 119901 value of this method can be obtained by

119901GMS

= 2

1

sum

119909=0

intΘ119909

(119911119867

)

int

infin

0

int

infin

119905

120601119909(119911119909 11991112 119911119867) 11988911991111990911988911991112119889119911119867

+ 2

intΘ12

(119911)

Φ((minus119905 and 0) + 120588

12119911

(1 minus 120588212)12

)119889Φ (119911)

(5)

where 120588119909= Corr(119885

119909 119885119867) in (5) and 120588

11990912= Corr(119885

119909 11988512)

(119909 = 0 1) are replaced by their consistent estimates Here119905 = 119911GMS and (minus119905 and 0) = min(minus119905 0) Moreover 119911GMS and 11991112are the observed values of 119885GMS and 11988512 respectively

While studying the properties of GMS Joo et al [14]noticed that the probability of selecting the true recessive ordominant models using 119885

119867is very low especially for low

to moderate GRRs but the unlikely genetic model can besuccessfully excluded This led to genetic model exclusionmethod119885GME which is the same as the119885GMS described aboveexcept 119885

119909for 119909 = 0 12 1 is replaced by 119885lowast

119909where 119885lowast

119909=

(119885119909+11988512)2(1 +120588

11990912)12 And the 119901 value of GME can be

obtained as

119901GME

= 2

1

sum

119909=0

intΘ119909

(119911119867

)

int

infin

0

int

infin

119871

120601119909(119911119909 11991112 119911119867) 11988911991111990911988911991112119889119911119867

+ 2

intΘ12

(119911)

Φ((minus119905 and 0) + 120588

12119911

(1 minus 120588212)12

)119889Φ (119911)

(6)

where 119871 = 1199052(1 + 12058811990912

)12

minus 11991112

for 119905 = 119911GME

23 119901 Value of Replication Data Using the Robust Method Inthe discovery stage the 119901 value of robust association testsincluding MAX3 MIN2 119885GMS and 119885GME can be obtainedas described in Section 22 For the 119901 value of replication datausing the robust method we use the same analytic methodthat was used for discovery and the risk allele identified byit [16] This means that when the best test statistic or geneticmodel is selected in the discovery stage the replication stage

will adopt the discovery stage selection and the direction ofassociation

Suppose that for simplicity of notation our interest isin GWAS with two stages one for discovery and the otherfor replication although the methodology described belowcan be extended to multistages for replication Let 119885(119894)

119909for

119909 = 0 12 1 be the CATT optimal for recessive additiveand dominant models and let 119875(119894)

119909be corresponding 119901 value

for 119894th stage (119894 = 1 for discovery and 119894 = 2 for replicationstages) Also denote 119885(119894)lowast

119909= (119885(119894)

119909+ 119885(119894)

12)2(1 + 120588

(119894)

11990912)12

for 119909 = 0 12 1 Then for CATT with a preselected geneticmodel 119875(2)

119909= 1minusΦ(sign(119885(1)

119909sdot119885(2)

119909) sdot |119885(2)

119909|) using a one-sided

119901 value given the direction of association from the discoverystage and119875(2)lowast

119909= 1minusΦ(sign(119885(1)lowast

119909sdot119885(2)lowast

119909) sdot |119885(2)lowast

119909|) Moreover

denote the test statistics and 119901 values using Pearsonrsquos Chi-square test from the 119894th stage as 119879(119894)chi2 and 119875

(119894)

chi2 Further letHWDTT from the 119894th stage be 119885(119894)

119867 Then the second stage

119901 values using MAX3 MIN2 119885GMS and 119885GME denoted as119875(2)

MAX3 119875(2)

MIN2 119875(2)

GMS and 119875(2)

GME can be obtained as follows

119875(2)

MAX3 = 119875(2)

119909lowast

where 119909lowast = arg min119909isin0121

119875(1)

119909

119875(2)

MIN2 = 119875(2)

12sdot 119868 (119875(1)

12le 119875(2)

chi2) + 119875(2)

chi2 sdot 119868 (119875(1)

12gt 119875(2)

chi2)

119875(2)

GMS = 119875(2)

0119868 (119885(1)

119867gt 119888) + 119875

(2)

1119868 (119885(1)

119867lt minus119888)

+ 119875(2)

12119868 (10038161003816100381610038161003816119885(1)

119867

10038161003816100381610038161003816le 119888) if 119885(1)

12gt 0

= 119875(2)

0119868 (119885(1)

119867lt minus119888) + 119875

(2)

1119868 (119885(1)

119867gt 119888)

+ 119875(2)

12119868 (10038161003816100381610038161003816119885(1)

119867

10038161003816100381610038161003816le 119888) if 119885(1)

12le 0

119875(2)

GME = 119875(2)lowast

0119868 (119885(1)

119867gt 119888) + 119875

(2)lowast

1119868 (119885(1)

119867lt minus119888)

+ 119875(2)lowast

12119868 (10038161003816100381610038161003816119885(1)

119867

10038161003816100381610038161003816le 119888) if 119885(1)

12gt 0

= 119875(2)lowast

0119868 (119885(1)

119867lt minus119888) + 119875

(2)lowast

1119868 (119885(1)

119867gt 119888)

+ 119875(2)lowast

12119868 (10038161003816100381610038161003816119885(1)

119867

10038161003816100381610038161003816le 119888) if 119885(1)

12le 0

(7)

It is important to note that even though the direction ofthe test statistics and the selected genetic models are usedto obtain the second stage 119901 values the 119901 values from thetwo stages are independent under the null hypothesis Thisis because under the null hypothesis the probability of 119885

12

being positive or negative is simply 12 and the probabilityof the selection of a certain genetic model is also a constant(120572119867for the recessive and dominant models and 1 minus 2120572

119867for

the additive model)

24 Combined Test Using 119901 Values and Its Statistical Signif-icance For a given robust test we can consider the jointanalysis by combining 119901 values from the discovery andreplication stages of GWAS We consider using 119901 valuesrather than the test statistics because test statistics can have

4 BioMed Research International

Table 1 Type I error rates of three approachesmdashreplication-based (REP) test Fisherrsquos combination (119885FC) and linear combination of test(119885LC)mdashbased on the CATT with an additive model (119885

12) 1205942 MAX3 MIN2 GMS and GME The disease prevalence 119870 = 01 119872 = 10

markers 119903 = 1 500 cases and 119904 = 1 500 controls are considered based on 20000 simulations

MAF 120587119904

120572119863

119865 = 0

11988512

1205942 MAX3 MIN2 GMS GME

03 05

005REP 000530 000455 000505 000485 00050 000490119885FC 000500 000535 000495 000510 000515 000460119885LC 000535 000525 000485 000510 00050 000485

01REP 000510 000560 000565 000485 000525 000545119885FC 000565 000535 000565 000545 000565 000540119885LC 000520 000565 000525 000520 000530 000525

03 06

005REP 000510 000485 000480 000515 000480 000500119885FC 000445 000455 000450 000455 000450 000460LC 000500 000515 000495 000520 000475 000480

01REP 000500 000485 000485 000535 000530 000505119885FC 000465 000490 000485 000500 000455 000460119885LC 000480 000470 000515 000490 000485 000475

04 05

005REP 000590 000505 000530 000565 000505 000510119885FC 000575 000430 000460 000535 000460 000500119885LC 000600 000445 000500 000540 000490 000490

01REP 000525 000470 000535 000450 000480 000515119885FC 000515 000510 000495 000475 000540 000500119885LC 000530 000500 000485 000475 000495 000510

04 06

005REP 000475 000585 000480 000500 000515 000495119885FC 000460 000470 000420 000490 000455 000440119885LC 000525 000550 000520 000580 000510 000510

01REP 000550 000490 000515 000535 000555 000540119885FC 000520 000370 000495 000450 000515 000510119885LC 000565 000485 000570 000530 000610 000580

complex forms and obtaining the distribution of the joint testcan be difficult On the other hand calculating a 119901 value foreach data set might be relatively simple and the distributionof 119901 values under the null hypothesis of no association is easyto handle

There are several methods for combining test statisticsfrom two stages [22] and twomost commonly used forms arebased on Fisherrsquos combination and a linear combination afterinverse normal transformation [23] Fisherrsquos combination(FC) directly sums 119901 values after minus2 log transformation thatis 119885FC = minus2119908

1log(119875(1)) minus 2119908

2log(119875(2)) where 119875(119894) is 119901 value

from 119894 = 0 for discovery and 119894 = 1 for replication stagesusing a given robust test A specification of 119908

1= 1199082= 1

gives the same weight for discovery and replication stagesand one can consider 119908

1= 2120587

119904and 119908

2= 2(1 minus 120587

119904)

where 120587119904= 119873119863(119873119863+ 119873119877) and 119873

119863and 119873

119877are sample

sizes of the discovery and replication data sets A linearcombination of two 119875 values after taking the inverse of thestandard normal cumulative distribution is given by 119885LC =

1199081Φminus1(1minus119875(1)2)+119908

2Φminus1(1minus119875(2))radic1199082

1+ 11990822with a natural

choice of 1199081= radic120587119904 and 1199082 = radic1 minus 120587

119904 Let the significance

level of the discovery stage be 120572119863 which means that markers

with 119875(1) lt 120572119863are selected and replicated in the replication

stage The 119901 value of combined test can then be obtained as

119901FC = 1198751198670

(119875(1)

lt 120572119863 119885FC gt 119911FC)where the observed value of

119885FC is 119911FCThe119901FC are calculated as 119890minus119911FC2(1+119911FC2+log120572119863)

for equal weights where 119911FC gt minus2 log120572119863and (119908

1(1199081minus

1199082))119890minus119911FC21199081minus(119908

2(1199081minus1199082))119890minus119911FC21199082120572

minus(1199081

minus1199082

)1199082

119889for unequal

weights where 119911FC gt minus21199081log120572119863 Detailed derivations are

described in the Appendix Equivalently for an overall typeI error threshold for a single marker of 120572 one may obtainthe threshold 119862FC of 119885FC that satisfies 119875

1198670

(119875(1)

lt 120572119863 119885FC gt

119862FC) le 120572 Similarly for the 119885LC the 119901 value is calculatedas 119901LC = 119875

1198670

(119875(1)

lt 120572119863 119885LC gt 119911LC) = int

infin

1199111minus120572

119863

2

120601(119911)[1 minus

Φ((radic11990821+ 11990822119911LC minus 1199081119911)1199082)]119889119911 for 119911LC gt 119911

1minus120572119863

2where the

observed value of 119885LC = 119911LC

3 Simulation Results

31 Type I Error Table 1 provides the type I errors underdifferent scenarios A disease prevalence of 10 is assumedand a total of 1500 cases and 1500 controls were divided intotwo stages The proportions of samples in the first stage (120587

119904)

of 05 and 06 were considered for the minor allele frequency(MAF) of 03 and 04 We considered 119872 = 10 markers tocontrol the genome-wide false positive rate at 120572 = 005 withthe Bonferroni correction We did not consider a larger 119872

BioMed Research International 5

such as 300000 or 500000 because this will require morethan 10 million simulations to show a stable estimate of thetype I error rate With 119872 = 10 we performed 20000simulations which result in less than 10 of a coefficientof variation for a significance level 005119872 = 0005 foreach marker [24] The test statistics considered are 119885

12

Pearsonrsquos Chi-square testMIN2MAX3GMS andGME Forthe second stage analysis we considered a replication-basedanalysis 119885FC and 119885LC as proposed above The results arebased on the situation under HWE (HWE coefficient 119865 = 0)As expected all tests control the type I error reasonablly welland similar results were obtained when a slight deviationfrom HWE is present with 119865 = 005 (results not shown)

32 Empirical Power We examined the empirical powers ofdifferent tests considered above In Figure 1 we considered119872 = 10 markers a disease prevalence of 10 the samegenotype relative risk for two stages (119903

1= 14 and 119903

2= 14)

and 1000 cases and 1000 controls 2000 simulations wereperformed under HWE (119865 = 0) to control the genome-wide false positive rate at 120572 = 005 The recessive additiveand dominant models were assumed for the first secondand third rows Both joint analyses showed better powerperformances compared to the replication-based analysis (upto 159 in scenarios considered in Figure 1) and LC and FChave comparable powers with less than 2 difference Thepower gain of using the joint analysis is not as much as thatobserved in Skol et al [18] However as reported by Skolet al [18] when the between-stage heterogeneity exists andthe risk allele has a larger effect in the first stage than thatin the second stage much improved power is observed byusing the joint test Figure 2 shows results under this scenariowith 119903

1= 16 and 119903

2= 14 and the observed increase

in power using the joint test is as high as 339 Againthe difference between LC and FC is minor with less than3 difference As for comparison between different robustmethods MAX3 GMS and GME perform well under therecessive model while 119885

12 1205942 and MIN2 are less powerful

Under the additivemodel11988512

is most powerful as expectedand 1205942 is least powerful Other robust methods perform wellwith a slight decrease in power compared to 119885

12 Under the

dominant model MAX3 GMS and GME perform the besteven though all tests show good power performances and thedifference is minor Similar patterns were observed when aslight deviation from theHWE is present (results not shown)

4 Real Data Application

The GWAS on non-small-cell lung cancer (NSCLC) by Yoonet al [25] studied 621 NSCLC patients and 1541 controlsubjects in the discovery stage After stringent quality controlsteps a total of 246758 SNPs were tested for the associationwith NSCLC based on119885

12 In the replication stage 168 SNPs

with 119901 value less than 1 times 10minus4 in the first stage based on 11988512

were tested using 804 patients and 1470 control samples Weidentified additional 234 SNPs using MIN2 in the first stagewhich could be studied in the replication stage if MIN2 wasused instead of 119885

12since MIN2 produces stronger evidence

for the additional SNPs than 11988512

does The Manhattan plotsof using MIN2 and 119885

12are presented in Figure 3 One

example is 119903119904385272 located in chromosome 2 which hada 119901 value of 137 times 10

minus7 which reached significance level atBonferroni correction in discovery samples alone whereas11988512

yielded a119901 value greater than 1times10minus4 Even though thereis possibility of false positive findings these SNPs could havebeen selected for replication if robust methods were used

Since we do not have replication data for these additionalSNPs selected using MIN2 because the first stage selectionwas based on 119885

12in Yoon et al [25] just for illustration

purpose of the proposed methods we present the results ofthree SNPs including 1199031199042131877 that was reported by Yoonet al [25] When the significance level in the discovery stageis set at 120572

119863= 5 times 10

minus5 so that all these exemplary SNPs canbe selected in the discovery stage the 119901 value of combinedtest based on four robust methods (MAX3 MIN2 GMSand GME) as well as 119885

12and Pearsonrsquos Chi-square test is

presented in Table 2 Fisherrsquos combination was used for thejoint test in the second stage Only 1199031199042131877was found to besignificant with Bonferroni correction (119901 value lt 203times10

minus7)by all except MAX3method

5 Discussion

In genetic association studies efficiency robust tests whoseperformance does not depend on the underlying geneticmodel have been extensively studied and their power benefitover a wide range of geneticmodels has been well recognizedIn this paper we described how the idea of these robustassociation tests can be applied to the replication studies andfurther how overall statistical significance can be evaluatedusing the combined test formed by 119901 values of the discoveryand replication studies

When the robust tests are used the test statistic ofeach stage can have a complex form and thus dealing withthe distribution of the joint test can be difficult whereascalculating the 119901 value of each stage might be relativelysimple Because the asymptotic distribution of the 119901 valueunder the null hypothesis of no association is easy to handlethe combined test using 119901 values rather than the test statisticsthemselves can provide computational convenience

There are several methods for combining test statisticsfrom two stages and Won et al [22] compared the per-formances of various choices Two most commonly usedforms are based on Fisherrsquos combination and the linearcombination after the inverse normal transformation [23]and we presented the test statistics and 119901 values of these twomethods In our limited experience the linear combinationand Fisherrsquos combination are fairly comparable Fisherrsquoscombination seems to perform slightly better than the linearcombination when there exists some heterogeneity betweenstages in terms of the genotype relative risk while the linearcombination seems to perform slightly better in most ofother situations However the difference is extremely minorFurther research is required for the thorough comparison ofvarious methods of combining 119901 values in the application ofefficiency robust tests to the replication of genetic associationstudies

6 BioMed Research International

Recessive

Additive

Dominant

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)Em

piric

al p

ower

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

Z05

1205942

MAX3

ZFC

ZFC

ZFC

ZLC

ZLC

ZLC

MIN2GMSGME

MIN2GMSGME

MIN2GMSGME

Z05

1205942

MAX3

Z05

1205942

MAX3

Replication-based test

Replication-based test

Replication-based test

Figure 1 Empirical powers based on 2000 simulations for119872 = 10markers genotype relative risks of both stages = 14 and disease prevalence119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to control 120572 = 005 The first stagetype I error rate for discovery is 120572

119863= 005 Six test statistics 119885

12 1205942 MAX3 MIN2 GMS and GME are considered The first second and

third columns depict powers using the replication-based test 119885FC and 119885LC respectively

BioMed Research International 7

Replication-based test

Recessive

Additive

Dominant

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)Em

piric

al p

ower

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

Z05

1205942

MAX3

ZFC

ZFC

ZFC

ZLC

ZLC

ZLC

MIN2GMSGME

MIN2GMSGME

MIN2GMSGME

Z05

1205942

MAX3

Z05

1205942

MAX3

Replication-based test

Replication-based test

Figure 2 Empirical powers based on 2000 simulations for 119872 = 10 markers genotype relative risks of two stages are different (1199031= 16

1199032= 14) disease prevalence 119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to

control 120572 = 005 The first stage type I error rate for discovery is 120572119863= 005 Six test statistics 119885

12 1205942 MAX3 MIN2 GMS and GME are

considered The first second and third columns depict powers using the replication-based test 119885FC and 119885LC respectively

8 BioMed Research International

minuslo

g 10(120588)

Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23

8

6

4

2

0

MIN2 P value

(a)

minuslo

g 10(120588)

Additive P value8

6

4

2

0

Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23

(b)

Figure 3 Manhattan plots of 246758 SNPs from Yoon et al [25] based on MIN2 (a) and 11988512

(b) The 119909 axis is chromosomal location andthe 119910 axis is the significance (minuslog

10119875) of association The horizontal line corresponds to the significance level 10minus4

Table 2 For selected exemplary three SNPs for testing association with NSCLC 119901 value of combined test using additive CATT (11988512)

Pearsonrsquos Chi-square test (119879chi2) MAX3 MIN2 119885GMS and 119885GME

SNP 119901 value of 11988512

119901 value of 119879chi2Discovery Replication Combined test Discovery Replication Combined test

rs2131877 788 times 10minus5

104 times 10minus4

797 times 10minus8

140 times 10minus4

149 times 10minus4

184 times 10minus7

rs905551 183 times 10minus5

702 times 10minus3

770 times 10minus6

806 times 10minus5

489 times 10minus2

140 times 10minus5

rs1695109 248 times 10minus4

346 times 10minus2

217 times 10minus6

456 times 10minus5

153 times 10minus1

207 times 10minus5

SNP 119901 value of MAX3 119901 value of MIN2Discovery Replication Combined test Discovery Replication Combined test

rs2131877 153 times 10minus4

405 times 10minus2

192 times 10minus5

132 times 10minus4

104 times 10minus4

126 times 10minus7

rs905551 450 times 10minus5

702 times 10minus3

192 times 10minus6

134 times 10minus4

489 times 10minus2

199 times 10minus5

rs1695109 354 times 10minus5

263 times 10minus2

464 times 10minus6

236 times 10minus5

263 times 10minus2

335 times 10minus6

SNP 119901 value of 119885GMS 119901 value of 119885GME

Discovery Replication Combined test Discovery Replication Combined testrs2131877 186 times 10

minus4104 times 10

minus4171 times 10

minus7103 times 10

minus4104 times 10

minus4102 times 10

minus7

rs905551 519 times 10minus5

702 times 10minus3

216 times 10minus6

735 times 10minus5

801 times 10minus3

320 times 10minus6

rs1695109 689 times 10minus4

127 times 10minus1

385 times 10minus5

269 times 10minus5

419 times 10minus2

540 times 10minus6

In a genetic study where the purpose of considering areplication stage is to validate or replicate the genetic findingsfrom the discovery stage which is the case considered inthis paper the analysis in the replication stage utilized thetest statistic or genetic model that is selected as being thebest in the discovery stage and also the direction of the riskallele following guidelines for exact replication in geneticassociation studies If the purpose is to simply combine theevidence fromdifferent data sources such as inmeta-analysisother strategies may be devised Further research again isrequired to provide fully detailed properties of suchmethods

Power gain of a joint analysis over the conventionalreplication-based analysis was thoroughly studied by Skol etal [18 19] In our simulation the amount of power increaseusing a joint test compared to the replication-based analysiswas much minor than what was observed by Skol et al[18 19] The exact reason is not known but we suspect thismight be due to the power advantages of robust methodsand also due to the fact that the optimal choice from thefirst stage is used when calculating the second stage 119901 values

However even though it was minor in some situationsthe joint anlysis presented better power performance thanthe replication-based analysis in our study This type ofjoint analysis raised concerns about the exact meaning ofreplication [17] However McCarthy et al [26] mentionedthat joint analyses ldquoblur the boundaries of where exactlyreplication starts but whichever analytical approach is takenconfirmation in many independent samples is important andit is the overall strength of the evidence of association thatmattersrdquo Purpose of the current study was to present howthe overall strength of the evidence of association can beevaluated when robust tests are used in GWAS replicationstudies

We illustrated how the proposed methods can be appliedin the real data that studied the association of SNPs with non-small-cell lung cancer (NSCLC) in discovery and replicationstages In the original study reported by Yoon et al [25]SNPs were selected in the discovery data set not based onthe robust tests but based on additive CATT Therefore wefound that some SNPs could have been selected by one

BioMed Research International 9

of the robust methods but they were not included in thereplication data set For these SNPs we were not able toperform the joint analysis that we propose and it was notpossible to examine whether there are other SNPs that couldhave been found to be associated with NSCLC by proposedmethods in the replication study For this reason we merelypresented howmany additional SNPs could have been furtherfollowed in the replication stage when robust methods wereused In many GWASs it is a common practice to reportthe summary test statistics and 119901 values of the SNPs undera specific genetic model usually an additive model whichwere further genotyped in the replication stage and werefinally defined to be significantly associated with a phenotypeof interest As emphasized in this paper one may have abetter chance of findingmanymissing SNPs by applyingmorepowerful and robust methods that consider different geneticmodels simultaneouslyTherefore we urge the community toshare test results under not only an additive model but alsoother genetic models although they were not significant at astringent significance level so that future research may haveenriched data resources to which robust tests can be appliedin association studies

Appendix

119901 value of Fisherrsquos Combination forEqual and Unequal Weights

Equal Weights Assume 1199081

= 1199082

= 1 Under the nullhypothesis of no association 119883

1= minus2 log119875(1) and 119883

2=

minus2 log119875(2) are independent and each asymptotically followsa 1205942 distribution with 2 degrees of freedom Let 119891

119896(119909) and

119865119896(119909) be the probability and cumulative density functions of

1205942 random variable with 119896 degrees of freedomThen 119891

2(119909) =

exp(minus1199092)2 1198914(119909) = 119909 exp(minus1199092)4 119865

2(119909) = 1 minus exp(minus1199092)

and1198654(119909) = 1minusexp(minus1199092)minus119909 exp(minus1199092)2 Denote the cutoff

of the discovery stage based on 120572119863as 119862119863 that is 119865

2(119862119863) =

1 minus 120572119863 For observed value 119911FC gt 119862

119863of1198831+ 1198832 the 119901 value

is written as

1198751198670

(1198831gt 119862119863 1198831+ 1198832gt 119911FC)

= 120572119863minus int

119911FC

119862119863

1198912 (119909) 1198652 (119911FC minus 119909) 119889119909

= 120572119863+ exp (minus

119911FC2) minus 120572119863+1

2exp(minus

119911FC2) (119911FC minus 119862119863)

= exp (minus119911FC2) (1 +

119911FC2

+ log120572119863)

(A1)

Unequal WeightsWhen different proportions of samples areused in the discovery and replication stages it may be moreappropriate to assignweights proportional to the sample sizesfor each stage For example when only a small portion isused in the discovery stage to prevent Fisherrsquos combinationtest from being dominated by the significant result in the

discovery stage one may want to assign a small weight to thediscovery stage result

When 120587119904is the proportion of samples used in the dis-

covery stage one selection for weights is 1199081

= 2120587119904and

1199082= 2(1 minus 120587

119904) for discovery and replication stages which

simplifies to equal weights when 120587119904= 05 Based on these

weights we consider unequal-weighted Fisherrsquos combinationasminus2 log119875(1)1199081119875(2)1199082 = 119908

11198831+11990821198832[27] Its density function

is given by

119891119908 (119909) =

1

2 (1199081minus 1199082)exp(minus 119909

(21199081))

minus1

2 (1199081minus 1199082)exp(minus 119909

(21199082))

(A2)

and the probability distribution function is

119865119908 (119909) = 1 minus

1199081

(1199081minus 1199082)exp(minus 119909

21199081

)

minus1199082

(1199081minus 1199082)exp(minus 119909

21199082

) 1199081

= 1199082

(A3)

Using the previous notation we have the following formof 119901 value

1198751198670

(1198831gt 119862119863 11990811198831+ 11990821198832gt 119911FC)

= 120572119863minus int

119911FC1199081

119862119863

1198912 (119909) 1198652 (

119911FC minus 1199081119909

1199082

)119889119909

= exp(minus119911FC21199081

) +1199082

1199081minus 1199082

exp(minus119911FC21199082

)

times exp(1199081 minus 1199082211990811199082

119911FC) minus exp(1199081 minus 1199082211990811199082

1199081119862119863)

=1199081

1199081minus 1199082

exp(minus119911FC21199081

)

minus1199082

1199081minus 1199082

exp(minus119911FC21199082

)120572minus(1199081

minus1199082

)1199082

119863

=1199081

1199081minus 1199082

exp(minus119911FC21199081

)

minus1199082

1199081minus 1199082

exp(minus119911FC21199082

)120572minus(1199081

minus1199082

)1199082

119863

(A4)

where 119911FC1199081 gt 119862119863

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

10 BioMed Research International

Acknowledgments

The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)

References

[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005

[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006

[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006

[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007

[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007

[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012

[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954

[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955

[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997

[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002

[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966

[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985

[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008

[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010

[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009

[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009

[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009

[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for

two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006

[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007

[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003

[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006

[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009

[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012

[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993

[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010

[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008

[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 2: Research Article Robust Association Tests for the Replication of …downloads.hindawi.com/journals/bmri/2015/461593.pdf · 2019. 7. 31. · Research Article Robust Association Tests

2 BioMed Research International

In this paper we illustrate how these robust tests canbe applied to a replication study of GWAS and how overallstatistical significance can be evaluated using the combinedtest formed by 119901 values of the discovery and replicationstudiesThe importance of replication or validation in GWAShas been well recognized [16 17] and joint analysis in atwo-stage design of GWAS has been proved to be morepowerful than replication-based analysis and has been widelyconducted in GWAS with a variety of phenotypes of interest[18 19]

The paper is organized as follows We first describethe data structures and notation and review existing robustassociation tests for a single data set Then we describe howto obtain the 119901 value for the replication data set given thesignificant result of the discovery stage using robust tests Inthe next section a combined test of the119901 values of the discov-ery and replication data sets is proposed together with theway to evaluate the statistical significance for the combinedtest Simulation studies are conducted to compare the typeI error rates and powers of various analytical strategies Forillustration purposes the summarized methods are appliedto a non-small-cell lung cancer data set and at the end thereis a discussion

2 Methods

21 Data and Notation For a marker with two alleles 119860 and119861 let the frequencies of 119861 in cases and controls be 119901 = 119875(119861 |

case) and 119902 = 119875(119861 | control) Denote three genotypes by1198660= 119860119860119866

1= 119860119861 and119866

2= 119861119861 In case-control association

studies 119903 cases and 119904 controls are independently sampledfrom each population The observed genotype counts for(1198660 1198661 1198662) are (119903

0 1199031 1199032) in the cases and (119904

0 1199041 1199042) in the

controls Disease prevalence is denoted by 119896 = 119875(disease)and penetrance by 119891

119894= 119875(disease | 119866

119894) for 119894 = 0 1 2 Two

genotype relative risks (GRRs) are denoted by 1205821= 11989111198910

and 1205822= 11989121198910using 119891

0gt 0 as baseline penetrance Under

the null hypothesis of no association 1198670 1198910= 1198911= 1198912= 119896

or alternatively1198670 1205822= 1205821= 1 Genetic model is recessive

(REC) additive (ADD)multiplicative (MUL) and dominant(DOM)when 120582

1= 1 120582

1= (1+120582

2)2 1205821= 12058212

2 and 120582

2= 1205821

respectively

22 Review of Association Tests for a Single Data Set Theassociation in case-control studies can be tested using variousmethods which have been extensively studied The generalassociation between the disease status and the SNP canbe tested using Pearsonrsquos Chi-square test which has anasymptotic Chi-square distributionwith 2 degrees of freedomunder119867

0 The test is given by

119879chi2 =2

sum

119895=0

(119903119895minus 119899119895119903119899)2

119899119895119903119899

+

2

sum

119895=0

(119904119895minus 119899119895119904119899)2

119899119895119904119899

(1)

where 119899119894= 119903119894+ 119904119894for 119894 = 0 1 2 and 119899 = 119903 + 119904 Under Hardy-

Weinberg equilibrium (HWE) an allele-based test (ABT) and

CATT with scores (0 119909 1) for (1198660 1198661 1198662) where 0 le 119909 le 1

are given by

119885ABT =11989912

2119903 (21199040+ 1199041) minus 2119904 (2119903

0+ 1199031)

2119903119904 (21198990+ 1198991) (1198991+ 21198992)12

119885119909=

11989912

sum2

119894=0119909119894(119904119903119894minus 119903119904119894)

[119903119904119899 119899sum2

119894=01199092119894119899119894minus (sum2

119894=0119909119894119899119894)2

]12

(2)

where (1199090 1199091 1199092) = (0 119909 1) [9] The optimal choices of 119909

for the recessive (REC) additivemultiplicative (ADDMUL)and dominant (DOM) models are 119909 = 0 12 and 1respectively [9 20] Both 119885

119909and 119885ABT asymptotically follow

a standard normal distribution under 1198670 119885119909can be used

even when HWE does not hold However without theHWE assumption 119885ABT does not follow a standard normaldistribution due to the correlation between two alleles

A robust test MAX3 proposed by Friedlin et al [10] canbe obtained by taking the maximum of three CATTs underthe three genetic models as MAX3 = max(|119885

0| |11988512| |1198851|)

Parametric bootstrap or permutationmethods can be used tofind the 119901 value of MAX3 [4]

Let the 119901 values of Pearsonrsquos Chi-square test and CATTunder the additive genetic model 119885

12be 119875chi2 and 119875

12

respectively WTCCC [5] proposed an alternative robust testMIN2 = min(119875chi2 11987512) Joo et al [15] derived the asymptoticnull distribution of MIN2 and using their result the 119901 valueof MIN2 can be obtained as

119875MIN2

=1

2exp minus1

2119867minus1

1(1 minusMIN2) + 1

2MIN2

minus1

2120587int

minus2 log(MIN2)

119867minus1

1

(1minusMIN2)119890minusV2 arcsin(

2119867minus1

1(1 minusMIN2)V minus 1

)119889V

(3)

where 1198671and 119867

2are the cumulative distributions of Chi-

square distributions with 1 and 2 degrees of freedomOn the other hand Song and Elston [21] considered a

Hardy-Weinberg disequilibrium trend test (HWDTT) givenby

119885119867=

(119903119904119899)12

(Δ119901minus Δ119902)

1 minus 1198992119899 minus 119899

1 (2119899) 119899

2119899 + 119899

1 (2119899)

(4)

where Δ119901= 1199012minus (1199012+ 11990112)2 and Δ

119902= 1199022minus (1199022+ 11990212)2

are the estimates of Δ119901and Δ

119902 where 119901

119894= 119903119894119903 and 119902

119894= 119904119894119904

Here Δ denotes the Hardy-Weinberg disequilibrium (HWD)coefficient defined by 119875119903(119861119861) minus 119875119903(119860119861)2 + 119875119903(119861119861)

2 andΔ119901andΔ

119902denote the HWD coefficient in cases and controls

respectivelyZheng and Ng [13] used the information contained in the

signs of (Δ119901 Δ119902) to determine the genetic models in their

two-phase method Their two-phase statistic 119885GMS is givenby 119885GMS = 119885

0if 119885119867

gt 119888 1198851if 119885119867

lt minus119888 and 11988512

other-wise where 119888 = Φ

minus1(1 minus 120572

119867) for 120572

119867= 005 The asymptotic

BioMed Research International 3

correlations between 119885119867and three CATTs under HWE were

derived and the significance level was adjusted accordingly tocontrol the desired type I error Based on the observation thatthis method assumes 119861 is the risk allele Joo et al [14] studiedthe behavior of 119885GMS when either one of the alleles can be arisk alleleThey chose the risk allele based on the sign of119885

12

that is if 11988512

gt 0 119861 is the risk allele and 1198850 11988512

and 1198851

are chosen for REC ADD and DOM models respectivelyIf 11988512

lt 0 the respective test statistics are chosen tobe minus119885

1 minus11988512

and minus1198850 They incorporate this property in

defining the test statistic for genetic model selection (119885GMS)and calculating the 119901 value Let Θ

0(119911) = 119911 119911 gt 119888

Θ12(119911) = 119911 |119911| lt 119888 and Θ

1(119911) = 119911 119911 lt minus119888 Then

the 119901 value of this method can be obtained by

119901GMS

= 2

1

sum

119909=0

intΘ119909

(119911119867

)

int

infin

0

int

infin

119905

120601119909(119911119909 11991112 119911119867) 11988911991111990911988911991112119889119911119867

+ 2

intΘ12

(119911)

Φ((minus119905 and 0) + 120588

12119911

(1 minus 120588212)12

)119889Φ (119911)

(5)

where 120588119909= Corr(119885

119909 119885119867) in (5) and 120588

11990912= Corr(119885

119909 11988512)

(119909 = 0 1) are replaced by their consistent estimates Here119905 = 119911GMS and (minus119905 and 0) = min(minus119905 0) Moreover 119911GMS and 11991112are the observed values of 119885GMS and 11988512 respectively

While studying the properties of GMS Joo et al [14]noticed that the probability of selecting the true recessive ordominant models using 119885

119867is very low especially for low

to moderate GRRs but the unlikely genetic model can besuccessfully excluded This led to genetic model exclusionmethod119885GME which is the same as the119885GMS described aboveexcept 119885

119909for 119909 = 0 12 1 is replaced by 119885lowast

119909where 119885lowast

119909=

(119885119909+11988512)2(1 +120588

11990912)12 And the 119901 value of GME can be

obtained as

119901GME

= 2

1

sum

119909=0

intΘ119909

(119911119867

)

int

infin

0

int

infin

119871

120601119909(119911119909 11991112 119911119867) 11988911991111990911988911991112119889119911119867

+ 2

intΘ12

(119911)

Φ((minus119905 and 0) + 120588

12119911

(1 minus 120588212)12

)119889Φ (119911)

(6)

where 119871 = 1199052(1 + 12058811990912

)12

minus 11991112

for 119905 = 119911GME

23 119901 Value of Replication Data Using the Robust Method Inthe discovery stage the 119901 value of robust association testsincluding MAX3 MIN2 119885GMS and 119885GME can be obtainedas described in Section 22 For the 119901 value of replication datausing the robust method we use the same analytic methodthat was used for discovery and the risk allele identified byit [16] This means that when the best test statistic or geneticmodel is selected in the discovery stage the replication stage

will adopt the discovery stage selection and the direction ofassociation

Suppose that for simplicity of notation our interest isin GWAS with two stages one for discovery and the otherfor replication although the methodology described belowcan be extended to multistages for replication Let 119885(119894)

119909for

119909 = 0 12 1 be the CATT optimal for recessive additiveand dominant models and let 119875(119894)

119909be corresponding 119901 value

for 119894th stage (119894 = 1 for discovery and 119894 = 2 for replicationstages) Also denote 119885(119894)lowast

119909= (119885(119894)

119909+ 119885(119894)

12)2(1 + 120588

(119894)

11990912)12

for 119909 = 0 12 1 Then for CATT with a preselected geneticmodel 119875(2)

119909= 1minusΦ(sign(119885(1)

119909sdot119885(2)

119909) sdot |119885(2)

119909|) using a one-sided

119901 value given the direction of association from the discoverystage and119875(2)lowast

119909= 1minusΦ(sign(119885(1)lowast

119909sdot119885(2)lowast

119909) sdot |119885(2)lowast

119909|) Moreover

denote the test statistics and 119901 values using Pearsonrsquos Chi-square test from the 119894th stage as 119879(119894)chi2 and 119875

(119894)

chi2 Further letHWDTT from the 119894th stage be 119885(119894)

119867 Then the second stage

119901 values using MAX3 MIN2 119885GMS and 119885GME denoted as119875(2)

MAX3 119875(2)

MIN2 119875(2)

GMS and 119875(2)

GME can be obtained as follows

119875(2)

MAX3 = 119875(2)

119909lowast

where 119909lowast = arg min119909isin0121

119875(1)

119909

119875(2)

MIN2 = 119875(2)

12sdot 119868 (119875(1)

12le 119875(2)

chi2) + 119875(2)

chi2 sdot 119868 (119875(1)

12gt 119875(2)

chi2)

119875(2)

GMS = 119875(2)

0119868 (119885(1)

119867gt 119888) + 119875

(2)

1119868 (119885(1)

119867lt minus119888)

+ 119875(2)

12119868 (10038161003816100381610038161003816119885(1)

119867

10038161003816100381610038161003816le 119888) if 119885(1)

12gt 0

= 119875(2)

0119868 (119885(1)

119867lt minus119888) + 119875

(2)

1119868 (119885(1)

119867gt 119888)

+ 119875(2)

12119868 (10038161003816100381610038161003816119885(1)

119867

10038161003816100381610038161003816le 119888) if 119885(1)

12le 0

119875(2)

GME = 119875(2)lowast

0119868 (119885(1)

119867gt 119888) + 119875

(2)lowast

1119868 (119885(1)

119867lt minus119888)

+ 119875(2)lowast

12119868 (10038161003816100381610038161003816119885(1)

119867

10038161003816100381610038161003816le 119888) if 119885(1)

12gt 0

= 119875(2)lowast

0119868 (119885(1)

119867lt minus119888) + 119875

(2)lowast

1119868 (119885(1)

119867gt 119888)

+ 119875(2)lowast

12119868 (10038161003816100381610038161003816119885(1)

119867

10038161003816100381610038161003816le 119888) if 119885(1)

12le 0

(7)

It is important to note that even though the direction ofthe test statistics and the selected genetic models are usedto obtain the second stage 119901 values the 119901 values from thetwo stages are independent under the null hypothesis Thisis because under the null hypothesis the probability of 119885

12

being positive or negative is simply 12 and the probabilityof the selection of a certain genetic model is also a constant(120572119867for the recessive and dominant models and 1 minus 2120572

119867for

the additive model)

24 Combined Test Using 119901 Values and Its Statistical Signif-icance For a given robust test we can consider the jointanalysis by combining 119901 values from the discovery andreplication stages of GWAS We consider using 119901 valuesrather than the test statistics because test statistics can have

4 BioMed Research International

Table 1 Type I error rates of three approachesmdashreplication-based (REP) test Fisherrsquos combination (119885FC) and linear combination of test(119885LC)mdashbased on the CATT with an additive model (119885

12) 1205942 MAX3 MIN2 GMS and GME The disease prevalence 119870 = 01 119872 = 10

markers 119903 = 1 500 cases and 119904 = 1 500 controls are considered based on 20000 simulations

MAF 120587119904

120572119863

119865 = 0

11988512

1205942 MAX3 MIN2 GMS GME

03 05

005REP 000530 000455 000505 000485 00050 000490119885FC 000500 000535 000495 000510 000515 000460119885LC 000535 000525 000485 000510 00050 000485

01REP 000510 000560 000565 000485 000525 000545119885FC 000565 000535 000565 000545 000565 000540119885LC 000520 000565 000525 000520 000530 000525

03 06

005REP 000510 000485 000480 000515 000480 000500119885FC 000445 000455 000450 000455 000450 000460LC 000500 000515 000495 000520 000475 000480

01REP 000500 000485 000485 000535 000530 000505119885FC 000465 000490 000485 000500 000455 000460119885LC 000480 000470 000515 000490 000485 000475

04 05

005REP 000590 000505 000530 000565 000505 000510119885FC 000575 000430 000460 000535 000460 000500119885LC 000600 000445 000500 000540 000490 000490

01REP 000525 000470 000535 000450 000480 000515119885FC 000515 000510 000495 000475 000540 000500119885LC 000530 000500 000485 000475 000495 000510

04 06

005REP 000475 000585 000480 000500 000515 000495119885FC 000460 000470 000420 000490 000455 000440119885LC 000525 000550 000520 000580 000510 000510

01REP 000550 000490 000515 000535 000555 000540119885FC 000520 000370 000495 000450 000515 000510119885LC 000565 000485 000570 000530 000610 000580

complex forms and obtaining the distribution of the joint testcan be difficult On the other hand calculating a 119901 value foreach data set might be relatively simple and the distributionof 119901 values under the null hypothesis of no association is easyto handle

There are several methods for combining test statisticsfrom two stages [22] and twomost commonly used forms arebased on Fisherrsquos combination and a linear combination afterinverse normal transformation [23] Fisherrsquos combination(FC) directly sums 119901 values after minus2 log transformation thatis 119885FC = minus2119908

1log(119875(1)) minus 2119908

2log(119875(2)) where 119875(119894) is 119901 value

from 119894 = 0 for discovery and 119894 = 1 for replication stagesusing a given robust test A specification of 119908

1= 1199082= 1

gives the same weight for discovery and replication stagesand one can consider 119908

1= 2120587

119904and 119908

2= 2(1 minus 120587

119904)

where 120587119904= 119873119863(119873119863+ 119873119877) and 119873

119863and 119873

119877are sample

sizes of the discovery and replication data sets A linearcombination of two 119875 values after taking the inverse of thestandard normal cumulative distribution is given by 119885LC =

1199081Φminus1(1minus119875(1)2)+119908

2Φminus1(1minus119875(2))radic1199082

1+ 11990822with a natural

choice of 1199081= radic120587119904 and 1199082 = radic1 minus 120587

119904 Let the significance

level of the discovery stage be 120572119863 which means that markers

with 119875(1) lt 120572119863are selected and replicated in the replication

stage The 119901 value of combined test can then be obtained as

119901FC = 1198751198670

(119875(1)

lt 120572119863 119885FC gt 119911FC)where the observed value of

119885FC is 119911FCThe119901FC are calculated as 119890minus119911FC2(1+119911FC2+log120572119863)

for equal weights where 119911FC gt minus2 log120572119863and (119908

1(1199081minus

1199082))119890minus119911FC21199081minus(119908

2(1199081minus1199082))119890minus119911FC21199082120572

minus(1199081

minus1199082

)1199082

119889for unequal

weights where 119911FC gt minus21199081log120572119863 Detailed derivations are

described in the Appendix Equivalently for an overall typeI error threshold for a single marker of 120572 one may obtainthe threshold 119862FC of 119885FC that satisfies 119875

1198670

(119875(1)

lt 120572119863 119885FC gt

119862FC) le 120572 Similarly for the 119885LC the 119901 value is calculatedas 119901LC = 119875

1198670

(119875(1)

lt 120572119863 119885LC gt 119911LC) = int

infin

1199111minus120572

119863

2

120601(119911)[1 minus

Φ((radic11990821+ 11990822119911LC minus 1199081119911)1199082)]119889119911 for 119911LC gt 119911

1minus120572119863

2where the

observed value of 119885LC = 119911LC

3 Simulation Results

31 Type I Error Table 1 provides the type I errors underdifferent scenarios A disease prevalence of 10 is assumedand a total of 1500 cases and 1500 controls were divided intotwo stages The proportions of samples in the first stage (120587

119904)

of 05 and 06 were considered for the minor allele frequency(MAF) of 03 and 04 We considered 119872 = 10 markers tocontrol the genome-wide false positive rate at 120572 = 005 withthe Bonferroni correction We did not consider a larger 119872

BioMed Research International 5

such as 300000 or 500000 because this will require morethan 10 million simulations to show a stable estimate of thetype I error rate With 119872 = 10 we performed 20000simulations which result in less than 10 of a coefficientof variation for a significance level 005119872 = 0005 foreach marker [24] The test statistics considered are 119885

12

Pearsonrsquos Chi-square testMIN2MAX3GMS andGME Forthe second stage analysis we considered a replication-basedanalysis 119885FC and 119885LC as proposed above The results arebased on the situation under HWE (HWE coefficient 119865 = 0)As expected all tests control the type I error reasonablly welland similar results were obtained when a slight deviationfrom HWE is present with 119865 = 005 (results not shown)

32 Empirical Power We examined the empirical powers ofdifferent tests considered above In Figure 1 we considered119872 = 10 markers a disease prevalence of 10 the samegenotype relative risk for two stages (119903

1= 14 and 119903

2= 14)

and 1000 cases and 1000 controls 2000 simulations wereperformed under HWE (119865 = 0) to control the genome-wide false positive rate at 120572 = 005 The recessive additiveand dominant models were assumed for the first secondand third rows Both joint analyses showed better powerperformances compared to the replication-based analysis (upto 159 in scenarios considered in Figure 1) and LC and FChave comparable powers with less than 2 difference Thepower gain of using the joint analysis is not as much as thatobserved in Skol et al [18] However as reported by Skolet al [18] when the between-stage heterogeneity exists andthe risk allele has a larger effect in the first stage than thatin the second stage much improved power is observed byusing the joint test Figure 2 shows results under this scenariowith 119903

1= 16 and 119903

2= 14 and the observed increase

in power using the joint test is as high as 339 Againthe difference between LC and FC is minor with less than3 difference As for comparison between different robustmethods MAX3 GMS and GME perform well under therecessive model while 119885

12 1205942 and MIN2 are less powerful

Under the additivemodel11988512

is most powerful as expectedand 1205942 is least powerful Other robust methods perform wellwith a slight decrease in power compared to 119885

12 Under the

dominant model MAX3 GMS and GME perform the besteven though all tests show good power performances and thedifference is minor Similar patterns were observed when aslight deviation from theHWE is present (results not shown)

4 Real Data Application

The GWAS on non-small-cell lung cancer (NSCLC) by Yoonet al [25] studied 621 NSCLC patients and 1541 controlsubjects in the discovery stage After stringent quality controlsteps a total of 246758 SNPs were tested for the associationwith NSCLC based on119885

12 In the replication stage 168 SNPs

with 119901 value less than 1 times 10minus4 in the first stage based on 11988512

were tested using 804 patients and 1470 control samples Weidentified additional 234 SNPs using MIN2 in the first stagewhich could be studied in the replication stage if MIN2 wasused instead of 119885

12since MIN2 produces stronger evidence

for the additional SNPs than 11988512

does The Manhattan plotsof using MIN2 and 119885

12are presented in Figure 3 One

example is 119903119904385272 located in chromosome 2 which hada 119901 value of 137 times 10

minus7 which reached significance level atBonferroni correction in discovery samples alone whereas11988512

yielded a119901 value greater than 1times10minus4 Even though thereis possibility of false positive findings these SNPs could havebeen selected for replication if robust methods were used

Since we do not have replication data for these additionalSNPs selected using MIN2 because the first stage selectionwas based on 119885

12in Yoon et al [25] just for illustration

purpose of the proposed methods we present the results ofthree SNPs including 1199031199042131877 that was reported by Yoonet al [25] When the significance level in the discovery stageis set at 120572

119863= 5 times 10

minus5 so that all these exemplary SNPs canbe selected in the discovery stage the 119901 value of combinedtest based on four robust methods (MAX3 MIN2 GMSand GME) as well as 119885

12and Pearsonrsquos Chi-square test is

presented in Table 2 Fisherrsquos combination was used for thejoint test in the second stage Only 1199031199042131877was found to besignificant with Bonferroni correction (119901 value lt 203times10

minus7)by all except MAX3method

5 Discussion

In genetic association studies efficiency robust tests whoseperformance does not depend on the underlying geneticmodel have been extensively studied and their power benefitover a wide range of geneticmodels has been well recognizedIn this paper we described how the idea of these robustassociation tests can be applied to the replication studies andfurther how overall statistical significance can be evaluatedusing the combined test formed by 119901 values of the discoveryand replication studies

When the robust tests are used the test statistic ofeach stage can have a complex form and thus dealing withthe distribution of the joint test can be difficult whereascalculating the 119901 value of each stage might be relativelysimple Because the asymptotic distribution of the 119901 valueunder the null hypothesis of no association is easy to handlethe combined test using 119901 values rather than the test statisticsthemselves can provide computational convenience

There are several methods for combining test statisticsfrom two stages and Won et al [22] compared the per-formances of various choices Two most commonly usedforms are based on Fisherrsquos combination and the linearcombination after the inverse normal transformation [23]and we presented the test statistics and 119901 values of these twomethods In our limited experience the linear combinationand Fisherrsquos combination are fairly comparable Fisherrsquoscombination seems to perform slightly better than the linearcombination when there exists some heterogeneity betweenstages in terms of the genotype relative risk while the linearcombination seems to perform slightly better in most ofother situations However the difference is extremely minorFurther research is required for the thorough comparison ofvarious methods of combining 119901 values in the application ofefficiency robust tests to the replication of genetic associationstudies

6 BioMed Research International

Recessive

Additive

Dominant

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)Em

piric

al p

ower

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

Z05

1205942

MAX3

ZFC

ZFC

ZFC

ZLC

ZLC

ZLC

MIN2GMSGME

MIN2GMSGME

MIN2GMSGME

Z05

1205942

MAX3

Z05

1205942

MAX3

Replication-based test

Replication-based test

Replication-based test

Figure 1 Empirical powers based on 2000 simulations for119872 = 10markers genotype relative risks of both stages = 14 and disease prevalence119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to control 120572 = 005 The first stagetype I error rate for discovery is 120572

119863= 005 Six test statistics 119885

12 1205942 MAX3 MIN2 GMS and GME are considered The first second and

third columns depict powers using the replication-based test 119885FC and 119885LC respectively

BioMed Research International 7

Replication-based test

Recessive

Additive

Dominant

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)Em

piric

al p

ower

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

Z05

1205942

MAX3

ZFC

ZFC

ZFC

ZLC

ZLC

ZLC

MIN2GMSGME

MIN2GMSGME

MIN2GMSGME

Z05

1205942

MAX3

Z05

1205942

MAX3

Replication-based test

Replication-based test

Figure 2 Empirical powers based on 2000 simulations for 119872 = 10 markers genotype relative risks of two stages are different (1199031= 16

1199032= 14) disease prevalence 119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to

control 120572 = 005 The first stage type I error rate for discovery is 120572119863= 005 Six test statistics 119885

12 1205942 MAX3 MIN2 GMS and GME are

considered The first second and third columns depict powers using the replication-based test 119885FC and 119885LC respectively

8 BioMed Research International

minuslo

g 10(120588)

Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23

8

6

4

2

0

MIN2 P value

(a)

minuslo

g 10(120588)

Additive P value8

6

4

2

0

Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23

(b)

Figure 3 Manhattan plots of 246758 SNPs from Yoon et al [25] based on MIN2 (a) and 11988512

(b) The 119909 axis is chromosomal location andthe 119910 axis is the significance (minuslog

10119875) of association The horizontal line corresponds to the significance level 10minus4

Table 2 For selected exemplary three SNPs for testing association with NSCLC 119901 value of combined test using additive CATT (11988512)

Pearsonrsquos Chi-square test (119879chi2) MAX3 MIN2 119885GMS and 119885GME

SNP 119901 value of 11988512

119901 value of 119879chi2Discovery Replication Combined test Discovery Replication Combined test

rs2131877 788 times 10minus5

104 times 10minus4

797 times 10minus8

140 times 10minus4

149 times 10minus4

184 times 10minus7

rs905551 183 times 10minus5

702 times 10minus3

770 times 10minus6

806 times 10minus5

489 times 10minus2

140 times 10minus5

rs1695109 248 times 10minus4

346 times 10minus2

217 times 10minus6

456 times 10minus5

153 times 10minus1

207 times 10minus5

SNP 119901 value of MAX3 119901 value of MIN2Discovery Replication Combined test Discovery Replication Combined test

rs2131877 153 times 10minus4

405 times 10minus2

192 times 10minus5

132 times 10minus4

104 times 10minus4

126 times 10minus7

rs905551 450 times 10minus5

702 times 10minus3

192 times 10minus6

134 times 10minus4

489 times 10minus2

199 times 10minus5

rs1695109 354 times 10minus5

263 times 10minus2

464 times 10minus6

236 times 10minus5

263 times 10minus2

335 times 10minus6

SNP 119901 value of 119885GMS 119901 value of 119885GME

Discovery Replication Combined test Discovery Replication Combined testrs2131877 186 times 10

minus4104 times 10

minus4171 times 10

minus7103 times 10

minus4104 times 10

minus4102 times 10

minus7

rs905551 519 times 10minus5

702 times 10minus3

216 times 10minus6

735 times 10minus5

801 times 10minus3

320 times 10minus6

rs1695109 689 times 10minus4

127 times 10minus1

385 times 10minus5

269 times 10minus5

419 times 10minus2

540 times 10minus6

In a genetic study where the purpose of considering areplication stage is to validate or replicate the genetic findingsfrom the discovery stage which is the case considered inthis paper the analysis in the replication stage utilized thetest statistic or genetic model that is selected as being thebest in the discovery stage and also the direction of the riskallele following guidelines for exact replication in geneticassociation studies If the purpose is to simply combine theevidence fromdifferent data sources such as inmeta-analysisother strategies may be devised Further research again isrequired to provide fully detailed properties of suchmethods

Power gain of a joint analysis over the conventionalreplication-based analysis was thoroughly studied by Skol etal [18 19] In our simulation the amount of power increaseusing a joint test compared to the replication-based analysiswas much minor than what was observed by Skol et al[18 19] The exact reason is not known but we suspect thismight be due to the power advantages of robust methodsand also due to the fact that the optimal choice from thefirst stage is used when calculating the second stage 119901 values

However even though it was minor in some situationsthe joint anlysis presented better power performance thanthe replication-based analysis in our study This type ofjoint analysis raised concerns about the exact meaning ofreplication [17] However McCarthy et al [26] mentionedthat joint analyses ldquoblur the boundaries of where exactlyreplication starts but whichever analytical approach is takenconfirmation in many independent samples is important andit is the overall strength of the evidence of association thatmattersrdquo Purpose of the current study was to present howthe overall strength of the evidence of association can beevaluated when robust tests are used in GWAS replicationstudies

We illustrated how the proposed methods can be appliedin the real data that studied the association of SNPs with non-small-cell lung cancer (NSCLC) in discovery and replicationstages In the original study reported by Yoon et al [25]SNPs were selected in the discovery data set not based onthe robust tests but based on additive CATT Therefore wefound that some SNPs could have been selected by one

BioMed Research International 9

of the robust methods but they were not included in thereplication data set For these SNPs we were not able toperform the joint analysis that we propose and it was notpossible to examine whether there are other SNPs that couldhave been found to be associated with NSCLC by proposedmethods in the replication study For this reason we merelypresented howmany additional SNPs could have been furtherfollowed in the replication stage when robust methods wereused In many GWASs it is a common practice to reportthe summary test statistics and 119901 values of the SNPs undera specific genetic model usually an additive model whichwere further genotyped in the replication stage and werefinally defined to be significantly associated with a phenotypeof interest As emphasized in this paper one may have abetter chance of findingmanymissing SNPs by applyingmorepowerful and robust methods that consider different geneticmodels simultaneouslyTherefore we urge the community toshare test results under not only an additive model but alsoother genetic models although they were not significant at astringent significance level so that future research may haveenriched data resources to which robust tests can be appliedin association studies

Appendix

119901 value of Fisherrsquos Combination forEqual and Unequal Weights

Equal Weights Assume 1199081

= 1199082

= 1 Under the nullhypothesis of no association 119883

1= minus2 log119875(1) and 119883

2=

minus2 log119875(2) are independent and each asymptotically followsa 1205942 distribution with 2 degrees of freedom Let 119891

119896(119909) and

119865119896(119909) be the probability and cumulative density functions of

1205942 random variable with 119896 degrees of freedomThen 119891

2(119909) =

exp(minus1199092)2 1198914(119909) = 119909 exp(minus1199092)4 119865

2(119909) = 1 minus exp(minus1199092)

and1198654(119909) = 1minusexp(minus1199092)minus119909 exp(minus1199092)2 Denote the cutoff

of the discovery stage based on 120572119863as 119862119863 that is 119865

2(119862119863) =

1 minus 120572119863 For observed value 119911FC gt 119862

119863of1198831+ 1198832 the 119901 value

is written as

1198751198670

(1198831gt 119862119863 1198831+ 1198832gt 119911FC)

= 120572119863minus int

119911FC

119862119863

1198912 (119909) 1198652 (119911FC minus 119909) 119889119909

= 120572119863+ exp (minus

119911FC2) minus 120572119863+1

2exp(minus

119911FC2) (119911FC minus 119862119863)

= exp (minus119911FC2) (1 +

119911FC2

+ log120572119863)

(A1)

Unequal WeightsWhen different proportions of samples areused in the discovery and replication stages it may be moreappropriate to assignweights proportional to the sample sizesfor each stage For example when only a small portion isused in the discovery stage to prevent Fisherrsquos combinationtest from being dominated by the significant result in the

discovery stage one may want to assign a small weight to thediscovery stage result

When 120587119904is the proportion of samples used in the dis-

covery stage one selection for weights is 1199081

= 2120587119904and

1199082= 2(1 minus 120587

119904) for discovery and replication stages which

simplifies to equal weights when 120587119904= 05 Based on these

weights we consider unequal-weighted Fisherrsquos combinationasminus2 log119875(1)1199081119875(2)1199082 = 119908

11198831+11990821198832[27] Its density function

is given by

119891119908 (119909) =

1

2 (1199081minus 1199082)exp(minus 119909

(21199081))

minus1

2 (1199081minus 1199082)exp(minus 119909

(21199082))

(A2)

and the probability distribution function is

119865119908 (119909) = 1 minus

1199081

(1199081minus 1199082)exp(minus 119909

21199081

)

minus1199082

(1199081minus 1199082)exp(minus 119909

21199082

) 1199081

= 1199082

(A3)

Using the previous notation we have the following formof 119901 value

1198751198670

(1198831gt 119862119863 11990811198831+ 11990821198832gt 119911FC)

= 120572119863minus int

119911FC1199081

119862119863

1198912 (119909) 1198652 (

119911FC minus 1199081119909

1199082

)119889119909

= exp(minus119911FC21199081

) +1199082

1199081minus 1199082

exp(minus119911FC21199082

)

times exp(1199081 minus 1199082211990811199082

119911FC) minus exp(1199081 minus 1199082211990811199082

1199081119862119863)

=1199081

1199081minus 1199082

exp(minus119911FC21199081

)

minus1199082

1199081minus 1199082

exp(minus119911FC21199082

)120572minus(1199081

minus1199082

)1199082

119863

=1199081

1199081minus 1199082

exp(minus119911FC21199081

)

minus1199082

1199081minus 1199082

exp(minus119911FC21199082

)120572minus(1199081

minus1199082

)1199082

119863

(A4)

where 119911FC1199081 gt 119862119863

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

10 BioMed Research International

Acknowledgments

The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)

References

[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005

[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006

[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006

[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007

[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007

[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012

[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954

[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955

[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997

[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002

[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966

[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985

[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008

[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010

[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009

[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009

[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009

[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for

two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006

[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007

[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003

[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006

[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009

[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012

[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993

[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010

[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008

[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 3: Research Article Robust Association Tests for the Replication of …downloads.hindawi.com/journals/bmri/2015/461593.pdf · 2019. 7. 31. · Research Article Robust Association Tests

BioMed Research International 3

correlations between 119885119867and three CATTs under HWE were

derived and the significance level was adjusted accordingly tocontrol the desired type I error Based on the observation thatthis method assumes 119861 is the risk allele Joo et al [14] studiedthe behavior of 119885GMS when either one of the alleles can be arisk alleleThey chose the risk allele based on the sign of119885

12

that is if 11988512

gt 0 119861 is the risk allele and 1198850 11988512

and 1198851

are chosen for REC ADD and DOM models respectivelyIf 11988512

lt 0 the respective test statistics are chosen tobe minus119885

1 minus11988512

and minus1198850 They incorporate this property in

defining the test statistic for genetic model selection (119885GMS)and calculating the 119901 value Let Θ

0(119911) = 119911 119911 gt 119888

Θ12(119911) = 119911 |119911| lt 119888 and Θ

1(119911) = 119911 119911 lt minus119888 Then

the 119901 value of this method can be obtained by

119901GMS

= 2

1

sum

119909=0

intΘ119909

(119911119867

)

int

infin

0

int

infin

119905

120601119909(119911119909 11991112 119911119867) 11988911991111990911988911991112119889119911119867

+ 2

intΘ12

(119911)

Φ((minus119905 and 0) + 120588

12119911

(1 minus 120588212)12

)119889Φ (119911)

(5)

where 120588119909= Corr(119885

119909 119885119867) in (5) and 120588

11990912= Corr(119885

119909 11988512)

(119909 = 0 1) are replaced by their consistent estimates Here119905 = 119911GMS and (minus119905 and 0) = min(minus119905 0) Moreover 119911GMS and 11991112are the observed values of 119885GMS and 11988512 respectively

While studying the properties of GMS Joo et al [14]noticed that the probability of selecting the true recessive ordominant models using 119885

119867is very low especially for low

to moderate GRRs but the unlikely genetic model can besuccessfully excluded This led to genetic model exclusionmethod119885GME which is the same as the119885GMS described aboveexcept 119885

119909for 119909 = 0 12 1 is replaced by 119885lowast

119909where 119885lowast

119909=

(119885119909+11988512)2(1 +120588

11990912)12 And the 119901 value of GME can be

obtained as

119901GME

= 2

1

sum

119909=0

intΘ119909

(119911119867

)

int

infin

0

int

infin

119871

120601119909(119911119909 11991112 119911119867) 11988911991111990911988911991112119889119911119867

+ 2

intΘ12

(119911)

Φ((minus119905 and 0) + 120588

12119911

(1 minus 120588212)12

)119889Φ (119911)

(6)

where 119871 = 1199052(1 + 12058811990912

)12

minus 11991112

for 119905 = 119911GME

23 119901 Value of Replication Data Using the Robust Method Inthe discovery stage the 119901 value of robust association testsincluding MAX3 MIN2 119885GMS and 119885GME can be obtainedas described in Section 22 For the 119901 value of replication datausing the robust method we use the same analytic methodthat was used for discovery and the risk allele identified byit [16] This means that when the best test statistic or geneticmodel is selected in the discovery stage the replication stage

will adopt the discovery stage selection and the direction ofassociation

Suppose that for simplicity of notation our interest isin GWAS with two stages one for discovery and the otherfor replication although the methodology described belowcan be extended to multistages for replication Let 119885(119894)

119909for

119909 = 0 12 1 be the CATT optimal for recessive additiveand dominant models and let 119875(119894)

119909be corresponding 119901 value

for 119894th stage (119894 = 1 for discovery and 119894 = 2 for replicationstages) Also denote 119885(119894)lowast

119909= (119885(119894)

119909+ 119885(119894)

12)2(1 + 120588

(119894)

11990912)12

for 119909 = 0 12 1 Then for CATT with a preselected geneticmodel 119875(2)

119909= 1minusΦ(sign(119885(1)

119909sdot119885(2)

119909) sdot |119885(2)

119909|) using a one-sided

119901 value given the direction of association from the discoverystage and119875(2)lowast

119909= 1minusΦ(sign(119885(1)lowast

119909sdot119885(2)lowast

119909) sdot |119885(2)lowast

119909|) Moreover

denote the test statistics and 119901 values using Pearsonrsquos Chi-square test from the 119894th stage as 119879(119894)chi2 and 119875

(119894)

chi2 Further letHWDTT from the 119894th stage be 119885(119894)

119867 Then the second stage

119901 values using MAX3 MIN2 119885GMS and 119885GME denoted as119875(2)

MAX3 119875(2)

MIN2 119875(2)

GMS and 119875(2)

GME can be obtained as follows

119875(2)

MAX3 = 119875(2)

119909lowast

where 119909lowast = arg min119909isin0121

119875(1)

119909

119875(2)

MIN2 = 119875(2)

12sdot 119868 (119875(1)

12le 119875(2)

chi2) + 119875(2)

chi2 sdot 119868 (119875(1)

12gt 119875(2)

chi2)

119875(2)

GMS = 119875(2)

0119868 (119885(1)

119867gt 119888) + 119875

(2)

1119868 (119885(1)

119867lt minus119888)

+ 119875(2)

12119868 (10038161003816100381610038161003816119885(1)

119867

10038161003816100381610038161003816le 119888) if 119885(1)

12gt 0

= 119875(2)

0119868 (119885(1)

119867lt minus119888) + 119875

(2)

1119868 (119885(1)

119867gt 119888)

+ 119875(2)

12119868 (10038161003816100381610038161003816119885(1)

119867

10038161003816100381610038161003816le 119888) if 119885(1)

12le 0

119875(2)

GME = 119875(2)lowast

0119868 (119885(1)

119867gt 119888) + 119875

(2)lowast

1119868 (119885(1)

119867lt minus119888)

+ 119875(2)lowast

12119868 (10038161003816100381610038161003816119885(1)

119867

10038161003816100381610038161003816le 119888) if 119885(1)

12gt 0

= 119875(2)lowast

0119868 (119885(1)

119867lt minus119888) + 119875

(2)lowast

1119868 (119885(1)

119867gt 119888)

+ 119875(2)lowast

12119868 (10038161003816100381610038161003816119885(1)

119867

10038161003816100381610038161003816le 119888) if 119885(1)

12le 0

(7)

It is important to note that even though the direction ofthe test statistics and the selected genetic models are usedto obtain the second stage 119901 values the 119901 values from thetwo stages are independent under the null hypothesis Thisis because under the null hypothesis the probability of 119885

12

being positive or negative is simply 12 and the probabilityof the selection of a certain genetic model is also a constant(120572119867for the recessive and dominant models and 1 minus 2120572

119867for

the additive model)

24 Combined Test Using 119901 Values and Its Statistical Signif-icance For a given robust test we can consider the jointanalysis by combining 119901 values from the discovery andreplication stages of GWAS We consider using 119901 valuesrather than the test statistics because test statistics can have

4 BioMed Research International

Table 1 Type I error rates of three approachesmdashreplication-based (REP) test Fisherrsquos combination (119885FC) and linear combination of test(119885LC)mdashbased on the CATT with an additive model (119885

12) 1205942 MAX3 MIN2 GMS and GME The disease prevalence 119870 = 01 119872 = 10

markers 119903 = 1 500 cases and 119904 = 1 500 controls are considered based on 20000 simulations

MAF 120587119904

120572119863

119865 = 0

11988512

1205942 MAX3 MIN2 GMS GME

03 05

005REP 000530 000455 000505 000485 00050 000490119885FC 000500 000535 000495 000510 000515 000460119885LC 000535 000525 000485 000510 00050 000485

01REP 000510 000560 000565 000485 000525 000545119885FC 000565 000535 000565 000545 000565 000540119885LC 000520 000565 000525 000520 000530 000525

03 06

005REP 000510 000485 000480 000515 000480 000500119885FC 000445 000455 000450 000455 000450 000460LC 000500 000515 000495 000520 000475 000480

01REP 000500 000485 000485 000535 000530 000505119885FC 000465 000490 000485 000500 000455 000460119885LC 000480 000470 000515 000490 000485 000475

04 05

005REP 000590 000505 000530 000565 000505 000510119885FC 000575 000430 000460 000535 000460 000500119885LC 000600 000445 000500 000540 000490 000490

01REP 000525 000470 000535 000450 000480 000515119885FC 000515 000510 000495 000475 000540 000500119885LC 000530 000500 000485 000475 000495 000510

04 06

005REP 000475 000585 000480 000500 000515 000495119885FC 000460 000470 000420 000490 000455 000440119885LC 000525 000550 000520 000580 000510 000510

01REP 000550 000490 000515 000535 000555 000540119885FC 000520 000370 000495 000450 000515 000510119885LC 000565 000485 000570 000530 000610 000580

complex forms and obtaining the distribution of the joint testcan be difficult On the other hand calculating a 119901 value foreach data set might be relatively simple and the distributionof 119901 values under the null hypothesis of no association is easyto handle

There are several methods for combining test statisticsfrom two stages [22] and twomost commonly used forms arebased on Fisherrsquos combination and a linear combination afterinverse normal transformation [23] Fisherrsquos combination(FC) directly sums 119901 values after minus2 log transformation thatis 119885FC = minus2119908

1log(119875(1)) minus 2119908

2log(119875(2)) where 119875(119894) is 119901 value

from 119894 = 0 for discovery and 119894 = 1 for replication stagesusing a given robust test A specification of 119908

1= 1199082= 1

gives the same weight for discovery and replication stagesand one can consider 119908

1= 2120587

119904and 119908

2= 2(1 minus 120587

119904)

where 120587119904= 119873119863(119873119863+ 119873119877) and 119873

119863and 119873

119877are sample

sizes of the discovery and replication data sets A linearcombination of two 119875 values after taking the inverse of thestandard normal cumulative distribution is given by 119885LC =

1199081Φminus1(1minus119875(1)2)+119908

2Φminus1(1minus119875(2))radic1199082

1+ 11990822with a natural

choice of 1199081= radic120587119904 and 1199082 = radic1 minus 120587

119904 Let the significance

level of the discovery stage be 120572119863 which means that markers

with 119875(1) lt 120572119863are selected and replicated in the replication

stage The 119901 value of combined test can then be obtained as

119901FC = 1198751198670

(119875(1)

lt 120572119863 119885FC gt 119911FC)where the observed value of

119885FC is 119911FCThe119901FC are calculated as 119890minus119911FC2(1+119911FC2+log120572119863)

for equal weights where 119911FC gt minus2 log120572119863and (119908

1(1199081minus

1199082))119890minus119911FC21199081minus(119908

2(1199081minus1199082))119890minus119911FC21199082120572

minus(1199081

minus1199082

)1199082

119889for unequal

weights where 119911FC gt minus21199081log120572119863 Detailed derivations are

described in the Appendix Equivalently for an overall typeI error threshold for a single marker of 120572 one may obtainthe threshold 119862FC of 119885FC that satisfies 119875

1198670

(119875(1)

lt 120572119863 119885FC gt

119862FC) le 120572 Similarly for the 119885LC the 119901 value is calculatedas 119901LC = 119875

1198670

(119875(1)

lt 120572119863 119885LC gt 119911LC) = int

infin

1199111minus120572

119863

2

120601(119911)[1 minus

Φ((radic11990821+ 11990822119911LC minus 1199081119911)1199082)]119889119911 for 119911LC gt 119911

1minus120572119863

2where the

observed value of 119885LC = 119911LC

3 Simulation Results

31 Type I Error Table 1 provides the type I errors underdifferent scenarios A disease prevalence of 10 is assumedand a total of 1500 cases and 1500 controls were divided intotwo stages The proportions of samples in the first stage (120587

119904)

of 05 and 06 were considered for the minor allele frequency(MAF) of 03 and 04 We considered 119872 = 10 markers tocontrol the genome-wide false positive rate at 120572 = 005 withthe Bonferroni correction We did not consider a larger 119872

BioMed Research International 5

such as 300000 or 500000 because this will require morethan 10 million simulations to show a stable estimate of thetype I error rate With 119872 = 10 we performed 20000simulations which result in less than 10 of a coefficientof variation for a significance level 005119872 = 0005 foreach marker [24] The test statistics considered are 119885

12

Pearsonrsquos Chi-square testMIN2MAX3GMS andGME Forthe second stage analysis we considered a replication-basedanalysis 119885FC and 119885LC as proposed above The results arebased on the situation under HWE (HWE coefficient 119865 = 0)As expected all tests control the type I error reasonablly welland similar results were obtained when a slight deviationfrom HWE is present with 119865 = 005 (results not shown)

32 Empirical Power We examined the empirical powers ofdifferent tests considered above In Figure 1 we considered119872 = 10 markers a disease prevalence of 10 the samegenotype relative risk for two stages (119903

1= 14 and 119903

2= 14)

and 1000 cases and 1000 controls 2000 simulations wereperformed under HWE (119865 = 0) to control the genome-wide false positive rate at 120572 = 005 The recessive additiveand dominant models were assumed for the first secondand third rows Both joint analyses showed better powerperformances compared to the replication-based analysis (upto 159 in scenarios considered in Figure 1) and LC and FChave comparable powers with less than 2 difference Thepower gain of using the joint analysis is not as much as thatobserved in Skol et al [18] However as reported by Skolet al [18] when the between-stage heterogeneity exists andthe risk allele has a larger effect in the first stage than thatin the second stage much improved power is observed byusing the joint test Figure 2 shows results under this scenariowith 119903

1= 16 and 119903

2= 14 and the observed increase

in power using the joint test is as high as 339 Againthe difference between LC and FC is minor with less than3 difference As for comparison between different robustmethods MAX3 GMS and GME perform well under therecessive model while 119885

12 1205942 and MIN2 are less powerful

Under the additivemodel11988512

is most powerful as expectedand 1205942 is least powerful Other robust methods perform wellwith a slight decrease in power compared to 119885

12 Under the

dominant model MAX3 GMS and GME perform the besteven though all tests show good power performances and thedifference is minor Similar patterns were observed when aslight deviation from theHWE is present (results not shown)

4 Real Data Application

The GWAS on non-small-cell lung cancer (NSCLC) by Yoonet al [25] studied 621 NSCLC patients and 1541 controlsubjects in the discovery stage After stringent quality controlsteps a total of 246758 SNPs were tested for the associationwith NSCLC based on119885

12 In the replication stage 168 SNPs

with 119901 value less than 1 times 10minus4 in the first stage based on 11988512

were tested using 804 patients and 1470 control samples Weidentified additional 234 SNPs using MIN2 in the first stagewhich could be studied in the replication stage if MIN2 wasused instead of 119885

12since MIN2 produces stronger evidence

for the additional SNPs than 11988512

does The Manhattan plotsof using MIN2 and 119885

12are presented in Figure 3 One

example is 119903119904385272 located in chromosome 2 which hada 119901 value of 137 times 10

minus7 which reached significance level atBonferroni correction in discovery samples alone whereas11988512

yielded a119901 value greater than 1times10minus4 Even though thereis possibility of false positive findings these SNPs could havebeen selected for replication if robust methods were used

Since we do not have replication data for these additionalSNPs selected using MIN2 because the first stage selectionwas based on 119885

12in Yoon et al [25] just for illustration

purpose of the proposed methods we present the results ofthree SNPs including 1199031199042131877 that was reported by Yoonet al [25] When the significance level in the discovery stageis set at 120572

119863= 5 times 10

minus5 so that all these exemplary SNPs canbe selected in the discovery stage the 119901 value of combinedtest based on four robust methods (MAX3 MIN2 GMSand GME) as well as 119885

12and Pearsonrsquos Chi-square test is

presented in Table 2 Fisherrsquos combination was used for thejoint test in the second stage Only 1199031199042131877was found to besignificant with Bonferroni correction (119901 value lt 203times10

minus7)by all except MAX3method

5 Discussion

In genetic association studies efficiency robust tests whoseperformance does not depend on the underlying geneticmodel have been extensively studied and their power benefitover a wide range of geneticmodels has been well recognizedIn this paper we described how the idea of these robustassociation tests can be applied to the replication studies andfurther how overall statistical significance can be evaluatedusing the combined test formed by 119901 values of the discoveryand replication studies

When the robust tests are used the test statistic ofeach stage can have a complex form and thus dealing withthe distribution of the joint test can be difficult whereascalculating the 119901 value of each stage might be relativelysimple Because the asymptotic distribution of the 119901 valueunder the null hypothesis of no association is easy to handlethe combined test using 119901 values rather than the test statisticsthemselves can provide computational convenience

There are several methods for combining test statisticsfrom two stages and Won et al [22] compared the per-formances of various choices Two most commonly usedforms are based on Fisherrsquos combination and the linearcombination after the inverse normal transformation [23]and we presented the test statistics and 119901 values of these twomethods In our limited experience the linear combinationand Fisherrsquos combination are fairly comparable Fisherrsquoscombination seems to perform slightly better than the linearcombination when there exists some heterogeneity betweenstages in terms of the genotype relative risk while the linearcombination seems to perform slightly better in most ofother situations However the difference is extremely minorFurther research is required for the thorough comparison ofvarious methods of combining 119901 values in the application ofefficiency robust tests to the replication of genetic associationstudies

6 BioMed Research International

Recessive

Additive

Dominant

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)Em

piric

al p

ower

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

Z05

1205942

MAX3

ZFC

ZFC

ZFC

ZLC

ZLC

ZLC

MIN2GMSGME

MIN2GMSGME

MIN2GMSGME

Z05

1205942

MAX3

Z05

1205942

MAX3

Replication-based test

Replication-based test

Replication-based test

Figure 1 Empirical powers based on 2000 simulations for119872 = 10markers genotype relative risks of both stages = 14 and disease prevalence119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to control 120572 = 005 The first stagetype I error rate for discovery is 120572

119863= 005 Six test statistics 119885

12 1205942 MAX3 MIN2 GMS and GME are considered The first second and

third columns depict powers using the replication-based test 119885FC and 119885LC respectively

BioMed Research International 7

Replication-based test

Recessive

Additive

Dominant

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)Em

piric

al p

ower

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

Z05

1205942

MAX3

ZFC

ZFC

ZFC

ZLC

ZLC

ZLC

MIN2GMSGME

MIN2GMSGME

MIN2GMSGME

Z05

1205942

MAX3

Z05

1205942

MAX3

Replication-based test

Replication-based test

Figure 2 Empirical powers based on 2000 simulations for 119872 = 10 markers genotype relative risks of two stages are different (1199031= 16

1199032= 14) disease prevalence 119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to

control 120572 = 005 The first stage type I error rate for discovery is 120572119863= 005 Six test statistics 119885

12 1205942 MAX3 MIN2 GMS and GME are

considered The first second and third columns depict powers using the replication-based test 119885FC and 119885LC respectively

8 BioMed Research International

minuslo

g 10(120588)

Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23

8

6

4

2

0

MIN2 P value

(a)

minuslo

g 10(120588)

Additive P value8

6

4

2

0

Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23

(b)

Figure 3 Manhattan plots of 246758 SNPs from Yoon et al [25] based on MIN2 (a) and 11988512

(b) The 119909 axis is chromosomal location andthe 119910 axis is the significance (minuslog

10119875) of association The horizontal line corresponds to the significance level 10minus4

Table 2 For selected exemplary three SNPs for testing association with NSCLC 119901 value of combined test using additive CATT (11988512)

Pearsonrsquos Chi-square test (119879chi2) MAX3 MIN2 119885GMS and 119885GME

SNP 119901 value of 11988512

119901 value of 119879chi2Discovery Replication Combined test Discovery Replication Combined test

rs2131877 788 times 10minus5

104 times 10minus4

797 times 10minus8

140 times 10minus4

149 times 10minus4

184 times 10minus7

rs905551 183 times 10minus5

702 times 10minus3

770 times 10minus6

806 times 10minus5

489 times 10minus2

140 times 10minus5

rs1695109 248 times 10minus4

346 times 10minus2

217 times 10minus6

456 times 10minus5

153 times 10minus1

207 times 10minus5

SNP 119901 value of MAX3 119901 value of MIN2Discovery Replication Combined test Discovery Replication Combined test

rs2131877 153 times 10minus4

405 times 10minus2

192 times 10minus5

132 times 10minus4

104 times 10minus4

126 times 10minus7

rs905551 450 times 10minus5

702 times 10minus3

192 times 10minus6

134 times 10minus4

489 times 10minus2

199 times 10minus5

rs1695109 354 times 10minus5

263 times 10minus2

464 times 10minus6

236 times 10minus5

263 times 10minus2

335 times 10minus6

SNP 119901 value of 119885GMS 119901 value of 119885GME

Discovery Replication Combined test Discovery Replication Combined testrs2131877 186 times 10

minus4104 times 10

minus4171 times 10

minus7103 times 10

minus4104 times 10

minus4102 times 10

minus7

rs905551 519 times 10minus5

702 times 10minus3

216 times 10minus6

735 times 10minus5

801 times 10minus3

320 times 10minus6

rs1695109 689 times 10minus4

127 times 10minus1

385 times 10minus5

269 times 10minus5

419 times 10minus2

540 times 10minus6

In a genetic study where the purpose of considering areplication stage is to validate or replicate the genetic findingsfrom the discovery stage which is the case considered inthis paper the analysis in the replication stage utilized thetest statistic or genetic model that is selected as being thebest in the discovery stage and also the direction of the riskallele following guidelines for exact replication in geneticassociation studies If the purpose is to simply combine theevidence fromdifferent data sources such as inmeta-analysisother strategies may be devised Further research again isrequired to provide fully detailed properties of suchmethods

Power gain of a joint analysis over the conventionalreplication-based analysis was thoroughly studied by Skol etal [18 19] In our simulation the amount of power increaseusing a joint test compared to the replication-based analysiswas much minor than what was observed by Skol et al[18 19] The exact reason is not known but we suspect thismight be due to the power advantages of robust methodsand also due to the fact that the optimal choice from thefirst stage is used when calculating the second stage 119901 values

However even though it was minor in some situationsthe joint anlysis presented better power performance thanthe replication-based analysis in our study This type ofjoint analysis raised concerns about the exact meaning ofreplication [17] However McCarthy et al [26] mentionedthat joint analyses ldquoblur the boundaries of where exactlyreplication starts but whichever analytical approach is takenconfirmation in many independent samples is important andit is the overall strength of the evidence of association thatmattersrdquo Purpose of the current study was to present howthe overall strength of the evidence of association can beevaluated when robust tests are used in GWAS replicationstudies

We illustrated how the proposed methods can be appliedin the real data that studied the association of SNPs with non-small-cell lung cancer (NSCLC) in discovery and replicationstages In the original study reported by Yoon et al [25]SNPs were selected in the discovery data set not based onthe robust tests but based on additive CATT Therefore wefound that some SNPs could have been selected by one

BioMed Research International 9

of the robust methods but they were not included in thereplication data set For these SNPs we were not able toperform the joint analysis that we propose and it was notpossible to examine whether there are other SNPs that couldhave been found to be associated with NSCLC by proposedmethods in the replication study For this reason we merelypresented howmany additional SNPs could have been furtherfollowed in the replication stage when robust methods wereused In many GWASs it is a common practice to reportthe summary test statistics and 119901 values of the SNPs undera specific genetic model usually an additive model whichwere further genotyped in the replication stage and werefinally defined to be significantly associated with a phenotypeof interest As emphasized in this paper one may have abetter chance of findingmanymissing SNPs by applyingmorepowerful and robust methods that consider different geneticmodels simultaneouslyTherefore we urge the community toshare test results under not only an additive model but alsoother genetic models although they were not significant at astringent significance level so that future research may haveenriched data resources to which robust tests can be appliedin association studies

Appendix

119901 value of Fisherrsquos Combination forEqual and Unequal Weights

Equal Weights Assume 1199081

= 1199082

= 1 Under the nullhypothesis of no association 119883

1= minus2 log119875(1) and 119883

2=

minus2 log119875(2) are independent and each asymptotically followsa 1205942 distribution with 2 degrees of freedom Let 119891

119896(119909) and

119865119896(119909) be the probability and cumulative density functions of

1205942 random variable with 119896 degrees of freedomThen 119891

2(119909) =

exp(minus1199092)2 1198914(119909) = 119909 exp(minus1199092)4 119865

2(119909) = 1 minus exp(minus1199092)

and1198654(119909) = 1minusexp(minus1199092)minus119909 exp(minus1199092)2 Denote the cutoff

of the discovery stage based on 120572119863as 119862119863 that is 119865

2(119862119863) =

1 minus 120572119863 For observed value 119911FC gt 119862

119863of1198831+ 1198832 the 119901 value

is written as

1198751198670

(1198831gt 119862119863 1198831+ 1198832gt 119911FC)

= 120572119863minus int

119911FC

119862119863

1198912 (119909) 1198652 (119911FC minus 119909) 119889119909

= 120572119863+ exp (minus

119911FC2) minus 120572119863+1

2exp(minus

119911FC2) (119911FC minus 119862119863)

= exp (minus119911FC2) (1 +

119911FC2

+ log120572119863)

(A1)

Unequal WeightsWhen different proportions of samples areused in the discovery and replication stages it may be moreappropriate to assignweights proportional to the sample sizesfor each stage For example when only a small portion isused in the discovery stage to prevent Fisherrsquos combinationtest from being dominated by the significant result in the

discovery stage one may want to assign a small weight to thediscovery stage result

When 120587119904is the proportion of samples used in the dis-

covery stage one selection for weights is 1199081

= 2120587119904and

1199082= 2(1 minus 120587

119904) for discovery and replication stages which

simplifies to equal weights when 120587119904= 05 Based on these

weights we consider unequal-weighted Fisherrsquos combinationasminus2 log119875(1)1199081119875(2)1199082 = 119908

11198831+11990821198832[27] Its density function

is given by

119891119908 (119909) =

1

2 (1199081minus 1199082)exp(minus 119909

(21199081))

minus1

2 (1199081minus 1199082)exp(minus 119909

(21199082))

(A2)

and the probability distribution function is

119865119908 (119909) = 1 minus

1199081

(1199081minus 1199082)exp(minus 119909

21199081

)

minus1199082

(1199081minus 1199082)exp(minus 119909

21199082

) 1199081

= 1199082

(A3)

Using the previous notation we have the following formof 119901 value

1198751198670

(1198831gt 119862119863 11990811198831+ 11990821198832gt 119911FC)

= 120572119863minus int

119911FC1199081

119862119863

1198912 (119909) 1198652 (

119911FC minus 1199081119909

1199082

)119889119909

= exp(minus119911FC21199081

) +1199082

1199081minus 1199082

exp(minus119911FC21199082

)

times exp(1199081 minus 1199082211990811199082

119911FC) minus exp(1199081 minus 1199082211990811199082

1199081119862119863)

=1199081

1199081minus 1199082

exp(minus119911FC21199081

)

minus1199082

1199081minus 1199082

exp(minus119911FC21199082

)120572minus(1199081

minus1199082

)1199082

119863

=1199081

1199081minus 1199082

exp(minus119911FC21199081

)

minus1199082

1199081minus 1199082

exp(minus119911FC21199082

)120572minus(1199081

minus1199082

)1199082

119863

(A4)

where 119911FC1199081 gt 119862119863

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

10 BioMed Research International

Acknowledgments

The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)

References

[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005

[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006

[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006

[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007

[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007

[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012

[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954

[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955

[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997

[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002

[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966

[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985

[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008

[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010

[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009

[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009

[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009

[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for

two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006

[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007

[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003

[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006

[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009

[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012

[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993

[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010

[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008

[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 4: Research Article Robust Association Tests for the Replication of …downloads.hindawi.com/journals/bmri/2015/461593.pdf · 2019. 7. 31. · Research Article Robust Association Tests

4 BioMed Research International

Table 1 Type I error rates of three approachesmdashreplication-based (REP) test Fisherrsquos combination (119885FC) and linear combination of test(119885LC)mdashbased on the CATT with an additive model (119885

12) 1205942 MAX3 MIN2 GMS and GME The disease prevalence 119870 = 01 119872 = 10

markers 119903 = 1 500 cases and 119904 = 1 500 controls are considered based on 20000 simulations

MAF 120587119904

120572119863

119865 = 0

11988512

1205942 MAX3 MIN2 GMS GME

03 05

005REP 000530 000455 000505 000485 00050 000490119885FC 000500 000535 000495 000510 000515 000460119885LC 000535 000525 000485 000510 00050 000485

01REP 000510 000560 000565 000485 000525 000545119885FC 000565 000535 000565 000545 000565 000540119885LC 000520 000565 000525 000520 000530 000525

03 06

005REP 000510 000485 000480 000515 000480 000500119885FC 000445 000455 000450 000455 000450 000460LC 000500 000515 000495 000520 000475 000480

01REP 000500 000485 000485 000535 000530 000505119885FC 000465 000490 000485 000500 000455 000460119885LC 000480 000470 000515 000490 000485 000475

04 05

005REP 000590 000505 000530 000565 000505 000510119885FC 000575 000430 000460 000535 000460 000500119885LC 000600 000445 000500 000540 000490 000490

01REP 000525 000470 000535 000450 000480 000515119885FC 000515 000510 000495 000475 000540 000500119885LC 000530 000500 000485 000475 000495 000510

04 06

005REP 000475 000585 000480 000500 000515 000495119885FC 000460 000470 000420 000490 000455 000440119885LC 000525 000550 000520 000580 000510 000510

01REP 000550 000490 000515 000535 000555 000540119885FC 000520 000370 000495 000450 000515 000510119885LC 000565 000485 000570 000530 000610 000580

complex forms and obtaining the distribution of the joint testcan be difficult On the other hand calculating a 119901 value foreach data set might be relatively simple and the distributionof 119901 values under the null hypothesis of no association is easyto handle

There are several methods for combining test statisticsfrom two stages [22] and twomost commonly used forms arebased on Fisherrsquos combination and a linear combination afterinverse normal transformation [23] Fisherrsquos combination(FC) directly sums 119901 values after minus2 log transformation thatis 119885FC = minus2119908

1log(119875(1)) minus 2119908

2log(119875(2)) where 119875(119894) is 119901 value

from 119894 = 0 for discovery and 119894 = 1 for replication stagesusing a given robust test A specification of 119908

1= 1199082= 1

gives the same weight for discovery and replication stagesand one can consider 119908

1= 2120587

119904and 119908

2= 2(1 minus 120587

119904)

where 120587119904= 119873119863(119873119863+ 119873119877) and 119873

119863and 119873

119877are sample

sizes of the discovery and replication data sets A linearcombination of two 119875 values after taking the inverse of thestandard normal cumulative distribution is given by 119885LC =

1199081Φminus1(1minus119875(1)2)+119908

2Φminus1(1minus119875(2))radic1199082

1+ 11990822with a natural

choice of 1199081= radic120587119904 and 1199082 = radic1 minus 120587

119904 Let the significance

level of the discovery stage be 120572119863 which means that markers

with 119875(1) lt 120572119863are selected and replicated in the replication

stage The 119901 value of combined test can then be obtained as

119901FC = 1198751198670

(119875(1)

lt 120572119863 119885FC gt 119911FC)where the observed value of

119885FC is 119911FCThe119901FC are calculated as 119890minus119911FC2(1+119911FC2+log120572119863)

for equal weights where 119911FC gt minus2 log120572119863and (119908

1(1199081minus

1199082))119890minus119911FC21199081minus(119908

2(1199081minus1199082))119890minus119911FC21199082120572

minus(1199081

minus1199082

)1199082

119889for unequal

weights where 119911FC gt minus21199081log120572119863 Detailed derivations are

described in the Appendix Equivalently for an overall typeI error threshold for a single marker of 120572 one may obtainthe threshold 119862FC of 119885FC that satisfies 119875

1198670

(119875(1)

lt 120572119863 119885FC gt

119862FC) le 120572 Similarly for the 119885LC the 119901 value is calculatedas 119901LC = 119875

1198670

(119875(1)

lt 120572119863 119885LC gt 119911LC) = int

infin

1199111minus120572

119863

2

120601(119911)[1 minus

Φ((radic11990821+ 11990822119911LC minus 1199081119911)1199082)]119889119911 for 119911LC gt 119911

1minus120572119863

2where the

observed value of 119885LC = 119911LC

3 Simulation Results

31 Type I Error Table 1 provides the type I errors underdifferent scenarios A disease prevalence of 10 is assumedand a total of 1500 cases and 1500 controls were divided intotwo stages The proportions of samples in the first stage (120587

119904)

of 05 and 06 were considered for the minor allele frequency(MAF) of 03 and 04 We considered 119872 = 10 markers tocontrol the genome-wide false positive rate at 120572 = 005 withthe Bonferroni correction We did not consider a larger 119872

BioMed Research International 5

such as 300000 or 500000 because this will require morethan 10 million simulations to show a stable estimate of thetype I error rate With 119872 = 10 we performed 20000simulations which result in less than 10 of a coefficientof variation for a significance level 005119872 = 0005 foreach marker [24] The test statistics considered are 119885

12

Pearsonrsquos Chi-square testMIN2MAX3GMS andGME Forthe second stage analysis we considered a replication-basedanalysis 119885FC and 119885LC as proposed above The results arebased on the situation under HWE (HWE coefficient 119865 = 0)As expected all tests control the type I error reasonablly welland similar results were obtained when a slight deviationfrom HWE is present with 119865 = 005 (results not shown)

32 Empirical Power We examined the empirical powers ofdifferent tests considered above In Figure 1 we considered119872 = 10 markers a disease prevalence of 10 the samegenotype relative risk for two stages (119903

1= 14 and 119903

2= 14)

and 1000 cases and 1000 controls 2000 simulations wereperformed under HWE (119865 = 0) to control the genome-wide false positive rate at 120572 = 005 The recessive additiveand dominant models were assumed for the first secondand third rows Both joint analyses showed better powerperformances compared to the replication-based analysis (upto 159 in scenarios considered in Figure 1) and LC and FChave comparable powers with less than 2 difference Thepower gain of using the joint analysis is not as much as thatobserved in Skol et al [18] However as reported by Skolet al [18] when the between-stage heterogeneity exists andthe risk allele has a larger effect in the first stage than thatin the second stage much improved power is observed byusing the joint test Figure 2 shows results under this scenariowith 119903

1= 16 and 119903

2= 14 and the observed increase

in power using the joint test is as high as 339 Againthe difference between LC and FC is minor with less than3 difference As for comparison between different robustmethods MAX3 GMS and GME perform well under therecessive model while 119885

12 1205942 and MIN2 are less powerful

Under the additivemodel11988512

is most powerful as expectedand 1205942 is least powerful Other robust methods perform wellwith a slight decrease in power compared to 119885

12 Under the

dominant model MAX3 GMS and GME perform the besteven though all tests show good power performances and thedifference is minor Similar patterns were observed when aslight deviation from theHWE is present (results not shown)

4 Real Data Application

The GWAS on non-small-cell lung cancer (NSCLC) by Yoonet al [25] studied 621 NSCLC patients and 1541 controlsubjects in the discovery stage After stringent quality controlsteps a total of 246758 SNPs were tested for the associationwith NSCLC based on119885

12 In the replication stage 168 SNPs

with 119901 value less than 1 times 10minus4 in the first stage based on 11988512

were tested using 804 patients and 1470 control samples Weidentified additional 234 SNPs using MIN2 in the first stagewhich could be studied in the replication stage if MIN2 wasused instead of 119885

12since MIN2 produces stronger evidence

for the additional SNPs than 11988512

does The Manhattan plotsof using MIN2 and 119885

12are presented in Figure 3 One

example is 119903119904385272 located in chromosome 2 which hada 119901 value of 137 times 10

minus7 which reached significance level atBonferroni correction in discovery samples alone whereas11988512

yielded a119901 value greater than 1times10minus4 Even though thereis possibility of false positive findings these SNPs could havebeen selected for replication if robust methods were used

Since we do not have replication data for these additionalSNPs selected using MIN2 because the first stage selectionwas based on 119885

12in Yoon et al [25] just for illustration

purpose of the proposed methods we present the results ofthree SNPs including 1199031199042131877 that was reported by Yoonet al [25] When the significance level in the discovery stageis set at 120572

119863= 5 times 10

minus5 so that all these exemplary SNPs canbe selected in the discovery stage the 119901 value of combinedtest based on four robust methods (MAX3 MIN2 GMSand GME) as well as 119885

12and Pearsonrsquos Chi-square test is

presented in Table 2 Fisherrsquos combination was used for thejoint test in the second stage Only 1199031199042131877was found to besignificant with Bonferroni correction (119901 value lt 203times10

minus7)by all except MAX3method

5 Discussion

In genetic association studies efficiency robust tests whoseperformance does not depend on the underlying geneticmodel have been extensively studied and their power benefitover a wide range of geneticmodels has been well recognizedIn this paper we described how the idea of these robustassociation tests can be applied to the replication studies andfurther how overall statistical significance can be evaluatedusing the combined test formed by 119901 values of the discoveryand replication studies

When the robust tests are used the test statistic ofeach stage can have a complex form and thus dealing withthe distribution of the joint test can be difficult whereascalculating the 119901 value of each stage might be relativelysimple Because the asymptotic distribution of the 119901 valueunder the null hypothesis of no association is easy to handlethe combined test using 119901 values rather than the test statisticsthemselves can provide computational convenience

There are several methods for combining test statisticsfrom two stages and Won et al [22] compared the per-formances of various choices Two most commonly usedforms are based on Fisherrsquos combination and the linearcombination after the inverse normal transformation [23]and we presented the test statistics and 119901 values of these twomethods In our limited experience the linear combinationand Fisherrsquos combination are fairly comparable Fisherrsquoscombination seems to perform slightly better than the linearcombination when there exists some heterogeneity betweenstages in terms of the genotype relative risk while the linearcombination seems to perform slightly better in most ofother situations However the difference is extremely minorFurther research is required for the thorough comparison ofvarious methods of combining 119901 values in the application ofefficiency robust tests to the replication of genetic associationstudies

6 BioMed Research International

Recessive

Additive

Dominant

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)Em

piric

al p

ower

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

Z05

1205942

MAX3

ZFC

ZFC

ZFC

ZLC

ZLC

ZLC

MIN2GMSGME

MIN2GMSGME

MIN2GMSGME

Z05

1205942

MAX3

Z05

1205942

MAX3

Replication-based test

Replication-based test

Replication-based test

Figure 1 Empirical powers based on 2000 simulations for119872 = 10markers genotype relative risks of both stages = 14 and disease prevalence119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to control 120572 = 005 The first stagetype I error rate for discovery is 120572

119863= 005 Six test statistics 119885

12 1205942 MAX3 MIN2 GMS and GME are considered The first second and

third columns depict powers using the replication-based test 119885FC and 119885LC respectively

BioMed Research International 7

Replication-based test

Recessive

Additive

Dominant

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)Em

piric

al p

ower

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

Z05

1205942

MAX3

ZFC

ZFC

ZFC

ZLC

ZLC

ZLC

MIN2GMSGME

MIN2GMSGME

MIN2GMSGME

Z05

1205942

MAX3

Z05

1205942

MAX3

Replication-based test

Replication-based test

Figure 2 Empirical powers based on 2000 simulations for 119872 = 10 markers genotype relative risks of two stages are different (1199031= 16

1199032= 14) disease prevalence 119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to

control 120572 = 005 The first stage type I error rate for discovery is 120572119863= 005 Six test statistics 119885

12 1205942 MAX3 MIN2 GMS and GME are

considered The first second and third columns depict powers using the replication-based test 119885FC and 119885LC respectively

8 BioMed Research International

minuslo

g 10(120588)

Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23

8

6

4

2

0

MIN2 P value

(a)

minuslo

g 10(120588)

Additive P value8

6

4

2

0

Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23

(b)

Figure 3 Manhattan plots of 246758 SNPs from Yoon et al [25] based on MIN2 (a) and 11988512

(b) The 119909 axis is chromosomal location andthe 119910 axis is the significance (minuslog

10119875) of association The horizontal line corresponds to the significance level 10minus4

Table 2 For selected exemplary three SNPs for testing association with NSCLC 119901 value of combined test using additive CATT (11988512)

Pearsonrsquos Chi-square test (119879chi2) MAX3 MIN2 119885GMS and 119885GME

SNP 119901 value of 11988512

119901 value of 119879chi2Discovery Replication Combined test Discovery Replication Combined test

rs2131877 788 times 10minus5

104 times 10minus4

797 times 10minus8

140 times 10minus4

149 times 10minus4

184 times 10minus7

rs905551 183 times 10minus5

702 times 10minus3

770 times 10minus6

806 times 10minus5

489 times 10minus2

140 times 10minus5

rs1695109 248 times 10minus4

346 times 10minus2

217 times 10minus6

456 times 10minus5

153 times 10minus1

207 times 10minus5

SNP 119901 value of MAX3 119901 value of MIN2Discovery Replication Combined test Discovery Replication Combined test

rs2131877 153 times 10minus4

405 times 10minus2

192 times 10minus5

132 times 10minus4

104 times 10minus4

126 times 10minus7

rs905551 450 times 10minus5

702 times 10minus3

192 times 10minus6

134 times 10minus4

489 times 10minus2

199 times 10minus5

rs1695109 354 times 10minus5

263 times 10minus2

464 times 10minus6

236 times 10minus5

263 times 10minus2

335 times 10minus6

SNP 119901 value of 119885GMS 119901 value of 119885GME

Discovery Replication Combined test Discovery Replication Combined testrs2131877 186 times 10

minus4104 times 10

minus4171 times 10

minus7103 times 10

minus4104 times 10

minus4102 times 10

minus7

rs905551 519 times 10minus5

702 times 10minus3

216 times 10minus6

735 times 10minus5

801 times 10minus3

320 times 10minus6

rs1695109 689 times 10minus4

127 times 10minus1

385 times 10minus5

269 times 10minus5

419 times 10minus2

540 times 10minus6

In a genetic study where the purpose of considering areplication stage is to validate or replicate the genetic findingsfrom the discovery stage which is the case considered inthis paper the analysis in the replication stage utilized thetest statistic or genetic model that is selected as being thebest in the discovery stage and also the direction of the riskallele following guidelines for exact replication in geneticassociation studies If the purpose is to simply combine theevidence fromdifferent data sources such as inmeta-analysisother strategies may be devised Further research again isrequired to provide fully detailed properties of suchmethods

Power gain of a joint analysis over the conventionalreplication-based analysis was thoroughly studied by Skol etal [18 19] In our simulation the amount of power increaseusing a joint test compared to the replication-based analysiswas much minor than what was observed by Skol et al[18 19] The exact reason is not known but we suspect thismight be due to the power advantages of robust methodsand also due to the fact that the optimal choice from thefirst stage is used when calculating the second stage 119901 values

However even though it was minor in some situationsthe joint anlysis presented better power performance thanthe replication-based analysis in our study This type ofjoint analysis raised concerns about the exact meaning ofreplication [17] However McCarthy et al [26] mentionedthat joint analyses ldquoblur the boundaries of where exactlyreplication starts but whichever analytical approach is takenconfirmation in many independent samples is important andit is the overall strength of the evidence of association thatmattersrdquo Purpose of the current study was to present howthe overall strength of the evidence of association can beevaluated when robust tests are used in GWAS replicationstudies

We illustrated how the proposed methods can be appliedin the real data that studied the association of SNPs with non-small-cell lung cancer (NSCLC) in discovery and replicationstages In the original study reported by Yoon et al [25]SNPs were selected in the discovery data set not based onthe robust tests but based on additive CATT Therefore wefound that some SNPs could have been selected by one

BioMed Research International 9

of the robust methods but they were not included in thereplication data set For these SNPs we were not able toperform the joint analysis that we propose and it was notpossible to examine whether there are other SNPs that couldhave been found to be associated with NSCLC by proposedmethods in the replication study For this reason we merelypresented howmany additional SNPs could have been furtherfollowed in the replication stage when robust methods wereused In many GWASs it is a common practice to reportthe summary test statistics and 119901 values of the SNPs undera specific genetic model usually an additive model whichwere further genotyped in the replication stage and werefinally defined to be significantly associated with a phenotypeof interest As emphasized in this paper one may have abetter chance of findingmanymissing SNPs by applyingmorepowerful and robust methods that consider different geneticmodels simultaneouslyTherefore we urge the community toshare test results under not only an additive model but alsoother genetic models although they were not significant at astringent significance level so that future research may haveenriched data resources to which robust tests can be appliedin association studies

Appendix

119901 value of Fisherrsquos Combination forEqual and Unequal Weights

Equal Weights Assume 1199081

= 1199082

= 1 Under the nullhypothesis of no association 119883

1= minus2 log119875(1) and 119883

2=

minus2 log119875(2) are independent and each asymptotically followsa 1205942 distribution with 2 degrees of freedom Let 119891

119896(119909) and

119865119896(119909) be the probability and cumulative density functions of

1205942 random variable with 119896 degrees of freedomThen 119891

2(119909) =

exp(minus1199092)2 1198914(119909) = 119909 exp(minus1199092)4 119865

2(119909) = 1 minus exp(minus1199092)

and1198654(119909) = 1minusexp(minus1199092)minus119909 exp(minus1199092)2 Denote the cutoff

of the discovery stage based on 120572119863as 119862119863 that is 119865

2(119862119863) =

1 minus 120572119863 For observed value 119911FC gt 119862

119863of1198831+ 1198832 the 119901 value

is written as

1198751198670

(1198831gt 119862119863 1198831+ 1198832gt 119911FC)

= 120572119863minus int

119911FC

119862119863

1198912 (119909) 1198652 (119911FC minus 119909) 119889119909

= 120572119863+ exp (minus

119911FC2) minus 120572119863+1

2exp(minus

119911FC2) (119911FC minus 119862119863)

= exp (minus119911FC2) (1 +

119911FC2

+ log120572119863)

(A1)

Unequal WeightsWhen different proportions of samples areused in the discovery and replication stages it may be moreappropriate to assignweights proportional to the sample sizesfor each stage For example when only a small portion isused in the discovery stage to prevent Fisherrsquos combinationtest from being dominated by the significant result in the

discovery stage one may want to assign a small weight to thediscovery stage result

When 120587119904is the proportion of samples used in the dis-

covery stage one selection for weights is 1199081

= 2120587119904and

1199082= 2(1 minus 120587

119904) for discovery and replication stages which

simplifies to equal weights when 120587119904= 05 Based on these

weights we consider unequal-weighted Fisherrsquos combinationasminus2 log119875(1)1199081119875(2)1199082 = 119908

11198831+11990821198832[27] Its density function

is given by

119891119908 (119909) =

1

2 (1199081minus 1199082)exp(minus 119909

(21199081))

minus1

2 (1199081minus 1199082)exp(minus 119909

(21199082))

(A2)

and the probability distribution function is

119865119908 (119909) = 1 minus

1199081

(1199081minus 1199082)exp(minus 119909

21199081

)

minus1199082

(1199081minus 1199082)exp(minus 119909

21199082

) 1199081

= 1199082

(A3)

Using the previous notation we have the following formof 119901 value

1198751198670

(1198831gt 119862119863 11990811198831+ 11990821198832gt 119911FC)

= 120572119863minus int

119911FC1199081

119862119863

1198912 (119909) 1198652 (

119911FC minus 1199081119909

1199082

)119889119909

= exp(minus119911FC21199081

) +1199082

1199081minus 1199082

exp(minus119911FC21199082

)

times exp(1199081 minus 1199082211990811199082

119911FC) minus exp(1199081 minus 1199082211990811199082

1199081119862119863)

=1199081

1199081minus 1199082

exp(minus119911FC21199081

)

minus1199082

1199081minus 1199082

exp(minus119911FC21199082

)120572minus(1199081

minus1199082

)1199082

119863

=1199081

1199081minus 1199082

exp(minus119911FC21199081

)

minus1199082

1199081minus 1199082

exp(minus119911FC21199082

)120572minus(1199081

minus1199082

)1199082

119863

(A4)

where 119911FC1199081 gt 119862119863

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

10 BioMed Research International

Acknowledgments

The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)

References

[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005

[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006

[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006

[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007

[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007

[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012

[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954

[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955

[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997

[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002

[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966

[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985

[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008

[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010

[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009

[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009

[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009

[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for

two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006

[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007

[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003

[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006

[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009

[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012

[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993

[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010

[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008

[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 5: Research Article Robust Association Tests for the Replication of …downloads.hindawi.com/journals/bmri/2015/461593.pdf · 2019. 7. 31. · Research Article Robust Association Tests

BioMed Research International 5

such as 300000 or 500000 because this will require morethan 10 million simulations to show a stable estimate of thetype I error rate With 119872 = 10 we performed 20000simulations which result in less than 10 of a coefficientof variation for a significance level 005119872 = 0005 foreach marker [24] The test statistics considered are 119885

12

Pearsonrsquos Chi-square testMIN2MAX3GMS andGME Forthe second stage analysis we considered a replication-basedanalysis 119885FC and 119885LC as proposed above The results arebased on the situation under HWE (HWE coefficient 119865 = 0)As expected all tests control the type I error reasonablly welland similar results were obtained when a slight deviationfrom HWE is present with 119865 = 005 (results not shown)

32 Empirical Power We examined the empirical powers ofdifferent tests considered above In Figure 1 we considered119872 = 10 markers a disease prevalence of 10 the samegenotype relative risk for two stages (119903

1= 14 and 119903

2= 14)

and 1000 cases and 1000 controls 2000 simulations wereperformed under HWE (119865 = 0) to control the genome-wide false positive rate at 120572 = 005 The recessive additiveand dominant models were assumed for the first secondand third rows Both joint analyses showed better powerperformances compared to the replication-based analysis (upto 159 in scenarios considered in Figure 1) and LC and FChave comparable powers with less than 2 difference Thepower gain of using the joint analysis is not as much as thatobserved in Skol et al [18] However as reported by Skolet al [18] when the between-stage heterogeneity exists andthe risk allele has a larger effect in the first stage than thatin the second stage much improved power is observed byusing the joint test Figure 2 shows results under this scenariowith 119903

1= 16 and 119903

2= 14 and the observed increase

in power using the joint test is as high as 339 Againthe difference between LC and FC is minor with less than3 difference As for comparison between different robustmethods MAX3 GMS and GME perform well under therecessive model while 119885

12 1205942 and MIN2 are less powerful

Under the additivemodel11988512

is most powerful as expectedand 1205942 is least powerful Other robust methods perform wellwith a slight decrease in power compared to 119885

12 Under the

dominant model MAX3 GMS and GME perform the besteven though all tests show good power performances and thedifference is minor Similar patterns were observed when aslight deviation from theHWE is present (results not shown)

4 Real Data Application

The GWAS on non-small-cell lung cancer (NSCLC) by Yoonet al [25] studied 621 NSCLC patients and 1541 controlsubjects in the discovery stage After stringent quality controlsteps a total of 246758 SNPs were tested for the associationwith NSCLC based on119885

12 In the replication stage 168 SNPs

with 119901 value less than 1 times 10minus4 in the first stage based on 11988512

were tested using 804 patients and 1470 control samples Weidentified additional 234 SNPs using MIN2 in the first stagewhich could be studied in the replication stage if MIN2 wasused instead of 119885

12since MIN2 produces stronger evidence

for the additional SNPs than 11988512

does The Manhattan plotsof using MIN2 and 119885

12are presented in Figure 3 One

example is 119903119904385272 located in chromosome 2 which hada 119901 value of 137 times 10

minus7 which reached significance level atBonferroni correction in discovery samples alone whereas11988512

yielded a119901 value greater than 1times10minus4 Even though thereis possibility of false positive findings these SNPs could havebeen selected for replication if robust methods were used

Since we do not have replication data for these additionalSNPs selected using MIN2 because the first stage selectionwas based on 119885

12in Yoon et al [25] just for illustration

purpose of the proposed methods we present the results ofthree SNPs including 1199031199042131877 that was reported by Yoonet al [25] When the significance level in the discovery stageis set at 120572

119863= 5 times 10

minus5 so that all these exemplary SNPs canbe selected in the discovery stage the 119901 value of combinedtest based on four robust methods (MAX3 MIN2 GMSand GME) as well as 119885

12and Pearsonrsquos Chi-square test is

presented in Table 2 Fisherrsquos combination was used for thejoint test in the second stage Only 1199031199042131877was found to besignificant with Bonferroni correction (119901 value lt 203times10

minus7)by all except MAX3method

5 Discussion

In genetic association studies efficiency robust tests whoseperformance does not depend on the underlying geneticmodel have been extensively studied and their power benefitover a wide range of geneticmodels has been well recognizedIn this paper we described how the idea of these robustassociation tests can be applied to the replication studies andfurther how overall statistical significance can be evaluatedusing the combined test formed by 119901 values of the discoveryand replication studies

When the robust tests are used the test statistic ofeach stage can have a complex form and thus dealing withthe distribution of the joint test can be difficult whereascalculating the 119901 value of each stage might be relativelysimple Because the asymptotic distribution of the 119901 valueunder the null hypothesis of no association is easy to handlethe combined test using 119901 values rather than the test statisticsthemselves can provide computational convenience

There are several methods for combining test statisticsfrom two stages and Won et al [22] compared the per-formances of various choices Two most commonly usedforms are based on Fisherrsquos combination and the linearcombination after the inverse normal transformation [23]and we presented the test statistics and 119901 values of these twomethods In our limited experience the linear combinationand Fisherrsquos combination are fairly comparable Fisherrsquoscombination seems to perform slightly better than the linearcombination when there exists some heterogeneity betweenstages in terms of the genotype relative risk while the linearcombination seems to perform slightly better in most ofother situations However the difference is extremely minorFurther research is required for the thorough comparison ofvarious methods of combining 119901 values in the application ofefficiency robust tests to the replication of genetic associationstudies

6 BioMed Research International

Recessive

Additive

Dominant

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)Em

piric

al p

ower

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

Z05

1205942

MAX3

ZFC

ZFC

ZFC

ZLC

ZLC

ZLC

MIN2GMSGME

MIN2GMSGME

MIN2GMSGME

Z05

1205942

MAX3

Z05

1205942

MAX3

Replication-based test

Replication-based test

Replication-based test

Figure 1 Empirical powers based on 2000 simulations for119872 = 10markers genotype relative risks of both stages = 14 and disease prevalence119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to control 120572 = 005 The first stagetype I error rate for discovery is 120572

119863= 005 Six test statistics 119885

12 1205942 MAX3 MIN2 GMS and GME are considered The first second and

third columns depict powers using the replication-based test 119885FC and 119885LC respectively

BioMed Research International 7

Replication-based test

Recessive

Additive

Dominant

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)Em

piric

al p

ower

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

Z05

1205942

MAX3

ZFC

ZFC

ZFC

ZLC

ZLC

ZLC

MIN2GMSGME

MIN2GMSGME

MIN2GMSGME

Z05

1205942

MAX3

Z05

1205942

MAX3

Replication-based test

Replication-based test

Figure 2 Empirical powers based on 2000 simulations for 119872 = 10 markers genotype relative risks of two stages are different (1199031= 16

1199032= 14) disease prevalence 119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to

control 120572 = 005 The first stage type I error rate for discovery is 120572119863= 005 Six test statistics 119885

12 1205942 MAX3 MIN2 GMS and GME are

considered The first second and third columns depict powers using the replication-based test 119885FC and 119885LC respectively

8 BioMed Research International

minuslo

g 10(120588)

Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23

8

6

4

2

0

MIN2 P value

(a)

minuslo

g 10(120588)

Additive P value8

6

4

2

0

Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23

(b)

Figure 3 Manhattan plots of 246758 SNPs from Yoon et al [25] based on MIN2 (a) and 11988512

(b) The 119909 axis is chromosomal location andthe 119910 axis is the significance (minuslog

10119875) of association The horizontal line corresponds to the significance level 10minus4

Table 2 For selected exemplary three SNPs for testing association with NSCLC 119901 value of combined test using additive CATT (11988512)

Pearsonrsquos Chi-square test (119879chi2) MAX3 MIN2 119885GMS and 119885GME

SNP 119901 value of 11988512

119901 value of 119879chi2Discovery Replication Combined test Discovery Replication Combined test

rs2131877 788 times 10minus5

104 times 10minus4

797 times 10minus8

140 times 10minus4

149 times 10minus4

184 times 10minus7

rs905551 183 times 10minus5

702 times 10minus3

770 times 10minus6

806 times 10minus5

489 times 10minus2

140 times 10minus5

rs1695109 248 times 10minus4

346 times 10minus2

217 times 10minus6

456 times 10minus5

153 times 10minus1

207 times 10minus5

SNP 119901 value of MAX3 119901 value of MIN2Discovery Replication Combined test Discovery Replication Combined test

rs2131877 153 times 10minus4

405 times 10minus2

192 times 10minus5

132 times 10minus4

104 times 10minus4

126 times 10minus7

rs905551 450 times 10minus5

702 times 10minus3

192 times 10minus6

134 times 10minus4

489 times 10minus2

199 times 10minus5

rs1695109 354 times 10minus5

263 times 10minus2

464 times 10minus6

236 times 10minus5

263 times 10minus2

335 times 10minus6

SNP 119901 value of 119885GMS 119901 value of 119885GME

Discovery Replication Combined test Discovery Replication Combined testrs2131877 186 times 10

minus4104 times 10

minus4171 times 10

minus7103 times 10

minus4104 times 10

minus4102 times 10

minus7

rs905551 519 times 10minus5

702 times 10minus3

216 times 10minus6

735 times 10minus5

801 times 10minus3

320 times 10minus6

rs1695109 689 times 10minus4

127 times 10minus1

385 times 10minus5

269 times 10minus5

419 times 10minus2

540 times 10minus6

In a genetic study where the purpose of considering areplication stage is to validate or replicate the genetic findingsfrom the discovery stage which is the case considered inthis paper the analysis in the replication stage utilized thetest statistic or genetic model that is selected as being thebest in the discovery stage and also the direction of the riskallele following guidelines for exact replication in geneticassociation studies If the purpose is to simply combine theevidence fromdifferent data sources such as inmeta-analysisother strategies may be devised Further research again isrequired to provide fully detailed properties of suchmethods

Power gain of a joint analysis over the conventionalreplication-based analysis was thoroughly studied by Skol etal [18 19] In our simulation the amount of power increaseusing a joint test compared to the replication-based analysiswas much minor than what was observed by Skol et al[18 19] The exact reason is not known but we suspect thismight be due to the power advantages of robust methodsand also due to the fact that the optimal choice from thefirst stage is used when calculating the second stage 119901 values

However even though it was minor in some situationsthe joint anlysis presented better power performance thanthe replication-based analysis in our study This type ofjoint analysis raised concerns about the exact meaning ofreplication [17] However McCarthy et al [26] mentionedthat joint analyses ldquoblur the boundaries of where exactlyreplication starts but whichever analytical approach is takenconfirmation in many independent samples is important andit is the overall strength of the evidence of association thatmattersrdquo Purpose of the current study was to present howthe overall strength of the evidence of association can beevaluated when robust tests are used in GWAS replicationstudies

We illustrated how the proposed methods can be appliedin the real data that studied the association of SNPs with non-small-cell lung cancer (NSCLC) in discovery and replicationstages In the original study reported by Yoon et al [25]SNPs were selected in the discovery data set not based onthe robust tests but based on additive CATT Therefore wefound that some SNPs could have been selected by one

BioMed Research International 9

of the robust methods but they were not included in thereplication data set For these SNPs we were not able toperform the joint analysis that we propose and it was notpossible to examine whether there are other SNPs that couldhave been found to be associated with NSCLC by proposedmethods in the replication study For this reason we merelypresented howmany additional SNPs could have been furtherfollowed in the replication stage when robust methods wereused In many GWASs it is a common practice to reportthe summary test statistics and 119901 values of the SNPs undera specific genetic model usually an additive model whichwere further genotyped in the replication stage and werefinally defined to be significantly associated with a phenotypeof interest As emphasized in this paper one may have abetter chance of findingmanymissing SNPs by applyingmorepowerful and robust methods that consider different geneticmodels simultaneouslyTherefore we urge the community toshare test results under not only an additive model but alsoother genetic models although they were not significant at astringent significance level so that future research may haveenriched data resources to which robust tests can be appliedin association studies

Appendix

119901 value of Fisherrsquos Combination forEqual and Unequal Weights

Equal Weights Assume 1199081

= 1199082

= 1 Under the nullhypothesis of no association 119883

1= minus2 log119875(1) and 119883

2=

minus2 log119875(2) are independent and each asymptotically followsa 1205942 distribution with 2 degrees of freedom Let 119891

119896(119909) and

119865119896(119909) be the probability and cumulative density functions of

1205942 random variable with 119896 degrees of freedomThen 119891

2(119909) =

exp(minus1199092)2 1198914(119909) = 119909 exp(minus1199092)4 119865

2(119909) = 1 minus exp(minus1199092)

and1198654(119909) = 1minusexp(minus1199092)minus119909 exp(minus1199092)2 Denote the cutoff

of the discovery stage based on 120572119863as 119862119863 that is 119865

2(119862119863) =

1 minus 120572119863 For observed value 119911FC gt 119862

119863of1198831+ 1198832 the 119901 value

is written as

1198751198670

(1198831gt 119862119863 1198831+ 1198832gt 119911FC)

= 120572119863minus int

119911FC

119862119863

1198912 (119909) 1198652 (119911FC minus 119909) 119889119909

= 120572119863+ exp (minus

119911FC2) minus 120572119863+1

2exp(minus

119911FC2) (119911FC minus 119862119863)

= exp (minus119911FC2) (1 +

119911FC2

+ log120572119863)

(A1)

Unequal WeightsWhen different proportions of samples areused in the discovery and replication stages it may be moreappropriate to assignweights proportional to the sample sizesfor each stage For example when only a small portion isused in the discovery stage to prevent Fisherrsquos combinationtest from being dominated by the significant result in the

discovery stage one may want to assign a small weight to thediscovery stage result

When 120587119904is the proportion of samples used in the dis-

covery stage one selection for weights is 1199081

= 2120587119904and

1199082= 2(1 minus 120587

119904) for discovery and replication stages which

simplifies to equal weights when 120587119904= 05 Based on these

weights we consider unequal-weighted Fisherrsquos combinationasminus2 log119875(1)1199081119875(2)1199082 = 119908

11198831+11990821198832[27] Its density function

is given by

119891119908 (119909) =

1

2 (1199081minus 1199082)exp(minus 119909

(21199081))

minus1

2 (1199081minus 1199082)exp(minus 119909

(21199082))

(A2)

and the probability distribution function is

119865119908 (119909) = 1 minus

1199081

(1199081minus 1199082)exp(minus 119909

21199081

)

minus1199082

(1199081minus 1199082)exp(minus 119909

21199082

) 1199081

= 1199082

(A3)

Using the previous notation we have the following formof 119901 value

1198751198670

(1198831gt 119862119863 11990811198831+ 11990821198832gt 119911FC)

= 120572119863minus int

119911FC1199081

119862119863

1198912 (119909) 1198652 (

119911FC minus 1199081119909

1199082

)119889119909

= exp(minus119911FC21199081

) +1199082

1199081minus 1199082

exp(minus119911FC21199082

)

times exp(1199081 minus 1199082211990811199082

119911FC) minus exp(1199081 minus 1199082211990811199082

1199081119862119863)

=1199081

1199081minus 1199082

exp(minus119911FC21199081

)

minus1199082

1199081minus 1199082

exp(minus119911FC21199082

)120572minus(1199081

minus1199082

)1199082

119863

=1199081

1199081minus 1199082

exp(minus119911FC21199081

)

minus1199082

1199081minus 1199082

exp(minus119911FC21199082

)120572minus(1199081

minus1199082

)1199082

119863

(A4)

where 119911FC1199081 gt 119862119863

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

10 BioMed Research International

Acknowledgments

The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)

References

[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005

[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006

[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006

[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007

[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007

[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012

[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954

[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955

[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997

[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002

[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966

[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985

[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008

[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010

[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009

[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009

[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009

[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for

two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006

[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007

[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003

[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006

[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009

[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012

[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993

[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010

[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008

[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 6: Research Article Robust Association Tests for the Replication of …downloads.hindawi.com/journals/bmri/2015/461593.pdf · 2019. 7. 31. · Research Article Robust Association Tests

6 BioMed Research International

Recessive

Additive

Dominant

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)Em

piric

al p

ower

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

Z05

1205942

MAX3

ZFC

ZFC

ZFC

ZLC

ZLC

ZLC

MIN2GMSGME

MIN2GMSGME

MIN2GMSGME

Z05

1205942

MAX3

Z05

1205942

MAX3

Replication-based test

Replication-based test

Replication-based test

Figure 1 Empirical powers based on 2000 simulations for119872 = 10markers genotype relative risks of both stages = 14 and disease prevalence119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to control 120572 = 005 The first stagetype I error rate for discovery is 120572

119863= 005 Six test statistics 119885

12 1205942 MAX3 MIN2 GMS and GME are considered The first second and

third columns depict powers using the replication-based test 119885FC and 119885LC respectively

BioMed Research International 7

Replication-based test

Recessive

Additive

Dominant

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)Em

piric

al p

ower

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

Z05

1205942

MAX3

ZFC

ZFC

ZFC

ZLC

ZLC

ZLC

MIN2GMSGME

MIN2GMSGME

MIN2GMSGME

Z05

1205942

MAX3

Z05

1205942

MAX3

Replication-based test

Replication-based test

Figure 2 Empirical powers based on 2000 simulations for 119872 = 10 markers genotype relative risks of two stages are different (1199031= 16

1199032= 14) disease prevalence 119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to

control 120572 = 005 The first stage type I error rate for discovery is 120572119863= 005 Six test statistics 119885

12 1205942 MAX3 MIN2 GMS and GME are

considered The first second and third columns depict powers using the replication-based test 119885FC and 119885LC respectively

8 BioMed Research International

minuslo

g 10(120588)

Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23

8

6

4

2

0

MIN2 P value

(a)

minuslo

g 10(120588)

Additive P value8

6

4

2

0

Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23

(b)

Figure 3 Manhattan plots of 246758 SNPs from Yoon et al [25] based on MIN2 (a) and 11988512

(b) The 119909 axis is chromosomal location andthe 119910 axis is the significance (minuslog

10119875) of association The horizontal line corresponds to the significance level 10minus4

Table 2 For selected exemplary three SNPs for testing association with NSCLC 119901 value of combined test using additive CATT (11988512)

Pearsonrsquos Chi-square test (119879chi2) MAX3 MIN2 119885GMS and 119885GME

SNP 119901 value of 11988512

119901 value of 119879chi2Discovery Replication Combined test Discovery Replication Combined test

rs2131877 788 times 10minus5

104 times 10minus4

797 times 10minus8

140 times 10minus4

149 times 10minus4

184 times 10minus7

rs905551 183 times 10minus5

702 times 10minus3

770 times 10minus6

806 times 10minus5

489 times 10minus2

140 times 10minus5

rs1695109 248 times 10minus4

346 times 10minus2

217 times 10minus6

456 times 10minus5

153 times 10minus1

207 times 10minus5

SNP 119901 value of MAX3 119901 value of MIN2Discovery Replication Combined test Discovery Replication Combined test

rs2131877 153 times 10minus4

405 times 10minus2

192 times 10minus5

132 times 10minus4

104 times 10minus4

126 times 10minus7

rs905551 450 times 10minus5

702 times 10minus3

192 times 10minus6

134 times 10minus4

489 times 10minus2

199 times 10minus5

rs1695109 354 times 10minus5

263 times 10minus2

464 times 10minus6

236 times 10minus5

263 times 10minus2

335 times 10minus6

SNP 119901 value of 119885GMS 119901 value of 119885GME

Discovery Replication Combined test Discovery Replication Combined testrs2131877 186 times 10

minus4104 times 10

minus4171 times 10

minus7103 times 10

minus4104 times 10

minus4102 times 10

minus7

rs905551 519 times 10minus5

702 times 10minus3

216 times 10minus6

735 times 10minus5

801 times 10minus3

320 times 10minus6

rs1695109 689 times 10minus4

127 times 10minus1

385 times 10minus5

269 times 10minus5

419 times 10minus2

540 times 10minus6

In a genetic study where the purpose of considering areplication stage is to validate or replicate the genetic findingsfrom the discovery stage which is the case considered inthis paper the analysis in the replication stage utilized thetest statistic or genetic model that is selected as being thebest in the discovery stage and also the direction of the riskallele following guidelines for exact replication in geneticassociation studies If the purpose is to simply combine theevidence fromdifferent data sources such as inmeta-analysisother strategies may be devised Further research again isrequired to provide fully detailed properties of suchmethods

Power gain of a joint analysis over the conventionalreplication-based analysis was thoroughly studied by Skol etal [18 19] In our simulation the amount of power increaseusing a joint test compared to the replication-based analysiswas much minor than what was observed by Skol et al[18 19] The exact reason is not known but we suspect thismight be due to the power advantages of robust methodsand also due to the fact that the optimal choice from thefirst stage is used when calculating the second stage 119901 values

However even though it was minor in some situationsthe joint anlysis presented better power performance thanthe replication-based analysis in our study This type ofjoint analysis raised concerns about the exact meaning ofreplication [17] However McCarthy et al [26] mentionedthat joint analyses ldquoblur the boundaries of where exactlyreplication starts but whichever analytical approach is takenconfirmation in many independent samples is important andit is the overall strength of the evidence of association thatmattersrdquo Purpose of the current study was to present howthe overall strength of the evidence of association can beevaluated when robust tests are used in GWAS replicationstudies

We illustrated how the proposed methods can be appliedin the real data that studied the association of SNPs with non-small-cell lung cancer (NSCLC) in discovery and replicationstages In the original study reported by Yoon et al [25]SNPs were selected in the discovery data set not based onthe robust tests but based on additive CATT Therefore wefound that some SNPs could have been selected by one

BioMed Research International 9

of the robust methods but they were not included in thereplication data set For these SNPs we were not able toperform the joint analysis that we propose and it was notpossible to examine whether there are other SNPs that couldhave been found to be associated with NSCLC by proposedmethods in the replication study For this reason we merelypresented howmany additional SNPs could have been furtherfollowed in the replication stage when robust methods wereused In many GWASs it is a common practice to reportthe summary test statistics and 119901 values of the SNPs undera specific genetic model usually an additive model whichwere further genotyped in the replication stage and werefinally defined to be significantly associated with a phenotypeof interest As emphasized in this paper one may have abetter chance of findingmanymissing SNPs by applyingmorepowerful and robust methods that consider different geneticmodels simultaneouslyTherefore we urge the community toshare test results under not only an additive model but alsoother genetic models although they were not significant at astringent significance level so that future research may haveenriched data resources to which robust tests can be appliedin association studies

Appendix

119901 value of Fisherrsquos Combination forEqual and Unequal Weights

Equal Weights Assume 1199081

= 1199082

= 1 Under the nullhypothesis of no association 119883

1= minus2 log119875(1) and 119883

2=

minus2 log119875(2) are independent and each asymptotically followsa 1205942 distribution with 2 degrees of freedom Let 119891

119896(119909) and

119865119896(119909) be the probability and cumulative density functions of

1205942 random variable with 119896 degrees of freedomThen 119891

2(119909) =

exp(minus1199092)2 1198914(119909) = 119909 exp(minus1199092)4 119865

2(119909) = 1 minus exp(minus1199092)

and1198654(119909) = 1minusexp(minus1199092)minus119909 exp(minus1199092)2 Denote the cutoff

of the discovery stage based on 120572119863as 119862119863 that is 119865

2(119862119863) =

1 minus 120572119863 For observed value 119911FC gt 119862

119863of1198831+ 1198832 the 119901 value

is written as

1198751198670

(1198831gt 119862119863 1198831+ 1198832gt 119911FC)

= 120572119863minus int

119911FC

119862119863

1198912 (119909) 1198652 (119911FC minus 119909) 119889119909

= 120572119863+ exp (minus

119911FC2) minus 120572119863+1

2exp(minus

119911FC2) (119911FC minus 119862119863)

= exp (minus119911FC2) (1 +

119911FC2

+ log120572119863)

(A1)

Unequal WeightsWhen different proportions of samples areused in the discovery and replication stages it may be moreappropriate to assignweights proportional to the sample sizesfor each stage For example when only a small portion isused in the discovery stage to prevent Fisherrsquos combinationtest from being dominated by the significant result in the

discovery stage one may want to assign a small weight to thediscovery stage result

When 120587119904is the proportion of samples used in the dis-

covery stage one selection for weights is 1199081

= 2120587119904and

1199082= 2(1 minus 120587

119904) for discovery and replication stages which

simplifies to equal weights when 120587119904= 05 Based on these

weights we consider unequal-weighted Fisherrsquos combinationasminus2 log119875(1)1199081119875(2)1199082 = 119908

11198831+11990821198832[27] Its density function

is given by

119891119908 (119909) =

1

2 (1199081minus 1199082)exp(minus 119909

(21199081))

minus1

2 (1199081minus 1199082)exp(minus 119909

(21199082))

(A2)

and the probability distribution function is

119865119908 (119909) = 1 minus

1199081

(1199081minus 1199082)exp(minus 119909

21199081

)

minus1199082

(1199081minus 1199082)exp(minus 119909

21199082

) 1199081

= 1199082

(A3)

Using the previous notation we have the following formof 119901 value

1198751198670

(1198831gt 119862119863 11990811198831+ 11990821198832gt 119911FC)

= 120572119863minus int

119911FC1199081

119862119863

1198912 (119909) 1198652 (

119911FC minus 1199081119909

1199082

)119889119909

= exp(minus119911FC21199081

) +1199082

1199081minus 1199082

exp(minus119911FC21199082

)

times exp(1199081 minus 1199082211990811199082

119911FC) minus exp(1199081 minus 1199082211990811199082

1199081119862119863)

=1199081

1199081minus 1199082

exp(minus119911FC21199081

)

minus1199082

1199081minus 1199082

exp(minus119911FC21199082

)120572minus(1199081

minus1199082

)1199082

119863

=1199081

1199081minus 1199082

exp(minus119911FC21199081

)

minus1199082

1199081minus 1199082

exp(minus119911FC21199082

)120572minus(1199081

minus1199082

)1199082

119863

(A4)

where 119911FC1199081 gt 119862119863

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

10 BioMed Research International

Acknowledgments

The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)

References

[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005

[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006

[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006

[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007

[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007

[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012

[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954

[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955

[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997

[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002

[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966

[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985

[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008

[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010

[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009

[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009

[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009

[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for

two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006

[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007

[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003

[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006

[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009

[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012

[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993

[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010

[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008

[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 7: Research Article Robust Association Tests for the Replication of …downloads.hindawi.com/journals/bmri/2015/461593.pdf · 2019. 7. 31. · Research Article Robust Association Tests

BioMed Research International 7

Replication-based test

Recessive

Additive

Dominant

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)Em

piric

al p

ower

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

(05 03) (05 04) (06 03) (06 04)

(120587sMAF)

Empi

rical

pow

er

10

08

06

04

02

00

Z05

1205942

MAX3

ZFC

ZFC

ZFC

ZLC

ZLC

ZLC

MIN2GMSGME

MIN2GMSGME

MIN2GMSGME

Z05

1205942

MAX3

Z05

1205942

MAX3

Replication-based test

Replication-based test

Figure 2 Empirical powers based on 2000 simulations for 119872 = 10 markers genotype relative risks of two stages are different (1199031= 16

1199032= 14) disease prevalence 119870 = 01 under the recessive additive and dominant models 1000 cases and 1000 controls are considered to

control 120572 = 005 The first stage type I error rate for discovery is 120572119863= 005 Six test statistics 119885

12 1205942 MAX3 MIN2 GMS and GME are

considered The first second and third columns depict powers using the replication-based test 119885FC and 119885LC respectively

8 BioMed Research International

minuslo

g 10(120588)

Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23

8

6

4

2

0

MIN2 P value

(a)

minuslo

g 10(120588)

Additive P value8

6

4

2

0

Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23

(b)

Figure 3 Manhattan plots of 246758 SNPs from Yoon et al [25] based on MIN2 (a) and 11988512

(b) The 119909 axis is chromosomal location andthe 119910 axis is the significance (minuslog

10119875) of association The horizontal line corresponds to the significance level 10minus4

Table 2 For selected exemplary three SNPs for testing association with NSCLC 119901 value of combined test using additive CATT (11988512)

Pearsonrsquos Chi-square test (119879chi2) MAX3 MIN2 119885GMS and 119885GME

SNP 119901 value of 11988512

119901 value of 119879chi2Discovery Replication Combined test Discovery Replication Combined test

rs2131877 788 times 10minus5

104 times 10minus4

797 times 10minus8

140 times 10minus4

149 times 10minus4

184 times 10minus7

rs905551 183 times 10minus5

702 times 10minus3

770 times 10minus6

806 times 10minus5

489 times 10minus2

140 times 10minus5

rs1695109 248 times 10minus4

346 times 10minus2

217 times 10minus6

456 times 10minus5

153 times 10minus1

207 times 10minus5

SNP 119901 value of MAX3 119901 value of MIN2Discovery Replication Combined test Discovery Replication Combined test

rs2131877 153 times 10minus4

405 times 10minus2

192 times 10minus5

132 times 10minus4

104 times 10minus4

126 times 10minus7

rs905551 450 times 10minus5

702 times 10minus3

192 times 10minus6

134 times 10minus4

489 times 10minus2

199 times 10minus5

rs1695109 354 times 10minus5

263 times 10minus2

464 times 10minus6

236 times 10minus5

263 times 10minus2

335 times 10minus6

SNP 119901 value of 119885GMS 119901 value of 119885GME

Discovery Replication Combined test Discovery Replication Combined testrs2131877 186 times 10

minus4104 times 10

minus4171 times 10

minus7103 times 10

minus4104 times 10

minus4102 times 10

minus7

rs905551 519 times 10minus5

702 times 10minus3

216 times 10minus6

735 times 10minus5

801 times 10minus3

320 times 10minus6

rs1695109 689 times 10minus4

127 times 10minus1

385 times 10minus5

269 times 10minus5

419 times 10minus2

540 times 10minus6

In a genetic study where the purpose of considering areplication stage is to validate or replicate the genetic findingsfrom the discovery stage which is the case considered inthis paper the analysis in the replication stage utilized thetest statistic or genetic model that is selected as being thebest in the discovery stage and also the direction of the riskallele following guidelines for exact replication in geneticassociation studies If the purpose is to simply combine theevidence fromdifferent data sources such as inmeta-analysisother strategies may be devised Further research again isrequired to provide fully detailed properties of suchmethods

Power gain of a joint analysis over the conventionalreplication-based analysis was thoroughly studied by Skol etal [18 19] In our simulation the amount of power increaseusing a joint test compared to the replication-based analysiswas much minor than what was observed by Skol et al[18 19] The exact reason is not known but we suspect thismight be due to the power advantages of robust methodsand also due to the fact that the optimal choice from thefirst stage is used when calculating the second stage 119901 values

However even though it was minor in some situationsthe joint anlysis presented better power performance thanthe replication-based analysis in our study This type ofjoint analysis raised concerns about the exact meaning ofreplication [17] However McCarthy et al [26] mentionedthat joint analyses ldquoblur the boundaries of where exactlyreplication starts but whichever analytical approach is takenconfirmation in many independent samples is important andit is the overall strength of the evidence of association thatmattersrdquo Purpose of the current study was to present howthe overall strength of the evidence of association can beevaluated when robust tests are used in GWAS replicationstudies

We illustrated how the proposed methods can be appliedin the real data that studied the association of SNPs with non-small-cell lung cancer (NSCLC) in discovery and replicationstages In the original study reported by Yoon et al [25]SNPs were selected in the discovery data set not based onthe robust tests but based on additive CATT Therefore wefound that some SNPs could have been selected by one

BioMed Research International 9

of the robust methods but they were not included in thereplication data set For these SNPs we were not able toperform the joint analysis that we propose and it was notpossible to examine whether there are other SNPs that couldhave been found to be associated with NSCLC by proposedmethods in the replication study For this reason we merelypresented howmany additional SNPs could have been furtherfollowed in the replication stage when robust methods wereused In many GWASs it is a common practice to reportthe summary test statistics and 119901 values of the SNPs undera specific genetic model usually an additive model whichwere further genotyped in the replication stage and werefinally defined to be significantly associated with a phenotypeof interest As emphasized in this paper one may have abetter chance of findingmanymissing SNPs by applyingmorepowerful and robust methods that consider different geneticmodels simultaneouslyTherefore we urge the community toshare test results under not only an additive model but alsoother genetic models although they were not significant at astringent significance level so that future research may haveenriched data resources to which robust tests can be appliedin association studies

Appendix

119901 value of Fisherrsquos Combination forEqual and Unequal Weights

Equal Weights Assume 1199081

= 1199082

= 1 Under the nullhypothesis of no association 119883

1= minus2 log119875(1) and 119883

2=

minus2 log119875(2) are independent and each asymptotically followsa 1205942 distribution with 2 degrees of freedom Let 119891

119896(119909) and

119865119896(119909) be the probability and cumulative density functions of

1205942 random variable with 119896 degrees of freedomThen 119891

2(119909) =

exp(minus1199092)2 1198914(119909) = 119909 exp(minus1199092)4 119865

2(119909) = 1 minus exp(minus1199092)

and1198654(119909) = 1minusexp(minus1199092)minus119909 exp(minus1199092)2 Denote the cutoff

of the discovery stage based on 120572119863as 119862119863 that is 119865

2(119862119863) =

1 minus 120572119863 For observed value 119911FC gt 119862

119863of1198831+ 1198832 the 119901 value

is written as

1198751198670

(1198831gt 119862119863 1198831+ 1198832gt 119911FC)

= 120572119863minus int

119911FC

119862119863

1198912 (119909) 1198652 (119911FC minus 119909) 119889119909

= 120572119863+ exp (minus

119911FC2) minus 120572119863+1

2exp(minus

119911FC2) (119911FC minus 119862119863)

= exp (minus119911FC2) (1 +

119911FC2

+ log120572119863)

(A1)

Unequal WeightsWhen different proportions of samples areused in the discovery and replication stages it may be moreappropriate to assignweights proportional to the sample sizesfor each stage For example when only a small portion isused in the discovery stage to prevent Fisherrsquos combinationtest from being dominated by the significant result in the

discovery stage one may want to assign a small weight to thediscovery stage result

When 120587119904is the proportion of samples used in the dis-

covery stage one selection for weights is 1199081

= 2120587119904and

1199082= 2(1 minus 120587

119904) for discovery and replication stages which

simplifies to equal weights when 120587119904= 05 Based on these

weights we consider unequal-weighted Fisherrsquos combinationasminus2 log119875(1)1199081119875(2)1199082 = 119908

11198831+11990821198832[27] Its density function

is given by

119891119908 (119909) =

1

2 (1199081minus 1199082)exp(minus 119909

(21199081))

minus1

2 (1199081minus 1199082)exp(minus 119909

(21199082))

(A2)

and the probability distribution function is

119865119908 (119909) = 1 minus

1199081

(1199081minus 1199082)exp(minus 119909

21199081

)

minus1199082

(1199081minus 1199082)exp(minus 119909

21199082

) 1199081

= 1199082

(A3)

Using the previous notation we have the following formof 119901 value

1198751198670

(1198831gt 119862119863 11990811198831+ 11990821198832gt 119911FC)

= 120572119863minus int

119911FC1199081

119862119863

1198912 (119909) 1198652 (

119911FC minus 1199081119909

1199082

)119889119909

= exp(minus119911FC21199081

) +1199082

1199081minus 1199082

exp(minus119911FC21199082

)

times exp(1199081 minus 1199082211990811199082

119911FC) minus exp(1199081 minus 1199082211990811199082

1199081119862119863)

=1199081

1199081minus 1199082

exp(minus119911FC21199081

)

minus1199082

1199081minus 1199082

exp(minus119911FC21199082

)120572minus(1199081

minus1199082

)1199082

119863

=1199081

1199081minus 1199082

exp(minus119911FC21199081

)

minus1199082

1199081minus 1199082

exp(minus119911FC21199082

)120572minus(1199081

minus1199082

)1199082

119863

(A4)

where 119911FC1199081 gt 119862119863

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

10 BioMed Research International

Acknowledgments

The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)

References

[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005

[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006

[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006

[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007

[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007

[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012

[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954

[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955

[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997

[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002

[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966

[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985

[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008

[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010

[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009

[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009

[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009

[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for

two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006

[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007

[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003

[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006

[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009

[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012

[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993

[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010

[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008

[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 8: Research Article Robust Association Tests for the Replication of …downloads.hindawi.com/journals/bmri/2015/461593.pdf · 2019. 7. 31. · Research Article Robust Association Tests

8 BioMed Research International

minuslo

g 10(120588)

Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23

8

6

4

2

0

MIN2 P value

(a)

minuslo

g 10(120588)

Additive P value8

6

4

2

0

Chromosome1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23

(b)

Figure 3 Manhattan plots of 246758 SNPs from Yoon et al [25] based on MIN2 (a) and 11988512

(b) The 119909 axis is chromosomal location andthe 119910 axis is the significance (minuslog

10119875) of association The horizontal line corresponds to the significance level 10minus4

Table 2 For selected exemplary three SNPs for testing association with NSCLC 119901 value of combined test using additive CATT (11988512)

Pearsonrsquos Chi-square test (119879chi2) MAX3 MIN2 119885GMS and 119885GME

SNP 119901 value of 11988512

119901 value of 119879chi2Discovery Replication Combined test Discovery Replication Combined test

rs2131877 788 times 10minus5

104 times 10minus4

797 times 10minus8

140 times 10minus4

149 times 10minus4

184 times 10minus7

rs905551 183 times 10minus5

702 times 10minus3

770 times 10minus6

806 times 10minus5

489 times 10minus2

140 times 10minus5

rs1695109 248 times 10minus4

346 times 10minus2

217 times 10minus6

456 times 10minus5

153 times 10minus1

207 times 10minus5

SNP 119901 value of MAX3 119901 value of MIN2Discovery Replication Combined test Discovery Replication Combined test

rs2131877 153 times 10minus4

405 times 10minus2

192 times 10minus5

132 times 10minus4

104 times 10minus4

126 times 10minus7

rs905551 450 times 10minus5

702 times 10minus3

192 times 10minus6

134 times 10minus4

489 times 10minus2

199 times 10minus5

rs1695109 354 times 10minus5

263 times 10minus2

464 times 10minus6

236 times 10minus5

263 times 10minus2

335 times 10minus6

SNP 119901 value of 119885GMS 119901 value of 119885GME

Discovery Replication Combined test Discovery Replication Combined testrs2131877 186 times 10

minus4104 times 10

minus4171 times 10

minus7103 times 10

minus4104 times 10

minus4102 times 10

minus7

rs905551 519 times 10minus5

702 times 10minus3

216 times 10minus6

735 times 10minus5

801 times 10minus3

320 times 10minus6

rs1695109 689 times 10minus4

127 times 10minus1

385 times 10minus5

269 times 10minus5

419 times 10minus2

540 times 10minus6

In a genetic study where the purpose of considering areplication stage is to validate or replicate the genetic findingsfrom the discovery stage which is the case considered inthis paper the analysis in the replication stage utilized thetest statistic or genetic model that is selected as being thebest in the discovery stage and also the direction of the riskallele following guidelines for exact replication in geneticassociation studies If the purpose is to simply combine theevidence fromdifferent data sources such as inmeta-analysisother strategies may be devised Further research again isrequired to provide fully detailed properties of suchmethods

Power gain of a joint analysis over the conventionalreplication-based analysis was thoroughly studied by Skol etal [18 19] In our simulation the amount of power increaseusing a joint test compared to the replication-based analysiswas much minor than what was observed by Skol et al[18 19] The exact reason is not known but we suspect thismight be due to the power advantages of robust methodsand also due to the fact that the optimal choice from thefirst stage is used when calculating the second stage 119901 values

However even though it was minor in some situationsthe joint anlysis presented better power performance thanthe replication-based analysis in our study This type ofjoint analysis raised concerns about the exact meaning ofreplication [17] However McCarthy et al [26] mentionedthat joint analyses ldquoblur the boundaries of where exactlyreplication starts but whichever analytical approach is takenconfirmation in many independent samples is important andit is the overall strength of the evidence of association thatmattersrdquo Purpose of the current study was to present howthe overall strength of the evidence of association can beevaluated when robust tests are used in GWAS replicationstudies

We illustrated how the proposed methods can be appliedin the real data that studied the association of SNPs with non-small-cell lung cancer (NSCLC) in discovery and replicationstages In the original study reported by Yoon et al [25]SNPs were selected in the discovery data set not based onthe robust tests but based on additive CATT Therefore wefound that some SNPs could have been selected by one

BioMed Research International 9

of the robust methods but they were not included in thereplication data set For these SNPs we were not able toperform the joint analysis that we propose and it was notpossible to examine whether there are other SNPs that couldhave been found to be associated with NSCLC by proposedmethods in the replication study For this reason we merelypresented howmany additional SNPs could have been furtherfollowed in the replication stage when robust methods wereused In many GWASs it is a common practice to reportthe summary test statistics and 119901 values of the SNPs undera specific genetic model usually an additive model whichwere further genotyped in the replication stage and werefinally defined to be significantly associated with a phenotypeof interest As emphasized in this paper one may have abetter chance of findingmanymissing SNPs by applyingmorepowerful and robust methods that consider different geneticmodels simultaneouslyTherefore we urge the community toshare test results under not only an additive model but alsoother genetic models although they were not significant at astringent significance level so that future research may haveenriched data resources to which robust tests can be appliedin association studies

Appendix

119901 value of Fisherrsquos Combination forEqual and Unequal Weights

Equal Weights Assume 1199081

= 1199082

= 1 Under the nullhypothesis of no association 119883

1= minus2 log119875(1) and 119883

2=

minus2 log119875(2) are independent and each asymptotically followsa 1205942 distribution with 2 degrees of freedom Let 119891

119896(119909) and

119865119896(119909) be the probability and cumulative density functions of

1205942 random variable with 119896 degrees of freedomThen 119891

2(119909) =

exp(minus1199092)2 1198914(119909) = 119909 exp(minus1199092)4 119865

2(119909) = 1 minus exp(minus1199092)

and1198654(119909) = 1minusexp(minus1199092)minus119909 exp(minus1199092)2 Denote the cutoff

of the discovery stage based on 120572119863as 119862119863 that is 119865

2(119862119863) =

1 minus 120572119863 For observed value 119911FC gt 119862

119863of1198831+ 1198832 the 119901 value

is written as

1198751198670

(1198831gt 119862119863 1198831+ 1198832gt 119911FC)

= 120572119863minus int

119911FC

119862119863

1198912 (119909) 1198652 (119911FC minus 119909) 119889119909

= 120572119863+ exp (minus

119911FC2) minus 120572119863+1

2exp(minus

119911FC2) (119911FC minus 119862119863)

= exp (minus119911FC2) (1 +

119911FC2

+ log120572119863)

(A1)

Unequal WeightsWhen different proportions of samples areused in the discovery and replication stages it may be moreappropriate to assignweights proportional to the sample sizesfor each stage For example when only a small portion isused in the discovery stage to prevent Fisherrsquos combinationtest from being dominated by the significant result in the

discovery stage one may want to assign a small weight to thediscovery stage result

When 120587119904is the proportion of samples used in the dis-

covery stage one selection for weights is 1199081

= 2120587119904and

1199082= 2(1 minus 120587

119904) for discovery and replication stages which

simplifies to equal weights when 120587119904= 05 Based on these

weights we consider unequal-weighted Fisherrsquos combinationasminus2 log119875(1)1199081119875(2)1199082 = 119908

11198831+11990821198832[27] Its density function

is given by

119891119908 (119909) =

1

2 (1199081minus 1199082)exp(minus 119909

(21199081))

minus1

2 (1199081minus 1199082)exp(minus 119909

(21199082))

(A2)

and the probability distribution function is

119865119908 (119909) = 1 minus

1199081

(1199081minus 1199082)exp(minus 119909

21199081

)

minus1199082

(1199081minus 1199082)exp(minus 119909

21199082

) 1199081

= 1199082

(A3)

Using the previous notation we have the following formof 119901 value

1198751198670

(1198831gt 119862119863 11990811198831+ 11990821198832gt 119911FC)

= 120572119863minus int

119911FC1199081

119862119863

1198912 (119909) 1198652 (

119911FC minus 1199081119909

1199082

)119889119909

= exp(minus119911FC21199081

) +1199082

1199081minus 1199082

exp(minus119911FC21199082

)

times exp(1199081 minus 1199082211990811199082

119911FC) minus exp(1199081 minus 1199082211990811199082

1199081119862119863)

=1199081

1199081minus 1199082

exp(minus119911FC21199081

)

minus1199082

1199081minus 1199082

exp(minus119911FC21199082

)120572minus(1199081

minus1199082

)1199082

119863

=1199081

1199081minus 1199082

exp(minus119911FC21199081

)

minus1199082

1199081minus 1199082

exp(minus119911FC21199082

)120572minus(1199081

minus1199082

)1199082

119863

(A4)

where 119911FC1199081 gt 119862119863

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

10 BioMed Research International

Acknowledgments

The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)

References

[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005

[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006

[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006

[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007

[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007

[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012

[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954

[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955

[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997

[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002

[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966

[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985

[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008

[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010

[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009

[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009

[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009

[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for

two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006

[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007

[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003

[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006

[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009

[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012

[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993

[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010

[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008

[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 9: Research Article Robust Association Tests for the Replication of …downloads.hindawi.com/journals/bmri/2015/461593.pdf · 2019. 7. 31. · Research Article Robust Association Tests

BioMed Research International 9

of the robust methods but they were not included in thereplication data set For these SNPs we were not able toperform the joint analysis that we propose and it was notpossible to examine whether there are other SNPs that couldhave been found to be associated with NSCLC by proposedmethods in the replication study For this reason we merelypresented howmany additional SNPs could have been furtherfollowed in the replication stage when robust methods wereused In many GWASs it is a common practice to reportthe summary test statistics and 119901 values of the SNPs undera specific genetic model usually an additive model whichwere further genotyped in the replication stage and werefinally defined to be significantly associated with a phenotypeof interest As emphasized in this paper one may have abetter chance of findingmanymissing SNPs by applyingmorepowerful and robust methods that consider different geneticmodels simultaneouslyTherefore we urge the community toshare test results under not only an additive model but alsoother genetic models although they were not significant at astringent significance level so that future research may haveenriched data resources to which robust tests can be appliedin association studies

Appendix

119901 value of Fisherrsquos Combination forEqual and Unequal Weights

Equal Weights Assume 1199081

= 1199082

= 1 Under the nullhypothesis of no association 119883

1= minus2 log119875(1) and 119883

2=

minus2 log119875(2) are independent and each asymptotically followsa 1205942 distribution with 2 degrees of freedom Let 119891

119896(119909) and

119865119896(119909) be the probability and cumulative density functions of

1205942 random variable with 119896 degrees of freedomThen 119891

2(119909) =

exp(minus1199092)2 1198914(119909) = 119909 exp(minus1199092)4 119865

2(119909) = 1 minus exp(minus1199092)

and1198654(119909) = 1minusexp(minus1199092)minus119909 exp(minus1199092)2 Denote the cutoff

of the discovery stage based on 120572119863as 119862119863 that is 119865

2(119862119863) =

1 minus 120572119863 For observed value 119911FC gt 119862

119863of1198831+ 1198832 the 119901 value

is written as

1198751198670

(1198831gt 119862119863 1198831+ 1198832gt 119911FC)

= 120572119863minus int

119911FC

119862119863

1198912 (119909) 1198652 (119911FC minus 119909) 119889119909

= 120572119863+ exp (minus

119911FC2) minus 120572119863+1

2exp(minus

119911FC2) (119911FC minus 119862119863)

= exp (minus119911FC2) (1 +

119911FC2

+ log120572119863)

(A1)

Unequal WeightsWhen different proportions of samples areused in the discovery and replication stages it may be moreappropriate to assignweights proportional to the sample sizesfor each stage For example when only a small portion isused in the discovery stage to prevent Fisherrsquos combinationtest from being dominated by the significant result in the

discovery stage one may want to assign a small weight to thediscovery stage result

When 120587119904is the proportion of samples used in the dis-

covery stage one selection for weights is 1199081

= 2120587119904and

1199082= 2(1 minus 120587

119904) for discovery and replication stages which

simplifies to equal weights when 120587119904= 05 Based on these

weights we consider unequal-weighted Fisherrsquos combinationasminus2 log119875(1)1199081119875(2)1199082 = 119908

11198831+11990821198832[27] Its density function

is given by

119891119908 (119909) =

1

2 (1199081minus 1199082)exp(minus 119909

(21199081))

minus1

2 (1199081minus 1199082)exp(minus 119909

(21199082))

(A2)

and the probability distribution function is

119865119908 (119909) = 1 minus

1199081

(1199081minus 1199082)exp(minus 119909

21199081

)

minus1199082

(1199081minus 1199082)exp(minus 119909

21199082

) 1199081

= 1199082

(A3)

Using the previous notation we have the following formof 119901 value

1198751198670

(1198831gt 119862119863 11990811198831+ 11990821198832gt 119911FC)

= 120572119863minus int

119911FC1199081

119862119863

1198912 (119909) 1198652 (

119911FC minus 1199081119909

1199082

)119889119909

= exp(minus119911FC21199081

) +1199082

1199081minus 1199082

exp(minus119911FC21199082

)

times exp(1199081 minus 1199082211990811199082

119911FC) minus exp(1199081 minus 1199082211990811199082

1199081119862119863)

=1199081

1199081minus 1199082

exp(minus119911FC21199081

)

minus1199082

1199081minus 1199082

exp(minus119911FC21199082

)120572minus(1199081

minus1199082

)1199082

119863

=1199081

1199081minus 1199082

exp(minus119911FC21199081

)

minus1199082

1199081minus 1199082

exp(minus119911FC21199082

)120572minus(1199081

minus1199082

)1199082

119863

(A4)

where 119911FC1199081 gt 119862119863

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

10 BioMed Research International

Acknowledgments

The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)

References

[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005

[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006

[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006

[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007

[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007

[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012

[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954

[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955

[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997

[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002

[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966

[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985

[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008

[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010

[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009

[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009

[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009

[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for

two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006

[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007

[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003

[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006

[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009

[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012

[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993

[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010

[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008

[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 10: Research Article Robust Association Tests for the Replication of …downloads.hindawi.com/journals/bmri/2015/461593.pdf · 2019. 7. 31. · Research Article Robust Association Tests

10 BioMed Research International

Acknowledgments

The authors are indebted to late Dr Gang Zheng for his inspi-ration and support on their workThisworkwas supported bygrant of the National Cancer Center (no NCC-1210060)

References

[1] R J Klein C Zeiss E Y Chew et al ldquoComplement factor Hpolymorphism in age-related macular degenerationrdquo Sciencevol 308 no 5720 pp 385ndash389 2005

[2] R H Duerr K D Taylor S R Brant et al ldquoA genome-wideassociation study identifies IL 23R as an inflammatory boweldisease generdquo Science vol 314 no 5804 pp 1461ndash1463 2006

[3] A Herbert N P Gerry M B McQueen et al ldquoA commongenetic variant is associated with adult and childhood obesityrdquoScience vol 312 no 5771 pp 279ndash283 2006

[4] R Sladek G Rocheleau J Rung et al ldquoA genome-wideassociation study identifies novel risk loci for type 2 diabetesrdquoNature vol 445 no 7130 pp 881ndash885 2007

[5] P R Burton D G Clayton L R Cardon et al ldquoGenome-wideassociation study of 14000 cases of seven common diseases and3000 shared controlsrdquo Nature vol 447 no 7145 pp 661ndash6782007

[6] S H Kwak S H Kim Y M Cho et al ldquoA genome-wideassociation study of gestational diabetes mellitus in Koreanwomenrdquo Diabetes vol 61 no 2 pp 531ndash541 2012

[7] W G Cochran ldquoSomemethods for strengthening the common1205942 testsrdquo Biometrics vol 10 no 4 pp 417ndash451 1954

[8] P Armitage ldquoTests for linear trends in proportions and frequen-ciesrdquo Biometrics vol 11 no 3 pp 375ndash386 1955

[9] P D Sasieni ldquoFrom genotypes to genes doubling the samplesizerdquo Biometrics vol 53 no 4 pp 1253ndash1261 1997

[10] B Freidlin G Zheng Z Li and J L Gastwirth ldquoTrend tests forcase-control studies of genetic markers power sample size androbustnessrdquo Human Heredity vol 53 no 3 pp 146ndash152 2002

[11] J L Gastwirth ldquoOn robust proceduresrdquo Journal of the AmericanStatistical Association vol 61 no 316 pp 929ndash948 1966

[12] J L Gastwirth ldquoThe use of maximin efficiency robust tests incombining contingency tables and survival analysisrdquo Journal ofthe American Statistical Association vol 80 no 390 pp 380ndash384 1985

[13] G Zheng and H K T Ng ldquoGenetic model selection in two-phase analysis for case-control association studiesrdquo Biostatisticsvol 9 no 3 pp 391ndash399 2008

[14] J Joo M Kwak and G Zheng ldquoImproving power for testinggenetic association in case-control studies by reducing thealternative spacerdquo Biometrics vol 66 no 1 pp 266ndash276 2010

[15] J Joo M Kwak K Ahn and G Zheng ldquoA robust genome-widescan statistic of theWellcome Trust Case-Control ConsortiumrdquoBiometrics vol 65 no 4 pp 1115ndash1122 2009

[16] P Kraft E Zeggini and J P A Ioannidis ldquoReplication ingenome-wide association studiesrdquo Statistical Science vol 24 no4 pp 561ndash573 2009

[17] D CThomas G Casey D V Conti R W Haile J P Lewingerand D O Stram ldquoMethodological issues in multistage genome-wide association studiesrdquo Statistical Science vol 24 no 4 pp414ndash429 2009

[18] A D Skol L J Scott G R Abecasis and M Boehnke ldquoJointanalysis is more efficient than replication-based analysis for

two-stage genome-wide association studiesrdquo Nature Geneticsvol 38 no 2 pp 209ndash213 2006

[19] A D Skol L J Scott G R Abecasis andM Boehnke ldquoOptimaldesigns for two-stage genome-wide association studiesrdquoGeneticEpidemiology vol 31 no 7 pp 776ndash788 2007

[20] G Zheng B Freidlin Z Li and J L Gastwirth ldquoChoice ofscores in trend tests for case-control studies of candidate-geneassociationsrdquo Biometrical Journal vol 45 no 3 pp 335ndash3482003

[21] K Song and R C Elston ldquoA powerful method of combiningmeasures of association and Hardy-Weinberg Disequilibriumfor fine-mapping in case-control studiesrdquo Statistics in Medicinevol 25 no 1 pp 105ndash126 2006

[22] S Won N Morris Q Lu and R C Elston ldquoChoosing anoptimal method to combine P-valuesrdquo Statistics in Medicinevol 28 no 11 pp 1537ndash1553 2009

[23] F BegumDGhoshGC Tseng andE Feingold ldquoComprehen-sive literature review and statistical considerations for GWASmeta-analysisrdquo Nucleic Acids Research vol 40 no 9 pp 3777ndash3784 2012

[24] B Efron and R J Tibshirani An Introduction to the BootstrapChapman amp HallCRC Press 1993

[25] K A Yoon J H Park J Han et al ldquoA genome-wide associationstudy reveals susceptibility variants for non-small cell lungcancer in the Korean populationrdquo Human Molecular Geneticsvol 19 no 24 pp 4948ndash4954 2010

[26] M I McCarthy G R Abecasis L R Cardon et al ldquoGenome-wide association studies for complex traits consensus uncer-tainty and challengesrdquoNature Reviews Genetics vol 9 no 5 pp356ndash369 2008

[27] I J Good ldquoOn the weighted combination of significance testsrdquoJournal of the Royal Statistical Society Series B vol 17 pp 264ndash265 1955

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 11: Research Article Robust Association Tests for the Replication of …downloads.hindawi.com/journals/bmri/2015/461593.pdf · 2019. 7. 31. · Research Article Robust Association Tests

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology