assessment of population structure and its effects on genome-wide association studies

14
This article was downloaded by: [Stony Brook University] On: 02 November 2014, At: 06:41 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Communications in Statistics - Theory and Methods Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/lsta20 Assessment of Population Structure and Its Effects on Genome-Wide Association Studies Hongyan Xu a & Varghese George a a Department of Biostatistics , Medical College of Georgia , Augusta, Georgia, USA Published online: 20 Aug 2009. To cite this article: Hongyan Xu & Varghese George (2009) Assessment of Population Structure and Its Effects on Genome-Wide Association Studies, Communications in Statistics - Theory and Methods, 38:16-17, 2843-2855, DOI: 10.1080/03610920902947188 To link to this article: http://dx.doi.org/10.1080/03610920902947188 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// www.tandfonline.com/page/terms-and-conditions

Upload: varghese

Post on 09-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Assessment of Population Structure and Its Effects on Genome-Wide Association Studies

This article was downloaded by: [Stony Brook University]On: 02 November 2014, At: 06:41Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

Communications in Statistics - Theory and MethodsPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/lsta20

Assessment of Population Structure and Its Effects onGenome-Wide Association StudiesHongyan Xu a & Varghese George aa Department of Biostatistics , Medical College of Georgia , Augusta, Georgia, USAPublished online: 20 Aug 2009.

To cite this article: Hongyan Xu & Varghese George (2009) Assessment of Population Structure and Its Effects onGenome-Wide Association Studies, Communications in Statistics - Theory and Methods, 38:16-17, 2843-2855, DOI:10.1080/03610920902947188

To link to this article: http://dx.doi.org/10.1080/03610920902947188

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Page 2: Assessment of Population Structure and Its Effects on Genome-Wide Association Studies

Communications in Statistics—Theory and Methods, 38: 2843–2855, 2009Copyright © Taylor & Francis Group, LLCISSN: 0361-0926 print/1532-415X onlineDOI: 10.1080/03610920902947188

Assessment of Population Structure and Its Effectson Genome-Wide Association Studies

HONGYAN XU AND VARGHESE GEORGE

Department of Biostatistics, Medical College of Georgia,Augusta, Georgia, USA

Large-scale genome-wide association studies are promising for unraveling thegenetic basis of complex diseases. However, population structure is a potentialproblem, the effects of which on genetic association studies are controversial.Quantification of the effects of population structure on large scale geneticassociation studies is needed for valid analysis of data and correct interpretationof results. In this study, we performed extensive coalescent-based simulation studywith varying levels of population structure to investigate the effects of populationstructure on large-scale genetic association studies. The effects of populationstructure are measured by the multiplicative changes of the probability of Type Ierror, which is then correlated with the levels of population structure. It is found thatat each nominal level of association tests, there is a positive relationship betweenthe level of population structure and its effects, which could be summarized wellwith a regression function. It is also found that at a specific level of populationstructure, its effect on association study increases drastically as the significance levelof the test decreases. The Type I error is inflated by an amount approximately equalto Wright’s FST , a measure that is used to quantify the magnitude of populationstructure. Therefore, in genome-wide association studies, the effects of populationstructure cannot be safely ignored, and must be accounted for with proper methods.This study provides quantitative guidelines to account for the effects of populationstructure on genome-wide association studies in admixed populations.

Keywords Complex diseases; False positives; Genetic variation; Genome-wideassociation; Heterozygosity; Population structure; SNP.

Mathematics Subject Classification Primary 62P10; Secondary 92D10.

1. Introduction

Complex human diseases such as hypertension, diabetes, and cancer pose greatpublic health burdens on human society. These diseases are “complex” becausethe etiology involves both environmental factors and multiple genes, with each

Received November 25, 2008; Accepted February 5, 2009Address correspondence to Varghese George, Department of Biostatistics, Medical

College of Georgia, 1120 15th St, AE-1001, Augusta, GA 30904, USA; E-mail: [email protected]

2843

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

6:41

02

Nov

embe

r 20

14

Page 3: Assessment of Population Structure and Its Effects on Genome-Wide Association Studies

2844 Xu and George

contributing small to moderate effects. Identification of the underlying geneticfactors would be the initial major step toward understanding the molecularmechanisms of these diseases. There are two primary statistical approaches tomapping genes of complex diseases, linkage and association. Genetic linkageanalysis is the traditional approach that tests for correlation of transmissionpatterns of genetic markers and disease traits in families, while associationstudies tests for the correlation of a specific genetic variant (allele) with diseasetrait in populations. Generally speaking, association studies have more powerand provide greater resolution than linkage analysis (Risch and Merikangas,1996). Association studies also have simpler study designs, and samples arerelatively easier to ascertain than for linkage analysis. Consequently, geneticassociation studies are becoming more and more popular in recent years. Withthe development of genomic technologies, large-scale genome-wide associations arebecoming increasingly feasible and hold great promises for unraveling the geneticbasis of complex diseases. This is made possible by the recent advances in humangenome research, in particular, the completion of the human genome sequence(International Human Genome Sequencing Consortium, 2004; Lander et al., 2001;Venter et al., 2001), the availability of millions of genetic markers, especiallysingle nucleotide polymorphisms (SNPs) in public databases, the initiation of theInternational HapMap Project (The International HapMap Consortium, 2003), andimprovements in genotyping technologies such as 500K SNP chip. However, the sizeof the resulting data sets raises significant issues of analysis and interpretation.

One of the potential problems in association studies is the presence ofpopulation structure, which raises the potential for confounding that leadsto spurious significant results. For example, if the sample comes from apopulation having several subpopulations with different allele frequencies, andif the proportions of cases and controls sampled from each subpopulation arenot matched, differences in allele frequencies between cases and controls willresult in mimicking a statistical signal of association and leading to false positiveresults. This concern has influenced the design and interpretation of the associationstudies (Spence et al., 2003). However, the magnitudes of the effects of populationstructure in real large-scale association studies are still controversial. While someinvestigators believe that it is not a major concern because the levels of populationstructure in the general human population are small (Ardlie et al., 2002), othersargue strongly against ignoring the effect of population structure especially forlarge-scale association studies (Marchini et al., 2004). Several statistical methodshave been proposed to correct for population structure in genetic associationstudies (Devlin and Roeder, 1999; Price et al., 2006; Pritchard et al., 2000b). Theseapproaches are quite different and have different theoretical basis. It is unclear whenone should use a specific method as opposed to a different choice. Therefore, asystematic assessment is needed to quantify the levels of population structure andits effects on genetic association studies.

The first step to quantify the effects of population structure is to choose anappropriate measure of population structure. A commonly used measure is Wright’sFST (Wright, 1943). In a large population, consider a genetic locus with two alleles,A1 and A2. There are three possible genotypes at this locus in the population: A1A1,A1A2, A2A2. Among these three genotypes, A1A1 and A2A2 are called homozygotesand A1A2 are called heterozygotes. The frequency of heterozygotes in the populationis called heterozygosity, denoted H, and it is a measure of the genetic variation

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

6:41

02

Nov

embe

r 20

14

Page 4: Assessment of Population Structure and Its Effects on Genome-Wide Association Studies

Population Structure and Genetic Association Studies 2845

at this particular locus. Under random mating and other regular assumptions, thegenotype distribution will reach Harday–Weinberg equilibrium (HWE). That is, thefrequencies for A1A1, A1A2, and A2A2 will be p2, 2pq, q2, respectively, where p

is the frequency of allele A1 and q = 1− p is the frequency of allele A2. Whena large population splits into several small sub-populations, these sub-populationsevolve somewhat independently and the sub-populations will diverge genetically.In this case, it is useful to differentiate genetic variations between subpopulationsand those within sub-populations. FST is a measure of population structure thatquantifies the proportion of genetic variations between sub-populations with respectto the total genetic variation. In the presence of population structure, the observedheterozygosity within sub-populations will deviate from the heterozygosity underHWE, with fewer heterozygotes than what is expected under HWE. We couldmeasure the genetic variation within subpopulation using the average heterozygosityacross sub-populations, denoted HS . Similarly, the overall genetic variation in thetotal population could be measured using the expected heterozygosity under HWE,denoted HT . Then FST is given by FST = �HT −HS�/HT .

In Sec. 2, we review methods of estimating FST and procedures for testingassociation in the presence of population structure. We then present the simulationmodel used to generate samples with population structure, and the results from thesimulation study. In Sec. 3, we present a new estimator of population structure usingall the marker information simultaneously, and investigate its relationship with FST .

2. Methods

2.1. Estimation of FST

FST is estimated using the unbiased estimator, F̂ST , described by Weir andCockerham (1984). Specifically, suppose samples are drawn from P subpopulationsand there are two alleles, A and a, at any given SNP. Let the frequency of alleleA in the ith population be pi, the average allele frequency across populations be p̄,and the sample size from the ith population be ni. Then the observed mean squareerror of allele frequency within a population, denoted MSI, is given by

MSI = 1∑Pi=1 ni − 1

P∑i=1

nipi�1− pi��

The observed mean square error of allele frequency between populations, denotedas MSP, is given by

MSP = 1P − 1

P∑i=1

ni�pi − p̄�2�

FST is then estimated as

F̂ST = MSP−MSIMSI+ �nc − 1�MSI

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

6:41

02

Nov

embe

r 20

14

Page 5: Assessment of Population Structure and Its Effects on Genome-Wide Association Studies

2846 Xu and George

where nc is the average sample size across subpopulations to correct for the differentsample sizes from each population, which is given by

nc =1

P − 1

( P∑i=1

ni−∑P

i=1 n2i∑P

i=1 ni

)�

2.2. Measuring Association at an SNP Locus in the Presenceof Population Structure

In order to quantify the effects of population structure on tests for association,we simulated samples under the null hypothesis of no association where werandomly assigned case/control status while keeping population structure in thetotal sample. For each SNP locus, 1,000 random assignments of case/controlstatus were performed. The association for each assignment was measured usingArmitage’s trend test under an additive genetic model (Armitage, 1955), which hasbeen shown to be robust against deviations of genotype frequencies from Hardy–Weinberg proportions (Sasieni et al., 1997). Suppose the two alleles at an SNP locusare denoted as A and a, then this test statistic is given by

Y 2 = N�N�k1 + 2k2�− k�n1 + 2n2��2

k�N − k��N�n1 + 4n2�− �n1 + 2n2�2��

where N is the total sample size, k is the number of cases, n1 and n2 are the numberof individuals with genotypes Aa and AA, respectively, in the total sample, and, k1and k2 are the number of cases with genotypes Aa and AA, respectively. Under thenull hypothesis of no association, and assuming no population structure, the teststatistic Y 2 follows a �2 distribution with one degree of freedom. Y 2 was computedfor each of the 1,000 samples with randomly assigned case/control status at eachSNP. The empirical distribution of Y 2 from samples with population structure wascompared with the �21 distribution to study the effects of population structure underthe null hypothesis.

2.3. Simulation Models for the Assessment of Population Structure

In order to systematically examine the effect of population structure on geneticassociation studies, we took a computer simulation approach to generate sampleswith various levels of population structure. All the simulations were performedunder the null hypothesis of no association between the SNPs and the disease status.The basic study design is a case-control design. The samples were simulated usingthe standard coalescent-based algorithm assuming a finite-island model (Wright,1943), which is commonly used to model subdivided populations. Specifically, ahomogeneous ancestor population was split into k isolated sub-populations, Tgenerations ago from current sampling time. The sub-populations are assumedto evolve independently after they were separated from the ancestor population.Varying levels of population structure could be created in each simulation, asit is a realization of a stochastic evolution process leading to the simulatedsamples. Coalescent theory is the recent development in theoretical populationgenetics concerning the stochastic evolution process (Kingman, 1982a,b, 2000).It has reshaped the field of population genetics and is efficient in simulating

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

6:41

02

Nov

embe

r 20

14

Page 6: Assessment of Population Structure and Its Effects on Genome-Wide Association Studies

Population Structure and Genetic Association Studies 2847

samples under various demographic models, because it focuses on the history ofsamples rather than the statistical properties of the entire population (Hudson,1991). It differs from the traditional approach in that it looks backward in timefrom current sampling time. Looking backward, we see that samples join togetheror coalesce successively at common ancestors and, thus, the history of the samplecould be described as a tree. The coalescent-based algorithm first generates a binarytree from the sample to their common ancestor with the branch length following anexponential distribution with mean 4N/k�k− 1�, where N is the effective populationsize and k is the number of samples at a particular time. Given the branch length t,the algorithm then simulates the number of mutations along each branch from aPoisson distribution with mean �t, where � is the mutation rate per sequence pergeneration. In the finite island model, the samples from each subpopulation wereallowed to coalesce in their respective subpopulations until the separation time T ,and the remaining samples were pooled together to continue the coalescence processin the ancestor population until one single common ancestor for all the samplesequences was reached (Hudson, 1991).

We simulated samples of 1,000 individuals from 2 subpopulations. Amongthem, 900 are from population 1 and 100 are from population 2. Genotypes atSNP markers were simulated from 100 consecutive genomic regions among whichrecombination events were allowed. The SNP genotypes were derived from thesimulated coalescent tree assuming infinite site model, where each mutation createsa new SNP site. Therefore, the number of SNPs in each simulation is a randomvariable with value equal to the number of mutation in the simulation given a fixedmutation rate. For each simulated sample, we randomly selected 500 individualsfrom population 1 as cases and the remaining 400 individuals from population 1and 100 individuals from population 2 were treated as controls. We are mimickinga sampling scenario where cases were from one population, while the controls weremismatched with a small fraction from the other population.

2.4. Simulation Results

We performed 1,000 simulations. From each replicate, we generated SNP genotypesfrom 100 consecutive genomic regions. The number of markers varies for eachreplicate, and ranges from 433–1078 with a mean of 719. Figure 1 gives theallele frequency distribution of the simulated markers in the two sub-populations.The number of alleles in each bin decreases as the allele frequency increases,which is similar to the allele frequency distribution for SNPs reported from theHapMap project (Frazer et al., 2007). The frequency in each bin is also similar. Foreach simulated sample, we also generated 1,000 samples with randomly assignedcase/control status as outlined in the Methods section. Therefore, for each simulatedsample with a specific level of population structure, the number of associationtests is in the order of 106, which ensures that we have enough tests to generatethe empirical distribution of the Y 2 statistic and examine the effects of populationstructure even at very low significance levels. Samples were simulated using thecoalescent algorithm assuming finite island model as detailed in Sec. 2.3. Eachsimulated sample is a realization of a stochastic evolutionary process. The levelsof population structure in our simulated samples, measured by FST , range from0.022–0.487, with an estimated mean value of 0.15 and standard deviation of 0.08.Figure 2 shows the histogram of the distribution of FST in the samples.

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

6:41

02

Nov

embe

r 20

14

Page 7: Assessment of Population Structure and Its Effects on Genome-Wide Association Studies

2848 Xu and George

Figure 1. Allele frequency distribution in the two sub-populations from the simulatedsamples.

Figure 2. Histogram of the distribution of estimates of FST from the simulated samples.

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

6:41

02

Nov

embe

r 20

14

Page 8: Assessment of Population Structure and Its Effects on Genome-Wide Association Studies

Population Structure and Genetic Association Studies 2849

The effects of population structure on genetic association studies were assessedby comparing the empirical distribution of Y 2 test statistic under the null hypothesisto the expected �21 distribution. Given a specific significance level , the number oftests in which the null hypothesis is rejected can be counted among all the tests fora specific level of population structure. The empirical Type I error probability, E, isestimated as the ratio of the number of tests rejected under the null hypothesis overthe total number of tests. The empirical Type I error probability is then comparedto the nominal level. The ratio of the empirical Type I error probability overnominal Type I error probability is defined as the multiplicative change of Type Ierror, denoted M. Since the simulations were performed under the null hypothesis,allele frequencies in cases and controls differ not by true association with the geneticmarker but because of the presence of population structure. If population structurehas little effect on the association tests, the empirical Type I error probability andthe nominal level will agree, in general, and the multiplicative change will be close to1. On the other hand, if population structure has significant effect on the associationtest, the empirical Type I error probability will be higher than the nominal level,and the multiplicative change will deviate from the null value of 1. Therefore, themagnitude of the multiplicative change of Type I error could be used as a measureto quantify the effects of population structure on association studies.

Figure 3 shows the multiplicative changes of Type I error at various nominalsignificance levels. The multiplicative changes are shown in logarithm scale withbase 10. Two results are obvious from the figure. First, the Type I error probabilityis inflated because of population structure, and the level of inflation can be veryhigh. For example, empirical Type I error probability is inflated by over 1,000-fold for all simulated samples with varying FST when the tests were performed at10−5 level. Thus, in this case, ignoring population structure will result in highly

Figure 3. Effects of population structure on genetic association studies as measured bymultiplicative changes of Type I error probability versus FST at several significance levels.

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

6:41

02

Nov

embe

r 20

14

Page 9: Assessment of Population Structure and Its Effects on Genome-Wide Association Studies

2850 Xu and George

Table 1Estimates of regression coeffieients of multiplicative changes ofType I error probability on FST at various significance levels.

Slope (SE) Intercept (SE) R2

10−2 114.96 (3.44) 4.10 (2.63) 0.91910−3 1120.7 (27.27) −8.27 (5.03) 0.94510−4 10807.97 (305.35) −368.83 (56.36) 0.92710−5 104095 (3459.9) −5533 (638.6) 0.902

spurious significance and false positives. As expected, the magnitude of the inflationincreases with the level of population structure as reflected by the increase inmultiplicative changes. The other interesting result is that the magnitude of inflationof Type I error probability depends on the significance level of the test. For example,when FST = 0�065, which is a likely value for major human populations (Nei, 1975;Rosenberg et al., 2002), Type I error probability was inflated on average by 8, 15,29, and 54 times when the significance level of the tests were 10−2, 10−3, 10−4, and10−5, respectively. In all cases, Type I error probability was inflated further whenthe significance levels were lower.

To further characterize the effects of population structure on genetic associationstudies and provide quantitative guidelines to gauge the magnitude of the inflationof Type I error probabilities, we fit a linear regression function for the multiplicativechanges for each significance level of the tests, with the measure of populationstructure as the predictor variable. i.e., the regression function is, M = 0 + FST +�, where 0 and are the regression coefficients.

Table 1 gives the slope, intercept, and R2 values of the regression function atdifferent significance levels. The standard errors of the slope and the intercept aregiven in parentheses. From the R2 values, it can be seen that the regression line fitsthe data well. With the regression line, we could make quantitative predictions ofthe actual Type I error probability for a given level of population structure at anominal significance level.

As evident from Table 1, there is approximately a ten-fold increase in slopeas the nominal decreases by 10−1. Also, the slope is approximately equal to thereciprocal of the nominal . The theoretical value for the intercept is 1 (the value ofthe multiplicative change under no population structure). Noting that the outcomevariable is the ratio of the actual (empirical) Type I error to the nominal significancelevel, this implies that the actual Type I error is approximately the sum of thenominal significance level and FST , which is a very interesting and, potentially,a highly useful result. However, further confirmatory investigation is needed tosupport this conclusion.

3. A New Measure of Population Structure

Given the genotype information at several markers, FST is usually estimated at eachmarker separately. In general, these estimates are different for distinct markers.For a particular set of subpopulations in one population, the general practice isto take the average of the estimates from each marker and use it as an overallmeasure of population structure. With the development of advanced genotyping

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

6:41

02

Nov

embe

r 20

14

Page 10: Assessment of Population Structure and Its Effects on Genome-Wide Association Studies

Population Structure and Genetic Association Studies 2851

technology, we now have access to genetic information on genome-wide markers,especially on SNPs which are the polymorphic sites at single nucleotides in thepopulation. It is therefore feasible to measure population structure utilizing theinformation at genome-wide markers simultaneously through a genomic approach.Recently, Nicholson et al. (2002). proposed a method for estimating populationdifferentiation and isolation from SNP data, which is subpopulation-specific andcommon for all loci. For the jth subpopulation, let cj denote this measure ofpopulation structure. Each cj is comparable to the subpopulation-specific versionof the FST , and it measures the divergence of the jth subpopulation from thecommon ancestor. However, it would be more desirable if one single measure couldsummarize the level of population structure across all subpopulations. We propose anew measure, denoted C, by combining all j measures of cj for each subpopulation.For each sample in the simulated data set, cj is estimated for each subpopulationusing a Bayesian approach as implemented in the software package Popgen(Nicholson et al., 2002; Pritchard et al., 2000b). Uniform priors are used for eachcj , and Markov Chain Monte Carlo (MCMC) is employed to sample from theposterior distribution. The Markov Chain was run for 20,000 iterations and thefirst 10,000 iterations were discarded as burn-in. We estimate cj from the meanof the posterior distribution. When summarizing the level of population structureacross subpopulations, it is desirable to have a single statistic, instead of one foreach subpopulation. Therefore, we take the weighted mean of the all values of cjas the new measure of the overall population structure. Specifically, suppose wehave P subpopulations, and let cj be the measure of population strcture for the jthsubpopulation, as defined above. The new overall measure of population structureis defined as

C =P∑

j=1

wjcj�

where wj is the weight for the jth subpopulation. There are many possible weightingschemes that can be employed. Here, we use sample size from each subpopulationas the weight. That is, wj = nj/

∑Pi=1 ni, where nj is the number of individuals from

subpopulation j.The new measure C is specific for SNPs and takes account of the ascertainment

bias in the process of SNP discovery. Since SNP discovery is generally conductedin small samples, SNPs with high minor allele frequencies are more likely to bediscovered than SNPs with low minor allele frequencies, thus creating potentialascertainment bias. It has been shown that the ascertainment bias could affect theestimation of population parameters in genetic analysis (Wakeley et al., 2001). Thisascertainment bias has been explicitly accounted for in the method for estimatingthe individual cj for each sub-population, and hence in the estimation of the overallC. Another advantage, as pointed out by Nicholson et al. (2002) is that for pointestimation, the estimates of cjs and hence that of C can substantially outperformFST and other similar methods. Also, this approach appropriately handles differingsample sizes across loci and/or subpopulations.

Table 2 gives the correlation coefficient and R2 between the estimates of FST

and C for varying sample sizes, from replicates of simulated samples. The numberin the sample size column is the number of individuals simulated from each sub-population. Clearly, the correlation between estimates of FST and the new measure C

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

6:41

02

Nov

embe

r 20

14

Page 11: Assessment of Population Structure and Its Effects on Genome-Wide Association Studies

2852 Xu and George

Table 2Correlation between estimates of FST and C from

simulations with varying sample sizes

Sample size Correlation coefficient R2

30 0.944 0.8960 0.959 0.9290 0.973 0.95

increases as the sample size increases though the correlation is high at all the samplesizes we have tried. Even when the sample size is as small as 30 individuals fromeach sub-population, we still have R2 = 0�89. Given this high correlation, similarresults for the magnitude of the inflated Type I error, as those obtained in Sec. 2,should hold using C instead of FST .

4. Discussion

Presence of population structure could potentially be a confounding factor ingenetic association studies in admixed populations, and can lead to spurioussignificance and false positives. However, whether it could pose serious problemto the studies in human populations is controversial. Some studies argued thatpopulation structure is not a major issue (Ardlie et al., 2002; Wacholder et al.,2002), while others conclude that population structure cannot be safely ignored,especially in large-scale studies (Freedman et al., 2004; Marchini et al., 2004). Thegeneral solution is to match samples in cases and controls by ancestry. However,this may be difficult for association studies in admixed populations such as AfricanAmerican population, because the degree of admixture varies among individuals,and population structure can be considered as a continuous variable in this case.Moreover, using genome-wide markers, genetic structures have been discoveredamong those previously regarded as “homogenous” populations (Campbell et al.,2005; Salmela et al., 2008). Therefore, matching by ancestry or geographic regionmay not be sufficient.

In this study, we took a computer simulation approach to assess the effectof population structure on genetic association studies. Samples with varyinglevels of population structure were generated using a coalescent-based algorithm.Effects of population structure on association studies were assessed quantitativelyin terms of changes in actual Type I error probability as a function of populationstructure and the nominal significance level. Results shown in Fig. 2 indicate thatpopulation structure can cause inflation of Type I error probability. The degree ofinflation increases with the level of population structure and it can be very serious insome cases. The results in Fig. 2 also indicate that as the nominal significance leveldecreases, the problem posed by a given amount of population structure becomesmore serious. This could have important implications for large scale genome-wideassociation studies in which millions of markers are usually used. Owing to thelarge number of markers tested in such studies, correcting for multiple testing wouldmean that any “significant” result must have a much lower p-value in order for theassociation results to be considered “significant” at genomic level. Usually, genome-wide significance level is set at a very low range of 10−4–10−8. Our results indicate

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

6:41

02

Nov

embe

r 20

14

Page 12: Assessment of Population Structure and Its Effects on Genome-Wide Association Studies

Population Structure and Genetic Association Studies 2853

that in this range, the problem posed by population structure becomes very seriousindeed with Type I error probability inflated over 1,000 times. As a consequence, theeffects of population structure cannot be safely ignored in genome-wide associationstudies, and appropriate steps must be taken to correct for the effects of populationstructure.

Through systematic simulations of samples with levels of population structurein a wide range and using a regression analysis approach, we describe themultiplicative changes of Type I error probability as a function of populationstructure and nominal significance level of association tests. This result could beused as a quantitative guideline to asses the effects of population structure ongenetic association studies. For a given level of population structure and significancelevel, we could use the regression line to find the expected change in Type I errorprobability. The regression function includes those with very low significance levelsthat are commonly used in genome-wide studies. Therefore, they could provideguidelines for genome-wide association studies.

A strong empirical finding of our study is that the actual Type I error can beapproximated as the sum of the nominal Type I error and FST . In other words,the Type I error is inflated by a magnitude approximately equal to Fst. Thus,we give a simple method to assess the actual Type I error in the presence ofpopulation structure. It is noteworthy that the minimum value of Type I errorthat can be used in an association study must be larger than FST , imposing Type Ierror restrictions on association studies of admixed populations. This is especially acritical issue in genome-wide association studies wherein extremely small -level istypically preferred, for reasons described above.

The proposed estimator, C, is highly correlated with F̂ST , and it has some addedadvantages. First, it is specific for SNPs and takes account of the ascertainment biasin the process of SNP discovery. Second, since the estimator of C is derived withinthe Bayesian framework, it is very flexible in modeling and can incorporate priorinformation on the parameters. In our simulation studies, we used non-informativeprior for the cj parameter. However, any prior knowledge regarding the parameterscan be incorporated in the estimation, which can lead to more precise estimates thanthe moment-based estimates of FST (Nicholson et al., 2002).

Acknowledgment

We thank the two anonymous reviewers for their constructive comments whichhelped us improve the manuscript. This work was partially supported by the grantNS057506 from the National Institutes of Health and the Scientist Training Grantfrom the Medical College of Georgia to HX.

References

Ardlie, K. G., Lunetta, K. L., Seielstad, M. (2002). Testing for population subdivision andassociation in four case-control studies. Amer. J. Hum. Genet. 71(2):304–311.

Armitage, P. (1955). Tests for linear trends in proportions and frequencies. Biometrics11:375–386.

Campbell, C. D., Ogburn, E. L., Lunetta, K. L., Lyon, H. N., Freedman, M. L.,Groop, L. C., Altshuler, D., Ardlie, K. G., Hirschhorn, J. N. (2005). Demonstratingstratification in a European American population. Nat. Genet. 37(8):868–872.

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

6:41

02

Nov

embe

r 20

14

Page 13: Assessment of Population Structure and Its Effects on Genome-Wide Association Studies

2854 Xu and George

Devlin, B., Roeder, K. (1999). Genomic control for association studies. Biometrics55(4):997–1004.

Frazer, K. A., Ballinger, D. G., Cox, D. R., Hinds, D. A., Stuve, L. L., Gibbs, R. A. et al.(2007). A second generation human haplotype map of over 3.1 million SNPs. Nature449(7164):851–861.

Freedman, M. L., Reich, D., Penney, K. L., McDonald, G. J., Mignault, A. A., Patterson, N.et al. (2004). Assessing the impact of population stratification on genetic associationstudies. Nat. Genet. 36(4):388–393.

Hudson, R. R. (1991). Gene genealogies and the coalescent process. In: Futuyma, D.,Antonovics, J., eds. Oxford Surveys in Evolutionary Biology. Vol. 7. Oxford: OxfordUniversity Press. pp. 1–44.

International Human Genome Sequencing Consortium (2004). Finishing the euchromaticsequence of the human genome. Nature 431(7011):931–945.

Kingman, J. F. C. (1982a). The coalescent. Stochastic Process. Applic. 13:235–248.Kingman, J. F. C. (1982b). On the genealogy of large populations. J. App. Probab. A

19:27–43.Kingman, J. F. C. (2000). Origins of the coalescent. 1974–1982. Genetics 156(4):1461–1463.Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J. et al. (2001).

Initial sequencing and analysis of the human genome. Nature 409(6822):860–921.Marchini, J., Cardon, L. R., Phillips, M. S., Donnelly, P. (2004). The effects of human

population structure on large genetic association studies. Nat. Genet. 36(5):512–517.Nei, M. (1975). Molecular Population Genetics and Evolution. New York: North-Holland –

American Elsevier.Nicholson, G., Smith, A. V., Jónsson, F., Gústafsson, Ó., Stefánsson, K., Donnelly, P.

(2002). Assessing population differentiation and isolation from single-nucleotidepolymorphism data. J. Roy. Statist. Soc. Ser. B Statist. Methodol. 64(4):695–715.

Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., Reich, D.(2006). Principal components analysis corrects for stratification in genome-wideassociation studies. Nat. Genet. 38(8):904–909.

Pritchard, J. K., Stephens, M., Donnelly, P. (2000a). Inference of population structure usingmultilocus genotype data. Genetics 155(2):945–959.

Pritchard, J. K., Stephens, M., Rosenberg, N. A., & Donnelly, P. (2000b). Associationmapping in structured populations. Amer. J. Hum. Genet. 67(1):170–181.

Risch, N., Merikangas, K. (1996). The future of genetic studies of complex human diseases.Science 273(5281):1516–1517.

Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivotovsky, L.A. et al. (2002). Genetic structure of human populations. Science 298(5602):2381–2385.

Salmela, E., Lappalainen, T., Fransson, I., Andersen, P. M., Dahlman-Wright, K., Fiebig,A. et al. (2008). Genome-wide analysis of single nucleotide polymorphisms uncoverspopulation structure in Northern Europe. PLoS ONE 3(10):e3519.

Sasieni, P. D. (1997). From genotypes to genes: doubling the sample size. Biometrics53(4):1253–1261.

Spence, M. A., Greenberg, D. A., Hodge, S. E., Vieland, V. J. (2003). The emperor’s newmethods. Amer. J. Hum. Genet. 72(5):1084–1087.

The International HapMap Consortium (2003). The International HapMap Project. Nature,426(6968):789–796.

Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G. et al.(2001). The sequence of the human genome. Science 291(5507):1304–1351.

Wacholder, S., Rothman, N., Caporaso, N. (2002). Counterpoint: bias from populationstratification is not a major threat to the validity of conclusions from epidemiologicalstudies of common polymorphisms and cancer. Cancer Epidemiol. Biomarkers Prev.11(6):513–520.

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

6:41

02

Nov

embe

r 20

14

Page 14: Assessment of Population Structure and Its Effects on Genome-Wide Association Studies

Population Structure and Genetic Association Studies 2855

Wakeley, J., Nielsen, R., Liu-Cordero, S. N., Ardlie, K. (2001). The discovery of single-nucleotide polymorphisms–and inferences about human demographic history. Amer. J.Hum. Genet. 69(6):1332–1347.

Weir, B. S., Cockerham, C. C. (1984). Estimating F-statistics for the analysis of populationstructure. Evolution 38:1358–1370.

Wright, S. (1943). Isolation by distance. Genetics 28(2):114–138.

Dow

nloa

ded

by [

Ston

y B

rook

Uni

vers

ity]

at 0

6:41

02

Nov

embe

r 20

14