simulating genes in genome-wide association studies
DESCRIPTION
Talk given to the UCI Genetic Epidemiology Research Group (GERI, http://www.geri.uci.edu/) on May 16, 2014. Recent results on power to detect associations in growing populations + need for better statistical tests.TRANSCRIPT
Simulating Genes in GWAS
Kevin R. Thornton Ecology and Evolutionary Biology
UC Irvine
slides will be available at http://www.slideshare.net/molpopgen
http://www.molpopgen.org
Acknowledgements
Tony Long Andrew Foran Jaleal Sanjak
from the analyses described above, and consideration of an expandedreference group, described below.Bipolar disorder (BD). Bipolar disorder (BD; manic depressive ill-ness26) refers to an episodic recurrent pathological disturbance inmood (affect) ranging from extreme elation or mania to severe depres-sion and usually accompanied by disturbances in thinking and beha-viour: psychotic features (delusions and hallucinations) often occur.Pathogenesis is poorly understood but there is robust evidence for asubstantial genetic contribution to risk27,28. The estimated siblingrecurrence risk (ls) is 7–10 and heritability 80–90%27,28. The definitionof BD phenotype is based solely on clinical features because, as yet,psychiatry lacks validating diagnostic tests such as those available formany physical illnesses. Indeed, a major goal of molecular geneticsapproaches to psychiatric illness is an improvement in diagnosticclassification that will follow identification of the biological systemsthat underpin the clinical syndromes. The phenotype definition thatwe have used includes individuals that have suffered one or moreepisodes of pathologically elevated mood (see Methods), a criterionthat captures the clinical spectrum of bipolar mood variation thatshows familial aggregation29.
Several genomic regions have been implicated in linkage studies30
and, recently, replicated evidence implicating specific genes has beenreported. Increasing evidence suggests an overlap in genetic suscept-ibility with schizophrenia, a psychotic disorder with many similar-ities to BD. In particular association findings have been reported with
both disorders at DAOA (D-amino acid oxidase activator), DISC1(disrupted in schizophrenia 1), NRG1 (neuregulin1) and DTNBP1(dystrobrevin binding protein 1)31.
The strongest signal in BD was with rs420259 at chromosome16p12 (genotypic test P 5 6.3 3 1028; Table 3) and the best-fittinggenetic model was recessive (Supplementary Table 8). Althoughrecognizing that this signal was not additionally supported by theexpanded reference group analysis (see below and SupplementaryTable 9) and that independent replication is essential, we note thatseveral genes at this locus could have pathological relevance to BD,(Fig. 5). These include PALB2 (partner and localizer of BRCA2),which is involved in stability of key nuclear structures includingchromatin and the nuclear matrix; NDUFAB1 (NADH dehydrogen-ase (ubiquinone) 1, alpha/beta subcomplex, 1), which encodes asubunit of complex I of the mitochondrial respiratory chain; andDCTN5 (dynactin 5), which encodes a protein involved in intracel-lular transport that is known to interact with the gene ‘disrupted inschizophrenia 1’ (DISC1)32, the latter having been implicated in sus-ceptibility to bipolar disorder as well as schizophrenia33.
Of the four regions showing association at P , 5 3 1027 in theexpanded reference group analysis (Supplementary Table 9), it is ofinterest that the closest gene to the signal at rs1526805 (P 5 2.2 31027) is KCNC2 which encodes the Shaw-related voltage-gated pot-assium channel. Ion channelopathies are well-recognized as causes ofepisodic central nervous system disease, including seizures, ataxias
−log
10(P
)
05
1015
05
1015
05
1015
05
1015
05
1015
05
1015
05
1015
Chromosome
Type 2 diabetes
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
22 XX212019181716151413121110987654321
Coronary artery disease
Crohn’s disease
Hypertension
Rheumatoid arthritis
Type 1 diabetes
Bipolar disorder
Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases2log10 of the trend test P value for quality-control-positive SNPs, excludingthose in each disease that were excluded for having poor clustering aftervisual inspection, are plotted against position on each chromosome.
Chromosomes are shown in alternating colours for clarity, withP values ,1 3 1025 highlighted in green. All panels are truncated at2log10(P value) 5 15, although some markers (for example, in the MHC inT1D and RA) exceed this significance threshold.
ARTICLES NATURE | Vol 447 | 7 June 2007
666Nature ©2007 Publishing Group
doi:10.1038/nature05911Burton et al.
The questions arise as to why so much of the heritability is apparentlyunexplained by initial GWA findings, and why it is important. It isimportant because a substantial proportion of individual differencesin disease susceptibility is known to be due to genetic factors, andunderstanding this genetic variation may contribute to better preven-tion, diagnosis and treatment of disease. It is important to recognize,however, that few investigators expected these studies immediately tofind all of the variants associated with common diseases, or even most ofthem; the hope was that they would at least find some16. Limitations inthe design of early GWAS, such as imprecise phenotyping and the use ofcontrol groups of questionable comparability, may have reduced esti-mates of effect sizes while preserving some ability to identify associatedvariants17. These studies have considerably surpassed early expectations,reproducibly identifying hundreds of variants in many dozens of traits,but for many traits they have explained only a small proportion ofestimated heritability18.
Many explanations for this missing heritability have been sug-gested, including much larger numbers of variants of smaller effectyet to be found; rarer variants (possibly with larger effects) that arepoorly detected by available genotyping arrays that focus on variantspresent in 5% or more of the population; structural variants poorlycaptured by existing arrays; low power to detect gene–gene interac-tions; and inadequate accounting for shared environment amongrelatives. Consensus is lacking, however, on approaches and priorit-ies for research to examine what has been termed ‘dark matter’ ofgenome-wide association—dark matter in the sense that one is sure itexists, can detect its influence, but simply cannot ‘see’ it (yet). Herewe examine potential sources of missing heritability and proposeresearch strategies to illuminate the genetics of complex diseases.
Heritability and allelic architecture of complex traitsIt is reasonable to assume that allelic architecture (number, type, effectsize and frequency of susceptibility variants) may differ across traits,and that missing heritability may take a different form for differentdiseases19, but at present our understanding is too limited to distin-guish these possibilities. Age-related macular degeneration may pro-vide the best example of a common disease in which heritability issubstantially explained by a small number of common variants of largeeffect20, but for other conditions, such as Crohn’s disease, the propor-tion of heritability explained is not nearly so large despite a muchlarger number of identified variants21 (Table 1). There are no obviousdifferences between these two traits in genetic architecture as pre-dicted from clinical and epidemiological data that would explainthe differences observed in their allelic architecture. Some apparentdifferences may simply be due to differences in the stage of investiga-tion across traits. Studies in several conditions have clearly demon-strated that the number of detected variants increases with increasingsample size22–24.
Population genetic theory suggests an explanation for the paucityof variants explaining a large proportion of disease predisposition, inthat decreased reproductive fitness should typically act to reduce thefrequencies of high-risk variants. This might explain the relative lackof variants detected so far for some neuropsychiatric conditions, suchas autism spectrum disorders, given their low reproductive fitness25.Yet for a condition such as type 1 diabetes, which has a similar pre-valence, familial risk, early onset and poor reproductive fitness (at
least before the discovery of insulin therapy), more than 40 loci havealready been reported; this might be because the overall sample sizesstudied in type 1 diabetes have been very large26,27. Present-day repro-ductive fitness may correlate poorly with the forces that have shapedvariation throughout human evolution; moreover focusing on thereproductive effects of a single disease ignores the pleiotropic effects(effects of the same variant on multiple characteristics or diseaserisks) of multiple alleles influencing that condition simultaneouslywith many other conditions28.
Selection might also be responsible for keeping genetic effect sizeslow, as variants of larger effect may be selected against and eventuallydisappear19. Long-term stabilizing selection minimizes the produc-tion of individuals at the extremes of a trait29, in part by reducing theadditive genetic effects of alleles already present or those arising denovo by mutation30 to levels potentially beneath the ability of studiesof feasible size to detect them. Selection may also contribute to dif-ferences in the ability to detect loci in different complex diseases, ifgenetic susceptibility to some diseases is more strongly affected byselection than other diseases, or if environmental perturbations varyin intensity across diseases. Immune and infectious agents have beenrecognized as among the strongest selection pressures in humanevolution31, and immune-related genes have been strongly impli-cated in Crohn’s disease and other immune-mediated diseases3, sug-gesting either that pleiotropic effects of these variants reduce theefficiency of negative selection, or that strong environmental per-turbation in modern societies might expose the disease risk asso-ciated with these variants. Selection may thus explain why diseaseallele frequencies are low and allelic effects are small, but this shouldmanifest as low, rather than missing, heritability.
A probable contributor to the small genetic effect sizes observed sofar is that current investigations have incompletely surveyed thepotential causal variants within each gene. Relative risks observedfor marker SNPs may underestimate the actual risks associated withthe true causal variants. Notably, 11 out of 30 genes implicated ascarrying common variants associated with lipid levels also carryknown rare alleles of large effect identified in Mendelian dyslipide-mias, including ABCA1, PCSK9 and LDLR22,32, suggesting that genescontaining common variants with modest effects on complex traitsmay also contain rare variants with larger effects.
An important consideration is that the overwhelming majority ofGWAS and other genetic studies have been limited to Europeanancestry populations, whereas genetic variation is greatest in popula-tions of recent African ancestry2, and studies in non-Europeans haveyielded intriguing new variants33,34. Studies of populations of recentAfrican ancestry in particular is likely to increase the yield of rarevariants and narrow the large chromosomal regions of associationidentified in the ‘younger’ population due to extended linkage dis-equilibrium, or the tendency for adjacent genetic loci to be inheritedtogether31. Isolated populations may also be of value given theirpotential to be enriched in unique variants35.
The accuracy of current heritability estimates is also important,because experimentally identified variants could never explain all thevariance in an erroneously inflated heritability estimate. Heritabilityof quantitative traits, formally defined as the proportion of pheno-typic variance in a population attributable to additive genetic factors(narrow-sense heritability, h2 (ref. 36)) is typically estimated from
Table 1 | Estimates of heritability and number of loci for several complex traits
Disease Number of loci Proportion of heritability explained Heritability measure
Age-related macular degeneration72 5 50% Sibling recurrence riskCrohn’s disease21 32 20% Genetic risk (liability)Systemic lupus erythematosus73 6 15% Sibling recurrence riskType 2 diabetes74 18 6% Sibling recurrence riskHDL cholesterol75 7 5.2% Residual* phenotypic varianceHeight15 40 5% Phenotypic varianceEarly onset myocardial infarction76 9 2.8% Phenotypic varianceFasting glucose77 4 1.5% Phenotypic variance
*Residual is after adjustment for age, gender, diabetes.
REVIEWS NATUREjVol 461j8 October 2009
748 Macmillan Publishers Limited. All rights reserved©2009
doi:10.1038/nature08494Manolio et al.
NHGRI GWA Catalog
www.genome.gov/GWAStudies
www.ebi.ac.uk/fgpt/gwas/
Published Genome-Wide Associations through 12/2012
PƵďůŝƐŚĞĚ�'t��Ăƚ�ƉчϱyϭϬ-8 for 17 trait categories
Figure 2. Frequency distributions of a) the risk allele frequency of the most associated SNPs listed in the GWAS Catalog [1] for thediseases in Table 3. b) MAF of all SNPs simulated under the coalescence model, c) MAF of SNPs used in analyses to be representative of SNPsincluded in GWAS. d–f) Coupled allele of most associated SNP from simulations of 1, 9, or 36 causal variants in a 100 kb region.doi:10.1371/journal.pbio.1000579.g002
PLoS Biology | www.plosbiology.org 3 January 2011 | Volume 9 | Issue 1 | e1000579
doi:10.1371/journal.pbio.1000579Wray et al.
Unsurprisingly, since the GWAS method is primarily poweredfor common alleles, risk allele frequencies were well above 5%(median risk allele frequency 36%, interquartile range, IQR,21%!53%) in the populations analyzed as well as in theHapMap populations (CEU: 37%, 21!54%; YRI: 33%,13!65%; combined JPT"CHB 32%, 13!58%; Fig. S1).
The 531 reported SNP-trait associations represented 465unique TASs; 43% (n # 199) of which were located in anintergenic region, 45% (n # 208) were intronic, 9% (n # 41)were nonsynonymous, 2% (n # 10) were in a 5$ or 3$ untrans-lated region, and 2% (n # 7) were synonymous, according to theUniversity of California Santa Cruz Genome Browser (5).Discrete traits were the focus of 227 (43%) of the 531 SNP-traitassociations, which had associated odds ratios (ORs) rangingfrom 1.04 to 29.4 (median 1.33, IQR 1.20–1.61; Fig. S2). Amongthe discrete traits, the range of ORs was similar betweennonsynonymous and other TASs; however, the right tail of theOR distribution for nonsynonymous TASs was slightly skewedtoward higher valves. The highest ORs were reported for pig-mentation traits [Fig. 1; MC1R and hair color (6) and OCA2 andeye color (6)]. SNP-trait associations were also distributed widelyacross diseases of high population prevalence, including heartdisease, obesity, diabetes, and cancer (Table S2). Trait preva-lence was not associated with the magnitude of ORs and riskallele frequencies, which were similar between the 10 mostprevalent traits and all others combined (median ORs 1.26 and1.29, respectively; median risk allele frequencies, 40% and 35%,respectively).
Among genes or regions harboring TASs that were reportedin multiple studies of discrete traits, 18 were associated withseemingly distinct traits that may suggest clues toward commonetiologic pathways (Table 1). Several TASs were located inpreviously characterized candidate genes, such as APOE, HLA,KCNJ11, PPARG, and CARD15, and were detected throughGWAS at comparable effect sizes and stronger levels of statis-tical significance (Table S3). In these instances, GWAS-identified SNPs served as reasonable positive controls for knowndisease-associated genetic variants.
Functional Analysis. To assess the underlying functionality at thetrait/disease-associated genetic loci, we systematically mapped
all TASPs (reported index TASs with an association p value %5.0 & 10!8 and all HapMap phase II CEU SNPs in LD [r2 ' 0.9])to 20 nonmutually exclusive genomic annotation sets (Table S4).For each annotation set, we did the following. For every uniqueTAS block, we determined whether any TASPs mapped to theannotation set. If none mapped, we did not count the block.However, if one or more TASPs mapped, then we counted 1 perblock. To compute the odds of a TAS block mapping to theannotation set, we divided the number of unique TAS blocks thatwere counted in the annotation set (n) by the number of TASblocks that were not counted (N!n). To evaluate whether anyannotation set was significantly enriched or depleted for TASblocks, we compared the observed odds with the expected oddscalculated from 100 control datasets comprised of randomlyselected SNPs and their LD partners. Importantly, the mappingand counting strategies were consistent across both the test andthe control datasets to ensure a fair comparison. Further, thegeneration of the control datasets took into account the repre-sentation biases on the genotyping arrays that were used toidentify the TASs (SI Text).
For 9 annotation sets (nonsynonymous sites, 1kb promoters,5kb promoters, most conserved sequences (MCSs), 3$ UTRs,microRNA target sites, Introns, CpG islands and experimentallyvalidated regulatory regions from ORegAnno), the 95% confi-dence interval (CI) of the OR excluded 1.0 and the enrichmentp values were %0.05 (Fig. S3), indicating that these categoriesmay be significantly enriched for TAS blocks. Nonsynonymoussites had the strongest signal for enrichment (OR # 3.9[2.2!7.0], p # 3.5 & 10!7). After restricting the analysis to onlythose nonsynonymous SNPs predicted by PolyPhen (7) to bepotentially deleterious (which reduces the sample size by ap-proximately 65%), TAS blocks were even more strongly enriched(OR # 5.2 [1.8–15.3], p # 0.001). Thirty nonsynonymous TASPsthat are predicted to be potentially deleterious [by PolyPhen andan unpublished method, CDPred (P. Cherukuri and J. Mullikin,personal communication)] were identified as attractive candi-dates for functional follow-up (Table 2).
To examine the possibility that signals in other annotation setsmight not represent bona fide TAS block enrichment, but rathera ‘‘hitchhiking’’ effect whereby TASPs closely linked with non-
OCA2, eye color
MC1R, hair color
LOXL1, exfoliation glaucoma1
25
1020
30O
dds
Rat
io
0 20 40 60 80 100Reported risk allele frequency, %
Fig. 1. Published odds ratios for discrete traits by reported risk allele frequencies. Labeled SNP-trait associations are those with the highest ORs. Note that they axis is on the log scale.
Hindorff et al. PNAS ! June 9, 2009 ! vol. 106 ! no. 23 ! 9363
GEN
ETIC
S
www.pnas.org/cgi/doi/10.1073/pnas.0903103106Hindorff et al.
effects, we believe that current approaches in quantitativegenetics coupled with gathering adequate data will dissectadditional genetic variance within populations. The goal ofexplaining heritable effects is not purely academic. Untilmore of the variation expected from family studies isexplained by direct analysis of the genome, there remainsthe possibility that there is a fundamental misunderstand-ing in our knowledge and conceptual framework. Identifi-cation of specific genomic variants that underpinindividual differences provides the foundation for predic-tion, risk profiling, and personalized medicine; for identi-fying pathways and new potential drug targets; forclassifying disease subtypes; for improving and maintain-ing food sources; and for understanding the influence ofselection and the maintenance of diversity in the naturalworld.
Complex trait variationThe genomic variation observed within a population is theresult of the evolutionary forces of mutation, genetic drift,
recombination, and natural selection in the evolutionarypast [24], which is something that we do not know, partic-ularly given the extent of pleiotropy across traits (Box 1). Arange of genetic architectures, in terms of the exact num-ber, effect size, and frequency of causal variants, may beconsistent with current findings in humans [30]. Linkagestudies and GWAS have identified many thousands ofsignificant associations across more than 500 human phe-notypes (Box 1), and it is clear that, for any given trait,genetic variance is likely contributed from a large numberof loci across the entire allele frequency spectrum.
Some researchers suggest that ‘synthetic associations’,where associations at common single nucleotide polymor-phisms (SNPs) reflect linkage disequilibrium (LD) withmultiple rare variants, underlie many GWAS results andthat drawing conclusions regarding genetic architecturefrom GWAS is not justified [31,32]. Although there areexamples of ‘synthetic associations’ [33], they cannot ex-plain all GWAS results [3,34,35]. Converging lines of evi-dence suggest a contribution from variants of >5%
Box 1. The distribution of genetic variants across allele frequency
The variance explained by a single causal variant depends upon itseffect size and its frequency within the population. Under neutralityand random mating, the allele frequency distribution is approximatelyproportional to Equation I [23]:
1=½ pð1 # pÞ%; [I]and the genetic variance contributed by a single variant is (Equation
II):
2 pð1 # pÞa2; [II]where p is the frequency of the causal variant and a is the effect size
on an arbitrary scale. Under a neutral model, this implies that mostvariants are rare, but most of the genetic variance is due to commonvariants [24].
The effect of directional selection is to increase the amount ofvariation explained by rare variants, because natural selection shouldminimize the frequency of deleterious variants in the population [24].Therefore, for any phenotype, many causal variants will be rare, andthe proportion of population-level genetic variance in complexphenotypes attributable to variants across the allele frequencyspectrum will depend upon the strength of selection in our evolu-tionary past. The problem is that this is something that we do not
know. Additionally, newly arising mutations can have pleiotropiceffects on multiple phenotypes and the effect (size and/or direction) ofa given mutation may not be the same for all traits. Moreover, each ofthe traits affected may be associated with fitness in different waysand, thus, held at frequencies that are intermediate between twophenotypes (e.g., balancing selection).
The distribution of GWAS findings to date, obtained from thePublished GWAS Catalogue, across allele frequency is shown inFigure I for studies from 2008 on a selection of traits each of which isgiven a different color. For quantitative traits (Figure IA), the absoluteeffect is plotted against the minor allele frequency, and for complexcommon diseases (Figure IB), the odds ratio is plotted against the riskallele frequency. Each of the 38 quantitative traits and 43 disease traitsare represented by different colors. There an ascertainment bias inthat the power of detection is proportional to pa2, but it is clear that,for each complex trait, variance is contributed from the entire allelefrequency spectrum. This highlights the scarcity of low-frequencyvariants identified by GWAS for quantitative traits and complexdisease in humans. Detecting these variants will require a combina-tion of greater sample size, better genotyping, and improvedphenotyping.
Minor allele frequency
(A) (B)
Abso
lute
effe
ct (S
D un
its)
<0.001 0.01 0.1 0.5
01
35
Risk allele frequencyOd
ds ra"o
<0.001 0.01 0.1 0.5 1
15
10TRENDS in Genetics
Figure I. For quantitative traits (A), the absolute effect is plotted against the minor allele frequency, whereas for complex common diseases (B), the odds ratio is plottedagainst the risk allele frequency. Each of the 38 quantitative traits and 43 disease traits are represented by different colors. Abbreviation: SD, standard deviation.
Opinion Trends in Genetics xxx xxxx, Vol. xxx, No. x
TIGS-1106; No. of Pages 9
2
http://dx.doi.org/10.1016/j.tig.2014.02.003Robinson et al.
provide important clues about the evolutionary history andunderlying molecular mechanisms of certain TASPs.
Several limitations of the underlying catalog data should benoted. We extracted all eligible associations from publishedarticles and SI Text, but the number and quality of reported SNPassociations is dependent upon the preferences of the individualauthor and journal. Also, the studies within the catalog generallytest only those SNPs that are detectable via commonly usedgenotyping platforms in participants who tend to be fromEuropean-descent populations. The GWAS data are likely to besubject to varying degrees of upward bias in effect size estimates(the ‘‘winner’s curse’’ phenomenon), particularly to the extentthat estimates from the GWAS discovery population, who maybe less representative of the general population, influence thosereported in our catalog. Nonetheless, in several instances inwhich known candidate SNPs have been previously identified,GWAS of the same trait tended to confirm these findings withsimilar effect sizes and stronger levels of statistical significance.Finally, TASs reported in published GWAS suffer from ‘‘leadTAS bias’’; generally 1 or 2 TASs out of a cluster are selectedfrom the initial study, often based on likely functional signifi-cance such as a conserved nonsynonymous site, for associationanalysis in the replication sample. To minimize the effect of thisbias, we analyzed TAS blocks, which include the lead SNPs andtheir known LD partners based on HapMap phase II data.However, the true impact of the bias is difficult to quantify andit may still exert a slight effect on the enrichment/depletionsignals especially for categories such as nonsynonymous sites.
An important question is to what extent GWAS have identifiedgenetic variants likely to be of clinical or public health importance,particularly for developing preventive or therapeutic interventions.Answering this question must await better functional characteriza-tion of TASs or the true causative variants they may be tagging,evidence of effective interventions, and identification of potentialmodifiers of SNP-trait associations (1). However, the current studycontributes empiric bounds on the expectations for the effect sizesand allele frequencies of TASs that can be identified from GWAS.It also highlights the distribution of promising SNP-trait associa-tions across a wide variety of traits of substantial public healthinterest, such as obesity, hypertension, coronary artery disease, andcancer. Our results may guide future studies by highlighting geneticvariants that are of particular interest from a descriptive, associa-tion, evolutionary, or functional perspective (such as predictions ofTASP-mediated allele-specific transcription factor binding sites)and suggesting hypotheses for future study. Our description ofGWAS-identified variants builds upon the important work previ-ously targeted toward candidate genes, adding to a more completepicture of the contribution of common genetic variation to commondiseases. It is clear, however, that the proportion of heritabilityexplained by common variation for most common diseases to dateis modest at best (17). As the power of the GWAS approachincreases with access to more samples, and as the types of methodsto test for genetic associations expand to include copy numbervariants and rarer alleles, more associations will likely be identifiedand timely analyses similar to those presented here will continue toupdate our knowledge of the influence of genomic structure andfunction on complex diseases.
1
2
3
456789
10
Odd
s R
atio
Non−s
ynon
ymou
s sites
Promote
rs (1k
b)
Promote
rs (5k
b)
5’ UTRs
3’ UTRs
miRTS
Intron
ic reg
ions
Interg
enic
region
s
Interg
enic
TFBSs
CpG is
lands
PReMod
sites
ORegAnn
o elem
ents
EAR regio
nsMCSs
HARsPSGs
Annotation Set
Enrichment/depletion analysis after adjusting for ’hitchhiking’ effects from non−synonymous sites
Fig. 2. Odds ratios for TAS block enrichment/depletion analysis after adjusting for ‘‘hitchhiking’’ effects from nonsynonymous sites. Four annotation sets (Splicesites, Validated enhancers, EvoFold elements, and noncoding RNAs) are not represented here because no TAS blocks mapped to these annotation sets. The bluecircle represents the point estimate of the odds ratio (OR) and the red lines represent the 95% CI. Possible ‘‘hitchhiking’’ effects from nonsynonymous sites arereduced by discarding any TASP/control SNP in r2 ! 0.6 with a nonsynonymous SNP. For an explanation of the annotation sets on the x axis, we refer the readerto Table S4. Note that the y axis is on the log scale. Nonsynonymous OR computation is not adjusted for ‘‘hitchhiking’’ effects.
9366 ! www.pnas.org"cgi"doi"10.1073"pnas.0903103106 Hindorff et al.
www.pnas.org/cgi/doi/10.1073/pnas.0903103106Hindorff et al.
Observation Interpretation
Missing H Lots
Uniform frequencies of “hits” Common associations exist
Rare hits have larger OR Rare alleles may have larger effects
Larger OR in genes Genes matter
Observation Interpretation
Rare hits have larger OR
Rare alleles may have larger effects
Disease is harmful with respect to fitness
(in the evolutionary sense).
Larger OR in genes Genes matter
0.4 0.020
0.015
0.010
0.005
a b
0.3
Freq
uenc
y of
obs
erva
tion
s
Cau
sal v
aria
nt fr
eque
ncy
0.2
0.1
00.05 0.50 1.0 0.1 0.2 0.3 0.4 0.5
Oddsratio
4KUM�CNNGNG�HTGSWGPE[�QH�OQUV�UKIPKȮECPV�EQOOQP�502 %QOOQP�502�HTGSWGPE[
5KOWNCVGF�JKVU)9#5�JKVU Ű 2
Ű 3
Ű 4
Ű 5
Ű 6
Ű 7
Ű 8
Ű 9
> 9
Figure 3 | Inconsistency between genome-wide association study results and rare variant expectations. a | The frequency distribution of risk allele frequencies (shown in light red) for 414 common variant associations with 17 diseases is only slightly skewed towards lower-frequency variants. By contrast, simulations — in this case, assuming up to nine rare causal variants inducing the common variant association with SNPs at the same frequency as observed on common genotyping platforms (light green bars) — result in a marked left-skew with a peak for common variants whose frequency is less than 10%. (The skew is even stronger if only a single causal variant is responsible.) The observed data are thus not immediately consistent with the rare variant model. b | Part of the problem with synthetic associations is that they would explain too much heritability if they were pervasively responsible for common variant effects. This is due to the relationship between allele frequency, maximum possible linkage disequilibrium (LD) and the amount of variance explained19. The plot shows the expected odds ratio due to a rare variant of the indicated frequency (from 0.5% to 2%) if it increases the odds ratio at a common SNP (with which it is in maximum possible LD) by 1.1-fold. Intermediate effect sizes (2 < odds ratio < 5) require combined causal variant frequencies in excess of 1%. As the number of rare variants increases, the likelihood that they are in high LD with the common variant also drops, further reducing the probability that they can explain observed common variant association. Suppose that a disease has a prevalence of 1%. Then ten causal variants that are each at a frequency of 1% would result in 20% of people carrying a causal variant. If the penetrance is 5%, then 1% of people would have the disease, and these 10 variants would completely explain the genetic risk. Similarly, if 100 causal variants were each at 0.1% frequency, it would take ~10 such variants to induce each single common variant association with an observed odds ratio of 1.1. If large genome-wide association studies (GWASs) detect dozens of such common loci, and they were actually due to LD with rare variants, then the heritability would be explained several times over. Alternatively, if hundreds of very rare causal variants are not in LD with common variants, we do not expect to see significant GWAS associations. Data taken from REF. 19.
DecanalizationThe notion that genetic systems evolved to be buffered but that large effect mutations or environmental change can overcome this buffering, thereby increasing the genetic variance.
Genomic selectionThe use of genetic markers that are spread throughout the genome to select individuals with desired predicted breeding values.
Predicted breeding valueThe estimated phenotype of progeny of individuals that have a particular genotype.
the consistency of common variant effects is that they are actually due to the common variants themselves or to unobserved common variants in high LD across all populations.
Arguments in favour of the infinitesimal modelThe infinitesimal model underpins standard quantita-tive genetic theory. Just as evolutionary theory provides a strong argument in favour of rare variants, standard quantitative genetic theory provides ample support for the infinitesimal model7,8. Whatever the causes of the maintenance of genetic variance may be, the consistent observation is that all diseases have moderately high her-itability, and so purifying selection has been unable to purge the population of disease-promoting variants2. At face value, the existence of dozens of susceptibility alleles for metabolic and immunological diseases with effect sizes that are just not detected for psychological diseases implies a difference in genetic architecture between the two categories of conditions. This may imply different intensities of purifying selection, although other mod-els, including decanalization69, are also compatible with the data. Because most of the genetic variance remains unexplained, it is a priori just as likely to exist in the form of rare or common alleles, and the fact is that there is
nothing about GWAS findings that is inconsistent with the infinitesimal model of many variants of very small effect across the full allele frequency spectrum. This model has served applied quantitative geneticists as well as evolutionary biologists for close to a century and, in a sense, it can be regarded as the null model that needs to be disproved before it is abandoned.
Common variants collectively capture the majority of the genetic variance in GWASs. Direct empirical support for the infinitesimal model comes from genomic variance analyses70,71. Animal breeders have been using genomic selection methods with great success for the past decade72, basing their selection of sires and dams on the overall pre-dicted breeding value, which is determined from the full set of genomic markers that capture variation distributed throughout the genome. Similarly, in humans, by taking all nominally significant SNPs rather than just the sig-nificant ones from GWASs, it is possible to capture much more of the genetic variance than is explained by the highly significant loci73,74 (BOX 3). A multivariate version of this approach, which is implemented by regression of phenotypic similarity on genetic relatedness, also implies that common variants capture most of the genetic vari-ants71. Furthermore, partitioning of the genetic variance
REVIEWS
140 | FEBRUARY 2012 | VOLUME 13 www.nature.com/reviews/genetics
© 2012 Macmillan Publishers Limited. All rights reserved
doi:10.1038/nrg3118Gibson
0.4 0.020
0.015
0.010
0.005
a b
0.3
Freq
uenc
y of
obs
erva
tion
s
Cau
sal v
aria
nt fr
eque
ncy
0.2
0.1
00.05 0.50 1.0 0.1 0.2 0.3 0.4 0.5
Oddsratio
4KUM�CNNGNG�HTGSWGPE[�QH�OQUV�UKIPKȮECPV�EQOOQP�502 %QOOQP�502�HTGSWGPE[
5KOWNCVGF�JKVU)9#5�JKVU Ű 2
Ű 3
Ű 4
Ű 5
Ű 6
Ű 7
Ű 8
Ű 9
> 9
Figure 3 | Inconsistency between genome-wide association study results and rare variant expectations. a | The frequency distribution of risk allele frequencies (shown in light red) for 414 common variant associations with 17 diseases is only slightly skewed towards lower-frequency variants. By contrast, simulations — in this case, assuming up to nine rare causal variants inducing the common variant association with SNPs at the same frequency as observed on common genotyping platforms (light green bars) — result in a marked left-skew with a peak for common variants whose frequency is less than 10%. (The skew is even stronger if only a single causal variant is responsible.) The observed data are thus not immediately consistent with the rare variant model. b | Part of the problem with synthetic associations is that they would explain too much heritability if they were pervasively responsible for common variant effects. This is due to the relationship between allele frequency, maximum possible linkage disequilibrium (LD) and the amount of variance explained19. The plot shows the expected odds ratio due to a rare variant of the indicated frequency (from 0.5% to 2%) if it increases the odds ratio at a common SNP (with which it is in maximum possible LD) by 1.1-fold. Intermediate effect sizes (2 < odds ratio < 5) require combined causal variant frequencies in excess of 1%. As the number of rare variants increases, the likelihood that they are in high LD with the common variant also drops, further reducing the probability that they can explain observed common variant association. Suppose that a disease has a prevalence of 1%. Then ten causal variants that are each at a frequency of 1% would result in 20% of people carrying a causal variant. If the penetrance is 5%, then 1% of people would have the disease, and these 10 variants would completely explain the genetic risk. Similarly, if 100 causal variants were each at 0.1% frequency, it would take ~10 such variants to induce each single common variant association with an observed odds ratio of 1.1. If large genome-wide association studies (GWASs) detect dozens of such common loci, and they were actually due to LD with rare variants, then the heritability would be explained several times over. Alternatively, if hundreds of very rare causal variants are not in LD with common variants, we do not expect to see significant GWAS associations. Data taken from REF. 19.
DecanalizationThe notion that genetic systems evolved to be buffered but that large effect mutations or environmental change can overcome this buffering, thereby increasing the genetic variance.
Genomic selectionThe use of genetic markers that are spread throughout the genome to select individuals with desired predicted breeding values.
Predicted breeding valueThe estimated phenotype of progeny of individuals that have a particular genotype.
the consistency of common variant effects is that they are actually due to the common variants themselves or to unobserved common variants in high LD across all populations.
Arguments in favour of the infinitesimal modelThe infinitesimal model underpins standard quantita-tive genetic theory. Just as evolutionary theory provides a strong argument in favour of rare variants, standard quantitative genetic theory provides ample support for the infinitesimal model7,8. Whatever the causes of the maintenance of genetic variance may be, the consistent observation is that all diseases have moderately high her-itability, and so purifying selection has been unable to purge the population of disease-promoting variants2. At face value, the existence of dozens of susceptibility alleles for metabolic and immunological diseases with effect sizes that are just not detected for psychological diseases implies a difference in genetic architecture between the two categories of conditions. This may imply different intensities of purifying selection, although other mod-els, including decanalization69, are also compatible with the data. Because most of the genetic variance remains unexplained, it is a priori just as likely to exist in the form of rare or common alleles, and the fact is that there is
nothing about GWAS findings that is inconsistent with the infinitesimal model of many variants of very small effect across the full allele frequency spectrum. This model has served applied quantitative geneticists as well as evolutionary biologists for close to a century and, in a sense, it can be regarded as the null model that needs to be disproved before it is abandoned.
Common variants collectively capture the majority of the genetic variance in GWASs. Direct empirical support for the infinitesimal model comes from genomic variance analyses70,71. Animal breeders have been using genomic selection methods with great success for the past decade72, basing their selection of sires and dams on the overall pre-dicted breeding value, which is determined from the full set of genomic markers that capture variation distributed throughout the genome. Similarly, in humans, by taking all nominally significant SNPs rather than just the sig-nificant ones from GWASs, it is possible to capture much more of the genetic variance than is explained by the highly significant loci73,74 (BOX 3). A multivariate version of this approach, which is implemented by regression of phenotypic similarity on genetic relatedness, also implies that common variants capture most of the genetic vari-ants71. Furthermore, partitioning of the genetic variance
REVIEWS
140 | FEBRUARY 2012 | VOLUME 13 www.nature.com/reviews/genetics
© 2012 Macmillan Publishers Limited. All rights reserved
The multiplicative model
G =Y
i
(1 + ei)
Risch & colleagues, Pritchard, countless others
The multiplicative model
G =Y
i
(1 + ei)
0 2 4 6 8 10
02
46
810
Causative mutations on paternal allele
Cau
sativ
e m
utat
ions
on
mat
erna
l alle
le
0.2 0.4
0.6
0.8
1
1.2
1.4
Risch & colleagues, Pritchard, countless others
WWHD?(What would Haldane do?)
p2 2pq q2
1 1� sh 1� 2s
Genotype AA Aa aa
Mating frequency
Fitness
q̂ =u
sh
q̂ ⇡r
u
sas h ! 0
DOI: 10.1017/S0305004100015644
Haldane
Mutation at rate u (per gamete per generation)
“A” allele
X
X
X
“a” allele is heterogeneous
in its molecular origin
trans-heterozygotes are at risk. Phenotype has (weak) effect on individual fitness
doi:10.1371/journal.pgen.1003258Thornton et al.
E↵ect sizes ⇠ Exp(�)
0.0
2.5
5.0
7.5
0.0 0.3 0.6 0.9Effect size
dens
ity
= effect of haplotype. Additive over causative mutations
hi
doi:10.1371/journal.pgen.1003258Thornton et al.
Gij =p
hi ⇥ hj
(geometric mean)
0 2 4 6 8 10
02
46
810
Causative mutations on paternal allele
Cau
sativ
e m
utat
ions
on
mat
erna
l alle
le
0.05 0.1
0.15
0.2 0.25
0.3 0.35
0.4
Pi,j = Gi,j +N(0,�)
w = e�(Pi,j)
2
2�2S
doi:10.1371/journal.pgen.1003258Thornton et al.
Aside: simulation tools
• C++ library for rapid forward simulation
• Available from https://github.com/molpopgen/fwdpp
• Preprint on arXiv at http://arxiv.org/abs/1401.3786
1e−0
31e−0
21e−0
11e
+00
1e+0
1
θ = ρ = 100
Population size (N diploids)
Mea
n ru
n tim
e (d
ays)
1000 10000 50000
sfs_codeSLiMfwdpp (gamete−based)fwddpp (individual−based)
0.00
50.
020
0.05
00.
200
0.50
02.
000
5.00
0 θ = ρ = 500
Population size (N diploids)
1000 10000 50000
510
2050
100
200
500
1000
Population size (N diploids)
Mea
n pe
ak m
emor
y us
e (M
b)
1000 10000 50000
1020
5010
020
050
010
00
Population size (N diploids)
1000 10000 50000
http://arxiv.org/abs/1401.3786Thornton
2Nsh = 1 2Nsh = 10 2Nsh = 100
0
5
10
15
20
0.1 0.5 1 0.1 0.5 1 0.1 0.5 1Proportion of new mutations that are deleterious
Mea
n ru
n tim
e (h
ours
)
Simulation
fwdpp (gamete−based)
fwdpp (individual−based)
SLiM
2Nsh = 1 2Nsh = 10 2Nsh = 100
0
50
100
150
0.1 0.5 1 0.1 0.5 1 0.1 0.5 1Proportion of new mutations that are deleterious
Mea
n pe
ak m
emor
y us
e (m
egab
ytes
)
http://arxiv.org/abs/1401.3786Thornton
Selection is weak●●● ● ● ● ● ● ● ● ●
0.0 0.1 0.2 0.3 0.4 0.5
0.70
0.80
0.90
1.00
Mean effect size (λ)
Rel
ative
fitn
ess
● Population mean fitnessAverage fitness of a caseAverage minimum fitness
doi:10.1371/journal.pgen.1003258Thornton et al.
Heritability plateaus
●
●
●
●
●● ●
● ● ●●
0.0 0.1 0.2 0.3 0.4 0.5
0.00
0.02
0.04
0.06
Mean effect size (λλ)
Broa
d−se
nse
herit
abilit
y
doi:10.1371/journal.pgen.1003258Thornton et al.
Rare alleles
0.0
0.2
0.4
Derived allele frequency
Prop
ortio
n
1 5 10
●
●●
● ● ● ● ● ● ● ●
� = 0.25
doi:10.1371/journal.pgen.1003258Thornton et al.
GWAS have poor power
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.2
0.4
0.6
0.8
Mean effect size (λ)
Powe
r
GWASGWAS,no recombinationresequencingresequencingno recombination
doi:10.1371/journal.pgen.1003258Thornton et al.
Compare model to data…0.4 0.020
0.015
0.010
0.005
a b
0.3
Freq
uenc
y of
obs
erva
tion
s
Cau
sal v
aria
nt fr
eque
ncy
0.2
0.1
00.05 0.50 1.0 0.1 0.2 0.3 0.4 0.5
Oddsratio
4KUM�CNNGNG�HTGSWGPE[�QH�OQUV�UKIPKȮECPV�EQOOQP�502 %QOOQP�502�HTGSWGPE[
5KOWNCVGF�JKVU)9#5�JKVU Ű 2
Ű 3
Ű 4
Ű 5
Ű 6
Ű 7
Ű 8
Ű 9
> 9
Figure 3 | Inconsistency between genome-wide association study results and rare variant expectations. a | The frequency distribution of risk allele frequencies (shown in light red) for 414 common variant associations with 17 diseases is only slightly skewed towards lower-frequency variants. By contrast, simulations — in this case, assuming up to nine rare causal variants inducing the common variant association with SNPs at the same frequency as observed on common genotyping platforms (light green bars) — result in a marked left-skew with a peak for common variants whose frequency is less than 10%. (The skew is even stronger if only a single causal variant is responsible.) The observed data are thus not immediately consistent with the rare variant model. b | Part of the problem with synthetic associations is that they would explain too much heritability if they were pervasively responsible for common variant effects. This is due to the relationship between allele frequency, maximum possible linkage disequilibrium (LD) and the amount of variance explained19. The plot shows the expected odds ratio due to a rare variant of the indicated frequency (from 0.5% to 2%) if it increases the odds ratio at a common SNP (with which it is in maximum possible LD) by 1.1-fold. Intermediate effect sizes (2 < odds ratio < 5) require combined causal variant frequencies in excess of 1%. As the number of rare variants increases, the likelihood that they are in high LD with the common variant also drops, further reducing the probability that they can explain observed common variant association. Suppose that a disease has a prevalence of 1%. Then ten causal variants that are each at a frequency of 1% would result in 20% of people carrying a causal variant. If the penetrance is 5%, then 1% of people would have the disease, and these 10 variants would completely explain the genetic risk. Similarly, if 100 causal variants were each at 0.1% frequency, it would take ~10 such variants to induce each single common variant association with an observed odds ratio of 1.1. If large genome-wide association studies (GWASs) detect dozens of such common loci, and they were actually due to LD with rare variants, then the heritability would be explained several times over. Alternatively, if hundreds of very rare causal variants are not in LD with common variants, we do not expect to see significant GWAS associations. Data taken from REF. 19.
DecanalizationThe notion that genetic systems evolved to be buffered but that large effect mutations or environmental change can overcome this buffering, thereby increasing the genetic variance.
Genomic selectionThe use of genetic markers that are spread throughout the genome to select individuals with desired predicted breeding values.
Predicted breeding valueThe estimated phenotype of progeny of individuals that have a particular genotype.
the consistency of common variant effects is that they are actually due to the common variants themselves or to unobserved common variants in high LD across all populations.
Arguments in favour of the infinitesimal modelThe infinitesimal model underpins standard quantita-tive genetic theory. Just as evolutionary theory provides a strong argument in favour of rare variants, standard quantitative genetic theory provides ample support for the infinitesimal model7,8. Whatever the causes of the maintenance of genetic variance may be, the consistent observation is that all diseases have moderately high her-itability, and so purifying selection has been unable to purge the population of disease-promoting variants2. At face value, the existence of dozens of susceptibility alleles for metabolic and immunological diseases with effect sizes that are just not detected for psychological diseases implies a difference in genetic architecture between the two categories of conditions. This may imply different intensities of purifying selection, although other mod-els, including decanalization69, are also compatible with the data. Because most of the genetic variance remains unexplained, it is a priori just as likely to exist in the form of rare or common alleles, and the fact is that there is
nothing about GWAS findings that is inconsistent with the infinitesimal model of many variants of very small effect across the full allele frequency spectrum. This model has served applied quantitative geneticists as well as evolutionary biologists for close to a century and, in a sense, it can be regarded as the null model that needs to be disproved before it is abandoned.
Common variants collectively capture the majority of the genetic variance in GWASs. Direct empirical support for the infinitesimal model comes from genomic variance analyses70,71. Animal breeders have been using genomic selection methods with great success for the past decade72, basing their selection of sires and dams on the overall pre-dicted breeding value, which is determined from the full set of genomic markers that capture variation distributed throughout the genome. Similarly, in humans, by taking all nominally significant SNPs rather than just the sig-nificant ones from GWASs, it is possible to capture much more of the genetic variance than is explained by the highly significant loci73,74 (BOX 3). A multivariate version of this approach, which is implemented by regression of phenotypic similarity on genetic relatedness, also implies that common variants capture most of the genetic vari-ants71. Furthermore, partitioning of the genetic variance
REVIEWS
140 | FEBRUARY 2012 | VOLUME 13 www.nature.com/reviews/genetics
© 2012 Macmillan Publishers Limited. All rights reserved
doi:10.1038/nrg3118
Figure 2. Frequency distributions of a) the risk allele frequency of the most associated SNPs listed in the GWAS Catalog [1] for thediseases in Table 3. b) MAF of all SNPs simulated under the coalescence model, c) MAF of SNPs used in analyses to be representative of SNPsincluded in GWAS. d–f) Coupled allele of most associated SNP from simulations of 1, 9, or 36 causal variants in a 100 kb region.doi:10.1371/journal.pbio.1000579.g002
PLoS Biology | www.plosbiology.org 3 January 2011 | Volume 9 | Issue 1 | e1000579
doi:10.1371/journal.pbio.1000579Gibson Wray et al.
…reveals a pretty good fit
Figure 2. Frequency distributions of a) the risk allele frequency of the most associated SNPs listed in the GWAS Catalog [1] for thediseases in Table 3. b) MAF of all SNPs simulated under the coalescence model, c) MAF of SNPs used in analyses to be representative of SNPsincluded in GWAS. d–f) Coupled allele of most associated SNP from simulations of 1, 9, or 36 causal variants in a 100 kb region.doi:10.1371/journal.pbio.1000579.g002
PLoS Biology | www.plosbiology.org 3 January 2011 | Volume 9 | Issue 1 | e1000579
doi:10.1371/journal.pbio.1000579Wray et al.
02
46
810
MAF of most significant marker(in cases)
Mea
n nu
mbe
r of m
arke
rs
n = 36.899
0 0.1 0.2 0.3 0.4 0.5
� = 0.05
(Based on simulating imperfect SNP chips)
“Burden” tests do badly…
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.2
0.4
0.6
0.8
1.0
Mean effect size (λ)
Powe
r
GWASGWASno recombinationResequencingResequencingno recombination
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.2
0.4
0.6
0.8
1.0
Mean effect size (λ)
Powe
r
50 markers50 markersno recombination100 markers100 markersno recombination200 markers200 markersno recombination250 markers250 markersno recombination
Madsen and Browning (2009)
Li and Leal (2008)
doi:10.1371/journal.pgen.1003258Thornton et al.
…because the model is wrong.
●
●
●
●
●●
●●
●●
0.0 0.1 0.2 0.3 0.4 0.5
02
46
8
Mean effect size (λ)
Mea
n nu
mbe
r of c
ausa
tive
mut
atio
ns p
er d
iplo
id
●
●
●
●
●●
●●
●●
●
●
ControlsCasesControls (rares)Cases (rares)
doi:10.1371/journal.pgen.1003258Thornton et al.
SKAT does ok
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.2
0.4
0.6
0.8
1.0
Mean effect size (λ)
Powe
r
Resequencing, default weights and optimal p−valuesGWAS, default weights and optimal p−valuesResequencing, Madsen−Browning weights and optimal p−valuesGWAS, Madsen−Browning weights and optimal p−values
doi:10.1371/journal.pgen.1003258Thornton et al.
Manhattan plots
0 20 40 60 80 100
05
1015
Position (kbp)
−lo
g 10(p)
CommonCommon, causativeRareRare, causative
0 20 40 60 80 100
05
1015
Position (kbp)
−lo
g 10(p)
CommonCommon, causativeRareRare, causative
Methods), and excluded 153 individuals on this basis. We nextlooked for evidence of population heterogeneity by studying allelefrequency differences between the 12 broad geographical regions(defined in Supplementary Fig. 4). The results for these 11-d.f. testsand associated quantile-quantile plots are shown in Fig. 2. Wide-spread small differences in allele frequencies are evident as anincreased slope of the line (Fig. 2b); in addition, a few loci show muchlarger differences (Fig. 2a and Supplementary Fig. 6).
Thirteen genomic regions showing strong geographical variationare listed in Table 1, and Supplementary Fig. 7 shows the way in whichtheir allele frequencies vary geographically. The predominant patternis variation along a NW/SE axis. The most likely cause for thesemarked geographical differences is natural selection, most plausiblyin populations ancestral to those now in the UK. Variation due toselection has previously been implicated at LCT (lactase) and majorhistocompatibility complex (MHC)7–9, and within-UK differentiationat 4p14 has been found independently10, but others seem to be newfindings. All but three of the regions contain known genes. Aside from
evolutionary interest, genes showing evidence of natural selection areparticularly interesting for the biology of traits such as infectious dis-eases; possible targets for selection include NADSYN1 (NAD synthe-tase 1) at 11q13, which could have a role in prevention of pellagra, aswell as TLR1 (toll-like receptor 1) at 4p14, for which a role in thebiology of tuberculosis and leprosy has been suggested10.
There may be important population structure that is not wellcaptured by current geographical region of residence. Presentimplementations of strongly model-based approaches such asSTRUCTURE11,12 are impracticable for data sets of this size, and wereverted to the classical method of principal components13,14, using asubset of 197,175 SNPs chosen to reduce inter-locus linkage disequi-librium. Nevertheless, four of the first six principal componentsclearly picked up effects attributable to local linkage disequilibriumrather than genome-wide structure. The remaining two componentsshow the same predominant geographical trend from NW to SE but,perhaps unsurprisingly, London is set somewhat apart (Supplemen-tary Fig. 8).
The overall effect of population structure on our associationresults seems to be small, once recent migrants from outsideEurope are excluded. Estimates of over-dispersion of the associationtrend test statistics (usually denoted l; ref. 15) ranged from 1.03 and1.05 for RA and T1D, respectively, to 1.08–1.11 for the remainingdiseases. Some of this over-dispersion could be due to factors otherthan structure, and this possibility is supported by the fact that inclu-sion of the two ancestry informative principal components as cov-ariates in the association tests reduced the over-dispersion estimatesonly slightly (Supplementary Table 6), as did stratification by geo-graphical region. This impression is confirmed on noting thatP values with and without correction for structure are similar(Supplementary Fig. 9). We conclude that, for most of the genome,population structure has at most a small confounding effect in ourstudy, and as a consequence the analyses reported below do notcorrect for structure. In principle, apparent associations in the fewgenomic regions identified in Table 1 as showing strong geographicaldifferentiation should be interpreted with caution, but none arose inour analyses.
Disease association results
We assessed evidence for association in several ways (see Methods fordetails), drawing on both classical and bayesian statistical approaches.For polymorphic SNPs on the Affymetrix chip, we performed trendtests (1 degree of freedom16) and general genotype tests (2 degrees offreedom16, referred to as genotypic) between each case collection andthe pooled controls, and calculated analogous Bayes factors. Thereare examples from animal models where genetic effects act differentlyin males and females17, and to assess this in our data we applied a
−log
10(P
)
0
5
10
15
Chromosome
22 X212019181716151413121110987654321
3020
20
100
0
40
80
60
40
100
Obs
erve
d te
st s
tatis
tic
Expected chi-squared value
a
b
Figure 2 | Genome-wide picture of geographic variation. a, P values for the11-d.f. test for difference in SNP allele frequencies between geographicalregions, within the 9 collections. SNPs have been excluded using the projectquality control filters described in Methods. Green dots indicate SNPs with aP value ,1 3 1025. b, Quantile-quantile plots of these test statistics. SNPs atwhich the test statistic exceeds 100 are represented by triangles at the top ofthe plot, and the shaded region is the 95% concentration band (seeMethods). Also shown in blue is the quantile-quantile plot resulting fromremoval of all SNPs in the 13 most differentiated regions (Table 1).
Table 1 | Highly differentiated SNPs
Chromosome Genes Region (Mb) SNP Position P value
2q21 LCT 135.16–136.82 rs1042712 136,379,576 5.54 3 10213
4p14 TLR1, TLR6, TLR10 38.51–38.74 rs7696175 386,43,552 1.51 3 10212
4q28 137.97–138.01 rs1460133 137,999,953 4.43 3 10208
6p25 IRF4 0.32–0.42 rs9378805 362,727 5.39 3 10213
6p21 HLA 31.10–31.55 rs3873375 31,359,339 1.07 3 10211
9p24 DMRT1 0.86–0.88 rs11790408 866,418 4.96 3 10207
11p15 NAV2 19.55–19.70 rs12295525 19,661,808 7.44 3 10208
11q13 NADSYN1, DHCR7 70.78–70.93 rs12797951 70,820,914 3.01 3 10208
12p13 DYRK4,AKAP3,NDUFA9,RAD51AP1,GALNT8
4.37–4.82 rs10774241 45,537,27 2.73 3 10208
14q12 HECTD1,AP4S1,STRN3 30.41–31.03 rs17449560 30,598,823 1.46 3 10207
19q13 GIPR,SNRPD2,QPCTL,SIX5,DMPK,DMWD,
RSHL1,SYMPK,FOXA3
50.84–51.09 rs3760843 50,980,546 4.19 3 10207
20q12 38.30–38.77 rs2143877 38,526,309 1.12 3 10209
Xp22 2.06–2.08 rs6644913 2,061,160 1.23 3 10207
Properties of SNPs that show large allele frequency differences between samples of individuals from 12 regions across Great Britain. Regions showing differentiated SNPs are given with details of theSNP with the smallest P value in each region for differentiation on the 11-d.f. test of differences in SNP allele frequencies between geographical regions, within the 9 collections. Cluster plots for theseSNPs have been examined visually. Signal plots appear in Supplementary Information. Positions are in NCBI build-35 coordinates.
NATURE | Vol 447 | 7 June 2007 ARTICLES
663Nature ©2007 Publishing Group
doi:10.1371/journal.pgen.1003258 doi:10.1038/nature05911Burton et al.Thornton et al.
A new association test
Methods), and excluded 153 individuals on this basis. We nextlooked for evidence of population heterogeneity by studying allelefrequency differences between the 12 broad geographical regions(defined in Supplementary Fig. 4). The results for these 11-d.f. testsand associated quantile-quantile plots are shown in Fig. 2. Wide-spread small differences in allele frequencies are evident as anincreased slope of the line (Fig. 2b); in addition, a few loci show muchlarger differences (Fig. 2a and Supplementary Fig. 6).
Thirteen genomic regions showing strong geographical variationare listed in Table 1, and Supplementary Fig. 7 shows the way in whichtheir allele frequencies vary geographically. The predominant patternis variation along a NW/SE axis. The most likely cause for thesemarked geographical differences is natural selection, most plausiblyin populations ancestral to those now in the UK. Variation due toselection has previously been implicated at LCT (lactase) and majorhistocompatibility complex (MHC)7–9, and within-UK differentiationat 4p14 has been found independently10, but others seem to be newfindings. All but three of the regions contain known genes. Aside from
evolutionary interest, genes showing evidence of natural selection areparticularly interesting for the biology of traits such as infectious dis-eases; possible targets for selection include NADSYN1 (NAD synthe-tase 1) at 11q13, which could have a role in prevention of pellagra, aswell as TLR1 (toll-like receptor 1) at 4p14, for which a role in thebiology of tuberculosis and leprosy has been suggested10.
There may be important population structure that is not wellcaptured by current geographical region of residence. Presentimplementations of strongly model-based approaches such asSTRUCTURE11,12 are impracticable for data sets of this size, and wereverted to the classical method of principal components13,14, using asubset of 197,175 SNPs chosen to reduce inter-locus linkage disequi-librium. Nevertheless, four of the first six principal componentsclearly picked up effects attributable to local linkage disequilibriumrather than genome-wide structure. The remaining two componentsshow the same predominant geographical trend from NW to SE but,perhaps unsurprisingly, London is set somewhat apart (Supplemen-tary Fig. 8).
The overall effect of population structure on our associationresults seems to be small, once recent migrants from outsideEurope are excluded. Estimates of over-dispersion of the associationtrend test statistics (usually denoted l; ref. 15) ranged from 1.03 and1.05 for RA and T1D, respectively, to 1.08–1.11 for the remainingdiseases. Some of this over-dispersion could be due to factors otherthan structure, and this possibility is supported by the fact that inclu-sion of the two ancestry informative principal components as cov-ariates in the association tests reduced the over-dispersion estimatesonly slightly (Supplementary Table 6), as did stratification by geo-graphical region. This impression is confirmed on noting thatP values with and without correction for structure are similar(Supplementary Fig. 9). We conclude that, for most of the genome,population structure has at most a small confounding effect in ourstudy, and as a consequence the analyses reported below do notcorrect for structure. In principle, apparent associations in the fewgenomic regions identified in Table 1 as showing strong geographicaldifferentiation should be interpreted with caution, but none arose inour analyses.
Disease association results
We assessed evidence for association in several ways (see Methods fordetails), drawing on both classical and bayesian statistical approaches.For polymorphic SNPs on the Affymetrix chip, we performed trendtests (1 degree of freedom16) and general genotype tests (2 degrees offreedom16, referred to as genotypic) between each case collection andthe pooled controls, and calculated analogous Bayes factors. Thereare examples from animal models where genetic effects act differentlyin males and females17, and to assess this in our data we applied a
−log
10(P
)
0
5
10
15
Chromosome
22 X2120191817161514131211109876543213020
20
100
0
40
80
60
40
100
Obs
erve
d te
st s
tatis
tic
Expected chi-squared value
a
b
Figure 2 | Genome-wide picture of geographic variation. a, P values for the11-d.f. test for difference in SNP allele frequencies between geographicalregions, within the 9 collections. SNPs have been excluded using the projectquality control filters described in Methods. Green dots indicate SNPs with aP value ,1 3 1025. b, Quantile-quantile plots of these test statistics. SNPs atwhich the test statistic exceeds 100 are represented by triangles at the top ofthe plot, and the shaded region is the 95% concentration band (seeMethods). Also shown in blue is the quantile-quantile plot resulting fromremoval of all SNPs in the 13 most differentiated regions (Table 1).
Table 1 | Highly differentiated SNPs
Chromosome Genes Region (Mb) SNP Position P value
2q21 LCT 135.16–136.82 rs1042712 136,379,576 5.54 3 10213
4p14 TLR1, TLR6, TLR10 38.51–38.74 rs7696175 386,43,552 1.51 3 10212
4q28 137.97–138.01 rs1460133 137,999,953 4.43 3 10208
6p25 IRF4 0.32–0.42 rs9378805 362,727 5.39 3 10213
6p21 HLA 31.10–31.55 rs3873375 31,359,339 1.07 3 10211
9p24 DMRT1 0.86–0.88 rs11790408 866,418 4.96 3 10207
11p15 NAV2 19.55–19.70 rs12295525 19,661,808 7.44 3 10208
11q13 NADSYN1, DHCR7 70.78–70.93 rs12797951 70,820,914 3.01 3 10208
12p13 DYRK4,AKAP3,NDUFA9,RAD51AP1,GALNT8
4.37–4.82 rs10774241 45,537,27 2.73 3 10208
14q12 HECTD1,AP4S1,STRN3 30.41–31.03 rs17449560 30,598,823 1.46 3 10207
19q13 GIPR,SNRPD2,QPCTL,SIX5,DMPK,DMWD,
RSHL1,SYMPK,FOXA3
50.84–51.09 rs3760843 50,980,546 4.19 3 10207
20q12 38.30–38.77 rs2143877 38,526,309 1.12 3 10209
Xp22 2.06–2.08 rs6644913 2,061,160 1.23 3 10207
Properties of SNPs that show large allele frequency differences between samples of individuals from 12 regions across Great Britain. Regions showing differentiated SNPs are given with details of theSNP with the smallest P value in each region for differentiation on the 11-d.f. test of differences in SNP allele frequencies between geographical regions, within the 9 collections. Cluster plots for theseSNPs have been examined visually. Signal plots appear in Supplementary Information. Positions are in NCBI build-35 coordinates.
NATURE | Vol 447 | 7 June 2007 ARTICLES
663Nature ©2007 Publishing Group
ESMK =i=KX
i=1
✓�log10(pi) + log10
i
K
◆
doi:10.1371/journal.pgen.1003258Thornton et al.
ESM is a more powerful test
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.2
0.4
0.6
0.8
1.0
Mean effect size (λ)
Powe
rGWASGWAS,no recombinationresequencingresequencingno recombination
(Caveat: requires permutation to get p-values)doi:10.1371/journal.pgen.1003258
Thornton et al.
Running ESM on real data
• We think we can implement ESM using a mix of the PLINK toolkit plus some custom programs.
• We need data to test it out on.
• There are very few modern GWAS available for reanalysis.
• Lack of data sharing hurts the field.
Rare alleles and missing heritability
• Current tests are underpowered
• Heterogeneity means that GWAS “hits” tag few causative mutations
• Causative mutations that are tagged tend to be (relatively) common. These “common” mutations have effect sizes much smaller than the typical causative mutation that segregates
●
●●● ●
●
●● ●● ●
●
●●
●● ● ● ●
●●
●
●●
●●
●● ● ● ● ●●
●● ●
●●
●
●●
●●
●
●● ●
●
●
●
●
●●
●
●●
●
●
●
●
0.010 0.025
0.050 0.075
0.100 0.125
0.175 0.250
0.350 0.500
0.0000
0.0015
0.0030
0.0000
0.0015
0.0030
0.0000
0.0015
0.0030
0.0000
0.0015
0.0030
0.0000
0.0015
0.0030
0 1 2 0 1 2Number of copies of derived allele at focal SNP
Mea
n nu
mbe
r of c
ausa
tive
sing
leto
ns p
er in
divi
dual
Focal SNP●
●
Most significant markerUnassociated SNP
doi:10.1371/journal.pgen.1003258Thornton et al.
Population growth
TimePresentPast
Popu
latio
n si
ze
H^2 insensitive to growth
●
●
●
●
● ●
●●
●
●
0.01
0.02
0.03
0.04
0.0 0.1 0.2 0.3 0.4 0.5Average effect size of new mutation
Mea
n br
oad−
sens
e he
ritab
ility
model● constant
growth
Unpublished
Consistent with recent findings from other groups
©20
14 N
atur
e A
mer
ica,
Inc.
All
righ
ts r
eser
ved.
2 ADVANCE ONLINE PUBLICATION NATURE GENETICS
A N A LY S I S
But despite these substantial shifts in the overall frequency spectrum, the impact on genetic load—namely, the mean number of deleterious variants per individual and thus the average fitness—is much more subtle.
In the semidominant case, the individual burden is essentially unaffected by these demographic events (Fig. 1c,d). With growth, the increased number of segregating sites is balanced exactly by a decrease in the mean frequency (with the converse being true for the bottleneck model) so that the number of variants per individual stays constant. This kind of balance is predicted by classic mutation-selection balance models18 and can be shown to hold for general changes in population size, provided that selection is strong and deleterious alleles are at least partially dominant (Supplementary Note).
The behavior of the recessive model is more complicated (Fig. 1e,f). In the bottle-neck model, the mean number of deleteri-ous variants per individual drops by 60% as a result of the bottleneck. This drop is due to the loss of rare alleles. However, during the bottleneck, some deleterious alleles drift to higher frequencies11,19, contributing dispro-portionately to the number of homozygotes. This causes a transient increase in the number of deleterious homozygous sites per individ-ual, i.e., the recessive load. Meanwhile, population growth has a less pronounced effect on recessive variation, leaving the mean number of deleterious alleles per individual unchanged but causing a slight decrease in load.
More generally, the manner in which demography affects individual load varies with the degree of dominance and the strength of selection (Fig. 2, Supplementary Note and Supplementary Table 1). The behavior of these models can be classified into three selection regimes: strong, weak and effectively neutral. In the case of strong selection, i.e., where selection is much stronger than drift (approximately s 10−3 for semidominant mutations), deleterious variants are extremely unlikely to fix, and virtu-ally all of the genetic load is due to segregating variation. In this range, we infer that human demography has had no impact on semidominant load (and, more generally, for mutations with at least some dominance component) and has had only small effects on recessive load.
The case of weak selection—where drift and selection have compa-rable effects—is more complex, as fixed alleles may contribute appreci-ably to load, and the steady-state load depends on population size20. However, the approach to the steady state is very slow, being limited by both the time to fixation (on the order of 4N generations) and the muta-tional input (on the order of 1/2NU generations, where U is the muta-tion rate). For both the semidominant and recessive cases, population growth is too recent to have substantially decreased the load. Recent growth increases the input of new deleterious mutations, but this effect
is counterbalanced by the fact that the new deleterious mutations are proportionally rarer, as well as by the input of beneficial mutations. The bottleneck in Europeans is estimated to have occurred further in the past and at much lower population sizes5 (Supplementary Fig. 1), thus increasing its effect. In this case, the increase in drift causes segregating deleterious alleles to increase in frequency, sometimes reaching fixa-tion, and results in a slight increase in load (Supplementary Fig. 2). The out-of-Africa bottleneck should thus lead to a slight increase of load in Europeans, most notably for recessive sites.
In the effectively neutral range—where selection has negligible effects on the population dynamics—segregating variation contrib-utes negligibly, and hence the load does not change with demography. Thus, across all three selection regimes, recent human demographic history is likely to have had virtually no impact on genetic load at partially dominant sites and only weak effects at recessive sites.
Analysis of exome dataTo test these predictions, we analyzed two recent data sets of exome sequences from individuals of west African and European descent. Previous work comparing load in different populations has produced conflicting conclusions depending on the data set, choice of measures and functional annotations used. For example, Lohmueller et al.11 reported that there is “proportionally more deleterious variation in European than in African populations.” Similarly, Tennessen et al.5 found that European Americans had more nonreference genotypes when they used a conservative classification of deleterious sites but
a b
c d
e f
100
–1,000 0 1,000 2,000 3,000Time since beginning of bottleneck (generations)
–1,000 0 1,000 2,000 3,000Time since beginning of bottleneck (generations)
Time since beginning of growth (generations)
Time since beginning of growth (generations)
10,000
1,000
–1,000 0 1,000 2,000 3,000Time (generations)
Bottleneck
Pop
ulat
ion
size
100,000
10,000
Time (generations)
Growth
Pop
ulat
ion
size
–200 –100 0 100 200
102
104
Sem
idom
inan
tR
eces
sive
Num
ber
per
MB
100
102
104
100
102
104
Num
ber
per
MB
Num
ber
per
MB
100
102
104
Num
ber
per
MB
Number ofsegregating sites
Number of segregatingsites
Number of segregating sites
Number of deleteriousalleles per individual
Number of deleterious alleles per individual
Number of raredeleterious alleles
per individual
Number of rare deleterious allelesper individual
Number of segregating sites
Number of rare segregating sites
Number of rare segregatingsites
Number of rare segregating sites
Number of rare segregating sites
Load: number of deleterious alleles per individual
Load: number of homozygous sites per individual
Load: number of homozygous sitesper individual
Load: number of deleterious alleles per individual
Number of raredeleterious
alleles per individual
Number of rare deleterious alleles per individual
–200 –100 0 100 200
–200 –100 0 100 200
Figure 1 Time course of load and other key aspects of variation through a bottleneck and exponential growth. (a,b) The bottleneck (a) and exponential growth (b). (c–f) The expected number of variants and alleles per MB assuming semidominant mutations (c,d) or recessive mutations (e,f) with s = 1% and a mutation rate per site per generation of 10−8.
Simons et al.doi:10.1038/ng.2896
Power is affected
0.00
0.02
0.04
0.06
0.08
0.000 0.025 0.050 0.075 0.100Effect size of segregating causative mutation
Freq
uenc
y in
pop
ulat
ion
ModelConstantGrowth
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0
0.2
0.4
0.6
0.8
0.0 0.1 0.2 0.3 0.4 0.5Mean effect size of causative mutation
Powe
r
Statistic● ESM50
LogitSKAT
ModelConstantGrowth
Unpublished
Excellent fit to empirical data
Frequency of most−associated marker
No.
mar
kers
0.0 0.2 0.4 0.6 0.8 1.0
02
46
810
1214
Unpublished
Implications
• Power to detect regions with modest effects on risk (4-5% contribution to broad-sense heritability) is very low in growing populations
• The explanatory power of simple models is probably far from exhausted
Implications
• Much more likely to detect loci with mutations of modest effect
• Underlying distribution of mean effect size across loci is completely unknown in any system
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0
0.2
0.4
0.6
0.8
0.0 0.1 0.2 0.3 0.4 0.5Mean effect size of causative mutation
Powe
r
Statistic● ESM50
LogitSKAT
ModelConstantGrowth
Unpublished
Future work• Multilocus models with epistasis
• Machine learning approaches: do they work?
• Develop new simulation tools
• Make simulation output available
• Implement ESM test for analyzing real GWAS data
Other work in the lab• Copy number variation in Drosophila: doi: 10.1093/
molbev/msu124
• Detecting TE insertions using paired-end data in Drosophila: doi: 10.1093/molbev/mst129
• Modeling experimental evolution: doi: 10.1093/molbev/msu048
• Structural variation and variation in gene expression