the american journal of human genetics - best of 2011 & 2012
DESCRIPTION
Las mejores publicaciones de la Revista Americana de Genética Humana, periodo 2011-2012TRANSCRIPT
85%
90%
70%
75%
80%
MAF >1%
Co
vera
ge
MAF >5%
Competing Array
Axiom® World Array 4
We’ve got you covered
The Definitive Evolution of Genotyping
Affymetrix Axiom® Arrays
“For Research Use Only. Not for use in diagnostic procedures.”©Affymetrix, Inc. All rights reserved.
Axiom® Genotyping Solution. Survival of the fittest.Axiom Genotyping Solution is the most powerful genotyping workflowdelivering superior coverage of populations, disease genes, and rarevariants at an affordable price.
Unique GWAS, replication, and fine-mapping content onone arrayUnrivaled coverage of the exome, disease genes, andfunctional regionsCost-effective custom array design with 100% SNP conversion
Axiom Genotyping Solution adapts to the needs of your research—coverage and flexibility like never before. Contact your AffymetrixRepresentative today.
For more information on Axiom Genotyping Solution,visit www.affymetrix.com/axiomevolution
www.nanostring.com | [email protected] | 888 358 6266FOR RESEARCH USE ONLY. Not for use in diagnostic procedures.
Molecules That Count®Gene Expression miRNA Expression Epigenomics Copy Number Variation
The NEW
nCounter®Single Cell ExpressionNanoString’s nCounter® Single Cell Gene Expression Assayoff ers a superior approach to identifying cell-to-cell diff erences within a population of cells. The highly multiplexed, single tube assay allows the analysis of 20 – 800 genes and frees you from the constraints of fi xed format microfl uidic platforms. Let biology guide your research.
Take the Single Cell Challenge - Try Before You Buy!Go to www.nanostring.com/challenge for complete details.
nCounter® Analysis SystemDirect Digital Quantifi cation of Nucleic Acids
More Genes » Analyze multiple pathways for up to 800 genes
High Sensitivity » Eliminate sample splitting, minimize amplifi cation - get better data from every cell
Digital Counting » Determine fractional fold changes - eliminate the variability of analog data
High Throughput » Analyze hundreds of samples per day
Make Every Cell CountThe New nCounter® Single Cell Expression Assay
Menkes?What is
Cell Press contentis widely accessible
At Cell Press we place a high priority on ensuring that all of our journal content is widely accessible and on working with the community to develop the best ways to achieve that goal.
Here are just some of those initiatives...
www.cell.com/cellpress/access
Open archivesWe provide free access to Cell Press research journals 12 months following publication
Open access journalWe launched Cell Reports - a new Open Access journal spanning the life sciences
Access for developing nationsWe provide free & low-cost access through programs like Research4Life
Funding body agreementsWe work cooperatively and successfully with major funding bodies
Submission to PubMed CentralCell Press deposits accepted manuscripts on our authors' behalf for a variety of funding bodies, including NIH and HHMI, to PubMed Central (PMC)
Public accessFull-text online via ScienceDirect is also available to the public via walk in user access from any participating library
Don’t be kept in the dark
523_12_JL
Image courtesy of an Abreview by Dr. Shaohua Li, UMDNJ-Robert Wood Johnson Medical School
Discover more at abcam.com/brighter_days
Back by popular demand for 2013:• New sessions on cutting-edge clinical trials, along
with commentaries on the implications of these trials for improved patient care
• Poster session on Clinical Trials in Progress
• Regulatory science and policy track
Join us in Washington, DC, the appropriate location for our conference and events that will emphasize the vital importance of reaffirming our nation’s commitment to the conquest of cancer.
Continuing Medical Education Activity–AMA PRA Category 1 CreditsTM available
Late-breaking and placeholder abstract submission deadline: Monday, January 28
Early registration deadline: Friday, December 21
A N N U A LMEETING
2013
April 6-10, 2013Walter E. Washington
Convention CenterWashington, DC
Secure your spot today for the premier event forcancer research covering
the spectrum of science fromthe bench to the clinic!
New for 2013: An exciting new series of sessions focused on Current Concepts in Epidemiology and Prevention
www.aacr.org/annual meeting13
Foreword
We are pleased to introduce a new series of “Best of…” reprint collections from Cell Press,
which give us a chance to reflect on what has caught the attention of AJHG readers in late
2011 and early 2012. This collection includes a selection of eight of the most-accessed
research articles across a range of topics and the most highly accessed review article of
2012. To select the articles, we considered the number of requests for PDF and full-text
HTML versions of a given article. Half of the articles were published in the last six months of
2011 and half were published between January and June of 2012; in doing so, we are able
to capture the full spectrum of articles that have been published during the past 12 months.
We acknowledge that no single measurement can truly be indicative of “the best” research
papers over a given period of time. This is especially true when sufficient time has not
necessarily passed to allow one to fully appreciate the relative importance of a discovery.
That said, we think it is still informative to look back at the scientific community’s interests
in what has been published in AJHG over the past year.
In this collection, you will see a range of the exciting topics that have widely captured
the attention and enthusiasm of our readers, including genome-wide association studies,
evolutionary and population genetics, genetics of disease, and new approaches for
analyzing sequencing data.
We hope that you will enjoy reading this special collection and that you will visit http://www.
cell.com/AJHG/home to check out the latest findings that we have had the privilege to
publish. To stay on top of what your colleagues have been reading over the past 30 days,
check out http://www.cell.com/AJHG/top20. Also be sure to visit http://www.cell.com to
find other high quality papers published in the full collection of Cell Press journals.
Finally, we are grateful for the generosity of our sponsors, who helped make this reprint
collection possible.
For information for the Best of Series, please contact:
Jonathan Christison
Program Director, Best of Cell Press
617-397-2893
LetL
s
o
v
d
v
Volume 89
Best of 2011 and 2012
Volume 90
Denisova Admixture and the First Modern Human
Dispersals into Southeast Asia and Oceania
Rare-Variant Association Testing for Sequencing Data
with the Sequence Kernel Association Test
Expansion of Intronic GGCCTG Hexanucleotide Repeat
in NOP56 Causes SCA36, a Type of Spinocerebellar
Ataxia Accompanied by Motor Neuron Involvement
A Mutation in a Skin-Specific Isoform of SMARCAD1
Causes Autosomal-Dominant Adermatoglyphia
Five Years of GWAS Discovery
Mitochondrial DNA and Y Chromosome Variation
Provides Evidence for a Recent Common Ancestry
between Native Americans and Indigenous Altaians
A ‘‘Copernican’’ Reassessment of the Human
Mitochondrial DNA Tree from its Root
Age-Related Somatic Structural Changes in the Nuclear
Genome of Human Blood Cells
Rare Mutations in XRCC2 Increase the
Risk of Breast Cancer
David Reich, Nick Patterson, Martin Kircher, Frederick Delfin,
Madhusudan R. Nandineni, Irina Pugach, Albert Min-Shan Ko,
Ying-Chin Ko, Timothy A. Jinam, Maude E. Phipps, Naruya
Saitou, Andreas Wollstein, Manfred Kayser, Svante Pääbo,
and Mark Stoneking
Michael C. Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael
Boehnke, and Xihong Lin
Hatasu Kobayashi, Koji Abe, Tohru Matsuura, Yoshio Ikeda,
Toshiaki Hitomi, Yuji Akechi, Toshiyuki Habu, Wanyang Liu,
Hiroko Okuda, and Akio Koizumi
Janna Nousbeck, Bettina Burger, Dana Fuchs-Telem, Mor
Pavlovsky, Shlomit Fenig, Ofer Sarig, Peter Itin, and Eli
Sprecher
Peter M. Visscher, Matthew A. Brown, Mark I. McCarthy, and
Jian Yang
Matthew C. Dulik, Sergey I. Zhadanov, Ludmila P. Osipova,
Ayken Askapuli, Lydia Gau, Omer Gokcumen, Samara
Rubinstein, and Theodore G. Schurr
Doron M. Behar, Mannis van Oven, Saharon Rosset, Mait
Metspalu, Eva-Liis Loogväli, Nuno M. Silva, Toomas Kivisild,
Antonio Torroni, and Richard Villems
Lars A. Forsberg, Chiara Rasi, Hamid R. Razzaghian, Geeta
Pakalapati, Lindsay Waite, Krista Stanton Thilbeault, Anna
Ronowicz, Nathan E. Wineinger, Hemant K. Tiwari, Dorret
Boomsma, Maxwell P. Westerman, Jennifer R. Harris,
Robert Lyle, Magnus Essand, Fredrik Eriksson, Themistocles
L. Assimes, Carlos Iribarren, Eric Strachan, Terrance P.
O’Hanlon, Lisa G. Rider, Frederick W. Miller, Vilmantas
Giedraitis, Lars Lannfelt, Martin Ingelsson, Arkadiusz
Piotrowski, Nancy L. Pedersen, Devin Absher, and Jan P.
Dumanski
D.J. Park, F. Lesueur, T. Nguyen-Dumont, M. Pertesi, F.
Odefrey, F. Hammet, S.L. Neuhausen, E.M. John, I.L.
Andrulis, M.B. Terry, M. Daly, S. Buys, F. Le Calvez-Kelm, A.
Lonie, B.J. Pope, H. Tsimiklis, C. Voegele, F.M. Hilbers, N.
Hoogerbrugge, A. Barroso, A. Osorio, the Breast
On the cover: Whole-mount preparation of a mouse cochlea, immunolabeled with myosin VIIa in green, DAPI in blue, and phalloidin in red
to stain hair cells, nuclei, and actin, respectively. The background sequence is that of connexin 26, the most commonly mutated gene in deaf
individuals. Image courtesy of Shaked Shivatzki and Karen Avraham, Tel Aviv University, Tel Aviv, Israel. Support: grant R01 DC011835 from the
National Institute on Deafness and Other Communication Disorders, National Institutes of Health. This image was the winner of the 2012 ASHG
GenArt competition.
ARTICLE
Denisova Admixture and the First Modern HumanDispersals into Southeast Asia and Oceania
David Reich,1,2,* Nick Patterson,2 Martin Kircher,3 Frederick Delfin,3 Madhusudan R. Nandineni,3,4
Irina Pugach,3 Albert Min-Shan Ko,3 Ying-Chin Ko,5 Timothy A. Jinam,6 Maude E. Phipps,7
Naruya Saitou,6 Andreas Wollstein,8,9 Manfred Kayser,9 Svante Paabo,3 and Mark Stoneking3,*
It has recently been shown that ancestors of NewGuineans and Bougainville Islanders have inherited a proportion of their ancestry from
Denisovans, an archaic hominin group from Siberia. However, only a sparse sampling of populations from Southeast Asia and Oceania
were analyzed. Here, we quantify Denisova admixture in 33 additional populations from Asia and Oceania. Aboriginal Australians, Near
Oceanians, Polynesians, Fijians, east Indonesians, and Mamanwa (a ‘‘Negrito’’ group from the Philippines) have all inherited genetic
material from Denisovans, but mainland East Asians, western Indonesians, Jehai (a Negrito group from Malaysia), and Onge (a Negrito
group from the Andaman Islands) have not. These results indicate that Denisova gene flow occurred into the common ancestors of New
Guineans, Australians, and Mamanwa but not into the ancestors of the Jehai and Onge and suggest that relatives of present-day East
Asians were not in Southeast Asia when the Denisova gene flow occurred. Our finding that descendants of the earliest inhabitants of
Southeast Asia do not all harbor Denisova admixture is inconsistent with a history in which the Denisova interbreeding occurred in
mainland Asia and then spread over Southeast Asia, leading to all its earliest modern human inhabitants. Instead, the data can be
most parsimoniously explained if the Denisova gene flow occurred in Southeast Asia itself. Thus, archaic Denisovans must have lived
over an extraordinarily broad geographic and ecological range, from Siberia to tropical Asia.
Introduction
The history of the earliest arrival of modern humans in
Southeast Asia and Oceania from Africa remains contro-
versial. Archaeological evidence has been interpreted to
support either a single wave of settlement1 or, alternatively,
multiple waves of settlement, the first leading to the initial
peoplingof SoutheastAsia andOceania via a southern route
and subsequent dispersals leading to the peopling of all of
East Asia.2 Mitochondrial DNA studies have been inter-
preted as supporting a single wave of migration via a
southern route,3–5 although other interpretations are
possible,6,7 and single-locus studies are unlikely to resolve
this issue.8 The largest genetic study of the region to date,
based on 73 populations genotyped at 55,000 SNPs,
concluded that the data were consistent with a single
wave of settlement of Asia that moved from south to north
and gave rise to all of the present-day inhabitants of the
region.9 However, another study of genome-wide SNP
data argued for twowaves of settlement10 as did an analysis
of diversity in the bacterium Helicobacter pylori.11
The recent finding that Near Oceanians (New Guineans
and Bougainville Islanders) have received 4%–6% of their
genetic material from archaic Denisovans12 in principle
provides a powerful tool for understanding the earliest
human migrations to the region and thus for resolving
the question of the number of waves of settlement. The
Denisova genetic material in Southeast Asians should be
easily recognizablebecause it is verydivergent frommodern
human DNA. Thus, the presence or absence of Denisova
genetic material in particular populations should provide
an informative probe for themigration history of Southeast
Asia andOceania, in addition to being interesting in its own
right. However, the populations previously analyzed for
signatures of Denisova admixture12 comprise a very thin
sampling of Southeast Asia and Oceania. In particular, no
groups from island Southeast Asia or Australia were
surveyed. Here, we report an analysis of genome-wide
data from an additional 33 populations from south Asia,
Southeast Asia, andOceania; analyze the data for signatures
of Denisova admixture; and use the results to infer the
history of human migration(s) to this part of the world.
Material and Methods
SNP Array DataWe analyzed data for modern humans genotyped on Affymetrix
6.0 SNP arrays. We began by assembling previously published
data for YRI (Yoruba in Ibadan, Nigeria) West Africans, CHB
(Han Chinese in Beijing, China) Han Chinese and CEU (Utah resi-
dents with Northern and Western European ancestry from the
CEPH collection) European Americans from HapMap 3;13 Onge
Andaman ‘‘Negritos’’;14 and New Guinea highlanders, Fijians,
one Bornean population, and Polynesians from seven islands.10
1Department of Genetics, Harvard Medical School, Boston, MA 02115, USA; 2Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA;3Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig D-04103, Germany; 4Laboratory of DNA Finger-
printing, Centre for DNA Fingerprinting and Diagnostics, Nampally, Hyderabad 500 001, India; 5Center of Excellence for Environmental Medicine,
Kaohsiung Medical University, Kaohsiung City 807, Taiwan; 6Division of Population Genetics, National Institute of Genetics, Yata 1111, Mishima, Shi-
zuoka 411-8540, Japan; 7School of Medicine and Health Sciences, Monash University (Sunway Campus), Selangor 46150, Malaysia; 8Cologne Center
for Genomics, University of Cologne, Cologne D-50931, Germany; 9Department of Forensic Molecular Biology, Erasmus MC University Medical Center
Rotterdam, 3000 CA Rotterdam, The Netherlands
*Correspondence: [email protected] (D.R.), [email protected] (M.S.)
DOI 10.1016/j.ajhg.2011.09.005. �2011 by The American Society of Human Genetics. All rights reserved.
516 The American Journal of Human Genetics 89, 516–528, October 7, 2011
We also assembled data including two aboriginal Australian popu-
lations: one from theNorthern Territories15 and one froma human
diversity cell line panel in the European Collection of Cell
Cultures. The data also include nine Indonesian populations:
four from the Nusa Tenggaras, two from the Moluccas, one from
Borneo, and two from Sumatra. Finally, the data include three
Malaysian populations (Temuan and Jehai [a Negrito group]
both from the Malay peninsula, and Bidayuh from Sarawak on
the island of Borneo), two Philippine populations (Manobo and
a Negrito group, the Mamanwa), six aboriginal Taiwanese popula-
tions, one Dravidian population from southern India, and San
Bushmen from southern Africa from the Centre d’Etude du
Polymorphisme Humain (CEPH)-Human Genome Diversity
Panel.16 All volunteers provided informed consent for research
into population history and the approval of appropriate local
ethical review boards was obtained. This project was approved
by the ethical review boards of the University of Leipzig Medical
Faculty and Harvard Medical School. The genotype data that we
analyzed for this study are available from the authors on request.
Merging Genotyping Data with Chimpanzee,
Denisova, and NeandertalWemerged the SNP array data frommodern humans with genome
sequence data from chimpanzee (CGSC 2.1/PanTro217), Deni-
sova,12 andNeandertal.18WeeliminatedA/TandC/GSNPs tomini-
mize strandmisidentification. After removing SNPswith low geno-
typing completeness, we had data for 353,143 autosomal SNPs.
Removal of Outlier SamplesWe carried out principal components analysis by using
EIGENSOFT.19 We removed samples that were visual outliers rela-
tive to others from the same population on eigenvectors that
were statistically significant by using a Tracy-Widom statistic (p <
0.05),19 resulting in the removal of threeYRI, twoCHB, five Polyne-
sians, oneNewGuineahighlander, two Jehai, and threeMamanwa.
Sequencing DataWepreparedDNAsequencing librarieswith300bp insert sizes from
a Papua New Guinea highlander (SH10) and Mamanwa Negrito
(ID36) individual by using a previously described protocol.12 The
two libraries were sequenced on an Illumina Genome Analyzer
IIx instrument with 2 3 101 þ 7 cycles according to the manufac-
turer’s instructions for multiplex sequencing (FC-104-400x v4
sequencing chemistry and PE-203-4001 cluster generation kit v4).
Bases and quality scores were generated with the Ibis base caller,20
and the reads were aligned with the Burrows-Wheeler Aligner
(BWA) software 21 to the human (NCBI 36/hg18) and chimpanzee
(CGSC 2.1/pantro2) genomes with default parameters. The result-
ing BAM files were filtered as follows: (1) a mapping quality of at
least 30 was required; (2) we removed duplicated reads with the
same outer coordinates; and (3) we removed reads with sequence
entropy < 1.0, calculated by summing �p$log2(p) for each of the
four nucleotides. The sequencing data are publicly available from
the European Nucleotide Archive (Project ID ERP000121), and
summary statistics are provided in Table S1, available online.
Estimating Denisova pD(X), Near Oceanian pN(X)
and Australian pA(X) ancestryWe define the frequency of one of the alleles at a SNP i as zix. We
can then compute three statistics for a given population X that
are informative about admixture:
pDðXÞ ¼Pni¼1
�ziOutgroup � ziArchaic
��ziEast Asian � zix
�Pni¼1
�ziOutgroup � ziArchaic
��ziEast Asian � ziNew Guinea
�
¼ f4ðOutgroup; Archaic; East Asian; XÞf4ðOutgroup; Archaic; East Asian; New GuineaÞ
(Equation 1)
pNðXÞ ¼ 1�Pni¼1
�ziOutgroup � ziAustralia
��zix � ziNew Guinea
�Pni¼1
�ziOutgroup � ziAustralia
��ziEast Asia � ziNew Guinea
�
¼ 1� f4ðOutgroup; Australia; X; New GuineaÞf4ðOutgroup; Australia; East Asia; New GuineaÞ
(Equation 2)
pAðXÞ ¼ 1�Pni¼1
�ziOutgroup � ziNew Guinea
��zix � ziAustralia
�Pni¼1
�ziOutgroup � ziNew Guinea
��ziEast Asia � ziAustralia
�
¼ 1� f4ðOutgroup; New Guinea; X; AustraliaÞf4ðOutgroup; New Guinea; East Asia; AustraliaÞ
(Equation 3)
The right side of each equation shows that these statistics can also
be expressed as ratios of f4 statistics,14 which provide unbiased
estimates of admixture proportions even in the absence of popula-
tions that are closely related to the analyzed populations
(Appendix A). For the ancestry estimates reported in Table 1, we
use Outgroup ¼ YRI (West Africans), Archaic ¼ Denisova, and
East Asian ¼ CHB (Han Chinese). Table S2 and Table S3 demon-
strate that consistent values are obtained when we replace these
choices with a variety of distantly related populations. Further
details are provided in Appendix A.
Block Jackknife Standard Error and Statistical TestingWe used a block jackknife22,23 to compute standard errors, drop-
ping each nonoverlapping five cM stretch of the genome in turn
and studying the variance of each statistic of interest to obtain
an approximately normally distributed standard error.12,18 To
test whether pD(X), pN(X), pA(X), and pD(X)� pN(X) are statistically
consistent with zero for any tested population X, we computed
the statistics along with a standard error from the block jackknife,
and then used a two-sided Z test that computes the number of
standard errors from zero. To implement the 4 Population Test14
for whether an unrooted phylogenetic tree ([A,B],[C,D]) relating
four populations is consistent with the data, we computed the
statistic f4(A,B;C,D) and assessed the number of standard errors
from zero.
Results
Quantifying Denisova Admixture from Genome-wide
SNP Data
To investigate which modern humans have inherited
genetic material from Denisovans, we assembled SNP
data from 33 populations from mainland East Asia, island
Southeast Asia, New Guinea, Fiji, Polynesia, Australia, and
India, and genotyped all of them on Affymetrix 6.0 arrays.
After removing samples that were outliers with respect to
The American Journal of Human Genetics 89, 516–528, October 7, 2011 517
Table 1. Estimates of Denisovan and Near Oceanian Ancestry from SNP Data
Population InformationpD(X): Denisovan Ancestryas % of New Guinea
pN(X): Near Oceanianancestry
p value forDifference
Broad Grouping Detailed Code NEstimatedAncestry
StandardError in theEstimate Z Score
EstimatedAncestry
StandardError in theEstimate Z Score pN(X) � pD(X)
New Guinea Highlander SH 24 100% 0% n/a 100% 0% n/a n/a
Australian all 10 103% 6% 17.1 n/a n/a n/a n/a
Northern Territories AU1 8 103% 6% 16.6 n/a n/a n/a n/a
Cell Cultures AU2 2 103% 7% 14.1 n/a n/a n/a n/a
Fiji Fiji FI 25 56% 3% 17.7 58% 1% 94.6 0.38
Nusa Tenggaras all 10 40% 3% 12.8 38% 1% 54.7 0.34
Alor AL 2 51% 6% 8.3 49% 1% 35.6 0.69
Flores FL 1 40% 8% 5.0 37% 2% 19.8 0.68
Roti RO 4 27% 4% 6.4 27% 1% 29.4 0.85
Timor TI 3 50% 5% 9.8 45% 1% 41.7 0.29
Philippines all 27 28% 3% 8.2 6% 1% 10.6 3.4 3 10�10
Mamanwa (N) MA 11 49% 5% 9.2 11% 1% 11.4 1.5 3 10�12
Manobo MN 16 13% 3% 4.2 4% 1% 5.7 0.0018
Moluccas all 10 35% 4% 10.1 34% 1% 46.0 0.59
Hiri HI 7 35% 4% 9.0 32% 1% 38.4 0.36
Ternate TE 3 36% 5% 7.2 38% 1% 33.7 0.67
Polynesia all PO 19 20% 4% 5.1 27% 1% 34.8 0.052
Cook 2 16% 6% 2.5 24% 1% 17.3 0.21
Futuna 4 28% 5% 5.3 29% 1% 26.9 0.87
Niue 1 27% 8% 3.3 30% 2% 16.3 0.72
Samoa 5 13% 5% 2.6 24% 1% 23.3 0.024
Tokelau 2 22% 6% 3.5 31% 1% 23.8 0.14
Tonga 2 17% 7% 2.5 31% 1% 22.5 0.027
Tuvalu 3 21% 6% 3.6 28% 1% 22.8 0.28
Andamanese Onge (N) AN 10 10% 6% 1.6 3% 1% 1.8 0.27
Taiwan all TA 12 4% 3% 1.2 1% 1% 1.5 0.35
Puyuma 2 4% 6% 0.6 2% 1% 1.8 0.79
Rukai 2 0% 6% 0.0 2% 1% 1.6 0.74
Paiwan 2 5% 6% 0.8 3% 1% 2.2 0.67
Atayal 2 �5% 5% �0.9 0% 1% 0.3 0.34
Bunun 2 12% 6% 2.1 �2% 1% �1.6 0.01
Pingpu 2 7% 6% 1.2 1% 1% 1.1 0.30
Malaysia all 18 5% 3% 1.4 0% 1% �0.2 0.16
Jehai (N) JE 8 7% 5% 1.4 1% 1% 0.8 0.21
Temuan TM 10 3% 4% 0.8 �1% 1% �0.9 0.32
Sumatra All 17 4% 3% 1.4 0% 1% 0.3 0.17
Besemah BE 8 5% 3% 1.5 1% 1% 0.9 0.20
Semende SM 9 3% 4% 0.9 0% 1% �0.3 0.31
518 The American Journal of Human Genetics 89, 516–528, October 7, 2011
their own populations (reflecting admixture in the last few
generations or genotyping error), we had data from 243
individuals (Table 1). We restricted the analysis to auto-
somal SNPs with high genotyping completeness and with
data from the Denisova genome, leaving 353,143 SNPs.
To quantify the proportion of Denisova genes in each
population X, we computed a statistic pD(X), which
measures the proportion of Denisova genetic material in
a population as a fraction of that in New Guineans. Our
main analyses in Figure 1 and Table 1 compute pD(X) as
a ratio of two f4 statistics,14 each of which measures the
correlation in allele frequency differences between the
two populations used as outgroups (Yoruba and Denisova)
and two East or Southeast Asian populations (Han and X¼tested population). If Han and X descend from a single
ancestral population without any subsequent admixture
Table 1. Continued
Population InformationpD(X): Denisovan Ancestryas % of New Guinea
pN(X): Near Oceanianancestry
p value forDifference
Broad Grouping Detailed Code NEstimatedAncestry
StandardError in theEstimate Z Score
EstimatedAncestry
StandardError in theEstimate Z Score pN(X) � pD(X)
Borneo all 49 1% 2% 0.6 1% 1% 1.3 0.79
Bidayuh BI 10 6% 4% 1.7 1% 1% 1.4 0.80
Barito River BO 23 0% 3% 0.2 1% 1% 1.7 0.18
Land Dayak DY 16 0% 3% �0.1 0% 1% 0.2 0.94
India Dravidian SI 12 �7% 5% �1.5 n/a n/a n/a n/a
We provide each population’s estimated ancestry, the standard error in the estimate, and the Z score for deviation from zero (Z). Negrito populations are markedwith (N). The New Guinea highlanders by definition have 100%Denisovan and 100%Near Oceanian ancestry because they are used as a reference population forcomputations. Results are not provided for Australians and Dravidians for whom the phylogenetic relationships do not allow the estimate (n/a). The last columnreports the two-sided p value for a difference based on a block jackknife and a Z test.
DENISOVA
HE
OR
AL Al MN M b
XI
UY
HEDRMO
AL Alor MN ManoboAN Andaman (Onge) MO MongolaAU Australian NA NaxiBE Besemah NG New GuineaBG Bougainville OR OroqenBI Bidayuh PO Polynesia
JA
TU
SE
HA
TJ
MI
BO Borneo RO RotiCA Cambodia SE SheDA Dai SH S. HighlandsDR Daur SI Southern IndiaDY Dayak SM SemendeFI Fiji TA Taiwan
MA
MN
TA
LA
DA
MI
jFL Flores TE TernateHA Han TI TimorHE Hezhen TJ TujiaHI Hiri TM TemuanJA Japan TU TuJE Jehai UY Uygur
BGHI
MN
JE
BITM
AN
JE Jehai UY UygurLA Lahu XI XiboMA Mamanwa YI YiMI Miao
SH
NG FI
POTE
ALBODY
SM
BE
AU1
TIFL
RO
AU2
NA YI
CASI
Figure 1. Denisovan Genetic Material as a Fraction of that in New GuineansPopulations are only shown as having Denisova ancestry if the estimates are more than two standard errors from zero (we combine esti-mates for populations in this study with analogous estimates from CEPH- Human Genome Diversity Panel populations reported previ-ously12). No population has an estimate of Denisova ancestry that is significantly more than that in New Guineans, and hence we atmost plot 100%. The sampling location of the AU2 population is unknown and hence the position of this population is not precise.
The American Journal of Human Genetics 89, 516–528, October 7, 2011 519
from Denisova, then the allele frequency differences
between Han and X must have arisen solely since their
separation from their common ancestor, and the two
frequency differences should be uncorrelated; thus, the f4statistic has an expected value of zero. However, if popula-
tion X inherited some of its ancestry from an archaic
population related to Denisovans, then the allele
frequency differences between Han and X will be corre-
lated, the higher the admixture from the archaic popula-
tion, the higher the correlation. Because the f4 statistic in
the numerator uses X as the test population, and the f4statistic in the denominator uses New Guinea as the test
population, the ratio pD(X) estimates a quantity propor-
tional to the percentage of Denisova ancestry qX; that is,
the Denisova admixture fraction in X divided by that in
New Guinea, qX/qNew Guinea (Appendix A).
We computed pD(X) for a range of non-African popula-
tions and found that for mainland East Asians, western
Negritos (Jehai and Onge), or western Indonesians, pD(X)
is within two standard errors of zero when a standard error
is computed from a block jackknife (Table 1 and Figure 1).
Thus, there is no significant evidence of Denisova genetic
material in these populations. However, there is strong
evidence of Denisovan genetic material in Australians
(1.035 0.06 times the NewGuinean proportion; one stan-
dard error), Fijians (0.565 0.03), Nusa Tenggaras islanders
of southeastern Indonesia (0.40 5 0.03), Moluccas
islanders of eastern Indonesia (0.35 5 0.04), Polynesians
(0.020 5 0.04), Philippine Mamanwa, who are classified
as a ‘‘Negrito’’ group (0.495 0.05), and PhilippineManobo
(0.13 5 0.03) (Table 1 and Figure 1). The New Guineans
and Australians are estimated to have indistinguishable
proportions of Denisovan ancestry (within the statistical
error), suggesting Denisova gene flow into the common
ancestors of Australians and New Guineans prior to their
entry into Sahul (Pleistocene New Guinea and Australia),
that is, at least 44,000 years ago.24,25 These results are
consistent with the Common Origin model of present-
day New Guineans and Australians.26,27 We further con-
firmed the consistency of the Common Origin model
with our data by testing for a correlation in the allele
frequency difference of two populations used as outgroups
(Yoruba and Han) and the two tested populations (New
Guinean and Australian).The f4 statistic that measures
their correlation is only jZj ¼ 0.8 standard errors from
zero, as expected if New Guineans and Australians descend
from a common ancestral population after they split from
East Asians, without any evidence of a closer relationship
of one group or the other to East Asians. Two alternative
histories, in which either New Guineans or Australians
have a common origin with East Asians, are inconsistent
with the data (both jZj > 52).
To assess the robustness of these estimates of Denisova
admixture proportion, we recomputed pD(X) for diverse
choices of A (YRI, San, and chimpanzee), B (Denisova,
Neandertal, and chimpanzee), C (CHB and Borneo) and
X (17 different populations). For any population X, we
obtain consistent estimates of the archaic mixture propor-
tion, regardless of the choice of A, B, and C. Thus, the
method is robust to the choice of comparison populations,
suggesting that the underlying model of population rela-
tionships (Appendix A) provides a reasonable fit to the
data and that our pD(X) ancestry estimates are reliable.
For our main estimates of admixture proportion, we report
results for A ¼ YRI, B ¼ Denisova and C ¼ CHB because
Table S2 shows that the standard errors are smallest (in
part because of larger sample sizes).
To test whether our estimates of pD(X) are robust to ascer-
tainment bias—the complex ways that SNPs were chosen
for inclusion on genotyping arrays originally designed
for medical genetics studies—we also estimated Denisova
admixture by using sequencing data. For this purpose, we
generated new shotgun sequencing data from a Philippine
Mamanwa individual (~13) and a New Guinea highlander
(~33, from a different New Guinean group than the one
sampled in the Human Genome Diversity Panel16). We
merged these with data from Neandertal, Denisova, chim-
panzee, and 12 present-day humans analyzed as part of the
Neandertal and Denisova genome sequencing studies.12,18
We then computed the same pD(X) statistics for the se-
quencing as for the genotyping data, replacing YRI with
a Yoruba (HGDP00927), CHB with a Han (HGDP00778),
and New Guinea with a Papuan sample (Papuan2;
HGDP00551). Both the full sequence data and the SNP
data produce consistent estimates of pD(X) (Table 2), sug-
gesting that ascertainment bias is not influencing the
pD(X) estimates from genome-wide SNP data.
Near Oceanian Ancestry Explains Denisovan Genes
Outside of Australia and the Philippines
Aparsimonious explanation for theDenisova geneticmate-
rial that we detect in the non-Australian populations is the
well-documented admixture that has occurred in many
Southeast Asian and Oceanian groups between (1) Near
Oceanian populations related to New Guineans and (2)
populations from island Southeast Asia related tomainland
East Asians, who are the primary populations of Taiwan
and Indonesia today.28–31 Thus, many groups might have
Denisova admixture as an indirect consequence of their
history of Near Oceanian admixture. For those populations
whoseDenisova ancestry is explained in thisway, their frac-
tion of Denisovan ancestry is predicted to be exactly
proportional to their fraction of Near Oceanian ancestry.
To test this hypothesis, we designed a second statistic,
pN(X), to estimate the fractionof apopulation’sNearOcean-
ian ancestry, defined here as the proportion of its ancestry
inherited from a population that is more closely related to
New Guineans than to Australians (Appendix A). A virtue
of pN(X) is that it provides an unbiased estimate of a popula-
tion’s Near Oceanian ancestry proportion even without
access to close relatives of the ancestral populations
(Appendix A), whereas previous estimators10,30 depend
on the accuracy of the surrogate contemporary popula-
tions used to approximate the ancestral populations. We
520 The American Journal of Human Genetics 89, 516–528, October 7, 2011
compared pD(X) and pN(X) for all relevant populations
(Table 1, Figure 2, and Figure S1) and found that, allowing
for sampling error, they occur in a one-to-one ratio for the
populations from theNusa Tenggaras,Moluccas, Polynesia,
and Fiji. Common ancestry with Near Oceania thus can
account for the Denisova genetic material in these groups.
A striking exception is observed in the two Philippine
populations, neither of which conforms to this relation-
ship: pD(Mamanwa) ¼ 0.495 0.05 versus pN(Mamanwa) ¼0.11 5 0.01 (p ¼ 1.5 3 10�12 for the difference) and
pD(Manobo) ¼ 0.13 5 0.03 versus pN(Manobo) ¼ 0.04 5
0.01 (p ¼ 0.0018) (Figure 2). An alternative hypothesis
that could account for the Denisovan genetic material in
the Philippines is common ancestry with Australians.32,33
We thus computed a third statistic, pApp (X), that estimates
the relative proportion of Australian ancestry (Appendix
A). However, Australian ancestry cannot explain these
patterns either: pD(Mamanwa) ¼ 0.49 5 0.05 versus
pApp (Mamanwa) ¼ 0.13 5 0.01 and pD(Manobo) ¼ 0.13 5
0.03 versus pApp (Manobo) ¼ 0.05 5 0.01. The estimates of
pN(X) and pApp (X) are consistent for a variety of outgroups
(Appendix A and Table S3). Thus, the Denisova genetic
material in Mamanwa, as well as the smaller proportion
in their Manobo neighbors, cannot be due to common
ancestry with Near Oceanians or Australians after the
two groups diverged from one another. In the following
section, we focus on the Mamanwa because they have
a higher proportion of Denisova genetic material and allow
us to study the pattern at a higher resolution.
Modeling Denisova Admixture and Population
History
To test whether the patterns observed in the Philippine
populations might reflect a history of Denisova gene flow
into a population that was ancestral to New Guineans,
Australians, and Mamanwa, followed by separation of
the Mamanwa first and then divergence of the New Guin-
eans from Australians, we fit f statistics summarizing the
allele frequency correlations among all possible sets of
populations to admixture graphs.14 Admixture graphs are
formal models of population relationships with the impor-
tant feature that simply by specifying a topology of popu-
lation relationships, admixture proportions, and genetic
drift values on each lineage, they produce precise predic-
tions of the values that will be observed at f4ff , f3ff , and f2ff
statistics (Appendix B). These predictions can then be
compared to the empirically observed values (with standard
Figure 2. Denisovan and Near Oceanian Ancestry Are Propor-tional Except in the PhilippinesWe plot pDpp (X), the estimated percentage of Denisova ancestry asa fraction of that seen in New Guineans, against the estimatedpercentage of Near Oceanian ancestry pN(X) by using the valuesfrom Table 1 (horizontal and vertical bars specify 51 standarderrors). The Mamanwa deviate significantly from the pD(X) ¼pN(X) line, indicating that their Denisova genetic material doesnot owe its origin to gene flow from a population related to NearOceanians. A weaker deviation is seen in the Manobo, who livenear the Mamanwa on the island of Mindanao.
Table 2. Denisovan Admixture pD(X) Estimated from Sequencing versus Genotyping Data
SampleHGDP ID forSequence Data
Sequencing Data Genotyping Data
EstimatedAncestry
Standard Errorin the Estimate Z Score
EstimatedAncestry
Standard Errorin the Estimate Z Score
Papuan HGDP00542 105% 9% 11.8 100% n/a n/a
New Guinea Highlander 104% 9% 11.7 100% n/a n/a
Bougainville HGDP00491 83% 10% 8.3 82% 5% 15.9
Mamanwa 28% 10% 2.9 49% 5% 9.2
Cambodian HGDP00711 19% 9% 2.0 �3% 3% �0.8
Karitiana HGDP00998 9% 12% 0.7 4% 6% 0.7
Mongolian HGDP01224 �6% 12% �0.5 3% 3% 1.1
For the sequencing data, we present the ratio f4(Yoruba, Denisova; Han, X)/f4(Yoruba, Denisova; Han, Papuan2), estimating the proportion of Denisova ancestry ina population X as a fraction of that in the Papuan2 sample (for the first line, the Papuan sample in the numerator is Papuan1 HGDP000551). For the genotypingdata, we present the ratio f4(YRI, Denisova; CHB, X)/f4(YRI, Denisova; CHB, Papuan). No standard errors are given for the genotyping-based estimates in the firsttwo rows because the Papuans and New Guineans are the reference populations, and so by definition those fractions are 100%.
The American Journal of Human Genetics 89, 516–528, October 7, 2011 521
errors from a block jackknife) to assess the fit to the data.14
The best-fitting admixture graph for seven populations
(Neandertal, Denisova, Yoruba, Han Chinese, Mamanwa,
Australians, and New Guineans) specifies Denisova gene
flow into a population ancestral to New Guineans, Austra-
lians, andMamanwa, followed by the splitting of the ances-
tors of the Mamanwa and much more recent admixture
between them and populations related to East Eurasians
(Figure3 andFigure S2). For thismodel, theadmixturegraph
predicts the values of 91 allele frequency correlation statis-
tics (f statistics) relating the seven analyzed populations,
and only one f statistic has an observed value more than
three standard errors from the prediction (Appendix B).
Encouraged by the fit of the admixture graph to the data
from the seven populations, we extended the model to
include two additional populations—Andaman Islanders
(Onge) and Negrito groups from Malaysia (Jehai)—both
of which have been hypothesized to descend from the
same migration that gave rise to Australians and New
Guineans4,5 (Figure 3 and Figure S3). This analysis provides
overwhelming support for common ancestry for the Onge
and Jehai: an admixture graph specifying such a history is
an excellent fit to the joint data in the sense that only one
of the 246 possible f statistics is more than three standard
errors from expectation (Appendix B). The analysis also
suggests that after their separation from the Onge, the Je-
hai received substantial admixture (about three-quarters
of their genome) from populations related to mainland
East Asians (Appendix B). In contrast, a model in which
the Onge have no recent East Asian admixture is a good
fit to the data, providing further evidence that the Onge
have been unadmixed (at least with non-South Asians8)
since their initial arrival in the region.14
A striking finding that emerges from the admixture
graph model fitting is the evidence of an episode of addi-
tional gene flow into Australian and New Guinean ances-
tors—after their ancestors separated from those of the Ma-
manwa—from a modern human population that did not
have Denisova genetic material. A model in which this
admixture accounts for half of the genetic material in
Australians and New Guineans is an excellent fit to the
data (Figure 3, Figures S2 and S3, and Appendix B). Admix-
ture graphs that do not model a second admixture event
are much poorer fits, producing 11 f statistics at jZj > 3
standard errors from expectation (Appendix B). Our
analysis further suggests that the modern humans who
admixed with the ancestors of Australians and New Guin-
eans were closer to Andamanese and Malaysian Negritos
than to mainland East Asians (Figure 3), although this
is a weaker signal (1 f statistic with jZj > 3 versus 3) (Fig-
ure S3). This suggests that populations with Denisova
admixture could have been in proximity to the ancestors
of the Onge and Jehai during the earliest settlement of
the region but provides no evidence for ancestors of pres-
ent-day East Asians in the region at that time (Appendix B).
Thus, these findings suggest that the present-day East
Asian and Indonesian populations are primarily descended
from more recent migrations to the region.
Discussion
This study has shown that Southeast Asia was settled by
modern humans in multiple waves: One wave contributed
the ancestors of present-day Onge, Jehai, Mamanwa, New
Guineans, and Australians (some of whom admixed with
Denisovans), and a second wave contributed much of
the ancestry of present-day East Asians and Indonesians.
This scenario of human dispersals is broadly consistent
with the archaeologically-motivated hypothesis of an early
southern route migration leading to the colonization of
Sahul and East Asia2 but also further clarifies this scenario.
In particular, our data provide no evidence for multiple
dispersals of modern humans out of Africa, as all non-
Africans have statistically indistinguishable amounts of
1.3%98.7%
7%93%
51%
24%76%
49%
Chinese Jehai (N) Onge (N) Australian DenisovaNew GuineaMamanwa (N)Yoruba Neandertal
24%76%27%73%
Figure 3. A Model of Population Separa-tion and Admixture that Fits the DataThe admixture graph suggests Denisova-related gene flow into a common ancestralpopulation of Mamanwa, New Guineans,and Australians, followed by admixture ofNew Guinean and Australian ancestorswith another population that did notexperience Denisova gene flow.We cannotdistinguish the order of population diver-gence of the ancestors of Chinese, Onge/Jehai, and Mamanwa/New Guineans/Australians, and hence show a trifurcation.Admixture proportion estimates (red) arepotentially affected by ascertainment biasand hence should be viewed with caution.In addition, although admixture graphsare precise about the topology of popula-tion relationships, they are not informa-tive regarding timing. Thus, the lengthsof lineages should not be interpreted interms of population split times and admix-ture events.
522 The American Journal of Human Genetics 89, 516–528, October 7, 2011
Neandertal genetic material.12,18 Instead, our data are
consistent with a single dispersal out of Africa (as proposed
in some versions of the early southern route hypothesis1)
from which there were multiple dispersals to South and
East Asia.
This study is also important in providing a clue about the
geographic location of the Denisova gene flow. Given the
high mobility of human populations, it is difficult to use
genetic data frompresent-day populations to infer the loca-
tion of past demographic events with high confidence.
Nevertheless, the fact that Denisova genetic material is
present in eastern Southeast Asians and Oceanians (Ma-
manwa, Australians, and New Guineans), but not in the
west (Onge and Jehai) or northwest (the Eurasian conti-
nent) suggests that interbreeding might have occurred in
Southeast Asia itself. Further evidence for a Southeast Asian
location comes fromour evidenceof ancient geneflow from
relatives of the Onge and Jehai into the common ancestors
of Australians and New Guineans after the initial Denisova
gene flow (Figure 3); this suggests that ancestors of both of
these groups (but not of East Asians) were present in the
region at the time. Although some of the observed patterns
could alternatively be explained by a history inwhich there
was initially some Denisova genetic material throughout
Southeast Asia—which was subsequently displaced by
major migrations of people related to present-day East
Asians—such a history cannot parsimoniously explain the
absence ofDenisova geneticmaterial in theOnge and Jehai.
Our evidence of a Southeast Asian location for the Deniso-
van admixture thus suggests that Denisovans were spread
across a wider ecological and geographic region—from the
deciduous forests of Siberia to the tropics—than any other
hominin with the exception of modern humans.
Finally, this study is methodologically important in
showing that there is much to learn about the relation-
ships among modern humans by analyzing patterns of
genetic material contributed by archaic humans. Because
the archaic genetic material is highly divergent, it is easily
detected in a modern human even if it contributes only a
small proportion of the ancestry; this makes it possible to
use archaic genetic material to study subtle and ancient
gene flow much as a medical imaging dye injected into a
patient allows the tracing of blood vessels. A priority for
future research should be to obtain direct estimates for
the dates of the Denisova and Neandertal gene flow, as
these will provide a better understanding of the interac-
tions among Denisovans, Neandertals, and the ancestors
of various present-day human populations.
Appendix A: Statistics Used for Estimating
Admixture Proportions
pD(X) Statistic Used for Estimating Denisova
Admixture Proportion
We first discuss the pD(X) statistic that we use for esti-
mating the Denisova admixture proportion in any popula-
tion X. Define the frequency of allele i in a sample from
population Y as ziY . Then pD(X) is defined as in Equation 1.
The rightmost part of Equation 1 shows that pD(X) can
also be expressed as a ratio of f4 statistics, which we intro-
duced previously14 to measure the correlation in allele
frequency differences between pairs of populations. We
previously reported simulations showing that the expected
values of f4 statistics are in practice robust to ascertainment
bias (how the polymorphisms are chosen for inclusion in
an analysis), making them useful for learning about
history with SNP array data.14
The expected values of f4 statistics can be understood
visually by following the arrows through the phylogenetic
trees with admixture relating sets of samples, assuming
that these are accurate models for the relationships among
the populations.14 Figure 4 illustrates how the ratio of f4statistics computed in Equation 1 estimates an admixture
proportion. Both the numerator and denominator can be
viewed as a correlation of two allele frequency differences:
ziA � ziB is the correlation in the allele frequency differ-
ence between an Outgroup ‘‘A’’ that did not experience
admixture and an Archaic group ‘‘B’’ hypothesized to be
related to the admixing group (e.g., A ¼ {chimpanzee,
Yoruba, or San} and B ¼ {Denisova or Neandertal}). This
follows the blue arrows in Figure 4.
ziC � ziX is the correlation in the allele frequency differ-
ence between a modern non-African population ‘‘C’’ and
a test population ‘‘X’’ (e.g., C ¼ {Chinese or Bornean}).
This follows the red arrows in Figure 4.
If populationsC andX are sister groups that descend from
ahomogeneousnon-African ancestral population, then the
allele frequency differences are expected to have arisen
entirely since the split from that commonancestral popula-
tion, and thus the correlation to A and B is expected to be
zero (no overlap of the arrows). In contrast, if population
X has inherited some proportion qX of its lineages from an
archaic population, then the expected value of the product
of the frequency differences is proportional to qX times
the overlap of the paths of A and B and C and X in Figure 4,
which corresponds to genetic drift a þ b. While we do
not know the value of a þ b, when we take the ratio of
the numerator and denominator to compute the pD(X)
statistic, this unknown quantity cancels, and we obtain
qX/qNew Guinea, the proportion of archaic ancestry in a popu-
lation as a fraction of that in New Guineans (Figure 4).
Two issues merit further discussion. First, Figure 4 is an
oversimplification in that it does not show two archaic
gene-flow events (corresponding to Denisovans and Nean-
dertals). However, we have previously reported that the
data are consistent with the same amount of Neandertal
gene flow into the ancestors of East Asians (C, such as
CHB) and populations with Denisovan ancestry (X).12,18
As a result, the same genetic drift terms are added to the
numerator and denominator, which then cancel in the
ratio pD(X) so that they do not affect results. Second,
pD(X) is expected to provide an unbiased estimate of the
admixture proportion even if the genetic drift on various
The American Journal of Human Genetics 89, 516–528, October 7, 2011 523
lineages has been large. This contrasts with previous
methods for estimating admixture, which have required
accurate proxies for the ancestral populations.10
pN(X) and pApp (X) Statistics for Estimating Near
Oceanian and Denisova Admixture
We next discuss the statistics that we use for estimating the
NewGuinean pN(X) or Australian pApp (X)mixture proportion
in any East Eurasian or island Southeast Asian population
X, which are defined in Equations 2 and 3, respectively.
Figure 5 shows the admixture graph corresponding to
the computation of pN(X). Both the numerator and the
denominator are of the form f4ff (A(( ,Australia; X,New
Guinea). The first term measures the correlation in allele
frequency differences between (A(( � Australia) and (X(( �New Guinea). If X and New Guinea descended from a
common ancestral population since the split from Austra-
lians, then they are perfect sister groups, and the expected
value of f4ff is zero (the sample is consistent with 100%
Near Oceanian ancestry). On the other hand, if X has
a proportion (1 � qXqq ) of non-Near Oceanian ancestry,
then the two terms will have a nonzero correlation, which
as shown in Figure 5 is proportional to the genetic drift
shared between the two population comparisons and has
an expected value of (1� qXqq )[(1� pXpp )bþg] (the proportions
of ancestry flowing along various genetic drift paths times
the genetic drift on each of these lineages, indicated by
the overlap of the red and blue arrows). When we take
one minus the ratio pN(X) ¼ 1 � f4ff (A(( ,Australia; X,New
Guinea)/f4ff (A(( ,Australia; CHB,NewGuinea), the complicated
term on the right side of this expectation cancels, and we
obtain E[p[[ N(X)] ¼ qXqq . As with Figure 4, we do not show the
independent Neandertal admixture because the effect of
this term is to cancel from thenumerator and denominator.
In Table S3 we report the pN(X) estimates for diverse
choices of outgroup populations A (Yoruba, San, and chim-
panzee) and E (China and Borneo). The estimates are con-
sistent whatever the choice of A and E, suggesting that our
inferences are robust. (We do not report pN(X) estimates in
Table S3 for the Australians because this population is not
expected to conform to the population relationships
shown in Figure 5; indeed, the pN(X) estimates for Austra-
lians, when we do compute them, are significantly greater
than 1.) Further evidence for the usefulness of the pN(X)
estimates comes from the fact that it is consistent with
the pD(X) estimate for nearly all the populations in Table
1 (except for the Philippine populations, in which the De-
nisova ancestry does not appear to be explainable by Near
Oceanian gene flow as described in the main text).
We also computed a statistic pApp (X) that is identical to
pN(X) except for the transpositions of the positions of Aus-
tralia and New Guinea in the statistics (Equations 2 and 3).
Once again, we obtain consistent inferences of pApp (X) in
Table S3 regardless of the choice of outgroup populations.
Because New Guinea and Australia are sister groups, de-
scending from a common ancestral population, the justifi-
cations for the two statistics are very similar.
The only problemwe found with the estimation of pN(X)
procedure is that when X is any non-African population
known to have West Eurasian ancestry (e.g., Europeans or
South Asians), we often obtained negative pN(X) statistics.
Two hypotheses could be consistent with this observation:
(1) In unpublished data, we have attempted to write down
a model of population separation and mixture analogous
Figure 4. Computation of the Estimateof Denisovan Ancestry pD(X)The black lines show the model for howpopulations are related that is the basisfor the pD(X) ancestry estimate. PopulationX arose from an admixture of a proportion(1� qXqq ) of ancestry from an ancestral non-African population C0 and (qXqq ) fromarchaic population B0 (C and B are theirunmixed descendants). The expectedvalue of f4ff (A,B;C,X) is proportional to thecorrelation in the allele frequency differ-ences A � B and C � X, and can be com-puted as the overlap in the drift pathsseparating A � B (blue arrows) and C � X(red arrows). These paths only overlapover the branches a and b, in proportionto the percentage qXqq of the lineages of pop-ulation X that are of archaic ancestry andso the expected value is qXqq (a(( þ b). Whenwe compute the ratio pD(X), (a(( þ b) cancelsfrom both the numerator and denomi-nator, and we obtain qXqq /qXX New Guinea, thefraction of archaic ancestry in a populationX divided by that in New Guinea. Thisprovides unbiased estimates of themixture
proportion even if populationsC and B have experienced a large amount of genetic drift since splitting from their ancestors, that is, evenif we do not have good surrogates for the ancestral populations. This robustness arises because the genetic drift on the branches B/B0
and C/C0 does not contribute to the expectations.
524 The American Journal of Human Genetics 89, 516–528, October 7, 2011
to that in Figure 3 that jointly fits the genetic data com-
paring eastern and western Eurasian populations and
have so far not succeeded in developing amodel that passes
goodness-of-fit tests. This suggests that the population
relationships between eastern andwestern Eurasiansmight
be more complex than we have been able to model to date,
and therefore we cannot use them in the pN(X) computa-
tion. (2) An alternative possibility is that the negative
pN(X) statistics reflect an artifact of ascertainment bias on
SNP arrays. Ascertainment bias is likely to be particularly
complex with regard to the joint information from Euro-
peans and East Asians because these populations were
heavily used in choices of SNPs for medical genetics arrays.
Thus, it might be difficult tomake inferences using popula-
tions from both regions together with data from conven-
tional SNP arrays developed for medical genetic studies.
Whatever the explanation, we have some reason to
believe that estimates of Near Oceanian admixture by
using data from populations with West Eurasians might
be unreliable. Thus, we have excluded West Eurasians
from the estimates reported in Table 1.
Appendix B: Admixture Graphs
Overview of Admixture Graphs
A key finding from this study is that there is Denisova
genetic material in the Mamanwa, a Negrito group from
the Philippines, which cannot be explained by a history of
recent gene flow from relatives of NewGuineans (Near Oce-
anians) or Australians. To further understand this history,
we use the admixture graph methodology that we initially
developed for a study of Indian genetic variation14 to test
whether varioushypotheses aboutpopulation relationships
are consistent with the data. Specifically, we tested the
hypothesis of a single episode of Denisovan gene flow into
theancestors ofNewGuineans,Australians, andMamanwa,
prior to the separation of New Guineans and Australians.
Admixture graphs refer to generalizations of phyloge-
netic trees that incorporate the possibility of gene flow.
Like phylogenetic trees, admixture graphs describe the
topology of population relationships without specifying
the timing of events (such as population splits or gene-
flow events), or the details of population size changes on
different lineages. While this can be a disadvantage in
that fitting admixture graphs to data does not allow infer-
ences of these important details, it is also an advantage in
that one can fit genetic data to an admixture graphwithout
having to specify a demographic history. This allows for
inferences that are more robust to uncertainties about
important parameters of history. Once the topology of the
population relationships is inferred, one can in principle
use other methods to make inferences about the timing of
events and population size changes. This makes the
problem of learning about history simpler than if one had
to simultaneously infer topology, timing, and demography.
An admixture graphmakes precise predictions about the
patterns of correlation in allele frequency differences
across all subsets of two, three, and four populations in
an analysis, as measured for example by the f2ff , f2 3ff , and f4ff
statistics of Reich et al.14 Given n populations, there are
n(n � 1)/2 f2ff statistics, n(n � 1)(n � 2)/6 f3ff statistics, and
n(n�1)(n�2)(n�3)/24 f4ff statistics. To fit an admixture
graph to data, one first proposes a topology, then identifies
the set of admixture proportions and genetic drift values
on each lineage (variation in allele frequency correspond-
ing to random sampling of alleles from generation to
generation in a population of finite size) that are the best
match to the data under that model. The admixture graph
topology, admixture proportions, and genetic drift values
Figure 5. Computation of the Estimateof Near Oceanian Ancestry pN(X)The test population X is assumed to havearisen from a mixture of a proportion(1 � qXqq ) of ancestry from ancestral EastAsians E0 and (qXqq ) of ancestral Near Ocean-ians N0NN . The Near Oceanians are, in turn,assumed to have received a proportion pXppof their ancestry from the Denisovans(E(( and New Guinea are assumed to beunmixed descendants of these two). Theexpected value of f4ff (A,Australia; X, NewGuinea) can be computed from the correla-tion in the allele frequency differences A �Australia (blue arrows) andX�New Guinea(red arrows). These paths only overlapalong the proportion (1 � qXqq ) of theancestry of population X that takes theEast Asian path, where the expected shareddrift is (1 � pXpp )bþg as shown in the figure.Thus, the expected value of the f4ff statisticis (1 � qXqq )(1 � pXpp )bþg. Because qXqq ¼0 for the denominator of pN(X) (no NearOceanian ancestry), the ratio of f4ff statisticshas an expected value of (1 � qXqq ) and E[p[[ N(X)] ¼ qXqq .
The American Journal of Human Genetics 89, 516–528, October 7, 2011 525
on each lineage together generate expected values for the
f2, f3 and f4 statistics14 that can be compared to the
observed values—which have empirical standard errors
from a block jackknife—to assess the adequacy of the
best fit under the proposed topology. As we showed previ-
ously,14 the topology relating populations in an admixture
graph can be accurately inferred even if the polymor-
phisms used in an analysis are affected by substantial ascer-
tainment bias. The software that we have developed for
fitting admixture graphs carries out a hill-climb to find
the genetic drift values and admixture proportions that
minimize the discrepancy between the observed and ex-
pected f2, f3, and f4 statistics for a given topology relating
a set of populations.
A complication in fitting admixture graphs to data is
that we do not know how many effectively independent
f statistics there are, out of the [n(n � 1)/2][1 þ (n � 2)/
3 þ (n � 3)/12] that are computed. These statistics are
highly correlated, and in fact can be related algebraically
to each other; for example, all the f3 and f4 statistics are
a linear combinations of the f2 statistics. Although we
believe that it is possible to construct a reasonable score
for how well the model fits the data by studying the covari-
ance matrix of the f statistics—and indeed a score of this
type is the basis for our hill-climbing software—we have
not yet found a formal way to assess how many indepen-
dent hypotheses are being tested, and thus we do not at
present have a goodness-of-fit test. Instead, we simply
compute all possible f statistics and search for extreme
outliers (e.g., Z scores of 3 or more from expectation). A
large number of Z scores greater than 3 are not likely to
be observed if the admixture graph topology is an accurate
description of a set of population relationships.
Denisova Gene Flow into Mamanwa/New Guinean/
Australian Ancestors
We initially fit an admixture graph to the data from
Mamanwa, New Guineans, Australians, Denisova, Nean-
dertal, West Africans (YRI), and Han Chinese (CHB), basing
some of the proposed population relationships on pre-
vious work that hypothesized a model of an out-of-Africa
migration of modern humans, Neandertal gene flow into
the ancestors of all non-Africans, and sister group status
for Neandertals and Denisovans.12 A complication in
fitting an admixture graph to these data is that because
of the low coverage of the Neandertal and Denisova
genomes, we could not accurately infer the diploid geno-
type at each SNP. Thus, we sampled a single read from
Neandertal and Denisova to represent each site and (incor-
rectly) assumed that these individuals were homozygous
for the observed allele at each analyzed SNP. This means
that the estimates of genetic drift on the Neandertal and
Denisova branches are not reliable (the genetic drift values
are overestimated). However, these sources of error do not
introduce a correlation in allele frequencies across popula-
tions and hence are not expected to generate a false infer-
ence about the population relationships.
Figure S2 showsan admixture graph that proposes that the
Mamanwa, New Guineans, and Australians descend from
a common ancestral population; the Mamanwa split first
and the New Guinean and Australian ancestors split later.
This is an excellent fit to the data in the sense that only
one of 91 f statistics is more than three standard errors
from zero (jZj ¼ 3.4). An interesting feature of this admixture
graph is that it specifies an additional admixture event, after
the Mamanwa lineage separated, into the ancestors of
Australians and New Guineans that contributed about half
of their ancestry and involved a population without Deni-
sova admixture. A model that does not include such a
secondary admixture event is strongly rejected (see below).
The estimated proportion of Neandertal ancestry in all
non-Africans from the admixture graph fitting in Figure 3,
at 1.3%, is at the low end of the 1%–4% previously esti-
mated from sequencing data.18 Similarly, we infer a propor-
tion of Denisova ancestry in New Guineans of 3.5% ¼6.6% 3 53%, which is lower than the 4%–6% previously
estimated based on sequencing data but not significantly
so when one takes into account the standard errors quoted
in that study.12 These low numbers could reflect statistical
uncertainty from the previously reported analyses of
sequencing data or in the admixture graph estimates (the
latter possibility is especially important to consider
because we do not at present understand how to compute
standard errors on the admixture estimates derived from
admixture graphs). Another possible explanation for the
low estimates of mixture proportions is ascertainment
bias affecting the way SNPs were selected, which can affect
estimates of mixture proportions and branch lengths
(while having much less impact on the inference of
topology). Further support for the hypothesis that ascer-
tainment bias might be contributing to our lower estimates
of mixture proportions comes from the fact that in unpub-
lished work we have found that the polymorphisms most
enriched for signals of archaic admixture are those in
which the derived allele is present in the archaic popula-
tion, absent in West Africans, and present at low minor
allele frequency in the studied population. In our admix-
ture graph fitting, we filtered out this class of SNPs, as
the f statistics used in the admixture graph have denomi-
nators that require frequency estimates from a polymor-
phic reference population, and we used YRI as our refer-
ence. Thus, when we refitted the same admixture graph
with CHB instead of YRI as the reference population, we
obtained the same topology but the Neandertal mixture
proportion increased to 1.9%. We have chosen to use YRI
as the reference population in all of our reported admix-
ture graphs because they are a better outgroup for the
modern populations whose history we are studying than
the CHB (populations related to the Chinese were directly
involved in admixture events in Southeast Asia).
Adding Onge and Jehai
The Andamanese Negrito group (Onge) and Malaysian
Negrito group (Jehai) have been proposed to share ancient
526 The American Journal of Human Genetics 89, 516–528, October 7, 2011
common ancestry with Philippine Negritos (e.g., Ma-
manwa). The fact that neither the Onge nor the Jehai
have evidence of Denisova genetic material, however,
suggests that any common ancestry must date to before
the Denisova gene flow into the ancestors of the Ma-
manwa, New Guineans, and Australians. To explore the
relationship between the Onge and Jehai and the other
populations, we added them into the admixture graph.
The only family of admixture graphs that we could identify
as fitting the data have the Onge as a deep lineage of
modern humans, with the Jehai deriving ancestry from
the same lineage but also harboring a substantial additional
contribution of East Asian related admixture (Figure S3).
A striking feature of the family of admixture graphs shown
in Figure S3 is that both the Jehai andMamanwa are inferred
to have up to about three-quarters of their ancestry due to
recent East Eurasian admixture, which is not too surprising
given that these populations have been living side by side
with populations of East Eurasian ancestry for thousands
of years. Moreover, both Y-chromosome and mtDNA anal-
yses strongly suggest recent East Asian admixture in the
Mamanwa.32,34 In contrast, the genome-wide SNP data for
the Onge are consistent with having no non-Negrito admix-
ture within the limits of our resolution, perhaps reflecting
their greater geographic isolation.
We next sought to resolve how the lineage including
Onge and Jehai ancestors, the mainland East Asian (e.g.,
Chinese), and the eastern group (including Mamanwa,
Australian and New Guinean ancestors) are related. Three
relationships are all consistent with the data. Specifically,
for all three of the admixture graphs shown in Figure S3,
only one of the 246 possible f statistics has a score of
jZj > 3. Thus, we cannot discern the order of splitting of
these three lineages and represent the relationships as
a trifurcation in Figure 3. The actual estimates of mixture
proportions are similar for all three figures as well.
Perturbing the Best-Fitting Admixture Graph to Assess
the Robustness of Our Inferences
To assess the robustness of the admixture graphs, we per-
turbed Figure S3 (in practice, we perturbed Figure 3A, but
given the fact that the graphs are statistically indistin-
guishable we expected that results would be similar for
all three). First, we considered the possibility that after
the initial Denisova gene flow into the ancestors of Ma-
manwa, NewGuineans, and Australians, the NewGuinean
and Australian ancestors did not experience an additional
gene-flow event with a population without Denisovan
admixture. However, when we try to fit this simpler model
to the data, we find that instead of one f statistic that is
jZj > 3 standard errors from expectation, there are now
11, and all but one of them involve theMamanwa, suggest-
ing that this population is poorly fit by such amodel. Thus,
an additional admixture event in the ancestry of New
Guineans and Australians (resulting in a decrease in their
proportion of Denisova ancestry) results in a major
improvement in the fit.
Second, we considered the possibility that the secondary
gene-flow event into the ancestors of Australians and
New Guineans came from relatives of Chinese (CHB)
rather than western Negritos such as the Onge. However,
when we fit this alternative history to the data, we find
three f statistics (rather than one) with scores of jZj > 3,
a substantially worse fit. We conclude that the modern
human populationwith which the ancestors of Australians
and New Guineans interbred was likely to have been more
closely related to western Negritos than to mainland East
Asians.
Supplemental Data
Supplemental Data include three figures and three tables and can
be found with this article online at http://www.cell.com/AJHG/.
Acknowledgments
We thank the volunteers who donated DNA samples.We acknowl-
edge F.A. Almeda Jr., J.P. Erazo, D. Gil, the late J. Kuhl, E.S. Larase, I.
Motinola, G. Patagan, W. Sinco, A. Sofro, U. Tadmor, and R. Trent
for assistance with sample collections. We thank M. Meyer for
preparing DNA libraries for high-throughput sequencing; A. Barik
and P. Nurenberg for assistance with genotyping; andO. Bar-Yosef,
K. Bryc, R.E. Green, J.-J. Hublin, J. Kelso, D. Lieberman, B. Paken-
dorf, M. Slatkin, and B. Viola for comments on the manuscript.
T.A. Jinamwas supported by a grant from the SOKENDAI Graduate
Student Overseas Travel Fund. This work was supported by the
Max Planck Society and by a National Science Foundation
HOMINID grant (1032255).
Received: August 11, 2011
Revised: September 8, 2011
Accepted: September 8, 2011
Published online: September 22, 2011
Web Resources
The URLs for data presented herein are as follows:
Burrows-Wheeler Aligner, http://bio-bwa.sourceforge.net/index.
shtml
CEPH-Human Genome Diversity Cell Line Panel, http://www.
cephb.fr/en/hgdp/diversity.php
EIGENSOFT, http://genepath.med.harvard.edu/~reich/Software.htm
European Collection of Cell Cultures, http://www.hpacultures.
org.uk/pages/Ethnic_DNA_Panel.pdf
European Nucleotide Archive (Project ID ERP000121), http://
www.ebi.ac.uk/ena/
Ibis, http://bioinf.eva.mpg.de/Ibis/
SAMtools, http://samtools.sourceforge.net/
References
1. Mellars, P. (2006). Going east: New genetic and archaeological
perspectives on the modern human colonization of Eurasia.
Science 313, 796–800.
2. Lahr, M., and Foley, R. (1994). Multiple dispersals and modern
human origins. Evol. Anthropol. 3, 48–60.
The American Journal of Human Genetics 89, 516–528, October 7, 2011 527
3. Endicott, P., Gilbert, M.T., Stringer, C., Lalueza-Fox, C., Willer-
slev, E.,Hansen,A.J., andCooper, A. (2003). The genetic origins
of the Andaman Islanders. Am. J. Hum. Genet. 72, 178–184.
4. Macaulay, V., Hill, C., Achilli, A., Rengo, C., Clarke, D., Mee-
han, W., Blackburn, J., Semino, O., Scozzari, R., Cruciani, F.,
et al. (2005). Single, rapid coastal settlement of Asia revealed
by analysis of complete mitochondrial genomes. Science
308, 1034–1036.
5. Thangaraj, K., Chaubey, G., Kivisild, T., Reddy, A.G., Singh,
V.K., Rasalkar, A.A., and Singh, L. (2005). Reconstructing the
origin of Andaman Islanders. Science 308, 996.
6. Cordaux, R., and Stoneking, M. (2003). South Asia, the
Andamanese, and the genetic evidence for an early human
dispersal out of Africa. Am J Hum Genet 72, 1586–1590;
author reply 1590-1583.
7. Palanichamy, M.G., Agrawal, S., Yao, Y.G., Kong, Q.P., Sun, C.,
Khan, F., Chaudhuri, T.K., and Zhang, Y.P. (2006). Comment
on ‘‘Reconstructing the origin of Andaman islanders’’. Science
311, 470, author reply 470.
8. Barik, S.S., Sahani, R., Prasad, B.V.R., Endicott, P.,Metspalu,M.,
Sarkar, B.N., Bhattacharya, S., Annapoorna, P.C.H., Sreenath, J.,
Sun, D., et al. (2008). Detailed mtDNA genotypes permit
a reassessment of the settlement and population structure of
the Andaman Islands. Am. J. Phys. Anthropol. 136, 19–27.
9. Abdulla, M.A., Ahmed, I., Assawamakin, A., Bhak, J.,
Brahmachari, S.K., Calacal, G.C., Chaurasia, A., Chen, C.H.,
Chen, J., Chen, Y.T., et al; HUGO Pan-Asian SNP Consortium;
Indian Genome Variation Consortium. (2009). Mapping
human genetic diversity in Asia. Science 326, 1541–1545.
10. Wollstein, A., Lao, O., Becker, C., Brauer, S., Trent, R.J., Nurn-
berg, P., Stoneking, M., and Kayser, M. (2010). Demographic
history of Oceania inferred from genome-wide data. Curr.
Biol. 20, 1983–1992.
11. Moodley, Y., Linz, B., Yamaoka, Y., Windsor, H.M., Breurec, S.,
Wu, J.Y., Maady, A., Bernhoft, S., Thiberge, J.M., Phuanukoon-
non, S., et al. (2009). The peopling of the Pacific from a bacte-
rial perspective. Science 323, 527–530.
12. Reich, D., Green, R.E., Kircher, M., Krause, J., Patterson, N.,
Durand, E.Y., Viola, B., Briggs, A.W., Stenzel, U., Johnson,
P.L., et al. (2010). Genetic history of an archaic hominin group
from Denisova Cave in Siberia. Nature 468, 1053–1060.
13. Altshuler, D.M., Gibbs, R.A., Peltonen, L., Altshuler, D.M.,
Gibbs, R.A., Peltonen, L., Dermitzakis, E., Schaffner, S.F., Yu,
F., Peltonen, L., et al; International HapMap 3 Consortium.
(2010). Integrating common and rare genetic variation in
diverse human populations. Nature 467, 52–58.
14. Reich, D., Thangaraj, K., Patterson, N., Price, A.L., and Singh,
L. (2009). Reconstructing Indian population history. Nature
461, 489–494.
15. Redd, A.J., and Stoneking, M. (1999). Peopling of Sahul:
mtDNA variation in aboriginal Australian and Papua New
Guinean populations. Am. J. Hum. Genet. 65, 808–828.
16. Cann, H.M., de Toma, C., Cazes, L., Legrand, M.F., Morel, V.,
Piouffre, L., Bodmer, J., Bodmer, W.F., Bonne-Tamir, B., Cam-
bon-Thomsen, A., et al. (2002). A human genome diversity
cell line panel. Science 296, 261–262.
17. Chimpanzee Sequencing and Analysis Consortium. (2005).
Initial sequence of the chimpanzee genome and comparison
with the human genome. Nature 437, 69–87.
18. Green,R.E.,Krause, J.,Briggs,A.W.,Maricic,T., Stenzel,U.,Kircher,
M., Patterson, N., Li, H., Zhai,W., Fritz,M.H., et al. (2010). A draft
sequence of the Neandertal genome. Science 328, 710–722.
19. Patterson, N., Price, A.L., and Reich, D. (2006). Population
structure and eigenanalysis. PLoS Genet. 2, e190.
20. Kircher, M., Stenzel, U., and Kelso, J. (2009). Improved base
calling for the Illumina Genome Analyzer using machine
learning strategies. Genome Biol. 10, R83.
21. Li, H., and Durbin, R. (2009). Fast and accurate short read
alignment with Burrows-Wheeler transform. Bioinformatics
25, 1754–1760.
22. Busing, F., Meijer, E., and Van Der Leeden, R. (1999). Delete-m
jackknife for unequal m. Stat. Comput. 9, 3–8.
23. Kunsch, H.K. (1989). The jackknife and the bootstrap for
general stationary observations. Ann. Stat. 17, 1217–1241.
24. O’Connell, J., and Allen, J. (2004). Dating the colonization of
Sahul (Pleistocene Australia - New Guinea): A review of recent
research. J. Archaeol. Sci. 31, 835–853.
25. Summerhayes, G.R., Leavesley, M., Fairbairn, A., Mandui, H.,
Field, J., Ford, A., and Fullagar, R. (2010). Human adaptation
and plant use in highland New Guinea 49,000 to 44,000 years
ago. Science 330, 78–81.
26. McEvoy, B.P., Lind, J.M.,Wang, E.T.,Moyzis, R.K., Visscher, P.M.,
van Holst Pellekaan, S.M., and Wilton, A.N. (2010). Whole-
genome genetic diversity in a sample of Australians with deep
Aboriginal ancestry. Am. J. Hum. Genet. 87, 297–305.
27. Roberts-Thomson, J.M., Martinson, J.J., Norwich, J.T.,
Harding, R.M., Clegg, J.B., and Boettcher, B. (1996). An
ancient common origin of aboriginal Australians and New
Guinea highlanders is supported by alpha-globin haplotype
analysis. Am. J. Hum. Genet. 58, 1017–1024.
28. Friedlaender, J.S., Friedlaender, F.R., Reed, F.A., Kidd, K.K.,
Kidd, J.R., Chambers, G.K., Lea, R.A., Loo, J.H., Koki, G., Hodg-
son, J.A., et al. (2008). The genetic structure of Pacific
Islanders. PLoS Genet. 4, e19.
29. Kayser, M., Brauer, S., Cordaux, R., Casto, A., Lao, O., Zhivo-
tovsky, L.A., Moyse-Faurie, C., Rutledge, R.B., Schiefenhoevel,
W., Gil, D., et al. (2006). Melanesian and Asian origins of Poly-
nesians: mtDNA and Y chromosome gradients across the
Pacific. Mol. Biol. Evol. 23, 2234–2244.
30. Kayser, M., Lao, O., Saar, K., Brauer, S., Wang, X., Nurnberg, P.,
Trent, R.J., and Stoneking, M. (2008). Genome-wide analysis
indicates more Asian than Melanesian ancestry of Polyne-
sians. Am. J. Hum. Genet. 82, 194–198.
31. Mona, S., Grunz, K.E., Brauer, S., Pakendorf, B., Castrı, L.,
Sudoyo, H., Marzuki, S., Barnes, R.H., Schmidtke, J.,
Stoneking, M., and Kayser, M. (2009). Genetic admixture
history of Eastern Indonesia as revealed by Y-chromosome
and mitochondrial DNA analysis. Mol. Biol. Evol. 26, 1865–
1877.
32. Delfin, F., Salvador, J.M., Calacal, G.C., Perdigon, H.B.,
Tabbada, K.A., Villamor, L.P., Halos, S.C., Gunnarsdottir, E.,
Myles, S., Hughes, D.A., et al. (2011). The Y-chromosome
landscape of the Philippines: Extensive heterogeneity and
varying genetic affinities of Negrito and non-Negrito groups.
Eur. J. Hum. Genet. 19, 224–230.
33. Matsumoto, H., Miyazaki, T., Omoto, K., Misawa, S., Harada,
S., Hirai, M., Sumpaico, J.S., Medado, P.M., and Ogonuki, H.
(1979). Population genetic studies of the Philippine Negritos.
II. gm and km allotypes of three population groups. Am. J.
Hum. Genet. 31, 70–76.
34. Gunnarsdottir, E.D., Li, M., Bauchet, M., Finstermeier, K., and
Stoneking, M. (2011). High-throughput sequencing of
complete human mtDNA genomes from the Philippines.
Genome Res. 21, 1–11.
528 The American Journal of Human Genetics 89, 516–528, October 7, 2011
Discover the latest Trends in your field
Trends
Cell Press Trends journals feature:
Cutting-edge Review and Opinion articles
Authoritative, succinct and accessible content
Discussion, analysis and debate
For more information visit
cell.com/trends
ARTICLE
Rare-Variant Association Testingfor Sequencing Data with the SequenceKernel Association Test
Michael C. Wu,1,5 Seunggeun Lee,2,5 Tianxi Cai,2 Yun Li,1,3 Michael Boehnke,4 and Xihong Lin2,*
Sequencing studies are increasingly being conducted to identify rare variants associated with complex traits. The limited power of clas-
sical single-marker association analysis for rare variants poses a central challenge in such studies. We propose the sequence kernel asso-
ciation test (SKAT), a supervised, flexible, computationally efficient regression method to test for association between genetic variants
(common and rare) in a region and a continuous or dichotomous trait while easily adjusting for covariates. As a score-based vari-
ance-component test, SKAT can quickly calculate p values analytically by fitting the null model containing only the covariates, and
so can easily be applied to genome-wide data. Using SKAT to analyze a genome-wide sequencing study of 1000 individuals, by segment-
ing the whole genome into 30 kb regions, requires only 7 hr on a laptop. Through analysis of simulated data across a wide range of
practical scenarios and triglyceride data from the Dallas Heart Study, we show that SKAT can substantially outperform several alternative
rare-variant association tests. We also provide analytic power and sample-size calculations to help design candidate-gene, whole-exome,
and whole-genome sequence association studies.
Introduction
Genome-wide association studies (GWASs) have identified
more than 1000 genetic loci associated with many human
diseases and traits,1 yet common variants identified
through GWASs often explain only a small proportion of
trait heritability. The advent of massively parallel
sequencing2 has transformed human genetics3,4 and has
the potential to explain some of this missing heritability
through identification of trait-associated rare variants.5
Although considerable resources have been devoted to
sequence mapping and genotype calling,6–9 successful
application of sequencing to the study of complex traits
requires novel statistical methods that allow researchers
to test efficiently for association given data on rare vari-
ants10 and to perform sample-size and power calculations
to help design sequencing-based association studies.
Rare genetic variants, here defined as alleles with
a frequency less than 1%–5%, can play key roles in influ-
encing complex disease and traits.11 However, standard
methods used to test for association with single common
genetic variants are underpowered for rare variants unless
sample sizes or effect sizes are very large.12,13 A logical alter-
native approach is to employ burden tests that assess
the cumulative effects of multiple variants in a genomic
region.12–18 Burden tests proposed to date are based on
collapsing or summarizing the rare variants within a region
by a single value, which is then tested for association with
the trait of interest. For example, the cohort allelic sum test
(CAST)14 collapses information on all rare variants within
a region (e.g., the exons of a gene) into a single dichoto-
mous variable for each subject by indicating whether or
not the subject has any rare variants within the region
and then applies a univariate test. Instead of collapsing by
dichotomizing the number of rare variants within a region,
collapsing by counting them is also possible.18 The
combined multivariate and collapsing method12 extends
CAST by collapsing rare variants within a region into
subgroups on the basis of allele frequency, collapsing
subgroups as in CAST, and applying a multivariate test to
the subgroups. The weighted sum test (WST)13 specifically
considers the case-control setting and collapses a set of
SNPs into a single weighted average of the number of
rare alleles for each individual. Numerous alternative
methods are largely variations on these approaches.16,17,19
A limitation for all these burden tests is that they implic-
itly assume that all rare variants influence the phenotype
in the same direction and with the same magnitude of
effect (after incorporating known weights). However, one
would expect most variants (common or rare) within
a sequenced region to have little or no effect on pheno-
type, whereas some variants are protective and others dele-
terious, and the magnitude of each variant’s effect is likely
to vary (e.g., rarer variants might have larger effects).
Hence, collapsing across all variants is likely to introduce
substantial noise into the aggregated index, attenuate
evidence for association, and result in power loss. Further-
more, burden tests require either specification of thresh-
olds for collapsing or the use of permutation to estimate
the threshold.16–20 Permutation tests are computationally
expensive, especially on the whole-genome scale, and are
difficult for covariate adjustment because permutation
1Department of Biostatistics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA; 2Department of Biostatistics, Harvard School
of Public Health, Boston, MA 02115, USA; 3Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA; 4Depart-
ment of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA5These authors contributed equally to this work
*Correspondence: [email protected]
DOI 10.1016/j.ajhg.2011.05.029. �2011 by The American Society of Human Genetics. All rights reserved.
82 The American Journal of Human Genetics 89, 82–93, July 15, 2011
requires independence between the genotype and the co-
variates.
The recently proposed C-alpha test21 is a non-burden-
based test and is hence robust to the direction and magni-
tude of effect. For case-control data, it compares the
expected variance to the actual variance of the distribution
of allele frequencies. These important advantages allow the
C-alpha test to have improved power over burden-based
tests, especially when the effects are in different directions.
Despite these attractive features, the C-alpha test does not
allow for easy covariate adjustment, such as for controlling
population stratification, which is important in genetic
association studies. The C-alpha test also uses permutation
to obtain a p value when linkage disequilibrium is present
among the variants, which is, as noted earlier, computa-
tionally expensive for whole-genome experiments. The
approach has not been generalized to analysis of contin-
uous phenotypes.
We propose in this paper the sequence kernel association
test (SKAT), a flexible, computationally efficient, regression
approach that tests for association between variants in a
region (both common and rare) and a dichotomous (e.g.,
case-control) or continuous phenotype while adjusting for
covariates, such as principal components, to account for
population stratification.22 The kernel machine regression
framework was previously considered for common vari-
ants.23,24 In this paper, weprovide several essentialmethod-
ological improvements necessary for testing rare variants.
SKAT uses a multiple regression model to directly regress
the phenotype on genetic variants in a region and on cova-
riates, and so allows different variants to have different
directions and magnitude of effects, including no effects;
SKAT also avoids selection of thresholds. We develop a
kernel association test to test the regression coefficients of
the variants by using a variance-component score test in a
mixed-model framework by accounting for rare variants.
SKAT is computationally efficient. This quality is espe-
cially important in genome-wide studies because SKAT
only requires fitting the null model in which phenotypes
are regressed on the covariates alone; p values are easily
computed with simple analytic formulae. Additional
features of SKAT include exploitation of local correlation
structure, incorporation of flexible weights to boost power
(e.g., by increasing the weight of rarer variants or incorpo-
rating functionality), and allowance for epistatic variant
effects. As discussed in more detail below, under special
cases, the SKAT, C-alpha test, and individual variant test
statistics are closely related.
We demonstrate through simulation and analysis of
resequencing data from the Dallas Heart Study that SKAT
is often more powerful than existing tests across a broad
range of models for both continuous and dichotomous
data. We also investigate the factors that influence power
for sequence association studies. Finally, we describe
analytic tools to estimate statistical power and sample sizes
to guide the design of new sequence association studies of
rare variants with SKAT.
Material and Methods
Sequencing Kernel Association TestSKAT is a supervised test for the joint effects of multiple variants in
a region on a phenotype. Regions can be defined by genes (in
candidate-gene or whole-exome studies) or moving windows
across the genome (in whole-genome studies). For each region,
SKAT analytically calculates a p value for association while adjust-
ing for covariates. Adjustments for multiple comparisons are
necessary for analyzing multiple regions, for example with the
Bonferroni correction or FDR control.
Notation
Assume n subjects are sequenced in a region with p variant sites
observed. Covariates might include age, gender, and top principal
components of genetic variation for controlling population strat-
ification.22 For the i-th subject, yi denotes the phenotype variable,
Xi ¼ (Xi1, Xi2, .., Xim) denotes the covariates, andGi ¼ (Gi1, Gi2,.,
Gip) denotes the genotypes for the p variants within the region.
Typically, we assume an additive genetic model and let Gij, ¼ 0,
1, or 2 represent the number of copies of the minor allele. Domi-
nant and recessive models can also be considered.
SKAT Model and Test for Linear SNP Effects
For a simple illustration of SKAT, we focus here on testing for a rela-
tionship between the variants and the phenotype by using clas-
sical multiple linear and logistic regression. We describe how the
SKAT can incorporate epistatic effects later. To relate the sequence
variants in a region to the phenotype, consider the linear model
yi ¼ a0 þ a0Xi þ b0Gi þ 3i; (Equation 1)
when the phenotypes are continuous traits, and the logistic model
logit P�yi ¼ 1
� ¼ a0 þ a0Xi þ b0Gi; (Equation 2)
when the phenotypes are dichotomous (e.g., y ¼ 0/1 for case or
control). Here a0 is an intercept term, a ¼ [a1,., am]’ is the vector
of regression coefficients for the m covariates, b ¼ [b1,.,bp]’ is the
vector of regression coefficients for the p observed gene variants in
the region, and for continuous phenotypes 3i is an error term with
a mean of zero and a variance of s2. Under both linear and logistic
models, and evaluating whether the gene variants influence the
phenotype, adjusting for covariates, corresponds to testing the
null hypothesis H0: b ¼ 0, that is, b1 ¼ b2 ¼ . ¼ bp ¼ 0. The stan-
dard p-DF likelihood ratio test has little power, especially for rare
variants. To increase the power, SKAT tests H0 by assuming each
bj follows an arbitrary distribution with a mean of zero and
a variance of wjt, where t is a variance component and wj is a pre-
specified weight for variant j. One can easily see that H0: b ¼ 0 is
equivalent to testing H0: t ¼ 0, which can be conveniently tested
with a variance-component score test in the corresponding mixed
model; this is known to be a locally most powerful test.25 A key
advantage of the score test is that it only requires fitting the null
model yi ¼ a0 þ a1’Xi þ 3i for continuous traits and the logit
P(yi ¼ 1) ¼ a0 þ a1’Xi for dichotomous traits.
Specifically, the variance-component score statistic is
Q ¼ �y� bm�0K�
y� bm�; (Equation 3)
where K ¼ GWG’, bm is the predicted mean of y under H0, that isbm ¼ ba0 þXba for continuous traits and bm ¼ logit�1ðba0 þXbaÞ for
dichotomous traits; and ba0 and ba are estimated under the null
model by regressing y on only the covariates X. Here G is an
n 3 p matrix with the (i, j)-th element being the genotype of
The American Journal of Human Genetics 89, 82–93, July 15, 2011 83
variant j of subject i, andW¼ diag(w1,., wp) contains the weights
of the p variants.
In fact, K is an n 3 n matrix with the (i, i’)-th element equal to
KðGi;Gi0 Þ ¼Pp
j¼1wjGijGi 0 j. Kð,; ,Þ is called the kernel function, and
KðGi;Gi0 Þ measures the genetic similarity between subjects i and i’
in the region via the p markers. This particular form of Kð,; ,Þ iscalled the weighted linear kernel function. We later discuss other
choices of the kernel to model epistatic effects.
Good choices of weights can improve power. Each weight wj
is prespecified, with only the genotypes, covariates and external
biological information, that is estimated without using the
outcome, and reflects the relative contribution of the j-th variant
to the score statistic: if wj is close to zero, then the j-th variant
makes only a small contribution to Q. Thus, decreasing the
weight of noncausal variants and increasing the weight of
causal variants can yield improved power. Because in practice
we do not know which variants are causal, we propose to setffiffiffiffiffiwj
p ¼ BetaðMAFj; a1; a2Þ, the beta distribution density function
with prespecified parameters a1 and a2 evaluated at the sample
minor-allele frequency (MAF) (across cases and controls
combined) for the j-th variant in the data. The beta density is flex-
ible and can accommodate a broad range of scenarios. For
example, if rarer variants are expected to be more likely to have
larger effects, then setting 0 < a1 % 1 and a2 R 1 allows for
increasing the weight of rarer variants and decreasing the weight
of commonweights.We suggest setting a1¼ 1 and a2¼ 25 because
it increases the weight of rare variants while still putting decent
nonzero weights for variants with MAF 1%–5%. All simulations
were conducted with this default choice unless stated otherwise.
Note that a smaller a1 results in more strongly increasing
the weight of rarer variants. Examples of weights across a range
of a1 and a2 values are presented in Figure S1, available online.
Note that a1 ¼ a2 ¼ 1 corresponds to wj ¼ 1, that is all variants
are weighted equally, and a1 ¼ a2 ¼ 0.5 corresponds toffiffiffiffiffiwj
p ¼ 1=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiMAFjð1�MAFjÞ
p, that iswj is the inverse of the variance
of the genotype of marker j, which puts almost zero weight for
MAFs > 1% and can be used if one believes only variants with
MAF < 1% are likely to be causal. Note that SKAT calculated
with this weight is identical to the unweighted SKAT test with
the standardized genotypes in Equations 1 and 2. Other forms of
the weight as a function of MAF can also be used. Because SKAT
is a score test, the type I error is protected for any choice of pre-
chosen weights. Note that the weights used in the weighted sum
test13 involve phenotype information and will therefore alter
the null distribution of SKAT if such weights are used.
Under the null hypothesis, Q follows a mixture of chi-square
distributions, which can be closely approximated with the compu-
tationally efficient Davies method.26 See Appendix A for details.
A special case of SKAT arises when the outcome is dichotomous,
no covariates are included, and all wj ¼ 1. Under these conditions,
we show in Appendix A that the SKAT test statistic Q is equivalent
to the C-alpha test statistic T. Hence, the C-alpha test can be
seen as a special case of SKAT, or alternatively, SKAT can be seen
as a generalized C-alpha test that does not require permutation
but calculates the p value analytically, allows for covariate adjust-
ment, and accommodates either dichotomous or continuous
phenotypes. Because SKAT under flat weights is also equivalent
to the kernel machine regression test23,24 and because the kernel
machine regression test is in turn related to the SSU test,27 it
follows transitively that SKAT under flat weights, the kernel
machine regression test, the SSU test, and the C-alpha test are all
equivalent and special cases of SKAT. Note that the null distribu-
tion is calculated differently via these methods, and SKAT gives
more accurate analytic p values, especially in the extreme tail,
when sample sizes are sufficient.
Relationship between Linear SKAT and Individual Variant Test Statistics
One can efficiently compute the test statistic Q by exploiting
a close connection between the SKAT score test statistic Q and
the individual variant test statistics. In particular, Q is a weighted
sum of the individual score statistics for testing for individual
variant effects. Hence, by letting gj ¼ [G1j, G1j, ., Gnj]’ denote
the n 3 1 vector containing the genotypes of the n subjects for
variant j, it is straightforward to see that Q ¼ Ppj¼1wjS
2j , where
Sj ¼ g0jðy� bm0Þ is the individual score statistic for testing the
marginal effect of the j-th marker (H0: bj ¼ 0) under the individual
linear or logistic regression model of yi on Xi and only the j-th
variant Gij:
yi ¼ a0 þX0i aþ bjGij þ 3i
for continuous phenotypes and
logit P�yi ¼ 1
� ¼ a0 þX0i aþ bjGij
for dichotomous phenotypes. bm0 is estimated as bm0 ¼ ba0 þX0iba
for continuous traits and bm0 ¼ logit�1ðba0 þX0ibaÞ for dichotomous
traits. As a score test, one needs to fit the null model only a single
time to be able to compute the Sj for all individual variants j as well
as all regions to be tested. Similarly, if multiple regions are under
consideration, then the same bm0 can be used to compute the
SKAT Q statistics for each region.
Accommodating Epistatic Effects and Prior Information under the SKAT
An attractive feature of SKAT is the ability to model the epistatic
effects of sequence variants on the phenotype within the flexible
kernel machine regression framework.28–30 To do so, we replace
Gi’b by a more flexible function f(Gi) in the linear and logistic
models (1) and (2) where f(Gi) allows for rare variant by rare
variant and common variant by rare-variant interactions. Specifi-
cally, for continuous traits we use the semiparametric linear
model23,29
yi ¼ a0 þ a0Xi þ f ðGiÞ þ 3i; (Equation 4)
and for dichotomous traits, we use the semiparametric logistic
model24,30
logit P�yi ¼ 1
� ¼ a0 þ a0Xi þ f ðGiÞ: (Equation 5)
Here the variants, Gi, are related to the phenotype through
a possibly nonparametric function f($), which is assumed to lie
in a functional space generated by a positive semidefinite kernel
function Kð,; ,Þ. Models (1) and (2) assume linear genetic effects
and are specified by KðGi;Gi0 Þ ¼Pp
j¼1wjGijGi 0 j. By changing
Kð,; ,Þ, one can allow for more complex models. Intuitively,
KðGi;Gi0 Þ is a function that measures genetic similarity between
the i-th and i’-th subjects via the p variants in the region, and
any positive semidefinite function KðGi;Gi0 Þ can be used as
a kernel function. We tailored several useful and commonly used
kernels specifically for the purpose of rare-variant analysis: the
weighted linear kernel, the weighted quadratic kernel, and the
weighted identity by state (IBS) kernel.
The weighted linear kernel function KðGi;Gi0 Þ ¼Pp
j¼1wjGijGi0 j
implies that the trait depends on the variants in a linear fashion
and is equivalent to the classical linear and logistic model pre-
sented in Equations 1 and 2. The weighted quadratic kernel
KðGi;Gi0 Þ ¼ ð1þPpj¼1wjGijGi0 jÞ2 implicitly assumes that the model
depends on the main effects and quadratic terms for the gene
84 The American Journal of Human Genetics 89, 82–93, July 15, 2011
variants and the first-order variant by variant interactions. The
weighted IBS kernel KðGi;Gi0 Þ ¼Pp
j¼1wjIBSðGij;Gi0 jÞ, defines simi-
larity between individuals as the number of alleles that share
IBS. For additively coded autosomal genotype data, KðGi;Gi0 Þ ¼Ppj¼1wjð2� jGij �Gi0 jjÞ. The model implied by the weighted IBS
kernel models the SNP effects nonparametrically.31 Consequently,
this allows for epistatic effects because the function f($) does not
assume linearity or interactions of a particular order (e.g., the
second order), Using the weighted IBS kernel removes the assump-
tion of additivity because the number of alleles that are identical
by state is a physical quantity that does not change on the basis
of different genotype encodings.
We note that a kernel function that better captures both the
similarity between individuals and the causal variant effects will
increase power. In particular, if relationships are linear and no
interactions are present, then the weighted linear kernel will
have highest power. If interactions are present, the weighted
quadratic and weighted IBS kernels can increase power. Our expe-
rience suggests using the IBS kernel when the number of interact-
ing variants within the region is modest. As our understanding of
genetic architecture improves so too will our knowledge of which
kernel to use.
In each of the above kernels, wj is an allele specific weight that
controls the relative importance of the jth variant and might be
a function of factors such as allele frequency or anticipated func-
tionality. Without prior information, we suggest the use of theffiffiffiffiffiwj
p ¼ BetaðMAFj;1;25Þ suggested earlier. However, if prior infor-
mation is available, for example some variants are predicted as
functional or damaging via Polyphen32 or Sift,33 weights can be
selected to increase the weight for likely functionality.
To test for the effects of gene variants in a region on a phenotype,
one tests the null hypothesis H0: f(G) ¼ 0. SKAT tests for this null
hypothesis by assuming the n 3 1 vector f ¼ [f(G1), ., f(Gn)]’ for
the genetic effects of n subjects follows a distribution with mean
zero and covariance tK, where t is a variance component that
indexes the effects of the variants.29,30 Hence, we can test the
null hypothesis that corresponds to testing H0: t ¼ 0 by a vari-
ance-component score test. In particular, we simply replace K in
Equation 3 by using the K discussed in this section, for example,
the weighted IBS kernel, for epistatic effect. All subsequent calcu-
lations for computing a p value remain the same.
Because the SKAT evaluates significance via a score test, which
operates under the null hypothesis, the SKAT is valid (in terms
of protecting type I error) irrespective of the kernel and the
weights used. Good choices of the kernel and the weights simply
increase power.
Planning New Sequencing-Based Association Studies:
Estimation of Power and Sample SizePower and sample-size calculations are important in designing
sequencing studies of complex traits. Using a modification of
the higher-order moment-approximation method,34 we provide
an analytic method to carry out efficiently such calculations for
SKAT.35 Specifically, for a fixed sample size and a level, given a prior
hypothesis on the genetic architecture of a particular region, the
effect size, and the proportion and number of causal variants
within a region, our method provides the power to detect the
region as significant with SKAT. Similarly, if the desired power is
fixed, the approach can be used to find the necessary sample size.
There are key differences between the power and sample-size
estimation for single-variant- and region (set)-based tests. For
a region (set)-based test, the power depends strongly on the under-
lying genetic architecture, and its estimation requires modeling
this genetic architecture and the linkage disequilibrium (LD)
between variants. Therefore, to estimate power to detect a partic-
ular region as associated with a phenotype requires specification
of the significance level, sample size, which variants in the region
are causal with corresponding effect size, and the LD structure of
the variants in the region. Ideally, one could use prior data to
assess the LD and MAF. Because prior data can be difficult to
obtain, we currently recommend the use of either 1000 Genomes
Project data36 or data simulated under a population genetics
model.37 Relevant preliminary data will become increasingly
available as sequencing studies become more common.
Our SKAT software uses simulated data based on the coalescent
population genetic model (released with the software package) as
a default in performing sample-size and power calculations, and
instead of directly specifying the effects of any given variant, the
user can input an MAF threshold for determining which variants
are regarded as rare and also a proportion determining how many
of the rare variants are causal. The causal variants are then randomly
selected from the alleles with true MAF (based on simulated or
preliminary data) less than the threshold. The magnitudes of the
effects jbjj for causal variants are set to be equal to c 3 jlog10 MAFjwhere c is determined on the basis of the maximum effect size the
user would like to allow (described below in the power simulations
section) at MAF ¼ 10�4. This allows the effects of causal variants to
decrease with MAFs. Because these parameters can be difficult to
choose as apriori, powerandsample size canbe reasonably estimated
by averaging results over a range of parameter values. Similarly,
because the regional architecture can vary across different regions,
for genome-wide studies, one can average over multiple randomly
selected regions as currently implemented in the SKAT software.
Numerical Experiments and SimulationsTo validate SKAT in terms of protecting type I error and to assess its
power compared to burden tests and the accuracy of our power
and sample-size tools, we carried out simulation studies under
a range of configurations. For all simulations, we determined
sequence genotypes by simulating 10,000 chromosomes for a
1 Mb region on the basis of a coalescent model that mimics the
LD pattern local recombination rate and the population history
for Europeans by using COSI.37
Type I Error Simulations
To investigate whether SKAT preserves the desired type I error rate
at the near genome-wide threshold level, for example a ¼ 10�6, it
is necessary to conduct simulations with hundreds of millions of
simulated datasets. Although SKAT is computationally efficient,
generating such a large number of datasets is challenging. To
reduce the computation burden, we took the following approach.
Using 10,000 randomly selected sets of 30 kb subregions within
a 1 Mb chromosome, we first generated 10,000 sets of genotypes
G(n 3 p) from the coalescent model, with p variants on n subjects.
Then, for each of the 10,000 simulated genotype data sets, we
simulated 10,000 sets of continuous phenotypes such that we
were able to obtain 108 individual genotype-phenotype data sets
by using the model:
y ¼ 0:5X1 þ 0:5X2 þ 3;
where X1 is a continuous covariate generated from a standard
normal distribution, X2 is a dichotomous covariate taking values
0 and 1 with a probability of 0.5, and 3 follows a standard normal
distribution. Note that the continuous trait values are not related
to the genotype so that the null model holds. The 30 kb regions on
The American Journal of Human Genetics 89, 82–93, July 15, 2011 85
which the genotype values are based contained 605 variants on
average, but the number of observed variants for any given data
set was considerably less and depended on the sample size n,
which we set to 500, 1000, 2500, and 5000.
We repeated the type I error simulations for dichotomous
phenotypes as above, except the dichotomous outcomes were
generated via the model:
logit Pðy ¼ 1Þ ¼ a0;
where a0 was determined to set the prevalence to 1% and case-
control sampling is used.
For both continuous and dichotomous simulations, we applied
SKAT by using the default weighted linear kernel to each of the 108
data sets and estimated the empirical type I error rate as the
proportion of p values less than a ¼ 10�4, 10�5, or 10�6.
We note that the estimated type I error from this approach is
not the same as the empirical type I error when genotypes are
generated randomly for each simulation, because for each of the
10,000 genotype data sets, only the outcomes are resampled.
However, our type I error estimator is still unbiased and results
in very accurate type I error estimates. For larger a levels (0.05
and 0.01), we directly computed the empirical type I error rate
by using data sets in which genotypes were randomly generated
for each simulation.
Empirical Power Simulations
We simulated data sets in which 30 kb subregions were randomly
selected from the generated 1 Mb chromosomes and used to
create causal variants and aphenotype variable aswell as additional
simulated covariates. We generated continuous phenotypes by
y ¼ 0:5X1 þ 0:5X2 þ b1Gc1 þ b2G
c2 þ.þ bpb Gc
pG þ 3;
where X1, X2X , and 3 are as defined for the type I error simulations,
Gc1;G
c2;.;Gc
s are the genotypes of the s causal rare variants (a
randomly selected subset of the simulated rare variants, for
example 5% of variants that have MAF < 3% in Figure 1), and
the bs are effect sizes for the causal variants. Similarly, we
0.5k 1k 2.5k 5k0.0
0.2
0.4
0.6
0.8
1.0
β +/− = 100/0
Total Sample Size
Pow
er
SKATSKAT_MrSKATWNC
0.5k 1k 2.5k 5k0.0
0.2
0.4
0.6
0.8
1.0
β +/− = 80/20
Total Sample Size
Pow
er
0.5k 1k 2.5k 5k0.0
0.2
0.4
0.6
0.8
1.0
β +/− = 50/50
Total Sample Size
Pow
er
Continuous Trait
0.5k 1k 2.5k 5k0.0
0.2
0.4
0.6
0.8
1.0
β +/− = 100/0
Total Sample Size
Pow
er
0.5k 1k 2.5k 5k0.0
0.2
0.4
0.6
0.8
1.0
β +/− = 80/20
Total Sample Size
Pow
er
0.5k 1k 2.5k 5k0.0
0.2
0.4
0.6
0.8
1.0
β +/− = 50/50
Total Sample Size
Pow
er
Dichotomous Trait
Figure 1. Simulation-Study-Based Power Comparisons of SKAT and Burden TestsEmpirical power at a¼ 10�6 under an assumption that 5% of the rare variants withMAF< 3%within random 30 kb regions were causal.Top panel: continuous phenotypes with maximum effect size (jbj) equal to 1.6 when MAF ¼ 10�4; bottom panel: case-control studieswith maximum OR ¼ 5 when MAF ¼ 10�4. Regression coefficients for the s causal variants were assumed to be a decreasing functionof MAF as jbjb j ¼ c jlog10MAFjFF j (j ¼ 1,.,p [see Figure S2]), where c was chosen to result in these maximum effect sizes. From left to right,the plots consider settings in which the coefficients for the causal rare variants are 100% positive (0% negative), 80% positive (20% nega-tive), and 50% positive (50%negative). Total sample sizes considered are 500, 1000, 2500, and 5000, with half being cases in case-controlstudies. For each setting, six methods are compared: SKAT, SKAT in which 10% of the genotypes were set to missing and then imputed(SKAT_M), restricted SKAT (rSKAT) in which unweighted SKAT is applied to variants with MAF < 3%, the weighted sum burden test (W)with the sameweights as used by SKAT, counting-based burden test (N), and the CASTmethod (C). All the burden tests usedMAF< 3% asthe threshold. For each method, power was estimated as the proportion of p values < a among 1000 simulated data sets.
86 The American Journal of Human Genetics 89, 82–93, July 15, 2011
generated dichotomous phenotypes for case-control data under
the logistic model
logit Pðy ¼ 1Þ ¼ a0 þ 0:5X1 þ 0:5X2 þ b1Gc1 þ b2G
c2 þ.þ bpG
cp;
where Gc1;G
c2;.;Gc
p are again the genotypes for the causal rare
variants and bs are log ORs for the causal variants. We controlled
prevalence by a0 and set to it 1% unless otherwise stated. Under
both models, we set the magnitude of each bj to cjlog10MAFjjsuch that rarer variants had larger effects. In the simulation
studies, for continuous traits, c ¼ 0.4, which gives the maximum
effect size jbjj ¼ 1.6 for variants with MAF ¼ 10�4 and small effects
jbjj ¼ 0.28 for MAF ¼ 0.2. For dichotomous traits, c ¼ ln5/4 ¼0.402, which gives the ‘‘maximum’’ OR ¼ 5.0 (jbjj ¼ ln5) for vari-
ants with MAF ¼ 10�4 and smaller OR ¼ 1.32 for MAF ¼ 0.2. The
effect size curves are given in Figure S2.
We compared SKAT, an unsupervised variation on the WST13
that uses weighted-count-based collapsing, counting-based
collapsing,18 and CAST.14 For each of these tests, we considered
variants with observed MAF < 3% as rare: whether CAST collapses
depends on whether an individual exhibits any variants with
allele frequency < 3%, the counting method counts the number
variants with MAF < 3%, and the weighted count inflates the
contribution of each rare variant by multiplying the genotype
with the same beta-density-based weights as used in SKAT.
To accommodate missing genotypes commonly observed in
sequence data, we considered the effect of imputing missing
values by randomly setting 10% of the genotypes as missing,
imputing genotypes on the basis of observed allele frequencies
and Hardy-Weinberg equilibrium, and then applying SKAT to
the imputed data. We also performed restricted SKAT (rSKAT) by
applying unweighted SKAT to rare variants with MAF < 3%.
Note that for dichotomous phenotypes, rSKAT is essentially equiv-
alent to a covariate adjusted C-alpha test with the p value calcu-
lated analytically instead of via permutation. For each of the
methods, power was estimated as the proportion of p values < a,
where a ¼ 10�6 to mimic genome-wide studies.
Power and Sample-Size Formulae
To demonstrate the utility and accuracy of our power and sample-
size calculation method, we conducted several numerical experi-
ments. We first illustrated the use of the methods by computing
the sample size necessary to detect a 30 kb region with 5% of
the variants with MAF < 3% being causal. We assume effect size
(OR) increases with decreasing MAF, and seek 80% power at
significance levels a ¼ 10�6, 10�3, 10�2, corresponding to approx-
imate genome-wide sequencing significance and candidate-gene-
sequencing studies of 50 and five genes, respectively. We consid-
ered both continuous and dichotomous traits.
To show that the power estimated from our sample-size formula
is accurate, we compared empirical power for SKAT under simula-
tions to power estimated via our analytic method. Specifically, we
simulated continuous and case-control data under the same
setting as that used in the power simulations, and we estimated
power as a function of the sample size by computing the propor-
tion of p values < a ¼ 10�6 and compared the empirical power
curve to the power estimated by using our analytical method.
Results
Simulation of the Type I Error
The empirical type I error rates estimated for SKAT are pre-
sented in Table 1 for a ¼ 10�4, 10�5, and 10�6 and suggest
the type I error rate is protected for continuous pheno-
types, though for smaller sample sizes the SKAT can be
slightly conservative. For dichotomous phenotypes, SKAT
is conservative for smaller sample sizes and very small
a levels. Additional results from simulations of the type I
error for SKAT and the competing methods are presented
in Figure S3 for both continuous traits and dichotomous
traits and show that at larger a levels, all of the considered
tests correctly control at the a¼ 0.05 and 0.01 levels. These
results show that SKAT is a validmethod, and despite being
conservative at low a levels, SKAT maintains good power
relative to existing methods (see below). However, if
sample sizes are small or sharp control of type I error is
necessary, then standard permutation-based procedures
can be used to generate a Monte Carlo p value for signifi-
cance, though this can be computationally expensive
and does not work in the presence of covariates, such as
controlling for population stratification and require carful
modifications.
Statistical Power of SKAT and Competing Methods
We compared the power of SKAT with three burden tests
in a series of simulation studies for both continuous traits
and dichotomous traits by generating sequence data
in randomly selected 30 kb regions with a coalescent
model.37 For our primary power simulation, within each
region, 5% of variants with population MAF < 3% were
randomly chosen as causal, the effect size of causal variants
was a decreasing function of MAF, and 50%–100% of the
causal variants being positively associated with the trait
Table 1. Type I Error Estimates of SKAT Aimed at Testing an Association between Randomly Selected 30 kb Regions with a ContinuousTrait at Type I Error Rates as Low as the Genome-wide a ¼ 10�6 Level
Total Sample Size (n)
Continuous Phenotypes Dichotomous Phenotypes
a ¼ 10�4 a ¼ 10�5 a ¼ 10�6 a ¼ 10�4 a ¼ 10�5 a ¼ 10�6
500 7.4 3 10�5 6.5 3 10�6 5.9 3 10�7 2.2 3 10�5 1.0 3 10�6 1.0 3 10�8
1000 8.5 3 10�5 8.2 3 10�6 8.0 3 10�7 5.0 3 10�5 3.5 3 10�6 2.3 3 10�7
2500 9.6 3 10�5 9.1 3 10�6 8.4 3 10�7 7.6 3 10�5 6.3 3 10�6 5.6 3 10�7
5000 9.8 3 10�5 9.6 3 10�6 8.8 3 10�7 8.9 3 10�5 7.8 3 10�6 7.0 3 10�7
Each entry represents type I error rate estimates as the proportion of p values a under the null hypothesis based on 108 simulated phenotypes.
The American Journal of Human Genetics 89, 82–93, July 15, 2011 87
(See Materials and Methods and Figure S2). The simulated
regions for our power analysis contained on average
605 variants (26 causal), of which 530.9 (88%), 502.9
(83%), and 422.8 (70%) had population MAF < 3%, < 1%,
and< 0.1%, respectively. The average allele frequency spec-
trum across the samples is similar to that of theDallas Heart
Studydata (Figure S4). Because themajority of variantshave
a low MAF, they might not be observed in any particular
sample. The average number of observed variants
(assuming no genotyping error) and the average number
of observed causal variants are presented in Table 2.
For continuous traits, SKAT had much higher power
than all the burden tests, and the weighted count method
tended to outperform the count and CAST methods
(Figure 1). SKAT’s power was robust to the proportion of
causal variants that were positively associated with the
trait, whereas the burden tests suffered substantial loss of
power when causal variants had the opposite effects. The
simulation results examining dichotomous traits were
qualitatively similar in that SKAT dominated the compet-
ing methods. However, here the power of the SKAT
decreased when both protective and harmful variants
were present, although less so than for the burden tests.
The difference in power for SKAT for different proportions
of protective variants is due to the fact that given fixed
population MAFs, protective variants imply negative log
ORs and lower disease risk and hence lower MAFs in cases
and more difficulties in observing rare variants in cases.
The larger decrease in power for the competing methods
is additionally driven by sensitivity to direction of effect
due to aggregation of genotypes. Across all configurations,
using imputed genotypes instead of the true genotype
for 10% missing genotype data led to a very small
reduction in power, despite the use of a very simple
Hardy-Weinberg-based imputation strategy. This is true
in part because most variants are rare.
Note that SKAT increases the weight of rare variants but
does not require thresholding. To show that the superior
performance of SKAT is intrinsic and is not driven by the
particular choice of the weight used, we calculated rSKAT,
which does not weight the rare variants but instead uses
the same threshold as the burden tests. Our results, pre-
sented in Figure 1, show that rSKAT is still substantially
more powerful than all three burden tests.
Power simulation results for other type I error rates (a ¼0.01, 0.001), lower causal variant frequencies (population
MAF < 1%), and other region sizes (10 kb and 60 kb)
yielded the same conclusions (Figures S5–S8).
In the 30 kb genomic regions considered, reflecting anal-
ysis of genome-wide sequencing data, it is unlikely that
a large proportion of the rare variants are all causal.
However, for exome-scale sequencing, the number of
observed rare variants can be considerably smaller and
the proportion of causal rare variants can be greater.
Hence, we also conducted power simulations for smaller
region sizes (3 kb and 5 kb) and larger proportions of causal
variants (10%, 20%, and 50%). Results for both continuous
and dichotomous phenotypes are presented in Figures S9–
S12 and show that if 50% of the rare variants are causal and
that all of the causal variants have effects in the same direc-
tion, then SKAT and rSKAT are less powerful compared to
collapsing methods, with count-based collapsing having
the greatest power. This result held for both 3 kb and
5 kb regions and is expected because the collapsing
methods implicitly assume that all of the variants are
causal and have unidirectional effects. In all other settings
we considered, SKAT was the most powerful method.
Power and Sample-Size Estimation
To illustrate our power and sample-size calculation
method, in Figure 2 we present the estimated sample-size
curves as a function of maximum effect sizes (ORs for
dichotomous traits) necessary to detect a 30 kb region
with 5% of the variants with MAF < 3% being causal.
Table 3 presents estimated sample sizes for several configu-
rations of practical interest. Additional sample-size curves
when causal variants are rarer (MAF < 1%) or occur more
frequently (10% of variants are causal) or when prevalence
is varied (5%, 0.1%) can be found in Figures S13–S15.
These results show that, for a given region, one will
have more power (and a lower required sample size) to
detect rare causal variants if the percentage of variants
that are causal is higher, the causal rare variants have
higher MAFs and/or larger effect sizes (e.g., odds ratios
[ORs]), and the effects are more consistently in the same
direction. For case-control designs, lower prevalence
yields higher power because given the same OR and popu-
lation MAF, the lower prevalence results in enrichment of
more harmful (ORs > 1) variants, that is higher MAFs,
across both cases and controls, that is for rarer diseases
harmful rare variants are more likely to be observed.
Conversely, if the prevalence is low, fewer protective vari-
ants (ORs< 1), that is lower MAFs, are likely to be observed
in the sample.
We also compared the power and sample-size formulae
estimates to the empirical, simulation-based power esti-
mates for both continuous and dichotomous traits. The
curves plotted in Figure 3 show that the empirical power
is accurately approximated by our analytical formula.
Table 2. Characteristics of the 30 kb Region Data Sets Used in theSimulation Studies
Average Number of Observed Variants
Sample Size (n)
500 1000 2500 5000
All traits* 255 330 438 512
Continuous trait** 9.6 13.3 18.6 22.3
Dichotomous trait (b 5 ¼ 100/0)** 14.4 18.7 23.5 25.2
Dichotomous trait (b5 ¼ 80/20)** 13.3 17.1 22.0 24.3
Dichotomous trait (b5 ¼ 50/50)** 11.1 14.9 19.7 22.6
The number of observed variants* and the number of observed causalvariants** within the region are averaged over the 1000 simulated data sets.
88 The American Journal of Human Genetics 89, 82–93, July 15, 2011
Application to Dallas Heart Study Data
We analyzed sequence data on 93 variants in ANGPTL3
(MIM 604774), ANGPTL4 (MIM 605910), and ANGPTL5
(MIM 607666) in 3476 individuals from the Dallas Heart
Study38 to test for association between log-transformed
serum triglyceride (logTG) levels and rare variants in these
genes. We adjusted for sex and ethnicity (black, Hispanic,
or white) but did not adjust for age as a large number of
subjects have missing ages. In addition to testing for asso-
ciation via SKAT and the three burden tests considered
earlier, we also applied the permutation-based varying-
threshold method (VT) and the Polyphen-score-adjusted
VT (VTP),16 which are based on the residuals obtained
from regressing the phenotype on the covariates and
assume gene-covariate independence. Because VT and
VTP require permutation, they are computationally expen-
sive when applied genome wide. For VTP, we used the
Polyphen score for rare variants (MAF< 0.01) and assigned
a constant score of 0.5 to all other variants. We also
analyzed a dichotomized phenotype on the highest and
lowest quartiles of each of the six sex-ethnicity groups
(Table 4).
Table 3. Required Total Sample Size to Achieve 80% Power to Detect Rare Variants Associated with a Continuous or DichotomousCase-Control Phenotype at the Genome-wide Level a ¼ 10�6
Total Sample Size
Maximum b ¼ 1.6/ Maximum OR ¼ 5 Maximum b ¼ 1.9/ Maximum OR ¼ 7
5% Causal 10% Causal 5% Causal 10% Causal
Continuous trait 5,990 1,800 4,260 1,290
Dichotomous trait with prevalence 10% 15,120 4,810 9,650 3,120
Dichotomous trait with prevalence 1% 12,030 3,870 7,010 2,290
Power was estimated via the analytical formulae assuming 5% or 10% of variants with MAF < 3% are causal. Regression coefficients for the s causal variantswere assumed to be a decreasing function of MAF, jbjb j ¼ c jlog10MAFjFF j (j ¼ 1,.,s), where 80% of bj’s are positive and 20% are negative; see Figure S2. Requiredtotal sample sizes (cases and controls) are given for different ‘‘maximum’’ effect sizes (or ORs) whenMAF¼ 10�4 and different prevalences for case-control studies.Estimated sample sizes were averaged over 100 random 30 kb regions.
1.4 1.6 1.8 2.0 2.2
020
0040
0060
0080
0010
000
β +/− = 100/0
max β
Tota
lSam
ple
Siz
e
α = 10−6
α = 10−3
α = 10−2
1.4 1.6 1.8 2.0 2.2
020
0040
0060
0080
0010
000
β +/− = 80/20
max β
Tota
lSam
ple
Siz
e
1.4 1.6 1.8 2.0 2.2
020
0040
0060
0080
0010
000
β +/− = 50/50
max β
Tota
lSam
ple
Siz
e
Continuous Trait
5 6 7 8 9 10 11
020
0040
0060
0080
0010
000
β +/− = 100/0
max OR
Tota
lSam
ple
Siz
e
5 6 7 8 9 10 11
020
0040
0060
0080
0010
000
β +/− = 80/20
max OR
Tota
lSam
ple
Siz
e
5 6 7 8 9 10 11
020
0040
0060
0080
0010
000
β +/− = 50/50
max OR
Tota
lSam
ple
Siz
e
Dichotomous Trait
Figure 2. Sample Sizes Required for Reaching 80% PowerAnalytically estimated sample sizes required for reaching 80% power to detect rare variants associated with a continuous (top panel) ordichotomous phenotype in case-control studies (half are cases) (bottom panel) at the a¼ 10�6, 10�3, and 10�2 levels, under the assump-tion that 5% of rare variants with MAF < 3% within the 30 kb regions are causal. Plots correspond to 100%, 80%, and 50% of the causalvariants associated with increase in the continuous phenotype or risk of the dichotomous phenotype. Regression coefficients for the scausal variants were assumed to be the same decreasing function of MAF as that in Figure 1. The absolute values of Required total samplesizes are plotted again themaximumeffect sizes (ORs) whenMAF¼ 10�4. Estimated total sample sizes were averaged over 100 random30kb regions.
The American Journal of Human Genetics 89, 82–93, July 15, 2011 89
SKAT was by far the most powerful test for the dichoto-
mous trait. For continuous traits, SKAT has much smaller
p values than two burden methods (CAST and WST) and
VT, and has a slightly higher p value than the counting-
based burden test (N) and VTP. Note that SKAT was easier
to apply because it did not require prior functional infor-
mation (available for only a subset of variants) or permuta-
tion, and it adjusted for covariates without assuming gene-
covariate independence.
Computation Time
The computation time for the SKAT depends on the
sample size and the number of markers. To analyze a 30 kb
region sequenced on 1000, 2500, or 5000 individuals,
SKAT required 0.21, 0.73, and 2.3 s, respectively, for
continuous traits and ~20% longer for dichotomous traits,
on a 2.33 GHz laptop with 6 Gb memory. Analyzing
300 kb, 3Mb, or 3 Gb (the entire genome) on 1000 individ-
uals requires 2.5 s, 25 s, and 7 hr, respectively.
Discussion
We propose SKAT as a supervised, flexible, and computa-
tionally efficient statisticalmethod that tests for association
between a continuous or dichotomous phenotype and rare
and common genetic variants in sequencing-based associa-
tion studies. We demonstrate that SKAT’s power is greater
than that of several burden tests over a range of genetic
models. Furthermore, we have developed analytical power
and sample-size calculations for SKAT that assist in
designing sequencing-based association studies.
2000 4000 6000 8000 10000
0.0
0.2
0.4
0.6
0.8
1.0
Continuous Trait
Total Sample Size
Pow
er
TheoreticalEmpirical
2000 4000 6000 8000 10000
0.0
0.2
0.4
0.6
0.8
1.0
Dichotomous Trait
Total Sample Size
Pow
er
Figure 3. Power Comparisons Based onSimulation and Analytic EstimationPower as a function of total sample sizeestimated by simulation with 1000 repli-cates and by the proposed power formulafor continuous and dichotomous case-control traits. Simulation configurationscorrespond to those used in Figure 1, inwhich 80% of the regression coefficientsfor the causal rare variants were positive.
Table 4. Analysis of the Dallas Heart Study Sequencing Data
SKAT C N W VTa VTPa
Continuous TG level 9.5 3 10�5 1.9 3 10�3 7.2 3 10�5 2.3 3 10�4 3.5 3 10�4 2.0 3 10�5
Dichotomized TG level 1.3 3 10�4 3.2 3 10�2 2.2 3 10�3 3.1 3 10�3 8.6 3 10�3 2.1 3 10�3
Analysis of the Dallas Heart Study sequencing data with SKAT, the weighted sum burden test (W), the counting-based burden test (N), the CAST method (C), thevarying-threshold method (VT), and the Polyphen-score adjusted VT (VTP) method. Beta (1, 25) is used as the weight in the SKAT and the weighted sum test.a p values are estimated on the basis of 106 permutations.
Like burden tests, SKAT performs
region-based testing. However, SKAT
has several major advantages over the
existing tests. As a supervisedmethod,
SKAT directly performs multiple re-
gressions of a phenotype on genotypes for all variants in
the region, adjusting for covariates. Hence, as with conven-
tional multiple regression models, neither directionality
nor magnitudes of the associations are assumed a priori
but are instead estimated from the data. To test efficiently
for the joint effects of rare variants in the region on the
phenotype, SKAT assumes a distribution for the regression
coefficients of the markers, whose variances depend on
flexible weights. SKAT performs a score-based variance-
component test, whose calculation only requires fitting
the null model by regressing phenotypes on covariates
alone and computing p values analytically. The flexible
regression framework also allows us to allow for epistatic
effects.
Besides region-based analysis, SKAT can also be applied
to any biologically meaningful SNP set. As SKAT is a regres-
sion-based method, it can be easily extended to survival,
and longitudinal and multivariate phenotypes and hence
provides a comprehensive framework for a wide variety
of sequencing-based association studies.
The ability to obtain a p value directly without the need
for permutation is an attractive feature of SKAT, and allows
for rapid estimation of p values in exome and genome-
wide sequencing studies. Our simulations showed that
for continuous phenotype, the p values are accurate
when the sample size is moderate or large; for dichoto-
mous phenotypes, the p values are conservative at lower
a levels (e.g., < 10�4) if the sample size is modest or
small. Permutation can be used to obtain a more accurate
estimate in the absence of covariates. In the presence of
covariates, for example population stratification, standard
90 The American Journal of Human Genetics 89, 82–93, July 15, 2011
permutations fail and require careful modifications.
Despite the conservative nature of the score test, SKAT
often still has higher power than competing methods at
small a levels.
SKATcan be combined with collapsing strategies to form
a hybrid testing approach. If most of the variants within
a range of allele frequencies are causal and have the same
directionality (i.e., under settings that are optimal for
burden-based tests), collapsing these variants and then
applying SKAT to the collapsed variants can improve
power. For example, because singletons are common in
sequencing studies (57 of 93 variants in the Dallas Heart
Study data), a possible hybrid strategy is to first collapse
all of the singletons into a single value and then apply
SKAT to the collapsed value and the other 36 variants.
Compared to the original SKAT, this strategy gives a slightly
lower p value, 3.1 3 10�5, for the continuous trait and
a slightly higher p value, 1.6 3 10�4, for the dichotomous
trait. Simulation studies showed that the two methods are
of similar power under the settings we used to generate
Figure 1.
An important feature of SKAT is that it allows for incor-
poration of flexible weight functions to boost analysis
power, for example by increasing the weight of variants
with lower MAFs and decreasing the weight of information
from variants inferred with lower confidence. Good
choices of weights are likely to improve the power of the
association test with SKAT, although simulations show
that even equal weights can yield high power when
combined with thresholding. In our simulation studies,
we employed a class of flexible continuous weights as
a function of MAF by using the beta function, which
increases the weight of rare variants and does not require
thresholding. Users can define other types of weight func-
tions. To further improve analysis power, one can estimate
weights by incorporating information besides MAF, for
example by using the Polyphen score or integrating other
annotation information, which will become increasingly
available as our understanding of genome variation
improves. Therefore, because of its flexibility, SKAT has
the capacity to mature, and its power to increase, as the
field progresses.
Appendix A
Estimating the Null Distribution for Q
Under the null hypothesis, Q follows a mixture of chi-
square distributions.29,30 More specifically, we define P0 ¼V�V ~Xð ~X0
V ~XÞ�1 ~X0V where ~X is the n 3 (p þ 1) matrix
equal to [1, X]. For continuous phenotypes, V ¼ bs2
0I
where bs0 is the estimator of s under the null model where
f(G) ¼ 0, and I is an n 3 n identity matrix. For dichoto-
mous phenotypes, V ¼ diagðbm01ð1� bm01Þ; bm02ð1� bm02Þ;.;bm0nð1� bm0nÞÞ where bm0i ¼ logit�1ðba þ ba0XiÞ is the esti-
mated probability that the i-th subject is a case under the
null model. Then under the null model
Q �Xni¼1
lic21;i; (Equation 6)
where (l1, l2, ., ln) are the eigenvalues of P1=20 KP1=2
0 , and
c21;i are independent c2
1 random variables.
Several approximation and exact methods have been
suggested to obtain the distribution of Q.39 Among these,
the Davies exact method,26 based on inverting the charac-
teristic function of Equation 6, appears to work well in
practice and is used here.
SKAT Is a Generalization of the C-Alpha Test
The recently proposed the C-alpha test has advantages
over burden tests in that it explicitly models the possibility
that minor alleles can be deleterious or protective.
However, it does not currently allow for the analysis of
quantitative outcomes or the inclusion of covariates and
p value calculation requires permutation. We demonstrate
that for a dichotomous trait in the absence of covariates,
the C-alpha test statistic is equivalent to the SKAT statistic
with unweighted linear kernel, which is the same as the
kernel machine test in Wu et al.24
Suppose the j-th variant is observed dj times in the cases,
out of nj times total in cases and controls, and that
p0 ¼ Pni¼1yi=n. For a dichotomous trait and no covariates,
the C-alpha test statistic
Ta ¼Xp
j¼1
h�dj � njp0
�2�njp0�1� p0
�i(Equation 7)
Denote T1a ¼ Pp
j¼1ðdj � njp0Þ2. BecausePp
j¼1njp0ð1� p0Þis the mean of Ta under the null hypothesis of no associa-
tion,T1a is theC-alpha test statisticwithoutmeancentering.
Because dj ¼ y0G:j and nj ¼ J0G:j, where G:j is the j-th
column of the genotype matrix G and J ¼ ð1;1;.;1Þ0, itcan be easily shown that
T1a ¼ �
y� p0J�0GG0�y� p0J
�: (Equation 8)
Note that under the unweighted linear kernel, K ¼ GG’
and bm0 ¼ p0J if no covariates are present. Hence, Equation
8 is identical to Equation 3, that is T1a is equivalent to the
SKAT test statistic with unweighted linear kernel.
Although the SKAT statistic with unweighted linear
kernel and the C-alpha test statistic are equivalent, SKAT
and C-alpha test use different null distributions to assess
significance: C-alpha test uses a normal approximation,
whereas we use a mixture of chi-squares. The normal
approximation gives a valid p value when the tested rare
variants are independent and sample sizes are large, and
so requires an assumption of linkage equilibrium. In the
presence of LD, permutation is used by the C-alpha test
for significance testing. One can easily see that the test
statistic takes a quadratic formof y, which follows amixture
of chi-square distributions. SKAT approximates this distri-
bution directly with the Davies method and hence gives
accurate estimation of significance regardless of the LD
structure when sample size is sufficient.
The American Journal of Human Genetics 89, 82–93, July 15, 2011 91
Supplemental Data
Supplemental Data include 15 figures and can be found with this
article online at http://www.cell.com/AJHG/.
Acknowledgments
This work was supported by grants P30 ES010126 (to M.C.W.),
DMS 0854970 and R01 GM079330 (to T.C.), R01 HG000376 (to
M.B.), and R37 CA076404 and P01 CA134294 (to S.L. and X.L.).
We thank Jonathan Cohen, Alkes Price, and Shamil Sunyaev for
providing the Dallas Heart Study data and Larisa Miropolsky for
help with the software development.
Received: March 16, 2011
Revised: May 27, 2011
Accepted: May 30, 2011
Published online: July 7, 2011
Web Resources
The URLs for data presented herein are as follows:
1000 Genomes Project, http://www.1000genomes.org/
Online Mendelian Inhereitance in Man (OMIM), http://www.
omim.org
SKATsoftware, http://www.hsph.harvard.edu/~xlin/software.html
References
1. Hindorff, L.A., Sethupathy, P., Junkins, H.A., Ramos, E.M.,
Mehta, J.P., Collins, F.S., and Manolio, T.A. (2009). Potential
etiologic and functional implications of genome-wide associa-
tion loci for human diseases and traits. Proc. Natl. Acad. Sci.
USA 106, 9362–9367.
2. Margulies, M., Egholm,M., Altman,W.E., Attiya, S., Bader, J.S.,
Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z.,
et al. (2005). Genome sequencing in microfabricated high-
density picolitre reactors. Nature 437, 376–380.
3. Mardis, E.R. (2008). Next-generation DNA sequencing
methods. Annu. Rev. Genomics Hum. Genet. 9, 387–402.
4. Ansorge, W.J. (2009). Next-generation DNA sequencing tech-
niques. New Biotechnol. 25, 195–203.
5. Eichler, E.E., Flint, J., Gibson, G., Kong, A., Leal, S.M.,
Moore, J.H., and Nadeau, J.H. (2010). Missing heritability
and strategies for finding the underlying causes of complex
disease. Nat. Rev. Genet. 11, 446–450.
6. Ley, T.J., Mardis, E.R., Ding, L., Fulton, B., McLellan, M.D.,
Chen, K., Dooling, D., Dunford-Shore, B.H., McGrath, S.,
Hickenbotham, M., et al. (2008). DNA sequencing of a cytoge-
netically normal acute myeloid leukaemia genome. Nature
456, 66–72.
7. Li, H., Ruan, J., and Durbin, R. (2008). Mapping short DNA
sequencing reads and calling variants using mapping quality
scores. Genome Res. 18, 1851–1858.
8. Li, R.Q., Li,Y.R., Fang,X.D.,Yang,H.M.,Wang, J.,Kristiansen,K.,
andWang, J. (2009). SNPdetection formassively parallelwhole-
genome resequencing. Genome Res. 19, 1124–1132.
9. Bansal, V., Harismendy, O., Tewhey, R., Murray, S.S., Schork,
N.J., Topol, E.J., and Frazer, K.A. (2010). Accurate detection
and genotyping of SNPs utilizing population sequencing
data. Genome Res. 20, 537–545.
10. Carvajal-Carmona, L.G. (2010). Challenges in the identifica-
tion and use of rare disease-associated predisposition variants.
Curr. Opin. Genet. Dev. 20, 277–281.
11. Schork, N.J., Murray, S.S., Frazer, K.A., and Topol, E.J. (2009).
Common vs. rare allele hypotheses for complex diseases.
Curr. Opin. Genet. Dev. 19, 212–219.
12. Li, B., and Leal, S.M. (2008). Methods for detecting associa-
tions with rare variants for common diseases: application
to analysis of sequence data. Am. J. Hum. Genet. 83, 311–321.
13. Madsen, B.E., and Browning, S.R. (2009). A groupwise associ-
ation test for rare mutations using a weighted sum statistic.
PLoS Genet. 5, e1000384.
14. Morgenthaler, S., and Thilly, W.G. (2007). A strategy to
discover genes that carry multi-allelic or mono-allelic risk for
common diseases: a cohort allelic sums test (CAST). Mutat.
Res. 615, 28–56.
15. Li, B., and Leal, S.M. (2009). Discovery of rare variants via
sequencing: implications for the design of complex trait asso-
ciation studies. PLoS Genet. 5, e1000481.
16. Price, A.L., Kryukov, G.V., de Bakker, P.I., Purcell, S.M., Staples,
J., Wei, L.J., and Sunyaev, S.R. (2010). Pooled association tests
for rare variants in exon-resequencing studies. Am. J. Hum.
Genet. 86, 832–838.
17. Han, F., and Pan, W. (2010). A data-adaptive sum test for
disease association with multiple common or rare variants.
Hum. Hered. 70, 42–54.
18. Morris, A.P., and Zeggini, E. (2010). An evaluation of statistical
approaches to rare variant analysis in genetic association
studies. Genet. Epidemiol. 34, 188–193.
19. Zawistowski,M., Gopalakrishnan, S., Ding, J., Li, Y., Grimm, S.,
andZollner, S. (2010). Extending rare-variant testing strategies:
analysisofnoncoding sequenceand imputedgenotypes.Am. J.
Hum. Genet. 87, 604–617.
20. Asimit, J., and Zeggini, E. (2010). Rare variant association anal-
ysismethods forcomplex traits.Annu.Rev.Genet.44, 293–308.
21. Neale, B.M., Rivas, M.A., Voight, B.F., Altshuler, D., Devlin, B.,
Orho-Melander, M., Kathiresan, S., Purcell, S.M., Roeder, K.,
and Daly, M.J. (2011). Testing for an unusual distribution of
rare variants. PLoS Genet. 7, e1001322.
22. Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E.,
Shadick, N.A., and Reich, D. (2006). Principal components
analysis corrects for stratification in genome-wide association
studies. Nat. Genet. 38, 904–909.
23. Kwee, L.C., Liu, D., Lin, X., Ghosh, D., and Epstein, M.P.
(2008). A powerful and flexible multilocus association test
for quantitative traits. Am. J. Hum. Genet. 82, 386–397.
24. Wu, M.C., Kraft, P., Epstein, M.P., Taylor, D.M., Chanock, S.J.,
Hunter, D.J., and Lin, X. (2010). Powerful SNP-set analysis for
case-control genome-wide association studies. Am. J. Hum.
Genet. 86, 929–942.
25. Lin, X. (1997). Variance component testing in generalised
linear models with random effects. Biometrika 84, 309–326.
26. Davies, R. (1980). The distribution of a linear combination of
chi-square random variables. J. R. Stat. Soc. Ser. C Appl. Stat.
29, 323–333.
27. Pan, W. (2009). Asymptotic tests of association with multiple
SNPs in linkagedisequilibrium.Genet. Epidemiol.33, 497–507.
28. Cristianini, N., and Shawe-Taylor, J. (2000). An Introduction
to Support Vector Machines and Other Kernel-Based Learning
Methods (Cambridge: Cambridge Univ Press).
29. Liu, D., Lin, X., and Ghosh, D. (2007). Semiparametric regres-
sion of multidimensional genetic pathway data: least-squares
92 The American Journal of Human Genetics 89, 82–93, July 15, 2011
kernel machines and linear mixed models. Biometrics 63,
1079–1088.
30. Liu, D., Ghosh, D., and Lin, X. (2008). Estimation and testing
for the effect of a genetic pathway on a disease outcome using
logistic kernel machine regression via logistic mixed models.
BMC Bioinformatics 9, 292.
31. Fleuret, F., and Sahbi, H. (2003). Scale-invariance of support
vector machines based on the triangular kernel. In 3rd Inter-
nationalWorkshop on Statistical and Computational Theories
of Vision. (ftp://ftp.inria.fr/INRIA/publication/publi-pdf/RR/
RR-4601.pdf).
32. Ramensky, V., Bork, P., and Sunyaev, S. (2002). Human non-
synonymous SNPs: server and survey. Nucleic Acids Res. 30,
3894–3900.
33. Kumar, P., Henikoff, S., and Ng, P.C. (2009). Predicting the
effects of coding non-synonymous variants on protein func-
tion using the SIFT algorithm. Nat. Protoc. 4, 1073–1081.
34. Liu, H., Tang, Y., and Zhang, H. (2009). A new chi-square
approximation to the distribution of non-negative definite
quadratic forms in non-central normal variables. Comput.
Stat. Data Anal. 53, 853–856.
35. Lee, S., Wu, M.C., Cai, T., Li, Y., Boehnke, M., and Lin, X.
(2011). Power and sample size calculations for designing rare
variant sequencing association studies. In Harvard University
Technical Report. (http://www.hsph.harvard.edu/~xlin).
36. Durbin, R.M., Abecasis, G.R., Altshuler, D.L., Auton, A.,
Brooks, L.D., Gibbs, R.A., Hurles, M.E., and McVean, G.A.;
1000 Genomes Project Consortium. (2010). A map of human
genome variation from population-scale sequencing. Nature
467, 1061–1073.
37. Schaffner, S.F., Foo, C., Gabriel, S., Reich, D., Daly, M.J., and
Altshuler, D. (2005). Calibrating a coalescent simulation of
human genome sequence variation. Genome Res. 15, 1576–
1583.
38. Romeo, S., Yin,W., Kozlitina, J., Pennacchio, L.A., Boerwinkle,
E., Hobbs, H.H., and Cohen, J.C. (2009). Rare loss-of-function
mutations in ANGPTL family members contribute to plasma
triglyceride levels in humans. J. Clin. Invest. 119, 70–79.
39. Duchesne, P., and Lafaye De Micheaux, P. (2010). Computing
the distribution of quadratic forms: Further comparisons
between the Liu-Tang-Zhang approximation and exact
methods. Comput. Stat. Data Anal. 54, 858–862.
The American Journal of Human Genetics 89, 82–93, July 15, 2011 93
Discover high-quality, open-access research
Cell Reports features:
High-quality, cutting-edge research
A focus on short, single-point papers called Reports
Broad scope covering all of biology
Flexible open-access policy
A highly engaged editorial board
A distinguished advisory board
New papers online weekly
cellreports.cell.com
REPORT
Expansion of Intronic GGCCTG Hexanucleotide Repeatin NOP56 Causes SCA36, a Type of Spinocerebellar AtaxiaAccompanied by Motor Neuron Involvement
Hatasu Kobayashi,1,4 Koji Abe,2,4 Tohru Matsuura,2,4 Yoshio Ikeda,2 Toshiaki Hitomi,1 Yuji Akechi,2
Toshiyuki Habu,3 Wanyang Liu,1 Hiroko Okuda,1 and Akio Koizumi1,*
Autosomal-dominant spinocerebellar ataxias (SCAs) are a heterogeneous group of neurodegenerative disorders. In this study, we per-
formed genetic analysis of a unique form of SCA (SCA36) that is accompanied by motor neuron involvement. Genome-wide linkage
analysis and subsequent fine mapping for three unrelated Japanese families in a cohort of SCA cases, in whom molecular diagnosis
had never been performed, mapped the disease locus to the region of a 1.8 Mb stretch (LOD score of 4.60) on 20p13 (D20S906–
D20S193) harboring 37 genes with definitive open reading frames. We sequenced 33 of these and observed a large expansion of an
intronic GGCCTG hexanucleotide repeat in NOP56 and an unregistered missense variant (Phe265Leu) in C20orf194, but we found
no mutations in PDYN and TGM6. The expansion showed complete segregation with the SCA phenotype in family studies, whereas
Phe265Leu in C20orf194 did not. Screening of the expansions in the SCA cohort cases revealed four additional occurrences, but
none were revealed in the cohort of 27 Alzheimer disease cases, 154 amyotrophic lateral sclerosis cases, or 300 controls. In total,
nine unrelated cases were found in 251 cohort SCA patients (3.6%). A founder haplotype was confirmed in these cases. RNA foci forma-
tionwas detected in lymphoblastoid cells from affected subjects by fluorescence in situ hybridization. Double staining and gel-shift assay
showed that (GGCCUG)n binds the RNA-binding protein SRSF2 but that (CUG)6 does not. In addition, transcription of MIR1292,
a neighboring miRNA, was significantly decreased in lymphoblastoid cells of SCA patients. Our finding suggests that SCA36 is caused
by hexanucleotide repeat expansions through RNA gain of function.
Autosomal-dominant spinocerebellar ataxias (SCAs) are
a heterogeneous group of neurodegenerative disorders
characterized by loss of balance, progressive gait, and
limb ataxia.1–3 We recently encountered two unrelated
patients with intriguing clinical symptoms from a commu-
nity in the Chugoku region in western mainland Japan.4
These patients both showed complicated clinical features,
with ataxia as the first symptom, followed by characteristic
late-onset involvement of the motor neuron system that
caused symptoms similar to those of amyotrophic lateral
sclerosis (ALS [MIM 105400]).4 Some SCAs (SCA1 [MIM
164400], SCA2 [MIM 183090], SCA3 [MIM 607047], and
SCA6 [MIM 183086]) are known to slightly affect motor
neurons; however, their involvement is minimal and the
patients usually do not develop skeletal muscle and tongue
atrophies.4 Of particular interest is that RNA foci have been
recently demonstrated in hereditary disorders caused by
microsatellite repeat expansions or insertions in the non-
coding regions of their gene.5–7 The unique clinical features
in these families have seldom been described in previous
reports; therefore, we undertook a genetic analysis.
A similar form of SCAwas observed in five Japanese cases
from a cohort of 251 patients with SCA, in whom molec-
ular diagnosis had not been performed, who were followed
by the Department of Neurology, Okayama University
Hospital. These five cases originated from a city of
450,000 people in the Chugoku region. Thus, we suspected
the presence of a founder mutation common to these five
cases, prompting us to recruit these five families (pedigrees
1–5) (Figure 1, Table 1). This study was approved by the
Ethics Committee of Kyoto University and the Okayama
University institutional review board. Written informed
consent was obtained from all subjects. An index of cases
per family was investigated in some depth: IV-4 in pedigree
1, II-1 in pedigree 2, III-1 in pedigree 3, II-1 in pedigree 4,
and II-1 in pedigree 5. The mean age at onset of cerebellar
ataxia was 52.8 5 4.3 years, and the disease was trans-
mitted by an autosomal-dominant mode of inheritance.
All affected individuals started their ataxic symptoms,
such as gait and truncal instability, ataxic dysarthria, and
uncoordinated limbs, in their late forties to fifties. MRI
revealed relatively confined and mild cerebellar atrophy
(Figure 2A). Unlike individuals with previously known
SCAs, all affected individuals with longer disease duration
showed obvious signs of motor neuron involvement
(Table 1). Characteristically, all affected individuals ex-
hibited tongue atrophy with fasciculation, although its
degree of severity varied (Figure 2B). Despite severe tongue
atrophy in some cases, their swallowing function was rela-
tively preserved, and they were allowed oral intake even at
a later point after onset. In addition to tongue atrophy,
skeletal muscle atrophy and fasciculation in the limbs
and trunk appeared in advanced cases.4 Tendon reflexes
were generally mildly to severely hyperreactive in most
1Department of Health and Environmental Sciences, Graduate School of Medicine, Kyoto University, Kyoto, Japan; 2Department of Neurology, Graduate
School of Medicine, Dentistry and Pharmaceutical Science, Okayama University, Okayama, Japan; 3Radiation Biology Center, Kyoto University, Kyoto,
Japan4These authors contributed equally to this work
*Correspondence: [email protected]
DOI 10.1016/j.ajhg.2011.05.015. �2011 by The American Society of Human Genetics. All rights reserved.
The American Journal of Human Genetics 89, 121–130, July 15, 2011 121
Figure 1. Pedigree Charts of the Five SCA FamiliesHaplotypes are shown for nine markers from D20S906 (1,505,576 bp) to D20S193 (3,313,494 bp), spanning 1.8 Mb on chromosome20p13.NOP56 is located at 2,633,254–2,639,039 bp (NCBI build 37.1). Filled and unfilled symbols indicate affected and unaffected indi-viduals, respectively. Squares and circles represent males and females, respectively. A slash indicates a deceased individual. The putativefounder haplotypes among patients are shown in boxes constructed by GENHUNTER.8 Arrows indicate the index case. The pedigreeswere slightly modified for privacy protection.
122 The American Journal of Human Genetics 89, 121–130, July 15, 2011
affected individuals, none of whom displayed severe lower
limb spasticity or extensor plantar response. Electrophysi-
ological studies were performed in an affected individual.
Nerve conduction studies revealed normal findings in all
of the cases that were examined; however, an electromyo-
gram showed neurogenic changes only in cases with
skeletal muscle atrophy, indicating that lower motor
neuropathy existed in this particular disease. Progression
of motor neuron involvement in this SCA was typically
limited to the tongue and main proximal skeletal muscles
in both upper and lower extremities, which is clearly
different from typical ALS, which usually involves most
skeletal muscles over the course of a few years, leading to
fatal results within several years.
We conducted genome-wide linkage analysis for nine
affected subjects and eight unaffected subjects in three
informative families (pedigrees 1–3; Figure 1). For genotyp-
ing, we used an ABI Prism Linkage Mapping Set (Version 2;
Applied Biosystems, Foster City, CA, USA) with 382
markers, 10 cM apart, for 22 autosomes. Fine-mapping
markers (approximately 1 cM apart) were designed accord-
ing to information from the uniSTS reference physical map
in the NCBI database. A parametric linkage analysis was
carried out in GENEHUNTER8 with the assumption of an
autosomal-dominant model. The disease allele frequency
was set at 0.000001, and a phenocopy frequency of
0.000001 was assumed. Population allele frequencies
were assigned equal portions of individual alleles. We per-
formed multipoint analyses for autosomes and obtained
LOD scores. We considered LOD scores above 3.0 to be
significant.8 Genome-wide linkage analysis revealed
a single locus on chromosome 20p13 with a LOD score
of 3.20. Fine mapping increased the LOD score to 4.60
(Figure 3). Haplotype analysis revealed two recombination
events in pedigree 3, delimiting a1.8 Mb region (D20S906–
D20S193) (Figure 1). We further tested whether the five
cases shared the haplotype. As shown in Figure 1, pedigrees
4 and 5 were confirmed to have the same haplotype as
pedigrees 1, 2, and 3, indicating that the 1.8 Mb region is
very likely to be derived from a common ancestor.
The1.8Mbregionharbors44genes (NCBI,build37.1).We
eliminated two pseudogenes and five genes (LOC441938,
LOC100289473, LOC100288797, LOC100289507, and
LOC100289538) from the candidates. Evidence view
showed that the first, fourth, and fifth genes were not found
in the contig in this region, whereas the second and third
Table 1. Clinical Characteristics of Affected Subjects
PedigreeNo.
PatientID Gender
OnsetAge (yr)
CurrentAge (yr) Ataxia
Motor Neuron Involvement
Genotype of GGCCTG Repeats
SkeletalMuscleAtrophy
SkeletalMuscleFasciculation
TongueAtrophy/Fasciculation
1 III-5 M 50 70 þþþ N.D. N.D. N.D. g.263397_263402[6]þ(1800)
III-6 F 52 68 þþ þ þ þ g.263397_263402[6]þ(2300)
IV-2 F 57 63 þ - - þ g.263397_263402[6]þ(2300)
IV-4 M 50 59 þ - - þ g.263397_263402[6]þ(2300)
2 II-1 M 55 77 þþþ þþ þ þ g.263397_263402[6]þ(2200)
II-2 F 53 70 þþ N.D. N.D. N.D. g.263397_263402[6]þ(2200)
3 II-3 M 58 77 þþ þþ þ þ g.263397_263402[3]þ(2300)
III-1 M 56 62 þ - - 5 g.263397_263402[8]þ(2200)
III-2 M 51 61 þþ þ þ þ g.263397_263402[6]þ(1800)
4 I-1 M 57 died in2001 at 83
þþ N.D. N.D. N.D. g.263397_263402[5]þ(1800)
II-1 F 48 61 þþ þ 5 þþ g.263397_263402[6]þ(2000)
5 I-1 M 57 86 þþ þþþ þ þ g.263397_263402[5]þ(2000)
II-1 F 47 58 þþ þ þ þ g.263397_263402[8]þ(1700)
SCA#1 M 52 69 þþþ þþþ þþþ þþþ g.263397_263402[5]þ(2200)
SCA#2 F 43 53 þþþ - - þ g.263397_263402[6]þ(1800)
SCA#3 M 55 60 þþ - - þþ g.263397_263402[8]þ(1700)
SCA#4 M 57 81 þþþ þ þ þþþ g.263397_263402[5]þ(2200)
Mean 52.8
SD 4.3
N.D., not determined.
The American Journal of Human Genetics 89, 121–130, July 15, 2011 123
genes are not assigned to orthologous loci in the mouse
genome. Sequence similarities among paralog genes defied
direct sequencing of four genes: SIRPD [NM 178460.2],
SIRPB1 [NM 603889], SIRPG [NM 605466], and SIRPA
[NM 602461]. Thus, we sequenced 33 of 37 genes (PDYN((
[MIM 131340], STK35 [MIM 609370], TGM3 [MIM
600238], TGM6 [NM_198994.2], SNRPB [MIM 182282],
SNORD119 [NR_003684.1], ZNF343 [NM_024325.4],
TMC2 [MIM 606707], NOP56 [NM_006392.2], MIR1292
[NR_031699.1], SNORD110 [NR_003078.1], SNORA51
[NR_002981.1], SNORD86 [NR_004399.1], SNORD56
[NR_002739.1], SNORD57 [NR_002738.1], IDH3B [MIM
604526], EBF4 [MIM 609935], CPXM1 [NM_019609.4],
C20orf141 [NM_080739.2], FAM113A [NM_022760.3],
VPS16 [MIM 608550], PTPRA [MIM 176884], GNRH2
[MIM 602352], MRPS26 [MIM 611988], OXT [MIM
167050], AVP [MIM 192340], UBOX5 [NM_014948.2],
FASTKD5 [NM_021826.4], ProSAPiP1 [MIM 610484],
DDRGK1 [NM_023935.1], ITPA [MIM 147520], SLC4A11
[MIM 610206], and C20orf194 [NM_001009984.1]) (Fig-
ure 2C). All noncoding and coding exons, as well as the
100 bp up- and downstream of the splice junctions of these
genes, were sequenced in two index cases (IV-4 in pedigree1
and III-1 in pedigree 3) and in three additional cases (II-1 in
pedigree 2, II-1 in pedigree 4, and II-1 in pedigree 5)with the
use of specific primers (Table S1 available online). Eight
unregistered variants were found among the two index
cases. Among these, there was a coding variant, c.795C>G
Figure 2. Motor Neuron Involvement and (GGCCTG)n Expansion in the First Intron of NOP56(A) MRI of an affected subject (SCA#3) showed mild cerebellar atrophy (arrow) but no other cerebral or brainstem pathology.(B) Tongue atrophy (arrow) was observed in SCA#1.(C) Physicalmap of the 1.8-Mb linkage region fromD20S906 (1,505,576 bp) to D20S193 (3,313,494 bp), with 33 candidate genes shown,as well as the direction of transcription (arrows).(D) The upper portion of the panel shows the scheme of primer binding for repeat-primer PCR analysis. In the lower portion, sequencetraces of the PCR reactions are shown. Red lines indicate the size markers. The vertical axis indicates arbitrary intensity levels. A typicalsaw-tooth pattern is observed in an affected pedigree.(E) Southern blotting of LCLs from SCA cases and three controls. Genomic DNA (10 mg) was extracted from Epstein-Barr virus (EBV)-immortalized LCLs derived from six affected subjects (Ped2_II-1, Ped3_III-1, Ped3_III-2, Ped5_I-1, Ped5_II-1, and SCA#1) and digestedwith 2 U of AvrII overnight (New England Biolabs, Beverly, MA, USA). A probe covering exon 4 of NOP56 (452 bp) was subjected toPCR amplification from human genomic DNA with the use of primers (Table S3) and labeled with 32P-dCTP.
124 The American Journal of Human Genetics 89, 121–130, July 15, 2011
(p.Phe265Leu), in C20orf194, whereas the other seven
included one synonymous variant, c.1695T>A (p.Leu565-
Leu), in ZNF343 and six non-splice-site intronic variants
(Table S2). We tested segregation by sequencing exon 11 of
C20orf194 in IV-2 and III-5 in pedigree 1. Neither IV-2 nor
III-5 had this variant. We thus eliminated C20orf194 as
a candidate.Missensemutations inPDYN andTGM6, which
have been recently reported as causes of SCA, mapped to
20p12.3-p13,9,10 but none were detected in the five index
cases studied here (Table S2).
Possible expansions of repetitive sequences in these
33 genes were investigated when intragenic repeats
were indicated in the database (UCSC Genome Bioinfor-
matics). Expansions of the hexanucleotide repeat
GGCCTG (rs68063608) were found in intron 1 of NOP56
(Figure 2D) in all five index cases through the use of
a repeat-primed PCR method.11–13 An outline of the
repeat-primed PCR experiment is described in Figure 2D.
In brief, the fluorescent-dye-conjugated forward primer
corresponded to the region upstream of the repeat of
interest. The first reverse primer consisted of four units of
the repeat (GGCCTG) and a 50 tail used as an anchor.
The second reverse primer was an ‘‘anchor’’ primer. These
primers are described in Table S3. Complete segregation
of the expanded hexanucleotide was confirmed in all pedi-
grees, and the maximum repeat size in nine unaffected
members was eight (data not shown).
In addition to the SCA cases in five pedigrees, four
unrelated cases (SCA#1–SCA#4) were found to have a
(GGCCTG)n allele through screening of the cohort SCA
patients (Table 1). Neurological examination was reeval-
uated in these four cases, revealing both ataxia and motor
neuron dysfunctionwith tongue atrophy and fasciculation
(Table 1). In total, nine unrelated cases were found in the
251 cohort patients with SCA (3.6%). For confirmation of
the repeat expansions, Southern blot analysis was conduct-
ed in six affected subjects (Ped2_II-1, Ped3_III-1, Ped3_III-2,
Ped5_I-1, Ped5_II-1, and SCA#1). The data showed >10 kb
of repeat expansions in the lymphoblastoid cell lines
(LCLs) obtained from the SCA patients (Figure 2E). Further-
more, the numbers of GGCCTG repeat expansion were
estimated by Southern blotting in 11 other cases. The
expansion analysis revealed approximately 1500 to 2500
repeats in 17 cases (Table 1). There was no negative associa-
tion between age at onset and the number of GGCCTG
repeats (n¼17, r¼0.42, p¼0.09; Figure S1) andnoobvious
anticipation in the current pedigrees.
To investigate the disease specificity and disease spec-
trum of the hexanucleotide repeat expansions, we tested
the repeat expansions in an Alzheimer disease (MIM
104300) cohort and an ALS cohort followed by the Depart-
ment of Neurology, Okayama University Hospital. We also
recruited Japanese controls, who were confirmed to be free
from brain lesions through MRI and magnetic resonance
angiography, which was performed as described previ-
ously.14 Screening of the 27 Alzheimer disease cases and
154 ALS cases failed to detect additional cases with repeat
expansions. The GGCCTG repeat sizes ranged from 3 to
8 in 300 Japanese controls (5.9 5 0.8 repeats), suggesting
that the >10 kb repeat expansions were mutations.
Expression of Nop56, an essential component of the
splicing machinery,15 was examined by RT-PCR with the
use of primers for wild-type mouse Nop56 cDNA (Table
S3). Expression of Nop56 mRNA was detected in various
tissues, including CNS tissue, and a very weak signal was
detected in spinal cord tissue (Figure 4A). Immunohisto-
chemistry using an anti-mouse Nop56 antibody (Santa
Cruz Biotechnology, Santa Cruz, CA, USA) detected the
Nop56 protein in Purkinje cells of the cerebellum as well
as motor neurons of the hypoglossal nucleus and the
spinal cord anterior horn (Figure 4B), suggesting that these
cells may be responsible for tongue and muscle atrophy in
the trunk and limbs, respectively. Immunoblotting also
confirmed the presence of Nop56 in neural tissues
(Figure 4C), where Nop56 is localized in both the nucleus
and cytoplasm.
Alterations of NOP56 RNA expression and protein levels
in LCLs from patients were examined by real-time RT-PCR
and immunoblotting. The primers for quantitative PCR of
human NOP56 cDNA are described in Table S3. Immuno-
blotting was performed with the use of an anti-human
NOP56 antibody (Santa Cruz Biotechnology, Santa Cruz,
CA, USA). We found no decrease inNOP56 RNA expression
or protein levels in LCLs from these patients (Figure 5A). To
investigate abnormal splicing variants of NOP56, we per-
formed RT-PCR using the primers covering the region
from the 50 UTR to exon 4 around the repeat expansion
(Table S3); however, no splicing variant was observed in
LCLs from the cases (Figure 5B). We also performed immu-
nocytochemistry for NOP56 and coilin, a marker of the
Cajal body, where NOP56 functions.16 NOP56 and coilin
distributions were not altered in LCLs of the SCA patients
(Figure 5C), suggesting that qualitative or quantitative
changes in the Cajal body did not occur. These results indi-
cated that haploinsufficiency could not explain the
observed phenotype.
Figure 3. Multipoint Linkage Analysis with Ten Markers onChromosome 20p13
The American Journal of Human Genetics 89, 121–130, July 15, 2011 125
We performed fluorescence in situ hybridization to
detect RNA foci containing the repeat transcripts in LCLs
from patients, as previously described.17,18 Lymphoblas-
toid cells from two SCA patients (Ped2_II-2 and Ped5_I-1)
and two control subjects were analyzed. An average of
2.1 5 0.5 RNA foci per cell were detected in 57.0%
of LCLs (n ¼ 100) from the SCA subjects through the use
of a nuclear probe targeting the GGCCUG repeat, whereas
no RNA foci were observed in control LCLs (n ¼ 100)
(Figure 6A). In contrast, a probe for the CGCCUG repeat,
another repeat sequence in intron 1 of NOP56, detected
no RNA foci in either SCA or control LCLs (n ¼ 100
each) (Figure 6A), indicating that the GGCCUG repeat
was specifically expanded in the SCA subjects. The speci-
ficity of the RNA foci was confirmed by sensitivity to RNase
A treatment and resistance to DNase treatment (Figure 6A).
Several reports have suggested that RNA foci play a role
in the etiology of SCA through sequestration of specific
RNA-binding proteins.5–7 In silico searches (ESEfinder
3.0) predicted an RNA-binding protein, SRSF2 (MIM
600813), as a strong candidate for binding of the GGCCUG
repeat. Double staining with the probe for the GGCCUG
repeat and an anti-SRSF2 antibody (Sigma-Aldrich, Tokyo,
Japan) was performed. The results showed colocalization of
RNA foci with SRSF2, whereas NOP56 and coilin were not
colocalized with the RNA foci (Figure 6B), suggesting
a specific interaction of endogenous SRSF2 with the RNA
foci in vivo.
To further confirm the interaction, gel-shift assays were
carried out for investigation of the binding activity of
SRSF2 with (GGCCUG)n. Synthetic RNA oligonucleotides
(200 pmol), (GGCCUG)4 or (CUG)6, which is the latter
part of the hexanucleotide, as well as the repeat RNA
involved in myotonic dystrophy type 1 (DM1 [MIM
160900])18 and SCA8 (MIM 608768),5 were denatured
and immediately mixed with different amounts (0, 0.2,
or 0.6 mg) of recombinant full-length human SRSF2
(Abcam, Cambridge, UK). The mixtures were incubated,
and the protein-bound probes were separated from the
free forms by electrophoresis on 5%–20% native polyacryl-
amide gels. The separated RNA probes were detected with
SYBR Gold staining (Invitrogen, Carlsbad, CA, USA). We
found a strong association of (GGCCUG)4 with SRSF2
in vitro in comparison to (CUG)6 (Figure 6C). Collectively,
we concluded that (GGCCUG)n interacts with SRSF2.
It is notable that MIR1292 is located just 19 bp 30 of theGGCCTG repeat (Figure 2D). MiRNAs such asMIR1292 are
small noncoding RNAs that regulate gene expression by in-
hibiting translation of specific target mRNAs.19,20 MiRNAs
are believed to play important roles in key molecular
Figure 4. Nop56 in the Mouse Nervous System(A) RT-PCR analysis of Nop56 (422 bp) in various mouse tissues. cDNA (25 ng) collected from various organs of C57BL/6 mice waspurchased from GenoStaf (Tokyo, Japan).(B) Immunohistochemical analysis of Nop56 in the cerebellum, hypoglossal nucleus, and spinal cord anterior horn in wild-type maleSlc:ICR mice at 8 wks of age (Japan SLC, Shizuoka, Japan). The arrows indicate anti- Nop56 antibody staining. The negative controlwas the cerebellar sample without the Nop56 antibody treatment. Scale bar represents 100 mm.(C) Immunoblotting of Nop56 (66 kDa) in the cerebellum and cerebrum. Protein sample (10 mg) was subjected to immunoblotting.LaminB1, a nuclear protein, and beta-tubulin were used as loading controls.
126 The American Journal of Human Genetics 89, 121–130, July 15, 2011
pathways by fine-tuning gene expression.19,20 Recent
studies have revealed that miRNAs influence neuronal
survival and are also associated with neurodegenerative
diseases.21,22 In silico searches (Target Scan Human 5.1)
predicted glutamate receptors (GRIN2B [MIM 138252]
and GRIK3 [MIM 138243]) to be potential target genes.
Real-time RT-PCR using TaqMan probes for miRNA
(Invitrogen, Carlsbad, CA, USA) revealed that the levels
of both mature and precursor MIR1292 were significantly
decreased in SCA LCLs (Figure 6D), indicating that the
GGCCTG repeat expansion decreased the transcription
of MIR1292. A decrease in MIR1292 expression may
upregulate glutamate receptors in particular cell types;
e.g., GRIK3 in stellate cells in the cerebellum,23 leading to
ataxia because of perturbation of signal transduction to
the Purkinje cells. In addition, it has been suggested, on
the basis of ALS mouse models,24,25 that excitotoxicity
mediated by a type of glutamate receptor, the NMDA
receptor including GRIN2B, is involved in loss of spinal
neurons. A very slowly progressing and mild form of the
motor neuron disease, such as that described here, which
is limited to mostly fasciculation of the tongue, limbs
and trunk, may also be compatible with such a functional
dysregulation rather than degeneration.
In the present study, we have conducted genetic analysis
to find a genetic cause for the unique SCA with motor
neuron disease. With extensive sequencing of the 1.8 Mb
linked region, we found large hexanucleotide repeat
expansions in NOP56, which were completely segregated
with SCA in five pedigrees and were found in four unre-
lated cases with a similar phenotype. The expansion was
not found in 300 controls or in other neurodegenerative
diseases. We further proved that repeat expansions of
NOP56 induce RNA foci and sequester SRSF2. We thus
concluded that hexanucleotide repeat expansions are
considered to cause SCA by a toxic RNA gain-of-function
mechanism, and we name this unique SCA as SCA36.
Haplotype analysis indicates that hexanucleotide expan-
sions are derived from a common ancestor. The prevalence
of SCA36 was estimated at 3.6% in the SCA cohort in
Chugoku district, suggesting that prevalence of SCA36
may be geographically limited to the western part of Japan
and is rare even in Japanese SCAs.
Expansion of tandem nucleotide repeats in different
regions of respective genes (most often the triplets CAG
and CTG) has been shown to cause a number of inherited
diseases over the past decades. An expansion in the coding
region of a gene causes a gain of toxic function and/or
reduces the normal function of the corresponding protein
at the protein level. RNA-mediated noncoding repeat
expansions have also been identified as causing eight other
neuromuscular disorders: DM1, DM2 (MIM 602668),
fragile X tremor/ataxia syndrome (FXTAS [MIM 300623]),
Huntington disease-like 2 (HDL2 [MIM 606438]), SCA8,
SCA10 (MIM 603516), SCA12 (MIM 604326), and SCA31
(MIM 117210).26 The repeat numbers in affected alleles
of SCA36 are among the largest seen in this group of
diseases (i.e., there are thousands of repeats). Moreover,
SCA36 is notmerely a nontriplet repeat expansion disorder
similar to SCA10, DM2, and SCA31, but is now proven to
be a human disease caused by a large hexanucleotide
repeat expansion. In addition, no or only weak anticipa-
tion has been reported for noncoding repeat expansion
in SCA, whereas clear anticipation has been reported for
most polyglutamine expansions in SCA.2 As such, absence
of anticipation in SCA36 is in accord with previous studies
Figure 5. Analysis of NOP56 in LCLs fromSCA Patients(A) mRNA expression (upper panel) andprotein levels (lower panel) in LCLs fromcases (n ¼ 6) and controls (n ¼ 3) weremeasured by RT-PCR and immunoblotting,respectively. cDNA (10 ng) was transcribedfrom total RNA isolated from LCLs andused for RT-PCR. Immunoblotting was per-formed with the use of a protein sample(40 mg) extracted from LCLs. The data indi-cate the mean5 SD relative to the levels ofPP1A and GAPDH, respectively. There wasno significant difference between LCLsfrom controls and cases.(B) Analysis for splicing variants of NOP56cDNA. RT-PCR with 10 ng of cDNA andprimers corresponding to the region fromthe 50 UTR to exon 4 around the repeatexpansion was performed. The PCRproduct has an expected size of 230 bp.(C) Immunocytochemistry for NOP56 andcoilin. Green signals represent NOP56 orcoilin. Shown are representative samplesfrom 100 observations of controls or cases.
The American Journal of Human Genetics 89, 121–130, July 15, 2011 127
on SCAs with noncoding repeat expansions. The common
hallmark in these noncoding repeat expansion disorders
is transcribed repeat nuclear accumulations with respec-
tive repeat RNA-binding proteins, which are considered
to primarily trigger and develop the disease at the RNA
level. However, multiple different mechanisms are likely
to be involved in each disorder. There are at least two
possible explanations for the motor neuron involvement
of SCA36: gene- and tissue-specific splicing specificity of
SRSF2 and involvement of miRNA. In SCA36, there is the
possibility that the adverse effect of the expansion muta-
tion is mediated by downregulation of miRNA expression.
The biochemical implication of miRNA involvement
cannot be evaluated in this study, because availability of
tissue samples from affected cases was limited to LCLs.
Given definitive downregulation of miRNA 1292 in
LCLs, we should await further study to substantiate its
involvement in affected tissues. Elucidating which mecha-
nism(s) plays a critical role in the pathogenesis will
be required for determining whether cerebellar degenera-
tion and motor neuron disease occur through a similar
scenario.
Figure 6. RNA Foci Formation and Decreased Transcription of MIR1292(A) Cells were fixed on coverslips and then hybridized with solutions containing either a Cy3-labeled C(CAGGCC)2CAG orG(CAGGCG)2CAG oligonucleotide probe (1 ng/ml). For controls, the cells were treated with 1000 U/ml DNase or 100 mg/ml RNasefor 1 hr at 37�C prior to hybridization, as indicated. After a wash step, coverslips were placed on the slides in the presence of ProLongGold with DAPI mountingmedia (Molecular Probes, Tokyo, Japan) and photographed with a fluorescence microscope. The upper panelsindicate LCLs from an SCA case and a control hybridized with C(CAGGCC)2CAG (left) or G(CAGGCG)2CAG (right). Red and bluesignals represent RNA foci and the nucleus (DAPI staining), respectively. Similar RNA foci formationwas confirmed in LCLs from anotherindex case. The lower panels show RNA foci in SCA LCLs treated with DNase or RNase.(B) Double staining was performed with the probe for (GGCCUG)n (red) and anti-SRSF2, NOP56, or coilin antibody (green).(C) Gel-shift assays revealed specific binding of SRSF2 to (GGCCUG)4 but little to (CUG)6.(D) RNA samples (10 ng) were extracted from LCLs of controls (n¼ 3) and cases (n¼ 6).MiRNAsweremeasuredwith the use of a TaqManprobe for precursor (Pri-) and mature MIR1292. The data indicate the mean 5 SD, relative to the levels of PP1A or RNU6. *: p < 0.05.
128 The American Journal of Human Genetics 89, 121–130, July 15, 2011
In conclusion, expansion of the intronic GGCCTG
hexanucleotide repeat in NOP56 causes a unique form of
SCA, SCA36, which shows not only ataxia but also motor
neuron dysfunction. This characteristic disease phenotype
can be explained by the combination of RNA gain of func-
tion and MIR1292 suppression. Additional studies are
required to investigate the roles of each mechanistic
component in the pathogenesis of SCA36.
Supplemental Data
Supplemental Data include one figure and three tables and can be
found with this article online at http://www.cell.com/AJHG/.
Acknowledgments
This work was supported mainly by grants to A.K. and partially by
grants to T.M., Y.I., H.K., and K.A. We thank Norio Matsuura,
Kokoro Iwasawa, and Kouji H. Harada (Kyoto University Graduate
School of Medicine).
Received: February 23, 2011
Revised: May 8, 2011
Accepted: May 18, 2011
Published online: June 16, 2011
Web Resources
The URLs for data presented herein are as follows:
ESEfinder 3.0, http://rulai.cshl.edu/cgi-bin/tools/ESE3/esefinder.
cgi?process¼home
NCBI, http://www.ncbi.nlm.nih.gov/
Target Scan Human 5.1, http://www.targetscan.org/
UCSC Genome Bioinformatics, http://genome.ucsc.edu
References
1. Harding, A.E. (1982). The clinical features and classification of
the late onset autosomal dominant cerebellar ataxias. A study
of 11 families, including descendants of the ‘the Drew family
of Walworth’. Brain 105, 1–28.
2. Matilla-Duenas, A., Sanchez, I., Corral-Juan, M., Davalos, A.,
Alvarez, R., and Latorre, P. (2010). Cellular and molecular
pathways triggering neurodegeneration in the spinocerebellar
ataxias. Cerebellum 9, 148–166.
3. Schols, L., Bauer, P., Schmidt,T., Schulte,T., andRiess,O. (2004).
Autosomal dominant cerebellar ataxias: clinical features,
genetics, and pathogenesis. Lancet Neurol. 3, 291–304.
4. Ohta, Y., Hayashi, T., Nagai, M., Okamoto, M., Nagotani, S.,
Nagano, I., Ohmori, N., Takehisa, Y., Murakami, T., Shoji,
M., et al. (2007). Two cases of spinocerebellar ataxia accompa-
nied by involvement of the skeletal motor neuron system and
bulbar palsy. Intern. Med. 46, 751–755.
5. Daughters, R.S., Tuttle, D.L., Gao, W., Ikeda, Y., Moseley, M.L.,
Ebner, T.J., Swanson, M.S., and Ranum, L.P. (2009). RNA gain-
of-function in spinocerebellar ataxia type 8. PLoS Genet. 5,
e1000600.
6. Sato, N., Amino, T., Kobayashi, K., Asakawa, S., Ishiguro, T.,
Tsunemi, T., Takahashi, M., Matsuura, T., Flanigan, K.M.,
Iwasaki, S., et al. (2009). Spinocerebellar ataxia type 31 is
associated with ‘‘inserted’’ penta-nucleotide repeats contain-
ing (TGGAA)n. Am. J. Hum. Genet. 85, 544–557.
7. White, M.C., Gao, R., Xu, W., Mandal, S.M., Lim, J.G., Hazra,
T.K., Wakamiya, M., Edwards, S.F., Raskin, S., Teive, H.A., et al.
(2010). Inactivation of hnRNP K by expanded intronic
AUUCU repeat induces apoptosis via translocation of
PKCdelta to mitochondria in spinocerebellar ataxia 10. PLoS
Genet. 6, e1000984.
8. Kruglyak, L., Daly, M.J., Reeve-Daly, M.P., and Lander, E.S.
(1996). Parametric and nonparametric linkage analysis:
a unified multipoint approach. Am. J. Hum. Genet. 58,
1347–1363.
9. Bakalkin, G., Watanabe, H., Jezierska, J., Depoorter, C.,
Verschuuren-Bemelmans, C., Bazov, I., Artemenko, K.A.,
Yakovleva, T., Dooijes, D., Van de Warrenburg, B.P., et al.
(2010). Prodynorphin mutations cause the neurodegenerative
disorder spinocerebellar ataxia type 23. Am. J. Hum. Genet.
87, 593–603.
10. Wang, J.L., Yang, X., Xia, K., Hu, Z.M., Weng, L., Jin, X., Jiang,
H., Zhang, P., Shen, L., Guo, J.F., et al. (2010). TGM6 identified
as a novel causative gene of spinocerebellar ataxias using
exome sequencing. Brain 133, 3510–3518.
11. Cagnoli, C., Michielotto, C., Matsuura, T., Ashizawa, T., Marg-
olis, R.L., Holmes, S.E., Gellera, C., Migone, N., and Brusco, A.
(2004). Detection of large pathogenic expansions in FRDA1,
SCA10, and SCA12 genes using a simple fluorescent repeat-
primed PCR assay. J. Mol. Diagn. 6, 96–100.
12. Matsuura, T., and Ashizawa, T. (2002). Polymerase chain reac-
tion amplification of expanded ATTCT repeat in spinocerebel-
lar ataxia type 10. Ann. Neurol. 51, 271–272.
13. Warner, J.P., Barron, L.H., Goudie, D., Kelly, K., Dow, D.,
Fitzpatrick, D.R., and Brock, D.J. (1996). A general method
for the detection of large CAG repeat expansions by fluores-
cent PCR. J. Med. Genet. 33, 1022–1026.
14. Hashikata, H., Liu, W., Inoue, K., Mineharu, Y., Yamada, S.,
Nanayakkara, S., Matsuura, N., Hitomi, T., Takagi, Y., Hashi-
moto, N., et al. (2010). Confirmation of an association of
single-nucleotide polymorphism rs1333040 on 9p21 with
familial and sporadic intracranial aneurysms in Japanese
patients. Stroke 41, 1138–1144.
15. Wahl, M.C., Will, C.L., and Luhrmann, R. (2009). The spliceo-
some: design principles of a dynamic RNP machine. Cell 136,
701–718.
16. Lechertier, T., Grob, A., Hernandez-Verdun, D., and Roussel, P.
(2009). Fibrillarin and Nop56 interact before being co-assem-
bled in box C/D snoRNPs. Exp. Cell Res. 315, 928–942.
17. Liquori, C.L., Ricker, K., Moseley, M.L., Jacobsen, J.F., Kress,
W., Naylor, S.L., Day, J.W., and Ranum, L.P. (2001). Myotonic
dystrophy type 2 caused by a CCTG expansion in intron 1 of
ZNF9. Science 293, 864–867.
18. Taneja, K.L., McCurrach, M., Schalling, M., Housman, D., and
Singer, R.H. (1995). Foci of trinucleotide repeat transcripts in
nuclei of myotonic dystrophy cells and tissues. J. Cell Biol.
128, 995–1002.
19. Winter, J., Jung, S., Keller, S., Gregory, R.I., and Diederichs, S.
(2009). Many roads to maturity: microRNA biogenesis path-
ways and their regulation. Nat. Cell Biol. 11, 228–234.
20. Zhao, Y., and Srivastava, D. (2007). A developmental view of
microRNA function. Trends Biochem. Sci. 32, 189–197.
21. Eacker, S.M., Dawson, T.M., and Dawson, V.L. (2009). Under-
standing microRNAs in neurodegeneration. Nat. Rev. Neuro-
sci. 10, 837–841.
The American Journal of Human Genetics 89, 121–130, July 15, 2011 129
22. Hebert, S.S., and De Strooper, B. (2009). Alterations of the
microRNA network cause neurodegenerative disease. Trends
Neurosci. 32, 199–206.
23. Tsuzuki, K., and Ozawa, S. (2005). Glutamate Receptors. Ency-
clopedia of life sciences. John Wiley and Sons, Ltd., http://
onlinelibrary.com/doi/10.1038/npg.els.0005056.
24. Nutini, M., Frazzini, V., Marini, C., Spalloni, A., Sensi, S.L., and
Longone, P. (2011). Zinc pre-treatment enhances NMDAR-
mediated excitotoxicity in cultured cortical neurons from
SOD1(G93A) mouse, a model of amyotrophic lateral sclerosis.
Neuropharmacology 60, 1200–1208.
25. Sanelli, T., Ge, W., Leystra-Lantz, C., and Strong, M.J. (2007).
Calcium mediated excitotoxicity in neurofilament aggregate-
bearing neurons in vitro is NMDA receptor dependant.
J. Neurol. Sci. 256, 39–51.
26. Todd, P.K., and Paulson, H.L. (2010). RNA-mediated neurode-
generation in repeat expansion disorders. Ann. Neurol. 67,
291–300.
130 The American Journal of Human Genetics 89, 121–130, July 15, 2011
Want to learn how to prepare, submit and publish an article in a Cell Press journal?
Watch the Cell Press publication guide.
for more information visitwww.cell.com/publicationguide
Chapter 1: Before manuscript submission Chapter 2: After initial submission
Chapter 3: Decision process Chapter 4: After manuscript acceptance
REPORT
A Mutation in a Skin-Specific Isoform of SMARCAD1Causes Autosomal-Dominant Adermatoglyphia
Janna Nousbeck,1 Bettina Burger,2 Dana Fuchs-Telem,1,4 Mor Pavlovsky,1 Shlomit Fenig,1 Ofer Sarig,1
Peter Itin,2,3 and Eli Sprecher1,4,*
Monogenic disorders offer unique opportunities for researchers to shed light upon fundamental physiological processes in humans. We
investigated a large family affected with autosomal-dominant adermatoglyphia (absence of fingerprints) also known as the ‘‘immigra-
tion delay disease.’’ Using linkage and haplotype analyses, we mapped the disease phenotype to 4q22. One of the genes located in
this interval is SMARCAD1, a member of the SNF subfamily of the helicase protein superfamily.We demonstrated the existence of a short
isoform of SMARCAD1 exclusively expressed in the skin. Sequencing of all SMARCAD1 coding and noncoding exons revealed a hetero-
zygous transversion predicted to disrupt a conserved donor splice site adjacent to the 30 end of a noncoding exon uniquely present in the
skin-specific short isoform of the gene. This mutation segregated with the disease phenotype throughout the entire family. Using amini-
gene system, we found that this mutation causes aberrant splicing, resulting in decreased stability of the short RNA isoform as predicted
by computational analysis and shown by RT-PCR. Taken together, the present findings implicate a skin-specific isoform of SMARCAD1 in
the regulation of dermatoglyph development.
Epidermal ridges are characteristic features of the human
skin1 and in wide use in the modern era as almost unsur-
passed identification tools. The physiological role of
epidermal ridges remains controversial. Recent data have
dismissed the theory that fingerprints might improve the
grip by ramping up friction levels.2 Instead, epidermal
ridges might amplify vibratory signals to deeply embedded
nerves involved in fine texture perception.3
The factors underlying the formation of epidermal ridges
during embryonic development and their pattern remain
unknown but are likely to include both genetically deter-
mined traits4 as well as environmental elements5 and to
involve some form of interactions between the mesen-
chymal and the dermal and the epidermal elements. At
24 weeks postfertilization, the epidermal-ridge system
displays an adult morphology6 that remains permanent
without any modification throughout life. The congenital
absence of epidermal ridges is a rare condition known as
adermatoglyphia (ADG). To date only four families with
congenital absence of fingerprints have been described.7–10
In three of these families,7–9 additional features such as
congenital facial milia, skin blisters, and fissures associated
with heat or trauma were reported. A number of more
complex syndromes such as Naegeli-Franceschetti-Jadas-
sohn syndrome (MIM 161000) and dyskeratosis congenita
(MIM 305000) also feature abnormal development of
epidermal ridges,11,12 as detailed in a recent review of the
topic.13
In the present studywe investigated a large Swiss kindred
presenting with autosomal-dominant adermatoglyphia
recently coined as the ‘‘immigration delay disease’’13
because affected individuals report significant difficulties
entering countries that require fingerprint recording. All
affected members of this family displayed since birth an
absence of fingerprints (Figure 1A); histological analysis13
revealed that this absence was associated with a reduced
number of sweat glands and a sweat test showed a reduced
ability for hand transpiration (Figure 1B).
All affected (n ¼ 9) and healthy (n ¼ 7) family members
or their legal guardian provided written and informed
consent according to a protocol approved by the institu-
tional review board of University Hospital Basel in adher-
ence with the principles of the declaration of Helsinki.
DNA was extracted from peripheral blood lymphocytes.
We initially genotyped all family members by using the
Illumina Human Linkage-12 chip comprising 6000 tagged
SNPs distributed across the genome. Two hundred ng of
DNA were hybridized according to the Infinium II assay
(Illumina, San Diego, CA) and scanned with an Illumina
BeadArray reader. The scanned images were imported
into BeadStudio 3.1.3.0 (Illumina) for extraction and
quality control, with an average call rate of 99.9%.
Multipoint linkage analysiswith the Superlink software14
generated a LOD score of 2.85 at marker rs1509948
(Figure 2). Fine mapping of the disease interval was per-
formed with polymorphic microsatellite markers that
were selected from the National Center for Biotechnology
Infromation (NCBI) database. Genotypes were established
with fluorescently labeled primer pairs (Research Genetics,
Invitrogen, Carlsbad, CA) according to the manufacturer’s
recommendations. PCR products were separated by PAGE
on an automated sequencer (ABI PRISM 3100 Genetic
Analyzer; Applied Biosystems, Foster City, CA), and allele
sizes were determined with Gene Mapper v4.0 software.
Haplotype analysis refined the disease locus to a 5.1 Mb
interval between markers D4S423 and D4S1560 (Figure 2).
1Department of Dermatology, Tel Aviv Sourasky Medical Center, Tel Aviv 64239, Israel; 2Department of Biomedicine, University Hospital Basel, Basel 4051,
Switzerland; 3Department of Dermatology, University Hospital Basel, Basel 4051, Switzerland; 4Department of Human Molecular Genetics and Biochem-
istry, Sackler Faculty of Medicine, Tel-Aviv University, Ramat Aviv 61390, Israel
*Correspondence: [email protected]
DOI 10.1016/j.ajhg.2011.07.004. �2011 by The American Society of Human Genetics. All rights reserved.
302 The American Journal of Human Genetics 89, 302–307, August 12, 2011
We found the disease interval contained 17 genes. All
coding and noncoding exons of the disease interval genes
were fully sequenced. Initially, nomutation was identified.
We therefore carefully scrutinized all currently available
databases for rare transcripts. We identified one minor
transcript (ENST00000509418, NM_001128430.1), sharing
a common nucleotide sequence with the 30-end of
SMARCAD1 (MIM 612761). SMARCAD1 encodes a protein
that is structurally related to the SWI2/SNF2 superfamily
of DNA-dependent ATPases, which function as catalytic
subunits of chromatin-remodeling complexes and are
consequently considered to be major regulators of tran-
scriptional activity.15 The two SMARCAD1 isoforms differ
in lengths and sites of transcription initiation. The shortest
SMARCAD1 isoform is predicted to contain a unique
50-nontranslated exon (Figure 3A). It is of interest that, in
contrast with the major large isoform, which was found to
be expressed ubiquitously as previously shown,16 the
SMARCAD1 short isoform was mainly identifiable by RT-
PCR in skin fibroblasts and to a lesser extent in keratino-
cytes and esophageal tissue (Figure 4), suggesting that it
might represent an attractive candidate gene for a skin
condition such as ADG.
To assess the possible involvement of SMARCAD1 in
ADG, genomic DNA was amplified by PCR with primer
pairs spanning the entire coding sequence of both
SMARCAD1 isoforms (Table S1, available online) and Taq
polymerase (QIAGEN, Valencia, CA). Cycling conditions
were 94�C for 2 min followed by three cycles at 94�C for
40 s, 61�C for 40 s, and 72�C for 40 s; three cycles at
94�C for 40 s, 59�C for 40 s, and 72�C for 40 s; three cycles
at 94�C for 40 s, 57�C for 40 s, and 72�C for 40 s; 33 cycles
at 94�C for 40 s, 55�C for 40 s, and 72�C for 40 s; and a final
extension step at 72�C for 10 min. DNA was extracted
from gel and purified with QIAquick Gel Extraction kit
(QIAGEN). Direct sequencing of the resulting PCR prod-
ucts with the BigDye terminator system on an automated
sequencer (Applied Biosystems) revealed a heterozygous
G>T transversion in the first intron of the skin-specific
SMARCAD1 short isoform. The mutation, c.378þ1G>T,
was predicted to abolish the donor splice site adjacent to
the 30-end of the first unique exon of the short SMARCAD1
isoform. To confirm the existence of themutation, we used
a PCR-RFLP assay. A 537 bp long DNA fragment was ampli-
fied with the forward primer 50-AGCTGATTGGCTGGGA
ATAC-30 and reverse primer 50-GGCATTCATAAAACTCAA
AATGC-30 (Figure 3B). The mutation creates a recognition
site for MseI endonuclease (New England Biolabs, Ipswich,
MA).A Using this assay, we confirmed segregation of the
mutation with the disease phenotype throughout the
entire family and also excluded the mutation from a panel
of 100 healthy Swiss individuals and 100 healthy Jewish
individuals (data not shown); this suggests that the muta-
tion does not represent a common neutral polymorphism
but rather is a disease-causing mutation.
To assess the consequences of the mutation on the
SMARCAD1-splicing pattern, we initially used RT-PCR to
amplify cDNA derived from the RNA extracted from the
fibroblast cell cultures that were established from a patient
and a healthy individual. Total RNA was extracted with
RNeasy Extraction Kit (QIAGEN). cDNA was synthesized
(Thermo Scientific Verso cDNA Synthesis Kit, ABgene,
Surrey, UK) and amplified by PCR with exon-crossing
primers, 50-GAAAGCAAGAATGTGGCAG-30; 50-GGGCTT
GAGTGACAAACT-30, located in exons 1 and 3 of the short
SMARCAD1 isoform, respectively. DNA was extracted from
gel, purified with QIAquick Gel Extraction kit (QIAGEN),
and directly sequenced as described above. Only the
wild-type splice product was identified, suggesting that
aberrant splice variants might undergo degradation. To
obtain further support for this possibility, we generated
a minigene construct17 by subcloning exon 1, parts of
intron 1 (because the first intron is very large [~10.5 kb],
we trimmed the intronic sequence) and exon 2 of the
SMARCAD1 short isoform into the pEGFP-C3 vector
(Figure 5A). More specifically, a 1.7 kb genomic DNA frag-
ment comprising exon 1 and the first 1358 bp of intron
1 was cloned into the EcoR1 and Kpn1 restriction sites of
the pEGFP-C3 vector with primers 50-AAAAAGAATTCA
AGAAATTAGAGCTTACATTTAG-30 and 50-AAAAAGGTAC
CTCACTGATTAACAGGGAAAAAG-30, respectively. Then,a 0.7 kb genomic fragment comprising the last 500 bp of
intron 1 followed by exon 2 was cloned into the Kpn1
and BamHI sites of the first construct with primers
50-AAAAAGGTACCTATACTTTGATGATAGATGTGG-30 and
Figure 1. Clinical Features(A andB)Absenceoffingerprints (A) and reducedhandperspirationdemonstrated by sweat test (B) in a patient with adermatoglyphia.
The American Journal of Human Genetics 89, 302–307, August 12, 2011 303
50-AAAAGGATCCCTTTGGTTTAGAATGGAAGG-30, respec-tively. We sequenced the entire insert to verify the authen-
ticity of the construct. Next, we introduced the
c.378þ1G>T mutation into the minigene by using the
Quick Change Site-Directed Mutagenesis kit (Stratagene,
Santa Clara, CA). Both the wild-type and the mutant mini-
gene constructs were transiently transfected into HeLa
cells with Lipofectamine 2000 (Invitrogen). Cells were
Figure 2. Genetic Mapping of ADG(A)Multipoint LOD score analysis was performedwith the SuperLink software. LOD scores are plotted against all SNPmarkers distributedacross the genome.(B) Haplotype analysis with polymorphic markers on chromosomal region 4q22 reveals a heterozygous 5.1Mb interval betweenmarkersD4S423 and D4S1560 uniquely shared by all patients (boxed in red).
304 The American Journal of Human Genetics 89, 302–307, August 12, 2011
harvested 48 hr after transfection; total RNA was extracted
and subjected to RT-PCR and direct sequencing. Transfec-
tion of the wild-type minigene resulted as expected
in the formation of one single and abundant spliced
variant containing exons 1 and 2 of the short SMARCAD1
isoform; this was confirmed by sequencing analysis. In
contrast, transfection of the mutation-carrying minigene
Figure 3. Mutation Analysis(A) Bioinformatics analysis indicated theexistence of two SMARCAD1 isoformsdiffering both in lengths and sites of tran-scription start site. The short SMARCAD1isoform contains a unique nontranslatedexon (red arrow).(B) Sequence analysis revealed a heterozy-gous transversion, c.378þ1G>T, in theshort SMARCAD1 isoform (red arrow, leftpanel). The wild-type sequence is givenfor comparison (right panel).(C) PCR-RFLP analysis confirmed segrega-tion of the mutation in the family. Muta-tion c.378þ1G>T creates a recognitionsite for MseI endonuclease; thus, healthyindividuals display fragments of 163 bpand 46 bp, whereas affected heterozygouspatients show in addition fragments of73 bp and 90 bp.
Figure 4. Tissue Expression of SMARCAD1 IsoformsSMARCAD1 isoform expression was assessed with Clontech tissueblot cDNA array. Quantitative RT-PCR analysis showed thatthe long SMARCAD1 isoform is expressed ubiquitously at lowlevel. In contrast, the short SMARCAD1 isoform was found to beexpressed mainly in skin fibroblasts, keratinocytes, and theesophagus. Expression of SMARCAD1 was normalized to that ofACTB. Results are provided as the fold change of expression ofSMARCAD1 long isoform expression in keratinocytes 5 standarddeviation.
was found to lead to the generation
of two aberrant splice variants: the
first one was found to contain an
extra 51 bp from intron 1, and the
second one was found to miss one G
at the end of exon 1 because of the
utilization of cryptic donor splice sites. Of interest, the
abnormal splice products were only marginally detectable
as compared with the wild-type RNA, both in HeLa cells
(Figure 5B) and in primary human fibroblasts (data not
shown). These results are in line with the fact that aberrant
splice variants were not detectable in patient fibroblasts
(see above).
Two main mechanisms, alone or in combination, might
explain this observation. First, authentic splicing is typi-
cally more efficient than splicing activated at cryptic
sites.18 Therefore, it is possible that the significantly
reduced level of aberrant splice variants is due to a decrease
in splicing efficiency. Another possibility is that the
abnormal 50UTR variants affect RNA stability. Indeed, alter-
ation in the secondary structure of an RNA molecule has
been shown to inhibit translation initiation directly, by
preventing the 40S subunit binding or scanning, or indi-
rectly, by preventing the action of regulatory RNA-binding
proteins. This in turn has been shown to foster mRNA
degradation by increasing decapping and the deadenyla-
tion rate.19 To assess this possibility, we initially compared
via computational analysis the secondary structure of wild-
type and aberrant splice RNA variants by using the Gene-
Bee RNA secondary-structure prediction software. As
shown in Figure 5C, computational analysis predicts that
both aberrant splice variations are likely to significantly
affect RNA secondary configuration; this prediction is in
agreement with the fact the 50UTR region of the gene
affected by the abnormal splicing is highly conserved
across species at the nucleotide level (data not shown).
To obtain experimental support for the possibility that
aberrantly spliced variants of the SMARCAD1 short isoform
The American Journal of Human Genetics 89, 302–307, August 12, 2011 305
undergo degradation, we treated cells transfected with
both the wild-type and mutation-carrying constructs
with cycloheximide at a concentration of 50 mg/ml for
24 hr, which is known to inhibit decapping of mRNA.20
As a result, we observed a significant increase in the aber-
rant splice variant levels but not in the wild-type splice
variant (Figure 5D).
In conclusion, we have identified in a large family with
ADG a splice site mutation causing aberrant splicing of
a skin-specific isoform of SMARCAD1, implicating this
gene in dermatoglyph ontogenesis. The mutation is likely
to exert a loss-of-function effect.
Little is known about the function of the full-length
SMARCAD1, and virtually nothing is known regarding
the physiological role of the skin-specific isoform of this
gene. Clearly, the tissue-specific pattern of expression of
the short isoform is likely to underlie the very limited
phenotype displayed by our patients, as attested by the
severe phenotype observed in mice knocked out for the
ubiquitous SMARCAD1 large isoform of the gene;21 those
mice feature retarded growth, perinatal mortality,
decreased fertility, and various skeletal defects.
The full-length SMARCAD1 seems to control the expres-
sion of a large spectrum of target genes encoding transcrip-
tional factors and histone modifiers as well as regulators
of the cell cycle and development.16 It is tempting to
speculate that the skin-specific isoform of SMARCAD1
might target genes involved in dermatoglyph and sweat
gland development, two structures jointly affected in
the present family and in additional disorders such as
Naegeli-Franceschetti-Jadassohn and Rapp-Hodgkin (MIM
129400) syndromes.11,22 Regardless of the exact mecha-
nisms mediating the activity of the skin-specific isoform
of SMARCAD1 in the skin, the present results once again
underscore the fact that rare monogenic traits represent
an invaluable tool for the investigation of concealed
aspects of our biology.
Supplemental Data
Supplemental Data include one table and can be found with this
article online at http://www.cell.com/AJHG/.
Acknowledgments
We would like to acknowledge the participation of all family
members in this study. We would like to thank Sylvia Kiese for
her help. We wish to thank Gil Ast, Hadas Keren, and Mordechai
Choder for helpful discussions.
Figure 5. Consequences of Mutation c.378þ1G>TTo assess the consequences of mutation c.378þ1G>Ton SMARCAD1 splicing, we used a minigene system. (A) Schematic representationof the SMARCAD1 short isoform wild-type and mutation-carrying minigenes.(B) Sequence analysis of RT-PCR products generated from HeLa cells transfected with wild-type and mutant minigene constructs. Trans-fection of wild-typeminigene resulted in the formation of one spliced variant containing exons 1 and 2 of the SMARCAD1 short isoform.In contrast, transfection of the mutant minigene resulted in two aberrant splice variants, containing an extra 51 bp from intron 1 ormissing one G at the end of exon 1. A marked decrease in the level of expression of the spliced variants was also observed.(C) Computational modeling predicts an altered mRNA secondary structure of both aberrant splice variants.(D) Treatment with cycloheximide (at a concentration of 50 mg/ml for 24 hr), known to inhibit mRNA decapping, resulted in signifi-cantly increased levels of aberrant (but not wild-type) splice variants.
306 The American Journal of Human Genetics 89, 302–307, August 12, 2011
Received: June 7, 2011
Revised: July 4, 2011
Accepted: July 8, 2011
Published online: August 4, 2011
Web Resources
The URLs for data presented herein are as follows
dbSNP, http://www.ncbi.nlm.nih.gov/SNP/
Ensembl, http://www.ensembl.org/
GenBank, http://www.ncbi.nlm.nih.gov/Genbank/
GeneBee, http://www.genebee.msu.su/
Online Mendelian Inheritance in Man (OMIM), http://www.
omim.org
Superlink,http://bioinfo.cs.technion.ac.il/superlink-online-twoloci/
makeped/TwoLociMultiPoint.html
UCSC Genome Browser, http://genome.ucsc.edu/
References
1. Verbov, J. (1970). Clinical significance and genetics of
epidermal ridges—a review of dermatoglyphics. J. Invest. Der-
matol. 54, 261–271.
2. Warman, P.H., and Ennos, A.R. (2009). Fingerprints are
unlikely to increase the friction of primate fingerpads. J. Exp.
Biol. 212, 2016–2022.
3. Scheibert, J., Leurent, S., Prevost, A., and Debregeas, G. (2009).
The role of fingerprints in the coding of tactile information
probed with a biomimetic sensor. Science 323, 1503–1506.
4. Reed, T., Viken, R.J., and Rinehart, S.A. (2006). High herita-
bility of fingertip arch patterns in twin-pairs. Am. J. Med.
Genet. A. 140, 263–271.
5. Bokhari, A., Coull, B.A., and Holmes, L.B. (2002). Effect of
prenatal exposure to anticonvulsant drugs on dermal ridge
patterns of fingers. Teratology 66, 19–23.
6. Babler, W.J. (1991). Embryologic development of epidermal
ridges and their configurations. Birth Defects Orig. Artic. Ser.
27, 95–112.
7. Baird, H.W. (1968). Absence of fingerprints in four genera-
tions. Lancet 2, 1250.
8. Basan, M. (1965). Ectodermal dysplasia. Missing papillary
pattern, nail disorders and furrows on 4 fingers. Arch. Klin.
Exp. Dermatol. 222, 546–557.
9. Reed, T., and Schreiner, R.L. (1983). Absence of dermal ridge
patterns: Genetic heterogeneity. Am. J. Med. Genet. 16, 81–88.
10. Lımova, M., Blacker, K.L., and LeBoit, P.E. (1993). Congenital
absenceofdermatoglyphs. J. Am.Acad.Dermatol.29, 355–358.
11. Lugassy, J., Itin, P., Ishida-Yamamoto, A., Holland, K., Huson,
S., Geiger, D., Hennies, H.C., Indelman, M., Bercovich, D.,
Uitto, J., et al. (2006). Naegeli-Franceschetti-Jadassohn
syndrome and dermatopathia pigmentosa reticularis: Two
allelic ectodermal dysplasias caused by dominant mutations
in KRT14. Am. J. Hum. Genet. 79, 724–730.
12. Sirinavin, C., and Trowbridge, A.A. (1975). Dyskeratosis con-
genita: Clinical features and genetic aspects. Report of a family
and review of the literature. J. Med. Genet. 12, 339–354.
13. Burger, B., Fuchs, D., Sprecher, E., and Itin, P. (2011).
The immigration delay disease: Adermatoglyphia-inherited
absence of epidermal ridges. J. Am. Acad. Dermatol. 64,
974–980.
14. Fishelson, M., and Geiger, D. (2002). Exact genetic linkage
computations for general pedigrees. Bioinformatics 18
(Suppl 1 ), S189–S198.
15. Adra, C.N., Donato, J.L., Badovinac, R., Syed, F., Kheraj, R.,
Cai, H., Moran, C., Kolker, M.T., Turner, H., Weremowicz, S.,
et al. (2000). SMARCAD1, a novel human helicase family-
defining member associated with genetic instability: Cloning,
expression, and mapping to 4q22-q23, a band rich in break-
points and deletion mutants involved in several human
diseases. Genomics 69, 162–173.
16. Okazaki, N., Ikeda, S., Ohara, R., Shimada, K., Yanagawa, T.,
Nagase, T., Ohara, O., and Koga, H. (2008). The novel protein
complex with SMARCAD1/KIAA1122 binds to the vicinity of
TSS. J. Mol. Biol. 382, 257–265.
17. Singh, G., and Cooper, T.A. (2006). Minigene reporter for
identification and analysis of cis elements and trans factors
affecting pre-mRNA splicing. Biotechniques 41, 177–181.
18. Roca, X., Sachidanandam, R., and Krainer, A.R. (2003).
Intrinsic differences between authentic and cryptic 50 splicesites. Nucleic Acids Res. 31, 6321–6333.
19. Day, D.A., and Tuite, M.F. (1998). Post-transcriptional gene
regulatory mechanisms in eukaryotes: An overview. J. Endo-
crinol. 157, 361–371.
20. Schwartz, D.C., and Parker, R. (1999). Mutations in translation
initiation factors lead to increased rates of deadenylation and
decapping of mRNAs in Saccharomyces cerevisiae. Mol. Cell.
Biol. 19, 5247–5256.
21. Schoor, M., Schuster-Gossler, K., Roopenian, D., and Gossler,
A. (1999). Skeletal dysplasias, growth retardation, reduced
postnatal survival, and impaired fertility in mice lacking the
SNF2/SWI2 family member ETL1. Mech. Dev. 85, 73–83.
22. Atasu, M., Akesi, S., Elcioglu, N., Yatmaz, P.I., and Ertas, E.B.
(1999). A Rapp-Hodgkin like syndrome in three sibs: Clinical,
dental and dermatoglyphic study. Clin. Dysmorphol. 8,
101–110.
The American Journal of Human Genetics 89, 302–307, August 12, 2011 307
Subscribe to Active ZoneThe Cell Press Neuroscience Newsletter
Featuring:
Cutting-edge neuroscience from Cell Press and beyond
Interviews with leading neuroscientists
Special features: Podcasts, Webinars and Review Issues
Neural Currents - cultural events, exhibits and new books
And much more
Read now at bit.ly/activezone
REVIEW
Five Years of GWAS Discovery
Peter M. Visscher,1,2,* Matthew A. Brown,1 Mark I. McCarthy,3,4 and Jian Yang5
The past five years have seenmany scientific and biological discov-
eries made through the experimental design of genome-wide asso-
ciation studies (GWASs). These studies were aimed at detecting
variants at genomic loci that are associated with complex traits
in the population and, in particular, at detecting associations
between common single-nucleotide polymorphisms (SNPs) and
common diseases such as heart disease, diabetes, auto-immune
diseases, and psychiatric disorders. We start by giving a number
of quotes from scientists and journalists about perceived problems
with GWASs. We will then briefly give the history of GWASs and
focus on the discoveries made through this experimental design,
what those discoveries tell us and do not tell us about the genetics
and biology of complex traits, and what immediate utility has
come out of these studies. Rather than giving an exhaustive review
of all reported findings for all diseases and other complex traits, we
focus on the results for auto-immune diseases and metabolic
diseases. We return to the perceived failure or disappointment
about GWASs in the concluding section.
Introduction: Have GWASs Been a Failure?
In the past five years, genome-wide association studies
(GWASs) have led to many scientific discoveries, and yet
at the same time, many people have pointed to various
problems and perceived failures of this experimental
design. Let us begin by considering a number of criticisms
that have been made against GWASs. We do not list these
quotes to discredit any of the scientists or journalists
involved, nor to deliberately cite them out of context.
Rather, they serve to confirm that the points we discuss
in this review are related to beliefs held by a significant
number of scientific commentators and therefore warrant
consideration.
From an interview with Sir Alec Jeffreys, ESHG Award
Lecturer 2010:
‘‘One of the great hopes for GWAS was that, in the
same way that huge numbers of Mendelian disorders
were pinned down at the DNA level and the gene
and mutations involved identified, it would be
possible to simply extrapolate from single gene disor-
ders to complex multigenic disorders. That really
hasn’t happened. Proponents will argue that it has
worked and that all sorts of fascinating genes that
predispose to or protect against diabetes or breast
cancer, for example, have been identified, but the
fact remains that the bulk of the heritability in these
conditions cannot be ascribed to loci that have
emerged from GWAS, which clearly isn’t going to
be the answer to everything.’’
From McCLellan and King, Cell 20101:
‘‘To date, genome-wide association studies (GWAS)
have published hundreds of common variants
whose allele frequencies are statistically correlated
with various illnesses and traits. However, the vast
majority of such variants have no established biolog-
ical relevance to disease or clinical utility for prog-
nosis or treatment.’’
‘‘An odds ratio of 3.0, or even of 2.0 depending on
population allele frequencies, would be robust to
such population stratification. However, odds ratios
of the magnitude generally detected by GWAS
(<1.5) can frequently be explained by cryptic popu-
lation stratification, regardless of the p value associ-
ated with them.’’
‘‘More generally, it is now clear that common risk
variants fail to explain the vast majority of genetic
heritability for any human disease, either individu-
ally or collectively (Manolio et al., 2009).’’
‘‘The general failure to confirm common risk vari-
ants is not due to a failure to carry out GWAS
properly. The problem is underlying biology, not
the operationalization of study design. The common
disease–common variant model has been the
primary focus of human genomics over the last
decade. Numerous international collaborative efforts
representing hundreds of important human diseases
and traits have been carried out with large well-char-
acterized cohorts of cases and controls. If common
alleles influenced common diseases, many would
have been found by now. The issue is not how to
develop still larger studies, or how to parse the data
still further, but rather whether the common
disease–common variant hypothesis has now been
tested and found not to apply to most complex
human diseases.’’
From Nicholas Wade in the New York Times, March 20
2011:
‘‘More common diseases, like cancer, are thought to
be caused by mutations in several genes, and finding
the causes was the principal goal of the $3 billion
1University of Queensland Diamantina Institute, Princess Alexandra Hospital, Brisbane, Queensland 4102, Australia; 2The Queensland Brain Institute, The
University of Queensland, Brisbane, Queensland 4072, Australia; 3Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK;4Oxford Centre for Diabetes, Endocrinology and Metabolism, Churchill Hospital Old Road, Headington Oxford OX3 7LJ, UK; 5Queensland Institute of
Medical Research, 300 Herston Road, Brisbane, Queensland 4006, Australia
*Correspondence: [email protected]
DOI 10.1016/j.ajhg.2011.11.029. �2012 by The American Society of Human Genetics. All rights reserved.
The American Journal of Human Genetics 90, 7–24, January 13, 2012 7
human genome project. To that end, medical genet-
icists have invested heavily over the last eight years
in an alluring shortcut. But the shortcut was based
on a premise that is turning out to be incorrect. Scien-
tists thought the mutations that caused common
diseases would themselves be common. So they first
identified the common mutations in the human
population in a $100 million project called the
HapMap. Then they compared patients’ genomes
with those of healthy genomes. The comparisons
relied on ingenious devices called SNP chips, which
scan just a tiny portion of the genome. (SNP,
pronounced ‘‘snip,’’ stands for single nucleotide
polymorphism.) These projects, called genome-wide
association studies, each cost around $10 million or
more. The results of this costly international exercise
have been disappointing. About 2,000 sites on the
human genome have been statistically linked with
various diseases, but in many cases the sites are
not inside working genes, suggesting there may be
some conceptual flaw in the statistics. And in most
diseases the culprit DNA was linked to only a small
portion of all the cases of the disease. It seemed that
natural selection has weeded out any disease-causing
mutation before it becomes common.’’
From Tim Crow, Molecular Psychiatry 20112:
‘‘There comes a point at which the genetic skeptic
can be pardoned the suggestion that if the genes
are so small and so multiple, what they are hardly
matters, the dividing line between polygenes and
no genes is of little practical consequence. Have we
reached this point’’?
From a commentary article by Jonathan Latham, on
guardian.co.uk, 17 April 2011:
‘‘Among all the genetic findings for common
illnesses, such as heart disease, cancer and mental
illnesses, only a handful are of genuine significance
for human health. Faulty genes rarely cause, or even
mildly predispose us, to disease, and as a consequence
the science of human genetics is in deep crisis.
Since the Collins paper [Manolio et al. 20093] was
published nothing has happened to change that
conclusion. It now seems that the original twin-
study critics were more right than they imagined.
The most likely explanation for why genes for
common diseases have not been found is that, with
few exceptions, they do not exist.’’
These quotes raise a number of different issues about
the methodology, research outcomes, and utility of the
research findings. The pertinent points made in these
quotes are:
(1) GWASs are founded on a flawed assumption that
genetics plays an important role in the risk to
common diseases;
(2) GWASs have been disappointing in not explaining
more genetic variation in the population;
(3) GWASs have not delivered meaningful, biologically
relevant knowledge or results of clinical or any
other utility; and
(4) GWAS results are spurious.
In this review we will briefly give the history of GWASs
and then focus on the discoveries made through this
experimental design, what those discoveries tell us and
do not tell us about the genetics and biology of complex
traits, and what immediate utility has come out of these
studies. We will focus on the results for auto-immune
diseases and metabolic diseases, although there have
been important findings for other diseases and complex
traits. In the concluding section, we will again consider
the perceived failure or disappointment of GWASs.
What Are GWASs, and How Did We Get There?
Attempts to use linkage analysis to map genomic loci that
have an effect on disease or other complex traits have
been ubiquitous in the last two decades. Gene mapping
by linkage relies on the cosegregation of causal variants
with marker alleles within pedigrees. We define and
discuss what we mean by ‘‘causal’’ in Box 1. Because the
number of recombination events per meiosis is relatively
small, tagging a causal variant requires only a few genetic
markers per chromosome. The downside of the small
number of recombination events is that the mapping
resolution, i.e., how close to the causal variant one can
get through linked markers, is typically low. Linkage
mapping has been extremely successful in mapping genes
and gene variants affecting Mendelian traits (e.g., single-
gene disorders).4 Mapping loci underlying common
diseases and, in particular, identifying causative muta-
tions have had much less success. There are many reasons
for the failure of linkage analyses to reliably identify
complex-trait loci in human pedigrees. One reason is
that the effect sizes (‘‘penetrance’’) of individual causal
variants are too small to allow detection via cosegregation
within pedigrees.
GWASs are based upon the principle of linkage disequi-
librium (LD) at the population level. LD is the nonrandom
association between alleles at different loci. It is created by
evolutionary forces such as mutation, drift, and selection
and is broken down by recombination.5 Generally, loci
that are physically close together exhibit stronger LD
than loci that are farther apart on a chromosome. The
larger the (effective) population size, the weaker the LD
for a given distance.6 (Linkage analysis exploits the large
LD within pedigrees.) The genomic distance at which LD
decays determines how many genetic markers are needed
to ‘‘tag’’ a haplotype, and the number of such tagging
markers is much smaller than the total number of
segregating variants in the population. For example,
a selection of approximately 500,000 common SNPs in
the human genome is sufficient to tag common variation
8 The American Journal of Human Genetics 90, 7–24, January 13, 2012
in non-African populations, even though the total number
of common SNPs exceeds 10 million.7
Geneticists realized some time ago that they could
exploit population-based LD to map genes. For example,
Bodmer suggested in 1986 that fine-mapping using popu-
lation association could lead to closer linkage between
a causative mutation and a linked marker.82 However,
fine-mapping still relied on having an initial genomic loca-
tion that is obtained from linkage analysis in family
studies. What if we do not have any prior information
on genomic loci or, alternatively, we deliberately want an
unbiased scan of the genome? In a landmark paper, Risch
and Merikangas83 showed that performing an association
scan involving one million variants in the genome and
a sample of unrelated individuals could be more powerful
than performing a linkage analysis with a few hundred
markers. It took only 10 years before this theoretical design
became reality. What was needed was the discovery (accel-
erated by the sequencing of the human genome) of
hundreds of thousands of single-nucleotide variants, the
quantification of the correlation (LD) structure of those
markers in the human genome, and the ability to accu-
rately genotype hundreds of thousands of markers in an
automated and affordable manner. The LD structure was
investigated in the HapMap project,7 and the outcome
was a list of tag SNPs that captured most of the common
genomic variation in a number of human populations.
Concurrently, commercial companies produced dense
SNP arrays that could genotype many markers in a single
assay. The technological advances together with biobanks
of either population cohorts or case-control samples facili-
tated the ability to conduct GWASs.
Although GWASs are unbiased with respect to prior bio-
logical knowledge (or prior beliefs) and with respect to
genome location, they are not unbiased in terms of what
is detectable. GWASs rely on LD between genotyped
SNPs and ungenotyped causal variants. The strength of
statistical association between alleles at two loci in the
genome strongly depends on their allele frequencies,
such that a rare variant (say, one with a frequency <0.01)
will be in low LD (as measured by r2) with a nearby
common variant, even if they map to the same recombina-
tion interval.84 But the SNPs that are on the SNP chips
have been selected to be common (most have a minor
allele frequency >0.05). Therefore, GWASs are by design
powered to detect association with causal variants that
are relatively common in the population. Is it realistic to
assume common causal variants for disease segregate in
the population? This is discussed in Box 2.
(Nearly) Five Years of Discovery
Although the first results from a GWAS were reported in
20058 and 2006,9 we take the 2007 Wellcome Trust Case
Control Consortium (WTCCC) paper in Nature10 as a start-
ing point. The reason for this is that theWTCCC study was
the first large, well-designed GWAS for complex diseases to
employ a SNP chip that had good coverage of the genome.
There are many ways to summarize the discoveries based
on GWASs in the last five years. We have tried to separate
the discoveries quantitatively and to focus on the biology.
There are nowwell over 2000 loci that are significantly and
robustly associated with one or more complex traits (see
GWAS catalog in Web Resources), as shown in Figure 1.
The vast majority of the loci identified are new, i.e., before
2007 their association with disease or other complex traits
Box 1. What Is a Causal Variant?
New mutations that contribute to an increase or
decrease in risk to disease arise in populations all
the time. Some of these mutations can reach an
appreciable frequency in the population, for
example by random drift or by natural selection.
As discussed in the main text, these mutations will
be associated with other variants in the genome
through LD. Such associations will include those
with SNPs that are genotyped on ‘‘SNP chips.’’
Because there are many more segregating variants
in the population than those genotyped in GWASs,
it is unlikely, but not impossible, that a mutation is
genotyped itself, and so its effect usually will be de-
tected through an association with a genotyped
variant. This genotyped variant can be robustly asso-
ciated with disease in multiple samples from the
same population, or even across populations, but it
is not the mutation that causes variation in risk.
The results from GWASs have shown that variants
at many genetic loci in the genome are associated
with disease, and these also reflect many ancestral
mutations with an effect on susceptibility to disease.
Therefore, the effect size (in terms of increasing or
decreasing the absolute probability of disease) is,
on average, small, and individual variants are
neither necessary nor sufficient to cause disease.
Herein lies the problem of defining ‘‘causal’’: How
do we prove that a particular mutation causes the
observed effect on variation in the population?
Engineering the same mutation in a cell or animal
model might give a relevant phenotype, but that is
not a proof. The mutation can have a direct effect
on gene expression in human tissues or be func-
tional in another way, but that doesn’t prove it has
a causal effect on disease risk. Operationally, in this
review what we mean by ‘‘causal variant’’ is an
(unknown) variant that has a direct or indirect func-
tional effect on disease risk, rather than a variant
that is associated with disease risk through LD,
even if we don’t have the tools available at present
to prove causality beyond reasonable doubt. Hence,
it is the variant that causes the observed association
signal.
The American Journal of Human Genetics 90, 7–24, January 13, 2012 9
was not known. Essentially, these are 2000 new biological
leads. The number of loci identified per complex trait
varies substantially, from a handful for psychiatric diseases
to a hundred or more for inflammatory bowel disease
(IBD1 [MIM 266600], including Crohn disease [CD]11
and ulcerative colitis [UC]12) and stature.13 Importantly,
the number of discovered variants is strongly correlated
with experimental sample size (Figure 2), which predicts
that an ever-increasing discovery sample size will increase
the number of discovered variants: very roughly, after
a minimum sample-size threshold below which no vari-
ants are detected is reached, a doubling in sample size leads
Box 2. The CDCV Hypothesis
Currently, the allele frequency of variants that
contribute to cause common disease is a subject of
some debate.85,86 The common disease-common
variant (CDCV) hypothesis is sometimes said to be
one side of this debate; the other side holds that
disease-causing alleles are typically rare. But what
is the precise ‘‘hypothesis’’ in the CDCV hypothesis?
We tried to find the origin of the CDCV hypothesis.
Many researchers cite either Lander87 or Risch and
Merikangas.83 We will add Chakravarti88 and Reich
and Lander89 as key studies. Lander87 noted from
the then-available data that there is a limited diver-
sity in coding regions at genes, in that most variants
are very rare, and therefore the effective number of
alleles is small. In addition, he provided ‘‘tantalizing
examples’’ of common alleles with large effects (for
example, such alleles include APOE [MIM 107741],
MTHFR [MIM 607093], and ACE [MIM 106180]).
Reich and Lander89 presented a theoretical popula-
tion-genetics model that predicted a relatively
simple spectrum of the frequency of disease risk
alleles at a particular disease locus. They (re)phrased
the CDCV hypothesis as the prediction that the ex-
pected allelic identity is high for those disease loci
that are responsible for most of the population risk
for disease. These studies did not appear to make
any prediction about the number of disease loci or,
therefore, about the effect size. What the authors
stated was that if a disease was common, there was
likely to be one disease-causing allele that was
much more common than all the other disease-
causing alleles at the same locus.87,89
Risch and Merikangas83 quantified two important
points regarding the detection of disease loci: first,
that detection by association is more powerful
than linkage when the genotype-relative risk is
modest or small and the risk-allele frequency is large
(say, >10%); and second, that the multiple-testing
burden of a genome scan by association does not
prevent the detection of genome-wide-significant
findings. This paper was essentially about experi-
mental design and statistical power (and hence feasi-
bility), not about the CDCV hypothesis as such.
Finally, Chakravarti88 pointed out that if individuals
with disease needed to be homozygous for risk vari-
ants at multiple loci, then the risk alleles at those
loci must be more common than they would be in
a model in which homozygosity at any risk locus is
sufficient to cause disease. We note that without
the assumption of strong epistasis on the scale of
liability, there is no need for risk variants to be
common. For example, Risch’s multilocus multipli-
cative model,90 which implies an additive model
Box 2. Continued
on the log (risk) scale (it is one of the ‘‘exchangeable’’
models91), does not rely on a particular allelic spec-
trum of risk-allele frequencies.
What all these landmark papers have in common
is a remarkable foresight in predicting the GWAS era
well before the publication of the full draft of the
human genome sequence, the HapMap project, or
the availability of commercial genotyping. But
what can we conclude about the origin and specifics
of the CDCV hypothesis? As implicitly or explicitly
stated in these key papers, there is no strong predic-
tion about the exact allele-frequency spectrum of
risk variants in the genome, nor a prediction about
the effect size at any disease loci and hence about
the total number of risk alleles in the genome.
The current debate is about the frequency spec-
trum of disease-causing alleles. Phrasing the debate
as an either/or question is not very helpful because
examples of both common and rare alleles are
already known, but there is still an open question
as to whether most genetic variation contributing
to complex traits in the population is caused by
rare variants or common variants. A more general
question regards the spectrum of allele frequencies
of disease-causing alleles and the joint distribution
between risk-allele frequency and effect size. In the
special case of an evolutionarily neutral model and
a constant effective population size, most causal
variants that are segregating in the population will
be rare, but most heritability will be due to common
variants.79,92 The reason for this apparent paradox is
that the number of segregating variants is propor-
tional to 1/[p(1 � p), where p is the allele frequency
of a risk-increasing allele (so the smaller p, the
more variants of that frequency), whereas the herita-
bility contributed at that frequency is proportional
to p(1 � p). The net effect is that the heritability is
distributed equally over all frequencies, and cumula-
tively most heritability is contributed by common
variants.
10 The American Journal of Human Genetics 90, 7–24, January 13, 2012
to a doubling of the number of associated variants discov-
ered. The proportion of genetic variation explained by
significantly associated SNPs is usually low (typically less
than 10%) for many complex traits, but for diseases such
as CD and multiple sclerosis (MS [MIM 126200]), and for
quantitative traits such as height and lipid traits, between
10% and 20% of genetic variance has been accounted for
(Table 1). In comparison to the pre-GWAS era, the propor-
tion of genetic variation accounted for by newly discov-
ered variants that are segregating in the population is large.
It is clear that for most complex traits that have been
investigated by GWAS, multiple identified loci have
genome-wide statistical significance, and thus it is likely
that there are (many) other loci that have not been identi-
fied because of a lack of statistical significance (false nega-
tives). Recently, researchers have developed and applied
methods to quantify the proportion of phenotypic varia-
tion that is tagged when one considers all SNPs simulta-
neously.12–14 These methods focus on estimation rather
than hypothesis testing and do not suffer from false
negatives caused by small effect sizes.15 Whole-genome
approaches to estimating genetic variation have shown
that approximately one-third to one-half of additive
genetic variation in the population is being tagged when
all GWAS SNPs are considered simultaneously.12–14 This
is a surprisingly large proportion given that evolutionary
theory predicts that most variants affecting disease risk
ought to be found at a low frequency in the population
if they affect fitness,16,17 and such risk variants would
not be in sufficient LD with the common SNPs to be
detected in GWASs.
Autoimmune Diseases
We concentrate on seven auto-immune diseases, anky-
losing spondylitis (AS [MIM 106300]), rheumatoid arthritis
(RA [MIM 180300), systemic lupus erythematosus (SLE
[MIM 152700]), and type 1 diabetes (T1D [MIM 222100]),
MS, CD, and UC. Table 2 summarizes the number of genes
that have been identified for these diseases. Across these
diseases, 19 loci (mainly related to human leukocyte
antigen) were known prior to 2007, and 277 have been
discovered from 2007 onward. The total of 277 includes
multiple counts of loci that have been implicated across a
number of diseases; such loci include BLK (MIM 191305),
TNFAIP3 (MIM 191163) and CD40 (MIM 109535).
Inflammatory bowel disease (IBD, not to be confused
here with identity by descent) is thought to arise from
dysregulation of intestinal homeostasis.18 GWASs of IBD
(CD and UC) have been highly successful in terms of
the number of loci identified (99 nonoverlapping loci in
Figure 1. GWAS Discoveries over TimeData obtained from the Published GWAS Catalog (see WebResources). Only the top SNPs representing loci with associationp values < 5 3 10�8 are included, and so that multiple countingis avoided, SNPs identified for the same traits with LD r2rr > 0.8 esti-mated from the entire HapMap samples are excluded.
Figure 2. Increase in Number of Loci Identified as a Function ofExperimental Sample Size(A) Selected quantitative traits.(B) Selected diseases.The coordinates are on the log scale. The complex traits wereselected with the criteria that there were at least three GWASpapers published on each in journals with a 2010–2011 journalimpact factor>9 (e.g.,Nature,Nature Genetics, the American Journalof Human Genetics, and PLoS Genetics) and that at least one papercontained more than ten genome-wide significant loci. Thesetraits are a representative selection among all complex traits thatfulfilled these criteria.
The American Journal of Human Genetics 90, 7–24, January 13, 2012 11
total18), and a substantial proportion of familial risk, about
20%, has been accounted for.11,12,18 Twenty-eight risk loci
are shared between CD and UC, despite the fact that these
diseases display distinct clinical features, and it has been
suggested that the two diseases share pathways and are
part of a mechanistic continuum.18 There are also strong
overlaps between genes involved in CD and UC, AS,19
and psoriasis (MIM 177900), again suggesting shared aetio-
pathogenic mechanisms in these conditions. Pleiotropic
genetic effects are becoming increasing widely identified,
including in classical autoimmune diseases.20 For example,
a coding variant in the gene PTPN22 (MIM 600716)
confers strong risk for T1D and RA as well as protection
against CD.18
Metabolic Diseases
In terms of metabolic diseases, we focus here specifically
on type 2 diabetes (T2D [MIM 125853]); fasting glucose
and insulin levels; body-mass index (BMI) and obesity;
and fat distribution. A recent review21 already covered
these complex traits, but we have updated that review
wherever necessary. Table 3 gives an overview of the
number of loci identified.
More than 20 major GWASs for T2D have been pub-
lished to date21–24, and there has been a cumulative tally
of around 50 genome-wide-significant hits,21,23,24 only
three of which were known before the GWAS era. Most
of these studies have involved individuals of European
descent; the latest published effort is from the DIAGRAM
(Diabetes Genetics Replication and Meta-analysis)
Consortium and includes more than 47,000 GWAS indi-
viduals and 94,000 samples for replication. More recently,
equivalent studies have emerged from samples of East
Asians,23,25–27 South Asians,22 and Hispanics,28,29 and
large studies involving African Americans and other major
ethnic groups are underway. Notwithstanding differences
in allele frequency and LD patterns, most of the signals
found in one ethnic group show some evidence of associ-
ation in others, indicating that the common-variant
signals identified by GWASs are likely to be the result of
widely distributed causal alleles that are of relatively high
frequency. This is an important observation because it
indicates that most of the GWAS-identified associations
for T2D reflect high LD with a causal variant that has
a small effect size rather than low LD with a causal variant
that has a large effect size. The largest common-variant
signal identified for T2D remains TCF7L2 (MIM 602228)
(detected just prior to the GWAS era30), which has a
per-allele odss ratio (OR) of around 1.35. The remaining
signals detected by GWAS have allelic ORs in the range
between 1.05 and 1.25. Collectively, the most-strongly
associated variants at these loci are estimated to explain
around 10% of familial aggregation of T2D in European
populations.
The MAGIC (Meta-Analysis of Glucose- and Insulin-
Related Traits Consortium) investigators have been
carrying out equivalent analyses focused on the identifica-
tion of variants influencing variation in glucose and
insulin levels in healthy nondiabetic individuals.31–33 Prior
to the GWAS era, the only compelling association signal
for fasting glucose levels was known at GCK (MIM
138079) (glucokinase),34 but GWAS in European samples
(46,000 GWAS and 76,000 replication samples) have
expanded that number to 1632. These variants explain
around 10% of the inherited variation in fasting glucose
levels. Only two signals (near GCKR [MIM 600842] and
IGF1 [MIM 147440]) were shown to influence fasting
insulin levels in the same analysis. Equivalent analyses
for 2h glucose33 (15,000 GWAS samples and up to 30,000
replication samples) identified further signals, including
variants near the GIP (MIM 137240) receptor (GIPR [MIM
137241]).
Before the GWAS era, the only robust association
between DNA sequence variation and either BMI or
weight involved low-frequency variants in MC4R (MIM
155541).35 Now, there are more than 30. In the most
recent study from the GIANT consortium,36 these analyses
extended to almost 250,000 samples, half of them in the
stage 1 GWAS, the remainder for replication. The largest
signal remains that at FTO (MIM 610966),37 where the
Table 1. Population Variation Explained by GWAS for a SelectedNumber of Complex Traits
Trait or Diseaseh2 PedigreeStudies
h2 GWASHitsa
h2 AllGWAS SNPsb
Type 1 diabetes 0.998 0.699 ,c 0.312
Type 2 diabetes 0.3–0.6100 0.05-0.1034
Obesity (BMI) 0.4–0.6101,102 0.01-0.0236 0.214
Crohn’s disease 0.6–0.8103 0.111 0.412
Ulcerative colitis 0.5103 0.0512
Multiple sclerosis 0.3–0.8104 0.145
Ankylosing spondylitis >0.90105 0.2106
Rheumatoid arthritis 0.6107
Schizophrenia 0.7–0.8108 0.0179 0.3109
Bipolar disorder 0.6–0.7108 0.0279 0.412
Breast cancer 0.3110 0.08111
Von Willebrand factor 0.66–0.75112,113 0.13114 0.2514
Height 0.8115,116 0.113 0.513,14
Bone mineral density 0.6-0.8117 0.05118
QT interval 0.37–0.60119,120 0.07121 0.214
HDL cholesterol 0.5122 0.157
Platelet count 0.8123 0.05–0.158
a Proportion of phenotypic variance or variance in liability explained bygenome-wide-significant and validated SNPs. For a number of diseases, otherparameters were reported, and these were converted and approximated to thescale of total variation explained. Blank cells indicate that these parametershave not been reported in the literature.b Proportion of phenotypic variance or variance in liability explained when allGWAS SNPs are considered simultaneously. Blank cell indicate that theseparameters have not been reported in the literature.c Includes pre-GWAS loci with large effects.
12 The American Journal of Human Genetics 90, 7–24, January 13, 2012
average between-homozygotes difference in weight is
around 2.5 kg. The effects at other loci are smaller, and
in combination, these variants explain no more than
1%–2% of overall variation in adult BMI (although this
percentage rises to almost 20% if the analysis is extended
to all GWA variants, not just those that reach genome-
wide significance14). As well as these studies of BMI and
obesity in population samples, there have been several
studies focused on extreme obesity phenotypes.38,39 The
genome-wide-significant loci thrown up by these efforts
only partially overlap with those emerging from popula-
tion-based studies, raising the possibility that some of
Table 2. Summary of GWAS Findings for Seven Autoimmune Diseasesa
Prior to 2007 2007 onward
Disease Number of Loci Loci Number of Loci Some or All of the Loci
Ankylosingspondylitis
1 HLA-B27 13 IL23R, ERAP1, 2p15, 21q22, CARD9 (MIM 607212), IL12B(MIM 161561), PTGER4 (MIM 601586), IL1R2 (MIM 147811),TNFR1, TBKBP1 (MIM 608476), ANTXR2 (MIM 608041),RUNX3 (MIM 600210), KIF21B (MIM 608322)
Rheumatoidarthritis
3 HLA-DRB1,PADI4,CTLA4
30 AFF3 (MIM 601464), BLK, CCL21 (MIM 602737), CD2/CD58(MIM 186990)/153420], CD28, CD40, FCGR2A (MIM 146790),HLA-DRB1, IL2/IL21 (MIM 147680/605384), IL2RA, IL2RB(MIM 146710), KIF5A/PIP4K2C, PRDM1 (MIM 603423), PRKCQ(MIM 600448), PTPRC (MIM 151460), REL (MIM 164910), STAT4(MIM 600558), TAGAP, TNFAIP3, TNFRSF14, TRAF1/C5 (MIM120900/601711), TRAF6 (MIM 602355), IL6ST (MIM 600694),SPRED2 (MIM 609292), RBPJ (MIM 147183), CCR6(MIM 601835), IRF5 (MIM 607218), PXK (MIM 611450)
Systemic lupuserythematosus
3 HLA, PTPN22,IRF5 (MIM607218)
31 BANK1 (MIM 610292), BLK (MIM 191305), C1q, C2 (MIM 613927),C4A/B (MIM 120820/120810), CRP (MIM 123260), ETS1(MIM 164720), FcGR2A–FcGR3A (MIM 146790/146740), FcGR3B(MIM 610665), HIC2-UBE2L3 (MIM 607712/603721), IKZF1 (MIM603023), IL10 (MIM 124092), IRAK1 (MIM 300283), ITGAM–ITGAX(MIM 120980)/151510], JAZF1, KIAA1542/PHRF1, LRRC18-WDFY4,LYN (MIM 165120), NMNAT2 (MIM 608701), PRDM1 (MIM603423), PTTG1 (MIM 604147), PXK (MIM 611450), RASGRP3(MIM 609531), SLC15A4, STAT1 (MIM 600555), TNFAIP3, TNFSF4(MIM 603594), TNIP1 (MIM 607714), TREX1 (MIM 606609),UHRF1BP1, XKR6
Type 1diabetes
4 HLA, INS(MIM 176730),PTPN22, CTLA4
40 RGS1, IL18RAP (MIM 604509), IFIH1 (MIM 606951), CCR5 (MIM601373), IL2 (MIM 147680), IL7R, MHC, BACH2 (MIM 605394),TNFAIP3, TAGAP, IL2RA, PRKCQ (MIM 600448), INS (MIM 176730),ERBB3 (MIM 190151), 12q13.3, SH2B3 (MIM 605093), CTSH(MIM 116820), CLEC16A (MIM 611303), PTPN2 (MIM 176887),CD226 (MIM 605397), UBASH3A (MIM 605736), C1QTNF6, IL10(MIM 124092), 4p15.2, C6orf173, 7p15.2, COBL (MIM 610317),GLIS3 (MIM 610192), C10orf59, CD69 (MIM 107273), 14q24.1,14q32.2, IL27 (MIM 608273), 16q23.1, ORMDL3 (MIM 610075),17q21.2, 19q13.32, 20p13, 22q12.2, Xq28
Multiplesclerosis
1 HLA 52 BACH2 (MIM 605394), BATF (MIM 612476), CBLB, CD40, CD58,CD6 (MIM 186720), CD86, CLEC16A (MIM 611303), CLECL1,CYP24A1, CYP27B1, DKKL1 (MIM 605418), EOMES (MIM 604615),EVI5 (MIM 602942), GALC (MIM 606890), HHEX (MIM 604420),IL12A, IL12B, IL22RA2, IL2RA, IL7, IL7R, IRF8, KIF21B (MIM608322), MALT1, MAPK1 (MIM 176948), MERTK (MIM 604705),MMEL1,MPHOSPH9 (MIM 605501),MPV17L2,MYB (MIM 189990),MYC (MIM 190080), OLIG3 (MIM 609323), PLEK (MIM 173570),PTGER4 (MIM 601586), PVT1 (MIM 165140), RGS1, SCO2 (MIM604272), SP140 (MIM 608602), STAT3, TAGAP, THEMIS (MIM613607), TMEM39A, TNFRSF1A, TNFSF14 (MIM 604520), TYK2,VCAM1, ZFP36L1 (MIM 601064), ZMIZ1 (MIM 607159), ZNF767
Crohn’sdisease
4 NOD2 (MIM 605956),IBD5 (MIM 606348),DRB1*0103, IL23R
67 SMAD3 (MIM 603109), ERAP2 (MIM 609497), IL10 (MIM 124092),IL2RA, TYK2, FUT2 (MIM 182100), DNMT3A (MIM 602769),DENND1B (MIM 613292), BACH2 (MIM 605394), ATG16L1(MIM 610767)
Ulcerativecolitis
3 DRB1*1502,DRB1*0103, IL23R
44 IL1R2 (MIM 147811), IL8RA-IL8RB, IL7R, IL12B, DAP(MIM 600954), PRDM1 (MIM 603423), JAK2 (MIM 147796),IRF5 (MIM 607218), GNA12 (MIM 604394), LSP1 (MIM 153432),ATG16L1 (MIM 610767)
Total 19 277
a The names of the loci are signposts and do not indicate that these loci are necessarily biologically relevant. A number of associated variants are distant fromprotein-coding genes.
The American Journal of Human Genetics 90, 7–24, January 13, 2012 13
the most extreme cases of obesity are driven by highly
penetrant, low-frequency variants. Variation at copy-
number variants (CNVs) has some impact on BMI. This is
true of commonCNVs (theNEGR1 association seems likely
to be driven by a common CNV40) and also rarer CNVs for
which evidence is starting to accumulate (e.g., 16p CNV
and effect on morbid obesity and developmental delay41).
The adverse metabolic effects of obesity depend not
only on the overall level of adiposity but also on the distribu-
tion of fat around the body; visceral (abdominal) fat has
particularly adverse consequences for overall health. GWASs
of fat-distribution phenotypes (including waist circumfer-
ence,waist:hipratio, andbody-fatpercentage studied inclose
to 200,000 individuals) have revealed almost 20 loci with
genome-wide significance40,42–44 and relatively little overlap
with those loci influencingoverall adiposity.AswithBMI, the
proportion of variance explained by these loci is small
(around 1% after adjustment for BMI, age, and sex).
New Biology Arising from GWAS Discoveries
Autoimmune Diseases
Thus far nearly all genes associated with MS have been
involved in autoimmune pathways rather than in
neurologic degenerative diseases.45 Indeed, of the two
MS-associated genes involved in neurodegeneration, one
(KIF21B) is also associated with AS and CD, suggesting
that it is actually an autoimmunity gene. The genes
involved in MS include genes coding for components of
the cytokine pathway (CXCR5 [MIM 601613], IL2RA
[MIM 147730], IL7R [MIM 146661], IL7 [MIM 146660],
IL12RB1 [MIM 601604], IL22RA2 [MIM 606648], IL12A
[MIM 161560], IL12B [MIM 161561], IRF8 [MIM 601565],
TNFRSF1A [MIM 191190], TNFRSF14 [MIM 602746], and
TNFSF14 [MIM 604520]), costimulatory molecules
(CD37 [MIM 151523], CD40, CD58 [MIM 153420],
CD80 [MIM 112203], CD86 [MIM 601020], and CLECL1
[MIM 607467]), and signal-transduction molecules of
immunological relevance (CBLB [MIM 604491], GPR65
[MIM 604620], MALT1 [MIM 604860], RGS1 [MIM
600323], STAT3 [MIM 102582], TAGAP [MIM 609667],
andTYK2 [MIM176941]). Interestingly, these genesmainly
implicate T-helper cells in MS pathogenesis.
Genetic findings have had amajor impact on AS research
and therapeutics. The association of the genes IL23R (MIM
607562)46 and IL12B19 have pointed to the involvement of
the IL-23R pathway, and hence IL-17-producing
Table 3. Summary of GWAS Findings for Metabolic Traitsa
Prior to 2007 2007 onward
Disease Number of Loci Loci Number of Loci Some or All of the Loci
Type 2 diabetes 3 PPARG, KCNJ11(MIM 600937),TCF7L2
50 NOTCH2 (MIM 600275), PROX1 (MIM 601546), GCKR, THADA(MIM 611800), BCL11A (MIM 606557), RBMS1 (MIM 602310), IRS1,ADAMTS9, ADCY5 (MIM 600293), IGF2BP2 (MIM 608289), WFS1,ZBED3, CDKAL1, DGKB (MIM 604070), JAZF1, GCK, KLF14,TP53INP1 (MIM 606185), SLC30A8 (MIM 611145), PTPRD(MIM 601598), CDKN2A, CHCHD9, CDC123,HHEX (MIM 604420),DUSP8 (MIM 602038), KCNQ1, CENTD2, MTNR1B, HMGA2 (MIM600698), TSPAN8 (MIM 600769), HNF1A, ZFAND6 (MIM 610183),PRC1 (MIM 603484), FTO, SRR (MIM 606477), HNF1B (MIM189907), DUSP9 (MIM 300134), CDCD4A, UBE2E2 (MIM 602163),GRB14 (MIM 601524), ST6GAL1 (MIM 109675), VPS26A (MIM605506), HMG20A (MIM 605534), AP3S2 (MIM 602416), HNF4A(MIM 600281), SPRY2 (MIM 602466)
Body-mass index 1 MC4R 30 NEGR1 (MIM 613173), TNNI3K (MIM 613932), PTBP2 (MIM608449), TMEM18 (MIM 613220), POMC, FANCL (MIM 608111),LRP1B (MIM 608766), CADM2 (MIM 609938), ETV5 (MIM 601600),GNPDA2 (MIM 613222), SLC39A8 (MIM 608732), HMGCR(MIM 142910), PCSK1, ZNF608, NCR3 (MIM 611550), HMGA1(MIM 600701), LRRN6C, TUB (MIM 601197), BDNF, MTCH2(MIM 613221), FAIM3 (MIM 606015), MTIF3, PRKD1(MIM 605435), MAP2K5 (MIM 602520), FTO, SH2B1, GPRC5B(MIM 605948), KCTD15, GIPR, TMEM160
Glucose or insulin 1 GCK 15 GCKR, G6PC2, IGF1, ADCY5 (MIM 600293), MADD (MIM 603584),ADRA2A, CRY2 (MIM 603732), FADS1 (MIM 606148), GLIS3(MIM 610192), SLC2A2, PROX1 (MIM 601546), C2CD4B (MIM610344), DGKB (MIM 604070), GIPR, VPS13C (MIM 608879)
Fat distribution 0 20 TBX15 (MIM 604127), LYPLAL1, IRS1, SPRY2 (MIM 602466), GRB14(MIM 601524), STAB1 (MIM 608560), ADAMTS9, CPEB4 (MIM610607), VEGFA (MIM 192240), TFAP2B (MIM 601601), LY86(MIM 605241), RSPO3 (MIM 610574),NFE2L3 (MIM 604135),MSRA(MIM 601250), ITPR2 (MIM 600144), HOXC13 (MIM 142976),NRXN3 (MIM 600567), ZNRF3 (MIM 612062), PIGC (MIM 601730)
Total 5 107
a The names of the loci are signposts and do not indicate that these loci are necessarily biologically relevant. A number of associated variants are distant fromprotein-coding genes.
14 The American Journal of Human Genetics 90, 7–24, January 13, 2012
proinflammatory cell populations, in the aetiopathogene-
sis of AS. The involvement of this pathway in AS was not
considered until the genetic discoveries were reported.
The recent demonstration that ERAP1 (MIM 606832) poly-
morphisms are associated with HLA-B27-positive but not
HLA-B27-negative AS has shed important light on research
into the mechanism by which HLA-B27 induces AS; this
mechanism has remained an enigma since the discovery
of the association of HLA-B27 with AS in the early 1970s.
ERAP1 is involved in peptide processing before HLA class
I molecule presentation; the restriction of the association
of ERAP1 variants to HLA-B27-positive disease indicates
that HLA-B27 operates to cause AS by a mechanism
that involves peptide presentation. Protective variants of
ERAP1 have been shown to have lower peptide-processing
capacity and thus to reduce the amount of peptide avail-
able to HLA-B27.47 Thus HLA-B27 is more likely to cause
AS when it is processing more peptides.
The finding that PADI4 (MIM 605347) is associated with
RA focused research interest on the role of anti-citrulli-
nated peptide antibodies (ACPAs) and disease.48 PADI4 is
involved in the citrullination of peptides against which
ACPAs develop. The association of PADI4 variants with
RA therefore indicated that ACPAs are directly involved
in RA pathogenesis, not an indirect manifestation of
immune dysregulation in the disease. Subsequently, it
was discovered that the association of HLA-DRB1 (MIM
142857) with RA was restricted to ACPA-positive disease
and that there was a strong gene-environment interaction,
such that cigarette smoking increases the risk of ACPA-
positive but not ACPA-negative RA.49 Because ACPA-
positive disease is more severe than ACPA-negative disease
and has a greater propensity toward joint-damaging
erosion, this provided further evidence supporting public-
health measures against cigarette smoking.
The genetic loci identified for IBD through GWASs have
highlighted a number of pathways, including antibacterial
autophagy and signaling pathways (e.g., IL-10 signaling,
T-cell-negative regulators, and pathways involving B cells
and innate sensors).18 Some of these pathways were previ-
ously not suspected to be important for these diseases.
The role of a number of pathways, for example the IL-23R
pathway, the autophagy pathway, and innate immunity,
haveall come fromhypothesis-generatinggenetics research,
not from immunology or hypothesis-driven research.
Similar advances could be described for many other
autoimmune diseases but are beyond the scope of this
review.
Metabolic Traits
Most loci affecting T2D and fasting glucose levels map to
regulatory sequences, and inmany cases, the ‘‘causal’’ tran-
script, i.e., the transcript responsible for mediating the
effect of the associated variants, is not yet known. At other
loci, a combination of coding variants, strong biological
candidates, and/or cis expression QTL data has defined
the transcript through which the effect is mediated
(HNF1A [MIM 142410], GCK, IRS1 [MIM 147545], WFS1
[MIM 606201], PPARG [MIM 601487], CAMK1D [MIM
607957], JAZF1 [MIM 606246], KLF14 [MIM 609393] and
others) as a first step to inferring biology.50 Some of these
stories are now starting to be fleshed out into biological
mechanisms (e.g., KLF1451).
There is incomplete overlap with the loci influencing
physiological variation in glucose and insulin. Some loci
(e.g., MTNR1B [MIM 600804]) have a relatively large effect
on both, whereas others (e.g., G6PC2 [MIM 608058])
influence fasting glucose levels but have a minimal effect
on T2D risk. Still others (e.g., CDKN2A and CDKN2 B
[MIM 600160 and 600431]) impact T2D and have surpris-
ingly modest effects on fasting glucose levels in healthy,
nondiabetic individuals32,33,50. Most of these loci appear
to have their primary effect on the function of beta cells
rather than on insulin resistance, highlighting the impor-
tance of the former with respect to normal and abnormal
glucose homeostasis.50 Of the subset of loci (including
PPARG, KLF14, and ADAMTS9 [MIM 605421]) shown to
influence T2D risk through a primary effect on insulin
resistance, only FTO seems to act primarily through an
effect on obesity.50 Several of the T2D loci overlap genes
that are known to harbor rare variants responsible for
penetrant, monogenic forms of diabetes (such genes
include KCNQ1 [MIM 607542], PPARG, HNF1A, GCK,
and WFS1), indicating that multiple causal variants at
the same locus segregate in the population at difference
frequencies. There is overlap between signals influencing
T2D risk and those influencing body weight (CDKAL1
[MIM 611259] and ADCY5 [MIM 600293]) indicating
that some of the observed epidemiological associations
between these traits are attributable to shared suscepti-
bility variants.52
Whereas many of the fasting-glucose and fasting-insulin
signals map near strong biological candidates for relevant
traits (such candidate genes include IRS1, IGF1, ADRA2A
[MIM 104210], SLC2A2 [MIM 138160], GCK and GCKR)
and fit within established models of our understanding
of islet biology, this is far from the case with the loci iden-
tified for T2D. Efforts to demonstrate that the genes
mapping close to T2D risk loci are enriched for particular
pathways or processes have met with only limited success;
the most robust finding yet has been in relation to
cell-cycle regulation (and was consistent with a model in
which the regulation of islet mass is a key component of
risk50). Either T2D is especially heterogeneous or else key
aspects of its pathophysiology are as yet poorly codified
in existing databases.
As for T2D and fasting glucose, most of the signals for
obesity and fat distribution map to regulatory signals, the
causal transcript is known at only a minority of the loci.
Signals influencing BMI appear to be enriched for genes
implicated in neuronal processes, whereas those influ-
encing fat distribution seem to be more closely related to
adipose development.36,43 Overlap with signals and genes
implicated inmore severe forms of disease (morbid obesity,
The American Journal of Human Genetics 90, 7–24, January 13, 2012 15
lipodystrophy) is seen at some loci (PCSK1 [MIM 162150],
POMC [MIM 176830], BDNF [MIM 113505], MC4R, and
SH2B1 [MIM 608937]) but is far from complete (some
loci implicated in extreme obesity case-control studies
show no association with BMI at the population level36).
The strongest signal for overall adiposityis the one map-
ping to FTO37. FTO is thought to be a DNA methylase,53
but its function is poorly understood. Murine models
demonstrate that modulation of Fto expression is associ-
ated with changes in body weight,54–56 but no direct
evidence linking coding variants in FTO in humans to
body-weight variation has been demonstrated. For the
time being, FTO remains the strongest candidate, but
the role of other genes (e.g., RPGRIP1L [MIM 610937]) in
the region cannot be discounted. This example demon-
strates the difficulties that remain in relating GWAS signals
to downstream biology. Fat distribution is a strongly
gender-dimorphic phenotype, and many of the signals
associated with fat distribution seem to have a selective
effect on this phenotype in women.43
Quantitative Traits
In addition to having been performed on the quantitative
traits discussed previously (e.g., BMI and fasting-glucose
and -insulin levels), GWASs have been done on a number
of quantitative risk factors for disease and for traits that
are models for the genetic architecture of complex traits.
For bone mineral density (BMD), a risk factor for osteopo-
rotic fracture, a total of 34 loci, together explaining ~5% of
narrow sense heritability, have been identified (Estrada
et al., abstract presented at the American Society for Bone
and Mineral Research 2010 Annual Meeting, published
in J. Bone. Med. Res. 25 [Suppl S1], p. 1243). Among these
genes, there is a major over-representation of genes in the
Wnt-signaling pathway, which was first implicated in oste-
oporosis (MIM 166710) from studies in families with high
or low BMD phenotypes. Many other examples exist in
osteoporosis and other human diseases in which GWASs
have demonstrated that more-prevalent but less-severe
genetic variants in genes initially identified from studies
of severe familial diseases have proven to be important in
the risk of disease in the general population. For human
height, a combined discovery and validation cohort of
~180,000 samples identified 180 robustly associated loci,
many in meaningful biological pathways and with evi-
dence for multiple segregating variants at the same loci.13
Together these loci explain approximately 12%–14% of
additive genetic variation (~10% of phenotypic variation).
A meta-analysis of more than 100,000 individuals of
European ancestry detected a total of 95 loci significantly
associated with plasma concentrations of cholesterol
and triglycerides, known risk factors for coronary artery
disease,57 and it provided evidence that the GWAS loci
were of biological and clinical relevance. A meta-analysis
from the HaemGen consortium on platelet count and
platelet volume, which are endophenotypes for myo-
cardial infarction (MIM 608446), discovered 68 loci.58
When the genes of a number of these loci were silenced
in Drosophila, 11 showed a clear platelet phenotype. These
genes are previously unknown regulators of blood cell
formation. The identification of so many loci has uncov-
ered new gene functions in megakaryopoiesis and platelet
formation. That is, new biology has resulted directly from
the identification of SNPs that are associated with variation
in platelet phenotypes.
Across these quantitative traits, a number of loci discov-
ered through GWASs were known to be a mutational target
for those traits because Mendelian forms with extreme
phenotypes existed. Taken together, the inference from
quantitative traits in terms of the (large) number of loci
involved, the allelic frequency spectrum of associated vari-
ants, and the nature of the candidate genes suggest that
models arising from quantitative traits appropriately
reflect the genetic architecture of disease and reinforce
the emerging evidence that it is the cumulative effect of
many loci that underlies susceptibility to disease.
From GWAS to Translation: Clinical Relevance
Autoimmune Diseases
Many of the MS-associated genes discovered by GWASs
represent excellent potential therapeutic targets. Of partic-
ular note is the identification of two genes involved in
vitamin D metabolism (CYP27B1 [MIM 609506] and
CYP24A1 [MIM 126065]). This identification might help
to explain the latitudinal variation in MS incidence—i.e.,
higher MS prevalence at more extreme latitudes is most
likely due to higher rates of vitamin D deficiency. Two
other identified genes are already targets of MS therapies,
highlighting the relevance of the findings to the disease
pathogenesis (natalizumab targets VCAM1 [MIM
192225], and daclizumab targets IL2RA). The findings for
AS have stimulated the trial of therapies against identified
pathways. Anti-IL-17 treatment has been shown in a phase
2 trial to have equivalent efficacy as the current gold-stan-
dard treatment, TNF-inhibition, in the treatment of AS.
The relevance of the RA-related genetic findings to thera-
peutic development is highlighted by the fact that some
existing therapies already target genes or gene pathways
highlighted by the genetic associations with RA; such ther-
apies include those involving TNF inhibitors (e.g., inflixi-
mab) and co-stimulation inhibitors (e.g., abatacept).
Abatacept is a fusion protein of CTLA-4 and immunoglob-
ulin. It acts by preventing costimulation of T-helper cells
by the binding of the T cell’s CD28 protein to the B7
protein on the antigen-presenting cell. CTLA4 (MIM
123890) and CD28 (MIM 186760) polymorphisms are
associated with RA. The RA-associated genes include
many involved in the NfKB signaling pathway and
place this pathway at the center of RA pathogenesis. As
in MS, mouse research prior to the genetic discoveries
had implicated the IL-23-dependent Th17-lymphocyte
pathway in RA pathogenesis. To date there has been very
little genetic support for this with regard to human
diseases, in contrast to the situation in seronegative
16 The American Journal of Human Genetics 90, 7–24, January 13, 2012
diseases such as AS, psoriasis and IBD, where strong genetic
associations exist and treatments targeting the pathway
are in clinical use.
Metabolic Diseases
The main relevance of GWASs lies in the insights into
disease biology (see above) and the potential for clinical
translation through novel approaches to the diagnosis,
prevention, treatment, and monitoring of disease. This
will take some time, in particular given that most GWAS
discoveries were made in the last few years. The predictive
power of disease risk ascertained from genetic data remains
poor because for most diseases only a small proportion of
additive genetic variation has been accounted for.
Although it is possible for T2D to identify individuals
who are at the extremes of the genotype risk score distribu-
tion and who differ appreciably in T2D risk (they have
twice or half the average risk for the upper and lower
1%–2%, respectively), many of these would already be
identifiable on the basis of classical risk factors. In fact,
when using receiver operating characteristic (ROC) anal-
yses, BMI and age do a far better job of discrimination
than the genetic variants so far discovered.59 This may
change as low frequency and rare causal alleles are found.
Although individual prediction is not yet practical with
the variants at hand, it should be possible to identify
groups of individuals who are at a substantially greater-
than-average risk for diabetes, and this might be of value,
for example, with respect to clinical-trial enrichment.
One obvious route to early translation involves the iden-
tification of diagnostic biomarkers on the basis of the
processes that have been uncovered. These may have
predictive impact well beyond the genetic variants that
led to their discovery. This was recently demonstrated by
a GWAS of C-reactive protein (CRP) levels; that study
found that common variants near the HNF1A gene were
associated with variation in CRP.60 The authors asked
whether rare HNF1A mutations that are causal for the
Mendelian MODY (MIM 606391) subtype of diabetes are
also associated with differences in CRP levels and whether
it would be possible to use CRP levels as a diagnostic
marker to help identify individuals who have early-onset
diabetes and who are likely to have HNF1A-MODY (and
to direct those individuals to sequence-based diagnostics).
They were able to show marked differences in CRP levels
between HNF1A -MODY and other types of diabetes and
demonstrated that diagnoses based on CRP levels has
a discriminative accuracy of more than 80% for this diag-
nostic classification.61,62 Otherwise, GWAS findings have
as yet had no impact on therapeutic optimization. Recent
studies have identified variants that influence therapeutic
response to metformin63 and might herald better under-
standing of how these drugs work.
New Science Facilitated by GWASs
Although the GWAS approach was designed for the detec-
tion of associations between DNA markers and disease, as
a by-product such studies have generated new scientific
discoveries. A detailed description and discussion is outside
the scope of this review, and we highlight only a few of
these advances: the discovery of genes affecting genetic
recombination and their correlation with natural selec-
tion64–66 and new insight in human population structure
and evolution.67–73
Interpretation of GWAS Results
GWASs conducted in the last five years were designed and
powered to detect associations through LD between geno-
typed (or imputed) common SNP markers and unknown
causal variants. What do the results imply in terms of vari-
ance explained in the population, common versus rare
variants underlying complex traits, and the nature of
complex-trait variation and evolution? It is too early to
be able to quantify the joint distribution of risk-allele
frequencies and their effect sizes because there are very
few causal variants identified by GWAS and because
systematic study of rare variants (through exome or
whole-genome sequencing) is in an early stage. To under-
stand the allelic spectrum of risk variants and thereby
inform optimal design of experiments aiming to detect
causal variants, one must differentiate between two expla-
nations for observed associations between genotyped
common SNPs and disease: the association can be caused
by one or more causal variants that have large effect sizes
and are in low LD with the genotyped SNPs, or it can be
caused by causal variants that have small effects and are
in high LD with the genotyped SNPs. Low LD occurs
when the allele frequencies of the unknown causal vari-
ants and those at the genotyped SNPs are very different
from each other, for example when the allele frequency
of causal variants is much lower than that of the SNPs.
For a single robustly associated SNP in a homogeneous
population, we cannot distinguish between the hypoth-
eses that the association signal is caused by a rare variant
of large effect or a common variant with small effect.
However, variants at multiple loci and GWASs in other
ethnic populations help to narrow the boundaries of the
genetic architecture of diseases. At this point in time, we
can conclude that
(1) Many loci contribute to complex-trait variation
(e.g., Figure 2).
(2) At a number of identified risk loci, there aremultiple
alleles associated with disease at a wide range of
frequencies.
(3) There is evidence for pleiotropy, i.e., that the same
variants are associated with multiple traits.66,74,75
(4) A number of variants associated with disease or
complex traits in one ethnic population are also
associated the same disease or traits in other popula-
tions (see above for T2D examples).
(5) The hypothesis76 that causal variant(s) that lead to
the association between common SNPs and disease
are mostly rare (say, have an allele frequency of 1%
The American Journal of Human Genetics 90, 7–24, January 13, 2012 17
or lower) isnot consistentwith theoretical and empir-
ical results.77,78 In particular, there is no widespread
evidence for the existence of ‘‘synthetic associations’’
(see Box 3). Numerically, we expect that most causal
variants that segregate in the population are rare,
consistent with evolutionary theory, but the propor-
tion of genetic variation that these variants cumula-
tively explain depends on their correlation with
fitness.79
(6) A surprisingly large proportion of additive genetic
variation is tagged when all SNPs are considered
simultaneously.12–14
The Cost of GWASs
If we assume that the GWAS results from Figure 1 represent
a total of 500,000 SNP chips and that on average a chip
costs $500, then this is a total investment of $250 million.
If there are a total of ~2,000 loci detected across all traits,
then this implies an investment of $125,000 per discov-
ered locus. Is that a good investment? We think so: The
total amount of money spent on candidate-gene studies
and linkage analyses in the 1990s and 2000s probably
exceeds $250M, and they in total have had little to show
for it. Also, it is worthwhile to put these amounts in
context. $250M is of the order of the cost of a one-two
stealth fighter jets and much less than the cost of a single
navy submarine. It is a fraction of the ~$9 billion cost of
the Large Hadron Collider. It would also pay for about
100 R01 grants. Would those 100 non-funded R01 grants
have made breakthrough discoveries in biology and medi-
cine? We simply can’t answer this question, but we can
conclude that a tremendous number of genuinely new
discoveries have been made in a period of only five years.
Concluding Comments
In this review we have attempted to summarize the
tremendous quality and quantity of discoveries that have
been made by GWASs in the last five years. Because of
space limitations, we have been able to discuss only
a subset of diseases and have not mentioned those made
in common cancers, pediatric diseases, and ophthalmolog-
ical diseases, to name but a few. We now return to the
Box 3. Synthetic Associations
Dickson and colleagues suggested that the observed
association between a common SNP and a complex
trait might result when one or more rare variants at
the locus is in LD with that SNP.76,93 Because
common SNP alleles and rare causal variants cannot
be highly correlated because of the properties of
LD,84 the hypothesis of ‘‘synthetic’’ associations
implies that the effect sizes of the causal variants
are much larger than the effect size observed at the
common SNP and suggests that (re)sequencing
studies might detect such variants. The hypothesis
is not about whether GWASs work as an experi-
mental design but what the likely interpretation of
GWAS hits is in terms of the allele spectrum of causal
risk alleles. Are empirical data consistent with this
hypothesis? Several lines of evidence suggest that
associations observed with common SNP associa-
tions are rarely due to synthetic associations with
rare variants. First, because the LD correlation
between common and rare variants is so low (typi-
cally 0.01–0.02), synthetic associations imply that
variation explained by the causal variants at the
locus is 50–100 times larger than the variance ex-
plained at the genotyped SNP.78 So, if the SNP
explains 0.1% of phenotypic variation in the popu-
lation, the causal variant would explain 5%–10%.
But as shown in this review, for many complex traits
and diseases tens to hundred of common variants
are identified, and so their combined effects would
explain too much variation if synthetic associations
were the norm. Second, empirical data from
(re)sequencing studies and trans-ethnic mapping
suggest that both common and rare variants
contribute to disease risk.77 At most loci detected
by GWASs, there is no evidence (despite extensive
genotyping and/or re-sequencing) that the
common-variant signal is driven by low-frequency
or rarer variants. Where rare risk alleles are uncov-
ered at the same loci, they seem much more likely
to be independent signals.94–96
Together these observations point to a highly
polygenic model of disease susceptibility with causal
variants across the entire range of the allele-
frequency spectrum. By ‘‘polygenic,’’ we mean that
segregating variants at many genomic loci (tens,
hundreds, or even thousands) contribute to genetic
variation for susceptibility in the population. The
observations imply that, for most common complex
diseases, nearly everyone in the population carries
some risk alleles and that affected individuals are
likely to have a different portfolio of risk alleles.79
They also imply that any single risk allele is neither
necessary nor sufficient to cause disease. For the
Box 3. Continued
etiology of disease, these observations provide
empirical evidence to support a threshold or burden
model involving multiple variants and environ-
mental factors, and they appear to be inconsistent
with a single cause (e.g., a single mutation). A rare-
variant only model of disease, characterized by locus
heterogeneity and raremutations of large effects and
proposed by, for example, McClellan and King,1 is
not consistent with empirical observations.77,79,97
18 The American Journal of Human Genetics 90, 7–24, January 13, 2012
perceived failure of GWASs as summarized in the introduc-
tory section:
(1) Is the GWAS approach founded on a flawed assumption
that genetics plays an important role in the risk for
common diseases? Pedigree studies, including those
involving twins, suggest that a substantial propor-
tion of variation in susceptibility for common
disease is due to genetic factors. The proportion of
total variation explained by genome-wide-signifi-
cant variants has reached 10%–20% for a number
of diseases, and clearly there are additional variants
with such small effect sizes that they have not been
detected with stringent significance. As reviewed
here, many of the detected loci are in biologically
meaningful pathways for the diseases investigated.
Whole-genome analyses involving GWAS data
have estimated that 20%–50% of phenotypic varia-
tion is captured when all SNPs are considered simul-
taneously for a number of complex diseases and
traits. These estimates are based on population-
wide studies and provide a lower limit of the total
proportion of phenotypic variation due to genetic
factors. Inference from GWASs is independent of
inference drawn from close relatives (pedigree/
family studies), and therefore these studies have
provided independent evidence for the role of
genetics in common diseases.
(2) Have GWASs been disappointing in not explaining more
genetic variation in the population? This criticism
implies that the aim of GWASs is to explain all
genetic variation. This is a misrepresentation of
the objective of GWASs. As was the aim of linkage
studies in pedigrees for complex diseases prior to
the GWAS era, the aim of GWAS is to detect loci
that are associated with complex traits. The detec-
tion of such loci has led to the discovery of new bio-
logical knowledge about disease—knowledge that
was absent only five years ago. But even ignoring
the aim of GWASs, for a number of complex traits
the proportion of genetic variation uncovered by
GWASs is actually substantial. For example, for
T2D, MS, and CD, approximately 10%, 20%, and
20%, respectively, of genetic variation in the popu-
lation has been accounted for. Apart from diseases
with a known major locus (which is usually the
major histocompatibility locus), the baseline of
variation explained five years ago was essentially
zero.
(3) Have GWASs delivered meaningful biologically relevant
knowledge or results of clinical or any other utility? As
we have highlighted in this review, the answer to
this question is a definite ‘‘yes.’’ For example, the
discovery of the importance of the autophagy
pathway in Crohn disease, the IL-23R pathway in
rheumatoid arthritis, and factor H in age-related
macular degeneration (MIM 610149)9 have given
important biological insight with direct clinical
relevance. Hunter and Kraft put it this way back in
2007: ‘‘There have been few, if any, similar bursts
of discovery in the history of medical research.’’80
(4) Are GWAS results spurious? The combination of large
sample sizes and stringent significance testing has
led to a large number of robust and replicable asso-
ciations between complex traits and genetic vari-
ants, many of which are in meaningful biological
pathways. A number of variants or different variants
at the same loci have been shown to be associated
with the same trait in different ethnic populations,
and some loci are even replicated across species.81
The combination of multiple variants with small
effect sizes has been shown to predict disease status
or phenotype in independent samples from the
same population. Clearly, these results are not
consistent with flawed inferences from GWASs.
In conclusion, in a period of less than five years, the
GWAS experimental design in human populations has
led to new discoveries about genes and pathways involved
in common diseases and other complex traits, has
provided a wealth of new biological insights, has led to
discoveries with direct clinical utility, and has facilitated
basic research in human genetics and genomics. For the
future, technological advances enabling the sequencing
of entire genomes in large samples at affordable prices is
likely to generate additional genes, pathways, and biolog-
ical insights, as well as to identify causal mutations.
Acknowledgments
We acknowledge funding from the Australian National Health and
Medical Research Council (NHMRC grants 389892, 496667,
613672, 613601, and 1011506) and the Australian Research
Council (ARC grant DP1093502). P.M.V. and M.A.B. are funded
by NHMRC Senior Principal Research Fellowships. We thank two
referees for many helpful comments.
Web Resources
The URLs for data presented herein are as follows:
Online Mendelian Inheritance in Man (OMIM), http://www.
omim.org
GWAS Catalog, http://www.genome.gov/26525384
References
1. McClellan, J., and King, M.C. (2010). Genetic heterogeneity
in human disease. Cell 141, 210–217.
2. Crow, T.J. (2011). ‘The missing genes: what happened to the
heritability of psychiatric disorders?’. Mol. Psychiatry 16,
362–364.
3. Manolio, T.A., Collins, F.S., Cox, N.J., Goldstein, D.B.,
Hindorff, L.A., Hunter, D.J., McCarthy, M.I., Ramos, E.M.,
Cardon, L.R., Chakravarti, A., et al. (2009). Finding themiss-
ing heritability of complex diseases. Nature 461, 747–753.
The American Journal of Human Genetics 90, 7–24, January 13, 2012 19
4. Botstein, D., and Risch, N. (2003). Discovering genotypes
underlying human phenotypes: Past successes for mende-
lian disease, future approaches for complex disease. Nat.
Genet. Suppl. 33, 228–237.
5. Hartl, D.L., and Clark, A.G. (1997). Principles of population
genetics (Sunderland: Sinauer Associates).
6. Hill, W.G., and Robertson, A. (1968). The effects of
inbreeding at loci with heterozygote advantage. Genetics
60, 615–628.
7. Altshuler, D., Brooks, L.D., Chakravarti, A., Collins, F.S.,
Daly, M.J., and Donnelly, P.; International HapMap Consor-
tium. (2005). A haplotype map of the human genome.
Nature 437, 1299–1320.
8. Dewan, A., Liu, M., Hartman, S., Zhang, S.S., Liu, D.T., Zhao,
C., Tam, P.O., Chan, W.M., Lam, D.S., Snyder, M., et al.
(2006). HTRA1 promoter polymorphism in wet age-related
macular degeneration. Science 314, 989–992.
9. Klein, R.J., Zeiss, C., Chew, E.Y., Tsai, J.Y., Sackler, R.S.,
Haynes, C., Henning, A.K., SanGiovanni, J.P., Mane, S.M.,
Mayne, S.T., et al. (2005). Complement factor H polymor-
phism in age-related macular degeneration. Science 308,
385–389.
10. Wellcome Trust Case Control Consortium. (2007). Genome-
wide association study of 14,000 cases of seven common
diseases and 3,000 shared controls. Nature 447, 661–678.
11. Franke, A., McGovern, D.P., Barrett, J.C., Wang, K., Radford-
Smith, G.L., Ahmad, T., Lees, C.W., Balschun, T., Lee, J.,
Roberts, R., et al. (2010). Genome-wide meta-analysis
increases to 71 the number of confirmed Crohn’s disease
susceptibility loci. Nat. Genet. 42, 1118–1125.
12. Anderson, C.A., Boucher, G., Lees, C.W., Franke, A.,
D’Amato, M., Taylor, K.D., Lee, J.C., Goyette, P., Imielinski,
M., Latiano, A., et al. (2011). Meta-analysis identifies 29 addi-
tional ulcerative colitis risk loci, increasing the number of
confirmed associations to 47. Nat. Genet. 43, 246–252.
13. Lango Allen, H., Estrada, K., Lettre, G., Berndt, S.I., Weedon,
M.N., Rivadeneira, F., Willer, C.J., Jackson, A.U., Vedantam,
S., Raychaudhuri, S., et al. (2010). Hundreds of variants clus-
tered in genomic loci and biological pathways affect human
height. Nature 467, 832–838.
14. Yang, J., Manolio, T.A., Pasquale, L.R., Boerwinkle, E., Capor-
aso, N., Cunningham, J.M., de Andrade, M., Feenstra, B.,
Feingold, E., Hayes, M.G., et al. (2011). Genome partitioning
of genetic variation for complex traits using common SNPs.
Nat. Genet. 43, 519–525.
15. Yang, J., Benyamin, B., McEvoy, B.P., Gordon, S., Henders,
A.K., Nyholt, D.R., Madden, P.A., Heath, A.C., Martin, N.G.,
Montgomery, G.W., et al. (2010). Common SNPs explain
a large proportion of the heritability for human height.
Nat. Genet. 42, 565–569.
16. Eyre-Walker, A. (2010). Evolution in health and medicine
Sackler colloquium: Genetic architecture of complex traits
and its implications for fitness and genome-wide associa-
tion studies. Proc. Natl. Acad. Sci. USA 107 (Suppl 1 ),
1752–1756.
17. Pritchard, J.K. (2001). Are rare variants responsible for
susceptibility to complex diseases? Am. J. Hum. Genet. 69,
124–137.
18. Khor, B., Gardet, A., and Xavier, R.J. (2011). Genetics and
pathogenesis of inflammatory bowel disease. Nature 474,
307–317.
19. Danoy, P., Pryce, K., Hadler, J., Bradbury, L.A., Farrar, C., Poin-
ton, J., Ward, M., Weisman, M., Reveille, J.D., Wordsworth,
B.P., et al; Australo-Anglo-American Spondyloarthritis
Consortium; Spondyloarthritis Research Consortium of
Canada. (2010). Association of variants at 1q32 and STAT3
with ankylosing spondylitis suggests genetic overlap with
Crohn’s disease. PLoS Genet. 6, e1001195.
20. Cotsapas, C., Voight, B.F., Rossin, E., Lage, K., Neale, B.M.,
Wallace, C., Abecasis, G.R., Barrett, J.C., Behrens, T., Cho,
J., et al; FOCiS Network of Consortia. (2011). Pervasive
sharing of genetic effects in autoimmune disease. PLoS
Genet. 7, e1002254.
21. McCarthy, M.I. (2010). Genomics, type 2 diabetes, and
obesity. N. Engl. J. Med. 363, 2339–2350.
22. Kooner, J.S., Saleheen, D., Sim, X., Sehmi, J., Zhang, W.,
Frossard, P., Been, L.F., Chia, K.S., Dimas, A.S., Hassanali,
N., et al; DIAGRAM; MuTHER. (2011). Genome-wide associ-
ation study in individuals of South Asian ancestry identifies
six new type 2 diabetes susceptibility loci. Nat. Genet. 43,
984–989.
23. Yamauchi, T., Hara, K., Maeda, S., Yasuda, K., Takahashi, A.,
Horikoshi, M., Nakamura, M., Fujita, H., Grarup, N., Cauchi,
S., et al. (2010). A genome-wide association study in the
Japanese population identifies susceptibility loci for type 2
diabetes at UBE2E2 and C2CD4A-C2CD4B. Nat. Genet. 42,
864–868.
24. Shu, X.O., Long, J., Cai, Q., Qi, L., Xiang, Y.B., Cho, Y.S., Tai,
E.S., Li, X., Lin, X., Chow, W.H., et al. (2010). Identification
of new genetic risk variants for type 2 diabetes. PLoS Genet.
6, e1001127.
25. Yasuda, K., Miyake, K., Horikawa, Y., Hara, K., Osawa, H.,
Furuta, H., Hirota, Y., Mori, H., Jonsson, A., Sato, Y., et al.
(2008). Variants in KCNQ1 are associated with susceptibility
to type 2 diabetes mellitus. Nat. Genet. 40, 1092–1097.
26. Unoki, H., Takahashi, A., Kawaguchi, T., Hara, K., Horikoshi,
M., Andersen, G., Ng, D.P., Holmkvist, J., Borch-Johnsen, K.,
Jørgensen, T., et al. (2008). SNPs in KCNQ1 are associated
with susceptibility to type 2 diabetes in East Asian and Euro-
pean populations. Nat. Genet. 40, 1098–1102.
27. Tsai, F.J., Yang, C.F., Chen, C.C., Chuang, L.M., Lu, C.H.,
Chang, C.T., Wang, T.Y., Chen, R.H., Shiu, C.F., Liu, Y.M.,
et al. (2010). A genome-wide association study identifies
susceptibility variants for type 2 diabetes in Han Chinese.
PLoS Genet. 6, e1000847.
28. Below, J.E., Gamazon, E.R., Morrison, J.V., Konkashbaev, A.,
Pluzhnikov, A., McKeigue, P.M., Parra, E.J., Elbein, S.C.,
Hallman, D.M., Nicolae, D.L., et al. (2011). Genome-wide
association and meta-analysis in populations from Starr
County, Texas, and Mexico City identify type 2 diabetes
susceptibility loci and enrichment for expression quantita-
tive trait loci in top signals. Diabetologia 54, 2047–2055.
29. Parra, E.J., Below, J.E., Krithika, S., Valladares, A., Barta, J.L.,
Cox, N.J., Hanis, C.L., Wacher, N., Garcia-Mena, J., Hu, P.,
et al; Diabetes Genetics Replication and Meta-analysis
(DIAGRAM) Consortium. (2011). Genome-wide association
study of type 2 diabetes in a sample from Mexico City and
a meta-analysis of a Mexican-American sample from Starr
County, Texas. Diabetologia 54, 2038–2046.
30. Grant, S.F., Thorleifsson, G., Reynisdottir, I., Benediktsson,
R., Manolescu, A., Sainz, J., Helgason, A., Stefansson, H.,
Emilsson, V., Helgadottir, A., et al. (2006). Variant of
20 The American Journal of Human Genetics 90, 7–24, January 13, 2012
transcription factor 7-like 2 (TCF7L2) gene confers risk of
type 2 diabetes. Nat. Genet. 38, 320–323.
31. Prokopenko, I., Langenberg, C., Florez, J.C., Saxena, R.,
Soranzo, N., Thorleifsson, G., Loos, R.J., Manning, A.K.,
Jackson, A.U., Aulchenko, Y., et al. (2009). Variants in
MTNR1B influence fasting glucose levels. Nat. Genet. 41,
77–81.
32. Dupuis, J., Langenberg, C., Prokopenko, I., Saxena, R.,
Soranzo, N., Jackson, A.U., Wheeler, E., Glazer, N.L., Boua-
tia-Naji, N., Gloyn, A.L., et al; DIAGRAM Consortium;
GIANT Consortium; Global BPgen Consortium; Anders
Hamsten on behalf of Procardis Consortium; MAGIC investi-
gators. (2010). New genetic loci implicated in fasting glucose
homeostasis and their impact on type 2 diabetes risk. Nat.
Genet. 42, 105–116.
33. Saxena, R., Hivert, M.F., Langenberg, C., Tanaka, T., Pankow,
J.S., Vollenweider, P., Lyssenko, V., Bouatia-Naji, N., Dupuis,
J., Jackson, A.U., et al; GIANT consortium; MAGIC investiga-
tors. (2010). Genetic variation in GIPR influences the glucose
and insulin responses to an oral glucose challenge. Nat.
Genet. 42, 142–148.
34. Weedon, M.N., Clark, V.J., Qian, Y., Ben-Shlomo, Y., Timp-
son, N., Ebrahim, S., Lawlor, D.A., Pembrey, M.E., Ring, S.,
Wilkin, T.J., et al. (2006). A common haplotype of the gluco-
kinase gene alters fasting glucose and birth weight: Associa-
tion in six studies and population-genetics analyses. Am. J.
Hum. Genet. 79, 991–1001.
35. Larsen, L.H., Echwald, S.M., Sørensen, T.I., Andersen, T.,
Wulff, B.S., and Pedersen, O. (2005). Prevalence of mutations
and functional analyses of melanocortin 4 receptor variants
identified among 750 men with juvenile-onset obesity. J.
Clin. Endocrinol. Metab. 90, 219–224.
36. Speliotes, E.K., Willer, C.J., Berndt, S.I., Monda, K.L., Thor-
leifsson, G., Jackson, A.U., Allen, H.L., Lindgren, C.M.,
Luan, J., Magi, R., et al; MAGIC; Procardis Consortium.
(2010). Association analyses of 249,796 individuals reveal
18 new loci associated with body mass index. Nat. Genet.
42, 937–948.
37. Frayling, T.M., Timpson, N.J., Weedon, M.N., Zeggini, E.,
Freathy, R.M., Lindgren, C.M., Perry, J.R., Elliott, K.S., Lango,
H., Rayner, N.W., et al. (2007). A common variant in the FTO
gene is associated with body mass index and predisposes to
childhood and adult obesity. Science 316, 889–894.
38. Meyre, D., Delplanque, J., Chevre, J.C., Lecoeur, C., Lobbens,
S., Gallina, S., Durand, E., Vatin, V., Degraeve, F., Proenca, C.,
et al. (2009). Genome-wide association study for early-onset
and morbid adult obesity identifies three new risk loci in
European populations. Nat. Genet. 41, 157–159.
39. Scherag, A., Dina, C., Hinney, A., Vatin, V., Scherag, S., Vogel,
C.I., Muller, T.D., Grallert, H., Wichmann, H.E., Balkau, B.,
et al. (2010). Two new Loci for body-weight regulation iden-
tified in a joint analysis of genome-wide association studies
for early-onset extreme obesity in French and german study
groups. PLoS Genet. 6, e1000916.
40. Willer, C.J., Speliotes, E.K., Loos, R.J., Li, S., Lindgren, C.M.,
Heid, I.M., Berndt, S.I., Elliott, A.L., Jackson, A.U., Lamina,
C., et al; Wellcome Trust Case Control Consortium; Genetic
Investigation of ANthropometric Traits Consortium.
(2009). Six new loci associated with body mass index high-
light a neuronal influence on body weight regulation. Nat.
Genet. 41, 25–34.
41. Walters, R.G., Jacquemont, S., Valsesia, A., de Smith, A.J.,
Martinet, D., Andersson, J., Falchi, M., Chen, F., Andrieux,
J., Lobbens, S., et al. (2010). A new highly penetrant form
of obesity due to deletions on chromosome 16p11.2. Nature
463, 671–675.
42. Heard-Costa, N.L., Zillikens, M.C., Monda, K.L., Johansson,
A., Harris, T.B., Fu, M., Haritunians, T., Feitosa, M.F., Aspe-
lund, T., Eiriksdottir, G., et al. (2009). NRXN3 is a novel locus
for waist circumference: A genome-wide association study
from the CHARGE Consortium. PLoS Genet. 5, e1000539.
43. Heid, I.M., Jackson, A.U., Randall, J.C., Winkler, T.W., Qi, L.,
Steinthorsdottir, V., Thorleifsson, G., Zillikens, M.C.,
Speliotes, E.K., Magi, R., et al; MAGIC. (2010). Meta-analysis
identifies 13 new loci associated with waist-hip ratio and
reveals sexual dimorphism in the genetic basis of fat distribu-
tion. Nat. Genet. 42, 949–960.
44. Kilpelainen, T.O., Zillikens, M.C., Stancakova, A., Finucane,
F.M., Ried, J.S., Langenberg, C., Zhang, W., Beckmann, J.S.,
Luan, J., Vandenput, L., et al. (2011). Genetic variation
near IRS1 associates with reduced adiposity and an impaired
metabolic profile. Nat. Genet. 43, 753–760.
45. Sawcer, S., Hellenthal, G., Pirinen, M., Spencer, C.C., Patso-
poulos, N.A., Moutsianas, L., Dilthey, A., Su, Z., Freeman,
C., Hunt, S.E., et al; International Multiple Sclerosis Genetics
Consortium; Wellcome Trust Case Control Consortium 2.
(2011). Genetic risk and a primary role for cell-mediated
immune mechanisms in multiple sclerosis. Nature 476,
214–219.
46. Burton, P.R., Clayton, D.G., Cardon, L.R., Craddock, N.,
Deloukas, P., Duncanson, A., Kwiatkowski, D.P., McCarthy,
M.I., Ouwehand, W.H., Samani, N.J., et al; Wellcome Trust
Case Control Consortium; Australo-Anglo-American Spon-
dylitis Consortium (TASC); Biologics in RA Genetics and
Genomics Study Syndicate (BRAGGS) Steering Committee;
Breast Cancer Susceptibility Collaboration (UK). (2007).
Association scan of 14,500 nonsynonymous SNPs in four
diseases identifies autoimmunity variants. Nat. Genet. 39,
1329–1337.
47. Evans, D.M., Spencer, C.C., Pointon, J.J., Su, Z., Harvey, D.,
Kochan, G., Oppermann, U., Dilthey, A., Pirinen, M.,
Stone, M.A., et al; Spondyloarthritis Research Consortium
of Canada (SPARCC); Australo-Anglo-American Spondyloar-
thritis Consortium (TASC); Wellcome Trust Case Control
Consortium 2 (WTCCC2). (2011). Interaction between
ERAP1 and HLA-B27 in ankylosing spondylitis implicates
peptide handling in the mechanism for HLA-B27 in disease
susceptibility. Nat. Genet. 43, 761–767.
48. Suzuki, A., Yamada, R., Chang, X., Tokuhiro, S., Sawada, T.,
Suzuki, M., Nagasaki, M., Nakayama-Hamada, M., Kawaida,
R., Ono, M., et al. (2003). Functional haplotypes of PADI4,
encoding citrullinating enzyme peptidylarginine deiminase
4, are associated with rheumatoid arthritis. Nat. Genet. 34,
395–402.
49. Padyukov, L., Silva, C., Stolt, P., Alfredsson, L., and Klareskog,
L. (2004). A gene-environment interaction between smoking
and shared epitope genes in HLA-DR provides a high risk
of seropositive rheumatoid arthritis. Arthritis Rheum. 50,
3085–3092.
50. Voight, B.F., Scott, L.J., Steinthorsdottir, V., Morris, A.P., Dina,
C., Welch, R.P., Zeggini, E., Huth, C., Aulchenko, Y.S.,
Thorleifsson, G., et al; MAGIC investigators; GIANT
Consortium. (2010). Twelve type 2 diabetes susceptibility
The American Journal of Human Genetics 90, 7–24, January 13, 2012 21
loci identified through large-scale association analysis. Nat.
Genet. 42, 579–589.
51. Small, K.S., Hedman, A.K., Grundberg, E., Nica, A.C., Thor-
leifsson, G., Kong, A., Thorsteindottir, U., Shin, S.Y.,
Richards, H.B., Soranzo, N., et al; GIANT Consortium;
MAGIC Investigators; DIAGRAM Consortium; MuTHER
Consortium. (2011). Identification of an imprinted master
trans regulator at the KLF14 locus related to multiple meta-
bolic phenotypes. Nat. Genet. 43, 561–564.
52. Freathy, R.M., Mook-Kanamori, D.O., Sovio, U., Prokopenko,
I., Timpson, N.J., Berry, D.J., Warrington, N.M., Widen, E.,
Hottenga, J.J., Kaakinen, M., et al; Genetic Investigation of
ANthropometric Traits (GIANT) Consortium; Meta-Analyses
of Glucose and Insulin-related traits Consortium; Wellcome
Trust Case Control Consortium; Early Growth Genetics
(EGG) Consortium. (2010). Variants in ADCY5 and near
CCNL1 are associated with fetal growth and birth weight.
Nat. Genet. 42, 430–435.
53. Gerken, T., Girard, C.A., Tung, Y.C., Webby, C.J., Saudek, V.,
Hewitson, K.S., Yeo, G.S., McDonough, M.A., Cunliffe, S.,
McNeill, L.A., et al. (2007). The obesity-associated FTO
gene encodes a 2-oxoglutarate-dependent nucleic acid deme-
thylase. Science 318, 1469–1472.
54. Church, C., Lee, S., Bagg, E.A., McTaggart, J.S., Deacon, R.,
Gerken, T., Lee, A., Moir, L., Mecinovi�c, J., Quwailid, M.M.,
et al. (2009). A mouse model for the metabolic effects of
the human fat mass and obesity associated FTO gene. PLoS
Genet. 5, e1000599.
55. Church, C., Moir, L., McMurray, F., Girard, C., Banks, G.T.,
Teboul, L., Wells, S., Bruning, J.C., Nolan, P.M., Ashcroft,
F.M., and Cox, R.D. (2010). Overexpression of Fto leads to
increased food intake and results in obesity. Nat. Genet. 42,
1086–1092.
56. Freathy, R.M., Timpson, N.J., Lawlor, D.A., Pouta, A., Ben-
Shlomo, Y., Ruokonen, A., Ebrahim, S., Shields, B., Zeggini,
E., Weedon, M.N., et al. (2008). Common variation in the
FTO gene alters diabetes-relatedmetabolic traits to the extent
expected given its effect on BMI. Diabetes 57, 1419–1426.
57. Teslovich, T.M., Musunuru, K., Smith, A.V., Edmondson,
A.C., Stylianou, I.M., Koseki, M., Pirruccello, J.P., Ripatti, S.,
Chasman, D.I., Willer, C.J., et al. (2010). Biological, clinical
and population relevance of 95 loci for blood lipids. Nature
466, 707–713.
58. Gieger, C., Radhakrishnan, A., Cvejic, A., Tang, W., Porcu, E.,
Pistis, G., Serbanovic-Canic, J., Elling, U., Goodall, A.H., Lab-
rune, Y., et al. (2011). New gene functions in megakaryopoi-
esis and platelet formation. Nature 480, 201–208.
59. Mihaescu, R., Meigs, J., Sijbrands, E., and Janssens, A.C.
(2011). Genetic risk profiling for prediction of type 2 dia-
betes. PLoS Curr. 3, RRN1208.
60. Elliott, P., Chambers, J.C., Zhang, W., Clarke, R., Hopewell,
J.C., Peden, J.F., Erdmann, J., Braund, P., Engert, J.C., Bennett,
D., et al. (2009). Genetic Loci associated with C-reactive
protein levels and risk of coronary heart disease. JAMA 302,
37–48.
61. Owen, K.R., Thanabalasingham, G., James, T.J., Karpe, F.,
Farmer, A.J., McCarthy, M.I., and Gloyn, A.L. (2010). Assess-
ment of high-sensitivity C-reactive protein levels as diag-
nostic discriminator of maturity-onset diabetes of the young
due to HNF1A mutations. Diabetes Care 33, 1919–1924.
62. Thanabalasingham, G., Shah, N., Vaxillaire, M., Hansen, T.,
Tuomi, T., Gasperikova, D., Szopa, M., Tjora, E., James, T.J.,
Kokko, P., et al. (2011). A large multi-centre European study
validates high-sensitivity C-reactive protein (hsCRP) as a
clinical biomarker for the diagnosis of diabetes subtypes.
Diabetologia 54, 2801–2810.
63. Zhou, K., Bellenguez, C., Spencer, C.C., Bennett, A.J.,
Coleman, R.L., Tavendale, R., Hawley, S.A., Donnelly, L.A.,
Schofield, C., Groves, C.J., et al; GoDARTS and UKPDS
Diabetes Pharmacogenetics Study Group; Wellcome Trust
Case Control Consortium 2; MAGIC investigators. (2011).
Common variants near ATM are associated with glycemic
response to metformin in type 2 diabetes. Nat. Genet. 43,
117–120.
64. Stefansson, H., Helgason, A., Thorleifsson, G., Steinthorsdot-
tir, V., Masson, G., Barnard, J., Baker, A., Jonasdottir, A., Inga-
son, A., Gudnadottir, V.G., et al. (2005). A common inversion
under selection in Europeans. Nat. Genet. 37, 129–137.
65. Kong, A., Barnard, J., Gudbjartsson, D.F., Thorleifsson, G.,
Jonsdottir, G., Sigurdardottir, S., Richardsson, B., Jonsdottir,
J., Thorgeirsson, T., Frigge, M.L., et al. (2004). Recombination
rate and reproductive success in humans. Nat. Genet. 36,
1203–1206.
66. Hinch, A.G., Tandon, A., Patterson, N., Song, Y., Rohland, N.,
Palmer, C.D., Chen, G.K., Wang, K., Buxbaum, S.G., Akylbe-
kova, E.L., et al. (2011). The landscape of recombination in
African Americans. Nature 476, 170–175.
67. Seldin, M.F., Tian, C., Shigeta, R., Scherbarth, H.R., Silva, G.,
Belmont, J.W., Kittles, R., Gamron, S., Allevi, A., Palatnik,
S.A., et al. (2007). Argentine population genetic structure:
Large variance in Amerindian contribution. Am. J. Phys.
Anthropol. 132, 455–462.
68. Seldin, M.F., Shigeta, R., Villoslada, P., Selmi, C., Tuomilehto,
J., Silva, G., Belmont, J.W., Klareskog, L., and Gregersen, P.K.
(2006). European population substructure: Clustering of
northern and southern populations. PLoS Genet. 2, e143.
69. Tian, C., Hinds, D.A., Shigeta, R., Kittles, R., Ballinger, D.G.,
and Seldin, M.F. (2006). A genomewide single-nucleotide-
polymorphism panel with high ancestry information for
African American admixture mapping. Am. J. Hum. Genet.
79, 640–649.
70. McEvoy, B.P., Montgomery, G.W., McRae, A.F., Ripatti, S.,
Perola, M., Spector, T.D., Cherkas, L., Ahmadi, K.R.,
Boomsma, D., Willemsen, G., et al. (2009). Geographical
structure and differential natural selection among North
European populations. Genome Res. 19, 804–814.
71. Heath, S.C., Gut, I.G., Brennan, P., McKay, J.D., Bencko, V.,
Fabianova, E., Foretova, L., Georges, M., Janout, V., Kabesch,
M., et al. (2008). Investigation of the fine structure of
European populations with applications to disease associa-
tion studies. Eur. J. Hum. Genet. 16, 1413–1429.
72. Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A.R.,
Auton, A., Indap, A., King, K.S., Bergmann, S., Nelson,
M.R., et al. (2008). Genes mirror geography within Europe.
Nature 456, 98–101.
73. Price, A.L., Butler, J., Patterson, N., Capelli, C., Pascali, V.L.,
Scarnicci, F., Ruiz-Linares, A., Groop, L., Saetta, A.A., Korkolo-
poulou, P., et al. (2008). Discerning the ancestry of European
Americans in genetic association studies. PLoS Genet. 4,
e236.
74. Manolio, T.A. (2010). Genomewide association studies
and assessment of the risk of disease. N. Engl. J. Med. 363,
166–176.
22 The American Journal of Human Genetics 90, 7–24, January 13, 2012
75. Sivakumaran, S., Agakov, F., Theodoratou, E., Prendergast,
J.G., Zgaga, L., Manolio, T., Rudan, I., McKeigue, P., Wilson,
J.F., and Campbell, H. (2011). Abundant pleiotropy in
human complex diseases and traits. Am. J. Hum. Genet. 89,
607–618.
76. Dickson, S.P., Wang, K., Krantz, I., Hakonarson, H., and
Goldstein, D.B. (2010). Rare variants create synthetic
genome-wide associations. PLoS Biol. 8, e1000294.
77. Anderson, C.A., Soranzo, N., Zeggini, E., and Barrett, J.C.
(2011). Synthetic associations are unlikely to account for
many common disease genome-wide association signals.
PLoS Biol. 9, e1000580.
78. Wray, N.R., Purcell, S.M., and Visscher, P.M. (2011). Synthetic
associations created by rare variants do not explain most
GWAS results. PLoS Biol. 9, e1000579.
79. Visscher, P.M., Goddard, M.E., Derks, E.M., and Wray, N.R.
(2011). Evidence-based psychiatric genetics, AKA the false
dichotomy between common and rare variant hypotheses.
Molecular Psychiatry, in press. Published online 14 June
2011. 2010.1038/mp.2011.2065.
80. Hunter, D.J., and Kraft, P. (2007). Drinking from the fire
hose—Statistical issues in genomewide association studies.
N. Engl. J. Med. 357, 436–439.
81. Pryce, J.E., Hayes, B.J., Bolormaa, S., and Goddard, M.E.
(2011). Polymorphic regions affecting human height also
control stature in cattle. Genetics 187, 981–984.
82. Bodmer, W.F. (1986). Human genetics: The molecular chal-
lenge. Cold Spring Harb. Symp. Quant. Biol. 51, 1–13.
83. Risch, N., and Merikangas, K. (1996). The future of genetic
studies of complex human diseases. Science 273, 1516–
1517.
84. Wray, N.R. (2005). Allele frequencies and the r2 measure of
linkage disequilibrium: impact on design and interpretation
of association studies. Twin Res. Hum. Genet. 8, 87–94.
85. McClellan, J.M., Susser, E., and King, M.C. (2007). Schizo-
phrenia: A common disease caused by multiple rare alleles.
Br. J. Psychiatry 190, 194–199.
86. Craddock, N., O’Donovan, M.C., and Owen, M.J. (2007).
Phenotypic and genetic complexity of psychosis. Invited
commentary on. Schizophrenia: a common disease caused
by multiple rare alleles. Br. J. Psychiatry 190, 200–203.
87. Lander, E.S. (1996). The new genomics: Global views of
biology. Science 274, 536–539.
88. Chakravarti, A. (1999). Population genetics—Making sense
out of sequence. Nat. Genet. 21 (1, Suppl), 56–60.
89. Reich, D.E., and Lander, E.S. (2001). On the allelic spectrum
of human disease. Trends Genet. 17, 502–510.
90. Risch, N. (1990). Linkage strategies for genetically complex
traits. I. Multilocus models. Am. J. Hum. Genet. 46, 222–228.
91. Slatkin, M. (2008). Exchangeable models of complex in-
herited diseases. Genetics 179, 2253–2261.
92. Hill, W.G., Goddard, M.E., and Visscher, P.M. (2008). Data
and theory point to mainly additive genetic variance for
complex traits. PLoS Genet. 4, e1000008.
93. Wang, K., Dickson, S.P., Stolle, C.A., Krantz, I.D., Goldstein,
D.B., and Hakonarson, H. (2010). Interpretation of associa-
tion signals and identification of causal variants from
genome-wide association studies. Am. J. Hum. Genet. 86,
730–742.
94. Nejentsev, S., Walker, N., Riches, D., Egholm, M., and Todd,
J.A. (2009). Rare variants of IFIH1, a gene implicated in anti-
viral responses, protect against type 1 diabetes. Science 324,
387–389.
95. Momozawa, Y., Mni, M., Nakamura, K., Coppieters, W.,
Almer, S., Amininejad, L., Cleynen, I., Colombel, J.F.,
de Rijk, P., Dewit, O., et al. (2011). Resequencing of positional
candidates identifies low frequency IL23R coding variants
protecting against inflammatory bowel disease. Nat. Genet.
43, 43–47.
96. Rivas,M.A., Beaudoin,M., Gardet, A., Stevens, C., Sharma, Y.,
Zhang, C.K., Boucher, G., Ripke, S., Ellinghaus, D., Burtt, N.,
et al; National Institute of Diabetes and Digestive Kidney
Diseases Inflammatory Bowel Disease Genetics Consortium
(NIDDK IBDGC); United Kingdom Inflammatory Bowel
Disease Genetics Consortium; International Inflammatory
Bowel Disease Genetics Consortium. (2011). Deep rese-
quencing of GWAS loci identifies independent rare variants
associated with inflammatory bowel disease. Nat. Genet.
43, 1066–1073.
97. Wang, K., Bucan, M., Grant, S.F., Schellenberg, G., and Hako-
narson, H. (2010). Strategies for genetic studies of complex
diseases. Cell 142, 351–353, author reply 353–355.
98. Hyttinen, V., Kaprio, J., Kinnunen, L., Koskenvuo, M., and
Tuomilehto, J. (2003). Genetic liability of type 1 diabetes
and the onset age among 22,650 young Finnish twin pairs:
A nationwide follow-up study. Diabetes 52, 1052–1055.
99. Polychronakos, C., and Li, Q. (2011). Understanding type 1
diabetes through genetics: Advances and prospects. Nat.
Rev. Genet. 12, 781–792.
100. Poulsen, P., Kyvik, K.O., Vaag, A., and Beck-Nielsen, H.
(1999). Heritability of type II (non-insulin-dependent)
diabetes mellitus and abnormal glucose tolerance—A popu-
lation-based twin study. Diabetologia 42, 139–145.
101. Magnusson, P.K., and Rasmussen, F. (2002). Familial resem-
blance of body mass index and familial risk of high and
low body mass index. A study of young men in Sweden.
Int. J. Obes. Relat. Metab. Disord. 26, 1225–1231.
102. Schousboe, K., Willemsen, G., Kyvik, K.O., Mortensen, J.,
Boomsma, D.I., Cornes, B.K., Davis, C.J., Fagnani, C., Hjelm-
borg, J., Kaprio, J., et al. (2003). Sex differences in heritability
of BMI: A comparative study of results from twin studies in
eight countries. Twin Res. 6, 409–421.
103. Tysk, C., Lindberg, E., Jarnerot, G., and Floderus-Myrhed, B.
(1988). Ulcerative colitis and Crohn’s disease in an unse-
lected population of monozygotic and dizygotic twins. A
study of heritability and the influence of smoking. Gut 29,
990–996.
104. Hawkes, C.H., and Macgregor, A.J. (2009). Twin studies
and the heritability of MS: A conclusion. Mult. Scler. 15,
661–667.
105. Brown, M.A., Kennedy, L.G., MacGregor, A.J., Darke, C.,
Duncan, E., Shatford, J.L., Taylor, A., Calin, A., and Words-
worth, P. (1997). Susceptibility to ankylosing spondylitis in
twins: The role of genes, HLA, and the environment.
Arthritis Rheum. 40, 1823–1828.
106. Brown, M.A. (2011). Progress in the genetics of ankylosing
spondylitis. Brief. Funct. Genomics 10, 249–257.
107. MacGregor, A.J., Snieder, H., Rigby, A.S., Koskenvuo, M.,
Kaprio, J., Aho, K., and Silman, A.J. (2000). Characterizing
the quantitative genetic contribution to rheumatoid arthritis
using data from twins. Arthritis Rheum. 43, 30–37.
108. Lichtenstein, P., Yip, B.H., Bjork, C., Pawitan, Y., Cannon,
T.D., Sullivan, P.F., and Hultman, C.M. (2009). Common
The American Journal of Human Genetics 90, 7–24, January 13, 2012 23
genetic determinants of schizophrenia and bipolar disorder
in Swedish families: A population-based study. Lancet 373,
234–239.
109. Purcell, S.M., Wray, N.R., Stone, J.L., Visscher, P.M., O’Dono-
van, M.C., Sullivan, P.F., and Sklar, P.; International Schizo-
phrenia Consortium. (2009). Common polygenic variation
contributes to risk of schizophrenia and bipolar disorder.
Nature 460, 748–752.
110. Lichtenstein, P., Holm, N.V., Verkasalo, P.K., Iliadou, A.,
Kaprio, J., Koskenvuo, M., Pukkala, E., Skytthe, A., and Hem-
minki, K. (2000). Environmental and heritable factors in the
causation of cancer—Analyses of cohorts of twins from
Sweden, Denmark, and Finland. N. Engl. J. Med. 343, 78–85.
111. Turnbull, C., Ahmed, S., Morrison, J., Pernet, D., Renwick, A.,
Maranian, M., Seal, S., Ghoussaini, M., Hines, S., Healey,
C.S., et al; Breast Cancer Susceptibility Collaboration (UK).
(2010). Genome-wide association study identifies five new
breast cancer susceptibility loci. Nat. Genet. 42, 504–507.
112. Orstavik, K.H., Magnus, P., Reisner, H., Berg, K., Graham, J.B.,
and Nance, W. (1985). Factor VIII and factor IX in a twin
population. Evidence for a major effect of ABO locus on
factor VIII level. Am. J. Hum. Genet. 37, 89–101.
113. de Lange, M., Snieder, H., Ariens, R.A., Spector, T.D., and
Grant, P.J. (2001). The genetics of haemostasis: A twin study.
Lancet 357, 101–105.
114. Smith, N.L., Chen, M.H., Dehghan, A., Strachan, D.P., Basu,
S., Soranzo, N., Hayward, C., Rudan, I., Sabater-Lleal, M., Bis,
J.C., et al; Wellcome Trust Case Control Consortium. (2010).
Novel associations ofmultiple genetic loci with plasma levels
of factor VII, factor VIII, and von Willebrand factor: The
CHARGE (Cohorts for Heart and Aging Research in Genome
Epidemiology) Consortium. Circulation 121, 1382–1392.
115. Visscher, P.M., Medland, S.E., Ferreira, M.A., Morley, K.I.,
Zhu, G., Cornes, B.K., Montgomery, G.W., and Martin,
N.G. (2006). Assumption-free estimation of heritability
from genome-wide identity-by-descent sharing between
full siblings. PLoS Genet. 2, e41.
116. Silventoinen, K., Sammalisto, S., Perola, M., Boomsma, D.I.,
Cornes, B.K., Davis, C., Dunkel, L., De Lange, M., Harris,
J.R., Hjelmborg, J.V., et al. (2003). Heritability of adult body
height: A comparative study of twin cohorts in eight coun-
tries. Twin Res. 6, 399–408.
117. Peacock, M., Turner, C.H., Econs, M.J., and Foroud, T. (2002).
Genetics of osteoporosis. Endocr. Rev. 23, 303–326.
118. Duncan, E.L., Danoy, P., Kemp, J.P., Leo, P.J., McCloskey, E.,
Nicholson, G.C., Eastell, R., Prince, R.L., Eisman, J.A., Jones,
G., et al. (2011). Genome-wide association study using
extreme truncate selection identifies novel genes affecting
bone mineral density and fracture risk. PLoS Genet. 7,
e1001372.
119. Dalageorgou, C., Ge, D., Jamshidi, Y., Nolte, I.M., Riese, H.,
Savelieva, I., Carter, N.D., Spector, T.D., and Snieder, H.
(2008). Heritability of QT interval: how much is explained
by genes for resting heart rate? J. Cardiovasc. Electrophysiol.
19, 386–391.
120. Russell, M.W., Law, I., Sholinsky, P., and Fabsitz, R.R. (1998).
Heritability of ECG measurements in adult male twins. J.
Electrocardiol. Suppl. 30, 64–68.
121. Shah, S.H., and Pitt, G.S. (2009). Genetics of cardiac repolar-
ization. Nat. Genet. 41, 388–389.
122. Hunt, S.C., Hasstedt, S.J., Kuida, H., Stults, B.M., Hopkins,
P.N., and Williams, R.R. (1989). Genetic heritability and
common environmental components of resting and stressed
blood pressures, lipids, and body mass index in Utah pedi-
grees and twins. Am. J. Epidemiol. 129, 625–638.
123. Evans, D.M., Frazer, I.H., and Martin, N.G. (1999). Genetic
and environmental causes of variation in basal levels of
blood cells. Twin Research: The Official Journal of the Inter-
national Society for Twin Studies 2, 250–257.
24 The American Journal of Human Genetics 90, 7–24, January 13, 2012
ARTICLE
Mitochondrial DNA and Y Chromosome VariationProvides Evidence for a Recent Common Ancestrybetween Native Americans and Indigenous Altaians
Matthew C. Dulik,1 Sergey I. Zhadanov,1,2 Ludmila P. Osipova,2 Ayken Askapuli,1,3 Lydia Gau,1
Omer Gokcumen,1,4 Samara Rubinstein,1,5 and Theodore G. Schurr1,*
The Altai region of southern Siberia has played a critical role in the peopling of northern Asia as an entry point into Siberia and a possible
homeland for ancestral Native Americans. It has an old and rich history because humans have inhabited this area since the Paleolithic.
Today, the Altai region is home to numerous Turkic-speaking ethnic groups, which have been divided into northern and southern clus-
ters based on linguistic, cultural, and anthropological traits. To untangle Altaian genetic histories, we analyzed mtDNA and Y chromo-
some variation in northern and southern Altaian populations. All mtDNAs were assayed by PCR-RFLP analysis and control region
sequencing, and the nonrecombining portion of the Y chromosomewas scored for more than 100 biallelic markers and 17 Y-STRs. Based
on these data, we noted differences in the origin and population history of Altaian ethnic groups, with northern Altaians appearingmore
like Yeniseian, Ugric, and Samoyedic speakers to the north, and southern Altaians having greater affinities to other Turkic speaking pop-
ulations of southern Siberia and Central Asia. Moreover, high-resolution analysis of Y chromosome haplogroup Q has allowed us to
reshape the phylogeny of this branch, making connections between populations of the New World and Old World more apparent
and demonstrating that southern Altaians and Native Americans share a recent common ancestor. These results greatly enhance our
understanding of the peopling of Siberia and the Americas.
Introduction
The Altai Republic is located in south-central Russia, situ-
ated at the borders of Mongolia, China, and Kazakhstan.
It sits at a crossroads where the Eurasian steppe meets the
Siberian taiga and serves as an entry point into northern
Asia. Having been habitable throughout the last glacial
maximum (LGM), the Altai region has had a human pres-
ence for some 45,000 years.1 The archaeology of the region
shows that, during this time, a number of different cultures
and peoples lived in andmigrated through the area.2–4 The
confirmation of Neanderthals and the recent discovery of
a new hominin at the Denisova cave in the Altai region
indicates that this area has long hosted extremely diverse
populations.5–7 It is also the area from which the ancestors
of Native American populations are thought to have arisen
prior to their expansion into the New World.8–11 In addi-
tion, archaeological evidence suggests that a few of the
later cultural horizons (Afanasievo and Andronovo) arose
in western Eurasia and spread eastward to the Altai region
during the Eneolithic and Bronze Ages, respectively.12,13
Such interactions increased during the Iron Age, as evi-
denced by the frozen Pazyryk kurgans in the southern Altai
Mountains,14 which contained examples of the typical
‘‘Scytho-Siberian animal style’’ observed throughout the
entire Eurasian steppe.3,15 These populations further
intermingled with expanding Altaic speaking groups,
and specifically the movements involving the Xiongnu,
Xianbei, and Yuezhi, as recorded by ancient Chinese histo-
rians in the second century BCE.16,17
Ethnographic studies of Turkic-speaking tribes indige-
nous to the Altai region of southern Siberia noted cultural
differences among ethnic groups such that they could be
classified into northern or southern Altaians.18,19 Northern
Altaian ethnic groups include the Chelkan, Kumandin,
and Tubalar. The Altai-kizhi, Teleut, and Telengit were
grouped together as southern Altaians, along with a few
other smaller populations. Similarly, linguistic studies
have shown that languages from northern and southern
populations are mutually unintelligible, despite their
having similar Turkic roots. The northern Altai languages
also showed greater influences from Samoyedic, Yeniseian,
and Ugric languages, possibly reflecting their origin among
the ancestors of these present-day peoples. By contrast,
southern Altaian languages belong to the Kipchak
branch of Turkic language family and have been greatly
influenced by Mongolian, especially after the expansion
of the Mongol Empire.16,20 These linguistic differences are
further mirrored by differences in anthropometric traits,
traditional subsistence strategies, religious traditions, and
clan names for northern and southern Altaians.18,19,21
Genetic analysis of Altaian populations initially focused
on protein polymorphisms to assess levels of diversity and
the relationships between them and other Siberian popula-
tions by comparing relative proportions of West and East
Eurasian genotypes.22–24 The role that the Altai region
1Department of Anthropology, University of Pennsylvania, Philadelphia, PA 19104-6398, USA; 2Institute of Cytology and Genetics, SB RAS, Novosibirsk
630090, Russia; 3Institute of General Genetics and Cytology, Almaty 050060, Kazakhstan4Present address: Harvard University Medical School, Brigham and Women’s Hospital, Boston, MA 02115, USA5Present address: Sackler Educational Laboratory for Comparative Genomics and Human Origins, American Museum of Natural History, New York,
NY 10024-5192, USA
*Correspondence: [email protected]
DOI 10.1016/j.ajhg.2011.12.014. �2012 by The American Society of Human Genetics. All rights reserved.
The American Journal of Human Genetics 90, 229–246, February 10, 2012 229
played in the dispersal of humans into northern Eurasia
and subsequently into the Americas gained increasing
importance with the search for the founding mitochon-
drial DNAs (mtDNAs) and Y chromosomes for the
New World.8,25,26 As a result, the issue of where Native
American progenitors originated became a hotly debated
topic, with suggested source areas being Central Asia,
Mongolia, and different parts of Siberia.8–10,27–46 However,
much of the previous genetic research into this issue
focused mainly on southern Altaian populations, leaving
our understanding of the genetic diversity of northern
Altaian groups incomplete.
Given the ethnographic and historical background of
Altaian peoples, we characterized the mtDNA and Y chro-
mosome variation in these populations to elucidate their
genetic history. Our first objective was to determine
whether the ethnographic classifications of northern and
southern Altaians reflected their patterns of genetic varia-
tion, and specifically whether they shared a common
ancestry. If differences were observed, we then wanted to
know whether they were attributable to demographic
factors, social organization, or some combination of the
two. The second goal was to examine whether northern
Altaians’ genetic variation is structured by tribe and clan
identity. The third goal was to use these data to investigate
larger questions concerning the peopling of Siberia (and
the Americas). In particular, we were interested in learning
whether these genetic data would reveal the effects of
ancient and/or recent migrations into or out of the Altai
region, including that giving rise to the ancestors of
indigenous populations from America. Overall, this paper
attempts to understand the population history of Altaians
by placing them into a Siberian genetic context and uses
a phylogeographic approach to dissect the layers of history,
uncovering the formation of these ethnic groups and their
importance for understanding the peopling of Northern
Asia and the Americas.
Subjects and Methods
Sample CollectionBetween 1991 and 2002, we conducted ethnographic fieldwork
and sample collection in a number of settlements within the
southern part of the Altai Republic (Figure 1). During this period,
a total of 267 self-identified Altai-kizhi individuals living in the
villages of Mendur-Sokkon, Cherny Anuy, Turata, and Kosh-Agach
participated in the study. In addition, another nine Altai-kizhi
individuals from villages in the northern Altai Republic partici-
pated in the study (see below), bringing the total number of
Altai-kizhi participants to 276, of whom 120 were men.
Figure 1. Map of the Altai Republic and Locations of Sample Collection
230 The American Journal of Human Genetics 90, 229–246, February 10, 2012
In 2003, we worked with 214 Northern Altaians living in the
Turochak District of the Altai Republic. These persons included
91 Chelkans, 52 Kumandins, and 71 Tubalars living in nine
different villages in the Biya and Lebed’ River basins and
along Teletskoe Lake (Figure 1). The villages included Artybash,
Biika, Dmitrievka, Kebezen, Kurmach-Baigol, Sank-Ino, Shunarak,
Tandoshka, and Yugach. Of the northern Altaian participants, 69
were men.
Blood samples were drawn from all participants with informed
consent written in Russian and approved by the University of
Pennsylvania IRB and the Institute of Cytology and Genetics in
Novosibirsk, Russia. Genealogical data were also obtained from
each person at the time of sample collection to ensure that the
individuals were unrelated through at least three generations
and to assess the level of admixture in these communities. Individ-
uals were categorized by self-identified ethnicity for this study.
Molecular Genetic AnalysisSample Preparation
Bloods were fractionated through low-speed centrifugation to
obtain plasma and red cell fractions. Total genomic DNAs were
isolated from buffy coats with a lysis buffer and standard phenol-
chloroform extraction protocol modified from earlier studies.27,47
mtDNA Analysis
The mtDNA of each sample was characterized by high-resolution
SNP analysis and control region sequencing. PCR-RFLP analysis
was employed to assign individuals to West48–52 and East30,53–56
Eurasian mtDNA haplogroups by screening them for known diag-
nostic markers, as per previous studies57,58 (Table S1 available
online), with the nomenclature used to classify the mitochondrial
haplotype according to PhyloTree.org.59
The hypervariable segment 1 (HVS1) of the control region was
directly sequenced for each sample by published methods,58 and
hypervariable segment 2 (HVS2) was sequenced with the primers
indicated in Table S2. Sequences were read on ABI 3130xl Gene
Analyzers located in the Laboratory of Molecular Anthropology
and the Department of Genetics Sequencing Core Facility at the
University of Pennsylvania and aligned and edited with the
Sequencher 4.8 (Gene Codes Corporation). All polymorphic
nucleotides were reckoned relative to the revised Cambridge refer-
ence sequence (rCRS).60,61 The combination of SNP data and
control region sequences defined maternal haplotypes in these
individuals.
Y Chromosome Analysis
The nonrecombining portion of the Y chromosome (NRY) from
each male participant was characterized by assaying phylogeneti-
cally informative biallelic markers in a hierarchical fashion accord-
ing to published information62,63 and previously published
methods.64 A total of 116 biallelic markers were tested to define
sample membership in respective NRY haplogroups. Most of the
SNPs and fragment length polymorphisms were characterized by
custom TaqMan assays read on an ABI Prism 7900 HT Real-Time
PCR System (Applied Biosystems). These polymorphisms included
L53, L54, L55, L56, L57, L213, L329, L330, L331, L332, L333,
L365, L400, L456, L472, L474, L475, L476, L528, LLY22g, M3,
M9, M12, M15, M18, M20, M25, M35, M45, M55, M56, M69,
M70, M73, M81, M86, M89, M93, M96, M102, M117, M119,
M120, M122, M123, M124, M128, M130, M134, M143, M147,
M157, M162, M170, M172, M173, M174, M178, M186, M201,
M204, M207, M214, M217, M223, M230, M242, M253, M265,
M267, M269, M285, M304, M323, M335, M346, M410, M417,
M434, M458, P15, P25, P31, P36.2, P37.2, P47, P60, P63, P105,
P215, P256, P261, P297, and PK2. Additional markers were
detected through direct sequencing (L191, L334, L401, L527,
L529, M17, M46 [Tat], M343, M407, MEH2, P39, P43, P48,
P53.1, P62, P89, P98, P101, PageS000104, and PK5) and by PCR-
RFLP analysis (M175).65 Seventeen short tandem repeats (STRs)
were characterized with the AmpFlSTR Yfiler PCR Amplification
Kit (ABI) and read on an ABI 3130xl Genetic Analyzer with Gene-
Mapper ID v3.2 software. Each paternal haplotype was designated
by its 17-STR profile. Y chromosome lineages were defined as the
unique combinations of SNP and STR data present in the samples.
DYS389b was calculated by subtracting DYS389I from DYS389II,
which was used for all statistical and network analyses.64
Comparative DataTo place their genetic histories in a broader contextual framework,
we compared Altaian mtDNA and NRY data with those from
populations in southern Siberia, Central Asia, Mongolia, and
East Asia. For the mtDNA analysis, the populations included
Telengits, Teleuts, Shors, Khakass, Tuvinians, Todzhans, Tofalars,
Soyots, Buryats, Khanty, Mansi, Ket, Nganasan, Western Evenks,
Uyghurs, Kazakhs, Kyrgyz, Uzbeks, and Mongolians.41,43,44,66–71
For the NRY analysis, only populations that were represented by
full Y-STR data sets (not just Y-STRs for specific haplogroups)
were used for comparative purposes. These populations included
Teleuts, Khakass, Mansi, Khanty, Kalmyks, Mongolians, and
Uyghurs.68,72–75 The STR haplotypes were reduced to ten loci
(DYS19, DYS389I, DYS398b, DYS390, DYS391, DYS392, DYS393,
DYS437, DYS438, and DYS439) to allow for as broad a comparison
as possible. In the coalescence analysis, we used the 15 Y-STR loci
Q-M3 haplotypes from Geppert et al.76
Data AnalysisSummary statistics, including gene diversity and pairwise differ-
ences, were calculated with Arlequin v3.1177 for mtDNA HVS1
(np 16024-16400) and NRY Y-STRs. FST and RST values between
populations were also calculated with Arlequin v3.11 for the
HVS1 sequences and Y-STRs, respectively. FST values were esti-
mated with the Tamura and Nei model of sequence evolution.78
Pairwise genetic distances were visualized by multidimensional
scaling (MDS) with SPSS 11.0.0.79 In addition, nucleotide diversity,
Tajima’s D, and Fu’s FS were calculated with mtDNA HVS1
sequences.
We analyzed the phylogenetic relationships among Y-STR
haplotypes and complete mtDNA genomes by using Network
4.6.0.0 (Fluxus Technology Ltd). These networks employed a
reduced median-median joining approach and MP post-process-
ing.80–82 The NRY haplotypes used to generate the networks
consisted of 15 Y-STRs. DYS385 was excluded from the network
analysis because differentiation between DYS385a and DYS385b
is not possible with the Y-Filer kit.83 The Y-STR loci were weighted
based on the inverse of their variances. Mitogenomes used in this
analysis came from the published literature and GenBank.
The time to the most recent common ancestor (TMRCA) for mi-
togenomes was estimated with the methods of Soares et al.84 The
Y-STR diversity within each haplogroup was assessed by two
methods.64 The first involved calculation of rho statistics with
Network 4.6.0.0, where the founder haplotype was inferred as
in Sengupta et al.85 The second used Batwing,86 a Bayesian
analysis where the TMRCA and expansion time of each popula-
tion (or haplogroup) were calculated by previously published
methods.64,72,87 Both the evolutionary and the pedigree-based
mutation rates were used to estimate coalescence dates with
The American Journal of Human Genetics 90, 229–246, February 10, 2012 231
generation times of 25 and 30 years, respectively.88–90 Because
a definitive consensus does not yet exist as to which rate should
be used, the validity of the resulting estimates are discussed. In
addition, Batwing was used to estimate the split or divergence
times of several haplogroups. Thismethod assumes that, after pop-
ulations split, no further migration occurs between them. In this
case, the haplogroups investigated were not shared between pop-
ulations but derive from a common source, thereby justifying this
approach. Duplicated loci and new STR variants detected in this
study were excluded from statistical analysis.
Results
Mitochondrial DNA and Y Chromosome Diversity
The maternal genetic ancestry of northern and southern
Altaian populations was explored by characterizing coding
region SNPs and control region sequences from 490 inhab-
itants of the Altai Republic, which yielded 99 distinct
mtDNA haplotypes defined by SNP and HVS1 mutations
(Table S3). The majority of mtDNAs were of East Eurasian
origin, although the relative proportion of these haplo-
types was greater in Chelkans (91.5%) compared to other
Altaian populations (75.2% in Tubalars, 75.6% in Kuman-
dins, and 76.4% in Altai-kizhi) (Table 1). Despite exhibit-
ing a lower overall frequency of West Eurasian haplo-
groups, Altaians (specifically, the Altai-kizhi, Tubalar, and
Kumandins) had a higher proportion of them as compared
to other southern Siberians.41,43 Differences in mtDNA
haplogroup profiles were observed among northern
Altaian ethnic groups and between northern Altaians
and Altai-kizhi, with the Chelkans being extraordinarily
distinct. Nevertheless, comparisons among other Altaian
ethnic groups revealed some consistent patterns. mtDNA
haplogroups B, C, D, and U4 were found in all Altaian pop-
ulations, but at varying frequencies, whereas southern
Altaians (Altai-kizhi, Telengits, and Teleuts) tended to
have a greater variety of West Eurasian haplogroups at
low frequencies. Shors, who have sometimes been catego-
rized as northern Altaians,18 exhibited a similar haplo-
group profile to other northern Altaian ethnic groups,
including moderate frequencies of C, D, and F1, although
they lacked others (N9a and U).41
Haplogroups C and D were the most frequent mtDNA
lineages in the Altaians, consistent with the overall picture
of the Siberian mtDNA gene pool. However, phylogeo-
graphic analysis of these lineages showed a greater diver-
sity of haplotypes in the southern Altaians compared to
northern Altaians. Although haplotypes were shared
between regions, northern Altaians largely had C4 with
the root HVS1 motif (16223-16298-16327) and C5c,
whereas the southern Altaians had C4a1 and C4a2.
Although C5c is largely confined to Altaians, it has been
suggested that an early migration from Siberia to Europe
brought haplogroup C west, where the branch differenti-
ated during the Neolithic and then was taken back into
southern Siberia.83 Also noteworthy, D4j7 appears to
be specific to Altaians and Shors.41,91 In addition, a D5a
haplotype was shared by Tubalars and Altai-kizhi, and
a rare D5c2 haplotype was shared by the Chelkans
and Kumandins. Interestingly, complete mtDNA genome
sequencing of a subset of our D5c2 samples showed few
differences from those present in Japan,55 suggesting
a possible connection resulting from the dispersal of Altaic
speaking populations.92 The remainder of the D haplo-
types were found in other southern Siberian and Central
Asian populations.
To explore the NRYvariation in Altaian populations, 116
biallelic polymorphisms were characterized in 189 male
individuals, resulting in 106 Y chromosome lineages
(Table 2). Northern Altaian populations were composed
largely of haplogroups Q and N-P43, whereas southern
Altaians had a higher proportion of R-M417, C-M217/
PK2, C-M86, and D-P47. Haplogroups typical of south
Asia, western Europe, and East Asia were not found in
appreciable frequencies.72,93–99 The haplogroup frequency
differences between northern and southern Altaians were
statistically significant (c2 ¼ 66.03, df ¼ 9, p ¼ 9.09 e�11).
As with the mtDNA data set, we also observed differ-
ences in NRY haplogroup composition among northern
Altaian populations, where each ethnic group shared
haplogroups with the other two, yet had distinct haplo-
group profiles. Overall, Kumandins had the most disparate
haplogroup frequencies of the northern Altaians, exhibit-
ing similar number of N-P43 chromosomes as the
Chelkans, which were quite similar to those found in
Khanty and Mansi populations in northwestern Sibe-
ria.68,100 In addition, a large proportion of Kumandin Y
chromosomes belonged to R-M73. This haplogroup is
largely restricted to Central Asia101 but has also been found
in Altaian Kazakhs and other southern Siberians.64,102 In
fact, Myres et al.101 noted two distinct clusters of R-M73
STR haplotypes, with one of them containing Y chromo-
somes bearing a 19 repeat allele for DYS390, which appears
to be unique to R-M73. Interestingly, the majority of
Kumandin R-M73 haplotypes fell into this category,
although haplotypes from both clusters are found in
southern Siberia.102
In all cases, the haplotypes present in Altaians fit into
known modern human phylogenies. None of the Altaians
had a mitochondrial lineage similar to those of Neander-
thals or the Denisovan hominin. Although there are no
ancient Denisovan or Neanderthal Y chromosome data
to compare with the Altaian data set, the Altaian Y chro-
mosomes clearly derived from more recent expansions of
modern humans out of Africa.
Altaian Genetic Relationships
Summary statistics were calculated to assess the relative
amounts of genetic diversity in Altaian populations
(Table 3). Gene diversities based on HVS1 of the mtDNA
showed that, overall, the Altai-kizhi were more diverse
than the northern Altaians. The average pairwise differ-
ences for the Altai-kizhi were also smaller. In fact, the esti-
mates for the Altai-kizhi and Tubalars were comparable
to other southern Siberians.43 By contrast, those for the
232 The American Journal of Human Genetics 90, 229–246, February 10, 2012
Chelkans and Kumandins were lower and more similar to
Soyots, but not as low as that of Tofalars. Mismatch distri-
butions were smooth and bell-shaped for all populations
except the Chelkans, which had a significant raggedness
index. This statistic indicated that Tubalars, Kumandins,
and Altai-kizhi had experienced sudden expansions
or expansions from population bottlenecks.103 Tests of
neutrality confirmed these findings in yielding signifi-
cantly negative Tajima’s D and Fu’s FS estimates for all
populations, except the Chelkans, indicating that this
Table 1. mtDNA Haplogroup Frequencies of Altaian Populations
Hg Chelkan Kumandin Tubalar1 Tubalar2 Shor Altai-kizhi1 Altai-kizhi2 Telengit Teleut
# 91 52 71 72 28 276 48 55 33
C 15.1 41.5 35.6 20.8 17.9 31.4 25.0 14.6 24.2
Z 2.7 3.6 4.3 4.2 3.0
M8 3.6 4.2
D4 13.9 15.1 24.7 15.3 25.0 13.0 6.3 18.2 24.2
D5 8.6 3.8 4.1 5.6 3.6 0.7 3.0
G 3.2 4.0 4.2 3.6
M7 1.8
M9 1.4
M10 1.1 3.6 0.4 2.1
M11 2.1 1.8 3.0
M* 1.8
A 1.9 11.1 3.6 2.9 4.7 7.3
I 3.6 1.4 2.1 1.8
N1a 1.8
N1b 0.4
W 1.1
X 3.8 1.4 2.2 2.1 3.0
N9a 19.4 1.9 2.7 6.9 1.8
B 3.2 3.8 2.7 4.2 3.6 1.4 6.3 14.6 6.1
F1 10.8 3.8 1.4 14.3 8.3 4.2 1.8 3.0
F2 15.1 2.7 3.6 2.5 2.1
H 1.1 2.7 1.4 3.6 2.5 8.3 9.1 9.1
H2 3.3 2.1
H8 5.7 2.7 4.2 3.6 1.4
HV 1.8
V 6.1
J 3.6 4.0 6.3 1.8
T 1.9 0.4 3.6 6.1
U2 2.8 0.7 1.8 3.0
U3 2.1
U4 4.3 3.8 15.1 18.1 3.6 0.7 2.1 1.8 3.0
U5 2.2 9.4 4.1 5.6 3.3 2.1 1.8
U8 1.8
K 3.6 3.3 6.3 3.0
R9 1.1 3.8 1.4 2.2 5.5
R11 2.1
The American Journal of Human Genetics 90, 229–246, February 10, 2012 233
particular population probably experienced a reduction in
population size or was subdivided.
To understand Altaian maternal genetic background, we
compared our data with those from other North Asian and
Central Asian populations. FST values between populations
were calculated with HVS1 sequences and viewed through
multidimensional scaling (Figure 2). In this analysis,
southern Siberians formed a rather diffuse cluster, with
most Central Asian and Mongolian populations being
separated from them. Altaian populations also did not
constitute a distinct cluster unto themselves. Based on
the FST values, the Chelkans were distinctive from all other
ethnic groups. Although falling closest to the Khakassians
in the MDS plot, they shared a smaller genetic distance
with the Tubalars2, which was expected because of the
inclusion of some Chelkans in that sample set.44 Kuman-
dins and Tubalars1 were not significantly different, and
appeared close to Tuvinians and southern Altaians. In
fact, both populations had smaller FST values with
southern Altaians than they did with the Chelkans,
although the genetic distances between Tubalars1 and
Tubalars2, Altai-kizhi, and Teleuts were also nonsignifi-
cant. Unlike northern Altaians, most of the southern
Altaian populations clustered together. The Altai-kizhi,
Teleuts, and Tubalars1 formed one small cluster with
Kyrgyz, whereas the Telengits showed greater affinities
with Central Asian populations. The southern Altaian
cluster sat near a cluster of Tuvinian populations, suggest-
ing a similar population history and likely gene flow
between these groups.
Summary statistics were calculated to assess the genetic
diversity of paternal lineages in Altaian populations
(Table 4). Gene diversities based on Y-STR haplotypes
(15-loci Y-STR haplotypes; Table S4) showed that the Altai-
kizhi were more diverse than the northern Altaians. Unlike
the mtDNA data, within group pairwise differences were
greater in the southern Altaian and Tubalar Y-STR haplo-
types than in the Chelkans and Kumandins.
Y-chromosomal variation in the four populations in our
data set provided a slightly different picture than the mito-
chondrial data. In this analysis, RST values were calculated
with 15-loci Y-STR haplotypes (Table S6). These estimates
indicated that only the Chelkans and Tubalars were not
Table 2. High-Resolution NRY Haplogroup Frequencies in AltaianPopulations
Haplogroup Chelkan Kumandin Tubalar Altai-kizhi
C3* 19 (0.158)
C3c1 5 (0.042)
D3a 6 (0.050)
E1b1b1c 1 (0.037)
I2a 1 (0.037)
J2a 3 (0.025)
L 1 (0.040)
N1* 1 (0.059) 3 (0.111)
N1b* 5 (0.200) 8 (0.471) 2 (0.017)
N1c* 1 (0.008)
N1c1 2 (0.017)
O3a3c* 1 (0.008)
O3a3c1 1 (0.037) 1 (0.008)
Q1a2 1 (0.037)
Q1a3a* 15 (0.600) 10 (0.370)
Q1a3a1c* 20 (0.167)
R1a1a1* 4 (0.160) 2 (0.118) 10 (0.370) 60 (0.500)
R1b1a1 6 (0.353)
T
Total 25 17 27 120
Table 3. HVS1 Summary Statistics for Altaian Populations
Population
Northern Altaian Southern Altaian
Chelkan Kumandin Tubalar1 Altai-kizhi1
# of samples 91 52 71 276
# of haplotypes 22 18 26 75
Haplotype diversity 0.923 5 0.013 0.914 5 0.021 0.953 5 0.010 0.976 5 0.003
Nucleotide diversity 0.020 5 0.011 0.022 5 0.011 0.019 5 0.010 0.018 5 0.009
Pairwise differences 7.68 5 3.61 8.22 5 3.87 7.03 5 3.34 6.84 5 3.23
Raggedness index 0.032 0.022 0.010 0.011
Raggedness p value 0.000 0.149 0.635 0.388
Tajima D 1.201 �0.644 �0.701 �1.180
Tajima D p value 0.000 0.000 0.000 0.000
Fu’s FS 3.417 �0.497 �3.877 �24.416
Fu’s FS p value 0.002 0.000 0.000 0.000
234 The American Journal of Human Genetics 90, 229–246, February 10, 2012
significantly different from each other. The Kumandins
were quite distant from all populations, although these
distances were slightly smaller among northern Altaians
than with the Altai-kizhi. The Altai-kizhi were again closest
to the Tubalars.
These relationships were affirmed by the haplotype
sharing between the four populations. The Chelkans and
Tubalars shared a large proportion of their haplotypes,
mostly those from haplogroups Q and R-M417, whereas
the Kumandins shared only one haplotype with Tubalars
(a rare N-LLY22g haplotype). In addition, the northern
and southern Altaians shared only a single haplotype,
belonging to haplogroup O-M117, which is more
commonly found in southern China.104 In fact, these
two Y chromosomes were the only occurrences of hap-
logroup O in our data set.
The Y-STR profiles were reduced to 10-loci STR haplo-
types in order to compare Y chromosome diversity in
several Siberian and Central Asian populations (Table 5;
Figure 3). The genetic distances in our sample set remained
high despite the greater haplotype sharing that resulted
from this reduction. Overall, the genetic distances were
much greater with the Y-STR haplotypes compared to
mtDNA haplotypes, indicating greater genetic differentia-
tion in paternal lineages compared to maternal lineages.
In addition to the Chelkans and Tubalars, two other groups
of populations exhibited nonsignificant RST values. One
group included Uyghur (from Urumqi and Yili) and
Mongolian (Kalmyks and Mongolians) populations, and
the other included the Mansi and a Sagai population iden-
tified as part of the Khakass ethnic group. In contrast with
their position in the mtDNA MDS plot, northern Altaians
were separated from all other populations, including other
southern Siberians. The three groups of Khakass (Sagai,
Sagai/Shor, and Kachin) fell much closer to the Khanty
and Mansi, which probably indicates a common ancestry
Figure 2. MDS Plot of FST Genetic Distances Generated from mtDNA HVS1 Sequences in Siberian and Central Asian PopulationsCircle, southern Siberian; diamond, northwestern Siberian; square, Central Asian.
Table 4. Y-STR Summary Statistics for Altaian Populations
Population
Northern Altaian Southern Altaian
Chelkan Kumandin Tubalar Altai-kizhi
# of samples 25 17 27 120
# of haplotypes 14 9 18 62
Haplotype diversity 0.910 5 0.043 0.912 5 0.042 0.954 5 0.025 0.978 5 0.005
Pairwise differences 6.59 5 3.22 6.39 5 3.19 7.40 5 3.57 7.58 5 3.56
The American Journal of Human Genetics 90, 229–246, February 10, 2012 235
for these populations. Unfortunately, more complete
Y-STR data sets were not available for other southern Sibe-
rian populations. Nonetheless, these results indicated a
different history for northern Altaians compared to
Central Asians and even other southern Siberians. A
specific reason for this difference is that Mongolians
had a much greater genetic impact on southern Altaians,
which is expected given the historical and linguistic
evidence.18,19,105
Altaian and Native American Connections
To test the hypothesis that Native Americans share a
more recent common ancestor with Altaians relative
to other Siberian and East Asian populations, we specifi-
cally examined the mtDNA and NRY haplogroups that
appeared in both locations. For the mtDNA, it is well
known that haplogroups A–D and X largely make up the
maternal genetic heritage of indigenous peoples in the
Americas.27,29,39,47,106 Complete mtDNA genome sequenc-
ing has led to a greater comprehension of the phylogeny of
Native American mtDNAs and, consequently, a better
understanding of their origins.107–110 Although Altaians
possess the five primary mtDNA haplogroups found in
the Americas, these lineages are not exactly the same as
those appearing in Native Americans at the subhaplogroup
level. This is also true for other Siberian populations except
in those few instances where gene flow across the Bering
Strait brought some low frequency types back to north-
eastern Siberians.
An example of this pattern is haplogroup C1a.
Southern Altaians possessed C1a, which is an exclusively
Asian branch of the predominately American C1 haplo-
group.107,108 To date, only four complete C1a genomes
have been published. These sequences produced a more
recent TMRCA than other genetic evidence had previously
suggested for the peopling of the Americas. Although
Tamm et al.107 viewed this haplogroup as representing a
back migration into Siberia, it does not occur in Siberian
populations that are geographically closest to the Americas,
but rather those living in southern and southeastern
Siberia.41,89 However, given the small effective population
sizes from the northeastern Siberian groups that have
been studied thus far, this haplogroup could have been
lost because of drift.
The other mtDNA haplogroup found in northern
and southern Altaians that is a close relative of a Native
American lineage is D4b1a2a1a. This haplogroup has
been found in Altaians, Shors, and Uzbeks from north-
western China.41,44,70 Analysis of complete mtDNA
genomes identified a sister branch (D4b1a2a1a1), which
is found only in northeastern Siberian populations
and Inuit from Canada and Greenland.42,45,54,91,111
TMRCAs were calculated from the complete mtDNA
genomes of this branch and those from Native American
D4b1a2a1a1. By analyzing only synonymous mutations
from these sequences with the method of Soares et al.,84
Table 5. Low-Resolution NRY Haplogroup Frequency Comparison of Altaians
Hg Chelkan Kumandin Tubalar Altai-kizhi1 Altai-kizhi2 Teleut1 Teleut2 Shor
C 20.0 13.0 8.5 5.7 2.0
D 5.0 3.3
E 3.7
F (xJ,K) 3.7 3.3 10.7 2.0
J 2.5 2.2 2.1
K (xN1c,O,P) 24.0 52.9 11.1 1.7 2.2 13.7
N1c 2.5 5.4 10.6 28.6 2.0
O 3.7 1.7
P (xR1a1a) 60.0 35.3 40.7 16.7 28.3 34.3 2.0
R1a1a 16.0 11.8 37.0 50.0 42.4 68.1 31.4 78.4
Total 25 17 27 120 92 47 35 51
Figure 3. MDS Plot of RST Genetic Distances Generated from YChromosome STR Haplotypes in Siberian and Central Asian Pop-ulationsCircle, southern Siberian; diamond, northwestern Siberian;square, Central Asian.
236 The American Journal of Human Genetics 90, 229–246, February 10, 2012
we estimated the TMRCAs of these two branches at
11.8 kya and 15.8 kya, respectively.
For the Y chromosome, indigenous American lineages
are derived mostly from haplogroups C and Q, and, as
such, are crucial for understanding of the genetic histories
of peoples from the Americas and how they relate to
populations of Central Asia and Siberia.9,39,93,98,112,113
Just as Seielstad et al.114 and Bortolini et al.38 used M242
to clarify the genetic relationship between Asian and
American Y chromosomes, the characterization of this
haplogroup at an even higher level of resolution has led
to a much greater understanding of the origins of Native
American Y chromosomes and their connections to Asian
types. In this regard, it was recently shown that the
American Q-M3 SNP is located on an M346-positive
background.63 The presence of M346 in Central Asia and
Siberia has strengthened the argument for a southern
Siberian or Central Asian origin for many American Y chro-
mosomes.85,99,102,115
Given the importance of haplogroup Q for Native
American origins, we subjected samples from this lineage
to high-resolution SNP analysis involving 37 biallelic
markers to better understand the relationship between
Old and New World populations and the migration(s)
that connect them. All Y chromosomes in this study that
belonged to haplogroup Q (as indicated by the presence
of M242) were also found to have the P36.2, MEH2,
L472, and L528 markers (Figure S1). Thus, these haplo-
types fell into the Q1a branch of the Y chromosome
phylogeny. Because Q1b Y chromosomes were not found
in Altaian samples, we were not able to definitively place
the L472 and L528 SNPs at the same phylogenetic position
as MEH2. For this reason, their placement is tentative until
L275/L314/M378 Y chromosomes are screened for these
markers. Furthermore, M120/M265-positive, P48-positive,
and P89-positive samples were not found in the Altai
region. Therefore, the placement of these branches at the
same phylogenetic level as M25/M143 and M346/L56/
L57 should also be considered as provisional (although
see Karafet et al.63).
The M346, L56, and L57 SNPs were positioned as ances-
tral to three derived branches in the Family Tree DNA
phylogeny. We found that the L474, L475, and L476
SNPs were present in all of our M346-positive samples.
However, because M323- and L527/L529-positive samples
were not found in the Altaians, we could not confirm the
exact position of these markers at either the Q1a3 or
Q1a3a level. On the other hand, all Altaians that possessed
the M346, L56, L57, L474, L475, and L476 SNPs also had
L53, L55, L213, and L331.
Interestingly, northern and southern Altaian Q Y chro-
mosomes differed by three markers. L54, L330, and L333
were found in Q haplotypes in the southern Altaians and
one Altaian Kazakh, whereas the northern Altaians Q
haplotypes lacked these derived SNPs. Thus, according to
the standard nomenclature set by the Y Chromosome
Consortium62 and followed by others, the northernAltaian
Q haplotypes belonged to Q1a3a* and the southern
Altaians belonged to Q1a3a1c*. We have further confirmed
that M3 haplotypes belong to L54-derived Y chromosomes
(unpublished data). These alterations in the phylogeny
change the haplogroup name of the Native American
Q-M3 Y chromosomes from Q1a3a to Q1a3a1a. Moreover,
the position of M3 and L330/L333 in the phylogeny indis-
putably showed that the MRCA of most Native American
Y chromosomes was shared with southern Altaians.
The differences between the northern and southern
Altaian Q Y chromosomes were also reflected in the anal-
ysis of high-resolution Y-STR haplotypes (Figure S2).116
Comparisons of Altaian Q-M346 Y chromosomes with
those from southern Siberian, Central Asian, and East
Asian populations revealed affinities between southern
Altaian and these other groups. However, the northern
Altaians remained distinctive, even in networks con-
structed from fewer Y-STR loci (Figure S3).
The time required to evolve the extent of haplotypic
diversity observed in each of the subhaplogroups can aid
in determining when particular mutations arose and
possibly when these mutations were carried to other loca-
tions. The TMRCA for the northern Altaian Q1a3a* Y chro-
mosomes indicated a relatively recent origin for them, one
dating to either the Bronze Age or recent historical period,
depending on the Y-STRmutation rate being used (Table 6).
The southern Altaian/Altaian Kazakh Q1a3a1c* Y chromo-
somes had a slightly older TMRCA that dated them to
either the late Neolithic or early Bronze Age. By using
Bayesian analysis, we further estimated the divergence
time of the two Q haplogroups at about 1,000 years after
the TMRCA of all Altaian Q lineages (~20 kya), indicating
an ancient separation of northern and southern Altaian
Q Y chromosomes (Table 7).
A similar analysis was conducted to determine when the
L54 haplogroup arose and gave rise to M3 and L330/L334
subbranches. The indigenous American Y chromosomes
used in this analysis were more diverse than those of
southern Altaians. The resulting TMRCA for the South
American Q1a3a1a* samples was 22.2 kya or 7.6 kya,
depending on the mutation rate used. The divergence
between the M3 and L330/L334 Y chromosomes was
~13.4 kya, with a TMRCA of 22.0 kya, via the evolutionary
rate. By contrast, the TMRCA and divergence time via
a pedigree-based mutation rate were 7.7 kya and 4.9 kya,
respectively.
The time required to generate the haplotypic diversity in
the L54-positive Y chromosomes clearly showed that the
evolutionary rate provided a more reasonable estimate.
The Americas were inhabited well before 5–8 kya, based
on various lines of evidence, making the use of the pedi-
gree-based mutation rate questionable. The estimates
generated with the evolutionary-based mutation rate
provided times that are more congruent with the known
prehistory of the Americas.117 They are also similar to the
TMRCAs calculated for Native American mtDNA haplo-
groups.107,108
The American Journal of Human Genetics 90, 229–246, February 10, 2012 237
Discussion
Origins of Northern and Southern Altaians
In this paper, we characterized mtDNA and NRY variation
in northern and southern Altaians to better understand
their population histories and elucidate the genetic
relationship between Altaians and Native American popu-
lations. The evidence from the mtDNA and NRY data
supports the hypothesis that northern and southern
Altaians generally formed out of separate gene pools.
This complex genetic history involves repeated migrations
into (and probably out of) the Altai-Sayan region. In addi-
tion, the histories as revealed by these data added nuances
that could not be attained through low-resolution charac-
terization alone.
The NRY data provided the clearest evidence for a signif-
icant genetic difference between the two sets of Altaian
ethnic groups. Although sharing certain NRY haplogroups,
the two population groups differed in the frequencies of
these lineages, and, more importantly, shared few haplo-
types with them. By contrast, northern and southern pop-
ulations shared considerably more mtDNA haplotypes,
indicating that some degree of gene flow had occurred
between them, albeit in a sex-specific manner. As seen in
other populations from Siberia and Central Asia, the patri-
lineality of these groups probably helped to shape this
difference in patterns of mtDNA and Y-chromosomal vari-
ation.64,118
In addition, each northern Altaian ethnic group showed
different genetic relationships with the Altai-kizhi. The
Tubalars consistently grouped closer to the Altai-kizhi
than the other two northern Altaians based on both
mtDNA and NRY data. Thus, the higher genetic diversity
of mtDNA and NRY haplotypes in the Tubalars is probably
the result of admixture with other groups, such as
southern Altaians. The Chelkans, on the other hand,
have the most divergent set of mtDNAs of the three popu-
lations. Mismatch analysis and tests of neutrality indicated
that the Chelkans show signs of decreasing population size
or population structure. Long-term endogamy has prob-
ably also played a role in shifting the pattern of mtDNA
diversity in Chelkans from that seen in other northern
Altaians. Because of this endogamy (and genetic drift),
only a few lineages attained high frequencies, resulting
Table 7. Divergence Times between Haplogroups/Populations
TMRCA Split Time
Median 95% Confidence Interval Median 95% Confidence Interval
Pedigree-Based Mutation Rate
Northern and Southern Altaians 5,490 [3,000–11,100] 4,490 [1,730–10,070]
Southern Altaians and Native Americans 7,740 [5,170–12,760] 4,950 [2,360–9,490]
Evolutionary-Based Mutation Rate
Northern and Southern Altaians 21,890 [9,900–57,440] 19,260 [7,060–54,600]
Southern Altaians and Native Americans 21,960 [12,260–42,690] 13,420 [5,220–30,430]
Table 6. TMRCAs and Expansion Times for Altaian and Native American NRA Haplogroup Q Lineages
Hg N
Network Batwing - TMRCA Batwing - Expansion
r 5 s Median 95% C.I. Median 95% C.I.
Pedigree-Based Mutation Rate
All Q1a3a 97 5,390 5 1,000 8,420 [5,620–14,290] 7,230 [1,220–20,510]
Q1a3a* 25 1,410 5 580 1,480 [680–3,060] 2,100 [380–6,830]
Q1a3a1a* 52 5,820 5 1,280 7,630 [4,870–12,920] 4,680 [480–14,940]
Q1a3a1c* 20 2,420 5 700 2,970 [1,500–5,960] 2,680 [450–8,610]
Evolutionary-Based Mutation Rate
All Q1a3a 97 14,970 5 2,760 25,580 [14,230–51,140] 17,220 [1,380–54,950]
Q1a3a* 25 3,910 5 1,610 5,320 [2,300–12,160] 4,340 [1,000–13,080]
Q1a3a1a* 52 16,170 5 3,550 22,160 [11,960–44,340] 9,800 [620–39,543]
Q1a3a1c* 20 6,750 5 1,950 8,720 [3,960–20,010] 5,600 [1,030–17,910]
Note: r, rho statistic; s, standard error; Q1a3a*, Northern Altaians (this study); Q1a3a1a, Native Americans (Geppert et al.76); Q1a3a1c, Southern Altaians (thisstudy).
238 The American Journal of Human Genetics 90, 229–246, February 10, 2012
in reduced mtDNA diversity. Based on the NRY data, the
Kumandins were distinct from both the Chelkans and
Tubalars, who were composed of mostly the same set of
lineages. Thus, the genetic diversity in northern Altaians
is structured by ethnic group membership, and, therefore,
can be viewed as reflecting distinctive histories for each
population.
Not much is known about the ethnogenesis of northern
Altaians. However, it has been suggested that they
descended from groups that historically lived around the
Yenisei River and spoke either southern Samoyedic, Ugric,
or Yeniseian languages.18,19 These populations are the
same ones that later contributed to the formation of the
Kets, Selk’ups, Shors, and Khakass in northwestern Siberia
and the western Sayans of southern Siberia.4,105 Further-
more, the Chelkans and Tubalars possess a large number
of Q1a3a* Y chromosomes with dramatically different
STR profiles compared to other southern Siberians (Altai-
kizhi and Tuvinians) and Mongolians. Thus, it is possible
that similar lineages will be found in the Kets and/or
Sel’kups, where high frequencies of Q1-P36 have already
been noted.119 Should this be the case, it would provide
additional evidence for northern Altaians having common
ancestry with Samoyedic, Yeniseian, and Ugric speakers. In
fact, Chelkans and Kumandins also have N-P43 Y chromo-
somes very similar to ones found in the Ugric-speaking
Khanty. Regardless, there is notable genetic discontinuity
between northern Altaians and other Turkic-speaking
people of southern Siberia.
Southern Altaians share greater affinities with Mongo-
lians and Central Asians than they do with northern
Altaians. This is partly because of the high frequencies of
Y chromosome haplogroup C in these groups. In fact,
present-day Kyrgyz are nearly indistinguishable from the
Altai-kizhi based on their NRY haplogroup profile.120,121
They share similar C-M217 and R-M417 lineages with
the Altai-kizhi, suggesting a recent common ancestry for
the two groups, which further supports the theory of a
recent common ancestry among southern Siberians and
Kyrgyz.122
As evident in the disparities in genetic history between
northern and southern Altaians, the Altai has served as
a long-term genetic boundary zone. These disparities
reflect the different sources of genetic lineages and spheres
of interaction for both groups. The northern Altaians share
clan names, similar languages, subsistence strategies, and
other cultural elements with populations that today live
farther to the north.4 By contrast, southern Altaians share
these same features with populations in Central Asia,
mostly with Turkic- (Kipchak) but also Mongolic-speaking
peoples. Thus, the geography of the Altai (taiga versus
steppe) has helped to maintain these cultural and biolog-
ical (mtDNA, Y chromosome, and cranial-morphological)
differences.
Furthermore, no evidence of Denisovan or Neanderthal
ancestry was found in the Altaian mtDNA and Y chromo-
some data. However, this does not preclude such admix-
ture in the autosomes of Altaian populations. Greater
numbers of derived Denisovan SNPs were found in some
southeastern Asian and Oceanian populations, although
native Siberians were not included in that study.123 There-
fore, this issue requires further investigation.
Native American Origins
Many earlier genetic studies looked for the origins of
Native Americans among the indigenous peoples of Sibe-
ria, Mongolia, and East Asia. Often, the identification of
source populations conflicted between studies, depending
largely on the loci or samples being studied. Cranial
morphology has been used to demonstrate a connection
between the Native Americans and Siberian popula-
tions.124,125 Various researchers have suggested sources
such as the Baikal region of southern Siberia, the Amur
region of southeastern Siberia, and more generally Eurasia
and East Asia.126–128 A study of autosomal loci also showed
an affinity between populations in the New World and
Siberian regions but did not attempt to pinpoint a partic-
ular area of Siberia as the source area.129 In addition,
mtDNA studies have suggested New World origins from
a number of different locations including different parts
of Siberia, Mongolia, and northern China.34,41–45,47,71,130
Our own analysis of Altaian mtDNAs showed that the
five primary haplogroups (A–D, X) were present among
these populations. However, Altaian populations (and
generally all Siberian populations outside of Chukotka)
lack mtDNA haplotypes that are identical to those appear-
ing in the Americas. The only exceptions are the Selk’ups
and Evenks who bear A2 haplotypes, with their presence
in those groups being explained as a result of a back migra-
tion to northeast Asia.107
Despite the general absence of Native American haplo-
types in southern Siberia, there are sister branches whose
MRCAs are shared with those in Native Americans. One
such lineage is C1a, which was found in two Altai-kizhi
individuals and has also been observed at low frequencies
in Mongolia, southeastern Siberia, and Japan.44,46,55,71
Tamm et al.107 attribute its presence in northeast Asia to
a back migration from the NewWorld, where haplogroups
C1b–d are prevalent, whereas Starikovskaya et al.44 argue
that C1a and C1b arose in the Amur region, with C1b
migrating to the Americas later. A similar lineage is
D4b1a2a1a, a sister branch to D4b1a2a1a1, which is found
in northern North America. Although both of these line-
ages date to around 15,000 years ago, additional mitoge-
nome sequences from these haplogroups are needed to
estimate more precise TMRCAs for them and thereby
delineate their putative Asian and American origins.
Results obtained from the Y chromosome analysis
support the view that southern Siberians and Native
Americans share a common source.8,9,11,38,131 This con-
nection was initially suggested by a low-level Y-SNP
resolution and an alphoid heteroduplex system by Santos
et al.8 Subsequently, Zegura et al.11 showed a similarity in
NRY Q and C types among southern Altaians and Native
The American Journal of Human Genetics 90, 229–246, February 10, 2012 239
Americans by using only fast evolving Y-STR loci and,
again, low-level Y-SNP resolution. We focused on haplo-
group Q in this study because of the greater number of
new mutations published for this branch and correspond-
ing levels of Y-STR resolution (15–17 loci), which are
currently lacking for published Native American haplo-
group C Y chromosomes. This high-resolution character-
ization is critical because it allows for a more accurate
dating of TMRCAs and estimates of divergence between
the ancestors of Native Americans and indigenous Sibe-
rians. For example, with this approach, Seielstad et al.114
dated the origin of the M242, which defines the NRY
haplogroup Q, and, in turn, provided a more accurate
upper bound to the timing of the initial peopling of the
Western Hemisphere.
Several studies have shown that the American-specific
Q-M3 arose on an M346-positive Y chromosome.63,115,132
The M346 marker was also discovered in Altaians and
other Siberian populations.102,116 However, it has a broad
geographic distribution, being found in Siberia, Central
Asia, East Asia, India, and Pakistan, albeit at lower frequen-
cies.85,99 We have shown that southern Altaians M346 Y
chromosomes also possess L54, a SNP marker that also is
shared by Native Americans who have the M3 marker
and which is more derived than M346. Because L54 is
found in both Siberia and the Americas, it most probably
defines the initial founder haplogroup from which M3
later developed.
Our coalescence analysis suggests that the two derived
branches of L54 (M3 and L330/L334) diverged soon after
this mutation arose. Estimates using the evolutionary
Y-STR mutation rate place the origin of this marker at
around 22,000 years ago, with the two branches diverging
at roughly 13,400 years ago. Although the 95% confidence
intervals for the Bayesian analyses are broad, the median
values of the TMCRAs estimated with this method closely
match those obtained through the analysis with rho statis-
tics. In addition, the coalescence estimates of northern and
southern Altaian Q Y chromosomes show that they, too,
are similar to the overall TMRCA estimates. This concor-
dance suggests that a rapid expansion probably occurred
for this particular Y chromosome branch around 15,000–
20,000 years ago. Given previous estimates for the timing
of the initial peopling of the Americas, this scenario seems
plausible, because these estimates fall in line with recent
estimates of indigenous American mitogenomes.107,133
As in any study, there are limitations to this analysis. The
primary issues are the accuracy and precision of using
microsatellites for dating origins and dispersals of haplo-
types. The stochastic nature of mutational accumulation
will continue to be a source of some uncertainty in any
attempt at dating TMRCAs. For this reason, the question
of which Y-STR mutation rate to use for coalescence esti-
mates has been debated.88,134,135 In this study, the evolu-
tionary rate seems the most realistic, because estimates
generated with the pedigree rate provided times that are
much too recent, given what is known about the peopling
of the New World from nongenetic studies.117 There is no
evidence that the majority of Native Americans (men with
Q-M3 Y chromosomes) derived from a migration less than
8 kya, as would be suggested from the TMRCAs calculated
with the pedigree rate. However, other studies have used
the pedigree mutation rate to explore historical events
with great effect—the most-well-known case being the
Genghis Khan star cluster.136 It is possible that such rates
are, like that of the mtDNA, time dependent or that the
Y chromosomes to which the Y-STRs are linked have
been affected by purifying selection.84,133,137,138 In this
regard, the pedigree-based mutation rate would be more
appropriately used with lower diversity estimates, reflect-
ing recent historical events, while the evolutionary rate
would be used in scenarios with higher diversity estimates,
reflecting more ancient phenomena. Although beyond the
scope of this paper, it is likely that the Y-STR mutation rate
follows a similarly shaped curve as that of the mitochon-
drial genome.
Furthermore, haplogroup divergence dates need not
(and mostly do not) equate with population divergence
dates. In this case, however, the mutations defining the
southern Altaian and Native American branches of the
Q-L54 lineage most probably arose after their ancestral
populations split, given the geographic exclusivity of
each derived marker. Yet, sample sets that are not entirely
representative of a derived branch could potentially skew
the coalescent results. In all likelihood, the L54 marker
will be found in other southern Siberian populations,
because southern Altaians show some genetic affinities
with Tuvinians and other populations from the eastern
Sayan region. Even so, the consistency of TMRCA esti-
mates and the divergence dates for the different Q
branches examined here suggest that our data sets are suffi-
ciently representative. Moreover, even though the M3
haplotypes used in this analysis came exclusively from
indigenous Ecuadorian populations, the diversity found
within this data set is similar to previous estimates of the
age of the Q-M3 haplogroup.11
Although different lines of evidence point to different
source populations for Native Americans, the alternatives
need not be exclusive. The effects of historical and demo-
graphic events and evolutionary processes, particularly
recent gene flow, have shaped modern-day populations
such that we should not expect that any one population
in the Old World would show the same genetic composi-
tion as populations in the New World. That (an) ancestral
population(s) probably differentiated into the numerous
populations of Siberia and Central Asia, which have inter-
acted over the past 15,000 years, is not lost on us. Historical
expansions of people and the effects of animal and plant
domestication have played critical roles in shaping the
genetics of both Old and New World populations, particu-
larly in the past several thousand years. Modern popula-
tions have complex, local histories that need to be under-
stood if these are to be used in larger interregional (or
biomedical) analyses. Through the use of phylogeographic
240 The American Journal of Human Genetics 90, 229–246, February 10, 2012
methods, we can attain a better understanding of these
populations for such purposes. It is through this type of
approach that it becomes quite clear that southern Altaians
and Native Americans share a recent common paternal
ancestor.
Supplemental Data
Supplemental Data include three figures and six tables and can be
found with this article online at http://www.cell.com/AJHG/.
Acknowledgments
The authors would like to thank all of the indigenous Altaian
participants for their involvement in this study. We also thank
Fabricio Santos for his careful review of and helpful suggestions
for the manuscript, and two anonymous reviewers for their
constructive comments. In addition, we would like to acknowl-
edge the people who facilitated and provided assistance with our
field research in the Altai Republic. They include Vasiliy Semeno-
vich Palchikov, the staff of the Biochemistry Lab at the Turochak
Hospital, Dr. Maria Nikolaevna Trishina, Vitaliy Trishin, Alexander
A. Guryanov, the staff of the Native Affairs office in Gorniy
Altaiask, Galina Nikolaevna Makhalina, and Tatiana Kunduchi-
novna Babrasheva. In addition, we received help from a number
of people living in local villages around the Turochakskiy Raion,
particularly Alexander Adonyov. This project was supported by
funds from the University of Pennsylvania (T.G.S.), the National
Science Foundation (BCS-0726623) (T.G.S., M.C.D.), the Social
Sciences and Humanities Research Council of Canada (MCRI
412-2005-1004) (T.G.S.), and the Russian Basic Fund for Research
(L.P.O.). T.G.S. would also like to acknowledge the infrastructural
support provided by the National Geographic Society.
Received: September 15, 2011
Revised: December 6, 2011
Accepted: December 19, 2011
Published online: January 26, 2012
Web Resources
The URLs for data presented herein are as follows:
Arlequin, version 3.11, http://cmpg.unibe.ch/software/arlequin3/
Batwing, http://www.mas.ncl.ac.uk/~nijw/
Network, version 4.6.0.0, http://www.fluxus-engineering.com/
sharenet.htm
Network Publisher, version 1.3.0.0, http://www.fluxus-engineering.
com/nwpub.htm
Y-DNAHaplogroup Tree 2011, version 6.46, http://www.isogg.org/
tree
References
1. Goebel, T. (1999). Pleistocene human colonization of
Siberia and peopling of the Americas: An ecological
approach. Evol. Anthropol. 8, 208–227.
2. Gryaznov, M.P. (1969). The Ancient Civilization of
Southern Siberia (New York: Cowles Book Company, Inc.).
3. Okladnikov, A.P. (1964). Ancient population of Siberia
and its culture. In The Peoples of Siberia, M.G. Levin and
L.P. Potapov, eds. (Chicago: The University of Chicago
Press), pp. 13–98.
4. Levin, M.G., and Potapov, L.P. (1964). The Peoples of Siberia
(Chicago: University of Chicago Press).
5. Reich, D., Green, R.E., Kircher, M., Krause, J., Patterson, N.,
Durand, E.Y., Viola, B., Briggs, A.W., Stenzel, U., Johnson,
P.L.F., et al. (2010). Genetic history of an archaic hominin
group from Denisova Cave in Siberia. Nature 468, 1053–
1060.
6. Krause, J., Fu, Q., Good, J.M., Viola, B., Shunkov, M.V.,
Derevianko, A.P., and Paabo, S. (2010). The complete mito-
chondrial DNA genome of an unknown hominin from
southern Siberia. Nature 464, 894–897.
7. Krause, J., Orlando, L., Serre, D., Viola, B., Prufer, K.,
Richards, M.P., Hublin, J.J., Hanni, C., Derevianko, A.P.,
and Paabo, S. (2007). Neanderthals in central Asia and
Siberia. Nature 449, 902–904.
8. Santos, F.R., Pandya, A., Tyler-Smith, C., Pena, S.D., Schan-
field, M., Leonard, W.R., Osipova, L., Crawford, M.H., and
Mitchell, R.J. (1999). The central Siberian origin for native
American Y chromosomes. Am. J. Hum. Genet. 64, 619–628.
9. Karafet, T.M., Zegura, S.L., Posukh, O., Osipova, L., Bergen,
A., Long, J., Goldman, D., Klitz, W., Harihara, S., de Knijff,
P., et al. (1999). Ancestral Asian source(s) of new world
Y-chromosome founder haplotypes. Am. J. Hum. Genet.
64, 817–831.
10. Lell, J.T., Sukernik, R.I., Starikovskaya, Y.B., Su, B., Jin, L.,
Schurr, T.G., Underhill, P.A., and Wallace, D.C. (2002). The
dual origin and Siberian affinities of Native American Y chro-
mosomes. Am. J. Hum. Genet. 70, 192–206.
11. Zegura, S.L., Karafet, T.M., Zhivotovsky, L.A., and Hammer,
M.F. (2004). High-resolution SNPs and microsatellite haplo-
types point to a single, recent entry of Native American
Y chromosomes into the Americas. Mol. Biol. Evol. 21,
164–175.
12. Anthony, D.W. (2007). The Horse, the Wheel, and Language:
How Bronze-Age Riders from the Eurasian Steppes Shaped
the Modern World (Princeton, N.J.: Princeton University
Press).
13. Kuzmina, E.E., and Mair, V.H. (2008). The Prehistory of the
Silk Road (Philadelphia: University of Pennsylvania Press).
14. Rudenko, S.I. (1970). Frozen Tombs of Siberia, the Pazyryk
Burials of Iron Age Horsemen (Berkeley: University of
California Press).
15. David-Kimball J., Bashilov V.A., and Yablonsky L.T., eds.
(1995). Nomads of the Eurasian Steppes in the Early Iron
Age (Berkeley, CA: Zinat Press).
16. Golden, P.B. (1992). An Introduction to the History of
the Turkic Peoples: Ethnogenesis and State-Formation in
Medieval and Early Modern Eurasia and the Middle East
(Wiesbaden: Otto Harrassowitz).
17. Grousset, R. (1970). The Empire of the Steppes: A History of
Central Asia (New Brunswick, N.J.: Rutgers University Press).
18. Potapov, L.P. (1962). The origins of the Altayans. In Studies
in Siberian Ethnogenesis, H.N. Michael, ed. (Toronto:
University of Toronto Press), pp. 169–196.
19. Potapov, L.P. (1964). The Altays. In The Peoples of Siberia,
M.G. Levin and L.P. Potapov, eds. (Chicago: University of
Chicago Press), pp. 305–341.
20. Menges, K.H. (1968). The Turkic Languages and Peoples:
An Introduction to Turkic Studies (Wiesbaden: Otto Harras-
sowitz).
The American Journal of Human Genetics 90, 229–246, February 10, 2012 241
21. Levin, M.G. (1964). The anthropological types of Siberia. In
The Peoples of Siberia, M.G. Levin and L.P. Potapov, eds.
(Chicago: The University of Chicago Press), pp. 99–104.
22. Osipova, L.P., and Sukernik, R.I. (1978). [Polymorphism
of immunoglobulin Gm- and Km-allotypes in northern
Altaians (western Sibiria)]. Genetika 14, 1272–1275.
23. Posukh, O.L., Osipova, L.P., Kashinskaia, IuO., Ivakin, E.A.,
Kriukov, IuA., Karafet, T.M., Kazakovtseva,M.A., Skobel’tsina,
L.M., Crawford, M.G., Lefranc, M.P., and Lefranc, G. (1998).
[Genetic analysis of the South Altaian population of
the Mendur-Sokkon village, Altai Republic]. Genetika 34,
106–113.
24. Sukernik, R.I., Karafet, T.M., Abanina, T.A., Korostyshevskiĭ,M.A., and Bashlaĭ, A.G. (1977). [Genetic structure of 2 iso-
lated populations of native inhabitants of Sibiria (Northern
Altaics) according to the results of a study of blood groups
and isoenzymes]. Genetika 13, 911–918.
25. Sukernik, R.I., Shur, T.G., Starikovskaia, E.B., and Uolles, D.K.
(1996). [Mitochondrial DNA variation in native inhabitants
of Siberia with reconstructions of the evolutional history of
the American Indians. Restriction polymorphism]. Genetika
32, 432–439.
26. Shields, G.F., Schmiechen, A.M., Frazier, B.L., Redd, A.,
Voevoda, M.I., Reed, J.K., and Ward, R.H. (1993). mtDNA
sequences suggest a recent evolutionary divergence for
Beringian and northern North American populations. Am.
J. Hum. Genet. 53, 549–562.
27. Torroni, A., Schurr, T.G., Yang, C.C., Szathmary, E.J.,
Williams, R.C., Schanfield, M.S., Troup, G.A., Knowler,
W.C., Lawrence, D.N., Weiss, K.M., et al. (1992). Native
American mitochondrial DNA analysis indicates that the
Amerind and the Nadene populations were founded by two
independent migrations. Genetics 130, 153–162.
28. Wallace, D.C., and Torroni, A. (1992). American Indian
prehistory as written in the mitochondrial DNA: a review.
Hum. Biol. 64, 403–416.
29. Torroni, A., Schurr, T.G., Cabell, M.F., Brown,M.D., Neel, J.V.,
Larsen, M., Smith, D.G., Vullo, C.M., and Wallace, D.C.
(1993). Asian affinities and continental radiation of the
four foundingNative AmericanmtDNAs. Am. J. Hum. Genet.
53, 563–590.
30. Torroni, A., Sukernik, R.I., Schurr, T.G., Starikorskaya, Y.B.,
Cabell, M.F., Crawford, M.H., Comuzzie, A.G., and Wallace,
D.C. (1993). mtDNA variation of aboriginal Siberians reveals
distinct genetic affinities with Native Americans. Am. J.
Hum. Genet. 53, 591–608.
31. Forster, P., Harding, R., Torroni, A., and Bandelt, H.J. (1996).
Origin and evolution of Native American mtDNA variation:
a reappraisal. Am. J. Hum. Genet. 59, 935–945.
32. Merriwether, D.A., and Ferrell, R.E. (1996). The four founding
lineage hypothesis for the NewWorld: a critical reevaluation.
Mol. Phylogenet. Evol. 5, 241–246.
33. Bonatto, S.L., and Salzano, F.M. (1997). Diversity and age of
the four major mtDNA haplogroups, and their implications
for the peopling of the New World. Am. J. Hum. Genet. 61,
1413–1423.
34. Merriwether, D.A., Hall, W.W., Vahlne, A., and Ferrell, R.E.
(1996). mtDNA variation indicates Mongolia may have
been the source for the founding population for the New
World. Am. J. Hum. Genet. 59, 204–212.
35. Neel, J.V., Biggar, R.J., and Sukernik, R.I. (1994). Virologic and
genetic studies relate Amerind origins to the indigenous
people of the Mongolia/Manchuria/southeastern Siberia
region. Proc. Natl. Acad. Sci. USA 91, 10737–10741.
36. Karafet, T.M., Zegura, S.L., Vuturo-Brady, J., Posukh, O.,
Osipova, L., Wiebe, V., Romero, F., Long, J.C., Harihara, S.,
Jin, F., et al. (1997). Y chromosomemarkers and Trans-Bering
Strait dispersals. Am. J. Phys. Anthropol. 102, 301–314.
37. Lell, J.T., Brown, M.D., Schurr, T.G., Sukernik, R.I., Starikov-
skaya, Y.B., Torroni, A., Moore, L.G., Troup, G.M., and
Wallace, D.C. (1997). Y chromosome polymorphisms in
native American and Siberian populations: identification of
native American Y chromosome haplotypes. Hum. Genet.
100, 536–543.
38. Bortolini, M.C., Salzano, F.M., Thomas, M.G., Stuart, S.,
Nasanen, S.P., Bau, C.H., Hutz, M.H., Layrisse, Z., Petzl-Erler,
M.L., Tsuneto, L.T., et al. (2003). Y-chromosome evidence for
differing ancient demographic histories in the Americas. Am.
J. Hum. Genet. 73, 524–539.
39. Schurr, T.G., and Sherry, S.T. (2004). Mitochondrial DNA and
Y chromosome diversity and the peopling of the Americas:
evolutionary and demographic evidence. Am. J. Hum. Biol.
16, 420–439.
40. Derenko,M.V., Malyarchuk, B., Denisova, G.A.,Wozniak,M.,
Dambueva, I., Dorzhu, C., Luzina, F., Mi�scicka-Sliwka, D.,
and Zakharov, I. (2006). Contrasting patterns of Y-chromo-
some variation in South Siberian populations from Baikal
and Altai-Sayan regions. Hum. Genet. 118, 591–604.
41. Derenko,M.V., Malyarchuk, B., Grzybowski, T., Denisova, G.,
Dambueva, I., Perkova, M., Dorzhu, C., Luzina, F., Lee, H.K.,
Vanecek, T., et al. (2007). Phylogeographic analysis of mito-
chondrial DNA in northern Asian populations. Am. J.
Hum. Genet. 81, 1025–1041.
42. Volodko, N.V., Starikovskaya, E.B., Mazunin, I.O., Eltsov,
N.P., Naidenko, P.V., Wallace, D.C., and Sukernik, R.I.
(2008). Mitochondrial genome diversity in arctic Siberians,
with particular reference to the evolutionary history of
Beringia and Pleistocenic peopling of the Americas. Am. J.
Hum. Genet. 82, 1084–1100.
43. Derenko, M.V., Grzybowski, T., Malyarchuk, B.A., Dam-
bueva, I.K., Denisova, G.A., Czarny, J., Dorzhu, C.M., Kakpa-
kov, V.T., Mi�scicka-Sliwka, D., Wo�zniak, M., and Zakharov,
I.A. (2003). Diversity of mitochondrial DNA lineages in
South Siberia. Ann. Hum. Genet. 67, 391–411.
44. Starikovskaya, E.B., Sukernik, R.I., Derbeneva, O.A., Volodko,
N.V., Ruiz-Pesini, E., Torroni, A., Brown, M.D., Lott, M.T.,
Hosseini, S.H., Huoponen, K., and Wallace, D.C. (2005).
Mitochondrial DNA diversity in indigenous populations of
the southern extent of Siberia, and the origins of Native
American haplogroups. Ann. Hum. Genet. 69, 67–89.
45. Starikovskaya, Y.B., Sukernik, R.I., Schurr, T.G., Kogelnik,
A.M., andWallace, D.C. (1998). mtDNA diversity in Chukchi
and Siberian Eskimos: implications for the genetic history of
Ancient Beringia and the peopling of the New World. Am. J.
Hum. Genet. 63, 1473–1491.
46. Schurr, T.G., and Wallace, D.C. (2003). Genetic prehistory of
Paleoasiatic-speaking populations of northeastern Siberia
and their relationships to Native Americans. In Constructing
cultures then and now: celebrating Franz Boas and the Jesup
North Pacific Expedition, L. Kendall and I. Krupnik, eds.
(Washington, D.C.: Arctic Studies Center, National Museum
of Natural History, Smithsonian Institution), pp. 239–258.
47. Schurr, T.G., Ballinger, S.W., Gan, Y.Y., Hodge, J.A., Merri-
wether, D.A., Lawrence, D.N., Knowler, W.C., Weiss, K.M.,
242 The American Journal of Human Genetics 90, 229–246, February 10, 2012
and Wallace, D.C. (1990). Amerindian mitochondrial DNAs
have rare Asian mutations at high frequencies, suggesting
they derived from four primary maternal lineages. Am. J.
Hum. Genet. 46, 613–623.
48. Macaulay, V., Richards, M., Hickey, E., Vega, E., Cruciani, F.,
Guida, V., Scozzari, R., Bonne-Tamir, B., Sykes, B., and
Torroni, A. (1999). The emerging tree of West Eurasian
mtDNAs: a synthesis of control-region sequences and RFLPs.
Am. J. Hum. Genet. 64, 232–249.
49. Richards, M., Macaulay, V., Hickey, E., Vega, E., Sykes, B.,
Guida, V., Rengo, C., Sellitto, D., Cruciani, F., Kivisild, T.,
et al. (2000). Tracing European founder lineages in the Near
Eastern mtDNA pool. Am. J. Hum. Genet. 67, 1251–1276.
50. Torroni, A., Bandelt, H.J., D’Urbano, L., Lahermo, P., Moral,
P., Sellitto, D., Rengo, C., Forster, P., Savontaus, M.L.,
Bonne-Tamir, B., and Scozzari, R. (1998). mtDNA analysis
reveals a major late Paleolithic population expansion from
southwestern to northeastern Europe. Am. J. Hum. Genet.
62, 1137–1152.
51. Torroni, A., Huoponen, K., Francalacci, P., Petrozzi, M.,
Morelli, L., Scozzari, R., Obinu, D., Savontaus, M.L., and
Wallace, D.C. (1996). Classification of European mtDNAs
from an analysis of three European populations. Genetics
144, 1835–1850.
52. Torroni, A., Lott, M.T., Cabell, M.F., Chen, Y.S., Lavergne, L.,
and Wallace, D.C. (1994). mtDNA and the origin of Cauca-
sians: identification of ancient Caucasian-specific haplo-
groups, one of which is prone to a recurrent somatic duplica-
tion in the D-loop region. Am. J. Hum. Genet. 55, 760–776.
53. Kivisild, T., Tolk, H.V., Parik, J., Wang, Y., Papiha, S.S.,
Bandelt, H.J., and Villems, R. (2002). The emerging limbs
and twigs of the East Asian mtDNA tree. Mol. Biol. Evol.
19, 1737–1751.
54. Schurr, T.G., Sukernik, R.I., Starikovskaya, Y.B., and Wallace,
D.C. (1999). Mitochondrial DNA variation in Koryaks and
Itel’men: population replacement in the Okhotsk Sea-Bering
Sea region during the Neolithic. Am. J. Phys. Anthropol. 108,
1–39.
55. Tanaka, M., Cabrera, V.M., Gonzalez, A.M., Larruga, J.M.,
Takeyasu, T., Fuku, N., Guo, L.J., Hirose, R., Fujita, Y., Kurata,
M., et al. (2004). Mitochondrial genome variation in eastern
Asia and the peopling of Japan. Genome Res. 14 (10A), 1832–
1850.
56. Yao, Y.G., Kong, Q.P., Bandelt, H.J., Kivisild, T., and Zhang,
Y.P. (2002). Phylogeographic differentiation of mitochon-
drial DNA in Han Chinese. Am. J. Hum. Genet. 70, 635–651.
57. Gokcumen, O., Dulik, M.C., Pai, A.A., Zhadanov, S.I., Rubin-
stein, S., Osipova, L.P., Andreenkov, O.V., Tabikhanova, L.E.,
Gubina, M.A., Labuda, D., and Schurr, T.G. (2008). Genetic
variation in the enigmatic Altaian Kazakhs of South-Central
Russia: insights into Turkic population history. Am. J. Phys.
Anthropol. 136, 278–293.
58. Rubinstein, S., Dulik, M.C., Gokcumen, O., Zhadanov, S.,
Osipova, L., Cocca, M., Mehta, N., Gubina, M., Posukh, O.,
and Schurr, T.G. (2008). Russian Old Believers: genetic conse-
quences of their persecution and exile, as shown by mito-
chondrial DNA evidence. Hum. Biol. 80, 203–237.
59. van Oven, M., and Kayser, M. (2009). Updated comprehen-
sive phylogenetic tree of global human mitochondrial DNA
variation. Hum. Mutat. 30, E386–E394.
60. Anderson, S., Bankier, A.T., Barrell, B.G., de Bruijn, M.H.,
Coulson, A.R., Drouin, J., Eperon, I.C., Nierlich, D.P., Roe,
B.A., Sanger, F., et al. (1981). Sequence and organization of
the human mitochondrial genome. Nature 290, 457–465.
61. Andrews, R.M., Kubacka, I., Chinnery, P.F., Lightowlers, R.N.,
Turnbull, D.M., and Howell, N. (1999). Reanalysis and revi-
sion of the Cambridge reference sequence for human mito-
chondrial DNA. Nat. Genet. 23, 147.
62. Y Chromosome Consortium. (2002). A nomenclature system
for the tree of human Y-chromosomal binary haplogroups.
Genome Res. 12, 339–348.
63. Karafet, T.M., Mendez, F.L., Meilerman, M.B., Underhill, P.A.,
Zegura, S.L., and Hammer, M.F. (2008). New binary polymor-
phisms reshape and increase resolution of the human Y chro-
mosomal haplogroup tree. Genome Res. 18, 830–838.
64. Dulik, M.C., Osipova, L.P., and Schurr, T.G. (2011). Y-chro-
mosome variation in Altaian Kazakhs reveals a common
paternal gene pool for Kazakhs and the influence of Mongo-
lian expansions. PLoS ONE 6, e17548.
65. Cox, M.P. (2006). Minimal hierarchical analysis of global
human Y-chromosome SNP diversity by PCR-RFLP. Anthro-
pol. Sci. 114, 69–74.
66. Derbeneva, O.A., Starikovskaia, E.B., Volod’ko, N.V., Wallace,
D.C., and Sukernik, R.I. (2002). [Mitochondrial DNA varia-
tion in Kets and Nganasans and the early peoples of
Northern Eurasia]. Genetika 38, 1554–1560.
67. Derbeneva, O.A., Starikovskaya, E.B., Wallace, D.C., and
Sukernik, R.I. (2002). Traces of early Eurasians in the
Mansi of northwest Siberia revealed by mitochondrial DNA
analysis. Am. J. Hum. Genet. 70, 1009–1014.
68. Pimenoff, V.N., Comas, D., Palo, J.U., Vershubsky, G., Kozlov,
A., and Sajantila, A. (2008). Northwest Siberian Khanty and
Mansi in the junction of West and East Eurasian gene pools
as revealed by uniparental markers. Eur. J. Hum. Genet. 16,
1254–1264.
69. Comas, D., Calafell, F., Mateu, E., Perez-Lezaun, A., Bosch, E.,
Martınez-Arias, R., Clarimon, J., Facchini, F., Fiori, G.,
Luiselli, D., et al. (1998). Trading genes along the silk road:
mtDNA sequences and the origin of central Asian popula-
tions. Am. J. Hum. Genet. 63, 1824–1838.
70. Yao, Y.G., Kong, Q.P., Wang, C.Y., Zhu, C.L., and Zhang, Y.P.
(2004). Different matrilineal contributions to genetic struc-
ture of ethnic groups in the silk road region in china. Mol.
Biol. Evol. 21, 2265–2280.
71. Kolman, C.J., Sambuughin, N., and Bermingham, E. (1996).
Mitochondrial DNA analysis of Mongolian populations and
implications for the origin of New World founders. Genetics
142, 1321–1334.
72. Xue, Y., Zerjal, T., Bao, W., Zhu, S., Shu, Q., Xu, J., Du, R.,
Fu, S., Li, P., Hurles, M.E., et al. (2006). Male demography
in East Asia: a north-south contrast in human population
expansion times. Genetics 172, 2431–2439.
73. Khar’kov, V.N., Medvedeva, O.F., Luzina, F.A., Kolbasko, A.V.,
Gafarov, N.I., Puzyrev, V.P., and Stepanov, V.A. (2009).
[Comparative characteristics of the gene pool of Teleuts
inferred from Y-chromosomal marker data]. Genetika 45,
1132–1142.
74. Khar’kov, V., Khamina, K., Medvedeva, O., Shtygasheva, O.,
and Stepanov, V. (2011). Genetic diversity of the Khakass
gene pool: Subethnic differentiation and the structure of
Y-chromosome haplogroups. Mol. Biol. (Mosk.) 45, 446–458.
75. Roewer, L., Kruger, C., Willuweit, S., Nagy, M., Rodig, H.,
Kokshunova, L., Rothamel, T., Kravchenko, S., Jobling, M.A.,
Stoneking, M., and Nasidze, I. (2007). Y-chromosomal STR
The American Journal of Human Genetics 90, 229–246, February 10, 2012 243
haplotypes in Kalmyk population samples. Forensic Sci. Int.
173, 204–209.
76. Geppert, M., Baeta, M., Nunez, C., Martınez-Jarreta, B., Zwey-
nert, S., Cruz, O.W., Gonzalez-Andrade, F., Gonzalez-Solo-
rzano, J., Nagy, M., and Roewer, L. (2011). Hierarchical
Y-SNP assay to study the hidden diversity and phylogenetic
relationship of native populations in South America.
Forensic Sci. Int. Genet. 5, 100–104.
77. Excoffier, L., Laval, G., and Schneider, S. (2005). Arlequin
(version 3.0): an integrated software package for population
genetics data analysis. Evol. Bioinform. Online 1, 47–50.
78. Tamura, K., and Nei, M. (1993). Estimation of the number of
nucleotide substitutions in the control region of mitochon-
drial DNA in humans and chimpanzees. Mol. Biol. Evol.
10, 512–526.
79. SPSS Inc. (2001). SPSS for Windows Release 11.0.0 (Chicago,
IL: SPSS Inc.).
80. Polzin, T., and Daneschmand, S.V. (2003). On Steiner trees
and minimum spanning trees in hypergraphs. Oper. Res.
Lett. 31, 12–20.
81. Bandelt, H.J., Forster, P., and Rohl, A. (1999). Median-joining
networks for inferring intraspecific phylogenies. Mol. Biol.
Evol. 16, 37–48.
82. Bandelt, H.J., Forster, P., Sykes, B.C., and Richards, M.B.
(1995). Mitochondrial portraits of human populations using
median networks. Genetics 141, 743–753.
83. Gusmao, L., Butler, J.M., Carracedo, A., Gill, P., Kayser, M.,
Mayr, W.R., Morling, N., Prinz, M., Roewer, L., Tyler-Smith,
C., and Schneider, P.M.; DNA Commission of the Interna-
tional Society of Forensic Genetics. (2006). DNA Commis-
sion of the International Society of Forensic Genetics
(ISFG): an update of the recommendations on the use of
Y-STRs in forensic analysis. Forensic Sci. Int. 157, 187–197.
84. Soares, P., Ermini, L., Thomson, N., Mormina, M., Rito, T.,
Rohl, A., Salas, A., Oppenheimer, S., Macaulay, V., and
Richards, M.B. (2009). Correcting for purifying selection:
an improved human mitochondrial molecular clock. Am. J.
Hum. Genet. 84, 740–759.
85. Sengupta, S., Zhivotovsky, L.A., King, R., Mehdi, S.Q.,
Edmonds, C.A., Chow, C.E., Lin, A.A., Mitra, M., Sil, S.K.,
Ramesh, A., et al. (2006). Polarity and temporality of high-
resolution y-chromosome distributions in India identify
both indigenous and exogenous expansions and reveal
minor genetic influence of Central Asian pastoralists. Am.
J. Hum. Genet. 78, 202–221.
86. Wilson, I., Balding, D., andWeale, M. (2003). Inferences from
DNA data: population histories, evolutionary processes and
forensic match probabilities. J. R. Stat. Soc. [Ser A] 166,
155–188.
87. Xue, Y., Zerjal, T., Bao, W., Zhu, S., Shu, Q., Xu, J., Du, R., Fu,
S., Li, P., Hurles, M.E., et al. (2008). Modelling male prehis-
tory in east Asia using BATWING. In Simulations, Genetics
and Human Prehistory, S. Matsumura, P. Forster, and C. Ren-
frew, eds. (Cambridge: McDonald Institute for Archaeolog-
ical Research), pp. 79–88.
88. Zhivotovsky, L.A., Underhill, P.A., Cinnio�glu, C., Kayser, M.,
Morar, B., Kivisild, T., Scozzari, R., Cruciani, F., Destro-Bisol,
G., Spedini, G., et al. (2004). The effective mutation rate at
Y chromosome short tandem repeats, with application to
human population-divergence time. Am. J. Hum. Genet.
74, 50–61.
89. Dupuy, B.M., Stenersen, M., Egeland, T., and Olaisen, B.
(2004). Y-chromosomal microsatellite mutation rates: differ-
ences inmutation rate between andwithin loci. Hum.Mutat.
23, 117–124.
90. Fenner, J.N. (2005). Cross-cultural estimation of the human
generation interval for use in genetics-based population
divergence studies. Am. J. Phys. Anthropol. 128, 415–423.
91. Derenko, M., Malyarchuk, B., Grzybowski, T., Denisova, G.,
Rogalla, U., Perkova, M., Dambueva, I., and Zakharov, I.
(2010). Origin and post-glacial dispersal of mitochondrial
DNA haplogroups C and D in northern Asia. PLoS ONE 5,
e15214.
92. Zhadanov, S.I., Dulik, M.C., Markley, M., Jennings, G.W.,
Gaieski, J.B., Elias, G., and Schurr, T.G.; Genographic Project
Consortium. (2010). Genetic heritage and native identity of
the Seaconke Wampanoag tribe of Massachusetts. Am. J.
Phys. Anthropol. 142, 579–589.
93. Hammer, M.F., Karafet, T.M., Redd, A.J., Jarjanazi, H., Santa-
chiara-Benerecetti, S., Soodyall, H., and Zegura, S.L. (2001).
Hierarchical patterns of global human Y-chromosome diver-
sity. Mol. Biol. Evol. 18, 1189–1203.
94. Kivisild, T., Rootsi, S., Metspalu, M., Mastana, S., Kaldma, K.,
Parik, J., Metspalu, E., Adojaan, M., Tolk, H.V., Stepanov, V.,
et al. (2003). The genetic heritage of the earliest settlers
persists both in Indian tribal and caste populations. Am. J.
Hum. Genet. 72, 313–332.
95. Wells, R.S., Yuldasheva, N., Ruzibakiev, R., Underhill, P.A.,
Evseeva, I., Blue-Smith, J., Jin, L., Su, B., Pitchappan, R.,
Shanmugalakshmi, S., et al. (2001). The Eurasian heartland:
a continental perspective on Y-chromosome diversity. Proc.
Natl. Acad. Sci. USA 98, 10244–10249.
96. Rosser, Z.H., Zerjal, T., Hurles, M.E., Adojaan, M., Alavantic,
D., Amorim, A., Amos,W., Armenteros,M., Arroyo, E., Barbu-
jani, G., et al. (2000). Y-chromosomal diversity in Europe is
clinal and influenced primarily by geography, rather than
by language. Am. J. Hum. Genet. 67, 1526–1543.
97. Quintana-Murci, L., Krausz, C., Zerjal, T., Sayar, S.H.,
Hammer, M.F., Mehdi, S.Q., Ayub, Q., Qamar, R., Mohyud-
din, A., Radhakrishna, U., et al. (2001). Y-chromosome line-
ages trace diffusion of people and languages in southwestern
Asia. Am. J. Hum. Genet. 68, 537–542.
98. Underhill, P.A., Passarino, G., Lin, A.A., Shen, P., Mirazon
Lahr, M., Foley, R.A., Oefner, P.J., and Cavalli-Sforza, L.L.
(2001). The phylogeography of Y chromosome binary haplo-
types and the origins of modern human populations. Ann.
Hum. Genet. 65, 43–62.
99. Zhong, H., Shi, H., Qi, X.-B., Duan, Z.-Y., Tan, P.-P., Jin, L., Su,
B., and Ma, R.Z. (2011). Extended Y chromosome investiga-
tion suggests postglacial migrations of modern humans
into East Asia via the northern route. Mol. Biol. Evol. 28,
717–727.
100. Mirabal, S., Regueiro, M., Cadenas, A.M., Cavalli-Sforza, L.L.,
Underhill, P.A., Verbenko, D.A., Limborska, S.A., and Her-
rera, R.J. (2009). Y-chromosome distribution within the
geo-linguistic landscape of northwestern Russia. Eur. J.
Hum. Genet. 17, 1260–1273.
101. Myres, N.M., Rootsi, S., Lin, A.A., Jarve, M., King, R.J.,
Kutuev, I., Cabrera, V.M., Khusnutdinova, E.K., Pshenichnov,
A., Yunusbayev, B., et al. (2011). A major Y-chromosome
haplogroup R1b Holocene era founder effect in Central and
Western Europe. Eur. J. Hum. Genet. 19, 95–101.
244 The American Journal of Human Genetics 90, 229–246, February 10, 2012
102. Malyarchuk, B., Derenko, M., Denisova, G., Maksimov, A.,
Wozniak, M., Grzybowski, T., Dambueva, I., and Zakharov,
I. (2011). Ancient links between Siberians and Native Amer-
icans revealed by subtyping the Y chromosome haplogroup
Q1a. J. Hum. Genet. 56, 583–588.
103. Rogers, A.R., and Harpending, H. (1992). Population growth
makes waves in the distribution of pairwise genetic differ-
ences. Mol. Biol. Evol. 9, 552–569.
104. Shi, H., Dong, Y.L., Wen, B., Xiao, C.J., Underhill, P.A., Shen,
P.D., Chakraborty, R., Jin, L., and Su, B. (2005). Y-chromo-
some evidence of southern origin of the East Asian-specific
haplogroup O3-M122. Am. J. Hum. Genet. 77, 408–419.
105. Forsyth, J. (1992). A History of the Peoples of Siberia: Russia’s
North Asian Colony, 1581–1990 (Cambridge, England:
Cambridge University Press).
106. Brown, M.D., Hosseini, S.H., Torroni, A., Bandelt, H.J., Allen,
J.C., Schurr, T.G., Scozzari, R., Cruciani, F., and Wallace, D.C.
(1998). mtDNA haplogroup X: An ancient link between
Europe/Western Asia and North America? Am. J. Hum.
Genet. 63, 1852–1861.
107. Tamm, E., Kivisild, T., Reidla, M., Metspalu, M., Smith, D.G.,
Mulligan, C.J., Bravi, C.M., Rickards, O., Martinez-Labarga,
C., Khusnutdinova, E.K., et al. (2007). Beringian standstill
and spread of Native American founders. PLoS ONE 2, e829.
108. Achilli, A., Perego, U.A., Bravi, C.M., Coble, M.D., Kong, Q.P.,
Woodward, S.R., Salas, A., Torroni, A., and Bandelt, H.J.
(2008). The phylogeny of the four pan-American MtDNA
haplogroups: implications for evolutionary and disease
studies. PLoS ONE 3, e1764.
109. Perego, U.A., Achilli, A., Angerhofer, N., Accetturo, M., Pala,
M., Olivieri, A., Kashani, B.H., Ritchie, K.H., Scozzari, R.,
Kong, Q.P., et al. (2009). Distinctive Paleo-Indian migration
routes from Beringia marked by two rare mtDNA haplo-
groups. Curr. Biol. 19, 1–8.
110. Perego, U.A., Angerhofer, N., Pala, M., Olivieri, A., Lancioni,
H., Kashani, B.H., Carossa, V., Ekins, J.E., Gomez-Carballa, A.,
Huber, G., et al. (2010). The initial peopling of the Americas:
a growing number of foundingmitochondrial genomes from
Beringia. Genome Res. 20, 1174–1179.
111. Helgason, A., Palsson, G., Pedersen, H.S., Angulalik, E., Gun-
narsdottir, E.D., Yngvadottir, B., and Stefansson, K. (2006).
mtDNA variation in Inuit populations of Greenland and
Canada: migration history and population structure. Am. J.
Phys. Anthropol. 130, 123–134.
112. Bortolini, M.C., Salzano, F.M., Bau, C.H., Layrisse, Z., Petzl-
Erler, M.L., Tsuneto, L.T., Hill, K., Hurtado, A.M., Castro-
De-Guerra, D., Bedoya, G., and Ruiz-Linares, A. (2002).
Y-chromosome biallelic polymorphisms and Native Amer-
ican population structure. Ann. Hum. Genet. 66, 255–259.
113. Underhill, P.A., Shen, P., Lin, A.A., Jin, L., Passarino, G., Yang,
W.H., Kauffman, E., Bonne-Tamir, B., Bertranpetit, J., Franca-
lacci, P., et al. (2000). Y chromosome sequence variation
and the history of human populations. Nat. Genet. 26,
358–361.
114. Seielstad, M., Yuldasheva, N., Singh, N., Underhill, P., Oef-
ner, P., Shen, P., and Wells, R.S. (2003). A novel Y-chromo-
some variant puts an upper limit on the timing of first entry
into the Americas. Am. J. Hum. Genet. 73, 700–705.
115. Schurr, T.G., Osipova, L.P., Zhadanov, S.I., and Dulik, M.C.
(2010). Genetic diversity in Native Siberians: Implications
for the prehistoric settlement of te Cis-Baikal region. In
Prehistoric Hunter-Gatherers of the Baikal Region, Siberia,
A.W.Weber, M.A. Katzenberg, and T.G. Schurr, eds. (Philadel-
phia: University of Pennsylvania Press), pp. 121–134.
116. Dulik, M.C. (2011). A molecular anthropological study
of Altaian histories utilizing population genetics and
phylogeography. PhD thesis, University of Pennsylvania,
Philadelphia, PA.
117. Fiedel, S.J. (2000). The peopling of the New World: present
evidence, new theories, and future directions. J. Archaeol.
Res. 8, 39–103.
118. Martınez-Cruz, B., Vitalis, R., Segurel, L., Austerlitz, F.,
Georges, M., Thery, S., Quintana-Murci, L., Hegay, T., Alda-
shev, A., Nasyrova, F., and Heyer, E. (2011). In the heartland
of Eurasia: the multilocus genetic landscape of Central Asian
populations. Eur. J. Hum. Genet. 19, 216–223.
119. Karafet, T.M., Osipova, L.P., Gubina, M.A., Posukh, O.L.,
Zegura, S.L., and Hammer, M.F. (2002). High levels of Y-chro-
mosome differentiation among native Siberian populations
and the genetic signature of a boreal hunter-gatherer way
of life. Hum. Biol. 74, 761–789.
120. Balaresque, P., Parkin, E.J., Roewer, L., Carvalho-Silva, D.R.,
Mitchell, R.J., van Oorschot, R.A., Henke, J., Stoneking, M.,
Nasidze, I., Wetton, J., et al. (2009). Genomic complexity
of the Y-STR DYS19: inversions, deletions and founder line-
ages carrying duplications. Int. J. Legal Med. 123, 15–23.
121. Underhill, P.A., Myres, N.M., Rootsi, S., Metspalu, M., Zhivo-
tovsky, L.A., King, R.J., Lin, A.A., Chow, C.E., Semino, O.,
Battaglia, V., et al. (2010). Separating the post-Glacial coan-
cestry of European and Asian Y chromosomes within haplo-
group R1a. Eur. J. Hum. Genet. 18, 479–484.
122. Soucek, S. (2000). A History of Inner Asia (Cambridge, New
York: Cambridge University Press).
123. Reich, D., Patterson, N., Kircher, M., Delfin, F., Nandineni,
M.R., Pugach, I., Ko, A.M., Ko, Y.C., Jinam, T.A., Phipps,
M.E., et al. (2011). Denisova admixture and the first modern
human dispersals into Southeast Asia and Oceania. Am. J.
Hum. Genet. 89, 516–528.
124. Hrdli�cka, A. (1942). Crania of Siberia. Am. J. Phys. Anthro-
pol. 29, 435–481.
125. Gonzalez-Jose, R., Bortolini, M.C., Santos, F.R., and Bonatto,
S.L. (2008). The peopling of America: craniofacial shape vari-
ation on a continental scale and its interpretation from an
interdisciplinary view. Am. J. Phys. Anthropol. 137, 175–187.
126. Kozintsev, A.G., Gromov, A.V., and Moiseyev, V.G. (1999).
Collateral relatives of American Indians among the Bronze
Age populations of Siberia? Am. J. Phys. Anthropol. 108,
193–204.
127. Crawford, M.H. (1998). The Origins of Native Americans:
Evidence from Anthropological Genetics (Cambridge: Cam-
bridge University Press).
128. Brace, C.L., Nelson, A.R., Seguchi, N., Oe, H., Sering, L.,
Qifeng, P., Yongyi, L., and Tumen, D. (2001). OldWorld sour-
ces of the first NewWorld human inhabitants: a comparative
craniofacial view. Proc. Natl. Acad. Sci. USA 98, 10017–
10022.
129. Wang, S., Lewis, C.M., Jakobsson, M., Ramachandran, S.,
Ray, N., Bedoya, G., Rojas, W., Parra, M.V., Molina, J.A.,
Gallo, C., et al. (2007). Genetic variation and population
structure in native Americans. PLoS Genet. 3, e185.
130. Horai, S., Kondo, R., Nakagawa-Hattori, Y., Hayashi, S.,
Sonoda, S., and Tajima, K. (1993). Peopling of the Americas,
founded by four major lineages of mitochondrial DNA. Mol.
Biol. Evol. 10, 23–47.
The American Journal of Human Genetics 90, 229–246, February 10, 2012 245
131. Kaessmann, H., Zollner, S., Gustafsson, A.C.,Wiebe, V., Laan,
M., Lundeberg, J., Uhlen, M., and Paabo, S. (2002). Extensive
linkage disequilibrium in small human populations in
Eurasia. Am. J. Hum. Genet. 70, 673–685.
132. Bailliet, G., Ramallo, V., Muzzio, M., Garcıa, A., Santos, M.R.,
Alfaro, E.L., Dipierri, J.E., Salceda, S., Carnese, F.R., Bravi,
C.M., et al. (2009). Brief communication: Restricted geo-
graphic distribution for Y-Q* paragroup in South America.
Am. J. Phys. Anthropol. 140, 578–582.
133. Ho, S.Y., and Endicott, P. (2008). The crucial role of calibra-
tion in molecular date estimates for the peopling of the
Americas. Am. J. Hum. Genet. 83, 142–146, author reply
146–147.
134. Zhivotovsky, L.A., and Underhill, P.A. (2005). On the evolu-
tionary mutation rate at Y-chromosome STRs: comments
on paper by Di Giacomo et al. (2004). Hum. Genet. 116,
529–532.
135. Di Giacomo, F., Luca, F., Popa, L.O., Akar, N., Anagnou, N.,
Banyko, J., Brdicka, R., Barbujani, G., Papola, F., Ciavarella,
G., et al. (2004). Y chromosomal haplogroup J as a signature
of the post-neolithic colonization of Europe. Hum. Genet.
115, 357–371.
136. Zerjal, T., Xue, Y., Bertorelle, G., Wells, R.S., Bao, W., Zhu, S.,
Qamar, R., Ayub, Q., Mohyuddin, A., Fu, S., et al. (2003).
The genetic legacy of the Mongols. Am. J. Hum. Genet. 72,
717–721.
137. Zhivotovsky, L.A., Underhill, P.A., and Feldman, M.W.
(2006). Difference between evolutionarily effective and
germ line mutation rate due to stochastically varying haplo-
group size. Mol. Biol. Evol. 23, 2268–2270.
138. Ho, S.Y., Phillips, M.J., Cooper, A., and Drummond, A.J.
(2005). Time dependency of molecular rate estimates and
systematic overestimation of recent divergence times. Mol.
Biol. Evol. 22, 1561–1568.
246 The American Journal of Human Genetics 90, 229–246, February 10, 2012
ARTICLE
A ‘‘Copernican’’ Reassessment of the HumanMitochondrial DNA Tree from its Root
Doron M. Behar,1,2,* Mannis van Oven,3,* Saharon Rosset,4 Mait Metspalu,1 Eva-Liis Loogvali,1
Nuno M. Silva,5 Toomas Kivisild,1,6 Antonio Torroni,7 and Richard Villems1,8
Mutational events along the human mtDNA phylogeny are traditionally identified relative to the revised Cambridge Reference
Sequence, a contemporary European sequence published in 1981. This historical choice is a continuous source of inconsistencies,
misinterpretations, and errors in medical, forensic, and population genetic studies. Here, after having refined the human mtDNA
phylogeny to an unprecedented level by adding information from 8,216 modern mitogenomes, we propose switching the reference
to a Reconstructed Sapiens Reference Sequence, which was identified by considering all available mitogenomes from Homo neandertha-
lensis. This ‘‘Copernican’’ reassessment of the human mtDNA tree from its deepest root should resolve previous problems and will
have a substantial practical and educational influence on the scientific and public perception of human evolution by clarifying the
core principles of common ancestry for extant descendants.
Introduction
Nested hierarchy of species, resulting from the descent
with modification process,1 is fundamental to our under-
standing of the evolution of biological diversity and
life in general. In molecular genealogy, the sequential
accumulation of mutations since the time of the most
recent common ancestor (MRCA) is reflected within the
ever-evolving phylogeny of any genetic locus. Accordingly,
the reconstructed ancestral sequence of a locus should
optimally serve as the reference point for its derived
alleles.2 The human mtDNA phylogeny3–7 is an almost
perfect molecular prototype for a nonrecombining locus,
and knowledge on its variation has been and is extensively
used in medical, genealogical, forensic, and popula-
tion genetic studies.8–11 Boosted by rapid advances in
sequencing and genotyping technology, its mode of inher-
itance, high mutation rate, lack of recombination, and
high cellular copy number have proved critical in making
this locus the primary choice in the field of archaeoge-
netics and ancient DNA.12–14 Although its early synthesis
was based on restriction-fragment-length polymor-
phisms,15–18 control-region variation,19,20 or a combina-
tion of both,21 the human mtDNA phylogeny is now
reconstructed from complete mtDNA sequences,4,6,7,22
thus stretching the phylogenetic resolution to its maxi-
mum. mtDNA also became the main target of ancient-
DNA studies because it is much more abundant than
nuclear DNA.13 The recently published Homo neandertha-
lensis mitogenomes23,24 represent the best available out-
group source for rooting the human mtDNA phylogeny
known to lay inside the contemporary African varia-
tion.22,25,26 Despite these major advances, the extinct
human mtDNA complete root sequence was never
precisely determined, and mtDNA nomenclature remains
cumbersome because it refers to the first completely
sequenced mtDNA,27,28 labeled rCRS, which is now
known to belong to the recently coalescing European
haplogroup H2a2a1.7 The use of the rCRS as a reference
resulted in a number of practical problems such as (1)
the misidentification of derived versus ancestral states
of alleles and (2) the count of nonsynonymous muta-
tions that map to the path between the rCRS and
the case sequences.29 For instance, clinical and func-
tional studies frequently include among the putative
nonsynonymous candidate mutations the haplogroup-
HV-defining transition at position 14766 (CYTB) simply
because the revised Cambridge Reference Sequence
(rCRS) belongs to its derived haplogroup H.30
In this study, to definitively address these issues,
we propose a ‘‘Copernican’’ reassessment of the human
mtDNA phylogeny by switching to a Reconstructed
Sapiens Reference Sequence (RSRS) as the phylogenetically
valid reference point. To this end, the previously suggested
root7,22,25 was updated tomost parsimoniously incorporate
the available mitogenomes from H. neanderthalensis.23,24
Moreover, we further refined the human mtDNA
phylogeny to an unprecedented level by adding informa-
tion from 8,216 mitogenomes and evaluated the ranges
of nucleotide substitutions from the root RSRS rather
than the rCRS28 as a reference point (Figure 1 and Figure S1,
available online).
1Estonian Biocentre and Department of Evolutionary Biology, University of Tartu, Tartu 51010, Estonia; 2Molecular Medicine Laboratory, Rambam Health
Care Campus, Haifa 31096, Israel; 3Department of Forensic Molecular Biology, Erasmus MC, University Medical Center Rotterdam, 3000 CA Rotterdam,
The Netherlands; 4Department of Statistics and Operations Research, School of Mathematical Sciences, Tel Aviv University, Tel Aviv 69978, Israel; 5Instituto
de Patologia e Imunologia Molecular da Universidade do Porto, Porto 4200-465, Portugal; 6Department of Biological Anthropology, University of
Cambridge, Cambridge CB2 1QH, UK; 7Dipartimento di Biologia e Biotecnologie ‘‘L. Spallanzani,’’ Universita di Pavia, Pavia 27100, Italy; 8Estonian
Academy of Sciences, 6 Kohtu Street, Tallinn 10130, Estonia
*Correspondence: [email protected] (D.M.B.), [email protected] (M.v.O.)
DOI 10.1016/j.ajhg.2012.03.002. �2012 by The American Society of Human Genetics. All rights reserved.
The American Journal of Human Genetics 90, 675–684, April 6, 2012 675
6
1.3
2.2
0.5
0.15
0.03
0L0d1c1b
(EU092832)H2a2a1
rCRS(NC_012920)H4a1a
(HQ860291)
53 M
UTA
TIO
NS
54 M
UTA
TIO
NS
46 M
UTA
TIO
NS
99 M
UTA
TIO
NS
13 MUTATIONS
2
5
99
13
6
L0 L1’2’3’4’5’6
Panpaniscus
Pantroglodytes
Homoneander-thalensisthalensis
Homosapiens
SRSRNR
Mya
Hominini
a2a1a1111a2a1a11111aa aaa 11122
C8209TA8348G
T12011C
A11560G
G5262AT4928C
C6518TA6131G
G6962AG7146A
A3564GA3334G
T4101CT3504C
G3438A
T6185C
T245CG263A
C152TG185A C262TA2294G A1779G
C146T A200G
C146T
T13488C
G15077A
G1048TC182T
T8167CC7650T
C10915TC9042TA11914G
A15775G
A16078G
C3516aT4312C
T16086C
T16154C
T5442CT10664C
A12810G
T14063C
A2758G
C3556TT3308C
A12720G
A574G G3483AT990C T12864C
C16344T
A9347GG13276AG10589AG16230A
G10586A A16258G
G12007A
G16156A
A14926G A5189tT16093C
291d361.1A
A16129G
T5964C G200A!A10520G T391CA13917G T4688C
L0L1’2’3’4'5’6FM865411 FM865408 FM865409 AM948965 FM865410 FM865407 H2a2a1
H2
H2a2a
H2a
H2a2
C152TA2758GC2885TG7146A
A825tT8655C
A10688GC10810TG13105AT13506C
T8468C
L2'3’4’5’6
C195TA247G
522.1AC
A7521G
L3’4'6
T182C!T3594CT7256C
T13650C
G15301AA16129GT16187CC16189T
L2'3’4’6
G4104A
G8701AC9540T
G10398AC10873TA15301G!
N
T16278C
L3'4
A769GA1018GC16311T
L3
T14766C
HV
G2706AT7028C
H
G1438A
T12705CT16223C
R
G73AA11719G
R0
G8860AG15326A
rCRS
G4769A
G750A
G263A
97559456
93459329
93259053
90278986
89438764
87188503
84618455
84068386
83658065
80217891
78687861774674247127710666416620645264106266626062006156602358405821567355805505547154605387494049044856456245324204404839393918390938083414339930102863283127062523205617091406827709547521-522438417243195189150
986910101
1025610281
1030710310
1032410373
1053210750
1138311458
11527115901162311770119501207012189123511236612406124741309513194132691335913506136501365613680137071380113879138891405314144141781429614560150431514815191152261523215295153011535515443154791562915649156671567115789158501603716139161481616916183161871620916234162441625616262
16263.116299163201636216400
Homo neanderthalensis mtDNA genomes Homo sapiens rCRS genome
Figure 1. Schematic Representation of the Human mtDNA Phylogeny within Hominini(Left) Hominini phylogeny illustrating approximate divergence times of the studied species. The positions of the RSRS and the putativeReconstructed Neanderthal Reference Sequence (RNRS) are shown.(Right)Magnification of the humanmtDNA phylogeny. Mutated nucleotide positions separating the nodes of the two basal human hap-logroups L0 and L1’20304’506 and their derived states as compared to the RSRS are shown. The positions of the rCRS and the RSRS areindicated by golden and a green five-pointed stars, respectively. Accordingly, the number of mutations counted from the rCRS(NC_012920) or the RSRS (Sequence S1) to the L0d1c1b (EU092832) and H4a1a (HQ860291) haplotypes retrieved from a San anda German, respectively, are marked on the golden and green branches. The principle of equidistant star-like radiation from the commonancestor of all contemporary haplotypes is highlighted when the RSRS is preferred over the rCRS as the reference sequence.
676 The American Journal of Human Genetics 90, 675–684, April 6, 2012
Subjects and Methods
Updating the Human mtDNA Phylogeny and
Inference of the Ancestral Root HaplotypeMtDNA Genomes Comprising the Phylogeny
A total of 18,843 complete mtDNA sequences were used to refine
the human mtDNA phylogeny of which 10,627 were previously
reported and used for the mtDNA tree Build 13 (28 Dec 2011)
as posted by PhyloTree.7 The remaining 8,216 sequences are
mainly from the large complete mtDNA database available at
FamilyTreeDNA and in part from data sets maintained by the
authors. The large database available at FamilyTreeDNA was
privately obtained by the sample donors, usually for genealogical
purposes. Most donors were of western Eurasian ancestry, but
donors with matrilineal ancestry from other geographical regions
have also contributed. Once the mtDNA sequences were obtained,
donors had several options: keep them confidential, share them
with peer genealogists, submit them to the National Center for
Biotechnology Information (NCBI) GenBank, and/or consent to
contribute them anonymously to a research database maintained
by FamilyTreeDNA to improve the mtDNA phylogeny. In turn,
this contribution rewards and enriches the genealogical experi-
ence as well as benefits the scientific community. All the proce-
dures followed in this study were in accordance with the ethical
standards of the responsible committee on human experimenta-
tion of the participating research centers.
Likewise, it is important to clarify that because the complete
sequences were obtained privately, some donors have indepen-
dently uploaded their sequence to NCBI. Currently (as of February
28, 2012), a total of 1,220 complete mtDNA sequences that were
generated at FamilyTreeDNA were privately deposited in NCBI
GenBank. Most of these sequences were already considered in
the previous PhyloTree Builds.7 Because we have no way to
know which of the sequences were autonomously uploaded to
NCBI, all duplicate sequences that matched precisely between
NCBI and our database were excluded from our analysis. There-
fore, even if multiple samples were excluded, no topological infor-
mation was lost. Accordingly, out of the 8,216 sequences used
to verify the phylogeny, a total of 4,265 sequences are released
and deposited in NCBI GenBank under accession numbers
JQ701803–JQ706067. The complete mtDNA sequences of the
Neanderthals were retrieved from the literature.23,24
Complete mtDNA Sequencing
DNAwas extracted from buccal swabs. MtDNAwas amplified with
18 primers to yield nine overlapping fragments as previously
reported.22 PCR products were cleaned with magnetic-particle
technology (BioSprint 96; QIAGEN). After purification, the nine
fragments were sequenced by means of 92 internal primers to
obtain the complete mtDNA genome. Sequencing was performed
on a 3730xl DNA Analyzer (Applied Biosystems), and the resulting
sequences were analyzed with the Sequencher software (Gene
Codes Corporation). Mutations were scored relative to the rCRS
and the suggested RSRS. Sample quality control was assured as
follows:
(1) After the PCR amplification of the nine fragments, DNA
handling and distribution to the 96 sequencing reactions
was aided by the Beckman Coulter Biomek FX liquid
handler to minimize the chance for human pipetting
errors.
(2) All 96 sequencing reactions of each sample were performed
simultaneously in the same sequencing run. Most observed
mutations were determined by at least two sequence reads.
However, in a minority of the cases only one sequence read
was available because of various technical reasons, usually
related to the amount and quality of the DNA available.
(3) Any fragment that failed the first sequencing attempt or
any ambiguous base call was tested by additional and
independent PCR and sequencing reactions. In these cases,
the first hypervariable segment (HVS-I) of the control
region was resequenced too to assure that the correct
sample was retrieved.
(4) Genotyping history for each sample was recorded to help
in the search for DNA handling errors and artificial recom-
bination events.
(5) All sequences were aligned with the software Sequencher
(Gene Codes Corporation), and all positions with a Phred
score less than 30 were manually evaluated by an operator.
Two independent operators read each sequence. All posi-
tions that differed from the reference sequences were
recorded electronically to minimize typographic errors.
(6) Any sequence that did not comfortably fit within the estab-
lished human mtDNA phylogeny was highlighted and
resequenced to exclude potential lab errors.
(7) Any comments and remarks raised by external investiga-
tors after release of the data will be addressed by reassessing
the original sequences for accuracy. After that, any unre-
solved result will be further examined by resequencing
and, if necessary, immediately corrected.
Tree Reconstruction and Notation of MutationsThe phylogeny was reconstructed by evaluating both all previ-
ously available published and the herein released complete
mtDNA sequences aiming at the most parsimonious solution
and aided by the software mtPhyl. Polymorphic positions are
shown on the branches and reticulations were resolved by consid-
ering the degree of mutability of individual positions as counted
by their number of occurrences in the overall phylogeny. Both
the ancestral and derived base status for each mutation appearing
in the phylogeny according to the International Union Of Pure
And Applied Chemistry (IUPAC) nucleotide code are reported.
We use capital letters for transitions (e.g., G73A) and lowercase
letters for transversions (e.g., A73t). Although heteroplasmies are
not noted in the phylogeny, we recommend labeling them by
using IUPAC code and capital letters (e.g., G73R). Throughout
the phylogeny indels are given with respect to the RSRS andmain-
tain the traditional nucleotide position numbering as in the rCRS.
Sequencing alignment prefers 30 placement for indels, except in
cases where the phylogeny suggests otherwise.31 Deletions are
indicated by a ‘‘d’’ after the deleted nucleotide position (e.g.,
T15944d). Insertions are indicated by a dot followed by the posi-
tion number and type of inserted nucleotide(s) (e.g., 5899.1C for
a C insertion at the first inserted nucleotide position after position
5899 and 5899.2C for a subsequent C insertion, and these are
abbreviated as 5899.1CC when occurring on the same branch).
We label polynucleotide stretches of unknown length as follows:
573.XC. In cases where an insertion occurred at an ancestral
branch but a reversion of this insertion (¼ deletion) took place
at a descendant branch, we noted the latter as follows:
5899.1Cd. An exclamationmark (!) at the end of a labeled position
denotes a reversion to the ancestral state. The number of exclama-
tion marks stands for the number of sequential reversions in
the given position from the RSRS (e.g., C152T, T152C!, and
The American Journal of Human Genetics 90, 675–684, April 6, 2012 677
C152T!!). Some indel positions have been a source of confusion
because multiple alignment solutions enable alternative scoring.
Notably, the dinucleotide repeat in hypervariable segment II
(HVS-II) of the control region can be viewed either as a CA repeat
starting at position 514 or as an AC repeat starting at position 515,
leading to two different notations being in use for a repeat loss:
522–523d versus 523–524d. We adhered to the guidelines for
consistent treatment of mtDNA-length variants that were estab-
lished by the forensic genetic community31 and favor the AC
interpretation. As the RSRS has one AC unit less compared to
the rCRS, we filled positions 523 and 524 of the RSRS with "NN,"
thereby preserving the historical genome annotation numbering.
Consequently, an AC insertion compared to the RSRS is scored as
522.1AC, whereas an AC deletion is scored as 521–522d. Table S2
presents all common indel positions throughout the complete
mtDNA sequence and the way we labeled them. Transitions at
the hypervariable position 16519, insertions of one or two Cs at
positions 309, 315, and 16193, A to C transversions at 16182
and 16183, as well as length variation of the AC dinucleotide
repeat spanning 515–522, were excluded from the phylogeny.
Haplogroup labels were re-evaluated and the following sugges-
tions were made:
(1) Monophyletic clades that are composed of two or more
previously named haplogroups are labeled by concate-
nating their names and separating them by apostrophe
(e.g., L0a’b). This is not applied in the case of capital-
letter-only labeled haplogroups (e.g., JT);
(2) We suggest labeling an extant sample that matches
a haplogroup root with the superscript case letter n for
‘‘nodal’’ (e.g., Hn);
(3) We note that when completemtDNA sequences are consid-
ered, the inability to differentiate a nodal haplotype from
an unresolved paraphyletic clade is eliminated. Accord-
ingly, the haplogroup label of each observed complete
mtDNA sequences can: (1) mark it in a nodal position; (2)
affiliate it with a previously labeled haplogroup; (3) suggest
a, so far, unlabeled haplogroup; or (4) in the absence of
two additional samples to justify the labeling of a, so far,
unidentified haplogroup, affiliate it with the ancestral
haplogroup. So, the label of a given sample as ‘‘H’’ means
that it is an unlabeled descendent of haplogroup H that
cannot be affiliated to any known H haplogroup clade
at the time of report and based on complete mtDNA
sequence. We suggest restricting the use of label ‘‘H*’’ to
cases where the haplogroup labeling is based on partial
mtDNA sequence;
(4) To aid the nonexpert in understanding the mtDNA hap-
logroup nomenclature system, we summarize in Table S3
the cases where haplogroup labels do not logically follow
from the hierarchy and hence could lead to confusion.
Changing these haplogroup labels to make them more
logical is undesirable at this stage because they are already
used extensively in the literature and therefore changing
them would probably cause even more confusion. In addi-
tion, we note that for the most basal nodes of the
phylogeny, historically the following shorthand names
have been in use: L1’5 ¼ L1’20304’506; L205 ¼ L20304’506;L206 ¼ L20304’6; and L4’6 ¼ L304’6, which we will herein
refer to by their full name. One shorthand haplogroup
name, M4’’67, is maintained because writing it in full
(M4’18’30’37’38’43’45’63’64’65’66’67) seems impractical.
It is important to note that the aim of this study is to publish the
most up-to-date human mtDNA phylogeny, and it cannot be
regarded by any means as a population-level survey exploring
the frequencies and distributions of the various haplogroups.
Therefore, although all sequences were used to establish the tree
topology, the subset of sequences actually presented in the
phylogeny is lower because for each branch up to two representa-
tive example sequences are provided. In most cases, we labeled
haplogroups only when supported by at least three distinct haplo-
types to maximize the accuracy of the haplogroup defining array
of mutations and to avoid the establishment of haplogroups
resulting from sequencing errors. Exceptions included previously
established haplogroups or haplogroups supported by a particu-
larly long array of mutations. Accordingly, the tips of the herein
released phylogeny are in fact internal haplogroup nodes, thus
private mutations (if any) of individual haplotypes were not
included.
Evaluation of the mtDNA Clock and Age EstimatesSubstitution Counts and Molecular Clock
To calculate the substitution counts from the RSRS to every extant
mitogenome (which is a tip in the mtDNA phylogeny), we
summed up the number of mutations on the path leading to
each noted haplogroup in the phylogeny and added to this the
number of positions that differed between the tip and the root
of the haplogroup. Thus, we are guaranteed to correctly count
all parallel and back mutations, except for the case where two
mutations affecting the same position occurred on a branch in
the tree (in which case we either count zero instead of two, if
the second is a back mutation, or one instead of two, if the second
mutation is not back to the initial state). As has been argued in the
past, such repeatedmutations within a single branch in the highly
resolved human mtDNA tree are highly unlikely,32 and are even
more so if the fastest mutating sites (16519 and the A to C trans-
versions and poly-C insertions around the HVS-I position 16189)
are eliminated, as was done in our analysis.
To test the validity of molecular clock assumption on human
mtDNA substitutions, we used PAML 4.4 with the HKY85 substitu-
tion model to generate maximum likelihood estimates of branch
lengths with and without the molecular clock assumption. We
chose to sample around 200–300 sequences and analyze their
coalescent tree (a subtree of the complete tree) in each PAML
run, to accommodate PAML’s computational limitations, and
also to sample mostly deep branches (such as M44), rather than
the recent and very short branches (such as D4a1b1) of the over-
sampled haplogroups such as H and D. Thus, we preferentially
sampled haplogroups whose coalescence with other samples in
the tree was more ancient. This ensured that even in such
a sample, the deeper clades such as the basal M clades would
be represented with high probability, whereas more recently
coalescing haplogroups such as the ones of haplogroup D would
be rarely sampled.
The generalized likelihood ratio (GLR) test for validity of the
clock assumption then uses the test statistic 2 3 (log-likelihood
of non-clock model � log-likelihood of clock model), which,
under the null hypothesis of molecular clock, has a c2 distribution
with degrees of freedom equal to the number of parameters under
no clock (¼ number of branches in the tree) minus number of
parameters under clock (¼ number of internal nodes in the tree).
We performed the analyses on two sets of the mtDNA
sequences: once by using the coding region alone and once on
the entire molecule. This was done as another sanity check for
678 The American Journal of Human Genetics 90, 675–684, April 6, 2012
the validity and generality of our results. All obtained p values are
presented in Table S4.
Age Calculations Assuming a Molecular Clock
In spite of thediscovered clockviolations,wewere still interested in
applying the best available tools for estimating the ages of ancestral
nodes in the tree assuming a molecular clock. We adopted the
calculation approach andmutation rate estimate of,32 who suggest
to estimate ages in substitutions and then transform them to years
in a nonlinear manner accounting for the selection effect on non-
synonymous mutations. We used PAML 4.433 with the HKY85
substitution model to generate maximum likelihood estimates of
internal node ages under a molecular clock assumption. Because
PAML is computationally limited in the size of trees it can analyze,
weperformed estimation for thewhole tree in several separate runs.
We divided the tree into seven collections of haplogroups:
d All L haplogroups (i.e., the entire phylogeny excluding M
and N)
d All of M excluding D
d D and JT
d H excluding H1 and H5
d B4’5 and HV excluding H but including H1 and H5
d U
d N excluding HV, U, JT and B4’5
For each PAML run, we selected all sequences belonging to one
of these sets, and added a small random sample of other samples
from the rest of the phylogeny to maintain ‘‘calibration.’’ Putting
together the estimates from all seven runs provided us with age
estimates for all nodes in our tree. Estimates are given in Table S5.
Data TransitionWe are aware that the suggested change can raise difficulties and
even antagonism from the scientific community. On the other
hand, a scenario in which a reference sequence of a genetic locus
does not represent its ancestral sequence should, indisputably, be
corrected. The realization of the superiority of complete mtDNA
sequence analysis compared to other approaches, combined
with the emergence of deep sequencing technologies, will possibly
shift the entire field into the use of only complete mtDNA
sequences in the near future.34–36 Therefore, the sooner the
change is made the less ‘‘painful’’ it will be. As the common
practice for reporting complete mtDNA sequences is by posting
the sequences as FASTA files to NCBI, rather than reporting the
substitutions with respect to a reference sequence (as in the case
of many data sets restricted to control-region variation), no major
change is needed. When a FASTA file is available or created, the
only change needed is to switch the reference sequence to the
RSRS. For control-region-based data sets, the conversion might
be more problematic as the common practice to report the
sequences in literature did not involve FASTA files but recorded
mutations as compared to the rCRS. Table S6 compares the classic
diagnostic mutations for the major haplogroups relative to the
rCRS or the RSRS.
To facilitate data transition we release the tools ‘‘FASTmtDNA,’’
which allows transformation of Excel list-type reports of mtDNA
haplotypes into FASTA files, and ‘‘mtDNAble,’’ which labels
haplogroups, performs a phylogeny-based quality check and
identifies private substitutions. These noted features are fully
supported in a web interface or as standalone versions, which
can be freely downloaded from thewebsite including theirmanual
and example files. In addition, the web interface allows the
benefit of comparing private substitutions between submitted
and previously stored mitogenomes to suggest the labeling of
additional haplogroups. Following quality check and consent, the
web interface enables the storing of complete mtDNA sequences
by members of the mtDNA community to enrich a growing
database. This in turn is expected to strengthen the data set used
by the website to label haplogroups, perform quality control and
refine the phylogeny. Additional tools will be periodically added
and updated.
Results
The RSRS
Since the sub-Saharan haplogroup L0 was defined,37 it
became clear that the root of the extant variation
of human mitochondrial genomes is allocated between
haplogroups L0 and L1’20304’506, which are separated
from each other by 14 coding and four control-region
mutations22 (Figure 1). Until now, our understanding of
the root of the human mtDNA tree was incomplete
because of the absence of reliable closely related outgroup
mitogenomes, and the exact placement of the 18 muta-
tions separating the L0 and L1’20304’506 nodes remained
vague. In principle, ancient mtDNA from early human
fossils might be informative but unreachable because of
considerable technical problems inherent to the analysis
process.13 However, as the split between H. sapiens and
H. neanderthalensis certainly predates the appearance of
the RSRS,38 a resolution of the deepest node might
be achieved by rooting the human phylogeny with
H. neanderthalensis complete mtDNA sequences23,24
(Figure 1). Table S1 shows all substitutions separating hap-
logroup L0 from L1’20304’506, their status in the six
H. neanderthalensis mitogenomes and their most parsimo-
nious allocation around the human root. Accordingly,
the ancestral mtDNA sequence of extant humans should
correspond to the bifurcation of L0 and L1’20304’506.Although it cannot be excluded that further sampling of
the African mtDNA variation might reveal yet another
more basal clade of the human mtDNA tree, it is at least
equally valid to indicate that, in spite of the many
thousands of reported complete mtDNA sequences,7 such
a clade has not been found so far. Operating under this
assumption we established the reference point, RSRS,
which is made available as Sequence S1.
We present the most resolved human mtDNA
phylogeny by compiling the information from 18,843
mitochondrial genomes of which 10,627 were previously
summarized in PhyloTree Build 13 (28 Dec 2011).7 We fol-
lowed the established cladistic notation for haplogroup
labeling adjusted for complete mtDNA genomes.7,39 Yet,
in contrast with the previously reported phylogeny, all
mutational changes noted on the branches of the tree indi-
cate the actual descendant nucleotide state relative to the
state in the RSRS. Although this has no effect on the tree
topology per se, it is critical to emphasize its major conse-
quences in the way of reporting the list of mutations
The American Journal of Human Genetics 90, 675–684, April 6, 2012 679
denoting an mtDNA haplotype. Accordingly, although the
HVS-I haplotype of a nodal haplogroup H2a2a1 mitoge-
nome will show no differences when compared to the
rCRS, its differentiation relative to the RSRS is now docu-
mented by the transitions A16129G, T16187C, C16189T,
T16223C, G16230A, T16278C and C16311T. This
common practice of expressing haplotypes as a string of
differences from the rCRS (Figure 1) led, for instance,
many inexperienced readers to incorrectly hold the ‘‘fact’’
that African haplogroup L mitogenomes have more substi-
tutions separating them from the rCRS as compared to
western Eurasian haplogroup H mitogenomes as a ‘‘proof’’
of an African origin for all contemporary humans.
Indications for Violation of the Molecular Clock
The accepted notion of a molecular clock means that
contemporary mtDNA haplotypes should show statisti-
cally insignificant differences in the number of accu-
mulated mutations from the RSRS.40 Triggered by the
suggested change in the reference sequence that facili-
tates substitution counts from the ancestral root, we
further evaluated this hypothesis. The range of sub-
stitution counts separating contemporary mitogenomes
belonging to major haplogroups from the RSRS is shown
in Figure S2. The mean distance is 57.1 substitutions, the
median is 56 and the empirical standard deviation is 5.9.
Widely different distances ranging from 41 substitutions in
some L0d1a1 mitogenomes to 77 in some L2b1a mitoge-
nomes are observed. Interestingly, the ranges of sub-
stitution counts within haplogroups M and N, which are
hallmarks of the relatively recent out-of-Africa exodus of
humans, are also very large. For example, within M there
are two mitogenomes with 43 substitutions (in M30a and
M44) and two mitogenomes with as many as 71 substitu-
tions (in M2b1b and M7b3a). This is especially striking
because the path from the RSRS to the root of M already
contains 39 substitutions. Hence, the difference between
the M root and its M44 descendant is only four substitu-
tions (two in the coding region and two in the control
region) as compared to 32 substitutions in the M2b1b
and M7b3a mitogenomes. These observations raise the
possibility that the tree in general, and haplogroup M in
particular, might not adhere uniformly to the assumed
molecular clock, under which substitutions occur at a fixed
rate on all branches of the tree over time.We evaluated this
scenario by performing generalized likelihood ratio tests of
the molecular clock by using PAML33 on subsets of samples
from the entire tree, on haplogroup L2 (following past
evidence of clock violations in this haplogroup40) and on
the sister haplogroups M and N. Our results demonstrate
violations of the molecular clock in M (0.00015 %
p value % 0.0003 for c2 GLR test in three different anal-
yses) and give mixed results for the entire tree (p ¼ 0.005
and p ¼ 0.018 for two analyses, which might be sensitive
to the parts of the tree randomly sampled) and L2 (GLR
c2 p value¼ 53 10�5 and p value¼ 0.033 for two analyses)
and borderline results in N (GLR c2 p value ¼ 0.049 and
p value ¼ 0.054 in two analyses). We are currently unable
to offer well-founded explanations for these findings,
which remain the scope of future studies.
As the clock violation was observed only in a restricted
number of specified cases, we applied the best available
tools for estimating the ages of ancestral nodes. We adop-
ted a conventional calculation approach and mutation
rate32 and used PAML 4.4 to generate maximum likelihood
estimates for internal node ages under a molecular clock
assumption.33 Figure 2 displays the phylogeny and density
of extant haplogroups as a function of both the number of
substitutions occurring since the RSRS and the estimated
coalescence times.
Approaching a Perfect Phylogeny
Themitochondrial genomes released herein almost double
the number of sequences that were previously available.
Despite the fact that the sequences released in this study
are not equally representative of all human populations
but aremainly from donors of western Eurasianmatrilineal
ancestry, a few additional advantages arise from this com-
bined data. First, an almost final level of resolution for
a number of western Eurasian clades was achieved, and
the nodes of ancestral and derived haplogroups are often
differentiated by a single mutation. For example, Figure 3
−170 −150 −130 −110 −90 −70 −50 −30 −10
050
100
200
300
400
500
600
KYBP
MtD
NA
hap
logr
oups
1 7 12 18 24 30 36 42 49
Substitutions since RSRS
L0L1
L5L2L6 L4
L3M
N
R rCRS
RSRS
Figure 2. Human mtDNA PhylogenyA schematic representation of the most parsimonious humanmtDNA phylogeny inferred from 18,843 complete mtDNAsequences with the structure shown explicitly for bifurcationsthat occurred 40,000 years before present (YBP) or earlier, anda graph showing the explosion of haplogroups since then. They axis indicates the approximate number of haplogroups fromeach time layer that have survived to nowadays. The upper andlower x axes of the rooted tree are scaled according to the numberof accumulated mutations since the RSRS and the correspondingcoalescence ages, respectively.
680 The American Journal of Human Genetics 90, 675–684, April 6, 2012
compares the resolution of haplogroup H4 as first41 and as
currently resolved. This comprehensive level of resolution
minimizes the chance of additional nomenclature issues
arising in future studies. Second, the highly resolved phy-
logeny is a powerful tool for quality assessment.29,42–44
Mapping any additional complete mtDNA haplotype to
such highly resolved phylogeny will highlight potential
sequencing errors and problems such as sample mix-
up, contamination, and typographical errors. Third, the
phylogeny itself is a useful resource for future evolutionary,
clinical, and forensic studies.45–51
Discussion
Thirty-one years ago, Anderson and colleagues27 published
the first complete sequence of human mtDNA. This
became the reference sequence inmultidisciplinary studies
that revolutionized human genetics, leading, for instance,
to the concept of ‘‘late-out-of-Africa’’ (‘‘African Eve’’)
peopling of the world by modern humans,17,18 the identi-
fication of a wide range of pathological mtDNA muta-
tions,52,53 and the possibility of reconstructing the origins
and the relationships of modern as well as ancient popula-
tions.12,14,54 The publication of globally selected complete
mtDNA genomes about 10 years agomarked the beginning
of the genomic era in this field.4 Since then, progress has
been impressive. Most admirable is the penetration of
the principles applied in the field of archaeogenetics to
hundreds of thousands of people around the world who
became interested in their matrilineal descent. In fact, in
this paper we add information from more than 8,000
complete mtDNA sequences resulting largely from the
curiosity and enthusiasm of lay people to the ~10,000
publicly available complete mtDNA sequences. However,
as discussed above, the entire field faces a problem: the
traditional manner of reporting variation observed in
human mitochondrial genome sequences is, to be blunt,
conceptually incorrect.
Supported by a consensus of many colleagues and after
a few years of hesitation, we have reached the conclusion
that on the verge of the deep-sequencing revolution,47,55
when perhaps tens of thousands of additional complete
mtDNA sequences are expected to be generated over the
next few years, the principal change we suggest cannot
be postponed any longer: an ancestral rather than a ‘‘phylo-
genetically peripheral’’ and modern mitogenome from
Europe should serve as the epicenter of the humanmtDNA
reference system. Inevitably, the proposed change could
raise some temporary inconveniences. For this reason, we
provide tables and software to aid data transition.
What we propose is much more than a mere clerical
change. We use the Ptolemaian geocentric versus Coper-
nican heliocentric systems as a metaphor. And the meta-
phor extends further: as the acceptance of the heliocentric
system circumvented epicycles in the orbits of planets,
737311
719
1171
9
R
1476
614
766
d522
d522
-523523 1276
4510
217
1137
712
879
1287
914
766
1476
616
256
1635
2
3992
3992
4024
4024
5004
5004
7581
7581
9123
9123
1436
543
614
582
4582
1549
754
9715
930
5930
1616
461
6411 H4
d522
d522
-523523
9033
1077
513
513h
1620
916
209
1621
5T
59
H14
456
1630
4
200
4336
5839
1552
116
093
5471
5471
1286
4
13
aH
5a
5H
5
15
709
709
1608
1618
916
189
14
239
1636
216
362
1648
2 44+C
152
152
214
6263
6263
8668
1404
016
300
3915
4727
9380
1058
916
129
1624
9
16
aH
6a
bH
6b
6H
617
55 57 1117
3847
6253
1099
3
21
H15
1651
916
519
152
152 7272 183
183
1598
1598
1606
616
239
60
3460
3786
1153
6
61
1636
216
362
62
7373 8557
8557
9368
1235
816
145
28
6908
7711
1551
916
291
1629
1
29
3591
4310
9148
1302
016
168
1616
8
30 H9
3010
6776
7373
6320
8468
9921
1497
816
051
1616
216
259 a
H1
a
33
1808
5460
1378
215
817
1631
8
32
d522
d522
-523523
2483
3796
5899
+2C
7870
8348
9022
1256
116
189
1618
916
356
1636
216
362
36
236
709
709
1900
5899
+C60
4016
294
1629
4
35
228
523+
CA
523
CA
1129
916
233
34
368
1000
316
291
1629
1
38
723
7271
8952
1154
916
311
1631
1
39
1428
7
3666
1171
911
719
4062
1629
416
294
4041
1623
416
234
42
573+
3C13
943
43
1504
716
189
1618
9
37
4769
152
152
1081
016
274
1842
1123
313
708
1432
316
291
1629
123
2H
224H
2c
1438
152
152
319
8598
1328
113
928
1392
816
266
1631
116
311
1636
216
362
1651
916
519
22
9393
95C
1555
1555
8258
1590
2
45
5471
5471
1479
8
46
152
152
4679
1287
912
879
1340
414
152
1623
9G16
311
1631
1
47
aH
3a
7373 761
1432
5
44
183
183
709
709
2581
3387
G59
11 49
1295
7
7272 150
150
1536
1066
714
467
195
195
1555
1555
1420
016
176
1651
9
5251
1555
1555
1623
416
234
50
1629
0
53
4793
185
1719
8573
1310
514
560
1621
3
1598
1598
6296
A16
265
26
7H
7
25
48
195
961G
8448
8898
1375
916
278
1627
816
311
1631
1
2392
6719
9530
1263
316
209
1620
916
399
252
2308
1036
1
19
54
H11
146
709
709
1310
1C16
111
1616
716
288
1636
216
362
3936
1455
216
287
18
55
H8
H12
20
195
195
4216
5378
1447
0A14
548
1611
4
H1031
2259
4745
1368
014
872
9393 7337
1304
213
326
573+
C16
519
1651
974
71+C
9449
1156
313
542
1571
216
278
1627
816
311
1631
1
3H
13
56
57
58
H1
3a
2706
7028
*2753
4812
351
1326
6C
60+T 64 152
152
153
2355
2442
3438
3847
1072
813
188
1567
416
126
1636
216
362
150
150
3290
5134
6263
6263
9585
1269
6
2758
3834
6317
7094
1035
611
252
1616
816
168
437
1167
414
800
1632
0
(pre
-HV
--)1
HV
1
HV
*VV
V
2
3
H1
7
195
195
523+
CA
523
CA
5093
6059
7762
1171
911
719
1393
3
5
727216
298
pre
*V1
**
1590
4
5581
8557
8557
1522
116
222
6
pre
*2
V2
**
pre
-V
8014
T15
218
1606
7 750
7569
8376
9755
1353
516
519
1651
9
4
4919
6285
1273
214
299
1624
116
311
237
1555
3531
4715
5201
8838
1045
412
362
1273
013
928
1633
5
10
9
4639
8869
1037
9
8
4580
737311
719
1171
9
R
1476
614
766
d522
d5222
-523523 1276
4510
217
1137
712
879
1287
914
766
1476
616
256
1635
2
d522
d5222
-523523
9033
1077
513
513h
1620
916
209
1621
5T
59
H14
456
1630
4
200
4336
5839
1552
116
093
5471
5471
1286
4
13
aH
5a
5H
5
15
709
709
1608
1618
916
189
14
239
1636
216
362
1648
2 44+C
152
152
214
6263
6263
8668
1404
016
300
3915
4727
9380
1058
916
129
1624
9
16
aH
6a
bH
6b
6H
617
55 57 1117
3847
6253
1099
3
21
H15
1651
916
519
152
152 7272 183
183
1598
1598
1606
616
239
60
3460
3786
1153
6
61
1636
216
362
62
7373 8557
8557
9368
1235
816
145
28
6908
7711
1551
916
291
1629
1
29
3591
4310
9148
1302
016
168
1616
8
30 H9
3010
6776
7373
6320
8468
9921
1497
816
051
161
1808
5460
1378
215
817
1631
8
32
d522
d5222
-523523
2483
3796
5
236
709
709
1900
5899
+C60
4
228
523+
CA
523
CA
1129
916
233
34
368
1037
4769
152
152
1081
016
274
1842
1123
313
708
1432
316
291
1629
123
2H
224H
2c
1438
152
152
319
8598
1328
113
928
1392
816
266
1631
116
311
1636
216
362
1651
916
519
22
932
183
183
709
1295
7
7272 5019
519
515
55555
1555
1555
1623
416
234
1629
0
53
4793
185
1719
8573
1310
514
560
1621
3
1598
1598
6296
A16
265
26
7H
7
25
48
195
961G
8448
8898
1375
916
278
1627
816
311
1631
1
2392
6719
9530
1263
316
209
1620
916
399
252
2308
1036
1
19
54
H11
146
709
709
1310
1C16
111
1616
716
288
1636
216
362
3936
1455
216
287
18
55
H8
20
195
195
4216
5378
1447
0A14
548
1611
4
H1031
2259
4745
1368
014
872
9393 7337
573+
C16
519
1651
974
71+C
9449
1156
3 2
2706
7028
*2753
4812
351
1326
6C
60+T 64 152
152
153
2355
2442
3438
3847
1072
813
188
1567
416
126
1636
216
362
150
150
3290
5134
6263
6263
9585
1269
6
2758
3834
6317
7094
1035
611
252
1616
816
168
437
1167
414
800
1632
0
(pre
-HV
--)1
HV
1
HV
*VV
V
2
3
H1
7
195
195
523+
CA
523
CA
5093
6059
7762
1171
911
719
1393
3
5
727216
298
pre
*V1
**
1590
4
5581
8557
8557
1522
116
222
6
pre
*2
V2
**
pre
-V
8014
T15
218
1606
7 750
7569
8376
9755
1353
516
519
1651
9
4
4919
6285
1273
214
299
1624
116
311
237
1555
3531
4715
5201
8838
1045
412
362
1273
013
928
1633
5
10
9
4639
8869
1037
9
8
4580
aH
1a
3316
362
1636
2
H1
b
36
aH
3a
1631
116
311
3H
133
H3
H1
58
H1
3a
1635
6
6162
1616
216
259
1625
932
3796
5899
+58
99+2
C2C78
7078
7078
7083
4883
4890
2290
2212
561
1256
116
189
1618
916
189
1635
616
356
1635
6
C60
4016
294
1629
416
294
35
810
003
1629
116
291
3838
723
723
723
7271
7271
7271
8952
1154
911
549
1631
116
311
3939
1428
7
3666
3666
1171
911
719
4062
4062
1629
416
294
40404141
1623
416
234
4242
573+
573+
3C3C13
943
4343
1504
715
047
1618
916
189
9393
95C
95C
95C
1555
1555
8258
8258
1590
215
902
4545
5471
5471
5471
1479
8
46
152
152
152
4679
4679
1287
912
879
1340
413
404
1415
214
152
1623
9G16
239G
1631
116
311
1631
1
474747
737373 761
1432
5
4444
70709
709
709
2581
3387
G33
87G
5911 4949
150
150
1536
1536
1066
710
667
1446
714
467
115 1420
014
200
1617
616
176
1651
916
519
5251
50
5
H12
H12
H12
H12
73 1304
213
326
1332
61 13
542
1354
213
542
1571
215
712
1627
816
278
1627
816
278
565656
5757
1635
6
H1
b333
C3992TT5004CG9123A
AA4024GAA14582G
C14365T
G8269A
AA10044G
T10034C
T10007C
A1656GG11440A
T14325C
AA15244G
960.XCT7870C
G13708A
T10124CT14956C
AA6040G
G13889A
G5773A
G14569A
T9615C
AA12642GG15884A
G6951A
T8380C
G15497AG15930A
T7581C
G7356AG7521A!
T10166CG9276A
A73G!
C16287T
T195C!
C16286g
A153G (T195C)
(T16093C)
A73G! C16248T
H4a1
c
H4a1
c1
H4a1
d
H4b1
H4c
H4c1
H4a1
a3
H4a1
a3a
H4a1
a4
H4a1
a4a
H4a1
a4b
H4a1
a4b1
H4a1
a4b2
H4a1
a5
H4a1
a1a1
H4a1
a1a1a
H4a1
a1a1a1
H4a1
a1a2
H4a1
a1a3
H4a1
a1a4
H4a1
a2
H4a1
a2a
H4a1
a2a1
H4a1
c
H4a1
c1
H4a1
d
H4b1
H4c
H4c1
H4a1
a3
H4a1
a3a
H4a1
a4
H4a1
a4a
H4a1
a4b
H4a1
a4b1
H4a1
a4b2
H4a1
a5
H4a1
a1a1
H4a1
a1a1a
H4a1
a1a1a1
H4a1
a1a2
H4a1
a1a3
H4a1
a1a4
H4a1
a2
H4a1
a2a
H4a1
a2a1
H4b
H4
H4a
H4a1
H4a1
a
H4a1
a1
H4a1
a1a
Figure 3. Haplogroup H4 internal cladistic structure(Left) Haplogroup H4 as first reported.41 Mutations in bold were considered diagnostic for the haplogroup.(Right) Haplogroup H4 as currently resolved with a total of 236 H4mitogenomes. An almost perfect resolution of the nested hierarchy isachieved. Additional haplogroups suggested herein are shown in yellow. Control-region mutations are noted in blue.
The American Journal of Human Genetics 90, 675–684, April 6, 2012 681
switching the mtDNA reference to an ancestral RSRS will
end an academically inadmissible conjuncture where
virtually all mitochondrial genome sequences are scored
in part from derived-to-ancestral states and in part from
ancestral-to-derived states. We aim to trigger the radical
but necessary change in the way mtDNA mutations are
reported relative to their ancestral versus derived status,
thus establishing an intellectual cohesiveness with the
current consensus of shared common ancestry of all con-
temporary human mitochondrial genomes.
Note that the problem is not restricted to mtDNA.
Indeed, in themuch larger perspective of complete nuclear
genomes in which comparisons are often currently made
relative to modern human reference sequences, often of
European origin, it seems worthwhile to begin consid-
ering, as valuable alternatives, public reference sequences
of ancestral alleles (common in all primates) whereby
derived alleles (common to some human populations)
would be distinguished.
Supplemental Data
Supplemental Data include two figures, six tables, and one
sequence and can be found with this article online at http://
www.cell.com/AJHG/.
Acknowledgments
We thank the genealogical community for donating their
privately obtained complete mtDNA sequences for scientific
studies and FamilyTreeDNA for compiling the data. We thank
FamilyTreeDNA for supporting the establishment of the herein
released website. We thank Eileen Krauss-Murphy of Family-
TreeDNA for help with assembly of the database. We thank
Rebekah Canada and William R. Hurst for help with the assembly
of haplogroup H and K samples, respectively. R.V. and D.M.B.
thank the European Commission, Directorate-General for
Research for FP7 Ecogene grant 205419. D.M.B. is a shareholder
of FamilyTreeDNA and a member of its scientific advisory board.
R.V. and M.M. thank the European Union, Regional Development
Fund for a Centre of Excellence in Genomics grant, and R.V.
thanks the Swedish Collegium for Advanced Studies for support
during the initial stage of this study. M.M. thanks Estonian Science
Foundation for grant 8973. A.T. received support from Fondazione
Alma Mater Ticinensis and the Italian Ministry of Education,
University and Research: Progetti Ricerca Interesse Nazionale
2009. S.R. thanks the Israeli Science Foundation for grant 1227/
09 and IBM for an Open Collaborative Research grant. FCT, the
Portuguese Foundation for Science and Technology, partially sup-
ported this work through the personal grant N.M.S. (SFRH/BD/
69119/2010). Instituto de Patologia e Imunologia Molecular da
Universidade do Porto is an Associate Laboratory of the Portuguese
Ministry of Science, Technology and Higher Education and is
partially supported by the Portuguese Foundation for Science
and Technology.
Received: January 9, 2012
Revised: February 22, 2012
Accepted: March 2, 2012
Published online: April 5, 2012
Web Resources
The URLs for data presented herein are as follows:
FASTmtDNA, http://www.mtdnacommunity.org
mtDNAble, http://www.mtdnacommunity.org
mtPhyl, http://eltsov.org/mtphyl.aspx
PhyloTree, http://www.phylotree.org
Accession Numbers
The 4,265 complete mtDNA sequences reported herein have been
submitted to GenBank (accession numbers JQ701803–JQ706067).
References
1. Darwin, C. (1859). Natural Selection. On the Origin of
Species by Means of Natural Selection, or, The Preservation
of Favoured Races in the Struggle for Life, Chapter 4 (London:
John Murray).
2. Delsuc, F., Brinkmann, H., and Philippe, H. (2005). Phyloge-
nomics and the reconstruction of the tree of life. Nat. Rev.
Genet. 6, 361–375.
3. Kivisild, T., Metspalu, E., Bandelt, H.J., Richards, M., and
Villems, R. (2006). The world mtDNA phylogeny. In Human
mitochondrial DNA and the evolution of Homo sapiens, H.J.
Bandelt, V. Macaulay, and M. Richards, eds. (Berlin: Springer-
Verlag), pp. 149–179.
4. Ingman, M., Kaessmann, H., Paabo, S., and Gyllensten, U.
(2000). Mitochondrial genome variation and the origin of
modern humans. Nature 408, 708–713.
5. Richards, M., and Macaulay, V. (2001). The mitochondrial
gene tree comes of age. Am. J. Hum. Genet. 68, 1315–1320.
6. Torroni, A., Achilli, A., Macaulay, V., Richards, M., and
Bandelt, H.J. (2006). Harvesting the fruit of the human
mtDNA tree. Trends Genet. 22, 339–345.
7. van Oven, M., and Kayser, M. (2009). Updated comprehensive
phylogenetic tree of global human mitochondrial DNA
variation. Hum. Mutat. 30, E386–E394.
8. Underhill, P.A., and Kivisild, T. (2007). Use of y chromosome
and mitochondrial DNA population structure in tracing
human migrations. Annu. Rev. Genet. 41, 539–564.
9. Salas, A., Bandelt, H.J., Macaulay, V., and Richards, M.B.
(2007). Phylogeographic investigations: The role of trees in
forensic genetics. Forensic Sci. Int. 168, 1–13.
10. Shriver, M.D., and Kittles, R.A. (2004). Genetic ancestry and
the search for personalized genetic histories. Nat. Rev. Genet.
5, 611–618.
11. Taylor, R.W., and Turnbull, D.M. (2005). Mitochondrial DNA
mutations in human disease. Nat. Rev. Genet. 6, 389–402.
12. Gilbert,M.T.,Kivisild,T.,Grønnow,B.,Andersen, P.K.,Metspalu,
E., Reidla,M., Tamm, E., Axelsson, E., Gotherstrom,A., Campos,
P.F., et al. (2008). Paleo-Eskimo mtDNA genome reveals matri-
lineal discontinuity in Greenland. Science 320, 1787–1789.
13. Gilbert, M.T., Hansen, A.J., Willerslev, E., Rudbeck, L., Barnes,
I., Lynnerup, N., and Cooper, A. (2003). Characterization of
genetic miscoding lesions caused by postmortem damage.
Am. J. Hum. Genet. 72, 48–61.
14. Haak, W., Forster, P., Bramanti, B., Matsumura, S., Brandt, G.,
Tanzer, M., Villems, R., Renfrew, C., Gronenborn, D., Alt,
K.W., and Burger, J. (2005). Ancient DNA from the first Euro-
pean farmers in 7500-year-old Neolithic sites. Science 310,
1016–1018.
682 The American Journal of Human Genetics 90, 675–684, April 6, 2012
15. Denaro, M., Blanc, H., Johnson, M.J., Chen, K.H., Wilmsen,
E., Cavalli-Sforza, L.L., and Wallace, D.C. (1981). Ethnic vari-
ation in Hpa 1 endonuclease cleavage patterns of human
mitochondrial DNA. Proc. Natl. Acad. Sci. USA 78, 5768–5772.
16. Brown,W.M. (1980). Polymorphism inmitochondrial DNA of
humans as revealed by restriction endonuclease analysis. Proc.
Natl. Acad. Sci. USA 77, 3605–3609.
17. Cann, R.L., Stoneking, M., and Wilson, A.C. (1987). Mito-
chondrial DNA and human evolution. Nature 325, 31–36.
18. Vigilant, L., Stoneking, M., Harpending, H., Hawkes, K., and
Wilson, A.C. (1991). African populations and the evolution
of human mitochondrial DNA. Science 253, 1503–1507.
19. Richards, M., Corte-Real, H., Forster, P., Macaulay, V.,
Wilkinson-Herbots, H., Demaine, A., Papiha, S., Hedges, R.,
Bandelt, H.J., and Sykes, B. (1996). Paleolithic and neolithic
lineages in the European mitochondrial gene pool. Am. J.
Hum. Genet. 59, 185–203.
20. Torroni, A., Bandelt, H.J., D’Urbano, L., Lahermo, P., Moral, P.,
Sellitto, D., Rengo, C., Forster, P., Savontaus, M.L., Bonne-
Tamir, B., and Scozzari, R. (1998). mtDNA analysis reveals
a major late Paleolithic population expansion from south-
western to northeastern Europe. Am. J. Hum. Genet. 62,
1137–1152.
21. Torroni, A., Schurr, T.G., Cabell, M.F., Brown, M.D., Neel, J.V.,
Larsen, M., Smith, D.G., Vullo, C.M., and Wallace, D.C.
(1993). Asian affinities and continental radiation of the four
founding Native American mtDNAs. Am. J. Hum. Genet. 53,
563–590.
22. Behar, D.M., Villems, R., Soodyall, H., Blue-Smith, J., Pereira,
L., Metspalu, E., Scozzari, R., Makkan, H., Tzur, S., Comas,
D., et al; Genographic Consortium. (2008). The dawn of
human matrilineal diversity. Am. J. Hum. Genet. 82, 1130–
1140.
23. Briggs, A.W., Good, J.M., Green, R.E., Krause, J., Maricic, T.,
Stenzel, U., Lalueza-Fox, C., Rudan, P., Brajkovic, D., Kucan,
Z., et al. (2009). Targeted retrieval and analysis of five Nean-
dertal mtDNA genomes. Science 325, 318–321.
24. Green, R.E., Malaspinas, A.S., Krause, J., Briggs, A.W., Johnson,
P.L., Uhler, C., Meyer, M., Good, J.M., Maricic, T., Stenzel, U.,
et al. (2008). A complete Neandertal mitochondrial genome
sequence determined by high-throughput sequencing. Cell
134, 416–426.
25. Kivisild, T., Shen, P., Wall, D.P., Do, B., Sung, R., Davis, K.,
Passarino, G., Underhill, P.A., Scharfe, C., Torroni, A., et al.
(2006). The role of selection in the evolution of human mito-
chondrial genomes. Genetics 172, 373–387.
26. Kivisild, T., Reidla, M., Metspalu, E., Rosa, A., Brehm, A.,
Pennarun, E., Parik, J., Geberhiwot, T., Usanga, E., and
Villems, R. (2004). Ethiopian mitochondrial DNA heritage:
Tracking gene flow across and around the gate of tears. Am.
J. Hum. Genet. 75, 752–770.
27. Anderson, S., Bankier, A.T., Barrell, B.G., de Bruijn, M.H.,
Coulson, A.R., Drouin, J., Eperon, I.C., Nierlich, D.P., Roe,
B.A., Sanger, F., et al. (1981). Sequence and organization of
the human mitochondrial genome. Nature 290, 457–465.
28. Andrews, R.M., Kubacka, I., Chinnery, P.F., Lightowlers, R.N.,
Turnbull, D.M., and Howell, N. (1999). Reanalysis and
revision of the Cambridge reference sequence for human
mitochondrial DNA. Nat. Genet. 23, 147.
29. Yao, Y.G., Salas, A., Bravi, C.M., and Bandelt, H.J. (2006).
A reappraisal of completemtDNAvariation in East Asian fami-
lies with hearing impairment. Hum. Genet. 119, 505–515.
30. Pello, R., Martın, M.A., Carelli, V., Nijtmans, L.G., Achilli, A.,
Pala, M., Torroni, A., Gomez-Duran, A., Ruiz-Pesini, E., Marti-
nuzzi, A., et al. (2008). Mitochondrial DNA background
modulates the assembly kinetics of OXPHOS complexes in
a cellular model of mitochondrial disease. Hum. Mol. Genet.
17, 4001–4011.
31. Bandelt, H.J., and Parson, W. (2008). Consistent treatment
of length variants in the human mtDNA control region:
A reappraisal. Int. J. Legal Med. 122, 11–21.
32. Soares, P., Ermini, L., Thomson, N., Mormina, M., Rito, T.,
Rohl, A., Salas, A., Oppenheimer, S., Macaulay, V., and Ri-
chards, M.B. (2009). Correcting for purifying selection: An
improved human mitochondrial molecular clock. Am. J.
Hum. Genet. 84, 740–759.
33. Yang, Z. (2007). PAML 4: Phylogenetic analysis by maximum
likelihood. Mol. Biol. Evol. 24, 1586–1591.
34. Tang, S., and Huang, T. (2010). Characterization of mitochon-
drial DNA heteroplasmy using a parallel sequencing system.
Biotechniques 48, 287–296.
35. Li, M., Schonberg, A., Schaefer, M., Schroeder, R., Nasidze, I.,
and Stoneking, M. (2010). Detecting heteroplasmy from
high-throughput sequencing of complete human mitochon-
drial DNA genomes. Am. J. Hum. Genet. 87, 237–249.
36. Zaragoza, M.V., Fass, J., Diegoli, M., Lin, D., and Arbustini, E.
(2010). Mitochondrial DNA variant discovery and evaluation
in human Cardiomyopathies through next-generation
sequencing. PLoS ONE 5, e12295.
37. Mishmar, D., Ruiz-Pesini, E., Golik, P., Macaulay, V., Clark,
A.G., Hosseini, S., Brandon, M., Easley, K., Chen, E., Brown,
M.D., et al. (2003). Natural selection shaped regional mtDNA
variation in humans. Proc. Natl. Acad. Sci. USA 100, 171–176.
38. Green, R.E., Krause, J., Briggs, A.W., Maricic, T., Stenzel, U.,
Kircher, M., Patterson, N., Li, H., Zhai, W., Fritz, M.H., et al.
(2010). A draft sequence of the Neandertal genome. Science
328, 710–722.
39. Richards, M.B., Macaulay, V.A., Bandelt, H.J., and Sykes, B.C.
(1998). Phylogeography of mitochondrial DNA in western
Europe. Ann. Hum. Genet. 62, 241–260.
40. Torroni, A., Rengo, C., Guida, V., Cruciani, F., Sellitto, D.,
Coppa, A., Calderon, F.L., Simionati, B., Valle, G., Richards,
M., et al. (2001). Do the four clades of the mtDNA haplogroup
L2evolve at different rates?Am. J.Hum.Genet.69, 1348–1356.
41. Achilli, A., Rengo, C., Magri, C., Battaglia, V., Olivieri, A., Scoz-
zari, R., Cruciani, F., Zeviani, M., Briem, E., Carelli, V., et al.
(2004). The molecular dissection of mtDNA haplogroup H
confirms that the Franco-Cantabrian glacial refugewas amajor
source for the European gene pool. Am. J. Hum. Genet. 75,
910–918.
42. Parson, W., and Bandelt, H.J. (2007). Extended guidelines for
mtDNA typing of population data in forensic science. Forensic
Sci. Int. Genet. 1, 13–19.
43. Salas, A., Carracedo, A., Macaulay, V., Richards, M., and
Bandelt, H.J. (2005). A practical guide to mitochondrial DNA
error prevention in clinical, forensic, and population genetics.
Biochem. Biophys. Res. Commun. 335, 891–899.
44. Bandelt, H.J., Lahermo, P., Richards, M., and Macaulay, V.
(2001). Detecting errors in mtDNA data by phylogenetic
analysis. Int. J. Legal Med. 115, 64–69.
45. Ballantyne, K.N., vanOven,M., Ralf, A., Stoneking,M., Mitch-
ell, R.J., van Oorschot, R.A., and Kayser, M. (2011). MtDNA
SNP multiplexes for efficient inference of matrilineal genetic
ancestry within Oceania. Forensic Sci. Int. Genet., in press.
The American Journal of Human Genetics 90, 675–684, April 6, 2012 683
Published online September 20, 2011. 10.1016/j.fsigen.2011.
08.010.
46. Pereira, L., Soares, P., Radivojac, P., Li, B., and Samuels, D.C.
(2011). Comparing phylogeny and thepredictedpathogenicity
of protein variations reveals equal purifying selection across
the global human mtDNA diversity. Am. J. Hum. Genet. 88,
433–439.
47. Behar, D.M., Harmant, C., Manry, J., van Oven, M., Haak, W.,
Martinez-Cruz, B., Salaberria, J., Oyharcabal, B., Bauduer, F.,
Comas, D., and Quintana-Murci, L.; Consortium. TG.
(2012). The Basque paradigm: Genetic evidence of a maternal
continuity in the Franco-Cantabrian Region since pre-
Neolithic times. Am. J. Hum. Genet. 90, 486–493.
48. Zeviani, M., and Carelli, V. (2007). Mitochondrial disorders.
Curr. Opin. Neurol. 20, 564–571.
49. Gunnarsdottir, E.D., Nandineni, M.R., Li, M., Myles, S., Gil,
D., Pakendorf, B., and Stoneking, M. (2011). Larger mitochon-
drial DNA than Y-chromosome differences betweenmatrilocal
and patrilocal groups from Sumatra. Nat. Commun. 2, 228.
50. Baum, D.A., Smith, S.D., and Donovan, S.S. (2005). Evolution.
The tree-thinking challenge. Science 310, 979–980.
51. Behar, D.M., Metspalu, E., Kivisild, T., Rosset, S., Tzur, S.,
Hadid, Y., Yudkovsky, G., Rosengarten, D., Pereira, L.,
Amorim, A., et al. (2008). Counting the founders: The matri-
lineal genetic ancestry of the Jewish Diaspora. PLoS ONE 3,
e2062.
52. Wallace, D.C., Singh, G., Lott, M.T., Hodge, J.A., Schurr, T.G.,
Lezza, A.M., Elsas, L.J., 2nd, and Nikoskelainen, E.K. (1988).
Mitochondrial DNA mutation associated with Leber’s heredi-
tary optic neuropathy. Science 242, 1427–1430.
53. MITOMAP. (2011) A Human Mitochondrial Genome Data-
base. http://www.mitomap.org.
54. Quintana-Murci, L., Harmant, C., Quach, H., Balanovsky, O.,
Zaporozhchenko, V., Bormans, C., van Helden, P.D., Hoal,
E.G., and Behar, D.M. (2010). Strongmaternal Khoisan contri-
bution to the South African coloured population: A case of
gender-biased admixture. Am. J. Hum. Genet. 86, 611–620.
55. Schonberg, A., Theunert, C., Li, M., Stoneking, M., and
Nasidze, I. (2011). High-throughput sequencing of complete
human mtDNA genomes from the Caucasus and West Asia:
High diversity and demographic inferences. Eur. J. Hum.
Genet. 19, 988–994.
684 The American Journal of Human Genetics 90, 675–684, April 6, 2012
ARTICLE
Age-Related Somatic Structural Changesin the Nuclear Genome of Human Blood Cells
Lars A. Forsberg,1 Chiara Rasi,1 Hamid R. Razzaghian,1 Geeta Pakalapati,1 Lindsay Waite,2
Krista Stanton Thilbeault,2 Anna Ronowicz,3 Nathan E. Wineinger,4 Hemant K. Tiwari,4
Dorret Boomsma,5 Maxwell P. Westerman,6 Jennifer R. Harris,7 Robert Lyle,8 Magnus Essand,1
Fredrik Eriksson,1 Themistocles L. Assimes,9 Carlos Iribarren,10 Eric Strachan,11 Terrance P. O’Hanlon,12
Lisa G. Rider,12 Frederick W. Miller,12 Vilmantas Giedraitis,13 Lars Lannfelt,13 Martin Ingelsson,13
Arkadiusz Piotrowski,3 Nancy L. Pedersen,14 Devin Absher,2 and Jan P. Dumanski1,*
Structural variations are among the most frequent interindividual genetic differences in the human genome. The frequency and distri-
bution of de novo somatic structural variants in normal cells is, however, poorly explored. Using age-stratified cohorts of 318 monozy-
gotic (MZ) twins and 296 single-born subjects, we describe age-related accumulation of copy-number variation in the nuclear genomes
in vivo and frequency changes for both megabase- and kilobase-range variants. Megabase-range aberrations were found in 3.4% (9 of
264) of subjects R60 years old; these subjects included 78 MZ twin pairs and 108 single-born individuals. No such findings were
observed in 81MZ pairs or 180 single-born subjects whowere%55 years old. Recurrent region- and gene-specificmutations, mostly dele-
tions, were observed. Longitudinal analyses of 43 subjects whose data were collected 7–19 years apart suggest considerable variation in
the rate of accumulation of clones carrying structural changes. Furthermore, the longitudinal analysis of individuals with structural aber-
rations suggests that there is a natural self-removal of aberrant cell clones from peripheral blood. In three healthy subjects, we detected
somatic aberrations characteristic of patients with myelodysplastic syndrome. The recurrent rearrangements uncovered here are candi-
dates for common age-related defects in human blood cells. We anticipate that extension of these results will allow determination of the
genetic age of different somatic-cell lineages and estimation of possible individual differences between genetic and chronological age.
Our work might also help to explain the cause of an age-related reduction in the number of cell clones in the blood; such a reduction is
one of the hallmarks of immunosenescence.
Introduction
Structural changes in the human genome have been iden-
tified as one of the major types of interindividual genetic
variation.1,2 Furthermore, the rate of formation of copy-
number variants (CNVs) exceeds the corresponding rate
of SNPs by 2–4 orders of magnitude.3–5 In spite of this, little
is known about the rate of formation and distribution of
de novo somatic CNVs in normal cells and whether these
aberrations accumulate with age. There are, however, indi-
cations that chromosomal remodeling in the nuclear and
mitochondrial genomes increases with age.6–12 Theoretical
predictions suggest that somatic mosaicism should be
widespread,13,14 and reviews in the field point out that
somatic mosaicism, in both healthy and diseased cells, is
an understudied aspect of human-genome biology.15–18
A recent estimate of 1.7% for the frequency with which
somatic mosaicism causes large-scale structural aberrations
in adult human samples is, however, a relatively low
number.19 We have shown that adult monozygotic (MZ)
twins and differentiated human tissues frequently display
somatic CNVs.20,21 We therefore hypothesized that the
nuclear genome of blood cells in vivo might accumulate
CNVs with age, and we used age-stratified MZ twins as
a starting point for testing this hypothesis. Because nuclear
genomes of MZ twins are identical at conception, they
represent a good model for studying somatic variation.
We replicated a MZ-twin-based analysis by using age-strat-
ified cohorts of single-born subjects. Using these resources,
we show age-related accumulation of CNVs in the nuclear
genomes of blood cells in vivo. Age effects were found for
both megabase- and kilobase-range variants.
1Department of Immunology, Genetics and Pathology, Rudbeck Laboratory, Uppsala University, 75185 Uppsala, Sweden; 2HudsonAlpha Institute for
Biotechnology, 601 Genome Way, Huntsville, AL 35806, USA; 3Department of Biology and Pharmaceutical Botany, Medical University of Gdansk, Hallera
107, 80-416 Gdansk, Poland; 4Section on Statistical Genetics, Department of Biostatistics, Ryals Public Health Building, University of Alabama at Birming-
ham, Suite 327, Birmingham, AL 35294-0022, USA; 5Department of Biological Psychology, VU University, Van der Boechorststraat 1, 1081 BT Amsterdam,
The Netherlands; 6Hematology Research, Mount Sinai Hospital Medical Center, 1500 S California Avenue, Chicago, IL 60608, USA; 7Department of Genes
and Environment, Division of Epidemiology, The Norwegian Institute of Public Health, P.O. Box 4404 Nydalen, N-0403 Oslo, Norway; 8Department of
Medical Genetics, Oslo University Hospital, Kirkeveien 166, 0407 Oslo, Norway; 9Department of Medicine, Stanford University School of Medicine,
Stanford, CA 94305, USA; 10Kaiser Foundation Research Institute, Oakland, CA 94612, USA; 11Deptartment of Psychiatry and Behavioral Sciences and
University of Washington Twin Registry, University of Washington, Box 359780 Seattle, WA 98104, USA; 12Environmental Autoimmunity Group,
National Institute of Environmental Health Sciences, National Institutes of Health Clinical Research Center, National Institutes of Health, Building 10,
Room 4-2352, 10 Center Drive, MSC 1301, Bethesda, MD 20892-1301, USA; 13Department of Public Health and Caring Sciences, Division of Molecular
Geriatrics, Rudbeck laboratory, Uppsala University, 751 85 Uppsala, Sweden; 14Department ofMedical Epidemiology and Biostatistics, Karolinska Institutet,
SE-171 77 Stockholm, Sweden
*Correspondence: [email protected]
DOI 10.1016/j.ajhg.2011.12.009. �2012 by The American Society of Human Genetics. All rights reserved.
The American Journal of Human Genetics 90, 217–228, February 10, 2012 217
Material and Methods
Studied Cohorts, DNA Isolation, and Quality ControlSamples were collected with informed consent from all subjects,
and the study was approved by the respective local institutional
review boards or research ethics committees. The information
about studied cohorts of MZ twins and single-born subjects is
provided in Tables S1 and S2, available online. We isolated DNA
from peripheral blood by using the QIAGEN kit (QIAGEN, Hilden,
Germany).
The quality, quantity, and integrity of DNA samples were
controlled with NanoDrop (Thermo Fisher Scientific, Waltham,
MA, USA), picoGreen fluorescent assay (Invitrogen, Eugene, Ore-
gon, USA), and agarose gels.
Sorting of Subpopulations of Cells from Peripheral
Blood and Culturing of FibroblastsPeripheral blood mononuclear cells (PBMCs) were isolated from
the whole blood with Ficoll-Paque centrifugation (Amersham
Biosciences, Uppsala, Sweden), and a mixture of granulocytes
was collected from under the PBMC layer. We isolated CD19þ cells
from PBMCs by positive selection with CD19 MicroBeads (Milte-
nyi Biotech, Auburn, CA, USA). First, we negatively selected
CD4þ cells by using the CD4þ T cell Isolation Kit II (Miltenyi
Biotech, Auburn, CA, USA), and then we positively selected the
cells by using CD4 MicroBeads (Miltenyi Biotech, Auburn, CA,
USA). The CD19þ and CD4þ cells were incubated for 30 min at
4�C with phycoerythrin- and PerCP-conjugated antibodies (BD
Biosciences, San Diego, CA, USA), respectively, for fluorescence-
activated cell sorting (FACS) analysis. We measured purities
of >90% for CD19þ and >98% for CD4þ cells by flow cytometry
(FACS CantoII, BD Biosciences, San Diego, CA,USA). The skin-
biopsy-derived fibroblasts were cultured in RPMI medium
supplemented with Hams F-10 medium, fetal bovine serum
(10%), penicillin, and L-glutamine (all cell culture reagents were
from GIBCO, Invitrogen, Paisley, UK) in an incubator at 37�C.After reaching ~90% confluence, the cells were trypsinized
(Trypsin-EDTA, GIBCO, Invitrogen, Paisley, UK), and the fibro-
blasts were used for DNA isolation. We performed a standard
phenol-chloroform extraction to isolate DNA from CD19þ cells,
CD4þ cells, fibroblasts, and crude granulocyte fraction.
Genotyping with Illumina SNP Arrays and Calling
of Large-Scale CNVsWe performed the SNP genotyping experiments by using several
types of Illumina beadchips according to the recommendations of
the manufacturer. Such experiments were performed at two facili-
ties: Hudson Alpha Institute for Biotechnology (Huntsville, AL,
USA) and the SNP Technology Platform (Uppsala University,
Sweden). All Illumina genotyping experiments passed the follow-
ing quality-control criteria: The SNP call rate for all samples was
>98%, and the LogRdev value was<0.2. The results from Illumina
SNP arrays consist of two main data tracks: log R ratio (LRR) and
B-allele frequency (BAF)22 (see Figure 1). Deviations of consecutive
probes from normal states are indicative of structural aberrations.
We analyzed Illumina output files by using Nexus Copy Number
version 5.1 (BioDiscovery, CA, USA), which applies a ‘‘Rank
Segmentation’’ algorithm based on the circular binary segmenta-
tion (CBS) approach.23 The applied version, ‘‘SNPRank Segmenta-
tion,’’ an extended algorithm inwhich BAF values are also included
in the segmentation process, generated both copy-number and
allelic-event calls. We applied the default calling parameters of
the program. The array data for large-scale CNVs reported in this
paper have been submitted to the Database of Genomic Structural
Variation (dbVAR) under the accession number nstd58.
A Method for Detection of Small-Scale CNVs
with Illumina SNP Array DataWe developed and applied an algorithm for testing whether
smaller structural variants would also accumulate with age. We
used deviations in BAF as the main tool for detecting candidate
CNV regions because it can detect mosaicism in as low as 5%–
7% of cells24,25 and allows uncovering of deletions and duplica-
tions as well as copy-number-neutral loss of heterozygozity
(CNNLOH). This method uses an in-house-developed R-script26
to perform scans for deviations in BAF values alone and in BAF
values together with LRR values in MZ twins. Figure S1 describes
this algorithm, which identifies CNV calls for each MZ pair at
user-defined thresholds of either DBAF or both DBAF and DLRR.
Our initial tests of the algorithm were based on the entire cohort
of 159 MZ pairs. However, a series of ‘‘trial and error’’ tests sug-
gested that the method is sensitive to the quality of input data,
given that the results were heavily biased toward detection of
putative CNV calls in MZ co-twins with lower quality of genotyp-
ing, as measured by the Nexus Quality (NQ) score. The latter is one
of the features of Nexus Copy Number software. We therefore
defined strict NQ-score-based criteria for inclusion of MZ pairs in
the analysis (see Table S3 and Figure S1), which resulted in the
selection of 87 pairs that were processed further.
We based the final analysis on 87 twin pairs by identifying
candidate CNV loci in which BAF values were different between
co-twins when multiple thresholds were used. As expected, the
number of putative CNV calls between MZ co-twins was highly
dependent on the settings of the DBAF filtering (Figures S1–S4).
Thus, when the settings were too generous in this step, an age-
related signal was hidden in large background variation (Figure S2).
By using more strict filtering criteria, we found an age-related
correlation (Figures 2A and S4C). We trimmed the list of putative
CNVs generated by DBAF by using a DLRR filter of >0.35 so that
only loci with differences in both BAF and LRR remained in the
final list (Figures 2B and S4D). Hence, the DLRR filter removed
all loci with copy-number-neutral variation from the list. In the
course of tuning DBAF (or both DBAF and DLRR) filtering parame-
ters, we took advantage of three already-known large-scale aberra-
tions that are described in our dataset (Figures 1A–1F, 3, and S5).
These worked as ideal internal controls for the validity of our
approach as shown in Figures S2–S4. Hence, by plotting the
number of calls both including the probes locatedwithin the three
known aberrations (Figures S2A–S2B, S3A–S3B, and S4A–S4B) and
after excluding the probes located within the known aberrations
(Figures S2C–S2D, S3C–S3D, and S4C–S4D), we could compare
and evaluate the observed and expected results. For example, in
Figure S4B, the twin pair TP25-1/TP25-2 sticks out because the
probes positioned within the large de novo aberration of chromo-
some 5 (Figure 1) are included in the list of calls. When plotting
the same data after excluding probes within this region, we found
that the twin pair falls into the cluster of variation similar to that
of the other MZ twin pairs (Figure S4D). On the basis of such eval-
uations, we observed that probes within the three large-scale
CNVs were detected (or not, depending on the input file used in
the analysis) as predicted by our DBAF and DLRR algorithm. There-
fore, these evaluations provided an internal validation of our
approach to detecting de novo small-scale CNVs.
218 The American Journal of Human Genetics 90, 217–228, February 10, 2012
Figure 1. Two Examples of Megabase-Range De Novo Somatic Aberrations(A) A normal profile of MZ twin TP25-1.(B) A 32.5 Mb deletion on 5q is shown in nucleated blood cells of co-twin TP25-2. This deletion was uncovered with LRR data from theIllumina SNP array.(C and D) The BAF profiles of twins TP25-1 (C) and TP25-2 (D). The qPCR experiments showed that 66.2% of nucleated blood cells inTP25-2 had the 5q deletion (i.e., 33.1% fewer copies of the DNA segment, Figure 5). The R-package-MAD (Mosaic Alteration Detection)analysis of the Illumina data suggested that 50.5% of the cells had the 5q deletion when the subjects were 77 years old.(E) The deviation of BAF values from 0.5 (the allelic fraction of intensity at each heterozygous SNP) was plotted, and the percentage ofcells with the 5q deletion was higher when the subjects were 77 years old than when they were 70 years old (t test: p < 0.001). This slowincrease in aberrant clones was also supported by the MAD estimate of 48.3% of cells detected when the subjects were 70 years old. Thesize and position of this deletion is typical of patients with myelodysplastic syndrome (MDS).(F) A confirmatory array-CGH experiment.(G–K) Another large somatic event: a terminal CNNLOH encompassing 103 Mb of 4q in ULSAM-697. The LRR and BAF data fromIllumina SNP genotyping of samples collected when the subjects were 71, 82, 88, and 90 years old are plotted in (G), (H), (I), and (J),respectively. Percentages of cells with the aberration were calculated with the MAD package and are given for each panel.(K) The proportion of cells with the 4q aberration changes with time, and the changes are significantly different between all samplings atdifferent ages (ANOVA: F(3,25935) ¼ 39087, p< 0.001; Tukey’s test for multiple comparisons). Figure S8 shows other analysis details of thesamples collected fromULSAM-697 when he was 90 years old. These analyses include those of fibroblasts and three types of sorted bloodcells. The analysis of samples obtained when the subjects were 90 years old was performed in duplicate experiments on Illumina 1M-Duoand Omni-Express arrays.
The American Journal of Human Genetics 90, 217–228, February 10, 2012 219
Design of the Nimblegen 135K Custom-Made
Tiling-Path Oligonucleotide ArrayThis tool was designed according to the instructions from Roche-
Nimblegen (Madison, WI, USA) and encompassed 137,545 probes
used for validation of the 138 putative CNVs detected by the Illu-
mina SNP array (Figures 2B, S4C, and S4D). In total, the design
consisted of 98,894 experimental probes and an additional
38,651 backbone control probes distributed across the genome.
The median overlap of probes (i.e., probe spacing) was 30 bp.
This array was applied in cohybridizations of 34 MZ twin pairs
(Figures 2G, 2H, and S6 and Table S4).
Array-Comparative Genomic Hybridization
with Nimblegen 720K and 135K ArraysWe performed DNA labeling for both platforms (3 3 720K and
12 3 135K) by using the random priming with the Nimblegen
Dual-Color DNA Labeling kit (Roche-Nimblegen) according to
Nimblegen’s protocol. In brief, test and reference DNA (500 ng
each) samples were labeled with Cy3 and Cy5, respectively. The
combined test and reference DNA was cohybridized (for 48 hr at
42�C) onto a human comparative genomic hybridization (CGH)
3 3 720K whole-genome tiling array (100718_HG18_WH_
CGH_v3.1_HX3, OID:30853; Roche-Nimblegen) or a 12 3 135K
custom-designed array (110131_HG18_LF_CGH_HX18, OID:
33469; Roche-Nimblegen). The arrays were washed with the
Nimblegen Wash Kit. We performed image acquisition with MS
200 Scanner at 2 mm resolution by using high-sensitivity and auto-
gain settings. We extracted data with NimbleScan v2.6 segMNT,
including spatial correction (LOESS) and qspline fit normalization,
in order to compensate for differences in signal between the two
dyes.27 We generated an experimental metrics report with
NimbleScan v2.6 to verify hybridization quality. We performed
CNV analysis with Nexus Copy Number software version 5.1 by
using default settings (see above). All plots shown in Figures 2G,
2H, and S6 are derived from unaveraged, normalized raw data.
Validation Experiments Involving Quantitative
Real-Time Polymerase Chain ReactionWe measured the relative amount of DNA molecules by using
quantitative real-time polymerase chain reaction (qPCR) with
SYBR green to validate the CNV findings from the arrays. qPCRs
FE
10
15
20
25
50
20
30
40
50
10
0
(0.2 <
d
BA
F<
0.45
)
(0
.2
<d
BA
F<
0.4
5, d
LR
R>
0.3
5)
100
Age of twinf pairs
0 20 40 60 80100
Age of twinf pairs
0 20 40 60 80
Nu
mb
er o
fr
c
alls
f
Nu
mb
er o
fr
c
alls
f
Corr. coef. = 0.62
p < 0.001
Corr. coef. = 0.54
p < 0.001
BA
90
Age at second sampling
50 60 70 80
Age at sampling
50 60 70 80
(0
.2
<d
BA
F<
0.45
)
Nu
mb
er o
fr
c
alls
f
20
30
15
25
10
5
100
60
140
20(0
.2
< d
BA
F<
0.45
)
Nu
mb
er o
fr
c
alls
f
1 2 3 4 5 6 7 8 9
Age group
Nu
mb
er o
fr
ca
lls
f
10
20
30
40
50
0
D
(0.2
< d
BA
F<
0.4
5)
CF = 7.58, p < 0.001(8,78)
FFAge group
in panel c
N
(MZ pairs)
Median
age
1 10 8
2 10 19
3 9 29
4 10 65
5 10 68
6 10 72
7 10 76
8 10 78
9 8 82
ANOVA
Longitudinal changes
within individuals
Longitudinal changes
between twins (10 years)
Twin TP31-1 Twin TP31-2
10 kb 10 kb
Position f o
0rs6928830
200 bp
Pair TP31-1/2r
84.2752 Mb
500 bp
00.4
-0
.4L
og
2 ra
tio
Pair TP63-1/2r
onPositio of
5020rs4635
5 kb5 kb
Twin TP63-1 Twin TP63-2
0.5
01
BA
F
0.5
01
BA
F
0.5
01
BA
F
0.5
01
BA
F
00.4
-0
.4L
og
2 ra
tio
HG
p = 5.85E-08 p = .82E-10 1.
Age 76 Age 70
at ageg 76 at age 76 at age 70 at age 70
100.695 100.710 100.695 100.710 84.265 84.285 84.265 84.285Mb Mb Mb Mb Mb Mb Mb Mb
100.695 Mb 100.704 Mb 84.2764 Mb
Figure 2. Age-Related Accumulation of Small Somatic StructuralRearrangements in 87 Pairs of MZ Twins(A and B) Linear regression analyses showing that the number ofcalls increases with age in MZ twin pairs when DBAF values arebetween 0.2 and 0.45 as well as when DBAF values are between0.2 and 0.45 and when the LRR deviation is>0.35. Each dot repre-sents data from one MZ twin pair. Details regarding the filteringalgorithms used are shown in Figure S1.(C and D) An analysis of statistical significance for nine age groupsof MZ twin pairs when DBAF values are between 0.2 and 0.45.(E and F) Longitudinal data analyses comparing the number ofDBAF reports (between 0.2 and 0.45) of 18 twin pairs that weresampled twice, 10 years apart. Each point in the plot representsthe number of differences within one MZ pair (E). Each line(plotted between the two time points for the same MZ pair) thusrepresents the change over time of the number of differenceswithin a pair (blue line, increase; red line, decrease; green line,no change). The intraindividual changes for each twin overa period of 10 years are shown in (F). The x axis shows individual
ages at the later sampling. On the y axis, the number of differencesfound between the two samples from the same person at the twotime points is shown, and vertical lines connect co-twins.(G and H) Validation of copy-number imbalance between MZtwins in two pairs (chromosomes 10 and 6, respectively), whichwere detected by the DBAF analysis. The small boxes at the topof both (G) and (H) display original data from Illumina arraysfor pairs TP63-1/TP63-2 and TP31-1/TP32-2, respectively. Thelarger boxes at the bottom of (G) and (H) display raw data fromNimblegen tiling-path 135K array for these two twin pairs. Eachline is drawn to scale and represents data from one oligonucleotideprobe. Statistical significance for the results of the Nimblegenarray was calculated with the Mann-Whitney U test; values wereanalyzed for the region of interest (shaded) and for both areason either side of the control regions. Twenty additional examplesof validation experiments are shown in Figure S6. There was nodifference between the rates of validation success for the young(n ¼ 8) and old (n ¼ 26) MZ pairs used in these experiments(t test: t ¼ 0.7062, p value ¼ 0.4819), supporting the resultsfrom linear-regression analyses. The detailed description of theNimblegen array is provided in Figure S6 and Table S4.
220 The American Journal of Human Genetics 90, 217–228, February 10, 2012
were performed in 20 ml reactions containing 5 ng genomic DNA,
0.3 mM of each primer, and 13 Maxima SYBR Green/ROX qPCR
Master Mix (Fermentas, Vilnius, Lithuania) (for primer sequences,
see Table S5). The reactions were incubated at 95�C for 10 min,
after which they underwent 40 cycles of 95�C for 15 s and 60�Cfor 60 s in a Stratagene Mx3000P (Agilent Technologies) machine.
The reactions for evaluation of primer efficiencies were performed
in duplicates with control DNA (normal human female genomic
DNA, Promega Corporation, Madison,WI, USA), whereas all other
reactions with test and reference DNA were performed in tripli-
cates; in both instances, the averages were used in analyses. Each
primer pair’s efficiency and standard curve are described in
Figure S7. Melting-curve analysis was performed in all the experi-
Figure 3. An Example of a SomaticMegabase-Range Aberration(A, E, and F) A deletion encompassing12.9 Mb of 20q in MZ twin TP30-1 wassampled when she was 69 years old.(B, G, and H) The normal profile of co-twinTP30-2, as detected by LRR and BAFafter Illumina SNP array genotyping.R-package-MAD analysis of the Illuminadata suggested that 41.5% of the bloodcells had the 20q deletion. qPCR valida-tion experiments confirmed this resultby showing 39.6% aberrant cells (i.e.,19.8% fewer copies of the DNA segment,Figure 5).(C and D) Array-CGH validation experi-ments also confirmed the copy-numbervariation. The genetic change in MZ twinTP30-1 is another example of an MDS-likeaberration, which was uncovered in asubjectwithout a clinical diagnosis ofMDS.
ments, and the results were analyzed with
MxPro v4.10 software. We used ultra-
conserved elements on human chromo-
somes 3 and 6 (UCE3 andUCE6) as control
loci as previously described.28,29 We used
the average cycle threshold (Ct) value of
UCE6 to normalize the average Ct values
of UCE3 and test loci. We used these
normalized Ct values to calculate copy-
number ratios of test regions. Using the
estimated copy-number ratios from UCE3
and the test loci from multiple replicate
experiments, we performed t tests for
statistical testing.
Statistical MethodsThe statistical analyses were performed
with the R 2.12–2.13 software.26 We used
methods such as linear regression, t tests,
andone-wayanalyses of variance (ANOVAs)
when suitable, as further specified in the
text. Prior to testing, we controlled the data
so that no test assumptions were violated.
For multiple comparisons (i.e., Figures 1K
and S8G), we used the Tukey honest-signifi-
cant-difference method by implementing
the TukeyHSD function in R. When appro-
priate, we performed the nonparametric Fisher’s exact test and
Mann-Whitney U test, as described in the text.
Boxplots of Longitudinal-Analysis Data
Heterozygous SNPs have a theoretical expected BAF value of 0.5,
and deviations from this normal state can be indicative of struc-
tural aberrations.24 We can therefore use changes in the magni-
tude of these deviations in the subjects’ longitudinal samples to
measure intraindividual changes over time and to estimate the
proportion of cells affected by large-scale aberrations. We
produced the boxplots in Figures 1E, 1K, 4J, S9D, S9G, and S8G
to visualize such changes in BAF variation. In these figures, we
plotted the absolute deviation of BAF values from 0.5 for all
heterozygous SNPs in the region of interest (i.e., ABS (0.5�BAF))
The American Journal of Human Genetics 90, 217–228, February 10, 2012 221
on the y axes. We only included heterozygous SNPs (i.e., those
with a BAF value between 0.2 and 0.8) in these calculations to
increase quality and accuracy of the plots. A larger BAF value devi-
ation from 0.5 corresponds to a larger degree of mosaicism, i.e.,
a higher proportion of cells with a specific aberration. We used t
tests (in cases with two factor levels) or one-way ANOVAs (in cases
with >2 factor levels) to test for significance of such differences.
For themodel illustrated in Figures 1K and S8G, we used the Tukey
Figure 4. Longitudinal Analysis of ULSAM-340, a Single-Born Subject Containing a 13.8 Mb Deletion on 20q, as Detected by LRR andBAF with the Illumina SNP ArrayThe size and position of this deletion is typical of MDS patients. This subject, however, has not been diagnosed with MDS. When thepatient was 71 years old, the deletion was only carried by a small proportion of blood cells and was barely detectable, and neither NexusCopyNumber software nor R-packageMAD reported this aberration at this age (A, D, and E). R-packageMAD suggested that 50.7% of thenucleated cells had the deletion when ULSAM-340 was 75 years old (B, F, and G) and that when he was 88 years old, the correspondingproportion of cells was 36.1% (C, H, and I). qPCR validation experiments showed that the sample taken when the patient was 88 yearsold contained 14.5% fewer copies of DNA in the segment as compared to the sample taken when he was 75 years old (Figure 5). Thedeviations from 0.5 of the BAF values within the deleted region in the three different sampling stages are illustrated in (J).
222 The American Journal of Human Genetics 90, 217–228, February 10, 2012
post-hoc test for multiple comparisons to compute differences
between factor-level means after adjusting p values for the
multiple testing.
Quantification of the Number of Cells Affected by
Megabase-Range AberrationsWe calculated the approximate percentage of cells affected by
aberrations in the megabase range by using data from qPCR exper-
iments (the data are described in Figure 5). The qPCR measure-
ments provided the approximate number of DNA molecules that
are affected by an aberration. Assuming that an aberration affects
only one chromosome (i.e., an aberration that is a heterozygous
event) in a diploid genome, we used this number and converted
it to the approximate number of affected cells. Our assumption
is reasonable, given that we are studying normal cells and that
the size of these large-scale aberrations renders them unlikely to
affect both chromosomes (i.e., they are unlikely to be homozygous
[biallelic] events). For example, the relative number of DNA copies
in nucleated blood cells of twin TP25-2 at the age of 77 years
confirmed the array data. To determine these numbers, we used
two primer pairs (41.1 and 42.1) designed within the deleted
region and took five independent measurements for both primer
pairs. These experiments suggested that, at the age of 77, twin
TP25-2 had 30.8% (when primer pair 41.1 was used) and 35.4%
(when primer pair 42.1 was used)—an average of 33.1%—fewer
DNA copies with a 32.5 Mb 5q deletion than did her co-twin at
the same age (Figure 5). If one assumes that this deletion is
affecting one chromosome in a diploid cell, our calculations
suggest that 66.2% of cells contain this deletion.
In order to quantify the level of mosaicism, we also applied an
alternative, published method19,30 based on calculations of the
deviation of BAF values from the expected value of 0.5 for the
heterozygous SNPs in a normal state. This method has been
tailored for data derived from the Illumina SNP platform. The
R-package MAD (Mosaic Alteration Detection) version 0.5–930
identifies the aberrant regions, such as deletions, gains, and
CNNLOHs, and calculates the B deviation (Bdev, deviation from
the expected BAF value of 0.5 for heterozygous SNPs) value, which
is then used for calculation of the number of cells affected by the
aberration. We used the following modified version of the pub-
lished19 formula for deletions, gains, and CNNLOHs:
Proportion of cells with aberration ¼ 2Bdev
ð0:5þ BdevÞ
Results
Age-Related Accumulation of Megabase-Range
Structural Variants
Our analysis of 159 MZ pairs involved genotyping with
Illumina 600K SNP arrays, confirmation of monozygozity
(>99.9% genotype concordance), CNV calling with Nexus
Copy Number software (BioDiscovery, CA, USA), followed
by inspection of genomic profiles. Validation was per-
formed with a different Illumina array, Nimblegen array,
and qPCR. Comparison of MZ twin pairs, including 19
previously reported pairs,21 identified five large de novo
aberrations of >1 Mb among 81 young or middle-aged
(%55 years) and 78 elderly (R60 years) pairs studied
(Figures 1, 3, 5, and S5). All five large rearrangements
occurred in the older twins, suggesting a relationship
between age and the presence of changes. Tables S1 and
S2 show a description of subjects, cohorts, and statistical
support for the use of Illumina data for the detection of
variants. We expanded on the results from twins by using
two age-stratified groups of single-born subjects. First, we
genotyped DNA from 108 men, all 88 years old, from the
ULSAM (Uppsala Longitudinal Study of Adult Men) cohort
by using the Illumina-1M-Duo array. We found that four
subjects had large-scale rearrangements at the age of
88 years, and the somatic nature of such rearrangements
was established by examination of samples taken from
the same individuals at other time points (Figures 1, 4, 5,
and S8–S10 and Table S1). Second, for the young or
middle-aged single-born control cohort (33–55 years), we
used existing Illumina 550K data from 180 controls from
the ADVANCE (Atherosclerotic Disease, Vascular Function,
and Genetic Epidemiology) study.31,32 Analogous analysis
of ADVANCE subjects did not reveal any cases of large-scale
aberrations. The genotyping quality of 550K experiments
is at least as good as the quality of 1M-Duo arrays, and
the resolution of the 550K array is sufficient for detection
of ~1Mb aberrations that have been uncovered in the
ULSAM cohort (Figures S11 and S12 and Table S6). In
fact, we described a 1.6 Mb deletion by using the 300K
array in twin D8,21 and literature comparing arrays
suggests that the 250K level is sufficient for uncovering
submegabase-range changes.28,33 By studying the twins
and the single-born individuals and by analyzing the two
groups together, we obtained firm statistical support for
age-related accumulation of large structural variants
(with Fisher’s exact test; p value ¼ 0.00052) (Table S2).
Overall, 3.4% of the studied population R60 years old
carries cells containing megabase-range somatic aberra-
tions that are readily detectable by array-based scanning,
whereas none of the younger controls carried aberrations
in this size range. The sensitivity of our analysis to detect
aberrant clones is about 5% of nucleated blood cells.24,25
A previous estimate of 1.7% for somatic mosaicism was
performed in an analysis that was not stratified by age.19
Five subjects harboring large CNVs (twin TP25-2 and
ULSAM-102, -298, -340, and -697) were followed in
repeated samplings collected up to 19 years apart. They
all showed accumulation of aberrant cells with a variation
in the rate of this process. Twin TP25-2 is an example of
slow accumulation of a 5q-deletion clone (Figure 1);
when this twin was 77 years old, two independent
methods (q-PCR and MAD-program-based) suggested
that 66.2% and 50.5% of cells, respectively, contained
a deletion on one copy of chromosome 5. The change in
deviation of BAF within the deleted region when twin
TP25-2 was 70 and 77 years old translates into a 2.2%
increase in cells with the 5q deletion. The latter estimation
was based on analysis with the MAD program. It is note-
worthy that the size and position of this 5q deletion are
typical of myelodysplastic syndrome (MDS).34–38 However,
twin TP25-2 has not been diagnosed with this disease.
The American Journal of Human Genetics 90, 217–228, February 10, 2012 223
A
MZ pair TP25-1/2
at the age of 77
Chr. 5 locus 41.1
n = 5
0
50
100
Re
la
tiv
e a
mo
un
t o
f D
NA
m
ole
cu
le
s (%
)
Control region UCE3 Test loci
~30.8% fewer DNA
copies in test locus
in twin TP25-2
p = 0.0149
~35.4% fewer DNA
copies in test locus
in twin TP25-2
p < 0.001
MZ pair TP25-1/2
at the age of 77
Chr. 5 locus 42.1
n = 5
MZ pair TP30-1/2
at the age of 69
Chr. 20 locus 45.1
n = 5
ULSAM-340 at the
age of 75 and 88
Chr. 20 locus 45.1
n = 6
~19.8% fewer DNA
copies in test locus
in twin TP30-1
p < 0.001
~14.5% fewer DNA
copies in test locus
at the age of 88
p < 0.001
ULSAM-102 Chr. 1
age 88 vs. f-gDNA
locus rs540796
n = 5
~49.1% more DNA
copies in test locus in
ULSAM-102 compared
to reference DNA~34.7% more DNA
copies in test locus in
ULSAM-102 compared
to reference DNA
p < 0.001
p = 0.0015150
ULSAM-102 Chr. 8
age 88 vs. f-gDNA
locus rs9298462
n = 5
B
Control region UCE3 Test loci
~8.9% fewer DNA
copies in test locus
p = 0.0449
~14.2% fewer DNA
copies in test locus
p < 0.0001
~7.8% fewer DNA
copies in test locus
p = 0.0057
~5.9% fewer DNA
copies in test locus
p = 0.0458
~5.7% fewer DNA
copies in test locus
p = 0.0101
MZ pair TP31-1/2
at the age of 69
SNP rs6928830
n = 8
0
50
100
Re
la
tiv
e a
mo
un
t o
f D
NA
m
ole
cu
le
s (%
)
MZ pair TP19-1/2
at the age of 75
SNP rs329312
n = 9
MZ pair TP63-1/2
at the age of 76
SNP rs4635020
n = 6
MZ pair TP16-1/2
at the age of 77
SNP rs4841318
n = 7
MZ pair TP63-1/2
at the age of 76
SNP rs708039
n = 11
Figure 5. Validation of de novo CNVs by qPCR with SYBR GreenEleven independent qPCR experiments, each composed of multiple (5–11) independent measurements, are shown. The relative numberof DNA copies in both test loci (white bars) and the control regionUCE3 (gray bars) were plotted. Before we plotted and performed statis-tical analyses with t tests, we normalized all Ct values by using the control region UCE6. Figure S7 shows the determination of primerefficiency for each of the primer pairs.(A and B) Validations for five large-scale (A) and five small-scale (B) aberrations. The dotted line drawn at 100% represents the copy-number state in control DNA (i.e., that from the normal MZ co-twin, or human female control DNA, or DNA from the same subjectsampled at another age), and error bars indicate standard error of means.(A) The 5q deletion in twin TP25-2 (Figure 1) was validated with two primer pairs (41.1 and 42.1) designed within the deleted region. Intotal, ten independent qPCR experiments showed that ~66.2% of all nucleated blood cells in TP25-2 had the 5q deletion (i.e., an averageof 33.1% [30.8%with primer pair 41.1 and 35.4%with primer pair 42.1] fewer copies of the DNA segment). Similarly, the 20q deletion intwin TP30-1 (Figure 3) was validated with primer pair 45.1 in five experiments. The 19.8% fewer DNA copies found in the test locus indi-cates that 39.6% of the nucleated blood cells had the deletion. For ULSAM-340, the array data indicated a longitudinal somatic change inthe number of cells carrying the 20q deletion. Six independent qPCR experiments comparing DNA sampled when ULSAM-340 was 75
224 The American Journal of Human Genetics 90, 217–228, February 10, 2012
ULSAM-102 is another example of slow accumulation and
contains gains on 1p and 8q (Figure S9). The 1p gain is
stable, whereas the 8q gain shows a statistically significant
(ANOVA: p value <0.05) increase over a period of 10 years.
Consequently, ULSAM-102 probably carries two coexisting
clones with different aberrations. In ULSAM-340 and -697,
the rate of accumulationwas faster and therewas a decrease
in the proportion of cells with aberrations at later sam-
plings. ULSAM-340 contains a 20q deletion, which was
barely detectable at the age of 71 (Figure 4). The number
of cells containing the 20q deletion was estimated by anal-
ysis with the MAD program to be 50.7% when ULSAM-340
was 75 years old and to be 36.1% when he was 88 years
old. ULSAM-340 is another example of an aberration
typical of MDS in a subject without this diagnosis.
However, his clinical history includes thrombocytopenia,
which is normally a part ofMDS clinical features.We there-
fore speculate that this symptom might be due to clonal
expansion of cells with a 20q deletion and suppression of
normal thrombocyte production. Finally, ULSAM-697
was analyzed four times and shows the most pronounced
increase and decrease in the number of cells with
CNNLOH of 4q (Figures 1 and S8). This aberration was
not detectable at the age of 71, reached 58.4% at the age
of 88, and decreased radically to 29.9% of cells at the age
of 90. When ULSAM-697 was 90 years old, we profiled
sorted CD4þ cells, CD19þ cells, granulocytes, and fibro-
blasts, in addition to whole-blood DNA. CD4þ cells, gran-
ulocytes, and whole blood showed similar levels of
aberrant cells, whereas CD19þ cells and fibroblasts ap-
peared normal. We performed all experiments on samples
taken when ULSAM-697 was 90 years old in duplicate with
different types of arrays. Thus, in ULSAM-697, both
lymphoid andmyeloid cells were affected, except for, quite
surprisingly, CD19þ B cells. Overall, the analyses per-
formed on ULSAM-340 and ULSAM-697 suggest that the
cells with aberrations have a higher proliferative potential
than do other cells in the immune system, but they are not
immortalized because they apparently disappear from
circulation.
Small-Scale Structural Aberrations Also Display
Positive Correlation with Age
Given the above results, we tested whether smaller struc-
tural variants would also accumulate with age, and we
used deviations in BAF as the main detection tool because
they can detect mosaicism in as low as 5%–7% of cells24,25
and allow detection of deletions and duplications as well as
CNNLOH. We performed scans for deviations in BAF
values alone and BAF together with LRR in twins by using
a new R-script (Figure S1) that identifies CNV calls for each
MZ pair at various thresholds of DBAF and DLRR. Early
analyses showed that the algorithm was sensitive to the
quality of genotyping because calls were preferentially
observed in co-twins with lower data quality. We therefore
applied strict inclusion criteria by using the NQ score,
which is based on genome-wide noise measurements.
This resulted in the selection of 87 out of 159 MZ pairs
(Table S3). We found that small putative CNVs increased
with age (Figure 2A, linear regression F(1,85) ¼ 54.00,
p < 0.001, Figures S2–S4). We further narrowed the
number of calls by combining the DBAF and DLRR
values >0.35 from both twins in each MZ twin pair, and
this process also indicated that these CNVs accumulate
with age (Figure 2B; F(1,85) ¼ 34.60, p < 0.001). We also
tested whether genotyping quality (DNQ value is the abso-
lute value of the difference in quality score within pairs)
might explain the observed pattern. Importantly, there
was no effect of DNQ on age (F(1,85) ¼ 1.85, p > 0.05), sug-
gesting that the positive correlation with age reflects true
aberrations. Figure 2B displays a total of 827 CNV calls at
378 loci in 87 pairs with an age span of 3–86 years. Plotting
of the 378 calls against the genome shows the nonrandom
distribution and recurrent nature of these CNV calls
(Figure S13). On the basis of frequency and/or location in
the vicinity of known genes, we selected 138 loci for vali-
dation by using a tiling-path array (Nimblegen 135K) in
34 twin pairs. With this platform, 15% of putative CNVs
were validated in the same twin pairs in which they were
first detected by DBAF and DLRR analysis. There was no
bias in the success rate of validation between younger
and older groups (t test: t ¼ 0.7062, p value ¼ 0.4819). In
total, 52 of the 138 loci (38%) included on the 135K array
showed CNVs within 32 of the 34 MZ pairs tested (Figures
2G, 2H, and S6), and the majority of CNVs encompassed
<1 kb. The reason for the discrepancy (i.e., 15% versus
38%) in the validation success rates mentioned above is
probably due, at least in part, to the high stringency of
the DBAF and DLRR analysis that only reported a subset
of preferentially strong calls representing structural vari-
ants and the recurrent nature of loci that are affected by
the small-scale variation. Hence, some true structural vari-
ants were validated in (often multiple) MZ pairs on the
135K array, even though the initial DBAF and DLRR anal-
ysis did not pick them up because the filtering parameters
were too stringent.We selected 5 of these 52 loci for further
validation with qPCR, and all five were confirmed by this
alternative approach (Figure 5). We also performed break-
point-PCR validation in 17 out of the above 52 loci by
using PCR across the deleted region in instances that
and 88 years old showed that the subject had 14.5% fewer copies of the DNA segment when he was 88 years old. In ULSAM-102, theIllumina array identified a duplication event on both chromosomes 1 and 8 (Figure S9). Given that the proportion of cells with a gainedsegment in this subject was relatively stable over time, we used human female genomic DNA as control DNA in these experiments. TheqPCR experiments validated both somatic CNVs.(B) qPCR validation of five loci with small-scale de novo CNVs withinMZ twins. These loci were identified by Illumina array genotypingand were confirmed on the Nimblegen 135K array (see also Figures 2G, 2H, and S6). The layout of this panel is similar to that of (A),described above. For example, the first locus (rs6928830) illustrates de novo CNVs in twin TP31-1 (Figure 2H).
The American Journal of Human Genetics 90, 217–228, February 10, 2012 225
were presumed to represent the shortest deletions based on
the Illumina and Nimblegen 135K array data. However,
these attempts were not successful. We obtained correctly
sized PCR bands representing wild-type alleles for tested
loci. However, we could not detect any shorter, mutated
alleles that were mapped to the correct genomic regions.
These validation experiments included gel purification of
PCR fragments, PCR-fragment analysis, subcloning in plas-
mids, and Sanger sequencing (details not shown). These
results suggest that the vast majority of the uncovered
small structural variants are due to more complex rear-
rangements involving deletions or gains embedded
together with other structural changes. These results are
in agreement with a recent sequencing-based validation
analysis of CNV loci; the analysis showed that as few as
5% of CNVs suspected to represent gains or deletions are
in fact ‘‘pure blunt-end breakpoints.’’39 Details for the 52
validated loci are shown in Table S4, which includes infor-
mation about genes affected by the variation. The results
presented in Table S4 and Figure S13 emphasize the
recurrent nature of the 52 validated loci. For example,
out of the 52 loci, 13 only occurred once in any of the 34
tested twin pairs, whereas the remaining 39 were recurrent
and occurred 2–16 times in the same set of MZ twin
pairs. The number of CNVs per pair validated with the
135K Nimblegen array ranged from 1 to 32 (median 6)
(Table S7). In summary, the deviation between MZ co-
twins ranged from 0 to 51,040 bp (median 4,995 bp),
and the latter corresponds to ~0.0000016% genome-wide
divergence.
By using the small-scale CNV pipeline, we analyzed 18
pairs of MZ twins that were sampled twice, 10 years apart
(Figures 2E, 2F, and S1 and Table S8). Analyses were per-
formed in two ways: as an interindividual comparison of
one twin to its co-twin at the first and second sampling
and as an intraindividual comparison of the two samplings
of a single twin. Both types of comparisons suggest varia-
tion in the dynamics of changes between co-twins and
show both increases and decreases over a period of 10 years
in the number of calls in different twin pairs. Interestingly,
this evidence for the dynamics of small-scale CNVs over
time (Figure 2E) is consistent with the results from longitu-
dinal analyses of large-scale aberrations in ULSAM-697 and
ULSAM-340 (Figures 1 and 4), suggesting both increases
and decreases over time in the number of cells containing
different variants.
Discussion
The phenotypic consequences of accumulating aberra-
tions are an interesting aspect of our results. In two
subjects diagnosed with chronic lymphocytic leukemia
(CLL), we detected multiple changes consistent with the
disease (Figure S10). These findings are not unexpected:
Our population-based cohort was not preselected against
any diagnoses, and CLL is the most prevalent leukemia
among the elderly.40 However, it is surprising that appar-
ently healthy subjects have aberrations characteristic of
MDS. A typical 5q deletion (observed in one subject) and
a 20q deletion (observed in two subjects) are among the
most common aberrations in patients diagnosed with
MDS.34–38 Trisomy 8 is also a recurrent aberration in
MDS, and ULSAM-102 displays a restricted 8q gain; it
remains unclear whether this gain is related to MDS.
None of the above-mentioned individuals were diagnosed
with MDS, and their cases might represent an indolent,
subclinical form of MDS. In two individuals followed in
longitudinal sampling (i.e., ULSAM-340 and -697), we
observed not only an increase but also a clear subsequent
decrease in the proportion of nucleated blood cells with
aberrations (Figures 1, 4, and S8). These results suggest an
‘‘autocorrection’’ of the immune system, given that the
aberrant clones are apparently disappearing from circula-
tion. Similar expansions of preleukemic clones containing
gene fusions specific to acute leukemia have been
described in newborns;41 the gene fusions TEL-AML1 and
AML1-ETO were present in cord blood at a frequency
1003 greater than the frequency that is associated with
the risk of developing the corresponding leukemia.
The presented data are probably only part of all the
somatic changes that actually occurred in the studied
cohorts because balanced inversions and translocations
escape our detection and because we interrogated a fraction
of all the nucleotides in the genome. Furthermore, we only
detected high-frequency aberrations, presumably because
these aberrations provided the affected cells with a prolifer-
ative advantage, which lead to clonal expansion above
the detection limit of ~5% of cells. It follows from this
reasoning that deleterious aberrations leading to prolifera-
tive disadvantage or aberrations that are neutral from the
point of view of the proliferative potential go undetected.
Nevertheless, the chromosomal regions (e.g., those that
contain the 20q deletion) and loci affected in a recurrent
fashion (Figure S13 and Table S4) are candidates for con-
taining common and redundant age-related defects in
human blood cells. These mutations are presumed to
provide the affected cells with a mild proliferative advan-
tage without transforming the affected cells into immortal-
ized cancer clones. However, the proliferative advantage
for a limited number of cells will most likely affect the
overall complexity of cell clones present in blood and
should therefore be discussed in the context of immunose-
nescence, which, in fact, involves loss of complexity of cell
clones in both B and T cell lineages.42,43 Our results might
therefore help to explain the cause of age-related reduction
in the number of cell clones in the blood. This reduction
could lead to a less diverse immune system caused by the
accumulation of genetic changes that induce the expan-
sion of a limited number of clones. We also anticipate
that extension of our work will allow determination of
the genetic age of different somatic cell lineages and esti-
mation of possible individual differences between genetic
and chronological age.
226 The American Journal of Human Genetics 90, 217–228, February 10, 2012
Supplemental Data
Supplemental Data include 13 figures and eight tables and can be
found with this article online at http://www.cell.com/AJHG.
Acknowledgments
We thank Lars Feuk, Brigitte Schlegelberger, JacekWitkowski, Greg
Cooper, Richard Rosenquist Brandell, Eva Hellstrom-Lindberg,
Chris Gunther, and Eva Tiensuu Janson for critical review of the
manuscript and Larry Mansouri and Juan R. Gonzalez for method-
ological advice. This study was sponsored by grants from the
EllisonMedical Foundation (J.P.D. and D.A.) and from the Swedish
Cancer Society, the Swedish Research Council, and the Science for
Life Laboratory-Uppsala (J.P.D.). A.P. acknowledges FOCUS 4/2008
and FOCUS 4/08/2009 grants from the Foundation for Polish
Science. Genotyping was performed in part by the SNP&SEQ
Technology Platform, which is supported by Uppsala University,
Uppsala University Hospital, the Science for Life Laboratory–
Uppsala, and the Swedish Research Council (contracts 80576801
and 70374401).
Received: November 10, 2011
Revised: December 6, 2011
Accepted: December 14, 2011
Published online: February 2, 2012
Web Resources
The URLs for data presented herein are as follows:
GenePipe PrimerZ, http://genepipe.ngc.sinica.edu.tw/primerz/
Illumina Beadchip information, http://www.illumina.com/
documents/products/appnotes/appnote_cytogenetics.pdf
R 2.12–2.13 software, http://www.r-project.org/
Roche-Nimblegen array CGH Protocols, http://www.
nimblegen.com/
R-package MAD version 0.5–9, http://www.creal.cat/jrgonzalez/
software.htm
Surveillance Epidemiology and End Results (SEER) Program Fast
Stats, http://seer.cancer.gov/faststats/
The Gene Ontology, http://www.geneontology.org/
The Genetic Association Database, http://geneticassociationdb.
nih.gov/
The HUGO Gene Nomenclature Committee, http://www.
genenames.org/
University of California Santa Cruz Human Genome Browser,
http://genome.cse.ucsc.edu/cgi-bin/hgGateway
Accession Numbers
The array data for large-scale CNVs reported in this paper have
been submitted to the Database of Genomic Structural Variation
(dbVAR) under the accession number nstd58.
References
1. Conrad, D.F., Pinto, D., Redon, R., Feuk, L., Gokcumen, O.,
Zhang, Y., Aerts, J., Andrews, T.D., Barnes, C., Campbell, P.,
et al; Wellcome Trust Case Control Consortium. (2010).
Origins and functional impact of copy number variation in
the human genome. Nature 464, 704–712.
2. Itsara, A., Cooper, G.M., Baker, C., Girirajan, S., Li, J., Absher,
D., Krauss, R.M., Myers, R.M., Ridker, P.M., Chasman, D.I.,
et al. (2009). Population analysis of large copy number vari-
ants and hotspots of human genetic disease. Am. J. Hum.
Genet. 84, 148–161.
3. vanOmmen, G.J. (2005). Frequency of new copy number vari-
ation in humans. Nat. Genet. 37, 333–334.
4. Lupski, J.R. (2007). Genomic rearrangements and sporadic
disease. Nat. Genet. 39 (7 Suppl), S43–S47.
5. Itsara, A., Wu, H., Smith, J.D., Nickerson, D.A., Romieu, I.,
London, S.J., and Eichler, E.E. (2010). De novo rates and selec-
tion of large copy number variation. Genome Res. 20, 1469–
1481.
6. Harley, C.B., Futcher, A.B., and Greider, C.W. (1990). Telo-
meres shorten during ageing of human fibroblasts. Nature
345, 458–460.
7. Vaziri, H., Schachter, F., Uchida, I., Wei, L., Zhu, X., Effros, R.,
Cohen, D., and Harley, C.B. (1993). Loss of telomeric DNA
during aging of normal and trisomy 21 human lymphocytes.
Am. J. Hum. Genet. 52, 661–667.
8. Lee, H.C., Pang, C.Y., Hsu, H.S., andWei, Y.H. (1994). Differen-
tial accumulations of 4,977 bp deletion inmitochondrial DNA
of various tissues in human ageing. Biochim. Biophys. Acta
1226, 37–43.
9. Fraga, M.F., Ballestar, E., Paz, M.F., Ropero, S., Setien, F., Balles-
tar, M.L., Heine-Suner, D., Cigudosa, J.C., Urioste, M., Benitez,
J., et al. (2005). Epigenetic differences arise during the lifetime
of monozygotic twins. Proc. Natl. Acad. Sci. USA 102, 10604–
10609.
10. Mohamed, S.A., Hanke, T., Erasmi, A.W., Bechtel, M.J.,
Scharfschwerdt, M., Meissner, C., Sievers, H.H., and Gosslau,
A. (2006). Mitochondrial DNA deletions and the aging heart.
Exp. Gerontol. 41, 508–517.
11. Flores, M., Morales, L., Gonzaga-Jauregui, C., Domınguez-
Vidana, R., Zepeda, C., Yanez, O., Gutierrez, M., Lemus, T.,
Valle, D., Avila, M.C., et al. (2007). Recurrent DNA inversion
rearrangements in the human genome. Proc. Natl. Acad. Sci.
USA 104, 6099–6106.
12. Sloter, E.D., Marchetti, F., Eskenazi, B., Weldon, R.H., Nath, J.,
Cabreros, D., and Wyrobek, A.J. (2007). Frequency of human
sperm carrying structural aberrations of chromosome 1
increases with advancing age. Fertil. Steril. 87, 1077–1086.
13. Frank, S.A. (2010). Evolution in health and medicine Sackler
colloquium: Somatic evolutionary genomics: Mutations
during development cause highly variable genetic mosaicism
with risk of cancer and neurodegeneration. Proc. Natl. Acad.
Sci. USA 107 (Suppl 1 ), 1725–1730.
14. Lynch, M. (2010). Evolution of the mutation rate. Trends
Genet. 26, 345–352.
15. Youssoufian, H., and Pyeritz, R.E. (2002). Mechanisms and
consequences of somatic mosaicism in humans. Nat. Rev.
Genet. 3, 748–758.
16. Erickson, R.P. (2010). Somatic gene mutation and human
disease other than cancer: An update.Mutat. Res. 705, 96–106.
17. De, S. (2011). Somatic mosaicism in healthy human tissues.
Trends Genet. 27, 217–223.
18. Dumanski, J.P., and Piotrowski, A. (2012). Structural genetic
variation in the context of somatic mosaicism. In Genomic
Structural Variation, L. Feuk, ed. (New York: Humana Press).
19. Rodrıguez-Santiago, B., Malats, N., Rothman, N., Armengol,
L., Garcia-Closas, M., Kogevinas, M., Villa, O., Hutchinson,
A., Earl, J., Marenne, G., et al. (2010). Mosaic uniparental
The American Journal of Human Genetics 90, 217–228, February 10, 2012 227
disomies and aneuploidies as large structural variants of the
human genome. Am. J. Hum. Genet. 87, 129–138.
20. Piotrowski, A., Bruder, C.E., Andersson, R., Diaz de Stahl, T.,
Menzel, U., Sandgren, J., Poplawski, A., von Tell, D., Crasto,
C., Bogdan, A., et al. (2008). Somatic mosaicism for copy
number variation in differentiated human tissues. Hum. Mu-
tat. 29, 1118–1124.
21. Bruder, C.E., Piotrowski, A., Gijsbers, A.A., Andersson, R.,
Erickson, S., Diaz de Stahl, T., Menzel, U., Sandgren, J.,
von Tell, D., Poplawski, A., et al. (2008). Phenotypically
concordant and discordant monozygotic twins display
different DNA copy-number-variation profiles. Am. J. Hum.
Genet. 82, 763–771.
22. Steemers, F.J., Chang, W., Lee, G., Barker, D.L., Shen, R., and
Gunderson, K.L. (2006). Whole-genome genotyping with
the single-base extension assay. Nat. Methods 3, 31–33.
23. Olshen, A.B., Venkatraman, E.S., Lucito, R., and Wigler, M.
(2004). Circular binary segmentation for the analysis of
array-based DNA copy number data. Biostatistics 5, 557–572.
24. Conlin, L.K., Thiel, B.D., Bonnemann, C.G., Medne, L., Ernst,
L.M., Zackai, E.H., Deardorff, M.A., Krantz, I.D., Hakonarson,
H., and Spinner, N.B. (2010). Mechanisms of mosaicism,
chimerism and uniparental disomy identified by single nucle-
otide polymorphism array analysis. Hum. Mol. Genet. 19,
1263–1275.
25. Razzaghian, H.R., Shahi, M.H., Forsberg, L.A., de Stahl, T.D.,
Absher, D., Dahl, N., Westerman, M.P., and Dumanski, J.P.
(2010). Somatic mosaicism for chromosome X and Y aneu-
ploidies in monozygotic twins heterozygous for sickle cell
disease mutation. Am. J. Med. Genet. A. 152A, 2595–2598.
26. R_Development_Core_Team. (2010). R: A language and envi-
ronment for statistical computing. In. (Vienna, Austria).
URL: http://www.R-project.org/
27. Workman, C., Jensen, L.J., Jarmer, H., Berka, R., Gautier, L.,
Nielser, H.B., Saxild, H.H., Nielsen, C., Brunak, S., and Knud-
sen, S. (2002). A new non-linear normalization method for
reducing variability in DNAmicroarray experiments. Genome
Biol. 3, research0048.
28. Gunnarsson, R., Staaf, J., Jansson, M., Ottesen, A.M., Gorans-
son, H., Liljedahl, U., Ralfkiaer, U., Mansouri, M., Buhl, A.M.,
Smedby, K.E., et al. (2008). Screening for copy-number alter-
ations and loss of heterozygosity in chronic lymphocytic
leukemia—a comparative study of four differently designed,
high resolution microarray platforms. Genes Chromosomes
Cancer 47, 697–711.
29. Gunnarsson, R., Isaksson, A., Mansouri, M., Goransson, H.,
Jansson, M., Cahill, N., Rasmussen, M., Staaf, J., Lundin, J.,
Norin, S., et al. (2010). Large but not small copy-number alter-
ations correlate to high-risk genomic aberrations and survival
in chronic lymphocytic leukemia: A high-resolution genomic
screening of newly diagnosed patients. Leukemia 24, 211–215.
30. Gonzalez, J.R., Rodrıguez-Santiago, B., Caceres, A., Pique-Regi,
R., Rothman, N., Chanock, S.J., Armengol, L., and Perez-
Jurado, L.A. (2011). A fast and accurate method to detect
allelic genomic imbalances underlying mosaic rearrange-
ments using SNP array data. BMC Bioinformatics 12, 166.
31. Schunkert, H., Konig, I.R., Kathiresan, S., Reilly, M.P.,
Assimes, T.L., Holm, H., Preuss, M., Stewart, A.F., Barbalic,
M., Gieger, C., et al; Cardiogenics; CARDIoGRAM Consor-
tium. (2011). Large-scale association analysis identifies 13
new susceptibility loci for coronary artery disease. Nat. Genet.
43, 333–338.
32. Assimes, T.L., Knowles, J.W., Basu, A., Iribarren, C., Southwick,
A., Tang, H., Absher, D., Li, J., Fair, J.M., Rubin, G.D., et al.
(2008). Susceptibility locus for clinical and subclinical coro-
nary artery disease at chromosome 9p21 in the multi-ethnic
ADVANCE study. Hum. Mol. Genet. 17, 2320–2328.
33. Hagenkord, J.M., Monzon, F.A., Kash, S.F., Lilleberg, S., Xie,
Q., and Kant, J.A. (2010). Array-based karyotyping for
prognostic assessment in chronic lymphocytic leukemia:
Performance comparison of Affymetrix 10K2.0, 250K Nsp,
and SNP6.0 arrays. J. Mol. Diagn. 12, 184–196.
34. Bernasconi, P., Boni, M., Cavigliano, P.M., Calatroni, S.,
Giardini, I., Rocca, B., Zappatore, R., Dambruoso, I., and Care-
sana, M. (2006). Clinical relevance of cytogenetics in myelo-
dysplastic syndromes. Ann. N Y Acad. Sci. 1089, 395–410.
35. Haase, D. (2008). Cytogenetic features in myelodysplastic
syndromes. Ann. Hematol. 87, 515–526.
36. Tiu, R.V., Gondek, L.P., O’Keefe, C.L., Elson, P., Huh, J.,
Mohamedali, A., Kulasekararaj, A., Advani, A.S., Paquette, R.,
List, A.F., et al. (2011). Prognostic impact of SNP array karyo-
typing in myelodysplastic syndromes and related myeloid
malignancies. Blood 117, 4552–4560.
37. Braun, T., de Botton, S., Taksin, A.L., Park, S., Beyne-Rauzy, O.,
Coiteux, V., Sapena, R., Lazareth, A., Leroux, G., Guenda, K.,
et al. (2011). Characteristics and outcome of myelodysplastic
syndromes (MDS) with isolated 20q deletion: A report on 62
cases. Leuk. Res. 35, 863–867.
38. Bejar, R., Levine, R., and Ebert, B.L. (2011). Unraveling the
molecular pathophysiology of myelodysplastic syndromes.
J. Clin. Oncol. 29, 504–515.
39. Conrad, D.F., Bird, C., Blackburne, B., Lindsay, S., Mamanova,
L., Lee, C., Turner, D.J., and Hurles, M.E. (2010). Mutation
spectrum revealed by breakpoint sequencing of human germ-
line CNVs. Nat. Genet. 42, 385–391.
40. Surveillance Epidemiology and End Results (SEER) Program.
Fast stats. Bethesda, MD, National Cancer Institute, NIH,
USA (2011) URL: http://seer.cancer.gov/faststats/
41. Mori, H., Colman, S.M., Xiao, Z., Ford, A.M., Healy, L.E.,
Donaldson, C., Hows, J.M., Navarrete, C., and Greaves, M.
(2002). Chromosome translocations and covert leukemic
clones are generated during normal fetal development. Proc.
Natl. Acad. Sci. USA 99, 8242–8247.
42. Naylor, K., Li, G., Vallejo, A.N., Lee, W.W., Koetz, K., Bryl, E.,
Witkowski, J., Fulbright, J., Weyand, C.M., and Goronzy, J.J.
(2005). The influence of age on T cell generation and TCR
diversity. J. Immunol. 174, 7446–7452.
43. Gibson, K.L., Wu, Y.C., Barnett, Y., Duggan, O., Vaughan, R.,
Kondeatis, E., Nilsson, B.O., Wikby, A., Kipling, D., and
Dunn-Walters, D.K. (2009). B-cell diversity decreases in old
age and is correlated with poor health status. Aging Cell 8,
18–25.
228 The American Journal of Human Genetics 90, 217–228, February 10, 2012
to read the latest issue of any Cell Press journal.BE THE FIRST
Register for Cell Press Email Alerts and get the complete table of contents as soon as the issue publishes online — FREE!
Cell Press Email Alerts deliver the news, research, and commentaries featured in eachjournal’s latest issue, including the full title of every article, direct links to the articles, and the complete author list. Plus, to save you time, each research article has a brief summary highlighting its significant findings.
You don’t have to be a subscriber to sign up for Cell Press Email Alerts. While subscribers have instant access to the full text of all articles listed in the Email Alerts, non-subscribers can read the abstracts of all articles as well as the full text of the issue’s Featured Article.
www.cellpress.com
REPORT
Rare Mutations in XRCC2 Increasethe Risk of Breast Cancer
D.J. Park,1,20 F. Lesueur,2,20 T. Nguyen-Dumont,1 M. Pertesi,2 F. Odefrey,1 F. Hammet,1 S.L. Neuhausen,3
E.M. John,4,5 I.L. Andrulis,6 M.B. Terry,7 M. Daly,8 S. Buys,9 F. Le Calvez-Kelm,2 A. Lonie,10 B.J. Pope,10
H. Tsimiklis,1 C. Voegele,2 F.M. Hilbers,11 N. Hoogerbrugge,12 A. Barroso,13 A. Osorio,13,14 the BreastCancer Family Registry, the Kathleen Cuningham Foundation Consortium for Research into FamilialBreast Cancer, G.G. Giles,15 P. Devilee,11,16 J. Benitez,13,14 J.L. Hopper,17 S.V. Tavtigian,18 D.E. Goldgar,19
and M.C. Southey1,*
An exome-sequencing study of families with multiple breast-cancer-affected individuals identified two families with XRCC2mutations,
one with a protein-truncatingmutation and one with a probably deleterious missensemutation.We performed a population-based case-
control mutation-screening study that identified six probably pathogenic coding variants in 1,308 cases with early-onset breast cancer
and no variants in 1,120 controls (the severity grading was p< 0.02). We also performed additional mutation screening in 689 multiple-
case families. We identified ten breast-cancer-affected families with protein-truncating or probably deleterious rare missense variants in
XRCC2. Our identification of XRCC2 as a breast cancer susceptibility gene thus increases the proportion of breast cancers that are asso-
ciated with homologous recombination-DNA-repair dysfunction and Fanconi anemia and could therefore benefit from specific targeted
treatments such as PARP (poly ADP ribose polymerase) inhibitors. This study demonstrates the power of massively parallel sequencing
for discovering susceptibility genes for common, complex diseases.
Currently, only approximately 30% of the familial risk for
breast cancer has been explained, leaving the substantial
majority unaccounted for.1 Recently, exome sequencing
has been demonstrated to be a powerful tool for identi-
fying the underlying cause of rare Mendelian disorders.
However, diseases such as breast cancer present substan-
tially increased complexity in terms of locus, allelic and
phenotypic heterogeneity, and relationships between
genotype and phenotype.
As part of a collaborative (Leiden University Medical
Centre, the Spanish National Cancer Center, and The
University of Melbourne) project involving the exome
capture and massively parallel sequencing of multiple-
case breast-cancer-affected families, we applied whole-
exome sequencing to DNA frommultiple affected relatives
from 13 families (family structure and sample availability
were considered before the affected relatives were chosen).
Bioinformatic analysis of the resulting exome sequences
identified a protein-truncating mutation, c.651_652del
(p.Cys217*), in X-ray repair cross complementing gene-2
(XRCC2(( [MIM 600375; NM_005431.1]) in the peripheral-
blood DNA of a man participating in the Australian Breast
Cancer Family Registry2 (ABCFR; Figure 1A); this man (III-4
in Figure 1A) had been diagnosed with breast cancer at
29 years of age, and his mother (II-3), sister (III-5), and
cousin (III-1) had been diagnosed with breast cancer at
37, 41, and 34 years of age, respectively. The cousin
(III-1), who had also been selected for exome sequencing,
did not carry this mutation, the sister’s DNA was Sanger
sequenced and was found to carry the mutation, and there
was no DNA available for testing of the mother. Exome
sequencing of three individuals from a family participating
in a Dutch research study of multiple-case breast-cancer-
affected families identified a probably deleterious missense
mutation (c.271C>T [p.Arg91Trp] in XRCC2) (Figure 2) in
two sisters (II-6 and II-8 in Figure 1B) diagnosed with breast
cancer at 40 and 48 years of age, respectively, but not in
their cousin (II-1), who was diagnosed at 47 years of age.
Genotyping of XRCC2 mutations c.651_652del
(p.Cys217*) and c.271C>T (p.Arg91Trp) in 1,344 cases
1Genetic Epidemiology Laboratory, The University of Melbourne, Victoria 3010, Australia; 2Genetic Cancer Susceptibility Group, International Agency for
Research on Cancer, 69372 Lyon, France; 3Department of Population Sciences, Beckman Research Institute of City of Hope, Duarte, CA 91010, USA;4Cancer Prevention Institute of California, Fremont, CA 94538, USA; 5Department of Health Research and Policy, Stanford Cancer Center Institute, Stan-
ford, CA 94305, USA; 6Department of Molecular Genetics, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, ON M5G 1X5, Canada;7Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY 10032, USA; 8Fox Chase Cancer Center, Philadelphia,
PA 19111, USA; 9Huntsman Cancer Institute, University of Utah Health Sciences Center, Salt Lake City, UT 84112, USA; 10Victorian Life Sciences Compu-
tation Initiative, Carlton, Victoria 3010, Australia; 11Department of Human Genetics, Leiden University Medical Center, Leiden, 2300 RC Leiden, The
Netherlands; 12Department of Human Genetics, Radboud University Nijmegen Medical Center, 6525 GA Nijmegen, The Netherlands; 13Human Genetics
Group, Human Cancer Genetics Program, Spanish National Cancer Center, 28029 Madrid, Spain; 14Spanish Network on Rare Diseases, 46010 Valencia,
Spain; 15Centre for Cancer Epidemiology, The Cancer Council Victoria, Carlton, Victoria 3052, Australia; 16Department of Pathology, Leiden University
Medical Center, Leiden, 2300 RC Leiden, The Netherlands; 17Centre for Molecular, Environmental, Genetic, and Analytical Epidemiology, School of Pop-
ulation Health, The University of Melbourne, Victoria 3010, Australia; 18Department of Oncological Sciences, Huntsman Cancer Institute, University of
Utah School of Medicine, Salt Lake City, UT 84112, USA; 19Department of Dermatology, University of Utah School of Medicine, Salt Lake City, UT
84132, USA20These authors contributed equally to this work
*Correspondence: [email protected]
DOI 10.1016/j.ajhg.2012.02.027. �2012 by The American Society of Human Genetics. All rights reserved.
734 The American Journal of Human Genetics 90, 734–739, April 6, 2012
and 1,436 controls from the Melbourne Collaborative
Cohort Study3 (MCCS) and the ABCFR revealed one
control (II-2, Figure 1C) who carried c.651_652del
(p.Cys217*). Intriguingly, this control individual’s sister
(II-1) was diagnosed with breast cancer at 63 years of age,
and her mother (I-2) was diagnosed with melanoma at
69 years of age (Figure 1C, Tables 1 and 2).
XRCC2, a RAD51 paralog, was cloned because of its
ability to complement the DNA-damage sensitivity of the
irs1 hamster cell line.4 Cells derived from Xrcc2-knockout
mice exhibit profound genetic instability as a result of
homologous recombination (HR) deficiency.5 XRCC2 is
highly conserved, and most truncations of the protein
destroy its ability to protect cells from the effects of the
DNA cross-linking agent mitomycin C.6 The involvement
of the HR DNA repair genes BRCA1 (MIM 113705),
BRCA2 (MIM 600185), ATM (MIM 607585), CHEK2 (MIM
604373), BRIP1 (MIM 605882), PALB2 (MIM 610355),
and RAD51C (MIM 602774) in breast cancer risk empha-
sizes the importance of this mechanism in the etiology
of breast cancer.7–9 Biallelic mutations in three of these
genes are associated with Fanconi anemia (FA), and, most
interestingly, Shamseldin et al.10 have recently reported
a homozygous frameshift mutation in XRCC2 as being
associated with a previously unrecognized form of FA.
XRCC2 binds directly to the C-terminal portion of the
product of the breast cancer susceptibility pathway gene
RAD51 (MIM 179617), which is central to HR.6,11 XRCC2
also complexes in vivo with RAD51B (RAD51L1 [MIM
602948]), the product of the breast and ovarian cancer
susceptibility gene RAD51C9 and the product of the
ovarian cancer risk gene RAD51D (MIM 602954),12,13 and
localizes to sites of DNA damage.6 Cells deficient in
XRCC2 also show centrosome disruption, a key compo-
nent of mitotic-apparatus dysfunction, which is often
linked to the onset of mitotic catastrophe. XRCC2 is
important in preventing chromosome missegregation
leading to aneuploidy.14 Studies of common genetic varia-
tion in XRCC2 have reported some evidence of association
with breast cancer risk (e.g., rs3218408),15 subtle effects on
DNA-repair capacity,16 and poor survival associated with
rs3218536 (XRCC2, Arg188His).15
On the basis of the exome-sequencing results, the subse-
quent genotyping of the two probably pathogenic variants
*
*
** *
*
A B
C D
EF
G H IJ
Figure 1. Pedigrees of Families Found to Carry XRCC2 MutationsMutation status is indicated for all family members for whom a DNA sample was available. Cancer diagnosis and age of onset are indi-cated for affected members. Asterisks indicate that DNA underwent exome sequencing (libraries for 50 bp fragment reads were preparedaccording to the SOLiD Baylor protocol 2.1 and the Nimblegen exome-capture protocol v.1.2 with some variations). The followingabbreviations are used: BC, breast cancer (black filled symbols); PC, pancreatic cancer; BwC, bowel cancer; UC, uterine cancer; MM,malignant melanoma; UK, unknown age; BlC, bladder cancer; OC, ovarian cancer; BCC, basal cell carcinoma; L, lung cancer; (allgray-filled symbols); V, verified cancer (via cancer registry or pathology report); and wt, wild-type. Some symbols represent more thanone person as indicated by a numeral.
The American Journal of Human Genetics 90, 734–739, April 6, 2012 735
in the MCCS and ABCFR, the rarity of these variants, and
the biochemical plausibility of XRCC2, we conducted two
further studies in parallel. The first study was case-control
mutation screening of XRCC2 (with high-resolution melt
[HRM] curve analysis followed by Sanger-sequencing
confirmation) in an additional series of 1,308 cases with
early-onset breast cancer and 1,120 frequency-matched
controls recruited through population-based sampling
by the Breast Cancer Family Registry2 (BCFR; Supplemental
Data, available online); the BCFR sampling was recently
carried out for the characterization of the breast cancer
risk associated with variants in ATM and CHEK2.17,18 The
second study was mutation screening of XRCC2 in a series
of index cases from multiple-case breast-cancer-affected
families and a series of male breast cancer cases.
The case-control mutation screening identified two cases
that carried protein-truncating variants in XRCC2: indi-
vidual III-2 had c.49C>T (p.Arg17*) (Figure 1F), and indi-
vidual II-1 had c.651_652del (p.Cys217*) (Figure 1G).
Five cases carried singleton missense substitutions ranging
from probably deleterious to relatively innocuous (accord-
ing to in silico prediction). One control carried a relatively
innocuous missense substitution (Table 2). In addition,
a case diagnosed with breast cancer at 32 years of age
carried a G>A substitution located one nucleotide prior
to the start codon.
We graded the rare missense variants by using three
computational tools: SIFT, Polyphen2.1, and Align-
GVGD. Differences in grading between these tools were
minor. Depending on which of the three computational
tools we used to grade the missense substitutions, the
statistical significances of the differences in the frequency
and severity distributions of protein-truncating variants
and rare missense substitutions between cases and controls
from the case-control mutation-screening study fell in the
range of p ¼ 0.01–0.02 (adjusted for race, study center, and
age). There were six probably deleterious variants (pre-
dicted deleterious by at least two prediction algorithms)
in the cases and none in the controls, corresponding to
a p value by Fisher’s exact test of 0.02. All together, the
case-control mutation-screening data provide statistical
support for the hypothesis that rare, evolutionarily
unlikely sequence variation in XRCC2 is associated with
increased risk of breast cancer.
Mutation screening (by Sanger sequencing) of XRCC2 in
the index cases of 689 multiple-case breast-cancer-affected
families participating in the BCFR and the Kathleen
Cuningham Foundation Consortium for Research into
Familial Breast Cancer19 (kConFab) plus 150 male breast
cancer cases participating in a US-based study of male
breast cancer (Beckman Research Institute of the City of
Hope20) and kConFab revealed three rare coding-sequence
alterations. We identified a second family (from the kCon-
Fab resource) with an index case who carried XRCC2
c.651_652del (p.Cys217*); this individual (II-5, Figure 1D)
also carried a truncating mutation in BRCA1 (c.70_80del
[p.Cys24Serfs*13]). We identified an ABCFR index case
(II-2, Figure 1E and Figure 2) who carried the previously
identified missense substitution, XRCC2 c.271C>T
(p.Arg91Trp). We also identified a male breast cancer case
who carried a relatively innocuous missense substitution,
c.283A>C (p.Ile95Leu).
In addition to the protein-truncating mutations and the
above-described missense variants, a number of missense,
silent, and intronic variants were also observed in
XRCC2, and common SNPs that were reported in public
databases such as dbSNP, HapMap, or the 1,000 Genomes
Project were also identified. These included the common
coding SNP c.563G>A (p.Arg188His) (rs3218536), one
silent substitution, three 50UTR variants, five 30UTR vari-
ants, and six intronic variants in the vicinity of exon-
intron boundaries. All these variants were predicted to be
neutral according to various in silico predictions tools
(Supplemental Data, Tables 1 and 2). For common SNPs
(>1% in controls), no difference in allele frequency was
observed between cases and controls in the BCFR series.
The genetic studies included in this report received ap-
proval from The University of Melbourne Human Research
Ethics Committee, the International Agency for Research
on Cancer institutional review board (IRB), and the local
IRBs of every center from which we report findings.
Of the six distinct rare variants predicted to severely
affect protein function and identified in ourwork, twowere
truncating mutations, and four were missense changes.
Although most recognized pathogenic mutations in the
major breast cancer susceptibility genes are protein trun-
cating, there is evidence that missense mutations might
be the more prominent of some more recently-identified
Figure 2. XRCC2 Multiple-Sequence Alignment Centered onPosition Arg91Missense substitutions observed in this interval are given with themissense residue directly above the corresponding human refer-ence sequence residue. The following abbreviations are used:Hsap, Homo sapiens; Mmul, Macaca mulatta; Mmus,Mus musculus;Cfam,Canis familiaris; Lafr,Loxodonta africana;Mdom,Monodelphisdomestica; Oana, Ornithorhynchus anatinus; Ggal, Gallus gallus;Acar, Anolis coralinensis; Xtro, Xenopus tropicalis; Drer, Danio rerio;Bflo, Branchiostoma floridae; Spur, Strongylocentrotus purpuratus;Nvec, Nematostella vectensis; and Tadh, Trichoplax adhaerans. Thealignment, or updated versions thereof, is available at the Align-GVGD website (see Web Resources).
736 The American Journal of Human Genetics 90, 734–739, April 6, 2012
breast cancer susceptibility genes. For example, in compre-
hensive studies ofATM andCHEK2, the proportion of prob-
ably deleterious or pathogenic rare sequence variants that
are missense changes is often over 50%. More relevantly,
estimates of breast cancer risk are higher for missense vari-
ants than they are for protein-truncating variants. This
has been observed through case-control mutation-
screening analyses of ATM and CHEK217,18 and through
a pedigree analysis21 of ATM; in these analyses, the breast
cancer risk associated with one specific missense mutation
approaches the average risk associated with pathogenic
BRCA2 mutations. A very recent analysis of PALB2 muta-
tions found no difference in the frequency of missense
mutations between two case groups (contralateral and
unilateral breast cancer cases),22 suggesting that the contri-
bution of missense mutations to breast cancer risk might
vary between susceptibility genes.
Our finding of XRCC2 as a breast cancer susceptibility
gene expands the proportion of breast cancer that is associ-
ated with rare mutations in the HR-DNA-repair pathways
and the number of breast cancer susceptibility genes in
whichbiallelicmutations are associatedwith FA; theprecise
contribution ofmutation in these geneswill become clearer
as more whole-exome-sequencing (or whole-genome-
sequencing) and targeted-pathway-sequencing studies are
performed. XRCC2 mutations appear to be very rare, even
in the context of multiple-case families; they appear in 1
of 66 (1.5%) early-onset female breast cancer cases with
a strong family history of the disease present in the ABCFR,
compared to 9 (14%) BRCA1 mutations, 6 (9%) BRCA2
mutations, 3 (5%) TP53 (MIM 191170) mutations, and 2
(3%) PALB2mutations.
These frequencies are consistent with data from both
breast cancer linkage studies that have suggested that no
single gene is likely to account for a large fraction of the re-
maining familial aggregation of breast cancer5 and reports
from recent candidate-gene sequencing studies that have
associated other members of the HR pathway with breast
cancer susceptibility.23,24 Although mutations in HR-
DNA-repair genes are rare, it is important to identify people
whose breast cancer is associated with HR-DNA-repair
dysfunction because they could benefit from specific tar-
geted treatments such as PARP inhibitors. Unaffected rela-
tives of people with a mutation in a HR-DNA-repair gene
could also be offered predictive testing and subsequent
clinical management and genetic counseling on the basis
of their mutation status. The identification of a family
with rare mutations in both XRCC2 and BRCA1 illustrates
the complexity of the underlying genetic architecture of
breast cancer susceptibility for some families and the chal-
lenges for personalized risk-prediction models that are
incorporating an increasing array of risk factors, which
include rare mutations in breast cancer susceptibility genes
and more common genetic variation. Currently, esti-
mating the relative importance of the XRCC2 mutation
to the breast cancer risk for members of this family is diffi-
cult because of the presence of a BRCA1 protein-truncating
mutation in the proband in addition to the XRCC2 muta-
tion. Many examples have been described of individuals
and families carrying deleterious mutations in more than
Table 1. Mutation Screening in Multiple-Case Breast Cancer Families
Rare XRCC2 VariantsEffect onProtein Align-GVGDa SIFTb
PolyPhen-2.1(HumDiv)
Case orControl
Pedigree(Study Source)
Age and Originof Carrier
Truncating variants
c.651_652del p.Cys217* � � � case Figure 1A (ABCFR)e 29, white
c.651_652del p.Cys217* � � � casec Figure 1C (kConFab) 36, white
c.651_652del p.Cys217* � � � control Figure 1D (MCCS) 72, white
Missense substitutions
c.271C>T p.Arg91Trp C65 0.00 probably damaging case Figure 1B (Dutch)e 40, white
c.271C>T p.Arg91Trp C65 0.00 probably damaging cased Figure 1E (ABCFR) 32, white
c.283A>C p.Ile95Val C0 0.34 benign case � (kConFab) 59, white
c.283A>G p.Ile95Leu C0 0.41 benign case � (kConFab) 70, white
c.283A>C p.Ile95Val C0 0.34 benign case � (BRICOH) 68, white
Silent substitution
c.582G>T p.Thr194Thr � � � case � (kConFab) 60, white
The following abbreviations are used: ABCFR; Australian Breast Cancer Family Registry; kConFab, Kathleen Cuningham Foundation Consortium for Research intoFamilial Breast Cancer; MCCS, Melbourne Collaborative Cohort Study; and BRICOH, Beckman Research Institute of City of Hope.aProtein multiple sequence alignment (PMSA) used for obtaining scores for Align-GVGD: from Human to Branchiostoma floridae (Bflo).bPMSA used for obtaining scores for SIFT: from Human to Trichoplax (Tadh).cThis woman also carries BRCA1 c.70_80del (p.Cys24Serfs*13).dThis carrier of p.Arg91Trp was identified through both the ABFCR multiple-case family screening and the BCFR-IARC (Breast Cancer Family Registry-InternationalAgency for Research on Cancer) case-control screening.eFamily included in the exome-sequencing phase.
The American Journal of Human Genetics 90, 734–739, April 6, 2012 737
one proven breast cancer susceptibility gene; one such
example is the co-observation of BRCA1, BRCA2, ATM,
and CHEK2 mutations.21,25
This study demonstrates the power of massively parallel
sequencing in the discovery of additional breast cancer
susceptibility genes when used with an appropriate study
design. Our approach could be applied to other common,
complex diseases with components of unexplained herita-
bility.
Supplemental Data
Supplemental Data include 6 tables and can be found with this
article online at http://www.cell.com/AJHG.
Acknowledgments
This work was supported by Cancer Council Victoria (grant
628774), the National Institutes of Health (R01CA155767 and
R01CA121245), the Australian National Health and Medical
Research Council (grant 466668), The University of Melbourne
(infrastructure award to J.L.H.), a Victorian Life Sciences Computa-
tion Initiative grant (VR00353) on its Peak Computing Facility at
the University of Melbourne, and an initiative of the Victorian
Government and Dutch Cancer Society (grant UL 2009-4388).
The research resources, including the Melbourne Collaborative
Cohort Study, theAustralianBreast Cancer Family Study, the Breast
Cancer Family Registry, and the Kathleen Cuningham Foundation
Consortium for Research into Familial Breast Cancer, are further
acknowledged in the supplementary information. We wish to
thankNivonirina Robinot andGeoffroyDurand for their technical
help during the case-control mutation screening at the Interna-
tional Agency for Research on Cancer, Georgia Chenevix-Trench
for her support of and contribution to the establishment of the
case-control mutation-screening study, and Greg Wilhoite for
sequencing the male breast cancer cases at the Beckman Research
Institute of City of Hope. This work and partial support for S.L.N.
was provided by the Morris and Horowitz Families Endowment.
Work at the Spanish National Cancer Center was partially funded
by the Spanish Association Against Cancer and Health Ministry
(FIS08/1120). M.C.S. is a National Health and Medical Research
Council (NHMRC) Senior Research Fellow and a Victorian Breast
Cancer Research Consortium (VBCRC) Group Leader. J.L.H. is
a NHMRC Australia Fellow and a VBCRC Group Leader. T.N.-D. is
a Susan G. Komen for the Cure Postdoctoral Fellow.
Received: November 20, 2011
Revised: January 16, 2012
Accepted: February 29, 2012
Published online: March 29, 2012
Web Resources
The URLs for data presented herein are as follows:
Align-GVGD, http://agvgd.iarc.fr/alignments
GATK v.1.0.4418, http://gatk.sourceforge.net/
Genome Viewer (IGV v.1.5.48), http://www.broadinstitute.org/
software/igv/
Online Mendelian Inheritance in Man (OMIM), http://www.
omim.org
Picard v.1.29, http://sourceforge.net/projects/picard/
PolyPhen2.1, http://genetics.bwh.harvard.edu./pph2/
SIFT, http://sift.jcvi.org/
SOLiD Baylor protocol 2.1, http://www.hgsc.bcm.tmc.edu/
documents/Preparation_of_SOLiD_Capture_Libraries.pdf
UCSC Genome Browser, http://genome.ucsc.edu/cgi-bin/
hgGateway
Table 2. Case-Control Mutation Screening Applied to the BCFR Population-Based Study
Rare XRCC2 VariantsEffect onProtein Align-GVGDa SIFTb
PolyPhen-2.1(HumDiv)
Case (n ¼ 1,308) orControl (n ¼ 1,120)
Pedigree(BCFR)
Age and Originof Carrier
Truncating variants
c.49C>T p.Arg17* � � � case Figure 1F 33, white
c.46G>T p.Ala16Ser C0 0.24 benign case � 44, East Asian
c.181C>A p.Leu61Ile C0 0.00 possibly damaging case Figure 1H 30, East Asian
c.271C>T p.Arg91Trp C65 0.00 probably damaging casec Figure 1E 32, white
c.283A>G p.Ile95Val C0 0.34 benign control � 44, white
c.693G>T p.Trp231Cys C65 0.00 probably damaging cased Figure 1I 44, East Asian
c.808T>G p.Phe270Val C45 0.00 probably damaging case Figure 1J 38, African
Silent substitution
c.354G>A p.Val118Val � � � cased � 44, East Asian
50 UTR variants
c.-1G>A ? � � � casee � 32, white
The following abbreviation is used: BCFR, Breast Cancer Family Registry.aProtein multiple sequence alignment (PMSA) used for obtaining scores for Align-GVGD: from Human to Branchiostoma floridae (Bflo).bPMSA used for obtaining scores for SIFT: from Human to Trichoplax (Tadh).cThis carrier of p.Arg91Trp was identified through both the ABFCR multiple-case family screening and the BCFR-IARC (Breast Cancer Family Registry-InternationalAgency for Research on Cancer) case-control screening.dThis 44-year-old East Asian case carries p.Trp231Cys and p.Val118Val.eThis case is considered a ‘‘noncarrier’’ in the analysis.
738 The American Journal of Human Genetics 90, 734–739, April 6, 2012
References
1. Turnbull, C., and Rahman, N. (2008). Genetic predisposition
to breast cancer: Past, present, and future. Annu. Rev. Geno-
mics Hum. Genet. 9, 321–345.
2. John, E.M., Hopper, J.L., Beck, J.C., Knight, J.A., Neuhausen,
S.L., Senie, R.T., Ziogas, A., Andrulis, I.L., Anton-Culver, H.,
Boyd, N., et al; Breast Cancer Family Registry. (2004). The
Breast Cancer Family Registry: An infrastructure for coopera-
tive multinational, interdisciplinary and translational studies
of the genetic epidemiology of breast cancer. Breast Cancer
Res. 6, R375–R389.
3. Giles, G.G., and R, E.D. (2002). The Melbourne Collaborative
Cohort Study. IARC Sci Publ 156, 2.
4. Cartwright, R., Tambini, C.E., Simpson, P.J., and Thacker, J.
(1998). The XRCC2 DNA repair gene from human and mouse
encodes a novel member of the recA/RAD51 family. Nucleic
Acids Res. 26, 3084–3089.
5. Deans, B., Griffin, C.S., O’Regan, P., Jasin, M., and Thacker, J.
(2003). Homologous recombination deficiency leads to
profound genetic instability in cells derived from Xrcc2-
knockout mice. Cancer Res. 63, 8181–8187.
6. Tambini, C.E., Spink, K.G., Ross, C.J., Hill, M.A., and Thacker,
J. (2010). The importance of XRCC2 in RAD51-related DNA
damage repair. DNA Repair (Amst.) 9, 517–525.
7. Moynahan,M.E., Chiu, J.W., Koller, B.H., and Jasin,M. (1999).
Brca1 controls homology-directed DNA repair. Mol. Cell 4,
511–518.
8. Moynahan, M.E., Pierce, A.J., and Jasin, M. (2001). BRCA2 is
required for homology-directed repair of chromosomal breaks.
Mol. Cell 7, 263–272.
9. Meindl, A., Hellebrand, H., Wiek, C., Erven, V., Wappensch-
midt, B., Niederacher, D., Freund, M., Lichtner, P., Hartmann,
L., Schaal, H., et al. (2010). Germline mutations in breast and
ovarian cancer pedigrees establish RAD51C as a human cancer
susceptibility gene. Nat. Genet. 42, 410–414.
10. Shamseldin, H.E., Elfaki, M., and Alkuraya, F.S. (2012). Exome
sequencing reveals a novel Fanconi group defined by XRCC2
mutation. J. Med. Genet. 49, 184–186.
11. Gao, L.-B., Pan, X.-M., Li, L.-J., Liang, W.-B., Zhu, Y., Zhang,
L.-S., Wei, Y.-G., Tang, M., and Zhang, L. (2011). RAD51
135G/C polymorphism and breast cancer risk: Ameta-analysis
from 21 studies. Breast Cancer Res. Treat. 125, 827–835.
12. Loveday, C., Turnbull, C., Ramsay, E., Hughes, D., Ruark, E.,
Frankum, J.R., Bowden, G., Kalmyrzaev, B., Warren-Perry,
M., Snape, K., et al; Breast Cancer Susceptibility Collaboration
(UK). (2011). Germlinemutations in RAD51D confer suscepti-
bility to ovarian cancer. Nat. Genet. 43, 879–882.
13. Liu, N., Schild, D., Thelen, M.P., and Thompson, L.H. (2002).
Involvement of Rad51C in two distinct protein complexes
of Rad51 paralogs in human cells. Nucleic Acids Res. 30,
1009–1015.
14. Griffin, C.S., Simpson, P.J., Wilson, C.R., and Thacker, J.
(2000). Mammalian recombination-repair genes XRCC2 and
XRCC3 promote correct chromosome segregation. Nat. Cell
Biol. 2, 757–761.
15. Lin,W.-Y., Camp, N.J., Cannon-Albright, L.A., Allen-Brady, K.,
Balasubramanian, S., Reed, M.W.R., Hopper, J.L., Apicella, C.,
Giles, G.G., Southey, M.C., et al. (2011). A role for XRCC2
gene polymorphisms in breast cancer risk and survival. J.
Med. Genet. 48, 477–484.
16. Rafii, S., O’Regan, P., Xinarianos, G., Azmy, I., Stephenson, T.,
Reed, M., Meuth, M., Thacker, J., and Cox, A. (2002). A poten-
tial role for the XRCC2 R188H polymorphic site in DNA-
damage repair and breast cancer. Hum. Mol. Genet. 11,
1433–1438.
17. Le Calvez-Kelm, F., Lesueur, F., Damiola, F., Vallee, M.,
Voegele, C., Babikyan, D., Durand, G., Forey, N., McKay-
Chopin, S., Robinot, N., et al; Breast Cancer Family Registry.
(2011). Rare, evolutionarily unlikely missense substitutions
in CHEK2 contribute to breast cancer susceptibility: results
from a breast cancer family registry case-control mutation-
screening study. Breast Cancer Res. 13, R6.
18. Tavtigian, S.V., Oefner, P.J., Babikyan, D., Hartmann, A.,
Healey, S., Le Calvez-Kelm, F., Lesueur, F., Byrnes, G.B.,
Chuang, S.-C., Forey, N., et al; Australian Cancer Study; Breast
Cancer Family Registries (BCFR); Kathleen Cuningham
Foundation Consortium for Research into Familial Aspects
of Breast Cancer (kConFab). (2009). Rare, evolutionarily
unlikely missense substitutions in ATM confer increased risk
of breast cancer. Am. J. Hum. Genet. 85, 427–446.
19. Mann, G.J., Thorne, H., Balleine, R.L., Butow, P.N., Clarke,
C.L., Edkins, E., Evans, G.M., Fereday, S., Haan, E., Gattas,
M., et al; Kathleen Cuningham Consortium for Research in
Familial Breast Cancer. (2006). Analysis of cancer risk and
BRCA1 and BRCA2 mutation prevalence in the kConFab
familial breast cancer resource. Breast Cancer Res. 8, R12.
20. Ding, Y.C., Steele, L., Chu, L.-H., Kelley, K., Davis, H., John,
E.M., Tomlinson, G.E., and Neuhausen, S.L. (2011). Germline
mutations in PALB2 in African-American breast cancer cases.
Breast Cancer Res. Treat. 126, 227–230.
21. Goldgar, D.E., Healey, S., Dowty, J.G., Da Silva, L., Chen, X.,
Spurdle, A.B., Terry, M.B., Daly, M.J., Buys, S.M., Southey,
M.C., et al; BCFR; kConFab. (2011). Rare variants in the
ATM gene and risk of breast cancer. Breast Cancer Res. 13, R73.
22. Tischkowitz, M., Capanu, M., Sabbaghian, N., Li, L., Liang, X.,
Vallee, M.P., Tavtigian, S.V., Concannon, P., Foulkes, W.D.,
Bernstein, L., et al; The WECARE Study Collaborative Group.
(2012). Rare germline mutations in PALB2 and breast cancer
risk: A population-based study. Hum Mutat 33, 674–680.
23. Rahman, N., Seal, S., Thompson, D., Kelly, P., Renwick, A.,
Elliott, A., Reid, S., Spanova, K., Barfoot, R., Chagtai, T., et al;
Breast Cancer Susceptibility Collaboration (UK). (2007).
PALB2, which encodes a BRCA2-interacting protein, is a breast
cancer susceptibility gene. Nat. Genet. 39, 165–167.
24. Seal, S., Thompson, D., Renwick, A., Elliott, A., Kelly, P.,
Barfoot, R., Chagtai, T., Jayatilake, H., Ahmed, M., Spanova,
K., et al; Breast Cancer Susceptibility Collaboration (UK).
(2006). Truncating mutations in the Fanconi anemia J gene
BRIP1 are low-penetrance breast cancer susceptibility alleles.
Nat. Genet. 38, 1239–1241.
25. Turnbull, C., Seal, S., Renwick, A., Warren-Perry, M., Hughes,
D., Elliott, A., Pernet, D., Peock, S., Adlard, J.W., Barwell, J.,
et al; Breast Cancer Susceptibility Collaboration (UK),
EMBRACE. (2012). Gene-gene interactions in breast cancer
susceptibility. Hum. Mol. Genet. 21, 958–962.
The American Journal of Human Genetics 90, 734–739, April 6, 2012 739
sponsored by
snapshots.cell.com
view the archive
C e na v 0
SnapShots—sorted categorized—from chromatin
remodelers and autophagy to cancer andr autism.
All SnapShots published from a year agor or morer are
open access and freely available.
Be Frustrated No More.
www.sdix.com/perform
frustrated
Better Antigens.Better Antibodies.Better Assays.
Discover how SDIX can help you create betterantibodies to difficult targets, like GPCRs.
You need antibodies to perform in critical research, diagnostic and therapeuticapplications — that’s what SDIX is all about, Design For Purpose™.
Our scientists have pioneered novel technologies in antigen design, includingSDIX Genomic Antibody Technology™.
Antibodies designed to perform for YOU.
No reason to be frustrated anymore.
®
Empowering Sequencing, Our Focus.
The NGS Experts™
Complete Kit - Everything you need upstream of target captureOptimized - Offers larger number of unique readsMultiplexed - Up to 24 barcodes and barcode blockersAvailable Now - Next Day Delivery
The NEXTflex™ Pre-Capture Combo Kit for NimbleGen SeqCap is a complete DNA-Seq library prep, barcode and barcode blocking solution, designed and validated for use upstream of Roche NimbleGen’s SeqCap v3 Target Capture. DNA-Seq
ChIP-SeqBisulfite-SeqMethyl-Seq
RNA-SeqSmall RNA-Seq
Directional RNA-SeqPCR-Free DNA-Seq
Pre-Target CaptureMultiple Platform Compatibility
Simplify your NimbleGen SeqCap Target Capture.
Visit BiooNGS.com and turn your focus to your NGS results.