the american journal of human genetics - best of 2011 & 2012

85%

90%

70%

75%

80%

MAF >1%

Co

vera

ge

MAF >5%

Competing Array

Axiom® World Array 4

We’ve got you covered

The Definitive Evolution of Genotyping

Affymetrix Axiom® Arrays

“For Research Use Only. Not for use in diagnostic procedures.”©Affymetrix, Inc. All rights reserved.

Axiom® Genotyping Solution. Survival of the fittest.Axiom Genotyping Solution is the most powerful genotyping workflowdelivering superior coverage of populations, disease genes, and rarevariants at an affordable price.

Unique GWAS, replication, and fine-mapping content onone arrayUnrivaled coverage of the exome, disease genes, andfunctional regionsCost-effective custom array design with 100% SNP conversion

Axiom Genotyping Solution adapts to the needs of your research—coverage and flexibility like never before. Contact your AffymetrixRepresentative today.

For more information on Axiom Genotyping Solution,visit www.affymetrix.com/axiomevolution

www.nanostring.com | [email protected] | 888 358 6266FOR RESEARCH USE ONLY. Not for use in diagnostic procedures.

Molecules That Count®Gene Expression miRNA Expression Epigenomics Copy Number Variation

The NEW

nCounter®Single Cell ExpressionNanoString’s nCounter® Single Cell Gene Expression Assayoff ers a superior approach to identifying cell-to-cell diff erences within a population of cells. The highly multiplexed, single tube assay allows the analysis of 20 – 800 genes and frees you from the constraints of fi xed format microfl uidic platforms. Let biology guide your research.

Take the Single Cell Challenge - Try Before You Buy!Go to www.nanostring.com/challenge for complete details.

nCounter® Analysis SystemDirect Digital Quantifi cation of Nucleic Acids

More Genes » Analyze multiple pathways for up to 800 genes

High Sensitivity » Eliminate sample splitting, minimize amplifi cation - get better data from every cell

Digital Counting » Determine fractional fold changes - eliminate the variability of analog data

High Throughput » Analyze hundreds of samples per day

Make Every Cell CountThe New nCounter® Single Cell Expression Assay

Menkes?What is

Cell Press contentis widely accessible

At Cell Press we place a high priority on ensuring that all of our journal content is widely accessible and on working with the community to develop the best ways to achieve that goal.

Here are just some of those initiatives...

www.cell.com/cellpress/access

Open archivesWe provide free access to Cell Press research journals 12 months following publication

Open access journalWe launched Cell Reports - a new Open Access journal spanning the life sciences

Access for developing nationsWe provide free & low-cost access through programs like Research4Life

Funding body agreementsWe work cooperatively and successfully with major funding bodies

Submission to PubMed CentralCell Press deposits accepted manuscripts on our authors' behalf for a variety of funding bodies, including NIH and HHMI, to PubMed Central (PMC)

Public accessFull-text online via ScienceDirect is also available to the public via walk in user access from any participating library

Don’t be kept in the dark

523_12_JL

Image courtesy of an Abreview by Dr. Shaohua Li, UMDNJ-Robert Wood Johnson Medical School

Discover more at abcam.com/brighter_days

Back by popular demand for 2013:• New sessions on cutting-edge clinical trials, along

with commentaries on the implications of these trials for improved patient care

• Poster session on Clinical Trials in Progress

• Regulatory science and policy track

Join us in Washington, DC, the appropriate location for our conference and events that will emphasize the vital importance of reaffirming our nation’s commitment to the conquest of cancer.

Continuing Medical Education Activity–AMA PRA Category 1 CreditsTM available

Late-breaking and placeholder abstract submission deadline: Monday, January 28

Early registration deadline: Friday, December 21

A N N U A LMEETING

2013

April 6-10, 2013Walter E. Washington

Convention CenterWashington, DC

Secure your spot today for the premier event forcancer research covering

the spectrum of science fromthe bench to the clinic!

New for 2013: An exciting new series of sessions focused on Current Concepts in Epidemiology and Prevention

www.aacr.org/annual meeting13

Foreword

We are pleased to introduce a new series of “Best of…” reprint collections from Cell Press,

which give us a chance to reflect on what has caught the attention of AJHG readers in late

2011 and early 2012. This collection includes a selection of eight of the most-accessed

research articles across a range of topics and the most highly accessed review article of

2012. To select the articles, we considered the number of requests for PDF and full-text

HTML versions of a given article. Half of the articles were published in the last six months of

2011 and half were published between January and June of 2012; in doing so, we are able

to capture the full spectrum of articles that have been published during the past 12 months.

We acknowledge that no single measurement can truly be indicative of “the best” research

papers over a given period of time. This is especially true when sufficient time has not

necessarily passed to allow one to fully appreciate the relative importance of a discovery.

That said, we think it is still informative to look back at the scientific community’s interests

in what has been published in AJHG over the past year.

In this collection, you will see a range of the exciting topics that have widely captured

the attention and enthusiasm of our readers, including genome-wide association studies,

evolutionary and population genetics, genetics of disease, and new approaches for

analyzing sequencing data.

We hope that you will enjoy reading this special collection and that you will visit http://www.

cell.com/AJHG/home to check out the latest findings that we have had the privilege to

publish. To stay on top of what your colleagues have been reading over the past 30 days,

check out http://www.cell.com/AJHG/top20. Also be sure to visit http://www.cell.com to

find other high quality papers published in the full collection of Cell Press journals.

Finally, we are grateful for the generosity of our sponsors, who helped make this reprint

collection possible.

For information for the Best of Series, please contact:

Jonathan Christison

Program Director, Best of Cell Press

[email protected]

617-397-2893

LetL

s

o

v

d

v

Volume 89

Best of 2011 and 2012

Volume 90

Denisova Admixture and the First Modern Human

Dispersals into Southeast Asia and Oceania

Rare-Variant Association Testing for Sequencing Data

with the Sequence Kernel Association Test

Expansion of Intronic GGCCTG Hexanucleotide Repeat

in NOP56 Causes SCA36, a Type of Spinocerebellar

Ataxia Accompanied by Motor Neuron Involvement

A Mutation in a Skin-Specific Isoform of SMARCAD1

Causes Autosomal-Dominant Adermatoglyphia

Five Years of GWAS Discovery

Mitochondrial DNA and Y Chromosome Variation

Provides Evidence for a Recent Common Ancestry

between Native Americans and Indigenous Altaians

A ‘‘Copernican’’ Reassessment of the Human

Mitochondrial DNA Tree from its Root

Age-Related Somatic Structural Changes in the Nuclear

Genome of Human Blood Cells

Rare Mutations in XRCC2 Increase the

Risk of Breast Cancer

David Reich, Nick Patterson, Martin Kircher, Frederick Delfin,

Madhusudan R. Nandineni, Irina Pugach, Albert Min-Shan Ko,

Ying-Chin Ko, Timothy A. Jinam, Maude E. Phipps, Naruya

Saitou, Andreas Wollstein, Manfred Kayser, Svante Pääbo,

and Mark Stoneking

Michael C. Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael

Boehnke, and Xihong Lin

Hatasu Kobayashi, Koji Abe, Tohru Matsuura, Yoshio Ikeda,

Toshiaki Hitomi, Yuji Akechi, Toshiyuki Habu, Wanyang Liu,

Hiroko Okuda, and Akio Koizumi

Janna Nousbeck, Bettina Burger, Dana Fuchs-Telem, Mor

Pavlovsky, Shlomit Fenig, Ofer Sarig, Peter Itin, and Eli

Sprecher

Peter M. Visscher, Matthew A. Brown, Mark I. McCarthy, and

Jian Yang

Matthew C. Dulik, Sergey I. Zhadanov, Ludmila P. Osipova,

Ayken Askapuli, Lydia Gau, Omer Gokcumen, Samara

Rubinstein, and Theodore G. Schurr

Doron M. Behar, Mannis van Oven, Saharon Rosset, Mait

Metspalu, Eva-Liis Loogväli, Nuno M. Silva, Toomas Kivisild,

Antonio Torroni, and Richard Villems

Lars A. Forsberg, Chiara Rasi, Hamid R. Razzaghian, Geeta

Pakalapati, Lindsay Waite, Krista Stanton Thilbeault, Anna

Ronowicz, Nathan E. Wineinger, Hemant K. Tiwari, Dorret

Boomsma, Maxwell P. Westerman, Jennifer R. Harris,

Robert Lyle, Magnus Essand, Fredrik Eriksson, Themistocles

L. Assimes, Carlos Iribarren, Eric Strachan, Terrance P.

O’Hanlon, Lisa G. Rider, Frederick W. Miller, Vilmantas

Giedraitis, Lars Lannfelt, Martin Ingelsson, Arkadiusz

Piotrowski, Nancy L. Pedersen, Devin Absher, and Jan P.

Dumanski

D.J. Park, F. Lesueur, T. Nguyen-Dumont, M. Pertesi, F.

Odefrey, F. Hammet, S.L. Neuhausen, E.M. John, I.L.

Andrulis, M.B. Terry, M. Daly, S. Buys, F. Le Calvez-Kelm, A.

Lonie, B.J. Pope, H. Tsimiklis, C. Voegele, F.M. Hilbers, N.

Hoogerbrugge, A. Barroso, A. Osorio, the Breast

On the cover: Whole-mount preparation of a mouse cochlea, immunolabeled with myosin VIIa in green, DAPI in blue, and phalloidin in red

to stain hair cells, nuclei, and actin, respectively. The background sequence is that of connexin 26, the most commonly mutated gene in deaf

individuals. Image courtesy of Shaked Shivatzki and Karen Avraham, Tel Aviv University, Tel Aviv, Israel. Support: grant R01 DC011835 from the

National Institute on Deafness and Other Communication Disorders, National Institutes of Health. This image was the winner of the 2012 ASHG

GenArt competition.

ARTICLE

Denisova Admixture and the First Modern HumanDispersals into Southeast Asia and Oceania

David Reich,1,2,* Nick Patterson,2 Martin Kircher,3 Frederick Delfin,3 Madhusudan R. Nandineni,3,4

Irina Pugach,3 Albert Min-Shan Ko,3 Ying-Chin Ko,5 Timothy A. Jinam,6 Maude E. Phipps,7

Naruya Saitou,6 Andreas Wollstein,8,9 Manfred Kayser,9 Svante Paabo,3 and Mark Stoneking3,*

It has recently been shown that ancestors of NewGuineans and Bougainville Islanders have inherited a proportion of their ancestry from

Denisovans, an archaic hominin group from Siberia. However, only a sparse sampling of populations from Southeast Asia and Oceania

were analyzed. Here, we quantify Denisova admixture in 33 additional populations from Asia and Oceania. Aboriginal Australians, Near

Oceanians, Polynesians, Fijians, east Indonesians, and Mamanwa (a ‘‘Negrito’’ group from the Philippines) have all inherited genetic

material from Denisovans, but mainland East Asians, western Indonesians, Jehai (a Negrito group from Malaysia), and Onge (a Negrito

group from the Andaman Islands) have not. These results indicate that Denisova gene flow occurred into the common ancestors of New

Guineans, Australians, and Mamanwa but not into the ancestors of the Jehai and Onge and suggest that relatives of present-day East

Asians were not in Southeast Asia when the Denisova gene flow occurred. Our finding that descendants of the earliest inhabitants of

Southeast Asia do not all harbor Denisova admixture is inconsistent with a history in which the Denisova interbreeding occurred in

mainland Asia and then spread over Southeast Asia, leading to all its earliest modern human inhabitants. Instead, the data can be

most parsimoniously explained if the Denisova gene flow occurred in Southeast Asia itself. Thus, archaic Denisovans must have lived

over an extraordinarily broad geographic and ecological range, from Siberia to tropical Asia.

Introduction

The history of the earliest arrival of modern humans in

Southeast Asia and Oceania from Africa remains contro-

versial. Archaeological evidence has been interpreted to

support either a single wave of settlement1 or, alternatively,

multiple waves of settlement, the first leading to the initial

peoplingof SoutheastAsia andOceania via a southern route

and subsequent dispersals leading to the peopling of all of

East Asia.2 Mitochondrial DNA studies have been inter-

preted as supporting a single wave of migration via a

southern route,3–5 although other interpretations are

possible,6,7 and single-locus studies are unlikely to resolve

this issue.8 The largest genetic study of the region to date,

based on 73 populations genotyped at 55,000 SNPs,

concluded that the data were consistent with a single

wave of settlement of Asia that moved from south to north

and gave rise to all of the present-day inhabitants of the

region.9 However, another study of genome-wide SNP

data argued for twowaves of settlement10 as did an analysis

of diversity in the bacterium Helicobacter pylori.11

The recent finding that Near Oceanians (New Guineans

and Bougainville Islanders) have received 4%–6% of their

genetic material from archaic Denisovans12 in principle

provides a powerful tool for understanding the earliest

human migrations to the region and thus for resolving

the question of the number of waves of settlement. The

Denisova genetic material in Southeast Asians should be

easily recognizablebecause it is verydivergent frommodern

human DNA. Thus, the presence or absence of Denisova

genetic material in particular populations should provide

an informative probe for themigration history of Southeast

Asia andOceania, in addition to being interesting in its own

right. However, the populations previously analyzed for

signatures of Denisova admixture12 comprise a very thin

sampling of Southeast Asia and Oceania. In particular, no

groups from island Southeast Asia or Australia were

surveyed. Here, we report an analysis of genome-wide

data from an additional 33 populations from south Asia,

Southeast Asia, andOceania; analyze the data for signatures

of Denisova admixture; and use the results to infer the

history of human migration(s) to this part of the world.

Material and Methods

SNP Array DataWe analyzed data for modern humans genotyped on Affymetrix

6.0 SNP arrays. We began by assembling previously published

data for YRI (Yoruba in Ibadan, Nigeria) West Africans, CHB

(Han Chinese in Beijing, China) Han Chinese and CEU (Utah resi-

dents with Northern and Western European ancestry from the

CEPH collection) European Americans from HapMap 3;13 Onge

Andaman ‘‘Negritos’’;14 and New Guinea highlanders, Fijians,

one Bornean population, and Polynesians from seven islands.10

1Department of Genetics, Harvard Medical School, Boston, MA 02115, USA; 2Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA;3Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig D-04103, Germany; 4Laboratory of DNA Finger-

printing, Centre for DNA Fingerprinting and Diagnostics, Nampally, Hyderabad 500 001, India; 5Center of Excellence for Environmental Medicine,

Kaohsiung Medical University, Kaohsiung City 807, Taiwan; 6Division of Population Genetics, National Institute of Genetics, Yata 1111, Mishima, Shi-

zuoka 411-8540, Japan; 7School of Medicine and Health Sciences, Monash University (Sunway Campus), Selangor 46150, Malaysia; 8Cologne Center

for Genomics, University of Cologne, Cologne D-50931, Germany; 9Department of Forensic Molecular Biology, Erasmus MC University Medical Center

Rotterdam, 3000 CA Rotterdam, The Netherlands

*Correspondence: [email protected] (D.R.), [email protected] (M.S.)

DOI 10.1016/j.ajhg.2011.09.005. �2011 by The American Society of Human Genetics. All rights reserved.

516 The American Journal of Human Genetics 89, 516–528, October 7, 2011

We also assembled data including two aboriginal Australian popu-

lations: one from theNorthern Territories15 and one froma human

diversity cell line panel in the European Collection of Cell

Cultures. The data also include nine Indonesian populations:

four from the Nusa Tenggaras, two from the Moluccas, one from

Borneo, and two from Sumatra. Finally, the data include three

Malaysian populations (Temuan and Jehai [a Negrito group]

both from the Malay peninsula, and Bidayuh from Sarawak on

the island of Borneo), two Philippine populations (Manobo and

a Negrito group, the Mamanwa), six aboriginal Taiwanese popula-

tions, one Dravidian population from southern India, and San

Bushmen from southern Africa from the Centre d’Etude du

Polymorphisme Humain (CEPH)-Human Genome Diversity

Panel.16 All volunteers provided informed consent for research

into population history and the approval of appropriate local

ethical review boards was obtained. This project was approved

by the ethical review boards of the University of Leipzig Medical

Faculty and Harvard Medical School. The genotype data that we

analyzed for this study are available from the authors on request.

Merging Genotyping Data with Chimpanzee,

Denisova, and NeandertalWemerged the SNP array data frommodern humans with genome

sequence data from chimpanzee (CGSC 2.1/PanTro217), Deni-

sova,12 andNeandertal.18WeeliminatedA/TandC/GSNPs tomini-

mize strandmisidentification. After removing SNPswith low geno-

typing completeness, we had data for 353,143 autosomal SNPs.

Removal of Outlier SamplesWe carried out principal components analysis by using

EIGENSOFT.19 We removed samples that were visual outliers rela-

tive to others from the same population on eigenvectors that

were statistically significant by using a Tracy-Widom statistic (p <

0.05),19 resulting in the removal of threeYRI, twoCHB, five Polyne-

sians, oneNewGuineahighlander, two Jehai, and threeMamanwa.

Sequencing DataWepreparedDNAsequencing librarieswith300bp insert sizes from

a Papua New Guinea highlander (SH10) and Mamanwa Negrito

(ID36) individual by using a previously described protocol.12 The

two libraries were sequenced on an Illumina Genome Analyzer

IIx instrument with 2 3 101 þ 7 cycles according to the manufac-

turer’s instructions for multiplex sequencing (FC-104-400x v4

sequencing chemistry and PE-203-4001 cluster generation kit v4).

Bases and quality scores were generated with the Ibis base caller,20

and the reads were aligned with the Burrows-Wheeler Aligner

(BWA) software 21 to the human (NCBI 36/hg18) and chimpanzee

(CGSC 2.1/pantro2) genomes with default parameters. The result-

ing BAM files were filtered as follows: (1) a mapping quality of at

least 30 was required; (2) we removed duplicated reads with the

same outer coordinates; and (3) we removed reads with sequence

entropy < 1.0, calculated by summing �p$log2(p) for each of the

four nucleotides. The sequencing data are publicly available from

the European Nucleotide Archive (Project ID ERP000121), and

summary statistics are provided in Table S1, available online.

Estimating Denisova pD(X), Near Oceanian pN(X)

and Australian pA(X) ancestryWe define the frequency of one of the alleles at a SNP i as zix. We

can then compute three statistics for a given population X that

are informative about admixture:

pDðXÞ ¼Pni¼1

�ziOutgroup � ziArchaic

��ziEast Asian � zix

�Pni¼1

�ziOutgroup � ziArchaic

��ziEast Asian � ziNew Guinea

�

¼ f4ðOutgroup; Archaic; East Asian; XÞf4ðOutgroup; Archaic; East Asian; New GuineaÞ

(Equation 1)

pNðXÞ ¼ 1�Pni¼1

�ziOutgroup � ziAustralia

��zix � ziNew Guinea

�Pni¼1

�ziOutgroup � ziAustralia

��ziEast Asia � ziNew Guinea

�

¼ 1� f4ðOutgroup; Australia; X; New GuineaÞf4ðOutgroup; Australia; East Asia; New GuineaÞ

(Equation 2)

pAðXÞ ¼ 1�Pni¼1

�ziOutgroup � ziNew Guinea

��zix � ziAustralia

�Pni¼1

�ziOutgroup � ziNew Guinea

��ziEast Asia � ziAustralia

�

¼ 1� f4ðOutgroup; New Guinea; X; AustraliaÞf4ðOutgroup; New Guinea; East Asia; AustraliaÞ

(Equation 3)

The right side of each equation shows that these statistics can also

be expressed as ratios of f4 statistics,14 which provide unbiased

estimates of admixture proportions even in the absence of popula-

tions that are closely related to the analyzed populations

(Appendix A). For the ancestry estimates reported in Table 1, we

use Outgroup ¼ YRI (West Africans), Archaic ¼ Denisova, and

East Asian ¼ CHB (Han Chinese). Table S2 and Table S3 demon-

strate that consistent values are obtained when we replace these

choices with a variety of distantly related populations. Further

details are provided in Appendix A.

Block Jackknife Standard Error and Statistical TestingWe used a block jackknife22,23 to compute standard errors, drop-

ping each nonoverlapping five cM stretch of the genome in turn

and studying the variance of each statistic of interest to obtain

an approximately normally distributed standard error.12,18 To

test whether pD(X), pN(X), pA(X), and pD(X)� pN(X) are statistically

consistent with zero for any tested population X, we computed

the statistics along with a standard error from the block jackknife,

and then used a two-sided Z test that computes the number of

standard errors from zero. To implement the 4 Population Test14

for whether an unrooted phylogenetic tree ([A,B],[C,D]) relating

four populations is consistent with the data, we computed the

statistic f4(A,B;C,D) and assessed the number of standard errors

from zero.

Results

Quantifying Denisova Admixture from Genome-wide

SNP Data

To investigate which modern humans have inherited

genetic material from Denisovans, we assembled SNP

data from 33 populations from mainland East Asia, island

Southeast Asia, New Guinea, Fiji, Polynesia, Australia, and

India, and genotyped all of them on Affymetrix 6.0 arrays.

After removing samples that were outliers with respect to

The American Journal of Human Genetics 89, 516–528, October 7, 2011 517

Table 1. Estimates of Denisovan and Near Oceanian Ancestry from SNP Data

Population InformationpD(X): Denisovan Ancestryas % of New Guinea

pN(X): Near Oceanianancestry

p value forDifference

Broad Grouping Detailed Code NEstimatedAncestry

StandardError in theEstimate Z Score

EstimatedAncestry

StandardError in theEstimate Z Score pN(X) � pD(X)

New Guinea Highlander SH 24 100% 0% n/a 100% 0% n/a n/a

Australian all 10 103% 6% 17.1 n/a n/a n/a n/a

Northern Territories AU1 8 103% 6% 16.6 n/a n/a n/a n/a

Cell Cultures AU2 2 103% 7% 14.1 n/a n/a n/a n/a

Fiji Fiji FI 25 56% 3% 17.7 58% 1% 94.6 0.38

Nusa Tenggaras all 10 40% 3% 12.8 38% 1% 54.7 0.34

Alor AL 2 51% 6% 8.3 49% 1% 35.6 0.69

Flores FL 1 40% 8% 5.0 37% 2% 19.8 0.68

Roti RO 4 27% 4% 6.4 27% 1% 29.4 0.85

Timor TI 3 50% 5% 9.8 45% 1% 41.7 0.29

Philippines all 27 28% 3% 8.2 6% 1% 10.6 3.4 3 10�10

Mamanwa (N) MA 11 49% 5% 9.2 11% 1% 11.4 1.5 3 10�12

Manobo MN 16 13% 3% 4.2 4% 1% 5.7 0.0018

Moluccas all 10 35% 4% 10.1 34% 1% 46.0 0.59

Hiri HI 7 35% 4% 9.0 32% 1% 38.4 0.36

Ternate TE 3 36% 5% 7.2 38% 1% 33.7 0.67

Polynesia all PO 19 20% 4% 5.1 27% 1% 34.8 0.052

Cook 2 16% 6% 2.5 24% 1% 17.3 0.21

Futuna 4 28% 5% 5.3 29% 1% 26.9 0.87

Niue 1 27% 8% 3.3 30% 2% 16.3 0.72

Samoa 5 13% 5% 2.6 24% 1% 23.3 0.024

Tokelau 2 22% 6% 3.5 31% 1% 23.8 0.14

Tonga 2 17% 7% 2.5 31% 1% 22.5 0.027

Tuvalu 3 21% 6% 3.6 28% 1% 22.8 0.28

Andamanese Onge (N) AN 10 10% 6% 1.6 3% 1% 1.8 0.27

Taiwan all TA 12 4% 3% 1.2 1% 1% 1.5 0.35

Puyuma 2 4% 6% 0.6 2% 1% 1.8 0.79

Rukai 2 0% 6% 0.0 2% 1% 1.6 0.74

Paiwan 2 5% 6% 0.8 3% 1% 2.2 0.67

Atayal 2 �5% 5% �0.9 0% 1% 0.3 0.34

Bunun 2 12% 6% 2.1 �2% 1% �1.6 0.01

Pingpu 2 7% 6% 1.2 1% 1% 1.1 0.30

Malaysia all 18 5% 3% 1.4 0% 1% �0.2 0.16

Jehai (N) JE 8 7% 5% 1.4 1% 1% 0.8 0.21

Temuan TM 10 3% 4% 0.8 �1% 1% �0.9 0.32

Sumatra All 17 4% 3% 1.4 0% 1% 0.3 0.17

Besemah BE 8 5% 3% 1.5 1% 1% 0.9 0.20

Semende SM 9 3% 4% 0.9 0% 1% �0.3 0.31


their own populations (reflecting admixture in the last few

generations or genotyping error), we had data from 243

individuals (Table 1). We restricted the analysis to auto-

somal SNPs with high genotyping completeness and with

data from the Denisova genome, leaving 353,143 SNPs.

To quantify the proportion of Denisova genes in each

population X, we computed a statistic pD(X), which

measures the proportion of Denisova genetic material in

a population as a fraction of that in New Guineans. Our

main analyses in Figure 1 and Table 1 compute pD(X) as

a ratio of two f4 statistics,14 each of which measures the

correlation in allele frequency differences between the

two populations used as outgroups (Yoruba and Denisova)

and two East or Southeast Asian populations (Han and X¼tested population). If Han and X descend from a single

ancestral population without any subsequent admixture

Table 1. Continued

Population InformationpD(X): Denisovan Ancestryas % of New Guinea

pN(X): Near Oceanianancestry

p value forDifference

Broad Grouping Detailed Code NEstimatedAncestry

StandardError in theEstimate Z Score

EstimatedAncestry

StandardError in theEstimate Z Score pN(X) � pD(X)

Borneo all 49 1% 2% 0.6 1% 1% 1.3 0.79

Bidayuh BI 10 6% 4% 1.7 1% 1% 1.4 0.80

Barito River BO 23 0% 3% 0.2 1% 1% 1.7 0.18

Land Dayak DY 16 0% 3% �0.1 0% 1% 0.2 0.94

India Dravidian SI 12 �7% 5% �1.5 n/a n/a n/a n/a

We provide each population’s estimated ancestry, the standard error in the estimate, and the Z score for deviation from zero (Z). Negrito populations are markedwith (N). The New Guinea highlanders by definition have 100%Denisovan and 100%Near Oceanian ancestry because they are used as a reference population forcomputations. Results are not provided for Australians and Dravidians for whom the phylogenetic relationships do not allow the estimate (n/a). The last columnreports the two-sided p value for a difference based on a block jackknife and a Z test.

DENISOVA

HE

OR

AL Al MN M b

XI

UY

HEDRMO

AL Alor MN ManoboAN Andaman (Onge) MO MongolaAU Australian NA NaxiBE Besemah NG New GuineaBG Bougainville OR OroqenBI Bidayuh PO Polynesia

JA

TU

SE

HA

TJ

MI

BO Borneo RO RotiCA Cambodia SE SheDA Dai SH S. HighlandsDR Daur SI Southern IndiaDY Dayak SM SemendeFI Fiji TA Taiwan

MA

MN

TA

LA

DA

MI

jFL Flores TE TernateHA Han TI TimorHE Hezhen TJ TujiaHI Hiri TM TemuanJA Japan TU TuJE Jehai UY Uygur

BGHI

MN

JE

BITM

AN

JE Jehai UY UygurLA Lahu XI XiboMA Mamanwa YI YiMI Miao

SH

NG FI

POTE

ALBODY

SM

BE

AU1

TIFL

RO

AU2

NA YI

CASI

Figure 1. Denisovan Genetic Material as a Fraction of that in New GuineansPopulations are only shown as having Denisova ancestry if the estimates are more than two standard errors from zero (we combine esti-mates for populations in this study with analogous estimates from CEPH- Human Genome Diversity Panel populations reported previ-ously12). No population has an estimate of Denisova ancestry that is significantly more than that in New Guineans, and hence we atmost plot 100%. The sampling location of the AU2 population is unknown and hence the position of this population is not precise.


from Denisova, then the allele frequency differences

between Han and X must have arisen solely since their

separation from their common ancestor, and the two

frequency differences should be uncorrelated; thus, the f4statistic has an expected value of zero. However, if popula-

tion X inherited some of its ancestry from an archaic

population related to Denisovans, then the allele

frequency differences between Han and X will be corre-

lated, the higher the admixture from the archaic popula-

tion, the higher the correlation. Because the f4 statistic in

the numerator uses X as the test population, and the f4statistic in the denominator uses New Guinea as the test

population, the ratio pD(X) estimates a quantity propor-

tional to the percentage of Denisova ancestry qX; that is,

the Denisova admixture fraction in X divided by that in

New Guinea, qX/qNew Guinea (Appendix A).

We computed pD(X) for a range of non-African popula-

tions and found that for mainland East Asians, western

Negritos (Jehai and Onge), or western Indonesians, pD(X)

is within two standard errors of zero when a standard error

is computed from a block jackknife (Table 1 and Figure 1).

Thus, there is no significant evidence of Denisova genetic

material in these populations. However, there is strong

evidence of Denisovan genetic material in Australians

(1.035 0.06 times the NewGuinean proportion; one stan-

dard error), Fijians (0.565 0.03), Nusa Tenggaras islanders

of southeastern Indonesia (0.40 5 0.03), Moluccas

islanders of eastern Indonesia (0.35 5 0.04), Polynesians

(0.020 5 0.04), Philippine Mamanwa, who are classified

as a ‘‘Negrito’’ group (0.495 0.05), and PhilippineManobo

(0.13 5 0.03) (Table 1 and Figure 1). The New Guineans

and Australians are estimated to have indistinguishable

proportions of Denisovan ancestry (within the statistical

error), suggesting Denisova gene flow into the common

ancestors of Australians and New Guineans prior to their

entry into Sahul (Pleistocene New Guinea and Australia),

that is, at least 44,000 years ago.24,25 These results are

consistent with the Common Origin model of present-

day New Guineans and Australians.26,27 We further con-

firmed the consistency of the Common Origin model

with our data by testing for a correlation in the allele

frequency difference of two populations used as outgroups

(Yoruba and Han) and the two tested populations (New

Guinean and Australian).The f4 statistic that measures

their correlation is only jZj ¼ 0.8 standard errors from

zero, as expected if New Guineans and Australians descend

from a common ancestral population after they split from

East Asians, without any evidence of a closer relationship

of one group or the other to East Asians. Two alternative

histories, in which either New Guineans or Australians

have a common origin with East Asians, are inconsistent

with the data (both jZj > 52).

To assess the robustness of these estimates of Denisova

admixture proportion, we recomputed pD(X) for diverse

choices of A (YRI, San, and chimpanzee), B (Denisova,

Neandertal, and chimpanzee), C (CHB and Borneo) and

X (17 different populations). For any population X, we

obtain consistent estimates of the archaic mixture propor-

tion, regardless of the choice of A, B, and C. Thus, the

method is robust to the choice of comparison populations,

suggesting that the underlying model of population rela-

tionships (Appendix A) provides a reasonable fit to the

data and that our pD(X) ancestry estimates are reliable.

For our main estimates of admixture proportion, we report

results for A ¼ YRI, B ¼ Denisova and C ¼ CHB because

Table S2 shows that the standard errors are smallest (in

part because of larger sample sizes).

To test whether our estimates of pD(X) are robust to ascer-

tainment bias—the complex ways that SNPs were chosen

for inclusion on genotyping arrays originally designed

for medical genetics studies—we also estimated Denisova

admixture by using sequencing data. For this purpose, we

generated new shotgun sequencing data from a Philippine

Mamanwa individual (~13) and a New Guinea highlander

(~33, from a different New Guinean group than the one

sampled in the Human Genome Diversity Panel16). We

merged these with data from Neandertal, Denisova, chim-

panzee, and 12 present-day humans analyzed as part of the

Neandertal and Denisova genome sequencing studies.12,18

We then computed the same pD(X) statistics for the se-

quencing as for the genotyping data, replacing YRI with

a Yoruba (HGDP00927), CHB with a Han (HGDP00778),

and New Guinea with a Papuan sample (Papuan2;

HGDP00551). Both the full sequence data and the SNP

data produce consistent estimates of pD(X) (Table 2), sug-

gesting that ascertainment bias is not influencing the

pD(X) estimates from genome-wide SNP data.

Near Oceanian Ancestry Explains Denisovan Genes

Outside of Australia and the Philippines

Aparsimonious explanation for theDenisova geneticmate-

rial that we detect in the non-Australian populations is the

well-documented admixture that has occurred in many

Southeast Asian and Oceanian groups between (1) Near

Oceanian populations related to New Guineans and (2)

populations from island Southeast Asia related tomainland

East Asians, who are the primary populations of Taiwan

and Indonesia today.28–31 Thus, many groups might have

Denisova admixture as an indirect consequence of their

history of Near Oceanian admixture. For those populations

whoseDenisova ancestry is explained in thisway, their frac-

tion of Denisovan ancestry is predicted to be exactly

proportional to their fraction of Near Oceanian ancestry.

To test this hypothesis, we designed a second statistic,

pN(X), to estimate the fractionof apopulation’sNearOcean-

ian ancestry, defined here as the proportion of its ancestry

inherited from a population that is more closely related to

New Guineans than to Australians (Appendix A). A virtue

of pN(X) is that it provides an unbiased estimate of a popula-

tion’s Near Oceanian ancestry proportion even without

access to close relatives of the ancestral populations

(Appendix A), whereas previous estimators10,30 depend

on the accuracy of the surrogate contemporary popula-

tions used to approximate the ancestral populations. We


compared pD(X) and pN(X) for all relevant populations

(Table 1, Figure 2, and Figure S1) and found that, allowing

for sampling error, they occur in a one-to-one ratio for the

populations from theNusa Tenggaras,Moluccas, Polynesia,

and Fiji. Common ancestry with Near Oceania thus can

account for the Denisova genetic material in these groups.

A striking exception is observed in the two Philippine

populations, neither of which conforms to this relation-

ship: pD(Mamanwa) ¼ 0.495 0.05 versus pN(Mamanwa) ¼0.11 5 0.01 (p ¼ 1.5 3 10�12 for the difference) and

pD(Manobo) ¼ 0.13 5 0.03 versus pN(Manobo) ¼ 0.04 5

0.01 (p ¼ 0.0018) (Figure 2). An alternative hypothesis

that could account for the Denisovan genetic material in

the Philippines is common ancestry with Australians.32,33

We thus computed a third statistic, pApp (X), that estimates

the relative proportion of Australian ancestry (Appendix

A). However, Australian ancestry cannot explain these

patterns either: pD(Mamanwa) ¼ 0.49 5 0.05 versus

pApp (Mamanwa) ¼ 0.13 5 0.01 and pD(Manobo) ¼ 0.13 5

0.03 versus pApp (Manobo) ¼ 0.05 5 0.01. The estimates of

pN(X) and pApp (X) are consistent for a variety of outgroups

(Appendix A and Table S3). Thus, the Denisova genetic

material in Mamanwa, as well as the smaller proportion

in their Manobo neighbors, cannot be due to common

ancestry with Near Oceanians or Australians after the

two groups diverged from one another. In the following

section, we focus on the Mamanwa because they have

a higher proportion of Denisova genetic material and allow

us to study the pattern at a higher resolution.

Modeling Denisova Admixture and Population

History

To test whether the patterns observed in the Philippine

populations might reflect a history of Denisova gene flow

into a population that was ancestral to New Guineans,

Australians, and Mamanwa, followed by separation of

the Mamanwa first and then divergence of the New Guin-

eans from Australians, we fit f statistics summarizing the

allele frequency correlations among all possible sets of

populations to admixture graphs.14 Admixture graphs are

formal models of population relationships with the impor-

tant feature that simply by specifying a topology of popu-

lation relationships, admixture proportions, and genetic

drift values on each lineage, they produce precise predic-

tions of the values that will be observed at f4ff , f3ff , and f2ff

statistics (Appendix B). These predictions can then be

compared to the empirically observed values (with standard

Figure 2. Denisovan and Near Oceanian Ancestry Are Propor-tional Except in the PhilippinesWe plot pDpp (X), the estimated percentage of Denisova ancestry asa fraction of that seen in New Guineans, against the estimatedpercentage of Near Oceanian ancestry pN(X) by using the valuesfrom Table 1 (horizontal and vertical bars specify 51 standarderrors). The Mamanwa deviate significantly from the pD(X) ¼pN(X) line, indicating that their Denisova genetic material doesnot owe its origin to gene flow from a population related to NearOceanians. A weaker deviation is seen in the Manobo, who livenear the Mamanwa on the island of Mindanao.

Table 2. Denisovan Admixture pD(X) Estimated from Sequencing versus Genotyping Data

SampleHGDP ID forSequence Data

Sequencing Data Genotyping Data

EstimatedAncestry

Standard Errorin the Estimate Z Score

EstimatedAncestry

Standard Errorin the Estimate Z Score

Papuan HGDP00542 105% 9% 11.8 100% n/a n/a

New Guinea Highlander 104% 9% 11.7 100% n/a n/a

Bougainville HGDP00491 83% 10% 8.3 82% 5% 15.9

Mamanwa 28% 10% 2.9 49% 5% 9.2

Cambodian HGDP00711 19% 9% 2.0 �3% 3% �0.8

Karitiana HGDP00998 9% 12% 0.7 4% 6% 0.7

Mongolian HGDP01224 �6% 12% �0.5 3% 3% 1.1

For the sequencing data, we present the ratio f4(Yoruba, Denisova; Han, X)/f4(Yoruba, Denisova; Han, Papuan2), estimating the proportion of Denisova ancestry ina population X as a fraction of that in the Papuan2 sample (for the first line, the Papuan sample in the numerator is Papuan1 HGDP000551). For the genotypingdata, we present the ratio f4(YRI, Denisova; CHB, X)/f4(YRI, Denisova; CHB, Papuan). No standard errors are given for the genotyping-based estimates in the firsttwo rows because the Papuans and New Guineans are the reference populations, and so by definition those fractions are 100%.


errors from a block jackknife) to assess the fit to the data.14

The best-fitting admixture graph for seven populations

(Neandertal, Denisova, Yoruba, Han Chinese, Mamanwa,

Australians, and New Guineans) specifies Denisova gene

flow into a population ancestral to New Guineans, Austra-

lians, andMamanwa, followed by the splitting of the ances-

tors of the Mamanwa and much more recent admixture

between them and populations related to East Eurasians

(Figure3 andFigure S2). For thismodel, theadmixturegraph

predicts the values of 91 allele frequency correlation statis-

tics (f statistics) relating the seven analyzed populations,

and only one f statistic has an observed value more than

three standard errors from the prediction (Appendix B).

Encouraged by the fit of the admixture graph to the data

from the seven populations, we extended the model to

include two additional populations—Andaman Islanders

(Onge) and Negrito groups from Malaysia (Jehai)—both

of which have been hypothesized to descend from the

same migration that gave rise to Australians and New

Guineans4,5 (Figure 3 and Figure S3). This analysis provides

overwhelming support for common ancestry for the Onge

and Jehai: an admixture graph specifying such a history is

an excellent fit to the joint data in the sense that only one

of the 246 possible f statistics is more than three standard

errors from expectation (Appendix B). The analysis also

suggests that after their separation from the Onge, the Je-

hai received substantial admixture (about three-quarters

of their genome) from populations related to mainland

East Asians (Appendix B). In contrast, a model in which

the Onge have no recent East Asian admixture is a good

fit to the data, providing further evidence that the Onge

have been unadmixed (at least with non-South Asians8)

since their initial arrival in the region.14

A striking finding that emerges from the admixture

graph model fitting is the evidence of an episode of addi-

tional gene flow into Australian and New Guinean ances-

tors—after their ancestors separated from those of the Ma-

manwa—from a modern human population that did not

have Denisova genetic material. A model in which this

admixture accounts for half of the genetic material in

Australians and New Guineans is an excellent fit to the

data (Figure 3, Figures S2 and S3, and Appendix B). Admix-

ture graphs that do not model a second admixture event

are much poorer fits, producing 11 f statistics at jZj > 3

standard errors from expectation (Appendix B). Our

analysis further suggests that the modern humans who

admixed with the ancestors of Australians and New Guin-

eans were closer to Andamanese and Malaysian Negritos

than to mainland East Asians (Figure 3), although this

is a weaker signal (1 f statistic with jZj > 3 versus 3) (Fig-

ure S3). This suggests that populations with Denisova

admixture could have been in proximity to the ancestors

of the Onge and Jehai during the earliest settlement of

the region but provides no evidence for ancestors of pres-

ent-day East Asians in the region at that time (Appendix B).

Thus, these findings suggest that the present-day East

Asian and Indonesian populations are primarily descended

from more recent migrations to the region.

Discussion

This study has shown that Southeast Asia was settled by

modern humans in multiple waves: One wave contributed

the ancestors of present-day Onge, Jehai, Mamanwa, New

Guineans, and Australians (some of whom admixed with

Denisovans), and a second wave contributed much of

the ancestry of present-day East Asians and Indonesians.

This scenario of human dispersals is broadly consistent

with the archaeologically-motivated hypothesis of an early

southern route migration leading to the colonization of

Sahul and East Asia2 but also further clarifies this scenario.

In particular, our data provide no evidence for multiple

dispersals of modern humans out of Africa, as all non-

Africans have statistically indistinguishable amounts of

1.3%98.7%

7%93%

51%

24%76%

49%

Chinese Jehai (N) Onge (N) Australian DenisovaNew GuineaMamanwa (N)Yoruba Neandertal

24%76%27%73%

Figure 3. A Model of Population Separa-tion and Admixture that Fits the DataThe admixture graph suggests Denisova-related gene flow into a common ancestralpopulation of Mamanwa, New Guineans,and Australians, followed by admixture ofNew Guinean and Australian ancestorswith another population that did notexperience Denisova gene flow.We cannotdistinguish the order of population diver-gence of the ancestors of Chinese, Onge/Jehai, and Mamanwa/New Guineans/Australians, and hence show a trifurcation.Admixture proportion estimates (red) arepotentially affected by ascertainment biasand hence should be viewed with caution.In addition, although admixture graphsare precise about the topology of popula-tion relationships, they are not informa-tive regarding timing. Thus, the lengthsof lineages should not be interpreted interms of population split times and admix-ture events.


Neandertal genetic material.12,18 Instead, our data are

consistent with a single dispersal out of Africa (as proposed

in some versions of the early southern route hypothesis1)

from which there were multiple dispersals to South and

East Asia.

This study is also important in providing a clue about the

geographic location of the Denisova gene flow. Given the

high mobility of human populations, it is difficult to use

genetic data frompresent-day populations to infer the loca-

tion of past demographic events with high confidence.

Nevertheless, the fact that Denisova genetic material is

present in eastern Southeast Asians and Oceanians (Ma-

manwa, Australians, and New Guineans), but not in the

west (Onge and Jehai) or northwest (the Eurasian conti-

nent) suggests that interbreeding might have occurred in

Southeast Asia itself. Further evidence for a Southeast Asian

location comes fromour evidenceof ancient geneflow from

relatives of the Onge and Jehai into the common ancestors

of Australians and New Guineans after the initial Denisova

gene flow (Figure 3); this suggests that ancestors of both of

these groups (but not of East Asians) were present in the

region at the time. Although some of the observed patterns

could alternatively be explained by a history inwhich there

was initially some Denisova genetic material throughout

Southeast Asia—which was subsequently displaced by

major migrations of people related to present-day East

Asians—such a history cannot parsimoniously explain the

absence ofDenisova geneticmaterial in theOnge and Jehai.

Our evidence of a Southeast Asian location for the Deniso-

van admixture thus suggests that Denisovans were spread

across a wider ecological and geographic region—from the

deciduous forests of Siberia to the tropics—than any other

hominin with the exception of modern humans.

Finally, this study is methodologically important in

showing that there is much to learn about the relation-

ships among modern humans by analyzing patterns of

genetic material contributed by archaic humans. Because

the archaic genetic material is highly divergent, it is easily

detected in a modern human even if it contributes only a

small proportion of the ancestry; this makes it possible to

use archaic genetic material to study subtle and ancient

gene flow much as a medical imaging dye injected into a

patient allows the tracing of blood vessels. A priority for

future research should be to obtain direct estimates for

the dates of the Denisova and Neandertal gene flow, as

these will provide a better understanding of the interac-

tions among Denisovans, Neandertals, and the ancestors

of various present-day human populations.

Appendix A: Statistics Used for Estimating

Admixture Proportions

pD(X) Statistic Used for Estimating Denisova

Admixture Proportion

We first discuss the pD(X) statistic that we use for esti-

mating the Denisova admixture proportion in any popula-

tion X. Define the frequency of allele i in a sample from

population Y as ziY . Then pD(X) is defined as in Equation 1.

The rightmost part of Equation 1 shows that pD(X) can

also be expressed as a ratio of f4 statistics, which we intro-

duced previously14 to measure the correlation in allele

frequency differences between pairs of populations. We

previously reported simulations showing that the expected

values of f4 statistics are in practice robust to ascertainment

bias (how the polymorphisms are chosen for inclusion in

an analysis), making them useful for learning about

history with SNP array data.14

The expected values of f4 statistics can be understood

visually by following the arrows through the phylogenetic

trees with admixture relating sets of samples, assuming

that these are accurate models for the relationships among

the populations.14 Figure 4 illustrates how the ratio of f4statistics computed in Equation 1 estimates an admixture

proportion. Both the numerator and denominator can be

viewed as a correlation of two allele frequency differences:

ziA � ziB is the correlation in the allele frequency differ-

ence between an Outgroup ‘‘A’’ that did not experience

admixture and an Archaic group ‘‘B’’ hypothesized to be

related to the admixing group (e.g., A ¼ {chimpanzee,

Yoruba, or San} and B ¼ {Denisova or Neandertal}). This

follows the blue arrows in Figure 4.

ziC � ziX is the correlation in the allele frequency differ-

ence between a modern non-African population ‘‘C’’ and

a test population ‘‘X’’ (e.g., C ¼ {Chinese or Bornean}).

This follows the red arrows in Figure 4.

If populationsC andX are sister groups that descend from

ahomogeneousnon-African ancestral population, then the

allele frequency differences are expected to have arisen

entirely since the split from that commonancestral popula-

tion, and thus the correlation to A and B is expected to be

zero (no overlap of the arrows). In contrast, if population

X has inherited some proportion qX of its lineages from an

archaic population, then the expected value of the product

of the frequency differences is proportional to qX times

the overlap of the paths of A and B and C and X in Figure 4,

which corresponds to genetic drift a þ b. While we do

not know the value of a þ b, when we take the ratio of

the numerator and denominator to compute the pD(X)

statistic, this unknown quantity cancels, and we obtain

qX/qNew Guinea, the proportion of archaic ancestry in a popu-

lation as a fraction of that in New Guineans (Figure 4).

Two issues merit further discussion. First, Figure 4 is an

oversimplification in that it does not show two archaic

gene-flow events (corresponding to Denisovans and Nean-

dertals). However, we have previously reported that the

data are consistent with the same amount of Neandertal

gene flow into the ancestors of East Asians (C, such as

CHB) and populations with Denisovan ancestry (X).12,18

As a result, the same genetic drift terms are added to the

numerator and denominator, which then cancel in the

ratio pD(X) so that they do not affect results. Second,

pD(X) is expected to provide an unbiased estimate of the

admixture proportion even if the genetic drift on various


lineages has been large. This contrasts with previous

methods for estimating admixture, which have required

accurate proxies for the ancestral populations.10

pN(X) and pApp (X) Statistics for Estimating Near

Oceanian and Denisova Admixture

We next discuss the statistics that we use for estimating the

NewGuinean pN(X) or Australian pApp (X)mixture proportion

in any East Eurasian or island Southeast Asian population

X, which are defined in Equations 2 and 3, respectively.

Figure 5 shows the admixture graph corresponding to

the computation of pN(X). Both the numerator and the

denominator are of the form f4ff (A(( ,Australia; X,New

Guinea). The first term measures the correlation in allele

frequency differences between (A(( � Australia) and (X(( �New Guinea). If X and New Guinea descended from a

common ancestral population since the split from Austra-

lians, then they are perfect sister groups, and the expected

value of f4ff is zero (the sample is consistent with 100%

Near Oceanian ancestry). On the other hand, if X has

a proportion (1 � qXqq ) of non-Near Oceanian ancestry,

then the two terms will have a nonzero correlation, which

as shown in Figure 5 is proportional to the genetic drift

shared between the two population comparisons and has

an expected value of (1� qXqq )[(1� pXpp )bþg] (the proportions

of ancestry flowing along various genetic drift paths times

the genetic drift on each of these lineages, indicated by

the overlap of the red and blue arrows). When we take

one minus the ratio pN(X) ¼ 1 � f4ff (A(( ,Australia; X,New

Guinea)/f4ff (A(( ,Australia; CHB,NewGuinea), the complicated

term on the right side of this expectation cancels, and we

obtain E[p[[ N(X)] ¼ qXqq . As with Figure 4, we do not show the

independent Neandertal admixture because the effect of

this term is to cancel from thenumerator and denominator.

In Table S3 we report the pN(X) estimates for diverse

choices of outgroup populations A (Yoruba, San, and chim-

panzee) and E (China and Borneo). The estimates are con-

sistent whatever the choice of A and E, suggesting that our

inferences are robust. (We do not report pN(X) estimates in

Table S3 for the Australians because this population is not

expected to conform to the population relationships

shown in Figure 5; indeed, the pN(X) estimates for Austra-

lians, when we do compute them, are significantly greater

than 1.) Further evidence for the usefulness of the pN(X)

estimates comes from the fact that it is consistent with

the pD(X) estimate for nearly all the populations in Table

1 (except for the Philippine populations, in which the De-

nisova ancestry does not appear to be explainable by Near

Oceanian gene flow as described in the main text).

We also computed a statistic pApp (X) that is identical to

pN(X) except for the transpositions of the positions of Aus-

tralia and New Guinea in the statistics (Equations 2 and 3).

Once again, we obtain consistent inferences of pApp (X) in

Table S3 regardless of the choice of outgroup populations.

Because New Guinea and Australia are sister groups, de-

scending from a common ancestral population, the justifi-

cations for the two statistics are very similar.

The only problemwe found with the estimation of pN(X)

procedure is that when X is any non-African population

known to have West Eurasian ancestry (e.g., Europeans or

South Asians), we often obtained negative pN(X) statistics.

Two hypotheses could be consistent with this observation:

(1) In unpublished data, we have attempted to write down

a model of population separation and mixture analogous

Figure 4. Computation of the Estimateof Denisovan Ancestry pD(X)The black lines show the model for howpopulations are related that is the basisfor the pD(X) ancestry estimate. PopulationX arose from an admixture of a proportion(1� qXqq ) of ancestry from an ancestral non-African population C0 and (qXqq ) fromarchaic population B0 (C and B are theirunmixed descendants). The expectedvalue of f4ff (A,B;C,X) is proportional to thecorrelation in the allele frequency differ-ences A � B and C � X, and can be com-puted as the overlap in the drift pathsseparating A � B (blue arrows) and C � X(red arrows). These paths only overlapover the branches a and b, in proportionto the percentage qXqq of the lineages of pop-ulation X that are of archaic ancestry andso the expected value is qXqq (a(( þ b). Whenwe compute the ratio pD(X), (a(( þ b) cancelsfrom both the numerator and denomi-nator, and we obtain qXqq /qXX New Guinea, thefraction of archaic ancestry in a populationX divided by that in New Guinea. Thisprovides unbiased estimates of themixture

proportion even if populationsC and B have experienced a large amount of genetic drift since splitting from their ancestors, that is, evenif we do not have good surrogates for the ancestral populations. This robustness arises because the genetic drift on the branches B/B0

and C/C0 does not contribute to the expectations.


to that in Figure 3 that jointly fits the genetic data com-

paring eastern and western Eurasian populations and

have so far not succeeded in developing amodel that passes

goodness-of-fit tests. This suggests that the population

relationships between eastern andwestern Eurasiansmight

be more complex than we have been able to model to date,

and therefore we cannot use them in the pN(X) computa-

tion. (2) An alternative possibility is that the negative

pN(X) statistics reflect an artifact of ascertainment bias on

SNP arrays. Ascertainment bias is likely to be particularly

complex with regard to the joint information from Euro-

peans and East Asians because these populations were

heavily used in choices of SNPs for medical genetics arrays.

Thus, it might be difficult tomake inferences using popula-

tions from both regions together with data from conven-

tional SNP arrays developed for medical genetic studies.

Whatever the explanation, we have some reason to

believe that estimates of Near Oceanian admixture by

using data from populations with West Eurasians might

be unreliable. Thus, we have excluded West Eurasians

from the estimates reported in Table 1.

Appendix B: Admixture Graphs

Overview of Admixture Graphs

A key finding from this study is that there is Denisova

genetic material in the Mamanwa, a Negrito group from

the Philippines, which cannot be explained by a history of

recent gene flow from relatives of NewGuineans (Near Oce-

anians) or Australians. To further understand this history,

we use the admixture graph methodology that we initially

developed for a study of Indian genetic variation14 to test

whether varioushypotheses aboutpopulation relationships

are consistent with the data. Specifically, we tested the

hypothesis of a single episode of Denisovan gene flow into

theancestors ofNewGuineans,Australians, andMamanwa,

prior to the separation of New Guineans and Australians.

Admixture graphs refer to generalizations of phyloge-

netic trees that incorporate the possibility of gene flow.

Like phylogenetic trees, admixture graphs describe the

topology of population relationships without specifying

the timing of events (such as population splits or gene-

flow events), or the details of population size changes on

different lineages. While this can be a disadvantage in

that fitting admixture graphs to data does not allow infer-

ences of these important details, it is also an advantage in

that one can fit genetic data to an admixture graphwithout

having to specify a demographic history. This allows for

inferences that are more robust to uncertainties about

important parameters of history. Once the topology of the

population relationships is inferred, one can in principle

use other methods to make inferences about the timing of

events and population size changes. This makes the

problem of learning about history simpler than if one had

to simultaneously infer topology, timing, and demography.

An admixture graphmakes precise predictions about the

patterns of correlation in allele frequency differences

across all subsets of two, three, and four populations in

an analysis, as measured for example by the f2ff , f2 3ff , and f4ff

statistics of Reich et al.14 Given n populations, there are

n(n � 1)/2 f2ff statistics, n(n � 1)(n � 2)/6 f3ff statistics, and

n(n�1)(n�2)(n�3)/24 f4ff statistics. To fit an admixture

graph to data, one first proposes a topology, then identifies

the set of admixture proportions and genetic drift values

on each lineage (variation in allele frequency correspond-

ing to random sampling of alleles from generation to

generation in a population of finite size) that are the best

match to the data under that model. The admixture graph

topology, admixture proportions, and genetic drift values

Figure 5. Computation of the Estimateof Near Oceanian Ancestry pN(X)The test population X is assumed to havearisen from a mixture of a proportion(1 � qXqq ) of ancestry from ancestral EastAsians E0 and (qXqq ) of ancestral Near Ocean-ians N0NN . The Near Oceanians are, in turn,assumed to have received a proportion pXppof their ancestry from the Denisovans(E(( and New Guinea are assumed to beunmixed descendants of these two). Theexpected value of f4ff (A,Australia; X, NewGuinea) can be computed from the correla-tion in the allele frequency differences A �Australia (blue arrows) andX�New Guinea(red arrows). These paths only overlapalong the proportion (1 � qXqq ) of theancestry of population X that takes theEast Asian path, where the expected shareddrift is (1 � pXpp )bþg as shown in the figure.Thus, the expected value of the f4ff statisticis (1 � qXqq )(1 � pXpp )bþg. Because qXqq ¼0 for the denominator of pN(X) (no NearOceanian ancestry), the ratio of f4ff statisticshas an expected value of (1 � qXqq ) and E[p[[ N(X)] ¼ qXqq .


on each lineage together generate expected values for the

f2, f3 and f4 statistics14 that can be compared to the

observed values—which have empirical standard errors

from a block jackknife—to assess the adequacy of the

best fit under the proposed topology. As we showed previ-

ously,14 the topology relating populations in an admixture

graph can be accurately inferred even if the polymor-

phisms used in an analysis are affected by substantial ascer-

tainment bias. The software that we have developed for

fitting admixture graphs carries out a hill-climb to find

the genetic drift values and admixture proportions that

minimize the discrepancy between the observed and ex-

pected f2, f3, and f4 statistics for a given topology relating

a set of populations.

A complication in fitting admixture graphs to data is

that we do not know how many effectively independent

f statistics there are, out of the [n(n � 1)/2][1 þ (n � 2)/

3 þ (n � 3)/12] that are computed. These statistics are

highly correlated, and in fact can be related algebraically

to each other; for example, all the f3 and f4 statistics are

a linear combinations of the f2 statistics. Although we

believe that it is possible to construct a reasonable score

for how well the model fits the data by studying the covari-

ance matrix of the f statistics—and indeed a score of this

type is the basis for our hill-climbing software—we have

not yet found a formal way to assess how many indepen-

dent hypotheses are being tested, and thus we do not at

present have a goodness-of-fit test. Instead, we simply

compute all possible f statistics and search for extreme

outliers (e.g., Z scores of 3 or more from expectation). A

large number of Z scores greater than 3 are not likely to

be observed if the admixture graph topology is an accurate

description of a set of population relationships.

Denisova Gene Flow into Mamanwa/New Guinean/

Australian Ancestors

We initially fit an admixture graph to the data from

Mamanwa, New Guineans, Australians, Denisova, Nean-

dertal, West Africans (YRI), and Han Chinese (CHB), basing

some of the proposed population relationships on pre-

vious work that hypothesized a model of an out-of-Africa

migration of modern humans, Neandertal gene flow into

the ancestors of all non-Africans, and sister group status

for Neandertals and Denisovans.12 A complication in

fitting an admixture graph to these data is that because

of the low coverage of the Neandertal and Denisova

genomes, we could not accurately infer the diploid geno-

type at each SNP. Thus, we sampled a single read from

Neandertal and Denisova to represent each site and (incor-

rectly) assumed that these individuals were homozygous

for the observed allele at each analyzed SNP. This means

that the estimates of genetic drift on the Neandertal and

Denisova branches are not reliable (the genetic drift values

are overestimated). However, these sources of error do not

introduce a correlation in allele frequencies across popula-

tions and hence are not expected to generate a false infer-

ence about the population relationships.

Figure S2 showsan admixture graph that proposes that the

Mamanwa, New Guineans, and Australians descend from

a common ancestral population; the Mamanwa split first

and the New Guinean and Australian ancestors split later.

This is an excellent fit to the data in the sense that only

one of 91 f statistics is more than three standard errors

from zero (jZj ¼ 3.4). An interesting feature of this admixture

graph is that it specifies an additional admixture event, after

the Mamanwa lineage separated, into the ancestors of

Australians and New Guineans that contributed about half

of their ancestry and involved a population without Deni-

sova admixture. A model that does not include such a

secondary admixture event is strongly rejected (see below).

The estimated proportion of Neandertal ancestry in all

non-Africans from the admixture graph fitting in Figure 3,

at 1.3%, is at the low end of the 1%–4% previously esti-

mated from sequencing data.18 Similarly, we infer a propor-

tion of Denisova ancestry in New Guineans of 3.5% ¼6.6% 3 53%, which is lower than the 4%–6% previously

estimated based on sequencing data but not significantly

so when one takes into account the standard errors quoted

in that study.12 These low numbers could reflect statistical

uncertainty from the previously reported analyses of

sequencing data or in the admixture graph estimates (the

latter possibility is especially important to consider

because we do not at present understand how to compute

standard errors on the admixture estimates derived from

admixture graphs). Another possible explanation for the

low estimates of mixture proportions is ascertainment

bias affecting the way SNPs were selected, which can affect

estimates of mixture proportions and branch lengths

(while having much less impact on the inference of

topology). Further support for the hypothesis that ascer-

tainment bias might be contributing to our lower estimates

of mixture proportions comes from the fact that in unpub-

lished work we have found that the polymorphisms most

enriched for signals of archaic admixture are those in

which the derived allele is present in the archaic popula-

tion, absent in West Africans, and present at low minor

allele frequency in the studied population. In our admix-

ture graph fitting, we filtered out this class of SNPs, as

the f statistics used in the admixture graph have denomi-

nators that require frequency estimates from a polymor-

phic reference population, and we used YRI as our refer-

ence. Thus, when we refitted the same admixture graph

with CHB instead of YRI as the reference population, we

obtained the same topology but the Neandertal mixture

proportion increased to 1.9%. We have chosen to use YRI

as the reference population in all of our reported admix-

ture graphs because they are a better outgroup for the

modern populations whose history we are studying than

the CHB (populations related to the Chinese were directly

involved in admixture events in Southeast Asia).

Adding Onge and Jehai

The Andamanese Negrito group (Onge) and Malaysian

Negrito group (Jehai) have been proposed to share ancient


common ancestry with Philippine Negritos (e.g., Ma-

manwa). The fact that neither the Onge nor the Jehai

have evidence of Denisova genetic material, however,

suggests that any common ancestry must date to before

the Denisova gene flow into the ancestors of the Ma-

manwa, New Guineans, and Australians. To explore the

relationship between the Onge and Jehai and the other

populations, we added them into the admixture graph.

The only family of admixture graphs that we could identify

as fitting the data have the Onge as a deep lineage of

modern humans, with the Jehai deriving ancestry from

the same lineage but also harboring a substantial additional

contribution of East Asian related admixture (Figure S3).

A striking feature of the family of admixture graphs shown

in Figure S3 is that both the Jehai andMamanwa are inferred

to have up to about three-quarters of their ancestry due to

recent East Eurasian admixture, which is not too surprising

given that these populations have been living side by side

with populations of East Eurasian ancestry for thousands

of years. Moreover, both Y-chromosome and mtDNA anal-

yses strongly suggest recent East Asian admixture in the

Mamanwa.32,34 In contrast, the genome-wide SNP data for

the Onge are consistent with having no non-Negrito admix-

ture within the limits of our resolution, perhaps reflecting

their greater geographic isolation.

We next sought to resolve how the lineage including

Onge and Jehai ancestors, the mainland East Asian (e.g.,

Chinese), and the eastern group (including Mamanwa,

Australian and New Guinean ancestors) are related. Three

relationships are all consistent with the data. Specifically,

for all three of the admixture graphs shown in Figure S3,

only one of the 246 possible f statistics has a score of

jZj > 3. Thus, we cannot discern the order of splitting of

these three lineages and represent the relationships as

a trifurcation in Figure 3. The actual estimates of mixture

proportions are similar for all three figures as well.

Perturbing the Best-Fitting Admixture Graph to Assess

the Robustness of Our Inferences

To assess the robustness of the admixture graphs, we per-

turbed Figure S3 (in practice, we perturbed Figure 3A, but

given the fact that the graphs are statistically indistin-

guishable we expected that results would be similar for

all three). First, we considered the possibility that after

the initial Denisova gene flow into the ancestors of Ma-

manwa, NewGuineans, and Australians, the NewGuinean

and Australian ancestors did not experience an additional

gene-flow event with a population without Denisovan

admixture. However, when we try to fit this simpler model

to the data, we find that instead of one f statistic that is

jZj > 3 standard errors from expectation, there are now

11, and all but one of them involve theMamanwa, suggest-

ing that this population is poorly fit by such amodel. Thus,

an additional admixture event in the ancestry of New

Guineans and Australians (resulting in a decrease in their

proportion of Denisova ancestry) results in a major

improvement in the fit.

Second, we considered the possibility that the secondary

gene-flow event into the ancestors of Australians and

New Guineans came from relatives of Chinese (CHB)

rather than western Negritos such as the Onge. However,

when we fit this alternative history to the data, we find

three f statistics (rather than one) with scores of jZj > 3,

a substantially worse fit. We conclude that the modern

human populationwith which the ancestors of Australians

and New Guineans interbred was likely to have been more

closely related to western Negritos than to mainland East

Asians.

Supplemental Data

Supplemental Data include three figures and three tables and can

be found with this article online at http://www.cell.com/AJHG/.

Acknowledgments

We thank the volunteers who donated DNA samples.We acknowl-

edge F.A. Almeda Jr., J.P. Erazo, D. Gil, the late J. Kuhl, E.S. Larase, I.

Motinola, G. Patagan, W. Sinco, A. Sofro, U. Tadmor, and R. Trent

for assistance with sample collections. We thank M. Meyer for

preparing DNA libraries for high-throughput sequencing; A. Barik

and P. Nurenberg for assistance with genotyping; andO. Bar-Yosef,

K. Bryc, R.E. Green, J.-J. Hublin, J. Kelso, D. Lieberman, B. Paken-

dorf, M. Slatkin, and B. Viola for comments on the manuscript.

T.A. Jinamwas supported by a grant from the SOKENDAI Graduate

Student Overseas Travel Fund. This work was supported by the

Max Planck Society and by a National Science Foundation

HOMINID grant (1032255).

Received: August 11, 2011

Revised: September 8, 2011

Accepted: September 8, 2011

Published online: September 22, 2011

Web Resources

The URLs for data presented herein are as follows:

Burrows-Wheeler Aligner, http://bio-bwa.sourceforge.net/index.

shtml

CEPH-Human Genome Diversity Cell Line Panel, http://www.

cephb.fr/en/hgdp/diversity.php

EIGENSOFT, http://genepath.med.harvard.edu/~reich/Software.htm

European Collection of Cell Cultures, http://www.hpacultures.

org.uk/pages/Ethnic_DNA_Panel.pdf

European Nucleotide Archive (Project ID ERP000121), http://

www.ebi.ac.uk/ena/

Ibis, http://bioinf.eva.mpg.de/Ibis/

SAMtools, http://samtools.sourceforge.net/

References

1. Mellars, P. (2006). Going east: New genetic and archaeological

perspectives on the modern human colonization of Eurasia.

Science 313, 796–800.

2. Lahr, M., and Foley, R. (1994). Multiple dispersals and modern

human origins. Evol. Anthropol. 3, 48–60.


3. Endicott, P., Gilbert, M.T., Stringer, C., Lalueza-Fox, C., Willer-

slev, E.,Hansen,A.J., andCooper, A. (2003). The genetic origins

of the Andaman Islanders. Am. J. Hum. Genet. 72, 178–184.

4. Macaulay, V., Hill, C., Achilli, A., Rengo, C., Clarke, D., Mee-

han, W., Blackburn, J., Semino, O., Scozzari, R., Cruciani, F.,

et al. (2005). Single, rapid coastal settlement of Asia revealed

by analysis of complete mitochondrial genomes. Science

308, 1034–1036.

5. Thangaraj, K., Chaubey, G., Kivisild, T., Reddy, A.G., Singh,

V.K., Rasalkar, A.A., and Singh, L. (2005). Reconstructing the

origin of Andaman Islanders. Science 308, 996.

6. Cordaux, R., and Stoneking, M. (2003). South Asia, the

Andamanese, and the genetic evidence for an early human

dispersal out of Africa. Am J Hum Genet 72, 1586–1590;

author reply 1590-1583.

7. Palanichamy, M.G., Agrawal, S., Yao, Y.G., Kong, Q.P., Sun, C.,

Khan, F., Chaudhuri, T.K., and Zhang, Y.P. (2006). Comment

on ‘‘Reconstructing the origin of Andaman islanders’’. Science

311, 470, author reply 470.

8. Barik, S.S., Sahani, R., Prasad, B.V.R., Endicott, P.,Metspalu,M.,

Sarkar, B.N., Bhattacharya, S., Annapoorna, P.C.H., Sreenath, J.,

Sun, D., et al. (2008). Detailed mtDNA genotypes permit

a reassessment of the settlement and population structure of

the Andaman Islands. Am. J. Phys. Anthropol. 136, 19–27.

9. Abdulla, M.A., Ahmed, I., Assawamakin, A., Bhak, J.,

Brahmachari, S.K., Calacal, G.C., Chaurasia, A., Chen, C.H.,

Chen, J., Chen, Y.T., et al; HUGO Pan-Asian SNP Consortium;

Indian Genome Variation Consortium. (2009). Mapping

human genetic diversity in Asia. Science 326, 1541–1545.

10. Wollstein, A., Lao, O., Becker, C., Brauer, S., Trent, R.J., Nurn-

berg, P., Stoneking, M., and Kayser, M. (2010). Demographic

history of Oceania inferred from genome-wide data. Curr.

Biol. 20, 1983–1992.

11. Moodley, Y., Linz, B., Yamaoka, Y., Windsor, H.M., Breurec, S.,

Wu, J.Y., Maady, A., Bernhoft, S., Thiberge, J.M., Phuanukoon-

non, S., et al. (2009). The peopling of the Pacific from a bacte-

rial perspective. Science 323, 527–530.

12. Reich, D., Green, R.E., Kircher, M., Krause, J., Patterson, N.,

Durand, E.Y., Viola, B., Briggs, A.W., Stenzel, U., Johnson,

P.L., et al. (2010). Genetic history of an archaic hominin group

from Denisova Cave in Siberia. Nature 468, 1053–1060.

13. Altshuler, D.M., Gibbs, R.A., Peltonen, L., Altshuler, D.M.,

Gibbs, R.A., Peltonen, L., Dermitzakis, E., Schaffner, S.F., Yu,

F., Peltonen, L., et al; International HapMap 3 Consortium.

(2010). Integrating common and rare genetic variation in

diverse human populations. Nature 467, 52–58.

14. Reich, D., Thangaraj, K., Patterson, N., Price, A.L., and Singh,

L. (2009). Reconstructing Indian population history. Nature

461, 489–494.

15. Redd, A.J., and Stoneking, M. (1999). Peopling of Sahul:

mtDNA variation in aboriginal Australian and Papua New

Guinean populations. Am. J. Hum. Genet. 65, 808–828.

16. Cann, H.M., de Toma, C., Cazes, L., Legrand, M.F., Morel, V.,

Piouffre, L., Bodmer, J., Bodmer, W.F., Bonne-Tamir, B., Cam-

bon-Thomsen, A., et al. (2002). A human genome diversity

cell line panel. Science 296, 261–262.

17. Chimpanzee Sequencing and Analysis Consortium. (2005).

Initial sequence of the chimpanzee genome and comparison

with the human genome. Nature 437, 69–87.

18. Green,R.E.,Krause, J.,Briggs,A.W.,Maricic,T., Stenzel,U.,Kircher,

M., Patterson, N., Li, H., Zhai,W., Fritz,M.H., et al. (2010). A draft

sequence of the Neandertal genome. Science 328, 710–722.

19. Patterson, N., Price, A.L., and Reich, D. (2006). Population

structure and eigenanalysis. PLoS Genet. 2, e190.

20. Kircher, M., Stenzel, U., and Kelso, J. (2009). Improved base

calling for the Illumina Genome Analyzer using machine

learning strategies. Genome Biol. 10, R83.

21. Li, H., and Durbin, R. (2009). Fast and accurate short read

alignment with Burrows-Wheeler transform. Bioinformatics

25, 1754–1760.

22. Busing, F., Meijer, E., and Van Der Leeden, R. (1999). Delete-m

jackknife for unequal m. Stat. Comput. 9, 3–8.

23. Kunsch, H.K. (1989). The jackknife and the bootstrap for

general stationary observations. Ann. Stat. 17, 1217–1241.

24. O’Connell, J., and Allen, J. (2004). Dating the colonization of

Sahul (Pleistocene Australia - New Guinea): A review of recent

research. J. Archaeol. Sci. 31, 835–853.

25. Summerhayes, G.R., Leavesley, M., Fairbairn, A., Mandui, H.,

Field, J., Ford, A., and Fullagar, R. (2010). Human adaptation

and plant use in highland New Guinea 49,000 to 44,000 years

ago. Science 330, 78–81.

26. McEvoy, B.P., Lind, J.M.,Wang, E.T.,Moyzis, R.K., Visscher, P.M.,

van Holst Pellekaan, S.M., and Wilton, A.N. (2010). Whole-

genome genetic diversity in a sample of Australians with deep

Aboriginal ancestry. Am. J. Hum. Genet. 87, 297–305.

27. Roberts-Thomson, J.M., Martinson, J.J., Norwich, J.T.,

Harding, R.M., Clegg, J.B., and Boettcher, B. (1996). An

ancient common origin of aboriginal Australians and New

Guinea highlanders is supported by alpha-globin haplotype

analysis. Am. J. Hum. Genet. 58, 1017–1024.

28. Friedlaender, J.S., Friedlaender, F.R., Reed, F.A., Kidd, K.K.,

Kidd, J.R., Chambers, G.K., Lea, R.A., Loo, J.H., Koki, G., Hodg-

son, J.A., et al. (2008). The genetic structure of Pacific

Islanders. PLoS Genet. 4, e19.

29. Kayser, M., Brauer, S., Cordaux, R., Casto, A., Lao, O., Zhivo-

tovsky, L.A., Moyse-Faurie, C., Rutledge, R.B., Schiefenhoevel,

W., Gil, D., et al. (2006). Melanesian and Asian origins of Poly-

nesians: mtDNA and Y chromosome gradients across the

Pacific. Mol. Biol. Evol. 23, 2234–2244.

30. Kayser, M., Lao, O., Saar, K., Brauer, S., Wang, X., Nurnberg, P.,

Trent, R.J., and Stoneking, M. (2008). Genome-wide analysis

indicates more Asian than Melanesian ancestry of Polyne-

sians. Am. J. Hum. Genet. 82, 194–198.

31. Mona, S., Grunz, K.E., Brauer, S., Pakendorf, B., Castrı, L.,

Sudoyo, H., Marzuki, S., Barnes, R.H., Schmidtke, J.,

Stoneking, M., and Kayser, M. (2009). Genetic admixture

history of Eastern Indonesia as revealed by Y-chromosome

and mitochondrial DNA analysis. Mol. Biol. Evol. 26, 1865–

1877.

32. Delfin, F., Salvador, J.M., Calacal, G.C., Perdigon, H.B.,

Tabbada, K.A., Villamor, L.P., Halos, S.C., Gunnarsdottir, E.,

Myles, S., Hughes, D.A., et al. (2011). The Y-chromosome

landscape of the Philippines: Extensive heterogeneity and

varying genetic affinities of Negrito and non-Negrito groups.

Eur. J. Hum. Genet. 19, 224–230.

33. Matsumoto, H., Miyazaki, T., Omoto, K., Misawa, S., Harada,

S., Hirai, M., Sumpaico, J.S., Medado, P.M., and Ogonuki, H.

(1979). Population genetic studies of the Philippine Negritos.

II. gm and km allotypes of three population groups. Am. J.

Hum. Genet. 31, 70–76.

34. Gunnarsdottir, E.D., Li, M., Bauchet, M., Finstermeier, K., and

Stoneking, M. (2011). High-throughput sequencing of

complete human mtDNA genomes from the Philippines.

Genome Res. 21, 1–11.


Discover the latest Trends in your field

Trends

Cell Press Trends journals feature:

Cutting-edge Review and Opinion articles

Authoritative, succinct and accessible content

Discussion, analysis and debate

For more information visit

cell.com/trends

ARTICLE

Rare-Variant Association Testingfor Sequencing Data with the SequenceKernel Association Test

Michael C. Wu,1,5 Seunggeun Lee,2,5 Tianxi Cai,2 Yun Li,1,3 Michael Boehnke,4 and Xihong Lin2,*

Sequencing studies are increasingly being conducted to identify rare variants associated with complex traits. The limited power of clas-

sical single-marker association analysis for rare variants poses a central challenge in such studies. We propose the sequence kernel asso-

ciation test (SKAT), a supervised, flexible, computationally efficient regression method to test for association between genetic variants

(common and rare) in a region and a continuous or dichotomous trait while easily adjusting for covariates. As a score-based vari-

ance-component test, SKAT can quickly calculate p values analytically by fitting the null model containing only the covariates, and

so can easily be applied to genome-wide data. Using SKAT to analyze a genome-wide sequencing study of 1000 individuals, by segment-

ing the whole genome into 30 kb regions, requires only 7 hr on a laptop. Through analysis of simulated data across a wide range of

practical scenarios and triglyceride data from the Dallas Heart Study, we show that SKAT can substantially outperform several alternative

rare-variant association tests. We also provide analytic power and sample-size calculations to help design candidate-gene, whole-exome,

and whole-genome sequence association studies.

Introduction

Genome-wide association studies (GWASs) have identified

more than 1000 genetic loci associated with many human

diseases and traits,1 yet common variants identified

through GWASs often explain only a small proportion of

trait heritability. The advent of massively parallel

sequencing2 has transformed human genetics3,4 and has

the potential to explain some of this missing heritability

through identification of trait-associated rare variants.5

Although considerable resources have been devoted to

sequence mapping and genotype calling,6–9 successful

application of sequencing to the study of complex traits

requires novel statistical methods that allow researchers

to test efficiently for association given data on rare vari-

ants10 and to perform sample-size and power calculations

to help design sequencing-based association studies.

Rare genetic variants, here defined as alleles with

a frequency less than 1%–5%, can play key roles in influ-

encing complex disease and traits.11 However, standard

methods used to test for association with single common

genetic variants are underpowered for rare variants unless

sample sizes or effect sizes are very large.12,13 A logical alter-

native approach is to employ burden tests that assess

the cumulative effects of multiple variants in a genomic

region.12–18 Burden tests proposed to date are based on

collapsing or summarizing the rare variants within a region

by a single value, which is then tested for association with

the trait of interest. For example, the cohort allelic sum test

(CAST)14 collapses information on all rare variants within

a region (e.g., the exons of a gene) into a single dichoto-

mous variable for each subject by indicating whether or

not the subject has any rare variants within the region

and then applies a univariate test. Instead of collapsing by

dichotomizing the number of rare variants within a region,

collapsing by counting them is also possible.18 The

combined multivariate and collapsing method12 extends

CAST by collapsing rare variants within a region into

subgroups on the basis of allele frequency, collapsing

subgroups as in CAST, and applying a multivariate test to

the subgroups. The weighted sum test (WST)13 specifically

considers the case-control setting and collapses a set of

SNPs into a single weighted average of the number of

rare alleles for each individual. Numerous alternative

methods are largely variations on these approaches.16,17,19

A limitation for all these burden tests is that they implic-

itly assume that all rare variants influence the phenotype

in the same direction and with the same magnitude of

effect (after incorporating known weights). However, one

would expect most variants (common or rare) within

a sequenced region to have little or no effect on pheno-

type, whereas some variants are protective and others dele-

terious, and the magnitude of each variant’s effect is likely

to vary (e.g., rarer variants might have larger effects).

Hence, collapsing across all variants is likely to introduce

substantial noise into the aggregated index, attenuate

evidence for association, and result in power loss. Further-

more, burden tests require either specification of thresh-

olds for collapsing or the use of permutation to estimate

the threshold.16–20 Permutation tests are computationally

expensive, especially on the whole-genome scale, and are

difficult for covariate adjustment because permutation

1Department of Biostatistics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA; 2Department of Biostatistics, Harvard School

of Public Health, Boston, MA 02115, USA; 3Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA; 4Depart-

ment of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, MI 48109, USA5These authors contributed equally to this work

*Correspondence: [email protected]


82 The American Journal of Human Genetics 89, 82–93, July 15, 2011

requires independence between the genotype and the co-

variates.

The recently proposed C-alpha test21 is a non-burden-

based test and is hence robust to the direction and magni-

tude of effect. For case-control data, it compares the

expected variance to the actual variance of the distribution

of allele frequencies. These important advantages allow the

C-alpha test to have improved power over burden-based

tests, especially when the effects are in different directions.

Despite these attractive features, the C-alpha test does not

allow for easy covariate adjustment, such as for controlling

population stratification, which is important in genetic

association studies. The C-alpha test also uses permutation

to obtain a p value when linkage disequilibrium is present

among the variants, which is, as noted earlier, computa-

tionally expensive for whole-genome experiments. The

approach has not been generalized to analysis of contin-

uous phenotypes.

We propose in this paper the sequence kernel association

test (SKAT), a flexible, computationally efficient, regression

approach that tests for association between variants in a

region (both common and rare) and a dichotomous (e.g.,

case-control) or continuous phenotype while adjusting for

covariates, such as principal components, to account for

population stratification.22 The kernel machine regression

framework was previously considered for common vari-

ants.23,24 In this paper, weprovide several essentialmethod-

ological improvements necessary for testing rare variants.

SKAT uses a multiple regression model to directly regress

the phenotype on genetic variants in a region and on cova-

riates, and so allows different variants to have different

directions and magnitude of effects, including no effects;

SKAT also avoids selection of thresholds. We develop a

kernel association test to test the regression coefficients of

the variants by using a variance-component score test in a

mixed-model framework by accounting for rare variants.

SKAT is computationally efficient. This quality is espe-

cially important in genome-wide studies because SKAT

only requires fitting the null model in which phenotypes

are regressed on the covariates alone; p values are easily

computed with simple analytic formulae. Additional

features of SKAT include exploitation of local correlation

structure, incorporation of flexible weights to boost power

(e.g., by increasing the weight of rarer variants or incorpo-

rating functionality), and allowance for epistatic variant

effects. As discussed in more detail below, under special

cases, the SKAT, C-alpha test, and individual variant test

statistics are closely related.

We demonstrate through simulation and analysis of

resequencing data from the Dallas Heart Study that SKAT

is often more powerful than existing tests across a broad

range of models for both continuous and dichotomous

data. We also investigate the factors that influence power

for sequence association studies. Finally, we describe

analytic tools to estimate statistical power and sample sizes

to guide the design of new sequence association studies of

rare variants with SKAT.


Sequencing Kernel Association TestSKAT is a supervised test for the joint effects of multiple variants in

a region on a phenotype. Regions can be defined by genes (in

candidate-gene or whole-exome studies) or moving windows

across the genome (in whole-genome studies). For each region,

SKAT analytically calculates a p value for association while adjust-

ing for covariates. Adjustments for multiple comparisons are

necessary for analyzing multiple regions, for example with the

Bonferroni correction or FDR control.

Notation

Assume n subjects are sequenced in a region with p variant sites

observed. Covariates might include age, gender, and top principal

components of genetic variation for controlling population strat-

ification.22 For the i-th subject, yi denotes the phenotype variable,

Xi ¼ (Xi1, Xi2, .., Xim) denotes the covariates, andGi ¼ (Gi1, Gi2,.,

Gip) denotes the genotypes for the p variants within the region.

Typically, we assume an additive genetic model and let Gij, ¼ 0,

1, or 2 represent the number of copies of the minor allele. Domi-

nant and recessive models can also be considered.

SKAT Model and Test for Linear SNP Effects

For a simple illustration of SKAT, we focus here on testing for a rela-

tionship between the variants and the phenotype by using clas-

sical multiple linear and logistic regression. We describe how the

SKAT can incorporate epistatic effects later. To relate the sequence

variants in a region to the phenotype, consider the linear model

yi ¼ a0 þ a0Xi þ b0Gi þ 3i; (Equation 1)

when the phenotypes are continuous traits, and the logistic model

logit P�yi ¼ 1

� ¼ a0 þ a0Xi þ b0Gi; (Equation 2)

when the phenotypes are dichotomous (e.g., y ¼ 0/1 for case or

control). Here a0 is an intercept term, a ¼ [a1,., am]’ is the vector

of regression coefficients for the m covariates, b ¼ [b1,.,bp]’ is the

vector of regression coefficients for the p observed gene variants in

the region, and for continuous phenotypes 3i is an error term with

a mean of zero and a variance of s2. Under both linear and logistic

models, and evaluating whether the gene variants influence the

phenotype, adjusting for covariates, corresponds to testing the

null hypothesis H0: b ¼ 0, that is, b1 ¼ b2 ¼ . ¼ bp ¼ 0. The stan-

dard p-DF likelihood ratio test has little power, especially for rare

variants. To increase the power, SKAT tests H0 by assuming each

bj follows an arbitrary distribution with a mean of zero and

a variance of wjt, where t is a variance component and wj is a pre-

specified weight for variant j. One can easily see that H0: b ¼ 0 is

equivalent to testing H0: t ¼ 0, which can be conveniently tested

with a variance-component score test in the corresponding mixed

model; this is known to be a locally most powerful test.25 A key

advantage of the score test is that it only requires fitting the null

model yi ¼ a0 þ a1’Xi þ 3i for continuous traits and the logit

P(yi ¼ 1) ¼ a0 þ a1’Xi for dichotomous traits.

Specifically, the variance-component score statistic is

Q ¼ �y� bm�0K�

y� bm�; (Equation 3)

where K ¼ GWG’, bm is the predicted mean of y under H0, that isbm ¼ ba0 þXba for continuous traits and bm ¼ logit�1ðba0 þXbaÞ for

dichotomous traits; and ba0 and ba are estimated under the null

model by regressing y on only the covariates X. Here G is an

n 3 p matrix with the (i, j)-th element being the genotype of

The American Journal of Human Genetics 89, 82–93, July 15, 2011 83

variant j of subject i, andW¼ diag(w1,., wp) contains the weights

of the p variants.

In fact, K is an n 3 n matrix with the (i, i’)-th element equal to

KðGi;Gi0 Þ ¼Pp

j¼1wjGijGi 0 j. Kð,; ,Þ is called the kernel function, and

KðGi;Gi0 Þ measures the genetic similarity between subjects i and i’

in the region via the p markers. This particular form of Kð,; ,Þ iscalled the weighted linear kernel function. We later discuss other

choices of the kernel to model epistatic effects.

Good choices of weights can improve power. Each weight wj

is prespecified, with only the genotypes, covariates and external

biological information, that is estimated without using the

outcome, and reflects the relative contribution of the j-th variant

to the score statistic: if wj is close to zero, then the j-th variant

makes only a small contribution to Q. Thus, decreasing the

weight of noncausal variants and increasing the weight of

causal variants can yield improved power. Because in practice

we do not know which variants are causal, we propose to setffiffiffiffiffiwj

p ¼ BetaðMAFj; a1; a2Þ, the beta distribution density function

with prespecified parameters a1 and a2 evaluated at the sample

minor-allele frequency (MAF) (across cases and controls

combined) for the j-th variant in the data. The beta density is flex-

ible and can accommodate a broad range of scenarios. For

example, if rarer variants are expected to be more likely to have

larger effects, then setting 0 < a1 % 1 and a2 R 1 allows for

increasing the weight of rarer variants and decreasing the weight

of commonweights.We suggest setting a1¼ 1 and a2¼ 25 because

it increases the weight of rare variants while still putting decent

nonzero weights for variants with MAF 1%–5%. All simulations

were conducted with this default choice unless stated otherwise.

Note that a smaller a1 results in more strongly increasing

the weight of rarer variants. Examples of weights across a range

of a1 and a2 values are presented in Figure S1, available online.

Note that a1 ¼ a2 ¼ 1 corresponds to wj ¼ 1, that is all variants

are weighted equally, and a1 ¼ a2 ¼ 0.5 corresponds toffiffiffiffiffiwj

p ¼ 1=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiMAFjð1�MAFjÞ

p, that iswj is the inverse of the variance

of the genotype of marker j, which puts almost zero weight for

MAFs > 1% and can be used if one believes only variants with

MAF < 1% are likely to be causal. Note that SKAT calculated

with this weight is identical to the unweighted SKAT test with

the standardized genotypes in Equations 1 and 2. Other forms of

the weight as a function of MAF can also be used. Because SKAT

is a score test, the type I error is protected for any choice of pre-

chosen weights. Note that the weights used in the weighted sum

test13 involve phenotype information and will therefore alter

the null distribution of SKAT if such weights are used.

Under the null hypothesis, Q follows a mixture of chi-square

distributions, which can be closely approximated with the compu-

tationally efficient Davies method.26 See Appendix A for details.

A special case of SKAT arises when the outcome is dichotomous,

no covariates are included, and all wj ¼ 1. Under these conditions,

we show in Appendix A that the SKAT test statistic Q is equivalent

to the C-alpha test statistic T. Hence, the C-alpha test can be

seen as a special case of SKAT, or alternatively, SKAT can be seen

as a generalized C-alpha test that does not require permutation

but calculates the p value analytically, allows for covariate adjust-

ment, and accommodates either dichotomous or continuous

phenotypes. Because SKAT under flat weights is also equivalent

to the kernel machine regression test23,24 and because the kernel

machine regression test is in turn related to the SSU test,27 it

follows transitively that SKAT under flat weights, the kernel

machine regression test, the SSU test, and the C-alpha test are all

equivalent and special cases of SKAT. Note that the null distribu-

tion is calculated differently via these methods, and SKAT gives

more accurate analytic p values, especially in the extreme tail,

when sample sizes are sufficient.

Relationship between Linear SKAT and Individual Variant Test Statistics

One can efficiently compute the test statistic Q by exploiting

a close connection between the SKAT score test statistic Q and

the individual variant test statistics. In particular, Q is a weighted

sum of the individual score statistics for testing for individual

variant effects. Hence, by letting gj ¼ [G1j, G1j, ., Gnj]’ denote

the n 3 1 vector containing the genotypes of the n subjects for

variant j, it is straightforward to see that Q ¼ Ppj¼1wjS

2j , where

Sj ¼ g0jðy� bm0Þ is the individual score statistic for testing the

marginal effect of the j-th marker (H0: bj ¼ 0) under the individual

linear or logistic regression model of yi on Xi and only the j-th

variant Gij:

yi ¼ a0 þX0i aþ bjGij þ 3i

for continuous phenotypes and

logit P�yi ¼ 1

� ¼ a0 þX0i aþ bjGij

for dichotomous phenotypes. bm0 is estimated as bm0 ¼ ba0 þX0iba

for continuous traits and bm0 ¼ logit�1ðba0 þX0ibaÞ for dichotomous

traits. As a score test, one needs to fit the null model only a single

time to be able to compute the Sj for all individual variants j as well

as all regions to be tested. Similarly, if multiple regions are under

consideration, then the same bm0 can be used to compute the

SKAT Q statistics for each region.

Accommodating Epistatic Effects and Prior Information under the SKAT

An attractive feature of SKAT is the ability to model the epistatic

effects of sequence variants on the phenotype within the flexible

kernel machine regression framework.28–30 To do so, we replace

Gi’b by a more flexible function f(Gi) in the linear and logistic

models (1) and (2) where f(Gi) allows for rare variant by rare

variant and common variant by rare-variant interactions. Specifi-

cally, for continuous traits we use the semiparametric linear

model23,29

yi ¼ a0 þ a0Xi þ f ðGiÞ þ 3i; (Equation 4)

and for dichotomous traits, we use the semiparametric logistic

model24,30

logit P�yi ¼ 1

� ¼ a0 þ a0Xi þ f ðGiÞ: (Equation 5)

Here the variants, Gi, are related to the phenotype through

a possibly nonparametric function f($), which is assumed to lie

in a functional space generated by a positive semidefinite kernel

function Kð,; ,Þ. Models (1) and (2) assume linear genetic effects

and are specified by KðGi;Gi0 Þ ¼Pp

j¼1wjGijGi 0 j. By changing

Kð,; ,Þ, one can allow for more complex models. Intuitively,

KðGi;Gi0 Þ is a function that measures genetic similarity between

the i-th and i’-th subjects via the p variants in the region, and

any positive semidefinite function KðGi;Gi0 Þ can be used as

a kernel function. We tailored several useful and commonly used

kernels specifically for the purpose of rare-variant analysis: the

weighted linear kernel, the weighted quadratic kernel, and the

weighted identity by state (IBS) kernel.

The weighted linear kernel function KðGi;Gi0 Þ ¼Pp

j¼1wjGijGi0 j

implies that the trait depends on the variants in a linear fashion

and is equivalent to the classical linear and logistic model pre-

sented in Equations 1 and 2. The weighted quadratic kernel

KðGi;Gi0 Þ ¼ ð1þPpj¼1wjGijGi0 jÞ2 implicitly assumes that the model

depends on the main effects and quadratic terms for the gene


variants and the first-order variant by variant interactions. The

weighted IBS kernel KðGi;Gi0 Þ ¼Pp

j¼1wjIBSðGij;Gi0 jÞ, defines simi-

larity between individuals as the number of alleles that share

IBS. For additively coded autosomal genotype data, KðGi;Gi0 Þ ¼Ppj¼1wjð2� jGij �Gi0 jjÞ. The model implied by the weighted IBS

kernel models the SNP effects nonparametrically.31 Consequently,

this allows for epistatic effects because the function f($) does not

assume linearity or interactions of a particular order (e.g., the

second order), Using the weighted IBS kernel removes the assump-

tion of additivity because the number of alleles that are identical

by state is a physical quantity that does not change on the basis

of different genotype encodings.

We note that a kernel function that better captures both the

similarity between individuals and the causal variant effects will

increase power. In particular, if relationships are linear and no

interactions are present, then the weighted linear kernel will

have highest power. If interactions are present, the weighted

quadratic and weighted IBS kernels can increase power. Our expe-

rience suggests using the IBS kernel when the number of interact-

ing variants within the region is modest. As our understanding of

genetic architecture improves so too will our knowledge of which

kernel to use.

In each of the above kernels, wj is an allele specific weight that

controls the relative importance of the jth variant and might be

a function of factors such as allele frequency or anticipated func-

tionality. Without prior information, we suggest the use of theffiffiffiffiffiwj

p ¼ BetaðMAFj;1;25Þ suggested earlier. However, if prior infor-

mation is available, for example some variants are predicted as

functional or damaging via Polyphen32 or Sift,33 weights can be

selected to increase the weight for likely functionality.

To test for the effects of gene variants in a region on a phenotype,

one tests the null hypothesis H0: f(G) ¼ 0. SKAT tests for this null

hypothesis by assuming the n 3 1 vector f ¼ [f(G1), ., f(Gn)]’ for

the genetic effects of n subjects follows a distribution with mean

zero and covariance tK, where t is a variance component that

indexes the effects of the variants.29,30 Hence, we can test the

null hypothesis that corresponds to testing H0: t ¼ 0 by a vari-

ance-component score test. In particular, we simply replace K in

Equation 3 by using the K discussed in this section, for example,

the weighted IBS kernel, for epistatic effect. All subsequent calcu-

lations for computing a p value remain the same.

Because the SKAT evaluates significance via a score test, which

operates under the null hypothesis, the SKAT is valid (in terms

of protecting type I error) irrespective of the kernel and the

weights used. Good choices of the kernel and the weights simply

increase power.

Planning New Sequencing-Based Association Studies:

Estimation of Power and Sample SizePower and sample-size calculations are important in designing

sequencing studies of complex traits. Using a modification of

the higher-order moment-approximation method,34 we provide

an analytic method to carry out efficiently such calculations for

SKAT.35 Specifically, for a fixed sample size and a level, given a prior

hypothesis on the genetic architecture of a particular region, the

effect size, and the proportion and number of causal variants

within a region, our method provides the power to detect the

region as significant with SKAT. Similarly, if the desired power is

fixed, the approach can be used to find the necessary sample size.

There are key differences between the power and sample-size

estimation for single-variant- and region (set)-based tests. For

a region (set)-based test, the power depends strongly on the under-

lying genetic architecture, and its estimation requires modeling

this genetic architecture and the linkage disequilibrium (LD)

between variants. Therefore, to estimate power to detect a partic-

ular region as associated with a phenotype requires specification

of the significance level, sample size, which variants in the region

are causal with corresponding effect size, and the LD structure of

the variants in the region. Ideally, one could use prior data to

assess the LD and MAF. Because prior data can be difficult to

obtain, we currently recommend the use of either 1000 Genomes

Project data36 or data simulated under a population genetics

model.37 Relevant preliminary data will become increasingly

available as sequencing studies become more common.

Our SKAT software uses simulated data based on the coalescent

population genetic model (released with the software package) as

a default in performing sample-size and power calculations, and

instead of directly specifying the effects of any given variant, the

user can input an MAF threshold for determining which variants

are regarded as rare and also a proportion determining how many

of the rare variants are causal. The causal variants are then randomly

selected from the alleles with true MAF (based on simulated or

preliminary data) less than the threshold. The magnitudes of the

effects jbjj for causal variants are set to be equal to c 3 jlog10 MAFjwhere c is determined on the basis of the maximum effect size the

user would like to allow (described below in the power simulations

section) at MAF ¼ 10�4. This allows the effects of causal variants to

decrease with MAFs. Because these parameters can be difficult to

choose as apriori, powerandsample size canbe reasonably estimated

by averaging results over a range of parameter values. Similarly,

because the regional architecture can vary across different regions,

for genome-wide studies, one can average over multiple randomly

selected regions as currently implemented in the SKAT software.

Numerical Experiments and SimulationsTo validate SKAT in terms of protecting type I error and to assess its

power compared to burden tests and the accuracy of our power

and sample-size tools, we carried out simulation studies under

a range of configurations. For all simulations, we determined

sequence genotypes by simulating 10,000 chromosomes for a

1 Mb region on the basis of a coalescent model that mimics the

LD pattern local recombination rate and the population history

for Europeans by using COSI.37

Type I Error Simulations

To investigate whether SKAT preserves the desired type I error rate

at the near genome-wide threshold level, for example a ¼ 10�6, it

is necessary to conduct simulations with hundreds of millions of

simulated datasets. Although SKAT is computationally efficient,

generating such a large number of datasets is challenging. To

reduce the computation burden, we took the following approach.

Using 10,000 randomly selected sets of 30 kb subregions within

a 1 Mb chromosome, we first generated 10,000 sets of genotypes

G(n 3 p) from the coalescent model, with p variants on n subjects.

Then, for each of the 10,000 simulated genotype data sets, we

simulated 10,000 sets of continuous phenotypes such that we

were able to obtain 108 individual genotype-phenotype data sets

by using the model:

y ¼ 0:5X1 þ 0:5X2 þ 3;

where X1 is a continuous covariate generated from a standard

normal distribution, X2 is a dichotomous covariate taking values

0 and 1 with a probability of 0.5, and 3 follows a standard normal

distribution. Note that the continuous trait values are not related

to the genotype so that the null model holds. The 30 kb regions on


which the genotype values are based contained 605 variants on

average, but the number of observed variants for any given data

set was considerably less and depended on the sample size n,

which we set to 500, 1000, 2500, and 5000.

We repeated the type I error simulations for dichotomous

phenotypes as above, except the dichotomous outcomes were

generated via the model:

logit Pðy ¼ 1Þ ¼ a0;

where a0 was determined to set the prevalence to 1% and case-

control sampling is used.

For both continuous and dichotomous simulations, we applied

SKAT by using the default weighted linear kernel to each of the 108

data sets and estimated the empirical type I error rate as the

proportion of p values less than a ¼ 10�4, 10�5, or 10�6.

We note that the estimated type I error from this approach is

not the same as the empirical type I error when genotypes are

generated randomly for each simulation, because for each of the

10,000 genotype data sets, only the outcomes are resampled.

However, our type I error estimator is still unbiased and results

in very accurate type I error estimates. For larger a levels (0.05

and 0.01), we directly computed the empirical type I error rate

by using data sets in which genotypes were randomly generated

for each simulation.

Empirical Power Simulations

We simulated data sets in which 30 kb subregions were randomly

selected from the generated 1 Mb chromosomes and used to

create causal variants and aphenotype variable aswell as additional

simulated covariates. We generated continuous phenotypes by

y ¼ 0:5X1 þ 0:5X2 þ b1Gc1 þ b2G

c2 þ.þ bpb Gc

pG þ 3;

where X1, X2X , and 3 are as defined for the type I error simulations,

Gc1;G

c2;.;Gc

s are the genotypes of the s causal rare variants (a

randomly selected subset of the simulated rare variants, for

example 5% of variants that have MAF < 3% in Figure 1), and

the bs are effect sizes for the causal variants. Similarly, we

0.5k 1k 2.5k 5k0.0

0.2

0.4

0.6

0.8

1.0

β +/− = 100/0

Total Sample Size

Pow

er

SKATSKAT_MrSKATWNC

0.5k 1k 2.5k 5k0.0

0.2

0.4

0.6

0.8

1.0

β +/− = 80/20

Total Sample Size

Pow

er

0.5k 1k 2.5k 5k0.0

0.2

0.4

0.6

0.8

1.0

β +/− = 50/50

Total Sample Size

Pow

er

Continuous Trait

0.5k 1k 2.5k 5k0.0

0.2

0.4

0.6

0.8

1.0

β +/− = 100/0

Total Sample Size

Pow

er

0.5k 1k 2.5k 5k0.0

0.2

0.4

0.6

0.8

1.0

β +/− = 80/20

Total Sample Size

Pow

er

0.5k 1k 2.5k 5k0.0

0.2

0.4

0.6

0.8

1.0

β +/− = 50/50

Total Sample Size

Pow

er

Dichotomous Trait

Figure 1. Simulation-Study-Based Power Comparisons of SKAT and Burden TestsEmpirical power at a¼ 10�6 under an assumption that 5% of the rare variants withMAF< 3%within random 30 kb regions were causal.Top panel: continuous phenotypes with maximum effect size (jbj) equal to 1.6 when MAF ¼ 10�4; bottom panel: case-control studieswith maximum OR ¼ 5 when MAF ¼ 10�4. Regression coefficients for the s causal variants were assumed to be a decreasing functionof MAF as jbjb j ¼ c jlog10MAFjFF j (j ¼ 1,.,p [see Figure S2]), where c was chosen to result in these maximum effect sizes. From left to right,the plots consider settings in which the coefficients for the causal rare variants are 100% positive (0% negative), 80% positive (20% nega-tive), and 50% positive (50%negative). Total sample sizes considered are 500, 1000, 2500, and 5000, with half being cases in case-controlstudies. For each setting, six methods are compared: SKAT, SKAT in which 10% of the genotypes were set to missing and then imputed(SKAT_M), restricted SKAT (rSKAT) in which unweighted SKAT is applied to variants with MAF < 3%, the weighted sum burden test (W)with the sameweights as used by SKAT, counting-based burden test (N), and the CASTmethod (C). All the burden tests usedMAF< 3% asthe threshold. For each method, power was estimated as the proportion of p values < a among 1000 simulated data sets.


generated dichotomous phenotypes for case-control data under

the logistic model

logit Pðy ¼ 1Þ ¼ a0 þ 0:5X1 þ 0:5X2 þ b1Gc1 þ b2G

c2 þ.þ bpG

cp;

where Gc1;G

c2;.;Gc

p are again the genotypes for the causal rare

variants and bs are log ORs for the causal variants. We controlled

prevalence by a0 and set to it 1% unless otherwise stated. Under

both models, we set the magnitude of each bj to cjlog10MAFjjsuch that rarer variants had larger effects. In the simulation

studies, for continuous traits, c ¼ 0.4, which gives the maximum

effect size jbjj ¼ 1.6 for variants with MAF ¼ 10�4 and small effects

jbjj ¼ 0.28 for MAF ¼ 0.2. For dichotomous traits, c ¼ ln5/4 ¼0.402, which gives the ‘‘maximum’’ OR ¼ 5.0 (jbjj ¼ ln5) for vari-

ants with MAF ¼ 10�4 and smaller OR ¼ 1.32 for MAF ¼ 0.2. The

effect size curves are given in Figure S2.

We compared SKAT, an unsupervised variation on the WST13

that uses weighted-count-based collapsing, counting-based

collapsing,18 and CAST.14 For each of these tests, we considered

variants with observed MAF < 3% as rare: whether CAST collapses

depends on whether an individual exhibits any variants with

allele frequency < 3%, the counting method counts the number

variants with MAF < 3%, and the weighted count inflates the

contribution of each rare variant by multiplying the genotype

with the same beta-density-based weights as used in SKAT.

To accommodate missing genotypes commonly observed in

sequence data, we considered the effect of imputing missing

values by randomly setting 10% of the genotypes as missing,

imputing genotypes on the basis of observed allele frequencies

and Hardy-Weinberg equilibrium, and then applying SKAT to

the imputed data. We also performed restricted SKAT (rSKAT) by

applying unweighted SKAT to rare variants with MAF < 3%.

Note that for dichotomous phenotypes, rSKAT is essentially equiv-

alent to a covariate adjusted C-alpha test with the p value calcu-

lated analytically instead of via permutation. For each of the

methods, power was estimated as the proportion of p values < a,

where a ¼ 10�6 to mimic genome-wide studies.

Power and Sample-Size Formulae

To demonstrate the utility and accuracy of our power and sample-

size calculation method, we conducted several numerical experi-

ments. We first illustrated the use of the methods by computing

the sample size necessary to detect a 30 kb region with 5% of

the variants with MAF < 3% being causal. We assume effect size

(OR) increases with decreasing MAF, and seek 80% power at

significance levels a ¼ 10�6, 10�3, 10�2, corresponding to approx-

imate genome-wide sequencing significance and candidate-gene-

sequencing studies of 50 and five genes, respectively. We consid-

ered both continuous and dichotomous traits.

To show that the power estimated from our sample-size formula

is accurate, we compared empirical power for SKAT under simula-

tions to power estimated via our analytic method. Specifically, we

simulated continuous and case-control data under the same

setting as that used in the power simulations, and we estimated

power as a function of the sample size by computing the propor-

tion of p values < a ¼ 10�6 and compared the empirical power

curve to the power estimated by using our analytical method.

Results

Simulation of the Type I Error

The empirical type I error rates estimated for SKAT are pre-

sented in Table 1 for a ¼ 10�4, 10�5, and 10�6 and suggest

the type I error rate is protected for continuous pheno-

types, though for smaller sample sizes the SKAT can be

slightly conservative. For dichotomous phenotypes, SKAT

is conservative for smaller sample sizes and very small

a levels. Additional results from simulations of the type I

error for SKAT and the competing methods are presented

in Figure S3 for both continuous traits and dichotomous

traits and show that at larger a levels, all of the considered

tests correctly control at the a¼ 0.05 and 0.01 levels. These

results show that SKAT is a validmethod, and despite being

conservative at low a levels, SKAT maintains good power

relative to existing methods (see below). However, if

sample sizes are small or sharp control of type I error is

necessary, then standard permutation-based procedures

can be used to generate a Monte Carlo p value for signifi-

cance, though this can be computationally expensive

and does not work in the presence of covariates, such as

controlling for population stratification and require carful

modifications.

Statistical Power of SKAT and Competing Methods

We compared the power of SKAT with three burden tests

in a series of simulation studies for both continuous traits

and dichotomous traits by generating sequence data

in randomly selected 30 kb regions with a coalescent

model.37 For our primary power simulation, within each

region, 5% of variants with population MAF < 3% were

randomly chosen as causal, the effect size of causal variants

was a decreasing function of MAF, and 50%–100% of the

causal variants being positively associated with the trait

Table 1. Type I Error Estimates of SKAT Aimed at Testing an Association between Randomly Selected 30 kb Regions with a ContinuousTrait at Type I Error Rates as Low as the Genome-wide a ¼ 10�6 Level

Total Sample Size (n)

Continuous Phenotypes Dichotomous Phenotypes

a ¼ 10�4 a ¼ 10�5 a ¼ 10�6 a ¼ 10�4 a ¼ 10�5 a ¼ 10�6

500 7.4 3 10�5 6.5 3 10�6 5.9 3 10�7 2.2 3 10�5 1.0 3 10�6 1.0 3 10�8

1000 8.5 3 10�5 8.2 3 10�6 8.0 3 10�7 5.0 3 10�5 3.5 3 10�6 2.3 3 10�7

2500 9.6 3 10�5 9.1 3 10�6 8.4 3 10�7 7.6 3 10�5 6.3 3 10�6 5.6 3 10�7

5000 9.8 3 10�5 9.6 3 10�6 8.8 3 10�7 8.9 3 10�5 7.8 3 10�6 7.0 3 10�7

Each entry represents type I error rate estimates as the proportion of p values a under the null hypothesis based on 108 simulated phenotypes.


(See Materials and Methods and Figure S2). The simulated

regions for our power analysis contained on average

605 variants (26 causal), of which 530.9 (88%), 502.9

(83%), and 422.8 (70%) had population MAF < 3%, < 1%,

and< 0.1%, respectively. The average allele frequency spec-

trum across the samples is similar to that of theDallas Heart

Studydata (Figure S4). Because themajority of variantshave

a low MAF, they might not be observed in any particular

sample. The average number of observed variants

(assuming no genotyping error) and the average number

of observed causal variants are presented in Table 2.

For continuous traits, SKAT had much higher power

than all the burden tests, and the weighted count method

tended to outperform the count and CAST methods

(Figure 1). SKAT’s power was robust to the proportion of

causal variants that were positively associated with the

trait, whereas the burden tests suffered substantial loss of

power when causal variants had the opposite effects. The

simulation results examining dichotomous traits were

qualitatively similar in that SKAT dominated the compet-

ing methods. However, here the power of the SKAT

decreased when both protective and harmful variants

were present, although less so than for the burden tests.

The difference in power for SKAT for different proportions

of protective variants is due to the fact that given fixed

population MAFs, protective variants imply negative log

ORs and lower disease risk and hence lower MAFs in cases

and more difficulties in observing rare variants in cases.

The larger decrease in power for the competing methods

is additionally driven by sensitivity to direction of effect

due to aggregation of genotypes. Across all configurations,

using imputed genotypes instead of the true genotype

for 10% missing genotype data led to a very small

reduction in power, despite the use of a very simple

Hardy-Weinberg-based imputation strategy. This is true

in part because most variants are rare.

Note that SKAT increases the weight of rare variants but

does not require thresholding. To show that the superior

performance of SKAT is intrinsic and is not driven by the

particular choice of the weight used, we calculated rSKAT,

which does not weight the rare variants but instead uses

the same threshold as the burden tests. Our results, pre-

sented in Figure 1, show that rSKAT is still substantially

more powerful than all three burden tests.

Power simulation results for other type I error rates (a ¼0.01, 0.001), lower causal variant frequencies (population

MAF < 1%), and other region sizes (10 kb and 60 kb)

yielded the same conclusions (Figures S5–S8).

In the 30 kb genomic regions considered, reflecting anal-

ysis of genome-wide sequencing data, it is unlikely that

a large proportion of the rare variants are all causal.

However, for exome-scale sequencing, the number of

observed rare variants can be considerably smaller and

the proportion of causal rare variants can be greater.

Hence, we also conducted power simulations for smaller

region sizes (3 kb and 5 kb) and larger proportions of causal

variants (10%, 20%, and 50%). Results for both continuous

and dichotomous phenotypes are presented in Figures S9–

S12 and show that if 50% of the rare variants are causal and

that all of the causal variants have effects in the same direc-

tion, then SKAT and rSKAT are less powerful compared to

collapsing methods, with count-based collapsing having

the greatest power. This result held for both 3 kb and

5 kb regions and is expected because the collapsing

methods implicitly assume that all of the variants are

causal and have unidirectional effects. In all other settings

we considered, SKAT was the most powerful method.

Power and Sample-Size Estimation

To illustrate our power and sample-size calculation

method, in Figure 2 we present the estimated sample-size

curves as a function of maximum effect sizes (ORs for

dichotomous traits) necessary to detect a 30 kb region

with 5% of the variants with MAF < 3% being causal.

Table 3 presents estimated sample sizes for several configu-

rations of practical interest. Additional sample-size curves

when causal variants are rarer (MAF < 1%) or occur more

frequently (10% of variants are causal) or when prevalence

is varied (5%, 0.1%) can be found in Figures S13–S15.

These results show that, for a given region, one will

have more power (and a lower required sample size) to

detect rare causal variants if the percentage of variants

that are causal is higher, the causal rare variants have

higher MAFs and/or larger effect sizes (e.g., odds ratios

[ORs]), and the effects are more consistently in the same

direction. For case-control designs, lower prevalence

yields higher power because given the same OR and popu-

lation MAF, the lower prevalence results in enrichment of

more harmful (ORs > 1) variants, that is higher MAFs,

across both cases and controls, that is for rarer diseases

harmful rare variants are more likely to be observed.

Conversely, if the prevalence is low, fewer protective vari-

ants (ORs< 1), that is lower MAFs, are likely to be observed

in the sample.

We also compared the power and sample-size formulae

estimates to the empirical, simulation-based power esti-

mates for both continuous and dichotomous traits. The

curves plotted in Figure 3 show that the empirical power

is accurately approximated by our analytical formula.

Table 2. Characteristics of the 30 kb Region Data Sets Used in theSimulation Studies

Average Number of Observed Variants

Sample Size (n)

500 1000 2500 5000

All traits* 255 330 438 512

Continuous trait** 9.6 13.3 18.6 22.3

Dichotomous trait (b 5 ¼ 100/0)** 14.4 18.7 23.5 25.2

Dichotomous trait (b5 ¼ 80/20)** 13.3 17.1 22.0 24.3

Dichotomous trait (b5 ¼ 50/50)** 11.1 14.9 19.7 22.6

The number of observed variants* and the number of observed causalvariants** within the region are averaged over the 1000 simulated data sets.


Application to Dallas Heart Study Data

We analyzed sequence data on 93 variants in ANGPTL3

(MIM 604774), ANGPTL4 (MIM 605910), and ANGPTL5

(MIM 607666) in 3476 individuals from the Dallas Heart

Study38 to test for association between log-transformed

serum triglyceride (logTG) levels and rare variants in these

genes. We adjusted for sex and ethnicity (black, Hispanic,

or white) but did not adjust for age as a large number of

subjects have missing ages. In addition to testing for asso-

ciation via SKAT and the three burden tests considered

earlier, we also applied the permutation-based varying-

threshold method (VT) and the Polyphen-score-adjusted

VT (VTP),16 which are based on the residuals obtained

from regressing the phenotype on the covariates and

assume gene-covariate independence. Because VT and

VTP require permutation, they are computationally expen-

sive when applied genome wide. For VTP, we used the

Polyphen score for rare variants (MAF< 0.01) and assigned

a constant score of 0.5 to all other variants. We also

analyzed a dichotomized phenotype on the highest and

lowest quartiles of each of the six sex-ethnicity groups

(Table 4).

Table 3. Required Total Sample Size to Achieve 80% Power to Detect Rare Variants Associated with a Continuous or DichotomousCase-Control Phenotype at the Genome-wide Level a ¼ 10�6

Total Sample Size

Maximum b ¼ 1.6/ Maximum OR ¼ 5 Maximum b ¼ 1.9/ Maximum OR ¼ 7

5% Causal 10% Causal 5% Causal 10% Causal

Continuous trait 5,990 1,800 4,260 1,290

Dichotomous trait with prevalence 10% 15,120 4,810 9,650 3,120

Dichotomous trait with prevalence 1% 12,030 3,870 7,010 2,290

Power was estimated via the analytical formulae assuming 5% or 10% of variants with MAF < 3% are causal. Regression coefficients for the s causal variantswere assumed to be a decreasing function of MAF, jbjb j ¼ c jlog10MAFjFF j (j ¼ 1,.,s), where 80% of bj’s are positive and 20% are negative; see Figure S2. Requiredtotal sample sizes (cases and controls) are given for different ‘‘maximum’’ effect sizes (or ORs) whenMAF¼ 10�4 and different prevalences for case-control studies.Estimated sample sizes were averaged over 100 random 30 kb regions.

1.4 1.6 1.8 2.0 2.2

020

0040

0060

0080

0010

000

β +/− = 100/0

max β

Tota

lSam

ple

Siz

e

α = 10−6

α = 10−3

α = 10−2

1.4 1.6 1.8 2.0 2.2

020

0040

0060

0080

0010

000

β +/− = 80/20

max β

Tota

lSam

ple

Siz

e

1.4 1.6 1.8 2.0 2.2

020

0040

0060

0080

0010

000

β +/− = 50/50

max β

Tota

lSam

ple

Siz

e

Continuous Trait

5 6 7 8 9 10 11

020

0040

0060

0080

0010

000

β +/− = 100/0

max OR

Tota

lSam

ple

Siz

e

5 6 7 8 9 10 11

020

0040

0060

0080

0010

000

β +/− = 80/20

max OR

Tota

lSam

ple

Siz

e

5 6 7 8 9 10 11

020

0040

0060

0080

0010

000

β +/− = 50/50

max OR

Tota

lSam

ple

Siz

e

Dichotomous Trait

Figure 2. Sample Sizes Required for Reaching 80% PowerAnalytically estimated sample sizes required for reaching 80% power to detect rare variants associated with a continuous (top panel) ordichotomous phenotype in case-control studies (half are cases) (bottom panel) at the a¼ 10�6, 10�3, and 10�2 levels, under the assump-tion that 5% of rare variants with MAF < 3% within the 30 kb regions are causal. Plots correspond to 100%, 80%, and 50% of the causalvariants associated with increase in the continuous phenotype or risk of the dichotomous phenotype. Regression coefficients for the scausal variants were assumed to be the same decreasing function of MAF as that in Figure 1. The absolute values of Required total samplesizes are plotted again themaximumeffect sizes (ORs) whenMAF¼ 10�4. Estimated total sample sizes were averaged over 100 random30kb regions.


SKAT was by far the most powerful test for the dichoto-

mous trait. For continuous traits, SKAT has much smaller

p values than two burden methods (CAST and WST) and

VT, and has a slightly higher p value than the counting-

based burden test (N) and VTP. Note that SKAT was easier

to apply because it did not require prior functional infor-

mation (available for only a subset of variants) or permuta-

tion, and it adjusted for covariates without assuming gene-

covariate independence.

Computation Time

The computation time for the SKAT depends on the

sample size and the number of markers. To analyze a 30 kb

region sequenced on 1000, 2500, or 5000 individuals,

SKAT required 0.21, 0.73, and 2.3 s, respectively, for

continuous traits and ~20% longer for dichotomous traits,

on a 2.33 GHz laptop with 6 Gb memory. Analyzing

300 kb, 3Mb, or 3 Gb (the entire genome) on 1000 individ-

uals requires 2.5 s, 25 s, and 7 hr, respectively.

Discussion

We propose SKAT as a supervised, flexible, and computa-

tionally efficient statisticalmethod that tests for association

between a continuous or dichotomous phenotype and rare

and common genetic variants in sequencing-based associa-

tion studies. We demonstrate that SKAT’s power is greater

than that of several burden tests over a range of genetic

models. Furthermore, we have developed analytical power

and sample-size calculations for SKAT that assist in

designing sequencing-based association studies.

2000 4000 6000 8000 10000

0.0

0.2

0.4

0.6

0.8

1.0

Continuous Trait

Total Sample Size

Pow

er

TheoreticalEmpirical

2000 4000 6000 8000 10000

0.0

0.2

0.4

0.6

0.8

1.0

Dichotomous Trait

Total Sample Size

Pow

er

Figure 3. Power Comparisons Based onSimulation and Analytic EstimationPower as a function of total sample sizeestimated by simulation with 1000 repli-cates and by the proposed power formulafor continuous and dichotomous case-control traits. Simulation configurationscorrespond to those used in Figure 1, inwhich 80% of the regression coefficientsfor the causal rare variants were positive.

Table 4. Analysis of the Dallas Heart Study Sequencing Data

SKAT C N W VTa VTPa

Continuous TG level 9.5 3 10�5 1.9 3 10�3 7.2 3 10�5 2.3 3 10�4 3.5 3 10�4 2.0 3 10�5

Dichotomized TG level 1.3 3 10�4 3.2 3 10�2 2.2 3 10�3 3.1 3 10�3 8.6 3 10�3 2.1 3 10�3

Analysis of the Dallas Heart Study sequencing data with SKAT, the weighted sum burden test (W), the counting-based burden test (N), the CAST method (C), thevarying-threshold method (VT), and the Polyphen-score adjusted VT (VTP) method. Beta (1, 25) is used as the weight in the SKAT and the weighted sum test.a p values are estimated on the basis of 106 permutations.

Like burden tests, SKAT performs

region-based testing. However, SKAT

has several major advantages over the

existing tests. As a supervisedmethod,

SKAT directly performs multiple re-

gressions of a phenotype on genotypes for all variants in

the region, adjusting for covariates. Hence, as with conven-

tional multiple regression models, neither directionality

nor magnitudes of the associations are assumed a priori

but are instead estimated from the data. To test efficiently

for the joint effects of rare variants in the region on the

phenotype, SKAT assumes a distribution for the regression

coefficients of the markers, whose variances depend on

flexible weights. SKAT performs a score-based variance-

component test, whose calculation only requires fitting

the null model by regressing phenotypes on covariates

alone and computing p values analytically. The flexible

regression framework also allows us to allow for epistatic

effects.

Besides region-based analysis, SKAT can also be applied

to any biologically meaningful SNP set. As SKAT is a regres-

sion-based method, it can be easily extended to survival,

and longitudinal and multivariate phenotypes and hence

provides a comprehensive framework for a wide variety

of sequencing-based association studies.

The ability to obtain a p value directly without the need

for permutation is an attractive feature of SKAT, and allows

for rapid estimation of p values in exome and genome-

wide sequencing studies. Our simulations showed that

for continuous phenotype, the p values are accurate

when the sample size is moderate or large; for dichoto-

mous phenotypes, the p values are conservative at lower

a levels (e.g., < 10�4) if the sample size is modest or

small. Permutation can be used to obtain a more accurate

estimate in the absence of covariates. In the presence of

covariates, for example population stratification, standard


permutations fail and require careful modifications.

Despite the conservative nature of the score test, SKAT

often still has higher power than competing methods at

small a levels.

SKATcan be combined with collapsing strategies to form

a hybrid testing approach. If most of the variants within

a range of allele frequencies are causal and have the same

directionality (i.e., under settings that are optimal for

burden-based tests), collapsing these variants and then

applying SKAT to the collapsed variants can improve

power. For example, because singletons are common in

sequencing studies (57 of 93 variants in the Dallas Heart

Study data), a possible hybrid strategy is to first collapse

all of the singletons into a single value and then apply

SKAT to the collapsed value and the other 36 variants.

Compared to the original SKAT, this strategy gives a slightly

lower p value, 3.1 3 10�5, for the continuous trait and

a slightly higher p value, 1.6 3 10�4, for the dichotomous

trait. Simulation studies showed that the two methods are

of similar power under the settings we used to generate

Figure 1.

An important feature of SKAT is that it allows for incor-

poration of flexible weight functions to boost analysis

power, for example by increasing the weight of variants

with lower MAFs and decreasing the weight of information

from variants inferred with lower confidence. Good

choices of weights are likely to improve the power of the

association test with SKAT, although simulations show

that even equal weights can yield high power when

combined with thresholding. In our simulation studies,

we employed a class of flexible continuous weights as

a function of MAF by using the beta function, which

increases the weight of rare variants and does not require

thresholding. Users can define other types of weight func-

tions. To further improve analysis power, one can estimate

weights by incorporating information besides MAF, for

example by using the Polyphen score or integrating other

annotation information, which will become increasingly

available as our understanding of genome variation

improves. Therefore, because of its flexibility, SKAT has

the capacity to mature, and its power to increase, as the

field progresses.

Appendix A

Estimating the Null Distribution for Q

Under the null hypothesis, Q follows a mixture of chi-

square distributions.29,30 More specifically, we define P0 ¼V�V ~Xð ~X0

V ~XÞ�1 ~X0V where ~X is the n 3 (p þ 1) matrix

equal to [1, X]. For continuous phenotypes, V ¼ bs2

0I

where bs0 is the estimator of s under the null model where

f(G) ¼ 0, and I is an n 3 n identity matrix. For dichoto-

mous phenotypes, V ¼ diagðbm01ð1� bm01Þ; bm02ð1� bm02Þ;.;bm0nð1� bm0nÞÞ where bm0i ¼ logit�1ðba þ ba0XiÞ is the esti-

mated probability that the i-th subject is a case under the

null model. Then under the null model

Q �Xni¼1

lic21;i; (Equation 6)

where (l1, l2, ., ln) are the eigenvalues of P1=20 KP1=2

0 , and

c21;i are independent c2

1 random variables.

Several approximation and exact methods have been

suggested to obtain the distribution of Q.39 Among these,

the Davies exact method,26 based on inverting the charac-

teristic function of Equation 6, appears to work well in

practice and is used here.

SKAT Is a Generalization of the C-Alpha Test

The recently proposed the C-alpha test has advantages

over burden tests in that it explicitly models the possibility

that minor alleles can be deleterious or protective.

However, it does not currently allow for the analysis of

quantitative outcomes or the inclusion of covariates and

p value calculation requires permutation. We demonstrate

that for a dichotomous trait in the absence of covariates,

the C-alpha test statistic is equivalent to the SKAT statistic

with unweighted linear kernel, which is the same as the

kernel machine test in Wu et al.24

Suppose the j-th variant is observed dj times in the cases,

out of nj times total in cases and controls, and that

p0 ¼ Pni¼1yi=n. For a dichotomous trait and no covariates,

the C-alpha test statistic

Ta ¼Xp

j¼1

h�dj � njp0

�2�njp0�1� p0

�i(Equation 7)

Denote T1a ¼ Pp

j¼1ðdj � njp0Þ2. BecausePp

j¼1njp0ð1� p0Þis the mean of Ta under the null hypothesis of no associa-

tion,T1a is theC-alpha test statisticwithoutmeancentering.

Because dj ¼ y0G:j and nj ¼ J0G:j, where G:j is the j-th

column of the genotype matrix G and J ¼ ð1;1;.;1Þ0, itcan be easily shown that

T1a ¼ �

y� p0J�0GG0�y� p0J

�: (Equation 8)

Note that under the unweighted linear kernel, K ¼ GG’

and bm0 ¼ p0J if no covariates are present. Hence, Equation

8 is identical to Equation 3, that is T1a is equivalent to the

SKAT test statistic with unweighted linear kernel.

Although the SKAT statistic with unweighted linear

kernel and the C-alpha test statistic are equivalent, SKAT

and C-alpha test use different null distributions to assess

significance: C-alpha test uses a normal approximation,

whereas we use a mixture of chi-squares. The normal

approximation gives a valid p value when the tested rare

variants are independent and sample sizes are large, and

so requires an assumption of linkage equilibrium. In the

presence of LD, permutation is used by the C-alpha test

for significance testing. One can easily see that the test

statistic takes a quadratic formof y, which follows amixture

of chi-square distributions. SKAT approximates this distri-

bution directly with the Davies method and hence gives

accurate estimation of significance regardless of the LD

structure when sample size is sufficient.


Supplemental Data

Supplemental Data include 15 figures and can be found with this

article online at http://www.cell.com/AJHG/.

Acknowledgments

This work was supported by grants P30 ES010126 (to M.C.W.),

DMS 0854970 and R01 GM079330 (to T.C.), R01 HG000376 (to

M.B.), and R37 CA076404 and P01 CA134294 (to S.L. and X.L.).

We thank Jonathan Cohen, Alkes Price, and Shamil Sunyaev for

providing the Dallas Heart Study data and Larisa Miropolsky for

help with the software development.

Received: March 16, 2011

Revised: May 27, 2011

Accepted: May 30, 2011

Published online: July 7, 2011

Web Resources


1000 Genomes Project, http://www.1000genomes.org/

Online Mendelian Inhereitance in Man (OMIM), http://www.

omim.org

SKATsoftware, http://www.hsph.harvard.edu/~xlin/software.html

References

1. Hindorff, L.A., Sethupathy, P., Junkins, H.A., Ramos, E.M.,

Mehta, J.P., Collins, F.S., and Manolio, T.A. (2009). Potential

etiologic and functional implications of genome-wide associa-

tion loci for human diseases and traits. Proc. Natl. Acad. Sci.

USA 106, 9362–9367.

2. Margulies, M., Egholm,M., Altman,W.E., Attiya, S., Bader, J.S.,

Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z.,

et al. (2005). Genome sequencing in microfabricated high-

density picolitre reactors. Nature 437, 376–380.

3. Mardis, E.R. (2008). Next-generation DNA sequencing

methods. Annu. Rev. Genomics Hum. Genet. 9, 387–402.

4. Ansorge, W.J. (2009). Next-generation DNA sequencing tech-

niques. New Biotechnol. 25, 195–203.

5. Eichler, E.E., Flint, J., Gibson, G., Kong, A., Leal, S.M.,

Moore, J.H., and Nadeau, J.H. (2010). Missing heritability

and strategies for finding the underlying causes of complex

disease. Nat. Rev. Genet. 11, 446–450.

6. Ley, T.J., Mardis, E.R., Ding, L., Fulton, B., McLellan, M.D.,

Chen, K., Dooling, D., Dunford-Shore, B.H., McGrath, S.,

Hickenbotham, M., et al. (2008). DNA sequencing of a cytoge-

netically normal acute myeloid leukaemia genome. Nature

456, 66–72.

7. Li, H., Ruan, J., and Durbin, R. (2008). Mapping short DNA

sequencing reads and calling variants using mapping quality

scores. Genome Res. 18, 1851–1858.

8. Li, R.Q., Li,Y.R., Fang,X.D.,Yang,H.M.,Wang, J.,Kristiansen,K.,

andWang, J. (2009). SNPdetection formassively parallelwhole-

genome resequencing. Genome Res. 19, 1124–1132.

9. Bansal, V., Harismendy, O., Tewhey, R., Murray, S.S., Schork,

N.J., Topol, E.J., and Frazer, K.A. (2010). Accurate detection

and genotyping of SNPs utilizing population sequencing

data. Genome Res. 20, 537–545.

10. Carvajal-Carmona, L.G. (2010). Challenges in the identifica-

tion and use of rare disease-associated predisposition variants.

Curr. Opin. Genet. Dev. 20, 277–281.

11. Schork, N.J., Murray, S.S., Frazer, K.A., and Topol, E.J. (2009).

Common vs. rare allele hypotheses for complex diseases.

Curr. Opin. Genet. Dev. 19, 212–219.

12. Li, B., and Leal, S.M. (2008). Methods for detecting associa-

tions with rare variants for common diseases: application

to analysis of sequence data. Am. J. Hum. Genet. 83, 311–321.

13. Madsen, B.E., and Browning, S.R. (2009). A groupwise associ-

ation test for rare mutations using a weighted sum statistic.

PLoS Genet. 5, e1000384.

14. Morgenthaler, S., and Thilly, W.G. (2007). A strategy to

discover genes that carry multi-allelic or mono-allelic risk for

common diseases: a cohort allelic sums test (CAST). Mutat.

Res. 615, 28–56.

15. Li, B., and Leal, S.M. (2009). Discovery of rare variants via

sequencing: implications for the design of complex trait asso-

ciation studies. PLoS Genet. 5, e1000481.

16. Price, A.L., Kryukov, G.V., de Bakker, P.I., Purcell, S.M., Staples,

J., Wei, L.J., and Sunyaev, S.R. (2010). Pooled association tests

for rare variants in exon-resequencing studies. Am. J. Hum.

Genet. 86, 832–838.

17. Han, F., and Pan, W. (2010). A data-adaptive sum test for

disease association with multiple common or rare variants.

Hum. Hered. 70, 42–54.

18. Morris, A.P., and Zeggini, E. (2010). An evaluation of statistical

approaches to rare variant analysis in genetic association

studies. Genet. Epidemiol. 34, 188–193.

19. Zawistowski,M., Gopalakrishnan, S., Ding, J., Li, Y., Grimm, S.,

andZollner, S. (2010). Extending rare-variant testing strategies:

analysisofnoncoding sequenceand imputedgenotypes.Am. J.

Hum. Genet. 87, 604–617.

20. Asimit, J., and Zeggini, E. (2010). Rare variant association anal-

ysismethods forcomplex traits.Annu.Rev.Genet.44, 293–308.

21. Neale, B.M., Rivas, M.A., Voight, B.F., Altshuler, D., Devlin, B.,

Orho-Melander, M., Kathiresan, S., Purcell, S.M., Roeder, K.,

and Daly, M.J. (2011). Testing for an unusual distribution of

rare variants. PLoS Genet. 7, e1001322.

22. Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E.,

Shadick, N.A., and Reich, D. (2006). Principal components

analysis corrects for stratification in genome-wide association

studies. Nat. Genet. 38, 904–909.

23. Kwee, L.C., Liu, D., Lin, X., Ghosh, D., and Epstein, M.P.

(2008). A powerful and flexible multilocus association test

for quantitative traits. Am. J. Hum. Genet. 82, 386–397.

24. Wu, M.C., Kraft, P., Epstein, M.P., Taylor, D.M., Chanock, S.J.,

Hunter, D.J., and Lin, X. (2010). Powerful SNP-set analysis for

case-control genome-wide association studies. Am. J. Hum.

Genet. 86, 929–942.

25. Lin, X. (1997). Variance component testing in generalised

linear models with random effects. Biometrika 84, 309–326.

26. Davies, R. (1980). The distribution of a linear combination of

chi-square random variables. J. R. Stat. Soc. Ser. C Appl. Stat.

29, 323–333.

27. Pan, W. (2009). Asymptotic tests of association with multiple

SNPs in linkagedisequilibrium.Genet. Epidemiol.33, 497–507.

28. Cristianini, N., and Shawe-Taylor, J. (2000). An Introduction

to Support Vector Machines and Other Kernel-Based Learning

Methods (Cambridge: Cambridge Univ Press).

29. Liu, D., Lin, X., and Ghosh, D. (2007). Semiparametric regres-

sion of multidimensional genetic pathway data: least-squares


kernel machines and linear mixed models. Biometrics 63,

1079–1088.

30. Liu, D., Ghosh, D., and Lin, X. (2008). Estimation and testing

for the effect of a genetic pathway on a disease outcome using

logistic kernel machine regression via logistic mixed models.

BMC Bioinformatics 9, 292.

31. Fleuret, F., and Sahbi, H. (2003). Scale-invariance of support

vector machines based on the triangular kernel. In 3rd Inter-

nationalWorkshop on Statistical and Computational Theories

of Vision. (ftp://ftp.inria.fr/INRIA/publication/publi-pdf/RR/

RR-4601.pdf).

32. Ramensky, V., Bork, P., and Sunyaev, S. (2002). Human non-

synonymous SNPs: server and survey. Nucleic Acids Res. 30,

3894–3900.

33. Kumar, P., Henikoff, S., and Ng, P.C. (2009). Predicting the

effects of coding non-synonymous variants on protein func-

tion using the SIFT algorithm. Nat. Protoc. 4, 1073–1081.

34. Liu, H., Tang, Y., and Zhang, H. (2009). A new chi-square

approximation to the distribution of non-negative definite

quadratic forms in non-central normal variables. Comput.

Stat. Data Anal. 53, 853–856.

35. Lee, S., Wu, M.C., Cai, T., Li, Y., Boehnke, M., and Lin, X.

(2011). Power and sample size calculations for designing rare

variant sequencing association studies. In Harvard University

Technical Report. (http://www.hsph.harvard.edu/~xlin).

36. Durbin, R.M., Abecasis, G.R., Altshuler, D.L., Auton, A.,

Brooks, L.D., Gibbs, R.A., Hurles, M.E., and McVean, G.A.;

1000 Genomes Project Consortium. (2010). A map of human

genome variation from population-scale sequencing. Nature

467, 1061–1073.

37. Schaffner, S.F., Foo, C., Gabriel, S., Reich, D., Daly, M.J., and

Altshuler, D. (2005). Calibrating a coalescent simulation of

human genome sequence variation. Genome Res. 15, 1576–

1583.

38. Romeo, S., Yin,W., Kozlitina, J., Pennacchio, L.A., Boerwinkle,

E., Hobbs, H.H., and Cohen, J.C. (2009). Rare loss-of-function

mutations in ANGPTL family members contribute to plasma

triglyceride levels in humans. J. Clin. Invest. 119, 70–79.

39. Duchesne, P., and Lafaye De Micheaux, P. (2010). Computing

the distribution of quadratic forms: Further comparisons

between the Liu-Tang-Zhang approximation and exact

methods. Comput. Stat. Data Anal. 54, 858–862.


Discover high-quality, open-access research

Cell Reports features:

High-quality, cutting-edge research

A focus on short, single-point papers called Reports

Broad scope covering all of biology

Flexible open-access policy

A highly engaged editorial board

A distinguished advisory board

New papers online weekly

cellreports.cell.com

REPORT

Expansion of Intronic GGCCTG Hexanucleotide Repeatin NOP56 Causes SCA36, a Type of Spinocerebellar AtaxiaAccompanied by Motor Neuron Involvement

Hatasu Kobayashi,1,4 Koji Abe,2,4 Tohru Matsuura,2,4 Yoshio Ikeda,2 Toshiaki Hitomi,1 Yuji Akechi,2

Toshiyuki Habu,3 Wanyang Liu,1 Hiroko Okuda,1 and Akio Koizumi1,*

Autosomal-dominant spinocerebellar ataxias (SCAs) are a heterogeneous group of neurodegenerative disorders. In this study, we per-

formed genetic analysis of a unique form of SCA (SCA36) that is accompanied by motor neuron involvement. Genome-wide linkage

analysis and subsequent fine mapping for three unrelated Japanese families in a cohort of SCA cases, in whom molecular diagnosis

had never been performed, mapped the disease locus to the region of a 1.8 Mb stretch (LOD score of 4.60) on 20p13 (D20S906–

D20S193) harboring 37 genes with definitive open reading frames. We sequenced 33 of these and observed a large expansion of an

intronic GGCCTG hexanucleotide repeat in NOP56 and an unregistered missense variant (Phe265Leu) in C20orf194, but we found

no mutations in PDYN and TGM6. The expansion showed complete segregation with the SCA phenotype in family studies, whereas

Phe265Leu in C20orf194 did not. Screening of the expansions in the SCA cohort cases revealed four additional occurrences, but

none were revealed in the cohort of 27 Alzheimer disease cases, 154 amyotrophic lateral sclerosis cases, or 300 controls. In total,

nine unrelated cases were found in 251 cohort SCA patients (3.6%). A founder haplotype was confirmed in these cases. RNA foci forma-

tionwas detected in lymphoblastoid cells from affected subjects by fluorescence in situ hybridization. Double staining and gel-shift assay

showed that (GGCCUG)n binds the RNA-binding protein SRSF2 but that (CUG)6 does not. In addition, transcription of MIR1292,

a neighboring miRNA, was significantly decreased in lymphoblastoid cells of SCA patients. Our finding suggests that SCA36 is caused

by hexanucleotide repeat expansions through RNA gain of function.

Autosomal-dominant spinocerebellar ataxias (SCAs) are

a heterogeneous group of neurodegenerative disorders

characterized by loss of balance, progressive gait, and

limb ataxia.1–3 We recently encountered two unrelated

patients with intriguing clinical symptoms from a commu-

nity in the Chugoku region in western mainland Japan.4

These patients both showed complicated clinical features,

with ataxia as the first symptom, followed by characteristic

late-onset involvement of the motor neuron system that

caused symptoms similar to those of amyotrophic lateral

sclerosis (ALS [MIM 105400]).4 Some SCAs (SCA1 [MIM

164400], SCA2 [MIM 183090], SCA3 [MIM 607047], and

SCA6 [MIM 183086]) are known to slightly affect motor

neurons; however, their involvement is minimal and the

patients usually do not develop skeletal muscle and tongue

atrophies.4 Of particular interest is that RNA foci have been

recently demonstrated in hereditary disorders caused by

microsatellite repeat expansions or insertions in the non-

coding regions of their gene.5–7 The unique clinical features

in these families have seldom been described in previous

reports; therefore, we undertook a genetic analysis.

A similar form of SCAwas observed in five Japanese cases

from a cohort of 251 patients with SCA, in whom molec-

ular diagnosis had not been performed, who were followed

by the Department of Neurology, Okayama University

Hospital. These five cases originated from a city of

450,000 people in the Chugoku region. Thus, we suspected

the presence of a founder mutation common to these five

cases, prompting us to recruit these five families (pedigrees

1–5) (Figure 1, Table 1). This study was approved by the

Ethics Committee of Kyoto University and the Okayama

University institutional review board. Written informed

consent was obtained from all subjects. An index of cases

per family was investigated in some depth: IV-4 in pedigree

1, II-1 in pedigree 2, III-1 in pedigree 3, II-1 in pedigree 4,

and II-1 in pedigree 5. The mean age at onset of cerebellar

ataxia was 52.8 5 4.3 years, and the disease was trans-

mitted by an autosomal-dominant mode of inheritance.

All affected individuals started their ataxic symptoms,

such as gait and truncal instability, ataxic dysarthria, and

uncoordinated limbs, in their late forties to fifties. MRI

revealed relatively confined and mild cerebellar atrophy

(Figure 2A). Unlike individuals with previously known

SCAs, all affected individuals with longer disease duration

showed obvious signs of motor neuron involvement

(Table 1). Characteristically, all affected individuals ex-

hibited tongue atrophy with fasciculation, although its

degree of severity varied (Figure 2B). Despite severe tongue

atrophy in some cases, their swallowing function was rela-

tively preserved, and they were allowed oral intake even at

a later point after onset. In addition to tongue atrophy,

skeletal muscle atrophy and fasciculation in the limbs

and trunk appeared in advanced cases.4 Tendon reflexes

were generally mildly to severely hyperreactive in most

1Department of Health and Environmental Sciences, Graduate School of Medicine, Kyoto University, Kyoto, Japan; 2Department of Neurology, Graduate

School of Medicine, Dentistry and Pharmaceutical Science, Okayama University, Okayama, Japan; 3Radiation Biology Center, Kyoto University, Kyoto,

Japan4These authors contributed equally to this work




Figure 1. Pedigree Charts of the Five SCA FamiliesHaplotypes are shown for nine markers from D20S906 (1,505,576 bp) to D20S193 (3,313,494 bp), spanning 1.8 Mb on chromosome20p13.NOP56 is located at 2,633,254–2,639,039 bp (NCBI build 37.1). Filled and unfilled symbols indicate affected and unaffected indi-viduals, respectively. Squares and circles represent males and females, respectively. A slash indicates a deceased individual. The putativefounder haplotypes among patients are shown in boxes constructed by GENHUNTER.8 Arrows indicate the index case. The pedigreeswere slightly modified for privacy protection.


affected individuals, none of whom displayed severe lower

limb spasticity or extensor plantar response. Electrophysi-

ological studies were performed in an affected individual.

Nerve conduction studies revealed normal findings in all

of the cases that were examined; however, an electromyo-

gram showed neurogenic changes only in cases with

skeletal muscle atrophy, indicating that lower motor

neuropathy existed in this particular disease. Progression

of motor neuron involvement in this SCA was typically

limited to the tongue and main proximal skeletal muscles

in both upper and lower extremities, which is clearly

different from typical ALS, which usually involves most

skeletal muscles over the course of a few years, leading to

fatal results within several years.

We conducted genome-wide linkage analysis for nine

affected subjects and eight unaffected subjects in three

informative families (pedigrees 1–3; Figure 1). For genotyp-

ing, we used an ABI Prism Linkage Mapping Set (Version 2;

Applied Biosystems, Foster City, CA, USA) with 382

markers, 10 cM apart, for 22 autosomes. Fine-mapping

markers (approximately 1 cM apart) were designed accord-

ing to information from the uniSTS reference physical map

in the NCBI database. A parametric linkage analysis was

carried out in GENEHUNTER8 with the assumption of an

autosomal-dominant model. The disease allele frequency

was set at 0.000001, and a phenocopy frequency of

0.000001 was assumed. Population allele frequencies

were assigned equal portions of individual alleles. We per-

formed multipoint analyses for autosomes and obtained

LOD scores. We considered LOD scores above 3.0 to be

significant.8 Genome-wide linkage analysis revealed

a single locus on chromosome 20p13 with a LOD score

of 3.20. Fine mapping increased the LOD score to 4.60

(Figure 3). Haplotype analysis revealed two recombination

events in pedigree 3, delimiting a1.8 Mb region (D20S906–

D20S193) (Figure 1). We further tested whether the five

cases shared the haplotype. As shown in Figure 1, pedigrees

4 and 5 were confirmed to have the same haplotype as

pedigrees 1, 2, and 3, indicating that the 1.8 Mb region is

very likely to be derived from a common ancestor.

The1.8Mbregionharbors44genes (NCBI,build37.1).We

eliminated two pseudogenes and five genes (LOC441938,

LOC100289473, LOC100288797, LOC100289507, and

LOC100289538) from the candidates. Evidence view

showed that the first, fourth, and fifth genes were not found

in the contig in this region, whereas the second and third

Table 1. Clinical Characteristics of Affected Subjects

PedigreeNo.

PatientID Gender

OnsetAge (yr)

CurrentAge (yr) Ataxia

Motor Neuron Involvement

Genotype of GGCCTG Repeats

SkeletalMuscleAtrophy

SkeletalMuscleFasciculation

TongueAtrophy/Fasciculation

1 III-5 M 50 70 þþþ N.D. N.D. N.D. g.263397_263402[6]þ(1800)

III-6 F 52 68 þþ þ þ þ g.263397_263402[6]þ(2300)

IV-2 F 57 63 þ - - þ g.263397_263402[6]þ(2300)

IV-4 M 50 59 þ - - þ g.263397_263402[6]þ(2300)

2 II-1 M 55 77 þþþ þþ þ þ g.263397_263402[6]þ(2200)

II-2 F 53 70 þþ N.D. N.D. N.D. g.263397_263402[6]þ(2200)

3 II-3 M 58 77 þþ þþ þ þ g.263397_263402[3]þ(2300)

III-1 M 56 62 þ - - 5 g.263397_263402[8]þ(2200)

III-2 M 51 61 þþ þ þ þ g.263397_263402[6]þ(1800)

4 I-1 M 57 died in2001 at 83

þþ N.D. N.D. N.D. g.263397_263402[5]þ(1800)

II-1 F 48 61 þþ þ 5 þþ g.263397_263402[6]þ(2000)

5 I-1 M 57 86 þþ þþþ þ þ g.263397_263402[5]þ(2000)

II-1 F 47 58 þþ þ þ þ g.263397_263402[8]þ(1700)

SCA#1 M 52 69 þþþ þþþ þþþ þþþ g.263397_263402[5]þ(2200)

SCA#2 F 43 53 þþþ - - þ g.263397_263402[6]þ(1800)

SCA#3 M 55 60 þþ - - þþ g.263397_263402[8]þ(1700)

SCA#4 M 57 81 þþþ þ þ þþþ g.263397_263402[5]þ(2200)

Mean 52.8

SD 4.3

N.D., not determined.


genes are not assigned to orthologous loci in the mouse

genome. Sequence similarities among paralog genes defied

direct sequencing of four genes: SIRPD [NM 178460.2],

SIRPB1 [NM 603889], SIRPG [NM 605466], and SIRPA

[NM 602461]. Thus, we sequenced 33 of 37 genes (PDYN((

[MIM 131340], STK35 [MIM 609370], TGM3 [MIM

600238], TGM6 [NM_198994.2], SNRPB [MIM 182282],

SNORD119 [NR_003684.1], ZNF343 [NM_024325.4],

TMC2 [MIM 606707], NOP56 [NM_006392.2], MIR1292

[NR_031699.1], SNORD110 [NR_003078.1], SNORA51

[NR_002981.1], SNORD86 [NR_004399.1], SNORD56

[NR_002739.1], SNORD57 [NR_002738.1], IDH3B [MIM

604526], EBF4 [MIM 609935], CPXM1 [NM_019609.4],

C20orf141 [NM_080739.2], FAM113A [NM_022760.3],

VPS16 [MIM 608550], PTPRA [MIM 176884], GNRH2

[MIM 602352], MRPS26 [MIM 611988], OXT [MIM

167050], AVP [MIM 192340], UBOX5 [NM_014948.2],

FASTKD5 [NM_021826.4], ProSAPiP1 [MIM 610484],

DDRGK1 [NM_023935.1], ITPA [MIM 147520], SLC4A11

[MIM 610206], and C20orf194 [NM_001009984.1]) (Fig-

ure 2C). All noncoding and coding exons, as well as the

100 bp up- and downstream of the splice junctions of these

genes, were sequenced in two index cases (IV-4 in pedigree1

and III-1 in pedigree 3) and in three additional cases (II-1 in

pedigree 2, II-1 in pedigree 4, and II-1 in pedigree 5)with the

use of specific primers (Table S1 available online). Eight

unregistered variants were found among the two index

cases. Among these, there was a coding variant, c.795C>G

Figure 2. Motor Neuron Involvement and (GGCCTG)n Expansion in the First Intron of NOP56(A) MRI of an affected subject (SCA#3) showed mild cerebellar atrophy (arrow) but no other cerebral or brainstem pathology.(B) Tongue atrophy (arrow) was observed in SCA#1.(C) Physicalmap of the 1.8-Mb linkage region fromD20S906 (1,505,576 bp) to D20S193 (3,313,494 bp), with 33 candidate genes shown,as well as the direction of transcription (arrows).(D) The upper portion of the panel shows the scheme of primer binding for repeat-primer PCR analysis. In the lower portion, sequencetraces of the PCR reactions are shown. Red lines indicate the size markers. The vertical axis indicates arbitrary intensity levels. A typicalsaw-tooth pattern is observed in an affected pedigree.(E) Southern blotting of LCLs from SCA cases and three controls. Genomic DNA (10 mg) was extracted from Epstein-Barr virus (EBV)-immortalized LCLs derived from six affected subjects (Ped2_II-1, Ped3_III-1, Ped3_III-2, Ped5_I-1, Ped5_II-1, and SCA#1) and digestedwith 2 U of AvrII overnight (New England Biolabs, Beverly, MA, USA). A probe covering exon 4 of NOP56 (452 bp) was subjected toPCR amplification from human genomic DNA with the use of primers (Table S3) and labeled with 32P-dCTP.


(p.Phe265Leu), in C20orf194, whereas the other seven

included one synonymous variant, c.1695T>A (p.Leu565-

Leu), in ZNF343 and six non-splice-site intronic variants

(Table S2). We tested segregation by sequencing exon 11 of

C20orf194 in IV-2 and III-5 in pedigree 1. Neither IV-2 nor

III-5 had this variant. We thus eliminated C20orf194 as

a candidate.Missensemutations inPDYN andTGM6, which

have been recently reported as causes of SCA, mapped to

20p12.3-p13,9,10 but none were detected in the five index

cases studied here (Table S2).

Possible expansions of repetitive sequences in these

33 genes were investigated when intragenic repeats

were indicated in the database (UCSC Genome Bioinfor-

matics). Expansions of the hexanucleotide repeat

GGCCTG (rs68063608) were found in intron 1 of NOP56

(Figure 2D) in all five index cases through the use of

a repeat-primed PCR method.11–13 An outline of the

repeat-primed PCR experiment is described in Figure 2D.

In brief, the fluorescent-dye-conjugated forward primer

corresponded to the region upstream of the repeat of

interest. The first reverse primer consisted of four units of

the repeat (GGCCTG) and a 50 tail used as an anchor.

The second reverse primer was an ‘‘anchor’’ primer. These

primers are described in Table S3. Complete segregation

of the expanded hexanucleotide was confirmed in all pedi-

grees, and the maximum repeat size in nine unaffected

members was eight (data not shown).

In addition to the SCA cases in five pedigrees, four

unrelated cases (SCA#1–SCA#4) were found to have a

(GGCCTG)n allele through screening of the cohort SCA

patients (Table 1). Neurological examination was reeval-

uated in these four cases, revealing both ataxia and motor

neuron dysfunctionwith tongue atrophy and fasciculation

(Table 1). In total, nine unrelated cases were found in the

251 cohort patients with SCA (3.6%). For confirmation of

the repeat expansions, Southern blot analysis was conduct-

ed in six affected subjects (Ped2_II-1, Ped3_III-1, Ped3_III-2,

Ped5_I-1, Ped5_II-1, and SCA#1). The data showed >10 kb

of repeat expansions in the lymphoblastoid cell lines

(LCLs) obtained from the SCA patients (Figure 2E). Further-

more, the numbers of GGCCTG repeat expansion were

estimated by Southern blotting in 11 other cases. The

expansion analysis revealed approximately 1500 to 2500

repeats in 17 cases (Table 1). There was no negative associa-

tion between age at onset and the number of GGCCTG

repeats (n¼17, r¼0.42, p¼0.09; Figure S1) andnoobvious

anticipation in the current pedigrees.

To investigate the disease specificity and disease spec-

trum of the hexanucleotide repeat expansions, we tested

the repeat expansions in an Alzheimer disease (MIM

104300) cohort and an ALS cohort followed by the Depart-

ment of Neurology, Okayama University Hospital. We also

recruited Japanese controls, who were confirmed to be free

from brain lesions through MRI and magnetic resonance

angiography, which was performed as described previ-

ously.14 Screening of the 27 Alzheimer disease cases and

154 ALS cases failed to detect additional cases with repeat

expansions. The GGCCTG repeat sizes ranged from 3 to

8 in 300 Japanese controls (5.9 5 0.8 repeats), suggesting

that the >10 kb repeat expansions were mutations.

Expression of Nop56, an essential component of the

splicing machinery,15 was examined by RT-PCR with the

use of primers for wild-type mouse Nop56 cDNA (Table

S3). Expression of Nop56 mRNA was detected in various

tissues, including CNS tissue, and a very weak signal was

detected in spinal cord tissue (Figure 4A). Immunohisto-

chemistry using an anti-mouse Nop56 antibody (Santa

Cruz Biotechnology, Santa Cruz, CA, USA) detected the

Nop56 protein in Purkinje cells of the cerebellum as well

as motor neurons of the hypoglossal nucleus and the

spinal cord anterior horn (Figure 4B), suggesting that these

cells may be responsible for tongue and muscle atrophy in

the trunk and limbs, respectively. Immunoblotting also

confirmed the presence of Nop56 in neural tissues

(Figure 4C), where Nop56 is localized in both the nucleus

and cytoplasm.

Alterations of NOP56 RNA expression and protein levels

in LCLs from patients were examined by real-time RT-PCR

and immunoblotting. The primers for quantitative PCR of

human NOP56 cDNA are described in Table S3. Immuno-

blotting was performed with the use of an anti-human

NOP56 antibody (Santa Cruz Biotechnology, Santa Cruz,

CA, USA). We found no decrease inNOP56 RNA expression

or protein levels in LCLs from these patients (Figure 5A). To

investigate abnormal splicing variants of NOP56, we per-

formed RT-PCR using the primers covering the region

from the 50 UTR to exon 4 around the repeat expansion

(Table S3); however, no splicing variant was observed in

LCLs from the cases (Figure 5B). We also performed immu-

nocytochemistry for NOP56 and coilin, a marker of the

Cajal body, where NOP56 functions.16 NOP56 and coilin

distributions were not altered in LCLs of the SCA patients

(Figure 5C), suggesting that qualitative or quantitative

changes in the Cajal body did not occur. These results indi-

cated that haploinsufficiency could not explain the

observed phenotype.

Figure 3. Multipoint Linkage Analysis with Ten Markers onChromosome 20p13


We performed fluorescence in situ hybridization to

detect RNA foci containing the repeat transcripts in LCLs

from patients, as previously described.17,18 Lymphoblas-

toid cells from two SCA patients (Ped2_II-2 and Ped5_I-1)

and two control subjects were analyzed. An average of

2.1 5 0.5 RNA foci per cell were detected in 57.0%

of LCLs (n ¼ 100) from the SCA subjects through the use

of a nuclear probe targeting the GGCCUG repeat, whereas

no RNA foci were observed in control LCLs (n ¼ 100)

(Figure 6A). In contrast, a probe for the CGCCUG repeat,

another repeat sequence in intron 1 of NOP56, detected

no RNA foci in either SCA or control LCLs (n ¼ 100

each) (Figure 6A), indicating that the GGCCUG repeat

was specifically expanded in the SCA subjects. The speci-

ficity of the RNA foci was confirmed by sensitivity to RNase

A treatment and resistance to DNase treatment (Figure 6A).

Several reports have suggested that RNA foci play a role

in the etiology of SCA through sequestration of specific

RNA-binding proteins.5–7 In silico searches (ESEfinder

3.0) predicted an RNA-binding protein, SRSF2 (MIM

600813), as a strong candidate for binding of the GGCCUG

repeat. Double staining with the probe for the GGCCUG

repeat and an anti-SRSF2 antibody (Sigma-Aldrich, Tokyo,

Japan) was performed. The results showed colocalization of

RNA foci with SRSF2, whereas NOP56 and coilin were not

colocalized with the RNA foci (Figure 6B), suggesting

a specific interaction of endogenous SRSF2 with the RNA

foci in vivo.

To further confirm the interaction, gel-shift assays were

carried out for investigation of the binding activity of

SRSF2 with (GGCCUG)n. Synthetic RNA oligonucleotides

(200 pmol), (GGCCUG)4 or (CUG)6, which is the latter

part of the hexanucleotide, as well as the repeat RNA

involved in myotonic dystrophy type 1 (DM1 [MIM

160900])18 and SCA8 (MIM 608768),5 were denatured

and immediately mixed with different amounts (0, 0.2,

or 0.6 mg) of recombinant full-length human SRSF2

(Abcam, Cambridge, UK). The mixtures were incubated,

and the protein-bound probes were separated from the

free forms by electrophoresis on 5%–20% native polyacryl-

amide gels. The separated RNA probes were detected with

SYBR Gold staining (Invitrogen, Carlsbad, CA, USA). We

found a strong association of (GGCCUG)4 with SRSF2

in vitro in comparison to (CUG)6 (Figure 6C). Collectively,

we concluded that (GGCCUG)n interacts with SRSF2.

It is notable that MIR1292 is located just 19 bp 30 of theGGCCTG repeat (Figure 2D). MiRNAs such asMIR1292 are

small noncoding RNAs that regulate gene expression by in-

hibiting translation of specific target mRNAs.19,20 MiRNAs

are believed to play important roles in key molecular

Figure 4. Nop56 in the Mouse Nervous System(A) RT-PCR analysis of Nop56 (422 bp) in various mouse tissues. cDNA (25 ng) collected from various organs of C57BL/6 mice waspurchased from GenoStaf (Tokyo, Japan).(B) Immunohistochemical analysis of Nop56 in the cerebellum, hypoglossal nucleus, and spinal cord anterior horn in wild-type maleSlc:ICR mice at 8 wks of age (Japan SLC, Shizuoka, Japan). The arrows indicate anti- Nop56 antibody staining. The negative controlwas the cerebellar sample without the Nop56 antibody treatment. Scale bar represents 100 mm.(C) Immunoblotting of Nop56 (66 kDa) in the cerebellum and cerebrum. Protein sample (10 mg) was subjected to immunoblotting.LaminB1, a nuclear protein, and beta-tubulin were used as loading controls.


pathways by fine-tuning gene expression.19,20 Recent

studies have revealed that miRNAs influence neuronal

survival and are also associated with neurodegenerative

diseases.21,22 In silico searches (Target Scan Human 5.1)

predicted glutamate receptors (GRIN2B [MIM 138252]

and GRIK3 [MIM 138243]) to be potential target genes.

Real-time RT-PCR using TaqMan probes for miRNA

(Invitrogen, Carlsbad, CA, USA) revealed that the levels

of both mature and precursor MIR1292 were significantly

decreased in SCA LCLs (Figure 6D), indicating that the

GGCCTG repeat expansion decreased the transcription

of MIR1292. A decrease in MIR1292 expression may

upregulate glutamate receptors in particular cell types;

e.g., GRIK3 in stellate cells in the cerebellum,23 leading to

ataxia because of perturbation of signal transduction to

the Purkinje cells. In addition, it has been suggested, on

the basis of ALS mouse models,24,25 that excitotoxicity

mediated by a type of glutamate receptor, the NMDA

receptor including GRIN2B, is involved in loss of spinal

neurons. A very slowly progressing and mild form of the

motor neuron disease, such as that described here, which

is limited to mostly fasciculation of the tongue, limbs

and trunk, may also be compatible with such a functional

dysregulation rather than degeneration.

In the present study, we have conducted genetic analysis

to find a genetic cause for the unique SCA with motor

neuron disease. With extensive sequencing of the 1.8 Mb

linked region, we found large hexanucleotide repeat

expansions in NOP56, which were completely segregated

with SCA in five pedigrees and were found in four unre-

lated cases with a similar phenotype. The expansion was

not found in 300 controls or in other neurodegenerative

diseases. We further proved that repeat expansions of

NOP56 induce RNA foci and sequester SRSF2. We thus

concluded that hexanucleotide repeat expansions are

considered to cause SCA by a toxic RNA gain-of-function

mechanism, and we name this unique SCA as SCA36.

Haplotype analysis indicates that hexanucleotide expan-

sions are derived from a common ancestor. The prevalence

of SCA36 was estimated at 3.6% in the SCA cohort in

Chugoku district, suggesting that prevalence of SCA36

may be geographically limited to the western part of Japan

and is rare even in Japanese SCAs.

Expansion of tandem nucleotide repeats in different

regions of respective genes (most often the triplets CAG

and CTG) has been shown to cause a number of inherited

diseases over the past decades. An expansion in the coding

region of a gene causes a gain of toxic function and/or

reduces the normal function of the corresponding protein

at the protein level. RNA-mediated noncoding repeat

expansions have also been identified as causing eight other

neuromuscular disorders: DM1, DM2 (MIM 602668),

fragile X tremor/ataxia syndrome (FXTAS [MIM 300623]),

Huntington disease-like 2 (HDL2 [MIM 606438]), SCA8,

SCA10 (MIM 603516), SCA12 (MIM 604326), and SCA31

(MIM 117210).26 The repeat numbers in affected alleles

of SCA36 are among the largest seen in this group of

diseases (i.e., there are thousands of repeats). Moreover,

SCA36 is notmerely a nontriplet repeat expansion disorder

similar to SCA10, DM2, and SCA31, but is now proven to

be a human disease caused by a large hexanucleotide

repeat expansion. In addition, no or only weak anticipa-

tion has been reported for noncoding repeat expansion

in SCA, whereas clear anticipation has been reported for

most polyglutamine expansions in SCA.2 As such, absence

of anticipation in SCA36 is in accord with previous studies

Figure 5. Analysis of NOP56 in LCLs fromSCA Patients(A) mRNA expression (upper panel) andprotein levels (lower panel) in LCLs fromcases (n ¼ 6) and controls (n ¼ 3) weremeasured by RT-PCR and immunoblotting,respectively. cDNA (10 ng) was transcribedfrom total RNA isolated from LCLs andused for RT-PCR. Immunoblotting was per-formed with the use of a protein sample(40 mg) extracted from LCLs. The data indi-cate the mean5 SD relative to the levels ofPP1A and GAPDH, respectively. There wasno significant difference between LCLsfrom controls and cases.(B) Analysis for splicing variants of NOP56cDNA. RT-PCR with 10 ng of cDNA andprimers corresponding to the region fromthe 50 UTR to exon 4 around the repeatexpansion was performed. The PCRproduct has an expected size of 230 bp.(C) Immunocytochemistry for NOP56 andcoilin. Green signals represent NOP56 orcoilin. Shown are representative samplesfrom 100 observations of controls or cases.


on SCAs with noncoding repeat expansions. The common

hallmark in these noncoding repeat expansion disorders

is transcribed repeat nuclear accumulations with respec-

tive repeat RNA-binding proteins, which are considered

to primarily trigger and develop the disease at the RNA

level. However, multiple different mechanisms are likely

to be involved in each disorder. There are at least two

possible explanations for the motor neuron involvement

of SCA36: gene- and tissue-specific splicing specificity of

SRSF2 and involvement of miRNA. In SCA36, there is the

possibility that the adverse effect of the expansion muta-

tion is mediated by downregulation of miRNA expression.

The biochemical implication of miRNA involvement

cannot be evaluated in this study, because availability of

tissue samples from affected cases was limited to LCLs.

Given definitive downregulation of miRNA 1292 in

LCLs, we should await further study to substantiate its

involvement in affected tissues. Elucidating which mecha-

nism(s) plays a critical role in the pathogenesis will

be required for determining whether cerebellar degenera-

tion and motor neuron disease occur through a similar

scenario.

Figure 6. RNA Foci Formation and Decreased Transcription of MIR1292(A) Cells were fixed on coverslips and then hybridized with solutions containing either a Cy3-labeled C(CAGGCC)2CAG orG(CAGGCG)2CAG oligonucleotide probe (1 ng/ml). For controls, the cells were treated with 1000 U/ml DNase or 100 mg/ml RNasefor 1 hr at 37�C prior to hybridization, as indicated. After a wash step, coverslips were placed on the slides in the presence of ProLongGold with DAPI mountingmedia (Molecular Probes, Tokyo, Japan) and photographed with a fluorescence microscope. The upper panelsindicate LCLs from an SCA case and a control hybridized with C(CAGGCC)2CAG (left) or G(CAGGCG)2CAG (right). Red and bluesignals represent RNA foci and the nucleus (DAPI staining), respectively. Similar RNA foci formationwas confirmed in LCLs from anotherindex case. The lower panels show RNA foci in SCA LCLs treated with DNase or RNase.(B) Double staining was performed with the probe for (GGCCUG)n (red) and anti-SRSF2, NOP56, or coilin antibody (green).(C) Gel-shift assays revealed specific binding of SRSF2 to (GGCCUG)4 but little to (CUG)6.(D) RNA samples (10 ng) were extracted from LCLs of controls (n¼ 3) and cases (n¼ 6).MiRNAsweremeasuredwith the use of a TaqManprobe for precursor (Pri-) and mature MIR1292. The data indicate the mean 5 SD, relative to the levels of PP1A or RNU6. *: p < 0.05.


In conclusion, expansion of the intronic GGCCTG

hexanucleotide repeat in NOP56 causes a unique form of

SCA, SCA36, which shows not only ataxia but also motor

neuron dysfunction. This characteristic disease phenotype

can be explained by the combination of RNA gain of func-

tion and MIR1292 suppression. Additional studies are

required to investigate the roles of each mechanistic

component in the pathogenesis of SCA36.

Supplemental Data

Supplemental Data include one figure and three tables and can be

found with this article online at http://www.cell.com/AJHG/.

Acknowledgments

This work was supported mainly by grants to A.K. and partially by

grants to T.M., Y.I., H.K., and K.A. We thank Norio Matsuura,

Kokoro Iwasawa, and Kouji H. Harada (Kyoto University Graduate

School of Medicine).

Received: February 23, 2011

Revised: May 8, 2011

Accepted: May 18, 2011

Published online: June 16, 2011

Web Resources


ESEfinder 3.0, http://rulai.cshl.edu/cgi-bin/tools/ESE3/esefinder.

cgi?process¼home

NCBI, http://www.ncbi.nlm.nih.gov/

Target Scan Human 5.1, http://www.targetscan.org/

UCSC Genome Bioinformatics, http://genome.ucsc.edu

References

1. Harding, A.E. (1982). The clinical features and classification of

the late onset autosomal dominant cerebellar ataxias. A study

of 11 families, including descendants of the ‘the Drew family

of Walworth’. Brain 105, 1–28.

2. Matilla-Duenas, A., Sanchez, I., Corral-Juan, M., Davalos, A.,

Alvarez, R., and Latorre, P. (2010). Cellular and molecular

pathways triggering neurodegeneration in the spinocerebellar

ataxias. Cerebellum 9, 148–166.

3. Schols, L., Bauer, P., Schmidt,T., Schulte,T., andRiess,O. (2004).

Autosomal dominant cerebellar ataxias: clinical features,

genetics, and pathogenesis. Lancet Neurol. 3, 291–304.

4. Ohta, Y., Hayashi, T., Nagai, M., Okamoto, M., Nagotani, S.,

Nagano, I., Ohmori, N., Takehisa, Y., Murakami, T., Shoji,

M., et al. (2007). Two cases of spinocerebellar ataxia accompa-

nied by involvement of the skeletal motor neuron system and

bulbar palsy. Intern. Med. 46, 751–755.

5. Daughters, R.S., Tuttle, D.L., Gao, W., Ikeda, Y., Moseley, M.L.,

Ebner, T.J., Swanson, M.S., and Ranum, L.P. (2009). RNA gain-

of-function in spinocerebellar ataxia type 8. PLoS Genet. 5,

e1000600.

6. Sato, N., Amino, T., Kobayashi, K., Asakawa, S., Ishiguro, T.,

Tsunemi, T., Takahashi, M., Matsuura, T., Flanigan, K.M.,

Iwasaki, S., et al. (2009). Spinocerebellar ataxia type 31 is

associated with ‘‘inserted’’ penta-nucleotide repeats contain-

ing (TGGAA)n. Am. J. Hum. Genet. 85, 544–557.

7. White, M.C., Gao, R., Xu, W., Mandal, S.M., Lim, J.G., Hazra,

T.K., Wakamiya, M., Edwards, S.F., Raskin, S., Teive, H.A., et al.

(2010). Inactivation of hnRNP K by expanded intronic

AUUCU repeat induces apoptosis via translocation of

PKCdelta to mitochondria in spinocerebellar ataxia 10. PLoS

Genet. 6, e1000984.

8. Kruglyak, L., Daly, M.J., Reeve-Daly, M.P., and Lander, E.S.

(1996). Parametric and nonparametric linkage analysis:

a unified multipoint approach. Am. J. Hum. Genet. 58,

1347–1363.

9. Bakalkin, G., Watanabe, H., Jezierska, J., Depoorter, C.,

Verschuuren-Bemelmans, C., Bazov, I., Artemenko, K.A.,

Yakovleva, T., Dooijes, D., Van de Warrenburg, B.P., et al.

(2010). Prodynorphin mutations cause the neurodegenerative

disorder spinocerebellar ataxia type 23. Am. J. Hum. Genet.

87, 593–603.

10. Wang, J.L., Yang, X., Xia, K., Hu, Z.M., Weng, L., Jin, X., Jiang,

H., Zhang, P., Shen, L., Guo, J.F., et al. (2010). TGM6 identified

as a novel causative gene of spinocerebellar ataxias using

exome sequencing. Brain 133, 3510–3518.

11. Cagnoli, C., Michielotto, C., Matsuura, T., Ashizawa, T., Marg-

olis, R.L., Holmes, S.E., Gellera, C., Migone, N., and Brusco, A.

(2004). Detection of large pathogenic expansions in FRDA1,

SCA10, and SCA12 genes using a simple fluorescent repeat-

primed PCR assay. J. Mol. Diagn. 6, 96–100.

12. Matsuura, T., and Ashizawa, T. (2002). Polymerase chain reac-

tion amplification of expanded ATTCT repeat in spinocerebel-

lar ataxia type 10. Ann. Neurol. 51, 271–272.

13. Warner, J.P., Barron, L.H., Goudie, D., Kelly, K., Dow, D.,

Fitzpatrick, D.R., and Brock, D.J. (1996). A general method

for the detection of large CAG repeat expansions by fluores-

cent PCR. J. Med. Genet. 33, 1022–1026.

14. Hashikata, H., Liu, W., Inoue, K., Mineharu, Y., Yamada, S.,

Nanayakkara, S., Matsuura, N., Hitomi, T., Takagi, Y., Hashi-

moto, N., et al. (2010). Confirmation of an association of

single-nucleotide polymorphism rs1333040 on 9p21 with

familial and sporadic intracranial aneurysms in Japanese

patients. Stroke 41, 1138–1144.

15. Wahl, M.C., Will, C.L., and Luhrmann, R. (2009). The spliceo-

some: design principles of a dynamic RNP machine. Cell 136,

701–718.

16. Lechertier, T., Grob, A., Hernandez-Verdun, D., and Roussel, P.

(2009). Fibrillarin and Nop56 interact before being co-assem-

bled in box C/D snoRNPs. Exp. Cell Res. 315, 928–942.

17. Liquori, C.L., Ricker, K., Moseley, M.L., Jacobsen, J.F., Kress,

W., Naylor, S.L., Day, J.W., and Ranum, L.P. (2001). Myotonic

dystrophy type 2 caused by a CCTG expansion in intron 1 of

ZNF9. Science 293, 864–867.

18. Taneja, K.L., McCurrach, M., Schalling, M., Housman, D., and

Singer, R.H. (1995). Foci of trinucleotide repeat transcripts in

nuclei of myotonic dystrophy cells and tissues. J. Cell Biol.

128, 995–1002.

19. Winter, J., Jung, S., Keller, S., Gregory, R.I., and Diederichs, S.

(2009). Many roads to maturity: microRNA biogenesis path-

ways and their regulation. Nat. Cell Biol. 11, 228–234.

20. Zhao, Y., and Srivastava, D. (2007). A developmental view of

microRNA function. Trends Biochem. Sci. 32, 189–197.

21. Eacker, S.M., Dawson, T.M., and Dawson, V.L. (2009). Under-

standing microRNAs in neurodegeneration. Nat. Rev. Neuro-

sci. 10, 837–841.


22. Hebert, S.S., and De Strooper, B. (2009). Alterations of the

microRNA network cause neurodegenerative disease. Trends

Neurosci. 32, 199–206.

23. Tsuzuki, K., and Ozawa, S. (2005). Glutamate Receptors. Ency-

clopedia of life sciences. John Wiley and Sons, Ltd., http://

onlinelibrary.com/doi/10.1038/npg.els.0005056.

24. Nutini, M., Frazzini, V., Marini, C., Spalloni, A., Sensi, S.L., and

Longone, P. (2011). Zinc pre-treatment enhances NMDAR-

mediated excitotoxicity in cultured cortical neurons from

SOD1(G93A) mouse, a model of amyotrophic lateral sclerosis.

Neuropharmacology 60, 1200–1208.

25. Sanelli, T., Ge, W., Leystra-Lantz, C., and Strong, M.J. (2007).

Calcium mediated excitotoxicity in neurofilament aggregate-

bearing neurons in vitro is NMDA receptor dependant.

J. Neurol. Sci. 256, 39–51.

26. Todd, P.K., and Paulson, H.L. (2010). RNA-mediated neurode-

generation in repeat expansion disorders. Ann. Neurol. 67,

291–300.


Want to learn how to prepare, submit and publish an article in a Cell Press journal?

Watch the Cell Press publication guide.

for more information visitwww.cell.com/publicationguide

Chapter 1: Before manuscript submission Chapter 2: After initial submission

Chapter 3: Decision process Chapter 4: After manuscript acceptance

REPORT

A Mutation in a Skin-Specific Isoform of SMARCAD1Causes Autosomal-Dominant Adermatoglyphia

Janna Nousbeck,1 Bettina Burger,2 Dana Fuchs-Telem,1,4 Mor Pavlovsky,1 Shlomit Fenig,1 Ofer Sarig,1

Peter Itin,2,3 and Eli Sprecher1,4,*

Monogenic disorders offer unique opportunities for researchers to shed light upon fundamental physiological processes in humans. We

investigated a large family affected with autosomal-dominant adermatoglyphia (absence of fingerprints) also known as the ‘‘immigra-

tion delay disease.’’ Using linkage and haplotype analyses, we mapped the disease phenotype to 4q22. One of the genes located in

this interval is SMARCAD1, a member of the SNF subfamily of the helicase protein superfamily.We demonstrated the existence of a short

isoform of SMARCAD1 exclusively expressed in the skin. Sequencing of all SMARCAD1 coding and noncoding exons revealed a hetero-

zygous transversion predicted to disrupt a conserved donor splice site adjacent to the 30 end of a noncoding exon uniquely present in the

skin-specific short isoform of the gene. This mutation segregated with the disease phenotype throughout the entire family. Using amini-

gene system, we found that this mutation causes aberrant splicing, resulting in decreased stability of the short RNA isoform as predicted

by computational analysis and shown by RT-PCR. Taken together, the present findings implicate a skin-specific isoform of SMARCAD1 in

the regulation of dermatoglyph development.

Epidermal ridges are characteristic features of the human

skin1 and in wide use in the modern era as almost unsur-

passed identification tools. The physiological role of

epidermal ridges remains controversial. Recent data have

dismissed the theory that fingerprints might improve the

grip by ramping up friction levels.2 Instead, epidermal

ridges might amplify vibratory signals to deeply embedded

nerves involved in fine texture perception.3

The factors underlying the formation of epidermal ridges

during embryonic development and their pattern remain

unknown but are likely to include both genetically deter-

mined traits4 as well as environmental elements5 and to

involve some form of interactions between the mesen-

chymal and the dermal and the epidermal elements. At

24 weeks postfertilization, the epidermal-ridge system

displays an adult morphology6 that remains permanent

without any modification throughout life. The congenital

absence of epidermal ridges is a rare condition known as

adermatoglyphia (ADG). To date only four families with

congenital absence of fingerprints have been described.7–10

In three of these families,7–9 additional features such as

congenital facial milia, skin blisters, and fissures associated

with heat or trauma were reported. A number of more

complex syndromes such as Naegeli-Franceschetti-Jadas-

sohn syndrome (MIM 161000) and dyskeratosis congenita

(MIM 305000) also feature abnormal development of

epidermal ridges,11,12 as detailed in a recent review of the

topic.13

In the present studywe investigated a large Swiss kindred

presenting with autosomal-dominant adermatoglyphia

recently coined as the ‘‘immigration delay disease’’13

because affected individuals report significant difficulties

entering countries that require fingerprint recording. All

affected members of this family displayed since birth an

absence of fingerprints (Figure 1A); histological analysis13

revealed that this absence was associated with a reduced

number of sweat glands and a sweat test showed a reduced

ability for hand transpiration (Figure 1B).

All affected (n ¼ 9) and healthy (n ¼ 7) family members

or their legal guardian provided written and informed

consent according to a protocol approved by the institu-

tional review board of University Hospital Basel in adher-

ence with the principles of the declaration of Helsinki.

DNA was extracted from peripheral blood lymphocytes.

We initially genotyped all family members by using the

Illumina Human Linkage-12 chip comprising 6000 tagged

SNPs distributed across the genome. Two hundred ng of

DNA were hybridized according to the Infinium II assay

(Illumina, San Diego, CA) and scanned with an Illumina

BeadArray reader. The scanned images were imported

into BeadStudio 3.1.3.0 (Illumina) for extraction and

quality control, with an average call rate of 99.9%.

Multipoint linkage analysiswith the Superlink software14

generated a LOD score of 2.85 at marker rs1509948

(Figure 2). Fine mapping of the disease interval was per-

formed with polymorphic microsatellite markers that

were selected from the National Center for Biotechnology

Infromation (NCBI) database. Genotypes were established

with fluorescently labeled primer pairs (Research Genetics,

Invitrogen, Carlsbad, CA) according to the manufacturer’s

recommendations. PCR products were separated by PAGE

on an automated sequencer (ABI PRISM 3100 Genetic

Analyzer; Applied Biosystems, Foster City, CA), and allele

sizes were determined with Gene Mapper v4.0 software.

Haplotype analysis refined the disease locus to a 5.1 Mb

interval between markers D4S423 and D4S1560 (Figure 2).

1Department of Dermatology, Tel Aviv Sourasky Medical Center, Tel Aviv 64239, Israel; 2Department of Biomedicine, University Hospital Basel, Basel 4051,

Switzerland; 3Department of Dermatology, University Hospital Basel, Basel 4051, Switzerland; 4Department of Human Molecular Genetics and Biochem-

istry, Sackler Faculty of Medicine, Tel-Aviv University, Ramat Aviv 61390, Israel



302 The American Journal of Human Genetics 89, 302–307, August 12, 2011

We found the disease interval contained 17 genes. All

coding and noncoding exons of the disease interval genes

were fully sequenced. Initially, nomutation was identified.

We therefore carefully scrutinized all currently available

databases for rare transcripts. We identified one minor

transcript (ENST00000509418, NM_001128430.1), sharing

a common nucleotide sequence with the 30-end of

SMARCAD1 (MIM 612761). SMARCAD1 encodes a protein

that is structurally related to the SWI2/SNF2 superfamily

of DNA-dependent ATPases, which function as catalytic

subunits of chromatin-remodeling complexes and are

consequently considered to be major regulators of tran-

scriptional activity.15 The two SMARCAD1 isoforms differ

in lengths and sites of transcription initiation. The shortest

SMARCAD1 isoform is predicted to contain a unique

50-nontranslated exon (Figure 3A). It is of interest that, in

contrast with the major large isoform, which was found to

be expressed ubiquitously as previously shown,16 the

SMARCAD1 short isoform was mainly identifiable by RT-

PCR in skin fibroblasts and to a lesser extent in keratino-

cytes and esophageal tissue (Figure 4), suggesting that it

might represent an attractive candidate gene for a skin

condition such as ADG.

To assess the possible involvement of SMARCAD1 in

ADG, genomic DNA was amplified by PCR with primer

pairs spanning the entire coding sequence of both

SMARCAD1 isoforms (Table S1, available online) and Taq

polymerase (QIAGEN, Valencia, CA). Cycling conditions

were 94�C for 2 min followed by three cycles at 94�C for

40 s, 61�C for 40 s, and 72�C for 40 s; three cycles at

94�C for 40 s, 59�C for 40 s, and 72�C for 40 s; three cycles

at 94�C for 40 s, 57�C for 40 s, and 72�C for 40 s; 33 cycles

at 94�C for 40 s, 55�C for 40 s, and 72�C for 40 s; and a final

extension step at 72�C for 10 min. DNA was extracted

from gel and purified with QIAquick Gel Extraction kit

(QIAGEN). Direct sequencing of the resulting PCR prod-

ucts with the BigDye terminator system on an automated

sequencer (Applied Biosystems) revealed a heterozygous

G>T transversion in the first intron of the skin-specific

SMARCAD1 short isoform. The mutation, c.378þ1G>T,

was predicted to abolish the donor splice site adjacent to

the 30-end of the first unique exon of the short SMARCAD1

isoform. To confirm the existence of themutation, we used

a PCR-RFLP assay. A 537 bp long DNA fragment was ampli-

fied with the forward primer 50-AGCTGATTGGCTGGGA

ATAC-30 and reverse primer 50-GGCATTCATAAAACTCAA

AATGC-30 (Figure 3B). The mutation creates a recognition

site for MseI endonuclease (New England Biolabs, Ipswich,

MA).A Using this assay, we confirmed segregation of the

mutation with the disease phenotype throughout the

entire family and also excluded the mutation from a panel

of 100 healthy Swiss individuals and 100 healthy Jewish

individuals (data not shown); this suggests that the muta-

tion does not represent a common neutral polymorphism

but rather is a disease-causing mutation.

To assess the consequences of the mutation on the

SMARCAD1-splicing pattern, we initially used RT-PCR to

amplify cDNA derived from the RNA extracted from the

fibroblast cell cultures that were established from a patient

and a healthy individual. Total RNA was extracted with

RNeasy Extraction Kit (QIAGEN). cDNA was synthesized

(Thermo Scientific Verso cDNA Synthesis Kit, ABgene,

Surrey, UK) and amplified by PCR with exon-crossing

primers, 50-GAAAGCAAGAATGTGGCAG-30; 50-GGGCTT

GAGTGACAAACT-30, located in exons 1 and 3 of the short

SMARCAD1 isoform, respectively. DNA was extracted from

gel, purified with QIAquick Gel Extraction kit (QIAGEN),

and directly sequenced as described above. Only the

wild-type splice product was identified, suggesting that

aberrant splice variants might undergo degradation. To

obtain further support for this possibility, we generated

a minigene construct17 by subcloning exon 1, parts of

intron 1 (because the first intron is very large [~10.5 kb],

we trimmed the intronic sequence) and exon 2 of the

SMARCAD1 short isoform into the pEGFP-C3 vector

(Figure 5A). More specifically, a 1.7 kb genomic DNA frag-

ment comprising exon 1 and the first 1358 bp of intron

1 was cloned into the EcoR1 and Kpn1 restriction sites of

the pEGFP-C3 vector with primers 50-AAAAAGAATTCA

AGAAATTAGAGCTTACATTTAG-30 and 50-AAAAAGGTAC

CTCACTGATTAACAGGGAAAAAG-30, respectively. Then,a 0.7 kb genomic fragment comprising the last 500 bp of

intron 1 followed by exon 2 was cloned into the Kpn1

and BamHI sites of the first construct with primers

50-AAAAAGGTACCTATACTTTGATGATAGATGTGG-30 and

Figure 1. Clinical Features(A andB)Absenceoffingerprints (A) and reducedhandperspirationdemonstrated by sweat test (B) in a patient with adermatoglyphia.

The American Journal of Human Genetics 89, 302–307, August 12, 2011 303

50-AAAAGGATCCCTTTGGTTTAGAATGGAAGG-30, respec-tively. We sequenced the entire insert to verify the authen-

ticity of the construct. Next, we introduced the

c.378þ1G>T mutation into the minigene by using the

Quick Change Site-Directed Mutagenesis kit (Stratagene,

Santa Clara, CA). Both the wild-type and the mutant mini-

gene constructs were transiently transfected into HeLa

cells with Lipofectamine 2000 (Invitrogen). Cells were

Figure 2. Genetic Mapping of ADG(A)Multipoint LOD score analysis was performedwith the SuperLink software. LOD scores are plotted against all SNPmarkers distributedacross the genome.(B) Haplotype analysis with polymorphic markers on chromosomal region 4q22 reveals a heterozygous 5.1Mb interval betweenmarkersD4S423 and D4S1560 uniquely shared by all patients (boxed in red).


harvested 48 hr after transfection; total RNA was extracted

and subjected to RT-PCR and direct sequencing. Transfec-

tion of the wild-type minigene resulted as expected

in the formation of one single and abundant spliced

variant containing exons 1 and 2 of the short SMARCAD1

isoform; this was confirmed by sequencing analysis. In

contrast, transfection of the mutation-carrying minigene

Figure 3. Mutation Analysis(A) Bioinformatics analysis indicated theexistence of two SMARCAD1 isoformsdiffering both in lengths and sites of tran-scription start site. The short SMARCAD1isoform contains a unique nontranslatedexon (red arrow).(B) Sequence analysis revealed a heterozy-gous transversion, c.378þ1G>T, in theshort SMARCAD1 isoform (red arrow, leftpanel). The wild-type sequence is givenfor comparison (right panel).(C) PCR-RFLP analysis confirmed segrega-tion of the mutation in the family. Muta-tion c.378þ1G>T creates a recognitionsite for MseI endonuclease; thus, healthyindividuals display fragments of 163 bpand 46 bp, whereas affected heterozygouspatients show in addition fragments of73 bp and 90 bp.

Figure 4. Tissue Expression of SMARCAD1 IsoformsSMARCAD1 isoform expression was assessed with Clontech tissueblot cDNA array. Quantitative RT-PCR analysis showed thatthe long SMARCAD1 isoform is expressed ubiquitously at lowlevel. In contrast, the short SMARCAD1 isoform was found to beexpressed mainly in skin fibroblasts, keratinocytes, and theesophagus. Expression of SMARCAD1 was normalized to that ofACTB. Results are provided as the fold change of expression ofSMARCAD1 long isoform expression in keratinocytes 5 standarddeviation.

was found to lead to the generation

of two aberrant splice variants: the

first one was found to contain an

extra 51 bp from intron 1, and the

second one was found to miss one G

at the end of exon 1 because of the

utilization of cryptic donor splice sites. Of interest, the

abnormal splice products were only marginally detectable

as compared with the wild-type RNA, both in HeLa cells

(Figure 5B) and in primary human fibroblasts (data not

shown). These results are in line with the fact that aberrant

splice variants were not detectable in patient fibroblasts

(see above).

Two main mechanisms, alone or in combination, might

explain this observation. First, authentic splicing is typi-

cally more efficient than splicing activated at cryptic

sites.18 Therefore, it is possible that the significantly

reduced level of aberrant splice variants is due to a decrease

in splicing efficiency. Another possibility is that the

abnormal 50UTR variants affect RNA stability. Indeed, alter-

ation in the secondary structure of an RNA molecule has

been shown to inhibit translation initiation directly, by

preventing the 40S subunit binding or scanning, or indi-

rectly, by preventing the action of regulatory RNA-binding

proteins. This in turn has been shown to foster mRNA

degradation by increasing decapping and the deadenyla-

tion rate.19 To assess this possibility, we initially compared

via computational analysis the secondary structure of wild-

type and aberrant splice RNA variants by using the Gene-

Bee RNA secondary-structure prediction software. As

shown in Figure 5C, computational analysis predicts that

both aberrant splice variations are likely to significantly

affect RNA secondary configuration; this prediction is in

agreement with the fact the 50UTR region of the gene

affected by the abnormal splicing is highly conserved

across species at the nucleotide level (data not shown).

To obtain experimental support for the possibility that

aberrantly spliced variants of the SMARCAD1 short isoform


undergo degradation, we treated cells transfected with

both the wild-type and mutation-carrying constructs

with cycloheximide at a concentration of 50 mg/ml for

24 hr, which is known to inhibit decapping of mRNA.20

As a result, we observed a significant increase in the aber-

rant splice variant levels but not in the wild-type splice

variant (Figure 5D).

In conclusion, we have identified in a large family with

ADG a splice site mutation causing aberrant splicing of

a skin-specific isoform of SMARCAD1, implicating this

gene in dermatoglyph ontogenesis. The mutation is likely

to exert a loss-of-function effect.

Little is known about the function of the full-length

SMARCAD1, and virtually nothing is known regarding

the physiological role of the skin-specific isoform of this

gene. Clearly, the tissue-specific pattern of expression of

the short isoform is likely to underlie the very limited

phenotype displayed by our patients, as attested by the

severe phenotype observed in mice knocked out for the

ubiquitous SMARCAD1 large isoform of the gene;21 those

mice feature retarded growth, perinatal mortality,

decreased fertility, and various skeletal defects.

The full-length SMARCAD1 seems to control the expres-

sion of a large spectrum of target genes encoding transcrip-

tional factors and histone modifiers as well as regulators

of the cell cycle and development.16 It is tempting to

speculate that the skin-specific isoform of SMARCAD1

might target genes involved in dermatoglyph and sweat

gland development, two structures jointly affected in

the present family and in additional disorders such as

Naegeli-Franceschetti-Jadassohn and Rapp-Hodgkin (MIM

129400) syndromes.11,22 Regardless of the exact mecha-

nisms mediating the activity of the skin-specific isoform

of SMARCAD1 in the skin, the present results once again

underscore the fact that rare monogenic traits represent

an invaluable tool for the investigation of concealed

aspects of our biology.

Supplemental Data

Supplemental Data include one table and can be found with this

article online at http://www.cell.com/AJHG/.

Acknowledgments

We would like to acknowledge the participation of all family

members in this study. We would like to thank Sylvia Kiese for

her help. We wish to thank Gil Ast, Hadas Keren, and Mordechai

Choder for helpful discussions.

Figure 5. Consequences of Mutation c.378þ1G>TTo assess the consequences of mutation c.378þ1G>Ton SMARCAD1 splicing, we used a minigene system. (A) Schematic representationof the SMARCAD1 short isoform wild-type and mutation-carrying minigenes.(B) Sequence analysis of RT-PCR products generated from HeLa cells transfected with wild-type and mutant minigene constructs. Trans-fection of wild-typeminigene resulted in the formation of one spliced variant containing exons 1 and 2 of the SMARCAD1 short isoform.In contrast, transfection of the mutant minigene resulted in two aberrant splice variants, containing an extra 51 bp from intron 1 ormissing one G at the end of exon 1. A marked decrease in the level of expression of the spliced variants was also observed.(C) Computational modeling predicts an altered mRNA secondary structure of both aberrant splice variants.(D) Treatment with cycloheximide (at a concentration of 50 mg/ml for 24 hr), known to inhibit mRNA decapping, resulted in signifi-cantly increased levels of aberrant (but not wild-type) splice variants.


Received: June 7, 2011

Revised: July 4, 2011

Accepted: July 8, 2011

Published online: August 4, 2011

Web Resources

The URLs for data presented herein are as follows

dbSNP, http://www.ncbi.nlm.nih.gov/SNP/

Ensembl, http://www.ensembl.org/

GenBank, http://www.ncbi.nlm.nih.gov/Genbank/

GeneBee, http://www.genebee.msu.su/

Online Mendelian Inheritance in Man (OMIM), http://www.

omim.org

Superlink,http://bioinfo.cs.technion.ac.il/superlink-online-twoloci/

makeped/TwoLociMultiPoint.html

UCSC Genome Browser, http://genome.ucsc.edu/

References

1. Verbov, J. (1970). Clinical significance and genetics of

epidermal ridges—a review of dermatoglyphics. J. Invest. Der-

matol. 54, 261–271.

2. Warman, P.H., and Ennos, A.R. (2009). Fingerprints are

unlikely to increase the friction of primate fingerpads. J. Exp.

Biol. 212, 2016–2022.

3. Scheibert, J., Leurent, S., Prevost, A., and Debregeas, G. (2009).

The role of fingerprints in the coding of tactile information

probed with a biomimetic sensor. Science 323, 1503–1506.

4. Reed, T., Viken, R.J., and Rinehart, S.A. (2006). High herita-

bility of fingertip arch patterns in twin-pairs. Am. J. Med.

Genet. A. 140, 263–271.

5. Bokhari, A., Coull, B.A., and Holmes, L.B. (2002). Effect of

prenatal exposure to anticonvulsant drugs on dermal ridge

patterns of fingers. Teratology 66, 19–23.

6. Babler, W.J. (1991). Embryologic development of epidermal

ridges and their configurations. Birth Defects Orig. Artic. Ser.

27, 95–112.

7. Baird, H.W. (1968). Absence of fingerprints in four genera-

tions. Lancet 2, 1250.

8. Basan, M. (1965). Ectodermal dysplasia. Missing papillary

pattern, nail disorders and furrows on 4 fingers. Arch. Klin.

Exp. Dermatol. 222, 546–557.

9. Reed, T., and Schreiner, R.L. (1983). Absence of dermal ridge

patterns: Genetic heterogeneity. Am. J. Med. Genet. 16, 81–88.

10. Lımova, M., Blacker, K.L., and LeBoit, P.E. (1993). Congenital

absenceofdermatoglyphs. J. Am.Acad.Dermatol.29, 355–358.

11. Lugassy, J., Itin, P., Ishida-Yamamoto, A., Holland, K., Huson,

S., Geiger, D., Hennies, H.C., Indelman, M., Bercovich, D.,

Uitto, J., et al. (2006). Naegeli-Franceschetti-Jadassohn

syndrome and dermatopathia pigmentosa reticularis: Two

allelic ectodermal dysplasias caused by dominant mutations

in KRT14. Am. J. Hum. Genet. 79, 724–730.

12. Sirinavin, C., and Trowbridge, A.A. (1975). Dyskeratosis con-

genita: Clinical features and genetic aspects. Report of a family

and review of the literature. J. Med. Genet. 12, 339–354.

13. Burger, B., Fuchs, D., Sprecher, E., and Itin, P. (2011).

The immigration delay disease: Adermatoglyphia-inherited

absence of epidermal ridges. J. Am. Acad. Dermatol. 64,

974–980.

14. Fishelson, M., and Geiger, D. (2002). Exact genetic linkage

computations for general pedigrees. Bioinformatics 18

(Suppl 1 ), S189–S198.

15. Adra, C.N., Donato, J.L., Badovinac, R., Syed, F., Kheraj, R.,

Cai, H., Moran, C., Kolker, M.T., Turner, H., Weremowicz, S.,

et al. (2000). SMARCAD1, a novel human helicase family-

defining member associated with genetic instability: Cloning,

expression, and mapping to 4q22-q23, a band rich in break-

points and deletion mutants involved in several human

diseases. Genomics 69, 162–173.

16. Okazaki, N., Ikeda, S., Ohara, R., Shimada, K., Yanagawa, T.,

Nagase, T., Ohara, O., and Koga, H. (2008). The novel protein

complex with SMARCAD1/KIAA1122 binds to the vicinity of

TSS. J. Mol. Biol. 382, 257–265.

17. Singh, G., and Cooper, T.A. (2006). Minigene reporter for

identification and analysis of cis elements and trans factors

affecting pre-mRNA splicing. Biotechniques 41, 177–181.

18. Roca, X., Sachidanandam, R., and Krainer, A.R. (2003).

Intrinsic differences between authentic and cryptic 50 splicesites. Nucleic Acids Res. 31, 6321–6333.

19. Day, D.A., and Tuite, M.F. (1998). Post-transcriptional gene

regulatory mechanisms in eukaryotes: An overview. J. Endo-

crinol. 157, 361–371.

20. Schwartz, D.C., and Parker, R. (1999). Mutations in translation

initiation factors lead to increased rates of deadenylation and

decapping of mRNAs in Saccharomyces cerevisiae. Mol. Cell.

Biol. 19, 5247–5256.

21. Schoor, M., Schuster-Gossler, K., Roopenian, D., and Gossler,

A. (1999). Skeletal dysplasias, growth retardation, reduced

postnatal survival, and impaired fertility in mice lacking the

SNF2/SWI2 family member ETL1. Mech. Dev. 85, 73–83.

22. Atasu, M., Akesi, S., Elcioglu, N., Yatmaz, P.I., and Ertas, E.B.

(1999). A Rapp-Hodgkin like syndrome in three sibs: Clinical,

dental and dermatoglyphic study. Clin. Dysmorphol. 8,

101–110.


Subscribe to Active ZoneThe Cell Press Neuroscience Newsletter

Featuring:

Cutting-edge neuroscience from Cell Press and beyond

Interviews with leading neuroscientists

Special features: Podcasts, Webinars and Review Issues

Neural Currents - cultural events, exhibits and new books

And much more

Read now at bit.ly/activezone

REVIEW

Five Years of GWAS Discovery

Peter M. Visscher,1,2,* Matthew A. Brown,1 Mark I. McCarthy,3,4 and Jian Yang5

The past five years have seenmany scientific and biological discov-

eries made through the experimental design of genome-wide asso-

ciation studies (GWASs). These studies were aimed at detecting

variants at genomic loci that are associated with complex traits

in the population and, in particular, at detecting associations

between common single-nucleotide polymorphisms (SNPs) and

common diseases such as heart disease, diabetes, auto-immune

diseases, and psychiatric disorders. We start by giving a number

of quotes from scientists and journalists about perceived problems

with GWASs. We will then briefly give the history of GWASs and

focus on the discoveries made through this experimental design,

what those discoveries tell us and do not tell us about the genetics

and biology of complex traits, and what immediate utility has

come out of these studies. Rather than giving an exhaustive review

of all reported findings for all diseases and other complex traits, we

focus on the results for auto-immune diseases and metabolic

diseases. We return to the perceived failure or disappointment

about GWASs in the concluding section.

Introduction: Have GWASs Been a Failure?

In the past five years, genome-wide association studies

(GWASs) have led to many scientific discoveries, and yet

at the same time, many people have pointed to various

problems and perceived failures of this experimental

design. Let us begin by considering a number of criticisms

that have been made against GWASs. We do not list these

quotes to discredit any of the scientists or journalists

involved, nor to deliberately cite them out of context.

Rather, they serve to confirm that the points we discuss

in this review are related to beliefs held by a significant

number of scientific commentators and therefore warrant

consideration.

From an interview with Sir Alec Jeffreys, ESHG Award

Lecturer 2010:

‘‘One of the great hopes for GWAS was that, in the

same way that huge numbers of Mendelian disorders

were pinned down at the DNA level and the gene

and mutations involved identified, it would be

possible to simply extrapolate from single gene disor-

ders to complex multigenic disorders. That really

hasn’t happened. Proponents will argue that it has

worked and that all sorts of fascinating genes that

predispose to or protect against diabetes or breast

cancer, for example, have been identified, but the

fact remains that the bulk of the heritability in these

conditions cannot be ascribed to loci that have

emerged from GWAS, which clearly isn’t going to

be the answer to everything.’’

From McCLellan and King, Cell 20101:

‘‘To date, genome-wide association studies (GWAS)

have published hundreds of common variants

whose allele frequencies are statistically correlated

with various illnesses and traits. However, the vast

majority of such variants have no established biolog-

ical relevance to disease or clinical utility for prog-

nosis or treatment.’’

‘‘An odds ratio of 3.0, or even of 2.0 depending on

population allele frequencies, would be robust to

such population stratification. However, odds ratios

of the magnitude generally detected by GWAS

(<1.5) can frequently be explained by cryptic popu-

lation stratification, regardless of the p value associ-

ated with them.’’

‘‘More generally, it is now clear that common risk

variants fail to explain the vast majority of genetic

heritability for any human disease, either individu-

ally or collectively (Manolio et al., 2009).’’

‘‘The general failure to confirm common risk vari-

ants is not due to a failure to carry out GWAS

properly. The problem is underlying biology, not

the operationalization of study design. The common

disease–common variant model has been the

primary focus of human genomics over the last

decade. Numerous international collaborative efforts

representing hundreds of important human diseases

and traits have been carried out with large well-char-

acterized cohorts of cases and controls. If common

alleles influenced common diseases, many would

have been found by now. The issue is not how to

develop still larger studies, or how to parse the data

still further, but rather whether the common

disease–common variant hypothesis has now been

tested and found not to apply to most complex

human diseases.’’

From Nicholas Wade in the New York Times, March 20

2011:

‘‘More common diseases, like cancer, are thought to

be caused by mutations in several genes, and finding

the causes was the principal goal of the $3 billion

1University of Queensland Diamantina Institute, Princess Alexandra Hospital, Brisbane, Queensland 4102, Australia; 2The Queensland Brain Institute, The

University of Queensland, Brisbane, Queensland 4072, Australia; 3Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK;4Oxford Centre for Diabetes, Endocrinology and Metabolism, Churchill Hospital Old Road, Headington Oxford OX3 7LJ, UK; 5Queensland Institute of

Medical Research, 300 Herston Road, Brisbane, Queensland 4006, Australia



The American Journal of Human Genetics 90, 7–24, January 13, 2012 7

human genome project. To that end, medical genet-

icists have invested heavily over the last eight years

in an alluring shortcut. But the shortcut was based

on a premise that is turning out to be incorrect. Scien-

tists thought the mutations that caused common

diseases would themselves be common. So they first

identified the common mutations in the human

population in a $100 million project called the

HapMap. Then they compared patients’ genomes

with those of healthy genomes. The comparisons

relied on ingenious devices called SNP chips, which

scan just a tiny portion of the genome. (SNP,

pronounced ‘‘snip,’’ stands for single nucleotide

polymorphism.) These projects, called genome-wide

association studies, each cost around $10 million or

more. The results of this costly international exercise

have been disappointing. About 2,000 sites on the

human genome have been statistically linked with

various diseases, but in many cases the sites are

not inside working genes, suggesting there may be

some conceptual flaw in the statistics. And in most

diseases the culprit DNA was linked to only a small

portion of all the cases of the disease. It seemed that

natural selection has weeded out any disease-causing

mutation before it becomes common.’’

From Tim Crow, Molecular Psychiatry 20112:

‘‘There comes a point at which the genetic skeptic

can be pardoned the suggestion that if the genes

are so small and so multiple, what they are hardly

matters, the dividing line between polygenes and

no genes is of little practical consequence. Have we

reached this point’’?

From a commentary article by Jonathan Latham, on

guardian.co.uk, 17 April 2011:

‘‘Among all the genetic findings for common

illnesses, such as heart disease, cancer and mental

illnesses, only a handful are of genuine significance

for human health. Faulty genes rarely cause, or even

mildly predispose us, to disease, and as a consequence

the science of human genetics is in deep crisis.

Since the Collins paper [Manolio et al. 20093] was

published nothing has happened to change that

conclusion. It now seems that the original twin-

study critics were more right than they imagined.

The most likely explanation for why genes for

common diseases have not been found is that, with

few exceptions, they do not exist.’’

These quotes raise a number of different issues about

the methodology, research outcomes, and utility of the

research findings. The pertinent points made in these

quotes are:

(1) GWASs are founded on a flawed assumption that

genetics plays an important role in the risk to

common diseases;

(2) GWASs have been disappointing in not explaining

more genetic variation in the population;

(3) GWASs have not delivered meaningful, biologically

relevant knowledge or results of clinical or any

other utility; and

(4) GWAS results are spurious.

In this review we will briefly give the history of GWASs

and then focus on the discoveries made through this

experimental design, what those discoveries tell us and

do not tell us about the genetics and biology of complex

traits, and what immediate utility has come out of these

studies. We will focus on the results for auto-immune

diseases and metabolic diseases, although there have

been important findings for other diseases and complex

traits. In the concluding section, we will again consider

the perceived failure or disappointment of GWASs.

What Are GWASs, and How Did We Get There?

Attempts to use linkage analysis to map genomic loci that

have an effect on disease or other complex traits have

been ubiquitous in the last two decades. Gene mapping

by linkage relies on the cosegregation of causal variants

with marker alleles within pedigrees. We define and

discuss what we mean by ‘‘causal’’ in Box 1. Because the

number of recombination events per meiosis is relatively

small, tagging a causal variant requires only a few genetic

markers per chromosome. The downside of the small

number of recombination events is that the mapping

resolution, i.e., how close to the causal variant one can

get through linked markers, is typically low. Linkage

mapping has been extremely successful in mapping genes

and gene variants affecting Mendelian traits (e.g., single-

gene disorders).4 Mapping loci underlying common

diseases and, in particular, identifying causative muta-

tions have had much less success. There are many reasons

for the failure of linkage analyses to reliably identify

complex-trait loci in human pedigrees. One reason is

that the effect sizes (‘‘penetrance’’) of individual causal

variants are too small to allow detection via cosegregation

within pedigrees.

GWASs are based upon the principle of linkage disequi-

librium (LD) at the population level. LD is the nonrandom

association between alleles at different loci. It is created by

evolutionary forces such as mutation, drift, and selection

and is broken down by recombination.5 Generally, loci

that are physically close together exhibit stronger LD

than loci that are farther apart on a chromosome. The

larger the (effective) population size, the weaker the LD

for a given distance.6 (Linkage analysis exploits the large

LD within pedigrees.) The genomic distance at which LD

decays determines how many genetic markers are needed

to ‘‘tag’’ a haplotype, and the number of such tagging

markers is much smaller than the total number of

segregating variants in the population. For example,

a selection of approximately 500,000 common SNPs in

the human genome is sufficient to tag common variation

8 The American Journal of Human Genetics 90, 7–24, January 13, 2012

in non-African populations, even though the total number

of common SNPs exceeds 10 million.7

Geneticists realized some time ago that they could

exploit population-based LD to map genes. For example,

Bodmer suggested in 1986 that fine-mapping using popu-

lation association could lead to closer linkage between

a causative mutation and a linked marker.82 However,

fine-mapping still relied on having an initial genomic loca-

tion that is obtained from linkage analysis in family

studies. What if we do not have any prior information

on genomic loci or, alternatively, we deliberately want an

unbiased scan of the genome? In a landmark paper, Risch

and Merikangas83 showed that performing an association

scan involving one million variants in the genome and

a sample of unrelated individuals could be more powerful

than performing a linkage analysis with a few hundred

markers. It took only 10 years before this theoretical design

became reality. What was needed was the discovery (accel-

erated by the sequencing of the human genome) of

hundreds of thousands of single-nucleotide variants, the

quantification of the correlation (LD) structure of those

markers in the human genome, and the ability to accu-

rately genotype hundreds of thousands of markers in an

automated and affordable manner. The LD structure was

investigated in the HapMap project,7 and the outcome

was a list of tag SNPs that captured most of the common

genomic variation in a number of human populations.

Concurrently, commercial companies produced dense

SNP arrays that could genotype many markers in a single

assay. The technological advances together with biobanks

of either population cohorts or case-control samples facili-

tated the ability to conduct GWASs.

Although GWASs are unbiased with respect to prior bio-

logical knowledge (or prior beliefs) and with respect to

genome location, they are not unbiased in terms of what

is detectable. GWASs rely on LD between genotyped

SNPs and ungenotyped causal variants. The strength of

statistical association between alleles at two loci in the

genome strongly depends on their allele frequencies,

such that a rare variant (say, one with a frequency <0.01)

will be in low LD (as measured by r2) with a nearby

common variant, even if they map to the same recombina-

tion interval.84 But the SNPs that are on the SNP chips

have been selected to be common (most have a minor

allele frequency >0.05). Therefore, GWASs are by design

powered to detect association with causal variants that

are relatively common in the population. Is it realistic to

assume common causal variants for disease segregate in

the population? This is discussed in Box 2.

(Nearly) Five Years of Discovery

Although the first results from a GWAS were reported in

20058 and 2006,9 we take the 2007 Wellcome Trust Case

Control Consortium (WTCCC) paper in Nature10 as a start-

ing point. The reason for this is that theWTCCC study was

the first large, well-designed GWAS for complex diseases to

employ a SNP chip that had good coverage of the genome.

There are many ways to summarize the discoveries based

on GWASs in the last five years. We have tried to separate

the discoveries quantitatively and to focus on the biology.

There are nowwell over 2000 loci that are significantly and

robustly associated with one or more complex traits (see

GWAS catalog in Web Resources), as shown in Figure 1.

The vast majority of the loci identified are new, i.e., before

2007 their association with disease or other complex traits

Box 1. What Is a Causal Variant?

New mutations that contribute to an increase or

decrease in risk to disease arise in populations all

the time. Some of these mutations can reach an

appreciable frequency in the population, for

example by random drift or by natural selection.

As discussed in the main text, these mutations will

be associated with other variants in the genome

through LD. Such associations will include those

with SNPs that are genotyped on ‘‘SNP chips.’’

Because there are many more segregating variants

in the population than those genotyped in GWASs,

it is unlikely, but not impossible, that a mutation is

genotyped itself, and so its effect usually will be de-

tected through an association with a genotyped

variant. This genotyped variant can be robustly asso-

ciated with disease in multiple samples from the

same population, or even across populations, but it

is not the mutation that causes variation in risk.

The results from GWASs have shown that variants

at many genetic loci in the genome are associated

with disease, and these also reflect many ancestral

mutations with an effect on susceptibility to disease.

Therefore, the effect size (in terms of increasing or

decreasing the absolute probability of disease) is,

on average, small, and individual variants are

neither necessary nor sufficient to cause disease.

Herein lies the problem of defining ‘‘causal’’: How

do we prove that a particular mutation causes the

observed effect on variation in the population?

Engineering the same mutation in a cell or animal

model might give a relevant phenotype, but that is

not a proof. The mutation can have a direct effect

on gene expression in human tissues or be func-

tional in another way, but that doesn’t prove it has

a causal effect on disease risk. Operationally, in this

review what we mean by ‘‘causal variant’’ is an

(unknown) variant that has a direct or indirect func-

tional effect on disease risk, rather than a variant

that is associated with disease risk through LD,

even if we don’t have the tools available at present

to prove causality beyond reasonable doubt. Hence,

it is the variant that causes the observed association

signal.


was not known. Essentially, these are 2000 new biological

leads. The number of loci identified per complex trait

varies substantially, from a handful for psychiatric diseases

to a hundred or more for inflammatory bowel disease

(IBD1 [MIM 266600], including Crohn disease [CD]11

and ulcerative colitis [UC]12) and stature.13 Importantly,

the number of discovered variants is strongly correlated

with experimental sample size (Figure 2), which predicts

that an ever-increasing discovery sample size will increase

the number of discovered variants: very roughly, after

a minimum sample-size threshold below which no vari-

ants are detected is reached, a doubling in sample size leads

Box 2. The CDCV Hypothesis

Currently, the allele frequency of variants that

contribute to cause common disease is a subject of

some debate.85,86 The common disease-common

variant (CDCV) hypothesis is sometimes said to be

one side of this debate; the other side holds that

disease-causing alleles are typically rare. But what

is the precise ‘‘hypothesis’’ in the CDCV hypothesis?

We tried to find the origin of the CDCV hypothesis.

Many researchers cite either Lander87 or Risch and

Merikangas.83 We will add Chakravarti88 and Reich

and Lander89 as key studies. Lander87 noted from

the then-available data that there is a limited diver-

sity in coding regions at genes, in that most variants

are very rare, and therefore the effective number of

alleles is small. In addition, he provided ‘‘tantalizing

examples’’ of common alleles with large effects (for

example, such alleles include APOE [MIM 107741],

MTHFR [MIM 607093], and ACE [MIM 106180]).

Reich and Lander89 presented a theoretical popula-

tion-genetics model that predicted a relatively

simple spectrum of the frequency of disease risk

alleles at a particular disease locus. They (re)phrased

the CDCV hypothesis as the prediction that the ex-

pected allelic identity is high for those disease loci

that are responsible for most of the population risk

for disease. These studies did not appear to make

any prediction about the number of disease loci or,

therefore, about the effect size. What the authors

stated was that if a disease was common, there was

likely to be one disease-causing allele that was

much more common than all the other disease-

causing alleles at the same locus.87,89

Risch and Merikangas83 quantified two important

points regarding the detection of disease loci: first,

that detection by association is more powerful

than linkage when the genotype-relative risk is

modest or small and the risk-allele frequency is large

(say, >10%); and second, that the multiple-testing

burden of a genome scan by association does not

prevent the detection of genome-wide-significant

findings. This paper was essentially about experi-

mental design and statistical power (and hence feasi-

bility), not about the CDCV hypothesis as such.

Finally, Chakravarti88 pointed out that if individuals

with disease needed to be homozygous for risk vari-

ants at multiple loci, then the risk alleles at those

loci must be more common than they would be in

a model in which homozygosity at any risk locus is

sufficient to cause disease. We note that without

the assumption of strong epistasis on the scale of

liability, there is no need for risk variants to be

common. For example, Risch’s multilocus multipli-

cative model,90 which implies an additive model

Box 2. Continued

on the log (risk) scale (it is one of the ‘‘exchangeable’’

models91), does not rely on a particular allelic spec-

trum of risk-allele frequencies.

What all these landmark papers have in common

is a remarkable foresight in predicting the GWAS era

well before the publication of the full draft of the

human genome sequence, the HapMap project, or

the availability of commercial genotyping. But

what can we conclude about the origin and specifics

of the CDCV hypothesis? As implicitly or explicitly

stated in these key papers, there is no strong predic-

tion about the exact allele-frequency spectrum of

risk variants in the genome, nor a prediction about

the effect size at any disease loci and hence about

the total number of risk alleles in the genome.

The current debate is about the frequency spec-

trum of disease-causing alleles. Phrasing the debate

as an either/or question is not very helpful because

examples of both common and rare alleles are

already known, but there is still an open question

as to whether most genetic variation contributing

to complex traits in the population is caused by

rare variants or common variants. A more general

question regards the spectrum of allele frequencies

of disease-causing alleles and the joint distribution

between risk-allele frequency and effect size. In the

special case of an evolutionarily neutral model and

a constant effective population size, most causal

variants that are segregating in the population will

be rare, but most heritability will be due to common

variants.79,92 The reason for this apparent paradox is

that the number of segregating variants is propor-

tional to 1/[p(1 � p), where p is the allele frequency

of a risk-increasing allele (so the smaller p, the

more variants of that frequency), whereas the herita-

bility contributed at that frequency is proportional

to p(1 � p). The net effect is that the heritability is

distributed equally over all frequencies, and cumula-

tively most heritability is contributed by common

variants.


to a doubling of the number of associated variants discov-

ered. The proportion of genetic variation explained by

significantly associated SNPs is usually low (typically less

than 10%) for many complex traits, but for diseases such

as CD and multiple sclerosis (MS [MIM 126200]), and for

quantitative traits such as height and lipid traits, between

10% and 20% of genetic variance has been accounted for

(Table 1). In comparison to the pre-GWAS era, the propor-

tion of genetic variation accounted for by newly discov-

ered variants that are segregating in the population is large.

It is clear that for most complex traits that have been

investigated by GWAS, multiple identified loci have

genome-wide statistical significance, and thus it is likely

that there are (many) other loci that have not been identi-

fied because of a lack of statistical significance (false nega-

tives). Recently, researchers have developed and applied

methods to quantify the proportion of phenotypic varia-

tion that is tagged when one considers all SNPs simulta-

neously.12–14 These methods focus on estimation rather

than hypothesis testing and do not suffer from false

negatives caused by small effect sizes.15 Whole-genome

approaches to estimating genetic variation have shown

that approximately one-third to one-half of additive

genetic variation in the population is being tagged when

all GWAS SNPs are considered simultaneously.12–14 This

is a surprisingly large proportion given that evolutionary

theory predicts that most variants affecting disease risk

ought to be found at a low frequency in the population

if they affect fitness,16,17 and such risk variants would

not be in sufficient LD with the common SNPs to be

detected in GWASs.

Autoimmune Diseases

We concentrate on seven auto-immune diseases, anky-

losing spondylitis (AS [MIM 106300]), rheumatoid arthritis

(RA [MIM 180300), systemic lupus erythematosus (SLE

[MIM 152700]), and type 1 diabetes (T1D [MIM 222100]),

MS, CD, and UC. Table 2 summarizes the number of genes

that have been identified for these diseases. Across these

diseases, 19 loci (mainly related to human leukocyte

antigen) were known prior to 2007, and 277 have been

discovered from 2007 onward. The total of 277 includes

multiple counts of loci that have been implicated across a

number of diseases; such loci include BLK (MIM 191305),

TNFAIP3 (MIM 191163) and CD40 (MIM 109535).

Inflammatory bowel disease (IBD, not to be confused

here with identity by descent) is thought to arise from

dysregulation of intestinal homeostasis.18 GWASs of IBD

(CD and UC) have been highly successful in terms of

the number of loci identified (99 nonoverlapping loci in

Figure 1. GWAS Discoveries over TimeData obtained from the Published GWAS Catalog (see WebResources). Only the top SNPs representing loci with associationp values < 5 3 10�8 are included, and so that multiple countingis avoided, SNPs identified for the same traits with LD r2rr > 0.8 esti-mated from the entire HapMap samples are excluded.

Figure 2. Increase in Number of Loci Identified as a Function ofExperimental Sample Size(A) Selected quantitative traits.(B) Selected diseases.The coordinates are on the log scale. The complex traits wereselected with the criteria that there were at least three GWASpapers published on each in journals with a 2010–2011 journalimpact factor>9 (e.g.,Nature,Nature Genetics, the American Journalof Human Genetics, and PLoS Genetics) and that at least one papercontained more than ten genome-wide significant loci. Thesetraits are a representative selection among all complex traits thatfulfilled these criteria.


total18), and a substantial proportion of familial risk, about

20%, has been accounted for.11,12,18 Twenty-eight risk loci

are shared between CD and UC, despite the fact that these

diseases display distinct clinical features, and it has been

suggested that the two diseases share pathways and are

part of a mechanistic continuum.18 There are also strong

overlaps between genes involved in CD and UC, AS,19

and psoriasis (MIM 177900), again suggesting shared aetio-

pathogenic mechanisms in these conditions. Pleiotropic

genetic effects are becoming increasing widely identified,

including in classical autoimmune diseases.20 For example,

a coding variant in the gene PTPN22 (MIM 600716)

confers strong risk for T1D and RA as well as protection

against CD.18

Metabolic Diseases

In terms of metabolic diseases, we focus here specifically

on type 2 diabetes (T2D [MIM 125853]); fasting glucose

and insulin levels; body-mass index (BMI) and obesity;

and fat distribution. A recent review21 already covered

these complex traits, but we have updated that review

wherever necessary. Table 3 gives an overview of the

number of loci identified.

More than 20 major GWASs for T2D have been pub-

lished to date21–24, and there has been a cumulative tally

of around 50 genome-wide-significant hits,21,23,24 only

three of which were known before the GWAS era. Most

of these studies have involved individuals of European

descent; the latest published effort is from the DIAGRAM

(Diabetes Genetics Replication and Meta-analysis)

Consortium and includes more than 47,000 GWAS indi-

viduals and 94,000 samples for replication. More recently,

equivalent studies have emerged from samples of East

Asians,23,25–27 South Asians,22 and Hispanics,28,29 and

large studies involving African Americans and other major

ethnic groups are underway. Notwithstanding differences

in allele frequency and LD patterns, most of the signals

found in one ethnic group show some evidence of associ-

ation in others, indicating that the common-variant

signals identified by GWASs are likely to be the result of

widely distributed causal alleles that are of relatively high

frequency. This is an important observation because it

indicates that most of the GWAS-identified associations

for T2D reflect high LD with a causal variant that has

a small effect size rather than low LD with a causal variant

that has a large effect size. The largest common-variant

signal identified for T2D remains TCF7L2 (MIM 602228)

(detected just prior to the GWAS era30), which has a

per-allele odss ratio (OR) of around 1.35. The remaining

signals detected by GWAS have allelic ORs in the range

between 1.05 and 1.25. Collectively, the most-strongly

associated variants at these loci are estimated to explain

around 10% of familial aggregation of T2D in European

populations.

The MAGIC (Meta-Analysis of Glucose- and Insulin-

Related Traits Consortium) investigators have been

carrying out equivalent analyses focused on the identifica-

tion of variants influencing variation in glucose and

insulin levels in healthy nondiabetic individuals.31–33 Prior

to the GWAS era, the only compelling association signal

for fasting glucose levels was known at GCK (MIM

138079) (glucokinase),34 but GWAS in European samples

(46,000 GWAS and 76,000 replication samples) have

expanded that number to 1632. These variants explain

around 10% of the inherited variation in fasting glucose

levels. Only two signals (near GCKR [MIM 600842] and

IGF1 [MIM 147440]) were shown to influence fasting

insulin levels in the same analysis. Equivalent analyses

for 2h glucose33 (15,000 GWAS samples and up to 30,000

replication samples) identified further signals, including

variants near the GIP (MIM 137240) receptor (GIPR [MIM

137241]).

Before the GWAS era, the only robust association

between DNA sequence variation and either BMI or

weight involved low-frequency variants in MC4R (MIM

155541).35 Now, there are more than 30. In the most

recent study from the GIANT consortium,36 these analyses

extended to almost 250,000 samples, half of them in the

stage 1 GWAS, the remainder for replication. The largest

signal remains that at FTO (MIM 610966),37 where the

Table 1. Population Variation Explained by GWAS for a SelectedNumber of Complex Traits

Trait or Diseaseh2 PedigreeStudies

h2 GWASHitsa

h2 AllGWAS SNPsb

Type 1 diabetes 0.998 0.699 ,c 0.312

Type 2 diabetes 0.3–0.6100 0.05-0.1034

Obesity (BMI) 0.4–0.6101,102 0.01-0.0236 0.214

Crohn’s disease 0.6–0.8103 0.111 0.412

Ulcerative colitis 0.5103 0.0512

Multiple sclerosis 0.3–0.8104 0.145

Ankylosing spondylitis >0.90105 0.2106

Rheumatoid arthritis 0.6107

Schizophrenia 0.7–0.8108 0.0179 0.3109

Bipolar disorder 0.6–0.7108 0.0279 0.412

Breast cancer 0.3110 0.08111

Von Willebrand factor 0.66–0.75112,113 0.13114 0.2514

Height 0.8115,116 0.113 0.513,14

Bone mineral density 0.6-0.8117 0.05118

QT interval 0.37–0.60119,120 0.07121 0.214

HDL cholesterol 0.5122 0.157

Platelet count 0.8123 0.05–0.158

a Proportion of phenotypic variance or variance in liability explained bygenome-wide-significant and validated SNPs. For a number of diseases, otherparameters were reported, and these were converted and approximated to thescale of total variation explained. Blank cells indicate that these parametershave not been reported in the literature.b Proportion of phenotypic variance or variance in liability explained when allGWAS SNPs are considered simultaneously. Blank cell indicate that theseparameters have not been reported in the literature.c Includes pre-GWAS loci with large effects.


average between-homozygotes difference in weight is

around 2.5 kg. The effects at other loci are smaller, and

in combination, these variants explain no more than

1%–2% of overall variation in adult BMI (although this

percentage rises to almost 20% if the analysis is extended

to all GWA variants, not just those that reach genome-

wide significance14). As well as these studies of BMI and

obesity in population samples, there have been several

studies focused on extreme obesity phenotypes.38,39 The

genome-wide-significant loci thrown up by these efforts

only partially overlap with those emerging from popula-

tion-based studies, raising the possibility that some of

Table 2. Summary of GWAS Findings for Seven Autoimmune Diseasesa

Prior to 2007 2007 onward

Disease Number of Loci Loci Number of Loci Some or All of the Loci

Ankylosingspondylitis

1 HLA-B27 13 IL23R, ERAP1, 2p15, 21q22, CARD9 (MIM 607212), IL12B(MIM 161561), PTGER4 (MIM 601586), IL1R2 (MIM 147811),TNFR1, TBKBP1 (MIM 608476), ANTXR2 (MIM 608041),RUNX3 (MIM 600210), KIF21B (MIM 608322)

Rheumatoidarthritis

3 HLA-DRB1,PADI4,CTLA4

30 AFF3 (MIM 601464), BLK, CCL21 (MIM 602737), CD2/CD58(MIM 186990)/153420], CD28, CD40, FCGR2A (MIM 146790),HLA-DRB1, IL2/IL21 (MIM 147680/605384), IL2RA, IL2RB(MIM 146710), KIF5A/PIP4K2C, PRDM1 (MIM 603423), PRKCQ(MIM 600448), PTPRC (MIM 151460), REL (MIM 164910), STAT4(MIM 600558), TAGAP, TNFAIP3, TNFRSF14, TRAF1/C5 (MIM120900/601711), TRAF6 (MIM 602355), IL6ST (MIM 600694),SPRED2 (MIM 609292), RBPJ (MIM 147183), CCR6(MIM 601835), IRF5 (MIM 607218), PXK (MIM 611450)

Systemic lupuserythematosus

3 HLA, PTPN22,IRF5 (MIM607218)

31 BANK1 (MIM 610292), BLK (MIM 191305), C1q, C2 (MIM 613927),C4A/B (MIM 120820/120810), CRP (MIM 123260), ETS1(MIM 164720), FcGR2A–FcGR3A (MIM 146790/146740), FcGR3B(MIM 610665), HIC2-UBE2L3 (MIM 607712/603721), IKZF1 (MIM603023), IL10 (MIM 124092), IRAK1 (MIM 300283), ITGAM–ITGAX(MIM 120980)/151510], JAZF1, KIAA1542/PHRF1, LRRC18-WDFY4,LYN (MIM 165120), NMNAT2 (MIM 608701), PRDM1 (MIM603423), PTTG1 (MIM 604147), PXK (MIM 611450), RASGRP3(MIM 609531), SLC15A4, STAT1 (MIM 600555), TNFAIP3, TNFSF4(MIM 603594), TNIP1 (MIM 607714), TREX1 (MIM 606609),UHRF1BP1, XKR6

Type 1diabetes

4 HLA, INS(MIM 176730),PTPN22, CTLA4

40 RGS1, IL18RAP (MIM 604509), IFIH1 (MIM 606951), CCR5 (MIM601373), IL2 (MIM 147680), IL7R, MHC, BACH2 (MIM 605394),TNFAIP3, TAGAP, IL2RA, PRKCQ (MIM 600448), INS (MIM 176730),ERBB3 (MIM 190151), 12q13.3, SH2B3 (MIM 605093), CTSH(MIM 116820), CLEC16A (MIM 611303), PTPN2 (MIM 176887),CD226 (MIM 605397), UBASH3A (MIM 605736), C1QTNF6, IL10(MIM 124092), 4p15.2, C6orf173, 7p15.2, COBL (MIM 610317),GLIS3 (MIM 610192), C10orf59, CD69 (MIM 107273), 14q24.1,14q32.2, IL27 (MIM 608273), 16q23.1, ORMDL3 (MIM 610075),17q21.2, 19q13.32, 20p13, 22q12.2, Xq28

Multiplesclerosis

1 HLA 52 BACH2 (MIM 605394), BATF (MIM 612476), CBLB, CD40, CD58,CD6 (MIM 186720), CD86, CLEC16A (MIM 611303), CLECL1,CYP24A1, CYP27B1, DKKL1 (MIM 605418), EOMES (MIM 604615),EVI5 (MIM 602942), GALC (MIM 606890), HHEX (MIM 604420),IL12A, IL12B, IL22RA2, IL2RA, IL7, IL7R, IRF8, KIF21B (MIM608322), MALT1, MAPK1 (MIM 176948), MERTK (MIM 604705),MMEL1,MPHOSPH9 (MIM 605501),MPV17L2,MYB (MIM 189990),MYC (MIM 190080), OLIG3 (MIM 609323), PLEK (MIM 173570),PTGER4 (MIM 601586), PVT1 (MIM 165140), RGS1, SCO2 (MIM604272), SP140 (MIM 608602), STAT3, TAGAP, THEMIS (MIM613607), TMEM39A, TNFRSF1A, TNFSF14 (MIM 604520), TYK2,VCAM1, ZFP36L1 (MIM 601064), ZMIZ1 (MIM 607159), ZNF767

Crohn’sdisease

4 NOD2 (MIM 605956),IBD5 (MIM 606348),DRB1*0103, IL23R

67 SMAD3 (MIM 603109), ERAP2 (MIM 609497), IL10 (MIM 124092),IL2RA, TYK2, FUT2 (MIM 182100), DNMT3A (MIM 602769),DENND1B (MIM 613292), BACH2 (MIM 605394), ATG16L1(MIM 610767)

Ulcerativecolitis

3 DRB1*1502,DRB1*0103, IL23R

44 IL1R2 (MIM 147811), IL8RA-IL8RB, IL7R, IL12B, DAP(MIM 600954), PRDM1 (MIM 603423), JAK2 (MIM 147796),IRF5 (MIM 607218), GNA12 (MIM 604394), LSP1 (MIM 153432),ATG16L1 (MIM 610767)

Total 19 277

a The names of the loci are signposts and do not indicate that these loci are necessarily biologically relevant. A number of associated variants are distant fromprotein-coding genes.


the most extreme cases of obesity are driven by highly

penetrant, low-frequency variants. Variation at copy-

number variants (CNVs) has some impact on BMI. This is

true of commonCNVs (theNEGR1 association seems likely

to be driven by a common CNV40) and also rarer CNVs for

which evidence is starting to accumulate (e.g., 16p CNV

and effect on morbid obesity and developmental delay41).

The adverse metabolic effects of obesity depend not

only on the overall level of adiposity but also on the distribu-

tion of fat around the body; visceral (abdominal) fat has

particularly adverse consequences for overall health. GWASs

of fat-distribution phenotypes (including waist circumfer-

ence,waist:hipratio, andbody-fatpercentage studied inclose

to 200,000 individuals) have revealed almost 20 loci with

genome-wide significance40,42–44 and relatively little overlap

with those loci influencingoverall adiposity.AswithBMI, the

proportion of variance explained by these loci is small

(around 1% after adjustment for BMI, age, and sex).

New Biology Arising from GWAS Discoveries

Autoimmune Diseases

Thus far nearly all genes associated with MS have been

involved in autoimmune pathways rather than in

neurologic degenerative diseases.45 Indeed, of the two

MS-associated genes involved in neurodegeneration, one

(KIF21B) is also associated with AS and CD, suggesting

that it is actually an autoimmunity gene. The genes

involved in MS include genes coding for components of

the cytokine pathway (CXCR5 [MIM 601613], IL2RA

[MIM 147730], IL7R [MIM 146661], IL7 [MIM 146660],

IL12RB1 [MIM 601604], IL22RA2 [MIM 606648], IL12A

[MIM 161560], IL12B [MIM 161561], IRF8 [MIM 601565],

TNFRSF1A [MIM 191190], TNFRSF14 [MIM 602746], and

TNFSF14 [MIM 604520]), costimulatory molecules

(CD37 [MIM 151523], CD40, CD58 [MIM 153420],

CD80 [MIM 112203], CD86 [MIM 601020], and CLECL1

[MIM 607467]), and signal-transduction molecules of

immunological relevance (CBLB [MIM 604491], GPR65

[MIM 604620], MALT1 [MIM 604860], RGS1 [MIM

600323], STAT3 [MIM 102582], TAGAP [MIM 609667],

andTYK2 [MIM176941]). Interestingly, these genesmainly

implicate T-helper cells in MS pathogenesis.

Genetic findings have had amajor impact on AS research

and therapeutics. The association of the genes IL23R (MIM

607562)46 and IL12B19 have pointed to the involvement of

the IL-23R pathway, and hence IL-17-producing

Table 3. Summary of GWAS Findings for Metabolic Traitsa

Prior to 2007 2007 onward

Disease Number of Loci Loci Number of Loci Some or All of the Loci

Type 2 diabetes 3 PPARG, KCNJ11(MIM 600937),TCF7L2

50 NOTCH2 (MIM 600275), PROX1 (MIM 601546), GCKR, THADA(MIM 611800), BCL11A (MIM 606557), RBMS1 (MIM 602310), IRS1,ADAMTS9, ADCY5 (MIM 600293), IGF2BP2 (MIM 608289), WFS1,ZBED3, CDKAL1, DGKB (MIM 604070), JAZF1, GCK, KLF14,TP53INP1 (MIM 606185), SLC30A8 (MIM 611145), PTPRD(MIM 601598), CDKN2A, CHCHD9, CDC123,HHEX (MIM 604420),DUSP8 (MIM 602038), KCNQ1, CENTD2, MTNR1B, HMGA2 (MIM600698), TSPAN8 (MIM 600769), HNF1A, ZFAND6 (MIM 610183),PRC1 (MIM 603484), FTO, SRR (MIM 606477), HNF1B (MIM189907), DUSP9 (MIM 300134), CDCD4A, UBE2E2 (MIM 602163),GRB14 (MIM 601524), ST6GAL1 (MIM 109675), VPS26A (MIM605506), HMG20A (MIM 605534), AP3S2 (MIM 602416), HNF4A(MIM 600281), SPRY2 (MIM 602466)

Body-mass index 1 MC4R 30 NEGR1 (MIM 613173), TNNI3K (MIM 613932), PTBP2 (MIM608449), TMEM18 (MIM 613220), POMC, FANCL (MIM 608111),LRP1B (MIM 608766), CADM2 (MIM 609938), ETV5 (MIM 601600),GNPDA2 (MIM 613222), SLC39A8 (MIM 608732), HMGCR(MIM 142910), PCSK1, ZNF608, NCR3 (MIM 611550), HMGA1(MIM 600701), LRRN6C, TUB (MIM 601197), BDNF, MTCH2(MIM 613221), FAIM3 (MIM 606015), MTIF3, PRKD1(MIM 605435), MAP2K5 (MIM 602520), FTO, SH2B1, GPRC5B(MIM 605948), KCTD15, GIPR, TMEM160

Glucose or insulin 1 GCK 15 GCKR, G6PC2, IGF1, ADCY5 (MIM 600293), MADD (MIM 603584),ADRA2A, CRY2 (MIM 603732), FADS1 (MIM 606148), GLIS3(MIM 610192), SLC2A2, PROX1 (MIM 601546), C2CD4B (MIM610344), DGKB (MIM 604070), GIPR, VPS13C (MIM 608879)

Fat distribution 0 20 TBX15 (MIM 604127), LYPLAL1, IRS1, SPRY2 (MIM 602466), GRB14(MIM 601524), STAB1 (MIM 608560), ADAMTS9, CPEB4 (MIM610607), VEGFA (MIM 192240), TFAP2B (MIM 601601), LY86(MIM 605241), RSPO3 (MIM 610574),NFE2L3 (MIM 604135),MSRA(MIM 601250), ITPR2 (MIM 600144), HOXC13 (MIM 142976),NRXN3 (MIM 600567), ZNRF3 (MIM 612062), PIGC (MIM 601730)

Total 5 107

a The names of the loci are signposts and do not indicate that these loci are necessarily biologically relevant. A number of associated variants are distant fromprotein-coding genes.


proinflammatory cell populations, in the aetiopathogene-

sis of AS. The involvement of this pathway in AS was not

considered until the genetic discoveries were reported.

The recent demonstration that ERAP1 (MIM 606832) poly-

morphisms are associated with HLA-B27-positive but not

HLA-B27-negative AS has shed important light on research

into the mechanism by which HLA-B27 induces AS; this

mechanism has remained an enigma since the discovery

of the association of HLA-B27 with AS in the early 1970s.

ERAP1 is involved in peptide processing before HLA class

I molecule presentation; the restriction of the association

of ERAP1 variants to HLA-B27-positive disease indicates

that HLA-B27 operates to cause AS by a mechanism

that involves peptide presentation. Protective variants of

ERAP1 have been shown to have lower peptide-processing

capacity and thus to reduce the amount of peptide avail-

able to HLA-B27.47 Thus HLA-B27 is more likely to cause

AS when it is processing more peptides.

The finding that PADI4 (MIM 605347) is associated with

RA focused research interest on the role of anti-citrulli-

nated peptide antibodies (ACPAs) and disease.48 PADI4 is

involved in the citrullination of peptides against which

ACPAs develop. The association of PADI4 variants with

RA therefore indicated that ACPAs are directly involved

in RA pathogenesis, not an indirect manifestation of

immune dysregulation in the disease. Subsequently, it

was discovered that the association of HLA-DRB1 (MIM

142857) with RA was restricted to ACPA-positive disease

and that there was a strong gene-environment interaction,

such that cigarette smoking increases the risk of ACPA-

positive but not ACPA-negative RA.49 Because ACPA-

positive disease is more severe than ACPA-negative disease

and has a greater propensity toward joint-damaging

erosion, this provided further evidence supporting public-

health measures against cigarette smoking.

The genetic loci identified for IBD through GWASs have

highlighted a number of pathways, including antibacterial

autophagy and signaling pathways (e.g., IL-10 signaling,

T-cell-negative regulators, and pathways involving B cells

and innate sensors).18 Some of these pathways were previ-

ously not suspected to be important for these diseases.

The role of a number of pathways, for example the IL-23R

pathway, the autophagy pathway, and innate immunity,

haveall come fromhypothesis-generatinggenetics research,

not from immunology or hypothesis-driven research.

Similar advances could be described for many other

autoimmune diseases but are beyond the scope of this

review.

Metabolic Traits

Most loci affecting T2D and fasting glucose levels map to

regulatory sequences, and inmany cases, the ‘‘causal’’ tran-

script, i.e., the transcript responsible for mediating the

effect of the associated variants, is not yet known. At other

loci, a combination of coding variants, strong biological

candidates, and/or cis expression QTL data has defined

the transcript through which the effect is mediated

(HNF1A [MIM 142410], GCK, IRS1 [MIM 147545], WFS1

[MIM 606201], PPARG [MIM 601487], CAMK1D [MIM

607957], JAZF1 [MIM 606246], KLF14 [MIM 609393] and

others) as a first step to inferring biology.50 Some of these

stories are now starting to be fleshed out into biological

mechanisms (e.g., KLF1451).

There is incomplete overlap with the loci influencing

physiological variation in glucose and insulin. Some loci

(e.g., MTNR1B [MIM 600804]) have a relatively large effect

on both, whereas others (e.g., G6PC2 [MIM 608058])

influence fasting glucose levels but have a minimal effect

on T2D risk. Still others (e.g., CDKN2A and CDKN2 B

[MIM 600160 and 600431]) impact T2D and have surpris-

ingly modest effects on fasting glucose levels in healthy,

nondiabetic individuals32,33,50. Most of these loci appear

to have their primary effect on the function of beta cells

rather than on insulin resistance, highlighting the impor-

tance of the former with respect to normal and abnormal

glucose homeostasis.50 Of the subset of loci (including

PPARG, KLF14, and ADAMTS9 [MIM 605421]) shown to

influence T2D risk through a primary effect on insulin

resistance, only FTO seems to act primarily through an

effect on obesity.50 Several of the T2D loci overlap genes

that are known to harbor rare variants responsible for

penetrant, monogenic forms of diabetes (such genes

include KCNQ1 [MIM 607542], PPARG, HNF1A, GCK,

and WFS1), indicating that multiple causal variants at

the same locus segregate in the population at difference

frequencies. There is overlap between signals influencing

T2D risk and those influencing body weight (CDKAL1

[MIM 611259] and ADCY5 [MIM 600293]) indicating

that some of the observed epidemiological associations

between these traits are attributable to shared suscepti-

bility variants.52

Whereas many of the fasting-glucose and fasting-insulin

signals map near strong biological candidates for relevant

traits (such candidate genes include IRS1, IGF1, ADRA2A

[MIM 104210], SLC2A2 [MIM 138160], GCK and GCKR)

and fit within established models of our understanding

of islet biology, this is far from the case with the loci iden-

tified for T2D. Efforts to demonstrate that the genes

mapping close to T2D risk loci are enriched for particular

pathways or processes have met with only limited success;

the most robust finding yet has been in relation to

cell-cycle regulation (and was consistent with a model in

which the regulation of islet mass is a key component of

risk50). Either T2D is especially heterogeneous or else key

aspects of its pathophysiology are as yet poorly codified

in existing databases.

As for T2D and fasting glucose, most of the signals for

obesity and fat distribution map to regulatory signals, the

causal transcript is known at only a minority of the loci.

Signals influencing BMI appear to be enriched for genes

implicated in neuronal processes, whereas those influ-

encing fat distribution seem to be more closely related to

adipose development.36,43 Overlap with signals and genes

implicated inmore severe forms of disease (morbid obesity,


lipodystrophy) is seen at some loci (PCSK1 [MIM 162150],

POMC [MIM 176830], BDNF [MIM 113505], MC4R, and

SH2B1 [MIM 608937]) but is far from complete (some

loci implicated in extreme obesity case-control studies

show no association with BMI at the population level36).

The strongest signal for overall adiposityis the one map-

ping to FTO37. FTO is thought to be a DNA methylase,53

but its function is poorly understood. Murine models

demonstrate that modulation of Fto expression is associ-

ated with changes in body weight,54–56 but no direct

evidence linking coding variants in FTO in humans to

body-weight variation has been demonstrated. For the

time being, FTO remains the strongest candidate, but

the role of other genes (e.g., RPGRIP1L [MIM 610937]) in

the region cannot be discounted. This example demon-

strates the difficulties that remain in relating GWAS signals

to downstream biology. Fat distribution is a strongly

gender-dimorphic phenotype, and many of the signals

associated with fat distribution seem to have a selective

effect on this phenotype in women.43

Quantitative Traits

In addition to having been performed on the quantitative

traits discussed previously (e.g., BMI and fasting-glucose

and -insulin levels), GWASs have been done on a number

of quantitative risk factors for disease and for traits that

are models for the genetic architecture of complex traits.

For bone mineral density (BMD), a risk factor for osteopo-

rotic fracture, a total of 34 loci, together explaining ~5% of

narrow sense heritability, have been identified (Estrada

et al., abstract presented at the American Society for Bone

and Mineral Research 2010 Annual Meeting, published

in J. Bone. Med. Res. 25 [Suppl S1], p. 1243). Among these

genes, there is a major over-representation of genes in the

Wnt-signaling pathway, which was first implicated in oste-

oporosis (MIM 166710) from studies in families with high

or low BMD phenotypes. Many other examples exist in

osteoporosis and other human diseases in which GWASs

have demonstrated that more-prevalent but less-severe

genetic variants in genes initially identified from studies

of severe familial diseases have proven to be important in

the risk of disease in the general population. For human

height, a combined discovery and validation cohort of

~180,000 samples identified 180 robustly associated loci,

many in meaningful biological pathways and with evi-

dence for multiple segregating variants at the same loci.13

Together these loci explain approximately 12%–14% of

additive genetic variation (~10% of phenotypic variation).

A meta-analysis of more than 100,000 individuals of

European ancestry detected a total of 95 loci significantly

associated with plasma concentrations of cholesterol

and triglycerides, known risk factors for coronary artery

disease,57 and it provided evidence that the GWAS loci

were of biological and clinical relevance. A meta-analysis

from the HaemGen consortium on platelet count and

platelet volume, which are endophenotypes for myo-

cardial infarction (MIM 608446), discovered 68 loci.58

When the genes of a number of these loci were silenced

in Drosophila, 11 showed a clear platelet phenotype. These

genes are previously unknown regulators of blood cell

formation. The identification of so many loci has uncov-

ered new gene functions in megakaryopoiesis and platelet

formation. That is, new biology has resulted directly from

the identification of SNPs that are associated with variation

in platelet phenotypes.

Across these quantitative traits, a number of loci discov-

ered through GWASs were known to be a mutational target

for those traits because Mendelian forms with extreme

phenotypes existed. Taken together, the inference from

quantitative traits in terms of the (large) number of loci

involved, the allelic frequency spectrum of associated vari-

ants, and the nature of the candidate genes suggest that

models arising from quantitative traits appropriately

reflect the genetic architecture of disease and reinforce

the emerging evidence that it is the cumulative effect of

many loci that underlies susceptibility to disease.

From GWAS to Translation: Clinical Relevance

Autoimmune Diseases

Many of the MS-associated genes discovered by GWASs

represent excellent potential therapeutic targets. Of partic-

ular note is the identification of two genes involved in

vitamin D metabolism (CYP27B1 [MIM 609506] and

CYP24A1 [MIM 126065]). This identification might help

to explain the latitudinal variation in MS incidence—i.e.,

higher MS prevalence at more extreme latitudes is most

likely due to higher rates of vitamin D deficiency. Two

other identified genes are already targets of MS therapies,

highlighting the relevance of the findings to the disease

pathogenesis (natalizumab targets VCAM1 [MIM

192225], and daclizumab targets IL2RA). The findings for

AS have stimulated the trial of therapies against identified

pathways. Anti-IL-17 treatment has been shown in a phase

2 trial to have equivalent efficacy as the current gold-stan-

dard treatment, TNF-inhibition, in the treatment of AS.

The relevance of the RA-related genetic findings to thera-

peutic development is highlighted by the fact that some

existing therapies already target genes or gene pathways

highlighted by the genetic associations with RA; such ther-

apies include those involving TNF inhibitors (e.g., inflixi-

mab) and co-stimulation inhibitors (e.g., abatacept).

Abatacept is a fusion protein of CTLA-4 and immunoglob-

ulin. It acts by preventing costimulation of T-helper cells

by the binding of the T cell’s CD28 protein to the B7

protein on the antigen-presenting cell. CTLA4 (MIM

123890) and CD28 (MIM 186760) polymorphisms are

associated with RA. The RA-associated genes include

many involved in the NfKB signaling pathway and

place this pathway at the center of RA pathogenesis. As

in MS, mouse research prior to the genetic discoveries

had implicated the IL-23-dependent Th17-lymphocyte

pathway in RA pathogenesis. To date there has been very

little genetic support for this with regard to human

diseases, in contrast to the situation in seronegative


diseases such as AS, psoriasis and IBD, where strong genetic

associations exist and treatments targeting the pathway

are in clinical use.

Metabolic Diseases

The main relevance of GWASs lies in the insights into

disease biology (see above) and the potential for clinical

translation through novel approaches to the diagnosis,

prevention, treatment, and monitoring of disease. This

will take some time, in particular given that most GWAS

discoveries were made in the last few years. The predictive

power of disease risk ascertained from genetic data remains

poor because for most diseases only a small proportion of

additive genetic variation has been accounted for.

Although it is possible for T2D to identify individuals

who are at the extremes of the genotype risk score distribu-

tion and who differ appreciably in T2D risk (they have

twice or half the average risk for the upper and lower

1%–2%, respectively), many of these would already be

identifiable on the basis of classical risk factors. In fact,

when using receiver operating characteristic (ROC) anal-

yses, BMI and age do a far better job of discrimination

than the genetic variants so far discovered.59 This may

change as low frequency and rare causal alleles are found.

Although individual prediction is not yet practical with

the variants at hand, it should be possible to identify

groups of individuals who are at a substantially greater-

than-average risk for diabetes, and this might be of value,

for example, with respect to clinical-trial enrichment.

One obvious route to early translation involves the iden-

tification of diagnostic biomarkers on the basis of the

processes that have been uncovered. These may have

predictive impact well beyond the genetic variants that

led to their discovery. This was recently demonstrated by

a GWAS of C-reactive protein (CRP) levels; that study

found that common variants near the HNF1A gene were

associated with variation in CRP.60 The authors asked

whether rare HNF1A mutations that are causal for the

Mendelian MODY (MIM 606391) subtype of diabetes are

also associated with differences in CRP levels and whether

it would be possible to use CRP levels as a diagnostic

marker to help identify individuals who have early-onset

diabetes and who are likely to have HNF1A-MODY (and

to direct those individuals to sequence-based diagnostics).

They were able to show marked differences in CRP levels

between HNF1A -MODY and other types of diabetes and

demonstrated that diagnoses based on CRP levels has

a discriminative accuracy of more than 80% for this diag-

nostic classification.61,62 Otherwise, GWAS findings have

as yet had no impact on therapeutic optimization. Recent

studies have identified variants that influence therapeutic

response to metformin63 and might herald better under-

standing of how these drugs work.

New Science Facilitated by GWASs

Although the GWAS approach was designed for the detec-

tion of associations between DNA markers and disease, as

a by-product such studies have generated new scientific

discoveries. A detailed description and discussion is outside

the scope of this review, and we highlight only a few of

these advances: the discovery of genes affecting genetic

recombination and their correlation with natural selec-

tion64–66 and new insight in human population structure

and evolution.67–73

Interpretation of GWAS Results

GWASs conducted in the last five years were designed and

powered to detect associations through LD between geno-

typed (or imputed) common SNP markers and unknown

causal variants. What do the results imply in terms of vari-

ance explained in the population, common versus rare

variants underlying complex traits, and the nature of

complex-trait variation and evolution? It is too early to

be able to quantify the joint distribution of risk-allele

frequencies and their effect sizes because there are very

few causal variants identified by GWAS and because

systematic study of rare variants (through exome or

whole-genome sequencing) is in an early stage. To under-

stand the allelic spectrum of risk variants and thereby

inform optimal design of experiments aiming to detect

causal variants, one must differentiate between two expla-

nations for observed associations between genotyped

common SNPs and disease: the association can be caused

by one or more causal variants that have large effect sizes

and are in low LD with the genotyped SNPs, or it can be

caused by causal variants that have small effects and are

in high LD with the genotyped SNPs. Low LD occurs

when the allele frequencies of the unknown causal vari-

ants and those at the genotyped SNPs are very different

from each other, for example when the allele frequency

of causal variants is much lower than that of the SNPs.

For a single robustly associated SNP in a homogeneous

population, we cannot distinguish between the hypoth-

eses that the association signal is caused by a rare variant

of large effect or a common variant with small effect.

However, variants at multiple loci and GWASs in other

ethnic populations help to narrow the boundaries of the

genetic architecture of diseases. At this point in time, we

can conclude that

(1) Many loci contribute to complex-trait variation

(e.g., Figure 2).

(2) At a number of identified risk loci, there aremultiple

alleles associated with disease at a wide range of

frequencies.

(3) There is evidence for pleiotropy, i.e., that the same

variants are associated with multiple traits.66,74,75

(4) A number of variants associated with disease or

complex traits in one ethnic population are also

associated the same disease or traits in other popula-

tions (see above for T2D examples).

(5) The hypothesis76 that causal variant(s) that lead to

the association between common SNPs and disease

are mostly rare (say, have an allele frequency of 1%


or lower) isnot consistentwith theoretical and empir-

ical results.77,78 In particular, there is no widespread

evidence for the existence of ‘‘synthetic associations’’

(see Box 3). Numerically, we expect that most causal

variants that segregate in the population are rare,

consistent with evolutionary theory, but the propor-

tion of genetic variation that these variants cumula-

tively explain depends on their correlation with

fitness.79

(6) A surprisingly large proportion of additive genetic

variation is tagged when all SNPs are considered

simultaneously.12–14

The Cost of GWASs

If we assume that the GWAS results from Figure 1 represent

a total of 500,000 SNP chips and that on average a chip

costs $500, then this is a total investment of $250 million.

If there are a total of ~2,000 loci detected across all traits,

then this implies an investment of $125,000 per discov-

ered locus. Is that a good investment? We think so: The

total amount of money spent on candidate-gene studies

and linkage analyses in the 1990s and 2000s probably

exceeds $250M, and they in total have had little to show

for it. Also, it is worthwhile to put these amounts in

context. $250M is of the order of the cost of a one-two

stealth fighter jets and much less than the cost of a single

navy submarine. It is a fraction of the ~$9 billion cost of

the Large Hadron Collider. It would also pay for about

100 R01 grants. Would those 100 non-funded R01 grants

have made breakthrough discoveries in biology and medi-

cine? We simply can’t answer this question, but we can

conclude that a tremendous number of genuinely new

discoveries have been made in a period of only five years.

Concluding Comments

In this review we have attempted to summarize the

tremendous quality and quantity of discoveries that have

been made by GWASs in the last five years. Because of

space limitations, we have been able to discuss only

a subset of diseases and have not mentioned those made

in common cancers, pediatric diseases, and ophthalmolog-

ical diseases, to name but a few. We now return to the

Box 3. Synthetic Associations

Dickson and colleagues suggested that the observed

association between a common SNP and a complex

trait might result when one or more rare variants at

the locus is in LD with that SNP.76,93 Because

common SNP alleles and rare causal variants cannot

be highly correlated because of the properties of

LD,84 the hypothesis of ‘‘synthetic’’ associations

implies that the effect sizes of the causal variants

are much larger than the effect size observed at the

common SNP and suggests that (re)sequencing

studies might detect such variants. The hypothesis

is not about whether GWASs work as an experi-

mental design but what the likely interpretation of

GWAS hits is in terms of the allele spectrum of causal

risk alleles. Are empirical data consistent with this

hypothesis? Several lines of evidence suggest that

associations observed with common SNP associa-

tions are rarely due to synthetic associations with

rare variants. First, because the LD correlation

between common and rare variants is so low (typi-

cally 0.01–0.02), synthetic associations imply that

variation explained by the causal variants at the

locus is 50–100 times larger than the variance ex-

plained at the genotyped SNP.78 So, if the SNP

explains 0.1% of phenotypic variation in the popu-

lation, the causal variant would explain 5%–10%.

But as shown in this review, for many complex traits

and diseases tens to hundred of common variants

are identified, and so their combined effects would

explain too much variation if synthetic associations

were the norm. Second, empirical data from

(re)sequencing studies and trans-ethnic mapping

suggest that both common and rare variants

contribute to disease risk.77 At most loci detected

by GWASs, there is no evidence (despite extensive

genotyping and/or re-sequencing) that the

common-variant signal is driven by low-frequency

or rarer variants. Where rare risk alleles are uncov-

ered at the same loci, they seem much more likely

to be independent signals.94–96

Together these observations point to a highly

polygenic model of disease susceptibility with causal

variants across the entire range of the allele-

frequency spectrum. By ‘‘polygenic,’’ we mean that

segregating variants at many genomic loci (tens,

hundreds, or even thousands) contribute to genetic

variation for susceptibility in the population. The

observations imply that, for most common complex

diseases, nearly everyone in the population carries

some risk alleles and that affected individuals are

likely to have a different portfolio of risk alleles.79

They also imply that any single risk allele is neither

necessary nor sufficient to cause disease. For the

Box 3. Continued

etiology of disease, these observations provide

empirical evidence to support a threshold or burden

model involving multiple variants and environ-

mental factors, and they appear to be inconsistent

with a single cause (e.g., a single mutation). A rare-

variant only model of disease, characterized by locus

heterogeneity and raremutations of large effects and

proposed by, for example, McClellan and King,1 is

not consistent with empirical observations.77,79,97


perceived failure of GWASs as summarized in the introduc-

tory section:

(1) Is the GWAS approach founded on a flawed assumption

that genetics plays an important role in the risk for

common diseases? Pedigree studies, including those

involving twins, suggest that a substantial propor-

tion of variation in susceptibility for common

disease is due to genetic factors. The proportion of

total variation explained by genome-wide-signifi-

cant variants has reached 10%–20% for a number

of diseases, and clearly there are additional variants

with such small effect sizes that they have not been

detected with stringent significance. As reviewed

here, many of the detected loci are in biologically

meaningful pathways for the diseases investigated.

Whole-genome analyses involving GWAS data

have estimated that 20%–50% of phenotypic varia-

tion is captured when all SNPs are considered simul-

taneously for a number of complex diseases and

traits. These estimates are based on population-

wide studies and provide a lower limit of the total

proportion of phenotypic variation due to genetic

factors. Inference from GWASs is independent of

inference drawn from close relatives (pedigree/

family studies), and therefore these studies have

provided independent evidence for the role of

genetics in common diseases.

(2) Have GWASs been disappointing in not explaining more

genetic variation in the population? This criticism

implies that the aim of GWASs is to explain all

genetic variation. This is a misrepresentation of

the objective of GWASs. As was the aim of linkage

studies in pedigrees for complex diseases prior to

the GWAS era, the aim of GWAS is to detect loci

that are associated with complex traits. The detec-

tion of such loci has led to the discovery of new bio-

logical knowledge about disease—knowledge that

was absent only five years ago. But even ignoring

the aim of GWASs, for a number of complex traits

the proportion of genetic variation uncovered by

GWASs is actually substantial. For example, for

T2D, MS, and CD, approximately 10%, 20%, and

20%, respectively, of genetic variation in the popu-

lation has been accounted for. Apart from diseases

with a known major locus (which is usually the

major histocompatibility locus), the baseline of

variation explained five years ago was essentially

zero.

(3) Have GWASs delivered meaningful biologically relevant

knowledge or results of clinical or any other utility? As

we have highlighted in this review, the answer to

this question is a definite ‘‘yes.’’ For example, the

discovery of the importance of the autophagy

pathway in Crohn disease, the IL-23R pathway in

rheumatoid arthritis, and factor H in age-related

macular degeneration (MIM 610149)9 have given

important biological insight with direct clinical

relevance. Hunter and Kraft put it this way back in

2007: ‘‘There have been few, if any, similar bursts

of discovery in the history of medical research.’’80

(4) Are GWAS results spurious? The combination of large

sample sizes and stringent significance testing has

led to a large number of robust and replicable asso-

ciations between complex traits and genetic vari-

ants, many of which are in meaningful biological

pathways. A number of variants or different variants

at the same loci have been shown to be associated

with the same trait in different ethnic populations,

and some loci are even replicated across species.81

The combination of multiple variants with small

effect sizes has been shown to predict disease status

or phenotype in independent samples from the

same population. Clearly, these results are not

consistent with flawed inferences from GWASs.

In conclusion, in a period of less than five years, the

GWAS experimental design in human populations has

led to new discoveries about genes and pathways involved

in common diseases and other complex traits, has

provided a wealth of new biological insights, has led to

discoveries with direct clinical utility, and has facilitated

basic research in human genetics and genomics. For the

future, technological advances enabling the sequencing

of entire genomes in large samples at affordable prices is

likely to generate additional genes, pathways, and biolog-

ical insights, as well as to identify causal mutations.

Acknowledgments

We acknowledge funding from the Australian National Health and

Medical Research Council (NHMRC grants 389892, 496667,

613672, 613601, and 1011506) and the Australian Research

Council (ARC grant DP1093502). P.M.V. and M.A.B. are funded

by NHMRC Senior Principal Research Fellowships. We thank two

referees for many helpful comments.

Web Resources



omim.org

GWAS Catalog, http://www.genome.gov/26525384

References

1. McClellan, J., and King, M.C. (2010). Genetic heterogeneity

in human disease. Cell 141, 210–217.

2. Crow, T.J. (2011). ‘The missing genes: what happened to the

heritability of psychiatric disorders?’. Mol. Psychiatry 16,

362–364.

3. Manolio, T.A., Collins, F.S., Cox, N.J., Goldstein, D.B.,

Hindorff, L.A., Hunter, D.J., McCarthy, M.I., Ramos, E.M.,

Cardon, L.R., Chakravarti, A., et al. (2009). Finding themiss-

ing heritability of complex diseases. Nature 461, 747–753.


4. Botstein, D., and Risch, N. (2003). Discovering genotypes

underlying human phenotypes: Past successes for mende-

lian disease, future approaches for complex disease. Nat.

Genet. Suppl. 33, 228–237.

5. Hartl, D.L., and Clark, A.G. (1997). Principles of population

genetics (Sunderland: Sinauer Associates).

6. Hill, W.G., and Robertson, A. (1968). The effects of

inbreeding at loci with heterozygote advantage. Genetics

60, 615–628.

7. Altshuler, D., Brooks, L.D., Chakravarti, A., Collins, F.S.,

Daly, M.J., and Donnelly, P.; International HapMap Consor-

tium. (2005). A haplotype map of the human genome.

Nature 437, 1299–1320.

8. Dewan, A., Liu, M., Hartman, S., Zhang, S.S., Liu, D.T., Zhao,

C., Tam, P.O., Chan, W.M., Lam, D.S., Snyder, M., et al.

(2006). HTRA1 promoter polymorphism in wet age-related

macular degeneration. Science 314, 989–992.

9. Klein, R.J., Zeiss, C., Chew, E.Y., Tsai, J.Y., Sackler, R.S.,

Haynes, C., Henning, A.K., SanGiovanni, J.P., Mane, S.M.,

Mayne, S.T., et al. (2005). Complement factor H polymor-

phism in age-related macular degeneration. Science 308,

385–389.

10. Wellcome Trust Case Control Consortium. (2007). Genome-

wide association study of 14,000 cases of seven common

diseases and 3,000 shared controls. Nature 447, 661–678.

11. Franke, A., McGovern, D.P., Barrett, J.C., Wang, K., Radford-

Smith, G.L., Ahmad, T., Lees, C.W., Balschun, T., Lee, J.,

Roberts, R., et al. (2010). Genome-wide meta-analysis

increases to 71 the number of confirmed Crohn’s disease

susceptibility loci. Nat. Genet. 42, 1118–1125.

12. Anderson, C.A., Boucher, G., Lees, C.W., Franke, A.,

D’Amato, M., Taylor, K.D., Lee, J.C., Goyette, P., Imielinski,

M., Latiano, A., et al. (2011). Meta-analysis identifies 29 addi-

tional ulcerative colitis risk loci, increasing the number of

confirmed associations to 47. Nat. Genet. 43, 246–252.

13. Lango Allen, H., Estrada, K., Lettre, G., Berndt, S.I., Weedon,

M.N., Rivadeneira, F., Willer, C.J., Jackson, A.U., Vedantam,

S., Raychaudhuri, S., et al. (2010). Hundreds of variants clus-

tered in genomic loci and biological pathways affect human

height. Nature 467, 832–838.

14. Yang, J., Manolio, T.A., Pasquale, L.R., Boerwinkle, E., Capor-

aso, N., Cunningham, J.M., de Andrade, M., Feenstra, B.,

Feingold, E., Hayes, M.G., et al. (2011). Genome partitioning

of genetic variation for complex traits using common SNPs.

Nat. Genet. 43, 519–525.

15. Yang, J., Benyamin, B., McEvoy, B.P., Gordon, S., Henders,

A.K., Nyholt, D.R., Madden, P.A., Heath, A.C., Martin, N.G.,

Montgomery, G.W., et al. (2010). Common SNPs explain

a large proportion of the heritability for human height.

Nat. Genet. 42, 565–569.

16. Eyre-Walker, A. (2010). Evolution in health and medicine

Sackler colloquium: Genetic architecture of complex traits

and its implications for fitness and genome-wide associa-

tion studies. Proc. Natl. Acad. Sci. USA 107 (Suppl 1 ),

1752–1756.

17. Pritchard, J.K. (2001). Are rare variants responsible for

susceptibility to complex diseases? Am. J. Hum. Genet. 69,

124–137.

18. Khor, B., Gardet, A., and Xavier, R.J. (2011). Genetics and

pathogenesis of inflammatory bowel disease. Nature 474,

307–317.

19. Danoy, P., Pryce, K., Hadler, J., Bradbury, L.A., Farrar, C., Poin-

ton, J., Ward, M., Weisman, M., Reveille, J.D., Wordsworth,

B.P., et al; Australo-Anglo-American Spondyloarthritis

Consortium; Spondyloarthritis Research Consortium of

Canada. (2010). Association of variants at 1q32 and STAT3

with ankylosing spondylitis suggests genetic overlap with

Crohn’s disease. PLoS Genet. 6, e1001195.

20. Cotsapas, C., Voight, B.F., Rossin, E., Lage, K., Neale, B.M.,

Wallace, C., Abecasis, G.R., Barrett, J.C., Behrens, T., Cho,

J., et al; FOCiS Network of Consortia. (2011). Pervasive

sharing of genetic effects in autoimmune disease. PLoS

Genet. 7, e1002254.

21. McCarthy, M.I. (2010). Genomics, type 2 diabetes, and

obesity. N. Engl. J. Med. 363, 2339–2350.

22. Kooner, J.S., Saleheen, D., Sim, X., Sehmi, J., Zhang, W.,

Frossard, P., Been, L.F., Chia, K.S., Dimas, A.S., Hassanali,

N., et al; DIAGRAM; MuTHER. (2011). Genome-wide associ-

ation study in individuals of South Asian ancestry identifies

six new type 2 diabetes susceptibility loci. Nat. Genet. 43,

984–989.

23. Yamauchi, T., Hara, K., Maeda, S., Yasuda, K., Takahashi, A.,

Horikoshi, M., Nakamura, M., Fujita, H., Grarup, N., Cauchi,

S., et al. (2010). A genome-wide association study in the

Japanese population identifies susceptibility loci for type 2

diabetes at UBE2E2 and C2CD4A-C2CD4B. Nat. Genet. 42,

864–868.

24. Shu, X.O., Long, J., Cai, Q., Qi, L., Xiang, Y.B., Cho, Y.S., Tai,

E.S., Li, X., Lin, X., Chow, W.H., et al. (2010). Identification

of new genetic risk variants for type 2 diabetes. PLoS Genet.

6, e1001127.

25. Yasuda, K., Miyake, K., Horikawa, Y., Hara, K., Osawa, H.,

Furuta, H., Hirota, Y., Mori, H., Jonsson, A., Sato, Y., et al.

(2008). Variants in KCNQ1 are associated with susceptibility

to type 2 diabetes mellitus. Nat. Genet. 40, 1092–1097.

26. Unoki, H., Takahashi, A., Kawaguchi, T., Hara, K., Horikoshi,

M., Andersen, G., Ng, D.P., Holmkvist, J., Borch-Johnsen, K.,

Jørgensen, T., et al. (2008). SNPs in KCNQ1 are associated

with susceptibility to type 2 diabetes in East Asian and Euro-

pean populations. Nat. Genet. 40, 1098–1102.

27. Tsai, F.J., Yang, C.F., Chen, C.C., Chuang, L.M., Lu, C.H.,

Chang, C.T., Wang, T.Y., Chen, R.H., Shiu, C.F., Liu, Y.M.,

et al. (2010). A genome-wide association study identifies

susceptibility variants for type 2 diabetes in Han Chinese.

PLoS Genet. 6, e1000847.

28. Below, J.E., Gamazon, E.R., Morrison, J.V., Konkashbaev, A.,

Pluzhnikov, A., McKeigue, P.M., Parra, E.J., Elbein, S.C.,

Hallman, D.M., Nicolae, D.L., et al. (2011). Genome-wide

association and meta-analysis in populations from Starr

County, Texas, and Mexico City identify type 2 diabetes

susceptibility loci and enrichment for expression quantita-

tive trait loci in top signals. Diabetologia 54, 2047–2055.

29. Parra, E.J., Below, J.E., Krithika, S., Valladares, A., Barta, J.L.,

Cox, N.J., Hanis, C.L., Wacher, N., Garcia-Mena, J., Hu, P.,

et al; Diabetes Genetics Replication and Meta-analysis

(DIAGRAM) Consortium. (2011). Genome-wide association

study of type 2 diabetes in a sample from Mexico City and

a meta-analysis of a Mexican-American sample from Starr

County, Texas. Diabetologia 54, 2038–2046.

30. Grant, S.F., Thorleifsson, G., Reynisdottir, I., Benediktsson,

R., Manolescu, A., Sainz, J., Helgason, A., Stefansson, H.,

Emilsson, V., Helgadottir, A., et al. (2006). Variant of


transcription factor 7-like 2 (TCF7L2) gene confers risk of

type 2 diabetes. Nat. Genet. 38, 320–323.

31. Prokopenko, I., Langenberg, C., Florez, J.C., Saxena, R.,

Soranzo, N., Thorleifsson, G., Loos, R.J., Manning, A.K.,

Jackson, A.U., Aulchenko, Y., et al. (2009). Variants in

MTNR1B influence fasting glucose levels. Nat. Genet. 41,

77–81.

32. Dupuis, J., Langenberg, C., Prokopenko, I., Saxena, R.,

Soranzo, N., Jackson, A.U., Wheeler, E., Glazer, N.L., Boua-

tia-Naji, N., Gloyn, A.L., et al; DIAGRAM Consortium;

GIANT Consortium; Global BPgen Consortium; Anders

Hamsten on behalf of Procardis Consortium; MAGIC investi-

gators. (2010). New genetic loci implicated in fasting glucose

homeostasis and their impact on type 2 diabetes risk. Nat.

Genet. 42, 105–116.

33. Saxena, R., Hivert, M.F., Langenberg, C., Tanaka, T., Pankow,

J.S., Vollenweider, P., Lyssenko, V., Bouatia-Naji, N., Dupuis,

J., Jackson, A.U., et al; GIANT consortium; MAGIC investiga-

tors. (2010). Genetic variation in GIPR influences the glucose

and insulin responses to an oral glucose challenge. Nat.

Genet. 42, 142–148.

34. Weedon, M.N., Clark, V.J., Qian, Y., Ben-Shlomo, Y., Timp-

son, N., Ebrahim, S., Lawlor, D.A., Pembrey, M.E., Ring, S.,

Wilkin, T.J., et al. (2006). A common haplotype of the gluco-

kinase gene alters fasting glucose and birth weight: Associa-

tion in six studies and population-genetics analyses. Am. J.

Hum. Genet. 79, 991–1001.

35. Larsen, L.H., Echwald, S.M., Sørensen, T.I., Andersen, T.,

Wulff, B.S., and Pedersen, O. (2005). Prevalence of mutations

and functional analyses of melanocortin 4 receptor variants

identified among 750 men with juvenile-onset obesity. J.

Clin. Endocrinol. Metab. 90, 219–224.

36. Speliotes, E.K., Willer, C.J., Berndt, S.I., Monda, K.L., Thor-

leifsson, G., Jackson, A.U., Allen, H.L., Lindgren, C.M.,

Luan, J., Magi, R., et al; MAGIC; Procardis Consortium.

(2010). Association analyses of 249,796 individuals reveal

18 new loci associated with body mass index. Nat. Genet.

42, 937–948.

37. Frayling, T.M., Timpson, N.J., Weedon, M.N., Zeggini, E.,

Freathy, R.M., Lindgren, C.M., Perry, J.R., Elliott, K.S., Lango,

H., Rayner, N.W., et al. (2007). A common variant in the FTO

gene is associated with body mass index and predisposes to

childhood and adult obesity. Science 316, 889–894.

38. Meyre, D., Delplanque, J., Chevre, J.C., Lecoeur, C., Lobbens,

S., Gallina, S., Durand, E., Vatin, V., Degraeve, F., Proenca, C.,

et al. (2009). Genome-wide association study for early-onset

and morbid adult obesity identifies three new risk loci in

European populations. Nat. Genet. 41, 157–159.

39. Scherag, A., Dina, C., Hinney, A., Vatin, V., Scherag, S., Vogel,

C.I., Muller, T.D., Grallert, H., Wichmann, H.E., Balkau, B.,

et al. (2010). Two new Loci for body-weight regulation iden-

tified in a joint analysis of genome-wide association studies

for early-onset extreme obesity in French and german study

groups. PLoS Genet. 6, e1000916.

40. Willer, C.J., Speliotes, E.K., Loos, R.J., Li, S., Lindgren, C.M.,

Heid, I.M., Berndt, S.I., Elliott, A.L., Jackson, A.U., Lamina,

C., et al; Wellcome Trust Case Control Consortium; Genetic

Investigation of ANthropometric Traits Consortium.

(2009). Six new loci associated with body mass index high-

light a neuronal influence on body weight regulation. Nat.

Genet. 41, 25–34.

41. Walters, R.G., Jacquemont, S., Valsesia, A., de Smith, A.J.,

Martinet, D., Andersson, J., Falchi, M., Chen, F., Andrieux,

J., Lobbens, S., et al. (2010). A new highly penetrant form

of obesity due to deletions on chromosome 16p11.2. Nature

463, 671–675.

42. Heard-Costa, N.L., Zillikens, M.C., Monda, K.L., Johansson,

A., Harris, T.B., Fu, M., Haritunians, T., Feitosa, M.F., Aspe-

lund, T., Eiriksdottir, G., et al. (2009). NRXN3 is a novel locus

for waist circumference: A genome-wide association study

from the CHARGE Consortium. PLoS Genet. 5, e1000539.

43. Heid, I.M., Jackson, A.U., Randall, J.C., Winkler, T.W., Qi, L.,

Steinthorsdottir, V., Thorleifsson, G., Zillikens, M.C.,

Speliotes, E.K., Magi, R., et al; MAGIC. (2010). Meta-analysis

identifies 13 new loci associated with waist-hip ratio and

reveals sexual dimorphism in the genetic basis of fat distribu-

tion. Nat. Genet. 42, 949–960.

44. Kilpelainen, T.O., Zillikens, M.C., Stancakova, A., Finucane,

F.M., Ried, J.S., Langenberg, C., Zhang, W., Beckmann, J.S.,

Luan, J., Vandenput, L., et al. (2011). Genetic variation

near IRS1 associates with reduced adiposity and an impaired

metabolic profile. Nat. Genet. 43, 753–760.

45. Sawcer, S., Hellenthal, G., Pirinen, M., Spencer, C.C., Patso-

poulos, N.A., Moutsianas, L., Dilthey, A., Su, Z., Freeman,

C., Hunt, S.E., et al; International Multiple Sclerosis Genetics

Consortium; Wellcome Trust Case Control Consortium 2.

(2011). Genetic risk and a primary role for cell-mediated

immune mechanisms in multiple sclerosis. Nature 476,

214–219.

46. Burton, P.R., Clayton, D.G., Cardon, L.R., Craddock, N.,

Deloukas, P., Duncanson, A., Kwiatkowski, D.P., McCarthy,

M.I., Ouwehand, W.H., Samani, N.J., et al; Wellcome Trust

Case Control Consortium; Australo-Anglo-American Spon-

dylitis Consortium (TASC); Biologics in RA Genetics and

Genomics Study Syndicate (BRAGGS) Steering Committee;

Breast Cancer Susceptibility Collaboration (UK). (2007).

Association scan of 14,500 nonsynonymous SNPs in four

diseases identifies autoimmunity variants. Nat. Genet. 39,

1329–1337.

47. Evans, D.M., Spencer, C.C., Pointon, J.J., Su, Z., Harvey, D.,

Kochan, G., Oppermann, U., Dilthey, A., Pirinen, M.,

Stone, M.A., et al; Spondyloarthritis Research Consortium

of Canada (SPARCC); Australo-Anglo-American Spondyloar-

thritis Consortium (TASC); Wellcome Trust Case Control

Consortium 2 (WTCCC2). (2011). Interaction between

ERAP1 and HLA-B27 in ankylosing spondylitis implicates

peptide handling in the mechanism for HLA-B27 in disease

susceptibility. Nat. Genet. 43, 761–767.

48. Suzuki, A., Yamada, R., Chang, X., Tokuhiro, S., Sawada, T.,

Suzuki, M., Nagasaki, M., Nakayama-Hamada, M., Kawaida,

R., Ono, M., et al. (2003). Functional haplotypes of PADI4,

encoding citrullinating enzyme peptidylarginine deiminase

4, are associated with rheumatoid arthritis. Nat. Genet. 34,

395–402.

49. Padyukov, L., Silva, C., Stolt, P., Alfredsson, L., and Klareskog,

L. (2004). A gene-environment interaction between smoking

and shared epitope genes in HLA-DR provides a high risk

of seropositive rheumatoid arthritis. Arthritis Rheum. 50,

3085–3092.

50. Voight, B.F., Scott, L.J., Steinthorsdottir, V., Morris, A.P., Dina,

C., Welch, R.P., Zeggini, E., Huth, C., Aulchenko, Y.S.,

Thorleifsson, G., et al; MAGIC investigators; GIANT

Consortium. (2010). Twelve type 2 diabetes susceptibility


loci identified through large-scale association analysis. Nat.

Genet. 42, 579–589.

51. Small, K.S., Hedman, A.K., Grundberg, E., Nica, A.C., Thor-

leifsson, G., Kong, A., Thorsteindottir, U., Shin, S.Y.,

Richards, H.B., Soranzo, N., et al; GIANT Consortium;

MAGIC Investigators; DIAGRAM Consortium; MuTHER

Consortium. (2011). Identification of an imprinted master

trans regulator at the KLF14 locus related to multiple meta-

bolic phenotypes. Nat. Genet. 43, 561–564.

52. Freathy, R.M., Mook-Kanamori, D.O., Sovio, U., Prokopenko,

I., Timpson, N.J., Berry, D.J., Warrington, N.M., Widen, E.,

Hottenga, J.J., Kaakinen, M., et al; Genetic Investigation of

ANthropometric Traits (GIANT) Consortium; Meta-Analyses

of Glucose and Insulin-related traits Consortium; Wellcome

Trust Case Control Consortium; Early Growth Genetics

(EGG) Consortium. (2010). Variants in ADCY5 and near

CCNL1 are associated with fetal growth and birth weight.

Nat. Genet. 42, 430–435.

53. Gerken, T., Girard, C.A., Tung, Y.C., Webby, C.J., Saudek, V.,

Hewitson, K.S., Yeo, G.S., McDonough, M.A., Cunliffe, S.,

McNeill, L.A., et al. (2007). The obesity-associated FTO

gene encodes a 2-oxoglutarate-dependent nucleic acid deme-

thylase. Science 318, 1469–1472.

54. Church, C., Lee, S., Bagg, E.A., McTaggart, J.S., Deacon, R.,

Gerken, T., Lee, A., Moir, L., Mecinovi�c, J., Quwailid, M.M.,

et al. (2009). A mouse model for the metabolic effects of

the human fat mass and obesity associated FTO gene. PLoS

Genet. 5, e1000599.

55. Church, C., Moir, L., McMurray, F., Girard, C., Banks, G.T.,

Teboul, L., Wells, S., Bruning, J.C., Nolan, P.M., Ashcroft,

F.M., and Cox, R.D. (2010). Overexpression of Fto leads to

increased food intake and results in obesity. Nat. Genet. 42,

1086–1092.

56. Freathy, R.M., Timpson, N.J., Lawlor, D.A., Pouta, A., Ben-

Shlomo, Y., Ruokonen, A., Ebrahim, S., Shields, B., Zeggini,

E., Weedon, M.N., et al. (2008). Common variation in the

FTO gene alters diabetes-relatedmetabolic traits to the extent

expected given its effect on BMI. Diabetes 57, 1419–1426.

57. Teslovich, T.M., Musunuru, K., Smith, A.V., Edmondson,

A.C., Stylianou, I.M., Koseki, M., Pirruccello, J.P., Ripatti, S.,

Chasman, D.I., Willer, C.J., et al. (2010). Biological, clinical

and population relevance of 95 loci for blood lipids. Nature

466, 707–713.

58. Gieger, C., Radhakrishnan, A., Cvejic, A., Tang, W., Porcu, E.,

Pistis, G., Serbanovic-Canic, J., Elling, U., Goodall, A.H., Lab-

rune, Y., et al. (2011). New gene functions in megakaryopoi-

esis and platelet formation. Nature 480, 201–208.

59. Mihaescu, R., Meigs, J., Sijbrands, E., and Janssens, A.C.

(2011). Genetic risk profiling for prediction of type 2 dia-

betes. PLoS Curr. 3, RRN1208.

60. Elliott, P., Chambers, J.C., Zhang, W., Clarke, R., Hopewell,

J.C., Peden, J.F., Erdmann, J., Braund, P., Engert, J.C., Bennett,

D., et al. (2009). Genetic Loci associated with C-reactive

protein levels and risk of coronary heart disease. JAMA 302,

37–48.

61. Owen, K.R., Thanabalasingham, G., James, T.J., Karpe, F.,

Farmer, A.J., McCarthy, M.I., and Gloyn, A.L. (2010). Assess-

ment of high-sensitivity C-reactive protein levels as diag-

nostic discriminator of maturity-onset diabetes of the young

due to HNF1A mutations. Diabetes Care 33, 1919–1924.

62. Thanabalasingham, G., Shah, N., Vaxillaire, M., Hansen, T.,

Tuomi, T., Gasperikova, D., Szopa, M., Tjora, E., James, T.J.,

Kokko, P., et al. (2011). A large multi-centre European study

validates high-sensitivity C-reactive protein (hsCRP) as a

clinical biomarker for the diagnosis of diabetes subtypes.

Diabetologia 54, 2801–2810.

63. Zhou, K., Bellenguez, C., Spencer, C.C., Bennett, A.J.,

Coleman, R.L., Tavendale, R., Hawley, S.A., Donnelly, L.A.,

Schofield, C., Groves, C.J., et al; GoDARTS and UKPDS

Diabetes Pharmacogenetics Study Group; Wellcome Trust

Case Control Consortium 2; MAGIC investigators. (2011).

Common variants near ATM are associated with glycemic

response to metformin in type 2 diabetes. Nat. Genet. 43,

117–120.

64. Stefansson, H., Helgason, A., Thorleifsson, G., Steinthorsdot-

tir, V., Masson, G., Barnard, J., Baker, A., Jonasdottir, A., Inga-

son, A., Gudnadottir, V.G., et al. (2005). A common inversion

under selection in Europeans. Nat. Genet. 37, 129–137.

65. Kong, A., Barnard, J., Gudbjartsson, D.F., Thorleifsson, G.,

Jonsdottir, G., Sigurdardottir, S., Richardsson, B., Jonsdottir,

J., Thorgeirsson, T., Frigge, M.L., et al. (2004). Recombination

rate and reproductive success in humans. Nat. Genet. 36,

1203–1206.

66. Hinch, A.G., Tandon, A., Patterson, N., Song, Y., Rohland, N.,

Palmer, C.D., Chen, G.K., Wang, K., Buxbaum, S.G., Akylbe-

kova, E.L., et al. (2011). The landscape of recombination in

African Americans. Nature 476, 170–175.

67. Seldin, M.F., Tian, C., Shigeta, R., Scherbarth, H.R., Silva, G.,

Belmont, J.W., Kittles, R., Gamron, S., Allevi, A., Palatnik,

S.A., et al. (2007). Argentine population genetic structure:

Large variance in Amerindian contribution. Am. J. Phys.

Anthropol. 132, 455–462.

68. Seldin, M.F., Shigeta, R., Villoslada, P., Selmi, C., Tuomilehto,

J., Silva, G., Belmont, J.W., Klareskog, L., and Gregersen, P.K.

(2006). European population substructure: Clustering of

northern and southern populations. PLoS Genet. 2, e143.

69. Tian, C., Hinds, D.A., Shigeta, R., Kittles, R., Ballinger, D.G.,

and Seldin, M.F. (2006). A genomewide single-nucleotide-

polymorphism panel with high ancestry information for

African American admixture mapping. Am. J. Hum. Genet.

79, 640–649.

70. McEvoy, B.P., Montgomery, G.W., McRae, A.F., Ripatti, S.,

Perola, M., Spector, T.D., Cherkas, L., Ahmadi, K.R.,

Boomsma, D., Willemsen, G., et al. (2009). Geographical

structure and differential natural selection among North

European populations. Genome Res. 19, 804–814.

71. Heath, S.C., Gut, I.G., Brennan, P., McKay, J.D., Bencko, V.,

Fabianova, E., Foretova, L., Georges, M., Janout, V., Kabesch,

M., et al. (2008). Investigation of the fine structure of

European populations with applications to disease associa-

tion studies. Eur. J. Hum. Genet. 16, 1413–1429.

72. Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A.R.,

Auton, A., Indap, A., King, K.S., Bergmann, S., Nelson,

M.R., et al. (2008). Genes mirror geography within Europe.

Nature 456, 98–101.

73. Price, A.L., Butler, J., Patterson, N., Capelli, C., Pascali, V.L.,

Scarnicci, F., Ruiz-Linares, A., Groop, L., Saetta, A.A., Korkolo-

poulou, P., et al. (2008). Discerning the ancestry of European

Americans in genetic association studies. PLoS Genet. 4,

e236.

74. Manolio, T.A. (2010). Genomewide association studies

and assessment of the risk of disease. N. Engl. J. Med. 363,

166–176.


75. Sivakumaran, S., Agakov, F., Theodoratou, E., Prendergast,

J.G., Zgaga, L., Manolio, T., Rudan, I., McKeigue, P., Wilson,

J.F., and Campbell, H. (2011). Abundant pleiotropy in

human complex diseases and traits. Am. J. Hum. Genet. 89,

607–618.

76. Dickson, S.P., Wang, K., Krantz, I., Hakonarson, H., and

Goldstein, D.B. (2010). Rare variants create synthetic

genome-wide associations. PLoS Biol. 8, e1000294.

77. Anderson, C.A., Soranzo, N., Zeggini, E., and Barrett, J.C.

(2011). Synthetic associations are unlikely to account for

many common disease genome-wide association signals.

PLoS Biol. 9, e1000580.

78. Wray, N.R., Purcell, S.M., and Visscher, P.M. (2011). Synthetic

associations created by rare variants do not explain most

GWAS results. PLoS Biol. 9, e1000579.

79. Visscher, P.M., Goddard, M.E., Derks, E.M., and Wray, N.R.

(2011). Evidence-based psychiatric genetics, AKA the false

dichotomy between common and rare variant hypotheses.

Molecular Psychiatry, in press. Published online 14 June

2011. 2010.1038/mp.2011.2065.

80. Hunter, D.J., and Kraft, P. (2007). Drinking from the fire

hose—Statistical issues in genomewide association studies.

N. Engl. J. Med. 357, 436–439.

81. Pryce, J.E., Hayes, B.J., Bolormaa, S., and Goddard, M.E.

(2011). Polymorphic regions affecting human height also

control stature in cattle. Genetics 187, 981–984.

82. Bodmer, W.F. (1986). Human genetics: The molecular chal-

lenge. Cold Spring Harb. Symp. Quant. Biol. 51, 1–13.

83. Risch, N., and Merikangas, K. (1996). The future of genetic

studies of complex human diseases. Science 273, 1516–

1517.

84. Wray, N.R. (2005). Allele frequencies and the r2 measure of

linkage disequilibrium: impact on design and interpretation

of association studies. Twin Res. Hum. Genet. 8, 87–94.

85. McClellan, J.M., Susser, E., and King, M.C. (2007). Schizo-

phrenia: A common disease caused by multiple rare alleles.

Br. J. Psychiatry 190, 194–199.

86. Craddock, N., O’Donovan, M.C., and Owen, M.J. (2007).

Phenotypic and genetic complexity of psychosis. Invited

commentary on. Schizophrenia: a common disease caused

by multiple rare alleles. Br. J. Psychiatry 190, 200–203.

87. Lander, E.S. (1996). The new genomics: Global views of

biology. Science 274, 536–539.

88. Chakravarti, A. (1999). Population genetics—Making sense

out of sequence. Nat. Genet. 21 (1, Suppl), 56–60.

89. Reich, D.E., and Lander, E.S. (2001). On the allelic spectrum

of human disease. Trends Genet. 17, 502–510.

90. Risch, N. (1990). Linkage strategies for genetically complex

traits. I. Multilocus models. Am. J. Hum. Genet. 46, 222–228.

91. Slatkin, M. (2008). Exchangeable models of complex in-

herited diseases. Genetics 179, 2253–2261.

92. Hill, W.G., Goddard, M.E., and Visscher, P.M. (2008). Data

and theory point to mainly additive genetic variance for

complex traits. PLoS Genet. 4, e1000008.

93. Wang, K., Dickson, S.P., Stolle, C.A., Krantz, I.D., Goldstein,

D.B., and Hakonarson, H. (2010). Interpretation of associa-

tion signals and identification of causal variants from

genome-wide association studies. Am. J. Hum. Genet. 86,

730–742.

94. Nejentsev, S., Walker, N., Riches, D., Egholm, M., and Todd,

J.A. (2009). Rare variants of IFIH1, a gene implicated in anti-

viral responses, protect against type 1 diabetes. Science 324,

387–389.

95. Momozawa, Y., Mni, M., Nakamura, K., Coppieters, W.,

Almer, S., Amininejad, L., Cleynen, I., Colombel, J.F.,

de Rijk, P., Dewit, O., et al. (2011). Resequencing of positional

candidates identifies low frequency IL23R coding variants

protecting against inflammatory bowel disease. Nat. Genet.

43, 43–47.

96. Rivas,M.A., Beaudoin,M., Gardet, A., Stevens, C., Sharma, Y.,

Zhang, C.K., Boucher, G., Ripke, S., Ellinghaus, D., Burtt, N.,

et al; National Institute of Diabetes and Digestive Kidney

Diseases Inflammatory Bowel Disease Genetics Consortium

(NIDDK IBDGC); United Kingdom Inflammatory Bowel

Disease Genetics Consortium; International Inflammatory

Bowel Disease Genetics Consortium. (2011). Deep rese-

quencing of GWAS loci identifies independent rare variants

associated with inflammatory bowel disease. Nat. Genet.

43, 1066–1073.

97. Wang, K., Bucan, M., Grant, S.F., Schellenberg, G., and Hako-

narson, H. (2010). Strategies for genetic studies of complex

diseases. Cell 142, 351–353, author reply 353–355.

98. Hyttinen, V., Kaprio, J., Kinnunen, L., Koskenvuo, M., and

Tuomilehto, J. (2003). Genetic liability of type 1 diabetes

and the onset age among 22,650 young Finnish twin pairs:

A nationwide follow-up study. Diabetes 52, 1052–1055.

99. Polychronakos, C., and Li, Q. (2011). Understanding type 1

diabetes through genetics: Advances and prospects. Nat.

Rev. Genet. 12, 781–792.

100. Poulsen, P., Kyvik, K.O., Vaag, A., and Beck-Nielsen, H.

(1999). Heritability of type II (non-insulin-dependent)

diabetes mellitus and abnormal glucose tolerance—A popu-

lation-based twin study. Diabetologia 42, 139–145.

101. Magnusson, P.K., and Rasmussen, F. (2002). Familial resem-

blance of body mass index and familial risk of high and

low body mass index. A study of young men in Sweden.

Int. J. Obes. Relat. Metab. Disord. 26, 1225–1231.

102. Schousboe, K., Willemsen, G., Kyvik, K.O., Mortensen, J.,

Boomsma, D.I., Cornes, B.K., Davis, C.J., Fagnani, C., Hjelm-

borg, J., Kaprio, J., et al. (2003). Sex differences in heritability

of BMI: A comparative study of results from twin studies in

eight countries. Twin Res. 6, 409–421.

103. Tysk, C., Lindberg, E., Jarnerot, G., and Floderus-Myrhed, B.

(1988). Ulcerative colitis and Crohn’s disease in an unse-

lected population of monozygotic and dizygotic twins. A

study of heritability and the influence of smoking. Gut 29,

990–996.

104. Hawkes, C.H., and Macgregor, A.J. (2009). Twin studies

and the heritability of MS: A conclusion. Mult. Scler. 15,

661–667.

105. Brown, M.A., Kennedy, L.G., MacGregor, A.J., Darke, C.,

Duncan, E., Shatford, J.L., Taylor, A., Calin, A., and Words-

worth, P. (1997). Susceptibility to ankylosing spondylitis in

twins: The role of genes, HLA, and the environment.

Arthritis Rheum. 40, 1823–1828.

106. Brown, M.A. (2011). Progress in the genetics of ankylosing

spondylitis. Brief. Funct. Genomics 10, 249–257.

107. MacGregor, A.J., Snieder, H., Rigby, A.S., Koskenvuo, M.,

Kaprio, J., Aho, K., and Silman, A.J. (2000). Characterizing

the quantitative genetic contribution to rheumatoid arthritis

using data from twins. Arthritis Rheum. 43, 30–37.

108. Lichtenstein, P., Yip, B.H., Bjork, C., Pawitan, Y., Cannon,

T.D., Sullivan, P.F., and Hultman, C.M. (2009). Common


genetic determinants of schizophrenia and bipolar disorder

in Swedish families: A population-based study. Lancet 373,

234–239.

109. Purcell, S.M., Wray, N.R., Stone, J.L., Visscher, P.M., O’Dono-

van, M.C., Sullivan, P.F., and Sklar, P.; International Schizo-

phrenia Consortium. (2009). Common polygenic variation

contributes to risk of schizophrenia and bipolar disorder.

Nature 460, 748–752.

110. Lichtenstein, P., Holm, N.V., Verkasalo, P.K., Iliadou, A.,

Kaprio, J., Koskenvuo, M., Pukkala, E., Skytthe, A., and Hem-

minki, K. (2000). Environmental and heritable factors in the

causation of cancer—Analyses of cohorts of twins from

Sweden, Denmark, and Finland. N. Engl. J. Med. 343, 78–85.

111. Turnbull, C., Ahmed, S., Morrison, J., Pernet, D., Renwick, A.,

Maranian, M., Seal, S., Ghoussaini, M., Hines, S., Healey,

C.S., et al; Breast Cancer Susceptibility Collaboration (UK).

(2010). Genome-wide association study identifies five new

breast cancer susceptibility loci. Nat. Genet. 42, 504–507.

112. Orstavik, K.H., Magnus, P., Reisner, H., Berg, K., Graham, J.B.,

and Nance, W. (1985). Factor VIII and factor IX in a twin

population. Evidence for a major effect of ABO locus on

factor VIII level. Am. J. Hum. Genet. 37, 89–101.

113. de Lange, M., Snieder, H., Ariens, R.A., Spector, T.D., and

Grant, P.J. (2001). The genetics of haemostasis: A twin study.

Lancet 357, 101–105.

114. Smith, N.L., Chen, M.H., Dehghan, A., Strachan, D.P., Basu,

S., Soranzo, N., Hayward, C., Rudan, I., Sabater-Lleal, M., Bis,

J.C., et al; Wellcome Trust Case Control Consortium. (2010).

Novel associations ofmultiple genetic loci with plasma levels

of factor VII, factor VIII, and von Willebrand factor: The

CHARGE (Cohorts for Heart and Aging Research in Genome

Epidemiology) Consortium. Circulation 121, 1382–1392.

115. Visscher, P.M., Medland, S.E., Ferreira, M.A., Morley, K.I.,

Zhu, G., Cornes, B.K., Montgomery, G.W., and Martin,

N.G. (2006). Assumption-free estimation of heritability

from genome-wide identity-by-descent sharing between

full siblings. PLoS Genet. 2, e41.

116. Silventoinen, K., Sammalisto, S., Perola, M., Boomsma, D.I.,

Cornes, B.K., Davis, C., Dunkel, L., De Lange, M., Harris,

J.R., Hjelmborg, J.V., et al. (2003). Heritability of adult body

height: A comparative study of twin cohorts in eight coun-

tries. Twin Res. 6, 399–408.

117. Peacock, M., Turner, C.H., Econs, M.J., and Foroud, T. (2002).

Genetics of osteoporosis. Endocr. Rev. 23, 303–326.

118. Duncan, E.L., Danoy, P., Kemp, J.P., Leo, P.J., McCloskey, E.,

Nicholson, G.C., Eastell, R., Prince, R.L., Eisman, J.A., Jones,

G., et al. (2011). Genome-wide association study using

extreme truncate selection identifies novel genes affecting

bone mineral density and fracture risk. PLoS Genet. 7,

e1001372.

119. Dalageorgou, C., Ge, D., Jamshidi, Y., Nolte, I.M., Riese, H.,

Savelieva, I., Carter, N.D., Spector, T.D., and Snieder, H.

(2008). Heritability of QT interval: how much is explained

by genes for resting heart rate? J. Cardiovasc. Electrophysiol.

19, 386–391.

120. Russell, M.W., Law, I., Sholinsky, P., and Fabsitz, R.R. (1998).

Heritability of ECG measurements in adult male twins. J.

Electrocardiol. Suppl. 30, 64–68.

121. Shah, S.H., and Pitt, G.S. (2009). Genetics of cardiac repolar-

ization. Nat. Genet. 41, 388–389.

122. Hunt, S.C., Hasstedt, S.J., Kuida, H., Stults, B.M., Hopkins,

P.N., and Williams, R.R. (1989). Genetic heritability and

common environmental components of resting and stressed

blood pressures, lipids, and body mass index in Utah pedi-

grees and twins. Am. J. Epidemiol. 129, 625–638.

123. Evans, D.M., Frazer, I.H., and Martin, N.G. (1999). Genetic

and environmental causes of variation in basal levels of

blood cells. Twin Research: The Official Journal of the Inter-

national Society for Twin Studies 2, 250–257.


ARTICLE

Mitochondrial DNA and Y Chromosome VariationProvides Evidence for a Recent Common Ancestrybetween Native Americans and Indigenous Altaians

Matthew C. Dulik,1 Sergey I. Zhadanov,1,2 Ludmila P. Osipova,2 Ayken Askapuli,1,3 Lydia Gau,1

Omer Gokcumen,1,4 Samara Rubinstein,1,5 and Theodore G. Schurr1,*

The Altai region of southern Siberia has played a critical role in the peopling of northern Asia as an entry point into Siberia and a possible

homeland for ancestral Native Americans. It has an old and rich history because humans have inhabited this area since the Paleolithic.

Today, the Altai region is home to numerous Turkic-speaking ethnic groups, which have been divided into northern and southern clus-

ters based on linguistic, cultural, and anthropological traits. To untangle Altaian genetic histories, we analyzed mtDNA and Y chromo-

some variation in northern and southern Altaian populations. All mtDNAs were assayed by PCR-RFLP analysis and control region

sequencing, and the nonrecombining portion of the Y chromosomewas scored for more than 100 biallelic markers and 17 Y-STRs. Based

on these data, we noted differences in the origin and population history of Altaian ethnic groups, with northern Altaians appearingmore

like Yeniseian, Ugric, and Samoyedic speakers to the north, and southern Altaians having greater affinities to other Turkic speaking pop-

ulations of southern Siberia and Central Asia. Moreover, high-resolution analysis of Y chromosome haplogroup Q has allowed us to

reshape the phylogeny of this branch, making connections between populations of the New World and Old World more apparent

and demonstrating that southern Altaians and Native Americans share a recent common ancestor. These results greatly enhance our

understanding of the peopling of Siberia and the Americas.

Introduction

The Altai Republic is located in south-central Russia, situ-

ated at the borders of Mongolia, China, and Kazakhstan.

It sits at a crossroads where the Eurasian steppe meets the

Siberian taiga and serves as an entry point into northern

Asia. Having been habitable throughout the last glacial

maximum (LGM), the Altai region has had a human pres-

ence for some 45,000 years.1 The archaeology of the region

shows that, during this time, a number of different cultures

and peoples lived in andmigrated through the area.2–4 The

confirmation of Neanderthals and the recent discovery of

a new hominin at the Denisova cave in the Altai region

indicates that this area has long hosted extremely diverse

populations.5–7 It is also the area from which the ancestors

of Native American populations are thought to have arisen

prior to their expansion into the New World.8–11 In addi-

tion, archaeological evidence suggests that a few of the

later cultural horizons (Afanasievo and Andronovo) arose

in western Eurasia and spread eastward to the Altai region

during the Eneolithic and Bronze Ages, respectively.12,13

Such interactions increased during the Iron Age, as evi-

denced by the frozen Pazyryk kurgans in the southern Altai

Mountains,14 which contained examples of the typical

‘‘Scytho-Siberian animal style’’ observed throughout the

entire Eurasian steppe.3,15 These populations further

intermingled with expanding Altaic speaking groups,

and specifically the movements involving the Xiongnu,

Xianbei, and Yuezhi, as recorded by ancient Chinese histo-

rians in the second century BCE.16,17

Ethnographic studies of Turkic-speaking tribes indige-

nous to the Altai region of southern Siberia noted cultural

differences among ethnic groups such that they could be

classified into northern or southern Altaians.18,19 Northern

Altaian ethnic groups include the Chelkan, Kumandin,

and Tubalar. The Altai-kizhi, Teleut, and Telengit were

grouped together as southern Altaians, along with a few

other smaller populations. Similarly, linguistic studies

have shown that languages from northern and southern

populations are mutually unintelligible, despite their

having similar Turkic roots. The northern Altai languages

also showed greater influences from Samoyedic, Yeniseian,

and Ugric languages, possibly reflecting their origin among

the ancestors of these present-day peoples. By contrast,

southern Altaian languages belong to the Kipchak

branch of Turkic language family and have been greatly

influenced by Mongolian, especially after the expansion

of the Mongol Empire.16,20 These linguistic differences are

further mirrored by differences in anthropometric traits,

traditional subsistence strategies, religious traditions, and

clan names for northern and southern Altaians.18,19,21

Genetic analysis of Altaian populations initially focused

on protein polymorphisms to assess levels of diversity and

the relationships between them and other Siberian popula-

tions by comparing relative proportions of West and East

Eurasian genotypes.22–24 The role that the Altai region

1Department of Anthropology, University of Pennsylvania, Philadelphia, PA 19104-6398, USA; 2Institute of Cytology and Genetics, SB RAS, Novosibirsk

630090, Russia; 3Institute of General Genetics and Cytology, Almaty 050060, Kazakhstan4Present address: Harvard University Medical School, Brigham and Women’s Hospital, Boston, MA 02115, USA5Present address: Sackler Educational Laboratory for Comparative Genomics and Human Origins, American Museum of Natural History, New York,

NY 10024-5192, USA



The American Journal of Human Genetics 90, 229–246, February 10, 2012 229

played in the dispersal of humans into northern Eurasia

and subsequently into the Americas gained increasing

importance with the search for the founding mitochon-

drial DNAs (mtDNAs) and Y chromosomes for the

New World.8,25,26 As a result, the issue of where Native

American progenitors originated became a hotly debated

topic, with suggested source areas being Central Asia,

Mongolia, and different parts of Siberia.8–10,27–46 However,

much of the previous genetic research into this issue

focused mainly on southern Altaian populations, leaving

our understanding of the genetic diversity of northern

Altaian groups incomplete.

Given the ethnographic and historical background of

Altaian peoples, we characterized the mtDNA and Y chro-

mosome variation in these populations to elucidate their

genetic history. Our first objective was to determine

whether the ethnographic classifications of northern and

southern Altaians reflected their patterns of genetic varia-

tion, and specifically whether they shared a common

ancestry. If differences were observed, we then wanted to

know whether they were attributable to demographic

factors, social organization, or some combination of the

two. The second goal was to examine whether northern

Altaians’ genetic variation is structured by tribe and clan

identity. The third goal was to use these data to investigate

larger questions concerning the peopling of Siberia (and

the Americas). In particular, we were interested in learning

whether these genetic data would reveal the effects of

ancient and/or recent migrations into or out of the Altai

region, including that giving rise to the ancestors of

indigenous populations from America. Overall, this paper

attempts to understand the population history of Altaians

by placing them into a Siberian genetic context and uses

a phylogeographic approach to dissect the layers of history,

uncovering the formation of these ethnic groups and their

importance for understanding the peopling of Northern

Asia and the Americas.

Subjects and Methods

Sample CollectionBetween 1991 and 2002, we conducted ethnographic fieldwork

and sample collection in a number of settlements within the

southern part of the Altai Republic (Figure 1). During this period,

a total of 267 self-identified Altai-kizhi individuals living in the

villages of Mendur-Sokkon, Cherny Anuy, Turata, and Kosh-Agach

participated in the study. In addition, another nine Altai-kizhi

individuals from villages in the northern Altai Republic partici-

pated in the study (see below), bringing the total number of

Altai-kizhi participants to 276, of whom 120 were men.

Figure 1. Map of the Altai Republic and Locations of Sample Collection

230 The American Journal of Human Genetics 90, 229–246, February 10, 2012

In 2003, we worked with 214 Northern Altaians living in the

Turochak District of the Altai Republic. These persons included

91 Chelkans, 52 Kumandins, and 71 Tubalars living in nine

different villages in the Biya and Lebed’ River basins and

along Teletskoe Lake (Figure 1). The villages included Artybash,

Biika, Dmitrievka, Kebezen, Kurmach-Baigol, Sank-Ino, Shunarak,

Tandoshka, and Yugach. Of the northern Altaian participants, 69

were men.

Blood samples were drawn from all participants with informed

consent written in Russian and approved by the University of

Pennsylvania IRB and the Institute of Cytology and Genetics in

Novosibirsk, Russia. Genealogical data were also obtained from

each person at the time of sample collection to ensure that the

individuals were unrelated through at least three generations

and to assess the level of admixture in these communities. Individ-

uals were categorized by self-identified ethnicity for this study.

Molecular Genetic AnalysisSample Preparation

Bloods were fractionated through low-speed centrifugation to

obtain plasma and red cell fractions. Total genomic DNAs were

isolated from buffy coats with a lysis buffer and standard phenol-

chloroform extraction protocol modified from earlier studies.27,47

mtDNA Analysis

The mtDNA of each sample was characterized by high-resolution

SNP analysis and control region sequencing. PCR-RFLP analysis

was employed to assign individuals to West48–52 and East30,53–56

Eurasian mtDNA haplogroups by screening them for known diag-

nostic markers, as per previous studies57,58 (Table S1 available

online), with the nomenclature used to classify the mitochondrial

haplotype according to PhyloTree.org.59

The hypervariable segment 1 (HVS1) of the control region was

directly sequenced for each sample by published methods,58 and

hypervariable segment 2 (HVS2) was sequenced with the primers

indicated in Table S2. Sequences were read on ABI 3130xl Gene

Analyzers located in the Laboratory of Molecular Anthropology

and the Department of Genetics Sequencing Core Facility at the

University of Pennsylvania and aligned and edited with the

Sequencher 4.8 (Gene Codes Corporation). All polymorphic

nucleotides were reckoned relative to the revised Cambridge refer-

ence sequence (rCRS).60,61 The combination of SNP data and

control region sequences defined maternal haplotypes in these

individuals.

Y Chromosome Analysis

The nonrecombining portion of the Y chromosome (NRY) from

each male participant was characterized by assaying phylogeneti-

cally informative biallelic markers in a hierarchical fashion accord-

ing to published information62,63 and previously published

methods.64 A total of 116 biallelic markers were tested to define

sample membership in respective NRY haplogroups. Most of the

SNPs and fragment length polymorphisms were characterized by

custom TaqMan assays read on an ABI Prism 7900 HT Real-Time

PCR System (Applied Biosystems). These polymorphisms included

L53, L54, L55, L56, L57, L213, L329, L330, L331, L332, L333,

L365, L400, L456, L472, L474, L475, L476, L528, LLY22g, M3,

M9, M12, M15, M18, M20, M25, M35, M45, M55, M56, M69,

M70, M73, M81, M86, M89, M93, M96, M102, M117, M119,

M120, M122, M123, M124, M128, M130, M134, M143, M147,

M157, M162, M170, M172, M173, M174, M178, M186, M201,

M204, M207, M214, M217, M223, M230, M242, M253, M265,

M267, M269, M285, M304, M323, M335, M346, M410, M417,

M434, M458, P15, P25, P31, P36.2, P37.2, P47, P60, P63, P105,

P215, P256, P261, P297, and PK2. Additional markers were

detected through direct sequencing (L191, L334, L401, L527,

L529, M17, M46 [Tat], M343, M407, MEH2, P39, P43, P48,

P53.1, P62, P89, P98, P101, PageS000104, and PK5) and by PCR-

RFLP analysis (M175).65 Seventeen short tandem repeats (STRs)

were characterized with the AmpFlSTR Yfiler PCR Amplification

Kit (ABI) and read on an ABI 3130xl Genetic Analyzer with Gene-

Mapper ID v3.2 software. Each paternal haplotype was designated

by its 17-STR profile. Y chromosome lineages were defined as the

unique combinations of SNP and STR data present in the samples.

DYS389b was calculated by subtracting DYS389I from DYS389II,

which was used for all statistical and network analyses.64

Comparative DataTo place their genetic histories in a broader contextual framework,

we compared Altaian mtDNA and NRY data with those from

populations in southern Siberia, Central Asia, Mongolia, and

East Asia. For the mtDNA analysis, the populations included

Telengits, Teleuts, Shors, Khakass, Tuvinians, Todzhans, Tofalars,

Soyots, Buryats, Khanty, Mansi, Ket, Nganasan, Western Evenks,

Uyghurs, Kazakhs, Kyrgyz, Uzbeks, and Mongolians.41,43,44,66–71

For the NRY analysis, only populations that were represented by

full Y-STR data sets (not just Y-STRs for specific haplogroups)

were used for comparative purposes. These populations included

Teleuts, Khakass, Mansi, Khanty, Kalmyks, Mongolians, and

Uyghurs.68,72–75 The STR haplotypes were reduced to ten loci

(DYS19, DYS389I, DYS398b, DYS390, DYS391, DYS392, DYS393,

DYS437, DYS438, and DYS439) to allow for as broad a comparison

as possible. In the coalescence analysis, we used the 15 Y-STR loci

Q-M3 haplotypes from Geppert et al.76

Data AnalysisSummary statistics, including gene diversity and pairwise differ-

ences, were calculated with Arlequin v3.1177 for mtDNA HVS1

(np 16024-16400) and NRY Y-STRs. FST and RST values between

populations were also calculated with Arlequin v3.11 for the

HVS1 sequences and Y-STRs, respectively. FST values were esti-

mated with the Tamura and Nei model of sequence evolution.78

Pairwise genetic distances were visualized by multidimensional

scaling (MDS) with SPSS 11.0.0.79 In addition, nucleotide diversity,

Tajima’s D, and Fu’s FS were calculated with mtDNA HVS1

sequences.

We analyzed the phylogenetic relationships among Y-STR

haplotypes and complete mtDNA genomes by using Network

4.6.0.0 (Fluxus Technology Ltd). These networks employed a

reduced median-median joining approach and MP post-process-

ing.80–82 The NRY haplotypes used to generate the networks

consisted of 15 Y-STRs. DYS385 was excluded from the network

analysis because differentiation between DYS385a and DYS385b

is not possible with the Y-Filer kit.83 The Y-STR loci were weighted

based on the inverse of their variances. Mitogenomes used in this

analysis came from the published literature and GenBank.

The time to the most recent common ancestor (TMRCA) for mi-

togenomes was estimated with the methods of Soares et al.84 The

Y-STR diversity within each haplogroup was assessed by two

methods.64 The first involved calculation of rho statistics with

Network 4.6.0.0, where the founder haplotype was inferred as

in Sengupta et al.85 The second used Batwing,86 a Bayesian

analysis where the TMRCA and expansion time of each popula-

tion (or haplogroup) were calculated by previously published

methods.64,72,87 Both the evolutionary and the pedigree-based

mutation rates were used to estimate coalescence dates with


generation times of 25 and 30 years, respectively.88–90 Because

a definitive consensus does not yet exist as to which rate should

be used, the validity of the resulting estimates are discussed. In

addition, Batwing was used to estimate the split or divergence

times of several haplogroups. Thismethod assumes that, after pop-

ulations split, no further migration occurs between them. In this

case, the haplogroups investigated were not shared between pop-

ulations but derive from a common source, thereby justifying this

approach. Duplicated loci and new STR variants detected in this

study were excluded from statistical analysis.

Results

Mitochondrial DNA and Y Chromosome Diversity

The maternal genetic ancestry of northern and southern

Altaian populations was explored by characterizing coding

region SNPs and control region sequences from 490 inhab-

itants of the Altai Republic, which yielded 99 distinct

mtDNA haplotypes defined by SNP and HVS1 mutations

(Table S3). The majority of mtDNAs were of East Eurasian

origin, although the relative proportion of these haplo-

types was greater in Chelkans (91.5%) compared to other

Altaian populations (75.2% in Tubalars, 75.6% in Kuman-

dins, and 76.4% in Altai-kizhi) (Table 1). Despite exhibit-

ing a lower overall frequency of West Eurasian haplo-

groups, Altaians (specifically, the Altai-kizhi, Tubalar, and

Kumandins) had a higher proportion of them as compared

to other southern Siberians.41,43 Differences in mtDNA

haplogroup profiles were observed among northern

Altaian ethnic groups and between northern Altaians

and Altai-kizhi, with the Chelkans being extraordinarily

distinct. Nevertheless, comparisons among other Altaian

ethnic groups revealed some consistent patterns. mtDNA

haplogroups B, C, D, and U4 were found in all Altaian pop-

ulations, but at varying frequencies, whereas southern

Altaians (Altai-kizhi, Telengits, and Teleuts) tended to

have a greater variety of West Eurasian haplogroups at

low frequencies. Shors, who have sometimes been catego-

rized as northern Altaians,18 exhibited a similar haplo-

group profile to other northern Altaian ethnic groups,

including moderate frequencies of C, D, and F1, although

they lacked others (N9a and U).41

Haplogroups C and D were the most frequent mtDNA

lineages in the Altaians, consistent with the overall picture

of the Siberian mtDNA gene pool. However, phylogeo-

graphic analysis of these lineages showed a greater diver-

sity of haplotypes in the southern Altaians compared to

northern Altaians. Although haplotypes were shared

between regions, northern Altaians largely had C4 with

the root HVS1 motif (16223-16298-16327) and C5c,

whereas the southern Altaians had C4a1 and C4a2.

Although C5c is largely confined to Altaians, it has been

suggested that an early migration from Siberia to Europe

brought haplogroup C west, where the branch differenti-

ated during the Neolithic and then was taken back into

southern Siberia.83 Also noteworthy, D4j7 appears to

be specific to Altaians and Shors.41,91 In addition, a D5a

haplotype was shared by Tubalars and Altai-kizhi, and

a rare D5c2 haplotype was shared by the Chelkans

and Kumandins. Interestingly, complete mtDNA genome

sequencing of a subset of our D5c2 samples showed few

differences from those present in Japan,55 suggesting

a possible connection resulting from the dispersal of Altaic

speaking populations.92 The remainder of the D haplo-

types were found in other southern Siberian and Central

Asian populations.

To explore the NRYvariation in Altaian populations, 116

biallelic polymorphisms were characterized in 189 male

individuals, resulting in 106 Y chromosome lineages

(Table 2). Northern Altaian populations were composed

largely of haplogroups Q and N-P43, whereas southern

Altaians had a higher proportion of R-M417, C-M217/

PK2, C-M86, and D-P47. Haplogroups typical of south

Asia, western Europe, and East Asia were not found in

appreciable frequencies.72,93–99 The haplogroup frequency

differences between northern and southern Altaians were

statistically significant (c2 ¼ 66.03, df ¼ 9, p ¼ 9.09 e�11).

As with the mtDNA data set, we also observed differ-

ences in NRY haplogroup composition among northern

Altaian populations, where each ethnic group shared

haplogroups with the other two, yet had distinct haplo-

group profiles. Overall, Kumandins had the most disparate

haplogroup frequencies of the northern Altaians, exhibit-

ing similar number of N-P43 chromosomes as the

Chelkans, which were quite similar to those found in

Khanty and Mansi populations in northwestern Sibe-

ria.68,100 In addition, a large proportion of Kumandin Y

chromosomes belonged to R-M73. This haplogroup is

largely restricted to Central Asia101 but has also been found

in Altaian Kazakhs and other southern Siberians.64,102 In

fact, Myres et al.101 noted two distinct clusters of R-M73

STR haplotypes, with one of them containing Y chromo-

somes bearing a 19 repeat allele for DYS390, which appears

to be unique to R-M73. Interestingly, the majority of

Kumandin R-M73 haplotypes fell into this category,

although haplotypes from both clusters are found in

southern Siberia.102

In all cases, the haplotypes present in Altaians fit into

known modern human phylogenies. None of the Altaians

had a mitochondrial lineage similar to those of Neander-

thals or the Denisovan hominin. Although there are no

ancient Denisovan or Neanderthal Y chromosome data

to compare with the Altaian data set, the Altaian Y chro-

mosomes clearly derived from more recent expansions of

modern humans out of Africa.

Altaian Genetic Relationships

Summary statistics were calculated to assess the relative

amounts of genetic diversity in Altaian populations

(Table 3). Gene diversities based on HVS1 of the mtDNA

showed that, overall, the Altai-kizhi were more diverse

than the northern Altaians. The average pairwise differ-

ences for the Altai-kizhi were also smaller. In fact, the esti-

mates for the Altai-kizhi and Tubalars were comparable

to other southern Siberians.43 By contrast, those for the


Chelkans and Kumandins were lower and more similar to

Soyots, but not as low as that of Tofalars. Mismatch distri-

butions were smooth and bell-shaped for all populations

except the Chelkans, which had a significant raggedness

index. This statistic indicated that Tubalars, Kumandins,

and Altai-kizhi had experienced sudden expansions

or expansions from population bottlenecks.103 Tests of

neutrality confirmed these findings in yielding signifi-

cantly negative Tajima’s D and Fu’s FS estimates for all

populations, except the Chelkans, indicating that this

Table 1. mtDNA Haplogroup Frequencies of Altaian Populations

Hg Chelkan Kumandin Tubalar1 Tubalar2 Shor Altai-kizhi1 Altai-kizhi2 Telengit Teleut

# 91 52 71 72 28 276 48 55 33

C 15.1 41.5 35.6 20.8 17.9 31.4 25.0 14.6 24.2

Z 2.7 3.6 4.3 4.2 3.0

M8 3.6 4.2

D4 13.9 15.1 24.7 15.3 25.0 13.0 6.3 18.2 24.2

D5 8.6 3.8 4.1 5.6 3.6 0.7 3.0

G 3.2 4.0 4.2 3.6

M7 1.8

M9 1.4

M10 1.1 3.6 0.4 2.1

M11 2.1 1.8 3.0

M* 1.8

A 1.9 11.1 3.6 2.9 4.7 7.3

I 3.6 1.4 2.1 1.8

N1a 1.8

N1b 0.4

W 1.1

X 3.8 1.4 2.2 2.1 3.0

N9a 19.4 1.9 2.7 6.9 1.8

B 3.2 3.8 2.7 4.2 3.6 1.4 6.3 14.6 6.1

F1 10.8 3.8 1.4 14.3 8.3 4.2 1.8 3.0

F2 15.1 2.7 3.6 2.5 2.1

H 1.1 2.7 1.4 3.6 2.5 8.3 9.1 9.1

H2 3.3 2.1

H8 5.7 2.7 4.2 3.6 1.4

HV 1.8

V 6.1

J 3.6 4.0 6.3 1.8

T 1.9 0.4 3.6 6.1

U2 2.8 0.7 1.8 3.0

U3 2.1

U4 4.3 3.8 15.1 18.1 3.6 0.7 2.1 1.8 3.0

U5 2.2 9.4 4.1 5.6 3.3 2.1 1.8

U8 1.8

K 3.6 3.3 6.3 3.0

R9 1.1 3.8 1.4 2.2 5.5

R11 2.1


particular population probably experienced a reduction in

population size or was subdivided.

To understand Altaian maternal genetic background, we

compared our data with those from other North Asian and

Central Asian populations. FST values between populations

were calculated with HVS1 sequences and viewed through

multidimensional scaling (Figure 2). In this analysis,

southern Siberians formed a rather diffuse cluster, with

most Central Asian and Mongolian populations being

separated from them. Altaian populations also did not

constitute a distinct cluster unto themselves. Based on

the FST values, the Chelkans were distinctive from all other

ethnic groups. Although falling closest to the Khakassians

in the MDS plot, they shared a smaller genetic distance

with the Tubalars2, which was expected because of the

inclusion of some Chelkans in that sample set.44 Kuman-

dins and Tubalars1 were not significantly different, and

appeared close to Tuvinians and southern Altaians. In

fact, both populations had smaller FST values with

southern Altaians than they did with the Chelkans,

although the genetic distances between Tubalars1 and

Tubalars2, Altai-kizhi, and Teleuts were also nonsignifi-

cant. Unlike northern Altaians, most of the southern

Altaian populations clustered together. The Altai-kizhi,

Teleuts, and Tubalars1 formed one small cluster with

Kyrgyz, whereas the Telengits showed greater affinities

with Central Asian populations. The southern Altaian

cluster sat near a cluster of Tuvinian populations, suggest-

ing a similar population history and likely gene flow

between these groups.

Summary statistics were calculated to assess the genetic

diversity of paternal lineages in Altaian populations

(Table 4). Gene diversities based on Y-STR haplotypes

(15-loci Y-STR haplotypes; Table S4) showed that the Altai-

kizhi were more diverse than the northern Altaians. Unlike

the mtDNA data, within group pairwise differences were

greater in the southern Altaian and Tubalar Y-STR haplo-

types than in the Chelkans and Kumandins.

Y-chromosomal variation in the four populations in our

data set provided a slightly different picture than the mito-

chondrial data. In this analysis, RST values were calculated

with 15-loci Y-STR haplotypes (Table S6). These estimates

indicated that only the Chelkans and Tubalars were not

Table 2. High-Resolution NRY Haplogroup Frequencies in AltaianPopulations

Haplogroup Chelkan Kumandin Tubalar Altai-kizhi

C3* 19 (0.158)

C3c1 5 (0.042)

D3a 6 (0.050)

E1b1b1c 1 (0.037)

I2a 1 (0.037)

J2a 3 (0.025)

L 1 (0.040)

N1* 1 (0.059) 3 (0.111)

N1b* 5 (0.200) 8 (0.471) 2 (0.017)

N1c* 1 (0.008)

N1c1 2 (0.017)

O3a3c* 1 (0.008)

O3a3c1 1 (0.037) 1 (0.008)

Q1a2 1 (0.037)

Q1a3a* 15 (0.600) 10 (0.370)

Q1a3a1c* 20 (0.167)

R1a1a1* 4 (0.160) 2 (0.118) 10 (0.370) 60 (0.500)

R1b1a1 6 (0.353)

T

Total 25 17 27 120

Table 3. HVS1 Summary Statistics for Altaian Populations

Population

Northern Altaian Southern Altaian

Chelkan Kumandin Tubalar1 Altai-kizhi1

# of samples 91 52 71 276

# of haplotypes 22 18 26 75

Haplotype diversity 0.923 5 0.013 0.914 5 0.021 0.953 5 0.010 0.976 5 0.003

Nucleotide diversity 0.020 5 0.011 0.022 5 0.011 0.019 5 0.010 0.018 5 0.009

Pairwise differences 7.68 5 3.61 8.22 5 3.87 7.03 5 3.34 6.84 5 3.23

Raggedness index 0.032 0.022 0.010 0.011

Raggedness p value 0.000 0.149 0.635 0.388

Tajima D 1.201 �0.644 �0.701 �1.180

Tajima D p value 0.000 0.000 0.000 0.000

Fu’s FS 3.417 �0.497 �3.877 �24.416

Fu’s FS p value 0.002 0.000 0.000 0.000


significantly different from each other. The Kumandins

were quite distant from all populations, although these

distances were slightly smaller among northern Altaians

than with the Altai-kizhi. The Altai-kizhi were again closest

to the Tubalars.

These relationships were affirmed by the haplotype

sharing between the four populations. The Chelkans and

Tubalars shared a large proportion of their haplotypes,

mostly those from haplogroups Q and R-M417, whereas

the Kumandins shared only one haplotype with Tubalars

(a rare N-LLY22g haplotype). In addition, the northern

and southern Altaians shared only a single haplotype,

belonging to haplogroup O-M117, which is more

commonly found in southern China.104 In fact, these

two Y chromosomes were the only occurrences of hap-

logroup O in our data set.

The Y-STR profiles were reduced to 10-loci STR haplo-

types in order to compare Y chromosome diversity in

several Siberian and Central Asian populations (Table 5;

Figure 3). The genetic distances in our sample set remained

high despite the greater haplotype sharing that resulted

from this reduction. Overall, the genetic distances were

much greater with the Y-STR haplotypes compared to

mtDNA haplotypes, indicating greater genetic differentia-

tion in paternal lineages compared to maternal lineages.

In addition to the Chelkans and Tubalars, two other groups

of populations exhibited nonsignificant RST values. One

group included Uyghur (from Urumqi and Yili) and

Mongolian (Kalmyks and Mongolians) populations, and

the other included the Mansi and a Sagai population iden-

tified as part of the Khakass ethnic group. In contrast with

their position in the mtDNA MDS plot, northern Altaians

were separated from all other populations, including other

southern Siberians. The three groups of Khakass (Sagai,

Sagai/Shor, and Kachin) fell much closer to the Khanty

and Mansi, which probably indicates a common ancestry

Figure 2. MDS Plot of FST Genetic Distances Generated from mtDNA HVS1 Sequences in Siberian and Central Asian PopulationsCircle, southern Siberian; diamond, northwestern Siberian; square, Central Asian.

Table 4. Y-STR Summary Statistics for Altaian Populations

Population

Northern Altaian Southern Altaian

Chelkan Kumandin Tubalar Altai-kizhi

# of samples 25 17 27 120

# of haplotypes 14 9 18 62

Haplotype diversity 0.910 5 0.043 0.912 5 0.042 0.954 5 0.025 0.978 5 0.005

Pairwise differences 6.59 5 3.22 6.39 5 3.19 7.40 5 3.57 7.58 5 3.56


for these populations. Unfortunately, more complete

Y-STR data sets were not available for other southern Sibe-

rian populations. Nonetheless, these results indicated a

different history for northern Altaians compared to

Central Asians and even other southern Siberians. A

specific reason for this difference is that Mongolians

had a much greater genetic impact on southern Altaians,

which is expected given the historical and linguistic

evidence.18,19,105

Altaian and Native American Connections

To test the hypothesis that Native Americans share a

more recent common ancestor with Altaians relative

to other Siberian and East Asian populations, we specifi-

cally examined the mtDNA and NRY haplogroups that

appeared in both locations. For the mtDNA, it is well

known that haplogroups A–D and X largely make up the

maternal genetic heritage of indigenous peoples in the

Americas.27,29,39,47,106 Complete mtDNA genome sequenc-

ing has led to a greater comprehension of the phylogeny of

Native American mtDNAs and, consequently, a better

understanding of their origins.107–110 Although Altaians

possess the five primary mtDNA haplogroups found in

the Americas, these lineages are not exactly the same as

those appearing in Native Americans at the subhaplogroup

level. This is also true for other Siberian populations except

in those few instances where gene flow across the Bering

Strait brought some low frequency types back to north-

eastern Siberians.

An example of this pattern is haplogroup C1a.

Southern Altaians possessed C1a, which is an exclusively

Asian branch of the predominately American C1 haplo-

group.107,108 To date, only four complete C1a genomes

have been published. These sequences produced a more

recent TMRCA than other genetic evidence had previously

suggested for the peopling of the Americas. Although

Tamm et al.107 viewed this haplogroup as representing a

back migration into Siberia, it does not occur in Siberian

populations that are geographically closest to the Americas,

but rather those living in southern and southeastern

Siberia.41,89 However, given the small effective population

sizes from the northeastern Siberian groups that have

been studied thus far, this haplogroup could have been

lost because of drift.

The other mtDNA haplogroup found in northern

and southern Altaians that is a close relative of a Native

American lineage is D4b1a2a1a. This haplogroup has

been found in Altaians, Shors, and Uzbeks from north-

western China.41,44,70 Analysis of complete mtDNA

genomes identified a sister branch (D4b1a2a1a1), which

is found only in northeastern Siberian populations

and Inuit from Canada and Greenland.42,45,54,91,111

TMRCAs were calculated from the complete mtDNA

genomes of this branch and those from Native American

D4b1a2a1a1. By analyzing only synonymous mutations

from these sequences with the method of Soares et al.,84

Table 5. Low-Resolution NRY Haplogroup Frequency Comparison of Altaians

Hg Chelkan Kumandin Tubalar Altai-kizhi1 Altai-kizhi2 Teleut1 Teleut2 Shor

C 20.0 13.0 8.5 5.7 2.0

D 5.0 3.3

E 3.7

F (xJ,K) 3.7 3.3 10.7 2.0

J 2.5 2.2 2.1

K (xN1c,O,P) 24.0 52.9 11.1 1.7 2.2 13.7

N1c 2.5 5.4 10.6 28.6 2.0

O 3.7 1.7

P (xR1a1a) 60.0 35.3 40.7 16.7 28.3 34.3 2.0

R1a1a 16.0 11.8 37.0 50.0 42.4 68.1 31.4 78.4

Total 25 17 27 120 92 47 35 51

Figure 3. MDS Plot of RST Genetic Distances Generated from YChromosome STR Haplotypes in Siberian and Central Asian Pop-ulationsCircle, southern Siberian; diamond, northwestern Siberian;square, Central Asian.


we estimated the TMRCAs of these two branches at

11.8 kya and 15.8 kya, respectively.

For the Y chromosome, indigenous American lineages

are derived mostly from haplogroups C and Q, and, as

such, are crucial for understanding of the genetic histories

of peoples from the Americas and how they relate to

populations of Central Asia and Siberia.9,39,93,98,112,113

Just as Seielstad et al.114 and Bortolini et al.38 used M242

to clarify the genetic relationship between Asian and

American Y chromosomes, the characterization of this

haplogroup at an even higher level of resolution has led

to a much greater understanding of the origins of Native

American Y chromosomes and their connections to Asian

types. In this regard, it was recently shown that the

American Q-M3 SNP is located on an M346-positive

background.63 The presence of M346 in Central Asia and

Siberia has strengthened the argument for a southern

Siberian or Central Asian origin for many American Y chro-

mosomes.85,99,102,115

Given the importance of haplogroup Q for Native

American origins, we subjected samples from this lineage

to high-resolution SNP analysis involving 37 biallelic

markers to better understand the relationship between

Old and New World populations and the migration(s)

that connect them. All Y chromosomes in this study that

belonged to haplogroup Q (as indicated by the presence

of M242) were also found to have the P36.2, MEH2,

L472, and L528 markers (Figure S1). Thus, these haplo-

types fell into the Q1a branch of the Y chromosome

phylogeny. Because Q1b Y chromosomes were not found

in Altaian samples, we were not able to definitively place

the L472 and L528 SNPs at the same phylogenetic position

as MEH2. For this reason, their placement is tentative until

L275/L314/M378 Y chromosomes are screened for these

markers. Furthermore, M120/M265-positive, P48-positive,

and P89-positive samples were not found in the Altai

region. Therefore, the placement of these branches at the

same phylogenetic level as M25/M143 and M346/L56/

L57 should also be considered as provisional (although

see Karafet et al.63).

The M346, L56, and L57 SNPs were positioned as ances-

tral to three derived branches in the Family Tree DNA

phylogeny. We found that the L474, L475, and L476

SNPs were present in all of our M346-positive samples.

However, because M323- and L527/L529-positive samples

were not found in the Altaians, we could not confirm the

exact position of these markers at either the Q1a3 or

Q1a3a level. On the other hand, all Altaians that possessed

the M346, L56, L57, L474, L475, and L476 SNPs also had

L53, L55, L213, and L331.

Interestingly, northern and southern Altaian Q Y chro-

mosomes differed by three markers. L54, L330, and L333

were found in Q haplotypes in the southern Altaians and

one Altaian Kazakh, whereas the northern Altaians Q

haplotypes lacked these derived SNPs. Thus, according to

the standard nomenclature set by the Y Chromosome

Consortium62 and followed by others, the northernAltaian

Q haplotypes belonged to Q1a3a* and the southern

Altaians belonged to Q1a3a1c*. We have further confirmed

that M3 haplotypes belong to L54-derived Y chromosomes

(unpublished data). These alterations in the phylogeny

change the haplogroup name of the Native American

Q-M3 Y chromosomes from Q1a3a to Q1a3a1a. Moreover,

the position of M3 and L330/L333 in the phylogeny indis-

putably showed that the MRCA of most Native American

Y chromosomes was shared with southern Altaians.

The differences between the northern and southern

Altaian Q Y chromosomes were also reflected in the anal-

ysis of high-resolution Y-STR haplotypes (Figure S2).116

Comparisons of Altaian Q-M346 Y chromosomes with

those from southern Siberian, Central Asian, and East

Asian populations revealed affinities between southern

Altaian and these other groups. However, the northern

Altaians remained distinctive, even in networks con-

structed from fewer Y-STR loci (Figure S3).

The time required to evolve the extent of haplotypic

diversity observed in each of the subhaplogroups can aid

in determining when particular mutations arose and

possibly when these mutations were carried to other loca-

tions. The TMRCA for the northern Altaian Q1a3a* Y chro-

mosomes indicated a relatively recent origin for them, one

dating to either the Bronze Age or recent historical period,

depending on the Y-STRmutation rate being used (Table 6).

The southern Altaian/Altaian Kazakh Q1a3a1c* Y chromo-

somes had a slightly older TMRCA that dated them to

either the late Neolithic or early Bronze Age. By using

Bayesian analysis, we further estimated the divergence

time of the two Q haplogroups at about 1,000 years after

the TMRCA of all Altaian Q lineages (~20 kya), indicating

an ancient separation of northern and southern Altaian

Q Y chromosomes (Table 7).

A similar analysis was conducted to determine when the

L54 haplogroup arose and gave rise to M3 and L330/L334

subbranches. The indigenous American Y chromosomes

used in this analysis were more diverse than those of

southern Altaians. The resulting TMRCA for the South

American Q1a3a1a* samples was 22.2 kya or 7.6 kya,

depending on the mutation rate used. The divergence

between the M3 and L330/L334 Y chromosomes was

~13.4 kya, with a TMRCA of 22.0 kya, via the evolutionary

rate. By contrast, the TMRCA and divergence time via

a pedigree-based mutation rate were 7.7 kya and 4.9 kya,

respectively.

The time required to generate the haplotypic diversity in

the L54-positive Y chromosomes clearly showed that the

evolutionary rate provided a more reasonable estimate.

The Americas were inhabited well before 5–8 kya, based

on various lines of evidence, making the use of the pedi-

gree-based mutation rate questionable. The estimates

generated with the evolutionary-based mutation rate

provided times that are more congruent with the known

prehistory of the Americas.117 They are also similar to the

TMRCAs calculated for Native American mtDNA haplo-

groups.107,108


Discussion

Origins of Northern and Southern Altaians

In this paper, we characterized mtDNA and NRY variation

in northern and southern Altaians to better understand

their population histories and elucidate the genetic

relationship between Altaians and Native American popu-

lations. The evidence from the mtDNA and NRY data

supports the hypothesis that northern and southern

Altaians generally formed out of separate gene pools.

This complex genetic history involves repeated migrations

into (and probably out of) the Altai-Sayan region. In addi-

tion, the histories as revealed by these data added nuances

that could not be attained through low-resolution charac-

terization alone.

The NRY data provided the clearest evidence for a signif-

icant genetic difference between the two sets of Altaian

ethnic groups. Although sharing certain NRY haplogroups,

the two population groups differed in the frequencies of

these lineages, and, more importantly, shared few haplo-

types with them. By contrast, northern and southern pop-

ulations shared considerably more mtDNA haplotypes,

indicating that some degree of gene flow had occurred

between them, albeit in a sex-specific manner. As seen in

other populations from Siberia and Central Asia, the patri-

lineality of these groups probably helped to shape this

difference in patterns of mtDNA and Y-chromosomal vari-

ation.64,118

In addition, each northern Altaian ethnic group showed

different genetic relationships with the Altai-kizhi. The

Tubalars consistently grouped closer to the Altai-kizhi

than the other two northern Altaians based on both

mtDNA and NRY data. Thus, the higher genetic diversity

of mtDNA and NRY haplotypes in the Tubalars is probably

the result of admixture with other groups, such as

southern Altaians. The Chelkans, on the other hand,

have the most divergent set of mtDNAs of the three popu-

lations. Mismatch analysis and tests of neutrality indicated

that the Chelkans show signs of decreasing population size

or population structure. Long-term endogamy has prob-

ably also played a role in shifting the pattern of mtDNA

diversity in Chelkans from that seen in other northern

Altaians. Because of this endogamy (and genetic drift),

only a few lineages attained high frequencies, resulting

Table 7. Divergence Times between Haplogroups/Populations

TMRCA Split Time

Median 95% Confidence Interval Median 95% Confidence Interval

Pedigree-Based Mutation Rate

Northern and Southern Altaians 5,490 [3,000–11,100] 4,490 [1,730–10,070]

Southern Altaians and Native Americans 7,740 [5,170–12,760] 4,950 [2,360–9,490]

Evolutionary-Based Mutation Rate

Northern and Southern Altaians 21,890 [9,900–57,440] 19,260 [7,060–54,600]

Southern Altaians and Native Americans 21,960 [12,260–42,690] 13,420 [5,220–30,430]

Table 6. TMRCAs and Expansion Times for Altaian and Native American NRA Haplogroup Q Lineages

Hg N

Network Batwing - TMRCA Batwing - Expansion

r 5 s Median 95% C.I. Median 95% C.I.

Pedigree-Based Mutation Rate

All Q1a3a 97 5,390 5 1,000 8,420 [5,620–14,290] 7,230 [1,220–20,510]

Q1a3a* 25 1,410 5 580 1,480 [680–3,060] 2,100 [380–6,830]

Q1a3a1a* 52 5,820 5 1,280 7,630 [4,870–12,920] 4,680 [480–14,940]

Q1a3a1c* 20 2,420 5 700 2,970 [1,500–5,960] 2,680 [450–8,610]

Evolutionary-Based Mutation Rate

All Q1a3a 97 14,970 5 2,760 25,580 [14,230–51,140] 17,220 [1,380–54,950]

Q1a3a* 25 3,910 5 1,610 5,320 [2,300–12,160] 4,340 [1,000–13,080]

Q1a3a1a* 52 16,170 5 3,550 22,160 [11,960–44,340] 9,800 [620–39,543]

Q1a3a1c* 20 6,750 5 1,950 8,720 [3,960–20,010] 5,600 [1,030–17,910]

Note: r, rho statistic; s, standard error; Q1a3a*, Northern Altaians (this study); Q1a3a1a, Native Americans (Geppert et al.76); Q1a3a1c, Southern Altaians (thisstudy).


in reduced mtDNA diversity. Based on the NRY data, the

Kumandins were distinct from both the Chelkans and

Tubalars, who were composed of mostly the same set of

lineages. Thus, the genetic diversity in northern Altaians

is structured by ethnic group membership, and, therefore,

can be viewed as reflecting distinctive histories for each

population.

Not much is known about the ethnogenesis of northern

Altaians. However, it has been suggested that they

descended from groups that historically lived around the

Yenisei River and spoke either southern Samoyedic, Ugric,

or Yeniseian languages.18,19 These populations are the

same ones that later contributed to the formation of the

Kets, Selk’ups, Shors, and Khakass in northwestern Siberia

and the western Sayans of southern Siberia.4,105 Further-

more, the Chelkans and Tubalars possess a large number

of Q1a3a* Y chromosomes with dramatically different

STR profiles compared to other southern Siberians (Altai-

kizhi and Tuvinians) and Mongolians. Thus, it is possible

that similar lineages will be found in the Kets and/or

Sel’kups, where high frequencies of Q1-P36 have already

been noted.119 Should this be the case, it would provide

additional evidence for northern Altaians having common

ancestry with Samoyedic, Yeniseian, and Ugric speakers. In

fact, Chelkans and Kumandins also have N-P43 Y chromo-

somes very similar to ones found in the Ugric-speaking

Khanty. Regardless, there is notable genetic discontinuity

between northern Altaians and other Turkic-speaking

people of southern Siberia.

Southern Altaians share greater affinities with Mongo-

lians and Central Asians than they do with northern

Altaians. This is partly because of the high frequencies of

Y chromosome haplogroup C in these groups. In fact,

present-day Kyrgyz are nearly indistinguishable from the

Altai-kizhi based on their NRY haplogroup profile.120,121

They share similar C-M217 and R-M417 lineages with

the Altai-kizhi, suggesting a recent common ancestry for

the two groups, which further supports the theory of a

recent common ancestry among southern Siberians and

Kyrgyz.122

As evident in the disparities in genetic history between

northern and southern Altaians, the Altai has served as

a long-term genetic boundary zone. These disparities

reflect the different sources of genetic lineages and spheres

of interaction for both groups. The northern Altaians share

clan names, similar languages, subsistence strategies, and

other cultural elements with populations that today live

farther to the north.4 By contrast, southern Altaians share

these same features with populations in Central Asia,

mostly with Turkic- (Kipchak) but also Mongolic-speaking

peoples. Thus, the geography of the Altai (taiga versus

steppe) has helped to maintain these cultural and biolog-

ical (mtDNA, Y chromosome, and cranial-morphological)

differences.

Furthermore, no evidence of Denisovan or Neanderthal

ancestry was found in the Altaian mtDNA and Y chromo-

some data. However, this does not preclude such admix-

ture in the autosomes of Altaian populations. Greater

numbers of derived Denisovan SNPs were found in some

southeastern Asian and Oceanian populations, although

native Siberians were not included in that study.123 There-

fore, this issue requires further investigation.

Native American Origins

Many earlier genetic studies looked for the origins of

Native Americans among the indigenous peoples of Sibe-

ria, Mongolia, and East Asia. Often, the identification of

source populations conflicted between studies, depending

largely on the loci or samples being studied. Cranial

morphology has been used to demonstrate a connection

between the Native Americans and Siberian popula-

tions.124,125 Various researchers have suggested sources

such as the Baikal region of southern Siberia, the Amur

region of southeastern Siberia, and more generally Eurasia

and East Asia.126–128 A study of autosomal loci also showed

an affinity between populations in the New World and

Siberian regions but did not attempt to pinpoint a partic-

ular area of Siberia as the source area.129 In addition,

mtDNA studies have suggested New World origins from

a number of different locations including different parts

of Siberia, Mongolia, and northern China.34,41–45,47,71,130

Our own analysis of Altaian mtDNAs showed that the

five primary haplogroups (A–D, X) were present among

these populations. However, Altaian populations (and

generally all Siberian populations outside of Chukotka)

lack mtDNA haplotypes that are identical to those appear-

ing in the Americas. The only exceptions are the Selk’ups

and Evenks who bear A2 haplotypes, with their presence

in those groups being explained as a result of a back migra-

tion to northeast Asia.107

Despite the general absence of Native American haplo-

types in southern Siberia, there are sister branches whose

MRCAs are shared with those in Native Americans. One

such lineage is C1a, which was found in two Altai-kizhi

individuals and has also been observed at low frequencies

in Mongolia, southeastern Siberia, and Japan.44,46,55,71

Tamm et al.107 attribute its presence in northeast Asia to

a back migration from the NewWorld, where haplogroups

C1b–d are prevalent, whereas Starikovskaya et al.44 argue

that C1a and C1b arose in the Amur region, with C1b

migrating to the Americas later. A similar lineage is

D4b1a2a1a, a sister branch to D4b1a2a1a1, which is found

in northern North America. Although both of these line-

ages date to around 15,000 years ago, additional mitoge-

nome sequences from these haplogroups are needed to

estimate more precise TMRCAs for them and thereby

delineate their putative Asian and American origins.

Results obtained from the Y chromosome analysis

support the view that southern Siberians and Native

Americans share a common source.8,9,11,38,131 This con-

nection was initially suggested by a low-level Y-SNP

resolution and an alphoid heteroduplex system by Santos

et al.8 Subsequently, Zegura et al.11 showed a similarity in

NRY Q and C types among southern Altaians and Native


Americans by using only fast evolving Y-STR loci and,

again, low-level Y-SNP resolution. We focused on haplo-

group Q in this study because of the greater number of

new mutations published for this branch and correspond-

ing levels of Y-STR resolution (15–17 loci), which are

currently lacking for published Native American haplo-

group C Y chromosomes. This high-resolution character-

ization is critical because it allows for a more accurate

dating of TMRCAs and estimates of divergence between

the ancestors of Native Americans and indigenous Sibe-

rians. For example, with this approach, Seielstad et al.114

dated the origin of the M242, which defines the NRY

haplogroup Q, and, in turn, provided a more accurate

upper bound to the timing of the initial peopling of the

Western Hemisphere.

Several studies have shown that the American-specific

Q-M3 arose on an M346-positive Y chromosome.63,115,132

The M346 marker was also discovered in Altaians and

other Siberian populations.102,116 However, it has a broad

geographic distribution, being found in Siberia, Central

Asia, East Asia, India, and Pakistan, albeit at lower frequen-

cies.85,99 We have shown that southern Altaians M346 Y

chromosomes also possess L54, a SNP marker that also is

shared by Native Americans who have the M3 marker

and which is more derived than M346. Because L54 is

found in both Siberia and the Americas, it most probably

defines the initial founder haplogroup from which M3

later developed.

Our coalescence analysis suggests that the two derived

branches of L54 (M3 and L330/L334) diverged soon after

this mutation arose. Estimates using the evolutionary

Y-STR mutation rate place the origin of this marker at

around 22,000 years ago, with the two branches diverging

at roughly 13,400 years ago. Although the 95% confidence

intervals for the Bayesian analyses are broad, the median

values of the TMCRAs estimated with this method closely

match those obtained through the analysis with rho statis-

tics. In addition, the coalescence estimates of northern and

southern Altaian Q Y chromosomes show that they, too,

are similar to the overall TMRCA estimates. This concor-

dance suggests that a rapid expansion probably occurred

for this particular Y chromosome branch around 15,000–

20,000 years ago. Given previous estimates for the timing

of the initial peopling of the Americas, this scenario seems

plausible, because these estimates fall in line with recent

estimates of indigenous American mitogenomes.107,133

As in any study, there are limitations to this analysis. The

primary issues are the accuracy and precision of using

microsatellites for dating origins and dispersals of haplo-

types. The stochastic nature of mutational accumulation

will continue to be a source of some uncertainty in any

attempt at dating TMRCAs. For this reason, the question

of which Y-STR mutation rate to use for coalescence esti-

mates has been debated.88,134,135 In this study, the evolu-

tionary rate seems the most realistic, because estimates

generated with the pedigree rate provided times that are

much too recent, given what is known about the peopling

of the New World from nongenetic studies.117 There is no

evidence that the majority of Native Americans (men with

Q-M3 Y chromosomes) derived from a migration less than

8 kya, as would be suggested from the TMRCAs calculated

with the pedigree rate. However, other studies have used

the pedigree mutation rate to explore historical events

with great effect—the most-well-known case being the

Genghis Khan star cluster.136 It is possible that such rates

are, like that of the mtDNA, time dependent or that the

Y chromosomes to which the Y-STRs are linked have

been affected by purifying selection.84,133,137,138 In this

regard, the pedigree-based mutation rate would be more

appropriately used with lower diversity estimates, reflect-

ing recent historical events, while the evolutionary rate

would be used in scenarios with higher diversity estimates,

reflecting more ancient phenomena. Although beyond the

scope of this paper, it is likely that the Y-STR mutation rate

follows a similarly shaped curve as that of the mitochon-

drial genome.

Furthermore, haplogroup divergence dates need not

(and mostly do not) equate with population divergence

dates. In this case, however, the mutations defining the

southern Altaian and Native American branches of the

Q-L54 lineage most probably arose after their ancestral

populations split, given the geographic exclusivity of

each derived marker. Yet, sample sets that are not entirely

representative of a derived branch could potentially skew

the coalescent results. In all likelihood, the L54 marker

will be found in other southern Siberian populations,

because southern Altaians show some genetic affinities

with Tuvinians and other populations from the eastern

Sayan region. Even so, the consistency of TMRCA esti-

mates and the divergence dates for the different Q

branches examined here suggest that our data sets are suffi-

ciently representative. Moreover, even though the M3

haplotypes used in this analysis came exclusively from

indigenous Ecuadorian populations, the diversity found

within this data set is similar to previous estimates of the

age of the Q-M3 haplogroup.11

Although different lines of evidence point to different

source populations for Native Americans, the alternatives

need not be exclusive. The effects of historical and demo-

graphic events and evolutionary processes, particularly

recent gene flow, have shaped modern-day populations

such that we should not expect that any one population

in the Old World would show the same genetic composi-

tion as populations in the New World. That (an) ancestral

population(s) probably differentiated into the numerous

populations of Siberia and Central Asia, which have inter-

acted over the past 15,000 years, is not lost on us. Historical

expansions of people and the effects of animal and plant

domestication have played critical roles in shaping the

genetics of both Old and New World populations, particu-

larly in the past several thousand years. Modern popula-

tions have complex, local histories that need to be under-

stood if these are to be used in larger interregional (or

biomedical) analyses. Through the use of phylogeographic


methods, we can attain a better understanding of these

populations for such purposes. It is through this type of

approach that it becomes quite clear that southern Altaians

and Native Americans share a recent common paternal

ancestor.

Supplemental Data

Supplemental Data include three figures and six tables and can be

found with this article online at http://www.cell.com/AJHG/.

Acknowledgments

The authors would like to thank all of the indigenous Altaian

participants for their involvement in this study. We also thank

Fabricio Santos for his careful review of and helpful suggestions

for the manuscript, and two anonymous reviewers for their

constructive comments. In addition, we would like to acknowl-

edge the people who facilitated and provided assistance with our

field research in the Altai Republic. They include Vasiliy Semeno-

vich Palchikov, the staff of the Biochemistry Lab at the Turochak

Hospital, Dr. Maria Nikolaevna Trishina, Vitaliy Trishin, Alexander

A. Guryanov, the staff of the Native Affairs office in Gorniy

Altaiask, Galina Nikolaevna Makhalina, and Tatiana Kunduchi-

novna Babrasheva. In addition, we received help from a number

of people living in local villages around the Turochakskiy Raion,

particularly Alexander Adonyov. This project was supported by

funds from the University of Pennsylvania (T.G.S.), the National

Science Foundation (BCS-0726623) (T.G.S., M.C.D.), the Social

Sciences and Humanities Research Council of Canada (MCRI

412-2005-1004) (T.G.S.), and the Russian Basic Fund for Research

(L.P.O.). T.G.S. would also like to acknowledge the infrastructural

support provided by the National Geographic Society.

Received: September 15, 2011

Revised: December 6, 2011

Accepted: December 19, 2011

Published online: January 26, 2012

Web Resources


Arlequin, version 3.11, http://cmpg.unibe.ch/software/arlequin3/

Batwing, http://www.mas.ncl.ac.uk/~nijw/

Network, version 4.6.0.0, http://www.fluxus-engineering.com/

sharenet.htm

Network Publisher, version 1.3.0.0, http://www.fluxus-engineering.

com/nwpub.htm

Y-DNAHaplogroup Tree 2011, version 6.46, http://www.isogg.org/

tree

References

1. Goebel, T. (1999). Pleistocene human colonization of

Siberia and peopling of the Americas: An ecological

approach. Evol. Anthropol. 8, 208–227.

2. Gryaznov, M.P. (1969). The Ancient Civilization of

Southern Siberia (New York: Cowles Book Company, Inc.).

3. Okladnikov, A.P. (1964). Ancient population of Siberia

and its culture. In The Peoples of Siberia, M.G. Levin and

L.P. Potapov, eds. (Chicago: The University of Chicago

Press), pp. 13–98.

4. Levin, M.G., and Potapov, L.P. (1964). The Peoples of Siberia

(Chicago: University of Chicago Press).

5. Reich, D., Green, R.E., Kircher, M., Krause, J., Patterson, N.,

Durand, E.Y., Viola, B., Briggs, A.W., Stenzel, U., Johnson,

P.L.F., et al. (2010). Genetic history of an archaic hominin

group from Denisova Cave in Siberia. Nature 468, 1053–

1060.

6. Krause, J., Fu, Q., Good, J.M., Viola, B., Shunkov, M.V.,

Derevianko, A.P., and Paabo, S. (2010). The complete mito-

chondrial DNA genome of an unknown hominin from

southern Siberia. Nature 464, 894–897.

7. Krause, J., Orlando, L., Serre, D., Viola, B., Prufer, K.,

Richards, M.P., Hublin, J.J., Hanni, C., Derevianko, A.P.,

and Paabo, S. (2007). Neanderthals in central Asia and

Siberia. Nature 449, 902–904.

8. Santos, F.R., Pandya, A., Tyler-Smith, C., Pena, S.D., Schan-

field, M., Leonard, W.R., Osipova, L., Crawford, M.H., and

Mitchell, R.J. (1999). The central Siberian origin for native

American Y chromosomes. Am. J. Hum. Genet. 64, 619–628.

9. Karafet, T.M., Zegura, S.L., Posukh, O., Osipova, L., Bergen,

A., Long, J., Goldman, D., Klitz, W., Harihara, S., de Knijff,

P., et al. (1999). Ancestral Asian source(s) of new world

Y-chromosome founder haplotypes. Am. J. Hum. Genet.

64, 817–831.

10. Lell, J.T., Sukernik, R.I., Starikovskaya, Y.B., Su, B., Jin, L.,

Schurr, T.G., Underhill, P.A., and Wallace, D.C. (2002). The

dual origin and Siberian affinities of Native American Y chro-

mosomes. Am. J. Hum. Genet. 70, 192–206.

11. Zegura, S.L., Karafet, T.M., Zhivotovsky, L.A., and Hammer,

M.F. (2004). High-resolution SNPs and microsatellite haplo-

types point to a single, recent entry of Native American

Y chromosomes into the Americas. Mol. Biol. Evol. 21,

164–175.

12. Anthony, D.W. (2007). The Horse, the Wheel, and Language:

How Bronze-Age Riders from the Eurasian Steppes Shaped

the Modern World (Princeton, N.J.: Princeton University

Press).

13. Kuzmina, E.E., and Mair, V.H. (2008). The Prehistory of the

Silk Road (Philadelphia: University of Pennsylvania Press).

14. Rudenko, S.I. (1970). Frozen Tombs of Siberia, the Pazyryk

Burials of Iron Age Horsemen (Berkeley: University of

California Press).

15. David-Kimball J., Bashilov V.A., and Yablonsky L.T., eds.

(1995). Nomads of the Eurasian Steppes in the Early Iron

Age (Berkeley, CA: Zinat Press).

16. Golden, P.B. (1992). An Introduction to the History of

the Turkic Peoples: Ethnogenesis and State-Formation in

Medieval and Early Modern Eurasia and the Middle East

(Wiesbaden: Otto Harrassowitz).

17. Grousset, R. (1970). The Empire of the Steppes: A History of

Central Asia (New Brunswick, N.J.: Rutgers University Press).

18. Potapov, L.P. (1962). The origins of the Altayans. In Studies

in Siberian Ethnogenesis, H.N. Michael, ed. (Toronto:

University of Toronto Press), pp. 169–196.

19. Potapov, L.P. (1964). The Altays. In The Peoples of Siberia,

M.G. Levin and L.P. Potapov, eds. (Chicago: University of

Chicago Press), pp. 305–341.

20. Menges, K.H. (1968). The Turkic Languages and Peoples:

An Introduction to Turkic Studies (Wiesbaden: Otto Harras-

sowitz).


21. Levin, M.G. (1964). The anthropological types of Siberia. In

The Peoples of Siberia, M.G. Levin and L.P. Potapov, eds.

(Chicago: The University of Chicago Press), pp. 99–104.

22. Osipova, L.P., and Sukernik, R.I. (1978). [Polymorphism

of immunoglobulin Gm- and Km-allotypes in northern

Altaians (western Sibiria)]. Genetika 14, 1272–1275.

23. Posukh, O.L., Osipova, L.P., Kashinskaia, IuO., Ivakin, E.A.,

Kriukov, IuA., Karafet, T.M., Kazakovtseva,M.A., Skobel’tsina,

L.M., Crawford, M.G., Lefranc, M.P., and Lefranc, G. (1998).

[Genetic analysis of the South Altaian population of

the Mendur-Sokkon village, Altai Republic]. Genetika 34,

106–113.

24. Sukernik, R.I., Karafet, T.M., Abanina, T.A., Korostyshevskiĭ,M.A., and Bashlaĭ, A.G. (1977). [Genetic structure of 2 iso-

lated populations of native inhabitants of Sibiria (Northern

Altaics) according to the results of a study of blood groups

and isoenzymes]. Genetika 13, 911–918.

25. Sukernik, R.I., Shur, T.G., Starikovskaia, E.B., and Uolles, D.K.

(1996). [Mitochondrial DNA variation in native inhabitants

of Siberia with reconstructions of the evolutional history of

the American Indians. Restriction polymorphism]. Genetika

32, 432–439.

26. Shields, G.F., Schmiechen, A.M., Frazier, B.L., Redd, A.,

Voevoda, M.I., Reed, J.K., and Ward, R.H. (1993). mtDNA

sequences suggest a recent evolutionary divergence for

Beringian and northern North American populations. Am.

J. Hum. Genet. 53, 549–562.

27. Torroni, A., Schurr, T.G., Yang, C.C., Szathmary, E.J.,

Williams, R.C., Schanfield, M.S., Troup, G.A., Knowler,

W.C., Lawrence, D.N., Weiss, K.M., et al. (1992). Native

American mitochondrial DNA analysis indicates that the

Amerind and the Nadene populations were founded by two

independent migrations. Genetics 130, 153–162.

28. Wallace, D.C., and Torroni, A. (1992). American Indian

prehistory as written in the mitochondrial DNA: a review.

Hum. Biol. 64, 403–416.

29. Torroni, A., Schurr, T.G., Cabell, M.F., Brown,M.D., Neel, J.V.,

Larsen, M., Smith, D.G., Vullo, C.M., and Wallace, D.C.

(1993). Asian affinities and continental radiation of the

four foundingNative AmericanmtDNAs. Am. J. Hum. Genet.

53, 563–590.

30. Torroni, A., Sukernik, R.I., Schurr, T.G., Starikorskaya, Y.B.,

Cabell, M.F., Crawford, M.H., Comuzzie, A.G., and Wallace,

D.C. (1993). mtDNA variation of aboriginal Siberians reveals

distinct genetic affinities with Native Americans. Am. J.

Hum. Genet. 53, 591–608.

31. Forster, P., Harding, R., Torroni, A., and Bandelt, H.J. (1996).

Origin and evolution of Native American mtDNA variation:

a reappraisal. Am. J. Hum. Genet. 59, 935–945.

32. Merriwether, D.A., and Ferrell, R.E. (1996). The four founding

lineage hypothesis for the NewWorld: a critical reevaluation.

Mol. Phylogenet. Evol. 5, 241–246.

33. Bonatto, S.L., and Salzano, F.M. (1997). Diversity and age of

the four major mtDNA haplogroups, and their implications

for the peopling of the New World. Am. J. Hum. Genet. 61,

1413–1423.

34. Merriwether, D.A., Hall, W.W., Vahlne, A., and Ferrell, R.E.

(1996). mtDNA variation indicates Mongolia may have

been the source for the founding population for the New

World. Am. J. Hum. Genet. 59, 204–212.

35. Neel, J.V., Biggar, R.J., and Sukernik, R.I. (1994). Virologic and

genetic studies relate Amerind origins to the indigenous

people of the Mongolia/Manchuria/southeastern Siberia

region. Proc. Natl. Acad. Sci. USA 91, 10737–10741.

36. Karafet, T.M., Zegura, S.L., Vuturo-Brady, J., Posukh, O.,

Osipova, L., Wiebe, V., Romero, F., Long, J.C., Harihara, S.,

Jin, F., et al. (1997). Y chromosomemarkers and Trans-Bering

Strait dispersals. Am. J. Phys. Anthropol. 102, 301–314.

37. Lell, J.T., Brown, M.D., Schurr, T.G., Sukernik, R.I., Starikov-

skaya, Y.B., Torroni, A., Moore, L.G., Troup, G.M., and

Wallace, D.C. (1997). Y chromosome polymorphisms in

native American and Siberian populations: identification of

native American Y chromosome haplotypes. Hum. Genet.

100, 536–543.

38. Bortolini, M.C., Salzano, F.M., Thomas, M.G., Stuart, S.,

Nasanen, S.P., Bau, C.H., Hutz, M.H., Layrisse, Z., Petzl-Erler,

M.L., Tsuneto, L.T., et al. (2003). Y-chromosome evidence for

differing ancient demographic histories in the Americas. Am.

J. Hum. Genet. 73, 524–539.

39. Schurr, T.G., and Sherry, S.T. (2004). Mitochondrial DNA and

Y chromosome diversity and the peopling of the Americas:

evolutionary and demographic evidence. Am. J. Hum. Biol.

16, 420–439.

40. Derenko,M.V., Malyarchuk, B., Denisova, G.A.,Wozniak,M.,

Dambueva, I., Dorzhu, C., Luzina, F., Mi�scicka-Sliwka, D.,

and Zakharov, I. (2006). Contrasting patterns of Y-chromo-

some variation in South Siberian populations from Baikal

and Altai-Sayan regions. Hum. Genet. 118, 591–604.

41. Derenko,M.V., Malyarchuk, B., Grzybowski, T., Denisova, G.,

Dambueva, I., Perkova, M., Dorzhu, C., Luzina, F., Lee, H.K.,

Vanecek, T., et al. (2007). Phylogeographic analysis of mito-

chondrial DNA in northern Asian populations. Am. J.

Hum. Genet. 81, 1025–1041.

42. Volodko, N.V., Starikovskaya, E.B., Mazunin, I.O., Eltsov,

N.P., Naidenko, P.V., Wallace, D.C., and Sukernik, R.I.

(2008). Mitochondrial genome diversity in arctic Siberians,

with particular reference to the evolutionary history of

Beringia and Pleistocenic peopling of the Americas. Am. J.

Hum. Genet. 82, 1084–1100.

43. Derenko, M.V., Grzybowski, T., Malyarchuk, B.A., Dam-

bueva, I.K., Denisova, G.A., Czarny, J., Dorzhu, C.M., Kakpa-

kov, V.T., Mi�scicka-Sliwka, D., Wo�zniak, M., and Zakharov,

I.A. (2003). Diversity of mitochondrial DNA lineages in

South Siberia. Ann. Hum. Genet. 67, 391–411.

44. Starikovskaya, E.B., Sukernik, R.I., Derbeneva, O.A., Volodko,

N.V., Ruiz-Pesini, E., Torroni, A., Brown, M.D., Lott, M.T.,

Hosseini, S.H., Huoponen, K., and Wallace, D.C. (2005).

Mitochondrial DNA diversity in indigenous populations of

the southern extent of Siberia, and the origins of Native

American haplogroups. Ann. Hum. Genet. 69, 67–89.

45. Starikovskaya, Y.B., Sukernik, R.I., Schurr, T.G., Kogelnik,

A.M., andWallace, D.C. (1998). mtDNA diversity in Chukchi

and Siberian Eskimos: implications for the genetic history of

Ancient Beringia and the peopling of the New World. Am. J.

Hum. Genet. 63, 1473–1491.

46. Schurr, T.G., and Wallace, D.C. (2003). Genetic prehistory of

Paleoasiatic-speaking populations of northeastern Siberia

and their relationships to Native Americans. In Constructing

cultures then and now: celebrating Franz Boas and the Jesup

North Pacific Expedition, L. Kendall and I. Krupnik, eds.

(Washington, D.C.: Arctic Studies Center, National Museum

of Natural History, Smithsonian Institution), pp. 239–258.

47. Schurr, T.G., Ballinger, S.W., Gan, Y.Y., Hodge, J.A., Merri-

wether, D.A., Lawrence, D.N., Knowler, W.C., Weiss, K.M.,


and Wallace, D.C. (1990). Amerindian mitochondrial DNAs

have rare Asian mutations at high frequencies, suggesting

they derived from four primary maternal lineages. Am. J.

Hum. Genet. 46, 613–623.

48. Macaulay, V., Richards, M., Hickey, E., Vega, E., Cruciani, F.,

Guida, V., Scozzari, R., Bonne-Tamir, B., Sykes, B., and

Torroni, A. (1999). The emerging tree of West Eurasian

mtDNAs: a synthesis of control-region sequences and RFLPs.

Am. J. Hum. Genet. 64, 232–249.

49. Richards, M., Macaulay, V., Hickey, E., Vega, E., Sykes, B.,

Guida, V., Rengo, C., Sellitto, D., Cruciani, F., Kivisild, T.,

et al. (2000). Tracing European founder lineages in the Near

Eastern mtDNA pool. Am. J. Hum. Genet. 67, 1251–1276.

50. Torroni, A., Bandelt, H.J., D’Urbano, L., Lahermo, P., Moral,

P., Sellitto, D., Rengo, C., Forster, P., Savontaus, M.L.,

Bonne-Tamir, B., and Scozzari, R. (1998). mtDNA analysis

reveals a major late Paleolithic population expansion from

southwestern to northeastern Europe. Am. J. Hum. Genet.

62, 1137–1152.

51. Torroni, A., Huoponen, K., Francalacci, P., Petrozzi, M.,

Morelli, L., Scozzari, R., Obinu, D., Savontaus, M.L., and

Wallace, D.C. (1996). Classification of European mtDNAs

from an analysis of three European populations. Genetics

144, 1835–1850.

52. Torroni, A., Lott, M.T., Cabell, M.F., Chen, Y.S., Lavergne, L.,

and Wallace, D.C. (1994). mtDNA and the origin of Cauca-

sians: identification of ancient Caucasian-specific haplo-

groups, one of which is prone to a recurrent somatic duplica-

tion in the D-loop region. Am. J. Hum. Genet. 55, 760–776.

53. Kivisild, T., Tolk, H.V., Parik, J., Wang, Y., Papiha, S.S.,

Bandelt, H.J., and Villems, R. (2002). The emerging limbs

and twigs of the East Asian mtDNA tree. Mol. Biol. Evol.

19, 1737–1751.

54. Schurr, T.G., Sukernik, R.I., Starikovskaya, Y.B., and Wallace,

D.C. (1999). Mitochondrial DNA variation in Koryaks and

Itel’men: population replacement in the Okhotsk Sea-Bering

Sea region during the Neolithic. Am. J. Phys. Anthropol. 108,

1–39.

55. Tanaka, M., Cabrera, V.M., Gonzalez, A.M., Larruga, J.M.,

Takeyasu, T., Fuku, N., Guo, L.J., Hirose, R., Fujita, Y., Kurata,

M., et al. (2004). Mitochondrial genome variation in eastern

Asia and the peopling of Japan. Genome Res. 14 (10A), 1832–

1850.

56. Yao, Y.G., Kong, Q.P., Bandelt, H.J., Kivisild, T., and Zhang,

Y.P. (2002). Phylogeographic differentiation of mitochon-

drial DNA in Han Chinese. Am. J. Hum. Genet. 70, 635–651.

57. Gokcumen, O., Dulik, M.C., Pai, A.A., Zhadanov, S.I., Rubin-

stein, S., Osipova, L.P., Andreenkov, O.V., Tabikhanova, L.E.,

Gubina, M.A., Labuda, D., and Schurr, T.G. (2008). Genetic

variation in the enigmatic Altaian Kazakhs of South-Central

Russia: insights into Turkic population history. Am. J. Phys.

Anthropol. 136, 278–293.

58. Rubinstein, S., Dulik, M.C., Gokcumen, O., Zhadanov, S.,

Osipova, L., Cocca, M., Mehta, N., Gubina, M., Posukh, O.,

and Schurr, T.G. (2008). Russian Old Believers: genetic conse-

quences of their persecution and exile, as shown by mito-

chondrial DNA evidence. Hum. Biol. 80, 203–237.

59. van Oven, M., and Kayser, M. (2009). Updated comprehen-

sive phylogenetic tree of global human mitochondrial DNA

variation. Hum. Mutat. 30, E386–E394.

60. Anderson, S., Bankier, A.T., Barrell, B.G., de Bruijn, M.H.,

Coulson, A.R., Drouin, J., Eperon, I.C., Nierlich, D.P., Roe,

B.A., Sanger, F., et al. (1981). Sequence and organization of

the human mitochondrial genome. Nature 290, 457–465.

61. Andrews, R.M., Kubacka, I., Chinnery, P.F., Lightowlers, R.N.,

Turnbull, D.M., and Howell, N. (1999). Reanalysis and revi-

sion of the Cambridge reference sequence for human mito-

chondrial DNA. Nat. Genet. 23, 147.

62. Y Chromosome Consortium. (2002). A nomenclature system

for the tree of human Y-chromosomal binary haplogroups.

Genome Res. 12, 339–348.

63. Karafet, T.M., Mendez, F.L., Meilerman, M.B., Underhill, P.A.,

Zegura, S.L., and Hammer, M.F. (2008). New binary polymor-

phisms reshape and increase resolution of the human Y chro-

mosomal haplogroup tree. Genome Res. 18, 830–838.

64. Dulik, M.C., Osipova, L.P., and Schurr, T.G. (2011). Y-chro-

mosome variation in Altaian Kazakhs reveals a common

paternal gene pool for Kazakhs and the influence of Mongo-

lian expansions. PLoS ONE 6, e17548.

65. Cox, M.P. (2006). Minimal hierarchical analysis of global

human Y-chromosome SNP diversity by PCR-RFLP. Anthro-

pol. Sci. 114, 69–74.

66. Derbeneva, O.A., Starikovskaia, E.B., Volod’ko, N.V., Wallace,

D.C., and Sukernik, R.I. (2002). [Mitochondrial DNA varia-

tion in Kets and Nganasans and the early peoples of

Northern Eurasia]. Genetika 38, 1554–1560.

67. Derbeneva, O.A., Starikovskaya, E.B., Wallace, D.C., and

Sukernik, R.I. (2002). Traces of early Eurasians in the

Mansi of northwest Siberia revealed by mitochondrial DNA

analysis. Am. J. Hum. Genet. 70, 1009–1014.

68. Pimenoff, V.N., Comas, D., Palo, J.U., Vershubsky, G., Kozlov,

A., and Sajantila, A. (2008). Northwest Siberian Khanty and

Mansi in the junction of West and East Eurasian gene pools

as revealed by uniparental markers. Eur. J. Hum. Genet. 16,

1254–1264.

69. Comas, D., Calafell, F., Mateu, E., Perez-Lezaun, A., Bosch, E.,

Martınez-Arias, R., Clarimon, J., Facchini, F., Fiori, G.,

Luiselli, D., et al. (1998). Trading genes along the silk road:

mtDNA sequences and the origin of central Asian popula-

tions. Am. J. Hum. Genet. 63, 1824–1838.

70. Yao, Y.G., Kong, Q.P., Wang, C.Y., Zhu, C.L., and Zhang, Y.P.

(2004). Different matrilineal contributions to genetic struc-

ture of ethnic groups in the silk road region in china. Mol.

Biol. Evol. 21, 2265–2280.

71. Kolman, C.J., Sambuughin, N., and Bermingham, E. (1996).

Mitochondrial DNA analysis of Mongolian populations and

implications for the origin of New World founders. Genetics

142, 1321–1334.

72. Xue, Y., Zerjal, T., Bao, W., Zhu, S., Shu, Q., Xu, J., Du, R.,

Fu, S., Li, P., Hurles, M.E., et al. (2006). Male demography

in East Asia: a north-south contrast in human population

expansion times. Genetics 172, 2431–2439.

73. Khar’kov, V.N., Medvedeva, O.F., Luzina, F.A., Kolbasko, A.V.,

Gafarov, N.I., Puzyrev, V.P., and Stepanov, V.A. (2009).

[Comparative characteristics of the gene pool of Teleuts

inferred from Y-chromosomal marker data]. Genetika 45,

1132–1142.

74. Khar’kov, V., Khamina, K., Medvedeva, O., Shtygasheva, O.,

and Stepanov, V. (2011). Genetic diversity of the Khakass

gene pool: Subethnic differentiation and the structure of

Y-chromosome haplogroups. Mol. Biol. (Mosk.) 45, 446–458.

75. Roewer, L., Kruger, C., Willuweit, S., Nagy, M., Rodig, H.,

Kokshunova, L., Rothamel, T., Kravchenko, S., Jobling, M.A.,

Stoneking, M., and Nasidze, I. (2007). Y-chromosomal STR


haplotypes in Kalmyk population samples. Forensic Sci. Int.

173, 204–209.

76. Geppert, M., Baeta, M., Nunez, C., Martınez-Jarreta, B., Zwey-

nert, S., Cruz, O.W., Gonzalez-Andrade, F., Gonzalez-Solo-

rzano, J., Nagy, M., and Roewer, L. (2011). Hierarchical

Y-SNP assay to study the hidden diversity and phylogenetic

relationship of native populations in South America.

Forensic Sci. Int. Genet. 5, 100–104.

77. Excoffier, L., Laval, G., and Schneider, S. (2005). Arlequin

(version 3.0): an integrated software package for population

genetics data analysis. Evol. Bioinform. Online 1, 47–50.

78. Tamura, K., and Nei, M. (1993). Estimation of the number of

nucleotide substitutions in the control region of mitochon-

drial DNA in humans and chimpanzees. Mol. Biol. Evol.

10, 512–526.

79. SPSS Inc. (2001). SPSS for Windows Release 11.0.0 (Chicago,

IL: SPSS Inc.).

80. Polzin, T., and Daneschmand, S.V. (2003). On Steiner trees

and minimum spanning trees in hypergraphs. Oper. Res.

Lett. 31, 12–20.

81. Bandelt, H.J., Forster, P., and Rohl, A. (1999). Median-joining

networks for inferring intraspecific phylogenies. Mol. Biol.

Evol. 16, 37–48.

82. Bandelt, H.J., Forster, P., Sykes, B.C., and Richards, M.B.

(1995). Mitochondrial portraits of human populations using

median networks. Genetics 141, 743–753.

83. Gusmao, L., Butler, J.M., Carracedo, A., Gill, P., Kayser, M.,

Mayr, W.R., Morling, N., Prinz, M., Roewer, L., Tyler-Smith,

C., and Schneider, P.M.; DNA Commission of the Interna-

tional Society of Forensic Genetics. (2006). DNA Commis-

sion of the International Society of Forensic Genetics

(ISFG): an update of the recommendations on the use of

Y-STRs in forensic analysis. Forensic Sci. Int. 157, 187–197.

84. Soares, P., Ermini, L., Thomson, N., Mormina, M., Rito, T.,

Rohl, A., Salas, A., Oppenheimer, S., Macaulay, V., and

Richards, M.B. (2009). Correcting for purifying selection:

an improved human mitochondrial molecular clock. Am. J.

Hum. Genet. 84, 740–759.

85. Sengupta, S., Zhivotovsky, L.A., King, R., Mehdi, S.Q.,

Edmonds, C.A., Chow, C.E., Lin, A.A., Mitra, M., Sil, S.K.,

Ramesh, A., et al. (2006). Polarity and temporality of high-

resolution y-chromosome distributions in India identify

both indigenous and exogenous expansions and reveal

minor genetic influence of Central Asian pastoralists. Am.

J. Hum. Genet. 78, 202–221.

86. Wilson, I., Balding, D., andWeale, M. (2003). Inferences from

DNA data: population histories, evolutionary processes and

forensic match probabilities. J. R. Stat. Soc. [Ser A] 166,

155–188.

87. Xue, Y., Zerjal, T., Bao, W., Zhu, S., Shu, Q., Xu, J., Du, R., Fu,

S., Li, P., Hurles, M.E., et al. (2008). Modelling male prehis-

tory in east Asia using BATWING. In Simulations, Genetics

and Human Prehistory, S. Matsumura, P. Forster, and C. Ren-

frew, eds. (Cambridge: McDonald Institute for Archaeolog-

ical Research), pp. 79–88.

88. Zhivotovsky, L.A., Underhill, P.A., Cinnio�glu, C., Kayser, M.,

Morar, B., Kivisild, T., Scozzari, R., Cruciani, F., Destro-Bisol,

G., Spedini, G., et al. (2004). The effective mutation rate at

Y chromosome short tandem repeats, with application to

human population-divergence time. Am. J. Hum. Genet.

74, 50–61.

89. Dupuy, B.M., Stenersen, M., Egeland, T., and Olaisen, B.

(2004). Y-chromosomal microsatellite mutation rates: differ-

ences inmutation rate between andwithin loci. Hum.Mutat.

23, 117–124.

90. Fenner, J.N. (2005). Cross-cultural estimation of the human

generation interval for use in genetics-based population

divergence studies. Am. J. Phys. Anthropol. 128, 415–423.

91. Derenko, M., Malyarchuk, B., Grzybowski, T., Denisova, G.,

Rogalla, U., Perkova, M., Dambueva, I., and Zakharov, I.

(2010). Origin and post-glacial dispersal of mitochondrial

DNA haplogroups C and D in northern Asia. PLoS ONE 5,

e15214.

92. Zhadanov, S.I., Dulik, M.C., Markley, M., Jennings, G.W.,

Gaieski, J.B., Elias, G., and Schurr, T.G.; Genographic Project

Consortium. (2010). Genetic heritage and native identity of

the Seaconke Wampanoag tribe of Massachusetts. Am. J.

Phys. Anthropol. 142, 579–589.

93. Hammer, M.F., Karafet, T.M., Redd, A.J., Jarjanazi, H., Santa-

chiara-Benerecetti, S., Soodyall, H., and Zegura, S.L. (2001).

Hierarchical patterns of global human Y-chromosome diver-

sity. Mol. Biol. Evol. 18, 1189–1203.

94. Kivisild, T., Rootsi, S., Metspalu, M., Mastana, S., Kaldma, K.,

Parik, J., Metspalu, E., Adojaan, M., Tolk, H.V., Stepanov, V.,

et al. (2003). The genetic heritage of the earliest settlers

persists both in Indian tribal and caste populations. Am. J.

Hum. Genet. 72, 313–332.

95. Wells, R.S., Yuldasheva, N., Ruzibakiev, R., Underhill, P.A.,

Evseeva, I., Blue-Smith, J., Jin, L., Su, B., Pitchappan, R.,

Shanmugalakshmi, S., et al. (2001). The Eurasian heartland:

a continental perspective on Y-chromosome diversity. Proc.

Natl. Acad. Sci. USA 98, 10244–10249.

96. Rosser, Z.H., Zerjal, T., Hurles, M.E., Adojaan, M., Alavantic,

D., Amorim, A., Amos,W., Armenteros,M., Arroyo, E., Barbu-

jani, G., et al. (2000). Y-chromosomal diversity in Europe is

clinal and influenced primarily by geography, rather than

by language. Am. J. Hum. Genet. 67, 1526–1543.

97. Quintana-Murci, L., Krausz, C., Zerjal, T., Sayar, S.H.,

Hammer, M.F., Mehdi, S.Q., Ayub, Q., Qamar, R., Mohyud-

din, A., Radhakrishna, U., et al. (2001). Y-chromosome line-

ages trace diffusion of people and languages in southwestern

Asia. Am. J. Hum. Genet. 68, 537–542.

98. Underhill, P.A., Passarino, G., Lin, A.A., Shen, P., Mirazon

Lahr, M., Foley, R.A., Oefner, P.J., and Cavalli-Sforza, L.L.

(2001). The phylogeography of Y chromosome binary haplo-

types and the origins of modern human populations. Ann.

Hum. Genet. 65, 43–62.

99. Zhong, H., Shi, H., Qi, X.-B., Duan, Z.-Y., Tan, P.-P., Jin, L., Su,

B., and Ma, R.Z. (2011). Extended Y chromosome investiga-

tion suggests postglacial migrations of modern humans

into East Asia via the northern route. Mol. Biol. Evol. 28,

717–727.

100. Mirabal, S., Regueiro, M., Cadenas, A.M., Cavalli-Sforza, L.L.,

Underhill, P.A., Verbenko, D.A., Limborska, S.A., and Her-

rera, R.J. (2009). Y-chromosome distribution within the

geo-linguistic landscape of northwestern Russia. Eur. J.

Hum. Genet. 17, 1260–1273.

101. Myres, N.M., Rootsi, S., Lin, A.A., Jarve, M., King, R.J.,

Kutuev, I., Cabrera, V.M., Khusnutdinova, E.K., Pshenichnov,

A., Yunusbayev, B., et al. (2011). A major Y-chromosome

haplogroup R1b Holocene era founder effect in Central and

Western Europe. Eur. J. Hum. Genet. 19, 95–101.


102. Malyarchuk, B., Derenko, M., Denisova, G., Maksimov, A.,

Wozniak, M., Grzybowski, T., Dambueva, I., and Zakharov,

I. (2011). Ancient links between Siberians and Native Amer-

icans revealed by subtyping the Y chromosome haplogroup

Q1a. J. Hum. Genet. 56, 583–588.

103. Rogers, A.R., and Harpending, H. (1992). Population growth

makes waves in the distribution of pairwise genetic differ-

ences. Mol. Biol. Evol. 9, 552–569.

104. Shi, H., Dong, Y.L., Wen, B., Xiao, C.J., Underhill, P.A., Shen,

P.D., Chakraborty, R., Jin, L., and Su, B. (2005). Y-chromo-

some evidence of southern origin of the East Asian-specific

haplogroup O3-M122. Am. J. Hum. Genet. 77, 408–419.

105. Forsyth, J. (1992). A History of the Peoples of Siberia: Russia’s

North Asian Colony, 1581–1990 (Cambridge, England:

Cambridge University Press).

106. Brown, M.D., Hosseini, S.H., Torroni, A., Bandelt, H.J., Allen,

J.C., Schurr, T.G., Scozzari, R., Cruciani, F., and Wallace, D.C.

(1998). mtDNA haplogroup X: An ancient link between

Europe/Western Asia and North America? Am. J. Hum.

Genet. 63, 1852–1861.

107. Tamm, E., Kivisild, T., Reidla, M., Metspalu, M., Smith, D.G.,

Mulligan, C.J., Bravi, C.M., Rickards, O., Martinez-Labarga,

C., Khusnutdinova, E.K., et al. (2007). Beringian standstill

and spread of Native American founders. PLoS ONE 2, e829.

108. Achilli, A., Perego, U.A., Bravi, C.M., Coble, M.D., Kong, Q.P.,

Woodward, S.R., Salas, A., Torroni, A., and Bandelt, H.J.

(2008). The phylogeny of the four pan-American MtDNA

haplogroups: implications for evolutionary and disease

studies. PLoS ONE 3, e1764.

109. Perego, U.A., Achilli, A., Angerhofer, N., Accetturo, M., Pala,

M., Olivieri, A., Kashani, B.H., Ritchie, K.H., Scozzari, R.,

Kong, Q.P., et al. (2009). Distinctive Paleo-Indian migration

routes from Beringia marked by two rare mtDNA haplo-

groups. Curr. Biol. 19, 1–8.

110. Perego, U.A., Angerhofer, N., Pala, M., Olivieri, A., Lancioni,

H., Kashani, B.H., Carossa, V., Ekins, J.E., Gomez-Carballa, A.,

Huber, G., et al. (2010). The initial peopling of the Americas:

a growing number of foundingmitochondrial genomes from

Beringia. Genome Res. 20, 1174–1179.

111. Helgason, A., Palsson, G., Pedersen, H.S., Angulalik, E., Gun-

narsdottir, E.D., Yngvadottir, B., and Stefansson, K. (2006).

mtDNA variation in Inuit populations of Greenland and

Canada: migration history and population structure. Am. J.

Phys. Anthropol. 130, 123–134.

112. Bortolini, M.C., Salzano, F.M., Bau, C.H., Layrisse, Z., Petzl-

Erler, M.L., Tsuneto, L.T., Hill, K., Hurtado, A.M., Castro-

De-Guerra, D., Bedoya, G., and Ruiz-Linares, A. (2002).

Y-chromosome biallelic polymorphisms and Native Amer-

ican population structure. Ann. Hum. Genet. 66, 255–259.

113. Underhill, P.A., Shen, P., Lin, A.A., Jin, L., Passarino, G., Yang,

W.H., Kauffman, E., Bonne-Tamir, B., Bertranpetit, J., Franca-

lacci, P., et al. (2000). Y chromosome sequence variation

and the history of human populations. Nat. Genet. 26,

358–361.

114. Seielstad, M., Yuldasheva, N., Singh, N., Underhill, P., Oef-

ner, P., Shen, P., and Wells, R.S. (2003). A novel Y-chromo-

some variant puts an upper limit on the timing of first entry

into the Americas. Am. J. Hum. Genet. 73, 700–705.

115. Schurr, T.G., Osipova, L.P., Zhadanov, S.I., and Dulik, M.C.

(2010). Genetic diversity in Native Siberians: Implications

for the prehistoric settlement of te Cis-Baikal region. In

Prehistoric Hunter-Gatherers of the Baikal Region, Siberia,

A.W.Weber, M.A. Katzenberg, and T.G. Schurr, eds. (Philadel-

phia: University of Pennsylvania Press), pp. 121–134.

116. Dulik, M.C. (2011). A molecular anthropological study

of Altaian histories utilizing population genetics and

phylogeography. PhD thesis, University of Pennsylvania,

Philadelphia, PA.

117. Fiedel, S.J. (2000). The peopling of the New World: present

evidence, new theories, and future directions. J. Archaeol.

Res. 8, 39–103.

118. Martınez-Cruz, B., Vitalis, R., Segurel, L., Austerlitz, F.,

Georges, M., Thery, S., Quintana-Murci, L., Hegay, T., Alda-

shev, A., Nasyrova, F., and Heyer, E. (2011). In the heartland

of Eurasia: the multilocus genetic landscape of Central Asian

populations. Eur. J. Hum. Genet. 19, 216–223.

119. Karafet, T.M., Osipova, L.P., Gubina, M.A., Posukh, O.L.,

Zegura, S.L., and Hammer, M.F. (2002). High levels of Y-chro-

mosome differentiation among native Siberian populations

and the genetic signature of a boreal hunter-gatherer way

of life. Hum. Biol. 74, 761–789.

120. Balaresque, P., Parkin, E.J., Roewer, L., Carvalho-Silva, D.R.,

Mitchell, R.J., van Oorschot, R.A., Henke, J., Stoneking, M.,

Nasidze, I., Wetton, J., et al. (2009). Genomic complexity

of the Y-STR DYS19: inversions, deletions and founder line-

ages carrying duplications. Int. J. Legal Med. 123, 15–23.

121. Underhill, P.A., Myres, N.M., Rootsi, S., Metspalu, M., Zhivo-

tovsky, L.A., King, R.J., Lin, A.A., Chow, C.E., Semino, O.,

Battaglia, V., et al. (2010). Separating the post-Glacial coan-

cestry of European and Asian Y chromosomes within haplo-

group R1a. Eur. J. Hum. Genet. 18, 479–484.

122. Soucek, S. (2000). A History of Inner Asia (Cambridge, New

York: Cambridge University Press).

123. Reich, D., Patterson, N., Kircher, M., Delfin, F., Nandineni,

M.R., Pugach, I., Ko, A.M., Ko, Y.C., Jinam, T.A., Phipps,

M.E., et al. (2011). Denisova admixture and the first modern

human dispersals into Southeast Asia and Oceania. Am. J.

Hum. Genet. 89, 516–528.

124. Hrdli�cka, A. (1942). Crania of Siberia. Am. J. Phys. Anthro-

pol. 29, 435–481.

125. Gonzalez-Jose, R., Bortolini, M.C., Santos, F.R., and Bonatto,

S.L. (2008). The peopling of America: craniofacial shape vari-

ation on a continental scale and its interpretation from an

interdisciplinary view. Am. J. Phys. Anthropol. 137, 175–187.

126. Kozintsev, A.G., Gromov, A.V., and Moiseyev, V.G. (1999).

Collateral relatives of American Indians among the Bronze

Age populations of Siberia? Am. J. Phys. Anthropol. 108,

193–204.

127. Crawford, M.H. (1998). The Origins of Native Americans:

Evidence from Anthropological Genetics (Cambridge: Cam-

bridge University Press).

128. Brace, C.L., Nelson, A.R., Seguchi, N., Oe, H., Sering, L.,

Qifeng, P., Yongyi, L., and Tumen, D. (2001). OldWorld sour-

ces of the first NewWorld human inhabitants: a comparative

craniofacial view. Proc. Natl. Acad. Sci. USA 98, 10017–

10022.

129. Wang, S., Lewis, C.M., Jakobsson, M., Ramachandran, S.,

Ray, N., Bedoya, G., Rojas, W., Parra, M.V., Molina, J.A.,

Gallo, C., et al. (2007). Genetic variation and population

structure in native Americans. PLoS Genet. 3, e185.

130. Horai, S., Kondo, R., Nakagawa-Hattori, Y., Hayashi, S.,

Sonoda, S., and Tajima, K. (1993). Peopling of the Americas,

founded by four major lineages of mitochondrial DNA. Mol.

Biol. Evol. 10, 23–47.


131. Kaessmann, H., Zollner, S., Gustafsson, A.C.,Wiebe, V., Laan,

M., Lundeberg, J., Uhlen, M., and Paabo, S. (2002). Extensive

linkage disequilibrium in small human populations in

Eurasia. Am. J. Hum. Genet. 70, 673–685.

132. Bailliet, G., Ramallo, V., Muzzio, M., Garcıa, A., Santos, M.R.,

Alfaro, E.L., Dipierri, J.E., Salceda, S., Carnese, F.R., Bravi,

C.M., et al. (2009). Brief communication: Restricted geo-

graphic distribution for Y-Q* paragroup in South America.

Am. J. Phys. Anthropol. 140, 578–582.

133. Ho, S.Y., and Endicott, P. (2008). The crucial role of calibra-

tion in molecular date estimates for the peopling of the

Americas. Am. J. Hum. Genet. 83, 142–146, author reply

146–147.

134. Zhivotovsky, L.A., and Underhill, P.A. (2005). On the evolu-

tionary mutation rate at Y-chromosome STRs: comments

on paper by Di Giacomo et al. (2004). Hum. Genet. 116,

529–532.

135. Di Giacomo, F., Luca, F., Popa, L.O., Akar, N., Anagnou, N.,

Banyko, J., Brdicka, R., Barbujani, G., Papola, F., Ciavarella,

G., et al. (2004). Y chromosomal haplogroup J as a signature

of the post-neolithic colonization of Europe. Hum. Genet.

115, 357–371.

136. Zerjal, T., Xue, Y., Bertorelle, G., Wells, R.S., Bao, W., Zhu, S.,

Qamar, R., Ayub, Q., Mohyuddin, A., Fu, S., et al. (2003).

The genetic legacy of the Mongols. Am. J. Hum. Genet. 72,

717–721.

137. Zhivotovsky, L.A., Underhill, P.A., and Feldman, M.W.

(2006). Difference between evolutionarily effective and

germ line mutation rate due to stochastically varying haplo-

group size. Mol. Biol. Evol. 23, 2268–2270.

138. Ho, S.Y., Phillips, M.J., Cooper, A., and Drummond, A.J.

(2005). Time dependency of molecular rate estimates and

systematic overestimation of recent divergence times. Mol.

Biol. Evol. 22, 1561–1568.


ARTICLE

A ‘‘Copernican’’ Reassessment of the HumanMitochondrial DNA Tree from its Root

Doron M. Behar,1,2,* Mannis van Oven,3,* Saharon Rosset,4 Mait Metspalu,1 Eva-Liis Loogvali,1

Nuno M. Silva,5 Toomas Kivisild,1,6 Antonio Torroni,7 and Richard Villems1,8

Mutational events along the human mtDNA phylogeny are traditionally identified relative to the revised Cambridge Reference

Sequence, a contemporary European sequence published in 1981. This historical choice is a continuous source of inconsistencies,

misinterpretations, and errors in medical, forensic, and population genetic studies. Here, after having refined the human mtDNA

phylogeny to an unprecedented level by adding information from 8,216 modern mitogenomes, we propose switching the reference

to a Reconstructed Sapiens Reference Sequence, which was identified by considering all available mitogenomes from Homo neandertha-

lensis. This ‘‘Copernican’’ reassessment of the human mtDNA tree from its deepest root should resolve previous problems and will

have a substantial practical and educational influence on the scientific and public perception of human evolution by clarifying the

core principles of common ancestry for extant descendants.

Introduction

Nested hierarchy of species, resulting from the descent

with modification process,1 is fundamental to our under-

standing of the evolution of biological diversity and

life in general. In molecular genealogy, the sequential

accumulation of mutations since the time of the most

recent common ancestor (MRCA) is reflected within the

ever-evolving phylogeny of any genetic locus. Accordingly,

the reconstructed ancestral sequence of a locus should

optimally serve as the reference point for its derived

alleles.2 The human mtDNA phylogeny3–7 is an almost

perfect molecular prototype for a nonrecombining locus,

and knowledge on its variation has been and is extensively

used in medical, genealogical, forensic, and popula-

tion genetic studies.8–11 Boosted by rapid advances in

sequencing and genotyping technology, its mode of inher-

itance, high mutation rate, lack of recombination, and

high cellular copy number have proved critical in making

this locus the primary choice in the field of archaeoge-

netics and ancient DNA.12–14 Although its early synthesis

was based on restriction-fragment-length polymor-

phisms,15–18 control-region variation,19,20 or a combina-

tion of both,21 the human mtDNA phylogeny is now

reconstructed from complete mtDNA sequences,4,6,7,22

thus stretching the phylogenetic resolution to its maxi-

mum. mtDNA also became the main target of ancient-

DNA studies because it is much more abundant than

nuclear DNA.13 The recently published Homo neandertha-

lensis mitogenomes23,24 represent the best available out-

group source for rooting the human mtDNA phylogeny

known to lay inside the contemporary African varia-

tion.22,25,26 Despite these major advances, the extinct

human mtDNA complete root sequence was never

precisely determined, and mtDNA nomenclature remains

cumbersome because it refers to the first completely

sequenced mtDNA,27,28 labeled rCRS, which is now

known to belong to the recently coalescing European

haplogroup H2a2a1.7 The use of the rCRS as a reference

resulted in a number of practical problems such as (1)

the misidentification of derived versus ancestral states

of alleles and (2) the count of nonsynonymous muta-

tions that map to the path between the rCRS and

the case sequences.29 For instance, clinical and func-

tional studies frequently include among the putative

nonsynonymous candidate mutations the haplogroup-

HV-defining transition at position 14766 (CYTB) simply

because the revised Cambridge Reference Sequence

(rCRS) belongs to its derived haplogroup H.30

In this study, to definitively address these issues,

we propose a ‘‘Copernican’’ reassessment of the human

mtDNA phylogeny by switching to a Reconstructed

Sapiens Reference Sequence (RSRS) as the phylogenetically

valid reference point. To this end, the previously suggested

root7,22,25 was updated tomost parsimoniously incorporate

the available mitogenomes from H. neanderthalensis.23,24

Moreover, we further refined the human mtDNA

phylogeny to an unprecedented level by adding informa-

tion from 8,216 mitogenomes and evaluated the ranges

of nucleotide substitutions from the root RSRS rather

than the rCRS28 as a reference point (Figure 1 and Figure S1,

available online).

1Estonian Biocentre and Department of Evolutionary Biology, University of Tartu, Tartu 51010, Estonia; 2Molecular Medicine Laboratory, Rambam Health

Care Campus, Haifa 31096, Israel; 3Department of Forensic Molecular Biology, Erasmus MC, University Medical Center Rotterdam, 3000 CA Rotterdam,

The Netherlands; 4Department of Statistics and Operations Research, School of Mathematical Sciences, Tel Aviv University, Tel Aviv 69978, Israel; 5Instituto

de Patologia e Imunologia Molecular da Universidade do Porto, Porto 4200-465, Portugal; 6Department of Biological Anthropology, University of

Cambridge, Cambridge CB2 1QH, UK; 7Dipartimento di Biologia e Biotecnologie ‘‘L. Spallanzani,’’ Universita di Pavia, Pavia 27100, Italy; 8Estonian

Academy of Sciences, 6 Kohtu Street, Tallinn 10130, Estonia

*Correspondence: [email protected] (D.M.B.), [email protected] (M.v.O.)


The American Journal of Human Genetics 90, 675–684, April 6, 2012 675

6

1.3

2.2

0.5

0.15

0.03

0L0d1c1b

(EU092832)H2a2a1

rCRS(NC_012920)H4a1a

(HQ860291)

53 M

UTA

TIO

NS

54 M

UTA

TIO

NS

46 M

UTA

TIO

NS

99 M

UTA

TIO

NS

13 MUTATIONS

2

5

99

13

6

L0 L1’2’3’4’5’6

Panpaniscus

Pantroglodytes

Homoneander-thalensisthalensis

Homosapiens

SRSRNR

Mya

Hominini

a2a1a1111a2a1a11111aa aaa 11122

C8209TA8348G

T12011C

A11560G

G5262AT4928C

C6518TA6131G

G6962AG7146A

A3564GA3334G

T4101CT3504C

G3438A

T6185C

T245CG263A

C152TG185A C262TA2294G A1779G

C146T A200G

C146T

T13488C

G15077A

G1048TC182T

T8167CC7650T

C10915TC9042TA11914G

A15775G

A16078G

C3516aT4312C

T16086C

T16154C

T5442CT10664C

A12810G

T14063C

A2758G

C3556TT3308C

A12720G

A574G G3483AT990C T12864C

C16344T

A9347GG13276AG10589AG16230A

G10586A A16258G

G12007A

G16156A

A14926G A5189tT16093C

291d361.1A

A16129G

T5964C G200A!A10520G T391CA13917G T4688C

L0L1’2’3’4'5’6FM865411 FM865408 FM865409 AM948965 FM865410 FM865407 H2a2a1

H2

H2a2a

H2a

H2a2

C152TA2758GC2885TG7146A

A825tT8655C

A10688GC10810TG13105AT13506C

T8468C

L2'3’4’5’6

C195TA247G

522.1AC

A7521G

L3’4'6

T182C!T3594CT7256C

T13650C

G15301AA16129GT16187CC16189T

L2'3’4’6

G4104A

G8701AC9540T

G10398AC10873TA15301G!

N

T16278C

L3'4

A769GA1018GC16311T

L3

T14766C

HV

G2706AT7028C

H

G1438A

T12705CT16223C

R

G73AA11719G

R0

G8860AG15326A

rCRS

G4769A

G750A

G263A

97559456

93459329

93259053

90278986

89438764

87188503

84618455

84068386

83658065

80217891

78687861774674247127710666416620645264106266626062006156602358405821567355805505547154605387494049044856456245324204404839393918390938083414339930102863283127062523205617091406827709547521-522438417243195189150

986910101

1025610281

1030710310

1032410373

1053210750

1138311458

11527115901162311770119501207012189123511236612406124741309513194132691335913506136501365613680137071380113879138891405314144141781429614560150431514815191152261523215295153011535515443154791562915649156671567115789158501603716139161481616916183161871620916234162441625616262

16263.116299163201636216400

Homo neanderthalensis mtDNA genomes Homo sapiens rCRS genome

Figure 1. Schematic Representation of the Human mtDNA Phylogeny within Hominini(Left) Hominini phylogeny illustrating approximate divergence times of the studied species. The positions of the RSRS and the putativeReconstructed Neanderthal Reference Sequence (RNRS) are shown.(Right)Magnification of the humanmtDNA phylogeny. Mutated nucleotide positions separating the nodes of the two basal human hap-logroups L0 and L1’20304’506 and their derived states as compared to the RSRS are shown. The positions of the rCRS and the RSRS areindicated by golden and a green five-pointed stars, respectively. Accordingly, the number of mutations counted from the rCRS(NC_012920) or the RSRS (Sequence S1) to the L0d1c1b (EU092832) and H4a1a (HQ860291) haplotypes retrieved from a San anda German, respectively, are marked on the golden and green branches. The principle of equidistant star-like radiation from the commonancestor of all contemporary haplotypes is highlighted when the RSRS is preferred over the rCRS as the reference sequence.

676 The American Journal of Human Genetics 90, 675–684, April 6, 2012

Subjects and Methods

Updating the Human mtDNA Phylogeny and

Inference of the Ancestral Root HaplotypeMtDNA Genomes Comprising the Phylogeny

A total of 18,843 complete mtDNA sequences were used to refine

the human mtDNA phylogeny of which 10,627 were previously

reported and used for the mtDNA tree Build 13 (28 Dec 2011)

as posted by PhyloTree.7 The remaining 8,216 sequences are

mainly from the large complete mtDNA database available at

FamilyTreeDNA and in part from data sets maintained by the

authors. The large database available at FamilyTreeDNA was

privately obtained by the sample donors, usually for genealogical

purposes. Most donors were of western Eurasian ancestry, but

donors with matrilineal ancestry from other geographical regions

have also contributed. Once the mtDNA sequences were obtained,

donors had several options: keep them confidential, share them

with peer genealogists, submit them to the National Center for

Biotechnology Information (NCBI) GenBank, and/or consent to

contribute them anonymously to a research database maintained

by FamilyTreeDNA to improve the mtDNA phylogeny. In turn,

this contribution rewards and enriches the genealogical experi-

ence as well as benefits the scientific community. All the proce-

dures followed in this study were in accordance with the ethical

standards of the responsible committee on human experimenta-

tion of the participating research centers.

Likewise, it is important to clarify that because the complete

sequences were obtained privately, some donors have indepen-

dently uploaded their sequence to NCBI. Currently (as of February

28, 2012), a total of 1,220 complete mtDNA sequences that were

generated at FamilyTreeDNA were privately deposited in NCBI

GenBank. Most of these sequences were already considered in

the previous PhyloTree Builds.7 Because we have no way to

know which of the sequences were autonomously uploaded to

NCBI, all duplicate sequences that matched precisely between

NCBI and our database were excluded from our analysis. There-

fore, even if multiple samples were excluded, no topological infor-

mation was lost. Accordingly, out of the 8,216 sequences used

to verify the phylogeny, a total of 4,265 sequences are released

and deposited in NCBI GenBank under accession numbers

JQ701803–JQ706067. The complete mtDNA sequences of the

Neanderthals were retrieved from the literature.23,24

Complete mtDNA Sequencing

DNAwas extracted from buccal swabs. MtDNAwas amplified with

18 primers to yield nine overlapping fragments as previously

reported.22 PCR products were cleaned with magnetic-particle

technology (BioSprint 96; QIAGEN). After purification, the nine

fragments were sequenced by means of 92 internal primers to

obtain the complete mtDNA genome. Sequencing was performed

on a 3730xl DNA Analyzer (Applied Biosystems), and the resulting

sequences were analyzed with the Sequencher software (Gene

Codes Corporation). Mutations were scored relative to the rCRS

and the suggested RSRS. Sample quality control was assured as

follows:

(1) After the PCR amplification of the nine fragments, DNA

handling and distribution to the 96 sequencing reactions

was aided by the Beckman Coulter Biomek FX liquid

handler to minimize the chance for human pipetting

errors.

(2) All 96 sequencing reactions of each sample were performed

simultaneously in the same sequencing run. Most observed

mutations were determined by at least two sequence reads.

However, in a minority of the cases only one sequence read

was available because of various technical reasons, usually

related to the amount and quality of the DNA available.

(3) Any fragment that failed the first sequencing attempt or

any ambiguous base call was tested by additional and

independent PCR and sequencing reactions. In these cases,

the first hypervariable segment (HVS-I) of the control

region was resequenced too to assure that the correct

sample was retrieved.

(4) Genotyping history for each sample was recorded to help

in the search for DNA handling errors and artificial recom-

bination events.

(5) All sequences were aligned with the software Sequencher

(Gene Codes Corporation), and all positions with a Phred

score less than 30 were manually evaluated by an operator.

Two independent operators read each sequence. All posi-

tions that differed from the reference sequences were

recorded electronically to minimize typographic errors.

(6) Any sequence that did not comfortably fit within the estab-

lished human mtDNA phylogeny was highlighted and

resequenced to exclude potential lab errors.

(7) Any comments and remarks raised by external investiga-

tors after release of the data will be addressed by reassessing

the original sequences for accuracy. After that, any unre-

solved result will be further examined by resequencing

and, if necessary, immediately corrected.

Tree Reconstruction and Notation of MutationsThe phylogeny was reconstructed by evaluating both all previ-

ously available published and the herein released complete

mtDNA sequences aiming at the most parsimonious solution

and aided by the software mtPhyl. Polymorphic positions are

shown on the branches and reticulations were resolved by consid-

ering the degree of mutability of individual positions as counted

by their number of occurrences in the overall phylogeny. Both

the ancestral and derived base status for each mutation appearing

in the phylogeny according to the International Union Of Pure

And Applied Chemistry (IUPAC) nucleotide code are reported.

We use capital letters for transitions (e.g., G73A) and lowercase

letters for transversions (e.g., A73t). Although heteroplasmies are

not noted in the phylogeny, we recommend labeling them by

using IUPAC code and capital letters (e.g., G73R). Throughout

the phylogeny indels are given with respect to the RSRS andmain-

tain the traditional nucleotide position numbering as in the rCRS.

Sequencing alignment prefers 30 placement for indels, except in

cases where the phylogeny suggests otherwise.31 Deletions are

indicated by a ‘‘d’’ after the deleted nucleotide position (e.g.,

T15944d). Insertions are indicated by a dot followed by the posi-

tion number and type of inserted nucleotide(s) (e.g., 5899.1C for

a C insertion at the first inserted nucleotide position after position

5899 and 5899.2C for a subsequent C insertion, and these are

abbreviated as 5899.1CC when occurring on the same branch).

We label polynucleotide stretches of unknown length as follows:

573.XC. In cases where an insertion occurred at an ancestral

branch but a reversion of this insertion (¼ deletion) took place

at a descendant branch, we noted the latter as follows:

5899.1Cd. An exclamationmark (!) at the end of a labeled position

denotes a reversion to the ancestral state. The number of exclama-

tion marks stands for the number of sequential reversions in

the given position from the RSRS (e.g., C152T, T152C!, and


C152T!!). Some indel positions have been a source of confusion

because multiple alignment solutions enable alternative scoring.

Notably, the dinucleotide repeat in hypervariable segment II

(HVS-II) of the control region can be viewed either as a CA repeat

starting at position 514 or as an AC repeat starting at position 515,

leading to two different notations being in use for a repeat loss:

522–523d versus 523–524d. We adhered to the guidelines for

consistent treatment of mtDNA-length variants that were estab-

lished by the forensic genetic community31 and favor the AC

interpretation. As the RSRS has one AC unit less compared to

the rCRS, we filled positions 523 and 524 of the RSRS with "NN,"

thereby preserving the historical genome annotation numbering.

Consequently, an AC insertion compared to the RSRS is scored as

522.1AC, whereas an AC deletion is scored as 521–522d. Table S2

presents all common indel positions throughout the complete

mtDNA sequence and the way we labeled them. Transitions at

the hypervariable position 16519, insertions of one or two Cs at

positions 309, 315, and 16193, A to C transversions at 16182

and 16183, as well as length variation of the AC dinucleotide

repeat spanning 515–522, were excluded from the phylogeny.

Haplogroup labels were re-evaluated and the following sugges-

tions were made:

(1) Monophyletic clades that are composed of two or more

previously named haplogroups are labeled by concate-

nating their names and separating them by apostrophe

(e.g., L0a’b). This is not applied in the case of capital-

letter-only labeled haplogroups (e.g., JT);

(2) We suggest labeling an extant sample that matches

a haplogroup root with the superscript case letter n for

‘‘nodal’’ (e.g., Hn);

(3) We note that when completemtDNA sequences are consid-

ered, the inability to differentiate a nodal haplotype from

an unresolved paraphyletic clade is eliminated. Accord-

ingly, the haplogroup label of each observed complete

mtDNA sequences can: (1) mark it in a nodal position; (2)

affiliate it with a previously labeled haplogroup; (3) suggest

a, so far, unlabeled haplogroup; or (4) in the absence of

two additional samples to justify the labeling of a, so far,

unidentified haplogroup, affiliate it with the ancestral

haplogroup. So, the label of a given sample as ‘‘H’’ means

that it is an unlabeled descendent of haplogroup H that

cannot be affiliated to any known H haplogroup clade

at the time of report and based on complete mtDNA

sequence. We suggest restricting the use of label ‘‘H*’’ to

cases where the haplogroup labeling is based on partial

mtDNA sequence;

(4) To aid the nonexpert in understanding the mtDNA hap-

logroup nomenclature system, we summarize in Table S3

the cases where haplogroup labels do not logically follow

from the hierarchy and hence could lead to confusion.

Changing these haplogroup labels to make them more

logical is undesirable at this stage because they are already

used extensively in the literature and therefore changing

them would probably cause even more confusion. In addi-

tion, we note that for the most basal nodes of the

phylogeny, historically the following shorthand names

have been in use: L1’5 ¼ L1’20304’506; L205 ¼ L20304’506;L206 ¼ L20304’6; and L4’6 ¼ L304’6, which we will herein

refer to by their full name. One shorthand haplogroup

name, M4’’67, is maintained because writing it in full

(M4’18’30’37’38’43’45’63’64’65’66’67) seems impractical.

It is important to note that the aim of this study is to publish the

most up-to-date human mtDNA phylogeny, and it cannot be

regarded by any means as a population-level survey exploring

the frequencies and distributions of the various haplogroups.

Therefore, although all sequences were used to establish the tree

topology, the subset of sequences actually presented in the

phylogeny is lower because for each branch up to two representa-

tive example sequences are provided. In most cases, we labeled

haplogroups only when supported by at least three distinct haplo-

types to maximize the accuracy of the haplogroup defining array

of mutations and to avoid the establishment of haplogroups

resulting from sequencing errors. Exceptions included previously

established haplogroups or haplogroups supported by a particu-

larly long array of mutations. Accordingly, the tips of the herein

released phylogeny are in fact internal haplogroup nodes, thus

private mutations (if any) of individual haplotypes were not

included.

Evaluation of the mtDNA Clock and Age EstimatesSubstitution Counts and Molecular Clock

To calculate the substitution counts from the RSRS to every extant

mitogenome (which is a tip in the mtDNA phylogeny), we

summed up the number of mutations on the path leading to

each noted haplogroup in the phylogeny and added to this the

number of positions that differed between the tip and the root

of the haplogroup. Thus, we are guaranteed to correctly count

all parallel and back mutations, except for the case where two

mutations affecting the same position occurred on a branch in

the tree (in which case we either count zero instead of two, if

the second is a back mutation, or one instead of two, if the second

mutation is not back to the initial state). As has been argued in the

past, such repeatedmutations within a single branch in the highly

resolved human mtDNA tree are highly unlikely,32 and are even

more so if the fastest mutating sites (16519 and the A to C trans-

versions and poly-C insertions around the HVS-I position 16189)

are eliminated, as was done in our analysis.

To test the validity of molecular clock assumption on human

mtDNA substitutions, we used PAML 4.4 with the HKY85 substitu-

tion model to generate maximum likelihood estimates of branch

lengths with and without the molecular clock assumption. We

chose to sample around 200–300 sequences and analyze their

coalescent tree (a subtree of the complete tree) in each PAML

run, to accommodate PAML’s computational limitations, and

also to sample mostly deep branches (such as M44), rather than

the recent and very short branches (such as D4a1b1) of the over-

sampled haplogroups such as H and D. Thus, we preferentially

sampled haplogroups whose coalescence with other samples in

the tree was more ancient. This ensured that even in such

a sample, the deeper clades such as the basal M clades would

be represented with high probability, whereas more recently

coalescing haplogroups such as the ones of haplogroup D would

be rarely sampled.

The generalized likelihood ratio (GLR) test for validity of the

clock assumption then uses the test statistic 2 3 (log-likelihood

of non-clock model � log-likelihood of clock model), which,

under the null hypothesis of molecular clock, has a c2 distribution

with degrees of freedom equal to the number of parameters under

no clock (¼ number of branches in the tree) minus number of

parameters under clock (¼ number of internal nodes in the tree).

We performed the analyses on two sets of the mtDNA

sequences: once by using the coding region alone and once on

the entire molecule. This was done as another sanity check for


the validity and generality of our results. All obtained p values are

presented in Table S4.

Age Calculations Assuming a Molecular Clock

In spite of thediscovered clockviolations,wewere still interested in

applying the best available tools for estimating the ages of ancestral

nodes in the tree assuming a molecular clock. We adopted the

calculation approach andmutation rate estimate of,32 who suggest

to estimate ages in substitutions and then transform them to years

in a nonlinear manner accounting for the selection effect on non-

synonymous mutations. We used PAML 4.433 with the HKY85

substitution model to generate maximum likelihood estimates of

internal node ages under a molecular clock assumption. Because

PAML is computationally limited in the size of trees it can analyze,

weperformed estimation for thewhole tree in several separate runs.

We divided the tree into seven collections of haplogroups:

d All L haplogroups (i.e., the entire phylogeny excluding M

and N)

d All of M excluding D

d D and JT

d H excluding H1 and H5

d B4’5 and HV excluding H but including H1 and H5

d U

d N excluding HV, U, JT and B4’5

For each PAML run, we selected all sequences belonging to one

of these sets, and added a small random sample of other samples

from the rest of the phylogeny to maintain ‘‘calibration.’’ Putting

together the estimates from all seven runs provided us with age

estimates for all nodes in our tree. Estimates are given in Table S5.

Data TransitionWe are aware that the suggested change can raise difficulties and

even antagonism from the scientific community. On the other

hand, a scenario in which a reference sequence of a genetic locus

does not represent its ancestral sequence should, indisputably, be

corrected. The realization of the superiority of complete mtDNA

sequence analysis compared to other approaches, combined

with the emergence of deep sequencing technologies, will possibly

shift the entire field into the use of only complete mtDNA

sequences in the near future.34–36 Therefore, the sooner the

change is made the less ‘‘painful’’ it will be. As the common

practice for reporting complete mtDNA sequences is by posting

the sequences as FASTA files to NCBI, rather than reporting the

substitutions with respect to a reference sequence (as in the case

of many data sets restricted to control-region variation), no major

change is needed. When a FASTA file is available or created, the

only change needed is to switch the reference sequence to the

RSRS. For control-region-based data sets, the conversion might

be more problematic as the common practice to report the

sequences in literature did not involve FASTA files but recorded

mutations as compared to the rCRS. Table S6 compares the classic

diagnostic mutations for the major haplogroups relative to the

rCRS or the RSRS.

To facilitate data transition we release the tools ‘‘FASTmtDNA,’’

which allows transformation of Excel list-type reports of mtDNA

haplotypes into FASTA files, and ‘‘mtDNAble,’’ which labels

haplogroups, performs a phylogeny-based quality check and

identifies private substitutions. These noted features are fully

supported in a web interface or as standalone versions, which

can be freely downloaded from thewebsite including theirmanual

and example files. In addition, the web interface allows the

benefit of comparing private substitutions between submitted

and previously stored mitogenomes to suggest the labeling of

additional haplogroups. Following quality check and consent, the

web interface enables the storing of complete mtDNA sequences

by members of the mtDNA community to enrich a growing

database. This in turn is expected to strengthen the data set used

by the website to label haplogroups, perform quality control and

refine the phylogeny. Additional tools will be periodically added

and updated.

Results

The RSRS

Since the sub-Saharan haplogroup L0 was defined,37 it

became clear that the root of the extant variation

of human mitochondrial genomes is allocated between

haplogroups L0 and L1’20304’506, which are separated

from each other by 14 coding and four control-region

mutations22 (Figure 1). Until now, our understanding of

the root of the human mtDNA tree was incomplete

because of the absence of reliable closely related outgroup

mitogenomes, and the exact placement of the 18 muta-

tions separating the L0 and L1’20304’506 nodes remained

vague. In principle, ancient mtDNA from early human

fossils might be informative but unreachable because of

considerable technical problems inherent to the analysis

process.13 However, as the split between H. sapiens and

H. neanderthalensis certainly predates the appearance of

the RSRS,38 a resolution of the deepest node might

be achieved by rooting the human phylogeny with

H. neanderthalensis complete mtDNA sequences23,24

(Figure 1). Table S1 shows all substitutions separating hap-

logroup L0 from L1’20304’506, their status in the six

H. neanderthalensis mitogenomes and their most parsimo-

nious allocation around the human root. Accordingly,

the ancestral mtDNA sequence of extant humans should

correspond to the bifurcation of L0 and L1’20304’506.Although it cannot be excluded that further sampling of

the African mtDNA variation might reveal yet another

more basal clade of the human mtDNA tree, it is at least

equally valid to indicate that, in spite of the many

thousands of reported complete mtDNA sequences,7 such

a clade has not been found so far. Operating under this

assumption we established the reference point, RSRS,

which is made available as Sequence S1.

We present the most resolved human mtDNA

phylogeny by compiling the information from 18,843

mitochondrial genomes of which 10,627 were previously

summarized in PhyloTree Build 13 (28 Dec 2011).7 We fol-

lowed the established cladistic notation for haplogroup

labeling adjusted for complete mtDNA genomes.7,39 Yet,

in contrast with the previously reported phylogeny, all

mutational changes noted on the branches of the tree indi-

cate the actual descendant nucleotide state relative to the

state in the RSRS. Although this has no effect on the tree

topology per se, it is critical to emphasize its major conse-

quences in the way of reporting the list of mutations


denoting an mtDNA haplotype. Accordingly, although the

HVS-I haplotype of a nodal haplogroup H2a2a1 mitoge-

nome will show no differences when compared to the

rCRS, its differentiation relative to the RSRS is now docu-

mented by the transitions A16129G, T16187C, C16189T,

T16223C, G16230A, T16278C and C16311T. This

common practice of expressing haplotypes as a string of

differences from the rCRS (Figure 1) led, for instance,

many inexperienced readers to incorrectly hold the ‘‘fact’’

that African haplogroup L mitogenomes have more substi-

tutions separating them from the rCRS as compared to

western Eurasian haplogroup H mitogenomes as a ‘‘proof’’

of an African origin for all contemporary humans.

Indications for Violation of the Molecular Clock

The accepted notion of a molecular clock means that

contemporary mtDNA haplotypes should show statisti-

cally insignificant differences in the number of accu-

mulated mutations from the RSRS.40 Triggered by the

suggested change in the reference sequence that facili-

tates substitution counts from the ancestral root, we

further evaluated this hypothesis. The range of sub-

stitution counts separating contemporary mitogenomes

belonging to major haplogroups from the RSRS is shown

in Figure S2. The mean distance is 57.1 substitutions, the

median is 56 and the empirical standard deviation is 5.9.

Widely different distances ranging from 41 substitutions in

some L0d1a1 mitogenomes to 77 in some L2b1a mitoge-

nomes are observed. Interestingly, the ranges of sub-

stitution counts within haplogroups M and N, which are

hallmarks of the relatively recent out-of-Africa exodus of

humans, are also very large. For example, within M there

are two mitogenomes with 43 substitutions (in M30a and

M44) and two mitogenomes with as many as 71 substitu-

tions (in M2b1b and M7b3a). This is especially striking

because the path from the RSRS to the root of M already

contains 39 substitutions. Hence, the difference between

the M root and its M44 descendant is only four substitu-

tions (two in the coding region and two in the control

region) as compared to 32 substitutions in the M2b1b

and M7b3a mitogenomes. These observations raise the

possibility that the tree in general, and haplogroup M in

particular, might not adhere uniformly to the assumed

molecular clock, under which substitutions occur at a fixed

rate on all branches of the tree over time.We evaluated this

scenario by performing generalized likelihood ratio tests of

the molecular clock by using PAML33 on subsets of samples

from the entire tree, on haplogroup L2 (following past

evidence of clock violations in this haplogroup40) and on

the sister haplogroups M and N. Our results demonstrate

violations of the molecular clock in M (0.00015 %

p value % 0.0003 for c2 GLR test in three different anal-

yses) and give mixed results for the entire tree (p ¼ 0.005

and p ¼ 0.018 for two analyses, which might be sensitive

to the parts of the tree randomly sampled) and L2 (GLR

c2 p value¼ 53 10�5 and p value¼ 0.033 for two analyses)

and borderline results in N (GLR c2 p value ¼ 0.049 and

p value ¼ 0.054 in two analyses). We are currently unable

to offer well-founded explanations for these findings,

which remain the scope of future studies.

As the clock violation was observed only in a restricted

number of specified cases, we applied the best available

tools for estimating the ages of ancestral nodes. We adop-

ted a conventional calculation approach and mutation

rate32 and used PAML 4.4 to generate maximum likelihood

estimates for internal node ages under a molecular clock

assumption.33 Figure 2 displays the phylogeny and density

of extant haplogroups as a function of both the number of

substitutions occurring since the RSRS and the estimated

coalescence times.

Approaching a Perfect Phylogeny

Themitochondrial genomes released herein almost double

the number of sequences that were previously available.

Despite the fact that the sequences released in this study

are not equally representative of all human populations

but aremainly from donors of western Eurasianmatrilineal

ancestry, a few additional advantages arise from this com-

bined data. First, an almost final level of resolution for

a number of western Eurasian clades was achieved, and

the nodes of ancestral and derived haplogroups are often

differentiated by a single mutation. For example, Figure 3

−170 −150 −130 −110 −90 −70 −50 −30 −10

050

100

200

300

400

500

600

KYBP

MtD

NA

hap

logr

oups

1 7 12 18 24 30 36 42 49

Substitutions since RSRS

L0L1

L5L2L6 L4

L3M

N

R rCRS

RSRS

Figure 2. Human mtDNA PhylogenyA schematic representation of the most parsimonious humanmtDNA phylogeny inferred from 18,843 complete mtDNAsequences with the structure shown explicitly for bifurcationsthat occurred 40,000 years before present (YBP) or earlier, anda graph showing the explosion of haplogroups since then. They axis indicates the approximate number of haplogroups fromeach time layer that have survived to nowadays. The upper andlower x axes of the rooted tree are scaled according to the numberof accumulated mutations since the RSRS and the correspondingcoalescence ages, respectively.


compares the resolution of haplogroup H4 as first41 and as

currently resolved. This comprehensive level of resolution

minimizes the chance of additional nomenclature issues

arising in future studies. Second, the highly resolved phy-

logeny is a powerful tool for quality assessment.29,42–44

Mapping any additional complete mtDNA haplotype to

such highly resolved phylogeny will highlight potential

sequencing errors and problems such as sample mix-

up, contamination, and typographical errors. Third, the

phylogeny itself is a useful resource for future evolutionary,

clinical, and forensic studies.45–51

Discussion

Thirty-one years ago, Anderson and colleagues27 published

the first complete sequence of human mtDNA. This

became the reference sequence inmultidisciplinary studies

that revolutionized human genetics, leading, for instance,

to the concept of ‘‘late-out-of-Africa’’ (‘‘African Eve’’)

peopling of the world by modern humans,17,18 the identi-

fication of a wide range of pathological mtDNA muta-

tions,52,53 and the possibility of reconstructing the origins

and the relationships of modern as well as ancient popula-

tions.12,14,54 The publication of globally selected complete

mtDNA genomes about 10 years agomarked the beginning

of the genomic era in this field.4 Since then, progress has

been impressive. Most admirable is the penetration of

the principles applied in the field of archaeogenetics to

hundreds of thousands of people around the world who

became interested in their matrilineal descent. In fact, in

this paper we add information from more than 8,000

complete mtDNA sequences resulting largely from the

curiosity and enthusiasm of lay people to the ~10,000

publicly available complete mtDNA sequences. However,

as discussed above, the entire field faces a problem: the

traditional manner of reporting variation observed in

human mitochondrial genome sequences is, to be blunt,

conceptually incorrect.

Supported by a consensus of many colleagues and after

a few years of hesitation, we have reached the conclusion

that on the verge of the deep-sequencing revolution,47,55

when perhaps tens of thousands of additional complete

mtDNA sequences are expected to be generated over the

next few years, the principal change we suggest cannot

be postponed any longer: an ancestral rather than a ‘‘phylo-

genetically peripheral’’ and modern mitogenome from

Europe should serve as the epicenter of the humanmtDNA

reference system. Inevitably, the proposed change could

raise some temporary inconveniences. For this reason, we

provide tables and software to aid data transition.

What we propose is much more than a mere clerical

change. We use the Ptolemaian geocentric versus Coper-

nican heliocentric systems as a metaphor. And the meta-

phor extends further: as the acceptance of the heliocentric

system circumvented epicycles in the orbits of planets,

737311

719

1171

9

R

1476

614

766

d522

d522

-523523 1276

4510

217

1137

712

879

1287

914

766

1476

616

256

1635

2

3992

3992

4024

4024

5004

5004

7581

7581

9123

9123

1436

543

614

582

4582

1549

754

9715

930

5930

1616

461

6411 H4

d522

d522

-523523

9033

1077

513

513h

1620

916

209

1621

5T

59

H14

456

1630

4

200

4336

5839

1552

116

093

5471

5471

1286

4

13

aH

5a

5H

5

15

709

709

1608

1618

916

189

14

239

1636

216

362

1648

2 44+C

152

152

214

6263

6263

8668

1404

016

300

3915

4727

9380

1058

916

129

1624

9

16

aH

6a

bH

6b

6H

617

55 57 1117

3847

6253

1099

3

21

H15

1651

916

519

152

152 7272 183

183

1598

1598

1606

616

239

60

3460

3786

1153

6

61

1636

216

362

62

7373 8557

8557

9368

1235

816

145

28

6908

7711

1551

916

291

1629

1

29

3591

4310

9148

1302

016

168

1616

8

30 H9

3010

6776

7373

6320

8468

9921

1497

816

051

1616

216

259 a

H1

a

33

1808

5460

1378

215

817

1631

8

32

d522

d522

-523523

2483

3796

5899

+2C

7870

8348

9022

1256

116

189

1618

916

356

1636

216

362

36

236

709

709

1900

5899

+C60

4016

294

1629

4

35

228

523+

CA

523

CA

1129

916

233

34

368

1000

316

291

1629

1

38

723

7271

8952

1154

916

311

1631

1

39

1428

7

3666

1171

911

719

4062

1629

416

294

4041

1623

416

234

42

573+

3C13

943

43

1504

716

189

1618

9

37

4769

152

152

1081

016

274

1842

1123

313

708

1432

316

291

1629

123

2H

224H

2c

1438

152

152

319

8598

1328

113

928

1392

816

266

1631

116

311

1636

216

362

1651

916

519

22

9393

95C

1555

1555

8258

1590

2

45

5471

5471

1479

8

46

152

152

4679

1287

912

879

1340

414

152

1623

9G16

311

1631

1

47

aH

3a

7373 761

1432

5

44

183

183

709

709

2581

3387

G59

11 49

1295

7

7272 150

150

1536

1066

714

467

195

195

1555

1555

1420

016

176

1651

9

5251

1555

1555

1623

416

234

50

1629

0

53

4793

185

1719

8573

1310

514

560

1621

3

1598

1598

6296

A16

265

26

7H

7

25

48

195

961G

8448

8898

1375

916

278

1627

816

311

1631

1

2392

6719

9530

1263

316

209

1620

916

399

252

2308

1036

1

19

54

H11

146

709

709

1310

1C16

111

1616

716

288

1636

216

362

3936

1455

216

287

18

55

H8

H12

20

195

195

4216

5378

1447

0A14

548

1611

4

H1031

2259

4745

1368

014

872

9393 7337

1304

213

326

573+

C16

519

1651

974

71+C

9449

1156

313

542

1571

216

278

1627

816

311

1631

1

3H

13

56

57

58

H1

3a

2706

7028

*2753

4812

351

1326

6C

60+T 64 152

152

153

2355

2442

3438

3847

1072

813

188

1567

416

126

1636

216

362

150

150

3290

5134

6263

6263

9585

1269

6

2758

3834

6317

7094

1035

611

252

1616

816

168

437

1167

414

800

1632

0

(pre

-HV

--)1

HV

1

HV

*VV

V

2

3

H1

7

195

195

523+

CA

523

CA

5093

6059

7762

1171

911

719

1393

3

5

727216

298

pre

*V1

**

1590

4

5581

8557

8557

1522

116

222

6

pre

*2

V2

**

pre

-V

8014

T15

218

1606

7 750

7569

8376

9755

1353

516

519

1651

9

4

4919

6285

1273

214

299

1624

116

311

237

1555

3531

4715

5201

8838

1045

412

362

1273

013

928

1633

5

10

9

4639

8869

1037

9

8

4580

737311

719

1171

9

R

1476

614

766

d522

d5222

-523523 1276

4510

217

1137

712

879

1287

914

766

1476

616

256

1635

2

d522

d5222

-523523

9033

1077

513

513h

1620

916

209

1621

5T

59

H14

456

1630

4

200

4336

5839

1552

116

093

5471

5471

1286

4

13

aH

5a

5H

5

15

709

709

1608

1618

916

189

14

239

1636

216

362

1648

2 44+C

152

152

214

6263

6263

8668

1404

016

300

3915

4727

9380

1058

916

129

1624

9

16

aH

6a

bH

6b

6H

617

55 57 1117

3847

6253

1099

3

21

H15

1651

916

519

152

152 7272 183

183

1598

1598

1606

616

239

60

3460

3786

1153

6

61

1636

216

362

62

7373 8557

8557

9368

1235

816

145

28

6908

7711

1551

916

291

1629

1

29

3591

4310

9148

1302

016

168

1616

8

30 H9

3010

6776

7373

6320

8468

9921

1497

816

051

161

1808

5460

1378

215

817

1631

8

32

d522

d5222

-523523

2483

3796

5

236

709

709

1900

5899

+C60

4

228

523+

CA

523

CA

1129

916

233

34

368

1037

4769

152

152

1081

016

274

1842

1123

313

708

1432

316

291

1629

123

2H

224H

2c

1438

152

152

319

8598

1328

113

928

1392

816

266

1631

116

311

1636

216

362

1651

916

519

22

932

183

183

709

1295

7

7272 5019

519

515

55555

1555

1555

1623

416

234

1629

0

53

4793

185

1719

8573

1310

514

560

1621

3

1598

1598

6296

A16

265

26

7H

7

25

48

195

961G

8448

8898

1375

916

278

1627

816

311

1631

1

2392

6719

9530

1263

316

209

1620

916

399

252

2308

1036

1

19

54

H11

146

709

709

1310

1C16

111

1616

716

288

1636

216

362

3936

1455

216

287

18

55

H8

20

195

195

4216

5378

1447

0A14

548

1611

4

H1031

2259

4745

1368

014

872

9393 7337

573+

C16

519

1651

974

71+C

9449

1156

3 2

2706

7028

*2753

4812

351

1326

6C

60+T 64 152

152

153

2355

2442

3438

3847

1072

813

188

1567

416

126

1636

216

362

150

150

3290

5134

6263

6263

9585

1269

6

2758

3834

6317

7094

1035

611

252

1616

816

168

437

1167

414

800

1632

0

(pre

-HV

--)1

HV

1

HV

*VV

V

2

3

H1

7

195

195

523+

CA

523

CA

5093

6059

7762

1171

911

719

1393

3

5

727216

298

pre

*V1

**

1590

4

5581

8557

8557

1522

116

222

6

pre

*2

V2

**

pre

-V

8014

T15

218

1606

7 750

7569

8376

9755

1353

516

519

1651

9

4

4919

6285

1273

214

299

1624

116

311

237

1555

3531

4715

5201

8838

1045

412

362

1273

013

928

1633

5

10

9

4639

8869

1037

9

8

4580

aH

1a

3316

362

1636

2

H1

b

36

aH

3a

1631

116

311

3H

133

H3

H1

58

H1

3a

1635

6

6162

1616

216

259

1625

932

3796

5899

+58

99+2

C2C78

7078

7078

7083

4883

4890

2290

2212

561

1256

116

189

1618

916

189

1635

616

356

1635

6

C60

4016

294

1629

416

294

35

810

003

1629

116

291

3838

723

723

723

7271

7271

7271

8952

1154

911

549

1631

116

311

3939

1428

7

3666

3666

1171

911

719

4062

4062

1629

416

294

40404141

1623

416

234

4242

573+

573+

3C3C13

943

4343

1504

715

047

1618

916

189

9393

95C

95C

95C

1555

1555

8258

8258

1590

215

902

4545

5471

5471

5471

1479

8

46

152

152

152

4679

4679

1287

912

879

1340

413

404

1415

214

152

1623

9G16

239G

1631

116

311

1631

1

474747

737373 761

1432

5

4444

70709

709

709

2581

3387

G33

87G

5911 4949

150

150

1536

1536

1066

710

667

1446

714

467

115 1420

014

200

1617

616

176

1651

916

519

5251

50

5

H12

H12

H12

H12

73 1304

213

326

1332

61 13

542

1354

213

542

1571

215

712

1627

816

278

1627

816

278

565656

5757

1635

6

H1

b333

C3992TT5004CG9123A

AA4024GAA14582G

C14365T

G8269A

AA10044G

T10034C

T10007C

A1656GG11440A

T14325C

AA15244G

960.XCT7870C

G13708A

T10124CT14956C

AA6040G

G13889A

G5773A

G14569A

T9615C

AA12642GG15884A

G6951A

T8380C

G15497AG15930A

T7581C

G7356AG7521A!

T10166CG9276A

A73G!

C16287T

T195C!

C16286g

A153G (T195C)

(T16093C)

A73G! C16248T

H4a1

c

H4a1

c1

H4a1

d

H4b1

H4c

H4c1

H4a1

a3

H4a1

a3a

H4a1

a4

H4a1

a4a

H4a1

a4b

H4a1

a4b1

H4a1

a4b2

H4a1

a5

H4a1

a1a1

H4a1

a1a1a

H4a1

a1a1a1

H4a1

a1a2

H4a1

a1a3

H4a1

a1a4

H4a1

a2

H4a1

a2a

H4a1

a2a1

H4a1

c

H4a1

c1

H4a1

d

H4b1

H4c

H4c1

H4a1

a3

H4a1

a3a

H4a1

a4

H4a1

a4a

H4a1

a4b

H4a1

a4b1

H4a1

a4b2

H4a1

a5

H4a1

a1a1

H4a1

a1a1a

H4a1

a1a1a1

H4a1

a1a2

H4a1

a1a3

H4a1

a1a4

H4a1

a2

H4a1

a2a

H4a1

a2a1

H4b

H4

H4a

H4a1

H4a1

a

H4a1

a1

H4a1

a1a

Figure 3. Haplogroup H4 internal cladistic structure(Left) Haplogroup H4 as first reported.41 Mutations in bold were considered diagnostic for the haplogroup.(Right) Haplogroup H4 as currently resolved with a total of 236 H4mitogenomes. An almost perfect resolution of the nested hierarchy isachieved. Additional haplogroups suggested herein are shown in yellow. Control-region mutations are noted in blue.


switching the mtDNA reference to an ancestral RSRS will

end an academically inadmissible conjuncture where

virtually all mitochondrial genome sequences are scored

in part from derived-to-ancestral states and in part from

ancestral-to-derived states. We aim to trigger the radical

but necessary change in the way mtDNA mutations are

reported relative to their ancestral versus derived status,

thus establishing an intellectual cohesiveness with the

current consensus of shared common ancestry of all con-

temporary human mitochondrial genomes.

Note that the problem is not restricted to mtDNA.

Indeed, in themuch larger perspective of complete nuclear

genomes in which comparisons are often currently made

relative to modern human reference sequences, often of

European origin, it seems worthwhile to begin consid-

ering, as valuable alternatives, public reference sequences

of ancestral alleles (common in all primates) whereby

derived alleles (common to some human populations)

would be distinguished.

Supplemental Data

Supplemental Data include two figures, six tables, and one

sequence and can be found with this article online at http://

www.cell.com/AJHG/.

Acknowledgments

We thank the genealogical community for donating their

privately obtained complete mtDNA sequences for scientific

studies and FamilyTreeDNA for compiling the data. We thank

FamilyTreeDNA for supporting the establishment of the herein

released website. We thank Eileen Krauss-Murphy of Family-

TreeDNA for help with assembly of the database. We thank

Rebekah Canada and William R. Hurst for help with the assembly

of haplogroup H and K samples, respectively. R.V. and D.M.B.

thank the European Commission, Directorate-General for

Research for FP7 Ecogene grant 205419. D.M.B. is a shareholder

of FamilyTreeDNA and a member of its scientific advisory board.

R.V. and M.M. thank the European Union, Regional Development

Fund for a Centre of Excellence in Genomics grant, and R.V.

thanks the Swedish Collegium for Advanced Studies for support

during the initial stage of this study. M.M. thanks Estonian Science

Foundation for grant 8973. A.T. received support from Fondazione

Alma Mater Ticinensis and the Italian Ministry of Education,

University and Research: Progetti Ricerca Interesse Nazionale

2009. S.R. thanks the Israeli Science Foundation for grant 1227/

09 and IBM for an Open Collaborative Research grant. FCT, the

Portuguese Foundation for Science and Technology, partially sup-

ported this work through the personal grant N.M.S. (SFRH/BD/

69119/2010). Instituto de Patologia e Imunologia Molecular da

Universidade do Porto is an Associate Laboratory of the Portuguese

Ministry of Science, Technology and Higher Education and is

partially supported by the Portuguese Foundation for Science

and Technology.

Received: January 9, 2012

Revised: February 22, 2012

Accepted: March 2, 2012

Published online: April 5, 2012

Web Resources


FASTmtDNA, http://www.mtdnacommunity.org

mtDNAble, http://www.mtdnacommunity.org

mtPhyl, http://eltsov.org/mtphyl.aspx

PhyloTree, http://www.phylotree.org

Accession Numbers

The 4,265 complete mtDNA sequences reported herein have been

submitted to GenBank (accession numbers JQ701803–JQ706067).

References

1. Darwin, C. (1859). Natural Selection. On the Origin of

Species by Means of Natural Selection, or, The Preservation

of Favoured Races in the Struggle for Life, Chapter 4 (London:

John Murray).

2. Delsuc, F., Brinkmann, H., and Philippe, H. (2005). Phyloge-

nomics and the reconstruction of the tree of life. Nat. Rev.

Genet. 6, 361–375.

3. Kivisild, T., Metspalu, E., Bandelt, H.J., Richards, M., and

Villems, R. (2006). The world mtDNA phylogeny. In Human

mitochondrial DNA and the evolution of Homo sapiens, H.J.

Bandelt, V. Macaulay, and M. Richards, eds. (Berlin: Springer-

Verlag), pp. 149–179.

4. Ingman, M., Kaessmann, H., Paabo, S., and Gyllensten, U.

(2000). Mitochondrial genome variation and the origin of

modern humans. Nature 408, 708–713.

5. Richards, M., and Macaulay, V. (2001). The mitochondrial

gene tree comes of age. Am. J. Hum. Genet. 68, 1315–1320.

6. Torroni, A., Achilli, A., Macaulay, V., Richards, M., and

Bandelt, H.J. (2006). Harvesting the fruit of the human

mtDNA tree. Trends Genet. 22, 339–345.

7. van Oven, M., and Kayser, M. (2009). Updated comprehensive

phylogenetic tree of global human mitochondrial DNA

variation. Hum. Mutat. 30, E386–E394.

8. Underhill, P.A., and Kivisild, T. (2007). Use of y chromosome

and mitochondrial DNA population structure in tracing

human migrations. Annu. Rev. Genet. 41, 539–564.

9. Salas, A., Bandelt, H.J., Macaulay, V., and Richards, M.B.

(2007). Phylogeographic investigations: The role of trees in

forensic genetics. Forensic Sci. Int. 168, 1–13.

10. Shriver, M.D., and Kittles, R.A. (2004). Genetic ancestry and

the search for personalized genetic histories. Nat. Rev. Genet.

5, 611–618.

11. Taylor, R.W., and Turnbull, D.M. (2005). Mitochondrial DNA

mutations in human disease. Nat. Rev. Genet. 6, 389–402.

12. Gilbert,M.T.,Kivisild,T.,Grønnow,B.,Andersen, P.K.,Metspalu,

E., Reidla,M., Tamm, E., Axelsson, E., Gotherstrom,A., Campos,

P.F., et al. (2008). Paleo-Eskimo mtDNA genome reveals matri-

lineal discontinuity in Greenland. Science 320, 1787–1789.

13. Gilbert, M.T., Hansen, A.J., Willerslev, E., Rudbeck, L., Barnes,

I., Lynnerup, N., and Cooper, A. (2003). Characterization of

genetic miscoding lesions caused by postmortem damage.

Am. J. Hum. Genet. 72, 48–61.

14. Haak, W., Forster, P., Bramanti, B., Matsumura, S., Brandt, G.,

Tanzer, M., Villems, R., Renfrew, C., Gronenborn, D., Alt,

K.W., and Burger, J. (2005). Ancient DNA from the first Euro-

pean farmers in 7500-year-old Neolithic sites. Science 310,

1016–1018.


15. Denaro, M., Blanc, H., Johnson, M.J., Chen, K.H., Wilmsen,

E., Cavalli-Sforza, L.L., and Wallace, D.C. (1981). Ethnic vari-

ation in Hpa 1 endonuclease cleavage patterns of human

mitochondrial DNA. Proc. Natl. Acad. Sci. USA 78, 5768–5772.

16. Brown,W.M. (1980). Polymorphism inmitochondrial DNA of

humans as revealed by restriction endonuclease analysis. Proc.

Natl. Acad. Sci. USA 77, 3605–3609.

17. Cann, R.L., Stoneking, M., and Wilson, A.C. (1987). Mito-

chondrial DNA and human evolution. Nature 325, 31–36.

18. Vigilant, L., Stoneking, M., Harpending, H., Hawkes, K., and

Wilson, A.C. (1991). African populations and the evolution

of human mitochondrial DNA. Science 253, 1503–1507.

19. Richards, M., Corte-Real, H., Forster, P., Macaulay, V.,

Wilkinson-Herbots, H., Demaine, A., Papiha, S., Hedges, R.,

Bandelt, H.J., and Sykes, B. (1996). Paleolithic and neolithic

lineages in the European mitochondrial gene pool. Am. J.

Hum. Genet. 59, 185–203.

20. Torroni, A., Bandelt, H.J., D’Urbano, L., Lahermo, P., Moral, P.,

Sellitto, D., Rengo, C., Forster, P., Savontaus, M.L., Bonne-

Tamir, B., and Scozzari, R. (1998). mtDNA analysis reveals

a major late Paleolithic population expansion from south-

western to northeastern Europe. Am. J. Hum. Genet. 62,

1137–1152.

21. Torroni, A., Schurr, T.G., Cabell, M.F., Brown, M.D., Neel, J.V.,

Larsen, M., Smith, D.G., Vullo, C.M., and Wallace, D.C.

(1993). Asian affinities and continental radiation of the four

founding Native American mtDNAs. Am. J. Hum. Genet. 53,

563–590.

22. Behar, D.M., Villems, R., Soodyall, H., Blue-Smith, J., Pereira,

L., Metspalu, E., Scozzari, R., Makkan, H., Tzur, S., Comas,

D., et al; Genographic Consortium. (2008). The dawn of

human matrilineal diversity. Am. J. Hum. Genet. 82, 1130–

1140.

23. Briggs, A.W., Good, J.M., Green, R.E., Krause, J., Maricic, T.,

Stenzel, U., Lalueza-Fox, C., Rudan, P., Brajkovic, D., Kucan,

Z., et al. (2009). Targeted retrieval and analysis of five Nean-

dertal mtDNA genomes. Science 325, 318–321.

24. Green, R.E., Malaspinas, A.S., Krause, J., Briggs, A.W., Johnson,

P.L., Uhler, C., Meyer, M., Good, J.M., Maricic, T., Stenzel, U.,

et al. (2008). A complete Neandertal mitochondrial genome

sequence determined by high-throughput sequencing. Cell

134, 416–426.

25. Kivisild, T., Shen, P., Wall, D.P., Do, B., Sung, R., Davis, K.,

Passarino, G., Underhill, P.A., Scharfe, C., Torroni, A., et al.

(2006). The role of selection in the evolution of human mito-

chondrial genomes. Genetics 172, 373–387.

26. Kivisild, T., Reidla, M., Metspalu, E., Rosa, A., Brehm, A.,

Pennarun, E., Parik, J., Geberhiwot, T., Usanga, E., and

Villems, R. (2004). Ethiopian mitochondrial DNA heritage:

Tracking gene flow across and around the gate of tears. Am.

J. Hum. Genet. 75, 752–770.

27. Anderson, S., Bankier, A.T., Barrell, B.G., de Bruijn, M.H.,

Coulson, A.R., Drouin, J., Eperon, I.C., Nierlich, D.P., Roe,

B.A., Sanger, F., et al. (1981). Sequence and organization of

the human mitochondrial genome. Nature 290, 457–465.

28. Andrews, R.M., Kubacka, I., Chinnery, P.F., Lightowlers, R.N.,

Turnbull, D.M., and Howell, N. (1999). Reanalysis and

revision of the Cambridge reference sequence for human

mitochondrial DNA. Nat. Genet. 23, 147.

29. Yao, Y.G., Salas, A., Bravi, C.M., and Bandelt, H.J. (2006).

A reappraisal of completemtDNAvariation in East Asian fami-

lies with hearing impairment. Hum. Genet. 119, 505–515.

30. Pello, R., Martın, M.A., Carelli, V., Nijtmans, L.G., Achilli, A.,

Pala, M., Torroni, A., Gomez-Duran, A., Ruiz-Pesini, E., Marti-

nuzzi, A., et al. (2008). Mitochondrial DNA background

modulates the assembly kinetics of OXPHOS complexes in

a cellular model of mitochondrial disease. Hum. Mol. Genet.

17, 4001–4011.

31. Bandelt, H.J., and Parson, W. (2008). Consistent treatment

of length variants in the human mtDNA control region:

A reappraisal. Int. J. Legal Med. 122, 11–21.

32. Soares, P., Ermini, L., Thomson, N., Mormina, M., Rito, T.,

Rohl, A., Salas, A., Oppenheimer, S., Macaulay, V., and Ri-

chards, M.B. (2009). Correcting for purifying selection: An

improved human mitochondrial molecular clock. Am. J.

Hum. Genet. 84, 740–759.

33. Yang, Z. (2007). PAML 4: Phylogenetic analysis by maximum

likelihood. Mol. Biol. Evol. 24, 1586–1591.

34. Tang, S., and Huang, T. (2010). Characterization of mitochon-

drial DNA heteroplasmy using a parallel sequencing system.

Biotechniques 48, 287–296.

35. Li, M., Schonberg, A., Schaefer, M., Schroeder, R., Nasidze, I.,

and Stoneking, M. (2010). Detecting heteroplasmy from

high-throughput sequencing of complete human mitochon-

drial DNA genomes. Am. J. Hum. Genet. 87, 237–249.

36. Zaragoza, M.V., Fass, J., Diegoli, M., Lin, D., and Arbustini, E.

(2010). Mitochondrial DNA variant discovery and evaluation

in human Cardiomyopathies through next-generation

sequencing. PLoS ONE 5, e12295.

37. Mishmar, D., Ruiz-Pesini, E., Golik, P., Macaulay, V., Clark,

A.G., Hosseini, S., Brandon, M., Easley, K., Chen, E., Brown,

M.D., et al. (2003). Natural selection shaped regional mtDNA

variation in humans. Proc. Natl. Acad. Sci. USA 100, 171–176.

38. Green, R.E., Krause, J., Briggs, A.W., Maricic, T., Stenzel, U.,

Kircher, M., Patterson, N., Li, H., Zhai, W., Fritz, M.H., et al.

(2010). A draft sequence of the Neandertal genome. Science

328, 710–722.

39. Richards, M.B., Macaulay, V.A., Bandelt, H.J., and Sykes, B.C.

(1998). Phylogeography of mitochondrial DNA in western

Europe. Ann. Hum. Genet. 62, 241–260.

40. Torroni, A., Rengo, C., Guida, V., Cruciani, F., Sellitto, D.,

Coppa, A., Calderon, F.L., Simionati, B., Valle, G., Richards,

M., et al. (2001). Do the four clades of the mtDNA haplogroup

L2evolve at different rates?Am. J.Hum.Genet.69, 1348–1356.

41. Achilli, A., Rengo, C., Magri, C., Battaglia, V., Olivieri, A., Scoz-

zari, R., Cruciani, F., Zeviani, M., Briem, E., Carelli, V., et al.

(2004). The molecular dissection of mtDNA haplogroup H

confirms that the Franco-Cantabrian glacial refugewas amajor

source for the European gene pool. Am. J. Hum. Genet. 75,

910–918.

42. Parson, W., and Bandelt, H.J. (2007). Extended guidelines for

mtDNA typing of population data in forensic science. Forensic

Sci. Int. Genet. 1, 13–19.

43. Salas, A., Carracedo, A., Macaulay, V., Richards, M., and

Bandelt, H.J. (2005). A practical guide to mitochondrial DNA

error prevention in clinical, forensic, and population genetics.

Biochem. Biophys. Res. Commun. 335, 891–899.

44. Bandelt, H.J., Lahermo, P., Richards, M., and Macaulay, V.

(2001). Detecting errors in mtDNA data by phylogenetic

analysis. Int. J. Legal Med. 115, 64–69.

45. Ballantyne, K.N., vanOven,M., Ralf, A., Stoneking,M., Mitch-

ell, R.J., van Oorschot, R.A., and Kayser, M. (2011). MtDNA

SNP multiplexes for efficient inference of matrilineal genetic

ancestry within Oceania. Forensic Sci. Int. Genet., in press.


Published online September 20, 2011. 10.1016/j.fsigen.2011.

08.010.

46. Pereira, L., Soares, P., Radivojac, P., Li, B., and Samuels, D.C.

(2011). Comparing phylogeny and thepredictedpathogenicity

of protein variations reveals equal purifying selection across

the global human mtDNA diversity. Am. J. Hum. Genet. 88,

433–439.

47. Behar, D.M., Harmant, C., Manry, J., van Oven, M., Haak, W.,

Martinez-Cruz, B., Salaberria, J., Oyharcabal, B., Bauduer, F.,

Comas, D., and Quintana-Murci, L.; Consortium. TG.

(2012). The Basque paradigm: Genetic evidence of a maternal

continuity in the Franco-Cantabrian Region since pre-

Neolithic times. Am. J. Hum. Genet. 90, 486–493.

48. Zeviani, M., and Carelli, V. (2007). Mitochondrial disorders.

Curr. Opin. Neurol. 20, 564–571.

49. Gunnarsdottir, E.D., Nandineni, M.R., Li, M., Myles, S., Gil,

D., Pakendorf, B., and Stoneking, M. (2011). Larger mitochon-

drial DNA than Y-chromosome differences betweenmatrilocal

and patrilocal groups from Sumatra. Nat. Commun. 2, 228.

50. Baum, D.A., Smith, S.D., and Donovan, S.S. (2005). Evolution.

The tree-thinking challenge. Science 310, 979–980.

51. Behar, D.M., Metspalu, E., Kivisild, T., Rosset, S., Tzur, S.,

Hadid, Y., Yudkovsky, G., Rosengarten, D., Pereira, L.,

Amorim, A., et al. (2008). Counting the founders: The matri-

lineal genetic ancestry of the Jewish Diaspora. PLoS ONE 3,

e2062.

52. Wallace, D.C., Singh, G., Lott, M.T., Hodge, J.A., Schurr, T.G.,

Lezza, A.M., Elsas, L.J., 2nd, and Nikoskelainen, E.K. (1988).

Mitochondrial DNA mutation associated with Leber’s heredi-

tary optic neuropathy. Science 242, 1427–1430.

53. MITOMAP. (2011) A Human Mitochondrial Genome Data-

base. http://www.mitomap.org.

54. Quintana-Murci, L., Harmant, C., Quach, H., Balanovsky, O.,

Zaporozhchenko, V., Bormans, C., van Helden, P.D., Hoal,

E.G., and Behar, D.M. (2010). Strongmaternal Khoisan contri-

bution to the South African coloured population: A case of

gender-biased admixture. Am. J. Hum. Genet. 86, 611–620.

55. Schonberg, A., Theunert, C., Li, M., Stoneking, M., and

Nasidze, I. (2011). High-throughput sequencing of complete

human mtDNA genomes from the Caucasus and West Asia:

High diversity and demographic inferences. Eur. J. Hum.

Genet. 19, 988–994.


ARTICLE

Age-Related Somatic Structural Changesin the Nuclear Genome of Human Blood Cells

Lars A. Forsberg,1 Chiara Rasi,1 Hamid R. Razzaghian,1 Geeta Pakalapati,1 Lindsay Waite,2

Krista Stanton Thilbeault,2 Anna Ronowicz,3 Nathan E. Wineinger,4 Hemant K. Tiwari,4

Dorret Boomsma,5 Maxwell P. Westerman,6 Jennifer R. Harris,7 Robert Lyle,8 Magnus Essand,1

Fredrik Eriksson,1 Themistocles L. Assimes,9 Carlos Iribarren,10 Eric Strachan,11 Terrance P. O’Hanlon,12

Lisa G. Rider,12 Frederick W. Miller,12 Vilmantas Giedraitis,13 Lars Lannfelt,13 Martin Ingelsson,13

Arkadiusz Piotrowski,3 Nancy L. Pedersen,14 Devin Absher,2 and Jan P. Dumanski1,*

Structural variations are among the most frequent interindividual genetic differences in the human genome. The frequency and distri-

bution of de novo somatic structural variants in normal cells is, however, poorly explored. Using age-stratified cohorts of 318 monozy-

gotic (MZ) twins and 296 single-born subjects, we describe age-related accumulation of copy-number variation in the nuclear genomes

in vivo and frequency changes for both megabase- and kilobase-range variants. Megabase-range aberrations were found in 3.4% (9 of

264) of subjects R60 years old; these subjects included 78 MZ twin pairs and 108 single-born individuals. No such findings were

observed in 81MZ pairs or 180 single-born subjects whowere%55 years old. Recurrent region- and gene-specificmutations, mostly dele-

tions, were observed. Longitudinal analyses of 43 subjects whose data were collected 7–19 years apart suggest considerable variation in

the rate of accumulation of clones carrying structural changes. Furthermore, the longitudinal analysis of individuals with structural aber-

rations suggests that there is a natural self-removal of aberrant cell clones from peripheral blood. In three healthy subjects, we detected

somatic aberrations characteristic of patients with myelodysplastic syndrome. The recurrent rearrangements uncovered here are candi-

dates for common age-related defects in human blood cells. We anticipate that extension of these results will allow determination of the

genetic age of different somatic-cell lineages and estimation of possible individual differences between genetic and chronological age.

Our work might also help to explain the cause of an age-related reduction in the number of cell clones in the blood; such a reduction is

one of the hallmarks of immunosenescence.

Introduction

Structural changes in the human genome have been iden-

tified as one of the major types of interindividual genetic

variation.1,2 Furthermore, the rate of formation of copy-

number variants (CNVs) exceeds the corresponding rate

of SNPs by 2–4 orders of magnitude.3–5 In spite of this, little

is known about the rate of formation and distribution of

de novo somatic CNVs in normal cells and whether these

aberrations accumulate with age. There are, however, indi-

cations that chromosomal remodeling in the nuclear and

mitochondrial genomes increases with age.6–12 Theoretical

predictions suggest that somatic mosaicism should be

widespread,13,14 and reviews in the field point out that

somatic mosaicism, in both healthy and diseased cells, is

an understudied aspect of human-genome biology.15–18

A recent estimate of 1.7% for the frequency with which

somatic mosaicism causes large-scale structural aberrations

in adult human samples is, however, a relatively low

number.19 We have shown that adult monozygotic (MZ)

twins and differentiated human tissues frequently display

somatic CNVs.20,21 We therefore hypothesized that the

nuclear genome of blood cells in vivo might accumulate

CNVs with age, and we used age-stratified MZ twins as

a starting point for testing this hypothesis. Because nuclear

genomes of MZ twins are identical at conception, they

represent a good model for studying somatic variation.

We replicated a MZ-twin-based analysis by using age-strat-

ified cohorts of single-born subjects. Using these resources,

we show age-related accumulation of CNVs in the nuclear

genomes of blood cells in vivo. Age effects were found for

both megabase- and kilobase-range variants.

1Department of Immunology, Genetics and Pathology, Rudbeck Laboratory, Uppsala University, 75185 Uppsala, Sweden; 2HudsonAlpha Institute for

Biotechnology, 601 Genome Way, Huntsville, AL 35806, USA; 3Department of Biology and Pharmaceutical Botany, Medical University of Gdansk, Hallera

107, 80-416 Gdansk, Poland; 4Section on Statistical Genetics, Department of Biostatistics, Ryals Public Health Building, University of Alabama at Birming-

ham, Suite 327, Birmingham, AL 35294-0022, USA; 5Department of Biological Psychology, VU University, Van der Boechorststraat 1, 1081 BT Amsterdam,

The Netherlands; 6Hematology Research, Mount Sinai Hospital Medical Center, 1500 S California Avenue, Chicago, IL 60608, USA; 7Department of Genes

and Environment, Division of Epidemiology, The Norwegian Institute of Public Health, P.O. Box 4404 Nydalen, N-0403 Oslo, Norway; 8Department of

Medical Genetics, Oslo University Hospital, Kirkeveien 166, 0407 Oslo, Norway; 9Department of Medicine, Stanford University School of Medicine,

Stanford, CA 94305, USA; 10Kaiser Foundation Research Institute, Oakland, CA 94612, USA; 11Deptartment of Psychiatry and Behavioral Sciences and

University of Washington Twin Registry, University of Washington, Box 359780 Seattle, WA 98104, USA; 12Environmental Autoimmunity Group,

National Institute of Environmental Health Sciences, National Institutes of Health Clinical Research Center, National Institutes of Health, Building 10,

Room 4-2352, 10 Center Drive, MSC 1301, Bethesda, MD 20892-1301, USA; 13Department of Public Health and Caring Sciences, Division of Molecular

Geriatrics, Rudbeck laboratory, Uppsala University, 751 85 Uppsala, Sweden; 14Department ofMedical Epidemiology and Biostatistics, Karolinska Institutet,

SE-171 77 Stockholm, Sweden





Studied Cohorts, DNA Isolation, and Quality ControlSamples were collected with informed consent from all subjects,

and the study was approved by the respective local institutional

review boards or research ethics committees. The information

about studied cohorts of MZ twins and single-born subjects is

provided in Tables S1 and S2, available online. We isolated DNA

from peripheral blood by using the QIAGEN kit (QIAGEN, Hilden,

Germany).

The quality, quantity, and integrity of DNA samples were

controlled with NanoDrop (Thermo Fisher Scientific, Waltham,

MA, USA), picoGreen fluorescent assay (Invitrogen, Eugene, Ore-

gon, USA), and agarose gels.

Sorting of Subpopulations of Cells from Peripheral

Blood and Culturing of FibroblastsPeripheral blood mononuclear cells (PBMCs) were isolated from

the whole blood with Ficoll-Paque centrifugation (Amersham

Biosciences, Uppsala, Sweden), and a mixture of granulocytes

was collected from under the PBMC layer. We isolated CD19þ cells

from PBMCs by positive selection with CD19 MicroBeads (Milte-

nyi Biotech, Auburn, CA, USA). First, we negatively selected

CD4þ cells by using the CD4þ T cell Isolation Kit II (Miltenyi

Biotech, Auburn, CA, USA), and then we positively selected the

cells by using CD4 MicroBeads (Miltenyi Biotech, Auburn, CA,

USA). The CD19þ and CD4þ cells were incubated for 30 min at

4�C with phycoerythrin- and PerCP-conjugated antibodies (BD

Biosciences, San Diego, CA, USA), respectively, for fluorescence-

activated cell sorting (FACS) analysis. We measured purities

of >90% for CD19þ and >98% for CD4þ cells by flow cytometry

(FACS CantoII, BD Biosciences, San Diego, CA,USA). The skin-

biopsy-derived fibroblasts were cultured in RPMI medium

supplemented with Hams F-10 medium, fetal bovine serum

(10%), penicillin, and L-glutamine (all cell culture reagents were

from GIBCO, Invitrogen, Paisley, UK) in an incubator at 37�C.After reaching ~90% confluence, the cells were trypsinized

(Trypsin-EDTA, GIBCO, Invitrogen, Paisley, UK), and the fibro-

blasts were used for DNA isolation. We performed a standard

phenol-chloroform extraction to isolate DNA from CD19þ cells,

CD4þ cells, fibroblasts, and crude granulocyte fraction.

Genotyping with Illumina SNP Arrays and Calling

of Large-Scale CNVsWe performed the SNP genotyping experiments by using several

types of Illumina beadchips according to the recommendations of

the manufacturer. Such experiments were performed at two facili-

ties: Hudson Alpha Institute for Biotechnology (Huntsville, AL,

USA) and the SNP Technology Platform (Uppsala University,

Sweden). All Illumina genotyping experiments passed the follow-

ing quality-control criteria: The SNP call rate for all samples was

>98%, and the LogRdev value was<0.2. The results from Illumina

SNP arrays consist of two main data tracks: log R ratio (LRR) and

B-allele frequency (BAF)22 (see Figure 1). Deviations of consecutive

probes from normal states are indicative of structural aberrations.

We analyzed Illumina output files by using Nexus Copy Number

version 5.1 (BioDiscovery, CA, USA), which applies a ‘‘Rank

Segmentation’’ algorithm based on the circular binary segmenta-

tion (CBS) approach.23 The applied version, ‘‘SNPRank Segmenta-

tion,’’ an extended algorithm inwhich BAF values are also included

in the segmentation process, generated both copy-number and

allelic-event calls. We applied the default calling parameters of

the program. The array data for large-scale CNVs reported in this

paper have been submitted to the Database of Genomic Structural

Variation (dbVAR) under the accession number nstd58.

A Method for Detection of Small-Scale CNVs

with Illumina SNP Array DataWe developed and applied an algorithm for testing whether

smaller structural variants would also accumulate with age. We

used deviations in BAF as the main tool for detecting candidate

CNV regions because it can detect mosaicism in as low as 5%–

7% of cells24,25 and allows uncovering of deletions and duplica-

tions as well as copy-number-neutral loss of heterozygozity

(CNNLOH). This method uses an in-house-developed R-script26

to perform scans for deviations in BAF values alone and in BAF

values together with LRR values in MZ twins. Figure S1 describes

this algorithm, which identifies CNV calls for each MZ pair at

user-defined thresholds of either DBAF or both DBAF and DLRR.

Our initial tests of the algorithm were based on the entire cohort

of 159 MZ pairs. However, a series of ‘‘trial and error’’ tests sug-

gested that the method is sensitive to the quality of input data,

given that the results were heavily biased toward detection of

putative CNV calls in MZ co-twins with lower quality of genotyp-

ing, as measured by the Nexus Quality (NQ) score. The latter is one

of the features of Nexus Copy Number software. We therefore

defined strict NQ-score-based criteria for inclusion of MZ pairs in

the analysis (see Table S3 and Figure S1), which resulted in the

selection of 87 pairs that were processed further.

We based the final analysis on 87 twin pairs by identifying

candidate CNV loci in which BAF values were different between

co-twins when multiple thresholds were used. As expected, the

number of putative CNV calls between MZ co-twins was highly

dependent on the settings of the DBAF filtering (Figures S1–S4).

Thus, when the settings were too generous in this step, an age-

related signal was hidden in large background variation (Figure S2).

By using more strict filtering criteria, we found an age-related

correlation (Figures 2A and S4C). We trimmed the list of putative

CNVs generated by DBAF by using a DLRR filter of >0.35 so that

only loci with differences in both BAF and LRR remained in the

final list (Figures 2B and S4D). Hence, the DLRR filter removed

all loci with copy-number-neutral variation from the list. In the

course of tuning DBAF (or both DBAF and DLRR) filtering parame-

ters, we took advantage of three already-known large-scale aberra-

tions that are described in our dataset (Figures 1A–1F, 3, and S5).

These worked as ideal internal controls for the validity of our

approach as shown in Figures S2–S4. Hence, by plotting the

number of calls both including the probes locatedwithin the three

known aberrations (Figures S2A–S2B, S3A–S3B, and S4A–S4B) and

after excluding the probes located within the known aberrations

(Figures S2C–S2D, S3C–S3D, and S4C–S4D), we could compare

and evaluate the observed and expected results. For example, in

Figure S4B, the twin pair TP25-1/TP25-2 sticks out because the

probes positioned within the large de novo aberration of chromo-

some 5 (Figure 1) are included in the list of calls. When plotting

the same data after excluding probes within this region, we found

that the twin pair falls into the cluster of variation similar to that

of the other MZ twin pairs (Figure S4D). On the basis of such eval-

uations, we observed that probes within the three large-scale

CNVs were detected (or not, depending on the input file used in

the analysis) as predicted by our DBAF and DLRR algorithm. There-

fore, these evaluations provided an internal validation of our

approach to detecting de novo small-scale CNVs.


Figure 1. Two Examples of Megabase-Range De Novo Somatic Aberrations(A) A normal profile of MZ twin TP25-1.(B) A 32.5 Mb deletion on 5q is shown in nucleated blood cells of co-twin TP25-2. This deletion was uncovered with LRR data from theIllumina SNP array.(C and D) The BAF profiles of twins TP25-1 (C) and TP25-2 (D). The qPCR experiments showed that 66.2% of nucleated blood cells inTP25-2 had the 5q deletion (i.e., 33.1% fewer copies of the DNA segment, Figure 5). The R-package-MAD (Mosaic Alteration Detection)analysis of the Illumina data suggested that 50.5% of the cells had the 5q deletion when the subjects were 77 years old.(E) The deviation of BAF values from 0.5 (the allelic fraction of intensity at each heterozygous SNP) was plotted, and the percentage ofcells with the 5q deletion was higher when the subjects were 77 years old than when they were 70 years old (t test: p < 0.001). This slowincrease in aberrant clones was also supported by the MAD estimate of 48.3% of cells detected when the subjects were 70 years old. Thesize and position of this deletion is typical of patients with myelodysplastic syndrome (MDS).(F) A confirmatory array-CGH experiment.(G–K) Another large somatic event: a terminal CNNLOH encompassing 103 Mb of 4q in ULSAM-697. The LRR and BAF data fromIllumina SNP genotyping of samples collected when the subjects were 71, 82, 88, and 90 years old are plotted in (G), (H), (I), and (J),respectively. Percentages of cells with the aberration were calculated with the MAD package and are given for each panel.(K) The proportion of cells with the 4q aberration changes with time, and the changes are significantly different between all samplings atdifferent ages (ANOVA: F(3,25935) ¼ 39087, p< 0.001; Tukey’s test for multiple comparisons). Figure S8 shows other analysis details of thesamples collected fromULSAM-697 when he was 90 years old. These analyses include those of fibroblasts and three types of sorted bloodcells. The analysis of samples obtained when the subjects were 90 years old was performed in duplicate experiments on Illumina 1M-Duoand Omni-Express arrays.


Design of the Nimblegen 135K Custom-Made

Tiling-Path Oligonucleotide ArrayThis tool was designed according to the instructions from Roche-

Nimblegen (Madison, WI, USA) and encompassed 137,545 probes

used for validation of the 138 putative CNVs detected by the Illu-

mina SNP array (Figures 2B, S4C, and S4D). In total, the design

consisted of 98,894 experimental probes and an additional

38,651 backbone control probes distributed across the genome.

The median overlap of probes (i.e., probe spacing) was 30 bp.

This array was applied in cohybridizations of 34 MZ twin pairs

(Figures 2G, 2H, and S6 and Table S4).

Array-Comparative Genomic Hybridization

with Nimblegen 720K and 135K ArraysWe performed DNA labeling for both platforms (3 3 720K and

12 3 135K) by using the random priming with the Nimblegen

Dual-Color DNA Labeling kit (Roche-Nimblegen) according to

Nimblegen’s protocol. In brief, test and reference DNA (500 ng

each) samples were labeled with Cy3 and Cy5, respectively. The

combined test and reference DNA was cohybridized (for 48 hr at

42�C) onto a human comparative genomic hybridization (CGH)

3 3 720K whole-genome tiling array (100718_HG18_WH_

CGH_v3.1_HX3, OID:30853; Roche-Nimblegen) or a 12 3 135K

custom-designed array (110131_HG18_LF_CGH_HX18, OID:

33469; Roche-Nimblegen). The arrays were washed with the

Nimblegen Wash Kit. We performed image acquisition with MS

200 Scanner at 2 mm resolution by using high-sensitivity and auto-

gain settings. We extracted data with NimbleScan v2.6 segMNT,

including spatial correction (LOESS) and qspline fit normalization,

in order to compensate for differences in signal between the two

dyes.27 We generated an experimental metrics report with

NimbleScan v2.6 to verify hybridization quality. We performed

CNV analysis with Nexus Copy Number software version 5.1 by

using default settings (see above). All plots shown in Figures 2G,

2H, and S6 are derived from unaveraged, normalized raw data.

Validation Experiments Involving Quantitative

Real-Time Polymerase Chain ReactionWe measured the relative amount of DNA molecules by using

quantitative real-time polymerase chain reaction (qPCR) with

SYBR green to validate the CNV findings from the arrays. qPCRs

FE

10

15

20

25

50

20

30

40

50

10

0

(0.2 <

d

BA

F<

0.45

)

(0

.2

<d

BA

F<

0.4

5, d

LR

R>

0.3

5)

100

Age of twinf pairs

0 20 40 60 80100

Age of twinf pairs

0 20 40 60 80

Nu

mb

er o

fr

c

alls

f

Nu

mb

er o

fr

c

alls

f

Corr. coef. = 0.62

p < 0.001

Corr. coef. = 0.54

p < 0.001

BA

90

Age at second sampling

50 60 70 80

Age at sampling

50 60 70 80

(0

.2

<d

BA

F<

0.45

)

Nu

mb

er o

fr

c

alls

f

20

30

15

25

10

5

100

60

140

20(0

.2

< d

BA

F<

0.45

)

Nu

mb

er o

fr

c

alls

f

1 2 3 4 5 6 7 8 9

Age group

Nu

mb

er o

fr

ca

lls

f

10

20

30

40

50

0

D

(0.2

< d

BA

F<

0.4

5)

CF = 7.58, p < 0.001(8,78)

FFAge group

in panel c

N

(MZ pairs)

Median

age

1 10 8

2 10 19

3 9 29

4 10 65

5 10 68

6 10 72

7 10 76

8 10 78

9 8 82

ANOVA

Longitudinal changes

within individuals

Longitudinal changes

between twins (10 years)

Twin TP31-1 Twin TP31-2

10 kb 10 kb

Position f o

0rs6928830

200 bp

Pair TP31-1/2r

84.2752 Mb

500 bp

00.4

-0

.4L

og

2 ra

tio

Pair TP63-1/2r

onPositio of

5020rs4635

5 kb5 kb

Twin TP63-1 Twin TP63-2

0.5

01

BA

F

0.5

01

BA

F

0.5

01

BA

F

0.5

01

BA

F

00.4

-0

.4L

og

2 ra

tio

HG

p = 5.85E-08 p = .82E-10 1.

Age 76 Age 70

at ageg 76 at age 76 at age 70 at age 70

100.695 100.710 100.695 100.710 84.265 84.285 84.265 84.285Mb Mb Mb Mb Mb Mb Mb Mb

100.695 Mb 100.704 Mb 84.2764 Mb

Figure 2. Age-Related Accumulation of Small Somatic StructuralRearrangements in 87 Pairs of MZ Twins(A and B) Linear regression analyses showing that the number ofcalls increases with age in MZ twin pairs when DBAF values arebetween 0.2 and 0.45 as well as when DBAF values are between0.2 and 0.45 and when the LRR deviation is>0.35. Each dot repre-sents data from one MZ twin pair. Details regarding the filteringalgorithms used are shown in Figure S1.(C and D) An analysis of statistical significance for nine age groupsof MZ twin pairs when DBAF values are between 0.2 and 0.45.(E and F) Longitudinal data analyses comparing the number ofDBAF reports (between 0.2 and 0.45) of 18 twin pairs that weresampled twice, 10 years apart. Each point in the plot representsthe number of differences within one MZ pair (E). Each line(plotted between the two time points for the same MZ pair) thusrepresents the change over time of the number of differenceswithin a pair (blue line, increase; red line, decrease; green line,no change). The intraindividual changes for each twin overa period of 10 years are shown in (F). The x axis shows individual

ages at the later sampling. On the y axis, the number of differencesfound between the two samples from the same person at the twotime points is shown, and vertical lines connect co-twins.(G and H) Validation of copy-number imbalance between MZtwins in two pairs (chromosomes 10 and 6, respectively), whichwere detected by the DBAF analysis. The small boxes at the topof both (G) and (H) display original data from Illumina arraysfor pairs TP63-1/TP63-2 and TP31-1/TP32-2, respectively. Thelarger boxes at the bottom of (G) and (H) display raw data fromNimblegen tiling-path 135K array for these two twin pairs. Eachline is drawn to scale and represents data from one oligonucleotideprobe. Statistical significance for the results of the Nimblegenarray was calculated with the Mann-Whitney U test; values wereanalyzed for the region of interest (shaded) and for both areason either side of the control regions. Twenty additional examplesof validation experiments are shown in Figure S6. There was nodifference between the rates of validation success for the young(n ¼ 8) and old (n ¼ 26) MZ pairs used in these experiments(t test: t ¼ 0.7062, p value ¼ 0.4819), supporting the resultsfrom linear-regression analyses. The detailed description of theNimblegen array is provided in Figure S6 and Table S4.


were performed in 20 ml reactions containing 5 ng genomic DNA,

0.3 mM of each primer, and 13 Maxima SYBR Green/ROX qPCR

Master Mix (Fermentas, Vilnius, Lithuania) (for primer sequences,

see Table S5). The reactions were incubated at 95�C for 10 min,

after which they underwent 40 cycles of 95�C for 15 s and 60�Cfor 60 s in a Stratagene Mx3000P (Agilent Technologies) machine.

The reactions for evaluation of primer efficiencies were performed

in duplicates with control DNA (normal human female genomic

DNA, Promega Corporation, Madison,WI, USA), whereas all other

reactions with test and reference DNA were performed in tripli-

cates; in both instances, the averages were used in analyses. Each

primer pair’s efficiency and standard curve are described in

Figure S7. Melting-curve analysis was performed in all the experi-

Figure 3. An Example of a SomaticMegabase-Range Aberration(A, E, and F) A deletion encompassing12.9 Mb of 20q in MZ twin TP30-1 wassampled when she was 69 years old.(B, G, and H) The normal profile of co-twinTP30-2, as detected by LRR and BAFafter Illumina SNP array genotyping.R-package-MAD analysis of the Illuminadata suggested that 41.5% of the bloodcells had the 20q deletion. qPCR valida-tion experiments confirmed this resultby showing 39.6% aberrant cells (i.e.,19.8% fewer copies of the DNA segment,Figure 5).(C and D) Array-CGH validation experi-ments also confirmed the copy-numbervariation. The genetic change in MZ twinTP30-1 is another example of an MDS-likeaberration, which was uncovered in asubjectwithout a clinical diagnosis ofMDS.

ments, and the results were analyzed with

MxPro v4.10 software. We used ultra-

conserved elements on human chromo-

somes 3 and 6 (UCE3 andUCE6) as control

loci as previously described.28,29 We used

the average cycle threshold (Ct) value of

UCE6 to normalize the average Ct values

of UCE3 and test loci. We used these

normalized Ct values to calculate copy-

number ratios of test regions. Using the

estimated copy-number ratios from UCE3

and the test loci from multiple replicate

experiments, we performed t tests for

statistical testing.

Statistical MethodsThe statistical analyses were performed

with the R 2.12–2.13 software.26 We used

methods such as linear regression, t tests,

andone-wayanalyses of variance (ANOVAs)

when suitable, as further specified in the

text. Prior to testing, we controlled the data

so that no test assumptions were violated.

For multiple comparisons (i.e., Figures 1K

and S8G), we used the Tukey honest-signifi-

cant-difference method by implementing

the TukeyHSD function in R. When appro-

priate, we performed the nonparametric Fisher’s exact test and

Mann-Whitney U test, as described in the text.

Boxplots of Longitudinal-Analysis Data

Heterozygous SNPs have a theoretical expected BAF value of 0.5,

and deviations from this normal state can be indicative of struc-

tural aberrations.24 We can therefore use changes in the magni-

tude of these deviations in the subjects’ longitudinal samples to

measure intraindividual changes over time and to estimate the

proportion of cells affected by large-scale aberrations. We

produced the boxplots in Figures 1E, 1K, 4J, S9D, S9G, and S8G

to visualize such changes in BAF variation. In these figures, we

plotted the absolute deviation of BAF values from 0.5 for all

heterozygous SNPs in the region of interest (i.e., ABS (0.5�BAF))


on the y axes. We only included heterozygous SNPs (i.e., those

with a BAF value between 0.2 and 0.8) in these calculations to

increase quality and accuracy of the plots. A larger BAF value devi-

ation from 0.5 corresponds to a larger degree of mosaicism, i.e.,

a higher proportion of cells with a specific aberration. We used t

tests (in cases with two factor levels) or one-way ANOVAs (in cases

with >2 factor levels) to test for significance of such differences.

For themodel illustrated in Figures 1K and S8G, we used the Tukey

Figure 4. Longitudinal Analysis of ULSAM-340, a Single-Born Subject Containing a 13.8 Mb Deletion on 20q, as Detected by LRR andBAF with the Illumina SNP ArrayThe size and position of this deletion is typical of MDS patients. This subject, however, has not been diagnosed with MDS. When thepatient was 71 years old, the deletion was only carried by a small proportion of blood cells and was barely detectable, and neither NexusCopyNumber software nor R-packageMAD reported this aberration at this age (A, D, and E). R-packageMAD suggested that 50.7% of thenucleated cells had the deletion when ULSAM-340 was 75 years old (B, F, and G) and that when he was 88 years old, the correspondingproportion of cells was 36.1% (C, H, and I). qPCR validation experiments showed that the sample taken when the patient was 88 yearsold contained 14.5% fewer copies of DNA in the segment as compared to the sample taken when he was 75 years old (Figure 5). Thedeviations from 0.5 of the BAF values within the deleted region in the three different sampling stages are illustrated in (J).


post-hoc test for multiple comparisons to compute differences

between factor-level means after adjusting p values for the

multiple testing.

Quantification of the Number of Cells Affected by

Megabase-Range AberrationsWe calculated the approximate percentage of cells affected by

aberrations in the megabase range by using data from qPCR exper-

iments (the data are described in Figure 5). The qPCR measure-

ments provided the approximate number of DNA molecules that

are affected by an aberration. Assuming that an aberration affects

only one chromosome (i.e., an aberration that is a heterozygous

event) in a diploid genome, we used this number and converted

it to the approximate number of affected cells. Our assumption

is reasonable, given that we are studying normal cells and that

the size of these large-scale aberrations renders them unlikely to

affect both chromosomes (i.e., they are unlikely to be homozygous

[biallelic] events). For example, the relative number of DNA copies

in nucleated blood cells of twin TP25-2 at the age of 77 years

confirmed the array data. To determine these numbers, we used

two primer pairs (41.1 and 42.1) designed within the deleted

region and took five independent measurements for both primer

pairs. These experiments suggested that, at the age of 77, twin

TP25-2 had 30.8% (when primer pair 41.1 was used) and 35.4%

(when primer pair 42.1 was used)—an average of 33.1%—fewer

DNA copies with a 32.5 Mb 5q deletion than did her co-twin at

the same age (Figure 5). If one assumes that this deletion is

affecting one chromosome in a diploid cell, our calculations

suggest that 66.2% of cells contain this deletion.

In order to quantify the level of mosaicism, we also applied an

alternative, published method19,30 based on calculations of the

deviation of BAF values from the expected value of 0.5 for the

heterozygous SNPs in a normal state. This method has been

tailored for data derived from the Illumina SNP platform. The

R-package MAD (Mosaic Alteration Detection) version 0.5–930

identifies the aberrant regions, such as deletions, gains, and

CNNLOHs, and calculates the B deviation (Bdev, deviation from

the expected BAF value of 0.5 for heterozygous SNPs) value, which

is then used for calculation of the number of cells affected by the

aberration. We used the following modified version of the pub-

lished19 formula for deletions, gains, and CNNLOHs:

Proportion of cells with aberration ¼ 2Bdev

ð0:5þ BdevÞ

Results

Age-Related Accumulation of Megabase-Range

Structural Variants

Our analysis of 159 MZ pairs involved genotyping with

Illumina 600K SNP arrays, confirmation of monozygozity

(>99.9% genotype concordance), CNV calling with Nexus

Copy Number software (BioDiscovery, CA, USA), followed

by inspection of genomic profiles. Validation was per-

formed with a different Illumina array, Nimblegen array,

and qPCR. Comparison of MZ twin pairs, including 19

previously reported pairs,21 identified five large de novo

aberrations of >1 Mb among 81 young or middle-aged

(%55 years) and 78 elderly (R60 years) pairs studied

(Figures 1, 3, 5, and S5). All five large rearrangements

occurred in the older twins, suggesting a relationship

between age and the presence of changes. Tables S1 and

S2 show a description of subjects, cohorts, and statistical

support for the use of Illumina data for the detection of

variants. We expanded on the results from twins by using

two age-stratified groups of single-born subjects. First, we

genotyped DNA from 108 men, all 88 years old, from the

ULSAM (Uppsala Longitudinal Study of Adult Men) cohort

by using the Illumina-1M-Duo array. We found that four

subjects had large-scale rearrangements at the age of

88 years, and the somatic nature of such rearrangements

was established by examination of samples taken from

the same individuals at other time points (Figures 1, 4, 5,

and S8–S10 and Table S1). Second, for the young or

middle-aged single-born control cohort (33–55 years), we

used existing Illumina 550K data from 180 controls from

the ADVANCE (Atherosclerotic Disease, Vascular Function,

and Genetic Epidemiology) study.31,32 Analogous analysis

of ADVANCE subjects did not reveal any cases of large-scale

aberrations. The genotyping quality of 550K experiments

is at least as good as the quality of 1M-Duo arrays, and

the resolution of the 550K array is sufficient for detection

of ~1Mb aberrations that have been uncovered in the

ULSAM cohort (Figures S11 and S12 and Table S6). In

fact, we described a 1.6 Mb deletion by using the 300K

array in twin D8,21 and literature comparing arrays

suggests that the 250K level is sufficient for uncovering

submegabase-range changes.28,33 By studying the twins

and the single-born individuals and by analyzing the two

groups together, we obtained firm statistical support for

age-related accumulation of large structural variants

(with Fisher’s exact test; p value ¼ 0.00052) (Table S2).

Overall, 3.4% of the studied population R60 years old

carries cells containing megabase-range somatic aberra-

tions that are readily detectable by array-based scanning,

whereas none of the younger controls carried aberrations

in this size range. The sensitivity of our analysis to detect

aberrant clones is about 5% of nucleated blood cells.24,25

A previous estimate of 1.7% for somatic mosaicism was

performed in an analysis that was not stratified by age.19

Five subjects harboring large CNVs (twin TP25-2 and

ULSAM-102, -298, -340, and -697) were followed in

repeated samplings collected up to 19 years apart. They

all showed accumulation of aberrant cells with a variation

in the rate of this process. Twin TP25-2 is an example of

slow accumulation of a 5q-deletion clone (Figure 1);

when this twin was 77 years old, two independent

methods (q-PCR and MAD-program-based) suggested

that 66.2% and 50.5% of cells, respectively, contained

a deletion on one copy of chromosome 5. The change in

deviation of BAF within the deleted region when twin

TP25-2 was 70 and 77 years old translates into a 2.2%

increase in cells with the 5q deletion. The latter estimation

was based on analysis with the MAD program. It is note-

worthy that the size and position of this 5q deletion are

typical of myelodysplastic syndrome (MDS).34–38 However,

twin TP25-2 has not been diagnosed with this disease.


A

MZ pair TP25-1/2

at the age of 77

Chr. 5 locus 41.1

n = 5

0

50

100

Re

la

tiv

e a

mo

un

t o

f D

NA

m

ole

cu

le

s (%

)

Control region UCE3 Test loci

~30.8% fewer DNA

copies in test locus

in twin TP25-2

p = 0.0149

~35.4% fewer DNA


in twin TP25-2

p < 0.001

MZ pair TP25-1/2

at the age of 77

Chr. 5 locus 42.1

n = 5

MZ pair TP30-1/2

at the age of 69

Chr. 20 locus 45.1

n = 5

ULSAM-340 at the

age of 75 and 88

Chr. 20 locus 45.1

n = 6

~19.8% fewer DNA


in twin TP30-1

p < 0.001

~14.5% fewer DNA


at the age of 88

p < 0.001

ULSAM-102 Chr. 1

age 88 vs. f-gDNA

locus rs540796

n = 5

~49.1% more DNA

copies in test locus in

ULSAM-102 compared

to reference DNA~34.7% more DNA

copies in test locus in

ULSAM-102 compared

to reference DNA

p < 0.001

p = 0.0015150

ULSAM-102 Chr. 8

age 88 vs. f-gDNA

locus rs9298462

n = 5

B

Control region UCE3 Test loci

~8.9% fewer DNA


p = 0.0449

~14.2% fewer DNA


p < 0.0001

~7.8% fewer DNA


p = 0.0057

~5.9% fewer DNA


p = 0.0458

~5.7% fewer DNA


p = 0.0101

MZ pair TP31-1/2

at the age of 69

SNP rs6928830

n = 8

0

50

100

Re

la

tiv

e a

mo

un

t o

f D

NA

m

ole

cu

le

s (%

)

MZ pair TP19-1/2

at the age of 75

SNP rs329312

n = 9

MZ pair TP63-1/2

at the age of 76

SNP rs4635020

n = 6

MZ pair TP16-1/2

at the age of 77

SNP rs4841318

n = 7

MZ pair TP63-1/2

at the age of 76

SNP rs708039

n = 11

Figure 5. Validation of de novo CNVs by qPCR with SYBR GreenEleven independent qPCR experiments, each composed of multiple (5–11) independent measurements, are shown. The relative numberof DNA copies in both test loci (white bars) and the control regionUCE3 (gray bars) were plotted. Before we plotted and performed statis-tical analyses with t tests, we normalized all Ct values by using the control region UCE6. Figure S7 shows the determination of primerefficiency for each of the primer pairs.(A and B) Validations for five large-scale (A) and five small-scale (B) aberrations. The dotted line drawn at 100% represents the copy-number state in control DNA (i.e., that from the normal MZ co-twin, or human female control DNA, or DNA from the same subjectsampled at another age), and error bars indicate standard error of means.(A) The 5q deletion in twin TP25-2 (Figure 1) was validated with two primer pairs (41.1 and 42.1) designed within the deleted region. Intotal, ten independent qPCR experiments showed that ~66.2% of all nucleated blood cells in TP25-2 had the 5q deletion (i.e., an averageof 33.1% [30.8%with primer pair 41.1 and 35.4%with primer pair 42.1] fewer copies of the DNA segment). Similarly, the 20q deletion intwin TP30-1 (Figure 3) was validated with primer pair 45.1 in five experiments. The 19.8% fewer DNA copies found in the test locus indi-cates that 39.6% of the nucleated blood cells had the deletion. For ULSAM-340, the array data indicated a longitudinal somatic change inthe number of cells carrying the 20q deletion. Six independent qPCR experiments comparing DNA sampled when ULSAM-340 was 75


ULSAM-102 is another example of slow accumulation and

contains gains on 1p and 8q (Figure S9). The 1p gain is

stable, whereas the 8q gain shows a statistically significant

(ANOVA: p value <0.05) increase over a period of 10 years.

Consequently, ULSAM-102 probably carries two coexisting

clones with different aberrations. In ULSAM-340 and -697,

the rate of accumulationwas faster and therewas a decrease

in the proportion of cells with aberrations at later sam-

plings. ULSAM-340 contains a 20q deletion, which was

barely detectable at the age of 71 (Figure 4). The number

of cells containing the 20q deletion was estimated by anal-

ysis with the MAD program to be 50.7% when ULSAM-340

was 75 years old and to be 36.1% when he was 88 years

old. ULSAM-340 is another example of an aberration

typical of MDS in a subject without this diagnosis.

However, his clinical history includes thrombocytopenia,

which is normally a part ofMDS clinical features.We there-

fore speculate that this symptom might be due to clonal

expansion of cells with a 20q deletion and suppression of

normal thrombocyte production. Finally, ULSAM-697

was analyzed four times and shows the most pronounced

increase and decrease in the number of cells with

CNNLOH of 4q (Figures 1 and S8). This aberration was

not detectable at the age of 71, reached 58.4% at the age

of 88, and decreased radically to 29.9% of cells at the age

of 90. When ULSAM-697 was 90 years old, we profiled

sorted CD4þ cells, CD19þ cells, granulocytes, and fibro-

blasts, in addition to whole-blood DNA. CD4þ cells, gran-

ulocytes, and whole blood showed similar levels of

aberrant cells, whereas CD19þ cells and fibroblasts ap-

peared normal. We performed all experiments on samples

taken when ULSAM-697 was 90 years old in duplicate with

different types of arrays. Thus, in ULSAM-697, both

lymphoid andmyeloid cells were affected, except for, quite

surprisingly, CD19þ B cells. Overall, the analyses per-

formed on ULSAM-340 and ULSAM-697 suggest that the

cells with aberrations have a higher proliferative potential

than do other cells in the immune system, but they are not

immortalized because they apparently disappear from

circulation.

Small-Scale Structural Aberrations Also Display

Positive Correlation with Age

Given the above results, we tested whether smaller struc-

tural variants would also accumulate with age, and we

used deviations in BAF as the main detection tool because

they can detect mosaicism in as low as 5%–7% of cells24,25

and allow detection of deletions and duplications as well as

CNNLOH. We performed scans for deviations in BAF

values alone and BAF together with LRR in twins by using

a new R-script (Figure S1) that identifies CNV calls for each

MZ pair at various thresholds of DBAF and DLRR. Early

analyses showed that the algorithm was sensitive to the

quality of genotyping because calls were preferentially

observed in co-twins with lower data quality. We therefore

applied strict inclusion criteria by using the NQ score,

which is based on genome-wide noise measurements.

This resulted in the selection of 87 out of 159 MZ pairs

(Table S3). We found that small putative CNVs increased

with age (Figure 2A, linear regression F(1,85) ¼ 54.00,

p < 0.001, Figures S2–S4). We further narrowed the

number of calls by combining the DBAF and DLRR

values >0.35 from both twins in each MZ twin pair, and

this process also indicated that these CNVs accumulate

with age (Figure 2B; F(1,85) ¼ 34.60, p < 0.001). We also

tested whether genotyping quality (DNQ value is the abso-

lute value of the difference in quality score within pairs)

might explain the observed pattern. Importantly, there

was no effect of DNQ on age (F(1,85) ¼ 1.85, p > 0.05), sug-

gesting that the positive correlation with age reflects true

aberrations. Figure 2B displays a total of 827 CNV calls at

378 loci in 87 pairs with an age span of 3–86 years. Plotting

of the 378 calls against the genome shows the nonrandom

distribution and recurrent nature of these CNV calls

(Figure S13). On the basis of frequency and/or location in

the vicinity of known genes, we selected 138 loci for vali-

dation by using a tiling-path array (Nimblegen 135K) in

34 twin pairs. With this platform, 15% of putative CNVs

were validated in the same twin pairs in which they were

first detected by DBAF and DLRR analysis. There was no

bias in the success rate of validation between younger

and older groups (t test: t ¼ 0.7062, p value ¼ 0.4819). In

total, 52 of the 138 loci (38%) included on the 135K array

showed CNVs within 32 of the 34 MZ pairs tested (Figures

2G, 2H, and S6), and the majority of CNVs encompassed

<1 kb. The reason for the discrepancy (i.e., 15% versus

38%) in the validation success rates mentioned above is

probably due, at least in part, to the high stringency of

the DBAF and DLRR analysis that only reported a subset

of preferentially strong calls representing structural vari-

ants and the recurrent nature of loci that are affected by

the small-scale variation. Hence, some true structural vari-

ants were validated in (often multiple) MZ pairs on the

135K array, even though the initial DBAF and DLRR anal-

ysis did not pick them up because the filtering parameters

were too stringent.We selected 5 of these 52 loci for further

validation with qPCR, and all five were confirmed by this

alternative approach (Figure 5). We also performed break-

point-PCR validation in 17 out of the above 52 loci by

using PCR across the deleted region in instances that

and 88 years old showed that the subject had 14.5% fewer copies of the DNA segment when he was 88 years old. In ULSAM-102, theIllumina array identified a duplication event on both chromosomes 1 and 8 (Figure S9). Given that the proportion of cells with a gainedsegment in this subject was relatively stable over time, we used human female genomic DNA as control DNA in these experiments. TheqPCR experiments validated both somatic CNVs.(B) qPCR validation of five loci with small-scale de novo CNVs withinMZ twins. These loci were identified by Illumina array genotypingand were confirmed on the Nimblegen 135K array (see also Figures 2G, 2H, and S6). The layout of this panel is similar to that of (A),described above. For example, the first locus (rs6928830) illustrates de novo CNVs in twin TP31-1 (Figure 2H).


were presumed to represent the shortest deletions based on

the Illumina and Nimblegen 135K array data. However,

these attempts were not successful. We obtained correctly

sized PCR bands representing wild-type alleles for tested

loci. However, we could not detect any shorter, mutated

alleles that were mapped to the correct genomic regions.

These validation experiments included gel purification of

PCR fragments, PCR-fragment analysis, subcloning in plas-

mids, and Sanger sequencing (details not shown). These

results suggest that the vast majority of the uncovered

small structural variants are due to more complex rear-

rangements involving deletions or gains embedded

together with other structural changes. These results are

in agreement with a recent sequencing-based validation

analysis of CNV loci; the analysis showed that as few as

5% of CNVs suspected to represent gains or deletions are

in fact ‘‘pure blunt-end breakpoints.’’39 Details for the 52

validated loci are shown in Table S4, which includes infor-

mation about genes affected by the variation. The results

presented in Table S4 and Figure S13 emphasize the

recurrent nature of the 52 validated loci. For example,

out of the 52 loci, 13 only occurred once in any of the 34

tested twin pairs, whereas the remaining 39 were recurrent

and occurred 2–16 times in the same set of MZ twin

pairs. The number of CNVs per pair validated with the

135K Nimblegen array ranged from 1 to 32 (median 6)

(Table S7). In summary, the deviation between MZ co-

twins ranged from 0 to 51,040 bp (median 4,995 bp),

and the latter corresponds to ~0.0000016% genome-wide

divergence.

By using the small-scale CNV pipeline, we analyzed 18

pairs of MZ twins that were sampled twice, 10 years apart

(Figures 2E, 2F, and S1 and Table S8). Analyses were per-

formed in two ways: as an interindividual comparison of

one twin to its co-twin at the first and second sampling

and as an intraindividual comparison of the two samplings

of a single twin. Both types of comparisons suggest varia-

tion in the dynamics of changes between co-twins and

show both increases and decreases over a period of 10 years

in the number of calls in different twin pairs. Interestingly,

this evidence for the dynamics of small-scale CNVs over

time (Figure 2E) is consistent with the results from longitu-

dinal analyses of large-scale aberrations in ULSAM-697 and

ULSAM-340 (Figures 1 and 4), suggesting both increases

and decreases over time in the number of cells containing

different variants.

Discussion

The phenotypic consequences of accumulating aberra-

tions are an interesting aspect of our results. In two

subjects diagnosed with chronic lymphocytic leukemia

(CLL), we detected multiple changes consistent with the

disease (Figure S10). These findings are not unexpected:

Our population-based cohort was not preselected against

any diagnoses, and CLL is the most prevalent leukemia

among the elderly.40 However, it is surprising that appar-

ently healthy subjects have aberrations characteristic of

MDS. A typical 5q deletion (observed in one subject) and

a 20q deletion (observed in two subjects) are among the

most common aberrations in patients diagnosed with

MDS.34–38 Trisomy 8 is also a recurrent aberration in

MDS, and ULSAM-102 displays a restricted 8q gain; it

remains unclear whether this gain is related to MDS.

None of the above-mentioned individuals were diagnosed

with MDS, and their cases might represent an indolent,

subclinical form of MDS. In two individuals followed in

longitudinal sampling (i.e., ULSAM-340 and -697), we

observed not only an increase but also a clear subsequent

decrease in the proportion of nucleated blood cells with

aberrations (Figures 1, 4, and S8). These results suggest an

‘‘autocorrection’’ of the immune system, given that the

aberrant clones are apparently disappearing from circula-

tion. Similar expansions of preleukemic clones containing

gene fusions specific to acute leukemia have been

described in newborns;41 the gene fusions TEL-AML1 and

AML1-ETO were present in cord blood at a frequency

1003 greater than the frequency that is associated with

the risk of developing the corresponding leukemia.

The presented data are probably only part of all the

somatic changes that actually occurred in the studied

cohorts because balanced inversions and translocations

escape our detection and because we interrogated a fraction

of all the nucleotides in the genome. Furthermore, we only

detected high-frequency aberrations, presumably because

these aberrations provided the affected cells with a prolifer-

ative advantage, which lead to clonal expansion above

the detection limit of ~5% of cells. It follows from this

reasoning that deleterious aberrations leading to prolifera-

tive disadvantage or aberrations that are neutral from the

point of view of the proliferative potential go undetected.

Nevertheless, the chromosomal regions (e.g., those that

contain the 20q deletion) and loci affected in a recurrent

fashion (Figure S13 and Table S4) are candidates for con-

taining common and redundant age-related defects in

human blood cells. These mutations are presumed to

provide the affected cells with a mild proliferative advan-

tage without transforming the affected cells into immortal-

ized cancer clones. However, the proliferative advantage

for a limited number of cells will most likely affect the

overall complexity of cell clones present in blood and

should therefore be discussed in the context of immunose-

nescence, which, in fact, involves loss of complexity of cell

clones in both B and T cell lineages.42,43 Our results might

therefore help to explain the cause of age-related reduction

in the number of cell clones in the blood. This reduction

could lead to a less diverse immune system caused by the

accumulation of genetic changes that induce the expan-

sion of a limited number of clones. We also anticipate

that extension of our work will allow determination of

the genetic age of different somatic cell lineages and esti-

mation of possible individual differences between genetic

and chronological age.


Supplemental Data

Supplemental Data include 13 figures and eight tables and can be

found with this article online at http://www.cell.com/AJHG.

Acknowledgments

We thank Lars Feuk, Brigitte Schlegelberger, JacekWitkowski, Greg

Cooper, Richard Rosenquist Brandell, Eva Hellstrom-Lindberg,

Chris Gunther, and Eva Tiensuu Janson for critical review of the

manuscript and Larry Mansouri and Juan R. Gonzalez for method-

ological advice. This study was sponsored by grants from the

EllisonMedical Foundation (J.P.D. and D.A.) and from the Swedish

Cancer Society, the Swedish Research Council, and the Science for

Life Laboratory-Uppsala (J.P.D.). A.P. acknowledges FOCUS 4/2008

and FOCUS 4/08/2009 grants from the Foundation for Polish

Science. Genotyping was performed in part by the SNP&SEQ

Technology Platform, which is supported by Uppsala University,

Uppsala University Hospital, the Science for Life Laboratory–

Uppsala, and the Swedish Research Council (contracts 80576801

and 70374401).

Received: November 10, 2011

Revised: December 6, 2011

Accepted: December 14, 2011

Published online: February 2, 2012

Web Resources


GenePipe PrimerZ, http://genepipe.ngc.sinica.edu.tw/primerz/

Illumina Beadchip information, http://www.illumina.com/

documents/products/appnotes/appnote_cytogenetics.pdf

R 2.12–2.13 software, http://www.r-project.org/

Roche-Nimblegen array CGH Protocols, http://www.

nimblegen.com/

R-package MAD version 0.5–9, http://www.creal.cat/jrgonzalez/

software.htm

Surveillance Epidemiology and End Results (SEER) Program Fast

Stats, http://seer.cancer.gov/faststats/

The Gene Ontology, http://www.geneontology.org/

The Genetic Association Database, http://geneticassociationdb.

nih.gov/

The HUGO Gene Nomenclature Committee, http://www.

genenames.org/

University of California Santa Cruz Human Genome Browser,

http://genome.cse.ucsc.edu/cgi-bin/hgGateway

Accession Numbers

The array data for large-scale CNVs reported in this paper have

been submitted to the Database of Genomic Structural Variation

(dbVAR) under the accession number nstd58.

References

1. Conrad, D.F., Pinto, D., Redon, R., Feuk, L., Gokcumen, O.,

Zhang, Y., Aerts, J., Andrews, T.D., Barnes, C., Campbell, P.,

et al; Wellcome Trust Case Control Consortium. (2010).

Origins and functional impact of copy number variation in

the human genome. Nature 464, 704–712.

2. Itsara, A., Cooper, G.M., Baker, C., Girirajan, S., Li, J., Absher,

D., Krauss, R.M., Myers, R.M., Ridker, P.M., Chasman, D.I.,

et al. (2009). Population analysis of large copy number vari-

ants and hotspots of human genetic disease. Am. J. Hum.

Genet. 84, 148–161.

3. vanOmmen, G.J. (2005). Frequency of new copy number vari-

ation in humans. Nat. Genet. 37, 333–334.

4. Lupski, J.R. (2007). Genomic rearrangements and sporadic

disease. Nat. Genet. 39 (7 Suppl), S43–S47.

5. Itsara, A., Wu, H., Smith, J.D., Nickerson, D.A., Romieu, I.,

London, S.J., and Eichler, E.E. (2010). De novo rates and selec-

tion of large copy number variation. Genome Res. 20, 1469–

1481.

6. Harley, C.B., Futcher, A.B., and Greider, C.W. (1990). Telo-

meres shorten during ageing of human fibroblasts. Nature

345, 458–460.

7. Vaziri, H., Schachter, F., Uchida, I., Wei, L., Zhu, X., Effros, R.,

Cohen, D., and Harley, C.B. (1993). Loss of telomeric DNA

during aging of normal and trisomy 21 human lymphocytes.

Am. J. Hum. Genet. 52, 661–667.

8. Lee, H.C., Pang, C.Y., Hsu, H.S., andWei, Y.H. (1994). Differen-

tial accumulations of 4,977 bp deletion inmitochondrial DNA

of various tissues in human ageing. Biochim. Biophys. Acta

1226, 37–43.

9. Fraga, M.F., Ballestar, E., Paz, M.F., Ropero, S., Setien, F., Balles-

tar, M.L., Heine-Suner, D., Cigudosa, J.C., Urioste, M., Benitez,

J., et al. (2005). Epigenetic differences arise during the lifetime

of monozygotic twins. Proc. Natl. Acad. Sci. USA 102, 10604–

10609.

10. Mohamed, S.A., Hanke, T., Erasmi, A.W., Bechtel, M.J.,

Scharfschwerdt, M., Meissner, C., Sievers, H.H., and Gosslau,

A. (2006). Mitochondrial DNA deletions and the aging heart.

Exp. Gerontol. 41, 508–517.

11. Flores, M., Morales, L., Gonzaga-Jauregui, C., Domınguez-

Vidana, R., Zepeda, C., Yanez, O., Gutierrez, M., Lemus, T.,

Valle, D., Avila, M.C., et al. (2007). Recurrent DNA inversion

rearrangements in the human genome. Proc. Natl. Acad. Sci.

USA 104, 6099–6106.

12. Sloter, E.D., Marchetti, F., Eskenazi, B., Weldon, R.H., Nath, J.,

Cabreros, D., and Wyrobek, A.J. (2007). Frequency of human

sperm carrying structural aberrations of chromosome 1

increases with advancing age. Fertil. Steril. 87, 1077–1086.

13. Frank, S.A. (2010). Evolution in health and medicine Sackler

colloquium: Somatic evolutionary genomics: Mutations

during development cause highly variable genetic mosaicism

with risk of cancer and neurodegeneration. Proc. Natl. Acad.

Sci. USA 107 (Suppl 1 ), 1725–1730.

14. Lynch, M. (2010). Evolution of the mutation rate. Trends

Genet. 26, 345–352.

15. Youssoufian, H., and Pyeritz, R.E. (2002). Mechanisms and

consequences of somatic mosaicism in humans. Nat. Rev.

Genet. 3, 748–758.

16. Erickson, R.P. (2010). Somatic gene mutation and human

disease other than cancer: An update.Mutat. Res. 705, 96–106.

17. De, S. (2011). Somatic mosaicism in healthy human tissues.

Trends Genet. 27, 217–223.

18. Dumanski, J.P., and Piotrowski, A. (2012). Structural genetic

variation in the context of somatic mosaicism. In Genomic

Structural Variation, L. Feuk, ed. (New York: Humana Press).

19. Rodrıguez-Santiago, B., Malats, N., Rothman, N., Armengol,

L., Garcia-Closas, M., Kogevinas, M., Villa, O., Hutchinson,

A., Earl, J., Marenne, G., et al. (2010). Mosaic uniparental


disomies and aneuploidies as large structural variants of the

human genome. Am. J. Hum. Genet. 87, 129–138.

20. Piotrowski, A., Bruder, C.E., Andersson, R., Diaz de Stahl, T.,

Menzel, U., Sandgren, J., Poplawski, A., von Tell, D., Crasto,

C., Bogdan, A., et al. (2008). Somatic mosaicism for copy

number variation in differentiated human tissues. Hum. Mu-

tat. 29, 1118–1124.

21. Bruder, C.E., Piotrowski, A., Gijsbers, A.A., Andersson, R.,

Erickson, S., Diaz de Stahl, T., Menzel, U., Sandgren, J.,

von Tell, D., Poplawski, A., et al. (2008). Phenotypically

concordant and discordant monozygotic twins display

different DNA copy-number-variation profiles. Am. J. Hum.

Genet. 82, 763–771.

22. Steemers, F.J., Chang, W., Lee, G., Barker, D.L., Shen, R., and

Gunderson, K.L. (2006). Whole-genome genotyping with

the single-base extension assay. Nat. Methods 3, 31–33.

23. Olshen, A.B., Venkatraman, E.S., Lucito, R., and Wigler, M.

(2004). Circular binary segmentation for the analysis of

array-based DNA copy number data. Biostatistics 5, 557–572.

24. Conlin, L.K., Thiel, B.D., Bonnemann, C.G., Medne, L., Ernst,

L.M., Zackai, E.H., Deardorff, M.A., Krantz, I.D., Hakonarson,

H., and Spinner, N.B. (2010). Mechanisms of mosaicism,

chimerism and uniparental disomy identified by single nucle-

otide polymorphism array analysis. Hum. Mol. Genet. 19,

1263–1275.

25. Razzaghian, H.R., Shahi, M.H., Forsberg, L.A., de Stahl, T.D.,

Absher, D., Dahl, N., Westerman, M.P., and Dumanski, J.P.

(2010). Somatic mosaicism for chromosome X and Y aneu-

ploidies in monozygotic twins heterozygous for sickle cell

disease mutation. Am. J. Med. Genet. A. 152A, 2595–2598.

26. R_Development_Core_Team. (2010). R: A language and envi-

ronment for statistical computing. In. (Vienna, Austria).

URL: http://www.R-project.org/

27. Workman, C., Jensen, L.J., Jarmer, H., Berka, R., Gautier, L.,

Nielser, H.B., Saxild, H.H., Nielsen, C., Brunak, S., and Knud-

sen, S. (2002). A new non-linear normalization method for

reducing variability in DNAmicroarray experiments. Genome

Biol. 3, research0048.

28. Gunnarsson, R., Staaf, J., Jansson, M., Ottesen, A.M., Gorans-

son, H., Liljedahl, U., Ralfkiaer, U., Mansouri, M., Buhl, A.M.,

Smedby, K.E., et al. (2008). Screening for copy-number alter-

ations and loss of heterozygosity in chronic lymphocytic

leukemia—a comparative study of four differently designed,

high resolution microarray platforms. Genes Chromosomes

Cancer 47, 697–711.

29. Gunnarsson, R., Isaksson, A., Mansouri, M., Goransson, H.,

Jansson, M., Cahill, N., Rasmussen, M., Staaf, J., Lundin, J.,

Norin, S., et al. (2010). Large but not small copy-number alter-

ations correlate to high-risk genomic aberrations and survival

in chronic lymphocytic leukemia: A high-resolution genomic

screening of newly diagnosed patients. Leukemia 24, 211–215.

30. Gonzalez, J.R., Rodrıguez-Santiago, B., Caceres, A., Pique-Regi,

R., Rothman, N., Chanock, S.J., Armengol, L., and Perez-

Jurado, L.A. (2011). A fast and accurate method to detect

allelic genomic imbalances underlying mosaic rearrange-

ments using SNP array data. BMC Bioinformatics 12, 166.

31. Schunkert, H., Konig, I.R., Kathiresan, S., Reilly, M.P.,

Assimes, T.L., Holm, H., Preuss, M., Stewart, A.F., Barbalic,

M., Gieger, C., et al; Cardiogenics; CARDIoGRAM Consor-

tium. (2011). Large-scale association analysis identifies 13

new susceptibility loci for coronary artery disease. Nat. Genet.

43, 333–338.

32. Assimes, T.L., Knowles, J.W., Basu, A., Iribarren, C., Southwick,

A., Tang, H., Absher, D., Li, J., Fair, J.M., Rubin, G.D., et al.

(2008). Susceptibility locus for clinical and subclinical coro-

nary artery disease at chromosome 9p21 in the multi-ethnic

ADVANCE study. Hum. Mol. Genet. 17, 2320–2328.

33. Hagenkord, J.M., Monzon, F.A., Kash, S.F., Lilleberg, S., Xie,

Q., and Kant, J.A. (2010). Array-based karyotyping for

prognostic assessment in chronic lymphocytic leukemia:

Performance comparison of Affymetrix 10K2.0, 250K Nsp,

and SNP6.0 arrays. J. Mol. Diagn. 12, 184–196.

34. Bernasconi, P., Boni, M., Cavigliano, P.M., Calatroni, S.,

Giardini, I., Rocca, B., Zappatore, R., Dambruoso, I., and Care-

sana, M. (2006). Clinical relevance of cytogenetics in myelo-

dysplastic syndromes. Ann. N Y Acad. Sci. 1089, 395–410.

35. Haase, D. (2008). Cytogenetic features in myelodysplastic

syndromes. Ann. Hematol. 87, 515–526.

36. Tiu, R.V., Gondek, L.P., O’Keefe, C.L., Elson, P., Huh, J.,

Mohamedali, A., Kulasekararaj, A., Advani, A.S., Paquette, R.,

List, A.F., et al. (2011). Prognostic impact of SNP array karyo-

typing in myelodysplastic syndromes and related myeloid

malignancies. Blood 117, 4552–4560.

37. Braun, T., de Botton, S., Taksin, A.L., Park, S., Beyne-Rauzy, O.,

Coiteux, V., Sapena, R., Lazareth, A., Leroux, G., Guenda, K.,

et al. (2011). Characteristics and outcome of myelodysplastic

syndromes (MDS) with isolated 20q deletion: A report on 62

cases. Leuk. Res. 35, 863–867.

38. Bejar, R., Levine, R., and Ebert, B.L. (2011). Unraveling the

molecular pathophysiology of myelodysplastic syndromes.

J. Clin. Oncol. 29, 504–515.

39. Conrad, D.F., Bird, C., Blackburne, B., Lindsay, S., Mamanova,

L., Lee, C., Turner, D.J., and Hurles, M.E. (2010). Mutation

spectrum revealed by breakpoint sequencing of human germ-

line CNVs. Nat. Genet. 42, 385–391.

40. Surveillance Epidemiology and End Results (SEER) Program.

Fast stats. Bethesda, MD, National Cancer Institute, NIH,

USA (2011) URL: http://seer.cancer.gov/faststats/

41. Mori, H., Colman, S.M., Xiao, Z., Ford, A.M., Healy, L.E.,

Donaldson, C., Hows, J.M., Navarrete, C., and Greaves, M.

(2002). Chromosome translocations and covert leukemic

clones are generated during normal fetal development. Proc.

Natl. Acad. Sci. USA 99, 8242–8247.

42. Naylor, K., Li, G., Vallejo, A.N., Lee, W.W., Koetz, K., Bryl, E.,

Witkowski, J., Fulbright, J., Weyand, C.M., and Goronzy, J.J.

(2005). The influence of age on T cell generation and TCR

diversity. J. Immunol. 174, 7446–7452.

43. Gibson, K.L., Wu, Y.C., Barnett, Y., Duggan, O., Vaughan, R.,

Kondeatis, E., Nilsson, B.O., Wikby, A., Kipling, D., and

Dunn-Walters, D.K. (2009). B-cell diversity decreases in old

age and is correlated with poor health status. Aging Cell 8,

18–25.


to read the latest issue of any Cell Press journal.BE THE FIRST

Register for Cell Press Email Alerts and get the complete table of contents as soon as the issue publishes online — FREE!

Cell Press Email Alerts deliver the news, research, and commentaries featured in eachjournal’s latest issue, including the full title of every article, direct links to the articles, and the complete author list. Plus, to save you time, each research article has a brief summary highlighting its significant findings.

You don’t have to be a subscriber to sign up for Cell Press Email Alerts. While subscribers have instant access to the full text of all articles listed in the Email Alerts, non-subscribers can read the abstracts of all articles as well as the full text of the issue’s Featured Article.

www.cellpress.com

REPORT

Rare Mutations in XRCC2 Increasethe Risk of Breast Cancer

D.J. Park,1,20 F. Lesueur,2,20 T. Nguyen-Dumont,1 M. Pertesi,2 F. Odefrey,1 F. Hammet,1 S.L. Neuhausen,3

E.M. John,4,5 I.L. Andrulis,6 M.B. Terry,7 M. Daly,8 S. Buys,9 F. Le Calvez-Kelm,2 A. Lonie,10 B.J. Pope,10

H. Tsimiklis,1 C. Voegele,2 F.M. Hilbers,11 N. Hoogerbrugge,12 A. Barroso,13 A. Osorio,13,14 the BreastCancer Family Registry, the Kathleen Cuningham Foundation Consortium for Research into FamilialBreast Cancer, G.G. Giles,15 P. Devilee,11,16 J. Benitez,13,14 J.L. Hopper,17 S.V. Tavtigian,18 D.E. Goldgar,19

and M.C. Southey1,*

An exome-sequencing study of families with multiple breast-cancer-affected individuals identified two families with XRCC2mutations,

one with a protein-truncatingmutation and one with a probably deleterious missensemutation.We performed a population-based case-

control mutation-screening study that identified six probably pathogenic coding variants in 1,308 cases with early-onset breast cancer

and no variants in 1,120 controls (the severity grading was p< 0.02). We also performed additional mutation screening in 689 multiple-

case families. We identified ten breast-cancer-affected families with protein-truncating or probably deleterious rare missense variants in

XRCC2. Our identification of XRCC2 as a breast cancer susceptibility gene thus increases the proportion of breast cancers that are asso-

ciated with homologous recombination-DNA-repair dysfunction and Fanconi anemia and could therefore benefit from specific targeted

treatments such as PARP (poly ADP ribose polymerase) inhibitors. This study demonstrates the power of massively parallel sequencing

for discovering susceptibility genes for common, complex diseases.

Currently, only approximately 30% of the familial risk for

breast cancer has been explained, leaving the substantial

majority unaccounted for.1 Recently, exome sequencing

has been demonstrated to be a powerful tool for identi-

fying the underlying cause of rare Mendelian disorders.

However, diseases such as breast cancer present substan-

tially increased complexity in terms of locus, allelic and

phenotypic heterogeneity, and relationships between

genotype and phenotype.

As part of a collaborative (Leiden University Medical

Centre, the Spanish National Cancer Center, and The

University of Melbourne) project involving the exome

capture and massively parallel sequencing of multiple-

case breast-cancer-affected families, we applied whole-

exome sequencing to DNA frommultiple affected relatives

from 13 families (family structure and sample availability

were considered before the affected relatives were chosen).

Bioinformatic analysis of the resulting exome sequences

identified a protein-truncating mutation, c.651_652del

(p.Cys217*), in X-ray repair cross complementing gene-2

(XRCC2(( [MIM 600375; NM_005431.1]) in the peripheral-

blood DNA of a man participating in the Australian Breast

Cancer Family Registry2 (ABCFR; Figure 1A); this man (III-4

in Figure 1A) had been diagnosed with breast cancer at

29 years of age, and his mother (II-3), sister (III-5), and

cousin (III-1) had been diagnosed with breast cancer at

37, 41, and 34 years of age, respectively. The cousin

(III-1), who had also been selected for exome sequencing,

did not carry this mutation, the sister’s DNA was Sanger

sequenced and was found to carry the mutation, and there

was no DNA available for testing of the mother. Exome

sequencing of three individuals from a family participating

in a Dutch research study of multiple-case breast-cancer-

affected families identified a probably deleterious missense

mutation (c.271C>T [p.Arg91Trp] in XRCC2) (Figure 2) in

two sisters (II-6 and II-8 in Figure 1B) diagnosed with breast

cancer at 40 and 48 years of age, respectively, but not in

their cousin (II-1), who was diagnosed at 47 years of age.

Genotyping of XRCC2 mutations c.651_652del

(p.Cys217*) and c.271C>T (p.Arg91Trp) in 1,344 cases

1Genetic Epidemiology Laboratory, The University of Melbourne, Victoria 3010, Australia; 2Genetic Cancer Susceptibility Group, International Agency for

Research on Cancer, 69372 Lyon, France; 3Department of Population Sciences, Beckman Research Institute of City of Hope, Duarte, CA 91010, USA;4Cancer Prevention Institute of California, Fremont, CA 94538, USA; 5Department of Health Research and Policy, Stanford Cancer Center Institute, Stan-

ford, CA 94305, USA; 6Department of Molecular Genetics, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, ON M5G 1X5, Canada;7Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY 10032, USA; 8Fox Chase Cancer Center, Philadelphia,

PA 19111, USA; 9Huntsman Cancer Institute, University of Utah Health Sciences Center, Salt Lake City, UT 84112, USA; 10Victorian Life Sciences Compu-

tation Initiative, Carlton, Victoria 3010, Australia; 11Department of Human Genetics, Leiden University Medical Center, Leiden, 2300 RC Leiden, The

Netherlands; 12Department of Human Genetics, Radboud University Nijmegen Medical Center, 6525 GA Nijmegen, The Netherlands; 13Human Genetics

Group, Human Cancer Genetics Program, Spanish National Cancer Center, 28029 Madrid, Spain; 14Spanish Network on Rare Diseases, 46010 Valencia,

Spain; 15Centre for Cancer Epidemiology, The Cancer Council Victoria, Carlton, Victoria 3052, Australia; 16Department of Pathology, Leiden University

Medical Center, Leiden, 2300 RC Leiden, The Netherlands; 17Centre for Molecular, Environmental, Genetic, and Analytical Epidemiology, School of Pop-

ulation Health, The University of Melbourne, Victoria 3010, Australia; 18Department of Oncological Sciences, Huntsman Cancer Institute, University of

Utah School of Medicine, Salt Lake City, UT 84112, USA; 19Department of Dermatology, University of Utah School of Medicine, Salt Lake City, UT

84132, USA20These authors contributed equally to this work




and 1,436 controls from the Melbourne Collaborative

Cohort Study3 (MCCS) and the ABCFR revealed one

control (II-2, Figure 1C) who carried c.651_652del

(p.Cys217*). Intriguingly, this control individual’s sister

(II-1) was diagnosed with breast cancer at 63 years of age,

and her mother (I-2) was diagnosed with melanoma at

69 years of age (Figure 1C, Tables 1 and 2).

XRCC2, a RAD51 paralog, was cloned because of its

ability to complement the DNA-damage sensitivity of the

irs1 hamster cell line.4 Cells derived from Xrcc2-knockout

mice exhibit profound genetic instability as a result of

homologous recombination (HR) deficiency.5 XRCC2 is

highly conserved, and most truncations of the protein

destroy its ability to protect cells from the effects of the

DNA cross-linking agent mitomycin C.6 The involvement

of the HR DNA repair genes BRCA1 (MIM 113705),

BRCA2 (MIM 600185), ATM (MIM 607585), CHEK2 (MIM

604373), BRIP1 (MIM 605882), PALB2 (MIM 610355),

and RAD51C (MIM 602774) in breast cancer risk empha-

sizes the importance of this mechanism in the etiology

of breast cancer.7–9 Biallelic mutations in three of these

genes are associated with Fanconi anemia (FA), and, most

interestingly, Shamseldin et al.10 have recently reported

a homozygous frameshift mutation in XRCC2 as being

associated with a previously unrecognized form of FA.

XRCC2 binds directly to the C-terminal portion of the

product of the breast cancer susceptibility pathway gene

RAD51 (MIM 179617), which is central to HR.6,11 XRCC2

also complexes in vivo with RAD51B (RAD51L1 [MIM

602948]), the product of the breast and ovarian cancer

susceptibility gene RAD51C9 and the product of the

ovarian cancer risk gene RAD51D (MIM 602954),12,13 and

localizes to sites of DNA damage.6 Cells deficient in

XRCC2 also show centrosome disruption, a key compo-

nent of mitotic-apparatus dysfunction, which is often

linked to the onset of mitotic catastrophe. XRCC2 is

important in preventing chromosome missegregation

leading to aneuploidy.14 Studies of common genetic varia-

tion in XRCC2 have reported some evidence of association

with breast cancer risk (e.g., rs3218408),15 subtle effects on

DNA-repair capacity,16 and poor survival associated with

rs3218536 (XRCC2, Arg188His).15

On the basis of the exome-sequencing results, the subse-

quent genotyping of the two probably pathogenic variants

*

*

** *

*

A B

C D

EF

G H IJ

Figure 1. Pedigrees of Families Found to Carry XRCC2 MutationsMutation status is indicated for all family members for whom a DNA sample was available. Cancer diagnosis and age of onset are indi-cated for affected members. Asterisks indicate that DNA underwent exome sequencing (libraries for 50 bp fragment reads were preparedaccording to the SOLiD Baylor protocol 2.1 and the Nimblegen exome-capture protocol v.1.2 with some variations). The followingabbreviations are used: BC, breast cancer (black filled symbols); PC, pancreatic cancer; BwC, bowel cancer; UC, uterine cancer; MM,malignant melanoma; UK, unknown age; BlC, bladder cancer; OC, ovarian cancer; BCC, basal cell carcinoma; L, lung cancer; (allgray-filled symbols); V, verified cancer (via cancer registry or pathology report); and wt, wild-type. Some symbols represent more thanone person as indicated by a numeral.


in the MCCS and ABCFR, the rarity of these variants, and

the biochemical plausibility of XRCC2, we conducted two

further studies in parallel. The first study was case-control

mutation screening of XRCC2 (with high-resolution melt

[HRM] curve analysis followed by Sanger-sequencing

confirmation) in an additional series of 1,308 cases with

early-onset breast cancer and 1,120 frequency-matched

controls recruited through population-based sampling

by the Breast Cancer Family Registry2 (BCFR; Supplemental

Data, available online); the BCFR sampling was recently

carried out for the characterization of the breast cancer

risk associated with variants in ATM and CHEK2.17,18 The

second study was mutation screening of XRCC2 in a series

of index cases from multiple-case breast-cancer-affected

families and a series of male breast cancer cases.

The case-control mutation screening identified two cases

that carried protein-truncating variants in XRCC2: indi-

vidual III-2 had c.49C>T (p.Arg17*) (Figure 1F), and indi-

vidual II-1 had c.651_652del (p.Cys217*) (Figure 1G).

Five cases carried singleton missense substitutions ranging

from probably deleterious to relatively innocuous (accord-

ing to in silico prediction). One control carried a relatively

innocuous missense substitution (Table 2). In addition,

a case diagnosed with breast cancer at 32 years of age

carried a G>A substitution located one nucleotide prior

to the start codon.

We graded the rare missense variants by using three

computational tools: SIFT, Polyphen2.1, and Align-

GVGD. Differences in grading between these tools were

minor. Depending on which of the three computational

tools we used to grade the missense substitutions, the

statistical significances of the differences in the frequency

and severity distributions of protein-truncating variants

and rare missense substitutions between cases and controls

from the case-control mutation-screening study fell in the

range of p ¼ 0.01–0.02 (adjusted for race, study center, and

age). There were six probably deleterious variants (pre-

dicted deleterious by at least two prediction algorithms)

in the cases and none in the controls, corresponding to

a p value by Fisher’s exact test of 0.02. All together, the

case-control mutation-screening data provide statistical

support for the hypothesis that rare, evolutionarily

unlikely sequence variation in XRCC2 is associated with

increased risk of breast cancer.

Mutation screening (by Sanger sequencing) of XRCC2 in

the index cases of 689 multiple-case breast-cancer-affected

families participating in the BCFR and the Kathleen

Cuningham Foundation Consortium for Research into

Familial Breast Cancer19 (kConFab) plus 150 male breast

cancer cases participating in a US-based study of male

breast cancer (Beckman Research Institute of the City of

Hope20) and kConFab revealed three rare coding-sequence

alterations. We identified a second family (from the kCon-

Fab resource) with an index case who carried XRCC2

c.651_652del (p.Cys217*); this individual (II-5, Figure 1D)

also carried a truncating mutation in BRCA1 (c.70_80del

[p.Cys24Serfs*13]). We identified an ABCFR index case

(II-2, Figure 1E and Figure 2) who carried the previously

identified missense substitution, XRCC2 c.271C>T

(p.Arg91Trp). We also identified a male breast cancer case

who carried a relatively innocuous missense substitution,

c.283A>C (p.Ile95Leu).

In addition to the protein-truncating mutations and the

above-described missense variants, a number of missense,

silent, and intronic variants were also observed in

XRCC2, and common SNPs that were reported in public

databases such as dbSNP, HapMap, or the 1,000 Genomes

Project were also identified. These included the common

coding SNP c.563G>A (p.Arg188His) (rs3218536), one

silent substitution, three 50UTR variants, five 30UTR vari-

ants, and six intronic variants in the vicinity of exon-

intron boundaries. All these variants were predicted to be

neutral according to various in silico predictions tools

(Supplemental Data, Tables 1 and 2). For common SNPs

(>1% in controls), no difference in allele frequency was

observed between cases and controls in the BCFR series.

The genetic studies included in this report received ap-

proval from The University of Melbourne Human Research

Ethics Committee, the International Agency for Research

on Cancer institutional review board (IRB), and the local

IRBs of every center from which we report findings.

Of the six distinct rare variants predicted to severely

affect protein function and identified in ourwork, twowere

truncating mutations, and four were missense changes.

Although most recognized pathogenic mutations in the

major breast cancer susceptibility genes are protein trun-

cating, there is evidence that missense mutations might

be the more prominent of some more recently-identified

Figure 2. XRCC2 Multiple-Sequence Alignment Centered onPosition Arg91Missense substitutions observed in this interval are given with themissense residue directly above the corresponding human refer-ence sequence residue. The following abbreviations are used:Hsap, Homo sapiens; Mmul, Macaca mulatta; Mmus,Mus musculus;Cfam,Canis familiaris; Lafr,Loxodonta africana;Mdom,Monodelphisdomestica; Oana, Ornithorhynchus anatinus; Ggal, Gallus gallus;Acar, Anolis coralinensis; Xtro, Xenopus tropicalis; Drer, Danio rerio;Bflo, Branchiostoma floridae; Spur, Strongylocentrotus purpuratus;Nvec, Nematostella vectensis; and Tadh, Trichoplax adhaerans. Thealignment, or updated versions thereof, is available at the Align-GVGD website (see Web Resources).


breast cancer susceptibility genes. For example, in compre-

hensive studies ofATM andCHEK2, the proportion of prob-

ably deleterious or pathogenic rare sequence variants that

are missense changes is often over 50%. More relevantly,

estimates of breast cancer risk are higher for missense vari-

ants than they are for protein-truncating variants. This

has been observed through case-control mutation-

screening analyses of ATM and CHEK217,18 and through

a pedigree analysis21 of ATM; in these analyses, the breast

cancer risk associated with one specific missense mutation

approaches the average risk associated with pathogenic

BRCA2 mutations. A very recent analysis of PALB2 muta-

tions found no difference in the frequency of missense

mutations between two case groups (contralateral and

unilateral breast cancer cases),22 suggesting that the contri-

bution of missense mutations to breast cancer risk might

vary between susceptibility genes.

Our finding of XRCC2 as a breast cancer susceptibility

gene expands the proportion of breast cancer that is associ-

ated with rare mutations in the HR-DNA-repair pathways

and the number of breast cancer susceptibility genes in

whichbiallelicmutations are associatedwith FA; theprecise

contribution ofmutation in these geneswill become clearer

as more whole-exome-sequencing (or whole-genome-

sequencing) and targeted-pathway-sequencing studies are

performed. XRCC2 mutations appear to be very rare, even

in the context of multiple-case families; they appear in 1

of 66 (1.5%) early-onset female breast cancer cases with

a strong family history of the disease present in the ABCFR,

compared to 9 (14%) BRCA1 mutations, 6 (9%) BRCA2

mutations, 3 (5%) TP53 (MIM 191170) mutations, and 2

(3%) PALB2mutations.

These frequencies are consistent with data from both

breast cancer linkage studies that have suggested that no

single gene is likely to account for a large fraction of the re-

maining familial aggregation of breast cancer5 and reports

from recent candidate-gene sequencing studies that have

associated other members of the HR pathway with breast

cancer susceptibility.23,24 Although mutations in HR-

DNA-repair genes are rare, it is important to identify people

whose breast cancer is associated with HR-DNA-repair

dysfunction because they could benefit from specific tar-

geted treatments such as PARP inhibitors. Unaffected rela-

tives of people with a mutation in a HR-DNA-repair gene

could also be offered predictive testing and subsequent

clinical management and genetic counseling on the basis

of their mutation status. The identification of a family

with rare mutations in both XRCC2 and BRCA1 illustrates

the complexity of the underlying genetic architecture of

breast cancer susceptibility for some families and the chal-

lenges for personalized risk-prediction models that are

incorporating an increasing array of risk factors, which

include rare mutations in breast cancer susceptibility genes

and more common genetic variation. Currently, esti-

mating the relative importance of the XRCC2 mutation

to the breast cancer risk for members of this family is diffi-

cult because of the presence of a BRCA1 protein-truncating

mutation in the proband in addition to the XRCC2 muta-

tion. Many examples have been described of individuals

and families carrying deleterious mutations in more than

Table 1. Mutation Screening in Multiple-Case Breast Cancer Families

Rare XRCC2 VariantsEffect onProtein Align-GVGDa SIFTb

PolyPhen-2.1(HumDiv)

Case orControl

Pedigree(Study Source)

Age and Originof Carrier

Truncating variants

c.651_652del p.Cys217* � � � case Figure 1A (ABCFR)e 29, white

c.651_652del p.Cys217* � � � casec Figure 1C (kConFab) 36, white

c.651_652del p.Cys217* � � � control Figure 1D (MCCS) 72, white

Missense substitutions

c.271C>T p.Arg91Trp C65 0.00 probably damaging case Figure 1B (Dutch)e 40, white

c.271C>T p.Arg91Trp C65 0.00 probably damaging cased Figure 1E (ABCFR) 32, white

c.283A>C p.Ile95Val C0 0.34 benign case � (kConFab) 59, white

c.283A>G p.Ile95Leu C0 0.41 benign case � (kConFab) 70, white

c.283A>C p.Ile95Val C0 0.34 benign case � (BRICOH) 68, white

Silent substitution

c.582G>T p.Thr194Thr � � � case � (kConFab) 60, white

The following abbreviations are used: ABCFR; Australian Breast Cancer Family Registry; kConFab, Kathleen Cuningham Foundation Consortium for Research intoFamilial Breast Cancer; MCCS, Melbourne Collaborative Cohort Study; and BRICOH, Beckman Research Institute of City of Hope.aProtein multiple sequence alignment (PMSA) used for obtaining scores for Align-GVGD: from Human to Branchiostoma floridae (Bflo).bPMSA used for obtaining scores for SIFT: from Human to Trichoplax (Tadh).cThis woman also carries BRCA1 c.70_80del (p.Cys24Serfs*13).dThis carrier of p.Arg91Trp was identified through both the ABFCR multiple-case family screening and the BCFR-IARC (Breast Cancer Family Registry-InternationalAgency for Research on Cancer) case-control screening.eFamily included in the exome-sequencing phase.


one proven breast cancer susceptibility gene; one such

example is the co-observation of BRCA1, BRCA2, ATM,

and CHEK2 mutations.21,25

This study demonstrates the power of massively parallel

sequencing in the discovery of additional breast cancer

susceptibility genes when used with an appropriate study

design. Our approach could be applied to other common,

complex diseases with components of unexplained herita-

bility.

Supplemental Data

Supplemental Data include 6 tables and can be found with this

article online at http://www.cell.com/AJHG.

Acknowledgments

This work was supported by Cancer Council Victoria (grant

628774), the National Institutes of Health (R01CA155767 and

R01CA121245), the Australian National Health and Medical

Research Council (grant 466668), The University of Melbourne

(infrastructure award to J.L.H.), a Victorian Life Sciences Computa-

tion Initiative grant (VR00353) on its Peak Computing Facility at

the University of Melbourne, and an initiative of the Victorian

Government and Dutch Cancer Society (grant UL 2009-4388).

The research resources, including the Melbourne Collaborative

Cohort Study, theAustralianBreast Cancer Family Study, the Breast

Cancer Family Registry, and the Kathleen Cuningham Foundation

Consortium for Research into Familial Breast Cancer, are further

acknowledged in the supplementary information. We wish to

thankNivonirina Robinot andGeoffroyDurand for their technical

help during the case-control mutation screening at the Interna-

tional Agency for Research on Cancer, Georgia Chenevix-Trench

for her support of and contribution to the establishment of the

case-control mutation-screening study, and Greg Wilhoite for

sequencing the male breast cancer cases at the Beckman Research

Institute of City of Hope. This work and partial support for S.L.N.

was provided by the Morris and Horowitz Families Endowment.

Work at the Spanish National Cancer Center was partially funded

by the Spanish Association Against Cancer and Health Ministry

(FIS08/1120). M.C.S. is a National Health and Medical Research

Council (NHMRC) Senior Research Fellow and a Victorian Breast

Cancer Research Consortium (VBCRC) Group Leader. J.L.H. is

a NHMRC Australia Fellow and a VBCRC Group Leader. T.N.-D. is

a Susan G. Komen for the Cure Postdoctoral Fellow.

Received: November 20, 2011

Revised: January 16, 2012

Accepted: February 29, 2012

Published online: March 29, 2012

Web Resources


Align-GVGD, http://agvgd.iarc.fr/alignments

GATK v.1.0.4418, http://gatk.sourceforge.net/

Genome Viewer (IGV v.1.5.48), http://www.broadinstitute.org/

software/igv/


omim.org

Picard v.1.29, http://sourceforge.net/projects/picard/

PolyPhen2.1, http://genetics.bwh.harvard.edu./pph2/

SIFT, http://sift.jcvi.org/

SOLiD Baylor protocol 2.1, http://www.hgsc.bcm.tmc.edu/

documents/Preparation_of_SOLiD_Capture_Libraries.pdf

UCSC Genome Browser, http://genome.ucsc.edu/cgi-bin/

hgGateway

Table 2. Case-Control Mutation Screening Applied to the BCFR Population-Based Study

Rare XRCC2 VariantsEffect onProtein Align-GVGDa SIFTb

PolyPhen-2.1(HumDiv)

Case (n ¼ 1,308) orControl (n ¼ 1,120)

Pedigree(BCFR)

Age and Originof Carrier

Truncating variants

c.49C>T p.Arg17* � � � case Figure 1F 33, white

c.46G>T p.Ala16Ser C0 0.24 benign case � 44, East Asian

c.181C>A p.Leu61Ile C0 0.00 possibly damaging case Figure 1H 30, East Asian

c.271C>T p.Arg91Trp C65 0.00 probably damaging casec Figure 1E 32, white

c.283A>G p.Ile95Val C0 0.34 benign control � 44, white

c.693G>T p.Trp231Cys C65 0.00 probably damaging cased Figure 1I 44, East Asian

c.808T>G p.Phe270Val C45 0.00 probably damaging case Figure 1J 38, African

Silent substitution

c.354G>A p.Val118Val � � � cased � 44, East Asian

50 UTR variants

c.-1G>A ? � � � casee � 32, white

The following abbreviation is used: BCFR, Breast Cancer Family Registry.aProtein multiple sequence alignment (PMSA) used for obtaining scores for Align-GVGD: from Human to Branchiostoma floridae (Bflo).bPMSA used for obtaining scores for SIFT: from Human to Trichoplax (Tadh).cThis carrier of p.Arg91Trp was identified through both the ABFCR multiple-case family screening and the BCFR-IARC (Breast Cancer Family Registry-InternationalAgency for Research on Cancer) case-control screening.dThis 44-year-old East Asian case carries p.Trp231Cys and p.Val118Val.eThis case is considered a ‘‘noncarrier’’ in the analysis.


References

1. Turnbull, C., and Rahman, N. (2008). Genetic predisposition

to breast cancer: Past, present, and future. Annu. Rev. Geno-

mics Hum. Genet. 9, 321–345.

2. John, E.M., Hopper, J.L., Beck, J.C., Knight, J.A., Neuhausen,

S.L., Senie, R.T., Ziogas, A., Andrulis, I.L., Anton-Culver, H.,

Boyd, N., et al; Breast Cancer Family Registry. (2004). The

Breast Cancer Family Registry: An infrastructure for coopera-

tive multinational, interdisciplinary and translational studies

of the genetic epidemiology of breast cancer. Breast Cancer

Res. 6, R375–R389.

3. Giles, G.G., and R, E.D. (2002). The Melbourne Collaborative

Cohort Study. IARC Sci Publ 156, 2.

4. Cartwright, R., Tambini, C.E., Simpson, P.J., and Thacker, J.

(1998). The XRCC2 DNA repair gene from human and mouse

encodes a novel member of the recA/RAD51 family. Nucleic

Acids Res. 26, 3084–3089.

5. Deans, B., Griffin, C.S., O’Regan, P., Jasin, M., and Thacker, J.

(2003). Homologous recombination deficiency leads to

profound genetic instability in cells derived from Xrcc2-

knockout mice. Cancer Res. 63, 8181–8187.

6. Tambini, C.E., Spink, K.G., Ross, C.J., Hill, M.A., and Thacker,

J. (2010). The importance of XRCC2 in RAD51-related DNA

damage repair. DNA Repair (Amst.) 9, 517–525.

7. Moynahan,M.E., Chiu, J.W., Koller, B.H., and Jasin,M. (1999).

Brca1 controls homology-directed DNA repair. Mol. Cell 4,

511–518.

8. Moynahan, M.E., Pierce, A.J., and Jasin, M. (2001). BRCA2 is

required for homology-directed repair of chromosomal breaks.

Mol. Cell 7, 263–272.

9. Meindl, A., Hellebrand, H., Wiek, C., Erven, V., Wappensch-

midt, B., Niederacher, D., Freund, M., Lichtner, P., Hartmann,

L., Schaal, H., et al. (2010). Germline mutations in breast and

ovarian cancer pedigrees establish RAD51C as a human cancer

susceptibility gene. Nat. Genet. 42, 410–414.

10. Shamseldin, H.E., Elfaki, M., and Alkuraya, F.S. (2012). Exome

sequencing reveals a novel Fanconi group defined by XRCC2

mutation. J. Med. Genet. 49, 184–186.

11. Gao, L.-B., Pan, X.-M., Li, L.-J., Liang, W.-B., Zhu, Y., Zhang,

L.-S., Wei, Y.-G., Tang, M., and Zhang, L. (2011). RAD51

135G/C polymorphism and breast cancer risk: Ameta-analysis

from 21 studies. Breast Cancer Res. Treat. 125, 827–835.

12. Loveday, C., Turnbull, C., Ramsay, E., Hughes, D., Ruark, E.,

Frankum, J.R., Bowden, G., Kalmyrzaev, B., Warren-Perry,

M., Snape, K., et al; Breast Cancer Susceptibility Collaboration

(UK). (2011). Germlinemutations in RAD51D confer suscepti-

bility to ovarian cancer. Nat. Genet. 43, 879–882.

13. Liu, N., Schild, D., Thelen, M.P., and Thompson, L.H. (2002).

Involvement of Rad51C in two distinct protein complexes

of Rad51 paralogs in human cells. Nucleic Acids Res. 30,

1009–1015.

14. Griffin, C.S., Simpson, P.J., Wilson, C.R., and Thacker, J.

(2000). Mammalian recombination-repair genes XRCC2 and

XRCC3 promote correct chromosome segregation. Nat. Cell

Biol. 2, 757–761.

15. Lin,W.-Y., Camp, N.J., Cannon-Albright, L.A., Allen-Brady, K.,

Balasubramanian, S., Reed, M.W.R., Hopper, J.L., Apicella, C.,

Giles, G.G., Southey, M.C., et al. (2011). A role for XRCC2

gene polymorphisms in breast cancer risk and survival. J.

Med. Genet. 48, 477–484.

16. Rafii, S., O’Regan, P., Xinarianos, G., Azmy, I., Stephenson, T.,

Reed, M., Meuth, M., Thacker, J., and Cox, A. (2002). A poten-

tial role for the XRCC2 R188H polymorphic site in DNA-

damage repair and breast cancer. Hum. Mol. Genet. 11,

1433–1438.

17. Le Calvez-Kelm, F., Lesueur, F., Damiola, F., Vallee, M.,

Voegele, C., Babikyan, D., Durand, G., Forey, N., McKay-

Chopin, S., Robinot, N., et al; Breast Cancer Family Registry.

(2011). Rare, evolutionarily unlikely missense substitutions

in CHEK2 contribute to breast cancer susceptibility: results

from a breast cancer family registry case-control mutation-

screening study. Breast Cancer Res. 13, R6.

18. Tavtigian, S.V., Oefner, P.J., Babikyan, D., Hartmann, A.,

Healey, S., Le Calvez-Kelm, F., Lesueur, F., Byrnes, G.B.,

Chuang, S.-C., Forey, N., et al; Australian Cancer Study; Breast

Cancer Family Registries (BCFR); Kathleen Cuningham

Foundation Consortium for Research into Familial Aspects

of Breast Cancer (kConFab). (2009). Rare, evolutionarily

unlikely missense substitutions in ATM confer increased risk

of breast cancer. Am. J. Hum. Genet. 85, 427–446.

19. Mann, G.J., Thorne, H., Balleine, R.L., Butow, P.N., Clarke,

C.L., Edkins, E., Evans, G.M., Fereday, S., Haan, E., Gattas,

M., et al; Kathleen Cuningham Consortium for Research in

Familial Breast Cancer. (2006). Analysis of cancer risk and

BRCA1 and BRCA2 mutation prevalence in the kConFab

familial breast cancer resource. Breast Cancer Res. 8, R12.

20. Ding, Y.C., Steele, L., Chu, L.-H., Kelley, K., Davis, H., John,

E.M., Tomlinson, G.E., and Neuhausen, S.L. (2011). Germline

mutations in PALB2 in African-American breast cancer cases.

Breast Cancer Res. Treat. 126, 227–230.

21. Goldgar, D.E., Healey, S., Dowty, J.G., Da Silva, L., Chen, X.,

Spurdle, A.B., Terry, M.B., Daly, M.J., Buys, S.M., Southey,

M.C., et al; BCFR; kConFab. (2011). Rare variants in the

ATM gene and risk of breast cancer. Breast Cancer Res. 13, R73.

22. Tischkowitz, M., Capanu, M., Sabbaghian, N., Li, L., Liang, X.,

Vallee, M.P., Tavtigian, S.V., Concannon, P., Foulkes, W.D.,

Bernstein, L., et al; The WECARE Study Collaborative Group.

(2012). Rare germline mutations in PALB2 and breast cancer

risk: A population-based study. Hum Mutat 33, 674–680.

23. Rahman, N., Seal, S., Thompson, D., Kelly, P., Renwick, A.,

Elliott, A., Reid, S., Spanova, K., Barfoot, R., Chagtai, T., et al;

Breast Cancer Susceptibility Collaboration (UK). (2007).

PALB2, which encodes a BRCA2-interacting protein, is a breast

cancer susceptibility gene. Nat. Genet. 39, 165–167.

24. Seal, S., Thompson, D., Renwick, A., Elliott, A., Kelly, P.,

Barfoot, R., Chagtai, T., Jayatilake, H., Ahmed, M., Spanova,

K., et al; Breast Cancer Susceptibility Collaboration (UK).

(2006). Truncating mutations in the Fanconi anemia J gene

BRIP1 are low-penetrance breast cancer susceptibility alleles.

Nat. Genet. 38, 1239–1241.

25. Turnbull, C., Seal, S., Renwick, A., Warren-Perry, M., Hughes,

D., Elliott, A., Pernet, D., Peock, S., Adlard, J.W., Barwell, J.,

et al; Breast Cancer Susceptibility Collaboration (UK),

EMBRACE. (2012). Gene-gene interactions in breast cancer

susceptibility. Hum. Mol. Genet. 21, 958–962.


sponsored by

snapshots.cell.com

view the archive

C e na v 0

SnapShots—sorted categorized—from chromatin

remodelers and autophagy to cancer andr autism.

All SnapShots published from a year agor or morer are

open access and freely available.

Be Frustrated No More.

www.sdix.com/perform

frustrated

Better Antigens.Better Antibodies.Better Assays.

Discover how SDIX can help you create betterantibodies to difficult targets, like GPCRs.

You need antibodies to perform in critical research, diagnostic and therapeuticapplications — that’s what SDIX is all about, Design For Purpose™.

Our scientists have pioneered novel technologies in antigen design, includingSDIX Genomic Antibody Technology™.

Antibodies designed to perform for YOU.

No reason to be frustrated anymore.

®

Empowering Sequencing, Our Focus.

The NGS Experts™

Complete Kit - Everything you need upstream of target captureOptimized - Offers larger number of unique readsMultiplexed - Up to 24 barcodes and barcode blockersAvailable Now - Next Day Delivery

The NEXTflex™ Pre-Capture Combo Kit for NimbleGen SeqCap is a complete DNA-Seq library prep, barcode and barcode blocking solution, designed and validated for use upstream of Roche NimbleGen’s SeqCap v3 Target Capture. DNA-Seq

ChIP-SeqBisulfite-SeqMethyl-Seq

RNA-SeqSmall RNA-Seq

Directional RNA-SeqPCR-Free DNA-Seq

Pre-Target CaptureMultiple Platform Compatibility

Simplify your NimbleGen SeqCap Target Capture.

Visit BiooNGS.com and turn your focus to your NGS results.