human population genomics man, woman, birth, death , infinity, plus

Human Population Genomics

Man, Woman, Birth, Death,

Infinity, Plus

Altruism, Cheap Talks,

Bad Behavior, ¥ Money, God and

Diversity on Steroids

2

Jack Schwartz (1930 – 2009)

“Damn the Human Genomes.Small populations;Genes too distant;Pestered with duplications;Feeble contrivance;Could make a better one myself!”

Lord Jeffrey (misattributed; badly paraphrased)

3

•Non-equilibrium Models•Population Bottlenecks•Not Well-mixed•Migration/Colonization Patterns•Catastrophic Infections

• Heterozygous Advantages

Small Populations4

Wright-Fisher Process

Mickey (Coalescent talk)

5

N in

div

idua

ls

mutation Derived allele extinction!

generation

Ancestral alleleDerived allele

Moran Process6

deathtime

•Overlapping generations•Distribution of time to replication

Forces in Population Genetics How to understand forces that produce

and maintain inherited genetic variation Forces

Mutation Recombination Natural Selection Population Structure/Migration Random birth/death (drift)

7

•20,000 Genes (Estimate in 80’s 120,000)

• Occurring about every 150 Kb•Many more functional ncRNA

• snoRNA, siRNA, piRNA, etc.•Uncharacterized

Genes Too Distant9

Y

“From a gene’s point of view, reshuffling is a great restorative…

“The Y, in its solitary state disapproves of such laxity. Apart from small parts near each tip which line up with a shared section of the X, it stands aloof from the great DNA swap. Its genes, such as they are, remain in purdah as the generations succeed. As a result, each Y is a genetic republic, insulated from the outside world. Like most closed societies it becomes both selfish and wasteful. Every lineage evolves an identity of its own which, quite often, collapses under the weight of its own inborn weaknesses.

“Celibacy has ruined man’s chromosome.” Steve Jones, Y: The descent of Men, 2002.

DAZ locus on Y Chromosome

Optical Mapping

Cells gently lysed to extract genomic DNA

DNA captured in parallel arrays of long single DNA molecules using microfluidic device

Genomic DNA, captured as single DNA molecules produced by random breakage of intact chromosomes

1. Capture and immobilize whole genomes as massive collections of single DNA molecules

Overlapping single molecule maps are aligned to produce a map assembly covering an entire chromosome

⌘⌘⌘

⌘⌘⌘⌘

Sizing Error (Bernoulli

labeling, absorption cross-section, PSF)

Partial Digestion False Optical Sites Orientation Spurious

molecules, Optical chimerism, Calibration

Image of restriction enzyme digested YAC clone: YAC clone 6H3, derived from human chromosome 11, digested with the restriction endonuclease Eag I and Mlu I, stained with a fluorochrome and imaged by fluorescence microscopy.

⌘⌘⌘⌘⌘

Various combinations of error sources lead to NP-hard Problems

•Complex Genome Structures• Segmental Duplications• Many types of Polymorphisms

(SNPs, CNVs, SVs, etc.)•Models of Genome Dynamics

• GOD (Genome Organizing Devices)•Models of Coalescence

Pestered with duplications18

Segmental Duplications

Segmental duplications have been found to be associated with genomic disorders. Deletions: Williams-Beuren syndrome Duplications: Charcot-Marie-Tooth disease type 1A Inversions: Haemophilia A Translocations: Derivative 22 [der(22)] syndrome.

Segmental duplications may be related to cancer development by causing copy number fluctuations Duplication of myc in lung cancer, and ERBB2 in

breast cancer.

Recent Segmental Duplications

From [Bailey, et al. 2002]

•3.5% ~ 5% of the human genome is found to contain

• segmental duplications, with length > 5 or 1kb, identity > 90%.

•August, 2001 assembly, •[Bailey, et al. 2002].

•April, 2003 assembly, •[Cheung, et al. 2003].

•These duplications are estimated to have emerged about 40Mya under neutral assumption.•The duplications are mostly interspersed (non-tandem), and happen both inter- and intra-chromosomally.

Human

Recent Segmental Duplications

Mouse

From [Cheung, et al. 2003]

•1.2% of the mouse genome is found to contain segmental duplications, with length > 5kb, identity > 90%.

•February, 2003 mouse assembly,•[Cheung, et al. 2003].

•These duplications are estimated to have emerged about 25Mya under neutral assumption.•The duplications happen both inter- and intra-chromosomally.

Duplication Flanking Sequences What are the molecular mechanisms

that caused the recent segmental duplications in the human and mouse genomes? Thermodynamic instability in the DNA

sequences; Recombination between homologous repeat

elements; Other unknown mechanisms.

Thermodynamics

5’-breakpoint 3’-breakpoint

duplicated

region

5’ 3’+512bp-512bp

Control

Data

⌘

FLAM/FRAM Alu-Jo Alu-Jb Alu-Sc~Sx Alu-Y Alu-Ya~YbMIR20% 14% 14% 8% 5% >1%30%

Divergence:

* *

**

**

**

SINE

L2 L1M4 L1M3 L1M2 L1M1 L1P5 L1P4 L1P3 L1P2 L1P1 L1Hs30% 22% 21% 19% 18% 12% 11% 7% 4% 2% <1%Divergence:

**

****

****

****

****

LINE

Fre

quen

cies

of

the

repe

ats

Control set

Data set

The Model

Duplication by recombination between repeats

Duplication by recombination between other repeats or other mechanisms

insertion

insertion deletion or

mutation

deletion or mutation

f - -

f ++

f + -

Mutation accumulation in the duplicated sequences

f - -

f ++

f + -

The Mathematical Model

H0

H1

0 ≤ d < ε ε ≤ d < 2ε (k-1)ε ≤ d < kε

α1-α-2β

1-α-2γ

1-α-β/2-γ

α

α

α

α

α

α

α

α

γ 2β

2γ β/2 2γ β/2 2γ β/2

γ 2β γ 2β

1-α-2β 1-α-2β

1-α-β/2-γ 1-α-β/2-γ

1-α-2γ 1-α-2γ

h0

h1 h1++

h1--

h0+-

h0--

h0++

f - -

f ++

f + -

α

α

α

Time after duplication

h1: proportion of duplications by repeat recombination;

h1++: proportion of duplications by recombination of the specific repeat;

h1- - : proportion of duplications by recombination of other repeats;

h0: proportion of duplications by other repeat-unrelated mechanism;

h0++: proportion of h0 with common specific repeat in the flanking regions;

h0+-: proportion of h0 with no common specific repeat in the flanking regions;

h0- -: proportion of h0 with no specific repeat in the flanking regions;

α: mutation rate in duplicated sequences;

β: insertion rate of the specific repeat;

γ: mutation rate in the specific repeat;

d: divergence level of duplications;

ε: divergence interval of duplications.

Model Fitting

Diversity:

f - -

f ++

f + -

Alu

Diversity:

f - -

f ++

f + -

L1

The model parameters (αAlu, βAlu, γAlu, αL1, βL1, γL1) are estimated from the reported mutation and insertion rates in the literature.

The relative strengths of the alternative hypotheses can be estimated by model fitting to the real data.

h1++Alu ≈ 0.3; h1++ L1 ≈ 0.35.

Mer Frequencies

Chr1

Ns

ATs

Reps

CDs

ΔG

DupCopy#

MerFreq

MER57A L1P

Copy Number Variation Data

China46 people

China46 people

UtahEuropean

origin: 90 people

UtahEuropean

origin: 90 people

Yoruba89 people

Yoruba89 people

Japan45 people

Japan45 people

HapMap data

Made available to us by Drs. Evan Eichler and Andy Sharp

CNVs in Unique regions

OROROROR

CNVs in Unique regions

Yoruba Japanese Chinese Ceph

No polymorphism

810 817 817 799

Amplifications only

43 43 46 55

Deletions only 46 37 36 44

Mixed 1* 3=2+1* 1* 2=1+1*

CNVs in SD regions

ANDANDANDAND

CNV in SD regions

Yoruba Japanese Chinese Ceph

No polymorphism

786 794 785 741

Amplifications only

124 135 141 129

Deletions only 101 86 101 141

Mixed 43 40 27 44

Unique and SD regions show completely different behavior of CNVs!

Distance-dependent recombination

The chance of recombination depends on the distance between Allele A and its copy

Simulation (probabilistic model)

Observations & Conclusions

Mutation rate of 0.0001 and recombination rate of 0.001 in SD regions constitute the best fit to observed real life data.

Single mutations cannot explain observed data, but can be explained by convergence via recombination.

Evolution-by-Duplication (EBD) appears to play a crucial role in evolution and molds the genetic circuitry in a rather constrained way, before it is subject to selection pressure

•GWAS (Genome-Wide Association Studies)

• Common Variants vs. Rare Variants• Haplotype Phasing/Linkage Analysis

•Poor Experiment Design• Reference Sequences• Genotypic vs. Haplotypic References

•Weak Technologies

Feeble Contrivance39

Common vs. Rare Disease Variants From Ionita-Laza (2009) There are two disease models:

CDCV - common disease, common variants CDRV - common disease, rare variants

The current genome-wide association studies only consider common variants (frequency at least 5%). Feasible with available resources The common loci identified so far have small effects

(ORs 1:1 -1:5) and only explain a small percentage of the estimated heritability.

Rare susceptibility variants are expected to play an important role: population genetics theory (Pritchard, 2001) empirical evidence (BMI, blood pressure, autism,

Mendelian diseases etc.)

40

Effect Size Distribution41

Capture-Recapture Model

Suppose we have sequence data on Nind individuals in a genomic region. An individual shows variation at a position if the

corresponding allele is different from the ancestral one.

A position is variable or is a variant if there is at least one individual in the dataset with a variation at that position.

Let xs be the number of individuals with variation at position s: xs > 0.

What is N: the total, unknown number of variants in the region.

42

One can estimate the following: Δ(t) = # NEW variants expected to be

found in a FUTURE dataset of size t . Nind. t is a multiplier of initial dataset size, Nind. Δf(t) = # new variants with frequency at

least f . . .

43

ENCODE dataset

Ten 500Kb genomic regions were sequenced in several unrelated DNA samples: 8 Yoruba (YRI) 16 CEPH European (CEPH) 7 Han Chinese (CHB) 8 Japanese (JPT)

To make results comparable across the four populations (YRI, CEPH, CHB and JPT), they considered only 7 of the sequenced individuals for each dataset.

44

ENCODE - Δf(t)

From Ionita-Laza et al. 2009

45

•Debugging a human better•Sequencing a genome•Sequencing a population

How to Make a Better Human?

46

S ★M ★ A ★ S ★ H

SingleMoleculeApproach toSequencing-by-Hybridization

S*M*A*S*H

Sequence a human size genome of about 6 Gb—include both haplotypes.

Integrate: Optical Mapping (Ordered Restriction Maps) Hybridization (with short nucleobase probes

[PNA or LNA oligomers] with dsDNA on a surface, and

Positional Sequencing by Hybridization (efficient polynomial time algorithms to solve “localized versions” of the PSBH problems)

⌘

Genomic DNA is carefully extracted

Fig 1

⌘⌘

LNA probes of length 6 – 8 nucleotides are hybridized to dsDNA (double-stranded genomic DNA)

The modified DNA is stretched on a 1” x 1” chip.

Fig 2

⌘⌘⌘

DNA adheres to the surface along the channels and stretches out.

Size from 0.3 – 3 million base pairs in length.

Bright emitters are attached to the probes and imaged (Fig 3).

Fig 3

⌘⌘⌘⌘

A restriction breaks the DNA at specific sites.

The cut fragments of DNA relax like entropic springs, leaving small visible gaps

Fig 4

⌘⌘⌘⌘⌘

The DNA is then stained with a fluorogen (Fig 5) and reimaged.

The two images are combined in a composite image suggesting the

locations of a specific short word (e.g., probes) within the context of a pattern of restriction sites.

Fig 5

⌘⌘⌘⌘⌘⌘

The integrated intensity measures the length of the DNA fragments.

The bright-emitters on probes provides a profile for locations of the probes.

Fig 6

The restriction sites are represented by a tall rectangle & The probe sites by small circles

⌘⌘⌘⌘⌘⌘⌘

These steps are repeated for all possible probe compositions (modulo reverse

complementarity). Software assembles

the haplotypic ordered restriction maps with approximate probe locations superimposed on the map.

ATAT

TATC

ATCA

TCAT

CATA

ATATCATAT

Fig 7

S*M*A*S*H

Local clusters of overlapping words are combined by our PSBH (positional sequencing by hybridization) algorithm

ATAT

TATC

ATCA

TCAT

CATA

ATATCATAT

Fig 8

Probe Map (lambda DNA)

Final Probe Map

Consensus map with 2 probe locations 14.8% and 52.4% of the DNA length.

In close agreement with the correct map 50.2% and 85.7% (known from the

sequence) Implied probe hybridization rate = 42%.

Significantly better than the needed 30%

500 nm

AFour AFM images of lambda DNA with PNA probes

Combinatorial Structure

Discretization

Prediction

The probability of successfully computing the correct restriction map as a function of the number of cuts in the map and number of molecules used in creating the map…

Gentig: Bayesian Approach

Bayesian Model

Robustness

BAC Clones with 6-cutters Average Clone size = 160 Kb; Average Fragment

Size = 4 Kb, & Average Number of Cutsites = 40. Parameters:

Digestion rate can be as low as 10% Orientation of DNA need not be known. 40% foreign DNA 85% DNA partially broken Relative sizing error up to 30% 30% spurious randomly located cuts…

Single Molecule Hapoltyping:Candida Albicans

The left end of chromsome-1 of the common fungus Candida Albicans (being sequenced by Stanford).

Three polymorphisms: (A) Fragment 2 is of size

41.19kb (top) vs 38.73kb (bottom).

(B) The 3rd fragment of size 7.76kb is missing from the top haplotype.

(C)The large fragment in the middle is of size 61.78kb vs 59.66kb.

Problem to Solve…

Given probe maps of some small region of the genome for all N-bp hybridization probes (e.g. all 2080 probes of 6-bp).

With known error rates (false positive, false negatives and sizing errors).

Can we reconstruct the complete sequence ?

Basic reconstruction algorithm Keep track of multiple sequence assemblies. Initialize with all possible 5-bp sequences. Try all 4 possible extensions of each sequence. Check if probe is present in corresponding map :

if not add a penalty score to the sequence involved.

Periodically delete sequences with high penalty. Stop when missing probe rate jumps significantly

from False Negative rate (2%) to (100% - false extension rate) = 55%.

Return highest scoring sequence.

Anomalies

Irresolvable Ambiguities: From assemblies based on 6bp probes

Error Pattern : s w sRC Correct Pattern : s wRC sRC

s = tcgcc (any 5 bases) sRC=ggcga (Reverse compliment of X) w = CCCCTAAC (any short sequence under 50bp) wRC= GTTAGGGG (Reverse compliment of Y)

Assembly:…tcgccCCCCTAAC ggcga… || || | || | ||Correct :…tcgccGTTAGGGGggcga…

Assembly:…tcgccCCCCTAAC ggcga… || || | || | ||Correct :…tcgccGTTAGGGGggcga…

Directed Eulerian Graph

AATT

TTAA

ATTC

TAAG

TTCG

AAGC

ATCG

TAGC

⌘

Mixing ‘solid’ bases with `wild-card’ bases: E.g., xx-x-x-xx (9-mers) or xxx- -x- -x- -xxx

(14 mers) An ‘inert’ base

Universal: In terms of its ability to form base pairs with the other natural DNA/RNA bases.

Examples: The naturally occurring base hypoxanthine,

as its ribo- or 2'-deoxyribonucleoside; 2'-deoxyisoinosine; 7-deaza-2'-deoxyinosine; 2-aza-2'-deoxyinosine

Simulation Results

1

10

100

1000

10000

5 6 7 8

Bases per probe

Err

ors

per

10kb s

equence

0.01

0.1

1

10

100

1000

0 1 2 3 4 5

Gapped bases per probe (6 solid bases)E

rro

rs p

er 1

0kb

seq

uen

ce

UNGAPPED GAPPED

1000 Rupees Genome

22.67 US$ for 6 billion bases135 billion US $ for the entire human population

Who we are…

Population David Albers (Columbia) Eric Aslakson (NYU) Mickey Atwal (CSHL) Ivan Iossifov (CSHL) Hossein Khiabanian

(Columbia) Samantha Kleinberg (NYU) Partha Mitra (CSHL) Michaela Oswald (CSHL) Raul Rabadan (Columbia) Vladimir Trifonov (Colmbia) Daniel Valente (CSHL) Chris Wiggins (Columbia)

Polymorphims Iuliana Ionita-Laza (Harvard) Antonina Mitrofanova (NYU) Joey Zhao (Princeton)

SMASH TS Anantharaman (OpGen) Charles Cantor (Sequenom) Vladimir Demidov (BU) Pierre Franquin (NYU) Alex Lim (Ex-NYU) Toto Paxia (Ex-NYU) Jason Reed (UCLA) Andrew Sundstrom (NYU)

SUTTA Giusepe Narzisi (NYU) Alessio Narzisi

(NYU/Catania)

74

“Beware prejudices.

“They are like rats, and men's minds are like traps; prejudices get in easily, but it is doubtful if they ever get out.”

Lord Jeffrey75

human population genomics man, woman, birth, death , infinity, plus

Documents

genomic dna dna

great dna swap

microfluidic devicegenomic

human genomes

human chromosome

restriction endonucleases3

great restorativethe

human population genomicsman