human population genomics man, woman, birth, death , infinity, plus
DESCRIPTION
Human Population Genomics Man, Woman, Birth, Death , Infinity, Plus Altruism, Cheap Talks, Bad Behavior , ¥ Money, God and Diversity on Steroids. Jack Schwartz (1930 – 2009). Lord Jeffrey (misattributed; badly paraphrased). “Damn the Human Genomes. - PowerPoint PPT PresentationTRANSCRIPT
Human Population Genomics
Man, Woman, Birth, Death,
Infinity, Plus
Altruism, Cheap Talks,
Bad Behavior, ¥ Money, God and
Diversity on Steroids
2
Jack Schwartz (1930 – 2009)
“Damn the Human Genomes.Small populations;Genes too distant;Pestered with duplications;Feeble contrivance;Could make a better one myself!”
Lord Jeffrey (misattributed; badly paraphrased)
3
•Non-equilibrium Models•Population Bottlenecks•Not Well-mixed•Migration/Colonization Patterns•Catastrophic Infections
• Heterozygous Advantages
Small Populations4
Wright-Fisher Process
Mickey (Coalescent talk)
5
N in
div
idua
ls
mutation Derived allele extinction!
generation
Ancestral alleleDerived allele
Moran Process6
deathtime
•Overlapping generations•Distribution of time to replication
Forces in Population Genetics How to understand forces that produce
and maintain inherited genetic variation Forces
Mutation Recombination Natural Selection Population Structure/Migration Random birth/death (drift)
7
8
•20,000 Genes (Estimate in 80’s 120,000)
• Occurring about every 150 Kb•Many more functional ncRNA
• snoRNA, siRNA, piRNA, etc.•Uncharacterized
Genes Too Distant9
Y
“From a gene’s point of view, reshuffling is a great restorative…
“The Y, in its solitary state disapproves of such laxity. Apart from small parts near each tip which line up with a shared section of the X, it stands aloof from the great DNA swap. Its genes, such as they are, remain in purdah as the generations succeed. As a result, each Y is a genetic republic, insulated from the outside world. Like most closed societies it becomes both selfish and wasteful. Every lineage evolves an identity of its own which, quite often, collapses under the weight of its own inborn weaknesses.
“Celibacy has ruined man’s chromosome.” Steve Jones, Y: The descent of Men, 2002.
DAZ locus on Y Chromosome
Optical Mapping
Cells gently lysed to extract genomic DNA
DNA captured in parallel arrays of long single DNA molecules using microfluidic device
Genomic DNA, captured as single DNA molecules produced by random breakage of intact chromosomes
1. Capture and immobilize whole genomes as massive collections of single DNA molecules
Overlapping single molecule maps are aligned to produce a map assembly covering an entire chromosome
⌘⌘⌘
⌘⌘⌘⌘
Sizing Error (Bernoulli
labeling, absorption cross-section, PSF)
Partial Digestion False Optical Sites Orientation Spurious
molecules, Optical chimerism, Calibration
Image of restriction enzyme digested YAC clone: YAC clone 6H3, derived from human chromosome 11, digested with the restriction endonuclease Eag I and Mlu I, stained with a fluorochrome and imaged by fluorescence microscopy.
⌘⌘⌘⌘⌘
Various combinations of error sources lead to NP-hard Problems
•Complex Genome Structures• Segmental Duplications• Many types of Polymorphisms
(SNPs, CNVs, SVs, etc.)•Models of Genome Dynamics
• GOD (Genome Organizing Devices)•Models of Coalescence
Pestered with duplications18
Segmental Duplications
Segmental duplications have been found to be associated with genomic disorders. Deletions: Williams-Beuren syndrome Duplications: Charcot-Marie-Tooth disease type 1A Inversions: Haemophilia A Translocations: Derivative 22 [der(22)] syndrome.
Segmental duplications may be related to cancer development by causing copy number fluctuations Duplication of myc in lung cancer, and ERBB2 in
breast cancer.
Recent Segmental Duplications
From [Bailey, et al. 2002]
•3.5% ~ 5% of the human genome is found to contain
• segmental duplications, with length > 5 or 1kb, identity > 90%.
•August, 2001 assembly, •[Bailey, et al. 2002].
•April, 2003 assembly, •[Cheung, et al. 2003].
•These duplications are estimated to have emerged about 40Mya under neutral assumption.•The duplications are mostly interspersed (non-tandem), and happen both inter- and intra-chromosomally.
Human
Recent Segmental Duplications
Mouse
From [Cheung, et al. 2003]
•1.2% of the mouse genome is found to contain segmental duplications, with length > 5kb, identity > 90%.
•February, 2003 mouse assembly,•[Cheung, et al. 2003].
•These duplications are estimated to have emerged about 25Mya under neutral assumption.•The duplications happen both inter- and intra-chromosomally.
Duplication Flanking Sequences What are the molecular mechanisms
that caused the recent segmental duplications in the human and mouse genomes? Thermodynamic instability in the DNA
sequences; Recombination between homologous repeat
elements; Other unknown mechanisms.
Thermodynamics
5’-breakpoint 3’-breakpoint
duplicated
region
5’ 3’+512bp-512bp
Control
Data
⌘
FLAM/FRAM Alu-Jo Alu-Jb Alu-Sc~Sx Alu-Y Alu-Ya~YbMIR20% 14% 14% 8% 5% >1%30%
Divergence:
* *
**
**
**
SINE
L2 L1M4 L1M3 L1M2 L1M1 L1P5 L1P4 L1P3 L1P2 L1P1 L1Hs30% 22% 21% 19% 18% 12% 11% 7% 4% 2% <1%Divergence:
**
****
****
****
****
LINE
Fre
quen
cies
of
the
repe
ats
Control set
Data set
The Model
Duplication by recombination between repeats
Duplication by recombination between other repeats or other mechanisms
insertion
insertion deletion or
mutation
deletion or mutation
f - -
f ++
f + -
Mutation accumulation in the duplicated sequences
f - -
f ++
f + -
The Mathematical Model
H0
H1
0 ≤ d < ε ε ≤ d < 2ε (k-1)ε ≤ d < kε
α1-α-2β
1-α-2γ
1-α-β/2-γ
α
α
α
α
α
α
α
α
γ 2β
2γ β/2 2γ β/2 2γ β/2
γ 2β γ 2β
1-α-2β 1-α-2β
1-α-β/2-γ 1-α-β/2-γ
1-α-2γ 1-α-2γ
h0
h1 h1++
h1--
h0+-
h0--
h0++
f - -
f ++
f + -
α
α
α
Time after duplication
h1: proportion of duplications by repeat recombination;
h1++: proportion of duplications by recombination of the specific repeat;
h1- - : proportion of duplications by recombination of other repeats;
h0: proportion of duplications by other repeat-unrelated mechanism;
h0++: proportion of h0 with common specific repeat in the flanking regions;
h0+-: proportion of h0 with no common specific repeat in the flanking regions;
h0- -: proportion of h0 with no specific repeat in the flanking regions;
α: mutation rate in duplicated sequences;
β: insertion rate of the specific repeat;
γ: mutation rate in the specific repeat;
d: divergence level of duplications;
ε: divergence interval of duplications.
Model Fitting
Diversity:
f - -
f ++
f + -
Alu
Diversity:
f - -
f ++
f + -
L1
The model parameters (αAlu, βAlu, γAlu, αL1, βL1, γL1) are estimated from the reported mutation and insertion rates in the literature.
The relative strengths of the alternative hypotheses can be estimated by model fitting to the real data.
h1++Alu ≈ 0.3; h1++ L1 ≈ 0.35.
Mer Frequencies
Chr1
Ns
ATs
Reps
CDs
ΔG
DupCopy#
MerFreq
MER57A L1P
Copy Number Variation Data
China46 people
China46 people
UtahEuropean
origin: 90 people
UtahEuropean
origin: 90 people
Yoruba89 people
Yoruba89 people
Japan45 people
Japan45 people
HapMap data
Made available to us by Drs. Evan Eichler and Andy Sharp
CNVs in Unique regions
OROROROR
CNVs in Unique regions
Yoruba Japanese Chinese Ceph
No polymorphism
810 817 817 799
Amplifications only
43 43 46 55
Deletions only 46 37 36 44
Mixed 1* 3=2+1* 1* 2=1+1*
CNVs in SD regions
ANDANDANDAND
CNV in SD regions
Yoruba Japanese Chinese Ceph
No polymorphism
786 794 785 741
Amplifications only
124 135 141 129
Deletions only 101 86 101 141
Mixed 43 40 27 44
Unique and SD regions show completely different behavior of CNVs!
Distance-dependent recombination
The chance of recombination depends on the distance between Allele A and its copy
Simulation (probabilistic model)
Observations & Conclusions
Mutation rate of 0.0001 and recombination rate of 0.001 in SD regions constitute the best fit to observed real life data.
Single mutations cannot explain observed data, but can be explained by convergence via recombination.
Evolution-by-Duplication (EBD) appears to play a crucial role in evolution and molds the genetic circuitry in a rather constrained way, before it is subject to selection pressure
•GWAS (Genome-Wide Association Studies)
• Common Variants vs. Rare Variants• Haplotype Phasing/Linkage Analysis
•Poor Experiment Design• Reference Sequences• Genotypic vs. Haplotypic References
•Weak Technologies
Feeble Contrivance39
Common vs. Rare Disease Variants From Ionita-Laza (2009) There are two disease models:
CDCV - common disease, common variants CDRV - common disease, rare variants
The current genome-wide association studies only consider common variants (frequency at least 5%). Feasible with available resources The common loci identified so far have small effects
(ORs 1:1 -1:5) and only explain a small percentage of the estimated heritability.
Rare susceptibility variants are expected to play an important role: population genetics theory (Pritchard, 2001) empirical evidence (BMI, blood pressure, autism,
Mendelian diseases etc.)
40
Effect Size Distribution41
Capture-Recapture Model
Suppose we have sequence data on Nind individuals in a genomic region. An individual shows variation at a position if the
corresponding allele is different from the ancestral one.
A position is variable or is a variant if there is at least one individual in the dataset with a variation at that position.
Let xs be the number of individuals with variation at position s: xs > 0.
What is N: the total, unknown number of variants in the region.
42
One can estimate the following: Δ(t) = # NEW variants expected to be
found in a FUTURE dataset of size t . Nind. t is a multiplier of initial dataset size, Nind. Δf(t) = # new variants with frequency at
least f . . .
43
ENCODE dataset
Ten 500Kb genomic regions were sequenced in several unrelated DNA samples: 8 Yoruba (YRI) 16 CEPH European (CEPH) 7 Han Chinese (CHB) 8 Japanese (JPT)
To make results comparable across the four populations (YRI, CEPH, CHB and JPT), they considered only 7 of the sequenced individuals for each dataset.
44
ENCODE - Δf(t)
From Ionita-Laza et al. 2009
45
•Debugging a human better•Sequencing a genome•Sequencing a population
How to Make a Better Human?
46
S ★M ★ A ★ S ★ H
SingleMoleculeApproach toSequencing-by-Hybridization
S*M*A*S*H
Sequence a human size genome of about 6 Gb—include both haplotypes.
Integrate: Optical Mapping (Ordered Restriction Maps) Hybridization (with short nucleobase probes
[PNA or LNA oligomers] with dsDNA on a surface, and
Positional Sequencing by Hybridization (efficient polynomial time algorithms to solve “localized versions” of the PSBH problems)
⌘
Genomic DNA is carefully extracted
Fig 1
⌘⌘
LNA probes of length 6 – 8 nucleotides are hybridized to dsDNA (double-stranded genomic DNA)
The modified DNA is stretched on a 1” x 1” chip.
Fig 2
⌘⌘⌘
DNA adheres to the surface along the channels and stretches out.
Size from 0.3 – 3 million base pairs in length.
Bright emitters are attached to the probes and imaged (Fig 3).
Fig 3
⌘⌘⌘⌘
A restriction breaks the DNA at specific sites.
The cut fragments of DNA relax like entropic springs, leaving small visible gaps
Fig 4
⌘⌘⌘⌘⌘
The DNA is then stained with a fluorogen (Fig 5) and reimaged.
The two images are combined in a composite image suggesting the
locations of a specific short word (e.g., probes) within the context of a pattern of restriction sites.
Fig 5
⌘⌘⌘⌘⌘⌘
The integrated intensity measures the length of the DNA fragments.
The bright-emitters on probes provides a profile for locations of the probes.
Fig 6
The restriction sites are represented by a tall rectangle & The probe sites by small circles
⌘⌘⌘⌘⌘⌘⌘
These steps are repeated for all possible probe compositions (modulo reverse
complementarity). Software assembles
the haplotypic ordered restriction maps with approximate probe locations superimposed on the map.
ATAT
TATC
ATCA
TCAT
CATA
ATATCATAT
Fig 7
S*M*A*S*H
Local clusters of overlapping words are combined by our PSBH (positional sequencing by hybridization) algorithm
ATAT
TATC
ATCA
TCAT
CATA
ATATCATAT
Fig 8
Probe Map (lambda DNA)
Final Probe Map
Consensus map with 2 probe locations 14.8% and 52.4% of the DNA length.
In close agreement with the correct map 50.2% and 85.7% (known from the
sequence) Implied probe hybridization rate = 42%.
Significantly better than the needed 30%
500 nm
AFour AFM images of lambda DNA with PNA probes
Combinatorial Structure
Discretization
Prediction
The probability of successfully computing the correct restriction map as a function of the number of cuts in the map and number of molecules used in creating the map…
Gentig: Bayesian Approach
Bayesian Model
Robustness
BAC Clones with 6-cutters Average Clone size = 160 Kb; Average Fragment
Size = 4 Kb, & Average Number of Cutsites = 40. Parameters:
Digestion rate can be as low as 10% Orientation of DNA need not be known. 40% foreign DNA 85% DNA partially broken Relative sizing error up to 30% 30% spurious randomly located cuts…
Single Molecule Hapoltyping:Candida Albicans
The left end of chromsome-1 of the common fungus Candida Albicans (being sequenced by Stanford).
Three polymorphisms: (A) Fragment 2 is of size
41.19kb (top) vs 38.73kb (bottom).
(B) The 3rd fragment of size 7.76kb is missing from the top haplotype.
(C)The large fragment in the middle is of size 61.78kb vs 59.66kb.
Problem to Solve…
Given probe maps of some small region of the genome for all N-bp hybridization probes (e.g. all 2080 probes of 6-bp).
With known error rates (false positive, false negatives and sizing errors).
Can we reconstruct the complete sequence ?
Basic reconstruction algorithm Keep track of multiple sequence assemblies. Initialize with all possible 5-bp sequences. Try all 4 possible extensions of each sequence. Check if probe is present in corresponding map :
if not add a penalty score to the sequence involved.
Periodically delete sequences with high penalty. Stop when missing probe rate jumps significantly
from False Negative rate (2%) to (100% - false extension rate) = 55%.
Return highest scoring sequence.
Anomalies
Irresolvable Ambiguities: From assemblies based on 6bp probes
Error Pattern : s w sRC Correct Pattern : s wRC sRC
s = tcgcc (any 5 bases) sRC=ggcga (Reverse compliment of X) w = CCCCTAAC (any short sequence under 50bp) wRC= GTTAGGGG (Reverse compliment of Y)
Assembly:…tcgccCCCCTAAC ggcga… || || | || | ||Correct :…tcgccGTTAGGGGggcga…
Assembly:…tcgccCCCCTAAC ggcga… || || | || | ||Correct :…tcgccGTTAGGGGggcga…
Directed Eulerian Graph
AATT
TTAA
ATTC
TAAG
TTCG
AAGC
ATCG
TAGC
⌘
Mixing ‘solid’ bases with `wild-card’ bases: E.g., xx-x-x-xx (9-mers) or xxx- -x- -x- -xxx
(14 mers) An ‘inert’ base
Universal: In terms of its ability to form base pairs with the other natural DNA/RNA bases.
Examples: The naturally occurring base hypoxanthine,
as its ribo- or 2'-deoxyribonucleoside; 2'-deoxyisoinosine; 7-deaza-2'-deoxyinosine; 2-aza-2'-deoxyinosine
Simulation Results
1
10
100
1000
10000
5 6 7 8
Bases per probe
Err
ors
per
10kb s
equence
0.01
0.1
1
10
100
1000
0 1 2 3 4 5
Gapped bases per probe (6 solid bases)E
rro
rs p
er 1
0kb
seq
uen
ce
UNGAPPED GAPPED
1000 Rupees Genome
22.67 US$ for 6 billion bases135 billion US $ for the entire human population
Who we are…
Population David Albers (Columbia) Eric Aslakson (NYU) Mickey Atwal (CSHL) Ivan Iossifov (CSHL) Hossein Khiabanian
(Columbia) Samantha Kleinberg (NYU) Partha Mitra (CSHL) Michaela Oswald (CSHL) Raul Rabadan (Columbia) Vladimir Trifonov (Colmbia) Daniel Valente (CSHL) Chris Wiggins (Columbia)
Polymorphims Iuliana Ionita-Laza (Harvard) Antonina Mitrofanova (NYU) Joey Zhao (Princeton)
SMASH TS Anantharaman (OpGen) Charles Cantor (Sequenom) Vladimir Demidov (BU) Pierre Franquin (NYU) Alex Lim (Ex-NYU) Toto Paxia (Ex-NYU) Jason Reed (UCLA) Andrew Sundstrom (NYU)
SUTTA Giusepe Narzisi (NYU) Alessio Narzisi
(NYU/Catania)
74
“Beware prejudices.
“They are like rats, and men's minds are like traps; prejudices get in easily, but it is doubtful if they ever get out.”
Lord Jeffrey75