c1 c2 c3 linkage disequilibrium 1 mapping and haploblock
TRANSCRIPT
1
C1
A1 A2 A3
H1 H2 H3
A4 A5 A6 A7
H4 H5 H6 H7
C2 C3
Linkage DisequilibriumMapping and HaploBlock
Gideon [email protected]
Guest lecture for course:Computational Genetics (236608)
2
Part 1: LD Mapping
• Basic LD Mapping– χ-squared test for individual SNPs
• Mapping with Haplotypes– Population phenomena
• Haplotyping– Clark algorithm– EM algorithm
3
Linkage Disequilibrium• LD = Another word for ‘correlation’
– Correlation between markers in a population• Random recombination destroys correlation
– Close markers may have high LD– Above 1 Mb, LD disappears
4
LD Mapping: The Basics
• Take set of unrelated individuals– Ideally from a small, inbred population
• Measure markers at high resolution– Single Nucleotide Polymorphisms are ideal
• Test marker–disease correlations– Non-parametric disease model– Suitable (in theory) for low penetrance
5
LD Mapping in Action
3 1 10 1 2
1 3 11 1 1
6
Chi-Squared Test
600500100∑29526431a30523669A∑ControlCase
Observed Counts
600500100∑295245.8349.17a305254.1750.83A∑ControlCase
Expected Counts
€
χ 2 =(o − e)2
e∑ =15.85 1 degree of freedom⇒ p-value = 0.0001
7
SNPs
• Single base pair which exhibits variation– Caused by point mutations during meiosis– Variation almost always biallelic
• dbSNP contains ~ 4.3×106 SNPs– Over 1 SNP per 1,000 base pairs– About half with minor allele frequency > 20%– This number is still growing rapidly!
8
LD Mapping in Context
Identifychromosome
(108 bp)
Linkageanalysis
(106~107 bp)
Identifygenes
(105~106 bp)Resequencing
(100 bp)
9
False Positives
• Causes of spurious LD– Population structure
• Migration and admixture• Preferential mating
– Phenotypic site interaction• Disease epistasis
• Key problem: too many SNP tests– Bonferroni correction
10
Haplotypes
Chromosome SNP Non-variable region
Haplotype = AGTCT
A G T C T
SNP allele
Generally, only a few of the 2loci possiblehaplotypes cover >90% of a population,
due to bottleneck effects and genetic drift.
11
Bottleneck Effects
106 years 105 years
12
Genetic Drift
0
100
200
300
400
500
600
700
800
900
1000
0 100 200 300 400 500 600 700 800 900 1000
Generation
Alle
le F
requ
ency
13
LD Mapping with Haplotypes
• Obtain haplotypes for a genomic region– Treat haplotype as correlated allele
• Advantage: fewer tests– Reduced false positive rate
• Disadvantage: ignores recombination– Different haplotypes could contain target
• Best: consider partial haplotypes…
14
The Haplotyping Problem1 2 3 4 5Variable Loci
MaternalChromosome
PaternalChromosome
ObservedGenotypes A/T C/G T/T A/C G/T
A G T C T
T C T A G
HiddenHaplotypes
15
Why is it hard?• A series of joint measurements containing h
heterozygous loci can be divided 2h-1 ways(we don’t care which is maternal or paternal).
A/T C/G T/T A/C G/T
A C T AG
T G T C T
AG T AG
T C T C T
A C T A T
T G T CG
A C T CG
T G T A T A C T C T
T G T AG
AG T A T
T C T CG
AG T CG
T C T A T
AG T C T
T C T AGCorrect!
16
Why is it approachable?
• Many of the haplotypes appear many times.• Data for many individuals allows inference.
A/T C/G T/T A/C G/T A/A C/G C/T A/C T/T
Solution seems ‘better’ since it uses fewer haplotypes.
A C T A TT G T CG A C C C TAG T A T
T C T AGAG T C T A C C A TAG T C T
17
Formalization 1• Assume all loci biallelic (realistic).• Individuals numbered 1…n• Loci numbered 1…l• Possible alleles B={0,1}• Possible haplotypes H= Bl
• Possible locus observations L={[B,B]}• Possible genotypes G=Ll
• Possible haplotype pairs D={[H,H]}
18
Formalization 2
• Given a true haplotype pair [h1,h2] ∈ D,G(h1,h2) ∈ G is the genotype observed.
• Given an observed genotype g ∈ G,D(g) ⊆ D is set of possible haplotype pairs.
• Problem input: (g1,…,gn) where gi ∈ G• Problem output: (d1,…,dn) where di ∈ D(gi)
19
Clark’s Algorithm
1. Initialize set S to {}.2. For genotypes gi with a single possibility [h1,h2]
assign di=[h1,h2] and add h1,h2 to S.3. For genotypes gi with a possibility containing a
member h1 ∈ S and another haplotype h2, assigndi=[h1,h2] and add h2 to S.
4. Repeat step 3 until all haplotypes are assigned orwe add nothing new to S.
5. Assign any remaining di arbitrarily.
20
Clark: Run
A/T C/G T/T A/C G/T
A/A C/G C/T A/C T/T
A/A C/G T/T C/C T/T AG T C TA C T C T
A/T C/C C/T A/A G/T
T G T AGA C T C T
AGC A TA C T C T
A C T A TT C C AG
A/T C/G C/C A/A G/T AGC A TT C C AG6
hapl
otyp
es u
sed
21
Clark: Rerun (same input)
A/T C/G T/T A/C G/T
A/A C/G C/T A/C T/T
A/A C/G T/T C/C T/T AG T C TA C T C T
A/T C/C C/T A/A G/T
AG T C TT C T AG
AG T C TA C C A T
T C T AGA C C A T
A/T C/G C/C A/A G/T T GC AGA C C A T5
hapl
otyp
es u
sed
22
Clark: Comments
• Implementation is very fast, O(ln2)• Total failure if no starting point.• Blind haplotyping of ‘orphans’ at end.• Arbitrary selections based on input order.
– Try multiple orderings, select best results.• Or formulate choices as integer program
– Solve approximately by linear relaxation.
23
Likelihood: Bayes NetworkPr(HM=h|θH)=Pr(HP=h|θH)=θh
D=[hm,hp]
G=G(d)
HM HP|HM|=|HP|=2l
D|D|= 2l +2l×(2l -1)/2
G|G|= 3l
24
EM Algorithm
1. Initialize θH0 to uniform distribution with
small random perturbations.2. E step: For each d ∈ D, calculate
Exp(d|θHi) using observed data G.
3. M step: Calculate θHi+1 from Exp(d|θH
i).4. Repeat 2 and 3 until θH converges.5. Assign output di by maximum likelihood.
25
EM: Iterations (same input)
0.0370.138 0.155
0.1500.134 0.121 0.113 0.106 0.101 0.100 0.100
0.037
0.072 0.1020.150
0.193 0.225 0.260 0.289 0.298 0.300 0.300
0.037
0.1360.204 0.266 0.279
0.2870.294 0.299 0.300 0.300
0.040
0.037 0.081 0.141 0.182 0.196 0.199 0.200 0.200 0.200 0.200
0.040 0.024 0.025 0.028 0.033 0.0470.073 0.095 0.100 0.100 0.100
0.245
0 1 2 3 4 5 6 7 8 9 10
Iteration
TGCAGTGCATTGTCGTGTCTTGTAGTGTATTCCAGTCCATTCTCGTCTCTTCTAGTCTATAGCCTAGCAGAGCATAGTCGAGTCTAGTAGAGTATACCCTACCAGACCATACTCGACTCTACTAGACTAT
26
EM: Output = Clark rerunA/T C/G T/T A/C G/T
A/A C/G C/T A/C T/T
A/A C/G T/T C/C T/T
A/T C/C C/T A/A G/T
A/T C/G C/C A/A G/T
AG T C TA C T C T
AG T C TT C T AG
AG T C TA C C A T
T C T AGA C C A T
T GC AGA C C A T
θH
θH
θH
θH
θH
27
EM: Comments
• Problem of EM entering local minima– Try multiple runs, select best result.
• Generally far better results than Clark.• Exponential complexity O(2l⋅n) due to |H|
– Ignore impossible haplotypes for O(1.5l⋅n)• Still infeasible for >30 loci
– Hierarchical EM, MCMC methods.
28
Part 2: HaploBlock
• Haplotype blocks• Statistical model• Model inference• Model criterion• Applications
– Haplotyping– Block-based LD mapping
29
Recombination Hotspots
30
Haplotype Blocks
1 GAACTGC ATTCGACTG CCAGTAGC 2 ACGTACA GATGAGCTG CCAGTAGC … 99 ACGTACA AACCGAGGT TGTACTAA100 GAACTGC GATGAGCTG TGTGCTAA
Recombinationhotspot
separates blocks
Few blockvariants due to
bottlenecks, drift
Mutationhotspot
31
Bayesian Network Model
C
A1 A2 A3
H1 H2 H3
Values of variable C are 1…qdenoting index of block’s haplotype
Pr(C = c) is frequency of haplotype c
Values of variable Aj are A,C,G,T,–denoting allele at site j of haplotype.Example: A1 A2 A3 = CTA for C = 2
Pr(aj | c) is deterministic
Values of variable Hj are A,C,G,T,–denoting allele at site j observed
after possible haplotype mutations
Pr(hj | aj) is cumulative mutation rate
32
Bayesian Network Model
haplotype block
Pr(c, a1, a2, a3, h1, h2, h3) =
Pr(c) ×Pr(a1 | c) ×Pr(a2 | c) ×Pr(a3 | c) ×Pr(h1 | a1) ×Pr(h2 | a2) ×Pr(h3 | a3)
C
A1 A2 A3
H1 H2 H3
33
Bayesian Network Model
C1
A1 A2 A3
H1 H2 H3
recombination hotspot
A4 A5 A6 A7
H4 H5 H6 H7
C2 C3
haplotype block
34
Data Likelihood
• For haplotypes H, likelihood is:
A1 A2 A3 A4 A5
H1 H2 H3 H4 H5
C1 C2
€
Pr(H) = Lc1
∑ La1
∑cb
∑Pr(c1) Pr(ck | ck−1)
k= 2
b
∏
Pr(a j | ck )Pr(h j | a j )j= sk
ek
∏k=1
b
∏
al
∑
h∈H∏
But we can calculate this efficientlyusing a suitable elimination ordering!
35
Data Criterion
• Maximum Likelihood leads to over-fitting– No hotspots, no mutations, many ancestors– Need to consider model complexity– Min DL(H,M)=DL(M)-log2Pr(H|M)
• DL(M) considers variable elements only– Ancestor block sequences– Markov chain parameters
A1 A2 A3 A4 A5
H1 H2 H3 H4 H5
C1 C2
36
Model Search
A1 A2 A3 A4 A5
H1 H2 H3 H4 H5
C1 C2
4 3
A1 A2 A3 A4 A5
H1 H2 H3 H4 H5
C1C2
4 3nudge
A1 A2 A3 A4 A5
H1 H2 H3 H4 H5
C1
5
add/remove
A1 A2 A3 A4 A5
H1 H2 H3 H4 H5
C1 C2
4 2
ancestors
37
Hotspot Addition – Initial Scan
3500
4000
4500
5000
5500
6000
0 10 20 30 40 50 60 70 80 90
Hotspot
MD
L Sc
ore
38
Hotspot Addition – Minima 1
3500
4000
4500
5000
5500
6000
0 10 20 30 40 50 60 70 80 90
Hotspot
MD
L Sc
ore
39
Hotspot Addition – Pass 2
3500
4000
4500
5000
5500
6000
0 10 20 30 40 50 60 70 80 90
Hotspot
MD
L Sc
ore
40
Hotspot Addition – Minima 2
3500
4000
4500
5000
5500
6000
0 10 20 30 40 50 60 70 80 90
Hotspot
MD
L Sc
ore
41
Hotspot Addition – Pass 3
3500
4000
4500
5000
5500
6000
0 10 20 30 40 50 60 70 80 90
Hotspot
MD
L Sc
ore
42
Hotspot Addition – Minima 3
3500
4000
4500
5000
5500
6000
0 10 20 30 40 50 60 70 80 90
Hotspot
MD
L Sc
ore
43
Hotspot Addition – Pass 4
3500
4000
4500
5000
5500
6000
0 10 20 30 40 50 60 70 80 90
Hotspot
MD
L Sc
ore
44
Hotspot Addition – Summary
3500
4000
4500
5000
5500
6000
0 10 20 30 40 50 60 70 80 90
Hotspot
MD
L Sc
ore
45
Ck
As Ae
Hs He
EM Modifications• Deterministic distribution for Aj• Mutation rate constraints for Hj• Infeasible calculations along whole model
1. Fix zero mutation rate tocluster sequences.Ck
As Ae
Hs He
2. Extract ancestor alleles bymaximum likelihood.
Ck
As Ae
3. Constrained EM formutation rates.
As Ae
Hs He
Ck
4. EM for Markov chain.Hs He
Ck
46
Model for Haplotyping
C1
A1 A2 A3
H1 H2 H3
A4 A5 A6 A7
H4 H5 H6 H7
C2 C3
C'1
A'1 A'2 A'3
H'1 H'2 H'3
A'4 A'5 A'6 A'7
H'4 H'5 H'6 H'7
C'2 C'3
G1 G2 G3 G4 G5 G6 G7
• Learn modeldirectly fromgenotypes
• Haplotypepair: choosemost likelyunder model
47
Haplotyping Results
.0152.0083.0047.0009.0042.0095Hierarchical EM
=2x3x2x2x2xImprovement factor
HaploBlock .0098.0048.0014.0005.0020.0047
.0419.0183.0262.0655.0403.0669PHASE
.0102failed.0077.0204failed.0224HAPLOTYPER
.0381.0234.0329.0280.0251.0548Clark
ACEC21eC21dC21cC21bC21aSite pairwise error rate
C21x data: 20 haplotypes, 100 SNPs over ≤ 35kb, Patil et al. (2001)ACE data: 22 haplotypes, 52 SNPs over 24kb, Rieder et al. (1999)
Average shown for 10 random pairings of true haplotypes
48
Model for LD Mapping• Learn model from marker data• Mapping: try making phenotype
dependent on each block
C1
A1 A2 A3
H1 H2 H3
A4 A5 A6 A7
H4 H5 H6 H7
C2 C3
P
49
LD Mapping Results
1.4x3x3xImprovement factor
HaploBlock 24 kb37 kb40 kb
33 kb105 kb131 kbNo Blocks
107 kb–144 kbBLADE
Chr 215q31 genos5q31 haplosResequencing required
5q31 data: 258 haplotypes, 98 SNPs over 464kb, Daly et al. (2001)Chr 21 data: 20 haplotypes, 5 sets of 200 SNPs, Patil et al. (2001)
Average shown for 5 random selections of target SNP
50
HaploBlock: Comments
• Our model boils down to an HMM– Calculations have linear complexity– Forward/backward probability caching
• Better to infer multiple models– Prevent getting stuck in local minima– Account for uncertainty of block identification– Use Gibbs-style iterations on hotspots– Take ‘average’ result over set of models