c1 c2 c3 linkage disequilibrium 1 mapping and haploblock

1

C1

A1 A2 A3

H1 H2 H3

A4 A5 A6 A7

H4 H5 H6 H7

C2 C3

Linkage DisequilibriumMapping and HaploBlock

Gideon [email protected]

Guest lecture for course:Computational Genetics (236608)

2

Part 1: LD Mapping

• Basic LD Mapping– χ-squared test for individual SNPs

• Mapping with Haplotypes– Population phenomena

• Haplotyping– Clark algorithm– EM algorithm

3

Linkage Disequilibrium• LD = Another word for ‘correlation’

– Correlation between markers in a population• Random recombination destroys correlation

– Close markers may have high LD– Above 1 Mb, LD disappears

4

LD Mapping: The Basics

• Take set of unrelated individuals– Ideally from a small, inbred population

• Measure markers at high resolution– Single Nucleotide Polymorphisms are ideal

• Test marker–disease correlations– Non-parametric disease model– Suitable (in theory) for low penetrance

5

LD Mapping in Action

3 1 10 1 2

1 3 11 1 1

6

Chi-Squared Test

600500100∑29526431a30523669A∑ControlCase

Observed Counts

600500100∑295245.8349.17a305254.1750.83A∑ControlCase

Expected Counts

€

χ 2 =(o − e)2

e∑ =15.85 1 degree of freedom⇒ p-value = 0.0001

7

SNPs

• Single base pair which exhibits variation– Caused by point mutations during meiosis– Variation almost always biallelic

• dbSNP contains ~ 4.3×106 SNPs– Over 1 SNP per 1,000 base pairs– About half with minor allele frequency > 20%– This number is still growing rapidly!

8

LD Mapping in Context

Identifychromosome

(108 bp)

Linkageanalysis

(106~107 bp)

Identifygenes

(105~106 bp)Resequencing

(100 bp)

9

False Positives

• Causes of spurious LD– Population structure

• Migration and admixture• Preferential mating

– Phenotypic site interaction• Disease epistasis

• Key problem: too many SNP tests– Bonferroni correction

10

Haplotypes

Chromosome SNP Non-variable region

Haplotype = AGTCT

A G T C T

SNP allele

Generally, only a few of the 2loci possiblehaplotypes cover >90% of a population,

due to bottleneck effects and genetic drift.

11

Bottleneck Effects

106 years 105 years

12

Genetic Drift

0

100

200

300

400

500

600

700

800

900

1000

0 100 200 300 400 500 600 700 800 900 1000

Generation

Alle

le F

requ

ency

13

LD Mapping with Haplotypes

• Obtain haplotypes for a genomic region– Treat haplotype as correlated allele

• Advantage: fewer tests– Reduced false positive rate

• Disadvantage: ignores recombination– Different haplotypes could contain target

• Best: consider partial haplotypes…

14

The Haplotyping Problem1 2 3 4 5Variable Loci

MaternalChromosome

PaternalChromosome

ObservedGenotypes A/T C/G T/T A/C G/T

A G T C T

T C T A G

HiddenHaplotypes

15

Why is it hard?• A series of joint measurements containing h

heterozygous loci can be divided 2h-1 ways(we don’t care which is maternal or paternal).

A/T C/G T/T A/C G/T

A C T AG

T G T C T

AG T AG

T C T C T

A C T A T

T G T CG

A C T CG

T G T A T A C T C T

T G T AG

AG T A T

T C T CG

AG T CG

T C T A T

AG T C T

T C T AGCorrect!

16

Why is it approachable?

• Many of the haplotypes appear many times.• Data for many individuals allows inference.

A/T C/G T/T A/C G/T A/A C/G C/T A/C T/T

Solution seems ‘better’ since it uses fewer haplotypes.

A C T A TT G T CG A C C C TAG T A T

T C T AGAG T C T A C C A TAG T C T

17

Formalization 1• Assume all loci biallelic (realistic).• Individuals numbered 1…n• Loci numbered 1…l• Possible alleles B={0,1}• Possible haplotypes H= Bl

• Possible locus observations L={[B,B]}• Possible genotypes G=Ll

• Possible haplotype pairs D={[H,H]}

18

Formalization 2

• Given a true haplotype pair [h1,h2] ∈ D,G(h1,h2) ∈ G is the genotype observed.

• Given an observed genotype g ∈ G,D(g) ⊆ D is set of possible haplotype pairs.

• Problem input: (g1,…,gn) where gi ∈ G• Problem output: (d1,…,dn) where di ∈ D(gi)

19

Clark’s Algorithm

1. Initialize set S to {}.2. For genotypes gi with a single possibility [h1,h2]

assign di=[h1,h2] and add h1,h2 to S.3. For genotypes gi with a possibility containing a

member h1 ∈ S and another haplotype h2, assigndi=[h1,h2] and add h2 to S.

4. Repeat step 3 until all haplotypes are assigned orwe add nothing new to S.

5. Assign any remaining di arbitrarily.

20

Clark: Run

A/T C/G T/T A/C G/T

A/A C/G C/T A/C T/T

A/A C/G T/T C/C T/T AG T C TA C T C T

A/T C/C C/T A/A G/T

T G T AGA C T C T

AGC A TA C T C T

A C T A TT C C AG

A/T C/G C/C A/A G/T AGC A TT C C AG6

hapl

otyp

es u

sed

21

Clark: Rerun (same input)

A/T C/G T/T A/C G/T

A/A C/G C/T A/C T/T

A/A C/G T/T C/C T/T AG T C TA C T C T

A/T C/C C/T A/A G/T

AG T C TT C T AG

AG T C TA C C A T

T C T AGA C C A T

A/T C/G C/C A/A G/T T GC AGA C C A T5

hapl

otyp

es u

sed

22

Clark: Comments

• Implementation is very fast, O(ln2)• Total failure if no starting point.• Blind haplotyping of ‘orphans’ at end.• Arbitrary selections based on input order.

– Try multiple orderings, select best results.• Or formulate choices as integer program

– Solve approximately by linear relaxation.

23

Likelihood: Bayes NetworkPr(HM=h|θH)=Pr(HP=h|θH)=θh

D=[hm,hp]

G=G(d)

HM HP|HM|=|HP|=2l

D|D|= 2l +2l×(2l -1)/2

G|G|= 3l

24

EM Algorithm

1. Initialize θH0 to uniform distribution with

small random perturbations.2. E step: For each d ∈ D, calculate

Exp(d|θHi) using observed data G.

3. M step: Calculate θHi+1 from Exp(d|θH

i).4. Repeat 2 and 3 until θH converges.5. Assign output di by maximum likelihood.

25

EM: Iterations (same input)

0.0370.138 0.155

0.1500.134 0.121 0.113 0.106 0.101 0.100 0.100

0.037

0.072 0.1020.150

0.193 0.225 0.260 0.289 0.298 0.300 0.300

0.037

0.1360.204 0.266 0.279

0.2870.294 0.299 0.300 0.300

0.040

0.037 0.081 0.141 0.182 0.196 0.199 0.200 0.200 0.200 0.200

0.040 0.024 0.025 0.028 0.033 0.0470.073 0.095 0.100 0.100 0.100

0.245

0 1 2 3 4 5 6 7 8 9 10

Iteration

TGCAGTGCATTGTCGTGTCTTGTAGTGTATTCCAGTCCATTCTCGTCTCTTCTAGTCTATAGCCTAGCAGAGCATAGTCGAGTCTAGTAGAGTATACCCTACCAGACCATACTCGACTCTACTAGACTAT

26

EM: Output = Clark rerunA/T C/G T/T A/C G/T

A/A C/G C/T A/C T/T

A/A C/G T/T C/C T/T

A/T C/C C/T A/A G/T

A/T C/G C/C A/A G/T

AG T C TA C T C T

AG T C TT C T AG

AG T C TA C C A T

T C T AGA C C A T

T GC AGA C C A T

θH

θH

θH

θH

θH

27

EM: Comments

• Problem of EM entering local minima– Try multiple runs, select best result.

• Generally far better results than Clark.• Exponential complexity O(2l⋅n) due to |H|

– Ignore impossible haplotypes for O(1.5l⋅n)• Still infeasible for >30 loci

– Hierarchical EM, MCMC methods.

28

Part 2: HaploBlock

• Haplotype blocks• Statistical model• Model inference• Model criterion• Applications

– Haplotyping– Block-based LD mapping

29

Recombination Hotspots

30

Haplotype Blocks

1 GAACTGC ATTCGACTG CCAGTAGC 2 ACGTACA GATGAGCTG CCAGTAGC … 99 ACGTACA AACCGAGGT TGTACTAA100 GAACTGC GATGAGCTG TGTGCTAA

Recombinationhotspot

separates blocks

Few blockvariants due to

bottlenecks, drift

Mutationhotspot

31

Bayesian Network Model

C

A1 A2 A3

H1 H2 H3

Values of variable C are 1…qdenoting index of block’s haplotype

Pr(C = c) is frequency of haplotype c

Values of variable Aj are A,C,G,T,–denoting allele at site j of haplotype.Example: A1 A2 A3 = CTA for C = 2

Pr(aj | c) is deterministic

Values of variable Hj are A,C,G,T,–denoting allele at site j observed

after possible haplotype mutations

Pr(hj | aj) is cumulative mutation rate

33


C1

A1 A2 A3

H1 H2 H3

recombination hotspot

A4 A5 A6 A7

H4 H5 H6 H7

C2 C3

haplotype block

34

Data Likelihood

• For haplotypes H, likelihood is:

A1 A2 A3 A4 A5

H1 H2 H3 H4 H5

C1 C2

€

Pr(H) = Lc1

∑ La1

∑cb

∑Pr(c1) Pr(ck | ck−1)

k= 2

b

∏

Pr(a j | ck )Pr(h j | a j )j= sk

ek

∏k=1

b

∏

al

∑

h∈H∏

But we can calculate this efficientlyusing a suitable elimination ordering!

35

Data Criterion

• Maximum Likelihood leads to over-fitting– No hotspots, no mutations, many ancestors– Need to consider model complexity– Min DL(H,M)=DL(M)-log2Pr(H|M)

• DL(M) considers variable elements only– Ancestor block sequences– Markov chain parameters

A1 A2 A3 A4 A5

H1 H2 H3 H4 H5

C1 C2

36

Model Search

A1 A2 A3 A4 A5

H1 H2 H3 H4 H5

C1 C2

4 3

A1 A2 A3 A4 A5

H1 H2 H3 H4 H5

C1C2

4 3nudge

A1 A2 A3 A4 A5

H1 H2 H3 H4 H5

C1

5

add/remove

A1 A2 A3 A4 A5

H1 H2 H3 H4 H5

C1 C2

4 2

ancestors

37

Hotspot Addition – Initial Scan

3500

4000

4500

5000

5500

6000

0 10 20 30 40 50 60 70 80 90

Hotspot

MD

L Sc

ore

38

Hotspot Addition – Minima 1

3500

4000

4500

5000

5500

6000

0 10 20 30 40 50 60 70 80 90

Hotspot

MD

L Sc

ore

39

Hotspot Addition – Pass 2

3500

4000

4500

5000

5500

6000

0 10 20 30 40 50 60 70 80 90

Hotspot

MD

L Sc

ore

40


3500

4000

4500

5000

5500

6000

0 10 20 30 40 50 60 70 80 90

Hotspot

MD

L Sc

ore

41


3500

4000

4500

5000

5500

6000

0 10 20 30 40 50 60 70 80 90

Hotspot

MD

L Sc

ore

42


3500

4000

4500

5000

5500

6000

0 10 20 30 40 50 60 70 80 90

Hotspot

MD

L Sc

ore

43


3500

4000

4500

5000

5500

6000

0 10 20 30 40 50 60 70 80 90

Hotspot

MD

L Sc

ore

44

Hotspot Addition – Summary

3500

4000

4500

5000

5500

6000

0 10 20 30 40 50 60 70 80 90

Hotspot

MD

L Sc

ore

45

Ck

As Ae

Hs He

EM Modifications• Deterministic distribution for Aj• Mutation rate constraints for Hj• Infeasible calculations along whole model

1. Fix zero mutation rate tocluster sequences.Ck

As Ae

Hs He

2. Extract ancestor alleles bymaximum likelihood.

Ck

As Ae

3. Constrained EM formutation rates.

As Ae

Hs He

Ck

4. EM for Markov chain.Hs He

Ck

46

Model for Haplotyping

C1

A1 A2 A3

H1 H2 H3

A4 A5 A6 A7

H4 H5 H6 H7

C2 C3

C'1

A'1 A'2 A'3

H'1 H'2 H'3

A'4 A'5 A'6 A'7

H'4 H'5 H'6 H'7

C'2 C'3

G1 G2 G3 G4 G5 G6 G7

• Learn modeldirectly fromgenotypes

• Haplotypepair: choosemost likelyunder model

47

Haplotyping Results

.0152.0083.0047.0009.0042.0095Hierarchical EM

=2x3x2x2x2xImprovement factor

HaploBlock .0098.0048.0014.0005.0020.0047

.0419.0183.0262.0655.0403.0669PHASE

.0102failed.0077.0204failed.0224HAPLOTYPER

.0381.0234.0329.0280.0251.0548Clark

ACEC21eC21dC21cC21bC21aSite pairwise error rate

C21x data: 20 haplotypes, 100 SNPs over ≤ 35kb, Patil et al. (2001)ACE data: 22 haplotypes, 52 SNPs over 24kb, Rieder et al. (1999)

Average shown for 10 random pairings of true haplotypes

48

Model for LD Mapping• Learn model from marker data• Mapping: try making phenotype

dependent on each block

C1

A1 A2 A3

H1 H2 H3

A4 A5 A6 A7

H4 H5 H6 H7

C2 C3

P

49

LD Mapping Results

1.4x3x3xImprovement factor

HaploBlock 24 kb37 kb40 kb

33 kb105 kb131 kbNo Blocks

107 kb–144 kbBLADE

Chr 215q31 genos5q31 haplosResequencing required

5q31 data: 258 haplotypes, 98 SNPs over 464kb, Daly et al. (2001)Chr 21 data: 20 haplotypes, 5 sets of 200 SNPs, Patil et al. (2001)

Average shown for 5 random selections of target SNP

50

HaploBlock: Comments

• Our model boils down to an HMM– Calculations have linear complexity– Forward/backward probability caching

• Better to infer multiple models– Prevent getting stuck in local minima– Account for uncertainty of block identification– Use Gibbs-style iterations on hotspots– Take ‘average’ result over set of models

c1 c2 c3 linkage disequilibrium 1 mapping and haploblock

Documents