c1 c2 c3 linkage disequilibrium 1 mapping and haploblock

A1 A2 A3

H1 H2 H3

A4 A5 A6 A7

H4 H5 H6 H7

Linkage DisequilibriumMapping and HaploBlock

Gideon Greenspangdg@cs.technion.ac.il

Guest lecture for course:Computational Genetics (236608)

Part 1: LD Mapping

• Basic LD Mapping– χ-squared test for individual SNPs

• Mapping with Haplotypes– Population phenomena

• Haplotyping– Clark algorithm– EM algorithm

Linkage Disequilibrium• LD = Another word for ‘correlation’

– Correlation between markers in a population• Random recombination destroys correlation

– Close markers may have high LD– Above 1 Mb, LD disappears

LD Mapping: The Basics

• Take set of unrelated individuals– Ideally from a small, inbred population

• Measure markers at high resolution– Single Nucleotide Polymorphisms are ideal

• Test marker–disease correlations– Non-parametric disease model– Suitable (in theory) for low penetrance

LD Mapping in Action

3 1 10 1 2

1 3 11 1 1

Chi-Squared Test

600500100∑29526431a30523669A∑ControlCase

Observed Counts

600500100∑295245.8349.17a305254.1750.83A∑ControlCase

Expected Counts

χ 2 =(o − e)2

e∑ =15.85 1 degree of freedom⇒ p-value = 0.0001

• Single base pair which exhibits variation– Caused by point mutations during meiosis– Variation almost always biallelic

• dbSNP contains ~ 4.3×106 SNPs– Over 1 SNP per 1,000 base pairs– About half with minor allele frequency > 20%– This number is still growing rapidly!

LD Mapping in Context

Identifychromosome

(108 bp)

Linkageanalysis

(106~107 bp)

Identifygenes

(105~106 bp)Resequencing

(100 bp)

False Positives

• Causes of spurious LD– Population structure

• Migration and admixture• Preferential mating

– Phenotypic site interaction• Disease epistasis

• Key problem: too many SNP tests– Bonferroni correction

Haplotypes

Chromosome SNP Non-variable region

Haplotype = AGTCT

A G T C T

SNP allele

Generally, only a few of the 2loci possiblehaplotypes cover >90% of a population,

due to bottleneck effects and genetic drift.

Bottleneck Effects

106 years 105 years

Genetic Drift

0 100 200 300 400 500 600 700 800 900 1000

Generation

LD Mapping with Haplotypes

• Obtain haplotypes for a genomic region– Treat haplotype as correlated allele

• Advantage: fewer tests– Reduced false positive rate

• Disadvantage: ignores recombination– Different haplotypes could contain target

• Best: consider partial haplotypes…

The Haplotyping Problem1 2 3 4 5Variable Loci

MaternalChromosome

PaternalChromosome

ObservedGenotypes A/T C/G T/T A/C G/T

A G T C T

T C T A G

HiddenHaplotypes

Why is it hard?• A series of joint measurements containing h

heterozygous loci can be divided 2h-1 ways(we don’t care which is maternal or paternal).

A/T C/G T/T A/C G/T

A C T AG

T G T C T

AG T AG

T C T C T

A C T A T

T G T CG

A C T CG

T G T A T A C T C T

T G T AG

AG T A T

T C T CG

AG T CG

T C T A T

AG T C T

T C T AGCorrect!

Why is it approachable?

• Many of the haplotypes appear many times.• Data for many individuals allows inference.

A/T C/G T/T A/C G/T A/A C/G C/T A/C T/T

Solution seems ‘better’ since it uses fewer haplotypes.

A C T A TT G T CG A C C C TAG T A T

T C T AGAG T C T A C C A TAG T C T

Formalization 1• Assume all loci biallelic (realistic).• Individuals numbered 1…n• Loci numbered 1…l• Possible alleles B={0,1}• Possible haplotypes H= Bl

• Possible locus observations L={[B,B]}• Possible genotypes G=Ll

• Possible haplotype pairs D={[H,H]}

Formalization 2

• Given a true haplotype pair [h1,h2] ∈ D,G(h1,h2) ∈ G is the genotype observed.

• Given an observed genotype g ∈ G,D(g) ⊆ D is set of possible haplotype pairs.

• Problem input: (g1,…,gn) where gi ∈ G• Problem output: (d1,…,dn) where di ∈ D(gi)

Clark’s Algorithm

1. Initialize set S to {}.2. For genotypes gi with a single possibility [h1,h2]

assign di=[h1,h2] and add h1,h2 to S.3. For genotypes gi with a possibility containing a

member h1 ∈ S and another haplotype h2, assigndi=[h1,h2] and add h2 to S.

4. Repeat step 3 until all haplotypes are assigned orwe add nothing new to S.

5. Assign any remaining di arbitrarily.

Clark: Run

A/T C/G T/T A/C G/T

A/A C/G C/T A/C T/T

A/A C/G T/T C/C T/T AG T C TA C T C T

A/T C/C C/T A/A G/T

T G T AGA C T C T

AGC A TA C T C T

A C T A TT C C AG

A/T C/G C/C A/A G/T AGC A TT C C AG6

Clark: Rerun (same input)

A/T C/G T/T A/C G/T

A/A C/G C/T A/C T/T

A/A C/G T/T C/C T/T AG T C TA C T C T

A/T C/C C/T A/A G/T

AG T C TT C T AG

AG T C TA C C A T

T C T AGA C C A T

A/T C/G C/C A/A G/T T GC AGA C C A T5

Clark: Comments

• Implementation is very fast, O(ln2)• Total failure if no starting point.• Blind haplotyping of ‘orphans’ at end.• Arbitrary selections based on input order.

– Try multiple orderings, select best results.• Or formulate choices as integer program

– Solve approximately by linear relaxation.

Likelihood: Bayes NetworkPr(HM=h|θH)=Pr(HP=h|θH)=θh

D=[hm,hp]

G=G(d)

HM HP|HM|=|HP|=2l

D|D|= 2l +2l×(2l -1)/2

G|G|= 3l

EM Algorithm

1. Initialize θH0 to uniform distribution with

small random perturbations.2. E step: For each d ∈ D, calculate

Exp(d|θHi) using observed data G.

3. M step: Calculate θHi+1 from Exp(d|θH

i).4. Repeat 2 and 3 until θH converges.5. Assign output di by maximum likelihood.

EM: Iterations (same input)

0.0370.138 0.155

0.1500.134 0.121 0.113 0.106 0.101 0.100 0.100

0.072 0.1020.150

0.193 0.225 0.260 0.289 0.298 0.300 0.300

0.1360.204 0.266 0.279

0.2870.294 0.299 0.300 0.300

0.037 0.081 0.141 0.182 0.196 0.199 0.200 0.200 0.200 0.200

0.040 0.024 0.025 0.028 0.033 0.0470.073 0.095 0.100 0.100 0.100

0 1 2 3 4 5 6 7 8 9 10

Iteration

TGCAGTGCATTGTCGTGTCTTGTAGTGTATTCCAGTCCATTCTCGTCTCTTCTAGTCTATAGCCTAGCAGAGCATAGTCGAGTCTAGTAGAGTATACCCTACCAGACCATACTCGACTCTACTAGACTAT

EM: Output = Clark rerunA/T C/G T/T A/C G/T

A/A C/G C/T A/C T/T

A/A C/G T/T C/C T/T

A/T C/C C/T A/A G/T

A/T C/G C/C A/A G/T

AG T C TA C T C T

AG T C TT C T AG

AG T C TA C C A T

T C T AGA C C A T

T GC AGA C C A T

EM: Comments

• Problem of EM entering local minima– Try multiple runs, select best result.

• Generally far better results than Clark.• Exponential complexity O(2l⋅n) due to |H|

– Ignore impossible haplotypes for O(1.5l⋅n)• Still infeasible for >30 loci

– Hierarchical EM, MCMC methods.

Part 2: HaploBlock

• Haplotype blocks• Statistical model• Model inference• Model criterion• Applications

– Haplotyping– Block-based LD mapping

Recombination Hotspots

Haplotype Blocks

1 GAACTGC ATTCGACTG CCAGTAGC 2 ACGTACA GATGAGCTG CCAGTAGC … 99 ACGTACA AACCGAGGT TGTACTAA100 GAACTGC GATGAGCTG TGTGCTAA

Recombinationhotspot

separates blocks

Few blockvariants due to

bottlenecks, drift

Mutationhotspot

Bayesian Network Model

A1 A2 A3

H1 H2 H3

Values of variable C are 1…qdenoting index of block’s haplotype

Pr(C = c) is frequency of haplotype c

Values of variable Aj are A,C,G,T,–denoting allele at site j of haplotype.Example: A1 A2 A3 = CTA for C = 2

Pr(aj | c) is deterministic

Values of variable Hj are A,C,G,T,–denoting allele at site j observed

after possible haplotype mutations

Pr(hj | aj) is cumulative mutation rate

haplotype block

Pr(c, a1, a2, a3, h1, h2, h3) =

A1 A2 A3

H1 H2 H3

A1 A2 A3

H1 H2 H3

recombination hotspot

A4 A5 A6 A7

H4 H5 H6 H7

haplotype block

Data Likelihood

• For haplotypes H, likelihood is:

A1 A2 A3 A4 A5

H1 H2 H3 H4 H5

Pr(H) = Lc1

∑ La1

∑Pr(c1) Pr(ck | ck−1)

Pr(a j | ck )Pr(h j | a j )j= sk

∏k=1

h∈H∏

But we can calculate this efficientlyusing a suitable elimination ordering!

Data Criterion

• Maximum Likelihood leads to over-fitting– No hotspots, no mutations, many ancestors– Need to consider model complexity– Min DL(H,M)=DL(M)-log2Pr(H|M)

• DL(M) considers variable elements only– Ancestor block sequences– Markov chain parameters

A1 A2 A3 A4 A5

H1 H2 H3 H4 H5

Model Search

A1 A2 A3 A4 A5

H1 H2 H3 H4 H5

A1 A2 A3 A4 A5

H1 H2 H3 H4 H5

4 3nudge

A1 A2 A3 A4 A5

H1 H2 H3 H4 H5

add/remove

A1 A2 A3 A4 A5

H1 H2 H3 H4 H5

ancestors

Hotspot Addition – Initial Scan

0 10 20 30 40 50 60 70 80 90

Hotspot

Hotspot Addition – Minima 1

0 10 20 30 40 50 60 70 80 90

Hotspot

Hotspot Addition – Pass 2

0 10 20 30 40 50 60 70 80 90

Hotspot

0 10 20 30 40 50 60 70 80 90

Hotspot

0 10 20 30 40 50 60 70 80 90

Hotspot

0 10 20 30 40 50 60 70 80 90

Hotspot

0 10 20 30 40 50 60 70 80 90

Hotspot

Hotspot Addition – Summary

0 10 20 30 40 50 60 70 80 90

Hotspot

EM Modifications• Deterministic distribution for Aj• Mutation rate constraints for Hj• Infeasible calculations along whole model

1. Fix zero mutation rate tocluster sequences.Ck

2. Extract ancestor alleles bymaximum likelihood.

3. Constrained EM formutation rates.

4. EM for Markov chain.Hs He

Model for Haplotyping

A1 A2 A3

H1 H2 H3

A4 A5 A6 A7

H4 H5 H6 H7

A'1 A'2 A'3

H'1 H'2 H'3

A'4 A'5 A'6 A'7

H'4 H'5 H'6 H'7

C'2 C'3

G1 G2 G3 G4 G5 G6 G7

• Learn modeldirectly fromgenotypes

• Haplotypepair: choosemost likelyunder model

Haplotyping Results

.0152.0083.0047.0009.0042.0095Hierarchical EM

=2x3x2x2x2xImprovement factor

HaploBlock .0098.0048.0014.0005.0020.0047

.0419.0183.0262.0655.0403.0669PHASE

.0102failed.0077.0204failed.0224HAPLOTYPER

.0381.0234.0329.0280.0251.0548Clark

ACEC21eC21dC21cC21bC21aSite pairwise error rate

C21x data: 20 haplotypes, 100 SNPs over ≤ 35kb, Patil et al. (2001)ACE data: 22 haplotypes, 52 SNPs over 24kb, Rieder et al. (1999)

Average shown for 10 random pairings of true haplotypes

Model for LD Mapping• Learn model from marker data• Mapping: try making phenotype

dependent on each block

A1 A2 A3

H1 H2 H3

A4 A5 A6 A7

H4 H5 H6 H7

LD Mapping Results

1.4x3x3xImprovement factor

HaploBlock 24 kb37 kb40 kb

33 kb105 kb131 kbNo Blocks

107 kb–144 kbBLADE

Chr 215q31 genos5q31 haplosResequencing required

5q31 data: 258 haplotypes, 98 SNPs over 464kb, Daly et al. (2001)Chr 21 data: 20 haplotypes, 5 sets of 200 SNPs, Patil et al. (2001)

Average shown for 5 random selections of target SNP

HaploBlock: Comments

• Our model boils down to an HMM– Calculations have linear complexity– Forward/backward probability caching

• Better to infer multiple models– Prevent getting stuck in local minima– Account for uncertainty of block identification– Use Gibbs-style iterations on hotspots– Take ‘average’ result over set of models

c1 c2 c3 linkage disequilibrium 1 mapping and haploblock

Documents

genetic variation, population structure and linkage...

natural selection can create linkage disequilibrium

estimation of linkage disequilibrium using ggt2 software

genetic fine-mapping with dense linkage disequilibrium...

understanding gwas chip design – linkage disequilibrium...

modeling) linkage) disequilibrium) increases(accuracy(of...

linkage disequilibrium mapping of complex disease:...

introduction to linkage disequilibrium - brown...

genetic diversity, linkage disequilibrium and genomic

lab 12. linkage disequilibrium

hardy weinberg equilibrium linkage disequilibrium

chapter 6 – linkage disequilibrium & gene mapping...

linkage disequilibrium mapping (ld) mapping and...

mendelian inheritance, genetic linkage, and genotypic...

nucleotide diversity and linkage disequilibrium in balsam...

combined linkage and linkage disequilibrium analysis of a

linkage disequilibrium - ndsu - north dakota state...

linkage disequilibrium between chromosomes in the human

linkage disequilibrium - association mapping in...

lecture 23: causes and consequences of linkage...