haplotyping...1. “guesstimate” haplotype frequencies 2. use current frequency estimates to...

$: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency$
HaplotypingBiostatistics 666

Previously

• Introduction to the E-M algorithm

• Approach for likelihood optimization

• Examples related to gene counting• Allele frequency estimation – recessive disorder

• Allele frequency estimation – ABO blood group

• Introduction to haplotype frequency estimation

Today: Haplotype Frequencies

• Evolution of haplotype estimation methods• Clark’s greedy algorithm

• Excoffier and Slatkin’s E-M algorithm

• Stephens et al’s coalescent based algorithm

• Using estimated haplotypes in association studies

Useful Roles for Haplotypes

• Linkage disequilibrium studies• Summarize genetic variation

• Selecting markers to genotype• Identify haplotype tag SNPs

• Candidate gene association studies• Help interpret single marker associations• Capture the effect of ungenotyped alleles

• Identify regions of recent ancestry among individuals• Date interesting alleles• Identify regions undergoing natural selection

The problem…

• Haplotypes are hard to measure directly• X-chromosome in males

• Sperm typing

• Hybrid cell lines

• Other molecular techniques

• Instead, statistical reconstruction is often required

Typical Genotype Data

• Two alleles for each individual• Chromosome origin for each allele unknown

• Many haplotype pairs can fit observed genotype

C G Marker1

T C Marker2

G A Marker3

Observation

C G C G

T C C T

G A G A

C G C G

C T T C

A G A G

Possible States

Use Information on Relatives

• Family information can help determine haplotype phase

• Still, many ambiguities can remain• Especially with larger numbers of markers

• Can you think of examples where parental information helps resolve phase?

• Can you think of examples where parental information leaves phase ambiguous?

Proportion of Phase Known Haplotypes

0

33

67

100

1 8 15Markers

Unrelateds w/Parents

Parents + Sibling

What if there are no relatives?

• Rely on linkage disequilibrium

• Assume that…

• Population consists of small number of distinct haplotypes

• Haplotypes tend to be similar to each other

Clark’s Haplotyping Algorithm

• Clark (1990) Mol Biol Evol 7:111-122

• One of the first haplotyping algorithms• Computationally efficient

• Very fast and widely used in 1990’s

• More accurate methods are now available

• Predictions from the coalescent can be used to model performance

Clark’s Haplotyping Algorithm

• Find unambiguous individuals

• What kinds of genotypes will these have?

• Initialize a list of known haplotypes

• Resolve ambiguous individuals• If possible, use two haplotypes from list

• Otherwise, use one known haplotype and augment list

• If unphased individuals remain• Assign phase randomly to one individual

• Augment haplotype list and continue from previous step

Chain of Inference (Clark, 1990)

Can The Algorithm Get Started?

• What kinds of genotypes do we need to get started?

• What kinds of haplotype pairs do we need to get started?

• What is the probability of these occurring?

Probability of Failing To Start

Distribution of Orphaned Alleles

Distribution of Anomalous Matches

Notes …

• Clark’s Algorithm is extremely fast

• More likely to start with large sample

• Orphaned alleles and anomalous matches may occur• Solution with the least orphaned alleles is usually the one with the

fewest anomalous matches

The E-M Haplotyping Algorithm

• Excoffier and Slatkin (1995) • Mol Biol Evol 12:921-927

• Provide a clear outline of how the algorithm can be applied to genetic data

• Combination of two strategies• E-M statistical algorithm for missing data

• Counting algorithm for allele frequencies

• Refines Clark’s algorithm by incorporating frequency information

E-M Algorithm For Haplotyping

1. “Guesstimate” haplotype frequencies

2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes

3. Estimate frequency of each haplotype by counting

4. Repeat steps 2 and 3 until frequencies are stable

Expected Haplotype Counts

ij

ji

ji

jiiii

hh

hhGH

hh

hhGhhGh

G

jiji

j

pHP

ppnn)(n

GH

n

hhhhG

jh

),(~

),(),()ˆ|(

ˆˆ22E

G with compatible pairs Haplotype~

G typeof genotypes ofNumber

, toingcorrespond genotype Unphased),(

haplotype

Notation used inExcoffier and Slatkin paper

n'combinatio genotype observed' with phenotype'' replacing try clarity,For

Likelihood)(

phenotype ofy probabilit)()cpair haplotype(

phenotypefor compatible pairs haplotype

phenotypefor markers usheterozygo of no.

phenotypes individualfor index ...1

phenotypes observed of no.

1 1

11

m

j

nc

i

ilik

c

i

ilik

c

i

jj

j

j

jj

jj

hhPL

jhhPPP

jc

js

mj

m

Notation used inExcoffier and Slatkin paper

n'combinatio genotype observed' withphenotype'' replacing try clarity,For

iterationnext for sfrequencie allele)(2

1

genotypefor proportion fitted)(

)(

phenotype ofy probabilit)(

shomozygote and tesheterozygofor

genotype ofy probabilit)(

roundat haplotype offrequency ...,

1 1

)()1(

)(

)(

)(

1

)()(

2)(

)()(

)(

)()(

2

)(

1

m

j

c

i

lk

g

jit

g

t

lkg

j

lk

g

jj

lk

g

c

i

ilik

g

j

g

j

lk

g

k

g

l

g

k

lk

g

j

g

h

gg

j

j

hhPp

hhP

hhP

n

nhhP

jhhPP

hh

p

pphhP

ghppp

Computational Cost (for SNPs)

• Consider sets of m unphased genotypes• Markers 1..m

• If markers are bi-allelic• 2m possible haplotypes

• 2m-1 (2m + 1) possible haplotype pairs

• 3m distinct observed genotypes

• 2n-1 reconstructions for n heterozygous loci

For example, if m = 10

= 1024

= 524,800

= 59,049

= 512

E-M Algorithm for Haplotyping

• Cost grows rapidly with number of markers

• Typically appropriate for < 25 SNPs• Fewer microsatellites

• More accurate than Clark’s method

• Fully or partially phased individuals contribute most of the information

Enhancements to E-M

• List only haplotypes present in sample

• Gradually expand subset of markers under consideration, eliminating haplotypes with zero or low estimated frequency from consideration at each stage

• SNPHAP [Clayton (2001)]

• HAPLOTYPER [Qin et al. (2002)]

Divide-And-Conquer Approximation

• Number of potential haplotypes increases exponentially

• Number of observed haplotypes does not

• Approximation• Successively divide marker set• Run E-M on each segment• Prune haplotype list as segments are

ligated

• Computation order: ~ m log m• Exact E-M is order ~ 2m

0

1250

2500

1 3 5 7 9 11

markers

tim

e (

ms)

Exact E-M Approximation

Further Refinements …

• More modern methods try to further improve haplotype estimation by favoring sets of similar haplotypes

• Stephens et al. (2001) • Am J Hum Genet 68:978-89

• Genealogical approach…

What the Genealogy Implies…

• Haplotypes are similar to each other…

Known Haplotypes

22544

22544

22544

33334

33334

23233

14234

Individual 1

Genotype:

32344

23534

Individual 2:

Genotype:

32444

23434

33334

22544

33434

22444

Chromosome Genealogies

Present

Past

*

*

*

*

*

*

*

*

* *

*

*

Method based on Gibbs sampler

• MCMC method• Stochastic, random procedure

• Improves solution gradually

• Given initial set of haplotypes

• Sample haplotypes for one individual at a time, assuming other haplotypes are true

• Repeat a few million times…

Update Procedure I

• Pick individual U to update at random

• Calculate haplotype frequencies F in all other individuals• Since everyone is “phased”, this is done by counting

• Sample new haplotypes for U from conditional distribution of U’s haplotypes given F

Update Procedure I

• This procedure would produce an estimate of haplotype frequencies equivalent to those obtained by E-M …

• Stephens et al (2001) suggested an alternative estimate of F…

Update Procedure II

• Estimate F from the other individuals

• Construct F* to include haplotypes in F and also other similar haplotypes (possibly differing at a few sites)

• Update U’s haplotypes conditional on F*

Stephens' Formula …

• Pr(h|H) is the probability of observing haplotype hgiven previous set H

h

S

S

S

Pn

n

nn

nHh

)|Pr(

Sum over haplotypes

Sum over number of mutations

S mutations before coalescence

Coalescence

Mutation Matrix

Further Refinements

• This naïve strategy becomes impractical for very long haplotypes• List of haplotypes for each individual could become too long

• Instead, we can proceed by selecting a short segment of the haplotype to update at random

Hypothesis Testing

• Often, haplotype frequencies are not final outcome.

• For example, we may wish to compare two groups of individuals…• Are haplotypes similar in two populations?

• Are haplotypes similar in patients and healthy controls?

Simplistic approach…

• Calculate haplotype frequencies in each group

• Find most likely haplotype for each individual

• Compare haplotype reconstructions in the two groups

Simplistic approach…

• Calculate haplotype frequencies in each group

• Find most likely haplotype for each individual

• Compare haplotype reconstructions in the two groups

NOT RECOMMENDED!!!

Observed Case Genotypes

1 2 3 4 5 6

The phase reconstruction in the five ambiguous individuals

will be driven by the haplotypes observed in individual 1 …

Inferred Case Haplotypes

1 2 3 4 5 6

This kind of phenomenon will occur with nearly all population

based haplotyping methods!

Observed Control Genotypes

1 2 3 4 5 6

Note these are identical, except for the single homozygous

individual …

Inferred Control Haplotypes

1 2 3 4 5 6

Ooops… The difference in a single genotype in the original

data has been greatly amplified by estimating haplotypes…

Hypothesis Testing II

• Never impute case and control haplotypes separately

• Instead, consider both groups together …• Schaid et al (2002) Am J Hum Genet 70:425-34

• Zaytkin et al (2002) Hum Hered. 53:79-91

• Another alternative is to use maximum likelihood

Hypothesis Testing III

• Estimated haplotype frequencies, imply a likelihood for the observed genotypes

i G~H i

)(HPL


• Estimated haplotype frequencies, imply a likelihood for the observed genotypes

i G~H i

)(HPL

individuals

possible haplotype pairs, conditional on genotype

haplotype pair frequency


• Calculate 3 likelihoods:• Maximum likelihood for combined sample, LA

• Maximum likelihood for control sample, LB

• Maximum likelihood for case sample, LC

2~ln2 df

A

CB

L

LL

df corresponds to number of non-zero haplotype frequencies in large samples

Significance in Small Samples

• In realistic sample sizes, it is hard to estimate the number of dfaccurately

• Instead, permute case and control labels randomly

Final thoughts…

• Compare alternative reconstructions• Change input order

• Change random seeds

• Change starting values

• When analyzing case-control studies• Randomize case-control labels

Summary

• Describe principles underlying haplotype estimation in unrelated individuals

• Heuristic algorithms

• The E-M algorithm

• Genealogical approach

Further Reading

• Clark AG (1990) Inference of Haplotypes from PCR-amplified samples of diploid populations. Mol Biol Evol 7:111-122

haplotyping...1. “guesstimate” haplotype frequencies 2. use current frequency estimates to...

Documents