haplotyping...1. “guesstimate” haplotype frequencies 2. use current frequency estimates to...

49
Haplotyping Biostatistics 666

Upload: others

Post on 05-Oct-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

HaplotypingBiostatistics 666

Page 2: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Previously

• Introduction to the E-M algorithm

• Approach for likelihood optimization

• Examples related to gene counting• Allele frequency estimation – recessive disorder

• Allele frequency estimation – ABO blood group

• Introduction to haplotype frequency estimation

Page 3: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Today: Haplotype Frequencies

• Evolution of haplotype estimation methods• Clark’s greedy algorithm

• Excoffier and Slatkin’s E-M algorithm

• Stephens et al’s coalescent based algorithm

• Using estimated haplotypes in association studies

Page 4: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Useful Roles for Haplotypes

• Linkage disequilibrium studies• Summarize genetic variation

• Selecting markers to genotype• Identify haplotype tag SNPs

• Candidate gene association studies• Help interpret single marker associations• Capture the effect of ungenotyped alleles

• Identify regions of recent ancestry among individuals• Date interesting alleles• Identify regions undergoing natural selection

Page 5: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

The problem…

• Haplotypes are hard to measure directly• X-chromosome in males

• Sperm typing

• Hybrid cell lines

• Other molecular techniques

• Instead, statistical reconstruction is often required

Page 6: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Typical Genotype Data

• Two alleles for each individual• Chromosome origin for each allele unknown

• Many haplotype pairs can fit observed genotype

C G Marker1

T C Marker2

G A Marker3

Observation

C G C G

T C C T

G A G A

C G C G

C T T C

A G A G

Possible States

Page 7: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Use Information on Relatives

• Family information can help determine haplotype phase

• Still, many ambiguities can remain• Especially with larger numbers of markers

• Can you think of examples where parental information helps resolve phase?

• Can you think of examples where parental information leaves phase ambiguous?

Proportion of Phase Known Haplotypes

0

33

67

100

1 8 15Markers

Unrelateds w/Parents

Parents + Sibling

Page 8: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

What if there are no relatives?

• Rely on linkage disequilibrium

• Assume that…

• Population consists of small number of distinct haplotypes

• Haplotypes tend to be similar to each other

Page 9: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Clark’s Haplotyping Algorithm

• Clark (1990) Mol Biol Evol 7:111-122

• One of the first haplotyping algorithms• Computationally efficient

• Very fast and widely used in 1990’s

• More accurate methods are now available

• Predictions from the coalescent can be used to model performance

Page 10: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Clark’s Haplotyping Algorithm

• Find unambiguous individuals

• What kinds of genotypes will these have?

• Initialize a list of known haplotypes

• Resolve ambiguous individuals• If possible, use two haplotypes from list

• Otherwise, use one known haplotype and augment list

• If unphased individuals remain• Assign phase randomly to one individual

• Augment haplotype list and continue from previous step

Page 11: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Chain of Inference (Clark, 1990)

Page 12: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Can The Algorithm Get Started?

• What kinds of genotypes do we need to get started?

• What kinds of haplotype pairs do we need to get started?

• What is the probability of these occurring?

Page 13: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Probability of Failing To Start

Page 14: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Distribution of Orphaned Alleles

Page 15: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Distribution of Anomalous Matches

Page 16: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Notes …

• Clark’s Algorithm is extremely fast

• More likely to start with large sample

• Orphaned alleles and anomalous matches may occur• Solution with the least orphaned alleles is usually the one with the

fewest anomalous matches

Page 17: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

The E-M Haplotyping Algorithm

• Excoffier and Slatkin (1995) • Mol Biol Evol 12:921-927

• Provide a clear outline of how the algorithm can be applied to genetic data

• Combination of two strategies• E-M statistical algorithm for missing data

• Counting algorithm for allele frequencies

• Refines Clark’s algorithm by incorporating frequency information

Page 18: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

E-M Algorithm For Haplotyping

1. “Guesstimate” haplotype frequencies

2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes

3. Estimate frequency of each haplotype by counting

4. Repeat steps 2 and 3 until frequencies are stable

Page 19: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Expected Haplotype Counts

ij

ji

ji

jiiii

hh

hhGH

hh

hhGhhGh

G

jiji

j

pHP

ppnn)(n

GH

n

hhhhG

jh

),(~

),(),()ˆ|(

ˆˆ22E

G with compatible pairs Haplotype~

G typeof genotypes ofNumber

, toingcorrespond genotype Unphased),(

haplotype

Page 20: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Notation used inExcoffier and Slatkin paper

n'combinatio genotype observed' with phenotype'' replacing try clarity,For

Likelihood)(

phenotype ofy probabilit)()cpair haplotype(

phenotypefor compatible pairs haplotype

phenotypefor markers usheterozygo of no.

phenotypes individualfor index ...1

phenotypes observed of no.

1 1

11

m

j

nc

i

ilik

c

i

ilik

c

i

jj

j

j

jj

jj

hhPL

jhhPPP

jc

js

mj

m

Page 21: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Notation used inExcoffier and Slatkin paper

n'combinatio genotype observed' withphenotype'' replacing try clarity,For

iterationnext for sfrequencie allele)(2

1

genotypefor proportion fitted)(

)(

phenotype ofy probabilit)(

shomozygote and tesheterozygofor

genotype ofy probabilit)(

roundat haplotype offrequency ...,

1 1

)()1(

)(

)(

)(

1

)()(

2)(

)()(

)(

)()(

2

)(

1

m

j

c

i

lk

g

jit

g

t

lkg

j

lk

g

jj

lk

g

c

i

ilik

g

j

g

j

lk

g

k

g

l

g

k

lk

g

j

g

h

gg

j

j

hhPp

hhP

hhP

n

nhhP

jhhPP

hh

p

pphhP

ghppp

Page 22: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Computational Cost (for SNPs)

• Consider sets of m unphased genotypes• Markers 1..m

• If markers are bi-allelic• 2m possible haplotypes

• 2m-1 (2m + 1) possible haplotype pairs

• 3m distinct observed genotypes

• 2n-1 reconstructions for n heterozygous loci

For example, if m = 10

= 1024

= 524,800

= 59,049

= 512

Page 23: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

E-M Algorithm for Haplotyping

• Cost grows rapidly with number of markers

• Typically appropriate for < 25 SNPs• Fewer microsatellites

• More accurate than Clark’s method

• Fully or partially phased individuals contribute most of the information

Page 24: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Enhancements to E-M

• List only haplotypes present in sample

• Gradually expand subset of markers under consideration, eliminating haplotypes with zero or low estimated frequency from consideration at each stage

• SNPHAP [Clayton (2001)]

• HAPLOTYPER [Qin et al. (2002)]

Page 25: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Divide-And-Conquer Approximation

• Number of potential haplotypes increases exponentially

• Number of observed haplotypes does not

• Approximation• Successively divide marker set• Run E-M on each segment• Prune haplotype list as segments are

ligated

• Computation order: ~ m log m• Exact E-M is order ~ 2m

0

1250

2500

1 3 5 7 9 11

markers

tim

e (

ms)

Exact E-M Approximation

Page 26: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Further Refinements …

• More modern methods try to further improve haplotype estimation by favoring sets of similar haplotypes

• Stephens et al. (2001) • Am J Hum Genet 68:978-89

• Genealogical approach…

Page 27: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

What the Genealogy Implies…

• Haplotypes are similar to each other…

Known Haplotypes

22544

22544

22544

33334

33334

23233

14234

Individual 1

Genotype:

32344

23534

Individual 2:

Genotype:

32444

23434

33334

22544

33434

22444

Page 28: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Chromosome Genealogies

Present

Past

*

*

*

*

*

*

*

*

* *

*

*

Page 29: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Method based on Gibbs sampler

• MCMC method• Stochastic, random procedure

• Improves solution gradually

• Given initial set of haplotypes

• Sample haplotypes for one individual at a time, assuming other haplotypes are true

• Repeat a few million times…

Page 30: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Update Procedure I

• Pick individual U to update at random

• Calculate haplotype frequencies F in all other individuals• Since everyone is “phased”, this is done by counting

• Sample new haplotypes for U from conditional distribution of U’s haplotypes given F

Page 31: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Update Procedure I

• This procedure would produce an estimate of haplotype frequencies equivalent to those obtained by E-M …

• Stephens et al (2001) suggested an alternative estimate of F…

Page 32: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Update Procedure II

• Estimate F from the other individuals

• Construct F* to include haplotypes in F and also other similar haplotypes (possibly differing at a few sites)

• Update U’s haplotypes conditional on F*

Page 33: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Stephens' Formula …

• Pr(h|H) is the probability of observing haplotype hgiven previous set H

h

S

S

S

Pn

n

nn

nHh

)|Pr(

Sum over haplotypes

Sum over number of mutations

S mutations before coalescence

Coalescence

Mutation Matrix

Page 34: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Further Refinements

• This naïve strategy becomes impractical for very long haplotypes• List of haplotypes for each individual could become too long

• Instead, we can proceed by selecting a short segment of the haplotype to update at random

Page 35: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Hypothesis Testing

• Often, haplotype frequencies are not final outcome.

• For example, we may wish to compare two groups of individuals…• Are haplotypes similar in two populations?

• Are haplotypes similar in patients and healthy controls?

Page 36: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Simplistic approach…

• Calculate haplotype frequencies in each group

• Find most likely haplotype for each individual

• Compare haplotype reconstructions in the two groups

Page 37: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Simplistic approach…

• Calculate haplotype frequencies in each group

• Find most likely haplotype for each individual

• Compare haplotype reconstructions in the two groups

NOT RECOMMENDED!!!

Page 38: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Observed Case Genotypes

1 2 3 4 5 6

The phase reconstruction in the five ambiguous individuals

will be driven by the haplotypes observed in individual 1 …

Page 39: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Inferred Case Haplotypes

1 2 3 4 5 6

This kind of phenomenon will occur with nearly all population

based haplotyping methods!

Page 40: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Observed Control Genotypes

1 2 3 4 5 6

Note these are identical, except for the single homozygous

individual …

Page 41: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Inferred Control Haplotypes

1 2 3 4 5 6

Ooops… The difference in a single genotype in the original

data has been greatly amplified by estimating haplotypes…

Page 42: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Hypothesis Testing II

• Never impute case and control haplotypes separately

• Instead, consider both groups together …• Schaid et al (2002) Am J Hum Genet 70:425-34

• Zaytkin et al (2002) Hum Hered. 53:79-91

• Another alternative is to use maximum likelihood

Page 43: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Hypothesis Testing III

• Estimated haplotype frequencies, imply a likelihood for the observed genotypes

i G~H i

)(HPL

Page 44: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Hypothesis Testing III

• Estimated haplotype frequencies, imply a likelihood for the observed genotypes

i G~H i

)(HPL

individuals

possible haplotype pairs, conditional on genotype

haplotype pair frequency

Page 45: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Hypothesis Testing III

• Calculate 3 likelihoods:• Maximum likelihood for combined sample, LA

• Maximum likelihood for control sample, LB

• Maximum likelihood for case sample, LC

2~ln2 df

A

CB

L

LL

df corresponds to number of non-zero haplotype frequencies in large samples

Page 46: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Significance in Small Samples

• In realistic sample sizes, it is hard to estimate the number of dfaccurately

• Instead, permute case and control labels randomly

Page 47: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Final thoughts…

• Compare alternative reconstructions• Change input order

• Change random seeds

• Change starting values

• When analyzing case-control studies• Randomize case-control labels

Page 48: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Summary

• Describe principles underlying haplotype estimation in unrelated individuals

• Heuristic algorithms

• The E-M algorithm

• Genealogical approach

Page 49: Haplotyping...1. “Guesstimate” haplotype frequencies 2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes 3. Estimate frequency

Further Reading

• Clark AG (1990) Inference of Haplotypes from PCR-amplified samples of diploid populations. Mol Biol Evol 7:111-122