haplotyping...1. “guesstimate” haplotype frequencies 2. use current frequency estimates to...
TRANSCRIPT
HaplotypingBiostatistics 666
Previously
• Introduction to the E-M algorithm
• Approach for likelihood optimization
• Examples related to gene counting• Allele frequency estimation – recessive disorder
• Allele frequency estimation – ABO blood group
• Introduction to haplotype frequency estimation
Today: Haplotype Frequencies
• Evolution of haplotype estimation methods• Clark’s greedy algorithm
• Excoffier and Slatkin’s E-M algorithm
• Stephens et al’s coalescent based algorithm
• Using estimated haplotypes in association studies
Useful Roles for Haplotypes
• Linkage disequilibrium studies• Summarize genetic variation
• Selecting markers to genotype• Identify haplotype tag SNPs
• Candidate gene association studies• Help interpret single marker associations• Capture the effect of ungenotyped alleles
• Identify regions of recent ancestry among individuals• Date interesting alleles• Identify regions undergoing natural selection
The problem…
• Haplotypes are hard to measure directly• X-chromosome in males
• Sperm typing
• Hybrid cell lines
• Other molecular techniques
• Instead, statistical reconstruction is often required
Typical Genotype Data
• Two alleles for each individual• Chromosome origin for each allele unknown
• Many haplotype pairs can fit observed genotype
C G Marker1
T C Marker2
G A Marker3
Observation
C G C G
T C C T
G A G A
C G C G
C T T C
A G A G
Possible States
Use Information on Relatives
• Family information can help determine haplotype phase
• Still, many ambiguities can remain• Especially with larger numbers of markers
• Can you think of examples where parental information helps resolve phase?
• Can you think of examples where parental information leaves phase ambiguous?
Proportion of Phase Known Haplotypes
0
33
67
100
1 8 15Markers
Unrelateds w/Parents
Parents + Sibling
What if there are no relatives?
• Rely on linkage disequilibrium
• Assume that…
• Population consists of small number of distinct haplotypes
• Haplotypes tend to be similar to each other
Clark’s Haplotyping Algorithm
• Clark (1990) Mol Biol Evol 7:111-122
• One of the first haplotyping algorithms• Computationally efficient
• Very fast and widely used in 1990’s
• More accurate methods are now available
• Predictions from the coalescent can be used to model performance
Clark’s Haplotyping Algorithm
• Find unambiguous individuals
• What kinds of genotypes will these have?
• Initialize a list of known haplotypes
• Resolve ambiguous individuals• If possible, use two haplotypes from list
• Otherwise, use one known haplotype and augment list
• If unphased individuals remain• Assign phase randomly to one individual
• Augment haplotype list and continue from previous step
Chain of Inference (Clark, 1990)
Can The Algorithm Get Started?
• What kinds of genotypes do we need to get started?
• What kinds of haplotype pairs do we need to get started?
• What is the probability of these occurring?
Probability of Failing To Start
Distribution of Orphaned Alleles
Distribution of Anomalous Matches
Notes …
• Clark’s Algorithm is extremely fast
• More likely to start with large sample
• Orphaned alleles and anomalous matches may occur• Solution with the least orphaned alleles is usually the one with the
fewest anomalous matches
The E-M Haplotyping Algorithm
• Excoffier and Slatkin (1995) • Mol Biol Evol 12:921-927
• Provide a clear outline of how the algorithm can be applied to genetic data
• Combination of two strategies• E-M statistical algorithm for missing data
• Counting algorithm for allele frequencies
• Refines Clark’s algorithm by incorporating frequency information
E-M Algorithm For Haplotyping
1. “Guesstimate” haplotype frequencies
2. Use current frequency estimates to replace ambiguous genotypes with fractional counts of phased genotypes
3. Estimate frequency of each haplotype by counting
4. Repeat steps 2 and 3 until frequencies are stable
Expected Haplotype Counts
ij
ji
ji
jiiii
hh
hhGH
hh
hhGhhGh
G
jiji
j
pHP
ppnn)(n
GH
n
hhhhG
jh
),(~
),(),()ˆ|(
ˆˆ22E
G with compatible pairs Haplotype~
G typeof genotypes ofNumber
, toingcorrespond genotype Unphased),(
haplotype
Notation used inExcoffier and Slatkin paper
n'combinatio genotype observed' with phenotype'' replacing try clarity,For
Likelihood)(
phenotype ofy probabilit)()cpair haplotype(
phenotypefor compatible pairs haplotype
phenotypefor markers usheterozygo of no.
phenotypes individualfor index ...1
phenotypes observed of no.
1 1
11
m
j
nc
i
ilik
c
i
ilik
c
i
jj
j
j
jj
jj
hhPL
jhhPPP
jc
js
mj
m
Notation used inExcoffier and Slatkin paper
n'combinatio genotype observed' withphenotype'' replacing try clarity,For
iterationnext for sfrequencie allele)(2
1
genotypefor proportion fitted)(
)(
phenotype ofy probabilit)(
shomozygote and tesheterozygofor
genotype ofy probabilit)(
roundat haplotype offrequency ...,
1 1
)()1(
)(
)(
)(
1
)()(
2)(
)()(
)(
)()(
2
)(
1
m
j
c
i
lk
g
jit
g
t
lkg
j
lk
g
jj
lk
g
c
i
ilik
g
j
g
j
lk
g
k
g
l
g
k
lk
g
j
g
h
gg
j
j
hhPp
hhP
hhP
n
nhhP
jhhPP
hh
p
pphhP
ghppp
Computational Cost (for SNPs)
• Consider sets of m unphased genotypes• Markers 1..m
• If markers are bi-allelic• 2m possible haplotypes
• 2m-1 (2m + 1) possible haplotype pairs
• 3m distinct observed genotypes
• 2n-1 reconstructions for n heterozygous loci
For example, if m = 10
= 1024
= 524,800
= 59,049
= 512
E-M Algorithm for Haplotyping
• Cost grows rapidly with number of markers
• Typically appropriate for < 25 SNPs• Fewer microsatellites
• More accurate than Clark’s method
• Fully or partially phased individuals contribute most of the information
Enhancements to E-M
• List only haplotypes present in sample
• Gradually expand subset of markers under consideration, eliminating haplotypes with zero or low estimated frequency from consideration at each stage
• SNPHAP [Clayton (2001)]
• HAPLOTYPER [Qin et al. (2002)]
Divide-And-Conquer Approximation
• Number of potential haplotypes increases exponentially
• Number of observed haplotypes does not
• Approximation• Successively divide marker set• Run E-M on each segment• Prune haplotype list as segments are
ligated
• Computation order: ~ m log m• Exact E-M is order ~ 2m
0
1250
2500
1 3 5 7 9 11
markers
tim
e (
ms)
Exact E-M Approximation
Further Refinements …
• More modern methods try to further improve haplotype estimation by favoring sets of similar haplotypes
• Stephens et al. (2001) • Am J Hum Genet 68:978-89
• Genealogical approach…
What the Genealogy Implies…
• Haplotypes are similar to each other…
Known Haplotypes
22544
22544
22544
33334
33334
23233
14234
Individual 1
Genotype:
32344
23534
Individual 2:
Genotype:
32444
23434
33334
22544
33434
22444
Chromosome Genealogies
Present
Past
*
*
*
*
*
*
*
*
* *
*
*
Method based on Gibbs sampler
• MCMC method• Stochastic, random procedure
• Improves solution gradually
• Given initial set of haplotypes
• Sample haplotypes for one individual at a time, assuming other haplotypes are true
• Repeat a few million times…
Update Procedure I
• Pick individual U to update at random
• Calculate haplotype frequencies F in all other individuals• Since everyone is “phased”, this is done by counting
• Sample new haplotypes for U from conditional distribution of U’s haplotypes given F
Update Procedure I
• This procedure would produce an estimate of haplotype frequencies equivalent to those obtained by E-M …
• Stephens et al (2001) suggested an alternative estimate of F…
Update Procedure II
• Estimate F from the other individuals
• Construct F* to include haplotypes in F and also other similar haplotypes (possibly differing at a few sites)
• Update U’s haplotypes conditional on F*
Stephens' Formula …
• Pr(h|H) is the probability of observing haplotype hgiven previous set H
h
S
S
S
Pn
n
nn
nHh
)|Pr(
Sum over haplotypes
Sum over number of mutations
S mutations before coalescence
Coalescence
Mutation Matrix
Further Refinements
• This naïve strategy becomes impractical for very long haplotypes• List of haplotypes for each individual could become too long
• Instead, we can proceed by selecting a short segment of the haplotype to update at random
Hypothesis Testing
• Often, haplotype frequencies are not final outcome.
• For example, we may wish to compare two groups of individuals…• Are haplotypes similar in two populations?
• Are haplotypes similar in patients and healthy controls?
Simplistic approach…
• Calculate haplotype frequencies in each group
• Find most likely haplotype for each individual
• Compare haplotype reconstructions in the two groups
Simplistic approach…
• Calculate haplotype frequencies in each group
• Find most likely haplotype for each individual
• Compare haplotype reconstructions in the two groups
NOT RECOMMENDED!!!
Observed Case Genotypes
1 2 3 4 5 6
The phase reconstruction in the five ambiguous individuals
will be driven by the haplotypes observed in individual 1 …
Inferred Case Haplotypes
1 2 3 4 5 6
This kind of phenomenon will occur with nearly all population
based haplotyping methods!
Observed Control Genotypes
1 2 3 4 5 6
Note these are identical, except for the single homozygous
individual …
Inferred Control Haplotypes
1 2 3 4 5 6
Ooops… The difference in a single genotype in the original
data has been greatly amplified by estimating haplotypes…
Hypothesis Testing II
• Never impute case and control haplotypes separately
• Instead, consider both groups together …• Schaid et al (2002) Am J Hum Genet 70:425-34
• Zaytkin et al (2002) Hum Hered. 53:79-91
• Another alternative is to use maximum likelihood
Hypothesis Testing III
• Estimated haplotype frequencies, imply a likelihood for the observed genotypes
i G~H i
)(HPL
Hypothesis Testing III
• Estimated haplotype frequencies, imply a likelihood for the observed genotypes
i G~H i
)(HPL
individuals
possible haplotype pairs, conditional on genotype
haplotype pair frequency
Hypothesis Testing III
• Calculate 3 likelihoods:• Maximum likelihood for combined sample, LA
• Maximum likelihood for control sample, LB
• Maximum likelihood for case sample, LC
2~ln2 df
A
CB
L
LL
df corresponds to number of non-zero haplotype frequencies in large samples
Significance in Small Samples
• In realistic sample sizes, it is hard to estimate the number of dfaccurately
• Instead, permute case and control labels randomly
Final thoughts…
• Compare alternative reconstructions• Change input order
• Change random seeds
• Change starting values
• When analyzing case-control studies• Randomize case-control labels
Summary
• Describe principles underlying haplotype estimation in unrelated individuals
• Heuristic algorithms
• The E-M algorithm
• Genealogical approach
Further Reading
• Clark AG (1990) Inference of Haplotypes from PCR-amplified samples of diploid populations. Mol Biol Evol 7:111-122