haplotype analysis based on markov chain monte carlo by konstantin sinyuk

Haplotype Analysis based on Markov Chain Monte Carlo

By Konstantin Sinyuk

Overview

Haplotype, Haplotype Analysis Markov Chain Monte Carlo

(MCMC) The algorithm based on (MCMC) Compare with other algorithms

result Discussion on algorithm

accuracy

What is haplotype ? A haplotype is a particular pattern of

sequential SNPs found on a single chromosome.

Haplotype has a block-wise structure separated by hot spots.

Within each block, recombination is rare due to tight linkage and only very few haplotypes really occur

Haplotype analysis motivation Use of haplotypes in disease association

studies reduces the number of tests to be carried out, and hence the penalty for multiple testing.

The genome can be partitioned onto 200,000 blocks

With haplotypes we can conduct evolutionary studies.

International HapMap Project started in October 2002 and planned to be 3 years long.

Haplotype analysis algorithms Given a random sample of multilocus genotypes

at a set of SNPs the following actions can be taken: Estimate the frequencies of all possible

haplotypes. Infer the haplotypes of all individuals.

Haplotyping Algorithms: Clark algorithm EM algorithm

Haplotyping programs: HAPINFEREX ( Clark Parsimony algorthm) EM-Decoder ( EM algorithm) PHASE ( Gibbs Sampler) HAPLOTYPER

Motivation for MCMC method MCMC algorithm considers the underlying

configurations in proportion to their likelihood Estimates most probable haplotype

configuration

Prof. Donnelly: “If a statistician cannot solve a problem, s/he makes it more complicated”

Discrete-Time Markov Chain

Discrete-time stochastic process {Xn: n = 0,1,2,…}

Takes values in {0,1,2,…} Memoryless property:

Transition probabilities Pij

Transition probability matrix P=[Pij]

1 1 1 0 0 1

1

{ | , ,..., } { | }

{ | }n n n n n n

ij n n

P X j X i X i X i P X j X i

P P X j X i

0

0, 1ij ijj

P P

n step transition probabilities

Chapman-Kolmogorov equations

is element (i, j) in matrix Pn

Recursive computation of state probabilities

Chapman-Kolmogorov Equations

{ | }, , 0, , 0nij n m mP P X j X i n m i j

nijP

0

, , 0, , 0n m n mij ik kj

k

P P P n m i j

State Probabilities – Stationary Distribution

State probabilities (time-dependent)

In matrix form:

If time-dependent distribution converges to a limit

is called the stationary distribution

Existence depends on the structure of Markov chain

11 1

0 0

{ } { } { | } π πn nn n n n j i ij

i i

P X j P X i P X j X i P

0 1π { }, π (π ,π ,...)n n n n

j nP X j

1 2 2 0π π π ... πn n n nP P P

π lim πn

n

π πP

Aperiodic: State i is periodic:

Aperiodic Markov chain: none of the states is periodic

Classification of Markov Chains

Irreducible: States i and j communicate:

Irreducible Markov chain: all states communicate

, : 0, 0n mij jin m P P 1: 0n

iid P n d

0

3 4

21

0

3 4

21

Existence of Stationary Distribution

Theorem 1: Irreducible aperiodic Markov chain. There

are two possibilities for scalars:

1. j = 0, for all states j No stationary distribution

2. j > 0, for all states j is the unique stationary distribution

Remark: If the number of states is finite, case 2 is the

only possibility

0π lim { | } lim nj n ij

n nP X j X i P

Ergodic Markov Chains Markov chain with a stationary distribution

States are positive recurrent: The process returns to state j “infinitely often”

A positive recurrent and aperiodic Markov chain is called ergodic

Ergodic chains have a unique stationary distribution

Ergodicity ⇒ Time Averages = Stochastic Averages

π 0, 0,1,2,...j j

π lim nj ijn

P

Balanced Markov Chain Global Balance Equations (GBE)

Detailed Balance Equations (DBE)

is the frequency of transitions from j to i

0 0

π π π π , 0j ji i ij j ji i iji i i j i j

P P P P j

π j jiP

Frequency of Frequency of

transitions out of transitions into j j

π π , 0,1,...j ji i ijP P i j

Markov Chain Summary Markov chain is a set of random processes with stationary transition probabilities

- matrix of transition probabilities, between and Markov chain is Ergodic if:

Aperiodic – Irreducible -

Ergodic Markov chain has stationary distribution property: exists and is independent of i (

)The vector is stationary distribution of

the chain

Ergodic Markov chain is detailed balanced if:

p nijn

)(lim j

p nijn

)(lim1),(

iij

ppjijiji

jij

ii pP ,

Markov Chain Monte Carlo MCMC is used when we wish to simulate

from a distribution known only up to a constant (normalization) factor: (C is hard to calculate)

Metropolis proposed to construct Markov chain with stationary distribution using only ratio

Define transition matrix P indirectly via Q = matrix: - proposal probability - acceptance probability, selected such that

Markov chain will be detailed balanced

ii

C

j

i

}{X S

)(qij

ij

qpijijij

)|Pr(*

ij XXq Tij

Metropolis-Hastings algorithm Metropolis-Hasting (MH) algorithm steps:

Start with X0 = any state Given Xt-1 = i, choose j with probability Accept this j (put Xt = j) with acceptance probability

- Hastings ratio

Otherwise accept i (put Xt = i) Repeat step 2 through 4 a needed number of times

With such detailed balance is satisfied With rejection steps the Markov chain is surely

irreducible

qij ij

1hijijq

qh

iji

jij

ij

hij

Metropolis-Hastings Graph

Example of Metropolis-Hastings Suppose we want to simulate from Metropolis algorithm steps:

Start with X0 = 0 Generate Xt from the proposal distribution

N(Xt-1,1) Compute

Repeat step 2 through 4 a needed number of times

Rxcx x ),exp(*)(4

)exp()(

)*(44*

1xxx

xh tt

tt

t

Gibbs Sampler The Gibbs sampler is a special case of the MH

algorithm that considers states that can be partitioned into coordinates

At each step, a single coordinate of the state is updated.

Step from to given by

Gibbs sampler is used where each variable depends on other variable within some neighborhood

The acceptance probabilities are all equal to 1

),...,,(21 iii n

i

)...,,,,...(111 iiiii nrrr

i

)...,,,,...(1

*

11

*

iiiiii nrrr

MCMC in haplotyping The Gibbs sampler is good for multilocus

genotyping of n persons. Lets define:

The conditional distribution P(g|d) can be estimated

, The Markov chain obtained with Gibbs sampler

may not be ergodic.

Ordered genotype of person i

is the observed phenotype of individual iat locus j

di

j

The proposed algorithm Most algorithms search that maximize P(g|d) The proposed algorithm seeks for

An ergodic Markov chain is constructed such thatstationary distribution is P(f(c)|d) The sampling is done with Gibbs sampler An Ergodic property of Markov chain is satisfied

with use of Metropolis jump kernels The Gibbs-Jumping name is assigned to algorithm

gmax

})|(:{)( cdgPgcf

Gibbs step of algorithm For each individual i and locus j, alleles andare sampled from the conditional distribution:

The following assumption are commonly made in order to compute transition probability

Hardy-Weinberg Equilibrium Linkage Equilibrium No interference

spouses of i

children of i and s

ai

j1 ai

j2

),|( gggmf

Pii

i

Jumping step of algorithm After Gibbs step the algorithm attempts to jump

from current state of multilocus genotype g to the state g* in a different irreducible set.

The Metropolis jumping kernel is used Let be the set of non-

communicating genotypic configurations on locus j set of individuals who “characterize” irreducible

set at j A new state g* is formed by replacing the alleles

pair in g by those from for individuals in The g* is accepted with probability

},...,,{21

gggG rDDDD

j

jjjj

Qj

|

D j

gk

D j

D j

})|(

)|*(,1min{

dgP

dgP

Gibbs-Jump trajectories

Results Comparison Gibbs-Jumping algorithm is estimating one. So the algorithm should be tested on well-

explored genetic diseases. Such explored diseases are:

Krabbe disease (autosomal, recessive disorder) Episodic Ataxia disease (autosomal, dominant

disorder) The original exploration was done by programs

LINKAGE (Krabbe) – enumerating linkage analysis SIMWALK (Ataxia) – using simulated annealing (MC)

The comparison of various haplotyping method was carried out by Sobel.

So proposed algorithm results are compared to Sobel work.

Krabbe disease (Globoid-cell leukodystrophy) This autosomal recessively-inherited disease

results from a deficiency of the lysosomal enzyme

b-galactosylceramidase (GALC). GALC enzyme plays a role in the normal turnover

of lipids that form a significant part of myelin, the insulating material around certain neurons.

Affected individuals show progressive mental and motor (movement) deterioration and usually die within a year or two of birth.

Krabbe disease cont

Krabbe disease result compare The input data is genetic map of 8

polymorphic genetic markers on chromosome 14.

The Gibbs-Jump algorithm assigned the most likely haplotype configuration with probability 0.69, the same configuration as obtained by Sobol enumerative approach.

By Sobol they enumerated 262,144 haplotype variations with CPU time of couple of hours instead of less than 1 minute run of 100 iterations of Gibbs-Jump.

Episodic Ataxia disease Episodic ataxia, a autosomal

dominantly-inherited disease affecting the cerebellum.

Point mutations in the human voltage-gated potassium channel (Kv1.1) gene on chromosome 12p13

Affected individuals are normal between attacks but become ataxic under stressful conditions.

Episodic Ataxia result compare The input data is genetic map of 9

polymorphic genetic markers on chromosome 12.

The Gibbs-Jump algorithm assigned the most likely haplotype configuration with probability 0.41, that is very similar to the obtained by Sobel with SIMWALK.

The second most probable haplotype configuration obtained with 0.09 probability and is identical to the one picked by Sobel.

Simulation data To evaluate the performance of Gibbs-Jump on

large pedigrees (with loops) a haplotype configuration was simulated.

The genetic map of 10 co dominant markers (5 alleles per marker) with = 0.05 was taken. The founders haplotypes were sampled randomly

from population distribution of haplotypes. Haplotypes for nonfounders where then simulated

conditional on their parents’ haplotypes. Assuming HW equilibrium ,Linkage equilibrium

and Haldane’s no interference model for recombination.

Simulation results 100 iteration of Gibbs-Jump were performed. The most probable configuration (with probability

0.41) is identical to the true (simulated ) one There are 3 configurations with second largest

probability (0.07) All 3 differ from the true configuration in one

person with one extra recombination event in each

The algorithm execution time took several minutes

Simulation accuracy Results of 10 runs of

100 realizations each. In runs 1 and 3-10 the

most frequent configuration was the true one .

The most frequent configuration in run 2 differed from the true one at one individual.

Simulation run-length Results of 5 runs of

10000 realizations each. The figure shows that

there is a fair amount of variability in the estimates, but with very little correlation between consecutive estimates.

Autocorrelation = -0.02

Dot plot of the estimated frequency of the underlying true haplotype configuration for 100 iterations.

Simulation run-length cont Estimates converges to the

true haplotype configuration after 2000 steps.

The confidence bound is 95%

Four other runs also inferred the true configuration with probabilities: 34.54%,35.75%,37.08% and 35.27% respectively.

Cumulative frequency of the most probable configuration , plotted for every 100 iterations and the confidence bound.

Results of Sensitivity Analysis

Computation of P(g) requires an assignment of haplotype probabilities to the founders.

How inaccurate prediction of founder probabilities affects the results?

The 4 sets for gene frequencies (different from simulated) for one of 10 markers were used (other markers were leaved unchanged)

For the above simulation set the resulting haplotype configuration was as simulated one.

Conclusion In this discussion was presented a new method,Gibbs-Jump, for haplotype analysis, which explores

the whole distribution of haplotypes conditional on the observed phenotypes. The method is very time-efficient. The result accuracy was compared to obtained

by other methods (described by Sobol). Method demonstrated the sensitivity tolerance tofounders probabilities sample.

The End…

Wake up!

haplotype analysis based on markov chain monte carlo by konstantin sinyuk

Documents