exact computation of coalescent likelihood under the infinite sites model yufeng wu university of...

14
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Post on 19-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Exact Computation of Coalescent Likelihood under the Infinite Sites

Model

Yufeng Wu

University of Connecticut

ISBRA 2009 1

Page 2: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Coalescent Likelihood

• D: a set of binary sequences.• Coalescent genealogy: history with

coalescent and mutation events.• Coalescent likelihood P(D):

probability of observing D on coalescent model given mutation rate

• Assume no recombination. 00000 00010 00010 00010 01100 10000 10001

1

5

2

3

4

Coalescent Mutation

Page 3: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Perfect Phylogeny• Infinite many sites model of

mutations: one mutation per site in history

• Perfect phylogeny– Site labels tree branches– Each site appears exactly once.– Sequence: list of mutations from

root to leaf.– Unique topologically, except root is

unknown and order of mutations on the same branch is not fixed.

00000

01100

3

Page 4: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Genealogy and Perfect Phylogeny

• Perfect phylogeny: not enough timing information

– Exists many coalescent genealogy for a fixed perfect phylogeny.– Each genealogy: different probability (depending on )

• Coalescent likelihood: sum over all compatible genealogy.

1

5

2

3

4

1

5

2

3

4

1

5

2

3

4

Page 5: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Computing Coalescent Likelihood• Computation of P(D): classic population genetics problem.

Statistical (inexact) approaches:– Importance sampling (IS): Griffiths and Tavare (1994), Stephens and

Donnelly (2000), Hobolth, Uyenoyama and Wiuf (2008).– MCMC: Kuhner,Yamato and Felsenstein (1995).

• Genetree: IS-based, widely used but (sometimes large) variance still exists.

• How feasible of computing exact P(D)?– Considered to be difficult for even medium-sized data (Song, Lyngso and

Hein, 2006).

• This talk: exact computation of P(D) is feasible for data significantly larger than previously believed.– A simple algorithmic trick: dynamic programming

5

Page 6: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Ethier-Griffiths Recursion• Build a perfect phylogeny for D.• Ancestral configuration (AC): pairs of

sequence multiplicity and list of mutations for each sequence type at some time

• Transition probability between ACs: depends on AC and .

• Genealogy: path of ACs (from present to root)

• P(D): sum of probability of all paths.• EG: faster summation, backwards in time.

(1, 0), (3, 4 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0)

(1, 0), (1, 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0)

(1, 0), (2, 4 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0)

(1, 0), (1, 4 0), (1, 3 2 0), (1, 1 0), (1, 5 1 0)

(3, 4 0)6

Page 7: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Computing Exact Likelihood• Key idea: forward instead of backwards

– Create all possible ACs reachable from the current AC (start from root). Update probability.

– Intuition of AC: growing coverage of the phylogeny, starting from root

• Possible events at root: three branching (b1, b2, b3), three mutations (m1, m2, m4).

• Branching: cover new branch

• Covered branch can mutate

• Each event: a new AC

b1

m2

b2Start from root AC

7

Page 8: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Forward Finding of ACs • Finding all ACs by forward looking.

• Maintain a list of active events in each AC. Update in the new ACs.

• Rule: at a node in phylogeny, mutated branches covered branches (unless all branches are covered)

Mut branch = covered branch

8

Page 9: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Why Forward?• Bottleneck: memory • Layer of ACs: ACs

with k mutation or branching events from root AC, k= 1,2,3…

• Key: only the current layer needs to be kept. Memory efficient.

• A single forward pass is enough to compute P(D).

9

Coalescent

Mutation

Page 10: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Results on Simulated Data• Use Hudson’s program ms: 20, 30 , 40 and 50

sequences with = 1, 3 and 5. Each settings: 100 datasets. How many allow exact computation of P(D) within reasonable amount of time?

Number of sequences

% of feasible data

Number of sequences

Ave. run time (sec.) for feasible data

10

Page 11: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Application: MLE of Mutation Rate• Given a set of sequences D, what is the

maximum likelihood estimate of the mutation rate ?

• Issue: Need to compute for many possible and root of genealogy is not known.

• Use exact likelihood– Compute P(D | ) for on a grid.– MLE of : maximize P(D | ).– Full likelihood: sum P(D | ) over all possible roots.– Faster computation: compute P(D) for multiple in

one pass. Quicker than computing P(D) for one each time.

11

P(D)

MLE()

Page 12: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

A Mitochondrial Data• Mitochondrial data from Ward, et al. (1991). Previously analyzed

by Griffiths and Tavare (1994) and others.– 55 sequences and 18 polymorphic sites.

– Believed to fit the infinite sites model.

12

Page 13: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

MLE of Mutation Rate for the Mitochondrial Data

• MLE of : 4.8 Griffiths and Tavare (1994)

• IS methods can have variance– Is 4.8 really the MLE?

• Compute the exact full likelihood for a grid of between 4.0 to 6.0.

13

Page 14: Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA 2009 1

Conclusion• IS seems to work well for the Mitochondrial

data– However, IS can still have large variance for some

data.– Thus, exact computation may help when data is

not very large and/or relatively low mutation rate.– Can also help to evaluate different statistical

methods.

• Research supported by National Science Foundation.

14