a large version of the small parsimony problem optimally reconstruct ancestral sequences given -...

51
A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple alignment - affine gap cost function Jakob Fredslund* ([email protected]) , Jotun Hein**, Tejs Scharling* * Bioinformatics Research Center, Aarhus University, Denmark ** Department of Statistics, University of Oxford, United Kingdom

Post on 21-Dec-2015

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

A Large Version of the Small Parsimony Problem

Optimally reconstruct ancestral sequences given

- unrooted phylogeny (hence ‘small’ parsimony p.) - multiple alignment - affine gap cost function

Jakob Fredslund* ([email protected]), Jotun Hein**, Tejs Scharling*

* Bioinformatics Research Center, Aarhus University, Denmark** Department of Statistics, University of Oxford, United Kingdom

Page 2: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

2

Overview

• Introduction

• Examples

• Gap graph construction

• Theory

• Results

• Conclusions

Page 3: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

3

Small Parsimony, No GapsAlgorithm due to Finch-Hartigan-Sankoff: Calculate N(A, C, G,T)

in each node (minimal cost of subtree rooted at this node with

nucleotide X in the root) going up, backtrack going down.

Page 4: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

4

Small Parsimony, Large Version

1: ac-a---gattc2: acgac---atcc3: gc-----gagcc4: -agacttgt---5: aagtcttagt-c

g(k) = 12 + 2*k

(note: alignment is given)

Page 5: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

5

Two Steps

1) Find optimal set of indels to explain gaps

2) Assign nucleotides optimally (FHS)

So: focus on indels

Page 6: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

6

Tracing Evolution

What events could explain this alignment?

cagtta

gcag--a

-cagtta

-cag--a

-ctg--a

Page 7: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

7

Tracing Evolution

cagtta

cagtta

Page 8: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

8

Tracing Evolution

cagtta caga

cagtta

cag--a

cagtta

Page 9: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

9

Tracing Evolution

cagtta caga

ctga

cagtta

cag--a

ctg--a

caga

Page 10: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

10

Tracing Evolution

cagtta caga

ctga

gcag--a

cagtta

cag--a

ctg--a

-cagtta

-cag--a

-ctg--a

gcaga

Page 11: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

11

Indels Affect Full Subtrees

cagtta caga

ctga

gcaga

gcag--a

-cagtta

-cag--a

-ctg--a

All sequences in right subtree have gaps in blue indel’s position

Page 12: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

12

Indels Affect Full Subtrees

cagtta caga

ctga

gcaga

gcag--a

-cagtta

-cag--a

-ctg--a

All sequences in left subtree have gaps in green indel’s position

Page 13: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

13

Direction of Evolution?

cagtta caga

ctga

gcaga

gcag--a

-cagtta

-cag--a

-ctg--adeletion of tt

Page 14: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

14

Direction of Evolution?

cagtta caga

ctga

gcaga

gcag--a

-cagtta

-cag--a

-ctg--a

insertion of tt

Page 15: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

15

Direction of Evolution?

cagtta caga

ctga

gcaga

gcag--a

-cagtta

-cag--a

-ctg--a

Since we don’t know the direction, we refer to insertions/ deletions as indels. And remember: an indel creates gaps in a full subtree.

Page 16: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

16

Explaining Gaps With Indels

g(k) = a + bk

(Anonymous nucleotides denoted by n)

Page 17: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

17

Explaining Gaps With Indels

g(k) = a + bk 2*(a+2b)

Page 18: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

18

Explaining Gaps With Indels

g(k) = a + bk 2*(a+2b) 3*(a+b)

Page 19: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

19

Larger Example

N8, N9, N10, N11, N12, N13 : ???.. Complex problem! (not aware of any upper time bound)

Page 20: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

20

Gap Graph Construction

Represent in a concise way all gaps and how they are connected: in a graph.

Page 21: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

21

Gap Intervals

1.Find gap intervals.

Page 22: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

22

Gap Intervals

1.Find gap intervals.

No optimal indel ‘stops’ in the middle of a gap interval:

it is cheaper to extend the indel making the first gap than to open a new one.

(by triangle inequality)

Page 23: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

23

Gap Graph Vertices

2. Create minimal tree coverings:

For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

Page 24: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

24

Gap Graph Vertices

2. Create minimal tree coverings:

For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

Page 25: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

25

Gap Graph Vertices

Each vertex represents:

a) subtree with gaps in all leaves

b) region of alignment

2. Create minimal tree coverings:

For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

Page 26: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

26

Gap Graph Vertices

2. Create minimal tree coverings:

For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

Page 27: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

27

Gap Graph Vertices

2. Create minimal tree coverings:

For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

Page 28: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

28

Gap Graph Vertices

2. Create minimal tree coverings:

For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

Page 29: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

29

Gap Graph Vertices

2. Create minimal tree coverings:

For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

Page 30: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

30

Gap Graph Vertices

2. Create minimal tree coverings:

For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

Page 31: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

31

Gap Graph Vertices

2. Create minimal tree coverings:

For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps

Page 32: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

32

Gap Graph Connections

3. Create connection between vertices v and w if they represent neighboring gaps.

Page 33: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

33

Gap Graph Connections

3. Create connection between vertices v and w if they represent neighboring gaps.

v → w : all v’s gaps continue in w

Page 34: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

34

Gap Graph Connections

3. Create connection between vertices v and w if they represent neighboring gaps.

v → w : all v’s gaps continue in w

Page 35: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

35

Gap Graph Connections

3. Create connection between vertices v and w if they represent neighboring gaps.

v → w : all v’s gaps continue in w

(a special-case connection exists; see paper)

Page 36: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

37

Interpreting a Gap Graph VertexA vertex is a potential indel: one indel could have created all gaps in the subtree.

Either one indel created all gaps in the subtree (vertex confirmed), ..

Page 37: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

38

Interpreting a Gap Graph Vertex.. or the vertex is decomposed into several indels (further ‘down’ in the tree).

Goal: confirm or decompose vertices with respect to the gap cost function.

Page 38: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

43

Theory Needed Here..

Page 39: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

44

We Need Optimality Proof

A gap graph may be huge, thus representing an enormous

number of potential indels. We need to show two things:

P1: that all optimal indels are represented in the gap graph;

P2: how to ‘resolve the graph’ to determine the set of optimal indels.

P1 proved directly in paper (Theorem 1).

Page 40: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

45

Resolving the Gap Graph

In order to determine optimal set of indels, we need to reduce potentially huge graph while keeping the optimal solution!

Theorem 2 and a set of following lemmas serve this purpose by

identifying certain local graph configurations that can be reduced.

Preprocess gap graph (perform local reductions) by applying lemmas.

Page 41: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

46

Preprocessing Earlier Example

Iteratively apply lemmas to reduce the

graph..

Page 42: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

47

Preprocessing Earlier Example

Iteratively apply lemmas to reduce the

graph..

Page 43: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

48

Preprocessing Earlier Example

Iteratively apply lemmas to reduce the

graph..

Page 44: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

49

Preprocessing Earlier Example

Iteratively apply lemmas to reduce the

graph..

Page 45: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

50

Solving Earlier Example

After preprocessing: resolve remaining graph by checking all combinations

decompose

Page 46: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

51

Solving Earlier Example

Placing indels in the tree:

Page 47: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

52

After Local Preprocessing

• In longer examples there will be many undecided vertices (purple) after preprocessing.

• Find possible decompositions for each vertex and check all combinations in each chain – number of combinations exponential in chain length

Page 48: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

53

Execution Times..?Worst-case: exponential.

Average times for random alignments with 60% gaps:

Page 49: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

54

60% gapsis a lot..

Page 50: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

55

Real Genome Analysis

B.ES.89.S61K15, B.FR.83.HXB2, B.GA.88.OYI, B.GB.83.CAM1, B.NL.86.3202A21, B.TW.94.TWCYS, B.US.86.AD87,

B.US.84.NY5CG, and B.US.83.SF2

Nine HIV-1 subtypes from the Los AlamosHIV database (tree constructed with Quicktree).

Length: 9868. Running Time: 1 sec

Page 51: A Large Version of the Small Parsimony Problem Optimally reconstruct ancestral sequences given - unrooted phylogeny (hence ‘small’ parsimony p.) - multiple

56

Conclusions

• Concise way of representing alignment gaps

• Theoretically sound framework prove optimality

• Graph reductions lead to fast resolvement