a large version of the small parsimony problem optimally reconstruct ancestral sequences given -...
Post on 21-Dec-2015
224 views
TRANSCRIPT
A Large Version of the Small Parsimony Problem
Optimally reconstruct ancestral sequences given
- unrooted phylogeny (hence ‘small’ parsimony p.) - multiple alignment - affine gap cost function
Jakob Fredslund* ([email protected]), Jotun Hein**, Tejs Scharling*
* Bioinformatics Research Center, Aarhus University, Denmark** Department of Statistics, University of Oxford, United Kingdom
2
Overview
• Introduction
• Examples
• Gap graph construction
• Theory
• Results
• Conclusions
3
Small Parsimony, No GapsAlgorithm due to Finch-Hartigan-Sankoff: Calculate N(A, C, G,T)
in each node (minimal cost of subtree rooted at this node with
nucleotide X in the root) going up, backtrack going down.
4
Small Parsimony, Large Version
1: ac-a---gattc2: acgac---atcc3: gc-----gagcc4: -agacttgt---5: aagtcttagt-c
g(k) = 12 + 2*k
(note: alignment is given)
5
Two Steps
1) Find optimal set of indels to explain gaps
2) Assign nucleotides optimally (FHS)
So: focus on indels
6
Tracing Evolution
What events could explain this alignment?
cagtta
gcag--a
-cagtta
-cag--a
-ctg--a
7
Tracing Evolution
cagtta
cagtta
8
Tracing Evolution
cagtta caga
cagtta
cag--a
cagtta
9
Tracing Evolution
cagtta caga
ctga
cagtta
cag--a
ctg--a
caga
10
Tracing Evolution
cagtta caga
ctga
gcag--a
cagtta
cag--a
ctg--a
-cagtta
-cag--a
-ctg--a
gcaga
11
Indels Affect Full Subtrees
cagtta caga
ctga
gcaga
gcag--a
-cagtta
-cag--a
-ctg--a
All sequences in right subtree have gaps in blue indel’s position
12
Indels Affect Full Subtrees
cagtta caga
ctga
gcaga
gcag--a
-cagtta
-cag--a
-ctg--a
All sequences in left subtree have gaps in green indel’s position
13
Direction of Evolution?
cagtta caga
ctga
gcaga
gcag--a
-cagtta
-cag--a
-ctg--adeletion of tt
14
Direction of Evolution?
cagtta caga
ctga
gcaga
gcag--a
-cagtta
-cag--a
-ctg--a
insertion of tt
15
Direction of Evolution?
cagtta caga
ctga
gcaga
gcag--a
-cagtta
-cag--a
-ctg--a
Since we don’t know the direction, we refer to insertions/ deletions as indels. And remember: an indel creates gaps in a full subtree.
16
Explaining Gaps With Indels
g(k) = a + bk
(Anonymous nucleotides denoted by n)
17
Explaining Gaps With Indels
g(k) = a + bk 2*(a+2b)
18
Explaining Gaps With Indels
g(k) = a + bk 2*(a+2b) 3*(a+b)
19
Larger Example
N8, N9, N10, N11, N12, N13 : ???.. Complex problem! (not aware of any upper time bound)
20
Gap Graph Construction
Represent in a concise way all gaps and how they are connected: in a graph.
21
Gap Intervals
1.Find gap intervals.
22
Gap Intervals
1.Find gap intervals.
No optimal indel ‘stops’ in the middle of a gap interval:
it is cheaper to extend the indel making the first gap than to open a new one.
(by triangle inequality)
23
Gap Graph Vertices
2. Create minimal tree coverings:
For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps
24
Gap Graph Vertices
2. Create minimal tree coverings:
For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps
25
Gap Graph Vertices
Each vertex represents:
a) subtree with gaps in all leaves
b) region of alignment
2. Create minimal tree coverings:
For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps
26
Gap Graph Vertices
2. Create minimal tree coverings:
For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps
27
Gap Graph Vertices
2. Create minimal tree coverings:
For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps
28
Gap Graph Vertices
2. Create minimal tree coverings:
For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps
29
Gap Graph Vertices
2. Create minimal tree coverings:
For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps
30
Gap Graph Vertices
2. Create minimal tree coverings:
For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps
31
Gap Graph Vertices
2. Create minimal tree coverings:
For each gap interval, find minimal number of subtrees with gaps in all leaves, covering all gaps
32
Gap Graph Connections
3. Create connection between vertices v and w if they represent neighboring gaps.
33
Gap Graph Connections
3. Create connection between vertices v and w if they represent neighboring gaps.
v → w : all v’s gaps continue in w
34
Gap Graph Connections
3. Create connection between vertices v and w if they represent neighboring gaps.
v → w : all v’s gaps continue in w
35
Gap Graph Connections
3. Create connection between vertices v and w if they represent neighboring gaps.
v → w : all v’s gaps continue in w
(a special-case connection exists; see paper)
37
Interpreting a Gap Graph VertexA vertex is a potential indel: one indel could have created all gaps in the subtree.
Either one indel created all gaps in the subtree (vertex confirmed), ..
38
Interpreting a Gap Graph Vertex.. or the vertex is decomposed into several indels (further ‘down’ in the tree).
Goal: confirm or decompose vertices with respect to the gap cost function.
43
Theory Needed Here..
44
We Need Optimality Proof
A gap graph may be huge, thus representing an enormous
number of potential indels. We need to show two things:
P1: that all optimal indels are represented in the gap graph;
P2: how to ‘resolve the graph’ to determine the set of optimal indels.
P1 proved directly in paper (Theorem 1).
45
Resolving the Gap Graph
In order to determine optimal set of indels, we need to reduce potentially huge graph while keeping the optimal solution!
Theorem 2 and a set of following lemmas serve this purpose by
identifying certain local graph configurations that can be reduced.
Preprocess gap graph (perform local reductions) by applying lemmas.
46
Preprocessing Earlier Example
Iteratively apply lemmas to reduce the
graph..
47
Preprocessing Earlier Example
Iteratively apply lemmas to reduce the
graph..
48
Preprocessing Earlier Example
Iteratively apply lemmas to reduce the
graph..
49
Preprocessing Earlier Example
Iteratively apply lemmas to reduce the
graph..
50
Solving Earlier Example
After preprocessing: resolve remaining graph by checking all combinations
decompose
51
Solving Earlier Example
Placing indels in the tree:
52
After Local Preprocessing
• In longer examples there will be many undecided vertices (purple) after preprocessing.
• Find possible decompositions for each vertex and check all combinations in each chain – number of combinations exponential in chain length
53
Execution Times..?Worst-case: exponential.
Average times for random alignments with 60% gaps:
54
60% gapsis a lot..
55
Real Genome Analysis
B.ES.89.S61K15, B.FR.83.HXB2, B.GA.88.OYI, B.GB.83.CAM1, B.NL.86.3202A21, B.TW.94.TWCYS, B.US.86.AD87,
B.US.84.NY5CG, and B.US.83.SF2
Nine HIV-1 subtypes from the Los AlamosHIV database (tree constructed with Quicktree).
Length: 9868. Running Time: 1 sec
56
Conclusions
• Concise way of representing alignment gaps
• Theoretically sound framework prove optimality
• Graph reductions lead to fast resolvement