an introduction to bioinformatics algorithms www ...michaluz/seminar/lecture4.1.smawk.pdf · an...
TRANSCRIPT
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Outline
• Review: sequence alignment in sub-quadratic
time for unrestricted Scoring Schemes.
• Discuss dynamic programming special properties:
convexity/concavity.
• SMAWK algorithm for computing the row/column
minima/maxima of a Totally Monotone nxm matrix
in O(n).
• Four Russians algorithm for sub-quadratic
sequence alignment under discrete scoring
schemes
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Alignment Graph Alignment Matrix
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
3
Computing the Optimal Global Alignment Value
c
a
c
t
a a c g a c g a0
1
1 2 3 4 5 6 7 8
2
3
4
a
g
a
g5
6
7
c
|B|= n
|A|= n
t9
9
8
0 1 2 3 4 5 6 8 7 9
Classical Dynamic Programming: O(n )
Score of = 1
Score of = -1
2
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
4
The O(n ) time, Classical Dynamic Programming Algorithm2
c
a
c
t
a a c g a c g a0
1
1 2 3 4 5 6 7 8
2
3
4
a
g
a
g5
6
7
c
|B|= n
|A|
= n
t9
9
8
0 1 2 3 4 5 6 8 7 9
The Alignment Graph
I1
I2 I3
O
O = max(I + edge[I ,O])x
x = 1
3
x
A: a c g
B: a c g
A: a c g a
B: a c g a
A: a c g a
B: a c g
A: a c g
B: a c g a
Can the quadratic complexity of the optimal alignment value
computation be reduced without relaxing the problem?
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Is it Possible to Align Sequences in
Subquadratic Time?
• Dynamic Programming takes O(n2) for global
alignment
• Can we do better? O (h n2 /log n), h <= 1.
Landau,Crochemore&Ziv-ukelson 2003
Techniques:
(1) Compress the sequences.
(2) Utilize the Total Monotonicity of DIST.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
6
ctacg
ag
a
a ta a c g a c g
c
O(h n / log n) rows of n vertices +
O(h n / log n) columns of n vertices
ctacg
ag
a
a a c g a c g
c
O(n ) vertices2
a t
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
7
g
a
ga c
I4 I5 I6I3
I2
I1
O4
Standard, single-cell DP
I1
I2 I3
O
O = max(I + edge[I ,O])x
x = 1
3
x
O = max(I + DIST[x,3])x 4 x = 0
6
New, extended-cell DP
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
8
g
a
ga c
I4 I5 I6I3
I2
I1
O4
a a c g a c gctacg
ag
a
a t1
1 2 3 4 5
2
3
4
5
6
c
I1 DIST[1,4] I1+DIST[1,4]
I2 DIST[2,4] I2+DIST[2,4]
I3 DIST[3,4] I3+DIST[3,4]
I4 DIST[4,4] I4+DIST[4,4]
I5 DIST[5,4] I5+DIST[5,4]
I6 DIST[6,4] I6+DIST[6,4]
Computing the score for Output Border Vertex O4
O = max(I + DIST[x,3])x 4 x = 0
6
O4
DIST[x,4]
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
9
I1 = 1
I2 = 2
I3 = 3
I4 =
I5 = 1
I6 =
OUT[x,j] = Ix + DIST[x,j]
Input I\ DIST Matrix
Output vector O
How to compute the
column maxima
of OUT in O(t) time ?
(Utilize the
Total Monotonicity
Property of OUT).
How to obtain the DIST
for G in O(t) time ?
(Take advantage of the
incremental nature of
LZ78 parsing).
The Main Challenges
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
10
3/4
5/2
3/2
left prefix (5,2)
diagonal
prefix (3,2)
top
prefix(3,4)
block G (5,4)
g
a
ga cga c
g
a
a c
a
a c
a
ctacg
ag
a
a t
5/4
1
1 2 3 4 5
2
3
4
5
6
a a c g a c g
c
Accessing a Prefix Block in Constant time.
a c
Trie for A
0
13
5
2
t
4
gg
a
c
g
g
Trie for B
0
31
2
46
5c
t
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
11
I1 = 1
I2 = 2
I3 = 3
I4 =
I5 = 1
I6 =
OUT[x,j] = Ix + DIST[x,j]
Input I\ DIST Matrix
Output vector O
How to compute the
column maxima
of OUT in O(t) time ?
(Utilize the
Total Monotonicity
Property of OUT).
How to obtain the DIST
for G in O(t) time ?
(Take advantage of the
incremental nature of
LZ78 parsing).
The Main Challenges
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
12
a
(0,0)
I
O
b
c d
The Total Monotonicity Property [Aggarwal et al 1987].
For any a < b and c < d
OUT[b,c] >= OUT[a,c] ===> OUT[b,d] >= OUT[a,d]
g
a
ga c
I4 I5 I6I3
I2
I1
O4
a a c g a c gctacg
ag
a
a t1
1 2 3 4 5
2
3
4
5
6
c
I1 DIST[1,4] I1+DIST[1,4]
I2 DIST[2,4] I2+DIST[2,4]
I3 DIST[3,4] I3+DIST[3,4]
I4 DIST[4,4] I4+DIST[4,4]
I5 DIST[5,4] I5+DIST[5,4]
I6 DIST[6,4] I6+DIST[6,4]
Computing the score for Output Border Vertex O4
O = max(I + DIST[x,3])x 4 x = 0
6
O4
I1 = 1
I2 = 2
I3 = 3
I4 =
I5 = 1
I6 =
OUT[x,j] = Ix + DIST[x,j]
Input I\ DIST Matrix
Output vector O
How to compute the
column maxima
of OUT in O(t) time ?
(Utilize the
Total Monotonicity
Property of OUT).
The Main Challenges
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
13
a
b
d
OUT Matrix
How does Total Monotonicity affect Column Maxima behavior?
For all a <b and c < d, OUT[a,c] <= OUT[b,c] OUT[a,d] <= OUT[b,d]
Column maxima row indices are monotonically non-decreasing.
SMAWK Matrix Searching[Aggarwal et-al 87] .
The t column maxima of a Totally Monotone array
can be computed in O(t) time, by querying only O(t) elements.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Outline
• Review: sequence alignment in sub-quadratic time
for unrestricted Scoring Schemes.
• Discuss dynamic programming special
properties: convexity/concavity.
• SMAWK algorithm for computing the
row/column minima/maxima of a Totally
Monotone nxm matrix in O(n).
• Four Russians algorithm for sub-quadratic
sequence alignment under discrete scoring
schemes
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
11-15
Elements of Dynamic Programming
• For dynamic programming to be applicable, an
optimization problem must have:
1. Optimal substructure
• An optimal solution to the problem contains within it
optimal solution to subproblems
2. Overlapping subproblems
• The space of subproblems must be small; i.e., the
same subproblems are encountered over and over
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Features of a Dynamic Programming
Algorithm:
(1) A table (where we store optimal costs of subproblems).
(2) The entry dependency of the table (given by the recurrence relation).
When we are interested in the design of efficient algorithms for dynamic
Programming, a third feature emerges:
(3) The order to fill in the table (the algorithm).
Some times one can use some properties of the problem at hand to
change the order in which (3) is computed to obtain better algorithms.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Convexity/Concavity Properties of the
DP Table A
(1) Monge (convex) [Monge 1781]
A[a, c] + A[b, d ] > = A[b, c] + A[a, d] for all a< b and c< d
Monge (concave)
A[a, c] + A[b, d ] < = A[b, c] + A[a, d] for all a< b and c< d
(2) Total Monotonicity (convex)
A[a, c] > = A[b, c ] => A[a,d] > = A[b, d] for all a< b and c< d
Total Monotonicity (concave)
A[a, c] < = A[b, c ] => A[a,d] < = A[b, d] for all a< b and c< d
a
b
c d
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
The Total Monotonicity Property A[a, c] < = A[b, c ] => A[a,d] ] < = A[b, d]
(1) matrix A is Monotone if , for any column d > c,
Column Maximum [d] >= Colum Maximum [c]
(2) matrix A is Totally Monotone if every 2x2 submatrix is Monotone.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
b
d
OUT Matrix
How does Total Monotonicity affect Column Maxima behavior?
For all a <b and c < d, OUT[a,c] <= OUT[b,c] OUT[a,d] <= OUT[b,d]
Column maxima row indices are monotonically non-decreasing.
SMAWK Matrix Searching[Aggarwal et-al 87] .
The t column maxima of a Totally Monotone array
can be computed in O(t) time, by querying only O(t) elements.
m
n
This statement holds for every 2x2 submatrix.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
b
d
OUT Matrix
How does Total Monotonicity affect Column Maxima behavior?
For all a <b and c < d, OUT[a,c] <= OUT[b,c] OUT[a,d] <= OUT[b,d]
Column maxima row indices are monotonically non-decreasing.
SMAWK Matrix Searching[Aggarwal et-al 87] .
The t column maxima of a Totally Monotone array
can be computed in O(t) time, by querying only O(t) elements.
m
n
This statement holds for every 2x2 submatrix.
G[k,k] > G[k+1,k] and k < n
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
SMAWK Matrix Searching[Aggarwal et-al 87] .
The n column maxima of a Totally Monotone array
can be computed in O(n) time, by querying only O(n) elements.
The hearth of the algorithm is the subroutine REDUCE.
It takes as input an n x m totally monotone matrix A with
m < n and returns an m x m matrix G which is a submatrix
of A such that G contains the rows of A which carry
the column maxima of A.
A. Aggarwal, M. Klawe, S. Moran, P. Shor and R. Wilber
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
n
m1
1
k
k
G <-- A
k <-- 1
while the number of rows of G is greater than m do
begin
case:
a: G[k,k] > G[k+1,k] and k < n: k <-- k+1
b: G[k,k] > G[k+1,k] and k = m: delete row k+1
c: G[k,k] <= G[k+1,k]: delete row k; k <-- max(1, k-1)
endcase
end
return(G);
G[k,k]
G[k+1,k]
Procedure REDUCE(A, n):
input: an n x m totally monoton matrix A with m <= n.
output: an m x m matrix G which is a submatrix of A
such that G contains only the rows of A which
carry the column maxima of A.
Invariant: G[k, 1.. k-1] is dead by total
monotonicity.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
SMAWK Matrix Searching[Aggarwal et-al 87] .
The n column maxima of a Totally Monotone array
can be computed in O(n) time, by querying only O(n) elements.
Procedure MaxColCompute(A):
G<-- REDUCE(A)
if (n = 1) then output the maximum and return
P <- {c2, c4,...c[n/2]} of G
MaxColCompute(P)
from the positions (now known) of the maxima
in the even columns of G,
find the maxima of its odd columns.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
MaxColCompute(OUT = c1,c2,c3,c4,c5,c6)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
MaxColCompute(OUT = c1,c2,c3,c4,c5,c6)MaxColCompute(A= c1,c2,c3,c4,c5,c6)
Reduce: A is 6x6, the number of rows is not
greater than m… return
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
MaxColCompute(OUT = c1,c2,c3,c4,c5,c6)
P the even columns of A
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
MaxColCompute(OUT = c1,c2,c3,c4,c5,c6)MaxColCompute(A= c2,c4,c6)
Reduce: A is 6x3, the number of rows is
greater than m.
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Reduce(OUT[c4], n = 1)
3
0
MaxColCompute(OUT[c4])
MaxColCompute(n = 3)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
MaxColCompute(n=6)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
n
m1
1
k
k
G <-- A
k <-- 1
while the number of rows of G is greater than m do
begin
case:
a: G[k,k] > G[k+1,k] and k < n: k <-- k+1
b: G[k,k] > G[k+1,k] and k = m: delete row k+1
c: G[k,k] <= G[k+1,k]: delete row k; k <-- max(1, k-1)
endcase
end
return(G);
G[k,k]
G[k+1,k]
Procedure REDUCE(A, n):
input: an n x m totally monoton matrix A with m <= n.
output: an m x m matrix G which is a submatrix of A
such that G contains only the rows of A which
carry the column maxima of A.
Time Analysis: Procedure Reduce takes time O(n)
An Introduction to Bioinformatics Algorithms www.bioalgorithms.infoTime Analysis.Let T(n,m) be the time taken by MaxColCompute for an n x m matrix.
The call to Reduce takes time O(n).
Notice that P is an m x m/2 totally monotone matrix,
so the recursive call takes time T(m,m/2).
Once the positions of the maxima in the even rows of G have
been found, the maximum in each odd column is restricted to the
interval of maximum positions of the neighboring even columns.
Thus, finding all maxima in the odd columns can be done in
O(m) time.
For some constants c1 and c2, the time complexity satisfies
T(n,m) <= c1n + c2m + T(m, m/2)
which gives the solution
T(n,m) <= 2(c1+ c2)m + c1n = O(n) , since m<=n.