an introduction to bioinformatics algorithms www ...michaluz/seminar/lecture4.1.smawk.pdf · an...

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Outline

• Review: sequence alignment in sub-quadratic

time for unrestricted Scoring Schemes.

• Discuss dynamic programming special properties:

convexity/concavity.

• SMAWK algorithm for computing the row/column

minima/maxima of a Totally Monotone nxm matrix

in O(n).

• Four Russians algorithm for sub-quadratic

sequence alignment under discrete scoring

schemes


Alignment Graph Alignment Matrix


3

Computing the Optimal Global Alignment Value

c

a

c

t

a a c g a c g a0

1

1 2 3 4 5 6 7 8

2

3

4

a

g

a

g5

6

7

c

|B|= n

|A|= n

t9

9

8

0 1 2 3 4 5 6 8 7 9

Classical Dynamic Programming: O(n )

Score of = 1

Score of = -1

2


4

The O(n ) time, Classical Dynamic Programming Algorithm2

c

a

c

t

a a c g a c g a0

1

1 2 3 4 5 6 7 8

2

3

4

a

g

a

g5

6

7

c

|B|= n

|A|

= n

t9

9

8

0 1 2 3 4 5 6 8 7 9

The Alignment Graph

I1

I2 I3

O

O = max(I + edge[I ,O])x

x = 1

3

x

A: a c g

B: a c g

A: a c g a

B: a c g a

A: a c g a

B: a c g

A: a c g

B: a c g a

Can the quadratic complexity of the optimal alignment value

computation be reduced without relaxing the problem?


Is it Possible to Align Sequences in

Subquadratic Time?

• Dynamic Programming takes O(n2) for global

alignment

• Can we do better? O (h n2 /log n), h <= 1.

Landau,Crochemore&Ziv-ukelson 2003

Techniques:

(1) Compress the sequences.

(2) Utilize the Total Monotonicity of DIST.


6

ctacg

ag

a

a ta a c g a c g

c

O(h n / log n) rows of n vertices +

O(h n / log n) columns of n vertices

ctacg

ag

a

a a c g a c g

c

O(n ) vertices2

a t


7

g

a

ga c

I4 I5 I6I3

I2

I1

O4

Standard, single-cell DP

I1

I2 I3

O

O = max(I + edge[I ,O])x

x = 1

3

x

O = max(I + DIST[x,3])x 4 x = 0

6

New, extended-cell DP


8

g

a

ga c

I4 I5 I6I3

I2

I1

O4

a a c g a c gctacg

ag

a

a t1

1 2 3 4 5

2

3

4

5

6

c

I1 DIST[1,4] I1+DIST[1,4]






Computing the score for Output Border Vertex O4

O = max(I + DIST[x,3])x 4 x = 0

6

O4

DIST[x,4]


9

I1 = 1

I2 = 2

I3 = 3

I4 =

I5 = 1

I6 =

OUT[x,j] = Ix + DIST[x,j]

Input I\ DIST Matrix

Output vector O

How to compute the

column maxima

of OUT in O(t) time ?

(Utilize the

Total Monotonicity

Property of OUT).

How to obtain the DIST

for G in O(t) time ?

(Take advantage of the

incremental nature of

LZ78 parsing).

The Main Challenges


10

3/4

5/2

3/2

left prefix (5,2)

diagonal

prefix (3,2)

top

prefix(3,4)

block G (5,4)

g

a

ga cga c

g

a

a c

a

a c

a

ctacg

ag

a

a t

5/4

1

1 2 3 4 5

2

3

4

5

6

a a c g a c g

c

Accessing a Prefix Block in Constant time.

a c

Trie for A

0

13

5

2

t

4

gg

a

c

g

g

Trie for B

0

31

2

46

5c

t


11

I1 = 1

I2 = 2

I3 = 3

I4 =

I5 = 1

I6 =



Output vector O

How to compute the

column maxima


(Utilize the

Total Monotonicity

Property of OUT).

How to obtain the DIST

for G in O(t) time ?

(Take advantage of the

incremental nature of

LZ78 parsing).

The Main Challenges


12

a

(0,0)

I

O

b

c d

The Total Monotonicity Property [Aggarwal et al 1987].

For any a < b and c < d

OUT[b,c] >= OUT[a,c] ===> OUT[b,d] >= OUT[a,d]

g

a

ga c

I4 I5 I6I3

I2

I1

O4

a a c g a c gctacg

ag

a

a t1

1 2 3 4 5

2

3

4

5

6

c







Computing the score for Output Border Vertex O4

O = max(I + DIST[x,3])x 4 x = 0

6

O4

I1 = 1

I2 = 2

I3 = 3

I4 =

I5 = 1

I6 =



Output vector O

How to compute the

column maxima


(Utilize the

Total Monotonicity

Property of OUT).

The Main Challenges


13

a

b

d

OUT Matrix

How does Total Monotonicity affect Column Maxima behavior?

For all a <b and c < d, OUT[a,c] <= OUT[b,c] OUT[a,d] <= OUT[b,d]

Column maxima row indices are monotonically non-decreasing.

SMAWK Matrix Searching[Aggarwal et-al 87] .

The t column maxima of a Totally Monotone array

can be computed in O(t) time, by querying only O(t) elements.


Outline

• Review: sequence alignment in sub-quadratic time

for unrestricted Scoring Schemes.

• Discuss dynamic programming special

properties: convexity/concavity.

• SMAWK algorithm for computing the

row/column minima/maxima of a Totally

Monotone nxm matrix in O(n).

• Four Russians algorithm for sub-quadratic

sequence alignment under discrete scoring

schemes


11-15

Elements of Dynamic Programming

• For dynamic programming to be applicable, an

optimization problem must have:

1. Optimal substructure

• An optimal solution to the problem contains within it

optimal solution to subproblems

2. Overlapping subproblems

• The space of subproblems must be small; i.e., the

same subproblems are encountered over and over


Features of a Dynamic Programming

Algorithm:

(1) A table (where we store optimal costs of subproblems).

(2) The entry dependency of the table (given by the recurrence relation).

When we are interested in the design of efficient algorithms for dynamic

Programming, a third feature emerges:

(3) The order to fill in the table (the algorithm).

Some times one can use some properties of the problem at hand to

change the order in which (3) is computed to obtain better algorithms.


Convexity/Concavity Properties of the

DP Table A

(1) Monge (convex) [Monge 1781]

A[a, c] + A[b, d ] > = A[b, c] + A[a, d] for all a< b and c< d

Monge (concave)

A[a, c] + A[b, d ] < = A[b, c] + A[a, d] for all a< b and c< d

(2) Total Monotonicity (convex)

A[a, c] > = A[b, c ] => A[a,d] > = A[b, d] for all a< b and c< d

Total Monotonicity (concave)

A[a, c] < = A[b, c ] => A[a,d] < = A[b, d] for all a< b and c< d

a

b

c d


The Total Monotonicity Property A[a, c] < = A[b, c ] => A[a,d] ] < = A[b, d]

(1) matrix A is Monotone if , for any column d > c,

Column Maximum [d] >= Colum Maximum [c]

(2) matrix A is Totally Monotone if every 2x2 submatrix is Monotone.


b

d

OUT Matrix







m

n

This statement holds for every 2x2 submatrix.


b

d

OUT Matrix







m

n

This statement holds for every 2x2 submatrix.

G[k,k] > G[k+1,k] and k < n



The n column maxima of a Totally Monotone array

can be computed in O(n) time, by querying only O(n) elements.

The hearth of the algorithm is the subroutine REDUCE.

It takes as input an n x m totally monotone matrix A with

m < n and returns an m x m matrix G which is a submatrix

of A such that G contains the rows of A which carry

the column maxima of A.

A. Aggarwal, M. Klawe, S. Moran, P. Shor and R. Wilber


n

m1

1

k

k

G <-- A

k <-- 1

while the number of rows of G is greater than m do

begin

case:

a: G[k,k] > G[k+1,k] and k < n: k <-- k+1

b: G[k,k] > G[k+1,k] and k = m: delete row k+1

c: G[k,k] <= G[k+1,k]: delete row k; k <-- max(1, k-1)

endcase

end

return(G);

G[k,k]

G[k+1,k]

Procedure REDUCE(A, n):

input: an n x m totally monoton matrix A with m <= n.

output: an m x m matrix G which is a submatrix of A

such that G contains only the rows of A which

carry the column maxima of A.

Invariant: G[k, 1.. k-1] is dead by total

monotonicity.



The n column maxima of a Totally Monotone array

can be computed in O(n) time, by querying only O(n) elements.

Procedure MaxColCompute(A):

G<-- REDUCE(A)

if (n = 1) then output the maximum and return

P <- {c2, c4,...c[n/2]} of G

MaxColCompute(P)

from the positions (now known) of the maxima

in the even columns of G,

find the maxima of its odd columns.


MaxColCompute(OUT = c1,c2,c3,c4,c5,c6)


MaxColCompute(OUT = c1,c2,c3,c4,c5,c6)MaxColCompute(A= c1,c2,c3,c4,c5,c6)

Reduce: A is 6x6, the number of rows is not

greater than m… return


MaxColCompute(OUT = c1,c2,c3,c4,c5,c6)

P the even columns of A


MaxColCompute(OUT = c1,c2,c3,c4,c5,c6)MaxColCompute(A= c2,c4,c6)

Reduce: A is 6x3, the number of rows is

greater than m.


Reduce(OUT[c4], n = 1)

3

0

MaxColCompute(OUT[c4])

MaxColCompute(n = 3)


MaxColCompute(n=6)


n

m1

1

k

k

G <-- A

k <-- 1

while the number of rows of G is greater than m do

begin

case:

a: G[k,k] > G[k+1,k] and k < n: k <-- k+1

b: G[k,k] > G[k+1,k] and k = m: delete row k+1

c: G[k,k] <= G[k+1,k]: delete row k; k <-- max(1, k-1)

endcase

end

return(G);

G[k,k]

G[k+1,k]

Procedure REDUCE(A, n):

input: an n x m totally monoton matrix A with m <= n.

output: an m x m matrix G which is a submatrix of A

such that G contains only the rows of A which

carry the column maxima of A.

Time Analysis: Procedure Reduce takes time O(n)

An Introduction to Bioinformatics Algorithms www.bioalgorithms.infoTime Analysis.Let T(n,m) be the time taken by MaxColCompute for an n x m matrix.

The call to Reduce takes time O(n).

Notice that P is an m x m/2 totally monotone matrix,

so the recursive call takes time T(m,m/2).

Once the positions of the maxima in the even rows of G have

been found, the maximum in each odd column is restricted to the

interval of maximum positions of the neighboring even columns.

Thus, finding all maxima in the odd columns can be done in

O(m) time.

For some constants c1 and c2, the time complexity satisfies

T(n,m) <= c1n + c2m + T(m, m/2)

which gives the solution

T(n,m) <= 2(c1+ c2)m + c1n = O(n) , since m<=n.

an introduction to bioinformatics algorithms www ...michaluz/seminar/lecture4.1.smawk.pdf · an...

Documents