an introduction to bioinformatics algorithms www ...michaluz/seminar/lecture4.1.smawk.pdf · an...

31
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Outline Review: sequence alignment in sub-quadratic time for unrestricted Scoring Schemes. Discuss dynamic programming special properties: convexity/concavity. SMAWK algorithm for computing the row/column minima/maxima of a Totally Monotone nxm matrix in O(n). Four Russians algorithm for sub-quadratic sequence alignment under discrete scoring schemes

Upload: others

Post on 21-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Outline

• Review: sequence alignment in sub-quadratic

time for unrestricted Scoring Schemes.

• Discuss dynamic programming special properties:

convexity/concavity.

• SMAWK algorithm for computing the row/column

minima/maxima of a Totally Monotone nxm matrix

in O(n).

• Four Russians algorithm for sub-quadratic

sequence alignment under discrete scoring

schemes

Page 2: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Alignment Graph Alignment Matrix

Page 3: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

3

Computing the Optimal Global Alignment Value

c

a

c

t

a a c g a c g a0

1

1 2 3 4 5 6 7 8

2

3

4

a

g

a

g5

6

7

c

|B|= n

|A|= n

t9

9

8

0 1 2 3 4 5 6 8 7 9

Classical Dynamic Programming: O(n )

Score of = 1

Score of = -1

2

Page 4: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

4

The O(n ) time, Classical Dynamic Programming Algorithm2

c

a

c

t

a a c g a c g a0

1

1 2 3 4 5 6 7 8

2

3

4

a

g

a

g5

6

7

c

|B|= n

|A|

= n

t9

9

8

0 1 2 3 4 5 6 8 7 9

The Alignment Graph

I1

I2 I3

O

O = max(I + edge[I ,O])x

x = 1

3

x

A: a c g

B: a c g

A: a c g a

B: a c g a

A: a c g a

B: a c g

A: a c g

B: a c g a

Can the quadratic complexity of the optimal alignment value

computation be reduced without relaxing the problem?

Page 5: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Is it Possible to Align Sequences in

Subquadratic Time?

• Dynamic Programming takes O(n2) for global

alignment

• Can we do better? O (h n2 /log n), h <= 1.

Landau,Crochemore&Ziv-ukelson 2003

Techniques:

(1) Compress the sequences.

(2) Utilize the Total Monotonicity of DIST.

Page 6: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

6

ctacg

ag

a

a ta a c g a c g

c

O(h n / log n) rows of n vertices +

O(h n / log n) columns of n vertices

ctacg

ag

a

a a c g a c g

c

O(n ) vertices2

a t

Page 7: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

7

g

a

ga c

I4 I5 I6I3

I2

I1

O4

Standard, single-cell DP

I1

I2 I3

O

O = max(I + edge[I ,O])x

x = 1

3

x

O = max(I + DIST[x,3])x 4 x = 0

6

New, extended-cell DP

Page 8: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

8

g

a

ga c

I4 I5 I6I3

I2

I1

O4

a a c g a c gctacg

ag

a

a t1

1 2 3 4 5

2

3

4

5

6

c

I1 DIST[1,4] I1+DIST[1,4]

I2 DIST[2,4] I2+DIST[2,4]

I3 DIST[3,4] I3+DIST[3,4]

I4 DIST[4,4] I4+DIST[4,4]

I5 DIST[5,4] I5+DIST[5,4]

I6 DIST[6,4] I6+DIST[6,4]

Computing the score for Output Border Vertex O4

O = max(I + DIST[x,3])x 4 x = 0

6

O4

DIST[x,4]

Page 9: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

9

I1 = 1

I2 = 2

I3 = 3

I4 =

I5 = 1

I6 =

OUT[x,j] = Ix + DIST[x,j]

Input I\ DIST Matrix

Output vector O

How to compute the

column maxima

of OUT in O(t) time ?

(Utilize the

Total Monotonicity

Property of OUT).

How to obtain the DIST

for G in O(t) time ?

(Take advantage of the

incremental nature of

LZ78 parsing).

The Main Challenges

Page 10: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

10

3/4

5/2

3/2

left prefix (5,2)

diagonal

prefix (3,2)

top

prefix(3,4)

block G (5,4)

g

a

ga cga c

g

a

a c

a

a c

a

ctacg

ag

a

a t

5/4

1

1 2 3 4 5

2

3

4

5

6

a a c g a c g

c

Accessing a Prefix Block in Constant time.

a c

Trie for A

0

13

5

2

t

4

gg

a

c

g

g

Trie for B

0

31

2

46

5c

t

Page 11: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

11

I1 = 1

I2 = 2

I3 = 3

I4 =

I5 = 1

I6 =

OUT[x,j] = Ix + DIST[x,j]

Input I\ DIST Matrix

Output vector O

How to compute the

column maxima

of OUT in O(t) time ?

(Utilize the

Total Monotonicity

Property of OUT).

How to obtain the DIST

for G in O(t) time ?

(Take advantage of the

incremental nature of

LZ78 parsing).

The Main Challenges

Page 12: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

12

a

(0,0)

I

O

b

c d

The Total Monotonicity Property [Aggarwal et al 1987].

For any a < b and c < d

OUT[b,c] >= OUT[a,c] ===> OUT[b,d] >= OUT[a,d]

g

a

ga c

I4 I5 I6I3

I2

I1

O4

a a c g a c gctacg

ag

a

a t1

1 2 3 4 5

2

3

4

5

6

c

I1 DIST[1,4] I1+DIST[1,4]

I2 DIST[2,4] I2+DIST[2,4]

I3 DIST[3,4] I3+DIST[3,4]

I4 DIST[4,4] I4+DIST[4,4]

I5 DIST[5,4] I5+DIST[5,4]

I6 DIST[6,4] I6+DIST[6,4]

Computing the score for Output Border Vertex O4

O = max(I + DIST[x,3])x 4 x = 0

6

O4

I1 = 1

I2 = 2

I3 = 3

I4 =

I5 = 1

I6 =

OUT[x,j] = Ix + DIST[x,j]

Input I\ DIST Matrix

Output vector O

How to compute the

column maxima

of OUT in O(t) time ?

(Utilize the

Total Monotonicity

Property of OUT).

The Main Challenges

Page 13: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

13

a

b

d

OUT Matrix

How does Total Monotonicity affect Column Maxima behavior?

For all a <b and c < d, OUT[a,c] <= OUT[b,c] OUT[a,d] <= OUT[b,d]

Column maxima row indices are monotonically non-decreasing.

SMAWK Matrix Searching[Aggarwal et-al 87] .

The t column maxima of a Totally Monotone array

can be computed in O(t) time, by querying only O(t) elements.

Page 14: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Outline

• Review: sequence alignment in sub-quadratic time

for unrestricted Scoring Schemes.

• Discuss dynamic programming special

properties: convexity/concavity.

• SMAWK algorithm for computing the

row/column minima/maxima of a Totally

Monotone nxm matrix in O(n).

• Four Russians algorithm for sub-quadratic

sequence alignment under discrete scoring

schemes

Page 15: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

11-15

Elements of Dynamic Programming

• For dynamic programming to be applicable, an

optimization problem must have:

1. Optimal substructure

• An optimal solution to the problem contains within it

optimal solution to subproblems

2. Overlapping subproblems

• The space of subproblems must be small; i.e., the

same subproblems are encountered over and over

Page 16: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Features of a Dynamic Programming

Algorithm:

(1) A table (where we store optimal costs of subproblems).

(2) The entry dependency of the table (given by the recurrence relation).

When we are interested in the design of efficient algorithms for dynamic

Programming, a third feature emerges:

(3) The order to fill in the table (the algorithm).

Some times one can use some properties of the problem at hand to

change the order in which (3) is computed to obtain better algorithms.

Page 17: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Convexity/Concavity Properties of the

DP Table A

(1) Monge (convex) [Monge 1781]

A[a, c] + A[b, d ] > = A[b, c] + A[a, d] for all a< b and c< d

Monge (concave)

A[a, c] + A[b, d ] < = A[b, c] + A[a, d] for all a< b and c< d

(2) Total Monotonicity (convex)

A[a, c] > = A[b, c ] => A[a,d] > = A[b, d] for all a< b and c< d

Total Monotonicity (concave)

A[a, c] < = A[b, c ] => A[a,d] < = A[b, d] for all a< b and c< d

a

b

c d

Page 18: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

The Total Monotonicity Property A[a, c] < = A[b, c ] => A[a,d] ] < = A[b, d]

(1) matrix A is Monotone if , for any column d > c,

Column Maximum [d] >= Colum Maximum [c]

(2) matrix A is Totally Monotone if every 2x2 submatrix is Monotone.

Page 19: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

b

d

OUT Matrix

How does Total Monotonicity affect Column Maxima behavior?

For all a <b and c < d, OUT[a,c] <= OUT[b,c] OUT[a,d] <= OUT[b,d]

Column maxima row indices are monotonically non-decreasing.

SMAWK Matrix Searching[Aggarwal et-al 87] .

The t column maxima of a Totally Monotone array

can be computed in O(t) time, by querying only O(t) elements.

m

n

This statement holds for every 2x2 submatrix.

Page 20: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

b

d

OUT Matrix

How does Total Monotonicity affect Column Maxima behavior?

For all a <b and c < d, OUT[a,c] <= OUT[b,c] OUT[a,d] <= OUT[b,d]

Column maxima row indices are monotonically non-decreasing.

SMAWK Matrix Searching[Aggarwal et-al 87] .

The t column maxima of a Totally Monotone array

can be computed in O(t) time, by querying only O(t) elements.

m

n

This statement holds for every 2x2 submatrix.

G[k,k] > G[k+1,k] and k < n

Page 21: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

SMAWK Matrix Searching[Aggarwal et-al 87] .

The n column maxima of a Totally Monotone array

can be computed in O(n) time, by querying only O(n) elements.

The hearth of the algorithm is the subroutine REDUCE.

It takes as input an n x m totally monotone matrix A with

m < n and returns an m x m matrix G which is a submatrix

of A such that G contains the rows of A which carry

the column maxima of A.

A. Aggarwal, M. Klawe, S. Moran, P. Shor and R. Wilber

Page 22: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

n

m1

1

k

k

G <-- A

k <-- 1

while the number of rows of G is greater than m do

begin

case:

a: G[k,k] > G[k+1,k] and k < n: k <-- k+1

b: G[k,k] > G[k+1,k] and k = m: delete row k+1

c: G[k,k] <= G[k+1,k]: delete row k; k <-- max(1, k-1)

endcase

end

return(G);

G[k,k]

G[k+1,k]

Procedure REDUCE(A, n):

input: an n x m totally monoton matrix A with m <= n.

output: an m x m matrix G which is a submatrix of A

such that G contains only the rows of A which

carry the column maxima of A.

Invariant: G[k, 1.. k-1] is dead by total

monotonicity.

Page 23: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

SMAWK Matrix Searching[Aggarwal et-al 87] .

The n column maxima of a Totally Monotone array

can be computed in O(n) time, by querying only O(n) elements.

Procedure MaxColCompute(A):

G<-- REDUCE(A)

if (n = 1) then output the maximum and return

P <- {c2, c4,...c[n/2]} of G

MaxColCompute(P)

from the positions (now known) of the maxima

in the even columns of G,

find the maxima of its odd columns.

Page 24: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

MaxColCompute(OUT = c1,c2,c3,c4,c5,c6)

Page 25: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

MaxColCompute(OUT = c1,c2,c3,c4,c5,c6)MaxColCompute(A= c1,c2,c3,c4,c5,c6)

Reduce: A is 6x6, the number of rows is not

greater than m… return

Page 26: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

MaxColCompute(OUT = c1,c2,c3,c4,c5,c6)

P the even columns of A

Page 27: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

MaxColCompute(OUT = c1,c2,c3,c4,c5,c6)MaxColCompute(A= c2,c4,c6)

Reduce: A is 6x3, the number of rows is

greater than m.

Page 28: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Reduce(OUT[c4], n = 1)

3

0

MaxColCompute(OUT[c4])

MaxColCompute(n = 3)

Page 29: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

MaxColCompute(n=6)

Page 30: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

n

m1

1

k

k

G <-- A

k <-- 1

while the number of rows of G is greater than m do

begin

case:

a: G[k,k] > G[k+1,k] and k < n: k <-- k+1

b: G[k,k] > G[k+1,k] and k = m: delete row k+1

c: G[k,k] <= G[k+1,k]: delete row k; k <-- max(1, k-1)

endcase

end

return(G);

G[k,k]

G[k+1,k]

Procedure REDUCE(A, n):

input: an n x m totally monoton matrix A with m <= n.

output: an m x m matrix G which is a submatrix of A

such that G contains only the rows of A which

carry the column maxima of A.

Time Analysis: Procedure Reduce takes time O(n)

Page 31: An Introduction to Bioinformatics Algorithms www ...michaluz/seminar/lecture4.1.SMAWK.pdf · An Introduction to Bioinformatics Algorithms SMAWK Matrix Searching [Aggarwal et-al 87]

An Introduction to Bioinformatics Algorithms www.bioalgorithms.infoTime Analysis.Let T(n,m) be the time taken by MaxColCompute for an n x m matrix.

The call to Reduce takes time O(n).

Notice that P is an m x m/2 totally monotone matrix,

so the recursive call takes time T(m,m/2).

Once the positions of the maxima in the even rows of G have

been found, the maximum in each odd column is restricted to the

interval of maximum positions of the neighboring even columns.

Thus, finding all maxima in the odd columns can be done in

O(m) time.

For some constants c1 and c2, the time complexity satisfies

T(n,m) <= c1n + c2m + T(m, m/2)

which gives the solution

T(n,m) <= 2(c1+ c2)m + c1n = O(n) , since m<=n.