cs 5263 bioinformatics lecture 4: global sequence alignment algorithms

40
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms

Upload: timothy-james

Post on 25-Dec-2015

230 views

Category:

Documents


1 download

TRANSCRIPT

CS 5263 Bioinformatics

Lecture 4: Global Sequence Alignment Algorithms

Roadmap

• Review of last lecture

• More global sequence alignment algorithms

• Given a scoring scheme, – Match: m– Mismatch: -s– Gap: -d

• We can easily compute an optimal alignment by dynamic programming

• In a completed alignment between a pair of sequences X = x1x2…xM, Y = y1y1…yN

• If we look at any column of the alignment, there are only three possibilities– xi is aligned to yj

– xi is aligned to a gap

– yj is aligned to a gap

• Since the alignment score F(M, N) is a sum of all aligned columns, it can be broken down to:

F(M-1, N-1) + (xM, yN)F(M, N) = max F(M-1, N) - d

F(M, N-1) - d

• And recursively:

F(i-1, j-1) + (xi, yj)F(i, j) = max F(i-1, j) - d

F(i, j-1) - d

F(0,0)

F(M,N)

F(0,0)

F(M,N)

A

A

G

-

T

T

A

A

Trace-back

A G T A

0 -1 -2 -3 -4

A -1 1 0 -1 -2

T -2 0 0 1 0

A -3 -1 -1 0 2

F(i,j) j = 0 1 2 3 4

i = 0

1

2

3

A

A

G

-

T

T

A

A

Graph representation

(0,0)

(3,4)

A G T A

A

A

T

1

-1

1

1

1

S1 =

S2 =

• Number of steps: length of the alignment

• Path length: alignment score

• Alignment: find the longest path from (0, 0) to (3, 4)

• General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible.

: a gap in the 2nd sequence

: a gap in the 1st sequence

: match / mismatch

-1 -1 -1

-1 -1

Values on vertical/horizontal line: -dValues on diagonal: m or -s-1

-1

-1

-1

Question

• If we change the scoring scheme, will the optimal alignment be changed? – Original: Match = 1, mismatch = gap = -1– New: match = 2, mismatch = gap = 0– New: Match = 2, mismatch = gap = -2?

Number of alignments

• Is equal to the number of distinct paths from (0, 0) to (m, n)

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A-

BC

A--

-BC

--A

BC-

-A-

B-C

-A

BC

• How to count?– Homework assignment

– Hint: dynamic programming– Or analytically

However

• Biologically meaningful “distinct” alignments may be much less– All three may be considered equivalent– A, B, and C all aligned to gaps

A

B

C

A

B

C

A

B

C

A--

-BC

--A

BC-

-A-

B-C

Number of alignments

• We only care about who is aligned to whom, not the gaps

• For two sequences of length m, n, there may be k matches, k = 0 to min(m, n)

• Number of alignments:

FurthermoreA

B

C

A

B

C

A-

BC

A--

-BC

• Alternating gaps are discouraged / prohibited.

• With most scoring scheme, alternating gaps will never happen. (as long as 2d > s)

=>

-d-d m or -s

A

B

C

A

B

C

A

B

C

A

B

C

A

B

C

A-

BC

A--

-BC

--A

BC-

-A-

B-C

-A

BC

• Special trick? • No. In most scoring scheme this is achieved

automatically– 2d > s

Number of alignments

• Homework assignment

• Dynamic programming– Multiple matrices– Three states:

• Came from diagonal. Can go any of the three directions

Number of alignments

• Homework assignment

• Dynamic programming– Multiple matrices– Three states:

• Came from diagonal. Can go any of the three directions

• Came from left, cannot go down

Number of alignments

• Homework assignment

• Dynamic programming– Multiple matrices– Three states:

• Came from diagonal. Can go any of the three directions

• Came from left, cannot go down• Came from above, cannot turn

right

• Given two sequences of length M, N

• Time: O(MN)– ok

• Space: O(MN)– bad– 1Mb seq x 1Mb seq = 1000G memory

• Can we do better?

In biology, this kind of alignment is unlikely to be meaningful

abcde--------vwxyz

Good alignment should appear near the diagonal

Bounded Dynamic Programming

If we know that x and y are very similar

Assumption: # gaps(x, y) < k

xi Then,| implies | i – j | < k

yj

Bounded Dynamic Programming

Initialization:

F(i,0), F(0,j) undefined for i, j > k

Iteration:For i = 1…M

For j = max(1, i – k)…min(N, i+k)

F(i – 1, j – 1)+ (xi, yj)

F(i, j) = max F(i, j – 1) – d, if j > i – k

F(i – 1, j) – d, if j < i + k

Termination: same

x1 ………………………… xM

y N …

……

……

……

……

… y

1

k

Analysis

• Time: O(kM) << O(MN)

• Space: O(kM) with some tricks

2k

M

2k

=>M

• What if we don’t know k?

• Iterate:– For k = 2, 4, 8, 16, …– For each k, we can have an optimal bounded

alignment with score Sk

– Stop when ((min(N, M)-k) * m – 2kd) < Sk, since we will not be able to get a higher score with larger k

• Given two sequences of length M, N

• Time: O(MN)– ok

• Space: O(MN)– bad– 1mb seq x 1mb seq = 1000G memory

• Can we do better?

Linear space algorithm

• If all we need is the alignment score but not the alignment, easy!

We only need to keep two rows

(if you are crafty enough, you only need one row)

But how do we get the alignment?

Linear space algorithm

• When we finish, we know how we have aligned the ends of the sequences

Naïve idea: Repeat on the smaller subproblem F(M-1, N-1)

Time complexity: O((M+N)(MN))

XM

YN

Hirschberg’s idea

• Divide and conquer!

M/2 F(M/2, k) represents the best alignment between x1x2…xM/2 and y1y2…yk

Forward algorithmAlign x1x2…xM/2 with Y

X

Y

Backward Algorithm

M/2

B(M/2, k) represents the best alignment between reverse(xM/2xM/2+1…xM) and reverse(ykyk+1…yN )

Backward algorithmAlign reverse(xM/2xM/2+1…xM) with reverse(Y)

Y

X

Lemma

•F(M/2, k) + B(M/2, k) is the best alignment under the constraint that xM/2 must be aligned to yk

•F(M, N) = maxk=0…N( F(M/2, k) + B(M/2, k) )

x

y

M/2

k*

F(M/2, k) B(M/2, k)

• Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(3,k,6,6)

(0,0)

(6,6)

(3,2) (3,4) (3,6)(3,0)

Linear-space alignment

Now, using 2 rows of space, we can compute

for k = 1…N, F(M/2, k), B(M/2, k)

M/2

Linear-space alignment

Now, we can find k* maximizing F(M/2, k) + B(M/2, k)

Also, we can trace the path exiting column M/2 from k*

Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2

Linear-space alignment• Iterate this procedure to the two sub-problems!

N-k*

M/2

M/2

k*

Analysis

• Memory: O(N) for computation, O(N+M) to store the optimal alignment

• Time: – MN for first iteration– k M/2 + (N-k) M/2 = MN/2 for second– …

k

N-k

M/2

M/2

MN MN/2 MN/4

MN/8

MN + MN/2 + MN/4 + MN/8 + … = MN (1 + ½ + ¼ + 1/8 + 1/16 + …)= 2MN = O(MN)