cs 5263 bioinformatics lecture 4: global sequence alignment algorithms
TRANSCRIPT
• Given a scoring scheme, – Match: m– Mismatch: -s– Gap: -d
• We can easily compute an optimal alignment by dynamic programming
• In a completed alignment between a pair of sequences X = x1x2…xM, Y = y1y1…yN
• If we look at any column of the alignment, there are only three possibilities– xi is aligned to yj
– xi is aligned to a gap
– yj is aligned to a gap
• Since the alignment score F(M, N) is a sum of all aligned columns, it can be broken down to:
F(M-1, N-1) + (xM, yN)F(M, N) = max F(M-1, N) - d
F(M, N-1) - d
A
A
G
-
T
T
A
A
Trace-back
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
F(i,j) j = 0 1 2 3 4
i = 0
1
2
3
A
A
G
-
T
T
A
A
Graph representation
(0,0)
(3,4)
A G T A
A
A
T
1
-1
1
1
1
S1 =
S2 =
• Number of steps: length of the alignment
• Path length: alignment score
• Alignment: find the longest path from (0, 0) to (3, 4)
• General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible.
: a gap in the 2nd sequence
: a gap in the 1st sequence
: match / mismatch
-1 -1 -1
-1 -1
Values on vertical/horizontal line: -dValues on diagonal: m or -s-1
-1
-1
-1
Question
• If we change the scoring scheme, will the optimal alignment be changed? – Original: Match = 1, mismatch = gap = -1– New: match = 2, mismatch = gap = 0– New: Match = 2, mismatch = gap = -2?
Number of alignments
• Is equal to the number of distinct paths from (0, 0) to (m, n)
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A-
BC
A--
-BC
--A
BC-
-A-
B-C
-A
BC
However
• Biologically meaningful “distinct” alignments may be much less– All three may be considered equivalent– A, B, and C all aligned to gaps
A
B
C
A
B
C
A
B
C
A--
-BC
--A
BC-
-A-
B-C
Number of alignments
• We only care about who is aligned to whom, not the gaps
• For two sequences of length m, n, there may be k matches, k = 0 to min(m, n)
• Number of alignments:
FurthermoreA
B
C
A
B
C
A-
BC
A--
-BC
• Alternating gaps are discouraged / prohibited.
• With most scoring scheme, alternating gaps will never happen. (as long as 2d > s)
=>
-d-d m or -s
A
B
C
A
B
C
A
B
C
A
B
C
A
B
C
A-
BC
A--
-BC
--A
BC-
-A-
B-C
-A
BC
• Special trick? • No. In most scoring scheme this is achieved
automatically– 2d > s
Number of alignments
• Homework assignment
• Dynamic programming– Multiple matrices– Three states:
• Came from diagonal. Can go any of the three directions
Number of alignments
• Homework assignment
• Dynamic programming– Multiple matrices– Three states:
• Came from diagonal. Can go any of the three directions
• Came from left, cannot go down
Number of alignments
• Homework assignment
• Dynamic programming– Multiple matrices– Three states:
• Came from diagonal. Can go any of the three directions
• Came from left, cannot go down• Came from above, cannot turn
right
• Given two sequences of length M, N
• Time: O(MN)– ok
• Space: O(MN)– bad– 1Mb seq x 1Mb seq = 1000G memory
• Can we do better?
Bounded Dynamic Programming
If we know that x and y are very similar
Assumption: # gaps(x, y) < k
xi Then,| implies | i – j | < k
yj
Bounded Dynamic Programming
Initialization:
F(i,0), F(0,j) undefined for i, j > k
Iteration:For i = 1…M
For j = max(1, i – k)…min(N, i+k)
F(i – 1, j – 1)+ (xi, yj)
F(i, j) = max F(i, j – 1) – d, if j > i – k
F(i – 1, j) – d, if j < i + k
Termination: same
x1 ………………………… xM
y N …
……
……
……
……
… y
1
k
• What if we don’t know k?
• Iterate:– For k = 2, 4, 8, 16, …– For each k, we can have an optimal bounded
alignment with score Sk
– Stop when ((min(N, M)-k) * m – 2kd) < Sk, since we will not be able to get a higher score with larger k
• Given two sequences of length M, N
• Time: O(MN)– ok
• Space: O(MN)– bad– 1mb seq x 1mb seq = 1000G memory
• Can we do better?
Linear space algorithm
• If all we need is the alignment score but not the alignment, easy!
We only need to keep two rows
(if you are crafty enough, you only need one row)
But how do we get the alignment?
Linear space algorithm
• When we finish, we know how we have aligned the ends of the sequences
Naïve idea: Repeat on the smaller subproblem F(M-1, N-1)
Time complexity: O((M+N)(MN))
XM
YN
Hirschberg’s idea
• Divide and conquer!
M/2 F(M/2, k) represents the best alignment between x1x2…xM/2 and y1y2…yk
Forward algorithmAlign x1x2…xM/2 with Y
X
Y
Backward Algorithm
M/2
B(M/2, k) represents the best alignment between reverse(xM/2xM/2+1…xM) and reverse(ykyk+1…yN )
Backward algorithmAlign reverse(xM/2xM/2+1…xM) with reverse(Y)
Y
X
Lemma
•F(M/2, k) + B(M/2, k) is the best alignment under the constraint that xM/2 must be aligned to yk
•F(M, N) = maxk=0…N( F(M/2, k) + B(M/2, k) )
x
y
M/2
k*
F(M/2, k) B(M/2, k)
• Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(3,k,6,6)
(0,0)
(6,6)
(3,2) (3,4) (3,6)(3,0)
Linear-space alignment
Now, using 2 rows of space, we can compute
for k = 1…N, F(M/2, k), B(M/2, k)
M/2
Linear-space alignment
Now, we can find k* maximizing F(M/2, k) + B(M/2, k)
Also, we can trace the path exiting column M/2 from k*
Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2
Analysis
• Memory: O(N) for computation, O(N+M) to store the optimal alignment
• Time: – MN for first iteration– k M/2 + (N-k) M/2 = MN/2 for second– …
k
N-k
M/2
M/2