4 -1 chapter 4 the sequence alignment problem. 4 -2 the longest common subsequence (lcs) problem a...

47
4 -1 Chapter 4 The Sequence Alignment Problem

Post on 20-Dec-2015

240 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -1

Chapter 4

The Sequence Alignment Problem

Page 2: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -2

The Longest Common Subsequence (LCS) Problem

A string : S1 = “TAGTCACG” A subsequence of S1 : deleting 0 or more symbols from S1 (not

necessarily consecutive). e.g. G, AGC, TATC, AGACG Common subsequences of S1 = “TAGTCACG” and S2 = “AGACTGTC” : GG, AGC, AGACG Longest common subsequence (LCS) : S1: TAGTCACG

S2: AGACTGTC LCS: AGACG

Page 3: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -3

Applications of LCS The edit distance of two strings or files. (# of deletions and insertions)

S1: TAGTCAC G

S2: AG ACTGTCOperation: DMMDDMMIMII Spoken word recognition Similarity of two biological sequences (DNA or protein)

Sequence alignment

Page 4: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -4

The LCS Algorithm

S1 = a1 a2 am and S2 = b1 b2 bn

Ai,j denotes the length of the longest common subseq

uence of a1 a2 ai and b1 b2 bj.

Dynamic programming:

Ai,j = Ai-1,j-1 + 1 if ai= bj

max{ Ai-1,j, Ai,j-1 } if ai bj

A0,0 = A0,j = Ai,0 = 0 for 1 i m, 1 j n.

Time complexity: O(mn)

Page 5: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -5

A1,1

A2,1

A3,1

A2,2

Am,n

A1,2 A1,3

By the dynamic programming, we can calculate matrix A starting at the upper left corner and ending at the lower right corner.

Simply, we can calculate it row by row, or column by column.

Page 6: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -6

After matrix A has been found, we can trace back to find the LCS.

TAGTCACGAGACTGTCLCS:AGACG

- A G A C T G T C

0 0 0 0 0 0 0 0 0-

0 0 0 0 0 1 1 1 1T

0 1 1 1 1 1 1 1 1A

0 1 2 2 2 2 2 2 2G

0 1 2 2 2 3 3 3 3T

0 1 2 2 3 3 3 3 4C

0 1 2 3 3 3 3 3 4A

0 1 2 3 4 4 4 4 4C

0 1 2 3 4 4 5 5 5G

S2

S1

Page 7: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -7

Edit Distance(1) To find a smallest edit process between

two strings. S1: TAGTCAC G

S2: AG ACTGTC

Operation: DMMDDMMIMII

Insertbdistc

Deleteadistc

baMatchc

c

jji

iji

jiji

ji

),(

),(

)(0

min

1,

,1

1,1

,

.1),(),( Suppose ji bdistadist

Page 8: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -8

Edit Distance(2)

TAGTCAC G

AG ACTGTC

DMMDDMMIMII

- A G A C T G T C

0 1 2 3 4 5 6 7 8-

1 2 3 4 5 4 5 6 7T

2 1 2 3 4 5 6 7 8A

3 2 1 2 3 4 5 6 7G

4 3 2 3 4 3 4 5 6T

5 4 3 4 3 4 5 6 5C

6 5 4 3 4 5 6 7 6A

7 6 5 4 3 4 5 6 7C

8 7 6 5 4 5 4 5 6G

ci-1,j-1 ci-1,j

ci,jci,j-1

S2

S1

Page 9: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -9

The Longest Increasing Subsequence (LIS) Problem

Definition: Input: One numeric sequence S Output: The longest increasing subsequence in S

Example: Given S = 35274816, the LIS in S is 3578.

By applying the LCS algorithm, this problem can be solved in O(n2) time. (Why?)

Robinson-Schensted-Knuth Algorithm can solve the LIS problem in O(nlogn) time.

(See the example on the next page.)

Page 10: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -10

Robinson-Schensted-Knuth Algorithm for LIS

8884

677773

44445552

112222331

61847253

L

Input

LIS: 3578 time complexity: O(nlogn)

n numbers are inserted and each insertion takes O(logn) time for binary search.

Page 11: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -11

Hunt-Szymanski LCS Algorithm By extending the idea in RSK algorithm, th

e LCS problem can be solved in O(rlogn) time, where r denotes the number of matches.

This algorithm is faster than traditional dynamic programming if r is small.

Page 12: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -12

The Pairs of Matching

A G A C T G T C

T

A

G

T

C

A

C

G

(1,5)

(1,7)

(2,1)

(2,3)

(3,2)

(3,6)

(4,5)

(4,7)

(5,4)

(5,8)

(6,1)

(6,3)

(7,4)

(7,8)

(8,2)

(8,6)

Input sequences: TAGTCACG and AGACTGTC Pairs of matching:

Page 13: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -13

Example for Hunt-Szymanski Algorithm

(1,7)

(1,5)

(2,3)

(2,1)

(3,6)

(3,2)

(4,7)

(4,5)

(5,8)

(5,4)

1 (1,7)

(1,5)

(2,3)

(2,1)

(2,1)

(2,1)

(2,1)

(2,1)

(2,1)

(2,1)

2 (3,6)

(3,2)

(3,2)

(3,2)

(3,2)

(3,2)

3 (4,7)

(4,5)

(4,5)

(5,4)

4 (5,8)

(5,8)

The insertion order is row major and column backward.

Exercise: Please fill out the rest parts by yourself. Time Complexity: O(rlogn), r: # of matches Each match needs O(logn) time for binary search.

L

Page 14: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -14

The Longest Common Increasing Subsequence (LCIS) Problem

Definition: Input: Two numeric sequences S1, S2

Output: The longest common increasing subsequence of S1 and S2.

Example: Given S1=35274816 and S2=51724863, the LCIS of S1 and S2 is 246

This problem can be solved by applying the RSK algorithm on the table for finding LCS(Chao’s Algorithm).

(See the example on the next page.)

Page 15: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -15

Chao’s Algorithm for LCIS3 5 2 7 4 8 1 6

5 - L1: 5 L1: 5 L1: 5 L1: 5 L1: 5 L1: 5 L1: 5

1 - L1: 5 L1: 5 L1: 5 L1: 5 L1: 5 L1: 1 L1: 1

7 - L1: 5 L1: 5 L1: 5

L2: 7

L1: 5

L2: 7

L1: 5

L2: 7

L1: 1

L2: 7

L1: 1

L2: 7

2 - L1: 5 L1: 2 L1: 2

L2: 7

L1: 2

L2: 7

L1: 2

L2: 7

L1: 1

L2: 7

L1: 1

L2: 7

4 - L1: 5 L1: 2 L1: 2

L2: 7

L1: 2

L2: 4

L1: 2

L2: 4

L1: 1

L2: 4

L1: 1

L2: 4

8 - L1: 5 L1: 2 L1: 2

L2: 7

L1: 2

L2: 4

L1: 2

L2: 4

L3: 8

L1: 1

L2: 4

L3: 8

L1: 1

L2: 4

L3: 8

6 - L1: 5 L1: 2 L1: 2

L2: 7

L1: 2

L2: 4

L1: 2

L2: 4

L3: 8

L1: 1

L2: 4

L3: 8

L1: 1

L2: 4

L3: 6

3 L1: 3 L1: 3 L1: 2 L1: 2

L2: 7

L1: 2

L2: 4

L1: 2

L2: 4

L3: 8

L1: 1

L2: 4

L3: 8

L1: 1

L2: 4

L3: 6

Page 16: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -16

Analysis for Chao’s Algorithm There are two types of operations to update the

best tails, insert (match) and merge (mismatch). Direct implementation will take O(n3) time, since

it cost O(n) for each operation. However, it can be shown that each merge can be

done in constant time. Also, all insertions in a row will totally take O(n) time. Thus, This is an O(n2) algorithm

Page 17: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -17

The Constrained Longest Common Subsequence (CLCS) Problem

Definition: Input: Two sequences S1, S2, and a constrained

sequence C. Output: The longest common subsequence of S1, S2 that

contains C. Example: Given S1= TAGTCACG, S2= AGACTGTC

and C=AT, the CLCS between S1 and S2 would be AGTG. (LCS is AGACG)

Purpose: From biological perspective, we can specify the

functional sites in input sequences by setting proper constraints.

Page 18: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -18

The CLCS Algorithm S1 = a1 a2 am , S2 = b1 b2 bn and C = c1 c2 cr Rk,i,j denotes the length of the longest common subsequence

of a1 a2 ai , b1 b2 bj.and c1 c2 ck Dynamic programming:

Rk,i,j = Rk-1,i-1,j-1 + 1 if ck = ai= bj

Rk,i-1,j-1 + 1 if ck ai= bj max {Rk,i-1,j, Rk,i,j-1} if ai bj

Rk,0,0 = Rk,i,0 = Rk,0,i = -∞ for 1 k r, 1 i m, 1 j n. R0,i,j = Ai,j (LCS without constraint, please read previous pages)

Time complexity: O(rnm)

Page 19: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -19

Example for CLCS Algorithm

- A G A C T G T C

- 0 0 0 0 0 0 0 0 0

T 0 0 0 0 0 1 1 1 1

A 0 1 1 1 1 1 1 1 1

G 0 1 2 2 2 2 2 2 2

T 0 1 2 2 2 3 3 3 3

C 0 1 2 2 3 3 3 3 4

A 0 1 2 3 3 3 3 3 4

C 0 1 2 3 4 4 4 4 4

G 0 1 2 3 4 4 5 5 5

- A G A C T G T C

- X X X X X X X X X

T X X X X X X X X X

A X 1 1 1 1 1 1 1 1

G X 1 2 2 2 2 2 2 2

T X 1 2 2 2 3 3 3 3

C X 1 2 2 3 3 3 3 4

A X 1 2 3 3 3 3 3 4

C X 1 2 3 4 4 4 4 4

G X 1 2 3 4 4 5 5 5

- A G A C T G T C

- X X X X X X X X X

T X X X X X X X X X

A X X X X X X X X X

G X X X X X X X X X

T X X X X X 3 3 3 3

C X X X X X 3 3 3 4

A X X X X X 3 3 3 4

C X X X X X 3 3 3 4

G X X X X X 3 4 4 4

k = 0 k = 2 (constraint T)k = 1 (constraint A)

Following the link, we can obtain the CLCS AGTG

Input: S1 = TAGTCACG, S2 = AGACTGTC and C = AT CLCS of S1 and S2 with constraint C: (X means -∞)

Page 20: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -20

Sequence Alignment

S1 = TAGTCACG

S2 = AGACTGTC----TAGTCACG TAGTCAC-G--AGACT-GTC--- -AG--ACTGTC

Which one is better? We can set different gap penalties as parameters for

different purposes.

Page 21: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -21

Sequence Alignment Problem Definition:

Input: Two (or more) sequences S1, S2, …, Sn, and a scoring function f.

Output: The alignment of S1, S2, …, Sn, which has the optimal score.

Purpose: To determine how close two species are To perform data compression To determine the common area of some sequences To construct evolutionary trees

Page 22: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -22

Gap Penalty

is the gap penalty. Suppose

),(),0(

),()0,(

),()1,(

),(),1(

),()1,1(

max),(

xjjA

xiiA

bjiA

ajiA

bajiA

jiA

j

i

ji

),(or ),( xx

) including( if 1

if 2),(

yx

yxyx

Page 23: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -23

Example for Sequence Alignment

TAGTCAC-G--

-AG--ACTGTC

- A G A C T G T C

0 -1 -2 -3 -4 -5 -6 -7 -8-

-1 -1 -2 -3 -4 -2 -3 -4 -5T

-2 1 0 0 -1 -2 -3 -4 -5A

-3 0 3 2 1 0 0 -1 -2G

-4 -1 2 2 1 3 2 2 1T

-5 -2 1 1 4 3 2 1 4C

-6 -3 0 3 3 3 2 1 3A

-7 -4 -1 2 5 4 3 2 3C

-8 -5 -2 1 4 4 6 5 4G

Page 24: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -24

PAM250 Score Matrix A C D E F G H I K L M N P Q R S T V W Y A 2 C -2 12 D 0 -5 4 E 0 -5 3 4 F -4 -4 -6 -5 9 G 1 -3 1 0 -5 5 H -1 -3 1 1 -2 -2 6 I -1 -2 -2 -2 1 -3 -2 5 K -1 -5 0 0 -5 -2 0 -2 5 L -2 -6 -4 -3 2 -4 -2 2 -3 6 M -1 -5 -3 -2 0 -3 -2 2 0 4 6 N 0 -4 2 1 -4 0 2 -2 1 -3 -2 2 P 1 -3 -1 -1 -5 -1 0 -2 -1 -3 -2 -1 6 Q 0 -5 2 2 -5 -1 3 -2 1 -2 -1 1 0 4 R -2 -4 -1 -1 -4 -3 2 -2 3 -3 0 0 0 1 6 S 1 0 0 0 -3 1 -1 -1 0 -3 -2 1 1 -1 0 2 T 1 -2 0 0 -3 0 -1 0 0 -2 -1 0 0 -1 -1 1 3 V 0 -2 -2 -2 -1 -1 -2 4 -2 2 2 -2 -1 -2 -2 -1 0 4 W -6 -8 -7 -7 0 -7 -3 -5 -3 -2 -4 -4 -6 -5 2 -2 -5 -6 17 Y -3 0 -4 -4 7 -5 0 -1 -4 -1 -2 -2 -5 -4 -4 -3 -3 -2 0 10

Page 25: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -25

Blosum62 Score Matrix A C D E F G H I K L M N P Q R S T V W Y

A 4

C 0 9

D -2 -3 6

E -1 -4 2 5

F -2 -2 -3 -3 6

G 0 -3 -1 -2 -3 6

H -2 -3 1 0 -1 -2 8

I -1 -1 -3 -3 0 -4 -3 4

K -1 -3 -1 1 -3 -2 -1 -3 5

L -1 -1 -4 -3 0 -4 -3 2 -2 4

M -1 -1 -3 -2 0 -3 -2 1 -1 2 5

N -2 -3 1 0 -3 0 -1 -3 0 -3 -2 6

P -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -1 7

Q -1 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5

R -1 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5

S 1 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4

T -1 -1 1 0 -2 1 0 -2 0 -2 -1 0 1 0 -1 1 4

V 0 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 -2 4

W -3 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -3 -3 11

Y -2 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 7

Page 26: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -26

The Local Alignment Problem Input: Two (or more) sequences S1, S2, …, Sn, and a s

coring function f. Output: Substrings Si

’of Si such that the score obtained by aligning Si

’ is the highest, among all possible substrings of Si. (1 i n)

S1= abbbcc

S2= adddcc

Score=32+3(–1)=3

S1’= cc

S2’= cc

Score=22=4

Page 27: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -27

Dynamic Programming for Local Alignment

0),0(

0)0,(

),()1,(

),(),1(

),()1,1(

0

max),(

jA

iA

bjiA

ajiA

bajiAjiA

j

i

ji

Once the score becomes negative, we reset it to 0.

Page 28: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -28

Example for Local Alignment

AGTCAC-G

AG--ACTG

- A G A C T G T C

0 0 0 0 0 0 0 0 0-

0 0 0 0 0 2 1 2 1T

0 2 1 2 1 1 1 1 1A

0 1 4 3 2 1 3 2 1G

0 0 3 3 2 4 3 5 4T

0 0 2 2 5 4 3 4 7C

0 2 1 4 4 4 3 3 6A

0 1 1 3 6 5 4 3 5C

0 0 3 2 5 5 7 6 5G

TAGTC

T-GTC

Two solutions:

Page 29: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -29

The Affine Gap Penalty S1=ACTTGATCC

S2=AGTTAGTAGTCC An optimal alignment:

S1=ACTT-G-A-TCC

S2=AGTTAGTAGTCC Original score=12

The following alignment may be better because there is only one gap.

S1=ACTT---GATCC

S2=AGTTAGTAGTCC Original score=6

Page 30: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -30

Definition of Affine Gap Penalty

A gap is caused by a mutational event which removes a sequence of residues..

A long gap is often more preferable than several gaps.

An affine gap penalty is defined as Pg+kPe for a gap with k, k1, spaces where Pg, Pe 0.

Pg is related to the initiation of a gap and Pe is related to the length of the gap.

Page 31: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -31

Suppose that Pg=4 and Pe=1. S1=ACTTGATCC

S2=AGTTAGTAGTCC S1=ACTT-G-A-TCC

S2 =AGTTAGTAGTCC Score=82 – 11 – 3(4+11)=0 S1=ACTT---GATCC

S2=AGTTAGTAGTCC Score=62 – 31 – (4+31)=2

Page 32: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -32

Algorithm for Affine Gap Penalty

})1,(,)1,({max),(

}),1(,),1({max),(

),()1,1(

)},(),,(),,({max),(

0,for ,)0,(),0(

0for ),0(),0(

0for )0,()0,(

0)0,0()0,0()0,0(

33

22

1

321

32

3

2

32

ege

ege

ji

eg

eg

ppjiApjiAjiA

ppjiApjiAjiA

bajiAA

jiAjiAjiAjiA

jiiAjA

jippjAjA

iippiAiA

AAA

A(i,j) is for the optimal alignment of a1 a2 ai and b1 b2 bj.

A1(i,j) is for that ai is aligned bj.

A2(i,j) is for that ai is aligned -.

A3(i,j) is for that - is aligned bj.

Page 33: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -33

Multiple Sequence Alignment (MSA)

Suppose three sequence are involved:

S1 = ATTCGAT

S2 = TTGAG

S3 = ATGCT A very good alignment:

S1 = ATTCGAT

S2 = -TT-GAG

S3 = AT--GCT In fact, the above alignment between every pair of sequences is also good.

Page 34: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -34

Complexity of MSA 2-sequence alignment problem:

Time complexity: O(n2) 3-sequence alignment problem:

(x,y,z) has to be defined. Time complexity: O(n3)

k-sequence alignment problem: O(nk)

)1,(

),1(

)1,1(

: ),(

jiA

jiA

jiA

jiA

)1,1,1(

)1,1,( ),1,,1(

),1,1( ),1,,(

),1,( ),,,1(

:),,(

kjiA

kjiAkjiA

kjiAkjiA

kjiAkjiA

kjiA

Page 35: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -35

The Star Algorithm for MSA

Proposed by Gusfield An approximation algorithm for the sum of pairs multiple seq

uence alignment problem Let (x,y)=0 if x=y and (x,y)=1 if xy.

S1 = GCCAT S1 = GCCAT

S2 = G--AT S2 = GA--T distance=2 distance=3

''2

'1

'1 ,,, naaaS

'n

'2

'1

'2 ,,, bbbS

n

tttji baSSd

1

'' ),( ),(

The distance induced by the alignment is define as

Page 36: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -36

Properties of d(Si,Sj): d(Si,Si) = 0 Triangular inequality

d(Si,Sj)+d(Si,Sk) d(Sj,Sk)

Given two sequences Si and Sj, the minimum distance is denoted as D(Si,Sj).

D(Si,Sj) d(Si,Sj)

Distance

i

jk

Page 37: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -37

Example for the Star Algorithm S1 = ATGCTC

S2 = AGAGC

S3 = TTCTG

S4 = ATTGCATGC Try to align every pair of sequences:

S1= ATGCTC

S2= A-GAGC

D(S1,S2) = 3

S1= ATGCTC

S3= TT-CTG

D(S1,S3) = 3

Page 38: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -38

S1= AT-GC-T-C

S4= ATTGCATGCD(S1,S4) = 3

S2= A--G-A-GC

S4= ATTGCATGCD(S2,S4) = 4

S2= AGAGC

S3= TTCTGD(S2,S3) = 5

S3= -TT-C-TG-

S4= ATTGCATGCD(S3,S4) = 4

Page 39: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -39

iSSX

i XSD\

),(

D(S1,S2)+D(S1,S3)+D(S1,S4) = 9

D(S2,S1)+D(S2,S3)+D(S2,S4) = 12

D(S3,S1)+D(S3,S2)+D(S3,S4) = 12

D(S4,S1)+D(S4,S2)+D(S4,S3) = 11

S1 is selected as the center since S1 is the most similar to others.

Given a set S of k sequences, the center of this set of sequences is the sequence which minimizes

Page 40: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -40

S1 has been selected as the center. Align S2 with S1:

S1 = ATGCTC

S2 = A-GAGC

Adding S3 by aligning S3 with S1:

S1 = ATGCTC

S2 = A-GAGC

S3 = -TTCTG

Adding S4 by aligning S4 with S1:

S1 = AT-GC-T-C

S2 = A--GA-G-C

S3 = -T-TC-T-G

S4 = ATTGCATGC

Page 41: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -41

Approximation Rate

App 2Opt

(See the proof on the lecture note.)

alignmentstar ),(1 1

k

i

k

ij

jji SSdApp

MSApairs of sum ),(1 1

*

k

i

k

ij

jji SSdOpt

Page 42: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -42

The MST Preservation for MSA In Gusfield’s star algorithm, the alignments between the center

and all other sequences are optimal. Thus, (k–1) distances are preserved.

MST preservation is to preserves the distances on the edges in the minimal spanning tree.

D: distance matrix based upon optimal alignments between every pair of input sequences.

Dm: distance matrix based upon a multiple sequence alignment MST(D): MST based on D MST(Dm): MST based on Dm

Goal: MST(D)=MST(Dm)

Page 43: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -43

Example for MST Preservation Input:

S1 = ATGCTC

S2 = ATGAGC

S3 = TTCTG

S4 = ATTGCATGC Step1: Finds the pair wise distances optimally by the

dynamic programming algorithm.

S1 = ATGCTC

S2 = ATGAGC

D(S1,S2) = 2

S1= ATGCTC

S3= TT-CTG

D(S1,S3) = 3

Page 44: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -44

S1= ATGC-T-C

S4= ATGCATGCD(S1,S4) = 2

S2= ATG-A-GC

S4= ATGCATGCD(S2,S4) = 2

S2= ATGAGC

S3= TTCTG-D(S2,S3) = 4

S3= -TTC-TG-

S4= ATGCATGCD(S3,S4) = 4

Distance matrix D

4

3

2

1

4321

4

24

232

S

S

S

S

SSSS

Page 45: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -45

Step 2: Find the minimal spanning tree based on matrix D.

4

3

2

1

4321

4

24

232

S

S

S

S

SSSS

S1

S2

S4

S3

2 3

2

Page 46: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -46

Step 3: Align the pair of sequences optimally corresponding to the edges on the MST. For e(S1, S2) S1 = ATGCTC

S2 = ATGAGC For e(S2, S4) S1 = ATG-C-TC

S2 = ATG-A-GC

S4= ATGCATGC For e(S1, S3) S1 = ATG-C-TC

S2 = ATG-A-GC

S3 = TT--C-TG

S4 = ATGCATGC Step 4: Output the above as the final alignment.

S1

S2

S4

S3

2 3

2

Page 47: 4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

4 -47

Distance matrix Dm and the minimal spanning tree based on

Dm :

Theorem: MST(D) is equal to MST(Dm).

MST Preservation

4

3

2

1

4321

7

25

432

S

S

S

S

SSSSS1

S2

S4

S3

2 3

2