algorithms in bioinformatics: a practical introduction

111
Algorithms in Bioinformatics: A Practical Introduction Suffix tree

Upload: lolita

Post on 23-Jan-2016

74 views

Category:

Documents


0 download

DESCRIPTION

Algorithms in Bioinformatics: A Practical Introduction. Suffix tree. Overview. What is suffix tree? Simple application of suffix tree Linear time algorithm for constructing suffix tree Suffix array FM-index 1-mismatch search. Suffix. 1. acacag$. 2. cacag$. 3. acag$. 4. cag$. 5. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Algorithms in Bioinformatics: A Practical Introduction

Algorithms in Bioinformatics: A Practical Introduction

Suffix tree

Page 2: Algorithms in Bioinformatics: A Practical Introduction

Overview What is suffix tree? Simple application of suffix tree Linear time algorithm for

constructing suffix tree Suffix array FM-index 1-mismatch search

Page 3: Algorithms in Bioinformatics: A Practical Introduction

Suffix Trie

E.g. consider the string S = acacag$Suffix Trie: a ties of all possible suffices of S

$ a c

a

g

c

a

g

c g

7

c g

1

3

5

2

4

6

a

gg

a

$7g$6ag$5cag$4acag$3cacag$2acacag$1Suffix

$

$

$ $

$

$

Page 4: Algorithms in Bioinformatics: A Practical Introduction

Suffix TreeSuffix tree for S=acacag$: merge nodes with only one

child 1 2 3 4 5 6 7a c a c a g $S=

v

$ a ca

g

ca

g

cg

c g

7

1 3

5

2 4

6

ag

g$ $

$

$

$$a

Path-label of node v is “aca”Denoted as (v)

“ca” is an edge label

This is a leaf edge

Page 5: Algorithms in Bioinformatics: A Practical Introduction

Size of Suffix Tree (I) How big is a suffix tree?

Suffix tree has exactly n leaves and at most 2n-1 edges The total length of all edge labels is O(n2).

Can we store suffix tree using o(n2) bit space?1 2 3 4 5 6 7a c a c a g $S=

$ a ca

g

ca

gc

gc

g

7

1 3

5

2 4

6

ag

g$ $

$

$

$$a

Page 6: Algorithms in Bioinformatics: A Practical Introduction

Size of Suffix Tree (II)Suffix tree has exactly n leaves and at most 2n-1 edgesNote that each edge label can be represented using 2

indicesThus, suffix tree can be represented using O(n log n) bits

1 2 3 4 5 6 7a c a c a g $S=$ a c

ag

ca

gc

gc

g

7

1 3

5

2 4

6

ag

g$ $

$

$

$$

6,7

6,76,7

6,74,7

4,7

a

7,71,1

2,32,3

Note: The end index of every leaf edge should be 7, the last index of S. Thus, for leaf edges, we only need to store the start index.

Page 7: Algorithms in Bioinformatics: A Practical Introduction

Property of suffix tree Fact: For any internal node v in the

suffix tree, if the path label of v is (v)=ap, then there exists another node w in the suffix

tree such that (w)=p.

Proof: Skip the proof.

Definition of Suffix Link: For any internal node v, define its suffix link

sl(v) = w.

Page 8: Algorithms in Bioinformatics: A Practical Introduction

Suffix Link example S=acacag$

$ a ca

g

ca

g

cg

c g

7

1 3

5

2 4

6

ag

g$ $

$

$

$$a

Page 9: Algorithms in Bioinformatics: A Practical Introduction

Generalized suffix tree Build a suffix tree for two or more strings E.g. S1 = acgat#, S2 = cgt$

a c gg t

at t a t

c

41 2 1 3 2 3

tat $ $$

#

g t#

5

##

46

# $

#

Page 10: Algorithms in Bioinformatics: A Practical Introduction

Applications of Suffix Tree

Page 11: Algorithms in Bioinformatics: A Practical Introduction

Exact string matching problem To find all occurrences of Q in S (searching)

Search for the node x in the suffix tree which represent Q All the leaves in the subtree rooted at x are the

occurrences Time: O(|Q| + occ) where occ is the total no. of

occurrences$ a c

ag

cag

cg

c g

7

1 3

5

2 4

6

ag

g$ $

$

$

$$a

E.g. S = acacag$Q = acaOccurrences: 1, 3

Page 12: Algorithms in Bioinformatics: A Practical Introduction

Longest repeated substring problem To find the longest repeated substring in S

Find the deepest internal node Time: O(n)

$ a ca

g

cag

cg

c g

7

1 3

5

2 4

6

ag

g$ $

$

$

$$a

E.g. S = acacag$The longest repeat is aca.

Page 13: Algorithms in Bioinformatics: A Practical Introduction

Longest common substring problem To find the longest common substring of

two or more sequences Note: 1970, Don Knuth conjectured that a

linear time algorithm for this problem is impossible

Now, we know that it can be solved in linear time.

E.g. consider two string S1 and S2,1. Build generalized suffix tree for S1# and S2$2. Then, mark each internal node with leaves

representing suffixes of both S1 and S2.3. Report the deepest marked node

Page 14: Algorithms in Bioinformatics: A Practical Introduction

Example for the longest common substring E.g. S1 = acgat#, S2 = cgt$ The longest common substring is “cg”. Its

length is 2.

a c gg t

at t a t

c

41 2 1 3 2 3

ta $ $$#

g t#

5

##

46

# $

#t

Page 15: Algorithms in Bioinformatics: A Practical Introduction

Longest common prefix (I) Given a string S. For any i, j,

Denote lcp(i, j) be the length of the longest common prefix of suffix i and j of S.

$ a ca

g

cag

cg

c g

7

1 3

5

24

6

ag

g$ $

$

$

$$a

1 2 3 4 5 6 7a c a c a g $S=

The longest common prefix of suffix 1 and suffix 3 is aca!

lcp(1, 3) = 3

Page 16: Algorithms in Bioinformatics: A Practical Introduction

Longest common prefix (II) Note that the lowest common ancestor(lca) of

leaves i and j identifies the longest common prefix. lcp(i, j) = |(lca(i, j))|. A well-know result:

Consider a tree of size n, after an O(n) time preprocessing, the lca for any two nodes can be returned in O(1) time.

First obtained by Harel and Tarjan (SIAM J. Comp. 1984) Simplified by Schieber and Vishkin (SIAM J. Comp. 1988)

Based on the above result, After an O(n) time preprocessing,

For any suffix i and suffix j, we can compute the longest common prefix of them in O(1) time.

Page 17: Algorithms in Bioinformatics: A Practical Introduction

Finding Palindrome (I) Given a string S, palindrome is a substring u of S s.t. u =

ur

E.g. ACAGACA Consider a palindrome u=S[i..i+|u|-1], u is called a

maximal palindrome if S[i’..j’] is not a palindrome for any [i’..j’][i..i+|u|-1].

Note that every palindrome is contained in a maximal palindrome.

Thus, maximal palindromes are a compact way to represent all palindromes.

Complemented Palindrome is a string u s.t. u = ūr

E.g. ACAUGU Maximal complemented palindrome is defined similarly.

Page 18: Algorithms in Bioinformatics: A Practical Introduction

Finding Palindrome (II) Recall that restriction enzyme usually is in

the form of complemented palindrome. This motivates the following two problems:

The palindrome problem: Given a string S (representing the genome) of length n,

the problem is to locate all maximal palindromes in S. The complemented palindrome problem:

Given a string S (representing the genome) of length n, the problem is to locate all maximal complemented palindromes in S.

Page 19: Algorithms in Bioinformatics: A Practical Introduction

Properties of palindrome (I) If S[i..i+k-1]=Sr[n-i+1..n-i+k], then

u=S[i-k+1..i+k-1] is an odd length palindrome 

1 i n

S a a a b a a a

1 n

Sr a a a b a a a

i-k+1 i+k-1

n-i+1

Page 20: Algorithms in Bioinformatics: A Practical Introduction

Properties of palindrome (II) If S[i..i+k-1]=Sr[n-i+2..n-i+k+1],

then u=S[i-k..i+k-1] is an even length palindrome 

1 i-k i n

S a a a b b a a a

1 n-i+2 n

Sr a a a b b a a a

i+k-1

Page 21: Algorithms in Bioinformatics: A Practical Introduction

Solution to the palindrome problem Preprocess S and Sr so that any longest

common prefix query can be answered in constant time.

For i=1 to n, Find the longest common prefix for (Si, Sr

n-i+1). If the longest prefix is k, we find an odd length maximal palindrome S[i-k+1..i+k-1].

Find the longest common prefix for (Si, Srn-i+2).

If the longest prefix is k, we find an even length maximal palindrome S[i-k..i+k-1].

Page 22: Algorithms in Bioinformatics: A Practical Introduction

Extracting embedded suffix tree from a generalized suffix tree

Input: The generalized suffix tree T of K strings S1, …, SK.

Aim: Compute the suffix tree Ti of the string Si.r

x y

a c gg t

at t a t

c

41 2 1 3 2 3

tat $ $$

#

gzw

t#

5

##

46

# $

#

r

a cgg

tat

ac

41 2 3

tat #

g

w

t#

5

# #

6

#

#

S1 = acgat#, S2 = cgt$T

T1

Page 23: Algorithms in Bioinformatics: A Practical Introduction

Extracting embedded suffix tree from a generalized suffix tree Observation: Ti is a subtree of T such that

The leaves of Ti are the leaves of T corresponding to Si. The internal nodes of Ti are the lowest common ancestors

of some leaves for Si. The edges of Ti can be inferred from the ancestor

descendent relationship among those nodes.r

x y

a c gg t

at t a t

c

41 2 1 3 2 3

tat $ $$

#

gzw

t#

5

##

46

# $

#

r

a cgg

tat

ac

41 2 3

tat #

g

w

t#

5

# #

6

#

#

S1 = acgat#, S2 = cgt$T

T1

Page 24: Algorithms in Bioinformatics: A Practical Introduction

Extracting embedded suffix tree from a generalized suffix tree

r

a cgg

tat

ac

41 2 3

tat #

g

w

t#

5

# #

6

#

#

r

a cgg

at

ac

41 2 3

tat #

g

w

t# #

6

#

#

r

a cgat

c

41 2

at

g

w

t# #

6

#

#

r

a

c

41

at

g

w

t#

6

#

#

r

ac

1

at

g6

#

#

r

6

#

Page 25: Algorithms in Bioinformatics: A Practical Introduction

Common substrings of more than 2 strings (I) Given a set of strings (protein or DNA

sequences), we want to know what substrings are common to a large number of these strings?

Why this question is important? DNA and protein sequences will evolve. If a

substring occur commonly in wide range of species. This may mean that the substring is critical for the correct functionality.

Page 26: Algorithms in Bioinformatics: A Practical Introduction

Common substrings of more than 2 strings (II) Given K strings whose total length

is n. For every 2kK, define l(k) be the

length of the longest substring common to at least k of these strings.

The problem is to compute l(k) for all k.

Page 27: Algorithms in Bioinformatics: A Practical Introduction

Common substrings of more than 2 strings (III) Example:

Consider a set of 5 strings { sandollar, sandlot, handler, grand, pantry }

Then, we havek l(k

)corresponding substring

2 4 sand

3 3 and

4 3 and

5 2 an

Page 28: Algorithms in Bioinformatics: A Practical Introduction

Common substrings of more than 2 strings (IV) Illustrating the solution by example:

S1 = aacg$, S2 = acgc#, S3 = cga%. (K=3)

1. Build a generalized suffix tree T for the K strings in O(n) time.

$ a cg

cgg

c

5

2 1

$#

54

#%

3

%

1

g$

ca

4

# c

3 1

$ #

2

a%

c

4 2

$ #

3

a%

Page 29: Algorithms in Bioinformatics: A Practical Introduction

Common substrings of more than 2 strings (V)

2. By traversing T, for each internal node v, compute its string depth. In total, O(n) time.

0

1 1

3

$ a cg

cgg

c

5

2 1

$#

54

#%

3

%

1

g$

ca

4

#2

c

3 1

$ #

2

a%

1

c

4 2

$ #

3

a%

Page 30: Algorithms in Bioinformatics: A Practical Introduction

Common substrings of more than 2 strings (VI)3. By traversing T, for each internal node v, compute

C(v). [C(v) is defined as the number of distinct termination symbols in the subtree rooted at v]

This step takes O(Kn) time.3

3 3

2

$ a cg

cgg

c

5

2 1

$#

54

#%

3

%

1

g$

ca

4

#3

c

3 1

$ #

2

a%

3

c

4 2

$ #

3

a%

Page 31: Algorithms in Bioinformatics: A Practical Introduction

Common substrings of more than 2 strings (VII)4. Traverse T and visit every internal node v. For each v, if

V(C(v)) < string-depth of v, set V(C(v)) = string-depth of v. [After step 4, V(k) = the length of the longest substring common to exactly k of these strings.]

5. l(k)=V(k). For i=k-1 downto 2, l(i)=max{l(i+1), V(i)}.

This two steps take O(n) time.

For our example, V(2) = 3, V(3) = 2. Thus, l(3) = 2, l(2) = 3.

In total, this algorithm takes O(Kn) time. Actually, we can improve this algorithm to O(n) time by

mean of lcp!

Page 32: Algorithms in Bioinformatics: A Practical Introduction

Linear time algorithm for constructing suffix tree

Page 33: Algorithms in Bioinformatics: A Practical Introduction

Straightforward construction of suffix tree Consider S = s1s2…sn where sn=$

Algorithm: Initialize the tree with only a root For i = n to 1

Includes S[i..n] into the tree

Time: O(n2)

Page 34: Algorithms in Bioinformatics: A Practical Introduction

Example of construction S=acca$

Init For-loop

$

5

I5

a

4

I4

5

$ $a

4

I3

5

$ $

3

a$

c

I2

c

ca a

3 2

a

45

$ $$ $

I1

c

ca a

3 25

$

$ $

cc

a

a

4 1

$$

Page 35: Algorithms in Bioinformatics: A Practical Introduction

Construction of generalized suffix tree S’= c#

I1

c

ca a

3 25

$

$ $

cc

a

a

4 1

$$

Init For-loop

J2

c

ca a

3 25

$

$ $

cc

a

a

4 1

$$

2

#

J1

c

ca a

3 25

$

$ $

cc

a

a

4 1

$$

2

##

1

Page 36: Algorithms in Bioinformatics: A Practical Introduction

Can we construct a suffix tree in o(n2) time? Yes. We can construct it in O(n) time.

Weiner’s algorithm [1973] Linear time for constant size alphabet, but much space

McGreight’s algorithm [JACM 1976] Linear time for constant size alphabet, quadratic space

Ukkonen’s algorithm [Algorithmica, 1995] Online algorithm, linear time for constant size alphabet, less space

Farach’s algorithm [FOCS 1997] Linear time for general alphabet

Hon,Sadakane, and Sung’s algorithm [FOCS 2003] O(n) bit space O(n logen) time for 0<e<1 O(n) bit space O(n) time for suffix array construction

We will discuss Farach’s algorithm later.

Page 37: Algorithms in Bioinformatics: A Practical Introduction

Idea Build Odd Suffix Tree and Even Suffix

Tree Then, merge odd and even suffix tree.

$ a ca

g

cag

cg

c g

7

1 3

5

24

6

ag

g$ $

$

$

$$a

$ a ca

g

cag

cg

c g

7

1 3

5

24

6

ag

g$ $

$

$

$$a

Even Suffix Tree Odd Suffix Tree

Page 38: Algorithms in Bioinformatics: A Practical Introduction

Idea Input: a string S of length n1. Recursively compute the suffix tree To of

all suffixes beginning at the odd positions.

To is of size n/2.

2. From To, compute Te which is the suffix tree for all suffixes beginning at the even positions.

3. Merge To and Te to form the suffix tree for S.

Page 39: Algorithms in Bioinformatics: A Practical Introduction

Stage 1: Constructing odd suffix tree Given a string S[1..n], we generate a new

string S’[1..n/2] as follows. we map pairs of characters into single

characters as follows: S[1..2], S[3..4], S[5..6], …, S[n-1..n].

Remove the duplicates from the pairs of characters and sort them by radix sort.

S’[i] = rank of S[2i-1..2i] in the sorted list, for i=1, 2, …, n/2.

By recursion, we get the suffix tree T’ for S’

Convert T’ to the odd suffix tree To.

Page 40: Algorithms in Bioinformatics: A Practical Introduction

Example (I) S = aaabbbabbaba$ S[1..2]=aa, S[3..4]=ab,

S[5..6]=bb, S[7..8]=ab, S[9..10]=ba, S[11..12]=ba.

By stable sort, aa < ab < ba < bb. Rank(aa)=1, Rank(ab)=2,

Rank(ba)=3, Rank(bb)=4. So, S’=124233$.

Page 41: Algorithms in Bioinformatics: A Practical Introduction

Example (II) By recursion, construct the suffix

tree T’ for S’:

7

$

1

12

42

33

$

4

33

$

2

2

24233$

3

6 5

3$$

3

4233$

Page 42: Algorithms in Bioinformatics: A Practical Introduction

Example (III) Convert T’ to the odd tree:

13

$

1

aa

ab

bb

ab

ba

ba

$

7

ba

ba

$

ab

3

ba

11 9

ba$

$

5

bbabbaba$

bbbabbaba$

This is not a suffix tree

i 2i-1

Page 43: Algorithms in Bioinformatics: A Practical Introduction

Example (IV) Refine the odd tree To:

13

$

1

aa

bb

ba

bb

ab

a$

7

ab

a$

bb

3

11 9

ba$

$

5

babbaba$

bbabbaba$

b

a

a

Page 44: Algorithms in Bioinformatics: A Practical Introduction

Time complexity for building the odd tree Let Time(n) be the time to build a

suffix tree for a string of length n.

Stable sorting and refinement of the odd trees take O(n) time.

Build suffix tree for S’ takes Time(n/2). So, Stage 1 takes Time(n/2)+O(n)

time.

Page 45: Algorithms in Bioinformatics: A Practical Introduction

Stage 2: Build the even tree1. Generate the lex-ordering of the

leaves in Te.

2. For any two adjacent leaves 2i and 2j, we find lcp(2i, 2j).

3. Construct the even tree Te from left to right (according to the lex-ordering).

Page 46: Algorithms in Bioinformatics: A Practical Introduction

Build the even tree (Step 1) We get the lex-ordering of the leaves

in To. Generate the lex-ordering of the

leaves in Te. For each leaf i in To, get the preceding

character c=S[i-1] and form a pair (c,i). Each pair represents a even suffix i-1.

Perform stable sorting on those pairs. We get the lex-ordering of the leaves in Te.

Page 47: Algorithms in Bioinformatics: A Practical Introduction

Example Lex-ordering of the leaves in To:

13 < 1 < 7 < 3 < 11 < 9 < 5 The pairs are:

(a,13), ($,1), (b,7), (a, 3), (a, 11), (b, 9), (b, 5). After stable sorting, we have

($, 1), (a, 13), (a, 3), (a, 11), (b, 7), (b, 9), (b, 5).

Hence, the lex-ordering of the leaves of Te: 12 < 2 < 10 < 6 < 8 < 4

S = aaabbbabbaba$

Page 48: Algorithms in Bioinformatics: A Practical Introduction

Build the even tree (Step 2) For any two adjacent leaves 2i and 2j,

we first find lcp(2i, 2j).

Observation: lcp(2i, 2j) = lcp(2i+1, 2j+1)+1 if S[2i]=S[2j] 0 otherwise

Proof: If S[2i]S[2j], lcp(2i,2j)=0. Otherwise, lcp(2i,2j)=1+lcp(2i+1,2j+1).

Page 49: Algorithms in Bioinformatics: A Practical Introduction

Example Recall that the lex-

ordering of leaves: 12 < 2 < 10 < 6 < 8 < 4.

By the previous observation, we have lcp(8,4)=lcp(9,5)+1=2

Similarly, we have lcp(12,2)=1, lcp(2,10)=1,

lcp(10,6)=0, lcp(6,8)=1, lcp(8,4)=2

13

$

1

aa

bb

ba

bb

ab

a$

7

ab

a$

bb

3

11 9

ba$

$

5

babbaba$

bbabbaba$

b

a

a

Page 50: Algorithms in Bioinformatics: A Practical Introduction

Build the even tree (Step 3) Construct the even tree Te from

left to right.

12

a

12

a$

$

2

abbbabbaba$

a

12

$

2

abbbabbaba$

10

ba$

Page 51: Algorithms in Bioinformatics: A Practical Introduction

Build the even tree (Step 3)

a

12

$

2

abbbabbaba$

10

ba$

6

babbaba$

a

12

$

2

abbbabbaba$

10

ba$

6

abbaba$

b

8

baba$

a

12

$

2

abbbabbaba$

10

ba$

6

abbaba$

b

8

aba$

4

babbaba$

b

Page 52: Algorithms in Bioinformatics: A Practical Introduction

Time complexity for building the even tree

Step 1: O(n) time Step 2: O(n) time Step 3: O(n) time

Page 53: Algorithms in Bioinformatics: A Practical Introduction

Stage 3: Merge odd and even trees We can merge To and Te by DFS. However, it takes O(n2) time.

13

$

1

aa

bb

ba

bb

ab

a$

7

ab

a$

bb

3

11 9

ba$

$

5

babbaba$

bbabbaba$

b

a

a

13

$

1

aa

bb

ba

bb

ab

a$

7

ab

a$

bb

3

11 9

ba$

$

5

babbaba$

bbabbaba$

b

a

a

Even tree

Odd tree

a

12

$

2

abbbabbaba$

10

ba$

6

abbaba$

b

8

aba$

4

babbaba$

b

a

12

$

2

abbbabbaba$

10

ba$

6

abbaba$

b

8

aba$

4

babbaba$

b

13$,1

1

7 3

11 9

$,1

b,1

a,1

a,1

12$,1

a,1

2

b,1

10a,2

a,4 b,9

6

b,1

b,5

8

4

b,1

a,2

5b,5

b,8a,11

b,10

a,2a,2

b,1

13$,1

1

7 3

11 9

$,1

b,1

a,1

a,1

12$,1

a,1

2

b,1

10a,2

a,4 b,9

6

b,1

b,5

8

4

b,1

a,2

5b,5

b,8a,2

a,2b,1

Page 54: Algorithms in Bioinformatics: A Practical Introduction

Stage 3: Merge odd and even trees We merge To and Te by DFS. We merge two edges as long as they start

with the same character. The merge is ended when one edge is longer than the other.

13

$

1

aa

bb

ba

bb

ab

a$

7

ab

a$

bb

3

11 9

ba$

$

5

babbaba$

bbabbaba$

b

a

aa

12

$

2

abbbabbaba$

10

ba$

6

abbaba$

b

8

aba$

4

babbaba$

b

Page 55: Algorithms in Bioinformatics: A Practical Introduction

Merge odd and even trees We merge To and Te by DFS. We merge two edges as long as they start

with the same character. The merge is ended when one edge is longer than the other.

13

$,1

27

3

11 9

$,1

b,1

a,1

a,1

12

$,1

a,11

1

$,1

b,2

10

$,1

a,4b,9

6

b,3

b,3

8

4

b,1

a,4

5

b,3

b,8

Page 56: Algorithms in Bioinformatics: A Practical Introduction

Merge odd and even trees The merging may over-merged some nodes. To correct the tree, we need to unmerge some nodes.

13

$,1

27

3

11 9

$,1

b,1

a,1

a,1

12

$,1

a,11

1

$,1

b,2

10

$,1

a,4b,9

6

b,3

b,3

8

4

b,1

a,4

5

b,3

b,8

Page 57: Algorithms in Bioinformatics: A Practical Introduction

Definition of L() and d() For every node u which may be over-

merged, there exist two leaves 2i and 2j-1 such that u=lca(2i, 2j-1). Denote L(u) be the correct depth of u, that is,

lcp(2i,2j-1).

Note that lcp(2i,2j-1) = 1+lcp(2i+1,2j) if S[2i]=S[2j-1]; 0 otherwise.

Let v be lca(2i+1,2j). Denote d(u) = v. Note that d() is equivalent to suffix link!

Page 58: Algorithms in Bioinformatics: A Practical Introduction

Example of d()

13

$,1

27

3

11 9

$,1

b,1

a,1

a,1

12

$,1

a,11

1$,1

b,2

10

$,1

a,4b,9

6

b,3

b,3

8

4

b,1

a,4

5

b,3

b,8

d()

Page 59: Algorithms in Bioinformatics: A Practical Introduction

Relationship between L() and d() Suppose u = lca(2i, 2j-1).

Note1: if u is not the root, then S[2i]=S[2j-1]. Note2: lcp(2i,2j-1) = 1+lcp(2i+1,2j) if S[2i]=S[2j-1]; 0

otherwise Note3: d(u) = lcp(2i+1,2j)

Hence, L(u) = 1 + L(d(u)) if u is not the root. Otherwise, L(u)=0.

Lemma: L(u) = the length of the purple path from u to the root.

Page 60: Algorithms in Bioinformatics: A Practical Introduction

Example of L()

13

$,1

27

3

11 9

$,1

b,1

a,1

a,1

12

$,1

a,11

1

$,1

b,2

10

$,1

a,4b,9

6

b,3

b,3

8

4

b,1

a,4

5

b,3

b,8

L( )=2 L( )=3

L( )=4

L( )=2L( )=2

L( )=2

L( )=1L( )=1

Page 61: Algorithms in Bioinformatics: A Practical Introduction

Unmerge the border nodes based on L() (I)

13

$,1

27

3

11 9

$,1

b,1

a,1

a,1

12

$,1

a,11

1

$,1

b,2

10

$,1

a,4b,9

6

b,3

b,3

8

4

b,1

a,4

5

b,3

b,8

L( )=2 L( )=3

L( )=4

L( )=2L( )=2

L( )=2

L( )=1L( )=1

Page 62: Algorithms in Bioinformatics: A Practical Introduction

Unmerge the border nodes based on L() (II)

13

$,1

1

7 3

11 9

$,1

b,1

a,1

a,1

12

$,1a,1

2

b,1

10

a,2

a,4 b,9

6

b,1

b,5

8

4

b,1

a,2

5

b,5

b,8a,11

b,10

a,2

a,2b,1

Page 63: Algorithms in Bioinformatics: A Practical Introduction

Time complexity for merging Merge the tree using DFS takes

O(n) time. Compute the links d() takes O(n)

time. Compute L() takes O(n) time. Unmerge takes O(n) time.

Page 64: Algorithms in Bioinformatics: A Practical Introduction

Total time complexity of Farach’s algorithm

Stage 1: Time(n/2)+O(n) Stage 2: O(n) Stage 3: O(n)

Thus, Time(n) = Time(n/2)+O(n). By solving the equation,

Time(n)= O(n).

Page 65: Algorithms in Bioinformatics: A Practical Introduction

Disadvantage of suffix tree Suffix tree is space inefficient. It

requires O(n||log n) bits. Manber and Myers (SIAM J. Comp

1993) proposes a new data structure, called suffix array, which has a similar functionality as suffix tree. Moreover, it only requires O(n log n) bits.

Page 66: Algorithms in Bioinformatics: A Practical Introduction

Suffix Array (I) It is just sorted suffixes. E.g. consider S = acacag$

Suffix array is an array of n indices. Thus, it takes O(n log n) bits.

Suffix Position SA[i] Suffix

acacag$ 1 7 $

cacag$ 2 1 acacag$

acag$ 3 => 3 acag$

cag$ 4 Sort 5 ag$

ag$ 5 2 cacag$

g$ 6 4 cag$

$ 7 6 g$

Page 67: Algorithms in Bioinformatics: A Practical Introduction

Observation The leaves of a suffix tree is in

suffix array order.

$ a ca

g

cag

cg

c g

7

1 3

5

2 4

6

ag

g$ $

$

$

$$a

SA[i] Suffix

7 $

1 acacag$

3 acag$

5 ag$

2 cacag$

4 cag$

6 g$

Page 68: Algorithms in Bioinformatics: A Practical Introduction

Linear time construction of suffix array from suffix tree Recall that the suffix tree T of S[1..n] can be

constructed in O(n) time. Then, by “lexical” depth-first traversal of T,

the suffix array of S is obtained. This takes O(n) time.

However, the space used during construction is the same as that for suffix tree! This defeats the purpose of suffix array.

Today, we can build suffix array using O(n) bit space and O(n) time.

Page 69: Algorithms in Bioinformatics: A Practical Introduction

range(T,Q) For a pattern Q, its

occurrences in T form a consecutive SA range.

Example: For T= acacag$, ca occurs in SA[5] and SA[6].

Definition: We called range(T,Q)=[st..ed] if

Q is a prefix of every Tj for j=SA[st], SA[st+1], …, SA[ed]

where Tj = j suffix of T = T[j..n]. Example: range(T,ca)=[5..6]

SA[i] Suffix

7 $

1 acacag$

3 acag$

5 ag$

2 cacag$

4 cag$

6 g$

1

2

3

4

5

6

7

Page 70: Algorithms in Bioinformatics: A Practical Introduction

Find occurrence of query Q in a string S using suffix array Input: (1) the suffix array of a string T

of length n and (2) a query Q of length m

Aim: check if Q occurs in T

Idea: binary search!

Page 71: Algorithms in Bioinformatics: A Practical Introduction

Algorithm

Page 72: Algorithms in Bioinformatics: A Practical Introduction

Example Consider T = acacag$ Pattern Q = acag

L=1 R=7 M=(L+R)/2=4

SA[i] Suffix

7 $

1 acacag$

3 acag$

5 ag$

2 cacag$

4 cag$

6 g$

i

1

2

3

4

5

6

7

Page 73: Algorithms in Bioinformatics: A Practical Introduction

Example Consider T = acacag$ Pattern Q = acag

L=1 R=7 M=(L+R)/2=4

suffix-SA[M] > Q. Set R=M=4.

SA[i] Suffix

7 $

1 acacag$

3 acag$

5 ag$

2 cacag$

4 cag$

6 g$

i

1

2

3

4

5

6

7

Page 74: Algorithms in Bioinformatics: A Practical Introduction

Example Consider T = acacag$ Pattern Q = acag

L=1 R=4 M=(L+R)/2=2

suffix-SA[M] < Q. Set L=M=2.

SA[i] Suffix

7 $

1 acacag$

3 acag$

5 ag$

2 cacag$

4 cag$

6 g$

i

1

2

3

4

5

6

7

Page 75: Algorithms in Bioinformatics: A Practical Introduction

Example Consider T = acacag$ Pattern Q = acag

L=2 R=4 M=(L+R)/2=3

The pattern Q is found at SA[M]=3.

SA[i] Suffix

7 $

1 acacag$

3 acag$

5 ag$

2 cacag$

4 cag$

6 g$

i

1

2

3

4

5

6

7

Page 76: Algorithms in Bioinformatics: A Practical Introduction

Can we do better? During each step of binary search,

we need to compare Q with a suffix using O(m) time, which is time consuming.

Can we do better?

We have the following observation. Suppose LCP(Q, suffix-SA[L]) is l and LCP(Q, suffix-SA[R])

is r. Then, LCP(Q, suffix-SA[M]) > min{l,r}.

Below, we describe how to utilize this observation to speedup the computation.

Page 77: Algorithms in Bioinformatics: A Practical Introduction

Algorithm

Page 78: Algorithms in Bioinformatics: A Practical Introduction

Example Consider T = acacag$ Pattern Q = acag

L=1, l=0 R=7, r=0 mlr = min(l,r)=0 M=(L+R)/2=4

SA[i] Suffix

7 $

1 acacag$

3 acag$

5 ag$

2 cacag$

4 cag$

6 g$

i

1

2

3

4

5

6

7

Page 79: Algorithms in Bioinformatics: A Practical Introduction

Example Consider T = acacag$ Pattern Q = acag

L=1, l=0 R=7, r=0 mlr = min(l,r)=0 M=(L+R)/2=4, m=1

The (m+1) char of suffix-SA[M] is g. The (m+1) char of Q is c. So, suffix-SA[M] > Q. Set R=M=4 and r=m=1.

SA[i] Suffix

7 $

1 acacag$

3 acag$

5 ag$

2 cacag$

4 cag$

6 g$

i

1

2

3

4

5

6

7

Page 80: Algorithms in Bioinformatics: A Practical Introduction

Example Consider T = acacag$ Pattern Q = acag

L=1, l=0 R=4, r=1 mlr = min(l,r)=0 M=(L+R)/2=2, m=3

The (m+1) char of suffix-SA[M] is c. The (m+1) char of Q is g. So, suffix-SA[M] < Q. Set L=M=2 and l=m=3.

SA[i] Suffix

7 $

1 acacag$

3 acag$

5 ag$

2 cacag$

4 cag$

6 g$

i

1

2

3

4

5

6

7

Page 81: Algorithms in Bioinformatics: A Practical Introduction

Example Consider T = acacag$ Pattern Q = acag

L=2, l=3 R=4, r=1 mlr = min(l,r)=1 M=(L+R)/2=3, m=4

The pattern Q is found at SA[M]=3.

SA[i] Suffix

7 $

1 acacag$

3 acag$

5 ag$

2 cacag$

4 cag$

6 g$

i

1

2

3

4

5

6

7

Page 82: Algorithms in Bioinformatics: A Practical Introduction

Time analysis Binary search will perform log n

comparisons Each comparison takes at most

O(m) time In the worst case, O(m log n) time. Myers and Manber report that, in

practice, the time is O(m + log n).

Page 83: Algorithms in Bioinformatics: A Practical Introduction

Suffix array and suffix tree We show one example of replacing

suffix tree by suffix array Note that most applications related

to suffix tree can be solved using suffix array with some time blow up!

When space is limited, replacing suffix tree by suffix array is a good choice.

Page 84: Algorithms in Bioinformatics: A Practical Introduction

The size is still too big! Why? DNA sequences can be very long!

E.g. Fly: ~100M bases, Human: ~3G bases, Tree: ~9G bases

Storage to store indexing data structure for human genomeSuffix Tree: ~40G bytesSuffix Array: ~13G bytes

Can we further reduce the space?

Page 85: Algorithms in Bioinformatics: A Practical Introduction

Solution Grossi, Vitter (STOC2000)

Compressed suffix array (CSA) Ferragine, Manzini (FOCS2000)

FM-index Both of them can be stored in O(n) bit

space For Human Genome

Both CSA and FM-index can be stored within 2G bytes.

Page 86: Algorithms in Bioinformatics: A Practical Introduction

FM-index Consider a text T=acacag$ FM-index stores:

A. The string BW=g$ccaaaB. C[x] = total no. of

occurrences of each symbol less than x. E.g. C[a]=1, C[c]=4, C[g]=6, C[t]=7

C. A data-structure occ(x, i) which tells us the number of occurrences of x in BW[1..i] using O(1) time.

SA[i] Suffix T[SA[i]-1]

7 $ g

1 acacag$ $

3 acag$ c

5 ag$ c

2 cacag$ a

4 cag$ a

6 g$ a

Page 87: Algorithms in Bioinformatics: A Practical Introduction

Data-structure for answering the occ(x, i) query?

Given the text BW[1..n], we divide BW[1..n] into buckets of size log2n. For each bucket i = 1, …, n/log2n, we store

P[i] = number of x’s in BW[1.. ilog2n]. Each bucket is further subdivided into log n sub-buckets of size log n. For each sub-bucket j of the bucket i, we store

Q[i][j] = number of x’s in the first j sub-buckets.

………BWlog2 n

………

log n

2 n/log2 n3 41

1 2 3 log n

00x0xxx0x0

Page 88: Algorithms in Bioinformatics: A Practical Introduction

Data-structure for answering the occ(x, i) query? We also need a lookup table rank(b,k)

b is any string of length (log n)/2 1 k (log n)/2

rank(b,k) = number of x in the first k characters of b.

00…0

xx…x

1 2 3 ………………(log n)/2

x0x..0

……

1 1 2 ………………………

bits )()logloglog(

space Total

bits )log(log sentry takeEach

2

log

2

log2

entries ofNumber

2

log

nonnnO

nO

nn

nn

Page 89: Algorithms in Bioinformatics: A Practical Introduction

Space complexity of the occ() data-structure P[1..n/log2n] uses O(n/log n) bits Q[1..n/log2n][1..log n] uses O(n log

log n / log n) bits rank(b,k) uses 2logn/2 (log n/2) =

o(n) bits

In total, we use O(n log log n / log n) bits.

Page 90: Algorithms in Bioinformatics: A Practical Introduction

How to compute occ(x,i)?

Suppose log n = 10. To compute occ(x, 327),

The result is P[3]+Q[4][2]+rank(00x0x,5)+rank(xx0x0,2)

Hence, O(1) time to compute occ(x,i).

………BWlog2 n

………

log n

2 n/log2 n3 41

1 2 3 log n

00x0xxx0x0

Page 91: Algorithms in Bioinformatics: A Practical Introduction

Size of FM-index Structure A can be store in 2n bits Structure B can be store in O(log n)

bits Structure C can be store in

O(n log log n/log n) bits

In total, the size of FM-index isO(n) bits.

Page 92: Algorithms in Bioinformatics: A Practical Introduction

Observation C[x]+occ(x,i) is the

number of suffixes smaller than of xTSA[i].

Example: C[c]=4 occ(c, 6)=2 Number of suffixes smaller

than cT[SA[6]..n]=ccag$ is 6.

SA[i] Suffix T[SA[i]-1]

7 $ g

1 acacag$ $

3 acag$ c

5 ag$ c

2 cacag$ a

4 cag$ a

6 g$ a

Page 93: Algorithms in Bioinformatics: A Practical Introduction

Lemma Suppose range(T,Q) is [st..ed]. Then, range(T,xQ) = [p..q] where

p = C[x] + occ(x,st-1) + 1 q = C[x] + occ(x,ed)

Proof: p = 1+number of suffixes strictly smaller

than xQ. The latter term = number of suffixes smaller than or equal to xTSA[st-1].

q = number of suffixes smaller than or equal to xTSA[ed].

Page 94: Algorithms in Bioinformatics: A Practical Introduction

Backward search Given the text T and the FM-index, we want to

determine if Q exists in T.

Algorithm BW_exist(Q[1..m])1. x=Q[m], i=m-1;2. /* find range(T,Q[m]) */

st = C[x]+1, ed = C[x+1];3. while (sted and i>1) {

/* find range(T, Q[i-1..m]) */ x = Q[i-1]; st = C[x] + occ(x, st-1) + 1; ed = C[x] + occ(x, ed); i = i – 1;}

4. if st > ed, then pattern not found else pattern found.

Page 95: Algorithms in Bioinformatics: A Practical Introduction

Example T = acacag$ Q = aca

Q[3..3]=a sp=C[a] +1

=1+1=2 ep=C[c]

=4

SA[i] Suffix T[SA[i]-1]

7 $ g

1 acacag$ $

3 acag$ c

5 ag$ c

2 cacag$ a

4 cag$ a

6 g$ a

g $ c c a a a

Page 96: Algorithms in Bioinformatics: A Practical Introduction

Example T = acacag$ Q = aca

Q[2..3]=ca st=C[c]+occ(c,stold-1)+1

=4+0+1=5 ed=C[c]+occ(c,edold)

=4+2=6

SA[i] Suffix T[SA[i]-1]

7 $ g

1 acacag$ $

3 acag$ c

5 ag$ c

2 cacag$ a

4 cag$ a

6 g$ a

g $ c c a a a

Page 97: Algorithms in Bioinformatics: A Practical Introduction

Example T = acacag$ Q = aca

Q[1..3]=aca st=C[a]+occ(a,stold-1)+1

=1+0+1=2 ed=C[a]+occ(a,edold)

=1+2=3

SA[i] Suffix T[SA[i]-1]

7 $ g

1 acacag$ $

3 acag$ c

5 ag$ c

2 cacag$ a

4 cag$ a

6 g$ a

g $ c c a a a

Page 98: Algorithms in Bioinformatics: A Practical Introduction

Example T = acacag$ Q = aca

Q[1..3]=aca st=C[a]+occ(a,stold-1)+1

=1+0+1=2 ed=C[a]+occ(a,edold)

=1+2=3

SA[i] Suffix T[SA[i]-1]

7 $ g

1 acacag$ $

3 acag$ c

5 ag$ c

2 cacag$ a

4 cag$ a

6 g$ a

g $ c c a a a

Q occurrences in T!

Page 99: Algorithms in Bioinformatics: A Practical Introduction

Time complexity of Backward Search To find a pattern Q[1..m]

Step 1, 2, and 4 can be computed in O(1) time.

For step 3, We need to iterate the loop for m-1 times. Each iteration of the loop can be computed

in O(1) time. The loop takes O(m) time.

In total, O(m) time for backward search.

Page 100: Algorithms in Bioinformatics: A Practical Introduction

Conclusion Suffix tree is a powerful data-structure

which has a lot of applications in Computational Biology.

Problems: Suffix tree is too big!

Can be solved using CSA and FM-index Suffix tree can only solve exact match

problem! (Most of the biology problems are approximate match!)

Many works have been done on this area! But still not pratical. One of the important area to explore!

Page 101: Algorithms in Bioinformatics: A Practical Introduction

1-mismatch problem

Page 102: Algorithms in Bioinformatics: A Practical Introduction

1-mismatch problem Index: the suffix tree of a text T[1..n] For any pattern P[1..m], the 1-mismatch

problem finds all occurrences of P in T that has hamming distance at most 1.

Example:P = ACGTT = AACGTGGCCAACTTGGA

Page 103: Algorithms in Bioinformatics: A Practical Introduction

Naïve solution Index:

create the suffix tree for T Algorithm for query:

Generate all possible 1-mismatch patterns of P. Find occurrences of every 1-mismatch pattern

Running time: There are ||m possible 1-mismatch patterns. Using suffix tree, it takes O(m) time to find

occurrences of each 1-mismatch pattern In total, O(m2+occ) time where occ is total

number of occurrences.

Page 104: Algorithms in Bioinformatics: A Practical Introduction

Any other solutions? Cobbs [CPM 1995]

O(n log n) bit index, O(m2+occ) query time. Amir et al [Journal of Algorithm 2000]

O(n log3 n) bit index, O(m log n loglog n + occ) query time. Buchsbaum et al [ESA 2000]

O(n log2 n) bit index, O(m log log n + occ) query time. Cole et al [SODA 2004]

O(n log2 n) bit index, O(m + log log n + occ) query time. Trinh et al [CPM 2004]

O(n log n) bit index, O(m log n + occ) query time. Lam et al [ISAAC 2005]

O(n sqrt(log n)) bit index, O(m log log n + occ) query time. Chan et al [ESA 2006]

O(n log n) bit index, O(m + log log n + occ) query time.

Today, we have a look of the solution of Trinh et al.

Page 105: Algorithms in Bioinformatics: A Practical Introduction

Index Suffix array of T

SA[1..n] Inversed suffix array of T

SA-1[1..n] where SA[SA-1[i]] = i.

Definition: We called range(T,P)=[st..ed] if

P is a prefix of every Tj for j=SA[st], SA[st+1], …, SA[ed]

where Tj = j suffix of T = T[j..n]. Example: For T= acacag$, range(T,ca)=[5..6]

SA[i] Suffix

7 $

1 acacag$

3 acag$

5 ag$

2 cacag$

4 cag$

6 g$

i

1

2

3

4

5

6

7

Page 106: Algorithms in Bioinformatics: A Practical Introduction

Lemma 1 (Forward search) Assume [st..ed]=range(T,P). We can compute

[st’..ed’]=range(T,Pc) in O(log n) time. Proof: binary search on SA.

Page 107: Algorithms in Bioinformatics: A Practical Introduction

Lemma 2 Assume [st1..ed1]=range(T,P1) and [st2..ed2]=range(T,P2). We can compute [st..ed]=range(T,P1P2) in O(log n) time.

Proof: Let the length of P1 be k. Note that TSA[st1],TSA[st1+1],…,TSA[ed1] are lexicographically increasing. Hence, TSA[st1]+k,TSA[st1+1]+k,…,TSA[ed1]+k are lexicographically

increasing. Thus, SA-1[SA[st1]+k] < SA-1[SA[st1+1]+k] <…< SA-1[SA[ed1]+k]. To find st and ed, we need to find

the smallest st such that st2 < SA-1[SA[st]+k]<ed2 and the largest ed such that st2 < SA-1[SA[ed]+k] < ed2.

This can be done by binary search.

Page 108: Algorithms in Bioinformatics: A Practical Introduction

Example T=acacag$ P1=a, P2=ca. range(T,P1)=[2..4], range(T,P2)=[5..6] To find range(T,P1P2), we do the following:

Note that TSA[2]<TSA[3]<TSA[4] are beginning with a. So, TSA[2]+1<TSA[3]+1<TSA[4]+1. Note that

SA-1[SA[2]+1]=5, SA-1[SA[3]+1]=6, SA-1[SA[4]+1]=7. Hence, TSA[5]<TSA[6]<TSA[7]. Among the three suffixes, we need to identify suffix beginning

with P2. Since range(T,P2)=[5..6], both TSA[5] and TSA[6] contain P2. As SA-1[SA[2]+1]=5 and SA-1[SA[3]+1]=6,

we have range(T,P1P2) = [2..3].

SA[i] Suffix

7 $

1 acacag$

3 acag$

5 ag$

2 cacag$

4 cag$

6 g$

i

1

2

3

4

5

6

7

Page 109: Algorithms in Bioinformatics: A Practical Introduction

Lemma 3 (Backward search) Assume [st..ed]=range(T,P). We can compute

[st’..ed’]=range(T,cP) in O(log n) time. Proof: Let P1=c, P2=P. By Lemma 2,

range(T,P1P2) can be computed in O(log n) time.

Page 110: Algorithms in Bioinformatics: A Practical Introduction

Algorithm1. For j=m,m-1,…,1,

By backward search, find range(T,P[j..m]).

2. For j=1,2,…,m, By forward search, find range(T,P[1..j]).

3. Report all occurrences of range(T,P[1..m])4. For j=1,2,…,m,

Let P1=P[1..j-1], P2=P[j+1..m] For every character cP[j],

By forward search, find range(T,P1c) By Lemma 2, find range(T,P1cP2) Report all occurrences of range(T,P1cP2)

Page 111: Algorithms in Bioinformatics: A Practical Introduction

Time analysis Ignoring the O(occ) reporting time! Steps 1 and 2 take O(m log n) time. Step 4 tries all possible mismatches.

There are in total ||m mismatches. For each mismatch, it take O(log n) time. So, Step 4 takes O(||m log n) time.

In total, the running time is O(||m log n + occ).

Note that this solution can be generalized to handle k-mismatch or k-difference.