suxtrees - bioinf.uni-freiburg.decosta/suffix_trees.pdf · blue arrows = suffix links sl(w)=v w =...
TRANSCRIPT
Su�x Trees
Rolf Backofen
Lehrstuhl f
¨
ur Bioinformatik
Institut f
¨
ur Informatik
Course Bioinformatics II — WS 11/12
String Matching
find e�ciently all occurrences of a pattern P of length m in atext T of length n
Counting query: reports the number of occurrences of P in TReporting query: reports all occurrences of P in T
string matching can be solved with a su�x tree
advantage over other string-matching algorithms:
if T is static, the su�x tree is constructed once in a preprocessingstepthe subsequent string matchings are then “fast“
Definitions
T = t1
t2
. . . tn
Definition
The substring t1
...ti is called the i-th prefix of T (1 i n).
Example: T=ACCTTCCT
first prefix: A
fourth prefix: ACCT
Definition
The substring ti ...tn is called the i-th su�x of T (1 i n).
Example: T=ACCTTCCT
first su�x: ACCTTCCT
fourth su�x: TTCCT
Definition
Su�x Tree
A su�x tree for a text T of length n over the alphabet ⌃ is a rooteddirected tree with n leaves. Apart from the root node, all internalnodes have at least two children. All edges are labeled with anon-empty substring of T and all outgoing edges from a node startwith a di↵erent character. Each leaf in the su�x tree is labeled withan integer i 2 {1 . . . n} such that the concatenation of the ege labelson the path from the root to the leaf node spells out the su�x of Tthat starts at position i . The su�x tree can be constructed in O(n)time and requires O(n) space.
Remark: In order to have a one-to-one correspondence between thesu�xes of T and the leaves of the su�x tree, we add a new character$ 62 ⌃ to the end of T . This ensures that no su�x is a prefix ofanother su�x.
Example for Su�x Tree
T = ACCTTCCT$ Su�xes: ACCTTCCT$
CCTTCCT$
CTTCCT$
TTCCT$
TCCT$
CCT$
CT$
T$
$
T
CT$
C
AC
TTCC
C
T$
C
T
T
9
$
$
CCT$
TC
CT
$
1
2 7 3
8 5 4
$
6
CT
$
T
CT$
C
Notations
for a node v in the su�x tree, v denotes the concatenation of allpath labels from the root to v
|v | denotes the string depth of a node v
in order to identify a node v in the su�x tree with v = x , wewrite x
a su�x link sl(v) of an internal node v = cb, where c is acharacter and b is a string, is the node w = b
Searching in a Su�x Tree
Task: find pattern P = p1
. . . pm of length m in the su�x tree for textT of length n
1 set cur node=root and cur char=p1
2 locate the correct outgoing edge from the cur node which startswith cur char
3 match the subsequent characters of the pattern to the label ofthe edge located in step 2 character-by-character until the wholepattern was matched (go to step 4 a)) or one ends up at a nodev . Assume we already matched p
1
. . . pi : set cur node = v andcur char = pi+1
4 repeat step 2 and 3 until:a) the whole pattern was matchedb) there is no outgoing edge that starts with cur char (step 2) or the
subsequent characters of P can not be matched (step 3)
Searching in a Su�x Tree (cont.)
step 4a):the whole pattern was matchedsuppose the search procedure ended at node w or on the incomingedge of node w
) the occurrences of P in T can be found in the subtree rooted at w
step 4b)there is no outgoing edge that starts with cur charthe subsequent characters of P can not be matched
) P does not occur in T
Searching in a Su�x Tree (cont.)
Counting query: reports the number of occurrences of P in T
step 4a): occurrences of the pattern found) return the number of leaves in the subtree rooted at w(assuming that all nodes in the su�x tree are labeled with theirsubtree sizes, this can be done in constant time)step 4b): no occurrence of the pattern found) return 0 (constant time)Runtime for counting query: O(m)
Reporting query: reports all occurrences of P in T
step 4a): occurrences of the pattern found) output the labels of all leaves in the subtree rooted at w in(O(OccP
T )) time, where OccPT is the number of occurrences of P
in Tstep 4b): no occurrence of the pattern found) output the empty set (constant time)Runtime for reporting query: O(m + OccP
T )
Example for Searching
P=CCT
P=CG
T
CT$
C
AC
TTCC
C
T$
C
T
T
9
$
$
CCT$
TC
CT
$
1
2 7 3
8 5 4
$
6
CT
$
T
CT$
C
Summary
Task
Find pattern P of length m in a text T of length n.
Su�x Tree
The su�x tree for T can be constructed in O(n) time and space.With the su�x tree, the counting query can be solved in O(m) timeand the reporting query in O(m + OccP
T ) time, where OccPT is the
number of occurrences of P in T .
Applications
1 searching for exact patterns (already discussed)
2 find Maximal Unique Matches
3 find all maximal pairs
2. Maximal Unique Matches
We have as an input two sequences A and B.
Definition
an occurrence of the same substring in A and B is called a match
a match in A and B is left (right) maximal if the match cannotbe extended to the left (right), i.e. the characters to theimmediate left (right) di↵er
a Maximal Unique Match (MUM) is a substring that occursexactly once in both A and B and is left and right maximal
Example: MUMs for A=ATGAC and B=AGAGGAC
GAC is a Maximal Unique Match as it occurs only once in A andB and cannot be extended
AG is not a Maximal Unique Match as it occurs twice in B
2. Maximal Unique Matches (cont.)
Why do we need MUMs?) for global alignments of large sequences
a significantly long MUM is almost certain to be part of a globalalignment of the sequences A and B
to get the full alignment we only need to align the sequences inthe gap between the MUMs
How to find e�ciently all MUMs?
generalized su�x tree for the string A#B$
2. Maximal Unique Matches (cont.)
leaf labels: firstnumber identifiesthe string and thesecond one thestarting position
observation: we candelete the edgelabel on leaf nodesafter the #
example for A=CGAA and B=CGA,CGAA#CGA$
#C
G
$A
A,4
B,3
wA#CGA$
B,1
A#CGA$
A,1
$ $
AG
$
A,3
A,5
$
CGA
B,4 C #GA$ A
v
B,2
A#CGA$
A,2
2. Maximal Unique Matches (cont.)
1 create the generalized su�x tree T for A#B$
2 mark each internal node v of T with exactly two child nodeswhere one is a leaf from A and the other is a leaf from B
3 for each internal node v unmark sl(v)
4 report all marked nodes as Maximal Unique Matches
2. Maximal Unique Matches (cont.)
#C
G
$A
A,4
B,3
wA#CGA$
B,1
A#CGA$
A,1
$ $
AG
$
A,3
A,5
$
CGA
B,4 C #GA$ A
v
B,2
A#CGA$
A,2
1 create generalized su�x tree for CGAA#CGA$
2 mark nodes v and w
3 unmark node v , as v = sl(w)
4 report node v = CGA as a Maximal Unique Match
2. Maximal Unique Matches (cont.)
CGA is a MUM as node w = CGA has exactly one child labeledwith A and one with B and it cannot be extended to the left
GA is no MUM as GA can be extended to the left
example for A=CGAA and B=CGA, CGAA#CGA$
#C
G
$A
A,4
B,3
wA#CGA$
B,1
A#CGA$
A,1
$ $
AG
$
A,3
A,5
$
CGA
B,4 C #GA$ A
v
B,2
A#CGA$
A,2
3. All maximal pairs
Definition
A maximal pair in a sequence A is a pair of occurrences of thesubstring ↵ in A such that the characters to the immediate left (right)of the two occurrences di↵er (the pair is left and right maximal). Amaximal pair is represented by (i , j , |↵|), where i and j are the startingpositions of the occurrences of ↵.
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16A= A G A C C A G A C A T A G A C A
maximal pair AGAC: (1,6,4) and (1,12,4)
maximal pair AGACA: (6,12,5)
3. All maximal pairs (cont.)
build su�x tree for sequence A
leaf annotation: in addition to the position i of the su�x, westore the character Ai�1
that occurs immediately before the su�x
T
CT$
C
AC
TTCC
C
T$
T 9
C
T
T
9
$
$
CCT$
TC
CT
$
1
2 7 3
8 5 4
$
6
CT
$
T
CT$
C
T 6 A 2 C 7 C 3
C 8 T 5 C 4
_ 1
3. All maximal pairs (cont.)
observation: a substring ↵ can only be a maximal pair if thecorresponding node ↵ has at least two children () rightmaximal) with di↵erent characters in their annotation () leftmaximal)
How to find all maximal pairs of a node v?
Reporting: for each character x and each child v 0 of v , thecartesian product of the list for x at v 0 with the union of everylist for a character x 0 6= x at a child w 6= v 0 is formed; each pairin this list together with the string depth of v is a maximal pair
Linking: to create the list for character x at node v , we link thelists for character x that exist for each of v 0s children
do a post-order traversal of the nodes in the su�x tree to get allmaximal pairs
3. All maximal pairs (cont.)
T
CT$
C
AC
TTCC
C
T$
T 9
A 2
C
T
T
9
$
$
CCT$
TC
CT
$
1
2 7 3
8 5 4
$
6
CT
$
T
CT$
C
T 6 C 7 C 3
C 8 T 5 C 4
_ 1 wv T 6A 2
for node v = CCT, we report the maximal pair CCT as (2,6,3)
we build the annotation for node v by combining the two leafannotations of the children of v
3. All maximal pairs (cont.)
T
CT$
C
AC
TTCC
C
T$
T 9
A 2
T 6A 2
C 7
C
T
T
9
$
$
CCT$
TC
CT
$
1
2 7 3
8 5 4
$
6
CT
$
T
CT$
C
T 6 C 3
C 8 T 5 C 4
_ 1 wv C 7 3
for node w = CT, we report no maximal pair
we build the annotation for node w by combining the two leafannotations of the children of w
3. All maximal pairs (cont.)
T
CT$
C
AC
TTCC
C
T$
T 9
A 2
T 6A 2
C 7
C
T
T
9
$
$
CCT$
TC
CT
$
1
2 7 3
8 5 4
$
6
CT
$
T
CT$
C
T 6 C 3
C 8 T 5 C 4
_ 1 wv C 7 3
C 4 8T 5
T 6A 2C 7 3
repeat steps for all internal nodes
report the following maximal pairs:1 CCT as (2,6,3)2 C as (6,7,1), (3,6,1), (2,3,1), (2,7,1)3 T as (5,8,1),(4,5,1)
3. All maximal pairs (cont.)
Runtime analysis
creation of the su�x tree, the post-order traversal, and all the listlinking take O(n) time
each operation of the cartesian product produces an uniquemaximal pair) O(k) time, where k is the number of maximal pairs
in total the algorithm takes O(n + k) time
in many applications we are only interested in maximal pairs of acertain length m) runtime is reduced to O(n + km), where km is the number ofmaximal pairs with length � m