exact string matching, suffix trees, and applications
DESCRIPTION
Exact String Matching, Suffix Trees, and Applications. CAP 5937 Bioinformatics University of Central Florida Fall 2004. Problem. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/1.jpg)
1
Exact String Matching, Suffix Trees, and Applications
CAP 5937 Bioinformatics
University of Central Florida
Fall 2004
![Page 2: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/2.jpg)
2
Problem
Given a string P called the pattern and longer string T called the text, the exact matching problem is to find all occurrences, if any, of pattern P in text.
![Page 3: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/3.jpg)
3
Notations
T: Text m: length = |T| P: Pattern n: length = |P| S: String s1 s2 s3 …….sn
Example: S=“AGCTTGA” |S| = 7 Substring: Si,j=SiS i+1…Sj
Example: S2,4=“GCT” Subsequence of S: deleting zero or more characters from S
“ACT” and “GCTT” are subsquences. Prefix of S: S1,k
“AGCT” is a prefix of S. Suffix of S: Sh,|S|
“CTTGA” is a suffix of S.
![Page 4: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/4.jpg)
4
0 1 1234567890123
T:xabxyabxyabxz
P:abxyabxz
*abxyabxz
^^^^^^^*
abxyabxz
*abxyabxz
*abxyabxz
*abxyabxz
^^^^^^^^
0 1 1234567890123T:xabxyabxyabxzP:abxyabxz *abxyabxz ^^^^^^^* abxyabxz ^^^^^^^^ ^^^^
![Page 5: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/5.jpg)
5
Z
Given a string S, and position i > 1, let Zi(S) be the length of the longest substring of S that starts at i and matches a prefix of S.
12345678901
S= aabcaabxaaz
Z5(S) = 3 (aabc…aabx)
Z6(S) = 1 (aa…ab)
Z7(S) = Z8(S) = 0
Z9(S) = 2 (aab…aaz)
![Page 6: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/6.jpg)
6
For any position i > 1 where Zi is greater than zero, the Z-box at i is defined as the substring starting at i and ending at position i + Zi -1
i
Zi
![Page 7: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/7.jpg)
7
For every i > 1, ri is the right-most endpoint of the Z-boxes that begin at or before position i and li is the left end of the Z-box.
12345678901234567
S = aabaabcaxaabaabcy
Z10 = 7, r15=16, l15 =10
![Page 8: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/8.jpg)
8
Z calculation (Linear)
Ref: Gusfield’s book Section 1.4 Given Zi for all 1 < i k-1 and the current v
alues r and l, Zk and updated r and l are computed as follows:
1. k > r. Find Zk by comparing the characters starting at position1 of S.
![Page 9: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/9.jpg)
9
2. k < r.
S
k’ Zl l rk
S
k’ Zl l rk
k’+Zk’-1
Zk’
k+Zk-1
S
k’ Zl l rk
?
k’+Zk’-1
Zk’ < | |
Zk’ | |
![Page 10: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/10.jpg)
10
First string matching algorithm Consider string P$T, where P is the pattern,
T is the text and $ is special alphabet. The algorithm is to calculate the Z value of th
e string P$T. For all i such that Zi = |P|, P matches the subsrting T[i..i+|P|-1].
The calculation time is linear.
![Page 11: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/11.jpg)
11
Classical Comparison-Based Methods
Boyer-Moore Algorithm Knuth-Morris-Pratt Algorithm Apostolico-Giancarlo Algorithm Aho-Corasick Algorithm
![Page 12: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/12.jpg)
12
Boyer-Morris Algorithm
Right-to-left scan
12345678901234567890
T: xpbctbzabpqxctbpq
P: tpabxab
![Page 13: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/13.jpg)
13
Bad character rule For each character x in the alphabet, let
R(x) be the position of right-most occurrence of character x in P. R(x) is defined to be zero if x does not occur in P.
12345678901234567890
T: xpbctbxabpqxctbpq
P: tpabxab R(t)= 1
tpabxab R(q)= 0
tpabxab
![Page 14: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/14.jpg)
14
Extended bad character rule
12345678901234567890
T: xpbcabxabpqxctbpq
P: aptbxab R(a)= 6
aptbxab R(q)= 0
aptbxab
When a mismatch occurs at position i of P and the mismatched character in T is x, then shift P to the right so that the closest x to the left of position i in P is below the mismatched x in T
![Page 15: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/15.jpg)
15
Strong good suffix rule
x
y
z
z
T
P before shift
P after shift
![Page 16: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/16.jpg)
16
0 1 123456789012345678T: prstabstubabvqxrst *P: qcabdabdab 1234567890 qcabdabdab weak rule qcabdabdab strong rule
![Page 17: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/17.jpg)
17
For each i, L(i) is the largest position less than n such that string P[i..n] matches a suffix of P[1..L(i)]. If no, L(i) = 0.
So is L’(i) with the characters to the left of the suffix are different.
Pi
nL(i)
i
n
PL’(i)
xy
xy
![Page 18: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/18.jpg)
18
Nj(P) is the length of the longest suffix of the substring P[1..j] that is also a suffix of the full string P.
Zi(P) is the length of the longest prefix of P[i..n] that is also a prefix of the full string P.
So, Nj(P) = Zn-j+1(Pr)
P
j-Nj(P)+1
nj
![Page 19: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/19.jpg)
19
P
j-Nj(P)+1
nj
123456789
P cabdabdab L(8) = 6, L’(8) = 3
abdab
abdab N3(P)=2, N6(P)=5
i
i+Zi-11
![Page 20: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/20.jpg)
20
Calculation of L(i)
L(i) is the largest index j less than n such that Nj(P) |P[i..n]|.
L’(i) is the largest index j less than n such that Nj(P) = |P[i..n]|.
Algorithm For i := 1 to n-1 do L’(i) := 0; For j := 1 to n-1 do
Begin i := n-Nj(P) +1; L’(i) := j End
![Page 21: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/21.jpg)
21
Let l’(i) denote the length of the largest suffix of |P[i..n]| that is also a prefix of P, if one exists. If none exists, then let l’(i) = 0.
l’(i) equals the largest j < |P[i..n]| such that Nj(P) = j.
![Page 22: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/22.jpg)
22
Knuth-Morris-Pratt Algorithm
For each position i in pattern P, defines spi(P) (resp. spi’(P) )to be the length of the longest proper suffix of P[1..i] that matches a prefix of P and (resp. P(i+1) P(sp’i(P)+1)).
i
spi(P)1P
x y
![Page 23: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/23.jpg)
23
k-1
P before shift
P after shift
Missed occurrence of P
T
is a prefix of P.Up to position k-1, P matches T,Thus, is also a suffix of P[1..k-1].
Shift rule of Knuth-Morris-Pratt algorithm
![Page 24: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/24.jpg)
24
spi(P) calculation
For any i >1, spi(P) = Zj = i-j+1, where j> 1 is the smallest position that maps to i.
i
spi’(P)1P
x y
i- spi’(P)+1
Z i- spi’(P)+1= spi’(P)
![Page 25: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/25.jpg)
25
123456789012345678 xyabcxabcxadcdqfeg abcxabcde 123456789 abcxabcde abcxabcde sp2=0, sp3=0, sp4=0, sp5=1, sp6=2, sp7=3, sp8=0, sp9=0.
![Page 26: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/26.jpg)
26
Theorem
After a mismatch at postion i+1 of P and a shift of i-spi’ places to the right, the left-most i-spi’ characters of P are gurranteed to match their counterparts in T.
For any alignment of P with T, if character 1 through i of P match the opposing characters of T but character i+1 mismatches T(k), then P can be shifted by i-spi’ places to the right without passing any occurrence of P in T.
![Page 27: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/27.jpg)
27
Classical Comparison-Based Methods
Boyer-Moore Algorithm Knuth-Morris-Pratt Algorithm Apostolico-Giancarlo Algorithm Aho-Corasick Algorithm
A Demonstration
![Page 28: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/28.jpg)
28
Exact matching with a set of patterns
Exact set matching problem is to find all the occurrences in a text T of a set of patterns P = {P
1, …, Pz}.
Dictionary problem: Given a text T, ask if T is a pattern in P.
![Page 29: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/29.jpg)
29
Keyword Tree
Keyword tree K for P each edge is labeled with exactly one character any two edges out of the same node have distinct
labels every pattern Pi in P maps to some node v of K
such that the characters on the path from the root of K to v exactly spell out Pi, and every leaf of K is mapped to by some pattern in P.
![Page 30: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/30.jpg)
30
Assumption: No pattern in P is a proper substring of any other pattern in P.
L(v) = the labels from root to the node v.lp(v) = the length of the longest proper suffix of strin
g of L(v) that is a prefix of some pattern in P. Lemma: Let be the lp(v)-length suffix of string L
(v). Then there is a unique node in the keyword tree that is labeled by string .
The unique node is denoted by nv.
When lp(v) =0, nv is the root.
nv for all v can be constructed in linear time.
![Page 31: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/31.jpg)
31
P = {potato, tattoo, theater, other} L(v) = potalp(v) = 2
p
o
t
a
t
o
1
t
t
t
t
o
o
o
th
e
r
h
e
e
aa
r
2
3
4v
nv
T = xxpotattooxx
![Page 32: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/32.jpg)
32
p
o
t
a
t
o
1
t
t
t
t
o
o
o
th
e
r
h
e
e
aa
r
2
3
4
T = xxpotattooxx
![Page 33: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/33.jpg)
33
nv is computed in linear time.Consider, for each pattern, two pointers, one points the current processing position and the other points to left end of the match suffix. We will see that each operation causes the pointers move forward, but they only move 2n times.
![Page 34: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/34.jpg)
34
p
o
t
a
t
o
1
t
t
t
t
o
o
o
th
e
r
h
e
e
aa
r
2
3
4
T = xxpotattooxx
![Page 35: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/35.jpg)
35
Aho-Corasick Algorithm
Without assumption. P = {acatt, ca}, T= acatx Suppose in a keyword tree K there is a direct
path of failure links from a node v to a node that numbered with pattern i. Then pattern Pi must occur in T ending at position c whenever node v is reached during the search of Aho-Corasick algorithm.
![Page 36: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/36.jpg)
36
Suppose a node v has been reached during the algorithm. Then the pattern Pi occurs in T ending at position c only if v is numbered i or there is a directed path of failure links from links from v to the node numbered i.
The output link at v points to that numbered node other than v that is reachable from v by the fewest failure links.
![Page 37: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/37.jpg)
37
P = {abcdefg, de, bcde, defg}T = xabcdefxcdefgx
a
bb
c c
d d
d
e
e
e
g
f
f
g
![Page 38: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/38.jpg)
38
Matching against DNA Library Sequence-tagged-sites(STS)
A DNA string of 200-300 bps whose right and left ends, of length 20 – 30 bps each, occur only once in the entire genome.
Expressed sequence tags (EST) A STS that comes from genes rather than parts of
inter-gene DNA. (Obtained from mRNA or cDNA)
![Page 39: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/39.jpg)
39
The set of patterns: all known STSs or ESTs
Text: a newly sequenced genome Goal: To identify STSs or ESTs occur in th
e newly sequences genome
![Page 40: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/40.jpg)
40
Seminumerical String Matching Shift-And Method
Let M be an n by m+1 binary matrix. M(i,j) = 1 if and only if the first i characters of P exact match the i characters of T ending at character j.
M(n,j) = 1 if and only if an occurrence of P ends at position j of T.
Bit-Shift(j-1) : shift column j-1 down by one position and set the first to 1.
![Page 41: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/41.jpg)
41
T= xabxabaaxa P= abaac C(8)T=(1 0 1 0 0) Bit-Shift C(8)T= (1 1 0 1 0) T(9)= a, Ua
T= (1 0 1 1 0)
C(9)T = C(8)T AND UaT = (1 0 0 1 0)
M(i, j) = 1 if and only if M(i-1, j-1) =1 and UT(j) (i) =1
![Page 42: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/42.jpg)
42
Advantage of Shift-And Very efficient if n is less than the size of single
computer word. Only two columns are needed in each computa
tion time. Agrep: The Shift-And method with errors.
Mk(i,j) is 1 if and only if at least i-k of the first i characters of P match the i characters up through character j of T.
In Agrep, the user chooses a value of k and then the arrays M, M1, …, Mk are computed.
![Page 43: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/43.jpg)
43
Ml(j) = Ml-1(j)
OR [Bit-Shift(Ml(j-1)) AND U(T(j))] OR Ml-1(j-1)
Computation time = O(kmn)
![Page 44: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/44.jpg)
44
Karp-Rabin fingerprint method Tr
n denote the n-length substring of T starting character r.
1
( ) 2 ( )n
n i
i
H P P i
1
( ) 2 ( 1)n
n ir
i
H T T r i
![Page 45: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/45.jpg)
45
There is an occurrence of P starting at position r if and only if H(P) = H(Tr).
Hp(P) = H(P) mod p and Hp(Tr) = H(Tr) mod p are called fingerprint of P and Tr .
Hp(P) = Hp(Tr) may introduce false match. (u) = the number of primes that are less
than or equal to u.
( ) 1.26ln( ) ln( )
u uπ u
u u
![Page 46: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/46.jpg)
46
If u 29, then the product of all the primes that are less than or equal to u is greater than 2u.
If u 29 and x is any number less than or equal to 2u, than x has fewer than (u) (distinct) prime divisors.
![Page 47: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/47.jpg)
47
Let P and T be any strings such that nm > 29. Let I be any positive integer. If p is a randomly chosen prime number less than or equal to I, then the probability of a false match between P and T is less than or equal to (mn)/(I).
R: the set of position in T, P does not begin.
Consider There are at most (mn) prime divisors p is randomly chosen from I.
(| ( ) ( ) |) 2mnss R
H P H T
![Page 48: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/48.jpg)
48
Algorithm
Choose a positive integer I. Randomly pick a prime number less than or e
qual to I, and compute Hp(P).
For each position r in T, compute Hp(Tr) and test if it equals Hp(P).
When I = nm2,the probability of a false match is at most 2.53/m.
![Page 49: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/49.jpg)
49
p
t t
o
te
i
h o o l
e
n
c
e
rr
yy
at
os
e c
P = {potato, pottery, poetry, school, science}
v
L(v) = pota
![Page 50: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/50.jpg)
50
Motivating Suffix Tree
![Page 51: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/51.jpg)
51
Exact String Matching
Input: P and T. Output: All occurrences of P in T. Time: O(|P| + |T|) Technique: Z values of PT.
Z(i + |P|) ≥ |P| iff P = T[i…i + |P| – 1].
P
i+|P| i+|P|+d-1
T
![Page 52: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/52.jpg)
52
Question 1
Solving the Exact String Matching problem in O(|P|) time under the assumption that T is known and already pre-processed? E.g., T is a dictionary whose content does not
change frequently. Answer:
![Page 53: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/53.jpg)
53
Question 2
Solving the Exact String Matching problem in O(|T|) time under the assumption that P is known and already pre-processed? E.g., P is one of your private collection of DNA
sequence. Answer:
![Page 54: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/54.jpg)
54
A Less Ambitious Version
The Substring Problem Input: P and T. Output: an occurrence of P in T.
![Page 55: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/55.jpg)
55
Question 2
Solving the Substring problem in O(|T|) time under the assumption that P is known and already pre-processed?
Answer:
![Page 56: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/56.jpg)
56
Question 1
Solving the Substring problem in O(|P|) time under the assumption that T is known and already pre-processed?
Answer:
![Page 57: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/57.jpg)
57
To P or not to P .........
Preprocessing P Gusfield Boyer-Moore Knuth-Morris-Pratt
Preprocessing T Suffix tree
![Page 58: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/58.jpg)
58
From Suffix Trie to Suffix Tree
![Page 59: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/59.jpg)
59
Notation Change
Input: P and S. Output: an occurrence of P in S.
For example, S = b b a b b a a b P = b a a
![Page 60: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/60.jpg)
60
Suffixes of S
S = b b a b b a a bS[1…8]= b b a b b a a bS[2…8]= b a b b a a bS[3…8]= a b b a a bS[4…8]= b b a a bS[5…8]= b a a bS[6…8]= a a bS[7…8]= a bS[8…8]= b
1st suffix
2nd suffix
3rd suffix
4th suffix
5th suffix
6th suffix
7th suffix
8th suffix
![Page 61: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/61.jpg)
61
KEY: P occurs in S iff P is a prefix of a suffix of S.S = b b a b b a a bS[1…8]= b b a b b a a bS[2…8]= b a b b a a bS[3…8]= a b b a a bS[4…8]= b b a a bS[5…8]= b a a bS[6…8]= a a bS[7…8]= a bS[8…8]= b
1st suffix
2nd suffix
3rd suffix
4th suffix
5th suffix
6th suffix
7th suffix
8th suffix
![Page 62: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/62.jpg)
62
T = Suffix Trie of S
b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b
![Page 63: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/63.jpg)
63
Why suffix trie?
The following statements are equivalent. P occurrs in S. P is a prefix of a suffix of S. P corresponds to a path of T starting from the root
of T.
![Page 64: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/64.jpg)
64
P = b a b b a
b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b
P occurs in S!
![Page 65: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/65.jpg)
65
P = b b a a b a
b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b
P doesn’t occur in S!
![Page 66: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/66.jpg)
66
P = a b b b a a
b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b
P doesn’t occur in S!
![Page 67: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/67.jpg)
67
Q: Where does P occur in S?
![Page 68: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/68.jpg)
68
P = a b b a a
8
4
4
4
4 1
1
1
1
1
5
5
5 2
2
2
2
2
6
7
6 3
3
3
7
3
1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b
Output: 3
![Page 69: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/69.jpg)
69
Question
Time complexity for constructing the suffix trie T of S?
(|S|)
(|S| log |S|)
(|S|2)
(|S|3)
![Page 70: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/70.jpg)
70
Time = O(|S|2)
8
4
4
4
4 1
1
1
1
1
5
5
5 2
2
2
2
2
6
7
6 3
3
3
7
3
1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b
![Page 71: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/71.jpg)
71
time = (|S|2)
How to establish a lower bound? Answer:
![Page 72: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/72.jpg)
72
S = a a a a b b b b
![Page 73: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/73.jpg)
73
Summary
Suffix trie is good in solving Substring Problem, but may require (|S|2) time and space.
Question: is there a compact representation of suffix trie that needs only O(|S|) time and space?
![Page 74: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/74.jpg)
74
Suffix Tree
A compact representation of suffix trie
![Page 75: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/75.jpg)
75
Observations on Trie T of S
T has at most |S| leaves. Why?
T has at most |S| branching nodes. Why?
![Page 76: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/76.jpg)
76
Keeping leaves and branching nodes only.
compact representation of edge labels
S = a a a a b b b b
[5,8]
[5,8]
[5,8]
[5,8]
[4,8]
[1,1]
[2,2]
[3,3]
![Page 77: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/77.jpg)
77
S = a a a a b b b b
[5,8]
[5,8]
[5,8]
[5,8][4,8]
[1,1]
[2,2]
[3,3]
![Page 78: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/78.jpg)
78
S = b b a b b a a b
![Page 79: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/79.jpg)
79
S = b b a b b a a b
[1,1]
[2,3]
[4,8]
[7,8]
[4,8]
[7,8]
[4,8]
[7,8]
[3,3]
[3,3][3,3] [1,1]
[3,3] [2,3]
[7,8] [4,8][7,8] [4,8]
[7,8] [4,8]
![Page 80: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/80.jpg)
80
S = b b a b b a a b
[3,3] [1,1]
[3,3] [2,3]
[7,8] [4,8][7,8] [4,8]
[7,8] [4,8]
![Page 81: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/81.jpg)
81
Question
The space complexity of suffix tree O(|S|) O(|S| log |S|) O(|S|2) O(|S|3)
Why? Number of nodes = Number of edges = Space required by each edge =
![Page 82: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/82.jpg)
82
The challenge
Constructing Suffix Tree in Linear Time
![Page 83: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/83.jpg)
83
History of Suffix Tree Algorithms [Weiner, IEEE FOCS 1973]
Linear time but expensive in space. D. E. Knuth: “the algorithm of 1973”.
[McCreight, J. ACM 1976] Linear time and quadratic space.
[Ukkonen, Algorithmica 1995] Linear time and linear space. Much better readability.
![Page 84: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/84.jpg)
84
Academy Professor ,Department of Computer Science , University of Helsinki, Finland
http://www.cs.helsinki.fi/u/ukkonen/
Esko Ukkonen: On-line construction of suffix-trees. Algorithmica 14 (1995), 249-260
![Page 85: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/85.jpg)
85
Ukkonen’s approach on Suffix Trie
b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b
Case 1: Leaf Extension
Case 2: New Leaf
Case 3: Do Nothing
![Page 86: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/86.jpg)
86
Growing Suffix Trie
Three cases while growing trie Case 1: growing an edge at a leaf. Case 2: growing a new branch of leaf. Case 3: does not change the tree structure.
![Page 87: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/87.jpg)
87
Three Phase Theorem
Those k steps in the k-th iteration have the following pattern: some (at least one) Case-1 steps, followed by some (could be zero) Case-2 steps, followed by some (could be zero) Case-3 steps.
![Page 88: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/88.jpg)
88
Thinking in Suffix Tree
1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b Case 1: Leaf Extension
Case 2: New Leaf
Case 3: Do Nothing
[1,1]
[1,2][1,3]
[2,3][3,3]
[3,3]
[2,4][3,4]
[3,4]
[2,5][3,5]
[3,5]
[2,6][3,6]
[3,6]
[2,7]
[3,7]
[3,7]
[7,7]
[4,7]
[7,7] [4,7]
[7,7] [4,7]
[4,8]
[4,8]
[4,8]
[7,8]
[7,8]
[7,8]
![Page 89: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/89.jpg)
89
Saving a lot of efforts
We can simply ignore all Case-1 steps.
Recall that the number of Case-2 steps is at most |S|.
Q: Is this good enough? Case 1: Leaf Extension
Case 2: New Leaf
Case 3: Do Nothing
1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b
![Page 90: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/90.jpg)
90
How does Ukkonen overcome the problem of too many Case-3 steps?
Completely ignore them……
Do nothing when nothing happen……
![Page 91: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/91.jpg)
91
Saving even more efforts
Case 1: Leaf Extension
Case 2: New Leaf
Case 3: Do Nothing
1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b
![Page 92: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/92.jpg)
92
Rough idea
Just keep one current growing point throughout the execution.
Deriving the new position of the current growing point from its previous position (with the help of suffix links )
![Page 93: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/93.jpg)
93
Only one growing point
Case 2: New Leaf
Case 3: Do Nothing
1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b
The challenges: How do we derive the position of the current growing point? Vertically (case 2) Horizontally (case 3)
Q: Which one is easier?
![Page 94: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/94.jpg)
94
Horizontally, …
Moving from iteration k – 1 to iteration k.
The growing point does not move!
This is the easier case.
Case 2: New Leaf
Case 3: Do Nothing
1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b
![Page 95: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/95.jpg)
95
Vertically, …
Moving from Step i to Step i+1 in the same iteration.
The growing point moves dramatically.
This is the tougher case.
Case 2: New Leaf
Case 3: Do Nothing
1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b
![Page 96: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/96.jpg)
96
Suffix link
Keep records of what have been done --- (Dynamic Programming)
![Page 97: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/97.jpg)
97
Recording What’s Done
Whenever a vertical movement reaching the destination, keep a record of the movement by using a link.
Later on, we might what to follow these recorded linkages.
These links are thus called the suffix links.
![Page 98: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/98.jpg)
98
Why called “Suffix Links”?
Note that the destination of the link is the (-1)-suffix of the starting.
That is, a suffix link links a length n+1 suffix to a length n suffix.
Case 2: New Leaf
Case 3: Do Nothing
1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b
![Page 99: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/99.jpg)
99
Property of Suffix Links (1)
The starting point of a suffix is an internal node, Not a leaf No the middle part of
some suffix tree edge. Why?
Case 2: New Leaf
Case 3: Do Nothing
1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b
![Page 100: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/100.jpg)
100
Every internal node must be a starting point of a suffix link.
Why?
Case 2: New Leaf
Case 3: Do Nothing
1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b
Property of Suffix Links (2)
![Page 101: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/101.jpg)
101
Using suffix links
S = b b a b b a a b
[1,-]
[2,-]
[3,-]
[3,-]
[3,3]
[7,-][4,-]
[3,3]
[7,-] [4,-]
[2,3]
[7,-] [4,-]
1
[1,1]
12
1
1
1
![Page 102: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/102.jpg)
102
Traversal with the help of suffix links: phase (1) Going up to a closest
internal node (whose suffix link must be available). Suppose this upward traversal passes through t characters.
Following the suffix link that starts from this internal node.
t
[i, j]
![Page 103: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/103.jpg)
103
Traversal with the help of suffix links: phase (2) Going down by matching th
e t-character substring S[i, i + t – 1] of S.
t
[i, j]
![Page 104: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/104.jpg)
104
Running Time?
Naïvely: O(t). Cleverly: O(1+ d), wher
e d is the number of internal nodes being went through during phase (2).
t
[i, j]
![Page 105: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/105.jpg)
105
Overall Time = O(|S|)
Suppose di is the d in the i-th Case-2-step traversal.
It suffices to show d1+d2
+…+d|S| =O(|S|).
Case 2: New Leaf
Case 3: Do Nothing
1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b
![Page 106: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/106.jpg)
106
= the “slack” of the growing point The slack means the
distance between a position P and the closest internal node above P.
t
[i, j]
![Page 107: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/107.jpg)
107
case-3 traversal
Each case-3 traversal (i.e., horizontal movement) can only increase the value of by at most one.
(It can even decrease the value of .)
Case 2: New Leaf
Case 3: Do Nothing
1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b
![Page 108: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/108.jpg)
108
case-2 traversal
The i-th case-2 traversal (i.e., vertical movement) decreases the value of by at least di.
Case 2: New Leaf
Case 3: Do Nothing
1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b
![Page 109: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/109.jpg)
109
d1+d2+…+d|S| = O(|S|)
Initial = O(1). can be increased by one for at most |S|
times (because there are at most |S| horizontal movements (i.e., case-3 traversals).
Since is always non-negative, the above bound is proved.
![Page 110: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/110.jpg)
110
Using suffix links
S = b b a b b a a b
[1,-]
[2,-]
[3,-]
[3,-]
[3,3]
[7,-][4,-]
[3,3]
[7,-] [4,-]
[2,3]
[7,-] [4,-]
1
[1,1]
12
1
1
1
![Page 111: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/111.jpg)
111
Applications of Suffix Tree in Bioinformatics
![Page 112: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/112.jpg)
112
Rapid global alignment
Genomic regions of interest contain ordered islands of similarity
E.g. genes
1. Find local alignments
2. Chain an optimal subset of them
![Page 113: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/113.jpg)
113
Suffix Trees
Suffix trees are a method to find all maximal matches between two strings (and much more)
Example:
x = dabdac
d a b d a c
ca
bd
acc
cca
db
1
4
25
63
![Page 114: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/114.jpg)
114
Application: Find all Matches Between x and y
1. Build suffix tree for x, mark nodes with x
2. Insert y in suffix tree, mark all nodes y “passes from” with y
The path label of every node marked both 0 and 1, is a common substring
![Page 115: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/115.jpg)
115
Example of Suffix Tree Construction for x, yExample of Suffix Tree Construction for x, y
1
x = d a b d a $y = a b a d a $
d a b d a $1. Construct tree for x
a
bd
a$2
$a
db
3
$
4
$
5
$6
xx
x
6. Insert a $
5
6
7. Insert $
4. Insert a d a $
da$
3
5. Insert d a $
y
4
2. Insert a b a d a $
a
y
da
$
1
y
yx
3. Insert b a d a $ ady
2
a$
x
![Page 116: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/116.jpg)
116
Application: Online Search of Strings on a DatabaseSay a database D = { s1, s2, …sn }
(eg. proteins)
Question: given new string x, find all matches of x to database
1. Build suffix tree for {s1,…, sn}
2. All new queries x take O( |x| ) time(somewhat like BLAST)
![Page 117: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/117.jpg)
117
Longest Common Substring
Given two strings S and T. Find the longest common substring. S = carport, T = airports
Longest common substring = rport Longest common subsequence = arport
Longest common subsequence may be found in O(|S|*|T|) time using dynamic programming.
Longest common substring? How much time is needed ?
![Page 118: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/118.jpg)
118
Donald E. Knuth conjectured in 1970 that …it is impossible to solve this longest common substring problem in O(|A|+|B|) time.
![Page 119: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/119.jpg)
119
Application: Longest Common Substrings Say we want to find the longest common substring
of s1, s2, …sn
1. Build suffix tree for s1,…, sn
2. All nodes labeled {si1, …, sik} represent a match between si1, …, sik
3. Keep the substring length informations on these {si1, …, sik} match; find the largest values.
![Page 120: Exact String Matching, Suffix Trees, and Applications](https://reader036.vdocuments.site/reader036/viewer/2022062309/568158ca550346895dc6148c/html5/thumbnails/120.jpg)
120
Acknowledgement:Adopted form Dr. Yaw-Ling Lin’s slides
The End