presented by dr. shazzad hosain asst. prof. eecs, nsu
DESCRIPTION
Exact String Matching Algorithms. Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU. Classical Comparison Based Methods. Boyer-Moore Algorithm Knuth-Morris-Pratt Algorithm (KMP Algorithm). Boyer-Moore Algorithm. Basic ideas: Previously discussed ideas for naïve matching - PowerPoint PPT PresentationTRANSCRIPT
Presented ByDr. Shazzad Hosain
Asst. Prof. EECS, NSU
Exact String Matching Algorithms
Classical Comparison Based Methods
• Boyer-Moore Algorithm• Knuth-Morris-Pratt Algorithm (KMP Algorithm)
Boyer-Moore Algorithm
• Basic ideas:– Previously discussed ideas for naïve matching
1. successively align P and T to check for a match.2. Shift P to the right on match failure.
– new concepts wrt the naïve algorithm1. Scan from right-to-left, i.e., 2. Special Bad character rule3. Suffix shift rule
Concept: Right-to-left Scan
• How can we check for a match of pattern P at location i in target T?
• Naïve algorithm scanned left-to-right, i.e., T[i+k]&P[1+k], k = 0 to length(P)-1
^1 a == a ^ 2 d != b
Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b
Concept: Right-to-left Scan
• Alternative, scan right-to-left, i.e., T[i+k]&P[1+k], k = length(P)-1 down-to 0
^ 1 b != r
Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b
Concept: Right-to-left Scan
• Why is scanning right-to-left a good idea?• Answer: by itself, it isn’t any better than left-
to-right.– A naïve approach with right-to-left scanning is
also Q(nm).– Larger shifts, supported by a clever bad
character rule and a suffix shift rule make it better.
Concept: Bad Character Rule
• Idea: the mismatched character indicates a safe minimum shift.
^ 1 a == a
Example: P = adacara, T=abaracadabara a b a r a c a d a b a r a a d a c a r a
^ 2 r != c
Here the bad character is c. Perhaps we should shift to align this character with its rightmost occurrence in P?
Concept: Bad Character Rule
Shift two positions to align the rightmost occurrence of the mismatched character c in P.
a b a r a c a d a b a r a a d a c a r a a d a c a r a
Now, start matching again from right to left.
Concept: Bad Character Rule
^ 1 a == a
Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a
^ 2 r == r
Here the bad character is x. The minimum that we should shift should align this character with its occurrence in P.
But x doesn’t occur in P!!!!
^ 3 a == a ^ 4 c != x
Concept: Bad Character Rule
Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a
Since x doesn’t occur in P, we can shift past it.
a d a c a r a
Now, start matching again from right to left.
11
Concept: Bad Character Rule
• The idea of bad character rule is to shift P by more than one characters when possible.
• But if rightmost position is greater than the mismatched position.
• Unfortunately, it is often the case
12345678901234567T: spbctbsatpqsctbpqP: tpabsatP: tpabsat
Concept: Bad Character Rule
• We will define a bad character rule that uses the concept of the rightmost occurrence of each letter.
• Let R(x) be the rightmost position of the letter x in P for each letter x in our alphabet.
• If x doesn’t occur in P, define R(x) to be 0.
a b c d z
7 0 4 2 * * 0
1234567P = adacara
R
13
Concept: Bad Character Rule 12345678901234567T: spbctbsabpqsctbpqP: tpabsab
R(t)=1, R(s)=5.i: the position of mismatch in P. i=3k: the counterpart in T. k=5. T[k]=t• The bad character rule says P should be shifted right by max{1, i-R(T[k])}. i.e.,
if the right-most occurrence of character T[k] in P is in position j (j<i), then P[j] should be below T[k] after the shifting.
• Otherwise, we will shift P one position, i.e., when R(T[k]) >= i, 1 >= i - R(T[k])• Obviously this rule is not very useful when R(T[k]) >= i, which is usually the
case for DNA sequences
P: tpabxab
Concept: Extended Bad Character Rule
Extended Bad Character Rule: If P[i] mismatches T[k], shift P along T so
that the closest occurrence of the letter T[k] in P to the left of i in P is aligned with T[k].
^ 1 a == a
Example: P = aracara, T=abararadabara a b a r a r a d a b a r a a r a c a r a
^ 2 r == r ^ 3 a == a ^ 4 c != r ^ This is the rightmost occurrence of r in P.
Notice that i - R(T(k)) < 0 , i.e., 4 – 6 < 0
^ This is the rightmost occurrence of r to the left of i in P.
Notice that 4 – 2 > 0, i.e., this gives us a positive shift.
Concept: Extended Bad Character Rule
The amount of shift is i – j, where:– i is the index of the mismatch in P.– j is the rightmost occurrence of T[k] to the left of i in P.
^ 1 a == a
Example: P = aracara, T=abataradabara a b a t a r a d a b a r a a r a c a r a
^ 2 r == r ^ 3 a == a ^ 4 c != t
There is no occurrence of t in P, thus j = 0. Notice that i – j = 4,
i.e., this gives us a positive shift past the point of mismatch.
Concept: Extended Bad Character Rule
• How do we implement this rule?• We preprocess P (from right to left), recording the
position of each occurrence of the letters.• For each character x in S, the alphabet, create a list
of its occurrences in P. If x doesn’t occur in P, then it has an empty list.
Concept: Extended Bad Character Rule
Example: S = {a, b, c, d, r, t}, P = abataradabara• a_list = <13, 11,9,7,5,3,1> since ‘a’ occurs at these positions
in P, i.e., abataradabara• b_list = <10,2> (abataradabara)• c_list = Ø• d_list = <8> (abataradabara)• r_list = <12,6> (abataradabara)• t_list = <4> (abataradabara)
Concept: Suffix Shift Rule
• Recall that we investigated finding prefixes before.• Since we are matching P to T from right-to-left, we will
instead need to use suffixes.
19
Suffix Shift Rule
t is a suffix of P that match with a substring t of Tx≠yt’ is the right-most copy of t in P such that t’ is not a suffix of P and z≠y
Concept: Suffix Shift Rule
• Consider the partial right-to-left matching of P to T below.
• This partial match involves ,a a suffix of P.
.....................................adbadbaddog
............................................axbadbaddog.....
P
T
Concept: Suffix Shift Rule
• This partial match ends where the first mismatch occurs, where x is aligned with d.
.....................................adbadbaddog
............................................axbadbaddog.....
P
T
Concept: Suffix Shift Rule
We want to find a right-most copy a´ of this substring a in P such that:
1. a´ is not a suffix of P and 2. The character to the left of a´ is not the same as the
character to the left of a
.........gbadbaddoghorseadbadbaddog
............................................axbadbaddog.....
P
T
’
Concept: Suffix Shift Rule
1. If a´ exists, shift P to the right such that a´ is now aligned with the substring in T that was previously aligned with a.
.........gbadbaddogcatdbadbaddog
.......................................xbadbaddog.....
P
T
’
.........gbadbaddogcatdbadbaddog
P after shifting ’
Concept: Suffix Shift Rule
2. If a´ doesn’t exist, shift P right by the least amount such that a prefix of P is aligned with a suffix of a in T.
dogcatratdbadbaddog
.......................................xbadbaddog.....
P
T
P after shifting dogcatratdbadbaddog
Concept: Suffix Shift Rule
3. If a´ doesn’t exist, and there is no prefix of P that matches a suffix of a in T, shift P left by n positions.
batcatratdbadbaddog
.......................................xbadbaddog.....
P
T
P after shifting batcatratdbadbaddog
Preprocessing for the good suffix rule
• Let L(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L(i)].
• If there is no such position, then L(i) = 0• Example 1: If i = 17 then L(i) = 9
batcatdogdbadbaddog P
17 L(17)
batcatdogdbadbaddog P
16
• Example 2: If i = 16 then L(i) = 0
Concept: Suffix Shift Rule
• Let L´(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L´(i)] and s.t. the character preceding the suffix is not equal to P(i-1).
• If there is no such position, then L´(i) = 0• Example 1: If i = 20 then L(i) = 12 and L´(i) = 6
slydogsaddogdbadbaddog P
20 L(20) L’(20)
Concept: Suffix Shift Rule
• Example 2: If i = 19 then L(i) = 12 and L´(i) = 0
slydogsaddogdbadbaddog P
19 L(19)
Concept: Suffix Shift Rule
• Notice that L(i) indicates the right-most copy of P[i..n] that is not a suffix of P.
• In contrast, L´(i) indicates the right-most copy of P[i..n] that is not a suffix of P and whose preceding character doesn’t match P(i-1).
• The relation between L´(i) and L(i) is analogous to the relation between a´ and a.
slydogsaddogdbadbaddog P
20 L(20) L’(20)
Concept: Suffix Shift Rule
• Q: What is the point?• A: If P(i - 1) causes the mismatch and L´(i) > 0, then
we can shift P right by n - L´(i) positions. Example:
.........gbadbaddogcatdbadbaddog
.......................................xbadbaddog.....
P
T
’
.........gbadbaddogcatdbadbaddog
P after shifting ’
Concept: Suffix Shift Rule
• If L(i) and L´(i) are different, then obviously shifting by n - L´(i) positions is a greater shift than n - L(i).
• Example:
slybaddogbadbaddogcatdbadbaddog
.......................................xbaxbaddog.....
P
T
slybaddogbadbaddogcatdbadbaddog
P after shifting ’
’
L(i) L’(i)
Concept: Suffix Shift Rule
• Let Nj(P) denote the length of the longest suffix of P[1..j] that is also a suffix of P.
• Example 1: N6(P) = 3 and N12(P) = 5.
slydogsaddogdbadbaddog P
12 6
hogslydogsaddogdbadbaddog P
15 9 3 19
• Example 2: N3(P) = 2, N9(P) = 3, N15(P) = 5, N19(P) = 0.
Concept: Suffix Shift Rule
• Q: How are the concepts of Ni and Zi related?• Recall that Zi = Length of a maximal substring starting at
position i, which is a prefix of P.
• In contrast, Ni = Length of a maximal substring ending at position i, which is a suffix of P.
• In the case of Boyer-Moore, we are naturally interested in suffixes since we are scanning right-to-left
i
a
i
a
Concept: Suffix Shift Rule
• Let Pr denote the mirror image of P, then the relationship can be expressed as Nj(P)=Zn-j+1(Pr).
• In words, the length of the substring matching a suffix at position j in P is equal to the length of the corresponding substring matching a prefix in the reverse of P.
• Q: Why must this true?• A: Because they are the same substring, except
that one is the reverse of the other.
Concept: Suffix Shift Rule
• Since Nj(P) = Zn-j+1(Pr), we can use the Z algorithm to compute N in O(n).
• Q: How do we do this?• A: We create Pr, the reverse of P, and process it
with the Z algorithm.
36
Concept: Suffix Shift Rule
N is the reverse of Z!
P: the pattern
Pr the string obtained by reversing P
Then Nj (P)=Zn-j+1 (Pr)
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0P: q c a b d a b d a b Pr: b a d b a d b a c qNj: 0 0 0 2 0 0 5 0 0 0 Zi 0 0 0 5 0 0 2 0 0 0
t t’ xyi
tt’j
xy
37
Concept: Suffix Shift Rule
For pattern P,
Nj (for j=1,…,n) can be calculated in O(n) using the Z algorithm.
Why do we need to define Nj ?
To use the strong good suffix rule, we need to find out L’(i) for every i=1,…,n.
We can get L’(i) from Nj !
x t
y tt’
y tt’
z
z
T
P
niL’(i)
Concept: Suffix Shift Rule
• We can then find L´(i) and L(i) values from N values in linear time with the following:
For i = 1 to n {L´(i) = 0;}For j = 1 to n – 1 {
i = n - Nj(P) + 1;L´(i) = j;
}
// L values (if desired) can be obtainedL(2) = L´(2) ;For i = 3 to n { L(i) = max(L(i - 1), L´(i));}
Concept: Suffix Shift Rule
• Example: P = asdbasasas, n = 10• Values of Ni(P): 0, 2, 0, 0, 0, 2, 0, 4,
0• Computed values i: 11, 9, 11, 11, 11, 9, 11, 7, 11• Values of L´: 0, 0, 0, 0, 0, 0, 8, 0, 6
For i = 1 to n {L´(i) = 0;}For j = 1 to n – 1 {
i = n - Nj(P) + 1;L´(i) = j;
}L(2) = L´(2) ;For i = 3 to n { L(i) = max(L(i - 1), L´(i));}
Concept: Suffix Shift Rule
• Let l´(i) denote the length of the largest suffix of P[i..n] that is also a prefix of P. Let l´(i) = 0 if no such suffix exists.
Example: P = asasbsasas^ l’(1) = 4 ^ l’(2) = 4 ^ l’(3) = 4 ^ l’(4) = 4 ^ l’(5) = 4 ^ l’(6) = 4 ^ l’(7) = 4 ^ l’(8) = 2 ^ l’(9) = 2 ^ l’(10) = 0
tt’l´(i) = t
i
Concept: Suffix Shift Rule
• Thm: l´(i) = largest j <= n – i + 1 s.t. Nj(P) = j.• Q: How can we compute l´(i) values in linear
time?• A: This is problem #9 in Chapter 2. This would
make an interesting homework problem.
tt’j
xy
tt’i
l´(i) = t
Boyer-Moore Algorithm
Preprocessing:Compute L´(i) and l´(i) for each position i in P, Compute R(x), the right-most occurrence of x in P, for each character x in S.Search:k = n;While k <= m {
i = n; h = k;While i > 0 and P(i) = T(j) {
i = i – 1; h = h – 1;}if i = 0 {
report occurrence of P in T at position k.k = k + n - l´(2);}
else Shift P (increase k) by the max amount indicated by theextended bad character rule and the good suffix rule.
}
Boyer-Moore Algorithm
Example: P = golgolPreprocessing:Compute L´(i) and l´(i) for each position i in P
Notice that first we need Nj(P) values in order to compute L´(i) and l´(i) for each position i in P.
For i = 1 to n {L´(i) = 0;}For j = 1 to n – 1 {
i = n - Nj(P) + 1;L´(i) = j;
}
Boyer-Moore Algorithm
Example: P = golgolRecall that Nj(P) is the length of the longest suffix of P[1..j]that is also a suffix of P.
N1(P) = 0, there is no suffix of P that ends with g
N2(P) = 0, there is no suffix of P that ends with o
N3(P) = 3, there is a suffix of P that ends with l
N4(P) = 0, there is no suffix of P that ends with g
N5(P) = 0, there is no suffix of P that ends with o
N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3
Boyer-Moore Algorithm
Preprocessing: P = golgol, n = 6
N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3Compute L´(i) and l´(i) for each position i in P
For i = 1 to n {L´(i) = 0;}For j = 1 to n – 1 {
i = n - Nj(P) + 1;L´(i) = j;
}
j = 1 i = 7 Therefore L´(7) = 1j = 2 i = 7 Therefore L´(7) = 2j = 3 i = 4 Therefore L´(4) = 3j = 4 i = 7 Therefore L´(7) = 4j = 5 i = 7 Therefore L´(7) = 5
L´(1) = L´(2) = L´(3) = L´(5) = 0 and L´(4) = 3
Boyer-Moore Algorithm
Preprocessing: P = golgol, n = 6N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3
Compute l´(i) for each position i in P.Recall that l´(i) is the length of the longest suffix of P[i..n] thatis also a prefix of P.
l´(1) = 6 since gol is the longest suffix of P[1..n] that is a prefix of P.l´(2) = 3 since gol is the longest suffix of P[2..n] that is a prefix of P.l´(3) = 3 since gol is the longest suffix of P[3..n] that is a prefix of P.l´(4) = 3 since gol is the longest suffix of P[4..n] that is a prefix of P.l´(5) = 0 since there is no suffix of P[5..n] that is a prefix of P.
l´(6) = 0 since there is no suffix of P[6..n] that is a prefix of P.
l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0
Boyer-Moore Algorithm
Preprocessing: P = golgol, n = 6N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0
Compute the list R(x), the right-most occurrences of x in P,for each character x in S = {g, o, l}
R(g) = <4, 1>
R(o) = <5, 2>
R(l) = <6, 3>
Boyer-Moore Algorithm
Preprocessing: P = golgol, n = 6, T = lolgolgol, m = 9L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3l´(1) =6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0R(g) = <4, 1>, R(o) = <5, 2>, R(l) = <6, 3>
Search:k = n;While k <= m {
i = n; h = k;While i > 0 and P(i) = T(j) {
i = i – 1; h = h – 1;}if i = 0 {
report occurrence of P in T at position k.k = k + n - l´(2);}
else Shift P (increase k) by the max amount indicated by theextended bad character rule and the good suffix rule.
}
Search
^i = 6, h = 6
^i = 5, h = 5
^i = 4, h = 4
lolgolgolgolgol
Bad Character Rule: there is no occurrence of l, the mismatched characterin T, to the left of P(1). This suggests shifting only 1 place
Good Suffix Rule: Since L´(2) = 0, l´(2) = 3 thereforeshift P by n - l´(2) places, i.e., 6-3=3 places. Thus k = k + 3 = 9
But i = 1!
^i = 3, h = 3
^i = 2, h = 2
^i = 1, h = 1, P(1) != T(1)
k = 6;While k <= 9 {
i = 6; h = k;While i > 0 and P(i) = T(j) {
i = i – 1; h = h – 1;}if i = 0 {
report occurrence of P in T at position k.k = k + 6 - l´(2);}
else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule.
}
Search
lolgolgolgolgol
^i = 6, h = 9
^i = 5, h = 8
^i = 4, h = 7
^i = 3, h = 6^i = 2, h = 5
^i = 1, h = 4
^i = 0, h = 3
i = 0, report occurrence of P in T at position 4,k = k + 6 - l´(2) = 9 + 6 - 3 = 12
lolgolgol golgolk = 12, we are done!
k = 6;While k <= 9 {
i = 6; h = k;While i > 0 and P(i) = T(j) {
i = i – 1; h = h – 1;}if i = 0 {
report occurrence of P in T at position k.k = k + 6 - l´(2);}
else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule.
}
Homework 1: Due Next Week• Implement the Boyeer More Algorithm
Break
KMP Algorithm
• Preliminaries:– KMP can be easily explained in terms of finite
state machines.– KMP has a easily proved linear bound– KMP is usually not the method of choice
KMP Algorithm
• Recall that the naïve approach to string matching is Q(mn).
• How can we reduce this complexity?– Avoid redundant comparisons– Use larger shifts
• Boyer-Moore good suffix rule• Boyer-Moore extended bad character rule
KMP Algorithm
• KMP finds larger shifts by recognizing patterns in P.– Let spi(P) denote the length of the longest proper
suffix of P[1..i] that matches a prefix of P.
– By definition sp1 = 0 for any string.– Q: Why does this make sense?– A: The proper suffix must be the empty string
α αi
KMP Algorithm
• Example: P = abcaeabcabd– P[1..2] = ab hence sp2 = ?
– sp2 = 0
– P[1..3] = abc hence sp3 = ?
– sp3 = 0
– P[1..4] = abca hence sp4 = ?
– sp4 = 1
– P[1..5] = abcae hence sp5 = ?
– sp5 = 0
– P[1..6] = abcaea hence sp6 = ?
– sp6 = 1
KMP Algorithm
• Example Continued– P[1..7] = abcaeab hence sp7 = ?
– sp7 = 2
– P[1..8] = abcaeabc hence sp8 = ?
– sp8 = 3
– P[1..9] = abcaeabca hence sp9 = ?
– sp9 = 4
– P[1..10] = abcaeabcab hence sp10 = ?
– sp10 = 2
– P[1..11] = abcaeabcabd hence sp11 = ?
– sp11 = 0
KMP Algorithm• Like the a/a concept for Boyer-Moore, there is an
analogous spi/sp´i concept.• Let sp´i(P) denote the length of the longest proper
suffix of P[1..i] that matches a prefix of P, with the added condition that characters P(i + 1) and P(sp´i + 1) are unequal.
• Example: P = abcdabce sp´7 = 3
Obviously sp´i(P) <= spi(P), since the later is lessrestrictive.
α αi
x y
KMP Algorithm• KMP Shift Rule:
1. Mismatch case:• Let position i+1 in P and position k in T be the first mismatch
in a left-to-right scan.• Shift P to the right, aligning P[1..sp´i] with T[k- sp´i..k-1]
2. Match case:• If no mismatch is found, an occurrence of P has been found.• Shift P by n – sp´n spaces to continue searching for other
occurrences.
i+1
kα
αα
n+1
α
αα
KMP Algorithm
• Observations:– The prefix P[1..sp´i] of the shifted P is shifted to match
the corresponding substring in T.– Subsequent character matching proceeds from
position sp´i + 1– Unlike Boyer-Moore, the matched substring is not
compared again.– The shift rule based on sp´i guarantees that the exact
same mismatch won’t occur at sp´i + 1 but doesn’t guarantee that P(sp´i+1) = T(k)
KMP Algorithm
• Example: P = abcxabcde– If a mismatch occurs at position 8, P will be shifted 4
positions to the right.– Q: Where did the 4 position shift come from?– A: The number of position is given by i - sp´i , in this
example i = 7, sp´7 = 3, 7 – 3 = 4 – Notice that we know the amount of shift without
knowing anything about T other than there was a mismatch at position 8..
KMP Algorithm
• Example Continued: P = abcxabcde– After the shift, P[1..3] lines up with T[k-4..k-1]– Since it known that P[1..3] must match T[k-4..k-1], no
comparison is needed.– The scan continues from P(4) & T(k)
• Advantages of KMP Shift Rule1. P is often shifted by more than 1 character, (i - sp´i )
2. The left-most sp´i characters in the shifted P are known to match the corresponding characters in T.
KMP Algorithm
Full Example: T = xyabcxabcxadcdqfeg P = abcxabcdeAssume that we have already shifted past the first two
positions in T.
xyabcxabcxadcdqfegabcxabcde
^ 1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7
abcxabcde
^ 8 d!=x, shift 4 places ^ 1 start again from position 4
Preprocessing for KMP
Approach: show how to derive sp´ values from Z values.
Definition: Position j > 1 maps to i if i = j + Zj(P) – 1– Recall that Zj(P) denotes the length of the Z-box starting at position j.
– This says that j maps to i if i is the right end of a Z-box starting at j.
αα
ααi
j
Preprocessing for KMP
Theorem. For any i > 1, sp´i(P) = Zj = i – j + 1
Where j > 1 is the smallest position that maps to i.If j then sp´i(P) = 0
Similarly for sp:For any i > 1, spi(P) = i – j + 1
Where j, i j > 1, is the smallest position that maps to i or beyond.If j then spi(P) = 0
Definition: Position j > 1 maps to i if i = j + Zj(P) – 1
αα
ααi
j
x y
Preprocessing for KMPGiven the theorem from the preceding slide, the sp´i and spi
values can be computed in linear time using Zi values:
For i = 1 to n { sp´i = 0;}For j = n downto 2 {
i = j + Zj(P) – 1; sp´i = Zj;
}
spn(P) = sp´n(P); For i = n - 1 downto 2 {
spi (P) = max[spi+1 (P) - 1, sp´i(P)];}
αα
ααi
j
x y
Preprocessing for KMP
Defn. Failure function F´(i) = sp´i-1 + 1 , 1 i n + 1, sp´0 = 0(similarly F(i) = spi-1 + 1 , 1 i n + 1, sp0 = 0)
xyabcxabcxadcdqfegabcxabcde
^ 1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7
abcxabcde
^ 8 d!=x, shift 4 places
Shifting is only conceptual and P is never explicitly shifted
xyabcxabcxadcdqfegabcxabcde
^ i
c |
^ i
c |
^ i
c |
^ i
c |
^i
Two special cases:1. Mismatch at position 1, then F’(1) = 12. Match found, then P shifts by n - sp’n places
o Which is F’(n+1) = sp’n + 1
Preprocessing for KMP
Defn. Failure function F´(i) = sp´i-1 + 1 , 1 i n + 1, sp´0 = 0(similarly F(i) = spi-1 + 1 , 1 i n + 1, sp0 = 0)• Idea:
– We maintain a pointer i in P and c in T.– After a mismatch at P(i+1) with T(c), shift P to align
P(sp´i + 1) with T(c), i.e., i = sp´i + 1.– Special case 1: i = 1 set i = F´(1) = 1 & c = c + 1– Special case 2: we find P in T, shift n - sp´n spaces,
i.e., i = F´(n + 1) = sp´n + 1.
Full KMP AlgorithmPreprocess P to find F´(k) = sp´k-1 +1 for k from 1 to n + 1c = 1; p = 1;While c + (n – p) m {
While P(p) = T( c )and p n {p = p + 1;c = c + 1;}
If (p = n + 1) thenreport an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1;p = F´(p) ;}
T = xyabcxabcxadcdqfegP = abcxabcde
^ p
c |
|T| = m|P| = n
Full KMP Algorithm
xyabcxabcxabcdefegabcxabcde
^ 1 a!=x
p != n+1
p = 1! c = 2
p = F’(1) = 1
c = 1; p = 1;While c + (n – p) m {
While P(p) = T( c )and p n {p = p + 1;c = c + 1;}
If (p = n + 1) thenreport an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1;p = F´(p) ;
}
Full KMP Algorithm
xyabcxabcxabcdefeg
abcxabcde
^ 1 a!=y
p != n+1
p = 1! c = 3
p = F’(1) = 1
c = 1; p = 1;While c + (n – p) m {
While P(p) = T( c )and p n {p = p + 1;c = c + 1;}
If (p = n + 1) thenreport an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1;p = F´(p) ;
}
abcxabcde
Full KMP Algorithm
xyabcxabcxabcdefeg
p != n+1
p = 8! don’t change c
p = F´(8) = 4
abcxabcde abcxabcde
^ 1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 ^ 8 d!=x
c = 1; p = 1;While c + (n – p) m {
While P(p) = T( c )and p n {p = p + 1;c = c + 1;}
If (p = n + 1) thenreport an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1;p = F´(p) ;
}
p = 4, c = 10
^ 4
Full KMP Algorithm
xyabcxabcxabcdefeg
p = n+1 !
abcxabcde
^ 5 ^ 6 ^ 7 ^ 8
abcxabcde abcxabcde abcxabcde
c = 1; p = 1;While c + (n – p) m {
While P(p) = T( c )and p n {p = p + 1;c = c + 1;}
If (p = n + 1) thenreport an occurrence of P at position c – n of T.
if (p = 1) then c = c + 1;p = F´(p) ;
}
^ 9
Real-Time KMP
• Q: What is meant by real-time algorithms?• A: Typically these are algorithms that are meant
to interact synchronously in the real world.– This implies a known fixed turn-around time for
processing a task– Many embedded scheduling systems are examples
involving real-time algorithms.– For KMP this means that we require a constant time
for processing all strings of length n.
Real-Time KMP
• Q: Why is KMP not real-time?• A: For any mismatched character in T, we may try
matching it several times.– Recall that sp´i only guarantees that P(i + 1) and P(sp´i + 1) differ– There is NO guarantee that P(i + 1) and T(k) match
• We need to ensure that a mismatch at T(k) does NOT entail additional matches at T(k).
• This means that we have to compute sp´i values with respect to all characters in S since any could appear in T.
Real-Time KMP
• Define: sp´(i,x)(P) to be the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that character P(sp´i + 1) is x.
• This is will tell us exactly what shift to use for each possible mismatch.
• A mismatched character T(k) will never be involved in subsequent comparisons.
Real-Time KMP
• Q: How do we know that the mismatched character T(k) will never be involved in subsequent comparisons?
• A: Because the shift will shift P so that either the matching character aligns with T(k) or P will be shifted past T(k).
• This results in a real-time version of KMP.• Let’s consider how we can find the sp´(i,x)(P)
values in linear time.
Real-Time KMP
Thm. For P[i + 1] x, sp´(i,x)(P) = i - j + 1– Here j is the smallest position such that j maps to i and
P(Zj + 1) = x.– If there is no such j then where sp´(i,x)(P) = 0
For i = 1 to n { sp´(i,x) = 0 for every character x;}For j = n downto 2 {
i = j + Zi(P) – 1;x = P(Zj + 1); sp´(i,x) = Zi;
}
Real-Time KMP
• Notice how this works:– Starting from the right
• Find i the right end of the Z box associated with j• Find x the character immediately following the prefix
corresponding to this Z box.• Set sp´(i,x) = Zi, the length of this Z box.
For i = 1 to n { sp´(i,x) = 0 for every character x;}For j = n downto 2 {
i = j + Zi(P) – 1;x = P(Zj + 1); sp´(i,x) = Zi;}
Reference
• Chapter 1, 2: Exact Matching: Fundamental Preprocessing and First Algorithms