presented by dr. shazzad hosain asst. prof. eecs, nsu

Presented ByDr. Shazzad Hosain

Asst. Prof. EECS, NSU

Exact String Matching Algorithms

Classical Comparison Based Methods

• Boyer-Moore Algorithm• Knuth-Morris-Pratt Algorithm (KMP Algorithm)

Boyer-Moore Algorithm

• Basic ideas:– Previously discussed ideas for naïve matching

1. successively align P and T to check for a match.2. Shift P to the right on match failure.

– new concepts wrt the naïve algorithm1. Scan from right-to-left, i.e., 2. Special Bad character rule3. Suffix shift rule

Concept: Right-to-left Scan

• How can we check for a match of pattern P at location i in target T?

• Naïve algorithm scanned left-to-right, i.e., T[i+k]&P[1+k], k = 0 to length(P)-1

^1 a == a ^ 2 d != b

Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b


• Alternative, scan right-to-left, i.e., T[i+k]&P[1+k], k = length(P)-1 down-to 0

^ 1 b != r

Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b


• Why is scanning right-to-left a good idea?• Answer: by itself, it isn’t any better than left-

to-right.– A naïve approach with right-to-left scanning is

also Q(nm).– Larger shifts, supported by a clever bad

character rule and a suffix shift rule make it better.

Concept: Bad Character Rule

• Idea: the mismatched character indicates a safe minimum shift.

^ 1 a == a

Example: P = adacara, T=abaracadabara a b a r a c a d a b a r a a d a c a r a

^ 2 r != c

Here the bad character is c. Perhaps we should shift to align this character with its rightmost occurrence in P?


Shift two positions to align the rightmost occurrence of the mismatched character c in P.

a b a r a c a d a b a r a a d a c a r a a d a c a r a

Now, start matching again from right to left.


^ 1 a == a

Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a

^ 2 r == r

Here the bad character is x. The minimum that we should shift should align this character with its occurrence in P.

But x doesn’t occur in P!!!!

^ 3 a == a ^ 4 c != x


Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a

Since x doesn’t occur in P, we can shift past it.

a d a c a r a

Now, start matching again from right to left.

11


• The idea of bad character rule is to shift P by more than one characters when possible.

• But if rightmost position is greater than the mismatched position.

• Unfortunately, it is often the case

12345678901234567T: spbctbsatpqsctbpqP: tpabsatP: tpabsat


• We will define a bad character rule that uses the concept of the rightmost occurrence of each letter.

• Let R(x) be the rightmost position of the letter x in P for each letter x in our alphabet.

• If x doesn’t occur in P, define R(x) to be 0.

a b c d z

7 0 4 2 * * 0

1234567P = adacara

R

13

Concept: Bad Character Rule 12345678901234567T: spbctbsabpqsctbpqP: tpabsab

R(t)=1, R(s)=5.i: the position of mismatch in P. i=3k: the counterpart in T. k=5. T[k]=t• The bad character rule says P should be shifted right by max{1, i-R(T[k])}. i.e.,

if the right-most occurrence of character T[k] in P is in position j (j<i), then P[j] should be below T[k] after the shifting.

• Otherwise, we will shift P one position, i.e., when R(T[k]) >= i, 1 >= i - R(T[k])• Obviously this rule is not very useful when R(T[k]) >= i, which is usually the

case for DNA sequences

P: tpabxab

Concept: Extended Bad Character Rule

Extended Bad Character Rule: If P[i] mismatches T[k], shift P along T so

that the closest occurrence of the letter T[k] in P to the left of i in P is aligned with T[k].

^ 1 a == a

Example: P = aracara, T=abararadabara a b a r a r a d a b a r a a r a c a r a

^ 2 r == r ^ 3 a == a ^ 4 c != r ^ This is the rightmost occurrence of r in P.

Notice that i - R(T(k)) < 0 , i.e., 4 – 6 < 0

^ This is the rightmost occurrence of r to the left of i in P.

Notice that 4 – 2 > 0, i.e., this gives us a positive shift.


The amount of shift is i – j, where:– i is the index of the mismatch in P.– j is the rightmost occurrence of T[k] to the left of i in P.

^ 1 a == a

Example: P = aracara, T=abataradabara a b a t a r a d a b a r a a r a c a r a

^ 2 r == r ^ 3 a == a ^ 4 c != t

There is no occurrence of t in P, thus j = 0. Notice that i – j = 4,

i.e., this gives us a positive shift past the point of mismatch.


• How do we implement this rule?• We preprocess P (from right to left), recording the

position of each occurrence of the letters.• For each character x in S, the alphabet, create a list

of its occurrences in P. If x doesn’t occur in P, then it has an empty list.


Example: S = {a, b, c, d, r, t}, P = abataradabara• a_list = <13, 11,9,7,5,3,1> since ‘a’ occurs at these positions

in P, i.e., abataradabara• b_list = <10,2> (abataradabara)• c_list = Ø• d_list = <8> (abataradabara)• r_list = <12,6> (abataradabara)• t_list = <4> (abataradabara)

Concept: Suffix Shift Rule

• Recall that we investigated finding prefixes before.• Since we are matching P to T from right-to-left, we will

instead need to use suffixes.

19

Suffix Shift Rule

t is a suffix of P that match with a substring t of Tx≠yt’ is the right-most copy of t in P such that t’ is not a suffix of P and z≠y


• Consider the partial right-to-left matching of P to T below.

• This partial match involves ,a a suffix of P.

.....................................adbadbaddog

............................................axbadbaddog.....

P

T


• This partial match ends where the first mismatch occurs, where x is aligned with d.

.....................................adbadbaddog

............................................axbadbaddog.....

P

T


We want to find a right-most copy a´ of this substring a in P such that:

1. a´ is not a suffix of P and 2. The character to the left of a´ is not the same as the

character to the left of a

.........gbadbaddoghorseadbadbaddog

............................................axbadbaddog.....

P

T

’


1. If a´ exists, shift P to the right such that a´ is now aligned with the substring in T that was previously aligned with a.

.........gbadbaddogcatdbadbaddog

.......................................xbadbaddog.....

P

T

’


P after shifting ’


2. If a´ doesn’t exist, shift P right by the least amount such that a prefix of P is aligned with a suffix of a in T.

dogcatratdbadbaddog

.......................................xbadbaddog.....

P

T

P after shifting dogcatratdbadbaddog


3. If a´ doesn’t exist, and there is no prefix of P that matches a suffix of a in T, shift P left by n positions.

batcatratdbadbaddog

.......................................xbadbaddog.....

P

T

P after shifting batcatratdbadbaddog

Preprocessing for the good suffix rule

• Let L(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L(i)].

• If there is no such position, then L(i) = 0• Example 1: If i = 17 then L(i) = 9

batcatdogdbadbaddog P

17 L(17)

batcatdogdbadbaddog P

16

• Example 2: If i = 16 then L(i) = 0


• Let L´(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L´(i)] and s.t. the character preceding the suffix is not equal to P(i-1).

• If there is no such position, then L´(i) = 0• Example 1: If i = 20 then L(i) = 12 and L´(i) = 6

slydogsaddogdbadbaddog P

20 L(20) L’(20)


• Example 2: If i = 19 then L(i) = 12 and L´(i) = 0


19 L(19)


• Notice that L(i) indicates the right-most copy of P[i..n] that is not a suffix of P.

• In contrast, L´(i) indicates the right-most copy of P[i..n] that is not a suffix of P and whose preceding character doesn’t match P(i-1).

• The relation between L´(i) and L(i) is analogous to the relation between a´ and a.


20 L(20) L’(20)


• Q: What is the point?• A: If P(i - 1) causes the mismatch and L´(i) > 0, then

we can shift P right by n - L´(i) positions. Example:


.......................................xbadbaddog.....

P

T

’




• If L(i) and L´(i) are different, then obviously shifting by n - L´(i) positions is a greater shift than n - L(i).

• Example:

slybaddogbadbaddogcatdbadbaddog

.......................................xbaxbaddog.....

P

T

slybaddogbadbaddogcatdbadbaddog


’

L(i) L’(i)


• Let Nj(P) denote the length of the longest suffix of P[1..j] that is also a suffix of P.

• Example 1: N6(P) = 3 and N12(P) = 5.


12 6

hogslydogsaddogdbadbaddog P

15 9 3 19

• Example 2: N3(P) = 2, N9(P) = 3, N15(P) = 5, N19(P) = 0.


• Q: How are the concepts of Ni and Zi related?• Recall that Zi = Length of a maximal substring starting at

position i, which is a prefix of P.

• In contrast, Ni = Length of a maximal substring ending at position i, which is a suffix of P.

• In the case of Boyer-Moore, we are naturally interested in suffixes since we are scanning right-to-left

i

a

i

a


• Let Pr denote the mirror image of P, then the relationship can be expressed as Nj(P)=Zn-j+1(Pr).

• In words, the length of the substring matching a suffix at position j in P is equal to the length of the corresponding substring matching a prefix in the reverse of P.

• Q: Why must this true?• A: Because they are the same substring, except

that one is the reverse of the other.


• Since Nj(P) = Zn-j+1(Pr), we can use the Z algorithm to compute N in O(n).

• Q: How do we do this?• A: We create Pr, the reverse of P, and process it

with the Z algorithm.

36


N is the reverse of Z!

P: the pattern

Pr the string obtained by reversing P

Then Nj (P)=Zn-j+1 (Pr)

1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0P: q c a b d a b d a b Pr: b a d b a d b a c qNj: 0 0 0 2 0 0 5 0 0 0 Zi 0 0 0 5 0 0 2 0 0 0

t t’ xyi

tt’j

xy

37


For pattern P,

Nj (for j=1,…,n) can be calculated in O(n) using the Z algorithm.

Why do we need to define Nj ?

To use the strong good suffix rule, we need to find out L’(i) for every i=1,…,n.

We can get L’(i) from Nj !

x t

y tt’

y tt’

z

z

T

P

niL’(i)


• We can then find L´(i) and L(i) values from N values in linear time with the following:

For i = 1 to n {L´(i) = 0;}For j = 1 to n – 1 {

i = n - Nj(P) + 1;L´(i) = j;

}

// L values (if desired) can be obtainedL(2) = L´(2) ;For i = 3 to n { L(i) = max(L(i - 1), L´(i));}


• Example: P = asdbasasas, n = 10• Values of Ni(P): 0, 2, 0, 0, 0, 2, 0, 4,

0• Computed values i: 11, 9, 11, 11, 11, 9, 11, 7, 11• Values of L´: 0, 0, 0, 0, 0, 0, 8, 0, 6


i = n - Nj(P) + 1;L´(i) = j;

}L(2) = L´(2) ;For i = 3 to n { L(i) = max(L(i - 1), L´(i));}


• Let l´(i) denote the length of the largest suffix of P[i..n] that is also a prefix of P. Let l´(i) = 0 if no such suffix exists.

Example: P = asasbsasas^ l’(1) = 4 ^ l’(2) = 4 ^ l’(3) = 4 ^ l’(4) = 4 ^ l’(5) = 4 ^ l’(6) = 4 ^ l’(7) = 4 ^ l’(8) = 2 ^ l’(9) = 2 ^ l’(10) = 0

tt’l´(i) = t

i


• Thm: l´(i) = largest j <= n – i + 1 s.t. Nj(P) = j.• Q: How can we compute l´(i) values in linear

time?• A: This is problem #9 in Chapter 2. This would

make an interesting homework problem.

tt’j

xy

tt’i

l´(i) = t


Preprocessing:Compute L´(i) and l´(i) for each position i in P, Compute R(x), the right-most occurrence of x in P, for each character x in S.Search:k = n;While k <= m {

i = n; h = k;While i > 0 and P(i) = T(j) {

i = i – 1; h = h – 1;}if i = 0 {

report occurrence of P in T at position k.k = k + n - l´(2);}

else Shift P (increase k) by the max amount indicated by theextended bad character rule and the good suffix rule.

}


Example: P = golgolPreprocessing:Compute L´(i) and l´(i) for each position i in P

Notice that first we need Nj(P) values in order to compute L´(i) and l´(i) for each position i in P.


i = n - Nj(P) + 1;L´(i) = j;

}


Example: P = golgolRecall that Nj(P) is the length of the longest suffix of P[1..j]that is also a suffix of P.

N1(P) = 0, there is no suffix of P that ends with g

N2(P) = 0, there is no suffix of P that ends with o

N3(P) = 3, there is a suffix of P that ends with l

N4(P) = 0, there is no suffix of P that ends with g

N5(P) = 0, there is no suffix of P that ends with o

N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3


Preprocessing: P = golgol, n = 6

N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3Compute L´(i) and l´(i) for each position i in P


i = n - Nj(P) + 1;L´(i) = j;

}

j = 1 i = 7 Therefore L´(7) = 1j = 2 i = 7 Therefore L´(7) = 2j = 3 i = 4 Therefore L´(4) = 3j = 4 i = 7 Therefore L´(7) = 4j = 5 i = 7 Therefore L´(7) = 5

L´(1) = L´(2) = L´(3) = L´(5) = 0 and L´(4) = 3


Preprocessing: P = golgol, n = 6N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3

Compute l´(i) for each position i in P.Recall that l´(i) is the length of the longest suffix of P[i..n] thatis also a prefix of P.

l´(1) = 6 since gol is the longest suffix of P[1..n] that is a prefix of P.l´(2) = 3 since gol is the longest suffix of P[2..n] that is a prefix of P.l´(3) = 3 since gol is the longest suffix of P[3..n] that is a prefix of P.l´(4) = 3 since gol is the longest suffix of P[4..n] that is a prefix of P.l´(5) = 0 since there is no suffix of P[5..n] that is a prefix of P.

l´(6) = 0 since there is no suffix of P[6..n] that is a prefix of P.

l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0


Preprocessing: P = golgol, n = 6N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0

Compute the list R(x), the right-most occurrences of x in P,for each character x in S = {g, o, l}

R(g) = <4, 1>

R(o) = <5, 2>

R(l) = <6, 3>


Preprocessing: P = golgol, n = 6, T = lolgolgol, m = 9L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3l´(1) =6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0R(g) = <4, 1>, R(o) = <5, 2>, R(l) = <6, 3>

Search:k = n;While k <= m {

i = n; h = k;While i > 0 and P(i) = T(j) {

i = i – 1; h = h – 1;}if i = 0 {

report occurrence of P in T at position k.k = k + n - l´(2);}

else Shift P (increase k) by the max amount indicated by theextended bad character rule and the good suffix rule.

}

Search

î = 6, h = 6

î = 5, h = 5

î = 4, h = 4

lolgolgolgolgol

Bad Character Rule: there is no occurrence of l, the mismatched characterin T, to the left of P(1). This suggests shifting only 1 place

Good Suffix Rule: Since L´(2) = 0, l´(2) = 3 thereforeshift P by n - l´(2) places, i.e., 6-3=3 places. Thus k = k + 3 = 9

But i = 1!

î = 3, h = 3

î = 2, h = 2

î = 1, h = 1, P(1) != T(1)

k = 6;While k <= 9 {

i = 6; h = k;While i > 0 and P(i) = T(j) {

i = i – 1; h = h – 1;}if i = 0 {

report occurrence of P in T at position k.k = k + 6 - l´(2);}

else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule.

}

Search

lolgolgolgolgol

î = 6, h = 9

î = 5, h = 8

î = 4, h = 7

î = 3, h = 6î = 2, h = 5

î = 1, h = 4

î = 0, h = 3

i = 0, report occurrence of P in T at position 4,k = k + 6 - l´(2) = 9 + 6 - 3 = 12

lolgolgol golgolk = 12, we are done!

k = 6;While k <= 9 {

i = 6; h = k;While i > 0 and P(i) = T(j) {

i = i – 1; h = h – 1;}if i = 0 {

report occurrence of P in T at position k.k = k + 6 - l´(2);}

else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule.

}

Homework 1: Due Next Week• Implement the Boyeer More Algorithm

KMP Algorithm

• Preliminaries:– KMP can be easily explained in terms of finite

state machines.– KMP has a easily proved linear bound– KMP is usually not the method of choice

KMP Algorithm

• Recall that the naïve approach to string matching is Q(mn).

• How can we reduce this complexity?– Avoid redundant comparisons– Use larger shifts

• Boyer-Moore good suffix rule• Boyer-Moore extended bad character rule

KMP Algorithm

• KMP finds larger shifts by recognizing patterns in P.– Let spi(P) denote the length of the longest proper

suffix of P[1..i] that matches a prefix of P.

– By definition sp1 = 0 for any string.– Q: Why does this make sense?– A: The proper suffix must be the empty string

α αi

KMP Algorithm

• Example: P = abcaeabcabd– P[1..2] = ab hence sp2 = ?

– sp2 = 0

– P[1..3] = abc hence sp3 = ?

– sp3 = 0

– P[1..4] = abca hence sp4 = ?

– sp4 = 1

– P[1..5] = abcae hence sp5 = ?

– sp5 = 0

– P[1..6] = abcaea hence sp6 = ?

– sp6 = 1

KMP Algorithm

• Example Continued– P[1..7] = abcaeab hence sp7 = ?

– sp7 = 2

– P[1..8] = abcaeabc hence sp8 = ?

– sp8 = 3

– P[1..9] = abcaeabca hence sp9 = ?

– sp9 = 4

– P[1..10] = abcaeabcab hence sp10 = ?

– sp10 = 2

– P[1..11] = abcaeabcabd hence sp11 = ?

– sp11 = 0

KMP Algorithm• Like the a/a concept for Boyer-Moore, there is an

analogous spi/spí concept.• Let spí(P) denote the length of the longest proper

suffix of P[1..i] that matches a prefix of P, with the added condition that characters P(i + 1) and P(spí + 1) are unequal.

• Example: P = abcdabce sp´7 = 3

Obviously spí(P) <= spi(P), since the later is lessrestrictive.

α αi

x y

KMP Algorithm• KMP Shift Rule:

1. Mismatch case:• Let position i+1 in P and position k in T be the first mismatch

in a left-to-right scan.• Shift P to the right, aligning P[1..spí] with T[k- spí..k-1]

2. Match case:• If no mismatch is found, an occurrence of P has been found.• Shift P by n – spń spaces to continue searching for other

occurrences.

i+1

kα

αα

n+1

α

αα

KMP Algorithm

• Observations:– The prefix P[1..spí] of the shifted P is shifted to match

the corresponding substring in T.– Subsequent character matching proceeds from

position spí + 1– Unlike Boyer-Moore, the matched substring is not

compared again.– The shift rule based on spí guarantees that the exact

same mismatch won’t occur at spí + 1 but doesn’t guarantee that P(spí+1) = T(k)

KMP Algorithm

• Example: P = abcxabcde– If a mismatch occurs at position 8, P will be shifted 4

positions to the right.– Q: Where did the 4 position shift come from?– A: The number of position is given by i - sp´i , in this

example i = 7, sp´7 = 3, 7 – 3 = 4 – Notice that we know the amount of shift without

knowing anything about T other than there was a mismatch at position 8..

KMP Algorithm

• Example Continued: P = abcxabcde– After the shift, P[1..3] lines up with T[k-4..k-1]– Since it known that P[1..3] must match T[k-4..k-1], no

comparison is needed.– The scan continues from P(4) & T(k)

• Advantages of KMP Shift Rule1. P is often shifted by more than 1 character, (i - sp´i )

2. The left-most sp´i characters in the shifted P are known to match the corresponding characters in T.

KMP Algorithm

Full Example: T = xyabcxabcxadcdqfeg P = abcxabcdeAssume that we have already shifted past the first two

positions in T.

xyabcxabcxadcdqfegabcxabcde

^ 1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7

abcxabcde

^ 8 d!=x, shift 4 places ^ 1 start again from position 4

Preprocessing for KMP

Approach: show how to derive sp´ values from Z values.

Definition: Position j > 1 maps to i if i = j + Zj(P) – 1– Recall that Zj(P) denotes the length of the Z-box starting at position j.

– This says that j maps to i if i is the right end of a Z-box starting at j.

αα

ααi

j


Theorem. For any i > 1, sp´i(P) = Zj = i – j + 1

Where j > 1 is the smallest position that maps to i.If j then sp´i(P) = 0

Similarly for sp:For any i > 1, spi(P) = i – j + 1

Where j, i j > 1, is the smallest position that maps to i or beyond.If j then spi(P) = 0

Definition: Position j > 1 maps to i if i = j + Zj(P) – 1

αα

ααi

j

x y

Preprocessing for KMPGiven the theorem from the preceding slide, the spí and spi

values can be computed in linear time using Zi values:

For i = 1 to n { spí = 0;}For j = n downto 2 {

i = j + Zj(P) – 1; spí = Zj;

}

spn(P) = spń(P); For i = n - 1 downto 2 {

spi (P) = max[spi+1 (P) - 1, spí(P)];}

αα

ααi

j

x y


Defn. Failure function F´(i) = sp´i-1 + 1 , 1 i n + 1, sp´0 = 0(similarly F(i) = spi-1 + 1 , 1 i n + 1, sp0 = 0)


^ 1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7

abcxabcde

^ 8 d!=x, shift 4 places

Shifting is only conceptual and P is never explicitly shifted


^ i

c |

^ i

c |

^ i

c |

^ i

c |

^i

Two special cases:1. Mismatch at position 1, then F’(1) = 12. Match found, then P shifts by n - sp’n places

o Which is F’(n+1) = sp’n + 1


Defn. Failure function F´(i) = spí-1 + 1 , 1 i n + 1, sp´0 = 0(similarly F(i) = spi-1 + 1 , 1 i n + 1, sp0 = 0)• Idea:

– We maintain a pointer i in P and c in T.– After a mismatch at P(i+1) with T(c), shift P to align

P(spí + 1) with T(c), i.e., i = spí + 1.– Special case 1: i = 1 set i = F´(1) = 1 & c = c + 1– Special case 2: we find P in T, shift n - spń spaces,

i.e., i = F´(n + 1) = spń + 1.

Full KMP AlgorithmPreprocess P to find F´(k) = sp´k-1 +1 for k from 1 to n + 1c = 1; p = 1;While c + (n – p) m {

While P(p) = T( c )and p n {p = p + 1;c = c + 1;}

If (p = n + 1) thenreport an occurrence of P at position c – n of T.

if (p = 1) then c = c + 1;p = F´(p) ;}

T = xyabcxabcxadcdqfegP = abcxabcde

^ p

c |

|T| = m|P| = n

Full KMP Algorithm

xyabcxabcxabcdefegabcxabcde

^ 1 a!=x

p != n+1

p = 1! c = 2

p = F’(1) = 1

c = 1; p = 1;While c + (n – p) m {



if (p = 1) then c = c + 1;p = F´(p) ;

}

Full KMP Algorithm

xyabcxabcxabcdefeg

abcxabcde

^ 1 a!=y

p != n+1

p = 1! c = 3

p = F’(1) = 1

c = 1; p = 1;While c + (n – p) m {



if (p = 1) then c = c + 1;p = F´(p) ;

}

abcxabcde

Full KMP Algorithm

xyabcxabcxabcdefeg

p != n+1

p = 8! don’t change c

p = F´(8) = 4

abcxabcde abcxabcde

^ 1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 ^ 8 d!=x

c = 1; p = 1;While c + (n – p) m {



if (p = 1) then c = c + 1;p = F´(p) ;

}

p = 4, c = 10

^ 4

Full KMP Algorithm

xyabcxabcxabcdefeg

p = n+1 !

abcxabcde

^ 5 ^ 6 ^ 7 ^ 8

abcxabcde abcxabcde abcxabcde

c = 1; p = 1;While c + (n – p) m {



if (p = 1) then c = c + 1;p = F´(p) ;

}

^ 9

Real-Time KMP

• Q: What is meant by real-time algorithms?• A: Typically these are algorithms that are meant

to interact synchronously in the real world.– This implies a known fixed turn-around time for

processing a task– Many embedded scheduling systems are examples

involving real-time algorithms.– For KMP this means that we require a constant time

for processing all strings of length n.

Real-Time KMP

• Q: Why is KMP not real-time?• A: For any mismatched character in T, we may try

matching it several times.– Recall that spí only guarantees that P(i + 1) and P(spí + 1) differ– There is NO guarantee that P(i + 1) and T(k) match

• We need to ensure that a mismatch at T(k) does NOT entail additional matches at T(k).

• This means that we have to compute spí values with respect to all characters in S since any could appear in T.

Real-Time KMP

• Define: sp´(i,x)(P) to be the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that character P(sp´i + 1) is x.

• This is will tell us exactly what shift to use for each possible mismatch.

• A mismatched character T(k) will never be involved in subsequent comparisons.

Real-Time KMP

• Q: How do we know that the mismatched character T(k) will never be involved in subsequent comparisons?

• A: Because the shift will shift P so that either the matching character aligns with T(k) or P will be shifted past T(k).

• This results in a real-time version of KMP.• Let’s consider how we can find the sp´(i,x)(P)

values in linear time.

Real-Time KMP

Thm. For P[i + 1] x, sp´(i,x)(P) = i - j + 1– Here j is the smallest position such that j maps to i and

P(Zj + 1) = x.– If there is no such j then where sp´(i,x)(P) = 0

For i = 1 to n { sp´(i,x) = 0 for every character x;}For j = n downto 2 {

i = j + Zi(P) – 1;x = P(Zj + 1); sp´(i,x) = Zi;

}

Real-Time KMP

• Notice how this works:– Starting from the right

• Find i the right end of the Z box associated with j• Find x the character immediately following the prefix

corresponding to this Z box.• Set sp´(i,x) = Zi, the length of this Z box.

For i = 1 to n { sp´(i,x) = 0 for every character x;}For j = n downto 2 {

i = j + Zi(P) – 1;x = P(Zj + 1); sp´(i,x) = Zi;}

Reference

• Chapter 1, 2: Exact Matching: Fundamental Preprocessing and First Algorithms

presented by dr. shazzad hosain asst. prof. eecs, nsu

Documents