presented by dr. shazzad hosain asst. prof. eecs, nsu

80
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Exact String Matching Algorithms

Upload: sean-fuller

Post on 02-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Exact String Matching Algorithms. Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU. Classical Comparison Based Methods. Boyer-Moore Algorithm Knuth-Morris-Pratt Algorithm (KMP Algorithm). Boyer-Moore Algorithm. Basic ideas: Previously discussed ideas for naïve matching - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Presented ByDr. Shazzad Hosain

Asst. Prof. EECS, NSU

Exact String Matching Algorithms

Page 2: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Classical Comparison Based Methods

• Boyer-Moore Algorithm• Knuth-Morris-Pratt Algorithm (KMP Algorithm)

Page 3: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Boyer-Moore Algorithm

• Basic ideas:– Previously discussed ideas for naïve matching

1. successively align P and T to check for a match.2. Shift P to the right on match failure.

– new concepts wrt the naïve algorithm1. Scan from right-to-left, i.e., 2. Special Bad character rule3. Suffix shift rule

Page 4: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Right-to-left Scan

• How can we check for a match of pattern P at location i in target T?

• Naïve algorithm scanned left-to-right, i.e., T[i+k]&P[1+k], k = 0 to length(P)-1

^1 a == a ^ 2 d != b

Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b

Page 5: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Right-to-left Scan

• Alternative, scan right-to-left, i.e., T[i+k]&P[1+k], k = length(P)-1 down-to 0

^ 1 b != r

Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b

Page 6: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Right-to-left Scan

• Why is scanning right-to-left a good idea?• Answer: by itself, it isn’t any better than left-

to-right.– A naïve approach with right-to-left scanning is

also Q(nm).– Larger shifts, supported by a clever bad

character rule and a suffix shift rule make it better.

Page 7: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Bad Character Rule

• Idea: the mismatched character indicates a safe minimum shift.

^ 1 a == a

Example: P = adacara, T=abaracadabara a b a r a c a d a b a r a a d a c a r a

^ 2 r != c

Here the bad character is c. Perhaps we should shift to align this character with its rightmost occurrence in P?

Page 8: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Bad Character Rule

Shift two positions to align the rightmost occurrence of the mismatched character c in P.

a b a r a c a d a b a r a a d a c a r a a d a c a r a

Now, start matching again from right to left.

Page 9: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Bad Character Rule

^ 1 a == a

Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a

^ 2 r == r

Here the bad character is x. The minimum that we should shift should align this character with its occurrence in P.

But x doesn’t occur in P!!!!

^ 3 a == a ^ 4 c != x

Page 10: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Bad Character Rule

Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a

Since x doesn’t occur in P, we can shift past it.

a d a c a r a

Now, start matching again from right to left.

Page 11: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

11

Concept: Bad Character Rule

• The idea of bad character rule is to shift P by more than one characters when possible.

• But if rightmost position is greater than the mismatched position.

• Unfortunately, it is often the case

12345678901234567T: spbctbsatpqsctbpqP: tpabsatP: tpabsat

Page 12: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Bad Character Rule

• We will define a bad character rule that uses the concept of the rightmost occurrence of each letter.

• Let R(x) be the rightmost position of the letter x in P for each letter x in our alphabet.

• If x doesn’t occur in P, define R(x) to be 0.

a b c d z

7 0 4 2 * * 0

1234567P = adacara

R

Page 13: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

13

Concept: Bad Character Rule 12345678901234567T: spbctbsabpqsctbpqP: tpabsab

R(t)=1, R(s)=5.i: the position of mismatch in P. i=3k: the counterpart in T. k=5. T[k]=t• The bad character rule says P should be shifted right by max{1, i-R(T[k])}. i.e.,

if the right-most occurrence of character T[k] in P is in position j (j<i), then P[j] should be below T[k] after the shifting.

• Otherwise, we will shift P one position, i.e., when R(T[k]) >= i, 1 >= i - R(T[k])• Obviously this rule is not very useful when R(T[k]) >= i, which is usually the

case for DNA sequences

P: tpabxab

Page 14: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Extended Bad Character Rule

Extended Bad Character Rule: If P[i] mismatches T[k], shift P along T so

that the closest occurrence of the letter T[k] in P to the left of i in P is aligned with T[k].

^ 1 a == a

Example: P = aracara, T=abararadabara a b a r a r a d a b a r a a r a c a r a

^ 2 r == r ^ 3 a == a ^ 4 c != r ^ This is the rightmost occurrence of r in P.

Notice that i - R(T(k)) < 0 , i.e., 4 – 6 < 0

^ This is the rightmost occurrence of r to the left of i in P.

Notice that 4 – 2 > 0, i.e., this gives us a positive shift.

Page 15: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Extended Bad Character Rule

The amount of shift is i – j, where:– i is the index of the mismatch in P.– j is the rightmost occurrence of T[k] to the left of i in P.

^ 1 a == a

Example: P = aracara, T=abataradabara a b a t a r a d a b a r a a r a c a r a

^ 2 r == r ^ 3 a == a ^ 4 c != t

There is no occurrence of t in P, thus j = 0. Notice that i – j = 4,

i.e., this gives us a positive shift past the point of mismatch.

Page 16: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Extended Bad Character Rule

• How do we implement this rule?• We preprocess P (from right to left), recording the

position of each occurrence of the letters.• For each character x in S, the alphabet, create a list

of its occurrences in P. If x doesn’t occur in P, then it has an empty list.

Page 17: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Extended Bad Character Rule

Example: S = {a, b, c, d, r, t}, P = abataradabara• a_list = <13, 11,9,7,5,3,1> since ‘a’ occurs at these positions

in P, i.e., abataradabara• b_list = <10,2> (abataradabara)• c_list = Ø• d_list = <8> (abataradabara)• r_list = <12,6> (abataradabara)• t_list = <4> (abataradabara)

Page 18: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Recall that we investigated finding prefixes before.• Since we are matching P to T from right-to-left, we will

instead need to use suffixes.

Page 19: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

19

Suffix Shift Rule

t is a suffix of P that match with a substring t of Tx≠yt’ is the right-most copy of t in P such that t’ is not a suffix of P and z≠y

Page 20: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Consider the partial right-to-left matching of P to T below.

• This partial match involves ,a a suffix of P.

.....................................adbadbaddog

............................................axbadbaddog.....

P

T

Page 21: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• This partial match ends where the first mismatch occurs, where x is aligned with d.

.....................................adbadbaddog

............................................axbadbaddog.....

P

T

Page 22: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

We want to find a right-most copy a´ of this substring a in P such that:

1. a´ is not a suffix of P and 2. The character to the left of a´ is not the same as the

character to the left of a

.........gbadbaddoghorseadbadbaddog

............................................axbadbaddog.....

P

T

Page 23: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

1. If a´ exists, shift P to the right such that a´ is now aligned with the substring in T that was previously aligned with a.

.........gbadbaddogcatdbadbaddog

.......................................xbadbaddog.....

P

T

.........gbadbaddogcatdbadbaddog

P after shifting ’

Page 24: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

2. If a´ doesn’t exist, shift P right by the least amount such that a prefix of P is aligned with a suffix of a in T.

dogcatratdbadbaddog

.......................................xbadbaddog.....

P

T

P after shifting dogcatratdbadbaddog

Page 25: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

3. If a´ doesn’t exist, and there is no prefix of P that matches a suffix of a in T, shift P left by n positions.

batcatratdbadbaddog

.......................................xbadbaddog.....

P

T

P after shifting batcatratdbadbaddog

Page 26: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Preprocessing for the good suffix rule

• Let L(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L(i)].

• If there is no such position, then L(i) = 0• Example 1: If i = 17 then L(i) = 9

batcatdogdbadbaddog P

17 L(17)

batcatdogdbadbaddog P

16

• Example 2: If i = 16 then L(i) = 0

Page 27: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Let L´(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L´(i)] and s.t. the character preceding the suffix is not equal to P(i-1).

• If there is no such position, then L´(i) = 0• Example 1: If i = 20 then L(i) = 12 and L´(i) = 6

slydogsaddogdbadbaddog P

20 L(20) L’(20)

Page 28: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Example 2: If i = 19 then L(i) = 12 and L´(i) = 0

slydogsaddogdbadbaddog P

19 L(19)

Page 29: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Notice that L(i) indicates the right-most copy of P[i..n] that is not a suffix of P.

• In contrast, L´(i) indicates the right-most copy of P[i..n] that is not a suffix of P and whose preceding character doesn’t match P(i-1).

• The relation between L´(i) and L(i) is analogous to the relation between a´ and a.

slydogsaddogdbadbaddog P

20 L(20) L’(20)

Page 30: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Q: What is the point?• A: If P(i - 1) causes the mismatch and L´(i) > 0, then

we can shift P right by n - L´(i) positions. Example:

.........gbadbaddogcatdbadbaddog

.......................................xbadbaddog.....

P

T

.........gbadbaddogcatdbadbaddog

P after shifting ’

Page 31: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• If L(i) and L´(i) are different, then obviously shifting by n - L´(i) positions is a greater shift than n - L(i).

• Example:

slybaddogbadbaddogcatdbadbaddog

.......................................xbaxbaddog.....

P

T

slybaddogbadbaddogcatdbadbaddog

P after shifting ’

L(i) L’(i)

Page 32: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Let Nj(P) denote the length of the longest suffix of P[1..j] that is also a suffix of P.

• Example 1: N6(P) = 3 and N12(P) = 5.

slydogsaddogdbadbaddog P

12 6

hogslydogsaddogdbadbaddog P

15 9 3 19

• Example 2: N3(P) = 2, N9(P) = 3, N15(P) = 5, N19(P) = 0.

Page 33: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Q: How are the concepts of Ni and Zi related?• Recall that Zi = Length of a maximal substring starting at

position i, which is a prefix of P.

• In contrast, Ni = Length of a maximal substring ending at position i, which is a suffix of P.

• In the case of Boyer-Moore, we are naturally interested in suffixes since we are scanning right-to-left

i

a

i

a

Page 34: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Let Pr denote the mirror image of P, then the relationship can be expressed as Nj(P)=Zn-j+1(Pr).

• In words, the length of the substring matching a suffix at position j in P is equal to the length of the corresponding substring matching a prefix in the reverse of P.

• Q: Why must this true?• A: Because they are the same substring, except

that one is the reverse of the other.

Page 35: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Since Nj(P) = Zn-j+1(Pr), we can use the Z algorithm to compute N in O(n).

• Q: How do we do this?• A: We create Pr, the reverse of P, and process it

with the Z algorithm.

Page 36: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

36

Concept: Suffix Shift Rule

N is the reverse of Z!

P: the pattern

Pr the string obtained by reversing P

Then Nj (P)=Zn-j+1 (Pr)

1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0P: q c a b d a b d a b Pr: b a d b a d b a c qNj: 0 0 0 2 0 0 5 0 0 0 Zi 0 0 0 5 0 0 2 0 0 0

t t’ xyi

tt’j

xy

Page 37: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

37

Concept: Suffix Shift Rule

For pattern P,

Nj (for j=1,…,n) can be calculated in O(n) using the Z algorithm.

Why do we need to define Nj ?

To use the strong good suffix rule, we need to find out L’(i) for every i=1,…,n.

We can get L’(i) from Nj !

x t

y tt’

y tt’

z

z

T

P

niL’(i)

Page 38: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• We can then find L´(i) and L(i) values from N values in linear time with the following:

For i = 1 to n {L´(i) = 0;}For j = 1 to n – 1 {

i = n - Nj(P) + 1;L´(i) = j;

}

// L values (if desired) can be obtainedL(2) = L´(2) ;For i = 3 to n { L(i) = max(L(i - 1), L´(i));}

Page 39: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Example: P = asdbasasas, n = 10• Values of Ni(P): 0, 2, 0, 0, 0, 2, 0, 4,

0• Computed values i: 11, 9, 11, 11, 11, 9, 11, 7, 11• Values of L´: 0, 0, 0, 0, 0, 0, 8, 0, 6

For i = 1 to n {L´(i) = 0;}For j = 1 to n – 1 {

i = n - Nj(P) + 1;L´(i) = j;

}L(2) = L´(2) ;For i = 3 to n { L(i) = max(L(i - 1), L´(i));}

Page 40: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Let l´(i) denote the length of the largest suffix of P[i..n] that is also a prefix of P. Let l´(i) = 0 if no such suffix exists.

Example: P = asasbsasas^ l’(1) = 4 ^ l’(2) = 4 ^ l’(3) = 4 ^ l’(4) = 4 ^ l’(5) = 4 ^ l’(6) = 4 ^ l’(7) = 4 ^ l’(8) = 2 ^ l’(9) = 2 ^ l’(10) = 0

tt’l´(i) = t

i

Page 41: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Thm: l´(i) = largest j <= n – i + 1 s.t. Nj(P) = j.• Q: How can we compute l´(i) values in linear

time?• A: This is problem #9 in Chapter 2. This would

make an interesting homework problem.

tt’j

xy

tt’i

l´(i) = t

Page 42: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Boyer-Moore Algorithm

Preprocessing:Compute L´(i) and l´(i) for each position i in P, Compute R(x), the right-most occurrence of x in P, for each character x in S.Search:k = n;While k <= m {

i = n; h = k;While i > 0 and P(i) = T(j) {

i = i – 1; h = h – 1;}if i = 0 {

report occurrence of P in T at position k.k = k + n - l´(2);}

else Shift P (increase k) by the max amount indicated by theextended bad character rule and the good suffix rule.

}

Page 43: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Boyer-Moore Algorithm

Example: P = golgolPreprocessing:Compute L´(i) and l´(i) for each position i in P

Notice that first we need Nj(P) values in order to compute L´(i) and l´(i) for each position i in P.

For i = 1 to n {L´(i) = 0;}For j = 1 to n – 1 {

i = n - Nj(P) + 1;L´(i) = j;

}

Page 44: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Boyer-Moore Algorithm

Example: P = golgolRecall that Nj(P) is the length of the longest suffix of P[1..j]that is also a suffix of P.

N1(P) = 0, there is no suffix of P that ends with g

N2(P) = 0, there is no suffix of P that ends with o

N3(P) = 3, there is a suffix of P that ends with l

N4(P) = 0, there is no suffix of P that ends with g

N5(P) = 0, there is no suffix of P that ends with o

N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3

Page 45: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Boyer-Moore Algorithm

Preprocessing: P = golgol, n = 6

N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3Compute L´(i) and l´(i) for each position i in P

For i = 1 to n {L´(i) = 0;}For j = 1 to n – 1 {

i = n - Nj(P) + 1;L´(i) = j;

}

j = 1 i = 7 Therefore L´(7) = 1j = 2 i = 7 Therefore L´(7) = 2j = 3 i = 4 Therefore L´(4) = 3j = 4 i = 7 Therefore L´(7) = 4j = 5 i = 7 Therefore L´(7) = 5

L´(1) = L´(2) = L´(3) = L´(5) = 0 and L´(4) = 3

Page 46: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Boyer-Moore Algorithm

Preprocessing: P = golgol, n = 6N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3

Compute l´(i) for each position i in P.Recall that l´(i) is the length of the longest suffix of P[i..n] thatis also a prefix of P.

l´(1) = 6 since gol is the longest suffix of P[1..n] that is a prefix of P.l´(2) = 3 since gol is the longest suffix of P[2..n] that is a prefix of P.l´(3) = 3 since gol is the longest suffix of P[3..n] that is a prefix of P.l´(4) = 3 since gol is the longest suffix of P[4..n] that is a prefix of P.l´(5) = 0 since there is no suffix of P[5..n] that is a prefix of P.

l´(6) = 0 since there is no suffix of P[6..n] that is a prefix of P.

l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0

Page 47: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Boyer-Moore Algorithm

Preprocessing: P = golgol, n = 6N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0

Compute the list R(x), the right-most occurrences of x in P,for each character x in S = {g, o, l}

R(g) = <4, 1>

R(o) = <5, 2>

R(l) = <6, 3>

Page 48: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Boyer-Moore Algorithm

Preprocessing: P = golgol, n = 6, T = lolgolgol, m = 9L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3l´(1) =6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0R(g) = <4, 1>, R(o) = <5, 2>, R(l) = <6, 3>

Search:k = n;While k <= m {

i = n; h = k;While i > 0 and P(i) = T(j) {

i = i – 1; h = h – 1;}if i = 0 {

report occurrence of P in T at position k.k = k + n - l´(2);}

else Shift P (increase k) by the max amount indicated by theextended bad character rule and the good suffix rule.

}

Page 49: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Search

^i = 6, h = 6

^i = 5, h = 5

^i = 4, h = 4

lolgolgolgolgol

Bad Character Rule: there is no occurrence of l, the mismatched characterin T, to the left of P(1). This suggests shifting only 1 place

Good Suffix Rule: Since L´(2) = 0, l´(2) = 3 thereforeshift P by n - l´(2) places, i.e., 6-3=3 places. Thus k = k + 3 = 9

But i = 1!

^i = 3, h = 3

^i = 2, h = 2

^i = 1, h = 1, P(1) != T(1)

k = 6;While k <= 9 {

i = 6; h = k;While i > 0 and P(i) = T(j) {

i = i – 1; h = h – 1;}if i = 0 {

report occurrence of P in T at position k.k = k + 6 - l´(2);}

else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule.

}

Page 50: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Search

lolgolgolgolgol

^i = 6, h = 9

^i = 5, h = 8

^i = 4, h = 7

^i = 3, h = 6^i = 2, h = 5

^i = 1, h = 4

^i = 0, h = 3

i = 0, report occurrence of P in T at position 4,k = k + 6 - l´(2) = 9 + 6 - 3 = 12

lolgolgol golgolk = 12, we are done!

k = 6;While k <= 9 {

i = 6; h = k;While i > 0 and P(i) = T(j) {

i = i – 1; h = h – 1;}if i = 0 {

report occurrence of P in T at position k.k = k + 6 - l´(2);}

else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule.

}

Page 51: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Homework 1: Due Next Week• Implement the Boyeer More Algorithm

Page 52: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Break

Page 53: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm

• Preliminaries:– KMP can be easily explained in terms of finite

state machines.– KMP has a easily proved linear bound– KMP is usually not the method of choice

Page 54: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm

• Recall that the naïve approach to string matching is Q(mn).

• How can we reduce this complexity?– Avoid redundant comparisons– Use larger shifts

• Boyer-Moore good suffix rule• Boyer-Moore extended bad character rule

Page 55: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm

• KMP finds larger shifts by recognizing patterns in P.– Let spi(P) denote the length of the longest proper

suffix of P[1..i] that matches a prefix of P.

– By definition sp1 = 0 for any string.– Q: Why does this make sense?– A: The proper suffix must be the empty string

α αi

Page 56: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm

• Example: P = abcaeabcabd– P[1..2] = ab hence sp2 = ?

– sp2 = 0

– P[1..3] = abc hence sp3 = ?

– sp3 = 0

– P[1..4] = abca hence sp4 = ?

– sp4 = 1

– P[1..5] = abcae hence sp5 = ?

– sp5 = 0

– P[1..6] = abcaea hence sp6 = ?

– sp6 = 1

Page 57: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm

• Example Continued– P[1..7] = abcaeab hence sp7 = ?

– sp7 = 2

– P[1..8] = abcaeabc hence sp8 = ?

– sp8 = 3

– P[1..9] = abcaeabca hence sp9 = ?

– sp9 = 4

– P[1..10] = abcaeabcab hence sp10 = ?

– sp10 = 2

– P[1..11] = abcaeabcabd hence sp11 = ?

– sp11 = 0

Page 58: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm• Like the a/a concept for Boyer-Moore, there is an

analogous spi/sp´i concept.• Let sp´i(P) denote the length of the longest proper

suffix of P[1..i] that matches a prefix of P, with the added condition that characters P(i + 1) and P(sp´i + 1) are unequal.

• Example: P = abcdabce sp´7 = 3

Obviously sp´i(P) <= spi(P), since the later is lessrestrictive.

α αi

x y

Page 59: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm• KMP Shift Rule:

1. Mismatch case:• Let position i+1 in P and position k in T be the first mismatch

in a left-to-right scan.• Shift P to the right, aligning P[1..sp´i] with T[k- sp´i..k-1]

2. Match case:• If no mismatch is found, an occurrence of P has been found.• Shift P by n – sp´n spaces to continue searching for other

occurrences.

i+1

αα

n+1

α

αα

Page 60: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm

• Observations:– The prefix P[1..sp´i] of the shifted P is shifted to match

the corresponding substring in T.– Subsequent character matching proceeds from

position sp´i + 1– Unlike Boyer-Moore, the matched substring is not

compared again.– The shift rule based on sp´i guarantees that the exact

same mismatch won’t occur at sp´i + 1 but doesn’t guarantee that P(sp´i+1) = T(k)

Page 61: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm

• Example: P = abcxabcde– If a mismatch occurs at position 8, P will be shifted 4

positions to the right.– Q: Where did the 4 position shift come from?– A: The number of position is given by i - sp´i , in this

example i = 7, sp´7 = 3, 7 – 3 = 4 – Notice that we know the amount of shift without

knowing anything about T other than there was a mismatch at position 8..

Page 62: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm

• Example Continued: P = abcxabcde– After the shift, P[1..3] lines up with T[k-4..k-1]– Since it known that P[1..3] must match T[k-4..k-1], no

comparison is needed.– The scan continues from P(4) & T(k)

• Advantages of KMP Shift Rule1. P is often shifted by more than 1 character, (i - sp´i )

2. The left-most sp´i characters in the shifted P are known to match the corresponding characters in T.

Page 63: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm

Full Example: T = xyabcxabcxadcdqfeg P = abcxabcdeAssume that we have already shifted past the first two

positions in T.

xyabcxabcxadcdqfegabcxabcde

^ 1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7

abcxabcde

^ 8 d!=x, shift 4 places ^ 1 start again from position 4

Page 64: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Preprocessing for KMP

Approach: show how to derive sp´ values from Z values.

Definition: Position j > 1 maps to i if i = j + Zj(P) – 1– Recall that Zj(P) denotes the length of the Z-box starting at position j.

– This says that j maps to i if i is the right end of a Z-box starting at j.

αα

ααi

j

Page 65: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Preprocessing for KMP

Theorem. For any i > 1, sp´i(P) = Zj = i – j + 1

Where j > 1 is the smallest position that maps to i.If j then sp´i(P) = 0

Similarly for sp:For any i > 1, spi(P) = i – j + 1

Where j, i j > 1, is the smallest position that maps to i or beyond.If j then spi(P) = 0

Definition: Position j > 1 maps to i if i = j + Zj(P) – 1

αα

ααi

j

x y

Page 66: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Preprocessing for KMPGiven the theorem from the preceding slide, the sp´i and spi

values can be computed in linear time using Zi values:

For i = 1 to n { sp´i = 0;}For j = n downto 2 {

i = j + Zj(P) – 1; sp´i = Zj;

}

spn(P) = sp´n(P); For i = n - 1 downto 2 {

spi (P) = max[spi+1 (P) - 1, sp´i(P)];}

αα

ααi

j

x y

Page 67: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Preprocessing for KMP

Defn. Failure function F´(i) = sp´i-1 + 1 , 1 i n + 1, sp´0 = 0(similarly F(i) = spi-1 + 1 , 1 i n + 1, sp0 = 0)

xyabcxabcxadcdqfegabcxabcde

^ 1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7

abcxabcde

^ 8 d!=x, shift 4 places

Shifting is only conceptual and P is never explicitly shifted

xyabcxabcxadcdqfegabcxabcde

^ i

c |

^ i

c |

^ i

c |

^ i

c |

^i

Two special cases:1. Mismatch at position 1, then F’(1) = 12. Match found, then P shifts by n - sp’n places

o Which is F’(n+1) = sp’n + 1

Page 68: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Preprocessing for KMP

Defn. Failure function F´(i) = sp´i-1 + 1 , 1 i n + 1, sp´0 = 0(similarly F(i) = spi-1 + 1 , 1 i n + 1, sp0 = 0)• Idea:

– We maintain a pointer i in P and c in T.– After a mismatch at P(i+1) with T(c), shift P to align

P(sp´i + 1) with T(c), i.e., i = sp´i + 1.– Special case 1: i = 1 set i = F´(1) = 1 & c = c + 1– Special case 2: we find P in T, shift n - sp´n spaces,

i.e., i = F´(n + 1) = sp´n + 1.

Page 69: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Full KMP AlgorithmPreprocess P to find F´(k) = sp´k-1 +1 for k from 1 to n + 1c = 1; p = 1;While c + (n – p) m {

While P(p) = T( c )and p n {p = p + 1;c = c + 1;}

If (p = n + 1) thenreport an occurrence of P at position c – n of T.

if (p = 1) then c = c + 1;p = F´(p) ;}

T = xyabcxabcxadcdqfegP = abcxabcde

^ p

c |

|T| = m|P| = n

Page 70: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Full KMP Algorithm

xyabcxabcxabcdefegabcxabcde

^ 1 a!=x

p != n+1

p = 1! c = 2

p = F’(1) = 1

c = 1; p = 1;While c + (n – p) m {

While P(p) = T( c )and p n {p = p + 1;c = c + 1;}

If (p = n + 1) thenreport an occurrence of P at position c – n of T.

if (p = 1) then c = c + 1;p = F´(p) ;

}

Page 71: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Full KMP Algorithm

xyabcxabcxabcdefeg

abcxabcde

^ 1 a!=y

p != n+1

p = 1! c = 3

p = F’(1) = 1

c = 1; p = 1;While c + (n – p) m {

While P(p) = T( c )and p n {p = p + 1;c = c + 1;}

If (p = n + 1) thenreport an occurrence of P at position c – n of T.

if (p = 1) then c = c + 1;p = F´(p) ;

}

abcxabcde

Page 72: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Full KMP Algorithm

xyabcxabcxabcdefeg

p != n+1

p = 8! don’t change c

p = F´(8) = 4

abcxabcde abcxabcde

^ 1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 ^ 8 d!=x

c = 1; p = 1;While c + (n – p) m {

While P(p) = T( c )and p n {p = p + 1;c = c + 1;}

If (p = n + 1) thenreport an occurrence of P at position c – n of T.

if (p = 1) then c = c + 1;p = F´(p) ;

}

Page 73: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

p = 4, c = 10

^ 4

Full KMP Algorithm

xyabcxabcxabcdefeg

p = n+1 !

abcxabcde

^ 5 ^ 6 ^ 7 ^ 8

abcxabcde abcxabcde abcxabcde

c = 1; p = 1;While c + (n – p) m {

While P(p) = T( c )and p n {p = p + 1;c = c + 1;}

If (p = n + 1) thenreport an occurrence of P at position c – n of T.

if (p = 1) then c = c + 1;p = F´(p) ;

}

^ 9

Page 74: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Real-Time KMP

• Q: What is meant by real-time algorithms?• A: Typically these are algorithms that are meant

to interact synchronously in the real world.– This implies a known fixed turn-around time for

processing a task– Many embedded scheduling systems are examples

involving real-time algorithms.– For KMP this means that we require a constant time

for processing all strings of length n.

Page 75: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Real-Time KMP

• Q: Why is KMP not real-time?• A: For any mismatched character in T, we may try

matching it several times.– Recall that sp´i only guarantees that P(i + 1) and P(sp´i + 1) differ– There is NO guarantee that P(i + 1) and T(k) match

• We need to ensure that a mismatch at T(k) does NOT entail additional matches at T(k).

• This means that we have to compute sp´i values with respect to all characters in S since any could appear in T.

Page 76: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Real-Time KMP

• Define: sp´(i,x)(P) to be the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that character P(sp´i + 1) is x.

• This is will tell us exactly what shift to use for each possible mismatch.

• A mismatched character T(k) will never be involved in subsequent comparisons.

Page 77: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Real-Time KMP

• Q: How do we know that the mismatched character T(k) will never be involved in subsequent comparisons?

• A: Because the shift will shift P so that either the matching character aligns with T(k) or P will be shifted past T(k).

• This results in a real-time version of KMP.• Let’s consider how we can find the sp´(i,x)(P)

values in linear time.

Page 78: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Real-Time KMP

Thm. For P[i + 1] x, sp´(i,x)(P) = i - j + 1– Here j is the smallest position such that j maps to i and

P(Zj + 1) = x.– If there is no such j then where sp´(i,x)(P) = 0

For i = 1 to n { sp´(i,x) = 0 for every character x;}For j = n downto 2 {

i = j + Zi(P) – 1;x = P(Zj + 1); sp´(i,x) = Zi;

}

Page 79: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Real-Time KMP

• Notice how this works:– Starting from the right

• Find i the right end of the Z box associated with j• Find x the character immediately following the prefix

corresponding to this Z box.• Set sp´(i,x) = Zi, the length of this Z box.

For i = 1 to n { sp´(i,x) = 0 for every character x;}For j = n downto 2 {

i = j + Zi(P) – 1;x = P(Zj + 1); sp´(i,x) = Zi;}

Page 80: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Reference

• Chapter 1, 2: Exact Matching: Fundamental Preprocessing and First Algorithms