mcs 101: algorithms instructor neelima gupta [email protected]

42
MCS 101: Algorithms Instructor Neelima Gupta [email protected]

Upload: patience-small

Post on 02-Jan-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

MCS 101: Algorithms

InstructorNeelima Gupta

[email protected]

Table of Contents

• String Matching– Naïve Method

– Finite Automata Approach– Rabin Karp

– KMP

Pattern Matching

• Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of the pattern within the text.

• Example: T = ababcabdabcaabc and P = abc, the occurrences are:– first occurrence starts at T[3]– second occurrence starts at T[9] – third occurrence starts at T[13]

Let Σ denotes the set of alphabet .

• Given:

A string of alphabets T[1..n] of size “n” and a pattern P[1..m] of size “m”

where, m<<<n.• To Find:

Whether the pattern P occurs in text T or not. If it does, then give the first occurrence of P in T.

The alphabets of both T and P are drawn from finite set Σ.

NAÏVE APPROACH

T :

P :

a b c a b d a a b c d e

a b d

Example ( Step – 1 )

T :

P :

c a b d a a b c d e

d

Mismatch after 3 Comparisons

a b

a b

Example ( Step – 2 )

T :

P :

a b c a b d a a b c d e

a b d

Mismatch after 1 Comparison

Example ( Step – 3 )

T :

P :

a b c a b d a a b c d e

a b d

Mismatch after 1 Comparison

Example ( Step – 4 )

T :

P :

a b c

a b d

Match found after 3 Comparisons

a b d

a a b c d e

Thus, after 8 comparisons the substring P is found in T.

Worst Case Running Time

T : a a a a a……..a a f of size say “n”

P : a a a f of size 4

Example ( Step – 1 )

T :

P :

a a a a . . . . . a a f

a a a f

Mismatch found after 4 comparisons

Example ( Step – 2 )

T :

P :

a a a a a , , , , a a f

a a a f

Mismatch found after 4 comparisons

T :

P :a a a f

a a a a a . . . .

Match found after 4 comparisons

a a a f

Example

This will continue to happen until (n-4)th alphabet in T is compared with the characters in P and thus the no. of comparisons required is (n-4)4 + 4.

Worst Case Running Time

Worst Case Running Time

• At every step, after ‘m’ comparisons a mismatch will be found.

• These ‘m’ comparisons will be done for (n-m) characters in T.

• Thus, the running time obtained is (n-m)m + m.

Finite Automata

s0s1

a fs2 s3

faa

# a

# a

# a

Worst Case Running Time

• In finite automata, each character is scanned atmost once. Thus in the worst case, the searching time is O(n).

• Preprocessing time:- As for every character in ∑ an edge has to be formed, thus the preprocessing time is O(m*|∑|).

• Thus total running time is O(n) + O(m*|∑|).

Drawback:-

If the alphabet set ∑ is very large, then the time required to construct the FA will be very large.

BRUTE FORCE STRATEGY

• In this strategy whenever a mismatch was found , the pattern was shifted right by 1 character.

• But this wasn’t an efficient strategy as it required a large number of comparisons. Hence a better algorithm was required.

19

T : …… tj .. …...tj+r-1 ….tj+k-r…...tj+k-2 tj+k-1 … ………………………………

P : p1 …… pr …… ……… pk-1 pk …… p1 …… pr pk …

If tj+k-1 ≠ pk

Shifting of the pattern is required. But instead of shifting right by 1 character, we look for longest prefix of p1 … pk-1 that matches the suffix of tj … tj+k-1.

Since tj … tj+k-1 has already been matched with p1 … pk-1 , this means

we need to look for longest prefix of p1 … pk-1 that matches with its own suffix.

20

KMP : Knuth Morris Pratt Algorithm

KMP Contd..

• Let r be the length of the longest prefix of P that matches with the matched part of P. Then the pattern can be shifted by r positions instead of 1 and tj+k-1 should be compared with pr+1.

• Claim 1: We have not missed any match i.e. the pattern does not exist at any position from j to j+k-r-1.

• Proof: Had it been, we would have a longer prefix matching with its suffix.

Why LONGEST?

T : a b c a b c a b c a b c a f mismatch found

P : a b c a b c a b c a f

22

23

T : a b c a b c a b c a b c a f mismatch found

P : a b c a b c a b c a f

the longest prefix.Correct alignment for the pattern will be by

shifting it 3 characters right.

24

T : a b c a b c a b c a b c a f

P : a b c a b c a b c a f

Pattern found.

25

T : a b c a b c a b c a b c a f mismatch

P : a b c a b c a b c a f

Pattern not found.

By finding a smaller prefix and aligning the pattern accordingly as shown, the pattern’s occurrence in the text got missed (that is we shifted by more positions than we should have)

So it is known that we need to find the longest prefix in the pattern that matches its suffix.

But HOW?

26

P : p1 ….………….…………… pk …………

Let the length of the longest prefix of p1 … pk-1 that matches its suffix be ‘r.’

27

T : …… tj .. …...tj+r-1 ….tj+k-r…...tj+k-2 tj+k-1 …

………………………………

P : p1 …… pr …… ……… pk-1 pk ……

p1 …… pr pk …

If tj+k-1 ≠ pk

Let Fail[k] be a pointer which says that if a mismatch occurs for pk then what is the character in P that should come in place of pk by shifting P accordingly .

How to compute Fail[k]? 28

P : p1 … pr-1 pr pr+1 …….…. pk-1 pk …

p1 … pr’-1 pr’ pr’+1

p1…....ps-1 ps ps+1

Look at fail[k-1]. Let it be r’.

If pr’ = pk-1 (which has already been matched with tj+k-1) fail[k] = r’+1

1 else { look at fail[r’] = s , say

if s>0

{ if ps = pk-1 then fail[k] = s+1

else goto 1 with r’ = s

}

} else (i.e s = 0) fail[k] =129

EXAMPLE

P: abcabcabcaf

for k=1, fai[k]=0 (assumed)for k=2,

s=fail[1]=0 therefore, fail[k]=0+1=1for k=3,

s=fail[2]=1 check whether p2=p1

since p2!=p1 so, s=fail[1]=0

therefore, fail[k]=0+1=1

P: abcabcabcaf

for k=4, s=fail[3]=1

check whether p1=p3since p1!=p3

so, s=fail[1]=0 therefore, fail[k]=0+1=1For k=5

s=fail[4]=1check whether p1=p4yestherefore, fail[k]=1+1=2

Similarly, for others.

k fail[k]

1 0

2 1

3 1

4 1

5 2

6 3

7 4

8 5

9 6

10 7

11 8

Example :T : a b c a b c a b c a b c a f

P : a b c a b c a b c a f k: 1 2 3 4 5 6 7 8 9 10 11

P : a b c a b c a b c a f k: 1 2 3 4 5 6 7 8 9 10 11

Mismatch found at k=11 position.Look at fail[11] = 8 which implies the

pattern must be shifted such that p8 comes in place of p11

33

k Fail[k]

1 0

2 1

3 1

4 1

5 2

6 3

7 4

8 5

9 6

10 7

11 8

Example :T : a b c a b c a b c a b c a f

P : a b c a b c a b c a f k: 1 2 3 4 5 6 7 8 9 10 11

Pattern found

34

k Fail[k]

1 0

2 1

3 1

4 1

5 2

6 3

7 4

8 5

9 6

10 7

11 8

Another Example :T : a b c b a b c b a b c a b c a b c a f

P : a b c a b c a b c a f k: 1 2 3 4 5 6 7 8 9 10 11

P : a b c a b c a b c a f k: 1 2 3 4 5 6 7 8 9 10 11

Mismatch found at k=4 position.Look at fail[4] = 1 which implies the pattern

must be shifted such that p1 comes in place of p4

35

k Fail[k]

1 0

2 1

3 1

4 1

5 2

6 3

7 4

8 5

9 6

10 7

11 8

36

Another Example :T : a b c b a b c b a b c a b c a b c a f

P : a b c a b c a b c a f k: 1 2 3 4 5 6 7 8 9 10 11

P : a b c a b c a b c a f k: 1 2 3 4 5 6 7 8 9 10 11

Mismatch found at k=1 position.Look at fail[1] = 0 which implies read the next

character in text.

k Fail[k]

1 0

2 1

3 1

4 1

5 2

6 3

7 4

8 5

9 6

10 7

11 8

37

Another Example :T : a b c b a b c b a b c a b c a b c a f

P : a b c a b c a b c a f k: 1 2 3 4 5 6 7 8 9 10 11

P : a b c a b c a b c a f k: 1 2 3 4 5 6 7 8 9 10 11

Mismatch found at k=4 position.Look at fail[4] = 1 which implies the pattern

must be shifted such that p1 comes in place of p4

k Fail[k]

1 0

2 1

3 1

4 1

5 2

6 3

7 4

8 5

9 6

10 7

11 8

38

Another Example :T : a b c b a b c b a b c a b c a b c a f

P : a b c a b c a b c a f k: 1 2 3 4 5 6 7 8 9 10 11

P : a b c a b c a b c a f k: 1 2 3 4 5 6 7 8 9 10 11

Mismatch found at k=1 position.Look at fail[1] = 0 which implies read the next

character in text.

k Fail[k]

1 0

2 1

3 1

4 1

5 2

6 3

7 4

8 5

9 6

10 7

11 8

39

k Fail[k]

1 0

2 1

3 1

4 1

5 2

6 3

7 4

8 5

9 6

10 7

11 8

Another Example :T : a b c b a b c b a b c a b c a b c a f

P : a b c a b c a b c a f k: 1 2 3 4 5 6 7 8 9 10 11

Pattern found

Analysis of KMP# of mismatch: For mismatch the pattern is shifted by at least 1 position. The maximum number of shifts is determined by the largest suffix.

T: ......a b c a b c a b c a b c d a f d........

P: d e b

P: d e b mismatch

mismatchFor every mismatch pattern is shifted by atleast1postion.

Total no. of shifts <= n-m

Total no. of mismatches <=n-m+1

..

..

Analysis of KMP contd.# of matches: For every match, pointer in the text

moves up by 1 position.

T: ......a b c a b c a b c a b c d a f d........

P: a b c b d e

P: a b c b d e For every match pointer moves up by 1 position.

P: a b c b d e => # of matches <= length of text

<= n ...

.

.

The complexity of KMP is linear in nature. O(m+n)

ACKNOWLEDGEMENTS

42

MSc (CS) 2009

Abhishek Behl(02)Aarti Sethiya(01)

Akansha Aggarwal(03)Alok Prakash (04)Vibha Negi(31)