1 the maxsuffix-matching algorithm on maximal suffixes and constant-space versions of kmpalgorithm...

42
1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions o f KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American Symposium, Cancun, Mexico, April 3-6, 20 02. Proceedings. Rytter, W. Advisor: Prof. R. C. T. Lee Reporter: L. Y. Huang

Upload: annice-jordan

Post on 17-Dec-2015

226 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

1

The MaxSuffix-Matching Algorithm

On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American Symposium,

Cancun, Mexico, April 3-6, 2002. Proceedings. Rytter, W.

Advisor: Prof. R. C. T. Lee

Reporter: L. Y. Huang

Page 2: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

2

Maximal Suffix

• A maximal suffix of a string is a suffix which is lexicographically maximal of all suffixes of a string.

• The maximal suffix of string w is denoted by MaxSuf(w)

• Ex: Consider string w = abaaba The set of its suffixes : {a, ba, aba, aaba, baaba, abaaba}The set of its sorted suffixes:{a, aaba, aba, abaaba, ba, baaba}

• Thus we can find that MaxSuf(w) = baaba.

Page 3: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

3

Self-Maximal String

• A string w is said to be self-maximal if MaxSuf(w) = w.

• Ex: Consider strings w = abaaba , x = baaba.– The MaxSuf(w) = baaba.– The MaxSuf(x) = baaba.

• Hence, we say that x is a self-maximal string but w is not.

Page 4: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

4

Important Properties of Self-Maximal Strings

• By definition, we have the following observation about self-maximal strings:

• For a self-maximal string P, suppose a prefix P1,P2,…,Pi of P is equal to a substring, Pk,Pk+1,…, Pk+i-1, of P, then Pi+1>=Pk+i.

x y

x > y

u uP …

Page 5: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

5

• Example: TCATBTCATA is a self-maximal string.

• But, TBATATBATB is not a self-maximal string because B after the substring TBAT is lexically larger than A after prefix TBAT.

Page 6: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

6

The Period of a String

• A period of a string w is an integer p, , such that :

• Ex: Consider string w = bbabbabbabba– bbabbabbabba → period = 3 and period =6.– abcdefg →period=word length=7– abcdeab →period=5

• We define period(w) as the smallest period of w.• If w = bbabbabbabba, period(w) is 3.

wp 0

},1{ allfor ][ pwipiwiw

Page 7: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

7

• Given a string P, we are actually interested in the period of every prefix.

i 1 2 3 4 5 6 7 8 9

P a b c a a b c a b

period 1 2 3 3 4 4 4 4 7

prefix 0 0 0 1 1 2 3 4 2

i-prefix(i) 1 2 3 3 4 4 4 4 7

Note that the period of i-prefix(i) in the MP-algorithm which is the number of steps which we can move the pattern. (The index starts from 1 in this case.)

Page 8: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

8

Why are we interested in the period function?

• If the period function is actually the same as the prefix function of the MP_algorithm, why are we interested in it?

• To calculate the prefix function, we must use pointers which point back to some characters way back.

• In the following, we shall introduce a naïve period function which never looks back.

Page 9: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

9

Naive-Period Function• Function Naive-Period can be used to compute

the period of a string if this string is self-maximal.

• For a general string, the Naive-Period function will not work. This is why our algorithm only works for the self-maximal strings.

Page 10: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

10

Function Naive-Period (j);{ computes the period of self-maximal pat}

period (1):= 1;for i := 2 to j do if then period (i):= i ; else period(i) := period(i - 1)return period;

)]1([][ iperiodipatipat

Algorithm of Naive-Period Function

Page 11: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

11

An Example of Naive-Period Function

w b b a b b a b b a bi 1 2

i-period(i-1)

0 1

period 1 1

Function Naive-Period (j);{ computes the period of self-maximal pat}

period (1):= 1;for i := 2 to j do if then period (i):= i ; else period(i) := period(i - 1)return period;

)]1([][ iperiodipatipat

Page 12: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

12

An Example of Naive-Period Function

• Consider a string w = bbabbabbab– w is a self-maximal string and period(w)=3.

w b b a b b a b b a bi 1 2 3

i-period(i-1)

0 1 2

period 1 1 3

Page 13: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

13

An Example of Naive-Period Function

• Consider a string w = bbabbabbab– w is a self-maximal string and period(w)=3.

w b b a b b a b b a bi 1 2 3 4

i-period(i-1)

0 1 2 1

period 1 1 3 3

Page 14: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

14

An Example of Naive-Period Function

• Consider a string w = bbabbabbab– w is a self-maximal string and period(w)=3.

w b b a b b a b b a bi 1 2 3 4 5

i-period(i-1)

0 1 2 1 2

period 1 1 3 3 3

Page 15: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

15

An Example of Naive-Period Function

• Consider a string w = bbabbabbab– w is a self-maximal string and period(w)=3.

w b b a b b a b b a bi 1 2 3 4 5 6

i-period(i-1)

0 1 2 1 2 3

period 1 1 3 3 3 3

Page 16: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

16

An Example of Naive-Period Function

• Consider a string w = bbabbabbab– w is a self-maximal string and period(w)=3.

w b b a b b a b b a bi 1 2 3 4 5 6 7

i-period(i-1)

0 1 2 1 2 3 4

period 1 1 3 3 3 3 3

Page 17: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

17

An Example of Naive-Period Function

• Consider a string w = bbabbabbab– w is a self-maximal string and period(w)=3.

w b b a b b a b b a bi 1 2 3 4 5 6 7 8

i-period(i-1)

0 1 2 1 2 3 4 5

period 1 1 3 3 3 3 3 3

Page 18: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

18

An Example of Naive-Period Function

• Consider a string w = bbabbabbab– w is a self-maximal string and period(w)=3.

w b b a b b a b b a bi 1 2 3 4 5 6 7 8 9

i-period(i-1)

0 1 2 1 2 3 4 5 6

period 1 1 3 3 3 3 3 3 3

Page 19: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

19

An Example of Naive-Period Function

• Consider a string w = bbabbabbab– w is a self-maximal string and period(w)=3.

w b b a b b a b b a bi 1 2 3 4 5 6 7 8 9 10

i-period(i-1)

0 1 2 1 2 3 4 5 6 7

Period(i) 1 1 3 3 3 3 3 3 3 3

Page 20: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

20

• Given any pattern P, let k be the length of the longest proper suffix of P[1, i-1] equal to a prefix P[1, k] of a P[1, i-1].

• Let k’ be the length of the longest proper suffix of P[1, i] equal to a prefix P[1, k’] of a P[1, i].

• For any i, we consider the following possibilities:

Why can Naïve period work in the self-maximal string?

i

i-1

k’k’

kk

P

P

Page 21: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

21

1. k ≠ 0 and P[k + 1] = P[i] : Period(i) = Period(i - 1)2. k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0 : Period(i) = i – k’3. k ≠ 0, P[k + 1] ≠ P[i] and k’ = 0 : Period(i) = i4. k = 0 and k’ ≠ 0 : Period(i) = i – k’5. k = 0 and k’ = 0 : Period(i) = i

Page 22: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

22

1. k ≠ 0 and P[k + 1] = P[i] : Period(i) = Period(i - 1)

i 1 2 3 4 5 6 7 8

P a b c a a b c a

period 1 2 3 3 4 4 4 4

For i = 8, the substring “abc” of length 3 (k = 3) is the longest suffix of P(1, 7) which equals to a prefix of P(1, 7) and P(8) = P(4)

period(8) = period(7)=4.

Page 23: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

23

2. k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0 : Period(i) = i – k’

i 1 2 3 4 5 6 7 8 9

P a b c a a b c a b

period 1 2 3 3 4 4 4 4 7

For i = 9, the substring “abca” of length 4 (k = 4) is the longest suffix of P(1, 8) which equals to a prefix of P(1, 8) and P(9) ≠ P(5)

There is a suffix of P(1, 9) which equals to a prefix of P(1, 9), P(1, 2) = ab of length 2 (k’ = 2)

period(9) = i - | P(1, 2) | = 9 - 2 =7.

Page 24: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

24

3. k ≠ 0, P[k + 1] ≠ P[i] and k’ = 0 : Period(i) = i

i 1 2 3 4 5 6 7 8 9

P a b c c a b c c b

period 1 2 3 4 4 4 4 4 9

For i = 9, the substring “abcc” of length 4 (k = 4) is the longest suffix of P(1, 8) which equals to a prefix of P(1, 8) and P(9) ≠ P(5)

There is no suffix of P(1, 9) which equals to a prefix of P(1, 9) , (k’ = 0).

period(9) = i = 9.

Page 25: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

25

4. k = 0 and k’ ≠ 0 : Period(i) = i – k’

i 1 2 3 4 5 6 7 8 9

P a b c c b b c c a

period 1 2 3 4 5 6 7 8 8

For i = 9, the is no suffix of P(1, 8) which equals to a prefix of P(1, 8), (k = 0)

The substring “a” of length 1 (k’ = 1) is a suffix of P(1, 9) which equals to a prefix of P(1, 9), P(1, 1) = a.

period(9) = i - |P(1, 1)| = 9-1 = 8.

Page 26: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

26

5. k = 0 and k’ = 0 : Period(i) = i

i 1 2 3 4 5 6 7 8 9

P a b c c b b c c b

period 1 2 3 4 5 6 7 8 9

For i = 9, there is no suffix of P(1, 8) which equals to a prefix of P(1, 8), (k = 0).

There is no suffix of P(1, 9) which equals to a prefix of P(1, 9), (k’ = 0).

period(9) = i = 9.

Page 27: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

27

Assume that the conditions 2 & 4 holds. There must be a suffix which is equal to a prefix. Let u be the such a longest suffix.

But, the conditions 2 (k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0) and 4 (k = 0 and k’ ≠ 0) do not exist in self-maximal suffix. Why?

Page 28: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

28

2. k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0

x yij

period periodu u

Suppose that P is self-maximal.Since P[i]=y≠P[j]=x holds, x >y.

Since k’ ≠ 0, there is a v+y which is the longest suffix of P(1,i) equal to a prefix of P(1,i) as shown above.

P

v y x v yi

period periodu u

P

Page 29: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

29

v y v x v y v yij

period periodu u

P

v y x v yi

period periodu u

P

Since k ≠ 0, we must have the following.

Since P is a self-maximal string, from the prefix u, we may conclude that y>x.Contradiction! k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0 cannot hold for self-maximal strings.

Page 30: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

30

Using similar reasoning, we can prove thatfor self-maximal strings, k = 0 and k’ ≠ 0 doesnot hold.

Thus we may have the following:

For self-maximal strings, Period(i)=Period(i - 1) or Period(i)=i.

That is, the naïve period function works for Self-maximal strings.

Page 31: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

31

• What is the advantage of the naïve-period function?

• It is linear and we never need to look back to

some characters way back, as we need in calculating the prefix function in MP-algorithm.

Page 32: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

32

• For a string which is not self-maximal, we use the following algorithm, called the Max-Suffix Matching Algorithm.

Page 33: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

33

MaxSuffix-Matching Algorithm

• First, we decompose the pattern string P to be u · v, where v= MaxSuf(P) and u is the other part of P.

• Note that v is unique in the string P, and this is a very important property.

• Property 1: No suffix of u is equal to a prefix of v., because v is uniqueness.

• Example:P = dababdadad MaxSuf(P) = dadadP = u·v = dabab ·dadad

Page 34: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

34

MaxSuffix-Matching Algorithm

• If v is found in T, we next find the part u of P which occurs in the left of v by a naive testing way.

• Assume i is the location of an occurrence of v in T and the string before i is denoted as prev because of Property 1.

Text v v

iprev

Page 35: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

35

Maxsuffix-Matching Algorithm

Algorithm Maxsuffix-Matchingi:= 0; j:=0; period:=1;prev:=0;while i ≤ n - |v| do begin while j < |v| and v[i+1]= T[i+j+1] do begin j=j+1; if j > period and v[j] ≠ v[j -period]

then period:=j end; {MATCH OF v} if j = |u| then begin

if i − prev > |u| and u = T[i − |u| + 1… i] then report match at i − |u|; prev := i; end

i := i + period; if j ≥ 2 ・ period then j := j − period else begin j:= 0; period := 1 end; end;

Naive-Period

Function

Test u by using any algorithm

Page 36: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

36

Example

• Text = adadaddadabababadada• P = u·v = abababa · dada• case1

– If i < |u|, that there is no occurrence of u·v at beginning.

a d a d a d d a d a b a b a b d a d aText

d a d a

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

i

Page 37: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

37

Example• Text = adadaddadabababadada• P = u·v = abababa · dada• Case2

– If i – prev <|u|, then there is no occurrence of u·v at position i - |u|. This is because the maximal suffix v of P only start at one position on P.

d a d a

a d a d a d d a d a b a b a b d a d aText

d a d a

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

d a d a

i = 7, |u| = 7, prev =2

Page 38: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

38

Example• Text = adadaddadabababadada• P = u·v = abababa · dada• So, we only need to check whether u exists in the l

eft of third v in this example.

d a d a

a d a d a d d a d a b a b a b d a d aText

d a d a

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

d a d a

First occurrence Second occurrence Third occurrence

Page 39: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

39

Time Complexity and Space Complexity

• Hence, the MaxSuffix-Matching Algorithm can find all occurrences of a pattern in O(1) space (i, j, period) and linear time complexity.

Page 40: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

40

Reference• Maxime Crochemore, String-matching on ordered alphabets, T

heoretical Computer Science, v.92 n.1, p.33-47, Jan. 6, 1992• Maxime Crochemore, Dominique Perrin, Two-way string-mat

ching, Journal of the ACM (JACM), v.38 n.3, p.650-674, July 1991

• Maxime Crochemore, Wojcjech Rvtter, Text algorithms, Oxford University Press, Inc.,New York, NY, 1994

• M. Crochemore, W. Rytter, Cubes, squares and time space efficient string matching, Algorithmica 13 (5) (1995) 405-425.

• J.-P. Duval, Factorizing words over an ordered alphabet, J. Algorithms 4 (1983) 363-381.

Page 41: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

41

Reference• Z Galil, J. Seiferas, Time-space-optimal string matching, J. Co

mput. System Sci. 26 (1983) 280-294. • L. Gasieniec, W. Plandowski, W. Rytter, Constant-space string

matching with smaller number of comparisons: sequential sampling, in: Z. Galil, E. Ukkonen (Eds.), Combinatorial Pattern Matching, 6th Annual Symposium, CPM gs, Lecture Notes in Computer Science, Vol. 937, Springer, Berlin, 1995, pp. 78-89.

• Leszek Gasieniec , Woiciech Plandowski , Woiciech Rytter, The zooming method: a recursive approach to time-space efficient string-matching, Theoretical Computer Science, v. 147 n. 1-2, p. 19-30, Aug. 7, 1995

• D.E. Knuth, J.H. Morris, V.R. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (1977) 322-350.

• M. Lothaire, Combinatorics on Words, Addison-Wesley, Reading, MA, USA, 1983.

Page 42: 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American

42

~Thank You~