a unifying framework for compressed pattern matching takuya kida, masayuki takeda, ayumi shinohara,...

29
A Unifying Framework for Compressed Pattern Matching akuya Kida, Masayuki Takeda, Ayumi Shinohara Yusuke Shibata, Setsuo Arikawa Department of Informatics, Kyushu University, Japan

Upload: bertha-ramsey

Post on 17-Jan-2016

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

A Unifying Framework forCompressed Pattern Matching

Takuya Kida, Masayuki Takeda, Ayumi Shinohara,

Yusuke Shibata, Setsuo Arikawa

Department of Informatics,Kyushu University, Japan

Page 2: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

2

Contents

Pattern matching and compressed pattern matching

Previous results Collage system Proposed algorithm Conclusion

Page 3: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

3

Pattern Matching Problem

We introduce a general framework which is suitable to capture an essence of compressed pattern matching according to various dictionary based compressions. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary based method. Technically, our pattern matching algorithm extremely extends that for LZW compressed text presented by Amir, Benson and Farach.

We introduce a general framework which is suitable to capture an essence of compressed pattern matching according to various dictionary based compressions. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary based method. Technically, our pattern matching algorithm extremely extends that for LZW compressed text presented by Amir, Benson and Farach.

text:=

pattern:= compresscompress

Page 4: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

4

Compressed Pattern Matching

CompressedText

OriginalOriginalTextText

CompressedText

Pattern MatchingPattern Matching MachineMachine

New Machine !New Machine !

decompress

Page 5: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

Previous Results(1)

1988 Eliam-Tsoreff and Vishkin run-length

1992 Amir, Landau, and Vishkin two-dimensional run-length

1995 Farach and Thorup LZ77

1996 Amir, Benson and Farach LZW

1997 Karpinski, Rytter, and Shinohara straight-line programs

1996 Gasieniec, et al. LZ77

1997 Miyazaki, Shinohara, and Takeda straight-line programs

1992 Amir and Benson two-dimensional run-length

Amir, Benson, and Farach1994 two-dimensional run-length

1997 Takeda finite state encoding

1998 Shibata byte pair encoding

1994 Manber original compression scheme

1998 Fukamachi, Shinohara, and Takeda Huffman encoding

1998 Kida, et al. LZW

year researcher compression

Page 6: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

year researcher compression

1999 Shibata, Takeda, Shinohara, andArikawa

Antidictionaries

1999 Kida, Takeda, Shinohara, andArikawa

LZW

1999 Shibata, et al. Byte pair encoding

Kida, et al.1999 Dictionary based methods(Collage system)

1999 Navarro and Raffinot LZ family

Today’stalk

Today’stalk

Previous Results(2)

1998 de Moura, Navarro, Ziviani, andBaeza-Yates

Word based encoding

faster thanAgrep!

faster thanAgrep!

Page 7: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

7

Motivation

Previous:Compression A PM Algorithm A

Compression B PM Algorithm B

Compression C PM Algorithm C

Ours: General Pattern matching algorithm onthe unifying framework

Compression A

Compression B

Compression C

Collage system

Page 8: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

Collage SystemCollage System

Definition and Several Examples

Page 9: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

9

Originaltext

Originaltext

Dictionary Based Compression

compressedtext

compressedtext

Dictionarystructure

Dictionarystructure

encoding

factorize into a series of phrases

How to choose the phrases.How to design the data structure of the dictionary.How to encode phrases.

Page 10: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

10

Definition of Collage System

Collage system is a pair 〈 D, S 〉

S : A sequence of variables defined in D (Compressed text)

S := Xi1 , Xi2 , ・・・ , Xil ( Xi ∈D )

D : A sequence of assignments (Dictionary structure)

X1 = expr1 ; ・・・X2 = expr2 ; Xn = exprn ;

||D|| = n : number of assignments in D

|S| = l : number of variables in S

Page 11: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

11

Definition of Collage System

where exprk areX1 = expr1 ; ・・・X2 = expr2 ; Xn = exprn ;

a a ∈Σ {ε∪ }, (primitive assignment)

Xi ・ X j (concatenation)for i, j < k,

( Xi ) j for i < k and integer j ( j times repetition)

D : A sequence of assignments (Dictionary structure)

[ j ]Xi(prefix truncation)for i < k and integer j

Xi [ j ] (suffix truncation)for i < k and integer j

Page 12: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

Example of Collage System

X1 = a ;X2 = b ;

D :

S : X3 , X6 , X4 , X7

abbabbababba

X7 = X6 ・ X4 ;

X6 = [ 3 ]X5 ;

X5 = ( X3 )3 ;

X4 = X2 ・ X1 ;

X3 = X1 ・ X2 ;

babbabababababbaab

X7

X6 X4

X5

X3

X1 X2

X2 X1

a b )3 )[ 3 ] (( b a

prefixtruncation

3 timesrepetition

T(X7)

height(X7) = 4

height(D) = 4

Page 13: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

13

Example of Collage SystemByte Pair Encoding (BPE)

D: X1 = a;

X2 = b;X4 = X1 ・ X2;

X5 = X4 ・ X3;

Original Text:a b a b c b a b c c a b c a c b

D D c b D c c D c a c bD E b E c E a c b

abDDcE

X3 = c;

S : X4 , X5 , X2 , X5 , X3 , X5 , X1 , X3 , X2

abDDcE

Page 14: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

14

Example of Collage System (LZSS[gzip])

Xq+1 , Xq+2 , ・・・ , Xq+n

Xq+1 = (( [i1]Xl(1) ・ Xl(1)+1 ・・・ Xr(1))m1)[ j1] b1;

・・・

Xq+2 = (( [i2]Xl(2) ・ Xl(2)+1 ・・・ Xr(2))m2)[ j2] b2;

Xq+n = (( [in]Xl(n) ・ Xl(n)+1 ・・・ Xr(n))mn)[ jn] bn;

D: X1 = a1 ; X2 = a2 ; Xq = aq ;・・・

S :

Page 15: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

15

What is ‘Collage’?

This is college!

Page 16: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

16

Collage is ...

an artistic composition technique.

1. Cut or tear up materials.

2. Paste the pieces over a surface.

Page 17: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

Our AlgorithmOur Algorithm

Pattern Matching Algorithmon a Collage System

Page 18: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

Compressed pattern matching on a collage system

The problem of compressed pattern matchingcan be solved in

O( (||D||+|S|) ・ height(D) + m2 + r ) timeusing O( ||D|| + m2 ) space.

If D contains no truncation, it can be solved inO( ||D|| + |S| + m2 + r ) time.

m : pattern lengthr : number of pattern occurrences

||D|| : number of assignments in D|S| : number of variables in SO(compressed text

length+m2+r)

O(compressed text length+m2+r)

Page 19: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

19

state: 0 1 2 3 4 3 4 5 11 2 4 1

S : Xi1 Xi2 Xi3 Xi4

7 : goto function: failure function

a0 1 2 4 5b ba b3

Pattern π= a b a b b

Basic Idea

original text: abababba

Page 20: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

20

The set Output( j, u) ={1≦i≦|u| | P = a suffix of P[1: j] ・ u[1: i]}

The function Jump( j, u) =δKMP( j, u)

•This set contains the pattern occurrences.

•The domain is Q×D• It simulates the sequence of state transitions for u.

Jump and Output

Reply inO(1) timeReply inO(1) time

Reply inO( l ) timeReply in

O( l ) time

Page 21: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

21

Realization of Jump

for Jump( q, Xk) , if Xk is ...

a

Xi ・ X j

O(1) time

If the factor concatenation problem for length m string can be solved in O(1) time, it can be solved in O(1) time.

[ j ]Xi

Xi [ j ] O( height(Xi) ) time

( Xi ) j O(1) time

Page 22: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

22

Factor Concatenation Problem

example: P =COPACABANA

OPA , CABAN OPACABAN‘Yes’! P[2:9]concatenate

Instance: Two factors x and y of a string Peach represented as a node of suffix trie of P.Question: Is the string xy a factor of P ?If ‘yes’ then return its node number.

Page 23: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

23

Solution to the problem

Using a suffix trie, it can be solved in O(m) time after preprocessing of O(m2) time and space.

Using a two-dimensional lookup table, it can be solved in O(1), but we need O(m4) time and space preprocessing.

It can be solved in O(1) time after O(m2) space and time preprocessing.

Page 24: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

24

Realization of Output

a

Xi ・ X j

O(1) time

[ j ]Xi

Xi [ j ] O( l ・ height(Xi) ) time

( Xi ) j O( l ) time

for Output( q, Xk), if Xk is ...

It can be enumerate in O( l ) time

from Output of Xi and X j .

Size of the set Output

Size of the set Output

Page 25: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

Outline of Our Algorithm

Input. pattern P and Collage system: 〈 D, S 〉 ( S := Xi1 , Xi2 , ・・・ , Xin )Output. All occurrences of the patterns.

Input. pattern P and Collage system: 〈 D, S 〉 ( S := Xi1 , Xi2 , ・・・ , Xin )Output. All occurrences of the patterns.

/* preprocess for D and P */ preprocess(D); preprocess(P);

l:=0; q:=0;for j:=1 to n do begin for each dOutput(q, Xij) do report ‘pattern occurs at position l+d ’;

q:= Jump(q, Xij); /* state transition */

l:= l + |Xij |; /* calculate the offset */end

Page 26: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

Concluding RemarksConcluding Remarks

Conclusion and Future Works

Page 27: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

27

Our Results

If D contains no truncation : O( ||D|| + |S| + m2 + r ) time

1998 Kida, et al. ( LZW ) : O( n + m2 ) spaceO( n + m2 + r ) time

LZ78, LZW, BPE, Run-length, etc...

LZ78, LZW, BPE, Run-length, etc...

no truncation

LZ77, LZSS, etc...LZ77, LZSS, etc...

truncation

Complexity of our algorithm: O( ||D|| + m2 ) space

O( (||D|| + |S| ) ・ height(D) + m2 + r ) time

Page 28: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

28

Conclusion

We introduced a general framework for compressed pattern matching (Collage system)

We proposed a compressed pattern matching algorithm on collage system and showed its complexity. O( (||D||+|S|) ・ height(D) + m2 + r ) time O( ||D|| + m2 ) space ( If no truncation ) O( ||D|| + |S| + m2 + r )

time

Page 29: A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,

29

Future Works

Can we reduce the complexity of the preprocessing? O(m2) O(m)

To improve our algorithm for dealing with multiple patterns.

To develop an approximate pattern matching algorithm on a collage system.

To develop a new compression which is suitable for compressed pattern matching.