multiple pattern matching in lzw compressed text takuya kida masayuki takeda ayumi shinohara...

23
Multiple Pattern Matching Multiple Pattern Matching in LZW Compressed Text in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics Department of Informatics Kyushu University, Japan Kyushu University, Japan Nagano Nagano Fukuoka Fukuoka Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA

Upload: sandra-cooper

Post on 02-Jan-2016

219 views

Category:

Documents


1 download

TRANSCRIPT

Multiple Pattern Matching Multiple Pattern Matching in LZW Compressed Textin LZW Compressed Text

Takuya KIDATakuya KIDAMasayuki TAKEDAMasayuki TAKEDA

Ayumi SHINOHARAAyumi SHINOHARAMasamichi MIYAZAKIMasamichi MIYAZAKI

Setsuo ARIKAWASetsuo ARIKAWA

Department of InformaticsDepartment of InformaticsKyushu University, JapanKyushu University, Japan

NaganoNagano

FukuokaFukuoka

Masayuki TAKEDAMasayuki TAKEDAAyumi SHINOHARAAyumi SHINOHARA

Masamichi MIYAZAKIMasamichi MIYAZAKISetsuo ARIKAWASetsuo ARIKAWA

Our GoalOur Goal

CompressedCompressedTextText

OriginalOriginalTextText

CompressedCompressedTextText

Pattern MatchingPattern Matching MachineMachine

New Machine !New Machine !

Previous studiesPrevious studies

yearyear researcherresearcher compression methodcompression method

Eilam-Tsoreff and VishkinEilam-Tsoreff and Vishkin

Amir, Landau, and VishikinAmir, Landau, and VishikinAmir and BensonAmir and Benson

Farach and ThorupFarach and ThorupGasieniec, et al.Gasieniec, et al.

Amir, Benson and FarachAmir, Benson and Farach

Karpinski, et al.Karpinski, et al.Miyazaki, et al.Miyazaki, et al.

run-lengthrun-length

two-dimensionaltwo-dimensionalrun-lengthrun-length

LZ77LZ77

LZWLZW

straight-line programsstraight-line programs

19881988

1992199219921992

1995199519961996

19961996

1997199719971997

Previous result vs Our resultPrevious result vs Our result Amir, Benson, and Farach's algorithm (JCSS 1996)Amir, Benson, and Farach's algorithm (JCSS 1996)

"Let sleeping files lie: Pattern matching in Z-compressed files""Let sleeping files lie: Pattern matching in Z-compressed files"– deals with deals with only singleonly single pattern. pattern.– can find can find only the first occurrenceonly the first occurrence of the pattern. of the pattern.– takes O(takes O(nn++mm22) time and space.) time and space.

n : length of the compressed text, n : length of the compressed text, m: length of the pattern.m: length of the pattern.

Our algorithmOur algorithm– deals with deals with multiplemultiple patterns. patterns.– can find can find all occurrences all occurrences of the patterns.of the patterns.– takes O(takes O(nn++mm22++rr) time and O() time and O(nn++mm22) space.) space.

m: total length of the patterns,m: total length of the patterns, r r : number of pattern occurrences. : number of pattern occurrences.

Lempel-Ziv-Welch compressionLempel-Ziv-Welch compression

a b ab ab ba b c aba bc ababa b ab ab ba b c aba bc abab

Dictionary trie : Dictionary trie : DD ΣΣ= {a,b,c}= {a,b,c}

bb

aabb cc

aa

aa aa

aa

bbbb

bb cc

00

11 22 33

44 55

66 77

99

88 1212

1010

1111

11 22 3344 55 66 99 111144 22

originaloriginal texttext

compressed textcompressed text

O( |D| ) = O( O( |D| ) = O( n n ))

PatternPattern :: abababab

-1-1

aa00 11 22 33 44bb bbaa{abab}{abab}

original text: a a b a b a a b b a b a b original text: a a b a b a a b b a b a b aa b a b b a baa b a b b a ba ba b a b a baa b a b b a ba b a ba b a baa aa

bb bbbb aa

aabbbb

aaaa

aabb

bbbb

bbaa

aaaa

bbbb

aaaa

bbbb

found !found ! found !found !

KMP automatonKMP automaton

ΣΣ

: goto function: goto function

: failure function: failure function{ } : output{ } : output

Basic Idea(Amir et al.)Basic Idea(Amir et al.)

Basic Idea(Amir et al.)Basic Idea(Amir et al.)

{abab}{abab}00 11 22 33 44

ab, babab, bab

abaabaabababab

bbcc

bcbc

ca, baca, babca, abca, a bb bbaa

-1-1

aa00 11 22 33 44bb bbaa{abab}{abab}

Next Next (0, (0, babbab)=2)=2

PatternPattern :: ababababKMP automatonKMP automaton

00 11 22 33 44aa bb aa bb {abab}{abab}

abcabc

ab abcab abc

Who is watchingWho is watchingthe occurrences of the pattern?!the occurrences of the pattern?!

Output Output (2, (2, abcabc)=)={ 〈{ 〈 2, abab2, abab 〉 〉

}}

Basic Idea(Amir et al.)Basic Idea(Amir et al.)Next Next (2, (2, abcabc)=0)=0

for Multiple Patternsfor Multiple Patterns

Aho-Corasick Pattern Matching MachineAho-Corasick Pattern Matching Machine

aacc

00 11 22 33 44 55

66 77

9988

bb bbaa bb

cc aabb

bb {bb}{bb}

{abca}{abca}

{aba}{aba} {ababb,bb}{ababb,bb}

Patterns:Patterns:ΠΠ={aba,ababb,abca,bb}={aba,ababb,abca,bb}: goto function: goto function

: failure function: failure function{ } : output{ } : output

Our AlgorithmOur AlgorithmInput. Input. Π Π : set of patterns,: set of patterns, uu11,,uu22, …,, …,uunn :: LZW compressed text . LZW compressed text .Output. All occurrences of the patterns.Output. All occurrences of the patterns.

Construct from Construct from ΠΠ the AC machine, the AC machine, and the generalized suffix trie.and the generalized suffix trie. Initialize the dictionary trie, Initialize the dictionary trie, NextNext and and Output Output ;;

ll:=0; :=0; statestate:=:=qq00;;

for for ii:=1 to :=1 to nn do begin do begin for eachfor each 〈〈 d d ,π,π 〉∈ 〉∈ OutputOutput((statestate,,uuii)) do do report "report "pattern π occurs at position pattern π occurs at position ll++dd"";; statestate:=:=NextNext((statestate,,uuii));; ll:= := ll+ + ||uuii||;; Update the dictionary trie, Update the dictionary trie, Next Next and and OutputOutput end.end.

O( O( nn++r r )) O( O( n n ))

O( O( mm22 ))

Ok! Let’s go!Ok! Let’s go!

State Transition Function State Transition Function Next Next ((qq, , uu))

NextNext: : QQ××D D → → QQ O( O( mm××||DD| | ) !!) !!

NextNext((qq,,uu) ) NN11((qq, , uu)) ・・ uu

NextNext(0, (0, uu))

ifif u u∈∈FactorFactor((ΠΠ),),

otherwise.otherwise.==

O( O( mm××mm22 ) )

O( O( ||DD| | ) )

Q: states of AC machineD: strings represented by dictionary triem: total length of patterns

Table of Table of NN1 1 ((qq, , uu)) ・・ uu --- O( --- O(mm××mm22))

0 1 2 3 4 5 6 7 8 90 1 2 3 4 5 6 7 8 9

statestate a b c ab ba bb bc ca aba abb abc bab bca abab abca babb ababba b c ab ba bb bc ca aba abb abc bab bca abab abca babb ababb

1 1 3 1 3 1 7 1 1 11 1 3 1 3 1 7 1 1 18 2 9 4 5 9 8 2 9 98 2 9 4 5 9 8 2 9 90 0 6 0 6 0 0 0 0 00 0 6 0 6 0 0 0 0 02 2 4 2 4 2 2 2 2 22 2 4 2 4 2 2 2 2 21 3 1 3 1 1 1 3 1 11 3 1 3 1 1 1 3 1 19 9 9 5 9 9 9 9 9 99 9 9 5 9 9 9 9 9 90 6 0 6 0 0 0 6 0 00 6 0 6 0 0 0 6 0 01 1 7 1 7 1 1 1 1 11 1 7 1 7 1 1 1 1 1

3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3

9 9 5 9 5 9 9 9 9 99 9 5 9 5 9 9 9 9 9

2 4 2 4 2 2 2 4 2 22 4 2 4 2 2 2 4 2 2

1 7 1 7 1 1 1 7 1 11 7 1 7 1 1 1 7 1 1

6 6 6 6 6 6 6 6 6 66 6 6 6 6 6 6 6 6 6

4 4 4 4 4 4 4 4 4 44 4 4 4 4 4 4 4 4 4

7 7 7 7 7 7 7 7 7 77 7 7 7 7 7 7 7 7 7

9 5 9 5 9 9 9 5 9 99 5 9 5 9 9 9 5 9 9

5 5 5 5 5 5 5 5 5 55 5 5 5 5 5 5 5 5 5

O( O( ||DD||++mm3 3 ))

State Transition Function State Transition Function Next Next ((qq, , uu))

ΠΠ={aba,ababb,abca,bb}={aba,ababb,abca,bb}

aabb cc

aa

aaaa

aa

bb

bb

bb cc

bb

bb

cc

aabb

bb ΠΠ={aba,ababb,abca,bb}={aba,ababb,abca,bb}

O( O( mm ) )

Generalized Suffix TrieGeneralized Suffix Trie

: explicit node

O( O( mm2 2 ))

: nonexplicit node

0 1 2 3 4 5 6 7 8 90 1 2 3 4 5 6 7 8 9

statestate a b a b cc ab ba bb ab ba bb bcbc ca aba abb ca aba abb abc bababc bab bca bca abababab abca babb ababb abca babb ababb

1 1 3 1 3 1 7 1 1 11 1 3 1 3 1 7 1 1 18 2 9 4 5 9 8 2 9 98 2 9 4 5 9 8 2 9 90 0 6 0 6 0 0 0 0 00 0 6 0 6 0 0 0 0 02 2 4 2 4 2 2 2 2 22 2 4 2 4 2 2 2 2 21 3 1 3 1 1 1 3 1 11 3 1 3 1 1 1 3 1 19 9 9 5 9 9 9 9 9 99 9 9 5 9 9 9 9 9 90 6 0 6 0 0 0 6 0 00 6 0 6 0 0 0 6 0 01 1 7 1 7 1 1 1 1 11 1 7 1 7 1 1 1 1 1

3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3

9 9 5 9 5 9 9 9 9 99 9 5 9 5 9 9 9 9 9

2 4 2 4 2 2 2 4 2 22 4 2 4 2 2 2 4 2 2

1 7 1 7 1 1 1 7 1 11 7 1 7 1 1 1 7 1 1

6 6 6 6 6 6 6 6 6 66 6 6 6 6 6 6 6 6 6

4 4 4 4 4 4 4 4 4 44 4 4 4 4 4 4 4 4 4

7 7 7 7 7 7 7 7 7 77 7 7 7 7 7 7 7 7 7

9 5 9 5 9 9 9 5 9 99 5 9 5 9 9 9 5 9 9

5 5 5 5 5 5 5 5 5 55 5 5 5 5 5 5 5 5 5

O( O( ||DD||++mm3 3 )) O( O( ||DD||++mm2 2 ))

State Transition Function State Transition Function Next Next ((qq, , uu))

0 1 2 3 4 5 6 7 8 90 1 2 3 4 5 6 7 8 90 1 2 3 4 5 6 7 8 90 1 2 3 4 5 6 7 8 9

statestatestatestate a b a b c ab ba bbab ba bb bc ca aba abb ca aba abb abc bab bca bca abab abca babb ababb abca babb ababba b a b c ab ba bbab ba bb bc ca aba abb ca aba abb abc bab bca bca abab abca babb ababb abca babb ababb

1 1 3 1 3 1 7 1 1 11 1 3 1 3 1 7 1 1 11 1 3 1 3 1 7 1 1 11 1 3 1 3 1 7 1 1 18 2 9 4 5 9 8 2 9 98 2 9 4 5 9 8 2 9 98 2 9 4 5 9 8 2 9 98 2 9 4 5 9 8 2 9 90 0 6 0 6 0 0 0 0 00 0 6 0 6 0 0 0 0 02 2 4 2 4 2 2 2 2 22 2 4 2 4 2 2 2 2 22 2 4 2 4 2 2 2 2 22 2 4 2 4 2 2 2 2 21 3 1 3 1 1 1 3 1 11 3 1 3 1 1 1 3 1 11 3 1 3 1 1 1 3 1 11 3 1 3 1 1 1 3 1 19 9 9 5 9 9 9 9 9 99 9 9 5 9 9 9 9 9 99 9 9 5 9 9 9 9 9 99 9 9 5 9 9 9 9 9 90 6 0 6 0 0 0 6 0 00 6 0 6 0 0 0 6 0 01 1 7 1 7 1 1 1 1 11 1 7 1 7 1 1 1 1 11 1 7 1 7 1 1 1 1 11 1 7 1 7 1 1 1 1 1

3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3

9 9 5 9 5 9 9 9 9 99 9 5 9 5 9 9 9 9 99 9 5 9 5 9 9 9 9 99 9 5 9 5 9 9 9 9 9

2 4 2 4 2 2 2 4 2 22 4 2 4 2 2 2 4 2 2

1 7 1 7 1 1 1 7 1 11 7 1 7 1 1 1 7 1 11 7 1 7 1 1 1 7 1 11 7 1 7 1 1 1 7 1 1

6 6 6 6 6 6 6 6 6 66 6 6 6 6 6 6 6 6 6

4 4 4 4 4 4 4 4 4 44 4 4 4 4 4 4 4 4 4

7 7 7 7 7 7 7 7 7 77 7 7 7 7 7 7 7 7 77 7 7 7 7 7 7 7 7 77 7 7 7 7 7 7 7 7 7

9 5 9 5 9 9 9 5 9 99 5 9 5 9 9 9 5 9 99 5 9 5 9 9 9 5 9 99 5 9 5 9 9 9 5 9 9

5 5 5 5 5 5 5 5 5 55 5 5 5 5 5 5 5 5 55 5 5 5 5 5 5 5 5 55 5 5 5 5 5 5 5 5 5

Table of Table of NN1 1 ((qq, , uu)) ・・ uu --- O( --- O( mm××m m ))

Ancestor(Ancestor(qq, , kk): the ancestor of node ): the ancestor of node qq with distance with distance kk in the trie of AC machine.in the trie of AC machine.

u : u : one of the explicit descendants of node uone of the explicit descendants of node u in the generalized suffix trie.in the generalized suffix trie.

Output FunctionOutput Function

OutputOutput((qq,,uu)=)={{ 〈〈 ii,π,π 〉〉 | | 1≦1≦ii≦≦||uu||, π∈, π∈ΠΠ, and, and

π is a suffix of string π is a suffix of string qq ・・ uu[[1..1..ii] ] }}

qq uu

ππii

O( O( mm××||DD| | ) !!!) !!!

iiii

qq uu

ππ11 ππ11

ππ22 ππ33

O( O( ||DD|| ) )O(mO(m22))

uu~~

dependent ondependent on q q independent of independent of qq

Output FunctionOutput Functionuu~~Let be the longest prefix of Let be the longest prefix of uu such that such that

is a suffix of some pattern.is a suffix of some pattern.uu~~

But... Is it But... Is it really fast ?really fast ?

Uhmm....Uhmm....

ExperimentExperiment

◆◆ Method 1:Method 1:

◆◆ Method 2:Method 2:

◆◆ Method 3: Method 3: WithoutWithout Decompression Decompression

CompressedCompressedTextText

OriginalOriginalTextText

CompressedCompressedTextText bcbababc bcbababc 99

CompressedCompressedTextText

Decompression !Decompression !AC MachineAC Machine

Decompression !Decompression !AC MachineAC Machine

Our AlgorithmOur Algorithm

ExperimentExperiment

Original TextOriginal Text"The Brown corpus""The Brown corpus"

6.8 Mbytes6.8 Mbytes

Compressed TextCompressed Text

3.4 Mbytes3.4 MbytesLanguage: C++ (gcc without optimization)Language: C++ (gcc without optimization)Machine : Sun SPARCstation 20.Machine : Sun SPARCstation 20.

compresscompress(UNIX command)(UNIX command)

Result of the ExperimentResult of the Experiment

(number of pattern occurrences / original text length)

00 55 1010 1515 2020 2525Occurrence rate ( % )Occurrence rate ( % )

00

55

1010

1515

2020

2525

3030

CP

U t

ime

(s)

CP

U t

ime

(s)

Method 1Method 1

Method 2Method 2

Method 3Method 3

Our AlgorithmOur AlgorithmOur AlgorithmOur Algorithm

ConclusionConclusion

Previous ResultPrevious Result Our ResultOur Result

deals with only single deals with only single patternpattern

deals with deals with multiplemultiplepatternspatterns

can find only the first can find only the first occurrence of the patternoccurrence of the pattern

takes O( takes O( nn++mm2 2 ) time and) time andspacespace

can find can find all occurrences all occurrences of the patternsof the patterns

takes O( takes O( nn++mm2 2 ) space) spacecan answer in O(can answer in O(nn++mm22++rr))timetime

no practical evaluationno practical evaluationabout about twice faster twice faster thanthana decompression followeda decompression followedby using the AC machineby using the AC machine