multiple pattern matching in lzw compressed text takuya kida masayuki takeda ayumi shinohara...
Post on 02-Jan-2016
219 Views
Preview:
TRANSCRIPT
Multiple Pattern Matching Multiple Pattern Matching in LZW Compressed Textin LZW Compressed Text
Takuya KIDATakuya KIDAMasayuki TAKEDAMasayuki TAKEDA
Ayumi SHINOHARAAyumi SHINOHARAMasamichi MIYAZAKIMasamichi MIYAZAKI
Setsuo ARIKAWASetsuo ARIKAWA
Department of InformaticsDepartment of InformaticsKyushu University, JapanKyushu University, Japan
NaganoNagano
FukuokaFukuoka
Masayuki TAKEDAMasayuki TAKEDAAyumi SHINOHARAAyumi SHINOHARA
Masamichi MIYAZAKIMasamichi MIYAZAKISetsuo ARIKAWASetsuo ARIKAWA
Our GoalOur Goal
CompressedCompressedTextText
OriginalOriginalTextText
CompressedCompressedTextText
Pattern MatchingPattern Matching MachineMachine
New Machine !New Machine !
Previous studiesPrevious studies
yearyear researcherresearcher compression methodcompression method
Eilam-Tsoreff and VishkinEilam-Tsoreff and Vishkin
Amir, Landau, and VishikinAmir, Landau, and VishikinAmir and BensonAmir and Benson
Farach and ThorupFarach and ThorupGasieniec, et al.Gasieniec, et al.
Amir, Benson and FarachAmir, Benson and Farach
Karpinski, et al.Karpinski, et al.Miyazaki, et al.Miyazaki, et al.
run-lengthrun-length
two-dimensionaltwo-dimensionalrun-lengthrun-length
LZ77LZ77
LZWLZW
straight-line programsstraight-line programs
19881988
1992199219921992
1995199519961996
19961996
1997199719971997
Previous result vs Our resultPrevious result vs Our result Amir, Benson, and Farach's algorithm (JCSS 1996)Amir, Benson, and Farach's algorithm (JCSS 1996)
"Let sleeping files lie: Pattern matching in Z-compressed files""Let sleeping files lie: Pattern matching in Z-compressed files"– deals with deals with only singleonly single pattern. pattern.– can find can find only the first occurrenceonly the first occurrence of the pattern. of the pattern.– takes O(takes O(nn++mm22) time and space.) time and space.
n : length of the compressed text, n : length of the compressed text, m: length of the pattern.m: length of the pattern.
Our algorithmOur algorithm– deals with deals with multiplemultiple patterns. patterns.– can find can find all occurrences all occurrences of the patterns.of the patterns.– takes O(takes O(nn++mm22++rr) time and O() time and O(nn++mm22) space.) space.
m: total length of the patterns,m: total length of the patterns, r r : number of pattern occurrences. : number of pattern occurrences.
Lempel-Ziv-Welch compressionLempel-Ziv-Welch compression
a b ab ab ba b c aba bc ababa b ab ab ba b c aba bc abab
Dictionary trie : Dictionary trie : DD ΣΣ= {a,b,c}= {a,b,c}
bb
aabb cc
aa
aa aa
aa
bbbb
bb cc
00
11 22 33
44 55
66 77
99
88 1212
1010
1111
11 22 3344 55 66 99 111144 22
originaloriginal texttext
compressed textcompressed text
O( |D| ) = O( O( |D| ) = O( n n ))
PatternPattern :: abababab
-1-1
aa00 11 22 33 44bb bbaa{abab}{abab}
original text: a a b a b a a b b a b a b original text: a a b a b a a b b a b a b aa b a b b a baa b a b b a ba ba b a b a baa b a b b a ba b a ba b a baa aa
bb bbbb aa
aabbbb
aaaa
aabb
bbbb
bbaa
aaaa
bbbb
aaaa
bbbb
found !found ! found !found !
KMP automatonKMP automaton
ΣΣ
: goto function: goto function
: failure function: failure function{ } : output{ } : output
Basic Idea(Amir et al.)Basic Idea(Amir et al.)
Basic Idea(Amir et al.)Basic Idea(Amir et al.)
{abab}{abab}00 11 22 33 44
ab, babab, bab
abaabaabababab
bbcc
bcbc
ca, baca, babca, abca, a bb bbaa
-1-1
aa00 11 22 33 44bb bbaa{abab}{abab}
Next Next (0, (0, babbab)=2)=2
PatternPattern :: ababababKMP automatonKMP automaton
00 11 22 33 44aa bb aa bb {abab}{abab}
abcabc
ab abcab abc
Who is watchingWho is watchingthe occurrences of the pattern?!the occurrences of the pattern?!
Output Output (2, (2, abcabc)=)={ 〈{ 〈 2, abab2, abab 〉 〉
}}
Basic Idea(Amir et al.)Basic Idea(Amir et al.)Next Next (2, (2, abcabc)=0)=0
for Multiple Patternsfor Multiple Patterns
Aho-Corasick Pattern Matching MachineAho-Corasick Pattern Matching Machine
aacc
00 11 22 33 44 55
66 77
9988
bb bbaa bb
cc aabb
bb {bb}{bb}
{abca}{abca}
{aba}{aba} {ababb,bb}{ababb,bb}
Patterns:Patterns:ΠΠ={aba,ababb,abca,bb}={aba,ababb,abca,bb}: goto function: goto function
: failure function: failure function{ } : output{ } : output
Our AlgorithmOur AlgorithmInput. Input. Π Π : set of patterns,: set of patterns, uu11,,uu22, …,, …,uunn :: LZW compressed text . LZW compressed text .Output. All occurrences of the patterns.Output. All occurrences of the patterns.
Construct from Construct from ΠΠ the AC machine, the AC machine, and the generalized suffix trie.and the generalized suffix trie. Initialize the dictionary trie, Initialize the dictionary trie, NextNext and and Output Output ;;
ll:=0; :=0; statestate:=:=qq00;;
for for ii:=1 to :=1 to nn do begin do begin for eachfor each 〈〈 d d ,π,π 〉∈ 〉∈ OutputOutput((statestate,,uuii)) do do report "report "pattern π occurs at position pattern π occurs at position ll++dd"";; statestate:=:=NextNext((statestate,,uuii));; ll:= := ll+ + ||uuii||;; Update the dictionary trie, Update the dictionary trie, Next Next and and OutputOutput end.end.
O( O( nn++r r )) O( O( n n ))
O( O( mm22 ))
State Transition Function State Transition Function Next Next ((qq, , uu))
NextNext: : QQ××D D → → QQ O( O( mm××||DD| | ) !!) !!
NextNext((qq,,uu) ) NN11((qq, , uu)) ・・ uu
NextNext(0, (0, uu))
ifif u u∈∈FactorFactor((ΠΠ),),
otherwise.otherwise.==
O( O( mm××mm22 ) )
O( O( ||DD| | ) )
Q: states of AC machineD: strings represented by dictionary triem: total length of patterns
Table of Table of NN1 1 ((qq, , uu)) ・・ uu --- O( --- O(mm××mm22))
0 1 2 3 4 5 6 7 8 90 1 2 3 4 5 6 7 8 9
statestate a b c ab ba bb bc ca aba abb abc bab bca abab abca babb ababba b c ab ba bb bc ca aba abb abc bab bca abab abca babb ababb
1 1 3 1 3 1 7 1 1 11 1 3 1 3 1 7 1 1 18 2 9 4 5 9 8 2 9 98 2 9 4 5 9 8 2 9 90 0 6 0 6 0 0 0 0 00 0 6 0 6 0 0 0 0 02 2 4 2 4 2 2 2 2 22 2 4 2 4 2 2 2 2 21 3 1 3 1 1 1 3 1 11 3 1 3 1 1 1 3 1 19 9 9 5 9 9 9 9 9 99 9 9 5 9 9 9 9 9 90 6 0 6 0 0 0 6 0 00 6 0 6 0 0 0 6 0 01 1 7 1 7 1 1 1 1 11 1 7 1 7 1 1 1 1 1
3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3
9 9 5 9 5 9 9 9 9 99 9 5 9 5 9 9 9 9 9
2 4 2 4 2 2 2 4 2 22 4 2 4 2 2 2 4 2 2
1 7 1 7 1 1 1 7 1 11 7 1 7 1 1 1 7 1 1
6 6 6 6 6 6 6 6 6 66 6 6 6 6 6 6 6 6 6
4 4 4 4 4 4 4 4 4 44 4 4 4 4 4 4 4 4 4
7 7 7 7 7 7 7 7 7 77 7 7 7 7 7 7 7 7 7
9 5 9 5 9 9 9 5 9 99 5 9 5 9 9 9 5 9 9
5 5 5 5 5 5 5 5 5 55 5 5 5 5 5 5 5 5 5
O( O( ||DD||++mm3 3 ))
State Transition Function State Transition Function Next Next ((qq, , uu))
ΠΠ={aba,ababb,abca,bb}={aba,ababb,abca,bb}
aabb cc
aa
aaaa
aa
bb
bb
bb cc
bb
bb
cc
aabb
bb ΠΠ={aba,ababb,abca,bb}={aba,ababb,abca,bb}
O( O( mm ) )
Generalized Suffix TrieGeneralized Suffix Trie
: explicit node
O( O( mm2 2 ))
: nonexplicit node
0 1 2 3 4 5 6 7 8 90 1 2 3 4 5 6 7 8 9
statestate a b a b cc ab ba bb ab ba bb bcbc ca aba abb ca aba abb abc bababc bab bca bca abababab abca babb ababb abca babb ababb
1 1 3 1 3 1 7 1 1 11 1 3 1 3 1 7 1 1 18 2 9 4 5 9 8 2 9 98 2 9 4 5 9 8 2 9 90 0 6 0 6 0 0 0 0 00 0 6 0 6 0 0 0 0 02 2 4 2 4 2 2 2 2 22 2 4 2 4 2 2 2 2 21 3 1 3 1 1 1 3 1 11 3 1 3 1 1 1 3 1 19 9 9 5 9 9 9 9 9 99 9 9 5 9 9 9 9 9 90 6 0 6 0 0 0 6 0 00 6 0 6 0 0 0 6 0 01 1 7 1 7 1 1 1 1 11 1 7 1 7 1 1 1 1 1
3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3
9 9 5 9 5 9 9 9 9 99 9 5 9 5 9 9 9 9 9
2 4 2 4 2 2 2 4 2 22 4 2 4 2 2 2 4 2 2
1 7 1 7 1 1 1 7 1 11 7 1 7 1 1 1 7 1 1
6 6 6 6 6 6 6 6 6 66 6 6 6 6 6 6 6 6 6
4 4 4 4 4 4 4 4 4 44 4 4 4 4 4 4 4 4 4
7 7 7 7 7 7 7 7 7 77 7 7 7 7 7 7 7 7 7
9 5 9 5 9 9 9 5 9 99 5 9 5 9 9 9 5 9 9
5 5 5 5 5 5 5 5 5 55 5 5 5 5 5 5 5 5 5
O( O( ||DD||++mm3 3 )) O( O( ||DD||++mm2 2 ))
State Transition Function State Transition Function Next Next ((qq, , uu))
0 1 2 3 4 5 6 7 8 90 1 2 3 4 5 6 7 8 90 1 2 3 4 5 6 7 8 90 1 2 3 4 5 6 7 8 9
statestatestatestate a b a b c ab ba bbab ba bb bc ca aba abb ca aba abb abc bab bca bca abab abca babb ababb abca babb ababba b a b c ab ba bbab ba bb bc ca aba abb ca aba abb abc bab bca bca abab abca babb ababb abca babb ababb
1 1 3 1 3 1 7 1 1 11 1 3 1 3 1 7 1 1 11 1 3 1 3 1 7 1 1 11 1 3 1 3 1 7 1 1 18 2 9 4 5 9 8 2 9 98 2 9 4 5 9 8 2 9 98 2 9 4 5 9 8 2 9 98 2 9 4 5 9 8 2 9 90 0 6 0 6 0 0 0 0 00 0 6 0 6 0 0 0 0 02 2 4 2 4 2 2 2 2 22 2 4 2 4 2 2 2 2 22 2 4 2 4 2 2 2 2 22 2 4 2 4 2 2 2 2 21 3 1 3 1 1 1 3 1 11 3 1 3 1 1 1 3 1 11 3 1 3 1 1 1 3 1 11 3 1 3 1 1 1 3 1 19 9 9 5 9 9 9 9 9 99 9 9 5 9 9 9 9 9 99 9 9 5 9 9 9 9 9 99 9 9 5 9 9 9 9 9 90 6 0 6 0 0 0 6 0 00 6 0 6 0 0 0 6 0 01 1 7 1 7 1 1 1 1 11 1 7 1 7 1 1 1 1 11 1 7 1 7 1 1 1 1 11 1 7 1 7 1 1 1 1 1
3 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 33 3 3 3 3 3 3 3 3 3
9 9 5 9 5 9 9 9 9 99 9 5 9 5 9 9 9 9 99 9 5 9 5 9 9 9 9 99 9 5 9 5 9 9 9 9 9
2 4 2 4 2 2 2 4 2 22 4 2 4 2 2 2 4 2 2
1 7 1 7 1 1 1 7 1 11 7 1 7 1 1 1 7 1 11 7 1 7 1 1 1 7 1 11 7 1 7 1 1 1 7 1 1
6 6 6 6 6 6 6 6 6 66 6 6 6 6 6 6 6 6 6
4 4 4 4 4 4 4 4 4 44 4 4 4 4 4 4 4 4 4
7 7 7 7 7 7 7 7 7 77 7 7 7 7 7 7 7 7 77 7 7 7 7 7 7 7 7 77 7 7 7 7 7 7 7 7 7
9 5 9 5 9 9 9 5 9 99 5 9 5 9 9 9 5 9 99 5 9 5 9 9 9 5 9 99 5 9 5 9 9 9 5 9 9
5 5 5 5 5 5 5 5 5 55 5 5 5 5 5 5 5 5 55 5 5 5 5 5 5 5 5 55 5 5 5 5 5 5 5 5 5
Table of Table of NN1 1 ((qq, , uu)) ・・ uu --- O( --- O( mm××m m ))
Ancestor(Ancestor(qq, , kk): the ancestor of node ): the ancestor of node qq with distance with distance kk in the trie of AC machine.in the trie of AC machine.
u : u : one of the explicit descendants of node uone of the explicit descendants of node u in the generalized suffix trie.in the generalized suffix trie.
Output FunctionOutput Function
OutputOutput((qq,,uu)=)={{ 〈〈 ii,π,π 〉〉 | | 1≦1≦ii≦≦||uu||, π∈, π∈ΠΠ, and, and
π is a suffix of string π is a suffix of string qq ・・ uu[[1..1..ii] ] }}
qq uu
ππii
O( O( mm××||DD| | ) !!!) !!!
iiii
qq uu
ππ11 ππ11
ππ22 ππ33
O( O( ||DD|| ) )O(mO(m22))
uu~~
dependent ondependent on q q independent of independent of qq
Output FunctionOutput Functionuu~~Let be the longest prefix of Let be the longest prefix of uu such that such that
is a suffix of some pattern.is a suffix of some pattern.uu~~
ExperimentExperiment
◆◆ Method 1:Method 1:
◆◆ Method 2:Method 2:
◆◆ Method 3: Method 3: WithoutWithout Decompression Decompression
CompressedCompressedTextText
OriginalOriginalTextText
CompressedCompressedTextText bcbababc bcbababc 99
CompressedCompressedTextText
Decompression !Decompression !AC MachineAC Machine
Decompression !Decompression !AC MachineAC Machine
Our AlgorithmOur Algorithm
ExperimentExperiment
Original TextOriginal Text"The Brown corpus""The Brown corpus"
6.8 Mbytes6.8 Mbytes
Compressed TextCompressed Text
3.4 Mbytes3.4 MbytesLanguage: C++ (gcc without optimization)Language: C++ (gcc without optimization)Machine : Sun SPARCstation 20.Machine : Sun SPARCstation 20.
compresscompress(UNIX command)(UNIX command)
Result of the ExperimentResult of the Experiment
(number of pattern occurrences / original text length)
00 55 1010 1515 2020 2525Occurrence rate ( % )Occurrence rate ( % )
00
55
1010
1515
2020
2525
3030
CP
U t
ime
(s)
CP
U t
ime
(s)
Method 1Method 1
Method 2Method 2
Method 3Method 3
Our AlgorithmOur AlgorithmOur AlgorithmOur Algorithm
ConclusionConclusion
Previous ResultPrevious Result Our ResultOur Result
deals with only single deals with only single patternpattern
deals with deals with multiplemultiplepatternspatterns
can find only the first can find only the first occurrence of the patternoccurrence of the pattern
takes O( takes O( nn++mm2 2 ) time and) time andspacespace
can find can find all occurrences all occurrences of the patternsof the patterns
takes O( takes O( nn++mm2 2 ) space) spacecan answer in O(can answer in O(nn++mm22++rr))timetime
no practical evaluationno practical evaluationabout about twice faster twice faster thanthana decompression followeda decompression followedby using the AC machineby using the AC machine
top related