Download - Compressed Index for Dictionary Matching
![Page 1: Compressed Index for Dictionary Matching](https://reader033.vdocuments.site/reader033/viewer/2022051418/56815295550346895dc0bdfc/html5/thumbnails/1.jpg)
1
Compressed Index for Dictionary Matching
WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter
(Purdue)
![Page 2: Compressed Index for Dictionary Matching](https://reader033.vdocuments.site/reader033/viewer/2022051418/56815295550346895dc0bdfc/html5/thumbnails/2.jpg)
2
• Dictionary Matching Problem• Summary of Results• Description of Our Solution (Brief):
Based on (I) Suffix Tree (II) A Simple Sampling Idea (III) Handling Irregularities
• Open Problems
Outline
![Page 3: Compressed Index for Dictionary Matching](https://reader033.vdocuments.site/reader033/viewer/2022051418/56815295550346895dc0bdfc/html5/thumbnails/3.jpg)
3
on receiving any text T, we can report for each Pj, all positions in T where it occurs
• Input: A set of d short patterns, { P1, P2, …, Pd }
of total length n
• Problem: Preprocess the patterns, and create an index so that:
Dictionary Matching
![Page 4: Compressed Index for Dictionary Matching](https://reader033.vdocuments.site/reader033/viewer/2022051418/56815295550346895dc0bdfc/html5/thumbnails/4.jpg)
4
• Relevant parameters to measure index’s performance:d = # of patterns
n = total length of patterns |T| = length of T = size of alphabet of T and patterns occ = total occurrences in search result
Dictionary Matching
![Page 5: Compressed Index for Dictionary Matching](https://reader033.vdocuments.site/reader033/viewer/2022051418/56815295550346895dc0bdfc/html5/thumbnails/5.jpg)
5
Summary of Results
Space (bits) Search Time Ref
O( n log n ) O( |T| + occ ) [AC 75]
O( n ) when = constant
O( (|T| + occ) log2 n) [CHLS 07]
O( n log ) O(|T| log log n + occ) ** this **
(1 + o(1)) n log
O(|T| (log n + log d) + occ)
** this **
optimal
|patterns| + o(n log )
= constant in (0,1)
![Page 6: Compressed Index for Dictionary Matching](https://reader033.vdocuments.site/reader033/viewer/2022051418/56815295550346895dc0bdfc/html5/thumbnails/6.jpg)
6
Existing Solution I: Patricia Trie
• Compact trie storing all d patterns
cha
h
ti
r
Patricia trie for { ate, chair, chat, hat, have, vet }
a
e
e
ate
v
vt
t
![Page 7: Compressed Index for Dictionary Matching](https://reader033.vdocuments.site/reader033/viewer/2022051418/56815295550346895dc0bdfc/html5/thumbnails/7.jpg)
7
Existing Solution I: Patricia Trie
• Advantage:Space: |patterns| + O( d log n ) bits
Very small overhead in addition to the input patterns
![Page 8: Compressed Index for Dictionary Matching](https://reader033.vdocuments.site/reader033/viewer/2022051418/56815295550346895dc0bdfc/html5/thumbnails/8.jpg)
8
Existing Solution I: Patricia Trie
Searching Strategy:For each position k in T•Match T from the root starting at k•Report occurrence of any Pj found
Disadvantage: Searching: worst-case O(|T|n + occ) time
![Page 9: Compressed Index for Dictionary Matching](https://reader033.vdocuments.site/reader033/viewer/2022051418/56815295550346895dc0bdfc/html5/thumbnails/9.jpg)
9
Existing Solution II: Suffix Tree
• Compact trie storing all suffixes of all d patterns
suffix tree for { ate, chair, chat, hat, have, vet }
a
tc
ha h
t
ir
ar
i
tv
t
r
r
e
e
$
ir
e
$ t
ve
i
$e
v et
$
![Page 10: Compressed Index for Dictionary Matching](https://reader033.vdocuments.site/reader033/viewer/2022051418/56815295550346895dc0bdfc/html5/thumbnails/10.jpg)
10
Existing Solution II: Suffix Tree
Searching: worst-case O(|T| + occ) time
Matching Time = O(|T|)
Same Searching Strategy:For each position k in T•Match T from the root starting at k•Report occurrence of any Pj found
![Page 11: Compressed Index for Dictionary Matching](https://reader033.vdocuments.site/reader033/viewer/2022051418/56815295550346895dc0bdfc/html5/thumbnails/11.jpg)
11
Existing Solution II: Suffix Tree
Disadvantage: Space: O( n log n ) bits
could be much larger than O( n log ), the space for |patterns|
![Page 12: Compressed Index for Dictionary Matching](https://reader033.vdocuments.site/reader033/viewer/2022051418/56815295550346895dc0bdfc/html5/thumbnails/12.jpg)
12
Our Solution
no suffixes:poor
searching
all suffixes:poor space
some suffixes:good space +
searching
![Page 13: Compressed Index for Dictionary Matching](https://reader033.vdocuments.site/reader033/viewer/2022051418/56815295550346895dc0bdfc/html5/thumbnails/13.jpg)
13
Our Solution: Sampling
• Store one suffix for every suffixes
= 2 for { ate, chair, chat, hat, have, vet }
a
tc
ha h
t
ir
ar
t
te
$
ir
e
ve
v et
$
![Page 14: Compressed Index for Dictionary Matching](https://reader033.vdocuments.site/reader033/viewer/2022051418/56815295550346895dc0bdfc/html5/thumbnails/14.jpg)
14
Our Solution: Sampling
• Store one suffix for every suffixes
irregularities
= 2 for { ate, chair, chat, hat, have, vet }
a
tc
ha h
t
ir
ar
t
te
$
ir
e
ve
v et
$
![Page 15: Compressed Index for Dictionary Matching](https://reader033.vdocuments.site/reader033/viewer/2022051418/56815295550346895dc0bdfc/html5/thumbnails/15.jpg)
15
Our Solution: Sampling
Need to handle irregularities
Same Searching Strategy:For each position k in T•Match T from the root starting at k•Report occurrence of any Pj found
Matching time = O(|T|) despite irregularities
![Page 16: Compressed Index for Dictionary Matching](https://reader033.vdocuments.site/reader033/viewer/2022051418/56815295550346895dc0bdfc/html5/thumbnails/16.jpg)
16
When = log n
Handling irregularities
predecessor search in a set of (log n)-bit integers
Search: O(|T| log log n + occ) timeSpace: O( n log ) bits
Y-fast trie
![Page 17: Compressed Index for Dictionary Matching](https://reader033.vdocuments.site/reader033/viewer/2022051418/56815295550346895dc0bdfc/html5/thumbnails/17.jpg)
17
When = (log n) / log
Handling irregularities
predecessor search in a set of (log n)-bit strings
Search: O(|T| (log n + log d) + occ) timeSpace: |patterns| + o(n log ) bits
Sting B-tree
![Page 18: Compressed Index for Dictionary Matching](https://reader033.vdocuments.site/reader033/viewer/2022051418/56815295550346895dc0bdfc/html5/thumbnails/18.jpg)
18
When = (log n) / log
Handling irregularities
predecessor search in a set of (log n)-bit strings
Search: O(|T| (log n + log d) + occ) timeSpace: n Hk + o(n log ) bits
Sting B-tree
FerVen 07
![Page 19: Compressed Index for Dictionary Matching](https://reader033.vdocuments.site/reader033/viewer/2022051418/56815295550346895dc0bdfc/html5/thumbnails/19.jpg)
19
Open Problems
Compressed + Dynamic Version: Can an index support update in the set of
patterns ? Target: Achieve nHk-type space bound
External Memory Version: Can an index operate in external memory and still support fast searching ?