a pivotal prefix based filtering algorithm for string similarity search
DESCRIPTION
A Pivotal Prefix Based Filtering Algorithm for String Similarity Search. Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng. Search is Important. Google Searches per Year. Source: http://www.internetlivestats.com/google-search-statistics/. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/1.jpg)
Dong Deng, Guoliang Li, Jianhua Feng
Database Group, Tsinghua University
Present by Dong Deng
A Pivotal Prefix Based Filtering Algorithm for String Similarity Search
![Page 2: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/2.jpg)
Search is Important
Source: http://www.internetlivestats.com/google-search-statistics/
Google Searches per Year
![Page 3: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/3.jpg)
Speed Matters
Source:
![Page 4: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/4.jpg)
Data is Dirty
• Typos
• Typo in “title”relaxed
related
Argyrios Zymnis
Argyris Zymnis
DBLP Complete Search
![Page 5: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/5.jpg)
Similarity Search
Query
String Dataset
All the strings similar to the query
![Page 6: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/6.jpg)
• ED(r, s): The min number of edit operations (insertion/deletion/substitution) needed to transform r to s.
• For example: ED(sigcom, sigmod) = 2
Edit Distance
sigcom
sigmom
sigmod
substitute c with m
substitute m with d
![Page 7: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/7.jpg)
Problem Definition
Query string s = “yotubecom” and τ = 2
string dataset R
ed(s, r4) <= 2output r4 as a result
![Page 8: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/8.jpg)
Application
• Spell Checking• Copy Detection• Entity Linking• Bioinformatic ….
![Page 9: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/9.jpg)
Challenge
Naïve MethodTime complexity: for each query
![Page 10: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/10.jpg)
No
Filter-and-Verification Framework
Dataset R
Threshold τ
Query string s
ResultsFilter:
Signature(s) ∩Signature(r) = ϕ?
Verify:ED(r,s) ≤ τ?
YesIndex
![Page 11: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/11.jpg)
Preliminary: q-gram
• q-gram of the substring with length q
yoouuttbbeeccoom
youtbecom
2-gram
![Page 12: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/12.jpg)
dd
d
Preliminary: q-gram• 1 edit operation destroies at most q grams.
• τ edit operations destroy at most qτ grams.• if r and s have more than qτ mismatch grams, ED(r, s)>τ.
yout ecomyoou
utt eeccoom
![Page 13: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/13.jpg)
Preliminary: Prefix FilterSort all q-grams by global ordering, such as idf
Pre(s)
q(r) : The sorted q-gram set of string rPre(r)
q(s): The sorted q-gram set of string s
Pre(•) is the prefix of q(•)
|Pre(•)|= qτ+1
Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ
suffix(r)
![Page 14: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/14.jpg)
Preliminary: Prefix FilterSort all q-grams by global ordering, such as idf
Pre(s)
g5 g6 g11 g12 g13g1 g2
g7 g8 g9 g10 g12g3 g4
q(r) : The sorted q-gram set of string rPre(r)
q(s): The sorted q-gram set of string s
Pre(•) is the prefix of q(•)
|Pre(•)|= qτ+1
Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ
>g10 >g10 >g10 >g10 >g10 >g10
suffix(r)
![Page 15: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/15.jpg)
d
d
Preliminary: disjoint q-gram• One edit operation destroies at most 1 disjoint gram.
• τ edit operations destroy at most τ disjoint grams.• if r and s have more than τ mismatch disjoint grams, ED(r, s)>
τ
yout ecom
e
yout
om
![Page 16: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/16.jpg)
q(s): The sorted q-gram set of string s
Pivotal Prefix FilterSort all q-grams by global ordering, such as idf
Pre(s)
q(r) : The sorted q-gram set of string rPre(r)
Piv(•) is the pivotal prefix of q(•)|Piv(•)|= τ+1 and the q-grams in Piv(•) are disjoint
Piv(r)
Piv(s)
suffix(r)
If piv(s) ∩ pre(r) = ϕ and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ
![Page 17: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/17.jpg)
q(s): The sorted q-gram set of string s
Pivotal Prefix FilterSort all q-grams by global ordering, such as idf
Pre(s)
g8 g10g5
g6 g9 g11 g13g1 g3
q(r) : The sorted q-gram set of string r
Pivotal Prefix Filter: If last(s)> last(r) and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ
Pre(r)
Piv(•) is the pivotal prefix of q(•)|Piv(•)|= τ+1 and the q-grams in Piv(•) are disjoint
Piv(r)
Piv(s)>g10 >g10 >g10 >g10 >g10 >g10 >g10
last(r)
last(s)
suffix(r)
![Page 18: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/18.jpg)
q(s): The sorted q-gram set of string s
Pivotal Prefix FilterSort all q-grams by global ordering, such as idf
Pre(s)
g6 g9 g12 g13g1 g4
g7 g10 g11g3
q(r) : The sorted q-gram set of string r
Pivotal Prefix Filter: If last(r)> last(s) and piv(s) ∩ pre(r) = ϕ, ED(r,s) > τ
Pre(r)
Piv(•) is the pivotal prefix of q(•)|Piv(•)|= τ+1 and the q-grams in Piv(•) are disjoint
Piv(r)
Piv(s)
>g10 >g10 >g10 >g10 >g10 >g10 >g10
last(r)
last(s)
suffix(r)
![Page 19: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/19.jpg)
Pivotal Prefix Filter
If last(r)> last(s) and piv(s) ∩ pre(r) = ϕ, ED(r,s) > τIf last(s)> last(r) and piv(r) ∩ pre(s) = ϕ, ED(r,s) > τ
• Existence: There must exist τ+1 disjoint grams in the prefix
• The Pivotal Prefix is a subset of the Prefix– The pivotal prefix filter dominates the prefix filter– Signature size are O(τ) and O(qτ) respectively
![Page 20: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/20.jpg)
Related WorkMethod |Sig(r)| |Sig(s)|
Prefix Filter O(qτ) O(qτ)
Mismatch Filter O(qτ) O(qτ)
Qchunk Filter O(τ) O(l)Pivotal Prefix Filter O(τ) O(qτ)
• Mismatch Filter [Xiao VLDB08] : Shorten prefix length, but still O(qτ)• Qchunk Filter[Qin SIGMOD11] : Shorten one to O(τ) but increased the other one to O(l)• Adaptive Prefix[Wang SIGMOD12]
– Increase prefix length to reduce candidate number– Orthogonal and can be integrated into our method
• Flamingo[Li ICDE08]– Based on count filter. Accelerating counting process.– Orthogonal and can be integrated into our method
![Page 21: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/21.jpg)
Pivotal Search Algorithm
• Indexing– Build inverted indexes for both the prefix and the pivotal prefix of the data strings
• Querying– Generate prefix and pivotal prefix for the query string– Probe the prefix index with the pivotal prefix of the query– Probe the pivotal prefix index with the prefix of the query– Verify the candidates and output results
![Page 22: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/22.jpg)
Pivotal Prefix Selection
Evaluating Different Pivotal Prefixes: The longer the inverted lists we probe, the more candidates we may have.
min𝑝𝑖𝑣 (𝑠)
∑𝑔∈ 𝑝𝑖𝑣(𝑠 )
h𝑙𝑒𝑛𝑔𝑡 𝑜𝑓 𝑖𝑛𝑣𝑒𝑟𝑡𝑒𝑑 𝑙𝑖𝑠𝑡𝑜𝑓 𝑔
min𝑝𝑖𝑣 (𝑟 )
∑𝑔∈ 𝑝𝑖𝑣(𝑟 )
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦𝑜𝑓 𝑔
For query string:
For data string:
![Page 23: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/23.jpg)
Optimal Pivotal Prefix SelectionDynamic Programming:
Select m-1 optimal pivotal q-grams from the first n-1 q-grams in prefix
Select as last pivotal q-gram
Object: Select m=τ+1 optimal pivotal q-grams from the first n=qτ+1 grams in the prefix
![Page 24: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/24.jpg)
Optimal Pivotal Prefix SelectionDynamic Programming:
Select m-1 optimal pivotal q-grams from the first n-2 q-grams
Select as last pivotal q-gram
![Page 25: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/25.jpg)
Optimal Pivotal Prefix SelectionDynamic Programming:
Select m-1 optimal pivotal q-grams from the first m-1 q-grams
Select as last pivotal q-gram
𝑓 (𝑚 ,𝑛 )= min1≤ 𝑘≤𝑚
¿
𝑤 h𝑒𝑖𝑔 𝑡 𝑖𝑠 h𝑙𝑒𝑛𝑔𝑡 𝑜𝑓 𝑖𝑛𝑣𝑒𝑟𝑡𝑒𝑑𝑙𝑖𝑠𝑡 𝑓𝑜𝑟 𝑞𝑢𝑒𝑟𝑦 𝑎𝑛𝑑 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑓𝑜𝑟 𝑑𝑎𝑡𝑎𝑠𝑡𝑟𝑖𝑛𝑔
Recursive formula:
![Page 26: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/26.jpg)
No
Filter-and-Verification Framework
Dataset R
Threshold τ
Query string s
ResultsFilter:
Signature(s) ∩Signature(r) = ϕ?
Verify:alignment filter?If yes, ED(r,s) ≤
τ?
YesIndex
Complexity Improvement: Improved from to
![Page 27: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/27.jpg)
Alignment Filter
Intuition of Alignment Filter: suppose in the best case we need erri edit operations to transform to a substring of r, then
If
![Page 28: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/28.jpg)
Alignment Filter
is the minimum edit distance between and any substring of r.
Substring edit distance (sed)
Alignment filter: If
![Page 29: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/29.jpg)
Alignment Filter
Accelerating Calculation: • The computation complexity of sed(, r) is O(). • By position filter, can only align to a substring xi of r
where |xi|<. • Thus if , ED( , )𝑟 𝑠• The complexity reduced to
Complexity Improvement: Improved from to
![Page 30: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/30.jpg)
Experiments
Settings:C++, g++ 4.8.2 with -O3 flags64bit Ubuntu Server 12.04 LTS versionIntel Xeon E5-2650 2.00GHz processor and 16GB memory.
![Page 31: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/31.jpg)
Evaluating Pivotal Prefix FilterAverage Search Time
Mismatch: From EDJoinCrossFiler: Cross FilterPivotalFilter: PivotalFilterCrossSelect: CrossFilter + Pivotal Prefix SelectionPivotalSearch: PivotalFilter + Pivotal Prefix Selection
![Page 32: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/32.jpg)
Evaluating Pivotal Prefix FilterCandidate Number
Mismatch: From EDJoinCrossFiler: Cross FilterPivotalFilter: PivotalFilterCrossSelect: CrossFilter + Pivotal Prefix SelectionPivotalSearch: PivotalFilter + Pivotal Prefix Selection
![Page 33: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/33.jpg)
Evaluating Alignment FilterAverage Search Time
NoFilter: without any filterContentFilter: From EDJoinAlignFilter: Alignment Filter
![Page 34: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/34.jpg)
Evaluating Alignment FilterCandidate Number
NoFilter: without any filterContentFilter: From EDJoinAlignFilter: Alignment FilterReal: Number of results
![Page 35: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/35.jpg)
Comparison with State-of-the-arts
PivotalSearch: Our methodAdaptive: [Wang2012]Flamingo: [Li2008]Qchunk: [Qin 2011]
![Page 36: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/36.jpg)
Scalability
![Page 37: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/37.jpg)
Conclusion
• Pivotal prefix filter• Pivotal search algorithm• Optimal pivotal prefix selection• Alignment filter
![Page 38: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/38.jpg)
THANK YOUQ & A
Project hompage: http://dbgroup.cs.tsinghua.edu.cn/dd/pivotal.html
![Page 39: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/39.jpg)
Outline
• Problem Definition• Pivotal Prefix Filter• The Similarity Search Algorithm• Alignment Filter• Experiment• Conclusion
![Page 40: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/40.jpg)
Outline
• Motivation and Problem Definition• Pivotal Prefix Filter• The Similarity Search Algorithm• Alignment Filter• Experiment• Conclusion
![Page 41: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/41.jpg)
Outline
• Problem Definition• Pivotal Prefix Filter• The Similarity Search Algorithm• Alignment Filter• Experiment• Conclusion
![Page 42: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/42.jpg)
Outline
• Problem Definition• Pivotal Prefix Filter• The Similarity Search Algorithm• Alignment Filter• Experiment• Conclusion
![Page 43: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/43.jpg)
Outline
• Problem Definition• Pivotal Prefix Filter• The Similarity Search Algorithm• Alignment Filter• Experiment• Conclusion
![Page 44: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/44.jpg)
Complexity
• Space Complexity: • Time Complexity:
![Page 45: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/45.jpg)
Pivotal Prefix Selection
Evaluating Different Pivotal Prefixes: The longer the inverted lists we scan, the larger the filtering cost is and the smaller the pruning power is.
min𝑝𝑖𝑣 (𝑟 )
∑𝑔∈ 𝑝𝑖𝑣(𝑟 )
¿ 𝐼 +¿[𝑔 ]∨¿¿¿
min𝑝𝑖𝑣 (𝑟 )
∑𝑔∈ 𝑝𝑖𝑣(𝑟 )
¿ 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 [𝑔]∨¿¿
For query string:
For data string:
Existence of Pivotal Prefix:There must exist at least τ+1 disjoint q-grams in the prefix pre(r) for any string r
![Page 46: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/46.jpg)
Complexity• Space Complexity: – Prefix Inverted Index Size: – Pivotal Prefix Inverted Index Size:
• Query Time Complexity:– Preprocess Query s: – Probing Inverted Indexes: where is the average
length of probed prefix inverted lists
• Verification Complexity: where c is the number of candidates and l is average string length
![Page 47: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/47.jpg)
Complexity• Space Complexity: – Prefix Inverted Index Size: – Pivotal Prefix Inverted Index Size:
• Query Time Complexity:– Preprocess Query s: – Probing Inverted Indexes: where is the average
length of probed prefix inverted lists
• Verification Complexity: where c is the number of candidates and l is average string length
![Page 48: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/48.jpg)
Preliminary: Prefix FilterSort all q-grams by global ordering, such as idf
Pre(s)
g5 g6 g9 g10 g11g1 g2
g7 g8 g11 g12 g13g3 g4
q(r) : The sorted q-gram set of string rPre(r)
q(s): The sorted q-gram set of string s
Pre(•) is the prefix of q(•)
|Pre(•)|= qτ+1
Prefix Filter: If pre(r) ∩ pre(s) = ϕ, ED(r,s) > τ
>g10 >g10 >g10 >g10 >g10 >g10 >g10
![Page 49: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/49.jpg)
Alignment Filternon-consecutive errors:
youtubecomyoytupecxm
q=3, the 3 non-consecutive errors destroy 8 q-grams
youtubecomyoutzpxcom
q=3, the 3 consecutive errors only destroy 5 q-grams
consecutive errors:
![Page 50: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/50.jpg)
Indexing
• Fix a global gram order
We use gram frequency ascending order τ=2 q=2
Global gram order
im my te bu un nt uc bb tb oy yt ca om yo ou ut ub co tu be ec
1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 3 3 3 3 3 4
![Page 51: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/51.jpg)
Indexing
• Build inverted indexes for prefix and pivotal prefix
Global gram order
im my te bu un nt uc bb tb oy yt ca om yo ou ut ub co tu be ec
1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 3 3 3 3 3 4
Sort and Split String,
Sort q-grams
q(r 1): {i m, my,t e, ca, yo ou, ut , ec}q(r 2): {bu, un, nt , uc, om ub, co, t u}q(r 3): {bb, ou, ut , ub, co t u, be, ec}q(r 4): {t b, om, yo,ou, ut co, be, ec}q(r 5): {oy, yt , ca, yo, ub t u, be, ec}
last(pre(ri))τ=2 q=2
slt(ri)
pre(ri)
Piv(ri)
![Page 52: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/52.jpg)
Indexing
• Build inverted indexes for prefix and pivotal prefix
q(r 1): {i m, my,t e, ca, yo ou, ut , ec}q(r 2): {bu, un, nt , uc, om ub, co, t u}q(r 3): {bb, ou, ut , ub, co t u, be, ec}q(r 4): {t b, om, yo,ou, ut co, be, ec}q(r 5): {oy, yt , ca, yo, ub t u, be, ec}
pre(ri)
slt(ri)
imtebuntuctbyt
<r1,1>ca
omyoouutub
Inverted index I
<r1,6><r2,2><r2,4><r2,6><r4,4><r5,3>
<r1,8><r4,8><r4,1><r3,8><r3,1><r3,3>
Inverted index I
immytebuunnt
uc bb tb oy ytco
caom
ouutub
<r5,3>
+
<r1,1><r1,2><r1,6><r2,2><r2,3><r2,4>
<r2,6><r3,4><r4,4><r5,2>
<r5,7>
<r1,8><r5,8><r2,8><r4,8>
<r1,3><r4,1><r5,1>yo
<r3,3> <r5,5><r3,1> <r4,3><r3,8> <r4,2>
-
Pivotal Prefix Index Prefix IndexPiv(ri
)
![Page 53: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/53.jpg)
Querying
• Generate prefix and pivotal prefix for the query string
Global gram order
im my te bu un nt uc bb tb oy yt ca om yo ou ut ub co tu be ec
1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 3 3 3 3 3 4
s: yotubecom pr e(s): {ot , om, yo, ub, co} pi v(s): {ot , om, ub}last(pre(s))
![Page 54: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/54.jpg)
Querying
• Probe the prefix index with the pivotal prefix of the query• Probe the pivotal prefix index with the prefix of the query
Inverted index I
imtebuntuctbyt
<r1,1>ca
omyoouutub
Inverted index I
s: yotubecom pr e(s): {ot , om, yo, ub, co} pi v(s): {ot , om, ub}
Preprocess Probe ProbeQuerying
immytebuunnt
uc bb tb oy ytco
caom
ouutub
<r5,3>
last(pre(s))
+-
<r1,6><r2,2><r2,4><r2,6><r4,4><r5,3>
<r1,8><r5,8><r4,8><r4,1><r5,1><r3,8><r3,1><r3,3>
<r1,1><r1,2><r1,6><r2,2><r2,3><r2,4>
<r2,6><r3,4><r4,4><r5,2>
<r5,7>
<r1,8><r5,8><r2,8><r4,8>
<r1,3><r4,1><r5,1>yo
<r3,3> <r5,5><r3,1> <r4,3><r3,8> <r4,2>
![Page 55: A Pivotal Prefix Based Filtering Algorithm for String Similarity Search](https://reader035.vdocuments.site/reader035/viewer/2022062809/5681578b550346895dc51de5/html5/thumbnails/55.jpg)
Querying
• Verify the candidates and output results
Inverted index I
imtebuntuctbyt
<r1,1>ca
omyoouutub
Inverted index I
s: yotubecom pr e(s): {ot , om, yo, ub, co} pi v(s): {ot , om, ub}
Preprocess Probe ProbeQuerying
immytebuunnt
uc bb tb oy ytco
caom
ouutub
<r5,3>
last(pre(s))
+-
<r1,6><r2,2><r2,4><r2,6><r4,4><r5,3>
<r1,8><r5,8><r4,8><r4,1><r5,1><r3,8><r3,1><r3,3>
<r1,1><r1,2><r1,6><r2,2><r2,3><r2,4>
<r2,6><r3,4><r4,4><r5,2>
<r5,7>
<r1,8><r5,8><r2,8><r4,8>
<r1,3><r4,1><r5,1>yo
<r3,3> <r5,5><r3,1> <r4,3><r3,8> <r4,2>
Candidates: r3, r4, r5
Result:r4
verify