vldb 2007 chen li (uc, irvine) bin wang (northeastern university)
DESCRIPTION
VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams. VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern University) Xiaochun Yang (Northeastern University) Presented by Jae-won Lee. Introduction. - PowerPoint PPT PresentationTRANSCRIPT
VGRAM:Improving Performance of VGRAM:Improving Performance of Approximate Queries on String Collections Approximate Queries on String Collections Using Variable-Length Grams Using Variable-Length Grams
VLDB 2007
Chen Li (UC, Irvine)Bin Wang (Northeastern University)Xiaochun Yang (Northeastern University)
Presented by Jae-won Lee
Copyright 2006 by CEBT
Introduction Introduction
Many applications have an increasing need to support approximate string queries on data collections
Examples of approximate string queries
Data Cleaning – the same entity can be represented in slightly different forms
– “PO BOX 23” and “P.O. Box 23”
Query Relaxation – errors in the query, inconsistencies in the data, limited knowledge about the data
– “Steven Spielburg” and “Steve Spielberg”
Spellchecking – find potential candidates for a possibly mistyped word
IDS Lab. Seminar - 2Center for E-Business Technology
Copyright 2006 by CEBT
IntroductionIntroduction
Dilemma of Choosing Gram Length
The gram length can greatly affect the performance of string matches
Increasing gram length
– Causes the inverted list to be shorter
This may decrease the time to merge the inverted lists
– Cases the lower threshold on the number of common grams
This causes a less selectiveness
IDS Lab. Seminar - 3Center for E-Business Technology
id strings
01234
richstickstichstuckstatic
2-grams
atchckicristtatituuc
4
2 30
1 4
201 30 1 2 4
41 2 433
# of common grams >= 3
id strings01234
richstickstichstuckstatic
3-grams
atiichickricstastistutattictucuck
4
241
2010
3413
42
3
id strings01234
richstickstichstuckstatic
id strings01234
richstickstichstuckstatic
# of common grams >= 1
Copyright 2006 by CEBT
VGRAM : Main IdeaVGRAM : Main Idea
We analyze the frequencies of variable-length grams in the strings, and select a set of grams, called gram dictionary
For a string, we generate a set of grams of variable lengths using the gram dictionary
Challenges
How to generate variable-length grams ?
How to construct a high-quality gram dictionary ?
What is the relationship between string similarity and their gram-set similarity?
How to adopt VGRAM in existing algorithms ?
IDS Lab. Seminar - 4Center for E-Business Technology
Copyright 2006 by CEBT
Challenge 1 : Generating Variable-Length Challenge 1 : Generating Variable-Length Grams Grams
Example
String s = universal
D = {ni, ivr, sal, uni, vers}
qmin = 2, qmax = 4
By setting position p = 1, VG = {}
The longest substring starting at u that appears in D is uni (1, uni)
Move to the next character n, the longest substring is ni
– However, this candidate (2, ni) is subsumed by the previous one, the algorithm does not insert it into VG
Move to the next character i, there is no substring starting at this character that matches a gram in D, so the algorithm produces (3, iv) of length qmin = 2
Final set VG(s) = {(1, uni), (3, iv), (4, vers), (7, sal)}
IDS Lab. Seminar - 5Center for E-Business Technology
Copyright 2006 by CEBT
Challenge 2:Constructing Gram Challenge 2:Constructing Gram Dictionary Dictionary
Step 1 : Collecting gram frequencies with length in [qmin =2, qmax =4]
IDS Lab. Seminar - 6Center for E-Business Technology
st 0, 1, 3sti 0, 1stu3stic 0, 1stuc3
Leaf node
Copyright 2006 by CEBT
Challenge 2:Constructing Gram Challenge 2:Constructing Gram Dictionary Dictionary
Step 2: Selecting High-Quality Grams
If a gram has a low frequency, we eliminate from the tree all the extended grams of g
If a gram is very frequent, keep some of its extended grams
IDS Lab. Seminar - 7Center for E-Business Technology
Copyright 2006 by CEBT
Challenge 2:Constructing Gram Challenge 2:Constructing Gram Dictionary Dictionary
Pruning tree using a frequency threshold T = 2
Frequency of node (which has leaf node) ≤ T
IDS Lab. Seminar - 8Center for E-Business Technology
8
removed
Copyright 2006 by CEBT
Challenge 2:Constructing Gram Challenge 2:Constructing Gram Dictionary Dictionary
Pruning tree using a frequency threshold T = 2
Frequency of node (which has leaf node) ≥ T
Pruning policies to be used to select a maximal subset of children to remove
– SmallFirst : choose children with the smallest frequencies
– LargeFirst : choose children with the largest frequencies
– Random : Randomly choose children so that L.freq is not greater than T
IDS Lab. Seminar - 9Center for E-Business Technology
Copyright 2006 by CEBT
Challenge 3:Similarity of Gram SetsChallenge 3:Similarity of Gram Sets
Analyzing the effect of an edit operation on the positional grams
These effects are stored NAG Vector (the vector of number of affected grams)
Category 1 : for positional gram (p, g)
– p < i-qmax+1 or p+|g| -1 > i+qmax-1
Category 2 : p ≤ i ≤ p+|g| -1
Category 3 : positional gram (p, g) on the left of the i-th character
Category 4 : positional gram (p, g) on the right of the i-th character
IDS Lab. Seminar - 10Center for E-Business Technology
i-qmax+1 i+qmax- 1Deletioni
String s
Category 1
Category 3Category 2
Category 4 Category 1
Copyright 2006 by CEBT
Challenge 3:Similarity of Gram SetsChallenge 3:Similarity of Gram Sets
Example
S = universal, D= {ni, ivr, sal, uni, vers}, qmin = 2, qmax = 4
VG(s) = {(1, uni), (3, iv), (4,vers), (7,sal)}
A deletion on the 5-th character e in the string s
i-qmax +1 =2 , i+qmax -1 = 8
Positional gram (1, uni) and (7, sal) is category 1
– Starting position is before 2 / ending position is after 8
These gram are not affected by deletion operation
(4, vers) is category 2
(3, iv) is category 3
– Since there is an extension of iv in D (ivr), (3, iv) could be affected by the deletion (potentially affected)
IDS Lab. Seminar - 11Center for E-Business Technology
Copyright 2006 by CEBT
Challenge 3:Similarity of Gram SetsChallenge 3:Similarity of Gram Sets
# of grams affected by each operation
We want to transform string s to string s’ with 2 edit operations
– At most 4 grams can be affected
IDS Lab. Seminar - 12Center for E-Business Technology
_ u _ n _ i _v _ e _ r _s _ a _ l _
0 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 0
Deletion/substitution Insertion
GAP ; insertion
?
String S’
# of edit operation # of grams
Copyright 2006 by CEBT
Challenge 4: Adopting VGRAM Challenge 4: Adopting VGRAM TechniqueTechnique
Example of Algorithm based on Inverted Lists
Query : Edit Distance (shtick , ?) ≤ 1
VG(q) = { (1, sh), (2, ht), (3, tick) } ; which are extracted using gram dictionary
IDS Lab. Seminar - 13Center for E-Business Technology
1 2 4
1 2
1
0 43
…ckic…ti…
# of common grams
= (|s1|- q + 1) – k * q
= (6-2+1) – 1 * 2 = 3
2 grams2-4 grams
id strings01234
richstickstichstuckstatic
# of common grams
= |VG(q)| - NAG(q, k)
= 3 – 2 = 1
Copyright 2006 by CEBT
Experiments Experiments
Data Sets
Data set 1: Texas Real Estate Commission.
– 151K person names, average length = 33.
Data set 2: English dictionary from the Aspell spellchecker for Cygwin.
– 149,165 words, average length = 8.
Data set 3: DBLP Bibliography.
– 277K titles, average length = 62.
IDS Lab. Seminar - 14Center for E-Business Technology
Copyright 2006 by CEBT
VGRAM OverheadVGRAM Overhead
Data set 3
IDS Lab. Seminar - 15Center for E-Business Technology
Index Size Construction Time
Copyright 2006 by CEBT
Benefits of Using Variable-Length Benefits of Using Variable-Length Grams Grams
Data set 1
IDS Lab. Seminar - 16Center for E-Business Technology
Construction Time/Size Query Time
Copyright 2006 by CEBT
Effect of qEffect of qmaxmax
Data Set 1
IDS Lab. Seminar - 17Center for E-Business Technology
Construction Time / Query Time
Query Performance
Copyright 2006 by CEBT
Effect of Frequency ThresholdEffect of Frequency Threshold
Data Set 1
IDS Lab. Seminar - 18Center for E-Business Technology
Construction Time Index Size Query Time
Copyright 2006 by CEBT
Conclusion Conclusion
We developed VGRAM to improve performance of approximate string queries
Variable-length grams, High Quality grams
We gave a full specification of the technique
Index structure
How to generate grams for a string using index structure
Relationship btw the similarity of two strings and the similarity of their grams
We show how to adopt this technique in a variety of existing algorithms
IDS Lab. Seminar - 19Center for E-Business Technology