vldb 2007 chen li (uc, irvine) bin wang (northeastern university)

VGRAM:Improving Performance of VGRAM:Improving Performance of Approximate Queries on String Collections Approximate Queries on String Collections Using Variable-Length Grams Using Variable-Length Grams

VLDB 2007

Chen Li (UC, Irvine)Bin Wang (Northeastern University)Xiaochun Yang (Northeastern University)

Presented by Jae-won Lee

Copyright 2006 by CEBT

Introduction Introduction

Many applications have an increasing need to support approximate string queries on data collections

Examples of approximate string queries

Data Cleaning – the same entity can be represented in slightly different forms

– “PO BOX 23” and “P.O. Box 23”

Query Relaxation – errors in the query, inconsistencies in the data, limited knowledge about the data

– “Steven Spielburg” and “Steve Spielberg”

Spellchecking – find potential candidates for a possibly mistyped word

IDS Lab. Seminar - 2Center for E-Business Technology


IntroductionIntroduction

Dilemma of Choosing Gram Length

The gram length can greatly affect the performance of string matches

Increasing gram length

– Causes the inverted list to be shorter

This may decrease the time to merge the inverted lists

– Cases the lower threshold on the number of common grams

This causes a less selectiveness


id strings

01234

richstickstichstuckstatic

2-grams

atchckicristtatituuc

4

2 30

1 4

201 30 1 2 4

41 2 433

# of common grams >= 3

id strings01234


3-grams

atiichickricstastistutattictucuck

4

241

2010

3413

42

3

id strings01234


id strings01234


# of common grams >= 1


VGRAM : Main IdeaVGRAM : Main Idea

We analyze the frequencies of variable-length grams in the strings, and select a set of grams, called gram dictionary

For a string, we generate a set of grams of variable lengths using the gram dictionary

Challenges

How to generate variable-length grams ?

How to construct a high-quality gram dictionary ?

What is the relationship between string similarity and their gram-set similarity?

How to adopt VGRAM in existing algorithms ?



Challenge 1 : Generating Variable-Length Challenge 1 : Generating Variable-Length Grams Grams

Example

String s = universal

D = {ni, ivr, sal, uni, vers}

qmin = 2, qmax = 4

By setting position p = 1, VG = {}

The longest substring starting at u that appears in D is uni (1, uni)

Move to the next character n, the longest substring is ni

– However, this candidate (2, ni) is subsumed by the previous one, the algorithm does not insert it into VG

Move to the next character i, there is no substring starting at this character that matches a gram in D, so the algorithm produces (3, iv) of length qmin = 2

Final set VG(s) = {(1, uni), (3, iv), (4, vers), (7, sal)}



Challenge 2:Constructing Gram Challenge 2:Constructing Gram Dictionary Dictionary

Step 1 : Collecting gram frequencies with length in [qmin =2, qmax =4]


st 0, 1, 3sti 0, 1stu3stic 0, 1stuc3

Leaf node



Step 2: Selecting High-Quality Grams

If a gram has a low frequency, we eliminate from the tree all the extended grams of g

If a gram is very frequent, keep some of its extended grams




Pruning tree using a frequency threshold T = 2

Frequency of node (which has leaf node) ≤ T


8

removed



Pruning tree using a frequency threshold T = 2

Frequency of node (which has leaf node) ≥ T

Pruning policies to be used to select a maximal subset of children to remove

– SmallFirst : choose children with the smallest frequencies

– LargeFirst : choose children with the largest frequencies

– Random : Randomly choose children so that L.freq is not greater than T



Challenge 3:Similarity of Gram SetsChallenge 3:Similarity of Gram Sets

Analyzing the effect of an edit operation on the positional grams

These effects are stored NAG Vector (the vector of number of affected grams)

Category 1 : for positional gram (p, g)

– p < i-qmax+1 or p+|g| -1 > i+qmax-1

Category 2 : p ≤ i ≤ p+|g| -1

Category 3 : positional gram (p, g) on the left of the i-th character

Category 4 : positional gram (p, g) on the right of the i-th character


i-qmax+1 i+qmax- 1Deletioni

String s

Category 1

Category 3Category 2

Category 4 Category 1



Example

S = universal, D= {ni, ivr, sal, uni, vers}, qmin = 2, qmax = 4

VG(s) = {(1, uni), (3, iv), (4,vers), (7,sal)}

A deletion on the 5-th character e in the string s

i-qmax +1 =2 , i+qmax -1 = 8

Positional gram (1, uni) and (7, sal) is category 1

– Starting position is before 2 / ending position is after 8

These gram are not affected by deletion operation

(4, vers) is category 2

(3, iv) is category 3

– Since there is an extension of iv in D (ivr), (3, iv) could be affected by the deletion (potentially affected)




# of grams affected by each operation

We want to transform string s to string s’ with 2 edit operations

– At most 4 grams can be affected


_ u _ n _ i _v _ e _ r _s _ a _ l _

0 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 0

Deletion/substitution Insertion

GAP ; insertion

?

String S’

# of edit operation # of grams


Challenge 4: Adopting VGRAM Challenge 4: Adopting VGRAM TechniqueTechnique

Example of Algorithm based on Inverted Lists

Query : Edit Distance (shtick , ?) ≤ 1

VG(q) = { (1, sh), (2, ht), (3, tick) } ; which are extracted using gram dictionary


1 2 4

1 2

1

0 43

…ckic…ti…

# of common grams

= (|s1|- q + 1) – k * q

= (6-2+1) – 1 * 2 = 3

2 grams2-4 grams

id strings01234


# of common grams

= |VG(q)| - NAG(q, k)

= 3 – 2 = 1


Experiments Experiments

Data Sets

Data set 1: Texas Real Estate Commission.

– 151K person names, average length = 33.

Data set 2: English dictionary from the Aspell spellchecker for Cygwin.

– 149,165 words, average length = 8.

Data set 3: DBLP Bibliography.

– 277K titles, average length = 62.



VGRAM OverheadVGRAM Overhead

Data set 3


Index Size Construction Time


Benefits of Using Variable-Length Benefits of Using Variable-Length Grams Grams

Data set 1


Construction Time/Size Query Time


Effect of qEffect of qmaxmax

Data Set 1


Construction Time / Query Time

Query Performance


Effect of Frequency ThresholdEffect of Frequency Threshold

Data Set 1


Construction Time Index Size Query Time


Conclusion Conclusion

We developed VGRAM to improve performance of approximate string queries

Variable-length grams, High Quality grams

We gave a full specification of the technique

Index structure

How to generate grams for a string using index structure

Relationship btw the similarity of two strings and the similarity of their grams

We show how to adopt this technique in a variety of existing algorithms


vldb 2007 chen li (uc, irvine) bin wang (northeastern university)

Documents

gram length

ebusiness technologycopyright

ebusiness technologycenter

ebusiness technologyst

gram frequencies

set of grams

gram dictionarychallenges

highquality grams