efficient parallel partition based algorithms for ...€¦ · additional filters parallel basic...
TRANSCRIPT
MotivationOur Approach
Experiment
Efficient Parallel Partition based Algorithms forSimilarity Search and Join with Edit Distance
Constraints
Yu Jiang, Dong Deng, Jiannan Wang, Guoliang Li, andJianhua Feng
Tsinghua University
Similarity Search&Join Competition on EDBT/ICDT 2013
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Outline
1 MotivationProblem DefinitionApplication
2 Our ApproachPass Join AlgorithmAdditional FiltersParallel
3 ExperimentEvaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Problem DefinitionApplication
Problem DefinitionSTRING SIMILARITY JOINS
Given a set of strings S, the task is to find all pairs of τ -similarstrings from S. A program must output all matches with bothstring identifiers and distance τ .(Track II)
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Problem DefinitionApplication
An Example
Table: A string dataset
ID Strings Lengths1 vankatesh 9s2 avataresha 10s3 kaushic chaduri 15s4 kaushik chakrab 15s5 kaushuk chadhui 15s6 caushik chakrabar 17
Consider the string dataset in Table 1.Suppose τ = 3. 〈s4, s6〉 is a similar pair as ED(s4, s6) ≤ τ
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Problem DefinitionApplication
Application
Data cleaningInformation ExtractionComparison of biological sequences...
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Basic Idea
LemmaGiven a string r with τ + 1 segments and a string s, if s issimilar to r within threshold τ , s must contain a segment of r .
Exampleτ = 1, r =“EDBT” has two segments “ED” and “BT”. s =“ICDT”cannot similar to r as s contains none of the two segemtns.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Even Partition Scheme
DefinitionIn even partition scheme, each segment has almost the samelength. (b |s|τ+1c or d |s|τ+1e)
Exampleτ = 3, we partition s1 =“vankatesh” into four segments “va”,“nk ”, “at”, “esh”.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Substring SelectionBasic Methods
Enumeration:Enumerate all substrings for each of the segment.Length-based:For each segment, only select substrings with same length.Shift-based:For segment with start position pi , select substrings withstart position in [pi − τ,pi + τ ]
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Substring SelectionPosition-aware Substring Selection
Observation
Theorem (Position-aware Substring Selection)For segment with start position pi , select substrings with startposition in [pi − b τ−42 c,pi + b τ+42 c] where 4 = |s| − |r |.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Substring SelectionPosition-aware Substring Selection
Observation
Theorem (Position-aware Substring Selection)For segment with start position pi , select substrings with startposition in [pi − b τ−42 c,pi + b τ+42 c] where 4 = |s| − |r |.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Substring SelectionPosition-aware Substring Selection
Example
τ = 3, 4 = 1, [pi − b τ−42 c,pi + b τ+42 c] = [pi − 1,pi + 2]
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Substring SelectionMulti-match-aware Substring Selection
Observation
There must be another matching between rr and sr .
Theorem (Multi-match-aware Substring Selection)For the i-th segment with start position pi , select substringswithin [pi−i ,pi+i] ∩ [pi+4−(τ+1−i),pi+4+(τ+1−i)].
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Substring SelectionMulti-match-aware Substring Selection
Observation
There must be another matching between rr and sr .
Theorem (Multi-match-aware Substring Selection)For the i-th segment with start position pi , select substringswithin [pi−i ,pi+i] ∩ [pi+4−(τ+1−i),pi+4+(τ+1−i)].
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Substring SelectionMulti-match-aware Substring Selection
Example
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Substring SelectionTheoretical Results
1 The number of selected substrings by themulti-match-aware method is minimum.
2 For strings longer than 2 ∗ (τ + 1), our selection method isthe only way to select minimum number of substrings.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Substring SelectionExperimental Results
1e+006
1e+007
1e+008
1e+009
1 2 3 4
# of
sel
ecte
d su
bstr
ings
Threshold τ
LengthShift
PositonMulti-Match
(a) Author Name(Avg Len = 15)
1e+006
1e+007
1e+008
1e+009
1e+010
4 5 6 7 8
# of
sel
ecte
d su
bstr
ings
Threshold τ
LengthShift
PositonMulti-Match
(b) Query Log(Avg Len = 45)
1e+007
1e+008
1e+009
1e+010
1e+011
5 6 7 8 9 10
# of
sel
ecte
d su
bstr
ings
Threshold τ
LengthShift
PositonMulti-Match
(c) Author+Title(Avg Len = 105)
Figure: Numbers of selected substrings
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Substring SelectionExperimental Results
0.1
1
10
100
1 2 3 4
Sel
ectio
n T
ime
(s)
Threshold τ
LengthShift
PositonMulti-Match
(a) Author Name(Avg Len = 15)
1
10
100
1000
4 5 6 7 8
Sel
ectio
n T
ime
(s)
Threshold τ
LengthShift
PositonMulti-Match
(b) Query Log(Avg Len = 45)
1
10
100
1000
10000
5 6 7 8 9 10
Sel
ectio
n T
ime
(s)
Threshold τ
LengthShift
PositonMulti-Match
(c) Author+Title(Avg Len = 105)
Figure: Elapsed time for generating substrings
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
VerificationLength-aware Verification
Inspired by the position-aware substring selection.Save at least half computation than traditional dynamicmethod.Save even more using improved early termination.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
VerificationLength-aware Verification
Inspired by the position-aware substring selection.Save at least half computation than traditional dynamicmethod.Save even more using improved early termination.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
VerificationLength-aware Verification
Inspired by the position-aware substring selection.Save at least half computation than traditional dynamicmethod.Save even more using improved early termination.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
VerificationExtension-based Verification
Inspired by the multi-match-aware substring selection.Using tighter thresholds to verify the candidate pairs.Verify if ED(rr , sr ) ≤ τ + 1− i and ED(rl , sl) ≤ i − 1.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
VerificationExtension-based Verification
Inspired by the multi-match-aware substring selection.Using tighter thresholds to verify the candidate pairs.Verify if ED(rr , sr ) ≤ τ + 1− i and ED(rl , sl) ≤ i − 1.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
VerificationExtension-based Verification
Inspired by the multi-match-aware substring selection.Using tighter thresholds to verify the candidate pairs.Verify if ED(rr , sr ) ≤ τ + 1− i and ED(rl , sl) ≤ i − 1.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
VerificationExperimental Results
1
10
100
1000
10000
100000
1 2 3 4
Ela
psed
Tim
e (s
)
Threshold τ
2τ+1τ+1
ExtensionSharePrefix
(a) Author Name(Avg Len 15)
10
100
1000
10000
4 5 6 7 8
Ela
psed
Tim
e (s
)
Threshold τ
2τ+1τ+1
ExtensionSharePrefix
(b) Query Log(Avg Len 45)
10
100
1000
10000
5 6 7 8 9 10
Ela
psed
Tim
e (s
)
Threshold τ
2τ+1τ+1
ExtensionSharePrefix
(c) Author+Title(Avg Len 105)
Figure: Elapsed time for verification
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Additional FiltersEffective Indexing Strategy
Partition longer strings into segments.Select substrings from shorter strings.Longer segments decrease the possibility of matching.Thus decrease the number of candidates.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Additional FiltersEffective Indexing Strategy
Partition longer strings into segments.Select substrings from shorter strings.Longer segments decrease the possibility of matching.Thus decrease the number of candidates.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Additional FiltersEffective Indexing Strategy
Partition longer strings into segments.Select substrings from shorter strings.Longer segments decrease the possibility of matching.Thus decrease the number of candidates.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Additional FiltersEffective Indexing Strategy
Partition longer strings into segments.Select substrings from shorter strings.Longer segments decrease the possibility of matching.Thus decrease the number of candidates.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Additional FiltersContent Filter
ObservationLet Hr denote the character frequency vector of r .r =“abyyyy”, s =“axxyyyxy”.Hr = {{a,1}, {b,1}, {y ,4}}, Hs = {{a,1}, {x ,3}, {y ,4}}Let H4 = |Hr −Hs|.H4 = |Hr −Hs| = ||1|+ | − 3|| = 4.A deletion or insertion changes H4 by 1 at most.An substitution changes H4 by 2 at most.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Additional FiltersContent Filter
ObservationLet Hr denote the character frequency vector of r .r =“abyyyy”, s =“axxyyyxy”.Hr = {{a,1}, {b,1}, {y ,4}}, Hs = {{a,1}, {x ,3}, {y ,4}}Let H4 = |Hr −Hs|.H4 = |Hr −Hs| = ||1|+ | − 3|| = 4.A deletion or insertion changes H4 by 1 at most.An substitution changes H4 by 2 at most.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Additional FiltersContent Filter
ObservationLet Hr denote the character frequency vector of r .r =“abyyyy”, s =“axxyyyxy”.Hr = {{a,1}, {b,1}, {y ,4}}, Hs = {{a,1}, {x ,3}, {y ,4}}Let H4 = |Hr −Hs|.H4 = |Hr −Hs| = ||1|+ | − 3|| = 4.A deletion or insertion changes H4 by 1 at most.An substitution changes H4 by 2 at most.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Additional FiltersContent Filter
ObservationLet Hr denote the character frequency vector of r .r =“abyyyy”, s =“axxyyyxy”.Hr = {{a,1}, {b,1}, {y ,4}}, Hs = {{a,1}, {x ,3}, {y ,4}}Let H4 = |Hr −Hs|.H4 = |Hr −Hs| = ||1|+ | − 3|| = 4.A deletion or insertion changes H4 by 1 at most.An substitution changes H4 by 2 at most.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Additional FiltersContent Filter
ObservationAt most τ edit operations, H4 ≤ 2τ .At most τ −
∣∣|r | − |s|∣∣ substitutions, H4 ≤ 2τ −∣∣|r | − |s|∣∣.
Group symbols to improve the content-filter running time.Integrate the content filter with the extension-basedverification.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Additional FiltersContent Filter
ObservationAt most τ edit operations, H4 ≤ 2τ .At most τ −
∣∣|r | − |s|∣∣ substitutions, H4 ≤ 2τ −∣∣|r | − |s|∣∣.
Group symbols to improve the content-filter running time.Integrate the content filter with the extension-basedverification.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Additional FiltersContent Filter
ObservationAt most τ edit operations, H4 ≤ 2τ .At most τ −
∣∣|r | − |s|∣∣ substitutions, H4 ≤ 2τ −∣∣|r | − |s|∣∣.
Group symbols to improve the content-filter running time.Integrate the content filter with the extension-basedverification.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
Additional FiltersContent Filter
ObservationAt most τ edit operations, H4 ≤ 2τ .At most τ −
∣∣|r | − |s|∣∣ substitutions, H4 ≤ 2τ −∣∣|r | − |s|∣∣.
Group symbols to improve the content-filter running time.Integrate the content filter with the extension-basedverification.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Pass Join AlgorithmAdditional FiltersParallel
1 Parallel Sorting. Group strings by lengths using existingparallel algorithm.
2 Parallel Building Indexes. Parallel building indexes for eachgroup.
3 Parallel Joins. Parallel perform similarity joins on eachgroups.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability
Experiment Setup
Table: Datasets
Datasets cardinality average len max len min lenGeoNames 400,000 11.106 1 60
GeoNames Query 100,000 10.7 2 43Reads 750,000 101.388 86 106
Reads Query 100,000 101.2 88 116
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability
Experiment Setup
0
10000
20000
30000
40000
50000
0 10 20 30 40 50 60
Num
bers
of s
trin
gs
String Lengths
(a) GeoNames
0
100000
200000
300000
400000
85 90 95 100 105
Num
bers
of s
trin
gs
String Lengths
(b) Reads
Figure: Length Distribution.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability
Evaluating Pruning Techniques
0
10
20
30
40
50
1 2 3 4
Ela
psed
Tim
e (s
)
Edit Distance Threshold
BasicContentLonger
ParaJoin
(a) GeoNames
0
200
400
600
800
4 8 12 16
Ela
psed
Tim
e (s
)
Edit Distance Threshold
BasicContentLonger
ParaJoin
(b) Reads
Figure: Evaluating pruning techniques for similarity joins(8 threads).
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability
Evaluating Pruning Techniques
0
10
20
30
40
1 2 3 4
Ela
psed
Tim
e (s
)
Edit Distance Threshold
BasicSearchParaSearch
(a) GeoNames
0
50
100
150
200
4 8 12 16
Ela
psed
Tim
e (s
)
Edit Distance Threshold
BasicSearchParaSearch
(b) Reads
Figure: Evaluating pruning techniques for similarity search(8 threads).
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability
Evaluating Parallelism
0
20
40
60
80
100
2 4 6 8
Ela
psed
Tim
e (s
)
Number of Threads
tau=4tau=3tau=2tau=1
(a) GeoNames
0
150
300
450
600
750
2 4 6 8
Ela
psed
Tim
e (s
)
Number of Threads
tau=16tau=12tau=8tau=4
(b) Reads
Figure: Evaluating running time of similarity join by varying number ofthreads.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability
Evaluating Speedup
2
4
6
8
2 4 6 8
Spe
edup
Number of Threads
tau=4tau=3tau=2tau=1Ideal
(a) GeoNames
2
4
6
8
2 4 6 8
Spe
edup
Number of Threads
tau=16tau=12
tau=8tau=4Ideal
(b) Reads
Figure: Evaluating speedup of similarity join.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability
Evaluating Parallelism
0
30
60
90
120
150
2 4 6 8
Ela
psed
Tim
e (s
)
Number of Threads
tau=4tau=3tau=2tau=1
(a) GeoNames
0
50
100
150
200
2 4 6 8
Ela
psed
Tim
e (s
)
Number of Threads
tau=16tau=12tau=8tau=4
(b) Reads
Figure: Evaluating running time of similarity search by varyingnumber of threads.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability
Evaluating Speedup
2
4
6
8
2 4 6 8
Spe
edup
Number of Threads
tau=4tau=3tau=2tau=1Ideal
(a) GeoNames
2
4
6
8
2 4 6 8
Spe
edup
Number of Threads
tau=16tau=12
tau=8tau=4Ideal
(b) Reads
Figure: Evaluating speedup of similarity search.
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability
Evaluating Scalability
0
40
80
120
0.25 0.5 0.75 1
Ela
psed
Tim
e (s
)
Number of Strings(*1,000,000)
tau=4tau=3tau=2tau=1
(a) GeoNames
0
80
160
240
320
0.25 0.5 0.75 1
Ela
psed
Tim
e (s
)
Number of Strings(*1,000,000)
tau=16tau=12
tau=8tau=4
(b) Reads
Figure: Evaluating the scalability of the similarity join algorithm(8threads).
Dong Deng Parallel PassJoin
MotivationOur Approach
Experiment
Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability
Evaluating Scalability
0
30
60
90
0.25 0.5 0.75 1
Ela
psed
Tim
e (s
)
Number of Strings(*1,000,000)
tau=4tau=3tau=2tau=1
(a) GeoNames
0
10
20
30
40
0.25 0.5 0.75 1
Ela
psed
Tim
e (s
)
Number of Strings(*1,000,000)
tau=16tau=12
tau=8tau=4
(b) Reads
Figure: Evaluating the scalability of the similarity search algorithm(8threads).
Dong Deng Parallel PassJoin
Appendix Our Team
About our team I
We are from Tsinghua University, Beijing, China.
Yu Jiang, Jiannan Wang, Guoliang Li, Jianhua Feng andDong Deng.
Dong Deng Parallel PassJoin
Appendix Our Team
About our team II
Dong Deng Parallel PassJoin
Appendix Our Team
Thank YouQ & A
http://dbgroup.cs.tsinghua.edu.cn/dd
Pass-Join: A Partition based Method for Similarity Joins.Guoliang Li, Dong Deng, Jiannan Wang, Jianhua Feng. VLDB2012.
Dong Deng Parallel PassJoin