efficient parallel partition based algorithms for ...€¦ · additional filters parallel basic...

50
Motivation Our Approach Experiment Efficient Parallel Partition based Algorithms for Similarity Search and Join with Edit Distance Constraints Yu Jiang, Dong Deng, Jiannan Wang, Guoliang Li, and Jianhua Feng Tsinghua University Similarity Search&Join Competition on EDBT/ICDT 2013 Dong Deng Parallel PassJoin

Upload: others

Post on 02-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Efficient Parallel Partition based Algorithms forSimilarity Search and Join with Edit Distance

Constraints

Yu Jiang, Dong Deng, Jiannan Wang, Guoliang Li, andJianhua Feng

Tsinghua University

Similarity Search&Join Competition on EDBT/ICDT 2013

Dong Deng Parallel PassJoin

Page 2: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Outline

1 MotivationProblem DefinitionApplication

2 Our ApproachPass Join AlgorithmAdditional FiltersParallel

3 ExperimentEvaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability

Dong Deng Parallel PassJoin

Page 3: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Problem DefinitionApplication

Problem DefinitionSTRING SIMILARITY JOINS

Given a set of strings S, the task is to find all pairs of τ -similarstrings from S. A program must output all matches with bothstring identifiers and distance τ .(Track II)

Dong Deng Parallel PassJoin

Page 4: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Problem DefinitionApplication

An Example

Table: A string dataset

ID Strings Lengths1 vankatesh 9s2 avataresha 10s3 kaushic chaduri 15s4 kaushik chakrab 15s5 kaushuk chadhui 15s6 caushik chakrabar 17

Consider the string dataset in Table 1.Suppose τ = 3. 〈s4, s6〉 is a similar pair as ED(s4, s6) ≤ τ

Dong Deng Parallel PassJoin

Page 5: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Problem DefinitionApplication

Application

Data cleaningInformation ExtractionComparison of biological sequences...

Dong Deng Parallel PassJoin

Page 6: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Basic Idea

LemmaGiven a string r with τ + 1 segments and a string s, if s issimilar to r within threshold τ , s must contain a segment of r .

Exampleτ = 1, r =“EDBT” has two segments “ED” and “BT”. s =“ICDT”cannot similar to r as s contains none of the two segemtns.

Dong Deng Parallel PassJoin

Page 7: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Even Partition Scheme

DefinitionIn even partition scheme, each segment has almost the samelength. (b |s|τ+1c or d |s|τ+1e)

Exampleτ = 3, we partition s1 =“vankatesh” into four segments “va”,“nk ”, “at”, “esh”.

Dong Deng Parallel PassJoin

Page 8: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Substring SelectionBasic Methods

Enumeration:Enumerate all substrings for each of the segment.Length-based:For each segment, only select substrings with same length.Shift-based:For segment with start position pi , select substrings withstart position in [pi − τ,pi + τ ]

Dong Deng Parallel PassJoin

Page 9: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Substring SelectionPosition-aware Substring Selection

Observation

Theorem (Position-aware Substring Selection)For segment with start position pi , select substrings with startposition in [pi − b τ−42 c,pi + b τ+42 c] where 4 = |s| − |r |.

Dong Deng Parallel PassJoin

Page 10: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Substring SelectionPosition-aware Substring Selection

Observation

Theorem (Position-aware Substring Selection)For segment with start position pi , select substrings with startposition in [pi − b τ−42 c,pi + b τ+42 c] where 4 = |s| − |r |.

Dong Deng Parallel PassJoin

Page 11: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Substring SelectionPosition-aware Substring Selection

Example

τ = 3, 4 = 1, [pi − b τ−42 c,pi + b τ+42 c] = [pi − 1,pi + 2]

Dong Deng Parallel PassJoin

Page 12: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Substring SelectionMulti-match-aware Substring Selection

Observation

There must be another matching between rr and sr .

Theorem (Multi-match-aware Substring Selection)For the i-th segment with start position pi , select substringswithin [pi−i ,pi+i] ∩ [pi+4−(τ+1−i),pi+4+(τ+1−i)].

Dong Deng Parallel PassJoin

Page 13: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Substring SelectionMulti-match-aware Substring Selection

Observation

There must be another matching between rr and sr .

Theorem (Multi-match-aware Substring Selection)For the i-th segment with start position pi , select substringswithin [pi−i ,pi+i] ∩ [pi+4−(τ+1−i),pi+4+(τ+1−i)].

Dong Deng Parallel PassJoin

Page 14: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Substring SelectionMulti-match-aware Substring Selection

Example

Dong Deng Parallel PassJoin

Page 15: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Substring SelectionTheoretical Results

1 The number of selected substrings by themulti-match-aware method is minimum.

2 For strings longer than 2 ∗ (τ + 1), our selection method isthe only way to select minimum number of substrings.

Dong Deng Parallel PassJoin

Page 16: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Substring SelectionExperimental Results

1e+006

1e+007

1e+008

1e+009

1 2 3 4

# of

sel

ecte

d su

bstr

ings

Threshold τ

LengthShift

PositonMulti-Match

(a) Author Name(Avg Len = 15)

1e+006

1e+007

1e+008

1e+009

1e+010

4 5 6 7 8

# of

sel

ecte

d su

bstr

ings

Threshold τ

LengthShift

PositonMulti-Match

(b) Query Log(Avg Len = 45)

1e+007

1e+008

1e+009

1e+010

1e+011

5 6 7 8 9 10

# of

sel

ecte

d su

bstr

ings

Threshold τ

LengthShift

PositonMulti-Match

(c) Author+Title(Avg Len = 105)

Figure: Numbers of selected substrings

Dong Deng Parallel PassJoin

Page 17: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Substring SelectionExperimental Results

0.1

1

10

100

1 2 3 4

Sel

ectio

n T

ime

(s)

Threshold τ

LengthShift

PositonMulti-Match

(a) Author Name(Avg Len = 15)

1

10

100

1000

4 5 6 7 8

Sel

ectio

n T

ime

(s)

Threshold τ

LengthShift

PositonMulti-Match

(b) Query Log(Avg Len = 45)

1

10

100

1000

10000

5 6 7 8 9 10

Sel

ectio

n T

ime

(s)

Threshold τ

LengthShift

PositonMulti-Match

(c) Author+Title(Avg Len = 105)

Figure: Elapsed time for generating substrings

Dong Deng Parallel PassJoin

Page 18: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

VerificationLength-aware Verification

Inspired by the position-aware substring selection.Save at least half computation than traditional dynamicmethod.Save even more using improved early termination.

Dong Deng Parallel PassJoin

Page 19: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

VerificationLength-aware Verification

Inspired by the position-aware substring selection.Save at least half computation than traditional dynamicmethod.Save even more using improved early termination.

Dong Deng Parallel PassJoin

Page 20: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

VerificationLength-aware Verification

Inspired by the position-aware substring selection.Save at least half computation than traditional dynamicmethod.Save even more using improved early termination.

Dong Deng Parallel PassJoin

Page 21: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

VerificationExtension-based Verification

Inspired by the multi-match-aware substring selection.Using tighter thresholds to verify the candidate pairs.Verify if ED(rr , sr ) ≤ τ + 1− i and ED(rl , sl) ≤ i − 1.

Dong Deng Parallel PassJoin

Page 22: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

VerificationExtension-based Verification

Inspired by the multi-match-aware substring selection.Using tighter thresholds to verify the candidate pairs.Verify if ED(rr , sr ) ≤ τ + 1− i and ED(rl , sl) ≤ i − 1.

Dong Deng Parallel PassJoin

Page 23: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

VerificationExtension-based Verification

Inspired by the multi-match-aware substring selection.Using tighter thresholds to verify the candidate pairs.Verify if ED(rr , sr ) ≤ τ + 1− i and ED(rl , sl) ≤ i − 1.

Dong Deng Parallel PassJoin

Page 24: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

VerificationExperimental Results

1

10

100

1000

10000

100000

1 2 3 4

Ela

psed

Tim

e (s

)

Threshold τ

2τ+1τ+1

ExtensionSharePrefix

(a) Author Name(Avg Len 15)

10

100

1000

10000

4 5 6 7 8

Ela

psed

Tim

e (s

)

Threshold τ

2τ+1τ+1

ExtensionSharePrefix

(b) Query Log(Avg Len 45)

10

100

1000

10000

5 6 7 8 9 10

Ela

psed

Tim

e (s

)

Threshold τ

2τ+1τ+1

ExtensionSharePrefix

(c) Author+Title(Avg Len 105)

Figure: Elapsed time for verification

Dong Deng Parallel PassJoin

Page 25: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Additional FiltersEffective Indexing Strategy

Partition longer strings into segments.Select substrings from shorter strings.Longer segments decrease the possibility of matching.Thus decrease the number of candidates.

Dong Deng Parallel PassJoin

Page 26: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Additional FiltersEffective Indexing Strategy

Partition longer strings into segments.Select substrings from shorter strings.Longer segments decrease the possibility of matching.Thus decrease the number of candidates.

Dong Deng Parallel PassJoin

Page 27: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Additional FiltersEffective Indexing Strategy

Partition longer strings into segments.Select substrings from shorter strings.Longer segments decrease the possibility of matching.Thus decrease the number of candidates.

Dong Deng Parallel PassJoin

Page 28: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Additional FiltersEffective Indexing Strategy

Partition longer strings into segments.Select substrings from shorter strings.Longer segments decrease the possibility of matching.Thus decrease the number of candidates.

Dong Deng Parallel PassJoin

Page 29: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Additional FiltersContent Filter

ObservationLet Hr denote the character frequency vector of r .r =“abyyyy”, s =“axxyyyxy”.Hr = {{a,1}, {b,1}, {y ,4}}, Hs = {{a,1}, {x ,3}, {y ,4}}Let H4 = |Hr −Hs|.H4 = |Hr −Hs| = ||1|+ | − 3|| = 4.A deletion or insertion changes H4 by 1 at most.An substitution changes H4 by 2 at most.

Dong Deng Parallel PassJoin

Page 30: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Additional FiltersContent Filter

ObservationLet Hr denote the character frequency vector of r .r =“abyyyy”, s =“axxyyyxy”.Hr = {{a,1}, {b,1}, {y ,4}}, Hs = {{a,1}, {x ,3}, {y ,4}}Let H4 = |Hr −Hs|.H4 = |Hr −Hs| = ||1|+ | − 3|| = 4.A deletion or insertion changes H4 by 1 at most.An substitution changes H4 by 2 at most.

Dong Deng Parallel PassJoin

Page 31: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Additional FiltersContent Filter

ObservationLet Hr denote the character frequency vector of r .r =“abyyyy”, s =“axxyyyxy”.Hr = {{a,1}, {b,1}, {y ,4}}, Hs = {{a,1}, {x ,3}, {y ,4}}Let H4 = |Hr −Hs|.H4 = |Hr −Hs| = ||1|+ | − 3|| = 4.A deletion or insertion changes H4 by 1 at most.An substitution changes H4 by 2 at most.

Dong Deng Parallel PassJoin

Page 32: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Additional FiltersContent Filter

ObservationLet Hr denote the character frequency vector of r .r =“abyyyy”, s =“axxyyyxy”.Hr = {{a,1}, {b,1}, {y ,4}}, Hs = {{a,1}, {x ,3}, {y ,4}}Let H4 = |Hr −Hs|.H4 = |Hr −Hs| = ||1|+ | − 3|| = 4.A deletion or insertion changes H4 by 1 at most.An substitution changes H4 by 2 at most.

Dong Deng Parallel PassJoin

Page 33: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Additional FiltersContent Filter

ObservationAt most τ edit operations, H4 ≤ 2τ .At most τ −

∣∣|r | − |s|∣∣ substitutions, H4 ≤ 2τ −∣∣|r | − |s|∣∣.

Group symbols to improve the content-filter running time.Integrate the content filter with the extension-basedverification.

Dong Deng Parallel PassJoin

Page 34: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Additional FiltersContent Filter

ObservationAt most τ edit operations, H4 ≤ 2τ .At most τ −

∣∣|r | − |s|∣∣ substitutions, H4 ≤ 2τ −∣∣|r | − |s|∣∣.

Group symbols to improve the content-filter running time.Integrate the content filter with the extension-basedverification.

Dong Deng Parallel PassJoin

Page 35: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Additional FiltersContent Filter

ObservationAt most τ edit operations, H4 ≤ 2τ .At most τ −

∣∣|r | − |s|∣∣ substitutions, H4 ≤ 2τ −∣∣|r | − |s|∣∣.

Group symbols to improve the content-filter running time.Integrate the content filter with the extension-basedverification.

Dong Deng Parallel PassJoin

Page 36: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

Additional FiltersContent Filter

ObservationAt most τ edit operations, H4 ≤ 2τ .At most τ −

∣∣|r | − |s|∣∣ substitutions, H4 ≤ 2τ −∣∣|r | − |s|∣∣.

Group symbols to improve the content-filter running time.Integrate the content filter with the extension-basedverification.

Dong Deng Parallel PassJoin

Page 37: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Pass Join AlgorithmAdditional FiltersParallel

1 Parallel Sorting. Group strings by lengths using existingparallel algorithm.

2 Parallel Building Indexes. Parallel building indexes for eachgroup.

3 Parallel Joins. Parallel perform similarity joins on eachgroups.

Dong Deng Parallel PassJoin

Page 38: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability

Experiment Setup

Table: Datasets

Datasets cardinality average len max len min lenGeoNames 400,000 11.106 1 60

GeoNames Query 100,000 10.7 2 43Reads 750,000 101.388 86 106

Reads Query 100,000 101.2 88 116

Dong Deng Parallel PassJoin

Page 39: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability

Experiment Setup

0

10000

20000

30000

40000

50000

0 10 20 30 40 50 60

Num

bers

of s

trin

gs

String Lengths

(a) GeoNames

0

100000

200000

300000

400000

85 90 95 100 105

Num

bers

of s

trin

gs

String Lengths

(b) Reads

Figure: Length Distribution.

Dong Deng Parallel PassJoin

Page 40: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability

Evaluating Pruning Techniques

0

10

20

30

40

50

1 2 3 4

Ela

psed

Tim

e (s

)

Edit Distance Threshold

BasicContentLonger

ParaJoin

(a) GeoNames

0

200

400

600

800

4 8 12 16

Ela

psed

Tim

e (s

)

Edit Distance Threshold

BasicContentLonger

ParaJoin

(b) Reads

Figure: Evaluating pruning techniques for similarity joins(8 threads).

Dong Deng Parallel PassJoin

Page 41: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability

Evaluating Pruning Techniques

0

10

20

30

40

1 2 3 4

Ela

psed

Tim

e (s

)

Edit Distance Threshold

BasicSearchParaSearch

(a) GeoNames

0

50

100

150

200

4 8 12 16

Ela

psed

Tim

e (s

)

Edit Distance Threshold

BasicSearchParaSearch

(b) Reads

Figure: Evaluating pruning techniques for similarity search(8 threads).

Dong Deng Parallel PassJoin

Page 42: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability

Evaluating Parallelism

0

20

40

60

80

100

2 4 6 8

Ela

psed

Tim

e (s

)

Number of Threads

tau=4tau=3tau=2tau=1

(a) GeoNames

0

150

300

450

600

750

2 4 6 8

Ela

psed

Tim

e (s

)

Number of Threads

tau=16tau=12tau=8tau=4

(b) Reads

Figure: Evaluating running time of similarity join by varying number ofthreads.

Dong Deng Parallel PassJoin

Page 43: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability

Evaluating Speedup

2

4

6

8

2 4 6 8

Spe

edup

Number of Threads

tau=4tau=3tau=2tau=1Ideal

(a) GeoNames

2

4

6

8

2 4 6 8

Spe

edup

Number of Threads

tau=16tau=12

tau=8tau=4Ideal

(b) Reads

Figure: Evaluating speedup of similarity join.

Dong Deng Parallel PassJoin

Page 44: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability

Evaluating Parallelism

0

30

60

90

120

150

2 4 6 8

Ela

psed

Tim

e (s

)

Number of Threads

tau=4tau=3tau=2tau=1

(a) GeoNames

0

50

100

150

200

2 4 6 8

Ela

psed

Tim

e (s

)

Number of Threads

tau=16tau=12tau=8tau=4

(b) Reads

Figure: Evaluating running time of similarity search by varyingnumber of threads.

Dong Deng Parallel PassJoin

Page 45: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability

Evaluating Speedup

2

4

6

8

2 4 6 8

Spe

edup

Number of Threads

tau=4tau=3tau=2tau=1Ideal

(a) GeoNames

2

4

6

8

2 4 6 8

Spe

edup

Number of Threads

tau=16tau=12

tau=8tau=4Ideal

(b) Reads

Figure: Evaluating speedup of similarity search.

Dong Deng Parallel PassJoin

Page 46: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability

Evaluating Scalability

0

40

80

120

0.25 0.5 0.75 1

Ela

psed

Tim

e (s

)

Number of Strings(*1,000,000)

tau=4tau=3tau=2tau=1

(a) GeoNames

0

80

160

240

320

0.25 0.5 0.75 1

Ela

psed

Tim

e (s

)

Number of Strings(*1,000,000)

tau=16tau=12

tau=8tau=4

(b) Reads

Figure: Evaluating the scalability of the similarity join algorithm(8threads).

Dong Deng Parallel PassJoin

Page 47: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

MotivationOur Approach

Experiment

Evaluating Pruning TechniquesEvaluating ParallelismEvaluating Scalability

Evaluating Scalability

0

30

60

90

0.25 0.5 0.75 1

Ela

psed

Tim

e (s

)

Number of Strings(*1,000,000)

tau=4tau=3tau=2tau=1

(a) GeoNames

0

10

20

30

40

0.25 0.5 0.75 1

Ela

psed

Tim

e (s

)

Number of Strings(*1,000,000)

tau=16tau=12

tau=8tau=4

(b) Reads

Figure: Evaluating the scalability of the similarity search algorithm(8threads).

Dong Deng Parallel PassJoin

Page 48: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

Appendix Our Team

About our team I

We are from Tsinghua University, Beijing, China.

Yu Jiang, Jiannan Wang, Guoliang Li, Jianhua Feng andDong Deng.

Dong Deng Parallel PassJoin

Page 49: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

Appendix Our Team

About our team II

Dong Deng Parallel PassJoin

Page 50: Efficient Parallel Partition based Algorithms for ...€¦ · Additional Filters Parallel Basic Idea Lemma Given a string r with ˝+1 segments and a string s, if s is similar to r

Appendix Our Team

Thank YouQ & A

http://dbgroup.cs.tsinghua.edu.cn/dd

Pass-Join: A Partition based Method for Similarity Joins.Guoliang Li, Dong Deng, Jiannan Wang, Jianhua Feng. VLDB2012.

Dong Deng Parallel PassJoin