Download - Efficient Exact Set-Similarity Joins
![Page 1: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/1.jpg)
Efficient Exact Set-Similarity Efficient Exact Set-Similarity JoinsJoins
Arvind ArasuArvind ArasuVenkatesh GantiVenkatesh GantiRaghav KaushikRaghav Kaushik
DMX Group, Microsoft ResearchDMX Group, Microsoft Research
![Page 2: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/2.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 22
Data CleaningData Cleaning
NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO
1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799
GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT
LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607
CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCONLLINCONL ILIL 9279992799
INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL
SANTA ANASANTA ANA CACA 9279992799
![Page 3: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/3.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 33
Data CleaningData Cleaning
NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO
1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799
GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT
LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607
CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCONLLINCONL ILIL 9279992799
INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL
SANTA ANASANTA ANA CACA 9279992799
![Page 4: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/4.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 44
Data CleaningData Cleaning
NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO
1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799
GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT
LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607
CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCOLINCONLNL ILIL 9279992799
INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL
SANTA ANASANTA ANA CACA 9279992799
![Page 5: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/5.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 55
Data CleaningData Cleaning
NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO
1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799
GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT
LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607
CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCOLINCONLNL ILIL 9279992799
INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL
SANTA ANASANTA ANA CACA 9279992799
![Page 6: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/6.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 66
Data CleaningData Cleaning
NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO
1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799
GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT 0690106901
LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER NYNY 1460714607
CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCOLINCOLNLN ILIL 9279992799
![Page 7: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/7.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 77
String Similarity JoinString Similarity Join
CITYCITY
ALABASTERALABASTER
ALBERTVILLEALBERTVILLE
……
……
……LINCOLNLINCOLN
……
……YUCAIPAYUCAIPA
Reference Table
…… …… CityCity …… ………… …… …… …… ……
…… …… LINCOLINCONLNL …… ……
…… …… …… …… ……
…… …… …… …… ……
![Page 8: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/8.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 88
NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO
1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799
GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT
LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607
CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCONLLINCONL ILIL 9279992799
INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL
SANTA ANASANTA ANA CACA 9279992799
String Similarity (Self) JoinString Similarity (Self) Join
![Page 9: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/9.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 99
Strings Strings Sets [CGK ’06] Sets [CGK ’06]
microsoft mcrosoft
{mc, cr, ro, os, so, of, ft}{mi, ic, cr, ro, os, so, of, ft}
(edit distance (edit distance ≤ 1) ----> (≤ 1) ----> (ΔΔ ≤ 4) ≤ 4)
2-grams2-grams
![Page 10: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/10.jpg)
mcrosoft…
…
……
…
…
…
microsoft…
…
……
…
…
… SR
String Sim Join edit distance edit distance ≤ 1≤ 1
Strings Sets
![Page 11: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/11.jpg)
mcrosoft…
…
……
…
…
…
microsoft…
…
……
…
…
…
Set Sim Join ΔΔ ≤ 4≤ 4
R S
TokenizeTokenize
Post-Process
Strings Sets
![Page 12: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/12.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 1212
String String Set: Advantages Set: Advantages
Generalizes to many string similarity Generalizes to many string similarity funcsfuncs Powerful primitivePowerful primitive
Sets Sets ≈ Relations≈ Relations Leverage relational data processingLeverage relational data processing
[CGK ‘06][CGK ‘06]
![Page 13: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/13.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 1313
ContributionsContributions
New algorithms for set-similarity New algorithms for set-similarity joinsjoins Exact answersExact answers Performance guaranteesPerformance guarantees Outperform previous exact algorithmsOutperform previous exact algorithms
Orders of magnitudeOrders of magnitude
Exact answers are important for operatorsExact answers are important for operators
![Page 14: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/14.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 1414
OutlineOutline
IntroductionIntroduction AlgorithmsAlgorithms ExperimentsExperiments ConclusionConclusion
![Page 15: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/15.jpg)
{ mi, ic, cr, ro, os, so, of, ft }
{ lo, og, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft }
{ lg, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ … }
SR
![Page 16: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/16.jpg)
{ mi, ic, cr, ro, os, so, of, ft }
{ lo, og, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft }
{ lg, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ … }
SR
Intersection size Intersection size ≥ 5 ≥ 5
![Page 17: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/17.jpg)
{ mi, ic, cr, ro, os, so, of, ft }
{ lo, og, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft }
{ lg, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ … }
SR
Intersection size Intersection size ≥ 5 ≥ 5
![Page 18: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/18.jpg)
{ mi, ic, cr, ro, os, so, of, ft }
{ lo, og, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft }
{ lg, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ … }
SR
Intersection size Intersection size ≥ 5 ≥ 5
![Page 19: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/19.jpg)
{ mi, ic, cr, ro, os, so, of, ft }
{ lo, og, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ bo, oe, ei, in, ng }
{ mc, cr, ro, os, so, of, ft }
{ lg, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ … }
SR
{ mc, cr, ro, os, so, of, ft }
{ mi, ic, cr, ro, os, so, of, ft }
Intersection size Intersection size ≥ 5 ≥ 5
![Page 20: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/20.jpg)
{ mi, ic, cr, ro, os, so, of, ft }
{ lo, og, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ bo, oe, ei, in, ng }
{ mc, cr, ro, os, so, of, ft }
{ lg, gi, is, so, of, ft }
{ … }
{ … }
{ … }
{ … }
{ … }
SR
{ mc, cr, ro, os, so, of, ft }
{ mi, ic, cr, ro, os, so, of, ft }
Intersection size Intersection size ≥ 5 ≥ 5
{ lg, gi, is, so, of, ft }
{ lo, og, gi, is, so, of, ft }
![Page 21: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/21.jpg)
{ … }
{ … }
{ … }
{ … }
{ bo, oe, ei, in, ng }
{ … }
{ … }
{ … }
{ … }
{ … }
SR
{ mc, cr, ro, os, so, of, ft }
{ mi, ic, cr, ro, os, so, of, ft }
Sim Sim ( ( rrii , s , sjj ) ) ≥ ≥ θθ
{ lg, gi, is, so, of, ft }
{ lo, og, gi, is, so, of, ft }
ss22
ss33
ssmm
ss11
rr22
rr33
rrnn
rr11
![Page 22: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/22.jpg)
{ … }
{ … }
{ … }
{ … }
{ bo, oe, ei, in, ng }
{ … }
{ … }
{ … }
{ … }
{ … }
SR
{ mc, cr, ro, os, so, of, ft }
{ mi, ic, cr, ro, os, so, of, ft }
Sim Sim ( ( rrii , s , sjj ) ) ≥ ≥ θθ
{ lg, gi, is, so, of, ft }
{ lo, og, gi, is, so, of, ft }
ss22
ss33
ssmm
ss11
rr22
rr33
rrnn
rr11
Larg
e
![Page 23: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/23.jpg)
Input:Input: R: R: rr11 , , rr22 , … , , … , rrnn (n sets) (n sets) S: S: ss1 1 , , ss2 2 , … , , … , ssmm (m sets) (m sets)
Output: All pairs (Output: All pairs (rrii , s , sj j ) such that:) such that: ||rrii ΔΔ s sjj | | ≤ ≤ kk
Set-Similarity Join: Symmetric Set-Similarity Join: Symmetric DifferenceDifference
≤ kRunning example: Running example: k k = 4= 4
![Page 24: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/24.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2424
Alternate Set Alternate Set RepresentationRepresentation
s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }
![Page 25: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/25.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2525
Alternate Set Alternate Set RepresentationRepresentation
s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }
1 25 50
![Page 26: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/26.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2626
Alternate Set Alternate Set RepresentationRepresentation
s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }
1 25 50
![Page 27: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/27.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2727
Alternate Set Alternate Set RepresentationRepresentation
s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }
1 25 50
![Page 28: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/28.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2828
Alternate Set Alternate Set RepresentationRepresentation
s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }
1 25 50
![Page 29: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/29.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2929
EnumerationEnumeration
s
r
|r Δ s | ≤ 4
![Page 30: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/30.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3030
EnumerationEnumeration
s
r
|r Δ s | ≤ 4
![Page 31: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/31.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3131
EnumerationEnumeration
s
r
|r Δ s | ≤ 4
ErrorsErrors
![Page 32: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/32.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3232
EnumerationEnumeration
2 3 4 51
s
r
|r Δ s | ≤ 4
![Page 33: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/33.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3333
Enumeration: Signature Enumeration: Signature GenerationGeneration
s
, , ,,{ }
Sig (s )
![Page 34: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/34.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3434
Enumeration: Signature Enumeration: Signature GenerationGeneration
s
, , ,,{ }
Sig (s )
{ 0x4f72ba91, 0x29c8af10, 0x594b2c17, 0xa3b0e20f, 0xdd21f32a}
Hash32()
![Page 35: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/35.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3535
Property of SignaturesProperty of Signatures
||r r ΔΔ ss | | ≤ 4≤ 4 Sig (Sig (rr ) Sig ( ) Sig (s s ) ) ≠ ≠ ΦΦ
UU
2 3 4 51
s
r
![Page 36: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/36.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3636
Enumeration: AlgorithmEnumeration: Algorithm
Generate signatures for each Generate signatures for each rrii , , ssjj
Enumerate (Enumerate (rrii , s , sjj ) s.t ) s.t Sig ( Sig (rrii ) Sig () Sig (ssjj ) ) ≠ ≠ ΦΦ
Output those satisfying |Output those satisfying |rrii ΔΔ ssjj | ≤ 4| ≤ 4
U
![Page 37: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/37.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3737
EnumerationEnumeration
s1
s5
s2
s3
s4
Sig (s2)
Sig (s5)
Sig (s3)
Sig (s4)UU
r1
r5
r2
r3
r4
Sig (s1)
Sig (r2)
Sig (r5)
Sig (r3)
Sig (r4)
Sig (r1)
Sig (Sig (rr22)) Sig (Sig (ss11)) ≠≠ ΦΦ
![Page 38: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/38.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3838
EnumerationEnumeration
s1
s5
s2
s3
s4
Sig (s2)
Sig (s5)
Sig (s3)
Sig (s4)UU
r1
r5
r2
r3
r4
Sig (s1)
Sig (r2)
Sig (r5)
Sig (r3)
Sig (r4)
Sig (r1)
Sig (Sig (rr22)) Sig (Sig (ss11)) ≠≠ ΦΦ
![Page 39: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/39.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3939
EnumerationEnumeration
s1
s5
s2
s3
s4
Sig (s2)
Sig (s5)
Sig (s3)
Sig (s4)UU
r1
r5
r2
r3
r4
Sig (s1)
Sig (r2)
Sig (r5)
Sig (r3)
Sig (r4)
Sig (r1)
Sig (Sig (rr22)) Sig (Sig (ss11)) ≠≠ ΦΦ
OutputOutput False positive candidate pairsFalse positive candidate pairs
![Page 40: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/40.jpg)
S (Id, Elem)
R.Sig = S.Sig
δ R.Id, S.Id
R (Id, Elem)
Post-Process each R.Id, S.Id
Gen SignaturesGen Signatures
S’ (Id, Sig)R’ (Id, Sig)
![Page 41: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/41.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4141
No False Positive Candidate No False Positive Candidate PairPair
2 3 4 51
s
r
|r Δ s | = 5
![Page 42: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/42.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4242
False Positive Candidate False Positive Candidate PairPair
s2
s1
2 3 4 51
|r Δ s | = 5
![Page 43: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/43.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4343
Enumeration: PerformanceEnumeration: Performance
0
0.25
0.5
0.75
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Symmetric Difference
Pro
bab
ility
of
Co
mm
on
Sig
nat
ure
k = 4
![Page 44: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/44.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4444
Enumeration: PerformanceEnumeration: Performance
0
0.25
0.5
0.75
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Symmetric Difference
Pro
bab
ility
of
Co
mm
on
Sig
nat
ure
Ideal PerformanceIdeal Performance
k = 4
![Page 45: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/45.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4545
EnumerationEnumeration
|r Δ s | ≤ 4
s
r
![Page 46: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/46.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4646
EnumerationEnumeration
2 3 4 61 5
s
r
|r Δ s | ≤ 4
![Page 47: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/47.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4747
Enumeration: Signature Enumeration: Signature GenerationGeneration
s1
2 3 4 61 5
![Page 48: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/48.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4848
Enumeration: Signature Enumeration: Signature GenerationGeneration
s1
2 3 4 61 5
![Page 49: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/49.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4949
Enumeration: Signature Enumeration: Signature GenerationGeneration
s1
2 3 4 61 5
![Page 50: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/50.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5050
Enumeration: Signature Enumeration: Signature GenerationGeneration
s1
2 3 4 61 5
![Page 51: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/51.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5151
Enumeration: Signature Enumeration: Signature GenerationGeneration
s1
2 3 4 61 5
( )( )6622
= 15= 15
![Page 52: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/52.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5252
AlgorithmAlgorithm
Generate signatures for each Generate signatures for each rrii , , ssjj
Enumerate (Enumerate (rrii , s , sjj ) s.t ) s.t Sig ( Sig (rrii ) Sig () Sig (ssjj ) ) ≠ ≠ ΦΦ
Output those satisfying |Output those satisfying |rrii ΔΔ ssjj | ≤ 4| ≤ 4
U
Only the signature function changesOnly the signature function changes
![Page 53: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/53.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5353
Enumeration: PerformanceEnumeration: Performance
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Symmetric Difference
Pro
b. o
f Com
mon
Sig
natu
re
n1 = 5 n1 = 6
k = 4
![Page 54: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/54.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5454
False Positive Candidate False Positive Candidate PairPair
2 3 4 61 5
s
r
|r Δ s | = 5
![Page 55: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/55.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5555
Enumeration: PerformanceEnumeration: Performance
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Symmetric Difference
Prob
. of C
omm
on S
igna
ture
n1 = 5 n1 = 6 n1 = 7 n1 = 20
k = 4
![Page 56: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/56.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5656
Enumeration: PerformanceEnumeration: Performance
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Symmetric Difference
Prob
. of C
omm
on S
igna
ture
n1 = 5 n1 = 6 n1 = 7 n1 = 20
55
15153535
48454845
k = 4
![Page 57: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/57.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5757
PartEnum: Divide and PartEnum: Divide and ConquerConquer
s1
21
k = 4
k2 = 1k1 = 2
Generate signatures using EnumerationGenerate signatures using Enumeration
![Page 58: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/58.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5858
PartEnum: Asymptotic PartEnum: Asymptotic PerformancePerformance
Theorem: There is an instance of Theorem: There is an instance of PartEnum such that: PartEnum such that: If If ||r r ΔΔ s s || > 7.5 > 7.5 kk, , then then r r and and s s do not do not
share a signature with probability 1 – share a signature with probability 1 – o(1)o(1)
The number of signatures per set: The number of signatures per set: O (O (kk22 ) )
![Page 59: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/59.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5959
PartEnum: SummaryPartEnum: Summary
Set-Similarity Joins with predicate Set-Similarity Joins with predicate ||rr ΔΔ ss | ≤ | ≤ kk
Theoretical guaranteesTheoretical guarantees First exact algorithmFirst exact algorithm
![Page 60: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/60.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6060
Other resultsOther results
PartEnum extensions:PartEnum extensions: Larger class of set-similarity join predicatesLarger class of set-similarity join predicates
JaccardJaccard Basic idea: reduce to symmetric set Basic idea: reduce to symmetric set
differencedifference WtEnumWtEnum class of signature functions: class of signature functions:
Use frequency of elementsUse frequency of elements Weighted set-similarity joinsWeighted set-similarity joins
![Page 61: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/61.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6161
OutlineOutline
IntroductionIntroduction AlgorithmsAlgorithms ExperimentsExperiments ConclusionConclusion
![Page 62: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/62.jpg)
S (Id, Elem)
R.Sig = S.Sig
δ R.Id, S.Id
R (Id, Elem)
Post-Process each R.Id, S.Id
Gen SignaturesGen Signatures
Implementation
DBMSDBMS
Client + DBMSClient + DBMS
DBMSDBMS
ClientClient
![Page 63: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/63.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6363
Previous WorkPrevious Work
Prefix Filtering [CGK ’06]Prefix Filtering [CGK ’06] ExactExact
Locality Sensitive Hashing [IM ’98]Locality Sensitive Hashing [IM ’98] ApproximateApproximate False negative rate: 5%False negative rate: 5%
![Page 64: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/64.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6464
Data SetsData Sets
Organization addresses [MS Sales]Organization addresses [MS Sales] Concatenation: Org name, street, city, Concatenation: Org name, street, city,
zipzip Input size: 1 millionInput size: 1 million Avg. length: 11 words, 58 charsAvg. length: 11 words, 58 chars Tokenization: Words, n-gramsTokenization: Words, n-grams
![Page 65: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/65.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6565
Jaccard, 1M, MS SalesJaccard, 1M, MS Sales
0
1000
2000
3000
4000
PEN LSH PF PEN LSH PF PEN LSH PF
Sec
on
ds
SigGen CandPair PostFilter
0.80.9 0.85
![Page 66: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/66.jpg)
S (Id, Elem)
R.Sig = S.Sig
δ R.Id, S.Id
R (Id, Elem)
Post-Process each R.Id, S.Id
Gen SignaturesGen Signatures
Evaluation
DBMSDBMS
DBMSDBMS
IntermediateIntermediateResult sizeResult size
Client + DBMSClient + DBMS
ClientClient
![Page 67: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/67.jpg)
Jaccard, 1M, MS SalesJaccard, 1M, MS Sales
0.00E+00
5.00E+07
1.00E+08
1.50E+08
2.00E+08
2.50E+08
PEN LSH PF PEN LSH PF PEN LSH PF
Inte
rmed
iate
Res
ult
Siz
e
0.80.9 0.85
![Page 68: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/68.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6868
Jaccard, SyntheticJaccard, Synthetic
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
1.0E+08
1.0E+09
1.0E+10
1.0E+11
1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07 1.0E+08 1.0E+09
Input Size
Inte
rmed
iate
Res
ult S
ize
LSH(0.95) PEN PF
![Page 69: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/69.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6969
Similar Results for …Similar Results for …
Other data setsOther data sets DBLP, Synthetic data setsDBLP, Synthetic data sets
Other similarity functionsOther similarity functions Weighted jaccardWeighted jaccard Edit distanceEdit distance
![Page 70: Efficient Exact Set-Similarity Joins](https://reader036.vdocuments.site/reader036/viewer/2022062410/568159a6550346895dc70a74/html5/thumbnails/70.jpg)
Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 7070
ConclusionConclusion
New algorithms for set-similarity New algorithms for set-similarity joinsjoins ExactExact Performance guaranteesPerformance guarantees Outperform previous exact algorithmsOutperform previous exact algorithms
Search: “data cleaning project”