weighted exact set similarity join

37
Weighted Exact Set Similarity Join The Pennsylvania State University Dongwon Lee [email protected]

Upload: gates

Post on 14-Jan-2016

60 views

Category:

Documents


1 download

DESCRIPTION

Weighted Exact Set Similarity Join. The Pennsylvania State University Dongwon Lee [email protected]. Set Similarity Join. Def. Set Similarity Join ( SSJoin ): Between collections A and B, find X pairs of objects whose similarity > t: If X = “MOST”  Approximate SSJoin - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Weighted Exact Set Similarity Join

Weighted Exact Set Similarity Join

The Pennsylvania State University

Dongwon Lee

[email protected]

Page 2: Weighted Exact Set Similarity Join

Set Similarity Join

Def. Set Similarity Join (SSJoin): Between collections A and B, find X pairs of objects whose similarity > t: If X = “MOST” Approximate SSJoin If X = “ALL” Exact SSJoin

Wisconsin DB Seminar, 2009 2

A B

0.7

0.5

0.40.9

0.1

0.2

: {Lake, Monona, Wisc, Dane, County}

: {University, Mendota, Wisc, Dane,}

Page 3: Weighted Exact Set Similarity Join

Set Similarity Join

Weighted vs. Unweighted Weighting quantifies relative importance of

token Eg, “Microsoft” is more important than “Copr.”

How to assign meaningful weights to tokens is an important problem itself

Not further discussed here

Wisconsin DB Seminar, 2009 3

Page 4: Weighted Exact Set Similarity Join

Set Similarity Join

Approximate SSJoin Allows some false positives/negatives Eg, LSH as solution

Exact SSJoin Does not allow any false positives/negatives Needs to be scalable

Weighted + Exact SSJoin Will simply call “WESSJoin”

Wisconsin DB Seminar, 2009 4

unweighted weighted

exact

approx.

WESSJoin

WASSJoinUASSJoin

UESSJoin

Page 5: Weighted Exact Set Similarity Join

Applications of WESSJoin

Entity resolution Web document genre classification

Find all pairs of documents w. similar contents Query refinement for web search

For a query, find another w. similar search result Movie recommendation

Identify users who have similar movie tastes w.r.t. the rented movies

Focus on string data represented as SET Eg, document, web page, record

Wisconsin DB Seminar, 2009 5

Page 6: Weighted Exact Set Similarity Join

Research Issues

Why not express WESSJoin in SQL? Join predicate as UDF Cartesian product followed by UDF processing

Inefficient evaluation Special handling for WESSJoin needed

Scalability Support diverse similarity (or distance) functions

o Eg, Overlap, Jaccard, Cosine vs. Edit, … Support diverse computation models

o Eg, Threshold vs. Top-k

Wisconsin DB Seminar, 2009 6

Page 7: Weighted Exact Set Similarity Join

Similarity/Distance Functions

Jaccard Coefficient: J(x,y) =

Overlap similarity: O(x,y) =

Cosine similarity: C(x,y) =

Hamming distance H(x,y) =

Levenshtein distance L(x,y): min # of edit operations to transform x to y

Wisconsin DB Seminar, 2009 7

| x y || x y |

x y

ix iy

ix y

(x y) (y x)

Page 8: Weighted Exact Set Similarity Join

Properties of sim()

Similarity functions can be re-written to each other equivalently J(x,y) > t O(x,y) > t/(1+t) (|x|+|y|) O(x,y) > t H(x,y) < |x|+|y|-2t C(x,y) > t O(x,y) >

Eg, x: {Lake, Mendota, Monona} y: {Wisc, Dane, Mendota, Lake} J(x,y) > 0.5 ? O(x,y) > 2.3 ?

Set representation: k-gram, word, phrase, …

Wisconsin DB Seminar, 2009 8

t x y

Page 9: Weighted Exact Set Similarity Join

Naïve Solution

All pair-wise comparison between A and B

Nested-loop: |A||B| comparisons The sim() evaluation may be costly

Eg, Generalized Jaccard Similarity function with O(|x|3)

Wisconsin DB Seminar, 2009 9

For x in A: For y in B:

If sim(x,y) > t, return (x,y);

A, B: tablex, y: record as set

Page 10: Weighted Exact Set Similarity Join

Naïve Solution Example

Wisconsin DB Seminar, 2009 10

ID Content

1 {Lake, Mendota}

2 {Lake, Monona, Area}

3 {Lake, Mendota, Monona, Dane}

ID Content

4 {Lake, Monona, University}

5 {Monona, Research, Area}

6 {Lake, Mendota, Monona, Area}

A B

O(x,y) ID=4 ID=5 ID=6

ID=1 1 0 2

ID=2 2 2 3

ID=3 2 1 3

O(x,y) > 2 ?

Page 11: Weighted Exact Set Similarity Join

Naïve Solution Example

Wisconsin DB Seminar, 2009 11

ID Content

1 {Lake, Mendota}

2 {Lake, Monona, Area}

3 {Lake, Mendota, Monona, Dane}

ID Content

4 {Lake, Monona, University}

5 {Monona, Research, Area}

6 {Lake, Mendota, Monona, Area}

A B

J(x,y)) ID=4 ID=5 ID=6

ID=1 0.25 0 0.5

ID=2 0.5 0.4 0.75

ID=3 0.2 0.16 0.6

J(x,y) > 0.6 ?

Page 12: Weighted Exact Set Similarity Join

2-Step Framework

Step 1: “Blocking” Using Index/heuristics/filtering/etc, reduce # of

candidates to compare Step 2: sim() only within candidate sets

O(|A||C|) s.t. |C| << |B|

Wisconsin DB Seminar, 2009 12

For x in A: Using Foo, find a candidate set C in B For y in C:

If sim(x,y) > t, return (x,y);

Page 13: Weighted Exact Set Similarity Join

Variants for “Foo”

“Foo”: How to identify candidate set C Fast Accurate: no false positives/negatives

Many Variants for “Foo” Inverted Index [Sarawagi et al, SIGMOD 04] Size filtering [Arasu et al, VLDB 06] Prefix Index [Chaudhuri et al, ICDE 06] Prefix + Inverted Index [Bayardo et al, WWW 07] Bound filtering [On et al, ICDE 07] Position Index [Xiao et al, WWW 08]

Wisconsin DB Seminar, 2009 13

Page 14: Weighted Exact Set Similarity Join

Inverted Index [Sarawagi et al, SIGMOD 04]

Wisconsin DB Seminar, 2009 14

ID Content

1 {Lake, Mendota}

2 {Lake, Monona, Area}

3 {Lake, Mendota, Monona, Dane}

ID Content

4 {Lake, Monona, University}

5 {Monona, Research, Area}

6 {Lake, Mendota, Monona, Area}

A B

Token in A ID List

Area 2

Dane 3

Lake 1, 2, 3

Mendota 1, 3

Monona 2, 3

Inverted Index (IDX) for A

Token in B ID List

Area 5

Lake 4, 6

Mendota 6

Monona 4, 5, 6

Research 5

University 4

Inverted Index (IDX) for B

Page 15: Weighted Exact Set Similarity Join

Inverted Index [Sarawagi et al, SIGMOD 04]

Wisconsin DB Seminar, 2009 15

ID Content

1 {Lake, Mendota}

2 {Lake, Monona, Area}

3 {Lake, Mendota, Monona, Dane}

ID Content

4 {Lake, Monona, University}

5 {Monona, Research, Area}

6 {Lake, Mendota, Monona, Area}

A B

Token in B ID List

Area 5

Lake 4, 6

Mendota 6

Monona 4, 5, 6

Research 5

University 4

Inverted Index (IDX) for BFor x in A: Using IDX, find a candidate set C in B For y in C:

If sim(x,y) > t, return (x,y);

ID=1: {Lake, Mendota}

ID=2: …

ID=3: …

Candidate set C: {4,6} + {6} = {4, 6}

Page 16: Weighted Exact Set Similarity Join

Inverted Index [Sarawagi et al, SIGMOD 04]

Wisconsin DB Seminar, 2009 16

ID Content

1 {Lake, Mendota}

2 {Lake, Monona, Area}

3 {Lake, Mendota, Monona, Dane}

ID Content

4 {Lake, Monona, University}

5 {Monona, Research, Area}

6 {Lake, Mendota, Monona, Area}

A B

Token in B ID List

Area 5

Lake 4, 6

Mendota 6

Monona 4, 5, 6

Research 5

University 4

Inverted Index (IDX) for B

ID=1: {Lake, Mendota}

ID=2: …

ID=3: …

ID Freq.

4 1

6 2

Candidate set C:

O(x,y) > 2

For x in A: Using IDX, find a candidate set C in B For y in C:

If sim(x,y) > t, return (x,y);

Page 17: Weighted Exact Set Similarity Join

Size Filtering [Arasu et al, VLDB 06]

Idea: Build index on the size of inputs Jaccard Coefficient J= Upperbound for Jaccard:

Bounding |y| w.r.t. |x|: Combining two

Wisconsin DB Seminar, 2009 17

| x y || x y |

| x y || x y |

min(| x |,| y |)

max(| x |,| y |)

x xy y

| x y || x y |

| y |

| x |

| x y || x y |

| x |

| y |

J * x y | x |

J

Page 18: Weighted Exact Set Similarity Join

Size Filtering [Arasu et al, VLDB 06]

Intuition: If t and |x| are given, |y| is bounded Eg,

x: {Lake, Mendota} y: {Lake, Mendota, Monona, Area} J(x,y) > 0.8 ?

Then, according to: |x|=2, t=0.8 1.6 <= |y| <= 2.5 However, |y| = 4 y cannot satisfy t=0.8 no need to compute

J(x,y) at all

Wisconsin DB Seminar, 2009 18

J * x y | x |

J

Page 19: Weighted Exact Set Similarity Join

Size Filtering [Arasu et al, VLDB 06]

Algorithm For all input strings, build B-tree w.r.t. their sizes Given a set x, using B-tree index, find a candidate

y in B s.t.

Wisconsin DB Seminar, 2009 19

For x in A: Using IDX, find a candidate set C in B For y in C:

If sim(x,y) > t, return (x,y);

J * x y | x |

J

Page 20: Weighted Exact Set Similarity Join

Prefix Index [Chaudhuri et al, ICDE 06]

Intuition: If two sets are very similar, their prefixes, when ordered, must have some common tokens

Eg. x: {Dane, University, Monona, Mendota} y: {Area, Lake, Mendota, Monona, Wisc} O(x,y) > 3 ?

x’: {Dane, Mendota, Monona, University} y’: {Area, Lake, Mendota, Monona, Wisc}

Wisconsin DB Seminar, 2009 20

Prefixes

Page 21: Weighted Exact Set Similarity Join

Prefix Index [Chaudhuri et al, ICDE 06]

Theorem 1: If there is no overlap btw. Prefix(x) and Prefix(y), then sim(x,y) > t, where: If sim()=Overlap, Prefix(x)=|x| - (t-1) If sim()=Jaccard, Prefix(x)=|x|-Ceiling(t*|x|)+1

Algorithm using Theorem 1: Given a set x For each token t_x in the prefix of x

o Using an index, locate a candidate y that contains t_x in the prefix of y

o If sim(x,y) > t, return (x,y)

Wisconsin DB Seminar, 2009 21

Page 22: Weighted Exact Set Similarity Join

Wisconsin DB Seminar, 2009 22

ID Content

1 {Lake, Mendota}

2 {Lake, Monona, Area}

3 {Lake, Mendota, Monona, Dane}

ID Content

4 {Lake, Monona, University}

5 {Monona, Research, Area}

6 {Lake, Mendota, Monona, Area}

A B

Token ID List DF Order

Area 2, 5 2 4

Dane 3 1 1

Lake 1, 2, 3, 4, 6 5 6

Mendota 1, 3, 6 3 5

Monona 2, 3, 4, 5, 6 5 7

Research 5 1 2

University 4 1 3

Inverted Index (IDX) for both A and B

Prefix + Inverted Index [Bayardo et al, WWW 07]

Create a universal order:Put rare tokens front

Order: Dane > Research > University > Area > Mendota > Lake > Monona

Page 23: Weighted Exact Set Similarity Join

Wisconsin DB Seminar, 2009 23

ID Content

1 {Mendota, Lake}

2 {Area, Lake, Monona}

3 {Dane, Mendota, Lake, Monona}

ID Content

4 {University, Lake, Monona}

5 {Research, Area, Monona}

6 {Area, Mendota, Lake, Monona}

Ordered A Ordered B

Prefix + Inverted Index [Bayardo et al, WWW 07]

Order: Dane > Research > University > Area > Mendota > Lake > Monona

Page 24: Weighted Exact Set Similarity Join

Wisconsin DB Seminar, 2009 24

ID Content

1 {Mendota, Lake}

2 {Area, Lake, Monona}

3 {Dane, Mendota, Lake, Monona}

ID Content

4 {University, Lake, Monona}

5 {Research, Area, Monona}

6 {Area, Mendota, Lake, Monona}

Ordered A Ordered B

Token in B ID List

Area 5

Lake 4, 6

Mendota 6

Research 5

University 4

Prefix Inverted Index for B

ID=1: {Mendota, Lake}

ID=2: …

ID=3: …

Candidate set C: {6}

O(x,y) > 2Prefix(x)=|x|-(t-1)=|x|-1

Prefix + Inverted Index [Bayardo et al, WWW 07]

Page 25: Weighted Exact Set Similarity Join

Wisconsin DB Seminar, 2009 25

ID Content

1 {Mendota, Lake}

2 {Area, Lake, Monona}

3 {Dane, Mendota, Lake, Monona}

ID Content

4 {University, Lake, Monona}

5 {Research, Area, Monona}

6 {Area, Mendota, Lake, Monona}

Ordered A Ordered B

Token in B ID List

Area 5

Lake 4, 6

Mendota 6

Research 5

University 4

Prefix Inverted Index for B

ID=1: …

ID=2: {Area, Lake, Monona}

ID=3: … Candidate set C: {5} + {4,6} = {4,5,6}

O(x,y) > 2Prefix(x)=|x|-(t-1)=|x|-1

Prefix + Inverted Index [Bayardo et al, WWW 07]

Page 26: Weighted Exact Set Similarity Join

Wisconsin DB Seminar, 2009 26

ID Content

1 {Mendota, Lake}

2 {Area, Lake, Monona}

3 {Dane, Mendota, Lake, Monona}

ID Content

4 {University, Lake, Monona}

5 {Research, Area, Monona}

6 {Area, Mendota, Lake, Monona}

Ordered A Ordered B

Token in B ID List

Area 5

Lake 4, 6

Mendota 6

Research 5

University 4

Prefix Inverted Index for B

ID=1: …

ID=2: …

ID=3: {Dane, Mendota, Lake, Monona}

Candidate set C: {6} + {4,6} = {4,6}

O(x,y) > 2Prefix(x)=|x|-(t-1)=|x|-1

Prefix + Inverted Index [Bayardo et al, WWW 07]

Page 27: Weighted Exact Set Similarity Join

Position Index [Xiao et al, WWW 08]

Eg, x: {Dane, Research, Area, Mendota, Lake} y: {Research, Area, Mendota, Lake, Monona} O(x,y) > 4 ?

Prefix(x) = Prefix(y) = 5 – (4 -1) = 2 x: {Dane, Research, Area, Mendota, Lake} y: {Research, Area, Mendota, Lake, Monona} “Research” is common btw prefixes (x,y) is a

candidate pair need to compute sim(x,y)

Wisconsin DB Seminar, 2009 27

Order: Dane > Research > University > Area > Mendota > Lake > Monona

Page 28: Weighted Exact Set Similarity Join

Position Index [Xiao et al, WWW 08]

Eg, x: {Dane, Research, Area, Mendota, Lake} y: {Research, Area, Mendota, Lake, Monona} O(x,y) > 4 ?

Prefix(x) = Prefix(y) = 5 – (4 -1) = 2 x: {Dane, Research, Area, Mendota, Lake} y: {Research, Area, Mendota, Lake, Monona} Estimation of max overlap = overlap in prefixes +

min # of unseen tokens = 1 + min(3,4) = 4 > t No need to compute sim(x,y) !

Wisconsin DB Seminar, 2009 28

Order: Dane > Research > University > Area > Mendota > Lake > Monona

Page 29: Weighted Exact Set Similarity Join

Wisconsin DB Seminar, 2009 29

Bound Filtering [On et al, ICDE 07]

Generalized Jaccard (GJ) similarity Two sets: x = {a1, …, a|x|}, y = {b1, …, b|y|} Normalized weight of the maximum bipartite

matching M in the bipartite graph (N = x U y, E=x X y)

GJ(x, y)sim(ai,b j )(a i ,b j )M

x y M

Page 30: Weighted Exact Set Similarity Join

Wisconsin DB Seminar, 2009 30

Bound Filtering [On et al, ICDE 07]

x y

0.9 0.73 2 2

1.6

30.53

GJ(x, y)sim(ai,b j )(a i ,b j )M

x y M

0.7

0.5

0.40.9

0.1

0.2

x y

0.7

0.5

0.40.9

0.1

0.2

M: maximum weight bipartite matching

Page 31: Weighted Exact Set Similarity Join

Wisconsin DB Seminar, 2009 31

Bound Filtering [On et al, ICDE 07]

Issues GJ captures more semantics btw. two sets via

the weighted bipartite matching than Jaccard But more costly to compute: maximum weight

bipartite matchingo Bellman-Ford: O(V2E) o Hungarian: O(V3)

For x in A: Using Foo, find a candidate set C in B For y in C:

If GJ(x,y) > t, return (x,y);

Page 32: Weighted Exact Set Similarity Join

Wisconsin DB Seminar, 2009 32

Bound Filtering [On et al, ICDE 07]

Bipartite matching computation is expensive because of the requirement No node in the bipartite graph can have more

than one edge incident on it Relax this constraint:

For each element ai in x, find an element bj in y with the highest element-level similarity S1

For each element bj in y, find an element ai in x with the highest element-level similarity S2

Complexity becomes linear: O(|x|+|y|)

Page 33: Weighted Exact Set Similarity Join

Wisconsin DB Seminar, 2009 33

Bound Filtering [On et al, ICDE 07]

x y

0.7

0.5

0.40.9

0.1

0.2

x y

0.7

0.5

0.40.9

0.1

0.2

x y

0.7

0.5

0.40.9

0.1

0.2

S1

S2

S1

S2

Page 34: Weighted Exact Set Similarity Join

Wisconsin DB Seminar, 2009 34

Bound Filtering [On et al, ICDE 07]

GJ(x, y)sim(ai,b j )(a i ,b j )M

x y M

UB(x, y)sim(ai,b j )(a i ,b j )S1S2

x y S1 S2

LB(x, y)sim(ai,b j )(a i ,b j )S1S2

x y S1 S2

Properties: Numerator of UB is at

least as large as that of GJ

Denominator of UB is no larger than that of GJ

Similar arguments for LB

Theorem 2 LB <= GJ <= UB

Page 35: Weighted Exact Set Similarity Join

Wisconsin DB Seminar, 2009 35

Bound Filtering [On et al, ICDE 07]

Algorithm Compute UB(x,y) If UB(x,y) <= t GJ(x,y) <= t (x,y) is not an

answer Else Compute LB(x,y) If LB(x,y) > t GJ(x,y) > t (x,y) is an answer Else compute GJ(x,y)

For x in A: Using Foo, find a candidate set C in B For y in C:

If GJ(x,y) > t, return (x,y);

LB <= GJ <= UB

Page 36: Weighted Exact Set Similarity Join

Takeaways

WESSJoin finds ALL pairs of sets btw two collections whose similarity > t Good abstraction for various problems

2 step framework is promising Step 1: reduce candidates Step 2: similarity computation among candidates

Less researched issues Comparison among different WESSJoin methods WESSJoin + top-k/skyline/MapReduce/etc

Wisconsin DB Seminar, 2009 36

Page 37: Weighted Exact Set Similarity Join

Reference [Sarawagi et al, SIGMOD 04] Sunita Sarawagi, Alok Kirpal: Efficient set

joins on similarity predicates, SIGMOD 2004. [Arasu et al, VLDB 06] Arvind Arasu, Venkatesh Ganti, and Raghav

Kaushik, Efficient exact set-similarity joins, VLDB 2006. [Chaudhuri et al, ICDE 06] Surajit Chaudhuri, Venkatesh Ganti, Raghav

Kaushik: A Primitive Operator for Similarity Joins in Data Cleaning. ICDE 2006.

[Bayardo et al, WWW 07] R. J. Bayardo, Yiming Ma, Ramakrishnan Srikant. Scaling Up All-Pairs Similarity Search, WWW 2007.

[On et al, ICDE 07] Byung-Won On, Nick Koudas, Dongwon Lee, Divesh Srivastava, Group Linkage, ICDE 2007.

[Xiao et al, WWW 08] Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. WWW 2008.

Wei Wang. Efficient Exact Similarity Join Algorithms: http://www.cse.unsw.edu.au/~weiw/project/PPJoin-UTS-Oct-2008.pdf

Jeffrey D. Ullman. High-Similarity Algorithms: http://infolab.stanford.edu/~ullman/mining/2009/similarity4.pdf

Wisconsin DB Seminar, 2009 37