cobinatorial algorithms for nearest neighbors, near-duplicates and small world design - yury...

45
Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small-World Design Yury Lifshits Yahoo! Research Shengyu Zhang The Chinese University of Hong Kong SODA 2009 Yury Lifshits, Shengyu Zhang Combinatorial Nearest Neighbors 1 / 29

Upload: yury-lifshits

Post on 29-Aug-2014

2.173 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Combinatorial Algorithmsfor Nearest Neighbors, Near-Duplicates

and Small-World Design

Yury LifshitsYahoo! Research

Shengyu ZhangThe Chinese University of Hong Kong

SODA 2009

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 1 / 29

Page 2: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Similarity Search: an ExampleInput: Set of objects

Task: Preprocess it

Query: New object

Task: Find the most

similar one in the dataset

Most similar

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 2 / 29

Page 3: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Similarity Search: an ExampleInput: Set of objects

Task: Preprocess it

Query: New object

Task: Find the most

similar one in the dataset

Most similar

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 2 / 29

Page 4: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Similarity Search: an ExampleInput: Set of objects

Task: Preprocess it

Query: New object

Task: Find the most

similar one in the dataset

Most similar

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 2 / 29

Page 5: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Similarity Search

Search space: object domain U, similarityfunction σ

Input: database S = {p1, . . . ,pn} ⊆ UQuery: q ∈ UTask: find argmax σ(pi,q)

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 3 / 29

Page 6: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Nearest Neighbors in TheorySphere Rectangle Tree Orchard’s Algorithm k-d-B tree

Geometric near-neighbor access tree Excludedmiddle vantage point forest mvp-tree Fixed-height

fixed-queries tree AESA Vantage-pointtree LAESA R∗-tree Burkhard-Keller tree BBD tree

Navigating Nets Voronoi tree Balanced aspect ratio

tree Metric tree vps-tree M-treeLocality-Sensitive Hashing SS-tree

R-tree Spatial approximation treeMulti-vantage point tree Bisector tree mb-tree Cover

tree Hybrid tree Generalized hyperplane tree Slim tree

Spill Tree Fixed queries tree X-tree k-d tree Balltree

Quadtree Octree Post-office treeYury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 4 / 29

Page 7: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Revision: Basic Assumptions

In theory:Triangle inequalityDoubling dimension is o(logn)

Typical web dataset has separation effect

For almost all i, j : 1/2 ≤ d(pi,pj) ≤ 1

Classic methods fail:Branch and bound algorithms visit every objectDoubling dimension is at least logn/2

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29

Page 8: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Revision: Basic Assumptions

In theory:Triangle inequalityDoubling dimension is o(logn)

Typical web dataset has separation effect

For almost all i, j : 1/2 ≤ d(pi,pj) ≤ 1

Classic methods fail:Branch and bound algorithms visit every objectDoubling dimension is at least logn/2

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29

Page 9: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Revision: Basic Assumptions

In theory:Triangle inequalityDoubling dimension is o(logn)

Typical web dataset has separation effect

For almost all i, j : 1/2 ≤ d(pi,pj) ≤ 1

Classic methods fail:Branch and bound algorithms visit every objectDoubling dimension is at least logn/2

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29

Page 10: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Contribution

Navin Goyal, YL, Hinrich Schütze, WSDM 2008:

Combinatorial framework: new approach to datamining problems that does not require triangleinequality

Nearest neighbor algorithm

This work:

Better nearest neighbor search

Detecting near-duplicates

Navigability design for small worlds

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 6 / 29

Page 11: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Contribution

Navin Goyal, YL, Hinrich Schütze, WSDM 2008:

Combinatorial framework: new approach to datamining problems that does not require triangleinequality

Nearest neighbor algorithm

This work:

Better nearest neighbor search

Detecting near-duplicates

Navigability design for small worlds

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 6 / 29

Page 12: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Outline

1 Combinatorial Framework

2 New Algorithms

3 Combinatorial Nets

4 Directions for Further Research

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 7 / 29

Page 13: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

1Combinatorial Framework

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 8 / 29

Page 14: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Comparison Oracle

Dataset p1, . . . ,pn

Objects and distance (or similarity)function are NOT given

Instead, there is a comparison oracleanswering queries of the form:

Who is closer to A: B or C?

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 9 / 29

Page 15: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Disorder InequalitySort all objects by their similarity to p:

p r s

rankp(r)

rankp(s)

Then by similarity to r:

r s

rankr(s)

Dataset has disorder D if∀p, r, s : rankr(s) ≤ D(rankp(r) + rankp(s))

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29

Page 16: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Disorder InequalitySort all objects by their similarity to p:

p r s

rankp(r)

rankp(s)

Then by similarity to r:

r s

rankr(s)

Dataset has disorder D if∀p, r, s : rankr(s) ≤ D(rankp(r) + rankp(s))

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29

Page 17: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Disorder InequalitySort all objects by their similarity to p:

p r s

rankp(r)

rankp(s)

Then by similarity to r:

r s

rankr(s)

Dataset has disorder D if∀p, r, s : rankr(s) ≤ D(rankp(r) + rankp(s))

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29

Page 18: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Combinatorial Framework

=

Comparison oracleWho is closer to A: B or C?

+

Disorder inequalityrankr(s) ≤ D(rankp(r) + rankp(s))

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 11 / 29

Page 19: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Combinatorial Framework: FAQ

Disorder of a metric space? Disorder ofRk?

In what cases disorder is relatively small?

Experimental values of D for somepractical datasets?

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 12 / 29

Page 20: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Disorder vs. Others

If expansion rate is c, disorder constant isat most c2

Doubling dimension and disorderdimension are incomparable

Disorder inequality implies combinatorialform of “doubling effect”

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 13 / 29

Page 21: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Combinatorial Framework: Pro & Contra

Advantages:

Does not require triangle inequality for distances

Applicable to any data model and any similarityfunction

Require only comparative training information

Limitation: worst-case form of disorder inequality

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 14 / 29

Page 22: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Combinatorial Framework: Pro & Contra

Advantages:

Does not require triangle inequality for distances

Applicable to any data model and any similarityfunction

Require only comparative training information

Limitation: worst-case form of disorder inequality

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 14 / 29

Page 23: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

2New Algorithms

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 15 / 29

Page 24: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Nearest Neighbor Search

Assume S ∪ {q} has disorder constant D

TheoremThere is a deterministic and exact algorithmfor nearest neighbor search:

Preprocessing: O(D7n log2n)

Search: O(D4 logn)

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 16 / 29

Page 25: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Near-Duplicates

Assume, comparison oracle can also tell uswhether σ(x,y) > T for some similaritythreshold T

TheoremAll pairs with over-T similarity can be founddeterministically in time

poly(D)(n log2n+ |Output|)

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 17 / 29

Page 26: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Visibility Graph

TheoremFor any dataset S with disorder D there existsa visibility graph:

poly(D)n log2n construction time

O(D4 logn) out-degrees

Naïve greedy routingdeterministically reachesexact nearest neighbor of the given target qin at most logn steps

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 18 / 29

Page 27: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

qp1

p2

p3

p4

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29

Page 28: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

qp1

p2

p3

p4

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29

Page 29: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

qp1

p2

p3

p4

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29

Page 30: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

qp1

p2

p3

p4

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29

Page 31: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

qp1

p2

p3

p4

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29

Page 32: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

qp1

p2

p3

p4

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29

Page 33: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

3Combinatorial Nets

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 20 / 29

Page 34: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Combinatorial Ball

B(x, r) = {y : rankx(y) < r}

In other words, it is a subset of dataset S: theobject x itself and r − 1 its nearest neighbors

xB(x,10)

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 21 / 29

Page 35: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Combinatorial NetA subset R ⊆ S is called a combinatorialr-net iff the following two properties holds:Covering: ∀y ∈ S,∃x ∈ R, s.t. rankx(y) < r.Separation: ∀xi,xj ∈ R, rankxi(xj) ≥ r OR rankxj(xi) ≥ r

How to construct a combinatorial net?What upper bound on its size can we guarantee?

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 22 / 29

Page 36: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Combinatorial NetA subset R ⊆ S is called a combinatorialr-net iff the following two properties holds:Covering: ∀y ∈ S,∃x ∈ R, s.t. rankx(y) < r.Separation: ∀xi,xj ∈ R, rankxi(xj) ≥ r OR rankxj(xi) ≥ r

How to construct a combinatorial net?What upper bound on its size can we guarantee?

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 22 / 29

Page 37: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Basic Data Structure

Combinatorial nets:For every 0 ≤ i ≤ logn, construct a n

2i-net

Pointers, pointers, pointers:

Direct & inverted indices: links between centersand members of their balls

Cousin links: for every center keep pointers toclose centers on the same level

Navigation links: for every center keep pointers toclose centers on the next level

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 23 / 29

Page 38: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Basic Data Structure

Combinatorial nets:For every 0 ≤ i ≤ logn, construct a n

2i-net

Pointers, pointers, pointers:

Direct & inverted indices: links between centersand members of their balls

Cousin links: for every center keep pointers toclose centers on the same level

Navigation links: for every center keep pointers toclose centers on the next level

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 23 / 29

Page 39: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Fast Net Construction

TheoremCombinatorial nets can be constructed inO(D7n log2n) time

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 24 / 29

Page 40: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Up’n’Down Trick

Assume your have 2r-net for the dataset

To compute an r-ball around some object p:

1 Take a center p′ of 2r ball that is covering p

2 Take all centers of 2r-balls nearby p′

3 For all of them write down all members of theirs2r-balls

4 Sort all written objects with respect to p and keep rmost similar ones.

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 25 / 29

Page 41: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

4Directions for Further Research

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 26 / 29

Page 42: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Future of Combinatorial FrameworkOther problems in combinatorial framework:

Low-distortion embeddingsClosest pairsCommunity discoveryLinear arrangementDistance labellingDimensionality reduction

What if disorder inequality has exceptions?

Insertions, deletions, changing metric

Experiments & implementation

Unification challenge: disorder + doubling = ?

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 27 / 29

Page 43: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Summary

Combinatorial framework:

comparison oracle + disorder inequality

New algorithms:

Nearest neighbor search

Deterministic detection of near-duplicates

Navigability design

Thanks for your attention!Questions?

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 28 / 29

Page 44: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Summary

Combinatorial framework:

comparison oracle + disorder inequality

New algorithms:

Nearest neighbor search

Deterministic detection of near-duplicates

Navigability design

Thanks for your attention!Questions?

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 28 / 29

Page 45: Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small World Design - Yury Lifshits - SODA 2009

Linkshttp://yury.name

http://simsearch.yury.nameTutorial, bibliography, people, links, open problems

Yury Lifshits and Shengyu Zhang

Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates andSmall-World Design

http://yury.name/papers/lifshits2008similarity.pdf

Navin Goyal, Yury Lifshits, Hinrich Schütze

Disorder Inequality: A Combinatorial Approach to Nearest Neighbor Search

http://yury.name/papers/goyal2008disorder.pdf

Benjamin Hoffmann, Yury Lifshits, Dirk Novotka

Maximal Intersection Queries in Randomized Graph Models

http://yury.name/papers/hoffmann2007maximal.pdf

Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 29 / 29