cobinatorial algorithms for nearest neighbors, near-duplicates and small world design - yury...
DESCRIPTION
TRANSCRIPT
Combinatorial Algorithmsfor Nearest Neighbors, Near-Duplicates
and Small-World Design
Yury LifshitsYahoo! Research
Shengyu ZhangThe Chinese University of Hong Kong
SODA 2009
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 1 / 29
Similarity Search: an ExampleInput: Set of objects
Task: Preprocess it
Query: New object
Task: Find the most
similar one in the dataset
Most similar
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 2 / 29
Similarity Search: an ExampleInput: Set of objects
Task: Preprocess it
Query: New object
Task: Find the most
similar one in the dataset
Most similar
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 2 / 29
Similarity Search: an ExampleInput: Set of objects
Task: Preprocess it
Query: New object
Task: Find the most
similar one in the dataset
Most similar
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 2 / 29
Similarity Search
Search space: object domain U, similarityfunction σ
Input: database S = {p1, . . . ,pn} ⊆ UQuery: q ∈ UTask: find argmax σ(pi,q)
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 3 / 29
Nearest Neighbors in TheorySphere Rectangle Tree Orchard’s Algorithm k-d-B tree
Geometric near-neighbor access tree Excludedmiddle vantage point forest mvp-tree Fixed-height
fixed-queries tree AESA Vantage-pointtree LAESA R∗-tree Burkhard-Keller tree BBD tree
Navigating Nets Voronoi tree Balanced aspect ratio
tree Metric tree vps-tree M-treeLocality-Sensitive Hashing SS-tree
R-tree Spatial approximation treeMulti-vantage point tree Bisector tree mb-tree Cover
tree Hybrid tree Generalized hyperplane tree Slim tree
Spill Tree Fixed queries tree X-tree k-d tree Balltree
Quadtree Octree Post-office treeYury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 4 / 29
Revision: Basic Assumptions
In theory:Triangle inequalityDoubling dimension is o(logn)
Typical web dataset has separation effect
For almost all i, j : 1/2 ≤ d(pi,pj) ≤ 1
Classic methods fail:Branch and bound algorithms visit every objectDoubling dimension is at least logn/2
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29
Revision: Basic Assumptions
In theory:Triangle inequalityDoubling dimension is o(logn)
Typical web dataset has separation effect
For almost all i, j : 1/2 ≤ d(pi,pj) ≤ 1
Classic methods fail:Branch and bound algorithms visit every objectDoubling dimension is at least logn/2
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29
Revision: Basic Assumptions
In theory:Triangle inequalityDoubling dimension is o(logn)
Typical web dataset has separation effect
For almost all i, j : 1/2 ≤ d(pi,pj) ≤ 1
Classic methods fail:Branch and bound algorithms visit every objectDoubling dimension is at least logn/2
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 5 / 29
Contribution
Navin Goyal, YL, Hinrich Schütze, WSDM 2008:
Combinatorial framework: new approach to datamining problems that does not require triangleinequality
Nearest neighbor algorithm
This work:
Better nearest neighbor search
Detecting near-duplicates
Navigability design for small worlds
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 6 / 29
Contribution
Navin Goyal, YL, Hinrich Schütze, WSDM 2008:
Combinatorial framework: new approach to datamining problems that does not require triangleinequality
Nearest neighbor algorithm
This work:
Better nearest neighbor search
Detecting near-duplicates
Navigability design for small worlds
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 6 / 29
Outline
1 Combinatorial Framework
2 New Algorithms
3 Combinatorial Nets
4 Directions for Further Research
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 7 / 29
1Combinatorial Framework
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 8 / 29
Comparison Oracle
Dataset p1, . . . ,pn
Objects and distance (or similarity)function are NOT given
Instead, there is a comparison oracleanswering queries of the form:
Who is closer to A: B or C?
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 9 / 29
Disorder InequalitySort all objects by their similarity to p:
p r s
rankp(r)
rankp(s)
Then by similarity to r:
r s
rankr(s)
Dataset has disorder D if∀p, r, s : rankr(s) ≤ D(rankp(r) + rankp(s))
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29
Disorder InequalitySort all objects by their similarity to p:
p r s
rankp(r)
rankp(s)
Then by similarity to r:
r s
rankr(s)
Dataset has disorder D if∀p, r, s : rankr(s) ≤ D(rankp(r) + rankp(s))
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29
Disorder InequalitySort all objects by their similarity to p:
p r s
rankp(r)
rankp(s)
Then by similarity to r:
r s
rankr(s)
Dataset has disorder D if∀p, r, s : rankr(s) ≤ D(rankp(r) + rankp(s))
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 10 / 29
Combinatorial Framework
=
Comparison oracleWho is closer to A: B or C?
+
Disorder inequalityrankr(s) ≤ D(rankp(r) + rankp(s))
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 11 / 29
Combinatorial Framework: FAQ
Disorder of a metric space? Disorder ofRk?
In what cases disorder is relatively small?
Experimental values of D for somepractical datasets?
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 12 / 29
Disorder vs. Others
If expansion rate is c, disorder constant isat most c2
Doubling dimension and disorderdimension are incomparable
Disorder inequality implies combinatorialform of “doubling effect”
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 13 / 29
Combinatorial Framework: Pro & Contra
Advantages:
Does not require triangle inequality for distances
Applicable to any data model and any similarityfunction
Require only comparative training information
Limitation: worst-case form of disorder inequality
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 14 / 29
Combinatorial Framework: Pro & Contra
Advantages:
Does not require triangle inequality for distances
Applicable to any data model and any similarityfunction
Require only comparative training information
Limitation: worst-case form of disorder inequality
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 14 / 29
2New Algorithms
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 15 / 29
Nearest Neighbor Search
Assume S ∪ {q} has disorder constant D
TheoremThere is a deterministic and exact algorithmfor nearest neighbor search:
Preprocessing: O(D7n log2n)
Search: O(D4 logn)
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 16 / 29
Near-Duplicates
Assume, comparison oracle can also tell uswhether σ(x,y) > T for some similaritythreshold T
TheoremAll pairs with over-T similarity can be founddeterministically in time
poly(D)(n log2n+ |Output|)
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 17 / 29
Visibility Graph
TheoremFor any dataset S with disorder D there existsa visibility graph:
poly(D)n log2n construction time
O(D4 logn) out-degrees
Naïve greedy routingdeterministically reachesexact nearest neighbor of the given target qin at most logn steps
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 18 / 29
qp1
p2
p3
p4
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29
qp1
p2
p3
p4
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29
qp1
p2
p3
p4
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29
qp1
p2
p3
p4
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29
qp1
p2
p3
p4
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29
qp1
p2
p3
p4
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 19 / 29
3Combinatorial Nets
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 20 / 29
Combinatorial Ball
B(x, r) = {y : rankx(y) < r}
In other words, it is a subset of dataset S: theobject x itself and r − 1 its nearest neighbors
xB(x,10)
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 21 / 29
Combinatorial NetA subset R ⊆ S is called a combinatorialr-net iff the following two properties holds:Covering: ∀y ∈ S,∃x ∈ R, s.t. rankx(y) < r.Separation: ∀xi,xj ∈ R, rankxi(xj) ≥ r OR rankxj(xi) ≥ r
How to construct a combinatorial net?What upper bound on its size can we guarantee?
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 22 / 29
Combinatorial NetA subset R ⊆ S is called a combinatorialr-net iff the following two properties holds:Covering: ∀y ∈ S,∃x ∈ R, s.t. rankx(y) < r.Separation: ∀xi,xj ∈ R, rankxi(xj) ≥ r OR rankxj(xi) ≥ r
How to construct a combinatorial net?What upper bound on its size can we guarantee?
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 22 / 29
Basic Data Structure
Combinatorial nets:For every 0 ≤ i ≤ logn, construct a n
2i-net
Pointers, pointers, pointers:
Direct & inverted indices: links between centersand members of their balls
Cousin links: for every center keep pointers toclose centers on the same level
Navigation links: for every center keep pointers toclose centers on the next level
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 23 / 29
Basic Data Structure
Combinatorial nets:For every 0 ≤ i ≤ logn, construct a n
2i-net
Pointers, pointers, pointers:
Direct & inverted indices: links between centersand members of their balls
Cousin links: for every center keep pointers toclose centers on the same level
Navigation links: for every center keep pointers toclose centers on the next level
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 23 / 29
Fast Net Construction
TheoremCombinatorial nets can be constructed inO(D7n log2n) time
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 24 / 29
Up’n’Down Trick
Assume your have 2r-net for the dataset
To compute an r-ball around some object p:
1 Take a center p′ of 2r ball that is covering p
2 Take all centers of 2r-balls nearby p′
3 For all of them write down all members of theirs2r-balls
4 Sort all written objects with respect to p and keep rmost similar ones.
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 25 / 29
4Directions for Further Research
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 26 / 29
Future of Combinatorial FrameworkOther problems in combinatorial framework:
Low-distortion embeddingsClosest pairsCommunity discoveryLinear arrangementDistance labellingDimensionality reduction
What if disorder inequality has exceptions?
Insertions, deletions, changing metric
Experiments & implementation
Unification challenge: disorder + doubling = ?
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 27 / 29
Summary
Combinatorial framework:
comparison oracle + disorder inequality
New algorithms:
Nearest neighbor search
Deterministic detection of near-duplicates
Navigability design
Thanks for your attention!Questions?
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 28 / 29
Summary
Combinatorial framework:
comparison oracle + disorder inequality
New algorithms:
Nearest neighbor search
Deterministic detection of near-duplicates
Navigability design
Thanks for your attention!Questions?
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 28 / 29
Linkshttp://yury.name
http://simsearch.yury.nameTutorial, bibliography, people, links, open problems
Yury Lifshits and Shengyu Zhang
Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates andSmall-World Design
http://yury.name/papers/lifshits2008similarity.pdf
Navin Goyal, Yury Lifshits, Hinrich Schütze
Disorder Inequality: A Combinatorial Approach to Nearest Neighbor Search
http://yury.name/papers/goyal2008disorder.pdf
Benjamin Hoffmann, Yury Lifshits, Dirk Novotka
Maximal Intersection Queries in Randomized Graph Models
http://yury.name/papers/hoffmann2007maximal.pdf
Yury Lifshits, Shengyu Zhang ()Combinatorial Nearest Neighbors 29 / 29