spectral approaches to nearest neighbor search arxiv:1408.0751 robert krauthgamer (weizmann...

28
Spectral Approaches to Nearest Neighbor Search arXiv:1408.0751 Robert Krauthgamer (Weizmann Institute) Joint with: Amirali Abdullah, Alexandr Andoni, Ravi Kannan Les Houches, January 2015

Upload: edgar-surridge

Post on 14-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Spectral Approaches to Nearest Neighbor Search

arXiv:1408.0751

Robert Krauthgamer (Weizmann Institute)

Joint with: Amirali Abdullah, Alexandr Andoni, Ravi Kannan

Les Houches, January 2015

Nearest Neighbor Search (NNS)

Preprocess: a set of points in

Query: given a query point , report a point with the smallest distance to

𝑞𝑝∗

Motivation

Generic setup: Points model objects (e.g. images) Distance models (dis)similarity measure

Application areas: machine learning: k-NN rule signal processing, vector quantization,

bioinformatics, etc… Distance can be:

Hamming, Euclidean, edit distance, earth-mover

distance, …

000000011100010100000100010100011111

000000001100000100000100110100111111 𝑞

𝑝∗

Curse of Dimensionality

All exact algorithms degrade rapidly with the dimension

Algorithm Query time Space

Full indexing (Voronoi diagram size)

No indexing – linear scan

Approximate NNS

Given a query point , report s.t. : approximation factor randomized: return such with

probability

Heuristic perspective: gives a set of candidates (hopefully small)

𝑞

𝑝∗

𝑝 ′

NNS algorithms

6

It’s all about space partitions !

Low-dimensional [Arya-Mount’93], [Clarkson’94], [Arya-Mount-

Netanyahu-Silverman-We’98], [Kleinberg’97], [HarPeled’02],[Arya-Fonseca-Mount’11],…

High-dimensional [Indyk-Motwani’98], [Kushilevitz-Ostrovsky-

Rabani’98], [Indyk’98, ‘01], [Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica-Indyk-Mirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [Andoni-Indyk’06], [Andoni-Indyk-Nguyen-Razenshteyn’14], [Andoni-Razenshteyn’15]

Low-dimensional

7

kd-trees,…

runtime:

High-dimensional Locality-Sensitive Hashing Crucial use of random projections

Johnson-Lindenstrauss Lemma: project to random subspace of dimension for approximation

Runtime: for -approximation

8

Practice Data-aware partitions

optimize the partition to your dataset PCA-tree [Sproull’91, McNames’01, Verma-Kpotufe-

Dasgupta’09] randomized kd-trees [SilpaAnan-Hartley’08, Muja-

Lowe’09] spectral/PCA/semantic/WTA hashing [Weiss-Torralba-

Fergus’08, Wang-Kumar-Chang’09, Salakhutdinov-Hinton’09, Yagnik-Strelow-Ross-Lin’11]

9

Practice vs Theory

10

Data-aware projections often outperform (vanilla) random-projection methods

But no guarantees (correctness or performance)

JL generally optimal [Alon’03, Jayram-Woodruff’11] Even for some NNS setups! [Andoni-Indyk-

Patrascu’06]

Why do data-aware projections outperform random projections ?

Algorithmic framework to study phenomenon?

Plan for the rest

11

Model Two spectral algorithms Conclusion

Our model

12

“low-dimensional signal + large noise” inside high dimensional space

Signal: for subspace of dimension Data: each point is perturbed by a full-

dimensional Gaussian noise

𝑈

Model properties

13

Data points in P have at least unit norm

Query s.t.: for “nearest neighbor” for everybody else

Noise entries up to factor Claim: exact nearest neighbor is still the same

Noise is large: has magnitude top dimensions of capture sub-constant mass JL would not work: after noise, gap very close to 1

Algorithms via PCA

14

Find the “signal subspace” ? then can project everything to and solve NNS

there Use Principal Component Analysis (PCA)?

extract top direction(s) from SVD e.g., -dimensional space that minimizes

If PCA removes noise “perfectly”, we are done: Can reduce to -dimensional NNS

15

Best we can hope for dataset contains a “worst-case” -dimensional

instance Reduction from dimension to

Spoiler: Yes

NNS performance as if we are in dimensions for full model?

PCA under noise fails

16

Does PCA find “signal subspace” under noise ?

No PCA minimizes good only on “average”, not “worst-case” weak signal directions overpowered by noise

directions typical noise direction contributes

𝑝∗

1st Algorithm: intuition

17

Extract “well-captured points” points with signal mostly inside top PCA space should work for large fraction of points

Iterate on the rest

𝑝∗

Iterative PCA

18

To make this work: Nearly no noise in : ensuring close to

determined by heavy-enough spectral directions (dimension may be less than )

Capture only points whose signal fully in well-captured: distance to explained by noise only

• Find top PCA subspace • =points well-captured by • Build NNS d.s. on { projected

onto }• Iterate on the remaining points,

Query: query each NNS d.s. separately

Simpler model

19

Assume: small noise ,

where can be even adversarial

Algorithm: well-captured if

Claim 1: if captured by , will find it in NNS for any captured :

Claim 2: number of iterations is

for at most 1/4-fraction of points, hence constant fraction captured in each iteration

• Find top-k PCA subspace • =points well-captured by • Build NNS on { projected

onto }• Iterate on remaining points,

Query: query each NNS separately

Analysis of general model

20

Need to use randomness of the noise Want to say that “signal” is stronger than

“noise” (on average) Use random matrix theory

is random with entries All singular values

has rank and (Frobenius-norm)2 important directions have can ignore directions with

Important signal directions stronger than noise!

Closeness of subspaces ?

21

Trickier than singular values Top singular vector not stable under perturbation! Only stable if second singular value much smaller

How to even define “closeness” of subspaces?

To the rescue: Wedin’s sin-theta theorem

𝜃

𝑆

𝑈

Wedin’s sin-theta theorem

22

Developed by [Davis-Kahan’70], [Wedin’72]

Theorem: Consider is top- subspace of is the -space containing Then:

Another way to see why we need to take directions with sufficiently heavy singular values

𝜃

Additional issue: Conditioning

23

After an iteration, the noise is not random anymore! non-captured points might be “biased” by

capturing criterion

Fix: estimate top PCA subspace from a small sample of the data

Might be purely due to analysis But does not sound like a bad idea in practice

either

Performance of Iterative PCA

24

Can prove there are iterations In each, we have NNS in dimensional space Overall query time:

Reduced to instances of -dimension NNS!

2nd Algorithm: PCA-tree

25

Closer to algorithms used in practice• Find top PCA direction • Partition into slabs • Snap points to

hyperplane• Recurse on each slice

≈𝜖 /𝑘

Query: • follow all tree paths

that may contain

2 algorithmic modifications

26

Centering: Need to use centered PCA (subtract average) Otherwise errors from perturbations accumulate

Sparsification: Need to sparsify the set of points in each node of the

tree Otherwise can get a “dense” cluster:

not enough variance in signal lots of noise

• Find top PCA direction • Partition into slabs • Snap points to

hyperplanes• Recurse on each slice

Query: • follow all tree paths

that may contain

Analysis

27

An “extreme” version of Iterative PCA Algorithm: just use the top PCA direction: guaranteed to have

signal ! Main lemma: the tree depth is

because each discovered direction close to snapping: like orthogonalizing with respect to each

one cannot have too many such directions

Query runtime:

Overall performs like -dimensional NNS!

Wrap-up

28

Here: Model: “low-dimensional signal + large noise” like NNS in low dimensional space via “right” adaptation of PCA

Immediate questions: Other, less-structured signal/noise models? Algorithms with runtime dependent on spectrum?

Broader Q: Analysis that explains empirical success?

Why do data-aware projections outperform random projections ?

Algorithmic framework to study phenomenon?