dimension reduction in the hamming cube (and its applications) rafail ostrovsky ucla (joint works...

45
Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

Upload: frances-trueblood

Post on 16-Dec-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

Dimension Reduction in the Hamming Cube

(and its Applications)

Rafail Ostrovsky UCLA

(joint works with Rabani; and Kushilevitz and Rabani)

Page 2: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

2

http://www.cs.ucla.edu/~rafail/

PLAN

Problem Formulations Communication complexity game What really happened? (dimension

reduction) Solutions to 2 problems

–ANN

–k-clustering

What’s next?

Page 3: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

3

http://www.cs.ucla.edu/~rafail/

Problem statements

Johnson-lindenstrauss lemma: n points in high dim. Hilbert Space can be embedded into O(logn) dim subspace with small distortion

Q: how do we do it for the Hamming Cube?

(we show how to avoid impossibility of [Charicar-Sahai])

Page 4: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

4

http://www.cs.ucla.edu/~rafail/

Many different formulations of ANN

ANN – “approximate nearest neighbor search”

(many applications in computational geometry, biology/stringology, IR, other areas)

Here are different formulations:

Page 5: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

5

http://www.cs.ucla.edu/~rafail/

Approximate Searching

Motivation: given a DB of “names”, user with a “target” name, find if any of DB names are “close” to the current name, without doing liner scan.

Jon

Alice

Bob

Eve

Panconesi

Kate

Fred

A.Panconesi ?

Page 6: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

6

http://www.cs.ucla.edu/~rafail/

Geometric formulation

Nearest Neighbor Search (NNS): given N blue points (and a distance function, say Euclidian distance in Rd), store all these points somehow

Page 7: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

7

http://www.cs.ucla.edu/~rafail/

Data structure question

given a new red point, find closest blue point.

Naive solution 1: store blue points “as is” and when given a red point, measure distances to all blue points.

Q: can we do better?

Page 8: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

8

http://www.cs.ucla.edu/~rafail/

Can we do better? Easy in small dimensions (Voronoi diagrams) “Curse of dimensionality” in High Dimensions… [KOR]: Can get a good “approximate” solution

efficiently!

Page 9: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

9

http://www.cs.ucla.edu/~rafail/

Hamming Cube Formulation for ANN

Given a DB of N blue n-bit strings, process them somehow. Given an n-bit red string find ANN in the Hyper-Cube {0,1}n

Naïve solution 2: pre-compute all (exponential #) of answers (want small data-structures!)

00101011

01011001

11101001

10110110

11010101

11011000

10101010

10101111

11010100

Page 10: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

10

http://www.cs.ucla.edu/~rafail/

Clustering problem that I’ll discuss in detail

K-clustering

Page 11: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

11

http://www.cs.ucla.edu/~rafail/

An example of Clustering – find “centers”

Given N points in Rd

Page 12: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

12

http://www.cs.ucla.edu/~rafail/

A clustering formulation

Find cluster “centers”

Page 13: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

13

http://www.cs.ucla.edu/~rafail/

Clustering formulation

The “cost” is the sum of distances

Page 14: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

14

http://www.cs.ucla.edu/~rafail/

Main technique

First, as a communication game Second, interpreted as a dimension reduction

Page 15: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

15

http://www.cs.ucla.edu/~rafail/

COMMUNICATION COMPLEXITY GAME

Given two players Alice and Bob, Alice is secretly given string x Bob is secretly given string y they want to estimate hamming distance

between x and y with small communication (with small error), provided that they have common randomness

How can they do it? (say length of |x|=|y|= N) Much easier: how do we check that x=y ?

Page 16: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

16

http://www.cs.ucla.edu/~rafail/

Main lemma : an abstract game

How can Alice and Bob estimate hamming distance between X and Y with small CC?

We assume Alice and Bob share randomness

ALICE

X1X2X3X4…Xn

BOB

Y1Y2Y3Y4…Yn

Page 17: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

17

http://www.cs.ucla.edu/~rafail/

A simpler question

To estimate hamming distance between X and Y (within (1+ )) with small CC, sufficient for Alice and Bob for any L to be able to distinguish X and Y for:

– H(X,Y) <= L OR

– H(X,Y) > (1+ ) L

Q: why sampling does not work?

ALICE

X1X2X3X4…Xn

BOB

Y1Y2Y3Y4…Yn

Page 18: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

18

http://www.cs.ucla.edu/~rafail/

Alice and Bob pick the SAME n-bit blue R each bit of R=1 independently with probability 1/2L

0 1 0 1 0 0 0 1 0 1 0

XOR

0 1 0 0 0 1 0 0 1 0 0

0 1 0 1 1 1 0 1 0 1 0

XOR

0/1 0/1

0 1 0 0 0 1 0 0 1 0 0

X Y

Page 19: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

19

http://www.cs.ucla.edu/~rafail/

What is the difference in probabilities? H(X,Y) <= L and H(X,Y) > (1+ ) L

0 1 0 1 0 0 0 1 0 1 0

XOR

0/1

0 1 0 0 0 1 0 0 1 0 0

0 1 0 1 1 1 0 1 0 1 0

XOR

0/1

0 1 0 0 0 1 0 0 1 0 0

X Y

Page 20: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

20

http://www.cs.ucla.edu/~rafail/

How do we amplify?

0 1 0 1 0 0 0 1 0 1 0

XOR

0/1

0 1 0 0 0 1 0 0 1 0 0

0 1 0 1 1 1 0 1 0 1 0

XOR

0/1

0 1 0 0 0 1 0 0 1 0 0

X Y

Page 21: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

21

http://www.cs.ucla.edu/~rafail/

How do we amplify? - Repeat, with many independent R’s but same distribution!

0 1 0 1 0 0 0 1 0 1 0

XOR

0/1

0 1 0 0 0 1 0 0 1 0 0

0 1 0 1 1 1 0 1 0 1 0

XOR

0/1

0 1 0 0 0 1 0 0 1 0 0

X Y

Page 22: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

22

http://www.cs.ucla.edu/~rafail/

a refined game with a small communication

How can Alice and Bob distinguish X and Y:– H(X,Y) <= L OR – H(X,Y) > (1+ ) L

ALICE

X1X2X3X4…Xn

For each R

XOR (subset) of Xi

Compare the outputs.

BOB

Y1Y2Y3Y4…Yn

For each R XOR (the same subset) of Yi

Compare the outputs.

Pick 1/ logN R’s with correct distribution

Compare this linear transformation.

Page 23: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

23

http://www.cs.ucla.edu/~rafail/

Dimension Reduction in the Hamming Cube [OR]

For each L, we can pick O(log N) R’s and boost the

Probabilities!

Key Property: we get an embedding from large to small cube that preserve ranges around L very well.

Page 24: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

24

http://www.cs.ucla.edu/~rafail/

Dimension Reduction in the Hamming Cube [OR]

For each L, we can pick O(log N) R’s and boost the

Probabilities!

Key Property: we get an embedding from large to small cube that preserve ranges around L.

Key idea in applications: can build inverse lookup table for the small cube!

Page 25: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

25

http://www.cs.ucla.edu/~rafail/

Applications

Applications of the dimension reduction in the Hamming CUBE

For ANN in the Hamming cube and Rd

For K-Clustering

Page 26: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

26

http://www.cs.ucla.edu/~rafail/

Application to ANN in the Hamming Cube

For each possible L build a “small cube” and project original DB to a small cube

Pre-compute inverse table for each entry of the small cube.

Why is this efficient? How do we answer any query? How do we navigate between different L?

Page 27: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

27

http://www.cs.ucla.edu/~rafail/

Putting it All together: User’s private approx search from DB

Each projection is O(log N) R’s. User picks many such projections for each L-range. That defines all the embeddings.

Now, DB builds inverse lookup tables for each projection as new DB’s for each L.

User can now “project” its query into small cube and use binary search on L

Page 28: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

28

http://www.cs.ucla.edu/~rafail/

MAIN THM [KOR]

Can build poly-size data-structure to do ANN for high-dimensional data in time polynomial in d and poly-log in N– For the hamming cube– L_1– L_2– Square of the Euclidian dist.

[IM] had a similar results, slightly weaker guarantee.

Page 29: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

29

http://www.cs.ucla.edu/~rafail/

Dealing with Rd

Project to random lines, choose “cut” points…

Well, not exactly… we need “navigation”

Page 30: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

30

http://www.cs.ucla.edu/~rafail/

Clustering

Huge number of applications (IR, mining, analysis of stat data, biology, automatic taxonomy formation, web, topic-specific data-collections, etc.)

Two independent issues:– Representation of data– Forming “clusters” (many

incomparable methods)

Page 31: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

31

http://www.cs.ucla.edu/~rafail/

Representation of data examples Latent semantic indexing yields points in Rd

with l2 distance (distance indicating similarity)

Min-wise permutation (Broder at. al.) approach yields points in the hamming metric

Many other representations from IR literature lead to other metrics, including edit-distance metric on strings

Recent news: [OR-95] showed that we can embed edit-distance metric into l1 with small distortion distortion= exp(sqrt(\log n \log log n))

Page 32: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

32

http://www.cs.ucla.edu/~rafail/

Geometric Clustering: examples

Min-sum clustering in Rd: form clusters s.t. the sum of intra-cluster distances in minimized

K-clustering: pick k “centers” in the ambient space. The cost is the sum of distances from each data-point to the closest center

Agglomerative clustering (form clusters below some distance-threshold)

Q: which is better?

Page 33: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

33

http://www.cs.ucla.edu/~rafail/

Methods are (in general) incomparable

Page 34: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

34

http://www.cs.ucla.edu/~rafail/

Min-SUM

Page 35: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

35

http://www.cs.ucla.edu/~rafail/

2-Clustering

Page 36: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

36

http://www.cs.ucla.edu/~rafail/

A k-clustering problem: notation

N – number of points d – dimension k – number of centers

Page 37: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

37

http://www.cs.ucla.edu/~rafail/

About k-clustering

When k if fixed, this is easy for small d [Kleinberg, Papadimitriou, Raghavan]: NP-complete

for k=2 for the cube [Drineas, Frieze, Kannan, Vempala, Vinay]” NP

complete for Rd for square of the Euclidian distance When k is not fixed, this is facility location (Euclidian k-

median) For fixed d but growing k a PTAS was given by [Arora,

Raghavan, Rao] (using dynamic prog.) (this talk): [OR]: PTAS for fixed k, arbitrary d

Page 38: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

38

http://www.cs.ucla.edu/~rafail/

Common tools in geometric PTAS

Dynamic programming Sampling [Schulman, AS, DLVK] [DFKVV] use SVD

Embeddings/dimension reduction seem useless because– Too many candidate centers– May introduce new centers

Page 39: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

39

http://www.cs.ucla.edu/~rafail/

[OR] k-clustering result

A PTAS for fixed k– Hamming cube {0,1}d

– l1d

– l2d (Euclidian distance)

– Square of the Euclidian distance

Page 40: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

40

http://www.cs.ucla.edu/~rafail/

Main ideas

For 2-clustering find a good partition is as good as solving the problem

Switch to cube Try partitions in the embedded low-

dimensional data set Given a partition, compute centers and cost in

the original data send Embedding/dim. reduction used to reduce the

number of partitions

Page 41: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

41

http://www.cs.ucla.edu/~rafail/

Stronger property of [OR] dimension reduction

Our random linear transformation preserve ranges!

Page 42: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

42

http://www.cs.ucla.edu/~rafail/

THE ALGORITHM

Page 43: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

43

http://www.cs.ucla.edu/~rafail/

The algorithm yet again

Guess 2-center distance Map to small cube Partition in the small cube Measure the partition in the big cube

THM: gets within (1+ of optimal.

Disclaimer: PTAS is (almost never) practical, this shows “feasibility only”, more ideas are needed for a practical solution.

Page 44: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

44

http://www.cs.ucla.edu/~rafail/

Dealing with k>2

Apex of a tournament is a node of max out-degree

Fact: apex has a path of length 2 to every node

Every point is assigned an apex of center “tournaments”:– Guess all (k choose 2) center distances– Embed into (k choose 2) small cubes– Guess center-projection in small cubes– For every point, for every pair of centers, define a

“tournament” which center is closer in the projection

Page 45: Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

45

http://www.cs.ucla.edu/~rafail/

Conclusions Dimension reduction in the

cube allows to deal with huge number of “incomparable” attributes.

Embeddings of other metrics into the cube allows fast ANN for other metrics

Real applications still require considerable additional ideas

Fun area to work in