locality sensitive hashing basics and applications

33
Locality Sensitive Hashing Basics and applications

Upload: peregrine-blankenship

Post on 03-Jan-2016

263 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Locality Sensitive Hashing Basics and applications

Locality Sensitive Hashing

Basics and applications

Page 2: Locality Sensitive Hashing Basics and applications

A well-known problem

Given a large collection of documents Identify the near-duplicate documents

Web search engines Proliferation of near-duplicate documents

Legitimate – mirrors, local copies, updates, … Malicious – spam, spider-traps, dynamic URLs, …

30% of web-pages are near-duplicates

[1997]

Page 3: Locality Sensitive Hashing Basics and applications

Natural Approaches

Fingerprinting: only works for exact matches Karp Rabin (rolling hash) – collision probability

guarantees MD5 – cryptographically-secure string hashes

Edit-distance metric for approximate string-matching expensive – even for one pair of documents impossible – for billion web documents

Random Sampling sample substrings (phrases, sentences, etc) hope: similar documents similar samples But – even samples of same document will differ

Page 4: Locality Sensitive Hashing Basics and applications

Basic Idea: Shingling [Broder 1997]

dissect document into q-grams (shingles)

T = I leave and study in Pisa, ….

If we set q=3 the 3-grams are:<I leave and><leave and study><and study in><study in

Pisa>…

represent documents by sets of hash[shingle]

The problem reduces to set intersection among int

Page 5: Locality Sensitive Hashing Basics and applications

Basic Idea: Shingling [Broder 1997]

Set intersection Jaccard similarity

DocB SB

SADocA

• Claim: A & B are near-duplicates if sim(SA,SB) is high

BA

BABA SS

SS )S,sim(S

Page 6: Locality Sensitive Hashing Basics and applications

Sketching of a document

From each shingle-set we build a “sketch vector” (~200

size)

Postulate: Documents that share ≥ t components of their sketch-vectors are claimed to be near duplicates

Sec. 19.6

Page 7: Locality Sensitive Hashing Basics and applications

Sketching by Min-Hashing

Consider SA, SB P = {0,…,p-1}

Pick a random permutation π of the whole set P (such as ax+b mod p)

Pick the minimal element of SA : = min{π(SA)}

Pick the minimal element of SB : = min{π(SB)}

Lemma:

BA

BA

SS

SS β]P[α

Page 8: Locality Sensitive Hashing Basics and applications

Strengthening it…

Similarity sketch sk(A) d minimal elements under π(SA) Or take d permutations and the min of each

Note: we can reduce the variance by using a larger d

Typically d is few hundred mins (~200)

Page 9: Locality Sensitive Hashing Basics and applications

Computing Sketch[i] for Doc1

Document 1

264

264

264

264

Start with 64-bit f(shingles)

Permute with i

Pick the min value

Sec. 19.6

Page 10: Locality Sensitive Hashing Basics and applications

Test if Doc1.Sketch[i] = Doc2.Sketch[i]

Document 1 Document 2

264

264

264

264

264

264

264

264

Are these equal?

Use 200 random permutations (minimum), and thus create one 200-dim vector per document and evaluate the fraction of shared components

A B

Sec. 19.6

Claim: This happens with probability Size_of_intersection / Size_of_union

Page 11: Locality Sensitive Hashing Basics and applications

It’s even more difficult…

So we have squeezed few Kbs of data

(web page) into few hundred bytes

But you still need a brute-force

comparison (quadratic time) to compute all

nearly-duplicate documents This is too much even if it is executed in RAM

Page 12: Locality Sensitive Hashing Basics and applications

Locality Sensitive Hashing

The case of the Hamming distance

How to compute fast the fraction of different compo. in d-dim vectors

How to compute fast the hamming distance between d-dim vectors

Fraction different components = HammingDist/d

Page 13: Locality Sensitive Hashing Basics and applications

A warm-up

Consider the case of binary (sketch) vectors, thus living in the hypercube {0,1}d

Hamming distance

D(p,q)= # coord on which p and q differ

Define hash function h by choosing a set I of k random coordinates

h(p) = p|I = projection of p on I

Example: if p=01011 (d=5), a pick I={1,4} (with k=2), then h(p)=01 Note similarity with the Bloom Filter

Page 14: Locality Sensitive Hashing Basics and applications

A key property

Pr[to pick an equal component]= (d - D(p,q))/d

We can vary the probability by changing k

k=1 k=2

distance distance

Pr Pr

1 2 …. d

k

d

qpDqhph )

),(1()]()(Pr[

equal

What about FalseNegatives ?

Page 15: Locality Sensitive Hashing Basics and applications

Reiterate

Repeat L times the k-projections hi

Declare a «match» if at least one hi matches

Example: d=5, k=2, p = 01011 and q = 00101

•I1 = {2,4}, we have h1(p) = 11 and h1(q)=00

•I2 = {1,4}, we have h2(p) = 01 and h2(q)=00

•I3 = {1,5}, we have h3(p) = 01 and h3(q)=01

We set g( ) = < h1( ), h2( ), h3( )>

p and qmatch !!

Page 16: Locality Sensitive Hashing Basics and applications

Measuring the error prob.

The g() consists of L independent hashes hi

Pr[g(p) matches g(q)] =1 - Pr[hi(p) ≠ hi(q), i=1, …, L]

L

k

d

qpDqgpg

)

),(1(11)()(Pr

kii d

qpDqhph )

),(1()]()(Pr[

Lksqgpg 11)()(Prs

Pr

(1/L)^(1/k)

Page 17: Locality Sensitive Hashing Basics and applications

Find groups of similar items SOL 1: Buckets provide the candidate similar items

«Merge» similar sets if they share items

h1(p) h2(p) hL(p)

TLT2T1

pPoints in a bucket are possibly similar objects

Page 18: Locality Sensitive Hashing Basics and applications

Find groups of similar items

SOL 1: Buckets provide the candidate similar items

SOL 2: Sort items by the hi(), and pick as similar candidate

the equal ones Repeat L times, for all hi()

«Merge» candidate sets if they share items.

What about clustering ?

Check candidates !!!

Page 19: Locality Sensitive Hashing Basics and applications

LSH versus K-means

What about optimality ? K-means is locally optimal [recently, some researchers showed how to introduce some guarantee]

What about the Sim-cost ? K-means compares items in (d) time and space [notice that d may be bi/millions]

What about the cost per iteration and their number? Typically K-means requires few iterations, each costs K n time: I K n d

What about K ? In principle have to iterate K=1,…, n

LSH needs sort(n) time hence, on disk, few passes over the data and with

guaranteed error bounds

Page 20: Locality Sensitive Hashing Basics and applications

Also on-line query

Given a query q, check the buckets of hj(q) for j=1,…,

L

h1(q) h2(q) hL(q)

TLT2T1

q

Page 21: Locality Sensitive Hashing Basics and applications

Locality Sensitive Hashingand its applications

More problems, indeed

Page 22: Locality Sensitive Hashing Basics and applications

Another classic problem

The problem: Given U users, the goal is to find groups of similar users (or, smilar to user Q)

Features = Personal data, preferences, purchases, navigational behavior, followers/ing or +1,…

A feature is typically a numerical value: binary or real

1 2 3 4 5

U1 0 1 0 1 1

U2 0 1 1 0 0

U3 0 1 1 1 1

Hamming distance: #different components

Page 23: Locality Sensitive Hashing Basics and applications

More than Hamming distance

Example:

q P*

q is the query point. P* is its Nearest Neighbor

Page 24: Locality Sensitive Hashing Basics and applications

Approximation helps

q

p*

r

Page 25: Locality Sensitive Hashing Basics and applications

A slightly different problem

Approximate Nearest Neighbor

Given an error parameter >0 For query q and nearest-neighbor p’, return p such that

Justification Mapping objects to metric space is heuristic anyway Get tremendous performance improvement

p )ε )D (q ,(1)p 'D (q ,

Page 26: Locality Sensitive Hashing Basics and applications

A workable approach

Given an error parameter >0, distance threshold t>0 (t,)-Approximate NN Query

If no point p with D(q,p)<t, return FAILURE Else, return any p’ with D(q,p’)< (1+)t

Application: Approximate Nearest Neighbor Assume maximum distance is T Run in parallel for

Time/space – O(log1+ T) overhead

T,,ε )(1,ε )(1 ε ),(1 1 , t 32

Page 27: Locality Sensitive Hashing Basics and applications

Locality Sensitive Hashingand its applications

The analysis

Page 28: Locality Sensitive Hashing Basics and applications

LSH Analysis

For a fixed threshold r, we distinguish between Near D(p,q) < r Far D(p,q) > (1+ε)r

A locality-sensitive hash h should guarantee

Near points are hashed together with Pr[h(a)=h(b)] ≥ P1

Far points may be mapped together but Pr[h(a)=h(c)] ≤ P2

where, of course, we have that P1 > P2

c ab

ε)r(1

r

Page 29: Locality Sensitive Hashing Basics and applications

Family hi(.) = p|c1,…,ck ,where the ci are chosen randomly

If D(a,b) ≤ r, then Pr[hi(a)= hi(b)] = (1 – D(a,b)/d)k

≥ ( 1 – r/d )k = ( p1 )k = P1

If D(a,c) > (1+)r, then Pr[hi(a)= hi(c)] = (1 – D(a,c)/d)k

< ( 1 – r(1+)/d )k = ( p2 )k = P2

where, of course, we have that p1 > p2 (as P1 > P2)

What about hamming distance?

Page 30: Locality Sensitive Hashing Basics and applications

LSH Analysis

The LSH-algorithm with the L mappings hi() correctly solves the (r,)-NN problem on a point query q if the following hold:

I. The total number of points FAR from q and belonging to the bucket hi(q) is a constant.

II. If p* NEAR to q, then hi(p*)= hi(q) for some I (p* is in a visited bucket)

Theorem. Take k=log_{1/p2} n and L=n(ln p1/ ln p2), then the two

properties above hold with probability at least 0.298.

Repeating the process (1/) times, we ensure a probability of success of at least 1-.

Space ≈ nL = n1+, where = (ln p1 / ln p2 ) < 1

Query time ≈ L buckets accessed, they are n

Page 31: Locality Sensitive Hashing Basics and applications

Proof

p* is a point near to q: D(q,p*) < r

FAR(q) = set of points p s.t. D(q,p) > (1+) r

BUCKETi (q) = set of points p s.t. hi(p)= hi(q)

Let us define the following events:

E1 = Num of far points in the visited buckets ≤ 3 L

E2 = p* occurs in some visited bucket, i.e. j s.t. hj(q) = hj(p*)

Lj

j LqBUCKETqFAR,1

*3)(

Page 32: Locality Sensitive Hashing Basics and applications

Bad collisions more than 3L

Let p is a point in FAR(q):

Pr[fixed j, hj(p) = hj(q)] < P2 = ( p2 )k

Given that k = log1/p2 n

Pr[fixed j, a far point p satisfies hj(p) = hj(q)] = 1/n

By Markov inequality Pr(X > 3E[x]) ≤ 1/3, follows:

Lj

L

j

L

jjj LnnqBUCKETqFAREqBUCKETqFARE

,1 11

)/1(*])([])([

Page 33: Locality Sensitive Hashing Basics and applications

Good collision: p* occursFor any hj the probability that

Pr[hj(p*) = hj(q)] ≥ P1 = ( p1 )k = ( p1 )^{ log1/p2 n } =

Given that L=n(ln p1/ ln p2), this is actually 1/L.

So we have that Pr[not E2] = Pr[not finding p* in q’s buckets] =

= (1 - Pr[hj(p*) = hj(q)])L = (1 – 1/L)L ≤ 1/e

Finally Pr[E1 and E2] ≥ 1 – Pr[not E1 OR not E2] ≥ 1 – (Pr[not E1]

+ Pr [not E2]) ≥ 1 - (1/3) - (1/e) = 0.298