locality sensitive hashing
TRANSCRIPT
Locality Sensitive HashingRandomized Algorithm
Problem Statement
• Given a query point q,• Find closest items to the query
point with the probability of
• Iterative methods?• Large volume of data• Curse of dimensionality
Taxonomy – Near Neighbor Query (NN)
NN
Trees
K-d Tree Range Tree B Tree Cover Tree
Grid
Voronoi Diagram
Hash
ApproximateLSH
Approximate LSH
• Simple Idea• if two points are close together, then after a “projection” operation these two
points will remain close together
LSH Requirement
• For any given points
• Hash function h is (, ) sensitive, Ideally we need• to be large• to be small
Pd
2d
c.d
q
q
P(1)
P(2)
P(c) P(1) P(2) P(3)
q
Probability vs. Distance on candidate pairs
Hash Function(Random)
• Locality-preserving• Independent• Deterministic• Family of Hash Function per various distance measures• Euclidean• Jaccard• Cosine Similarity• Hamming
LSH Family for Euclidean distance (2d)
• When ,• Chance of colliding• But not certain
• But can guarantee,• If ,
• to have
• If ,• 0 to have
• As LSH (, ) sensitive
How to define the projection?
• Scalar projection (Dot product)
How to define the projection?
• K-dot product, that
points at different separations will fall into the same quantization bin
• Perform k independent dot products• Achieve success,• if the query and the nearest neighbor are in the same bin in all k dot products• Success probability = decreases as we include more dot products
Multiple-projections
• L independent projections• True near neighbor will be unlikely to be unlucky in all the projections
• By increasing L,• we can find the true nearest neighbor with arbitrarily high probability
Accuracy
• Two close points p and q,• Separated by • Probability of collision ,
- probability density function of H
• As distance u increases, decreases
Time complexity
• For a query point q,• To Find the near neighbor: (+)
• Calculate & hash the projections ()
• Search the bucket for collisions ()• O(DL); D-dimension, L projections, and• where ; - expected number of collisions for single projection
• Analyze• increases as k & L increase• decreases as k increases since
How many projections(L)?
• For query point p & neighbor q,• For single projection,
• Success probability of collisions: • For L projections,
• Failure probability of collisions:
LSH in MAXDIVREL Diversity
#1 #2 #3 … #k dot product
1 1 0 0 .. 1
2 0 1 1 … 1
w 0 0 1 … 0
#1 #2 #3 … #k dot product
1 1 1 0 .. 1
2 1 0 1 … 1
w 0 1 1 … 0
#1 #2 #3 … #k dot product
1 1 0 1 .. 0
2 0 0 1 … 0
w 0 1 0 … 0
#1 #2 #3 … #k dot product
1 1 0 0 .. 1
2 0 1 1 … 1
w 0 0 1 … 0
REFERENCES
[1] Anand Rajaraman and Jeff Ullman, “Chapter Three of ‘Mining of Massive Datasets,’” pp. 72–130.[2] M. Slaney and M. Casey, “Lecture Note: LSH,” 2008.[3] N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk, S. Madden, and P. Dubey, “Streaming similarity search over one billion tweets using parallel locality-sensitive hashing,” Proc. VLDB Endow., vol. 6, no. 14, pp. 1930–1941, Sep. 2013.