finding similar items

38
Finding Similar Items

Upload: marshall-maddox

Post on 04-Jan-2016

25 views

Category:

Documents


1 download

DESCRIPTION

Finding Similar Items. Set Similarity. Problem : Find similar sets. Motivation : Many things can be modeled/represented as sets Applications : Face Recognition Recommender Systems (Collaborative Filtering) Document Similarity/Clustering Similar (but infrequent) Items. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Finding Similar Items

Finding Similar Items

Page 2: Finding Similar Items

Set Similarity• Problem: Find similar sets.

• Motivation: Many things can be modeled/represented as sets

• Applications:– Face Recognition

– Recommender Systems (Collaborative Filtering)

– Document Similarity/Clustering

– Similar (but infrequent) Items

Page 3: Finding Similar Items

Application: Face Recognition• We have a database of (say) 1 million face images.• We want to find the most similar images in the database.• Represent faces by (relatively) invariant values, e.g.,

ratio of nose width to eye width.

• Many-one problem : given a new face, see if it is close to any of the 1 million old faces.

• Many-Many problem : which pairs of the 1 million faces are similar.

Page 4: Finding Similar Items

Simple Solution• Represent each face by a vector of 1000 values and score

the comparisons.• Sort-of OK for many-one problem.• Out of the question for the many-many problem

(106*106*1000/2 numerical comparisons).• We can do better!

Page 5: Finding Similar Items

Applications - Collaborative Filtering (1)• Represent a customer, e.g., of Netflix, by the set of movies

they rented.– Similar customers have a relatively large fraction of their

choices in common.

• Motivation: A customer can be pitched an item that a similar customer bought, but which they did not buy.

Page 6: Finding Similar Items

Applications - Collaborative Filtering (2)• Products are similar if they are bought by many of the

same customers.– E.g., movies of the same genre are typically rented by

similar sets of Netflix customers.

• Motivation: A customer can be pitched an item that is a similar to an item that he/she already bought.

• …isn’t this mining frequent pairs…• Well, not exactly…

Page 7: Finding Similar Items

Applications: Similar ItemsProblem • Search for pairs of items that appear together a large

fraction of the times that either appears, – even if neither item appears in very many baskets.

– Such items are considered "similar"

– E.g. caviar and crackers

Modeling• Each item is a set: the set of baskets in which it appears.

– Thus, the problem becomes: Find similar sets!

Page 8: Finding Similar Items

Applications: Similar Documents (1)• Given a body of documents, e.g., Web pages, find pairs of

docs that have a lot of text in common, e.g.:– Mirror sites, or approximate mirrors.

– Plagiarism, including large quotations.

– Repetitions of news articles at news sites.

• How do you represent a document so it is easy to compare with others?– Special cases are easy, e.g., identical documents, or one

document contained verbatim in another.

– General case, where many small pieces of one doc appear out of order in another, is hard.

Page 9: Finding Similar Items

Applications: Similar Documents (2)• Represent doc by its set of shingles (or k -grams).• A k-shingle (or k-gram) for a document is a sequence of k

characters that appears in the document.

Example. • k=2; doc = abcab. • Set of 2-shingles = {ab, bc, ca}.

• At that point, doc problem becomes finding similar sets.

Page 10: Finding Similar Items

The Jaccard Measure of Similarity• The similarity of sets S and T is the ratio of the

sizes of the intersection and union of S and T.– Sim (C1,C2) = |ST|/|ST| = Jaccard similarity.

• Disjoint sets have a similarity of 0, and the similarity of a set with itself is 1.

• Another example: similarity of sets {1, 2, 3} and {1, 3, 4, 5} is – 2/5.

Page 11: Finding Similar Items

Solutions1. Divide-Compute-Merge (DCM) uses external sorting,

merging.

2. Locality-Sensitive Hashing (LSH) can be carried out in main memory, but admits some false negatives.

Page 12: Finding Similar Items

Divide-Compute-Merge

• Originally designed for “shingles” and docs.– Applicable for other set similarity problems.

• Based on the merge-sort paradigm.

Page 13: Finding Similar Items

doc1: s11,s12,…,s1kdoc2: s21,s22,…,s2k…

DCM Steps

s11,doc1s12,doc1…s1k,doc1s21,doc2…

Invertt1,doc11t1,doc12…t2,doc21t2,doc22…

sort onshingleId

doc11,doc12,1doc11,doc13,1…doc21,doc22,1…

Invert and pair

doc11,doc12,1doc11,doc12,1…doc11,doc13,1…

sort on<docId1,docId2>

doc11,doc12,2doc11,doc13,10…

Merge

Page 14: Finding Similar Items

DCM Summary1. Start with the pairs <shingleId, docId>.

2. Sort by shingleId.

3. In a sequential scan, generate triplets <docId1, docId2, 1> for pairs of docs that share a shingle.

4. Sort on <docId1, docId2>.

5. Merge triplets with common docIds to generate triplets of the form <docId1,docId2,count>.

6. Output document pairs with count > threshold.

Page 15: Finding Similar Items

Some Optimizations• “Invert and Pair” is the most expensive step.• Speed it up by eliminating very common shingles.

– “the”, “404 not found”, “<A HREF”, etc.

• Also, eliminate exact-duplicate docs first.

Page 16: Finding Similar Items

Locality-Sensitive Hashing• First represent sets as bit vectors, and organize them as

columns of a signature matrix M. • Big idea: hash columns of matrix M several times.• Arrange that (only) similar columns are likely to hash to the

same bucket.• Candidate pairs are those that hash at least once to the

same bucket.

Page 17: Finding Similar Items

LSH Roadmap

Sets

Similarcustomers

Similarproducts

Signatures

Bucketscontaining

mostlysimilar

items (sets)

Technique:Minhashing

Documents

Technique:Shingling

Technique:Locality-SensitiveHashing

Page 18: Finding Similar Items

Minhashing• Suppose that the elements of each set are chosen from a

"universal" set of n elements e0, el,...,en-1. • Pick a random permutation of the n elements. • Minhash value of a set S is the first element, in the

permuted order, that is a member of S.

Example • Suppose the universal set is {1, 2, 3, 4, 5} and the

permutation we choose is (3,5,4,2,1). – Set {2, 3, 5} hashes to

3. – Set {1, 2, 5} hashes to

5. – Set {1,2} hashes to

2.

Page 19: Finding Similar Items

Minhash signatures• Compute signatures for the sets by using m permutations.

– Typically, m would be about 100.

• Signature of a set S is the list of the minhash values of S, for each of the m permutations, in order.

Example • Universal set is {1,2,3,4,5}, m = 3, and the permutations

are: 1= (1,2,3,4,5),

2= (5,4,3,2,1),

3= (3,5,1,4,2).

• Signature of S = {2,3,4} is – (2,4,3).

Page 20: Finding Similar Items

Minhashing and Jaccard DistanceSurprising relationship

If we choose a permutation at random, the probability that it will produce the same minhash values for two sets is the same as the Jaccard similarity of those sets.

Why?

(see the next slides)

• Thus, if we have the signatures of two sets S and T, we can estimate the Jaccard similarity of S and T by the fraction of corresponding minhash values for the two sets that agree.

Page 21: Finding Similar Items

Four Types of Rows• Given columns C1 and C2, each representing a set,

rows may be classified as:

C1 C2

a 1 1b 1 0c 0 1d 0 0

• Also, a = # rows of type a , etc.• Note Sim (C1, C2) = a /(a +b +c ).

Page 22: Finding Similar Items

Minhashing–put differently• Imagine the rows permuted randomly.

• Define “hash” function h (C ) = the number of the first (in the permuted order) row in which column C has 1.

• Use several independent hash functions to create a signature.

Page 23: Finding Similar Items

Minhashing and Jaccard Distance (2)

• The probability (over all permutations of the rows) that h (C1) = h (C2) is the same as Sim (C1, C2).

• Both are a /(a +b +c )!

• Why?– Look down columns C1 and C2 until we see a 1.

– If it’s a type-a row, then h (C1) = h (C2). If a type-b or type-c row, then not.

– Type-d row doesn’t play any role for either of them.

Page 24: Finding Similar Items

Implementing Minhashing• Generating a permutation of all the universe of objects is

infeasible.

• Rather, simulate the choice of a random permutation by picking a random hash function h. – Pretend that the permutation that h represents places

element e in position h(e).

• Several elements might wind up in the same position. • As long as number of buckets is large, we can break ties as we

like, and the simulated permutations will be sufficiently random that the relationship between signatures and similarity still holds.

Page 25: Finding Similar Items

Algorithm for minhashing• To compute the minhash value for a set S = {a1, a2,. . . ,an}

using a hash function h, we can execute:

V =infinity;

FOR i := 1 TO n DO

IF h(ai) < V THEN V = h(ai);

• As a result, V will be set to the hash value of the element of S that has the smallest hash value.

Page 26: Finding Similar Items

Algorithm for set signature• If we have m hash functions h1, h2, . .. , hm, then we can

compute m minhash values in parallel, as we process each member of S.

FOR j := 1 TO m DO

Vj := infinity;

FOR i := 1 TO n DO

FOR j := 1 TO m DO

IF hj(ai) < Vj THEN Vj = hj(ai)

Page 27: Finding Similar Items

Example

S = {1,3,4}T = {2,3,5}

h(x) = x mod 5g(x) = 2x+1 mod 5

h(1) = 1h(3) = 3h(4) = 4g(1) = 3g(3) = 2g(4) = 4

sig(S) = 1,3sig(T) = 5,2

h(2) = 2h(3) = 3h(5) = 0g(2) = 0g(3) = 2g(5) = 1

Page 28: Finding Similar Items

Locality-Sensitive Hashing of Signatures

• Organize signatures of the various sets as a matrix M, with a column for each set's signature and a row for each hash function.

• Big idea: hash columns of signature matrix M several times.• Arrange that (only) similar columns are likely to hash to the

same bucket.• Candidate pairs are those that hash at least once to the

same bucket.

Page 29: Finding Similar Items

Partition Into Bands

Matrix M

r rowsper band

b bands

Page 30: Finding Similar Items

Partition Into Bands• For each band,

hash its portion of each column to a hash table with k buckets.

• Candidate column pairs are those that hash to the same bucket for at least one band.

Matrix M

r rows b bands

Buckets

Page 31: Finding Similar Items

Analysis• Probability that the signatures agree on one row is

s (Jaccard similarity)

• Probability that they agree on all r rows of a given band is s^r.

• Probability that they do not agree on all the rows of a band is 1 - s^r

• Probability that for none of the b bands do they agree in all rows of that band is

(1 - s^r)^b

• Probability that the signatures will agree in all rows of at least one band is

1 - (1 - s^r)^b

• This function is the probability that the signatures will be compared for similarity.

Page 32: Finding Similar Items

Example• Suppose 100,000 columns.• Signatures of 100 integers.• Therefore, signatures take 40Mb.• But 5,000,000,000 pairs of signatures take a while to

compare.• Choose 20 bands of 5 integers/band.

Page 33: Finding Similar Items

Suppose C1, C2 are 80% Similar• Probability C1, C2 agree on one particular band:

– (0.8)5 = 0.328.

• Probability C1, C2 do not agree on any of the 20 bands:

– (1-0.328)20 = .00035 .

– i.e., we miss about 1/3000th of the 80%-similar column pairs.

• The chance that we do find this pair of signatures together in at least one bucket is 1 - 0.00035,or 0.99965.

Page 34: Finding Similar Items

Suppose C1, C2 Only 40% Similar• Probability C1, C2 agree on one particular band:

– (0.4)5 = 0.01 .

• Probability C1, C2 do not agree on any of the 20 bands:

– (1-0.01)^20 .80

– i.e., we miss a lot...

• The chance that we do find this pair of signatures together in at least one bucket is 1 - 0.80,or 0.20 (i.e. only 20%).

Page 35: Finding Similar Items

Analysis of LSH – What We Want

Similarity s of two columns

Probabilityof sharinga bucket

t

No chanceif s < t

Probability= 1 if s > t

Page 36: Finding Similar Items

What One Row Gives You

Similarity s of two columns

Probabilityof sharinga bucket

t

Remember:probability ofequal hash-values= similarity

Page 37: Finding Similar Items

What b Bands of r Rows Gives You

Similarity s of two columns

Probabilityof sharinga bucket

t

s r

All rowsof a bandare equal

1 -

Some rowof a bandunequal

( )b

No bandsidentical

1 -

At leastone bandidentical

t ~ (1/b)1/r

Page 38: Finding Similar Items

LSH Summary

• Tune to get almost all pairs with similar signatures, but eliminate most pairs that do not have similar signatures.

• Check in main memory that candidate pairs really do have similar signatures.

• Optional: In another pass through data, check that the remaining candidate pairs really are similar columns.