google news personalization big data reading group november 12, 2007 presented by babu pillai
TRANSCRIPT
![Page 1: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/1.jpg)
Google News Personalization
Big Data reading groupNovember 12, 2007
Presented by Babu Pillai
![Page 2: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/2.jpg)
Problem: finding stuff on Internet
• Know what you want: – content-based filtering,– search
• Don’t know– browse
• How to handle: Don’t know but, show me something interesting!
![Page 3: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/3.jpg)
Google News• Top Stories
• Recommendationsfor registered users
• Based on userclick history,community clicks
![Page 4: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/4.jpg)
Problem Scale
• Lots of users, (more is good)– Millions of clicks from millions of users
• Problem: high churn in item set– Several million items (clusters of news articles
about the same story, as identified by GN) per month
– Continuous addition, deletion
• Strict timing (few hundred ms)• Existing systems not suitable
![Page 5: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/5.jpg)
Memory-based Ratings
• General form:
where r is rating of item sk for user ua, and w(ua,ui) is similarity between users ua and ui
• Problem: scalability, even when similarity is computed offline
![Page 6: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/6.jpg)
Model-based techniques
• Clustering / segmentation, e.g. based on interests
• Bayesian models, Markov Decision, …– All are computationally expensive
![Page 7: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/7.jpg)
What’s in this paper?
• Investigate 2 different ways to cluster users: MinHash, and PLSI
• Implement both on MapReduce
![Page 8: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/8.jpg)
Google News Rating Model
• 1 click = 1 positive vote
• Noisier than 1-5 ranking (Netflix)
• No explicit negatives
• Why might it work? Partly due to the fairly significant article clips provided, so a user that clicks is likely genuinely interested
![Page 9: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/9.jpg)
Design guidelines for a scalable rating system
• Associate users into clusters of similar users (based on prior clicks, offline)
• Users can belong to multiple clusters
• Generate rating using much smaller sets of user clusters, rather than all users:
![Page 10: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/10.jpg)
Technique 1: MinHash
• Probabilistically assign users to clusters based on click history
• Use Jaccard coefficient:
distance is a metric
• Using this metric is computationally expensive, not feasible even offline
![Page 11: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/11.jpg)
MinHash as a form of Locality Sensitive Hashing
• Basic idea: assign hash value to each use based on click history
• How: randomly permute set of all items; assign id of first item in this order that appears in the user’s click history as the hash value for the user
• Probability that 2 users have the same hash is equal to the Jaccard coefficient
![Page 12: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/12.jpg)
Using MinHash for clusters
• Concatenate p>1 such hashes as cluster id for increased precision
• Apply q>1 in parallel (users belong to q clusters) to improve recall
• Don’t actually maintain p*q permutations: hash item id with random seed to get proxy for permutation index, for p*q different seeds
![Page 13: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/13.jpg)
MinHash on MapReduce
• Generate p x q hashes for each user based on click history; generate q p-long cluster ids by concatenation
• Map using cluster id’s as keys
• Reduce to form membership lists for each cluster id
![Page 14: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/14.jpg)
Technique 2: PLSI clustering
• Probabilistic Latent Semantic Indexing• Main idea: hidden state z that correlates
users and items
• Generate this clustering from training set based on EM algorithm give by Hoffman04– Iterative technique, generates new probability
estimates based on previous estimates
![Page 15: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/15.jpg)
PLSI as MapReduce
• Q* can be independently computed for each (u,s), given prior N(z,s), N(z), p(z|u): map to RxK machines (R, K partitions for u, s respectively)
• Reduce is simply addition
![Page 16: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/16.jpg)
PLSI in a dynamic environment
• Treat Z as user clusters
• On each click, update p(s|z) for all clusters the user belongs to
• This approximates PLSI, but is updated dynamically as additional items are added
• Does not allow additions of users
![Page 17: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/17.jpg)
Cluster-based recommendation
• For each cluster, maintain number of clicks, decayed by time, for each item visited by a member
• For a candidate item, lookup user’s clusters, add up age-discounted visitation counts, normalized by total clicks
• Do this using both MinHash and PLSI clustering
![Page 18: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/18.jpg)
One more technique: Covisitation
• Memory-based technique• Create adjacency matrix between all pairs of
items (can be directed)• Increment corresponding count if one item
visited soon after another
• Recommendation: for candidate item j, sum of all counts from i to j for all items i in recent click history of user, normalized appropriately
![Page 19: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/19.jpg)
Whole System
• Offline clustering
• Online click history update, cluster item stats update, covisitation update
![Page 20: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/20.jpg)
Results
Generally around 30-50% better than popularity based recommendations
![Page 21: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/21.jpg)
Techniques don’t work well together, though
![Page 22: Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai](https://reader036.vdocuments.site/reader036/viewer/2022081516/5697bff71a28abf838cbe7d0/html5/thumbnails/22.jpg)
Discussion
• Covisitation appears to work as well as clustering
• Operational details missing: how big are cluster memberships, etc.
• All of the clustering is done offline