cs 124/linguist 180 from languages to information · 2019-02-19 · 1.content-based 2.collaborative...

58
CS 124/LINGUIST 180 From Languages to Information Dan Jurafsky Stanford University Recommender Systems & Collaborative Filtering Slides adapted from Jure Leskovec

Upload: others

Post on 30-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

CS 124/LINGUIST 180From Languages to

Information

Dan JurafskyStanford University

Recommender Systems & Collaborative Filtering

Slides adapted from Jure Leskovec

Page 2: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Recommender Systems

Customer X◦ Buys CD of Mozart◦ Buys CD of Haydn

Customer Y◦ Does search on Mozart ◦ Recommender system suggests

Haydn from data collected about customer X

2/19/19 JURE LESKOVEC, STANFORD C246: MINING MASSIVE DATASETS 2

Page 3: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Recommendations

2/19/19SLIDES ADAPTED FROM JURE LESKOVEC

3

Items

Search Recommendations

Products, web sites, blogs, news items, …

Examples:

Page 4: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

From Scarcity to AbundanceShelf space is a scarce commodity for traditional retailers ◦Also: TV networks, movie theaters,…

Web enables near-zero-cost dissemination of information about products◦From scarcity to abundance

2/19/19 4SLIDES ADAPTED FROM JURE LESKOVEC

Page 5: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

The Long Tail

2/19/19 5Source: Chris Anderson (2004)

SLIDES ADAPTED FROM JURE LESKOVEC

Page 6: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

More choice requires:

Recommendation engines!

Page 7: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

We all need to understand how they work! 1. How Into Thin Air made Touching the Void a bestseller:

http://www.wired.com/wired/archive/12.10/tail.html

2. Societal problems in the news:

2/19/19 7

https://www.theguardian.com/technology/2018/feb/02/how-youtubes-algorithm-distorts-truth

Page 8: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Types of Recommendations

Editorial and hand curated◦ List of favorites◦ Lists of “essential” items

Simple aggregates◦ Top 10, Most Popular, Recent Uploads

Tailored to individual users◦ Amazon, Netflix, …

2/19/19 8

Today class

SLIDES ADAPTED FROM JURE LESKOVEC

Page 9: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Formal ModelX = set of CustomersS = set of Items

Utility function u: X × Sà R◦R = set of ratings◦R is a totally ordered set◦e.g., 0-5 stars, real number in [0,1]

2/19/19 9SLIDES ADAPTED FROM JURE LESKOVEC

Page 10: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Utility Matrix

0.410.2

0.30.50.21

2/19/19 10

Avatar LOTR Matrix Pirates

Alice

Bob

Carol

David

SLIDES ADAPTED FROM JURE LESKOVEC

Page 11: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Key Problems(1) Gathering “known” ratings for matrix◦ How to collect the data in the utility matrix

(2) Extrapolate unknown ratings from known ones◦ Mainly interested in high unknown ratings

◦ We are not interested in knowing what you don’t like but what you like

(3) Evaluating extrapolation methods◦ How to measure success/performance of

recommendation methods

2/19/19 11SLIDES ADAPTED FROM JURE LESKOVEC

Page 12: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

(1) Gathering Ratings

Explicit◦ Ask people to rate items◦ Doesn’t work well in practice – people

can’t be bothered◦ Crowdsourcing: Pay people to label items

Implicit◦ Learn ratings from user actions

◦ E.g., purchase implies high rating

2/19/19 12SLIDES ADAPTED FROM JURE LESKOVEC

Page 13: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

(2) Extrapolating Utilities

Key problem: Utility matrix U is sparse◦ Most people have not rated most items◦ Cold start:

◦ New items have no ratings◦ New users have no history

Three approaches to recommender systems:1. Content-based2. Collaborative Filtering3. Latent factor based CS246!

2/19/19 13

This lecture!

SLIDES ADAPTED FROM JURE LESKOVEC

Page 14: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Content-based Recommender Systems

2/19/19 JURE LESKOVEC, STANFORD C246: MINING MASSIVE DATASETS 14SLIDES ADAPTED FROM JURE LESKOVEC

Page 15: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Content-based Recommendations

Main idea: Recommend items to customer xsimilar to previous items rated highly by x

Example:Movie recommendations◦ Recommend movies with same actor(s), director,

genre, …Websites, blogs, news◦ Recommend other sites with “similar” content

2/19/19 15SLIDES ADAPTED FROM JURE LESKOVEC

Page 16: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Plan of Action

2/19/19 16

likes

Item profiles

RedCircles

Triangles

User profile

match

recommendbuild

SLIDES ADAPTED FROM JURE LESKOVEC

Page 17: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Item ProfilesFor each item, create an item profile

Profile is a set (vector) of features◦ Movies: author, genre, director, actors, year…◦ Text: Set of “important” words in document

How to pick important features?◦ TF-IDF (Term frequency * Inverse Doc Frequency)

◦ Term … Feature◦ Document … Item

2/19/19 17SLIDES ADAPTED FROM JURE LESKOVEC

Page 18: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

• If everything is 1 or 0 (indicator features)• But what if we want to have real or ordinal

features too?

Content-based Item Profiles

2/19/19 18

MelissaMcCarthy

JohnnyDepp

Movie X

Movie Y

0 1 1 0 1 1 0 11 1 0 1 0 1 1 0

ActorA

ActorB …

PirateGenre

SpyGenre

ComicGenre

SLIDES ADAPTED FROM JURE LESKOVEC

Page 19: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

• Maybe we want a scaling factor α between binary and numeric features

Content-based Item Profiles

2/19/19 19

MelissaMcCarthy

JohnnyDepp

Movie X

Movie Y

0 1 1 0 1 1 0 1 31 1 0 1 0 1 1 0 4

ActorA

ActorB …

AvgRating

PirateGenre

SpyGenre

ComicGenre

SLIDES ADAPTED FROM JURE LESKOVEC

Page 20: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

• Maybe we want a scaling factor α between binary and numeric features

• Or maybe α=1

Cosine(Movie X, Movie Y) = !"#!$%

&"'$% &"#($%

Content-based Item Profiles

2/19/19 20

MelissaMcCarthy

JohnnyDepp

Movie X

Movie Y

0 1 1 0 1 1 0 1 3α1 1 0 1 0 1 1 0 4α

ActorA

ActorB …

AvgRating

PirateGenre

SpyGenre

ComicGenre

SLIDES ADAPTED FROM JURE LESKOVEC

Page 21: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

User Profiles

Want a vector with the same components/dimensions as items◦ Could be 1s representing user purchases◦ Or arbitrary numbers from a rating

User profile is aggregate of items:◦Average(weighted?)of rated item profiles

2/19/19 21SLIDES ADAPTED FROM JURE LESKOVEC

Page 22: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Sample user profile

• Items are movies• Utility matrix has 1 if user has seen movie• 20% of the movies user U has seen have

Melissa McCarthy• U[“Melissa McCarthy”] = 0.2

MelissaMcCarthy

User U 0.2 .005 0 0 …

ActorA

ActorB …

Page 23: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Prediction◦Users and items have the same dimensions!

◦So just recommend the items whose vectors are most similar to the user vector!

◦Given user profile x and item profile i,◦ estimate ! ", $ = cos(", $) = "·$

| " |⋅| $ |2/19/19 23

SLIDES ADAPTED FROM JURE LESKOVEC

Movie X 0 1 1 0 …0.2 .005 0 0 0User U

MelissaMcCarthy

ActorA

ActorB …

Page 24: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Pros: Content-based Approach+: No need for data on other users◦ No cold-start or sparsity problems

+: Able to recommend to users with unique tastes+: Able to recommend new & unpopular items◦ No first-rater problem

+: Able to provide explanations◦ Just list the content-features that caused an item to be

recommended

2/19/19 24SLIDES ADAPTED FROM JURE LESKOVEC

Page 25: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Cons: Content-based Approach– Finding the appropriate features is hard◦E.g., images, movies, music– Recommendations for new users◦How to build a user profile?– Overspecialization◦Never recommends items outside user’s

content profile◦People might have multiple interests◦Unable to exploit quality judgments of other

users2/19/19 25

SLIDES ADAPTED FROM JURE LESKOVEC

Page 26: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Collaborative FilteringHarnessing quality judgments of other users

Page 27: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Collaborative FilteringVersion 1: "User-User" Collaborative Filtering

Consider user x

Find set N of other users whose ratings are “similar” to x’s ratings

Estimate x’s ratings based on ratings of users in N

2/19/19 27

x

N

SLIDES ADAPTED FROM JURE LESKOVEC

Page 28: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Finding Similar UsersLet rx be the vector of user x’s ratings

Cosine similarity measure◦ sim(x, y) = cos(rx, ry) =

!"⋅!$!" ||!$||

Problem: Treats missing ratings as “negative”◦ What do I mean?

2/19/19 28

rx = [*, _, _, *, ***]ry = [*, _, **, **, _]

rx = {1, 0, 0, 1, 3}ry = {1, 0, 2, 2, 0}

SLIDES ADAPTED FROM JURE LESKOVEC

Page 29: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Utility Matrix

322 CHAPTER 9. RECOMMENDATION SYSTEMS

9.3.1 Measuring Similarity

The first question we must deal with is how to measure similarity of users oritems from their rows or columns in the utility matrix. We have reproducedFig. 9.1 here as Fig. 9.4. This data is too small to draw any reliable conclusions,but its small size will make clear some of the pitfalls in picking a distancemeasure. Observe specifically the users A and C. They rated two movies incommon, but they appear to have almost diametrically opposite opinions ofthese movies. We would expect that a good distance measure would makethem rather far apart. Here are some alternative measures to consider.

HP1 HP2 HP3 TW SW1 SW2 SW3A 4 5 1B 5 5 4C 2 4 5D 3 3

Figure 9.4: The utility matrix introduced in Fig. 9.1

Jaccard Distance

We could ignore values in the matrix and focus only on the sets of items rated.If the utility matrix only reflected purchases, this measure would be a goodone to choose. However, when utilities are more detailed ratings, the Jaccarddistance loses important information.

Example 9.7 : A and B have an intersection of size 1 and a union of size 5.Thus, their Jaccard similarity is 1/5, and their Jaccard distance is 4/5; i.e.,they are very far apart. In comparison, A and C have a Jaccard similarity of2/4, so their Jaccard distance is the same, 1/2. Thus, A appears closer to Cthan to B. Yet that conclusion seems intuitively wrong. A and C disagree onthe two movies they both watched, while A and B seem both to have liked theone movie they watched in common. ✷

Cosine Distance

We can treat blanks as a 0 value. This choice is questionable, since it has theeffect of treating the lack of a rating as more similar to disliking the movie thanliking it.

Example 9.8 : The cosine of the angle between A and B is

4 × 5√42 + 52 + 12

√52 + 52 + 42

= 0.380

Intuitively we want: sim(A, B) > sim(A, C)Cosine similarity: Yes, 0.386 > 0.322

But only barely works…Considers missing ratings as “negative”

Page 30: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Utility Matrix

322 CHAPTER 9. RECOMMENDATION SYSTEMS

9.3.1 Measuring Similarity

The first question we must deal with is how to measure similarity of users oritems from their rows or columns in the utility matrix. We have reproducedFig. 9.1 here as Fig. 9.4. This data is too small to draw any reliable conclusions,but its small size will make clear some of the pitfalls in picking a distancemeasure. Observe specifically the users A and C. They rated two movies incommon, but they appear to have almost diametrically opposite opinions ofthese movies. We would expect that a good distance measure would makethem rather far apart. Here are some alternative measures to consider.

HP1 HP2 HP3 TW SW1 SW2 SW3A 4 5 1B 5 5 4C 2 4 5D 3 3

Figure 9.4: The utility matrix introduced in Fig. 9.1

Jaccard Distance

We could ignore values in the matrix and focus only on the sets of items rated.If the utility matrix only reflected purchases, this measure would be a goodone to choose. However, when utilities are more detailed ratings, the Jaccarddistance loses important information.

Example 9.7 : A and B have an intersection of size 1 and a union of size 5.Thus, their Jaccard similarity is 1/5, and their Jaccard distance is 4/5; i.e.,they are very far apart. In comparison, A and C have a Jaccard similarity of2/4, so their Jaccard distance is the same, 1/2. Thus, A appears closer to Cthan to B. Yet that conclusion seems intuitively wrong. A and C disagree onthe two movies they both watched, while A and B seem both to have liked theone movie they watched in common. ✷

Cosine Distance

We can treat blanks as a 0 value. This choice is questionable, since it has theeffect of treating the lack of a rating as more similar to disliking the movie thanliking it.

Example 9.8 : The cosine of the angle between A and B is

4 × 5√42 + 52 + 12

√52 + 52 + 42

= 0.380

• Problem with cosine: 0 acts like a negative review• C really loves SW• A hates SW• B just hasn’t seen it

• Another problem: we’d like to normalize for raters• D rated everything the same; not very useful

Page 31: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Modified Utility Matrix:subtract the means of each row

324 CHAPTER 9. RECOMMENDATION SYSTEMS

HP1 HP2 HP3 TW SW1 SW2 SW3A 2/3 5/3 −7/3B 1/3 1/3 −2/3C −5/3 1/3 4/3D 0 0

Figure 9.6: The utility matrix introduced in Fig. 9.1

The cosine of the angle between between A and C is

(5/3)× (−5/3) + (−7/3)× (1/3)!

(2/3)2 + (5/3)2 + (−7/3)2!

(−5/3)2 + (1/3)2 + (4/3)2= −0.559

Notice that under this measure, A and C are much further apart than A andB, and neither pair is very close. Both these observations make intuitive sense,given that A and C disagree on the two movies they rated in common, while Aand B give similar scores to the one movie they rated in common. ✷

9.3.2 The Duality of Similarity

The utility matrix can be viewed as telling us about users or about items, orboth. It is important to realize that any of the techniques we suggested inSection 9.3.1 for finding similar users can be used on columns of the utilitymatrix to find similar items. There are two ways in which the symmetry isbroken in practice.

1. We can use information about users to recommend items. That is, givena user, we can find some number of the most similar users, perhaps usingthe techniques of Chapter 3. We can base our recommendation on thedecisions made by these similar users, e.g., recommend the items that thegreatest number of them have purchased or rated highly. However, thereis no symmetry. Even if we find pairs of similar items, we need to takean additional step in order to recommend items to users. This point isexplored further at the end of this subsection.

2. There is a difference in the typical behavior of users and items, as itpertains to similarity. Intuitively, items tend to be classifiable in simpleterms. For example, music tends to belong to a single genre. It is impossi-ble, e.g., for a piece of music to be both 60’s rock and 1700’s baroque. Onthe other hand, there are individuals who like both 60’s rock and 1700’sbaroque, and who buy examples of both types of music. The consequenceis that it is easier to discover items that are similar because they belongto the same genre, than it is to detect that two users are similar becausethey prefer one genre in common, while each also likes some genres thatthe other doesn’t care for.

322 CHAPTER 9. RECOMMENDATION SYSTEMS

9.3.1 Measuring Similarity

The first question we must deal with is how to measure similarity of users oritems from their rows or columns in the utility matrix. We have reproducedFig. 9.1 here as Fig. 9.4. This data is too small to draw any reliable conclusions,but its small size will make clear some of the pitfalls in picking a distancemeasure. Observe specifically the users A and C. They rated two movies incommon, but they appear to have almost diametrically opposite opinions ofthese movies. We would expect that a good distance measure would makethem rather far apart. Here are some alternative measures to consider.

HP1 HP2 HP3 TW SW1 SW2 SW3A 4 5 1B 5 5 4C 2 4 5D 3 3

Figure 9.4: The utility matrix introduced in Fig. 9.1

Jaccard Distance

We could ignore values in the matrix and focus only on the sets of items rated.If the utility matrix only reflected purchases, this measure would be a goodone to choose. However, when utilities are more detailed ratings, the Jaccarddistance loses important information.

Example 9.7 : A and B have an intersection of size 1 and a union of size 5.Thus, their Jaccard similarity is 1/5, and their Jaccard distance is 4/5; i.e.,they are very far apart. In comparison, A and C have a Jaccard similarity of2/4, so their Jaccard distance is the same, 1/2. Thus, A appears closer to Cthan to B. Yet that conclusion seems intuitively wrong. A and C disagree onthe two movies they both watched, while A and B seem both to have liked theone movie they watched in common. ✷

Cosine Distance

We can treat blanks as a 0 value. This choice is questionable, since it has theeffect of treating the lack of a rating as more similar to disliking the movie thanliking it.

Example 9.8 : The cosine of the angle between A and B is

4 × 5√42 + 52 + 12

√52 + 52 + 42

= 0.380

• Now a 0 means no information• And negative ratings means viewers with opposite ratings will

have vectors in opposite directions!

Page 32: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Modified Utility Matrix:subtract the means of each row324 CHAPTER 9. RECOMMENDATION SYSTEMS

HP1 HP2 HP3 TW SW1 SW2 SW3A 2/3 5/3 −7/3B 1/3 1/3 −2/3C −5/3 1/3 4/3D 0 0

Figure 9.6: The utility matrix introduced in Fig. 9.1

The cosine of the angle between between A and C is

(5/3)× (−5/3) + (−7/3)× (1/3)!

(2/3)2 + (5/3)2 + (−7/3)2!

(−5/3)2 + (1/3)2 + (4/3)2= −0.559

Notice that under this measure, A and C are much further apart than A andB, and neither pair is very close. Both these observations make intuitive sense,given that A and C disagree on the two movies they rated in common, while Aand B give similar scores to the one movie they rated in common. ✷

9.3.2 The Duality of Similarity

The utility matrix can be viewed as telling us about users or about items, orboth. It is important to realize that any of the techniques we suggested inSection 9.3.1 for finding similar users can be used on columns of the utilitymatrix to find similar items. There are two ways in which the symmetry isbroken in practice.

1. We can use information about users to recommend items. That is, givena user, we can find some number of the most similar users, perhaps usingthe techniques of Chapter 3. We can base our recommendation on thedecisions made by these similar users, e.g., recommend the items that thegreatest number of them have purchased or rated highly. However, thereis no symmetry. Even if we find pairs of similar items, we need to takean additional step in order to recommend items to users. This point isexplored further at the end of this subsection.

2. There is a difference in the typical behavior of users and items, as itpertains to similarity. Intuitively, items tend to be classifiable in simpleterms. For example, music tends to belong to a single genre. It is impossi-ble, e.g., for a piece of music to be both 60’s rock and 1700’s baroque. Onthe other hand, there are individuals who like both 60’s rock and 1700’sbaroque, and who buy examples of both types of music. The consequenceis that it is easier to discover items that are similar because they belongto the same genre, than it is to detect that two users are similar becausethey prefer one genre in common, while each also likes some genres thatthe other doesn’t care for.

9.3. COLLABORATIVE FILTERING 323

The cosine of the angle between A and C is

5 × 2 + 1 × 4√42 + 52 + 12

√22 + 42 + 52

= 0.322

Since a larger (positive) cosine implies a smaller angle and therefore a smallerdistance, this measure tells us that A is slightly closer to B than to C. ✷

Rounding the Data

We could try to eliminate the apparent similarity between movies a user rateshighly and those with low scores by rounding the ratings. For instance, we couldconsider ratings of 3, 4, and 5 as a “1” and consider ratings 1 and 2 as unrated.The utility matrix would then look as in Fig. 9.5. Now, the Jaccard distancebetween A and B is 3/4, while between A and C it is 1; i.e., C appears furtherfrom A than B does, which is intuitively correct. Applying cosine distance toFig. 9.5 allows us to draw the same conclusion.

HP1 HP2 HP3 TW SW1 SW2 SW3A 1 1B 1 1 1C 1 1D 1 1

Figure 9.5: Utilities of 3, 4, and 5 have been replaced by 1, while ratings of 1and 2 are omitted

Normalizing Ratings

If we normalize ratings, by subtracting from each rating the average ratingof that user, we turn low ratings into negative numbers and high ratings intopositive numbers. If we then take the cosine distance, we find that users withopposite views of the movies they viewed in common will have vectors in almostopposite directions, and can be considered as far apart as possible. However,users with similar opinions about the movies rated in common will have arelatively small angle between them.

Example 9.9 : Figure 9.6 shows the matrix of Fig. 9.4 with all ratings nor-malized. An interesting effect is that D’s ratings have effectively disappeared,because a 0 is the same as a blank when cosine distance is computed. Note thatD gave only 3’s and did not differentiate among movies, so it is quite possiblethat D’s opinions are not worth taking seriously.

Let us compute the cosine of the angle between A and B:

(2/3) × (1/3)!

(2/3)2 + (5/3)2 + (−7/3)2!

(1/3)2 + (1/3)2 + (−2/3)2= 0.092

324 CHAPTER 9. RECOMMENDATION SYSTEMS

HP1 HP2 HP3 TW SW1 SW2 SW3A 2/3 5/3 −7/3B 1/3 1/3 −2/3C −5/3 1/3 4/3D 0 0

Figure 9.6: The utility matrix introduced in Fig. 9.1

The cosine of the angle between between A and C is

(5/3)× (−5/3) + (−7/3)× (1/3)!

(2/3)2 + (5/3)2 + (−7/3)2!

(−5/3)2 + (1/3)2 + (4/3)2= −0.559

Notice that under this measure, A and C are much further apart than A andB, and neither pair is very close. Both these observations make intuitive sense,given that A and C disagree on the two movies they rated in common, while Aand B give similar scores to the one movie they rated in common. ✷

9.3.2 The Duality of Similarity

The utility matrix can be viewed as telling us about users or about items, orboth. It is important to realize that any of the techniques we suggested inSection 9.3.1 for finding similar users can be used on columns of the utilitymatrix to find similar items. There are two ways in which the symmetry isbroken in practice.

1. We can use information about users to recommend items. That is, givena user, we can find some number of the most similar users, perhaps usingthe techniques of Chapter 3. We can base our recommendation on thedecisions made by these similar users, e.g., recommend the items that thegreatest number of them have purchased or rated highly. However, thereis no symmetry. Even if we find pairs of similar items, we need to takean additional step in order to recommend items to users. This point isexplored further at the end of this subsection.

2. There is a difference in the typical behavior of users and items, as itpertains to similarity. Intuitively, items tend to be classifiable in simpleterms. For example, music tends to belong to a single genre. It is impossi-ble, e.g., for a piece of music to be both 60’s rock and 1700’s baroque. Onthe other hand, there are individuals who like both 60’s rock and 1700’sbaroque, and who buy examples of both types of music. The consequenceis that it is easier to discover items that are similar because they belongto the same genre, than it is to detect that two users are similar becausethey prefer one genre in common, while each also likes some genres thatthe other doesn’t care for.

Cos(A,B) =

Cos(A,C) =

Now A and C are (correctly) way further apart than A,B

Page 33: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Fun fact:Cosine after subtracting mean

Turns out to be the same as Pearson correlation coefficient!!!

Cosine similarity is correlation when the data is centered at 0

◦ Terminological Note: subtracting the mean is zero-centering, not normalizing (normalizing is dividing by a norm to turn something into a probability), but the textbook (and common usage) sometimes overloads the term “normalize”

SLIDES ADAPTED FROM JURE LESKOVEC

Page 34: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Finding Similar UsersLet rx be the vector of user x’s ratingsCosine similarity measure◦ sim(x, y) = cos(rx, ry) =

!"⋅!$!" ||!$||

◦ Problem: Treats missing ratings as “negative”

Pearson correlation coefficient◦ Sxy = items rated by both users x and y

2/19/19 34

rx = [*, _, _, *, ***]ry = [*, _, **, **, _]

rx, ry as points:rx = {1, 0, 0, 1, 3}ry = {1, 0, 2, 2, 0}

rx, ry… avg.rating of x, y

&'( ), + =∑&∈/)+ 0)& − 0) 0+& − 0+

∑&∈/)+ 0)& − 0) 2 ∑&∈/)+ 0+& − 0+2

SLIDES ADAPTED FROM JURE LESKOVEC

Page 35: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Rating PredictionsFrom similarity metric to recommendations:Let rx be the vector of user x’s ratings

Let N be the set of k users most similar to x who have rated item iPrediction for item i of user x:◦ Rate i as the mean of what k-people-like-me rated i

!"#= %& ∑(∈* !(#

◦ Even better: Rate i as the mean weighted by their similarity to me …

!"#=∑(∈* +"( !(#∑(∈* +"(

•Many other tricks possible…2/19/19 35

Shorthand:,-. = ,/0 -, .

SLIDES ADAPTED FROM JURE LESKOVEC

Page 36: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Collaborative Filtering Version 2:Item-Item Collaborative FilteringSo far: User-user collaborative filteringAlternate view that often works better: Item-item◦ For item i, find other similar items◦ Estimate rating for item i based on ratings for similar items◦ Can use same similarity metrics and prediction functions as in

user-user model◦ "Rate i as the mean of my ratings for other items, weighted by

their similarity to i"

2/19/19 36

åå

Î

Î×

=);(

);(

xiNj ij

xiNj xjij

xi s

rsr sij… similarity of items i and j

rxj…rating of user x on item iN(i;x)…set of items rated by xsimilar to i

SLIDES ADAPTED FROM JURE LESKOVEC

Page 37: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Item-Item CF (|N|=2)

2/19/19 37

121110987654321

455311

3124452

534321423

245424

5224345

423316

users

mov

ies

- unknown rating - rating between 1 to 5

SLIDES ADAPTED FROM JURE LESKOVEC

Page 38: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Item-Item CF (|N|=2)

2/19/19 38

121110987654321

455? 311

3124452

534321423

245424

5224345

423316

users

- estimate rating of movie 1 by user 5

mov

ies

SLIDES ADAPTED FROM JURE LESKOVEC

Page 39: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Item-Item CF (|N|=2)

2/19/19 39

121110987654321

455? 311

3124452

534321423

245424

5224345

423316

users

Neighbor selection:Identify movies similar to movie 1, rated by user 5

mov

ies

1.00

-0.18

0.41

-0.10

-0.31

0.59

sim(1,m)

Here we use Pearson correlation as similarity:1) Subtract mean rating mi from each movie i

m1 = (1+3+5+5+4)/5 = 3.6row 1: [-2.6, 0, -0.6, 0, 0, 1.4, 0, 0, 1.4, 0, 0.4, 0]

2) Compute cosine similarities between rowsSLIDES ADAPTED FROM JURE LESKOVEC

Page 40: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Item-Item CF (|N|=2)

2/19/19 40

121110987654321

455? 311

3124452

534321423

245424

5224345

423316

users

Compute similarity weights:s1,3=0.41, s1,6=0.59

mov

ies

1.00

-0.18

0.41

-0.10

-0.31

0.59

sim(1,m)

SLIDES ADAPTED FROM JURE LESKOVEC

Page 41: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Item-Item CF (|N|=2)

2/19/19 41

121110987654321

4552.6311

3124452

534321423

245424

5224345

423316

users

Predict by taking weighted average:

r1,5 = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.6

mov

ies

!"# =∑&∈((";#) ,"& !&#

∑,"&SLIDES ADAPTED FROM JURE LESKOVEC

Page 42: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Item-Item vs. User-User

2/19/19 42SLIDES ADAPTED FROM JURE LESKOVEC

¡ In practice, item-item often works better than user-user

¡ Why? Items are simpler, users have multiple tastes¡ (People are more complex than objects)

Page 43: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Simplified item-item for our homework

First, assume you've converted all the values to

+1 (like),

0 (no rating)

-1 (dislike)

121110987654321

455311

3124452

534321423

245424

5224345

423316

usersmov

ies

Page 44: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Simplified item-item for our homework

First, assume you've converted all the values to

+1 (like),

0 (no rating)

-1 (dislike)

121110987654321

1111-11

1-1-11112

1111-1-11-13

-1111-14

1-1-11115

1-111-16

usersmov

ies

Page 45: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Simplified item-item for our tiny PA6 dataset

Assume you've binarized, i.e. converted all the values to ◦ +1 (like), 0 (no rating) -1 (dislike)

For this binary case, some tricks that the TAs recommend:◦ Don't mean-center users, just keep the raw +1,0,-1◦ Don't normalize (i.e. don't divide the product by the sum)◦ i.e., instead of this:

!"# =∑&∈((#;") ,#& !"&∑&∈((#;") ,#&

◦ Just do this:

!"# = -&∈((#;")

,#& !"&

◦ Don't use Pearson correlation to compute sij

◦ Just use cosine

sij… similarity of items i and jrxj…rating of user x on item jN(i;x)…set of items rated by x

Page 46: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Simplified item-item for our tiny PA6 dataset

1. binarize, i.e. convert all values to ◦+1 (like), 0 (no rating) -1 (dislike)

2. The user x gives you (say) ratings for 2 movies m1 and m2 3. For each movie i in the dataset◦ !"# = ∑&∈ (),(+ ,#& !"&◦ Where sij… cosine between vectors for movies i and j

4. Recommend the movie i with max rxi

rxj…rating of user x on item i

Page 47: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Pros/Cons of Collaborative Filtering+ Works for any kind of item◦ No feature selection needed

- Cold Start:◦ Need enough users in the system to find a match

- Sparsity: ◦ The user/ratings matrix is sparse◦ Hard to find users that have rated the same items

- First rater: ◦ Cannot recommend an item that has not been previously rated◦ New items, Esoteric items

- Popularity bias: ◦ Cannot recommend items to someone with unique taste ◦ Tends to recommend popular items

2/19/19 47SLIDES ADAPTED FROM JURE LESKOVEC

Page 48: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Hybrid Methods

Implement two or more different recommenders and combine predictions◦ Perhaps using a linear model

Add content-based methods to collaborative filtering◦ Item profiles for new item problem◦ Demographics to deal with new user problem

2/19/19 48SLIDES ADAPTED FROM JURE LESKOVEC

Page 49: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Evaluation

2/19/19 JURE LESKOVEC, STANFORD C246: MINING MASSIVE DATASETS 49

1 3 4

3 5 5

4 5 5

3

3

2 2 2

5

2 1 1

3 3

1

movies

users

SLIDES ADAPTED FROM JURE LESKOVEC

Page 50: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Evaluation

2/19/19 JURE LESKOVEC, STANFORD C246: MINING MASSIVE DATASETS 50

1 3 4

3 5 5

4 5 5

3

3

2 ? ?

?

2 1 ?

3 ?

1

Test Data Set

users

movies

Page 51: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Evaluating PredictionsCompare predictions with known ratings◦Root-mean-square error (RMSE)

◦ ∑"# $"# − $"#∗'

where ()* is predicted, ()*∗ is the true rating of x on i

◦Rank Correlation: ◦ Spearman’s correlation between system’s and user’s

complete rankings

2/19/19 51SLIDES ADAPTED FROM JURE LESKOVEC

Page 52: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Problems with Error MeasuresNarrow focus on accuracy sometimes misses the point◦ Prediction Diversity◦ Prediction Context

In practice, we care only to predict high ratings:◦ RMSE might penalize a method that does well

for high ratings and badly for others

2/19/19 52SLIDES ADAPTED FROM JURE LESKOVEC

Page 53: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

There’s No Data like More Data

Leverage all the data◦ Simple methods on large data do best

Add more data◦ e.g., add IMDB data on genres

More data beats better algorithms

2/19/19 53SLIDES ADAPTED FROM JURE LESKOVEC

Page 54: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Famous Historical Example:The Netflix Prize

Training data◦ 100 million ratings, 480,000 users, 17,770 movies◦ 6 years of data: 2000-2005

Test data◦ Last few ratings of each user (2.8 million)◦ Evaluation criterion: root mean squared error (RMSE) ◦ Netflix Cinematch RMSE: 0.9514◦ Dumb baseline does really well. For user u and movie m take the average of

◦ The average rating given by u on all rated movies◦ The average of the ratings for movie m by all users who rated that movie

Competition◦ 2700+ teams◦ $1 million prize for 10% improvement on Cinematch◦ BellKor system won in 2009. Combined many factors

◦ Overall deviations of users/movies ◦ Regional effects◦ Local collaborative filtering patterns ◦ Temporal biases

2/19/19 54SLIDES ADAPTED FROM JURE LESKOVEC

Page 55: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Summary on Recommendation Systems

• The Long Tail• Content-based Systems• Collaborative Filtering

Page 56: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

State of the Art in Collaborative Filtering

Dimensionality reduction (SVD) techniques best performanceBut not clear if they are necessary at YouTube scale…

Page 57: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Open problem:ethical and societal implicationsFilter bubbles

“I realized really fast that YouTube’s recommendation was putting people into filter bubbles,” Chaslot said. “There was no way out. If a person was into Flat Earth conspiracies, it was bad for watch-time to recommend anti-Flat Earth videos, so it won’t even recommend them.”

Page 58: CS 124/LINGUIST 180 From Languages to Information · 2019-02-19 · 1.Content-based 2.Collaborative Filtering 3.Latent factor based CS246! 2/19/19 13 This lecture! SLIDES ADAPTED

Youtube announcement 3 weeks ago"we’ll begin reducing recommendations of borderline content and content that could misinform users in harmful ways—such as videos promoting a phony miracle cure for a serious illness, claiming the earth is flat, or making blatantly false claims about historic events like 9/11.