social/collaborative filteringwcohen/10-601/cf.pdf · • rich dad’s guide to investing: what the...

Social/Collaborative Filtering

Outline

•  Recap •  SVD vs PCA •  Collaborative <iltering

– aka Social recommendation – k-NN CF methods – classi<ication

•  CF via MF •  MF vs SGD vs ….

Dimensionality Reduction�and Principle Components

Analysis: Recap

3

More cartoons

4

PCA as matrices: the optimization problem

PCA as matrices

10,000 pixels

1000

imag

es

1000 * 10,000,00

x1 y1

x2 y2

.. ..

… …

xn yn

a1 a2 .. … am

b1 b2 … … bm v11 … … …

… …

vij

…

vnm

~

V[i,j] = pixel j in image i

2 mixing wts

PC1

PC2

Poll •  True or false: the weights for an example

should add up to 1. •  True or false: the weights for a prototype

should add up to 1

Implementing PCA

8

Also:theyareorthogonaltoeachother!

PCA FOR MODELING TEXT�(SVD = SINGULAR VALUE DECOMPOSITION)

9

A Scalability Problem with PCA •  Covariance matrix is large in high dimensions

– With d features covariance matrix is d*d •  SVD is a closely-related method that can be

implemented more ef<iciently in high dimensions – Don’t explicitly compute covariance matrix –  Instead write docterm matrix X as X=USVT

–  S is k*k where k<<d –  S is diagonal and S[i,i]=sqrt(λi) the i-th eigenvec –  Columns of V ~= principle components – Rows of US ~= embedding for examples

10

Recovering latent factors in a matrix

mterms

ndo

cumen

ts

doctermmatrix

x1 y1

x2 y2

.. ..

… …

xn yn

a1 a2 .. … am

b1 b2 … … bm v11 …

… …

vij

…

vnm

~

Docterm[i,j]=TFIDFscoreoftermjindoci

11

UVT

s1 0

0 s2

S

SVD example

12

•  The Neatest Little Guide to Stock Market Investing •  Investing For Dummies, 4th Edition •  The Little Book of Common Sense Investing: The Only

Way to Guarantee Your Fair Share of Stock Market Returns

•  The Little Book of Value Investing •  Value Investing: From Graham to Buffett and Beyond •  Rich Dad’s Guide to Investing: What the Rich Invest in,

That the Poor and the Middle Class Do Not! •  Investing in Real Estate, 5th Edition •  Stock Investing For Dummies •  Rich Dad’s Advisors: The ABC’s of Real Estate

Investing: The Secrets of Finding Hidden Profits Most Investors Miss

https://technowiki.wordpress.com/2011/08/27/latent-semantic-analysis-lsa-tutorial/

https://technowiki.wordpress.com/2011/08/27/latent-semantic-analysis-lsa-tutorial/

InvesKngforrealestate

RichDad’sAdvisor’s:TheABCsofReal

EstateInvestment…

TheliQlebookofcommon

senseinvesKng:…

NeatestLiQleGuidetoStock

MarketInvesKng

My recap: SVD vs PCA

•  Very closely related methods •  As described here

– SVD decomp doesn’t require a square matrix – PCA decomp is always applied to CX which is

square and mean-centered – You can implement PCA using SVD as a substep

•  People sometimes use the terms interchangeably

Outline

•  What is CF? •  Nearest-neighbor methods for CF

– One old-school paper: BellCore’s movie recommender

– Some general discussion •  CF reduced to classi<ication •  CF reduced to matrix factoring •  Other uses of matrix factoring in ML

WHAT IS COLLABORATIVE FILTERING?� AKAsocialfiltering,recommendaKonsystems,…

What is collaborative <iltering?

Other examples of social <iltering….

Everyday Examples of Collaborative Filtering...

•  Bestseller lists •  Top 40 music lists •  The “recent returns” shelf at the library •  Unmarked but well-used paths thru the woods •  The printer room at work •  “Read any good books lately?” •  .... •  Common insight: personal tastes are

correlated: –  If Alice and Bob both like X and Alice likes Y then

Bob is more likely to like Y –  especially (perhaps) if Bob knows Alice

SOCIAL/COLLABORATIVE FILTERING: NEAREST-NEIGHBOR METHODS

BellCore’s MovieRecommender

•  Recommending And Evaluating Choices In A Virtual Community Of Use. Will Hill, Larry Stead, Mark Rosenstein and George Furnas, Bellcore; CHI 1995

Byvirtualcommunitywemean"agroupofpeoplewhosharecharacterisKcsandinteractinessenceoreffectonly".Inotherwords,peopleinaVirtualCommunityinfluenceeachotherasthoughtheyinteractedbuttheydonotinteract.Thusweask:"IsitpossibletoarrangeforpeopletosharesomeofthepersonalizedinformaKonalbenefitsofcommunityinvolvementwithouttheassociatedcommunicaKonscosts?"

MovieRecommender Goals Recommendations should: •  simultaneously ease and encourage rather than replace

social processes....should make it easy to participate while leaving in hooks for people to pursue more personal relationships if they wish.

•  be for sets of people not just individuals...multi-person recommending is often important, for example, when two or more people want to choose a video to watch together.

•  be from people not a black box machine or so-called "agent".

•  tell how much con<idence to place in them, in other words they should include indications of how accurate they are.

BellCore’s MovieRecommender •  Participants sent email to [email protected] •  System replied with a list of 500 movies to rate on

a 1-10 scale (250 random, 250 popular) –  Only subset need to be rated

•  New participant P sends in rated movies via email •  System compares ratings for P to ratings of (a

random sample of) previous users •  Most similar users are used to predict scores for

unrated movies (more later) •  System returns recommendations in an email

message.

SuggestedVideosfor:JohnA.Jamus.

Yourmust-seelistwithpredictedraKngs:

• 7.0"Alien(1979)"

• 6.5"BladeRunner"

• 6.2"CloseEncountersOfTheThirdKind(1977)"

YourvideocategorieswithaverageraKngs:

• 6.7"AcKon/Adventure"

• 6.5"ScienceFicKon/Fantasy"

• 6.3"Children/Family"

• 6.0"Mystery/Suspense"

• 5.9"Comedy"

• 5.8"Drama"

TheviewingpaQernsof243viewerswereconsulted.PaQernsof7viewerswerefoundtobemostsimilar.CorrelaKonwithtargetviewer:

• 0.59viewer-130([email protected])

• 0.55bullert,janer([email protected])

• 0.51jan_arst([email protected])

• 0.46KenCross([email protected])

• 0.42rskt([email protected])

• 0.41kkgg([email protected])

• 0.41bnn([email protected])

Bycategory,theirjointraKngsrecommend:

• AcKon/Adventure:• "Excalibur"8.0,4viewers• "ApocalypseNow"7.2,4viewers• "Platoon"8.3,3viewers

• ScienceFicKon/Fantasy:

• "TotalRecall"7.2,5viewers• Children/Family:

• "WizardOfOz,The"8.5,4viewers

• "MaryPoppins"7.7,3viewers

Mystery/Suspense:• "SilenceOfTheLambs,The"9.3,3viewers

Comedy:• "NaKonalLampoon'sAnimalHouse"7.5,4viewers• "DrivingMissDaisy"7.5,4viewers• "HannahandHerSisters"8.0,3viewers

Drama:• "It'sAWonderfulLife"8.0,5viewers• "DeadPoetsSociety"7.0,5viewers• "RainMan"7.5,4viewers

CorrelaKonofpredictedraKngswithyouractualraKngsis:0.64Thisnumbermeasuresabilitytoevaluatemoviesaccuratelyforyou.0.15meanslowability.0.85meansverygoodability.0.50meansfairability.


•  Evaluation: – Withhold 10% of the ratings of each user to use

as a test set – Measure correlation between predicted ratings

and actual ratings for test-set movie/user pairs


•  Participants sent email to [email protected] •  System replied with a list of 500 movies to rate

New participant P sends in rated movies via email

•  System compares ratings for P to ratings of (a random sample of) previous users

•  Most similar users are used to predict scores for unrated movies –  Empirical Analysis of Predictive Algorithms for

Collaborative Filtering Breese, Heckerman, Kadie, UAI98

•  System returns recommendations in an email message.

recap: k-nearest neighbor learning

?

Givenatestexamplex:1.  Findthektraining-set

examples(x1,y1),….,(xk,yk)thatareclosesttox.

2.  Predictthemostfrequentlabelinthatset.

?

Breaking it down:

•  To train: – save the data

•  To test: – For each test example x: 1. Findthektraining-setexamples(x1,y1),….,(xk,yk)thatareclosesttox.

2. Predictthemostfrequentlabelinthatset.

Veryfast!

PredicKonisrelaKvelyslow…

...butitdoesn’tdependonthenumberofclasses,onlythenumberofneighbors

recap: k-nearest neighbor learning

?

Givenatestexamplex:1.  Findthektraining-set

examples(x1,y1),….,(xk,yk)thatareclosesttox.

2.  Predictthemostfrequentlabelinthatset.

?

Algorithms for Collaborative Filtering 1: Memory-Based Algorithms (Breese et al, UAI98)

•  vi,j = vote of user i on item j •  Ii = items for which user i has voted •  Mean vote for i is

•  Predicted vote for “active user” a for j is a weighted sum

•  weight is based on similarity between user a and i

weightsofnsimilarusersnormalizer

Algorithms for Collaborative Filtering 1: Memory-Based Algorithms (Breese et al, UAI98)

•  K-nearest neighbor

•  Pearson correlation coef<icient (Resnick ’94, Grouplens):

•  Cosine distance, etc, …

⎩⎨⎧ ∈

=else0

)neighbors( if1),(

aiiaw

SOCIAL/COLLABORATIVE FILTERING: TRADITIONAL CLASSIFICATION

What are other ways to formulate the collaborative <iltering problem?

•  Treat it like ordinary classi<ication or regression

Collaborative + Content Filtering (Basu et al, AAAI98; Condliff et al, AI-STATS99)

Airplane Matrix Room with a View

... Hidalgo

comedy, $2M

action,$70M

romance,$25M

... action,$30M

Joe 27,M,$70k

9 7 2 7

Carol 53,F, $20k

8 9

...

Kumar 25,M,$22k

9 3 6

Ua 48,M,$81k

4 7 ? ? ?

Collaborative + Content Filtering As Classification (Basu, Hirsh, Cohen, AAAI98)


... Hidalgo

comedy action romance ... action

Joe 27,M,70k 1 1 0 1

Carol 53,F,20k 1 1 0

...

Kumar 25,M,22k 1 0 0 1

Ua 48,M,81k 0 1 ? ? ?

Classification task: map (user,movie) pair into {likes,dislikes}

Training data: known likes/dislikes

Test data: active users Features: any properties of user/movie pair

Collaborative + Content Filtering As Classification (Basu et al, AAAI98)


... Hidalgo


Joe 27,M,70k 1 1 0 1

Carol 53,F,20k 1 1 0

...

Kumar 25,M,22k 1 0 0 1

Ua 48,M,81k 0 1 ? ? ?

Features: any properties of user/movie pair (U,M)

Examples: genre(U,M), age(U,M), income(U,M),...

•  genre(Carol,Matrix) = action

•  income(Kumar,Hidalgo) = 22k/year



... Hidalgo


Joe 27,M,70k 1 1 0 1

Carol 53,F,20k 1 1 0

...

Kumar 25,M,22k 1 0 0 1

Ua 48,M,81k 0 1 ? ? ?


Examples: usersWhoLikedMovie(U,M):

•  usersWhoLikedMovie(Carol,Hidalgo) = {Joe,...,Kumar}

•  usersWhoLikedMovie(Ua, Matrix) = {Joe,...}



... Hidalgo


Joe 27,M,70k 1 1 0 1

Carol 53,F,20k 1 1 0

...

Kumar 25,M,22k 1 0 0 1

Ua 48,M,81k 0 1 ? ? ?


Examples: moviesLikedByUser(M,U):

•  moviesLikedByUser(*,Joe) = {Airplane,Matrix,...,Hidalgo}

•  actionMoviesLikedByUser(*,Joe)={Matrix,Hidalgo}



... Hidalgo


Joe 27,M,70k 1 1 0 1

Carol 53,F,20k 1 1 0

...

Kumar 25,M,22k 1 0 0 1

Ua 48,M,81k 1 1 ? ? ?


genre={romance}, age=48, sex=male, income=81k, usersWhoLikedMovie={Carol}, moviesLikedByUser={Matrix,Airplane}, ...



... Hidalgo


Joe 27,M,70k 1 1 0 1

Carol 53,F,20k 1 1 0

...

Kumar 25,M,22k 1 0 0 1

Ua 48,M,81k 1 1 ? ? ?

genre={romance}, age=48, sex=male, income=81k, usersWhoLikedMovie={Carol}, moviesLikedByUser={Matrix,Airplane}, ...

genre={action}, age=48, sex=male, income=81k, usersWhoLikedMovie = {Joe,Kumar}, moviesLikedByUser={Matrix,Airplane},...


•  Classification learning algorithm: rule learning (RIPPER)

•  If NakedGun33/13 moviesLikedByUser and Joe usersWhoLikedMovie and genre=comedy then predict likes(U,M)

•  If age>12 and age<17 and HolyGrail moviesLikedByUser and director=MelBrooks then predict likes(U,M)

•  If Ishtar moviesLikedByUser then predict likes(U,M)

∈

∈

∈

∈

Basu et al 98 - results •  Evaluation:

–  Predict liked(U,M)=“M in top quartile of U’s ranking” from features, evaluate recall and precision

–  Features: •  Collaborative: UsersWhoLikedMovie,

UsersWhoDislikedMovie, MoviesLikedByUser •  Content: Actors, Directors, Genre, MPAA rating, ... •  Hybrid: ComediesLikedByUser, DramasLikedByUser,

UsersWhoLikedFewDramas, ... •  Results: at same level of recall (about 33%)

–  Ripper with collaborative features only is worse than the original MovieRecommender (by about 5 pts precision – 73 vs 78)

–  Ripper with hybrid features is better than MovieRecommender (by about 5 pts precision)

Matrix Factorization for Collaborative Filtering


m movies

v11 …

… …

vij

…

vnm

V[i,j] = user i’s rating of movie j

n us

ers


m movies

n us

ers

m movies

x1 y1

x2 y2

.. ..

… …

xn yn

a1 a2 .. … am

b1 b2 … … bm v11 …

… …

vij

…

vnm

~


talk pilfered from à …..

KDD 2011


m movies

n us

ers

m movies

x1 y1

x2 y2

.. ..

… …

xn yn

a1 a2 .. … am

b1 b2 … … bm v11 …

… …

vij

…

vnm

~


r

W

H

V

… is like Linear Regression ….

m=1 regressors

n in

stan

ces

(e.g

., 15

0)

predictions

pl1 pw1 sl1 sw1

pl2 pw2 sl2 sw2

.. ..

… …

pln pwn

w1

w2

w3

w4

y1

…

yi

yn

~

Y[i,1] = instance i’s prediction

W

H

Y

r features (eg 4)

.. for many outputs at once….

m regressors

n in

stan

ces

(e.g

., 15

0)

predictions

pl1 pw1 sl1 sw1

pl2 pw2 sl2 sw2

.. ..

… …

pln …

w11 w12

w21 ..

w31 ..

w41 .. …

y11 y12

…

yn1

~

Y[I,j] = instance i’s prediction for

regression task j

W

H

Y

ym

…

ynm

r features (eg 4)

… where we also have to <ind the dataset!

Matrix factorization as SGD

step size

Matrix factorization as SGD - why does this work?

step size

Matrix factorization as SGD - why does this work? Here’s the key claim:

Checking the claim

Think for SGD for logistic regression •  LR loss = compare y and ŷ = dot(w,x) •  similar but now update w (user weights) and x (movie weight)

What loss functions are possible?

“generalized” KL-divergence

What loss functions are possible?

ALS = alternating least squares

Matrix Multiplications in Machine Learning:�

MF vs PCA vs SGD vs ….


m movies

n us

ers

m movies

x1 y1

x2 y2

.. ..

… …

xn yn

a1 a2 .. … am

b1 b2 … … bm v11 …

… …

vij

…

vnm

~


r

W

H

V

….. vs k-means (1)

cluster means

n ex

ampl

es

0 1

1 0

.. ..

… …

xn yn

a1 a2 .. … am

b1 b2 … … bm v11 …

… …

vij

…

vnm

~

original data set indicators for r

clusters

Z

M

X

Matrix multiplication - 1 (Gram matrix)

transpose of X

x1 y1

x2 y2

.. ..

… …

xn yn

x1 x2 .. … xn

y1 y2 … yn <x1,x1> …

… …

…

<xi,xj> …

…

<xn,xn>

~

V[i,j] = inner product of instances i and j (Gram matrix)

X

XT

V

n in

stan

ces

(e.g

., 15

0)

r features (eg 2)

Matrix multiplication - 2 �(Covariance matrix)

transpose of X

a1 a2 … am

b1 b2 … bm

v11 v12

v21 v22

~ XT x1 y1

x2 y2

.. ..

… …

xn yn

X

assuming mean(X)=0

CX (i, j) = xit x j

t

t=1

n

∑ = n ⋅cov(i, j)

r features (eg 2)

CX

Matrix multiplication - 2 (PCA)

V = CX = XT X …variance/covariances E = eigenvectors(V) X E T = PCA(X) = Z Z(i,j) is similarity of example i to eigenvector j

x1 y1

x2 y2

.. ..

… …

xn yn

X ET

e11 e12

e21 e22

z1 z1

z2 z2

.. ..

… …

zn zn

Z

Eigenvecs of CX

I think of these as fixed point of a process where we predict each feature value from the others and CX

Matrix multiplication - 2 (PCA)

V = CX = XT X …variance/covariances E = eigenvectors(V) X E T = PCA(X) = ZK … or use E(1:K, :) instead of E

x1 y1

x2 y2

.. ..

… …

xn yn

X EKT

e11

e21

z1

z2

..

…

zn

ZK

top K eigenvecs of CX

Matrix multiplication - 3 (SVD)

V = CX = XT X …variance/covariances E = eigenvectors(V) X E T = PCA(X) = Z

X = Z E = Z Σ-1 Σ E = U Σ E … where U = Z Σ-1 … i.e., factored version of X Usually written as X = U Σ V

x1 y1

x2 y2

.. ..

… …

xn yn

X E

e11 e12

e21 e22

z1 z1

z2 z2

.. ..

… …

zn zn

Z

Eigenvecs of CX


V = CX = XT X …variance/covariances E = eigenvectors(V) X EK T = PCA(X) = ZK … EK = E(1:K, :) instead of E

X ≈ ZK EK ≈ ZK Σ-1 Σ E ≈ UK Σ E … where UK = ZK Σ-1 … i.e., factored version of X

x1 y1

x2 y2

.. ..

… …

xn yn

X E

e11 e12

e21 e22

z1

z2

..

…

zn

UK

Eigenvecs of CX Σ

Σ1


eigenvectors of CX

x1 y1

x2 y2

.. ..

… …

xn yn

x1 x2 .. … xn

y1 y2 … yn <x1,x1> …

… …

…

<xi,xj> …

…

<xn,xn>

~

original matrix

UK

E

X

n in

stan

ces

K features with zero covar and unit

variances Σ

Σ1 0

0 Σ2


m movies

n us

ers

m movies

x1 y1

x2 y2

.. ..

… …

xn yn

a1 a2 .. … am

b1 b2 … … bm v11 …

… …

vij

…

vnm

~


r

W

H

V

social/collaborative filteringwcohen/10-601/cf.pdf · • rich dad’s guide to investing: what the...

Documents