learning user preferences

Learning User Preferences

Jason RennieMIT CSAIL

[email protected]

Advisor: Tommi Jaakkola

Information Extraction

• Informal Communication: e-mail, mailing lists, bulletin boards

• Issues:– Context switching– Abbreviations & shortened forms– Variable punctuation, formatting, grammar

Thesis Advertisement: Outline

• Thesis is not end-to-end IE system

• We address some IE problems:

1. Identifying & Resolving Named Entites

2. Tracking Context

3. Learning User Preferences

Identifying Named Entities

• “Rialto is now open until 11pm”

• Facts/Opinions usually about a named entity

• Tools typically rely on punctuation, capitalization, formatting, grammar

• We developed criterion to identify topic-oriented words using occurrence stats

[Rennie & Jaakkola, SIGIR 2005]

Resolving Named Entites

• “They’re now open until 11pm”

• What does “they” refer to?

• Clustering– Group noun phrases that co-refer

• McCallum & Wellner (2005)– Excellent for proper nouns

• Our contribution: better modeling of non-proper nouns (incl. pronouns)

Tracking Context

• “The Swordfish was fabulous”– Indirect comment on restaurant.– Restaurant identifed by context.

• Use word statistics to find topic switches

• Contribution: new sentence clustering algorithm

Learning User Preferences

• Examples:– “I loved Rialto last night.”– “Overall, Oleana was worth the money”– “Radius wasn’t bad, but wasn’t great”– “Om was purely pretentious”

• Issues:1. Translate text to partial ordering or rating

2. Predict unobserved ratings

Preference Problems

• Single User w/ Item Features

• Multi-user, no features– Aka Collaborative Filtering

Single User, Item Features

-0.1-0.1+10 +5 0 0 +2

User Weights

+8 -4 +1 -7 -6 -3

Preference Scores

Capacity

Price

French?

New American?

Ethnic?

Formality

Location

10

Ta

ble

s

#9

Pa

rk

Lum

iere

Ta

njo

re

Ch

enn

ai

Rn

dzv

ous

30 90 60 80 40 80

30 60 50 30 20 40

1 0 1 0 0 0

0 1 0 0 0 1

0 0 0 1 1 0

2 4 3 1 0 2

2 3 1 2 0 2

FeatureValues 4=6

3=3

2=-2

1=-5

5

1

3

2

4

Ra

ting

s

Single User, Item Features

? ? ? ? ? ? ?

User Weights

? ? ? ? ? ?

Preference Scores

Capacity

Price

French?

New American?

Ethnic?

Formality

Location

10

Ta

ble

s

#9

Pa

rk

Lum

iere

Ta

njo

re

Ch

enn

ai

Rn

dzv

ous

30 90 60 80 40 80

30 60 50 30 20 40

1 0 1 0 0 0

0 1 0 0 0 1

0 0 0 1 1 0

2 4 3 1 0 2

2 3 1 2 0 2

FeatureValues

5 2 3 1 ? ?

Ratings

-2.5 1.4 -0.9 5.6 3.1 -1.8

-2.7 0.2 -4.2 2.1 0.2 -4.2

2.1 -2.5 1.4 -0.9 5.6 3.1

-1.8 -2.7 0.2 -4.2 2.1 -2.5

1.4 -0.9 5.6 3.1 -1.8 -2.7

0.2 -4.2 -1.4 0.7 3.4 -0.8

1.9 -2.2 4.7 2.6 -3.5 -2.1

Many Users, No Features

2 3 2 3 2 3

2 1 5 1 2 4

1 2 1 3 1 3

5 2 3 5 2 4

4 2 5 2 1 5

3 3 3 5 3 2

4 5 2 4 3 5

?

? ? ? ?

? ?

? ?

? ? ? ?

? ? ?

? ? ?

We

igh

ts

Features

Preference Scores Ratings

??

?

• Possible goals:– Predict missing entries– Cluster users or items

• Applications:– Movies, Books– Genetic Interaction– Network routing– Sports performance

Collaborative Filtering

2 3 2 3 2 3

2 1 5 1 2 4

1 2 1 3 1 3

5 2 3 5 2 4

4 2 5 2 1 5

3 3 3 5 3 2

4 5 2 4 3 5

use

rs

items

Outline

• Single User, Features– Loss functions, Convexity, Large Margin– Loss function for Ratings

• Many Users, No Features– Feature Selection, Rank, SVD– Regularization: tie together multiple tasks– Optimization: scale to large problems

• Extensions

This Talk: Contributions

• Implementation and systematic evaluation of loss functions for Single User prediction.

• Scaling Multi-user regularization to large (thousands of users/items) problems– Analysis of optimization

• Extensions– Hybrid: features + multiple users– Observation model & multiple ratings

Rating Classification

• n ordered classes

• Learn weight vector, thresholds

11

11 11

22 2

22 2

3

33 33

3

w

Loss Functions

0-1 Hinge Logistic

Margin Agreement Smooth Hinge Mod. Least Squares

Convexity

• Convex function => no local minima

• Set convex if all line segments within set

Convexity of Loss Functions

• 0-1 loss is not convex– Local minima, sensitive to small changes

• Convex Bound– Large margin solution with regularization– Stronger guarantees

Proportional Odds

• McCullagh introduced original rating model– Linear interaction: weights & features– Thresholds– Maximum likelihood

[McCullagh, 1980]

1 11

1 112

2 222 2

33

3 33

3

w

Immediate-Thresholds

1 2 3 4 5

[Shashua & Levin, 2003]

Some Errors are Better than Others

User:

System 1:

System 2:

Not a Bound on Absolute Diff.

1 2 3 4 5

All-Thresholds Loss

1 2 3 4 5[Srebro, Rennie & Jaakkola, NIPS 2004]

Experiments

Multi-Class

Imm-Thresh

All-Thresh p-value

MLS .7486 .7491 .6700 1.7e-18

Hinge .7433 .7628 .6702 6.6e-17

Logistic .7490 .7248 .6623 7.3e-22

Least Squares: 1.3368

[Rennie & Srebro, IJCAI 2005]


2 3 2 3 2 3

2 1 5 1 2 4

1 2 1 3 1 3

5 2 3 5 2 4

4 2 5 2 1 5

3 3 3 5 3 2

4 5 2 4 3 5

?

? ? ? ?

? ?

? ?

? ? ? ?

? ? ?

? ? ?

-2.5 1.4 -0.9 5.6 3.1 -1.8

-2.7 0.2 -4.2 2.1 0.2 -4.2

2.1 -2.5 1.4 -0.9 5.6 3.1

-1.8 -2.7 0.2 -4.2 2.1 -2.5

1.4 -0.9 5.6 3.1 -1.8 -2.7

0.2 -4.2 -1.4 0.7 3.4 -0.8

1.9 -2.2 4.7 2.6 -3.5 -2.1

We

igh

ts

Features


??

?

Background: Lp-norms

• L0: # non-zero entries: ||<0,2,0,3,4>||0 = 3

• L1: absolute value sum: ||<2,-2,1>||1 = 5

• L2: Euclidean length: ||<1,-1>||2 = 2

• General: ||v||p = (i |vi|p)1/p

Background: Feature Selection

• Objective: Loss + Regularization

L2 Squared L1

Singular Value Decomposition

• X=USV’– U,V: orthogonal (rotation)– S: diagonal, non-negative

• Eigenvalues of XX’=USV’VSU’=USSU’ are squared singular values of X

• Rank = ||s||0• SVD: used to obtain least-squares low-

rank approximation

Low Rank Matrix Factorization

V’U

×

¼X

rank k=

2 4 5 1 4 23 1 2 2 5 44 2 4 1 3 13 3 4 2 42 3 1 4 3 2

2 2 1 4 52 4 1 4 2 3

1 3 1 1 4 34 2 2 5 3 1

YY

Use SVD to findGlobal Optimum

Non-convexNo explicit soln.

• Sum-Squared Loss• Fully Observed Y• Classification Error Loss• Partially Observed Y

Low-Rank: Non-Convex Set

Rank 1Rank 1 Rank 2

Trace Norm Regularization

[Fazel et al., 2001]

Trace Norm: sum of singular values

y


2 3 2 3 2 3

2 1 5 1 2 4

1 2 1 3 1 3

5 2 3 5 2 4

4 2 5 2 1 5

3 3 3 5 3 2

4 5 2 4 3 5

-2.5 1.4 -0.9 5.6 3.1 -1.8

-2.7 0.2 -4.2 2.1 0.2 -4.2

2.1 -2.5 1.4 -0.9 5.6 3.1

-1.8 -2.7 0.2 -4.2 2.1 -2.5

1.4 -0.9 5.6 3.1 -1.8 -2.7

0.2 -4.2 -1.4 0.7 3.4 -0.8

1.9 -2.2 4.7 2.6 -3.5 -2.1

We

igh

ts

Features


U

V’

X Y

Max Margin Matrix Factorization

• Convex function of X and • Low rank in X

All-Thresholds Loss Trace Norm

[Srebro, Rennie & Jaakkola, NIPS 2004]

Properties of the Trace Norm

The factorization: US, VS minimizes both quantities

Factorized Optimization

• Factorized Objective (tight bound):

• Gradient descent: O(n3) per round

• Stationary points, but no local minima

[Rennie & Srebro, ICML 2005]

Collaborative Prediction Results

size, sparsity:

EachMovie36656x1648, 96%

MovieLens6040x3952, 96%

Algorithm

Weak Error

Strong Error

Weak Error

Strong Error

URP .8596 .8859 .6946 .7104

Attitude .8787 .8845 .6912 .7000

MMMF .8548 .8439 .6650 .6725

[URP & Attitude: Marlin, 2004] [MMMF: Rennie & Srebro, 2005]

Extensions

• Multi-user + Features

• Observation model– Predict which restaurants a user will rate, and– The rating she will make

• Multiple ratings per user/restaurant– E.g. Food, Service and Décor ratings

• SVD Parameterization

Fixed Features

Learned Features

Multi-User + Features

• Feature parameters (V):– Some are fixed– Some are learned

• Learn weights (U) for all features

• Fixed part of V does not affect regularization

V’

Observation Model

• Common assumption: ratings observed at random

• Restaurant selection:– Geography, popularity, price, food style

• Remove bias: model observation process

Observation Model

• Model as binary classification

• Add binary classification loss

• Tie together rating and observation models

X=UXV’ W=UWV’

Multiple Ratings

• Users may provide multiple ratings:– Service, Décor, Food

• Add in loss functions

• Stack parameter matrices for regularization

SVD Parameterization

• Too many parameters: UAA-1V’=X is another factorization of X

• Alternate: U,S,V– U,V orthogonal, S diagonal

• Advantages:– Not over-parameterized– Exact objective (not a bound)– No stationary points

Summary

• Loss function for ratings

• Regularization for multiple users

• Scaled MMMF to large problems (e.g. > 1000x1000)

• Trace norm: widely applicable

• Extensions

Code: http://people.csail.mit.edu/jrennie/matlab

Thanks!

• Helen, for supporting me for 7.5 years!• Tommi Jaakkola, for answering all my questions and

directing me to the “end”!• Mike Collins and Tommy Poggio for add’l guidance.• Nati Srebro & John Barnett for endless valuable

discussions and ideas.• Amir Globerson, David Sontag, Luis Ortiz, Luis Perez-

Breva, Alan Qi, & Patrycja Missiuro & all past members of Tommi’s reading group for paper discussions, conference trips and feedback on my talks.

• Many, many others who have helped me along the way!

Low-Rank Optimization

Low-Rank Minimum

Objective Minimum

Low-RankLocal

MinimumLow-Rank

Low-Rank

learning user preferences

Documents

single user prediction

ratingsmany users

named entitytools

named entitestheyre

multiple tasksoptimization

missing entriescluster

large marginloss function

featuresloss functions