a random walks perspective on maximizing satisfaction and

55
A random walks perspective on maximizing satisfaction and profit Matthew Brand SIAM International Conference on Data Mining, April 21-23, 2005 May 31, 2005 Presented by Daniel Hsu (djhsu@cs) for CSE 254

Upload: others

Post on 24-Dec-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A random walks perspective on maximizing satisfaction and

A random walks perspective on maximizingsatisfaction and profit

Matthew BrandSIAM International Conference on Data Mining, April 21-23,

2005

May 31, 2005

Presented by Daniel Hsu (djhsu@cs) for CSE 254

Page 2: A random walks perspective on maximizing satisfaction and

Outline

Motivation

Collaborative filtering

Basis for recommendation

The random walk model

Using the model

Evaluation

Page 3: A random walks perspective on maximizing satisfaction and

Motivation: collaborative filtering

“Hello, Daniel Hsu. We have recommendations for you.”

Page 4: A random walks perspective on maximizing satisfaction and

Motivation: collaborative filtering

Everyone wins!

I Daniel is more likely to find products he will like.

I Amazon.com is more likely to sell products to Daniel.

How can Amazon.com achieve this glorious end?

Should Amazon.com just recommend The Da Vinci Code toeveryone?

Page 5: A random walks perspective on maximizing satisfaction and

Motivation: basis for recommendation

Page 6: A random walks perspective on maximizing satisfaction and

Motivation: basis for recommendation

How can Amazon.com decide which recommendations to make?

I (Satisfaction) “Many customers who bought Debussy: PianoWorks also bought Satie: Piano Works. . . People who likeDebussy also like Satie. . . Daniel will like Satie: Piano Works.”

I (Profit) “Also, we (Amazon.com) make a huge profit marginon Satie: Piano Works, so let’s try to sell as many copies ofthis disc as possible.”

Page 7: A random walks perspective on maximizing satisfaction and

Outline

Motivation

The random walk model

Association graph and Markov chain

Expected hitting and commute times

Connection to resistive networks

Random walk correlation

Using the model

Evaluation

Page 8: A random walks perspective on maximizing satisfaction and

Model: association graph

Let W ∈ Rn×n+ be the weighted adjacency matrix of an association

graph.

I Example 1: vertices are events, weight of edge (i , j) is howmany times event j followed event i .

I Example 2: vertices are people and movies, weight of edge(i , j) is the rating person i gave movie j .

Let P = diag(W 1)−1W be the row-normalized version of W .Then, we can think of P as the transition matrix of a Markovchain.

Page 9: A random walks perspective on maximizing satisfaction and

An example association graph

18-25 year old

female

student

individual 314

Star Wars Ep. 3

Napoleon Dynamite

Sleepless in Seattleindividual 315

bus driver

...

1

3

2

5

The Shawshank

Redemption

5...

......

Page 10: A random walks perspective on maximizing satisfaction and

The main assumption

“a random walk on this Markov chain will mimic, over the shortterm, the behavior of individuals randomly drawn from thispopulation.”

Page 11: A random walks perspective on maximizing satisfaction and

Further assumptions and consequences

Let Xt : t ≥ 0 be an irreducible and aperiodic Markov chain withtransition matrix P. Then, the chain has a unique stationarydistribution π.

I πj ≥ 0 for each j

I∑

j πj = 1

I π′T = π′

Page 12: A random walks perspective on maximizing satisfaction and

Expected hitting and commute times

Suppose the chain is in state i .

I Expected hitting time Hij : How long does it take, on average,to reach state j?

I Expected commute time Cij : How long does it take, onaverage, to reach state j and then state i?

I Cij = Hij + Hji = Cji

Both Hij and Cij have been previously proposed as a basis formaking recommendations. But how are they computed?

Glimpse into the future: the newly proposed basis is also derivedfrom the expected commute times.

Page 13: A random walks perspective on maximizing satisfaction and

A recurrence relation for expected hitting time

Let the random variable Tj |i be the time to reach state j startingin state i . If i 6= j , then

Tj |i = 1 + Tj |k for any k s.t. Pik > 0.

Then, using conditional expecations,

Hij = E Tj |i

= 1 +∑

k:Pik>0

Pr(next state is k | in state i) ETj |k

= 1 +∑

k:Pik>0

PikHkj .

Page 14: A random walks perspective on maximizing satisfaction and

An identity for the frequency of a state

Now, we’ll derive a direct expression for hitting time (adapted fromAldous and Fill). We’ll use the following lemma:

Lemma (“Occupation measure identity”)

Consider the Markov chain Xt : t ≥ 0 with stationary distributionπ started at state i . Let 0 < S < ∞ be a random stopping timesuch that XS = i and E[S | X0 = i ] < ∞. Then for any state j,

E[# of visits to state j before time S | X0 = i ] = πj E[S | X0 = i ].

For succinctness, write this as Ei [#j before S ] = πj Ei [S ]. Wecount visits at time 0 and exclude visits at time S .

Page 15: A random walks perspective on maximizing satisfaction and

Using the identity

Occupation measure identity: Ei [#j before S ] = πj Ei [S ].

Define Ti = mint ≥ 0 : Xt = i as the first hitting time of state i ,and T+

i = mint ≥ 1 : Xt = i as the first return time to state i .Note: Ti and T+

i are the same unless X0 = i .

Warm-up:

Let S = T+i . Ei [#i before T+

i ] = 1, so 1 = πi Ei [T+i ]. That is,

Ei [T+i ] = 1/πi . (1)

For S = T+i and j 6= i , use the lemma and (1) to get

Ei [#j before T+i ] = πj Ei [T

+i ] = πj/πi (2)

Page 16: A random walks perspective on maximizing satisfaction and

Using the identity

Occupation measure identity: Ei [#j before S ] = πj Ei [S ].

Let S = the first return to i after the first visit to j (j 6= i).

ThenEi S = Ei Tj + Ej Ti

and

Ei [#j before S ] = Ei [#j before Tj ] + Ej [#j before Ti ].

But Ei [#j before Tj ] = 0, so

Ej [#j before Ti ] = πj(Ei Tj + Ej Ti ). (3)

Page 17: A random walks perspective on maximizing satisfaction and

Using the identity

Use the notation Eρ[·] to be the expectation given that the state attime 0 is distributed according to ρ.

Let t0 ≥ 1 and S be the time of the following:

1. wait time t0, then

2. wait until the chain hits i .

Let Vt be the random variable that indicates if i is visited at timet. Then

∑t0−1t=0 Vt is the number of visits to i before S . Now,

using the identity,

t0−1∑t=0

Ei [Vt ] =

t0−1∑t=0

(Pt)ii = πi (t0 + Eρ Ti )

with ρk = Pr(Xt0 = k | X0 = i).

Page 18: A random walks perspective on maximizing satisfaction and

Using the identity

Rearrangingt0−1∑t=0

(Pt)ii = πi (t0 + Eρ Ti )

to gett0−1∑t=0

[(Pt)ii − πi ] = πi Eρ Ti

and letting t0 →∞, we get

Zii = πi Eπ Ti . (4)

where Zij =∑∞

t=0[(Pt)ij − πj ].

Page 19: A random walks perspective on maximizing satisfaction and

Using the identity

To actually get an expression for Ei Tj , this time let S be the timeof the following:

1. wait until the chain hits i ,

2. then wait time t0 ≥ 1, and then

3. finally wait until the chain hits j .

The occupation measure identity says Ej [#j before S ] = πj Ej S .

Note thatEj S = Ej Ti + t0 + Eρ Tj ,

where ρk = Pr(Xt0 = k | X0 = i), and

Ej [#j before S ] = Ej [#j before Ti ] +

t0−1∑t=1

(Pt)ij .

Page 20: A random walks perspective on maximizing satisfaction and

Using the identity

Then, using (3), rearranging, and letting t0 →∞,

Ej [#j before Ti ] +

t0−1∑t=0

(Pt)ij = πj(Ej Ti + t0 + Eρ Tj)

πj(EjTi + EiTj) +

t0−1∑t=0

(Pt)ij = πj(Ej Ti + t0 + Eρ Tj)

t0−1∑t=0

[(Pt)ij − πj ] = πj(EρTj − EiTj)

Zij = πj(Eπ Tj − EiTj).

Finally, using (4), we have

Zjj − Zij = πjEiTj = πjHij (5)

Page 21: A random walks perspective on maximizing satisfaction and

Computing the expected hitting time

In order to use (5) to compute Hij = Ei Tj , we need to compute Z .

Let Π = 1π′. Then

Z =∞∑

t=0

(Pt − Π)

= (P0 − Π) +∞∑

t=1

(Pt − Π)

= (I − Π) +∞∑

t=1

(P − Π)t (check by induction)

= −Π +∞∑

t=0

(P − Π)t

= −Π + (I − (P − Π))−1 (since Pt − Π → 0).

Note: Brand says Z = (I − P − Π)−1, which is probably wrong.

Page 22: A random walks perspective on maximizing satisfaction and

A slight simplification

Recall that W is the weighted adjacency matrix of the associationgraph. From now on, assume W is symmetric (i.e. the graph isundirected).

A random walk on such a graph has transition probabilities fori 6= j

Pij =Wij

Wi

where Wi =∑

j Wij .

Page 23: A random walks perspective on maximizing satisfaction and

A slight simplification

Also assume the graph is connected and not bipartite. Then, thestationary distribution of the random walk is

πi =Wi∑k Wk

.

What about expected hitting and commute times?

It turns out that the expected hitting and commute times arecaptured by the graph’s electrical resistance.

Page 24: A random walks perspective on maximizing satisfaction and

Connection to resistive networks

Put a resistor on each edge i , j with resistance Rij = 1/Wij

(i.e. with conductance Wij). Now consider fixed nodes i and j .

I Inject Wi =∑

k Wik current into node i .

I By Kirchhoff’s (Iin = Iout) and Ωhm’s (V = IR) laws,

Wi =∑

(i ,k)∈E

Iik =∑

(i ,k)∈E

Vik/Rik =∑

(i ,k)∈E

VikWik .

I By Kirchhoff’s voltage law, Vij = Vik + Vkj .

I Get Wi =∑

k Wik(Vij − Vkj). After rearranging, this is

Vij = 1 +∑k

Wik

WiVkj

= 1 +∑k

PikVkj .

Page 25: A random walks perspective on maximizing satisfaction and

Connection to resistive networks

The recurrence relation for the voltage Vij when Wi units ofcurrent are injected into node i is the same as the recurrencerelation for the expected hitting time Hij . So identify Vij ≡ Hij .

To get an explicit formula for the expected commute timeCij = Hij + Hjk , we’ll use the superposition property of linearequations (resistive networks are characterized by linear equations).

Page 26: A random walks perspective on maximizing satisfaction and

Deriving the expected commute time: four cases

Adapted from Karp (2003).Current into i Current into j Current into k Vij Vji

k 6= i , j

Wi

−(∑

k Wk −Wj)

Wk Hij −Hij

−(∑

k Wk −Wi ) Wj Wk −Hji Hji∑k Wk −Wi −Wj −Wk Hji −Hji∑

k Wk −∑

k Wk 0 Cij −Cij

Using Ohm’s law, we have the expected commute time

Cij =

(∑k

Wk

)Rij = 2WtotalRij

where Wtotal is the total weight of the graph, and Rij is theeffective resistance between nodes i and j .

Page 27: A random walks perspective on maximizing satisfaction and

Deriving the expected commute time: four cases

Adapted from Karp (2003).Current into i Current into j Current into k Vij Vji

k 6= i , j

Wi −(∑

k Wk −Wj) Wk Hij −Hij

−(∑

k Wk −Wi ) Wj Wk −Hji Hji∑k Wk −Wi −Wj −Wk Hji −Hji∑

k Wk −∑

k Wk 0 Cij −Cij

Using Ohm’s law, we have the expected commute time

Cij =

(∑k

Wk

)Rij = 2WtotalRij

where Wtotal is the total weight of the graph, and Rij is theeffective resistance between nodes i and j .

Page 28: A random walks perspective on maximizing satisfaction and

Deriving the expected commute time: four cases

Adapted from Karp (2003).Current into i Current into j Current into k Vij Vji

k 6= i , j

Wi −(∑

k Wk −Wj) Wk Hij −Hij

−(∑

k Wk −Wi )

Wj Wk −Hji Hji

∑k Wk −Wi −Wj −Wk Hji −Hji∑

k Wk −∑

k Wk 0 Cij −Cij

Using Ohm’s law, we have the expected commute time

Cij =

(∑k

Wk

)Rij = 2WtotalRij

where Wtotal is the total weight of the graph, and Rij is theeffective resistance between nodes i and j .

Page 29: A random walks perspective on maximizing satisfaction and

Deriving the expected commute time: four cases

Adapted from Karp (2003).Current into i Current into j Current into k Vij Vji

k 6= i , j

Wi −(∑

k Wk −Wj) Wk Hij −Hij

−(∑

k Wk −Wi ) Wj Wk −Hji Hji

∑k Wk −Wi −Wj −Wk Hji −Hji∑

k Wk −∑

k Wk 0 Cij −Cij

Using Ohm’s law, we have the expected commute time

Cij =

(∑k

Wk

)Rij = 2WtotalRij

where Wtotal is the total weight of the graph, and Rij is theeffective resistance between nodes i and j .

Page 30: A random walks perspective on maximizing satisfaction and

Deriving the expected commute time: four cases

Adapted from Karp (2003).Current into i Current into j Current into k Vij Vji

k 6= i , j

Wi −(∑

k Wk −Wj) Wk Hij −Hij

−(∑

k Wk −Wi ) Wj Wk −Hji Hji∑k Wk −Wi −Wj −Wk Hji −Hji

∑k Wk −

∑k Wk 0 Cij −Cij

Using Ohm’s law, we have the expected commute time

Cij =

(∑k

Wk

)Rij = 2WtotalRij

where Wtotal is the total weight of the graph, and Rij is theeffective resistance between nodes i and j .

Page 31: A random walks perspective on maximizing satisfaction and

Deriving the expected commute time: four cases

Adapted from Karp (2003).Current into i Current into j Current into k Vij Vji

k 6= i , j

Wi −(∑

k Wk −Wj) Wk Hij −Hij

−(∑

k Wk −Wi ) Wj Wk −Hji Hji∑k Wk −Wi −Wj −Wk Hji −Hji∑

k Wk −∑

k Wk 0 Cij −Cij

Using Ohm’s law, we have the expected commute time

Cij =

(∑k

Wk

)Rij = 2WtotalRij

where Wtotal is the total weight of the graph, and Rij is theeffective resistance between nodes i and j .

Page 32: A random walks perspective on maximizing satisfaction and

Deriving the expected commute time: four cases

Adapted from Karp (2003).Current into i Current into j Current into k Vij Vji

k 6= i , j

Wi −(∑

k Wk −Wj) Wk Hij −Hij

−(∑

k Wk −Wi ) Wj Wk −Hji Hji∑k Wk −Wi −Wj −Wk Hji −Hji∑

k Wk −∑

k Wk 0 Cij −Cij

Using Ohm’s law, we have the expected commute time

Cij =

(∑k

Wk

)Rij = 2WtotalRij

where Wtotal is the total weight of the graph, and Rij is theeffective resistance between nodes i and j .

Page 33: A random walks perspective on maximizing satisfaction and

Drawbacks of expected hitting and commute times

Using expected hitting and commute times as a basis forrecommendation is natural.

However, they can be dominated by the stationary distribution, sothe same popular items are recommended to everyone.

Idea: can still use expected hitting and commute times, but notdirectly.

Page 34: A random walks perspective on maximizing satisfaction and

Cosine correlation

Here is a popular idea from information retrieval.

Suppose x and y are count vectors (e.g. word counts of adocument). Similarity is measured by the dot product x · y .

Problem: longer similar documents get larger dot products thanshorter similar documents.

Solution: just look at the angle θ between x and y .

x · y = ‖x‖‖y‖ cos θ

cos θ =x · y‖x‖‖y‖

Can we use this cosine correlation with Hij or Cij?

Page 35: A random walks perspective on maximizing satisfaction and

Effective resistance as a metric

Recall: Cij = 2WtotalRij . So any metric properties of R will alsohold for C .

It is easy to check that

I Rij ≤ Rik + Rkj

I Rii = 0, and

I Rij = Rji .

So, the effective resistance R is a metric.

But a general metric is not enough to talk about angles—we needa Euclidean metric. We will show that the square-root of effectiveresistance is a Euclidean metric.

Page 36: A random walks perspective on maximizing satisfaction and

The Laplacian matrix

The Laplacian matrix L of the graph is

L = D −W

where D = diag(W1,W2, . . . ,Wn). Note that diag(W ) = 0.

L =

W1 −W12 . . . −W1n

−W12 W2 . . . −W2n...

.... . .

...−W1n −W2n . . . Wn

Note that the sum of each row is 0 and so is the sum of eachcolumn (by symmetry of W ).

Page 37: A random walks perspective on maximizing satisfaction and

The Laplacian matrix and its pseudoinverse

The Laplacian L is

I symmetric, because the original graph is symmetric, and

I positive-semidefinite, because

x ′Lx =∑i<j

Wij(xi − xj)2 ≥ 0.

L has a pseudoinverse L+ given by

L+ = (L− 1

n11′)−1 +

1

n11′.

L+ is also symmetric and positive-semidefinite (not obvious, butit’s true).

Page 38: A random walks perspective on maximizing satisfaction and

A Euclidean metric from the Laplacian’s pseudoinverse

Furthermore,Rij = (ei − ej)

′L+(ei − ej),

where ei is the ith elementary vector (1 in the ith entry, zeroeverywhere else). This comes from yet another, less intuitivederivation of expected hitting time.

This is a Mahalanobis distance since L+ is symmetricpositive-semidefinite, so its square-root is a Euclidean metric.

Page 39: A random walks perspective on maximizing satisfaction and

Cosine correlation for random walk

The square-root of effective resistance√

Rij defines a Euclideanmetric, so the “angle” θij between i and j is well-defined:

Rij = (ei − ej)′L+(ei − ej)

= L+ii − 2L+

ij + L+jj

cos θij =L+

ij√L+

ii L+jj

Page 40: A random walks perspective on maximizing satisfaction and

Interpreting cosine correlation for random walk

Identifying Rij ≡ ‖xi − xj‖2, one can deduce

‖xi‖ =√

L+ii .

For large Markov chains, this is approximately the recurrence time1πi

, a measure of “generic” popularity.

If the embedded points xi are projected onto a unit hypersphere(thus removing all “generic” popularity), then

cos θij = 1−d2ij

2,

where dij is the resulting Euclidean distance between i and j .

Page 41: A random walks perspective on maximizing satisfaction and

Outline

Motivation

The random walk model

Using the model

Making recommendations

Turning a profit

Evaluation

Page 42: A random walks perspective on maximizing satisfaction and

Making recommendations

Recommendations are with respect to a query state (e.g. customer,currently viewed product, search query).

Given a query state i , rank other states j according to cos θij andrecommend the top hits.

The problem is similar to semi-supervised classification (learningwith both labeled and unlabeled data), and the cosine correlationis a superior similarity measure compared to other proposedmethods. . .

(on one toy example).

Page 43: A random walks perspective on maximizing satisfaction and

Making recommendations

Recommendations are with respect to a query state (e.g. customer,currently viewed product, search query).

Given a query state i , rank other states j according to cos θij andrecommend the top hits.

The problem is similar to semi-supervised classification (learningwith both labeled and unlabeled data), and the cosine correlationis a superior similarity measure compared to other proposedmethods. . . (on one toy example).

Page 44: A random walks perspective on maximizing satisfaction and

Semi-supervised classification

Page 45: A random walks perspective on maximizing satisfaction and

Semi-supervised classification

Page 46: A random walks perspective on maximizing satisfaction and

Turning a profit (at least in expectation)

Goal (from decision theory): “recommend the product (state) withthe greatest expected profit, discounted over time.”

Let ~$ ∈ Rn be the profit (or loss) for each state, and e−β (β > 0)be the discount factor. Then, the expected discounted profit is

v =∞∑

t=0

e−tβPt~$

=

( ∞∑t=0

(e−βP)t

)~$

= (I − e−βP)−1~$ since (e−βP)t → 0.

To maximize expected discounted profit at query state i , choose

j = argmaxj :Pij>0

Pijvj

Page 47: A random walks perspective on maximizing satisfaction and

Outline

Motivation

The random walk model

Using the model

Evaluation

Data set and model

Maximizing satisfaction

Maximizing profit

Discussion

Page 48: A random walks perspective on maximizing satisfaction and

Experimental setup: data set and model

Data comes from the MovieLens database:

I Ratings on 1–5 scale for 1682 movies by 943 individuals

I Each individual viewed 20–737 movies (106 on average)

I Each movie received 1–583 ratings (60 on average)

I Ratings table is 93.7% empty (i.e. most viewers have not seenmost movies)

I Classify movies into 19 genres

I Classify individuals into 2 genders, 21 vocations, 8 overlappingage groups

Constructed n = 2657 node graph with

Wij =

1 if i belongs in class jrij if individual i rates movie j with rij .

Page 49: A random walks perspective on maximizing satisfaction and

Task 1: recommending to maximize satisfaction

Randomly partition data into training and test sets.

1. Test set contains 10 ratings from each viewer.

2. Take 10 top-ranked movies not in the training set as therecommendations.

3. Score the recommendations with the sum of individual’sheld-out ratings for the recommended movies.

Compared different measures of similarities:

I cosine correlation

I expected commute time, expected hitting time

I stationary distribution

I normalized hitting time, normalize commute time

Page 50: A random walks perspective on maximizing satisfaction and

Task 1 results

Compared average score across all 943 individuals and 500 trials.

I Score ranges from 0 to 50.

I Omniscient oracle scores ≤ 35.3 on average (due to sparsity ofdata and low average rating).

I Random recommendations score 2.2 on average.

Page 51: A random walks perspective on maximizing satisfaction and

Task 1 results

Page 52: A random walks perspective on maximizing satisfaction and

Task 2: recommending to maximize profit

Similar setup as before, except the scoring is changed:

1. A priori randomly assign each movie j a profit pj ∈ N (0, 1).

2. For t = 1 to 10

a. Recommend a movie.b. If the movie is in the individual’s held-out set, receive profit

e−tβpj .

Compared different recommenders:

I maximum expected discounted profit

I cosine correlation, but only movies with positive profit

I cosine correlation, allowing all movies

I expected commute time, expected hitting time, stationarydistribution

Page 53: A random walks perspective on maximizing satisfaction and

Task 2 results

Page 54: A random walks perspective on maximizing satisfaction and

Discussion

Cosine correlation performs significantly better than stationarydistribution. This suggests that it is sensitive to individualpreferences.

Issues with experimental study:

I Movies without recommendations by an individual in the testset were given a score of 0.

I Only considered “random walk”-related similarity measures.

Page 55: A random walks perspective on maximizing satisfaction and

References

D. Aldous and J. Fill, Reversible Markov Chains and RandomWalks on Graphs. Monograph in preparation,www.stat.berkeley.edu/users/aldous/RWG/book.html.

M. Brand, A random walks perspective on maximizing satisfactionand profit. In Proceedings of SIAM International Conference onData Mining, 2005.

F. Fouss, A. Pirotte, J. Renders, and M. Saerens, A Novel Way ofComputing Dissimilarities Between Nodes of a Graph, withApplication to Collaborative Filtering. In Proceedings of ECMLworkshop on Statistical Approaches for Web Mining, 2004.

R. Karp, Lecture. U.C. Berkeley, November 12, 2003.