a random walks perspective on maximizing satisfaction and

A random walks perspective on maximizingsatisfaction and profit

Matthew BrandSIAM International Conference on Data Mining, April 21-23,

2005

May 31, 2005

Presented by Daniel Hsu (djhsu@cs) for CSE 254

Outline

Motivation

Collaborative filtering

Basis for recommendation

The random walk model

Using the model

Evaluation

Motivation: collaborative filtering

“Hello, Daniel Hsu. We have recommendations for you.”

Motivation: collaborative filtering

Everyone wins!

I Daniel is more likely to find products he will like.

I Amazon.com is more likely to sell products to Daniel.

How can Amazon.com achieve this glorious end?

Should Amazon.com just recommend The Da Vinci Code toeveryone?

Motivation: basis for recommendation

Motivation: basis for recommendation

How can Amazon.com decide which recommendations to make?

I (Satisfaction) “Many customers who bought Debussy: PianoWorks also bought Satie: Piano Works. . . People who likeDebussy also like Satie. . . Daniel will like Satie: Piano Works.”

I (Profit) “Also, we (Amazon.com) make a huge profit marginon Satie: Piano Works, so let’s try to sell as many copies ofthis disc as possible.”

Outline

Motivation


Association graph and Markov chain

Expected hitting and commute times

Connection to resistive networks

Random walk correlation

Using the model

Evaluation

Model: association graph

Let W ∈ Rn×n+ be the weighted adjacency matrix of an association

graph.

I Example 1: vertices are events, weight of edge (i , j) is howmany times event j followed event i .

I Example 2: vertices are people and movies, weight of edge(i , j) is the rating person i gave movie j .

Let P = diag(W 1)−1W be the row-normalized version of W .Then, we can think of P as the transition matrix of a Markovchain.

An example association graph

18-25 year old

female

student

individual 314

Star Wars Ep. 3

Napoleon Dynamite

Sleepless in Seattleindividual 315

bus driver

...

1

3

2

5

The Shawshank

Redemption

5...

......

The main assumption

“a random walk on this Markov chain will mimic, over the shortterm, the behavior of individuals randomly drawn from thispopulation.”

Further assumptions and consequences

Let Xt : t ≥ 0 be an irreducible and aperiodic Markov chain withtransition matrix P. Then, the chain has a unique stationarydistribution π.

I πj ≥ 0 for each j

I∑

j πj = 1

I π′T = π′

Expected hitting and commute times

Suppose the chain is in state i .

I Expected hitting time Hij : How long does it take, on average,to reach state j?

I Expected commute time Cij : How long does it take, onaverage, to reach state j and then state i?

I Cij = Hij + Hji = Cji

Both Hij and Cij have been previously proposed as a basis formaking recommendations. But how are they computed?

Glimpse into the future: the newly proposed basis is also derivedfrom the expected commute times.

A recurrence relation for expected hitting time

Let the random variable Tj |i be the time to reach state j startingin state i . If i 6= j , then

Tj |i = 1 + Tj |k for any k s.t. Pik > 0.

Then, using conditional expecations,

Hij = E Tj |i

= 1 +∑

k:Pik>0

Pr(next state is k | in state i) ETj |k

= 1 +∑

k:Pik>0

PikHkj .

An identity for the frequency of a state

Now, we’ll derive a direct expression for hitting time (adapted fromAldous and Fill). We’ll use the following lemma:

Lemma (“Occupation measure identity”)

Consider the Markov chain Xt : t ≥ 0 with stationary distributionπ started at state i . Let 0 < S < ∞ be a random stopping timesuch that XS = i and E[S | X0 = i ] < ∞. Then for any state j,

E[# of visits to state j before time S | X0 = i ] = πj E[S | X0 = i ].

For succinctness, write this as Ei [#j before S ] = πj Ei [S ]. Wecount visits at time 0 and exclude visits at time S .

Using the identity

Occupation measure identity: Ei [#j before S ] = πj Ei [S ].

Define Ti = mint ≥ 0 : Xt = i as the first hitting time of state i ,and T+

i = mint ≥ 1 : Xt = i as the first return time to state i .Note: Ti and T+

i are the same unless X0 = i .

Warm-up:

Let S = T+i . Ei [#i before T+

i ] = 1, so 1 = πi Ei [T+i ]. That is,

Ei [T+i ] = 1/πi . (1)

For S = T+i and j 6= i , use the lemma and (1) to get

Ei [#j before T+i ] = πj Ei [T

+i ] = πj/πi (2)

Using the identity

Occupation measure identity: Ei [#j before S ] = πj Ei [S ].

Let S = the first return to i after the first visit to j (j 6= i).

ThenEi S = Ei Tj + Ej Ti

and

Ei [#j before S ] = Ei [#j before Tj ] + Ej [#j before Ti ].

But Ei [#j before Tj ] = 0, so

Ej [#j before Ti ] = πj(Ei Tj + Ej Ti ). (3)

Using the identity

Use the notation Eρ[·] to be the expectation given that the state attime 0 is distributed according to ρ.

Let t0 ≥ 1 and S be the time of the following:

1. wait time t0, then

2. wait until the chain hits i .

Let Vt be the random variable that indicates if i is visited at timet. Then

∑t0−1t=0 Vt is the number of visits to i before S . Now,

using the identity,

t0−1∑t=0

Ei [Vt ] =

t0−1∑t=0

(Pt)ii = πi (t0 + Eρ Ti )

with ρk = Pr(Xt0 = k | X0 = i).

Using the identity

Rearrangingt0−1∑t=0

(Pt)ii = πi (t0 + Eρ Ti )

to gett0−1∑t=0

[(Pt)ii − πi ] = πi Eρ Ti

and letting t0 →∞, we get

Zii = πi Eπ Ti . (4)

where Zij =∑∞

t=0[(Pt)ij − πj ].

Using the identity

To actually get an expression for Ei Tj , this time let S be the timeof the following:

1. wait until the chain hits i ,

2. then wait time t0 ≥ 1, and then

3. finally wait until the chain hits j .

The occupation measure identity says Ej [#j before S ] = πj Ej S .

Note thatEj S = Ej Ti + t0 + Eρ Tj ,

where ρk = Pr(Xt0 = k | X0 = i), and

Ej [#j before S ] = Ej [#j before Ti ] +

t0−1∑t=1

(Pt)ij .

Using the identity

Then, using (3), rearranging, and letting t0 →∞,

Ej [#j before Ti ] +

t0−1∑t=0

(Pt)ij = πj(Ej Ti + t0 + Eρ Tj)

πj(EjTi + EiTj) +

t0−1∑t=0

(Pt)ij = πj(Ej Ti + t0 + Eρ Tj)

t0−1∑t=0

[(Pt)ij − πj ] = πj(EρTj − EiTj)

Zij = πj(Eπ Tj − EiTj).

Finally, using (4), we have

Zjj − Zij = πjEiTj = πjHij (5)

Computing the expected hitting time

In order to use (5) to compute Hij = Ei Tj , we need to compute Z .

Let Π = 1π′. Then

Z =∞∑

t=0

(Pt − Π)

= (P0 − Π) +∞∑

t=1

(Pt − Π)

= (I − Π) +∞∑

t=1

(P − Π)t (check by induction)

= −Π +∞∑

t=0

(P − Π)t

= −Π + (I − (P − Π))−1 (since Pt − Π → 0).

Note: Brand says Z = (I − P − Π)−1, which is probably wrong.

A slight simplification

Recall that W is the weighted adjacency matrix of the associationgraph. From now on, assume W is symmetric (i.e. the graph isundirected).

A random walk on such a graph has transition probabilities fori 6= j

Pij =Wij

Wi

where Wi =∑

j Wij .

A slight simplification

Also assume the graph is connected and not bipartite. Then, thestationary distribution of the random walk is

πi =Wi∑k Wk

.

What about expected hitting and commute times?

It turns out that the expected hitting and commute times arecaptured by the graph’s electrical resistance.


Put a resistor on each edge i , j with resistance Rij = 1/Wij

(i.e. with conductance Wij). Now consider fixed nodes i and j .

I Inject Wi =∑

k Wik current into node i .

I By Kirchhoff’s (Iin = Iout) and Ωhm’s (V = IR) laws,

Wi =∑

(i ,k)∈E

Iik =∑

(i ,k)∈E

Vik/Rik =∑

(i ,k)∈E

VikWik .

I By Kirchhoff’s voltage law, Vij = Vik + Vkj .

I Get Wi =∑

k Wik(Vij − Vkj). After rearranging, this is

Vij = 1 +∑k

Wik

WiVkj

= 1 +∑k

PikVkj .


The recurrence relation for the voltage Vij when Wi units ofcurrent are injected into node i is the same as the recurrencerelation for the expected hitting time Hij . So identify Vij ≡ Hij .

To get an explicit formula for the expected commute timeCij = Hij + Hjk , we’ll use the superposition property of linearequations (resistive networks are characterized by linear equations).

Deriving the expected commute time: four cases

Adapted from Karp (2003).Current into i Current into j Current into k Vij Vji

k 6= i , j

Wi

−(∑

k Wk −Wj)

Wk Hij −Hij

−(∑

k Wk −Wi ) Wj Wk −Hji Hji∑k Wk −Wi −Wj −Wk Hji −Hji∑

k Wk −∑

k Wk 0 Cij −Cij

Using Ohm’s law, we have the expected commute time

Cij =

(∑k

Wk

)Rij = 2WtotalRij

where Wtotal is the total weight of the graph, and Rij is theeffective resistance between nodes i and j .



k 6= i , j

Wi −(∑

k Wk −Wj) Wk Hij −Hij

−(∑


k Wk −∑

k Wk 0 Cij −Cij


Cij =

(∑k

Wk

)Rij = 2WtotalRij




k 6= i , j

Wi −(∑


−(∑

k Wk −Wi )

Wj Wk −Hji Hji

∑k Wk −Wi −Wj −Wk Hji −Hji∑

k Wk −∑

k Wk 0 Cij −Cij


Cij =

(∑k

Wk

)Rij = 2WtotalRij




k 6= i , j

Wi −(∑


−(∑

k Wk −Wi ) Wj Wk −Hji Hji

∑k Wk −Wi −Wj −Wk Hji −Hji∑

k Wk −∑

k Wk 0 Cij −Cij


Cij =

(∑k

Wk

)Rij = 2WtotalRij




k 6= i , j

Wi −(∑


−(∑

k Wk −Wi ) Wj Wk −Hji Hji∑k Wk −Wi −Wj −Wk Hji −Hji

∑k Wk −

∑k Wk 0 Cij −Cij


Cij =

(∑k

Wk

)Rij = 2WtotalRij




k 6= i , j

Wi −(∑


−(∑


k Wk −∑

k Wk 0 Cij −Cij


Cij =

(∑k

Wk

)Rij = 2WtotalRij


Drawbacks of expected hitting and commute times

Using expected hitting and commute times as a basis forrecommendation is natural.

However, they can be dominated by the stationary distribution, sothe same popular items are recommended to everyone.

Idea: can still use expected hitting and commute times, but notdirectly.

Cosine correlation

Here is a popular idea from information retrieval.

Suppose x and y are count vectors (e.g. word counts of adocument). Similarity is measured by the dot product x · y .

Problem: longer similar documents get larger dot products thanshorter similar documents.

Solution: just look at the angle θ between x and y .

x · y = ‖x‖‖y‖ cos θ

cos θ =x · y‖x‖‖y‖

Can we use this cosine correlation with Hij or Cij?

Effective resistance as a metric

Recall: Cij = 2WtotalRij . So any metric properties of R will alsohold for C .

It is easy to check that

I Rij ≤ Rik + Rkj

I Rii = 0, and

I Rij = Rji .

So, the effective resistance R is a metric.

But a general metric is not enough to talk about angles—we needa Euclidean metric. We will show that the square-root of effectiveresistance is a Euclidean metric.

The Laplacian matrix

The Laplacian matrix L of the graph is

L = D −W

where D = diag(W1,W2, . . . ,Wn). Note that diag(W ) = 0.

L =

W1 −W12 . . . −W1n

−W12 W2 . . . −W2n...

.... . .

...−W1n −W2n . . . Wn

Note that the sum of each row is 0 and so is the sum of eachcolumn (by symmetry of W ).

The Laplacian matrix and its pseudoinverse

The Laplacian L is

I symmetric, because the original graph is symmetric, and

I positive-semidefinite, because

x ′Lx =∑i<j

Wij(xi − xj)2 ≥ 0.

L has a pseudoinverse L+ given by

L+ = (L− 1

n11′)−1 +

1

n11′.

L+ is also symmetric and positive-semidefinite (not obvious, butit’s true).

A Euclidean metric from the Laplacian’s pseudoinverse

Furthermore,Rij = (ei − ej)

′L+(ei − ej),

where ei is the ith elementary vector (1 in the ith entry, zeroeverywhere else). This comes from yet another, less intuitivederivation of expected hitting time.

This is a Mahalanobis distance since L+ is symmetricpositive-semidefinite, so its square-root is a Euclidean metric.

Cosine correlation for random walk

The square-root of effective resistance√

Rij defines a Euclideanmetric, so the “angle” θij between i and j is well-defined:

Rij = (ei − ej)′L+(ei − ej)

= L+ii − 2L+

ij + L+jj

cos θij =L+

ij√L+

ii L+jj

Interpreting cosine correlation for random walk

Identifying Rij ≡ ‖xi − xj‖2, one can deduce

‖xi‖ =√

L+ii .

For large Markov chains, this is approximately the recurrence time1πi

, a measure of “generic” popularity.

If the embedded points xi are projected onto a unit hypersphere(thus removing all “generic” popularity), then

cos θij = 1−d2ij

2,

where dij is the resulting Euclidean distance between i and j .

Outline

Motivation


Using the model

Making recommendations

Turning a profit

Evaluation


Recommendations are with respect to a query state (e.g. customer,currently viewed product, search query).

Given a query state i , rank other states j according to cos θij andrecommend the top hits.

The problem is similar to semi-supervised classification (learningwith both labeled and unlabeled data), and the cosine correlationis a superior similarity measure compared to other proposedmethods. . .

(on one toy example).


Recommendations are with respect to a query state (e.g. customer,currently viewed product, search query).

Given a query state i , rank other states j according to cos θij andrecommend the top hits.

The problem is similar to semi-supervised classification (learningwith both labeled and unlabeled data), and the cosine correlationis a superior similarity measure compared to other proposedmethods. . . (on one toy example).

Semi-supervised classification

Turning a profit (at least in expectation)

Goal (from decision theory): “recommend the product (state) withthe greatest expected profit, discounted over time.”

Let ~$ ∈ Rn be the profit (or loss) for each state, and e−β (β > 0)be the discount factor. Then, the expected discounted profit is

v =∞∑

t=0

e−tβPt~$

=

( ∞∑t=0

(e−βP)t

)~$

= (I − e−βP)−1~$ since (e−βP)t → 0.

To maximize expected discounted profit at query state i , choose

j = argmaxj :Pij>0

Pijvj

Outline

Motivation


Using the model

Evaluation

Data set and model

Maximizing satisfaction

Maximizing profit

Discussion

Experimental setup: data set and model

Data comes from the MovieLens database:

I Ratings on 1–5 scale for 1682 movies by 943 individuals

I Each individual viewed 20–737 movies (106 on average)

I Each movie received 1–583 ratings (60 on average)

I Ratings table is 93.7% empty (i.e. most viewers have not seenmost movies)

I Classify movies into 19 genres

I Classify individuals into 2 genders, 21 vocations, 8 overlappingage groups

Constructed n = 2657 node graph with

Wij =

1 if i belongs in class jrij if individual i rates movie j with rij .

Task 1: recommending to maximize satisfaction

Randomly partition data into training and test sets.

1. Test set contains 10 ratings from each viewer.

2. Take 10 top-ranked movies not in the training set as therecommendations.

3. Score the recommendations with the sum of individual’sheld-out ratings for the recommended movies.

Compared different measures of similarities:

I cosine correlation

I expected commute time, expected hitting time

I stationary distribution

I normalized hitting time, normalize commute time

Task 1 results

Compared average score across all 943 individuals and 500 trials.

I Score ranges from 0 to 50.

I Omniscient oracle scores ≤ 35.3 on average (due to sparsity ofdata and low average rating).

I Random recommendations score 2.2 on average.

Task 1 results

Task 2: recommending to maximize profit

Similar setup as before, except the scoring is changed:

1. A priori randomly assign each movie j a profit pj ∈ N (0, 1).

2. For t = 1 to 10

a. Recommend a movie.b. If the movie is in the individual’s held-out set, receive profit

e−tβpj .

Compared different recommenders:

I maximum expected discounted profit

I cosine correlation, but only movies with positive profit

I cosine correlation, allowing all movies

I expected commute time, expected hitting time, stationarydistribution

Task 2 results

Discussion

Cosine correlation performs significantly better than stationarydistribution. This suggests that it is sensitive to individualpreferences.

Issues with experimental study:

I Movies without recommendations by an individual in the testset were given a score of 0.

I Only considered “random walk”-related similarity measures.

References

D. Aldous and J. Fill, Reversible Markov Chains and RandomWalks on Graphs. Monograph in preparation,www.stat.berkeley.edu/users/aldous/RWG/book.html.

M. Brand, A random walks perspective on maximizing satisfactionand profit. In Proceedings of SIAM International Conference onData Mining, 2005.

F. Fouss, A. Pirotte, J. Renders, and M. Saerens, A Novel Way ofComputing Dissimilarities Between Nodes of a Graph, withApplication to Collaborative Filtering. In Proceedings of ECMLworkshop on Statistical Approaches for Web Mining, 2004.

R. Karp, Lecture. U.C. Berkeley, November 12, 2003.

a random walks perspective on maximizing satisfaction and

Documents