a random walks perspective on maximizing satisfaction and
TRANSCRIPT
A random walks perspective on maximizingsatisfaction and profit
Matthew BrandSIAM International Conference on Data Mining, April 21-23,
2005
May 31, 2005
Presented by Daniel Hsu (djhsu@cs) for CSE 254
Outline
Motivation
Collaborative filtering
Basis for recommendation
The random walk model
Using the model
Evaluation
Motivation: collaborative filtering
“Hello, Daniel Hsu. We have recommendations for you.”
Motivation: collaborative filtering
Everyone wins!
I Daniel is more likely to find products he will like.
I Amazon.com is more likely to sell products to Daniel.
How can Amazon.com achieve this glorious end?
Should Amazon.com just recommend The Da Vinci Code toeveryone?
Motivation: basis for recommendation
Motivation: basis for recommendation
How can Amazon.com decide which recommendations to make?
I (Satisfaction) “Many customers who bought Debussy: PianoWorks also bought Satie: Piano Works. . . People who likeDebussy also like Satie. . . Daniel will like Satie: Piano Works.”
I (Profit) “Also, we (Amazon.com) make a huge profit marginon Satie: Piano Works, so let’s try to sell as many copies ofthis disc as possible.”
Outline
Motivation
The random walk model
Association graph and Markov chain
Expected hitting and commute times
Connection to resistive networks
Random walk correlation
Using the model
Evaluation
Model: association graph
Let W ∈ Rn×n+ be the weighted adjacency matrix of an association
graph.
I Example 1: vertices are events, weight of edge (i , j) is howmany times event j followed event i .
I Example 2: vertices are people and movies, weight of edge(i , j) is the rating person i gave movie j .
Let P = diag(W 1)−1W be the row-normalized version of W .Then, we can think of P as the transition matrix of a Markovchain.
An example association graph
18-25 year old
female
student
individual 314
Star Wars Ep. 3
Napoleon Dynamite
Sleepless in Seattleindividual 315
bus driver
...
1
3
2
5
The Shawshank
Redemption
5...
......
The main assumption
“a random walk on this Markov chain will mimic, over the shortterm, the behavior of individuals randomly drawn from thispopulation.”
Further assumptions and consequences
Let Xt : t ≥ 0 be an irreducible and aperiodic Markov chain withtransition matrix P. Then, the chain has a unique stationarydistribution π.
I πj ≥ 0 for each j
I∑
j πj = 1
I π′T = π′
Expected hitting and commute times
Suppose the chain is in state i .
I Expected hitting time Hij : How long does it take, on average,to reach state j?
I Expected commute time Cij : How long does it take, onaverage, to reach state j and then state i?
I Cij = Hij + Hji = Cji
Both Hij and Cij have been previously proposed as a basis formaking recommendations. But how are they computed?
Glimpse into the future: the newly proposed basis is also derivedfrom the expected commute times.
A recurrence relation for expected hitting time
Let the random variable Tj |i be the time to reach state j startingin state i . If i 6= j , then
Tj |i = 1 + Tj |k for any k s.t. Pik > 0.
Then, using conditional expecations,
Hij = E Tj |i
= 1 +∑
k:Pik>0
Pr(next state is k | in state i) ETj |k
= 1 +∑
k:Pik>0
PikHkj .
An identity for the frequency of a state
Now, we’ll derive a direct expression for hitting time (adapted fromAldous and Fill). We’ll use the following lemma:
Lemma (“Occupation measure identity”)
Consider the Markov chain Xt : t ≥ 0 with stationary distributionπ started at state i . Let 0 < S < ∞ be a random stopping timesuch that XS = i and E[S | X0 = i ] < ∞. Then for any state j,
E[# of visits to state j before time S | X0 = i ] = πj E[S | X0 = i ].
For succinctness, write this as Ei [#j before S ] = πj Ei [S ]. Wecount visits at time 0 and exclude visits at time S .
Using the identity
Occupation measure identity: Ei [#j before S ] = πj Ei [S ].
Define Ti = mint ≥ 0 : Xt = i as the first hitting time of state i ,and T+
i = mint ≥ 1 : Xt = i as the first return time to state i .Note: Ti and T+
i are the same unless X0 = i .
Warm-up:
Let S = T+i . Ei [#i before T+
i ] = 1, so 1 = πi Ei [T+i ]. That is,
Ei [T+i ] = 1/πi . (1)
For S = T+i and j 6= i , use the lemma and (1) to get
Ei [#j before T+i ] = πj Ei [T
+i ] = πj/πi (2)
Using the identity
Occupation measure identity: Ei [#j before S ] = πj Ei [S ].
Let S = the first return to i after the first visit to j (j 6= i).
ThenEi S = Ei Tj + Ej Ti
and
Ei [#j before S ] = Ei [#j before Tj ] + Ej [#j before Ti ].
But Ei [#j before Tj ] = 0, so
Ej [#j before Ti ] = πj(Ei Tj + Ej Ti ). (3)
Using the identity
Use the notation Eρ[·] to be the expectation given that the state attime 0 is distributed according to ρ.
Let t0 ≥ 1 and S be the time of the following:
1. wait time t0, then
2. wait until the chain hits i .
Let Vt be the random variable that indicates if i is visited at timet. Then
∑t0−1t=0 Vt is the number of visits to i before S . Now,
using the identity,
t0−1∑t=0
Ei [Vt ] =
t0−1∑t=0
(Pt)ii = πi (t0 + Eρ Ti )
with ρk = Pr(Xt0 = k | X0 = i).
Using the identity
Rearrangingt0−1∑t=0
(Pt)ii = πi (t0 + Eρ Ti )
to gett0−1∑t=0
[(Pt)ii − πi ] = πi Eρ Ti
and letting t0 →∞, we get
Zii = πi Eπ Ti . (4)
where Zij =∑∞
t=0[(Pt)ij − πj ].
Using the identity
To actually get an expression for Ei Tj , this time let S be the timeof the following:
1. wait until the chain hits i ,
2. then wait time t0 ≥ 1, and then
3. finally wait until the chain hits j .
The occupation measure identity says Ej [#j before S ] = πj Ej S .
Note thatEj S = Ej Ti + t0 + Eρ Tj ,
where ρk = Pr(Xt0 = k | X0 = i), and
Ej [#j before S ] = Ej [#j before Ti ] +
t0−1∑t=1
(Pt)ij .
Using the identity
Then, using (3), rearranging, and letting t0 →∞,
Ej [#j before Ti ] +
t0−1∑t=0
(Pt)ij = πj(Ej Ti + t0 + Eρ Tj)
πj(EjTi + EiTj) +
t0−1∑t=0
(Pt)ij = πj(Ej Ti + t0 + Eρ Tj)
t0−1∑t=0
[(Pt)ij − πj ] = πj(EρTj − EiTj)
Zij = πj(Eπ Tj − EiTj).
Finally, using (4), we have
Zjj − Zij = πjEiTj = πjHij (5)
Computing the expected hitting time
In order to use (5) to compute Hij = Ei Tj , we need to compute Z .
Let Π = 1π′. Then
Z =∞∑
t=0
(Pt − Π)
= (P0 − Π) +∞∑
t=1
(Pt − Π)
= (I − Π) +∞∑
t=1
(P − Π)t (check by induction)
= −Π +∞∑
t=0
(P − Π)t
= −Π + (I − (P − Π))−1 (since Pt − Π → 0).
Note: Brand says Z = (I − P − Π)−1, which is probably wrong.
A slight simplification
Recall that W is the weighted adjacency matrix of the associationgraph. From now on, assume W is symmetric (i.e. the graph isundirected).
A random walk on such a graph has transition probabilities fori 6= j
Pij =Wij
Wi
where Wi =∑
j Wij .
A slight simplification
Also assume the graph is connected and not bipartite. Then, thestationary distribution of the random walk is
πi =Wi∑k Wk
.
What about expected hitting and commute times?
It turns out that the expected hitting and commute times arecaptured by the graph’s electrical resistance.
Connection to resistive networks
Put a resistor on each edge i , j with resistance Rij = 1/Wij
(i.e. with conductance Wij). Now consider fixed nodes i and j .
I Inject Wi =∑
k Wik current into node i .
I By Kirchhoff’s (Iin = Iout) and Ωhm’s (V = IR) laws,
Wi =∑
(i ,k)∈E
Iik =∑
(i ,k)∈E
Vik/Rik =∑
(i ,k)∈E
VikWik .
I By Kirchhoff’s voltage law, Vij = Vik + Vkj .
I Get Wi =∑
k Wik(Vij − Vkj). After rearranging, this is
Vij = 1 +∑k
Wik
WiVkj
= 1 +∑k
PikVkj .
Connection to resistive networks
The recurrence relation for the voltage Vij when Wi units ofcurrent are injected into node i is the same as the recurrencerelation for the expected hitting time Hij . So identify Vij ≡ Hij .
To get an explicit formula for the expected commute timeCij = Hij + Hjk , we’ll use the superposition property of linearequations (resistive networks are characterized by linear equations).
Deriving the expected commute time: four cases
Adapted from Karp (2003).Current into i Current into j Current into k Vij Vji
k 6= i , j
Wi
−(∑
k Wk −Wj)
Wk Hij −Hij
−(∑
k Wk −Wi ) Wj Wk −Hji Hji∑k Wk −Wi −Wj −Wk Hji −Hji∑
k Wk −∑
k Wk 0 Cij −Cij
Using Ohm’s law, we have the expected commute time
Cij =
(∑k
Wk
)Rij = 2WtotalRij
where Wtotal is the total weight of the graph, and Rij is theeffective resistance between nodes i and j .
Deriving the expected commute time: four cases
Adapted from Karp (2003).Current into i Current into j Current into k Vij Vji
k 6= i , j
Wi −(∑
k Wk −Wj) Wk Hij −Hij
−(∑
k Wk −Wi ) Wj Wk −Hji Hji∑k Wk −Wi −Wj −Wk Hji −Hji∑
k Wk −∑
k Wk 0 Cij −Cij
Using Ohm’s law, we have the expected commute time
Cij =
(∑k
Wk
)Rij = 2WtotalRij
where Wtotal is the total weight of the graph, and Rij is theeffective resistance between nodes i and j .
Deriving the expected commute time: four cases
Adapted from Karp (2003).Current into i Current into j Current into k Vij Vji
k 6= i , j
Wi −(∑
k Wk −Wj) Wk Hij −Hij
−(∑
k Wk −Wi )
Wj Wk −Hji Hji
∑k Wk −Wi −Wj −Wk Hji −Hji∑
k Wk −∑
k Wk 0 Cij −Cij
Using Ohm’s law, we have the expected commute time
Cij =
(∑k
Wk
)Rij = 2WtotalRij
where Wtotal is the total weight of the graph, and Rij is theeffective resistance between nodes i and j .
Deriving the expected commute time: four cases
Adapted from Karp (2003).Current into i Current into j Current into k Vij Vji
k 6= i , j
Wi −(∑
k Wk −Wj) Wk Hij −Hij
−(∑
k Wk −Wi ) Wj Wk −Hji Hji
∑k Wk −Wi −Wj −Wk Hji −Hji∑
k Wk −∑
k Wk 0 Cij −Cij
Using Ohm’s law, we have the expected commute time
Cij =
(∑k
Wk
)Rij = 2WtotalRij
where Wtotal is the total weight of the graph, and Rij is theeffective resistance between nodes i and j .
Deriving the expected commute time: four cases
Adapted from Karp (2003).Current into i Current into j Current into k Vij Vji
k 6= i , j
Wi −(∑
k Wk −Wj) Wk Hij −Hij
−(∑
k Wk −Wi ) Wj Wk −Hji Hji∑k Wk −Wi −Wj −Wk Hji −Hji
∑k Wk −
∑k Wk 0 Cij −Cij
Using Ohm’s law, we have the expected commute time
Cij =
(∑k
Wk
)Rij = 2WtotalRij
where Wtotal is the total weight of the graph, and Rij is theeffective resistance between nodes i and j .
Deriving the expected commute time: four cases
Adapted from Karp (2003).Current into i Current into j Current into k Vij Vji
k 6= i , j
Wi −(∑
k Wk −Wj) Wk Hij −Hij
−(∑
k Wk −Wi ) Wj Wk −Hji Hji∑k Wk −Wi −Wj −Wk Hji −Hji∑
k Wk −∑
k Wk 0 Cij −Cij
Using Ohm’s law, we have the expected commute time
Cij =
(∑k
Wk
)Rij = 2WtotalRij
where Wtotal is the total weight of the graph, and Rij is theeffective resistance between nodes i and j .
Deriving the expected commute time: four cases
Adapted from Karp (2003).Current into i Current into j Current into k Vij Vji
k 6= i , j
Wi −(∑
k Wk −Wj) Wk Hij −Hij
−(∑
k Wk −Wi ) Wj Wk −Hji Hji∑k Wk −Wi −Wj −Wk Hji −Hji∑
k Wk −∑
k Wk 0 Cij −Cij
Using Ohm’s law, we have the expected commute time
Cij =
(∑k
Wk
)Rij = 2WtotalRij
where Wtotal is the total weight of the graph, and Rij is theeffective resistance between nodes i and j .
Drawbacks of expected hitting and commute times
Using expected hitting and commute times as a basis forrecommendation is natural.
However, they can be dominated by the stationary distribution, sothe same popular items are recommended to everyone.
Idea: can still use expected hitting and commute times, but notdirectly.
Cosine correlation
Here is a popular idea from information retrieval.
Suppose x and y are count vectors (e.g. word counts of adocument). Similarity is measured by the dot product x · y .
Problem: longer similar documents get larger dot products thanshorter similar documents.
Solution: just look at the angle θ between x and y .
x · y = ‖x‖‖y‖ cos θ
cos θ =x · y‖x‖‖y‖
Can we use this cosine correlation with Hij or Cij?
Effective resistance as a metric
Recall: Cij = 2WtotalRij . So any metric properties of R will alsohold for C .
It is easy to check that
I Rij ≤ Rik + Rkj
I Rii = 0, and
I Rij = Rji .
So, the effective resistance R is a metric.
But a general metric is not enough to talk about angles—we needa Euclidean metric. We will show that the square-root of effectiveresistance is a Euclidean metric.
The Laplacian matrix
The Laplacian matrix L of the graph is
L = D −W
where D = diag(W1,W2, . . . ,Wn). Note that diag(W ) = 0.
L =
W1 −W12 . . . −W1n
−W12 W2 . . . −W2n...
.... . .
...−W1n −W2n . . . Wn
Note that the sum of each row is 0 and so is the sum of eachcolumn (by symmetry of W ).
The Laplacian matrix and its pseudoinverse
The Laplacian L is
I symmetric, because the original graph is symmetric, and
I positive-semidefinite, because
x ′Lx =∑i<j
Wij(xi − xj)2 ≥ 0.
L has a pseudoinverse L+ given by
L+ = (L− 1
n11′)−1 +
1
n11′.
L+ is also symmetric and positive-semidefinite (not obvious, butit’s true).
A Euclidean metric from the Laplacian’s pseudoinverse
Furthermore,Rij = (ei − ej)
′L+(ei − ej),
where ei is the ith elementary vector (1 in the ith entry, zeroeverywhere else). This comes from yet another, less intuitivederivation of expected hitting time.
This is a Mahalanobis distance since L+ is symmetricpositive-semidefinite, so its square-root is a Euclidean metric.
Cosine correlation for random walk
The square-root of effective resistance√
Rij defines a Euclideanmetric, so the “angle” θij between i and j is well-defined:
Rij = (ei − ej)′L+(ei − ej)
= L+ii − 2L+
ij + L+jj
cos θij =L+
ij√L+
ii L+jj
Interpreting cosine correlation for random walk
Identifying Rij ≡ ‖xi − xj‖2, one can deduce
‖xi‖ =√
L+ii .
For large Markov chains, this is approximately the recurrence time1πi
, a measure of “generic” popularity.
If the embedded points xi are projected onto a unit hypersphere(thus removing all “generic” popularity), then
cos θij = 1−d2ij
2,
where dij is the resulting Euclidean distance between i and j .
Outline
Motivation
The random walk model
Using the model
Making recommendations
Turning a profit
Evaluation
Making recommendations
Recommendations are with respect to a query state (e.g. customer,currently viewed product, search query).
Given a query state i , rank other states j according to cos θij andrecommend the top hits.
The problem is similar to semi-supervised classification (learningwith both labeled and unlabeled data), and the cosine correlationis a superior similarity measure compared to other proposedmethods. . .
(on one toy example).
Making recommendations
Recommendations are with respect to a query state (e.g. customer,currently viewed product, search query).
Given a query state i , rank other states j according to cos θij andrecommend the top hits.
The problem is similar to semi-supervised classification (learningwith both labeled and unlabeled data), and the cosine correlationis a superior similarity measure compared to other proposedmethods. . . (on one toy example).
Semi-supervised classification
Semi-supervised classification
Turning a profit (at least in expectation)
Goal (from decision theory): “recommend the product (state) withthe greatest expected profit, discounted over time.”
Let ~$ ∈ Rn be the profit (or loss) for each state, and e−β (β > 0)be the discount factor. Then, the expected discounted profit is
v =∞∑
t=0
e−tβPt~$
=
( ∞∑t=0
(e−βP)t
)~$
= (I − e−βP)−1~$ since (e−βP)t → 0.
To maximize expected discounted profit at query state i , choose
j = argmaxj :Pij>0
Pijvj
Outline
Motivation
The random walk model
Using the model
Evaluation
Data set and model
Maximizing satisfaction
Maximizing profit
Discussion
Experimental setup: data set and model
Data comes from the MovieLens database:
I Ratings on 1–5 scale for 1682 movies by 943 individuals
I Each individual viewed 20–737 movies (106 on average)
I Each movie received 1–583 ratings (60 on average)
I Ratings table is 93.7% empty (i.e. most viewers have not seenmost movies)
I Classify movies into 19 genres
I Classify individuals into 2 genders, 21 vocations, 8 overlappingage groups
Constructed n = 2657 node graph with
Wij =
1 if i belongs in class jrij if individual i rates movie j with rij .
Task 1: recommending to maximize satisfaction
Randomly partition data into training and test sets.
1. Test set contains 10 ratings from each viewer.
2. Take 10 top-ranked movies not in the training set as therecommendations.
3. Score the recommendations with the sum of individual’sheld-out ratings for the recommended movies.
Compared different measures of similarities:
I cosine correlation
I expected commute time, expected hitting time
I stationary distribution
I normalized hitting time, normalize commute time
Task 1 results
Compared average score across all 943 individuals and 500 trials.
I Score ranges from 0 to 50.
I Omniscient oracle scores ≤ 35.3 on average (due to sparsity ofdata and low average rating).
I Random recommendations score 2.2 on average.
Task 1 results
Task 2: recommending to maximize profit
Similar setup as before, except the scoring is changed:
1. A priori randomly assign each movie j a profit pj ∈ N (0, 1).
2. For t = 1 to 10
a. Recommend a movie.b. If the movie is in the individual’s held-out set, receive profit
e−tβpj .
Compared different recommenders:
I maximum expected discounted profit
I cosine correlation, but only movies with positive profit
I cosine correlation, allowing all movies
I expected commute time, expected hitting time, stationarydistribution
Task 2 results
Discussion
Cosine correlation performs significantly better than stationarydistribution. This suggests that it is sensitive to individualpreferences.
Issues with experimental study:
I Movies without recommendations by an individual in the testset were given a score of 0.
I Only considered “random walk”-related similarity measures.
References
D. Aldous and J. Fill, Reversible Markov Chains and RandomWalks on Graphs. Monograph in preparation,www.stat.berkeley.edu/users/aldous/RWG/book.html.
M. Brand, A random walks perspective on maximizing satisfactionand profit. In Proceedings of SIAM International Conference onData Mining, 2005.
F. Fouss, A. Pirotte, J. Renders, and M. Saerens, A Novel Way ofComputing Dissimilarities Between Nodes of a Graph, withApplication to Collaborative Filtering. In Proceedings of ECMLworkshop on Statistical Approaches for Web Mining, 2004.
R. Karp, Lecture. U.C. Berkeley, November 12, 2003.