pagerank and related algorithms - pagerank and hitskogan/teaching/cir/s06/pagerank.pdf · pagerank...
Post on 16-Oct-2020
19 Views
Preview:
TRANSCRIPT
PageRank and related algorithmsPageRank and HITS
Jacob Kogan
Department of Mathematics and StatisticsUniversity of Maryland, Baltimore County
Baltimore, Maryland 21250
kogan@umbc.edu
May 15, 2006
Basic References
L. Page and S. Brin and R. Motwani and T. Winograd. ThePageRank citation index: bringing order to the web. StanfordDigital Library Technologies Project, 1998,citeseer.ist.psu.edu/page98pagerank.html.
Jon Kleinberg. Authoritative Sources in a HyperlinkedEnvironment. Journal of the ACM, 46:5, pp. 604-632, 1999.
Berkhin, P. A survey on Page Rank computing. InternetMathematics, vol. 2, no. 1, pp. 73–120, 2005.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 2/21
PageRank
PageRank
is a global “importance” ranking of every web page.
The method is based on the graph of the web.The model is inspired by academic citation analysis.
If a page has a link off an “important” page (Yahoo home page forexample), then this link should make a larger contribution to thepage “importance”, then links from “obscure” pages.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 3/21
The graph and the matrix
G(V,E) is a directed graph
V are the vertices/nodes (say n HTML pages)
E are the directes edges (hyperlinks)
The n × n adjacency matrix A = (Aij)
Aij =
{1 if page i −→ j0 otherwise
Jacob Kogan, UMBC PageRank and related algorithms, optimization 4/21
Transition matrix P
P = (Pij)
Pij =Aij
odeg(i)(odeg(i), the out degree of a node i ,
is the number of outgoing links)
so that∑
j
Pij = 1
(P is row stochastic)
Jacob Kogan, UMBC PageRank and related algorithms, optimization 5/21
Random Serfer Model
A surfer travels along the directed graph G .Pij , j = 1, . . . , n is the probability the surfer moves
from node i to node j .
If at step k the probability of the surfer being located at node i is
p(k)i , so that
p(k) =(p
(k)1 , . . . ,p
(k)n
),
thenp(k+1) = PTp(k).
p(k+1) is a probability distribution!
Jacob Kogan, UMBC PageRank and related algorithms, optimization 6/21
q = PTp
if p = (p1, . . . ,pn), pi ≥ 0,∑
pi = 1and q = (q1, . . . ,qn), q = PTpthen qi ≥ 0,
∑qi = 1.
n∑i=1
qi =n∑
j=1
(n∑
i=1
Pijpi
)=
n∑i=1
pi
n∑j=1
Pij
=n∑
i=1
pi = 1.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 7/21
q = PTp
if p = (p1, . . . ,pn), pi ≥ 0,∑
pi = 1and q = (q1, . . . ,qn), q = PTpthen qi ≥ 0,
∑qi = 1.
n∑i=1
qi =n∑
j=1
(n∑
i=1
Pijpi
)=
n∑i=1
pi
n∑j=1
Pij
=n∑
i=1
pi = 1.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 7/21
Dangling Pages
pages that have no outgoing links are calleddangling pages
orsinks
orattractors.
With dangling pages the transition matrix P has zero rows, andfails to be stochastic.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 8/21
PageRank
Definition. A PageRank vector is a non-negative stationary pointof the transformation
q = PTp
(a stationary distribution for a Markov chain)
What can be done in presence of dangling pages?
Jacob Kogan, UMBC PageRank and related algorithms, optimization 9/21
PageRank
Definition. A PageRank vector is a non-negative stationary pointof the transformation
q = PTp
(a stationary distribution for a Markov chain)
What can be done in presence of dangling pages?
Jacob Kogan, UMBC PageRank and related algorithms, optimization 9/21
What can be done?
removal of dangling pages,
renormalization of PTp(k+1),
to add self link to each dangling page,
to introduce an “ideal page” with a self link to each danglingpage,
to modify the matrix P by introducing artificial links thatuniformly connect dangling pages to pages (P ′ = P + dvT ).
Jacob Kogan, UMBC PageRank and related algorithms, optimization 10/21
PageRank
v =
1n· · ·1n
, d =
δ(odeg(1), 0)· · ·δ(odeg(n), 0)
Consider
P ′′ = c[P + dvT
]+ (1− c)evT .
y = P ′′Tx = cPTx + cv(dTx
)+ (1− c)v
(eTx
).
Jacob Kogan, UMBC PageRank and related algorithms, optimization 11/21
PageRank computation
Let x be a vector in Rn, and P = (Pij) is an n × n matrix with nonnegative entries such that
either∑
j
Pij = 1, or∑
j
Pij = 0.
Let d ∈ Rn so that di = δ(odeg(i), 0), then
|PTx| = |x| − dTx.
(where |y| = |y|1 = |y1|+ · · ·+ |yn|)
Jacob Kogan, UMBC PageRank and related algorithms, optimization 12/21
PageRank computation
PTx =
P11 P21 · · · Pn1
P12 P22 . . . Pn2
· · · · · · · · · · · ·P1n P2n · · · Pnn
x1
x2
· · ·xn
=
x1
P11
P12
· · ·P1n
+ x2
P21
P22
· · ·P2n
+ · · ·+ xn
Pn1
Pn2
· · ·Pnn
Hence
∣∣∣PTx∣∣∣ = x1
∑j
P1j
+ x2
∑j
P2j
+ xn
∑j
Pnj
.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 13/21
PageRank computation
PTx =
P11 P21 · · · Pn1
P12 P22 . . . Pn2
· · · · · · · · · · · ·P1n P2n · · · Pnn
x1
x2
· · ·xn
=
x1
P11
P12
· · ·P1n
+ x2
P21
P22
· · ·P2n
+ · · ·+ xn
Pn1
Pn2
· · ·Pnn
Hence
∣∣∣PTx∣∣∣ = x1
∑j
P1j
+ x2
∑j
P2j
+ xn
∑j
Pnj
.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 13/21
PageRank computation
∣∣PTx∣∣ = x1
(∑j P1j
)+ · · · + xn
(∑j Pnj
).
|x| − dTx = x1 + · · · + xn
−δ(odeg(1), 0)x1 + · · · + −δ(odeg(n), 0)xn
Jacob Kogan, UMBC PageRank and related algorithms, optimization 14/21
PageRank
y = P ′′Tx = cPTx + cv(dTx
)+ (1− c)v
(eTx
)︸ ︷︷ ︸ .
|x| −(c |x| − c
(dTx
))= |x| −
∣∣∣cPTx∣∣∣ .
Hence y can be computed as follows:
1. y←− cPTx,
2. γ = |x| − |y|,3. y←− y + γv.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 15/21
Hyperlink Induced Topic Search (HITS)
works with a subgraph specific to a particular query (ratherthan with a full graph),
computes two weights (authority and hub) for each webpage,
allows clustering of results for multi-topic or polarized queries.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 16/21
Root and Focused Sets
Root set:The top t (around 200) results are recalled for a given query(the results are picked according to a text based relevancecriterion).
Focused set:All pages pointed by out links of the root set are added alongwith up to d (about 50) pages corresponding to inlinks ofeach page in a root set.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 17/21
Hubs and Authorities
Define “authorities” and “hubs” as follows:
1. a page p is an authority if it is pointed by many pages,
2. a page p is a hub if it points to many pages.
To measure the “authority” and the “hub” of the pages weconsider L2 unit norm vectors a and h of dimension |V |, so that
a[p] is the “authority”
h[p] is the “hub”
of the page p.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 18/21
Hubs and Authorities
The following is an iterative process that computes the vectors.
1. set t = 0
2. assign initial values a(t), and h(t)
3. normalize vectors a(t), and h(t), so that∑p
(a(t)[p]
)2=∑p
(h(t)[p]
)2= 1
4. set a(t+1)[p] =∑
q−→p
h(t)[q], and h(t+1)[p] =∑
p−→q
a(t+1)[q]
5. if (stopping criterion fails)then increment t by 1, goto Step 3else stop.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 19/21
Adjacency Matrix
Let A be the adjacency matrix of the graph G , i.e.
Aij =
{1 if page i −→ j0 otherwise
Note that
a(t+1) =ATh(t)∣∣∣∣ATh(t)
∣∣∣∣ , and h(t+1) =Aa(t+1)∣∣∣∣Aa(t+1)
∣∣∣∣ .This yields a(t+1) =
ATAa(t)∣∣∣∣ATAa(t)∣∣∣∣ , and h(t+1) =
AATh(t)∣∣∣∣AATh(t)∣∣∣∣ .
Jacob Kogan, UMBC PageRank and related algorithms, optimization 20/21
Eigenvectors
a(t) =
(ATA
)ka(0)∣∣∣∣∣∣(ATA)
k a(0)∣∣∣∣∣∣ , and h(t) =
(AAT
)kh(0)∣∣∣∣∣∣(AAT )
k h(0)∣∣∣∣∣∣ .
Let v and w be a unit eigenvectors corresponding to maximaleigenvalues of the symmetric matrices ATA and AAT
correspondingly. The above arguments lead to the following result:
limt→∞
a(t) = v, limt→∞
h(t) = w.
Jacob Kogan, UMBC PageRank and related algorithms, optimization 21/21
top related