pagerank and related algorithms - pagerank and hitskogan/teaching/cir/s06/pagerank.pdf · pagerank...

PageRank and related algorithmsPageRank and HITS

Jacob Kogan

Department of Mathematics and StatisticsUniversity of Maryland, Baltimore County

Baltimore, Maryland 21250

kogan@umbc.edu

May 15, 2006

Basic References

L. Page and S. Brin and R. Motwani and T. Winograd. ThePageRank citation index: bringing order to the web. StanfordDigital Library Technologies Project, 1998,citeseer.ist.psu.edu/page98pagerank.html.

Jon Kleinberg. Authoritative Sources in a HyperlinkedEnvironment. Journal of the ACM, 46:5, pp. 604-632, 1999.

Berkhin, P. A survey on Page Rank computing. InternetMathematics, vol. 2, no. 1, pp. 73–120, 2005.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 2/21

PageRank

is a global “importance” ranking of every web page.

The method is based on the graph of the web.The model is inspired by academic citation analysis.

If a page has a link off an “important” page (Yahoo home page forexample), then this link should make a larger contribution to thepage “importance”, then links from “obscure” pages.

The graph and the matrix

G(V,E) is a directed graph

V are the vertices/nodes (say n HTML pages)

E are the directes edges (hyperlinks)

The n × n adjacency matrix A = (Aij)

{1 if page i −→ j0 otherwise

Transition matrix P

P = (Pij)

Pij =Aij

odeg(i)(odeg(i), the out degree of a node i ,

is the number of outgoing links)

so that∑

Pij = 1

(P is row stochastic)

Random Serfer Model

A surfer travels along the directed graph G .Pij , j = 1, . . . , n is the probability the surfer moves

from node i to node j .

If at step k the probability of the surfer being located at node i is

p(k)i , so that

p(k) =(p

(k)1 , . . . ,p

thenp(k+1) = PTp(k).

p(k+1) is a probability distribution!

q = PTp

if p = (p1, . . . ,pn), pi ≥ 0,∑

pi = 1and q = (q1, . . . ,qn), q = PTpthen qi ≥ 0,

∑qi = 1.

n∑i=1

qi =n∑

n∑i=1

n∑j=1

pi = 1.

q = PTp

if p = (p1, . . . ,pn), pi ≥ 0,∑

pi = 1and q = (q1, . . . ,qn), q = PTpthen qi ≥ 0,

∑qi = 1.

n∑i=1

qi =n∑

n∑i=1

n∑j=1

pi = 1.

Dangling Pages

pages that have no outgoing links are calleddangling pages

orsinks

orattractors.

With dangling pages the transition matrix P has zero rows, andfails to be stochastic.

PageRank

Definition. A PageRank vector is a non-negative stationary pointof the transformation

q = PTp

(a stationary distribution for a Markov chain)

What can be done in presence of dangling pages?

PageRank

Definition. A PageRank vector is a non-negative stationary pointof the transformation

q = PTp

(a stationary distribution for a Markov chain)

What can be done in presence of dangling pages?

What can be done?

removal of dangling pages,

renormalization of PTp(k+1),

to add self link to each dangling page,

to introduce an “ideal page” with a self link to each danglingpage,

to modify the matrix P by introducing artificial links thatuniformly connect dangling pages to pages (P ′ = P + dvT ).

PageRank

1n· · ·1n

δ(odeg(1), 0)· · ·δ(odeg(n), 0)

Consider

P ′′ = c[P + dvT

]+ (1− c)evT .

y = P ′′Tx = cPTx + cv(dTx

)+ (1− c)v

PageRank computation

Let x be a vector in Rn, and P = (Pij) is an n × n matrix with nonnegative entries such that

either∑

Pij = 1, or∑

Pij = 0.

Let d ∈ Rn so that di = δ(odeg(i), 0), then

|PTx| = |x| − dTx.

(where |y| = |y|1 = |y1|+ · · ·+ |yn|)

P11 P21 · · · Pn1

P12 P22 . . . Pn2

· · · · · · · · · · · ·P1n P2n · · · Pnn

· · ·xn

· · ·P1n

· · ·P2n

+ · · ·+ xn

· · ·Pnn

∣∣∣PTx∣∣∣ = x1

P11 P21 · · · Pn1

P12 P22 . . . Pn2

· · · · · · · · · · · ·P1n P2n · · · Pnn

· · ·xn

· · ·P1n

· · ·P2n

+ · · ·+ xn

· · ·Pnn

∣∣∣PTx∣∣∣ = x1

∣∣PTx∣∣ = x1

(∑j P1j

)+ · · · + xn

(∑j Pnj

|x| − dTx = x1 + · · · + xn

−δ(odeg(1), 0)x1 + · · · + −δ(odeg(n), 0)xn

PageRank

y = P ′′Tx = cPTx + cv(dTx

)+ (1− c)v

)︸︷︷︸ .

|x| −(c |x| − c

))= |x| −

∣∣∣cPTx∣∣∣ .

Hence y can be computed as follows:

1. y←− cPTx,

2. γ = |x| − |y|,3. y←− y + γv.

Hyperlink Induced Topic Search (HITS)

works with a subgraph specific to a particular query (ratherthan with a full graph),

computes two weights (authority and hub) for each webpage,

allows clustering of results for multi-topic or polarized queries.

Root and Focused Sets

Root set:The top t (around 200) results are recalled for a given query(the results are picked according to a text based relevancecriterion).

Focused set:All pages pointed by out links of the root set are added alongwith up to d (about 50) pages corresponding to inlinks ofeach page in a root set.

Hubs and Authorities

Define “authorities” and “hubs” as follows:

1. a page p is an authority if it is pointed by many pages,

2. a page p is a hub if it points to many pages.

To measure the “authority” and the “hub” of the pages weconsider L2 unit norm vectors a and h of dimension |V |, so that

a[p] is the “authority”

h[p] is the “hub”

of the page p.

Hubs and Authorities

The following is an iterative process that computes the vectors.

1. set t = 0

2. assign initial values a(t), and h(t)

3. normalize vectors a(t), and h(t), so that∑p

(a(t)[p]

)2=∑p

(h(t)[p]

4. set a(t+1)[p] =∑

q−→p

h(t)[q], and h(t+1)[p] =∑

p−→q

a(t+1)[q]

5. if (stopping criterion fails)then increment t by 1, goto Step 3else stop.

Adjacency Matrix

Let A be the adjacency matrix of the graph G , i.e.

{1 if page i −→ j0 otherwise

Note that

a(t+1) =ATh(t)∣∣∣∣ATh(t)

∣∣∣∣ , and h(t+1) =Aa(t+1)∣∣∣∣Aa(t+1)

∣∣∣∣ .This yields a(t+1) =

ATAa(t)∣∣∣∣ATAa(t)∣∣∣∣ , and h(t+1) =

AATh(t)∣∣∣∣AATh(t)∣∣∣∣ .

Eigenvectors

a(t) =

)ka(0)∣∣∣∣∣∣(ATA)

k a(0)∣∣∣∣∣∣ , and h(t) =

)kh(0)∣∣∣∣∣∣(AAT )

k h(0)∣∣∣∣∣∣ .

Let v and w be a unit eigenvectors corresponding to maximaleigenvalues of the symmetric matrices ATA and AAT

correspondingly. The above arguments lead to the following result:

limt→∞

a(t) = v, limt→∞

h(t) = w.

pagerank and related algorithms - pagerank and hitskogan/teaching/cir/s06/pagerank.pdf · pagerank...

Documents

topic-sensitive pagerank: a context-sensitive ranking...

4061 ais.database.model.file.pertemuanfilecontent s06

pagerank . pagerank . pagerank google - aut

pagerank (1)

pagerank - computer science and engineering | nyu...

summierwerk s06-02 sc 2059 sc 1602 · summierwerk s06/sc...

qual mgt2 s06

google pagerank

inside pagerank - École normale...

pagerank - university of...

next pagerank

pagerank -...

pagerank - university of...

7110 s06 qp_2

s06 descriptive statisticssdaf

netserver s06 presentation

the pagerank

hits + pagerank

“pagerank...

am1 s06 (06)