pagerank and related algorithms - pagerank and hitskogan/teaching/cir/s06/pagerank.pdf · pagerank...

Post on 16-Oct-2020

19 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

PageRank and related algorithmsPageRank and HITS

Jacob Kogan

Department of Mathematics and StatisticsUniversity of Maryland, Baltimore County

Baltimore, Maryland 21250

kogan@umbc.edu

May 15, 2006

Basic References

L. Page and S. Brin and R. Motwani and T. Winograd. ThePageRank citation index: bringing order to the web. StanfordDigital Library Technologies Project, 1998,citeseer.ist.psu.edu/page98pagerank.html.

Jon Kleinberg. Authoritative Sources in a HyperlinkedEnvironment. Journal of the ACM, 46:5, pp. 604-632, 1999.

Berkhin, P. A survey on Page Rank computing. InternetMathematics, vol. 2, no. 1, pp. 73–120, 2005.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 2/21

PageRank

PageRank

is a global “importance” ranking of every web page.

The method is based on the graph of the web.The model is inspired by academic citation analysis.

If a page has a link off an “important” page (Yahoo home page forexample), then this link should make a larger contribution to thepage “importance”, then links from “obscure” pages.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 3/21

The graph and the matrix

G(V,E) is a directed graph

V are the vertices/nodes (say n HTML pages)

E are the directes edges (hyperlinks)

The n × n adjacency matrix A = (Aij)

Aij =

{1 if page i −→ j0 otherwise

Jacob Kogan, UMBC PageRank and related algorithms, optimization 4/21

Transition matrix P

P = (Pij)

Pij =Aij

odeg(i)(odeg(i), the out degree of a node i ,

is the number of outgoing links)

so that∑

j

Pij = 1

(P is row stochastic)

Jacob Kogan, UMBC PageRank and related algorithms, optimization 5/21

Random Serfer Model

A surfer travels along the directed graph G .Pij , j = 1, . . . , n is the probability the surfer moves

from node i to node j .

If at step k the probability of the surfer being located at node i is

p(k)i , so that

p(k) =(p

(k)1 , . . . ,p

(k)n

),

thenp(k+1) = PTp(k).

p(k+1) is a probability distribution!

Jacob Kogan, UMBC PageRank and related algorithms, optimization 6/21

q = PTp

if p = (p1, . . . ,pn), pi ≥ 0,∑

pi = 1and q = (q1, . . . ,qn), q = PTpthen qi ≥ 0,

∑qi = 1.

n∑i=1

qi =n∑

j=1

(n∑

i=1

Pijpi

)=

n∑i=1

pi

n∑j=1

Pij

=n∑

i=1

pi = 1.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 7/21

q = PTp

if p = (p1, . . . ,pn), pi ≥ 0,∑

pi = 1and q = (q1, . . . ,qn), q = PTpthen qi ≥ 0,

∑qi = 1.

n∑i=1

qi =n∑

j=1

(n∑

i=1

Pijpi

)=

n∑i=1

pi

n∑j=1

Pij

=n∑

i=1

pi = 1.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 7/21

Dangling Pages

pages that have no outgoing links are calleddangling pages

orsinks

orattractors.

With dangling pages the transition matrix P has zero rows, andfails to be stochastic.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 8/21

PageRank

Definition. A PageRank vector is a non-negative stationary pointof the transformation

q = PTp

(a stationary distribution for a Markov chain)

What can be done in presence of dangling pages?

Jacob Kogan, UMBC PageRank and related algorithms, optimization 9/21

PageRank

Definition. A PageRank vector is a non-negative stationary pointof the transformation

q = PTp

(a stationary distribution for a Markov chain)

What can be done in presence of dangling pages?

Jacob Kogan, UMBC PageRank and related algorithms, optimization 9/21

What can be done?

removal of dangling pages,

renormalization of PTp(k+1),

to add self link to each dangling page,

to introduce an “ideal page” with a self link to each danglingpage,

to modify the matrix P by introducing artificial links thatuniformly connect dangling pages to pages (P ′ = P + dvT ).

Jacob Kogan, UMBC PageRank and related algorithms, optimization 10/21

PageRank

v =

1n· · ·1n

, d =

δ(odeg(1), 0)· · ·δ(odeg(n), 0)

Consider

P ′′ = c[P + dvT

]+ (1− c)evT .

y = P ′′Tx = cPTx + cv(dTx

)+ (1− c)v

(eTx

).

Jacob Kogan, UMBC PageRank and related algorithms, optimization 11/21

PageRank computation

Let x be a vector in Rn, and P = (Pij) is an n × n matrix with nonnegative entries such that

either∑

j

Pij = 1, or∑

j

Pij = 0.

Let d ∈ Rn so that di = δ(odeg(i), 0), then

|PTx| = |x| − dTx.

(where |y| = |y|1 = |y1|+ · · ·+ |yn|)

Jacob Kogan, UMBC PageRank and related algorithms, optimization 12/21

PageRank computation

PTx =

P11 P21 · · · Pn1

P12 P22 . . . Pn2

· · · · · · · · · · · ·P1n P2n · · · Pnn

x1

x2

· · ·xn

=

x1

P11

P12

· · ·P1n

+ x2

P21

P22

· · ·P2n

+ · · ·+ xn

Pn1

Pn2

· · ·Pnn

Hence

∣∣∣PTx∣∣∣ = x1

∑j

P1j

+ x2

∑j

P2j

+ xn

∑j

Pnj

.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 13/21

PageRank computation

PTx =

P11 P21 · · · Pn1

P12 P22 . . . Pn2

· · · · · · · · · · · ·P1n P2n · · · Pnn

x1

x2

· · ·xn

=

x1

P11

P12

· · ·P1n

+ x2

P21

P22

· · ·P2n

+ · · ·+ xn

Pn1

Pn2

· · ·Pnn

Hence

∣∣∣PTx∣∣∣ = x1

∑j

P1j

+ x2

∑j

P2j

+ xn

∑j

Pnj

.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 13/21

PageRank computation

∣∣PTx∣∣ = x1

(∑j P1j

)+ · · · + xn

(∑j Pnj

).

|x| − dTx = x1 + · · · + xn

−δ(odeg(1), 0)x1 + · · · + −δ(odeg(n), 0)xn

Jacob Kogan, UMBC PageRank and related algorithms, optimization 14/21

PageRank

y = P ′′Tx = cPTx + cv(dTx

)+ (1− c)v

(eTx

)︸ ︷︷ ︸ .

|x| −(c |x| − c

(dTx

))= |x| −

∣∣∣cPTx∣∣∣ .

Hence y can be computed as follows:

1. y←− cPTx,

2. γ = |x| − |y|,3. y←− y + γv.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 15/21

Hyperlink Induced Topic Search (HITS)

works with a subgraph specific to a particular query (ratherthan with a full graph),

computes two weights (authority and hub) for each webpage,

allows clustering of results for multi-topic or polarized queries.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 16/21

Root and Focused Sets

Root set:The top t (around 200) results are recalled for a given query(the results are picked according to a text based relevancecriterion).

Focused set:All pages pointed by out links of the root set are added alongwith up to d (about 50) pages corresponding to inlinks ofeach page in a root set.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 17/21

Hubs and Authorities

Define “authorities” and “hubs” as follows:

1. a page p is an authority if it is pointed by many pages,

2. a page p is a hub if it points to many pages.

To measure the “authority” and the “hub” of the pages weconsider L2 unit norm vectors a and h of dimension |V |, so that

a[p] is the “authority”

h[p] is the “hub”

of the page p.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 18/21

Hubs and Authorities

The following is an iterative process that computes the vectors.

1. set t = 0

2. assign initial values a(t), and h(t)

3. normalize vectors a(t), and h(t), so that∑p

(a(t)[p]

)2=∑p

(h(t)[p]

)2= 1

4. set a(t+1)[p] =∑

q−→p

h(t)[q], and h(t+1)[p] =∑

p−→q

a(t+1)[q]

5. if (stopping criterion fails)then increment t by 1, goto Step 3else stop.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 19/21

Adjacency Matrix

Let A be the adjacency matrix of the graph G , i.e.

Aij =

{1 if page i −→ j0 otherwise

Note that

a(t+1) =ATh(t)∣∣∣∣ATh(t)

∣∣∣∣ , and h(t+1) =Aa(t+1)∣∣∣∣Aa(t+1)

∣∣∣∣ .This yields a(t+1) =

ATAa(t)∣∣∣∣ATAa(t)∣∣∣∣ , and h(t+1) =

AATh(t)∣∣∣∣AATh(t)∣∣∣∣ .

Jacob Kogan, UMBC PageRank and related algorithms, optimization 20/21

Eigenvectors

a(t) =

(ATA

)ka(0)∣∣∣∣∣∣(ATA)

k a(0)∣∣∣∣∣∣ , and h(t) =

(AAT

)kh(0)∣∣∣∣∣∣(AAT )

k h(0)∣∣∣∣∣∣ .

Let v and w be a unit eigenvectors corresponding to maximaleigenvalues of the symmetric matrices ATA and AAT

correspondingly. The above arguments lead to the following result:

limt→∞

a(t) = v, limt→∞

h(t) = w.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 21/21

top related