pagerank and related algorithms - pagerank and hitskogan/teaching/cir/s06/pagerank.pdf · pagerank...

24
PageRank and related algorithms PageRank and HITS Jacob Kogan Department of Mathematics and Statistics University of Maryland, Baltimore County Baltimore, Maryland 21250 [email protected] May 15, 2006

Upload: others

Post on 16-Oct-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

PageRank and related algorithmsPageRank and HITS

Jacob Kogan

Department of Mathematics and StatisticsUniversity of Maryland, Baltimore County

Baltimore, Maryland 21250

[email protected]

May 15, 2006

Page 2: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

Basic References

L. Page and S. Brin and R. Motwani and T. Winograd. ThePageRank citation index: bringing order to the web. StanfordDigital Library Technologies Project, 1998,citeseer.ist.psu.edu/page98pagerank.html.

Jon Kleinberg. Authoritative Sources in a HyperlinkedEnvironment. Journal of the ACM, 46:5, pp. 604-632, 1999.

Berkhin, P. A survey on Page Rank computing. InternetMathematics, vol. 2, no. 1, pp. 73–120, 2005.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 2/21

Page 3: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

PageRank

PageRank

is a global “importance” ranking of every web page.

The method is based on the graph of the web.The model is inspired by academic citation analysis.

If a page has a link off an “important” page (Yahoo home page forexample), then this link should make a larger contribution to thepage “importance”, then links from “obscure” pages.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 3/21

Page 4: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

The graph and the matrix

G(V,E) is a directed graph

V are the vertices/nodes (say n HTML pages)

E are the directes edges (hyperlinks)

The n × n adjacency matrix A = (Aij)

Aij =

{1 if page i −→ j0 otherwise

Jacob Kogan, UMBC PageRank and related algorithms, optimization 4/21

Page 5: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

Transition matrix P

P = (Pij)

Pij =Aij

odeg(i)(odeg(i), the out degree of a node i ,

is the number of outgoing links)

so that∑

j

Pij = 1

(P is row stochastic)

Jacob Kogan, UMBC PageRank and related algorithms, optimization 5/21

Page 6: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

Random Serfer Model

A surfer travels along the directed graph G .Pij , j = 1, . . . , n is the probability the surfer moves

from node i to node j .

If at step k the probability of the surfer being located at node i is

p(k)i , so that

p(k) =(p

(k)1 , . . . ,p

(k)n

),

thenp(k+1) = PTp(k).

p(k+1) is a probability distribution!

Jacob Kogan, UMBC PageRank and related algorithms, optimization 6/21

Page 7: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

q = PTp

if p = (p1, . . . ,pn), pi ≥ 0,∑

pi = 1and q = (q1, . . . ,qn), q = PTpthen qi ≥ 0,

∑qi = 1.

n∑i=1

qi =n∑

j=1

(n∑

i=1

Pijpi

)=

n∑i=1

pi

n∑j=1

Pij

=n∑

i=1

pi = 1.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 7/21

Page 8: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

q = PTp

if p = (p1, . . . ,pn), pi ≥ 0,∑

pi = 1and q = (q1, . . . ,qn), q = PTpthen qi ≥ 0,

∑qi = 1.

n∑i=1

qi =n∑

j=1

(n∑

i=1

Pijpi

)=

n∑i=1

pi

n∑j=1

Pij

=n∑

i=1

pi = 1.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 7/21

Page 9: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

Dangling Pages

pages that have no outgoing links are calleddangling pages

orsinks

orattractors.

With dangling pages the transition matrix P has zero rows, andfails to be stochastic.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 8/21

Page 10: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

PageRank

Definition. A PageRank vector is a non-negative stationary pointof the transformation

q = PTp

(a stationary distribution for a Markov chain)

What can be done in presence of dangling pages?

Jacob Kogan, UMBC PageRank and related algorithms, optimization 9/21

Page 11: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

PageRank

Definition. A PageRank vector is a non-negative stationary pointof the transformation

q = PTp

(a stationary distribution for a Markov chain)

What can be done in presence of dangling pages?

Jacob Kogan, UMBC PageRank and related algorithms, optimization 9/21

Page 12: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

What can be done?

removal of dangling pages,

renormalization of PTp(k+1),

to add self link to each dangling page,

to introduce an “ideal page” with a self link to each danglingpage,

to modify the matrix P by introducing artificial links thatuniformly connect dangling pages to pages (P ′ = P + dvT ).

Jacob Kogan, UMBC PageRank and related algorithms, optimization 10/21

Page 13: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

PageRank

v =

1n· · ·1n

, d =

δ(odeg(1), 0)· · ·δ(odeg(n), 0)

Consider

P ′′ = c[P + dvT

]+ (1− c)evT .

y = P ′′Tx = cPTx + cv(dTx

)+ (1− c)v

(eTx

).

Jacob Kogan, UMBC PageRank and related algorithms, optimization 11/21

Page 14: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

PageRank computation

Let x be a vector in Rn, and P = (Pij) is an n × n matrix with nonnegative entries such that

either∑

j

Pij = 1, or∑

j

Pij = 0.

Let d ∈ Rn so that di = δ(odeg(i), 0), then

|PTx| = |x| − dTx.

(where |y| = |y|1 = |y1|+ · · ·+ |yn|)

Jacob Kogan, UMBC PageRank and related algorithms, optimization 12/21

Page 15: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

PageRank computation

PTx =

P11 P21 · · · Pn1

P12 P22 . . . Pn2

· · · · · · · · · · · ·P1n P2n · · · Pnn

x1

x2

· · ·xn

=

x1

P11

P12

· · ·P1n

+ x2

P21

P22

· · ·P2n

+ · · ·+ xn

Pn1

Pn2

· · ·Pnn

Hence

∣∣∣PTx∣∣∣ = x1

∑j

P1j

+ x2

∑j

P2j

+ xn

∑j

Pnj

.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 13/21

Page 16: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

PageRank computation

PTx =

P11 P21 · · · Pn1

P12 P22 . . . Pn2

· · · · · · · · · · · ·P1n P2n · · · Pnn

x1

x2

· · ·xn

=

x1

P11

P12

· · ·P1n

+ x2

P21

P22

· · ·P2n

+ · · ·+ xn

Pn1

Pn2

· · ·Pnn

Hence

∣∣∣PTx∣∣∣ = x1

∑j

P1j

+ x2

∑j

P2j

+ xn

∑j

Pnj

.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 13/21

Page 17: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

PageRank computation

∣∣PTx∣∣ = x1

(∑j P1j

)+ · · · + xn

(∑j Pnj

).

|x| − dTx = x1 + · · · + xn

−δ(odeg(1), 0)x1 + · · · + −δ(odeg(n), 0)xn

Jacob Kogan, UMBC PageRank and related algorithms, optimization 14/21

Page 18: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

PageRank

y = P ′′Tx = cPTx + cv(dTx

)+ (1− c)v

(eTx

)︸ ︷︷ ︸ .

|x| −(c |x| − c

(dTx

))= |x| −

∣∣∣cPTx∣∣∣ .

Hence y can be computed as follows:

1. y←− cPTx,

2. γ = |x| − |y|,3. y←− y + γv.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 15/21

Page 19: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

Hyperlink Induced Topic Search (HITS)

works with a subgraph specific to a particular query (ratherthan with a full graph),

computes two weights (authority and hub) for each webpage,

allows clustering of results for multi-topic or polarized queries.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 16/21

Page 20: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

Root and Focused Sets

Root set:The top t (around 200) results are recalled for a given query(the results are picked according to a text based relevancecriterion).

Focused set:All pages pointed by out links of the root set are added alongwith up to d (about 50) pages corresponding to inlinks ofeach page in a root set.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 17/21

Page 21: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

Hubs and Authorities

Define “authorities” and “hubs” as follows:

1. a page p is an authority if it is pointed by many pages,

2. a page p is a hub if it points to many pages.

To measure the “authority” and the “hub” of the pages weconsider L2 unit norm vectors a and h of dimension |V |, so that

a[p] is the “authority”

h[p] is the “hub”

of the page p.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 18/21

Page 22: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

Hubs and Authorities

The following is an iterative process that computes the vectors.

1. set t = 0

2. assign initial values a(t), and h(t)

3. normalize vectors a(t), and h(t), so that∑p

(a(t)[p]

)2=∑p

(h(t)[p]

)2= 1

4. set a(t+1)[p] =∑

q−→p

h(t)[q], and h(t+1)[p] =∑

p−→q

a(t+1)[q]

5. if (stopping criterion fails)then increment t by 1, goto Step 3else stop.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 19/21

Page 23: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

Adjacency Matrix

Let A be the adjacency matrix of the graph G , i.e.

Aij =

{1 if page i −→ j0 otherwise

Note that

a(t+1) =ATh(t)∣∣∣∣ATh(t)

∣∣∣∣ , and h(t+1) =Aa(t+1)∣∣∣∣Aa(t+1)

∣∣∣∣ .This yields a(t+1) =

ATAa(t)∣∣∣∣ATAa(t)∣∣∣∣ , and h(t+1) =

AATh(t)∣∣∣∣AATh(t)∣∣∣∣ .

Jacob Kogan, UMBC PageRank and related algorithms, optimization 20/21

Page 24: PageRank and related algorithms - PageRank and HITSkogan/teaching/cir/s06/PageRank.pdf · PageRank PageRank is a global “importance” ranking of every web page. The method is based

Eigenvectors

a(t) =

(ATA

)ka(0)∣∣∣∣∣∣(ATA)

k a(0)∣∣∣∣∣∣ , and h(t) =

(AAT

)kh(0)∣∣∣∣∣∣(AAT )

k h(0)∣∣∣∣∣∣ .

Let v and w be a unit eigenvectors corresponding to maximaleigenvalues of the symmetric matrices ATA and AAT

correspondingly. The above arguments lead to the following result:

limt→∞

a(t) = v, limt→∞

h(t) = w.

Jacob Kogan, UMBC PageRank and related algorithms, optimization 21/21