pagerank - carnegie mellon school of computer scienceelaw/pagerank.pdf · the algorithm given a web...

39
Edith Law PageRank lecture 12 (October 9, 2008)

Upload: lekhuong

Post on 11-Jul-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Edith Law

PageRanklecture 12 (October 9, 2008)

Page 2: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Page RankWhat’s the big deal?

Page 3: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Life before PageRank

Page 4: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Life after PageRank

Page 5: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Evolution of the web

Centralization

Page 6: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Idea #1: Centralization

Page 7: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Idea #1: Centralization

Page 8: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Veronica(1992)

Jughead(1993)

Archie(1990)

Page 9: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Evolution of the web

Centralization

Relevancy

Page 10: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Idea #2: Relevancy

2. More sophisticated indexing methods

Filename Description Content

1. Web directories

Given a query, how do we know what to retrieve?

Page 11: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

The index size war

Page 12: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Evolution of the web

Centralization

Relevancy

Ranking

Page 13: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Idea #3: Ranking

Page 14: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Page RankHow it works ...

Page 15: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Main idea

A page is important if it is pointed to by other important pages

Page 16: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Importance

/ l(Pj) (t+1) (t)

r (Pi) = ∑ r (Pj) j∈E(i)

C

B

0.2 0.8

0.6

A ...

9,999

Page 17: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

The Algorithm

Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks

• Assign each node an initial page rank

• repeat until convergence

calculate the page rank of each node (using the equation in the previous slide)

Page 18: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Example

2

5

3

1

4

Iteration 0 Iteration 1 Iteration 2 Page Rank

P1 1/5 1/20 1/40 5

P2 1/5 5/20 3/40 4

P3 1/5 1/10 5/40 3

P4 1/5 5/20 15/40 2

P5 1/5 7/20 16/40 1

r1(P5)=1/5 + 1/5×1/4 + 1/5 × 1/2 = 7/20

Page 19: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Matrix representation

1/20

5/20

1/10

5/20

7/20

r(t+1)

0 0 1/4 0 0

1 0 1/4 0 0

0 0 0 1/2 0

0 0 1/4 0 1

0 1 1/4 1/2 0

H

=

= r(t)

1/5

1/5

1/5

1/5

1/5

Page 20: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Three Questions

• Does this converge?

• Does it converge to what we want?

• Are the results reasonable?

r(t+1) = H r(t)

Also known as the power method

Page 21: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Does it converge?

Iteration 0 Iteration 1 Iteration 2 Iteration 3

P1 1 0 1 0

P2 0 1 0 1

21

Page 22: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Iteration 0 Iteration 1 Iteration 2 Iteration 3

P1 1 0 0 0

P2 0 1 0 0

21

Does it converge to what we want?

Page 23: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Does it converge to what we want?

2

5

3

1

4

xDangling

Node0 0 1/4 0 0

1 0 1/4 0 0

0 0 0 1/2 0

0 0 1/4 0 1

0 0 1/4 1/2 0

Page ranks to converge to 0.

Page 24: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Looks a lot like ...

Markov Chains

Set of states X

Transition matrix P where Pij = P(Xt=j | Xt-1=i)

π specifying the probability of being at each state x ∈ X

Goal is to find π such that π = P π

r(t+1) = H r(t)

Page 25: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Why is this analogy useful?

There exists a theory about Markov chains that says that for any start vector, the power method applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic, irreducible and aperiodic.

Page 26: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Make H stochastic

S = H + a(1/n eT)

2

5

3

1

4

0 1/5 1/4 0 0

1 1/5 1/4 0 0

0 1/5 0 1/2 0

0 1/5 1/4 0 1

0 1/5 1/4 1/2 0

Page 27: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Make H aperiodic

A chain is periodic if there exists k > 1 such that the interval between two visits to some state s is always a multiple of k.

1

2 5

3 4

Page 28: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Make H irreducibleFrom any state, there is a non-zero probability of going from one state to another.

1

2 3

4 5

Page 29: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

The Google Matrix

G = αS + (1-α) 1/n eeT

2

5

3

1

4

The Random Surfer Model: for each page, time spent ∝ importance.

Page 30: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

G = αS + (1-α) 1/n eeT

G is stochastic, aperiodic and irreducible.

r(t+1) = G r(t)

G is dense but computable using the sparse matrix H.

G = αS + (1-α) 1/n eeT

= α(H + 1/naeT) + (1-α) 1/n eeT

= αH + (αa + (1-α)e) 1/n eT

Are the results reasonable?

Page 31: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Page RankThe problems

Page 32: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

The Rich Gets Richer

(Cho et al, 04)

Page 33: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Google Bombs

Page 34: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Google Bombs

Page 35: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Google Bombs

Page 36: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Link Farms

... ...

Page 37: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Link Farms

(Wu and Davison, 05)

Page 38: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

Take-home

Ranking is important.

Relationship between links and the importance of pages.

Why PageRank converges (to the right answer).

How link-based ranking methods can be manipulated.

Page 39: PageRank - Carnegie Mellon School of Computer Scienceelaw/pagerank.pdf · The Algorithm Given a web graph ... are pages and edges are hyperlinks • Assign each node an initial page

g2gttyl