n comp 465: data mining more on pagerankcs.rhodes.edu/welshc/comp465_s15/lecture16.pdf4/14/2015 1...

6
4/14/2015 1 COMP 465: Data Mining More on PageRank Slides Adapted From: www.mmds.org (Mining Massive Datasets) Power Iteration: Set =1/N 1: โ€ฒ = โ†’ 2: = โ€ฒ Goto 1 Example: r y 1/3 r a = 1/3 r m 1/3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org y a m y a m y ยฝ ยฝ 0 a ยฝ 0 1 m 0 ยฝ 0 2 Iteration 0, 1, 2, โ€ฆ r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 Power Iteration: Set =1/N 1: โ€ฒ = โ†’ 2: = โ€ฒ Goto 1 Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 โ€ฆ 6/15 r m 1/3 1/6 3/12 1/6 3/15 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org y a m y a m y ยฝ ยฝ 0 a ยฝ 0 1 m 0 ยฝ 0 3 Iteration 0, 1, 2, โ€ฆ r y = r y /2 + r a /2 r a = r y /2 + r m r m = r a /2 Imagine a random web surfer: At any time , surfer is on some page At time +, the surfer follows an out-link from uniformly at random Ends up on some page linked from Process repeats indefinitely Let: () โ€ฆ vector whose th coordinate is the prob. that the surfer is at page at time So, () is a probability distribution over pages J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4 j i i j r r (i) d out j i 1 i 2 i 3

Upload: others

Post on 16-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: N COMP 465: Data Mining More on PageRankcs.rhodes.edu/welshc/COMP465_S15/Lecture16.pdf4/14/2015 1 COMP 465: Data Mining More on PageRank Slides Adapted From: (Mining Massive Datasets)

4/14/2015

1

COMP 465: Data Mining More on PageRank

Slides Adapted From: www.mmds.org (Mining Massive Datasets)

Power Iteration:

Set ๐‘Ÿ๐‘— = 1/N

1: ๐‘Ÿโ€ฒ๐‘— = ๐‘Ÿ๐‘–

๐‘‘๐‘–๐‘–โ†’๐‘—

2: ๐‘Ÿ = ๐‘Ÿโ€ฒ

Goto 1

Example: ry 1/3 1/3 5/12 9/24 6/15

ra = 1/3 3/6 1/3 11/24 โ€ฆ 6/15

rm 1/3 1/6 3/12 1/6 3/15

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y

a m

y a m

y ยฝ ยฝ 0

a ยฝ 0 1

m 0 ยฝ 0

2

Iteration 0, 1, 2, โ€ฆ

ry = ry /2 + ra /2

ra = ry /2 + rm

rm = ra /2

Power Iteration:

Set ๐‘Ÿ๐‘— = 1/N

1: ๐‘Ÿโ€ฒ๐‘— = ๐‘Ÿ๐‘–

๐‘‘๐‘–๐‘–โ†’๐‘—

2: ๐‘Ÿ = ๐‘Ÿโ€ฒ

Goto 1

Example: ry 1/3 1/3 5/12 9/24 6/15

ra = 1/3 3/6 1/3 11/24 โ€ฆ 6/15

rm 1/3 1/6 3/12 1/6 3/15

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y

a m

y a m

y ยฝ ยฝ 0

a ยฝ 0 1

m 0 ยฝ 0

3

Iteration 0, 1, 2, โ€ฆ

ry = ry /2 + ra /2

ra = ry /2 + rm

rm = ra /2

Imagine a random web surfer:

At any time ๐’•, surfer is on some page ๐’Š

At time ๐’• + ๐Ÿ, the surfer follows an out-link from ๐’Š uniformly at random

Ends up on some page ๐’‹ linked from ๐’Š

Process repeats indefinitely

Let: ๐’‘(๐’•) โ€ฆ vector whose ๐’Šth coordinate is the

prob. that the surfer is at page ๐’Š at time ๐’•

So, ๐’‘(๐’•) is a probability distribution over pages

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

ji

ij

rr

(i)dout

j

i1 i2 i3

Page 2: N COMP 465: Data Mining More on PageRankcs.rhodes.edu/welshc/COMP465_S15/Lecture16.pdf4/14/2015 1 COMP 465: Data Mining More on PageRank Slides Adapted From: (Mining Massive Datasets)

4/14/2015

2

Where is the surfer at time t+1?

Follows a link uniformly at random

๐’‘ ๐’• + ๐Ÿ = ๐‘ด โ‹… ๐’‘(๐’•)

Suppose the random walk reaches a state ๐’‘ ๐’• + ๐Ÿ = ๐‘ด โ‹… ๐’‘(๐’•) = ๐’‘(๐’•)

then ๐’‘(๐’•) is stationary distribution of a random walk

Our original rank vector ๐’“ satisfies ๐’“ = ๐‘ด โ‹… ๐’“

So, ๐’“ is a stationary distribution for the random walk

)(M)1( tptp

j

i1 i2 i3

5 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Does this converge?

Does it converge to what we want?

Are results reasonable?

ji

t

it

j

rr

i

)()1(

d Mrr or

equivalently

7 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Example: ra 1 0 1 0

rb 0 1 0 1

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

=

b a

Iteration 0, 1, 2, โ€ฆ

ji

t

it

j

rr

i

)()1(

d

Page 3: N COMP 465: Data Mining More on PageRankcs.rhodes.edu/welshc/COMP465_S15/Lecture16.pdf4/14/2015 1 COMP 465: Data Mining More on PageRank Slides Adapted From: (Mining Massive Datasets)

4/14/2015

3

Example: ra 1 0 0 0

rb 0 1 0 0

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

=

b a

Iteration 0, 1, 2, โ€ฆ

ji

t

it

j

rr

i

)()1(

d

2 problems: (1) Some pages are

dead ends (have no out-links)

Random walk has โ€œnowhereโ€ to go to

Such pages cause importance to โ€œleak outโ€

(2) Spider traps:

(all out-links are within the group)

Random walked gets โ€œstuckโ€ in a trap

And eventually spider traps absorb all importance

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10

Dead end

Power Iteration:

Set ๐‘Ÿ๐‘— = 1

๐‘Ÿ๐‘— = ๐‘Ÿ๐‘–

๐‘‘๐‘–๐‘–โ†’๐‘—

And iterate

Example: ry 1/3 2/6 3/12 5/24 0

ra = 1/3 1/6 2/12 3/24 โ€ฆ 0

rm 1/3 3/6 7/12 16/24 1

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11

Iteration 0, 1, 2, โ€ฆ

y

a m

y a m

y ยฝ ยฝ 0

a ยฝ 0 0

m 0 ยฝ 1

ry = ry /2 + ra /2

ra = ry /2

rm = ra /2 + rm

m is a spider trap

All the PageRank score gets โ€œtrappedโ€ in node m.

The Google solution for spider traps: At each time step, the random surfer has two options

With prob. , follow a link at random

With prob. 1-, jump to some random page

Common values for are in the range 0.8 to 0.9

Surfer will teleport out of spider trap within a few time steps

12 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y

a m

y

a m

Page 4: N COMP 465: Data Mining More on PageRankcs.rhodes.edu/welshc/COMP465_S15/Lecture16.pdf4/14/2015 1 COMP 465: Data Mining More on PageRank Slides Adapted From: (Mining Massive Datasets)

4/14/2015

4

Power Iteration:

Set ๐‘Ÿ๐‘— = 1

๐‘Ÿ๐‘— = ๐‘Ÿ๐‘–

๐‘‘๐‘–๐‘–โ†’๐‘—

And iterate

Example: ry 1/3 2/6 3/12 5/24 0

ra = 1/3 1/6 2/12 3/24 โ€ฆ 0

rm 1/3 1/6 1/12 2/24 0

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

Iteration 0, 1, 2, โ€ฆ

y

a m

y a m

y ยฝ ยฝ 0

a ยฝ 0 0

m 0 ยฝ 0

ry = ry /2 + ra /2

ra = ry /2

rm = ra /2

Here the PageRank โ€œleaksโ€ out since the matrix is not stochastic.

Teleports: Follow random teleport links with probability 1.0 from dead-ends

Adjust matrix accordingly

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14

y

a m

y a m

y ยฝ ยฝ โ…“

a ยฝ 0 โ…“

m 0 ยฝ โ…“

y a m

y ยฝ ยฝ 0

a ยฝ 0 0

m 0 ยฝ 0

y

a m

Why are dead-ends and spider traps a problem and why do teleports solve the problem? Spider-traps are not a problem, but with traps

PageRank scores are not what we want

Solution: Never get stuck in a spider trap by teleporting out of it in a finite number of steps

Dead-ends are a problem

The matrix is not column stochastic so our initial assumptions are not met

Solution: Make matrix column stochastic by always teleporting when there is nowhere else to go

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15

Googleโ€™s solution that does it all: At each step, random surfer has two options:

With probability , follow a link at random

With probability 1-, jump to some random page

PageRank equation [Brin-Page, 98]

๐‘Ÿ๐‘— = ๐›ฝ ๐‘Ÿ๐‘–๐‘‘๐‘–

๐‘–โ†’๐‘—

+ (1 โˆ’ ๐›ฝ)1

๐‘

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16

di โ€ฆ out-degree of node i

This formulation assumes that ๐‘ด has no dead ends. We can either

preprocess matrix ๐‘ด to remove all dead ends or explicitly follow random

teleport links with probability 1.0 from dead-ends.

Page 5: N COMP 465: Data Mining More on PageRankcs.rhodes.edu/welshc/COMP465_S15/Lecture16.pdf4/14/2015 1 COMP 465: Data Mining More on PageRank Slides Adapted From: (Mining Massive Datasets)

4/14/2015

5

PageRank equation [Brin-Page, โ€˜98]

๐‘Ÿ๐‘— = ๐›ฝ๐‘Ÿ๐‘–๐‘‘๐‘–

๐‘–โ†’๐‘—

+ (1 โˆ’ ๐›ฝ)1

๐‘

The Google Matrix A:

๐ด = ๐›ฝ ๐‘€ + 1 โˆ’ ๐›ฝ1

๐‘ ๐‘ร—๐‘

We have a recursive problem: ๐’“ = ๐‘จ โ‹… ๐’“ And the Power method still works!

What is ?

In practice =0.8,0.9 (make 5 steps on avg., jump)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

[1/N]NxNโ€ฆN by N matrix

where all entries are 1/N

y

a =

m

1/3

1/3

1/3

0.33

0.20

0.46

0.24

0.20

0.52

0.26

0.18

0.56

7/33

5/33

21/33

. . .

18 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y

a m

13/15

7/15

1/2 1/2 0

1/2 0 0

0 1/2 1

1/3 1/3 1/3

1/3 1/3 1/3

1/3 1/3 1/3

y 7/15 7/15 1/15

a 7/15 1/15 1/15

m 1/15 7/15 13/15

0.8 + 0.2

M [1/N]NxN

A

Key step is matrix-vector multiplication rnew = A โˆ™ rold

Easy if we have enough main memory to hold A, rold, rnew

Say N = 1 billion pages We need 4 bytes for

each entry (say)

2 billion entries for vectors, approx 8GB

Matrix A has N2 entries 1018 is a large number!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20

ยฝ ยฝ 0

ยฝ 0 0

0 ยฝ 1

1/3 1/3 1/3

1/3 1/3 1/3

1/3 1/3 1/3

7/15 7/15 1/15

7/15 1/15 1/15

1/15 7/15 13/15

0.8 +0.2

A = โˆ™M + (1-) [1/N]NxN

=

A =

Page 6: N COMP 465: Data Mining More on PageRankcs.rhodes.edu/welshc/COMP465_S15/Lecture16.pdf4/14/2015 1 COMP 465: Data Mining More on PageRank Slides Adapted From: (Mining Massive Datasets)

4/14/2015

6

Suppose there are N pages Consider page i, with di out-links

We have Mji = 1/|di| when i โ†’ j and Mji = 0 otherwise

The random teleport is equivalent to: Adding a teleport link from i to every other page

and setting transition probability to (1-)/N

Reducing the probability of following each out-link from 1/|di| to /|di|

Equivalent: Tax each page a fraction (1-) of its score and redistribute evenly

21 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

๐’“ = ๐‘จ โ‹… ๐’“, where ๐‘จ๐’‹๐’Š = ๐œท ๐‘ด๐’‹๐’Š +๐Ÿโˆ’๐œท

๐‘ต

๐‘Ÿ๐‘— = ๐ด๐‘—๐‘– โ‹… ๐‘Ÿ๐‘–๐‘i=1

๐‘Ÿ๐‘— = ๐›ฝ ๐‘€๐‘—๐‘– +1โˆ’๐›ฝ

๐‘โ‹… ๐‘Ÿ๐‘–

๐‘๐‘–=1

= ๐›ฝ ๐‘€๐‘—๐‘– โ‹… ๐‘Ÿ๐‘– +1โˆ’๐›ฝ

๐‘๐‘i=1 ๐‘Ÿ๐‘–

๐‘i=1

= ๐›ฝ ๐‘€๐‘—๐‘– โ‹… ๐‘Ÿ๐‘– +1โˆ’๐›ฝ

๐‘๐‘i=1 since ๐‘Ÿ๐‘– = 1

So we get: ๐’“ = ๐œท ๐‘ด โ‹… ๐’“ +๐Ÿโˆ’๐œท

๐‘ต ๐‘ต

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22

[x]N โ€ฆ a vector of length N with all entries x Note: Here we assumed M

has no dead-ends

We just rearranged the PageRank equation

๐’“ = ๐œท๐‘ด โ‹… ๐’“ +๐Ÿ โˆ’ ๐œท

๐‘ต๐‘ต

where [(1-)/N]N is a vector with all N entries (1-)/N

M is a sparse matrix! (with no dead-ends)

10 links per node, approx 10N entries So in each iteration, we need to:

Compute rnew = M โˆ™ rold

Add a constant value (1-)/N to each entry in rnew

Note if M contains dead-ends then ๐’“๐’‹๐’๐’†๐’˜

๐’‹ < ๐Ÿ and we also have to renormalize rnew so that it sums to 1

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23

Input: Graph ๐‘ฎ and parameter ๐œท Directed graph ๐‘ฎ (can have spider traps and dead ends) Parameter ๐œท

Output: PageRank vector ๐’“๐’๐’†๐’˜

Set: ๐‘Ÿ๐‘—๐‘œ๐‘™๐‘‘ =

1

๐‘

repeat until convergence: ๐‘Ÿ๐‘—๐‘›๐‘’๐‘ค โˆ’ ๐‘Ÿ๐‘—

๐‘œ๐‘™๐‘‘ > ๐œ€๐‘—

โˆ€๐‘—: ๐’“โ€ฒ๐’‹๐’๐’†๐’˜ = ๐œท

๐’“๐’Š๐’๐’๐’…

๐’…๐’Š๐’Šโ†’๐’‹

๐’“โ€ฒ๐’‹๐’๐’†๐’˜ = ๐ŸŽ if in-degree of ๐’‹ is 0

Now re-insert the leaked PageRank:

โˆ€๐’‹: ๐’“๐’‹๐’๐’†๐’˜ = ๐’“โ€ฒ๐’‹

๐’๐’†๐’˜+๐Ÿโˆ’๐‘บ

๐‘ต

๐’“๐’๐’๐’… = ๐’“๐’๐’†๐’˜

24

where: ๐‘† = ๐‘Ÿโ€ฒ๐‘—๐‘›๐‘’๐‘ค

๐‘—

If the graph has no dead-ends then the amount of leaked PageRank is 1-ฮฒ. But since we have dead-ends

the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org