pagerank - uestcdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_pagerank.… · data mining...

35
PageRank Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering Email: [email protected] Xiaolin Yang DM LESS IS MORE 2015/12/09

Upload: others

Post on 25-Sep-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

PageRank

Data Mining Lab, Big Data Research Center, UESTCSchool of Computer Science and Engineering Email: [email protected]

Xiaolin Yang

DMLESS IS MORE

2015/12/09

Page 2: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

CONTENTS 01 Background

02 Markov Chain

03 The Basic PageRank Model

04 The Power Method

05 Discussion about the Model

06 Other Topics

Page 3: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background

Page 4: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREBackground

At the Seventh International World Wide Web conference(WWW98), Sergey Brin and Larry Page’s paper “The PageRank citation ranking: Bringing order to the Web” made small ripples in the information science community that quickly turned into waves.

Page 5: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREBackground

Great Success Hyperlink Structure PageRank

Page 6: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background

Page 7: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREMarkov Chain

Markov Chain:

Stock Market

Irreducible(不可约):any state can reached from any state.

Absorbing states(吸收态):the probability of leaving this state is zero

Page 8: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREMarkov Chain

Markov Chain:

Limit Theorem(极限定理):

A homogeneous(齐次), irreducible(不可约), aperiodic(非周期) and positive recurrent(正常返) Markoc Chain has:

a limiting distribution:

also a stationary distribution:

Page 9: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREMarkov Chain

Markov Chain:

Limit Theorem(极限定理):

Page 10: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREMarkov Chain

Markov Chain(Aperiodic):

Period of a state:

. . : the greatest common divisor

Page 11: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREMarkov Chain

Markov Chain(Positive Recurrent):

Recurrent (常返):

The first return probability(首返概率):

Positive Recurrent (正常返):

Page 12: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREMarkov Chain

Markov Chain:

1 2 3 4

p p

qq

1

1

Positive Recurrent (正常返):

State 1:

State 2:

positive recurrent !

transient !

Page 13: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background

Page 14: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREThe Basic PageRank Model

Page Links:

• is linked by many pages;

If a page

• which link to it is authoritative

Then it will gain high PageRank!

Page 15: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREThe Basic PageRank Model

Transition Probability Matrix:

Page 16: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREThe Basic PageRank Model

Transition Probability Matrix:

• Row/Column sums are +1• Every element is NonnegativeMarkov matrix/stochastic matrix:

• Spectral radius(the supremum among the absolute values of spectrums) is 1;

Properties:

Page 17: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREThe Basic PageRank Model

Transition Probability Matrix:

Every irreducible markov matrix has a stationary vector:

It’s independent of the initial state

If every element is positive, Markov matrix is irreducible(strongly connected):

Page 18: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREThe Basic PageRank Model

Transition Probability Matrix:

The independence of the initial state:

Page 19: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREThe Basic PageRank Model

Transition Probability Matrix:

Practical web graphs are not necessarily strongly connected.• For Dangling Nodes:

• For Irreducible:

Page 20: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREThe Basic PageRank Model

Transition Probability Matrix:

Page 21: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREThe Basic PageRank Model

Transition Probability Matrix:

Another Interpretation for is:

Even though the customer always browse webpages by hyperlinks, but he can also use URL to “teleport” to a new page.

Page 22: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background

Page 23: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREThe Power Method

The Power Method:

Page 24: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREThe Power Method

Four advantages:

• and are never formed or stored.

• At each iteration, the power method only requires the storage of one vector, the current iterate.

• Converges quickly.

Page 25: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background

Page 26: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREDiscussion about the Model

Google always using = 0.85, so why this choice for ?

• The larger is, the more the true hyperlink structure of the web is used to determine webpage importance.

• The smaller is, the faster the convergence for power method.

1. a trade-off:

Page 27: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREDiscussion about the Model

Page 28: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MORE

• 5/6 of the time a Web surfer randomly clicks on hyperlinks.

• 1/6 of the time this Web surfer will go to the URL line and type the address of a new page.

2. intuitive reality:

Discussion about the Model

Page 29: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MORE

• control spamming done by the so-called link farms.

A link farm

Discussion about the Model

Page 30: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREDiscussion about the Model

Page 31: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MORE

Forcing Irreducibility:enforce every node is directly connected to every other node. (alter the true nature of the Web)

add a dummy node to the Web which connects to every other node and to which every other node is connected.

Discussion about the Model

Page 32: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MORE

Forcing Irreducibility:LeaderRank:

Ground

• self-adaptive• parameter-free

Discussion about the Model

Page 33: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background

Page 34: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background DMLESS IS MOREOther Topics

• Storage and Speed“The World’s Largest Matrix Computation”

• Spam

• The evolution and dynamics of the Web

• Web’s structure:how to use the scale-free structure to improve PageRank computations

• Community:how do changes within the community affect the PageRank of community pages?

Page 35: PageRank - UESTCdm.uestc.edu.cn/wp-content/uploads/seminar/20151209_PageRank.… · Data Mining Lab, Big Data Research Center, UESTC School of Computer Science and Engineering

Background

DMLESS IS MORE