presented by: wang hao march 8 th, 2011 the pagerank citation ranking: bringing order to the web...

Presented By:Presented By:

Wang HaoWang Hao

March 8March 8thth, 2011, 2011

The PageRank Citation Ranking:The PageRank Citation Ranking:Bringing Order to the WebBringing Order to the Web

Lawrence Page, Sergey Brin, Rajeev Motwani, Terry WinogradLawrence Page, Sergey Brin, Rajeev Motwani, Terry WinogradJanuary 29January 29thth, 1998, 1998Stanford InfoLabStanford InfoLab

Adaptive methods for the computationAdaptive methods for the computationof PageRankof PageRank

Sepandar Kamvar, Taher Haveliwala, Gene GolubSepandar Kamvar, Taher Haveliwala, Gene Golub20042004

Agenda Technology Overview Introduction & Motivation Link Structure of the Web Simplified PageRank PageRank Definition How we can get PageRank Dangling Links PageRank Implementation Adaptive Methods for computation of PageRank Searching with PageRnnk Personalized PageRank Application Conclusion References

Technology Overview

Recognized the need for a new kind of server setup

Linked PCs to quickly find each query’s answersThis resulted in: Faster Response Time

Greater Scalability Lower costs

Google uses more than 200 signals (including PageRank algorithm) to determine which pages are important

Google then performs hypertext-matching- Google Corporate Information

- Google Corporate Information

Life of a Google Query

Introduction & MotivationWWW is very large and heterogeneousThe web pages are extremely diverse in terms of

content, quality and structureChallenging for information retrieval on WWW

Academic Citations link to other well known papers

But they are peer reviewed and have quality control

Web of academic documents are homogeneous in their quality, usage, citation & length

Problem: How can the most relevant pages be ranked at the top?

Answer: Take advantage of the link structure of the Web to produce ranking of every web page known as PageRank

Cont’d

Most web pages link to web pages as well

Quality measure of a web page is subjective to the user though

Importance of a page is a quantity that isn’t intuitively possible to capture

Link Structure of the Web

A and B are Backlinks of C

•Every page has some number of forward links (outedges) and backlinks (inedges)

•We can never know all the backlinks of a page, but we know all of its forward links

•Generally, highly linked pages are more “important”

PageRank Definition• PageRank - a method for computing a ranking for every web page

based on the graph of the web• A page has high rank if the sum of the ranks of its backlinks is high

• Page has many backlinks• Page has a few highly ranked backlinks

• PageRank is a link analysis algorithm that assigns a numerical weight that represents how important a page is on the web

• The web is democratic i.e., pages vote for pages

Google interprets a link from page A to page B as a vote, by page A, for page B.It also analyses the page that cast the vote.

A page is important if important pages refer to it

Simple Ranking Function: u: web pageBu: backlinksNu = |Fu| number of links from uc: factor used for normalization

In principle, the PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one

Simplified PageRank Calculation

Cont’d

Computing PageRank given a Directed Graph

The transition matrix A =

We get the eigenvalue λ = 1

Calculating the eigenvector

On substituting we get,

so the vector u is of the form

Choose v to be the unique eigenvector with the sum of all entries equal to 1

PageRank vector

Cont’d

It is a Markov chain.Set the probability distribution at time 0: X0

Set one-step transition probability matrix: A

What we would like to get is the unique stationary distribution of the Markov chain: by successively iterating until convergence that This is the principal eigenvector of the matrix A, which is exactly the PageRank vector.

How we can get PageRank

Dangling links are links that point to any page with no outgoing links or pages not downloaded yet.

Problem : how their weights should be distributed.Solution 1: they are removed from the system until

all the PageRanks are calculated. Afterwards, they are added in without affecting things significantly.

Problem 1: Dangling Links

Solution 2 (presented in the second paper):Let v be a vector representing a uniform distribution over all nodes

Problem 1: Dangling Links (cont’d)

In terms of the random walk, the effect of D is to modify the transition probabilities so that a surfer visiting a dangling page randomly jumps to another page in the next time step, using the distribution given by v.

Problem 2: Rank Sink

Problem: Some pages form a loop that accumulates rank (rank sink) to the infinity.

Solution:Random Surfer ModelJump to a random page based on some distribution E (rank source)

Convergence and Random Walks : Why does it work? Irreducible Aperiodic Markov Chains with a Primitive

transition probability matrix

What are the issues all about? We need a transition matrix model that can guarantee

convergence and does indeed converge to a unique stationary distribution vector.

PageRank Expression:Let E(u) be some vector over the Web pages that corresponds to a source of rank. Then, the PageRank of a set of Web pages is an assignment, R’, to the Web pages which satisfies

such that c is maximized and ||R’||1 = 1 (||R’||1 denotes the L1 norm of R’).

PageRank of document u

Number of outlinks from document v

PageRank of document vthat links to u

Normalizationfactor

Vector of web pages that the Surfer randomly jumps to u

Computing PageRank

S: any vector over the web pages

Loop:

Calculate the Ri+1 vector using Ri

Calculate the normalizing factor

Find the vector Ri+1 using d

Find the norm of the difference of 2 vectors

while Loop until convergence

SR 0

ii ARR 1

111 ii RRd

dERR ii 11

ii RR 1

PageRank ImplementationConvert each URL into a unique integer IDSort the link structure by IDRemove the dangling linksMake an initial assignment of ranks Iteratively compute PageRank until ConvergenceAdd the dangling links back Recompute the rankings

NOTE: After adding the dangling links back, we need to iterate as many times as was required to remove the dangling links

The mechanism

•Web Crawler: Finds and retrieves pages on the web•Repository: web pages are compressed and stored here•Indexer: each index entry has a list of documents in which the term appears

and the location within the text where it occurs

ConvergencePR (322 Million Links): 52 iterationsPR (161 Million Links): 45 iterationsScaling factor is roughly linear in logn

This paper presents two contributions:

First, it shows that most pages in the web converge to their true PageRank quickly, while relatively few pages take much longer to converge.

And it further shows that those slow-converging pages generally have high PageRank, and those pages that converge quickly generally have low PageRank.

Experimental results supports the findings:

Adaptive Methods for the computation of PageRank

Experimental results

Second, the authors develop two algorithms, called Adaptive PageRank and Modified Adaptive PageRank, that exploit this observation to speed up the computation of PageRank by 18% and 28%, respectively.

The main ideas of the all the proposed algorithms are the same, which is to speed up the computation of PageRank by reducing the cost (not computing the PageRank of converged pages at each iteration).

Notations to be included: A: one-step transition probability matrix. x(k): probability distribution vector at time k. N = not yet converged; C = converged.

Adaptive Algorithms

Adaptive PageRank

Reordering the matrix A at each iteration is expensive.Reducing the cost by introducing sparse (zero) entries.

Filter-based Adaptive PageRank

Reducing redundant computation by not recomputing the components of the PageRanks of those pages in N due to links from those pages in C.

Split A even further.

Filter-based Modified Adaptive PageRank

Performance comparison

Searching with PageRank• Two search engines:

– Title-based search engine– Full text search engine

• Title-based search engine– Searches only the “Titles”– Finds all the web pages whose titles contain all the query

words– Sorts the results by PageRank– Very simple and cheap to implement– Title match ensures high precision, and PageRank ensures

high quality

• Full text search engine– Called Google– Examines all the words in every stored document and also

performs PageRank (Rank Merging)– More precise but more complicated

Title-based search for University

Personalized PageRankImportant component of PageRank calculation is E

A vector over the web pages (used as source of rank)Powerful parameter to adjust the page ranks

E vector corresponds to the distribution of web pages that a random surfer periodically jumps to

Having an E vector that is uniform over all the web pages results in some web pages with many related links receiving an overly high rank e.g.: copyright page or forums General Search over the internet

Instead in Personalized PageRank E consists of a single web page

ApplicationsEstimating Web Traffic

On analyzing the statistics, it was found that there are some sites that have a very high usage, but low PageRank.e.g.: Links to pirated software

PageRank as Backlink PredictorThe goal is to try to crawl the pages in as close to the optimal order as possible i.e., in the order of their rank according to an evaluation func.PageRank is a better predictor than citation counting

User Navigation: The PageRank ProxyThe user receives some information about the link before they click on itThis proxy can help users decide which links are more likely to be interesting

ConclusionPageRank is a global ranking of all web pages based on

their locations in the web graph structure

PageRank uses information which is external to the web pages – backlinks

Backlinks from important pages are more significant than backlinks from average pages

The structure of the web graph is very useful for information retrieval tasks.

References L. Page, S. Brin, R. Motwani, T. Winograd. The PageRank Citation

Ranking: Bringing Order to the Web, 1998 Sepandar Kamvar, Taher Haveliwala, Gene Golub, Adaptive methods for

the computation of PageRank, Linear Algebra and its Applications 386, 2004 Published by Elsevier Inc., pp 51–65.

L. Page and S. Brin. The anatomy of a large-scale hypertextual web search engine, 1998

THE $25,000,000,000 EIGENVECTOR THE LINEAR ALGEBRA BEHIND GOOGLE by KURT BRYAN AND TANYA LEISE

Google Corporate Information: http://www.google.com/corporate/tech.html

http://en.wikipedia.org/wiki/PageRank http://en.wikipedia.org/wiki/Eigenvalue,_eigenvector_and_eigenspace http://www.googleguide.com/google_works.html http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/

Lecture3/lecture3.html http://pr.efactory.de/

Thank You!

Q&A

presented by: wang hao march 8 th, 2011 the pagerank citation ranking: bringing order to the web...

Documents