the pagerank citation ranking: bringing order to the web

23
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das

Upload: chiara

Post on 25-Feb-2016

35 views

Category:

Documents


3 download

DESCRIPTION

The PageRank Citation Ranking: Bringing Order to the Web. Presented by Aishwarya Rengamannan 1000669605 Instructor: Dr. Gautam Das. Technology Overview. Motivation. WWW is huge and heterogeneous WebPages proliferate free of quality control Commercial interest to manipulate ranking - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The  PageRank  Citation Ranking: Bringing Order to the Web

The PageRank Citation Ranking: Bringing Order to the Web

Presented byAishwarya Rengamannan

1000669605Instructor: Dr. Gautam Das

Page 2: The  PageRank  Citation Ranking: Bringing Order to the Web

Technology Overview

Page 3: The  PageRank  Citation Ranking: Bringing Order to the Web

Motivation

• WWW is huge and heterogeneous• WebPages proliferate free of quality control• Commercial interest to manipulate ranking• The ‘quality’ of a webpage is subjective to the

users.Problem: Necessity to approximate the overall

relative ‘importance’ of web pages. Solution: Take advantage of the Link Structure of the

web

Page 4: The  PageRank  Citation Ranking: Bringing Order to the Web

Link structure of the Web

• Forward Links(Outedges):The outgoing links from a webpage. C is A & B’s forward link.

• Back Links(Inedges):Incoming links to a webpage. A & B are back links for C.

Page 5: The  PageRank  Citation Ranking: Bringing Order to the Web

Related Work

• Academic paper citations• Link based analysis• Clustering methods that take link structure

into account• Modeling web as Hubs and Authorities

Page 6: The  PageRank  Citation Ranking: Bringing Order to the Web

Ranking Intuition

• The quantity of the backlinks to a webpage makes it important.

• The quality of the back linked pages increases the ranking.

“A page has high rank if the sum of the ranks of it’s backlinks is high.”

How about having a backlink from www.yahoo.com?

Page 7: The  PageRank  Citation Ranking: Bringing Order to the Web

Naïve PageRank Calculation

• u & v --> Webpages• Bu --> backlinks of u

• Nv --> Forward Links from v to u.• R --> Ranks of the webpages• c <1 --> Used for normalization

Page 8: The  PageRank  Citation Ranking: Bringing Order to the Web

Matrix Representation

‘A’ is a square adjacency Matrix with• Rows and columns corresponding to web

pages (u & v)• Au,v = 1/Nu if there is an edge from u to v

• Au,v = 0 if there is no edge.

Page 9: The  PageRank  Citation Ranking: Bringing Order to the Web

Matrices Revisited

Eigen Values and Eigen Vectors:• Matrix A (nXn)• is an Eigen value of A if there exists a non-zero

vector v such that Av= v• vector v is called an Eigen vector of A

corresponding to .• We can rewrite Av= v as (A− I)v=0, where I is

identity matrix (nXn).

Page 10: The  PageRank  Citation Ranking: Bringing Order to the Web

Matrices Revisited(Contd…)

How to solve for Eigen value and Eigen Vector?

Page 11: The  PageRank  Citation Ranking: Bringing Order to the Web
Page 12: The  PageRank  Citation Ranking: Bringing Order to the Web

Sample Calculation

1 3

2 4

Page 13: The  PageRank  Citation Ranking: Bringing Order to the Web

Matrix Representation (contd…)

• A --> square matrix of web pages• R --> vector over webpages• To find: Eigen Vector corresponding to dominant

(maximum) Eigen value.– Could be computed by repeatedly iterating till it

converges to the dominant Eigen value-Eigen VectorMatrix Notation gives

R = c A R

c : eigenvalueR : eigenvector of A

R =

Normalized R =

Page 14: The  PageRank  Citation Ranking: Bringing Order to the Web

Problem with Naïve PageRankRank Sink:• Two web pages that point to each other but to

no other page. Third page which points to one of them.

• loop will accumulate rank but never distribute it (since there are no out edges).

Page 15: The  PageRank  Citation Ranking: Bringing Order to the Web

Solution – Extended version of PageRank

Introducing Rank Source:

E(u): a vector over the web pages that corresponds to a source of rank.

Page 16: The  PageRank  Citation Ranking: Bringing Order to the Web

Random Surfer Model

• Random Surfer – Clicks on successive links at random.

• The factor ‘E’ can be viewed as modeling this behavior.

• “Surfer” periodically gets bored, jumped to a random page based on E.

Page 17: The  PageRank  Citation Ranking: Bringing Order to the Web

PageRank Computation

- initialize vector over web pagesLoop:- new ranks sum of normalized backlink ranks- compute normalizing factor- add escape term- control parameterWhile - stop when converged

Page 18: The  PageRank  Citation Ranking: Bringing Order to the Web

Another Problem?

Dangling links:– Links to a page with no link to any other pages– Not clear where their weights should be

distributed

Solution: Remove them from the system until after calculating all other PageRanks!

Page 19: The  PageRank  Citation Ranking: Bringing Order to the Web

Implementation• Web crawler keeps a database of URLs so that it can discover all

URLs on the web• To implement PageRank, the web crawler builds an index of the

URLs as it crawls

Problems???

• Infinitely large sites• Incorrect/Broken HTML• Sites are down• Web is always changing

Page 20: The  PageRank  Citation Ranking: Bringing Order to the Web

PageRank Implementation

• Convert each URL into unique integer ID• Link structure sorted by the IDs• Remove dangling links• Make a initial assignment of ranks and iterate until

convergence• Add the dangling links back• Iterate the process again to assign weights to all

dangling links• Link database A, is normally kept in RAM

Page 21: The  PageRank  Citation Ranking: Bringing Order to the Web

Convergence Properties

• Interpret web as a expander like graph.– if every subsets of nodes S has a neighborhood

that is larger than some factor α times |S|

• Verification - if the largest eigenvalue is sufficiently larger than the second-largest eigenvalue

Page 22: The  PageRank  Citation Ranking: Bringing Order to the Web

Applications of Page Rank

• Search, Browsing and Traffic estimation.• Help user decide if a site is trustworthy.• Estimate web traffic.• Spam detection and prevention.• Predict citation counts

Page 23: The  PageRank  Citation Ranking: Bringing Order to the Web

• http://www.techpavan.com/2008/11/20/backend-google-search/

• http://www.math.hmc.edu/calculus/tutorials/eigenstuff/

• http://williamcotton.com/pagerank-explained-with-javascript