Download - HITS + Pagerank

HITS + PageRank

Jens Noschinski, Thomas Honné, Kersten Schuster, Andreas Schäfer

The slides are licensed under theCreative Commons Attribution-ShareAlike 3.0 License

WS 2010/2011

Web Technologies – Prof. Dr. Ulrik Schroeder

http://creativecommons.org/licenses/by-sa/3.0/legalcode

http://creativecommons.org/licenses/by-sa/3.0/

Overview Motivation HITS

background algorithm drawbacks

PageRank background algorithm problems

Summary HITS, PageRank differences

Sources2

Problem: searching for information on the web

> 1 mio. results, but only the first 10-20 results are relevant

How do search engines decide which sites are important?

What else needs to be considered?

Motivation

3

Motivation Fast and efficient

many requests at the same time very big set of websites (more than 1.000.000.000.000 in July ‘08)

Actuality of results recent changes

Availability of the search engine itself of indexed pages that can be searched (cache)

Resistance against manipulation search result manipulation spam

4

HITS HITS = Hyperlink-Induced Topic Search Introduced in 1997 by Jon Kleinberg For broad-topic information discovery

pick out few relevant sources Identify authoritative web pages

most central regarding a certain topic

Question: When can a page be considered authoritative?

66

Two distinct types of pages Authorities

highly referenced pages considered as authoritative

Hubs pages that point to many authorities points from which authority is conferred

Mutually Reinforcing Relationship a good hub points to many good authorities a good authority is pointed to by many good hubs

Hubs and Authorities

Hub Authority

7

Root Set and Base Set First step of HITS‘ processing Assemble root set S of pages

execute a user-supplied query use a full text search engine

Expand to base set T add pages that point to any page in S add pages that are pointed to by any page in S

Restrictions set of pages pointing to an authority can be enormous

consider fixed-size random subset page links can be internal links for site navigation

exclude links between pages on the same host

88

Root Set (S) and Base Set (T)

ST

9

Hub Weight and Authority Weight Weights associated with each page p

hub weight h(p) authority weight a(p) initialized to 1

Calculation a(p) is the sum of hub weights of pages pointing to p h(p) is the sum of authority weights of pages pointed to by p

“p → q“ means that page p has a hyperlink to page q

pq

qhpa )(:)(

qp

qaph )(:)(

10

Further Processing Repeat whole update operation k times

ongoing updates - no exact final result for weights convergence to certain values in time k = 20 has shown to deliver a good convergence

Normalize the weights prevent the values from getting too large normalize after each iteration

n

i

ia1

1)(

n

i

ih1

1)(

11

Output

Only few pages from base set are relevant dump the n pages with the highest authority weights dump the n pages with the highest hub weights n = 10 is reasonable

We just got our final search results

12

Drawbacks No anti-spam capability

link farms can boost hub score Topic drift

not all linked pages are thematically related Minor link changes can cause large result changes Query-dependent

algorithm is executed for every single search query query is time consuming

computation of root and base set calculation of hub and authority weights

13

PAGERANK1414

Background on PageRank Published in 1998

developed and patented at Stanford University amongst others by the Google founders Larry Page and Sergei Brin

exclusively licensed by Google

Differences to other search technologies not only ranked by content new ranking criteria based on the link structure harder to manipulate

15

Main idea Each website has a numeric value called PageRank or

Prestige PageRank computation is based on in- and outlinks

D

C

B

A

A B C D

ABCD

16

PageRank Algorithm Surfer follows an outlink of page x with probability px

Therefore the PageRank of a page is Resulting equation system:

17

A B C D

ABCD

xx outdegree

p 1

inlinksi

ipiPRxPR )()(

cbad

dbc

ab

da

21

21

21

212121

17

PageRank Algorithm Other scores can be reached by multiplication of all values

with the same factor

18

D=8

C=5

B=2

A=4

i ioutdegree

iPRxPR )()(

18

Problems of the algorithm Rank Sink

after some iterations A and B will have a PageRank of 0 solution: RandomSurfer1919

D

C

B

A

RandomSurfer

Idea: simulate real surfing behavior a real surfer may “teleport“ to another website (back-button,

bookmark, ...) the “damping factor“ d is the probability to follow a regular outlink

20

i

iPRddxPRioutdegree

)()1()(

20

Iterative algorithm PageRank-Iterate(G)

Repeat

Until

Return

21

;0 neP

;1k

;)1( 1 kk PdMedP T

;1 kk

;11 kk PP

;kP


Repeat

Until

Return

22

;0 neP

;1k

;)1( 1 kk PdMedP T

;1 kk

;11 kk PP

Step 0:

25.025.025.025.0

0P;kP

00000001,0;85,0 d

0101100011001010

M

AB

CD


Repeat

Until

Return

23

;0 neP

;1k

;)1( 1 kk PdMedP T

;1 kk

;11 kk PP

Step 1:

;kP

0.46250.25

0,143750,14375

25,025,025,025,0

85,0

15,015,015,015,0

1

TMP

00000001,0;85,0 d

0101100011001010

M

AB

CD


Repeat

Until

Return

24

;0 neP

;1k

;)1( 1 kk PdMedP T

;1 kk

;11 kk PP

;kP

Final step :

0.402797440.262320850.126192790.20868892

60

P

00000001,0;85,0 d

0101100011001010

M

AB

CD

8524

8,055,242,524,17

20 60P

Properties Strengths

pre-computable fast spam-resistant

minor changes have minor effects Weaknesses

pages only authoritative in general and not on query topic link farms Google-bombs

25

Summary HITS

algorithm is executed after a query is made pages get a hub- and an authority-value calculation of whether a page provides good information and/or

whether it links to pages that do so no spam-fighting ability

PageRank each page gets one PageRank that declares its value query-independent spam-resistant

26

Sources Papers about PageRank

Larry Page et al.: The PageRank Citation Ranking: Bringing Order to the Web

Ulrik Brandes, Gabi Dorfmüller: 10. Algorithmus der Woche der RWTH Aachen, 09/05/2006

Peter J. Zehetner: „Der PageRank-Algorithmus“, 05/2007 Taher H. Haveliwala: „Efficient Computation of PageRank“, 10/1999

Papers about HITS Jon Kleinberg: Authoritative sources in a hyperlinked environment Jon Kleinberg et al.: Inferring Web communities from link topology

Book Bing Liu: “Web Data Mining”, 2008

27

Download - HITS + Pagerank

Top Related