Download - HITS + Pagerank
HITS + PageRank
Jens Noschinski, Thomas Honné, Kersten Schuster, Andreas Schäfer
The slides are licensed under theCreative Commons Attribution-ShareAlike 3.0 License
WS 2010/2011
Web Technologies – Prof. Dr. Ulrik Schroeder
Overview Motivation HITS
background algorithm drawbacks
PageRank background algorithm problems
Summary HITS, PageRank differences
Sources2
Problem: searching for information on the web
> 1 mio. results, but only the first 10-20 results are relevant
How do search engines decide which sites are important?
What else needs to be considered?
Motivation
3
Motivation Fast and efficient
many requests at the same time very big set of websites (more than 1.000.000.000.000 in July ‘08)
Actuality of results recent changes
Availability of the search engine itself of indexed pages that can be searched (cache)
Resistance against manipulation search result manipulation spam
4
HITS5
HITS HITS = Hyperlink-Induced Topic Search Introduced in 1997 by Jon Kleinberg For broad-topic information discovery
pick out few relevant sources Identify authoritative web pages
most central regarding a certain topic
Question: When can a page be considered authoritative?
66
Two distinct types of pages Authorities
highly referenced pages considered as authoritative
Hubs pages that point to many authorities points from which authority is conferred
Mutually Reinforcing Relationship a good hub points to many good authorities a good authority is pointed to by many good hubs
Hubs and Authorities
Hub Authority
7
Root Set and Base Set First step of HITS‘ processing Assemble root set S of pages
execute a user-supplied query use a full text search engine
Expand to base set T add pages that point to any page in S add pages that are pointed to by any page in S
Restrictions set of pages pointing to an authority can be enormous
consider fixed-size random subset page links can be internal links for site navigation
exclude links between pages on the same host
88
Root Set (S) and Base Set (T)
ST
9
Hub Weight and Authority Weight Weights associated with each page p
hub weight h(p) authority weight a(p) initialized to 1
Calculation a(p) is the sum of hub weights of pages pointing to p h(p) is the sum of authority weights of pages pointed to by p
“p → q“ means that page p has a hyperlink to page q
pq
qhpa )(:)(
qp
qaph )(:)(
10
Further Processing Repeat whole update operation k times
ongoing updates - no exact final result for weights convergence to certain values in time k = 20 has shown to deliver a good convergence
Normalize the weights prevent the values from getting too large normalize after each iteration
n
i
ia1
1)(
n
i
ih1
1)(
11
Output
Only few pages from base set are relevant dump the n pages with the highest authority weights dump the n pages with the highest hub weights n = 10 is reasonable
We just got our final search results
12
Drawbacks No anti-spam capability
link farms can boost hub score Topic drift
not all linked pages are thematically related Minor link changes can cause large result changes Query-dependent
algorithm is executed for every single search query query is time consuming
computation of root and base set calculation of hub and authority weights
13
PAGERANK1414
Background on PageRank Published in 1998
developed and patented at Stanford University amongst others by the Google founders Larry Page and Sergei Brin
exclusively licensed by Google
Differences to other search technologies not only ranked by content new ranking criteria based on the link structure harder to manipulate
15
Main idea Each website has a numeric value called PageRank or
Prestige PageRank computation is based on in- and outlinks
D
C
B
A
A B C D
ABCD
16
PageRank Algorithm Surfer follows an outlink of page x with probability px
Therefore the PageRank of a page is Resulting equation system:
17
A B C D
ABCD
xx outdegree
p 1
inlinksi
ipiPRxPR )()(
cbad
dbc
ab
da
21
21
21
212121
17
PageRank Algorithm Other scores can be reached by multiplication of all values
with the same factor
18
D=8
C=5
B=2
A=4
i ioutdegree
iPRxPR )()(
18
Problems of the algorithm Rank Sink
after some iterations A and B will have a PageRank of 0 solution: RandomSurfer1919
D
C
B
A
RandomSurfer
Idea: simulate real surfing behavior a real surfer may “teleport“ to another website (back-button,
bookmark, ...) the “damping factor“ d is the probability to follow a regular outlink
20
i
iPRddxPRioutdegree
)()1()(
20
Iterative algorithm PageRank-Iterate(G)
Repeat
Until
Return
21
;0 neP
;1k
;)1( 1 kk PdMedP T
;1 kk
;11 kk PP
;kP
Iterative algorithm PageRank-Iterate(G)
Repeat
Until
Return
22
;0 neP
;1k
;)1( 1 kk PdMedP T
;1 kk
;11 kk PP
Step 0:
25.025.025.025.0
0P;kP
00000001,0;85,0 d
0101100011001010
M
AB
CD
Iterative algorithm PageRank-Iterate(G)
Repeat
Until
Return
23
;0 neP
;1k
;)1( 1 kk PdMedP T
;1 kk
;11 kk PP
Step 1:
;kP
0.46250.25
0,143750,14375
25,025,025,025,0
85,0
15,015,015,015,0
1
TMP
00000001,0;85,0 d
0101100011001010
M
AB
CD
Iterative algorithm PageRank-Iterate(G)
Repeat
Until
Return
24
;0 neP
;1k
;)1( 1 kk PdMedP T
;1 kk
;11 kk PP
;kP
Final step :
0.402797440.262320850.126192790.20868892
60
P
00000001,0;85,0 d
0101100011001010
M
AB
CD
8524
8,055,242,524,17
20 60P
Properties Strengths
pre-computable fast spam-resistant
minor changes have minor effects Weaknesses
pages only authoritative in general and not on query topic link farms Google-bombs
25
Summary HITS
algorithm is executed after a query is made pages get a hub- and an authority-value calculation of whether a page provides good information and/or
whether it links to pages that do so no spam-fighting ability
PageRank each page gets one PageRank that declares its value query-independent spam-resistant
26
Sources Papers about PageRank
Larry Page et al.: The PageRank Citation Ranking: Bringing Order to the Web
Ulrik Brandes, Gabi Dorfmüller: 10. Algorithmus der Woche der RWTH Aachen, 09/05/2006
Peter J. Zehetner: „Der PageRank-Algorithmus“, 05/2007 Taher H. Haveliwala: „Efficient Computation of PageRank“, 10/1999
Papers about HITS Jon Kleinberg: Authoritative sources in a hyperlinked environment Jon Kleinberg et al.: Inferring Web communities from link topology
Book Bing Liu: “Web Data Mining”, 2008
27