i/o-efficient techniques for computing pagerank
TRANSCRIPT
![Page 1: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/1.jpg)
I/O-Efficient Techniques for Computing Pagerank
Yen-Yu Chen Qingqing Gan Torsten Suel
Polytechnic University, Brooklyn NY
![Page 2: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/2.jpg)
Web Graph
• URL as a node• Hyperlink as a
directed edge
• The graph structure represents the World Wide Web
![Page 3: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/3.jpg)
Page Rank
• Random Surfer model– A person who surf the
web by randomly clicking links on visited pages.
• PageRank of a page is proportional to the frequency with which a random surfer would visit it.
R2=0.286
R3=0.143
R4=0.143R5=0.143
R1=0.286
![Page 4: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/4.jpg)
Practical PageRank
• Two problems:– Rank leak– Rank sink
• Pruning• Add back edges• Random Jump
R2=0.142
R3=0.101
R4=0.313R5=0.290
R1=0.154
d=0.8
![Page 5: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/5.jpg)
Topic Sensitive PageRank
• Modified Random Jump
• Only jump to certain pages which are related to a specific topic
• ODP-biasing
Topic T
![Page 6: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/6.jpg)
Challenge
• 3.5 Billion pages on the web
• 49 Billion hyperlinks in betweens
• Require 14G bytes to store 4-byte pagerank values – Hard to fit in memory
• Calculate the pagerank value in an I/O efficient way
![Page 7: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/7.jpg)
I/O Efficient Algorithms
• Naïve Algorithm
• Haveliwala’s Algorithm
• Our contribution:
– Sort-Merge Algorithm
– Split-Accumulate Algorithm
![Page 8: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/8.jpg)
Related Work
![Page 9: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/9.jpg)
Naïve Algorithm
• Two vectors of 32-bits floating point numbers.
• Source vector is on disk
• Destination vector is in memory.
LVVLVCnaive +⋅=++= 2'
![Page 10: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/10.jpg)
Haveliwala’s Algorithm
• Partition destination vector into d blocks Vi’ that each fit into main memory.
• Partition link file into d files Li , each only contains links pointing to nodes in Vi’ .
∑∑<≤<≤
⋅++⋅+=++⋅=di
idi
ih LVdLVVdC00
)1()1(' ε
![Page 11: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/11.jpg)
![Page 12: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/12.jpg)
Sort-Merge Algorithm
• Link file is identical to the on in naïve algorithm.
• Creating for each link a packet that contains the line number of the destination and an amount of rank value that has to be transmitted to that destination.
• 8-byte packet : 4-byte id + 4-byte floating number
![Page 13: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/13.jpg)
Sort-Merge Algorithm (continue)
• Route packets by sorting them by destination and combining the ranks into the destination node.
• |P| is the total size of the generated packets that need to be written in and out once.
PLVVC mergesort ⋅+++=− 2'
![Page 14: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/14.jpg)
Split-Accumulate Algorithm
• Splits the source vector into d blocks Vi, such that 4-byte rank values of all node in a block fit into memory.
• Link file contains information on all links with source node in block Vi.
• It likes reverse of Li in Haviliwala’s, but we remove the out-degree information to another files.
iL
![Page 15: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/15.jpg)
![Page 16: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/16.jpg)
Split-Accumulate Algorithm (continue)
• File Oi is a vector of 2-byte integers, storing out-degree for each element in source vector.
• File is defined as containing all packets of rank values with destination in block Vi, in arbitrary order.
iP
![Page 17: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/17.jpg)
Split-Accumulate Algorithm (continue)
• For each iteration i:– Initial block Vi in memory
– Accumulate phase:• Scan with destinations in Vi , add rank values
in each packet to appropriate entry in Vi.
– Scan Oi and divide each rank value in Vi by its out-degree.
iP
![Page 18: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/18.jpg)
Split-Accumulate Algorithm (continue)
– Split phase:• Read and for each record in consisting of
several sources in Vi and a destination in Vj, we write one packet with this destination node and the total amount of rank to be transmitted to it from these sources into output file ( which will become file in the next iteration).
Combining packets is simpler and more efficient. No in-memory sorting of packets is needed.
iL iL
'jPjP
![Page 19: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/19.jpg)
Split-Accumulate Algorithm (continue)
• In a nutshell, it split packets into different buckets by destination, and then directly accumulating rank values using a table.
PL
PLV
iPiLOiCdi
split
⋅++=
⋅+++⋅=
⋅++= ∑<≤
2)1(
2)'1(5.0
)2(0
ε
ε
![Page 20: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/20.jpg)
Experimental Setup
• Sun Blade 100 (500 MHz Ultra Sparc IIe) running Solaris 8 with 100GB, 7200 RPM hard disk.
• Various physical memory configurations: 128M, 256M, 512M, 1G, 2G
• Simulated 32M and 64M setting under 128M memory.
![Page 21: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/21.jpg)
Results for Real Data
• 120 M web pages crawled
• 327 M URLs and 1.33 Billion links parsed out.
• After pruning:– 44.8 M nodes– 653M edges– 15.3 edges/node
![Page 22: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/22.jpg)
Result for Real Data (continue)
• No pruning.• Add back edges
for nodes which has 0 out-degree.
• 327 M nodes• 1.96 Billion
edges
![Page 23: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/23.jpg)
Results for Scaled Data
![Page 24: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/24.jpg)
Results for Topic-Sensitive PR
0500
1000150020002500300035004000
10 T
opi c
s(51
2M)
20 T
opi c
s(51
2M)
10 T
opi c
s(25
6M)
20 T
opi c
s(25
6M)
Nai ve
Havel i wal a' s
Spl i t -Accumul at e
![Page 25: I/O-Efficient Techniques for Computing Pagerank](https://reader034.vdocuments.site/reader034/viewer/2022042817/55a982c71a28ab6f458b4784/html5/thumbnails/25.jpg)
• Basic:
• Random Jump:
• Topic-Sensitive:
{
Page Rank
∑→
=pq qd
prpr
)(
)()(
∑→
−
⋅+−=pq
ii
qd
qr
n
Rpr
)(
)()1()(
)1()0()( αα
=)()( pr i∑→
−
⋅+−pq
i
qd
qr
n
R
)(
)()1(
)1()0(
αα
∑→
−
⋅pq
i
qd
qr
)(
)()1(
α
p is special
otherwise