1 if you were plowing a field, which would you rather use? two oxen, or 1024 chickens? (attributed...
TRANSCRIPT
1
If you were plowing a field, which would you rather
use?Two oxen, or 1024 chickens?(Attributed to S. Cray)
2
Our ‘field’ to plow : Graph processing
|V| = 1.4B, |E| = 6.6B
Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto
Matei Ripeanu
NetSysLabThe University of British Columbia
http://netsyslab.ece.ubc.ca
Graph Processing: The Challenges
Data-dependent memory access
patterns
Caches + summary data structures
Large memory footprint
>128GB
CPUs
Poor locality
Varying degrees of parallelism(both intra- and inter- stage)
Low compute-to-memory access
ratio
Graph Processing: The GPU Opportunity
Data-dependent memory access
patterns
Caches + summary data structures
Large memory footprint
>128GB
CPUs
Poor locality
Assemble a heterogeneous platform
Massive hardware multithreading
6GB!
GPUs
Low compute-to-memory access
ratio
Caches + summary data structures
Varying degrees of parallelism(both intra- and inter- stage)
6
YES WE CAN!2x speedup (8 billion edges)
Motivating Question
Can we efficiently use hybrid systems for
large-scale graph processing?
7
Performance Model Predicts speedup Intuitive
Totem A graph processing engine for hybrid
systems Applies algorithm-agnostic optimizations
Evaluation Predicated vs. achieved Hybrid vs. Symmetric
Methodology
8
Core0
Core4
Core1
Core0
.
.
Core1
Core2
Corem
Core3
System MemoryDevice Memory
Host GPUb
The Performance Model (I)
GgpuGcpu
||/|| EEcpuα = ||/|| EEboundaryβ =
)(/|| GTErcpu =
c
rSpeedup
cpu
1
Goal: Predict the speedup obtained from offloading part of the graph to the GPU (when compared to processing only on the host)
)(/ msizeofbc =
9
GgpuGcpu
The Performance Model (II)
β = 20% rcpu = 0.5 BEPS
It is beneficial to process the graph on a hybrid system
if communication overhead is low
Core0
Core4
Core1
Core0
.
.
Core1
Core2
Corem
Core3
System MemoryDevice Memory
Host GPU
||/|| EEcpuα = ||/|| EEboundaryβ =
)(/|| GTErcpu =
)(/ msizeofbc =
c
rSpeedup
cpu
1
Assume PCI-E bus, b ≈ 4 GB/secand per edge state m = 4 bytes
=> c = 1 billion EPS
x
Best reported single-node BFS performance
[Agarwal, V. 2010]
|V| = 32M, |E| = 1B
Worst case (e.g., bipartite graph)
10
Totem: Programming ModelBulk Synchronous Parallel
Rounds of computation and communication phases
Updates to remote vertices are delivered in the next round
Partitions vote to terminate execution
.
.
.
11
Totem: A BSP-based Engine
1
0
3
2
4
5
CPU GPU0
3
2
1S S
outbox
0 20 1
outbox
40
0 1 2 3 4 5
0 1 3 54 6
10 20 30 40 40E
V
01 21 21
0 1 2 3
0 1 3 3
31 40 E
V
2121
Compressed sparse row representation
Computation: kernel manipulates local state
Updates to remote vertices aggregated locally
Comm2: merge with local state
Comm1: transfer outbox buffer to remote input buffer
inbox4
inbox
0 2
12
|E| = 512 Million
Random
The Aggregation Opportunity
real-world graphs are mostly scale-free: skewed
degree distribution
sparse graph: ~5x reduction
Denser graph has better opportunity for aggregation:
~50x reduction
13
Evaluation Setup
Workload R-MAT graphs |V|=32M, |E|=1B, unless otherwise noted
Algorithms Breadth-first Search PageRank
Metrics Speedup compared to processing on the host only
Testbed Host: dual-socket Intel Xeon with 16GB GPU: Nvidia Tesla C2050 with 3GB
14
Predicted vs. Achieved Speedup
Linear speedup with respect to offloaded part
GPU partition fills GPU memory
After aggregation, β = 2%. A low value
is critical for BFS
15
Breakdown of Execution Time
PageRank is dominated by the compute phase
Aggregation significantly reduced communication
overhead
GPU is > 5x faster than the host
16
So far …
Performance modeling Simple Useful for initial system provisioning
Totem Generic graph processing framework Algorithm-agnostic optimizations
Evaluation (Graph500 scale-28) 2x speedup over a symmetric system 1.13 Billion TEPS edges on a dual-socket, dual-GPU
system
But, random partitioning! Can we do better?
Better partitioning strategies.
The search space. Handles large (billion-edge scale) graphs.
o Low space and time complexity.o Ideally, quasi-linear!
Handles well scale-free graphs. Minimizes algorithm’s execution time by
reducing computation time o (rather than communication)
17
The strategies we explore HIGH: vertices with high degree left on the
hostLOW: vertices with low degree left on the
host RAND: random
18The percentage of vertices placed on the CPU for a scale-28 RMAT graph (|V|=256m, |E|=4B)
Evaluation platform
19
Intel Nehalem Fermi GPU Xeon X5650 Tesla C2075
(2x sockets) (2x GPUs)
Core Frequency 2.67GHz 1.15GHz
Num Cores (SMs) 6 14
HW-thread/Core 2 x 48warps (x32/warp)
Last Level Cache 12MB 2MB
Main Memory 144GB 6GB
Memory Bandwidth 32GB/sec 144GB/sec
Total Power (TDP) 95W 225W
BSF performance
20BFS traversal rate for a scale-28 RMAT graph (|V|=256m, |E|=4B)
2x performance gain!
LOW: No gain over random!
Exploring the 75% data point
Host is the bottleneck in all cases !
BSF performance – more details
PageRank performance
22PageRank processing rate for a scale-28 RMAT graph (|V|=256m, |E|=4B)
LOW: Minimal gain over random!
Better packing
25% performance gain!
Small graphs(scale-25 RMAT graphs: |V|=32m, |E|=512m)
23
BFS PageRank
Intelligent partitioning provides benefits Key for performance: load balancing
Uniform graphs (not scale free)
24
BFS on scale-25 uniform graph |V|=32m, |E|=512m) BFS on scale-28
Hybrid techniques not useful for uniform graphs
Scalability Graph size: RMAT graphs: scale 25 to 29 (|V|=512m, |E|=8B)Platform size: 1,2, 4 sockets 2xsockets + 2 x GPU
25
BFS PageRank
Power Normalizing by power (TDP – thermal design power)Metric: million TEPS / watt
26
BFS PageRank
2.0 1.3 1.1 1.3 1.3
1.9
2.3
2.4 1.8 1.01.4
1.8
Conclusions
Q: Does it make sense to use a hybrid system? A: Yes! (for large scale-free graphs)
Q: Can one design a processing engine for hybrid platforms that both generic and efficient?
A: Yes.
Q: Are there near-linear complexity partitioning strategies that enable higher performance?
A: Yes, partitioning strategies based on vertex connectivity provide in all cases better performance than random.
Q: Should one search for partitioning strategies that reduce the communication overheads (and hope for higher performance)?
A: No. (for scale free graphs)
Q: Which strategies work best?A: It depends!
Large graphs: shape the load. Small graphs: load balancing.
28
If you were plowing a field, which would you rather
use?- Two oxen, or 1024 chickens?- Both!
29
code available at: netsyslab.ece.ubc.ca
Papers: • A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph
Processing, A. Gharaibeh, L. Costa, E. Santos-Neto, M. Ripeanu, PACT 2012
• On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest, A. Gharaibeh, L. Costa, E. Santos-Neto, M. Ripeanu, IPDPS 2013
29
30
A golf course …
… a (nudist) beach
(… and 199 days of rain each year)
Networked Systems Laboratory (NetSysLab)University of British Columbia