1 if you were plowing a field, which would you rather use? two oxen, or 1024 chickens? (attributed...

1

If you were plowing a field, which would you rather

use?Two oxen, or 1024 chickens?(Attributed to S. Cray)

2

Our ‘field’ to plow : Graph processing

|V| = 1.4B, |E| = 6.6B

Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto

Matei Ripeanu

NetSysLabThe University of British Columbia

http://netsyslab.ece.ubc.ca

http://netsyslab.ece.ubc.ca/

Graph Processing: The Challenges

Data-dependent memory access

patterns

Caches + summary data structures

Large memory footprint

>128GB

CPUs

Poor locality

Varying degrees of parallelism(both intra- and inter- stage)

Low compute-to-memory access

ratio

Graph Processing: The GPU Opportunity

Data-dependent memory access

patterns


Large memory footprint

>128GB

CPUs

Poor locality

Assemble a heterogeneous platform

Massive hardware multithreading

6GB!

GPUs

Low compute-to-memory access

ratio


Varying degrees of parallelism(both intra- and inter- stage)

6

YES WE CAN!2x speedup (8 billion edges)

Motivating Question

Can we efficiently use hybrid systems for

large-scale graph processing?

7

Performance Model Predicts speedup Intuitive

Totem A graph processing engine for hybrid

systems Applies algorithm-agnostic optimizations

Evaluation Predicated vs. achieved Hybrid vs. Symmetric

Methodology

8

Core0

Core4

Core1

Core0

.

.

Core1

Core2

Corem

Core3

System MemoryDevice Memory

Host GPUb

The Performance Model (I)

GgpuGcpu

||/|| EEcpuα = ||/|| EEboundaryβ =

)(/|| GTErcpu =

c

rSpeedup

cpu

1

Goal: Predict the speedup obtained from offloading part of the graph to the GPU (when compared to processing only on the host)

)(/ msizeofbc =

9

GgpuGcpu

The Performance Model (II)

β = 20% rcpu = 0.5 BEPS

It is beneficial to process the graph on a hybrid system

if communication overhead is low

Core0

Core4

Core1

Core0

.

.

Core1

Core2

Corem

Core3

System MemoryDevice Memory

Host GPU

||/|| EEcpuα = ||/|| EEboundaryβ =

)(/|| GTErcpu =

)(/ msizeofbc =

c

rSpeedup

cpu

1

Assume PCI-E bus, b ≈ 4 GB/secand per edge state m = 4 bytes

=> c = 1 billion EPS

x

Best reported single-node BFS performance

[Agarwal, V. 2010]

|V| = 32M, |E| = 1B

Worst case (e.g., bipartite graph)

10

Totem: Programming ModelBulk Synchronous Parallel

Rounds of computation and communication phases

Updates to remote vertices are delivered in the next round

Partitions vote to terminate execution

.

.

.

11

Totem: A BSP-based Engine

1

0

3

2

4

5

CPU GPU0

3

2

1S S

outbox

0 20 1

outbox

40

0 1 2 3 4 5

0 1 3 54 6

10 20 30 40 40E

V

01 21 21

0 1 2 3

0 1 3 3

31 40 E

V

2121

Compressed sparse row representation

Computation: kernel manipulates local state

Updates to remote vertices aggregated locally

Comm2: merge with local state

Comm1: transfer outbox buffer to remote input buffer

inbox4

inbox

0 2

12

|E| = 512 Million

Random

The Aggregation Opportunity

real-world graphs are mostly scale-free: skewed

degree distribution

sparse graph: ~5x reduction

Denser graph has better opportunity for aggregation:

~50x reduction

13

Evaluation Setup

Workload R-MAT graphs |V|=32M, |E|=1B, unless otherwise noted

Algorithms Breadth-first Search PageRank

Metrics Speedup compared to processing on the host only

Testbed Host: dual-socket Intel Xeon with 16GB GPU: Nvidia Tesla C2050 with 3GB

14

Predicted vs. Achieved Speedup

Linear speedup with respect to offloaded part

GPU partition fills GPU memory

After aggregation, β = 2%. A low value

is critical for BFS

15

Breakdown of Execution Time

PageRank is dominated by the compute phase

Aggregation significantly reduced communication

overhead

GPU is > 5x faster than the host

16

So far …

Performance modeling Simple Useful for initial system provisioning

Totem Generic graph processing framework Algorithm-agnostic optimizations

Evaluation (Graph500 scale-28) 2x speedup over a symmetric system 1.13 Billion TEPS edges on a dual-socket, dual-GPU

system

But, random partitioning! Can we do better?

Better partitioning strategies.

The search space. Handles large (billion-edge scale) graphs.

o Low space and time complexity.o Ideally, quasi-linear!

Handles well scale-free graphs. Minimizes algorithm’s execution time by

reducing computation time o (rather than communication)

17

The strategies we explore HIGH: vertices with high degree left on the

hostLOW: vertices with low degree left on the

host RAND: random

18The percentage of vertices placed on the CPU for a scale-28 RMAT graph (|V|=256m, |E|=4B)

Evaluation platform

19

Intel Nehalem Fermi GPU Xeon X5650 Tesla C2075

(2x sockets) (2x GPUs)

Core Frequency 2.67GHz 1.15GHz

Num Cores (SMs) 6 14

HW-thread/Core 2 x 48warps (x32/warp)

Last Level Cache 12MB 2MB

Main Memory 144GB 6GB

Memory Bandwidth 32GB/sec 144GB/sec

Total Power (TDP) 95W 225W

BSF performance

20BFS traversal rate for a scale-28 RMAT graph (|V|=256m, |E|=4B)

2x performance gain!

LOW: No gain over random!

Exploring the 75% data point

Host is the bottleneck in all cases !

BSF performance – more details

PageRank performance

22PageRank processing rate for a scale-28 RMAT graph (|V|=256m, |E|=4B)

LOW: Minimal gain over random!

Better packing

25% performance gain!

Small graphs(scale-25 RMAT graphs: |V|=32m, |E|=512m)

23

BFS PageRank

Intelligent partitioning provides benefits Key for performance: load balancing

Uniform graphs (not scale free)

24

BFS on scale-25 uniform graph |V|=32m, |E|=512m) BFS on scale-28

Hybrid techniques not useful for uniform graphs

Scalability Graph size: RMAT graphs: scale 25 to 29 (|V|=512m, |E|=8B)Platform size: 1,2, 4 sockets 2xsockets + 2 x GPU

25

BFS PageRank

Power Normalizing by power (TDP – thermal design power)Metric: million TEPS / watt

26

BFS PageRank

2.0 1.3 1.1 1.3 1.3

1.9

2.3

2.4 1.8 1.01.4

1.8

Conclusions

Q: Does it make sense to use a hybrid system? A: Yes! (for large scale-free graphs)

Q: Can one design a processing engine for hybrid platforms that both generic and efficient?

A: Yes.

Q: Are there near-linear complexity partitioning strategies that enable higher performance?

A: Yes, partitioning strategies based on vertex connectivity provide in all cases better performance than random.

Q: Should one search for partitioning strategies that reduce the communication overheads (and hope for higher performance)?

A: No. (for scale free graphs)

Q: Which strategies work best?A: It depends!

Large graphs: shape the load. Small graphs: load balancing.

28

If you were plowing a field, which would you rather

use?- Two oxen, or 1024 chickens?- Both!

29

code available at: netsyslab.ece.ubc.ca

Papers: • A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph

Processing, A. Gharaibeh, L. Costa, E. Santos-Neto, M. Ripeanu, PACT 2012

• On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest, A. Gharaibeh, L. Costa, E. Santos-Neto, M. Ripeanu, IPDPS 2013

29

30

A golf course …

… a (nudist) beach

(… and 199 days of rain each year)

Networked Systems Laboratory (NetSysLab)University of British Columbia

1 if you were plowing a field, which would you rather use? two oxen, or 1024 chickens? (attributed...

Documents