scalable parallel primitives for massive graph computationaydin/talks/wilkinson... ·...

40
1 Scalable Parallel Primitives for Massive Graph Computation Aydın Buluç University of California, Santa Barbara

Upload: others

Post on 22-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

1

Scalable Parallel Primitives for Massive Graph Computation

Aydın BuluçUniversity of California, Santa Barbara

Page 2: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

Sources of Massive Graphs

(WWW snapshot, courtesy Y. Hyun) (Yeast protein interaction network, courtesy H. Jeong)

Graphs naturally arise from the internet and social interactions

Many scientific (biological, chemical,cosmological, ecological, etc) datasets are modeled as graphs.

Page 3: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

Examples:- Centrality- Shortest paths- Network flows- Strongly Connected

Components

Examples:

- Loop and multi edge removal- Triangle/Rectangle enumeration

3

Types of Graph Computations

Fuzzy intersectionExamples: Clustering,

Algebraic Multigrid

Tool: GraphTraversal Tool: Map/Reduce

1 3

4

2 5

7

6

Tightly coupled

Filtering based

map map map

reduce reduce

Page 4: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

Tightly Coupled Computations

• Many graph mining algorithms are computationally intensive. (e.g. graph clustering, centrality)

• Some computations are inherently latency-bound. (e.g. finding shortest paths)

• Interesting graphs are sparse, typically |edges| = O(|vertices|)

4

Huge Graphs Expensive Kernels+ High Performance and Massive Parallelism

Sparse Graphs/Data Sparse Data Structures (Matrices)

Tightly Coupled Computationson Sparse Graphs

Page 5: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

555

Software for Graph Computation

“...my main conclusion after spending ten years of my life on the TeX project is that software is hard. It's harder than anything else I've ever had to do”

Page 6: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

666

Software for Graph Computation

“...my main conclusion after spending ten years of my life on the TeX project is that software is hard. It's harder than anything else I've ever had to do”

Dealing with software is hard !

Page 7: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

777

Software for Graph Computation

“...my main conclusion after spending ten years of my life on the TeX project is that software is hard. It's harder than anything else I've ever had to do”

Dealing with software is hard !

High performance computing (HPC)

software is harder !

Page 8: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

888

Software for Graph Computation

“...my main conclusion after spending ten years of my life on the TeX project is that software is hard. It's harder than anything else I've ever had to do”

Dealing with software is hard !

High performance computing (HPC)

software is harder !

Deal with parallel HPC software?

Page 9: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

Outline

• The Case for Primitives• The Case for Sparse Matrices

• Parallel Sparse Matrix-Matrix Multiplication

• Software Design of the Combinatorial BLAS

• An Application in Social Network Analysis

• Other Work

• Future Directions

9

Page 10: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

10

• Input: Directed graph with “costs” on edges• Find least-cost paths between all reachable vertex pairs• Classical algorithm: Floyd-Warshall

• Case study of implementation on multicore architecture:– graphics processing unit (GPU)

All-Pairs Shortest Paths

for k=1:n // the induction sequencefor i = 1:n

for j = 1:nif(w(i→k)+w(k→j) < w(i→j))

w(i→j):= w(i→k) + w(k→j)

1 52 3 41

5

2

3

4

k = 1 case

Page 11: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

11

GPU characteristics

Powerful: two Nvidia 8800s > 1 TFLOPS

Inexpensive: $500 each

• Difficult programming model:  One instruction stream drives 8 arithmetic units

• Performance is counterintuitive and fragile:Memory access pattern has subtle effects on cost  

• Extremely easy to underutilize the device:Doing it wrong easily costs 100x in time

t1

t3t2

t4

t6t5

t7

t9t8

t10

t12t11

t13t14

t16t15

But:

Page 12: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

12

Recursive All-Pairs Shortest Paths

A B

C DA

BD

C

A = A*; % recursive callB = AB; C = CA; D = D + CB;D = D*; % recursive callB = BD; C = DC;A = A + BC;

+ is “min”, × is “add”

Based on R-Kleene algorithm

Well suited for GPU architecture:• Fast matrix-multiply kernel• In-place computation =>

low memory bandwidth• Few, large MatMul calls =>

low GPU dispatch overhead• Recursion stack on host CPU,

not on multicore GPU• Careful tuning of GPU code

Page 13: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

The Case for Primitives

13

480x

Lifting Floyd-Warshallto GPU

The right primitive !

Conclusions:

High performance is achievable but not simple

Carefully chosen and optimized primitives will be key

Unorthodox R-Kleene algorithm

APSP: Experiments and observations

Page 14: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

Outline

• The Case for Primitives

• The Case for Sparse Matrices• Parallel Sparse Matrix-Matrix Multiplication

• Software Design of the Combinatorial BLAS

• An Application in Social Network Analysis

• Other Work

• Future Directions

14

Page 15: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

15

Sparse Adjacency Matrix and Graph

• Every graph is a sparse matrix and vice-versa

• Adjacency matrix: sparse array w/ nonzeros for graph edges

• Storage-efficient implementation from sparse data structures

x

1 2

3

4 7

6

5

AT

Page 16: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

The Case for Sparse Matrices

• Many irregular applications contain sufficient coarse-grained parallelism that can ONLY be exploited using abstractions at proper level.

16

Traditional graph computations

Graphs in the language of linear algebra

Data driven. Unpredictable communication.

Fixed communication patterns.

Irregular and unstructured. Poor locality of reference

Operations on matrix blocks. Exploits memory hierarchy

Fine grained data accesses. Dominated by latency

Coarse grained parallelism. Bandwidth limited

The Case for Sparse Matrices

Page 17: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

Identification of Primitives

Sparse matrix-matrix Multiplication (SpGEMM)

Element-wise operations

17

Linear Algebraic Primitives

x

Matrices on semirings, e.g. (, +), (and, or), (+, min)

Sparse matrix-vector multiplication

Sparse Matrix Indexing

x

.*

Page 18: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

Why focus on SpGEMM?

• Graph clustering (Markov, peer pressure)

• Subgraph / submatrix indexing

• Shortest path calculations

• Betweenness centrality

• Graph contraction

• Cycle detection

• Multigrid interpolation & restriction

• Colored intersection searching

• Applying constraints in finite element computations

• Context-free parsing ...

18

Applications of Sparse GEMM

11

11

1x x

Page 19: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

Outline

• The Case for Primitives

• The Case for Sparse Matrices

• Parallel Sparse Matrix-Matrix Multiplication• Software Design of the Combinatorial BLAS

• An Application in Social Network Analysis

• Other Work

• Future Directions

19

Page 20: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

20

Two Versions of Sparse GEMM

A1 A2 A3 A4 A7A6A5 A8 B1 B2 B3 B4 B7B6B5 B8 C1 C2 C3 C4 C7C6C5 C8

j

x =i

kk

Cij

Cij += Aik Bkj

Ci = Ci + A Bi

x =1D block-column distribution

Checkerboard (2D block) distribution

Page 21: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

Comparative Speedup of Sparse 1D & 2D

21

In practice, 2D algorithms have the potential to scale, if implemented correctly. Overlapping communication, and maintaining load balance are crucial.

Projected performances of Sparse 1D & 2D

N P

1D 2D

NP

Page 22: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

• Stores entries in column-major order

• Dense collection of “sparse columns”

• Uses storage.22

Compressed Sparse Columns (CSC):A Standard Layout

Column pointers

data

810

2 3rowindn×n matrix with nnz nonzeroes

n

0 2 3 4 0 1 5 7 3 4 5 4 5 6 7

04

1112131617

nnznO )(

Page 23: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

23

Submatrices are “hypersparse” (i.e. nnz << n)

blocks

blocks

Total Storage:

Average of c nonzeros per column

• A data structure or algorithm that depends on the matrix dimension n (e.g. CSR or CSC) is asymptotically too wasteful for submatrices

Node Level Considerations

Page 24: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

Sequential Kernel

• Strictly O(nnz) data structure • Outer-product formulation • Work-efficient

24

X

flops

nnz

n

Standard algorithm’s complexity:

Sequential Hypersparse Kernel

))()(lg( BnzrAnzcnif lops

))(( mnBnnzf lops

New hypersparse kernel:

Page 25: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

25

Scaling Results for SpGEMM

RMat X RMat product (graphs with high variance on degrees)

Random permutations useful for the overall computation.

Bulk synchronous algorithms may still suffer due to imbalance within the stages.

Asynchronous algorithm to avoid the curse of synchronicity

One sided communication via RDMA (using MPI-2)

Results obtained on TACC/Lonestar for graphs with average degree 8

Page 26: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

Outline

• The Case for Primitives

• The Case for Sparse Matrices

• Parallel Sparse Matrix-Matrix Multiplication

• Software Design of the Combinatorial BLAS• An Application in Social Network Analysis

• Other Work

• Future Directions

26

Page 27: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

27

Software design of theCombinatorial BLAS

Generality, of the numeric type of matrix elements, algebraic operation performed, and the library interface.Without the language abstraction penalty: C++ Templates

• Achieve mixed precision arithmetic: Type traits• Enforcing interface and strong type checking: CRTP• General semiring operation: Function Objects

template <class IT, class NT, class DER>class SpMat;

• Abstraction penalty is not just a programming language issue.

• In particular, view matrices as indexed data structures and stay away from single element access (Interface should discourage)

Page 28: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

28

Extendability, of the library while maintaining compatibility and seamless upgrades.➡ Decouple parallel logic from the sequential part.

TuplesCSC DCSC

SpSeq

Commonalities:- Support the sequential API- Composed of a number of arrays

SpSeq

SpPar<Comm, SpSeq>

Any parallel logic: asynchronous, bulk synchronous, etc

Software design of theCombinatorial BLAS

Page 29: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

Outline

• The Case for Primitives

• The Case for Sparse Matrices

• Parallel Sparse Matrix-Matrix Multiplication

• Software Design of the Combinatorial BLAS

• An Application in Social Network Analysis• Other Work

• Future Directions

29

Page 30: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

Applications and Algorithms

30

Betweenness Centrality (BC)CB(v): Among all the shortest paths, what fraction of them pass through the node of interest?

Brandes’ algorithm

A typical software stack for an application enabled with the Combinatorial BLAS

Social Network Analysis

Page 31: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

31

6

XAT (ATX).*¬X

1 2

3

4 7 5

Betweenness Centrality using Sparse GEMM

• Parallel breadth-first search is implemented with sparse matrix-matrix multiplication

• Work efficient algorithm for BC

Page 32: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

32

BC Performance onDistributed-memory

• TEPS: Traversed Edges Per Second• Batch of 512 vertices at each iteration• Code only a few lines longer than Matlab version

0

50

100

150

200

250

25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 400 441 484

TEPS sc

ore

Millions

Number of Cores

BC performance 

Scale 17Scale 18Scale 19Scale 20

Input: RMAT scale N2N verticesAverage degree 8

Pure MPI-1 version. No reliance on any particular hardware.

Page 33: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

Outline

• The Case for Primitives

• The Case for Sparse Matrices

• Parallel Sparse Matrix-Matrix Multiplication

• Software Design of the Combinatorial BLAS

• An Application in Social Network Analysis

• Other Work• Future Directions

33

Page 34: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

34

SpMV on Multicore

Our parallel algorithms for y Ax and y’ ATx’ using thenew compressed sparse blocks (CSB) layout have• span, and work, • yielding parallelism.)lg/( nnnnz

)(nnz)lg( nn

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8

MFlop

s/sec

Processors

Our CSB algorithms

Star-P (CSR+blockrow distribution)

Serial (Naïve CSR)

Page 35: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

Outline

• The Case for Primitives

• The Case for Sparse Matrices

• Parallel Sparse Matrix-Matrix Multiplication

• Software Design of the Combinatorial BLAS

• An Application in Social Network Analysis

• Other Work

• Future Directions

35

Page 36: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

3636

Future Directions

‣Novel scalable algorithms

‣Static graphs are just the beginning.Dynamic graphs, Hypergraphs, Tensors

‣ Architectures (mainly nodes) are evolvingHeterogeneous multicoresHomogenous multicores with more cores per node

TACC Lonestar (2006)

4 cores / node

TACC Ranger (2008)

16 cores / node

SDSC Triton (2009)

32 cores / node

XYZ Resource (2020)

Hierarchical parallelism

Page 37: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

3737

New Architectural Trends

LANL / IBM Roadrunner

NVIDIA Tesla

Intel 80-core chip

Cray XMT

A unified architecture ?

Page 38: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

38

Remarks

• Graph computations are pervasive in sciences and will become more so.

• High performance software libraries improve productivity.

• Carefully chosen and implemented primitive operations are key to performance.

• Linear algebraic primitives:o General enough to be widely useful

o Compact enough to be implemented in a reasonable time.

Page 39: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

39

Related Publications

• Hypersparsity in 2D decomposition, sequential kernel.B., Gilbert, "On the Representation and Multiplication of Hypersparse Matrices“, IPDPS’08 

• Parallel analysis of sparse GEMM, synchronous implementationB., Gilbert, "Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication, ICPP’08

• The case for primitives, APSP on the GPUB., Gilbert, Budak, “Solving Path Problems on the GPU“, Parallel Computing, 2009

• SpMV on MulticoresB., Fineman, Frigo, Gilbert, Leiserson, "Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication using Compressed Sparse Blocks“, SPAA’09

• Betweenness centrality resultsB., Gilbert, “Parallel Sparse Matrix-Matrix Multiplication and Large Scale Applications”

• Software design of the libraryB., Gilbert, “Parallel Combinatorial BLAS: Interface and Reference Implementation”

Page 40: Scalable Parallel Primitives for Massive Graph Computationaydin/talks/wilkinson... · 2017-12-16 · • High performance software libraries improve productivity. • Carefully chosen

40

Acknowledgments…

David Bader, Erik Boman, Ceren Budak, Alan Edelman, Jeremy Fineman, MatteoFrigo, Bruce Hendrickson, Jeremy

Kepner, Charles Leiserson, KameshMadduri, Steve Reinhardt, Eric Robinson, Viral

Shah, Sivan Toledo