neugraph - usenix

31
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs Lingxiao Ma , Zhi Yang , Youshan Miao , Jilong Xue , Ming Wu , Lidong Zhou , Yafei Dai Peking University Microsoft Research USENIX ATC ’19, Renton, WA, USA

Upload: others

Post on 23-Feb-2022

13 views

Category:

Documents


0 download

TRANSCRIPT

NeuGraph:Parallel Deep Neural Network Computation on Large Graphs

Lingxiao Ma †, Zhi Yang †, Youshan Miao ‡, Jilong Xue ‡, Ming Wu ‡, Lidong Zhou ‡, Yafei Dai †

† Peking University

‡ Microsoft Research

USENIX ATC ’19, Renton, WA, USA

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs 2

Image Object Detection Speech Recognition

Self-Driving Personal Assistant

User-Item Graph Knowledge Graph

Recommendation Question Answering

Neural Networks

Input Feature Vector

Input: Graph Data

Graph Neural Networks

* Figures from Internet

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

Graph Neural Networks (GNN)

3

- Information propagation via Graph

- Information transformation via Neural Networks

VertexNN Transformation

EdgeNN Transformation

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

Challenges in Processing GNNs on GPU

4

large, sparse, and irregular

Scalability ProblemEfficiency Problem

small and regular

Graphs

* Figures from Internet

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs 5

Dataflow Graph as

Intermediate Representation

𝛻b

Σ𝐠

𝛻a

𝛻x 𝛻y

𝛻z

+𝐠

*𝐠

x y z

*

a+

b

Σ

c

easy to express NNsefficient for grid structures

Deep Learning Systems

hard to express graph opshard to handle large graphs

Existing Systems are Insufficient

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

Existing Systems are Insufficient

6

Graph Computing Systems

Vertex Programminge.g.: GAS

Gather(Du, D(u,v), Dv):return Dv.rank / #outNbrs(v)

Sum(a, b):return a + b

Apply(Du, acc):rnew = 0.15 + 0.85 * accDu.delta = (rnew - Du.rank) / #outNbrs(u)Du.rank = rnew

Scatter(Du, D(u,v), Dv):if(|Du.delta|>ε) Activate(v)return delta

easy to program graph appsgraph-aware optimizationsscale to trillion edges

hard to express NNs (e.g., no backprop.)

insufficient NN execution (e.g., vertex-by-vertex)

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

We propose: NeuGraph- Bridge graph and dataflow models to support efficient and scalable GNN processing

7

easy to express NNsefficient for grid structures

easy to program graph appsgraph-aware optimizationsscale to trillion edges

Graph SystemsDeep Learning Systems

NeuGraph

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

NeuGraph

- Bridge graph and dataflow models to support efficient and scalable GNN processing

- Key techniques- SAGA-NN model for graph-based neural networks

-> programming GNN apps

- Chunk-based dataflow graph translation & streaming processing out of GPU core-> processing graphs larger than GPU memory

- Highly-optimized graph propagation operators

- Chain-based parallel streaming-> efficient multi-GPU parallel execution

- Performance- Outperform state-of-the-art frameworks (e.g., TensorFlow and DGL) on small graphs

- Scale to large real-world graphs with GPUs

8

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

SAGA-NN Abstraction

9

V0 V1 V2

v2

edge

v1

edge

v0

Scatter

vertex feature

edge feature

edge output

accumulated

vertex output

Neural

Network

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

SAGA-NN Abstraction

10

V0 V1 V2

EdgeNNTransformation

v2

edge

v1

edge

v0

Scatter

ApplyEdge

vertex feature

edge feature

edge output

accumulated

vertex output

Neural

Network

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

SAGA-NN Abstraction

11

V0 V1 V2

EdgeNNTransformation

v2

edge

v1

edge

v0

accum.

Scatter

ApplyEdge

Gather

vertex feature

edge feature

edge output

accumulated

vertex output

Neural

Network

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

SAGA-NN Abstraction

12

v2

edge

v1

edge

v0

accum.

Scatter

ApplyEdge

Gather

ApplyVertex

V0 V1 V2

EdgeNNTransformation

VertexNNTransformation

vertex feature

edge feature

edge output

accumulated

vertex output

Neural

Network

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

SAGA-NN Abstraction

13

v2

edge

v1

edge

v0

accum.

Scatter

ApplyEdge

Gather

ApplyVertex

V0 V1 V2

EdgeNNTransformation

VertexNNTransformation

vertex feature

edge feature

edge output

accumulated

vertex output

Neural

Network

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

SAGA-NN Abstraction

14

v2

edge

v1

edge

v0

vertex feature

edge feature

edge output

accumulated

vertex output

Neural

Network

accum.

Scatter

ApplyEdge

Gather

ApplyVertex

Graph Operations• transparent to users

Neural Network Operations• define NN computation

with dataflow

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

Example: Graph Convolutional Network (GCN)

15

ScatterApplyEdge

Gather

ApplyVertex

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

Scaling beyond GPU memory limit

16

10

3 2Graph

0 1 2 1

1 3 3 0

1 2

Input Vertex Feature Edges

Output Vertex Feature

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

Edge Chunk E0,0

Chunk V1

Input Vertex Feature Chunk V0

Chunk E1,0

Output Vertex Feature Chunk V0

Scaling beyond GPU memory limit

17

10

3 2Graph

E0,0

E0,1

E1,0

E1,1

3 0 0 1

1 32 1

1 2

- 2D graph partitioning

3 0

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

Edge Chunk E0,0

Chunk V1

Input Vertex Feature Chunk V0

Chunk E1,0

Output Vertex Feature Chunk V0

Scaling beyond GPU memory limit

18

10

3 2Graph

E0,0

E0,1

E1,0

E1,1

3 0 0 1

1 32 1

1 2

- 2D graph partitioning

3 0

column-by-columnchunk scheduling

Accumulate in GPU memory

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

Chunk-based Dataflow Translation: GCN

19

mul

Gather

ApplyEdgeV0

E0,0

V0 src

dst

Scatter

A0

mul

W

E0,0V0

SAG

A0

weights params.

Accum.

V1

SAG

E1,0

V0 matmul ReLU

ApplyVertex

add

B

E1,0 E1,1

E0,0 E0,1V0

V1

V0 V

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

E1,0 E1,1

E0,0 E0,1V0

V1

V0 V

E1,2 E1,3

E0,2 E0,3

V0 V

- Naïve solution:- Approach: each GPU takes a “column” of edge chunks

- Problem: redundant data transfer through shared upper links

Scaling to multi-GPU

20

PCIe Switch

GPU

0

GPU

1

PCIe Switch

GPU

2

GPU

3

x16 x16 x16 x16

x16 x16

CPU /

DRAM

PCIe Host

Bridge

GPU0

GPU1

GPU2

GPU3

V0 V1

V0 V0

bandwidthbottleneck E1,0 E1,1

E0,0 E0,1V0

V1

V0 V

E1,2 E1,3

E0,2 E0,3

V V

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

E1,0 E1,1

E0,0 E0,1V0

V1

V0 V

E1,2 E1,3

E0,2 E0,3

V0 V

- Observation: link to neighbor GPU is empty

- Approach: directly load from neighbor GPU

Scaling to multi-GPU

21

PCIe Switch

GPU

0

GPU

1

PCIe Switch

GPU

2

GPU

3

x16 x16 x16 x16

x16 x16

CPU /

DRAM

PCIe Host

Bridge

GPU0

GPU1

GPU2

GPU3

V0 V1

V0 V0

GPU-to-GPU

E1,0 E1,1

E0,0 E0,1V0

V1

V0 V

E1,2 E1,3

E0,2 E0,3

V V

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

E1,0 E1,1

E0,0 E0,1V0

V1

V0 V

E1,2 E1,3

E0,2 E0,3

V0 V

- Observation: link to neighbor GPU is empty

- Approach: directly load from neighbor GPU

Scaling to multi-GPU

22

PCIe Switch

GPU

0

GPU

1

PCIe Switch

GPU

2

GPU

3

x16 x16 x16 x16

x16 x16

CPU /

DRAM

PCIe Host

Bridge

GPU0

GPU1

GPU2

GPU3

V0 V1

Chain

E1,0 E1,1

E0,0 E0,1V0

V1

V0 V

E1,2 E1,3

E0,2 E0,3

V V

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

Chain-based Parallel Streaming

- Each GPU can consume all the chunks in chain for a PCIe switch

23

E1,0 E1,1

E0,0 E0,1V0

V1

V0 V

E1,2 E1,3

E0,2 E0,3

V0 V

GPU0

GPU1

GPU2

GPU3

Step

1

2

3

4

0 0empty empty

1 0 1 00 0

2 1 1 0 2 1 1 0

3 2 2 1 3 2 2 1

Load

Load

#Computing

Chunk

Delete

Data

Load

Data#Receiving

ChunkTransfer

Load

Load

Delete

Delete

V0 V1 V2 V3V0 V1 V2 V3

E1,0 E1,1

E0,0 E0,1V0

V1

V0 V

E1,2 E1,3

E0,2 E0,3

V V

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

High performance graph propagation kernels

- Traditional multithread parallelization- PageRank, SSSP, CC, etc: one-dimension scalar value on vertices

- A thread takes a vertex/edge

V1V0 edge

24

Thre

ad 0

Thre

ad 1

diff ops. on diff verticesinefficient in GPU(SIMD)

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

High performance graph propagation kernels

- Observation- High-dimension feature

- More-regular operations along feature-dimension

- Solution: exploit feature-dimension parallelization in GNNs

Thread 0

Thread 1

25

same ops. efficient in GPU(SIMD)

V1V0 edge

Thread 2

Thread 3

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

Experiment Setup- Typical GNN Applications

- Communication neural network (CommNet)

// no computation in ApplyEdge

- Graph convolutional networks (GCN)

// element-wise Mul in ApplyEdge

- Gated graph neural networks (GG-NN)

// GRU in ApplyVertex

- Testbed

- 2x E5-2690-v4 (14-core with HT)

- 512GB Quad-Channel RAM

- 8x NVIDIA Tesla P100 GPU

- Ubuntu 16.04

- CUDA 9.0 + cuDNN 7

26

dataset vertex edge feature label degree type

blog 10.3K 668.0K 128 39 65 social

pubmed 19.7K 108.4K 500 3 5 cite

reddit_small 58.2K 1.4M 300 41 25 social

reddit_full 2.4M 705.9M 300 50 292 social

enwiki 3.6M 276.1M 300 12 77 knowledge

amazon 8.6M 231.6M 96 22 27 review

- Comparison

- TF: TensorFlow v1.7 (CPU/GPU)

- DGL: Deep Graph Library v0.1.3 with PyTorch v1.0 backend

- TF-SAGA: SAGA-NN with TensorFlow v1.7 backend

- NG: NeuGraph on top of TensorFlow v1.7

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

Performance on a Single GPU

Compare with TF, DGL on small graphs

up to 5x speedup vs TF

27

higher density

Avg. epoch time; #layer=2, hidden_size=512

up to 19x speedup vs DGL

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

Scaling-up on a Single GPU

- Compare with TF(CPU), TF-SAGA(CPU), TF-SAGA on large graphs- TF and DGL run out-of-memory (OOM)

28

higher density

Avg. epoch time; #layer=2, hidden_size=16 (amazon), 32 (enwiki), 64 (reddit-full)

up to 47x speedup vs TF(CPU)

up to 5x speedup vs TF-SAGA

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

Scaling-out on Multiple GPUs

- GCN on three large graphs on different number of GPUs

29

higher density

Avg. speedup over TF-SAGA-1GPU; #layer=2, hidden_size=16 (amazon), 32 (enwiki), 64 (reddit-full)

NeuGraph: Parallel Deep Neural Network Computation on Large Graphs

Conclusion

NeuGraph: advocate unifying deep learning and graph computing for graph neural networks

- Define a new, flexible SAGA-NN model to express GNNs

- Fuse graph-related optimizations into deep learning frameworks

- Demonstrate NeuGraph achieves efficient and scalable GNN processing

30

Thank you!Q&A

For more information, please visit: https://aka.ms/neugraph

NeuGraph