neugraph - usenix
TRANSCRIPT
NeuGraph:Parallel Deep Neural Network Computation on Large Graphs
Lingxiao Ma †, Zhi Yang †, Youshan Miao ‡, Jilong Xue ‡, Ming Wu ‡, Lidong Zhou ‡, Yafei Dai †
† Peking University
‡ Microsoft Research
USENIX ATC ’19, Renton, WA, USA
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs 2
Image Object Detection Speech Recognition
Self-Driving Personal Assistant
User-Item Graph Knowledge Graph
Recommendation Question Answering
Neural Networks
Input Feature Vector
Input: Graph Data
Graph Neural Networks
* Figures from Internet
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
Graph Neural Networks (GNN)
3
- Information propagation via Graph
- Information transformation via Neural Networks
VertexNN Transformation
EdgeNN Transformation
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
Challenges in Processing GNNs on GPU
4
large, sparse, and irregular
Scalability ProblemEfficiency Problem
small and regular
Graphs
* Figures from Internet
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs 5
Dataflow Graph as
Intermediate Representation
𝛻b
Σ𝐠
𝛻a
𝛻x 𝛻y
𝛻z
+𝐠
*𝐠
x y z
*
a+
b
Σ
c
easy to express NNsefficient for grid structures
Deep Learning Systems
hard to express graph opshard to handle large graphs
Existing Systems are Insufficient
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
Existing Systems are Insufficient
6
Graph Computing Systems
Vertex Programminge.g.: GAS
Gather(Du, D(u,v), Dv):return Dv.rank / #outNbrs(v)
Sum(a, b):return a + b
Apply(Du, acc):rnew = 0.15 + 0.85 * accDu.delta = (rnew - Du.rank) / #outNbrs(u)Du.rank = rnew
Scatter(Du, D(u,v), Dv):if(|Du.delta|>ε) Activate(v)return delta
easy to program graph appsgraph-aware optimizationsscale to trillion edges
hard to express NNs (e.g., no backprop.)
insufficient NN execution (e.g., vertex-by-vertex)
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
We propose: NeuGraph- Bridge graph and dataflow models to support efficient and scalable GNN processing
7
easy to express NNsefficient for grid structures
easy to program graph appsgraph-aware optimizationsscale to trillion edges
Graph SystemsDeep Learning Systems
NeuGraph
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
NeuGraph
- Bridge graph and dataflow models to support efficient and scalable GNN processing
- Key techniques- SAGA-NN model for graph-based neural networks
-> programming GNN apps
- Chunk-based dataflow graph translation & streaming processing out of GPU core-> processing graphs larger than GPU memory
- Highly-optimized graph propagation operators
- Chain-based parallel streaming-> efficient multi-GPU parallel execution
- Performance- Outperform state-of-the-art frameworks (e.g., TensorFlow and DGL) on small graphs
- Scale to large real-world graphs with GPUs
8
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
SAGA-NN Abstraction
9
V0 V1 V2
v2
edge
v1
edge
v0
Scatter
vertex feature
edge feature
edge output
accumulated
vertex output
Neural
Network
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
SAGA-NN Abstraction
10
V0 V1 V2
EdgeNNTransformation
v2
edge
v1
edge
v0
Scatter
ApplyEdge
vertex feature
edge feature
edge output
accumulated
vertex output
Neural
Network
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
SAGA-NN Abstraction
11
V0 V1 V2
EdgeNNTransformation
v2
edge
v1
edge
v0
accum.
Scatter
ApplyEdge
Gather
vertex feature
edge feature
edge output
accumulated
vertex output
Neural
Network
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
SAGA-NN Abstraction
12
v2
edge
v1
edge
v0
accum.
Scatter
ApplyEdge
Gather
ApplyVertex
V0 V1 V2
EdgeNNTransformation
VertexNNTransformation
vertex feature
edge feature
edge output
accumulated
vertex output
Neural
Network
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
SAGA-NN Abstraction
13
v2
edge
v1
edge
v0
accum.
Scatter
ApplyEdge
Gather
ApplyVertex
V0 V1 V2
EdgeNNTransformation
VertexNNTransformation
vertex feature
edge feature
edge output
accumulated
vertex output
Neural
Network
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
SAGA-NN Abstraction
14
v2
edge
v1
edge
v0
vertex feature
edge feature
edge output
accumulated
vertex output
Neural
Network
accum.
Scatter
ApplyEdge
Gather
ApplyVertex
Graph Operations• transparent to users
Neural Network Operations• define NN computation
with dataflow
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
Example: Graph Convolutional Network (GCN)
15
ScatterApplyEdge
Gather
ApplyVertex
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
Scaling beyond GPU memory limit
16
10
3 2Graph
0 1 2 1
1 3 3 0
1 2
Input Vertex Feature Edges
Output Vertex Feature
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
Edge Chunk E0,0
Chunk V1
Input Vertex Feature Chunk V0
Chunk E1,0
Output Vertex Feature Chunk V0
Scaling beyond GPU memory limit
17
10
3 2Graph
E0,0
E0,1
E1,0
E1,1
3 0 0 1
1 32 1
1 2
- 2D graph partitioning
3 0
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
Edge Chunk E0,0
Chunk V1
Input Vertex Feature Chunk V0
Chunk E1,0
Output Vertex Feature Chunk V0
Scaling beyond GPU memory limit
18
10
3 2Graph
E0,0
E0,1
E1,0
E1,1
3 0 0 1
1 32 1
1 2
- 2D graph partitioning
3 0
column-by-columnchunk scheduling
Accumulate in GPU memory
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
Chunk-based Dataflow Translation: GCN
19
mul
Gather
ApplyEdgeV0
E0,0
V0 src
dst
Scatter
A0
mul
W
E0,0V0
SAG
A0
weights params.
Accum.
V1
SAG
E1,0
V0 matmul ReLU
ApplyVertex
add
B
E1,0 E1,1
E0,0 E0,1V0
V1
V0 V
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
E1,0 E1,1
E0,0 E0,1V0
V1
V0 V
E1,2 E1,3
E0,2 E0,3
V0 V
- Naïve solution:- Approach: each GPU takes a “column” of edge chunks
- Problem: redundant data transfer through shared upper links
Scaling to multi-GPU
20
PCIe Switch
GPU
0
GPU
1
PCIe Switch
GPU
2
GPU
3
x16 x16 x16 x16
x16 x16
CPU /
DRAM
PCIe Host
Bridge
GPU0
GPU1
GPU2
GPU3
V0 V1
V0 V0
bandwidthbottleneck E1,0 E1,1
E0,0 E0,1V0
V1
V0 V
E1,2 E1,3
E0,2 E0,3
V V
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
E1,0 E1,1
E0,0 E0,1V0
V1
V0 V
E1,2 E1,3
E0,2 E0,3
V0 V
- Observation: link to neighbor GPU is empty
- Approach: directly load from neighbor GPU
Scaling to multi-GPU
21
PCIe Switch
GPU
0
GPU
1
PCIe Switch
GPU
2
GPU
3
x16 x16 x16 x16
x16 x16
CPU /
DRAM
PCIe Host
Bridge
GPU0
GPU1
GPU2
GPU3
V0 V1
V0 V0
GPU-to-GPU
E1,0 E1,1
E0,0 E0,1V0
V1
V0 V
E1,2 E1,3
E0,2 E0,3
V V
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
E1,0 E1,1
E0,0 E0,1V0
V1
V0 V
E1,2 E1,3
E0,2 E0,3
V0 V
- Observation: link to neighbor GPU is empty
- Approach: directly load from neighbor GPU
Scaling to multi-GPU
22
PCIe Switch
GPU
0
GPU
1
PCIe Switch
GPU
2
GPU
3
x16 x16 x16 x16
x16 x16
CPU /
DRAM
PCIe Host
Bridge
GPU0
GPU1
GPU2
GPU3
V0 V1
Chain
E1,0 E1,1
E0,0 E0,1V0
V1
V0 V
E1,2 E1,3
E0,2 E0,3
V V
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
Chain-based Parallel Streaming
- Each GPU can consume all the chunks in chain for a PCIe switch
23
E1,0 E1,1
E0,0 E0,1V0
V1
V0 V
E1,2 E1,3
E0,2 E0,3
V0 V
GPU0
GPU1
GPU2
GPU3
Step
1
2
3
4
0 0empty empty
1 0 1 00 0
2 1 1 0 2 1 1 0
3 2 2 1 3 2 2 1
Load
Load
#Computing
Chunk
Delete
Data
Load
Data#Receiving
ChunkTransfer
Load
Load
Delete
Delete
V0 V1 V2 V3V0 V1 V2 V3
E1,0 E1,1
E0,0 E0,1V0
V1
V0 V
E1,2 E1,3
E0,2 E0,3
V V
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
High performance graph propagation kernels
- Traditional multithread parallelization- PageRank, SSSP, CC, etc: one-dimension scalar value on vertices
- A thread takes a vertex/edge
V1V0 edge
24
Thre
ad 0
Thre
ad 1
diff ops. on diff verticesinefficient in GPU(SIMD)
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
High performance graph propagation kernels
- Observation- High-dimension feature
- More-regular operations along feature-dimension
- Solution: exploit feature-dimension parallelization in GNNs
Thread 0
Thread 1
25
same ops. efficient in GPU(SIMD)
V1V0 edge
Thread 2
Thread 3
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
Experiment Setup- Typical GNN Applications
- Communication neural network (CommNet)
// no computation in ApplyEdge
- Graph convolutional networks (GCN)
// element-wise Mul in ApplyEdge
- Gated graph neural networks (GG-NN)
// GRU in ApplyVertex
- Testbed
- 2x E5-2690-v4 (14-core with HT)
- 512GB Quad-Channel RAM
- 8x NVIDIA Tesla P100 GPU
- Ubuntu 16.04
- CUDA 9.0 + cuDNN 7
26
dataset vertex edge feature label degree type
blog 10.3K 668.0K 128 39 65 social
pubmed 19.7K 108.4K 500 3 5 cite
reddit_small 58.2K 1.4M 300 41 25 social
reddit_full 2.4M 705.9M 300 50 292 social
enwiki 3.6M 276.1M 300 12 77 knowledge
amazon 8.6M 231.6M 96 22 27 review
- Comparison
- TF: TensorFlow v1.7 (CPU/GPU)
- DGL: Deep Graph Library v0.1.3 with PyTorch v1.0 backend
- TF-SAGA: SAGA-NN with TensorFlow v1.7 backend
- NG: NeuGraph on top of TensorFlow v1.7
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
Performance on a Single GPU
Compare with TF, DGL on small graphs
up to 5x speedup vs TF
27
higher density
Avg. epoch time; #layer=2, hidden_size=512
up to 19x speedup vs DGL
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
Scaling-up on a Single GPU
- Compare with TF(CPU), TF-SAGA(CPU), TF-SAGA on large graphs- TF and DGL run out-of-memory (OOM)
28
higher density
Avg. epoch time; #layer=2, hidden_size=16 (amazon), 32 (enwiki), 64 (reddit-full)
up to 47x speedup vs TF(CPU)
up to 5x speedup vs TF-SAGA
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
Scaling-out on Multiple GPUs
- GCN on three large graphs on different number of GPUs
29
higher density
Avg. speedup over TF-SAGA-1GPU; #layer=2, hidden_size=16 (amazon), 32 (enwiki), 64 (reddit-full)
NeuGraph: Parallel Deep Neural Network Computation on Large Graphs
Conclusion
NeuGraph: advocate unifying deep learning and graph computing for graph neural networks
- Define a new, flexible SAGA-NN model to express GNNs
- Fuse graph-related optimizations into deep learning frameworks
- Demonstrate NeuGraph achieves efficient and scalable GNN processing
30
Thank you!Q&A
For more information, please visit: https://aka.ms/neugraph
NeuGraph