NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense
GPU SystemsChing-Hsiang Chu, Pouya Kousha, Ammar A. Awan, Kawthar Shafie Khorassani,
Hari Subramoni, and Dhabaleswar K. (DK) Panda{chu.368, kousha.2, awan.10, shafiekhorassani.1}@osu.edu
{subramon, panda}@cse.ohio-state.edu
Network-based Computing LaboratoryThe Ohio State University
Network Based Computing Laboratory ICS 2020
• Introduction– Trend in Modern HPC systems
– All-reduce for Distributed Deep Learning on Dense-GPU systems
• Research Challenge
• Proposed Designs: NV-Group Allreduce
• Performance Evaluation
• Concluding Remarks
Outline
2
Network Based Computing Laboratory ICS 2020
Trends in Modern HPC Architecture: Heterogeneous
Acceleratorshigh compute density,high performance/watt
High Performance Interconnects InfiniBand, Omni-Path, EFA
<1usec latency, 200Gbps+ Bandwidth
Multi/ Many-core Processors
SSD, NVMe-SSD, NVRAM
Node local storage
#2 Summit (27,648 GPUs)#3 Sierra (17,280 GPUs)#14 Lassen (2,664 GPUs)
#7 SeleneNVIDIA DGX SuperPOD
(2,240 GPUs)
#1 Fugaku(158,976 nodes with A64FX ARM
CPU, a GPU-like processor)
#6 HPC5(7,280 GPUs)
https://www.top500.org/
Network Based Computing Laboratory ICS 2020
• Scale-up (up to 150 GB/s)
– PCIe, NVLink/NVSwitch
– Infinity Fabric, Xe Link
• Scale-out (up to 25 GB/s)
– InfiniBand, Omni-path, Ethernet
– Cray Slingshot
Trends in Modern Large-scale Dense-GPU Systems
NVIDIA DGX machine IBM Power System AC922 on #1 Summit system4
Network Based Computing Laboratory ICS 2020
• Easy-to-use and high-performance frameworks
• Wide range of applications– Image Classification– Speech Recognition– Self-driving Car– Healthcare– Climate Analytic
GPU-enabled Distributed Deep Learning
Kurth T, Treichler S, Romero J, Mudigonda M, Luehr N, Phillips E, Mahesh A, Matheson M, Deslippe J, Fatica M, Houston M. Exascale deep learning for climate analytics. SC 2018 Nov 11 (p. 51). (Golden Bell Prize)
999 PetaFlop/s sustained, and 1.13 ExaFlop/s peak FP 16 performance over 4560 nodes (27,360 GPUs)
5
Network Based Computing Laboratory ICS 2020
• Distributed deep learning training with data parallelism
– Using Allreduce operations to exchange and update gradients, weights…etc.
• State-of-the-art Ring-based Allreduce for GPUs*
– Pros: Contention-free
– Cons: not scalable
Reduction Operations for Distributed Deep Learning
https://www.oreilly.com/ideas/distributed-tensorflowBen-Nun T, Hoefler T. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. arXiv preprint arXiv:1802.09941. 2018 Feb 26.
6*Please refer to the paper for the analysis of more algorithms
Network Based Computing Laboratory ICS 2020
• Ring-based Allreduce cannot efficiently utilize NVLinks
Motivation
0
5
10
15
G1<->CPU
G2<->CPU
G3<->CPU
G4<->CPU
G5<->CPU
G6<->CPU
G1<->G2
G1<->G3
G2<->G3
G4<->G5
G4<->G6
G5<->G6
Thro
ughp
ut (G
B/s)
NVLink Pairs
SpectrumMPI-10.3 OpenMPI-4.0.3 MV2-GDR-2.3 NCCL-2.6
7* Profiling tool: P. Kousha et al., Designing a Profiling and Visualization Tool for Scalable and In-Depth Analysis of High-Performance GPU Clusters, HiPC 2019.
Network Based Computing Laboratory ICS 2020
• Introduction
• Research Challenge
• Proposed Designs: NV-Group
• Performance Evaluation
• Concluding Remarks
Outline
8
Network Based Computing Laboratory ICS 2020
How to design a link-efficient Allreduce algorithm that can maximize the utilization of available
hardware communication channels to boost the performance for distributed DL training on
emerging dense GPU systems?
Broad Research Challenge
9
Network Based Computing Laboratory ICS 2020
• Introduction
• Research Challenge
• Proposed Designs: NV-Group Allreduce
• Performance Evaluation
• Concluding Remarks
Outline
10
Network Based Computing Laboratory ICS 2020
1. Forming NV-Groups– Treat multiple GPUs as one
2. Cooperative reduction kernel within NV-Group– Persistent GPU kernels
– Exploit load-store primitives over NVLinks
– High-occupancy kernel
3. Communication across NV-Groups– Contention-free over slowest IB networks
Overview of the Proposed NVGroup Allreduce
11
Network Based Computing Laboratory ICS 2020
• Topology detection and GPU grouping– Discover which GPUs are fully connected by NVLink; using tools
such as hwloc[1] and NVML[2]
– Create logical GPU groups, e.g., MPI Group or Communicator
12
Forming NV-Groups
[1] https://www.open-mpi.org/projects/hwloc/[2] NVIDIA Management Library (NVML), https://developer.nvidia.com/nvidia-management-library-nvml
Network Based Computing Laboratory ICS 2020
• CPU creates work queue for each Cooperative Thread Array (CTA or block)
• Persistent GPU Kernel1) Poll the individual work queue
2) Reduce the data chunks• Reduce-scatter among GPUs• Direct Load-Store over NVLink
3) Signal CPU upon completion• Implicit synchronization[1]
13
Cooperative Reduction Kernel within NV-Group
[1 ] Ching-Hsiang Chu et al. "Designing High-Performance In-Memory Key-Value Operations with Persistent GPU Kernels and OpenSHMEM, " OpenSHMEM 2018.
GPU 0
CTA 0 CTA 1
GPU 1
CTA 0 CTA 1
Chunk 0
Shared System Memory
Chunk 1
Chunk 0-0CPU GPU0
Chunk 1-0CPU GPU1
Work queue
Chunk 0-1CPU GPU0
Chunk 1-1CPU GPU1
Chunk 0-0GPU1 GPU0
Chunk 1-0GPU1 GPU0
Chunk 0-1GPU0 GPU1
Chunk 1-1GPU0 GPU1
Network Based Computing Laboratory ICS 2020
• High-Occupancy kernel with low register pressure*
– CPU coordinates the topology and communication paths
– Enable all threads for reduction operations
• Free resources for applications– Low SM consumption
Ø Low scheduling overhead
– Enable overlap opportunity
14
Cooperative Reduction Kernel - Efficiency
02040
6080
1 2 4 8 16Th
roug
hput
(GB/
s)Number of CTAs
Efficiency of Reduction Kernel (higher the better)
NVGroup (1024 threads) NCCL-2.6 (256 threads)
Links have been saturated
* https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#execution-configuration-optimizations
Network Based Computing Laboratory ICS 2020
• CPU processes– Inter-group communication
– Ring-based Reduce-Scatter + Allgather over IB or X-BUS
– Offload reduction to NV-Group
• GPUs (NV-Group):– Processing operations
requested by CPU• Direct Reduce-scatter or
Allgather over NVLink
15
Group-Wise Communication – CPU-GPU Cooperation
GPU 0
GPU 1
GPU 2
Group-1 Group-2 Group-3 Group-4
* Please check out the paper for more
optimization techniques.
Network Based Computing Laboratory ICS 2020
• Introduction
• Research Challenge
• Proposed Designs: NV-Group
• Performance Evaluation
• Concluding Remarks
Outline
16
Network Based Computing Laboratory ICS 2020 17
Experimental Environments#1 Summit #10 Lassen* NVIDIA DGX-2
CPU Model IBM POWER9 AC922 Intel Skylake
System memory 512 GB 256 GB 1.5 TB
GPU Model NVIDIA Volta V100 x 6 NVIDIA Volta V100 x 4 NVIDIA Volta V100 x 16
Interconnects between CPU & GPU2-lane NVLink 3-lane NVLink
PCIe Gen3
Interconnects between GPUs 6-lane NVLink & NVSwitch
Interconnects between nodes Dual-rail Mellanox IB EDR Mellanox IB EDR x 8 (Unused)
NVIDIA driver & CUDA versions 418.116 & 10.1.243 410.48 & 10.1.243
• Libraries: SpectrumMPI v10.3.1, OpenMPI v4.0.3+UCX v1.8.0, MVAPICH2-GDR v2.3, NCCL v2.6
• Benchmarks: OSU Micro-Benchmark (OMB) & modified nccl-test
• Applications: Horovod v0.19 with TensorFlow v1.14 & PyTorch v1.5
*Please refer to the paper for the thorough performance comparison
Network Based Computing Laboratory ICS 2020
Overview of the MVAPICH2 Project• High Performance open-source MPI Library
• Support for multiple interconnects– InfiniBand, Omni-Path, Ethernet/iWARP, RDMA over Converged Ethernet (RoCE), and AWS
EFA
• Support for multiple platforms– x86, OpenPOWER, ARM, Xeon-Phi, GPGPUs
• Started in 2001, first open-source version demonstrated at SC ‘02
• Supports the latest MPI-3.1 standard
• http://mvapich.cse.ohio-state.edu
• Additional optimized versions for different systems/environments:– MVAPICH2-X (Advanced MPI + PGAS), since 2011
– MVAPICH2-GDR with support for NVIDIA GPGPUs, since 2014
– MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014
– MVAPICH2-Virt with virtualization support, since 2015
– MVAPICH2-EA with support for Energy-Awareness, since 2015
– MVAPICH2-Azure for Azure HPC IB instances, since 2019
– MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019
• Tools:– OSU MPI Micro-Benchmarks (OMB), since 2003
– OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015
• Used by more than 3,100 organizations in 89 countries
• More than 772,000 (> 0.7 million) downloads from the OSU site directly
• Empowering many TOP500 clusters (June 2020 ranking)– 4th, 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China
– 8th, 448, 448 cores (Frontera) at TACC
– 12th, 391,680 cores (ABCI) in Japan
– 18th, 570,020 cores (Nurion) in South Korea and many others
• Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, OpenHPC, and Spack)
• Partner in the 8th ranked TACC Frontera system
• Empowering Top500 systems for more than 15 years
18
Network Based Computing Laboratory ICS 2020
• NVGroup design utilizes all NVLinks19
Evaluation of Link Utilization on Summit
05
1015202530
G1<->CPU
G2<->CPU
G3<->CPU
G4<->CPU
G5<->CPU
G6<->CPU
G1<->G2
G1<->G3
G2<->G3
G4<->G5
G4<->G6
G5<->G6
Thro
ughp
ut (G
B/s)
NVLink Pairs
Throughput of Allreduce operation on IBM AC922 machine SpectrumMPI-10.3 OpenMPI-4.0.3 MV2-GDR-2.3 NCCL-2.6 Proposed
Network Based Computing Laboratory ICS 2020
Allreduce Benchmarking on Summit
0
1
2
3
4
5
6
32M 64M 128M 256M
Band
wid
th (G
B/s)
Message Size (Bytes)
Bandwidth on 1,536 GPUs
NVGroup NCCL 2.6
1.7X better
0
100
200
300
400
500
4 16 64 256 1K 4K 16K
Late
ncy
(us)
Message Size (Bytes)
Latency on 1,536 GPUs
NVGroup NCCL 2.6
1.6X better
0
2
4
6
8
10
24 48 96 192 384 768 1536
Band
wid
th (G
B/s)
Number of GPUs
128MB Message
SpectrumMPI 10.3.1 OpenMPI 4.0.3
NCCL 2.6 NVGroup
1.7X better
• NVGroup yields 40% and 30% better latency and bandwidth than NCCL
• NVGroup scale significantly better than production libraries on Summit system
20*Please refer to the paper for the thorough comparison
Network Based Computing Laboratory ICS 2020
Distributed Deep Learning Training on DGX-2 Machine• ResNet-50 Training using Horovod with TensorFlow
– Synthetic ImageNet dataset with batch size 64 and 100 batches
• NVGroup achieves higher throughput due to high-occupancy reduction kernel
0
1000
2000
3000
4000
5000
6000
7000
1 2 4 8 16
Imag
es p
er se
cond
Number of GPUs
NCCL-2.6 Proposed Ideal
9% higher
0102030405060708090
100
1 2 4 8 16
Scal
ing
Effic
ienc
y (%
)Number of GPUs
NCCL-2.6 Proposed
21
93%
Network Based Computing Laboratory ICS 2020
020000400006000080000
100000120000140000160000
6 12 24 48 96 192 384
Imag
es p
er se
cond
Number of GPUs
TensorFlow v1.14
SpectrumMPI MV2-GDR-2.3 NCCL-2.6 Proposed
Distributed Deep Learning Training on Summit• ResNet-50 Training using Horovod with TensorFlow and PyTorch
– Synthetic ImageNet dataset with batch size 64 and 100 batches
5% higher
0
20000
40000
60000
80000
100000
120000
6 12 24 48 96 192 384
Imag
es p
er se
cond
Number of GPUs
PyTorch v1.5
SpectrumMPI MV2-GDR-2.3 NCCL-2.6 Proposed
22
1.48X higher
Network Based Computing Laboratory ICS 2020
• Introduction
• Research Challenge
• Proposed Designs
• Performance Evaluation
• Concluding Remarks
Outline
23
Network Based Computing Laboratory ICS 2020
• Allreduce operations dominate the performance of distributed deep learning training with data parallelism
• State-of-the-art ring-based Allreduce failed to efficiently utilize interconnects on the modern Dense GPU systems
• Proposed NVGroup Allreduce can– Maximize the NVLink utilization for the dense-GPU systems
– Perform cooperative reduction on GPUs
– Achieve faster distributed DL training on dense-GPU systems such as #1 Summit
• Publicly available since MVAPICH2-GDR 2.3.2 release!– http://mvapich.cse.ohio-state.edu/
Concluding Remarks
24
Network Based Computing Laboratory ICS 2020
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
{chu.368, kousha.2, awan.10, shafiekhorassani.1}@osu.edu{subramon, panda}@cse.ohio-state.edu
The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/
The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/
25