mvapich2-gdr: pushing the frontier of hpc and deep...
TRANSCRIPT
MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning
Dhabaleswar K. (DK) Panda
The Ohio State UniversityE-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Talk at Mellanox booth (SC ‘19)
by
Mellanox Booth (SC ’19) 2Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR• High-Performance Deep Learning (HiDL) with MVAPICH2-GDR • Conclusions
Outline
Mellanox Booth (SC ’19) 3Network Based Computing Laboratory
Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 (Supercomputing 2002)
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 3,050 organizations in 89 countries
– More than 615,000 (> 0.6 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘19 ranking)
• 3rd, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
• 5th, 448, 448 cores (Frontera) at TACC
• 8th, 391,680 cores (ABCI) in Japan
• 14th, 570,020 cores (Neurion) in South Korea and many others
– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decadePartner in the 5th ranked TACC Frontera System
Mellanox Booth (SC ’19) 4Network Based Computing Laboratory
0
100000
200000
300000
400000
500000
600000Se
p-04
Feb-
05Ju
l-05
Dec-
05M
ay-0
6O
ct-0
6M
ar-0
7Au
g-07
Jan-
08Ju
n-08
Nov
-08
Apr-
09Se
p-09
Feb-
10Ju
l-10
Dec-
10M
ay-1
1O
ct-1
1M
ar-1
2Au
g-12
Jan-
13Ju
n-13
Nov
-13
Apr-
14Se
p-14
Feb-
15Ju
l-15
Dec-
15M
ay-1
6O
ct-1
6M
ar-1
7Au
g-17
Jan-
18Ju
n-18
Nov
-18
Apr-
19
Num
ber o
f Dow
nloa
ds
Timeline
MV
0.9.
4
MV2
0.9
.0
MV2
0.9
.8
MV2
1.0
MV
1.0
MV2
1.0.
3
MV
1.1
MV2
1.4
MV2
1.5
MV2
1.6
MV2
1.7
MV2
1.8
MV2
1.9
MV2
-GD
R 2.
0b
MV2
-MIC
2.0
MV2
-GD
R 2
.3.2
MV2
-X2.
3rc
2
MV2
Virt
2.2
MV2
2.3
.2
OSU
INAM
0.9
.3
MV2
-Azu
re 2
.3.2
MV2
-AW
S2.
3
MVAPICH2 Release Timeline and Downloads
Mellanox Booth (SC ’19) 5Network Based Computing Laboratory
Architecture of MVAPICH2 Software Family (HPC and DL)
High Performance Parallel Programming Models
Message Passing Interface(MPI)
PGAS(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms
Point-to-point
Primitives
Collectives Algorithms
Energy-
Awareness
Remote Memory Access
I/O and
File Systems
Fault
ToleranceVirtualization Active
MessagesJob Startup
Introspection & Analysis
Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path, Elastic Fabric Adapter)
Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)
Transport Protocols Modern Features
RC SRD UD DC UMR ODPSR-IOV
Multi Rail
Transport MechanismsShared
MemoryCMA IVSHMEM
Modern Features
Optane* NVLink CAPI*
* Upcoming
XPMEM
Mellanox Booth (SC ’19) 6Network Based Computing Laboratory
MVAPICH2 Software Family Requirements Library
MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2
Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS with IB, Omni-Path, and RoCE
MVAPICH2-X
MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR
HPC Cloud with MPI & IB MVAPICH2-Virt
Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA
MPI Energy Monitoring Tool OEMT
InfiniBand Network Analysis and Monitoring OSU INAM
Microbenchmarks for Measuring MPI and PGAS Performance OMB
Mellanox Booth (SC ’19) 7Network Based Computing Laboratory
CPU CPUQPI
GPU
PCIe
GPU
GPU
CPU
GPU
IB
Node 0 Node 11. Intra-GPU2. Intra-Socket GPU-GPU3. Inter-Socket GPU-GPU4. Inter-Node GPU-GPU5. Intra-Socket GPU-Host
7. Inter-Node GPU-Host6. Inter-Socket GPU-Host
Memory buffers
8. Inter-Node GPU-GPU with IB adapter on remote socketand more . . .
• For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline …• Critical for runtimes to optimize data movement while hiding the complexity
• Connected as PCIe devices – Flexibility but Complexity
MVAPICH2-GDR: Optimizing MPI Data Movement on GPU Clusters
Mellanox Booth (SC ’19) 8Network Based Computing Laboratory
At Sender:
At Receiver:MPI_Recv(r_devbuf, size, …);
insideMVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
Mellanox Booth (SC ’19) 9Network Based Computing Laboratory
CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases
• Support for MPI communication from NVIDIA GPU device memory• High performance RDMA-based inter-node point-to-point communication
(GPU-GPU, GPU-Host and Host-GPU)• High performance intra-node point-to-point communication for multi-GPU
adapters/node (GPU-GPU, GPU-Host and Host-GPU)• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node
communication for multiple GPU adapters/node• Optimized and tuned collectives for GPU device buffers• MPI datatype support for point-to-point and collective communication from
GPU device buffers• Unified memory
Mellanox Booth (SC ’19) 10Network Based Computing Laboratory
• Released on 08/08/2019
• Major Features and Enhancements– Based on MVAPICH2 2.3.1
– Support for CUDA 10.1
– Support for PGI 19.x
– Enhanced intra-node and inter-node point-to-point performance
– Enhanced MPI_Allreduce performance for DGX-2 system
– Enhanced GPU communication support in MPI_THREAD_MULTIPLE mode
– Enhanced performance of datatype support for GPU-resident data
• Zero-copy transfer when P2P access is available between GPUs through NVLink/PCIe
– Enhanced GPU-based point-to-point and collective tuning
• OpenPOWER systems such as ORNL Summit and LLNL Sierra ABCI system @AIST, Owens and Pitzer systems @Ohio Supercomputer Center
– Scaled Allreduce to 24,576 Volta GPUs on Summit
– Enhanced intra-node and inter-node point-to-point performance for DGX-2 and IBM POWER8 and IBM POWER9 systems
– Enhanced Allreduce performance for DGX-2 and IBM POWER8/POWER9 systems
– Enhanced small message performance for CUDA-Aware MPI_Put and MPI_Get
– Flexible support for running TensorFlow (Horovod) jobs
MVAPICH2-GDR 2.3.2
Mellanox Booth (SC ’19) 11Network Based Computing Laboratory
0
2000
4000
6000
1 2 4 8 16 32 64 128
256
512 1K 2K 4K
Band
wid
th (M
B/s)
Message Size (Bytes)
GPU-GPU Inter-node Bi-Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
01000200030004000
1 2 4 8 16 32 64 128
256
512 1K 2K 4K
Band
wid
th (M
B/s)
Message Size (Bytes)
GPU-GPU Inter-node Bandwidth
MV2-(NO-GDR) MV2-GDR-2.3
0
10
20
300 1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8K
Late
ncy
(us)
Message Size (Bytes)
GPU-GPU Inter-node Latency
MV2-(NO-GDR) MV2-GDR 2.3
MVAPICH2-GDR-2.3Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores
NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA
CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA
10x
9x
Optimized MVAPICH2-GDR Design
1.85us11X
Mellanox Booth (SC ’19) 12Network Based Computing Laboratory
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)• HoomdBlue Version 1.0.5
• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
Application-Level Evaluation (HOOMD-blue)
0
500
1000
1500
2000
2500
4 8 16 32
Aver
age
Tim
e St
eps p
er
seco
nd (T
PS)
Number of Processes
MV2 MV2+GDR
0500
100015002000250030003500
4 8 16 32Aver
age
Tim
e St
eps p
er
seco
nd (T
PS)
Number of Processes
64K Particles 256K Particles
2X2X
Mellanox Booth (SC ’19) 13Network Based Computing Laboratory
Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland
0
0.2
0.4
0.6
0.8
1
1.2
16 32 64 96Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
CSCS GPU cluster
Default Callback-based Event-based
00.20.40.60.8
11.2
4 8 16 32
Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
Wilkes GPU Cluster
Default Callback-based Event-based
• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)
C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16
On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application
Cosmo model: http://www2.cosmo-model.org/content/tasks/operational/meteoSwiss/
Mellanox Booth (SC ’19) 14Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR
• Multi-stream Communication for IPC• CMA-based Intra-node Communication Support• Support for OpenPower and NVLink with GDRCOPY2• Maximal overlap in MPI Datatype Processing
• High-Performance Deep Learning (HiDL) with MVAPICH2-GDR • Conclusions
Outline
Mellanox Booth (SC ’19) 15Network Based Computing Laboratory
Multi-stream Communication using CUDA IPC on OpenPOWER and DGX-1
• Up to 16% higher Device to Device (D2D) bandwidth on OpenPOWER + NVLink inter-connect
• Up to 30% higher D2D bandwidth on DGX-1 with NVLink
•
05000
10000150002000025000300003500040000
128K 256K 512K 1M 2M 4M
Mill
ion
Byte
s (M
B)/s
econ
d
Message Size (Bytes)
Pt-to-pt (D-D) Bandwidth:Benefits of Multi-stream CUDA IPC Design
1-stream 4-streams
16% better
02000400060008000
100001200014000160001800020000
16K 32K 64K 128K 256K 512K 1M 2M 4M
Mill
ion
Byte
s (M
B)/s
econ
d
Message Size (Bytes)
Pt-to-pt (D-D) Bandwidth:Benefits of Multi-stream CUDA IPC Design
1-stream 4-streams
30% better
Available since MVAPICH2-GDR-2.3a
Mellanox Booth (SC ’19) 16Network Based Computing Laboratory
0
5000
10000
15000
Late
ncy
(us)
Message Size (Bytes)
INTRA-NODE Pt-to-Pt (H2H) BANDWIDTH
MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)
CMA-based Intra-node Communication Support
0100200300400500600
Late
ncy
(us)
Message Size (Bytes)
INTRA-NODE Pt-to-Pt (H2H) LATENCY
MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)
MVAPICH2-GDR-2.3.2Intel Broadwell (E5-2680 v4 @ 3240 GHz) node – 28 cores
NVIDIA Tesla K-80 GPU, and Mellanox Connect-X4 EDR HCACUDA 8.0, Mellanox OFED 4.0 with GPU-Direct-RDMA
• Up to 30% lower Host-to-Host (H2H) latency and 30% higher H2H Bandwidth
30% better30% better
Mellanox Booth (SC ’19) 17Network Based Computing Laboratory
0
10
20
30
40
4 8 16 32 64 128 256 512 1K 2K 4K
Late
ncy
(us)
MVAPICH2-GDRSpectrumMPI-10.1.0.2OpenMPI-3.0.0
Scalable Host-based Collectives on OpenPOWER (Intra-node Reduce & AlltoAll)(N
odes
=1, P
PN=2
0)
0
50
100
150
200
4 8 16 32 64 128256512 1K 2K 4K
Late
ncy
(us)
MVAPICH2-GDRSpectrumMPI-10.1.0.2OpenMPI-3.0.0
(Nod
es=1
, PPN
=20)
Up to 5X and 3x performance improvement by MVAPICH2 for small and large messages respectively
3.6X
Alltoall
Reduce
5.2X
3.2X3.3X
0
400
800
1200
1600
2000
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(us)
MVAPICH2-GDRSpectrumMPI-10.1.0.2OpenMPI-3.0.0
(Nod
es=1
, PPN
=20)
0
2500
5000
7500
10000
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(us)
MVAPICH2-GDRSpectrumMPI-10.1.0.2OpenMPI-3.0.0
(Nod
es=1
, PPN
=20)
3.2X
1.4X
1.3X1.2X
Mellanox Booth (SC ’19) 18Network Based Computing Laboratory
D-to-D Performance on OpenPOWER w/ GDRCopy (NVLink2 + Volta)
0
1
2
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8KLate
ncy
(us)
Message Size (Bytes)
Intra-Node Latency (Small Messages)
0
50
100
16K 32K 64K 128K 256K 512K 1M 2M 4MLate
ncy
(us)
Message Size (Bytes)
Intra-Node Latency (Large Messages)
020000400006000080000
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Band
wid
th (M
B/s)
Message Size (Bytes)
Intra-Node Bandwidth
0
5
10
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8KLate
ncy
(us)
Message Size (Bytes)
Inter-Node Latency (Small Messages)
0100200300
16K 32K 64K 128K 256K 512K 1M 2M 4MLate
ncy
(us)
Message Size (Bytes)
Inter-Node Latency (Large Messages)
05000
10000150002000025000
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Band
wid
th (M
B/s)
Message Size (Bytes)
Inter-Node Bandwidth
Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect
Intra-node Bandwidth: 65.48 GB/sec for 4MB (via NVLINK2)Intra-node Latency: 0.76 us (with GDRCopy)
Inter-node Bandwidth: 23 GB/sec for 4MB (via 2 Port EDR)Inter-node Latency: 2.18 us (with GDRCopy 2.0)
Mellanox Booth (SC ’19) 19Network Based Computing Laboratory
D-to-H & H-to-D Performance on OpenPOWER w/ GDRCopy (NVLink2 + Volta)
Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect
Intra-node D-H Bandwidth: 16.70 GB/sec for 2MB (via NVLINK2)
Intra-node D-H Latency: 0.49 us (with GDRCopy)
Intra-node H-D Latency: 0.49 us (with GDRCopy 2.0)Intra-node H-D Bandwidth: 26.09 GB/sec for 2MB (via NVLINK2)
0204060
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8KLa
tenc
y (u
s)
Message Size (Bytes)
D-H INTRA-NODE LATENCY (SMALL)
Spectrum MPI MV2-GDR
0
100
200
300
400
16K 32K 64K 128K 256K 512K 1M 2M 4M
Late
ncy
(us)
Message Size (Bytes)
D-H INTRA-NODE LATENCY (LARGE)
Spectrum MPI MV2-GDR
05
101520
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Band
wid
th (G
B/se
c)
Message Size (Bytes)
D-H INTRA-NODE BW
Spectrum MPI MV2-GDR
0204060
1 2 4 8 16 32 64 128
256
512 1K 2K 4K 8KLate
ncy
(us)
Message Size (Bytes)
H-D INTRA-NODE LATENCY (SMALL)
Spectrum MPI MV2-GDR
0100200300400
16K 32K 64K 128K 256K 512K 1M 2M 4M
Late
ncy
(us)
Message Size (Bytes)
H-D INTRA-NODE LATENCY (LARGE)
Spectrum MPI MV2-GDR
0
10
20
30
1 4 16 64 256 1K 4K 16K
64K
256K 1M 4M
Band
wid
th (G
B/se
c)
Message Size (Bytes)
H-D INTRA-NODE BW
Spectrum MPI MV2-GDR
Mellanox Booth (SC ’19) 20Network Based Computing Laboratory
Managed Memory Performance (OpenPOWER Intra-node)
Latency MD MD Bandwidth MD MD
Bi-Bandwidth MD MD
Mellanox Booth (SC ’19) 21Network Based Computing Laboratory
0
2
4
6
4 8 16 32 64 128 256 512
Late
ncy
(us)
MVAPICH2-2.3.2MVAPICH2-2.3.2 w/ SHARP
0
5
10
15
4 8 16 32 64 128 256 512
Late
ncy
(us)
MVAPICH2-GDR-Next
MVAPICH2-GDR-Next w/ SHARP
MVAPICH2 with SHARP Support (Preliminary Results)GPU AllReduce w/ SHARP Support – 2 nodes
0
5
10
15
20
4 8 16 32 64 128 256 512
Late
ncy
(us)
MVAPICH2-GDR-NextMVAPICH2-GDR-Next w/ SHARP
3.41 us
GPU AllReduce w/ SHARP Support – 4 nodes
2.34 us
HOST AllReduce w/ SHARP Support – 2 nodes
3.14 us
0
2
4
6
8
10
4 8 16 32 64 128 256 512
Late
ncy
(us)
MVAPICH2-2.3.2MVAPICH2-2.3.2 w/ SHARP
HOST AllReduce w/ SHARP Support – 8 nodes
3.97 us
Platform: OpenPOWER (POWER9-ppc64le) nodes equipped with a dual-socket CPU, 4 Volta V100 GPUs, and 2port EDR InfiniBand Interconnect
Mellanox Booth (SC ’19) 22Network Based Computing Laboratory
• Multi-dimensional data• Row based organization• Contiguous on one dimension • Non-contiguous on other dimensions
• Halo data exchange• Duplicate the boundary• Exchange the boundary in each
iteration
Halo data exchange
Non-contiguous Data Exchange
Mellanox Booth (SC ’19) 23Network Based Computing Laboratory
MPI Datatype support in MVAPICH2• Datatypes support in MPI
– Operate on customized datatypes to improve productivity
– Enable MPI library to optimize non-contiguous data
At Sender: MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type);MPI_Type_commit(&new_type);…MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);
• Inside MVAPICH2 - Use datatype specific CUDA Kernels to pack data in chunks- Efficiently move data between nodes using RDMA- In progress - currently optimizes vector and hindexed datatypes- Transparent to the userH. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on Parallel and Distributed Systems, Vol. 25, No. 10, pp. 2595-2605 , Oct 2014.
Mellanox Booth (SC ’19) 24Network Based Computing Laboratory
MPI Datatype Processing (Computation Optimization )
• Comprehensive support • Targeted kernels for regular datatypes - vector, subarray, indexed_block
• Generic kernels for all other irregular datatypes
• Separate non-blocking stream for kernels launched by MPI library • Avoids stream conflicts with application kernels
• Flexible set of parameters for users to tune kernels• Vector
• MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE
• MV2_CUDA_KERNEL_VECTOR_YSIZE
• Subarray • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE • MV2_CUDA_KERNEL_SUBARR_XDIM• MV2_CUDA_KERNEL_SUBARR_YDIM • MV2_CUDA_KERNEL_SUBARR_ZDIM
• Indexed_block• MV2_CUDA_KERNEL_IDXBLK_XDIM
Mellanox Booth (SC ’19) 25Network Based Computing Laboratory
MPI Datatype Processing (Communication Optimization )
Waste of computing resources on CPU and GPUCommon Scenario
*A, B…contain non-contiguousMPI Datatype
MPI_Isend (A,.. Datatype,…)MPI_Isend (B,.. Datatype,…)MPI_Isend (C,.. Datatype,…)MPI_Isend (D,.. Datatype,…)…
MPI_Waitall (…);
Mellanox Booth (SC ’19) 26Network Based Computing Laboratory
16 GPUs on POWER9 system (test Comm mpi Mesh cuda Device Buffers mpi_type)
pre-comm post-recv post-
send wait-recv wait-send
post-comm start-up test-
commbench-comm
Spectrum MPI 10.3 0.0001 0.0000 1.6021 1.7204 0.0112 0.0001 0.0004 7.7383 83.6229
MVAPICH2-GDR 2.3.2 0.0001 0.0000 0.0862 0.0871 0.0018 0.0001 0.0009 0.3558 4.4396
MVAPICH2-GDR 2.3.3 (Upcoming) 0.0001 0.0000 0.0030 0.0032 0.0001 0.0001 0.0009 0.0133 0.1602
Application: COMB
Run Scripts pushed to COMB Github repo: https://github.com/LLNL/Comb/pull/2
18x
27x
- Improvements due to enhanced support for GPU-kernel based packing/unpacking routines
Mellanox Booth (SC ’19) 27Network Based Computing Laboratory
Application: HYPRE - BoomerAMG
0.971.196
1.4361.756
0.94 1.01 1.082
1.628
0
0.5
1
1.5
2
16 GPUs 32 GPUs 64 GPUs 128 GPUs
Seco
nds
(Sum
of S
etup
Pha
se &
Sol
ve
Phas
e Ti
mes
)
# of GPUs
HYPRE - BoomerAMGSpectrum-MPI 10.3.0.1 MVAPICH2-GDR 2.3.2
RUN MVAPICH2-GDR 2.3.2: export MV2_USE_CUDA=1 MV2_USE_GDRCOPY=0 MV2_USE_RDMA_CM=0export MV2_USE_GPUDIRECT_LOOPBACK=0 MV2_HYBRID_BINDING_POLICY=spread MV2_IBA_HCA=mlx5_0:mlx5_3OMP_NUM_THREADS=20 lrun -n 128 -N 32 mpibind ./ij -P 8 4 4 -n 50 50 50 -pmis -Pmx 8 -keepT 1 -rlx 18
RUN Spectrum-MPI 10.3.0.1: OMP_NUM_THREADS=20 lrun -n 128 -N 32 --smpiargs "-gpu --disable_gdr" mpibind ./ij -P 8 4 4 -n 50 50 50 -pmis -Pmx 8 -keepT 1 -rlx 18
25%16%
Mellanox Booth (SC ’19) 28Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR• High-Performance Deep Learning (HiDL) with MVAPICH2-GDR
• Benefits of CUDA-Aware MPI with TensorFlow• Optimized Collectives for Deep Learning• Out-of-core DNN Training
• Conclusions
Outline
Mellanox Booth (SC ’19) 29Network Based Computing Laboratory
• Need to understand several options currently available
• gRPC (official support)– Open-source – can be enhanced by others
– Accelerated gRPC (add RDMA to gRPC)
• gRPC+X– Use gRPC for bootstrap and rendezvous
– Actual communication is in “X”
– XMPI, Verbs, GPUDirect RDMA (GDR), etc.
• No-gRPC– Baidu – the first one to use MPI Collectives for TF
– Horovod – Use NCCL, or MPI, or any other future library (e.g. IBM DDL support recently added)
Data Parallel Training with TensorFlow (TF)
A. A. Awan, J. Bedorf, C.-H. Chu, H. Subramoni and D. K. Panda, “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, CCGrid ‘19. https://arxiv.org/abs/1810.11112
Mellanox Booth (SC ’19) 30Network Based Computing Laboratory
• MVAPICH2-GDR offers excellent performance via advanced designs for MPI_Allreduce.
• Up to 11% better performance on the RI2 cluster (16 GPUs)
• Near-ideal – 98% scaling efficiency
Exploiting CUDA-Aware MPI for TensorFlow (Horovod)
0100200300400500600700800900
1000
1 2 4 8 16
Imag
es/s
econ
d (H
ighe
r is b
ette
r)No. of GPUs
Horovod-MPI Horovod-NCCL2 Horovod-MPI-Opt (Proposed) Ideal
MVAPICH2-GDR 2.3 (MPI-Opt) is up to 11% faster than MVAPICH2
2.3 (Basic CUDA support)
A. A. Awan et al., “Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation”, CCGrid ‘19, https://arxiv.org/abs/1810.11112
Mellanox Booth (SC ’19) 31Network Based Computing Laboratory
MVAPICH2-GDR vs. NCCL2: Allreduce Operation• Optimized designs in MVAPICH2-GDR 2.3 offer better/comparable performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs
1
10
100
1000
10000
100000
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~1.2X better
Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect
1
10
100
1000
4 8 16 32 64 128
256
512 1K 2K 4K 8K 16K
32K
64K
Late
ncy
(us)
Message Size (Bytes)
MVAPICH2-GDR NCCL2
~3X better
Mellanox Booth (SC ’19) 32Network Based Computing Laboratory
MVAPICH2-GDR vs. NCCL2: Allreduce Optimization (DGX-2)• Optimized designs in upcoming MVAPICH2-GDR offer better performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on a DGX-2 machine
020406080
Band
wid
th (G
B/s)
Message Size (Bytes)
Bandwidth
MVAPICH2-GDR 2.3.2 NCCL 2.4
2.4X better
110
1001000
10000
4 16 64 256 1K 4K 16K
64K
256K 1M 4M 16M
64M
Late
ncy
(us)
Message Size (Bytes)
Latency
MVAPICH2-GDR 2.3.2 NCCL 2.4
5.8X better
Platform: Nvidia DGX-2 system @ PSC (16 Nvidia Volta GPUs connected with NVSwitch), CUDA 9.2
Mellanox Booth (SC ’19) 33Network Based Computing Laboratory
MVAPICH2-GDR: MPI_Allreduce (Device Buffers) on Summit• Optimized designs in MVAPICH2-GDR offer better performance for most cases
• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) up to 1,536 GPUs
0
1
2
3
4
5
6
32M 64M 128M 256M
Band
wid
th (G
B/s)
Message Size (Bytes)
Bandwidth on 1,536 GPUs
MVAPICH2-GDR-2.3.2 NCCL 2.4
1.7X better
050
100150200250300350400450
4 16 64 256 1K 4K 16K
Late
ncy
(us)
Message Size (Bytes)
Latency on 1,536 GPUs
MVAPICH2-GDR-2.3.2 NCCL 2.4
1.6X better
Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect
0
2
4
6
8
10
24 48 96 192 384 768 1536
Band
wid
th (G
B/s)
Number of GPUs
128MB Message
SpectrumMPI 10.2.0.11 OpenMPI 4.0.1
NCCL 2.4 MVAPICH2-GDR-2.3.2
1.7X better
Mellanox Booth (SC ’19) 34Network Based Computing Laboratory
Distributed Training with TensorFlow and MVAPICH2-GDR on Summit• ResNet-50 Training using
TensorFlow benchmark on SUMMIT -- 1536 Volta GPUs!
• 1,281,167 (1.2 mil.) images
• Time/epoch = 3.6 seconds
• Total Time (90 epochs) = 3.6 x 90 = 332 seconds = 5.5 minutes!
050
100150200250300350400
1 2 4 6 12 24 48 96 192 384 768 1536
Imag
e pe
r sec
ond
(Tho
usan
ds)
Number of GPUs
NCCL-2.4 MVAPICH2-GDR-2.3.2
Platform: The Summit Supercomputer (#1 on Top500.org) – 6 NVIDIA Volta GPUs per node connected with NVLink, CUDA 9.2
*We observed errors for NCCL2 beyond 96 GPUs
MVAPICH2-GDR reaching ~0.35 million images per second for ImageNet-1k!
ImageNet-1k has 1.2 million images
Mellanox Booth (SC ’19) 35Network Based Computing Laboratory
• Near-linear scaling may be achieved by tuning Horovod/MPI• Optimizing MPI/Horovod towards large message sizes for high-resolution images
• Develop a generic Image Segmentation benchmark• Tuned DeepLabV3+ model using the benchmark and Horovod, up to 1.3X better than default
New Benchmark for Image Segmentation on Summit
*Anthony et al., “Scaling Semantic Image Segmentation using Tensorflow and MVAPICH2-GDR on HPC Systems” (Submission under review)
Mellanox Booth (SC ’19) 36Network Based Computing Laboratory
• Caffe : A flexible and layered Deep Learning framework.
• Benefits and Weaknesses– Multi-GPU Training within a single node
– Performance degradation for GPUs across different sockets
– Limited Scale-out
• OSU-Caffe: MPI-based Parallel Training – Enable Scale-up (within a node) and Scale-out (across
multi-GPU nodes)
– Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset
– Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset
OSU-Caffe: Scalable Deep Learning
0
50
100
150
200
250
8 16 32 64 128
Trai
ning
Tim
e (s
econ
ds)
No. of GPUs
GoogLeNet (ImageNet) on 128 GPUs
Caffe OSU-Caffe (1024) OSU-Caffe (2048)
Invalid use caseOSU-Caffe publicly available from
http://hidl.cse.ohio-state.edu/
Mellanox Booth (SC ’19) 37Network Based Computing Laboratory
Scalability and Large (Out-of-core) Models?• Large DNNs cannot be trained on GPUs due to memory limitation!
– ResNet-50 for Image Recognition but current frameworks can only go up to a small batch size of 45
– Next generation models: Neural Machine Translation (NMT) • Ridiculously large (billions of parameters),
• Will require even more memory!
– Can we exploit new software features in CUDA 8/9 and hardware mechanisms in Pascal/Volta GPUs?
• General intuition is that managed allocations “will be” slow!
– The proposed framework called OC-Caffe (Out-of-Core Caffe)shows the potential of managed memory designs that can provide performance with negligible/no overhead.
• OC-Caffe-Opt: up to 80% better than Intel-optimized CPU Caffe for ResNet-50 training on the Volta V100 GPU with CUDA9 and CUDNN7
A. A. Awan, C.-H. Chu, H. Subramoni, X. Lu, and D. K. Panda, OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training, HiPC ’18
Mellanox Booth (SC ’19) 38Network Based Computing Laboratory
• CPU based results– AMD EPYC
– Intel Xeon
• Excellent speedups for – VGG-19
– ResNet-110
– ResNet-1000 (1k layers)
• Able to train “future” models– E.g. ResNet-5000 (a synthetic
5000-layer model we benchmarked)
HyPar-Flow (HF): Hybrid Parallelism for TensorFlow
*Awan et al., “HyPar-Flow: Exploiting MPI and Keras for Hybrid Parallel Training of TensorFlow models”, arXiv ’19. https://arxiv.org/pdf/1911.05146.pdf
110x speedup on 128 Intel Xeon Skylake nodes (TACC Stampede2 Cluster)
Mellanox Booth (SC ’19) 39Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR• High-Performance Deep Learning (HiDL) with MVAPICH2-GDR• Conclusions
Outline
Mellanox Booth (SC ’19) 40Network Based Computing Laboratory
• MVAPICH2-GDR Library provides optimized MPI communication on InfiniBand and RoCE clusters with GPUs
• Supports both X86 and OpenPower with NVLink
• Takes advantage of CUDA features like IPC and GPUDirect RDMA families
• Allows flexible solutions for streaming applications with GPUs
• Provides optimized solutions (scale-up and scale-out) for High-Performance Deep Learning
Conclusions
Mellanox Booth (SC ’19) 41Network Based Computing Laboratory
• Supported through X-ScaleSolutions (http://x-scalesolutions.com)• Benefits:
– Help and guidance with installation of the library
– Platform-specific optimizations and tuning
– Timely support for operational issues encountered with the library
– Web portal interface to submit issues and tracking their progress
– Advanced debugging techniques
– Application-specific optimizations and tuning
– Obtaining guidelines on best practices
– Periodic information on major fixes and updates
– Information on major releases
– Help with upgrading to the latest release
– Flexible Service Level Agreements • Support provided to Lawrence Livermore National Laboratory (LLNL) for the last two years
Commercial Support for MVAPICH2, HiBD, and HiDL Libraries
Mellanox Booth (SC ’19) 42Network Based Computing Laboratory
• Presentations at OSU and X-Scale Booth (#2094)– Members of the MVAPICH, HiBD and HiDL members
– External speakers
• Presentations at SC main program (Tutorials, Workshops, BoFs, Posters, and Doctoral Showcase)
• Presentation at many other booths (Mellanox, Intel, Microsoft, and AWS) and satellite events
• Complete details available at http://mvapich.cse.ohio-state.edu/conference/752/talks/
Multiple Events at SC ‘19
Mellanox Booth (SC ’19) 43Network Based Computing Laboratory
Funding AcknowledgmentsFunding Support by
Equipment Support by
Mellanox Booth (SC ’19) 44Network Based Computing Laboratory
Personnel AcknowledgmentsCurrent Students (Graduate)
– A. Awan (Ph.D.)
– M. Bayatpour (Ph.D.)
– C.-H. Chu (Ph.D.)
– J. Hashmi (Ph.D.)– A. Jain (Ph.D.)
– K. S. Kandadi (M.S.)
Past Students – A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– R. Biswas (M.S.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– S. Chakraborthy (Ph.D.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– R. Rajachandrasekar (Ph.D.)
– D. Shankar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
– J. Zhang (Ph.D.)
Past Research Scientist– K. Hamidouche
– S. Sur
– X. Lu
Past Post-Docs– D. Banerjee
– X. Besseron
– H.-W. Jin
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– Kamal Raj (M.S.)
– K. S. Khorassani (Ph.D.)– P. Kousha (Ph.D.)
– A. Quentin (Ph.D.)
– B. Ramesh (M. S.)
– S. Xu (M.S.)
– J. Lin
– M. Luo
– E. Mancini
Past Programmers– D. Bureddy
– J. Perkins
Current Research Specialist– J. Smith
– S. Marcarelli
– J. Vienne
– H. Wang
Current Post-doc– M. S. Ghazimeersaeed
– A. Ruhela
– K. ManianCurrent Students (Undergraduate)
– V. Gangal (B.S.)
– N. Sarkauskas (B.S.)
Past Research Specialist– M. Arnold
Current Research Scientist– H. Subramoni– Q. Zhou (Ph.D.)
Mellanox Booth (SC ’19) 45Network Based Computing Laboratory
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/
The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/
The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/
Follow us on
https://twitter.com/mvapich