mvapich2-gdr: pushing the frontier of hpc and deep...
Post on 16-Feb-2019
214 Views
Preview:
TRANSCRIPT
MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: panda@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
GPU Technology Conference GTC 2017
by
Hari Subramoni
The Ohio State University
E-mail: subramon@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~subramon
GTC 2017 2Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR
• Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• RoCE and Optimized Collective • Initial support for GPUDirect Async feature
• Efficient Deep Learning with MVAPICH2-GDR• OpenACC-Aware support • Conclusions
Outline
GTC 2017 3Network Based Computing Laboratory
Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015
– Used by more than 2,750 organizations in 84 countries
– More than 416,000 (> 0.4 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘16 ranking)
• 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China
• 13th, 241,108-core (Pleiades) at NASA
• 17th, 462,462-core (Stampede) at TACC
• 40th, 74,520-core (Tsubame 2.5) at Tokyo Institute of Technology
– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu• Empowering Top500 systems for over a decade
– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->
– Sunway TaihuLight (1st in Jun’16, 10M cores, 100 PFlops)
GTC 2017 4Network Based Computing Laboratory
0
50000
100000
150000
200000
250000
300000
350000
400000
450000Se
p-04
Feb-
05
Jul-0
5
Dec-
05
May
-06
Oct
-06
Mar
-07
Aug-
07
Jan-
08
Jun-
08
Nov
-08
Apr-
09
Sep-
09
Feb-
10
Jul-1
0
Dec-
10
May
-11
Oct
-11
Mar
-12
Aug-
12
Jan-
13
Jun-
13
Nov
-13
Apr-
14
Sep-
14
Feb-
15
Jul-1
5
Dec-
15
May
-16
Oct
-16
Mar
-17
Num
ber o
f Dow
nloa
ds
Timeline
MV
0.9.
4
MV2
0.9
.0
MV2
0.9
.8
MV2
1.0
MV
1.0
MV2
1.0.
3
MV
1.1
MV2
1.4
MV2
1.5
MV2
1.6
MV2
1.7
MV2
1.8
MV2
1.9 M
V22.
1
MV2
-GDR
2.0
b
MV2
-MIC
2.0
MV2
-Virt
2.1
rc2
MV2
-GDR
2.2
rc1
MV2
-X2.
2M
V22.
3a
MVAPICH2 Release Timeline and Downloads
GTC 2017 5Network Based Computing Laboratory
MVAPICH2 Software Family
Requirements MVAPICH2 Library to use
MPI with IB, iWARP and RoCE MVAPICH2
Advanced MPI, OSU INAM, PGAS and MPI+PGAS with IB and RoCE MVAPICH2-X
MPI with IB & GPU MVAPICH2-GDR
MPI with IB & MIC MVAPICH2-MIC
HPC Cloud with MPI & IB MVAPICH2-Virt
Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA
GTC 2017 6Network Based Computing Laboratory
Architecture of MVAPICH2 Software Family
High Performance Parallel Programming Models
Message Passing Interface(MPI)
PGAS(UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms
Point-to-point
Primitives
Collectives Algorithms
Energy-Awareness
Remote Memory Access
I/O andFile Systems
FaultTolerance
Virtualization Active Messages
Job StartupIntrospection
& Analysis
Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path)
Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPower, Xeon-Phi (MIC, KNL), NVIDIA GPGPU)
Transport Protocols Modern Features
RC XRC UD DC UMR ODPSR-IOV
Multi Rail
Transport MechanismsShared
Memory CMA IVSHMEM
Modern Features
MCDRAM* NVLink* CAPI*
* Upcoming
GTC 2017 7Network Based Computing Laboratory
CPU CPUQPI
GPU
PCIe
GPU
GPU
CPU
GPU
IB
Node 0 Node 11. Intra-GPU2. Intra-Socket GPU-GPU3. Inter-Socket GPU-GPU4. Inter-Node GPU-GPU5. Intra-Socket GPU-Host
7. Inter-Node GPU-Host6. Inter-Socket GPU-Host
Memory buffers
8. Inter-Node GPU-GPU with IB adapter on remote socketand more . . .
• For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline …• Critical for runtimes to optimize data movement while hiding the complexity
• Connected as PCIe devices – Flexibility but Complexity
Optimizing MPI Data Movement on GPU Clusters
GTC 2017 8Network Based Computing Laboratory
At Sender:
At Receiver:MPI_Recv(r_devbuf, size, …);
insideMVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
GTC 2017 9Network Based Computing Laboratory
CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases• Support for MPI communication from NVIDIA GPU device memory• High performance RDMA-based inter-node point-to-point
communication (GPU-GPU, GPU-Host and Host-GPU)• High performance intra-node point-to-point communication for multi-
GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node
communication for multiple GPU adapters/node• Optimized and tuned collectives for GPU device buffers• MPI datatype support for point-to-point and collective communication
from GPU device buffers
GTC 2017 10Network Based Computing Laboratory
• MVAPICH2-2.2 with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/
• Please select the best matching package for your system• We have most of common combinations. If you do not find your match here please email OSU with the following details
– OS versions
– OFED version
– CUDA Version
– Compiler (GCC, Intel and PGI)
• Install instructions• Having root permissions
• On default Path: rpm -Uvh --nodeps mvapich2-gdr-cuda7.0-gnu-2.2-0.3.el6.x86_64.rpm
• On specific Path: rpm --prefix /custom/install/prefix -Uvh --nodeps mvapich2-gdr-cuda7.0-gnu-2.2-0.3.el6.x86_64.rpm
• Do not have root permissions:• rpm2cpio mvapich2-gdr-cuda7.0-gnu-2.2-0.3.el6.x86_64.rpm | cpio –id
• More details on the installation process refer to:
http://mvapich.cse.ohio-state.edu/userguide/gdr/2.2#_installing_mvapich2_gdr_library
Installing MVAPICH2-GDR
GTC 2017 11Network Based Computing Laboratory
01000200030004000
1 4 16 64 256 1K 4K
MV2-GDR2.2MV2-GDR2.0bMV2 w/o GDR
GPU-GPU Internode Bi-Bandwidth
Message Size (bytes)
Bi-B
andw
idth
(MB/
s)
05
1015202530
0 2 8 32 128 512 2K
MV2-GDR2.2 MV2-GDR2.0bMV2 w/o GDR
GPU-GPU internode latency
Message Size (bytes)
Late
ncy
(us)
MVAPICH2-GDR-2.2Intel Ivy Bridge (E5-2680 v2) node - 20 cores
NVIDIA Tesla K40c GPUMellanox Connect-X4 EDR HCA
CUDA 8.0Mellanox OFED 3.0 with GPU-Direct-RDMA
10x2X
11x
Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR)
2.18us0
50010001500200025003000
1 4 16 64 256 1K 4K
MV2-GDR2.2MV2-GDR2.0bMV2 w/o GDR
GPU-GPU Internode Bandwidth
Message Size (bytes)
Band
wid
th
(MB/
s) 11X
2X
3X
GTC 2017 12Network Based Computing Laboratory
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)• HoomdBlue Version 1.0.5
• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
Application-Level Evaluation (HOOMD-blue)
0
500
1000
1500
2000
2500
4 8 16 32
Aver
age
Tim
e St
eps p
er
seco
nd (T
PS)
Number of Processes
MV2 MV2+GDR
0500
100015002000250030003500
4 8 16 32Aver
age
Tim
e St
eps p
er
seco
nd (T
PS)
Number of Processes
64K Particles 256K Particles
2X2X
GTC 2017 13Network Based Computing Laboratory
Full and Efficient MPI-3 RMA Support
05
101520253035
0 2 8 32 128 512 2K 8K
Small Message Latency
Message Size (bytes)
Late
ncy
(us)
MVAPICH2-GDR-2.2Intel Ivy Bridge (E5-2680 v2) node - 20 cores, NVIDIA Tesla K40c GPU
Mellanox Connect-IB Dual-FDR HCA, CUDA 7Mellanox OFED 2.4 with GPU-Direct-RDMA
6X
2.88 us
GTC 2017 14Network Based Computing Laboratory
Performance of MVAPICH2-GDR with GPU-Direct RDMA and Multi-Rail Support
0100020003000400050006000700080009000
10000
1 4 16 64 256 1K 4K 16K64K256K1M 4M
MV2-GDR 2.1
MV2-GDR 2.1 RC2
GPU-GPU Internode MPI Uni-Directional Bandwidth
Message Size (bytes)
Band
wid
th (M
B/s)
MVAPICH2-GDR-2.2.bIntel Ivy Bridge (E5-2680 v2) node - 20 cores, NVIDIA Tesla K40c GPU
Mellanox Connect-IB Dual-FDR HCA CUDA 7Mellanox OFED 2.4 with GPU-Direct-RDMA
02000400060008000
10000120001400016000
1 4 16 64 256 1K 4K 16K64K256K1M 4M
MV2-GDR 2.1
MV2-GDR 2.1 RC2
GPU-GPU Internode Bi-directional Bandwidth
Message Size (bytes)
Bi-B
andw
idth
(MB/
s)
8,759 15,111
40%20%
GTC 2017 15Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR
• Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• RoCE and Optimized Collective • Initial support for GPUDirect Async feature
• Efficient Deep Learning with MVAPICH2-GDR• OpenACC-Aware support • Conclusions
Outline
GTC 2017 16Network Based Computing Laboratory
Non-Blocking Collectives (NBC) using Core-Direct Offload
A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, Offloaded GPU Collectives using CORE-Direct and CUDA Capabilities on IB Clusters, HIPC, 2015
• MPI NBC decouples initiation (Ialltoall) and completion (Wait) phases and provide overlap potential (Ialltoall + compute + Wait) but CPU drives progress largely in Wait (=> 0 overlap)
• CORE-Direct feature provides true overlap capabilities by providing a priori specification of a list of network-tasks which is progressed by the NIC instead of the CPU (hence freeing it)
• We propose a design that combines GPUDirect RDMA and Core-Direct features to provide efficient support of CUDA-Aware NBC collectives on GPU buffers
• Overlap communication with CPU computation
• Overlap communication with GPU computation
• Extend OMB with CUDA-Aware NBC benchmarks to evaluate
• Latency
• Overlap on both CPU and GPU
GTC 2017 17Network Based Computing Laboratory
0
20
40
60
80
100
120
4K 16K 64K 256K 1M
Ove
rlap
(%)
Message Size (Bytes)
Medium/Large Message Overlap (64 GPU nodes)
Ialltoall (1process/node)
Ialltoall (2process/node; 1process/GPU)0
20
40
60
80
100
120
4K 16K 64K 256K 1M
Ove
rlap
(%)
Message Size (Bytes)
Medium/Large Message Overlap(64 GPU nodes)
Igather (1process/node)
Igather (2processes/node;1process/GPU)
Platform: Wilkes: Intel Ivy Bridge NVIDIA Tesla K20c + Mellanox Connect-IB
Available since MVAPICH2-GDR 2.2b
CUDA-Aware Non-Blocking Collectives
A. Venkatesh, K. Hamidouche, H. Subramoni, and D. K. Panda, Offloaded GPU Collectives using CORE-Direct and CUDA Capabilities on IB Clusters, HIPC, 2015
GTC 2017 18Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR
• Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• RoCE and Optimized Collective • Initial support for GPUDirect Async feature
• Efficient Deep Learning with MVAPICH2-GDR• OpenACC-Aware support • Conclusions
Outline
GTC 2017 19Network Based Computing Laboratory
• Multi-dimensional data• Row based organization• Contiguous on one dimension • Non-contiguous on other dimensions
• Halo data exchange• Duplicate the boundary• Exchange the boundary in each
iteration
Halo data exchange
Non-contiguous Data Exchange
GTC 2017 20Network Based Computing Laboratory
MPI Datatype support in MVAPICH2• Datatypes support in MPI
– Operate on customized datatypes to improve productivity
– Enable MPI library to optimize non-contiguous data
At Sender: MPI_Type_vector (n_blocks, n_elements, stride, old_type, &new_type);MPI_Type_commit(&new_type);…MPI_Send(s_buf, size, new_type, dest, tag, MPI_COMM_WORLD);
• Inside MVAPICH2 - Use datatype specific CUDA Kernels to pack data in chunks- Efficiently move data between nodes using RDMA- In progress - currently optimizes vector and hindexed datatypes- Transparent to the userH. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on Parallel and Distributed Systems, Vol. 25, No. 10, pp. 2595-2605 , Oct 2014.
GTC 2017 21Network Based Computing Laboratory
MPI Datatype Processing (Computation Optimization )
• Comprehensive support • Targeted kernels for regular datatypes - vector, subarray, indexed_block
• Generic kernels for all other irregular datatypes
• Separate non-blocking stream for kernels launched by MPI library • Avoids stream conflicts with application kernels
• Flexible set of parameters for users to tune kernels• Vector
• MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE
• MV2_CUDA_KERNEL_VECTOR_YSIZE
• Subarray • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE • MV2_CUDA_KERNEL_SUBARR_XDIM• MV2_CUDA_KERNEL_SUBARR_YDIM • MV2_CUDA_KERNEL_SUBARR_ZDIM
• Indexed_block
• MV2_CUDA_KERNEL_IDXBLK_XDIM
GTC 2017 22Network Based Computing Laboratory
MPI Datatype Processing (Communication Optimization )
Waste of computing resources on CPU and GPUCommon Scenario
*A, B…contain non-contiguousMPI Datatype
MPI_Isend (A,.. Datatype,…)MPI_Isend (B,.. Datatype,…)MPI_Isend (C,.. Datatype,…)MPI_Isend (D,.. Datatype,…)…
MPI_Waitall (…);
GTC 2017 23Network Based Computing Laboratory
Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland
0
0.2
0.4
0.6
0.8
1
1.2
16 32 64 96Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
CSCS GPU cluster
Default Callback-based Event-based
00.20.40.60.8
11.2
4 8 16 32
Nor
mal
ized
Exec
utio
n Ti
me
Number of GPUs
Wilkes GPU Cluster
Default Callback-based Event-based
• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)
C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16
On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application
Cosmo model: http://www2.cosmo-model.org/content/tasks/operational/meteoSwiss/
GTC 2017 24Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR
• Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• RoCE and Optimized Collective • Initial support for GPUDirect Async feature
• Efficient Deep Learning with MVAPICH2-GDR• OpenACC-Aware support • Conclusions
Outline
GTC 2017 25Network Based Computing Laboratory
Initial (Basic) Support for GPU Managed Memory
● CUDA 6.0 NVIDIA introduced CUDA Managed (or Unified) memory allowing a common memory allocation for GPU or CPU through cudaMallocManaged() call
● Significant productivity benefits due to abstraction of explicit allocation and cudaMemcpy()
● Extended MVAPICH2 to perform communications directly from managed buffers (Available since MVAPICH2-GDR 2.2b)
● OSU Micro-benchmarks extended to evaluate the performance of point-to-point and collective communications using managed buffers ● Available since OMB 5.2D. S. Banerjee, K Hamidouche, and D. K Panda, Designing High Performance
Communication Runtime for GPUManaged Memory: Early Experiences, GPGPU-9 Workshop, held in conjunction with PPoPP ‘16
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 2 4 8 16 32 64 128 256 1K 4K 8K 16K
Halo
Exc
hang
e Ti
me
(ms)
Total Dimension Size (Bytes)
2D Stencil Performance for Halowidth=1
DeviceManaged
GTC 2017 26Network Based Computing Laboratory
Enhanced Support for Intra-node Managed Memory
● CUDA Managed => no memory pin down ● No IPC support for intra-node communication ● No GDR support for Inter-node
communication● Initial and basic support in MVAPICH2-GDR
● For both intra- and inter-nodes use “pipeline through” host memory
● Enhance intra-node managed memory to use IPC● Double buffering pair-wise IPC-based scheme ● Brings IPC performance to Managed memory ● High performance and high productivity● 2.5 X improvement in bandwidth
● Available since MVAPICH2-GDR 2.2RC1
0
200
400
600
800
1000
1200
4K 16K 64K 256K 1M 4M
EnhancedMV2-GDR 2.2b
Message Size (bytes)
Late
ncy
(us)
02000400060008000
10000
32K 128K 512K 2M
EnhancedMV2-GDR 2.2b
Message Size (bytes)
Band
wid
th (M
B/s)
2.5X
GTC 2017 27Network Based Computing Laboratory
Enhanced Support for Inter-node Managed Memory ● Enhance inter-node managed memory to use GDR● SGL-based scheme:
● Eager Pre-allocated GPU VBUFs● Register the GPU vbufs with the HCA ● Scatter-Gather List to combine control and data messages● Brings GDR performance to Managed memory ● High performance and high productivity● Up to 25% improvement for small and medium messages
(micro-benchmark)● Up to 1.92X improvement for GPU-LBM application
● Will be available in a future MVAPICH2-GDR release
25%
1.92X
K. Hamidouche, A. A. Awan, A. Venkatesh and D. K. Panda, "CUDA M3: Designing Efficient CUDA Managed Memory-Aware MPI by Exploiting GDR and IPC," 2016 IEEE 23rd International Conference on High Performance Computing (HiPC), Hyderabad, 2016
GTC 2017 28Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR
• Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• RoCE and Optimized Collective • Initial support for GPUDirect Async feature
• Efficient Deep Learning with MVAPICH2-GDR• OpenACC-Aware support • Conclusions
Outline
GTC 2017 29Network Based Computing Laboratory
• RoCE V1 and V2 support
• RDMA_CM connection support
• CUDA-Aware Collective Tuning– Point-point Tuning (available since MVAPICH2-GDR 2.0)
• Tuned thresholds for the different communication patterns and features
• Depending on the system configuration (CPU, HCA and GPU models)
– Tuning Framework for GPU based collectives • Select the best algorithm depending on message size, system size and system
configuration
• Support for Bcast and Gather operations for different GDR-enabled systems
• Available since MVAPICH2-GDR 2.2RC1 release
ROCE and Optimized Collectives Support
GTC 2017 30Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR
• Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• RoCE and Optimized Collective• Initial support for GPUDirect Async feature
• Efficient Deep Learning with MVAPICH2-GDR• OpenACC-Aware support • Conclusions
Outline
GTC 2017 31Network Based Computing Laboratory
Overview of GPUDirect aSync (GDS) Feature: Current MPI+CUDA interaction
CUDA_Kernel_a<<<>>>(A…., stream1)cudaStreamSynchronize(stream1)MPI_ISend (A,…., req1)MPI_Wait (req1)CUDA_Kernel_b<<<>>>(B…., stream1)
GPU CPU HCA
100% CPU control • Limit the throughput of a GPU • Limit the asynchronous progress • Waste CPU cycles
GTC 2017 32Network Based Computing Laboratory
MVAPICH2-GDS: Decouple GPU Control Flow from CPU
CUDA_Kernel_a<<<>>>(A…., stream1)MPI_ISend (A,…., req1, stream1)MPI_Wait (req1, stream1) (non-blocking from CPU)CUDA_Kernel_b<<<>>>(B…., stream1)
GPU CPU HCA
CPU offloads the compute, communication and synchronization tasks to GPU • CPU is out of the critical path • Tight interaction between GPU and HCA • Hide the overhead of kernel launch • Requires MPI semantics extensions
• All operations are asynchronous from CPU • Extend MPI semantics with Stream-based semantics
Kernel Launch overhead hided
GTC 2017 33Network Based Computing Laboratory
Latency oriented: Send+kernel and Recv+kernelMVAPICH2-GDS: Preliminary Results of Micro-benchmark
05
1015202530
8 32 128 512
DefaultEnhanced-GDS
Message Size (bytes)
Late
ncy
(us)
05
101520253035
16 64 256 1024
DefaultEnhanced-GDS
Message Size (bytes)
Late
ncy
(us)
Throughput Oriented: back-to-back
• Latency Oriented: Able to hide the kernel launch overhead– 25% improvement at 256 Bytes compared to default behavior
• Throughput Oriented: Asynchronously to offload queue the Communication and computation tasks– 14% improvement at 1KB message size
– Requires some tuning and expect better performance for Application with different Kernels
Intel SandyBridge, NVIDIA K20 and Mellanox FDR HCA
GTC 2017 34Network Based Computing Laboratory
Point-to-point: kernels+Send and Recv+kernels
MVAPICH2-GDS: Results of 1D Stencil Kernels
Collective: Broadcast+Kernels
• Point-to-point: Hide the kernel launch overhead– 30% improvement at 16K Bytes compared to traditional MPI+CUDA programs
• Collective: Overlap computations with GPU communications (GDS) – 36% improvement at 64K Bytes compared to traditional MPI+CUDA programs
Intel Broadwell CPU, NVIDIA K80 and Mellanox EDR HCA Akshay Venkatesh, Ching-Hsiang Chu, Khaled Hamidouche, Sreeram Potluri, Davide Rossetti and Dhabaleswar K. Panda, MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling, ICPP 2017, Bristol, UK, Aug 14-17, 2017. (Accepted)
0100200300400500
16K 32K 64K
Late
ncy
(us)
Message Size (Bytes)
Default MPI MPI-GDS
0
20
40
60
80
2 8 32 128 512 2K 8K
Default MPI-GDS
Message Size (bytes)
Late
ncy
(us) 1.4X
1.8X
GTC 2017 35Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR
• Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• RoCE and Optimized Collective• Initial support for GPUDirect Async feature
• Efficient Deep Learning with MVAPICH2-GDR• OpenACC-Aware support • Conclusions
Outline
GTC 2017 36Network Based Computing Laboratory
• Many Deep Learning (DL) frameworks have emerged– Berkeley Caffe
– Google TensorFlow
– Microsoft CNTK
– Facebook Torch
– Facebook Caffe2
• Broadly, DL Frameworks are being developed along two directions1. The HPC Eco-system: MPI-based Deep Learning
2. Enterprise Eco-system: BigData-based Deep Learning
Deep Learning Frameworks
Courtesy: https://www.microway.com/hpc-tech-tips/deep-learning-frameworks-survey-tensorflow-torch-theano-caffe-neon-ibm-machine-learning-stack/
GTC 2017 37Network Based Computing Laboratory
• Scale-up: Intra-node Performance
– Many improvements like:• NVIDIA cuDNN, cuBLAS, NCCL, etc.
• Scale-out: Inter-node Performance
– DL frameworks - single-node only
– Distributed (Parallel) Training – an emerging trend• S-Caffe or OSU-Caffe – MPI-based
• Microsoft CNTK – MPI-based
• Google TensorFlow – gRPC-based
• Facebook Caffe2 – Hybrid
Parallel Training: Scale-up and Scale-out
Scal
e-up
Per
form
ance
Scale-out Performance
cuDNN
NCCL
gRPC
Hadoop
MPIcuBLAS
GTC 2017 38Network Based Computing Laboratory
How to efficiently scale-out a Deep Learning (DL) framework and take advantage of heterogeneous
High Performance Computing (HPC) resources?
Broad Challenge
GTC 2017 39Network Based Computing Laboratory
• Deep Learning frameworks are a different game altogether
– Unusually large message sizes (order of megabytes)
– Most communication based on GPU buffers
• How to address these newer requirements?– GPU-specific Communication Libraries (NCCL)
• NVidia's NCCL library provides inter-GPU communication
– CUDA-Aware MPI (MVAPICH2-GDR)• Provides support for GPU-based communication
• Can we exploit CUDA-Aware MPI and NCCL to support Deep Learning applications?
New Challenges for MPI Runtimes
Hierarchical Communication (Knomial + NCCL ring)
GTC 2017 40Network Based Computing Laboratory
• NCCL has some limitations– Only works for a single node, thus, no scale-out on
multiple nodes
– Degradation across IOH (socket) for scale-up (within a node)
• We propose optimized MPI_Bcast– Communication of very large GPU buffers (order of
megabytes)
– Scale-out on large number of dense multi-GPU nodes
• Hierarchical Communication that efficiently exploits:– CUDA-Aware MPI_Bcast in MV2-GDR
– NCCL Broadcast primitive
Efficient Broadcast: MVAPICH2-GDR and NCCL
0
10
20
30
2 4 8 16 32 64Tim
e (s
econ
ds)
Number of GPUs
MV2-GDR MV2-GDR-Opt
Performance Benefits: Microsoft CNTK DL framework (25% avg. improvement )
Performance Benefits: OSU Micro-benchmarks
Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning, A. Awan , K. Hamidouche , A. Venkatesh , and D. K. Panda, The 23rd European MPI Users' Group Meeting (EuroMPI 16), Sep 2016 [Best Paper Runner-Up]
2.2X
GTC 2017 41Network Based Computing Laboratory
0100200300
2 4 8 16 32 64 128
Late
ncy
(ms)
Message Size (MB)
Reduce – 192 GPUs
Large Message Optimized Collectives for Deep Learning
0
100
200
128 160 192
Late
ncy
(ms)
No. of GPUs
Reduce – 64 MB
0100200300
16 32 64
Late
ncy
(ms)
No. of GPUs
Allreduce - 128 MB
0
50
100
2 4 8 16 32 64 128
Late
ncy
(ms)
Message Size (MB)
Bcast – 64 GPUs
0
50
100
16 32 64
Late
ncy
(ms)
No. of GPUs
Bcast 128 MB
• MV2-GDR provides optimized collectives for large message sizes
• Optimized Reduce, Allreduce, and Bcast
• Good scaling with large number of GPUs
• Available with MVAPICH2-GDR 2.2GA
0100200300
2 4 8 16 32 64 128La
tenc
y (m
s)Message Size (MB)
Allreduce – 64 GPUs
GTC 2017 42Network Based Computing Laboratory
• Caffe : A flexible and layered Deep Learning framework.
• Benefits and Weaknesses– Multi-GPU Training within a single node
– Performance degradation for GPUs across different sockets
– Limited Scale-out
• OSU-Caffe: MPI-based Parallel Training – Enable Scale-up (within a node) and Scale-out
(across multi-GPU nodes)
– network on ImageNet dataset
OSU-Caffe: Scalable Deep Learning
0
50
100
150
200
250
8 16 32 64 128
Trai
ning
Tim
e (s
econ
ds)
No. of GPUs
GoogLeNet (ImageNet) on 128 GPUs
Caffe OSU-Caffe (1024) OSU-Caffe (2048)
Invalid use caseOSU-Caffe is publicly available from: http://hidl.cse.ohio-state.edu
A. A. Awan, K. Hamidouche, J. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters,PPoPP, Sep 2017
GTC 2017 43Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR
• Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• RoCE and Optimized Collective• Initial support for GPUDirect Async feature
• Efficient Deep Learning with MVAPICH2-GDR• OpenACC-Aware support • Conclusions
Outline
GTC 2017 44Network Based Computing Laboratory
OpenACC-Aware MPI• acc_malloc to allocate device memory
– No changes to MPI calls– MVAPICH2 detects the device pointer and optimizes data movement
• acc_deviceptr to get device pointer (in OpenACC 2.0)– Enables MPI communication from memory allocated by compiler when it is available in OpenACC 2.0 implementations– MVAPICH2 will detect the device pointer and optimize communication
• Delivers the same performance as with CUDA
A = acc_malloc(sizeof(int) * N);……#pragma acc parallel loop deviceptr(A) . . .//compute for loop
MPI_Send (A, N, MPI_INT, 0, 1, MPI_COMM_WORLD);
……acc_free(A);
A = malloc(sizeof(int) * N);……#pragma acc data copyin(A) . . . {#pragma acc parallel loop . . .//compute for loopMPI_Send(acc_deviceptr(A), N, MPI_INT, 0, 1, MPI_COMM_WORLD);}……free(A);
GTC 2017 45Network Based Computing Laboratory
• Overview of the MVAPICH2 Project• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR
• Efficient MPI-3 Non-Blocking Collective support • Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• RoCE and Optimized Collective• Initial support for GPUDirect Async feature
• Efficient Deep Learning with MVAPICH2-GDR• OpenACC-Aware support • Conclusions
Outline
GTC 2017 46Network Based Computing Laboratory
• MVAPICH2 optimizes MPI communication on InfiniBand and RoCE clusters with GPUs
• Provides optimized designs for point-to-point two-sided and one-sided communication, datatype processing and collective operations
• Efficient and maximal overlap for MPI-3 NBC collectives
• Delivers high performance and high productivity with support for the latest NVIDIA GPUs and InfiniBand/RoCE Adapters
• Looking forward to next-generation designs with GPUDirect Async (GDS) and applications domain like Deep Learning
• Users are strongly encouraged to use the latest MVAPICH2-GDR release to avail all features and performance benefits
Conclusions
GTC 2017 47Network Based Computing Laboratory
A Follow-up Talk on PGAS/OpenSHMEM
• S7324 - Bringing NVIDIA GPUs to the PGAS/OpenSHMEM World: Challenges and Solutions
– Day: Today, 05/11
– Time: 15:00 - 15:25
– Location: Room 211B
GTC 2017 48Network Based Computing Laboratory
Dr. Davide Rossetti Dr. Sreeram Potluri
Filippo Spiga and Stuart Rankin, HPCS, University of Cambridge
(Wilkes Cluster)
Acknowledgments
GTC 2017 49Network Based Computing Laboratory
Thank You!
The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The MVAPICH2 Projecthttp://mvapich.cse.ohio-state.edu/
panda@cse.ohio-state.edusubramon@cse.ohio-state.edu
top related