exploiting latest networking and accelerator technologies...

42
Exploiting Latest Networking and Accelerator Technologies for MPI, Streaming, and Deep Learning: An MVAPICH2-Based Approach Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Talk at NRL booth (SC ‘17) by

Upload: dangkhanh

Post on 30-Jun-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Exploiting Latest Networking and Accelerator Technologies for MPI, Streaming, and Deep Learning:

An MVAPICH2-Based Approach

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Talk at NRL booth (SC ‘17)

by

NRL (SC ‘17) 2Network Based Computing Laboratory

Drivers of Modern HPC Cluster Architectures

• Multi-core/many-core technologies

• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)

• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD

• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

• Available on HPC Clouds, e.g., Amazon EC2, NSF Chameleon, Microsoft Azure, etc.

Accelerators / Coprocessors high compute density, high

performance/watt>1 TFlop DP on a chip

High Performance Interconnects -InfiniBand

<1usec latency, 100Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM

Tianhe – 2 TitanK - ComputerSunway TaihuLight

NRL (SC ‘17) 3Network Based Computing Laboratory

Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Started in 2001, First version available in 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,825 organizations in 85 countries

– More than 433,000 (> 0.4 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (June ‘17 ranking)• 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China

• 15th, 241,108-core (Pleiades) at NASA

• 20th, 462,462-core (Stampede) at TACC

• 44th, 74,520-core (Tsubame 2.5) at Tokyo Institute of Technology

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade

– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->

– Sunway TaihuLight (1st in Jun’17, 10M cores, 100 PFlops)

NRL (SC ‘17) 4Network Based Computing Laboratory

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000Se

p-04

Feb-

05

Jul-0

5

Dec-

05

May

-06

Oct

-06

Mar

-07

Aug-

07

Jan-

08

Jun-

08

Nov

-08

Apr-

09

Sep-

09

Feb-

10

Jul-1

0

Dec-

10

May

-11

Oct

-11

Mar

-12

Aug-

12

Jan-

13

Jun-

13

Nov

-13

Apr-

14

Sep-

14

Feb-

15

Jul-1

5

Dec-

15

May

-16

Oct

-16

Mar

-17

Aug-

17

Num

ber o

f Dow

nloa

ds

Timeline

MV

0.9.

4

MV2

0.9

.0

MV2

0.9

.8

MV2

1.0

MV

1.0

MV2

1.0.

3

MV

1.1

MV2

1.4

MV2

1.5

MV2

1.6

MV2

1.7

MV2

1.8

MV2

1.9 M

V22.

1

MV2

-GDR

2.0

b

MV2

-MIC

2.0

MV2

-GDR

2.3

aM

V2-X

2.3b

MV2

2.3b

MV2

Virt

2.2

MVAPICH2 Release Timeline and Downloads

NRL (SC ‘17) 5Network Based Computing Laboratory

MVAPICH2 Architecture

High Performance Parallel Programming Models

Message Passing Interface(MPI)

PGAS(UPC, OpenSHMEM, CAF, UPC++)

Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms

Point-to-point

Primitives

Collectives Algorithms

Energy-Awareness

Remote Memory Access

I/O andFile Systems

FaultTolerance

Virtualization Active Messages

Job StartupIntrospection

& Analysis

Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, OmniPath)

Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPower, Xeon-Phi (MIC, KNL*), NVIDIA GPGPU)

Transport Protocols Modern Features

RC XRC UD DC UMR ODP* SR-IOV

Multi Rail

Transport MechanismsShared

Memory CMA IVSHMEM

Modern Features

MCDRAM* NVLink* CAPI*

* Upcoming

NRL (SC ‘17) 6Network Based Computing Laboratory

MVAPICH2 Software Family High-Performance Parallel Programming Libraries

MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE

MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime

MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs

MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud

MVAPICH2-EA Energy aware and High-performance MPI

MVAPICH2-MIC Optimized MPI for clusters with Intel KNC

Microbenchmarks

OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs

Tools

OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration

OEMT Utility to measure the energy consumption of MPI applications

NRL (SC ‘17) 7Network Based Computing Laboratory

• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR

• Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Support for OpenPower and NVLink • Initial support for GPUDirect Async feature

• Streaming Support with IB Multicast and GDR • High-Performance Deep Learning with MVAPICH2-GDR• Conclusions

Outline

NRL (SC ‘17) 8Network Based Computing Laboratory

CPU CPUQPI

GPU

PCIe

GPU

GPU

CPU

GPU

IB

Node 0 Node 11. Intra-GPU2. Intra-Socket GPU-GPU3. Inter-Socket GPU-GPU4. Inter-Node GPU-GPU5. Intra-Socket GPU-Host

7. Inter-Node GPU-Host6. Inter-Socket GPU-Host

Memory buffers

8. Inter-Node GPU-GPU with IB adapter on remote socketand more . . .

• For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline …• Critical for runtimes to optimize data movement while hiding the complexity

• Connected as PCIe devices – Flexibility but Complexity

Optimizing MPI Data Movement on GPU Clusters

NRL (SC ‘17) 9Network Based Computing Laboratory

At Sender:

At Receiver:MPI_Recv(r_devbuf, size, …);

insideMVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU

NRL (SC ‘17) 10Network Based Computing Laboratory

CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases• Support for MPI communication from NVIDIA GPU device memory• High performance RDMA-based inter-node point-to-point

communication (GPU-GPU, GPU-Host and Host-GPU)• High performance intra-node point-to-point communication for multi-

GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node

communication for multiple GPU adapters/node• Optimized and tuned collectives for GPU device buffers• MPI datatype support for point-to-point and collective communication

from GPU device buffers

NRL (SC ‘17) 11Network Based Computing Laboratory

0

2000

4000

6000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)

Message Size (Bytes)

GPU-GPU Inter-node Bi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

01000200030004000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)

Message Size (Bytes)

GPU-GPU Inter-node Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

0

10

20

300 1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)

Message Size (Bytes)

GPU-GPU Inter-node Latency

MV2-(NO-GDR) MV2-GDR-2.3a

MVAPICH2-GDR-2.3aIntel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores

NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA

CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA

10x

9x

Optimized MVAPICH2-GDR Design

1.88us11X

NRL (SC ‘17) 12Network Based Computing Laboratory

• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)• HoomdBlue Version 1.0.5

• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Application-Level Evaluation (HOOMD-blue)

0

500

1000

1500

2000

2500

4 8 16 32

Aver

age

Tim

e St

eps p

er

seco

nd (T

PS)

Number of Processes

MV2 MV2+GDR

0500

100015002000250030003500

4 8 16 32Aver

age

Tim

e St

eps p

er

seco

nd (T

PS)

Number of Processes

64K Particles 256K Particles

2X2X

NRL (SC ‘17) 13Network Based Computing Laboratory

Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

0

0.2

0.4

0.6

0.8

1

1.2

16 32 64 96Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

CSCS GPU cluster

Default Callback-based Event-based

00.20.40.60.8

11.2

4 8 16 32

Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

Wilkes GPU Cluster

Default Callback-based Event-based

• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)

C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16

On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

Cosmo model: http://www2.cosmo-model.org/content/tasks/operational/meteoSwiss/

NRL (SC ‘17) 14Network Based Computing Laboratory

• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR

• Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Support for OpenPower and NVLink • Initial support for GPUDirect Async feature

• Streaming Support with IB Multicast and GDR • High-Performance Deep Learning with MVAPICH2-GDR• Conclusions

Outline

NRL (SC ‘17) 15Network Based Computing Laboratory

• Multi-dimensional data• Row based organization• Contiguous on one dimension • Non-contiguous on other dimensions

• Halo data exchange• Duplicate the boundary• Exchange the boundary in each

iteration

Halo data exchange

Non-contiguous Data Exchange

NRL (SC ‘17) 16Network Based Computing Laboratory

MPI Datatype Processing (Computation Optimization )

• Comprehensive support • Targeted kernels for regular datatypes - vector, subarray, indexed_block

• Generic kernels for all other irregular datatypes

• Separate non-blocking stream for kernels launched by MPI library • Avoids stream conflicts with application kernels

• Flexible set of parameters for users to tune kernels• Vector

• MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE

• MV2_CUDA_KERNEL_VECTOR_YSIZE

• Subarray • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE • MV2_CUDA_KERNEL_SUBARR_XDIM• MV2_CUDA_KERNEL_SUBARR_YDIM • MV2_CUDA_KERNEL_SUBARR_ZDIM

• Indexed_block

• MV2_CUDA_KERNEL_IDXBLK_XDIM

NRL (SC ‘17) 17Network Based Computing Laboratory

MPI Datatype Processing (Communication Optimization )

Waste of computing resources on CPU and GPUCommon Scenario

*A, B…contain non-contiguousMPI Datatype

MPI_Isend (A,.. Datatype,…)MPI_Isend (B,.. Datatype,…)MPI_Isend (C,.. Datatype,…)MPI_Isend (D,.. Datatype,…)…

MPI_Waitall (…);

NRL (SC ‘17) 18Network Based Computing Laboratory

Enhanced Support for GPU Managed Memory ● CUDA Managed => no memory pin down

● No IPC support for intranode communication ● No GDR support for Internode communication

● Significant productivity benefits due to abstraction of explicit allocation and cudaMemcpy()

● Initial and basic support in MVAPICH2-GDR ● For both intra- and inter-nodes use “pipeline through”

host memory ● Enhance intranode managed memory to use IPC

● Double buffering pair-wise IPC-based scheme ● Brings IPC performance to Managed memory ● High performance and high productivity● 2.5 X improvement in bandwidth

● OMB extended to evaluate the performance of point-to-point and collective communications using managed buffers

● Available since MVAPICH2-GDR 2.2 0

2000

4000

6000

8000

10000

32K 128K 512K 2M

Enhanced

MV2-GDR 2.2b

Message Size (bytes)

Band

wid

th (M

B/s)

2.5X

D. S. Banerjee, K Hamidouche, and D. K Panda, Designing High Performance Communication Runtime for GPUManaged Memory: Early Experiences, GPGPU-9 Workshop, held in conjunction with PPoPP ‘16

0

0.2

0.4

0.6

0.8

1 2 4 8 16 32 64 128 256 1K 4K 8K 16KHalo

Exc

hang

e Ti

me

(ms)

Total Dimension Size (Bytes)

2D Stencil Performance for Halowidth=1DeviceManaged

NRL (SC ‘17) 19Network Based Computing Laboratory

• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR

• Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Support for OpenPower and NVLink • Initial support for GPUDirect Async feature

• Streaming Support with IB Multicast and GDR • High-Performance Deep Learning with MVAPICH2-GDR• Conclusions

Outline

NRL (SC ‘17) 20Network Based Computing Laboratory

05

101520

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)

Message Size (Bytes)

INTRA-NODE LATENCY (SMALL)

INTRA-SOCKET(NVLINK) INTER-SOCKET

MVAPICH2-GDR: Performance on OpenPOWER (NVLink + Pascal)

010203040

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Band

wid

th (G

B/se

c)

Message Size (Bytes)

INTRA-NODE BANDWIDTH

INTRA-SOCKET(NVLINK) INTER-SOCKET

0

2

4

6

8

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Band

wid

th (G

B/se

c)

Message Size (Bytes)

INTER-NODE BANDWIDTH

Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P100-SXM GPUs, and 4X-FDR InfiniBand Inter-connect

0

200

400

16K 32K 64K 128K256K512K 1M 2M 4M

Late

ncy

(us)

Message Size (Bytes)

INTRA-NODE LATENCY (LARGE)

INTRA-SOCKET(NVLINK) INTER-SOCKET

0

10

20

30

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)

Message Size (Bytes)

INTER-NODE LATENCY (SMALL)

0200400600800

1000

Late

ncy

(us)

Message Size (Bytes)

INTER-NODE LATENCY (LARGE)

Intra-node Bandwidth: 33.2 GB/sec (NVLINK)Intra-node Latency: 13.8 us (without GPUDirectRDMA)

Inter-node Latency: 23 us (without GPUDirectRDMA) Inter-node Bandwidth: 6 GB/sec (FDR)Available in MVAPICH2-GDR 2.3a

NRL (SC ‘17) 21Network Based Computing Laboratory

Overview of GPUDirect aSync (GDS) Feature: Current MPI+CUDA interaction

CUDA_Kernel_a<<<>>>(A…., stream1)cudaStreamSynchronize(stream1)MPI_ISend (A,…., req1)MPI_Wait (req1)CUDA_Kernel_b<<<>>>(B…., stream1)

GPU CPU HCA

100% CPU control • Limits the throughput of a GPU • Limits the asynchronous progress • Wastes CPU cycles

NRL (SC ‘17) 22Network Based Computing Laboratory

MVAPICH2-GDS: Decouple GPU Control Flow from CPU

CUDA_Kernel_a<<<>>>(A…., stream1)MPI_ISend (A,…., req1, stream1)MPI_Wait (req1, stream1) (non-blocking from CPU)CUDA_Kernel_b<<<>>>(B…., stream1)

GPU CPU HCA

CPU offloads the compute, communication and synchronization tasks to GPU • CPU is out of the critical path • Tight interaction between GPU and HCA • Hides the overhead of kernel launch • Requires MPI semantics extensions

• All operations are asynchronous from CPU • Extends MPI semantics with Stream-based semantics

Kernel Launch overhead can be hidden

NRL (SC ‘17) 23Network Based Computing Laboratory

Latency oriented: Kernel+Send and Recv+Kernel

MVAPICH2-GDS: Preliminary Results Overlap with host computation/communication

• Latency Oriented: Able to hide the kernel launch overhead– 8-15% improvement compared to default behavior

• Overlap: Asynchronously to offload queue the Communication and computation tasks– 89% overlap with host computation at 128-Byte message size

Intel Sandy Bridge, NVIDIA Tesla K40c and Mellanox FDR HCACUDA 8.0, OFED 3.4, Each kernel is ~50us

Will be available in a public release soon

0

20

40

60

80

1 4 16 64 256 1K 4K

Default MPI Enhaced MPI+GDS

Message Size (bytes)

Late

ncy

(us)

020406080

100

1 4 16 64 256 1K 4K

Default MPI Enhanced MPI+GDS

Message Size (bytes)

Ove

rlap

(%)

NRL (SC ‘17) 24Network Based Computing Laboratory

• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR

• Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Support for OpenPower and NVLink • Initial support for GPUDirect Async feature

• Streaming Support with IB Multicast and GDR • High-Performance Deep Learning with MVAPICH2-GDR• Conclusions

Outline

NRL (SC ‘17) 25Network Based Computing Laboratory

• Examples - surveillance, habitat monitoring, proton computed tomography (pCT), etc..

• Require efficient transport of data from/to distributed sources/sinks

• Sensitive to latency and throughput metrics

• Require HPC resources to efficiently carry out compute-intensive tasks

Streaming Applications

Src: http://www.symmetrymagazine.org/article/april-2012/proton-beam-on

NRL (SC ‘17) 26Network Based Computing Laboratory

• Streaming applications on HPC systems1. Communication (MPI)

• Broadcast-type operations

2. Computation (CUDA)• Multiple GPU nodes as

workers

MotivationData Source

SenderHPC resources for real-time analytics

Real-time streaming

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

WorkerCPUGPU

GPU

Data streaming-like broadcast operations

NRL (SC ‘17) 27Network Based Computing Laboratory

IB Multicast Example

NRL (SC ‘17) 28Network Based Computing Laboratory

• Can we design a GPU broadcast and allreduce mechanism that can deliver low latency and high throughput for streaming applications?

• Can we combine GPUDirect RDMA (GDR) and IB-MCAST features to• Achieve the best performance and scalability• Free-up the Host-Device PCIe bandwidth for application needs

• Can such design be extended to support heterogeneous configuration (host-to-device)?

• Can we design an efficient MCAST based broadcast for multi-GPU systems?• Can we design an efficient reliability support on top of the UD-based MCAST

broadcast?• Can we design an efficient MCAST based allreduce for GPU systems?• How can we demonstrate such benefits at benchmark and applications level?

Problem Statement

NRL (SC ‘17) 29Network Based Computing Laboratory

• Handling Efficient and Reliable Broadcast on Multi-GPU Clusters• C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing High

Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16,Oct 2016.

• C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Efficient ReliabilitySupport for Hardware Multicast-based Broadcast in GPU-enabled Streaming Applications,“ COMHPC2016 (SC Workshop), Nov 2016.

• Optimizing Broadcast for GPU-based Deep Learning• Ching-Hsiang Chu, Xiaoyi Lu, Ammar A. Awan, Hari Subramoni, Jahanzeb Hashmi, Bracy Elton, and

Dhabaleswar K. Panda, "Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters forDeep Learning , " ICPP’17.

• High-Performance Broadcast with IB-MCAST and GDR• Ching-Hsiang Chu, Xiaoyi Lu, Ammar A. Awan, Hari Subramoni, Bracy Elton, and Dhabaleswar K. Panda.,

"Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast , ” submitted to IEEE TPDS.(Under review)

Related Publications

NRL (SC ‘17) 30Network Based Computing Laboratory

• Combining MCAST+GDR hardware features for heterogeneous configurations:

– Source on the Host and destination on Device

– SL design: Scatter at destination• Source: Data and Control on Host

• Destinations: Data on Device and Control on Host

– Combines IB MCAST and GDR features at receivers

– CUDA IPC-based topology-aware intra-node broadcast

– Minimize use of PCIe resources (Maximizing availability of PCIe Host-Device Resources)

• Available in MVAPICH2-GDR 2.3a

SL-based Design for Heterogeneous Configuration (Host-Device)

Node NIB

HCA

IB HCA

CPU

GPU

Source

IB Switch

GPU

CPU

Node 1

Multicast steps

CDat

a

C

IB SL step

Data

IB HCA

GPU

CPU

Data

C

C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16, Oct 2016.

NRL (SC ‘17) 31Network Based Computing Laboratory

• Inter-node experiments @ Wilkes cluster, 32 GPUs, 1 GPU/node– 1K byte messages

Scalability Evaluation of the Proposed Design

05

10152025

2 4 8 16 32

Late

ncy

(μs)

System size (Number of GPU nodes)

SL-MCAST SGL-MCAST Host-MCAST

C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16, Oct 2016.

• Maintain good Scalability while yielding up to 64% reduction of latency

64%

NRL (SC ‘17) 32Network Based Computing Laboratory

• Mimic the behavior of streaming applications @ CSCS cluster, 88 GPUs, 8 NVIDIA K80 GPUs per node

– Broadcast operations overlapped with application level Host-Device transfers

Benefits of the Availability of Host-Device PCI Resources

0

1

2

3

41 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M

Thro

ughp

ut (G

B/s)

Message Size (Bytes)IPC SL-MCAST SHMEM SL-MCAST

3.2X

Higher is better

C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16, Oct 2016.

• Maintain near-peak throughput over all message sizes

NRL (SC ‘17) 33Network Based Computing Laboratory

• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR

• Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Support for OpenPower and NVLink • Initial support for GPUDirect Async feature

• Streaming Support with IB Multicast and GDR • High-Performance Deep Learning with MVAPICH2-GDR• Conclusions

Outline

NRL (SC ‘17) 34Network Based Computing Laboratory

• NCCL 1.x had some limitations– Only worked for a single node; no scale-out on multiple nodes

– Degradation across IOH (socket) for scale-up (within a node)

• We propose optimized MPI_Bcast design that exploits NCCL [1]

– Communication of very large GPU buffers

– Scale-out on large number of dense multi-GPU nodes

Efficient Broadcast: MVAPICH2-GDR and NCCL

1. A. A. Awan, K. Hamidouche, A. Venkatesh, and D. K. Panda, Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning. In Proceedings of the 23rd European MPI Users' Group Meeting (EuroMPI 2016). [Best Paper Nominee]

2. A. A. Awan, C-H. Chu, H. Subramoni, and D. K. Panda. Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?, arXiv ’17 (https://arxiv.org/abs/1707.09414)

• Hierarchical Communication that efficiently exploits:– CUDA-Aware MPI_Bcast in MV2-GDR

– NCCL Broadcast for intra-node transfers

• Can pure MPI-level designs be done that achieve similar or better performance than NCCL-based approach? [2]

VGG Training with CNTK

NRL (SC ‘17) 35Network Based Computing Laboratory

0100200300

2 4 8 16 32 64 128

Late

ncy

(ms)

Message Size (MB)

Reduce – 192 GPUsLarge Message Optimized Collectives for Deep Learning

0

100

200

128 160 192

Late

ncy

(ms)

No. of GPUs

Reduce – 64 MB

0100200300

16 32 64

Late

ncy

(ms)

No. of GPUs

Allreduce - 128 MB

0

50

100

2 4 8 16 32 64 128

Late

ncy

(ms)

Message Size (MB)

Bcast – 64 GPUs

0

50

100

16 32 64

Late

ncy

(ms)

No. of GPUs

Bcast 128 MB

• MV2-GDR provides optimized collectives for large message sizes

• Optimized Reduce, Allreduce, and Bcast

• Good scaling with large number of GPUs

• Available since MVAPICH2-GDR 2.2GA

0100200300

2 4 8 16 32 64 128La

tenc

y (m

s)Message Size (MB)

Allreduce – 64 GPUs

NRL (SC ‘17) 36Network Based Computing Laboratory

• Initial Evaluation shows promising performance gains for MVAPICH2-GDR 2.3a compared to Baidu-allreduce

MVAPICH2-GDR vs. Baidu-allreduce

Available in MVAPICH2-GDR 2.3a!

110

1001000

10000100000

1000000

Late

ncy

(us)

-lo

g sc

ale

Message Size (bytes)

8 GPUs (4 nodes log scale-allreduce vs MVAPICH2-GDR (MPI_Allreduce)

Baidu-allreduce MVAPICH2-GDR

~30X better ~11% improvement

NRL (SC ‘17) 37Network Based Computing Laboratory

• Optimizing MCAST+GDR Broadcast for deep learning:

– Source and destination buffers are on GPU Device • Typically very large messages (>1MB)

– Pipelining data from Device to Host• Avoid GDR read limit

• Leverage high-performance SL design

– Combines IB MCAST and GDR features

– Minimize use of PCIe resources on the receiver side• Maximizing availability of PCIe Host-Device Resources

– Available MVAPICH2-GDR 2.3a!

Exploiting GDR+IB-Mcast Design for Deep Learning Applications

IB HCA

CPU

GPU

Source

IB Switch

Header

Data

IB HCA

CPU

GPU

Node 1

Header

Data

IB HCA

CPU

GPU

Node N

Header

Data

1. Pipeline Data movement2. IB Gather 3. IB Hardware Multicast4. IB Scatter + GDR Write

Data

Ching-Hsiang Chu, Xiaoyi Lu, Ammar A. Awan, Hari Subramoni, Jahanzeb Hashmi, Bracy Elton, andDhabaleswar K. Panda, "Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for DeepLearning , ” ICPP’17.

NRL (SC ‘17) 38Network Based Computing Laboratory

• @ RI2 cluster, 16 GPUs, 1 GPU/node– Microsoft Cognitive Toolkit (CNTK) [https://github.com/Microsoft/CNTK]

Application Evaluation: Deep Learning Frameworks

0

50

100

150

200

250

300

8 16

Trai

ning

Tim

e (s

)

Number of GPU nodes

AlexNet modelMV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt

0

500

1000

1500

2000

2500

8 16

Trai

ning

Tim

e (s

)

Number of GPU nodes

VGG modelMV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt

Lower is better

15% 24%6%

15%

C.-H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda, Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning, ICPP’17.

• Reduces up to 24% and 15% of latency for AlexNet and VGG models• Higher improvement can be observed for larger system sizes

NRL (SC ‘17) 39Network Based Computing Laboratory

High-Performance Deep Learning (HiDL) with MVAPICH2-GDR

• Caffe : A flexible and layered Deep Learning framework.

• Benefits and Weaknesses– Multi-GPU Training within a single node

– Performance degradation for GPUs across different sockets

– No Scale-out available

• OSU-Caffe: MPI-based Parallel Training – Enable Scale-up (within a node) and Scale-out

(across multi-GPU nodes)

– Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset

– Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset

0

50

100

150

200

250

8 16 32 64 128

Trai

ning

Tim

e (s

econ

ds)

No. of GPUs

GoogLeNet (ImageNet) on 128 GPUs

Caffe OSU-Caffe (1024) OSU-Caffe (2048)

OSU-Caffe publicly available from

http://hidl.cse.ohio-state.edu/

Invalid use case

NRL (SC ‘17) 40Network Based Computing Laboratory

• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • What’s new with MVAPICH2-GDR

• Maximal overlap in MPI Datatype Processing• Efficient Support for Managed Memory• Support for OpenPower and NVLink • Initial support for GPUDirect Async feature

• Streaming Support with IB Multicast and GDR • High-Performance Deep Learning with MVAPICH2-GDR• Conclusions

Outline

NRL (SC ‘17) 41Network Based Computing Laboratory

• MVAPICH2 optimizes MPI communication on InfiniBand clusters with GPUs

• Provides optimized designs for point-to-point two-sided and one-sided communication, datatype processing and collective operations

• Takes advantage of CUDA features like IPC and GPUDirect RDMA families

• New designs help to get good performance for streaming and deep learning applications

Conclusions

NRL (SC ‘17) 42Network Based Computing Laboratory

[email protected]

Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

The MVAPICH2 Projecthttp://mvapich.cse.ohio-state.edu/