Transcript
Page 1: MVAPICH Co-designing Communication Middleware and Deep … · 2019-12-11 · CUDA-Aware MPI CUDA-Aware Allreduce Pointer Cache GDR 1 Program Launch Optimizations 3 Characterization

Computation

Ln

Ln-1

Ln-1

Ln

L1L1

L2 L2

Overlap

Backward Pass

Helper-thread

Communication

Main-thread

Reduce (Ln)

Reduce (Ln-1)

Reduce (L2)

Reduce (L1)

Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems

Ammar A. Awan and DK Panda (Advisor)[email protected],

[email protected]

MOTIVATION• Resurgence of Deep Learning (DL)

• Availability of Large Datasets like ImageNet and massively-parallel modern hardware like NVIDIA GPUs

• Emergence of DL frameworks (Caffe, TensorFlow, CNTK, etc.)

• Computability of Deep Neural Networks (DNNs)• Single GPU/node is not enough!• Scale-up and Scale-out training: an emerging research area

• Various Strategies to deal with large DNNs • Data Parallelism, Out-of-Core Training, and Model Parallelism

• Parameter-Server approach / Reduction-Tree approach• Distributed Address-Space Design Constraints

• Challenges for Communication Middleware• Very Large GPU-based Buffers• Reduction Collectives

• Overlap of Computation and Communication

• Co-design MPI middleware and communication of tensors for efficient DNN training• Large-message, CUDA-Aware, and Non-Blocking communication

• Design novel techniques to deal with Out-of-Core Workloads on GPUs

• Exploiting CUDA Unified Memory, efficient prefetching, and page-migration

• In-depth characterization and analysis of DL workloads and frameworks

• Profiling MPI and NCCL Communication

• Holistic understanding of single/multiple compute elements (CPU/GPU performance)

• Significant broader impact – Intersection of ML & HPC – a new research area

• Tutorials and Course (OSU) on High Performance Deep Learning

• Outreach through MVAPICH2-GDR and HiDL public releases

RESEARCH CHALLENGES

PROPOSED DESIGNS AND PERFORMANCE CHARACTERIZATION

PROPOSED FRAMEWORK

SUMMARY OF CONTRIBUTIONS

Accelerating Data-Parallel Training -- EuroMPI ‘16, EuroMPI ’18 and J. Parallel Computing ’19

Layer-wise Overlapped Gradient Aggregation

http://hidl.cse.ohio-state.edu

MPI Reduce Benchmark: 160 GPUs (10 nodes)

GoogLeNet Training: Strong Scaling AlexNet Training: Weak Scaling

051015202530

2 4 8 32 64 128

TrainingTime(secon

ds)

No.ofGPUs

MV2-GDR-NCCL MV2-GDR-Opt

VGG Training with CNTK

0.0010.010.11101001000

1 8 64 512 4K 32K

256K 2M 16M

128M

Latency(m

s)-logscale

MessageSize(bytes)

MV2-GDR-NCCL MV2-GDR-Opt

MVAPICH

Distributed Training Middleware

Communication Middleware (DL-Aware MPI)

Application Layer (DNN Training)

Co-Designs

Out-of-Core

HPC Platforms Multi-/Many-core CPUs (Intel Xeon, AMD EPYC,

and IBM POWER9)NVIDIA GPUs

High-Performance Interconnects

(InfiniBand, Omni-Path)

OSU-Caffe

Large Message

ReductionCUDA-Aware Reductions

Data-Parallel

MPI-NetsHorovod

Model-Parallel

Caffe

TensorFlow

CNTK

PyTorch

CUDA-Aware Broadcast

Performanceand

Design Analysis

Distributed TensorFlow Interfaces

Programming Models for Distributed TF

HPC Platforms

CPUs GPUs

HorovodDistributed Optimizer

Baidu-Allreduce

gRPC+XgRPC

Communication Runtimes/Libraries

Parameter Server

Deep Learning ApplicationDistributed TF Program(tf_cnn_benchmarks)TensorFlow (TF) Programs

No-gRPC

gRPC Verbs

NCCL2

CUDA-Aware MPI

CUDA-Aware Allreduce

Pointer CacheGDR

1Program Launch

Optimizations

3

Characterizationand

Performance Analysis

2

Proposed Allreduce

Designs and Optimizations

InfiniBand Cray Aries

0100200300400500600700800900

1000

1 2 4 8 16Imag

es/s

econ

d (H

ighe

r is b

ette

r)

No. of Nodes (GPUs)

Baidu-MPI Horovod-MPI Horovod-NCCL2 gRPC+verbs gRPC+MPI gRPC (IPoIB)

File System

Data Layer (L1) Layer 2

<< Kernel >>

………. Layer N

<< Kernel >> …

D2D H2D D2HF2H/H2F I/O

File System

<< Kernel >>

<< Kernel >> …

D2D

H2D

D2H

F2H/H2F

I/O

M2M

F2M/M2F

Prefetch () Advise (Evict())

Prefetch ()

Advise (Evict()) ………. ……….

Prefetch ()

Advise (Evict())……….……….

Forward Propagation (Existing)

Backward Propagation (Existing)

Forward Propagation (Proposed)

Backward Propagation (Proposed)

Unified Data Layer

1 32

4

1 Performance Characterization and Design Analysis: Caffe, CNTK, TensorFlow -- MLHPC ‘17, CCGrid ’19, and HotI ‘19

Communicator Selection

Pure MPI Design for DL-Aware Broadcast

Algorithm Selection

Flexible

Communicator

Intra-node

Communicator

Shared

Memory

Loopback

GPUDirect

RDMA

GDRCopy

CUDA IPC

Pipelined

IPC

Inter-node

Communicator

GPUDirect

RDMA

Host-staging

GDR Write

Pipelined

K-nomial Tree

Scatter-

Allgather

Several others..

MPI_Bcast()

Collectives Design Selection

Staged

Designs

Direct

Designs

Point-to-point (P2P) Design

Selection

Chain (Ring)

Point-to-Point (P2P) Primitives

MPI_Isend

MPI_Irecv

Intra-node P2P

Inter-node P2P

2

0

200400

600800

10001200

1400

1600

Img/

sec (

High

er is

bet

ter)

caffe-gpu oc-caffe-naïve oc-caffe-optcaffe-cpu intel-caffe intel-caffe-opt

oc-caffe-opt is 5Xbetter than

intel-caffe-opt

caffe-gpucannot

run

X 0

50

100

150

200

250

300

Img/

sec (

High

er is

bet

ter)

caffe-gpu oc-caffe-naïve oc-caffe-optcaffe-cpu intel-caffe intel-caffe-opt

caffe-gpucannot

run

X

oc-caffe-opt is2.7X better than intel-caffe-opt

0

5

10

15

20

Img/

sec (

High

er is

bet

ter)

caffe-gpu oc-caffe-naïve oc-caffe-optcaffe-cpu intel-caffe intel-caffe-opt

oc-caffe-optis 80%

better than intel-caffecaffe-gpu

cannot run

X

intel-caffe-opt

(N/A)X

Out-of-Core AlexNet Out-of-Core GoogLeNet Out-of-Core ResNet-50

3 Out-of-Core DNN Training -- HiPC ‘18

4 Co-designing MPI and Caffe for Data Parallelism – PPoPP ‘17

Layer-wise Overlapped Model Propagation

• Faster Convolutions à Faster Training

• Performance of Intel KNL == NVIDIA P100 for AlexNet Training – Volta; different league!

Communication Middleware

Deep Learning Frameworks

Distributed Training Middleware (Horovod)

HPC Platforms

PyTorch

CPUs

GPUs InfiniBand

NCCL MPI

Proposed Profiling Infrastructure (hvprof)

MXNet TensorFlow

Omni-Path

PCIe

NVLink

High-Performance Interconnects

050

100150200250300350400

1 2 4 6 12 24 48 96 192 384 768 1536

Imag

e pe

r sec

ond

Thou

sand

s

Number of GPUs

NCCL-2.4 MVAPICH2-GDR-Next

MVAPICH2-GDR reaching ~0.35 million images per second for ImageNet-1k!

ImageNet-1k has 1.2 million images

Details of all publications are available from: http://go.osu.edu/ammar

Top Related