high-performance training for deep learning and...

High-Performance Training for Deep Learning and Computer Vision HPC

Dhabaleswar K. (DK) PandaThe Ohio State University

E-mail: [email protected]://www.cse.ohio-state.edu/~panda

Panel at CVPR-ECV ’18

by

http://www.cse.ohio-state.edu/%7Epanda

PETTT TE901 (May ’18) 3Network Based Computing Laboratory

• Scale-up: Intra-node Communication

– Many improvements like:• NVIDIA cuDNN, cuBLAS, NCCL, etc.

• CUDA 9 Co-operative Groups

• Scale-out: Inter-node Communication

– DL Frameworks – most are optimized for single-node only

– Distributed (Parallel) Training is an emerging trend• OSU-Caffe – MPI-based

• Microsoft CNTK – MPI/NCCL2

• Google TensorFlow – gRPC-based/MPI/NCCL2

• Facebook Caffe2 – Hybrid (NCCL2/Gloo/MPI)

Scale-up and Scale-out

Scal

e-up

Per

form

ance

Scale-out Performance

cuDNN

gRPC

Hadoop

MPIMKL-DNN

DesiredNCCL1NCCL2

CVPR-ECV ’18 5Network Based Computing Laboratory

Holistic Evaluation is Important!!• My framework is faster than

your framework!

• This needs to be understood in a holistic way.

• Performance depends on the entire execution environment (the full stack)

• Isolated view of performance is not helpful

A. A. Awan, H. Subramoni, and Dhabaleswar K. Panda. “An In-depth Performance Characterization of CPU- and GPU-based DNN Trainingon Modern Architectures”, In Proceedings of the Machine Learning on HPC Environments (MLHPC'17). ACM, New York, NY, USA, Article8.


Big Data (Hadoop, Spark,

HBase, Memcached,

etc.)

Deep Learning(Caffe, TensorFlow, BigDL,

etc.)

HPC (MPI, RDMA, Lustre, etc.)

Interdependency of HPC, Big Data and Deep Learning

Convergence of HPC, Big Data, and Deep Learning!!!


Drivers of Modern HPC Cluster and Data Center Architecture

• Multi-core/many-core technologies

• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)– Single Root I/O Virtualization (SR-IOV)

• Solid State Drives (SSDs), NVM, Parallel Filesystems, Object Storage Clusters

• Accelerators (NVIDIA GPGPUs and FPGAs)

High Performance Interconnects –InfiniBand (with SR-IOV)

<1usec latency, 200Gbps Bandwidth>Multi-/Many-core

Processors

Cloud CloudSDSC Comet TACC Stampede2

Accelerators high compute density, high

performance/watt>1 TFlop DP on a chip

SSD, NVMe-SSD, NVRAM


1. What are the fundamental issues in designing DL frameworks?

– Memory Requirements

– ComputationRequirements

– Communication Overhead

2. Why do we need to support distributed training?

– To overcome the limits of single-node training

– To better utilize hundreds of existing HPC Clusters

Research Challenges to Exploit HPC Technologies

InfiniBand GPUCPU

CNTK

Gradient AggregationModel Propagation Forward

Backward

Deep Learning and Machine Learning Frameworks

Caffe/OSU-Caffe Caffe2 TensorFlow MXNet

Communication Runtimes to support Distributed Training

HPC Platforms

Major Computation and Communication Phases in DL Frameworks

1

2


3. What are the new design challenges brought forward by DL frameworks for Communication runtimes?

– Large Message CollectiveCommunication and Reductions

– GPU Buffers (CUDA-Awareness)

4. Can a Co-design approach help in achieving Scale-up and Scale-out efficiently?

– Co-Design the support at Runtime level and Exploit it at the DL Framework level

– What performance benefits can be observed?

– What needs to be fixed at the communication runtime layer?

5.

Research Challenges to Exploit HPC Technologies (Cont’d)

CUDA-Awareness

InfiniBand GPUCPU

Large-message Collectives

CNTK

Point-to-Point

Operations

Gradient AggregationModel Propagation Forward

Backward

Deep Learning and Machine Learning Frameworks

Caffe/OSU-Caffe Caffe2 TensorFlow MXNet

Communication Runtimes (MPI/NCCL/Gloo/MLSL)

HPC Platforms

Major Computation and Communication Phases in DL Frameworks

3

4 Co-Design Opportunities


• MPI-driven Deep Learning

• Co-designing Deep Learning Stacks with High-Performance MPI

• Accelerating TensorFlow on HPC Systems

• Accelerating Big Data Stacks

• Efficient Deep Learning over Big Data

Multiple Approaches taken up by OSU


Data Parallel Deep Learning and MPI Collectives

MPI_Bcast (GPU 0)

packed_comm_buff

L1

L2

..Ln

F

L1

L2

..Ln

L1

L2

..Ln

L1

L2

..Ln

Params

GPU

0 Params

GPU

1 Params

GPU

2 Params

GPU

3

Gradients

1. Data Propagation

2. Forward Backward

Pass

3. Gradient Aggregatio

n

B F B F B F B

packed_reduce_buff

packed_reduce_buff

packed_reduce_buff

packed_reduce_buff

ApplyUpdates

MPI_Reduce (GPU 0)

Loop {}• Major MPI Collectives

involved in Designing distributed frameworks

• MPI_Bcast – required for DNN parameter exchange

• MPI_Reduce – needed for gradient accumulation from multiple solvers

• MPI_Allreduce – use just one Allreduce instead of Reduce and Broadcast

A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)


Overview of the MVAPICH2 MPI Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,900 organizations in 86 countries

– More than 474,000 (> 0.47 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘17 ranking)• 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China

• 9th, 556,104 cores (Oakforest-PACS) in Japan

• 12th, 368,928-core (Stampede2) at TACC

• 17th, 241,108-core (Pleiades) at NASA

• 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade

http://mvapich.cse.ohio-state.edu/


0

2000

4000

6000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)

Message Size (Bytes)

GPU-GPU Inter-node Bi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

01000200030004000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)


GPU-GPU Inter-node Bandwidth


0

10

20

300 1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)


GPU-GPU Inter-node Latency


MVAPICH2-GDR-2.3aIntel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores

NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA

CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA

10x

9x

Optimized MVAPICH2-GDR Design

1.88us11X


• MVAPICH2-GDR offers excellent performance via advanced designs for MPI_Allreduce.

• Up to 22% better performance on Wilkes2 cluster (16 GPUs)

Exploiting CUDA-Aware MPI for TensorFlow (Horovod)

0

500

1000

1500

2000

2500

3000

3500

1 2 4 8 16

Imag

es/s

ec (h

ighe

r is b

ette

r)

No. of GPUs (4GPUs/node)

MVAPICH2 MVAPICH2-GDR

MVAPICH2-GDR is up to 22% faster than MVAPICH2


0

10000

20000

30000

40000

50000

512K 1M 2M 4M

Late

ncy

(us)


MVAPICH2 BAIDU OPENMPI

0100000020000003000000400000050000006000000

8388

608

1677

7216

3355

4432

6710

8864

1342

1772

8

2684

3545

6

5368

7091

2

Late

ncy

(us)



1

10

100

1000

10000

100000

4 16 64 256

1024

4096

1638

4

6553

6

2621

44

Late

ncy

(us)



• 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0

MVAPICH2-GDR: Allreduce Comparison with Baidu and OpenMPI

*Available with MVAPICH2-GDR 2.3a

~30X betterMV2 is ~2X better

than Baidu

~10X better OpenMPI is ~5X slower than Baidu

~4X better


MVAPICH2-GDR vs. NCCL2 – Reduce Operation• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases

• MPI_Reduce (MVAPICH2-GDR) vs. ncclReduce (NCCL2) on 16 GPUs

*Will be available with upcoming MVAPICH2-GDR 2.3b

1

10

100

1000

10000

100000

Late

ncy

(us)


MVAPICH2-GDR NCCL2

~5X better

Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect

1

10

100

1000

4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K

Late

ncy

(us)


MVAPICH2-GDR NCCL2

~2.5X better


MVAPICH2-GDR vs. NCCL2 – Allreduce Operation• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases

• MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs

*Will be available with upcoming MVAPICH2-GDR 2.3b

1

10

100

1000

10000

100000

Late

ncy

(us)


MVAPICH2-GDR NCCL2

~1.2X better

Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect

1

10

100

1000

4 8 16 32 64 128

256

512 1K 2K 4K 8K 16K

32K

64K

Late

ncy

(us)


MVAPICH2-GDR NCCL2

~3X better


• To address the limitations of Caffe and existing MPI runtimes, we propose the OSU-Caffe (S-Caffe) framework

• At the application (DL framework) level

– Develop a fine-grain workflow – i.e. layer-wise communication instead of communicating the entire model

• At the runtime (MPI) level

– Develop support to perform reduction of very-large GPU buffers

– Perform reduction using GPU kernels

OSU-Caffe: Proposed Co-Design Overview

OSU-Caffe is available from the HiDL project pagehttp://hidl.cse.ohio-state.edu

http://hidl.cse.ohio-state.edu/


S-Caffe vs. Inspur-Caffe and Microsoft CNTK• AlexNet: Notoriously hard to scale-out on

multiple nodes due to comm. overhead!• Large number of parameters ~ 64 Million

(comm. buffer size = 256 MB)

S-Caffe delivers better or comparable performance withother multi-node capable DL frameworks

Up to 14% improvement (Scale-up)

Impact of HR

• GoogLeNet is a popular DNN• 13 million parameters (comm. buffer

size = ~50 MB)


Performance Benefits for AR-gRPC with Micro-Benchmark

Point-to-Point Latency

• AR-gRPC (OSU design) Latency on SDSC-Comet-FDR– Up to 2.7x performance speedup over Default gRPC (IPoIB) for Latency for small messages.– Up to 2.8x performance speedup over Default gRPC (IPoIB) for Latency for medium messages.– Up to 2.5x performance speedup over Default gRPC (IPoIB) for Latency for large messages.

0

15

30

45

60

75

902 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)

payload (Bytes)

Default gRPC (IPoIB-56Gbps)

AR-gRPC (RDMA-56Gbps)

0

100

200

300

400

500

600

700

800

900

1000

16K 32K 64K 128K 256K 512KLa

tenc

y (u

s)Payload (Bytes)


AR-gRPC (RDMA-56 Gbps)

100

3800

7500

11200

14900

18600

1M 2M 4M 8M

Late

ncy

(us)

Payload (Bytes)



R. Biswas, X. Lu, and D. K. Panda, Accelerating TensorFlow with Adaptive RDMA-based gRPC. (Under Review)


0

60

120

180

240

300

8 16 32 64

Imag

es /

Sec

ond

Batch Size / GPU

gRPPC (IPoIB-100Gbps)

Verbs (RDMA-100Gbps)

MPI (RDMA-100Gbps)


0

80

160

240

320

400

8 16 32 64

Imag

es /

Sec

ond

Batch Size / GPU



MPI (RDMA-100Gbps)


133%

16%

0

30

60

90

120

150

8 16 32 64

Imag

es /

Sec

ond

Batch Size/ GPU

gRPPC (IPoIB-100Gbps)Verbs (RDMA-100Gbps)MPI (RDMA-100Gbps)AR-gRPC (RDMA-100Gbps)

26%

10%

Performance Benefit for TensorFlow (Resnet50)

• TensorFlow Resnet50 performance evaluation on an IB EDR cluster– Up to 26% performance speedup over Default gRPC (IPoIB) for 4 nodes– Up to 127% performance speedup over Default gRPC (IPoIB) for 8 nodes– Up to 133% performance speedup over Default gRPC (IPoIB) for 12 nodes

4 Nodes 8 Nodes 12 Nodes

127%

15%


0

10

20

30

40

50

60

70

80

90

8 16 32 64

Imag

es /

Sec

ond

Batch Size / GPU



MPI (RDMA-100Gbps)


47%

7%

Performance Benefit for TensorFlow (Inception3)

• TensorFlow Inception3 performance evaluation on an IB EDR cluster– Up to 47% performance speedup over Default gRPC (IPoIB) for 4 nodes – Up to 116% performance speedup over Default gRPC (IPoIB) for 8 nodes– Up to 153% performance speedup over Default gRPC (IPoIB) for 12 nodes

4 Nodes 8 Nodes 12 Nodes

0

60

120

180

240

300

8 16 32 64

Imag

es /

Sec

ond

Batch Size / GPU



MPI (RDMA-100Gbps)


153%

12%

0

50

100

150

200

8 16 32 64Im

ages

/ S

econ

d

Batch Size / GPU



MPI (RDMA-100Gbps)


116%

10%


• RDMA for Apache Spark

• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions

• RDMA for Apache HBase

• RDMA for Memcached (RDMA-Memcached)

• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

• OSU HiBD-Benchmarks (OHB)

– HDFS, Memcached, HBase, and Spark Micro-benchmarks

• http://hibd.cse.ohio-state.edu

• Users Base: 285 organizations from 34 countries

• More than 26,700 downloads from the project site

The High-Performance Big Data (HiBD) Project

Available for InfiniBand and RoCEAlso run on Ethernet

Available for x86 and OpenPOWER

Support for Singularity and Docker

http://hibd.cse.ohio-state.edu/


X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks, HotI 2017.

High-Performance Deep Learning over Big Data (DLoBD) Stacks• Challenges of Deep Learning over Big Data

(DLoBD) Can RDMA-based designs in DLoBD stacks improve

performance, scalability, and resource utilization on high-performance interconnects, GPUs, and multi-core CPUs?

What are the performance characteristics of representative DLoBD stacks on RDMA networks?

• Characterization on DLoBD Stacks CaffeOnSpark, TensorFlowOnSpark, and BigDL IPoIB vs. RDMA; In-band communication vs. Out-

of-band communication; CPU vs. GPU; etc. Performance, accuracy, scalability, and resource

utilization RDMA-based DLoBD stacks (e.g., BigDL over

RDMA-Spark) can achieve 2.6x speedup compared to the IPoIB based scheme, while maintain similar accuracy

0

20

40

60

10

1010

2010

3010

4010

5010

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Acc

urac

y (%

)

Epoc

hs T

ime

(sec

s)

Epoch Number

IPoIB-TimeRDMA-TimeIPoIB-AccuracyRDMA-Accuracy

2.6X


Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected]

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

http://nowlab.cse.ohio-state.edu/

mailto:[email protected]

high-performance training for deep learning and...

Documents