aluminum: an asynchronous, gpu-aware communication library...

12
Aluminum: An Asynchronous, GPU-aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC systems Nikoli Dryden *† , Naoya Maruyama * Tim Moon * , Tom Benson * , Andy Yoo * , Marc Snir , Brian Van Essen * * Lawrence Livermore National Laboratory {maruyama3,moon13,benson31,yoo2,vanessen1}@llnl.gov Department of Computer Science University of Illinois at Urbana-Champaign {dryden2,snir}@illinois.edu Abstract—We identify communication as a major bottleneck for training deep neural networks on large-scale GPU clusters, taking up to 10x as long as computation. To reduce this overhead, we discuss techniques to overlap communication and computation as much as possible. This leads to much of the communication being latency-bound instead of bandwidth-bound, and we find that using a combination of latency- and bandwidth-optimized allreduce algorithms significantly reduces communication costs. We also discuss a semantic mismatch between MPI and CUDA that increases overheads and limits asynchrony, and propose a solution that enables communication to be aware of CUDA streams. We implement these optimizations in the open-source Aluminum communication library, enabling optimized, asyn- chronous, GPU-aware communication. Aluminum demonstrates improved performance in benchmarks and end-to-end training of deep networks, for both strong and weak scaling. Index Terms—Deep learning, machine learning, communica- tion optimization, collective algorithms, HPC I. I NTRODUCTION With the success of deep learning, accelerating the training process has become increasingly important, particularly as model complexity and dataset sizes grow. Many training toolkits have emerged that leverage multiple GPUs, either on a single node or across multiple nodes [1]–[7]. Simultane- ously, large clusters of GPUs have begun to be deployed and leveraged for training deep networks. Efficiently utilizing such systems requires careful optimization of many aspects of the training process. In particular, reducing communication over- heads stands out as one of the most significant requirements, especially in order to scale to large node counts. Many of the current approaches to distributed training can be broadly divided into model- and data-parallel techniques. In model-parallel techniques, a neural network layer is partitioned across multiple processors. This is typically applied to fully- connected layers [1], where it is essentially a distributed matrix product, but it has also been demonstrated for locally- connected layers [8]. Scaling matrix products is a well-studied problem in numerical linear algebra [9], [10]. In data-parallel techniques, layers are replicated and a mini-batch’s data is partitioned across multiple processors, which perform forward and backward propagation independently before synchroniz- ing their parameter updates. This is the typical approach to distributed training for convolutional layers, and is also often applied to entire networks. Scaling data-parallelism typically relies on increasing the training algorithm’s mini-batch size, as scalability is ultimately limited by the number of samples in each mini-batch. Addi- tionally, larger mini-batches help ensure that each processor is efficiently utilized. Increasing mini-batch size is non-trivial, as it can impact the quality of the learned model and has a complex interplay with the model’s learning rate [11]–[14]. Several techniques have demonstrated successful large mini- batch training, including linear warmups [15] and layer-wise adaptive learning rates [16], typically for image classification problems. It remains to be seen how general these approaches are, especially when applied to non-image data. The communication requirements for data-parallel training are particularly large due to the need to synchronize parameter updates. This operation is a global allreduce operation on each layer’s parameters. As modern networks often have large num- bers of parameters and many layers, this allreduce is a signifi- cant cost, and communication has been consistently identified as a major bottleneck in scaling [17], [18]. The allreduce is typically implemented either via centralized parameter servers, which accumulate and distribute updates from workers; or in a distributed manner via an operation like MPI_Allreduce. Some systems make use of sparse, quantized, or compressed communication to reduce communication overhead [19]–[21]; we view these approaches as complementary to our work. In this work, we study the communication requirements for training modern deep networks and identify implementation techniques to reduce them. Our focus is on distributed GPU systems using CUDA and MPI, where each node is intercon- nected with a high-speed network. This is typical of modern GPU supercomputers. We begin by examining communication overheads for both strong and weak scaling of training. Using ResNet-50 [22] as our example, we find that even at small scales, communication accounts for a significant portion of total runtime and the

Upload: others

Post on 20-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Aluminum: An Asynchronous, GPU-aware Communication Library …snir.cs.illinois.edu/listed/C108.pdf · 2019-10-03 · We also discuss a semantic mismatch between MPI and CUDA that

Aluminum: An Asynchronous, GPU-awareCommunication Library Optimized for Large-ScaleTraining of Deep Neural Networks on HPC systems

Nikoli Dryden∗†, Naoya Maruyama∗ Tim Moon∗, Tom Benson∗, Andy Yoo∗, Marc Snir†, Brian Van Essen∗∗Lawrence Livermore National Laboratory

{maruyama3,moon13,benson31,yoo2,vanessen1}@llnl.gov†Department of Computer Science

University of Illinois at Urbana-Champaign{dryden2,snir}@illinois.edu

Abstract—We identify communication as a major bottleneckfor training deep neural networks on large-scale GPU clusters,taking up to 10x as long as computation. To reduce this overhead,we discuss techniques to overlap communication and computationas much as possible. This leads to much of the communicationbeing latency-bound instead of bandwidth-bound, and we findthat using a combination of latency- and bandwidth-optimizedallreduce algorithms significantly reduces communication costs.We also discuss a semantic mismatch between MPI and CUDAthat increases overheads and limits asynchrony, and proposea solution that enables communication to be aware of CUDAstreams. We implement these optimizations in the open-sourceAluminum communication library, enabling optimized, asyn-chronous, GPU-aware communication. Aluminum demonstratesimproved performance in benchmarks and end-to-end trainingof deep networks, for both strong and weak scaling.

Index Terms—Deep learning, machine learning, communica-tion optimization, collective algorithms, HPC

I. INTRODUCTION

With the success of deep learning, accelerating the trainingprocess has become increasingly important, particularly asmodel complexity and dataset sizes grow. Many trainingtoolkits have emerged that leverage multiple GPUs, either ona single node or across multiple nodes [1]–[7]. Simultane-ously, large clusters of GPUs have begun to be deployed andleveraged for training deep networks. Efficiently utilizing suchsystems requires careful optimization of many aspects of thetraining process. In particular, reducing communication over-heads stands out as one of the most significant requirements,especially in order to scale to large node counts.

Many of the current approaches to distributed training canbe broadly divided into model- and data-parallel techniques. Inmodel-parallel techniques, a neural network layer is partitionedacross multiple processors. This is typically applied to fully-connected layers [1], where it is essentially a distributedmatrix product, but it has also been demonstrated for locally-connected layers [8]. Scaling matrix products is a well-studiedproblem in numerical linear algebra [9], [10]. In data-paralleltechniques, layers are replicated and a mini-batch’s data ispartitioned across multiple processors, which perform forward

and backward propagation independently before synchroniz-ing their parameter updates. This is the typical approach todistributed training for convolutional layers, and is also oftenapplied to entire networks.

Scaling data-parallelism typically relies on increasing thetraining algorithm’s mini-batch size, as scalability is ultimatelylimited by the number of samples in each mini-batch. Addi-tionally, larger mini-batches help ensure that each processoris efficiently utilized. Increasing mini-batch size is non-trivial,as it can impact the quality of the learned model and has acomplex interplay with the model’s learning rate [11]–[14].Several techniques have demonstrated successful large mini-batch training, including linear warmups [15] and layer-wiseadaptive learning rates [16], typically for image classificationproblems. It remains to be seen how general these approachesare, especially when applied to non-image data.

The communication requirements for data-parallel trainingare particularly large due to the need to synchronize parameterupdates. This operation is a global allreduce operation on eachlayer’s parameters. As modern networks often have large num-bers of parameters and many layers, this allreduce is a signifi-cant cost, and communication has been consistently identifiedas a major bottleneck in scaling [17], [18]. The allreduce istypically implemented either via centralized parameter servers,which accumulate and distribute updates from workers; or ina distributed manner via an operation like MPI_Allreduce.Some systems make use of sparse, quantized, or compressedcommunication to reduce communication overhead [19]–[21];we view these approaches as complementary to our work.

In this work, we study the communication requirements fortraining modern deep networks and identify implementationtechniques to reduce them. Our focus is on distributed GPUsystems using CUDA and MPI, where each node is intercon-nected with a high-speed network. This is typical of modernGPU supercomputers.

We begin by examining communication overheads for bothstrong and weak scaling of training. Using ResNet-50 [22] asour example, we find that even at small scales, communicationaccounts for a significant portion of total runtime and the

Page 2: Aluminum: An Asynchronous, GPU-aware Communication Library …snir.cs.illinois.edu/listed/C108.pdf · 2019-10-03 · We also discuss a semantic mismatch between MPI and CUDA that

overhead worsens rapidly as training is scaled onto moreGPUs. This is exacerbated for strong scaling, where thevolume of work per processor decreases with scale while thecost of communication increases, making strong scaling tolarge numbers of GPUs unprofitable. Weak scaling fares better,but communication overheads still prevent optimal scaling, andit yields poor improvements on many GPUs. We then turn toalleviating the communication overheads.

Overlapping communication and computation is a standardapproach to help hide communication overheads. The standardformulation of backpropagation and gradient descent for train-ing deep networks enables communication to be overlappedwith no algorithmic changes, and we show that when donewell, this can significantly reduce communication overheads.To maximize overlap, we aim to begin communication assoon as possible: whenever a layer has finished computing itsupdates, an allreduce for it is started. This leads to relativelyfine-grained communication and requires quality implementa-tions of non-blocking communication. Since communicationis done for each individual layer, the volume of data beingcommunicated in each operation is quite small. This resultsin many of the allreduces being latency-bound rather thanbandwidth-bound, contrary to the typical case for training deepnetworks. Latency also becomes increasingly important at atlarge scales.

Once latency becomes a significant factor in communicationperformance, local synchronization overhead also becomes aconcern. The standard approach to interfacing CUDA-awareMPI for communication with data being computed on GPUsis to synchronize the stream computing the data prior tobeginning communication. This imposes overheads both due tosynchronization and because kernel launch latencies for GPUcomputations cannot be pipelined as effectively. We proposeinstead to make our communication operations aware of thestream a GPU datum is being computed on. This enables thecommunication operation to function similarly to a CUDAkernel, and minimizes the synchronization overheads withoutimpacting pipelining. Unfortunately, current MPI distributions,even those that are CUDA-aware, do not provide a means todo this.

These improvements enable the communication overheadfor training deep networks to be significantly reduced andtraining to be scaled to larger systems. We summarize ourcontributions as follows:

• We examine the communication overheads involved in train-ing deep networks and show that overlapping can signifi-cantly reduce them.

• We identify that getting good performance for fine-grained, often latency-dominated communication as criti-cal. We show how latency-optimized allreduce algorithmscan significantly outperform the more common bandwidth-optimized ring algorithms for relevant data sizes, especiallyat scale.

• We demonstrate techniques that can be used for performingcommunication on GPU data in a non-blocking manner for

both the host and GPU, while reducing synchronizationoverheads.

• We introduce the Aluminum library, an open-source librarythat implements our communication techniques1.

• We evaluate the impact of these methods in both mi-crobenchmarks and end-to-end training within the open-source LBANN toolkit [1]

II. COMMUNICATION REQUIREMENTS

We begin by discussing in more detail the communicationinvolved in training a deep network, including where thecommunication occurs and what volume of data is moved. Thisforms the basis of our subsequent discussion on optimizingcommunication.

A. Where and what is the communication?

Training a deep network can be thought of as involvingthree phases that are repeated iteratively: forward propagation,backpropagation, and optimization. Forward propagation in-volves computing the output of the network for the input data,and is essentially inference. Backprop computes gradients toupdate the network parameters based on its inference, and theoptimization phase applies the updates, typically using a vari-ant of stochastic gradient descent. When using a data-parallelapproach to parallelize training, communication is performedonly during backpropagation2 (see II-D for the model-parallelcase). This communication is an allreduce that synchronizesthe independent updates that each processor computes into aglobal update that can be applied independently. (See [23]–[25] for overviews of deep learning and its optimization andparallelization.)

Implementations can perform this allreduce either usingcentralized parameter servers (e.g. as in TensorFlow) orvia a decentralized allreduce implementation such as MPI’sMPI_Allreduce or equivalent. We focus on the latter caseexclusively in this work.

Backprop is performed sequentially for each layer in anetwork, beginning with the final layer and ending with theinput layer. Each layer receives as input an “error signal” fromthe subsequent layer, and computes a modified error signal asits output. If a layer additionally has parameters that need tobe learned, the layer will compute a gradient based on theinput error signal. It is important to note that within a layer,these two operations are independent and can be performedin any order. Once the gradient has been computed, it canbe combined with other processor’s gradients to compute theglobal gradient for that layer.

The granularity of communication can vary dependingon the implementation. At one extreme, all data could becombined into a single buffer and allreduced once backpropcompletes for every layer. Alternatively, allreduces can be doneas soon as the gradient computation for a layer completes, andwork on a per-buffer basis. Many implementations (including

1URL removed for anonymization.2One could equivalently think of communication as being performed during

the optimization phase; we choose backprop for convenience.

Page 3: Aluminum: An Asynchronous, GPU-aware Communication Library …snir.cs.illinois.edu/listed/C108.pdf · 2019-10-03 · We also discuss a semantic mismatch between MPI and CUDA that

96 256

384

4096

3484

8

6144

00

8847

36

1327

104

4096

000

1677

7216

2621

4400

Parameter count

0

1

2

Freq

uenc

y

1

2 2 2

1 1

2

1 1 1 1

(a) AlexNet

64 128

256

512

1024

2048

4096

9408

1638

432

768

3686

465

536

1310

7214

7456

2621

4452

4288

5898

2410

4857

620

4800

020

9715

223

5929

6

Parameter count

0

4

8

12

16

20

24

28

32

Freq

uenc

y

1416

32

22

14

8

1 1

6

1 3

7

2 4

11

2

6 5

1 1 3

(b) ResNet-50

Fig. 1. Histograms breaking down the number of parameters in each parameter buffer (essentially, a layer) for the AlexNet and ResNet-50 networks. In ourimplementation, each parameter is a 4-byte float.

8 16 32 64 128 256#GPUs

0.00

0.05

0.10

0.15

0.20

0.25

Min

i-bat

ch ti

me

(s)

NCCL (no overlap)

8 16 32 64 128 256#GPUs

0.00

0.05

0.10

0.15

0.20

0.25NCCL (with overlap)

0

2

4

6

8

10

12

14ComputationCommunicationRatio

0

2

4

6

8

10

12

14C

omm

unic

atio

n / c

ompu

tatio

n ra

tio

(a) NCCL

8 16 32 64 128 256#GPUs

0.00

0.05

0.10

0.15

0.20

0.25

Min

i-bat

ch ti

me

(s)

Minimal (no overlap)

8 16 32 64 128 256#GPUs

0.00

0.05

0.10

0.15

0.20

0.25Minimal (with overlap)

0

2

4

6

8

10

12

14

0

2

4

6

8

10

12

14

Com

mun

icat

ion

/ com

puta

tion

ratio

(b) Minimal

Fig. 2. Strong scaling results for ResNet-50 using our benchmark on Sierra, with either NCCL or a dynamic minimal-time algorithm, and with and withoutoverlap of communication/computation. The bars show the total time for a mini-batch, broken down by computation and unoverlapped communication.Superimposed is the ratio of communication to computation. Note that each Sierra node has four GPUs.

8 16 32 64 128 256 512 1024

#GPUs

0

100

200

300

400

500

600

Epo

ch ti

me

(s)

NCCL (no overlap)

8 16 32 64 128 256 512 1024

#GPUs

0

100

200

300

400

500

600NCCL (with overlap)

0

2

4

6

8

10

0

2

4

6

8

10

Com

mun

icat

ion

/ com

puta

tion

ratio

(a) NCCL

8 16 32 64 128 256 512 1024

#GPUs

0

100

200

300

400

500

600

Epo

ch ti

me

(s)

Minimal (no overlap)

8 16 32 64 128 256 512 1024

#GPUs

0

100

200

300

400

500

600Minimal (with overlap)

0

2

4

6

8

10

0

2

4

6

8

10C

omm

unic

atio

n / c

ompu

tatio

n ra

tio

(b) Minimal

Fig. 3. Weak scaling results for ResNet-50 using our benchmark on Sierra, with either NCCL or a dynamic minimal-time algorithm, and with and withoutoverlap of communication/computation. The mini-batch size per GPU is fixed to 32, so the global mini-batch size increases with scale. Since the total datasetsize is fixed, more GPUs results in fewer, larger iterations. The bars show the total time for an epoch, broken down by computation and unoverlappedcommunication. Superimposed is the ratio of communication to computation. Note that each Sierra node has four GPUs.

Page 4: Aluminum: An Asynchronous, GPU-aware Communication Library …snir.cs.illinois.edu/listed/C108.pdf · 2019-10-03 · We also discuss a semantic mismatch between MPI and CUDA that

ours) keep separate, non-contiguous buffers for the parametersfor each layer for simplicity, so operating on a per-buffer basisis typical.

In this work, our layers use 4 byte single-precision floatsto store parameters, and we communicate parameters in thisformat. Within the networks we consider, convolutional, fully-connected, and batch normalization [26] layers have parame-ters that must be learned. In our implementation, convolutionaland fully-connected layers have their parameters stored in asingle buffer per layer. Batch normalization, for convenience,has two buffers, one for its scale parameters and one for itsbias parameters.

B. Communication volume

We now look to understand the amount of data and numberof buffers that must be communicated in an iteration. Thisdepends on the architecture of the network being trained(e.g. number and size of filters in a convolutional layer).Figure 1 plots histograms of parameter buffer size for tworepresentative image classification networks, AlexNet [27] andResNet-50 [22].

AlexNet is a fairly shallow network that has several largefully-connected layers, and is a commonly used baseline orbuilding block where state-of-the-art accuracy is not necessary.It has relatively few buffers: five convolutional layers and threefully-connected layers, with all but the final layer having aseparate bias. The three largest buffers are the fully-connectedlayers, which contain a majority of the parameters.

ResNet-50 is more representative of modern CNNs, whichhave many more layers, batch normalization, and fewer fully-connected layers. ResNet architectures do not have biases,but many of the small buffers are due to the parameters forbatch normalization layers. Since many recent architecturesand benchmarks have focused on ResNet-like architecturesor ResNet-50 in particular (e.g. [28]), we will use it for theremainder of the paper.

A key observation to make from these plots is that bothnetworks require allreduces to be performed on many smallbuffers. For ResNet-50, a majority of the buffers are 8 KiBor less. However, there is also a very large range of buffersizes, spanning 256 bytes to megabytes. A single algorithmfor performing the allreduce is unlikely to perform optimallyacross this entire regime, as we discuss in Section III.

C. Communication overhead

We now empirically examine the communication overheadinvolved in training ResNet-50 on ImageNet [29] in variousconfigurations. Our goal in this section is to understand thebaseline performance, which can then be improved upon. Weutilize a simple benchmark that incorporates the computecost of convolutional layers (the primary computational costin ResNet-50) and the communication cost of synchronizinglayer gradients.

The compute time is determined by benchmarking theruntime of the relevant cuDNN [30] routines for convolution

on the local problem size of each convolutional layer. Com-munication time is determined by benchmarking allreduces ofthe relevant sizes, using the NCCL collective communicationlibrary [31]. We assume that a separate allreduce is performedon each buffer. We treat the fully-connected layer as beingmodel-parallel (see II-D) and neglect it for simplicity; as it isa small layer, this does not significantly affect our results. Notethat this benchmark is meant to illustrate the major sources ofcommunication and computation, and neglects many aspectsof a full training pipeline, such as I/O, optimization, activationlayer computation, and internal synchronization.

We run this benchmark on the Sierra supercomputer [32],which consists of 4,320 compute nodes with two IBMPOWER9 CPUs and four NVIDIA V100 (Volta) GPUs withNVLINK2 per node, interconnected via a dual-rail InfiniBandEDR network. We use CUDA 9.2.148, cuDNN 7.2.1, andNCCL 2.2.13.

1) Strong scaling: To strong scale ResNet-50 training, wekeep all parameters constant and increase the number of GPUsbeing trained on. The mini-batch size is 256, per the originalpaper. Due to memory constraints, we cannot train ResNet-50on fewer than 8 GPUs, and the mini-batch limits us to at most256 GPUs. We additionally neglect issues that may be causedby batch normalization having few samples per node [33],[34].

We plot the mini-batch iteration time, as well as a break-down of computation versus communication, in Figure 2a(left). As the number of GPUs increases, the computationtime decreases, but the scaling is unfortunately sublinear.Simultaneously, communication requirements increase as morenodes are involved while the number of iterations remainsconstant. Runtimes improve up to 32 GPUs, after whichcommunication overheads outweigh the benefits. The commu-nication/computation ratio rapidly increases, and even at only32 GPUs accounts for more than half the runtime.

2) Weak scaling: For weak scaling, we keep every param-eter but the mini-batch size fixed and train with 32 samplesper GPU. This is the same regime as [15] or [16], whichdemonstrate how to maintain model accuracy despite the largemini-batch, and offers a good compromise between GPUutilization and memory requirements. Note that as the mini-batch size increases, the number of iterations to complete anepoch decreases (it is 4955 iterations when the mini-batch sizeis 256).

We plot total epoch time, again with a communica-tion/computation breakdown, in Figure 3a (left). In this case,computation scales linearly. The total time for communica-tion decreases as the number of GPUs increases, becausefewer iterations are performed, resuling in fewer rounds ofcommunication, although this trend breaks down for largenumbers of GPUs. However, the ratio of communication tocomputation steadily worsens, resulting in a nearly 10x ratioof communication to computation on 1024 GPUs. Despite this,it remains profitable to weak scale ResNet-50 training to this

Page 5: Aluminum: An Asynchronous, GPU-aware Communication Library …snir.cs.illinois.edu/listed/C108.pdf · 2019-10-03 · We also discuss a semantic mismatch between MPI and CUDA that

Internal stream

Data stream Data

NCCL Allreduce

Other work Complete

Fig. 4. Performing a non-blocking allreduce on a GPU using NCCL. Data iscomputed on a stream by the application, and CUDA event synchronizationis used to cause a separate, internal stream to block until that computation iscomplete (solid red boxes are CUDA events). The internal stream performs theNCCL allreduce while the data stream can continue with other computation.When the result is needed, the internal stream can be synchronized back tothe data stream. (Boxes are not to scale.)

scale, though it suffers from significant diminishing returns.3

D. Model-parallel fully-connected layers

We briefly discuss the differences in communication whenusing model-parallel fully-connected layers. These essentiallyimplement a distributed matrix product, which can be thoughtof as a collective operation involving every processor. Com-munication is now required in both forward and backwardpropagation to compute the layer’s output, error signal, andgradients; however, no additional communication is needed tosynchronize the gradient update. Since matrix products typi-cally require their input data to have a particular distribution(e.g. blocked), data may need to be moved from a “data-parallel” distribution for this. The communication operationsperformed depend on the algorithm being used, but typicallyinvolve a variety of collectives beyond allreduce.

III. OVERLAP AND LATENCY

We now discuss two basic optimizations for reducing com-munication overhead and improving performance: overlappingand latency-efficient allreduces algorithms. Neither of thesetechniques are new. Overlapping communication and compu-tation during training has been discussed before (e.g. [15]),and we provide additional detail on implementing them withGPUs. Latency-efficient allreduces are similarly not new [35];however, deep learning applications have typically preferredbandwidth-optimized ring-based allreduce implementations asin the Baidu allreduce [5] or NCCL/NCCL2 [31] libraries.

A. Overlapping

Overlapping communication and computation when train-ing deep nets involves concurrently performing allreducesto compute global gradient updates and backpropagation oroptimization for other layers. This can be done within theconstraints discussed in Section II-A. Thus, to maximize thepotential for overlapping, within each layer gradients shouldbe computed first, and then an asynchronous allreduce startedon that buffer. The remainder of backprop can be performedin the same manner, and the allreduce completed when theoptimization phase for that layer begins. This enables theallreduce to be hidden by the error signal computation in theassociated layer, and all computation in all remaining layers.

3We were unable to evaluate beyond 1024 GPUs due to bugs in NCCL.

Achieving communication/computation overlap when utiliz-ing the NCCL library requires additional work. The communi-cation operations are enqueued as kernels on a CUDA stream,hence if they are enqueued on the same stream as training com-putations are performed on the operations will block. Instead,the NCCL allreduce should be performed on a separate CUDAstream, and appropriate synchronization added. We illustratethis strategy in more detail in Figure 4. In our experiments andprofiling, we have observed that this strategy enables excellentcommunication/computation overlap.

Using this method, we perform the same benchmarks asin II-C. Figures 2a and 3a demonstrate the improvements. Atsmall scales for both strong and weak scaling, the bulk ofcommunication is successfully hidden. This further results insignificantly improved communication/computation ratios. Atlarger numbers of GPUs (32 and above for strong scaling,128 and above for weak scaing), there is insufficient compu-tation to hide the communication at that scale. In particular,because many allreduces can only be started toward the endof backprop, allreduces later in backprop always have lesscomputation available to hide them. Nonetheless, overlappingstill reduces communication overhead in these cases.

B. Latency

While performing allreduces as soon as possible helpsmaximize overlap, it results in many small allreduces beingperformed, some as small as 64 parameters (256 bytes). Thissize regime is latency-dominated instead of being bandwidth-dominated, and the size of allreduces that are latency-dominated increases as the number of GPUs increases.

Typically, allreduce libraries for deep learning havebeen bandwidth-optimized and employ ring-based algo-rithms [5], [31]. These algorithms perform very well in multi-GPU shared-memory systems (especially ones optimized tohave ring topologies, such as the NVIDIA DGX1) or atsmall distributed-memory scales despite not being latency-optimized. AlexNet-style networks (see Figure 1a) also havefar fewer small allreduces and several very large allreduces.

For data lying in host memory, it is easy to rely onoptimized MPI distributions, which provide a suite of allreducealgorithms tuned to these different regimes [35]. As our dataof interest lies exclusively in GPU memory, we could makeuse of a CUDA-aware MPI runtime; however, as we discuss inSection IV, directly utilizing MPI results in overheads becauseof MPI’s interaction with CUDA. We instead make use ofour own implementation (discussed later), “host-transfer”, thatwraps around MPI and is preferred for small messages at largerscales.

This enables our allreduces to take advantage of tree-basedalgorithms for latency-bound workloads and large numbersof processes. These can be algorithmically better than ringalgorithms in many regimes. In particular, we utilize recursive-doubling and recursive-halving/recursive-doubling algorithms.Recursive-doubling is preferred for small messages and hasoptimal latency. The recursive-halving/recursive-doubling al-gorithm (also called Rabenseifner’s algorithm) has slightly

Page 6: Aluminum: An Asynchronous, GPU-aware Communication Library …snir.cs.illinois.edu/listed/C108.pdf · 2019-10-03 · We also discuss a semantic mismatch between MPI and CUDA that

23

24

25

26

27

28

29

210

#GPUs

22

26

210

214

218

222

226

Siz

e (#

para

met

ers)

Host-transferNCCL

Fig. 5. The fastest allreduce algorithm for a given number of GPUs andbuffer size on Sierra. A green dot marks the configurations our host-transferallreduce is fastest; a red triangle when NCCL is fastest. The host-transferpoints for 16 and 32 GPUs appear to be due to a protocol change or somethingsimilar within NCCL.

worse latency, but better bandwidth utilization, and is preferredfor larger messages.

To demonstrate the different regimes in which NCCL andour host-transfer allreduce are better, we conducted a simplebenchmark comparing their performance across a range ofnode/GPU counts (2-256 nodes/8-1024 GPUs) and buffer sizes(1-228 parameters) on the Sierra supercomputer. For eachconfiguration we computed the average over ten runs of thein-place version of allreduce algorithm, after a warmup run.The underlying MPI distribution was the 2018.08.13 releaseof Spectrum MPI, the default MPI runtime on Sierra.

Figure 5 plots which implementation is faster for a givenconfiguration. At small scales, NCCL is superior, as the la-tency differences between the ring and tree-based is relativelysmall. Once running on 16 nodes (64 GPUs), the host-transferallreduce outperforms NCCL for messages of up to 32768parameters. At the largest scale, the host-transfer allreduceis preferred for messages up to 219 parameters. We attributethe sudden transition between NCCL and the host-transferallreduce between 32 and 64 nodes to two factors. First, NCCLis able to take advantage of GPUDirect RDMA [36] and node-local topology information, to reduce communication overheadand latency. Second, NCCL’s implementation has likely beensignificantly more optimized than ours.

Figure 6 plots the actual performance results for each scale.We can see that NCCL has a significant advantage at thesmallest scale (two nodes), that gradually disappears as thenumber of nodes increases. At 128 GPUs, our implementationis over 2x faster than NCCL for small messages, and at 1024GPUs, it is over 10x for some message sizes. Further, the tree-based allreduces scale much better with increasing node countthan NCCL’s ring-based allreduce.

It is important to observe that the size range where thehost-transfer allreduce outperforms NCCL corresponds to asignificant portion of the allreduces required when training

AlexNet or ResNet-50 (see Figure 1). While these allreducesalso tend to be faster, improving their performance helps toreduce communication overheads during training.

To this end, we repeat the benchmark from Section II-C witha “minimal” algorithm that is a hybrid of the host-transfer andNCCL allreduces. This algorithm uses our prior benchmarkingresults to select the fastest implementation for a given inputconfiguration. The results for strong and weak scaling (withand without overlap) are presented in Figures 2b and 3b.

Strong scaling benefits less from the better allreduce algo-rithms, as the regime where it is profitable is not significantlyimpacted by them. Nonetheless, at larger scales communica-tion overhead is significantly reduced. This implies that with abetter implementation and improved compute scaling, we maybe able to successfully strong-scale training further.

Weak scaling exhibits a more noticable impact, dramat-ically improving the performance at large scales. WhereasNCCL, even with overlap, barely improves performance from256 to 1024 GPUs, the minimal algorithm sees continuedprofit in scaling to 1024 GPUs. Furthermore, communicationoverhead—while still quite high—is significantly improved,by up to ∼ 2.6x at 1024 GPUs.

An additional optimization, not reflected within these bench-marks, is to run multiple allreduces concurrently. In a latency-dominated regime, we are not limited by packet injection ratesor similar issues, but instead by waiting for communicationto complete. This enables pipelining the allreduces to furtherreduce communication overhead. One limitation that we haveobserved with NCCL is that it performs only a single allreduceat a time, even if multiple allreduces could be executed.Our host-transfer allreduce implementation does not have thisrestriction.

IV. INTERFACING WITH MPI

Why can we not simply use CUDA-aware MPI directly forallreduces when appropriate? Fundamentally, we argue thatbecause MPI is unaware of users’ CUDA streams, a semanticmismatch between the MPI and CUDA programming modelsarises, leading to communication and computation overheadsdue to unnecessary synchronization. We will then discussapproaches to fixing this mismatch.

A. Problems

When using CUDA to compute data on a GPU, one typicallylaunches a sequence of compute kernels on a CUDA stream.The CUDA runtime ensures that kernels launched on a streamare executed in launch order (there is no ordering betweenmultiple streams unless one is imposed using explicit synchro-nization). This means that, provided kernels are launched inthe right order, when a kernel begins execution, all its inputsare ready. Kernel launches (along with most other CUDAoperations) are asynchronous and do not block the host, butthere is a cost (roughly 10 µs) associated with launching them.For this reason, one typically launches many kernels in a rowwithout waiting for their completion, pipelining the launchesand hiding the launch latency for every kernel beyond the first.

Page 7: Aluminum: An Asynchronous, GPU-aware Communication Library …snir.cs.illinois.edu/listed/C108.pdf · 2019-10-03 · We also discuss a semantic mismatch between MPI and CUDA that

22

26

210

214

218

222

226

Size (#parameters)

104

103

102

101

100

Tim

e (s

)

Host-transferNCCL

(a) 8 GPUs

22

26

210

214

218

222

226

Size (#parameters)

104

103

102

101

100

Tim

e (s

)

(b) 16 GPUs

22

26

210

214

218

222

226

Size (#parameters)

104

103

102

101

100

Tim

e (s

)

(c) 32 GPUs

22

26

210

214

218

222

226

Size (#parameters)

104

103

102

101

100

Tim

e (s

)

(d) 64 GPUs

22

26

210

214

218

222

226

Size (#parameters)

104

103

102

101

100

Tim

e (s

)

(e) 128 GPUs

22

26

210

214

218

222

226

Size (#parameters)

104

103

102

101

100

Tim

e (s

)

(f) 256 GPUs

22

26

210

214

218

222

226

Size (#parameters)

104

103

102

101

100

Tim

e (s

)

(g) 512 GPUs

22

26

210

214

218

222

226

Size (#parameters)

104

103

102

101

100

Tim

e (s

)

(h) 1024 GPUs

Fig. 6. Performance results for our host-transfer allreduce and NCCL’s allreduce on Sierra.

MPI runtimes are unaware of users’ CUDA streams. There-fore, when a user passes a GPU buffer to an MPI routine,MPI has no way to determine whether there is a pendingcomputation on a stream that will write to the buffer. To ensurecorrectness when a kernel may write to the buffer, the usermust synchronize the stream to complete pending computation.This forces the application into a bulk-synchronous modelof separated computation and communication phases, andprevents pipelining kernel launches to hide their launch latencyor overlapping communication and computation. Similarly,when MPI communication is in progress, there is no way fora stream to wait for a blocking operation’s completion (e.g.MPI_Allreduce or MPI_Wait). This further means thatother streams that might synchronize with the first stream alsoneed to be blocked.

Alternating computation and communication phases in thismanner leads to an awkward and error-prone programmingmodel, and underutilization of both the network (during com-putation phases) and GPU (during communication). As shownin Figures 2 and 3, overlapping is an important performanceoptimization. Blocking the host to wait for operations tocomplete also limits its ability to overlap other operations, suchas I/O, with computation and/or communication. In the contextof training deep nets, I/O can be quite expensive, so hidingit is crucial. Finally, we have shown that latency-optimizedcommunication is important for scaling; the additional syn-chronization imposed will increase latency.

A further concern with using CUDA-aware MPI is prac-tical. We have observed that CUDA-aware MPI runtimesoften do not handle operations with GPU buffers correctlywhen they are performed from multiple threads, even whenMPI_THREAD_MULTIPLE is enabled. We hope that this canbe resolved by improved documentation and bugfixes by MPIdistributions.

B. Possible solutionsOne solution that achieves correctness is to push the syn-

chronization into the MPI library. Since it is unaware of which

user stream is producing the buffer to be communicated, thelibrary must synchronize the entire device, either explicitly orvia CUDA’s default stream semantics. This resolves none ofthe performance issues noted above.

A more promising solution comes from the idea that weshould treat MPI communication operations as “just anotherkernel” to be enqueued on a stream. A proof-of-concept thatcommunication can be treated this way is NCCL, where eachoperation takes a stream as an argument and results in theusual semantics for a kernel launch: it doesn’t block the host,is ordered within the stream, and blocks the stream.

Unfortunately, MPI operations cannot take a stream pa-rameter. However, we find it sufficient to associate a singlestream with a communicator. Every operation that uses thecommunicator and a GPU buffer can then assume that thebuffer is written to by some kernel on that stream, and performthe appropriate synchronization with respect to only thatstream. Within MPI, this association can be implemented as anattribute attached to the communicator. The details associatedwith achieving the remaining characteristics are somewhatcomplicated and implementation-dependent. We have takenthis approach and implemented it in our Aluminum library,detailed in the next section.

While this paper has focused on allreduces, due to theirimportance in training deep networks, these approaches are inno way exclusive to allreduces and should be applicable toany communication operation.

V. THE ALUMINUM LIBRARY

We have developed the Aluminum library as an open-sourcecommunication library. It provides a generic API for commu-nication operations implemented by multiple backends, andcurrently supports MPI, NCCL, and custom implementationsof various operations for both CPU and GPU communication.This library encapsulates the optimizations discussed through-out this paper, such as easy non-blocking operations on bothhost and GPU (Section III-A), latency-optimized algorithms

Page 8: Aluminum: An Asynchronous, GPU-aware Communication Library …snir.cs.illinois.edu/listed/C108.pdf · 2019-10-03 · We also discuss a semantic mismatch between MPI and CUDA that

Comm engine

Data stream D→HData Wait

Allreduce

H→D

Cleanup

Fig. 7. Implementation of the blocking host-transfer allreduce. The databuffer to transmit is computed on the data stream, after which a device-to-hostmemcpy transfers moves the buffer to the host. CUDA event synchronizationis used to determine when the transfer has completed, after which an MPIallreduce is performed. Meanwhile, the data stream is blocked with a waitoperation, until the host signals completion, after which a host-to-devicememcpy transfers the buffer back. A second CUDA event signals completionof this transfer, so temporary resources can be released. (Boxes are not toscale.)

Comm engine

Internal stream

Data stream

D→H

Data

Wait

Allreduce

H→D

Cleanup

Other work Complete

Fig. 8. Implementation of the non-blocking host-transfer allreduce. This issimilar to the blocking version, except the operation is performed on an interalstream and a separate wait operation is used to invoke the synchronizationwith the data stream to complete the operation. (Boxes are not to scale.)

(Section III-B), and CUDA-friendly synchronization semantics(Section IV). It is currently being leveraged by both theLBANN deep learning toolkit [1] and the Hydrogen distributedlinear algebra library [37] (a fork of the Elemental library [38])as shown in Figure 9.

A. API and semantics

Aluminum is a C++11 library with an API inspired byMPI’s. It consists of a core providing internal implementa-tion frameworks and three communication backends (and isextendable to support more):MPI provides both an interface to MPI (by directly calling

MPI routines) and custom collective implementationsbuilt atop of MPI. It is meant to be used with host buffers.

NCCL provides a direct interface to NCCL for use with GPUbuffers.

LBANNScalableDeepLearningToolkit

HydrogenGPU-Accelerated

DistributedLinearAlgebra

MPI

CUDA-awareMPI

NCCL+customAluminum

High-performanceGPU-awarecommunicaEonlibrary

CPU-Only GPU-Accel

Fig. 9. Integration of Aluminum into the open-source toolkits, LBANN andHydrogen.

MPI-CUDA implements a variety of custom algorithms thatare built on top of MPI and CUDA for use with GPUbuffers. This backend implements our “host-transfer”allreduce. (This is independent of CUDA-aware MPI.)

The API to invoke a non-blocking, in-place allreduce (for example) looks like:Al::NonblockingAllreduce<Backend>(buffer,count, op, comm, req), where buffer and countdefine the buffer to be reduced, op is a reduction operation(e.g. summation), comm is an Aluminum communicatorobject, and req is a request object. C++ templates are usedto infer the type of the buffer and dispatch the operationto the correct backend. Aluminum also handles algorithmselection where appropriate, making a reasonable choicebased on the buffer and communicator sizes (this can alsobe manually specified by the user). The allreduce thenproceeds asynchronously, and can be completed via a waitoperation: Al::Wait<Backend>(req). Every backendautomatically handles Aluminum’s synchronization semantics,described below.

Aluminum currently supports a subset of the standardMPI collective operations in both blocking and non-blockingversions, including: reduce, allreduce, reduce-scatter, allgather,and broadcast. The MPI-CUDA backend additionally supportsthe basic send, recv, and sendrecv point-to-point operationsfor GPU buffers. The NCCL backend is currently limited toonly the subset of reduction operations that NCCL supports(summation, multiplication, min, and max); our other backendssupport a more general set of reduction operations.

The semantics of Aluminum’s blocking and non-blockingoperations differs from MPI, and it implements the approachdiscussed in Section IV-B in a manner that provides a fairlygeneric interface for both CPU and GPU operations. Weassociate a “stream of computation” with each communicator.For GPU backends, this is a CUDA stream. For the MPIbackend, this stream is implicit, and can be thought of asthe calling thread or process; this could be made explicit inthe future to better support threading or lightweight threadinglibraries. All operations then synchronize the communicator’sstream as necessary. This is critically important for GPUoperations, where it means that no GPU operation blocksthe host. From the example above, if the Al::Wait op-eration were used with the MPI-CUDA backend, it would(perhaps counterintuitively) not block the host, but insteadblock comm’s CUDA stream until the allreduce completed.

B. Some implementation details

We now detail some of the notable implementation detailsfor Aluminum.

1) Communication engine: Any communication that mustperform operations on the host without blocking the mainthread of execution need to be run in a separate, dedicatedthread that serves as the communication or progress engine.This thread is automatically bound by the library to a core, anduses some basic heuristics to avoid conflicting with both otherprocesses that may be on the same node and other threads (e.g.

Page 9: Aluminum: An Asynchronous, GPU-aware Communication Library …snir.cs.illinois.edu/listed/C108.pdf · 2019-10-03 · We also discuss a semantic mismatch between MPI and CUDA that

OpenMP compute threads) that the application may spawn.Asynchronous operations are submitted to the communicationengine as a state object that encapsulates the operation tobe performed and any necessary state (essentially, a closure).Submission is done via a lock-free single-producer, single-consumer queue (implemented as a classic Lamport queue [39]with modifications described in [40], and could be generalizedto a MPSC queue). The engine maintains an internal queue ofcurrently running state objects, and invokes a step methodon them, which should not block. When the operation hascompleted, the engine can optionally indicate this to otherthreads by atomically setting a flag in a request object.

This implementation approach is inspired by the communi-cation engines that have been used in other high-performancecommunication libraries [41], [42].

Aluminum’s MPI backend utilizes the progress engine toprovide asynchronous progress on the host both for customalgorithm implementations and via MPI_Test polling fornon-blocking MPI operations. We do this because we have ob-served that MPI implementations often do not make adequateprogress on their own without polling (see also e.g. [43]). Thehost-transfer allreduce also makes use of the progress engineto perform communication, as we describe next.

2) Non-blocking and host-transfer allreduce: Aluminumhas a heavy focus on non-blocking communication with GPUbuffers. Non-blocking operations via the NCCL backend areautomatically run on one of Aluminum’s internal CUDAstreams, as discussed in Section III-A and Figure 4. TheAl::Wait operation implements the synchronization to com-plete the communication. All of this is done with asynchronousCUDA operations and does not block the caller.

For latency-dominated workloads, we have implemented a“host-transfer” allreduce that, as described in Section III-B,can be significantly more performant than NCCL in the rightregimes. At a high level, this implementation simply transfersthe GPU memory to the host, performs the allreduce in hostmemory using MPI, and transfers the result back to the GPU.To avoid the caller blocking, the operation enqueues the nec-essary kernels and events on the communicator’s stream, andthen delegates communication to Aluminum’s communicationengine. Polling on CUDA events are used to determine whenmemory transfers have completed. To block the stream whilecommunication is in progress, the cuStreamWaitValue32operation from the CUDA driver API is used. This preventsany work submitted to the stream after the call from beginninguntil a memory location is written. The entire process isdescribed in more detail in Figure 7. A non-blocking versionof this is implemented similarly to non-blocking operationsfor NCCL, by running on an internal stream (see Figure 8).

Because we transfer the entire GPU buffer to the host, thisapproach could be significantly optimized by utilizing GPUDi-rect RDMA [36], and by pipelining for longer messages.

While we have described and implemented this “host-transfer” approach for allreduces, it can be applied to anycommunication operation. We briefly describe applying thisapproach to send and recv operations next.

3) Other operations: Send and recv operations that supportAluminum’s semantics for GPU buffers are useful both to sup-port applications that require more irregular communicationpatterns and as building blocks for custom implementationsof collectives. Both operations can be implemented similarlyto the host-transfer allreduce.

For a send operation, we transfer the data from the GPU tothe host and then use MPI_Isend within the communicationengine to perform the completion. The communicator’s streamdoes not need to be blocked: similarly to MPI’s semantics,we consider it locally complete when the user buffer canbe reused. For recv, the communication engine can begin anMPI_Irecv immediately while blocking the communicator’sstream. Once complete, the stream is notified and the buffertransferred to the GPU.

Using these operations as primitives, we have implementedour own ring allreduce in Aluminum’s MPI-CUDA back-end. This allreduce pipelines communication and host/GPUmemory transfers, supports both single- and bi-directionalrings, and performs reduction operations on-GPU. While thisimplementation is not always competitive with NCCL’s (inparticular, it does not take advantage of GPUDirect RDMA),it does enable additional flexibility by supporting reductionoperations that NCCL lacks.

VI. TRAINING EXPERIMENTS

Note: Potentially temporary results pending NCCL bug-fix (or I will make these final and expand).

To evaluate end-to-end training in a real environment, weintegrated our Aluminum library into the LBANN toolkit [1],which is optimized for training deep networks on large GPUclusters. We train ResNet-50 on the ImageNet-1K dataset [29],with data being read off a Lustre parallel filesystem.

Strong scaling is performed by fixing the mini-batch size to256, the default, and increasing the number of GPUs. Weakscaling fixes a per-GPU mini-batch size of 32, increasing theglobal mini-batch size as the number of GPUs increases. Thisresults in fewer iterations being performed per epoch.

Results comparing Aluminum’s NCCL backend withCUDA-aware MPI (MVAPICH2 version 2.3) are presented inFigure 10, where we observe that Aluminum+NCCL signif-icantly outperforms CUDA-aware MPI. This improvement isprimarily a result of using Aluminum’s synchronization se-mantics, which not only enables communication/computationoverlap, but avoids blocking the host or repeatedly payingkernel launch overheads. A major advantage of not blockingthe host is that I/O and data preprocessing for a subsequentmini-batch is completely overlapped with the backprop andoptimization phases of the prior mini-batch.

VII. RELATED WORK

Many other frameworks for training deep neural net-works, including TensorFlow [3], PyTorch [4], FireCaffe [2],LBANN [1], and CNTK [7] aim to scale training, and optimizecommunication to that end. While these frameworks often im-plement a large variety of optimizations, they typically rely on

Page 10: Aluminum: An Asynchronous, GPU-aware Communication Library …snir.cs.illinois.edu/listed/C108.pdf · 2019-10-03 · We also discuss a semantic mismatch between MPI and CUDA that

22

23

24

25

# Nodes

800

1000

1200

1400

1600

1800

Tim

e (s

)

Strong scaling ResNet-50 on Pascal (2 GPUs/node)

Aluminum/NCCLCUDA-aware MPI

(a) Strong scaling

27

28

29

210

211

Mini-batch size

500

1000

1500

2000

2500

Tim

e (s

)

ResNet-50 mini-batch weak scaling on Pascal (32 samples/GPU)

Aluminum/NCCLCUDA-aware MPI

(b) Weak scaling

Fig. 10. Strong and weak scaling for training with LBANN on Pascal. (Missing points for CUDA-aware MPI took too long to run.)

either MPI or NCCL to provide the underlying communicationlayer on dedicated clusters, and therefore can benefit fromthe optimizations we have discussed and implemented withinAluminum.

There are several communication layers that have beendeveloped primarily to accelerate training deep networks, andcan be integrated into existing frameworks such as TensorFlow.These often aim to replace centralized parameter servers with adecentralized allreduce implementation. Baidu’s allreduce [5]was the first attempt to leverage ring allreduces for trainingdeep networks, and is implemented atop CUDA-aware MPIto manage GPU communication. Facebook’s Gloo [44] sup-ports a number of collective algorithms, including multipleoptimized allreduce implementations; it builds upon MPI,node-local NCCL (but not distributed NCCL), and customcommunication layers. Uber’s Horovod [6] similarly supportsallreduces and several other collectives, and builds uponCUDA-aware MPI and NCCL. Horovod supports tensor fu-sion, which attempts to address issues with latency-boundallreduces by merging the buffers together to perform fewer,larger allreduces. Baidu’s allreduce implements none of theoptimizations we have described; Gloo and Horovod will notblock host execution, but do not overlap computation on theGPU, and do not implement latency-optimized allreduces.

NCCL [31] and MPI are most similar in approach toAluminum, and we build upon both in many ways, as dis-cussed throughout the paper. NCCL lacks native support fornon-blocking allreduces and is not latency-optimized; MPIsuffers from a semantic mismatch with CUDA that limitsits performance. NVSHMEM [45] implements high-optimizedpoint-to-point communication that avoids the semantic issuesMPI suffers from while being entirely managed from a GPU.It provides no collective operations, but could be useful as abuilding block for higher-level systems.

Many works have investigated scaling training by increasingthe mini-batch size (weak scaling) [15], [16], [46]. The pri-mary contribution of these approaches works is not the scal-ing techniques, but the learning techniques used to maintainmodel accuracy despite the large mini-batch size. [46] does

discuss large-scale communication, but similarly to Horovod,addresses latency concerns through tensor fusion.

Another significant body of work has investigated manyother approaches to parallelizing training; we refer the readerto [25] for an excellent overview. We view these techniquesas orthogonal to our work here: they can be leveraged onconcert to improve performance. One important class of workthat more closely relates is quantization [19]–[21]. Theseapproaches aim to trade additional local computation to reducecommunication volume, and are particularly applicable to opti-mizing allreduces on large buffers. Quantization complementsour work especially well, as it requires good overlap betweencomputation and quantization+communication and shifts morecommunication into a latency-bound regime. [47] discussestechniques and APIs that enable efficient implementation ofquantized communication.

VIII. CONCLUSIONS

We have examined the communication requirements fortraining deep neural networks, and found that the overheadof communication is significant, and becomes the dominantcost at scale. We applied several optimizations to reduce thisoverhead. Overlapping communication and computation andusing latency-optimized algorithms helps to directly reducethis overhead. Identifying and working around the semanticmismatch between MPI and CUDA both reduces overheadsand enables overlapping of host and GPU computation. Weincorporated these improvements into the open-source Alu-minum library.

We do not view these optimizations as inherent to Alu-minum or the present work, and encourage other libraries,especially MPI distributions, to adopt them. These techniquesare also not limited to training deep networks; other ap-plications that leverage GPUs, such as numerical or graphanalytics applications, can benefit from them too. Aluminum’ssemantics and point-to-point communication implementationsare a step toward supporting more irregular communicationpatterns.

Page 11: Aluminum: An Asynchronous, GPU-aware Communication Library …snir.cs.illinois.edu/listed/C108.pdf · 2019-10-03 · We also discuss a semantic mismatch between MPI and CUDA that

Our work has improved the strong and weak scaling oftraining deep networks significantly. However, both remainheavily communication-bound at large scales. Strong scalinghas always been difficult due to diminishing computationalwork and increased communication requirements. Aluminumenables profitable weak scaling to large numbers of GPUs, butcommunication overheads limit the benefits. As GPUs con-tinue to improve computational performance, while networkbandwidth grows slowly and latencies reach physical limits,communication will only become a more critical bottleneckin the future. To efficiently utilize large GPU clusters, im-plementors must pay close attention to the optimizations de-scribed here, while developing improved techniques to reducelatencies and other overheads and scale communication.

ACKNOWLEDGMENTS

Prepared by LLNL under Contract DE-AC52-07NA27344.Funding provided by LDRD #17-SI-003. Experiments wereperformed at the Livermore Computing facility. The authorswould like to thank the LBANN team for their assistance.

REFERENCES

[1] B. Van Essen, H. Kim, R. Pearce, K. Boakye, and B. Chen, “LBANN:Livermore Big Artificial Neural Network HPC toolkit,” in Proceedingsof the Workshop on Machine Learning in High-Performance ComputingEnvironments. ACM, 2015, p. 5.

[2] F. N. Iandola, M. W. Moskewicz, K. Ashraf, and K. Keutzer, “FireCaffe:near-linear acceleration of deep neural network training on computeclusters,” in Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2016, pp. 2592–2600.

[3] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard et al., “TensorFlow: a system forlarge-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.

[4] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation inPyTorch,” in NIPS-W, 2017.

[5] Baidu Research, “Baidu allreduce,” https://github.com/baidu-research/baidu-allreduce, 2018.

[6] A. Sergeev and M. D. Balso, “Horovod: fast and easy distributed deeplearning in TensorFlow,” arXiv preprint arXiv:1802.05799, 2018.

[7] F. Seide and A. Agarwal, “CNTK: Microsoft’s open-source deep-learning toolkit,” in Proceedings of the 22nd ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining. ACM,2016, pp. 2135–2135.

[8] A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew,“Deep learning with COTS HPC systems,” in International Conferenceon Machine Learning, 2013, pp. 1337–1345.

[9] M. D. Schatz, R. A. Van de Geijn, and J. Poulson, “Parallel matrix multi-plication: A systematic journey,” SIAM Journal on Scientific Computing,vol. 38, no. 6, pp. C748–C781, 2016.

[10] R. A. Van De Geijn and J. Watts, “SUMMA: Scalable universal matrixmultiplication algorithm,” Concurrency: Practice and Experience, vol. 9,no. 4, pp. 255–274, 1997.

[11] S. Hochreiter and J. Schmidhuber, “Flat minima,” Neural Computation,vol. 9, no. 1, pp. 1–42, 1997.

[12] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P.Tang, “On large-batch training for deep learning: Generalization gap andsharp minima,” arXiv preprint arXiv:1609.04836, 2016.

[13] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi,C. Borgs, J. Chayes, L. Sagun, and R. Zecchina, “Entropy-SGD: Biasinggradient descent into wide valleys,” arXiv preprint arXiv:1611.01838,2016.

[14] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio, “Sharp minima cangeneralize for deep nets,” arXiv preprint arXiv:1703.04933, 2017.

[15] P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola,A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD: trainingImageNet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.

[16] Y. You, Z. Zhang, C. Hsieh, J. Demmel, and K. Keutzer, “ImageNettraining in minutes,” CoRR, abs/1709.05011, 2017.

[17] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “On parallelizability ofstochastic gradient descent for speech DNNs,” in Acoustics, Speech andSignal Processing (ICASSP), 2014 IEEE International Conference on.IEEE, 2014, pp. 235–239.

[18] J. Keuper and F.-J. Preundt, “Distributed training of deep neuralnetworks: Theoretical and practical limits of parallel scalability,” inProceedings of the Workshop on Machine Learning in High PerformanceComputing Environments. IEEE Press, 2016, pp. 19–26.

[19] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradientdescent and its application to data-parallel distributed training of speechDNNs,” in Fifteenth Annual Conference of the International SpeechCommunication Association, 2014.

[20] N. Dryden, T. Moon, S. A. Jacobs, and B. Van Essen, “Communicationquantization for data-parallel training of deep neural networks,” in Pro-ceedings of the Workshop on Machine Learning in HPC Environments(MLHPC). IEEE, 2016, pp. 1–8.

[21] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD:Communication-efficient SGD via gradient quantization and encoding,”in Advances in Neural Information Processing Systems, 2017, pp. 1709–1720.

[22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proceedings of the IEEE conference on computer visionand pattern recognition, 2016, pp. 770–778.

[23] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,2016, http://www.deeplearningbook.org.

[24] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” SIAM Review, vol. 60, no. 2, pp. 223–311,2018.

[25] T. Ben-Nun and T. Hoefler, “Demystifying parallel and distributeddeep learning: An in-depth concurrency analysis,” arXiv preprintarXiv:1802.09941, 2018.

[26] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167, 2015.

[27] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica-tion with deep convolutional neural networks,” in Advances in neuralinformation processing systems, 2012, pp. 1097–1105.

[28] C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi,P. Bailis, K. Olukotun, C. Re, and M. Zaharia, “DAWNBench: An end-to-end deep learning benchmark and competition,” in Advances in neuralinformation processing systems, 2017.

[29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, andL. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp.211–252, 2015.

[30] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,and E. Shelhamer, “cuDNN: Efficient primitives for deep learning,”arXiv preprint arXiv:1410.0759, 2014.

[31] NVIDIA, “NVIDIA collective communications library,”https://developer.nvidia.com/nccl, 2018.

[32] L. L. N. Laboratory, “Sierra,” https://hpc.llnl.gov/hardware/platforms/sierra,2018.

[33] S. Ioffe, “Batch renormalization: Towards reducing minibatch depen-dence in batch-normalized models,” in Advances in Neural InformationProcessing Systems, 2017, pp. 1945–1953.

[34] Y. Wu and K. He, “Group normalization,” arXiv preprintarXiv:1803.08494, 2018.

[35] R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of collectivecommunication operations in MPICH,” The International Journal ofHigh Performance Computing Applications, vol. 19, no. 1, pp. 49–66,2005.

[36] NVIDIA, “GPUDirect RDMA,” https://docs.nvidia.com/cuda/gpudirect-rdma/index.html, 2018.

[37] Hydrogen team, “Hydrogen,” https://github.com/LLNL/Elemental, 2018.[38] J. Poulson, B. Marker, R. A. Van de Geijn, J. R. Hammond, and N. A.

Romero, “Elemental: A new framework for distributed memory densematrix computations,” ACM Transactions on Mathematical Software(TOMS), vol. 39, no. 2, p. 13, 2013.

[39] L. Lamport, “Proving the correctness of multiprocess programs,” IEEEtransactions on software engineering, no. 2, pp. 125–143, 1977.

Page 12: Aluminum: An Asynchronous, GPU-aware Communication Library …snir.cs.illinois.edu/listed/C108.pdf · 2019-10-03 · We also discuss a semantic mismatch between MPI and CUDA that

[40] N. M. Le, A. Guatto, A. Cohen, and A. Pop, “Correct and efficientbounded FIFO queues,” in Computer Architecture and High PerformanceComputing (SBAC-PAD), 2013 25th International Symposium on. IEEE,2013, pp. 144–151.

[41] A. Brooks, H.-V. Dang, N. Dryden, and M. Snir, “PPL: an abstractruntime system for hybrid parallel programming,” in Proceedings of theFirst International Workshop on Extreme Scale Programming Modelsand Middleware. ACM, 2015, pp. 2–9.

[42] H.-V. Dang, M. Snir, and W. Gropp, “Towards millions of communicat-ing threads,” in Proceedings of the 23rd European MPI Users’ GroupMeeting. ACM, 2016, pp. 1–14.

[43] P. R. Eller and W. Gropp, “Scalable non-blocking preconditioned conju-gate gradient methods,” in Proceedings of the International Conferencefor High Performance Computing, Networking, Storage and Analysis.IEEE Press, 2016, p. 18.

[44] Facebook, “Gloo,” https://github.com/facebookincubator/gloo, 2018.[45] S. Potluri, A. Goswami, D. Rossetti, C. Newburn, M. G. Venkata,

and N. Imam, “GPU-centric communication on NVIDIA GPU clusterswith InfiniBand: A case study with OpenSHMEM,” in 2017 IEEE24th International Conference on High Performance Computing (HiPC).IEEE, 2017, pp. 253–262.

[46] X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo,Y. Yang, L. Yu et al., “Highly scalable deep learning training system withmixed-precision: Training ImageNet in four minutes,” arXiv preprintarXiv:1807.11205, 2018.

[47] C. Renggli, D. Alistarh, and T. Hoefler, “SparCML: High-performance sparse communication for machine learning,” arXivpreprint arXiv:1802.08021, 2018.