tree-based allreduce communication on mxnetctcyang/pub/amaz-techreport2018.pdf · tree-based...

13
Tree-based Allreduce Communication on MXNet Carl Yang Amazon AWS 2100 University Ave. Palo Alto, CA 64303 [email protected] Abstract Allreduce is the key communication primitive used in deep learning. In this paper we propose an algorithm for producing Allreduce communication schedules for arbitrary network topologies automatically. We show that this can be accomplished efficiently by finding a Kernighan-Lin clustering on the class of binary trees. We provide experimental evidence that trees found using such an algorithm show peak 6.6× and 1.42× speed-up to two state-of-the-art algorithms in end-to-end training. This provides, for the first time, provably optimal aggregation schedules for several common hardware architectures. 1 Introduction Deep learning, the technology behind the recent breakthroughs of superhuman ability in the boardgame go [1], image classification [2] and superhuman ability in voice recognition [3] is a very computationally intensive task. Two modes of training deep learning models are popular: the model parallel approach which shards parameters across processors and training happens on the same data, and the data parallel approach which maintains the same parameters across processors and training happens on different data. We will focus our investigation on the latter approach. The main communication that must happen in the latter approach happens after the back-propagation step. Once all processors (typically GPUs) have the gradient for each layer, they must perform an all-to-all reduce (i.e. an “Allreduce”) on this data. In single-machine, how this Allreduce is typically implemented is by a Reduce followed by a Broadcast. The latest generation of GPU servers on AWS (i.e. p3.16xlarge instances) has a complex topology, which matches that of the NVIDIA DGX-1 server (shown in Figure 5a). In literature, there has not been much work taking advantage of network topology. The vendor-shipped library NVIDIA Collective Communications Library (NCCL) is topology-aware and excellent at communicating large messages. However, our experiments show it is not latency-optimal for small messages 2c. The default single-machine training communication algorithm in MXNet, Parameter server on single- machine (abbreviated PS-single) is good at communicating small messages. However, it is not bandwidth-optimal for large messages and not topology-aware. Another key difference is that traditional high-performance computing (HPC) uses the α-β model that models communication time as α + , where α is per message latency, β per byte transfer time and n is the message size in bytes. Then, they assume that each process can send and receive one message per time step. However with p3.16xlarge, by using direct memory access (DMA) GPUs can send to as many GPUs as they are connected with using GPUDirect (in this case, 4). This implies that typical algorithms found in HPC (single ring, single tree, etc.) are ineffective, because for p processes they use at most p or p - 1 links. This means that in the p3.16xlarge topology, the other 32 - p NVLink connections are idle. Preprint. Work in progress.

Upload: others

Post on 21-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tree-based Allreduce Communication on MXNetctcyang/pub/amaz-techreport2018.pdf · Tree-based Allreduce Communication on MXNet Carl Yang Amazon AWS 2100 University Ave. Palo Alto,

Tree-based Allreduce Communication on MXNet

Carl YangAmazon AWS

2100 University Ave.Palo Alto, CA 64303

[email protected]

Abstract

Allreduce is the key communication primitive used in deep learning. In this paperwe propose an algorithm for producing Allreduce communication schedules forarbitrary network topologies automatically. We show that this can be accomplishedefficiently by finding a Kernighan-Lin clustering on the class of binary trees. Weprovide experimental evidence that trees found using such an algorithm show peak6.6× and 1.42× speed-up to two state-of-the-art algorithms in end-to-end training.This provides, for the first time, provably optimal aggregation schedules for severalcommon hardware architectures.

1 Introduction

Deep learning, the technology behind the recent breakthroughs of superhuman ability in theboardgame go [1], image classification [2] and superhuman ability in voice recognition [3] is avery computationally intensive task. Two modes of training deep learning models are popular: themodel parallel approach which shards parameters across processors and training happens on the samedata, and the data parallel approach which maintains the same parameters across processors andtraining happens on different data. We will focus our investigation on the latter approach.

The main communication that must happen in the latter approach happens after the back-propagationstep. Once all processors (typically GPUs) have the gradient for each layer, they must perform anall-to-all reduce (i.e. an “Allreduce”) on this data. In single-machine, how this Allreduce is typicallyimplemented is by a Reduce followed by a Broadcast.

The latest generation of GPU servers on AWS (i.e. p3.16xlarge instances) has a complex topology,which matches that of the NVIDIA DGX-1 server (shown in Figure 5a). In literature, there hasnot been much work taking advantage of network topology. The vendor-shipped library NVIDIACollective Communications Library (NCCL) is topology-aware and excellent at communicating largemessages. However, our experiments show it is not latency-optimal for small messages 2c. Thedefault single-machine training communication algorithm in MXNet, Parameter server on single-machine (abbreviated PS-single) is good at communicating small messages. However, it is notbandwidth-optimal for large messages and not topology-aware.

Another key difference is that traditional high-performance computing (HPC) uses the α-β modelthat models communication time as α+ nβ, where α is per message latency, β per byte transfer timeand n is the message size in bytes. Then, they assume that each process can send and receive onemessage per time step. However with p3.16xlarge, by using direct memory access (DMA) GPUs cansend to as many GPUs as they are connected with using GPUDirect (in this case, 4). This impliesthat typical algorithms found in HPC (single ring, single tree, etc.) are ineffective, because for pprocesses they use at most p or p− 1 links. This means that in the p3.16xlarge topology, the other32− p NVLink connections are idle.

Preprint. Work in progress.

Page 2: Tree-based Allreduce Communication on MXNetctcyang/pub/amaz-techreport2018.pdf · Tree-based Allreduce Communication on MXNet Carl Yang Amazon AWS 2100 University Ave. Palo Alto,

In recent years, TensorFlow, PyTorch and MXNet have emerged as leading deep learning frameworks.We focus our efforts on MXNet. We achieve latency-optimality in a similar manner as Wang, Li,Liberty and Smola [4], but with different problem formulation (see Section 3.1). In this paper, ourcontributions are as follows:

1. We design a new tree-based communication algorithm for single-machine training that seeksto match:• PS-single: latency-optimal for small messages• NCCL: bandwidth-optimal for large messages

2. We achieve topology-awareness in a novel way by encoding information about the networktopology into our trees during construction.

3. We devise an innovative technique targeting the NVLink communication model that differsfrom α-β model from HPC, by slicing large messages into smaller messages to be sentalong different links, so that all 32 NVLink connections are used.

4. We provide experimental evidence that using our tree-based communication algorithm wecan achieve a peak 6.6× speed-up over PS-single and 1.42× over NCCL in end-to-endtraining of image classification models.

2 Background

In this section, we will discuss related work that has been done on single-machine training withMXNet, the benchmark we used to evaluate existing work, and the network topology of the hardwarewe are primarily targeting (even though our solution generalizes to every other topology as well).

2.1 Related Work

NVIDIA Collective Communications Library (NCCL) is a vendor-shipped library from NVIDIA [5],which is optimized for single- and multi-machine communication. It uses the ring communicationpattern (Figure 1a), which comes from traditional HPC. It uses many rings to divide large gradientsinto smaller ones, and uses each ring to communicate each small gradient. Its disadvantages are thatanother dependency is introduced to MXNet and poor overlapping of successive calls, which may bea consequence of its implementation as a monolithic CUDA kernel rather than using the traditionaldirect memory access (DMA) call “cudaMemcpyPeerAsync”.

Parameter server on single-machine (abbreviated PS-single, see Figure 1b) is a single-machineReduce and Broadcast implementation that is written in the MXNet backend, so it does not requireany additional dependency. However on models with large gradients (VGGNet and AlexNet), thismethod faces scalability issues at 8 GPUs.

2.2 Communication Latency Benchmark

The ring Reduce communication pattern used by NCCL nor the parameter server Reduce currentlyused in MXNet are not optimal for small batch sizes on p3.16xlarge instances with 8 GPUs. We run abenchmark to show this shortcoming in Figure 2. All our experiments are run using the experimentalsetup described in Section 4.1.

NCCL is clearly the best when only one Reduce and Broadcast is called. However, in workloads thatrequire many Reduce and Broadcast in a sequence, NCCL becomes slower than the parameter serverimplementation for small message sizes.

Figure 2 explains the end-to-end performance results (see Figure 3) that show Parameter serveris faster for networks that require many Reduces and Broadcasts over relatively small keys (e.g.ResNet-50, Inception-48 need over 157 Reduces and Broadcasts on keys not exceeding 2M floatsin size), but NCCL ring Reduce is faster for networks that only need Reduce and Broadcast overfew keys (e.g. VGG-16 and AlexNet only need fewer than 32 Reduces and Broadcasts on keys thatexceed 10M in size).

As shown in Figure 4, VGG-16 has one large gradient of ~400MB that accounts for over 74% of thecommunication traffic. For these very large gradients, NCCL has a significant bandwidth advantage asshown by Figure 2e. This explains why PS-single has difficulty scaling on VGGNet-16 and AlexNet.

2

Page 3: Tree-based Allreduce Communication on MXNetctcyang/pub/amaz-techreport2018.pdf · Tree-based Allreduce Communication on MXNet Carl Yang Amazon AWS 2100 University Ave. Palo Alto,

(a) NCCL (b) PS-single

Figure 1: Existing Reduce and Broadcast algorithms currently in MXNet.

2.3 Network Topology

The system we are targeting is the Amazon AWS EC2 p3.16xlarge instance. The network topologyof p3.16xlarge is shown in Figure 5. It is composed of 3 types of links:

1. Double NVLink connections (50 GB/s)

2. Single NVLink connections (25 GB/s)

3. PCI-E connection + CPU (10 GB/s)

3 Single and Multiple Tree-based Communication

We would like to design a single-machine communication algorithm that both: (i) maximizesperformance on large gradients like NCCL, (iii) has very low latency on small keys like PS-single,(iii) takes advantage of network topology information. Our approach is based on the idea of usingtrees to perform the Reduce and Broadcast. We can use the idea of minimum spanning trees to do abinary tree Reduce communication pattern to improve it following this paper by Wang, Li, Edo andSmola [4]. Our strategy will be to use:

• A single tree: latency-optimal for small messages

• Multiple trees: bandwidth-optimal for large messages

This way, we will be able to enjoy both the bandwidth-optimality of NCCL and the latency-optimalityof PS-single. Both methods are shown in Figure 6. In order to satisfy our requirement (iii), we willuse network topology information in constructing our binary trees when we prioritize high bandwidthlinks (double NVLink) and avoid usage of low bandwidth links (PCI-E).

3.1 Problem Formulation

Our solution works for any network topology and not only p3.16x.large. Therefore, our proposedought to automatically detect how GPUs are connected in a single machine (topology-awareness)and construct an undirected, weighted graph G whose vertices represent GPUs and edges (u, v)

3

Page 4: Tree-based Allreduce Communication on MXNetctcyang/pub/amaz-techreport2018.pdf · Tree-based Allreduce Communication on MXNet Carl Yang Amazon AWS 2100 University Ave. Palo Alto,

0.001

0.01

0.1

1

10

1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09

Late

ncy

(s)

Message Size (words)

PS-single NCCL

(a) 1 Reduce and Broadcast

0.001

0.01

0.1

1

10

1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09

Late

ncy

(s)

Message Size (words)

PS-single NCCL

(b) 50 Reduces and Broadcasts

0.001

0.01

0.1

1

10

1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09

Late

ncy

(s)

Message Size (words)

PS-single NCCL

(c) 150 Reduces and Broadcasts (d) Latency and Bandwidth Cusp

0.0001

0.001

0.01

0.1

1

10

1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09

Band

wid

th (

GB/s

)

Message Size (words)

PS-single NCCL

(e) Bandwidth

Figure 2: Latency and bandwidth for Reduce and Broadcast of existing communication methods.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

4 8 16 32 64

Spee

d-up

vs.

NCC

L

Batch Size (per GPU)

(a) VGGNet-16

0

0.2

0.4

0.6

0.8

1

1.2

1.4

4 8 16 32 64

Spee

d-up

vs.

NCC

L

Batch Size (per GPU)

(b) ResNet-50

Figure 3: End-to-end training performance of PS-single compared against NCCL on various batchsizes using 8 V100 GPUs on p3-16x.large instances.

4

Page 5: Tree-based Allreduce Communication on MXNetctcyang/pub/amaz-techreport2018.pdf · Tree-based Allreduce Communication on MXNet Carl Yang Amazon AWS 2100 University Ave. Palo Alto,

102 103 104 105 106 107 108

Message size (words)

0

1

2

3

4

5

6

Freq

uenc

y

(a) VGGNet-16

102 103 104 105 106 107 108

Message size (words)

0

2

4

6

8

10

Freq

uenc

y

(b) ResNet-50

Figure 4: Distribution of gradients that must be communicated.

(a) 2D (b) 3D (PCI-E not shown)

Figure 5: Network topology of p3.16xlarge instance.

represent bandwidth in the link between u to v. Our task then, is to find an optimal embedding of thevertices and edges of tree T to the vertices and edges of graph G such that the tree has the followingproperties:

• Binary - Limiting the class of trees to binary has practical applications such as limitingmemory consumption overhead and search space.

• Minimum height - The tree’s height determines how many pairwise operations we must dosequentially, so we require the tree be of minimum height (i.e. balanced).

• Maximum weight - Use of the highest bandwidth connections is maximized.

In the case of multiple trees, since we are constructing trees rooted at different vertices sequentially,we introduce another property:

• Penalty term - Multiplicative factor applied to a link in order to dissuade later trees fromusing an already utilized link.

5

Page 6: Tree-based Allreduce Communication on MXNetctcyang/pub/amaz-techreport2018.pdf · Tree-based Allreduce Communication on MXNet Carl Yang Amazon AWS 2100 University Ave. Palo Alto,

(a) Single tree (b) Multiple trees

Figure 6: Proposed Reduce and Broadcast algorithms currently in MXNet.

3.2 Tree Construction

We considered two methods to generate the binary tree T . Since we are looking for a balancedbinary tree instead of a maximum spanning tree, we cannot use polynomial time algorithms such asKruskal’s or Boruvka’s MST algorithm as Wang, Li, Edo and Smola [4]. Using the network topologysaved in an adjacency matrix as input, we used:

• Kernighan-Lin algorithm: This is a popular hierarchical clustering method used to generategraph clusterings based on the Kernighan-Lin algorithm [6]. It is part of the popular METISpackage [7].

• Exhaustive search: We observed that in practice, Kernighan-Lin would get stuck due to ourmodification to the algorithm. In such cases, we resort to exhaustive search.

Since we did not want to introduce additional dependencies to MXNet, we implemented our ownversion of the Kernighan-Lin algorithm. We modify the Kernighan-Lin algorithm for a purpose itwas not intended for, which is to find a binary tree embedding on the link topology we are interestedin. Our algorithm works as shown in Algorithm 1.

We find that using the Kernighan-Lin in a greedy manner as Lines 24-30 suggest, it is not guaranteedwe can find binary trees in all circumstances. For example In the graph shown in Figure 7, at thebeginning of this algorithm it has found 2 clusters:

1. Cluster black: 0, 1, 3

2. Cluster red: 3, 4, 5, 6

Now, the Kernighan-Lin heuristic has been tasked with dividing cluster composing of 3, 4, 5, 6 intotwo, which it has done successfully into one of (3, 6) and another of (4, 5).

1. Cluster black: 0, 1

2. Cluster blue: 2

3. Cluster red: 3, 6

4. Cluster green: 4, 5

6

Page 7: Tree-based Allreduce Communication on MXNetctcyang/pub/amaz-techreport2018.pdf · Tree-based Allreduce Communication on MXNet Carl Yang Amazon AWS 2100 University Ave. Palo Alto,

Algorithm 1 Build binary tree using Kernighan-Lin heuristic.1: procedure BUILDTREEKL(graph, root, link_penalty)2: roots← {root}3: tree← {root}4: vertices← graph.vertices5: n_partitions← 16: partitions← {0, 0, ..., 0}7: finished← false8: while finished = false do9: pairs← {}

10: // Histogram determines size of each partition11: hist← HISTOGRAM(partitions, n_partitions)12: finished← true13: for i = 0, 1, 2, ..., n_partitions do14: if hist[i] = 2 then15: pairs.push_back((i, -1))16: finished← false17: else if hist[i] >= 3 then18: // Kernighan-Lin splits partition i into 2 partitions: i and n_partitions19: partitions← KLPARTITION(graph, partitions, n_partitions)20: pairs.push_back((i, n_partitions))21: n_partitions = n_partitions + 122: finished← false23: end if24: end for25: // Look for an edge (u, v) that satisfies all following conditions:26: // (i) is one of the edges crossing two newly formed partitions27: // (ii) vertex u of edge is in roots, and28: // (iii) has highest weight of all edges satisfying conditions (i) and (ii)29: // If such an edge is found, for parent root u create leaves u and v30: // Add u and v to new_roots31: // If no such edge can be found, return false32: success, tree, new_roots← PROCESSPAIRS(graph, pairs, roots)33: roots← new_roots34: if success = false then35: // If KL heuristic gets stuck, use exhaustive search to find tree36: return BUILDTREEES(graph, root)37: end if38: end while39: return tree40: end procedure

7

Page 8: Tree-based Allreduce Communication on MXNetctcyang/pub/amaz-techreport2018.pdf · Tree-based Allreduce Communication on MXNet Carl Yang Amazon AWS 2100 University Ave. Palo Alto,

(a) network topology graph (b) binary tree gets stuck

(c) valid binary tree

Figure 7: In the graph (left), Line 31 of Algorithm 1 requires the root vertex of the red cluster (vertex3) to connect to any vertex of the green cluster, which is impossible. The binary tree (right) showsthat there must be communication from GPU4 or GPU5 to GPU3, but there is no link to make thatpossible.

However, as the comments in Lines 24-30 suggest, what is necessary after this clustering is to finda a link between the two clusters. Clearly such a link is possible from vertex 6 to 4 or vertex 6 to5. However, an additional requirement as can be seen from the binary tree (see Figure 7b is that thevertex from the red cluster must be a root vertex (i.e. vertex 3). However, the graph shows that thereis no link between vertex 3 to either 4 or 5. Nevertheless as Figure 7c, it is possible to build a binarytree rooted at vertex 2, just not using the greedy method described by Algorithm 1.

3.3 Tree-based Communication

Once such a tree T has been found, the tree represents a routing along edges of graph G that allowsReduce and Broadcast to be performed in a way that maximizes usage of the highest bandwidthlinks of G. An example of this is shown in Figure 8. In this diagram, the red arrows representGPU1 sending to GPU5, where the combined results of GPU1 and GPU5 are reduced on GPU5 (seeFigure 8a). In the next diagram, GPU7 sends to GPU5, where the combined results of GPU1, GPU3,GPU5 and GPU7 are reduced on GPU5. In the final step, GPU4 sends to GPU5, where the combinedresults of all 8 GPUs are reduced on GPU5.

In order to do Broadcast using the same tree, Step 3 is done first, followed by Step 2 and 1. Inaddition, the order of the red arrows will be reversed.

3.4 Single and Multiple Tree Threshold

In order to decide the threshold that determines whether the gradient is big enough to use multipletree communication, we ran a benchmark on VGGNet-16 (see Figure 9. In experiments, we see that10M words yields good results on our experimental setup.

8

Page 9: Tree-based Allreduce Communication on MXNetctcyang/pub/amaz-techreport2018.pdf · Tree-based Allreduce Communication on MXNet Carl Yang Amazon AWS 2100 University Ave. Palo Alto,

(a) Step 1 (b) Step 2

(c) Step 3 (d) Summary

Figure 8: Reduce on a p3.16xlarge network topology (left), whose schedule is encoded in a singletree (right).

0

100

200

300

400

500

600

700

800

900

1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09

Perfo

rman

ce (

sam

ples

/s)

Threshold (words)

Figure 9: End-to-end throughput on VGGNet-16 in float16 precision as function of threshold.

3.5 Integration with MXNet

Our proposed integration with MXNet is shown in Figure 10. We propose to use environmentalvariables to give users a trial period, and then consider switching it to a command-line parameter.The proposed environmental variables along with their default values are listed below:

• MXNET_KVSTORE_USE_TREE: Whether our proposed solution is used in place ofPS-single when “–kv-store device” is selected (default: false)

• MXNET_KVSTORE_LOGTREE: Whether user wants trees to be printed out (default:false)

• MXNET_KVSTORE_TREE_ARRAY_BOUND: Threshold above which multiple treeswill be used instead of single tree (default: 10,000,000)• MXNET_KVSTORE_TREE_BACKTRACK: Whether KL will automatically be skipped in

favour of exhaustive search (default: false)• MXNET_KVSTORE_TREE_LINK_USAGE_PENALTY: Multiplicative factor that is ap-

plied to a link in order to dissuade later trees from using an already utilized link (default:0.7)

4 Experiments

In this section, we will discuss our experimental setup and experimental results.

9

Page 10: Tree-based Allreduce Communication on MXNetctcyang/pub/amaz-techreport2018.pdf · Tree-based Allreduce Communication on MXNet Carl Yang Amazon AWS 2100 University Ave. Palo Alto,

Figure 10: Proposed integration with MXNet.

4.1 Experimental Setup

We ran all experiments in this paper on Amazon AWS p3.16xlarge instance with 64× virtualCPUs, 488 GB of main memory, and 8× NVIDIA V100 GPU with 16 GB on-board memory. Theoperating system is Ubuntu 16.04. The GPU programs were compiled with NVIDIA’s nvcc compiler(version 9.0.176). The C code was compiled using gcc 5.4.0. The NCCL used is version 2.1.15. Ourcode has been merged into the MXNet repository 1.

4.2 Image Classification Models

In our experiments, we test end-to-end training performance on ResNet-50 v1 (“resnet”), VGGNet-16 (“vgg”), Inception-v3 (“inception”), and AlexNet (“alexnet”). The training script used is“example/image-classification/train_imagenet.py”.

4.3 Experimental Results

Our end-to-end performance is illustrated in Figure 11. Looking at the two deeper neural networksresnet and inception, we show we match PS-single’s superior performance on small batch sizes(4-16 for float32 and 8-32 for float16). On these networks, we attain a peak 33% and 34% speed-upagainst NCCL on resnet and inception respectively. In addition, we show a peak 19% and 15%speed-up against PS-single. We attribute our speed-up over NCCL to superior overlapping of multipleReduce and Broadcast operations by the MXNet engine. We provide experimental evidence for thisin Figure 12.

Even though NCCL has best performance out of all communication mechanisms when a singlegradient is being communicated, its latency falls behind when 50 gradients are being communicatedand the gap increases at 150 gradients. We see that PS-single and single tree have similar performancefor small gradients. Therefore, we have attained our first aim—that of matching the low latency ofPS-single for small messages. This performance characteristic is what allows us to match PS-single’ssuperior performance for small batch sizes on resnet and inception. We attribute our speed-up overPS-single to our topology-aware approach, which allows us to avoid the use of PCI-E links. We havefound that when many messages need to be sent over PCI-E, there can be observed queuing effectspossibly due to going over the QPI (QuickPath Interconnect) bus on the CPU.

Next, we examine performance on the two shallower networks vgg and alexnet. On these twonetworks, we attain a peak 6% and 42% speed-up over NCCL. However compared to PS-single, wedemonstrate a peak 589% and 660% speed-up. We attribute our speed-up to a combined effect of:(i) using a topology-aware approach to communication, and (ii) using multiple trees to slice largegradients into smaller gradients.

As Figure 12d shows, we have significantly narrowed the bandwidth advantage of NCCL by the singleand multiple trees communication scheme. As mentioned above, this is done by the topology-awareapproach and multiple trees respectively. The impact of these two factors is broken down in Table 1.

1https://github.com/apache/incubator-mxnet

10

Page 11: Tree-based Allreduce Communication on MXNetctcyang/pub/amaz-techreport2018.pdf · Tree-based Allreduce Communication on MXNet Carl Yang Amazon AWS 2100 University Ave. Palo Alto,

0

0.2

0.4

0.6

0.8

1

1.2

1.4

4 8 16 32 64

Spee

d-up

vs.

NCC

L

Batch Size (per GPU)

PS-single Exhaustive Search Kernighan-Lin

(a) resnet float32

0

0.2

0.4

0.6

0.8

1

1.2

1.4

8 16 32 64 128 190

Spee

d-up

vs.

NCC

L

Batch Size (per GPU)

PS-single Exhaustive Search Kernighan-Lin

(b) resnet float16

0

0.2

0.4

0.6

0.8

1

1.2

4 8 16 32 64

Spee

d-up

vs.

NCC

L

Batch Size (per GPU)

PS-single Exhaustive Search Kernighan-Lin

(c) vgg float32

0

0.2

0.4

0.6

0.8

1

1.2

8 16 32 64 128 190

Spee

d-up

vs.

NCC

L

Batch Size (per GPU)

PS-single Exhaustive Search Kernighan-Lin

(d) vgg float16

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

4 8 16 32 64 100

Spee

d-up

vs.

NCC

L

Batch Size (per GPU)

PS-single Exhaustive Search Kernighan-Lin

(e) inception float32

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

8 16 32 64 128 190

Spee

d-up

vs.

NCC

L

Batch Size (per GPU)

PS-single Exhaustive Search Kernighan-Lin

(f) inception float16

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

4 8 16 32 64 100

Spee

d-up

vs.

NCC

L

Batch Size (per GPU)

PS-single Exhaustive Search Kernighan-Lin

(g) alexnet float32

Figure 11: End-to-end training performance of PS-single and the two proposed methods ExhaustiveSearch and Kernighan-Lin compared against NCCL on various batch sizes using 8 V100 GPUs onp3-16x.large instances.

11

Page 12: Tree-based Allreduce Communication on MXNetctcyang/pub/amaz-techreport2018.pdf · Tree-based Allreduce Communication on MXNet Carl Yang Amazon AWS 2100 University Ave. Palo Alto,

0.001

0.01

0.1

1

10

1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09

Late

ncy

(s)

Message size (words)

PS-single NCCL Single Tree Mult iple Tree

(a) 1 Reduce and Broadcast

0.001

0.01

0.1

1

10

1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09

Late

ncy

(s)

Message size (words)

PS-single NCCL Single Tree Mult iple Tree

(b) 50 Reduces and Broadcasts

0.001

0.01

0.1

1

10

1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09

Late

ncy

(s)

Message size (words)

PS-single NCCL Single Tree Mult iple Tree

(c) 150 Reduces and Broadcasts

0.0001

0.001

0.01

0.1

1

10

1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09Ba

ndw

idth

(GB

/s)

Message Size (words)

PS-single NCCL Single Tree Mult iple Tree

(d) Bandwidth

Figure 12: Latency and bandwidth for Reduce and Broadcast of existing and proposed communicationmethods.

Optimization Performance (samples/sec) Speed-up

Baseline (PS-single) 123 —Topology-aware 358 2.91×Multiple trees 724 2.02×

Table 1: Impact of the two optimizations described in this section on the performance measuredin samples per second processed in VGGNet-16 in fp16 using a batch size of 8 per GPU. Theseoptimizations are cumulative, meaning the next optimization is stacked on top of the previous one.Speedups are standalone.

5 Conclusion

In this paper we proposed an algorithm for producing Allreduce communication schedules forarbitrary network topologies automatically. We showed that this can be accomplished efficiently byfinding a Kernighan-Lin clustering on the class of binary trees. We provided experimental evidencethat trees found using such an algorithm show peak 6.6× and 1.42× speed-up to two state-of-the-artalgorithms in end-to-end training. This provides, for the first time, provably optimal aggregationschedules for several common hardware architectures.

One possible direction for future research is to devise an analytic cost model that takes as input anetwork topology and a neural network model, then outputs an Allreduce schedule. Such a costmodel would differ from our problem formulation in two ways: (i) it would explicitly include overlapas a parameter (we have only implicitly included it via penalty term), and (ii) it may expand thesearch space beyond binary trees, which are a subset of trees, which are a subset of all graphs. Itwould be particularly interesting to extend the class of graphs to include rings, so that the NCCLalgorithm could be accounted for in the cost model.

Another interesting direction of future research is instead of using an analytic cost model, to use anauto-tuner. The advantage of this method is that at the cost of running a few iterations of training,

12

Page 13: Tree-based Allreduce Communication on MXNetctcyang/pub/amaz-techreport2018.pdf · Tree-based Allreduce Communication on MXNet Carl Yang Amazon AWS 2100 University Ave. Palo Alto,

one may get a more accurate runtime estimate than an analytic cost model can provide. In addition,such a system could be combined by the analytic cost model by doing a randomized search of 100communication schedules, outputting the 10 best ones for the auto-tuner to compare.

6 Acknowledgments

I’d like to thank my mentors Rahul Huilgol and Haibin Lin for providing day-to-day guidance.Without their help, this work would not have been possible. i’d also like to thank Mu Li for the ideaof slicing larger gradients and to use multiple trees. I’d also like to thank Leyuan Wang, Junyuan Xie,and Andrea Olgati for their valuable feedback and suggesting future directions of research.

References[1] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-

che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas-tering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.

[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpass-ing human-level performance on imagenet classification. In Proceedings of the IEEE internationalconference on computer vision, pages 1026–1034, 2015.

[3] Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas Stolcke,Dong Yu, and Geoffrey Zweig. Achieving human parity in conversational speech recognition.arXiv preprint arXiv:1610.05256, 2016.

[4] Leyuan Wang, Mu Li, Edo Liberty, and Alex J. Smola. Optimal message scheduling foraggregation. In ACM Conference on Systems and Machine Learning (SysML), 2018.

[5] Sylvain Jeaugey. Nccl 2.0. GPU Technology Conference, GTC, 2017.

[6] Brian W Kernighan and Shen Lin. An efficient heuristic procedure for partitioning graphs. TheBell system technical journal, 49(2):291–307, 1970.

[7] George Karypis and Vipin Kumar. A software package for partitioning unstructured graphs,partitioning meshes, and computing fill-reducing orderings of sparse matrices. 1998.

13