accelerating deep learning with in-network …...© 2019 mellanox technologies | confidential 12 an...

29
© 2019 Mellanox Technologies | Confidential 1 SC 19 Gil Bloch Accelerating Deep Learning with In-Network Computing

Upload: others

Post on 20-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 1

SC 19Gil Bloch

Accelerating Deep Learning with In-Network Computing

Page 2: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 2

What is it?

Moore’s Law

Page 3: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 3

Where is it going?

Moore’s Law

▪ April 2005, Gordon Moore stated in an interview that the projection cannot be sustained indefinitely: "It can't continue forever. ... It no longer centered its research and development plan on Moore's law.

Page 4: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 4

Moore’s Law

GPU Accelerated Computing

Page 5: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 5

Exponential Data Growth

Page 6: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 6

Cloud Big Data

Enterprise

Business Intelligence

HPC

Storage

Security

Machine Learning

Internet of Things

Exponential Data Growth Everywhere

Page 7: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 7

It is not a wave, it is a Tsunami

Riding The Data Wave

Did you know that 90 % of the world’s data has been created only in last two years?

It has been predicted that by 2020, 40 zettabytes of data will get generatedan increase of 300 times from 2005!

Page 8: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 8

Big Data? No… REALLY BIG DATA

Average data generated in a self-driving vehicle is expected to reach 40TB for every eight hours of driving (this mostly applies to full service fleet vehicles)

The Pratt & Whitney PW1000G engine has 5,000 sensors installed, generating about 10 GB of data per second. With an average 12-hr. flight-time can produce up to 844 TB of data

Mellanox is the de-facto interconnect for deep learning deployments

Page 9: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 9

Neural Networks Complexity Growth

2014 2015 2016 2017

DeepSpeech DeepSpeech-2DeepSpeech-3

30X

2012 2013 2014 2015 2016

AlexNet GoogleNetResNet

Inception-V2

350X

Inception-V4

Image Recognition

SpeechRecognition

Complexity = GOPS X Bandwidth

Page 10: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 10

MoreData

BetterModels

FasterInterconnect

GPUs

CPUs

FPGAs

Storage

Mellanox Unleashes the Power of Artificial IntelligenceEnabling World-Leading Artificial Intelligence Solutions

ASIC

Page 11: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 11

The Need for Intelligent and Faster Interconnect

CPU-Centric (Onload) Data-Centric (Offload)

Must Wait for the DataCreates Performance Bottlenecks

Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale

GPU

CPU

GPU

CPU

Onload Network In-Network Computing

GPU

CPU

CPU

GPU

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

Analyze Data as it Moves!Higher Performance and Scale

Page 12: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 12

An Application Example – Pizza Processing

▪ Order Pizza▪ Call (or use Pizza application)

▪ PE 1 – prepare Pizza▪ Tomato sauce, Cheese, Peperoni…

▪ PE 1 – Put in the oven▪ And now we wait…

▪ PE 1 – Pack and send▪ Network (Pizza Delivery)▪ PE2 – Pizza Consumption

CPU-Centric (Onload)

Must Wait for the PizzaCreates Performance Bottlenecks

PE 1 – Pizza GenerationPE 2 – Pizza Consumption

GPU

CPU

GPU

CPU

Onload Network

GPU

CPU

CPU

GPU

Page 13: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 13

What if…

Page 14: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 14

Data Centric Architecture to Overcome Latency Bottlenecks

CPU-Centric (Onload) Data-Centric (Offload)

Communications Latencies of 30-40us

Intelligent Interconnect Paves the Road to Exascale Performance

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

Communications Latenciesof 3-4us

Page 15: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 15

In-Network Computing to Enable Data-Centric Data Centers

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

GPUDirect

RDMA

Scalable Hierarchical Aggregation and

Reduction Protocol

NVMeOverFabrics

Page 16: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 16

The Need for Speed

Page 17: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 17

Mellanox Accelerates TensorFlow 1.5

100G is a Must For Large Scale Models 6.5X Faster Training

with 100G

2.5X

6.5X

Page 18: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 18

Remote Direct Memory Access RDMA

Page 19: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 19

Mellanox Accelerates TensorFlow

Unmatched Linear Scalability at No Additional Cost

50% Better

Performance

Page 20: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 20

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)

Page 21: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 21

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)

▪ Reliable Scalable General Purpose Primitive

▪ Applicable to Multiple Use-cases in ML/HPC

▪ Scalable High Performance Collective Offload

DataAggregated

AggregatedResult

Aggregated Result

Data

Host Host Host Host Host

SwitchSwitch

Switch

Page 22: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 22

SHARP AllReduce Performance Advantages (128 Nodes)

SHARP enables 75% Reduction in LatencyProviding Scalable Flat LatencyScalable Hierarchical

Aggregation and

Reduction Protocol

Page 23: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 23

SHARP AllReduce Performance Advantages 1500 Nodes, 60K MPI Ranks, Dragonfly+ Topology

SHARP Enables Highest PerformanceScalable Hierarchical Aggregation and

Reduction Protocol

Page 24: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 24

Performs the Gradient AveragingReplaces all physical parameter serversAccelerate AI Performance

SHARP Accelerates AI Performance

The CPU in a parameter server becomes the bottleneck

Page 25: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 25

▪ Increase System Performance▪ Better Scalability▪ Reduces amount of data traversing the network

InfiniBand SHARP Advantage for Deep Learning

16%

11%

System Configuration: Intel E5-2650V4, 12 cores @ 2.2GHz, 30M L2 cache, 9.6GT QPI, 256GB RAM: 16 x 16 GB DDR4, NVIDIA P100 GPUs, ConnectX-6 HCA, IB Quantum Switch (EDR speed), RH 7.5, Mellanox OFED 4.4, HPC-X v2.3, TensorFlow v1.11, Horovod 0.15.0

Scalable Performance for Distributed AI

Page 26: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 26

NCCL SHARP

Page 27: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 27

NCCL Overview

▪ NCCL : NVIDIA Collective Communication Library

▪ Enables Multi GPU Computing▪ Data Parallel multi GPU training▪ NCCL Allreduce : Aggregate gradients across GPUS

▪ DL Frameworks (Tensorflow/Horovod, PyTorch, MXNet, Chainer, …)

▪ NCCL 1.0▪ Single node Ring

▪ NCCL 2.0▪ Ring across multiple nodes▪ RDMA

▪ NCCL 2.4▪ Hierarchical tree algorithm

Page 28: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 28

NCCL SHARPNetwork Fabric

NIC NIC NIC

Page 29: Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An Application Example –Pizza Processing Order Pizza Call (or use Pizza application)

© 2019 Mellanox Technologies | Confidential 29

Thank You