the mlperf benchmarks: deep learning at ......the mlperf benchmarks: deep learning at scale with...

THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS

Vishal Mehta, DevTech Compute

HPC-AI Advisory Council, Swiss Conference, Lugano Date:02/04/2019

MLPERF BENCHMARKING AREAS

Problem Model DataSet

Image Classification Resnet-50 v1.5 ImageNet

Object Detection – Heavy Weight Mask R-CNN COCO

Object Detection – Light Weight Single-shot Detector (SSD) COCO

Translation (non-recurrent) Transformer WMT English-German

Translation (recurrent) Neural Machine Translator (NMT) WMT English-German

Recommendation Neural Collaborative Filtering (NCF) MovieLens 20M

Reinforcement Learning Mini Go (Based on Alpha Go)

*Closed Divisions: fixed model parameters, fixed data format, results must be reproducible*Open Divisions: encourage innovations, tricks and model adjustment welcomed

Diverse Use Cases Towards a Full Performance Picture

MLPERFFirst Industry Benchmark for Measuring AI Performance

https://mlperf.org/

• Open source code on single/multi node DGX systems

• Everywhere: Workstation/Clusters with SLURM/Cloud

• Reproducible Performance!

MLPERF RESULTS: CLOSED DIVISIONSSingle Node & At Scale

Results are Time to Complete Model Training

Single Node At Scale

* Full reference: https://mlperf.org/results/

AI TRAINING REQUIRES FULL STACK INNOVATION

36000 Mins (25 Days)1xK80 | 2015

1200 Mins (20 Hours)DGX-1P | 2016

NVLink

480 Mins (8 Hours)DGX-1V | 2017

Tensor Core

70 Minutes on MLPerfDGX-2H | 2018

NVSwitch

6.3 Minutes on MLPerfAt Scale | 2018

DGX Cluster

PILLARS OF GPU PERFORMANCE

CUDA Architecture

NVLink/NVSwitch Integrated Software

Massively Parallel Processing

High Speed Connecting between GPUs for Distributed

Algorithms

Fully Integrated Software and Hardware for Instant

Productivity

NVSwitch

6x NVLink

PYTHON

APACHE ARROW on GPU Memory

RAPIDS

cuMLcuDF

FRAMEWORKS

Tensor Cores

Mixed Precision Matrix Math Support

TENSOR CORESMixed Precision Matrix Math

• CUDA TensorOp instructions & data formats. Automated Mixed Precision

• Using Tensor cores via

• Volta optimized frameworks and libraries (cuDNN, CuBLAS, TensorRT, ..)

• CUDA C++ Warp Level Matrix Operations

• cuTENSOR library (pre-release version available)

cuTENSORA New High-Performance CUDA Library for Tensor Primitives

TENSOR CORE AUTOMATIC MIXED PRECISIONOver 3x Speedup With Just Two Lines of Code

TOOLS AND LIBRARIES MAINTAIN NETWORK ACCURACY

Tensor Core Journey Page

Github

Profiler Tools

Performance increase using automatic mixed precision on a variety of

training data sets. All performance collected using 1xV100-16GB except

bert-squadqa, which ran on 1xV100-32GB.

Enable AMP in TensorFlow (NGC Container 19.03)

export TF_ENABLE_AUTO_MIXED_PRECISION=1

MIXED PRECISION MAINTAINS ACCURACYBenefit From Higher Throughput Without Compromise

ILSVRC12 classification top-1 accuracy.(Sharan Narang, Paulius Micikevicius et al., "Mixed Precision Training“, ICLR 2018)

**Same hyperparameters and learning rate schedule as FP32.

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

AlexNet VGG-D GoogleNet(Inception v1)

Inception v2 Inception v3 Resnet50

Model Accura

FP32 Mixed Precision**

TENSOR CORES FOR SCIENCEMulti-precision computing

AI-POWERED WEATHER PREDICTION

PLASMA FUSION APPLICATION EARTHQUAKE SIMULATION

7.815.7

V100 TFLOPS

FP64+ MULTI-PRECISION

FP16 Solver

3.5x times faster

FP16/FP32

1.15x ExaOPS

FP16-FP21-FP32-FP64

25x times faster

NVIDIA DGX-2

6 Two Intel Xeon Platinum CPUs

7 1.5 TB System Memory

30 TB NVME SSDs Internal Storage

NVIDIA Tesla V100 32GB

Two GPU Boards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card

Twelve NVSwitches2.4 TB/sec bi-section

bandwidth

Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth

PCIe Switch Complex

9Dual 10/25 Gb/secEthernet

• 18 NVLINK ports

• @50 GB/s per port bi-directional

• 900 GB/s total bi-directional

• Fully connected crossbar

• 2 billion transistors

NVSWITCH

FULL NON-BLOCKING BANDWIDTH

UNIFIED MEMORY PROVIDES

• Single memory view shared by all GPUs

• Automatic migration of data between GPUs

• User control of data locality

• CUDA cooperative kernels across multiple

NVLINK PROVIDES

• All-to-all high-bandwidth peer mapping

between GPUs

• Full inter-GPU memory interconnect

(incl. Atomics)

NVSWITCH

• Building block that abstracts highly-optimized communication for each topology

• Rings within node over NVLink

• Trees among nodes over network

• Like MPI collectives, but with a CUDA stream

• Essential to deep learning and machine learning, can be relevant to traditional HPC

• Open sourced as of v2.3

NVIDIA Collectives Communication Library

NCCLTrees vs Rings

https://devblogs.nvidia.com/massively-scale-deep-learning-training-nccl-2-4/

Runs on Summit supercomputer on up to 24,576 GPUs.

NCCLPerformance comparison on Resnet-50

NVIDIA GPU CLOUDGPU-optimized Software Hub. Simplifying DL, ML and HPC Workflows

NGC50+ Containers

DL, ML, HPC

Pre-trained ModelsNLP, Classification, Object Detection & more

Industry WorkflowsMedical Imaging, Intelligent Video Analytics

Model Training ScriptsNLP, Image Classification, Object Detection & more

Innovate Faster

Deploy Anywhere

Simplify Deployments

NGC Support Services

https://ngc.nvidia.com

MLPERF KEY POINTSNVIDIA’s Platform Available Everywhere to All Developers

• Software innovations used to achieve this industry-leading

performance is available via the NGC container registry.

• NGC containers are available for all key AI frameworks and can be

used anywhere, on desktops, workstations, servers and all leading

cloud services

MLPERF KEY POINTSNVIDIA Sets New Records in AI Performance

• NVIDIA ran models of all kinds of complexity, in the industry's first

comprehensive AI benchmark.

• Tensor Core GPUs are the fastest and combined with CUDA the most

versatile platform for AI.

• NVIDIA platform, is available everywhere from desktop to cloud services

MLPERF KEY POINTSState-of-the-art AI Computing Requires Full Stack Innovation

APPS &FRAMEWORKS

CUDA-XNVIDIA LIBRARIES

VIRTUAL GPU

CUDA & CORE LIBRARIES - cuBLAS | NCCL

DEEP LEARNING

cuFFTOpenACC

+550 Applications

CUSTOMER USE CASES

VIRTUAL GRAPHICS

Speech Translate Recommender

SCIENTIFIC APPLICATIONS

Molecular Simulations

WeatherForecasting

SeismicMapping

CONSUMER INTERNET & INDUSTRY APPLICATIONS

ManufacturingHealthcare Finance

GPUs & SYSTEMS

SYSTEM OEM CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILY

MACHINE LEARNING

cuMLcuDF cuGRAPH cuDNN CUTLASS TensorRTvDWS vPC

Creative & Technical

Knowledge Workers

+600 Applications

DX/OGL

RAPIDS: MACHINE LEARNING AT SCALE

THANK YOU

the mlperf benchmarks: deep learning at ......the mlperf benchmarks: deep learning at scale with...

Documents

gurobi 8.0 performance benchmarks · milp competitive...

micro benchmarks

the vision behind mlperf€¦ · opennmt deep speech 2...

channel benchmarks

mlperf tiny benchmark

arxiv:2005.12873v3 [cs.dc] 7 jun 2020processing benchmarks...

mlperf training benchmark · benchmarking efforts in x2.2....

mlperf inference benchmark - arxiv · mlperf inference...

benchmarks ramon zatarain. index benchmarks and benchmarking...

financial benchmarks

issues at the intersection of ai, streaming, hpc, data...

dynatrace performance benchmarks€¦ · performance...

online display advertising benchmarks: doubleclick...

social media - tourism australia · • social audience...

2005 - benchmarks

cpu benchmarks - zszgorlice.iap.pl · cpu benchmarks video...

benchmarks - may, 2011 | benchmarks...

grading benchmarks third grade reading - silver...

accelerating mlperf training v0.6 deep learning...

demystifying hardware infrastructure choices for deep...