hpc and ai acceleration on gpu · 2019-03-13 · 3 1980 1990 2000 2010 2020 gpu-computing perf 1.5x...

59
HPC AND AI ACCELERATION ON GPU Yi Cheng(易成) SA HPC&AI March 2019

Upload: others

Post on 09-Feb-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

HPC AND AI ACCELERATION ON GPU

Yi Cheng(易成) SA HPC&AI March 2019

Page 2: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

2

Artificial IntelligenceComputer GraphicsGPU Computing

NVIDIA“THE AI COMPUTING COMPANY”

Page 3: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

3

1980 1990 2000 2010 2020

GPU-Computing perf

1.5X per year

1000X

by

2025

RISE OF GPU COMPUTING

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.

Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

102

103

104

105

106

107

Single-threaded perf

1.5X per year

1.1X per year

APPLICATIONS

SYSTEMS

ALGORITHMS

CUDA

ARCHITECTURE

Page 4: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

4

ELEVEN YEARS OF GPU COMPUTING

2010

Fermi: World’s First HPC GPU

World’s First Atomic Model of HIV Capsid

GPU-Trained AI Machine Beats World Champion in Go

2014

Stanford Builds AI Machine using GPUs

World’s First 3-D Mapping of Human Genome

Google Outperforms Humans in ImageNet

2012

Discovered How H1N1 Mutates to Resist Drugs

Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs

2008

World’s First GPU Top500 System

2006

CUDA Launched

AlexNet beats expert code by huge margin using GPUs

Top 13 Greenest Supercomputers Powered

by NVIDIA GPUs

2017

Page 5: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

5

GPU EVOLUTIONSG

EM

M /

W N

orm

alized

2012 20142008 2010 2016

TeslaCUDA

FermiFP64

KeplerDynamic Parallelism

MaxwellDX12

PascalUnified Memory

3D Memory

NVLink

20

16

12

8

6

2

0

4

10

14

18

M2090

M2070

C2070

C2050

C1070

C1060

C870

K80

K40

K20

M60

M40

M10

P100

P40

P4

Page 6: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

6

NVIDIA POWERS WORLD’SFASTEST SUPERCOMPUTERS48% More Systems | 22 of Top 25 Greenest

Piz DaintEurope’s Fastest

5,704 GPUs| 21 PF

ORNL SummitWorld’s Fastest

27,648 GPUs| 144 PF

ABCIJapan’s Fastest

4,352 GPUs| 20 PF

ENI HPC4Fastest Industrial

3,200 GPUs| 12 PF

LLNL SierraWorld’s 2nd Fastest

17,280 GPUs| 95 PF

Page 7: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

7

NVIDIA POWER GORDON BELL WINNERS & 5 OF 6 FINALISTS

GPU Acceleration Critical To HPC At Scale Today

Material Science300X HigherPerformance

Genomics 2.36 ExaFLOPS

Seismic1st Soil & Structure

Simulation

Quantum Chromodynamics

<1% of Uncertainty Margin

Weather1.15 ExaFLOPS

Page 8: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

8

END-TO-END PRODUCT FAMILY

DESKTOP

TITAN/GeForce

WORKSTATION

DGX Station

DATA CENTER

Tesla V100

AUTOMOTIVE

Drive AGX Pegasus

VIRTUAL

WORKSTATION

Virtual GPU

SERVER

PLATFORM

HGX1/ HGX2

HPC / TRAINING INFERENCE

EMBEDDED

Jetson AGX Xavier

DATA CENTER

Tesla V100

Tesla P4/T4

FULLY INTEGRATED AI SYSTEMS

DGX-1 DGX-2

Page 9: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

9

TESLA PRODUCT FAMILY

V100 SXM2with NVLINK

V100 PCIe2 slot

HGX-2 Baseboard16 V100 + NVSwitch

HGX-2: V100 & NVSwitch heat sink included but not shown

Supercomputing

DL Training & Inference

Machine Learning

Video | Graphics

TESLA V100 (Scale-up)

DL Inference &

Training

Machine Learning

Video | Graphics

TESLA T4 (Scale-out)

T4 PCIeLow Profile, 70W

Page 10: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

10

APPS &FRAMEWORKS

NVIDIA SDK& LIBRARIES

NVIDIA UNIVERSAL ACCELERATION PLATFORMSingle Platform Drives Utilization and Productivity

MACHINE LEARNING / ANALYTICS

cuMLcuDF cuGRAPH

CUDA

DEEP LEARNING

cuDNN cuBLAS CUTLASS NCCL TensorRT

HPC

CuBLAS OpenACCCuFFT

+580 Applications

Amber

NAMD

CUSTOMER USECASES

TESLA GPUs & SYSTEMS

HGX-2Scale up,

Dense Compute

T4Scale out,

Distributed Compute

Speech Translate Recommender Molecular Simulations

WeatherForecasting

SeismicMapping

ManufacturingHealthcare Finance

CONSUMER INTERNET, INDUSTRIAL, and SCIENTIFIC APPLICATIONS

Video | Images Retail

Page 11: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

11

TESLA V100

Core 5120 CUDA cores, 640 Tensor cores 5120 CUDA cores, 640 Tensor cores

Compute 7.8 TF DP ∙ 15.7 TF SP ∙ 125 TF DL 7 TF DP ∙ 14 TF SP ∙ 112 TF DL

Memory HBM2: 900 GB/s ∙ 32 GB HBM2: 900 GB/s ∙ 32 GB

InterconnectNVLink (up to 300 GB/s) +

PCIe Gen3 (up to 32 GB/s)PCIe Gen3 (up to 32 GB/s)

Power 300W 250W

Available Now Now

For NVLink Servers For PCIe Servers

Page 12: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

12

TESLA V100The Fastest and Most Productive GPU for AI and HPC

Volta Architecture

Most Productive GPU

Tensor Core

125 Programmable

TFLOPS Deep Learning

Improved SIMT Model

New Algorithms

Volta MPS

Inference Utilization

Improved NVLink 2.0

(300GB/s)&HBM2(900GB/s)

Efficient Bandwidth

Page 13: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

13

INTRODUCING TESLA P100New GPU Architecture to Enable the World’s Fastest Compute Node

Pascal Architecture NVLink(160GB/s) CoWoS HBM2(768GB/s) Page Migration Engine

PCIe

Switch

PCIe

Switch

CPU CPU

Highest Compute Performance GPU Interconnect for Maximum Scalability

Unifying Compute & Memory in Single Package

Simple Parallel Programming with Virtually Unlimited Memory

Unified Memory

CPU

Tesla P100

Page 14: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

14

P100 V100 Ratio

Training acceleration 10 TOPS 125 TOPS 12x

Inference acceleration 21 TFLOPS 125 TOPS 6x

FP64/FP32 5/10 TFLOPS7.8/15.7

TFLOPS1.5x

HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x

NVLink Bandwidth 160 GB/s 300 GB/s 1.9x

L2 Cache 4 MB 6 MB 1.5x

L1 Caches 1.3 MB 10 MB 7.7x

GPU PERFORMANCE COMPARISON

Page 15: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

15

21B transistors815 mm2

80 SM5120 CUDA Cores640 Tensor Cores

32 GB HBM2900 GB/s HBM2

300 GB/s NVLink

VOLTA GV100

*full GV100 chip contains 84 SMs

Page 16: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

16

VOLTA GV100 SM

GV100

FP32 units 64

FP64 units 32

INT32 units 64

Tensor Cores 8

Register File 256 KB

Unified L1/Shared

memory

128 KB

Active Threads 2048

Page 17: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

GPU P100 SM (Streaming Multiprocessor )

GP100

SM/GPU 56

FP32 units 64

FP64 units 32

TensorCore -

Register/SM 256 KB

Shared

Memory/SM64 KB

L1 Cache 24 KB

Page 18: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

18

TENSOR COREMixed Precision Matrix Math4x4 matrices

D = AB + C

D =

FP16 or FP32 FP16 FP16 FP16 or FP32

A0,0 A0,1 A0,2 A0,3

A1,0 A1,1 A1,2 A1,3

A2,0 A2,1 A2,2 A2,3

A3,0 A3,1 A3,2 A3,3

B0,0 B0,1 B0,2 B0,3

B1,0 B1,1 B1,2 B1,3

B2,0 B2,1 B2,2 B2,3

B3,0 B3,1 B3,2 B3,3

C0,0 C0,1 C0,2 C0,3

C1,0 C1,1 C1,2 C1,3

C2,0 C2,1 C2,2 C2,3

C3,0 C3,1 C3,2 C3,3

Page 19: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

19

BASIC CONCEPTSVOLTA TRAINING METHOD

Page 20: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

20

USING TENSOR CORES

Volta Optimized Frameworks and Libraries

__device__ void tensor_op_16_16_16(float *d, half *a, half *b, float *c)

{wmma::fragment<matrix_a, …> Amat;wmma::fragment<matrix_b, …> Bmat;wmma::fragment<matrix_c, …> Cmat;

wmma::load_matrix_sync(Amat, a, 16);wmma::load_matrix_sync(Bmat, b, 16);wmma::fill_fragment(Cmat, 0.0f);

wmma::mma_sync(Cmat, Amat, Bmat, Cmat);

wmma::store_matrix_sync(d, Cmat, 16,wmma::row_major);

}

CUDA C++

Warp-Level Matrix Operations

NVIDIA cuDNN, cuBLAS, TensorRT

Page 21: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

21

VOLTA: A GIANT LEAP FOR DEEP LEARNING

P100 V100 P100 V100

Images

per

Second

Images

per

Second

2.4x faster 3.7x faster

FP32 Tensor Cores FP16 Tensor Cores

V100 measured on pre-production hardware.

ResNet-50 Training ResNet-50 Inference

TensorRT - 7ms Latency

Page 22: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

22

Universal Inference Acceleration

320 Turing Tensor cores

2,560 CUDA cores

65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS

16GB | 320GB/s

ANNOUNCING TESLA T4WORLD’S MOST ADVANCED INFERENCE GPU

Page 23: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

23

TURING

Turing: Up to 72 Streaming Multiprocessors (SM)

Page 24: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

24

TURINGPer Streaming Multiprocessor:

• 64 FP32 lanes

• 2 FP64 lanes

• 64 INT32 lanes

• 16 SFU lanes (transcendentals)

• 32 LD/ST lanes (Gmem/Lmem/Smem)

• 8 Tensor Cores

• 40 RT Cores

• 4 TEX lanes

SM

L1 …SM

L1

SM

L1

SM

L1

SM

L1

L2

DRAM

Up to 72 SMs (T4: 40 SMs )

Page 25: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

NEW TURING TENSOR CORE

MULTI-PRECISION FOR AI INFERENCE

65 TFLOPS FP16 | 130 TeraOPS INT8 | 260 TeraOPS INT4

Page 26: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

26

RT CORESTuring GPU RT Cores Accelerate Ray Tracing

RT Cores accelerate ray tracing

• Hardware accelerated tracing of rays through the scene

• RT Core performance scales up with the Quadro RTX product family

• Applications access capabilities of RT Cores through OptiX, DXR, and Vulkan APIs

Page 27: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

27

K80P100

(SXM2)

P100

(PCIE)P40 P4

V100

(PCIE)

V100

(SXM2)

V100

(FHHL)

GPU 2x GK210 GP100 GP100 GP102 GP104 GV100 GV100 GV100

PEAK FP64 (TFLOPs) 2.9 5.3 4.7 NA NA 7 7.8 6.5

PEAK FP32 (TFLOPs) 8.7 10.6 9.3 12 5.5 14 15.7 13

PEAK FP16 (TFLOPs) NA 21.2 18.7 NA NA 112 125 105

PEAK TIOPs NA NA NA 47 22 NA NA NA

Memory Size2x 12GB

GDDR516 GB HBM2

16/12 GB

HBM2

24 GB

GDDR58 GB GDDR5 16GB HBM2 16GB HBM2 16GB HBM2

Memory BW 480 GB/s 732 GB/s732/549

GB/s346 GB/s 192 GB/s 900 GB/s 900 GB/s 900 GB/s

Interconnect PCIe Gen3NVLINK +

PCIe Gen3PCIe Gen3 PCIe Gen3 PCIe Gen3 PCIe Gen3

NVLINK +

PCIe Gen3PCIe Gen3

ECCInternal +

GDDR5

Internal +

HBM2

Internal +

HBM2GDDR5 GDDR5

Internal +

HBM2

Internal +

HBM2

Internal +

HBM2

Form Factor PCIE Dual Slot SXM2PCIE Dual

Slot

PCIE Dual

SlotPCIE LP

PCIE Dual

SlotSXM2

PCIE Single

Slot Full

Height Half

Length

Power 300 W 300 W 250 W 250 W 50-75 W 250W 300W 150W

TESLA PRODUCTS DECODER

Page 28: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

28

SPECS OVERVIEW

Tesla P100 Tesla P4Tesla

V100Tesla T4

GPU P100 P104 V100 TU104

CC 6.0 6.1 7.0 7.5

Mem GB/s 732 192 900 320

FP32 TFlops 10.0 5.5 15.5 8.1

FP64 TFlops 5.0 0.2 7.8 0.25

FP16 TFlops 20.0 0.12 31.1 16.2

HMMA TFlops - - 124.5 65

IMMA8 TOps - - - 130

IMMA4 TOps - - - 260

TDP 300W 75W 300W 70W

TensorCores

Page 29: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

29

TESLA V100 VS P100

Tesla V100 Tensor Core 和 CUDA 9 对 GEMM 运算有了 9 倍的性能提升。(在 Tesla V100 样机上使用CUDA 9 软件进行的测试)

32GB

Page 30: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

30

DGX计算平台

Page 31: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

31

DGX PRODUCTS FAMILY

The Fastest Personal Supercomputer for Researchers and Data Scientists

The Essential Instrument of AI Research in data center

The World’s Most Powerful deep learning System for the Most Complex deep learning Challenges

DGX-1

DGX Station DGX-2

Page 32: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

32

NVIDIA DGX-1 WITH VOLTAHighest Performance, Fully Integrated HW System

1 PetaFLOPS | 8x Tesla V100 32GB | 300 Gb/s NVLink Hybrid Cube Mesh

2x Xeon | 7 TB RAID 0 | Quad IB/Ethernet 100Gbps, Dual 10GbE | 3U — 3500W

7 TB SSD 8 x Tesla V100 32 GB

Quad IB/Ethernet 100Gbps, Dual 10GbE

2x Xeon

3U – 3200W NVLink Hybrid Cube Mesh

Page 33: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

33

VOLTA NVLINK300GB/sec

50% more links

28% faster signaling

Page 34: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

34

V100 (16 GB) V100 (32 GB)

VGG-16(16 Layers)

ResNet-152

(152 Layers)

More Complex

Models Now Possible

Dramatic Boost

in Accuracy

DRAMATIC BOOST IN ACCURACY WITH LARGER, MORE COMPLEX MODELS

SAP Brand Impact on DGX-1 (32 GB) for Object Detection

Dataset: Winter Sports 2018 Campaign; high definition resolution images (1920 x1080)

40% Reduced Error Rate

Page 35: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

35

FASTER RESULTS ON COMPLEX DL AND HPCUp to 50% Faster Results With 2x The Memory

Unsupervised Image Translation

Input winter photo

AI converts it to summer

Dual E5-2698v4 server, 512GB DDR4, Ubuntu 16.04, CUDA9, cuDNN7| NMT is GNMT-like and run with TensorFlow NGC Container 18.01 (Batch Size= 128 (for 16GB) and 256 (for 32GB) | FFT is with cufftbench 1k x 1k x 1k and comparing 2 V100 16GB (DGX1V) vs. 2 V100 32GB (DGX1V)

Neural MachineTranslation (NMT)

3D FFT 1k x 1k x 1k

1.5X Faster Calculations

1.5X Faster Language Translation

1.2step/sec

0.8step/sec

2.5TF

3.8TF

GAN Image to ImageGen

1024x1024res images

512x512res images

4X Higher resolution

75%Accuracy

(16 layers)

85%Accuracy

(152 layers)

HIGHER ACCURACY HIGHER RESOLUTIONFASTER RESULTS

NVIDIA customer R-CNN for object detection at 1080P with Caffe | V100 16GB uses VGG16| V100 32GB uses Resnet-152

V100 16GB V100 32GB

VGG-16 RN-152

1.4X Lower Error Rate

GAN by NVRESEARCH (https://arxiv.org/pdf/1703.00848.pdf) | V100 16GB and V100 32GB with FP32

Page 36: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

36

30% BETTER PERFORMANCE WITH NVLINK THAN PCIE

• Encoder and decoder embedding size of 512

• Batch size of 256 per GPU

• NVIDIA DGX containers version 17.11, processing real data with cuDNN 7.0.4, NCCL 2.1.2

Page 37: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

37

2.54X BETTER PERFORMANCE WITH NVLINK

• Performance benefits increase with increasing encoder/ decoder embedding size

• Sockeye neural machine translation single-precision training

• NVIDIA DGX containers version 17.11, processing real data with cuDNN 7.0.4, NCCL 2.1.2

Page 38: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

38

3.1X FASTER ON DGX-1 V100 THAN DGX-1 P100

Page 39: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

39

DGX STATION

Page 40: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

40

INTRODUCING NVIDIA DGX STATIONGroundbreaking AI – at your desk

The Fastest Personal Supercomputer for Researchers and Data Scientists

Revolutionary form factor -designed for the desk, whisper-quiet

Start experimenting in hours, not weeks, powered by DGX Stack

Productivity that goes from desk to data center to cloud

Breakthrough performance and precision – powered by Volta

40

Page 41: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

41

INTRODUCING NVIDIA DGX STATIONGroundbreaking AI – at your desk

The Personal AI Supercomputer for Researchers and Data Scientists

41

Key Features

1. 4 x NVIDIA Tesla V100 GPU (NOW 32 GB)

2. 2nd-gen NVLink (4-way)

3. Water-cooled design

4. 3 x DisplayPort (4K resolution)

5. Intel Xeon E5-2698v4 20-core

6. 256GB DDR4 RAM

2

1

5

4

3

6

Page 42: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

42

NVIDIA DGX STATION

SPECIFICATIONS

At a GlanceGPUs 4x NVIDIA® Tesla® V100

TFLOPS (GPU FP16) 500

GPU Memory 32 GB per GPU

NVIDIA Tensor Cores 2,560 (total)

NVIDIA CUDA Cores 20,480 (total)

CPU Intel Xeon E5-2698 v4 2.2 GHz (20-core)

System Memory 256 GB RDIMM DDR4

StorageData: 3 x 1.92 TB SSD RAID 0

OS: 1 x 1.92 TB SSD

Network Dual 10GBASE-T LAN (RJ45)

Display 3x DisplayPort, 4K Resolution

Additional Ports 2x eSATA, 2x USB 3.1, 4x USB 3.0

Acoustics < 35 dB

Maximum Power Requirements 1500 W

Operating Temperature Range 10 - 30 oC

Software

Ubuntu Desktop Linux OS

DGX Recommended GPU Driver

CUDA Toolkit

42

DGX STATION SPECIFICATIONS

Page 43: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

43

NVIDIA DGX-2THE WORLD’S MOST POWERFUL DEEP LEARNING SYSTEM FOR THE MOST COMPLEX DEEP LEARNING CHALLENGES

• First 2 PFLOPS System

• 16 V100 32GB GPUs Fully Interconnected

• NVSwitch: 2.4 TB/s bisection bandwidth

• 24X GPU-GPU Bandwidth

• 0.5 TB of Unified GPU Memory

• 10X Deep Learning Performance

43

Page 44: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

44

DESIGNED TO TRAIN THE PREVIOUSLY IMPOSSIBLE

1

2

3

5

4

6 Two Intel Xeon Platinum CPUs

7 1.5 TB System Memory

44

30 TB NVME SSDs Internal Storage

NVIDIA Tesla V100 32GB

Two GPU Boards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card

Twelve NVSwitches2.4 TB/sec bi-section

bandwidth

Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth

PCIe Switch Complex

8

9

9Dual 10/25 Gb/secEthernet

Page 45: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

45

NVSWITCHWORLD’S HIGHEST BANDWIDTH ON-NODE SWITCH

7.2 Terabits/sec or 900 GB/sec

18 NVLINK ports | 50GB/s per port bi-directional

Fully-connected crossbar

2 billion transistors | 47.5mm x 47.5mm package

Page 46: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

46

NVSWITCHENABLES THE WORLD’S LARGEST GPU

16 Tesla V100 32GB Connected by New NVSwitch

2 petaFLOPS of DL Compute

Unified 512GB HBM2 GPU Memory Space

300GB/sec Every GPU-to-GPU

2.4TB/sec of Total Cross-section Bandwidth

Page 47: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

47

Software system

Page 48: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

48

Virtual Machine vs. Container

Not so similar

Docker VS VM

Page 49: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

49

COMMON SOFTWARE STACK ACROSS DGX FAMILY

Cloud Service Provider

• Single, unified stack for deep learning frameworks

• Predictable execution across platforms

• Pervasive reach

DGX Station DGX-1

NVIDIAGPU Cloud

DGX-2

49

Page 50: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

50

NGC GPU-OPTIMIZED DEEPLEARNING CONTAINERS

NVCaffe

Caffe2

Microsoft Cognitive Toolkit (CNTK)

DIGITS

MXNet

PyTorch

TensorFlow

Theano

Torch

CUDA (base level container for developers)

NEW! – NVIDIA TensorRT inference accelerator with ONNX support

A Comprehensive Catalog of Deep Learning Software

Page 51: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

51

VIRTUAL WORKSTATIONS AND PCs

Page 52: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

软件虚拟GPU GPU透传 GPU共享 vGPU

GPU

User

User

User

User

User

User

User

User

User

User

User

User

GPU

VM

Driver

VM

Driver

VM

Driver

VM

Driver

VM

Driver

vGPU vGPU vGPU vGPU vGPU

Hypervisor

VMDriver

VMDriver

VMDriver

GPU GPU GPU

Hypervisor

VMDriver

VMDriver

VMDriver

Hypervisor

常见GPU解决方案介绍

Driver

Windows Server + XenApp

User Session

Page 53: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

GPU虚拟(VGPU)- GPU一对多虚拟切分

服务器

GPU

1

GPU

2

GPU

1

GPU

2

虚拟GPU

VM

NVIDIA Driver

VM

NVIDIA Driver

VM

NVIDIA Driver

VM

NVIDIA Driver

VM

NVIDIA Driver

VM

NVIDIA Driver

VM

NVIDIA Driver

VM

NVIDIA Driver

VM

NVIDIA Driver

VM

NVIDIA Driver

VM

NVIDIA Driver

VM

NVIDIA Driver

VM

NVIDIA Driver

VM

NVIDIA Driver

VM

NVIDIA Driver

VM

NVIDIA Driver

vGPUvGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU

Page 54: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

理解VGPU在虚拟化平台上是如何切割的?

四分之一切割

八分之一切割

vGPU切割特点:单GPU不支持多种切割类型、虚拟机关闭vGPU资源释放

Page 55: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

NVIDIA GPU虚拟化解决方案发展历史通过软件更新来实现增值

2015.8 2016.8 2017.12

特性

/功能

Kepler架构

K1, K2 GPU8:1 vGPUOGL/DX

XenServer 6.2 SP1

Maxwell架构

M10 GPU支持主机和虚拟机级别的资源监控支持Citrix Desktop

director支持Windows Server

2016 VM支持vSphere 6.5

支持XenServer 7.1/2支持RHEL KVM GPU

透传方案

Pascal架构更名为Virtual GPUP4, P6, P40, P100

CUDA/OGL/DX2倍性能提升

24:1 vGPUNutanix KVM

支持Linux硬编码应用程序级别监控

License授权模式(vApp vPC vDWS)支持VMware vRops新增两种GPU调度

方式License HA部署模式支持大于1TB内存

2013.12 2016.4

Maxwell架构

M6, M60 GPU16:1 vGPU

OGL/DXVMware vSphere6.0

Huawei UVP

License授权模式(vPC vWS vWS ext)支持 Windows 10

Maxwell架构支持4K显示

引入vApp授权模式(vApp vPC vWS)

支持DX12GRID 1.x

GRID 2.x

GRID 3.x

GRID 4.x

Virtual GPU 5.x

2018.10

Volta架构支持V100 16/32GB

PCIE/SXM2 GPU32:1 vGPU

新增支持RHEL 7.5/RHV 4.2 KVM,Sangfor VMP,H3C

CAS KVMvGPU Motion

vPC支持2GB显存vPC支持Linux OS

vPC支持4k显示*2,高清显示*4

Virtual GPU 6.x

Page 56: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

NVIDIA Virtual GPU 7.1 平台提供图形、计算和人工智能特性

支持、更新与维护

NVIDIA Tesla (数据中心 GPU)

NVIDIA Virtual GPU 软件

Page 57: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

VIRTUAL GPU vGPU 7.X 新特性Unprecedented Performance & Manageability

Multi-vGPU SupportWorld’s Most Powerful

Quadro vDWS

vMotion Support for vGPULive Migration of vGPUenabled VMs

Quadro vDWS & GRID

Tesla T4 GPU Support*Latest Generation Turing

Quadro vDWS

NGC with vGPUAvailable with vGPU

Quadro vDWS

FPO FPO

* Tesla T4 support coming with vGPU software 7.1 release

Page 58: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

更多种类的GPU选择

V100 P100 P40 P4 T4 M60 M10 M6 P6

GPUs / Board

(Architecture)

1

(Volta)

1

(Pascal)

1

(Pascal)

1

(Pascal)

1

(Turing)

2

(Maxwell)

4

(Maxwell)

1

(Maxwell)

1

(Pascal)

CUDA Cores 5,120 3,584 3,840 2,560 2,5604,096

(2,048 per GPU)

2,560

(640 per GPU)1,536 2,048

Memory Size32 GB/16 GB

HBM216 GB HBM2 24 GB GDDR5 8 GB GDDR5 16 GB GDDR5

16 GB GDDR5

(8 GB per GPU)

32 GB GDDR5

(8 GB per GPU)8 GB GDDR5 16 GB GDDR5

vGPU Profiles

1 GB, 2 GB, 4 GB,

8 GB, 16 GB,

32 GB

1 GB, 2 GB, 4

GB,

8 GB, 16 GB

1 GB, 2 GB, 3

GB,

4 GB, 6 GB, 8

GB,

12 GB, 24 GB

1 GB, 2 GB, 4

GB,

8 GB

1 GB, 2 GB, 4

GB, 8 GB, 16

GB

0.5 GB, 1 GB, 2

GB,

4 GB, 8 GB

0.5 GB, 1 GB, 2

GB,

4 GB, 8 GB

0.5 GB, 1 GB, 2

GB,

4 GB, 8 GB

1 GB, 2 GB, 4 GB,

8 GB, 16 GB

Form Factor

PCIe 3.0 Dual Slot

& SXM2

(rack servers)

PCIe 3.0 Dual

Slot

(rack servers)

PCIe 3.0 Dual

Slot

(rack servers)

PCIe 3.0 Single

Slot

(rack servers)

PCIe 3.0

Single Slot

(rack servers)

PCIe 3.0 Dual

Slot

(rack servers)

PCIe 3.0 Dual

Slot

(rack servers)

MXM

(blade servers)

MXM

(blade servers)

Power 250W/300W 250W 250W 75W 70W 300W (225W opt) 225W 100W (75W opt) 90W

Thermal passive passive passive passive passive active/passive passive bare board bare board

BLADEOptimized

PERFORMANCEOptimized

支持Tesla P、V、T全线产品,适用于不同用户场景

DENSITYOptimized

Page 59: HPC AND AI ACCELERATION ON GPU · 2019-03-13 · 3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 ... 2008 World’s First GPU Top500 System 2006 CUDA Launched

THANKS