volta (tesla v100) の紹介

30
Akira Naruse, 9 th Nov. 2017 VOLTA (TESLA V100)

Upload: nvidia-japan

Post on 21-Jan-2018

435 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Volta (Tesla V100) の紹介

Akira Naruse, 9th Nov. 2017

VOLTA (TESLA V100)

Page 2: Volta (Tesla V100) の紹介

2

VOLTA

Page 3: Volta (Tesla V100) の紹介

3

SG

EM

M /

W

2012 20142008 2010 2016

48

36

12

0

24

60

2018

72

Tesla Fermi

Kepler

Maxwell

Pascal

Volta

GPU ROADMAPS

Page 4: Volta (Tesla V100) の紹介

4

VOLTA: TESLA V100

HPC and Deep Learning、両方に適した最速のGPU

Volta Architecture

Most Productive GPU

Tensor Core

120 Programmable

TFLOPS Deep Learning

Improved SIMT Model

New Algorithms

Volta MPS

Inference Utilization

Improved NVLink &

HBM2

Efficient Bandwidth

Page 5: Volta (Tesla V100) の紹介

5

21B transistors815 mm2

80 SM5120 CUDA Cores640 Tensor Cores

16 GB HBM2900 GB/s HBM2

300 GB/s NVLink

VOLTA: TESLA V100

*full GV100 chip contains 84 SMs

Page 6: Volta (Tesla V100) の紹介

6

P100 V100 Ratio

DL ops (FP16 or Mixed) 21 TOPS 120 TOPS 6x

FP32 10 TFLOPS 15 TFLOPS 1.5x

FP64 5 TFLOPS 7.5 TFLOPS 1.5x

L1 Caches 1.3 MB 10 MB 7.7x

L2 Cache 4 MB 6 MB 1.5x

Memory 720 GB/s 900 GB/s 1.2x

NVLink 160 GB/s 300 GB/s 1.9x

GPU性能比較

演算

容量

帯域

Page 7: Volta (Tesla V100) の紹介

7

NEW HBM2 MEMORY ARCHITECTURE

STREAM

: Tr

iad-

Delivere

d G

B/s

P100 V10076% DRAM Utilization

95% DRAM Utilization

実効帯域は1.5倍

V100 measured on pre-production hardware.

HBM2 stack

Page 8: Volta (Tesla V100) の紹介

8

ROAD TO EXASCALEVolta to Fuel Most Powerful

US Supercomputers

Rela

tive t

o T

esl

a P

100

Volta HPC Application Performance

System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla P100 or V100. V100 measured on pre-production hardware.

Summit

Supercomputer

200+ PetaFlops

~3,400 Nodes

10 Megawatts

1.5x

Page 9: Volta (Tesla V100) の紹介

9

21B transistors815 mm2

80 SM5120 CUDA Cores640 Tensor Cores

16 GB HBM2900 GB/s HBM2

300 GB/s NVLink

VOLTA: TESLA V100

*full GV100 chip contains 84 SMs

Page 10: Volta (Tesla V100) の紹介

10

VOLTA GV100 SM

GV100

FP32 units 64

FP64 units 32

INT units 64

Tensor Cores 8

Register File 256 KB

Unified L1/Shared

memory

128 KB

Active Threads 2048

Page 11: Volta (Tesla V100) の紹介

11

VOLTA GV100 SM

Completely new ISA

Twice the schedulers

Simplified Issue Logic

Large, fast L1 cache

Improved SIMT model

Tensor acceleration

=

GPU史上、最も性能の出しやすいSM

使い勝手の良いアーキテクチャ

Page 12: Volta (Tesla V100) の紹介

12

MPS: MULTI-PROCESS SERVICE複数プロセスで、安全かつ効率的にGPUを共有

Limited Isolation

A B C

CUDA MULTI-PROCESS SERVICE

Pascal GP100

A

B

C

CPU Processes

GPU Execution

Hardware Isolation

VOLTA MULTI-PROCESS SERVICE

Volta GV100

A B C

CUDA MULTI-PROCESS SERVICE CONTROLCPU Processes

GPU Execution

A B C

Pascal Volta

Page 13: Volta (Tesla V100) の紹介

13

VOLTA: INDEPENDENT THREAD SCHEDULING

Pascal: Lock-Free Algorithms Volta: Starvation Free Algorithms

Communicating Algorithms

Threads cannot wait for messages Threads may wait for messages

Page 14: Volta (Tesla V100) の紹介

14

PASCAL SIMT EXECUTION MODEL

ワープ内の分岐したスレッド間で、データ交換ができない

Time

X; Y;

div

erg

e

reconverg

e

A; B;

if (threadIdx.x < 4) {A;__syncwarp();B;

} else {X;__syncwarp();Y;

}

Page 15: Volta (Tesla V100) の紹介

15

VOLTA SIMT EXECUTION MODEL

div

erg

e

A; B;

X; Y;

ワープ内の分岐したスレッド間でも、データ交換が可能

Time

synchro

niz

e

if (threadIdx.x < 4) {A;__syncwarp();B;

} else {X;__syncwarp();Y;

}__syncwarp();

Page 16: Volta (Tesla V100) の紹介

16

VOLTA TENSOR CORE

Page 17: Volta (Tesla V100) の紹介

17

TENSOR CORE128 ops /cycle

D = AB + C

D =

FP16 or FP32 FP16 FP16 FP16 or FP32

A0,0 A0,1 A0,2 A0,3

A1,0 A1,1 A1,2 A1,3

A2,0 A2,1 A2,2 A2,3

A3,0 A3,1 A3,2 A3,3

B0,0 B0,1 B0,2 B0,3

B1,0 B1,1 B1,2 B1,3

B2,0 B2,1 B2,2 B2,3

B3,0 B3,1 B3,2 B3,3

C0,0 C0,1 C0,2 C0,3

C1,0 C1,1 C1,2 C1,3

C2,0 C2,1 C2,2 C2,3

C3,0 C3,1 C3,2 C3,3

Mixed Precision

Page 18: Volta (Tesla V100) の紹介

18

TENSOR SYNCHRONIZATION

ワープ内スレッドで同期

Full Warp 16x16 Matrix Math

16x16の行列乗算を、4x4の行列乗算の組み合わせとして実行

各スレッドに結果を分配

Warp (32 threads)

Page 19: Volta (Tesla V100) の紹介

19

VOLTA TENSOR OPERATION

FP16

storage/input

Full precision

product

Sum with

FP32

accumulator

Convert to

FP32 result

FP16

FP16× + FP32

FP32

more products

Page 20: Volta (Tesla V100) の紹介

20

USING TENSOR CORES

Volta Optimized Libraries

__device__ void tensor_op_16_16_16(float *d, half *a, half *b, float *c)

{wmma::fragment<matrix_a, …> Amat;wmma::fragment<matrix_b, …> Bmat;wmma::fragment<matrix_c, …> Cmat;

wmma::load_matrix_sync(Amat, a, 16);wmma::load_matrix_sync(Bmat, b, 16);wmma::fill_fragment(Cmat, 0.0f);

wmma::mma_sync(Cmat, Amat, Bmat, Cmat);

wmma::store_matrix_sync(d, Cmat, 16,wmma::row_major);

}

CUDA C++

Warp-Level Matrix Operations

NVIDIA cuDNN, cuBLAS, TensorRT

Page 21: Volta (Tesla V100) の紹介

21

VOLTA: A GIANT LEAP FOR DEEP LEARNING

P100 V100 P100 V100

Images

per

Second

Images

per

Second

2.4x faster 3.7x faster

FP32 Tensor Cores FP16 Tensor Cores

V100 measured on pre-production hardware.

ResNet-50 Training ResNet-50 Inference

TensorRT - 7ms Latency

Page 22: Volta (Tesla V100) の紹介

22

FP16でトレーニングして、精度は大丈夫なのか?

Page 23: Volta (Tesla V100) の紹介

23

大丈夫です、Tensor Coreを使えばhttp://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html

Training with Mixed-Precision User Guide

Page 24: Volta (Tesla V100) の紹介

24

大丈夫です、Tensor Coreを使えば

• Mixed Precision Training

• Forward中、Backward中は、ほぼ全てfp16で実行して問題ない (Tensorコアを使えば)

• Update(重みの更新)は、fp32で実行した方がよい (Update時間は短い)

• モデルによっては、Loss scalingと呼ばれるテクニックが必要 (オーバーヘッド小)

• 主要DLフレームワークで使用可能

• TensorFlow, MxNet, PyTorch, Caffe2, Theano, MS Cognitive Toolkit, Chainer

http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html

Training with Mixed-Precision User Guide

Page 25: Volta (Tesla V100) の紹介

25

どれぐらい、Voltaで速くなるの?P100 FP32, V100 FP32 vs. V100 Tensor Core

Resnet50

(*) Chainer 3.0.0rc1+ と CuPy 2.0.0rc1+ を使用

Conv,

1x1,

64

Conv,

3x3,

64

Conv,

1x1,

256

BN

ReLU

BN

ReLU

BN +x

ReLU

Page 26: Volta (Tesla V100) の紹介

26

どれぐらい、Voltaで速くなるの?P100 FP32, V100 FP32 vs. V100 Tensor Core

0 100 200 300 400 500 600

Conv BN Relu Cupy_* Misc.

570 ms

360 ms

197 ms

ImageNet, Resnet50, Batch:128 Time per iteration [ms]

P100 FP32

V100 FP32

V100Tensor Core

(*) Chainer 3.0.0rc1+ と CuPy 2.0.0rc1+ を使用

Page 27: Volta (Tesla V100) の紹介

27

どれぐらい、Voltaで速くなるの?P100 FP32, V100 FP32 vs. V100 Tensor Core

0 100 200 300 400 500 600

Conv BN Relu Cupy_* Misc.

570 ms

360 ms

197 ms

ImageNet, Resnet50, Batch:128 Time per iteration [ms]

P100 FP32

V100 FP32

V100Tensor Core

(*) Chainer 3.0.0rc1+ と CuPy 2.0.0rc1+ を使用

Page 28: Volta (Tesla V100) の紹介

28

どれぐらい、Voltaで速くなるの?P100 FP32, V100 FP32 vs. V100 Tensor Core

0 100 200 300 400 500 600

Conv BN Relu Cupy_* Misc.

570 ms

360 ms

197 ms

ImageNet, Resnet50, Batch:128 Time per iteration [ms]

約3倍

P100 FP32

V100 FP32

V100Tensor Core

(*) Chainer 3.0.0rc1+ と CuPy 2.0.0rc1+ を使用

Page 29: Volta (Tesla V100) の紹介

29

マルチGPU性能ImageNet, Resnet50, Batch/GPU:128

224430

857

1,657

355675

1,331

2,530

649

1,199

2,359

4,064

0

1,000

2,000

3,000

4,000

5,000

1 GPU 2 GPUs 4 GPUs 8 GPUs

P100 FP32 V100 FP32 V100 Tensor Core

Images

per

second

(*) CUDA 9, cuDNN 7, NCCL 2, Chainer 3.0.0rc1+, CuPy 2.0.0rc1+ を使用、マシンはDGX1(V)

Page 30: Volta (Tesla V100) の紹介