optimizing cuda applications for the …on-demand.gputechconf.com/gtc-il/2018/pdf/sil8140...new in...

Vishal Mehta, Maxim Milakov, NVIDIA, Oct 18, 2018

OPTIMIZING CUDA APPLICATIONS FOR THE VOLTA/TURING ARCHITECTURE

2

NEW FEATURES IN CUDA ECOSYSTEM

New GPU Architecture, Tensor Cores, NVSwitch Fabric, DGX2, RTcore

TURING AND NEW SYSTEMSCUDA Graphs, Vulkan & DX12 Interop, Warp Matrix Multiply Accumulate (WMMA)

CUDA PLATFORM

GPU-accelerated hybrid JPEG decoding, Symmetric Eigenvalue Solvers, FFT Scaling

LIBRARIESNew Nsight Products – Nsight Systems and Nsight Compute

DEVELOPER TOOLS

Scientific Computing

3

AGENDA

New Features:

Tensor Cores

RTcore

CUDA Graphs

Nsight Developer Tools

Optimization strategies:

Volta/Turing Execution Model

Volta/Turing Memory Subsystem

4

TENSOR CORES

5

VOLTA / TURING SMV100 TU102

FP64 32 2

INT32 64 64

FP32 64 64

Tensor Cores 8 8

RT Core - 1

Register File 256 KB 256 KB

L1 and shmem 128 KB 96 KB

Max threads 2048 1024

Compute

Capability70 75*

*Volta (cc70) code runs on Turing without JIT or recompile!

Turing SM

6

TENSOR CORESNew in Volta, Extended in Turing

half precision inputs half / float accumulator

8bit/4bit INT inputs 32-bit INT accumulator

1bit Binary inputs 32-bit INT accumulator (XOR + POPC)

Used via CUBLAS, CUDNN, CUTLASS, TensorRT

Exposed in CUDA 10 (4bit INT and 1bit binary are experimental)

GPU SMs Total Peak Half FLOPSPEAK INT8

OPS

PEAK INT4

OPS

PEAK

Binary OPS

V100 80 640 125 TFLOPS N.A. N.A. N.A.

TU102 72 576 130.5 TFLOPS 261 TOPS 522 TOPS 2088 TOPS

7

TURING TENSOR CORE

WMMA operations now include 8-bit integer along with FP16

▪ Warp Matrix Multiply Accumulate

▪ Signed & unsigned 8-bit input

▪ 32-bit integer accumulator

▪ Input/Output dimensions similar to FP16

▪ 2048 ops per cycle, per SM for 8bit

▪ nvcuda::wmma

New Warp Matrix Functions

= +

A32x16

B16x8

C32x8

D32x8

WMMA 32x8x16

= +

WMMA 8x32x16

A8x16

B16x32

C8x32

D8x32

= + A

16x16B

16x16C

16x16D

16x16

WMMA 16x16x16

8

EXPERIMENTAL WARP MATRIX FUNCTIONS

Experimental Sub-Byte Operations

▪ 4-bit signed & unsigned input

▪ 1-bit input with custom matrix operations

▪ 32-bit accumulator output

Access via special namespace:

nvcuda::wmma::experimental

Turing Enables Experimental Sub-Byte Tensor Core Operations

namespace experimental {

namespace precision {

struct u4; // 4-bit unsigned

struct s4; // 4-bit signed

struct b1; // 1-bit

}

enum bmmaBitOp { bmmaBitOpXOR = 1 };

enum bmmaAccumulateOp { bmmaAccumulateOpPOPC = 1 };

}

Enable researchers to experiment with ultra low precision!Experimental subject to API changes not functionality.

9

WMMA – IMMA 4BIT

Di,j = (Ai,k * Bk,j) + Ci,j for k = 0 .. 31

New for Turing (Experimental)

8-by-8 x int32

128 b

its

32-by-8 x 4b

A B C

128 bits

8-by-32 x 4b

D =

8-by-8 x int32

10

WMMA – BINARY - XOR POPC

Di,j = popc(Ai,k ^ Bk,j) + Ci,j for k = 0 .. 127

New for Turing (Experimental)

8-by-8 x int32

128 b

its

128-by-8 x 1b

A B C

128 bits

8-by-128 x 1b

D =

8-by-8 x int32

11

BINARY TENSOR CORE OPERATION

Bitwise

XOR +Accumulated

32-bit Integer

Count

Previous

Accumulation

Other Row/Column Results

1-Bit Input SignalBitwise

XOR Operation

128-bit population count added to accumulator

32-bit Integer OutputPer Point

12

NEW TURING WARP MATRIX FUNCTIONS

Input Precision Output Supported Sizes Max Ops/Clock/SM

Nati

ve T

ypes

half * half or float16 x 16 x 16

32 x 8 x 16

8 x 32 x 16

1024

charinteger (int32) 2048

unsigned char

Experi

menta

l

precision::u4 (4-bit unsigned)

integer (int32)8 x 8 x 32 4096

precision::s4 (4-bit signed)

precision::b1 (1-bit) 8 x 8 x 128 16384

* Also available on Volta sm_70. Note: WMMA requires recompilation for sm_75 for peak performance

13

CUTLASS 1.1High-performance Matrix Multiplication in Open Source templated CUDA C++

CUTLASS GEMM Structural Model

14

https://github.com/NVIDIA/cutlass

CUTLASS 1.1

Turing optimized GEMMs

Integer (8-bit, 4-bit and 1-bit) using WMMA

Batched strided GEMM

Support for CUDA 10.0

Updates to documentation and more examples

0%

20%

40%

60%

80%

100%

dge

mm

_nn

dge

mm

_nt

dge

mm

_tn

dge

mm

_tt

hge

mm

_nn

hge

mm

_nt

hge

mm

_tn

hge

mm

_tt

igem

m_n

n

igem

m_n

t

igem

m_t

n

igem

m_t

t

sgem

m_n

n

sgem

m_n

t

sgem

m_t

n

sgem

m_t

t

wm

ma_

gem

m_f

16_n

n

wm

ma_

gem

m_f

16_n

t

wm

ma_

gem

m_f

16_t

n

wm

ma_

gem

m_f

16_t

t

wm

ma_

gem

m_n

n

wm

ma_

gem

m_n

t

wm

ma_

gem

m_t

n

wm

ma_

gem

m_t

t

DGEMM HGEMM IGEMM SGEMM WMMA (F16) WMMA (F32)

% R

ela

tive t

o P

eak

> 90% Relative to Peak Performance

CUTLASS 1.1 on Volta (GV100)

High-performance Matrix Multiplication in Open Source templated CUDA C++

15

TURING RTCORE

16

RT Cores perform

● Ray-BVH (Bounding Volume Hierarchy)

Traversal

● Instancing: 1 Level

● Ray-Triangle Intersection

Return to SM for

● Multi-level Instancing

● Custom Intersection

● Shading

RT CoresTuring GPU RT Cores accelerate ray tracing

17

Software v/s Hardware Ray Tracing

Tri1 Tri2 Tri3 Circle1

Pre-Turing

SM

Turing

SM

18

Rtcore in OPTIX

• Single-ray shader programming model using C++

• Transparently scales across multiple GPUs

• AI Accelerated rendering

• Easy interop with CUDA

http://developer.nvidia.com/optix

http://on-demand.gputechconf.com

http://on-demand.gputechconf.com/

19

CUDA GRAPHS

20

ASYNCHRONOUS TASK GRAPHSExecution Optimization When Workflow is Known Up-Front

DL Inference

Loop & Functionoffload

Deep Neural NetworkTraining

HPC SimulationLinear Algebra

21

ALL CUDA WORK FORMS A GRAPH

End

A

B X

C D

E Y

Any CUDA stream can be mapped to a graph

A

B

C

Wait

E

Wait

D

Wait

X

Y

Wait

Node represents operation

Edge represents dependencyCUDA Work in Streams

Implicit dependencies Explicit dependencies

22

DEFINITION OF A CUDA GRAPH

Sequence of operations, connected by dependencies.

Operations are one of:

Kernel Launch CUDA kernel running on GPU

CPU Function Call Callback function on CPU

Memcopy/Memset GPU data management

Sub-Graph Graphs are hierarchical

Graph Nodes Are Not Just Kernel Launches

A

B X

C D

E Y

End

23

NEW EXECUTION MECHANISMGraphs Can Be Generated Once Then Launched Repeatedly

A

B X

C D

E Y

End

for(int i=0; i<1000; i++) {launch_graph( G );

}

24

EXECUTION OPTIMIZATIONS

Launch latencies:

▪ CUDA 10.0 takes at least 2.2us CPU time to launch each CUDA kernel on Linux

▪ Pre-defined graph allows launch of any number of kernels in one single operation

Latency & Overhead Reductions

time

Launch

A

Launch

B

Launch

C

Launch

D

Launch

E

A B C D E

Build

GraphLaunch Graph

CPU Idle

CPU Idle

A B C D E

25

Example: Small 3D FFT

25% end-to-end improvement for 323 3D-FFT(16us with stream launch, 12us with graph launch)

PERFORMANCE IMPACTOptimizations for Short-Runtime Operations

CPU launch time improvements

Typical: 33% faster than stream launch

NOTE: Performance impact is workload-dependent

Benefits especially short-running kernels, where overheads account for more runtime

26

THREE-STAGE EXECUTION MODEL

Define

A

B X

C D

E Y

End

Single Graph “Template”

Instantiate

Multiple “Executable Graphs”

A

B X

C D

E Y

End

A

B X

C D

E Y

End

A

B X

C D

E Y

End

Execute

Executable Graphs Running in CUDA Streams

s1 s2 s3

Created in host code,or loaded from disk,

or built up from libraries

Snapshot of templateSets up & initializes GPU

execution structures(create once, run many times)

Concurrency in graphis not limited by stream

(see later)

27

CONVERT CUDA STREAM INTO A GRAPHConstruct a graph from normal CUDA stream syntax

// Start by initating stream capture

cudaStreamBeginCapture(&stream1);

// Build stream work as usual

A<<< ..., stream1 >>>();

cudaEventRecord(e1, stream1);

B<<< ..., stream1 >>>();

cudaStreamWaitEvent(stream2, e1);

C<<< ..., stream2 >>>();



D<<< ..., stream1 >>>();

// Now convert the stream to a graph

cudaStreamEndCapture(stream1, &graph);

A

B

Wait

D

C

Wait

stream1 stream2 graph

D

B C

A

28

CONVERT CUDA STREAM INTO A GRAPHConstruct a graph from normal CUDA stream syntax

// Start by initating stream capture

cudaStreamBeginCapture(&stream1);

// Build stream work as usual

A<<< ..., stream1 >>>();


B<<< ..., stream1 >>>();


C<<< ..., stream2 >>>();



D<<< ..., stream1 >>>();

// Now convert the stream to a graph

cudaStreamEndCapture(stream1, &graph);

A

B

Wait

D

C

Wait

stream1 stream2 graph

D

B C

A

Capture follows

inter-stream dependencies

to create forks & joinscudaStreamWaitEvent(stream2, e1);

29

CREATE GRAPHS DIRECTLYMap Graph-Based Workflows Directly Into CUDA

D

B C

A

// Define graph of work + dependencies

cudaGraphCreate(&graph);

cudaGraphAddNode(graph, kernel_a, {}, ...);

cudaGraphAddNode(graph, kernel_b, { kernel_a }, ...);

cudaGraphAddNode(graph, kernel_c, { kernel_a }, ...);

cudaGraphAddNode(graph, kernel_d, { kernel_b, kernel_c }, ...);

// Instantiate graph and apply optimizations

cudaGraphInstantiate(&instance, graph);

// Launch executable graph 100 times

for(int i=0; i<100; i++)

cudaGraphLaunch(instance, stream);

Graph fromframework

30

GRAPH EXECUTION SEMANTICSOrder Graph Work With Other Non-Graph CUDA Work

stream

launchWork(cudaGraphExec_t i1, cudaGraphExec_t i2,CPU_Func cpu, cudaStream_t stream) {

A <<< 256, 256, 0, stream >>>(); // Kernel launch

cudaGraphLaunch(i1, stream); // Graph1 launch

cudaStreamAddCallback(stream, cpu); // CPU callback

cudaGraphLaunch(i2, stream); // Graph2 launch

cudaStreamSynchronize(stream);

}

A

CPU

If you can put it in a CUDA stream, you can run it together with a graph

31

GRAPHS IGNORE STREAM SERIALIZATION RULESLaunch Stream Is Used Only For Ordering With Other Work

stream

A

CPU

End

A

B X

C D

E Y

Branches in graph still execute concurrently even though graph is

launched into a stream

32

CROSS-DEVICE DEPENDENCIES

CUDA is closest to the O/S and the hardware

▪ Can optimize multi-device dependencies

▪ Can optimize heterogeneous dependencies

▪ Define locality per-node

Graphs May Span Multiple GPUs

GPU 0 GPU 1

CB

A

D

GPU

CPU

GPU

HeterogeneousExecution

Multi-DeviceExecution

HeterogeneousExecution

33

NSIGHTDEVELOPER TOOLS

34

NSIGHT PRODUCT FAMILY

Nsight Systems

System-wide application

algorithm tuning

Nsight Compute

CUDA Kernel Profiling and

Debugging

Nsight Graphics

Graphics Shader Profiling and

Debugging

IDE PluginsNsight Eclipse

Edition/Visual Studio (Editor, Debugger)

35

NSIGHT SYSTEMS

Observe Application Behavior: CPU threads, GPU traces, Memory Bandwidth and more

Locate Optimization Opportunities: CUDA & OpenGL APIs, UVM transfers, User Annotations using NVTX

Ready for Big Data: Fast GUI capable of visualizing in excess of 10 million events.

System-wide Performance Analysis

https://developer.nvidia.com/nsight-systems

https://developer.nvidia.com/nsight-systems

36

Processes and

threads

CUDA and OpenGL

API trace

Multi-GPU

Kernel and memory

transfer activities

cuDNN and

cuBLAS trace

Thread/core

migration

Thread state

37

NVIDIA NSIGHT COMPUTENext Generation Kernel Profiler

Interactive CUDA API debugging and kernel profiling

Fast Data Collection

Improved Workflow and Fully Customizable (Baselining, Programmable UI/Rules)

Command Line, Standalone, IDE Integration

Platform Support

OS: Linux (x86, POWER, ARM), Windows

GPUs: Pascal, Volta, Turing

Kernel Profile

Comparisons with

Baseline

Metric Data

Source Correlation

38

EXECUTION MODEL

39

CUDA BASICS

Single Instruction Multiple Threads (SIMT) model

CUDA hierarchy: Grid -> Blocks -> Warps -> Threads

One warp = 32 threads.

Why does it matter ?Many optimizations based on behavior at the warp level

Blocks of threads, warps

40

CUDA BASICS

Thread blocks can be 1D, 2D, 3DOnly for convenience. Hardware “looks” at threads in 1D

Consecutive 32 threads belong to the same warp

Mapping threads

80 Threads:40 threads in X

2 rows of threads in Y

40

2

3 warps (96 threads)16 inactive threads in 3rd warp

1

2

2

3 32

40

41

CUDA BASICS

Different warps can execute different codeNo impact on performanceEach warp maintains its own Program Counter

Different code path inside the same warp ?Threads that don’t participate are masked out,but the whole warp executes both sides of the branch

Control Flow

42

CONTROL FLOW

1

2

2

3 3

ThreadIdx.x0 39

0

1ThreadIdx.y

A;

if(threadIdx.y==0)

B;

else

C;

D;

A

A B D

DB C

Warp 10

…

31

Warp 20

…

31

Warp 30

…

31

Instructions, time

A C D

43

CONTROL FLOW

Minimize thread divergence inside a warp

Divergence between warps is fine

Maximize “useful” cycles for each warp

Takeaways

44

THREADS ARE THREADS

Program counter:Before Volta: Per warpVolta: Per thread

Volta guarantees Forward Progress for diverged threads in a warp

Allows to exchange data between diverged threads in a warp. E.g. mutexes among warp threads.Allows to write natural code that would deadlock before

New in Volta

45

THREADS ARE THREADSExample

lock = 0;

while (lock == 0)

lock = tryGetLock();

doSomething;

releaseLock();

These device functions could be implemented with atomics, or volatile pointers

Pre-Volta: The code might deadlock in the loop,if the thread that gets the lock cannot forward-

progress and release the lock

46

THREADS ARE THREADS

Don’t assume the threads in a warp are re-converged or executing in lock-step mode.Use __syncwarp() to synchronize the threads in a warp.

Shuffle and warp vote functions are deprecated.Use the new equivalent “_sync” functions.Extra parameter tells the compiler/hardware which threads are expected to participate, because they might not reach it all at the same time.E.g: __shfl_up(value, 1) becomes __shfl_up_sync (0xffffffff, value, 1)

Full efficiency only when all the 32 threads of a warp are converged!

Thread re-convergence

47

THREAD ARE THREADS

Update/fix the code!

Use Cooperative Groups (GTC 2017 talk s7622)

Compile for an older architecture (disable forward progress)-arch=compute_60,sm_70 (binary)–arch=compute_60 (PTX JIT)

How to deal with warp-synchronous code?

48

MEMORY SUBSYSTEM

49

VOLTA MEMORY SUBSYSTEM

80 Streaming Multiprocessors256KB register file (20 MB)

Unified Shared Mem / L1 Cache128KB, variable split (10MB Total, 14 TB/s), Volta caches L1 writes

6 MB L2 Cache, L2 is write back

16/32 GB HBM2 (900 GB/s)

Tesla V100

SM

L1 SMEM

Registers

L2

DRAM

SM

L1 SMEM

Registers

SM

L1 SMEM

Registers

PCIe NVLINK

50

TURING MEMORY SUBSYSTEM

72 Streaming Multiprocessors256KB register file (18.5 MB)

Unified Shared Mem / L1 Cache96KB, variable split (7MB Total, 8 TB/s) Turing caches L1 writes

6 MB L2 Cache, L2 is write back

24 GB GDDR6 (672 GB/s)

Quadro RTX 8000

SM

L1 SMEM

Registers

L2

DRAM

SM

L1 SMEM

Registers

SM

L1 SMEM

Registers

PCIe NVLINK

51

L1, L2 CACHES

In general, not for temporal locality

100s ~ 1000s of threads running per SM, tens of thousands of threads sharing the L2 cache

L1, L2 are small per thread

For example, at 2048 threads/SM, with 80 SMs: 64 bytes L1, 38 Bytes L2 per thread

Why do GPUs have caches?

52

L1, L2 CACHES

Memory access granularity = 32 Bytes = 1 sector

An L1/L2 cache line is 128 Bytes, made of 4 sectors.Cache ”management” granularity = 1 cache line

Cache Lines & Sectors

128 Byte cache line

128-Byte alignment

Sector 0 Sector 1 Sector 2 Sector 3

53

ACCESS PATTERNS

For each warp: How many sectors needed?

Depends on addresses, active threads, access size.

Natural element sizes = 1B, 2B, 4B, 8B, 16B.

Warps and Sectors

WARP

0 314-Byte element access4 sectors

0 32 64 96 128 160 224 256 320288192 352

Memory Addresses

54

ACCESS PATTERNSWarps and Sectors

0 32 64 96 128 160 224 256 320288192 352

Memory Addresses

WARP

0 314-Byte access, unaligned5 sectors

128 bytes requested, 160 bytes read (80% efficiency)

55


0 32 64 96 128 160 224 256 320288192 352

Memory Addresses

WARP

0 314-Byte access, unaligned5 sectors

NEXT WARP

With >1 warp per block, this sector might be found in L1 or L2

56


0 32 64 96 128 160 224 256 320288192 352

Memory Addresses

WARP

0 31Same address1 sector

57

L1, L2 CACHES

Caches on GPUs can help with:

“Smoothing” irregular, unaligned access patterns

Caching common data accessed by many threads

Faster register spills, local memory

Can help in codes that don’t use shared memory

Why do GPU have caches?

58

SHARED MEMORY

Scratch-pad memory on each SMUser-managed cache, hardware does not evict dataData written to SMEM stays there until this the code overwrites the data or threadblockfinishes execution

Useful for:Storing frequently-accessed data, to reduce DRAM accessesCommunication among threads of a threadblock

Performance benefits compared to DRAM:20-40x lower latency~15x higher bandwidth

59

UNIFIED SHARED MEM / L1 CACHE

How to specify the L1 / Smem split:cudaFuncSetAttribute (MyKernel, cudaFuncAttributePreferredSharedMemoryCarveout, carveout);

The driver usually does a pretty good job at choosing the right split.

To overcome 48 KB per threadblock limitation call: cudaFuncSetAttribute (MyKernel, cudaFuncAttributeMaxDynamicSharedMemorySize, maxsize);

Variable split

SM

L1 SMEM

Registers

Volta: 6 possiblesmem / L1 splits

96KB / 32KB64KB / 64KB32KB / 96KB16KB / 112KB8KB / 120KB0KB /128 KB

Turing: 2 possiblesmem / L1 splits

64KB / 32KB

32KB / 64KB

https://developer.nvidia.com/computeworks

http://on-demand.gputechconf.com

https://developer.nvidia.com/computeworks

http://on-demand.gputechconf.com/

optimizing cuda applications for the …on-demand.gputechconf.com/gtc-il/2018/pdf/sil8140...new in...

Documents