optimizing cuda applications for the …on-demand.gputechconf.com/gtc-il/2018/pdf/sil8140...new in...
TRANSCRIPT
Vishal Mehta, Maxim Milakov, NVIDIA, Oct 18, 2018
OPTIMIZING CUDA APPLICATIONS FOR THE VOLTA/TURING ARCHITECTURE
2
NEW FEATURES IN CUDA ECOSYSTEM
New GPU Architecture, Tensor Cores, NVSwitch Fabric, DGX2, RTcore
TURING AND NEW SYSTEMSCUDA Graphs, Vulkan & DX12 Interop, Warp Matrix Multiply Accumulate (WMMA)
CUDA PLATFORM
GPU-accelerated hybrid JPEG decoding, Symmetric Eigenvalue Solvers, FFT Scaling
LIBRARIESNew Nsight Products – Nsight Systems and Nsight Compute
DEVELOPER TOOLS
Scientific Computing
3
AGENDA
New Features:
Tensor Cores
RTcore
CUDA Graphs
Nsight Developer Tools
Optimization strategies:
Volta/Turing Execution Model
Volta/Turing Memory Subsystem
4
TENSOR CORES
5
VOLTA / TURING SMV100 TU102
FP64 32 2
INT32 64 64
FP32 64 64
Tensor Cores 8 8
RT Core - 1
Register File 256 KB 256 KB
L1 and shmem 128 KB 96 KB
Max threads 2048 1024
Compute
Capability70 75*
*Volta (cc70) code runs on Turing without JIT or recompile!
Turing SM
6
TENSOR CORESNew in Volta, Extended in Turing
half precision inputs half / float accumulator
8bit/4bit INT inputs 32-bit INT accumulator
1bit Binary inputs 32-bit INT accumulator (XOR + POPC)
Used via CUBLAS, CUDNN, CUTLASS, TensorRT
Exposed in CUDA 10 (4bit INT and 1bit binary are experimental)
GPU SMs Total Peak Half FLOPSPEAK INT8
OPS
PEAK INT4
OPS
PEAK
Binary OPS
V100 80 640 125 TFLOPS N.A. N.A. N.A.
TU102 72 576 130.5 TFLOPS 261 TOPS 522 TOPS 2088 TOPS
7
TURING TENSOR CORE
WMMA operations now include 8-bit integer along with FP16
▪ Warp Matrix Multiply Accumulate
▪ Signed & unsigned 8-bit input
▪ 32-bit integer accumulator
▪ Input/Output dimensions similar to FP16
▪ 2048 ops per cycle, per SM for 8bit
▪ nvcuda::wmma
New Warp Matrix Functions
= +
A32x16
B16x8
C32x8
D32x8
WMMA 32x8x16
= +
WMMA 8x32x16
A8x16
B16x32
C8x32
D8x32
= + A
16x16B
16x16C
16x16D
16x16
WMMA 16x16x16
8
EXPERIMENTAL WARP MATRIX FUNCTIONS
Experimental Sub-Byte Operations
▪ 4-bit signed & unsigned input
▪ 1-bit input with custom matrix operations
▪ 32-bit accumulator output
Access via special namespace:
nvcuda::wmma::experimental
Turing Enables Experimental Sub-Byte Tensor Core Operations
namespace experimental {
namespace precision {
struct u4; // 4-bit unsigned
struct s4; // 4-bit signed
struct b1; // 1-bit
}
enum bmmaBitOp { bmmaBitOpXOR = 1 };
enum bmmaAccumulateOp { bmmaAccumulateOpPOPC = 1 };
}
Enable researchers to experiment with ultra low precision!Experimental subject to API changes not functionality.
9
WMMA – IMMA 4BIT
Di,j = (Ai,k * Bk,j) + Ci,j for k = 0 .. 31
New for Turing (Experimental)
8-by-8 x int32
128 b
its
32-by-8 x 4b
A B C
128 bits
8-by-32 x 4b
D =
8-by-8 x int32
10
WMMA – BINARY - XOR POPC
Di,j = popc(Ai,k ^ Bk,j) + Ci,j for k = 0 .. 127
New for Turing (Experimental)
8-by-8 x int32
128 b
its
128-by-8 x 1b
A B C
128 bits
8-by-128 x 1b
D =
8-by-8 x int32
11
BINARY TENSOR CORE OPERATION
Bitwise
XOR +Accumulated
32-bit Integer
Count
Previous
Accumulation
Other Row/Column Results
1-Bit Input SignalBitwise
XOR Operation
128-bit population count added to accumulator
32-bit Integer OutputPer Point
12
NEW TURING WARP MATRIX FUNCTIONS
Input Precision Output Supported Sizes Max Ops/Clock/SM
Nati
ve T
ypes
half * half or float16 x 16 x 16
32 x 8 x 16
8 x 32 x 16
1024
charinteger (int32) 2048
unsigned char
Experi
menta
l
precision::u4 (4-bit unsigned)
integer (int32)8 x 8 x 32 4096
precision::s4 (4-bit signed)
precision::b1 (1-bit) 8 x 8 x 128 16384
* Also available on Volta sm_70. Note: WMMA requires recompilation for sm_75 for peak performance
13
CUTLASS 1.1High-performance Matrix Multiplication in Open Source templated CUDA C++
CUTLASS GEMM Structural Model
14
https://github.com/NVIDIA/cutlass
CUTLASS 1.1
Turing optimized GEMMs
Integer (8-bit, 4-bit and 1-bit) using WMMA
Batched strided GEMM
Support for CUDA 10.0
Updates to documentation and more examples
0%
20%
40%
60%
80%
100%
dge
mm
_nn
dge
mm
_nt
dge
mm
_tn
dge
mm
_tt
hge
mm
_nn
hge
mm
_nt
hge
mm
_tn
hge
mm
_tt
igem
m_n
n
igem
m_n
t
igem
m_t
n
igem
m_t
t
sgem
m_n
n
sgem
m_n
t
sgem
m_t
n
sgem
m_t
t
wm
ma_
gem
m_f
16_n
n
wm
ma_
gem
m_f
16_n
t
wm
ma_
gem
m_f
16_t
n
wm
ma_
gem
m_f
16_t
t
wm
ma_
gem
m_n
n
wm
ma_
gem
m_n
t
wm
ma_
gem
m_t
n
wm
ma_
gem
m_t
t
DGEMM HGEMM IGEMM SGEMM WMMA (F16) WMMA (F32)
% R
ela
tive t
o P
eak
> 90% Relative to Peak Performance
CUTLASS 1.1 on Volta (GV100)
High-performance Matrix Multiplication in Open Source templated CUDA C++
15
TURING RTCORE
16
RT Cores perform
● Ray-BVH (Bounding Volume Hierarchy)
Traversal
● Instancing: 1 Level
● Ray-Triangle Intersection
Return to SM for
● Multi-level Instancing
● Custom Intersection
● Shading
RT CoresTuring GPU RT Cores accelerate ray tracing
17
Software v/s Hardware Ray Tracing
Tri1 Tri2 Tri3 Circle1
Pre-Turing
SM
Turing
SM
18
Rtcore in OPTIX
• Single-ray shader programming model using C++
• Transparently scales across multiple GPUs
• AI Accelerated rendering
• Easy interop with CUDA
http://developer.nvidia.com/optix
http://on-demand.gputechconf.com
19
CUDA GRAPHS
20
ASYNCHRONOUS TASK GRAPHSExecution Optimization When Workflow is Known Up-Front
DL Inference
Loop & Functionoffload
Deep Neural NetworkTraining
HPC SimulationLinear Algebra
21
ALL CUDA WORK FORMS A GRAPH
End
A
B X
C D
E Y
Any CUDA stream can be mapped to a graph
A
B
C
Wait
E
Wait
D
Wait
X
Y
Wait
Node represents operation
Edge represents dependencyCUDA Work in Streams
Implicit dependencies Explicit dependencies
22
DEFINITION OF A CUDA GRAPH
Sequence of operations, connected by dependencies.
Operations are one of:
Kernel Launch CUDA kernel running on GPU
CPU Function Call Callback function on CPU
Memcopy/Memset GPU data management
Sub-Graph Graphs are hierarchical
Graph Nodes Are Not Just Kernel Launches
A
B X
C D
E Y
End
23
NEW EXECUTION MECHANISMGraphs Can Be Generated Once Then Launched Repeatedly
A
B X
C D
E Y
End
for(int i=0; i<1000; i++) {launch_graph( G );
}
24
EXECUTION OPTIMIZATIONS
Launch latencies:
▪ CUDA 10.0 takes at least 2.2us CPU time to launch each CUDA kernel on Linux
▪ Pre-defined graph allows launch of any number of kernels in one single operation
Latency & Overhead Reductions
time
Launch
A
Launch
B
Launch
C
Launch
D
Launch
E
A B C D E
Build
GraphLaunch Graph
CPU Idle
CPU Idle
A B C D E
25
Example: Small 3D FFT
25% end-to-end improvement for 323 3D-FFT(16us with stream launch, 12us with graph launch)
PERFORMANCE IMPACTOptimizations for Short-Runtime Operations
CPU launch time improvements
Typical: 33% faster than stream launch
NOTE: Performance impact is workload-dependent
Benefits especially short-running kernels, where overheads account for more runtime
26
THREE-STAGE EXECUTION MODEL
Define
A
B X
C D
E Y
End
Single Graph “Template”
Instantiate
Multiple “Executable Graphs”
A
B X
C D
E Y
End
A
B X
C D
E Y
End
A
B X
C D
E Y
End
Execute
Executable Graphs Running in CUDA Streams
s1 s2 s3
Created in host code,or loaded from disk,
or built up from libraries
Snapshot of templateSets up & initializes GPU
execution structures(create once, run many times)
Concurrency in graphis not limited by stream
(see later)
27
CONVERT CUDA STREAM INTO A GRAPHConstruct a graph from normal CUDA stream syntax
// Start by initating stream capture
cudaStreamBeginCapture(&stream1);
// Build stream work as usual
A<<< ..., stream1 >>>();
cudaEventRecord(e1, stream1);
B<<< ..., stream1 >>>();
cudaStreamWaitEvent(stream2, e1);
C<<< ..., stream2 >>>();
cudaEventRecord(e2, stream2);
cudaStreamWaitEvent(stream1, e2);
D<<< ..., stream1 >>>();
// Now convert the stream to a graph
cudaStreamEndCapture(stream1, &graph);
A
B
Wait
D
C
Wait
stream1 stream2 graph
D
B C
A
28
CONVERT CUDA STREAM INTO A GRAPHConstruct a graph from normal CUDA stream syntax
// Start by initating stream capture
cudaStreamBeginCapture(&stream1);
// Build stream work as usual
A<<< ..., stream1 >>>();
cudaEventRecord(e1, stream1);
B<<< ..., stream1 >>>();
cudaStreamWaitEvent(stream2, e1);
C<<< ..., stream2 >>>();
cudaEventRecord(e2, stream2);
cudaStreamWaitEvent(stream1, e2);
D<<< ..., stream1 >>>();
// Now convert the stream to a graph
cudaStreamEndCapture(stream1, &graph);
A
B
Wait
D
C
Wait
stream1 stream2 graph
D
B C
A
Capture follows
inter-stream dependencies
to create forks & joinscudaStreamWaitEvent(stream2, e1);
29
CREATE GRAPHS DIRECTLYMap Graph-Based Workflows Directly Into CUDA
D
B C
A
// Define graph of work + dependencies
cudaGraphCreate(&graph);
cudaGraphAddNode(graph, kernel_a, {}, ...);
cudaGraphAddNode(graph, kernel_b, { kernel_a }, ...);
cudaGraphAddNode(graph, kernel_c, { kernel_a }, ...);
cudaGraphAddNode(graph, kernel_d, { kernel_b, kernel_c }, ...);
// Instantiate graph and apply optimizations
cudaGraphInstantiate(&instance, graph);
// Launch executable graph 100 times
for(int i=0; i<100; i++)
cudaGraphLaunch(instance, stream);
Graph fromframework
30
GRAPH EXECUTION SEMANTICSOrder Graph Work With Other Non-Graph CUDA Work
stream
launchWork(cudaGraphExec_t i1, cudaGraphExec_t i2,CPU_Func cpu, cudaStream_t stream) {
A <<< 256, 256, 0, stream >>>(); // Kernel launch
cudaGraphLaunch(i1, stream); // Graph1 launch
cudaStreamAddCallback(stream, cpu); // CPU callback
cudaGraphLaunch(i2, stream); // Graph2 launch
cudaStreamSynchronize(stream);
}
A
CPU
If you can put it in a CUDA stream, you can run it together with a graph
31
GRAPHS IGNORE STREAM SERIALIZATION RULESLaunch Stream Is Used Only For Ordering With Other Work
stream
A
CPU
End
A
B X
C D
E Y
Branches in graph still execute concurrently even though graph is
launched into a stream
32
CROSS-DEVICE DEPENDENCIES
CUDA is closest to the O/S and the hardware
▪ Can optimize multi-device dependencies
▪ Can optimize heterogeneous dependencies
▪ Define locality per-node
Graphs May Span Multiple GPUs
GPU 0 GPU 1
CB
A
D
GPU
CPU
GPU
HeterogeneousExecution
Multi-DeviceExecution
HeterogeneousExecution
33
NSIGHTDEVELOPER TOOLS
34
NSIGHT PRODUCT FAMILY
Nsight Systems
System-wide application
algorithm tuning
Nsight Compute
CUDA Kernel Profiling and
Debugging
Nsight Graphics
Graphics Shader Profiling and
Debugging
IDE PluginsNsight Eclipse
Edition/Visual Studio (Editor, Debugger)
35
NSIGHT SYSTEMS
Observe Application Behavior: CPU threads, GPU traces, Memory Bandwidth and more
Locate Optimization Opportunities: CUDA & OpenGL APIs, UVM transfers, User Annotations using NVTX
Ready for Big Data: Fast GUI capable of visualizing in excess of 10 million events.
System-wide Performance Analysis
https://developer.nvidia.com/nsight-systems
36
Processes and
threads
CUDA and OpenGL
API trace
Multi-GPU
Kernel and memory
transfer activities
cuDNN and
cuBLAS trace
Thread/core
migration
Thread state
37
NVIDIA NSIGHT COMPUTENext Generation Kernel Profiler
Interactive CUDA API debugging and kernel profiling
Fast Data Collection
Improved Workflow and Fully Customizable (Baselining, Programmable UI/Rules)
Command Line, Standalone, IDE Integration
Platform Support
OS: Linux (x86, POWER, ARM), Windows
GPUs: Pascal, Volta, Turing
Kernel Profile
Comparisons with
Baseline
Metric Data
Source Correlation
38
EXECUTION MODEL
39
CUDA BASICS
Single Instruction Multiple Threads (SIMT) model
CUDA hierarchy: Grid -> Blocks -> Warps -> Threads
One warp = 32 threads.
Why does it matter ?Many optimizations based on behavior at the warp level
Blocks of threads, warps
40
CUDA BASICS
Thread blocks can be 1D, 2D, 3DOnly for convenience. Hardware “looks” at threads in 1D
Consecutive 32 threads belong to the same warp
Mapping threads
80 Threads:40 threads in X
2 rows of threads in Y
40
2
3 warps (96 threads)16 inactive threads in 3rd warp
1
2
2
3 32
40
41
CUDA BASICS
Different warps can execute different codeNo impact on performanceEach warp maintains its own Program Counter
Different code path inside the same warp ?Threads that don’t participate are masked out,but the whole warp executes both sides of the branch
Control Flow
42
CONTROL FLOW
1
2
2
3 3
ThreadIdx.x0 39
0
1ThreadIdx.y
A;
if(threadIdx.y==0)
B;
else
C;
D;
A
A B D
DB C
Warp 10
…
31
Warp 20
…
31
Warp 30
…
31
Instructions, time
A C D
43
CONTROL FLOW
Minimize thread divergence inside a warp
Divergence between warps is fine
Maximize “useful” cycles for each warp
Takeaways
44
THREADS ARE THREADS
Program counter:Before Volta: Per warpVolta: Per thread
Volta guarantees Forward Progress for diverged threads in a warp
Allows to exchange data between diverged threads in a warp. E.g. mutexes among warp threads.Allows to write natural code that would deadlock before
New in Volta
45
THREADS ARE THREADSExample
lock = 0;
while (lock == 0)
lock = tryGetLock();
doSomething;
releaseLock();
These device functions could be implemented with atomics, or volatile pointers
Pre-Volta: The code might deadlock in the loop,if the thread that gets the lock cannot forward-
progress and release the lock
46
THREADS ARE THREADS
Don’t assume the threads in a warp are re-converged or executing in lock-step mode.Use __syncwarp() to synchronize the threads in a warp.
Shuffle and warp vote functions are deprecated.Use the new equivalent “_sync” functions.Extra parameter tells the compiler/hardware which threads are expected to participate, because they might not reach it all at the same time.E.g: __shfl_up(value, 1) becomes __shfl_up_sync (0xffffffff, value, 1)
Full efficiency only when all the 32 threads of a warp are converged!
Thread re-convergence
47
THREAD ARE THREADS
Update/fix the code!
Use Cooperative Groups (GTC 2017 talk s7622)
Compile for an older architecture (disable forward progress)-arch=compute_60,sm_70 (binary)–arch=compute_60 (PTX JIT)
How to deal with warp-synchronous code?
48
MEMORY SUBSYSTEM
49
VOLTA MEMORY SUBSYSTEM
80 Streaming Multiprocessors256KB register file (20 MB)
Unified Shared Mem / L1 Cache128KB, variable split (10MB Total, 14 TB/s), Volta caches L1 writes
6 MB L2 Cache, L2 is write back
16/32 GB HBM2 (900 GB/s)
Tesla V100
SM
L1 SMEM
Registers
L2
DRAM
SM
L1 SMEM
Registers
SM
L1 SMEM
Registers
PCIe NVLINK
50
TURING MEMORY SUBSYSTEM
72 Streaming Multiprocessors256KB register file (18.5 MB)
Unified Shared Mem / L1 Cache96KB, variable split (7MB Total, 8 TB/s) Turing caches L1 writes
6 MB L2 Cache, L2 is write back
24 GB GDDR6 (672 GB/s)
Quadro RTX 8000
SM
L1 SMEM
Registers
L2
DRAM
SM
L1 SMEM
Registers
SM
L1 SMEM
Registers
PCIe NVLINK
51
L1, L2 CACHES
In general, not for temporal locality
100s ~ 1000s of threads running per SM, tens of thousands of threads sharing the L2 cache
L1, L2 are small per thread
For example, at 2048 threads/SM, with 80 SMs: 64 bytes L1, 38 Bytes L2 per thread
Why do GPUs have caches?
52
L1, L2 CACHES
Memory access granularity = 32 Bytes = 1 sector
An L1/L2 cache line is 128 Bytes, made of 4 sectors.Cache ”management” granularity = 1 cache line
Cache Lines & Sectors
128 Byte cache line
128-Byte alignment
Sector 0 Sector 1 Sector 2 Sector 3
53
ACCESS PATTERNS
For each warp: How many sectors needed?
Depends on addresses, active threads, access size.
Natural element sizes = 1B, 2B, 4B, 8B, 16B.
Warps and Sectors
WARP
0 314-Byte element access4 sectors
0 32 64 96 128 160 224 256 320288192 352
Memory Addresses
54
ACCESS PATTERNSWarps and Sectors
0 32 64 96 128 160 224 256 320288192 352
Memory Addresses
WARP
0 314-Byte access, unaligned5 sectors
128 bytes requested, 160 bytes read (80% efficiency)
55
ACCESS PATTERNSWarps and Sectors
0 32 64 96 128 160 224 256 320288192 352
Memory Addresses
WARP
0 314-Byte access, unaligned5 sectors
NEXT WARP
With >1 warp per block, this sector might be found in L1 or L2
56
ACCESS PATTERNSWarps and Sectors
0 32 64 96 128 160 224 256 320288192 352
Memory Addresses
WARP
0 31Same address1 sector
57
L1, L2 CACHES
Caches on GPUs can help with:
“Smoothing” irregular, unaligned access patterns
Caching common data accessed by many threads
Faster register spills, local memory
Can help in codes that don’t use shared memory
Why do GPU have caches?
58
SHARED MEMORY
Scratch-pad memory on each SMUser-managed cache, hardware does not evict dataData written to SMEM stays there until this the code overwrites the data or threadblockfinishes execution
Useful for:Storing frequently-accessed data, to reduce DRAM accessesCommunication among threads of a threadblock
Performance benefits compared to DRAM:20-40x lower latency~15x higher bandwidth
59
UNIFIED SHARED MEM / L1 CACHE
How to specify the L1 / Smem split:cudaFuncSetAttribute (MyKernel, cudaFuncAttributePreferredSharedMemoryCarveout, carveout);
The driver usually does a pretty good job at choosing the right split.
To overcome 48 KB per threadblock limitation call: cudaFuncSetAttribute (MyKernel, cudaFuncAttributeMaxDynamicSharedMemorySize, maxsize);
Variable split
SM
L1 SMEM
Registers
Volta: 6 possiblesmem / L1 splits
96KB / 32KB64KB / 64KB32KB / 96KB16KB / 112KB8KB / 120KB0KB /128 KB
Turing: 2 possiblesmem / L1 splits
64KB / 32KB
32KB / 64KB
https://developer.nvidia.com/computeworks
http://on-demand.gputechconf.com