cuda programming - aalto university wiki
TRANSCRIPT
![Page 1: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/1.jpg)
![Page 2: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/2.jpg)
NVIDIA Research
! What is the world going to look like, what should our hardware look like in 5–10 years? ! And how do we get there?
! Engage and participate in the academic community
! ~30 researchers around the globe (4 in Helsinki)
! See http://research.nvidia.com
![Page 3: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/3.jpg)
Today
! Brief history to GPU programming
! CUDA programming model ! Writing CUDA programs
! Designing parallel algorithms
![Page 4: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/4.jpg)
Motivation for GPU Programming
! This thing packs a lot of oomph
! How to tap into that?
![Page 5: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/5.jpg)
Early Days: GPGPU (bad!)
! General-Purpose GPU programming ! The craze around 2004 – 2006
! Trick the GPU into general-purpose computing by casting problem as graphics ! Turn data into images (textures) ! Turn algorithms into image synthesis (rendering passes)
! Many attempts to handle these automatically ! Brook, Sh, PeakStream, MS Accelerator, … ! Take a “program”, somehow convert to shaders
![Page 6: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/6.jpg)
Problems with GPGPU
! Highly constrained memory access model ! No scatter, no read/write access
! Split computation into highly constrained passes ! Limited by what shaders can do
! Tough learning curve
! To understand limitations, must understand graphics HW ! Need crazy stunts to circumvent rigidity of hardware
! Overhead of graphics API
![Page 7: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/7.jpg)
GPGPU: An Illustrated Guide
Using graphics API to express programs
Designing GPGPU algorithms
![Page 8: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/8.jpg)
The Road to CUDA
! Okay, this GPGPU thing has potential ! The only problem is that it sucks
! Let’s design the right tool for the job
! Need new hardware capabilities? Build it. ! We are a hardware company, after all
! Need a better API for poking the GPU? Ok.
! Don’t invent a new language
![Page 9: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/9.jpg)
CUDA Design Goals
! Heterogeneous CPU/GPU computing platform
! Easy to program ! Also, easy to integrate GPU code into existing programs
! Close enough to hardware to get best performance ! For those who know what they’re doing
![Page 10: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/10.jpg)
Some Ingredients of CUDA
! SIMT execution model ! Single Instruction, Multiple Thread ! Allows to write scalar code instead of explicit SIMD
! Ways to exploit locality ! Warp and block execution model ! Shared memory
! Direct memory access
! C/C++ with minimal extensions
![Page 11: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/11.jpg)
To Whet Your Appetite
146X
Interactive visualization of
volumetric white matter connectivity
36X
Ionic placement for molecular dynamics simulation on GPU
19X
Transcoding HD video stream to H.264
17X
Fluid mechanics in Matlab using .mex file
CUDA function
100X
Astrophysics N-body simulation
149X
Financial simulation of LIBOR model with
swaptions
47X
GLAME@lab: an M-script API for GPU
linear algebra
20X
Ultrasound medical imaging for cancer
diagnostics
24X
Highly optimized object oriented
molecular dynamics
30X
Cmatch exact string matching to find
similar proteins and gene sequences
![Page 12: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/12.jpg)
CUDA Programming Model
![Page 13: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/13.jpg)
Programmer’s View of Hardware
GPU
GPU Memory (DRAM)
PC
IE B
US
CPU SM SM SM …
L1 L1 L1
L2
CPU Memory (DRAM)
![Page 14: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/14.jpg)
Threads, Warps, Blocks
! A thread in CUDA executes scalar code ! Very much like a usual CPU program
! Hardware packs threads into warps ! Crucial for efficient execution ! Programmer can ignore warps (but you shouldn’t)
! Threads are logically grouped into blocks ! Threads in the same block can communicate and
synchronize efficiently
![Page 15: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/15.jpg)
Programmer’s View of SM
SM
core core core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
…
![Page 16: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/16.jpg)
Programmer’s View of SM
SM
core core core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
…
a CUDA thread state
Each thread has otherwise independent state, but it shares PC with other threads of warp
![Page 17: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/17.jpg)
Programmer’s View of SM: Execution
SM
core core core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
…
![Page 18: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/18.jpg)
Programmer’s View of SM: Execution
SM
core core core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
… r1 = r2 * r3
![Page 19: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/19.jpg)
Programmer’s View of SM: Execution
SM
core core core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
… r1 = r2 * r3
read r2 and r3
![Page 20: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/20.jpg)
Programmer’s View of SM: Execution
SM
core core core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
… r1 = r2 * r3
![Page 21: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/21.jpg)
Programmer’s View of SM: Execution
SM
core core core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
…
write to r1
r1 = r2 * r3
![Page 22: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/22.jpg)
Programmer’s View of SM: Execution
SM
core core core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
…
![Page 23: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/23.jpg)
Programmer’s View of SM: Execution
SM
core core core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
…
![Page 24: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/24.jpg)
Programmer’s View of SM: Blocks
SM
core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
…
a CUDA thread block Shared memory
…
Note: Blocks are formed on the fly from the available warps (don’t need to consecutive)
![Page 25: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/25.jpg)
Programmer’s View of SM: Blocks
SM
core core core core core core core core
Warp 0 Warp 1
… thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC … thr thr thr thr thr thr thr thr PC
… thr thr thr thr thr thr thr thr PC
Warp 2 Warp 3 Warp 4 Warp 5 Warp 6
Warp n
…
…
another CUDA thread block Shared memory
…
Note: Blocks are formed on the fly from the available warps (don’t need to consecutive)
![Page 26: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/26.jpg)
Implications
! All threads in a warp always execute concurrently ! Same PC, same instruction ! You can exploit this if you’re careful!
! But warps in a block are scheduled irregularly ! Hence, threads of a block are not implicitly synchronized ! But they are always in the same SM, and can synchronize
efficiently and communicate through shared memory
! Blocks are instantiated in SMs that have space ! No way of knowing which blocks end up in which SMs ! Key to good load balancing
![Page 27: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/27.jpg)
Occupancy
! How many warps can fit in one SM depends on resource usage ! Number of registers / thread ! Amount of shared memory / block
! Block size matters too ! Work is always launched in full blocks ! Number of blocks / SM also limited
! Occupancy = percentage of thread slots used ! Handy occupancy calculator spreadsheet available ! Directly affects latency hiding capability
![Page 28: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/28.jpg)
Synchronization
! Threads can specify a synchronization point ! __syncthreads() intrinsic ! This prevents warp from being scheduled until all warps in
the same block have arrived at sync point ! Very lightweight mechanism
! Atomic operations can be used for avoiding race conditions globally ! E.g., append to an array with atomicAdd()
! Implicit synchronization between launches ! Unless asynchronous operation is explicitly allowed
![Page 29: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/29.jpg)
SIMT Execution Model
! How can threads of a warp diverge if they all have the same PC?
! Partial solution: Per-instruction execution predication
! Full solution: Hardware-supported execution mask, execution stack, and related instructions
![Page 30: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/30.jpg)
Example: Instruction Predication
if (a < 10) small++;
else big++;
ISETP.GT.AND P0, pt, R6, 0x9, pt; @!P0 IADD R5, R5, 0x1; @P0 IADD R4, R4, 0x1;
![Page 31: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/31.jpg)
Example: Instruction Predication
if (a < 10) small++;
else big++;
ISETP.GT.AND P0, pt, R6, 0x9, pt; @!P0 IADD R5, R5, 0x1; @P0 IADD R4, R4, 0x1;
Set predicate register P0 if a > 9
![Page 32: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/32.jpg)
Example: Instruction Predication
if (a < 10) small++;
else big++;
ISETP.GT.AND P0, pt, R6, 0x9, pt; @!P0 IADD R5, R5, 0x1; @P0 IADD R4, R4, 0x1;
If P0 is cleared, R5 = R5 + 1
![Page 33: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/33.jpg)
Example: Instruction Predication
if (a < 10) small++;
else big++;
ISETP.GT.AND P0, pt, R6, 0x9, pt; @!P0 IADD R5, R5, 0x1; @P0 IADD R4, R4, 0x1; If P0 is set, R4 = R4 + 1
![Page 34: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/34.jpg)
What About Complex Cases?
! Nested if-else blocks, loops, recursion …
! Need hardware execution mask and execution stack
![Page 35: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/35.jpg)
Non-Predicated Example
if (a < 10) foo();
else bar();
/*0048*/ ISETP.GT.AND P0, pt, R4, 0x9, pt; /*0050*/ @P0 BRA 0x70; /*0058*/ ...; /*0060*/ ...; /*0068*/ BRA 0x80; /*0070*/ ...; /*0078*/ ...; /*0080*/ continue here after the if-block
else branch
if branch
![Page 36: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/36.jpg)
Non-Predicated Example
if (a < 10) foo();
else bar();
/*0048*/ ISETP.GT.AND P0, pt, R4, 0x9, pt; /*0050*/ @P0 BRA 0x70; /*0058*/ ...; /*0060*/ ...; /*0068*/ BRA 0x80; /*0070*/ ...; /*0078*/ ...; /*0080*/ continue here after the if-block
else branch
if branch
Case 1: All threads take the if branch
// no thread wants to jump
![Page 37: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/37.jpg)
Non-Predicated Example
if (a < 10) foo();
else bar();
/*0048*/ ISETP.GT.AND P0, pt, R4, 0x9, pt; /*0050*/ @P0 BRA 0x70; /*0058*/ ...; /*0060*/ ...; /*0068*/ BRA 0x80; /*0070*/ ...; /*0078*/ ...; /*0080*/ continue here after the if-block
else branch
if branch
Case 2: All threads take the else branch
// all threads want to jump
![Page 38: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/38.jpg)
Non-Predicated Example
if (a < 10) foo();
else bar();
/*0048*/ ISETP.GT.AND P0, pt, R4, 0x9, pt; /*0050*/ @P0 BRA 0x70; /*0058*/ ...; /*0060*/ ...; /*0068*/ BRA 0x80; /*0070*/ ...; /*0078*/ ...; /*0080*/ continue here after the if-block
else branch
if branch
Case 3: Some threads take the if branch, some take the else branch
// some threads want to jump: push
// pop
// restore active thread mask
![Page 39: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/39.jpg)
Benefits of SIMT
! Supports all structured C++ constructs ! If/else, switch/case, loops, function calls, exceptions ! goto is a different beast – supported, but best to avoid
! Multi-level constructs handled efficiently ! Break/continue from inside multiple levels of conditionals ! Function return from inside loops and conditionals ! Retreating to exception handler from anywhere
! You only need to care about SIMT when tuning for performance
![Page 40: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/40.jpg)
Some Consequences of SIMT
! An if statement takes the same number of cycles for any number of threads > 0 ! If nobody participates it’s cheap ! Also, masked-out threads don’t do memory accesses
! A loop is iterated until all active threads in the warp are done
! A warp stays alive until every thread in it has terminated ! Terminated threads cause “empty slots” in warps ! Thread utilization = percentage of active threads
![Page 41: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/41.jpg)
Coherent Execution Is Great
! An if statement is perfectly efficient if either everyone takes it or nobody does ! All threads stay active
! A loop is perfectly efficient if everyone does the same number of iterations
! Note: These are required for traditional SIMD
![Page 42: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/42.jpg)
Incoherent Execution Is Okay
! Conditionals are efficient as long as threads usually agree
! Loops are efficient if threads usually take roughly the same number of iterations
! Much easier to program than explicit SIMD ! SIMT: Incoherence is supported, performance degrades
gracefully if control diverges ! SIMD: performance is fixed, incoherence not supported
![Page 43: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/43.jpg)
Striving for Execution Coherence
! Learn to spot low-hanging fruit for improving execution coherence
! Process input in coherent order ! E.g., process nearby pixels of an image together
! Fold branches together as much as possible ! Only put the differing part in a conditional
! Simple low-level fixes ! Favor [f]min / [f]max over conditionals ! Bitwise operators sometimes help
![Page 44: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/44.jpg)
Memory in CUDA, part 1
! Global memory ! Accessible from everywhere, including CPU (memcpy) ! Requests go through L1, L2, DRAM
! Shared memory ! Either 16 or 48 KB per SM in Fermi ! Pieces allocated to thread blocks when launched ! Accessible from threads in the same block ! Requests served directly, very fast
! Local memory ! Actually a thread-local portion of global memory ! Used for register spilling and indexed arrays
![Page 45: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/45.jpg)
Memory in CUDA, part 2
! Textures ! Data can also be fetched from DRAM through texture units ! Separate texture caches ! High latency, extreme pipelining capability ! Read-only
! Surfaces ! Read / write access with pixel format conversions ! Useful for integrating with graphics
! Constants ! Coherent and frequent access of same data
![Page 46: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/46.jpg)
Simplified
! Global memory ! Almost all data access goes here, you will need this
! Shared memory ! Use to share data between threads
! Textures ! Use to accelerate data fetching
! Local memory, constants, surfaces ! Let’s ignore for now, details can be found in manuals
![Page 47: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/47.jpg)
Memory Access Coherence
! GPU memory buses are wide ! Both external and internal
! When warp executes a memory instruction, the addresses matter a lot ! Those that land on the same cache line are served
together ! Different cache lines are served sequentially
! This can have a huge impact on performance ! Easy to accidentally burden the memory system ! Incoherent access also easily overflows caches
![Page 48: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/48.jpg)
Improving Memory Coherence
! Try to access nearby addresses from nearby threads
! If each thread processes just one element, choose wisely which one
! If each thread processes multiple elements, preferably use striding
![Page 49: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/49.jpg)
Striding Example
! We want each thread to process 10 elements of an array ! 64 threads per block
No striding Thread 0: 0 1 2 3 4 5 6 7 8 9 Thread 1: 10 11 12 13 14 15 16 17 18 19 .. Thread 63: 630 631 632 633 634 635 636 637 638 639
With stride of 64 Thread 0: 0 64 128 192 256 320 384 448 512 576 Thread 1: 1 65 129 193 257 321 385 449 513 577 .. Thread 63: 63 127 191 255 319 383 447 511 575 639
Bad access pattern
Optimal access pattern
Time
![Page 50: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/50.jpg)
Launching Work in CUDA
! Kernel = function running on GPU ! Written in CUDA C
! A kernel is launched for a grid of blocks ! The blocks and the grid can be 1D, 2D or 3D ! The extra dimensions are really just syntactic sugar but
convenient if the data lives in a 2D or 3D domain
! Every thread gets to know its ! Thread location within the block (threadIdx) ! Block location within the grid (blockIdx) ! Block and grid dimensions (blockDim, gridDim)
![Page 51: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/51.jpg)
Example
! Each block has 8×8 threads ! So, 64 threads in a block = 2 warps
! We launch a grid of 10×5 blocks ! So, 50 blocks in total
threadIdx.x = 1 threadIdx.y = 1 blockIdx.x = 9 blockIdx.y = 0 blockDim.x = 8 blockDim.y = 8 gridDim.x = 10 gridDim.y = 5
![Page 52: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/52.jpg)
What’s with the Blocks?
! Why did we have blocks again, instead of just a flat gigantic grid of threads? ! Because a block can be guaranteed to be localized ! Launched at the same time, in the same SM
! Threads of a block have the same shared memory ! Load common data together and work on it
! Threads of a block can synchronize efficiently ! Synchronization points in code
! Individual blocks must be truly independent
![Page 53: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/53.jpg)
Writing CUDA Programs
![Page 54: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/54.jpg)
Two APIs
! CUDA can be used through two APIs
! Driver API ! Low-level API ! GPU code compiled separately into binaries ! CPU code manually loads GPU code and invokes it
! Runtime API ! User-friendly high-level API ! Language extensions for launching kernels ! Compiler automatically splits code into GPU and CPU
parts, compiles them separately, and links together
I will talk about this now
![Page 55: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/55.jpg)
Defining and Launching Kernels
// Kernel definition __global__ void VecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main() // N-length vector add { ... // Kernel invocation with N threads VecAdd<<<1, N>>>(A, B, C); }
Number of blocks
Threads per block In general, these are of type dim3
![Page 56: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/56.jpg)
Defining and Launching Kernels (2D)
// Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = threadIdx.x; int j = threadIdx.y; C[i][j] = A[i][j] + B[i][j]; } int main() // N*N matrix add { ... // Kernel invocation with one block of N * N * 1 threads int numBlocks = 1; dim3 threadsPerBlock(N, N); MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); }
![Page 57: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/57.jpg)
Extending for Multiple Blocks
// Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; } int main() // N*N matrix add { ... // Kernel invocation dim3 threadsPerBlock(16, 16); dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y); MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); }
![Page 58: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/58.jpg)
Function Type Qualifiers
__global__ ! A kernel function ! Executed on GPU ! Callable from CPU only (using <<< >>> launch)
__device__ ! GPU-local function ! Executed on GPU ! Callable from GPU only (using standard function call)
__host__ ! Default if nothing else specified ! CPU-only function ! Can be combined with __device__ to compile for both
![Page 59: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/59.jpg)
Variable Type Qualifiers
__device__ ! “Global” variable residing in GPU memory ! Accessible from all threads
__shared__ ! Resides in SM shared memory space ! Accessible from threads of the same block
#define BLOCK_SIZE 16 __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) { __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; ... }
![Page 60: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/60.jpg)
! Threads in a block may operate on largely same data ! Convolution-like operations, matrix multiply, …
! Load the data once into shared memory, then operate on it ! Share the loading between threads in the block
! Synchronization is important ! Call __syncthreads() after reading the data to ensure
that it is valid before starting any computation on it
Using Shared Memory
![Page 61: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/61.jpg)
! GPU memory needs to be allocated ! cudaMalloc() and cudaFree()
! Data transfers must be done manually ! cudaMemcpy()
Using Global Memory
![Page 62: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/62.jpg)
// Device code __global__ void VecAdd(float* A, float* B, float* C, int N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) C[i] = A[i] + B[i]; } // Host code int VecAddWrapper(float* h_A, float* h_B, float* h_C, int N) { size_t size = N * sizeof(float); // Allocate vectors in device memory float* d_A; float* d_B; float* d_C; cudaMalloc(&d_A, size); cudaMalloc(&d_B, size); cudaMalloc(&d_C, size); // Copy vectors from host memory to device memory cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); // Invoke kernel ... VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N); // Copy result from device memory to host memory cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); }
host (CPU) memory pointers
device (GPU) memory pointers
![Page 63: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/63.jpg)
! Avoid moving data around unnecessarily ! Keep intermediate buffers on GPU
! Only transfer what you need
! Don’t copy unchanged data again ! Don’t copy unnecessary data
! Concurrent data transfer possible on latest devices ! Needs host memory to be page-locked ! Needs kernel execution to be non-blocking ! Needs something useful to be done at the same time ! Non-trivial, so do only if you know you need it
Smart Use of Memory
![Page 64: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/64.jpg)
! All functions return an error code ! cudaSuccess if the call was successful
! Also possible to check for last error ! cudaGetLastError() and cudaPeekAtLastError()
! Error strings available through the API ! cudaGetErrorString()
! Checking errors of asynchronous operations is a little more complex, refer to manual
Error Checking
![Page 65: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/65.jpg)
! CUDA C Programming Guide ! Best starting point ! Describes CUDA programming model, language
extensions, builtin functions and types, etc.
! CUDA Toolkit Reference Manual ! Documentation of host-side CUDA functions
! CUDA C Best Practices Guide ! Information on improving performance and ensuring
compatibility
Manuals and Resources
![Page 66: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/66.jpg)
! NVCC compilation chain: CUDA C à PTX à SASS ! CPU code is compiled by MSVC
! PTX is device-independent intermediate assembly ! To export PTX: nvcc -ptx foo.cu ! PTX is not supposed to be optimized, so don’t expect it
! SASS is device-specific low-level assembly ! Compile: nvcc –cubin –arch=sm_<nn> foo.cu ! Dump: cuobjdump -sass foo.cubin ! SASS instruction sets in cuobjdump manual
Peeking Under the Hood
![Page 67: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/67.jpg)
Designing Parallel Algorithms
![Page 68: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/68.jpg)
! Designing parallel algorithms is not trivial ! Especially so for thousands of threads ! Cannot rely on fine-grained global synchronization
! Active area of research (partially thanks to GPUs) ! E.g., sorting performance going up all the time
! A highly parallel algorithm may need to do more work than a sequential one ! Almost always a higher number of primitive operations ! Or worse complexity, e.g., O(n log n) instead of O(n) ! The price we have to pay for better performance
A Full Can of Worms
![Page 69: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/69.jpg)
! A data-parallel program does some computation for a large number of elements ! All computations must be independent ! There must be enough input to utilize GPU properly
! Natural way to parallelize: One thread per output element ! Convolution, Mandelbrot, etc.: thread = pixel ! Ray tracing: thread = ray
! Boost performance by sharing data if possible
Data-Parallel Is Easy
![Page 70: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/70.jpg)
! Many interesting tasks are not data-parallel ! Sorting, compression, variable-length data, etc. ! Even simple stuff like finding the maximum element
! Hierarchical processing is often a good idea
! Split input into a number of chunks ! Need enough chunks to utilize the GPU
! Do something per chunk to reduce problem size ! Then process the remains on CPU or continue on GPU
Everything Else
![Page 71: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/71.jpg)
! Let’s say we have 100 million elements in an array
! Split into 100000 chunks with 1000 elements each ! To find best performance, experiment with the numbers
! Process one chunk of 1000 elements per thread ! Find and output the maximum
! Now we have 100000 elements, repeat ! Utilization bad from here forward, but the heavy part was
parallelized successfully ! Or: download the 100K elements to CPU, process there
Example 1: Find Maximum
![Page 72: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/72.jpg)
Example 2: Cumulative Sum
! Add to each element of an array the sum of its predecessors
! Can be done using n summations ! Looks like an inherently serial algorithm
1 4 0 3 4 Input
Output
Each summation requires that the previous one has completed Impossible to parallelize?!
Not at all 1 5 5 8 12
![Page 73: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/73.jpg)
Parallel Cumulative Sum
! Let’s suppose we have 2000000 elements and 1000 threads
input . . .
. . . in0 in1
slice input into 1000 segments, each 2000 elements long
in999
. . . out0 out1
calculate cumulative sum for each segment in parallel
out999
take the last element of each output segment sums 1000 elements
![Page 74: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/74.jpg)
Parallel Cumulative Sum
! Let’s suppose we have 2000000 elements and 1000 threads
. . . out0 out1 out999
take the last element of each output segment sums
compute cumulative sum over these
Σsums
. . . out0 out1 out999
add result to output segments
… and we’re done!
![Page 75: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/75.jpg)
Parallel Cumulative Sum
! 1st pass: process each input segment ! Perfectly parallelized, perfectly coherent memory access
! 2nd pass: cumulative sum over last elements ! Parallelizes badly, but very small amount of work
! 3rd pass: add bias to every output element ! Perfectly parallelized, perfectly coherent memory access
! Need two additions per element, but still O(n)
![Page 76: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/76.jpg)
Wrapping Up
![Page 77: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/77.jpg)
Takeaways
! It’s very easy to get started ! Just C++, some extra work needed for managing memory
! But there’s plenty of room for creativity when striving for performance ! Low-level optimizations, data sharing, algorithmic
improvements, concurrent processing, … ! Profiling tools available
! Scalable code will be fast on future hardware as well ! Basically just more blocks running concurrently
![Page 78: CUDA programming - Aalto University Wiki](https://reader031.vdocuments.site/reader031/viewer/2022032504/6234bd9b9d5bcb4690152934/html5/thumbnails/78.jpg)
Thank You