![Page 1: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/1.jpg)
CUDAC/C++BASICS
NVIDIACorpora1on
© NVIDIA 2013
![Page 2: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/2.jpg)
WhatisCUDA?
• CUDAArchitecture– ExposeGPUparallelismforgeneral-purposecompu1ng– Retainperformance
• CUDAC/C++– Basedonindustry-standardC/C++– Smallsetofextensionstoenableheterogeneousprogramming
– StraighIorwardAPIstomanagedevices,memoryetc.
• ThissessionintroducesCUDAC/C++
© NVIDIA 2013
![Page 3: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/3.jpg)
Introduc1ontoCUDAC/C++
• Whatwillyoulearninthissession?– Startfrom“HelloWorld!”
– WriteandlaunchCUDAC/C++kernels– ManageGPUmemory– Managecommunica1onandsynchroniza1on
© NVIDIA 2013
![Page 4: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/4.jpg)
Prerequisites
• You(probably)needexperiencewithCorC++
• Youdon’tneedGPUexperience
• Youdon’tneedparallelprogrammingexperience
• Youdon’tneedgraphicsexperience
© NVIDIA 2013
![Page 5: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/5.jpg)
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
![Page 6: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/6.jpg)
HELLO WORLD!
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
![Page 7: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/7.jpg)
HeterogeneousCompu1ng
! Terminology: ! Host The CPU and its memory (host memory) ! Device The GPU and its memory (device memory)
Host Device
© NVIDIA 2013
![Page 8: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/8.jpg)
HeterogeneousCompu1ng#include <iostream> #include <algorithm>
using namespace std;
#define N 1024 #define RADIUS 3 #define BLOCK_SIZE 16
__global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + RADIUS;
// Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; }
// Synchronize (ensure all the data is available) __syncthreads();
// Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset];
// Store the result out[gindex] = result; }
void fill_ints(int *x, int n) { fill_n(x, n, 1); }
int main(void) { int *in, *out; // host copies of a, b, c int *d_in, *d_out; // device copies of a, b, c int size = (N + 2*RADIUS) * sizeof(int);
// Alloc space for host copies and setup values in = (int *)malloc(size); fill_ints(in, N + 2*RADIUS); out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS);
// Alloc space for device copies cudaMalloc((void **)&d_in, size); cudaMalloc((void **)&d_out, size);
// Copy to device cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice); cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);
// Launch stencil_1d() kernel on GPU stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS, d_out + RADIUS);
// Copy result back to host cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);
// Cleanup free(in); free(out); cudaFree(d_in); cudaFree(d_out); return 0; }
serial code
parallel code
serial code
parallel fn
© NVIDIA 2013
![Page 9: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/9.jpg)
SimpleProcessingFlow
1. Copy input data from CPU memory to GPU memory
PCIBus
© NVIDIA 2013
![Page 10: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/10.jpg)
SimpleProcessingFlow
1. Copy input data from CPU memory to GPU memory
2. Load GPU program and execute, caching data on chip for performance
© NVIDIA 2013
PCIBus
![Page 11: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/11.jpg)
SimpleProcessingFlow
1. Copy input data from CPU memory to GPU memory
2. Load GPU program and execute, caching data on chip for performance
3. Copy results from GPU memory to CPU memory
© NVIDIA 2013
PCIBus
![Page 12: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/12.jpg)
HelloWorld!
int main(void) { printf("Hello World!\n"); return 0; }
" Standard C that runs on the host
" NVIDIA compiler (nvcc) can be used to compile programs with no device code
Output:
$ nvcc hello_world.cu $ a.out Hello World! $
© NVIDIA 2013
![Page 13: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/13.jpg)
HelloWorld!withDeviceCode
__global__ void mykernel(void) { }
int main(void) { mykernel<<<1,1>>>(); printf("Hello World!\n"); return 0; }
! Two new syntactic elements…
© NVIDIA 2013
![Page 14: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/14.jpg)
HelloWorld!withDeviceCode __global__ void mykernel(void) { }
• CUDAC/C++keyword__global__ indicatesafunc1onthat:– Runsonthedevice– Iscalledfromhostcode
• nvccseparatessourcecodeintohostanddevicecomponents– Devicefunc1ons(e.g.mykernel())processedbyNVIDIAcompiler– Hostfunc1ons(e.g.main())processedbystandardhostcompiler
• gcc,cl.exe
© NVIDIA 2013
![Page 15: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/15.jpg)
HelloWorld!withDeviceCode
mykernel<<<1,1>>>();
• Tripleanglebracketsmarkacallfromhostcodetodevicecode– Alsocalleda“kernellaunch”– We’llreturntotheparameters(1,1)inamoment
• That’sallthatisrequiredtoexecuteafunc1onontheGPU!
© NVIDIA 2013
![Page 16: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/16.jpg)
HelloWorld!withDeviceCode
__global__ void mykernel(void){ }
int main(void) { mykernel<<<1,1>>>(); printf("Hello World!\n"); return 0; }
• mykernel() does nothing, somewhat anticlimactic!
Output:
$ nvcc hello.cu $ a.out Hello World! $
© NVIDIA 2013
![Page 17: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/17.jpg)
ParallelProgramminginCUDAC/C++
• But wait… GPU computing is about massive parallelism!
• We need a more interesting example…
• We’ll start by adding two integers and build up to vector addition
a b c
© NVIDIA 2013
![Page 18: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/18.jpg)
Addi1onontheDevice
• Asimplekerneltoaddtwointegers
__global__ void add(int *a, int *b, int *c) { *c = *a + *b; }
• Asbefore__global__ isaCUDAC/C++keywordmeaning– add() willexecuteonthedevice– add() willbecalledfromthehost
© NVIDIA 2013
![Page 19: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/19.jpg)
Addi1onontheDevice
• Notethatweusepointersforthevariables
__global__ void add(int *a, int *b, int *c) { *c = *a + *b; }
• add() runsonthedevice,soa,bandcmustpointtodevicememory
• WeneedtoallocatememoryontheGPU
© NVIDIA 2013
![Page 20: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/20.jpg)
MemoryManagement
• Hostanddevicememoryareseparateen11es– DevicepointerspointtoGPUmemory
Maybepassedto/fromhostcodeMaynotbedereferencedinhostcode
– HostpointerspointtoCPUmemoryMaybepassedto/fromdevicecodeMaynotbedereferencedindevicecode
• SimpleCUDAAPIforhandlingdevicememory– cudaMalloc(),cudaFree(),cudaMemcpy()– SimilartotheCequivalentsmalloc(),free(),memcpy()
© NVIDIA 2013
![Page 21: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/21.jpg)
Addi1onontheDevice:add()
• Returningtoouradd()kernel
__global__ void add(int *a, int *b, int *c) { *c = *a + *b; }
• Let’stakealookatmain()…
© NVIDIA 2013
![Page 22: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/22.jpg)
Addi1onontheDevice:main() int main(void) { int a, b, c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = sizeof(int); // Allocate space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size);
// Setup input values a = 2; b = 7;
© NVIDIA 2013
![Page 23: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/23.jpg)
Addi1onontheDevice:main() // Copy inputs to device cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU add<<<1,1>>>(d_a, d_b, d_c);
// Copy result back to host cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; }
© NVIDIA 2013
![Page 24: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/24.jpg)
RUNNING IN PARALLEL
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
![Page 25: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/25.jpg)
MovingtoParallel
• GPUcompu1ngisaboutmassiveparallelism– Sohowdoweruncodeinparallelonthedevice?
add<<< 1, 1 >>>();
add<<< N, 1 >>>();
• Insteadofexecu1ngadd()once,executeN1mesinparallel
© NVIDIA 2013
![Page 26: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/26.jpg)
VectorAddi1onontheDevice
• Withadd()runninginparallelwecandovectoraddi1on
• Terminology:eachparallelinvoca1onofadd()isreferredtoasablock– Thesetofblocksisreferredtoasagrid– Eachinvoca1oncanrefertoitsblockindexusingblockIdx.x
__global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; }
• ByusingblockIdx.xtoindexintothearray,eachblockhandlesadifferentindex
© NVIDIA 2013
![Page 27: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/27.jpg)
VectorAddi1onontheDevice
__global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x]; }
• Onthedevice,eachblockcanexecuteinparallel:
c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3];
Block0 Block1 Block2 Block3
© NVIDIA 2013
![Page 28: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/28.jpg)
VectorAddi1onontheDevice:add()
• Returningtoourparallelizedadd()kernel __global__ void add(int *a, int *b, int *c) { c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
• Let’stakealookatmain()…
© NVIDIA 2013
![Page 29: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/29.jpg)
VectorAddi1onontheDevice:main() #define N 512 int main(void) {
int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
// Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size);
© NVIDIA 2013
![Page 30: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/30.jpg)
VectorAddi1onontheDevice:main() // Copy inputs to device cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with N blocks add<<<N,1>>>(d_a, d_b, d_c);
// Copy result back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; }
© NVIDIA 2013
![Page 31: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/31.jpg)
Review(1of2)
• Differencebetweenhostanddevice– Host CPU– Device GPU
• Using__global__ todeclareafunc1onasdevicecode– Executesonthedevice– Calledfromthehost
• Passingparametersfromhostcodetoadevicefunc1on
© NVIDIA 2013
![Page 32: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/32.jpg)
Review(2of2)
• Basicdevicememorymanagement– cudaMalloc() – cudaMemcpy()– cudaFree()
• Launchingparallelkernels– LaunchNcopiesofadd()withadd<<<N,1>>>(…); – UseblockIdx.xtoaccessblockindex
© NVIDIA 2013
![Page 33: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/33.jpg)
INTRODUCING THREADS
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
![Page 34: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/34.jpg)
CUDAThreads
• Terminology:ablockcanbesplitintoparallelthreads
• Let’schangeadd()touseparallelthreadsinsteadofparallelblocks
• WeusethreadIdx.xinsteadofblockIdx.x
• Needtomakeonechangeinmain()…
__global__ void add(int *a, int *b, int *c) { c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x]; }
© NVIDIA 2013
![Page 35: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/35.jpg)
VectorAddi1onUsingThreads:main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int);
// Alloc space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
// Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size);
© NVIDIA 2013
![Page 36: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/36.jpg)
VectorAddi1onUsingThreads:main() // Copy inputs to device cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with N threads add<<<1,N>>>(d_a, d_b, d_c);
// Copy result back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; }
© NVIDIA 2013
![Page 37: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/37.jpg)
COMBINING THREADS AND BLOCKS
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
![Page 38: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/38.jpg)
CombiningBlocksandThreads
• We’veseenparallelvectoraddi1onusing:– Manyblockswithonethreadeach– Oneblockwithmanythreads
• Let’sadaptvectoraddi1ontousebothblocksandthreads
• Why?We’llcometothat…
• Firstlet’sdiscussdataindexing…
© NVIDIA 2013
![Page 39: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/39.jpg)
0 1 72 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6
IndexingArrayswithBlocksandThreads
• WithMthreads/blockauniqueindexforeachthreadisgivenby:
int index = threadIdx.x + blockIdx.x * M;
• NolongerassimpleasusingblockIdx.xandthreadIdx.x– Considerindexinganarraywithoneelementperthread(8threads/block)
threadIdx.x threadIdx.x threadIdx.x threadIdx.x
blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3
© NVIDIA 2013
![Page 40: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/40.jpg)
IndexingArrays:Example
• Whichthreadwilloperateontheredelement?
int index = threadIdx.x + blockIdx.x * M; = 5 + 2 * 8; = 21;
0 1 7 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6
threadIdx.x = 5
blockIdx.x = 2
0 1 31 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
M = 8
© NVIDIA 2013
![Page 41: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/41.jpg)
VectorAddi1onwithBlocksandThreads
• Whatchangesneedtobemadeinmain()?
• Usethebuilt-invariableblockDim.x forthreadsperblock
int index = threadIdx.x + blockIdx.x * blockDim.x;
• Combinedversionofadd()touseparallelthreadsandparallelblocks
__global__ void add(int *a, int *b, int *c) { int index = threadIdx.x + blockIdx.x * blockDim.x; c[index] = a[index] + b[index]; }
© NVIDIA 2013
![Page 42: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/42.jpg)
Addi1onwithBlocksandThreads:main() #define N (2048*2048) #define THREADS_PER_BLOCK 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int);
// Alloc space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
// Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size);
© NVIDIA 2013
![Page 43: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/43.jpg)
Addi1onwithBlocksandThreads:main()
// Copy inputs to device cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU add<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c);
// Copy result back to host cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; }
© NVIDIA 2013
![Page 44: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/44.jpg)
HandlingArbitraryVectorSizes
• Updatethekernellaunch: add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N);
• Typicalproblemsarenotfriendlymul1plesofblockDim.x
• Avoidaccessingbeyondtheendofthearrays: __global__ void add(int *a, int *b, int *c, int n) { int index = threadIdx.x + blockIdx.x * blockDim.x; if (index < n) c[index] = a[index] + b[index]; }
© NVIDIA 2013
![Page 45: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/45.jpg)
WhyBotherwithThreads?
• Threadsseemunnecessary– Theyaddalevelofcomplexity– Whatdowegain?
• Unlikeparallelblocks,threadshavemechanismsto:– Communicate– Synchronize
• Tolookcloser,weneedanewexample…
© NVIDIA 2013
![Page 46: CUDA C/C++ BASICS - oxent2.ic.unicamp.br · What is CUDA? • CUDA Architecture – Expose GPU parallelism for general-purpose compu1ng – Retain performance • CUDA C/C++ – Based](https://reader030.vdocuments.site/reader030/viewer/2022040307/5ed37bdf847f87317f77bec6/html5/thumbnails/46.jpg)
Review
• Launchingparallelkernels– LaunchNcopiesofadd()withadd<<<N/M,M>>>(…); – UseblockIdx.xtoaccessblockindex– UsethreadIdx.xtoaccessthreadindexwithinblock
• Allocateelementstothreads:
int index = threadIdx.x + blockIdx.x * blockDim.x;
© NVIDIA 2013