cuda c/c++ basics (cont.) -...
TRANSCRIPT
![Page 1: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/1.jpg)
CUDAC/C++BASICS(cont.)
NVIDIACorpora7on
© NVIDIA 2013
![Page 2: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/2.jpg)
COOPERATING THREADS
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
![Page 3: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/3.jpg)
1DStencil
• Considerapplyinga1Dstenciltoa1Darrayofelements– Eachoutputelementisthesumofinputelementswithinaradius
• Ifradiusis3,theneachoutputelementisthesumof7inputelements:
© NVIDIA 2013
radius radius
![Page 4: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/4.jpg)
Implemen7ngWithinaBlock
• Eachthreadprocessesoneoutputelement– blockDim.xelementsperblock
• Inputelementsarereadseveral7mes– Withradius3,eachinputelementisreadseven7mes
© NVIDIA 2013
![Page 5: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/5.jpg)
SharingDataBetweenThreads
• Terminology:withinablock,threadssharedataviasharedmemory
• Extremelyfaston-chipmemory,user-managed
• Declareusing__shared__,allocatedperblock
• Dataisnotvisibletothreadsinotherblocks
© NVIDIA 2013
![Page 6: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/6.jpg)
Implemen7ngWithSharedMemory
• Cachedatainsharedmemory– Read(blockDim.x+2*radius)inputelementsfromglobalmemorytosharedmemory
– ComputeblockDim.xoutputelements
– WriteblockDim.xoutputelementstoglobalmemory
– Eachblockneedsahaloofradiuselementsateachboundary
blockDim.x output elements (16 in the example)
halo on left halo on right
© NVIDIA 2013
![Page 7: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/7.jpg)
__global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + RADIUS;
// Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; }
© NVIDIA 2013
StencilKernel
![Page 8: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/8.jpg)
// Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset];
// Store the result out[gindex] = result; }
StencilKernel
© NVIDIA 2013
![Page 9: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/9.jpg)
DataRace!
© NVIDIA 2013
! Thestencilexamplewillnotwork…
! Supposethread15readsthehalobeforethread0hasfetchedit…
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex – RADIUS = in[gindex – RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
int result = 0;
result += temp[lindex + 1];
Store at temp[18]
Load from temp[19]
Skipped, threadIdx > RADIUS
![Page 10: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/10.jpg)
__syncthreads()
• void __syncthreads();
• Synchronizesallthreadswithinablock– UsedtopreventRAW/WAR/WAWhazards
• Allthreadsmustreachthebarrier– Incondi7onalcode,thecondi7onmustbeuniformacrosstheblock
© NVIDIA 2013
![Page 11: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/11.jpg)
StencilKernel__global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + radius;
// Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { temp[lindex – RADIUS] = in[gindex – RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; }
// Synchronize (ensure all the data is available) __syncthreads();
© NVIDIA 2013
![Page 12: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/12.jpg)
StencilKernel
// Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset];
// Store the result out[gindex] = result; }
© NVIDIA 2013
![Page 13: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/13.jpg)
Review(1of2)
• Launchingparallelthreads– LaunchNblockswithMthreadsperblockwith
kernel<<<N,M>>>(…); – UseblockIdx.xtoaccessblockindexwithingrid– UsethreadIdx.xtoaccessthreadindexwithinblock
• Allocateelementstothreads:
int index = threadIdx.x + blockIdx.x * blockDim.x;
© NVIDIA 2013
![Page 14: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/14.jpg)
Review(2of2)
• Use__shared__ todeclareavariable/arrayinsharedmemory– Dataissharedbetweenthreadsinablock– Notvisibletothreadsinotherblocks
• Use__syncthreads()asabarrier– Usetopreventdatahazards
© NVIDIA 2013
![Page 15: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/15.jpg)
MANAGING THE DEVICE
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
![Page 16: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/16.jpg)
Coordina7ngHost&Device
• Kernellaunchesareasynchronous– ControlreturnstotheCPUimmediately
• CPUneedstosynchronizebeforeconsumingtheresults
cudaMemcpy() BlockstheCPUun7lthecopyiscompleteCopybeginswhenallprecedingCUDAcallshavecompleted
cudaMemcpyAsync() Asynchronous,doesnotblocktheCPU
cudaDeviceSynchronize() BlockstheCPUun7lallprecedingCUDAcallshavecompleted
© NVIDIA 2013
![Page 17: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/17.jpg)
Repor7ngErrors
• AllCUDAAPIcallsreturnanerrorcode(cudaError_t)– ErrorintheAPIcallitself
OR– Errorinanearlierasynchronousopera7on(e.g.kernel)
• Gettheerrorcodeforthelasterror: cudaError_t cudaGetLastError(void)
• Getastringtodescribetheerror: char *cudaGetErrorString(cudaError_t)
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
© NVIDIA 2013
![Page 18: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/18.jpg)
DeviceManagement
• Applica7oncanqueryandselectGPUs cudaGetDeviceCount(int *count) cudaSetDevice(int device) cudaGetDevice(int *device) cudaGetDeviceProperties(cudaDeviceProp *prop, int device)
• Mul7plethreadscanshareadevice
• Asinglethreadcanmanagemul7pledevices cudaSetDevice(i)toselectcurrentdevice cudaMemcpy(…)forpeer-to-peercopies✝
✝ requires OS and device support
© NVIDIA 2013
![Page 19: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/19.jpg)
Introduc7ontoCUDAC/C++
• Whathavewelearned?– WriteandlaunchCUDAC/C++kernels
• __global__,blockIdx.x,threadIdx.x,<<<>>>
– ManageGPUmemory• cudaMalloc(),cudaMemcpy(),cudaFree()
– Managecommunica7onandsynchroniza7on• __shared__,__syncthreads()
• cudaMemcpy()vscudaMemcpyAsync(),cudaDeviceSynchronize()
© NVIDIA 2013
![Page 20: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/20.jpg)
ComputeCapability
• Thecomputecapabilityofadevicedescribesitsarchitecture,e.g.– Numberofregisters
– Sizesofmemories
– Features&capabili7es
• Thefollowingpresenta7onsconcentrateonFermidevices– ComputeCapability>=2.0
ComputeCapability
SelectedFeatures(seeCUDACProgrammingGuideforcompletelist)
Teslamodels
1.0 FundamentalCUDAsupport 870
1.3 Doubleprecision,improvedmemoryaccesses,atomics
10-series
2.0 Caches,fusedmul7ply-add,3Dgrids,surfaces,ECC,P2P,concurrentkernels/copies,func7onpointers,recursion
20-series
© NVIDIA 2013
![Page 21: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/21.jpg)
IDsandDimensions
– Akernelislaunchedasagridofblocksofthreads
• blockIdxandthreadIdxare3D
• Weshowedonlyonedimension(x)
• Built-invariables:– threadIdx – blockIdx – blockDim – gridDim
Device
Grid 1
Block (0,0,0)
Block (1,0,0)
Block (2,0,0)
Block (1,1,0)
Block (2,1,0)
Block (0,1,0)
Block (1,1,0) Thread
(0,0,0)
Thread
(1,0,0)
Thread
(2,0,0)
Thread
(3,0,0)
Thread
(4,0,0)
Thread
(0,1,0)
Thread
(1,1,0)
Thread
(2,1,0)
Thread
(3,1,0)
Thread
(4,1,0)
Thread
(0,2,0)
Thread
(1,2,0)
Thread
(2,2,0)
Thread
(3,2,0)
Thread
(4,2,0)
© NVIDIA 2013
![Page 22: CUDA C/C++ BASICS (cont.) - oxent2.ic.unicamp.broxent2.ic.unicamp.br/sites/oxent2.ic.unicamp.br/files/2.Introduction... · 1D Stencil • Consider applying a 1D stencil to a 1D array](https://reader033.vdocuments.site/reader033/viewer/2022050104/5f42cd1e158fde743236297a/html5/thumbnails/22.jpg)
Topicsweskipped
• Weskippedsomedetails,youcanlearnmore:– CUDAProgrammingGuide
– CUDAZone–tools,training,webinarsandmoredeveloper.nvidia.com/cuda
© NVIDIA 2013