Download - Realtime Computer Graphics on GPUs
![Page 1: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/1.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
Realtime Computer Graphics on GPUsGPGPU II
Jan Kolomaznı́k
Department of Software and Computer Science EducationFaculty of Mathematics and Physics
Charles University in Prague
1 / 33
![Page 2: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/2.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
Parallel Algorithms
2 / 33
![Page 3: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/3.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
INTRODUCTION
I Algorithms for massively parallel architectures:I Often bottom up designI Shallow datastructuresI Memory access patterns considered first
I Try to make all operations local onlyI Problem reformulation:I Search for possible constrainsI Solve dual problemI Cellular automataI . . .
3 / 33
![Page 4: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/4.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
BASIC META-ALGORITHMS
I Map :I ForEach (inplace?)I Transform
I Spatial filters with limited support:I ConvolutionI Morphological operations
4 / 33
![Page 5: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/5.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
REDUCE (FOLD)I Associative binary operation to combine input elements into
single value:I SumI MultiplicationI Min/Max
I On GPU:1. Tree-based approach used within each thread block2. Global sync (multiple kernels)3. Reduce block results
I Optimizations:I Prevent thread idlingI Shared memory access patternsI Ref: Mark Harris – Optimizing Parallel Reduction in CUDA (30x
speedup) 5 / 33
![Page 6: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/6.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
SCAN (PREFIX-SUM)
I Associative binary operation to combine input elements infront of each of the processed elements
I Inclusive vs. exclusiveI Similar implementation and optimizations as reduction
6 / 33
![Page 7: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/7.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
EXAMPLE: SELECT ELEMENTS BY INDICATOR
I Task:I Output elements selected by predicate
I Naive approach:I Adding Elements to output array directlyI Bottleneck – size update
I Better solution:I Two level solutionI Store local selection into shared memoryI Update global output size once per block
I Advanced solution:I Run parallel prefix-sum on indicator set (0/1 for each element)I Computes indices for all selected elements and final sizeI Final step – store all selected elements on precomputed
positions
7 / 33
![Page 8: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/8.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
EXAMPLE: INTEGRAL IMAGES
I Apply prefix-sum to rows and columnsI Sum/average queries for rectangular regions with constant
complexityI Used in feature detectors:
I Haar featuresI Blur filter approximation
8 / 33
![Page 9: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/9.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
DATASTRUCTURES
I Basic categories:I Dense arraysII Sparse structures
I MatricesI Graphs
I Hash tablesI Different criteria:
I Read/writeI Space wasteI How many threads are writing
I Two level design:I Local datastructure living in shared memoryI Main datastructure living in global memoryI Write local instances at the end of computation
9 / 33
![Page 10: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/10.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
Deep Neural Networks
10 / 33
![Page 11: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/11.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
DEEP NEURAL NETWORKS
I Another neural networks renaissanceI Large neural networks with lots of layers
I Convolutional networksI Large numbers of identical neurons – highly parallel by natureI Backpropagation
I Millions of parametersI Large training set
I Training vs. inference
11 / 33
![Page 12: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/12.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
EXAMPLE
12 / 33
![Page 13: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/13.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
CUDNN
I GPU-accelerated library of primitives for deep neural networksI Used for speedup of DNN frameworks like:
I TensorFlowI CaffeI PytorchI KerasI MatlabI . . .
I Special implementations for selected common cases – highlyoptimized
13 / 33
![Page 14: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/14.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
TENSORRT
I Also by NvidiaI Inference optimization
I Significantly less computationally demanding than trainingI Deployed on embedded systems – memory constrains
I Change network topology without sacrificing inferenceprecision
I Use lower numerical precisionI Less space occupied by weightsI Usage of tensor cores
14 / 33
![Page 15: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/15.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
TENSOR CORES
I New in Volta architectureI Accelerate matrix problems of the form D = A ∗ B + C
I Single core works on 4x4 matricesI fp16 precision for multiplicationI fp16 or fp32 for accumulation
I Warp matrix functionsI Special calls for specific matrix sizes
I Significant speedup of NN training and inferenceI Denoising in raytracing APIs
15 / 33
![Page 16: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/16.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
OpenCL
16 / 33
![Page 17: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/17.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
OPENCL
I Alternative to C for CUDAI Basic idea from C for CUDA, 1:1 equivalence in some partsI Programming model for execution of massivily parallel tasks
on CPU, GPU, Cell, . . .I Language:
I Originally subset of C99I Subset of C++14 in OpenCL 2.1I Just-in-time compilation
17 / 33
![Page 18: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/18.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
HOST CODE
I Code structure similar to shader programmingI API:
I clBuildProgram()I clCreateCommandQueue()I clCreateBuffer()I clEnqueueWriteBuffer()I . . .
18 / 33
![Page 19: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/19.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
OPENCL CONCEPTS
OpenCL CUDA equivalentkernel kernelhost program host programNDRange (index space) gridwork item threadwork group block
19 / 33
![Page 20: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/20.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
OPENCL THREADS
OpenCL CUDA equivalentget_global_id(0) blockIdx.x · blockDim.x + threadIdx.xget_local_id(0) threadIdx.xget_global_size(0) gridDim.x · blockDim.xget_local_size(0) blockDim.x
20 / 33
![Page 21: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/21.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
OPENCL MEMORY
OpenCL CUDA equivalentglobal memory global memoryconstant memory constant memorylocal memory shared memoryprivate memory local memory
21 / 33
![Page 22: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/22.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
SAMPLE OPENCL KERNEL
// Kernel definition__kernel void VecAdd(
__global const float *A,__global const float *B,__global float *C)
{int id = get_global_id(0);C[id] = A[id] + B[id];
}
22 / 33
![Page 23: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/23.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
SPIR
I Standard Portable Intermediate RepresentationI Distribute device-specific pre-compiled binariesI SPIR-V incorporated in the core specification of:
I OpenCL 2.1I Vulkan APII OpenGL 4.6
23 / 33
![Page 24: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/24.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
Compute Shaders
24 / 33
![Page 25: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/25.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
MOTIVATION
I Why not OpenCL or CUDA?I One API for graphics and GP processingI Avoid interopI Avoid context switchesI You already know GLSL
I APIs:I Core since OpenGL 4.3I Part of OpenGL ES 3.1I Vulcan
25 / 33
![Page 26: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/26.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
WHERE IT BELONGS?
26 / 33
![Page 27: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/27.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
USAGE
I Write compute shader in GLSLI Define memory resourcesI Write main()function
I InitializationI Allocate GPU memory (buffers, textures)I Compile shader, link program
I Run itI Bind buffers, textures, images, uniformsI Call glDispatchCompute(...)
27 / 33
![Page 28: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/28.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
Other APIs
28 / 33
![Page 29: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/29.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
DIRECTCOMPUTE
I Part of Microsoft DirectX collection of APIsI Needs GPU support for DX10 or DX11
29 / 33
![Page 30: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/30.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
C++ AMP, OPENACC
I C++ Accelerated Massive Parallelism:I Open specification from MicrosoftI Builds on DX11I Language extensions, runtime libraryI Heterogenous computation
I OpenACC:I Similar to OpenMPI Compiler directives (pragmas) + runtime library
30 / 33
![Page 31: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/31.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
SYCL
I Standard by KhronosI Cross-platform abstraction layerI Builds on the underlying concepts, portability and efficiency of
OpenCLI Single-source style using completely standard C++
31 / 33
![Page 32: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/32.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
SAMPLE SYCL CODE
# inc lude<CL/ syc l . hpp># inc lude<iostream>
i n tmain ( ) {
using namespace cl : : sycl ;i n t data [ 1 0 2 4 ] ;/ / i n i t i a l i z e data to be worked on/ / By i n c l u d i n g a l l the SYCL work i n a {} block , we ensure/ / a l l SYCL tasks must complete before e x i t i n g the block{
queue myQueue ;bufferbuffer<i n t , 1> resultBuf (data , range<1>(1024) ) ;/ / c reate a command group to issue commands to the queuemyQueue .submit ( [ & ] ( handler& cgh ) {
/ / request access to the b u f f e rauto writeResult = resultBuf .get_access<access : : write>(cgh ) ;/ / enqueuea p a r a l l e l f o r taskcgh .parallel_for<c lass simple_test>(range<1>(1024) , [ = ] ( id<1> idx ) {
writeResult [idx ] = idx [ 0 ] ;} ) ;}) ;
} / / end of scope , so we wa i t f o r the queued work to completef o r ( i n t i = 0; i < 1024; i++)
std : : cout<< ” data [ ” << i << ” ] = ” << data [i ] << std : : endl ;r e t u r n 0 ;
}
32 / 33
![Page 33: Realtime Computer Graphics on GPUs](https://reader031.vdocuments.site/reader031/viewer/2022012916/61c67feac160ce4e0729aa64/html5/thumbnails/33.jpg)
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
Thank you for your attention!
33 / 33