realtime computer graphics on gpus

Post on 25-Dec-2021

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

Realtime Computer Graphics on GPUsGPGPU II

Jan Kolomaznı́k

Department of Software and Computer Science EducationFaculty of Mathematics and Physics

Charles University in Prague

1 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

Parallel Algorithms

2 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

INTRODUCTION

I Algorithms for massively parallel architectures:I Often bottom up designI Shallow datastructuresI Memory access patterns considered first

I Try to make all operations local onlyI Problem reformulation:I Search for possible constrainsI Solve dual problemI Cellular automataI . . .

3 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

BASIC META-ALGORITHMS

I Map :I ForEach (inplace?)I Transform

I Spatial filters with limited support:I ConvolutionI Morphological operations

4 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

REDUCE (FOLD)I Associative binary operation to combine input elements into

single value:I SumI MultiplicationI Min/Max

I On GPU:1. Tree-based approach used within each thread block2. Global sync (multiple kernels)3. Reduce block results

I Optimizations:I Prevent thread idlingI Shared memory access patternsI Ref: Mark Harris – Optimizing Parallel Reduction in CUDA (30x

speedup) 5 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

SCAN (PREFIX-SUM)

I Associative binary operation to combine input elements infront of each of the processed elements

I Inclusive vs. exclusiveI Similar implementation and optimizations as reduction

6 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

EXAMPLE: SELECT ELEMENTS BY INDICATOR

I Task:I Output elements selected by predicate

I Naive approach:I Adding Elements to output array directlyI Bottleneck – size update

I Better solution:I Two level solutionI Store local selection into shared memoryI Update global output size once per block

I Advanced solution:I Run parallel prefix-sum on indicator set (0/1 for each element)I Computes indices for all selected elements and final sizeI Final step – store all selected elements on precomputed

positions

7 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

EXAMPLE: INTEGRAL IMAGES

I Apply prefix-sum to rows and columnsI Sum/average queries for rectangular regions with constant

complexityI Used in feature detectors:

I Haar featuresI Blur filter approximation

8 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

DATASTRUCTURES

I Basic categories:I Dense arraysII Sparse structures

I MatricesI Graphs

I Hash tablesI Different criteria:

I Read/writeI Space wasteI How many threads are writing

I Two level design:I Local datastructure living in shared memoryI Main datastructure living in global memoryI Write local instances at the end of computation

9 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

Deep Neural Networks

10 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

DEEP NEURAL NETWORKS

I Another neural networks renaissanceI Large neural networks with lots of layers

I Convolutional networksI Large numbers of identical neurons – highly parallel by natureI Backpropagation

I Millions of parametersI Large training set

I Training vs. inference

11 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

EXAMPLE

12 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

CUDNN

I GPU-accelerated library of primitives for deep neural networksI Used for speedup of DNN frameworks like:

I TensorFlowI CaffeI PytorchI KerasI MatlabI . . .

I Special implementations for selected common cases – highlyoptimized

13 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

TENSORRT

I Also by NvidiaI Inference optimization

I Significantly less computationally demanding than trainingI Deployed on embedded systems – memory constrains

I Change network topology without sacrificing inferenceprecision

I Use lower numerical precisionI Less space occupied by weightsI Usage of tensor cores

14 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

TENSOR CORES

I New in Volta architectureI Accelerate matrix problems of the form D = A ∗ B + C

I Single core works on 4x4 matricesI fp16 precision for multiplicationI fp16 or fp32 for accumulation

I Warp matrix functionsI Special calls for specific matrix sizes

I Significant speedup of NN training and inferenceI Denoising in raytracing APIs

15 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

OpenCL

16 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

OPENCL

I Alternative to C for CUDAI Basic idea from C for CUDA, 1:1 equivalence in some partsI Programming model for execution of massivily parallel tasks

on CPU, GPU, Cell, . . .I Language:

I Originally subset of C99I Subset of C++14 in OpenCL 2.1I Just-in-time compilation

17 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

HOST CODE

I Code structure similar to shader programmingI API:

I clBuildProgram()I clCreateCommandQueue()I clCreateBuffer()I clEnqueueWriteBuffer()I . . .

18 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

OPENCL CONCEPTS

OpenCL CUDA equivalentkernel kernelhost program host programNDRange (index space) gridwork item threadwork group block

19 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

OPENCL THREADS

OpenCL CUDA equivalentget_global_id(0) blockIdx.x · blockDim.x + threadIdx.xget_local_id(0) threadIdx.xget_global_size(0) gridDim.x · blockDim.xget_local_size(0) blockDim.x

20 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

OPENCL MEMORY

OpenCL CUDA equivalentglobal memory global memoryconstant memory constant memorylocal memory shared memoryprivate memory local memory

21 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

SAMPLE OPENCL KERNEL

// Kernel definition__kernel void VecAdd(

__global const float *A,__global const float *B,__global float *C)

{int id = get_global_id(0);C[id] = A[id] + B[id];

}

22 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

SPIR

I Standard Portable Intermediate RepresentationI Distribute device-specific pre-compiled binariesI SPIR-V incorporated in the core specification of:

I OpenCL 2.1I Vulkan APII OpenGL 4.6

23 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

Compute Shaders

24 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

MOTIVATION

I Why not OpenCL or CUDA?I One API for graphics and GP processingI Avoid interopI Avoid context switchesI You already know GLSL

I APIs:I Core since OpenGL 4.3I Part of OpenGL ES 3.1I Vulcan

25 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

WHERE IT BELONGS?

26 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

USAGE

I Write compute shader in GLSLI Define memory resourcesI Write main()function

I InitializationI Allocate GPU memory (buffers, textures)I Compile shader, link program

I Run itI Bind buffers, textures, images, uniformsI Call glDispatchCompute(...)

27 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

Other APIs

28 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

DIRECTCOMPUTE

I Part of Microsoft DirectX collection of APIsI Needs GPU support for DX10 or DX11

29 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

C++ AMP, OPENACC

I C++ Accelerated Massive Parallelism:I Open specification from MicrosoftI Builds on DX11I Language extensions, runtime libraryI Heterogenous computation

I OpenACC:I Similar to OpenMPI Compiler directives (pragmas) + runtime library

30 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

SYCL

I Standard by KhronosI Cross-platform abstraction layerI Builds on the underlying concepts, portability and efficiency of

OpenCLI Single-source style using completely standard C++

31 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

SAMPLE SYCL CODE

# inc lude<CL/ syc l . hpp># inc lude<iostream>

i n tmain ( ) {

using namespace cl : : sycl ;i n t data [ 1 0 2 4 ] ;/ / i n i t i a l i z e data to be worked on/ / By i n c l u d i n g a l l the SYCL work i n a {} block , we ensure/ / a l l SYCL tasks must complete before e x i t i n g the block{

queue myQueue ;bufferbuffer<i n t , 1> resultBuf (data , range<1>(1024) ) ;/ / c reate a command group to issue commands to the queuemyQueue .submit ( [ & ] ( handler& cgh ) {

/ / request access to the b u f f e rauto writeResult = resultBuf .get_access<access : : write>(cgh ) ;/ / enqueuea p a r a l l e l f o r taskcgh .parallel_for<c lass simple_test>(range<1>(1024) , [ = ] ( id<1> idx ) {

writeResult [idx ] = idx [ 0 ] ;} ) ;}) ;

} / / end of scope , so we wa i t f o r the queued work to completef o r ( i n t i = 0; i < 1024; i++)

std : : cout<< ” data [ ” << i << ” ] = ” << data [i ] << std : : endl ;r e t u r n 0 ;

}

32 / 33

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

Thank you for your attention!

33 / 33

top related