cudahome.deib.polimi.it › ... › 2017 › doc › slides › 2.hsa_cuda_v1.pdf · 2017-12-12 ·...

19
Advanced Topics on Heterogeneous System Architectures Politecnico di Milano N.1.1 12 December, 2017 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano CUDA

Upload: others

Post on 24-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced

AdvancedTopicsonHeterogeneousSystemArchitectures

Politecnico di Milano!N.1.1!

12 December, 2017!

Antonio R. Miele!Marco D. Santambrogio!

Politecnico di Milano!

CUDA!

Page 2: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced

Introduction!•  CUDA (Compute Unified Device Architecture) has

been introduced in 2007 with NVIDIA Tesla architecture!

•  C-like language to write programs that are able to use GPU as a processing resource to accelerate functions!

•  CUDA’s abstraction level closely match the characteristics and organization of NVIDIA GPUs!

2

Page 3: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced

Structure of a program!•  A CUDA program is organized in!–  A serial C code executed on the CPU!–  A set of independent kernel functions parallelized

on the GPU!

3

Page 4: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced

Parallelization of a kernel!•  The set of input data is

decomposed into a stream of elements!

•  A single computational function (kernel) operates on each element!–  “thread” defined as

execution of kernel on one data element!

•  Multiple cores can process multiple threads in parallel!

•  Suitable for data-parallel problems!

4

Page 5: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced

CUDA hierarchy of threads!

5

•  A kernel executes in parallel across a set of parallel threads!–  Each thread executes an instance of the

kernel on a single data chuck!

•  Threads are grouped in thread blocks!

•  Thread blocks are grouped in a grid!

•  Blocks and grids are organized as an N-dimensional (up to 3) array!–  IDs are used to identify elements in the

array!

Page 6: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced

CUDA hierarchy of threads!•  CUDA hierarchy of threads

perfectly match with NVIDIA 2-level architecture!–  A GPU executes one or

more kernel grids!–  A streaming multiprocessor

(SM) executes concurrently one or more interleaved thread blocks!

–  CUDA cores in the SM execute threads !

–  The SM executes threads of the same block in groups of 32 threads called a warp!

6

Threads

Block

Block

Page 7: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced

CUDA device memory model!

7

•  There are mainly three distinct memories visible in the device-side code!

!

sequ

enAa

l

Page 8: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced

CUDA device memory model!•  Matching between the memory model and the NVIDIA

GPU architecture!

8

Page 9: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced

Execution flow!

1. Copy input data from CPU memory to GPU memory!

9

Page 10: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced

Execution flow!

2. Load GPU program and execute!

10

Page 11: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced

Execution flow!

3. Transfer results from GPU memory to CPU memory!

11

Page 12: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced

Example of CUDA kernel!•  Compute the sum of two vectors!•  C program:!

12

#define N 256 void vsum(int* a, int* b, int* c){ int i; for (i=0;i<N;i++){ c[i]=a[i]+b[i]; } } void main(){ int va[N], vb[N], vc[N]; ... vsum(va,vb,vc); ... }

Page 13: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced

Example of CUDA kernel!•  CUDA C program:!

13

#define N 256 __global__ void vsumKernel(int* a, int* b, int *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; }

void main(){ ... dim3 blocksPerGrid(N/256,1,1); //assuming 256 divides N exactly dim3 threadsPerBlock(256,1,1); vsumKernel<<<blocksPerGrid, threadsPerBlock>>>(va,vb,vc); ...

}

Page 14: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced

Example of CUDA kernel!•  CUDA C program:!

14

#define N 1024 __global__ void vsumKernel(int* a, int* b, int *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; }

void main(){ ... dim3 blocksPerGrid(N/256,1,1); //assuming 256 divides N exactly dim3 threadsPerBlock(256,1,1); vsumKernel<<<blocksPerGrid, threadsPerBlock>>>(va,vb,vc); ...

}

Getthex(y,z)valueoftheblockid

Getthelengthoftheblockonthex(y,z)direcAon

Getthexvalueofthethreadidonx(y,z)direcAon

Page 15: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced

Example of CUDA kernel!•  CUDA C program:!

15

#define N 1024 __global__ void vsumKernel(int* a, int* b, int *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; }

void main(){ ... dim3 blocksPerGrid(N/256,1,1); //assuming 256 divides N exactly dim3 threadsPerBlock(256,1,1); vsumKernel<<<blocksPerGrid, threadsPerBlock>>>(va,vb,vc); ...

}

Executethekernel

Numberofblocksinthegrid

Numberofthreadsineachblock

Page 16: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced

Data transfer!•  GPU memory is separate from CPU one!–  GPU memory has to be allocated!–  Data have to be transferred explicitly in GPU memory!

16

void main(){ int *va; int mya[N]; //to be populated ... cudaMalloc(&va, N*sizeof(int)); cudaMemcpy(va, mya, N*sizeof(int), cudaMemcpyHostToDevice); ... //execute kernel cudaMemcpy(mya, va, N*sizeof(int), cudaMemcpyDeviceToHost); cudaFree(va);

}

Page 17: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced

Host/device synchronization!•  Kernel calls are non-blocking. This means that the host

program continues immediately after it calls the kernel !–  Allows overlap of computation on CPU and GPU!

•  Use cudaThreadSynchronize() to wait for kernel to finish!

17

•  Standard cudaMemcpy calls are blocking!–  Non-blocking variants exist!

•  It is also possible to synchronize threads within the same block with the __syncthreads()call !

vsumKernel<<<blocksPerGrid, threadsPerBlock>>>(a, b, c); //do work on host (that doesn’t depend on c) cudaThreadSynchronise(); //wait for kernel to finish

Page 18: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced

Another example of CUDA kernel!•  Sum of matrices: 2D decomposition of the problem!

18

__global__ void matrixAdd(float a[N][N], float b[N][N], float c[N][N]) {

int j = blockIdx.x * blockDim.x + threadIdx.x;

int i = blockIdx.y * blockDim.y + threadIdx.y;

c[i][j] = a[i][j] + b[i][j];

}

int main()

{

dim3 blocksPerGrid(N/16,N/16,1); // (N/16)x(N/16) blocks/grid (2D)

dim3 threadsPerBlock(16,16,1); // 16x16=256 threads/block (2D)

matrixAdd<<<blocksPerGrid, threadsPerBlock>>>(a, b, c);

}

Page 19: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced

References!•  NVIDIA website:!–  http://www.nvidia.com/object/what-is-gpu-

computing.html !

19