introduction to cuda - prace research infrastructure€¦ · cuda a parallel computing architecture...

66
Introduction to CUDA

Upload: lamtram

Post on 09-May-2018

229 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

Introduction to CUDA

Page 2: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

GPU Performance History• GPUs are massively multithreaded many-core chips

• Hundreds of cores, thousands of concurrent threads• Huge economies of scale• Still on aggressive performance growth • High memory bandwidth

G80

GT200

Page 3: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

CUDAA Parallel Computing Architecture for NVIDIA GPUs

Supports standard languages and APIs

• C• OpenCL• DX Compute• Fortran (PGI)

Supported on common operating systems:

• Windows• Mac OS X• Linux

DXCompute Fortran

NVIDIA supports any initiative that unleashes the massive power of the GPU

Page 4: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

C for CUDA

CUDA is industry-standard CWrite a program for one threadInstantiate it on many parallel threadsFamiliar programming model and language

CUDA is a scalable parallel programming modelProgram runs on any number of processors without recompiling

Page 5: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

GPU Sizes Require CUDA Scalability

128 SP Cores32 SPCores

240 SP Cores

Page 6: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

TeslaTM

High-Performance ComputingQuadro®

Design & CreationGeForce®

Entertainment

CUDA runs on NVIDIA GPUs…Over 100 Million CUDA GPUs Deployed

Page 7: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Pervasive CUDA Parallel ComputingCUDA brings data-parallel computing to the masses

Over 100 M CUDA-capable GPUs deployed since Nov 2006

Wide developer acceptanceDownload CUDA from www.nvidia.com/cudaOver 50K CUDA developer downloadsA GPU “developer kit” costs ~$100 for several hundreds GFLOPS

Data-parallel supercomputers are everywhere!CUDA makes this power readily accessibleEnables rapid innovations in data-parallel computing

Parallel computing rides the commodity technology wave

Page 8: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

CUDA Zone: www.nvidia.com/cuda

Resources, examples, and pointers for CUDA developers

Page 9: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

1.4 billion transistors

1 Teraflop of processing power

240 SP processing cores

30 DP processing cores with

IEEE-754 double precision

Introducing Tesla T10P Processor

…NVIDIA’s 2nd Generation CUDA Processor

Page 10: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

CUDA Computing with Tesla T10240 SP processors at 1.44 GHz: 1 TFLOPS peak30 DP processors at 1.44 GHz: 86 GFLOPS peak

128 threads per processor: 30,720 threads total

Page 11: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Double Precision Floating Point NVIDIA GPU SSE2 Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

All 4 IEEE, round to nearest, zero, inf, -inf

All 4 IEEE, round to nearest, zero, inf, -inf

Round to zero/truncate only

Denormal handling Full speed Supported, costs 1000’s of cycles Flush to zero

NaN support Yes Yes No

Overflow and Infinity support Yes Yes No infinity, clamps to max norm

Flags No Yes Some

FMA Yes No Yes

Square root Software with low-latency FMA-based convergence Hardware Software only

Division Software with low-latency FMA-based convergence Hardware Software only

Reciprocal estimate accuracy 24 bit 12 bit 12 bit

Reciprocal sqrt estimate accuracy 23 bit 12 bit 12 bit

log2(x) and 2^x estimates accuracy 23 bit No No

Page 12: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Tesla C1060 Computing Processor

Processor 1x Tesla T10P

Core GHz 1.29 GHz

Form factorFull ATX:

4.736” (H) x 10.5” (L)Dual slot wide

On-board memory 4 GB

System I/O PCIe x16 gen2

Memory I/O 512-bit, 800MHz DDR102 GB/s peak bandwidth

Display outputs None

Typical power 160 W

Page 13: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Tesla S1070 1U SystemProcessors 4 x Tesla T10P

Core GHz 1.44 GHz

Form factor 1U for an EIA 19”4-post rack

Total 1U system memory 16 GB (4.0GB per GPU)

System I/O 2 PCIe x16

Memory I/O per processor

512-bit, 800MHz GDDR102 GB/s peak bandwidth

Display outputs None

Typical power 700 W

Chassis dimensions 1.73” H × 17.5” W × 28.5” D

Page 14: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Applications

Page 15: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

4100

170

377423

740

0

100

200

300

400

500

600

700

800

CPU PS3 Radeon HD3870

Radeon HD4850

Tesla 8‐series Tesla 10‐Series

nano seconds  simulation per  QuickTime™ and a

decompressorare needed to see this picture.

F@H kernel based on GROMACS code

Folding@home Performance Comparison

Page 16: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Lattice Boltzmann

1000 iterations on a 256x128x128 domain

Cluster with 8 GPUs: 7.5 sec

Blue Gene/L 512 nodes: 21 sec

10000 iterations on irregular 1057x692x1446 domain with 4M fluid nodes

1 C870 760 s 53 MLUPS

2 C1060 159 s 252 MLUPS

8 C1060 42 s 955 MLUPS

Blood flow pattern in a human coronary artery, Bernaschi et al.

Page 17: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Desktop GPU Supercomputer Beats Cluster

FASTRA8 GPUs in a Desktop

CalcUA256 Nodes (512 cores)

http://fastra.ua.ac.be/en/index.html

Page 18: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Standard HPL code, with library that intercepts DGEMM and DTRSMcalls and executes them simultaneously on the GPUs and CPU cores. Library is implemented with CUBLAS

Cluster with 8 nodes:-Each node has 2 Intel Xeon E5462 ( 2.8Ghz), 16GB of memoryand 2 Tesla GPUs (1.44Ghz clock).

-The nodes are connected with SDR Infiniband.

CUDA accelerated Linpack

T/V N NB P Q Time Gflops---------------------------------------------------------------------------------------------------WR11R2L2 118144 960 4 4 874.26 1.258e+03----------------------------------------------------------------------------------------------------||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0031157 ...... PASSED

Page 19: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Pseudo-spectral simulation of 2D Isotropic turbulence

Accelerating MATLAB®

1024x1024 mesh, 400 RK4 steps, Windows XP, Core2 Duo 2.4Ghz vs GeForce 8800GTX

http://developer.nvidia.com/object/matlab_cuda.htmll

Use MEX files to call CUDA from MATLAB, 17x speed-up

Page 20: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

146X 36X 19X 17X 100X

Astrophysics NAstrophysics N--body simulationbody simulation

Interactive Interactive visualization of visualization of

volumetric white volumetric white matter connectivitymatter connectivity

Ionic placement for Ionic placement for molecular molecular dynamics dynamics

simulation on GPUsimulation on GPU

TranscodingTranscoding HD HD video stream to video stream to

H.264H.264

Simulation in Simulation in MatlabMatlab using .using .mexmexfile CUDA functionfile CUDA function

149X 47X 20X 24X 30X

Financial Financial simulation of simulation of

LIBOR model with LIBOR model with swaptionsswaptions

GLAME@labGLAME@lab: An : An MM--script API for script API for linear Algebra linear Algebra

operations on GPUoperations on GPU

Ultrasound Ultrasound medical imaging medical imaging

for cancer for cancer diagnosticsdiagnostics

Highly optimized Highly optimized object oriented object oriented

molecular molecular dynamicsdynamics

CmatchCmatch exact exact string matching to string matching to

find similar find similar proteins and gene proteins and gene

sequencessequences

Applications in several fields

Page 21: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

CUDA Basic

Page 22: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

CUDAA Parallel Computing Architecture for NVIDIA GPUs

Supports standard languages and APIs

•C•OpenCL•Fortran (PGI)•DX Compute

Supported on common operating systems:

•Windows•Mac OS•Linux

DXCompute

Page 23: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008 23

Arrays of Parallel Threads

A CUDA kernel is executed by an array of threadsAll threads run the same codeEach thread has an ID that it uses to compute memory addresses and make control decisions

0 1 2 3 4 5 6 7

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID

Page 24: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008 24

Example: Increment Array Elements

Increment N-element vector a by scalar b

Let’s assume N=16, blockDim=4 -> 4 blocks

blockIdx.x=0blockDim.x=4threadIdx.x=0,1,2,3idx=0,1,2,3

blockIdx.x=1blockDim.x=4threadIdx.x=0,1,2,3idx=4,5,6,7

blockIdx.x=2blockDim.x=4threadIdx.x=0,1,2,3idx=8,9,10,11

blockIdx.x=3blockDim.x=4threadIdx.x=0,1,2,3idx=12,13,14,15

int idx = blockDim.x * blockId.x + threadIdx.x;will map from local index threadIdx to global index

NB: blockDim should be >= 32 in real code, this is just an example

Page 25: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008 25

Example: Increment Array Elements

CPU program CUDA program

void increment_cpu(float *a, float b, int N){

for (int idx = 0; idx<N; idx++) a[idx] = a[idx] + b;

}

void main(){

.....increment_cpu(a, b, N);

}

__global__ void increment_gpu(float *a, float b, int N){

int idx = blockIdx.x * blockDim.x + threadIdx.x;if (idx < N)

a[idx] = a[idx] + b;}

void main(){

…..dim3 dimBlock (blocksize);dim3 dimGrid( ceil( N / (float)blocksize) );increment_gpu<<<dimGrid, dimBlock>>>(ad,bd, N);

}

Page 26: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Outline of CUDA Basics

Basics Memory ManagementBasic Kernels and Execution on GPUCoordinating CPU and GPU ExecutionDevelopment Resources

See the Programming Guide for the full API

Page 27: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Basic Memory Management

Page 28: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Memory Spaces

CPU and GPU have separate memory spacesData is moved across PCIe busUse functions to allocate/set/copy memory on GPU

Very similar to corresponding C functions

Pointers are just addressesCan’t tell from the pointer value whether the address is on CPU or GPUMust exercise care when dereferencing:

Dereferencing CPU pointer on GPU will likely crashSame for vice versa

Page 29: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

GPU Memory Allocation / Release

Host (CPU) manages device (GPU) memory:cudaMalloc (void ** pointer, size_t nbytes)cudaMemset (void * pointer, int value, size_t count)cudaFree (void* pointer)

int n = 1024;int nbytes = 1024*sizeof(int);int * d_a = 0;cudaMalloc( (void**)&d_a, nbytes );cudaMemset( d_a, 0, nbytes);cudaFree(d_a);

Page 30: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Data Copies

cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);

returns after the copy is completeblocks CPU thread until all bytes have been copieddoesn’t start copying until previous CUDA calls complete

enum cudaMemcpyKindcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDevice

Non-blocking memcopies are provided

Page 31: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Code Walkthrough 1

Allocate CPU memory for n integersAllocate GPU memory for n integersInitialize GPU memory to 0sCopy from GPU to CPUPrint the values

Page 32: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Code Walkthrough 1#include <stdio.h>

int main(){

int dimx = 16;int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

Page 33: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Code Walkthrough 1#include <stdio.h>

int main(){

int dimx = 16;int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes);cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ){

printf("couldn't allocate memory\n");return 1;

}

Page 34: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Code Walkthrough 1#include <stdio.h>

int main(){

int dimx = 16;int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes);cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ){

printf("couldn't allocate memory\n");return 1;

}

cudaMemset( d_a, 0, num_bytes );cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

Page 35: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Code Walkthrough 1#include <stdio.h>

int main(){

int dimx = 16;int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes);cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ){

printf("couldn't allocate memory\n");return 1;

}

cudaMemset( d_a, 0, num_bytes );cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int i=0; i<dimx; i++)printf("%d ", h_a[i] );

printf("\n");

free( h_a );cudaFree( d_a );

return 0;}

Page 36: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Basic Kernels and Execution on GPU

Page 37: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

CUDA Programming Model

Parallel code (kernel) is launched and executed on a device by many threadsThreads are grouped into thread blocksParallel code is written for a thread

Each thread is free to execute a unique code pathBuilt-in thread and block ID variables

Page 38: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Thread Hierarchy

Threads launched for a parallel section are partitioned into thread blocks

Grid = all blocks for a given launchThread block is a group of threads that can:

Synchronize their executionCommunicate via shared memory

Page 39: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

IDs and Dimensions

Threads:3D IDs, unique within a block

Blocks:2D IDs, unique within a grid

Dimensions set at launch timeCan be unique for each grid

Built-in variables:threadIdx, blockIdxblockDim, gridDim

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Page 40: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Code executed on GPU

C function with some restrictions:Can only access GPU memoryNo variable number of argumentsNo static variablesNo recursion

Must be declared with a qualifier:__global__ : launched by CPU,

cannot be called from GPU must return void__device__ : called from other GPU functions,

cannot be launched by the CPU__host__ : can be executed by CPU__host__ and __device__ qualifiers can be combined

sample use: overloading operators

Page 41: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Code Walkthrough 2

Build on Walkthrough 1Write a kernel to initialize integersCopy the result back to CPUPrint the values

Page 42: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

__global__ void kernel( int *a ){

int idx = blockIdx.x*blockDim.x + threadIdx.x;a[idx] = 7;

}

Kernel Code (executed on GPU)

Page 43: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Launching kernels on GPU

Launch parameters:grid dimensions (up to 2D), dim3 typethread-block dimensions (up to 3D), dim3 typeshared memory: number of bytes per block

for extern smem variables declared without sizeOptional, 0 by default

stream IDOptional, 0 by default

dim3 grid(16, 16);dim3 block(16,16);kernel<<<grid, block, 0, 0>>>(...);kernel<<<32, 512>>>(...);

Page 44: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

#include <stdio.h>

__global__ void kernel( int *a ){

int idx = blockIdx.x*blockDim.x + threadIdx.x;a[idx] = 7;

}

int main(){

int dimx = 16;int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes);cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ){

printf("couldn't allocate memory\n");return 1;

}

cudaMemset( d_a, 0, num_bytes );

dim3 grid, block;block.x = 4;grid.x = dimx / block.x;

kernel<<<grid, block>>>( d_a );

cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int i=0; i<dimx; i++)printf("%d ", h_a[i] );

printf("\n");

free( h_a );cudaFree( d_a );

return 0;}

Page 45: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

__global__ void kernel( int *a ){

int idx = blockIdx.x*blockDim.x + threadIdx.x;a[idx] = 7;

}

__global__ void kernel( int *a ){

int idx = blockIdx.x*blockDim.x + threadIdx.x;a[idx] = blockIdx.x;

}

__global__ void kernel( int *a ){

int idx = blockIdx.x*blockDim.x + threadIdx.x;a[idx] = threadIdx.x;

}

Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

Kernel Variations and Output

Page 46: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Code Walkthrough 3

Build on Walkthruogh 2Write a kernel to increment n×m integersCopy the result back to CPUPrint the values

Page 47: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

__global__ void kernel( int *a, int dimx, int dimy ){

int ix = blockIdx.x*blockDim.x + threadIdx.x;int iy = blockIdx.y*blockDim.y + threadIdx.y;int idx = iy*dimx + ix;

a[idx] = a[idx]+1;}

Kernel with 2D Indexing

Page 48: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

int main(){

int dimx = 16;int dimy = 16;int num_bytes = dimx*dimy*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes);cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ){

printf("couldn't allocate memory\n");return 1;

}

cudaMemset( d_a, 0, num_bytes );

dim3 grid, block;block.x = 4;block.y = 4;grid.x = dimx / block.x;grid.y = dimy / block.y;

kernel<<<grid, block>>>( d_a, dimx, dimy );

cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int row=0; row<dimy; row++){

for(int col=0; col<dimx; col++)printf("%d ", h_a[row*dimx+col] );

printf("\n");}

free( h_a );cudaFree( d_a );

return 0;}

__global__ void kernel( int *a, int dimx, int dimy ){

int ix = blockIdx.x*blockDim.x + threadIdx.x;int iy = blockIdx.y*blockDim.y + threadIdx.y;int idx = iy*dimx + ix;

a[idx] = a[idx]+1;}

Page 49: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Blocks must be independent

Any possible interleaving of blocks should be validpresumed to run to completion without pre-emptioncan run in any ordercan run concurrently OR sequentially

Blocks may coordinate but not synchronizeshared queue pointer: OKshared lock: BAD … can easily deadlock

Independence requirement gives scalability

Page 50: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Blocks must be independent

Thread blocks can run in any orderConcurrently or sequentiallyFacilitates scaling of the same code across many devices

Scalability

Page 51: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Coordinating CPU and GPU Execution

Page 52: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Synchronizing GPU and CPU

All kernel launches are asynchronouscontrol returns to CPU immediatelykernel starts executing once all previous CUDA calls have completed

Memcopies are synchronouscontrol returns to CPU once the copy is completecopy starts once all previous CUDA calls have completed

cudaThreadSynchronize()blocks until all previous CUDA calls complete

Asynchronous CUDA calls provide:non-blocking memcopiesability to overlap memcopies and kernel execution

Page 53: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

CUDA Error Reporting to CPU

All CUDA calls return error code:except kernel launchescudaError_t type

cudaError_t cudaGetLastError(void)returns the code for the last error (“no error” has a code)

char* cudaGetErrorString(cudaError_t code)returns a null-terminated character string describing the error

printf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );

Page 54: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

CUDA Event API

Events are inserted (recorded) into CUDA call streamsUsage scenarios:

measure elapsed time for CUDA calls (clock cycle precision)query the status of an asynchronous CUDA callblock CPU until CUDA calls prior to the event are completedasyncAPI sample in CUDA SDK

cudaEvent_t start, stop;cudaEventCreate(&start); cudaEventCreate(&stop);cudaEventRecord(start, 0);kernel<<<grid, block>>>(...);cudaEventRecord(stop, 0);cudaEventSynchronize(stop);float et;cudaEventElapsedTime(&et, start, stop);cudaEventDestroy(start); cudaEventDestroy(stop);

Page 55: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Device Management

CPU can query and select GPU devicescudaGetDeviceCount( int* count )cudaSetDevice( int device )cudaGetDevice( int *current_device )cudaGetDeviceProperties( cudaDeviceProp* prop,

int device )cudaChooseDevice( int *device, cudaDeviceProp* prop )

Multi-GPU setup:device 0 is used by defaultone CPU thread can control one GPU

multiple CPU threads can control the same GPU

– calls are serialized by the driver

Page 56: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Shared Memory

Page 57: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Shared Memory

On-chip memory2 orders of magnitude lower latency than global memoryOrder of magnitude higher bandwidth than gmem16KB per multiprocessor

NVIDIA GPUs contain up to 30 multiprocessors

Allocated per threadblockAccessible by any thread in the threadblock

Not accessible to other threadblocksSeveral uses:

Sharing data among threads in a threadblockUser-managed cache (reducing gmem accesses)

Page 58: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008 58

Using shared memory

Size known at compile time

__global__ void kernel(…){

…__shared__ float sData[256];…

}

int main(void){…kernel<<<nBlocks,blockSize>>>(…);…

}

Size known at kernel launch

__global__ void kernel(…){

…extern __shared__ float sData[];…

}

int main(void){

…smBytes = blockSize*sizeof(float);kernel<<<nBlocks, blockSize,

smBytes>>>(…);…

}

Page 59: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Example of Using Shared Memory

Applying a 1D stencil:1D dataFor each output element, sum all elements within a radius

For example, radius = 3Add 7 input elements

radius radius

Page 60: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Implementation with Shared Memory

1D threadblocks (partition the output)Each threadblock outputs BLOCK_DIMX elements

Read input from gmem to smemNeeds BLOCK_DIMX + 2*RADIUS input elements

ComputeWrite output to gmem

“halo” “halo”Input elements corresponding to output

as many as there are threads in a threadblock

Page 61: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Kernel code

__global__ void stencil( int *output, int *input, int dimx, int dimy ){

__shared__ int s_a[BLOCK_DIMX+2*RADIUS];

int global_ix = blockIdx.x*blockDim.x + threadIdx.x;int local_ix = threadIdx.x + RADIUS;

s_a[local_ix] = input[global_ix];

if ( threadIdx.x < RADIUS ){

s_a[local_ix – RADIUS] = input[global_ix – RADIUS];s_a[local_ix + BLOCK_DIMX + RADIUS] =

input[global_ix + BLOCK_DIMX + RADIUS];}__syncthreads();

int value = 0;for( offset = -RADIUS; offset<=RADIUS; offset++ )

value += s_a[ local_ix + offset ];

output[global_ix] = value;}

Page 62: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Thread Synchronization Function

void __syncthreads();Synchronizes all threads in a thread-block

Since threads are scheduled at run-timeOnce all threads have reached this point, execution resumes normallyUsed to avoid RAW / WAR / WAW hazards when accessing shared memory

Should be used in conditional code only if the conditional is uniform across the entire thread block

Page 63: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Memory Model Review

Local storageEach thread has own local storageMostly registers (managed by the compiler)Data lifetime = thread lifetime

Shared memoryEach thread block has own shared memory

Accessible only by threads within that block

Data lifetime = block lifetimeGlobal (device) memory

Accessible by all threads as well as host (CPU)Data lifetime = from allocation to deallocation

Page 64: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Memory Model Review

Thread

Per-threadLocal Storage

Block

Per-blockSharedMemory

Page 65: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Memory Model Review

Kernel 0

. . .Per-device

GlobalMemory

. . .

Kernel 1

SequentialKernels

Page 66: Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture for NVIDIA GPUs Supports standard languages and APIs •C • OpenCL • DX Compute

© NVIDIA Corporation 2008

Memory Model Review

Device 0memory

Device 1memory

Host memory cudaMemcpy()