introduction to cuda - prace research infrastructure€¦ · cuda a parallel computing architecture...

Introduction to CUDA

GPU Performance History• GPUs are massively multithreaded many-core chips

• Hundreds of cores, thousands of concurrent threads• Huge economies of scale• Still on aggressive performance growth • High memory bandwidth

G80

GT200

CUDAA Parallel Computing Architecture for NVIDIA GPUs

Supports standard languages and APIs

• C• OpenCL• DX Compute• Fortran (PGI)

Supported on common operating systems:

• Windows• Mac OS X• Linux

DXCompute Fortran

NVIDIA supports any initiative that unleashes the massive power of the GPU

© NVIDIA Corporation 2008

C for CUDA

CUDA is industry-standard CWrite a program for one threadInstantiate it on many parallel threadsFamiliar programming model and language

CUDA is a scalable parallel programming modelProgram runs on any number of processors without recompiling


GPU Sizes Require CUDA Scalability

128 SP Cores32 SPCores

240 SP Cores


TeslaTM

High-Performance ComputingQuadro®

Design & CreationGeForce®

Entertainment

CUDA runs on NVIDIA GPUs…Over 100 Million CUDA GPUs Deployed


Pervasive CUDA Parallel ComputingCUDA brings data-parallel computing to the masses

Over 100 M CUDA-capable GPUs deployed since Nov 2006

Wide developer acceptanceDownload CUDA from www.nvidia.com/cudaOver 50K CUDA developer downloadsA GPU “developer kit” costs ~$100 for several hundreds GFLOPS

Data-parallel supercomputers are everywhere!CUDA makes this power readily accessibleEnables rapid innovations in data-parallel computing

Parallel computing rides the commodity technology wave


CUDA Zone: www.nvidia.com/cuda

Resources, examples, and pointers for CUDA developers


1.4 billion transistors

1 Teraflop of processing power

240 SP processing cores

30 DP processing cores with

IEEE-754 double precision

Introducing Tesla T10P Processor

…NVIDIA’s 2nd Generation CUDA Processor


CUDA Computing with Tesla T10240 SP processors at 1.44 GHz: 1 TFLOPS peak30 DP processors at 1.44 GHz: 86 GFLOPS peak

128 threads per processor: 30,720 threads total


Double Precision Floating Point NVIDIA GPU SSE2 Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

All 4 IEEE, round to nearest, zero, inf, -inf

All 4 IEEE, round to nearest, zero, inf, -inf

Round to zero/truncate only

Denormal handling Full speed Supported, costs 1000’s of cycles Flush to zero

NaN support Yes Yes No

Overflow and Infinity support Yes Yes No infinity, clamps to max norm

Flags No Yes Some

FMA Yes No Yes

Square root Software with low-latency FMA-based convergence Hardware Software only

Division Software with low-latency FMA-based convergence Hardware Software only

Reciprocal estimate accuracy 24 bit 12 bit 12 bit

Reciprocal sqrt estimate accuracy 23 bit 12 bit 12 bit

log2(x) and 2^x estimates accuracy 23 bit No No


Tesla C1060 Computing Processor

Processor 1x Tesla T10P

Core GHz 1.29 GHz

Form factorFull ATX:

4.736” (H) x 10.5” (L)Dual slot wide

On-board memory 4 GB

System I/O PCIe x16 gen2

Memory I/O 512-bit, 800MHz DDR102 GB/s peak bandwidth

Display outputs None

Typical power 160 W


Tesla S1070 1U SystemProcessors 4 x Tesla T10P

Core GHz 1.44 GHz

Form factor 1U for an EIA 19”4-post rack

Total 1U system memory 16 GB (4.0GB per GPU)

System I/O 2 PCIe x16

Memory I/O per processor

512-bit, 800MHz GDDR102 GB/s peak bandwidth

Display outputs None

Typical power 700 W

Chassis dimensions 1.73” H × 17.5” W × 28.5” D


Applications


4100

170

377423

740

0

100

200

300

400

500

600

700

800

CPU PS3 Radeon HD3870

Radeon HD4850

Tesla 8‐series Tesla 10‐Series

nano seconds simulation per QuickTime™ and a

decompressorare needed to see this picture.

F@H kernel based on GROMACS code

Folding@home Performance Comparison


Lattice Boltzmann

1000 iterations on a 256x128x128 domain

Cluster with 8 GPUs: 7.5 sec

Blue Gene/L 512 nodes: 21 sec

10000 iterations on irregular 1057x692x1446 domain with 4M fluid nodes

1 C870 760 s 53 MLUPS

2 C1060 159 s 252 MLUPS

8 C1060 42 s 955 MLUPS

Blood flow pattern in a human coronary artery, Bernaschi et al.


Desktop GPU Supercomputer Beats Cluster

FASTRA8 GPUs in a Desktop

CalcUA256 Nodes (512 cores)

http://fastra.ua.ac.be/en/index.html


Standard HPL code, with library that intercepts DGEMM and DTRSMcalls and executes them simultaneously on the GPUs and CPU cores. Library is implemented with CUBLAS

Cluster with 8 nodes:-Each node has 2 Intel Xeon E5462 ( 2.8Ghz), 16GB of memoryand 2 Tesla GPUs (1.44Ghz clock).

-The nodes are connected with SDR Infiniband.

CUDA accelerated Linpack

T/V N NB P Q Time Gflops---------------------------------------------------------------------------------------------------WR11R2L2 118144 960 4 4 874.26 1.258e+03----------------------------------------------------------------------------------------------------||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0031157 ...... PASSED


Pseudo-spectral simulation of 2D Isotropic turbulence

Accelerating MATLAB®

1024x1024 mesh, 400 RK4 steps, Windows XP, Core2 Duo 2.4Ghz vs GeForce 8800GTX

http://developer.nvidia.com/object/matlab_cuda.htmll

Use MEX files to call CUDA from MATLAB, 17x speed-up

http://developer.nvidia.com/object/matlab_cuda.html


146X 36X 19X 17X 100X

Astrophysics NAstrophysics N--body simulationbody simulation

Interactive Interactive visualization of visualization of

volumetric white volumetric white matter connectivitymatter connectivity

Ionic placement for Ionic placement for molecular molecular dynamics dynamics

simulation on GPUsimulation on GPU

TranscodingTranscoding HD HD video stream to video stream to

H.264H.264

Simulation in Simulation in MatlabMatlab using .using .mexmexfile CUDA functionfile CUDA function

149X 47X 20X 24X 30X

Financial Financial simulation of simulation of

LIBOR model with LIBOR model with swaptionsswaptions

GLAME@labGLAME@lab: An : An MM--script API for script API for linear Algebra linear Algebra

operations on GPUoperations on GPU

Ultrasound Ultrasound medical imaging medical imaging

for cancer for cancer diagnosticsdiagnostics

Highly optimized Highly optimized object oriented object oriented

molecular molecular dynamicsdynamics

CmatchCmatch exact exact string matching to string matching to

find similar find similar proteins and gene proteins and gene

sequencessequences

Applications in several fields


CUDA Basic


CUDAA Parallel Computing Architecture for NVIDIA GPUs

Supports standard languages and APIs

•C•OpenCL•Fortran (PGI)•DX Compute

Supported on common operating systems:

•Windows•Mac OS•Linux

DXCompute

© NVIDIA Corporation 2008 23

Arrays of Parallel Threads

A CUDA kernel is executed by an array of threadsAll threads run the same codeEach thread has an ID that it uses to compute memory addresses and make control decisions

0 1 2 3 4 5 6 7

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID


Example: Increment Array Elements

Increment N-element vector a by scalar b

Let’s assume N=16, blockDim=4 -> 4 blocks

blockIdx.x=0blockDim.x=4threadIdx.x=0,1,2,3idx=0,1,2,3




int idx = blockDim.x * blockId.x + threadIdx.x;will map from local index threadIdx to global index

NB: blockDim should be >= 32 in real code, this is just an example


Example: Increment Array Elements

CPU program CUDA program

void increment_cpu(float *a, float b, int N){

for (int idx = 0; idx<N; idx++) a[idx] = a[idx] + b;

}

void main(){

.....increment_cpu(a, b, N);

}

__global__ void increment_gpu(float *a, float b, int N){

int idx = blockIdx.x * blockDim.x + threadIdx.x;if (idx < N)

a[idx] = a[idx] + b;}

void main(){

…..dim3 dimBlock (blocksize);dim3 dimGrid( ceil( N / (float)blocksize) );increment_gpu<<<dimGrid, dimBlock>>>(ad,bd, N);

}


Outline of CUDA Basics

Basics Memory ManagementBasic Kernels and Execution on GPUCoordinating CPU and GPU ExecutionDevelopment Resources

See the Programming Guide for the full API


Basic Memory Management


Memory Spaces

CPU and GPU have separate memory spacesData is moved across PCIe busUse functions to allocate/set/copy memory on GPU

Very similar to corresponding C functions

Pointers are just addressesCan’t tell from the pointer value whether the address is on CPU or GPUMust exercise care when dereferencing:

Dereferencing CPU pointer on GPU will likely crashSame for vice versa


GPU Memory Allocation / Release

Host (CPU) manages device (GPU) memory:cudaMalloc (void ** pointer, size_t nbytes)cudaMemset (void * pointer, int value, size_t count)cudaFree (void* pointer)

int n = 1024;int nbytes = 1024*sizeof(int);int * d_a = 0;cudaMalloc( (void**)&d_a, nbytes );cudaMemset( d_a, 0, nbytes);cudaFree(d_a);


Data Copies

cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);

returns after the copy is completeblocks CPU thread until all bytes have been copieddoesn’t start copying until previous CUDA calls complete

enum cudaMemcpyKindcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDevice

Non-blocking memcopies are provided


Code Walkthrough 1

Allocate CPU memory for n integersAllocate GPU memory for n integersInitialize GPU memory to 0sCopy from GPU to CPUPrint the values


Code Walkthrough 1#include <stdio.h>

int main(){

int dimx = 16;int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers



int main(){



h_a = (int*)malloc(num_bytes);cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ){

printf("couldn't allocate memory\n");return 1;

}



int main(){




if( 0==h_a || 0==d_a ){


}

cudaMemset( d_a, 0, num_bytes );cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );



int main(){




if( 0==h_a || 0==d_a ){


}

cudaMemset( d_a, 0, num_bytes );cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int i=0; i<dimx; i++)printf("%d ", h_a[i] );

printf("\n");

free( h_a );cudaFree( d_a );

return 0;}


Basic Kernels and Execution on GPU


CUDA Programming Model

Parallel code (kernel) is launched and executed on a device by many threadsThreads are grouped into thread blocksParallel code is written for a thread

Each thread is free to execute a unique code pathBuilt-in thread and block ID variables


Thread Hierarchy

Threads launched for a parallel section are partitioned into thread blocks

Grid = all blocks for a given launchThread block is a group of threads that can:

Synchronize their executionCommunicate via shared memory


IDs and Dimensions

Threads:3D IDs, unique within a block

Blocks:2D IDs, unique within a grid

Dimensions set at launch timeCan be unique for each grid

Built-in variables:threadIdx, blockIdxblockDim, gridDim

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)


Code executed on GPU

C function with some restrictions:Can only access GPU memoryNo variable number of argumentsNo static variablesNo recursion

Must be declared with a qualifier:__global__ : launched by CPU,

cannot be called from GPU must return void__device__ : called from other GPU functions,

cannot be launched by the CPU__host__ : can be executed by CPU__host__ and __device__ qualifiers can be combined

sample use: overloading operators


Code Walkthrough 2

Build on Walkthrough 1Write a kernel to initialize integersCopy the result back to CPUPrint the values


__global__ void kernel( int *a ){

int idx = blockIdx.x*blockDim.x + threadIdx.x;a[idx] = 7;

}

Kernel Code (executed on GPU)


Launching kernels on GPU

Launch parameters:grid dimensions (up to 2D), dim3 typethread-block dimensions (up to 3D), dim3 typeshared memory: number of bytes per block

for extern smem variables declared without sizeOptional, 0 by default

stream IDOptional, 0 by default

dim3 grid(16, 16);dim3 block(16,16);kernel<<<grid, block, 0, 0>>>(...);kernel<<<32, 512>>>(...);


#include <stdio.h>



}

int main(){




if( 0==h_a || 0==d_a ){


}

cudaMemset( d_a, 0, num_bytes );

dim3 grid, block;block.x = 4;grid.x = dimx / block.x;

kernel<<<grid, block>>>( d_a );

cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int i=0; i<dimx; i++)printf("%d ", h_a[i] );

printf("\n");


return 0;}




}


int idx = blockIdx.x*blockDim.x + threadIdx.x;a[idx] = blockIdx.x;

}


int idx = blockIdx.x*blockDim.x + threadIdx.x;a[idx] = threadIdx.x;

}

Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

Kernel Variations and Output


Code Walkthrough 3

Build on Walkthruogh 2Write a kernel to increment n×m integersCopy the result back to CPUPrint the values


__global__ void kernel( int *a, int dimx, int dimy ){

int ix = blockIdx.x*blockDim.x + threadIdx.x;int iy = blockIdx.y*blockDim.y + threadIdx.y;int idx = iy*dimx + ix;

a[idx] = a[idx]+1;}

Kernel with 2D Indexing


int main(){

int dimx = 16;int dimy = 16;int num_bytes = dimx*dimy*sizeof(int);



if( 0==h_a || 0==d_a ){


}

cudaMemset( d_a, 0, num_bytes );

dim3 grid, block;block.x = 4;block.y = 4;grid.x = dimx / block.x;grid.y = dimy / block.y;

kernel<<<grid, block>>>( d_a, dimx, dimy );

cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int row=0; row<dimy; row++){

for(int col=0; col<dimx; col++)printf("%d ", h_a[row*dimx+col] );

printf("\n");}


return 0;}

__global__ void kernel( int *a, int dimx, int dimy ){

int ix = blockIdx.x*blockDim.x + threadIdx.x;int iy = blockIdx.y*blockDim.y + threadIdx.y;int idx = iy*dimx + ix;

a[idx] = a[idx]+1;}


Blocks must be independent

Any possible interleaving of blocks should be validpresumed to run to completion without pre-emptioncan run in any ordercan run concurrently OR sequentially

Blocks may coordinate but not synchronizeshared queue pointer: OKshared lock: BAD … can easily deadlock

Independence requirement gives scalability


Blocks must be independent

Thread blocks can run in any orderConcurrently or sequentiallyFacilitates scaling of the same code across many devices

Scalability


Coordinating CPU and GPU Execution


Synchronizing GPU and CPU

All kernel launches are asynchronouscontrol returns to CPU immediatelykernel starts executing once all previous CUDA calls have completed

Memcopies are synchronouscontrol returns to CPU once the copy is completecopy starts once all previous CUDA calls have completed

cudaThreadSynchronize()blocks until all previous CUDA calls complete

Asynchronous CUDA calls provide:non-blocking memcopiesability to overlap memcopies and kernel execution


CUDA Error Reporting to CPU

All CUDA calls return error code:except kernel launchescudaError_t type

cudaError_t cudaGetLastError(void)returns the code for the last error (“no error” has a code)

char* cudaGetErrorString(cudaError_t code)returns a null-terminated character string describing the error

printf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );


CUDA Event API

Events are inserted (recorded) into CUDA call streamsUsage scenarios:

measure elapsed time for CUDA calls (clock cycle precision)query the status of an asynchronous CUDA callblock CPU until CUDA calls prior to the event are completedasyncAPI sample in CUDA SDK

cudaEvent_t start, stop;cudaEventCreate(&start); cudaEventCreate(&stop);cudaEventRecord(start, 0);kernel<<<grid, block>>>(...);cudaEventRecord(stop, 0);cudaEventSynchronize(stop);float et;cudaEventElapsedTime(&et, start, stop);cudaEventDestroy(start); cudaEventDestroy(stop);


Device Management

CPU can query and select GPU devicescudaGetDeviceCount( int* count )cudaSetDevice( int device )cudaGetDevice( int *current_device )cudaGetDeviceProperties( cudaDeviceProp* prop,

int device )cudaChooseDevice( int *device, cudaDeviceProp* prop )

Multi-GPU setup:device 0 is used by defaultone CPU thread can control one GPU

multiple CPU threads can control the same GPU

– calls are serialized by the driver


Shared Memory


Shared Memory

On-chip memory2 orders of magnitude lower latency than global memoryOrder of magnitude higher bandwidth than gmem16KB per multiprocessor

NVIDIA GPUs contain up to 30 multiprocessors

Allocated per threadblockAccessible by any thread in the threadblock

Not accessible to other threadblocksSeveral uses:

Sharing data among threads in a threadblockUser-managed cache (reducing gmem accesses)


Using shared memory

Size known at compile time

__global__ void kernel(…){

…__shared__ float sData[256];…

}

int main(void){…kernel<<<nBlocks,blockSize>>>(…);…

}

Size known at kernel launch

__global__ void kernel(…){

…extern __shared__ float sData[];…

}

int main(void){

…smBytes = blockSize*sizeof(float);kernel<<<nBlocks, blockSize,

smBytes>>>(…);…

}


Example of Using Shared Memory

Applying a 1D stencil:1D dataFor each output element, sum all elements within a radius

For example, radius = 3Add 7 input elements

radius radius


Implementation with Shared Memory

1D threadblocks (partition the output)Each threadblock outputs BLOCK_DIMX elements

Read input from gmem to smemNeeds BLOCK_DIMX + 2*RADIUS input elements

ComputeWrite output to gmem

“halo” “halo”Input elements corresponding to output

as many as there are threads in a threadblock


Kernel code

__global__ void stencil( int *output, int *input, int dimx, int dimy ){

__shared__ int s_a[BLOCK_DIMX+2*RADIUS];

int global_ix = blockIdx.x*blockDim.x + threadIdx.x;int local_ix = threadIdx.x + RADIUS;

s_a[local_ix] = input[global_ix];

if ( threadIdx.x < RADIUS ){

s_a[local_ix – RADIUS] = input[global_ix – RADIUS];s_a[local_ix + BLOCK_DIMX + RADIUS] =

input[global_ix + BLOCK_DIMX + RADIUS];}__syncthreads();

int value = 0;for( offset = -RADIUS; offset<=RADIUS; offset++ )

value += s_a[ local_ix + offset ];

output[global_ix] = value;}


Thread Synchronization Function

void __syncthreads();Synchronizes all threads in a thread-block

Since threads are scheduled at run-timeOnce all threads have reached this point, execution resumes normallyUsed to avoid RAW / WAR / WAW hazards when accessing shared memory

Should be used in conditional code only if the conditional is uniform across the entire thread block


Memory Model Review

Local storageEach thread has own local storageMostly registers (managed by the compiler)Data lifetime = thread lifetime

Shared memoryEach thread block has own shared memory

Accessible only by threads within that block

Data lifetime = block lifetimeGlobal (device) memory

Accessible by all threads as well as host (CPU)Data lifetime = from allocation to deallocation


Memory Model Review

Thread

Per-threadLocal Storage

Block

Per-blockSharedMemory


Memory Model Review

Kernel 0

. . .Per-device

GlobalMemory

. . .

Kernel 1

SequentialKernels


Memory Model Review

Device 0memory

Device 1memory

Host memory cudaMemcpy()

introduction to cuda - prace research infrastructure€¦ · cuda a parallel computing architecture...

Documents