programming with cuda ws...

Programming with Programming with CUDACUDAWS 08/09WS 08/09

Lecture 2Lecture 2Tue, 28 Oct, 2008Tue, 28 Oct, 2008

PreviouslyPreviously

OrganizationOrganization– Course structure, timings, locationsCourse structure, timings, locations– People, reference materialPeople, reference material

Brief intro to GPGPU, CUDABrief intro to GPGPU, CUDA Sign up for access to CUDA Sign up for access to CUDA

machinemachine– Check today during exerciseCheck today during exercise

TodayToday

GradingGrading Course website unchangedCourse website unchanged

– http://theinf2.informatik.uni-jena.de/For+Students/CUDA.htmlhttp://theinf2.informatik.uni-jena.de/For+Students/CUDA.html

The CUDA programming modelThe CUDA programming model– exerciseexercise

http://theinf2.informatik.uni-jena.de/For+Students/CUDA.html

GradingGrading

Need 50% of marks from exercises Need 50% of marks from exercises to qualify for the final projectto qualify for the final project

Final grade will be determined by Final grade will be determined by an exam based on the projectan exam based on the project

Recap ...Recap ...

GPGPUGPGPU– Graphical Processing UnitGraphical Processing Unit– Handles values of pixels displayed on Handles values of pixels displayed on

screenscreen Highly parallel computationHighly parallel computation

– Optimized for parallel computationsOptimized for parallel computations

Recap ...Recap ...

GPGPUGPGPU– General Purpose computing on GPUGeneral Purpose computing on GPU– Many non-graphics applications can Many non-graphics applications can

be parallelizedbe parallelized Can then be ported to a GPU Can then be ported to a GPU

implementationimplementation

Recap ...Recap ...

CUDA – Compute Unified Device CUDA – Compute Unified Device ArchitectureArchitecture– Software: minimal extension to C Software: minimal extension to C

programming languageprogramming language– Hardware: supports the softwareHardware: supports the software

Thus, CUDA enablesThus, CUDA enables– GPGPU for non-graphics peopleGPGPU for non-graphics people

CUDACUDA

The CUDA Programming ModelThe CUDA Programming Model

CUDACUDA

GPU as co-processorGPU as co-processor

The application runs on the CPU The application runs on the CPU ((hosthost))

Compute intensive parts are Compute intensive parts are delegated to the GPU (delegated to the GPU (devicedevice))

These parts are written as C These parts are written as C functions (functions (kernelskernels))

The kernel is executed on the The kernel is executed on the device simultaneously by N device simultaneously by N threadsthreads

GPU as co-processorGPU as co-processor

Compute intensive tasks are Compute intensive tasks are defined as kernelsdefined as kernels

The host delegates kernels to the The host delegates kernels to the devicedevice

The device executes a kernel with The device executes a kernel with N parallel threadsN parallel threads

Each thread has a Each thread has a thread IDthread ID The thread ID is accessible in a The thread ID is accessible in a

kernel via the kernel via the threadIdxthreadIdx variable variable

Example: Vector additionExample: Vector addition

CPU versionCPU version

Total time = N * time for 1 additionTotal time = N * time for 1 addition

Thread 1

Example: Vector additionExample: Vector addition

GPU versionGPU version

Total time = time for 1 additionTotal time = time for 1 addition

Thread 1Thread 2Thread 3

Thread 4

Thread N

CUDA kernelCUDA kernel

ExampleExample: definition: definition__global____global__ void vecAdd void vecAdd(float* A,float* B,float* C)(float* A,float* B,float* C){{

int i = int i = threadIdx.xthreadIdx.x;;C[i] = A[i] + B[i];C[i] = A[i] + B[i];

}}

CUDA kernelCUDA kernel

ExampleExample: invocation: invocation

int main() {int main() {// init host vectors, size N: h_A, h_B, h_C // init host vectors, size N: h_A, h_B, h_C // init device// init device// copy to device: h_A->d_A, h_B->d_B, h_C->d_C// copy to device: h_A->d_A, h_B->d_B, h_C->d_CvecAdd<<<1, N>>>vecAdd<<<1, N>>>(d_A, d_B, d_C);(d_A, d_B, d_C);// copy to host, d_C->h_C// copy to host, d_C->h_C// do stuff// do stuff// free host variables// free host variables// free device variables// free device variables

}}

Thread organizationThread organization

Thread are organized in Thread are organized in blocksblocks.. A block can be a 1D, 2D or 3D A block can be a 1D, 2D or 3D

array of threadsarray of threads– threadIdx is a 3-component vectorthreadIdx is a 3-component vector– Depends on how the kernel is calledDepends on how the kernel is called


Example of 1D blockExample of 1D blockInvoke (in main):Invoke (in main):

int N;int N;// assign some value to N// assign some value to NvecAdd<<<1, N>>>(d_A, d_B, d_C);vecAdd<<<1, N>>>(d_A, d_B, d_C);

Access (in kernel):Access (in kernel):int i = int i = threadIdx.xthreadIdx.x;;


Example of 2D blockExample of 2D blockInvoke (in main):Invoke (in main):

dim3 blockDimension (N,N); // N pre-assigneddim3 blockDimension (N,N); // N pre-assignedmatAdd<<<1, blockDimension>>>(d_A, d_B, d_C);matAdd<<<1, blockDimension>>>(d_A, d_B, d_C);

Access (in kernel):Access (in kernel):__global__ void matAdd__global__ void matAdd(float A[N][N], float B[N][N],float C[N][N])(float A[N][N], float B[N][N],float C[N][N]){{

int i = int i = threadIdx.xthreadIdx.x;;int j = int j = threadIdx.ythreadIdx.y;;C[i][j] = A[i][j] + B[i][j];C[i][j] = A[i][j] + B[i][j];

}}


Similarly, for a 3D blockSimilarly, for a 3D blockInvoke (in main):Invoke (in main):

dim3 blockDimension (N,N,3); // N pre-assigneddim3 blockDimension (N,N,3); // N pre-assignedmatAdd<<<1, blockDimension>>>(d_A, d_B, d_C);matAdd<<<1, blockDimension>>>(d_A, d_B, d_C);

Access (in kernel):Access (in kernel):int i = int i = threadIdx.xthreadIdx.x;;int j = int j = threadIdx.ythreadIdx.y;;int k = int k = threadIdx.zthreadIdx.z;;


Each thread in a block has a Each thread in a block has a unique thread ID.unique thread ID.

Thread ID is NOT the same as Thread ID is NOT the same as threadIdxthreadIdx– 1D block. dim Dx: thread index 1D block. dim Dx: thread index xx. thread ID = . thread ID = xx

– 2D block. dim (Dx,Dy): thread index 2D block. dim (Dx,Dy): thread index (x,y). (x,y). thread thread ID = ID = x + y.x + y.DxDx

– 3D block. dim (Dx,Dy,Dz): thread index 3D block. dim (Dx,Dy,Dz): thread index (x,y,z). (x,y,z). thread ID = thread ID = x + y.x + y.DxDx+ z.+ z.DxDx..DyDy


All threads in a block have a All threads in a block have a shared memoryshared memory– Very fast accessVery fast access

For efficient/safe cooperation For efficient/safe cooperation between threads, use between threads, use __syncthreads()__syncthreads()– All threads complete execution up to All threads complete execution up to

that point, and then resume togetherthat point, and then resume together

Memory available to Memory available to threadsthreads Kernel definitionKernel definition__global__ void vecAdd__global__ void vecAdd(float* A,float* B,float* C)(float* A,float* B,float* C)// A,B,C reside on global memory// A,B,C reside on global memory

Global memory is slower than Global memory is slower than shared memoryshared memory

Memory available to Memory available to threadsthreads Good idea:Good idea:

– Global -> shared on entryGlobal -> shared on entry– Shared -> global on exitShared -> global on exit__global__ void doStuff__global__ void doStuff(float* in,float* out) {(float* in,float* out) {

// init SData, shared memory// init SData, shared memory// copy in -> SData// copy in -> SData// do stuff with SData// do stuff with SData// copy Sdata -> out// copy Sdata -> out

}}

Device Compute Device Compute CapabilityCapability The compute capability of a CUDA The compute capability of a CUDA

device is a number of the sort device is a number of the sort Major.MinorMajor.Minor– MajorMajor is the is the major revision numbermajor revision number

Fundamental change in card architectureFundamental change in card architecture

– MinorMinor is the is the minor revision numberminor revision number Incremental changes within the major Incremental changes within the major

revisionrevision

A device is CUDA-ready if its A device is CUDA-ready if its compute capability is >= 1.0compute capability is >= 1.0

All for todayAll for today

Next timeNext time– Grids of thread blocksGrids of thread blocks

– Memory limitationsMemory limitations

– The hardware modelThe hardware model

On to exercises!On to exercises!

programming with cuda ws...

Documents