programming with cuda ws...

31
Programming with Programming with CUDA CUDA WS 08/09 WS 08/09 Lecture 2 Lecture 2 Tue, 28 Oct, 2008 Tue, 28 Oct, 2008

Upload: others

Post on 10-Jun-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

Programming with Programming with CUDACUDAWS 08/09WS 08/09

Lecture 2Lecture 2Tue, 28 Oct, 2008Tue, 28 Oct, 2008

Page 2: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

PreviouslyPreviously

OrganizationOrganization– Course structure, timings, locationsCourse structure, timings, locations– People, reference materialPeople, reference material

Brief intro to GPGPU, CUDABrief intro to GPGPU, CUDA Sign up for access to CUDA Sign up for access to CUDA

machinemachine– Check today during exerciseCheck today during exercise

Page 3: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

TodayToday

GradingGrading Course website unchangedCourse website unchanged

– http://theinf2.informatik.uni-jena.de/For+Students/CUDA.htmlhttp://theinf2.informatik.uni-jena.de/For+Students/CUDA.html

The CUDA programming modelThe CUDA programming model– exerciseexercise

Page 4: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

GradingGrading

Need 50% of marks from exercises Need 50% of marks from exercises to qualify for the final projectto qualify for the final project

Final grade will be determined by Final grade will be determined by an exam based on the projectan exam based on the project

Page 5: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

Recap ...Recap ...

GPGPUGPGPU– Graphical Processing UnitGraphical Processing Unit– Handles values of pixels displayed on Handles values of pixels displayed on

screenscreen Highly parallel computationHighly parallel computation

– Optimized for parallel computationsOptimized for parallel computations

Page 6: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

Recap ...Recap ...

GPGPUGPGPU– General Purpose computing on GPUGeneral Purpose computing on GPU– Many non-graphics applications can Many non-graphics applications can

be parallelizedbe parallelized Can then be ported to a GPU Can then be ported to a GPU

implementationimplementation

Page 7: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,
Page 8: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

Recap ...Recap ...

CUDA – Compute Unified Device CUDA – Compute Unified Device ArchitectureArchitecture– Software: minimal extension to C Software: minimal extension to C

programming languageprogramming language– Hardware: supports the softwareHardware: supports the software

Thus, CUDA enablesThus, CUDA enables– GPGPU for non-graphics peopleGPGPU for non-graphics people

Page 9: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

CUDACUDA

Page 10: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

The CUDA Programming ModelThe CUDA Programming Model

Page 11: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

CUDACUDA

Page 12: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

GPU as co-processorGPU as co-processor

The application runs on the CPU The application runs on the CPU ((hosthost))

Compute intensive parts are Compute intensive parts are delegated to the GPU (delegated to the GPU (devicedevice))

These parts are written as C These parts are written as C functions (functions (kernelskernels))

The kernel is executed on the The kernel is executed on the device simultaneously by N device simultaneously by N threadsthreads

Page 13: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,
Page 14: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

GPU as co-processorGPU as co-processor

Compute intensive tasks are Compute intensive tasks are defined as kernelsdefined as kernels

The host delegates kernels to the The host delegates kernels to the devicedevice

The device executes a kernel with The device executes a kernel with N parallel threadsN parallel threads

Each thread has a Each thread has a thread IDthread ID The thread ID is accessible in a The thread ID is accessible in a

kernel via the kernel via the threadIdxthreadIdx variable variable

Page 15: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

Example: Vector additionExample: Vector addition

CPU versionCPU version

Total time = N * time for 1 additionTotal time = N * time for 1 addition

Thread 1

Page 16: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

Example: Vector additionExample: Vector addition

GPU versionGPU version

Total time = time for 1 additionTotal time = time for 1 addition

Thread 1Thread 2Thread 3

Thread 4

Thread N

Page 17: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

CUDA kernelCUDA kernel

ExampleExample: definition: definition__global____global__ void vecAdd void vecAdd(float* A,float* B,float* C)(float* A,float* B,float* C){{

int i = int i = threadIdx.xthreadIdx.x;;C[i] = A[i] + B[i];C[i] = A[i] + B[i];

}}

Page 18: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

CUDA kernelCUDA kernel

ExampleExample: invocation: invocation

int main() {int main() {// init host vectors, size N: h_A, h_B, h_C // init host vectors, size N: h_A, h_B, h_C // init device// init device// copy to device: h_A->d_A, h_B->d_B, h_C->d_C// copy to device: h_A->d_A, h_B->d_B, h_C->d_CvecAdd<<<1, N>>>vecAdd<<<1, N>>>(d_A, d_B, d_C);(d_A, d_B, d_C);// copy to host, d_C->h_C// copy to host, d_C->h_C// do stuff// do stuff// free host variables// free host variables// free device variables// free device variables

}}

Page 19: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

Thread organizationThread organization

Thread are organized in Thread are organized in blocksblocks.. A block can be a 1D, 2D or 3D A block can be a 1D, 2D or 3D

array of threadsarray of threads– threadIdx is a 3-component vectorthreadIdx is a 3-component vector– Depends on how the kernel is calledDepends on how the kernel is called

Page 20: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,
Page 21: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

Thread organizationThread organization

Example of 1D blockExample of 1D blockInvoke (in main):Invoke (in main):

int N;int N;// assign some value to N// assign some value to NvecAdd<<<1, N>>>(d_A, d_B, d_C);vecAdd<<<1, N>>>(d_A, d_B, d_C);

Access (in kernel):Access (in kernel):int i = int i = threadIdx.xthreadIdx.x;;

Page 22: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

Thread organizationThread organization

Example of 2D blockExample of 2D blockInvoke (in main):Invoke (in main):

dim3 blockDimension (N,N); // N pre-assigneddim3 blockDimension (N,N); // N pre-assignedmatAdd<<<1, blockDimension>>>(d_A, d_B, d_C);matAdd<<<1, blockDimension>>>(d_A, d_B, d_C);

Access (in kernel):Access (in kernel):__global__ void matAdd__global__ void matAdd(float A[N][N], float B[N][N],float C[N][N])(float A[N][N], float B[N][N],float C[N][N]){{

int i = int i = threadIdx.xthreadIdx.x;;int j = int j = threadIdx.ythreadIdx.y;;C[i][j] = A[i][j] + B[i][j];C[i][j] = A[i][j] + B[i][j];

}}

Page 23: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

Thread organizationThread organization

Similarly, for a 3D blockSimilarly, for a 3D blockInvoke (in main):Invoke (in main):

dim3 blockDimension (N,N,3); // N pre-assigneddim3 blockDimension (N,N,3); // N pre-assignedmatAdd<<<1, blockDimension>>>(d_A, d_B, d_C);matAdd<<<1, blockDimension>>>(d_A, d_B, d_C);

Access (in kernel):Access (in kernel):int i = int i = threadIdx.xthreadIdx.x;;int j = int j = threadIdx.ythreadIdx.y;;int k = int k = threadIdx.zthreadIdx.z;;

Page 24: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

Thread organizationThread organization

Each thread in a block has a Each thread in a block has a unique thread ID.unique thread ID.

Thread ID is NOT the same as Thread ID is NOT the same as threadIdxthreadIdx– 1D block. dim Dx: thread index 1D block. dim Dx: thread index xx. thread ID = . thread ID = xx

– 2D block. dim (Dx,Dy): thread index 2D block. dim (Dx,Dy): thread index (x,y). (x,y). thread thread ID = ID = x + y.x + y.DxDx

– 3D block. dim (Dx,Dy,Dz): thread index 3D block. dim (Dx,Dy,Dz): thread index (x,y,z). (x,y,z). thread ID = thread ID = x + y.x + y.DxDx+ z.+ z.DxDx..DyDy

Page 25: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

Thread organizationThread organization

All threads in a block have a All threads in a block have a shared memoryshared memory– Very fast accessVery fast access

For efficient/safe cooperation For efficient/safe cooperation between threads, use between threads, use __syncthreads()__syncthreads()– All threads complete execution up to All threads complete execution up to

that point, and then resume togetherthat point, and then resume together

Page 26: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

Memory available to Memory available to threadsthreads Kernel definitionKernel definition__global__ void vecAdd__global__ void vecAdd(float* A,float* B,float* C)(float* A,float* B,float* C)// A,B,C reside on global memory// A,B,C reside on global memory

Global memory is slower than Global memory is slower than shared memoryshared memory

Page 27: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

Memory available to Memory available to threadsthreads Good idea:Good idea:

– Global -> shared on entryGlobal -> shared on entry– Shared -> global on exitShared -> global on exit__global__ void doStuff__global__ void doStuff(float* in,float* out) {(float* in,float* out) {

// init SData, shared memory// init SData, shared memory// copy in -> SData// copy in -> SData// do stuff with SData// do stuff with SData// copy Sdata -> out// copy Sdata -> out

}}

Page 28: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,
Page 29: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

Device Compute Device Compute CapabilityCapability The compute capability of a CUDA The compute capability of a CUDA

device is a number of the sort device is a number of the sort Major.MinorMajor.Minor– MajorMajor is the is the major revision numbermajor revision number

Fundamental change in card architectureFundamental change in card architecture

– MinorMinor is the is the minor revision numberminor revision number Incremental changes within the major Incremental changes within the major

revisionrevision

A device is CUDA-ready if its A device is CUDA-ready if its compute capability is >= 1.0compute capability is >= 1.0

Page 30: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

All for todayAll for today

Next timeNext time– Grids of thread blocksGrids of thread blocks

– Memory limitationsMemory limitations

– The hardware modelThe hardware model

Page 31: Programming with CUDA WS 08/09theinf2.informatik.uni-jena.de/theinf2_multimedia/Lectures_material/... · Previously Organization –Course structure, timings, locations –People,

On to exercises!On to exercises!