programming with cuda ws...
TRANSCRIPT
Programming with Programming with CUDACUDAWS 08/09WS 08/09
Lecture 2Lecture 2Tue, 28 Oct, 2008Tue, 28 Oct, 2008
PreviouslyPreviously
OrganizationOrganization– Course structure, timings, locationsCourse structure, timings, locations– People, reference materialPeople, reference material
Brief intro to GPGPU, CUDABrief intro to GPGPU, CUDA Sign up for access to CUDA Sign up for access to CUDA
machinemachine– Check today during exerciseCheck today during exercise
TodayToday
GradingGrading Course website unchangedCourse website unchanged
– http://theinf2.informatik.uni-jena.de/For+Students/CUDA.htmlhttp://theinf2.informatik.uni-jena.de/For+Students/CUDA.html
The CUDA programming modelThe CUDA programming model– exerciseexercise
GradingGrading
Need 50% of marks from exercises Need 50% of marks from exercises to qualify for the final projectto qualify for the final project
Final grade will be determined by Final grade will be determined by an exam based on the projectan exam based on the project
Recap ...Recap ...
GPGPUGPGPU– Graphical Processing UnitGraphical Processing Unit– Handles values of pixels displayed on Handles values of pixels displayed on
screenscreen Highly parallel computationHighly parallel computation
– Optimized for parallel computationsOptimized for parallel computations
Recap ...Recap ...
GPGPUGPGPU– General Purpose computing on GPUGeneral Purpose computing on GPU– Many non-graphics applications can Many non-graphics applications can
be parallelizedbe parallelized Can then be ported to a GPU Can then be ported to a GPU
implementationimplementation
Recap ...Recap ...
CUDA – Compute Unified Device CUDA – Compute Unified Device ArchitectureArchitecture– Software: minimal extension to C Software: minimal extension to C
programming languageprogramming language– Hardware: supports the softwareHardware: supports the software
Thus, CUDA enablesThus, CUDA enables– GPGPU for non-graphics peopleGPGPU for non-graphics people
CUDACUDA
The CUDA Programming ModelThe CUDA Programming Model
CUDACUDA
GPU as co-processorGPU as co-processor
The application runs on the CPU The application runs on the CPU ((hosthost))
Compute intensive parts are Compute intensive parts are delegated to the GPU (delegated to the GPU (devicedevice))
These parts are written as C These parts are written as C functions (functions (kernelskernels))
The kernel is executed on the The kernel is executed on the device simultaneously by N device simultaneously by N threadsthreads
GPU as co-processorGPU as co-processor
Compute intensive tasks are Compute intensive tasks are defined as kernelsdefined as kernels
The host delegates kernels to the The host delegates kernels to the devicedevice
The device executes a kernel with The device executes a kernel with N parallel threadsN parallel threads
Each thread has a Each thread has a thread IDthread ID The thread ID is accessible in a The thread ID is accessible in a
kernel via the kernel via the threadIdxthreadIdx variable variable
Example: Vector additionExample: Vector addition
CPU versionCPU version
Total time = N * time for 1 additionTotal time = N * time for 1 addition
Thread 1
Example: Vector additionExample: Vector addition
GPU versionGPU version
Total time = time for 1 additionTotal time = time for 1 addition
Thread 1Thread 2Thread 3
Thread 4
Thread N
CUDA kernelCUDA kernel
ExampleExample: definition: definition__global____global__ void vecAdd void vecAdd(float* A,float* B,float* C)(float* A,float* B,float* C){{
int i = int i = threadIdx.xthreadIdx.x;;C[i] = A[i] + B[i];C[i] = A[i] + B[i];
}}
CUDA kernelCUDA kernel
ExampleExample: invocation: invocation
int main() {int main() {// init host vectors, size N: h_A, h_B, h_C // init host vectors, size N: h_A, h_B, h_C // init device// init device// copy to device: h_A->d_A, h_B->d_B, h_C->d_C// copy to device: h_A->d_A, h_B->d_B, h_C->d_CvecAdd<<<1, N>>>vecAdd<<<1, N>>>(d_A, d_B, d_C);(d_A, d_B, d_C);// copy to host, d_C->h_C// copy to host, d_C->h_C// do stuff// do stuff// free host variables// free host variables// free device variables// free device variables
}}
Thread organizationThread organization
Thread are organized in Thread are organized in blocksblocks.. A block can be a 1D, 2D or 3D A block can be a 1D, 2D or 3D
array of threadsarray of threads– threadIdx is a 3-component vectorthreadIdx is a 3-component vector– Depends on how the kernel is calledDepends on how the kernel is called
Thread organizationThread organization
Example of 1D blockExample of 1D blockInvoke (in main):Invoke (in main):
int N;int N;// assign some value to N// assign some value to NvecAdd<<<1, N>>>(d_A, d_B, d_C);vecAdd<<<1, N>>>(d_A, d_B, d_C);
Access (in kernel):Access (in kernel):int i = int i = threadIdx.xthreadIdx.x;;
Thread organizationThread organization
Example of 2D blockExample of 2D blockInvoke (in main):Invoke (in main):
dim3 blockDimension (N,N); // N pre-assigneddim3 blockDimension (N,N); // N pre-assignedmatAdd<<<1, blockDimension>>>(d_A, d_B, d_C);matAdd<<<1, blockDimension>>>(d_A, d_B, d_C);
Access (in kernel):Access (in kernel):__global__ void matAdd__global__ void matAdd(float A[N][N], float B[N][N],float C[N][N])(float A[N][N], float B[N][N],float C[N][N]){{
int i = int i = threadIdx.xthreadIdx.x;;int j = int j = threadIdx.ythreadIdx.y;;C[i][j] = A[i][j] + B[i][j];C[i][j] = A[i][j] + B[i][j];
}}
Thread organizationThread organization
Similarly, for a 3D blockSimilarly, for a 3D blockInvoke (in main):Invoke (in main):
dim3 blockDimension (N,N,3); // N pre-assigneddim3 blockDimension (N,N,3); // N pre-assignedmatAdd<<<1, blockDimension>>>(d_A, d_B, d_C);matAdd<<<1, blockDimension>>>(d_A, d_B, d_C);
Access (in kernel):Access (in kernel):int i = int i = threadIdx.xthreadIdx.x;;int j = int j = threadIdx.ythreadIdx.y;;int k = int k = threadIdx.zthreadIdx.z;;
Thread organizationThread organization
Each thread in a block has a Each thread in a block has a unique thread ID.unique thread ID.
Thread ID is NOT the same as Thread ID is NOT the same as threadIdxthreadIdx– 1D block. dim Dx: thread index 1D block. dim Dx: thread index xx. thread ID = . thread ID = xx
– 2D block. dim (Dx,Dy): thread index 2D block. dim (Dx,Dy): thread index (x,y). (x,y). thread thread ID = ID = x + y.x + y.DxDx
– 3D block. dim (Dx,Dy,Dz): thread index 3D block. dim (Dx,Dy,Dz): thread index (x,y,z). (x,y,z). thread ID = thread ID = x + y.x + y.DxDx+ z.+ z.DxDx..DyDy
Thread organizationThread organization
All threads in a block have a All threads in a block have a shared memoryshared memory– Very fast accessVery fast access
For efficient/safe cooperation For efficient/safe cooperation between threads, use between threads, use __syncthreads()__syncthreads()– All threads complete execution up to All threads complete execution up to
that point, and then resume togetherthat point, and then resume together
Memory available to Memory available to threadsthreads Kernel definitionKernel definition__global__ void vecAdd__global__ void vecAdd(float* A,float* B,float* C)(float* A,float* B,float* C)// A,B,C reside on global memory// A,B,C reside on global memory
Global memory is slower than Global memory is slower than shared memoryshared memory
Memory available to Memory available to threadsthreads Good idea:Good idea:
– Global -> shared on entryGlobal -> shared on entry– Shared -> global on exitShared -> global on exit__global__ void doStuff__global__ void doStuff(float* in,float* out) {(float* in,float* out) {
// init SData, shared memory// init SData, shared memory// copy in -> SData// copy in -> SData// do stuff with SData// do stuff with SData// copy Sdata -> out// copy Sdata -> out
}}
Device Compute Device Compute CapabilityCapability The compute capability of a CUDA The compute capability of a CUDA
device is a number of the sort device is a number of the sort Major.MinorMajor.Minor– MajorMajor is the is the major revision numbermajor revision number
Fundamental change in card architectureFundamental change in card architecture
– MinorMinor is the is the minor revision numberminor revision number Incremental changes within the major Incremental changes within the major
revisionrevision
A device is CUDA-ready if its A device is CUDA-ready if its compute capability is >= 1.0compute capability is >= 1.0
All for todayAll for today
Next timeNext time– Grids of thread blocksGrids of thread blocks
– Memory limitationsMemory limitations
– The hardware modelThe hardware model
On to exercises!On to exercises!