training program on gpu programming with cuda

Training Program onGPU Programming

with CUDA

31st July, 7th Aug, 14th Aug 2011CUDA Teaching Center @ UoM

Training Program on GPU Programming with CUDA

Sanath JayasenaCUDA Teaching Center @ UoM

Day 1, Session 2

CUDA Programming ModelCUDA Threads

Outline for Day 1 Session 2

CUDA Programming Model, CUDA Threads• Data Parallelism• CUDA Program Structure• Memory Model & Data Transfer (Brief)• Kernel Functions & Threading

(Discussion with Example: Matrix Multiplication)

July-Aug 2011 3CUDA Training Program

Data Parallelism

• Data Parallelism– A problem/program property– Many arithmetic operations can be safely

performed on the data structures simultaneously– Example: matrix multiplication (next slide)

• CUDA devices can exploit data parallelism to accelerate execution of applications

July-Aug 2011 CUDA Training Program 4

Example: Matrix Multiplication


M P

N

width width

width

width

P = M · N

•Each element in P is computed as dot product between a row of M and a column of N

•All elements in P can be computed independently and simultaneously

CUDA Program Structure

• A CUDA program consists of one or more phases executed on either the host (CPU) or a device (GPU), supplied as a single source code

• Little or no data parallelism host code– ANSI C, compiled with standard compiler

• Significant data parallelism device code– Extended ANSI C to specify kernels, data structs

• NVIDIA C Complier separates the two and …


Execution of a CUDA Program


Execution of a CUDA Program

• Execution starts with host (CPU)• When a kernel is invoked, execution moves to

the device (GPU)– A large number of threads generated– Grid : collection of all threads generated by kernel– (Previous slide shows two grids of threads)

• Once all threads in a grid complete execution, the grid terminates and execution continues on the host


Example: Matrix Multiplicationint main (void)

{

1. // Allocate and initialize matrices M, N, P

// I/O to read the input matrices M and N

….

2. // M * N on the device

MatrixMulOnDevice (M, N, P, width);

3. // I/O to write the output matrix P

// Free matrices M, N, P

…

return 0;

}


A simple CUDA host code skeletonfor matrix multiplication

CUDA Device Memory Model

• Host, devices have separate memory spaces– E.g., hardware cards with their own DRAM

• To execute a kernel on a device– Need to allocate memory on device– Transfer data: host memory device memory

• After device execution– Transfer results: device memory host memory– Free device memory no longer needed


CUDA Device Memory Model


CUDA API : Memory Mgt.


CUDA API : Memory Mgt.

• Example

float *Md;

int size = Width * Width * sizeof(float);

cudaMalloc((void**)&Md, size);

…

cudaFree(Md);


CUDA API : Data Transfer


Example: Matrix Multiplication


Kernel Functions & Threading

• A kernel function specifies the code to be executed by all threads of a parallel phase– All threads of a parallel phase execute the same

code single-program multiple-data (SPMD), a popular programming style for parallel computing

• Need a mechanism to– Allow threads to distinguish themselves– Direct themselves to specific parts of data they are

supposed to work on


Kernel Functions & Threading

• Keywords “threadIdx.x” and “threadIdx.y”– Thread indices of a thread– Allow a thread to identify itself at runtime (by

accessing hardware registers associated with it)

• Can refer a thread as Thread threadIdx.x,threadIdx.y

• Thread indices reflect a multi-dimensional organization for threads


Example: Matrix Multiplication Kernel


See next slide for more details on accessing relevant data

Thread Indices & Accessing Data Relevant to a Thread


Md Pd

Nd

width

width

width

tx

tx

ty ty

x

y

Pd

row 0 row 1

ty * width tx

How matrix Pd would be laid out in memory (as it is a 1-D array)

•Each thread uses tx, ty to identify the relevant row of Md, column of Nd and the element of Pd in the for loop

•E.g., Thread2,3 will perform dot product between row 2 of Md and column 3 of Nd and write the result into element (2,3) of Pd

Threading & Grids

• When a kernel is invoked/launched, it is executed as a grid of parallel threads

• A CUDA thread grid can have millions of lightweight GPU threads per kernel invocation– To fully utilize hardware enough threads

required large data parallelism required

• Threads in a grid has a two-level hierarchy– A grid consists of 1 or more thread blocks– All blocks in a grid have same # of threads


CUDA Thread Organization


Threading with Grids & Blocks

• Each thread block has a unique 2-D coordinate given by CUDA keywords “blockIdx.x” and “blockIdx.y”– All blocks must have the same structure, thread #

• Each block has a 3-D array of threads up to a total of 1024 threads max– Coordinates of threads in a block are defined by

indices: threadIdx.x, threadIdx.y, threadIdx.z– (Not all apps will use all 3 dimensions)


Our Example: Matrix Multiplication

• The kernel is shown 5 slides before (slide 18)– This can only use one thread block– The block is organized as a 2D-array

• The code can compute a product matrix Pd of only up to 1024 elements– As a block can have a max of 1024 threads– Each thread computes one element in Pd– Is this sufficient / acceptable?


Our Example: Matrix Multiplication

• When host code invokes the kernel, the grid and block dimensions are set by passing them as parameters

• Example// Setup the execution configuration

dim3 dimBlock(16, 16, 1); //Width=16, as example

dim3 dimGrid(1, 1, 1); //last 1 ignored

// Launch the device computation threads!

MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,16);


Here is an Exercise…• Implement Matrix Multiplication

– Execute it with different matrix dimensions using (a) CPU only, (b) GPUs and (c) GPUs with different grid/block organizations

• Fill a table like the following


Dimensions (M, N) CPU time (s)

GPU time (s)

Speedup

[400,800] , [400, 400]

[800,1600] , [800, 800]

….

….

[2400,4800] , [2400, 4800]

Conclusion

• We discussed CUDA Programming Model and CUDA Thread Basics– Data Parallelism– CUDA Program Structure– Memory Model & Data Transfer (briefly)– Kernel Functions & Threading– (Discussion with Example: Matrix Multiplication)


References for this Session

• Chapter 2 of: D. Kirk and W. Hwu, Programming Massively Parallel Processors, Morgan Kaufmann, 2010

• Chapters 4-5 of: E. Kandrot and J. Sanders, CUDA by Example, Addison-Wesley, 2010

• Chapter 2 of: NVIDIA CUDA C Programming Guide, V. 3.2/4.0, NVIDIA Corp. , 2010-2011


training program on gpu programming with cuda

Documents

cuda training programexample

cuda training programexecution

cuda training programcuda

cuda programexecution

cuda programjuly

uomtraining program

memory mgt

matrix multiplicationjuly