training program on gpu programming with cuda
DESCRIPTION
Training Program on GPU Programming with CUDA. 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching Center @ UoM. Day 1, Session 2 CUDA Programming Model CUDA Threads. Training Program on GPU Programming with CUDA. Sanath Jayasena CUDA Teaching Center @ UoM. Outline for Day 1 Session 2. - PowerPoint PPT PresentationTRANSCRIPT
Training Program onGPU Programming
with CUDA
31st July, 7th Aug, 14th Aug 2011CUDA Teaching Center @ UoM
Training Program on GPU Programming with CUDA
Sanath JayasenaCUDA Teaching Center @ UoM
Day 1, Session 2
CUDA Programming ModelCUDA Threads
Outline for Day 1 Session 2
CUDA Programming Model, CUDA Threads• Data Parallelism• CUDA Program Structure• Memory Model & Data Transfer (Brief)• Kernel Functions & Threading
(Discussion with Example: Matrix Multiplication)
July-Aug 2011 3CUDA Training Program
Data Parallelism
• Data Parallelism– A problem/program property– Many arithmetic operations can be safely
performed on the data structures simultaneously– Example: matrix multiplication (next slide)
• CUDA devices can exploit data parallelism to accelerate execution of applications
July-Aug 2011 CUDA Training Program 4
Example: Matrix Multiplication
July-Aug 2011 CUDA Training Program 5
M P
N
width width
width
width
P = M · N
•Each element in P is computed as dot product between a row of M and a column of N
•All elements in P can be computed independently and simultaneously
CUDA Program Structure
• A CUDA program consists of one or more phases executed on either the host (CPU) or a device (GPU), supplied as a single source code
• Little or no data parallelism host code– ANSI C, compiled with standard compiler
• Significant data parallelism device code– Extended ANSI C to specify kernels, data structs
• NVIDIA C Complier separates the two and …
July-Aug 2011 CUDA Training Program 6
Execution of a CUDA Program
July-Aug 2011 CUDA Training Program 7
Execution of a CUDA Program
• Execution starts with host (CPU)• When a kernel is invoked, execution moves to
the device (GPU)– A large number of threads generated– Grid : collection of all threads generated by kernel– (Previous slide shows two grids of threads)
• Once all threads in a grid complete execution, the grid terminates and execution continues on the host
July-Aug 2011 CUDA Training Program 8
Example: Matrix Multiplicationint main (void)
{
1. // Allocate and initialize matrices M, N, P
// I/O to read the input matrices M and N
….
2. // M * N on the device
MatrixMulOnDevice (M, N, P, width);
3. // I/O to write the output matrix P
// Free matrices M, N, P
…
return 0;
}
July-Aug 2011 CUDA Training Program 9
A simple CUDA host code skeletonfor matrix multiplication
CUDA Device Memory Model
• Host, devices have separate memory spaces– E.g., hardware cards with their own DRAM
• To execute a kernel on a device– Need to allocate memory on device– Transfer data: host memory device memory
• After device execution– Transfer results: device memory host memory– Free device memory no longer needed
July-Aug 2011 CUDA Training Program 10
CUDA Device Memory Model
July-Aug 2011 CUDA Training Program 11
CUDA API : Memory Mgt.
July-Aug 2011 CUDA Training Program 12
CUDA API : Memory Mgt.
• Example
float *Md;
int size = Width * Width * sizeof(float);
cudaMalloc((void**)&Md, size);
…
cudaFree(Md);
July-Aug 2011 CUDA Training Program 13
CUDA API : Data Transfer
July-Aug 2011 CUDA Training Program 14
Example: Matrix Multiplication
July-Aug 2011 CUDA Training Program 15
Kernel Functions & Threading
• A kernel function specifies the code to be executed by all threads of a parallel phase– All threads of a parallel phase execute the same
code single-program multiple-data (SPMD), a popular programming style for parallel computing
• Need a mechanism to– Allow threads to distinguish themselves– Direct themselves to specific parts of data they are
supposed to work on
July-Aug 2011 CUDA Training Program 16
Kernel Functions & Threading
• Keywords “threadIdx.x” and “threadIdx.y”– Thread indices of a thread– Allow a thread to identify itself at runtime (by
accessing hardware registers associated with it)
• Can refer a thread as Thread threadIdx.x,threadIdx.y
• Thread indices reflect a multi-dimensional organization for threads
July-Aug 2011 CUDA Training Program 17
Example: Matrix Multiplication Kernel
July-Aug 2011 CUDA Training Program 18
See next slide for more details on accessing relevant data
Thread Indices & Accessing Data Relevant to a Thread
July-Aug 2011 CUDA Training Program 19
Md Pd
Nd
width
width
width
tx
tx
ty ty
x
y
Pd
row 0 row 1
ty * width tx
How matrix Pd would be laid out in memory (as it is a 1-D array)
•Each thread uses tx, ty to identify the relevant row of Md, column of Nd and the element of Pd in the for loop
•E.g., Thread2,3 will perform dot product between row 2 of Md and column 3 of Nd and write the result into element (2,3) of Pd
Threading & Grids
• When a kernel is invoked/launched, it is executed as a grid of parallel threads
• A CUDA thread grid can have millions of lightweight GPU threads per kernel invocation– To fully utilize hardware enough threads
required large data parallelism required
• Threads in a grid has a two-level hierarchy– A grid consists of 1 or more thread blocks– All blocks in a grid have same # of threads
July-Aug 2011 CUDA Training Program 20
CUDA Thread Organization
July-Aug 2011 CUDA Training Program 21
Threading with Grids & Blocks
• Each thread block has a unique 2-D coordinate given by CUDA keywords “blockIdx.x” and “blockIdx.y”– All blocks must have the same structure, thread #
• Each block has a 3-D array of threads up to a total of 1024 threads max– Coordinates of threads in a block are defined by
indices: threadIdx.x, threadIdx.y, threadIdx.z– (Not all apps will use all 3 dimensions)
July-Aug 2011 CUDA Training Program 22
Our Example: Matrix Multiplication
• The kernel is shown 5 slides before (slide 18)– This can only use one thread block– The block is organized as a 2D-array
• The code can compute a product matrix Pd of only up to 1024 elements– As a block can have a max of 1024 threads– Each thread computes one element in Pd– Is this sufficient / acceptable?
July-Aug 2011 CUDA Training Program 23
Our Example: Matrix Multiplication
• When host code invokes the kernel, the grid and block dimensions are set by passing them as parameters
• Example// Setup the execution configuration
dim3 dimBlock(16, 16, 1); //Width=16, as example
dim3 dimGrid(1, 1, 1); //last 1 ignored
// Launch the device computation threads!
MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,16);
July-Aug 2011 CUDA Training Program 24
Here is an Exercise…• Implement Matrix Multiplication
– Execute it with different matrix dimensions using (a) CPU only, (b) GPUs and (c) GPUs with different grid/block organizations
• Fill a table like the following
July-Aug 2011 CUDA Training Program 25
Dimensions (M, N) CPU time (s)
GPU time (s)
Speedup
[400,800] , [400, 400]
[800,1600] , [800, 800]
….
….
[2400,4800] , [2400, 4800]
Conclusion
• We discussed CUDA Programming Model and CUDA Thread Basics– Data Parallelism– CUDA Program Structure– Memory Model & Data Transfer (briefly)– Kernel Functions & Threading– (Discussion with Example: Matrix Multiplication)
July-Aug 2011 CUDA Training Program 26
References for this Session
• Chapter 2 of: D. Kirk and W. Hwu, Programming Massively Parallel Processors, Morgan Kaufmann, 2010
• Chapters 4-5 of: E. Kandrot and J. Sanders, CUDA by Example, Addison-Wesley, 2010
• Chapter 2 of: NVIDIA CUDA C Programming Guide, V. 3.2/4.0, NVIDIA Corp. , 2010-2011
July-Aug 2011 CUDA Training Program 27