training program on gpu programming with cuda

27
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching Center @ UoM

Upload: katima

Post on 14-Jan-2016

57 views

Category:

Documents


3 download

DESCRIPTION

Training Program on GPU Programming with CUDA. 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching Center @ UoM. Day 1, Session 2 CUDA Programming Model CUDA Threads. Training Program on GPU Programming with CUDA. Sanath Jayasena CUDA Teaching Center @ UoM. Outline for Day 1 Session 2. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Training Program on GPU Programming  with CUDA

Training Program onGPU Programming

with CUDA

31st July, 7th Aug, 14th Aug 2011CUDA Teaching Center @ UoM

Page 2: Training Program on GPU Programming  with CUDA

Training Program on GPU Programming with CUDA

Sanath JayasenaCUDA Teaching Center @ UoM

Day 1, Session 2

CUDA Programming ModelCUDA Threads

Page 3: Training Program on GPU Programming  with CUDA

Outline for Day 1 Session 2

CUDA Programming Model, CUDA Threads• Data Parallelism• CUDA Program Structure• Memory Model & Data Transfer (Brief)• Kernel Functions & Threading

(Discussion with Example: Matrix Multiplication)

July-Aug 2011 3CUDA Training Program

Page 4: Training Program on GPU Programming  with CUDA

Data Parallelism

• Data Parallelism– A problem/program property– Many arithmetic operations can be safely

performed on the data structures simultaneously– Example: matrix multiplication (next slide)

• CUDA devices can exploit data parallelism to accelerate execution of applications

July-Aug 2011 CUDA Training Program 4

Page 5: Training Program on GPU Programming  with CUDA

Example: Matrix Multiplication

July-Aug 2011 CUDA Training Program 5

M P

N

width width

width

width

P = M · N

•Each element in P is computed as dot product between a row of M and a column of N

•All elements in P can be computed independently and simultaneously

Page 6: Training Program on GPU Programming  with CUDA

CUDA Program Structure

• A CUDA program consists of one or more phases executed on either the host (CPU) or a device (GPU), supplied as a single source code

• Little or no data parallelism host code– ANSI C, compiled with standard compiler

• Significant data parallelism device code– Extended ANSI C to specify kernels, data structs

• NVIDIA C Complier separates the two and …

July-Aug 2011 CUDA Training Program 6

Page 7: Training Program on GPU Programming  with CUDA

Execution of a CUDA Program

July-Aug 2011 CUDA Training Program 7

Page 8: Training Program on GPU Programming  with CUDA

Execution of a CUDA Program

• Execution starts with host (CPU)• When a kernel is invoked, execution moves to

the device (GPU)– A large number of threads generated– Grid : collection of all threads generated by kernel– (Previous slide shows two grids of threads)

• Once all threads in a grid complete execution, the grid terminates and execution continues on the host

July-Aug 2011 CUDA Training Program 8

Page 9: Training Program on GPU Programming  with CUDA

Example: Matrix Multiplicationint main (void)

{

1. // Allocate and initialize matrices M, N, P

// I/O to read the input matrices M and N

….

2. // M * N on the device

MatrixMulOnDevice (M, N, P, width);

3. // I/O to write the output matrix P

// Free matrices M, N, P

return 0;

}

July-Aug 2011 CUDA Training Program 9

A simple CUDA host code skeletonfor matrix multiplication

Page 10: Training Program on GPU Programming  with CUDA

CUDA Device Memory Model

• Host, devices have separate memory spaces– E.g., hardware cards with their own DRAM

• To execute a kernel on a device– Need to allocate memory on device– Transfer data: host memory device memory

• After device execution– Transfer results: device memory host memory– Free device memory no longer needed

July-Aug 2011 CUDA Training Program 10

Page 11: Training Program on GPU Programming  with CUDA

CUDA Device Memory Model

July-Aug 2011 CUDA Training Program 11

Page 12: Training Program on GPU Programming  with CUDA

CUDA API : Memory Mgt.

July-Aug 2011 CUDA Training Program 12

Page 13: Training Program on GPU Programming  with CUDA

CUDA API : Memory Mgt.

• Example

float *Md;

int size = Width * Width * sizeof(float);

cudaMalloc((void**)&Md, size);

cudaFree(Md);

July-Aug 2011 CUDA Training Program 13

Page 14: Training Program on GPU Programming  with CUDA

CUDA API : Data Transfer

July-Aug 2011 CUDA Training Program 14

Page 15: Training Program on GPU Programming  with CUDA

Example: Matrix Multiplication

July-Aug 2011 CUDA Training Program 15

Page 16: Training Program on GPU Programming  with CUDA

Kernel Functions & Threading

• A kernel function specifies the code to be executed by all threads of a parallel phase– All threads of a parallel phase execute the same

code single-program multiple-data (SPMD), a popular programming style for parallel computing

• Need a mechanism to– Allow threads to distinguish themselves– Direct themselves to specific parts of data they are

supposed to work on

July-Aug 2011 CUDA Training Program 16

Page 17: Training Program on GPU Programming  with CUDA

Kernel Functions & Threading

• Keywords “threadIdx.x” and “threadIdx.y”– Thread indices of a thread– Allow a thread to identify itself at runtime (by

accessing hardware registers associated with it)

• Can refer a thread as Thread threadIdx.x,threadIdx.y

• Thread indices reflect a multi-dimensional organization for threads

July-Aug 2011 CUDA Training Program 17

Page 18: Training Program on GPU Programming  with CUDA

Example: Matrix Multiplication Kernel

July-Aug 2011 CUDA Training Program 18

See next slide for more details on accessing relevant data

Page 19: Training Program on GPU Programming  with CUDA

Thread Indices & Accessing Data Relevant to a Thread

July-Aug 2011 CUDA Training Program 19

Md Pd

Nd

width

width

width

tx

tx

ty ty

x

y

Pd

row 0 row 1

ty * width tx

How matrix Pd would be laid out in memory (as it is a 1-D array)

•Each thread uses tx, ty to identify the relevant row of Md, column of Nd and the element of Pd in the for loop

•E.g., Thread2,3 will perform dot product between row 2 of Md and column 3 of Nd and write the result into element (2,3) of Pd

Page 20: Training Program on GPU Programming  with CUDA

Threading & Grids

• When a kernel is invoked/launched, it is executed as a grid of parallel threads

• A CUDA thread grid can have millions of lightweight GPU threads per kernel invocation– To fully utilize hardware enough threads

required large data parallelism required

• Threads in a grid has a two-level hierarchy– A grid consists of 1 or more thread blocks– All blocks in a grid have same # of threads

July-Aug 2011 CUDA Training Program 20

Page 21: Training Program on GPU Programming  with CUDA

CUDA Thread Organization

July-Aug 2011 CUDA Training Program 21

Page 22: Training Program on GPU Programming  with CUDA

Threading with Grids & Blocks

• Each thread block has a unique 2-D coordinate given by CUDA keywords “blockIdx.x” and “blockIdx.y”– All blocks must have the same structure, thread #

• Each block has a 3-D array of threads up to a total of 1024 threads max– Coordinates of threads in a block are defined by

indices: threadIdx.x, threadIdx.y, threadIdx.z– (Not all apps will use all 3 dimensions)

July-Aug 2011 CUDA Training Program 22

Page 23: Training Program on GPU Programming  with CUDA

Our Example: Matrix Multiplication

• The kernel is shown 5 slides before (slide 18)– This can only use one thread block– The block is organized as a 2D-array

• The code can compute a product matrix Pd of only up to 1024 elements– As a block can have a max of 1024 threads– Each thread computes one element in Pd– Is this sufficient / acceptable?

July-Aug 2011 CUDA Training Program 23

Page 24: Training Program on GPU Programming  with CUDA

Our Example: Matrix Multiplication

• When host code invokes the kernel, the grid and block dimensions are set by passing them as parameters

• Example// Setup the execution configuration

dim3 dimBlock(16, 16, 1); //Width=16, as example

dim3 dimGrid(1, 1, 1); //last 1 ignored

// Launch the device computation threads!

MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,16);

July-Aug 2011 CUDA Training Program 24

Page 25: Training Program on GPU Programming  with CUDA

Here is an Exercise…• Implement Matrix Multiplication

– Execute it with different matrix dimensions using (a) CPU only, (b) GPUs and (c) GPUs with different grid/block organizations

• Fill a table like the following

July-Aug 2011 CUDA Training Program 25

Dimensions (M, N) CPU time (s)

GPU time (s)

Speedup

[400,800] , [400, 400]

[800,1600] , [800, 800]

….

….

[2400,4800] , [2400, 4800]

Page 26: Training Program on GPU Programming  with CUDA

Conclusion

• We discussed CUDA Programming Model and CUDA Thread Basics– Data Parallelism– CUDA Program Structure– Memory Model & Data Transfer (briefly)– Kernel Functions & Threading– (Discussion with Example: Matrix Multiplication)

July-Aug 2011 CUDA Training Program 26

Page 27: Training Program on GPU Programming  with CUDA

References for this Session

• Chapter 2 of: D. Kirk and W. Hwu, Programming Massively Parallel Processors, Morgan Kaufmann, 2010

• Chapters 4-5 of: E. Kandrot and J. Sanders, CUDA by Example, Addison-Wesley, 2010

• Chapter 2 of: NVIDIA CUDA C Programming Guide, V. 3.2/4.0, NVIDIA Corp. , 2010-2011

July-Aug 2011 CUDA Training Program 27