itcs 4/5010 cuda programming, unc-charlotte, b. wilkinson, jan 23, 2013 matrixmult

1ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013MatrixMult.ppt

Matrix Multiplication Performance Improvements

These notes will introduce:•Basic matrix multiplication on a 2-D grid/block review•Limitations•Block matrix multiplication •Memory access patterns in matrix multiplication •Coalescing memory accesses

• Transpose array• Using shared memory tiles

Matrix Multiplication

Matrix multiplication is an important application in HPC and appears in many applications

C = A * B

where A, B, and C are matrices (two-dimensional arrays.

A restricted case is when B has only one column -- matrix-vector multiplication, which appears in representation of linear equations and partial differential equations

Matrix multiplication, C = A x B

Assume matrices square (N x N matrices).

for (i = 0; i < N; i++)for (j = 0; j < N; j++) {

c[i][j] = 0;for (k = 0; k < N; k++)

c[i][j] = c[i][j] + a[i][k] * b[k][j];}

Requires n3 multiplications and n3 additionsSequential time complexity of O(n3). Very easy to parallelize.

Implementing Matrix MultiplicationSequential Code

CUDA Kernelfor multiplying two arrays

__global__ void gpu_matrixmult(int *gpu_a, int *gpu_b, int *gpu_c, int N) {

int k, sum = 0;

int col = threadIdx.x + blockDim.x * blockIdx.x;

int row = threadIdx.y + blockDim.y * blockIdx.y;

if (col < N && row < N) {

for (k = 0; k < N; k++)

sum += a[row * N + k] * b[k * N + col];

c[row * N + col] = sum;

In this example, one thread computes one C element and the number of threads must equal or greater than the number of elements

Sequential version with flattened arraysfor comparison

void cpu_matrixmult(int *cpu_a, int *cpu_b, int *cpu_c, int N) {

int i, j, k, sum;

for (row =0; row < N; row++) // row of a

for (col =0; col < N; col++) { // column of b

sum = 0;

for(k = 0; k < N; k++)

sum += cpu_a[row * N + k] * cpu_b[k * N + col];

cpu_c[row * N + col] = sum;

Matrix mapped on 2-D Grids and 2-D blocks

threadID.x

threadID.y

blockIdx.x * blockDim.x + threadIdx.x

blockIdx.y * blockDim.y + threadIdx.y

A[][column]

A[row][] Thread

Arrays mapped onto structure, one element per thread

Basically array divided into “tiles” and one tile mapped onto one block

// Matrix addition program MatrixMult.cu, Barry Wilkinson, Dec. 28, 2010.#include <stdio.h>#include <cuda.h>#include <stdlib.h>

__global__ void gpu_matrixmult(int *gpu_a, int *gpu_b, int *gpu_c, int N) {…

void cpu_matrixmult(int *cpu_a, int *cpu_b, int *cpu_c, int N) {…

int main(int argc, char *argv[]) {int i, j; // loop countersint Grid_Dim_x=1, Grid_Dim_y=1; //Grid structure valuesint Block_Dim_x=1, Block_Dim_y=1; //Block structure valuesint noThreads_x, noThreads_y; // number of threads available in device, each dimensionint noThreads_block; // number of threads in a blockint N = 10; // size of array in each dimensionint *a,*b,*c,*d;int *dev_a, *dev_b, *dev_c;int size; // number of bytes in arrayscudaEvent_t start, stop; // using cuda events to measure timefloat elapsed_time_ms; // which is applicable for asynchronous code also

/* --------------------ENTER INPUT PARAMETERS AND ALLOCATE DATA -----------------------*/… // keyboard input

dim3 Grid(Grid_Dim_x, Grid_Dim_x); //Grid structuredim3 Block(Block_Dim_x,Block_Dim_y); //Block structure, threads/block limited by specific devicesize = N * N * sizeof(int); // number of bytes in total in arrays

a = (int*) malloc(size); //dynamically allocated memory for arrays on hostb = (int*) malloc(size);c = (int*) malloc(size); // results from GPUd = (int*) malloc(size); // results from CPU… // load arrays with some numbers

Complete Program(several slides)

/* ------------- COMPUTATION DONE ON GPU ----------------------------*/

cudaMalloc((void**)&dev_a, size); // allocate memory on devicecudaMalloc((void**)&dev_b, size);cudaMalloc((void**)&dev_c, size);

cudaMemcpy(dev_a, a , size ,cudaMemcpyHostToDevice);cudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice);

cudaEventRecord(start, 0); // here start time, after memcpy

gpu_matrixmult<<<Grid,Block>>>(dev_a,dev_b,dev_c,N);cudaMemcpy(c, dev_c, size , cudaMemcpyDeviceToHost);

cudaEventRecord(stop, 0); // measuse end timecudaEventSynchronize(stop);cudaEventElapsedTime(&elapsed_time_ms, start, stop );

printf("Time to calculate results on GPU: %f ms.\n", elapsed_time_ms);

Where you measure time will make a big difference

/* ------------- COMPUTATION DONE ON HOST CPU ----------------------------*/

cudaEventRecord(start, 0); // use same timing*

cpu_matrixmult(a,b,d,N); // do calculation on host

cudaEventRecord(stop, 0); // measure end timecudaEventSynchronize(stop);cudaEventElapsedTime(&elapsed_time_ms, start, stop );

printf("Time to calculate results on CPU: %f ms.\n", elapsed_time_ms); // exe. time

/* ------------------- check device creates correct results -----------------*/…/* --------------------- repeat program ----------------------------------------*/… // while loop to repeat calc with different parameters/* -------------- clean up ---------------------------------------*/

free(a); free(b); free(c);cudaFree(dev_a);cudaFree(dev_b);cudaFree(dev_c);cudaEventDestroy(start);cudaEventDestroy(stop);return 0;

I have found problems if do CPU timing before GPU timing.Anyone else?

Some PreliminariesEffects of First Launch

Program is written so that can repeat with different parameters without stopping program – to eliminate effect of first kernel launch

Also might take advantage of caching – seems not significant as first launch

Some results

Random numbers 0- 9

32 x 32 array

1 blockof 32 x 32 threads

Speedup = 1.65,First time

Answer

Check both CPU and GPU same answers

Some results

32 x 32 array

Speedup = 2.12Second time

Some results

32 x 32 array

Speedup = 2.16Third time

Subsequently can vary 2.12 – 2.18

Some results

256 x 256 array

8 blocksof 32 x 32 threads

Speedup = 151.86

Some results

1024 x 1024 array

Speedup = 860.9

Some results

2048 x 2048 array

Speedup = ??GPU appears to freeze.Why?

211 x 211 threads = 222 threads

Memory needed = 222 integers= 224 bytes= 4 Mbytes

Max number of threads on GPU appears to be 216 x 216 = 232 threadsServer has 4 Gbytes of main memory

Array size Speedup * Speedup**

32 x 32 2.18 1.26

256 x 256 151.86 110.63

1024 x 1024 860.9 733.44

2048 x 2048 ??

4096 x 4096

Block size 32 x 32. Number of blocks to suit array size

Different Array Sizes

* These result include time of memory copy back from device but not memory copy to device.

** These result include time of memory copy back from the device and memory copy to the device

GPU Limitations

Previous program has limitations:

•Number of threads => number of array elements(code will not work if number of threads < number of array elements)

•Number of threads/block and blocks/grid has GPU limitations, which will limit size of arrays that can be processed.

•Keyboard input must check for invalid grid and block value

•There are memory bandwidth issues.

Compute capability 1.x

Maximum number of threads per block = 512

Maximum sizes of x- and y- dimension of thread block = 512

Maximum size of each dimension of grid = 65535

Maximum number of threads per block of 512 means a square 2-D block cannot be greater than 16 x 16 (256 threads)

So maximum square array size is 16 x 65535 (24 x 216 = 220)i.e. an array of 220 x 220

Is this a problem?

Compute capability 2.x(coit-grid06.uncc.edu)

Maximum number of threads per block = 1024

Maximum sizes of x- and y- dimension of thread block = 1024

Maximum size of each dimension of grid = 65535

Max number of threads per block of 1024 means a square 2-D block cannot be greater than 32 x 32. Now all 1024 threads allocated

So maximum square array size is 32 x 65535 (25 x 216 = 221)i.e. an array of 221 x 221

Is this a problem?

Increasing size of arrays beyond thread limitation of GPU

-- Using less threads than array elements*

Actually this one is easy and can draw upon regular technique of using sub-matrix multiplication:

Sub-matrix

Sub-matrixSub-matrix

* Would you want to use less threads than elements?Not in textbooks (not their “tiling”).

To demonstrate that sum-matrix multiplication will produce the correct final answer, consider simple 2 x 2 sub-matrices:

Suppose N x N matrices, divided into s2 submatrices.Each submatrix has n/s x n/s elements.

for (p = 0; p < s; p++) for (q = 0; q < s; q++) { Cp,q = 0; // clear elements of submatrix

for (r = 0; r < m; r++) // submatrix multiplication

Cp,q = Cp,q + Ap,r * Br,q; // add to accum. submatrix

Cp,q means submatrix row p and column q of matrix C. Cp,q = Cp,q + Ap,r * Br,q; means multiply submatrix Ap,r and Br,q using matrix multiplication and add to submatrix Cp,q using matrix addition.

Pseudo Code for Block Multiplication

N = …; // no of elements in cols/rows of matricess = …; // no of sub matricesss = N/s; // no of elements in sub-matrix cols/rowsfor (i = 0; i < N; i+= ss) // go thro sub-matrices of Afor (j = 0; j < N; j+= ss) { // and sub-matrices of B

for (p = i; p < i + ss; p++) // clear elements of sub-matrix, Cp,q = 0;for (q = j; q < j + ss; q++)

C[p][q] = 0;

for (r = 0; r < s; r++) { // sub-matrices row, column

for (p = i; p < i + ss; p++) // submatrix multiplicationfor (q = j; q < j + ss; q++) {

c[p][q] = 0;for (k = r; k < r + ss; k++)

c[p][q] += A[p][k] * B[k][q];}C[i][j] += c[p][q]; // add to accum. submatrix

Code for Block Multiplication

Not tested yetMistakes?

Each thread does one value of i and j.

S threads

N/s an integer

Effects of memory access in matrix multiplication

One thread is responsible for computing one result C ij and needs access a row of A and a column of B:

Thread

Each thread access one row of A and one column of BN2 row/column combinations, N2 threads

Seen another way, in first time period, each thread accesses the first element in a row of A:

Thread 0, …

Thread I, …

Thread N-1, …

Consider those threads that access different rowsGiven the row-major order of how A is stored, those threads will locations are not in consecutive locations

– Bad cannot do memory coalescing.

Question: how many threads access the same location?

Next, each thread accesses the first element in a column of B:

Thread 0, … Thread I, … Thread N-1, …

Consider those threads that access different columnsGiven the row-major order of how A is stored, those threads will locations are in consecutive locations.

– Good! Can do memory coalescing.

Question: how many threads access the same location?

How can we get better memory accesses and memory coalcesing?

1. Transpose one array Copy all rows of A to columns and all columns of A to rows before access A and modify program according.

(Not mentioned in course textbook or other NVIDIA book, although appears obvious way – see next about whether works!)

Sequential code for a transpose using same array:

for (i=0; i < N; i++) for (j=0; j < i; j++) {

temp = B[i][j];B[i][j] = b[j][i];B[j][i] = temp;

(In my code, I use separate arrays)

Could be done on host prior to copying to device.How would the code look like if on device?

/* ------ COMPUTATION DONE ON GPU USING A TRANSPOSED ARRAY-----*/

transposeArray(a, a_T, N); // transpose array

cudaEventRecord(start, 0); // here time measured before // host-device copy, but not transpose

// cudaEventSynchronize(start); // Needed?

cudaMemcpy(dev_a, a_T , size ,cudaMemcpyHostToDevice);// cpy transp. AcudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice); // copy B

gpu_matrixmult_T<<<Grid,Block>>>(dev_a,dev_b,dev_c,N);

cudaMemcpy(c_T,dev_c, size ,cudaMemcpyDeviceToHost);

cudaEventRecord(stop, 0); // measure end timecudaEventSynchronize(stop);cudaEventElapsedTime(&elapsed_time_ms2, start, stop );

printf("Time to calculate results on GPU with transposed array: %f ms.\n", elapsed_time_ms2); // print out execution time

Some results

8 x 8 array

Speedup = 1.62 over not transposing array

Some results

32 x 32 array

Speedup = 1.17 over not transposing array

Some results

256 x 256 array

Speedup = 0.89!! over not transposing array

Some results

1024 x 1024 array

Speedup = 0.93!! over not transposing array

2. Using shared memory with “Tiling”

Programming Massively Parallel Processors A Hands-on Approach, Fig. 6.10, Page 109.

Copy of tiles made in shared memory

As Bs Cs

Note: this is not using block matrix multiplication. The fundamental algorithm is just divided into phases

Developing Code

1. Declaring shared memory:

__global__ void gpu_matrixmult (int* Md, int* Nd, int* Pd, int Width) {

__shared__ int Mds[TILE_WIDTH][TILE_WIDTH];__shared__ int Nds[TILE_WIDTH][TILE_WIDTH];

Needs to be using static memory allocationso will need to fix size of tileConvenient to choose same as kernel block size, say 32 x 32

2. Index into shared memory:

For convenience declare tile(block) and thread indices, and indices into final C array:

int bx = blockIdx.x; // tile (block) indicesint by = blockIdx.y;

int tx = threadIdx.x; //thread indices int ty = threadIdx.y;

To access (later):

Mds[ty][tx] = … // element associated with thread Nds[ty][tx] = …

3. Global address:

For convenience declare row and column:

int Row = by * TILE_WIDTH + ty; // global indicesint Col = bx * TILE_WIDTH + tx;

Note: Same as the usual global thread ID and result index in normal matrix multiplication:

int row = threadIdx.y + blockDim.y * blockIdx.y;int col = threadIdx.x + blockDim.x * blockIdx.x;

4. Loading shared memory:

Done using SIMT thread collaboration (very tricky):

for (int m = 0; m < N/TILE_WIDTH; m++) {

Mds[ty][tx] = Md[Row*N + (m*TILE_WIDTH + tx)];

Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*N + Col];

__syncthreads();

// do matrix multiplication operations on pair of tiles …

Thread ID

All elements on row transferred, one per thread

Thread ID

All elements on column transferred, one per thread

Wait for all threads in block

For each tile in row or column

Books says achieves memory coalescing although it does not look to do that in both cases.

Example (3 x 3 tiles)

Based upon Programming Massively Parallel Processors A Hands-on Approach, Fig. 6.10, Page 109.

Global array A

Global array B

Global memory addressRow*N + (m*TILE_WIDTH + tx)

Global memory address((m*TILE_WIDTH + ty)*N + Col

m = tile number 0 ,1, and 2Row = by * TILE_WIDTH + tyCol = bx * TILE_WIDTH + tx

5. Matrix multiplication in shared memory:

int Pvalue = 0;

for (int m = 0; m < N/TILE_WIDTH; m++) {

… // copy tiles to shared memory (step 4)

for ( int k = 0; k < TILE_WIDTH; k++)

Pvalue += Mds[ty][k] * Nds[k][tx];

Pd[Row * N + Col] = Pvalue;

Copy back to global memory

Multiply tiles, accumulating values

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) {

__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];__shared__ float Nds[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;int tx = threadIdx.x; int ty = threadIdx.y; // thread ID

int Row = by * TILE_WIDTH + ty; //Identify row, column of Pd element to work onint Col = bx * TILE_WIDTH + tx;

float Pvalue = 0;

for (int m = 0; m < Width/TILE_WIDTH; m++) { // loop over Md, Nd to compute Pd element

Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)]; // load Md, Nd tiles into sh. memNds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col];__syncthreads();

for ( int k = 0; k < TILE_WIDTH; k++) Pvalue += Mds[ty][k] * Nds[k][tx];}Pd[Row][Col] = Pvalue;

Code given in book Programming Massively

Parallel Processors A Hands-on Approach, Fig. 6.11, Page 110

A mistake in the book as Pd is a pointer and size of array unknown? Will not compile as is. Need to use flattened index

Note copying to shared memory is a collaborative between threads, each thread doing one element of each arrayIn wrong place on page 110, although does

not effect final answer!

Some results

32 x 32 array

SpeedupGPU to CPU

2.12GPU sh. mem. to CPU

2.73GPU sh. mem to GPU

without sh memory1.28

Results of second run

Some results

256 x 256 array

SpeedupGPU to CPU

153GPU sh. mem. to CPU

217GPU sh. mem to GPU

without sh memory 1.41

Some results

1024 x1024 array

SpeedupGPU to CPU

2214 !!!GPU sh. mem to GPU

Some results

2048 x 2048 array

SpeedupGPU to CPU

2962 !!GPU sh. mem to GPU

Speedup

Array size GPU to CPU GPU using shared memory to CPU

GPU using shared memory to GPU not using shared memory

32 x 32 2.12 2.73 1.28

256 x 256 153 217 1.41

1024 x 1024 864 2214 2.56

2048 x 2048 989 2962 2.99

4096 x 4096

Block size 32 x 32. Number of blocks to suit array size

Different Array Sizes

Bandwidth improvements by using shared memory

Using a 32 x 32 tiles reduced the number of global memory access by a factor of 16 (two transfers instead of 2 x 32 transfers).

According to PMPP book, page 90, using 16 x 16 tiles:

“it allows the 86.4 GB/s global bandwidth to serve a much larger floating-point computation rate … Can now support 86.4/4 x 16 = 345.6 gigaflops, very close to the peak floating point performance of the G80… effectively removes global memory bandwidth as the major limiting factor of matrix multiplication performance.”

Conclusions

Using the shared memory algorithm can make a significant difference, up to 3 times as fast on

the GPU to not using this algorithm in presented tests.

Speedup of almost 3000 over the CPU! (Note though CPU is an old processor)

This topic may be explored further in an assignment

Need better toolsNVIDIA offers a debugging tool called Parallel Nsight for Visual Studio/Windows

http://parallelnsight.nvidia.com/

itcs 4/5010 cuda programming, unc-charlotte, b. wilkinson, jan 23, 2013 matrixmult

Documents

media.cranepedia.commedia.cranepedia.com/uploads/2018/03/7055-7065-2-catalog-ja.pdfcrawler...

itcs 4/5010 gpu programming, unc-charlotte, b. wilkinson,...

5010 integral -...

itcs 3153 artificial intelligence

itcs 6010 natural language systems. overview welcome to itcs...

itcs 4/5010 cuda programming, unc-charlotte, b. wilkinson,...

itcs, cambodia in 2013 highlight

itcs 4/5010 cuda programming, unc-charlotte, b. wilkinson,...

docslide.us itcs-assignment (1)

itcs 4120 introduction (c)

itcs tech talk - initse.com

1 itcs 4/5010 cuda programming, unc-charlotte, b. wilkinson,...

003-wps for cs & itcs with pwht...

itcs quick reference guide: · web viewnovember 16, 2015...

1 itcs 4/5010 cuda programming, unc-charlotte, b. wilkinson,...

apeosport-ii 5010/4000/3000 docucentre-ii 5010/4000...

b.sc. in computer science 2016 - university of bahrain ·...

itcs 4/5010 cuda programming, unc-charlotte, b. wilkinson,...

itcs 4/5010 cuda programming, unc-charlotte, b. wilkinson,...

itcs agri business