itcs 4/5010 cuda programming, unc-charlotte, b. wilkinson, jan 23, 2013 sharedmem

12
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem.ppt Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.

Upload: howe

Post on 13-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Using Shared memory. These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu. ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem.ppt. Approach. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem

1ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013SharedMem.ppt

Using Shared memory

These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.

Page 2: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem

2

Objective: As with memory coalescing demo, to load numbers into a two-dimensional array

Flattened global threadID of thread loaded into array element so one can tell which thread accesses which location array printed out.

For comparison purposes, access done:1.Using global memory only2.On shared memory with local 2-D arrays and copying back to global memory3.As 2. but using separate pointer arithmetic for speed

GPU structure -- one or more 2-D blocks in a 2-D grid. Each block is 2-D 32x32 threads fixed (max. for compute cap. 2.x)

Approach

Page 3: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem

3

__global__ void gpu_WithoutSharedMem (int *h, int N, int T) {// Array loaded with global thread ID that accesses that location// Coalescing should be possible

int col = threadIdx.x + blockDim.x * blockIdx.x;int row = threadIdx.y + blockDim.y * blockIdx.y;

int threadID = col + row * N;int index = col + row * N;

for (int t = 0; t < T; t++) // to reduce other time effects h[index] = threadID; // load array with global thread ID

}

1. Using global memory only

Page 4: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem

4

__global__ void gpu_SharedMem (int *h, int N, int T) {

__shared__ int h_local[BlockSize][BlockSize]; // sh. mem. each block

int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y;

int threadID = col + row * N;int index = col + row * N;

// h_local[threadIdx.y][threadIdx.x] = h[index]; Not necessary here// but might be in other caculations

for (int t = 0; t < T; t++) h_local[threadIdx.y][threadIdx.x] = threadID; // load array

h[index] = h_local[threadIdx.y][threadIdx.x]; //copy back to global mem.}

2. Using shared memory

Page 5: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem

5

__global__ void gpu_SharedMem_ptr (int *h, int N, int T) {

__shared__ int h_local[BlockSize][BlockSize];

int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y;

int threadID = col + row * N;int index = col + row * N;

int *ptr; // index calc. once outside loopptr = &h_local[0][0];int index_local = threadIdx.x + threadIdx.y * N;

for (int t = 0; t < T; t++)ptr[index_local] = threadID;

h[index] = h_local[threadIdx.y][threadIdx.x];}

3. Using shared memorywith index calculation outside loop

This code I am still checking out

Page 6: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem

6

… /*------------------------- Allocate Memory-----------------------------------*/

int size = N * N * sizeof(int); // number of bytes in total in arrayint *h, *dev_h; // ptr to arrays holding numbers on host and device

h = (int*) malloc(size); // Array on hostcudaMalloc((void**)&dev_h, size); // allocate device memory

/* ------------------------- GPU Computation without shared memory -----------------------------------*/

gpu_WithoutSharedMem <<< Grid, Block >>>(dev_h, N, T); // once outside timing

cudaEventRecord( start, 0 );

gpu_WithoutSharedMem <<< Grid, Block >>>(dev_h, N, T);

cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop );cudaEventElapsedTime( &elapsed_time_ms1, start, stop );

cudaMemcpy(h,dev_h, size ,cudaMemcpyDeviceToHost); //Get results to check

printf("\nComputation without shared memory\n");printArray(h,N);printf("\nTime to calculate results on GPU: %f ms.\n", elapsed_time_ms1);

Main program

Computation 2 and 3 similar

Page 7: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem

7

Some results

A grid of one block and one iterationArray 32x32

Shared memorySpeedup = 1.18

Page 8: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem

8

A grid of one block and 1000000 iterations

Array 32 x 32

Shared memorySpeedup = 1.24

Page 9: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem

9

Repeat just to check results are consistent

Page 10: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem

10

A grid of 16 x 16 blocks and 10000 iterations

Array 512x512

Speedup = 1.74

Different numbers of iterations produce similar results

Page 11: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem

11

Array size Speedup

32 x 32 1.24

64 x 64 1.37

128 x 128 1.36

256 x 256 1.78

512 x 512 1.75

1024 x 1024 1.82

2048 x 2048 1.79

4096 x 4096 1.77

1000 iterations. Block size 32 x 32. Number of blocks to suit array size

Different Array Sizes

Page 12: ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem

Questions