gpu programming gpu programming 1. assignment 4 consists of two programming assignments concurrency...

65
GPU PROGRAMMING GPU Programming 1

Upload: alice-watson

Post on 05-Jan-2016

251 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 1

GPU PROGRAMMING

Page 2: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 2

Assignment 4• Consists of two programming assignments

• Concurrency• GPU programming

• Requires a computer with a CUDA/OpenCL/DirectCompute compatible GPU

• Due Jun 07• We have no final exams

Page 3: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 3

GPU Resources• Download CUDA toolkit from the web

• Very good text book:• Programming Massively Parallel Processors

• Wen-mei Hwu and David Kirk• Available at

• http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Page 4: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 4

Acknowledgments• Slides and material from

• Wen-mei Hwu (UIUC) and David Kirk (NVIDIA)

Page 5: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 5

Why GPU Programming• More processing power + higher memory bandwidth

• GPU in every PC and workstation – massive volume and potential impact

Page 6: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

Current CPU

4 Cores4 float wide SIMD3GHz48-96GFlops2x HyperThreaded64kB $L1/core20GB/s to Memory$200200W

CPU 0 CPU 1

CPU 2 CPU 3

L2 Cache

Page 7: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

Current GPU

32 Cores32 Float wide1GHz1TeraFlop32x “HyperThreaded”64kB $L1/Core150GB/s to Mem$200, 200W

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

L2 Cache

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD

SIMD SIMD

SIMD

SIMD

SIMD

SIMD

SIMD SIMD

Page 8: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming

Bandwidth and Capacity

8

CPU50GFlops

GPU1TFlop

CPU RAM4-6 GB

GPU RAM1 GB

10GB/s 100GB/s

1GB/s

All values are approximate

Page 9: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 9

CUDA• “Compute Unified Device Architecture”• General purpose programming model

• User kicks off batches of threads on the GPU• GPU = dedicated super-threaded, massively data parallel co-processor

• Targeted software stack• Compute oriented drivers, language, and tools

• Driver for loading computation programs into GPU

Page 10: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 10

Languages with Similar Capabilities• CUDA• OpenCL• DirectCompute

• You are free to use any of the above for assignment 4• I will focus on CUDA for the rest of the lecture

• Same abstractions present in all three with different (and confusing) names

Page 11: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

11

CUDA Programming Model:• The GPU = compute device that:

• Is a coprocessor to the CPU or host• Has its own DRAM (device memory)• Runs many threads in parallel

• GPU program = kernel• Differences between GPU and CPU threads

• GPU threads are extremely lightweight• Very little creation overhead

• GPU needs 1000s of threads for full efficiency• Multi-core CPU needs only a few

GPU Programming

Page 12: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 12

A CUDA Program1. Host performs some CPU computation2. Host copies input data into the device3. Host instructs the device to execute a kernel4. Device executes the kernel produces results5. Host copies the results6. Goto step 1

Page 13: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 13

CUDA Kernel is a SPMD program

• All threads run the same code• Each thread uses its id to

• Operate on different memory addresses

• Make control decisions

Kernel:…i = input[tid];o = f(i);output[tid] = o;…

• SPMD = Single Program Multiple Data

Page 14: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 14

CUDA Kernel is a SPMD program

• All threads run the same code• Each thread uses its id to

• Operate on different memory addresses

• Make control decisions• Difference with SIMD

• Threads can execute different control flow

• At a performance cost

Kernel:…i = input[tid];if(i%2 == 0) o = f(i);else o = g(i);output[tid] = o;…

• SPMD = Single Program Multiple Data

Page 15: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 15

Threads Organization• Kernel threads = Grid of Thread Blocks (1D or 2D)

• Thread Block = Array of Threads (1D or 2D or 3D)

• Simplifies memory addressing for multidimensional data

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Page 16: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 16

Threads Organization• Kernel threads = Grid of Thread Blocks (1D or 2D)

• Thread Block = Array of Threads (1D or 2D or 3D)

• Simplifies memory addressing for multidimensional data

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Thread(0,0)

Thread(1,0)

Thread(0,1)

Thread(1,1)

Page 17: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 17

Threads within a Block• Execute in lock step• Can share memory• Can synchronize with each other

CUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

Page 18: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 18

CUDA Function Declarations

hosthost__host__ float HostFunc()

hostdevice__global__ void KernelFunc()

devicedevice__device__ float DeviceFunc()

Only callable from the:

Executed on the:

• __global__ defines a kernel function• Must return void

• __device__ and __host__ can be used together

Page 19: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 19

CUDA Function Declarations (cont.)

• __device__ functions cannot have their address taken

• For functions executed on the device:• No recursion• No static variable declarations inside the function• No variable number of arguments

Page 20: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 20

Putting it all together

__global__ void KernelFunc(…)

dim3 DimGrid(100, 50);

dim3 DimBlock(4, 8, 8);

KernelFunc<<< DimGrid, DimBlock >>>(...);

Page 21: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 21

CUDA Memory Model• Registers

• Read/write per thread

• Local memory• Read/write per thread

• Shared memory• Read/write per block

• Global memory• Read/write per grid

• Constant memory• Read only, per grid

• Texture memory• Read only, per grid

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

Texture Memory

Page 22: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 22

Memory Access Efficiency• Registers

• Fast• Local memory

• Not cached -> Slow• Registers spill into local memory

• Shared memory• On chip -> Fast

• Global memory• Not cached -> Slow

• Constant memory• Cached – Fast if good reuse

• Texture memory• Cached – Fast if good reuse

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

Texture Memory

Page 23: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 23

CUDA Variable Type Qualifiers

• __device__ is optional when used with __local__, __shared__, or __constant__

• Automatic variables without any qualifier reside in a register• Except arrays that reside in local memory

Variable declaration Memory Scope Lifetime__device__ __local__ int LocalVar; local thread thread

__device__ __shared__ int SharedVar; shared block block

__device__ int GlobalVar; global grid application

__device__ __constant__ int ConstantVar; constant grid application

Page 24: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 24

Variable Type Restrictions• Pointers can only point to memory allocated or

declared in global memory:• Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr)

• Obtained as the address of a global variable: float* ptr = &GlobalVar;

Page 25: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 25

Simple Example: Matrix Multiplication

Page 26: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 26

Matrix Multiplication

• P = M * N of size WIDTH x WIDTH• Simple strategy

• One thread calculates one element of P• M and N are loaded WIDTH times from

global memory

M

N

P

WID

TH

WID

TH

WIDTH WIDTH

Page 27: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 27

GPU Matrix Multiplication: Hostfloat *M, *N, *P; int width;

int size = width * width * sizeof(float);

cudaMalloc(&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

Page 28: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 28

GPU Matrix Multiplication: Hostfloat *M, *N, *P; int width;

int size = width * width * sizeof(float);

cudaMalloc(&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

cudaMalloc(&Pd, size);

Page 29: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 29

GPU Matrix Multiplication: Hostfloat *M, *N, *P; int width;

int size = width * width * sizeof(float);

cudaMalloc(&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

cudaMalloc(&Pd, size);

// call kernel

cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

Page 30: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 30

GPU Matrix Multiplication: Hostfloat *M, *N, *P; int width;

int size = width * width * sizeof(float);

cudaMalloc(&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

cudaMalloc(&Pd, size);

// call kernel

cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);cudaFree(Md); cudaFree(Nd); cudaFree(Pd);

Page 31: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 31

GPU Matrix Multiplication: Host• How many threads do we need?

M

N

P

WID

TH

WID

TH

WIDTH WIDTH

Page 32: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 32

GPU Matrix Multiplication: Host

M

N

P

WID

TH

WID

TH

WIDTH WIDTH

dim3 dimGrid(1,1);

dim3 dimBlock(width, width);

MatrixMul<<<dimGrid, dimBlock>>>

(Md, Nd, Pd, width);

Page 33: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 33

GPU Matrix Multiplication: Kernel__global__ void MatrixMul(

float* Md, float* Nd,

float* Pd, int width)

{

Pd[ty*width + tx] = …

}

Md

Nd

Pd

WID

TH

WID

TH

WIDTH WIDTH

tx

tyshort forms:

tx = threadIdx.x;

ty = threadIdx.y;

Page 34: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 34

GPU Matrix Multiplication: Kernel__global__ void MatrixMul(…){

for(k=0; k<width; k++){

r = Md[ty*width+k] +

Nd[k*width+tx];

Pd[ty*width + tx] = r;

}}

Md

Nd

Pd

WID

TH

WID

TH

WIDTH WIDTH

tx

ty

Page 35: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 35

Only One Thread Block Used• One Block of threads compute

matrix Pd• Each thread computes one

element of Pd• Each thread

• Loads a row of matrix Md• Loads a column of matrix Nd• Perform one multiply and

addition for each pair of Md and Nd elements

• Compute to off-chip memory access ratio close to 1:1 (not very high)

• Size of matrix limited by the number of threads allowed in a thread block

Grid 1

Block 1

3 2 5 4

2

4

2

6

48

Thread(2, 2)

WIDTH

Md Pd

Nd

Page 36: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 36

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

How about performance on G80?

• All threads access global memory for their input matrix elements

• Compute: 346.5 GFLOPS• Memory bandwidth: 86.4 GBps

Page 37: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 37

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

How about performance on G80?

• All threads access global memory for their input matrix elements• Two memory accesses (8 bytes) per

floating point multiply-add• 4B/s of memory bandwidth/FLOPS• 4*346.5 = 1386 GB/s required to

achieve peak FLOP rating• 86.4 GB/s limits the code at 21.6

GFLOPS

• The actual code runs at about 15 GFLOPS

• Need to drastically cut down memory accesses to get closer to the peak 346.5 GFLOPS

Page 38: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 38

G80 Example: Executing Thread Blocks

• Threads are assigned to Streaming Multiprocessors in block granularity• Up to 8 blocks to each SM as resource

allows• SM in G80 can take up to 768 threads

• Could be 256 (threads/block) * 3 blocks • Or 128 (threads/block) * 6 blocks, etc.

• Threads run concurrently• SM maintains thread/block id #s• SM manages/schedules thread execution

t0 t1 t2 … tm

Blocks

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

t0 t1 t2 … tm

Blocks

SM 1SM 0

Page 39: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 39

G80 Example: Thread Scheduling

• Each Block is executed as

32-thread Warps– Warps are scheduling units

in SM

• If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM?

…t0 t1 t2 …

t31

…t0 t1 t2 …

t31

…Block 1 Warps Block 2 Warps

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1

Streaming Multiprocessor

Shared Memory

…t0 t1 t2 …

t31

…Block 1 Warps

Page 40: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 40

G80 Example: Thread Scheduling

• Each Block is executed as

32-thread Warps– Warps are scheduling units

in SM

• If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM?– Each Block is divided into

256/32 = 8 Warps– There are 8 * 3 = 24 Warps

…t0 t1 t2 …

t31

…t0 t1 t2 …

t31

…Block 1 Warps Block 2 Warps

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1

Streaming Multiprocessor

Shared Memory

…t0 t1 t2 …

t31

…Block 1 Warps

Page 41: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 41

SM Warp Scheduling

• SM hardware implements zero-overhead Warp scheduling• Warps whose next instruction has

its operands ready for consumption are eligible for execution

• Eligible Warps are selected for execution on a prioritized scheduling policy

• All threads in a Warp execute the same instruction when selected

warp 8 instruction 11

SM multithreadedWarp scheduler

warp 1 instruction 42

warp 3 instruction 95

warp 8 instruction 12

...

time

warp 3 instruction 96

Page 42: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 42

G80 Block Granularity Considerations• For Matrix Multiplication using multiple blocks, should I use

8X8, 16X16 or 32X32 blocks?• Each SM can take max 8 blocks and max 768 threads

Page 43: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 43

G80 Block Granularity Considerations• For Matrix Multiplication using multiple blocks, should I use

8X8, 16X16 or 32X32 blocks?

• For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, there are 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!

• For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule.

• For 32X32, we have 1024 threads per Block. Not even one can fit into an SM!

Page 44: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 44

A Common Programming Strategy• Global memory resides in device memory (DRAM) - much

slower access than shared memory• So, a profitable way of performing computation on the

device is to tile data to take advantage of fast shared memory:• Partition data into subsets that fit into shared memory• Handle each data subset with one thread block by:

• Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism

• Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element

• Copying results from shared memory to global memory

Page 45: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 45

A Common Programming Strategy (Cont.)• Constant memory also resides in device memory

(DRAM) - much slower access than shared memory• But… cached!• Highly efficient access for read-only data

• Carefully divide data according to access patterns• R/Only constant memory (very fast if in cache)• R/W shared within Block shared memory (very fast)• R/W within each thread registers (very fast)• R/W inputs/results global memory (very slow)

Page 46: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 46

Idea: Use Shared Memory to reuse global memory data

• Each input element is read by Width threads.

• Load each element into Shared Memory and have several threads use the local version to reduce the memory bandwidth• Tiled algorithms

M

N

P

WID

TH

WID

TH

WIDTH WIDTH

ty

tx

Page 47: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 47

Md

Nd

Pd

Pdsub

TILE_WIDTH

WIDTHWIDTH

TILE_WIDTHTILE_WIDTH

bx

tx01 TILE_WIDTH-12

0 1 2

by ty 210

TILE_WIDTH-1

2

1

0

TIL

E_W

IDT

HT

ILE

_WID

TH

TIL

E_W

IDT

HE

WID

TH

WID

TH

Tiled Multiply

• Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of Md and Nd

Page 48: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 48

Pd1,0

A Small Example: 2X2 Tiling of P

Md2,0

Md1,1

Md1,0Md0,0

Md0,1

Md3,0

Md2,1

Pd0,0

Md3,1 Pd0,1

Pd2,0 Pd3,0

Nd0,3 Nd1,3

Nd1,2

Nd1,1

Nd1,0Nd0,0

Nd0,1

Nd0,2

Pd1,1

Pd0,2 Pd2,2 Pd3,2Pd1,2

Pd3,1Pd2,1

Pd0,3 Pd2,3 Pd3,3Pd1,3

Page 49: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 49

Every Md and Nd Element is used exactly twice in generating a 2X2 tile of P

P0,0

thread0,0

P1,0

thread1,0

P0,1

thread0,1

P1,1

thread1,1

M0,0 * N0,0 M0,0 * N1,0 M0,1 * N0,0 M0,1 * N1,0

M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1

M2,0 * N0,2 M2,0 * N1,2 M2,1 * N0,2 M2,1 * N1,2

M3,0 * N0,3 M3,0 * N1,3 M3,1 * N0,3 M3,1 * N1,3

Accessorder

Page 50: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 50

Every Md and Nd Element is used exactly twice in generating a 2X2 tile of P

P0,0

thread0,0

P1,0

thread1,0

P0,1

thread0,1

P1,1

thread1,1

M0,0 * N0,0 M0,0 * N1,0 M0,1 * N0,0 M0,1 * N1,0

M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1

M2,0 * N0,2 M2,0 * N1,2 M2,1 * N0,2 M2,1 * N1,2

M3,0 * N0,3 M3,0 * N1,3 M3,1 * N0,3 M3,1 * N1,3

Accessorder

Page 51: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming

Pd1,0Md2,0

Md1,1

Md1,0Md0,0

Md0,1

Md3,0

Md2,1

Pd0,0

Md3,1 Pd0,1

Pd2,0 Pd3,0

Nd0,3 Nd1,3

Nd1,2

Nd1,1

Nd1,0Nd0,0

Nd0,1

Nd0,2

Pd1,1

Pd0,2 Pd2,2 Pd3,2Pd1,2

Pd3,1Pd2,1

Pd0,3 Pd2,3 Pd3,3Pd1,3

Breaking Md and Nd into Tiles

• Break up the inner product loop of each thread into phases

• At the beginning of each phase, load the Md and Nd elements that everyone needs during the phase into shared memory

• Everyone access the Md and Nd elements from the shared memory during the phase

Page 52: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming

Pd1,0Md2,0

Md1,1

Md1,0Md0,0

Md0,1

Md3,0

Md2,1

Pd0,0

Md3,1 Pd0,1

Pd2,0 Pd3,0

Nd0,3 Nd1,3

Nd1,2

Nd1,1

Nd1,0Nd0,0

Nd0,1

Nd0,2

Pd1,1

Pd0,2 Pd2,2 Pd3,2Pd1,2

Pd3,1Pd2,1

Pd0,3 Pd2,3 Pd3,3Pd1,3

Breaking Md and Nd into Tiles

• Break up the inner product loop of each thread into phases

• At the beginning of each phase, load the Md and Nd elements that everyone needs during the phase into shared memory

• Everyone access the Md and Nd elements from the shared memory during the phase

Page 53: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 53

Tiled Kernel__global__ void Tiled(float* Md, float* Nd, float* Pd, int Width)

{ __shared __float Mds[TILE_WIDTH][TILE_WIDTH]; __shared __float Nds[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work on

int Row = by * TILE_WIDTH + ty; int Col = bx * TILE_WIDTH + tx;

float Pvalue = 0; // compute Pvalue Pd[Row*Width + Col] = Pvalue;}

Page 54: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 54

Tiled Kernel: Computing Pvalue //… float Pvalue = 0; // Loop over the Md and Nd tiles required for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Collaborative loading of Md and Nd tiles Mds[ty]

[tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col];__syncthreads();

for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[ty][k] * Nds[k][tx]; __syncthreads(); } Pd[Row*Width + Col] = Pvalue; //…

Page 55: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 55

CUDA Code – Kernel Execution Configuration

// Setup the execution configuration

dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);dim3 dimGrid(Width / TILE_WIDTH,

Width / TILE_WIDTH);

Page 56: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 56

First-order Size Considerations in G80• Each thread block should have many threads

• TILE_WIDTH of 16 gives 16*16 = 256 threads

• There should be many thread blocks• A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks• TILE_WIDTH of 16 gives each SM 3 blocks, 768 threads (full capacity)

• Each thread block perform 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. • Memory bandwidth no longer a limiting factor

Page 57: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 57

Md

Nd

Pd

Pdsub

TILE_WIDTH

WIDTHWIDTH

TILE_WIDTHTILE_WIDTH

bx

tx01 TILE_WIDTH-12

0 1 2

by ty 210

TILE_WIDTH-1

2

1

0

TIL

E_W

IDT

HT

ILE

_WID

TH

TIL

E_W

IDT

HE

WID

TH

WID

TH

Tiled Multiply

• Each block computes one square sub-matrix Pdsub of size TILE_WIDTH

• Each thread computes one element of Pdsub

m

kbx

by

k

m

Page 58: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 58

G80 Shared Memory and Threading• Each SM in G80 has 16KB shared memory

• SM size is implementation dependent!• For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory. • The shared memory can potentially have up to 8 Thread Blocks actively executing

• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)• The threading model limits the number of thread blocks to 3 so shared memory is not the

limiting factor here• The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage

per thread block, allowing only up to two thread blocks active at the same time

• Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16• The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!

Page 59: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 59

Parallel Memory Architecture

• In a parallel machine, many threads access memory• Therefore, memory is divided into banks• Essential to achieve high bandwidth

• Each bank can service one address per cycle• A memory can service as many simultaneous

accesses as it has banks

• Multiple simultaneous accesses to a bankresult in a bank conflict • Conflicting accesses are serialized

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Page 60: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 60

Bank Addressing Examples• No Bank Conflicts

• Linear addressing stride == 1

• No Bank Conflicts• Random 1:1 Permutation

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Page 61: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 61

Bank Addressing Examples• 2-way Bank Conflicts

• Linear addressing stride == 2

• 8-way Bank Conflicts• Linear addressing

stride == 8

Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 9Bank 8

Bank 15

Bank 7

Bank 2Bank 1Bank 0

x8

x8

Page 62: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 62

How addresses map to banks on G80• Each bank has a bandwidth of 32 bits per clock cycle• Successive 32-bit words are assigned to successive

banks• G80 has 16 banks

• So bank = address % 16• Same as the size of a half-warp

• No bank conflicts between different half-warps, only within a single half-warp

Page 63: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 63

Shared memory bank conflicts• Shared memory is as fast as registers if there are no bank

conflicts

• The fast case:• If all threads of a half-warp access different banks, there is no bank

conflict• If all threads of a half-warp access the identical address, there is no

bank conflict (broadcast)• The slow case:

• Bank Conflict: multiple threads in the same half-warp access the same bank

• Must serialize the accesses• Cost = max # of simultaneous accesses to a single bank

Page 64: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 64

Linear Addressing• Given:

__shared__ float shared[256];float foo = shared[baseIndex + s *

threadIdx.x];

• This is only bank-conflict-free if s shares no common factors with the number of banks • 16 on G80, so s must be odd

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

s=3

s=1

Page 65: GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

GPU Programming 65

Control Flow Instructions• Main performance concern with branching is divergence

• Threads within a single warp take different paths• Different execution paths are serialized in G80

• The control paths taken by the threads in a warp are traversed one at a time until there is no more.

• A common case: avoid divergence when branch condition is a function of thread ID• Example with divergence:

• If (threadIdx.x > 2) { }• This creates two different control paths for threads in a block• Branch granularity < warp size; threads 0, 1 and 2 follow different path

than the rest of the threads in the first warp• Example without divergence:

• If (threadIdx.x / WARP_SIZE > 2) { }• Also creates two different control paths for threads in a block• Branch granularity is a whole multiple of warp size; all threads in any

given warp follow the same path