gpu computing

122
GPU Computing Dr. Bo Yuan E-mail: [email protected]

Upload: xerxes

Post on 24-Feb-2016

46 views

Category:

Documents


1 download

DESCRIPTION

GPU Computing. Dr. Bo Yuan E-mail: [email protected]. Overview. What is GPU?. Graphics Processing Unit First GPU: GeForce 256 (1999) Connected to motherboard via PCI Express High computational density and memory bandwidth Massively multithreaded many-core chips - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GPU Computing

GPU Computing

Dr. Bo YuanE-mail: [email protected]

Page 2: GPU Computing

2

Overview

Foundation

GPU

CUDA

Thread

Memory Structure

Intermediate

Kernel

Vector Addition

Matrix Multiplicatio

n

Shared Memory

Advanced

Warp

Memory Access

Resource Optimization

Dynamic Parallelism

Extension

Floating Point

Stream

Multiple GPUs

Parallel Matlab

Page 3: GPU Computing

3

What is GPU?

• Graphics Processing Unit

• First GPU: GeForce 256 (1999)

• Connected to motherboard via PCI Express

• High computational density and memory bandwidth

• Massively multithreaded many-core chips

• Traditionally used for real-time rendering

• Several millions units are sold each year.

Page 4: GPU Computing

4

Graphics Cards

Page 5: GPU Computing

5

GPU Pipeline

Page 6: GPU Computing

6

GPU Pipeline

Rasterization

Page 7: GPU Computing

7

Anti-Aliasing

Triangle Geometry Aliased Anti-AliasedTriangle Geometry Aliased Anti-Aliased

Page 8: GPU Computing

8

GPGPU

• General-Purpose Computing on GPUs

• Massively Parallel, Simple Operations

• Suitable for compute-intensive engineering problems

• The original problem needs to be cast into native graphics operations.

• Launched through OpenGL or DirectX API calls

• Input data are stored in texture images and issued to the GPU by submitting triangles.

• Highly restricted access to input/output

• Very tedious, limited success with painstaking efforts

Page 9: GPU Computing

9

Trend of Computing

Page 10: GPU Computing

10

CPU vs. GPU

DRAM

Cache

ALUControl

ALU

ALU

ALU

DRAM

CPU GPU

Multi-Core Many-Core

Number of ALUs

Memory Bandwidth

Page 11: GPU Computing

11

Power of the Crowd

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1Streaming Multiprocessor

Shared Memory

• SM– Streaming Multiprocessor– Multi-threaded processor core– Processing unit for thread block– SPs (Streaming Processor)– SFUs (Special Function Unit)

• SP– Streaming Processor– Scalar ALU for a single CUDA thread

• SIMT– Single-Instruction, Multiple-Thread– Shared instruction fetch per 32 threads (warp)

Page 12: GPU Computing

12

Need For Speed

Page 13: GPU Computing

13

Green Computing

0

5

10

15

20

25

6.48

15.85

1.7

21.8

Intel Core i7-980XE

GTX

750

Ti

GTX

680

GTX

580G

FLO

PS

per

Wat

t

Page 14: GPU Computing

14

Supercomputing

• TITAN, Oak Ridge National Laboratory

• Speed: 24.8 PFLOPS (Theory), 17.6 PFLOPS (Real)

• CPU: AMD Opteron 6274 (18,688 × 16 cores)

• GPU: NVIDIA Tesla K20 (18,688 × 2496 cores)

• Cost: US$ 97 Million

• Power: 9 MW

Page 16: GPU Computing

16

What is CUDA?

• Compute Unified Device Architecture

• Introduced by NVIDIA in 2007

• Scalable Parallel Programming Model

• Small extensions to standard C/C++

• Enable general-purpose GPU computing

• Straightforward APIs to manage devices, memory etc.

• Only supports NVIDIA GPUs.

http://developer.nvidia.com/category/zone/cuda-zone

Page 17: GPU Computing

17

CUDA-Enabled GPU

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

Page 18: GPU Computing

18

CUDA GPUsComputeCapability GPUs Cards

2.0 GF100, GF110GeForce GTX 470, GTX 480, GTX 570, GTX 580, GTX

590, Tesla C2050, C2070

2.1GF104, GF114, GF116,

GF108, GF106

GeForce GT 430, GT 440, GTX 460, GTX 550 Ti, GTX

560 Ti, GT 640, GT 630

3.0 GK104, GK106, GK107GeForce GTX 690, GTX 680, GTX 670, GTX 660, GTX

650 Ti, GTX 650

3.5 GK110, GK208GeForce GTX TITAN, GT 640 (Rev. 2), GT 630 (Rev. 2),

Tesla K40, Tesla K20

5.0 GM107, GM108 GeForce GTX 750 Ti, GTX 750, GTX 860M

Page 19: GPU Computing

19

Fermi Architecture

Page 20: GPU Computing

20

Kepler Architecture• GeForce GTX 680 (Mar. 22, 2012)

• GK104, 28 nm process

• 3.5 billion transistors on a 294 mm2 die

• CUDA Cores: 1536 (8 SMs X 192 SPs)

• Memory Bandwidth: 192 GB/S

• Peak Performance: 3090 GFLOPS

• TDP: 195W

• Release Price: $499

Page 21: GPU Computing

21

Maxwell Architecture• GeForce GTX 750 Ti (Feb. 18, 2014)

• GM107, 28 nm process

• 1.87 billion transistors on a 148 mm2 die

• CUDA Cores: 640 (5 SMs X 128 Cores)

• Memory Bandwidth: 86.4 GB/S

• Peak Performance: 1306 GFLOPS

• TDP: 60W

• Release Price: $149

Page 22: GPU Computing

22

CUDA Teaching Lab

• GTX 750 (GM107)• Compute Capability: 5.0• 512 CUDA Cores• 1GB, 128-bit GDDR5• 80 GB/S• 1044 GFLOPS• TDP: 55W• RMB 799

• GT 630 (GK208)• Compute Capability: 3.5• 384 CUDA Cores• 2GB, 64-bit GDDR3• 14.4 GB/S• 692.7 GFLOPS• TDP: 25W• RMB 419

Page 23: GPU Computing

23

CUDA Installationhttps://developer.nvidia.com/cuda-downloads

Page 24: GPU Computing

24

CUDA: deviceQuery

Page 25: GPU Computing

25

CUDA: bandwidthTest

Page 26: GPU Computing

26

CUDA Applications

Page 27: GPU Computing

27

CUDA Showcase

Page 28: GPU Computing

28

Heterogeneous Computing

Device

Host

Page 29: GPU Computing

29

Heterogeneous Computing

Page 30: GPU Computing

30

Grids, Blocks and ThreadsHost

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Page 31: GPU Computing

31

Thread Block

• Threads have thread ID numbers within block.

• Threads use thread ID to select work.

• Threads are assigned to SMs in block granularity.

• Each GT200 SM can have maximum 8 blocks.

• Each GT200 SM can have maximum 1024 threads.

• Threads in the same block can share data and synchronize.

• Threads in different blocks cannot cooperate.

• Each block can execute in any order relative to other blocks.

Thread Id #:0 1 2 3 … m

Thread program

Page 32: GPU Computing

32

Code Example

Page 33: GPU Computing

33

Transparent Scalability

Device

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Kernel grid

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Device

Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

Page 34: GPU Computing

34

Memory SpaceGrid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

• Each thread can:

– Read/write per-thread registers

– Read/write per-block shared memory

– Read/write per-grid global memory

– Read/only per-grid constant memory

GeForce GTX 680

Memory Bandwidth … 192 GB/S

Single-Precision Floating Point … 4B

Peak Performance … 3090 GFLOPS

Practical Performance … 48 GFLOPS

Page 35: GPU Computing

35

Hello World!

int main(void) {

printf(“Hello World!\n”);

return 0;}

__global__ void mykernel(void) {}

int main(void) {

mykernel<<<1,1>>>();

printf(“Hello World!\n”);

return 0;}

Your first CUDA code!

Page 36: GPU Computing

36

Device Code

• CUDA keyword __global__ indicates a kernel function that:– Runs on the device.– Called from the host.

• CUDA keyword __device__ indicates a device function that:– Runs on the device.– Called from a kernel function or another device function.

• Triple angle brackets <<< >>> indicate a call from host code to device code.– Kernel launch

• nvcc separates source code into two components:– Device functions are processed by NVIDIA compiler.– Host functions are processed by standard host compiler.– $ nvcc hello.cu

Page 37: GPU Computing

37

Addition on Device

__global__ void add (int *a, int *b, int *c) {*c=*a+*b;

}

• add () will execute on the device.

• add () will be called from the host.

• a, b, c must point to device memory.

• We need to allocate memory on GPU.

Page 38: GPU Computing

38

Memory Management

• Host and device memories are separate entities.

• Device pointers point to GPU memory.– May be passed to/from host code.– May not be dereferenced in host code.

• Host pointers point to CPU memory– May be passed to/from device code.– May not be dereferenced in device code.

• CUDA APIs for handling device memory– cudaMalloc(), cudaFree(), cudaMemcpy()– C equivalents: malloc(), free(), memcpy()

Page 39: GPU Computing

39

Addition on Device: main()

int main(void) {int a, b, c; // host copiesint *d_a, *d_b, *d_c; // device copiesint size=sizeof(int);

// Allocate space for device copies of a, b, c cudaMalloc((void **)&d_a, size);

cudaMalloc((void **)&d_b, size);cudaMalloc((void **)&d_c, size);

a=2;b=7;

// Copy inputs to devicecudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);

Page 40: GPU Computing

40

Addition on Device: main()

// Launch add() kernel on GPUadd<<<1,1>>>(d_a,d_b,d_c);

// Copy result back to hostcudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);

// CleanupcudaFree(d_a);

cudaFree(d_b);cudaFree(d_c);

return 0;}

Page 41: GPU Computing

41

Moving to Parallel

• Each call to add() adds two integers.

• With add() running in parallel, we can do vector addition in parallel.

• add<<<nblocks, 1>>>(d_a, d_b, d_c)

• Each parallel invocation of add() is referred to as a block.

• By using blockIdx.x to index into the array, each block handles a different index.

• Block can be 2D:– dim3 nblocks(M, N)– blockIdx.x, blockIdx.y

Page 42: GPU Computing

42

Vector Addition on Device

__global__ void add (int *a, int *b, int *c) {c[blockIdx.x]=a[blockIdx.x]+b[blockIdx.x];

}

c[0]=a[0]+b[0];

Block 0

c[1]=a[1]+b[1];

Block 1

c[2]=a[2]+b[2];

Block 2

c[3]=a[3]+b[3];

Block 3

Page 43: GPU Computing

43

Vector Addition on Device: main()

# define N 512int main(void) {

int *a, *b, *c; // host copiesint *d_a, *d_b, *d_c; // device copiesint size=N*sizeof(int);

// Allocate space for device copies of a, b, c cudaMalloc((void **)&d_a, size);

cudaMalloc((void **)&d_b, size);cudaMalloc((void **)&d_c, size);

// Allocate space of host copies of a, b, c// Set up initial valuesa=(int *)malloc(size); rand_ints(a, N);b=(int *)malloc(size); rand_ints(b, N);c=(int *)malloc(size); rand_ints(c, N);

Page 44: GPU Computing

44

Vector Addition on Device: main()

// Copy inputs to devicecudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N blocksadd<<<N, 1>>(d_a, d_b, d_c);

// Copy results back to hostcudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanupfree(a); free(b); free(c);cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

return 0;}

Page 45: GPU Computing

45

CUDA Threads

• Each block can be split into parallel threads.

• Threads can be up to 3D:– dim3 nthreads(M, N, P)– threadIdx.x, threadIdx.y, threadIdx.z

__global__ void add (int *a, int *b, int *c) {c[threadIdx.x]=a[threadIdx.x]+b[threadIdx.x];

}

add<<<1, N>>>(d_a, d_b, d_c);

Page 46: GPU Computing

46

Combining Blocks and Threads

• We have seen parallel vector addition using:– Many blocks with one thread each– One block with many threads

• Let’s adapt vector addition to use both blocks and threads.– Why bother?

Page 47: GPU Computing

47

Indexing

M=8; // 8 threads/blockint index=threadIdx.x+blockIdx.x*M;

int index=threadIdx.x+blockIdx.x*blockDim.x;

__global__ void add (int *a, int *b, int *c) {int index=threadIdx.x+blockIdx.x*blockDim.x;c[index]=a[index]+b[index];

}

Page 48: GPU Computing

48

Indexing

#define N (2048*2048)#define M 512 // THREADS_PER_BLOCK

add<<<N/M, M>>>(d_a, d_b, d_c);

__global__ void add (int *a, int *b, int *c, int n) {

int index=threadIdx.x+blockIdx.x*blockDim.x;

if (index<n)

c[index]=a[index]+b[index];}

add<<<(N+M-1)/M, M>>>(d_a, d_b, d_c, N);

Page 49: GPU Computing

49

Data Access Pattern

radius radius

input

output

How many times?

Page 50: GPU Computing

50

Sharing Data Between Threads

• Each thread generates one output element.– blockDim.x elements per block

• Each input element needs to be read several times.– High I/O cost

• Within a block, threads can share data via shared memory.– Data are not visible to threads in other blocks.

• Extremely fast on-chip memory

• Declared using keyword: __shared__, allocated per block.

• Read (blockDim.x+2*radius)input elements from global to shared memory.

Page 51: GPU Computing

51

Collaborative Threads

0 1 2 4 5 63 7 8 9

input

shared

blockDim.x output elements

Thread 0 produces the values of temp[i], i=0, 3, 13.

Thread 9 requires the values of temp[i], i=9, 10, 11, 12, 13, 14, 15.

Data Race!

void _syncthreads()

T0T0 T0

Page 52: GPU Computing

52

Kernel Synchronization__global__ void vector_sum(int *in, int *out) {

__shared__ int temp[BLOCK_SIZE+2*RADIUS];

int gindex=threadIdx.x+blockIdx.x*blockDim.x; // global index

int lindex=threadIdx.x+RADIUS; // local index

// Read input elements into shared memory

temp[lindex]=in[gindex];

if (threadIdx.x<RADIUS) { // some extra work

temp[lindex-RADIUS]=in[gindex-RADIUS];

temp[lindex+BLOCK_SIZE]=in[gindex+BLOCK_SIZE];

} __syncthreads();

int offset, result=0;

for (offset=-RADIUS; offset<=RADIUS; offset++)

result+=temp[lindex+offset];

out[gindex]=result;}

Page 53: GPU Computing

void MatrixMulOnHost(float* M, float* N, float* P, int Width){ int i, j, k; float a, b, sum; for (i = 0; i < Width; ++i) for (j = 0; j < Width; ++j) {

sum = 0;for (k = 0; k < Width; ++k) {

a = M[i * width + k]; b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; }}

53

N

PM

Matrix Multiplication

WID

TH

WID

TH

WIDTH WIDTH

i

k

kj

Host Code Only

Page 54: GPU Computing

54

Single Thread Blockdim3 dimGrid(1,1);dim3 dimBlock(Width, Width);…MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);…

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width){

int k=0; float Pvalue = 0, Melement, Nelement;

for (k = 0; k < Width; ++k) { Melement = Md[threadIdx.y*Width+k]; // Md[threadIdx.y, k] Nelement = Nd[k*Width+threadIdx.x]; // Nd[k, threadIdx.x] Pvalue += Melement * Nelement; } // Pd[threadIdx.y, threadIdx.x] Pd[threadIdx.y*Width+threadIdx.x] = Pvalue; }

Page 55: GPU Computing

55

Single Thread Block

• What is the maximum size of the matrix?– Each thread computes one element of Pd.

• Each thread:– Loads a row of matrix Md.– Loads a column of matrix Nd.– Perform one multiply and addition for each pair of

Md and Nd elements.

• CGMA– Compute to Global Memory Access

Grid 1Block 1

3 2 5 4

2

4

2

6

48

Thread(2, 2)

WIDTH

Md Pd

Nd

Page 56: GPU Computing

Multiple Blocks

56

Md

Nd

Pd

Pdsub

TILE_WIDTH

WIDTHWIDTH

bx

01 TILE_WIDTH-12

0

210

TILE_WIDTH-1

TIL

E_W

IDT

H

WID

TH

WID

TH

• Break Pd into square tiles.

• Each block calculates one tile:– Each threads calculates one element.– Block size equals to tile size.

• Require both block ID and thread ID.

Page 57: GPU Computing

57

Multiple Blocks: An Example

P1,0P0,0

P0,1

P2,0 P3,0

P1,1

P0,2 P2,2 P3,2P1,2

P3,1P2,1

P0,3 P2,3 P3,3P1,3

Block(0,0) Block(1,0)

Block(1,1)Block(0,1)

TILE_WIDTH = 2

Pd1,0Md2,0

Md1,1

Md1,0Md0,0

Md0,1

Md3,0

Md2,1

Pd0,0

Md3,1 Pd0,1

Pd2,0 Pd3,0

Nd0,3 Nd1,3

Nd1,2

Nd1,1

Nd1,0Nd0,0

Nd0,1

Nd0,2

Pd1,1

Pd0,2 Pd2,2 Pd3,2Pd1,2

Pd3,1Pd2,1

Pd0,3 Pd2,3 Pd3,3Pd1,3

Page 58: GPU Computing

58

Multiple Blocks: Indexing

• TILE_WIDTH• Block: blockIdx.x, blockIdx.y• Thread: threadIdx.x, threadIdx.y

• Row: blockIdx.y * TILE_WIDTH + threadIdx.y• Col: blockIdx.x * TILE_WIDTH + threadIdx.x

(0,0) (1,0) (2,0) (3,0)

(0,1) (1,1) (2,1) (3,1)

(0,2) (1,2) (2,2) (3,2)

(0,3) (1,3) (2,3) (3,3)bloc

kIdx.y

blockIdx.x threadIdx.x

threadIdx.y

Page 59: GPU Computing

59

Multiple Blocks: Device Code

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width){ // Calculate the row index of the Pd element and Md int Row=blockIdx.y*TILE_WIDTH+threadIdx.y; // Calculate the col index of the Pd element and Nd int Col=blockIdx.x*TILE_WIDTH+threadIdx.x;

int k; float Pvalue = 0; // Each thread computes one element of sub-matrix for (k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue;}

Page 60: GPU Computing

60

Block Granularity

• Each SM in GT200 can take up to 1024 threads and 8 blocks.

• 8×8: 64 threads per block, 1024/64=12 blocks, 64×8=512 threads per SM

• 16×16: 256 threads per block, 1024/256=4 blocks, full capacity!

• 32×32: 1024 threads per block, exceeding the limit of 512 threads/block

t0 t1 t2 … tm

Blocks

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

t0 t1 t2 … tmSM 1SM 0

Blocks

Page 61: GPU Computing

61

Global Memory Access

T1,0T0,0

T0,1 T1,1

2×2 Thread Block

Md

Nd

• Each thread requires one row from Md and one column from Nd.

• For a k×k thread block, each row/column will be accessed k times.

• To reduce the global memory I/O, it is beneficial to load the required data once into the shared memory.

Page 62: GPU Computing

62

Splitting Md

Md0,0 Md1,0 Md2,0 Md3,0

Md0,1 Md1,1 Md2,1 Md3,1

Mds0,0 Mds1,0

Mds0,1 Mds1,1

Mds0,0 Mds1,0

Mds0,1 Mds1,1

MdMds Mds

Phase 1 Phase 2

shared shared

• The shared memory per SM is limited (e.g., 64KB).

• Shared among all blocks in the same SM.

• Luckily, not all data needs to be in the shared memory simultaneously.

Page 63: GPU Computing

63

Shared Memory: Device Code__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int

Width){1. __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];2. __shared__ float Nds[TILE_WIDTH][TILE_WIDTH];3. int bx = blockIdx.x; int by = blockIdx.y;4. int tx = threadIdx.x; int ty = threadIdx.y;// Identify the row and column of the Pd element to work on5. int Row = by * TILE_WIDTH + ty;6. int Col = bx * TILE_WIDTH + tx;7. float Pvalue = 0;// Loop over the Md and Nd tiles required to compute the Pd element8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {// Collaboratively load Md and Nd tiles into shared memory9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];10. Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];11. __syncthreads(); // Make sure the shared memory is ready

Page 64: GPU Computing

64

Shared Memory: Device Code12. for (int k = 0; k < TILE_WIDTH; ++k)13. Pvalue += Mds[ty][k] * Nds[k][tx];// Make sure all threads have finished working on the shared memory

14. __syncthreads(); }15. Pd[Row*Width+Col] = Pvalue;}

• Each SM in G80 can take up to 768 threads and 8 blocks.

• Each SM has 16KB shared memory.

• For a tile of size 16 × 16, each block requires 16 × 16 × 4 = 1KB for Mds.

• Totally, 2KB are required for each block and 8 blocks can be supported.

• However, since 768/(16 × 16) = 3, only 3 blocks and 6KB will be in use.

Page 65: GPU Computing

65

Performance Considerations• GPU computing is easy:

– Host, Device, Kernel, Block, Thread– GPU Cards– Up and running in a few days

• As long as performance is not a major concern:– Various performance bottlenecks – 10× speedup is often within your reach.– 100× speedup takes significant amount of tuning efforts.

Page 66: GPU Computing

66

Thread Execution• Conceptually, threads in a block can execute in any order with respect to each other.– Exception: barrier synchronizations

• Each thread block is partitioned into warps.– Hardware cost considerations– The unit of thread scheduling in SMs– 32 threads per warp: [0,…, 31], [32,…, 63] …

• All threads in a warp execute the same instruction.

• SM hardware implements zero-overhead thread scheduling.– Can tolerate long-latency operations with several warps around.– GPU does not require as much chip area for cache memories and branch prediction

mechanisms as CPUs.

Page 67: GPU Computing

67

Warp Scheduling

• Suppose 1 global memory access is needed for every 4 instructions.

• Instruction: 4 clock cycles

• Memory latency: 200 clock cycles

• At least 14 warps are required to keep the units fully utilized.

A

B

C

D

A

Page 68: GPU Computing

68

Flow Divergence

warp 8 instruction 11

SM multithreadedWarp scheduler

warp 1 instruction 42

warp 3 instruction 95

warp 8 instruction 12

...

time

warp 3 instruction 96

• SIMT can reduce the cost of fetching and processing instructions.

• SIMT works well when all threads in a warp follow the same control flow.

• Multiple sequential passes may be required for an if-then-else construct.

if

X Y

then else

Page 69: GPU Computing

69

Flow Divergence

Page 70: GPU Computing

70

Page 71: GPU Computing

71

Flow Divergence

• Main performance concern with branching is divergence.– Threads within a thread take different paths.– The control paths are traversed one at a time.

• How to avoid divergence when the branch condition is a function of thread ID?– With divergence:

• if (threadIdx.x>2) { }• Threads 0, 1, and 2 follow a different path than the rest threads in the warp.

– Without divergence:• if (threadIdx.x/WARP_SIZE>=2) { }• Creates two different paths for threads in a block.• Branch granularity is a whole multiple of warp size.• All threads in any given warp follow the same path.

Page 72: GPU Computing

72

Memory Coalescing

• Dynamic Random Access Memory (DRAM)– Each bit is stored in a separate capacitor.– All storage locations have nearly identical access time.– In practice, many consecutive locations are accessed in parallel.– All threads in a warp should access continuous memory locations (coalescing) to

maximize memory bandwidth utilization.

Md Nd

WIDTH

Thread 1Thread 2

Not coalesced Coalesced

WID

TH

Page 73: GPU Computing

73

Memory Layout of a Matrix in C

M2,0

M1,1

M1,0M0,0

M0,1

M3,0

M2,1 M3,1

M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2

M1,2M0,2 M2,2 M3,2

M1,3M0,3 M2,3 M3,3

M1,3M0,3 M2,3 M3,3

M

T0 T1 T2 T3

Time Period 1

T0 T1 T2 T3

Time Period 2

Time

Page 74: GPU Computing

74

Memory Layout of a Matrix in C

M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2 M1,3M0,3 M2,3 M3,3

T0 T1 T2 T3

Time Period 1

T0 T1 T2 T3

Time Period 2

Time

M

M2,0

M1,1

M1,0M0,0

M0,1

M3,0

M2,1 M3,1

M1,2M0,2 M2,2 M3,2

M1,3M0,3 M2,3 M3,3

Page 75: GPU Computing

75

Shared Memory Architecture

• Many threads access memory:– Shared memory is divided in banks.– Successive 32-bit words are assigned to successive banks.– Each bank has a bandwidth of 32 bits per clock cycle.– G80 has 16 banks: bank=address % 16– Same as the size of half a warp

• Each memory bank can service one address per cycle.– Can service as many simultaneous accesses as the number of banks.

• Multiple simultaneous accesses to the same bank may result in a bank conflict.– Conflicting accesses are serialized.– No bank conflicts between different half warps.

Page 76: GPU Computing

76

Bank Conflicts

• Shared memory is as fast as registers if there are no bank conflicts.

• The fast case:– If all threads of a half-warp access different banks, there is no bank

conflict.– If all threads of a half-warp access the identical address, there is no

bank conflict (broadcast).

• The slow case:– Bank Conflict: Multiple threads in the same half-warp access the

same bank.– Must serialize the accesses.– Cost = max # of simultaneous accesses to a single bank

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Page 77: GPU Computing

77

Bank Addressing Example

• No Bank Conflicts– Linear addressing

• No Bank Conflicts– Random 1:1 Permutation

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Page 78: GPU Computing

78

Bank Addressing Example

• 2-way Bank Conflicts– Linear addressing

stride = 2

• 8-way Bank Conflicts– Linear addressing

stride = 8

Thread 11Thread 10Thread 9Thread 8

Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 15

Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0

Thread 15

Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0

Bank 9Bank 8

Bank 15

Bank 7

Bank 2Bank 1Bank 0x8

x8

Page 79: GPU Computing

79

Partitioning of SM Resources

• Execution resources in SM– Registers– Block Slots (GT200: 8)– Thread Slots (GT200: 1024)– Number of 16×16 blocks = 1024/(16×16) = 4– Determine the number of threads running on a SM.– Subtle interactions that may cause underutilization of resources

• Register File– Store automatic variables declared in a CUDA kernel.– G80: 32KB (8192 entries) for each SM– Dynamically partitioned across all blocks in the same SM.– Each thread can only access registers assigned to itself.

Page 80: GPU Computing

80

SM Resources Example

• For 16×16 blocks, if each thread uses 10 registers:– Each block requires 16×16×10 = 2560 registers.– Number of blocks = 8129/2560 = 3

• If each thread increases the use of registers by 1:– Each block now requires 16×16×11 = 2816 registers.– Number of blocks = 8129/2816 = 2– Only two blocks can run on a SM.– Number of threads drops from 768 to 512– 1/3 reduction of parallelism due to the single extra automatic variable!

Performance Cliff

Page 81: GPU Computing

81

Occupancy Calculator

Page 82: GPU Computing

82

Instruction Mix

• Each processor core has limited instruction processing bandwidth.

• Every instruction consumes processing bandwidth:– Floating point calculation– Load instruction– Branch instruction

• We should try to increase the efficiency of instructions.

for (int k=0; k<BLOCK_SIZE; k++) Pvalue+=Mds[ty][k]*Nds[k][tx];

2 floating point arithmetic instructions

1 loop branch instruction

2 address arithmetic instructions

1 loop counter increment instruction

Page 83: GPU Computing

83

Instruction Mix

Pvalue+=Mds[ty][0]*Nds[0][tx]+… +Mds[ty][15]*Nds[15][tx];

Loop Unrolling

• Express the dot-product computation as one long multiply-add expression.

• Eliminate the loop branch instruction.

• Eliminate the loop counter update.

• Matrix indices are constants rather than a variable.

• With the help of compiler, address arithmetic instructions can be also eliminated!

Page 84: GPU Computing

84

Dynamic Parallelism

• A child CUDA kernel can be called from within a parent CUDA kernel, without CPU involvement.

• Extension to flat, single-level of parallelism

• Requires Compute Capability 3.5+

• Benefits:– Simplified CPU/GPU Cooperation– Dynamic Load Balancing– Data-Dependent Execution– Recursive Parallel Algorithms

CPU GPU

CPU GPU

Page 85: GPU Computing

85

What does it mean?

CPU GPU CPU GPU

Dynamic Parallelism

Page 86: GPU Computing

86

What does it mean?

Page 87: GPU Computing

87

Example

__global__ ChildKernel(void* data){ //Operate on data}

__global__ ParentKernel(void *data){ ChildKernel<<<1, 16>>>(data);}

// In Host Code ParentKernel<<<256, 64>>(data);

__global__ RecursiveKernel(void*data){ if(continueRecursion == true) RecursiveKernel<<<64, 16>>>(data); }

Page 88: GPU Computing

88

Matrix Example

for (int i=0; i<N; i++){ for (int j=0; j<M; j++){ convolution_function(i, j); }}

Page 89: GPU Computing

89

Matrix Example

for (int i=0; i<N; i++){ for (int j=0; j<M[i]; j++){ convolution_function(i,j); }}

Oversubscription

Page 90: GPU Computing

90

Matrix Example

__global__ void con_kernel(int i){convolution_function(i, threadIdx.x);

}

__global__ void dynamic_parallelism_kernel(int *M){ con_kernel<<<1, M[blockIdx.x]>>>(blockIdx.x);}

//In Host Codedynamic_parallelism_kernel<<<N, 1>>>(M);

Page 91: GPU Computing

91

Synchronization

__global__ void Parent_Kernel() { ... //Kernel code if(threadIdx.x==0){ Child_Kernel<<<1, 256>>>(); // Thread will launch kernel and keep going cudaDeviceSynchronize(); // Make thread wait for Child_Kernel to complete } __syncthreads();//If all threads in the block need Child_Kernel to complete ... //Code that needs data generated by Child_Kernel}

Page 92: GPU Computing

92

Timing GPU Kernelsfloat time;cudaEvent_t start, stop;cudaEventCreate(&start);cudaEventCreate(&stop);cudaEventRecord(start, 0); // Place the start eventkernel <<<..>>>(..); // Returns immediatelycudaEventRecord(stop, 0); // Place the stop eventcudaEventSynchronize(stop); // Make sure stop is reachedcudaEventElapsedTime(&time, start, stop);cudaEventDestroy(start);cudaEventDestroy(stop);

stop Kernel start Stream 0

Page 93: GPU Computing

93

Multiple Kernels

// Create two streamscudaStream_t stream[2];for (int i=0; i<2; ++i) cudaStreamCreate(&stream[i]);// Launch a Kernel from each streamKernelOne <<<64, 512, 0, stream[0]>>>(..);KernelTwo <<<64, 512, 0, stream[1]>>>(..);// Destroy the streamsfor (int i=0; i<2; ++i) cudaStreamDestroy(stream[i]);

• Synchronization is implied for events within the same stream.

• More than one stream can be associated with a GPU.

Page 94: GPU Computing

94

Multiple GPUs

int nDevices;cudaGetDeviceCount(&nDevices);cudaDeviceProp prop; for (int i=0; i<nDevices; i++) { cudaGetDeviceProperties(&prop, i); printf("Device Number: %d\n", i); printf("Device Name: %s\n", prop.name); printf("Compute Capability: %d.", prop.major); printf("%d\n", prop.minor); printf("Memory Bus Width: %d\n", prop.memoryBusWidth); }

Page 95: GPU Computing

95

Streams and Multiple GPUs

• Streams belong to the GPU that was active when they were created.

• Calls to a stream are invalid if the associated GPU is not active.

cudaSetDevice(0);cudaStreamCreate(&streamA);cudaSetDevice(1);cudaStreamCreate(&streamB);// Launch kernelsKernelOne <<<..., streamA>>>(..); // Invalid!KernelTwo <<<..., streamB>>>(..); // Valid

Page 96: GPU Computing

96

Floating Point Considerations

• Numeric values are represented as bit patterns.

• IEEE Floating Point Standard– Sign (S), Exponent (E) and Mantissa (M)– Each (S, E, M) pattern uniquely identifies a floating point number.

• For each bit pattern, it numeric value is derived as:– Value = (-1)S × M × {2E}, where 1.0B ≤ M < 10.0B

• The interpretation of S:– S=0: Positive Number– S=1: Negative Number

Page 97: GPU Computing

97

Normalized Representation of M

• Subscripts D and B are for decimal place and binary place values respectively.

• Specifying 1.0B ≤ M < 10.0B makes the mantissa value for each floating point

number unique. – For example: 0.5D  = 1.0B × 2-1

– The only valid mantissa value is M=1.0B.

– Neither 10.0B × 2-2 (M = 10.0B) nor 0.1B × 20 (M = 0.1B) qualifies.

– Just like 10.0D × 105, or 0.9D × 10-3 are not valid.

• Because all mantissa values are of the form 1.XX…, we can omit the “1.” part from the representation. 

– The mantissa value of 0.5D in a 2-bit mantissa is 00, by omitting “1.” from 1.00.

– With the IEEE format, an n-bit mantissa is effectively an (n+1)-bit mantissa.

Page 98: GPU Computing

98

Excess Encoding of E

Decimal Value Two’s Complement Excess-3Reserved 100 111

-3 101 000

-2 110 001

-1 111 010

0 000 011

1 001 100

2 010 101

3 011 110

In an n-bit exponent representation, 2n-1-1 is added to its two's complement to form its excess representation.

Mon

oton

ical

ly

Mon

oton

ical

ly

(-1)S × 1.M × 2(E-2^(n-1)+1)

Page 99: GPU Computing

99

Representable NumbersNo-zero Abrupt Underflow Denormalization

E M S=0 S=1 S=0 S=1 S=0 S=100 00 2-1 -(2-1) 0 0 0 0

01 2-1+1*2-3 -(2-1+1*2-3) 0 0 1*2-2 -1*2-2

10 2-1+2*2-3 -(2-1+2*2-3) 0 0 2*2-2 -2*2-2

11 2-1+3*2-3 -(2-1+3*2-3) 0 0 3*2-2 -3*2-2

01 00 20 -(20) 20 -(20) 20 -(20)01 20+1*2-2 -(20+1*2-2) 20+1*2-2 -(20+1*2-2) 20+1*2-2 -(20+1*2-2)

10 20+2*2-2 -(20+2*2-2) 20+2*2-2 -(20+2*2-2) 20+2*2-2 -(20+2*2-2)

11 20+3*2-2 -(20+3*2-2) 20+3*2-2 -(20+3*2-2) 20+3*2-2 -(20+3*2-2)

10 00 21 -(21) 21 -(21) 21 -(21)01 21+1*2-1 -(21+1*2-1) 21+1*2-1 -(21+1*2-1) 21+1*2-1 -(21+1*2-1)

10 21+2*2-1 -(21+2*2-1) 21+2*2-1 -(21+2*2-1) 21+2*2-1 -(21+2*2-1)

11 21+3*2-1 -(21+3*2-1) 21+3*2-1 -(21+3*2-1) 21+3*2-1 -(21+3*2-1)

11 Reserved Pattern

Page 100: GPU Computing

100

Representable Numbers

• The exponent bits define the major intervals of representable numbers.

• The mantissa bits define the number of representable numbers in each interval.

• Zero is not representable in this format.

• Representable numbers become closer to each other toward 0.

• There is a gap in representable numbers in the vicinity of 0.

1 2 30

Page 101: GPU Computing

101

Representing Zero

• Abrupt Underflow– Treats all bit patters with E=0 as 0.– Takes away several representable numbers near zero and lumps them all into 0.

1 2 30

1 2 30

• Denormalization– Relaxes the normalization requirement for numbers very close to 0.– Whenever E=0, the mantissa is assumed to be 0.xx.– The exponent is assumed to be the same as the previous interval.

0.M × 2-2^(n-1)+2

0 00 01 (S E M) 0.01 X 20 = 2-2

Page 102: GPU Computing

102

Accuracy and Rounding

• 1.00 × 2-2 + 1.00 × 21 = 0.001 × 21 + 1.00 × 21 = 1.001 × 21 ≈ 1.00 × 21 (Error = 0.001 × 21)

• 1.00 × 20 +1.00 × 20 + 1.00 × 2-2 + 1.00 × 2-2 = 1.00 × 21 + 1.00 × 2-2 + 1.00 × 2-2 = 1.00 × 21 + 1.00 × 2-2 = 1.00 × 21

• [1.00 × 20 +1.00 × 20 ]+ [1.00 × 2-2 + 1.00 × 2-2]= 1.00 × 21 + 1.00 × 2-1 = 1.01 × 21

• Sorting data in ascending order may help achieve greater accuracy.– Numbers with similar numerical values are close to each other.

Page 103: GPU Computing

103

Single vs. Double Precision

• GPUs were traditionally not good at double precision calculation.– Requires compute capability 1.3 or above.– Around 1/8 of single precision performance.– Improved greatly to 1/2 with Fermi architecture.

• Important to avoid using double precision when it is not necessary.– Add ‘f’ specifier on float literals:

• Y=X*0.123; // double assumed• Y=X*0.123f; // float explicit

– Use float version of standard library functions: • Y=sin(X); // double assumed• Y=sinf(X); // single precision explicit

Page 104: GPU Computing

104

Matlab in Parallel• Matlab: Numerical Computing Environment

• Parallel Computing Toolbox (PCT)

• Offload work from one MATLAB session (client) to other MATLAB sessions (workers).

• Run as many as eight MATLAB workers (2011b) on your local machine in addition to your MATLAB client session.

http://www.mathworks.cn/cn/help/distcomp/index.html

Page 105: GPU Computing

105

Parfor

• Parallel for-loop

• The parfor body is executed on the client and workers.

• The data on which parfor operates is sent from the client to workers, and the results are sent back to the client and pieced together.

• MATLAB workers evaluate iterations in no particular order, and independently of each other.

• Classification of Variables– Loop, Sliced, Reduction, Broadcast, Temporary

Page 106: GPU Computing

106

Classification of Variables

a=0;c=pi;z=0;r=rand(1,10);parfor i=1:10 a=i; z=z+i; b(i)=r(i); if i<=c d=2*a; endend

temporary loopreduction

broadcastslicedsliced

Page 107: GPU Computing

107

Parfor Example

X=zeros(1,N);for i = 1:N x(i)=sin(i/N*2*pi);end

X=zeros(1,N);matlabpool open local 8 % create 8 workers parfor i = 1:N X(i)=sin(i/N*2*pi);endmatlabpool close % close all workers

parallelization

Page 108: GPU Computing

108

Notes on Parfor

• Each loop must be independent of other loops.

• In the Windows Task Manager, there are multiple Matlab processes:– Higher CPU Usage– Higher Memory Usage

• It incurs significant overhead: only works with long loop iterations and/or time consuming calculations.

• Be prepared to be surprised:– Some Matlab functions are already optimized for multithreading.– The practical speedup value is generally quite moderate.

Page 109: GPU Computing

109

GPU Accelerated Matlab

• Matlab users can now easily enjoy the benefits of GPU computing.

• Capabilities– Evaluating built-in functions on the GPU.– Running MATLAB code on the GPU.

• Requirements– Matlab 2014a (Recommended)– NVIDIA CUDA-enabled devices with compute capability of 1.3 or greater– NVIDIA CUDA device driver 3.1 or greater

• Check the GPU environment– gpuDeviceCount: number of available GPU devices– gpuDevice: select and query GPU device

Page 110: GPU Computing

110

Create Data on GPU

• Transferring data between workspace and GPU:

• Directly creating GPU data:

M = rand(6);G = gpuArray(single(M));N = gather(G);

Workspace GPU

G = ones(100,100,50, 'single', 'gpuArray'); size(G) 100 100 50 classUnderlying(G) single

Page 111: GPU Computing

111

Execute Code on GPU

• Run Built-In Functions

• Run Element-Wise Matlab Code

X = rand(1000,'single','gpuArray'); Gfft = fft(X); Y = gather(Gfft);

function c = myCal(rawdata, gain, offst) c = (rawdata .* gain) + offst;

meas = ones(1000)*3; // CPUgn = rand(1000,'gpuArray')/100; // GPUoffs = rand(1000,'gpuArray')/50; // GPUcorrected = arrayfun(@myCal,meas,gn,offs);results = gather(corrected);

Page 112: GPU Computing

112

Timing GPU Code

A = rand(1024,'gpuArray');fh = @()fft(A); gputimeit(fh);

gd = gpuDevice(); tic(); B = fft(A); wait(gd); t = toc();

A = rand(12000,400,'gpuArray'); B = rand(400,12000,'gpuArray'); f = @()A*B; t = gputimeit(f);

X = rand(1000,'gpuArray'); f = @()svd(X); t1 = gputimeit(f,1);t3 = gputimeit(f,3) ;

Page 113: GPU Computing

113

Testing Host-GPU BandwidthsizeOfDouble = 8; sizes = power(2, 14:28);sendTimes = inf(size(sizes)); gatherTimes = inf(size(sizes)); for i=1:numel(sizes)

numElements = sizes(i)/sizeOfDouble; hostData = randi([0 9], numElements, 1); gpuData = gpuArray.randi([0 9], numElements, 1); sendFcn = @()gpuArray(hostData); sendTimes(i) = gputimeit(sendFcn); gatherFcn = @()gather(gpuData); gatherTimes(i) = gputimeit(gatherFcn); end sendBandwidth = (sizes./sendTimes)/1e9; [maxSendBandwidth, maxSendIdx] = max(sendBandwidth); gatherBandwidth = (sizes./gatherTimes)/1e9; [maxGatherBandwidth, maxGatherIdx] = max(gatherBandwidth);

Page 114: GPU Computing

114

Testing Host-GPU Bandwidth

104

105

106

107

108

109

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5Data Transfer Bandwidth

Array size (bytes)

Tran

sfer

spe

ed (G

B/s

)

Send to GPUGather from GPU

Page 115: GPU Computing

115

Testing CPU Bandwidth

sizeOfDouble = 8; sizes = power(2, 14:28);memoryTimesHost = inf(size(sizes)); for i=1:numel(sizes)

numElements = sizes(i)/sizeOfDouble; hostData = randi([0 9], numElements, 1); plusFcn = @()plus(hostData, 1.0); memoryTimesHost(i) = timeit(plusFcn); end memoryBandwidthHost = 2*(sizes./memoryTimesHost)/1e9; [maxBWHost, maxBWIdxHost] = max(memoryBandwidthHost);

Page 116: GPU Computing

116

Testing GPU Bandwidth

memoryTimesGPU = inf(size(sizes)); for i=1:numel(sizes)

numElements = sizes(i)/sizeOfDouble; gpuData = gpuArray.randi([0 9], numElements, 1); plusFcn = @()plus(gpuData, 1.0); memoryTimesGPU(i) = gputimeit(plusFcn); end memoryBandwidthGPU = 2*(sizes./memoryTimesGPU)/1e9; [maxBWGPU, maxBWIdxGPU] = max(memoryBandwidthGPU);

Page 117: GPU Computing

117

Bandwidth: CPU vs. GPU

104

105

106

107

108

109

0

50

100

150Read+Write Bandwidth

Array size (bytes)

Spe

ed (G

B/s

)

GPUHost

Page 118: GPU Computing

118

Testing Matrix Multiplication

sizes = power(2, 12:2:24); N = sqrt(sizes); mmTimesHost = inf(size(sizes)); mmTimesGPU = inf(size(sizes)); for i=1:numel(sizes)

A = rand(N(i), N(i)); B = rand(N(i), N(i)); mmTimesHost(i) = timeit(@()A*B); A = gpuArray(A); B = gpuArray(B); mmTimesGPU(i) = gputimeit(@()A*B); end mmGFlopsHost = (2*N.^3 - N.^2)./mmTimesHost/1e9; [maxGFlopsHost,maxGFlopsHostIdx] = max(mmGFlopsHost); mmGFlopsGPU = (2*N.^3 - N.^2)./mmTimesGPU/1e9; [maxGFlopsGPU,maxGFlopsGPUIdx] = max(mmGFlopsGPU);

Page 119: GPU Computing

119

Testing Matrix Multiplication

103

104

105

106

107

108

0

100

200

300

400

500

600

700

800

900

1000Double precision matrix-matrix multiply

Matrix size (numel)

Cal

cula

tion

Rat

e (G

FLO

PS

)

GPUHost

Page 120: GPU Computing

120

Testing Matrix Multiplication

103

104

105

106

107

108

0

500

1000

1500

2000

2500

3000Single precision matrix-matrix multiply

Matrix size (numel)

Cal

cula

tion

Rat

e (G

FLO

PS

)

GPUHost

Page 121: GPU Computing

121

Review

• What are the differences among MPI, OpenMP and CUDA?

• Why is GPU suitable for high performance computing?

• What is the general framework of CUDA programming?

• What is a kernel function and how to call it from the host code?

• What is the advantage of splitting a block into threads?

• Why do we need multiple thread blocks?

• What are the major memory types in CUDA?

• When should we use shared memory?

• What resource factors are critical to GPU programming?

Page 122: GPU Computing

122

Review

• What is a warp and why do we need it?

• What is flow divergence and how to avoid it?

• What is bank conflict?

• What is instruction mix?

• What are the benefits of Dynamic Parallelism?

• How to measure the performance of GPU code?

• How to run kernel functions in parallel?

• How is a floating point number represented in the IEEE format?

• How to execute Matlab code on GPU?