ferienakademie 2007 alexander heinecke (tum) 1 a short introduction to nvidia‘s cuda alexander...

Ferienakademie 2007 Alexander Heinecke (TUM) 1

A short introduction

to nVidia‘s CUDA

Alexander Heinecke

Technical University of Munichhttp://home.in.tum.de/~heinecke/fa2007


Overview

1. Differences CPU – GPU 31. General CPU/GPU properties2. Compare specifications

2. CUDA Programming Model10

1. Application stack2. Thread implementation3. Memory Model

3. CUDA API 131. Extension of the C/C++ Programming Lang.2. Example structure of a CUDA application

4. Examples 151. Matrix Addition2. Matrix Multiplication3. Jacobi & Gauß – Seidel

5. Benchmark Results 21


Differences between CPU and GPU

• GPU: nearly all transistors are ALUs• CPU: most of the transistors are Cache

(taken from [NV1])


AMD Opteron Dieshot


Intel Itanium2 Dual-Core Dieshot


Intel Core Architecture Pipeline / Simple Example

(taken from IN1)

IFETCH #1 IFETCH #2

IDEC #1

IFETCH #3

IDEC #2

OFETCH #1

IFETCH #4

IDEC #3

OFETCH #2

EXEC #1

IFETCH #5

IDEC #4

OFETCH #3

EXEC #2

RET #1

IFETCH #6

IDEC #5

OFETCH #4

EXEC #3

RET #2

IFETCH #7

IDEC #6

OFETCH #5

EXEC #4

RET #3

Pipeline

cycle

Step 1

Step 2

Step 3

Step 4

Step 5

1 2 3 4 5 6 7


nVidia G80 Pipeline


Properties of CPU and GPU

Intel Xeon X5355 nVidia G80 (8800 GTX)

Clock Speed 2,66 GHz 575 MHz

#Cores / SPEs 4 128

Floats in register 96 131072

Max. GFlop/s (float)

84 (prac)

85 (theo)

460 (prac)

500 (theo)

Max. Instructions RAM limited 2 Million G80 ASM Instr.

typ. dur. Inst. 1-2 cycles (SSE) min. 4 cycles

Price (€) 800 500


History: Power of GPUs in the last four years

(taken from [NV1])


Application stack of CUDA

(taken from [NV1])


Thread organization in CUDA

(taken from [NV1])


Memory organization in CUDA

(taken from [NV1])


Extensions to C (functions and varaible)

• CUDA Code is saved in special files (*.cu)• These are precompiled by nvcc (nvidia compiler)• There are some function type qualifiers, which

decide the execution place:– __host__ (CPU only, called by CPU)– __global__ (GPU only, called by CPU)– __device__ (GPU only, called by GPU)

• For varaibles: __device__, __constant__, __shared__


Example structure of a CUDA application

• min. two functions to isolate CUDA Code from your app.

• First function:– Init CUDA– Copy data to device– Call kernel with execution settings– Copy data to host and shut down (automatic)

• Second function (kernel):– Contains problem for ONE thread


Tested Algorithms (2D Arrays)

All tested algorithms operate on 2D Arrays

• Matrix Addtion

• Matrix Multiplication

• Jacobi & Gauß-Seidel (iterative solver)

jijiji bacji ,,, ,

jkki

n

kji bacji ,,

1, ,

jijijijiji oldoldoldoldjinew uuuufn

uji,1,11,1,, ,2

1

4

1 ,

41,11,1,,,2, 104

1 ,

ijijijijijiji uuuuufn

rji


Example Matrix Addition (Init function)

CUT_DEVICE_INIT();// allocate device memoryfloat* d_A;CUDA_SAFE_CALL(cudaMalloc((void**) &d_A, mem_size));…// copy host memory to deviceCUDA_SAFE_CALL(cudaMemcpy(d_A, ma_a, mem_size,

cudaMemcpyHostToDevice) );… cudaBindTexture(0, texRef_MaA, d_A, mem_size); // texture binding…dim3 threads(BLOCK_SIZE_GPU, BLOCK_SIZE_GPU);dim3 grid(n_dim / threads.x, n_dim / threads.y);// execute the kernelcuMatrixAdd_kernel<<< grid, threads >>>(d_C, n_dim);cudaUnbindTexture(texRef_MaA); // texture unbinding…// copy result from device to hostCUDA_SAFE_CALL(cudaMemcpy(ma_c, d_C, mem_size,

cudaMemcpyDeviceToHost) );…CUDA_SAFE_CALL(cudaFree(d_A));


Example Matrix Addition (kernel)

// Block indexint bx = blockIdx.x;int by = blockIdx.y;// Thread indexint tx = threadIdx.x; int ty = threadIdx.y;

int start = (n_dim * by * BLOCK_SIZE_GPU) + bx * BLOCK_SIZE_GPU;

C[start + (n_dim * ty) + tx] = tex1Dfetch(texRef_MaA, start + (n_dim * ty) + tx) + tex1Dfetch(texRef_MaB, start + (n_dim * ty) + tx);


Example Matrix Multiplication (kernel)

int tx2 = tx + BLOCK_SIZE_GPU;int ty2 = n_dim * ty;float Csub1 = 0.0; float Csub2 = 0.0;int b = bBegin;for (int a = aBegin; a <= aEnd; a += aStep){

__shared__ float As[BLOCK_SIZE_GPU][BLOCK_SIZE_GPU];AS(ty, tx) = A[a + ty2 + tx];__shared__ float B1s[BLOCK_SIZE_GPU][BLOCK_SIZE_GPU*2];B1S(ty, tx) = B[b + ty2 + tx];B1S(ty, tx2) = B[b + ty2 + tx2];__syncthreads();Csub1 += AS(ty, 0) * B1S(0, tx);// more calcsb+= bStep;

}__syncthreads();// Write result back


Example Jacobi (kernel), no internal loops

// Block indexint bx = blockIdx.x; int by = blockIdx.y;// Thread indexint tx = threadIdx.x+1; int ty = threadIdx.y+1;

int ustart =((by * BLOCK_SIZE_GPU) * n_dim ) + (bx * BLOCK_SIZE_GPU);

float res = tex1Dfetch(texRef_MaF, ustart + (ty * n_dim) + tx) * qh;

res += tex1Dfetch(texRef_MaU, ustart + (ty * n_dim) + tx - 1) + tex1Dfetch(texRef_MaU, ustart + (ty * n_dim) + tx + 1);

res += tex1Dfetch(texRef_MaU, ustart + ((ty+1) * n_dim) + tx) + tex1Dfetch(texRef_MaU, ustart + ((ty-1) * n_dim) + tx);

res = 0.25f * res;

ma_u[ustart + (ty * n_dim) + tx] = res;


Example Jacobi (kernel), internal loops

int tx = threadIdx.x+1; int ty = threadIdx.y+1;// *some more inits*

// load to calc u_ij__shared__ float Us[BLOCK_SIZE_GPU+2][BLOCK_SIZE_GPU+2];US(ty, tx) = tex1Dfetch(texRef_MaU, ustart + (ty * n_dim) + tx);// *init edge u*…for (unsigned int i = 0; i < n_intern_loops; i++){

res = funk;

res += US(ty, tx - 1) + US(ty, tx + 1);res += US(ty - 1, tx) + US(ty + 1, tx);res = 0.25f * res;

__syncthreads(); // not used in parallel jacobi

US(ty, tx) = res;}ma_u[ustart + (ty * n_dim) + tx] = res;


Performance Results (1)8800 GTX 768 MB (575 MHz GPU, 900 MHz DDR3 Memory; on E6400) vs.

X5355 (2x4x2,66 GHz, 2GB FB, 1333 MHz FSB, Win2k3 x64) vs. T7600 (2x2,33 Ghz, 3,3 GB DDR2, 667 MHz FSB, Win x64)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

32 160

288

416

544

672

800

928

1056

1184

1312

1440

1568

1696

1824

1952

2080

2208

2336

2464

2592

2720

2848

2976

3104

Matrix Dimension (n x n)

MF

LO

P/s

Geforce 8800 GTXX5355 4 ThreadsT7600 2 ThreadsX5355 8 Threads


Performance Results (2)

Jacobi

0

2000

4000

6000

8000

10000

12000

14000

18 162

306

450

594

738

882

1026

1170

1314

1458

1602

1746

1890

2034

2178

2322

2466

2610

2754

2898

3042

3186

3330

3474

3618

3762

3906

4050

4194

Dimension

MF

lop

/s

X5355 (4 Threads)

8800GTX

X5355 (8 Threads)



Jacobi parallel (16 internal loops)

0

10000

20000

30000

40000

50000

60000

18 162

306

450

594

738

882

1026

1170

1314

1458

1602

1746

1890

2034

2178

2322

2466

2610

2754

2898

3042

3186

3330

3474

3618

3762

3906

4050

4194

Dimension

MF

lop

/s

X5355 (4 Threads)

8800GTX

X5355 (8 Threads)



Jacobi parallel comparison

8800GTX (500 lp.)

8800GTX (200 lp.)

8800GTX (100 lp.)

8800GTX (50 lp.)

X5355 (4 Threads, all lp.)

8800GTX (16 lp.)

X5355 (8 Threads, all lp.)

0

50000

100000

150000

200000

250000

18 162

306

450

594

738

882

1026

1170

1314

1458

1602

1746

1890

2034

2178

2322

2466

2610

2754

2898

3042

3186

3330

3474

3618

3762

3906

4050

4194

Dimension

MF

lop

/s


Conclusion (Points to take care of)

Be care of / you should use:

• min. number of memory accesses• use unrolling instead of for loops• use blocking algorithms• only algorithms, which are not extremly memory

bounded (NOT matrix addition) should be implemented with CUDA

• try to do not use the if statement, or other programmecontrolling statements (slow)


Appendix - References

[NV1] NVIDIA CUDA Compute Unified Device Architecture, Programming Guide; nVidia Corporation, Version

1.0, 23.06.2007

[IN1/2/3] Intel Architecture Handbook, Version November 2006

[NR] Numerical receipies (online generated pdf)

http://home.in.tum.de/~heinecke/fa2007

ferienakademie 2007 alexander heinecke (tum) 1 a short introduction to nvidia‘s cuda alexander...

Documents