ferienakademie 2007 alexander heinecke (tum) 1 a short introduction to nvidia‘s cuda alexander...

26
erienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich http://home.in.tum.de/~heinecke/fa2007

Post on 15-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 1

A short introduction

to nVidia‘s CUDA

Alexander Heinecke

Technical University of Munichhttp://home.in.tum.de/~heinecke/fa2007

Page 2: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 2

Overview

1. Differences CPU – GPU 31. General CPU/GPU properties2. Compare specifications

2. CUDA Programming Model10

1. Application stack2. Thread implementation3. Memory Model

3. CUDA API 131. Extension of the C/C++ Programming Lang.2. Example structure of a CUDA application

4. Examples 151. Matrix Addition2. Matrix Multiplication3. Jacobi & Gauß – Seidel

5. Benchmark Results 21

Page 3: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 3

Differences between CPU and GPU

• GPU: nearly all transistors are ALUs• CPU: most of the transistors are Cache

(taken from [NV1])

Page 4: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 4

AMD Opteron Dieshot

Page 5: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 5

Intel Itanium2 Dual-Core Dieshot

Page 6: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 6

Intel Core Architecture Pipeline / Simple Example

(taken from IN1)

IFETCH #1 IFETCH #2

IDEC #1

IFETCH #3

IDEC #2

OFETCH #1

IFETCH #4

IDEC #3

OFETCH #2

EXEC #1

IFETCH #5

IDEC #4

OFETCH #3

EXEC #2

RET #1

IFETCH #6

IDEC #5

OFETCH #4

EXEC #3

RET #2

IFETCH #7

IDEC #6

OFETCH #5

EXEC #4

RET #3

Pipeline

cycle

Step 1

Step 2

Step 3

Step 4

Step 5

1 2 3 4 5 6 7

Page 7: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 7

nVidia G80 Pipeline

Page 8: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 8

Properties of CPU and GPU

Intel Xeon X5355 nVidia G80 (8800 GTX)

Clock Speed 2,66 GHz 575 MHz

#Cores / SPEs 4 128

Floats in register 96 131072

Max. GFlop/s (float)

84 (prac)

85 (theo)

460 (prac)

500 (theo)

Max. Instructions RAM limited 2 Million G80 ASM Instr.

typ. dur. Inst. 1-2 cycles (SSE) min. 4 cycles

Price (€) 800 500

Page 9: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 9

History: Power of GPUs in the last four years

(taken from [NV1])

Page 10: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 10

Application stack of CUDA

(taken from [NV1])

Page 11: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 11

Thread organization in CUDA

(taken from [NV1])

Page 12: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 12

Memory organization in CUDA

(taken from [NV1])

Page 13: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 13

Extensions to C (functions and varaible)

• CUDA Code is saved in special files (*.cu)• These are precompiled by nvcc (nvidia compiler)• There are some function type qualifiers, which

decide the execution place:– __host__ (CPU only, called by CPU)– __global__ (GPU only, called by CPU)– __device__ (GPU only, called by GPU)

• For varaibles: __device__, __constant__, __shared__

Page 14: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 14

Example structure of a CUDA application

• min. two functions to isolate CUDA Code from your app.

• First function:– Init CUDA– Copy data to device– Call kernel with execution settings– Copy data to host and shut down (automatic)

• Second function (kernel):– Contains problem for ONE thread

Page 15: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 15

Tested Algorithms (2D Arrays)

All tested algorithms operate on 2D Arrays

• Matrix Addtion

• Matrix Multiplication

• Jacobi & Gauß-Seidel (iterative solver)

jijiji bacji ,,, ,

jkki

n

kji bacji ,,

1, ,

jijijijiji oldoldoldoldjinew uuuufn

uji,1,11,1,, ,2

1

4

1 ,

41,11,1,,,2, 104

1 ,

ijijijijijiji uuuuufn

rji

Page 16: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 16

Example Matrix Addition (Init function)

CUT_DEVICE_INIT();// allocate device memoryfloat* d_A;CUDA_SAFE_CALL(cudaMalloc((void**) &d_A, mem_size));…// copy host memory to deviceCUDA_SAFE_CALL(cudaMemcpy(d_A, ma_a, mem_size,

cudaMemcpyHostToDevice) );… cudaBindTexture(0, texRef_MaA, d_A, mem_size); // texture binding…dim3 threads(BLOCK_SIZE_GPU, BLOCK_SIZE_GPU);dim3 grid(n_dim / threads.x, n_dim / threads.y);// execute the kernelcuMatrixAdd_kernel<<< grid, threads >>>(d_C, n_dim);cudaUnbindTexture(texRef_MaA); // texture unbinding…// copy result from device to hostCUDA_SAFE_CALL(cudaMemcpy(ma_c, d_C, mem_size,

cudaMemcpyDeviceToHost) );…CUDA_SAFE_CALL(cudaFree(d_A));

Page 17: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 17

Example Matrix Addition (kernel)

// Block indexint bx = blockIdx.x;int by = blockIdx.y;// Thread indexint tx = threadIdx.x; int ty = threadIdx.y;

int start = (n_dim * by * BLOCK_SIZE_GPU) + bx * BLOCK_SIZE_GPU;

C[start + (n_dim * ty) + tx] = tex1Dfetch(texRef_MaA, start + (n_dim * ty) + tx) + tex1Dfetch(texRef_MaB, start + (n_dim * ty) + tx);

Page 18: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 18

Example Matrix Multiplication (kernel)

int tx2 = tx + BLOCK_SIZE_GPU;int ty2 = n_dim * ty;float Csub1 = 0.0; float Csub2 = 0.0;int b = bBegin;for (int a = aBegin; a <= aEnd; a += aStep){

__shared__ float As[BLOCK_SIZE_GPU][BLOCK_SIZE_GPU];AS(ty, tx) = A[a + ty2 + tx];__shared__ float B1s[BLOCK_SIZE_GPU][BLOCK_SIZE_GPU*2];B1S(ty, tx) = B[b + ty2 + tx];B1S(ty, tx2) = B[b + ty2 + tx2];__syncthreads();Csub1 += AS(ty, 0) * B1S(0, tx);// more calcsb+= bStep;

}__syncthreads();// Write result back

Page 19: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 19

Example Jacobi (kernel), no internal loops

// Block indexint bx = blockIdx.x; int by = blockIdx.y;// Thread indexint tx = threadIdx.x+1; int ty = threadIdx.y+1;

int ustart =((by * BLOCK_SIZE_GPU) * n_dim ) + (bx * BLOCK_SIZE_GPU);

float res = tex1Dfetch(texRef_MaF, ustart + (ty * n_dim) + tx) * qh;

res += tex1Dfetch(texRef_MaU, ustart + (ty * n_dim) + tx - 1) + tex1Dfetch(texRef_MaU, ustart + (ty * n_dim) + tx + 1);

res += tex1Dfetch(texRef_MaU, ustart + ((ty+1) * n_dim) + tx) + tex1Dfetch(texRef_MaU, ustart + ((ty-1) * n_dim) + tx);

res = 0.25f * res;

ma_u[ustart + (ty * n_dim) + tx] = res;

Page 20: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 20

Example Jacobi (kernel), internal loops

int tx = threadIdx.x+1; int ty = threadIdx.y+1;// *some more inits*

// load to calc u_ij__shared__ float Us[BLOCK_SIZE_GPU+2][BLOCK_SIZE_GPU+2];US(ty, tx) = tex1Dfetch(texRef_MaU, ustart + (ty * n_dim) + tx);// *init edge u*…for (unsigned int i = 0; i < n_intern_loops; i++){

res = funk;

res += US(ty, tx - 1) + US(ty, tx + 1);res += US(ty - 1, tx) + US(ty + 1, tx);res = 0.25f * res;

__syncthreads(); // not used in parallel jacobi

US(ty, tx) = res;}ma_u[ustart + (ty * n_dim) + tx] = res;

Page 21: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 21

Performance Results (1)8800 GTX 768 MB (575 MHz GPU, 900 MHz DDR3 Memory; on E6400) vs.

X5355 (2x4x2,66 GHz, 2GB FB, 1333 MHz FSB, Win2k3 x64) vs. T7600 (2x2,33 Ghz, 3,3 GB DDR2, 667 MHz FSB, Win x64)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

32 160

288

416

544

672

800

928

1056

1184

1312

1440

1568

1696

1824

1952

2080

2208

2336

2464

2592

2720

2848

2976

3104

Matrix Dimension (n x n)

MF

LO

P/s

Geforce 8800 GTXX5355 4 ThreadsT7600 2 ThreadsX5355 8 Threads

Page 22: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 22

Performance Results (2)

Jacobi

0

2000

4000

6000

8000

10000

12000

14000

18 162

306

450

594

738

882

1026

1170

1314

1458

1602

1746

1890

2034

2178

2322

2466

2610

2754

2898

3042

3186

3330

3474

3618

3762

3906

4050

4194

Dimension

MF

lop

/s

X5355 (4 Threads)

8800GTX

X5355 (8 Threads)

Page 23: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 23

Performance Results (3)

Jacobi parallel (16 internal loops)

0

10000

20000

30000

40000

50000

60000

18 162

306

450

594

738

882

1026

1170

1314

1458

1602

1746

1890

2034

2178

2322

2466

2610

2754

2898

3042

3186

3330

3474

3618

3762

3906

4050

4194

Dimension

MF

lop

/s

X5355 (4 Threads)

8800GTX

X5355 (8 Threads)

Page 24: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 24

Performance Results (4)

Jacobi parallel comparison

8800GTX (500 lp.)

8800GTX (200 lp.)

8800GTX (100 lp.)

8800GTX (50 lp.)

X5355 (4 Threads, all lp.)

8800GTX (16 lp.)

X5355 (8 Threads, all lp.)

0

50000

100000

150000

200000

250000

18 162

306

450

594

738

882

1026

1170

1314

1458

1602

1746

1890

2034

2178

2322

2466

2610

2754

2898

3042

3186

3330

3474

3618

3762

3906

4050

4194

Dimension

MF

lop

/s

Page 25: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 25

Conclusion (Points to take care of)

Be care of / you should use:

• min. number of memory accesses• use unrolling instead of for loops• use blocking algorithms• only algorithms, which are not extremly memory

bounded (NOT matrix addition) should be implemented with CUDA

• try to do not use the if statement, or other programmecontrolling statements (slow)

Page 26: Ferienakademie 2007 Alexander Heinecke (TUM) 1 A short introduction to nVidia‘s CUDA Alexander Heinecke Technical University of Munich heinecke/fa2007

Ferienakademie 2007 Alexander Heinecke (TUM) 26

Appendix - References

[NV1] NVIDIA CUDA Compute Unified Device Architecture, Programming Guide; nVidia Corporation, Version

1.0, 23.06.2007

[IN1/2/3] Intel Architecture Handbook, Version November 2006

[NR] Numerical receipies (online generated pdf)

http://home.in.tum.de/~heinecke/fa2007