cuda intro

46
Emergence of GPU systems and clusters for general purpose High Performance Computing ITCS 4145/5145 Nov 8, 2010 © Barry Wilkinson

Upload: anshul-sharma

Post on 14-May-2015

685 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Cuda intro

Emergence of GPU systems and clusters for general

purpose High Performance Computing

ITCS 4145/5145 Nov 8, 2010 © Barry Wilkinson

Page 2: Cuda intro

2

Last few years GPUs have developed from graphics cards into a platform from HPC

There is now great interest in using GPUs for scientific high performance computing and GPUs are being designed with that application in mind

CUDA programming – C/C++ with a few additional features and routines to support GPU programming. Uses data parallel paradigm

Graphics Processing Units (GPUs)

Page 3: Cuda intro

3http://www.hpcwire.com/blogs/New-China-GPGPU-Super-Outruns-Jaguar-105987389.html

Page 4: Cuda intro

Graphics Processing Units (GPUs)Brief History

1970 2010200019901980

Atari 8-bit computer

text/graphics chip

Source of information http://en.wikipedia.org/wiki/Graphics_Processing_Unit

IBM PC Professional Graphics Controller

card

S3 graphics cards- single chip 2D

accelerator

OpenGL graphics API

Hardware-accelerated 3D graphics

DirectX graphics API

Playstation

GPUs with programmable shading

Nvidia GeForceGE 3 (2001) with

programmable shading

General-purpose computing on graphics processing units

(GPGPUs)

GPU Computing

Page 5: Cuda intro

NVIDIA products

NVIDIA Corp. is the leader in GPUs for high performance computing:

1993 201019991995

http://en.wikipedia.org/wiki/GeForce

20092007 20082000 2001 2002 2003 2004 2005 2006

Established by Jen-Hsun Huang, Chris

Malachowsky, Curtis Priem

NV1 GeForce 1

GeForce 2 series GeForce FX series

GeForce 8 series

GeForce 200 series

GeForce 400 series

GTX460/465/470/475/480/485

GTX260/275/280/285/295GeForce 8800

GT 80

Tesla

Quadro

NVIDIA's first GPU with general purpose processors

C870, S870, C1060, S1070, C2050, …

Tesla 2050 GPU has 448 thread processors

Fermi

Kepler(2011)

Maxwell (2013)

Page 6: Cuda intro

6

GPU performance gains over CPUs

0

200

400

600

800

1000

1200

1400

9/22/2002 2/4/2004 6/18/2005 10/31/2006 3/14/2008 7/27/2009

GFLO

Ps

NVIDIA GPUIntel CPU

T12

Westmere

NV30NV40

G70

G80

GT200

3GHz Dual Core P4

3GHz Core2 Duo

3GHz Xeon Quad

Source © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign

Page 7: Cuda intro

7

CPU-GPU architecture evolution

Co-processors -- very old idea that appeared in 1970s and 1980s with floating point co-processors attached to microprocessors that did not then have floating point capability.

These coprocessors simply executed floating point instructions that were fetched from memory.

Around same time, interest to provide hardware support for displays, especially with increasing use of graphics and PC games.

Led to graphics processing units (GPUs) attached to CPU to create video display.

CPU

Graphics card

Display

Memory

Early design

Page 8: Cuda intro

8

Birth of general purpose programmable GPU

Dedicated pipeline (late1990s-early 2000s)

By late1990’s, graphics chips needed to support 3-D graphics, especially for games and graphics APIs such as DirectX and OpenGL.

Graphics chips generally had a pipeline structure with individual stages performing specialized operations, finally leading to loading frame buffer for display.

Individual stages may have access to graphics memory for storing intermediate computed data.

Input stage

Vertex shader stage

Geometry shader stage

Rasterizer stage

Frame buffer

Pixel shading stage

Graphics memory

Page 9: Cuda intro

9

GeForce 6 Series Architecture

(2004-5)From GPU Gems 2, Copyright 2005 by NVIDIA Corporation

Page 10: Cuda intro

10

General-Purpose GPU designs

High performance pipelines call for high-speed (IEEE) floating point operations.

People had been trying to use GPU cards to speed up scientific computations

Known as GPGPU (General-purpose computing on graphics processing units) -- Difficult to do with specialized graphics pipelines, but possible.)

By mid 2000’s, recognized that individual stages of graphics pipeline could be implemented by a more general purpose processor core (although with a data-parallel paradym)

a

Page 11: Cuda intro

11

2006 -- First GPU for general high performance computing as well as graphics processing, NVIDIA GT 80 chip/GeForce 8800 card.

Unified processors that could perform vertex, geometry, pixel, and general computing operations

Could now write programs in C rather than graphics APIs.

Single-instruction multiple thread (SIMT) programming model

GPU design for general high performance computing

Page 12: Cuda intro

12

Page 13: Cuda intro

13

Evolving GPU designNVIDIA Fermi architecture

(announced Sept 2009)

• 512 stream processing engines (SPEs)

• Organized as 16 SPEs, each having 32 cores

• 3GB or 6 GB GDDR5 memory

• Many innovations including L1/L2 caches, unified device memory

addressing, ECC memory, …

First implementation: Tesla 20 series(single chip C2050/2070, 4 chip S2050/2070)3 billion transistor chip?

New Fermi chips planned (GT 300, GeForce 400 series)

Page 14: Cuda intro

14

Fermi Streaming Multiprocessor (SM)

* Whitepaper NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, NVIDIA, 2008

Page 15: Cuda intro

15

CUDA(Compute Unified Device Architecture)

Architecture and programming model, introduced in NVIDIA in 2007

Enables GPUs to execute programs written in C.

Within C programs, call SIMT “kernel” routines that are executed on GPU.

CUDA syntax extension to C identify routine as a Kernel.

Very easy to learn although to get highest possible execution performance requires understanding of hardware architecture

Page 16: Cuda intro

16

Programming Model

• Program once compiled has code executed on CPU and code executed on GPU

• Separate memories on CPU and GPU

Need to

• Explicitly transfer data from CPU to GPU for GPU computation, and

• Explicitly transfer results in GPU memory copied back to CPU memory

Copy from CPU to GPU

Copy from GPU to CPU

GPU

CPU

CPU main memory

GPU global memory

Page 17: Cuda intro

17

Basic CUDA program structure

int main (int argc, char **argv ) {

1. Allocate memory space in device (GPU) for data2. Allocate memory space in host (CPU) for data

3. Copy data to GPU

4. Call “kernel” routine to execute on GPU(with CUDA syntax that defines no of threads and their physical structure)

5. Transfer results from GPU to CPU

6. Free memory space in device (GPU)7. Free memory space in host (CPU)

return;}

Page 18: Cuda intro

18

1. Allocating memory space in “device” (GPU) for data

Use CUDA malloc routines:

int size = N *sizeof( int); // space for N integers

int *devA, *devB, *devC; // devA, devB, devC ptrs

cudaMalloc( (void**)&devA, size) );cudaMalloc( (void**)&devB, size );cudaMalloc( (void**)&devC, size );

Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.

Page 19: Cuda intro

19

2. Allocating memory space in “host” (CPU) for data

Use regular C malloc routines:

int *a, *b, *c;… a = (int*)malloc(size);b = (int*)malloc(size);c = (int*)malloc(size);

or statically declare variables:

#define N 256…int a[N], b[N], c[N];

Page 20: Cuda intro

20

3. Transferring data from host (CPU) to device (GPU)

Use CUDA routine cudaMemcpy

cudaMemcpy( devA, &A, size, cudaMemcpyHostToDevice);cudaMemcpy( dev_B, &B, size, cudaMemcpyHostToDevice);

where devA and devB are pointers to destination in device and A and B are pointers to host data

Page 21: Cuda intro

21

4. Declaring “kernel” routine to execute on device (GPU)

CUDA introduces a syntax addition to C:Triple angle brackets mark call from host code to device code. Contains organization and number of threads in two parameters:

myKernel<<< n, m >>>(arg1, … );

n and m will define organization of thread blocks and threads in a block.

For now, we will set n = 1, which say one block and m = N, which says N threads in this block.

arg1, … , -- arguments to routine myKernel typically pointers to device memory obtained previously from cudaMallac.

Page 22: Cuda intro

22

A kernel defined using CUDA specifier __global__

Example – Adding to vectors A and B#define N 256__global__ void vecAdd(int *A, int *B, int *C) { // Kernel definition

int i = threadIdx.x; C[i] = A[i] + B[i];

}

int main() { // allocate device memory &

// copy data to device // device mem. ptrs devA,devB,devC

vecAdd<<<1, N>>>(devA,devB,devC); …

}

Loosely derived from CUDA C programming guide, v 3.2 , 2010, NVIDIA

Declaring a Kernel Routine

Each of the N threads performs one pair-wise addition: Thread 0: devC[0] = devA[0] + devB[0];Thread 1: devC[1] = devA[1] + devB[1];

Thread N-1: devC[N-1] = devA[N-1]+devB[N-1];

Grid of one block, block has N threads

CUDA structure that provides thread ID in block

Page 23: Cuda intro

23

5. Transferring data from device (GPU) to host (CPU)

Use CUDA routine cudaMemcpy

cudaMemcpy( &C, devC, size, cudaMemcpyDeviceToHost);

where devC is a pointer in device and C is a pointer in host.

Page 24: Cuda intro

24

6. Free memory space in “device” (GPU)

Use CUDA cudaFree routine:

cudaFree( dev_a);cudaFree( dev_b);cudaFree( dev_c);

Page 25: Cuda intro

25

7. Free memory space in (CPU) host(if CPU memory allocated with malloc)

Use regular C free routine to deallocate memory if previously allocated with malloc:

free( a );free( b );free( c );

Page 26: Cuda intro

26

Complete CUDA program

Adding two vectors, A and B

N elements in A and B, and N threads

(without code to load arrays with data)

#define N 256

__global__ void vecAdd(int *A, int *B, int *C) { int i = threadIdx.x; C[i] = A[i] + B[i];

} int main (int argc, char **argv ) {

int size = N *sizeof( int);int a[N], b[N], c[N], *devA, *devB, *devC;

cudaMalloc( (void**)&devA, size) );cudaMalloc( (void**)&devB, size );cudaMalloc( (void**)&devC, size );

a = (int*)malloc(size); b = (int*)malloc(size);c = (int*)malloc(size);

cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice);cudaMemcpy( dev_B, b size, cudaMemcpyHostToDevice);

vecAdd<<<1, N>>>(devA, devB, devC);

cudaMemcpy( &c, devC size, cudaMemcpyDeviceToHost);cudaFree( dev_a);cudaFree( dev_b);cudaFree( dev_c);free( a ); free( b ); free( c );

return (0);}

Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.

Page 27: Cuda intro

27

Can be 1 or 2 dimensions

Can be 1, 2 or 3 dimensions

CUDA C programming guide, v 3.2, 2010, NVIDIA

CUDA SIMT Thread StructureAllows flexibility and efficiency in processing 1D, 2-D, and 3-D data on GPU.

Linked to internal organization

Threads in one block execute together.

Page 28: Cuda intro

28

Need to provide each kernel call with values for two key structures:

• Number of blocks in each dimension• Threads per block in each dimension

myKernel<<< numBlocks, threadsperBlock >>>(arg1, … );

numBlocks – number of blocks in grid in each dimension (1D or 2D). An integer would define a 1D grid of that size, otherwise use CUDA structure, see next.

threadsperBlock – number of threads in a block in each dimension (1D, 2D, or 3D). An integer would define a 1D block of that size, otherwise use CUDA structure, see next.

Notes: Number of blocks not limited by specific GPU. Number of threads/block is limited by specific GPU.

Defining Grid/Block Structure

Page 29: Cuda intro

29

CUDA provided with built-in variables and structures to define number of blocks of threads in grid in each dimension and number of threads in a block in each dimension.

CUDA Vector Types/Structures

unit3 and dim3 – can be considered essentially as CUDA-defined structures of unsigned integers: x, y, z, i.e.

struct unit3 { x; y; z; };struct dim3 { x; y; z; };

Used to define grid of blocks and threads, see next.Unassigned structure components automatically set to 1.There are other CUDA vector types.

Built-in CUDA data types and structures

Page 30: Cuda intro

30

Built-in Variables for Grid/Block Sizes

dim3 gridDim -- Size of grid:

gridDim.x * gridDim.y(z not used)

dim3 blockDim -- Size of block:

blockDim.x * blockDim.y * blockDim.z

Example

dim3 grid(16, 16); // Grid -- 16 x 16 blocksdim3 block(32, 32); // Block -- 32 x 32 threadsmyKernel<<<grid, block>>>(...);

Page 31: Cuda intro

31

Built-in Variables for Grid/Block Indices

uinit3 blockIdx -- block index within grid:

blockIdx.x, blockIdx.y(z not used)

uint3 threadIdx -- thread index within block:

blockIdx.x, blockIdx.y, blockId.z

Full global thread ID in x and y dimensions can be computed by:

x = blockIdx.x * blockDim.x + threadIdx.x;y = blockIdx.y * blockDim.y + threadIdx.y;

Page 32: Cuda intro

32

Example -- x directionA 1-D grid and 1-D block

4 blocks, each having 8 threads

0 1 2 3 4 765 0 1 2 3 4 7650 1 2 3 4 765 0 1 2 3 4 765

threadIdx.x threadIdx.x threadIdx.x

blockIdx.x = 3

threadIdx.x

blockIdx.x = 1blockIdx.x = 0

Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.

blockIdx.x = 2

gridDim = 4 x 1blockDim = 8 x 1

Global thread ID = blockIdx.x * blockDim.x + threadIdx.x = 3 * 8 + 2 = thread 26 with linear global addressing

Global ID 26

Page 33: Cuda intro

33

#define N 2048 // size of vectors#define T 256 // number of threads per block

__global__ void vecAdd(int *A, int *B, int *C) {

int i = blockIdx.x*blockDim.x + threadIdx.x;

C[i] = A[i] + B[i];} int main (int argc, char **argv ) {

vecAdd<<<N/T, T>>>(devA, devB, devC); // assumes N/T is an integer

…return (0);

}

Code example with a 1-D grid and 1-D blocks

Number of blocks to map each vector across grid, one element of each vector per thread

Page 34: Cuda intro

34

#define N 2048 // size of vectors#define T 240 // number of threads per block

__global__ void vecAdd(int *A, int *B, int *C) {

int i = blockIdx.x*blockDim.x + threadIdx.x;

if (i < N) C[i] = A[i] + B[i]; // allows for more threads than vector elements // some unused

} int main (int argc, char **argv ) {

int blocks = (N + T - 1) / T; // efficient way of rounding to next integer …vecAdd<<<blocks, T>>>(devA, devB, devC); …return (0);

}

If T/N not necessarily an integer:

Page 35: Cuda intro

35

Example using 1-D grid and 2-D blocksAdding two arrays

#define N 2048 // size of arrays

__global__void addMatrix (int *a, int *b, int *c) {int i = blockIdx.x*blockDim.x+threadIdx.x;int j =blockIdx.y*blockDim.y+threadIdx.y;int index = i + j * N;if ( i < N && j < N) c[index]= a[index] + b[index];

}

Void main() {...dim3 dimBlock (16,16);dim3 dimGrid (N/dimBlock.x, N/dimBlock.y);

addMatrix<<<dimGrid, dimBlock>>>(devA, devB, devC);…

}

Page 36: Cuda intro

36

Memory Structure within GPU

Local private memory -- per threadShared memory -- per blockGlobal memory -- per application

GPU executes one or more kernel grids.

Streaming multiprocessor (SM) executes one or more thread blocks

CUDA cores and other execution units in the SM execute threads.

SM executes threads in groups of 32 threads called a warp.*

* Whitepaper NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, NVIDIA, 2008

Page 37: Cuda intro

37

Compiling code

Linux

Command line. CUDA provides nvcc (a NVIDIA “compiler-driver”. Use instead of gcc

nvcc –O3 –o <exe> <input> -I/usr/local/cuda/include –L/usr/local/cuda/lib –lcudart

Separates compiled code for CPU and for GPU and compiles code. Need regular C compiler installed for CPU.Make files also provided.

Windows

NVIDIA suggests using Microsoft Visual Studio

Page 38: Cuda intro

38

Debugging

NVIDIA has recently develped a debugging tool called Parallel Nsight

Available for use with Visual Studio

Page 39: Cuda intro

39

GPU Clusters

GPU systems for HPC GPU clusters GPU Grids GPU Clouds

With advent of GPUs for scientific high performance computing, compute cluster now can incorporate, greatly increasing their compute capability.

Page 40: Cuda intro

40

Page 41: Cuda intro

41

Maryland CPU-GPU Cluster Infrastructure

http://www.umiacs.umd.edu/research/GPU/facilities.html

Page 42: Cuda intro

42

Hybrid Programming Model for Clusters having Multicore Shared Memory Processors

Combine MPI between nodes and OpenMP with nodes (or other thread libraries such as Pthreads):

MPI/OpenMP compilation:mpicc -o mpi_out mpi_test.c -fopenmp

Page 43: Cuda intro

43

Hybrid Programming Model for Clusters having Multicore Shared Memory Processors and

GPUs

Combine OpenMP and CUDA on one node for CPU and GPU respectively, with MPI between nodes

Note – All three as C-based so can be compiled together.

NVIDIA does provide a sample OpenMP/CUDA

Page 44: Cuda intro

44http://www.gpugrid.net/

Page 45: Cuda intro

45

Intel’s response to Nvidia and GPUs

Page 46: Cuda intro

Questions