using gpus for parallel processing
DESCRIPTION
Short intro to GPU and CUDA programmingTRANSCRIPT
Sci-Prog seminar series Talks on computing and programming related topics ranging from basic to
advanced levels.
Talk: Using GPUs for parallel processing A. Stephen McGough
Website: http://conferences.ncl.ac.uk/sciprog/index.php
Research community site: contact Matt Wade for access Alerts mailing list: [email protected]
(sign up at http://lists.ncl.ac.uk )
Organisers: Dr Liz Petrie, Dr Matt Wade, Dr Stephen McGough,
Dr Ben Allen and Gregg Iceton
Using GPUs for parallel processing
A. Stephen McGough
Why?
• Moore’s law is dead? • “the number of transistors on integrated circuits
doubles approximately every two years” – Processors aren’t getting faster…
XXXX observation
Processor Speed and Energy Assume 1 GHz Core consumes 1watt A 4GHz Core consumes ~64watts Four 1GHz cores consume ~4watts Power ~frequency3
Computers are going many-core
They’re getting fatter
What?
• Games industry is multi-billion dollar • Gamers want photo-realistic games
– Computationally expensive – Requires complex physics calculations
• Latest generation of Graphical Processing Units are therefore many core parallel processors
– General Purpose Graphical Processing Units - GPGPUs
Not just normal processors
• 1000’s of cores – But cores are simpler than a normal processor
– Multiple cores perform the same action at the same time – Single Instruction Multiple Data – SIMD
• Conventional processor -> Minimize latency – Of a single program
• GPU -> Maximize throughput of all cores
• Potential for orders of magnitude speed-up
“If you were plowing a field, which would you rather use: two strong oxen or 1024 chicken?”
• Famous quote from Seymour Cray arguing for small numbers of processors
– But the chickens are now winning
• Need a new way to think about programming
– Need hugely parallel algorithms
• Many existing algorithms won’t work (efficiently)
Some Issues with GPGPUs
• Cores are slower than a standard CPU – But you have lots more
• No direct control on when your code runs on a core – GPGPU decides where and when
• Can’t communicate between cores • Order of execution is ‘random’
– Synchronization is through exiting parallel GPU code
• SIMD only works (efficiently) if all cores are doing the same thing – NVIDIA GPU’s have Warps of 32 cores working together
• Code divergence leads to more Warps
• Cores can interfere with each other – Overwriting each others memory
How
• Many approaches
– OpenGL – for the mad Guru
– Computer Unified Device Architecture (CUDA)
– OpenCL – emerging standard
– Dynamic Parallelism – For existing code loops
• Focus here on CUDA
– Well developed and supported
– Exploits full power of GPGPU
CUDA • CUDA is a set of extensions to C/C++
– (and Fortran)
• Code consists of sequential and parallel parts – Parallel parts are written as kernels
• Describe what one thread of the code will do
Start
Transfer data to card
Execute Kernel
Transfer data from card
Sequential code
Sequential code Finish
Example: Vector Addition
• One dimensional data
• Add two vectors (A,B) together to produce C
• Need to define the kernel to run and the main code
• Each thread can compute a single value for C
Example: Vector Addition
• Pseudo code for the kernel:
– Identify which element in the vector I’m computing
• i
– Compute C[i] = A[i] + B[i]
• How do we identify our index (i)?
Blocks and Threads
• In CUDA the whole data space is the Grid – Divided into a number
of blocks • Divided into a number of
threads
• Blocks can be executed in any order
• Threads in a block are executed together
• Blocks and Threads can be 1D, 2D or 3D
Blocks
• As Blocks are executed in arbitrary order this gives CUDA the opportunity to scale to the number of cores in a particular device
Thread id
• CUDA provides three pieces of data for identifying a thread – BlockIdx – block identity – BlockDim – the size of a block (no of threads in block) – ThreadIdx – identity of a thread in a block
• Can use these to compute the absolute thread id id = BlockIdx * BlockDim + ThreadIdx
• EG: BlockIdx = 2, BlockDim = 3, ThreadIdx = 1 • id = 2 * 3 + 1 = 7
0 1 2
Block0
3 4 5
Block1
6 7 8
Block2
Thread index 0 1 2 0 1 2 0 1 2
Example: Vector Addition Kernel code
__global__ void vector_add(double *A, double *B,
double* C, int N) {
// Find my thread id - block and thread
int id = blockDim.x * blockIdx.x + threadIdx.x;
if (id >= N) {return;} // I'm not a valid ID
C[id] = A[id] + B[id]; // do my work
}
Entry point for a kernel
Normal function definition
Compute my absolute thread id
We might be invalid – if
data size not completely divisible by
blocks
Do the work
Example: Vector Addition Pseudo code for sequential code
• Create space on device
• Create Data on Host Computer
• Copy data to device
• Run Kernel
• Copy data back to host and do something with it
• Clean up
Host and Device
• Data needs copying to / from the GPU (device)
• Often end up with same data on both
– Postscript variable names with _device or _host
• To help identify where data is
Host Device
A_host A_device
Example: Vector Addition int N = 2000;
double *A_host = new double[N]; // Create data on host computer
double *B_host = new double[N]; double *C_host = new double[N];
for(int i=0; i<N; i++) { A_host[i] = i; B_host[i] = (double)i/N; }
double *A_device, *B_device, *C_device; // allocate space on device GPGPU
cudaMalloc((void**) &A_device, N*sizeof(double));
cudaMalloc((void**) &B_device, N*sizeof(double));
cudaMalloc((void**) &C_device, N*sizeof(double));
// Copy data from host memory to device memory
cudaMemcpy(A_device, A_host, N*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(B_device, B_host, N*sizeof(double), cudaMemcpyHostToDevice);
// How many blocks will we need? Choose block size of 256
int blocks = (N - 0.5)/256 + 1;
vector_add<<<blocks, 256>>>(A_device, B_device, C_device, N); // run kernel
// Copy data back
cudaMemcpy(C_host, C_device, N*sizeof(double), cudaMemcpyDeviceToHost);
// do something with result
// free device memory
cudaFree(A_device); cudaFree(B_device); cudaFree(C_device);
free(A_host); free(B_host); free(C_host); // free host memory
More Complex: Matrix Addition
• Now a 2D problem – BlockIdx, BlockDim, ThreadIdx now have x and y
• But general principles hold – For kernel
• Compute location in matrix of two diminutions
– For main code • Define and transmit data
• But keep data 1D – Why?
Why data in 1D?
• If you define data as 2D there is no guarantee that data will be a contiguous block of memory
– Can’t be transmitted to card in one command
X X
Some other data
Faking 2D data
• 2D data size N*M
• Define 1D array of size N*M
• Index element at [x,y] as
x * N + y
• Then can transfer to device in one go
Row 1 Row 2 Row 3 Row 4
Example: Matrix Add Kernel
__global__ void matrix_add(double *A, double *B, double* C, int N, int M)
{
// Find my thread id - block and thread
int idX = blockDim.x * blockIdx.x + threadIdx.x;
int idY = blockDim.y * blockIdx.y + threadIdx.y;
if (idX >= N || idY >= M) {return;} // I'm not a valid ID
int id = idY * N + idX;
C[id] = A[id] + B[id]; // do my work
}
Both dimensions
Get both dimensions
Compute 1D location
Example: Matrix Addition Main Code
int N = 20;
int M = 10;
double *A_host = new double[N * M]; // Create data on host computer
double *B_host = new double[N * M];
double *C_host = new double[N * M];
for(int i=0; i<N; i++) {
for (int j = 0; j < M; j++) {
A_host[i + j * N] = i; B_host[i + j * N] = (double)j/M;
}
}
double *A_device, *B_device, *C_device; // allocate space on device GPGPU
cudaMalloc((void**) &A_device, N*M*sizeof(double));
cudaMalloc((void**) &B_device, N*M*sizeof(double));
cudaMalloc((void**) &C_device, N*M*sizeof(double));
// Copy data from host memory to device memory
cudaMemcpy(A_device, A_host, N*M*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(B_device, B_host, N*M*sizeof(double), cudaMemcpyHostToDevice);
// How many blocks will we need? Choose block size of 16
int blocksX = (N - 0.5)/16 + 1;
int blocksY = (M - 0.5)/16 + 1;
dim3 dimGrid(blocksX, blocksY);
dim3 dimBlocks(16, 16);
matrix_add<<<dimGrid, dimBlocks>>>(A_device, B_device, C_device, N, M);
// Copy data back from device to host
cudaMemcpy(C_host, C_device, N*M*sizeof(double), cudaMemcpyDeviceToHost);
// Free device
//for (int i = 0; i < N*M; i++) printf("C[%d,%d] = %f\n", i/N, i%N, C_host[i]);
cudaFree(A_device); cudaFree(B_device); cudaFree(C_device);
free(A_host); free(B_host); free(C_host);
Define matrices on host
Define space on device
Copy data to device
Run Kernel
Bring data back
Tidy up
Running Example
• Computer: condor-gpu01 – Set path
• set path = ( $path /usr/local/cuda/bin/ )
• Compile command nvcc
• Then just run the binary file
• C2050, 440 cores, 3GB RAM – Single precision flops 1.03Tflops
– Double precision flops 515Gflops
Summary and Questions
• GPGPU’s have great potential for parallelism • But at a cost
– Not ‘normal’ parallel computing – Need to think about problems in a new way
• Further reading – NVIDIA CUDA Zone https://developer.nvidia.com/category/zone/cuda-zone – Online courses https://www.coursera.org/course/hetero
Sci-Prog seminar series Talks on computing and programming related topics ranging from basic to
advanced levels.
Talk: Using GPUs for parallel processing A. Stephen McGough
Website: http://conferences.ncl.ac.uk/sciprog/index.php
Research community site: contact Matt Wade for access Alerts mailing list: [email protected]
(sign up at http://lists.ncl.ac.uk )
Organisers: Dr Liz Petrie, Dr Matt Wade, Dr Stephen McGough,
Dr Ben Allen and Gregg Iceton