applications of gpu computing - rochester institute...
TRANSCRIPT
Applications of GPU Computing Alex Karantza
0306-722 Advanced Computer Architecture Fall 2011
Outline
• Introduction
• GPU Architecture
▫ Multiprocessing
▫ Vector ISA
• GPUs in Industry
▫ Scientific Computing
▫ Image Processing
▫ Databases
• Examples and Benefits
Introduction
“GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs.”
- Prof. Jack Dongarra, director of the Innovative Computing Laboratory at the University of Tennessee
Author of LINPACK
(As typified by NVIDIA CUDA)
GPU Architecture
• Parallel Coprocessor to conventional CPUs
▫ Implement a SIMD structure, multiple threads running the
same code.
• Grid of Blocks of Threads
▫ Thread local registers
▫ Block local memory and control
▫ Global memory
Grids, Blocks, and Threads
Thread Thread
Processor
Thread
Block Multiprocessor
Grid Device(s)
Contains local registers
and memory; scalar processor
Shared memory and registers;
shared control logic
Global memory, can be easily
distributed across devices
GPU Architecture
• Processors also implement vector instructions
▫ Vectors of length 2,3,4 of any fundamental type
integer, float, bits, predicate
▫ Instructions for conversion between vector, scalar
• To encourage uniform execution, rather than
branching for conditionals, use predicates
▫ All instructions can be conditionally executed based on
predicate registers
Vectors and Predicates
.global .v4 .f32 V; // a length-4 vector of floats
.shared .v2 .u16 uv; // a length-2 vector of unsigned
.global .v4 .b8 v; // a length-4 vector of bytes
.reg .s32 a, b; // two 32-bit signed ints
.reg .pred p; // a predicate register
setp.lt.s32 p, a, b; // if a < b, set p
@p add.v4.f32 V, V, {1,0,0,0}; // if p, V.x = V.x + 1
NSF Keeneland
360 Tesla20s
GPUs in Industry
• Many applications have been developed to use GPUs
for supercomputing in various fields
▫ Scientific Computing
CFD, Molecular Dynamics, Genome Sequencing,
Mechanical Simulation, Quantum Electrodynamics
▫ Image Processing
Registration, interpolation, feature detection, recognition,
filtering
▫ Data Analysis
Databases, sorting and searching, data mining
Major Categories of Algorithm
• 2D/3D filtering operations
• n-body simulations
• Parallel tree operations – searching/sorting
• All suited to GPUs because of data-parallel
requirements and uniform kernels
Computational Fluid Dynamics
• Simulate fluids in a discrete volume over time
• Involves solving the Navier-Stokes partial differential
equations iteratively on a grid
▫ Can be considered a filtering operation
• When parallelized on a GPU using multigrid solvers,
10x speedups have been reported
Molecular Dynamics
• Large set of particles with forces between them –
protein behavior, material simulation
• Calculating forces between particles can be done in
parallel for each particle
• Accumulation of forces can be implemented as
multilevel parallel sums
Genetics
• Large strings of genome sequences must be searched
through to organize and identify samples
• GPUs enable multiple parallel queries to the
database to perform string matching
• Again, order of magnitude
speedups reported
Electrodynamics
• Simulation of electric fields, Coulomb forces
• Requires iterative solving of partial differential
equations
• Cell phone modeling applications have
reported 50x speedups using GPUs
Image Processing
• Medical Imaging was the early adopter
▫ Registration of massive 3D voxel images
▫ Both the cost function for deformable registration and interpolation of results are filtering operations
• Generic feature detection, recognition, object extraction are all filters
• For object recognition, one can search a database of objects in parallel
• Running these algorithms off the CPU can allow real-time interaction
Data Analysis
• Huge databases for web services require instant
results for many simultaneous users
• Insufficient room in main memory, disk is too slow and
doesn’t allow parallel reads
• GPUs can split up the data and perform
fast searches, keeping their section
in memory
Example: Filtering Operation
• Many algorithms can be reduced to a filtering
operation. As an example, consider image convolution
for blurring
Kernel = Gaussian2D(size);
for (x,y) in Input {
for (p,q) in Kernel {
Output(x,y) += Input(x+p,y+q) * Kernel(p,q);
}
}
Example: Filtering Operation
• A quick optimization that can be made on many filters is that they are separable, and can be done in one pass per dimension
Kernel = Gaussian1D(size);
for (x,y) in Input {
for (p) in Kernel {
Output(x,y) += Input(x+p,y) * Kernel(p);
}
}
for (x,y) in Input {
for (q) in Kernel {
Output(x,y) += Input(x,y+q) * Kernel(q);
}
}
Example: Filtering Operation
• This is still O(2nnm) on a sequential processor • Each output pixel is independent, but shares spatially
local data and a constant kernel
UploadGPU(Kernel, CONSTANT);
UploadGPU(Input, TEXTURE);
ConvolveColumnsGPU<blocks,threads>();
ConvolveRowsGPU<blocks,threads>();
DownloadGPU(Output, TEXTURE);
Example: Filtering Operation
• Complexity remains the same, however each MAC
instruction can be executed on as many processors as
are available, and memory can be accessed quickly
because of the assignment of blocks and texture
memory
• In practice, the overhead of uploading and
downloading from the GPU is far less than the
performance gained in the kernel
Example: Filtering Operation
__global__ void convolutionColumnsKernel(
float *d_Dst,
float *d_Src,
int imageW,
int imageH,
int pitch
){
__shared__ float s_Data[COLUMNS_BLOCKDIM_X]
[(COLUMNS_RESULT_STEPS + 2 * COLUMNS_HALO_STEPS) *
COLUMNS_BLOCKDIM_Y + 1];
//// *snip* Populate s_Data from d_Src
__syncthreads();
#pragma unroll
for(int i = COLUMNS_HALO_STEPS; i < COLUMNS_HALO_STEPS + COLUMNS_RESULT_STEPS; i++){
float sum = 0;
#pragma unroll
for(int j = -KERNEL_RADIUS; j <= KERNEL_RADIUS; j++)
sum += c_Kernel[KERNEL_RADIUS - j] *
s_Data[threadIdx.x][threadIdx.y + i * COLUMNS_BLOCKDIM_Y + j];
d_Dst[i * COLUMNS_BLOCKDIM_Y * pitch] = sum;
}
Even More Fun
• Some of that overhead can be avoided when the
destination of the GPU’s data is graphics
• Texture memory can be shared between general
purpose computations and normal rendering
• For post-processing effects or visualizing particles, the
pixel/vertex data never needs to leave the GPU
Conclusions
Certain classes of problem appear in many different
fields, and involve very data-parallel operations such
as filtering, sorting, or integration
Taking advantage of the architecture decisions behind
graphics processing units such as their multiprocessing
and native vector operations, these problems can be
solved quickly and cheaply
References • 1. Ziegler, Grenot. Introduction to the CUDA Architecture. [Online] 2009.
http://www.cse.scitech.ac.uk/disco/workshops/200907/Day1_01_Intro_CUDA_Architecture.pdf.
• 2. NVIDIA Corporation. NVIDIA Compute PTX: Parallel Thread Execution ISA Version 1.1. 2007.
• 3. Göddeke, Dominik. Fast and Accurate Finite-Element Multigrid Solvers for PDE Simulations on GPU Clusters. Berlin : Logos Verlag, 2010. 978-3-8325-2768-6.
• 4. Accellerating molecular modeling application swith graphics processors. John E Stone, James C Phillips, Peter L Freddolino, David J Hardy, Leonardo G Trabuco, and Klaus Schulten. 2007, Journal of Computational Chemistry, pp. 28:2618-2640.
• 5. Michael C Schatz, Cole Trapnell, Arthur L Delcher, and Amitabh Varshney. High-throughput sequence alignment using Graphics Processing Units. s.l. : BMC Bioinformatics, 2007.
• 6. ANSYS, Inc. ANSYS Unveils GPU Computing for Accelerated Engineering Simulations. [Online] 2010. http://investors.ansys.com/releasedetail.cfm?releaseid=509436.
• 7. Warburton, Tim. Parallel Numerical Methods for Partial Differential Equations. Rocky Mountain Mathematics Consortium. [Online] 2008. http://www.caam.rice.edu/~timwar/RMMC/gpuDG.html.
• 8. Ansorge, Richard. AIRWC : Accelerated Image Registration With CUDA . BSS Group, Cavendish Laboratory, University of Cambridge UK. 2008.
• 9. N. Cornelis, L. Van Gool. Fast Scale Invariant Feature Detection and Matching on Programmable Graphics Hardware. s.l. : CVPR 2008 Workshop, 2008.
• 10. Andrea DiBlas, Tim Kaldewey. Data Monster: Why graphics processors will transform database processing. IEEE Spectrum. [Online] 2009. http://spectrum.ieee.org/computing/software/data-monster/0.
• 11. Podlozhnyuk, Victor. Image Convolution with CUDA. [Online] 2007. http://developer.download.nvidia.com/compute/DevZone/C/html/C/src/convolutionSeparable/doc/convolutionSeparable.pdf.
• 12. Goodnight, Nolan. CUDA/OpenGL Fluid Simulation. [Online] 2007. http://new.math.uiuc.edu/MA198-2008/schaber2/fluidsGL.pdf.
Questions?