applications of gpu computing - rochester institute...

Applications of GPU Computing Alex Karantza

0306-722 Advanced Computer Architecture Fall 2011

Outline

• Introduction

• GPU Architecture

▫ Multiprocessing

▫ Vector ISA

• GPUs in Industry

▫ Scientific Computing

▫ Image Processing

▫ Databases

• Examples and Benefits

Introduction

“GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs.”

- Prof. Jack Dongarra, director of the Innovative Computing Laboratory at the University of Tennessee

Author of LINPACK

(As typified by NVIDIA CUDA)

GPU Architecture

• Parallel Coprocessor to conventional CPUs

▫ Implement a SIMD structure, multiple threads running the

same code.

• Grid of Blocks of Threads

▫ Thread local registers

▫ Block local memory and control

▫ Global memory

Grids, Blocks, and Threads

Thread Thread

Processor

Thread

Block Multiprocessor

Grid Device(s)

Contains local registers

and memory; scalar processor

Shared memory and registers;

shared control logic

Global memory, can be easily

distributed across devices

GPU Architecture

• Processors also implement vector instructions

▫ Vectors of length 2,3,4 of any fundamental type

integer, float, bits, predicate

▫ Instructions for conversion between vector, scalar

• To encourage uniform execution, rather than

branching for conditionals, use predicates

▫ All instructions can be conditionally executed based on

predicate registers

Vectors and Predicates

.global .v4 .f32 V; // a length-4 vector of floats

.shared .v2 .u16 uv; // a length-2 vector of unsigned

.global .v4 .b8 v; // a length-4 vector of bytes

.reg .s32 a, b; // two 32-bit signed ints

.reg .pred p; // a predicate register

setp.lt.s32 p, a, b; // if a < b, set p

@p add.v4.f32 V, V, {1,0,0,0}; // if p, V.x = V.x + 1

NSF Keeneland

360 Tesla20s

GPUs in Industry

• Many applications have been developed to use GPUs

for supercomputing in various fields

▫ Scientific Computing

CFD, Molecular Dynamics, Genome Sequencing,

Mechanical Simulation, Quantum Electrodynamics

▫ Image Processing

Registration, interpolation, feature detection, recognition,

filtering

▫ Data Analysis

Databases, sorting and searching, data mining

Major Categories of Algorithm

• 2D/3D filtering operations

• n-body simulations

• Parallel tree operations – searching/sorting

• All suited to GPUs because of data-parallel

requirements and uniform kernels

Computational Fluid Dynamics

• Simulate fluids in a discrete volume over time

• Involves solving the Navier-Stokes partial differential

equations iteratively on a grid

▫ Can be considered a filtering operation

• When parallelized on a GPU using multigrid solvers,

10x speedups have been reported

Molecular Dynamics

• Large set of particles with forces between them –

protein behavior, material simulation

• Calculating forces between particles can be done in

parallel for each particle

• Accumulation of forces can be implemented as

multilevel parallel sums

Genetics

• Large strings of genome sequences must be searched

through to organize and identify samples

• GPUs enable multiple parallel queries to the

database to perform string matching

• Again, order of magnitude

speedups reported

Electrodynamics

• Simulation of electric fields, Coulomb forces

• Requires iterative solving of partial differential

equations

• Cell phone modeling applications have

reported 50x speedups using GPUs

Image Processing

• Medical Imaging was the early adopter

▫ Registration of massive 3D voxel images

▫ Both the cost function for deformable registration and interpolation of results are filtering operations

• Generic feature detection, recognition, object extraction are all filters

• For object recognition, one can search a database of objects in parallel

• Running these algorithms off the CPU can allow real-time interaction

Data Analysis

• Huge databases for web services require instant

results for many simultaneous users

• Insufficient room in main memory, disk is too slow and

doesn’t allow parallel reads

• GPUs can split up the data and perform

fast searches, keeping their section

in memory

Example: Filtering Operation

• Many algorithms can be reduced to a filtering

operation. As an example, consider image convolution

for blurring

Kernel = Gaussian2D(size);

for (x,y) in Input {

for (p,q) in Kernel {

Output(x,y) += Input(x+p,y+q) * Kernel(p,q);

}

}


• A quick optimization that can be made on many filters is that they are separable, and can be done in one pass per dimension

Kernel = Gaussian1D(size);


for (p) in Kernel {

Output(x,y) += Input(x+p,y) * Kernel(p);

}

}


for (q) in Kernel {

Output(x,y) += Input(x,y+q) * Kernel(q);

}

}


• This is still O(2nnm) on a sequential processor • Each output pixel is independent, but shares spatially

local data and a constant kernel

UploadGPU(Kernel, CONSTANT);

UploadGPU(Input, TEXTURE);

ConvolveColumnsGPU<blocks,threads>();

ConvolveRowsGPU<blocks,threads>();

DownloadGPU(Output, TEXTURE);


• Complexity remains the same, however each MAC

instruction can be executed on as many processors as

are available, and memory can be accessed quickly

because of the assignment of blocks and texture

memory

• In practice, the overhead of uploading and

downloading from the GPU is far less than the

performance gained in the kernel


__global__ void convolutionColumnsKernel(

float *d_Dst,

float *d_Src,

int imageW,

int imageH,

int pitch

){

__shared__ float s_Data[COLUMNS_BLOCKDIM_X]

[(COLUMNS_RESULT_STEPS + 2 * COLUMNS_HALO_STEPS) *

COLUMNS_BLOCKDIM_Y + 1];

//// *snip* Populate s_Data from d_Src

__syncthreads();

#pragma unroll

for(int i = COLUMNS_HALO_STEPS; i < COLUMNS_HALO_STEPS + COLUMNS_RESULT_STEPS; i++){

float sum = 0;

#pragma unroll

for(int j = -KERNEL_RADIUS; j <= KERNEL_RADIUS; j++)

sum += c_Kernel[KERNEL_RADIUS - j] *

s_Data[threadIdx.x][threadIdx.y + i * COLUMNS_BLOCKDIM_Y + j];

d_Dst[i * COLUMNS_BLOCKDIM_Y * pitch] = sum;

}

Even More Fun

• Some of that overhead can be avoided when the

destination of the GPU’s data is graphics

• Texture memory can be shared between general

purpose computations and normal rendering

• For post-processing effects or visualizing particles, the

pixel/vertex data never needs to leave the GPU

Conclusions

Certain classes of problem appear in many different

fields, and involve very data-parallel operations such

as filtering, sorting, or integration

Taking advantage of the architecture decisions behind

graphics processing units such as their multiprocessing

and native vector operations, these problems can be

solved quickly and cheaply

References • 1. Ziegler, Grenot. Introduction to the CUDA Architecture. [Online] 2009.

http://www.cse.scitech.ac.uk/disco/workshops/200907/Day1_01_Intro_CUDA_Architecture.pdf.

• 2. NVIDIA Corporation. NVIDIA Compute PTX: Parallel Thread Execution ISA Version 1.1. 2007.

• 3. Göddeke, Dominik. Fast and Accurate Finite-Element Multigrid Solvers for PDE Simulations on GPU Clusters. Berlin : Logos Verlag, 2010. 978-3-8325-2768-6.

• 4. Accellerating molecular modeling application swith graphics processors. John E Stone, James C Phillips, Peter L Freddolino, David J Hardy, Leonardo G Trabuco, and Klaus Schulten. 2007, Journal of Computational Chemistry, pp. 28:2618-2640.

• 5. Michael C Schatz, Cole Trapnell, Arthur L Delcher, and Amitabh Varshney. High-throughput sequence alignment using Graphics Processing Units. s.l. : BMC Bioinformatics, 2007.

• 6. ANSYS, Inc. ANSYS Unveils GPU Computing for Accelerated Engineering Simulations. [Online] 2010. http://investors.ansys.com/releasedetail.cfm?releaseid=509436.

• 7. Warburton, Tim. Parallel Numerical Methods for Partial Differential Equations. Rocky Mountain Mathematics Consortium. [Online] 2008. http://www.caam.rice.edu/~timwar/RMMC/gpuDG.html.

• 8. Ansorge, Richard. AIRWC : Accelerated Image Registration With CUDA . BSS Group, Cavendish Laboratory, University of Cambridge UK. 2008.

• 9. N. Cornelis, L. Van Gool. Fast Scale Invariant Feature Detection and Matching on Programmable Graphics Hardware. s.l. : CVPR 2008 Workshop, 2008.

• 10. Andrea DiBlas, Tim Kaldewey. Data Monster: Why graphics processors will transform database processing. IEEE Spectrum. [Online] 2009. http://spectrum.ieee.org/computing/software/data-monster/0.

• 11. Podlozhnyuk, Victor. Image Convolution with CUDA. [Online] 2007. http://developer.download.nvidia.com/compute/DevZone/C/html/C/src/convolutionSeparable/doc/convolutionSeparable.pdf.

• 12. Goodnight, Nolan. CUDA/OpenGL Fluid Simulation. [Online] 2007. http://new.math.uiuc.edu/MA198-2008/schaber2/fluidsGL.pdf.

Questions?

applications of gpu computing - rochester institute...

Documents