slide 1 / 16 on using graphics hardware for scientific computing...

/ 16

On Using Graphics Hardware forScientific Computing

________________________________________________

Stan Tomov

June 23, 2006

/ 16

Outline

• Motivation• Literature review• The graphics pipeline• Programmable GPUs• Some application examples• Performance results• Conclusion

/ 16

Motivation

Frames per second usingOpenGL(GPU) Mesa (CPU)

11,540 189 8.0147,636 52 1.71

193,556 13 0.44780,308 3 0.12

Problem size

Table 1. GPU vs CPU in rendering polygons. The GPU (Quadro2 Pro) is approximately 30 times faster than the CPU (Pentium III, 1 GHz) in rendering polygonal data of various sizes.

/ 16

Motivation

• High flops count (currently 200GFlops, single precision)

• Compatible price performance (less then 1 cent per MFlop)

• Performance doubling every 6 months

• Continuously increasing functionality and programmability– Realistic games require more

complicated physics

(picture: from the GPU Gems 2 book)

/ 16

Literature review

Using graphics hardware for non-graphics applications(just a few examples):

• Cellular automata• Reaction-diffusion simulation (Mark Harris, University of North Carolina)• Matrix multiply (E. Larsen and D. McAllister, University of North Carolina)• Lattice Boltzmann (Wei Li, Xiaoming Wei, and Arie Kaufman, Stony Brook)• CG and multigrid (J. Bolz et al, Caltech, and N. Goodnight et al, University of

Virginia)• Convolution (University of Stuttgart)• BLAS 1,2; fft; certain eigensolvers; etc. • See also GPGPU’s homepage : http://www.gpgpu.org/

/ 16

Literature review

Typical performance results reported (by the middle of 2003):

• Significant speedup of GPU vs CPU are reported if the GPU performs low precision computations (30 to 60 times; depends on the configuration) - integers (8 or 12 bit arithmetic), 16-bit floating point

• Vendor advertisements about very high performance assume low precision arithmetic

• NCSA, University of Illinois assembled a $50,000 supercomputer out of 70 PlayStation 2 consoles, which could theoretically deliver 0.5 trillion operations/second

• GPU’s 32-bit flops performance is comparable to the CPU’s(may be 2-4 times faster depending on application and configuration)

/ 16

The graphics pipeline• GeForce 256 (August, 1999)

- allowed certain degree of programmability - before: fixed function pipeline

• GeForce 3 (February, 2001) - considered first fully programmable GPU

• GeForce 4 - partial 16-bit floating point arithmetic

• NV30 - 32-bit floating point

• Cg - high-level programming

language

/ 16

The graphics pipeline

• GPUs: on their way into turning into programmable stream processors

• Stream formulation of the graphics pipeline:all data viewed as streams and computation as kernels

• Streaming– Efficient computation (enable efficient parallelism; deep pipeline)– Efficient communication (efficient off-chip communication; intermediate

results kept on chip; deep pipelining allows high degree of latency tolerance

(picture: from the GPU Gems 2 book)

/ 16

Programmable GPUs (in particular NV30)

• GPU programming model: streaming– Naturally addresses parallelism and communication– Easy when problems maps well

• Support floating point operations• Vertex program

– Replaces fixed-function pipeline for vertices– Manipulates single vertex data– Executes for every vertex

• Fragment program– Similar to vertex program but for pixels

• Programming in Cg:– High level language; looks like C; portable; compiles Cg programs to

assembly code

/ 16

Block Diagram of GeForce FX• AGP 8x graphics bus bandwidth: 2.1GB/s• Local memory bandwidth: 16 GB/s• Chip officially clocked at 500 MHz• Vertex processor:

- execute vertex shaders or emulate fixed transformations and lighting (T&L)

• Pixel processor : - execute pixel shaders or emulate fixed shaders - 2 int & 1 float ops or 2 texture accesses/clock circle

• Texture & color interpolators - interpolate texture coordinates and color values

• Performance (on processing 4D vectors):– Vertex ops/sec - 1.5 Gops– Pixel ops/sec - 8 Gops (int), or 4 Gops (float)

Hardware at Digit-Life.com, NVIDIA GeForce FX, or "Cinema show started", November 18, 2002.

/ 16

Block Diagram of GeForce FX• AGP 8x graphics bus bandwidth: 2.1GB/s• Local memory bandwidth: 16 GB/s• Chip officially clocked at 500 MHz• Vertex processor: - execute vertex shaders or emulate fixed transformations and lighting (T&L)• Pixel processor : - execute pixel shaders or emulate fixed shaders - 2 int & 1 float ops or 2 texture accesses/clock circle• Texture & color interpolators - interpolate texture coordinates and color values

• Performance (on processing 4D vectors):– Vertex ops/sec - 1.5 Gops– Pixel ops/sec - 8 Gops (int), or 4 Gops (float)

Hardware at Digit-Life.com, NVIDIA GeForce FX, or "Cinema show started", November 18, 2002.

3 vertex and 8 pixel processors

Last nVidia card: dual-GPU GeForce 7950 GX2 with 32 vertex and 96 pixel processors

/ 16

Summary of CPU vs GPU

• General vs specialized hardware– CPUs have more complex control hardware

– GPU can have hardware acceleration for specific tasks

• Sequential vs parallel programming models– In general CPUs don’t have the GPU’s level of data

parallelism (though some may be available: Intel’s SSE and PowerPC’s AltiVec instructions sets)

• Memory latency vs bandwidth optimization

/ 16

Some application examples

• Monte Carlo simulations– Used in variety of simulations in physics, finance, chemistry, etc.

– Based on probability statistics and use random numbers

– A classical example: compute area of a circle

– Computation of expected values:

– N can be very large : on a 1024 x 1024 lattice of particles, every particle modeled to have k states, N =

– Random number generation. We used linear congruential type generator:

NbnRanR mod))1(*()(

21024k

))P(S(SF=E(F) ii

N

=1i

/ 16

Some application examples

• Monte Carlo simulations– Ising model

• Simplified model for magnets • Evolve the system into “higher probability” states

and compute expected values as average over only those states

– Percolation• In studies of disease spreading, flow in porous

media, forest fire propagation, clustering, etc.

• Lattice Boltzmann method– Simulate fluid flow; particles are allowed

to move and collide on a lattice

/ 16

Some performance results

• saxpy on 512 x 512 (x 4) vectors 1GFlop– speed limited by GPU memory bandwidth (16 GB/s)

• sin, cos, exp, log 20 times faster than on Pentium 4, 2.8GHz– hardware accelerated of low accuracy

• Ising model 7GFlops – 44% of theoretical maximum

– On fragment program compiled to 109 assembly instructions

/ 16

Conclusions

• What to expect for future GPGPUs?Can GPGPUs influence future computer systems ? ( HPC and consequently our models of software development: is the IBM’s Cell processor already an example? )

Current trends: CPU multi-core GPU more powerful streaming model (Gather, scatter, conditional streams, reduction, etc.)

more CPU functionality

slide 1 / 16 on using graphics hardware for scientific computing...

Documents