this document contains the draft version of the following...

This document contains the draft version of the following paper: A. Balijepalli, T.W. LeBrun, and S.K. Gupta. Stochastic simulations with graphics hardware: Characterization of accuracy and performance. ASME Journal of Computing and Information Science in Engineering, 10(1): 011010, March 2010. Readers are encouraged to get the official version from the journal’s web site or by contacting Dr. S.K. Gupta ([email protected]).

Stochastic Simulations with Graphics Hardware: Characterization

of Accuracy and Performance

Arvind Balijepalli1,2 Thomas W. LeBrun1 Satyandra K. Gupta2,∗

1Manufacturing Engineering Laboratory 2Department of Mechanical Engineering

National Institute of Standards and Technology University of Maryland

Gaitherburg, MD 20899 College Park, MD 20742

Abstract

Methods to implement stochastic simulations on the Graphics Processing Unit (GPU) have beendeveloped. These algorithms are used in a simulation of micro and nano assembly with optical tweezers,but are also directly compatible with simulations of a wide variety of assembly techniques using eitherelectrophoretic, magnetic or other trapping techniques. Significant speedup is possible for stochasticparticle simulations when using the GPU, included in most PC’s, rather than the CPU that handlesmost calculations. However, a careful analysis of the accuracy and precision when using the GPU instochastic simulations is lacking and is addressed here. A stochastic simulation for spherical particlesusing Langevin’s equation and Verlet integration has been developed and mapped onto stages of theGPU hardware that provide the best performance. The results from the CPU and GPU implementationare then compared with each other and with well-established theory. The error in the mean ensembleenergy and the diffusion constant is measured for both the CPU and the GPU implementations. Thetime taken to complete several simulation experiments on each platform has also been measured and thespeedup attained by the GPU is then calculated.

Index Terms− GPU, Stochastic Simulation, Verlet, Optical Tweezers

1 Introduction

Stochastic particle simulations are important when simulating micro or nano assembly operations using avariety of manipulation techniques. Assembly methods that employ optical, magnetic or electrophoretictraps often perform manipulation in a fluid cell where there are several freely diffusing particles and a fewthat are under the influence of the trapping field. Therefore, an accurate and fast algorithm to updatethe positions of the particles in the cell is needed. We have developed algorithms for stochastic particlesimulations which can easily be interfaced with many trapping force calculations and are therefore widelyuseful in micro and nano assembly simulations.

Current GPUs have considerable computing power and are able to execute complex calculations to achievephoto-realistic rendering and smooth interaction in multimedia applications. The high computational powerthat these devices offer also make them an attractive option to speed up scientific calculations. While somegeneral physical models have been implemented on the GPU and recent work [1] has addressed the issues ofaccuracy and precision for deterministic systems, such characterization is far from complete and even lesssatisfactory for iterative systems that generate long trajectories and may accumulate errors over thousandsof steps. The focus of this work, therefore is to a) develop algorithms for stochastic simulations using theGPU, which can be used in many kinds of nanoassembly simulations and b) characterize the accuracy ofstochastic simulations implemented on graphics hardware.

∗Author for correspondence

1

All GPU’s to-date store data in 32-bit floating point format compared with double-precision 64-bit formatsthat are common on the CPU, degrading accuracy and potentially resulting in large roundoff errors. Theproblem is further aggravated by the fact that most GPU vendors (including current generation hardware suchas the nVIDIA G80 [2]) do not strictly adhere to the IEEE floating point representation standard, causingthe round-off errors to vary significantly between GPUs. The availability of 64-bit GPUs in the future islikely to improve the accuracy in some applications, but unless the GPU vendors uniformly adopt floatingpoint standards (which is unlikely due to architectural and performance tradeoffs) already implementedon the CPU, the characterization of GPU errors will remain an important component in porting scientificcalculations to the GPU successfully. In this work, we pick a model stochastic particle system of freelydiffusing particles because it forms the basis of optical tweezers based assembly and is also widely useful forother manipulation methods we described previously. We quantify the accuracy of the GPU in relation witha double precision CPU implementation of the algorithm and also describe the details of the GPU memoryand computational model in later sections. The constraints in programming streaming processors that aredescribed in this paper are important even for newer architectures that employ a unified shader model.

Many general purpose calculations have been successfully implemented on the GPU facilitating its use as acost-effective parallel processor. Geometric modeling applications have heavily leveraged the capabilities ofthe GPU [3, 4, 5, 6] and methods to solve Partial Differential Equations (PDE) as well as linear systems ofequations have also been successfully implemented on the GPU. This in turn has led to the development offinite element algorithms [1] and spring mass systems. Development of sorting and search methods such asbitwise or a nearest neighbor search have helped the implementation of fluid dynamics problems on graphicshardware. Kipfer et al. [7] have developed a GPU based particle engine that can simulate a million particleswith collision detection using a simple sorting algorithm. Kolb and Cuntz [8] adapted a Lagrangian modelthat directly simulates particle motion where they propose to eliminate the need to sort interacting particlesto determine the effect of nearest neighbor interactions. A majority of GPU implementations however,emphasize the performance of their respective algorithms on graphics hardware over accuracy.

Here we develop a physically accurate simulation engine for freely diffusing particles on the GPU has beendeveloped by choosing parts of the graphics pipeline that provide the most benefit. The simulation is thentested against the well-known theory of Brownian motion in order to characterize the accuracy when usingthe GPU in stochastic simulations. It is anticipated that our implementation will yield significant speedupwith acceptable accuracy when compared with a CPU implementation.

This stochastic simulation module has been developed to model nanoassembly processes in an opticaltweezers-based system. Traditionally, optical tweezers [9, 10] have been used for micromanipulation inbiological applications [11, 12] with great success. We have extended the optical tweezers to perform microand nanoscale manipulations to prototype nandevices [13, 14]. This system has the ability to easily andintuitively interact with the instrument and test complex operations through scripting. However, flexiblealgorithms are critically needed to quantitatively measure trapping behavior, improve the understandingof system operation in assembly tasks and develop automation and control functions. Therefore, a robustand physically accurate simulation module is centrally important to facilitate testing and development ofsuch functions. The assembly workspace within the optical tweezers system has several particles that arefreely diffusing in a liquid and a few that are under the influence of a trapping beam. At each time-step,the simulation must update the positions, velocities and accelerations of each particle in the workspace - atask that is computationally challenging for real-time operation. However, since similar operations are per-formed on every particle in the workspace, we can update each particle independently in parallel to improveperformance. External forces such as the effect of the optical trap are then easily coupled into the particlesimulation by taking advantage of the different time-scales of each process.

The simulation model for a freely diffusing particle system (Section 2) is first described followed by anoverview of the GPU architecture (Section 3). Special attention is given to how the stages of the graphicspipeline map onto our simulation model and which stages of the graphics hardware provide the most benefitfor a given operation. Finally, results (Section 4) highlighting the accuracy achieved by running the simulationon the GPU and the measured speedup are presented, followed by a discussion of the results and conclusions(Section 5).

2

2 Simulation Outline

A large particle in a fluid medium is bombarded by the surrounding liquid molecules undergoing thermalmotion. This bombardment gives rise to a fluctuating force on the large particle and consequently impartssome velocity to it. As the particle moves through the liquid with a finite velocity it also feels the effectof a systematic drag force. Langevin’s equation is used to formally describe this phenomenon and consistsof a properly scaled random force and a systematic drag force [15, 16, 17]. Equation 1 gives the Langevinequation for a particle with mass, m, radius r and velocity V (t) at time t in a fluid with viscosity η whichis a function of temperature T . The characteristic time-scale of this model is called the relaxation time, mγ ,where γ = 6πηr is the drag coefficient (from Stoke’s law for a spherical particle), and is the time taken fora particle’s velocity to asymptotically approach thermal equilibrium when starting from rest or when it isinjected with a large initial velocity.

dV (t)dt

= − γmV (t) +

ζ

mΓ(t) (1)

The scaling constant, ζ, in Equation 1 is√

2γKBT [18] and is obtained by imposing the requirements of thefluctuation-dissipation theorem [19], where KB is Boltzmann’s constant. The presence of the stochastic forceterm ζΓ(t) prevents the direct analytical solution of this equation. Therefore we first express Langevin’sequation in a finite difference form and then proceed to integrate it numerically using an appropriate inte-gration algorithm. Assuming a uniform time-step δt and constant acceleration, we can write the accelerationof the particle at the end of a time-step in finite difference form as shown in Equation 2. A force term Fexthas been added to allow the inclusion of a force potential if needed, but is set to zero for the free particle.

V (t+ δt)− V (t)δt

= A(t+ δt) = − γmV (t) +

ζ

mΓ(t) +

Fextm

(2)

As stated before, a large particle moving with a finite velocity, V (t) in a fluid bath at temperature Tundergoes many collisions with the molecules in the liquid [20]. The stochastic force in Equation 2 is theresult of these collisions over a finite period of observation, δt. Furthermore, the stochastic force that arisesfrom these collisions is found to be normally distributed with width proportional to, 1

δt [19]. Therefore we canreplace the stochastic term, Γ(t) in Equation 2 with a properly scaled normal distribution N(0, 1

δt ). Using theproperty of the normal random variable, N(αm,α2σ2) = αN(m,σ2), we can absorb ( 1

δt )12 into the scaling

constant. The final form of the acceleration equation is then given by Equation 3, where χ =√

2γKBTδt .

A(t+ δt) = − γmV (t) +

χ

mN(0, 1) +

Fextm

(3)

Equation 3 can be integrated explicitly using one of the finite difference methods that have been proposedin the literature [21]. Implicit methods to solve Langevin’s equation have also been proposed [22] andoffer enhanced numerical stability at larger time-steps than is possible with explicit methods, however thecomputation cost for these methods can be very high. Brendensen and van Gunsteren [23] give a detailedcomparison of several integration algorithms. Of the explicit finite difference methods, the most popular areeither the Gear predictor-corrector method [24] or the Verlet method [25]. The Gear predictor-corrector, asthe name suggests, consists of two steps - an initial prediction step that gives the initial estimates of thequantities being calculated followed by a correction. A fourth or fifth order approximation of the calculatedquantities is commonly used and this method provides excellent energy conservation for smaller time-steps.However, the error in the equilibrium energy increases very rapidly for large time-steps. The Verlet algorithm,on the other hand, is a second order method that provides good energy conservation at larger time-steps.The Verlet methods typically underestimate the system energy when tested on a harmonic oscillator [26].

3

Bitmap Pre-Processing and Setup

Texture Memory

Control toVertex Processors

Transform Geometry(Vertex Processors)

2D Projection and Rendering

(Rasterization)

Final Render and Post-process

(Fragment Processors)

Vertex Stream(Connectivity Information)

Transformed Vertex Stream

ConnectivityInformation

FragmentStream Pixel

StreamTo Framebuffer

State Information

UserApplication GPU

Bitmaps

Fixed Function

Processors

Programm-able

Processors

Programm-able

Processors

• Transform Vertices• Ignore Connection Information.• Optionally use Bitmaps

SCATTER

• Project 3D geometric objects on to a 2D surface

GATHER

•Render final output bitmap•Combine with user-defined textures

Figure 1: Conceptual map of the GPU rendering pipeline. The three main stages are physically representedin the center of the figure while examples of their memory models (scatter and gather) are shown at thebottom.

Eventually, an integration method must be selected that a) satisfies conservation laws with acceptable ac-curacy, b) is computationally cost effective and c) permits the use of the largest possible time-step. For ourapplication, we choose the Verlet algorithm [27] due to its ease of implementation and the possibility to uselarger time-steps.

V (t+ δt) = V (t) +A(t) +A(t+ δt)

2δt+ O(δt2) (4)

R(t+ δt) = R(t) + V (t)δt+12A(t)δt2 + O(δt4) (5)

Starting with the state of the simulation at time t, we calculate the acceleration of the particle at the nextinstance (t+ δt) given by Equation 3. Next, we calculate the velocity of the particle at t+ δt using Equation4. Finally, we update the position of the particle using the previous velocity and acceleration as shown inEquation 5. Therefore, the velocity Verlet generates a list of positions, velocities and accelerations in onepass. Since the algorithm requires only one evaluation of any external potential at each time-step, it isgenerally more efficient than other methods. The overall errors in position and velocity for this algorithmare proportional to δt2 [24], which makes the velocity Verlet integrator a second order integrator. The resultof a single simulation run is a list the particle’s position, velocity and acceleration vectors at each time step.

3 GPU Implementation

The graphics rendering pipeline sequentially processes data in several steps to transform an arbitrary scenerepresentation into a final result that is output to a display device. We begin by outlining the steps in

4

the rendering process conceptually followed by a detailed look at the physical embodiment of the renderingpipeline commonly found in most modern GPUs.

The graphics scene in arbitrary world coordinates contains several geometric objects such as cubes, polygonsor a mesh of triangles as well as lighting, point of view and other state information. The geometry of eachobject is then described using a primitive construct called a vertex. A vertex in contrast to traditionalgeometric vertices is a data structure that contains in addition to position in the native coordinate system,attributes such as a Red, Green, Blue (RGB) color, normal vector and other user-defined information. A listof vertices, grouped by geometry forms the input to the graphics pipeline. Connectivity information betweenthe vertices of a geometry object, e.g. which vertices are connected by edges, is also stored implicitly withinthis list.

The rendering process consists of three stages: geometric transformation, rasterization and post-processing.The first stage of the rendering process manipulates incoming vertices by ignoring the embedded connectivityinformation and treating each vertex independently. Operations such as translation, rotation, scaling andlighting are applied to every vertex and the net result of these transformations is to align the user definedcoordinate system with the coordinates of the viewing plane. The output of this stage is the transformedvertex list which is sent to the next stage in the pipeline.

In the rasterization stage, the three-dimensional vertices are first mapped onto the final screen coordinatesystem and connection information which was implicitly stored in the vertex list is used to render the scenestarting with polygons that are farthest from the viewpoint. Pixels between connected vertices are thenfilled-in using appropriate color values. Next, pixels that are obscured by polygons lying on top of themand pixels that fall outside the screen viewing area are discarded. The final two-dimensional array is calleda fragment stream and contains a RGB color and transparency value for each pixel as well as additionalinformation which encodes knowledge of the geometry each fragment represents. This allows a bitmap to beapplied to the fragments in the next stage for photorealistic rendering.

The post-processing stage is the last step in the rendering process where the incoming fragment stream istransformed into the final bitmap before being sent to the display device. Optionally user-supplied bitmapsare loaded, transformed and applied to individual fragments in this stage. Special operations to controlthe appearance of individual pixels on the screen can also be applied before the final bitmap is rendered.Bitmaps are sometimes called a pixel stream when referring to the output of the post-processing stage.

The geometric transformation and post-processing stage are user programmable. Data passing througheither of these stages can therefore be manipulated by default system instructions (fixed function mode) orby loading a custom program. More than one program can also be chained together sequentially to performa series of operations on the data list. Both programmable stages are capable of looping as well as writingtheir output to graphics memory to be re-used in a subsequent pass.

The memory model of the GPU differs considerably from that of the CPU. Important memory accessrestrictions within the individual stages of the pipeline, which are designed to improve the performance ofgraphics applications, can impact the utility of the GPU in general purpose applications. When sequentiallyprocessing the elements of a list, the GPU employs two memory models called scatter and gather. Givenan input and output list, a scatter operation allows the result of a calculation on an input element to bemapped to an arbitrary location in the output list. A gather operation on the other hand, allows the resultsof a calculation on an input element to be mapped only to its corresponding location in the output list.Both the scatter and gather operations can read an arbitrary memory location in the input data for usein a calculation. The geometric transformation stage allows both scatter and gather operations, while thepost-processing stage allows only gather operations. These memory restrictions are consistent with the useof the GPU in graphics applications but may pose a challenge when porting general purpose calculations tographics hardware.

The GPU architecture contains several computational blocks or stages that embody the conceptual renderingpipeline (see Figure 1) described previously [28, 29, 30]. Next, we describe each stage of the pipeline in detailand also discuss how we use the GPU in our particle simulation.

5

1,1 1,2 ... 1,N

2,1 2,2 ... 2,N

...

N,1 N,2 ... N,N

1,1 1,2 ... 1,N

2,1 2,2 ... 2,N

...

N,1 N,2 ... N,N

1,1 1,2 ... 1,N

2,1 2,2 ... 2,N

...

N,1 N,2 ... N,N

X Y

Z

Z1,2 =X1,2 −X1,1

δt+ Y1,1

Figure 2: Calculate z = dxdt + y using the post-processing stage. The green squares in the figure are inputs

to the program and the orange squares are the outputs.

3.1 Geometric Transformation (Vertex Processors)

The vertex processors transform vertices in the incoming vertex list appropriately to align them with thecoordinates of the viewing plane. The processors in this stage support the Single Instruction Multiple Data(SIMD) and the Multiple Instruction Multiple Data (MIMD) computational models and are capable ofhandling branching and looping. However they are not heavily used for large calculations as they have fewerProcessing Elements (PE) compared with the post-processing stage. In some situations the geometry thatis defined in this stage can be designed cleverly so as to leverage the capabilities of the rasterization stage.

The graphics card is able to handle N2 particles, arranged in a square grid at one time. In this stage, wecreate a single square polygon of N ×N pixels that occupies the entire screen. Each pixel in this geometryrepresents one freely diffusing particle in the simulation. The size of this polygon is chosen so that therasterization stage (discussed below) will perform no interpolation and the PEs in the post processing stageperform most of calculations.

3.2 Rasterization

The transformed vertex list is first mapped onto a surface in screen coordinates and the regions betweenconnected vertices are filled-in starting with polygons that are farthest from the viewpoint. As describedpreviously, attributes associated with each vertex are interpolated and polygons and pixels that lie outside theviewing volume are discarded before the results are output as a fragment stream. Since the rasterization stepcan be used to interpolate any user-defined attribute efficiently, we can use this capability in a calculation bydefining an appropriate geometry and attributes in the previous step. The resulting fragment stream will thencontain interpolated values of all the defined attributes. Sumanaweera and Liu [31] exploit this functionalityto perform Fast Fourier Transforms on large image data sets. Interpolations can also be performed in thepost processing stage, but since the rasterizer is designed specifically to perform this operation, utilizing thisfeature can result in significant performance benefits.f

In our simulation the rasterization stage converts the square polygon we defined previously into a 2-Dfragment stream with dimensions N ×N for use in the next stage of the pipeline.

3.3 Post-Processing (Fragment Processors)

The post-processing stage transforms an incoming fragment stream to the final bitmap that is output to thedisplay device and is the most useful stage in the pipeline for general purpose simulations. The program

6

Step 2

R(t+δt)

Step 1

A(t+δt)

Step 3

V(t+δt)

Initial Acceleration

Initial Position

Initial Velocity

Random Deviates

Final Acceleration

Final Position

Final Velocity

δtγ m

GPU Post-processing Stage

Figure 3: Three user-defined programs chained together to implement the velocity Verlet integrator in thepost-processing stage of the GPU.

(user-defined or default system instructions) that manipulates the data can access any attributes that areembedded in the fragment stream as well as additional user-defined bitmaps (2-D data arrays) to calculatethe final output. Each location of the user-defined bitmap contains a vector with RGB or RGBA valueswhere each color component can be configured to occupy 16, 24 or 32-bits. The PEs in this stage have theability to gather information from several sources such as user-defined bitmaps or interpolated attributesembedded within the fragment stream, but they cannot scatter the output of any calculation. Therefore inthe post-processing stage, each processed fragment corresponds to a pixel in the output bitmap.

A very useful extension to the graphics instruction set allows the output of the fragment processors to beretained on the graphics card instead of being written to the display device. This provides two importantadvantages a) it allows the result of a calculation to be returned to the calling program and b) it allowsmultiple passes or iterations over a data stream using the output of one pass as input to the next. As withany parallel processing hardware, it is advantageous to have an application that iterates multiple times overa data set before it needs to be refreshed or return results. Figure 2 shows an example operation in thepost-processing stage. In this example, the gradient calculated using adjacent cells in bitmap X is offset bya constant from bitmap Y to generate the final output. The output can be written to a third bitmap - Z asin this example or can be output to the display device. Bitmap Z is then either stored and used as input infuture iterations or returned to the calling CPU program.

The parallel architecture of this GPU stage allows us to simultaneously advance several particles throughone simulation time-step δt. The bulk of the free particle simulation runs in this stage of the pipeline.Several bitmaps are created to hold the random deviates, position, velocity and acceleration informationfor N2 particles. The incoming fragment stream and bitmaps are combined using three distinct steps thatimplement equations 3, 4 and 5. This program, shown in Figure 3 shows the layout of the program in thefragment processors that produces the final acceleration, velocity and position for each δt. The dark boxes onthe left of the figure represent the input textures which hold the initial position, velocity and accelerations.The small boxes on the bottom of the figure are the input constants used in the calculations. The threeprocessing stages combine the inputs to produce the final results shown on the right of the figure. Step1 calculates the final acceleration at the end of the time-step using the input acceleration, pre-computedrandom deviates and the scalars for particle mass m, viscous drag, γ and time-step, δt. Similarly steps 2and 3 use the appropriate input bitmaps and scalars shown in Figure 3 to calculate the final positions andvelocities for each of the N2 particles.

7

Table 1: Simulation Parameters (SI Units)Sphere Material: GlassFluid Medium: Water Fluid Temperature (T): 293 KNumber of Particles (N): 900 Data-points per trajectory (n): 10,000

r m mγ

δt (Simulation T ime = δt× n× 100∗)

0.005τ 0.01τ 0.05τ

50 nm 1.3614 ag 1.442 ns 5 ps (5 µs) 10 ps (10 µs) 50 ps (50 µs)500 nm 1.3614 fg 144.2 ns 500 ps (500 µs) 1 ns (1 ms) 5 ns (5 ms)

5 µm 1.3614 pg 14.42 µs 50 ns (50 ms) 100 ns (100 ms) 500 ns (500 ms)∗ Since only every 100th point is recorded, the total simulation time is 100 times longer

A 128-bit representation is used to store the X, Y and Z components of a particle’s velocity, position oracceleration within the RGB components of a single pixel of the bitmap, while the alpha component isleft empty. The PEs in this stage simultaneously process the four color elements and depending on thespecification of the graphics card, four to sixteen pixels (and therefore 3-D particles), in one clock cycle. Atthe end of each simulation pass, the N2 particle simulation is advanced by one time-step and the outputis written to a bitmap, held in the graphics memory. The simulation is allowed to run recursively forseveral passes before the results are read back to the CPU for storage. This strategy allows us to run atight simulation loop, with a physically relevant time-step, while sampling the simulation output at longerintervals that are consistent with laboratory experiments. In the case of the GPU simulation, we also getthe additional benefit of reducing the overall data communication time, thereby improving performance. Atthe end of the simulation, a time-series of position, velocity and acceleration for each of the N2 particles isobtained for analysis.

4 Results and Discussion

The simulation returns the position, velocity and acceleration for each particle at regular time intervals,similar to the data obtained from a laboratory experiment when tracking an ensemble of freely diffusingparticles. Several simulation experiments were setup to test the physical accuracy of the CPU and GPUimplementations and also to measure the speedup achieved when using the parallel GPU architecture. Ineach experiment the particles were assumed to be spherical, made of glass and suspended in a water bathat room temperature (293 K) in the absence of any external potential field, such as gravity. The simulationwas then run using three representative particle radii of 50 nm, 500 nm and 5 µm.

The characteristic time-scale of the simulation model is given by the relaxation time mγ which is a function of

the particle radius, r and the friction coefficient, γ from Stokes law (Equation 2). Choosing a time-step thatis much smaller than the characteristic time-scale allows us to capture interesting non-equilibrium behaviorthat occurs at time-scales shorter than the relaxation time. For the sake of convenience in determining thetime-step, we pick a number τ , which is the closest multiple of 10 smaller than m

γ . We then pick the simulationtime-step to be some fraction of τ . Figure 4 shows a plot of the normalized mean energy as a function oftime-step, δt for an ensemble of 30 particles with radius 500 nm. This small ensemble of 30 particles is onlyused to determine suitable parameters for the full simulation with 900 particles and 106 time-steps. We findthat the relative error in the normalized energy increases for longer time-steps - approaching 60 % for atime-step that is approximately one τ . For 30 particles and time-steps shorter than 0.1τ , the relative errorin the normalized energy is better than 5 %. As we shall see later in this section, this error reduces to below1 % for the larger ensemble. We also find that the relative error does not decrease significantly as the sizeof δt is reduced further. The large error bars (which represent one standard deviation) in the plot decreaseapproximately by the

√3 for an ensemble of 100 particles as may be expected and are significantly smaller

for the full simulation ensemble as the distribution of mean energies approaches a normal distribution. Inorder to conserve energy and accurately estimate the equilibrium ensemble energy of the system, the change

8

ææ

æ

æ

æ

à

à

à

à

à

Τ0.1Τ0.05Τ0.01Τ0.005Τ0.001Τ0.0001Τ

0.5

1.0

1.5

2.0

2.5

0.5

1.0

1.5

2.0

2.5

Time-step H∆tL

Nor

mal

ized

Ene

rgy

XE\

Nor

mal

ized

Dif

fusi

onC

onst

antX

D\

XD\

XE\

Figure 4: Log linear plot of normalized energy ( ESimulated

ET heoretical) and normalized diffusion constant ( DSimulated

DT heoreitcal)

against time-step (δt) for an ensemble of 30 glass particles with radius 500 nm suspended in a water bath at293 K. This data in this plot is used to pick the parameters for the full simulation wiht 900 particles.

in kinetic energy and hence the change in velocity per time-step must be small. This condition imposes anupper limit on the size of the time-step that can be used in our simulation. Equation 2, after ignoring thestochastic force, gives the largest viable time-step as a function of the particle size and dynamical frictionto be δt � relaxation time (mγ ). For a given particle radius and finite simulation time, shorter time-stepsrequire a larger number of iterations resulting in simulations that take longer to complete. Furthermore,as we reduce the size of the time-step, we expect the Brownian motion model to fail for time-steps shorterthan approximately one atto second which is a few orders of magnitude larger than the mean collision timegiven by Chandrasekhar [16]. However, on the GPU the shortest time-step (longer simulation runs) may belimited by the accumulation of large roundoff errors as we will see later in this section. We pick δt close tothe largest allowable value where the error in the equilibrium averages is small. Therefore, we repeat thesimulation for each particle radius with three representative time-step values 0.005τ , 0.01τ and 0.05τ . Table1 gives the simulation parameters for the nine distinct simulation cases.

Each of the nine simulation experiments was carried out with 900 independent particles using a three-dimensional Cartesian coordinate system. The simulation algorithm was run for a total of 1000000 time-steps,but only every 100th data point was recorded resulting in a total of 10000 data points per particle trajectory.Although the simulation experiments were run for a fixed number of iterations, the choice of time-step ensuresthat each simulation run is longer than several hundred relaxation times thereby giving the system sufficienttime to equilibrate. All particles in the simulation were started with initial positions at the origin of thecoordinate system and with velocities sampled from an appropriate thermal distribution. Each experimentwas run on the CPU (using single, CPU32 and double, CPU64 precision floating point representation) andon the GPU (using single precision, GPU32 floating point representation) to give matrices of position Rij ,velocity Vij and acceleration Aij . The rows of the matrix, i give the particle number and the columns, jgive the index of the time-step with the X, Y and Z components forming each element. However, sincethere is no coupling in the X, Y and Z axes for a freely diffusing particle, we can analyze the data as 3None-dimensional trajectories. Therefore for 900 particles in three-dimensions we get a 2700 × 10000 matrix.

The physical accuracy of the recorded data was verified against quantities such as the thermal energy andthe diffusion constant obtained from the theory of Brownian motion. The effects of time-step, particleradius, round-off error and statistical error on the accuracy of the simulation results are then discussed.The time required for the simulation to complete was also recorded for the CPU32, CPU64 and GPU32implementations. This information was subsequently used to calculate the speedup obtained by running thesimulation on the GPU. Finally, the scaling of speedup as a function of the number of particles is discussed.

9

4.1 Physical Validation

4.1.1 Energy Conservation

In order to verify energy conservation, we i) calculate the mean kinetic energy, the energy fluctuations as wellas the statistical repeatability for an ensemble of particles and then compare them with theoretical values andii) verify that on average the ensemble of particles does not gain or lose energy over time. Using the velocitydata output by the simulation we calculate the mean energy of all the particles at each time-step (ensembleaverage energy). The mean energy must agree with the thermal energy 1

2KBT given by the equipartitiontheorem for one degree of freedom. Table 2 shows the normalized mean energy for each simulation case andalso compares the CPU and GPU implementations. In order to satisfy energy conservation, the net changein the mean energy of the simulation must be statistically consistent with zero.

4.1.2 Diffusion Constant

The mean squared displacement of a freely diffusing particle in one-dimension is given by the Einstein-Smoluchowski relation [20] shown in Equation 6, where KBT

γ is the diffusion constant, D. Using the positiondata output by the simulation, we calculate the variance in particle positions at each time-step. This givesthe mean-squared displacement as a function of time. Using a linear least squares estimator, we then fit aline to this function and the slope of this line fit must equal the scaling constant 2D from Equation 6. Table3 shows the normalized diffusion constant for each simulation case.

〈R2〉 = 2KBT

γt (6)

Table 2 gives the average energy (E), standard deviation (σ), standard error (SE) and normalized energy,ENorm = ESimulated

ET heoreticalcalculated for varying particle size, time-step and for when the simulation is run on

the CPU32, CPU64 and GPU32 platforms. The relative error in the equilibrium energy is better than 0.5 %when δt = 0.005τ and δt = 0.01τ and reaches a maximum of 1.77 % for δt = 0.05τ on the GPU. The error inthe calculated mean energy of the simulation is always better than the 2.5 % reported in the literature [23, 24]for the velocity Verlet integration algorithm. The magnitude of the calculated error increases with the size ofthe time-step as expected and it is also observed that particle size has no systematic effect on the equilibriumaverage energy. The difference between the normalized energy calculated using the CPU32/CPU64 generateddata and the GPU32 generated data is small for each simulated case. Energy conservation for the CPU andGPU implementations is found to be excellent and the energy trajectory, E in each case shows no appreciablegain or loss of energy over time.

Table 3 shows the average diffusion constant (D), σ, SE and the normalized diffusion constant, DNorm =DSimulated

DT heoreitcalfor each simulation case. The columns show the data for each implementation and the three

representative particle sizes and the rows show the normalized diffusion constant for different time-steps.The effect of particle size and time-step on the error in the calculated value of the diffusion constant, DNorm

is small. The calculated diffusion constant is found to be on average 1.4 % of the the theoretical value forthe CPU32 and CPU64 implementations, with maximum errors of 4.2 % and 2.4 % respectively. However,the average error in the diffusion constant for the GPU32 runs was found to be 4.3 %, with a maximumerror of 8 %. The small increase in error for the GPU32 implementation is caused by rounding errors.

Equation 6 gives the average displacement of an ensemble of particles as a function of time and for eachof our simulation experiments, the final displacement is less than one particle diameter. This is seen moreclearly in Figure 5, which shows a log-log plot of the average particle displacement (from Equation 6) overdiameter for a finite-time simulation with time-step δt = 0.01τ after 1000000 time-steps. The gray regionshows the effect of a ±10 % error in the diffusion constant. We see that the average excursion of a particleis only a small fraction of its diameter over the course of the entire simulation and consequently a 10 %

10

Tab

le2:

Ens

embl

eA

vera

geE

nerg

yδt

=0.0

05τ

δt=

0.0

1τ

δt=

0.0

5τ

E(J

)σ

(J)

SE

(J)

ENorm

E(J

)σ

(J)

SE

(J)

ENorm

E(J

)σ

(J)

SE

(J)

ENorm

×10−

21

×10−

21

×10−

25

×10−

21

×10−

21

×10−

25

×10−

21

×10−

21

×10−

25

CPU

32

50

nm

2.0

247

2.8

636

5.5

110

1.0

010

2.0

300

2.8

712

5.5

256

1.0

036

2.0

584

2.9

117

5.6

035

1.0

177

500

nm

2.0

264

2.8

649

5.5

135

1.0

019

2.0

308

2.8

728

5.5

288

1.0

040

2.0

578

2.9

101

5.6

005

1.0

174

5µ

m2.0

259

2.8

664

5.5

164

1.0

016

2.0

299

2.8

719

5.5

269

1.0

036

2.0

572

2.9

105

5.6

013

1.0

171

GPU

32

50

nm

2.0

261

2.8

653

5.5

142

1.0

017

2.0

285

2.8

691

5.5

216

1.0

029

2.0

585

2.9

105

5.6

012

1.0

177

500

nm

2.0

252

2.8

640

5.5

118

1.0

013

2.0

294

2.8

694

5.5

222

1.0

033

2.0

572

2.9

088

5.5

979

1.0

171

5µ

m2.0

246

2.8

641

5.5

120

1.0

010

2.0

311

2.8

724

5.5

280

1.0

042

2.0

585

2.9

112

5.6

025

1.0

177

CPU

64

50

nm

2.0

253

2.8

634

5.5

107

1.0

013

2.0

288

2.8

680

5.5

195

1.0

030

2.0

585

2.9

105

5.6

012

1.0

177

500

nm

2.0

264

2.8

655

5.5

146

1.0

019

2.0

296

2.8

704

5.5

241

1.0

034

2.0

577

2.9

094

5.5

991

1.0

173

5µ

m2.0

263

2.8

650

5.5

137

1.0

018

2.0

297

2.8

698

5.5

230

1.0

035

2.0

578

2.9

099

5.6

001

1.0

174

ETheoretical

=2.0

227×

10−

21

Jat

T=

293

K

Tab

le3:

Diff

usio

nC

onst

ant

δt=

0.0

05τ

δt=

0.0

1τ

δt=

0.0

5τ

DTh

(m

2

sec)

D(m

2

sec)

σ(m

2

sec)

DNorm

D(m

2

sec)

σ(m

2

sec)

DNorm

D(m

2

sec)

σ(m

2

sec)

DNorm

×10−

13

×10−

13

×10−

14

×10−

13

×10−

14

×10−

13

×10−

14

CPU

32

50

nm

42.8

363

42.6

539

4.5

405

0.9

957

42.2

938

7.9

684

0.9

873

42.5

465

5.2

338

0.9

932

500

nm

4.2

836

4.4

637

0.4

792

1.0

420

4.2

871

0.4

093

1.0

008

0.4

464

0.4

792

1.0

420

5µ

m0.4

284

0.4

295

0.0

568

1.0

027

0.4

248

0.0

781

0.9

917

4.3

094

0.0

414

1.0

060

GPU

32

50

nm

42.8

363

40.9

561

9.0

678

0.9

561

42.5

527

4.9

765

0.9

934

43.7

657

3.6

273

1.0

217

500

nm

4.2

836

4.0

610

0.6

134

0.9

480

3.9

420

0.7

599

0.9

202

4.0

610

0.6

134

0.9

480

5µ

m0.4

284

0.3

981

0.0

1210

0.9

294

0.4

070

0.0

347

0.9

501

0.4

247

0.0

734

0.9

916

CPU

64

50

nm

42.8

363

42.2

823

7.4

140

0.9

871

43.8

533

4.4

342

1.0

237

43.2

453

4.3

501

1.0

095

500

nm

4.2

836

4.2

563

0.4

649

0.9

936

4.3

589

0.5

775

1.0

176

4.2

563

0.4

649

0.9

936

5µ

m0.4

284

0.4

371

0.0

844

1.0

204

0.4

368

0.0

631

1.0

196

0.4

234

0.0

350

0.9

884

Sta

ndard

Err

or

(SE

)=σ×

10−

2m

2

sec

for

all

case

s

11

1.!10"7 1.!10"6 0.00001

0.010

0.100

0.050

0.020

0.030

0.015

0.070

Diameter !d"#R$ d

Figure 5: Log-Log plot of the average particle displacement after 1000000 time-steps at δt = 0.01 τ as afunction of particle diameter from Equation 6. The gray region shows the effect of ±10 % error in thediffusion constant. Smaller particles exhibit larger excursions from their initial positions as a function ofparticle diameter, however the average displacement is no larger than 1

10 × d for the finite time simulation.

deviation in the diffusion constant results in an error of less than 1 nm. In our experiments, the simulation istypically run for a total time less than a few milliseconds. However, real-time nanoassembly tasks will requirethe simulation to be run for several minutes, in which case the additive roundoff error will increase. In ananoassembly workspace, the exact positions and trajectories of individual particles are less important thanoverall system properties such as energy conservation. Moreover when a particle is confined in an opticaltrap prior to assembly, the effect of random fluctuations is even less pronounced.

Analysis of GPU Rounding Error

The accuracy of results from the particle simulation depend mainly on errors introduced by the simulationalgorithm, random number generator and roundoff error. In our simulations, algorithmic and random numbererrors exhibit similar behavior since the same algorithm is used across the CPU and GPU implementationsand the random numbers are always generated on the CPU in double precision. Therefore the difference insimulation results on the GPU and CPU, arises primarily from varying rounding errors on each architecture.In order to investigate the GPU rounding error, relative to the CPU, we generate a position trajectoryfor a single 500 nm particle on the CPU (32 and 64-bit) and on the GPU using the same list of randomdeviates and simulation conditions. Therefore, we expect the trajectories generated on the CPU and theGPU to be identical, however this is not the case. Using the CPU64 trajectory as a reference, we plot therelative error in the CPU32 and GPU32 trajectories for the three separate time-steps in Figures 6(a) and6(b). The relative errors on the CPU and GPU are of the same magnitude and increase as expecte whenlarger time-steps are used to generate the trajectories. The differences in the two plots can be explaineddue to differences in the method in which the CPU and GPU store the result of floating point operations.The CPU, which follows the IEEE-754 floating point convention always performs exact rounding in all itsoperations while the GPU, depending on the vendor, performs exact rounding in some operations and chopsthe result of other operations. This can lead to fluctuations in the relative errors by as much as a factorof 2 [32, 33]. However, this error is acceptable since we are more interested in the long term statisticalaverages, where differences between individual particle trajectories do not severly affect average behavior.This is seen previously, in the calculation of the mean energy and the diffusion constant, which have a smallerror relative to their theoretical values, despite deviations in individual trajectories. The error in individualtrajectories and consequently the error in the average diffusion constant can increase significantly when theparticle diffuses far away from zero due to a loss precision due to adding numbers of significantly differentmagnitudes in Equation 5. Through several experimens it is found that the onset of this error does not occuruntil the particle has diffused over 1.5µm from the starting point. This is not a problem since the simulationis never run long enough to trigger this condition. Also, the problem is easily avoided in longer simulations

12

by shifting the particle back and recording the offset when it diffuses by a certain amount.

0 2000 4000 6000 8000 10000

!5."10!9

0

5."10!9

1."10!8

1.5"10!8

Iteration

#R

CPU64!GPU32

∆t % 0.05 Τ

∆t % 0.01 Τ

∆t % 0.005 Τ

(a)

0 2000 4000 6000 8000 10000

0

5.!10"9

1.!10"8

1.5!10"8

Iteration

#R

CPU64"CPU32

∆t % 0.05 Τ

∆t % 0.01 Τ

∆t % 0.005 Τ

(b)

Figure 6: Relative error observed by subtracting a reference CPU64 position trajectory for a single 500 nmparticle from those of the CPU32 and GPU32 implementations. All trajectories are generated using thesame list of random deviates. Figure (a) shows the error for the GPU32 case for three separate time-stepvalues and Figure (b) shows the error for the CPU32 case.

4.2 Speedup

The total simulation time is divided into the computation time, the communication time (the time spent formoving data to and from memory to the on-chip cache) and the file I/O time (the time spent when writingthe results to disk for analysis). The speedup of the GPU over the CPU is then calculated as the ratioof the sum of the computation and communication time for the entire simulation run. For both the CPUand GPU implementations, the random deviates are pre-calculated for use in the simulation. The randomdeviates can be calculated on the GPU, but may not provide any additional performance gains. Speedup wascalculated for an ensemble of particles radius of 500 nm and a time-step, δt = 0.05τ . The experiments wererepeated for ensemble sizes of 100, 256, 400, 625, 900, 1024 and 1600 particles. For each particle size, thesimulation was run for a constant time and the output sampled every 10, 100 and 500 steps. All experimentswere run on a Dell desktop with a dual core Intel Pentium, D processor with a clock speed of 2.8 GHz, 1GB DDR2 RAM and a NVIDIA GeForce 7950GT PCI Express graphics card with 512 MB GDDR3 VideoRAM running the OpenSuSe 10.2 Linux operating system∗. Figure 7 shows the speedup obtained by theGPU over a single-precision CPU implementation. The three curves in the figure show the speedup whenthe output is sampled every 10, 100 and 500 time-steps. For ensembles with fewer than 256 particles, theCPU is faster because the time required to transfer data to the GPU memory is large compared to thecomputation time. As the ensemble size grows, the computation and communication times required by theGPU to run one simulation experiment remains constant while the computation time required by the CPUto run the same experiment grows steadily. This results in large performance gains by the GPU. We alsosee some performance benefits when the simulation is allowed to loop on the GPU for several cycles beforethe data is recorded. Figure 7 plots the speedup against the ensemble size in three cases where the outputis sampled every 10δt, 100δt and 500δt. We observe that sampling the output every 100 time-steps yieldsbetter performance over the 10δt case but increasing the sampling interval to 500δt does not significantlychange the observed speedup. This is due to the fact that the computational time on the GPU and CPUscale linearly as the sampling interval increases beyond 100δt and therefore the observed speedup remainsroughly constant.

∗Certain commercial entities, equipment, or materials may be identified in this document in order to describe an experimentalprocedure or concept adequately. Such identification is not intended to imply recommendation or endorsement by the NationalInstitute of Standards and Technology, nor is it intended to imply that the entities, materials, or equipment are necessarily thebest available for the purpose.

13

!

!

!

!

!

!

!

"

"

"

"

"

"

"

#

#

#

#

#

#

#

100 256 400 625 900 1024 16000

2

4

6

8

Number of Particles

SpeedupoverCPU32

500 ∆t100 ∆t10 ∆t

Figure 7: Speedup (calculated as the ratio of the GPU simulation time over single precision CPU simulationtime) as a function of the ensemble size. The three curves show the results when the output was sampledafter every 10, 100 and 500 simulation time-steps (δt).

4.3 The GPU as a Coprocessor in Stochastic Simulations

The GPU, when used to run a free particle simulation yields significant speedup, especially with largeparticle ensembles. However the benefits of using the GPU can be leveraged for more general simulationsof the optical tweezers system as well as other stochastic systems. Most simulations of micro and nanoscaleassembly tasks include an energy potential where the force exerted by the potential on individual particlesis calculated periodically. These forces are often pre-computed and stored in a parametric form on the CPU[34]. When running the stochastic simulation on the GPU, the number of times the simulation loops on theGPU is tuned to closely match the timescale of the problem. For example in the case of the optical tweezerssimulation, we allow the simulation to loop for 100 time-steps on the GPU before we read data back to theCPU. This corresponds to a time less than one relaxation time (the shortest time-scale in the problem) andat this time-scale the particle undergoes very little movement. The optical trapping force on the particleis then assumed to be constant and updated once every 100 time-steps to yield physically accurate results.This same principle is applicable for all stochastic simulations that use a fluidic assembly cell and sometrapping potential.

5 Conclusions and Future Work

A versatile tool has been developed to speedup a wide class of stochastic simulations for the assembly of microand nanoscale components in fluid. Several available choices to develop stochastic simulations on the GPUwere critically examined and a simulation for freely diffusing particles has been successfully implementedon highly parallel graphics hardware. The accuracy of the results from the GPU implementation were thencompared with those from the CPU. The errors in the mean energy of the ensemble and the diffusion constantwere found to be acceptable in each simulation experiment. It was observed that particle size had a negligibleeffect on the results, but the errors in the ensemble averages increased with larger time-steps as expected.The speedup scaled with the size of the ensemble and the GPU obtained a speedup of almost 8 in comparisonwith the single precision CPU implementation for an ensemble of 1600 particles (although the card used inthe simulation experiments is capable of simulating over 10 million particles at once). We conclude thereforethat significant computational speed can be realized by using the GPU for stochastic simulation models withdesired accuracy.

In future work, models to utilize more efficient unified shader architecture as well as multiple graphics cards

14

to run several simultaneous simulation experiments will be explored. This will allow us to rapidly collectstatistical data for assembly techniques and validate our assembly algorithms. These models will then becompared with results from trapping experiments performed on the optical tweezers. We also plan to estimateand track roundoff errors on the GPU at each iteration and explore methods to eliminate this error whereverpossible.

References

[1] G. Dominik, S. Robert, and T. Stefan. Accelerating double precision FEM simulations with GPUs.Proceedings of the 18th Symposium on Simulation Technique (ASIM 2005), pages 139–144, 2005.

[2] Technical brief: NVIDIA GeForce 8800 GPU architecture overview. 2006.

[3] S. Spitz, A. Spyridi, and A. Requicha. Accessibility analysis for planning of dimensional inspection withcoordinate measuring machines. Robotics and Automation, IEEE Transactions on, 15(4):714–727, 1999.

[4] M. Foskey, M. C. Lin, and D. Manocha. Efficient computation of a simplified medial axis. Journal ofComputing and Information Science in Engineering, 3(4):274–284, 2003.

[5] A. Priyadrashi and S. Gupta. Finding mold-piece regions using computer graphics hardware. GeometricModeling and Processing Conference, Pittsburgh, PA, 2006.

[6] R. Khardekar, G. Burton, and S. McMains. Finding feasible mold parting directions using graphicshardware. Computer-Aided Design, 38(4):327–341, 2006.

[7] P. Kipfer, M. Segal, and R. Westermann. Uberflow: A GPU-based particle engine. HWWS ’04: Pro-ceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pages 115–122,2004.

[8] A. Kolb and N. Cuntz. Dynamic particle coupling for GPU-based fluid simulation. Proceedings of the18th Symposium on Simulation Technique, pages 722–727, 2005.

[9] A. Ashkin. Forces of a single-beam gradient laser trap on a dielectric sphere in the ray optics regime.Biophysical Journal, 61(2):569–582, 1992.

[10] E. Fallman and O. Axner. Design for fully steerable dual-trap optical tweezers. Applied Optics, 36:2107–2113, 1997.

[11] D. Wang, Michelle, J. Schnitzer, Mark, Y. Hong, L. Robert, G. Jeff, and M. Block, Steven. Force andvelocity measured for single molecules of rna polymerase. Science, 282(5390):902–907, 1998.

[12] J. T. Finer, R. M. Simmons, and J. A. Spudich. Single myosin molecule mechanics: piconewton forcesand nanometre steps. Nature, 368(6467):113–119, 1994.

[13] A. Balijepalli, T. W. Lebrun, C. Gagnon, Y. Lee, and N. Dagalakis. A modular architecture for agileassembly of nano-components using optical tweezers. in Optical Information Systems III, Proceedings ofSPIE, 5908:59080H, 2005.

[14] D. Lee, T. W. LeBrun, A. Balijepalli, J. J. Gorman, C. Gagnon, D. Hong, and E. H. Chang. Developmentof multiple beam optical tweezers. Spring Conference of Korean Society of Precision Engineering (KSPE),pages 1501–1506, 2005.

[15] P. Langevin. On the theory of brownian motion (”sur la theorie du mouvement brownien,”). C. R.Acad. Sci. (Paris), 146:530–533, 1908.

[16] S. Chandrasekhar. Stochastic problems in physics and astronomy. Reviews of Modern Physics, 15(1):1–84, 1943.

15

[17] M. Weissbluth. Photon-Atom Interactions. Academic Press, Boston, MA, 1989.

[18] R. Kubo. The fluctuation-dissipation theorem. Reports on Progress in Physics, 29(1):255–284, 1966.

[19] P. Grassia. Dissipation, fluctuations, and conservation laws. American Journal of Physics, 69(2):113–119, 2001.

[20] D. T. Gillespie. Fluctuation and dissipation in brownian motion. Americal Journal of Physics,61(12):1077–1083, 1993.

[21] A. C, Bra nka and D. M., Heyes. Algorithms for brownian dynamics simulation. Phys. Rev. E,58(2):2611–2615, 1998.

[22] G. Zhang and T. Schlick. Implicit discretization schemes for langevin dynamics. Molecular Physics,84(6):1077 – 1098, 1995.

[23] H. J. C. Berendsen and W. F. Van Gunsteren. Practical algorithms for dynamic simulations - moleculardynamics simulation of statistical mechanical systems. Proceedings of the Enricfo Fermi Summer School,pages 43–64, 1985.

[24] M. P. Allen and D. J. Tildesley. Computer Simulation of Liquids. Clarendon Press, New York, NY,USA, 1987.

[25] L. Verlet. Computer ”experiments” on classical fluids. i. thermodynamical properties of lennard-jonesmolecules. Physical Review, 159:98–103, 1967.

[26] R. W. Pastor, B. R. Brooks, and A. Szabo. An analysis of the accuracy of langevin and moleculardynamics algorithms. Molecular Physics, 65(6):1409 – 1419, 1988.

[27] W. C. Swope, H. C. Andersen, P. H. Berens, and K. R. Wilson. A computer simulation method for thecalculation of equilibrium constants for the formation of physical clusters of molecules: Application tosmall water clusters. The Journal of Chemical Physics, 76(1):637–649, 1982.

[28] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. E. Lefohn, and T. J. Purcell. Asurvey of general-purpose computation on graphics hardware. Proceedings of the Eurographics 2005,State of the Art Reports:21–51, 2005.

[29] E. Kilgariff and R. Fernando. The geforce 6 series GPU architecture. In M. Pharr and R. Fernando,editors, GPU Gems 2: Programming Techniques for High-Performance Graphics and General-PurposeComputation, Chapter 30. Addison-Wesley Professional, 2005.

[30] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook forGPUs: Stream computing on graphics hardware. Proceedings of the ACM SIGGRAPH 2004 Papers,pages 777–786, 2004.

[31] T. Sumanaweera and D. Liu. Medical image reconstruction with the FFT. In P. Matt and F. Randima,editors, GPU Gems 2: Programming Techniques for High-Performance Graphics and General-PurposeComputation, Chapter 48. Addison-Wesley Professional, 2005.

[32] D. Goldberg. What every computer scientist should know about floating-point arithmetic. ACM Com-put. Surv., 23(1):5–48, 1991.

[33] K. Hillesland and A. Lastra. GPU floating-point paranoia. Proceedings of the ACM Workshop onGeneral Purpose Computing on Graphics Processors, pages C–8, 2004.

[34] A. G. Banerjee, A. Balijepalli, S. K. Gupta, and T. W. LeBrun. Radial basis function based simplifiedtrapping probability models for optical tweezers. Proceedings of the ASME 2008 International DesignEngineering Technical Conferences and Computers and Information in Engineering Conference, 2008.

16

this document contains the draft version of the following...

Documents