advances in the solution of ns eqs. in gpgpu hardware. second order scheme and experimental...

Advances in NS solver on GPU por M.Storti et.al.

Advances in the Solution of NS Eqs. in GPGPU

Hardware

by Mario Stortiab, Santiago Costarelliab, Luciano Garellia,

Marcela Cruchagac, Ronald Ausensic, S. Idelsohnabde,

Gilles Perrotf , Raphael Couturierf

aCentro de Investigacion de Metodos Computacionales, CIMEC (CONICET-UNL)Santa Fe, Argentina, [email protected], www.cimec.org.ar/mstorti

bFacultad de Ingenierıa y Ciencias Hıdricas. UN Litoral. Santa Fe, Argentina, fich.unl.edu.ar

cDepto Ing. Mecanica, Univ Santiago de Chile

d International Center for Numerical Methods in Engineering (CIMNE)Technical University of Catalonia (UPC), www.cimne.com

e Institucio Catalana de Recerca i Estudis Avancats(ICREA), Barcelona, Spain, www.icrea.cat

f FEMTO-ST Institute, Franche-Comte Meso-Centre

CIMEC-CONICET-UNL 1((version texstuff-1.2.8-4-g7cfd85f Mon Apr 27 23:01:25 2015 -0300) (date Thu Apr 30 12:04:10 2015 -0300))

http://www.cimec.org.ar/mstorti

http://fich.unl.edu.ar

http://www.cimne.com

http://www.icrea.cat


Scientific computing on GPUs

• Graphics Processing Units (GPU’s) are specialized hardware designed todischarge computation from the CPU for intensive graphics applications.• They have many cores (thread processors), currently the Tesla K40

(Kepler GK110) has 2880 cores at 745 Mhz (boost 875 Mhz).

• The raw computing poweris in the order of Teraflops(4.3 Tflops in SP and1.43 Tflops in DP).• Memory Bandwidth

(GDDR5) 288 GiB/s.Memory size 12 GiB.• Cost: USD 3,100. Low end

version Geforce GTX Titan:USD 1,500.



Scientific computing on GPUs (cont.)

• The difference between the GPUsarchitecture and standard multicoreprocessors is that GPUs have muchmore computing units (ALU’s(Arithmetic-Logic Unit) and SFU’s(Special Function Unit), but few controlunits.• The programming model is SIMD

(Single Instruction Multiple Data).• Branch-divergence happens when a

branch condition gives true in some insome threads an false in others. Dueto the SIMD model all threads can’t beexecuted simultaneously.• GPUs compete with many-core

processors (e.g. Intel’s Xeon Phi)Knights-Corner, Xeon-Phi 60 cores).

if (COND) BODY-TRUE; else BODY-FALSE;



GPUs, MIC, and multi-core processors

• Computing power is limited by the numberof transistors we can put in a chip. This islimited in turn to the largest silicon chipyou can grow (O(400mm2)), and theminimum size of the transistors you canmake (O(22 nm)). So actually the limit is inthe order of 5× 109 transistors per chip.• So with that number of transistors you can

do one of the following. Have a few (6 to 12) very powerful

cores: Xeon E5-26XX processors.. Have 60 or more less powerful cores,

each one with 40 million transistors,similar to a Pentium processor: IntelMany Integrated Core (MIC), Xeon Phi

. Have a very large (O(3000)) simple(without control) cores: Nvidia GPUs.

Die shot of the Nvidia ”Kepler1”

GK104 GPU



Solution of incompressible Navier-Stokes flows on GPU

• GPU’s are less efficient for algorithms that require access to the card’s(device) global memory. In older cards shared memory is much faster butusually scarce (16K per thread block in the Tesla C1060) .• The best algorithms are those that make computations for one cell

requiring only information on that cell and their neighbors. Thesealgorithms are classified as cellular automata (CA).• In newer cards there is a much larger memory cache. Anyway data

locality is fundamental.• Lattice-Boltzmann and explicit F?M (FDM/FVM/FEM) fall in this category.• Structured meshes require less data to exchange between cells (e.g.

neighbor indices are computed, no stored), and so, they require lessshared memory. Also, very fast solvers like FFT-based (Fast FourierTransform) or Geometric Multigrid are available .



Fractional Step Method on structured grids with QUICK

Proposed by Molemaker et.al. SCA’08: 2008 ACM SIGGRAPH, Low viscosityflow simulations for animation.

• Fractional Step Method(a.k.a. pressuresegregation)• u, v, w and continuity

cells are staggered(MAC=Marker AndCell).• QUICK advection

scheme is used in thepredictor stage.• Poisson system is

solved with IOP(Iterated OrthogonalProjection) (to bedescribed later), on topof Geometric MultiGrid

j

j+1

j+2

i+1 i+2(ih,jh)

Vij

PijUij

x

y

i

y-momentum cell

x-momentum nodes

y-momentum nodes

continuity nodesx-momentum cell

continuity cell


http://portal.acm.org/citation.cfm?id=1632592.1632595


Solution of the Poisson with FFT

• Solution of the Poisson equation is, for large meshes, the more CPUconsuming time stage in Fractional-Step like Navier-Stokes solvers.• We have to solve a linear system Ax = b• The Discrete Fourier Transform (DFT) is an orthogonal transformation

x = Ox = fft(x).• The inverse transformation O−1 = OT is the inverse Fourier Transform

x = OT x = ifft(x).• If the operator matrix A is spatially invariant (i.e. the stencil is the same at

all grid points) and the b.c.’s are periodic, then it can be shown that Odiagonalizes A, i.e. OAO−1 = D.• So in the transformed basis the system of equations is diagonal

(OAO−1) (Ox) = (Ob),

Dx = b,(1)

• For N = 2p the Fast Fourier Transform (FFT) is an algorithm thatcomputes the DFT (and its inverse) in O(N log(N)) operations.



Solution of the Poisson with FFT (cont.)

• So the following algorithm computes the solution of the system tomachine precision in O(N log(N)) ops.

. b = fft(b), (transform r.h.s)

. x = D−1b, (solve diagonal system O(N))

. x = ifft(x), (anti-transform to get the sol. vector)• Total cost: 2 FFT’s, plus one element-by-element vector multiply (the

reciprocals of the values of the diagonal of D are precomputed)• In order to precompute the diagonal values of D,. We take any vector z and compute y = Az,. then transform z = fft(z), y = fft(y),. Djj = yj/zj .



FFT computing rates in GPGPU. GTX-580 and Titan Black

1e+5 1e+6 1e+7 1e+80

100

200

300

400

500

600580/DP580/SPTB/DPTB/SP

GTX 580

GTX Titan Black

Launch Feb 2014, Arch: Kepler, USD 1,500, 2880 cores, 5.12 Tflops SP

Launch June 2011, Arch: Fermi, USD 1,000, 772 cores, 1.5 Tflops SP

GTX 580 DP

FF

T p

roc.

rat

e [G

flo

ps/

sec]

Ncell*t

64x64x64128x64x64

128x128x64

128x128x128

256x128x128

256x256x128

256x256x256

512x256x256

GTX 580 SP

GTX TB DP

[*] Ncell=2*Nx*Ny*Nz because FFT is complex

GTX TB SP



Upper bound on computing rate for the NS solver w/FFT

• For instance for a mesh of N = 2563 = 16Mcell the rate is 4.66 Gcell/s(SP).• So that one could perform one Poisson solution (two FFTs) in 4.2ms

(doesn’t take into account the computation of the RHS, which isnegligible).• Poisson eq. is a large component of the computing time in the Fractional

Step Solver for Navier-Stokes (and the only one that is > O(N)) so thatthis puts an upper limit on the computing rate achievable with ouralgorithm (based on FFT).



Solution of NS on embedded geometries: FVM/IBM1

• When solid bodies areincluded in the fluid, wehave to deal with curvedsurfaces that in general donot fit with node planes.• The first approach is to fit

the body surface by astair-case one (a.k.a. Legosurface).• Convergence is degraded

to 1st order.• Stair-case acts like an

artificial roughness so ittends to give higher drag.

solid body

fluid



Solving Poisson with embedded bodies

• FFT solver and GMG are very fast but have several restrictions: invarianceof translation, periodic boundary conditions.• No holes (solid bodies) in the fluid domain.• So they are not well suited for direct use on embedded geometries.• One approach for the solution is the IOP (Iterated Orthogonal Projection)

algorithm.• It is based on solving iteratively the Poisson eq. on the whole domain

(fluid+solid). Solving in the whole domain is fast, because algorithms likeGeometric Multigrid or FFT can be used. Also, they are very efficientrunning on GPU’s .



The IOP (Iterated Orthogonal Projection) method

The method is based on succesively solve for the incompressibility condition(on the whole domain: solid+fluid), and impose the boundary condition.

on the wholedomain (fluid+solid)

violates impenetrability b.c. satisfies impenetrability b.c.



The IOP (Iterated Orthogonal Projection) method (cont.)

• Fixed point iteration

wk+1 = ΠbdyΠdivwk.

• Projection on the space ofdivergence-free velocity fields:

u′ = Πdiv(u)

u′ = u−∇P,

∆P = ∇ · u,

• Projection on the space of velocityfields that satisfy theimpenetrability boundary condition

u′′ = Πbdy(u′)

u′′ = ubdy, in Ωbdy,

u′′ = u′, in Ωfluid.



Accelerated Global Preconditioning (AGP) algorithm

• Storti et.al. ”A FFT preconditioning technique for the solution ofincompressible flow on GPUs.” Computers & Fluids 74 (2013): 44-57.• It shares some features with IOP scheme.• Both solvers are based on the fact that an efficient preconditioning exists.

The preco. consists in solving the Poisson problem on the global domain(fluid+solid, i.e. no “holes” in the fluid domain).• IOP is a stationary method and its limit rate of convergence doesn’t

depend on refinement.• However id does depend on (O(AR)) where AR is the maximum aspect

ratio of bodies inside the fluid. I.e. the method is bad when very slenderbodies are present.• Our proposed method AGP is a preconditioned Krylov space method.• The condition number of the preconditioned system doesn’t depend on

refinement but deteriorates with the AR of immersed bodies.• IOP iterates over both the velocity and pressure fields, whereas AGP

iterates only on the pressure vector (which is better for implementation onGPU’s).


http://dx.doi.org/10.1016/j.compfluid.2012.12.019


MOC+BFECC

• The Method of Characteristics (MOC) is a Lagrangian method. Theposition of the node (“particle”) is tracked backwards in time and thevalues at the previous time step are interpolated into that position. Theseprojections introduce dissipation.• BFECC Back and Forth Error Compensation and Correction can fix this.• Assume we have a low order (dissipative) operator (may be MOC, full

upwind, or any other) Φt+∆t = L(Φt,u).• BFECC allows to eliminate the dissipation error.. Advance forward the state Φt+∆t,∗ = L(Φt,u).. Advance backwards the state Φt,∗ = L(Φt+∆t,∗,−u).



MOC+BFECC (cont.)

exact w/dissipation error

• If L introduces some dissipative error ε, then Φt,∗ 6= Φt, in factΦt,∗ = Φt + 2ε.• So that we can compensate for the error:

Φt+∆t = L(Φt,∆t)− ε,

= Φt+∆t,∗ − 1/2(Φt,∗ − Φt)

(2)



MOC+BFECC (cont.)

Nbr of Cells QUICK-SP BFECC-SP QUICK-DP BFECC-DP

64× 64× 64 29.09 12.38 15.9 5.23

128× 128× 128 75.74 18.00 28.6 7.29

192× 192× 192 78.32 17.81 30.3 7.52

Cubic cavity. Computing rates for the whole NS solver (one step) in [Mcell/sec] obtained with the

BFECC and QUICK algorithms on a NVIDIA GTX 580. 3 Poisson iterations were used.

Costarelli, et.al. ”GPGPU implementation of the BFECC algorithm for pureadvection equations.” Cluster Computing 17(2) (2014): 243-254.


http://dx.doi.org/10.1007/s10586-013-0329-9


Analysis of performance

• Regarding the performance results shown above, it can be seen that thecomputing rate of QUICK is at most 4x faster than that of BFECC. SoBFECC is more efficient than QUICK whenever used with CFL > 2, beingthe critical CFL for QUICK 0.5. The CFL used in our simulations is typicallyCFL≈ 5 and, thus, at this CFL the BFECC version runs 2.5 times fasterthan the QUICK version.• Effective computing rate is more representative than the plain computing

rate

effective computing rate = CFLcrit ·(number of cells)

(elapsed time)

• The speedup of MOC+BFECC versus QUICK increases with the number ofPoisson iterations. In the limit of very large number of iters (very lowtolerance in the tolerance for Poisson) we expect a speedup 10x (equal tothe CFL ratio).



Validation. A buoyant fully submerged sphere

• A fluid structure interaction problem has beenused for validation.• Fluid is water ρfluid = 1000 [kg/m3].• The body is a smooth sphere D =10cm, fully

submerged in a box of L =39cm. Soliddensity is ρbdy = 366 [kg/m3] ≈ 0.4 ρfluid

so the body has positive buoyancy. The buoy(sphere) is tied by a string to an anchor pointat the center of the bottom side of the box.• The box is subject to horizontal harmonic

oscillations

xbox = Abox sin(ωt),



Spring-mass-damper oscillator

• The acceleration of the box creates a driving force on the sphere. Notethat as the sphere is positively buoyant the driving force is in phase withthe displacement. That means that if the box accelerates to the right, thenthe sphere tends to move in the same direction. (launch video p2-f09)

• The buoyancy of the sphere produces a restoring force that tends to keepaligned the string with the vertical direction.• The mass of the sphere plus the added mass of the liquid gives inertia.• The drag of the fluid gives a damping effect.• So the system can be approximated as a (nonlinear) spring-mass-damper

oscillator

inertia︷︸︸︷mtotx+

damp︷︸︸︷0.5ρflCdA|v|x+

spring︷︸︸︷(mflotg/L)x =

driving frc︷︸︸︷mflotabox,

mtot = mbdy +madd, mflot = md −mbdy,

madd = Cmaddmd, Cmadd ≈ 0.5, md = ρflVbdy


/home/scratch/mstorti/TO-BACKUP/DATA-CD/GPUCONF-VIDEOS/p2-f09.avi


Experimental results

• Experiments carried out atUniv Santiago deChile (Dra. Marcela Cruchaga).• Plexiglas box mounted on a shake table,

subject to oscillations of 2cm amplitude infreqs 0.5-1.5 Hz.• Sphere position is obtained using Motion

Capture (MoCap) from high speed (HS) camera(800x800 pixels @ 120fps). HS cam is AOSQ-PRI, can do 3 MPixel @500fps, max framerate is 100’000 fps.• MoCap is performed with a home-made code.

Finds current position by minimization of afunctional. Can also detect rotations. In thisexp setup 1 pixel=0.5mm (D=200 px). Cameradistortion and refraction have been evaluatedto be negligible. MoCap can obtain subpixelresolution. (launch video mocap2)


/home/scratch/mstorti/TO-BACKUP/DATA-CD/GPUCONF-VIDEOS/mocap2.avi


Experimental results (cont.)

0

20

40

60

80

100

120

140

160

180

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3

phase[deg]

freq[Hz]

anly(CD=0.3,Cmadd=0.55)exp

0

0.02

0.04

0.06

0.08

0.1

0.4 0.6 0.8 1 1.2 1.4 1.6

Amplitude[m

]

freq[Hz]

anly(CD=0.3,Cmadd=0.55)exp

Best fit of the experimental results with the simple spring-mass-damperoscillator are obtained for Cmadd = 0.55 and Cd = 0.25. Reference valuesare:

• Cmadd = 0.5 (potential flow), Cmadd = 0.58 (Neill etal Amer J Phys2013),• Cd = 0.47 (reference value). (launch video description-buoy-resoviz)


/home/scratch/mstorti/TO-BACKUP/DATA-CD/GPUCONF-VIDEOS/description-buoy-resoviz.avi


aboxdriving force

box desplacement

buoy displacement (f<<f0)

buoy displacement (f=f0)

buoy displacement (f>>f0) f

buoy (positive buoyancy)

pendulum (negative buoyancy)

abox

driving forcebox desplacement

buoy displacement (f<<f0)

buoy displacement (f=f0)

buoy displacement (f>>f0)f

driving force = (md −mbdy)abox (3)



Computational FSI model

• The fluid structure interaction is performed with a partitioned FluidStructure Interaction (FSI) technique.• Fluid is solved in step [tn, tn+1 = tn + ∆t] between the known position

xn of the body at tn and an extrapolated (predicted) position xn+1,P attn+1.• Then fluid forces (pressure and viscous tractions) are computed and the

rigid body equations are solved to compute the true position of the solidxn+1.• The scheme is second order precision. It can be put as a fixed point

iteration and iterated to convergence (enhancing stability).



Computational FSI model (cont.)

• The simulation is performed in a non-inertial frame attached to the box.So, according to the Galilean equivalence principle we have just to add aninertial force due to the acceleration of the box.• Due to the incompressibility this is absorbed in an horizontal flotation

force on the sphere, which is the driving force for the system.• The rigid body eqs. are now:

mbdyx+ (mflotg/L)x = Ffluid +mflotabox

where Ffluid = Fdrag + Fadd are now computed with the CFD FVM/IBM1code.



Why use a positive buoyant body?

• Gov. equations for the body:

mbdyx+ (mflotg/L)x =

comp w/CFD︷︸︸︷Ffluid(x) +mflotabox

where Ffluid = Fdrag + Fadd are fluid forces computed with the FVM/IBMcode.• The most important and harder from the point of view of the numerical

method is to compute the the fluid forces (drag and added mass).• Buoyancy is almost trivial and can be easily well captured of factored out

from the simulation.• Then, if the body is too heavy it’s not affected by the fluid forces, so we

prefer a body as lighter as possible.• In the present example mbdy = 0.192[kg], md = 0.523[kg],madd = 0.288[kg]



Comparison of experimental and numerical results

0

0.02

0.04

0.06

0.08

0.1

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3

Amplitude[m

]

freq[Hz]

experimentalFVM/IBM1

0

50

100

150

200

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3Phase[deg]

freq[Hz]




Comparison of experimental and numerical results

0

0.02

0.04

0.06

0.08

0.1

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3

Amplitude[m

]

freq[Hz]


0

50

100

150

200

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3

Phase[deg]

freq[Hz]


• Results obtained with mesh 2563 = 16.7 Mcell.• Maximum error 22% at resonance (f = 0.9Hz).• Added mass an buoyancy cancel out at resonance and amplitude is

almost controlled by drag. Hence drag is not well computed.• Numerical amplitude is lower than experimental =⇒ numerical drag is

higher than experimental.• Probable cause: staircase representation of surface acts as artificial

roughness and increases drag.• Resonance frequency seems correct =⇒ added mass and buoyancy

forces are correct.



Convergence of FVM/IBM1

• Plot shows amplitude vs mesh size at resonance (f = 0.9Hz).• The scheme is O(h) convergent.• Need for a O(h2) scheme with computational efficiency comparable to

the 1st order one (FVM/IBM1).

0 0.001 0.002 0.003 0.004 0.0054

5

6

7

8

9

10

mesh size h [m]

Abuoy [m

]

FVM/IBM1experimental

256^3 = 16.7 Mcell

192^3 = 7.1 Mcell

128^3 = 2.1 Mcell

96^3 = 0.9 Mcell

actual Lego sphere with O(40^3) blocks



Immersed boundary second order scheme (FVM/IBM2)

• Fadlun, et al. ”Combined immersed-boundary...” JCP (2000)• Hard to combine with staggered grids and Fractional Step schemes.• So switched to Rhie-Chow interpolation for using co-located grids (i.e.

avoiding staggered grids)• Switch to SIMPLE algorithm for iterating the pressure-velocity coupling.• Bodies are represented by level set functions. This allows for easy

computation of boundary operators without branch-divergence.


http://dx.doi.org/10.1006/jcph.2000.6484

http://dx.doi.org/10.2514/3.8284


Immersed boundary second order scheme (FVM/IBM2) (cont.)

1: Data initialization: u0, p0, q0, q0

2: for n = 0, Nstep do

3: u∗ ← un , p∗ ← pn

4: for k = 0, Nsimple do

5: w ← u∗ , f ← 0

6: for l = 0, NIBM do

7: w = Gp∗ + R(w) + f ; w = ΠBC(w, qnS , qnS)

8: f ′ = w −Gp∗ − R(w), εf = ‖f − f ′‖/‖f l=0‖9: f = f ′ ,w = w

10: If εf < tolf break

11: end for

12: u← w

13: SolveA(p− p∗) = A′p∗ +Du for p

14: εu = ‖u− u∗‖/‖u− un‖15: u∗ ← αuu+ (1− αu)u∗; p∗ ← p∗ + αpp

16: If εu < tolu break

17: end for

18: un+1 ← u∗ , pn+1 ← p∗

19: Compute the tractions of the fluid onto the solid tfl20: Compute the new solid position qn+1

S and velocity qn+1S

21: end for



The boundary projection operator

• Given a solid cell node S we have to determinethe fictitious (a.k.a. parasitic) velocity ufS suchthat it guarantees the satisfaction of theboundary condition at B• The closest point xB to S is determined asxS = xB − dnB .• d and nB can be determined from the level set

function ψ

nB = (∇ψ/‖∇ψ‖)B ≈ (∇ψ/‖∇ψ‖)S ,

d = ψS ,

• Mirror fluid point to S is xF = xB + dnS• Velocity at mirror point is determined from

extrapolation ufS = 2uB − uF (*)• Forcing at S is iteratively found such that (*) is

satisfied.Solid cell center Fluid cell center

ψ<0 (solid)

ψ>0 (fluid)

ψ=0

S

B

nB^

F

d

d



Numerical results with FVM/IBM2

0.4 0.6 0.8 1 1.2 1.4 1.60

0.02

0.04

0.06

0.08

0.1

f [Hz]

Ab

[m]

experimental [m]FVM/IBM1[m]FVM/IBM2[m]

experimental

FVM/IBM1 256^3 = 16.7 Mcell


0.4 0.6 0.8 1 1.2 1.4 1.60

50

100

150

f [Hz]

phi [

deg]

experimental [m]FVM/IBM1[m]FVM/IBM2[m]

experimental





Convergence of FVM/IBM2

• FVM/IBM2 exhibits O(h2) convergence.

0 0.001 0.002 0.003 0.004 0.0054

5

6

7

8

9

10

mesh size h [m]

FVM/IBM1FVM/IBM2experimental

Abuoy [m

]



Convergence of FVM/IBM2 (cont.)

• Results for FVM/IBM2 with 1923 are much better than FVM/IBM1 with afiner mesh 2563.• Time step for FVM/IBM1 was chosen so that CFL = 0.2 (CFLcrit = 0.5,

QUICK version).• Tolerance for Poisson 10−6.• Results are almost converged for NIBM = 3, Nsimple = 5.• For FVM/IBM2 we used tolf = 1× 10−3, tolu = 2× 10−3 (typicallyNIBM = 1, Nsimple = 3). CFL = 0.1 (∆t = 2× 10−4 s). (RememberCFLcrit = 0.5− 0.6).• Results have been fitted with the analytic model giving Cd,num = 0.23,

experimental= 0.25 (reference value for sphere drag is Cd = 0.47).• Added mass coefficient: FVM/IBM2: Cmadd = 0.58, experimental = 0.58,

reference: 0.55, potential theory: 0.5.



Computational efficiency of FVM/IBM2 on GPGPUs

• Efficiency of the above algorithm has been measured on Nvidia TeslaK40c, and GTX Titan Black.• K40c has 2880 CUDA cores @876 MHz, 12 GiB ECC RAM. GK110 model

(Kepler arch). (It’s equivalent with K40 but has builtin fansink). Cost:O(USD 1500).• GTX Titan Black has 2880 cores @980 GHz with 6 GiB RAM (no ECC).

Cost: O(USD 3100)



Tesla K40c vs GTX Titan Black

GTX Titan Black

2880 cores @ 980MHz2880 cores @ 875MHz

235W 250W

12 GiB ECC 6 GiB (no ECC)

7.10e9 7.08e9

4.29 Tflop/s 5.12 Tflop/s

336 GB/s288 GB/s

cores /Turbo speed

dissipation

device RAM

# transistors

memory bandwidth

SP performance

TESLA K40

+12%

+7%

0.5x

+19%

=

+17%

cost (aprox.) USD 3,100 USD 1,500 0.5x



Computational efficiency of FVM/IBM2 on GPGPUs

2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07200

400

600

800

1000

1200

1400

Ncell

com

putin

g ra

te [M

cell/

s]GTX,SP,NSIMPLE=1

K40c,SP,NSIMPLE=1

GTX,DP,NSIMPLE=1

K40c,DP,NSIMPLE=1

K40c,SP,NSIMPLE=3GTX,DP,NSIMPLE=3

K40c,DP,NSIMPLE=3

GTX,SP,NSIMPLE=3



Computing rates for FVM/IBM2

• Maximum attained rate is1.3 Gcell/s for GTX Titan Black inSP, Nsimple = 1, mesh of2243 = 11.2 Mcell.• Computing rates are monotonically

increasing with refinement in allcases.• Comp. rates are roughly∝ (NsimpleNIBM)−1

• Titan Black outperforms K40c forboth SP and DP.• Remarkable good performance of

GTX Titan Black specially in DP.2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+070

200

400

600

800

1000

1200

Ncell

com

putin

g ra

te [M

cell/

s]

GTX,SP,NSIMPLE=1

K40c,SP,NSIMPLE=1

GTX,DP,NSIMPLE=1

K40c,DP,NSIMPLE=1

GTX,SP,NSIMPLE=3

K40c,SP,NSIMPLE=3

GTX,DP,NSIMPLE=3K40c,DP,NSIMPLE=3



Computing rates for FVM/IBM2 (cont.)

• For 1923 (the refinement used for the results presented above) we cancompute 1s simulation in 30s of computation (ST:RT=1:30).• We can estimate un upper bound for the computing rate counting the

transactions to the device RAM.• We have 66 transactions (floats/doubles) per cell. So the theoretical limit is

limit rate (SP) =336GB/s

66 · 4 Bytes/cell= 1.27 Gcell/s,

limit rate (DP) =336GB/s

66 · 8 Bytes/cell= 636 Mcell/s,

(4)



Time consumption breakdown



Numerical results

Vorticity isosurface

velocity at sym plane

tadpolescolored vort isosurfaces

(launch video fsi-boya-gpu-vorticity), (launch video gpu-corte)

(launch video fsi-vort-3D), (launch video particles-buoyp5)


/home/scratch/mstorti/TO-BACKUP/DATA-CD/GPUCONF-VIDEOS/fsi-boya-gpu-vorticity.avi

/home/scratch/mstorti/TO-BACKUP/DATA-CD/GPUCONF-VIDEOS/gpu-corte.avi

/home/scratch/mstorti/TO-BACKUP/DATA-CD/GPUCONF-VIDEOS/fsi-vort-3D.avi

/home/scratch/mstorti/TO-BACKUP/DATA-CD/GPUCONF-VIDEOS/particles-buoyp5.avi


FVM/IBM vs. LBM on GPGPU

• In LBM the Poisson equation is not solved; it usespseudo-compressibility, hence LBM is a true Cellular Automata (CA)algorithm. (FVM/IBM could so too, of course )• So comparatively it can reach higher computing rates, but suffers from the

artificial Mach penalization effect.• Suppose you have a maximum velocity umax in the flow field.• If you use pseudo-compressibility then you must choose an artificial

speed of sound cart > umax.• The compressibility effects are O(M2

art) (Mart = umax/cart) so youchoose, let’s say, Mart = 0.1 so that the error is O(10−2).

∆t < CFLcrit · h/(umax + cart)

= Mart/(1 + Mart) · h/umax ≈ Mart · h/umax,

CFL < Mart · CFLcrit,

• So your CFL is penalized by a factor Mart (0.1 in the example).



FVM/IBM vs. LBM on GPGPU (cont.)

• So in fact not only Mcell/s must be reported, but also the critical CFL forthe algorithm.• So if the computing rate in Mcell/s is doubled but have a penalization of

0.5x in the CFL the algorithm breaks even.• Or rather the effective computing rate should be reported

effective computing rate = CFLcrit ·(number of cells)

(elapsed time)

• The presented algorithm has (experimentally determined)CFLcrit = 0.5− 0.6.



Future work

• We are in the process of buying a HPC cluster (≈ 20 Tflops, w/o GPGPUs)with funds from Provincia de Santa Fe Science Agency ASACTEI (Thanks!

). Includes nodes w/GTX Titan Z (2x2880=5760 cores @876 MHz(boost), 12 GiB RAM, equiv to 2 Titan (not Black! ) on the same card).We estimate reaching 2 Gcell/s on one card.• Replace momentum eqs. solver with MOC+BFECC.• Multi-GPU: (hybrid MPI+CUDA+OpenMP) Some experiments already done

with Poisson eq.• Refinement with multi-block strategy.• LES Smagorinsky turbulence model.



Acknowledgments

• Chilean Council for Scientific and Technological Research (CONICYT-FONDECYT 1130278);

• The Scientific Research Projects Management Department of the Vice Presidency of

Research, Development and Innovation (DICYT-VRID) at Universidad de Santiago de Chile;

• Association of Universities: Montevideo Group (AUGM);

• Argentinean Council for Scientific Research (CONICET project PIP 112-20111-00978);

• Argentinean National Agency for Technological and Scientific Promotion, ANPCyT, (grants

PICT-E-2014-0191, PICT-2014-0191);

• Universidad Nacional del Litoral, Argentina (grant CAI+D-501-201101-00233-LI);

• Santa Fe Science Technology and Innovation Agency (grant ASACTEI-010-18-2014); and

• European Research Council (ERC) Advanced Grant Real Time Computational Mechanics

Techniques for Multi-Fluid Problems (REALTIME, Reference: ERC-2009-AdG).

• We made extensive use of Free Software as GNU/Linux OS, GCC/G++ compilers, Octave, and

Open Source software as VTK, ParaView, among many others.


advances in the solution of ns eqs. in gpgpu hardware. second order scheme and experimental...

Science

nvidia gpus

gpu por

gpus architecture

ns solver

version texstuff

cores thread processors

control cores

powerful cores