scalable lattice boltzmann solvers for cuda gpu clusters · project [1], which aims at providing a...

Scalable Lattice Boltzmann Solvers for CUDA GPU

Clusters

Christian Obrecht, Frederic Kuznik, Bernard Tourancheau, Jean-Jacques

Roux

To cite this version:

Christian Obrecht, Frederic Kuznik, Bernard Tourancheau, Jean-Jacques Roux. Scalable Lat-tice Boltzmann Solvers for CUDA GPU Clusters. Parallel Computing, Elsevier, 2013, 39 (6-7),pp.259-270. <10.1016/j.parco.2013.04.001>. <hal-00931058>

HAL Id: hal-00931058

https://hal.archives-ouvertes.fr/hal-00931058

Submitted on 11 Jun 2014

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinee au depot et a la diffusion de documentsscientifiques de niveau recherche, publies ou non,emanant des etablissements d’enseignement et derecherche francais ou etrangers, des laboratoirespublics ou prives.

https://hal.archives-ouvertes.fr

https://hal.archives-ouvertes.fr/hal-00931058

Scalable Lattice Boltzmann Solvers

for CUDA GPU Clusters

Christian Obrechta,b,c,∗, Frédéric Kuznikb,c, Bernard Tourancheaud,Jean-Jacques Rouxb,c

aEDF R&D, Département EnerBAT, 77818 Moret-sur-Loing Cedex, FrancebUniversité de Lyon, 69361 Lyon Cedex 07, France

cINSA-Lyon, CETHIL UMR5008, 69621 Villeurbanne Cedex, FrancedUJF-Grenoble, INRIA, LIG UMR5217, 38041 Grenoble Cedex 9, France

Abstract

The lattice Boltzmann method (LBM) is an innovative and promising ap-proach in computational fluid dynamics. From an algorithmic standpoint itreduces to a regular data parallel procedure and is therefore well-suited to highperformance computations. Numerous works report efficient implementationsof the LBM for the GPU, but very few mention multi-GPU versions and evenfewer GPU cluster implementations. Yet, to be of practical interest, GPU LBMsolvers need to be able to perform large scale simulations. In the present contri-bution, we describe an efficient LBM implementation for CUDA GPU clusters.Our solver consists of a set of MPI communication routines and a CUDA kernelspecifically designed to handle three-dimensional partitioning of the computa-tion domain. Performance measurement were carried out on a small cluster.We show that the results are satisfying, both in terms of data throughput andparallelisation efficiency.

Keywords: GPU clusters, CUDA, lattice Boltzmann method

1. Introduction

A single-GPU based computing device is not proper to solve large scale prob-lems because of the limited amount of on-board memory. However, applicationsrunning on multiple GPUs have to face the PCI-E bottleneck, and great care hasto be taken in design and implementation to minimise inter-GPU communica-tion. Such constraints may be rather challenging; the well-known MAGMA [15]linear algebra library, for instance, did not support multiple GPUs until version1.1, two years after the first public release.

The lattice Boltzmann method (LBM) is a novel approach in computationalfluid dynamics (CFD), which, unlike most other CFD methods, does not consist

∗[email protected]

Preprint submitted to Parallel Computing February 15, 2013

mailto:[email protected]

in directly solving the Navier-Stokes equations by a numerical procedure [9].Beside many interesting features, such as the ability to easily handle complexgeometries, the LBM reduces to a regular data-parallel algorithm and therefore,is well-suited to efficient HPC implementations. As a matter of fact, numeroussuccessful attempts to implement the LBM for the GPU have been reported inthe recent years, starting with the seminal work of Li et al. in 2003 [10].

CUDA capable computation devices may at present manage up to 6 GBof memory, and the most widespread three-dimensional LBM models requirethe use of at least 19 floating point numbers per node. Such capacity allowstherefore the GPU to process about 8.5× 107 nodes in single-precision. Takingarchitectural constraints into account, the former amount is sufficient to storea 4163 cubic lattice. Although large, such a computational domain is likely tobe too coarse to perform direct numerical simulation of a fluid flow in manypractical situations as, for instance, urban-scale building aeraulics or thermalmodeling of electronic circuit boards.

To our knowledge, the few single-node multi-GPU LBM solvers described inliterature all use a one-dimensional (1D) partition of the computation domain,which is relevant in this case since the volume of inter-GPU communication isnot likely to be a limiting factor given the small number of involved devices.This option does not require any data reordering, provided the appropriate par-titioning direction is chosen, thus keeping the computation kernel fairly simple.For a GPU cluster implementation, on the contrary, a kernel able to run ona three-dimensional (3D) partition seems preferable, since it would both pro-vide more flexibility for load balancing and contribute to reduce the volume ofcommunication.

In the present contribution, we describe an implementation of a lattice Boltz-mann solver for CUDA GPU clusters. The core computation kernel is designedso as to import and export data efficiently in each spatial direction, thus en-abling the use of 3D partitions. The inter-GPU communication is managed byMPI-based routines. This work constitutes the latest extension to the TheLMAproject [1], which aims at providing a comprehensive framework for efficientGPU implementations of the LBM.

The remainder of the paper is structured as follows. In Section 2, we givea description of the algorithmic aspects of the LBM as well as a short reviewof LBM implementations for the GPU. The third section consists of a detaileddescription of the implementation principles of the computation kernel and thecommunication routines of our solver. In the fourth section, we present someperformance results on a small cluster. The last section concludes and discussespossible extensions to the present work.

2. State of the art

2.1. Lattice Boltzmann Method

The lattice Boltzmann method is generally carried out on a regular orthog-onal mesh with a constant time step δt. Each node of the lattice holds a set of

2

12

3

4

5

6

15

18

16

17

14

1112

13

8

910

7

Figure 1: The D3Q19 stencil — The blue arrows represent the propagationvectors of the stencil linking a given node to some of its nearest neighbours.

scalars fα |α = 0, . . . N representing the local particle density distribution.Each particle density fα is associated with a particle velocity ξ

αand a propa-

gation vector cα = δt · ξα. Usually the propagation vectors link a given node

to one of its nearest neighbours, except for c0 which is null. For the presentwork, we implemented the D3Q19 propagation stencil illustrated in Fig. 1. Thisstencil, which contains 19 elements, is the most commonly used in practice for3D LBM, being the best trade-off between size and isotropy. The governingequation of the LBM at node x and time t writes:

∣∣fα(x+ cα, t+ δt)⟩−∣∣fα(x, t)

⟩= Ω

(∣∣fα(x, t)⟩)

, (1)

where∣∣fα

⟩denotes the distribution vector and Ω denotes the so-called collision

operator. The mass density ρ and the momentum j of the fluid are given by:

ρ =∑

α

fα , j =∑

α

fαξα . (2)

In our solver, we implemented the multiple-relaxation-time collision opera-tor described in [5]. Further information on the physical and numerical aspectsof the method are to be found in the aforementioned reference. From an algo-rithmic perspective, Eq. 1 naturally breaks in two elementary steps:

∣∣fα(x, t)⟩=

∣∣fα(x, t)⟩+Ω

(∣∣fα(x, t)⟩)

, (3)

∣∣fα(x+ cα, t+ δt)⟩=

∣∣fα(x, t)⟩. (4)

Equation 3 describes the collision step in which an updated particle distributionis computed. Equation 4 describes the propagation step in which the updated

3

(a) Initial state (b) Post-collision state (c) Post-propagation state

Figure 2: Collision and propagation — The collision step is represented by thetransition between ( a) and ( b). The pre-collision particle distribution is drawnin black whereas the post-collision one is drawn in blue. The transition from ( b)to ( c) illustrates the propagation step in which the updated particle distributionis advected to the neighbouring nodes.

(a) Initial state (b) Pre-collision state (c) Post-collision state

Figure 3: In-place propagation — With the in-place propagation scheme, con-trary to the out-of-place scheme outlined in Fig. 2, the updated particle distri-bution of the former time step is advected to the current node before collision.

particle densities are transferred to the neighbouring nodes. This two-step pro-cess is outlined by Fig. 2 (in the two-dimensional case, for the sake of clarity).

2.2. GPU implementations of the LBM

Due to substantial hardware evolution, the pioneering work of Fan et al. [6]reporting a GPU cluster LBM implementation is only partially relevant today.The GPU computations were implemented using pre-CUDA techniques that arenow obsolete. Yet, the proposed optimisation of the communication pattern stillapplies, although it was only tested on Gigabyte Ethernet; in future work, weplan to evaluate its impact using InfiniBand interconnect.

In 2008, Tölke and Krafczyk [16] described a single-GPU 3D-LBM imple-mentation using CUDA. The authors mainly try to address the problem in-duced by misaligned memory accesses. As a matter of fact, with the NVIDIA

4

G80 GPU available at this time, only aligned and ordered memory transac-tions could be coalesced. The proposed solution consists in partially performingpropagation in shared memory. With the GT200 generation, this approach isless relevant, since misalignment has a lower—though not negligible—impact onperformance. As shown in [12], the misalignment overhead is significantly higherfor store operations than for read operations. We therefore suggested in [13] touse the in-place propagation scheme outlined by Fig. 3 instead of the ordinaryout-of-place propagation scheme illustrated in Fig. 2. From an implementationstandpoint, this alternative approach consists in performing the propagationwhen loading the densities from device memory instead of performing it whenstoring the densities back. The misalignments therefore only occur on read op-erations. With the GT200, the gain in performance is about 20%. Moreover,the resulting computation kernel is simpler and leaves the shared memory freefor possible extensions.

Further work led us to develop a single-node multi-GPU solver, with 1Dpartitioning of the computation domain [14]. Each CUDA device is managedby a specific POSIX thread. Inter-GPU communication is carried out using zero-copy transactions to page-locked host memory. Performance and scalability aresatisfying with up to 2,482 million lattice node updates per second (MLUPS)and 90.5% parallelisation efficiency on a 3843 lattice using eight Tesla C1060computing devices in single-precision.

In their recent paper [17], Wang and Aoki describe an implementation of theLBM for CUDA GPU clusters. The partition of the computation domain maybe either one-, two-, or three-dimensional. Although the authors are elusive onthis point, no special care seems to be taken to optimise data transfer betweendevice and host memory, and as a matter of fact, performance is quite low. Forinstance, on a 3843 lattice with 1D partitioning, the authors report about 862MLUPS using eight GT200 GPUs in single-precision, i.e. about one third ofthe performance of our single-node multi-GPU solver using similar hardware.It should also be noted that the given data size for communication per rank,denoted M1D, M2D, and M3D, are at least inaccurate. For the 1D and 2Dcases, no account is taken of the fact that for the simple bounce-back boundarycondition, no external data is required to process boundary nodes. In the 3Dcase, the proposed formula is erroneous.

3. Proposed implementation

3.1. Computation kernel

To take advantage of the massive hardware parallelism, our single-GPU andour single-node multi-GPU LBM solvers both assign one thread to each nodeof the lattice. The kernel execution set-up consists of a two-dimensional grid ofone-dimensional blocks, mapping the spatial coordinates. The lattice is stored asa four-dimensional array, the direction of the blocks corresponding to the minordimension. Two instances of the lattice are kept in device memory, one for eventime steps and one for odd time steps, in order to avoid local synchronisation is-sues. This data layout allows the fetch and store operations to be coalesced since

5

Figure 4: Grid layout for the single-GPU and the single-node multi-GPU LBMsolvers — The execution grid is two-dimensional with one-dimensional blocksspanning the width of the domain (or sub-domain).

consecutive threads within a warp access consecutive memory locations. It alsomakes possible, using coalescent zero-copy transactions, to import and exportdata efficiently at the four sub-domain faces parallel to the blocks, with partialoverlapping of communication and computations. For the two sub-domain facesorthogonal to the blocks however, such approach is not practicable since onlythe first and the last thread within a block would be involved in data exchange,leading to individual zero-copy transactions which, as shown in section 6 of [14],dramatically increase the cost of inter-GPU communication.

A possible solution to extend our computation kernel to support 3D parti-tions would be to use a specific kernel to handle the interfaces orthogonal to theblocks. Not mentioning the overhead of kernel switching, such an approach doesnot seem satisfying since the corresponding data is scattered across the arrayand therefore the kernel would only perform non-coalesced accesses to devicememory. As a matter of fact, the minimum data access size is 32 bytes forcompute capability up to 1.3, and 128 bytes above, whereas only 4 or 8 byteswould be useful. The cache memory available in devices of compute capability2.0 and 2.1 is likely to have small impact in this case, taking into account thescattering of the accessed data.

We therefore decided to design a new kernel able to perform propagation anddata reordering at once. With this new kernel, blocks are still one-dimensionalbut, instead of spanning the lattice width, contain only one warp, i.e. W = 32threads (for all existing CUDA capable GPUs). Each block is assigned to a tileof nodes of size W ×W × 1, which imposes for the sub-domain dimensions tobe a multiple of W in the x- and y-direction. For a Sx × Sy × Sz sub-domain,we therefore use a (Sx/W )× (Sy/W )× Sz grid.

The data access pattern is outlined in Fig. 5. For the sake of clarity, let uscall lateral densities the particle densities crossing the tile sides parallel to they-direction. Using a D3Q19 stencil, the number of lateral densities is M = 10.

6

Warp

Shared memory Shared memoryTile

Figure 5: Processing of a tile — Each tile of nodes is processed row by row bya CUDA block composed of a single warp. Note that we only drew a 8 × 8 tileinstead of a W ×W tile in order to improve readability. The current row isframed in red and the direction of the processing is indicated by the bold redarrow. The in-coming lateral densities are drawn in blue whereas the out-goingones are drawn in red. These densities are stored in a temporary array hostedin shared memory.

At each time step, the lateral densities are first loaded into a temporary arrayin shared memory, then the kernel loops over the tile row by row to processthe nodes saving the updated lateral densities in the temporary array, last theupdated lateral densities are written back.

In addition to the two instances of the lattice, as for the single-GPU andsingle-node multi-GPU kernels, the new kernel uses an auxiliary array in devicememory to store the lateral densities. As demonstrated by the pseudo-code inCode 1, the data transfer operations issued by the kernel, i.e. the statementson lines 6, 7, 15, 23, and 28, are coalescent. These transactions may be eitheraccesses to device memory or, for the nodes located at the interfaces of thesub-domains, zero-copy transactions to communication buffers in host memory.This novel data access pattern makes thus possible to export data efficiently inevery spatial direction.

The data transfered by the kernel consist of the densities stored in the latticeinstances and the lateral densities stored in the auxiliary array. At each timestep, the volume of data read and written amounts to N ×W 2 floating point

7

1. for each block B do

2. for each thread T do

3. x1 ← (BxW,ByW + Tx, Bz)

4. x2 ← (BxW +W − 1, ByW + Tx, Bz)

5. for each α ∈ L do

6. load fα(x1, t) = fα(x1 − cα, t− δt) in shared memory

7. load fα(x2, t) = fα(x2 − cα, t− δt) in shared memory

8. end for

9. for y = 0 to W − 1 do

10. x← (BxW + Tx, ByW + y,Bz)

11. for α = 0 to N do

12. if (Tx = 0 and α ∈ L) or (Tx = W − 1 and α ∈ L) then

13. read fα(x, t) from shared memory

14. else

15. load fα(x, t) = fα(x− cα, t− δt)

16. end if

17. end for

18. compute∣∣fα(x, t)

⟩

19. for α = 0 to N do

20. if (Tx = 0 and α ∈ L) or (Tx = W − 1 and α ∈ L) then

21. write fα(x, t) to shared memory

22. else

23. store fα(x, t)

24. end if

25. end for

26. end for

27. for each α ∈ L do

28. store fα(x1, t) and fα(x2, t)

29. end for

30. end for

31. end for

Code 1: Computation kernel — In this pseudo-code, Bx, By, Bz, and Tx denotethe indices of block B and thread T ; L = 1, 7, 9, 11, 13 lists the indices ofthe propagation vectors with a strictly positive x-component; α stands for thedirection opposite to α. Note that, for the sake of simplicity, we omitted theprocessing of boundary conditions.

8

numbers1 per tile for the former, and to M ×W per tile for the later.The amount of 4-byte (or 8-byte) words read or written per block and per

time step is therefore:

QT = 2(19W 2 + 10W

)= 38W 2 + 20W, (5)

and the amount of data read or written in device memory per time step for aSx × Sy × Sz sub-domain is:

QS =Sx

W×

Sy

W× Sz ×QT = SxSySz

38W + 20

W. (6)

We therefore see that this approach only increases the volume of device mem-ory accesses by less than 2%, with respect to our former implementations [12],while greatly reducing the number of misaligned transactions.

3.2. Multi-GPU solver

To enable our kernel to run across a GPU cluster, we wrote a set of MPI-based initialisation and communication routines. These routines as well as thenew computation kernel were designed as components of the TheLMA frame-work, which was first developed for our single-node multi-GPU LBM solver.The main purpose of TheLMA is to improve code reusability. It comes witha set of generic modules providing the basic features required by a GPU LBMsolver. This approach allowed us to develop our GPU cluster implementationmore efficiently.

The execution set-up as well as general parameters such as the Reynoldsnumber of the flow simulation or various option flags, are specified by a con-figuration file in JSON format [4]. The listing in Code 2 gives an example filefor a 2 × 2 × 1 partition running on two nodes. The parameters for each sub-domains, such as the size or the target node and computing device, are givenin the Subdomains array. The Faces and Edges arrays specify to which sub-domains a given sub-domain is linked, either through its faces or edges. Thesetwo arrays follow the same ordering as the propagation vector set displayedin Fig. 1. Being versatile, the JSON format is well-suited for our application.Moreover, its simplicity makes both parsing and automatic generation straight-forward. This generic approach brings flexibility. It allows any LBM solverbased on our framework to be tuned to the target architecture.

Our implementation requires the use of an MPI process for each sub-domainto handle. At start, the rank 0 process is responsible for processing the config-uration file. Once this file is parsed, the MPI processes register themselves bysending their MPI processor name to the rank 0 process, which in turn assignsan appropriate sub-domain to each of them and sends back all necessary pa-rameters. The processes then perform local initialisation, setting the assignedCUDA device and allocating the communication buffers, which fall into three

1A whole segment access is performed for each row and each density, although for somedensities the first or the last thread do not issue an individual transaction.

9

"Path": "out",

"Prefix": "ldc",

"Re": 1E3,

"U0": 0.1,

"Log": true,

"Duration": 10000,

"Period": 100,

"Images": true,

"Subdomains": [

"Id": 0,

"Host": "node00",

"GPU": 0,

"Offset": [0, 0, 0],

"Size": [128, 128, 256],

"Faces": [ 1, null, 2, null, null, null],

"Edges": [ 3, null, null, null, null, null,

null, null, null, null, null, null]

,

"Id": 1,

"Host": "node00",

"GPU": 1,

"Offset": [128, 0, 0],

"Size": [128, 128, 256],

"Faces": [null, 0, 3, null, null, null],

"Edges": [null, 2, null, null, null, null,


,

"Id": 2,

"Host": "node01",

"GPU": 0,

"Offset": [0, 128, 0],

"Size": [128, 128, 256],

"Faces": [ 3, null, null, 0, null, null],

"Edges": [null, null, 1, null, null, null,


,

"Id": 3,

"Host": "node01",

"GPU": 1,

"Offset": [128, 128, 0],

"Size": [128, 128, 256],

"Faces": [null, 2, null, 1, null, null],

"Edges": [null, null, null, 0, null, null,


]

Code 2: Configuration file

10

categories: send buffers, receive buffers and read buffers. It is worth noting thatboth send buffers and read buffers consist of pinned memory allocated using theCUDA API, since they have to be made accessible by the GPU.

The steps of the main computation loop consist of a kernel execution phaseand a communication phase. During the first phase, the out-going particledensities are written to the send buffers assigned to the faces without performingany propagation as for the densities written in device memory. During thesecond phase, the following operations are performed:

1. The relevant densities are copied to the send buffers assigned to the edges.

2. Non-blocking send requests are issued for all send buffers.

3. Blocking receive requests are issued for all receive buffers.

4. Once message passing is completed, the particle densities contained in thereceive buffers are copied to the read buffers.

This communication phase is outlined in Fig. 6. The purpose of the lastoperation is to perform propagation for the in-coming particle densities. As aresult, the data corresponding to a face and its associated edges is gathered in asingle read buffer. This approach avoids misaligned zero-copy transactions, andmost important, leads to a simpler kernel since only six buffers at most have tobe read. It should be mentioned that the read buffers are allocated using thewrite combined flag to optimise cache usage. According to [7, 8, 11], this settingis likely to improve performance since the memory pages are locked.

4. Performance study

We conducted experiments on an eight-node GPU cluster, each node beingequipped with two hexa-core X5650 Intel Xeon CPUs, 36 GB memory, and threeNVIDIA Tesla M2070 computing devices; the network interconnect uses QDRInfiniBand. To evaluate raw performance, we simulated a lid-driven cavity [2] insingle-precision and recorded execution times for 10,000 time steps using variousconfigurations. Overall performance is good, with at most 8,928 million latticenode updates per second (MLUPS) on a 7683 lattice using all 24 GPUs. To seta comparison, Wang and Aoki in [17] report at most 7,537 MLUPS for the sameproblem size using four time as many GPUs. However, it should be mentionedthat these results were obtained using hardware of the preceding generation.

The solver was compiled using CUDA 4.0 and OpenMPI 1.4.4. It is alsoworth mentioning that the computing devices had ECC support enabled. Fromtests we conducted on a single computing device, we expect the overall perfor-mance to be about 20% higher with ECC support disabled.

4.1. Performance model

Our first performance benchmark consisted in running our solver using eightGPUs on a cubic cavity of increasing size. The computation domain is split ina 2× 2× 2 regular partition, the size S of the sub-domains ranging from 128 to288. In addition, we recorded the performance for a single-GPU on a domain of

11

Lattice GPU 0Faces

(send buf.)

CPU 0

Edges

(send buf.)

Interconnect

Faces

(recv. buf.)

Edges

(recv. buf.)

CPU 1

Faces

(read buf.)Lattice GPU 1

Figure 6: Communication phase — The upper part of the graph outlines the pathfollowed by data leaving the sub-domain handled by GPU 0. For each face of thesub-domain, the out-going densities are written by the GPU to pinned buffersin host memory. The associated MPI process then copies the relevant densitiesinto the edge buffers and sends both face and edge buffers to the correspondingMPI processes. The lower part of the graph describes the path followed by dataentering the sub-domain handled by GPU 1. Once the reception of in-comingdensities for faces and edges is completed, the associated MPI process copies therelevant data for each face of the sub-domain into pinned host memory buffers,which are read by the GPU during kernel execution.

12

size S, in order to evaluate the communication overhead and the GPU to devicememory data throughput. The results are gathered in Tables 1, 2, and 3.

Table 1 shows that the data throughput between GPU and device memoryis stable, only slightly increasing with the size of the domain. (Given the datalayout in device memory, the increase of the domain size is likely to reducethe amount of L2 cache misses, having therefore a positive impact on datatransfer.) We may therefore conclude that the performance of our kernel iscommunication bound. The last column accounts for the ratio of the datathroughput to the maximum sustained throughput, for which we used the value102.7 GB/s obtained using the bandwidthTest program that comes with theCUDA SDK. The obtained ratios are fairly satisfying taking into account thecomplex data access pattern the kernel must follow.

In Tables 2 and 3, the parallel efficiency and the non-overlapped commu-nication time were computed using the single-GPU results. The efficiency isgood with at least 87.3% and appears to benefit from surface-to-volume effects.In Tab. 3, the third column reports the overall data throughput (intra-nodeand inter-node), the fourth column gives the amount of data transmitted overthe interconnect per time step, and the last column reports the correspond-ing throughput. Both throughputs remain rather stable when the size of thedomain increases from 256 to 576, only decreasing by about 20% whereas thecommunication load increases by a factor of 5. Figure 7 displays the obtainedperformance results.

256 320 384 448 512 576

Domain size

0

500

1000

1500

2000

2500

3000

3500

4000

Perf

orm

ance (

MLU

PS)

Measured performance100% efficiency

Figure 7: Performance for a 2× 2× 2 regular partition

13

Domain size (S) Runtime (s) Performance (MLUPS) Throughput (GB/s) Ratio to peak throughput

128 54.7 383.2 59.2 57.6%

160 100.6 407.2 62.9 61.2%

192 167.6 422.3 65.2 63.5%

224 260.3 431.8 66.7 64.9%

256 382.3 438.8 67.8 66.0%

288 538.7 443.4 68.5 66.7%

Table 1: Single-GPU performance

14

Domain size (2S) Runtime (s) Perf. (MLUPS) Parallel efficiency

256 62.7 2,678 87.3%

320 114.5 2,862 87.9%

384 186.9 3,030 89.7%

448 289.6 3,105 89.9%

512 418.7 3,206 91.3%

576 587.0 3,256 91.8%

Table 2: Performance for a 2× 2× 2 regular partition

Domain size (2S) Communication (s) Inter-GPU (GB/s) Transmission (MB) Inter-node (GB/s)

256 7.9 9.9 5.0 6.2

320 13.9 8.9 7.8 5.5

384 19.3 9.6 11.3 5.9

448 29.3 8.2 15.3 5.1

512 36.4 8.6 20.0 5.4

576 48.3 8.2 25.3 5.1

Table 3: Data throughput for a 2× 2× 2 regular partition

15

4.2. Scalability

In order to study scalability, both weak and strong, we considered seven dif-ferent partition types with increasing number of sub-domains. Weak scalabilityrepresents the ability to solve larger problems with larger resources whereasstrong scalability accounts for the ability to solve a problem faster using moreresources. For weak scalability, we used cubic sub-domains of size 128, andfor strong scalability, we used a computation domain of constant size 384 withcuboid sub-domains. Table 4 gives all the details of the tested configurations.

For our weak scaling test, we use fixed size sub-domains so that the amountof processed nodes linearly increases with the number of GPUs. We chose asmall, although realistic, sub-domain size in order to reduce as much as possi-ble favourable surface-to-volume effects. Since the workload per GPU is fixed,perfect scaling is achieved when the runtime remains constant. The results ofthe test are gathered in Tab. 5. Efficiency was computed using the runtime ofthe smallest tested configuration. Figure 8 displays the runtime with respectto the number of GPUs. As illustrated by this diagram, the weak scalability ofour solver is satisfying, taking into account that the volume of communicationincreases by a factor up to 11.5. It is worth noting that in this test, the config-uration using 18 GPUs performs better than the configuration using 16 GPUs.It seems that a better node occupancy (i.e. three sub-domains per node insteadof two) has a positive impact on performance. However, this hypothesis needto be confirmed by large-scale experiments.

0 2 4 6 8 10 12 14 16 18 20 22 24

Number of GPUs

0

10

20

30

40

50

60

70

80

90

100

Runti

me (

s)

Figure 8: Runtime for the weak scaling test — Perfect weak scaling would resultin an horizontal straight line.

16

Number of GPUs Nodes×GPUs Partition type Domain (weak scal.) Sub-dom. (strong scal.)

4 2× 2 1× 2× 2 128× 256× 256 384× 192× 192

6 2× 3 1× 3× 2 128× 384× 256 384× 128× 192

8 4× 2 2× 2× 2 256× 256× 256 192× 192× 192

12 4× 3 2× 3× 2 256× 384× 256 192× 128× 192

16 8× 2 2× 4× 2 256× 512× 256 192× 96 × 192

18 6× 3 2× 3× 3 256× 384× 384 192× 128× 128

24 8× 3 2× 4× 3 256× 512× 384 192× 96 × 128

Table 4: Configuration details for the scaling tests

17

Number of GPUs Runtime (s) Efficiency Performance (MLUPS) Perf. per GPU (MLUPS)

4 59.8 100% 1402 350.5

6 64.2 93% 1959 326.6

8 62.7 95% 2676 334.5

12 66.8 90% 3767 313.9

16 71.1 84% 4721 295.1

18 67.0 89% 5634 313.0

24 73.2 82% 6874 286.4

Table 5: Runtime and efficiency for the weak scaling test

Number of GPUs Runtime (s) Efficiency Performance (MLUPS) Perf. per GPU (MLUPS)

4 335.0 100% 1690 422.6

6 241.9 92% 2341 390.1

8 186.1 90% 3043 380.3

12 134.7 83% 4204 350.3

16 109.9 76% 5152 322.0

18 98.4 76% 5753 319.6

24 80.3 70% 7053 293.9

Table 6: Runtime and efficiency for the strong scaling test

18

1 10 100Number of GPUs

10

100

1000

Runt

ime

(s)

Figure 9: Runtime for the strong scaling test — Perfect strong scaling is indi-cated by the solid red line

In our strong scalability test, we consider a fixed computation domain pro-cessed using an increasing number of computing devices. As a consequence thevolume of the communication increases by a factor up to three, while the sizeof the sub-domains decreases, leading to less favourable configurations for thecomputation kernel. The results of the strong scaling test are given in Tab. 6.The runtime with respect to the number of GPUs is represented in Fig. 9 using alog-log diagram. As shown by the trend-line, the runtime closely obeys a powerlaw, the correlation coefficient for the log-log regression line being below −0.999.The obtained scaling exponent is approximately −0.8, whereas perfect strongscalability corresponds to an exponent of −1. We may conclude that the strongscalability of our code is good, given the fairly small size of the computationdomain.

5. Conclusion

In this paper, we describe the implementation of an efficient and scalableLBM solver for GPU clusters. Our code lies upon three main components thatwere developed for that purpose: a CUDA computation kernel, a set of MPI ini-tialisation routines, and a set of MPI communication routines. The computationkernel’s most important feature is the ability to efficiently exchange data in allspatial directions, making possible the use of 3D partitions of the computation

19

domain. The initialisation routines are designed in order to distribute the work-load across the cluster in a flexible way, following the specifications contained ina configuration file. The communication routines manage to pass data betweensub-domains efficiently, performing reordering and partial propagation. Thesenew components were devised as key parts of the TheLMA framework[1], whosemain purpose is to facilitate the development of LBM solvers for the GPU. Theobtained performance on rather affordable hardware such as small GPU clustersmakes possible to carry out large scale simulations in reasonable time and atmoderate cost. We believe these advances will benefit to many potential appli-cations of the LBM. Moreover, we expect our approach to be sufficiently genericto apply to a wide range of stencil computations, and therefore to be suitablefor numerous applications that operate on a regular grid.

Although performance and scalability of our solver is good, we believe there isstill room for improvement. Possible enhancements include better overlappingbetween communication and computation, and more efficient communicationbetween sub-domains. As for now, only transactions to the send and read buffersmay overlap kernel computations. The communication phase starts once thecomputation phase is completed. One possible solution to improve overlappingwould be to split the sub-domains in seven zones, six external zones, one for eachface of the sub-domains, and one internal zone for the remainder. Processingthe external zones first would allow the communication phase to start while theinternal zone is still being processed.

Regarding ameliorations to the communication phase, we are consideringthree paths to explore. First of all, we plan to reinvest the concepts presentedin [7] and [8] to improve data transfers involving page-locked buffers. Secondly,we intend to evaluate the optimisation proposed by Fan et al. in [6], whichconsists in performing data exchange in several synchronous steps, one for eachface of the sub-domains, the data corresponding to the edges being transferedin two steps. Last, following [3], we plan to implement a benchmark programable to search heuristically efficient execution layouts for a given computationdomain and to generate automatically the configuration file corresponding tothe most efficient one.

Acknowledgments

The authors wish to thank the INRIA PlaFRIM team for allowing us to testour executables on the Mirage GPU cluster.

References

[1] Thermal LBM on Many-core Architectures. www.thelma-project.info.

[2] S. Albensoeder and H. C. Kuhlmann. Accurate three-dimensional lid-drivencavity flow. Journal of Computational Physics, 206(2):536–558, 2005.

20

http://www.thelma-project.info

[3] R. Clint Whaley, A. Petitet, and J.J. Dongarra. Automated empiricaloptimizations of software and the ATLAS project. Parallel Computing, 27(1):3–35, 2001.

[4] D. Crockford. The application/json Media Type for JavaScript ObjectNotation (JSON), RFC 4627. Internet Engineering Task Force, 2006.

[5] D. d’Humières, I. Ginzburg, M. Krafczyk, P. Lallemand, and L.S. Luo.Multiple-relaxation-time lattice Boltzmann models in three dimensions.Philosophical Transactions of the Royal Society A, 360:437–451, 2002.

[6] Z. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover. GPU cluster for highperformance computing. In Proceedings of the 2004 ACM/IEEE conferenceon Supercomputing, pages 47–58. IEEE, 2004.

[7] P. Geoffray, L. Prylli, and B. Tourancheau. BIP-SMP: High performancemessage passing over a cluster of commodity SMPs. In Proceedings of the1999 ACM/IEEE conference on Supercomputing, pages 20–38. ACM, 1999.

[8] P. Geoffray, C. Pham, and B. Tourancheau. A Software Suite for High-Performance Communications on Clusters of SMPs. Cluster Computing, 5(4):353–363, 2002.

[9] X. He and L.-S. Luo. Theory of the lattice Boltzmann method: From theBoltzmann equation to the lattice Boltzmann equation. Physical ReviewE, 56(6):6811–6817, 1997.

[10] W. Li, X. Wei, and A. Kaufman. Implementing lattice Boltzmann compu-tation on graphics hardware. The Visual Computer, 19(7):444–456, 2003.

[11] Compute Unified Device Architecture Programming Guide version 4.0.NVIDIA, June 2011.

[12] C. Obrecht, F. Kuznik, B. Tourancheau, and J.-J. Roux. Global MemoryAccess Modelling for Efficient Implementation of the LBM on GPUs. InLecture Notes in Computer Science 6449, High Performance Computingfor Computational Science, VECPAR 2010 Revised Selected Papers, pages151–161. Springer, 2011.

[13] C. Obrecht, F. Kuznik, B. Tourancheau, and J.-J. Roux. A New Approachto the Lattice Boltzmann Method for Graphics Processing Units. Comput-ers and Mathematics with Applications, 12(61):3628–3638, 2011.

[14] C. Obrecht, F. Kuznik, B. Tourancheau, and J.-J. Roux. The TheLMAproject: Multi-GPU Implementation of the Lattice Boltzmann Method.International Journal of High Performance Computing Applications, 25(3):295–303, August 2011.

[15] F. Song, S. Tomov, and J. Dongarra. Efficient Support for Matrix Compu-tations on Heterogeneous Multi-core and Multi-GPU Architectures. Tech-nical Report UT-CS-11-668, University of Tennessee, June 2011.

21

[16] J. Tölke and M. Krafczyk. TeraFLOP computing on a desktop PC withGPUs for 3D CFD. International Journal of Computational Fluid Dynam-ics, 22(7):443–456, 2008.

[17] X. Wang and T. Aoki. Multi-GPU performance of incompressible flowcomputation by lattice Boltzmann method on GPU cluster. Parallel Com-puting, 37(9):521–535, 2011.

22

scalable lattice boltzmann solvers for cuda gpu clusters · project [1], which aims at providing a...

Documents