lattice boltzmann simulations on heterogeneous cpu-gpu...

38
Computer Science X - System Simulation Group Harald Köstler ([email protected]) Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters H. Köstler, Ch. Feichtinger 2nd International Symposium “Computer Simulations on GPU” Freudenstadt 29.05.2013 1

Upload: others

Post on 17-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Lattice Boltzmann simulations on heterogeneous CPU-GPU

clusters H. Köstler, Ch. Feichtinger

2nd International Symposium

“Computer Simulations on GPU” Freudenstadt 29.05.2013

1

Page 2: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Contents

Motivation

waLBerla software concepts

LBM simulations on Tsubame

Future Work

2

Page 3: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Computational Science and Engineering @ LSS

3

Applications • Multiphysics • fluid, structure • medical imaging • laser

Applied Math • LBM • multigrid • FEM • numerics

Computer Science • HPC / hardware • Performance

engineering • software

engineering USE_SweepSection( getLBMsweepUID() ) USE_Sweep() swUseFunction(„LBM",sweep::LBMsweep,FSUIDSet::all(),hsCPU,BSUIDSet::all()); USE_After() //Communication

Page 4: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Problems

Hardware: Modern HPC clusters are massively parallel Intra-core, intra-node, and inter-node

Software: Applications become more complex with increasing computational power

More complex (physical) models

Code development in interdisciplinary teams

Algorithm: Many variants exist Components and parameters depend on computational domain or grid, type of problem, …

4

Page 5: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

WALBERLA Applications

5

Page 6: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

waLBerla: parallel block-structured grid framework

6

Page 7: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

waLBerla @ GPU

7

Geometric multigrid solver on Tsubame

Computational Steering (VIPER)

CFD, fluid-structure interaction

0 500

1000 1500 2000 2500 3000 3500

unknowns in million

runt

ime

in m

s

Page 8: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Boltzmann equation

Mesoscopic approach to solving the Navier-Stokes equations

Boltzmann equation describes the statistical distribution of one particle in a fluid

f is the probability distribution function (PDF), the particle velocity, and Ω(f) is the change due to collision

Models behavior of fluids in statistical physics

Lattice Boltzmann Method (LBM) solves the discrete Boltzmann equation

)(f f ft Ω=∇⋅+∂ ζ

ζ

8

Page 9: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Particulate Flow Simulation

9

D3Q19 LBM cell Collide and Stream

amF ⋅=α⋅= JM

Page 10: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

WALBERLA CPU-GPU cluster software concepts

10

Page 11: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

waLBerla framework

Main goal: provide a massive parallel and efficient software framework for multi-physics simulations

WaLBerla is mainly designed for HPC clusters

11

waLBerla (C++) Code management,

standard implementations

Low-level kernels for optimized architecture-

specific computations (in

C++, CUDA, Assembler)

Page 12: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

waLBerla: Block concept

12

Page 13: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

waLBerla: Sweep concept

13

Page 14: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Challenges on heterogeneous clusters I

Problem: Description of the heterogeneous compute resources Solution: Description of all compute components per compute node in the input file

Problem: Management of the communication and compute kernels for each architecture Solution: Kernel management based on meta data

14

Page 15: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Challenges on heterogeneous clusters II

Problem: Common communication interface Solution: Data exchange via communication buffers also for intra node communication

Problem: Minimization of the heterogeneous communication overhead Solution: Overlapping of work and communication, non-uniform domain decomposition, and intra-node communication in z-dimension

15

Page 16: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

waLBerla: Communication concept

16

Page 17: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Overlapping of work and communication

17

Page 18: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

WaLBerla: Subblocks

Assumption: A block corresponds to a (shared-memory) compute node

Can possibly be heterogeneous (CPU + GPU)

Distributed memory communication (via MPI) is not required within one block

Divide one block into subblocks of different sizes for (static) load balancing

Subblocks map to (local) devices

18

Page 19: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Domain decomposition on one compute node

19

Page 20: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

RESULTS LBM Simulations on Tsubame 2.0

20

Page 21: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

21

Tsubame 2.0 in Japan

Compute nodes: 1442

Processor: Intel Xeon X5670

GPU: 3 x Nvidia Tesla M205

Peak performance:

2.2 PFlop/s

633 TB/s memory bandwidth

LINPACK performance: 1.2 Petaflops

Power consumption: 1.4 MW

Interconnect: QDR Infiniband

Page 22: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Performance Engineering

22

Create performance model

Identify performance bottlenecks

Create problem-specific, hardware-

dependent, and highly efficient kernel

Integrate them in software framework

Algorithm Hardware

Page 23: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Input Algorithm: LBM kernel

Generic Implementation

Hardware information (bandwidth, peak performance)

Assumption

Computation time limited by memory bandwidth and instruction throughput

Communication time limited by network bandwidth and latency (for direct and collective communication)

Performance Model I

),max( ,,,, MPIcommGPUCPUcommbufferinnercompoutercomptotal tttttt +++=

23

Page 24: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Single node performance on Tsubame

Machine balance

Code balance

Lightspeed estimate (if l < 1 code is bandwidth limited)

Performance Model II

24

eperformancpeak bandwidth esustainabl

=mB

=

c

m

BBl ,1min

200304

FLOPS executed no.stored and loaded bytes no.

==cB

Page 25: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Single Compute Node Performance I

25

Page 26: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Single Compute Node Performance II

26

Page 27: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Single Compute Node Performance III

27

Page 28: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Single Compute Node Performance IV

28

Page 29: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Communication model

29

),max( ,,,, MPIcommGPUCPUcommbufferinnercompoutercomptotal tttttt +++=

Communication time for one message depends on size of message s

number of messages x that are concurrently transferred over the communication link (communication pattern)

type of communication link ω

relative position of the communication partners e.g.intra- or inter-node communication p

Page 30: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Weak scaling, 3 GPUs per node

30

Page 31: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Strong scaling, 3 GPUs per node

31

Page 32: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Test case: Packed bed of hollow cylinders

32

Page 33: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Porous media: 100x100x1536, 1D dom. decomp.

33

Page 34: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Porous media: 100x100x1536, 1D dom. decomp.

34

Page 35: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Porous media: 100x100x1536, 1D/2D/3D

35

Page 36: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Porous media: 256x256x3600, 1D/2D

36

Page 37: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Current Work

Focus in waLBerla currently on Juqueen and SuperMUC

37

Page 38: Lattice Boltzmann simulations on heterogeneous CPU-GPU ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/koestler.pdf · waLBerla framework . Main goal: provide a massive

Computer Science X - System Simulation Group Harald Köstler ([email protected])

Future Work

Tests on Nvidia Kepler cluster

Programming paradigms on future HPC clusters?

Code generation techniques to improve portability

Dynamic load balancing

38