optimization and parallelization of the boundary element...

144
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain erenger Bramas Inria, Bordeaux - Sud-Ouest PhD defense - Feb. 15th 2016 Advisor: Oliver Coulaud (Inria) Industrial co-advisor: Guillaume Sylvand (Airbus) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Upload: lenguyet

Post on 18-Jul-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

Optimization and Parallelization of the BoundaryElement Method for the Wave Equation

in Time Domain

Berenger Bramas

Inria, Bordeaux - Sud-Ouest

PhD defense - Feb. 15th 2016

Advisor: Oliver Coulaud (Inria)

Industrial co-advisor: Guillaume Sylvand (Airbus)

................ ..................... ..................... ..................... ..................... ................

Page 2: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(2)

Context

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 3: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(3)

Wave Equation Problems

• Study the wave propagation in acoustics or electromagnetism

• Critical in several industrial fields (design, robustness study)

In our case: antenna placement, electromagnetic compatibility,furtivity, lightning, ...

Image from Airbus Group.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 4: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(4)

Wave Equation Simulations

Boundary element method (BEM): integral equation over adiscretized mesh

Interest of BEM compared to other approaches

• Better accuracy

• Surfacic mesh (easier to produce)

Disadvantages of BEM

• Dense matrices (specific solvers)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 5: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(5)

BEM

Boundary Element Method (BEM) for the Wave Equation

In Time Domain (TD) In Frequency Domain (FD)

One solve = a range of frequencies(Fourier Transform)

One solve = one frequency

Advantages/disadvantages depend on the application/configuration

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 6: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(5)

BEM

Boundary Element Method (BEM) for the Wave Equation

In Time Domain (TD) In Frequency Domain (FD)

Accelerated by FMM (Fast-BEM)

Accelerated by H-Matrix

Accelerated by FMM

Dense BEM/Matrix Approach Dense BEM

......

One solve = a range of frequencies One solve = one frequency

Advantages/disadvantages depend on the application/configuration

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 7: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(5)

BEM

Boundary Element Method (BEM) for the Wave Equation

In Time Domain (TD) In Frequency Domain (FD)

Accelerated by FMM (Fast-BEM)

Accelerated by H-Matrix

Accelerated by FMM

Dense BEM/Matrix Approach Dense BEM

......

One solve = a range of frequencies One solve = one frequency

Less studied and used Widely used (academy and industry)

Advantages/disadvantages depend on the application/configuration

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 8: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(5)

BEM

Boundary Element Method (BEM) for the Wave Equation

In Time Domain (TD) In Frequency Domain (FD)

Accelerated by FMM (Fast-BEM)

Accelerated by H-Matrix

Accelerated by FMM

Dense BEM/Matrix Approach Dense BEM

......

One solve = a range of frequencies One solve = one frequency

Less studied and used Widely used (academy and industry)

Advantages/disadvantages depend on the application/configuration

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 9: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(6)

Industrial Context

• In partnership with Airbus Group Innovation (financed jointlywith Region Aquitaine)

• Airbus solvers:• FD-BEM

• Accelerated by FMM or H-Matrix techniques

• TD-BEM (experimental)• No stability problem (formulation based on a full Galerkin

discretization unconditionnaly stable from [Terrasse, 1993])• With FMM [Ergin et al., 2000] (trial)

Objective:

• Reduce the performance gap between FD and TD approaches

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 10: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(6)

Industrial Context

• In partnership with Airbus Group Innovation (financed jointlywith Region Aquitaine)

• Airbus solvers:• FD-BEM

• Accelerated by FMM or H-Matrix techniques

• TD-BEM (experimental)• No stability problem (formulation based on a full Galerkin

discretization unconditionnaly stable from [Terrasse, 1993])• With FMM [Ergin et al., 2000] (trial)

Objective:

• Reduce the performance gap between FD and TD approaches

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 11: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(7)

HPCSuper-computers are mandatory to solve large problems

• Shared/Distributed memory

• Heterogeneous (one or more GPU per node)

CPU CPU

GPU

Node

CPU CPU

GPU

Node

CPU CPU

GPU

Node

CPU CPU

GPU

Node

...

Cluster

Some of the challenges

• Efficient computational algorithm/kernel

• Parallelization

• Balancing

• Hardware abstraction, portable implementation, long-termdevelopment, ...

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 12: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(7)

HPCSuper-computers are mandatory to solve large problems

• Shared/Distributed memory

• Heterogeneous (one or more GPU per node)

CPU CPU

GPU

Node

CPU CPU

GPU

Node

CPU CPU

GPU

Node

CPU CPU

GPU

Node

...

Cluster

Some of the challenges

• Efficient computational algorithm/kernel

• Parallelization

• Balancing

• Hardware abstraction, portable implementation, long-termdevelopment, ...

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 13: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(8)

Outline

• Problem Formulation

• BEM Solver (Matrix Approach)

• Fast-Multipole Method Approach• FMM Algorithm & Parallelization• FMM BEM Solver (Experimental Implementation)

• Conclusion & Perspectives

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 14: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(9)

TD-BEM Application Stages

User inputs, simulation parameters↓

Mesh generator, configuration↓

Solver↓

Post-processing (TD → FD)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 15: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

TD-BEM Formulation (10)

Linear FormulationNotations:

• δΩ discretized in N unknowns/degrees of freedom

• Mk : the convolution matrices (dimension N × N) - input• ln: the incident wave emitted by a source on the unknowns attime step n - input

• an: the state of the system at time step n - to compute

Convolution system:

M0 · an +Kmax∑k≥1

Mk · an−k = ln (1)

Ω

ln

Solve at each time step:

an = (M0)−1

(ln −

Kmax∑k=1

Mk · an−k

)(2)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 16: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

TD-BEM Formulation (10)

Linear FormulationNotations:

• δΩ discretized in N unknowns/degrees of freedom• Mk : the convolution matrices (dimension N × N) - input• ln: the incident wave emitted by a source on the unknowns attime step n - input

• an: the state of the system at time step n - to compute

Convolution system:

M0 · an +Kmax∑k≥1

Mk · an−k = ln (1)

Ω

ln

Solve at each time step:

an = (M0)−1

(ln −

Kmax∑k=1

Mk · an−k

)(2)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 17: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

TD-BEM Formulation (10)

Linear FormulationNotations:

• δΩ discretized in N unknowns/degrees of freedom• Mk : the convolution matrices (dimension N × N) - input• ln: the incident wave emitted by a source on the unknowns attime step n - input

• an: the state of the system at time step n - to compute

Convolution system:

M0 · an +Kmax∑k≥1

Mk · an−k = ln (1)

Ω

ln

Solve at each time step:

an = (M0)−1

(ln −

Kmax∑k=1

Mk · an−k

)(2)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 18: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

TD-BEM Formulation (10)

Linear FormulationNotations:

• δΩ discretized in N unknowns/degrees of freedom

• Mk : the convolution matrices (dimension N × N) - input

• ln: the incident wave emitted by a source on the unknowns attime step n - input

• an: the state of the system at time step n - to compute

Convolution system:

M0 · an +Kmax∑k≥1

Mk · an−k = ln (1)

Solve at each time step:

an = (M0)−1

(ln −

Kmax∑k=1

Mk · an−k

)(2)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 19: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

TD-BEM Formulation (11)

Interaction/Convolution Matrices (Mk)

• Interactions between unknowns

• Symmetric and sparse, Mk(i , j) = 0 if distance(i , j) ≈ k .c.∆t

• Pre-computed (external tool)

M0 M1 M2 MKmax...

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 20: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

TD-BEM Formulation (12)

Solve (Schematic View)

an = (M0)−1

(ln −

Kmax∑k=1

Mk · an−k

)(3)

= =

=

-

,

Linear Solver

sn ln sn sn~

sn~an M0

M1 M2M3M4M5

an-1an-2an-3an-4an-5

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 21: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

TD-BEM Formulation (13)

SpMV (sparse matrix/vector product)Summation stage → Kmax SpMVs

• Permutation, advanced storages/kernels, blocking[White III and Sadayappan, 1997, Pinar and Heath, 1999,Pichel et al., 2005, Vuduc and Moon, 2005]

• Auto-tuning[Im and Yelick, 2001, Vuduc et al., 2005]

Low Flop-rate:• Memory bound operation

• Flop/Word hardware limit

• Irregular/not contiguous memory accesses• Instruction (pipelining, vectorization)

• Not appropriate for GPUs[Garland, 2008, Baskaran and Bordawekar, 2008,Bell and Garland, 2009]

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 22: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

TD-BEM Formulation (13)

SpMV (sparse matrix/vector product)Summation stage → Kmax SpMVs

• Permutation, advanced storages/kernels, blocking[White III and Sadayappan, 1997, Pinar and Heath, 1999,Pichel et al., 2005, Vuduc and Moon, 2005]

• Auto-tuning[Im and Yelick, 2001, Vuduc et al., 2005]

Low Flop-rate:• Memory bound operation

• Flop/Word hardware limit

• Irregular/not contiguous memory accesses• Instruction (pipelining, vectorization)

• Not appropriate for GPUs[Garland, 2008, Baskaran and Bordawekar, 2008,Bell and Garland, 2009]

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 23: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

TD-BEM Formulation (14)

SpMV (Performance)

.....

Dense

5000

.

Diagon

al 15/5

00000

.

Rando

m4/2

0000

.

Block

Rando

m(5/

80000)

.

Dense

200(x1

0000)

.0 .

10

.

20

.

30

.

GFlop/s

.

. ..C00 MKL . ..CRS MKL . ..DIA MKL . ..BCSR MKL . ..CRS cuSparse . ..BCSR cuSparse

SpMVs MKL/cuSparse (double precision)

Peak performance: CPU Haswell Intel Xeon E5-2680 2,50 GHz core

20GFlop/s, and K40-M GPU 1.43TFlop/s.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 24: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

TD-BEM Formulation (15)

TD-BEM Application Stages

User inputs, simulation parameters↓

Mesh generator, configuration,interaction matrices pre-computation

↓Solver

· Summation stage· M0 Linear Solver (external tool)

↓Post-processing (TD → FD)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 25: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(16)

Outline

• Problem Formulation

• BEM Solver (Matrix Approach)

• Fast-Multipole Method Approach• FMM Algorithm & Parallelization• FMM BEM Solver (Experimental Implementation)

• Conclusion & Perspectives

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 26: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Improving the Summation (17)

Computational Ordering

A0

M 6

A1

M 5

A2

M 4

A3

M 3

A4

M 2

A5

M 1

S6

Front (k)/ SpMV

A0

M 6

A1

M 5

A2

M 4

A3

M 3

A4

M 2

A5

M 1

S6

Top (i)

A0

M 6

A1

M 5

A2

M 4

A3

M 3

A4

M 2

A5

M 1

S6

Side (j)

sn(i) =Kmax∑k=1

N∑j=1

Mk(i , j)× an−k(j) , 1 ≤ i ≤ N . (4)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 27: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Improving the Summation (17)

Computational Ordering

A0

M 6

A1

M 5

A2

M 4

A3

M 3

A4

M 2

A5

M 1

S6

Front (k)/ SpMV

A0

M 6

A1

M 5

A2

M 4

A3

M 3

A4

M 2

A5

M 1

S6

Top (i)

A0

M 6

A1

M 5

A2

M 4

A3

M 3

A4

M 2

A5

M 1

S6

Side (j)

sn(i) =Kmax∑k=1

N∑j=1

Mk(i , j)× an−k(j) , 1 ≤ i ≤ N . (4)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 28: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Improving the Summation (18)

Structure of a Slice MatrixA Slice j :

• When outer loop index is j• The concatenation of column j of the interaction matrices Mk

(except M0)• Size (N × (Kmax − 1))• There is one dense vector per row• Slice j(i , k) = Mk(i , j) = 0with ks = d(i , j)/(c∆t) and ks ≤ k ≤ ks + p

M1 (*,j)

Slicej

M2 (*,j)

M3 (*,j)

M4 (*,j)

M5 (*,j)

M6 (*,j)

M7 (*,j)

M8 (*,j)

M9 (*,j)

M10 (*,j)

M11 (*,j)

M12 (*,j)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 29: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Improving the Summation (19)

Computing with a Slice Matrix

a*<n(j)

M1 (*,j)

Slicej

snM

2 (*,j)

M3 (*,j)

M4 (*,j)

M5 (*,j)

M6 (*,j)

M7 (*,j)

M8 (*,j)

M9 (*,j)

M10 (*,j)

M11 (*,j)

M12 (*,j)

Computation with N vector/vector products (one per line):

• Regular memory access (vectorization, pipelining)

• Low Flop/word ratio (same as SpMV)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 30: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Improving the Summation (20)

Improving the Flop/Word Ratio

an-1an-2an-3an-4an-5an-6an-7an-8an-9

M1 (*,j)

Slicej

sn

M2 (*,j)

M3 (*,j)

M4 (*,j)

M5 (*,j)

M6 (*,j)

M7 (*,j)

M8 (*,j)

M9 (*,j)

State of unknown j

anan+1an+2

an

? ? ?….

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 31: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Improving the Summation (20)

Improving the Flop/Word Ratio

an-1an-2an-3an-4an-5an-6an-7an-8an-9

M1 (*,j)

Slicej

sn+1

M2 (*,j)

M3 (*,j)

M4 (*,j)

M5 (*,j)

M6 (*,j)

M7 (*,j)

M8 (*,j)

M9 (*,j)

anan+1an+2?? ?

an-1an-2an-3an-4an-5an-6an-7an-8an-9anan+1an+2 ….?? ?

….

snn =2g

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 32: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Improving the Summation (20)

Improving the Flop/Word Ratio

an-1an-2an-3an-4an-5an-6an-7an-8an-9

M1 (*,j)

Slicej

sn+1

M2 (*,j)

M3 (*,j)

M4 (*,j)

M5 (*,j)

M6 (*,j)

M7 (*,j)

M8 (*,j)

M9 (*,j)

anan+1an+2?? ?

an-1an-2an-3an-4an-5an-6an-7an-8an-9anan+1an+2 ….?? ?

….

snsn+2

an-1an-2an-3an-4an-5an-6an-7 an-9anan+1an+2?? ? an-8 ….

n =3g

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 33: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Improving the Summation (20)

Improving the Flop/Word Ratio

M1 (*,j)

Slicej

sn+1

M2 (*,j)

M3 (*,j)

M4 (*,j)

M5 (*,j)

M6 (*,j)

M7 (*,j)

M8 (*,j)

M9 (*,j)

an-1an-2an-3an-4an-5an-6an-7an-8an-9anan+1 ….00

snsn+2 n =3g

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 34: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Improving the Summation (20)

Improving the Flop/Word Ratio

M1 (*,j)

Slicej

sn+1

M2 (*,j)

M3 (*,j)

M4 (*,j)

M5 (*,j)

M6 (*,j)

M7 (*,j)

M8 (*,j)

M9 (*,j)

an-1an-2an-3an-4an-5an-6an-7an-8an-9anan+1 ….00

snsn+2 n =3g

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 35: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Multi-Vectors/Vector Product (21)

Flop/Word RatioVector length v = 4, group size ng = 4 (v × ng × 2 Flops):

a b c d

0123r a b c d

0123r

1234

2345

3456

r r r a b c d

0123456r r r r

Vector/vector product

Vector/matrix product

Multi-vectors/vector product

v ng

• Vectors product (≈ SpMV) : ng (2v + 1)• Vector/matrix product : v + ng (v + 1)• Multi-vectors/vector product : (v + ng − 1) + (v) + (ng )

.....0

.2

.4

.6

.8

.10

.12

.14

.16

.18

.20

.0 .2

.

4

.

6

.

v

.F/W

(ng8)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 36: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Multi-Vectors/Vector Product (21)

Flop/Word RatioVector length v = 4, group size ng = 4 (v × ng × 2 Flops):

a b c d

0123r a b c d

0123r

1234

2345

3456

r r r a b c d

0123456r r r r

Vector/vector product

Vector/matrix product

Multi-vectors/vector product

v ng

• Vectors product (≈ SpMV) : ng (2v + 1)• Vector/matrix product : v + ng (v + 1)• Multi-vectors/vector product : (v + ng − 1) + (v) + (ng )

.....0

.2

.4

.6

.8

.10

.12

.14

.16

.18

.20

.0 .2

.

4

.

6

.

v

.F/W

(ng8)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 37: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Multi-Vectors/Vector Product (22)

Multi-vectors/vector Product (CPU)

.....

. ..AVX-Asm . ..AVX-Intrinsic . ..AVX-Template . ..SSE-Intrinsic . ..Compiler Version

... ..0

.20

.40

.60

.80

.0 .

5

.

10

.

15

.

20

.

Length of vectors (v)

.

Speed(G

Flop/s)

Figure : Nr = 1024

... ..0

.20

.40

.60

.80

.0 .

5

.

10

.

15

.

20

.

Length of vectors (v)

Figure : Nr = 20 480

Plots show the GFlop/s with ng = 8 for test cases of dimension Nr × v (in double precision).

Haswell Intel Xeon E5-2680 at 2, 50GHz (20GFlop/s)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 38: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Multi-Vectors/Vector Product on GPUs (23)

GPUs Slice Storages

• Blocking scheme (small conversion overhead)

• Data access appropriate for SIMT/SIMD

• Memory accesses (coalesced, low bank conflicts)

• Data re-use (shared memory)

• CPU/GPU Balancing

Slicej

a*<n(j)

Slicej 0

2

0

1

3

4

5

8

6

5

6

1

2

0

2

(a) (b) (c)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 39: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Multi-Vectors/Vector Product on GPUs (23)

GPUs Slice Storages

• Blocking scheme (small conversion overhead)

• Data access appropriate for SIMT/SIMD

• Memory accesses (coalesced, low bank conflicts)

• Data re-use (shared memory)

• CPU/GPU Balancing

Slicej

a*<n(j)

Slicej 0

2

0

1

3

4

5

8

6

5

6

1

2

0

2

(a) (b) (c)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 40: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Parallelization (24)

Parallelization

Sequential algorithm:

= =

=

-

,

Linear Solver

sn ln sn sn~

sn~an M0

M1 M2M3M4M5

an-1an-2an-3an-4an-5

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 41: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Parallelization (25)

Parallel Solver (Schematic View)

=-

ln sn sn~

=

,

Parallel Linear Solver

sn~ anM0

=

sn

M1 M2M3M4M5

an-1n-2n-3

n-4

n-5

=

sn

=

sn

sn

+P1

P2

P0

an-1n-2

n-3n-4n-5

n-5

an-1n-2n-3

n-4

M1 M2M3M4M5

M1 M2M3M4M5

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 42: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Parallelization (26)

Parallel Solver with ng > 1 (Schematic View)

=-

ln+f sn+f sn+f~

=

,

Parallel Linear Solver

sn+f~an+f

M0

=

=

=

+

sn+f

sn sn+1sn+2

M1M2

,

Radiation

sn+f+1sn+f+2

,an+f

Radiation

, ,

Radiation

, ,

+

sn+f

ng loops

T/ng loops

P1

P2

P0

P1

P2

P0

M1M2sn+f+1sn+f+2an+f

M1M2 sn+f+1sn+f+2an+f

snsn+1sn+2

snsn+1sn+2

an-1n-2n-3n-4

n-5

an-1n-2n-3n-4

n-5

an-1n-2n-3n-4

n-5

M1 M2M3M4M5

M1 M2M3M4M5

M1 M2M3M4M5

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 43: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Results (27)

Airplane Simulation

• Acoustics

• N = 23 962

• 10 823 time iterations

• Kmax = 341 interaction matrices Mk

• ng = 8

• 70GB of data

• double precision

• Homogeneous node: 24 Cores CPU (128GB memory)

• Heterogeneous node: 24 Cores CPU (128GB memory) and 4K40M GPUs (12GB memory)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 44: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Results (28)

Parallel Efficiency/Percentage (Homogeneous)

.....

. ..Full-MPI

.....

. ..Summation . ..Idle

. ..Direct solver M0

... ..1.

10.

20.0 .

0.5

.

1

.

Number of nodes

.

Efficiency

... ..1.

10.

20.0 .

20.

40

.

60

.

80

.

100

.

Number of nodes

.Percentage

(%)

Summation stage M0 Solve →

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 45: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Results (28)

Parallel Efficiency/Percentage (Homogeneous)

.....

. ..Full-MPI

.....

. ..Summation . ..Idle

. ..Direct solver M0

... ..1.

10.

20.0 .

0.5

.

1

.

Number of nodes

.

Efficiency

... ..1.

10.

20.0 .

20.

40

.

60

.

80

.

100

.

Number of nodes

.Percentage

(%)

Summation stage M0 Solve →

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 46: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Results (28)

Parallel Efficiency/Percentage (Homogeneous)

.....

. ..Full-MPI

.....

. ..Summation . ..Idle

. ..Direct solver M0

... ..1.

10.

20.0 .

0.5

.

1

.

Number of nodes

.

Efficiency

... ..1.

10.

20.0 .

20.

40

.

60

.

80

.

100

.

Number of nodes

.Percentage

(%)

Summation stage M0 Solve →

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 47: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Results (29)

With GPUs

...

..

1

.

2

.

3

.

4

.

5

.

300

.

1,000

.

3,000

.

20,000

.

Number of nodes

.

Tim

e(secon

ds)

.

. ..CPU-Only . ..1GPU

Figure : Execution time

.....

1

.

2

.

3

.

4

.

5

.1.0 .

11.0

.

21.0

.

Number of nodes

.

Speedup

.

. ..1GPU

Figure : Speedup against CPU-Only

Problem ≈ 70GB/GPU 12GB memory

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 48: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Results (29)

With GPUs

...

..

1

.

2

.

3

.

4

.

5

.

300

.

1,000

.

3,000

.

20,000

.

Number of nodes

.

Tim

e(secon

ds)

.

. ..CPU-Only . ..1GPU . ..2GPU

Figure : Execution time

.....

1

.

2

.

3

.

4

.

5

.1.0 .

11.0

.

21.0

.

Number of nodes

.

Speedup

.

. ..1GPU . ..2GPU

Figure : Speedup against CPU-Only

Problem ≈ 70GB/GPU 12GB memory

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 49: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Results (29)

With GPUs

...

..

1

.

2

.

3

.

4

.

5

.

300

.

1,000

.

3,000

.

20,000

.

Number of nodes

.

Tim

e(secon

ds)

.

. ..CPU-Only . ..1GPU . ..2GPU

. ..3GPU

Figure : Execution time

.....

1

.

2

.

3

.

4

.

5

.1.0 .

11.0

.

21.0

.

Number of nodes

.

Speedup

.

. ..1GPU . ..2GPU . ..3GPU

Figure : Speedup against CPU-Only

Problem ≈ 70GB/GPU 12GB memory

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 50: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Results (29)

With GPUs

...

..

1

.

2

.

3

.

4

.

5

.

300

.

1,000

.

3,000

.

20,000

.

Number of nodes

.

Tim

e(secon

ds)

.

. ..CPU-Only . ..1GPU . ..2GPU

. ..3GPU . ..4GPU

Figure : Execution time

.....

1

.

2

.

3

.

4

.

5

.1.0 .

11.0

.

21.0

.

Number of nodes

.

Speedup

.

. ..1GPU . ..2GPU . ..3GPU . ..4GPU

Figure : Speedup against CPU-Only

Problem ≈ 70GB/GPU 12GB memory

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 51: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Summary (30)

Summary:

• New computational ordering [Bramas et al., 2014]

• Solver with few communication points

Additional contributions:

• Permutations/SpMV

• Efficient SIMD kernel CPU

• Efficient blocking scheme/kernel forGPU [Bramas et al., 2015]

• Dynamic balancing (CPU/GPU)

Limits:

• M0 Linear solver

• GPUs’ memory

• Interaction matrices construction

• Complexity → O(N2) for each iteration

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 52: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(31)

Outline

• Problem Formulation

• BEM Solver (Matrix Approach)

• Fast-Multipole Method Approach• FMM Algorithm & Parallelization• FMM BEM Solver (Experimental Implementation)

• Conclusion & Perspectives

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 53: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Algorithm (32)

FMM Operators (1D)

• Spatial decomposition → Potential decomposition

fi = f neari + f fari

• Near field by direct interactions (leaves)• Far field with FMM operators (tree)

l = 0

l = 1

l = 2

l = 3

P2P

M2M

M2L

L2LM2L M2L

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 54: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Algorithm (32)

FMM Operators (1D)

• Spatial decomposition → Potential decomposition

fi = f neari + f fari

• Near field by direct interactions (leaves)• Far field with FMM operators (tree)

l = 0

l = 1

l = 2

l = 3

P2P

M2M

M2L

L2LM2L M2L

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 55: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Parallelization (33)

Related work:

• Multicore study [Chandramowlishwaran et al., 2010]

• NVidia GPU [Yokota and Barba, 2011]

• Distributed GPU [Hamada et al., 2009]

• Distributed CPU/GPU [Hu et al., 2011, Lashuk et al., 2012,Malhotra and Biros, 2015]

• Using a runtime system (multicore) [Ltaief and Yokota, 2014]

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 56: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Parallelization (34)

Paradigms

• Fork-join

- Parallel-for (OpenMP)

• Task-based

- Tasks pool (OpenMP 3.1) [Agullo et al., 2014]1

- Tasks-and-dependencies (runtime systems, OpenMP 4)

1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2014). Task-based fmm for

multicore architectures. SIAM Journal on Scientific Computing, 36(1):C66C93.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 57: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Parallelization (34)

Paradigms

• Fork-join

- Parallel-for (OpenMP)

• Task-based

- Tasks pool (OpenMP 3.1) [Agullo et al., 2014]1

- Tasks-and-dependencies (runtime systems, OpenMP 4)

1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2014). Task-based fmm for

multicore architectures. SIAM Journal on Scientific Computing, 36(1):C66C93.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 58: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Parallelization (34)

Paradigms

• Fork-join

- Parallel-for (OpenMP)

• Task-based

- Tasks pool (OpenMP 3.1) [Agullo et al., 2014]1

- Tasks-and-dependencies (runtime systems, OpenMP 4)

1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2014). Task-based fmm for

multicore architectures. SIAM Journal on Scientific Computing, 36(1):C66C93.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 59: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Parallelization (35)

Tasks-and-Dependencies Model (OpenMP 4, StarPU )

Challenges

• Granularity

• Computational kernels

• Scheduling

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 60: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Parallelization (35)

Tasks-and-Dependencies Model (OpenMP 4, StarPU )

Challenges

• Granularity

• Computational kernels

• Scheduling

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 61: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Parallelization (35)

Tasks-and-Dependencies Model (OpenMP 4, *PU)

CPU/GPU

Challenges

• Granularity

• Computational kernels

• Scheduling

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 62: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Parallelization (35)

Tasks-and-Dependencies Model (OpenMP 4, *PU)

Challenges

• Granularity

• Computational kernels

• Scheduling

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 63: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Parallelization (36)

Scheduling

A

D

B

C

E

F

CPU0

CPU1

GPU0

AScheduler

A

D

B

C

E

F

CPU0

CPU1

GPU0

SchedulerB

C

D

• Priority• Work stealing [Blumofe and Leiserson, 1999]• Heterogeneous Earliest Finish Time (Heft)[Topcuouglu et al., 2002]

Drawbacks:• Calibration• Overhead• Ready-tasks view

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 64: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Parallelization (36)

Scheduling

A

D

B

C

E

F

CPU0

CPU1

GPU0

SchedulerB

C

D

• Priority

• Work stealing [Blumofe and Leiserson, 1999]

• Heterogeneous Earliest Finish Time (Heft)[Topcuouglu et al., 2002]

Drawbacks:

• Calibration

• Overhead

• Ready-tasks view

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 65: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Parallelization (36)

Scheduling

A

D

B

C

E

F

CPU0

CPU1

GPU0

SchedulerB

C

D

• Priority

• Work stealing [Blumofe and Leiserson, 1999]

• Heterogeneous Earliest Finish Time (Heft)[Topcuouglu et al., 2002]

Drawbacks:

• Calibration

• Overhead

• Ready-tasks view

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 66: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Parallelization (36)

Scheduling

A

D

B

C

E

F

CPU0

CPU1

GPU0

SchedulerB

C

D

• Priority

• Work stealing [Blumofe and Leiserson, 1999]

• Heterogeneous Earliest Finish Time (Heft)[Topcuouglu et al., 2002]

Drawbacks:

• Calibration

• Overhead

• Ready-tasks view

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 67: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Parallelization (37)

Heteroprio

• Heteroprio [Agullo et al., 2015]1

• Steady-state : execute tasks where they have the bestacceleration factor

• Critical-state : execute a task by a worker if it does not delaythe hypothetical end

1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for

heterogeneous architectures. Concurrency and Computation: Practice and Experience.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 68: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Parallelization (37)

Heteroprio

• Heteroprio [Agullo et al., 2015]1

• Steady-state : execute tasks where they have the bestacceleration factor

• Critical-state : execute a task by a worker if it does not delaythe hypothetical end

A

D

B

C

E

F

CPU0

CPU1

GPU0

Scheduler

B

CDCPUPrio

GPUPrio

1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for

heterogeneous architectures. Concurrency and Computation: Practice and Experience.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 69: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Parallelization (37)

Heteroprio

• Heteroprio [Agullo et al., 2015]1

• Steady-state : execute tasks where they have the bestacceleration factor

• Critical-state : execute a task by a worker if it does not delaythe hypothetical end

A

D

B

C

E

F

CPU0

CPU1

GPU0

SchedulerB

CPUPrio

GPUPrio

CD

1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for

heterogeneous architectures. Concurrency and Computation: Practice and Experience.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 70: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Parallelization (37)

Heteroprio

• Heteroprio [Agullo et al., 2015]1

• Steady-state : execute tasks where they have the bestacceleration factor

• Critical-state : execute a task by a worker if it does not delaythe hypothetical end

A

D

B

C

E

F

CPU0

CPU1

GPU0

SchedulerB

C

CPUPrio

GPUPrio

D

1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for

heterogeneous architectures. Concurrency and Computation: Practice and Experience.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 71: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Parallelization (37)

Heteroprio

• Heteroprio [Agullo et al., 2015]1

• Steady-state : execute tasks where they have the bestacceleration factor

• Critical-state : execute a task by a worker if it does not delaythe hypothetical end

A

D

B

C

E

F

CPU0

CPU1

GPU0

SchedulerB

D

C

CPUPrio

GPUPrio

1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for

heterogeneous architectures. Concurrency and Computation: Practice and Experience.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 72: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Particles Interaction Simulations (38)

Test Case

CPU - 24 Cores

GPU 3 GPU 4

GPU 2GPU 1

• N = 30 millions particles

• Spherical Expansion/Rotation Kernel

• Acc = 10−3, h = 7 and Granularity = 1500

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 73: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Particles Interaction Simulations (39)

Trace - Heterogeneous (24CPUs)

0GPU/15.5s

Legend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and

Idle ()

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 74: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Particles Interaction Simulations (39)

Trace - Heterogeneous (1GPU/23CPUs)

0GPU/15.5s

1GPU/13.4sLegend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and

Idle ()

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 75: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Particles Interaction Simulations (39)

Trace - Heterogeneous (1GPU/23CPUs)

0GPU/15.5s

1GPU/13.4sLegend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and

Idle ()

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 76: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Particles Interaction Simulations (39)

Trace - Heterogeneous (2GPUs/22CPUs)

0GPU/15.5s

2GPU/10.9sLegend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and

Idle ()

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 77: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Particles Interaction Simulations (39)

Trace - Heterogeneous (3GPUs/21CPUs)

0GPU/15.5s

3GPU/9.4sLegend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and

Idle ()

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 78: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Particles Interaction Simulations (39)

Trace - Heterogeneous (4GPUs/20CPUs)

0GPU/15.5s

4GPU/8.7sLegend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and

Idle ()

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 79: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Particles Interaction Simulations (40)

Test Case

CPU - 24 Cores

GPU 3 GPU 4

GPU 2GPU 1

• N = 30 millions particles

• Uniform/Lagrange kernel

• Acc = 10−5, 10−7, h = 7 and Granularity = 1500

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 80: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Particles Interaction Simulations (41)

Trace - Heterogeneous (4GPUs)

Acc = 10−5/7.9s

Acc = 10−7/17s

Legend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and

Idle ()Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 81: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Particles Interaction Simulations (41)

Trace - Heterogeneous (4GPUs)

Acc = 10−5/7.9s

Acc = 10−7/17sLegend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and

Idle ()Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 82: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Particles Interaction Simulations (42)

Test Cases

Node 1 - 24 Cores

Node 2 - 24 Cores

Node 3 - 24 Cores

Node 4 - 24 Cores

Node 5 - 24 Cores

Node 6 - 24 Cores

Node 7 - 24 Cores

• N = 200 millions particles

• Spherical Expansion/Rotation Kernel

• Acc = 10−3, h = 8 and Granularity = 2000

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 83: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

FMM Particles Interaction Simulations (43)

Trace - 7 nodes × 24CPUs

Legend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and

Idle () .

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 84: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Summary (44)

Summary:

• Generic

• Kernel independent

• Architecture independent

• Performance portability

Additional contributions:

• Commutativity expression in FMM

• MPI/OpenMP implementation

All included in ScalFMM (C++/HPC library)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 85: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Summary (44)

Summary:

• Generic

• Kernel independent

• Architecture independent

• Performance portability

Additional contributions:

• Commutativity expression in FMM

• MPI/OpenMP implementation

All included in ScalFMM (C++/HPC library)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 86: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(45)

Outline

• Problem Formulation

• BEM Solver (Matrix Approach)

• Fast-Multipole Method Approach• FMM Algorithm & Parallelization• FMM BEM Solver (Experimental Implementation)

• Conclusion & Perspectives

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 87: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(46)

Propagation of the Current State to the Future

= =

=

-

,

Linear Solver

sn ln sn sn~

sn~an M0

M1 M2M3M4M5

an-1an-2an-3an-4an-5

=-

ln sn sn~

=

,

Linear Solver

sn~anM0

M5

an

M1

M2

M3

M4sn+5

sn+4

sn+3

sn+2

sn+1

++

++

+

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 88: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(46)

Propagation of the Current State to the Future

= =

=

-

,

Linear Solver

sn ln sn sn~

sn~an M0

M1 M2M3M4M5

an-1an-2an-3an-4an-5

=-

ln sn sn~

=

,

Linear Solver

sn~anM0

M5

an

M1

M2

M3

M4sn+5

sn+4

sn+3

sn+2

sn+1

++

++

+

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 89: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(47)

With FMM

=-

ln sn sn~

=

,

Linear Solver

sn~anM0

an

M1

M2

sn+5

sn+4

sn+3

sn+2

sn+1

++

++

+

FMM

• Far interactions in time (between far elements in space) arecomputed by the FMM

• The spatial decomposition is given by the octree

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 90: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(48)

Overview

• The octree is over a mesh (integration points)

• Interactions matrices between leaves• Approximation/FMM

• development in the time-domain• multipole: what a cell emits to the outside• local: what a cell receives from the outside

• operators in FD or TD• accurate up-to a chosen frequency

• the results in the TD of the matrix approach = FMM

Figure : Complete unit sphere Figure : Truncated unit sphere

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 91: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(49)

Operators (Overview)• P2M

• compute what is emitted by a leaf to the outside

• M2M/L2L• Extrapolation + time shift

• M2L• Convolution product in TD

(term-by-term multiplication in FD)• L2P

• Integration

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 92: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(49)

Operators (Overview)• P2M

• compute what is emitted by a leaf to the outside• M2M/L2L

• Extrapolation + time shift

• M2L• Convolution product in TD

(term-by-term multiplication in FD)• L2P

• Integration

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 93: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(49)

Operators (Overview)• P2M

• compute what is emitted by a leaf to the outside• M2M/L2L

• Extrapolation + time shift• M2L

• Convolution product in TD(term-by-term multiplication in FD)

• L2P• Integration

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 94: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(49)

Operators (Overview)• P2M

• compute what is emitted by a leaf to the outside• M2M/L2L

• Extrapolation + time shift• M2L

• Convolution product in TD(term-by-term multiplication in FD)

• L2P• Integration

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 95: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(49)

Operators (Overview)• P2M

• compute what is emitted by a leaf to the outside• M2M/L2L

• Extrapolation + time shift• M2L

• Convolution product in TD(term-by-term multiplication in FD)

• L2P• Integration

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 96: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Results (50)

Cone-Sphere Test Cases

Case C-927 C-4269 C-10012

Number of unknowns 927 4269 10012FMM tree height 3 4 5Number of leaves 16 64 234

Number of Mk matrices (Kmax) 117 244 370Number of Mk matrices (leaves) 60 64 49

Number of time steps (T ) 2033 4345 6647

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 97: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Results (51)

Sequential Executions

TD vs. FD operators:

FMMStages TD TD + FD-M2L FD Matrix approach

Mk Construction 76 s 76 s 76 s 242 sSolve 58 122 s 53 241 s 97 861s 7.8 s(*)

Total 58 198 s 53 317 s 97 937 s 249.8 s

Execution time TD-FMM Vs. matrix approach to solve the Case C-927 indouble precision.

(*) Our optimized BEM solver.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 98: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Results (52)

Parallel Executions (FMM Vs. Matrix Approach)

.....

. ..Matrix generation . ..Solve

.....

FMM

.

Matrix

Approach

.0 .

500

.

1,000

.

883

.

237

.

Tim

e(s)

Figure : C-927 (×3.8)

.....

FMM

.

Matrix

Approach

.0 .

0.5

.

1

.

·104

.

9,426

.

10,080

Figure : C-4269 (×1)

.....

FMM

.

Matrix

Approach

.0 .

2

.

4

.

·104

.

33,408

.

25,256

Figure : C-10012 (×1.4)

The captions of the different cases show the overhead of the FMM TD-BEM against the matrix approach.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 99: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

Summary (53)

Summary:

• Preliminary results

• Best configuration: TD + FD M2L

• Not competitive against the direct approach (maybe on largertest cases)

• Any improvement of the matrix creation will make the FMMless competitive

Additional contributions:

• Incomplete/4D FMM

• Sphere discretization/length APS signal

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 100: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(54)

Conclusion & Perspectives

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 101: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(55)

Conclusion

Dense BEM/Matrix Approach

• Based on a new computational order

• Remove the bottleneck of the SpMV

• Implemented efficiently on modern architectures

• Complete BEM solver

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 102: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(56)

Conclusion

FMM

• Generic and state-of-the-art library

• Several parallelization strategies

• Robust OpenMP/MPI implementation (10 billions particles)

• Modern task-based approach

• ScalFMM

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 103: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(57)

Conclusion

FMM BEM Solver (Preliminary)

• Parallelized using ScalFMM

• Best configuration TD operators + FD M2L

• Our implementation is not faster than the direct approach

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 104: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(58)

Perspectives

TD-BEM

• Improve the construction of the interaction matrices

• M0 linear solver: small matrix, lots of nodes

• Compare existing solvers (TD vs. FD)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 105: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(59)

Perspectives

FMM parallelization

• Task-based with implicit MPI communications

• Group-Tree update

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 106: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives

(60)

Perspectives

FMM BEM

• Study the cost of the solve compare to the direct approach(complexity for some cases)

• Lots of remaining optimizations to test

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 107: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(1)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 108: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(2)

Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M.,and Takahashi, T. (2014).Task-based fmm for multicore architectures.SIAM Journal on Scientific Computing, 36(1):C66–C93.

Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M.,and Takahashi, T. (2015).Task-based fmm for heterogeneous architectures.Concurrency and Computation: Practice and Experience,pages n/a–n/a.cpe.3723.

Baskaran, M. M. and Bordawekar, R. (2008).Optimizing sparse matrix-vector multiplication on gpus usingcompile-time and run-time strategies.IBM Reserach Report, RC24704 (W0812-047).

Bell, N. and Garland, M. (2009).Implementing sparse matrix-vector multiplication onthroughput-oriented processors.In Proceedings of the Conference on High PerformanceComputing Networking, Storage and Analysis, page 18. ACM.

Blumofe, R. D. and Leiserson, C. E. (1999).Scheduling multithreaded computations by work stealing.J. ACM, 46(5):720–748.

Bramas, B., Coulaud, O., and Sylvand, G. (2014).Euro-Par 2014 Parallel Processing: 20th InternationalConference, Porto, Portugal, August 25-29, 2014. Proceedings,chapter Time-Domain BEM for the Wave Equation:Optimization and Hybrid Parallelization, pages 511–523.Springer International Publishing, Cham.

Bramas, B., Coulaud, O., and Sylvand, G. (2015).Time-domain bem for the wave equation ondistributed-heterogeneous architectures.Parallel Comput., 49(C):66–82.

Chandramowlishwaran, A., Williams, S., Oliker, L., Lashuk, I.,Biros, G., and Vuduc, R. (2010).Optimizing and tuning the fast multipole method forstate-of-the-art multicore architectures.In Parallel Distributed Processing (IPDPS), 2010 IEEEInternational Symposium on, pages 1–12.

Ergin, A. A., Shanker, B., and Michielssen, E. (2000).Fast analysis of transient acoustic wave scattering from rigidbodies using the multilevel plane wave time domain algorithm.The Journal of the Acoustical Society of America, 107(3).

Garland, M. (2008).Sparse matrix computations on manycore gpu’s.In Proceedings of the 45th annual Design AutomationConference, pages 2–6. ACM.

Hamada, T., Narumi, T., Yokota, R., Yasuoka, K., Nitadori,K., and Taiji, M. (2009).42 tflops hierarchical n-body simulations on gpus withapplications in both astrophysics and turbulence.In Proceedings of the Conference on High PerformanceComputing Networking, Storage and Analysis, SC ’09, pages62:1–62:12, New York, NY, USA. ACM.

Hu, Q., Gumerov, N. A., and Duraiswami, R. (2011).Scalable fast multipole methods on distributed heterogeneousarchitectures.In Proceedings of 2011 International Conference for HighPerformance Computing, Networking, Storage and Analysis,SC ’11, pages 36:1–36:12, New York, NY, USA. ACM.

Im, E.-J. and Yelick, K. (2001).Optimizing sparse matrix computations for register reuse insparsity.In Computational ScienceICCS 2001, pages 127–136. Springer.

Lashuk, I., Chandramowlishwaran, A., Langston, H., Nguyen,T.-A., Sampath, R., Shringarpure, A., Vuduc, R., Ying, L.,Zorin, D., and Biros, G. (2012).A massively parallel adaptive fast multipole method onheterogeneous architectures.Commun. ACM, 55(5):101–109.

Ltaief, H. and Yokota, R. (2014).Data-driven execution of fast multipole methods.Concurrency and Computation: Practice and Experience,26(11):1935–1946.

Malhotra, D. and Biros, G. (2015).Pvfmm: A parallel kernel independent fmm for particle andvolume potentials.Communications in Computational Physics, 18(03):808–830.

Pichel, J. C., Heras, D. B., Cabaleiro, J. C., and Rivera, F. F.(2005).Performance optimization of irregular codes based on thecombination of reordering and blocking techniques.Parallel Computing, 31(8):858–876.

Pinar, A. and Heath, M. T. (1999).Improving performance of sparse matrix-vector multiplication.In Proceedings of the 1999 ACM/IEEE conference onSupercomputing, page 30. ACM.

Terrasse, I. (1993).Resolution mathematique et numerique des equations deMaxwell instationnaires par une methode de potentielsretardes.PhD thesis.

Topcuouglu, H., Hariri, S., and Wu, M.-y. (2002).Performance-effective and low-complexity task scheduling forheterogeneous computing.IEEE Trans. Parallel Distrib. Syst., 13(3):260–274.

Vuduc, R., Demmel, J. W., and Yelick, K. A. (2005).Oski: A library of automatically tuned sparse matrix kernels.In Journal of Physics: Conference Series, volume 16, page 521.IOP Publishing.

Vuduc, R. W. and Moon, H.-J. (2005).Fast sparse matrix-vector multiplication by exploiting variableblock structure.In High Performance Computing and Communications, pages807–816. Springer.

White III, J. B. and Sadayappan, P. (1997).On improving the performance of sparse matrix-vectormultiplication.In High-Performance Computing, 1997. Proceedings. FourthInternational Conference on, pages 66–71. IEEE.

Yokota, R. and Barba, L. A. (2011).Treecode and fast multipole method for n-body simulationwith cuda.GPU Computing Gems Emerald Edition, page 113.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 109: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(3)

Vectorized Implementation (Overview)

The algorithm is tied to the vectorization type size lSIMD

Example with ng = 4, v = 4, lSIMD = 2:

a*<n(j)

a b c d

Slicej row-vector(i)

2

3

4

5

0

1

sn+1 snsn+2

6

sn+2

sn+1

sn

sn+3

sn+3

buffer

0

1

a b

sn+3

2

1

sn+2

buffer

2

3

3

4

2

3

buffer

a b

a b

sn+1

3

4

a b

sn

buffer

2

3

buffer

c d

sn+3

3

4

c d

sn+2

buffer

4

5

c d

sn+1

6

5

sn

c dr

r

r

r

r

r

r

r

r

• Each value from the vectors is read only once (and maybecopied into the buffer)

• The values in the buffer are shifted to avoid reloading

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 110: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(3)

Vectorized Implementation (Overview)

The algorithm is tied to the vectorization type size lSIMD

Example with ng = 4, v = 4, lSIMD = 2:

a*<n(j)

a b c d

Slicej row-vector(i)

2

3

4

5

0

1

sn+1 snsn+2

6

sn+2

sn+1

sn

sn+3

sn+3

buffer

0

1

a b

sn+3

2

1

sn+2

buffer

2

3

3

4

2

3

buffer

a b

a b

sn+1

3

4

a b

sn

buffer

2

3

buffer

c d

sn+3

3

4

c d

sn+2

buffer

4

5

c d

sn+1

6

5

sn

c dr

r

r

r

r

r

r

r

r

• Each value from the vectors is read only once (and maybecopied into the buffer)

• The values in the buffer are shifted to avoid reloading

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 111: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(3)

Vectorized Implementation (Overview)

The algorithm is tied to the vectorization type size lSIMD

Example with ng = 4, v = 4, lSIMD = 2:

a*<n(j)

a b c d

Slicej row-vector(i)

2

3

4

5

0

1

sn+1 snsn+2

6

sn+2

sn+1

sn

sn+3

sn+3

buffer

0

1

a b

sn+3

2

1

sn+2

buffer

2

3

3

4

2

3

buffer

a b

a b

sn+1

3

4

a b

sn

buffer

2

3

buffer

c d

sn+3

3

4

c d

sn+2

buffer

4

5

c d

sn+1

6

5

sn

c dr

r

r

r

r

r

r

r

r

• Each value from the vectors is read only once (and maybecopied into the buffer)

• The values in the buffer are shifted to avoid reloading

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 112: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(3)

Vectorized Implementation (Overview)

The algorithm is tied to the vectorization type size lSIMD

Example with ng = 4, v = 4, lSIMD = 2:

a*<n(j)

a b c d

Slicej row-vector(i)

2

3

4

5

0

1

sn+1 snsn+2

6

sn+2

sn+1

sn

sn+3

sn+3

buffer

0

1

a b

sn+3

2

1

sn+2

buffer

2

3

3

4

2

3

buffer

a b

a b

sn+1

3

4

a b

sn

buffer

2

3

buffer

c d

sn+3

3

4

c d

sn+2

buffer

4

5

c d

sn+1

6

5

sn

c dr

r

r

r

r

r

r

r

r

• Each value from the vectors is read only once (and maybecopied into the buffer)

• The values in the buffer are shifted to avoid reloading

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 113: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(3)

Vectorized Implementation (Overview)

The algorithm is tied to the vectorization type size lSIMD

Example with ng = 4, v = 4, lSIMD = 2:

a*<n(j)

a b c d

Slicej row-vector(i)

2

3

4

5

0

1

sn+1 snsn+2

6

sn+2

sn+1

sn

sn+3

sn+3

buffer

0

1

a b

sn+3

2

1

sn+2

buffer

2

3

3

4

2

3

buffer

a b

a b

sn+1

3

4

a b

sn

buffer

2

3

buffer

c d

sn+3

3

4

c d

sn+2

buffer

4

5

c d

sn+1

6

5

sn

c dr

r

r

r

r

r

r

r

r

• Each value from the vectors is read only once (and maybecopied into the buffer)

• The values in the buffer are shifted to avoid reloading

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 114: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(4)

In Time or Frequency Domain

Propagation of the wave for several time steps on a targetdiscretized sphere. The different spheres represent the values thatwill be applied on the included mesh elements.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 115: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(5)

Contiguous-Blocking Computational Kernel

Slicej

Kmax-1

N

a*<n(j)I) Coalesced copy of the vector into shared memory

III) Coalesced add to global

memory

II) Coalesced access in global memory

(column per column)

III) Partial results in registers

snsn+1sn+2

II) Uncoalesced read from shared memory

Global Memory Shared Memory Local Memory

(a) (b) (c)

bc

Thread-1Thread-2Thread-3Thread-4Thread-5Thread-6Thread-7Thread-8Thread-9

(a) the original slice is transformed in a block during thepre-computation stage (ng = 3, bc = 11)

(b) the blocks are moved to the device memory for thesummation stage

(c) a thread-block (nb − threads = 9) is in charge of the blocksfrom a slice interval and computes several summation vectorsat the same timeOptimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 116: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(6)

Multi-vectors/vector Product

sn+2 (i)sn+1 (i) sn (i)

0

1

2

3

4

0

sn+2 (i)sn+1 (i)sn (i)

1

2

3

4

5

6

1

2

3

4

5

2

3

4

5

6

(a) (b)

Computing one slice-row with 3 vectors (ng = 3)

(a) using 3 scalar products

(b) using the multi-vectors/vector product

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 117: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(7)

In Shared Memory

• OpenMP parallelization

• Summation divided/balanced between the threads

• No communication during the summation

• Multi-threaded M0 linear solver if possible

• NUMA effects are not handled

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 118: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(8)

On Heterogeneous Nodes

• OpenMP parallelization

• One thread/core per GPU

• An interval of the slices is moved on each GPU

• Intervals are balanced between each iteration with a greedyalgorithm

• The memory limit of the GPU may reduce their performance

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 119: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(9)

Application ExampleFD-BEM + FMM (antenna at 1GHz)

Image from Airbus Group.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 120: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(10)

Cone-Sphere Test Cases

Case C-927 C-4269 C-10012 C-22468

Number of unknowns 927 4269 10012 C-22468FMM tree height 3 4 5 6

Number of leaves in the FMM tree 16 64 234 936Number of NNZ interaction matrices (Kmax ) 117 244 370 551

Number of NNZ matrices between FMM leaves 60 64 49 37Number of time steps (T ) 2033 4345 6647 9957Size of the simulation box 3.3 7.3 11 16

Fmax 348 337 335 334Incomplete FMM coefficient l = h − 1 16 18 13 10

Incomplete FMM coefficient l = 2 16 36 52 80

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 121: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(11)

Multi-vectors/vector Product (GPU)

For the Contiguous-Blocking scheme:

Width (bc)16 32 64 128 16 32 64 128

GPU CPUSingle 243 338 431 496 (11%) 4.3 5.5 7.8 6.8 (17%)Double 143 199 248 286 (20%) 3.9 5.6 4.2 4.3 (21%)

GFlop/s for 420 slices (6400 rows and bc columns)(%) percentage of the peak performance

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 122: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(12)

Fork-join (OpenMP)

• Level by level

• Critical balancing

• Possible bottleneck (top of the tree)

• Difficult to mix near/far fields

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 123: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(12)

Fork-join (OpenMP)

• Level by level

• Critical balancing

• Possible bottleneck (top of the tree)

• Difficult to mix near/far fields

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 124: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(12)

Fork-join (OpenMP)

• Level by level

• Critical balancing

• Possible bottleneck (top of the tree)

• Difficult to mix near/far fields

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 125: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(12)

Fork-join (OpenMP)

• Level by level

• Critical balancing

• Possible bottleneck (top of the tree)

• Difficult to mix near/far fields

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 126: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(12)

Fork-join (OpenMP)

• Level by level

• Critical balancing

• Possible bottleneck (top of the tree)

• Difficult to mix near/far fields

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 127: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(13)

Fork-join+Message-passing (Hybrid OpenMP/MPI)

P0 P1 P2 P3

• Distribute the tree between nodes

• Progress level by level

• Communication between all stages

Poor parallelism expression

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 128: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(13)

Fork-join+Message-passing (Hybrid OpenMP/MPI)

P0 P1 P2 P3

• Distribute the tree between nodes

• Progress level by level

• Communication between all stages

Poor parallelism expression

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 129: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(14)

Trace - Heterogeneous (4GPUs)

h = 7/ng = 1500/Acc = 10−7/17s

h = 6/ng = 300/Acc = 10−7/39s (not on the same scale)

24 threads, N = 30 millions, uniform distribution, Uniform/Lagrange

kernel. Legend: P2P () , P2M () , M2M () , M2L (), L2L (),

L2P () and Idle ()

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 130: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(14)

Trace - Heterogeneous (4GPUs)

h = 7/ng = 1500/Acc = 10−7/17s

h = 6/ng = 300/Acc = 10−7/39s (not on the same scale)

24 threads, N = 30 millions, uniform distribution, Uniform/Lagrange

kernel. Legend: P2P () , P2M () , M2M () , M2L (), L2L (),

L2P () and Idle ()

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 131: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(15)

Flop/Cost Estimation

...

..

1,000

.

10,000

.

20,000

.

106

.

107

.

108

.

109

.

1

.

6

.

23

.

92

.

Number of unknowns

.

UnitofCost

.

. ..Between FMM leaves

. ..Complete interaction matrices

Figure : Matrix generation costestimation

...

..

1,000

.

10,000

.

20,000

.

1010

.

1013

.

1016

.

1340

.

388

.

203

.

154

.

1320

.

353

.

119

.

57

.

Number of unknowns

.

Number

ofFlop

.

. ..TD-BEM FMM

. ..TD-BEM FMM (FD M2L)

. ..Matrix Approach

Figure : Summation stage Flopestimation

The numbers above the slower plot represent the slow-down factors against the faster method.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 132: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(15)

Flop/Cost Estimation

...

..

1,000

.

10,000

.

20,000

.

106

.

107

.

108

.

109

.

1

.

6

.

23

.

92

.

Number of unknowns

.

UnitofCost

.

. ..Between FMM leaves

. ..Complete interaction matrices

Figure : Matrix generation costestimation

...

..

1,000

.

10,000

.

20,000

.

1010

.

1013

.

1016

.

1340

.

388

.

203

.

154

.

1320

.

353

.

119

.

57

.

Number of unknowns

.

Number

ofFlop

.

. ..TD-BEM FMM

. ..TD-BEM FMM (FD M2L)

. ..Matrix Approach

Figure : Summation stage Flopestimation

The numbers above the slower plot represent the slow-down factors against the faster method.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 133: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(16)

Reducing the Complexity

Direct computation 0(N2)

→ FMM O(N)

..

X

.

Y

.O(N2)

..

X

.

Y

.O(N)

• Spatial decomposition → Potential decomposition

fi = f neari + f fari

• Near field is computed by direct interactions• The far field is done using different operators

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 134: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(16)

Reducing the Complexity

Direct computation 0(N2) → FMM O(N)

..

X

.

Y

.O(N2)

..

X

.

Y

.O(N)

• Spatial decomposition → Potential decomposition

fi = f neari + f fari

• Near field is computed by direct interactions• The far field is done using different operators

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 135: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

(16)

Reducing the Complexity

Direct computation 0(N2) → FMM O(N)

..

X

.

Y

.O(N2)

..

X

.

Y

.O(N)

• Spatial decomposition → Potential decomposition

fi = f neari + f fari

• Near field is computed by direct interactions• The far field is done using different operators

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 136: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Results (17)

Airplane Simulation

• Acoustics

• N = 23 962

• 10 823 time iterations

• Kmax = 341 interaction matrices Mk (≈ 5.5× 109 NNZ)

• Computing sn ≈ 11GFlop

• Total ≈ 130 651GFlop

• ng = 8

• 70GB of data

• CPU node : 2 Dodeca-core Haswell Intel Xeon E5-2680 at2, 50GHz and 128GB (DDR4) of shared memory

• GPUs per node: 4 NVIDIA Kepler K40M GPU (745MHz),2880 Cores, 12GB of dedicated memory

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 137: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Results (18)

Hybrid MPI/OpenMP

.....

. ..Hybrid MPI/OpenMP

... ..1.

10.

20.

30.

40.

50.

1

.

0.5

.0 .

Number of nodes

.

Efficienty

Efficiency: Uniform distribution, Spherical Expansion/Rotation Kernel,

N = 200 millions, h = 8, Acc = 10−3, from 1 to 50 nodes (24 threads per

node), for np = 50 the execution time is 2.24s

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 138: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Results (19)

Group-tree

• Granularity G

• A group → G cells/leaves

• Good locality

• Low iteration complexity

• Dependencies between cells = between groups

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 139: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Results (20)

Trace - Shared Memory - 24CPUs

N = 20 millions, ellipsoid distribution, Spherical Expansion/Rotation Kernel,Acc = 10−3, h = 11 and ng = 8000 in 5.2s.

Legend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and

Idle ()

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 140: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Results (21)

Trace - 24CPUs

N = 30 millions, uniform distribution, Spherical Expansion/Rotation Kernel,Acc = 10−3, h = 7 and ng = 1500 in 15.5s.

Legend: P2P (), P2M () , M2M () , M2L (), L2L (), L2P () and

Idle ()

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 141: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Particles Interaction Simulations (22)

Test Cases

Interactions between N particles for two distributions:

Figure : Uniform Figure : Ellipsoid

The height of the tree (h) is chosen such that the execution time isminimal in sequential.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 142: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Particles Interaction Simulations (23)

Parallel Strategies for FMM BEMThree strategies (Fork-join OpenMP)

• Threaded FMM: divide each level between threads (ScalFMMclassic)

• Threaded kernel: divide the work inside the kernel

• Mix FMM/Kernel: two layers of parallelism, one in the FMMand a second in the kernel

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 143: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Particles Interaction Simulations (24)

Parallel Executions

FMM Vs. Matrix Approach

.....

. ..Matrix generation . ..Solve

.....

ThreadedFM

M

.

ThreadedKernel

.

MixFM

M/Kernel

.

Matrix

Approach

.0 .

2,000

.

4,000

.

3,752

.

1,897

.

883

.237

.

Tim

e(s)

Figure : C-927

.....

ThreadedFM

M

.

ThreadedKernel

.

MixFM

M/Kernel

.

Matrix

Approach

.0 .

0.5

.

1

.

1.5

.

·104

.

15,732

.

15,371

.

9,426

.

10,080

Figure : C-4269 (×1)

.....

ThreadedFM

M

.

ThreadedKernel

.

MixFM

M/Kernel

.

Matrix

Approach

.0 .

5

.

·104

.

33,408

.

76,931

.

39,227

.

25,256

Figure : C-10012

The captions of the different cases show the overhead of the FMM TD-BEM against the matrix approach.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas

Page 144: Optimization and Parallelization of the Boundary Element ...berenger.eu/blog/wp-content/uploads/2010/01/Berenger-Slides.pdf · Context Problem Formulation Matrix Approach FMM Approach

..........

.....

....................................................................................

......

.....

.....

.

Particles Interaction Simulations (24)

Parallel Executions FMM Vs. Matrix Approach

.....

. ..Matrix generation . ..Solve

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain Berenger Bramas