large -scale ultrasound simulations with local fourier ... · title: large scale ultrasound...

1
Large Scale Ultrasound Simulations on Accelerated Clusters Filip Vaverka supervised by Jiri Jaros Faculty of Information Technology, Brno University of Technology, Centre of Excellence IT4Innovations, CZ Overview The synergy between advance- ments in medicine, ultrasound modeling, and high performance computing leads to the emer- gence of many new applications of biomedical ultrasound. Appli- cations such as HIFU treatment planing, (photo-)acoustic imaging or transcranial ultrasound ther- apy (see gure), require accu- rate large scale ultrasound simu- lations while signicant pressure is being put on the cost and time to compute these simulations. The focal point of the present thesis is the development of k-space pseudo-spectral time-domain methods (KSTD) scalable across a variety of accelerated cluster architectures and dense compute nodes. Ultrasound Wave Propagation Model Ultrasound wave propagation in soft tissue can be understood as wave propagation in a uid, heterogeneous and absorbing medium with weak non-linear effects. The governing acoustic equations for such a problem can be written as: u t = - 1 ρ 0 p + S F (momentum conservation) ∂ρ t = -(2ρ + ρ 0 )∇· u - u ·∇ρ 0 + S M (mass conservation) p = c 2 0 ρ + d ·∇ρ 0 + B 2A ρ 2 ρ 0 - L ρ (pressure-density relation) The k-space pseudo-spectral method (KSTD) is a highly ecient ap- proach to discretization of this model. The KSTD takes the known analytic solution to the homogeneous variant of the problem to cor- rect for time-stepping errors, thus allowing for simple 2nd order time- stepping. Local Fourier Basis However, implementation of KSTD method on distributed memory ar- chitectures is tricky due to global nature of these methods. Ecient decomposition on modern accelerated clusters is achieved by using local Fourier basis (LFB) decomposition. This approach limits the com- munication to nearest neighbors. subdomain 1 subdomain 2 subdomain 3 local data bell function periodic data Local Fourier Basis Accuracy When the gradient is not calculated on the whole data, numerical error is introduced. The error level can be controlled by the shape of the bell function and the size of the overlap region. 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Overlap Size 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 L error Erf Bell Optimised Bell 1 2 3 4 5 6 7 8 9 Number of Subdomains Cuts Crossed 4 6 8 10 12 14 16 18 20 22 Minimum Overlap Size L < 1e-4 L < 1e-3 Traditional CPU Cluster (MPI + OpenMP) Considering traditional CPU based cluster with each compute node consisting of two Intel Xeon E5-2680v3 CPUs where nodes are con- nected in 7D hypercube topology with FDR Inniband interconnect. 16 64 256 1024 4096 16384 1 2 4 8 16 32 64 128 256 Execution time per time step [ms] Number of CPUs (×12 cores) Global simulation weak scaling (Salomon) 2 21 2 22 2 23 2 24 2 25 2 26 2 27 2 28 8 16 32 64 128 256 512 1024 1 2 4 8 16 32 64 128 256 Execution time per time step [ms] Number of CPUs (×12 cores) Simulation weak scaling (Salomon) 2 18 2 19 2 20 2 21 2 22 2 23 2 24 2 25 The transition from global (left) to local (right) KSTD method offers up to 5× speedup while same amount of CPUs is used. This is in signi- cant part due to reduction in the order of communication complexity as all-to-all communication is reduced to nearest neighbors communi- cation. Multi-GPU Dense Nodes (MPI + CUDA + P2P) Nvidia DGX-2 compute node contains 16 Nvidia V100 GPUs with 32 GB of HBM 2 memory at 900 GB/s each. All 16 GPUs are con- nected together via NVlink 2.0, which provides 300 GB/s of con- nectivity to each GPU and bisection bandwidth of 2.4 TB/s. 1 2 4 8 16 32 64 128 256 1 2 4 8 16 Execution time per time step [ms] Numer of GPUs Simulation strong scaling (DGX-2) 2 24 2 25 2 26 2 27 2 28 2 29 2 30 2 31 0 50 100 150 200 250 2 24 2 25 2 26 2 27 2 28 2 29 2 30 2 31 Execution time per time step [ms] Domain Size [Grid Points] Simulation breakdown (DGX-2) Computation Computation (FFTs) Overlap Exchange GPU Accelerated Cluster (MPI + CUDA) 100 GFlop/s 1 GB 100 GB/s PCI-E P100 CPU RAM HBM PCI-E P100 CPU RAM HBM PCI-E P100 CPU RAM HBM PCI-E P100 CPU RAM HBM PCI-E P100 CPU RAM HBM PCI-E P100 CPU RAM HBM PCI-E P100 CPU RAM HBM ARIES ASIC PCI-E P100 CPU RAM HBM ARIES ASIC DRAGON FLY 8 16 32 64 128 256 512 1024 2048 4096 1 2 4 8 16 32 64 128 256 512 Execution time per time step [ms] Number of GPUs (1 GPU per node) Simulation strong scaling (Piz Daint) 2 24 2 25 2 26 2 27 2 28 2 29 2 30 2 31 2 32 2 33 2 34 2 35 2 36 0 100 200 300 400 500 600 700 2 27 2 28 2 29 2 30 2 31 2 32 2 33 2 34 2 35 2 36 Execution time per time step [ms] Domain Size [Grid Points] Simulation breakdown (Piz Daint) Computation Overlap Exchange Impact and Outlook The LFB decomposition allows for ecient implementation of KSTD ultrasound wave propagation simulation codes, which are scalable on wide range of cluster architectures. The practical implementation of described approach showed that: Up to 5× speedup was achieved on traditional CPU based cluster The proposed method is scalable to simulation with over 7 × 10 10 unknowns across 1024 GPUs while maintaining eciency over 50% The method is able to take advantage of high-speed interconnects in multi-GPU nodes to achieve eciency over 65% In the realm of personalized ultrasound medical procedures these ad- vancements amount to: Achieving 6× speedup and 10× cost reduction of typical photoacoustic tomography image reconstruction by enabling ecient use of multi-GPU servers such as Nvidia DGX-2 17× price reduction of simulations used in HIFU treatment planning Conclusion The presented approach shows how use of LFB decomposition in con- text of KSTD ultrasound wave propagation models allows these mod- els to scale eciently to GPU accelerated nodes and clusters. This leap in eciency and reduction in cost allows for wider spread of novel ul- trasound medical procedures such as HIFU and photoacoustic tomog- raphy. The method can be extended to similar models such as wave propagation in elastic medium. k-Wave A MATLAB toolbox for the time-domain simulation of acoustic wave fields This work was supported by The Ministry of Education, Youth and Sports from the National Programme of Sustainability (NPU II) project “IT4Innovations excellence in science - LQ1602” and by the IT4Innovations infrastructure which is supported from the Large Infrastructures for Research, Experimental Development and Innovations project “IT4Innovations National Supercomputing Center - LM2015070”. This project has received funding from the European Union’s Horizon 2020 research and innovation programme H2020 ICT 2016-2017 under grant agreement No 732411 and is an initiative of the Photonics Public Private Partnership.

Upload: others

Post on 20-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large -scale Ultrasound Simulations with Local Fourier ... · Title: Large Scale Ultrasound Simulations on Accelerated Clusters Author: Filip Vaverka supervised by Jiri Jaros Created

Large Scale Ultrasound Simulations on Accelerated ClustersFilip Vaverka supervised by Jiri Jaros

Faculty of Information Technology, Brno University of Technology, Centre of Excellence IT4Innovations, CZ

OverviewThe synergy between advance-ments in medicine, ultrasoundmodeling, and high performancecomputing leads to the emer-gence of many new applicationsof biomedical ultrasound. Appli-cations such as HIFU treatmentplaning, (photo-)acoustic imagingor transcranial ultrasound ther-apy (see figure), require accu-rate large scale ultrasound simu-lations while significant pressureis being put on the cost and time to compute these simulations.The focal point of the present thesis is the development of k-spacepseudo-spectral time-domainmethods (KSTD) scalable across a varietyof accelerated cluster architectures and dense compute nodes.

Ultrasound Wave Propagation ModelUltrasound wave propagation in soft tissue can be understood as wavepropagation in a fluid, heterogeneous and absorbing medium withweak non-linear effects. The governing acoustic equations for sucha problem can be written as:

∂ u∂ t= −

1ρ0∇p+ SF (momentum conservation)

∂ ρ

∂ t= −(2ρ +ρ0)∇ · u− u · ∇ρ0+ SM (mass conservation)

p = c20

ρ + d · ∇ρ0+B2Aρ2

ρ0− Lρ

(pressure-density relation)The k-space pseudo-spectral method (KSTD) is a highly efficient ap-proach to discretization of this model. The KSTD takes the knownanalytic solution to the homogeneous variant of the problem to cor-rect for time-stepping errors, thus allowing for simple 2nd order time-stepping.

Local Fourier BasisHowever, implementation of KSTD method on distributed memory ar-chitectures is tricky due to global nature of these methods. Efficientdecomposition on modern accelerated clusters is achieved by usinglocal Fourier basis (LFB) decomposition. This approach limits the com-munication to nearest neighbors.

Large-scale Ultrasound Simulations with Local Fourier Basis Decomposition

Jiri Jaros1, Filip Vaverka1 and Bradley E. Treeby2

1 Overview

High-intensity focused ultrasound (HIFU) is an emerging non-

invasive cancer therapy that uses tightly focused ultrasound waves to

destroy tissue cells through localised heating. The treatment planning

goal is to select the best transducer position and transmit parameters

to accurately target the tumour. The path of the ultrasound waves can

be predicted by solving acoustic equations based on momentum,

mass, and energy conservation. However, this is a computationally

difficult problem because the domain size is very large compared to

the acoustic wavelength.

2

3

4

5

1Faculty of Information Technology, Brno University of Technology, CZ

The project is financed from the SoMoPro II programme. The research leading to this invention has acquired a financial grant from the People Programme (Marie Curie action) of the Seventh Framework Programme of EU according to the REA Grant Agreement

No. 291782. The research is further co-financed by the South-Moravian Region. This work reflects only the author’s view and the European Union is not liable for any use that may be made of the information contained therein.

This work was supported by The Ministry of Education, Youth and Sports from the Large Infrastructures for Research, Experimental Development and Innovations project „IT4Innovations National Supercomputing Center – LM2015070“The authors would like to

acknowledge that the work presented here made use of Emerald, a GPU-accelerated High Performance Computer, made available by the Science & Engineering South Consortium operated in partnership with the STFC Rutherford-Appleton Laboratory.

Nonlinear Ultrasound Wave Propagation in Tissue

The governing equations must account for the nonlinear propagation

of ultrasound waves in tissue, which is a heterogeneous and

absorbing medium. Accurately accounting for acoustic absorption is

critical for predicting ultrasound dose under different conditions. The

required acoustic equations can be written as:

These equations are discretised using the k-space pseudo-spectral

method and solved iteratively. This reduces the number of required

grid points per wavelength by an order of magnitude compared to

finite element or finite difference methods. For uniform Cartesian

grids, the gradients can be calculated using the fast Fourier transform.

Local Fourier Basis Accuracy

Since the gradient is not calculated on the whole data, numeric error

is introduced. Its level can be tuned by the thickness of the halo

region.

Performance Investigation

The strong scaling and simulation time breakdown were investigated

on Emerald and Anselm clusters with up to 128 GPUs.

momentum conservation

mass conservation

pressure-density relation

Local Fourier Basis Decomposition

Local domain decomposition reduces the communication burden by

partitioning the domain into a grid of local subdomains where

gradients are calculated locally and the global communication is

replaced by the nearest-neighbor halo exchange.

The gradient calculation with the hallo on an i-th subdomain reads as

follows (b is a bell function smoothening the subdomain interface):

𝜕𝑝𝑖𝜕𝑡

= 𝔽−1 𝑖𝑘𝑖𝔽(𝑏 ∙ 𝑝𝑖)

subdomain 1 subdomain 2 subdomain 3

local data

bell function

periodic data

Realistic Simulations and Their Costs

Pressure field from a prostate ultrasound transducer simulated using

a domain size of 1536 x 1024 x 2048 (45mm x 30mm x 60mm) with

48,000 time steps (60μs).

6

Compute Resources Simulation Time Simulation Cost

96 GPUs 14h 09m $475

128 GPUs 9h 29m $426

128 CPU cores 6d 18h $1,826

256 CPU cores 3d 0h $1,623

512 CPU cores 2d 5h $2,395

0

5

10

15

20

25

30

35

40

45

50

2 4 8 16

Number of GPUs

Anselm - K20s

0

5

10

15

20

25

30

35

40

45

50

4 8 16 32 64 128

Tim

e p

er

10

0 t

ime

ste

ps

[s

]

Number of GPUs

Emerald - C2070

MPI

PCE-E

Comp

1

2

4

8

16

32

64

128

1 2 4 8 16 32 64 128

Tim

e p

er

10

0 t

ime

ste

ps

[s

]

Number of GPUs

Emerald - C2070

256x256x256 256x256x512 256x512x512

512x512x512 512x512x1024 512x1024x1024

1024x1024x1024 1024x1024x20481

2

4

8

16

32

64

128

1 2 4 8 16

Number of GPUs

Anselm - K20s

Actual focus

Intended focus

256 x 256 x 2048

2Medical Physics and Biomedical Engineering, University College London, UK

Local Fourier Basis AccuracyWhen the gradient is not calculated on the whole data, numerical erroris introduced. The error level can be controlled by the shape of the bellfunction and the size of the overlap region.

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Overlap Size

10-6

10-5

10-4

10-3

10-2

10-1

L e

rror

Erf Bell

Optimised Bell

1 2 3 4 5 6 7 8 9

Number of Subdomains Cuts Crossed

4

6

8

10

12

14

16

18

20

22

Min

imum

Overlap S

ize

L < 1e-4

L < 1e-3

Traditional CPU Cluster (MPI + OpenMP)Considering traditional CPU based cluster with each compute nodeconsisting of two Intel Xeon E5-2680v3 CPUs where nodes are con-nected in 7D hypercube topology with FDR Infiniband interconnect.

16

64

256

1024

4096

16384

1 2 4 8 16 32 64 128 256

Executiontimepertime

step[m

s]

Number of CPUs (×12 cores)

Global simulation weak scaling (Salomon)

221

222223

224225

226227

228

81632641282565121024

1 2 4 8 16 32 64 128 256

Executiontimepertime

step[m

s]

Number of CPUs (×12 cores)

Simulation weak scaling (Salomon)

218

219220

221222

223224

225

The transition from global (left) to local (right) KSTD method offers upto 5× speedup while same amount of CPUs is used. This is in signifi-cant part due to reduction in the order of communication complexityas all-to-all communication is reduced to nearest neighbors communi-cation.

Multi-GPU Dense Nodes (MPI + CUDA + P2P)Nvidia DGX-2 compute node contains 16 Nvidia V100 GPUs with32 GB of HBM 2 memory at 900 GB/s each. All 16 GPUs are con-nected together via NVlink 2.0, which provides 300 GB/s of con-nectivity to each GPU and bisection bandwidth of 2.4 TB/s.

1248163264128256

1 2 4 8 16

Executiontimepertime

step[m

s]

Numer of GPUs

Simulation strong scaling (DGX-2)

224

225226

227228

229230

231

0

50

100

150

200

250

224 225 226 227 228 229 230 231

Executiontimepertime

step[m

s]

Domain Size [Grid Points]

Simulation breakdown (DGX-2)

ComputationComputation (FFTs)Overlap Exchange

GPU Accelerated Cluster (MPI + CUDA)

100 GFlop/s

1 GB100 GB/s

PCI-E

P100CPU

RAM HBM

PCI-E

P100CPU

RAM HBM

PCI-E

P100CPU

RAM HBM

PCI-E

P100 CPU

RAMHBM

PCI-E

P100 CPU

RAMHBM

PCI-E

P100 CPU

RAMHBM

PCI-E

P100 CPU

RAMHBM

AR

IES

AS

IC

PCI-E

P100CPU

RAM HBM

AR

IES

AS

IC

DR

AG

ON

FLY

8163264128256512102420484096

1 2 4 8 16 32 64 128 256 512

Executiontimepertime

step[m

s]

Number of GPUs (1 GPU per node)

Simulation strong scaling (Piz Daint)224

225

226

227

228

229

230

231

232

233

234

235

236

0

100

200

300

400

500

600

700

227 228 229 230 231 232 233 234 235 236

Executiontimepertime

step[m

s]

Domain Size [Grid Points]

Simulation breakdown (Piz Daint)

ComputationOverlap Exchange

Impact and OutlookThe LFB decomposition allows for efficient implementation of KSTDultrasound wave propagation simulation codes, which are scalable onwide range of cluster architectures. The practical implementation ofdescribed approach showed that:• Up to 5× speedup was achieved on traditional CPU based cluster• The proposed method is scalable to simulation with over 7× 1010

unknowns across 1024 GPUs while maintaining efficiency over 50%• The method is able to take advantage of high-speed interconnectsin multi-GPU nodes to achieve efficiency over 65%In the realm of personalized ultrasound medical procedures these ad-vancements amount to:• Achieving 6× speedup and 10× cost reduction of typicalphotoacoustic tomography image reconstruction by enablingefficient use of multi-GPU servers such as Nvidia DGX-2• 17× price reduction of simulations used in HIFU treatment planning

ConclusionThe presented approach shows how use of LFB decomposition in con-text of KSTD ultrasound wave propagation models allows these mod-els to scale efficiently to GPU accelerated nodes and clusters. This leapin efficiency and reduction in cost allows for wider spread of novel ul-trasound medical procedures such as HIFU and photoacoustic tomog-raphy. The method can be extended to similar models such as wavepropagation in elastic medium.

k-WaveA MATLAB toolbox for the time-domainsimulation of acoustic wave fields

IT4Innovationsnational#1!0€€supercomputingcenter0€01#11#

This work was supported by The Ministry of Education, Youth and Sports from the National Programme of Sustainability (NPU II) project “IT4Innovations excellence in science - LQ1602” and bythe IT4Innovations infrastructure which is supported from the Large Infrastructures for Research, Experimental Development and Innovations project “IT4Innovations NationalSupercomputing Center - LM2015070”. This project has received funding from the European Union’s Horizon 2020 research and innovation programme H2020 ICT 2016-2017 under grantagreement No 732411 and is an initiative of the Photonics Public Private Partnership.