large -scale ultrasound simulations with local fourier ... · title: large scale ultrasound...
TRANSCRIPT
Large Scale Ultrasound Simulations on Accelerated ClustersFilip Vaverka supervised by Jiri Jaros
Faculty of Information Technology, Brno University of Technology, Centre of Excellence IT4Innovations, CZ
OverviewThe synergy between advance-ments in medicine, ultrasoundmodeling, and high performancecomputing leads to the emer-gence of many new applicationsof biomedical ultrasound. Appli-cations such as HIFU treatmentplaning, (photo-)acoustic imagingor transcranial ultrasound ther-apy (see figure), require accu-rate large scale ultrasound simu-lations while significant pressureis being put on the cost and time to compute these simulations.The focal point of the present thesis is the development of k-spacepseudo-spectral time-domainmethods (KSTD) scalable across a varietyof accelerated cluster architectures and dense compute nodes.
Ultrasound Wave Propagation ModelUltrasound wave propagation in soft tissue can be understood as wavepropagation in a fluid, heterogeneous and absorbing medium withweak non-linear effects. The governing acoustic equations for sucha problem can be written as:
∂ u∂ t= −
1ρ0∇p+ SF (momentum conservation)
∂ ρ
∂ t= −(2ρ +ρ0)∇ · u− u · ∇ρ0+ SM (mass conservation)
p = c20
�
ρ + d · ∇ρ0+B2Aρ2
ρ0− Lρ
�
(pressure-density relation)The k-space pseudo-spectral method (KSTD) is a highly efficient ap-proach to discretization of this model. The KSTD takes the knownanalytic solution to the homogeneous variant of the problem to cor-rect for time-stepping errors, thus allowing for simple 2nd order time-stepping.
Local Fourier BasisHowever, implementation of KSTD method on distributed memory ar-chitectures is tricky due to global nature of these methods. Efficientdecomposition on modern accelerated clusters is achieved by usinglocal Fourier basis (LFB) decomposition. This approach limits the com-munication to nearest neighbors.
Large-scale Ultrasound Simulations with Local Fourier Basis Decomposition
Jiri Jaros1, Filip Vaverka1 and Bradley E. Treeby2
1 Overview
High-intensity focused ultrasound (HIFU) is an emerging non-
invasive cancer therapy that uses tightly focused ultrasound waves to
destroy tissue cells through localised heating. The treatment planning
goal is to select the best transducer position and transmit parameters
to accurately target the tumour. The path of the ultrasound waves can
be predicted by solving acoustic equations based on momentum,
mass, and energy conservation. However, this is a computationally
difficult problem because the domain size is very large compared to
the acoustic wavelength.
2
3
4
5
1Faculty of Information Technology, Brno University of Technology, CZ
The project is financed from the SoMoPro II programme. The research leading to this invention has acquired a financial grant from the People Programme (Marie Curie action) of the Seventh Framework Programme of EU according to the REA Grant Agreement
No. 291782. The research is further co-financed by the South-Moravian Region. This work reflects only the author’s view and the European Union is not liable for any use that may be made of the information contained therein.
This work was supported by The Ministry of Education, Youth and Sports from the Large Infrastructures for Research, Experimental Development and Innovations project „IT4Innovations National Supercomputing Center – LM2015070“The authors would like to
acknowledge that the work presented here made use of Emerald, a GPU-accelerated High Performance Computer, made available by the Science & Engineering South Consortium operated in partnership with the STFC Rutherford-Appleton Laboratory.
Nonlinear Ultrasound Wave Propagation in Tissue
The governing equations must account for the nonlinear propagation
of ultrasound waves in tissue, which is a heterogeneous and
absorbing medium. Accurately accounting for acoustic absorption is
critical for predicting ultrasound dose under different conditions. The
required acoustic equations can be written as:
These equations are discretised using the k-space pseudo-spectral
method and solved iteratively. This reduces the number of required
grid points per wavelength by an order of magnitude compared to
finite element or finite difference methods. For uniform Cartesian
grids, the gradients can be calculated using the fast Fourier transform.
Local Fourier Basis Accuracy
Since the gradient is not calculated on the whole data, numeric error
is introduced. Its level can be tuned by the thickness of the halo
region.
Performance Investigation
The strong scaling and simulation time breakdown were investigated
on Emerald and Anselm clusters with up to 128 GPUs.
momentum conservation
mass conservation
pressure-density relation
Local Fourier Basis Decomposition
Local domain decomposition reduces the communication burden by
partitioning the domain into a grid of local subdomains where
gradients are calculated locally and the global communication is
replaced by the nearest-neighbor halo exchange.
The gradient calculation with the hallo on an i-th subdomain reads as
follows (b is a bell function smoothening the subdomain interface):
𝜕𝑝𝑖𝜕𝑡
= 𝔽−1 𝑖𝑘𝑖𝔽(𝑏 ∙ 𝑝𝑖)
subdomain 1 subdomain 2 subdomain 3
local data
bell function
periodic data
Realistic Simulations and Their Costs
Pressure field from a prostate ultrasound transducer simulated using
a domain size of 1536 x 1024 x 2048 (45mm x 30mm x 60mm) with
48,000 time steps (60μs).
6
Compute Resources Simulation Time Simulation Cost
96 GPUs 14h 09m $475
128 GPUs 9h 29m $426
128 CPU cores 6d 18h $1,826
256 CPU cores 3d 0h $1,623
512 CPU cores 2d 5h $2,395
0
5
10
15
20
25
30
35
40
45
50
2 4 8 16
Number of GPUs
Anselm - K20s
0
5
10
15
20
25
30
35
40
45
50
4 8 16 32 64 128
Tim
e p
er
10
0 t
ime
ste
ps
[s
]
Number of GPUs
Emerald - C2070
MPI
PCE-E
Comp
1
2
4
8
16
32
64
128
1 2 4 8 16 32 64 128
Tim
e p
er
10
0 t
ime
ste
ps
[s
]
Number of GPUs
Emerald - C2070
256x256x256 256x256x512 256x512x512
512x512x512 512x512x1024 512x1024x1024
1024x1024x1024 1024x1024x20481
2
4
8
16
32
64
128
1 2 4 8 16
Number of GPUs
Anselm - K20s
Actual focus
Intended focus
256 x 256 x 2048
2Medical Physics and Biomedical Engineering, University College London, UK
Local Fourier Basis AccuracyWhen the gradient is not calculated on the whole data, numerical erroris introduced. The error level can be controlled by the shape of the bellfunction and the size of the overlap region.
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Overlap Size
10-6
10-5
10-4
10-3
10-2
10-1
L e
rror
Erf Bell
Optimised Bell
1 2 3 4 5 6 7 8 9
Number of Subdomains Cuts Crossed
4
6
8
10
12
14
16
18
20
22
Min
imum
Overlap S
ize
L < 1e-4
L < 1e-3
Traditional CPU Cluster (MPI + OpenMP)Considering traditional CPU based cluster with each compute nodeconsisting of two Intel Xeon E5-2680v3 CPUs where nodes are con-nected in 7D hypercube topology with FDR Infiniband interconnect.
16
64
256
1024
4096
16384
1 2 4 8 16 32 64 128 256
Executiontimepertime
step[m
s]
Number of CPUs (×12 cores)
Global simulation weak scaling (Salomon)
221
222223
224225
226227
228
81632641282565121024
1 2 4 8 16 32 64 128 256
Executiontimepertime
step[m
s]
Number of CPUs (×12 cores)
Simulation weak scaling (Salomon)
218
219220
221222
223224
225
The transition from global (left) to local (right) KSTD method offers upto 5× speedup while same amount of CPUs is used. This is in signifi-cant part due to reduction in the order of communication complexityas all-to-all communication is reduced to nearest neighbors communi-cation.
Multi-GPU Dense Nodes (MPI + CUDA + P2P)Nvidia DGX-2 compute node contains 16 Nvidia V100 GPUs with32 GB of HBM 2 memory at 900 GB/s each. All 16 GPUs are con-nected together via NVlink 2.0, which provides 300 GB/s of con-nectivity to each GPU and bisection bandwidth of 2.4 TB/s.
1248163264128256
1 2 4 8 16
Executiontimepertime
step[m
s]
Numer of GPUs
Simulation strong scaling (DGX-2)
224
225226
227228
229230
231
0
50
100
150
200
250
224 225 226 227 228 229 230 231
Executiontimepertime
step[m
s]
Domain Size [Grid Points]
Simulation breakdown (DGX-2)
ComputationComputation (FFTs)Overlap Exchange
GPU Accelerated Cluster (MPI + CUDA)
100 GFlop/s
1 GB100 GB/s
PCI-E
P100CPU
RAM HBM
PCI-E
P100CPU
RAM HBM
PCI-E
P100CPU
RAM HBM
PCI-E
P100 CPU
RAMHBM
PCI-E
P100 CPU
RAMHBM
PCI-E
P100 CPU
RAMHBM
PCI-E
P100 CPU
RAMHBM
AR
IES
AS
IC
PCI-E
P100CPU
RAM HBM
AR
IES
AS
IC
DR
AG
ON
FLY
8163264128256512102420484096
1 2 4 8 16 32 64 128 256 512
Executiontimepertime
step[m
s]
Number of GPUs (1 GPU per node)
Simulation strong scaling (Piz Daint)224
225
226
227
228
229
230
231
232
233
234
235
236
0
100
200
300
400
500
600
700
227 228 229 230 231 232 233 234 235 236
Executiontimepertime
step[m
s]
Domain Size [Grid Points]
Simulation breakdown (Piz Daint)
ComputationOverlap Exchange
Impact and OutlookThe LFB decomposition allows for efficient implementation of KSTDultrasound wave propagation simulation codes, which are scalable onwide range of cluster architectures. The practical implementation ofdescribed approach showed that:• Up to 5× speedup was achieved on traditional CPU based cluster• The proposed method is scalable to simulation with over 7× 1010
unknowns across 1024 GPUs while maintaining efficiency over 50%• The method is able to take advantage of high-speed interconnectsin multi-GPU nodes to achieve efficiency over 65%In the realm of personalized ultrasound medical procedures these ad-vancements amount to:• Achieving 6× speedup and 10× cost reduction of typicalphotoacoustic tomography image reconstruction by enablingefficient use of multi-GPU servers such as Nvidia DGX-2• 17× price reduction of simulations used in HIFU treatment planning
ConclusionThe presented approach shows how use of LFB decomposition in con-text of KSTD ultrasound wave propagation models allows these mod-els to scale efficiently to GPU accelerated nodes and clusters. This leapin efficiency and reduction in cost allows for wider spread of novel ul-trasound medical procedures such as HIFU and photoacoustic tomog-raphy. The method can be extended to similar models such as wavepropagation in elastic medium.
k-WaveA MATLAB toolbox for the time-domainsimulation of acoustic wave fields
IT4Innovationsnational#1!0€€supercomputingcenter0€01#11#
This work was supported by The Ministry of Education, Youth and Sports from the National Programme of Sustainability (NPU II) project “IT4Innovations excellence in science - LQ1602” and bythe IT4Innovations infrastructure which is supported from the Large Infrastructures for Research, Experimental Development and Innovations project “IT4Innovations NationalSupercomputing Center - LM2015070”. This project has received funding from the European Union’s Horizon 2020 research and innovation programme H2020 ICT 2016-2017 under grantagreement No 732411 and is an initiative of the Photonics Public Private Partnership.