performance of a multi-physics code on cavium thunderx2 · 2018. 11. 28. · cavium thunder x2 cpu...

PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

Performance of a multi-physics code on Cavium ThunderX2

Presented byJohn G. Wohlbier (PETTT/Engility), Keith Obenschain, Gopal Patnaik (NRL-DC)

September 26, 2018

User Productivity Enhancement, Technology Transfer, and Training (PETTT)

2PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

This material is based upon work supported by, or in part by, the Department of Defense (DoD)

High Performance Computing Modernization Program (HPCMP) under the User Productivity, Technology Transfer and Training (PETTT) Program,

contract number GS04T09DBC0017.


Aster Comparison methodology Initial performance numbers Extract serial kernel Rev 1 performance numbers Extract more kernels Characterize kernels Summary and future work

Outline


Direct drive inertial confinement fusion code– Spherical, structured grid– Two temperature explicit CFD– Tabular equation of state– Implicit species heat conduction– Laser ray tracing– Fusion reactions– Multi-group radiation diffusion– Operator split time stepping

Aster1

1I.V. Igumenshchev, et al. “Three-Dimensional Modeling of Direct-Drive Cryogenic Implosions on OMEGA.” Physics of Plasmas 23, 052702 (2016).2Image: https://str.llnl.gov/str/Haan.html

https://str.llnl.gov/str/Haan.html


Fixed problem size on one node– 21.3M cells/node– 1 MPI rank/core

Node characteristics– Two sockets– 32 ranks on SKL, 2 x 125W TDP– 56/64 ranks on TX2, 2 x 165W TDP

Compilers– Intel/Intelmpi on SKL– Gcc/OpenMPI on TX2

Comparison methodology

Manufacturer Architecture Part SIMD width

Peak GF/s

Peak DRAM GB/s

Cores Peak DRAM GB/s/core

Memory channels

Intel Skylake 6130 512 1075 128 16 7.1 6

Cavium ThunderX2 B0 128 563 171 32 5.3 8

ThunderX2 B0 128 493 171 28 6.1 8


Compare STREAM Triad and HPCG– Triad peak: Peak GB/s / 24 B/triad x 2 FLOP/triad / Peak GF/s

STREAM

HPCG

Comparison methodology

CPU Nrank Stack Read (GB/s)

Write (GB/s)

Total (GB/s)

STREAMTriad

(GB/s)

FLOPS(GF/s)

Peak (GF/s)

%peak

Skylake 32 Intel 19.0, Intel MPI 133 31 164 164 21 2150 1

ThunderX2 64 gcc 7.2, OpenMPI 3.1 149 35 184 195 24 1126 2

ThunderX2 56 gcc 7.2, OpenMPI 3.1 149 34 183 217 23 986 2

Manufacturer Architecture Part Peak GF/s Peak DRAM GB/s

Triad peak efficiency (%)

Threads GB/s

Intel Skylake 6130 2 x 1075 2 x 128 1 32 164

Cavium ThunderX2 B0 2 x 563 2 x 171 2.5 64 195

Cavium ThunderX2 B1 2 x 493 2 x 171 2.9 56 217


Ten cycles of Aster– 21.3M cells

Initial performance numbers

Architecture Ranks Time (s) Ratio

SKL 32 330 1

TX2 64 437 1.3

Profile with Arm MAP– mpi_recv + gauss_seidel_in_plane are largest cost, but have very similar

absolute run times

– pow + exp have disparate run times SKL – 15s

TX2 – 60s

Difference (60-15=45) is nearly ½ of the total difference (437-330=107)


Initial performance numbers

Profile with Arm MAPTX2 Self SKL

% seconds Function % seconds

13.70% 59.88 mpi_recv_ 14.40% 47.48

10.80% 47.21 gauss_seidel_in_plane 18.70% 61.65

8.90% 38.90 __pow_finite 4.60% 15.17

6.00% 26.23 mpi_waitall_ 3.60% 11.87

5.60% 24.48 mpi_bcast_ 1.60% 5.28

5.20% 22.73 _gfortran_string_index 7.20% 23.74

5.00% 21.86 mpi_send_ 3.00% 9.89

5.10% 22.29 __exp1


Extract serial kernel Identified “pow” heavy Aster function Extract kernel using KGen– https://github.com/NCAR/KGen– Instruments application and generates verification data for kernel– aster_tubr (“temperature update by radiation”)– Kernel used internally at Arm to work on precision issues with armflang

Arm recommends using “Arm Performance Libraries” (ArmPL)– Found Arm Optimized-Routines (AOR)– https://github.com/ARM-software/optimized-routines– Upstream for ArmPL

https://github.com/NCAR/KGenhttps://github.com/ARM-software/optimized-routines


Extract serial kernel aster_tubr results– Run on same input data– Weak scaling implies TX2 data set would be ½ size as SKL data set– Best effective TX2 time ~ 1.36s compared to 0.88s on SKL

Architecture Time (s) Compiler, librarySKL 0.88 Intel

TX2 5.44 gcc 7.2, default

TX2 5.96 armflang 18.4.1, default

TX2 3.37 armflang 18.4.1, -L${ARMPL_LIBRARIES} -lamath

TX2 2.72 gcc 7.2, Arm Optimized-Routines-lmathlib


Ten cycles of Aster– 21.3M cells

Rev 1 performance numbers


SKL 32 330 1

TX2 64 437 1.3

TX2 64 349 1.06

Profile with Arm MAP– mpi_recv + mpi_waitall + mpi_send

SKL – 69s TX2 – 95s Difference (95-69=16) is nearly as large as total difference (349-330=19)


Rev 1 performance numbers

Profile with Arm MAPTX2 Self SKL

% seconds Function % seconds

13.60% 47.48 gauss_seidel_in_plane 18.70% 61.65

13.40% 46.78 mpi_recv_ 14.40% 47.48

8.00% 27.93 mpi_waitall_ 3.60% 11.87

6.70% 23.39 _gfortran_string_index 7.20% 23.74

6.10% 21.30 ppi 4.00% 13.19

5.70% 19.90 mpi_send_ 3.00% 9.89

3.80% 13.27 get_diff_coefs 6.10% 20.11

3.70% 12.92 _int_free


gauss_seidel_in_plane– Most expensive function– Called many times during l-cycles and V-cycles with variable sized input

data for fine and coarse grids

– Tridiagonal solver in radial direction introduces MPI sweep like dependency

– Characterization useful, but not as important as … multigrid– Many calls to gauss_seidel_in_plane– Accounts for 48% inclusive time

Extract more kernels


gauss_seidel_in_plane

Extract more kernelsArchitecture Ranks Time (s) Ratio

SKL 32 0.28 1

TX2 64 0.25 0.9

multigrid

MPI in multigrid


SKL 32 37.1 1

TX2 64 52.0 1.4

TX2 56 33.2 0.9

Function SKL (s) TX2 64 (s) TX2 56

mpi_recv 12.7 17.3

mpi_waitall 3.2 10.1

mpi_send 2.5 4.1

Total 18.4 31.5


Run multigrid kernel through Intel VTune on SKL to determine performance characterization– Intel performance analysis tools provide extensive detail

multigrid kernel is memory bound on SKL– 65% of pipeline slots stalled due to load/store

~10% clock ticks stalled on cache

– 35% clock ticks stalled on DRAM 41% clock ticks stalled for DRAM bandwidth boundedness 16% clock ticks stalled for DRAM latency

Characterize kernels


Characterize kernels Multigrid Time vs Arithmetic Intensity– Low arithmetic intensity implies memory bandwidth will be limiting factor


Characterize kernels Multigrid roofline on SKL– Heavy vertical lines show bounds of measured arithmetic intensity


Based on DRAM bandwidth boundedness, expect higher aggregate bandwidth to run code faster– Would like to measure effective bandwidth on Arm

Histogram shows MPI imbalance due to sweep dependency of tridiagonal solver– Larger number of ranks on TX2 as SKL exacerbates sweep dependency– Number of ranks in angular dimensions stays same, only sweep direction

increases in ranks

Characterize kernels


Single node performance for multigrid kernel– Available memory bandwidth has large impact on performance– Intel VTune measured 41% clock ticks limited by DRAM bandwidth

boundedness– More work needed to understand discrepancy between TX2 and EPYC

Multigrid on four CPU architectures

CPU Bandwidth (GB/s)

Measured kernel time

(s)

Aster time(s)

Broadwell 77 42.1

Skylake 128 32.2 344

ThunderX2 171 32.8 341

EPYC 171 21.2 263


Node level results for Aster code are encouraging– Initially disparate results were reconciled through profiling and finding

correct math libraries

– Codes that are clearly bandwidth bound might be expected to perform similarly on TX2 and SKL

– Shared memory byte transport layers show similar bandwidths and latencies when measured with micro-benchmarks

– Additional latencies appear to be present in Aster and the extracted kernels, which requires further study

Preparing Aster to run on Astra– Will perform multi-node scaling studies next

Sweep algorithm needs to be studied for improvement– Will benefit both SKL and TX2

Summary and future work


Additional Material


Xeon Gold 6130 CPU @ 2.10 GHz– 16 cores, 32 threads– Max turbo frequency: 3.7 GHz– 22 MB L3 cache– TDP 125 W– Max memory speed: 2666 MHz– Number of AVX-512 FMA Units: 2– Max number of memory channels: 6

Single: 19.87 GiB/s [= (64/8/10243) GiB x 2666 MHz], [= 21.3 GB/s] Double: 39.74 GiB/s [= 42.7 GB/s] Quad: 79.47 GiB/s [= 85.4 GB/s]

Hexa: 119.21 GiB/s [= 128.0 GB/s]

Intel Skylake


Floating point capacity– 2 x 512 bit VPU/core– Fused Multiply Add (FMA): 2 FLOP/VPU/cycle– Double precision

2 FLOP/VPU/cycle x 2 VPU/core x 8 reals = 32 FLOP/cycle/core 32 FLOP/cycle/core x [2.1 – 3.7] GHz = [67.2 – 118.4] GF/s/core [67.2 – 118.4] GF/s/core x 16 cores = [1075.2 – 1894.4] GF/s Single thread measurement1: 110 GF/s/core

– Single precision 2 FLOP/VPU/cycle x 2 VPU/core x 16 reals = 64 FLOP/cycle/core 64 FLOP/cycle/core x [2.1 – 3.7] GHz = [134.4 – 236.8] GF/s/core [134.4 – 236.8] GF/s/core x 16 cores = [2150.4 – 3788.8] GF/s

Single thread measurement1: 220 GF/s/core

Intel Skylake

1Intel Advisor 19


Cavium Thunder X2 CPU @ 2.2 GHz– B0 stepping– Some specs are best guess based on public information and A2 stepping– 32 cores, 64 threads (up to 128 threads)– Max turbo frequency ? GHz– 32 MB L3 cache– TDP 165 W– Max memory speed: 2666 MHz– Max number of DDR4 memory channels: 8

Single: 19.87 GiB/s [= (64/8/10243) GiB x 2666 MHz], [= 21.3 GB/s]

Double: 39.74 GiB/s [= 42.7 GB/s]

Quad: 79.47 GiB/s [= 85.4 GB/s]

Hexa: 119.21 GiB/s [= 128.0 GB/s]

Octo: 158.96 GiB/s [= 170.7 GB/s]

Cavium ThunderX2


Floating point capacity1– 2 x 128 bit VPU/core– Fused Multiply Add (FMA): 2 FLOP/VPU/cycle– Double precision

2 FLOP/VPU/cycle x 1 VPU/core x 4 reals = 8 FLOP/cycle/core 8 FLOP/cycle/core x 2.2 GHz = 17.6 GF/s/core 17.6 GF/s/core x 32 cores = 563 GF/s Single thread measurement: ? GF/s/core

– Single precision 2 FLOP/VPU/cycle x 1 VPU/core x 8 reals = 16 FLOP/cycle/core 16 FLOP/cycle/core x 2.2 GHz = 35.2 GF/s/core 35.2 GF/s/core x 32 cores = 1126 GF/s Single thread measurement: ? GF/s/core

Cavium ThunderX2 (32 core)

1Based on Broadcom Vulcan

https://www.nextplatform.com/2017/11/27/cavium-truly-contender-one-two-arm-server-punch/


Floating point capacity1– 2 x 128 bit VPU/core– Fused Multiply Add (FMA): 2 FLOP/VPU/cycle– Double precision

2 FLOP/VPU/cycle x 1 VPU/core x 4 reals = 8 FLOP/cycle/core 8 FLOP/cycle/core x 2.2 GHz = 17.6 GF/s/core 17.6 GF/s/core x 28 cores = 493 GF/s Single thread measurement: ? GF/s/core

– Single precision 2 FLOP/VPU/cycle x 1 VPU/core x 8 reals = 16 FLOP/cycle/core 16 FLOP/cycle/core x 2.2 GHz = 35.2 GF/s/core 35.2 GF/s/core x 28 cores = 986 GF/s Single thread measurement: ? GF/s/core

Cavium ThunderX2 (28 core)

1Based on Broadcom Vulcan
https://www.nextplatform.com/2017/11/27/cavium-truly-contender-one-two-arm-server-punch/

performance of a multi-physics code on cavium thunderx2 · 2018. 11. 28. · cavium thunder x2 cpu...

Documents