performance of a multi-physics code on cavium thunderx2 · 2018. 11. 28. · cavium thunder x2 cpu...
TRANSCRIPT
-
PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Performance of a multi-physics code on Cavium ThunderX2
Presented byJohn G. Wohlbier (PETTT/Engility), Keith Obenschain, Gopal Patnaik (NRL-DC)
September 26, 2018
User Productivity Enhancement, Technology Transfer, and Training (PETTT)
-
2PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
This material is based upon work supported by, or in part by, the Department of Defense (DoD)
High Performance Computing Modernization Program (HPCMP) under the User Productivity, Technology Transfer and Training (PETTT) Program,
contract number GS04T09DBC0017.
-
3PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Aster Comparison methodology Initial performance numbers Extract serial kernel Rev 1 performance numbers Extract more kernels Characterize kernels Summary and future work
Outline
-
4PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Direct drive inertial confinement fusion code– Spherical, structured grid– Two temperature explicit CFD– Tabular equation of state– Implicit species heat conduction– Laser ray tracing– Fusion reactions– Multi-group radiation diffusion– Operator split time stepping
Aster1
1I.V. Igumenshchev, et al. “Three-Dimensional Modeling of Direct-Drive Cryogenic Implosions on OMEGA.” Physics of Plasmas 23, 052702 (2016).2Image: https://str.llnl.gov/str/Haan.html
https://str.llnl.gov/str/Haan.html
-
5PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Fixed problem size on one node– 21.3M cells/node– 1 MPI rank/core
Node characteristics– Two sockets– 32 ranks on SKL, 2 x 125W TDP– 56/64 ranks on TX2, 2 x 165W TDP
Compilers– Intel/Intelmpi on SKL– Gcc/OpenMPI on TX2
Comparison methodology
Manufacturer Architecture Part SIMD width
Peak GF/s
Peak DRAM GB/s
Cores Peak DRAM GB/s/core
Memory channels
Intel Skylake 6130 512 1075 128 16 7.1 6
Cavium ThunderX2 B0 128 563 171 32 5.3 8
ThunderX2 B0 128 493 171 28 6.1 8
-
6PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Compare STREAM Triad and HPCG– Triad peak: Peak GB/s / 24 B/triad x 2 FLOP/triad / Peak GF/s
STREAM
HPCG
Comparison methodology
CPU Nrank Stack Read (GB/s)
Write (GB/s)
Total (GB/s)
STREAMTriad
(GB/s)
FLOPS(GF/s)
Peak (GF/s)
%peak
Skylake 32 Intel 19.0, Intel MPI 133 31 164 164 21 2150 1
ThunderX2 64 gcc 7.2, OpenMPI 3.1 149 35 184 195 24 1126 2
ThunderX2 56 gcc 7.2, OpenMPI 3.1 149 34 183 217 23 986 2
Manufacturer Architecture Part Peak GF/s Peak DRAM GB/s
Triad peak efficiency (%)
Threads GB/s
Intel Skylake 6130 2 x 1075 2 x 128 1 32 164
Cavium ThunderX2 B0 2 x 563 2 x 171 2.5 64 195
Cavium ThunderX2 B1 2 x 493 2 x 171 2.9 56 217
-
7PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Ten cycles of Aster– 21.3M cells
Initial performance numbers
Architecture Ranks Time (s) Ratio
SKL 32 330 1
TX2 64 437 1.3
Profile with Arm MAP– mpi_recv + gauss_seidel_in_plane are largest cost, but have very similar
absolute run times
– pow + exp have disparate run times SKL – 15s
TX2 – 60s
Difference (60-15=45) is nearly ½ of the total difference (437-330=107)
-
8PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Initial performance numbers
Profile with Arm MAPTX2 Self SKL
% seconds Function % seconds
13.70% 59.88 mpi_recv_ 14.40% 47.48
10.80% 47.21 gauss_seidel_in_plane 18.70% 61.65
8.90% 38.90 __pow_finite 4.60% 15.17
6.00% 26.23 mpi_waitall_ 3.60% 11.87
5.60% 24.48 mpi_bcast_ 1.60% 5.28
5.20% 22.73 _gfortran_string_index 7.20% 23.74
5.00% 21.86 mpi_send_ 3.00% 9.89
5.10% 22.29 __exp1
-
9PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Extract serial kernel Identified “pow” heavy Aster function Extract kernel using KGen– https://github.com/NCAR/KGen– Instruments application and generates verification data for kernel– aster_tubr (“temperature update by radiation”)– Kernel used internally at Arm to work on precision issues with armflang
Arm recommends using “Arm Performance Libraries” (ArmPL)– Found Arm Optimized-Routines (AOR)– https://github.com/ARM-software/optimized-routines– Upstream for ArmPL
https://github.com/NCAR/KGenhttps://github.com/ARM-software/optimized-routines
-
10PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Extract serial kernel aster_tubr results– Run on same input data– Weak scaling implies TX2 data set would be ½ size as SKL data set– Best effective TX2 time ~ 1.36s compared to 0.88s on SKL
Architecture Time (s) Compiler, librarySKL 0.88 Intel
TX2 5.44 gcc 7.2, default
TX2 5.96 armflang 18.4.1, default
TX2 3.37 armflang 18.4.1, -L${ARMPL_LIBRARIES} -lamath
TX2 2.72 gcc 7.2, Arm Optimized-Routines-lmathlib
-
11PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Ten cycles of Aster– 21.3M cells
Rev 1 performance numbers
Architecture Ranks Time (s) Ratio
SKL 32 330 1
TX2 64 437 1.3
TX2 64 349 1.06
Profile with Arm MAP– mpi_recv + mpi_waitall + mpi_send
SKL – 69s TX2 – 95s Difference (95-69=16) is nearly as large as total difference (349-330=19)
-
12PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Rev 1 performance numbers
Profile with Arm MAPTX2 Self SKL
% seconds Function % seconds
13.60% 47.48 gauss_seidel_in_plane 18.70% 61.65
13.40% 46.78 mpi_recv_ 14.40% 47.48
8.00% 27.93 mpi_waitall_ 3.60% 11.87
6.70% 23.39 _gfortran_string_index 7.20% 23.74
6.10% 21.30 ppi 4.00% 13.19
5.70% 19.90 mpi_send_ 3.00% 9.89
3.80% 13.27 get_diff_coefs 6.10% 20.11
3.70% 12.92 _int_free
-
13PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
gauss_seidel_in_plane– Most expensive function– Called many times during l-cycles and V-cycles with variable sized input
data for fine and coarse grids
– Tridiagonal solver in radial direction introduces MPI sweep like dependency
– Characterization useful, but not as important as … multigrid– Many calls to gauss_seidel_in_plane– Accounts for 48% inclusive time
Extract more kernels
-
14PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
gauss_seidel_in_plane
Extract more kernelsArchitecture Ranks Time (s) Ratio
SKL 32 0.28 1
TX2 64 0.25 0.9
multigrid
MPI in multigrid
Architecture Ranks Time (s) Ratio
SKL 32 37.1 1
TX2 64 52.0 1.4
TX2 56 33.2 0.9
Function SKL (s) TX2 64 (s) TX2 56
mpi_recv 12.7 17.3
mpi_waitall 3.2 10.1
mpi_send 2.5 4.1
Total 18.4 31.5
-
15PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Run multigrid kernel through Intel VTune on SKL to determine performance characterization– Intel performance analysis tools provide extensive detail
multigrid kernel is memory bound on SKL– 65% of pipeline slots stalled due to load/store
~10% clock ticks stalled on cache
– 35% clock ticks stalled on DRAM 41% clock ticks stalled for DRAM bandwidth boundedness 16% clock ticks stalled for DRAM latency
Characterize kernels
-
16PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Characterize kernels Multigrid Time vs Arithmetic Intensity– Low arithmetic intensity implies memory bandwidth will be limiting factor
-
17PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Characterize kernels Multigrid roofline on SKL– Heavy vertical lines show bounds of measured arithmetic intensity
-
18PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Based on DRAM bandwidth boundedness, expect higher aggregate bandwidth to run code faster– Would like to measure effective bandwidth on Arm
Histogram shows MPI imbalance due to sweep dependency of tridiagonal solver– Larger number of ranks on TX2 as SKL exacerbates sweep dependency– Number of ranks in angular dimensions stays same, only sweep direction
increases in ranks
Characterize kernels
-
19PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Single node performance for multigrid kernel– Available memory bandwidth has large impact on performance– Intel VTune measured 41% clock ticks limited by DRAM bandwidth
boundedness– More work needed to understand discrepancy between TX2 and EPYC
Multigrid on four CPU architectures
CPU Bandwidth (GB/s)
Measured kernel time
(s)
Aster time(s)
Broadwell 77 42.1
Skylake 128 32.2 344
ThunderX2 171 32.8 341
EPYC 171 21.2 263
-
20PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Node level results for Aster code are encouraging– Initially disparate results were reconciled through profiling and finding
correct math libraries
– Codes that are clearly bandwidth bound might be expected to perform similarly on TX2 and SKL
– Shared memory byte transport layers show similar bandwidths and latencies when measured with micro-benchmarks
– Additional latencies appear to be present in Aster and the extracted kernels, which requires further study
Preparing Aster to run on Astra– Will perform multi-node scaling studies next
Sweep algorithm needs to be studied for improvement– Will benefit both SKL and TX2
Summary and future work
-
21PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Additional Material
-
22PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Xeon Gold 6130 CPU @ 2.10 GHz– 16 cores, 32 threads– Max turbo frequency: 3.7 GHz– 22 MB L3 cache– TDP 125 W– Max memory speed: 2666 MHz– Number of AVX-512 FMA Units: 2– Max number of memory channels: 6
Single: 19.87 GiB/s [= (64/8/10243) GiB x 2666 MHz], [= 21.3 GB/s] Double: 39.74 GiB/s [= 42.7 GB/s] Quad: 79.47 GiB/s [= 85.4 GB/s]
Hexa: 119.21 GiB/s [= 128.0 GB/s]
Intel Skylake
-
23PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Floating point capacity– 2 x 512 bit VPU/core– Fused Multiply Add (FMA): 2 FLOP/VPU/cycle– Double precision
2 FLOP/VPU/cycle x 2 VPU/core x 8 reals = 32 FLOP/cycle/core 32 FLOP/cycle/core x [2.1 – 3.7] GHz = [67.2 – 118.4] GF/s/core [67.2 – 118.4] GF/s/core x 16 cores = [1075.2 – 1894.4] GF/s Single thread measurement1: 110 GF/s/core
– Single precision 2 FLOP/VPU/cycle x 2 VPU/core x 16 reals = 64 FLOP/cycle/core 64 FLOP/cycle/core x [2.1 – 3.7] GHz = [134.4 – 236.8] GF/s/core [134.4 – 236.8] GF/s/core x 16 cores = [2150.4 – 3788.8] GF/s
Single thread measurement1: 220 GF/s/core
Intel Skylake
1Intel Advisor 19
-
24PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Cavium Thunder X2 CPU @ 2.2 GHz– B0 stepping– Some specs are best guess based on public information and A2 stepping– 32 cores, 64 threads (up to 128 threads)– Max turbo frequency ? GHz– 32 MB L3 cache– TDP 165 W– Max memory speed: 2666 MHz– Max number of DDR4 memory channels: 8
Single: 19.87 GiB/s [= (64/8/10243) GiB x 2666 MHz], [= 21.3 GB/s]
Double: 39.74 GiB/s [= 42.7 GB/s]
Quad: 79.47 GiB/s [= 85.4 GB/s]
Hexa: 119.21 GiB/s [= 128.0 GB/s]
Octo: 158.96 GiB/s [= 170.7 GB/s]
Cavium ThunderX2
-
25PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Floating point capacity1– 2 x 128 bit VPU/core– Fused Multiply Add (FMA): 2 FLOP/VPU/cycle– Double precision
2 FLOP/VPU/cycle x 1 VPU/core x 4 reals = 8 FLOP/cycle/core 8 FLOP/cycle/core x 2.2 GHz = 17.6 GF/s/core 17.6 GF/s/core x 32 cores = 563 GF/s Single thread measurement: ? GF/s/core
– Single precision 2 FLOP/VPU/cycle x 1 VPU/core x 8 reals = 16 FLOP/cycle/core 16 FLOP/cycle/core x 2.2 GHz = 35.2 GF/s/core 35.2 GF/s/core x 32 cores = 1126 GF/s Single thread measurement: ? GF/s/core
Cavium ThunderX2 (32 core)
1Based on Broadcom Vulcan
https://www.nextplatform.com/2017/11/27/cavium-truly-contender-one-two-arm-server-punch/
-
26PETTTDistribution Statement A: Approved for public release; distribution is unlimited.
Floating point capacity1– 2 x 128 bit VPU/core– Fused Multiply Add (FMA): 2 FLOP/VPU/cycle– Double precision
2 FLOP/VPU/cycle x 1 VPU/core x 4 reals = 8 FLOP/cycle/core 8 FLOP/cycle/core x 2.2 GHz = 17.6 GF/s/core 17.6 GF/s/core x 28 cores = 493 GF/s Single thread measurement: ? GF/s/core
– Single precision 2 FLOP/VPU/cycle x 1 VPU/core x 8 reals = 16 FLOP/cycle/core 16 FLOP/cycle/core x 2.2 GHz = 35.2 GF/s/core 35.2 GF/s/core x 28 cores = 986 GF/s Single thread measurement: ? GF/s/core
Cavium ThunderX2 (28 core)
1Based on Broadcom Vulcan
https://www.nextplatform.com/2017/11/27/cavium-truly-contender-one-two-arm-server-punch/