performance of a multi-physics code on cavium thunderx2 · 2018. 11. 28. · cavium thunder x2 cpu...

26
PETTT Distribution Statement A: Approved for public release; distribution is unlimited. Performance of a multi-physics code on Cavium ThunderX2 Presented by John G. Wohlbier (PETTT/Engility), Keith Obenschain, Gopal Patnaik (NRL-DC) September 26, 2018 User Productivity Enhancement, Technology Transfer, and Training (PETTT)

Upload: others

Post on 01-Feb-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

  • PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Performance of a multi-physics code on Cavium ThunderX2

    Presented byJohn G. Wohlbier (PETTT/Engility), Keith Obenschain, Gopal Patnaik (NRL-DC)

    September 26, 2018

    User Productivity Enhancement, Technology Transfer, and Training (PETTT)

  • 2PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    This material is based upon work supported by, or in part by, the Department of Defense (DoD)

    High Performance Computing Modernization Program (HPCMP) under the User Productivity, Technology Transfer and Training (PETTT) Program,

    contract number GS04T09DBC0017.

  • 3PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Aster Comparison methodology Initial performance numbers Extract serial kernel Rev 1 performance numbers Extract more kernels Characterize kernels Summary and future work

    Outline

  • 4PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Direct drive inertial confinement fusion code– Spherical, structured grid– Two temperature explicit CFD– Tabular equation of state– Implicit species heat conduction– Laser ray tracing– Fusion reactions– Multi-group radiation diffusion– Operator split time stepping

    Aster1

    1I.V. Igumenshchev, et al. “Three-Dimensional Modeling of Direct-Drive Cryogenic Implosions on OMEGA.” Physics of Plasmas 23, 052702 (2016).2Image: https://str.llnl.gov/str/Haan.html

    https://str.llnl.gov/str/Haan.html

  • 5PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Fixed problem size on one node– 21.3M cells/node– 1 MPI rank/core

    Node characteristics– Two sockets– 32 ranks on SKL, 2 x 125W TDP– 56/64 ranks on TX2, 2 x 165W TDP

    Compilers– Intel/Intelmpi on SKL– Gcc/OpenMPI on TX2

    Comparison methodology

    Manufacturer Architecture Part SIMD width

    Peak GF/s

    Peak DRAM GB/s

    Cores Peak DRAM GB/s/core

    Memory channels

    Intel Skylake 6130 512 1075 128 16 7.1 6

    Cavium ThunderX2 B0 128 563 171 32 5.3 8

    ThunderX2 B0 128 493 171 28 6.1 8

  • 6PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Compare STREAM Triad and HPCG– Triad peak: Peak GB/s / 24 B/triad x 2 FLOP/triad / Peak GF/s

    STREAM

    HPCG

    Comparison methodology

    CPU Nrank Stack Read (GB/s)

    Write (GB/s)

    Total (GB/s)

    STREAMTriad

    (GB/s)

    FLOPS(GF/s)

    Peak (GF/s)

    %peak

    Skylake 32 Intel 19.0, Intel MPI 133 31 164 164 21 2150 1

    ThunderX2 64 gcc 7.2, OpenMPI 3.1 149 35 184 195 24 1126 2

    ThunderX2 56 gcc 7.2, OpenMPI 3.1 149 34 183 217 23 986 2

    Manufacturer Architecture Part Peak GF/s Peak DRAM GB/s

    Triad peak efficiency (%)

    Threads GB/s

    Intel Skylake 6130 2 x 1075 2 x 128 1 32 164

    Cavium ThunderX2 B0 2 x 563 2 x 171 2.5 64 195

    Cavium ThunderX2 B1 2 x 493 2 x 171 2.9 56 217

  • 7PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Ten cycles of Aster– 21.3M cells

    Initial performance numbers

    Architecture Ranks Time (s) Ratio

    SKL 32 330 1

    TX2 64 437 1.3

    Profile with Arm MAP– mpi_recv + gauss_seidel_in_plane are largest cost, but have very similar

    absolute run times

    – pow + exp have disparate run times SKL – 15s

    TX2 – 60s

    Difference (60-15=45) is nearly ½ of the total difference (437-330=107)

  • 8PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Initial performance numbers

    Profile with Arm MAPTX2 Self SKL

    % seconds Function % seconds

    13.70% 59.88 mpi_recv_ 14.40% 47.48

    10.80% 47.21 gauss_seidel_in_plane 18.70% 61.65

    8.90% 38.90 __pow_finite 4.60% 15.17

    6.00% 26.23 mpi_waitall_ 3.60% 11.87

    5.60% 24.48 mpi_bcast_ 1.60% 5.28

    5.20% 22.73 _gfortran_string_index 7.20% 23.74

    5.00% 21.86 mpi_send_ 3.00% 9.89

    5.10% 22.29 __exp1

  • 9PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Extract serial kernel Identified “pow” heavy Aster function Extract kernel using KGen– https://github.com/NCAR/KGen– Instruments application and generates verification data for kernel– aster_tubr (“temperature update by radiation”)– Kernel used internally at Arm to work on precision issues with armflang

    Arm recommends using “Arm Performance Libraries” (ArmPL)– Found Arm Optimized-Routines (AOR)– https://github.com/ARM-software/optimized-routines– Upstream for ArmPL

    https://github.com/NCAR/KGenhttps://github.com/ARM-software/optimized-routines

  • 10PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Extract serial kernel aster_tubr results– Run on same input data– Weak scaling implies TX2 data set would be ½ size as SKL data set– Best effective TX2 time ~ 1.36s compared to 0.88s on SKL

    Architecture Time (s) Compiler, librarySKL 0.88 Intel

    TX2 5.44 gcc 7.2, default

    TX2 5.96 armflang 18.4.1, default

    TX2 3.37 armflang 18.4.1, -L${ARMPL_LIBRARIES} -lamath

    TX2 2.72 gcc 7.2, Arm Optimized-Routines-lmathlib

  • 11PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Ten cycles of Aster– 21.3M cells

    Rev 1 performance numbers

    Architecture Ranks Time (s) Ratio

    SKL 32 330 1

    TX2 64 437 1.3

    TX2 64 349 1.06

    Profile with Arm MAP– mpi_recv + mpi_waitall + mpi_send

    SKL – 69s TX2 – 95s Difference (95-69=16) is nearly as large as total difference (349-330=19)

  • 12PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Rev 1 performance numbers

    Profile with Arm MAPTX2 Self SKL

    % seconds Function % seconds

    13.60% 47.48 gauss_seidel_in_plane 18.70% 61.65

    13.40% 46.78 mpi_recv_ 14.40% 47.48

    8.00% 27.93 mpi_waitall_ 3.60% 11.87

    6.70% 23.39 _gfortran_string_index 7.20% 23.74

    6.10% 21.30 ppi 4.00% 13.19

    5.70% 19.90 mpi_send_ 3.00% 9.89

    3.80% 13.27 get_diff_coefs 6.10% 20.11

    3.70% 12.92 _int_free

  • 13PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    gauss_seidel_in_plane– Most expensive function– Called many times during l-cycles and V-cycles with variable sized input

    data for fine and coarse grids

    – Tridiagonal solver in radial direction introduces MPI sweep like dependency

    – Characterization useful, but not as important as … multigrid– Many calls to gauss_seidel_in_plane– Accounts for 48% inclusive time

    Extract more kernels

  • 14PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    gauss_seidel_in_plane

    Extract more kernelsArchitecture Ranks Time (s) Ratio

    SKL 32 0.28 1

    TX2 64 0.25 0.9

    multigrid

    MPI in multigrid

    Architecture Ranks Time (s) Ratio

    SKL 32 37.1 1

    TX2 64 52.0 1.4

    TX2 56 33.2 0.9

    Function SKL (s) TX2 64 (s) TX2 56

    mpi_recv 12.7 17.3

    mpi_waitall 3.2 10.1

    mpi_send 2.5 4.1

    Total 18.4 31.5

  • 15PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Run multigrid kernel through Intel VTune on SKL to determine performance characterization– Intel performance analysis tools provide extensive detail

    multigrid kernel is memory bound on SKL– 65% of pipeline slots stalled due to load/store

    ~10% clock ticks stalled on cache

    – 35% clock ticks stalled on DRAM 41% clock ticks stalled for DRAM bandwidth boundedness 16% clock ticks stalled for DRAM latency

    Characterize kernels

  • 16PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Characterize kernels Multigrid Time vs Arithmetic Intensity– Low arithmetic intensity implies memory bandwidth will be limiting factor

  • 17PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Characterize kernels Multigrid roofline on SKL– Heavy vertical lines show bounds of measured arithmetic intensity

  • 18PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Based on DRAM bandwidth boundedness, expect higher aggregate bandwidth to run code faster– Would like to measure effective bandwidth on Arm

    Histogram shows MPI imbalance due to sweep dependency of tridiagonal solver– Larger number of ranks on TX2 as SKL exacerbates sweep dependency– Number of ranks in angular dimensions stays same, only sweep direction

    increases in ranks

    Characterize kernels

  • 19PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Single node performance for multigrid kernel– Available memory bandwidth has large impact on performance– Intel VTune measured 41% clock ticks limited by DRAM bandwidth

    boundedness– More work needed to understand discrepancy between TX2 and EPYC

    Multigrid on four CPU architectures

    CPU Bandwidth (GB/s)

    Measured kernel time

    (s)

    Aster time(s)

    Broadwell 77 42.1

    Skylake 128 32.2 344

    ThunderX2 171 32.8 341

    EPYC 171 21.2 263

  • 20PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Node level results for Aster code are encouraging– Initially disparate results were reconciled through profiling and finding

    correct math libraries

    – Codes that are clearly bandwidth bound might be expected to perform similarly on TX2 and SKL

    – Shared memory byte transport layers show similar bandwidths and latencies when measured with micro-benchmarks

    – Additional latencies appear to be present in Aster and the extracted kernels, which requires further study

    Preparing Aster to run on Astra– Will perform multi-node scaling studies next

    Sweep algorithm needs to be studied for improvement– Will benefit both SKL and TX2

    Summary and future work

  • 21PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Additional Material

  • 22PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Xeon Gold 6130 CPU @ 2.10 GHz– 16 cores, 32 threads– Max turbo frequency: 3.7 GHz– 22 MB L3 cache– TDP 125 W– Max memory speed: 2666 MHz– Number of AVX-512 FMA Units: 2– Max number of memory channels: 6

    Single: 19.87 GiB/s [= (64/8/10243) GiB x 2666 MHz], [= 21.3 GB/s] Double: 39.74 GiB/s [= 42.7 GB/s] Quad: 79.47 GiB/s [= 85.4 GB/s]

    Hexa: 119.21 GiB/s [= 128.0 GB/s]

    Intel Skylake

  • 23PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Floating point capacity– 2 x 512 bit VPU/core– Fused Multiply Add (FMA): 2 FLOP/VPU/cycle– Double precision

    2 FLOP/VPU/cycle x 2 VPU/core x 8 reals = 32 FLOP/cycle/core 32 FLOP/cycle/core x [2.1 – 3.7] GHz = [67.2 – 118.4] GF/s/core [67.2 – 118.4] GF/s/core x 16 cores = [1075.2 – 1894.4] GF/s Single thread measurement1: 110 GF/s/core

    – Single precision 2 FLOP/VPU/cycle x 2 VPU/core x 16 reals = 64 FLOP/cycle/core 64 FLOP/cycle/core x [2.1 – 3.7] GHz = [134.4 – 236.8] GF/s/core [134.4 – 236.8] GF/s/core x 16 cores = [2150.4 – 3788.8] GF/s

    Single thread measurement1: 220 GF/s/core

    Intel Skylake

    1Intel Advisor 19

  • 24PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Cavium Thunder X2 CPU @ 2.2 GHz– B0 stepping– Some specs are best guess based on public information and A2 stepping– 32 cores, 64 threads (up to 128 threads)– Max turbo frequency ? GHz– 32 MB L3 cache– TDP 165 W– Max memory speed: 2666 MHz– Max number of DDR4 memory channels: 8

    Single: 19.87 GiB/s [= (64/8/10243) GiB x 2666 MHz], [= 21.3 GB/s]

    Double: 39.74 GiB/s [= 42.7 GB/s]

    Quad: 79.47 GiB/s [= 85.4 GB/s]

    Hexa: 119.21 GiB/s [= 128.0 GB/s]

    Octo: 158.96 GiB/s [= 170.7 GB/s]

    Cavium ThunderX2

  • 25PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Floating point capacity1– 2 x 128 bit VPU/core– Fused Multiply Add (FMA): 2 FLOP/VPU/cycle– Double precision

    2 FLOP/VPU/cycle x 1 VPU/core x 4 reals = 8 FLOP/cycle/core 8 FLOP/cycle/core x 2.2 GHz = 17.6 GF/s/core 17.6 GF/s/core x 32 cores = 563 GF/s Single thread measurement: ? GF/s/core

    – Single precision 2 FLOP/VPU/cycle x 1 VPU/core x 8 reals = 16 FLOP/cycle/core 16 FLOP/cycle/core x 2.2 GHz = 35.2 GF/s/core 35.2 GF/s/core x 32 cores = 1126 GF/s Single thread measurement: ? GF/s/core

    Cavium ThunderX2 (32 core)

    1Based on Broadcom Vulcan

    https://www.nextplatform.com/2017/11/27/cavium-truly-contender-one-two-arm-server-punch/

  • 26PETTTDistribution Statement A: Approved for public release; distribution is unlimited.

    Floating point capacity1– 2 x 128 bit VPU/core– Fused Multiply Add (FMA): 2 FLOP/VPU/cycle– Double precision

    2 FLOP/VPU/cycle x 1 VPU/core x 4 reals = 8 FLOP/cycle/core 8 FLOP/cycle/core x 2.2 GHz = 17.6 GF/s/core 17.6 GF/s/core x 28 cores = 493 GF/s Single thread measurement: ? GF/s/core

    – Single precision 2 FLOP/VPU/cycle x 1 VPU/core x 8 reals = 16 FLOP/cycle/core 16 FLOP/cycle/core x 2.2 GHz = 35.2 GF/s/core 35.2 GF/s/core x 28 cores = 986 GF/s Single thread measurement: ? GF/s/core

    Cavium ThunderX2 (28 core)

    1Based on Broadcom Vulcan

    https://www.nextplatform.com/2017/11/27/cavium-truly-contender-one-two-arm-server-punch/