benchmarking huawei arm multi-core processors for hpc ... · benchmarking huawei arm multi-core...
TRANSCRIPT
Benchmarking Huawei ARMMulti-Core Processors for HPC workloadsKey Liao
Center for HPC
Shanghai Jiao Tong University
Jan 9th, 2019
About Me
Key Liao (廖秋承)
B.S. from Environment Science and Engineering, SJTU.
HPC Engineer of Center for High Performance Computing, SJTU.
Leader of ARM Research Team at CHPC, SJTU.
Supervisor of SJTU Student HPC Competition Team.
Main Research Area:
Computer Architecture
Theoretical Computer
Performance Evaluation
Performance Optimization
Email: [email protected]
Outline
➢Kunpeng 920
➢ Float-point Arithmetic
➢ Memory subsystem
➢Proxy Applications
➢ TeaLeaf
➢ SNAP
➢ CloverLeaf
➢Real-world applications
➢ GTC-P
Chips Information
ModelIntel Xeon
Gold 6148Hi1616
Kunpeng 920
(Engineering Sample)
Arch Skylake-SP ARM ARM
Lithography 14nm 16nm 7nm
Main Frequency(GHz) 2.4 2.4 2.0
Num of Cores 20 32 48
Vectorization Ins/Width AVX512/512bits ASIMD/128bits ASIMD/128bits
Theoretical DP Peak
Performance (GFLOPS)*1536 307.2 768
L3 Cache 1.375 MB32MB
(shared)
64MB
(shared)
DRAM Support 6 x DDR4-2666 4 x DDR4-2400 8 x DDR4-3200
TDP 150 70 150
Launch Time 2017 2016 2019
* Theoretical DP peak performance is calculated based on the frequency we test during chips running their
best vectorization instruction set.
Platform Information
Platform 6148 1616 920
CPU Xeon Gold 6148 Hi1616 Kunpeng 920
Number of Sockets 4 4 8
DRAM Size (GB) 2048 256 256
DRAM Frequency (MHz) 2666 2400 2666
LinuxCentOS 7.5
Kernel 3.10.0
EulerOS
Kernel 4.11.0
EulerOS
Kernel 4.14.0
CompilerAll with
Intel Parallel Studio
XE Cluster Version
2019 Update 1
(Education License)
GNU/GCC-8.2.0
MPI Library MVAPICH2-2.3
BLAS Library OpenBLAS 0.3.5
360.2
955.32
220.2310.5
750
2252.2
475.2
670.7
0
500
1000
1500
2000
2500
2683 6148 1616 920
Single Socket Dual Socket
Float-point Arithmetic
2683 6148 1616 920
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
Single Socket Dual Socket
HPL Benchmark on Four Platforms HPL Efficiency on Four Platforms
• 41.1% Better than Hi1616, compared to a 165.3% increase from Haswell to
Skylake in 3 years.
• HPL efficiency on Kunpeng 920 is around 40% compared to more than 70%
on other chips.
Float-point Arithmetic
SP Scalar DP Scalar SP Vector DP Vector
Hi1616 2ins/cycle 9.596GFlops 2ins/cycle 9.596GFlops 1ins/cycle 19.194Gflops 1ins/cycle 9.596GFlops
Kunpeng920 2ins/cycle 7.989Gflops 2ins/cycle 7.989Gflops 2ins/cycle 31.954GFlops 1ins/cycle 7.989Gflops
FMA Instruction Throughput
• Hi1616
• 128-bit SIMD
• SP: 614.4 Gflops
• DP: 307.2 Gflops
• Hi1620
• 128-bit SIMD
• SP: 1,536 Gflops
• DP: 384 Glops
• Throughput of DP SIMD instruction is limited.
• Not a good chip for intense DP computation.
• DP computation is not so important as people
used to think.
• Trend on SVE and VLA .
Memory Subsystem
1 1 1 1 1 1 1 11.1
1.61.2
1.61.2
2.0 2.1
3.6
0
0.5
1
1.5
2
2.5
3
3.5
4
L1 Read L1 Write L2 Read L2 Write L3 Read L3 Write DRAMRead
DRAMWrite
Rela
tive S
cale
6148 920
Normalized Bandwidth of Different Memory Layers
Chip L1 L2 L3 DRAM
6148 1x 1x 1x 1x
920 0.33x 0.71x 1.57x 1.25x
Normalized Average Latency (ns)
Chip Communication - Bandwidth
Platform 2680 6148 1616 920
Technique QPI UPIHydra
InterfaceHydra
Interface
Bandwidth(GB/s) 35.2 40.8 10.0 12.7
▪ SNAP
▪ A proxy application for a modern deterministic discrete ordinates transport
code
▪ TeaLeaf
▪ Proxy app for solving the linear heat conduction equation on a spatially
decomposed regular grid, utilising a five point finite difference stencil
▪ CloverLeaf
▪ Solving Euler’s equations of compressible fluid dynamics, under a Lagrangian-
Eulerian scheme, on a two-dimensional spatial regular structured grid.
Proxy Applications
0.77 0.766
0.916
0.56
0.91
0.462
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
6148 920
Gri
nd
Tim
e (
ns)
Single 2-Socket Strong Scaling 2-Socket Weak Scaling
SNAP Grind Time
Proxy Applications - Results
601.18
364.42339.5
193.05
989.88
607.13
0
200
400
600
800
1000
1200
6148 920
Wall
Lo
ck (
s)
Single 2-Socket Strong Scaling 2-Socket Weak Scaling
TeaLeaf
1342.61
1041.56
755.5
571.67
0
200
400
600
800
1000
1200
1400
1600
6148 920
Wall
Lo
ck (
s)
Single 2-Socket Strong Scaling
CloverLeaf-bm16
182.58
208.78
120.65109.8
0
50
100
150
200
250
6148 920
Wall
Lo
ck (
s)
Single 2-Socket Strong Scaling
CloverLeaf-bm128_short
Proxy Applications - Results
1.005
1.65
1.289
0.875
0
0.5
1
1.5
2
SNAP TeaLeaf CloverLeaf_bm16 CloverLeaf_bm128_short
Normalized Performance of Proxy Applications on Single Socket
6148 920
1.636 1.759
1.3221.099
0
0.5
1
1.5
2
SNAP TeaLeaf CloverLeaf_bm16 CloverLeaf_bm128_short
Normalized Performance of Proxy Applications on Dual Socket
6148 920
Proxy Applications - SNAP
• Generally, load a relative big data
set. Performing random access in
the data set.
(dim3_sweep.f90)
• If OpenMP is enable, threading
across data set.
• MPI_Recv becomes a hotspot
after scaling across socket.
0.77 0.766
0.916
0.56
0.91
0.462
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
6148 920
Gri
nd
Tim
e (
ns)
Single 2-Socket Strong Scaling 2-Socket Weak Scaling
SNAP Grind Time
(9600 cells, nang=64, ng=332, nstep=100)
Same single node performace
26.9% Speedup
Proxy Applications - TeaLeaf
• Memory subsystem bandwidth.
• 3840 x 3840, 10000 steps.
1.0x 1.0x
1.77x1.89x
0.607x 0.600x
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
6148 920
Wall
Lo
ck (
s)
Single 2-Socket Strong Scaling 2-Socket Weak Scaling
TeaLeaf Relative Speedup
Proxy Applications - CloverLeaf
• Memory subsystem bandwidth.
• But
• Double-float arithmetic intensity increases as the number of cells
increases and the total number of iteration decreases.
1342.61
1041.56
755.5
571.67
0
200
400
600
800
1000
1200
1400
1600
6148 920
Wall
Lo
ck (
s)
Single 2-Socket Strong Scaling
CloverLeaf-bm16
182.58
208.78
120.65109.8
0
50
100
150
200
250
6148 920
Wall
Lo
ck (
s)
Single 2-Socket Strong Scaling
CloverLeaf-bm128_short
1.77x 1.82x 1.51x 1.90x
▪ GTC-P: Gyrokinetic Toroidal Code - Princeton
▪ GTC-P is Particle-in-Cell code that delivers fusion simulations at extreme
scales on the worldwide supercomputers including Tianhe-2, Titan,
TaihuLight and etc., that feature CPU, GPU and many-core processors.
GTC-P
Supported by
NSF SAVI Project
▪ Kunpeng 920 is capable to finish those scientific computation which
has relatively low arithmetic intensity (<4 dp F/B) better than Intel's
recent chip which has similar price.
▪ Pro
▪ Good Topology designs for threading
▪ High bandwidth, low latency, do well in many memoty-bound apps.
▪ Con
▪ Low bandwidth of Hydra Interface.
▪ Low DP arithmetic capability.
Conclusion