benchmarking huawei arm multi-core processors for hpc ... · benchmarking huawei arm multi-core...

Benchmarking Huawei ARMMulti-Core Processors for HPC workloadsKey Liao

Center for HPC

Shanghai Jiao Tong University

Jan 9th, 2019

About Me

Key Liao (廖秋承)

B.S. from Environment Science and Engineering, SJTU.

HPC Engineer of Center for High Performance Computing, SJTU.

Leader of ARM Research Team at CHPC, SJTU.

Supervisor of SJTU Student HPC Competition Team.

Main Research Area:

Computer Architecture

Theoretical Computer

Performance Evaluation

Performance Optimization

Email: [email protected]

Outline

➢Kunpeng 920

➢ Float-point Arithmetic

➢ Memory subsystem

➢Proxy Applications

➢ TeaLeaf

➢ SNAP

➢ CloverLeaf

➢Real-world applications

➢ GTC-P

Chips Information

core 0 core 1

core 3 core 4

grp 0

grp 1

grp 2

grp 3

grp 4

grp 5

Chips Information

ModelIntel Xeon

Gold 6148Hi1616

Kunpeng 920

(Engineering Sample)

Arch Skylake-SP ARM ARM

Lithography 14nm 16nm 7nm

Main Frequency(GHz) 2.4 2.4 2.0

Num of Cores 20 32 48

Vectorization Ins/Width AVX512/512bits ASIMD/128bits ASIMD/128bits

Theoretical DP Peak

Performance (GFLOPS)*1536 307.2 768

L3 Cache 1.375 MB32MB

(shared)

64MB

(shared)

DRAM Support 6 x DDR4-2666 4 x DDR4-2400 8 x DDR4-3200

TDP 150 70 150

Launch Time 2017 2016 2019

* Theoretical DP peak performance is calculated based on the frequency we test during chips running their

best vectorization instruction set.

Platform Information

Platform 6148 1616 920

CPU Xeon Gold 6148 Hi1616 Kunpeng 920

Number of Sockets 4 4 8

DRAM Size (GB) 2048 256 256

DRAM Frequency (MHz) 2666 2400 2666

LinuxCentOS 7.5

Kernel 3.10.0

EulerOS

Kernel 4.11.0

EulerOS

Kernel 4.14.0

CompilerAll with

Intel Parallel Studio

XE Cluster Version

2019 Update 1

(Education License)

GNU/GCC-8.2.0

MPI Library MVAPICH2-2.3

BLAS Library OpenBLAS 0.3.5

360.2

955.32

220.2310.5

750

2252.2

475.2

670.7

0

500

1000

1500

2000

2500

2683 6148 1616 920

Single Socket Dual Socket

Float-point Arithmetic

2683 6148 1616 920

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

Single Socket Dual Socket

HPL Benchmark on Four Platforms HPL Efficiency on Four Platforms

• 41.1% Better than Hi1616, compared to a 165.3% increase from Haswell to

Skylake in 3 years.

• HPL efficiency on Kunpeng 920 is around 40% compared to more than 70%

on other chips.

Float-point Arithmetic

SP Scalar DP Scalar SP Vector DP Vector

Hi1616 2ins/cycle 9.596GFlops 2ins/cycle 9.596GFlops 1ins/cycle 19.194Gflops 1ins/cycle 9.596GFlops

Kunpeng920 2ins/cycle 7.989Gflops 2ins/cycle 7.989Gflops 2ins/cycle 31.954GFlops 1ins/cycle 7.989Gflops

FMA Instruction Throughput

• Hi1616

• 128-bit SIMD

• SP: 614.4 Gflops

• DP: 307.2 Gflops

• Hi1620

• 128-bit SIMD

• SP: 1,536 Gflops

• DP: 384 Glops

• Throughput of DP SIMD instruction is limited.

• Not a good chip for intense DP computation.

• DP computation is not so important as people

used to think.

• Trend on SVE and VLA .

Memory Subsystem

1 1 1 1 1 1 1 11.1

1.61.2

1.61.2

2.0 2.1

3.6

0

0.5

1

1.5

2

2.5

3

3.5

4

L1 Read L1 Write L2 Read L2 Write L3 Read L3 Write DRAMRead

DRAMWrite

Rela

tive S

cale

6148 920

Normalized Bandwidth of Different Memory Layers

Chip L1 L2 L3 DRAM

6148 1x 1x 1x 1x

920 0.33x 0.71x 1.57x 1.25x

Normalized Average Latency (ns)

Chip Communication - Bandwidth

Platform 2680 6148 1616 920

Technique QPI UPIHydra

InterfaceHydra

Interface

Bandwidth(GB/s) 35.2 40.8 10.0 12.7

▪ SNAP

▪ A proxy application for a modern deterministic discrete ordinates transport

code

▪ TeaLeaf

▪ Proxy app for solving the linear heat conduction equation on a spatially

decomposed regular grid, utilising a five point finite difference stencil

▪ CloverLeaf

▪ Solving Euler’s equations of compressible fluid dynamics, under a Lagrangian-

Eulerian scheme, on a two-dimensional spatial regular structured grid.

Proxy Applications

0.77 0.766

0.916

0.56

0.91

0.462

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

6148 920

Gri

nd

Tim

e (

ns)

Single 2-Socket Strong Scaling 2-Socket Weak Scaling

SNAP Grind Time

Proxy Applications - Results

601.18

364.42339.5

193.05

989.88

607.13

0

200

400

600

800

1000

1200

6148 920

Wall

Lo

ck (

s)


TeaLeaf

1342.61

1041.56

755.5

571.67

0

200

400

600

800

1000

1200

1400

1600

6148 920

Wall

Lo

ck (

s)

Single 2-Socket Strong Scaling

CloverLeaf-bm16

182.58

208.78

120.65109.8

0

50

100

150

200

250

6148 920

Wall

Lo

ck (

s)


CloverLeaf-bm128_short

Proxy Applications - Results

1.005

1.65

1.289

0.875

0

0.5

1

1.5

2

SNAP TeaLeaf CloverLeaf_bm16 CloverLeaf_bm128_short

Normalized Performance of Proxy Applications on Single Socket

6148 920

1.636 1.759

1.3221.099

0

0.5

1

1.5

2

SNAP TeaLeaf CloverLeaf_bm16 CloverLeaf_bm128_short

Normalized Performance of Proxy Applications on Dual Socket

6148 920

Proxy Applications - SNAP

• Generally, load a relative big data

set. Performing random access in

the data set.

(dim3_sweep.f90)

• If OpenMP is enable, threading

across data set.

• MPI_Recv becomes a hotspot

after scaling across socket.

0.77 0.766

0.916

0.56

0.91

0.462

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

6148 920

Gri

nd

Tim

e (

ns)


SNAP Grind Time

(9600 cells, nang=64, ng=332, nstep=100)

Same single node performace

26.9% Speedup

Proxy Applications - TeaLeaf

• Memory subsystem bandwidth.

• 3840 x 3840, 10000 steps.

1.0x 1.0x

1.77x1.89x

0.607x 0.600x

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

6148 920

Wall

Lo

ck (

s)


TeaLeaf Relative Speedup

Proxy Applications - CloverLeaf

• Memory subsystem bandwidth.

• But

• Double-float arithmetic intensity increases as the number of cells

increases and the total number of iteration decreases.

1342.61

1041.56

755.5

571.67

0

200

400

600

800

1000

1200

1400

1600

6148 920

Wall

Lo

ck (

s)


CloverLeaf-bm16

182.58

208.78

120.65109.8

0

50

100

150

200

250

6148 920

Wall

Lo

ck (

s)


CloverLeaf-bm128_short

1.77x 1.82x 1.51x 1.90x

▪ GTC-P: Gyrokinetic Toroidal Code - Princeton

▪ GTC-P is Particle-in-Cell code that delivers fusion simulations at extreme

scales on the worldwide supercomputers including Tianhe-2, Titan,

TaihuLight and etc., that feature CPU, GPU and many-core processors.

GTC-P

Supported by

NSF SAVI Project

GTC-P

Kunpeng

920

GTC-P

GTC-P Performance With Different Combination of

Processes and Threads on Kunpeng 920

GTC-P

Kunpeng 920

▪ Kunpeng 920 is capable to finish those scientific computation which

has relatively low arithmetic intensity (＜4 dp F/B) better than Intel's

recent chip which has similar price.

▪ Pro

▪ Good Topology designs for threading

▪ High bandwidth, low latency, do well in many memoty-bound apps.

▪ Con

▪ Low bandwidth of Hydra Interface.

▪ Low DP arithmetic capability.

Conclusion

benchmarking huawei arm multi-core processors for hpc ... · benchmarking huawei arm multi-core...

Documents