hpc architectures evolution: the case of marconi, the new ...€¦ · case of marconi, the new...

HPC Architectures evolution: the

case of Marconi, the new CINECA

flagship system

Piero Lanucara

BG/Q (Fermi) as a Tier0 Resource

• Many advantages as a supercomputing resource:– Low energy consumption.– Limited floor space requirements– Fast internal network– Homogeneous architecture → simple usage

model.

• But– Low, single core performance + I/O structure

meant very high parallelism necessary (at least 1024 cores).

– For some applications low memory/core (1Gb) and I/O performance also a problem. Also limited capabilities of O.S. on compute cores (e.g. no interactive access)

– Cross compilation, because login nodes different to compute nodes, can complicate some build procedures.

FERMI scheduled to be decommissioned mid-end 2016

Replacing Fermi at Cineca -considerations

• A new procurement is a complicated process and considers many factors but must include (together with the price):– Minimum peak compute power– Power consumption– Floor space required– Availability– Disk space, internal network, etc.

• IBM no longer offers the BlueGene range for supercomputers so cannot be a solution.

• Many computer centres are adopting instead a heterogenous model for computer clusters

20192016 2017 2018 2020

Marconi A12.1PFlops . 0.7Mwatt

Marconi A211PFlops. 1.3MWatt

Alpha

Commit Wins

Volume Beta

Marconi A37PFlops. 1.0Mwatt

Sistem A4(?) 50PFlops. 3.2 Mwatt

Fermi2PFlops . 0.8Mwatt

Galileo & PICO1.2PFlops . 0.4Mwatt

2.4Mwatt 2.3Mwatt 2.3Mwatt 3.2Mwatt1.2Mwatt

50 rack 120 rack 120 rack 120 rack 150 rack

100mq 240mq 240mq 240mq 300mq

Replacing Fermi – the Marconi solution

Marconi High level system Characteristics

Partition Installation CPU # nodes # of Racks Power

A1 – Broadwell (2.1 PFlops)

April 2016 E5-2697 v4 1512 25 700KW

A2 - Knight Landing (11 Pflops)

September 2016

KNL 3600 50 1300KW

A3 – Skylake (7 Pflopsexpected)

June 2017 E5-2680 v5 >1500 >25 1000KW

Tender proposal

Network: Intel OmniPath

A2KNL 68cores, 1.4 GHz; 3600 nodes, 11 PFs

A3SKL 2x20 cores, 2.3 GHz;

>1500 nodes, 7 PFsA1BRD 2x18 cores, 2.3GHz1500 nodes, 2PFs

1 PFs - conventional

1 PFs – no-conventional (KNL)


A2KNL 68cores, 1.4 GHz; 3600 nodes, 11 PFs

A3SKL 2x20 cores, 2.3 GHz;

>1500 nodes, 7 PFsA1BRD 2x18 cores, 2.3GHz1500 nodes, 2PFs


1 PFs – no-conventional (KNL)


storageGSS*: 5 PB scratch area (50 GB/s)by Jan17: 10 PB long term storage (40 GB/s)

20 PB Tape library

Marconi - Compute

A1 (half reserved to EUROfusion)1512 Lenovo NeXtScale Server -> 2 PFlopsIntel E5-2697 v4 Broadwell18 cores @ 2.3GHz. dual socket node: 36 core and 128GByte / node

A2 (1 PFlops to EUROfusion)3600 server Intel AdamPass -> 11 PFlopsIntel PHI code name Knight Landing (KNL) 68 cores @ 1.4GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM

A3 (great part reserved to EUROfusion)>1500 Lenovo Stark Server -> 7 PFlopsIntel E5-2680v5 SkyLake20 cores @ 2.3GHzdual socket node: 40 core and 196GByte /node

Marconi - Network

Network type: new Intel OmnipathLargest Omnipath cluster of the world

Network topology: Fat-tree 2:1 oversubscription

tapering at the level of the core switches only

Core Switches: 5 x OPA Core Switch “Sawtooth Forest” 768 ports each

Hdge Switch: 216 OPA Edge Switch “Eldorado Forest” 48 ports each

Maximum system configuration:5(opa) x 768(ports) x 2(tapering) -> 7680 servers

6624 Compute nodes

216 x 48 ports Hedge Switches

5 x 768 ports core Switches

3x

32 downlink

32 nodesfully interconnected island

A1 HPL

Full system Linpack:

• 1 MPI task per node

• perf range: 1.6 – 1.7PFs.

• Max Perf: 1.72389PFs with

Turbo-OFF.

• Turbo-ON -> throttlingJune 2016:Number 46

A2 HPL

Full system Linpack: 3556 nodes

• 1 MPI task per node

• Max Perf with HyperThreading-OFF.

November 2016:Number 12

[0] ================================================================================

[0] T/V N NB P Q Time Gflops

[0] --------------------------------------------------------------------------------

[0] W[0] R00L2L4 6287568 336 28 127 26628.96 6.22304e+06

[0] HPL_pdgesv() start time Fri Nov 4 23:10:08 2016

[0]

[0] HPL_pdgesv() end time Sat Nov 5 06:33:57 2016

[0]

[0] HPL Efficiency by CPU Cycle 2505.196%

[0] HPL Efficiency by BUS Cycle 2439.395%

[0] --------------------------------------------------------------------------------

[0] ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0006293 ...... PASSED

[0] ================================================================================

[0]

[0] Finished 1 tests with the following results:

[0] 1 tests completed and passed residual checks,

[0] 0 tests completed and failed residual checks,

[0] 0 tests skipped because of illegal input values.

[0] --------------------------------------------------------------------------------

[0]

[0] End of Tests.

[0] ================================================================================

• An accelerator (like GPUs) but more similar to a conventional multicore CPU.

• Current version, Knight’s Corner (KNC) has 57-61 1.0-1.2 GHz cores,8-16GB RAM. 512 bit vector unit.

• Cores connected in a ring topology and MPI possible.

• No need to write CUDA or OpenCL as Intel compilers will compile Fortran or C code for the MIC.

• 1-2 Tflops, according to model.

Intel Xeon PHI KNC (Galileo@CINECA)

•A2: Knights Landing (KNL)

–A big unknown because very fewpeople currently have access to KNL.–But we know the architecture of KNL and the differences and similarities with respect to KNC.–The main differences are:

•KNL will be a standaloneprocessor not an accelerator(unlike KNC)•KNL has more powerful coresand faster internal network.•On package high performance, memory (16GB, MCDRAM).

Intel Xeon PHI KNC-KNL comparision

KNC (Galileo) KNL (Marconi)

#cores 61 (pentium) 68 (Atom )

Core frequency 1.238 GHz 1.4 Ghz

Memory 16GB GDDR5 96GB DDR4 +16Gb MCDRAM

Internal network Bi-directional Ring Mesh

Vectorisation 512 bit /core 2xAVX-512 /core

Usage Co-processor Standalone

Performance (Gflops) 1208 (dp)/2416 (sp) ~3000 (dp)

Power ~300W ~200W

A KNC core can be 10x slower than a Haswell core. A KNL core is expected to be 2-3X slower. Big differences also in memory bandwidth.

Coming next: A3

•A3. Intel Skylake

processors (mid-2017)–Successor to Haswell,

and launched in 2015.

–Expect increase in

performance and power

efficiency.

Coming next: A3

Marconi A1 and A2 exploitation at its best

*

Exploiting the parallel universeThree levels of parallelism supported by Intel hardware

• Multi thread/task performance

• Exposed by programming models

• Execute tens/hundreds/thousands task concurrently

Thread Level Parallelism

• Single thread performance

• Exposed by tools and programming models

• Operate on 4/8/16 elements at a time

Vector LevelParallelism

• Single thread performance

• Automatically exposed by HW/tools

• Effectively limited to a few instructions

Instruction Level Parallelism

•A1: Broadwell nodes–Similar to Haswell cores present on Galileo.–Expect only a small difference in single core

performance wrt Galileo, but a big difference

compared to Fermi.

–More cores/node (36) should mean better

OpenMP performance but also MPI performance

will improve (faster network).

–Life much easier for SPMD programming models.

cores/node 36

Memory/node 128 GB

A1 exploitation

+ Use SIMD vectorization

Single Instruction Multiple Data (SIMD) vectorization

• Technique for exploiting VLP on a single thread• Operate on more than one element at a time

• Might decrease instruction counts significantly

• Elements are stored on SIMD registers or vectors

• Code needs to be vectorized• Vectorization usually on inner loops

• Main and remainder loops are generated

for (int i = 0; i < N; i++)

c[i] = a[i] + b[i];

for (int i = 0; i < N; i += 4)

c[i:4] = a[i:4] + b[i:4];

a[i:4]

b[i:4]

c[i:4]

Scalar loop

SIMD loop

(4 elements)

•A2: KNL–(More) similar to KNC coprocessor present

on Galileo, but with remarkable differences: High bandwidth MCDRAM available

AVX-512 ISA

Binary compatible with ‘standard’ Xeon

–MPI and OpenMP mixing still the best choice

(less OpenMP performances dependent). cores/node 68

Memory/node 96GByte DDR4 + 16GByte MCDRAM

A2 exploitation

++ Use SIMD vectorization

A2: MCDRAM

• Memory bandwidth in HPC is one of common bottleneck

for performances

• To increase the demand for memory bandwidth KNL

have a on-package high memory bandwidth memory

(HBM) based on multi-channel dynamic random access

memory (MCDRAM).

• This memory is capable of delivering up to 5x

performance (≥ 400 Gb/s) compared to DDR4 memory on

same platform (≥ 90 GB/s)

A2: MCDRAM

• HBM on KNL can be used as

1. a last-level cache (LLC)

2. an addressable memory.

• The configuration is determined at boot time, by

choosing in BIOS setting between three MCDRAM

modes:

1. Flat mode

2. Cache mode

3. Hybrid mode

A2: MCDRAM

• The best mode to use will depend on the application.

A2: Using HBM as addressable memory

Two methods for this:

• the numactl tool

Works best if the whole app can fit in MCDRAM

• the memkind library

Using library calls or Compiler Directives

Needs source modification

A2: Using numactl to access MCDRAM

• Run "numactl --hardware" to see the NUMA configuration of your system

• Look for the node only.

If the total memory footprint of your app is smaller than the size of

MCDRAM

ps -C myapp u

see RSS value

Use numactl to allocate all of its memory from MCDRAM

numactl --membind=mcdram_id myapp

Where mcdram_id is the ID of MCDRAM "node"

If the total memory footprint of your app is larger than the size of

MCDRAM

You can still use numactl to allocate part of your app in MCDRAM

numactl --preferred=mcdram_id myapp

Allocations that don't fit into MCDRAM spills over to DDR

numactl --interleave=nodes myapp

Allocations are interleaved across all nodes

A2: Using Memkind to access MCDRAM

• Memkind library is a user-extensible heap manager built

on top of jemalloc, a C library for general-purpose

memory allocation functions.

• The library is generalizable to any NUMA architecture,

but for Knights Landing processors it is used primarily for

manual allocation to HBM using special allocators for

C/C++

• has limited support for Fortran

A2: Using Memkind - C case

• Allocate 1000 floats from DDR

float *fv;

fv = (float *)malloc(sizeof(float) * 1000);

• Allocate 1000 floats from MCDRAM

float *fv;

fv = (float *)hbw_malloc(sizeof(float) * 1000);

A2: Using Memkind - Fortran case

C Declare arrays to be dynamic

REAL, ALLOCATABLE :: A(:), B(:), C(:)

!DEC$ ATTRIBUTES FASTMEM :: A

NSIZE=1024

c

c allocate array 'A' from MCDRAM

c

ALLOCATE (A(1:NSIZE))

c

c Allocate arrays that will come from DDR

c

ALLOCATE (B(NSIZE), C(NSIZE))

•A2: KNL–(More) similar to KNC coprocessor present

on Galileo, but with remarkable differences: High bandwidth MCDRAM available

AVX-512 ISA

Binary compatible with ‘standard’ Xeon

–MPI and OpenMP mixing still the best choice

(less OpenMP performances dependent). cores/node 68

Memory/node 96GByte DDR4 + 16GByte MCDRAM

A2 exploitation

++ Use SIMD vectorization

Porting Applications onto Marconi A1 and A2

• Applications developers must utilise vectorisation (SIMD) and/or MPI(OpenMP) processes(threads), and possibly the fast memory of KNL (MCDRAM).

• From the user perspective, its success will depend on how software tools and applications are able to exploit the KNL architecture. Key idea: use proxy (KNC, …):

• SPECFEM3D_GLOBE. Already reasonable results with KNC in the framework of the IPCC@CINECA activity. Good amount of vectorisation (FORCE_VECTORIZATION preprocessing enabling and SIMD optimization) suitable for KNC and future KNL. High number of OpenMPthreads scaling (up to more than 60 on KNC)

Worth noting that up to now KNC haven’t been widely

supported by Geophysical Applications software developers

and users. A remarkable exception is SPECFEM3D_GLOBE

software CIG repo, where the “native” version is maintained

and tested. Again, this should be fine for its KNL startup.

Good practice: vectorization (SPECFEM3D_GLOBE KNC

investigation)

Computer

system

e.t. (sec.) Speedup wrt

Haswell

Haswell

(Galileo)

570.20 1.00

KNC

(Galileo)

430.35 1.32

SPECFEM3D_GLOBE Regional_MiddleEast test

case: forward simulation

Computer

System

e.t. (sec.) Slowdown

factor wrt

vectorised

Haswell

(Galileo)

687.14 1.20

KNC

(Galileo)

848.12 1.97

Based on a 4-node

Galileo partition (16

MPI processes, 4

and 60 OpenMP

threads on Haswell

and KNC

respectively).

The impact of

vectorisation: on

Haswell and KNC

respectively).

SPECFEM3D_GLOBE Regional_MiddleEast test

case: no vectorisation comparison

<- 2x Slowdown factor

Good practice: choosing MCDRAM memory modes

• MCDRAM cache mode should be the good choice for most applications but…

• …applications with ‘cache unfriendly’ data are candidatesfor using other memory modes. In this case:Try to identify what to put in MCDRAM:

timings with selected data items allocated fast/slow (intrusive)

memory profiling (VTUNE Amplifier tool)

Conclusions

• Marconi A1: moderate improvements over the years….but a big improvements compared to Fermi.

• High expectations of Marconi A2 KNL performances.

• KNC paves the way for increasing performances…

• ….try to manage domain parallelism, increase threading, exploit data parallelism (vectorisation) and improve data locality (new chance: use on package memory)

hpc architectures evolution: the case of marconi, the new ...€¦ · case of marconi, the new...

Documents