hpc architectures evolution: the case of marconi, the new ...€¦ · case of marconi, the new...
TRANSCRIPT
HPC Architectures evolution: the
case of Marconi, the new CINECA
flagship system
Piero Lanucara
BG/Q (Fermi) as a Tier0 Resource
• Many advantages as a supercomputing resource:– Low energy consumption.– Limited floor space requirements– Fast internal network– Homogeneous architecture → simple usage
model.
• But– Low, single core performance + I/O structure
meant very high parallelism necessary (at least 1024 cores).
– For some applications low memory/core (1Gb) and I/O performance also a problem. Also limited capabilities of O.S. on compute cores (e.g. no interactive access)
– Cross compilation, because login nodes different to compute nodes, can complicate some build procedures.
FERMI scheduled to be decommissioned mid-end 2016
Replacing Fermi at Cineca -considerations
• A new procurement is a complicated process and considers many factors but must include (together with the price):– Minimum peak compute power– Power consumption– Floor space required– Availability– Disk space, internal network, etc.
• IBM no longer offers the BlueGene range for supercomputers so cannot be a solution.
• Many computer centres are adopting instead a heterogenous model for computer clusters
20192016 2017 2018 2020
Marconi A12.1PFlops . 0.7Mwatt
Marconi A211PFlops. 1.3MWatt
Alpha
Commit Wins
Volume Beta
Marconi A37PFlops. 1.0Mwatt
Sistem A4(?) 50PFlops. 3.2 Mwatt
Fermi2PFlops . 0.8Mwatt
Galileo & PICO1.2PFlops . 0.4Mwatt
2.4Mwatt 2.3Mwatt 2.3Mwatt 3.2Mwatt1.2Mwatt
50 rack 120 rack 120 rack 120 rack 150 rack
100mq 240mq 240mq 240mq 300mq
Replacing Fermi – the Marconi solution
Marconi High level system Characteristics
Partition Installation CPU # nodes # of Racks Power
A1 – Broadwell (2.1 PFlops)
April 2016 E5-2697 v4 1512 25 700KW
A2 - Knight Landing (11 Pflops)
September 2016
KNL 3600 50 1300KW
A3 – Skylake (7 Pflopsexpected)
June 2017 E5-2680 v5 >1500 >25 1000KW
Tender proposal
Network: Intel OmniPath
A2KNL 68cores, 1.4 GHz; 3600 nodes, 11 PFs
A3SKL 2x20 cores, 2.3 GHz;
>1500 nodes, 7 PFsA1BRD 2x18 cores, 2.3GHz1500 nodes, 2PFs
1 PFs - conventional
1 PFs – no-conventional (KNL)
5 PFs - conventional
A2KNL 68cores, 1.4 GHz; 3600 nodes, 11 PFs
A3SKL 2x20 cores, 2.3 GHz;
>1500 nodes, 7 PFsA1BRD 2x18 cores, 2.3GHz1500 nodes, 2PFs
1 PFs - conventional
1 PFs – no-conventional (KNL)
5 PFs - conventional
storageGSS*: 5 PB scratch area (50 GB/s)by Jan17: 10 PB long term storage (40 GB/s)
20 PB Tape library
Marconi - Compute
A1 (half reserved to EUROfusion)1512 Lenovo NeXtScale Server -> 2 PFlopsIntel E5-2697 v4 Broadwell18 cores @ 2.3GHz. dual socket node: 36 core and 128GByte / node
A2 (1 PFlops to EUROfusion)3600 server Intel AdamPass -> 11 PFlopsIntel PHI code name Knight Landing (KNL) 68 cores @ 1.4GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM
A3 (great part reserved to EUROfusion)>1500 Lenovo Stark Server -> 7 PFlopsIntel E5-2680v5 SkyLake20 cores @ 2.3GHzdual socket node: 40 core and 196GByte /node
Marconi - Compute
A1 (half reserved to EUROfusion)1512 Lenovo NeXtScale Server -> 2 PFlopsIntel E5-2697 v4 Broadwell18 cores @ 2.3GHz. dual socket node: 36 core and 128GByte / node
A2 (1 PFlops to EUROfusion)3600 server Intel AdamPass -> 11 PFlopsIntel PHI code name Knight Landing (KNL) 68 cores @ 1.4GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM
A3 (great part reserved to EUROfusion)>1500 Lenovo Stark Server -> 7 PFlopsIntel E5-2680v5 SkyLake20 cores @ 2.3GHzdual socket node: 40 core and 196GByte /node
Marconi - Compute
A1 (half reserved to EUROfusion)1512 Lenovo NeXtScale Server -> 2 PFlopsIntel E5-2697 v4 Broadwell18 cores @ 2.3GHz. dual socket node: 36 core and 128GByte / node
A2 (1 PFlops to EUROfusion)3600 server Intel AdamPass -> 11 PFlopsIntel PHI code name Knight Landing (KNL) 68 cores @ 1.4GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM
A3 (great part reserved to EUROfusion)>1500 Lenovo Stark Server -> 7 PFlopsIntel E5-2680v5 SkyLake20 cores @ 2.3GHzdual socket node: 40 core and 196GByte /node
Marconi - Network
Network type: new Intel OmnipathLargest Omnipath cluster of the world
Network topology: Fat-tree 2:1 oversubscription
tapering at the level of the core switches only
Core Switches: 5 x OPA Core Switch “Sawtooth Forest” 768 ports each
Hdge Switch: 216 OPA Edge Switch “Eldorado Forest” 48 ports each
Maximum system configuration:5(opa) x 768(ports) x 2(tapering) -> 7680 servers
6624 Compute nodes
216 x 48 ports Hedge Switches
5 x 768 ports core Switches
3x
32 downlink
32 nodesfully interconnected island
A1 HPL
Full system Linpack:
• 1 MPI task per node
• perf range: 1.6 – 1.7PFs.
• Max Perf: 1.72389PFs with
Turbo-OFF.
• Turbo-ON -> throttlingJune 2016:Number 46
A2 HPL
Full system Linpack: 3556 nodes
• 1 MPI task per node
• Max Perf with HyperThreading-OFF.
November 2016:Number 12
[0] ================================================================================
[0] T/V N NB P Q Time Gflops
[0] --------------------------------------------------------------------------------
[0] W[0] R00L2L4 6287568 336 28 127 26628.96 6.22304e+06
[0] HPL_pdgesv() start time Fri Nov 4 23:10:08 2016
[0]
[0] HPL_pdgesv() end time Sat Nov 5 06:33:57 2016
[0]
[0] HPL Efficiency by CPU Cycle 2505.196%
[0] HPL Efficiency by BUS Cycle 2439.395%
[0] --------------------------------------------------------------------------------
[0] ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0006293 ...... PASSED
[0] ================================================================================
[0]
[0] Finished 1 tests with the following results:
[0] 1 tests completed and passed residual checks,
[0] 0 tests completed and failed residual checks,
[0] 0 tests skipped because of illegal input values.
[0] --------------------------------------------------------------------------------
[0]
[0] End of Tests.
[0] ================================================================================
• An accelerator (like GPUs) but more similar to a conventional multicore CPU.
• Current version, Knight’s Corner (KNC) has 57-61 1.0-1.2 GHz cores,8-16GB RAM. 512 bit vector unit.
• Cores connected in a ring topology and MPI possible.
• No need to write CUDA or OpenCL as Intel compilers will compile Fortran or C code for the MIC.
• 1-2 Tflops, according to model.
Intel Xeon PHI KNC (Galileo@CINECA)
•A2: Knights Landing (KNL)
–A big unknown because very fewpeople currently have access to KNL.–But we know the architecture of KNL and the differences and similarities with respect to KNC.–The main differences are:
•KNL will be a standaloneprocessor not an accelerator(unlike KNC)•KNL has more powerful coresand faster internal network.•On package high performance, memory (16GB, MCDRAM).
Intel Xeon PHI KNC-KNL comparision
KNC (Galileo) KNL (Marconi)
#cores 61 (pentium) 68 (Atom )
Core frequency 1.238 GHz 1.4 Ghz
Memory 16GB GDDR5 96GB DDR4 +16Gb MCDRAM
Internal network Bi-directional Ring Mesh
Vectorisation 512 bit /core 2xAVX-512 /core
Usage Co-processor Standalone
Performance (Gflops) 1208 (dp)/2416 (sp) ~3000 (dp)
Power ~300W ~200W
A KNC core can be 10x slower than a Haswell core. A KNL core is expected to be 2-3X slower. Big differences also in memory bandwidth.
Coming next: A3
•A3. Intel Skylake
processors (mid-2017)–Successor to Haswell,
and launched in 2015.
–Expect increase in
performance and power
efficiency.
Coming next: A3
Marconi A1 and A2 exploitation at its best
*
Exploiting the parallel universeThree levels of parallelism supported by Intel hardware
• Multi thread/task performance
• Exposed by programming models
• Execute tens/hundreds/thousands task concurrently
Thread Level Parallelism
• Single thread performance
• Exposed by tools and programming models
• Operate on 4/8/16 elements at a time
Vector LevelParallelism
• Single thread performance
• Automatically exposed by HW/tools
• Effectively limited to a few instructions
Instruction Level Parallelism
•A1: Broadwell nodes–Similar to Haswell cores present on Galileo.–Expect only a small difference in single core
performance wrt Galileo, but a big difference
compared to Fermi.
–More cores/node (36) should mean better
OpenMP performance but also MPI performance
will improve (faster network).
–Life much easier for SPMD programming models.
cores/node 36
Memory/node 128 GB
A1 exploitation
+ Use SIMD vectorization
Single Instruction Multiple Data (SIMD) vectorization
• Technique for exploiting VLP on a single thread• Operate on more than one element at a time
• Might decrease instruction counts significantly
• Elements are stored on SIMD registers or vectors
• Code needs to be vectorized• Vectorization usually on inner loops
• Main and remainder loops are generated
for (int i = 0; i < N; i++)
c[i] = a[i] + b[i];
for (int i = 0; i < N; i += 4)
c[i:4] = a[i:4] + b[i:4];
a[i:4]
b[i:4]
c[i:4]
Scalar loop
SIMD loop
(4 elements)
•A2: KNL–(More) similar to KNC coprocessor present
on Galileo, but with remarkable differences: High bandwidth MCDRAM available
AVX-512 ISA
Binary compatible with ‘standard’ Xeon
–MPI and OpenMP mixing still the best choice
(less OpenMP performances dependent). cores/node 68
Memory/node 96GByte DDR4 + 16GByte MCDRAM
A2 exploitation
++ Use SIMD vectorization
A2: MCDRAM
• Memory bandwidth in HPC is one of common bottleneck
for performances
• To increase the demand for memory bandwidth KNL
have a on-package high memory bandwidth memory
(HBM) based on multi-channel dynamic random access
memory (MCDRAM).
• This memory is capable of delivering up to 5x
performance (≥ 400 Gb/s) compared to DDR4 memory on
same platform (≥ 90 GB/s)
A2: MCDRAM
• HBM on KNL can be used as
1. a last-level cache (LLC)
2. an addressable memory.
• The configuration is determined at boot time, by
choosing in BIOS setting between three MCDRAM
modes:
1. Flat mode
2. Cache mode
3. Hybrid mode
A2: MCDRAM
• The best mode to use will depend on the application.
A2: Using HBM as addressable memory
Two methods for this:
• the numactl tool
Works best if the whole app can fit in MCDRAM
• the memkind library
Using library calls or Compiler Directives
Needs source modification
A2: Using numactl to access MCDRAM
• Run "numactl --hardware" to see the NUMA configuration of your system
• Look for the node only.
If the total memory footprint of your app is smaller than the size of
MCDRAM
ps -C myapp u
see RSS value
Use numactl to allocate all of its memory from MCDRAM
numactl --membind=mcdram_id myapp
Where mcdram_id is the ID of MCDRAM "node"
If the total memory footprint of your app is larger than the size of
MCDRAM
You can still use numactl to allocate part of your app in MCDRAM
numactl --preferred=mcdram_id myapp
Allocations that don't fit into MCDRAM spills over to DDR
numactl --interleave=nodes myapp
Allocations are interleaved across all nodes
A2: Using Memkind to access MCDRAM
• Memkind library is a user-extensible heap manager built
on top of jemalloc, a C library for general-purpose
memory allocation functions.
• The library is generalizable to any NUMA architecture,
but for Knights Landing processors it is used primarily for
manual allocation to HBM using special allocators for
C/C++
• has limited support for Fortran
A2: Using Memkind - C case
• Allocate 1000 floats from DDR
float *fv;
fv = (float *)malloc(sizeof(float) * 1000);
• Allocate 1000 floats from MCDRAM
float *fv;
fv = (float *)hbw_malloc(sizeof(float) * 1000);
A2: Using Memkind - Fortran case
C Declare arrays to be dynamic
REAL, ALLOCATABLE :: A(:), B(:), C(:)
!DEC$ ATTRIBUTES FASTMEM :: A
NSIZE=1024
c
c allocate array 'A' from MCDRAM
c
ALLOCATE (A(1:NSIZE))
c
c Allocate arrays that will come from DDR
c
ALLOCATE (B(NSIZE), C(NSIZE))
•A2: KNL–(More) similar to KNC coprocessor present
on Galileo, but with remarkable differences: High bandwidth MCDRAM available
AVX-512 ISA
Binary compatible with ‘standard’ Xeon
–MPI and OpenMP mixing still the best choice
(less OpenMP performances dependent). cores/node 68
Memory/node 96GByte DDR4 + 16GByte MCDRAM
A2 exploitation
++ Use SIMD vectorization
Porting Applications onto Marconi A1 and A2
• Applications developers must utilise vectorisation (SIMD) and/or MPI(OpenMP) processes(threads), and possibly the fast memory of KNL (MCDRAM).
• From the user perspective, its success will depend on how software tools and applications are able to exploit the KNL architecture. Key idea: use proxy (KNC, …):
• SPECFEM3D_GLOBE. Already reasonable results with KNC in the framework of the IPCC@CINECA activity. Good amount of vectorisation (FORCE_VECTORIZATION preprocessing enabling and SIMD optimization) suitable for KNC and future KNL. High number of OpenMPthreads scaling (up to more than 60 on KNC)
Worth noting that up to now KNC haven’t been widely
supported by Geophysical Applications software developers
and users. A remarkable exception is SPECFEM3D_GLOBE
software CIG repo, where the “native” version is maintained
and tested. Again, this should be fine for its KNL startup.
Good practice: vectorization (SPECFEM3D_GLOBE KNC
investigation)
Computer
system
e.t. (sec.) Speedup wrt
Haswell
Haswell
(Galileo)
570.20 1.00
KNC
(Galileo)
430.35 1.32
SPECFEM3D_GLOBE Regional_MiddleEast test
case: forward simulation
Computer
System
e.t. (sec.) Slowdown
factor wrt
vectorised
Haswell
(Galileo)
687.14 1.20
KNC
(Galileo)
848.12 1.97
Based on a 4-node
Galileo partition (16
MPI processes, 4
and 60 OpenMP
threads on Haswell
and KNC
respectively).
The impact of
vectorisation: on
Haswell and KNC
respectively).
SPECFEM3D_GLOBE Regional_MiddleEast test
case: no vectorisation comparison
<- 2x Slowdown factor
Good practice: choosing MCDRAM memory modes
• MCDRAM cache mode should be the good choice for most applications but…
• …applications with ‘cache unfriendly’ data are candidatesfor using other memory modes. In this case:Try to identify what to put in MCDRAM:
timings with selected data items allocated fast/slow (intrusive)
memory profiling (VTUNE Amplifier tool)
Conclusions
• Marconi A1: moderate improvements over the years….but a big improvements compared to Fermi.
• High expectations of Marconi A2 KNL performances.
• KNC paves the way for increasing performances…
• ….try to manage domain parallelism, increase threading, exploit data parallelism (vectorisation) and improve data locality (new chance: use on package memory)