fujitsu hpc and the development of the post-k supercomputer · post-k powered by fujitsu-designed...

Copyright 2016 FUJITSU LIMITED

FUJITSU HPC and the Development of the Post-K Supercomputer

Toshiyuki ShimizuVice President, System Development Division,Next Generation Technical Computing Unit

0

November 16th, 2016

Post-K is currently under development. Information in these slides is subject to change without notice

FUJITSU HPC and Post-K Development

Introduction of HPC solutions, HPC product portfolio

High-end HPC supercomputer development

Performance of high-end machines preceding the Post-K

Post-K Goals and approaches

Post-K hardware

Post-K software

Performance balance

Summary

Copyright 2016 FUJITSU LIMITED1

FUJITSU HPC Solutions to Satisfy Customer Demands

High-end supercomputers, both Fujitsu-developed CPUs and x86 cluster systems

Single system image operation w/ FUJITSU system software

High performance, high availability, and high reliability


SupercomputerPRIMEHPC

PRIMEHPC FX10

PRIMEHPC FX100

Large-ScaleSMP System

RX900

High-end

Divisional

Departmental

Work Group

x86 Cluster

RX2530/RX2540

CX400/CX600(KNL)

BX900/BX400

K computer

（Co-developed with RIKEN）

High scalability with Fujitsu-developed CPU and interconnect

PRIMERGY x86 cluster systems support the latest CPUs and accelerators

2

FUJITSU High-end Supercomputer Development

2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

PRIMEHPC FX10

1.8x CPU perf. of KEasier installation

4x(DP) / 8x(SP) CPU per. of K, Tofu2High-density pkg & lower energy

App.review

FSprojects

HPCI strategic apps program

Operation of K computerDevelopment

Japan’s National Projects

FUJITSU

Post-K computer development

PRIMEHPC FX100


K computer and PRIMEHPC FX10/FX100 in Operation

Many applications are currently running on these machines and being developed for science and various industries

The CPU and interconnect of FX10/FX100 supercomputers inherit the K computer’s architectural concept, featuring state-of-the-art technologies

Post-K SupercomputerRIKEN and FUJITSU are working together to provide a successor to the K computer with application R&D teams using a “co-design” approach

Technical Computing Suite (TCS)System software “TCS” supports FUJITSU supercomputers

Handles millions of parallel jobsFEFS: super scalable file system

MPI: Ultra scalable collectivecommunication libraries

OS: Lower OS jitter w/ assistant core

Post-K

3


Achievements with the K computer

Prestigious Benchmark Awards TOP500: 10.5Pflops, 93% efficiency

HPCG: 602Tflops, 5.3% efficiency

Graph500: 38.6TTEPS

HPC Challenge Class 1: No.1 at all categories(1) Global HPL, (2) Global RandomAccess, (3) EP STREAM, (4) Global FFT

Gordon Bell Prize Awards “First principles calculation of electronic states of a silico

nanowire with 100,000 atoms on the K computer” (2011) “4.45 Pflops Astrophysical N-Body Simulation on K Computer –

The Gravitational Trillion-Body Problem” (2012) “Simulations of Below-Ground Dynamics of Fungi: 1.184 Pflops Attained by

Automated Generation and Autotuning of Temporal Blocking Codes” (2016 finalist)

Nominated asfinalist

at SC166 years from the

shipmentNo. 1

No. 1

No. 1

4

Performance of FUJITSU High-end Machines

FUJITSU’s custom CPUs steadily increase their FP performance

Uncompromised data bandwidth for the best use of applications

With the FX100, we introduced the “SMaC” concept, followed by the assistant core (AC), and CMG structure


FX100 FX10 K computer

Available year CY2015 CY2012 CY2010

Double Flops / CPU 1 TF 235 GF 128 GF

Single Flops / CPU 2 TF 235 GF 128 GF

SIMD width 256 bit 128 bit 128 bit

# of CMG (# of cores/CMG*) 2 (16+1xAC*) 1 (16) 1 (8)

Memory BW 480 GB/s 83.5 GB/s 64 GB/s

Byte per flop 0.4 ~ 0.5

*AC (Assistant Core) for OS jitter reduction by processing IO operations and async. MPI handling*CMG (Core Memory Group) is a group of CPU cores sharing L2 and memory for efficient BW

5

core core

core core

core core

core core

core core

core core

core core

core core

Assistant

coreAssistant

core

core core

core core

core core

core core

core core

core core

core core

core core

Tofu2 interface

Tofu2 controllerM

emory in

terface Mem

ory

inte

rfac

e

L2 cache

L2 cache

PCI interface

MA

C M

AC M

AC

MA

C

PCI controller

SMaC (Scalable Many Core) Concept & Approach


Many core-oriented, long-lasting architecture

Scalable performance improvement by increasing the number of cores Increasing the number of cores would be safe, even in the post-Moore’s Law era,

by using 3D stacking and alternative, newer technologies

FX100 CPU implementation

CMG(Shared L2 cache & Memories)

CMG

Core Memory Group (CMG), many core building block, ccNUMA between CMGs

・Hierarchal structure for hybrid parallel model・Optimized area and performance

CORE

Middle-sized, general purpose, out-of-order, superscalar processor core・Good performance for variety of apps・Low power by optimal balance of resources & perf.

Assistantcore

Assistant core・OS jitter reduction by processing IO op, async MPI・Highly scalable performance by low system noise

Assistantcore

6

Post-K Goals and Approaches

Post-K Goals High application performance and good power efficiency

Keeping application compatibility while advancing from predecessors

Good usability and better accessibility for users

Our Approaches Developing high performance and scalable, custom CPU cores

【Performance】 Wider SIMD & high memory BW, mathematical acc. primitives

【Scalability】 SMaC (scalable many core), zero OS jitter (assistant core)

【Power efficiency】 The best device tech, power control functions, optimal resources

Maintaining performance balance and supporting advanced features

•High memory BW, “Tofu” interconnect, and RIKEN advanced system software

Adopting ARM standard architecture

•Co-operation with ARM/Linux community and utilization of open source software

•Getting involved in the ARM HPC ecosystem


Post-K Powered by FUJITSU-designed CPU Cores & Tofu

FUJITSU CPU cores support ARM SVE ISA FUJITSU, as a lead partner in ARM SVE development, contributes to specification

of ARM SVE (Scalable Vector Extension), for application performance

FUJITSU ARM core incorporates FUJITSU’s proven supercomputer microarchitecture

ARM SVE, plus optional functions and Tofu, maintain programing models and performance balance

Post-K complies ARM’s standard frameworks (SBSA, etc.), for compatibility among platforms


Functions for Perf. Post-K FX100 FX10 K computer

SVE incorporated

SIMD 512bit 256bit 128bit 128bit

FMA4 ✔ ✔ ✔ ✔

Math. acc. prim.* ✔Enhanced ✔ ✔ ✔

Optional functions

Inter-core barrier ✔ ✔ ✔ ✔

Sector cache ✔Enhanced ✔ ✔ ✔

Prefetch modes ✔Enhanced ✔ ✔ ✔

Interconnect Tofu ✔Enhanced ✔ ✔ ✔

*Mathematical acceleration primitives include trigonometric functions, sine & cosines, and exponential...8

System Software for Post-K

Currently in development, based on “co-design” scheme with application developers, including system hardware


Post-K System Hardware

Linux OS / McKernel (Lightweight Kernel)

FUJITSU Technical Computing Suite / RIKEN Advanced System Software

Post-K Applications

Management Software Programming EnvironmentHierarchical File I/O Software

System managementfor highly available &

power saving operation

Job management for higher system

utilization & power efficiency

Lustre-based distributed file system

FEFS

OpenMP, COARRAY, Math Libs.

Compilers (C, C++, Fortran)

Debugging and tuning tools

MPI (Open MPI, )

9

FUJITSU Compiler for Post-K

Maximizes the execution performance of HPC applications

Covers a wide range of applications, including integer calculations are dominant

Targets 512bit-wide vectorization as well as Vector-length-agnostic

Fixed-vector-length facilitates optimizations such as constant folding

Inherits options/features of K computer, PRIMEHPC FX10 and FX100

Language Standard Support

Fully supported : Fortran 2008, C11, C++14, OpenMP 4.5

Partially supported : Fortran 2015, C++1z, OpenMP 5.0

Supports ARM C Language Extensions (ACLE) for SVE

ACLE allow programmers to use SVE instructions as C intrinsic functions


// C intrinsics in ACLE for SVEsvfloat64_t z0 = svld1_f64(p0, &x[i]);svfloat64_t z1 = svld1_f64(p0, &y[i]);svfloat64_t z2 = svadd_f64_x(p0, z0, z1);svst1_f64(p0, &z[i], z2);

// SVE assembler ld1d z1.d, p0/z, [x19, x3, lsl #3]ld1d z0.d, p0/z, [x20, x3, lsl #3]fadd z1.d, p0/m, z1.d, z0.dst1d z1.d, p0, [x21, x3, lsl #3]

10

Vectorization by FUJITSU Compiler

Dynamic instruction counts of representative loops of NPB 3.3-SER

Vectorized loops in TSVC* (Fortran and C)// Sample of vectorized loop by SVE// s482for (int i = 0; i < LEN; i++) {

a[i] += b[i] * c[i];if (c[i] > b[i]) break;

}

*[Fortran] D. Callahan, J. Dongarra, and D. Levine. “Vectorizing compilers: a test suite and results.” In Supercomputing '88, pp. 98- 105.[C] S. Maleki, Y. Gao, M. J. Garzar´n, T. Wong, and D. A. Padua, "An Evaluation of Vectorizing Compilers,” PACT2011, pp. 372-382.

TSVC (total) FX100 Post-K

Fortran (135) 96 111

C (151) 106 121

0

0.2

0.4

0.6

0.8

1

1.2

MG BT SP LU

# of

exe

cute

d in

stru

ctio

n r

atio 256b SIMD(FX100)

512b SIMD(Post-K)

512b SIMD(Estimatedfrom FX100 result)

75% SIMD 69% SIMD 69% SIMD 72% SIMD


Discussion on the Perf. Balance for Applications

Effectiveness for the meteorology application, IFS*, was evaluated

Good performance balance w/ wider SIMD and memory bandwidth from K to FX100 realizes an IFS performance improvement

Trying to keep the performance balance throughout the generations toward Post-K will be expected to provide scalable speed-up


K computer FX100

Flops / CPU 128 Gflops 1 Tflops

SIMD width 128 bit 256 bit

Memory BW 64 GB/s 480 GB/s

Byte per flop 0.4 ~ 0.5

*The Integrated Forecasting System (IFS) is developed by ECMWF

Speedup of IFS CNT4 (TL159, 96 cores)

.5x B/F 2x B/Fw/ narrowSIMD

Balanced

1.5x by doubling SIMD

2x B/F

(3) Insufficient gain by 2x B/F

(1) BW limits performance

(2) Narrower SIMD limits performance

12

Summary of Post-K Development

Developing high performance, scalable, custom CPU cores SMaC architecture with an assistant core for scalable performance

ARM instruction set architecture, SVE, as a standard architecture

ARM standard frameworks, SBSA, etc., for compatibility among platforms

Keeping performance balanced and advancing preceding machines Higher performance and higher data bandwidth

Advanced system software and applications Co-design scheme with application developers

FUJITSU optimizing compilers for Post-K

Performance balance is a key for application speedup


Post-K will meet requirements & be valuable for science and industries

13


Post-K: Succeeding the K computer Heritage

Prestigious Benchmark Awards TOP500: 10.5Pflops, 93% efficiency

HPCG: 602Tflops, 5.3% efficiency

Graph500: 38.6TTEPS

HPC Challenge Class 1: No.1 at all categories(1) Global HPL, (2) Global RandomAccess, (3) EP STREAM, (4) Global FFT

Gordon Bell Prize Awards “First principles calculation of electronic states of a silico

nanowire with 100,000 atoms on the K computer” (2011) “4.45 Pflops Astrophysical N-Body Simulation on K Computer –

The Gravitational Trillion-Body Problem” (2012) “Simulations of Below-Ground Dynamics of Fungi: 1.184 Pflops Attained by

Automated Generation and Autotuning of Temporal Blocking Codes” (2016 finalist)

Nominated asfinalist

at SC166 years from the

shipmentNo. 1

No. 1

No. 1

14

fujitsu hpc and the development of the post-k supercomputer · post-k powered by fujitsu-designed...

Documents