fujitsu hpc and the development of the post-k supercomputer · post-k powered by fujitsu-designed...
TRANSCRIPT
Copyright 2016 FUJITSU LIMITED
FUJITSU HPC and the Development of the Post-K Supercomputer
Toshiyuki ShimizuVice President, System Development Division,Next Generation Technical Computing Unit
0
November 16th, 2016
Post-K is currently under development. Information in these slides is subject to change without notice
FUJITSU HPC and Post-K Development
Introduction of HPC solutions, HPC product portfolio
High-end HPC supercomputer development
Performance of high-end machines preceding the Post-K
Post-K Goals and approaches
Post-K hardware
Post-K software
Performance balance
Summary
Copyright 2016 FUJITSU LIMITED1
FUJITSU HPC Solutions to Satisfy Customer Demands
High-end supercomputers, both Fujitsu-developed CPUs and x86 cluster systems
Single system image operation w/ FUJITSU system software
High performance, high availability, and high reliability
Copyright 2016 FUJITSU LIMITED
SupercomputerPRIMEHPC
PRIMEHPC FX10
PRIMEHPC FX100
Large-ScaleSMP System
RX900
High-end
Divisional
Departmental
Work Group
x86 Cluster
RX2530/RX2540
CX400/CX600(KNL)
BX900/BX400
K computer
(Co-developed with RIKEN)
High scalability with Fujitsu-developed CPU and interconnect
PRIMERGY x86 cluster systems support the latest CPUs and accelerators
2
FUJITSU High-end Supercomputer Development
2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
PRIMEHPC FX10
1.8x CPU perf. of KEasier installation
4x(DP) / 8x(SP) CPU per. of K, Tofu2High-density pkg & lower energy
App.review
FSprojects
HPCI strategic apps program
Operation of K computerDevelopment
Japan’s National Projects
FUJITSU
Post-K computer development
PRIMEHPC FX100
Copyright 2016 FUJITSU LIMITED
K computer and PRIMEHPC FX10/FX100 in Operation
Many applications are currently running on these machines and being developed for science and various industries
The CPU and interconnect of FX10/FX100 supercomputers inherit the K computer’s architectural concept, featuring state-of-the-art technologies
Post-K SupercomputerRIKEN and FUJITSU are working together to provide a successor to the K computer with application R&D teams using a “co-design” approach
Technical Computing Suite (TCS)System software “TCS” supports FUJITSU supercomputers
Handles millions of parallel jobsFEFS: super scalable file system
MPI: Ultra scalable collectivecommunication libraries
OS: Lower OS jitter w/ assistant core
Post-K
3
Copyright 2016 FUJITSU LIMITED
Achievements with the K computer
Prestigious Benchmark Awards TOP500: 10.5Pflops, 93% efficiency
HPCG: 602Tflops, 5.3% efficiency
Graph500: 38.6TTEPS
HPC Challenge Class 1: No.1 at all categories(1) Global HPL, (2) Global RandomAccess, (3) EP STREAM, (4) Global FFT
Gordon Bell Prize Awards “First principles calculation of electronic states of a silico
nanowire with 100,000 atoms on the K computer” (2011) “4.45 Pflops Astrophysical N-Body Simulation on K Computer –
The Gravitational Trillion-Body Problem” (2012) “Simulations of Below-Ground Dynamics of Fungi: 1.184 Pflops Attained by
Automated Generation and Autotuning of Temporal Blocking Codes” (2016 finalist)
Nominated asfinalist
at SC166 years from the
shipmentNo. 1
No. 1
No. 1
4
Performance of FUJITSU High-end Machines
FUJITSU’s custom CPUs steadily increase their FP performance
Uncompromised data bandwidth for the best use of applications
With the FX100, we introduced the “SMaC” concept, followed by the assistant core (AC), and CMG structure
Copyright 2016 FUJITSU LIMITED
FX100 FX10 K computer
Available year CY2015 CY2012 CY2010
Double Flops / CPU 1 TF 235 GF 128 GF
Single Flops / CPU 2 TF 235 GF 128 GF
SIMD width 256 bit 128 bit 128 bit
# of CMG (# of cores/CMG*) 2 (16+1xAC*) 1 (16) 1 (8)
Memory BW 480 GB/s 83.5 GB/s 64 GB/s
Byte per flop 0.4 ~ 0.5
*AC (Assistant Core) for OS jitter reduction by processing IO operations and async. MPI handling*CMG (Core Memory Group) is a group of CPU cores sharing L2 and memory for efficient BW
5
core core
core core
core core
core core
core core
core core
core core
core core
Assistant
coreAssistant
core
core core
core core
core core
core core
core core
core core
core core
core core
Tofu2 interface
Tofu2 controllerM
emory in
terface Mem
ory
inte
rfac
e
L2 cache
L2 cache
PCI interface
MA
C M
AC M
AC
MA
C
PCI controller
SMaC (Scalable Many Core) Concept & Approach
Copyright 2016 FUJITSU LIMITED
Many core-oriented, long-lasting architecture
Scalable performance improvement by increasing the number of cores Increasing the number of cores would be safe, even in the post-Moore’s Law era,
by using 3D stacking and alternative, newer technologies
FX100 CPU implementation
CMG(Shared L2 cache & Memories)
CMG
Core Memory Group (CMG), many core building block, ccNUMA between CMGs
・Hierarchal structure for hybrid parallel model・Optimized area and performance
CORE
Middle-sized, general purpose, out-of-order, superscalar processor core・Good performance for variety of apps・Low power by optimal balance of resources & perf.
Assistantcore
Assistant core・OS jitter reduction by processing IO op, async MPI・Highly scalable performance by low system noise
Assistantcore
6
Post-K Goals and Approaches
Post-K Goals High application performance and good power efficiency
Keeping application compatibility while advancing from predecessors
Good usability and better accessibility for users
Our Approaches Developing high performance and scalable, custom CPU cores
【Performance】 Wider SIMD & high memory BW, mathematical acc. primitives
【Scalability】 SMaC (scalable many core), zero OS jitter (assistant core)
【Power efficiency】 The best device tech, power control functions, optimal resources
Maintaining performance balance and supporting advanced features
•High memory BW, “Tofu” interconnect, and RIKEN advanced system software
Adopting ARM standard architecture
•Co-operation with ARM/Linux community and utilization of open source software
•Getting involved in the ARM HPC ecosystem
Copyright 2016 FUJITSU LIMITED7
Post-K Powered by FUJITSU-designed CPU Cores & Tofu
FUJITSU CPU cores support ARM SVE ISA FUJITSU, as a lead partner in ARM SVE development, contributes to specification
of ARM SVE (Scalable Vector Extension), for application performance
FUJITSU ARM core incorporates FUJITSU’s proven supercomputer microarchitecture
ARM SVE, plus optional functions and Tofu, maintain programing models and performance balance
Post-K complies ARM’s standard frameworks (SBSA, etc.), for compatibility among platforms
Copyright 2016 FUJITSU LIMITED
Functions for Perf. Post-K FX100 FX10 K computer
SVE incorporated
SIMD 512bit 256bit 128bit 128bit
FMA4 ✔ ✔ ✔ ✔
Math. acc. prim.* ✔Enhanced ✔ ✔ ✔
Optional functions
Inter-core barrier ✔ ✔ ✔ ✔
Sector cache ✔Enhanced ✔ ✔ ✔
Prefetch modes ✔Enhanced ✔ ✔ ✔
Interconnect Tofu ✔Enhanced ✔ ✔ ✔
*Mathematical acceleration primitives include trigonometric functions, sine & cosines, and exponential...8
System Software for Post-K
Currently in development, based on “co-design” scheme with application developers, including system hardware
Copyright 2016 FUJITSU LIMITED
Post-K System Hardware
Linux OS / McKernel (Lightweight Kernel)
FUJITSU Technical Computing Suite / RIKEN Advanced System Software
Post-K Applications
Management Software Programming EnvironmentHierarchical File I/O Software
System managementfor highly available &
power saving operation
Job management for higher system
utilization & power efficiency
Lustre-based distributed file system
FEFS
OpenMP, COARRAY, Math Libs.
Compilers (C, C++, Fortran)
Debugging and tuning tools
MPI (Open MPI, )
9
FUJITSU Compiler for Post-K
Maximizes the execution performance of HPC applications
Covers a wide range of applications, including integer calculations are dominant
Targets 512bit-wide vectorization as well as Vector-length-agnostic
Fixed-vector-length facilitates optimizations such as constant folding
Inherits options/features of K computer, PRIMEHPC FX10 and FX100
Language Standard Support
Fully supported : Fortran 2008, C11, C++14, OpenMP 4.5
Partially supported : Fortran 2015, C++1z, OpenMP 5.0
Supports ARM C Language Extensions (ACLE) for SVE
ACLE allow programmers to use SVE instructions as C intrinsic functions
Copyright 2016 FUJITSU LIMITED
// C intrinsics in ACLE for SVEsvfloat64_t z0 = svld1_f64(p0, &x[i]);svfloat64_t z1 = svld1_f64(p0, &y[i]);svfloat64_t z2 = svadd_f64_x(p0, z0, z1);svst1_f64(p0, &z[i], z2);
// SVE assembler ld1d z1.d, p0/z, [x19, x3, lsl #3]ld1d z0.d, p0/z, [x20, x3, lsl #3]fadd z1.d, p0/m, z1.d, z0.dst1d z1.d, p0, [x21, x3, lsl #3]
10
Vectorization by FUJITSU Compiler
Dynamic instruction counts of representative loops of NPB 3.3-SER
Vectorized loops in TSVC* (Fortran and C)// Sample of vectorized loop by SVE// s482for (int i = 0; i < LEN; i++) {
a[i] += b[i] * c[i];if (c[i] > b[i]) break;
}
*[Fortran] D. Callahan, J. Dongarra, and D. Levine. “Vectorizing compilers: a test suite and results.” In Supercomputing '88, pp. 98- 105.[C] S. Maleki, Y. Gao, M. J. Garzar´n, T. Wong, and D. A. Padua, "An Evaluation of Vectorizing Compilers,” PACT2011, pp. 372-382.
TSVC (total) FX100 Post-K
Fortran (135) 96 111
C (151) 106 121
0
0.2
0.4
0.6
0.8
1
1.2
MG BT SP LU
# of
exe
cute
d in
stru
ctio
n r
atio 256b SIMD(FX100)
512b SIMD(Post-K)
512b SIMD(Estimatedfrom FX100 result)
75% SIMD 69% SIMD 69% SIMD 72% SIMD
Copyright 2016 FUJITSU LIMITED11
Discussion on the Perf. Balance for Applications
Effectiveness for the meteorology application, IFS*, was evaluated
Good performance balance w/ wider SIMD and memory bandwidth from K to FX100 realizes an IFS performance improvement
Trying to keep the performance balance throughout the generations toward Post-K will be expected to provide scalable speed-up
Copyright 2016 FUJITSU LIMITED
K computer FX100
Flops / CPU 128 Gflops 1 Tflops
SIMD width 128 bit 256 bit
Memory BW 64 GB/s 480 GB/s
Byte per flop 0.4 ~ 0.5
*The Integrated Forecasting System (IFS) is developed by ECMWF
Speedup of IFS CNT4 (TL159, 96 cores)
.5x B/F 2x B/Fw/ narrowSIMD
Balanced
1.5x by doubling SIMD
2x B/F
(3) Insufficient gain by 2x B/F
(1) BW limits performance
(2) Narrower SIMD limits performance
12
Summary of Post-K Development
Developing high performance, scalable, custom CPU cores SMaC architecture with an assistant core for scalable performance
ARM instruction set architecture, SVE, as a standard architecture
ARM standard frameworks, SBSA, etc., for compatibility among platforms
Keeping performance balanced and advancing preceding machines Higher performance and higher data bandwidth
Advanced system software and applications Co-design scheme with application developers
FUJITSU optimizing compilers for Post-K
Performance balance is a key for application speedup
Copyright 2016 FUJITSU LIMITED
Post-K will meet requirements & be valuable for science and industries
13
Copyright 2016 FUJITSU LIMITED
Post-K: Succeeding the K computer Heritage
Prestigious Benchmark Awards TOP500: 10.5Pflops, 93% efficiency
HPCG: 602Tflops, 5.3% efficiency
Graph500: 38.6TTEPS
HPC Challenge Class 1: No.1 at all categories(1) Global HPL, (2) Global RandomAccess, (3) EP STREAM, (4) Global FFT
Gordon Bell Prize Awards “First principles calculation of electronic states of a silico
nanowire with 100,000 atoms on the K computer” (2011) “4.45 Pflops Astrophysical N-Body Simulation on K Computer –
The Gravitational Trillion-Body Problem” (2012) “Simulations of Below-Ground Dynamics of Fungi: 1.184 Pflops Attained by
Automated Generation and Autotuning of Temporal Blocking Codes” (2016 finalist)
Nominated asfinalist
at SC166 years from the
shipmentNo. 1
No. 1
No. 1
14
Copyright 2016 FUJITSU LIMITED