system software for armv8-a with sve riken center for ... · system software for armv8-a with sve...

18
System Software for Armv8-A with SVE Yutaka Ishikawa, Leader of FLAGSHIP2020 Project RIKEN Center for Computational Science 9:00– 9:25 14 th of January, 2019 Open Source HPC Collaboration on Arm Architecture Linaro workshop, Guangzhou , China

Upload: others

Post on 18-Mar-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

System Software for Armv8-A with SVE

Yutaka Ishikawa, Leader of FLAGSHIP2020 ProjectRIKEN Center for Computational Science

9:00– 9:25 14th of January, 2019

Open Source HPC Collaboration on Arm Architecture Linaro workshop, Guangzhou , China

Background: Flagship2020

20019/1/14

• Missions• Building the Japanese national flagship supercomputer, post

K, and• Developing wide range of HPC applications, running on post K,

in order to solve social and science issues in Japan

• Project organization• Post K Computer development

• RIKEN AICS is in charge of development• Fujitsu is vendor partner.• International collaborations: DOE, CEA, JLESC (NCSA, ANL, UTK, JSC,

BSC, INRIA, RIKEN)

• Applications• The government selected

• 9 social & scientific priority issues• 4 exploratory issues

and their R&D organizations.

2

NOW

RIKEN Center for Computational Science

Background: Flagship2020

20019/1/14

• Missions• Building the Japanese national flagship supercomputer, post

K, and• Developing wide range of HPC applications, running on post K,

in order to solve social and science issues in Japan

• Project organization• Post K Computer development

• RIKEN AICS is in charge of development• Fujitsu is vendor partner.• International collaborations: DOE, CEA, JLESC (NCSA, ANL, UTK, JSC,

BSC, INRIA, RIKEN)

• Applications• The government selected

• 9 social & scientific priority issues• 4 exploratory issues

and their R&D organizations.

3

NOW

Target Applications

Program Brief description

① GENESIS MD for proteins

② Genomon Genome processing (Genome alignment)

③ GAMERA Earthquake simulator (FEM in unstructured & structured grid)

④ NICAM+LETK Weather prediction system using Big data (structured grid stencil & ensemble Kalman filter)

⑤ NTChem molecular electronic (structure calculation)

⑥ FFB Large Eddy Simulation (unstructured grid)

⑦ RSDFT an ab-initio program (density functional theory)

⑧ Adventure Computational Mechanics System for Large Scale Analysis and Design (unstructured grid)

⑨ CCS-QCD Lattice QCD simulation (structured grid Monte Carlo)

RIKEN Center for Computational Science

Courtesy of FUJITSU LIMITED

Background: Post-K CPU A64FX

20019/1/14 4

Architecture Armv8.2-A SVE (512 bit SIMD)

Core48 cores for compute and 2/4 for OS activities

DP: 2.7+ TF, SP: 5.4+ TF, HP: 10.8 TF

Cache

L1D: 64 KiB, 4 way, 230 GB/s(load), 115 GB/s (store)

L2: 8 MiB, 16way, 115 GB/s (load), 57 GB/s (store)

Memory HBM2 32 GiB, 1024 GB/s

Interconnect TofuD (28 Gbps x 2 lane x 10 port)

I/O PCIe Gen3 x 16 lane

Technology 7nm FinFET

PerformanceStream triad: 830+ GB/sDgemm: 2.5+ TF (90+% efficiency)ref. Toshio Yoshida, “Fujitsu High Performance CPU for the Post-K Computer,” IEEE Hot Chips: A Symposium on High Performance Chips, San Jose, August 21, 2018.

CMG: CPU Memory GroupNOC: Network On Chip

RIKEN Center for Computational Science

Background: An Overview of Post-K Hardware

● Compute Node, Compute + I/O Node connected by 6D mesh/torus Interconnect

● 3-level hierarchical storage system

● 1st Layer

● Cache for global file system

● Temporary file systems

- Local file system for compute node

- Shared file system for a job

● 2nd Layer

● Lustre-based global file system

● 3rd Layer

● Storage for archive

520019/1/14 RIKEN Center for Computational Science

An Overview of System Software Stack

20019/1/14

Easy of use is one of our KPIs (Key Performance Indicators)

Providing wide range ofapplications/tools/libraries/compilers

Linux DistributionEco-System

Parallel Programming EnvironmentsXMP, FDPS, …

Armv8 + SVE

Multi-Kernel System: Linux and light-weight kernel (McKernel)

Batch Job System

Application-oriented

File I/O

Communication

MPI

Parallel File SystemTuning and Debugging Tools

Hierarchical File System

Low Level CommunicationFile I/O for

Hierarchical StorageLLIO

Fortran, C/C++, OpenMP, Java, …

Math libraries

Process/ThreadPIP

6RIKEN Center for Computational Science

● Programing Languages and Compilers provided by Fujitsu

● Fortran2008 & Fortran2018 subset

● C11 & GNU and Clang extensions

● C++14 & C++17 subset and GNU and Clang extensions

● OpenMP 4.5 & OpenMP 5.0 subset

● Java

GCC, LLVM, and Arm compiler will be also available

● Parallel Programming Language & Domain Specific Library provided by RIKEN

● XcalableMP

● FDPS (Framework for Developing Particle Simulator)

● Process/Thread Library provided by RIKEN

● PiP (Process in Process)

● Script Languages provided by Linux distributor

● E.g., Python+NumPy, SciPy

● Communication Libraries

● MPI 3.1 & MPI4.0 subset● Open MPI base (Fujitsu), MPICH (RIKEN)

● Low-level Communication Libraries● uTofu (Fujitsu), LLC(RIKEN)

● File I/O Libraries provided by RIKEN

● pnetCDF, DTF, FTAR

● Math Libraries

● BLAS, LAPACK, ScaLAPACK, SSL II (Fujitsu)

● EigenEXA, Batched BLAS (RIKEN)

● Programming Tools provided by Fujitsu

● Profiler, Debugger, GUI

Post-K Programming Environment

Scalableは筑波大・東大が運用するOakforest-PACS上でも稼働している。

20019/1/147

RIKEN Center for Computational Science

Open Source Management Tools

● EasyBuild● Used at CEA

● RIKEN is evaluating it. As an example, CAFFE, a deep learning tool, is ported to an Arm machine using EasyBuild● CAFFE consists of several opensource packages:

- boost, blas, cmake, gflags, google (glog, googletest, snapy, leveldb, protobuf), lmdb, opencv

● Spack● Used at ECP project

● RIKEN is evaluating Spack also.

820019/1/14 RIKEN Center for Computational Science

● Partition resources (CPU cores, memory)

● Full Linux kernel on some cores

● System daemons and in-situ non HPC applications

● Device drivers

● Light-weight kernel(LWK), McKernel on other cores

● HPC applications

IHK/McKernel developed at RIKEN

● IHK: Linux kernel module

● Allows dynamically partitioning of node resources: CPU cores, physical memory, …

● Enables management of LWKs (assign resources, load, boot, destroy, etc..)

● Provides inter-kernel communication, messaging and notification

● McKernel: Light-weight kernel

● Is designed for HPC, noiseless, simple

● Implements only performance sensitive system calls, e.g., process and memory management, and the rest are offloaded to Linux

Very simplememory

management

Thin LWKProcess/Thread

managementGeneral

scheduler

Complex Mem. Mngt.

Linux

TCP stack

Dev. Drivers

VFS

File Sys Driers

Memory

… …Interrupt

Systemdaemons

?

HPC Applications

Partition

Partition

In-situ non HPC application

Linux API (glibc, /sys/, /proc/)

Core Core Core Core Core Core

20019/1/149

• IHK/McKernel runs on• Intel Xeon and Xeon phi• Fujitsu FX10 and FX100

(Experiments)

Interface for Heterogeneous Kernels

● Executes the same binary of Linux without any recompilation

RIKEN Center for Computational Science

How to deploy IHK/McKernel

• Linux Kernel with IHK kernel module is resident– daemons for job scheduler and etc. run on Linux

• McKernel is dynamically reloaded (rebooted) by IHK for each application

• No hardware reboot

Finish

App A, requiring LWK-without-scheduler, Is invoked

App B, requiring LWK-with-scheduler,

Is invoked

FinishA

pp C

, usi

ng fu

ll Li

nux

capa

bilit

y, Is

invo

ked

Finish

20019/1/14 10RIKEN Center for Computational Science

miniFE (CORAL benchmark suite)

11

● Conjugate gradient - strong scaling● Up to 3.5X improvement (Linux falls over.. )

3.5X

Oakforest-PACS supercomputer, 25 PF in peak, at JCAHPC organized by U. of Tsukuba and U. of Tokyo

Results using the same binary

20019/1/14

Balazs Gerofi, Rolf Riesen, Robert W. Wisniewski and Yutaka Ishikawa: “Toward Full Specialization of the HPC System Software Stack: Reconciling Application Containers and Lightweight Multi-kernels”, International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), 2017

RIKEN Center for Computational Science

Support of Software Development/Portingfor Post-K

20019/1/14 RIKEN Center for Computational Science 12

CY2017 CY2018 CY2019 CY2020 CY2021

Specification

OptimizationGuidebook

RIKEN Performance

EvaluationEnvironment

Early Access Program

Publishing Incrementally

Performance estimation tool using FX100

RIKEN Simulator

Installation, and TuningManufacturingDesign and Implementation Operation

Armv8-A + SVE Overview Detailed hardware info.

• CY2018. Q2, Optimization guidebook is incrementally published• CY2021. Q1/Q2, General operation starts

NOW

• CY2020. Q2, Early access program start

Contribution to Arm HPC (Armv8-A SVE) Ecosystem

Concluding Remarks

20019/1/14 RIKEN Center for Computational Science 13

https://postk-web.r-ccs.riken.jp/faq.html

BACKUP

14

MPI Communication implemented usingTofu2 and TofuD

● Tofu2 and TofuD offloading mechanism

● Posting send commands (PUT, GET, NOP) to a command queue, the Tofu network interface processes posted commands.

● Tofu2 has two packet processing modes: Normal Mode and Session Mode. In the Session Mode, a special register called Scheduling Pointer plays important role.

● Scheduling Pointer: Commands enqueued in the command queue are processed until reaching an entry pointed by the Scheduling Pointer. Scheduling Pointer is updated by a packet sent by remote node

20019/1/14 15RIKEN Center for Computational Science

Evaluation: Latency

16

MPI_Neighbor_alltoall_init(sbuf, count, MPI_DOUBLE, rbuf, MPI_DOUBLE, comm, &req[1]);

for (I = 0; …….) { /` Computation `/

MPI_Start(req);/* Computation */

MPI_Wait( req, stat);}

Tofu2 Offload

Direct Transfers between User Buffers

Completely Asynchronous Progression

Persistent pt2pt. (≒Non-blocking pt2pt.)

Late

ncy

[us]

Message Size [Bytes]

Late

ncy

[us]

• The offload version is faster.

• Unlike the point-to-point version, the offload version doe not need CPU cycle for communication progress. Thus computation and communication overlap is realized by the offload version.

20019/1/14

• Masayuki Hatanaka, Masamichi Takagi, Atsushi Hori, Yutaka Ishikawa, “Offloaded MPI persistent collectives using persistent generalized request interface,” Proceedings of the 24th European MPI Users' Group Meeting (EuroMPI2017), ACM, 2017.

• Yoshiyuki Morie, Masayuki Hatanaka, Masamichi Takagi, Atsushi Hori, Yutaka Ishikawa, “Prototyping of Offloaded Persistent Broadcast on Tofu2 Interconnect,” SC17, 2017 (poster)

• Yoshiyuki Morie, Masayuki Hatanaka, Masamichi Tagaki, Atsushi Hori, Yutaka Ishikawa, "Evaluation of Intra Node of Persistent Collective Communication using NIC Offloading," SWOPP'18, HPC165, 2018. (In Japanese)

RIKEN Center for Computational Science

17

● Application● MODYLAYS, USQCD, OpenFOAM

● Library● Numpy, Scipy, pysam, FFTW, LAPACK95, lapack, blas, Metis, ParMetis, HDF5,

NetCDF, NetCDF-fortran, PnetCDF, scalasca, SCOTCH, Zoltan, openmpi1.8, openmpi1.10, mpich2-1.4.1, boost, FFTE, PETSc/SLEPc Elemental, BWA, Star, Blat, TopHat, TopHat2, MapSplice2, MPDyn2, ELPA, Trillinos, Eigen3, mesa, MesaGLUT, libxml2, C-LIME, EigenExa

● Tool/Visuallization Tool● git, git-flow, gnuplot, Paraview, VisIT, ImageMagick, svn, Samtools, bedtools,

Biobambam, Picard, GMT, GrADS, HDF-EOS, wgrib, GRIB API, Climate data Operators

● Build tool● cmake, gnu Autotools, automake, autoconf, gcc, gfortran, C++, libtools

● Shell script / Programming language / Script language● python2, python3, perl5, R, Ruby2, zsh, ksh, NCADS Command Language

OSS Survey (9 priority issues developers)

20019/1/14 RIKEN Center for Computational Science

18

● Application● ABINIT-MP, AkaiKKR, bedtools, Biobambam, BWA, CUBE, ERmod, fdps, FFV-C,

FrontFlow/Red, FrontISTR, GAMES, GENESIS, gromacs, GROMACS, HIVE, LAMMPS, MapSplice2, MODYLAS, NEURON, octa, OpenFOAM, PBVR, Picard, PIMD, quantum ESPRESSO, rDock, Samtools, SCALE, Star, TopHat, TopHat 2, WHEEL, xTAPP,

● Library● FFTW, matplotlib(python), beautiful soup(python), metis, ParMETIS, NetCDF4, HDF5,

NuSDAS1.3, octa, fdps, Zoltan, cgns, Polylib, libsim● Visualization tool

● gnuplot, PBVR, VTK, OSMesa● Tool

● GNU utils, zlib, anaconda(python), itk, PAPI, PMlib, Szip, zip, TextParser, fpzip, ● Build tool

● make, autoconf, cmake● Shell script / Programming language / Script language

● bash, curl, python, ruby● ISV

● ABAQUS, Advance, AMBER, Ansys fluent, Gaussian, FLUENT, Scryu/Tetra, LS-DYNA, VPS solver ( PAM-CRASH ), Helyx, HEETAH, iconCFD, LaBS, JMAG, MIZUHO, NuFD, VASP, VSOP

OSS Survey (K computer users)

20019/1/14 RIKEN Center for Computational Science