sc19: the most significant bits

SC19: The Most Significant Bits

Martin Thompson

January 2020

Supercomputing Conferences

SC19 - Denver, CO

Rank System Site Launch Cores Rmax (PF/s)

1 Summit ORNL 2018 2,397,824 143.50

2 Sierra LLNL 2018 1,572,480 94.64

3 Sunway TaihuLight NSC Wuxi 2016 10,649,600 93.02

4 Tianhe-2A NSC Guangzhou 2013 4,981,760 61.44

5 Piz Daint CSCS 2012 387,872 21.23

6 Trinity LANL 2015 979,072 20.16

7 ABCI AIST 2018 391,680 19.88

8 SuperMUC-NG LRZ 2018 305,856 19.48

9 Titan ORNL 2012 560,640 17.59

10 Sequoia LLNL 2012 1,572,864 17.17

Top 500 – November 2018

HPL: Solve dense system of linear equations with LU decomposition


1 Summit ORNL 2018 2,414,592 148.60

2 Sierra LLNL 2018 1,572,480 94.64



5 Frontera TACC 2019 448,448 23.52

6 Piz Daint CSCS 2012 387,872 21.23

7 Trinity LANL 2015 979,072 20.16

8 ABCI AIST 2018 391,680 19.88

9 SuperMUC-NG LRZ 2018 305,856 19.48

10 Lassen LLNL 2018 288,288 18.20

Top 500 – June 2019

Titan (ORNL) and Sequoia (LLNL) drop out of the top 10


1 Summit ORNL 2018 2,414,592 148.60

2 Sierra LLNL 2018 1,572,480 94.64



5 Frontera TACC 2019 448,448 23.52

6 Piz Daint CSCS 2012 387,872 21.23

7 Trinity LANL 2015 979,072 20.16

8 ABCI AIST 2018 391,680 19.88

9 SuperMUC-NG LRZ 2018 305,856 19.48

10 Lassen LLNL 2018 288,288 18.20


Titan (ORNL) and K Computer (RIKEN) are decommissioned


1 Summit ORNL 2018 2,414,592 148.60

2 Sierra LLNL 2018 1,572,480 94.64



5 Frontera TACC 2019 448,448 23.52

6 Piz Daint CSCS 2012 387,872 21.23

7 Trinity LANL 2015 979,072 20.16

8 ABCI AIST 2018 391,680 19.88

9 SuperMUC-NG LRZ 2018 305,856 19.48

10 Lassen LLNL 2018 288,288 18.20

47 Gadi (Phase 1) NCI 2019 75,576 4.41

239 Raijin NCI 2013 87,224 1.68


Exascale?• Aurora (ANL) 2021

• Cray/Intel

• Sapphire Rapids + Ponte Vecchio

• Frontier (ORNL) 2021

• Cray/AMD

• EPYC + Radeon

• El Capitan (LLNL) 2022

• Cray

• Tianhe-3 (NSC Tianjin) 2020

• NUDT

• ARM-based? + MT-3000 + 400Gb/s

• Shuguang (NSC Shanghai) 2021?

• Sugon

• Licensed AMD EPYC clone

• Liquid immersion

• Sunway? (NSC Jinan) 2021?

• ShenWei (256C)

• No accelerator

• Fugaku (RIKEN) 2021

• Fujitsu

• A64FX (ARM)

• LUMI (CSC Finland) 2020

• Leonardo (CINECA Italy) 2020

• MareNostrum 5 (BSC Spain) 2020

• 1st Gen ARM/RISC-V 2021

• 3 exascale with 2nd Gen 2023

Fujitsu FX1000

• 4 shelves

• 24 blades per shelf

• 2 x 1S nodes per blade

• A64FX 48C CPU

• 32GB HBM2 memory

• No accelerators

• 384 nodes per rack

• 1 PF per rack

23 © 2019 FUJITSU

CMU: CPU Memory Unit

A64FX CPU x2 (Two independent nodes)

QSFP28 x3 for Active Optical Cables

Single-side blind mate connectors of signals & water

~100% direct water cooling

Water

Water

Electrical signals

AOC

QSFP28 (X)

QSFP28 (Y)

QSFP28 (Z)

AOC

AOC

SCAsia2019, March 12

Rank System Site Launch Cores GFLOPS/watt

1 Micro-Fugaku Fujitsu 2019 36,864 16.88

2 NA-1 PEZY 2019 1,271,040 16.26

3 AiMOS RPI 2019 130,000 15.77

4 Satori MIT 2019 23,040 15.57

5 Summit ORNL 2018 2,397,824 14.67

6 ABCI AIST 2018 391,680 14.42

7 MareNostrum P9 CTE BSC 2018 18,360 14.13

8 TSUBAME3.0 GSIC 2017 135,828 13.70

9 PANGEA III Total 2019 291,024 13.07

10 Sierra LLNL 2018 1,572,480 12.72

Green 500 – November 2019

Prototype Fugaku ARM-based system straight into #1 spot

Rank System Site Launch Cores GFLOPS/watt

1 Micro-Fugaku Fujitsu 2019 36,864 16.88

2 NA-1 PEZY 2019 1,271,040 16.26

3 AiMOS RPI 2019 130,000 15.77

4 Satori MIT 2019 23,040 15.57

5 Summit ORNL 2018 2,397,824 14.67

6 ABCI AIST 2018 391,680 14.42

7 MareNostrum P9 CTE BSC 2018 18,360 14.13

8 TSUBAME3.0 GSIC 2017 135,828 13.70

9 PANGEA III Total 2019 291,024 13.07

10 Sierra LLNL 2018 1,572,480 12.72

Green 500 – November 2019

IBM Power9 systems well represented this year

• Build computational cluster with 3kW power budget

• Benchmarks: HPL, HPCG and IO-500

• Applications: 3 known apps; 1 mystery app

• 16 teams of undergraduate or high school students• China, Estonia, Germany, Poland, Singapore, Switzerland, Taiwan, USA

• Tsinghua University (China) 9th Student Cluster Competition win!

• 2 AMD and 14 Intel systems

• 14 systems used NVIDIA V100s (avg 3 per node)

Student Cluster Competition

Tutorials

• Containers

• OpenMP

• Quantum Computing

• Parallel Computing 101

• Better Scientific Software

• Data Compression

• I/O Frameworks

• GPU Programming

• Deep Learning

• MPI

• Performance Analysis

• Performance Tuning

• High Speed Networks

• HPC Procurements

• Managing S/W Complexity

• Secure Coding

Intel CPU

• Cooper Lake

• 14nm; 56 cores; Q2 2020

• 8-channel DDR4

• bfloat16

• Ice Lake

• 10nm; 38 cores; Q3 2020

• 2nd Gen Optane

• PCI Gen 4

• Sapphire Rapids

• 10nm; Late 2021

• PCI Gen 5

Intel GPU

• Xe architecture

• Xe LP; 20W

• Xe HP; 250W

• Xe HPC; 500W

• 1,000s execution units (8T)

• Focus FP64 performance

• Ponte Vecchio

• 7nm; 2021

Intel oneAPI

• Compute Taxonomy

• Scalar – CPU

• Vector – GPU

• Matrix – AI

• Spatial – FPGA

• Aim is to write code once and cross-compile for each target

• Software Stack

• Distributed Parallel C++

• C++ and SYCL

• Domain specific libraries

• MKL, TBB, DNN,…

• Migration tools

• CUDA → oneAPI

• Analysis and Debug tools

• VTune profiler

• Trace analyzer

Dell DSS8440

Dell DSS8440 + Graphcore C2

• 1,216 IPU cores

• 300MB memory

• 45TB/s mem b/w

• 8TB/s internal comms

• 320GB/s inter-chip

• 2 x IPU per C2 card

• 8 x C2 cards per server

• Microsoft Azure

Cerebras

• 21.5cm x 21.5cm wafer

• 400,000 cores (SLAC)

• 18GB memory

• 57x larger than V100

• 3,000x more memory

• 10,000x more mem b/w

• Award for innovative AI hardware

Cerebras CS-1

sc19: the most significant bits

Documents