cscs site update - hpc advisory council...1. iaas relies on rest apis to offer services to platforms...

14
CSCS Site Update HPC Advisory Council Workshop 2018 Colin McMurtrie, Associate Director and Head of HPC Operations. 9th April 2018

Upload: others

Post on 04-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSCS Site Update - HPC Advisory Council...1. IaaS relies on REST APIs to offer Services to Platforms We collectively term these services Infrastructure Services 2. OpenStack is one

CSCS Site Update

HPC Advisory Council Workshop 2018

Colin McMurtrie,

Associate Director and Head of HPC Operations.

9th April 2018

Page 2: CSCS Site Update - HPC Advisory Council...1. IaaS relies on REST APIs to offer Services to Platforms We collectively term these services Infrastructure Services 2. OpenStack is one

Established in 1991 as a unit of ETH Zurich

>90 motivated staff from ~15 nations

Now with a division in Zurich working on Scientific Software and Libraries (aka SSL)

Develops and operates the key supercomputing capabilities required to solve important problems to science and/or society

Leads the national strategy for High-Performance Computing and Networking (HPCN) that was passed by Swiss Parliament in 2009

Has a dedicated User Laboratory for supercomputing since 2011

Research infrastructure funded by the ETH Domain on a programmatic basis

~1000 users, 200 projects

CSCS Site Update - HPCACW 2018 2

The Swiss National Supercomputing Centre

Driving innovation in computational research in Switzerland

Page 3: CSCS Site Update - HPC Advisory Council...1. IaaS relies on REST APIs to offer Services to Platforms We collectively term these services Infrastructure Services 2. OpenStack is one

CSCS Site Update - HPCACW 2018 3

Flexible Facilities Infrastructure

Sophist icated Inf rast ruct ure

The basement of t he comput er building houses the “resource

deck” containing t he basic infrast ructure: 960 bat teries for t he

emergency power supply as well as t he elect ricit y and water

supply systems. Thick cables deliver t he power to t he compu-

ter cent re at a medium volt age of 16,000 volt s, where braided

copper cables, as t hick as an arm, dist ribut e it to t he twelve

current ly installed t ransformers.

The t ransformers convert t he power to 400 volt s, before it is

t aken via power rails to t he middle floor, t he “inst allat ion deck”,

and finally from there to t he supercomput ers. The exist ing po-

wer supply allows the comput er cent re to operate computers

wit h an output of about 11 megawat t s and this could even be

extended t o operate up t o 25 megawat t s.

The lake water pipe, measuring 80 cent imet res in diameter,

enters t he building on t he south side. Alongside it , a pipe of

t he same size leads back to t he lake. Between the incoming

and out going pipes, t here is a sophist icated cooling system in

operat ion: t he lake water and the internal cooling water circuit

meet in heat exchangers which are as t all as a person. There t he

low temperature of t he lake water is t ransferred to t he internal

cooling circuit . This delivers water at about 8 or at most 9 de-

grees Celsius to t he supercomput ers to cool t hem. By t he t ime

the water has passed through this first cooling circuit , it is eight

degrees warmer. However, t his water is st ill cold enough to cool

t he air in t he housings of lower-densit y comput ers and hard

discs. To t his end, it is sent t hrough another heat exchanger

t hat is connected to t he medium-temperature cooling circuit .

This allows one pumping operat ion to supply two cooling

circuit s t hat cool several systems. This, t oo, saves energy.

A separate storey instead of a raised f loor

From the “resources deck”, t he processed power and water are

sent to t he “dist ribut ion deck”, t he installat ion f loor located

direct ly above. In most convent ional comput er cent res, t he

installat ion deck consist s of a raised floor measuring 40 to

60 cent imet res in height t hrough which kilomet res of cable

are fed. The cabinets for t he power dist ribut ion unit s (PDU)

are located in t he comput er room and so limit t he opt ions for

installing supercomput ers.

In order to avoid t his limit at ion in t he new CSCS building, t he

raised floor has been replaced by a five-met re high storey

which houses the ent ire technical infrast ructure, also called

t he secondary dist ribut ion system. The decision to opt for

t his const ruct ion was made on the basis of experience in t he

previous comput er cent re in Manno where t he raised f loor was

barely able to accommodate t he installat ion of new comput ers.

The comput er building (left ) and the office block (right ) are connected

by a bridge and an under ground tunnel. (Picture: CSCS)

Pump once t o cool twice

Once the water has passed through this first cooling circuit ,

it has been heated up by eight degrees. The now 16 to 17 °C

water is sent through a further heat exchanger, connected to

a second cooling circuit . This mid-temperature circuit cools the

air in the housings for the computers and hard drives of lesser

energy densit y, which can therefore be cooled with water that

is less cold. This means that with one pumping operat ion, cold

water is supplied t o two circuit s t o cool two t ypes of systems.

The cold water pipe is designed to cool supercomputers of up

to 14 megawat t s on the first cooling circuit . The second circuit

can cool a further 7 megawatt s of computers. The more the se-

cond circuit is used, the higher the waste heat absorbed by the

water and so the more useful it is to the local indust rial works

who will be able t o use it .

Before the lake water returns to the lake, it passes through

a st illing basin which can hold 120 cubic met res. The basin

collects the water and makes sure that it f lows freely down the

return pipe back to the lake at a constant pressure and with

no need for further power to be used. On the cont rary, t he plan

is to use the energy generated as it falls to produce elect ricit y.

That is why connect ions for a microturbine have been provided

in the pumping stat ion.

So as not to affect t he ecological balance of t he lake, the water

going back into the lake must never exceed 25 degrees Celsius.

To ensure that this is always the case, a back-mixing funnel has

been fit ted which will add cold water if necessary.

I

The suct ion st rainers for the lake water pipe, just before they were

lowered 45 met res into Lake Lugano. (Picture: CSCS)

Only a t rapdoor indicates the existence of the pumping stat ion below

the surface of Parco Ciani t o visit ors. (Picture: CSCS)

The water pipe (green) st retches 2.8 km accross the cit y to connect the

lake (right ) with the comput ing cent re. On it s way it crosses under the

Casserate river twice.

Via Trevano 131

6900 Lugano

Switzerland

Tel +41 (0)91 610 82 11

Fax +41 (0)91 610 82 82

www.cscs.ch

© CSCS 2012

• Flexible Facilities Infrastructure is important since we

cannot be certain about future system requirements

• CSCS’ Data Centre provides:

• power/cooling: 12 MW

• upgradable to 25 MW

• Free cooling with water from lake Lugano

• Current Power Usage Effectiveness (PUE) = 1.2

Page 4: CSCS Site Update - HPC Advisory Council...1. IaaS relies on REST APIs to offer Services to Platforms We collectively term these services Infrastructure Services 2. OpenStack is one

Flagship Supercomputer “Piz Daint”

Cray XC40 / Cray XC50

Operational since April 2013

Extension + upgrade to hybrid in late 2013 Upgrade to new GPU in 2016

Compute nodes

5‘320 dual-socket nodes with Intel Xeon CPU and NVIDIA Tesla P100 GPU 1‘815 dual-socked nodes with Intel Xeon CPUs

Total system memory 521 TB RAM

Peak Performance

Hybrid partition 25.3 Petaflop/s Multicore partition 1.7 Petaflop/s

Measured Linpack performance of 19.59 Pflop/s

Most powerful petascale supercomputer in the Top10 of the Green500

CSCS Site Update - HPCACW 2018 4

Page 5: CSCS Site Update - HPC Advisory Council...1. IaaS relies on REST APIs to offer Services to Platforms We collectively term these services Infrastructure Services 2. OpenStack is one

5,272 hybrid nodes (CrayXC30)

Nvidia Tesla K20x

16x PCIe 2.0

Intel Xeon E5-2670 (SB) 6GB GDDR5 32GB DDR3

No multi-core partition

Cray Aries interconnect

16x PCIe 3.0 Dragonfly topology ~33TB/s bisection bandwidth Fully provisioned for 28 cabinets

Cray Sonexion Lustre File System

2.7PB Snx1600

Slurm WLM

Slurm + ALPS

5320 hybrid nodes (Cray XC50)

Nvidia Tesla P100 16x PCIe 3.0

Intel Xeon E5-2690 v3 (HSW) 16GB HBM2 64GB DDR4

1815 multi-core nodes (Cray XC40)

Dual socket Intel Xeon E5-2695 v4 (BDW) 64GB and 128GB DDR4

Cray Aries interconnect

16x PCIe 3.0 ~36TB/s bisection bandwidth Public IP routing to CSCS network

Sonexion Lustre file system

9.6PB Snx3000 2.7PB Snx1600 External GPFS on selected nodes

Slurm WLM

Native Slurm (no ALPS)

CSCS Site Update - HPCACW 2018 5

Evolution of the Flagship System - Piz Daint

2018

2013

Page 6: CSCS Site Update - HPC Advisory Council...1. IaaS relies on REST APIs to offer Services to Platforms We collectively term these services Infrastructure Services 2. OpenStack is one

Piz Daint – A Consolidated HPC Environment

CSCS Site Update - HPCACW 2018 6

Computing

Visualisation

Data Analysis

2013

Pre- & post-processing

Data Mover

Data Warp

Machine Learning

Deep Learning

Support for Docker

2017

Page 7: CSCS Site Update - HPC Advisory Council...1. IaaS relies on REST APIs to offer Services to Platforms We collectively term these services Infrastructure Services 2. OpenStack is one

Data Centre Ecosystem

Dedicated Customer Systems/Platforms

Data Centre Network

(IB, Ethernet)

CSCS LAN

TSM + Tape Library

On-site Cloud IaaS

Internet Access (via SWITCHlan;100 Gbit/s)

CSCS Site Update - HPCACW 2018 7

Site-wide Storage

Consolidated HPC Environment

Page 8: CSCS Site Update - HPC Advisory Council...1. IaaS relies on REST APIs to offer Services to Platforms We collectively term these services Infrastructure Services 2. OpenStack is one

Infrastructure-as-a-Service (IaaS)

“IaaS is a service model that delivers computer infrastructure on an outsourced basis to

support enterprise operations. Typically, IaaS provides hardware, storage, servers and data

center space or network components; it may also include software”, Source: Technopedia.com

CSCS Site Update - HPCACW 2018 8

Legend: You = Platform Provider Other = Infrastructure Provider

Page 9: CSCS Site Update - HPC Advisory Council...1. IaaS relies on REST APIs to offer Services to Platforms We collectively term these services Infrastructure Services 2. OpenStack is one

IaaS - Why Use It?

CSCS Site Update - HPCACW 2018 9

• Enables the hosting of Domain-

specific portals that are

managed by external entities

• Separation of concerns means

we don’t need to get involved

with the details of what powers

the Portal(s)

• Dynamic Provisioning possible

• But the Web Services

themselves need to be

scalable in such an env.

• Challenges:

• Infrastructure provider has no

visibility on what is

happening within the

Platform(s)Example of a Web Service• Arrows denote functional dependency

IaaS VM Infrastructure

Identity Service

OIDC Service (Mitreconnect based, Java)

OIDC REST API

NGINX OIDC Extension (lua)

RDBMS (Postgresql or Mysql)

Django ORM + Business logic

Django REST

NGINX Reverse Proxy Server (SSL + Caching)

Mysql DB

Log collection and monitoring services

Page 10: CSCS Site Update - HPC Advisory Council...1. IaaS relies on REST APIs to offer Services to Platforms We collectively term these services Infrastructure Services 2. OpenStack is one

OpenStack IaaS Architecture Summary

CSCS Site Update - HPCACW 2018 10

Page 11: CSCS Site Update - HPC Advisory Council...1. IaaS relies on REST APIs to offer Services to Platforms We collectively term these services Infrastructure Services 2. OpenStack is one

New system for generic Cloud Services

30 new servers

Directors, Controllers, Compute

Dedicated network

2 x 48 port 40 Gbit/s switches integrated into the CSCS network

Storage

~30 TB usable internal CEPH storage

External Swift-on-GPFS storage

RedHat OpenStack Platform 11 (RHOSP11) Integrated with CSCS AAI via KeyCloak (RHSSO) FireWall rules configured for Internet-facing services

Now hosting production platforms for third-party projects

Adding more HW Augmenting the Cloud offerings (work on-going)

Production OpenStack Environment - Pollux

CSCS Site Update - HPCACW 2018 11

KeyCloak/RHSSO

Controller nodes

Compute nodes

Storage nodes

Director node

40 Gb/s

SWIFT (IBM

Spectrum Scale)

SAN Storage

Page 12: CSCS Site Update - HPC Advisory Council...1. IaaS relies on REST APIs to offer Services to Platforms We collectively term these services Infrastructure Services 2. OpenStack is one

Bringing Cloud Technologies closer to Piz Daint

Docker Containers and Shifter

CSCS Site Update - HPCACW 2018 12

Production workflows are using Docker Containers with Shifter on Piz Daint:

1. Build and test containers with Docker on your Laptop

• Convenience for the user

2. Run securely and with high-performance on Piz Daint using Shifter

• Native GPU and MPI performance

• Improves parallel file system performance for some applications (e.g. Spark)

Current use cases:

• LHConCray

• Data Analytics frameworks (e.g. Spark)

• HBP Neurorobotics Platform

• >5000 container launches per day

• Others coming… watch this space

Page 13: CSCS Site Update - HPC Advisory Council...1. IaaS relies on REST APIs to offer Services to Platforms We collectively term these services Infrastructure Services 2. OpenStack is one

Bringing it all together

1. IaaS relies on REST APIs to offer Services to Platforms

We collectively term these services Infrastructure Services

2. OpenStack is one way to provide IaaS and this can be done with satellite clusters

3. For Piz Daint we need other mechanisms to provide the necessary Infrastructure Services APIs

Work underway in this area

4. IaaS opens the door for Interactive Supercomputing

There are known Use Cases coming from various communities

Implies some policy-level changes (e.g. job preemption or node sharing for some queues)

5. This does NOT mean we will do away with our usual operations

The User Lab will remain our main core business

These new services are a way to augment our capability and open doors to new communities

CSCS Site Update - HPCACW 2018 13

Page 14: CSCS Site Update - HPC Advisory Council...1. IaaS relies on REST APIs to offer Services to Platforms We collectively term these services Infrastructure Services 2. OpenStack is one

Q&A

Contact: [email protected]