hermitcore a low noise unikernel for extrem-scale … · hbm support similar to memkind ip...

36
HermitCore – A Low Noise Unikernel for Extrem-Scale Systems Stefan Lankes 1 , Simon Pickartz 1 , Jens Breitbart 2 1 RWTH Aachen University, Aachen, Germany 2 Bosch Chassis Systems Control, Stuttgart, Germany

Upload: doanliem

Post on 06-Sep-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

HermitCore – A Low Noise Unikernel for Extrem-Scale SystemsStefan Lankes1, Simon Pickartz1, Jens Breitbart2

1RWTH Aachen University, Aachen, Germany2Bosch Chassis Systems Control, Stuttgart, Germany

Agenda

Motivation

Challenges for the Future Systems

OS Architectures

HermitCore Design

Performance Evaluation

Conclusion and Outlook

2 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Motivation

Complexity of high-end HPC systems keeps growingExtreme degree of parallelismHeterogeneous core architecturesDeep memory hierarchyPower constrains

⇒ Need for scalable, reliable performance and capability to rapidly adapt to new HWApplications have also become complex

In-situ analysis, workflowsSophisticated monitoring and tools support, etc. . .Isolated, consistent simulation performance

⇒ Dependence on POSIX, MPI and OpenMP

Seemingly contradictory requirements. . .

3 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Operating System = Overhead

Every administration layer has its overhead ⇒ e. g. Hourglass benchmark

10000 20000 30000

100

102

104

106

Loop time (cycles)

Num

bero

feve

nts

How does the HPC / Cloud Community reduce overhead?

4 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Operating System = Overhead

Every administration layer has its overhead ⇒ e. g. Hourglass benchmark

10000 20000 30000

100

102

104

106

Loop time (cycles)

Num

bero

feve

nts

How does the HPC / Cloud Community reduce overhead?

4 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

How does the HPC community reduce the overhead?

Light-weight Kernels

Typically taken of an existing fat kernel (e. g. Linux)Removalof unneeded features to improve the scalabilitye. g. ZeptoOS

Multi-KernelsA specialized kernel runs side-by-side to full-weight kernel (e. g. Linux)Applications of the full-weight kernels are able to run on the specialized kernel.The specialized kernel catch every system call and delegate them to the full-weightkernel

Binary compatible to the full-weight kernelExamples: mOS, McKernel

5 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

OS Designs for Cloud Computing – Usage of Common OS

Operating System eth0

Hypervisor Software Virtual Switch

eth0OS

Application

eth0OS

Application

Two operating systems to maintain a single computer?Double Management!Why does a Cloud base on a Multi-User-/Multi-Tasking OS?

6 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

OS Designs for Cloud Computing – LibraryOS

Operating System eth0

Hypervisor Software Virtual Switch

eth0libOS

Application

eth0libOS

Application

Now, every system call is a function call ⇒ Low overheadWhole optimization of an image possible (including the library OS)

Link Time Optimization (LTO)Removing unneeded code ⇒ reduces the attack vector

7 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Related Unikernels

Rump kernels1

Part of NetBSD ⇒ e. g., NetBSD’s TCP / IP stack is available as libraryStrong dependencies to the hypervisorNot directly bootable on a standard hypervisor (e. g., KVM)

IncludeOS2

Runs natively on the hardware ⇒ minimal OverheadOnly 32 bit support to avoid the overhead of paging ⇒ uncommon in HPC

MirageOS3

Designed for the high-level language OCaml ⇒ uncommon in HPC

1A. Kantee and J. Cormack. “Rump Kernels – No OS? No Problem!”. In: ; login: 2014.2A. Bratterud et al. “IncludeOS: A Resource Efficient Unikernel for Cloud Services”. In:

7th Int. Conference on Cloud Computing Technology and Science. 2015.3A. Madhavapeddy et al. “Unikernels: Library Operating Systems for the Cloud”. In:

8th Int. Conference on Architectural Support for Programming Languages and Operating Systems. 2013.8 HermitCore | Stefan Lankes et al. | RWTH Aachen University |

5th April 2017

HermitCore Design

Combination of the Unikernel and Multi-Kernel to reduce the overheadThe same binary is able to run

in a VM (classical unikernel setup)or bare-metal side-by-side to Linux (multi-kernel setup)

Support for dominant programming models (MPI, OpenMP)Single-address space operating system

No TLB Shootdown

9 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

HermitCore Design

Classical Unikernel SetupApplications are able to boot directly within a VM

Tested with Qemu / KVMTested with uhyve (experimental KVM-based Hypervisor)

Qemu emulates more than HermitCore needs ⇒ large setup timeuhyve reduce the boot time (from 2 s to ~30 ms)

Amazon Web Services (available soon!)

Multi-Kernel SetupOne kernel per NUMA node

Only local memory accesses (UMA)Message passing between NUMA nodes

One FWK (Linux) in the system to get access to a broader driver supportOnly a backup for pre- / post-processingCritical path should be handled by HermitCore

10 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Booting HermitCore

Hardware

Linux kernel

libc

Proxy

Linux kernel

libc

Proxy

libos(LwIP, IRQ, etc.)

Newlib

OpenMP / MPI

App

On the detection of aHermitCore app, a proxy will bestarted.

The proxy unplugs a set ofcores.Triggers Linux to bootHermitCore on the unusedcores.A reliable connection will beestablished.By termination, the cores areset to the HALT state.Finally, reregistering of thecores to Linux.

11 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Booting HermitCore

Hardware

Linux kernel

libc

Proxy

Linux kernel

libc

Proxy

libos(LwIP, IRQ, etc.)

Newlib

OpenMP / MPI

App

On the detection of aHermitCore app, a proxy will bestarted.The proxy unplugs a set ofcores.

Triggers Linux to bootHermitCore on the unusedcores.A reliable connection will beestablished.By termination, the cores areset to the HALT state.Finally, reregistering of thecores to Linux.

11 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Booting HermitCore

Hardware

Linux kernel

libc

Proxy

Linux kernel

libc

Proxy

libos(LwIP, IRQ, etc.)

Newlib

OpenMP / MPI

App

On the detection of aHermitCore app, a proxy will bestarted.The proxy unplugs a set ofcores.Triggers Linux to bootHermitCore on the unusedcores.

A reliable connection will beestablished.By termination, the cores areset to the HALT state.Finally, reregistering of thecores to Linux.

11 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Booting HermitCore

Hardware

Linux kernel

libc

Proxy

Linux kernel

libc

Proxy

libos(LwIP, IRQ, etc.)

Newlib

OpenMP / MPI

App

On the detection of aHermitCore app, a proxy will bestarted.The proxy unplugs a set ofcores.Triggers Linux to bootHermitCore on the unusedcores.A reliable connection will beestablished.

By termination, the cores areset to the HALT state.Finally, reregistering of thecores to Linux.

11 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Booting HermitCore

Hardware

Linux kernel

libc

Proxy

Linux kernel

libc

Proxy

libos(LwIP, IRQ, etc.)

Newlib

OpenMP / MPI

App

On the detection of aHermitCore app, a proxy will bestarted.The proxy unplugs a set ofcores.Triggers Linux to bootHermitCore on the unusedcores.A reliable connection will beestablished.By termination, the cores areset to the HALT state.

Finally, reregistering of thecores to Linux.

11 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Booting HermitCore

Hardware

Linux kernel

libc

Proxy

Linux kernel

libc

Proxy

libos(LwIP, IRQ, etc.)

Newlib

OpenMP / MPI

App

On the detection of aHermitCore app, a proxy will bestarted.The proxy unplugs a set ofcores.Triggers Linux to bootHermitCore on the unusedcores.A reliable connection will beestablished.By termination, the cores areset to the HALT state.Finally, reregistering of thecores to Linux.

11 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Runtime Support

SSE, AVX2, AVX512, FMA,. . .Full C-library support (newlib)HBM support similar to memkindIP interface & BSD sockets (LwIP)

IP packets are forwarded to LinuxShared memory interface

PthreadsThread binding at start timeNo load balancing ⇒ less housekeeping

OpenMP via Intel’s RuntimeiRCCE- & MPI (via SCC-MPICH)Full support for the Go runtime

R

10

R

32

R

54

R

76

R

98

R

1110

R

1312

R

1514

R

1716

R

1918

R

2120

R

2322

R

2524

R

2726

R

2928

R

3130

R

3332

R

3534

R

3736

R

3938

R

4140

R

4342

R

4544

R

4746

MC 1

MC 0

MC 3

MC 2

FPGA

Router

TileMIU MPB

Core 23

Core 22

L2$

L2$

12 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Operating System Micro-Benchmarks

Test systemsIntel Haswell CPUs (E5-2650 v3) clocked at 2.3 GHzIntel KNL (Phi 7210) clocked at 1.3 GHz, SNC mode with four NUMA nodes

Results in CPU cycles

System activity KNL Haswell

HermitCore Linux HermitCore Linux

getpid() 15 486 14 143sched_yield() 197 983 97 370malloc() 3 051 12 806 3715 6575first write access to a page 2 078 3 967 2018 4007

13 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1·105

Time in s

Gap

inTi

cks

Standard Linux

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1·105

Time in s

Gap

inTi

cks

HermitCore

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1·105

Time in s

Gap

inTi

cks

Linux with enabled isolcpus & nohz

0 50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1·105

Time in sGa

pin

Tick

s

HermitCore without IP thread

Overhead of VMs – Determined via NAS Parallel Benchmarks (Class B)

Native (Linux) HermitCore + uhyve

CG BT SP EP IS0

5

10

15

20

Spee

din

GMop

/s

15 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Conclusions

It works! ⇒ https://youtu.be/gDYCJ1DOTKw

Binary packages are availableReduce the OS noise significantlyTry it out!

http://www.hermitcore.org

Thank you for your kind attention!

16 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Backup slides

Outlook

A fast direct access to the interconnect is requiredSR-IOV simplifies the coordination between Linux & HermitCore

Core Core

Memory NIC

Linux

Node 0

Core Core

Memory vNIC

HermitCore

Node 1

Core Core

Memory vNIC

HermitCore

Node 2

Core Core

Memory vNIC

HermitCore

Node 3

Virt

ualI

PD

evice

/M

essa

gePa

ssin

gIn

tera

fce

18 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

OS Designs for Cloud Computing – Container

OSeth0

Container

Application

Container

Application

Building of virtual borders (namespaces)Containers and their processes doesn’t see each otherFast access to OS servicesLess secure because an exploit for the container attacks also the host OSDoesn’t reduce the OS noise of the host system

19 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

EPCC OpenMP Micro-Benchmarks

2 4 6 8 100

1

2

3

Number of Threads

Ove

rhea

din

µs

Parallel on Linux (gcc)Parallel For on Linux (gcc)Parallel on Linux (icc)Parallel For on Linux (icc)Parallel on HermitCoreParallel For on HermitCore

20 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Throughput Results of the Inter-kernel Communication Layer

256 4 Ki 64 Ki 1 Mi0

1

2

3

4

5

Size in Byte

Thro

ughp

utin

GiB/

s

PingPong via iRCCEPingPong via SCC-MPICHPingPong via ParaStation MPI

21 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Lack of programmability

Non-Uniform Memory Access

Costs for memory access may varyRun processes where memory isallocatedAllocate memory where the processresidesImplications for the performance

Where should the applicationsstore the data?Who should decide the location?

The operating system?The application developers?

Socket 0 Socket 1

Memory 0 Memory 1

Interconnect

Memory 2 Memory 3

22 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Lack of programmability

Non-Uniform Memory Access

Costs for memory access may varyRun processes where memory isallocatedAllocate memory where the processresidesImplications for the performance

Where should the applicationsstore the data?Who should decide the location?

The operating system?The application developers?

Memory 0 Memory 1

Interconnect

Memory 2 Memory 3

22 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Tuning Tricks

Parallelization via Shared Memory(OpenMP)

Many side-effects and error-proneIncremental parallelization

Parallelization via Message Passing(MPI)

Restructuring of the sequential codeLess side-effects

Performance TuningBind MPI applications on one NUMAnode

⇒ No remote memory access2x8 4x4 8x2 16x10

1

2

3

4

5

< threadcount > × < proccount >

Spee

din

GFlo

p/s

LU-MZ.C.16BT-MZ.C.16SP-MZ.C.16

23 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

OpenMP Runtime

GCC includes a OpenMP Runtime (libgomp)Reuse synchronization primitives of the Pthread libraryOther OpenMP runtimes scales betterIn addition, our Pthread library was originally not designed for HPC

Integration of Intel’s OpenMP RuntimeInclude its own synchronization primitivesBinary compatible to GCC’s OpenMP RuntimeChanges for the HermitCore support are small

Mostly deactivation of function to define the thread affinityTransparent usage

For the end-user, no changes in the build process

24 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Support of compilers beside GCC

Just avoid the standard environment (−ffreestanding)Set include path to HermitCore’s toolchainBe sure that the ELF file use HermitCore’s ABI

Patching object files via elfeditUse the GCC to link the binaryLD = x86_64 -hermit -gcc#CC = x86_64 -hermit -gcc# CFLAGS = -O3 -mtune= native -march= native -fopenmp -mno -red -zoneCC = icc -D__hermit__CFLAGS = -O3 -xHost -mno -red -zone -ffreestanding -I$( HERMIT_DIR ) -openmpELFEDIT = x86_64 -hermit - elfedit

stream .o: stream .c$(CC) $( CFLAGS ) -c -o $@ $<$( ELFEDIT ) --output -osabi HermitCore $@

stream : stream .o$(LD) -o $@ $< $( LDFLAGS ) $( CFLAGS )

25 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Is the software stack difficult to maintain?

Changes to the common software stack determined with cloc

Software Stack LoC Changes

binutils 5 121 217 226gcc 6 850 382 4 821Linux 15 276 013 1 296Newlib 1 040 826 5 472LwIP 38 883 832Pthread 13 768 466OpenMP RT 61 594 324HermitCore – 10 597

26 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

100002000030000

100

102

104

106

Loop time (cycles)

Num

bero

feve

nts Linux

100002000030000

100

102

104

106

Loop time (cycles)

Num

bero

feve

nts Linux (isolcpu)

100002000030000

100

102

104

106

Loop time (cycles)

Num

bero

feve

nts Hermit w LwIP

100002000030000

100

102

104

106

Loop time (cycles)N

umbe

rofe

vent

s Hermit wo LwIP

Hydro (preliminary results)

5 10 15 20

5,000

10,000

Number of Cores

MFL

OPS

Linux (1 process × n threads)Linux (1 proc. × n thr., bind-to 0–19)Linux (n proc. × 5 thr., bind-to numa)HermitCore (n proc. × 5 thr.)

28 HermitCore | Stefan Lankes et al. | RWTH Aachen University |5th April 2017

Thank you for your kind attention!

Stefan Lankes et al. – [email protected]

Institute for Automation of Complex Power SystemsE.ON Energy Research Center, RWTH Aachen UniversityMathieustraße 1052074 Aachen

www.acs.eonerc.rwth-aachen.de