a case for transforming parallel runtimes

79
A CASE FOR TRANSFORMING PARALLEL RUNTIMES INTO OS KERNELS Kyle Hale Peter Dinda halek.co v3vee.org presciencelab.org xstack.sandia.gov/hobbes

Upload: others

Post on 08-Apr-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

A CASE FOR TRANSFORMING !PARALLEL RUNTIMES INTO OS KERNELS

Kyle Hale Peter Dinda

halek.co v3vee.org presciencelab.org xstack.sandia.gov/hobbes

2

HOBBES xstack.sandia.gov/hobbes

v3vee.org

v3vee.org/palacios

3

parallel app

parallel runtime

general-purpose kernel

node HW

user mode

kernel mode

THE CURRENT OS/RUNTIME MODEL

4

THIS MODEL HAS SOME ISSUES

5

user mode

kernel mode

ARE PROVIDED KERNEL ABSTRACTIONS THE RIGHT ONES?

NOT ALWAYS

I’d like to pin memory to a specific PFN

range please

NO!

runtime

general OS

6

user mode

kernel mode

NOT ALWAYS

I’d like to never be interrupted

please

NOPE

runtime

general OS

ARE PROVIDED KERNEL ABSTRACTIONS THE RIGHT ONES?

7

user mode

kernel mode

RESTRICTED ACCESS TO HARDWARE

I’d like to set up some custom

page mappings please

Uh no

runtime

general OS

8

user mode

kernel mode

I’d like to interrupt another

processor please

HA!

runtime

general OS

RESTRICTED ACCESS TO HARDWARE

9

What are the consequences?

10

What are the consequences?

WORKAROUNDS & COMPROMISES

11

What are the consequences?

WORKAROUNDS & COMPROMISES

DUPLICATED FUNCTIONALITY

12

If runtime had

we could mitigate these issues

13 we could mitigate these issues

FULL HARDWARE ACCESS

If runtime had

14 we could mitigate these issues

FULL HARDWARE ACCESS

CONTROL OVER KERNEL ABSTRACTIONS

If runtime had

15

parallel app

parallel runtime

general-purpose kernel

node HW

user mode

kernel mode

THE CURRENT OS/RUNTIME MODEL

16

parallel app

hybrid runtime!

node HW

user mode

kernel mode

OUR PROPOSED MODEL: THE HYBRID RUNTIME (HRT)

17

parallel app

hybrid runtime!

node HW

user mode

kernel mode

OUR PROPOSED MODEL: THE HYBRID RUNTIME (HRT)

Mashup of OS and runtime

18

parallel app

hybrid runtime!

node HW

user mode

kernel mode

OUR PROPOSED MODEL: THE HYBRID RUNTIME (HRT)

The runtime IS the kernel, built within a kernel framework

19

parallel app

hybrid runtime!

node HW

user mode

kernel mode

OUR PROPOSED MODEL: THE HYBRID RUNTIME (HRT)

The runtime IS the kernel, built within a kernel framework

Everything is in kernel space

20

parallel app

hybrid runtime!

node HW

user mode

kernel mode

OUR PROPOSED MODEL: THE HYBRID RUNTIME (HRT)

The runtime IS the kernel, built within a kernel framework

Everything is in kernel space

HRT has full access to the hardware

21

parallel app

node HW

user mode

kernel mode

OUR PROPOSED MODEL: THE HYBRID RUNTIME (HRT)

HRT

HRT can control HW access

HRT can pick its own abstractions

22

parallel app

node HW

user mode

kernel mode

OUR PROPOSED MODEL: THE HYBRID RUNTIME (HRT)

HRT

MORE POWER!

23

We built a kernel framework to support HRTs

24

We built a kernel framework to support HRTs NAUTILUS

25

We built a kernel framework to support HRTs

We ported an existing, complex parallel runtime

NAUTILUS

26

We built a kernel framework to support HRTs

We ported an existing, complex parallel runtime

NAUTILUS

LEGIONlegion.stanford.edu!

27

We built a kernel framework to support HRTs

We ported an existing, complex parallel runtime

We ported our framework to cutting-edge many-core hardware

NAUTILUS

LEGIONlegion.stanford.edu!

28

We built a kernel framework to support HRTs

We ported an existing, complex parallel runtime

We ported our framework to cutting-edge many-core hardware

NAUTILUS

LEGION

XEON PHI

legion.stanford.edu!

29

We built a kernel framework to support HRTs

We ported an existing, complex parallel runtime

We ported our framework to cutting-edge many-core hardware

We evaluated our port on a standard HPC benchmark

NAUTILUS

LEGION

XEON PHI

legion.stanford.edu!

30

We built a kernel framework to support HRTs

We ported an existing, complex parallel runtime

We ported our framework to cutting-edge many-core hardware

We evaluated our port on a standard HPC benchmark

NAUTILUS

LEGION

XEON PHI

HPCG

legion.stanford.edu!

31

0%

5%

10%

15%

20%

25%

1 2 4 8 16 32 64 128 200 220

Spe

ed

up o

ver L

inux

Legion Processor Count (Cores)

XEON PHI + NAUTILUS + LEGION + HPCG

32

0%

5%

10%

15%

20%

25%

1 2 4 8 16 32 64 128 200 220

Spe

ed

up o

ver L

inux

Legion Processor Count (Cores)

XEON PHI + NAUTILUS + LEGION + HPCG

33

0%

5%

10%

15%

20%

25%

1 2 4 8 16 32 64 128 200 220

Spe

ed

up o

ver L

inux

Legion Processor Count (Cores)

11% average speedup

XEON PHI + NAUTILUS + LEGION + HPCG

NAUTILUS

34

runtime

parallel app

threads sync. events HW info bootstrap paging timers IRQs console

Hardware

user mode

kernel mode

NAUTILUS

35

runtime

parallel app

threads sync. events HW info bootstrap paging timers IRQs console

Hardware

user mode

kernel mode

Nautilus primitives & utilities (HRT can use or not use any of them)

NAUTILUS

36

runtime

parallel app

threads sync. events HW info bootstrap paging timers IRQs console

Hardware

user mode

kernel mode

Nautilus primitives & utilities (HRT can use or not use any of them)

aerokernel

NAUTILUS

37

runtime

parallel app

threads sync. events HW info bootstrap paging timers IRQs console

Hardware

user mode

kernel mode

HRT

NAUTILUS

38

runtime

parallel app

threads sync. events HW info bootstrap paging timers IRQs console

Hardware

user mode

kernel mode

Kernel

39

MINIMAL LIGHTWEIGHT PRIMITIVES

FULL HARDWARE ACCESS

VERY FAST BOOT TIMES

LIGHTWEIGHT PRIMITIVES" EXAMPLE: THREADS

40

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

2 4 8 16 32 64

Cyc

les

Nautilus

Linux (pthreads)

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

41

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

2 4 8 16 32 64

Cyc

les

Linux (pthreads)

LIGHTWEIGHT PRIMITIVES" EXAMPLE: THREADS

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

42

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

2 4 8 16 32 64

Cyc

les

Nautilus

Linux (pthreads)

LIGHTWEIGHT PRIMITIVES" EXAMPLE: THREADS

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

43

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

2 4 8 16 32 64

Cyc

les

Nautilus

Linux (pthreads)

LIGHTWEIGHT PRIMITIVES" EXAMPLE: THREADS

~3ms

~90µs

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

FULL HARDWARE CONTROL"EXAMPLE: INTERRUPT CONTROL"

44

FULL HARDWARE CONTROL"EXAMPLE: INTERRUPT CONTROL"

45

very simple modification: give runtime controlover interrupts in itstask scheduler

FULL HARDWARE CONTROL"EXAMPLE: INTERRUPT CONTROL"

46

very simple modification:

à modest speedups

give runtime controlover interrupts in itstask scheduler

FULL HARDWARE CONTROL"EXAMPLE: INTERRUPT CONTROL"

47

very simple modification:

MUCH more to come here

à modest speedups

give runtime controlover interrupts in itstask scheduler

48

in addition to Legion, we have 2 other high-level, parallel runtimes

running as HRTs

NESL: VCODE interpreter running as HRT

NDPC: home-grown, co-designed HRT

49

INTEGRATING THE HRT WITH A LEGACY OS

50

HVM THE HYBRID VIRTUAL MACHINE

51

parallel app

hybrid runtime!

user mode

kernel mode parallel app

parallel runtime user mode

kernel mode parallel runtime

general OS

node hardware

Hybrid Virtual Machine (HVM)!

general virtualization model

specialized virtualization model

Legacy functionality from the Regular OS via the HVM!

Regular OS (ROS)

52

LINUX FORK + EXEC ~ 714µs

HVM + HRT CORE BOOT ~ 61µs

53

LINUX FORK + EXEC ~ 714µs

HVM + HRT CORE BOOT ~ 61µs

HRT boot is CHEAP!

NAUTILUS + XEON PHI

54

55

56

0%

5%

10%

15%

20%

25%

1 2 4 8 16 32 64 128 200 220

Spe

ed

up o

ver L

inux

Legion Processor Count (Cores)

11% average speedup

XEON PHI + NAUTILUS + LEGION + HPCG

my websitehalek.co

our development blog

haltloop.com

our labpresciencelab.org

the Hobbes project

xstack.sandia.gov/hobbes

Kyle Hale Peter Dinda

A CASE FOR TRANSFORMING !PARALLEL RUNTIMES INTO OS KERNELS

follow us here for: -  experience report on

building OS for Phi - philix release (soon)

58

BACKUPS

FULL HARDWARE CONTROL"EXAMPLE: INTERRUPT CONTROL"

59

0%

1%

2%

3%

4%

5%

6%

2 4 8 16 32 62

Spe

ed

up

Legion Processors (threads)

FULL HARDWARE CONTROL"EXAMPLE: INTERRUPT CONTROL"

60

0%

1%

2%

3%

4%

5%

6%

2 4 8 16 32 62

Spe

ed

up

Legion Processors (threads)

61

parallel app

hybrid runtime!

node HW

user mode

kernel mode

62

parallel app

hybrid runtime!

node HW

user mode

kernel mode parallel app

parallel runtime user mode

kernel mode parallel runtime

general OS

63

parallel app

hybrid runtime!

user mode

kernel mode parallel app

parallel runtime user mode

kernel mode parallel runtime

general OS

node hardware

Hybrid Virtual Machine (HVM)!

general virtualization model

specialized virtualization model

64

parallel app

hybrid runtime!

user mode

kernel mode parallel app

parallel runtime user mode

kernel mode parallel runtime

general OS

node hardware

Hybrid Virtual Machine (HVM)!

general virtualization model

specialized virtualization model

This is the performance path, through the HRT!

65

parallel app

hybrid runtime!

user mode

kernel mode parallel app

parallel runtime user mode

kernel mode parallel runtime

general OS

node hardware

Hybrid Virtual Machine (HVM)!

general virtualization model

specialized virtualization model

Legacy functionality from the Regular OS via the HVM!

Regular OS (ROS)

66

parallel app

hybrid runtime!

user mode

kernel mode parallel app

parallel runtime user mode

kernel mode parallel runtime

general OS

node hardware

Hybrid Virtual Machine (HVM)!

general virtualization model

specialized virtualization model

We can boot these things very quickly!!

67

parallel app

hybrid runtime!

user mode

kernel mode parallel app

parallel runtime user mode

kernel mode parallel runtime

general OS

node hardware

Hybrid Virtual Machine (HVM)!

general virtualization model

specialized virtualization model

parallel app

hybrid runtime!

parallel app

hybrid runtime!

parallel app

hybrid runtime!

several auxiliary HRTs spawned in!less than a millisecond !

68

0

1

2

3

4

5

6

1 2 4 8 16 32 64 128 200 220

Exe

cut

ion

Tim

e (

s)

Legion Processor Count (Cores)

Natuilus

Linux

HPCG IN LEGION ON XEON PHI

HPCG IN LEGION ON XEON PHI

69

0

1

2

3

4

5

6

1 2 4 8 16 32 64 128 200 220

Exe

cut

ion

Tim

e (

s)

Legion Processor Count (Cores)

Natuilus

Linux

70

port of NESL

-  nested data parallel language aimed at vector machines

71

port of NESL

-  nested data parallel language aimed at vector machines

-  we can run unmodified NESL programs in our kernel-mode VCODE interpreter

72

the first co-designed HRT: NDPC -  Nested Data Parallelism in C/C++-  subset of NESL

73

the first co-designed HRT: NDPC -  Nested Data Parallelism in C/C++-  subset of NESL-  fork/join parallelism over flattened vector processing

74

the first co-designed HRT: NDPC -  Nested Data Parallelism in C/C++-  subset of NESL-  fork/join parallelism over flattened vector processing-  allows us to explore runtime/kernel co-design- e.g. smart kernel-mode thread fork

75

parallel app

hybrid runtime!

user mode

kernel mode parallel app

parallel runtime user mode

kernel mode parallel runtime

general OS

node hardware

Hybrid Virtual Machine (HVM)!

general virtualization model

specialized virtualization model

parallel app

hybrid runtime!

parallel app

hybrid runtime!

parallel app

hybrid runtime!

several auxiliary HRTs spawned in!less than a millisecond !

Regular OS (ROS)

specialized virtualization model

specialized virtualization model

specialized virtualization model

76

to get started with your own Xeon Phi prototype kernel:

77

to get started with your own Xeon Phi prototype kernel:

•  follow our blog •  use our tool (philix) to boot it and

leverage MPSS stack

78

to get started with your own Xeon Phi prototype kernel:

•  follow our blog •  use our tool (philix) to boot it and

leverage MPSS stack

find out more @ haltloop.com 79

to get started with your own Xeon Phi prototype kernel:

•  follow our blog •  use our tool (philix) to boot it and

leverage MPSS stack