workshop report - muli ben-yehudac ibm corporation 2006 hrl system research seminar june 26, 2006 a...

c

�

IBM Corporation 2006 HRL System Research Seminar June 26, 2006

Workshop Report

Oleg Goldshmidt

IBM Haifa Research Laboratory

– p.1/18

c

�


FAST-OS

“FAST-OS (Forum to Address Scalable Technology forruntime and Operating Systems) provides a forum forthe presentation and discussion of research resultsrelated to the development of operating and runtimesystems for very large scale systems.”

funding: US government

FAST-OS Workshop serves as the PI meeting for theprojects funded by the FAST-OS program and willinclude progress reports from these projects.

The workshop is open and participation by the OSresearch community is encouraged.

More info at http://www.cs.unm.edu/˜fastos

– p.2/18

http://www.cs.unm.edu/~fastos

c

�


Petascale HPC

– p.3/18

c

�


Focus

very large scale HPC systems: special purposelightweight OS or fully featured OS?

services, versatility, flexibility lead to noise

OS noise: OS activity that is unrelated to the applicationbig problem for HPC, large scale parallel jobsoccasional delays are magnified because manythreads/nodes sufferoften leave a CPU “idle” for system tasks on eachnode (e.g., BG/L)

the “fatter” the OS the stronger the noiseclock interrupts are a major source of noiseand other interrupts are not so good, either

– p.4/18

c

�


A Brief History of Time(-Sharing)

time-sharing computerscomputers are expensive, users share resourcesapplications do not run as fast as they could:expansion factor ( � ��

for typical HPC)SMP, gang scheduling

clustersa node is dedicated to one application at a timeinterrupts, daemons, networkingget rid of parts of the kernel

e.g., networking: Myrinet, Quadrics: apps talk toHW directly

– p.5/18

c

�


Noise Scales

– p.6/18

c

�


ZeptoOS: Small Linux for Big Machines

research focus:fundamental limits, advanced designs and petaflopscalemanaging and optimizing OSgathering and studying performance data

approach: modified Linux on BG/L I/O and computenodes, specialized I/O daemon (ZOID)

“don’t attribute to OS noise what can be adequatelyexplained by stupidity” (e.g., do not start lpd on anHPC node)

injecting noise: on a vanilla Linux with minimalmodifications an application that does 300 barriersper second will run 0.3% slower on 100K nodes

– p.7/18

c

�


PROSE (Eric Van Hensbergen)

Partitioned Reliable Operating System Environment

application-specific kernelsfine-grained control over OS services: scheduling,memory allocation, interrupt handling (or lackthereof)reliability (up to and including formal verifications)HW support: hardware-specific features not suitablefor general purpose OS

– p.8/18

c

�


Virtualization

– p.9/18

c

�


PROSE Approach

– p.10/18

c

�


PROSE Approach II

run applications in dedicated partitions

execution environment makes starting a partition aseasy as starting an application

development environment allows developingspecialized kernels as easily as developing applications(library-OS)

based on rHypehttp://www.research.ibm.com/hypervisor

30K LOC for both x86 and PPCsame interfaces as pHype

– p.11/18

http://www.research.ibm.com/hypervisor

c

�


PROSE Noise

– p.12/18

c

�


Light- and Right-Weight Kernels

LWK — get rid of most of the kernelno filesystems, no sockets, no paging or virtualmemory, no security, no scheduling...no shared librariesno debugging...does it go too far?

RWKdon’t take for granted that LWK are necessarytry to trim down Linux and/or use Plan9 to comparedevelop metrics

– p.13/18

c

�


Right-Weight Kernels

what is light-weight?we probably don’t need a printer daemondo we need a filesystem? maybe...what ro remove?how to benchmark?

what is right-weight?LWK on BG/L c-nodes: 80 system callsLinux: 300 system callsPlan9: 40 system callsbut Plan9 has clock interrupts

develop simulators, measure

– p.14/18

c

�


HPC Colony (LLNL, UIUC, IBM)

difficulty in achieving balanced partitioning and dynamicscheduling on very large machines can limit scaling

today application programmers often manageresources explicitlygoal: delegate these functions to a parallel OS

system management (scheduling, event notification, jobmanagement) does not scale

goal: OS needs to be aware of the requirements ofparallel applications

directions:full featured Linux on Blue Geneparallel aware schedulingvirtualization: fault tolerance, resource management

– p.15/18

c

�


Virtualization on Blue Gene

divide the computation into a large number of pieces,independent of the number of processes

let the runtime system map the pieces to processors

e.g., Adaptive MPI (AMPI)

implement MPI processes as user-level migratablethreads

load balancing:centralized — not scalabledistributed — slow balancinghybrid — group nodes hierarchically, within eachgroup use centralized schemetopology aware

– p.16/18

c

�


Checkpointing, Fault Detection, Restart

migrate when a fault is imminent

if one CPU fails the other 99,999 should not scurry tocheckpoints

everyone logs messagesasynchronous checkpointswhen one CPU fails and restarts, the failed objectsrestart from their checkpoiints, their acquaintancesresend messages, the restarted objects catch upbut sooner or later we will stall, because we need towait for the crashed processor

migrate work from restarted processors to waitingones

– p.17/18

c

�


Future: FAST-OS 2 Topics

resource management: local/global, multicore,heterogeneous, power

system management: usage models, scalability, RAS

virtualization for HPC

adaptability of OS/runtime to application needs

performance measurements

scalable fault handling, communicating fault information

OS noise/interference: design, hardware support,mitigation strategies

interfaces to allow applications query hardwarecharacteristics, communication resources

– p.18/18

workshop report - muli ben-yehudac ibm corporation 2006 hrl system research seminar june 26, 2006 a...

Documents