workshop report - muli ben-yehudac ibm corporation 2006 hrl system research seminar june 26, 2006 a...
TRANSCRIPT
c
�
IBM Corporation 2006 HRL System Research Seminar June 26, 2006
Workshop Report
Oleg Goldshmidt
IBM Haifa Research Laboratory
– p.1/18
c
�
IBM Corporation 2006 HRL System Research Seminar June 26, 2006
FAST-OS
“FAST-OS (Forum to Address Scalable Technology forruntime and Operating Systems) provides a forum forthe presentation and discussion of research resultsrelated to the development of operating and runtimesystems for very large scale systems.”
funding: US government
FAST-OS Workshop serves as the PI meeting for theprojects funded by the FAST-OS program and willinclude progress reports from these projects.
The workshop is open and participation by the OSresearch community is encouraged.
More info at http://www.cs.unm.edu/˜fastos
– p.2/18
c
�
IBM Corporation 2006 HRL System Research Seminar June 26, 2006
Petascale HPC
– p.3/18
c
�
IBM Corporation 2006 HRL System Research Seminar June 26, 2006
Focus
very large scale HPC systems: special purposelightweight OS or fully featured OS?
services, versatility, flexibility lead to noise
OS noise: OS activity that is unrelated to the applicationbig problem for HPC, large scale parallel jobsoccasional delays are magnified because manythreads/nodes sufferoften leave a CPU “idle” for system tasks on eachnode (e.g., BG/L)
the “fatter” the OS the stronger the noiseclock interrupts are a major source of noiseand other interrupts are not so good, either
– p.4/18
c
�
IBM Corporation 2006 HRL System Research Seminar June 26, 2006
A Brief History of Time(-Sharing)
time-sharing computerscomputers are expensive, users share resourcesapplications do not run as fast as they could:expansion factor ( � ��
for typical HPC)SMP, gang scheduling
clustersa node is dedicated to one application at a timeinterrupts, daemons, networkingget rid of parts of the kernel
e.g., networking: Myrinet, Quadrics: apps talk toHW directly
– p.5/18
c
�
IBM Corporation 2006 HRL System Research Seminar June 26, 2006
Noise Scales
– p.6/18
c
�
IBM Corporation 2006 HRL System Research Seminar June 26, 2006
ZeptoOS: Small Linux for Big Machines
research focus:fundamental limits, advanced designs and petaflopscalemanaging and optimizing OSgathering and studying performance data
approach: modified Linux on BG/L I/O and computenodes, specialized I/O daemon (ZOID)
“don’t attribute to OS noise what can be adequatelyexplained by stupidity” (e.g., do not start lpd on anHPC node)
injecting noise: on a vanilla Linux with minimalmodifications an application that does 300 barriersper second will run 0.3% slower on 100K nodes
– p.7/18
c
�
IBM Corporation 2006 HRL System Research Seminar June 26, 2006
PROSE (Eric Van Hensbergen)
Partitioned Reliable Operating System Environment
application-specific kernelsfine-grained control over OS services: scheduling,memory allocation, interrupt handling (or lackthereof)reliability (up to and including formal verifications)HW support: hardware-specific features not suitablefor general purpose OS
– p.8/18
c
�
IBM Corporation 2006 HRL System Research Seminar June 26, 2006
Virtualization
– p.9/18
c
�
IBM Corporation 2006 HRL System Research Seminar June 26, 2006
PROSE Approach
– p.10/18
c
�
IBM Corporation 2006 HRL System Research Seminar June 26, 2006
PROSE Approach II
run applications in dedicated partitions
execution environment makes starting a partition aseasy as starting an application
development environment allows developingspecialized kernels as easily as developing applications(library-OS)
based on rHypehttp://www.research.ibm.com/hypervisor
30K LOC for both x86 and PPCsame interfaces as pHype
– p.11/18
c
�
IBM Corporation 2006 HRL System Research Seminar June 26, 2006
PROSE Noise
– p.12/18
c
�
IBM Corporation 2006 HRL System Research Seminar June 26, 2006
Light- and Right-Weight Kernels
LWK — get rid of most of the kernelno filesystems, no sockets, no paging or virtualmemory, no security, no scheduling...no shared librariesno debugging...does it go too far?
RWKdon’t take for granted that LWK are necessarytry to trim down Linux and/or use Plan9 to comparedevelop metrics
– p.13/18
c
�
IBM Corporation 2006 HRL System Research Seminar June 26, 2006
Right-Weight Kernels
what is light-weight?we probably don’t need a printer daemondo we need a filesystem? maybe...what ro remove?how to benchmark?
what is right-weight?LWK on BG/L c-nodes: 80 system callsLinux: 300 system callsPlan9: 40 system callsbut Plan9 has clock interrupts
develop simulators, measure
– p.14/18
c
�
IBM Corporation 2006 HRL System Research Seminar June 26, 2006
HPC Colony (LLNL, UIUC, IBM)
difficulty in achieving balanced partitioning and dynamicscheduling on very large machines can limit scaling
today application programmers often manageresources explicitlygoal: delegate these functions to a parallel OS
system management (scheduling, event notification, jobmanagement) does not scale
goal: OS needs to be aware of the requirements ofparallel applications
directions:full featured Linux on Blue Geneparallel aware schedulingvirtualization: fault tolerance, resource management
– p.15/18
c
�
IBM Corporation 2006 HRL System Research Seminar June 26, 2006
Virtualization on Blue Gene
divide the computation into a large number of pieces,independent of the number of processes
let the runtime system map the pieces to processors
e.g., Adaptive MPI (AMPI)
implement MPI processes as user-level migratablethreads
load balancing:centralized — not scalabledistributed — slow balancinghybrid — group nodes hierarchically, within eachgroup use centralized schemetopology aware
– p.16/18
c
�
IBM Corporation 2006 HRL System Research Seminar June 26, 2006
Checkpointing, Fault Detection, Restart
migrate when a fault is imminent
if one CPU fails the other 99,999 should not scurry tocheckpoints
everyone logs messagesasynchronous checkpointswhen one CPU fails and restarts, the failed objectsrestart from their checkpoiints, their acquaintancesresend messages, the restarted objects catch upbut sooner or later we will stall, because we need towait for the crashed processor
migrate work from restarted processors to waitingones
– p.17/18
c
�
IBM Corporation 2006 HRL System Research Seminar June 26, 2006
Future: FAST-OS 2 Topics
resource management: local/global, multicore,heterogeneous, power
system management: usage models, scalability, RAS
virtualization for HPC
adaptability of OS/runtime to application needs
performance measurements
scalable fault handling, communicating fault information
OS noise/interference: design, hardware support,mitigation strategies
interfaces to allow applications query hardwarecharacteristics, communication resources
– p.18/18