systems support for many task computing holistic aggregate
TRANSCRIPT
IBM Research, Sandia National Labs, Bell Labs, & CMU
Systems Support for Many Task Computing 11/17/2008 (c) 2008 IBM Corporation
Systems Support forMany Task Computing
Holistic Aggregate Resource EnvironmentEric Van Hensbergen (IBM) andRon Minnich (Sandia National Labs)
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation2 Systems Support for Many Task Computing 11/17/2008
Motivation
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation3 Systems Support for Many Task Computing 11/17/2008
Overview of Approach
Targeting Blue Gene/P
– provide a complimentary runtime environment
Using Plan 9 Research Operating System
– “Right Weight Kernel” balances simplicity and function
– Built from the ground up as a distributed system
Leverage HPC interconnects for system services
Distribute system services among compute nodes
Leverage aggregation as a firstclass systems construct to help manage complexity and provide a foundation for scalability, reliability, and efficiency.
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation4 Systems Support for Many Task Computing 11/17/2008
Related Work
Default Blue Gene runtime
– Linux on I/O nodes + CNK on compute nodes
High Throughput Computing (HTC) Mode
Compute Node Linux
ZeptoOS
Kittyhawk
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation5 Systems Support for Many Task Computing 11/17/2008
Foundation: Plan 9 Distributed System
Right Weight Kernel
– General purpose multithread, multiuser environment
– Pleasantly portable
– Relatively Lightweight (compared to Linux)
Core Principles
– All resources are synthetic file hierarchies
– Local & remote resources accessed via simple API
– Each thread can dynamically organize local and remote resources via dynamic private namespace
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation6 Systems Support for Many Task Computing 11/17/2008
Everything Represented as File Systems
HardwareDevices
SystemServices
ApplicationServices
Disk
Network
TCP/IP Stack DNS
GUI
/dev/hda1
/dev/hda2
/dev/eth0
/net /arp /udp /tcp /clone /stats /0 /1 /ctl /data /listen /local /remote /status
/net/cs/dns
/win/clone/0/1 /ctl /data /refresh/2
Console, Audio, Etc. Wiki, Authentication, and Service Control
Process Control, Debug, Etc.
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation7 Systems Support for Many Task Computing 11/17/2008
Plan 9 Networks
Internet
High Bandwidth (10 GB/s) Network
LAN (1 GB/s) Network
Wifi/EdgeCable/DSL
ContentAddressable
Storage
FileServer
CPUServers
CPUServers
PDASmartphone
Term
TermTermTerm
Set Top Box
ScreenPhone
)))
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation8 Systems Support for Many Task Computing 11/17/2008
An Issue of Scale
Node Card(4x4x2)
32 compute0-2 IO cards
Compute Card2 chips
ChipBG/p – 4 way
Rack32 Node Cards
System72 Racks
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation9 Systems Support for Many Task Computing 11/17/2008
Aggregation as a First Class Concept
Local Service Aggregate Service
Remote Service
Proxy Service
Remote ServiceRemote Service
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation10 Systems Support for Many Task Computing 11/17/2008
Issues of Topology
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation11 Systems Support for Many Task Computing 11/17/2008
File Cache Example
Proxy Service
– Monitors access to remote file server & local resources
– Local cache mode
– Collaborative cache mode
– Designated cache server(s)
– Integrate replication and redundancy
– Explore write coherence via “territories” ala Envoy
Based on experiences with Xget deployment model
Leverage natural topology of machine where possible.
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation12 Systems Support for Many Task Computing 11/17/2008
Monitoring Example
Distribute monitoring throughout the system
– Use for system health monitoring and load balancing
– Allow for applicationspecific monitoring agents
Distribute filtering & control agents at key points in topology
Allow for localized monitoring and control as well as highlevel global reporting and control
Explore both push and pull methods of modeling
Based on experiences with supermon system.
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation13 Systems Support for Many Task Computing 11/17/2008
Workload Management Example
Provide file system interface to job execution and scheduling.
Allows scheduling of new work from within the cluster, using localized as well as global scheduling controls.
Can allow for more organic growth of workloads as well as topdown and bottomup models.
Can be extended to allow direct access from enduser workstations.
Based on experiences with Xcpu mechanism.
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation14 Systems Support for Many Task Computing 11/17/2008
Status
Initial Port to BG/P 90% Complete
Applications
– Linux emulation environment
– CNK emulation environment
– Native ports of applications
Also have a port of Inferno Virtual Machine to BG/P
– Runs on Kittyhawk as well as Native
Baseline boot & runtime infrastructure complete
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation15 Systems Support for Many Task Computing 11/17/2008
HARE Team
David Eckhardt (Carnegie Mellon University)
Charles Forsyth (Vitanuova)
Jim McKie (Bell Labs)
Ron Minnich (Sandia National Labs)
Eric Van Hensbergen (IBM Research)
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation16 Systems Support for Many Task Computing 11/17/2008
Thanks Funding
– This material is based upon work supported by the Department of Energy under Aware Number DEFG0208ER25851
Resources
– This work is being conducted on resources provided by the Department of Energy's Innovative and novel Computational Impact on Theory and Experiment (INCITE)
Information
– The authors would also like to thank the IBM Research Blue Gene Team along with the IBM Research Kittyhawk team for their assistance.
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation17 Systems Support for Many Task Computing 11/17/2008
Questions? Discussion?
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation18 Systems Support for Many Task Computing 11/17/2008
Links
FastOS Web Site
– http://www.cs.unm.edu/~fastos/
Phase II CFP
– http://www.sc.doe.gov/grants/FAPN0723.html
BlueGene
– http://www.research.ibm.com/bluegene/
Plan 9
– http://plan9.belllabs.com/plan9
LibraryOS
– http://www.research.ibm.com/prose
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation19 Systems Support for Many Task Computing 11/17/2008
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation20 Systems Support for Many Task Computing 11/17/2008
Plan 9 Characteristics Kernel Breakdown Lines of Code
– Architecture Specific Code
• BG/L: ~10,000 lines of code
– Portable Code
• Port: ~25,000 lines of code
• TCP/IP Stack: ~14,000 lines of code
Binary Sizes
– 415k Text + 140k Data + 107k BSS
Runtime Memory Footprint
– ~4 MB for compute node kernels – could be smaller or larger depending on application specific tuning.
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation21 Systems Support for Many Task Computing 11/17/2008
Why not Linux? Not a distributed system
Core systems inflexible
– VM based on x86 MMU
– Networking tightly tied to sockets & TCP/IP w/long callpath
– Typical installations extremely overweight and noisy
– Benefits of modularity and opensource advantages overcome by complexity, dependencies, and rapid rate of change
Community has become conservative
– Support for alternative interfaces waning
– Support for large systems which hurts small systems not acceptable
Ultimately a customer constraint
– FastOS was developed to prevent OS monoculture in HPC
– Few Linux projects were even invited to submit final proposals
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation22 Systems Support for Many Task Computing 11/17/2008
FTQ on BG/L IO Node running Linux
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation23 Systems Support for Many Task Computing 11/17/2008
FTQ on BG/L IO Node Running Plan 9
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation24 Systems Support for Many Task Computing 11/17/2008
Right Weight Kernels Project (Phase I)
Motivation
– OS Effect on Applications
• Metric is based on OS Interference on FWQ & FTQ benchmarks.
– AIX/Linux has more capability than many apps need
– LWK and CNK have less capability than apps want
Approach
– Customize the kernel to the application
Ongoing Challenges
– Need to balance capability with overhead
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation25 Systems Support for Many Task Computing 11/17/2008
Why Blue Gene?
Readily available largescale cluster
– Minimum allocation is 37 nodes
– Easy to get 512 and 1024 node configurations
– Up to 8192 nodes available upon request internally
– FastOS will make 64k configuration available
DOE interest – Blue Gene was a specified target
Variety of interconnects allows exploration of alternatives
Embedded core design provides simple architecture that is quick to port to and doesn't require heavy weight systems software management, device drivers, or firmware
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation26 Systems Support for Many Task Computing 11/17/2008
Department of Energy FastOS CFPaka: Operating and Runtime System for Extreme Scale Scientific Computation (DEPS0207ER0723)
Goal: Stimulate R&D related to operating and runtime systems for petascale
systems in the 2010 to 2015 time frame.
Expected OutputUnified operating and runtime system that could fully support and exploit
petascale and beyond systems.
Near Term Hardware Targets: – Blue Gene, Cray XD3, and HPCS Machines.
IBM Research, Sandia National Labs, Bell Labs, and CMU
(c) 2008 IBM Corporation27 Systems Support for Many Task Computing 11/17/2008
Blue Gene Interconnects3 Dimensional Torus
Interconnects all compute nodes (65,536) Virtual cutthrough hardware routing 1.4Gb/s on all 12 node links (2.1 GB/s per node) 1 µs latency between nearest neighbors, 5 µs to the farthest 4 µs latency for one hop with MPI, 10 µs to the farthest Communications backbone for computations 0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth
Global Tree Onetoall broadcast functionality Reduction operations functionality 2.8 Gb/s of bandwidth per link Latency of one way tree traversal 2.5 µs ~23TB/s total binary tree bandwidth (64k machine) Interconnects all compute and I/O nodes (1024)
Ethernet Incorporated into every node ASIC Active in the I/O nodes (1:64) All external comm. (file I/O, control, user interaction, etc.)
Low Latency Global Barrier and Interrupt Latency of round trip 1.3 µs
Control Network