systems support for many task computing holistic aggregate

27
IBM Research, Sandia National Labs, Bell Labs, & CMU Systems Support for Many Task Computing 11/17/2008 (c) 2008 IBM Corporation Systems Support for Many Task Computing Holistic Aggregate Resource Environment Eric Van Hensbergen (IBM) and Ron Minnich (Sandia National Labs)

Upload: others

Post on 24-Oct-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, & CMU

Systems Support for Many Task Computing 11/17/2008 (c) 2008 IBM Corporation

Systems Support forMany Task Computing

Holistic Aggregate Resource EnvironmentEric Van Hensbergen (IBM) andRon Minnich (Sandia National Labs)

Page 2: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation2 Systems Support for Many Task Computing 11/17/2008

Motivation

Page 3: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation3 Systems Support for Many Task Computing 11/17/2008

Overview of Approach

Targeting Blue Gene/P 

– provide a complimentary runtime environment

Using Plan 9 Research Operating System

– “Right Weight Kernel” ­ balances simplicity and function

– Built from the ground up as a distributed system

Leverage HPC interconnects for system services

Distribute system services among compute nodes

Leverage aggregation as a first­class systems construct to help manage complexity and provide a foundation for scalability, reliability, and efficiency.

Page 4: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation4 Systems Support for Many Task Computing 11/17/2008

Related Work

Default Blue Gene runtime

– Linux on I/O nodes + CNK on compute nodes

High Throughput Computing (HTC) Mode

Compute Node Linux

ZeptoOS

Kittyhawk

Page 5: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation5 Systems Support for Many Task Computing 11/17/2008

Foundation: Plan 9 Distributed System

Right Weight Kernel

– General purpose multi­thread, multi­user environment

– Pleasantly portable

– Relatively Lightweight (compared to Linux)

Core Principles

– All resources are synthetic file hierarchies

– Local & remote resources accessed via simple API

– Each thread can dynamically organize local and remote resources via dynamic private namespace

Page 6: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation6 Systems Support for Many Task Computing 11/17/2008

Everything Represented as File Systems

HardwareDevices

SystemServices

ApplicationServices

Disk

Network

TCP/IP Stack DNS

GUI

/dev/hda1

/dev/hda2

/dev/eth0

/net     /arp     /udp     /tcp        /clone        /stats           /0           /1               /ctl               /data               /listen               /local               /remote               /status

/net/cs/dns

/win/clone/0/1    /ctl    /data    /refresh/2

Console, Audio, Etc. Wiki, Authentication, and Service Control

Process Control, Debug, Etc.

Page 7: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation7 Systems Support for Many Task Computing 11/17/2008

Plan 9 Networks

Internet

High Bandwidth (10 GB/s) Network

LAN (1 GB/s) Network

Wifi/EdgeCable/DSL

ContentAddressable

Storage

FileServer

CPUServers

CPUServers

PDASmartphone

Term

TermTermTerm

Set Top Box

ScreenPhone

)))

Page 8: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation8 Systems Support for Many Task Computing 11/17/2008

An Issue of Scale

Node Card(4x4x2)

32 compute0-2 IO cards

Compute Card2 chips

ChipBG/p – 4 way

Rack32 Node Cards

System72 Racks

Page 9: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation9 Systems Support for Many Task Computing 11/17/2008

Aggregation as a First Class Concept

Local Service Aggregate Service

Remote Service

Proxy Service

Remote ServiceRemote Service

Page 10: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation10 Systems Support for Many Task Computing 11/17/2008

Issues of Topology

Page 11: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation11 Systems Support for Many Task Computing 11/17/2008

File Cache Example

Proxy Service

– Monitors access to remote file server & local resources

– Local cache mode

– Collaborative cache mode

– Designated cache server(s)

– Integrate replication and redundancy

– Explore write coherence via “territories” ala Envoy

Based on experiences with Xget deployment model

Leverage natural topology of machine where possible.

Page 12: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation12 Systems Support for Many Task Computing 11/17/2008

Monitoring Example

Distribute monitoring throughout the system

– Use for system health monitoring and load balancing

– Allow for application­specific monitoring agents

Distribute filtering & control agents at key points in topology

Allow for localized monitoring and control as well as high­level global reporting and control

Explore both push and pull methods of modeling

Based on experiences with supermon system.

Page 13: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation13 Systems Support for Many Task Computing 11/17/2008

Workload Management Example

Provide file system interface to job execution and scheduling.

Allows scheduling of new work from within the cluster, using localized as well as global scheduling controls.

Can allow for more organic growth of workloads as well as top­down and bottom­up models.

Can be extended to allow direct access from end­user workstations.

Based on experiences with Xcpu mechanism.

Page 14: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation14 Systems Support for Many Task Computing 11/17/2008

Status

Initial Port to BG/P 90% Complete

Applications

– Linux emulation environment

– CNK emulation environment

– Native ports of applications

Also have a port of Inferno Virtual Machine to BG/P

– Runs on Kittyhawk as well as Native

Baseline boot & runtime infrastructure complete

Page 15: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation15 Systems Support for Many Task Computing 11/17/2008

HARE Team

David Eckhardt (Carnegie Mellon University)

Charles Forsyth (Vitanuova)

Jim McKie (Bell Labs)

Ron Minnich (Sandia National Labs)

Eric Van Hensbergen (IBM Research)

Page 16: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation16 Systems Support for Many Task Computing 11/17/2008

Thanks Funding

– This material is based upon work supported by the Department of Energy under Aware Number                  DE­FG02­08ER25851

Resources

– This work is being conducted on resources provided by the Department of Energy's Innovative and novel Computational Impact on Theory and Experiment (INCITE)

Information

– The authors would also like to thank the IBM Research Blue Gene Team along with the IBM Research Kittyhawk team for their assistance.

Page 17: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation17 Systems Support for Many Task Computing 11/17/2008

Questions? Discussion?

Page 18: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation18 Systems Support for Many Task Computing 11/17/2008

Links

FastOS Web Site

– http://www.cs.unm.edu/~fastos/

Phase II CFP

– http://www.sc.doe.gov/grants/FAPN07­23.html

BlueGene

– http://www.research.ibm.com/bluegene/

Plan 9

– http://plan9.bell­labs.com/plan9

LibraryOS

– http://www.research.ibm.com/prose

Page 19: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation19 Systems Support for Many Task Computing 11/17/2008

Page 20: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation20 Systems Support for Many Task Computing 11/17/2008

Plan 9 Characteristics Kernel Breakdown ­ Lines of Code

– Architecture Specific Code

• BG/L:  ~10,000 lines of code

– Portable Code

• Port:  ~25,000 lines of code

• TCP/IP Stack:  ~14,000 lines of code

Binary Sizes

– 415k Text + 140k Data + 107k BSS

Runtime Memory Footprint

– ~4 MB for compute node kernels – could be smaller or larger depending on application specific tuning.

Page 21: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation21 Systems Support for Many Task Computing 11/17/2008

Why not Linux? Not a distributed system

Core systems inflexible

– VM based on x86 MMU

– Networking tightly tied to sockets & TCP/IP w/long call­path

– Typical installations extremely overweight and noisy

– Benefits of modularity and open­source advantages overcome by complexity, dependencies, and rapid rate of change

Community has become conservative

– Support for alternative interfaces waning

– Support for large systems which hurts small systems not acceptable

Ultimately a customer constraint

– FastOS was developed to prevent OS monoculture in HPC

– Few Linux projects were even invited to submit final proposals

Page 22: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation22 Systems Support for Many Task Computing 11/17/2008

FTQ on BG/L IO Node running Linux

Page 23: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation23 Systems Support for Many Task Computing 11/17/2008

FTQ on BG/L IO Node Running Plan 9

Page 24: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation24 Systems Support for Many Task Computing 11/17/2008

Right Weight Kernels Project (Phase I)

Motivation

– OS Effect on Applications

• Metric is based on OS Interference on FWQ & FTQ benchmarks.

– AIX/Linux has more capability than many apps need

– LWK and CNK have less capability than apps want

Approach

– Customize the kernel to the application

Ongoing Challenges

– Need to balance capability with overhead

Page 25: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation25 Systems Support for Many Task Computing 11/17/2008

Why Blue Gene?

Readily available large­scale cluster

– Minimum allocation is 37 nodes

– Easy to get 512 and 1024 node configurations

– Up to 8192 nodes available upon request internally

– FastOS will make 64k configuration available

DOE interest – Blue Gene was a specified target

Variety of interconnects allows exploration of alternatives

Embedded core design provides simple architecture that is quick to port to and doesn't require heavy weight systems software management, device drivers, or firmware

Page 26: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation26 Systems Support for Many Task Computing 11/17/2008

Department of Energy FastOS CFPaka: Operating and Runtime System for Extreme Scale Scientific Computation (DE­PS02­07ER07­23)

Goal: Stimulate R&D related to operating and runtime systems for petascale 

systems in the 2010 to 2015 time frame.

Expected OutputUnified operating and runtime system that could fully support and exploit 

petascale and beyond systems.

Near Term Hardware Targets: – Blue Gene, Cray XD3, and HPCS Machines.

Page 27: Systems Support for Many Task Computing Holistic Aggregate

IBM Research, Sandia National Labs, Bell Labs, and CMU

(c) 2008 IBM Corporation27 Systems Support for Many Task Computing 11/17/2008

Blue Gene Interconnects3 Dimensional Torus

Interconnects all compute nodes (65,536) Virtual cut­through hardware routing 1.4Gb/s on all 12 node links (2.1 GB/s per node) 1 µs latency between nearest neighbors, 5 µs to the farthest 4 µs latency for one hop with MPI, 10 µs to the farthest Communications backbone for computations 0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth

Global Tree One­to­all broadcast functionality Reduction operations functionality 2.8 Gb/s of bandwidth per link Latency of one way tree traversal 2.5 µs  ~23TB/s total binary tree bandwidth (64k machine) Interconnects all compute and I/O nodes (1024)

Ethernet Incorporated into every node ASIC Active in the I/O nodes (1:64) All external comm. (file I/O, control, user interaction, etc.)

Low Latency Global Barrier and Interrupt Latency of round trip 1.3 µs

Control Network