open resilient cluster manager: a distributed approach to a resilient router manager

Open Resilient Cluster Manager:A Distributed Approach to a Resilient Router Manager

Ralph H. Castain, Ph.D.Cisco Systems, Inc.

Outline

• Overview

• Key pieces OpenRTE uPNP

• ORCM Architecture Fault behavior

• Future directions

3© 2006 Cisco Systems, Inc. All rights reserved.

System Software Requirements

1) Turn on once with remote access thereafter

2) Non-Stop == max 20 events/day lasting < 200ms each

3) Hitless SW Upgrades and Downgrades

4) Upgrade/downgrade SW components across delta versions

5) Field Patchable

6) Beta Test New Features in situ

7) Extensive Trace Facilities: on Routes, Tunnels, Subscribers,…

8) Configuration

9) Clear APIs; minimize application awareness

10) Extensive remote capabilities for fault management, software maintenance and software installations

Our Approach

• Distributed redundancy NO master Multiple copies of everything Running in tracking mode

• Parallel, seeing identical input• Multiple ways of selecting leader

• Utilize component architecture Multiple ways to do something => framework! Create an initial working base Encourage experimentation

Methodology

• Exploit open source software Reduce development time Encourage outside participation Cross-fertilize with HPC community

• Write new cluster manager (ORCM) Exploit new capabilities Potential dual-use for HPC clusters Encourage outside contributions

Open Source ≠ Free

Pro

• Widespread exposure ORTE on thousands of

systems around world Surface & address problems

• Community support Others can help solve

problems Expanded access to tools

(e.g., debuggers)

• Energy Other ideas, methods

Con

• Your timeline ≠ my timeline No penalty for late

contributions Academic contributors have

other priorities

• Compromise: a required art Code must be designed to

support multiple approaches Nobody wins all the time Adds time to implementation

Outline

• Overview




3-day workshop

Robustness(CSU)

A Convergence of Ideas

PACX-MPI(HLRS)

LAM/MPI(IU)

LA-MPI(LANL)

FT-MPI(U of TN)

Open MPIOpen MPI

FaultDetection

(LANL,Industry)

Grid(many)

AutonomousComputing

(many)

FDDP(Semi. Mfg.

Industry) ResilientResilientComputingComputing

SystemsSystems

OpenRTEOpenRTE

Program Objective

*Cell = one or more computers sharing a common launch environment/point

Participants

Developers

• DOE/NNSA* Los Alamos Nat Lab Sandia Nat Lab Oak Ridge Nat Lab

• Universities Indiana University Univ of Tennessee Univ of Houston HLRS, Stuttgart

Support• Industry

Cisco Oracle IBM Microsoft* Apple* Multiple interconnect

vendors

• Open source teams OFED, autotools,

Mercurial*Providing funding

Reliance on Components

• Formalized interfaces Specifies “black box”

implementation Different

implementations available at run-time

Can compose different systems on the fly

Interface 1 Interface 2 Interface 3

Caller

OpenRTE and Components

• Components are shared libraries Central set of components in installation tree Users can also have components under $HOME

• Can add / remove components after install No need to recompile / re-link apps Download / install new components Develop new components safely

• Update “on-the-fly” Add, update components while running Frameworks “pause” during update

Component Benefits

• Stable, production quality environment for 3rd party researchers Can experiment inside the system without rebuilding

everything else Small learning curve (learn a few components, not the

entire implementation) Allow wide use, experience before exposing work

• Vendors can quickly roll out support for new platforms Write only the components you want/need to change Protect intellectual property

ORTE: Resiliency*

• Fault Events that hinder the correct operation of a process.

• May not actually be a “failure” of a component, but can cause system-level failure or performance degradation below specified level

Effect may be immediate or some time in the future. Usually are rare. May not have many data examples.

• Fault prediction Estimate probability of incipient fault within some time period in the future

• Fault Tolerance ………………………………………reactive, static Ability to recover from a fault

• Robustness…………………………………………..metric How much can the system absorb without catastrophic consequences

• Resilience……………………………………………..proactive, dynamic Dynamically configure system to minimize impact of potential faults

*standalone presentation

Key Frameworks

Error Manager (Errmgr)

• Receives all process state updates Sensor, waitpid Includes predictions

• Determines response strategy Restart locally, globally,

abort

• Executes recovery Accounts for fault groups to

avoid repeated failover

Sensor

• Monitors software and hardware state-of-health Sentinel file size, mod &

access times Memory footprint Temperature Heartbeat ECC errors

• Predicts incipient faults Trend, fingerprint AI-based algos coming

Outline

• Overview




Universal PNP

• Widely adopted standard

• ORCM uses only a part PNP discovery via announcement on std multicast

channel• Includes application id, contact info• All applications respond

Wireup “storm” limits scalability Various algorithms for storm reduction

Each application assigned own “channel”• All output from members of that application• Input sent to that application given to all members

Outline

• Overview




ORCM DVM

• One per node Started at node boot or launched by tool Locally spawns and monitors processes, system

health sensors Small footprint (≤1Mb)

• Each daemon tracks existence of others PNP wireup Know where all processes are located

orcmd

Predefined“System”multicastchannel

orcmd orcmd

Parallel DVMs

• Allows Concurrent development, testing in production

environment Sharing of development resources

• Unique identifier (ORTE jobid) Maintains separation between orcmd’s Each application belongs to their respective

DVM No cross-DVM communication allowed

Configuration Mgmt

orcmd orcmd orcmd

cfgi

confd tool fileconfddaemon

subscribe

Lowest vpid

recv config

Openframework

set recv configfile?

connect?

orcm-start

file

Configuration Mgmt

orcmd orcmd orcmd

cfgi


subscribe

recv config

Openframework


connect?

orcm-start

file

Update any missing config infoAssume “leader” duties

Application Launch

orcmd orcmd orcmd

cfgi


subscribe

recv config


connect?

orcm-start

file

Config change

#procslocation

Launch msgPredefined“System”multicastchannel

Resilient Mapper

• Fault groups Nodes with common failure mode Node can belong to multiple fault groups Defined in system file

• Map instances across fault groups Minimize probability of cascading failures One instance/fault group Pick lightest loaded node in group Randomly map extras

• Next-generation algorithms Failure mode probability => fault group selection

Multiple Replicas

• Multiple copies of each executable Run on separate fault groups Async, independent

• Shared pnp channel Input: recvd by all Output: broadcast to all, recvd by those who

registered for input• Leader determined by recvr

Leader Selection

• Two forms of leader selection Internal to ORCM DVM External facing

• Internal - framework App-specific module Configuration specified Lowest rank First contact None

External Connections

orcm-connector Input

• Broadcast on respective PNP channel

Output• Determines “leader” to supply output to rest of world• Utilize any leader method in framework

Testing in Production

orcm-logger

logger

db file syslog console

Software Maintenance

• On-the-fly module activation Configuration manager can select new

modules to load, reload, activate Change priorities of active modules

• Full replacement When more than a module needs updating Start replacement version Configuration manager switches “leader” Stop old version

Detecting Failures

• Application failures - detected by local daemon Monitors for self-induced problems

• Memory and cpu usage• Orders termination if limits exceeded or are trending to

exceed

Detects unexpected failures via waitpid

• Hardware failures Local hardware sensors continuously report status

• Read by local daemon• Projects potential failure modes to pre-order relocation of

processes, shutdown node

Detected by DVM when daemon misses heartbeats

Application Failure

• Local daemon Detects (or predicts) failure Locally restarts up to specified max #local-restarts Utilizes resilient mapper to determine re-location Sends launch message to all daemons

• Replacement app Announces itself on application public address

channel Receives responses - registers own inputs Begins operation

• Connected applications Select new “leader” based on current module

Node Failure

orcmd orcmd orcmd

cfgi

confd tool file

OpenframeworkNext higher orcmd becomes leader

Open/init cfgi frameworkUpdate any missing config infoMark node as “down”Relocate application processes from failed nodeConnected apps failover leader per active leader moduleAttempt to restart

Node Replacement/Addition

• Auto-boot of local daemon on power up Daemon announces to DVM All DVM members add node to available resources

• Reboot/restart Relocate original procs back up to some max number

of times (need smarter algo here) Leadership remains unaffected to avoid “bounce”

• Processes will map to new resource as start/restart demands Future: rebalance existing load upon node availability

Outline

• Overview




35© 2006 Cisco Systems, Inc. All rights reserved.

System Software Requirements

1) Turn on once with remote access thereafter

2) Non-Stop == max 20 events/day lasting < 200ms each

3) Hitless SW Upgrades and Downgrades

4) Upgrade/downgrade SW components across delta versions

5) Field Patchable

6) Beta Test New Features in situ

7) Extensive Trace Facilities: on Routes, Tunnels, Subscribers,…

8) Configuration

9) Clear APIs; minimize application awareness

10) Extensive remote capabilities for fault management, software maintenance and software installations

~5ms recovery

Start new app triplet, kill old one

New app triplet, register for production input

Boot-level startup

Start/stop triplets, leader selection

Still A Ways To Go

• Security Who can order ORCM to launch/stop apps? Who can “log” output from which apps? Network extent of communications?

• Communications Message size, fragmentation support Speed of underlying transport Truly reliable multicast Asynchronous messaging

Still A Ways To Go

• Transfer of state How does a restarted application replica regain the

state of its prior existence? How do we re-sync state across replicas so outputs

track?

• Deterministic outputs Same output from replicas tracking same inputs

• Assumes deterministic algorithms

Can we support non-deterministic algorithms?• Random channel selection to balance loads• Decisions based on instantaneous traffic sampling

Still A Ways To Go

• Enhanced algorithms Mapping Leader selection

• Fault prediction Implementation and algorithms Expanded sensors

• Replication vs rapid restart If we can restart in few millisecs, do we really

need replication?

Concluding Remarks

http://www.open-mpi.orghttp://www.open-mpi.org/projects/orcm

open resilient cluster manager: a distributed approach to a resilient router manager

Documents

update components

cisco systems

thousands of systems

different systems

software maintenance

open resilient cluster

remote access thereafternon

hpc clustersencourage