open resilient cluster manager: a distributed approach to a resilient router manager

39
Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager Ralph H. Castain, Ph.D. Cisco Systems, Inc.

Upload: willa

Post on 13-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Open Resilient Cluster Manager: A Distributed Approach to a Resilient Router Manager. Ralph H. Castain, Ph.D. Cisco Systems, Inc. Outline. Overview Key pieces OpenRTE uPNP ORCM Architecture Fault behavior Future directions. System Software Requirements. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Open Resilient Cluster Manager:A Distributed Approach to a Resilient Router Manager

Ralph H. Castain, Ph.D.Cisco Systems, Inc.

Page 2: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Outline

• Overview

• Key pieces OpenRTE uPNP

• ORCM Architecture Fault behavior

• Future directions

Page 3: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

3© 2006 Cisco Systems, Inc. All rights reserved.

System Software Requirements

1) Turn on once with remote access thereafter

2) Non-Stop == max 20 events/day lasting < 200ms each

3) Hitless SW Upgrades and Downgrades

4) Upgrade/downgrade SW components across delta versions

5) Field Patchable

6) Beta Test New Features in situ

7) Extensive Trace Facilities: on Routes, Tunnels, Subscribers,…

8) Configuration

9) Clear APIs; minimize application awareness

10) Extensive remote capabilities for fault management, software maintenance and software installations

Page 4: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Our Approach

• Distributed redundancy NO master Multiple copies of everything Running in tracking mode

• Parallel, seeing identical input• Multiple ways of selecting leader

• Utilize component architecture Multiple ways to do something => framework! Create an initial working base Encourage experimentation

Page 5: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Methodology

• Exploit open source software Reduce development time Encourage outside participation Cross-fertilize with HPC community

• Write new cluster manager (ORCM) Exploit new capabilities Potential dual-use for HPC clusters Encourage outside contributions

Page 6: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Open Source ≠ Free

Pro

• Widespread exposure ORTE on thousands of

systems around world Surface & address problems

• Community support Others can help solve

problems Expanded access to tools

(e.g., debuggers)

• Energy Other ideas, methods

Con

• Your timeline ≠ my timeline No penalty for late

contributions Academic contributors have

other priorities

• Compromise: a required art Code must be designed to

support multiple approaches Nobody wins all the time Adds time to implementation

Page 7: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Outline

• Overview

• Key pieces OpenRTE uPNP

• ORCM Architecture Fault behavior

• Future directions

3-day workshop

Page 8: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Robustness(CSU)

A Convergence of Ideas

PACX-MPI(HLRS)

LAM/MPI(IU)

LA-MPI(LANL)

FT-MPI(U of TN)

Open MPIOpen MPI

FaultDetection

(LANL,Industry)

Grid(many)

AutonomousComputing

(many)

FDDP(Semi. Mfg.

Industry) ResilientResilientComputingComputing

SystemsSystems

OpenRTEOpenRTE

Page 9: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Program Objective

*Cell = one or more computers sharing a common launch environment/point

Page 10: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Participants

Developers

• DOE/NNSA* Los Alamos Nat Lab Sandia Nat Lab Oak Ridge Nat Lab

• Universities Indiana University Univ of Tennessee Univ of Houston HLRS, Stuttgart

Support• Industry

Cisco Oracle IBM Microsoft* Apple* Multiple interconnect

vendors

• Open source teams OFED, autotools,

Mercurial*Providing funding

Page 11: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Reliance on Components

• Formalized interfaces Specifies “black box”

implementation Different

implementations available at run-time

Can compose different systems on the fly

Interface 1 Interface 2 Interface 3

Caller

Page 12: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

OpenRTE and Components

• Components are shared libraries Central set of components in installation tree Users can also have components under $HOME

• Can add / remove components after install No need to recompile / re-link apps Download / install new components Develop new components safely

• Update “on-the-fly” Add, update components while running Frameworks “pause” during update

Page 13: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Component Benefits

• Stable, production quality environment for 3rd party researchers Can experiment inside the system without rebuilding

everything else Small learning curve (learn a few components, not the

entire implementation) Allow wide use, experience before exposing work

• Vendors can quickly roll out support for new platforms Write only the components you want/need to change Protect intellectual property

Page 14: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

ORTE: Resiliency*

• Fault Events that hinder the correct operation of a process.

• May not actually be a “failure” of a component, but can cause system-level failure or performance degradation below specified level

Effect may be immediate or some time in the future. Usually are rare. May not have many data examples.

• Fault prediction Estimate probability of incipient fault within some time period in the future

• Fault Tolerance ………………………………………reactive, static Ability to recover from a fault

• Robustness…………………………………………..metric How much can the system absorb without catastrophic consequences

• Resilience……………………………………………..proactive, dynamic Dynamically configure system to minimize impact of potential faults

*standalone presentation

Page 15: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Key Frameworks

Error Manager (Errmgr)

• Receives all process state updates Sensor, waitpid Includes predictions

• Determines response strategy Restart locally, globally,

abort

• Executes recovery Accounts for fault groups to

avoid repeated failover

Sensor

• Monitors software and hardware state-of-health Sentinel file size, mod &

access times Memory footprint Temperature Heartbeat ECC errors

• Predicts incipient faults Trend, fingerprint AI-based algos coming

Page 16: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Outline

• Overview

• Key pieces OpenRTE uPNP

• ORCM Architecture Fault behavior

• Future directions

Page 17: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Universal PNP

• Widely adopted standard

• ORCM uses only a part PNP discovery via announcement on std multicast

channel• Includes application id, contact info• All applications respond

Wireup “storm” limits scalability Various algorithms for storm reduction

Each application assigned own “channel”• All output from members of that application• Input sent to that application given to all members

Page 18: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Outline

• Overview

• Key pieces OpenRTE uPNP

• ORCM Architecture Fault behavior

• Future directions

Page 19: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

ORCM DVM

• One per node Started at node boot or launched by tool Locally spawns and monitors processes, system

health sensors Small footprint (≤1Mb)

• Each daemon tracks existence of others PNP wireup Know where all processes are located

orcmd

Predefined“System”multicastchannel

orcmd orcmd

Page 20: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Parallel DVMs

• Allows Concurrent development, testing in production

environment Sharing of development resources

• Unique identifier (ORTE jobid) Maintains separation between orcmd’s Each application belongs to their respective

DVM No cross-DVM communication allowed

Page 21: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Configuration Mgmt

orcmd orcmd orcmd

cfgi

confd tool fileconfddaemon

subscribe

Lowest vpid

recv config

Openframework

set recv configfile?

connect?

orcm-start

file

Page 22: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Configuration Mgmt

orcmd orcmd orcmd

cfgi

confd tool fileconfddaemon

subscribe

recv config

Openframework

set recv configfile?

connect?

orcm-start

file

Update any missing config infoAssume “leader” duties

Page 23: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Application Launch

orcmd orcmd orcmd

cfgi

confd tool fileconfddaemon

subscribe

recv config

set recv configfile?

connect?

orcm-start

file

Config change

#procslocation

Launch msgPredefined“System”multicastchannel

Page 24: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Resilient Mapper

• Fault groups Nodes with common failure mode Node can belong to multiple fault groups Defined in system file

• Map instances across fault groups Minimize probability of cascading failures One instance/fault group Pick lightest loaded node in group Randomly map extras

• Next-generation algorithms Failure mode probability => fault group selection

Page 25: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Multiple Replicas

• Multiple copies of each executable Run on separate fault groups Async, independent

• Shared pnp channel Input: recvd by all Output: broadcast to all, recvd by those who

registered for input• Leader determined by recvr

Page 26: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Leader Selection

• Two forms of leader selection Internal to ORCM DVM External facing

• Internal - framework App-specific module Configuration specified Lowest rank First contact None

Page 27: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

External Connections

orcm-connector Input

• Broadcast on respective PNP channel

Output• Determines “leader” to supply output to rest of world• Utilize any leader method in framework

Page 28: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Testing in Production

orcm-logger

logger

db file syslog console

Page 29: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Software Maintenance

• On-the-fly module activation Configuration manager can select new

modules to load, reload, activate Change priorities of active modules

• Full replacement When more than a module needs updating Start replacement version Configuration manager switches “leader” Stop old version

Page 30: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Detecting Failures

• Application failures - detected by local daemon Monitors for self-induced problems

• Memory and cpu usage• Orders termination if limits exceeded or are trending to

exceed

Detects unexpected failures via waitpid

• Hardware failures Local hardware sensors continuously report status

• Read by local daemon• Projects potential failure modes to pre-order relocation of

processes, shutdown node

Detected by DVM when daemon misses heartbeats

Page 31: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Application Failure

• Local daemon Detects (or predicts) failure Locally restarts up to specified max #local-restarts Utilizes resilient mapper to determine re-location Sends launch message to all daemons

• Replacement app Announces itself on application public address

channel Receives responses - registers own inputs Begins operation

• Connected applications Select new “leader” based on current module

Page 32: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Node Failure

orcmd orcmd orcmd

cfgi

confd tool file

OpenframeworkNext higher orcmd becomes leader

Open/init cfgi frameworkUpdate any missing config infoMark node as “down”Relocate application processes from failed nodeConnected apps failover leader per active leader moduleAttempt to restart

Page 33: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Node Replacement/Addition

• Auto-boot of local daemon on power up Daemon announces to DVM All DVM members add node to available resources

• Reboot/restart Relocate original procs back up to some max number

of times (need smarter algo here) Leadership remains unaffected to avoid “bounce”

• Processes will map to new resource as start/restart demands Future: rebalance existing load upon node availability

Page 34: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Outline

• Overview

• Key pieces OpenRTE uPNP

• ORCM Architecture Fault behavior

• Future directions

Page 35: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

35© 2006 Cisco Systems, Inc. All rights reserved.

System Software Requirements

1) Turn on once with remote access thereafter

2) Non-Stop == max 20 events/day lasting < 200ms each

3) Hitless SW Upgrades and Downgrades

4) Upgrade/downgrade SW components across delta versions

5) Field Patchable

6) Beta Test New Features in situ

7) Extensive Trace Facilities: on Routes, Tunnels, Subscribers,…

8) Configuration

9) Clear APIs; minimize application awareness

10) Extensive remote capabilities for fault management, software maintenance and software installations

~5ms recovery

Start new app triplet, kill old one

New app triplet, register for production input

Boot-level startup

Start/stop triplets, leader selection

Page 36: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Still A Ways To Go

• Security Who can order ORCM to launch/stop apps? Who can “log” output from which apps? Network extent of communications?

• Communications Message size, fragmentation support Speed of underlying transport Truly reliable multicast Asynchronous messaging

Page 37: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Still A Ways To Go

• Transfer of state How does a restarted application replica regain the

state of its prior existence? How do we re-sync state across replicas so outputs

track?

• Deterministic outputs Same output from replicas tracking same inputs

• Assumes deterministic algorithms

Can we support non-deterministic algorithms?• Random channel selection to balance loads• Decisions based on instantaneous traffic sampling

Page 38: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Still A Ways To Go

• Enhanced algorithms Mapping Leader selection

• Fault prediction Implementation and algorithms Expanded sensors

• Replication vs rapid restart If we can restart in few millisecs, do we really

need replication?

Page 39: Open Resilient Cluster Manager: A Distributed Approach to a  Resilient Router Manager

Concluding Remarks

http://www.open-mpi.orghttp://www.open-mpi.org/projects/orcm