reliability, availability, and serviceability (ras) for high-performance computing

Presented by

Reliability, Availability, and Serviceability (RAS) for High-Performance Computing

Stephen L. ScottChristian Engelmann

Computer Science Research GroupComputer Science and Mathematics Division

2 Scott_RAS_SC07

Develop techniques to enable HPC systems to run computational jobs 24/7

Develop proof-of-concept prototypes and production-type RAS solutions

Provide high-level RAS capabilities for current terascale and next-generation petascale high-performance computing (HPC) systems

Eliminate many of the numerous single points of failure and control in today’s HPC systems

Research and development goals

3 Scott_RAS_SC07

MOLAR: Adaptive runtime support for high-end computing operating and runtime systems

Addresses the challenges for operating and runtime systems to run large applications efficiently on future ultrascale high-end computers

Part of the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS)

MOLAR is a collaborative research effort (www.fastos.org/molar)

4 Scott_RAS_SC07

Many active head nodes

Workload distribution

Symmetric replication between head nodes

Continuous service

Always up to date

No fail-over necessary

No restore-over necessary

Virtual synchrony model

Complex algorithms

Prototypes for Torque and Parallel Virtual File System metadata server

Active/active head nodes

Symmetric active/active redundancy

Compute nodes

5 Scott_RAS_SC07

Writing throughput Reading throughput

Symmetric active/active Parallel Virtual File System metadata server

Nodes Availability Est. annual downtime

1 98.58% 5d, 4h, 21m

2 99.97% 1h, 45m

3 99.9997% 1m, 30s

0

20

40

60

80

100

120

1 2 4 8 16 32

Number of clients

Th

rou

gh

pu

t (R

eq

ue

sts

/se

c)

PVFSA/A 1

A/A 2A/A 4

0

50

100

150

200

250

300

350

400

1 2 4 8 16 32Number of clients

Th

rou

gh

pu

t (R

eq

ue

sts

/se

c)

1 server

2 servers

4 servers

6 Scott_RAS_SC07

Operational nodes: Pause BLCR reuses existing

processes

LAM/MPI reuses existing connections

Restore partial process state from checkpoint

Failed nodes: Migrate Restart process on new node

from checkpoint

Reconnect with paused processes

Scalable MPI membership management for low overhead

Efficient, transparent, and automatic failure recovery

Reactive fault tolerance for HPC with LAM/MPI+BLCR job-pause mechanism

Paused MPI process

Paused MPI process

Live node

Migrated MPI process

Migrated MPI process

Spare node

Paused MPI process

Paused MPI process

Live node

Failed MPI process

Failed node

Shared storage

Failed

Processmigration

New connection

New

connection

Faile

d

Existingconnection

7 Scott_RAS_SC07

3.4% overhead over job restart, but No LAM reboot overhead

Transparent continuation of execution

0

1

2

3

4

5

6

7

8

9

10

BT CG EP FT LU MG SP

Se

co

nd

s

Job pause and migrate LAM reboot Job restart

No requeue penalty

Less staging overhead

LAM/MPI+BLCR job pause performance

8 Scott_RAS_SC07

Proactive fault tolerance for HPC using Xen virtualization

Standby Xen host (spare node without guest VM)

Deteriorating health Migrate guest VM to

spare node

New host generates unsolicited ARP reply Indicates that guest VM

has moved

ARP tells peers to resend to new host

Novel fault-tolerance scheme that acts before a failure impacts a system

Xen VMMXen VMM

GangliaGanglia

Privileged VMPrivileged VM

PFT daemon

PFT daemon

H/w BMC

Xen VMMXen VMM

Privileged VMPrivileged VM Guest VMGuest VM

MPItaskMPItaskGangliaGanglia

PFT daemon

PFT daemon

H/w BMC

Xen VMMXen VMM



PFT daemon

PFT daemon

H/w BMC

Xen VMMXen VMM



PFT daemon

PFT daemon

H/w BMC

MigrateMigrate

9 Scott_RAS_SC07

0

50

100

150

200

250

300

350

400

450

500

BT CG EP LU SP

Se

co

nd

s

Single node failure

Single node failure: 0.5–5% additional cost over total wall clock time

Double node failure: 2–8% additional cost over total wall clock time

VM migration performance impact

without migration

one migration

Double node failure

BT CG EP LU SP0

50

100

150

200

250

300

Se

co

nd

s

without migration

one migration

two migrations

10 Scott_RAS_SC07

HPC reliability analysis and modeling

Programming paradigm and system scale impact reliability

Reliability analysis

Estimate mean time to failure (MTTF)

Obtain failure distribution: exponential, Weibull, gamma, etc.

Feedback into fault-tolerance schemes for adaptation

010

To

tal

sy

ste

m M

TT

F (

hrs

)

200

400

600

800

50 100 500 1000 2000 5000Number of Participating Nodes

System reliability (MTTF) for k-of-n AND Survivability (k=n) Parallel Execution Model

Node MTTF 1000 hrsNode MTTF 3000 hrsNode MTTF 5000 hrsNode MTTF 7000 hrs

Exponencial

2653.3

Weibull 3532.8

Lognormal 2604.3

Gamma 2627.4

Negative likelihood value

Cu

mu

lati

ve

p

rob

ab

ilit

y

1

0

0.2

0.4

0.6

0.8

0 50 100 150 200 250 300Time between failure (TBF)

11 Scott_RAS_SC0711 Scott_RAS_SC07

Contacts

Stephen L. ScottComputer Science Research GroupComputer Science and Mathematics Division(865) [email protected]

Christian EngelmannComputer Science Research GroupComputer Science and Mathematics Division(865) [email protected]

reliability, availability, and serviceability (ras) for high-performance computing

Documents

new node

faulttolerance schemes

sreactive fault tolerance

runtime systemsaddresses

checkpointfailed nodes

additional cost

mechanismoperational

nodes availability