reliability, availability, and serviceability (ras) for high-performance computing
DESCRIPTION
Reliability, Availability, and Serviceability (RAS) for High-Performance Computing. Stephen L. Scott Christian Engelmann Computer Science Research Group Computer Science and Mathematics Division. Research and development goals. - PowerPoint PPT PresentationTRANSCRIPT
Presented by
Reliability, Availability, and Serviceability (RAS) for High-Performance Computing
Stephen L. ScottChristian Engelmann
Computer Science Research GroupComputer Science and Mathematics Division
2 Scott_RAS_SC07
Develop techniques to enable HPC systems to run computational jobs 24/7
Develop proof-of-concept prototypes and production-type RAS solutions
Provide high-level RAS capabilities for current terascale and next-generation petascale high-performance computing (HPC) systems
Eliminate many of the numerous single points of failure and control in today’s HPC systems
Research and development goals
3 Scott_RAS_SC07
MOLAR: Adaptive runtime support for high-end computing operating and runtime systems
Addresses the challenges for operating and runtime systems to run large applications efficiently on future ultrascale high-end computers
Part of the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS)
MOLAR is a collaborative research effort (www.fastos.org/molar)
4 Scott_RAS_SC07
Many active head nodes
Workload distribution
Symmetric replication between head nodes
Continuous service
Always up to date
No fail-over necessary
No restore-over necessary
Virtual synchrony model
Complex algorithms
Prototypes for Torque and Parallel Virtual File System metadata server
Active/active head nodes
Symmetric active/active redundancy
Compute nodes
5 Scott_RAS_SC07
Writing throughput Reading throughput
Symmetric active/active Parallel Virtual File System metadata server
Nodes Availability Est. annual downtime
1 98.58% 5d, 4h, 21m
2 99.97% 1h, 45m
3 99.9997% 1m, 30s
0
20
40
60
80
100
120
1 2 4 8 16 32
Number of clients
Th
rou
gh
pu
t (R
eq
ue
sts
/se
c)
PVFSA/A 1
A/A 2A/A 4
0
50
100
150
200
250
300
350
400
1 2 4 8 16 32Number of clients
Th
rou
gh
pu
t (R
eq
ue
sts
/se
c)
1 server
2 servers
4 servers
6 Scott_RAS_SC07
Operational nodes: Pause BLCR reuses existing
processes
LAM/MPI reuses existing connections
Restore partial process state from checkpoint
Failed nodes: Migrate Restart process on new node
from checkpoint
Reconnect with paused processes
Scalable MPI membership management for low overhead
Efficient, transparent, and automatic failure recovery
Reactive fault tolerance for HPC with LAM/MPI+BLCR job-pause mechanism
Paused MPI process
Paused MPI process
Live node
Migrated MPI process
Migrated MPI process
Spare node
Paused MPI process
Paused MPI process
Live node
Failed MPI process
Failed node
Shared storage
Failed
Processmigration
New connection
New
connection
Faile
d
Existingconnection
7 Scott_RAS_SC07
3.4% overhead over job restart, but No LAM reboot overhead
Transparent continuation of execution
0
1
2
3
4
5
6
7
8
9
10
BT CG EP FT LU MG SP
Se
co
nd
s
Job pause and migrate LAM reboot Job restart
No requeue penalty
Less staging overhead
LAM/MPI+BLCR job pause performance
8 Scott_RAS_SC07
Proactive fault tolerance for HPC using Xen virtualization
Standby Xen host (spare node without guest VM)
Deteriorating health Migrate guest VM to
spare node
New host generates unsolicited ARP reply Indicates that guest VM
has moved
ARP tells peers to resend to new host
Novel fault-tolerance scheme that acts before a failure impacts a system
Xen VMMXen VMM
GangliaGanglia
Privileged VMPrivileged VM
PFT daemon
PFT daemon
H/w BMC
Xen VMMXen VMM
Privileged VMPrivileged VM Guest VMGuest VM
MPItaskMPItaskGangliaGanglia
PFT daemon
PFT daemon
H/w BMC
Xen VMMXen VMM
Privileged VMPrivileged VM Guest VMGuest VM
MPItaskMPItaskGangliaGanglia
PFT daemon
PFT daemon
H/w BMC
Xen VMMXen VMM
Privileged VMPrivileged VM Guest VMGuest VM
MPItaskMPItaskGangliaGanglia
PFT daemon
PFT daemon
H/w BMC
MigrateMigrate
9 Scott_RAS_SC07
0
50
100
150
200
250
300
350
400
450
500
BT CG EP LU SP
Se
co
nd
s
Single node failure
Single node failure: 0.5–5% additional cost over total wall clock time
Double node failure: 2–8% additional cost over total wall clock time
VM migration performance impact
without migration
one migration
Double node failure
BT CG EP LU SP0
50
100
150
200
250
300
Se
co
nd
s
without migration
one migration
two migrations
10 Scott_RAS_SC07
HPC reliability analysis and modeling
Programming paradigm and system scale impact reliability
Reliability analysis
Estimate mean time to failure (MTTF)
Obtain failure distribution: exponential, Weibull, gamma, etc.
Feedback into fault-tolerance schemes for adaptation
010
To
tal
sy
ste
m M
TT
F (
hrs
)
200
400
600
800
50 100 500 1000 2000 5000Number of Participating Nodes
System reliability (MTTF) for k-of-n AND Survivability (k=n) Parallel Execution Model
Node MTTF 1000 hrsNode MTTF 3000 hrsNode MTTF 5000 hrsNode MTTF 7000 hrs
Exponencial
2653.3
Weibull 3532.8
Lognormal 2604.3
Gamma 2627.4
Negative likelihood value
Cu
mu
lati
ve
p
rob
ab
ilit
y
1
0
0.2
0.4
0.6
0.8
0 50 100 150 200 250 300Time between failure (TBF)
11 Scott_RAS_SC0711 Scott_RAS_SC07
Contacts
Stephen L. ScottComputer Science Research GroupComputer Science and Mathematics Division(865) [email protected]
Christian EngelmannComputer Science Research GroupComputer Science and Mathematics Division(865) [email protected]