research efforts toward non-stop services in high end and enterprise computing box leangsuksun,

28
June 20, 2005 Innovation and information technology Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate Professor, Computer Science Director, eXtreme Computing Research (XCR) HA-OSCAR: unleashing HA Beowulf

Upload: eliza

Post on 15-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

HA-OSCAR: unleashing HA Beowulf. Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box Leangsuksun, Associate Professor, Computer Science Director, eXtreme Computing Research (XCR). Research Collaborators. National, Academic and Industry Labs ORNL - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

Research Efforts toward Non-Stop Services in High End and Enterprise Computing

Box Leangsuksun,

Associate Professor, Computer Science

Director, eXtreme Computing Research (XCR)

HA-OSCAR: unleashing HA Beowulf

Page 2: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

Research Collaborators

– National, Academic and Industry Labs

• ORNL• Intel, Dell, Ericsson• Lucent, CRAY• IU, NCSA, OSU, NCSU, UNM, TTU• Systran• OSDL (Linus is here)

• ANL, LLNL

Page 3: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

Service Unavailability Impacts

• No Performance and No Functionality• Losses of $195K - $58M with 3.5 hrs (Meta

Group report, 2000) – (enterprise)

• Enterprise/Shared Major computing resources- 7/24/365 (enterprise/HPC-HEC)

• Critical HPC apps such as National Security (Home Land defense) (HPC-HEC)

• Service provider Regulation/Mandate – FCC mandate (Class 5 local switch = 5 9’s)

• Losses time and opportunities• Life-threatening

Page 4: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

RASS Definitions

• Reliability (MTTF) – How fast it fails?

• Availability – What is the total uptime?– Availability = MTTF / (MTTF + MTTR)

• Serviceability – How fast to build, manage, upgrade system– Planned outages – 60% of total outages

• Security will impact Availability

Page 5: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

High Availability Open Source Cluster Application Resources (HA-OSCAR)

HA-OSCAR: unleashing HA Beowulf

Page 6: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

HA-OSCAR overview

•Production-quality Open source Linux-cluster project

•HA and HPC clustering techniques to enable critical HPC infrastructure Self-configuration Multi-head Beowulf system

•HA-enabled HPC Services:Active/Hot Standby

•Self-healing with 3-5 sec automatic failover time

•The first known field-grade open source HA Beowulf cluster release

Page 7: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

Monitoring & Self-healing cores

ServiceMonitor

ResourceMonitor

Healthchannel Monitor

Self-Healing Daemon

PBS ,MAUI , NFS,HTTP

services are monitored

load_average, disk_usage, free_memory are monitored

eth0,eth0:1 interfaces

are monitored

Page 8: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

Monitoring and recovery

• Enhancement based kernel.org MON , IPMI, and net-SNMP framework

• Recovery – Associative Response

• Local recovery, e.g. restart, checkpoint• Failover (simple or impersonate/clone)• Admin-defined actions

– Adaptive Response• Previous state and number retry• Acceleration (Time-series)• E.g. maui dies, restart. After 3 times reties within 3 mins,

failover

Page 9: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

Appeared in a front cover in two major Linux magazines, various technical papers, research exhibitions.

web site: http://xcr.cenit.latech.edu/ha-oscar

HA-OSCAR beta was released to open source community in March 2004

Page 10: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

On-going R&D works(Lab grade enhancements)

Page 11: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

Reliability Modeling for dummy

1,1,2,2

0,1,2,2 1,1,1,2 1,1,2,1

122 32

1,0,2,2

1

0,0,2,2 0,1,1,2 0,1,2,1 1,0,1,2 1,0,2,1 1,1,0,2 1,1,1,1 1,1,2,0

0,0,1,2 0,1,1,10,1,0,2 0,0,2,1 0,1,2,0 1,0,0,2 1,0,1,1 1,0,2,0 1,1,0,1 1,1,1,0

0,0,1,1 0,1,0,1 0,1,1,0 1,0,0,1 1,0,1,0

Page 12: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

UML Representation of System Architecture

XMI Representation with Embedded Dependability Information

Extracting Dependability parameters and Building Logical Representation

Results showing Reliability and Availability of System

Semantic Mapping and

Dependability Modeling

UML based Approach

Page 13: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

An example of UML tools

Page 14: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

Examples in UML diagrams

Page 15: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

Example of HA-OSCAR

A single

head cluster

λ

µ

SystemUnreliability

MTTF days

SystemInstantaneous Unavailability

Availability Percentage

SystemDowntimePer Year

NodeSwitchClient1Client2Client3Client4

21E-0532E-0567E-0589E-0576E-0554E-05

12E-041E-03

32E-0415E-0416E-0419E-04

1.7804E-01 300 2.87771E-03 99.71 25.2 hrs

HA-OSCAR

λ

µ System Un-

reliability

MTTF days

SystemInstantaneousUnavailability

AvailabilityPercentage

SystemDowntime Per Year

Node 1Node 2

Switch 1Switch 2Client 1Client 2Client 3Client 4

3.4E-058.6E-051E-05

1.3E-052.5E-059.8E-056.7E-053.5E-05

2E-0512E-042E-04

2.1E-0432E-044E-045E-04

21E-05

92.1138E-03 331 2.10727E-05 99.997

11 min

<RELIABILITY BLOCK DIAGRAM> <component> <name> Node1 <lambda> 3.4E-5 </lambda> <mu> 2.0E-5 </mu> </name> </component> <component> <name> Node2 <lambda> 8.6E-5 </lambda> <mu> 0.0012 </mu> </name> </component> <component> <name> Switch1 <lambda> 1.0E-5 </lambda> <mu> 2.0E-4 </mu> </name> </component> <component> <name> Switch2 <lambda> 1.3E-5 </lambda> <mu> 2.1E-4 </mu> </name> </component><component> <name> Client4 <lambda> 3.5E-5 </lambda> <mu> 2.1E-4 </mu> </name> </component> <Series id=0> Node1 Switch1 Client1 </Block0> </Series> <Series id=1> Node1 Switch2 Client1 </Block1> </Series> <Series id=2> Node1 Switch1 Client2 </Block2> </Series> <Series id=3> Node1 Switch2 Client2 </Block3> </Series> <Series id=4> Node1 Switch1 Client3 </Block4> </Series> <Series id=5> Node1 Switch2 Client3 </Block5> </Series> <Series id=6> Node1 Switch1 Client4 </Block6> </Series> <Series id=7> Node1 Switch2 Client4 </Block7> </Series> <Series id=8> Node2 Switch1 Client1 </Block8> </Series> <Series id=9> Node2 Switch2 Client1 </Block9> </Series> <Series id=10> Node2 Switch1 Client2 </Block10> </Series> <Series id=11> Node2 Switch2 Client2 </Block11> </Series> <Series id=12> Node2 Switch1 Client3 </Block12> </Series> <Series id=13> Node2 Switch2 Client3 </Block13> </Series> <Series id=14> Node2 Switch1 Client4 </Block14> </Series> <Series id=15> Node2 Switch2 Client4 </Block15> </Series> <Parallel> id=0 id=1 id=2 id=3 id=4 id=5 id=6 id=7 id=8 id=9 id=10 id=11 id=12 id=13 id=14 id=155 </Parallel> <System Unreliability> 9.211E-02 </System Unreliability> <Mean Time to Failure> <days> 331 </days> </Mean Time to Failure> <System Instantaneous Availability per year> 99.997 </System Instantaneous Availability per year> <System DownTime per year> <min> 11 </min> </System DownTime per year>

</RELIABILITY BLOCK DIAGRAM>

Page 16: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

Policy-based Fault Prediction, Hardware Management abstraction

Page 17: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

Policy-based Fault Prediction, Hardware Management abstraction

Page 18: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

Hardware Management abstraction

• Ability to access and control detailed status for better management (CPU temp, baseboard, power status, system ID/ up/ down etc.)

• IPMI (Intelligent Platform Management Interface)• open IPMI and OpenHPI (SA forum) • HW abstraction hinds vendor specific

– CPU – Power – Memory– Baseboard– Fan (cooling)

Page 19: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

Our early observations

01/25/2004 | 00:31:19 | Sys Fan 1 | critical01/25/2004 | 00:31:19 | Sys Fan 3 | critical01/25/2004 | 00:31:19 | Sys Fan 4 | critical01/25/2004 | 00:31:19 | Processor 1 Fan | ok01/25/2004 | 00:31:20 | Processor 2 Fan | ok

• Can set thresholds in managed elements to trigger events with severity levels

• Automatic failure trend analysis -> prediction

Page 20: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

A failure prediction & policy-based recovery Cluster management

• Detections - the damage done!• Predictions

– trend analysis– Anticipate imminent failures– Better handling– More difficult for multiple events/nodes correlations

• Example of IPMI events and trend analysis – E.g. CPU temp raising too fast with 5 min -> prepare to

checkpoint, failover and restart– Memory bit error detected -> take a node out

Page 21: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

HA-OSCAR monitoring, Fault prediction and recovery Restructure

Page 22: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

Cluster Power Management (IPMI)

Page 23: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

Reliability-aware Runtime

Page 24: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

Reliability-Aware Runtime

• Programming paradigm and Scalability impact “Reliability”, esp for HPC environment

• “AND Survivability” analysis based on– at 10, 100, 1000 nodes all have to survive.– Each node MTTF at 5000 hours– N=10, MTTF = 492.424242– N=100, MTTF = 49.9902931– N=1000, MTTF = 4.99999003– N=10000, MTTF = ½ hour

• Reliability and Availability info - Better Job execution (checkpointing, resource management)

Page 25: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

MTTF 1000-5000

The more & the faster processors, the faster failure rate

System reliability (MTTF) for k-of-n AND Survivability (k=n) Parallel

Execution model

0

100

200

300

400

500

600

700

800

10 50 100 500 1000 2000 5000

Number of Participating Nodes

Tota

l sys

tem

MTT

F (h

rs)

Node MTTF 1000 hrs

Node MTTF 3000 hrs

Node MTTF 5000 hrs

Node MTTF 7000 hrs

e.g. each nodal failure rate 2/yearN=10, MTTF = 492.424242N=100, MTTF = 49.9902931N=1000, MTTF = 4.99999003

Page 26: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

Reliability-aware Checkpointing

– Consideration of Scalability vs. Reliability in Runtime– MTTF vs. application execution time– HA-OSCAR monitoring -> Failure Prediction and

Detection– System-initiated (transparent) and Reliability-aware

checkpointing in MPI environments. – Developed smart checkpoint based on above. – Reduce unnecessary overheads yet reliability-aware– Detailed reports in HAPCW2004 and submitted to IEEE

cluster 2005

Page 27: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

Federated System Architecture (DOE fastOS)

Page 28: Research Efforts toward Non-Stop Services in High End and Enterprise Computing Box  Leangsuksun,

June 20, 2005

Innovation and information technology

Summary

• Problems in Large-scale computing is similar to Wireless Sensor Network– Computing node = SN– Head node = gateway

• Reliability issues are similar– Depends on applications

• Self-config, self-awareness, self-healing

• Routing algorithm = location-aware