asymmetric / active-active high-availability for high-end ...€¦ · asymmetric / active-active...

19
Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech University Ruston, Louisiana – USA {box, vkm001}@latech.edu T. Liu Dell Inc. Austin, Texas – USA [email protected] S.L. Scott C. Engelmann Oak Ridge National Laboratory Oak Ridge, Tennessee – USA {scottsl, engelmannc}@ornl.gov Second International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters June 19, 2005 Cambridge, Massachusetts (USA) Dell Inc

Upload: others

Post on 15-Jun-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech

Asymmetric / Active-ActiveHigh-Availability for High-End Computing

C. LeangsuksunV.K. Munganuru

Louisiana Tech UniversityRuston, Louisiana – USA

{box, vkm001}@latech.edu

T. Liu

Dell Inc.Austin, Texas – [email protected]

S.L. ScottC. Engelmann

Oak Ridge National LaboratoryOak Ridge, Tennessee – USA

{scottsl, engelmannc}@ornl.gov

Second International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters

June 19, 2005Cambridge, Massachusetts (USA)

Dell Inc

Page 2: Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech

2

Outline

Motivation

Related Work: OSCAR

HA-OSCAR: RAS Management for HPC

Clusters: Self-awareness Approach

Analysis & Experiment

Summary & Future work

Page 3: Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech

3

Motivation

Cluster architecture dominates HPC community.

Cluster architecture is prone to single-point-of failure (SPoF).

Cluster size has significantly grown.Size and reliability have inverse relationship…

Self-aware Reliability, Availability and Serviceability management is needed.

Size

Reliability

Page 4: Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech

4

Cluster “Beowulf” Architecture

Single Point of FailureSingle Point of Control

Page 5: Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech

5

Availability of HEC Systems

Today’s supercomputers typically need to reboot to recover from a single failure.

Entire systems go down (regularly and unscheduled) for any maintenance or repair.

Compute nodes sit idle while a head or service node is down.

Availability will get worse in the future as the MTBI decreases with growing system size.

Productive computation is not done during the checkpoint/restartprocess.

Page 6: Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech

6

Availability Measured by the 9’s

9’s Availability* Downtime/Year Examples

1 90.0% 36 days, 12 hours Personal Computers

2 99.0% 87 hours, 36 min Entry Level Business

3 99.9% 8 hours, 45.6 min ISPs, Mainstream Business

4 99.99% 52 min, 33.6 sec Data Centers

5 99.999% 5 min, 15.4 sec Banking, Medical

6 99.9999% 31.5 seconds Military Defense

Enterprise-class hardware + Stable Linux kernel = 5+ Substandard hardware + Good high availability package = 2-3Today’s supercomputers = 1-2My desktop = 1-2

* Based on (MTBI) – mean time between interrupt – both software and hardware interrupts.

Page 7: Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech

7

Solution: Active Redundancy

Page 8: Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech

8

Clustering High-Availability Models

Active – Hot-Standby

Asymmetric / Active – Active

Symmetric / Active – Active

Page 9: Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech

9

Open Source Cluster Application Resources

Framework for cluster installation configuration and management

Common used cluster tools

Wizard based cluster software installation

Operating systemCluster environment

AdministrationOperation

Automatically configures cluster components

Increases consistency among cluster builds

Reduces time to build / install a cluster

Reduces need for expertise

Step 5

Step 8 Done!

Step 6

Step 1 Start…

Step 2

Step 3Step 4

Step 7What is OSCAR?

Page 10: Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech

10

HA-OSCAR: Active – Hot-StandbyProduction-quality Open source Linux-cluster project

HA and HPC clustering techniques to enable critical HPC infrastructure Self-configuration Multi-head Beowulf system

HA-enabled HPC Services:Active / Hot-Standby

Self-healing with 3-5 sec automatic failover time

The first known field-grade open source HA Beowulf cluster release

Page 11: Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech

11

HA-OSCAR Serviceability

Self-Build and configuration Multi-head Beowulf system

Adopt ease of build and operation same as OSCAR concept

~30 min – installation

Take almost the same time for disaster recovery (that is, each disaster recovery –providing you are prepared)

step1

Step2 create head imageStep3 clone image

Step4 configStandby Step5 web admin to

add/config more services

Page 12: Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech

12

Adaptive Recovery State Diagram

working Failover

failure

Alert.

Detect

previous state, # counter,recovery

switch over & take control at thestandby

threshold reached after # retry

previous state, # counter,recovery

After the primary node repair, thenoptional Fallback

Page 13: Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech

13

Monitoring & Self-healing cores

ServiceMonitor

ResourceMonitor

Healthchannel Monitor

Self-Healing Daemon

PBS ,MAUI , NFS,HTTP

services are monitored

load_average, disk_usage, free_memory are monitored

eth0,eth0:1 interfaces

are monitored

Page 14: Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech

14

HA-OSCAR RAS Software Stack

Redundant H/W platform

Intelligent sensors

HPI wrapper

Operating System (OS) hardware Interface

OS Application Services

Monitoring and Self-healing Core

HA-OSCAR Management layer

Application Services

Monitoring & self-healing core

Page 15: Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech

15

Asymmetric / Active-Active Architecture

Page 16: Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech

16

Failover of: Asymmetric / Active-Active Architecture

Page 17: Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech

17

Asymmetric/Symmetric Active/Active

Page 18: Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech

18

Reality Checks

Great! We got Highly Reliable HPC system!

But How much improvement?The total uptime?Performance?

Analytical model and predictionStatistical technique to compare uptimeHow many 9’s? (downtime per/year)Stochastic Reward Net with SPNP packageIdentical hardware parameters between Beowulf and HA-OSCAR multi-heads

Page 19: Asymmetric / Active-Active High-Availability for High-End ...€¦ · Asymmetric / Active-Active High-Availability for High-End Computing C. Leangsuksun V.K. Munganuru Louisiana Tech

19

Availability vs Unavailability

Planned and unplanned downtimeScheduled downtime = 200 hrsRepair time = 24 hrsMonitoring interval = 10 sec

Ours 99.99% vs 91.+%

1k vs 10m TFLOP (1T system)

$70k vs $2m ($20m system)

HA-OSCAR solution vs traditional BeowulfTotal Availability impacted by service nodes

90.580%

91.575%92.081% 92.251% 92.336% 92.387%

99.9896%

99.9951% 99.9962% 99.9966% 99.9968%

99.9684%

90.00%

91.00%

92.00%

93.00%

94.00%

95.00%

96.00%

97.00%

98.00%

99.00%

100.00%

Noda-wise mean time to failure (hr)

Avai

labi

lity

99.950%

99.955%

99.960%

99.965%

99.970%

99.975%

99.980%

99.985%

99.990%

99.995%

100.000%

Beowulf 0.905797 0.915751 0.920810 0.922509 0.923361 0.923873

HA-oscar 0.999684 0.999896 0.999951 0.999962 0.999966 0.999968

1000 2000 4000 6000 8000 10000

Model assumption:- scheduled downtime=200 hrs - nodal MTTR = 24 hrs- failover time=10s- During maintainance on the head, standby node acts as primary

Lost investment due to unavailability (based on $20M)

0

0.5

1

1.5

2

1000 2000 4000 6000 8000 10000

MTTF(hours)

mill

ion

$

Beowulf

HA-oscar

HA-OSCAR solut ion vs tradit ional BeowulfUnavailable performance (based on 1 T flop machine in a year)

1

100

10000

1000000

100000000

10000000000

1000 2000 4000 6000 8000 10000

MTTF(hours)

Gflo

p Beowulf

HA-OSCAR