advanced pattern recognition for enhanced dependability of...

1

Advanced Pattern Recognition for Enhanced Dependability of Large-Scale Real-Time Systems and Networks

Presentation for:2002 Workshop on High Performance,Fault Adaptive Large Scale Real-Time Systems

Kenny C. Gross, PhDSr. Physicist

Sun [email protected]

Team coauthors:Larry Votta, Aleksey Urmanov, Wendy Lu

2

Why is Availability Important?

ØTo Customers–Downtime very costly§ Measured by lost

sales, reduced productivity, damaged business reputation, etc.

ØTo Sun Microsystems–Means of differentiation–Key corporate initiative–Goal to provide

continuous application access with predictable performance

In today’s internet-based computing model …

Mission director, Apollo 13

3

• Quantitative– Lost business, lost productivity

• Qualitative– Diminished reputation

0.09

0.11

0.15

2.6

6.45

0 2 4 6 8

Travel/reservations

Retail/home shopping

Media/pay-per-view

Financial Services/credit card

Financial Services/brokerage

Cost of down time/hour ($millions)

Costs of downtime can be substantial

Source: Hennessy and Patterson, Computer Architecture, Morgan & Kaufmann Publishers(2002).

4

Similarly, costs of avoiding downtime can be substantial

11.5

4

10

0123456789

10R

elat

ive

cost

Single system

with RAID storage

ClusterFault tolerant/ redundant HW

5

CPU

SystemStorage

Netw

Appl Dev Env

OSMW

DB

Svcs

Storage

Potential points of failure range from within the data center …

ISV

6

• Goes beyond the glass house to partners and customers

• This complexity makes all interactions mission-critical, demanding highest availability

Internal customers,e.g., fin, mktg, HR

Vendors/suppliers

Customer relationship mgmt

Supply chain mgmt

DatabaseData WarehouseBusiness Intelligence

Service provider

Customers

… to the Extended Enterprise

7

Motivation:Proactive fault monitoring (“An ounce of prevention...”)System Availability Serviceability CostsImproved Root Cause Analysis (RCA)...Each failure is used to build

better future systems

Approach:Adapt advanced statistical pattern recognition algorithm, MSET, that

has been proven in a broad spectrum of safety-critical and mission-critical application domains.

BENEFITS:Ø Enhanced end-to-end stack availability through predictive fault

monitoring.Ø Faster, more accurate RCAØ Provides “Intelligent Agent” functionality when integrated with real

time telemetry harness for monitored assetsØ Improved stability analysis, dynamic resource provisioning for

networks of interacting dynamic elementsØ Closed-loop autonomic control for future servers and networks

Advanced Pattern Recognition

8

Sequential Probability Ratio Test (SPRT)Advanced pattern recognition technique for high sensitivity, high reliability sensor and equipment operability surveillance.Developers proved in refereed journals that the SPRT provides the earliest mathematically possible annunciation of a subtle fault in noisy process variables.

Multivariate State Estimation Technique (MSET)

Online model-based fault detection and identification.

MSET predicts what each process should be on the basis of learned correlations among all process variables.

MSET incorporates the SPRT to monitor the residuals between the actual observations and the estimates MSET predicts on the basis of the correlated variables.

Advanced Pattern Recognition Tools for Ultrahigh-Reliability Surveillance

For Stationary Time Series

For Dynamic Time Series

9

Proactive Fault Monitoring Applications:Ø Continuous signal validation, sensor operability validationØ “Acoustic SPRT” for high sensitivity proactive fault detection in

storage, fan trays, and other electromechanical componentsØ Improved soft error rate discriminationØ Dynamic resource provisioning for optimized energy utilization Ø Software Aging and Rejuvenation

• Proactively detects onset of memory leaks and other resource-contention issues

• Successful pilot project applied MSET to S/W aging problem on business critical enterprise servers

Ø Improved environmental-variable proactive fault detectionØ Monitoring bit error rates in Fibre Channel network-attached-

storage (NAS)Ø Closed-loop autonomic control for improved (and less human-

intensive) performance management of servers and networksØ Proactive fault monitoring and more accurate Root Cause

Analysis Ø Stability assurance for N1 networks of dynamically interacting

entities.

10

0

Sequential Probability Ratio Test (SPRT)High sensitivity for subtle anomaly detection, but without increasing false alarm probability.

Sequential test for residual signals is based upon two hypotheses:

M

f

H0 : µ = µ0= 0H1 : µ = µ1= M

where µ is the mean of the distribution under test and M is a preset alarm threshold.

Accept H0

Signal x1 is faulty

x1

x2

xp

|x1- |

Signal x1 is fine

),...,(ˆˆ 21 pxxfx = 1x̂Accept H1

Sequential Statistical Test

DegradedNormal

Unlike threshold limit tests, SPRTdetects shifts in noise distributions.This breaks the “sea-saw” effectbetween sensitivity and false alarmsinherent in conventional monitoringapproaches.

11

MSETMultivariate State Estimation Technique

Argonne National Laboratory developed an advanced pattern recognition system, MSET, for 24X7 predictive fault monitoring in complex engineering systems for extremely sensitive identification of the onset of component degradation or process anomalies.

MSET possesses unique surveillance capabilities that surpass anyconventional approaches, including neural networks, in sensitivity, reliability, and computational efficiency.

MSET Provides:Ø Continuous signal validation for safety-critical and mission-critical

applicationsØ Incipient fault annunciation in all monitored components and

process variablesØ Extremely low probability of false alarmsØ Improved dynamic feedback/control algorithms (MSET estimate

can be used for control variable)

12

Internal VariablesDynamic Loads on CPU, Memory, Cache

IO Traffic, Queue Lengths, Transaction Latencies

Operational Profiles from OS Kernel “Virtual Sensors”

Intelligent Realtime Monitoring

For Enhanced Availability, Quality-of-Service, and Security

External VariablesService Level Availability Measurement

Distributed Synthetic Transaction Generators

Canary Times (user wait times, monitored 24x7)

Physical VariablesDistributed internal temperaturesRelative HumidityCumulative or differential vibrationTime domain reflectrometry (TDR)Current & voltage noise time seriesFan speedAcousticsEnvironmental Vars (outside cabinet)

Multivariate State Estimation Technique

Monitors hundreds of signals in realtime. Learns “normal” patterns in signals. Detects deviations in patterns to trigger alerts and activate detailed diagnostics.

Proactive: Predictive Failure Annunciation

Reactive: Faster, more accurate RCA

Security: Enhanced Intrusion Detection

Self Healing: Software Aging and Rejuvenation

1)1) Online model is Online model is “learned” from “learned” from operating telemetry operating telemetry datadata

2)2) Online model Online model provides an estimate provides an estimate for each new for each new observation valueobservation value

3)3) MSET alarms when MSET alarms when estimated and estimated and observed data observed data disagreedisagree

How MSET is AppliedTraining

Data

CalibrateModel

OnlineModel

AcquireData

ParameterEstimation

FaultDetection

FaultFound

?

Alarm orControlAction

Asset

Yes

No

Training

Monitoring

14

Legacy Viewgraph: MSET detects instrumentationdegradation in an operating nuclear power plant.

Departure between real signal (yellow) and MSET estimate (red) at onset of instrument degradation event.

Onset of SPRT alarms for Instrument R241

15

ACTUAL PARAMETERS (YELLOW) VS.MSET PREDICTED (RED): MSET

SensitivityMSET can detect degradation

that is still "within the noise band“

Note very subtle disturbance is introduced into a system parameter starting at DAY=0.

Note SPRT alarms start triggering at DAY=13, when the degradation is only 0.06% of the signal, and well within the noise band.

DEGRADATION SHOWN BY SPRT:

<- # Days After Start of Sensor Drift ->

DIFFERENCE IDENTIFIED BY SPRT:

16

Software RejuvenationSoftware Aging Problems:Resource contention phenomenon that can cause servers to hang orcrash. Mechanisms can include:

Memory leaks; Unreleased file locks; Accumulation of unterminated threads; Data corruption/round off accrual; File space fragmentation; Shared memory pool latching; Thread stack bloating and overruns

Software Rejuvenation:

Proactive fault management technique to periodically “cleanse” the system internal state. Mechanisms can include:

Flushing stale locks; Reinitializing application components; Preemptive rollback; Memory defragmentation; Purging DB shared-pool latches; Node/application failover (cluster machines); Therapeutic reboots (primarily Wintel platforms)

17

Software Rejuvenation Reduces Downtime

CRASH!

Software RejuvenationParasitic Resource

Consumption

Time

Physical Limit

Aging Software

18

SAR_AgentInstrumentation:

Throughput, performance, and transaction latency metrics (Adrian Cockcroft’s SEtoolkit)

External synthetic transaction generators

Implementation:

Ø MSET pattern recognition monitors performance and throughput variables to detect incipience or onset of SA mechanisms

Ø Stochastic Reward Net (SRN) algorithm optimizes sequencing of rejuvenation mechanism for lowest compute cost

19

SAR_AgentArchived

Training

ParametersSensitivity Analysis Module

(Sun Microsystems)

Stochastic Reward Network

Rejuvenation Sequencing Optimization

(Duke)

Advanced Pattern Recognition

MSET

Continue

Surveillance

SoftwareAgingAlert?

NOYES

MonitoredServer/Cluster

System ParametersC(t) Cost of Downtime

R(t) Cost of RejuvenationMarkov Parameters

Reference System Load Profile

Rejuvenation Actuator Signal

(Automatic, or via Human SysAdm)

Optimized Parameter Set

For Realtime Surveillance

MeasuredSys Performance

Parameters

20

Initial MSET Experiments with Software Aging and Rejuvenation:•MSET trained on 33 performance variables sampled in realtime•Signals generated by standard Unix utilities (mpstat, iostat, cpustat)•Subtle, linearly degrading memory leaks simulated•MSET consistently demonstrated high alarm sensitivity with no false alarms

MPSTAT Response Variable (1 of 33 variables monitored by MSET)

MSET training during typical user activities Departure between measured signal and MSET estimate

Onset of SPRT alarms

Fault Injection ActuatorT = 500 obs

21

8.5”x11” piece of paper

SPRT Alarms

MSET Detects Coolant Air FlowPerturbations in Enterprise Servers

Occasional cause of service problems withhigh end servers: piece of paper falls fromwall, notebook, etc. Works its way to bottomair inlet for server. Temps are not highenough to trip threshold; but over long term,can lead to accelerated reliability issues.

Experiments conducted with fully loadedE10K server. MSET monitors dozens of performancevariables. Piece of paper put on bottom airinlet.

Immediate SPRT alarms observed.

2nd experiment conducted with 3x5 post card.

Post card

22

Dirty filter installed

Dirty filter replaced

SPRT Alarms in Red

MSET Detects Dirty Air Filters in Enterprise Servers

MSET Detects Degraded/Failed Fans

(Eliminates Need for Hall-Effect RPM Sensors)

SPRT Alarms in Red

DATE: September 23 – 26 2002

SB4PS4 Temperature: Actual and MSET Estimate

23

Use of Pattern Recognition with SunSigma:Note that advanced pattern recognition is useful not only for detecting the onset of degradation. MSET can also detect and quantitatively characterize the onset of process improvements, including the reduction in processes variability.

24

Synthetic Load Gen / Fault Gen

Scripts

System Dynamics, Characterization and Controls

Lab ExperimentsStress-Based Fault InjectionPattern Recognition Evaluation

Networked SystemTestbed

(Various HAConfigs)

.

.

.

.

.

.

.

Test Suite

Script 1

Script 2

….

Script N

DESIGNOf

EXPERIMENTS

LOADBALANCE

FaultInjection

Tests

High

Level

Low

Level

BOX 1

BOX 2

BOX 3

Telemetry

Harness

Circular FileCAPTURE

ANALYZE

MSET

25

System Dynamics, Characterization and Control

Conclusions: Ø System telemetry harness provides “Black Box Flight Recorder”

with FRU-level granularity

Ø Reduces h/w sensor requirements and mitigates complexity in high-end platforms

Ø Use MSET-based pattern recognition technology to interpret monitoring data on line, in real time (or can post-process archived telemetry data)

Ø Can enhance reliability, availability, serviceability, security and QoS; mitigate “no trouble found” warranty returns

Ø Future: Closed loop autonomic control (Use MSET estimate for control variable, instead of sensor signals)

Ø Next generation system for N1 stability assurance and dynamic resource provisioning

26

BibliographyRecent Sun Publications on Adv. Pattern Recognition for Enhanced

Dependability of Computer Systems and Networks“Spectral Decomposition of Performance Variables for Dynamic System Characterization of Web Servers,” K. C. Gross, W. Lu, and K. Mishra, Proc. 2002 SAS Users Group International (SUGI 28), Seattle, Wa. (Mar 30 - Apr 2, 2003).

”Proactive Detection of Software Aging Mechanisms in Performance-Critical Computers,” K. C. Gross, V. Bhardwaj, and R. L. Bickford, 27th Annual IEEE/NASA Software Engineering Symposium, Greenbelt, MD (Dec 4-6, 2002).

“Time-Series Investigation of Anomalous CRC Error Patterns in Fibre Channel Arbitrated Loops,” K. C. Gross, W. Lu, and D. Huang, Proc. 2002 IEEE Int’l Conf. on Machine Learning and Applications (ICMLA), Las Vegas, NV (June 2002).

“Early Detection of Signal and Process Anomalies in Enterprise Computing Systems,” K. C. Gross and W. Lu, Proc. 2002IEEE Int’l Conf. on Machine Learning and Applications (ICMLA), Las Vegas, NV (June 2002).

“Advanced Pattern Recognition for Detection of Complex Software Aging Phenomena in Online Transaction Processing Servers,” Karen J. Cassidy, Kenny C. Gross, and Amir Malekpour, Proc. Intnl. Performance and Dependability Symposium, Washington, DC, (June 23rd - 26th, 2002).

“Software Reliability and System Availability at Sun,” K. C. Gross, Proc. IEEE 11th Intn’l Symp. On Software Reliability Eng., San Jose, CA (Oct 2000).

“Uncertainty Analysis for Multivariate State Estimation in Mission-Critical and Safety-Critical Applications,” N. Zavaljevski and K. C. Gross, Proc. MARCON 2000, “Maintenance and Reliability in the 21st Century”, Knoxville, TN, (May 7-10, 2000).

“Support Vector Machines for Multivariate State Estimation,” N. Zavaljevski and K. C. Gross, Proc. ANS Intnl. Topical Mtg. On “Advances in Reactor Physics, Mathematics, and Computation into the Next Millennium,” Pittsburgh, PA, (May 7-11, 2000).

advanced pattern recognition for enhanced dependability of...

Documents