advanced pattern recognition for enhanced dependability of...
TRANSCRIPT
1
Advanced Pattern Recognition for Enhanced Dependability of Large-Scale Real-Time Systems and Networks
Presentation for:2002 Workshop on High Performance,Fault Adaptive Large Scale Real-Time Systems
Kenny C. Gross, PhDSr. Physicist
Team coauthors:Larry Votta, Aleksey Urmanov, Wendy Lu
2
Why is Availability Important?
ØTo Customers–Downtime very costly§ Measured by lost
sales, reduced productivity, damaged business reputation, etc.
ØTo Sun Microsystems–Means of differentiation–Key corporate initiative–Goal to provide
continuous application access with predictable performance
In today’s internet-based computing model …
Mission director, Apollo 13
3
• Quantitative– Lost business, lost productivity
• Qualitative– Diminished reputation
0.09
0.11
0.15
2.6
6.45
0 2 4 6 8
Travel/reservations
Retail/home shopping
Media/pay-per-view
Financial Services/credit card
Financial Services/brokerage
Cost of down time/hour ($millions)
Costs of downtime can be substantial
Source: Hennessy and Patterson, Computer Architecture, Morgan & Kaufmann Publishers(2002).
4
Similarly, costs of avoiding downtime can be substantial
11.5
4
10
0123456789
10R
elat
ive
cost
Single system
with RAID storage
ClusterFault tolerant/ redundant HW
5
CPU
SystemStorage
Netw
Appl Dev Env
OSMW
DB
Svcs
Storage
Potential points of failure range from within the data center …
ISV
6
• Goes beyond the glass house to partners and customers
• This complexity makes all interactions mission-critical, demanding highest availability
Internal customers,e.g., fin, mktg, HR
Vendors/suppliers
Customer relationship mgmt
Supply chain mgmt
DatabaseData WarehouseBusiness Intelligence
Service provider
Customers
… to the Extended Enterprise
7
Motivation:Proactive fault monitoring (“An ounce of prevention...”)System Availability Serviceability CostsImproved Root Cause Analysis (RCA)...Each failure is used to build
better future systems
Approach:Adapt advanced statistical pattern recognition algorithm, MSET, that
has been proven in a broad spectrum of safety-critical and mission-critical application domains.
BENEFITS:Ø Enhanced end-to-end stack availability through predictive fault
monitoring.Ø Faster, more accurate RCAØ Provides “Intelligent Agent” functionality when integrated with real
time telemetry harness for monitored assetsØ Improved stability analysis, dynamic resource provisioning for
networks of interacting dynamic elementsØ Closed-loop autonomic control for future servers and networks
Advanced Pattern Recognition
8
Sequential Probability Ratio Test (SPRT)Advanced pattern recognition technique for high sensitivity, high reliability sensor and equipment operability surveillance.Developers proved in refereed journals that the SPRT provides the earliest mathematically possible annunciation of a subtle fault in noisy process variables.
Multivariate State Estimation Technique (MSET)
Online model-based fault detection and identification.
MSET predicts what each process should be on the basis of learned correlations among all process variables.
MSET incorporates the SPRT to monitor the residuals between the actual observations and the estimates MSET predicts on the basis of the correlated variables.
Advanced Pattern Recognition Tools for Ultrahigh-Reliability Surveillance
For Stationary Time Series
For Dynamic Time Series
9
Proactive Fault Monitoring Applications:Ø Continuous signal validation, sensor operability validationØ “Acoustic SPRT” for high sensitivity proactive fault detection in
storage, fan trays, and other electromechanical componentsØ Improved soft error rate discriminationØ Dynamic resource provisioning for optimized energy utilization Ø Software Aging and Rejuvenation
• Proactively detects onset of memory leaks and other resource-contention issues
• Successful pilot project applied MSET to S/W aging problem on business critical enterprise servers
Ø Improved environmental-variable proactive fault detectionØ Monitoring bit error rates in Fibre Channel network-attached-
storage (NAS)Ø Closed-loop autonomic control for improved (and less human-
intensive) performance management of servers and networksØ Proactive fault monitoring and more accurate Root Cause
Analysis Ø Stability assurance for N1 networks of dynamically interacting
entities.
10
0
Sequential Probability Ratio Test (SPRT)High sensitivity for subtle anomaly detection, but without increasing false alarm probability.
Sequential test for residual signals is based upon two hypotheses:
M
f
H0 : µ = µ0= 0H1 : µ = µ1= M
where µ is the mean of the distribution under test and M is a preset alarm threshold.
Accept H0
Signal x1 is faulty
x1
x2
xp
|x1- |
Signal x1 is fine
),...,(ˆˆ 21 pxxfx = 1x̂Accept H1
Sequential Statistical Test
DegradedNormal
Unlike threshold limit tests, SPRTdetects shifts in noise distributions.This breaks the “sea-saw” effectbetween sensitivity and false alarmsinherent in conventional monitoringapproaches.
11
MSETMultivariate State Estimation Technique
Argonne National Laboratory developed an advanced pattern recognition system, MSET, for 24X7 predictive fault monitoring in complex engineering systems for extremely sensitive identification of the onset of component degradation or process anomalies.
MSET possesses unique surveillance capabilities that surpass anyconventional approaches, including neural networks, in sensitivity, reliability, and computational efficiency.
MSET Provides:Ø Continuous signal validation for safety-critical and mission-critical
applicationsØ Incipient fault annunciation in all monitored components and
process variablesØ Extremely low probability of false alarmsØ Improved dynamic feedback/control algorithms (MSET estimate
can be used for control variable)
12
Internal VariablesDynamic Loads on CPU, Memory, Cache
IO Traffic, Queue Lengths, Transaction Latencies
Operational Profiles from OS Kernel “Virtual Sensors”
Intelligent Realtime Monitoring
For Enhanced Availability, Quality-of-Service, and Security
External VariablesService Level Availability Measurement
Distributed Synthetic Transaction Generators
Canary Times (user wait times, monitored 24x7)
Physical VariablesDistributed internal temperaturesRelative HumidityCumulative or differential vibrationTime domain reflectrometry (TDR)Current & voltage noise time seriesFan speedAcousticsEnvironmental Vars (outside cabinet)
Multivariate State Estimation Technique
Monitors hundreds of signals in realtime. Learns “normal” patterns in signals. Detects deviations in patterns to trigger alerts and activate detailed diagnostics.
Proactive: Predictive Failure Annunciation
Reactive: Faster, more accurate RCA
Security: Enhanced Intrusion Detection
Self Healing: Software Aging and Rejuvenation
1)1) Online model is Online model is “learned” from “learned” from operating telemetry operating telemetry datadata
2)2) Online model Online model provides an estimate provides an estimate for each new for each new observation valueobservation value
3)3) MSET alarms when MSET alarms when estimated and estimated and observed data observed data disagreedisagree
How MSET is AppliedTraining
Data
CalibrateModel
OnlineModel
AcquireData
ParameterEstimation
FaultDetection
FaultFound
?
Alarm orControlAction
Asset
Yes
No
Training
Monitoring
14
Legacy Viewgraph: MSET detects instrumentationdegradation in an operating nuclear power plant.
Departure between real signal (yellow) and MSET estimate (red) at onset of instrument degradation event.
Onset of SPRT alarms for Instrument R241
15
ACTUAL PARAMETERS (YELLOW) VS.MSET PREDICTED (RED): MSET
SensitivityMSET can detect degradation
that is still "within the noise band“
Note very subtle disturbance is introduced into a system parameter starting at DAY=0.
Note SPRT alarms start triggering at DAY=13, when the degradation is only 0.06% of the signal, and well within the noise band.
DEGRADATION SHOWN BY SPRT:
<- # Days After Start of Sensor Drift ->
DIFFERENCE IDENTIFIED BY SPRT:
16
Software RejuvenationSoftware Aging Problems:Resource contention phenomenon that can cause servers to hang orcrash. Mechanisms can include:
Memory leaks; Unreleased file locks; Accumulation of unterminated threads; Data corruption/round off accrual; File space fragmentation; Shared memory pool latching; Thread stack bloating and overruns
Software Rejuvenation:
Proactive fault management technique to periodically “cleanse” the system internal state. Mechanisms can include:
Flushing stale locks; Reinitializing application components; Preemptive rollback; Memory defragmentation; Purging DB shared-pool latches; Node/application failover (cluster machines); Therapeutic reboots (primarily Wintel platforms)
17
Software Rejuvenation Reduces Downtime
CRASH!
Software RejuvenationParasitic Resource
Consumption
Time
Physical Limit
Aging Software
18
SAR_AgentInstrumentation:
Throughput, performance, and transaction latency metrics (Adrian Cockcroft’s SEtoolkit)
External synthetic transaction generators
Implementation:
Ø MSET pattern recognition monitors performance and throughput variables to detect incipience or onset of SA mechanisms
Ø Stochastic Reward Net (SRN) algorithm optimizes sequencing of rejuvenation mechanism for lowest compute cost
19
SAR_AgentArchived
Training
ParametersSensitivity Analysis Module
(Sun Microsystems)
Stochastic Reward Network
Rejuvenation Sequencing Optimization
(Duke)
Advanced Pattern Recognition
MSET
Continue
Surveillance
SoftwareAgingAlert?
NOYES
MonitoredServer/Cluster
System ParametersC(t) Cost of Downtime
R(t) Cost of RejuvenationMarkov Parameters
Reference System Load Profile
Rejuvenation Actuator Signal
(Automatic, or via Human SysAdm)
Optimized Parameter Set
For Realtime Surveillance
MeasuredSys Performance
Parameters
20
Initial MSET Experiments with Software Aging and Rejuvenation:•MSET trained on 33 performance variables sampled in realtime•Signals generated by standard Unix utilities (mpstat, iostat, cpustat)•Subtle, linearly degrading memory leaks simulated•MSET consistently demonstrated high alarm sensitivity with no false alarms
MPSTAT Response Variable (1 of 33 variables monitored by MSET)
MSET training during typical user activities Departure between measured signal and MSET estimate
Onset of SPRT alarms
Fault Injection ActuatorT = 500 obs
21
8.5”x11” piece of paper
SPRT Alarms
MSET Detects Coolant Air FlowPerturbations in Enterprise Servers
Occasional cause of service problems withhigh end servers: piece of paper falls fromwall, notebook, etc. Works its way to bottomair inlet for server. Temps are not highenough to trip threshold; but over long term,can lead to accelerated reliability issues.
Experiments conducted with fully loadedE10K server. MSET monitors dozens of performancevariables. Piece of paper put on bottom airinlet.
Immediate SPRT alarms observed.
2nd experiment conducted with 3x5 post card.
Post card
22
Dirty filter installed
Dirty filter replaced
SPRT Alarms in Red
MSET Detects Dirty Air Filters in Enterprise Servers
MSET Detects Degraded/Failed Fans
(Eliminates Need for Hall-Effect RPM Sensors)
SPRT Alarms in Red
DATE: September 23 – 26 2002
SB4PS4 Temperature: Actual and MSET Estimate
23
Use of Pattern Recognition with SunSigma:Note that advanced pattern recognition is useful not only for detecting the onset of degradation. MSET can also detect and quantitatively characterize the onset of process improvements, including the reduction in processes variability.
24
Synthetic Load Gen / Fault Gen
Scripts
System Dynamics, Characterization and Controls
Lab ExperimentsStress-Based Fault InjectionPattern Recognition Evaluation
Networked SystemTestbed
(Various HAConfigs)
.
.
.
.
.
.
.
Test Suite
Script 1
Script 2
….
Script N
DESIGNOf
EXPERIMENTS
LOADBALANCE
FaultInjection
Tests
High
Level
Low
Level
BOX 1
BOX 2
BOX 3
Telemetry
Harness
Circular FileCAPTURE
ANALYZE
MSET
25
System Dynamics, Characterization and Control
Conclusions: Ø System telemetry harness provides “Black Box Flight Recorder”
with FRU-level granularity
Ø Reduces h/w sensor requirements and mitigates complexity in high-end platforms
Ø Use MSET-based pattern recognition technology to interpret monitoring data on line, in real time (or can post-process archived telemetry data)
Ø Can enhance reliability, availability, serviceability, security and QoS; mitigate “no trouble found” warranty returns
Ø Future: Closed loop autonomic control (Use MSET estimate for control variable, instead of sensor signals)
Ø Next generation system for N1 stability assurance and dynamic resource provisioning
26
BibliographyRecent Sun Publications on Adv. Pattern Recognition for Enhanced
Dependability of Computer Systems and Networks“Spectral Decomposition of Performance Variables for Dynamic System Characterization of Web Servers,” K. C. Gross, W. Lu, and K. Mishra, Proc. 2002 SAS Users Group International (SUGI 28), Seattle, Wa. (Mar 30 - Apr 2, 2003).
”Proactive Detection of Software Aging Mechanisms in Performance-Critical Computers,” K. C. Gross, V. Bhardwaj, and R. L. Bickford, 27th Annual IEEE/NASA Software Engineering Symposium, Greenbelt, MD (Dec 4-6, 2002).
“Time-Series Investigation of Anomalous CRC Error Patterns in Fibre Channel Arbitrated Loops,” K. C. Gross, W. Lu, and D. Huang, Proc. 2002 IEEE Int’l Conf. on Machine Learning and Applications (ICMLA), Las Vegas, NV (June 2002).
“Early Detection of Signal and Process Anomalies in Enterprise Computing Systems,” K. C. Gross and W. Lu, Proc. 2002IEEE Int’l Conf. on Machine Learning and Applications (ICMLA), Las Vegas, NV (June 2002).
“Advanced Pattern Recognition for Detection of Complex Software Aging Phenomena in Online Transaction Processing Servers,” Karen J. Cassidy, Kenny C. Gross, and Amir Malekpour, Proc. Intnl. Performance and Dependability Symposium, Washington, DC, (June 23rd - 26th, 2002).
“Software Reliability and System Availability at Sun,” K. C. Gross, Proc. IEEE 11th Intn’l Symp. On Software Reliability Eng., San Jose, CA (Oct 2000).
“Uncertainty Analysis for Multivariate State Estimation in Mission-Critical and Safety-Critical Applications,” N. Zavaljevski and K. C. Gross, Proc. MARCON 2000, “Maintenance and Reliability in the 21st Century”, Knoxville, TN, (May 7-10, 2000).
“Support Vector Machines for Multivariate State Estimation,” N. Zavaljevski and K. C. Gross, Proc. ANS Intnl. Topical Mtg. On “Advances in Reactor Physics, Mathematics, and Computation into the Next Millennium,” Pittsburgh, PA, (May 7-11, 2000).