intro & overview of rads goals
DESCRIPTION
Intro & Overview of RADS goals. Armando Fox & Dave Patterson CS 444A/CS 294-6, Stanford/UC Berkeley Fall 2004. Administrivia Course logistics & registration Project expectations and other deliverables Background and motivation for RADS ROC and its relationship to RADS Early case studies - PowerPoint PPT PresentationTRANSCRIPT
Intro & Overview of RADS goalsIntro & Overview of RADS goals
Armando Fox & Dave PattersonArmando Fox & Dave Patterson
CS 444A/CS 294-6, Stanford/UC CS 444A/CS 294-6, Stanford/UC BerkeleyBerkeley
Fall 2004Fall 2004
© 2004A. Fox
AdministriviaAdministrivia Course logistics & registrationCourse logistics & registration
Project expectations and other deliverablesProject expectations and other deliverables
Background and motivation for RADSBackground and motivation for RADS
ROC and its relationship to RADSROC and its relationship to RADS
Early case studiesEarly case studies
Discussion: projects, research directions, etc.Discussion: projects, research directions, etc.
© 2004A. Fox
Administrivia/goalsAdministrivia/goals
Stanford enrollment vs. AxessStanford enrollment vs. Axess
SLT and CT tutorial VHS/DVD’s available to viewSLT and CT tutorial VHS/DVD’s available to view
SLT and CT Lab/assignments grading policySLT and CT Lab/assignments grading policy
Stanford and Berkeley meeting/transportation Stanford and Berkeley meeting/transportation logisticslogistics
Format of courseFormat of course
Background & motivation for Background & motivation for RADSRADS
© 2004A. Fox
RADS in One SlideRADS in One Slide
Philosophy of ROC: focus on lowering MTTR to Philosophy of ROC: focus on lowering MTTR to improve overall availabilityimprove overall availability
ROC achievements: two levels of lowering MTTRROC achievements: two levels of lowering MTTR ““Microrecovery”: fine-grained generic recovery techniques Microrecovery”: fine-grained generic recovery techniques
recover only the failed part(s) of the system, at much lower recover only the failed part(s) of the system, at much lower cost than whole-system recoverycost than whole-system recovery
Undo: sophisticated tools to help human operators Undo: sophisticated tools to help human operators selectively back out destructive actions/changes to a systemselectively back out destructive actions/changes to a system
General approach: use microrecovery as “first line of General approach: use microrecovery as “first line of defense”; when it fails, provide support to human operators defense”; when it fails, provide support to human operators to avoid having to “reinstall the world”to avoid having to “reinstall the world”
RADS insight: can combine cheap recovery with RADS insight: can combine cheap recovery with statistical anomaly detection techniquesstatistical anomaly detection techniques
© 2004A. Fox
Hence, (at least) 2 parts to RADSHence, (at least) 2 parts to RADS
Investigating other microrecovery methodsInvestigating other microrecovery methods
Investigating analysis techniquesInvestigating analysis techniques What to capture/represent in a modelWhat to capture/represent in a model
Addressing fundamental open challengesAddressing fundamental open challenges stabilitystability
systematic misdiagnosissystematic misdiagnosis
subversion by attackerssubversion by attackers
etc.etc.
General insight: “different is bad”General insight: “different is bad” ““law of large numbers” arguments support this for large law of large numbers” arguments support this for large
servicesservices
© 2004A. Fox
Why RADSWhy RADS
MotivationMotivation 5 9’s availability => 5 down-minutes/year => must recover 5 9’s availability => 5 down-minutes/year => must recover
from (or mask) most failures without human interventionfrom (or mask) most failures without human intervention
a principled way to design “self-*” systemsa principled way to design “self-*” systems
TechnologyTechnology High-traffic large-scale distributed/replicated services => High-traffic large-scale distributed/replicated services =>
large datasetslarge datasets
Analysis is CPU-intensive => a way to trade extra CPU cycles Analysis is CPU-intensive => a way to trade extra CPU cycles for dependabilityfor dependability
Large logs/datasets for models => storage is cheap and Large logs/datasets for models => storage is cheap and getting cheapergetting cheaper
RADS addresses a clear need while exploiting RADS addresses a clear need while exploiting demonstrated technology trendsdemonstrated technology trends
Cheap RecoveryCheap Recovery
© 2004A. Fox
Complex systems of black boxesComplex systems of black boxes
““...our ability to analyze and predict the performance of the ...our ability to analyze and predict the performance of the enormously complex software systems that lies at the core of enormously complex software systems that lies at the core of our economy is painfully inadequate.” (Choudhury & Weikum, our economy is painfully inadequate.” (Choudhury & Weikum, 2000 PITAC Report)2000 PITAC Report)
Networked services too complex and rapidly-Networked services too complex and rapidly-changing to test exhaustively: “collections of black changing to test exhaustively: “collections of black boxes”boxes” Weekly or biweekly code drops not uncommonWeekly or biweekly code drops not uncommon
Market activities lead to integration of whole systemsMarket activities lead to integration of whole systems
Need to get humans out of loop for at least some Need to get humans out of loop for at least some monitoring/recovery loopsmonitoring/recovery loops hence interest in “autonomic” approacheshence interest in “autonomic” approaches
fast detection is often at odds with false alarmsfast detection is often at odds with false alarms
© 2004A. Fox
ConsequencesConsequences
Complexity breeds increased bug counts and bug impact Complexity breeds increased bug counts and bug impact Heisenbugs, race conditions, environment-dependent and hard-Heisenbugs, race conditions, environment-dependent and hard-
to-reproduce bugs still account for majority of SW bugs in live to-reproduce bugs still account for majority of SW bugs in live systemssystems
up to 80% of bugs found in production are those for which a fix is up to 80% of bugs found in production are those for which a fix is not yet available*not yet available*
some application-level failures result in user-visible bad behavior some application-level failures result in user-visible bad behavior beforebefore they are detected by site monitors they are detected by site monitors
Tellme Networks: up to 75% of downtime is “detection” Tellme Networks: up to 75% of downtime is “detection” (sometimes by user complaints), followed by localization(sometimes by user complaints), followed by localization
Amazon, Yahoo: gross metrics track second-order effect of bugs, Amazon, Yahoo: gross metrics track second-order effect of bugs, but lags actual bug by minutes or tens of minutesbut lags actual bug by minutes or tens of minutes
Result: downtime and increased management costsResult: downtime and increased management costs
* A.P. Wood, * A.P. Wood, Software reliability from the customer view, Software reliability from the customer view, IEEE Computer, Aug. 2003IEEE Computer, Aug. 2003
© 2004A. Fox
““Always adapting, always recovering”Always adapting, always recovering”
Build statistical models of “acceptable” operating envelope Build statistical models of “acceptable” operating envelope by measurement & analysis on live systemby measurement & analysis on live system Control theory, statistical correlation, anomaly detection...Control theory, statistical correlation, anomaly detection...
Detect runtime deviations from modelDetect runtime deviations from model
typical tradeoff is between detection rate & false positive ratetypical tradeoff is between detection rate & false positive rate
Rely on Rely on external controlexternal control using inexpensive and simple using inexpensive and simple mechanisms that respect the black box, to keep system mechanisms that respect the black box, to keep system within its acceptable operating envelopewithin its acceptable operating envelope invariant: attempting recovery won’t make things worseinvariant: attempting recovery won’t make things worse
makes inevitable false positives tolerablemakes inevitable false positives tolerable
can then reduce false negatives by “tuning” algo’s to be more can then reduce false negatives by “tuning” algo’s to be more aggressive and/or deploying multiple detectorsaggressive and/or deploying multiple detectors
Systems that are “always adapting, always recovering”Systems that are “always adapting, always recovering”
© 2004A. Fox
Toward recovery management Toward recovery management invariantsinvariants
Observation: Observation: instrumentation and analysisinstrumentation and analysis collect and analyze data from running systemscollect and analyze data from running systems
rely on “most systems work most of the time” to automatically rely on “most systems work most of the time” to automatically derive baseline modelsderive baseline models
Analysis: detect and localize anomalous behaviorAnalysis: detect and localize anomalous behavior
Action: Action: close loop automatically with “micro-recovery”close loop automatically with “micro-recovery” ““Salubrious”: returns some Salubrious”: returns some partpart of system to known state of system to known state
• Reclaim resources (memory, DB conns, sockets, DHCP Reclaim resources (memory, DB conns, sockets, DHCP lease...), throw away corrupt transient state, setup to retry lease...), throw away corrupt transient state, setup to retry operation if appropriateoperation if appropriate
Safe: no effect on correctness, minimal effect on performanceSafe: no effect on correctness, minimal effect on performance
Localized: parts not being microrecovered aren’t affectedLocalized: parts not being microrecovered aren’t affected
Fast recovery simplifies failure detection and recovery Fast recovery simplifies failure detection and recovery management.management.
© 2004A. Fox
Non-goals/complementary workNon-goals/complementary work
All of the following are being capably studied by others, All of the following are being capably studied by others, and directly compose with our own efforts...and directly compose with our own efforts...
Byzantine fault toleranceByzantine fault tolerance
In-place repair of persistent data structuresIn-place repair of persistent data structures
Hard-real-time response guaranteesHard-real-time response guarantees
Adding checkpointing to legacy non-componentized Adding checkpointing to legacy non-componentized applicationsapplications
Source code bug findingSource code bug finding
Advancing the state of the art in SLT (analysis Advancing the state of the art in SLT (analysis algorithms)algorithms)
© 2004A. Fox
OutlineOutline
Micro-recoverable systemsMicro-recoverable systems Concept of microrecoveryConcept of microrecovery
A microrecoverable application server & session state A microrecoverable application server & session state storestore
Application-generic SLT-based failure detectionApplication-generic SLT-based failure detection Path and component analysis and localization for Path and component analysis and localization for
appserverappserver
Simple time series analyses for purpose-built state storeSimple time series analyses for purpose-built state store
Combining SLT detection with microrecoverable systemsCombining SLT detection with microrecoverable systems
Discussion, related work, implications & Discussion, related work, implications & conclusionsconclusions
© 2004A. Fox
Microrebooting: one kind of Microrebooting: one kind of microrecoverymicrorecovery
60+% of software failures in the field* are reboot-curable, 60+% of software failures in the field* are reboot-curable, even if root cause is unknown... why?even if root cause is unknown... why? Rebooting discards bad temporary data (corrupted data structures Rebooting discards bad temporary data (corrupted data structures
that can be rebuilt) and (usually) reclaims used resourcesthat can be rebuilt) and (usually) reclaims used resources
reestablishes control flow in a predictable way (breaks reestablishes control flow in a predictable way (breaks deadlocks/livelocks, returns thread or process to its start state)deadlocks/livelocks, returns thread or process to its start state)
To avoid imperiling correctness, we must...To avoid imperiling correctness, we must... Separate data recovery from process recoverySeparate data recovery from process recovery
Safeguard the dataSafeguard the data
Reclaim resources with high confidenceReclaim resources with high confidence
Goal: get same benefits of rebooting but at much finer Goal: get same benefits of rebooting but at much finer grain (hence faster and less disruptive) - grain (hence faster and less disruptive) - microrebootingmicrorebooting
* D. Oppenheimer et al., * D. Oppenheimer et al., Why do Internet services fail and what can be done about it? , Why do Internet services fail and what can be done about it? , USITS 2003USITS 2003
© 2004A. Fox
Write example: “Write to Many, Wait for Few”Write example: “Write to Many, Wait for Few”
Browser
AppServerSTUB
Brick 1
Brick 2
Brick 3
Brick 4
Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2
Brick 5
© 2004A. Fox
Write example: “Write to Many, Wait for Few”Write example: “Write to Many, Wait for Few”
Browser
AppServerSTUB
Brick 1
Brick 2
Brick 3
Brick 4
Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2
Brick 5
© 2004A. Fox
Write example: “Write to Many, Wait for Few”Write example: “Write to Many, Wait for Few”
Browser
AppServerSTUB
Brick 1
Brick 2
Brick 3
Brick 4
Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2
Brick 5
© 2004A. Fox
Write example: “Write to Many, Wait for Few”Write example: “Write to Many, Wait for Few”
Browser
AppServerSTUB
Brick 1
Brick 2
Brick 3
Brick 4
Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2
Brick 5
© 2004A. Fox
Write example: “Write to Many, Wait for Few”Write example: “Write to Many, Wait for Few”
Browser
AppServerSTUB
Brick 1
Brick 2
Brick 3
Brick 4
Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2
14
Brick 5
Cookie holds metadata
Crashed? Slow?
© 2004A. Fox
Read example:Read example:
Browser
AppServerSTUB
Brick 1
Brick 2
Brick 3
Brick 4
14
Brick 5
Try to read from Bricks 1, 4
© 2004A. Fox
Read example:Read example:
Browser
AppServerSTUB
Brick 1
Brick 2
Brick 3
Brick 4
14
Brick 5
© 2004A. Fox
Read example:Read example:
Browser
AppServerSTUB
Brick 1
Brick 2
Brick 3
Brick 4
Brick 5
Brick 1 crashes
© 2004A. Fox
Read example:Read example:
Browser
AppServerSTUB
Brick 2
Brick 3
Brick 4
Brick 5
© 2004A. Fox
SSM: Failure and RecoverySSM: Failure and Recovery
Failure of single nodeFailure of single node No data loss, WQ-1 remainNo data loss, WQ-1 remain
State is available for R/W during failureState is available for R/W during failure
RecoveryRecovery Restart – No special case recovery codeRestart – No special case recovery code
State is available for R/W during brick restartState is available for R/W during brick restart
Session state is self-recovering Session state is self-recovering • User’s access pattern causes data to be rewrittenUser’s access pattern causes data to be rewritten
© 2004A. Fox
Backpressure and Admission ControlBackpressure and Admission Control
AppServerSTUB
Brick 1
Brick 2
Brick 3
Brick 4
Brick 5
AppServerSTUB
Heavy flow to Brick 3
Drop Requests
© 2004A. Fox
Statistical MonitoringStatistical Monitoring
Brick 1
Brick 2
Brick 3
Brick 4
Brick 5
PinpointPinpoint
Statistics
Statistics
NumElementsMemoryUsed
InboxSizeNumDroppedNumReadsNumWrites
© 2004A. Fox
SSM MonitoringSSM Monitoring
N replicated bricks handle read/write requestsN replicated bricks handle read/write requests Cannot do structural anomaly detection!Cannot do structural anomaly detection!
Alternative features (performance, mem usage, etc)Alternative features (performance, mem usage, etc)
Activity statistics: How often did a brick do Activity statistics: How often did a brick do something?something? Msgs received/sec, dropped/sec, etc.Msgs received/sec, dropped/sec, etc.
Same across all peers, assuming balanced workloadSame across all peers, assuming balanced workload
Use anomalies as likely failuresUse anomalies as likely failures
State statistics: Current state of systemState statistics: Current state of system Memory usage, queue length, etc.Memory usage, queue length, etc.
Similar pattern across peers, but may not be in phaseSimilar pattern across peers, but may not be in phase
Look for patterns in time-series; differences in patterns Look for patterns in time-series; differences in patterns indicate failure at a node.indicate failure at a node.
© 2004A. Fox
Detecting Anomalous ConditionsDetecting Anomalous Conditions
Metrics compared against those of “peer” bricksMetrics compared against those of “peer” bricks Basic idea: Changes in workload tend to affect all bricks Basic idea: Changes in workload tend to affect all bricks
equallyequally
Underlying (weak) assumption: “Most bricks are doing Underlying (weak) assumption: “Most bricks are doing mostly the right thing most of the time”mostly the right thing most of the time”
Anomaly in 6 or more (out of 9) metrics => reboot brickAnomaly in 6 or more (out of 9) metrics => reboot brick
Use different techniques for different statsUse different techniques for different stats ““Activity” – absolute median deviationActivity” – absolute median deviation
““State” – Tarzan time-series analysis State” – Tarzan time-series analysis
© 2004A. Fox
Network Fault – 70% packet loss in Network Fault – 70% packet loss in SANSAN
Network fault injectedFault detected
Brick killed
Brick restarts
© 2004A. Fox
J2EE as a platform for uRB-based J2EE as a platform for uRB-based recoveryrecovery
Java 2 Enterprise Edition, a component framework Java 2 Enterprise Edition, a component framework for Internet request-reply style appsfor Internet request-reply style apps App is a collection of components (“EJBs”) created by App is a collection of components (“EJBs”) created by
subclassing a managed container classsubclassing a managed container class
application server provides component creation, thread application server provides component creation, thread management, naming/directory services, abstractions for management, naming/directory services, abstractions for database and HTTP sessions, etc.database and HTTP sessions, etc.
Web pages with embedded servlets and Java Server Web pages with embedded servlets and Java Server Pages invoke EJB methodsPages invoke EJB methods
potential to improve potential to improve all all apps by modifying the appserverapps by modifying the appserver
J2EE has a strong following, encourages modular J2EE has a strong following, encourages modular programming, and there are open source programming, and there are open source appserversappservers
© 2004A. Fox
Separating data recovery from process Separating data recovery from process recoveryrecovery
For HTTP workloads, session state For HTTP workloads, session state app checkpoint app checkpoint Store session state in a microrebootable session state Store session state in a microrebootable session state
subsystem (NSDI’04)subsystem (NSDI’04)
Recovery==non-state-preserving process restart, Recovery==non-state-preserving process restart, redundancy gives probabilistic durabilityredundancy gives probabilistic durability
• Response time cost of externalizing session state: ~25%Response time cost of externalizing session state: ~25%• SSM, an N-way RAM-based state replication [NSDI 04] SSM, an N-way RAM-based state replication [NSDI 04]
behind existing J2EE APIbehind existing J2EE API
Microreboot EJB’s: Microreboot EJB’s: destroy all instances of EJB and associated threadsdestroy all instances of EJB and associated threads
releases appserver-level resources (DB connections, etc)releases appserver-level resources (DB connections, etc)
discards appserver metadata about EJB’sdiscards appserver metadata about EJB’s
session state preserved across uRBsession state preserved across uRB
© 2004A. Fox
JBoss+uRB’s+SSM + fault injectionJBoss+uRB’s+SSM + fault injection
Client-based failure
detection
Fault injection: null refs, deadlocks/infinite loop, corruption of volatile EJB metadata, resource leaks, Java runtime errors/exc RUBiS: online auction app
(132K items, 1.5M bids, 100K subscribers)
150 simulated users/node35-45 req/sec/node
Workload mix based on a commercial auction site
© 2004A. Fox
uRB vs. full RB - uRB vs. full RB - action weighted action weighted goodputgoodput
Example: corrupt JNDI database entry, Example: corrupt JNDI database entry, RuntimeException, Java error; measure G_aw in 1-RuntimeException, Java error; measure G_aw in 1-second bucketssecond buckets Localization is crude: static analysis to associate failed URL Localization is crude: static analysis to associate failed URL
with set of EJB’s, incrementing an EJB’s score whenever it’s with set of EJB’s, incrementing an EJB’s score whenever it’s implicatedimplicated
With uRB’s, 89% reduction in failed requests and 9% more With uRB’s, 89% reduction in failed requests and 9% more successful requests compared to full RB, despite 6 false successful requests compared to full RB, despite 6 false positivespositives
© 2004A. Fox
Performance overhead of JAGRPerformance overhead of JAGR
150 clients/node: latency=38 msec (3 -> 7 nodes)150 clients/node: latency=38 msec (3 -> 7 nodes)
Human-perceptible delay: 100-200 msecHuman-perceptible delay: 100-200 msec
Real auction site: 41 req/sec, 33-300 msec latencyReal auction site: 41 req/sec, 33-300 msec latency
© 2004A. Fox
Improving availability from user’s point of Improving availability from user’s point of viewview
uRB improves user-uRB improves user-perceived availability perceived availability vs. full rebootvs. full reboot
uRB complements uRB complements failoverfailover (a) Initially, excess load on (a) Initially, excess load on
2nd node brought it down 2nd node brought it down immediately after failoverimmediately after failover
(b) uRB results in some (b) uRB results in some failed requests (96% failed requests (96% fewer) from temporary fewer) from temporary overloadoverload
(c,d) Full reboot vs. uRB (c,d) Full reboot vs. uRB without without failoverfailover
For small clusters, should For small clusters, should always try uRB firstalways try uRB first
© 2004A. Fox
uRB Tolerates Lax Failure DetectionuRB Tolerates Lax Failure Detection
Tolerates lag in detection Tolerates lag in detection latencylatency (up to 53s in our (up to 53s in our microbenchmark) and high false positive ratesmicrobenchmark) and high false positive rates Our naive detection algorithm had up to 60% false Our naive detection algorithm had up to 60% false
positive rate in terms of positive rate in terms of what what to uRB to uRB
we injected 97% false positives before reduction in we injected 97% false positives before reduction in overall availability equaled cost of full RBoverall availability equaled cost of full RB
Always safe to use as “first line of defense”, even Always safe to use as “first line of defense”, even when failover is possiblewhen failover is possible cost(uRB+other recovery) cost(uRB+other recovery) cost(other recovery) cost(other recovery)
success rate of uRB on reboot-curable failures is success rate of uRB on reboot-curable failures is comparable to whole-appserver rebootcomparable to whole-appserver reboot
© 2004A. Fox
Performance penaltiesPerformance penalties
Baseline workload mix modeled on commercial siteBaseline workload mix modeled on commercial site 150 simulated clients per node, ~40-45 reqs/sec per node150 simulated clients per node, ~40-45 reqs/sec per node
system at ~70% utilizationsystem at ~70% utilization
Throughput ~1% worse due to instrumentationThroughput ~1% worse due to instrumentation
worst-case response latency increases from 800 to worst-case response latency increases from 800 to 1200ms1200ms Average case: 45ms to 80ms; compare to 35-300ms for Average case: 45ms to 80ms; compare to 35-300ms for
commercial servicecommercial service
Well within “human tolerance” thresholdsWell within “human tolerance” thresholds
Entirely due to factoring out of session stateEntirely due to factoring out of session state
Performance penalty is tolerable & worth itPerformance penalty is tolerable & worth it
Recovery and maintenanceRecovery and maintenance
© 2004A. Fox
Microrecovery for Maintenance Microrecovery for Maintenance OperationsOperations
Capacity discovery in SSMCapacity discovery in SSM TCP-inspired flow control keeps system from falling off a cliffTCP-inspired flow control keeps system from falling off a cliff
““OK to say no” is essential for this backpressure to workOK to say no” is essential for this backpressure to work
Microrejuvenation in JAGR (proactively microreboot Microrejuvenation in JAGR (proactively microreboot to fix localized memory leaks)to fix localized memory leaks)
Splitting/coalescing in DstoreSplitting/coalescing in Dstore Split = failure + reappearance of failed nodeSplit = failure + reappearance of failed node
Same safe/non-disruptive recovery mechanisms are used to Same safe/non-disruptive recovery mechanisms are used to lazily repair inconsistencies after new node appearslazily repair inconsistencies after new node appears
Consequently, performance impact small enough to do this Consequently, performance impact small enough to do this as an online operationas an online operation
© 2004A. Fox
Using microrecovery for maintenanceUsing microrecovery for maintenance
Capacity discovery in SSMCapacity discovery in SSM
redundancy mechanism used for recovery (“write redundancy mechanism used for recovery (“write many, wait few”) also used to “say no” while many, wait few”) also used to “say no” while gracefully gracefully degrading performancedegrading performance
Offered Load vs. GoodputAI/MD Admission Control
0
1000
2000
3000
4000
5000
6000
7000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 18 21 24 27 30
Number of machines
Num
ber
of r
eque
sts
per
seco
nd
Offered Load
Goodput
© 2004A. Fox
76%
Full rejuvenation vs. Full rejuvenation vs. microrejuvenationmicrorejuvenation
© 2004A. Fox
Splitting/coalescing in DstoreSplitting/coalescing in Dstore
Splitting/coalescing in DstoreSplitting/coalescing in Dstore Split = failure + reappearance of failed nodeSplit = failure + reappearance of failed node
Same mechanisms used to lazily repair inconsistenciesSame mechanisms used to lazily repair inconsistencies
© 2004A. Fox
Summary: microrecoverable systemsSummary: microrecoverable systems
Separation of data from process recoverySeparation of data from process recovery Special-purpose data stores can be made microrecoverableSpecial-purpose data stores can be made microrecoverable
OK to initiate microrecovery anytime for any reasonOK to initiate microrecovery anytime for any reason no loss of correctness, tolerable loss of performanceno loss of correctness, tolerable loss of performance
likely (but not guaranteed) to fix an important class of transientslikely (but not guaranteed) to fix an important class of transients
won’t make things worse; can always try “full” recovery won’t make things worse; can always try “full” recovery afterwardafterward
inexpensive enough to tolerate “sloppy” fault detectioninexpensive enough to tolerate “sloppy” fault detection
low-cost first line of defenselow-cost first line of defense
some “maintenance” ops can be cast as microrecoverysome “maintenance” ops can be cast as microrecovery due to low cost, “proactive” maintenance can be done onlinedue to low cost, “proactive” maintenance can be done online
can often convert unplanned long downtime into planned shorter can often convert unplanned long downtime into planned shorter performance hitperformance hit
Anomaly detection as failure Anomaly detection as failure detectiondetection
© 2004A. Fox
Example: Anomaly Finding TechniquesExample: Anomaly Finding Techniques
Before runtime*Before runtime* At runtime**At runtime**
Findin
g/p
reven-tin
g
Findin
g/p
reven-tin
g
Bugs
Bugs
Manual “Inspeculation”, code Manual “Inspeculation”, code inspection/reviews, using debugging inspection/reviews, using debugging tools [Eisenstadt97]tools [Eisenstadt97]
model checking, Lint-like toolsmodel checking, Lint-like tools
human factors/processes (extreme human factors/processes (extreme programming, etc.)programming, etc.)
Runtime-safe languages, dynamic data Runtime-safe languages, dynamic data analysisanalysis
Self-repairing data structures [Demsky & Self-repairing data structures [Demsky & Rinard] Rinard]
Sandboxing/isolation, stack guarding, etc.Sandboxing/isolation, stack guarding, etc.
Redundancy (TMR, Byzantine, etc.)Redundancy (TMR, Byzantine, etc.)
Dete
cting a
nom
alie
sD
ete
cting a
nom
alie
s
static analysis (even static analysis (even gcc -Wallgcc -Wall ) )
type-safe languagestype-safe languages
Bugs as anomalous behavior [Chou & Bugs as anomalous behavior [Chou & Engler 2002]Engler 2002]
Heuristic detection of data races [Eraser, Heuristic detection of data races [Eraser, Savage et al]Savage et al]
Heuristic detection of possible invariant Heuristic detection of possible invariant violation [Haglan & Lam 2002]violation [Haglan & Lam 2002]
Performance anomalies [Richardson et al] Performance anomalies [Richardson et al]
Path-based analysis [Chen, Kiciman et al Path-based analysis [Chen, Kiciman et al 2002]2002]
Deadlock detection and repairDeadlock detection and repair
* Includes design time and build time
** Includes both offline (invasive) and online detection techniques
Question: does anomaly == bug?
© 2004A. Fox
Examples of Badness InferenceExamples of Badness Inference
Sometimes can detect badness by looking for Sometimes can detect badness by looking for inconsistenciesinconsistencies in in runtime behaviorruntime behavior We can observe program-specific properties (though using We can observe program-specific properties (though using
automated methods) as well as program-generic propertiesautomated methods) as well as program-generic properties Often, we must be able to first observe program operating Often, we must be able to first observe program operating
“normally”“normally”
Eraser: detecting data races [Savage et al. 2000]Eraser: detecting data races [Savage et al. 2000] Observe lock/unlock patterns around shared variablesObserve lock/unlock patterns around shared variables If a variable usually protected by lock/unlock or mutex is observed If a variable usually protected by lock/unlock or mutex is observed
to have interleaved reads, report a violationto have interleaved reads, report a violation
DIDUCE: inferring invariants, then detecting violations [Hangal & DIDUCE: inferring invariants, then detecting violations [Hangal & Lam 2002]Lam 2002] Start with strict invariant (“Start with strict invariant (“x x is always =3”)is always =3”) Relax it as other values seen (“Relax it as other values seen (“x x is in [0,10]”)is in [0,10]”) IncreaseIncrease confidence confidence in invariant as more observations seen in invariant as more observations seen Report violations of invariants that have threshold confidenceReport violations of invariants that have threshold confidence
© 2004A. Fox
Generic runtime monitoring techniquesGeneric runtime monitoring techniques
What conditions are we monitoring for?What conditions are we monitoring for? Fail-stop vs. Fail-silent vs. Fail-stutterFail-stop vs. Fail-silent vs. Fail-stutter Byzantine failuresByzantine failures
Generic methodsGeneric methods Heartbeats (what does loss of heartbeat mean? Who monitors them?)Heartbeats (what does loss of heartbeat mean? Who monitors them?) Resource monitoring (what is “abnormal”?)Resource monitoring (what is “abnormal”?) Application-specific monitoring: ask a question you know the answer toApplication-specific monitoring: ask a question you know the answer to
Fault model enforcementFault model enforcement coerce all observed faults to an “expected faults” subsetcoerce all observed faults to an “expected faults” subset if necessary, take additional actions to completely “induce” the faultif necessary, take additional actions to completely “induce” the fault Simplifies recovery since fewer distinct casesSimplifies recovery since fewer distinct cases Avoids potential misdiagnosis of faults that have common symptomsAvoids potential misdiagnosis of faults that have common symptoms Note, may sometimes appear to make things “worse” (coerce a less-Note, may sometimes appear to make things “worse” (coerce a less-
severe fault to a more-severe fault)severe fault to a more-severe fault) Doesn’t exercise all parts of the systemDoesn’t exercise all parts of the system
© 2004A. Fox
Internet performance failure detectionInternet performance failure detection
Various approaches, all of which exploit the law of Various approaches, all of which exploit the law of large numbers and (sort of) Central Limit Theorem large numbers and (sort of) Central Limit Theorem (which is?)(which is?) Establish “baseline” of quantity to be monitoredEstablish “baseline” of quantity to be monitored
• Take observations, factor out data from known failuresTake observations, factor out data from known failures• Normalize to workload?Normalize to workload?
Look for “significant” deviations from baselineLook for “significant” deviations from baseline
What to measure?What to measure? Coarse-grain: number of reqs/secCoarse-grain: number of reqs/sec Finer-grain: Number of TCP connections in Established, Finer-grain: Number of TCP connections in Established,
Syn_sent, Syn_rcvd stateSyn_sent, Syn_rcvd state Even finer: additional internal request “milestones”Even finer: additional internal request “milestones”
• Hard to do in an application-generic way...but Hard to do in an application-generic way...but frameworks frameworks can save uscan save us
© 2004A. Fox
Example 1: Detection & recovery in Example 1: Detection & recovery in SSMSSM
9 “State” statistics collected per second from each 9 “State” statistics collected per second from each replica replica Tarzan time series analysis* compares relative frequencies of Tarzan time series analysis* compares relative frequencies of
substrings corresponding to discretized time seriessubstrings corresponding to discretized time series
““anomalous” => at least 6 stats “anomalous”; works for anomalous” => at least 6 stats “anomalous”; works for aperiodic or irregular-period signalsaperiodic or irregular-period signals
robust against workload changes that affect all replicas robust against workload changes that affect all replicas equally and against highly-correlated metricsequally and against highly-correlated metrics
*Keogh et al., *Keogh et al., Finding surprising patterns in a time series database in linear time and space,Finding surprising patterns in a time series database in linear time and space, SIGKDD 2002SIGKDD 2002
© 2004A. Fox
What faults does this handle?What faults does this handle?
Essentially 100% availability vs. injected faults:Essentially 100% availability vs. injected faults: Node crash/hang/timeout/freezeNode crash/hang/timeout/freeze
Fail-stutter: Network loss (drop up to 70% of packets Fail-stutter: Network loss (drop up to 70% of packets randomly)randomly)
Periodic slowdown (eg from garbage collection)Periodic slowdown (eg from garbage collection)
Persistent slowdown (one node lags the others)Persistent slowdown (one node lags the others)
Underlying (weak) assumption: “Most bricks are doing Underlying (weak) assumption: “Most bricks are doing mostly the right thing most of the time”mostly the right thing most of the time”
All anomalies can be safely “coerced” to crash faults All anomalies can be safely “coerced” to crash faults If reboot doesn’t fix, it didn’t cost you much to try itIf reboot doesn’t fix, it didn’t cost you much to try it
Human notified after threshold number of restarts; system Human notified after threshold number of restarts; system has no concept of “recovery”has no concept of “recovery”
Allows SSM to be managed like a farm of stateless serversAllows SSM to be managed like a farm of stateless servers
© 2004A. Fox
Detecting anomalies in application Detecting anomalies in application logiclogic
Goal: detect failures whose only obvious symptom is Goal: detect failures whose only obvious symptom is change in semantics of applicationchange in semantics of application Example: wrong item data displayed; wouldn’t be caught by Example: wrong item data displayed; wouldn’t be caught by
HTML scraping or HTTP logsHTML scraping or HTTP logs
Typically, site responds to HTTP pings, etc. under such Typically, site responds to HTTP pings, etc. under such failuresfailures
These commonly result from exceptions of the form we These commonly result from exceptions of the form we injected into RUBiSinjected into RUBiS
Insight: manifestation of bugs is the rare case, so Insight: manifestation of bugs is the rare case, so capture “normal” behavior of system under no fault capture “normal” behavior of system under no fault injectioninjection Then detect threshold deviations from this baselineThen detect threshold deviations from this baseline
Periodically move the baseline to allow for workload evolutionPeriodically move the baseline to allow for workload evolution
© 2004A. Fox
Patterns: Path shape analysisPatterns: Path shape analysis
Middleware
HTTPFrontends Application Components Databases
Model paths as parse trees in probabilistic CFGModel paths as parse trees in probabilistic CFG Build grammar under “believed normal” conditions, then mark very unlikely paths as Build grammar under “believed normal” conditions, then mark very unlikely paths as
anomalousanomalous
after classification, build decision tree to correlate path features (components touched) after classification, build decision tree to correlate path features (components touched) with anomalous pathswith anomalous paths
© 2004A. Fox
Patterns: Component Interaction Patterns: Component Interaction AnalysisAnalysis
Middleware
HTTPFrontends Application Components Databases
Model interactions between a component and its Model interactions between a component and its nn neighbors in the dynamic call graph neighbors in the dynamic call graph as a weighted DAGas a weighted DAG compare to observed call graph using chi-squared goodness-of-fitcompare to observed call graph using chi-squared goodness-of-fit
can compare either across peers or against historical datacan compare either across peers or against historical data
© 2004A. Fox
Localization: Recall vs. precision
Precision and recall (example)Precision and recall (example) Detection: Recall = % of failures actually detected as Detection: Recall = % of failures actually detected as
anomaliesanomalies Strictly better than Strictly better than
HTTP/HTML monitoringHTTP/HTML monitoring Localization: Localization: recall = % actually-recall = % actually-
faulty requests faulty requests returnedreturned
precision = % precision = % requests returned requests returned that are faulty = 1-that are faulty = 1-(FP rate) (FP rate)
Tradeoff between Tradeoff between recall and precision recall and precision (false positive rate)(false positive rate) Even low-recall case Even low-recall case
corresponds to high corresponds to high detection detection recall (.83)recall (.83)
Detection: recall, faults affecting >1% of workload
[R]=.68 [P]=.14[R]=.68 [P]=.14
[R]=.34 [P]=.93[R]=.34 [P]=.93
© 2004A. Fox
Pinpoint key resultsPinpoint key results
Detect 89-96% of injected failures, compared to 20-Detect 89-96% of injected failures, compared to 20-79% for HTML scraping and HTTP log monitoring79% for HTML scraping and HTTP log monitoring
Limited success in detecting injected source bugsLimited success in detecting injected source bugs Example success: caught a bug that prevented shopping cart Example success: caught a bug that prevented shopping cart
from iterating over its contents to display them, and correctly from iterating over its contents to display them, and correctly identified at-fault component (where bug was injected)identified at-fault component (where bug was injected)
Resilient to “normal” workload changes Resilient to “normal” workload changes Because we bin analysis by request categoryBecause we bin analysis by request category
Resilient to “bug fix release” code changesResilient to “bug fix release” code changes
Currently slow; analysis lags ~20s behind applicationCurrently slow; analysis lags ~20s behind application
© 2004A. Fox
Combining uRB’s and PinpointCombining uRB’s and Pinpoint
Simple recovery policy:Simple recovery policy: uRB all components whose normalized anomaly score >1.0uRB all components whose normalized anomaly score >1.0
if we’ve already done that, reboot the whole applicationif we’ve already done that, reboot the whole application
More sophisticated policies certainly possibleMore sophisticated policies certainly possible
© 2004A. Fox
Combining uRB’s and PinpointCombining uRB’s and Pinpoint
Example: data structure corruption in Example: data structure corruption in SB_viewItem EJBSB_viewItem EJB 350 simulated clients350 simulated clients
18.5s to detect/localize18.5s to detect/localize
<1s to repair<1s to repair
Note, returned WebNote, returned Webpage would be valid page would be valid but incorrectbut incorrect
Robust to typicalRobust to typicalworkload changesworkload changes& bug patches& bug patches
More comprehensive deployment in progressMore comprehensive deployment in progress
© 2004A. Fox
Faulty Request IdentificationFaulty Request Identification
HTTP monitoring has perfect precision since it’s a “ground truth indicator” of a server fault
Path-shape analysis pulls more points out of the bottom left corner
Failures injected but not
detected
Failures detected, faulty
requests identified as
such
Failures not detected, but low false positives
Failures detected, but high rate of mis-identification of
faulty requests (false positive)
© 2004A. Fox
Faulty Request IdentificationFaulty Request Identification
HTTP monitoring has perfect precision since it’s a “ground truth indicator” of a server fault
Path-shape analysis pulls more points out of the bottom left corner
© 2004A. Fox
Tolerating false positives in DStoreTolerating false positives in DStore
Metrics and algorithm comparable to Metrics and algorithm comparable to those used in SSMthose used in SSM
We inject “fail-stutter” behavior by We inject “fail-stutter” behavior by increasing request latencyincreasing request latency Bottom case: more aggressive Bottom case: more aggressive
detection also results in 2 detection also results in 2 “unnecessary” reboots“unnecessary” reboots
But they don’t matter much if there is But they don’t matter much if there is modest replicationmodest replication
Currently some voodoo constants for Currently some voodoo constants for thresholds in both SSM and DStorethresholds in both SSM and DStore Recall that these are “off-the-shelf” Recall that these are “off-the-shelf”
algorithms; should be able to do algorithms; should be able to do betterbetter
Trade-off: earlier detection vs. false Trade-off: earlier detection vs. false positivespositives
© 2004A. Fox
Summary of case studiesSummary of case studiesSubsystemSubsystem InstrumentationInstrumentation MicrorecoveryMicrorecovery Statistical Statistical
monitoringmonitoringPerformance Performance
costcost
SSM (diskless SSM (diskless session state session state store) store) [NSDI 04][NSDI 04]
State and activity State and activity metric ‘sensors’ metric ‘sensors’ built into appbuilt into app
Whole-node fast Whole-node fast reboot (doesn’t reboot (doesn’t preserve state)preserve state)
Tarzan time-series Tarzan time-series analysisanalysis
Median absolute Median absolute deviationdeviation
20-50% 20-50% request request latency; still latency; still competitive competitive with with commercial commercial serviceservice
<1% thruput <1% thruput reductionreduction
DStore DStore (persistent (persistent hashtable) hashtable) [ACM [ACM Trans. on Trans. on Storage]Storage]
Whole-node Whole-node reboot reboot (preserves (preserves state)state)
JAGR (J2EE JAGR (J2EE application application server) server) [OSDI [OSDI 04]04]
Inter-EJB call info Inter-EJB call info monitored by monitored by modifying modifying containercontainer
Could also use Could also use aspects aspects
Microreboots of Microreboots of EJB’sEJB’s
Anomalous code Anomalous code paths modeled paths modeled using PCFGusing PCFG
component component interactions interactions modeled by modeled by comparing dynamic comparing dynamic call graphscall graphs
~1% on ~1% on request request latency and latency and thruputthruput
Detection and localization good even with “simple” algorithms; fits well with localized recoveryDetection and localization good even with “simple” algorithms; fits well with localized recovery
Performance penalty is tolerable & worth itPerformance penalty is tolerable & worth it
Note, microrecovery can also be used for microrejuvenationNote, microrecovery can also be used for microrejuvenation
DiscussionDiscussion
© 2004A. Fox
Discussion: What makes this work?Discussion: What makes this work?
What made it work in our examples specifically?What made it work in our examples specifically? Recovery speed: Weaker consistency in SSM and DStore in Recovery speed: Weaker consistency in SSM and DStore in
exchange for fast recovery and predictable work done per exchange for fast recovery and predictable work done per requestrequest
Recovery correctness: J2EE apps constrained to “checkpoint” Recovery correctness: J2EE apps constrained to “checkpoint” by manipulating session state, and this is brought out in the by manipulating session state, and this is brought out in the app-writer-visible API’s; good isolation between components app-writer-visible API’s; good isolation between components and relative lack of shared stateand relative lack of shared state
Anomaly detection: app behavior alternates short sequences Anomaly detection: app behavior alternates short sequences of EJB calls with updates to persistent state, so can be of EJB calls with updates to persistent state, so can be characterized in terms of those callscharacterized in terms of those calls
ObservationsObservations Neither diagnosisNeither diagnosisrecovery nor recoveryrecovery nor recoverydiagnosisdiagnosis
Localization != diagnosis, but it’s an important optimizationLocalization != diagnosis, but it’s an important optimization
© 2004A. Fox
Why are statistical methods Why are statistical methods appealing?appealing?
Large complex systems tend to exercise a lot of Large complex systems tend to exercise a lot of their functionality in a fairly short amount of timetheir functionality in a fairly short amount of time Especially Internet services, with high-volume workloads of Especially Internet services, with high-volume workloads of
largely independent requestslargely independent requests
Even if we don’t know what to measure, statistical Even if we don’t know what to measure, statistical and data mining techniques can help figure it outand data mining techniques can help figure it out
Performance problems are often linked with Performance problems are often linked with dependability problems (fail-stutter behavior), for dependability problems (fail-stutter behavior), for either HW or SW reasonseither HW or SW reasons
Most systems work well most of the timeMost systems work well most of the time Corollary: in a replica system, replicas should behave “the Corollary: in a replica system, replicas should behave “the
same” most of the timesame” most of the time
© 2004A. Fox
When does it not work?When does it not work?
When SLT-based monitoring does not applyWhen SLT-based monitoring does not apply Base-rate fallacy: monitoring events so rare that FP rate dominatesBase-rate fallacy: monitoring events so rare that FP rate dominates
Gaming the system (deliberately or inadvertently)Gaming the system (deliberately or inadvertently)
When failures can’t be cured by any kind of micro-recoveryWhen failures can’t be cured by any kind of micro-recovery Persistent-state corruption (or hardware failure)Persistent-state corruption (or hardware failure)
Corrupted configuration dataCorrupted configuration data
““a spectrum of undo”a spectrum of undo”
When you can’t say noWhen you can’t say no Backpressure and possibility of caller-retry are used to improve Backpressure and possibility of caller-retry are used to improve
predictabilitypredictability
Promising you will say “yes” may be difficult...Promising you will say “yes” may be difficult...question may be question may be whether end-to-end guarantees are needed at lower layerswhether end-to-end guarantees are needed at lower layers
© 2004A. Fox
SSM/DStore as “extreme” design SSM/DStore as “extreme” design pointspoints
Goal was to investigate extremes of “no special Goal was to investigate extremes of “no special recovery”recovery”
Could explore erasure coding (RepStore does this Could explore erasure coding (RepStore does this dynamically)dynamically)
Weakened consistency model of DStore vs. 2PCWeakened consistency model of DStore vs. 2PC Spread cost of repair lazily across many operations Spread cost of repair lazily across many operations
(rather than bulk recovery)(rather than bulk recovery)
Spread some 2PC state maintenance to client in the form Spread some 2PC state maintenance to client in the form of “write in progress” cookieof “write in progress” cookie
May be that 2PC would be affordable, but we were May be that 2PC would be affordable, but we were interested in extreme design point of “no special restart interested in extreme design point of “no special restart code”code”
© 2004A. Fox
Role of 3-tier architectureRole of 3-tier architecture
Separation of concerns: really, separation of Separation of concerns: really, separation of process recovery process recovery (control flow) from (control flow) from data recoverydata recovery
uRB and reboots recover processes; SSM, DStore, uRB and reboots recover processes; SSM, DStore, and traditional relational databases recover dataand traditional relational databases recover data
Not addressed is Not addressed is repair repair of dataof data
© 2004A. Fox
Shouldn’t we just make software Shouldn’t we just make software better?better?
Yes we should (and many people are), but...Yes we should (and many people are), but...
We use commodity HW&SW, despite the fact that they We use commodity HW&SW, despite the fact that they are imperfect, less reliable than “hardened” or are imperfect, less reliable than “hardened” or purpose-built components, etc. Why?purpose-built components, etc. Why? Price/performance follows volumePrice/performance follows volume
Allows specialization of efforts and composition of reusable Allows specialization of efforts and composition of reusable building blocks (vs. building stovepipe system)building blocks (vs. building stovepipe system)
In short, it allows much faster overall pace of innovation and In short, it allows much faster overall pace of innovation and deployment, for both technical and economic reasons, even deployment, for both technical and economic reasons, even though the components themselves are imperfectthough the components themselves are imperfect
We should assume“commodity programmers” We should assume“commodity programmers” tootoo (observation from Brewster Kahle)(observation from Brewster Kahle) Give as much generic support to application as we canGive as much generic support to application as we can
© 2004A. Fox
Challenges & open issuesChallenges & open issues
Algorithm issues that impinge on systems workAlgorithm issues that impinge on systems work Hand-tuned constants/thresholds in algorithms--seems to be an Hand-tuned constants/thresholds in algorithms--seems to be an
issue in other applications of SLT as wellissue in other applications of SLT as well
Online vs. offline algorithmsOnline vs. offline algorithms
Stability of closed loopStability of closed loop
Systems issuesSystems issues How do you “know” you’ve checkpointed all important state, or How do you “know” you’ve checkpointed all important state, or
that something is safe to retry?that something is safe to retry?
How do you debug a “moving target” ? Traditional methods/tools How do you debug a “moving target” ? Traditional methods/tools are confounded by code obfuscation, sudden loss of transient are confounded by code obfuscation, sudden loss of transient program state (stack & heap), etc. (a great PhD thesis...)program state (stack & heap), etc. (a great PhD thesis...)
debugging today’s real systems is already hard for these reasonsdebugging today’s real systems is already hard for these reasons
Real apps, faultloads, best practices, etc. hard to get!Real apps, faultloads, best practices, etc. hard to get!
© 2004A. Fox
RADS message in a nutshellRADS message in a nutshell
Statistical techniques can identify “interesting” features and Statistical techniques can identify “interesting” features and relationships from large datasets, but frequent tradeoff relationships from large datasets, but frequent tradeoff
between detection rate (or detection time) and between detection rate (or detection time) and false false positivespositives
Statistical techniques can identify “interesting” features and Statistical techniques can identify “interesting” features and relationships from large datasets, but frequent tradeoff relationships from large datasets, but frequent tradeoff
between detection rate (or detection time) and between detection rate (or detection time) and false false positivespositives
Make “micro-recovery” so inexpensive that occasional false Make “micro-recovery” so inexpensive that occasional false positives don’t matterpositives don’t matter
Make “micro-recovery” so inexpensive that occasional false Make “micro-recovery” so inexpensive that occasional false positives don’t matterpositives don’t matter
Achievable now on realistic applications & workloadsAchievable now on realistic applications & workloads
Synergistic with componentized apps & frameworksSynergistic with componentized apps & frameworks
Specific point of leverage for collaboration with machine Specific point of leverage for collaboration with machine learning research; lots of headroom for improvementlearning research; lots of headroom for improvement Even “simple” algorithms show encouraging initial resultsEven “simple” algorithms show encouraging initial results
Project possibilitiesProject possibilities
BACKUP SLIDESBACKUP SLIDES