intro & overview of rads goals

Intro & Overview of RADS goalsIntro & Overview of RADS goals

Armando Fox & Dave PattersonArmando Fox & Dave Patterson

CS 444A/CS 294-6, Stanford/UC CS 444A/CS 294-6, Stanford/UC BerkeleyBerkeley

Fall 2004Fall 2004

© 2004A. Fox

AdministriviaAdministrivia Course logistics & registrationCourse logistics & registration

Project expectations and other deliverablesProject expectations and other deliverables

Background and motivation for RADSBackground and motivation for RADS

ROC and its relationship to RADSROC and its relationship to RADS

Early case studiesEarly case studies

Discussion: projects, research directions, etc.Discussion: projects, research directions, etc.

© 2004A. Fox

Administrivia/goalsAdministrivia/goals

Stanford enrollment vs. AxessStanford enrollment vs. Axess

SLT and CT tutorial VHS/DVD’s available to viewSLT and CT tutorial VHS/DVD’s available to view

SLT and CT Lab/assignments grading policySLT and CT Lab/assignments grading policy

Stanford and Berkeley meeting/transportation Stanford and Berkeley meeting/transportation logisticslogistics

Format of courseFormat of course

Background & motivation for Background & motivation for RADSRADS

© 2004A. Fox

RADS in One SlideRADS in One Slide

Philosophy of ROC: focus on lowering MTTR to Philosophy of ROC: focus on lowering MTTR to improve overall availabilityimprove overall availability

ROC achievements: two levels of lowering MTTRROC achievements: two levels of lowering MTTR ““Microrecovery”: fine-grained generic recovery techniques Microrecovery”: fine-grained generic recovery techniques

recover only the failed part(s) of the system, at much lower recover only the failed part(s) of the system, at much lower cost than whole-system recoverycost than whole-system recovery

Undo: sophisticated tools to help human operators Undo: sophisticated tools to help human operators selectively back out destructive actions/changes to a systemselectively back out destructive actions/changes to a system

General approach: use microrecovery as “first line of General approach: use microrecovery as “first line of defense”; when it fails, provide support to human operators defense”; when it fails, provide support to human operators to avoid having to “reinstall the world”to avoid having to “reinstall the world”

RADS insight: can combine cheap recovery with RADS insight: can combine cheap recovery with statistical anomaly detection techniquesstatistical anomaly detection techniques

© 2004A. Fox

Hence, (at least) 2 parts to RADSHence, (at least) 2 parts to RADS

Investigating other microrecovery methodsInvestigating other microrecovery methods

Investigating analysis techniquesInvestigating analysis techniques What to capture/represent in a modelWhat to capture/represent in a model

Addressing fundamental open challengesAddressing fundamental open challenges stabilitystability

systematic misdiagnosissystematic misdiagnosis

subversion by attackerssubversion by attackers

etc.etc.

General insight: “different is bad”General insight: “different is bad” ““law of large numbers” arguments support this for large law of large numbers” arguments support this for large

servicesservices

© 2004A. Fox

Why RADSWhy RADS

MotivationMotivation 5 9’s availability => 5 down-minutes/year => must recover 5 9’s availability => 5 down-minutes/year => must recover

from (or mask) most failures without human interventionfrom (or mask) most failures without human intervention

a principled way to design “self-*” systemsa principled way to design “self-*” systems

TechnologyTechnology High-traffic large-scale distributed/replicated services => High-traffic large-scale distributed/replicated services =>

large datasetslarge datasets

Analysis is CPU-intensive => a way to trade extra CPU cycles Analysis is CPU-intensive => a way to trade extra CPU cycles for dependabilityfor dependability

Large logs/datasets for models => storage is cheap and Large logs/datasets for models => storage is cheap and getting cheapergetting cheaper

RADS addresses a clear need while exploiting RADS addresses a clear need while exploiting demonstrated technology trendsdemonstrated technology trends

Cheap RecoveryCheap Recovery

© 2004A. Fox

Complex systems of black boxesComplex systems of black boxes

““...our ability to analyze and predict the performance of the ...our ability to analyze and predict the performance of the enormously complex software systems that lies at the core of enormously complex software systems that lies at the core of our economy is painfully inadequate.” (Choudhury & Weikum, our economy is painfully inadequate.” (Choudhury & Weikum, 2000 PITAC Report)2000 PITAC Report)

Networked services too complex and rapidly-Networked services too complex and rapidly-changing to test exhaustively: “collections of black changing to test exhaustively: “collections of black boxes”boxes” Weekly or biweekly code drops not uncommonWeekly or biweekly code drops not uncommon

Market activities lead to integration of whole systemsMarket activities lead to integration of whole systems

Need to get humans out of loop for at least some Need to get humans out of loop for at least some monitoring/recovery loopsmonitoring/recovery loops hence interest in “autonomic” approacheshence interest in “autonomic” approaches

fast detection is often at odds with false alarmsfast detection is often at odds with false alarms

© 2004A. Fox

ConsequencesConsequences

Complexity breeds increased bug counts and bug impact Complexity breeds increased bug counts and bug impact Heisenbugs, race conditions, environment-dependent and hard-Heisenbugs, race conditions, environment-dependent and hard-

to-reproduce bugs still account for majority of SW bugs in live to-reproduce bugs still account for majority of SW bugs in live systemssystems

up to 80% of bugs found in production are those for which a fix is up to 80% of bugs found in production are those for which a fix is not yet available*not yet available*

some application-level failures result in user-visible bad behavior some application-level failures result in user-visible bad behavior beforebefore they are detected by site monitors they are detected by site monitors

Tellme Networks: up to 75% of downtime is “detection” Tellme Networks: up to 75% of downtime is “detection” (sometimes by user complaints), followed by localization(sometimes by user complaints), followed by localization

Amazon, Yahoo: gross metrics track second-order effect of bugs, Amazon, Yahoo: gross metrics track second-order effect of bugs, but lags actual bug by minutes or tens of minutesbut lags actual bug by minutes or tens of minutes

Result: downtime and increased management costsResult: downtime and increased management costs

* A.P. Wood, * A.P. Wood, Software reliability from the customer view, Software reliability from the customer view, IEEE Computer, Aug. 2003IEEE Computer, Aug. 2003

© 2004A. Fox

““Always adapting, always recovering”Always adapting, always recovering”

Build statistical models of “acceptable” operating envelope Build statistical models of “acceptable” operating envelope by measurement & analysis on live systemby measurement & analysis on live system Control theory, statistical correlation, anomaly detection...Control theory, statistical correlation, anomaly detection...

Detect runtime deviations from modelDetect runtime deviations from model

typical tradeoff is between detection rate & false positive ratetypical tradeoff is between detection rate & false positive rate

Rely on Rely on external controlexternal control using inexpensive and simple using inexpensive and simple mechanisms that respect the black box, to keep system mechanisms that respect the black box, to keep system within its acceptable operating envelopewithin its acceptable operating envelope invariant: attempting recovery won’t make things worseinvariant: attempting recovery won’t make things worse

makes inevitable false positives tolerablemakes inevitable false positives tolerable

can then reduce false negatives by “tuning” algo’s to be more can then reduce false negatives by “tuning” algo’s to be more aggressive and/or deploying multiple detectorsaggressive and/or deploying multiple detectors

Systems that are “always adapting, always recovering”Systems that are “always adapting, always recovering”

© 2004A. Fox

Toward recovery management Toward recovery management invariantsinvariants

Observation: Observation: instrumentation and analysisinstrumentation and analysis collect and analyze data from running systemscollect and analyze data from running systems

rely on “most systems work most of the time” to automatically rely on “most systems work most of the time” to automatically derive baseline modelsderive baseline models

Analysis: detect and localize anomalous behaviorAnalysis: detect and localize anomalous behavior

Action: Action: close loop automatically with “micro-recovery”close loop automatically with “micro-recovery” ““Salubrious”: returns some Salubrious”: returns some partpart of system to known state of system to known state

• Reclaim resources (memory, DB conns, sockets, DHCP Reclaim resources (memory, DB conns, sockets, DHCP lease...), throw away corrupt transient state, setup to retry lease...), throw away corrupt transient state, setup to retry operation if appropriateoperation if appropriate

Safe: no effect on correctness, minimal effect on performanceSafe: no effect on correctness, minimal effect on performance

Localized: parts not being microrecovered aren’t affectedLocalized: parts not being microrecovered aren’t affected

Fast recovery simplifies failure detection and recovery Fast recovery simplifies failure detection and recovery management.management.

© 2004A. Fox

Non-goals/complementary workNon-goals/complementary work

All of the following are being capably studied by others, All of the following are being capably studied by others, and directly compose with our own efforts...and directly compose with our own efforts...

Byzantine fault toleranceByzantine fault tolerance

In-place repair of persistent data structuresIn-place repair of persistent data structures

Hard-real-time response guaranteesHard-real-time response guarantees

Adding checkpointing to legacy non-componentized Adding checkpointing to legacy non-componentized applicationsapplications

Source code bug findingSource code bug finding

Advancing the state of the art in SLT (analysis Advancing the state of the art in SLT (analysis algorithms)algorithms)

© 2004A. Fox

OutlineOutline

Micro-recoverable systemsMicro-recoverable systems Concept of microrecoveryConcept of microrecovery

A microrecoverable application server & session state A microrecoverable application server & session state storestore

Application-generic SLT-based failure detectionApplication-generic SLT-based failure detection Path and component analysis and localization for Path and component analysis and localization for

appserverappserver

Simple time series analyses for purpose-built state storeSimple time series analyses for purpose-built state store

Combining SLT detection with microrecoverable systemsCombining SLT detection with microrecoverable systems

Discussion, related work, implications & Discussion, related work, implications & conclusionsconclusions

© 2004A. Fox

Microrebooting: one kind of Microrebooting: one kind of microrecoverymicrorecovery

60+% of software failures in the field* are reboot-curable, 60+% of software failures in the field* are reboot-curable, even if root cause is unknown... why?even if root cause is unknown... why? Rebooting discards bad temporary data (corrupted data structures Rebooting discards bad temporary data (corrupted data structures

that can be rebuilt) and (usually) reclaims used resourcesthat can be rebuilt) and (usually) reclaims used resources

reestablishes control flow in a predictable way (breaks reestablishes control flow in a predictable way (breaks deadlocks/livelocks, returns thread or process to its start state)deadlocks/livelocks, returns thread or process to its start state)

To avoid imperiling correctness, we must...To avoid imperiling correctness, we must... Separate data recovery from process recoverySeparate data recovery from process recovery

Safeguard the dataSafeguard the data

Reclaim resources with high confidenceReclaim resources with high confidence

Goal: get same benefits of rebooting but at much finer Goal: get same benefits of rebooting but at much finer grain (hence faster and less disruptive) - grain (hence faster and less disruptive) - microrebootingmicrorebooting

* D. Oppenheimer et al., * D. Oppenheimer et al., Why do Internet services fail and what can be done about it? , Why do Internet services fail and what can be done about it? , USITS 2003USITS 2003

© 2004A. Fox

Write example: “Write to Many, Wait for Few”Write example: “Write to Many, Wait for Few”

Browser

AppServerSTUB

Brick 1

Brick 2

Brick 3

Brick 4

Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2

Brick 5

© 2004A. Fox

Write example: “Write to Many, Wait for Few”Write example: “Write to Many, Wait for Few”

Browser

AppServerSTUB

Brick 1

Brick 2

Brick 3

Brick 4

Try to write to W random bricks, W = 4Must wait for WQ bricks to reply, WQ = 2

14

Brick 5

Cookie holds metadata

Crashed? Slow?

© 2004A. Fox

Read example:Read example:

Browser

AppServerSTUB

Brick 1

Brick 2

Brick 3

Brick 4

14

Brick 5

Try to read from Bricks 1, 4

© 2004A. Fox


Browser

AppServerSTUB

Brick 1

Brick 2

Brick 3

Brick 4

14

Brick 5

© 2004A. Fox


Browser

AppServerSTUB

Brick 1

Brick 2

Brick 3

Brick 4

Brick 5

Brick 1 crashes

© 2004A. Fox


Browser

AppServerSTUB

Brick 2

Brick 3

Brick 4

Brick 5

© 2004A. Fox

SSM: Failure and RecoverySSM: Failure and Recovery

Failure of single nodeFailure of single node No data loss, WQ-1 remainNo data loss, WQ-1 remain

State is available for R/W during failureState is available for R/W during failure

RecoveryRecovery Restart – No special case recovery codeRestart – No special case recovery code

State is available for R/W during brick restartState is available for R/W during brick restart

Session state is self-recovering Session state is self-recovering • User’s access pattern causes data to be rewrittenUser’s access pattern causes data to be rewritten

© 2004A. Fox

Backpressure and Admission ControlBackpressure and Admission Control

AppServerSTUB

Brick 1

Brick 2

Brick 3

Brick 4

Brick 5

AppServerSTUB

Heavy flow to Brick 3

Drop Requests

© 2004A. Fox

Statistical MonitoringStatistical Monitoring

Brick 1

Brick 2

Brick 3

Brick 4

Brick 5

PinpointPinpoint

Statistics

Statistics

NumElementsMemoryUsed

InboxSizeNumDroppedNumReadsNumWrites

© 2004A. Fox

SSM MonitoringSSM Monitoring

N replicated bricks handle read/write requestsN replicated bricks handle read/write requests Cannot do structural anomaly detection!Cannot do structural anomaly detection!

Alternative features (performance, mem usage, etc)Alternative features (performance, mem usage, etc)

Activity statistics: How often did a brick do Activity statistics: How often did a brick do something?something? Msgs received/sec, dropped/sec, etc.Msgs received/sec, dropped/sec, etc.

Same across all peers, assuming balanced workloadSame across all peers, assuming balanced workload

Use anomalies as likely failuresUse anomalies as likely failures

State statistics: Current state of systemState statistics: Current state of system Memory usage, queue length, etc.Memory usage, queue length, etc.

Similar pattern across peers, but may not be in phaseSimilar pattern across peers, but may not be in phase

Look for patterns in time-series; differences in patterns Look for patterns in time-series; differences in patterns indicate failure at a node.indicate failure at a node.

© 2004A. Fox

Detecting Anomalous ConditionsDetecting Anomalous Conditions

Metrics compared against those of “peer” bricksMetrics compared against those of “peer” bricks Basic idea: Changes in workload tend to affect all bricks Basic idea: Changes in workload tend to affect all bricks

equallyequally

Underlying (weak) assumption: “Most bricks are doing Underlying (weak) assumption: “Most bricks are doing mostly the right thing most of the time”mostly the right thing most of the time”

Anomaly in 6 or more (out of 9) metrics => reboot brickAnomaly in 6 or more (out of 9) metrics => reboot brick

Use different techniques for different statsUse different techniques for different stats ““Activity” – absolute median deviationActivity” – absolute median deviation

““State” – Tarzan time-series analysis State” – Tarzan time-series analysis

© 2004A. Fox

Network Fault – 70% packet loss in Network Fault – 70% packet loss in SANSAN

Network fault injectedFault detected

Brick killed

Brick restarts

© 2004A. Fox

J2EE as a platform for uRB-based J2EE as a platform for uRB-based recoveryrecovery

Java 2 Enterprise Edition, a component framework Java 2 Enterprise Edition, a component framework for Internet request-reply style appsfor Internet request-reply style apps App is a collection of components (“EJBs”) created by App is a collection of components (“EJBs”) created by

subclassing a managed container classsubclassing a managed container class

application server provides component creation, thread application server provides component creation, thread management, naming/directory services, abstractions for management, naming/directory services, abstractions for database and HTTP sessions, etc.database and HTTP sessions, etc.

Web pages with embedded servlets and Java Server Web pages with embedded servlets and Java Server Pages invoke EJB methodsPages invoke EJB methods

potential to improve potential to improve all all apps by modifying the appserverapps by modifying the appserver

J2EE has a strong following, encourages modular J2EE has a strong following, encourages modular programming, and there are open source programming, and there are open source appserversappservers

© 2004A. Fox

Separating data recovery from process Separating data recovery from process recoveryrecovery

For HTTP workloads, session state For HTTP workloads, session state app checkpoint app checkpoint Store session state in a microrebootable session state Store session state in a microrebootable session state

subsystem (NSDI’04)subsystem (NSDI’04)

Recovery==non-state-preserving process restart, Recovery==non-state-preserving process restart, redundancy gives probabilistic durabilityredundancy gives probabilistic durability

• Response time cost of externalizing session state: ~25%Response time cost of externalizing session state: ~25%• SSM, an N-way RAM-based state replication [NSDI 04] SSM, an N-way RAM-based state replication [NSDI 04]

behind existing J2EE APIbehind existing J2EE API

Microreboot EJB’s: Microreboot EJB’s: destroy all instances of EJB and associated threadsdestroy all instances of EJB and associated threads

releases appserver-level resources (DB connections, etc)releases appserver-level resources (DB connections, etc)

discards appserver metadata about EJB’sdiscards appserver metadata about EJB’s

session state preserved across uRBsession state preserved across uRB

© 2004A. Fox

JBoss+uRB’s+SSM + fault injectionJBoss+uRB’s+SSM + fault injection

Client-based failure

detection

Fault injection: null refs, deadlocks/infinite loop, corruption of volatile EJB metadata, resource leaks, Java runtime errors/exc RUBiS: online auction app

(132K items, 1.5M bids, 100K subscribers)

150 simulated users/node35-45 req/sec/node

Workload mix based on a commercial auction site

© 2004A. Fox

uRB vs. full RB - uRB vs. full RB - action weighted action weighted goodputgoodput

Example: corrupt JNDI database entry, Example: corrupt JNDI database entry, RuntimeException, Java error; measure G_aw in 1-RuntimeException, Java error; measure G_aw in 1-second bucketssecond buckets Localization is crude: static analysis to associate failed URL Localization is crude: static analysis to associate failed URL

with set of EJB’s, incrementing an EJB’s score whenever it’s with set of EJB’s, incrementing an EJB’s score whenever it’s implicatedimplicated

With uRB’s, 89% reduction in failed requests and 9% more With uRB’s, 89% reduction in failed requests and 9% more successful requests compared to full RB, despite 6 false successful requests compared to full RB, despite 6 false positivespositives

© 2004A. Fox

Performance overhead of JAGRPerformance overhead of JAGR

150 clients/node: latency=38 msec (3 -> 7 nodes)150 clients/node: latency=38 msec (3 -> 7 nodes)

Human-perceptible delay: 100-200 msecHuman-perceptible delay: 100-200 msec

Real auction site: 41 req/sec, 33-300 msec latencyReal auction site: 41 req/sec, 33-300 msec latency

© 2004A. Fox

Improving availability from user’s point of Improving availability from user’s point of viewview

uRB improves user-uRB improves user-perceived availability perceived availability vs. full rebootvs. full reboot

uRB complements uRB complements failoverfailover (a) Initially, excess load on (a) Initially, excess load on

2nd node brought it down 2nd node brought it down immediately after failoverimmediately after failover

(b) uRB results in some (b) uRB results in some failed requests (96% failed requests (96% fewer) from temporary fewer) from temporary overloadoverload

(c,d) Full reboot vs. uRB (c,d) Full reboot vs. uRB without without failoverfailover

For small clusters, should For small clusters, should always try uRB firstalways try uRB first

© 2004A. Fox

uRB Tolerates Lax Failure DetectionuRB Tolerates Lax Failure Detection

Tolerates lag in detection Tolerates lag in detection latencylatency (up to 53s in our (up to 53s in our microbenchmark) and high false positive ratesmicrobenchmark) and high false positive rates Our naive detection algorithm had up to 60% false Our naive detection algorithm had up to 60% false

positive rate in terms of positive rate in terms of what what to uRB to uRB

we injected 97% false positives before reduction in we injected 97% false positives before reduction in overall availability equaled cost of full RBoverall availability equaled cost of full RB

Always safe to use as “first line of defense”, even Always safe to use as “first line of defense”, even when failover is possiblewhen failover is possible cost(uRB+other recovery) cost(uRB+other recovery) cost(other recovery) cost(other recovery)

success rate of uRB on reboot-curable failures is success rate of uRB on reboot-curable failures is comparable to whole-appserver rebootcomparable to whole-appserver reboot

© 2004A. Fox

Performance penaltiesPerformance penalties

Baseline workload mix modeled on commercial siteBaseline workload mix modeled on commercial site 150 simulated clients per node, ~40-45 reqs/sec per node150 simulated clients per node, ~40-45 reqs/sec per node

system at ~70% utilizationsystem at ~70% utilization

Throughput ~1% worse due to instrumentationThroughput ~1% worse due to instrumentation

worst-case response latency increases from 800 to worst-case response latency increases from 800 to 1200ms1200ms Average case: 45ms to 80ms; compare to 35-300ms for Average case: 45ms to 80ms; compare to 35-300ms for

commercial servicecommercial service

Well within “human tolerance” thresholdsWell within “human tolerance” thresholds

Entirely due to factoring out of session stateEntirely due to factoring out of session state

Performance penalty is tolerable & worth itPerformance penalty is tolerable & worth it

Recovery and maintenanceRecovery and maintenance

© 2004A. Fox

Microrecovery for Maintenance Microrecovery for Maintenance OperationsOperations

Capacity discovery in SSMCapacity discovery in SSM TCP-inspired flow control keeps system from falling off a cliffTCP-inspired flow control keeps system from falling off a cliff

““OK to say no” is essential for this backpressure to workOK to say no” is essential for this backpressure to work

Microrejuvenation in JAGR (proactively microreboot Microrejuvenation in JAGR (proactively microreboot to fix localized memory leaks)to fix localized memory leaks)

Splitting/coalescing in DstoreSplitting/coalescing in Dstore Split = failure + reappearance of failed nodeSplit = failure + reappearance of failed node

Same safe/non-disruptive recovery mechanisms are used to Same safe/non-disruptive recovery mechanisms are used to lazily repair inconsistencies after new node appearslazily repair inconsistencies after new node appears

Consequently, performance impact small enough to do this Consequently, performance impact small enough to do this as an online operationas an online operation

© 2004A. Fox

Using microrecovery for maintenanceUsing microrecovery for maintenance

Capacity discovery in SSMCapacity discovery in SSM

redundancy mechanism used for recovery (“write redundancy mechanism used for recovery (“write many, wait few”) also used to “say no” while many, wait few”) also used to “say no” while gracefully gracefully degrading performancedegrading performance

Offered Load vs. GoodputAI/MD Admission Control

0

1000

2000

3000

4000

5000

6000

7000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 18 21 24 27 30

Number of machines

Num

ber

of r

eque

sts

per

seco

nd

Offered Load

Goodput

© 2004A. Fox

76%

Full rejuvenation vs. Full rejuvenation vs. microrejuvenationmicrorejuvenation

© 2004A. Fox

Splitting/coalescing in DstoreSplitting/coalescing in Dstore

Splitting/coalescing in DstoreSplitting/coalescing in Dstore Split = failure + reappearance of failed nodeSplit = failure + reappearance of failed node

Same mechanisms used to lazily repair inconsistenciesSame mechanisms used to lazily repair inconsistencies

© 2004A. Fox

Summary: microrecoverable systemsSummary: microrecoverable systems

Separation of data from process recoverySeparation of data from process recovery Special-purpose data stores can be made microrecoverableSpecial-purpose data stores can be made microrecoverable

OK to initiate microrecovery anytime for any reasonOK to initiate microrecovery anytime for any reason no loss of correctness, tolerable loss of performanceno loss of correctness, tolerable loss of performance

likely (but not guaranteed) to fix an important class of transientslikely (but not guaranteed) to fix an important class of transients

won’t make things worse; can always try “full” recovery won’t make things worse; can always try “full” recovery afterwardafterward

inexpensive enough to tolerate “sloppy” fault detectioninexpensive enough to tolerate “sloppy” fault detection

low-cost first line of defenselow-cost first line of defense

some “maintenance” ops can be cast as microrecoverysome “maintenance” ops can be cast as microrecovery due to low cost, “proactive” maintenance can be done onlinedue to low cost, “proactive” maintenance can be done online

can often convert unplanned long downtime into planned shorter can often convert unplanned long downtime into planned shorter performance hitperformance hit

Anomaly detection as failure Anomaly detection as failure detectiondetection

© 2004A. Fox

Example: Anomaly Finding TechniquesExample: Anomaly Finding Techniques

Before runtime*Before runtime* At runtime**At runtime**

Findin

g/p

reven-tin

g

Findin

g/p

reven-tin

g

Bugs

Bugs

Manual “Inspeculation”, code Manual “Inspeculation”, code inspection/reviews, using debugging inspection/reviews, using debugging tools [Eisenstadt97]tools [Eisenstadt97]

model checking, Lint-like toolsmodel checking, Lint-like tools

human factors/processes (extreme human factors/processes (extreme programming, etc.)programming, etc.)

Runtime-safe languages, dynamic data Runtime-safe languages, dynamic data analysisanalysis

Self-repairing data structures [Demsky & Self-repairing data structures [Demsky & Rinard] Rinard]

Sandboxing/isolation, stack guarding, etc.Sandboxing/isolation, stack guarding, etc.

Redundancy (TMR, Byzantine, etc.)Redundancy (TMR, Byzantine, etc.)

Dete

cting a

nom

alie

sD

ete

cting a

nom

alie

s

static analysis (even static analysis (even gcc -Wallgcc -Wall ) )

type-safe languagestype-safe languages

Bugs as anomalous behavior [Chou & Bugs as anomalous behavior [Chou & Engler 2002]Engler 2002]

Heuristic detection of data races [Eraser, Heuristic detection of data races [Eraser, Savage et al]Savage et al]

Heuristic detection of possible invariant Heuristic detection of possible invariant violation [Haglan & Lam 2002]violation [Haglan & Lam 2002]

Performance anomalies [Richardson et al] Performance anomalies [Richardson et al]

Path-based analysis [Chen, Kiciman et al Path-based analysis [Chen, Kiciman et al 2002]2002]

Deadlock detection and repairDeadlock detection and repair

* Includes design time and build time

** Includes both offline (invasive) and online detection techniques

Question: does anomaly == bug?

© 2004A. Fox

Examples of Badness InferenceExamples of Badness Inference

Sometimes can detect badness by looking for Sometimes can detect badness by looking for inconsistenciesinconsistencies in in runtime behaviorruntime behavior We can observe program-specific properties (though using We can observe program-specific properties (though using

automated methods) as well as program-generic propertiesautomated methods) as well as program-generic properties Often, we must be able to first observe program operating Often, we must be able to first observe program operating

“normally”“normally”

Eraser: detecting data races [Savage et al. 2000]Eraser: detecting data races [Savage et al. 2000] Observe lock/unlock patterns around shared variablesObserve lock/unlock patterns around shared variables If a variable usually protected by lock/unlock or mutex is observed If a variable usually protected by lock/unlock or mutex is observed

to have interleaved reads, report a violationto have interleaved reads, report a violation

DIDUCE: inferring invariants, then detecting violations [Hangal & DIDUCE: inferring invariants, then detecting violations [Hangal & Lam 2002]Lam 2002] Start with strict invariant (“Start with strict invariant (“x x is always =3”)is always =3”) Relax it as other values seen (“Relax it as other values seen (“x x is in [0,10]”)is in [0,10]”) IncreaseIncrease confidence confidence in invariant as more observations seen in invariant as more observations seen Report violations of invariants that have threshold confidenceReport violations of invariants that have threshold confidence

© 2004A. Fox

Generic runtime monitoring techniquesGeneric runtime monitoring techniques

What conditions are we monitoring for?What conditions are we monitoring for? Fail-stop vs. Fail-silent vs. Fail-stutterFail-stop vs. Fail-silent vs. Fail-stutter Byzantine failuresByzantine failures

Generic methodsGeneric methods Heartbeats (what does loss of heartbeat mean? Who monitors them?)Heartbeats (what does loss of heartbeat mean? Who monitors them?) Resource monitoring (what is “abnormal”?)Resource monitoring (what is “abnormal”?) Application-specific monitoring: ask a question you know the answer toApplication-specific monitoring: ask a question you know the answer to

Fault model enforcementFault model enforcement coerce all observed faults to an “expected faults” subsetcoerce all observed faults to an “expected faults” subset if necessary, take additional actions to completely “induce” the faultif necessary, take additional actions to completely “induce” the fault Simplifies recovery since fewer distinct casesSimplifies recovery since fewer distinct cases Avoids potential misdiagnosis of faults that have common symptomsAvoids potential misdiagnosis of faults that have common symptoms Note, may sometimes appear to make things “worse” (coerce a less-Note, may sometimes appear to make things “worse” (coerce a less-

severe fault to a more-severe fault)severe fault to a more-severe fault) Doesn’t exercise all parts of the systemDoesn’t exercise all parts of the system

© 2004A. Fox

Internet performance failure detectionInternet performance failure detection

Various approaches, all of which exploit the law of Various approaches, all of which exploit the law of large numbers and (sort of) Central Limit Theorem large numbers and (sort of) Central Limit Theorem (which is?)(which is?) Establish “baseline” of quantity to be monitoredEstablish “baseline” of quantity to be monitored

• Take observations, factor out data from known failuresTake observations, factor out data from known failures• Normalize to workload?Normalize to workload?

Look for “significant” deviations from baselineLook for “significant” deviations from baseline

What to measure?What to measure? Coarse-grain: number of reqs/secCoarse-grain: number of reqs/sec Finer-grain: Number of TCP connections in Established, Finer-grain: Number of TCP connections in Established,

Syn_sent, Syn_rcvd stateSyn_sent, Syn_rcvd state Even finer: additional internal request “milestones”Even finer: additional internal request “milestones”

• Hard to do in an application-generic way...but Hard to do in an application-generic way...but frameworks frameworks can save uscan save us

© 2004A. Fox

Example 1: Detection & recovery in Example 1: Detection & recovery in SSMSSM

9 “State” statistics collected per second from each 9 “State” statistics collected per second from each replica replica Tarzan time series analysis* compares relative frequencies of Tarzan time series analysis* compares relative frequencies of

substrings corresponding to discretized time seriessubstrings corresponding to discretized time series

““anomalous” => at least 6 stats “anomalous”; works for anomalous” => at least 6 stats “anomalous”; works for aperiodic or irregular-period signalsaperiodic or irregular-period signals

robust against workload changes that affect all replicas robust against workload changes that affect all replicas equally and against highly-correlated metricsequally and against highly-correlated metrics

*Keogh et al., *Keogh et al., Finding surprising patterns in a time series database in linear time and space,Finding surprising patterns in a time series database in linear time and space, SIGKDD 2002SIGKDD 2002

© 2004A. Fox

What faults does this handle?What faults does this handle?

Essentially 100% availability vs. injected faults:Essentially 100% availability vs. injected faults: Node crash/hang/timeout/freezeNode crash/hang/timeout/freeze

Fail-stutter: Network loss (drop up to 70% of packets Fail-stutter: Network loss (drop up to 70% of packets randomly)randomly)

Periodic slowdown (eg from garbage collection)Periodic slowdown (eg from garbage collection)

Persistent slowdown (one node lags the others)Persistent slowdown (one node lags the others)

Underlying (weak) assumption: “Most bricks are doing Underlying (weak) assumption: “Most bricks are doing mostly the right thing most of the time”mostly the right thing most of the time”

All anomalies can be safely “coerced” to crash faults All anomalies can be safely “coerced” to crash faults If reboot doesn’t fix, it didn’t cost you much to try itIf reboot doesn’t fix, it didn’t cost you much to try it

Human notified after threshold number of restarts; system Human notified after threshold number of restarts; system has no concept of “recovery”has no concept of “recovery”

Allows SSM to be managed like a farm of stateless serversAllows SSM to be managed like a farm of stateless servers

© 2004A. Fox

Detecting anomalies in application Detecting anomalies in application logiclogic

Goal: detect failures whose only obvious symptom is Goal: detect failures whose only obvious symptom is change in semantics of applicationchange in semantics of application Example: wrong item data displayed; wouldn’t be caught by Example: wrong item data displayed; wouldn’t be caught by

HTML scraping or HTTP logsHTML scraping or HTTP logs

Typically, site responds to HTTP pings, etc. under such Typically, site responds to HTTP pings, etc. under such failuresfailures

These commonly result from exceptions of the form we These commonly result from exceptions of the form we injected into RUBiSinjected into RUBiS

Insight: manifestation of bugs is the rare case, so Insight: manifestation of bugs is the rare case, so capture “normal” behavior of system under no fault capture “normal” behavior of system under no fault injectioninjection Then detect threshold deviations from this baselineThen detect threshold deviations from this baseline

Periodically move the baseline to allow for workload evolutionPeriodically move the baseline to allow for workload evolution

© 2004A. Fox

Patterns: Path shape analysisPatterns: Path shape analysis

Middleware

HTTPFrontends Application Components Databases

Model paths as parse trees in probabilistic CFGModel paths as parse trees in probabilistic CFG Build grammar under “believed normal” conditions, then mark very unlikely paths as Build grammar under “believed normal” conditions, then mark very unlikely paths as

anomalousanomalous

after classification, build decision tree to correlate path features (components touched) after classification, build decision tree to correlate path features (components touched) with anomalous pathswith anomalous paths

© 2004A. Fox

Patterns: Component Interaction Patterns: Component Interaction AnalysisAnalysis

Middleware

HTTPFrontends Application Components Databases

Model interactions between a component and its Model interactions between a component and its nn neighbors in the dynamic call graph neighbors in the dynamic call graph as a weighted DAGas a weighted DAG compare to observed call graph using chi-squared goodness-of-fitcompare to observed call graph using chi-squared goodness-of-fit

can compare either across peers or against historical datacan compare either across peers or against historical data

© 2004A. Fox

Localization: Recall vs. precision

Precision and recall (example)Precision and recall (example) Detection: Recall = % of failures actually detected as Detection: Recall = % of failures actually detected as

anomaliesanomalies Strictly better than Strictly better than

HTTP/HTML monitoringHTTP/HTML monitoring Localization: Localization: recall = % actually-recall = % actually-

faulty requests faulty requests returnedreturned

precision = % precision = % requests returned requests returned that are faulty = 1-that are faulty = 1-(FP rate) (FP rate)

Tradeoff between Tradeoff between recall and precision recall and precision (false positive rate)(false positive rate) Even low-recall case Even low-recall case

corresponds to high corresponds to high detection detection recall (.83)recall (.83)

Detection: recall, faults affecting >1% of workload

[R]=.68 [P]=.14[R]=.68 [P]=.14

[R]=.34 [P]=.93[R]=.34 [P]=.93

© 2004A. Fox

Pinpoint key resultsPinpoint key results

Detect 89-96% of injected failures, compared to 20-Detect 89-96% of injected failures, compared to 20-79% for HTML scraping and HTTP log monitoring79% for HTML scraping and HTTP log monitoring

Limited success in detecting injected source bugsLimited success in detecting injected source bugs Example success: caught a bug that prevented shopping cart Example success: caught a bug that prevented shopping cart

from iterating over its contents to display them, and correctly from iterating over its contents to display them, and correctly identified at-fault component (where bug was injected)identified at-fault component (where bug was injected)

Resilient to “normal” workload changes Resilient to “normal” workload changes Because we bin analysis by request categoryBecause we bin analysis by request category

Resilient to “bug fix release” code changesResilient to “bug fix release” code changes

Currently slow; analysis lags ~20s behind applicationCurrently slow; analysis lags ~20s behind application

© 2004A. Fox

Combining uRB’s and PinpointCombining uRB’s and Pinpoint

Simple recovery policy:Simple recovery policy: uRB all components whose normalized anomaly score >1.0uRB all components whose normalized anomaly score >1.0

if we’ve already done that, reboot the whole applicationif we’ve already done that, reboot the whole application

More sophisticated policies certainly possibleMore sophisticated policies certainly possible

© 2004A. Fox

Combining uRB’s and PinpointCombining uRB’s and Pinpoint

Example: data structure corruption in Example: data structure corruption in SB_viewItem EJBSB_viewItem EJB 350 simulated clients350 simulated clients

18.5s to detect/localize18.5s to detect/localize

<1s to repair<1s to repair

Note, returned WebNote, returned Webpage would be valid page would be valid but incorrectbut incorrect

Robust to typicalRobust to typicalworkload changesworkload changes& bug patches& bug patches

More comprehensive deployment in progressMore comprehensive deployment in progress

© 2004A. Fox

Faulty Request IdentificationFaulty Request Identification

HTTP monitoring has perfect precision since it’s a “ground truth indicator” of a server fault

Path-shape analysis pulls more points out of the bottom left corner

Failures injected but not

detected

Failures detected, faulty

requests identified as

such

Failures not detected, but low false positives

Failures detected, but high rate of mis-identification of

faulty requests (false positive)

© 2004A. Fox

Faulty Request IdentificationFaulty Request Identification

HTTP monitoring has perfect precision since it’s a “ground truth indicator” of a server fault

Path-shape analysis pulls more points out of the bottom left corner

© 2004A. Fox

Tolerating false positives in DStoreTolerating false positives in DStore

Metrics and algorithm comparable to Metrics and algorithm comparable to those used in SSMthose used in SSM

We inject “fail-stutter” behavior by We inject “fail-stutter” behavior by increasing request latencyincreasing request latency Bottom case: more aggressive Bottom case: more aggressive

detection also results in 2 detection also results in 2 “unnecessary” reboots“unnecessary” reboots

But they don’t matter much if there is But they don’t matter much if there is modest replicationmodest replication

Currently some voodoo constants for Currently some voodoo constants for thresholds in both SSM and DStorethresholds in both SSM and DStore Recall that these are “off-the-shelf” Recall that these are “off-the-shelf”

algorithms; should be able to do algorithms; should be able to do betterbetter

Trade-off: earlier detection vs. false Trade-off: earlier detection vs. false positivespositives

© 2004A. Fox

Summary of case studiesSummary of case studiesSubsystemSubsystem InstrumentationInstrumentation MicrorecoveryMicrorecovery Statistical Statistical

monitoringmonitoringPerformance Performance

costcost

SSM (diskless SSM (diskless session state session state store) store) [NSDI 04][NSDI 04]

State and activity State and activity metric ‘sensors’ metric ‘sensors’ built into appbuilt into app

Whole-node fast Whole-node fast reboot (doesn’t reboot (doesn’t preserve state)preserve state)

Tarzan time-series Tarzan time-series analysisanalysis

Median absolute Median absolute deviationdeviation

20-50% 20-50% request request latency; still latency; still competitive competitive with with commercial commercial serviceservice

<1% thruput <1% thruput reductionreduction

DStore DStore (persistent (persistent hashtable) hashtable) [ACM [ACM Trans. on Trans. on Storage]Storage]

Whole-node Whole-node reboot reboot (preserves (preserves state)state)

JAGR (J2EE JAGR (J2EE application application server) server) [OSDI [OSDI 04]04]

Inter-EJB call info Inter-EJB call info monitored by monitored by modifying modifying containercontainer

Could also use Could also use aspects aspects

Microreboots of Microreboots of EJB’sEJB’s

Anomalous code Anomalous code paths modeled paths modeled using PCFGusing PCFG

component component interactions interactions modeled by modeled by comparing dynamic comparing dynamic call graphscall graphs

~1% on ~1% on request request latency and latency and thruputthruput

Detection and localization good even with “simple” algorithms; fits well with localized recoveryDetection and localization good even with “simple” algorithms; fits well with localized recovery

Performance penalty is tolerable & worth itPerformance penalty is tolerable & worth it

Note, microrecovery can also be used for microrejuvenationNote, microrecovery can also be used for microrejuvenation

DiscussionDiscussion

© 2004A. Fox

Discussion: What makes this work?Discussion: What makes this work?

What made it work in our examples specifically?What made it work in our examples specifically? Recovery speed: Weaker consistency in SSM and DStore in Recovery speed: Weaker consistency in SSM and DStore in

exchange for fast recovery and predictable work done per exchange for fast recovery and predictable work done per requestrequest

Recovery correctness: J2EE apps constrained to “checkpoint” Recovery correctness: J2EE apps constrained to “checkpoint” by manipulating session state, and this is brought out in the by manipulating session state, and this is brought out in the app-writer-visible API’s; good isolation between components app-writer-visible API’s; good isolation between components and relative lack of shared stateand relative lack of shared state

Anomaly detection: app behavior alternates short sequences Anomaly detection: app behavior alternates short sequences of EJB calls with updates to persistent state, so can be of EJB calls with updates to persistent state, so can be characterized in terms of those callscharacterized in terms of those calls

ObservationsObservations Neither diagnosisNeither diagnosisrecovery nor recoveryrecovery nor recoverydiagnosisdiagnosis

Localization != diagnosis, but it’s an important optimizationLocalization != diagnosis, but it’s an important optimization

© 2004A. Fox

Why are statistical methods Why are statistical methods appealing?appealing?

Large complex systems tend to exercise a lot of Large complex systems tend to exercise a lot of their functionality in a fairly short amount of timetheir functionality in a fairly short amount of time Especially Internet services, with high-volume workloads of Especially Internet services, with high-volume workloads of

largely independent requestslargely independent requests

Even if we don’t know what to measure, statistical Even if we don’t know what to measure, statistical and data mining techniques can help figure it outand data mining techniques can help figure it out

Performance problems are often linked with Performance problems are often linked with dependability problems (fail-stutter behavior), for dependability problems (fail-stutter behavior), for either HW or SW reasonseither HW or SW reasons

Most systems work well most of the timeMost systems work well most of the time Corollary: in a replica system, replicas should behave “the Corollary: in a replica system, replicas should behave “the

same” most of the timesame” most of the time

© 2004A. Fox

When does it not work?When does it not work?

When SLT-based monitoring does not applyWhen SLT-based monitoring does not apply Base-rate fallacy: monitoring events so rare that FP rate dominatesBase-rate fallacy: monitoring events so rare that FP rate dominates

Gaming the system (deliberately or inadvertently)Gaming the system (deliberately or inadvertently)

When failures can’t be cured by any kind of micro-recoveryWhen failures can’t be cured by any kind of micro-recovery Persistent-state corruption (or hardware failure)Persistent-state corruption (or hardware failure)

Corrupted configuration dataCorrupted configuration data

““a spectrum of undo”a spectrum of undo”

When you can’t say noWhen you can’t say no Backpressure and possibility of caller-retry are used to improve Backpressure and possibility of caller-retry are used to improve

predictabilitypredictability

Promising you will say “yes” may be difficult...Promising you will say “yes” may be difficult...question may be question may be whether end-to-end guarantees are needed at lower layerswhether end-to-end guarantees are needed at lower layers

© 2004A. Fox

SSM/DStore as “extreme” design SSM/DStore as “extreme” design pointspoints

Goal was to investigate extremes of “no special Goal was to investigate extremes of “no special recovery”recovery”

Could explore erasure coding (RepStore does this Could explore erasure coding (RepStore does this dynamically)dynamically)

Weakened consistency model of DStore vs. 2PCWeakened consistency model of DStore vs. 2PC Spread cost of repair lazily across many operations Spread cost of repair lazily across many operations

(rather than bulk recovery)(rather than bulk recovery)

Spread some 2PC state maintenance to client in the form Spread some 2PC state maintenance to client in the form of “write in progress” cookieof “write in progress” cookie

May be that 2PC would be affordable, but we were May be that 2PC would be affordable, but we were interested in extreme design point of “no special restart interested in extreme design point of “no special restart code”code”

© 2004A. Fox

Role of 3-tier architectureRole of 3-tier architecture

Separation of concerns: really, separation of Separation of concerns: really, separation of process recovery process recovery (control flow) from (control flow) from data recoverydata recovery

uRB and reboots recover processes; SSM, DStore, uRB and reboots recover processes; SSM, DStore, and traditional relational databases recover dataand traditional relational databases recover data

Not addressed is Not addressed is repair repair of dataof data

© 2004A. Fox

Shouldn’t we just make software Shouldn’t we just make software better?better?

Yes we should (and many people are), but...Yes we should (and many people are), but...

We use commodity HW&SW, despite the fact that they We use commodity HW&SW, despite the fact that they are imperfect, less reliable than “hardened” or are imperfect, less reliable than “hardened” or purpose-built components, etc. Why?purpose-built components, etc. Why? Price/performance follows volumePrice/performance follows volume

Allows specialization of efforts and composition of reusable Allows specialization of efforts and composition of reusable building blocks (vs. building stovepipe system)building blocks (vs. building stovepipe system)

In short, it allows much faster overall pace of innovation and In short, it allows much faster overall pace of innovation and deployment, for both technical and economic reasons, even deployment, for both technical and economic reasons, even though the components themselves are imperfectthough the components themselves are imperfect

We should assume“commodity programmers” We should assume“commodity programmers” tootoo (observation from Brewster Kahle)(observation from Brewster Kahle) Give as much generic support to application as we canGive as much generic support to application as we can

© 2004A. Fox

Challenges & open issuesChallenges & open issues

Algorithm issues that impinge on systems workAlgorithm issues that impinge on systems work Hand-tuned constants/thresholds in algorithms--seems to be an Hand-tuned constants/thresholds in algorithms--seems to be an

issue in other applications of SLT as wellissue in other applications of SLT as well

Online vs. offline algorithmsOnline vs. offline algorithms

Stability of closed loopStability of closed loop

Systems issuesSystems issues How do you “know” you’ve checkpointed all important state, or How do you “know” you’ve checkpointed all important state, or

that something is safe to retry?that something is safe to retry?

How do you debug a “moving target” ? Traditional methods/tools How do you debug a “moving target” ? Traditional methods/tools are confounded by code obfuscation, sudden loss of transient are confounded by code obfuscation, sudden loss of transient program state (stack & heap), etc. (a great PhD thesis...)program state (stack & heap), etc. (a great PhD thesis...)

debugging today’s real systems is already hard for these reasonsdebugging today’s real systems is already hard for these reasons

Real apps, faultloads, best practices, etc. hard to get!Real apps, faultloads, best practices, etc. hard to get!

© 2004A. Fox

RADS message in a nutshellRADS message in a nutshell

Statistical techniques can identify “interesting” features and Statistical techniques can identify “interesting” features and relationships from large datasets, but frequent tradeoff relationships from large datasets, but frequent tradeoff

between detection rate (or detection time) and between detection rate (or detection time) and false false positivespositives

Statistical techniques can identify “interesting” features and Statistical techniques can identify “interesting” features and relationships from large datasets, but frequent tradeoff relationships from large datasets, but frequent tradeoff

between detection rate (or detection time) and between detection rate (or detection time) and false false positivespositives

Make “micro-recovery” so inexpensive that occasional false Make “micro-recovery” so inexpensive that occasional false positives don’t matterpositives don’t matter

Make “micro-recovery” so inexpensive that occasional false Make “micro-recovery” so inexpensive that occasional false positives don’t matterpositives don’t matter

Achievable now on realistic applications & workloadsAchievable now on realistic applications & workloads

Synergistic with componentized apps & frameworksSynergistic with componentized apps & frameworks

Specific point of leverage for collaboration with machine Specific point of leverage for collaboration with machine learning research; lots of headroom for improvementlearning research; lots of headroom for improvement Even “simple” algorithms show encouraging initial resultsEven “simple” algorithms show encouraging initial results

Project possibilitiesProject possibilities

BACKUP SLIDESBACKUP SLIDES

intro & overview of rads goals

Documents