1 design and implementation of self- healing systems presented by rean griffith phd candidacy exam...

11

Design and implementation of Self-Design and implementation of Self-Healing systemsHealing systems

Presented by Rean GriffithPresented by Rean GriffithPhD Candidacy ExamPhD Candidacy ExamNovember 23November 23rdrd, 2004, 2004

22

Self-Healing Systems Self-Healing Systems

A self healing system “…automatically A self healing system “…automatically detects, diagnoses and repairs localized detects, diagnoses and repairs localized software and hardware problems” – software and hardware problems” – The Vision The Vision of Autonomic Computing 2003 IEEE Computer Societyof Autonomic Computing 2003 IEEE Computer Society

33

Roots of Self-Healing Systems Roots of Self-Healing Systems AreaArea DescriptionDescription Foundation TechniquesFoundation TechniquesFault tolerant Fault tolerant computingcomputing

(’70’s) (’70’s)

Ability of a system to Ability of a system to respond gracefully to an respond gracefully to an unexpected hardware or unexpected hardware or software failuresoftware failure

Fault Model specification [5]Fault Model specification [5]

Fault Avoidance [1,3,5]Fault Avoidance [1,3,5]

Fault Detection [1,3,4,5], Fault Fault Detection [1,3,4,5], Fault Masking [3,4,5]Masking [3,4,5]

Dependable Dependable computing computing

(‘80’s)(‘80’s)

The trustworthiness of a The trustworthiness of a system that allows reliance system that allows reliance to be justifiably placed on to be justifiably placed on the service it deliversthe service it delivers

Focus on reliability and other Focus on reliability and other system qualities e.g. system qualities e.g. Availability[26], Integrity[23,25], Availability[26], Integrity[23,25], Degradation [7], MaintainabilityDegradation [7], Maintainability

Self-Healing Self-Healing systemssystems

(’90’s+)(’90’s+)

Automatically detects, Automatically detects, diagnoses and repairs diagnoses and repairs localized software and localized software and hardware problemshardware problems

Dependability, Modeling Dependability, Modeling [6,7,8], Monitoring [9], [6,7,8], Monitoring [9], Diagnosis [11], Planning [16], Diagnosis [11], Planning [16], Adaptation for Repair [13, 19, Adaptation for Repair [13, 19, 21] (System evolution)21] (System evolution)

44

OverviewOverview

Motivation for self-healing systemsMotivation for self-healing systemsBuilding a self-healing systemBuilding a self-healing systemEvaluating a self-healing system*Evaluating a self-healing system*

* - * - omitted from the presentation for the sake of brevity but will be discussed during Q & A as omitted from the presentation for the sake of brevity but will be discussed during Q & A as necessarynecessary

55

Motivation for self-healing systemsMotivation for self-healing systems

We are building increasingly complex (distributed) systemsWe are building increasingly complex (distributed) systems Harder to build [1,12]Harder to build [1,12] Systems must evolve over time to meet changing user needs [13,14]Systems must evolve over time to meet changing user needs [13,14] Systems still fail embarrassingly often [26]Systems still fail embarrassingly often [26] Complex and expensive to manage and maintain [12] – human in the Complex and expensive to manage and maintain [12] – human in the

loop a potential bottleneck in problem resolutionloop a potential bottleneck in problem resolutionComputer systems are pervading our everyday livesComputer systems are pervading our everyday lives

Increased dependency on systems e.g. financial networks [2]Increased dependency on systems e.g. financial networks [2] The more dependent users are on a system the less tolerant they are of The more dependent users are on a system the less tolerant they are of

interruptions [14]interruptions [14] The cost of downtime is prohibitive – not just monetary cost [18,26]The cost of downtime is prohibitive – not just monetary cost [18,26]

Can we remove the human operator and let the system manage and Can we remove the human operator and let the system manage and maintain itself?maintain itself?

Possible 24/7 monitoring and maintenancePossible 24/7 monitoring and maintenance CheaperCheaper Faster response and resolution than human administratorsFaster response and resolution than human administrators

66

Conceptual implementation of a Conceptual implementation of a self-healing systemself-healing system

77

Building blocksBuilding blocks

Fault model specification (Koopman[5])Fault model specification (Koopman[5])

Design techniquesDesign techniques Fault avoidance (Cheung[1])Fault avoidance (Cheung[1]) Incorporating repair via system modeling Incorporating repair via system modeling

(Koopman[7],Garlan[10])(Koopman[7],Garlan[10])

Implementation of system responses to faultsImplementation of system responses to faults Monitoring and Detection (Avizienis[4],Debusmann[9])Monitoring and Detection (Avizienis[4],Debusmann[9]) Analysis & Diagnosis (Stojanovic[12],Chen[11])Analysis & Diagnosis (Stojanovic[12],Chen[11]) Specific fault responses Specific fault responses

88

Fault Model SpecificationFault Model Specification

The fault model determines response strategy based on:The fault model determines response strategy based on: Fault durationFault duration

Permanent, intermittent, transientPermanent, intermittent, transient Fault manifestationFault manifestation

How will we detect?How will we detect?What happens if we do nothing?What happens if we do nothing?

Fault sourceFault sourceDesign defect, implementation defect, operational mistake, Design defect, implementation defect, operational mistake, malicious attack, “wear-out”/system agingmalicious attack, “wear-out”/system aging

Fault granularityFault granularityFault containment (component, subsystem, node, task)Fault containment (component, subsystem, node, task)

Fault profile expectationsFault profile expectationsAny symptoms that are definite markers of imminent failure?Any symptoms that are definite markers of imminent failure?Need to be robust for unexpected faultsNeed to be robust for unexpected faults

99

Incorporating repair via modelingIncorporating repair via modelingTechniqueTechnique When When

constructedconstructedRemarksRemarks

Architectural Architectural models [8]models [8]

Design timeDesign time Architectures encapsulate knowledge and Architectures encapsulate knowledge and provide a basis for principled adaptationprovide a basis for principled adaptation

Architectural Architectural reflection [6]reflection [6]

Design timeDesign time Reify architectural features as meta-objects Reify architectural features as meta-objects which can be observed and manipulated at which can be observed and manipulated at runtimeruntime

Utility Utility modeling modeling

(Degradation)(Degradation)

[7][7]

Design timeDesign time Use architecture to group components into Use architecture to group components into feature sets that enable reasoning about feature sets that enable reasoning about overall system utility. Gracefully degrading overall system utility. Gracefully degrading systems have positive utility when any systems have positive utility when any combination of components failcombination of components fail

Discovering Discovering architecturearchitecture

(Discotect) (Discotect) [10][10]

RuntimeRuntime Disconnect between design and Disconnect between design and implementationimplementation

Compliment design-time models by filling in Compliment design-time models by filling in gapsgaps

1010






1111

Monitoring & DetectionMonitoring & Detection

DetectionDetection First step in responding to faults (Koopman[5])First step in responding to faults (Koopman[5]) Different granularitiesDifferent granularities

Internal to the component (Self-checking software[1])Internal to the component (Self-checking software[1])Supervisory checks (Recovery blocks [3])Supervisory checks (Recovery blocks [3])Comparisons with replicated components (N-version [4])Comparisons with replicated components (N-version [4])Instrumentation (Garlan[8],Debusmann[9],Valetto[21])Instrumentation (Garlan[8],Debusmann[9],Valetto[21])

Different levels of intrusiveness of detection [1]Different levels of intrusiveness of detection [1]Non-intrusive checking of resultsNon-intrusive checking of resultsExecution of audit/check tasksExecution of audit/check tasksRedundant task executionRedundant task executionOnline self-tests/periodic reboots for self-testsOnline self-tests/periodic reboots for self-testsFault injectionFault injection

1212






1313

Analysis & DiagnosisAnalysis & Diagnosis

Analysis [12] for problem determinationAnalysis [12] for problem determination Data correlation and inference technologies to Data correlation and inference technologies to

support automated continuous system analysis over support automated continuous system analysis over monitoring datamonitoring data

We need reference points – models and/or knowledge We need reference points – models and/or knowledge repositories which we can use/reason over to repositories which we can use/reason over to determine whether a problem existsdetermine whether a problem exists

Diagnosing faults [11]Diagnosing faults [11] Locates source of a fault after detection e.g. Decision Locates source of a fault after detection e.g. Decision

trees [11]trees [11]Specificity down the offending line of code often not Specificity down the offending line of code often not necessarynecessary

Improve adaptation strategy selectionImprove adaptation strategy selection

1414






1515

Specific fault responsesSpecific fault responses

Fault responseFault response ReactiveReactive

Degradation e.g. Killing less important tasks (Koopman[5,7])Degradation e.g. Killing less important tasks (Koopman[5,7])Repair Adaptations (Dynamic updates[18], Reconfigurations Repair Adaptations (Dynamic updates[18], Reconfigurations [2,14,20,23])[2,14,20,23])Roll forward with compensation (System-level undo [25])*Roll forward with compensation (System-level undo [25])*Functional alternatives [28]*Functional alternatives [28]*Requesting help from the outside (Koopman[5])Requesting help from the outside (Koopman[5])

ProactiveProactiveActions triggered by faults that are indicative of possible Actions triggered by faults that are indicative of possible near-term failure (Koopman[5])near-term failure (Koopman[5])

PreventativePreventativePeriodic micro-reboot, sub-system reboot, system reboot Periodic micro-reboot, sub-system reboot, system reboot (Software Rejuventation[24])*(Software Rejuventation[24])*

* - omitted from the presentation for the sake of brevity but will be discussed during Q & A * - omitted from the presentation for the sake of brevity but will be discussed during Q & A as necessaryas necessary

1616

DegradationDegradation

Degradation (Koopman[7])Degradation (Koopman[7]) The degree of degraded operation of a systemThe degree of degraded operation of a system

……is its resilience to damage that exceeds built in is its resilience to damage that exceeds built in redundancy (Koopman[5])redundancy (Koopman[5])

System might not be able to restore 100% System might not be able to restore 100% functionality after a faultfunctionality after a fault

Some systems must fail if they are not 100% Some systems must fail if they are not 100% functionalfunctional

Others can degrade performance, shed tasks or Others can degrade performance, shed tasks or failover to less computationally expensive degraded failover to less computationally expensive degraded mode algorithmsmode algorithms

1717

AdaptationAdaptation

Adaptation allows a system to respond to its Adaptation allows a system to respond to its environmentenvironmentSelf-healing systems are expected to be Self-healing systems are expected to be amenable to evolutionary change with minimal amenable to evolutionary change with minimal downtime (Kramer[13])downtime (Kramer[13]) Characteristic types of evolution (Oreizy[15])Characteristic types of evolution (Oreizy[15])

Corrective – removes software faultsCorrective – removes software faultsPerfective – enhances product functionalityPerfective – enhances product functionality

Self-healing systems can be:Self-healing systems can be: Closed adaptive (Oreizy[16]) – self containedClosed adaptive (Oreizy[16]) – self contained Open adaptive (Oreizy[16]) – amenable to adding Open adaptive (Oreizy[16]) – amenable to adding

new behaviorsnew behaviors

1818

Repair AdaptationsRepair Adaptations

Many kinds of repairsMany kinds of repairs ADT/Module replacement – Fabry[2]ADT/Module replacement – Fabry[2] Method replacement – Dymos [14]Method replacement – Dymos [14] Component replacement (Hot-swaps) DAS [14], K42 [20]Component replacement (Hot-swaps) DAS [14], K42 [20] Server replacement – Conic[13], MTS [14]Server replacement – Conic[13], MTS [14] Verifiable patches – Popcorn [18] (C-like language)Verifiable patches – Popcorn [18] (C-like language)

Risky changes to make in the regular update cycle of:Risky changes to make in the regular update cycle of: Stop systemStop system Apply updateApply update RestartRestart

Now we may have to perform repair as the system runsNow we may have to perform repair as the system runs Cost and risks will determine whether online repair is a viable Cost and risks will determine whether online repair is a viable

prospectprospect Can we make online repair less risky?Can we make online repair less risky?

1919

Adaptation techniquesAdaptation techniquesAdaptation Adaptation granularitygranularity

Language/Language/

compiler compiler supportsupport

IndirectionIndirection QuiescenceQuiescence State State transfertransfer

InterfaceInterface

Changes Changes supported?supported?

Verifiable/Verifiable/

provably provably safe?safe?

ADT/ADT/

ModuleModule

replacementreplacement

Fabry[2]Fabry[2] Fabry[2]Fabry[2] Fabry[2]Fabry[2]

MethodMethod


Dymos[14]Dymos[14] Dymos – Dymos – program program statestate

DymosDymos

[14][14]

Dymos[14]Dymos[14]

ComponentComponent


DAS[14]DAS[14]

K42[20]K42[20]

DAS[14]DAS[14]

K42[20]K42[20]

Zdonik[23]Zdonik[23]

ServerServer Conic[13]Conic[13] Conic[13]Conic[13]

ArbitraryArbitrary

patchespatches

Popcorn[18]Popcorn[18] Popcorn – Popcorn – programmer programmer hintshints

PopcornPopcorn

[18][18]

PopcornPopcorn

[18] – [18] – provably provably safe patchsafe patch

2020

Managing the risk of online repairManaging the risk of online repair

Key activities Key activities Change management [13]Change management [13]

A principled aspect of runtime system evolution A principled aspect of runtime system evolution that:that:

Helps identify what must be changedHelps identify what must be changed Provides context for reasoning about, specifying and Provides context for reasoning about, specifying and

implementing changeimplementing change Controls change to preserve system integrity as well as Controls change to preserve system integrity as well as

meeting the additional extra-functional requirements of meeting the additional extra-functional requirements of dependably systems e.g. availability, reliability etc.dependably systems e.g. availability, reliability etc.

2121

Motivation for Change Motivation for Change ManagementManagement

Why change management is necessaryWhy change management is necessary Need for novel approaches to maintenanceNeed for novel approaches to maintenance Principled approaches – Model-based healing [19] – Principled approaches – Model-based healing [19] –

supported by reusable infrastructure e.g. Rainbow [2], supported by reusable infrastructure e.g. Rainbow [2], Workflakes [21] preferred over ad hoc adaptation Workflakes [21] preferred over ad hoc adaptation mechanismsmechanisms

Objectives of change management (Kramer[13])Objectives of change management (Kramer[13]) Changes specified only in terms of system structure Changes specified only in terms of system structure

e.g. (Wermlinger[17])e.g. (Wermlinger[17]) Change specifications are independent of application Change specifications are independent of application

protocols, algorithms or statesprotocols, algorithms or states Changes leave the system in a consistent stateChanges leave the system in a consistent state Change should minimize disruption to systemChange should minimize disruption to system

2222

Externalized adaptation [19,21,22]Externalized adaptation [19,21,22]

One or more models of a system are One or more models of a system are maintained at runtime and external to the maintained at runtime and external to the application as a basis for identifying application as a basis for identifying problems and resolving themproblems and resolving them

Describe changes in terms of operations Describe changes in terms of operations on the modelon the model

Changes to the model effects changes to Changes to the model effects changes to the underlying systemthe underlying system

2323

Rainbow[22]Rainbow[22]

2424

Kinesthetics eXtreme (KX) Kinesthetics eXtreme (KX) featuring Workflakes[21]featuring Workflakes[21]

2525

ConclusionsConclusionsUsing models as the basis for guiding and Using models as the basis for guiding and validating repair and system integrity is a validating repair and system integrity is a promising principled approachpromising principled approachOntologies allow us to capture, reuse and Ontologies allow us to capture, reuse and extend domain specific knowledgeextend domain specific knowledgeMore flexible to construct the system such that More flexible to construct the system such that there is a separation between functional there is a separation between functional concerns and repair adaptation logic concerns and repair adaptation logic Change management is integral to minimizing Change management is integral to minimizing the risks of repairing a running systemthe risks of repairing a running systemWe need other “benchmarks” for evaluating self-We need other “benchmarks” for evaluating self-healing systems other than raw performancehealing systems other than raw performance

2626

ReferencesReferences1. Design of Self-Checking Software

S. S. Yau, R. C. Cheung Proceedings of the international conference on Reliable software pp 450 - 455 1975.

2. How to Design a System in which Modules can be changed on the flyR. S. Fabry International Conference on Software Engineering 1976.

3. System Structure for Software Fault Tolerance B. Randell Proceedings of the international conference on Reliable software pp 437 - 449 1975.

4. The N-Version Approach to Fault-Tolerant Software Algirdas Avizienis, Fellow, IEEE, IEEE Transactions on Software Engineering Vol. SE-11 No. 12 December 1985.

5. Elements of the Self-Healing Problem Space Philip Koopman ICSE Workshop on Architecting Dependable Systems 2003.

6. Architectural Reflection: Realising Software Architectures via Reflective ActivitiesFrancesco Tisato, Andrea Savigni, Walter Cazzola, Andrea Sosio Engineering Distributed Objects, Second International Workshop, EDO 2000, Davis, CA, USA, November 2-3, 2000, Revised Papers pp 102 - 115.

7. Using Architectural Properties to Model and Measure System-Wide Graceful DegradationC. Shelton and P. Koopman Accepted to the Workshop on Architecting Dependable Systems sponsored by the International Conference on Software Engineering (ICSE2002)

8. Exploiting Architectural Design Knowledge to Support Self-Repairing Systems Bradley Schmerl, David Garlan Proceedings of the 14th international conference on Software engineering and knowledge engineering Pages: 241 - 248 2002.

1. Efficient and transparent Instrumentation of Application Components using an Aspect-oriented Approach M. Debusmann and Kurt Geihs 14th IFIP/IEEE Workshop on Distributed Systems: Operations and Management (DSOM 2003) pp 209--220.

2. Discotect: A System for Discovering Architectures from Running Systems Hong Yan, David Garlan, Bradley Schmerl, Jonathan Aldrich, Rick Kazman International Conference on Software Engineering archive Proceedings of the 26th International Conference on Software Engineering - Volume 00 Pages: 470 - 479 2004.

3. Failure Diagnosis using Decision Trees Mike Chen, Alice X. Zheng, Jim Lloyd, Michael I. Jordan, Eric A. Brewer 1st International Conference on Autonomic Computing (ICAC 2004), 17-19 May 2004, New York, NY, USA pp36 - 43.

2727

ReferencesReferences12. The role of ontologies in autonomic computing systems

L. Stojanovic, J. Schneider, A. Maedche, S. Libischer, R. Studer, Th. Lumpp, A. Abecker, G. Breiter, and J. Dinger IBM Systems Journal Volume 43, Number 3, 2004 Unstructured Information Management.

13. The Evolving Philosophers Problem: Dynamic Change ManagementJeff Kramer, Jeff Magee IEEE Transactions on Software Engineering November 1990 Vol. 16 - No. 11 pp 1293 - 1306.

14. On-The-Fly Program Modification: Systems For Dynamic UpdatingMark E. Segal, Ophir FriederIEEE Software Volume 10, Number 2, March 1993 pp 53 - 65.

15. Architecture-Based Runtime Software EvolutionPeyman Oreizy, Nenad Medovic, Richard N. Taylor In Proceedings of the International Conference on Software Engineering 1998 (ICSE'98), pages 177--186, April 1998.

• An Architecture-Based Approach to Self-Adaptive SoftwarePeyman Oreizy, Michael M. Gorlick, Richard N. Taylor, Dennis Heimbigner, Gregory Johnson, Nenad Medvidovic,Alex Quilici, David S. Rosenblum, Alexander L. WolfIEEE Intelligent Systems archive Volume 14, Issue 3 (May 1999) pp 54 - 62.

17. A Graph Based Architectural (Re)configuration LanguageMichael Wermlinger, Antonia Lopes, Jose Luiz FiadeiroProceedings of the 8th European Software Engineering Conference held jointly with 9th ACM SIGSOFT InternationalSymposium on Foundations of Software Engineering 2001, Vienna, Austria, September 10-14, 2001 pp 21 - 32.

18. Dynamic Software UpdatingMichael W. Hicks, Jonathan T. Moore, Scott NettlesProceedings of the 2001 ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI), Snowbird, Utah, USA, June 20-22, 2001. SIGPLAN Notices 36(5) (May 2001), ACM, 2001, ISBN 1-58113-414-2 pp 13 - 23.

19. Model-based Adaptation for Self-Healing SystemsDavid Garlan, Bradley R. SchmerlProceedings of the First Workshop on Self-Healing Systems, WOSS 2002, Charleston, South Carolina, USA,November 18-19, 2002 pp 27 - 32.

20. System Support for Online ReconfigurationC. A. N. Soules, J. Appavoo, K. Hui, D. D. Silva, G. R. Ganger, O. Krieger, M. Stumm, R. W. Wisniewski, M.Auslander, M. Ostrowski, B. Rosenburg, and J. XenidisIn Proceedings of the Usenix Technical Conference, San Antonio, TX, USA, June 2003.

21. Using Process Technology to Control and Coordinate Software AdaptationGiuseppe Valetto, Gail KaiserInternational Conference on Software Engineering archive Proceedings of the 25th international conference onSoftware engineering Portland, Oregon Pages: 262 - 272 2003.

2828

ReferencesReferences22. Rainbow: Architecture-based Self-adaptation with Reusable Infrastructure

Shang-Wen Cheng, An-Cheng Huang, David Garlan, Bradley R. Schmerl, Peter Steenkiste 1st International Conference on Autonomic Computing (ICAC 2004), 17-19 May 2004, New York, NY, USA.23. Maintaining Consistency in a Database with Changing Types Stanley B. Zdonik Proceedings of the 1986 SIGPLAN workshop on Object-oriented programming Yorktown Heights,

New York, United States Pages: 120 - 127 1986.24. Software Rejuventation: Analysis, Module and Applications

Yennun Huang, Chandra Kintala, Nick Kolettis, N. Dudley Fulton Proceedings of the 25th International Symposium on Fault-Tolerant Computing (FTCS-25), Pasadena, CA, pp. June

1995, pp. 381-390. 25. Rewind, Repair, Replay: Three R's to Dependability

A. Brown and D. A. Patterson In 10th ACM SIGOPS European Workshop, 2002.

26. Improving Availability with Recursive Micro-Reboots: A Soft-State System Case Study George Candea, James Cutler, Armando Fox Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers

pp 213 - 248. 27. Predicting Problems Caused by Component Upgrades

Stephen McCamant and Michael D. Ernst In 10th European Software Engineering Conference and the 11th ACM SIGSOFT Symposium on the Foundations of

Software Engineering, pages 287--296, Helsinki, Finland, September 2003. 28. Improving System Dependability with Functional Alternatives

Charles P. Shelton and Philip Koopman Proceedings of the 2004 International Conference on Dependable Systems and Networks (DSN'04) - Volume 00 Page: 295.

2929

Backup slidesBackup slides

3030

Component definitionsComponent definitions

ComponentComponent Software element that conforms to a component Software element that conforms to a component

model and can be independently deployed and model and can be independently deployed and composed without modification according to a composed without modification according to a composition standardcomposition standard

Component modelComponent model Defines specific interaction and composition Defines specific interaction and composition

standardsstandards

Component model implementationComponent model implementation Dedicated set of executable software elements Dedicated set of executable software elements

required to support the execution of components that required to support the execution of components that conform to that modelconform to that model

3131

Efficient and transparent instrumentation of Efficient and transparent instrumentation of application components using an aspect-application components using an aspect-

oriented approachoriented approachOverview of AOPOverview of AOP

Developed at Xerox PARC in the early 90’s (Gregor Kiczales et. Al)Developed at Xerox PARC in the early 90’s (Gregor Kiczales et. Al) Modularize cross-cutting concerns (non-functional requirements) and Modularize cross-cutting concerns (non-functional requirements) and

separate them from functional requirementsseparate them from functional requirements Some termsSome terms

Joinpoint – well defined point in execution flowJoinpoint – well defined point in execution flowPointcut – identifies a joinpoint e.g. using a regexPointcut – identifies a joinpoint e.g. using a regexAdvice – defines additional code to be executed at a joinpoint (before, Advice – defines additional code to be executed at a joinpoint (before, around, after)around, after)Introductions – modify classes, add members, change class relationshipsIntroductions – modify classes, add members, change class relationships

Aspects and application code can be developed separately and weaved Aspects and application code can be developed separately and weaved togethertogether

Compile time weaving – AspectJCompile time weaving – AspectJRuntime weaving – JBoss bytecode weavingRuntime weaving – JBoss bytecode weavingHybridHybrid

3232

Rainbow – architecture-based self-Rainbow – architecture-based self-adaptation with reusable infrastructureadaptation with reusable infrastructure

DefinitionsDefinitions ArchitectureArchitecture

An abstract view of the system as a composition of computational An abstract view of the system as a composition of computational elements and their interconnections [Shaw & Garlan]elements and their interconnections [Shaw & Garlan]

Architectural modelArchitectural modelRepresents the system architecture as a graph of interacting components Represents the system architecture as a graph of interacting components [popular format – ACME, xADL, SADL][popular format – ACME, xADL, SADL]Nodes in the graph are termed components – principal computation Nodes in the graph are termed components – principal computation elements and data stores of the systemelements and data stores of the systemArcs are termed connectors and represent the pathways of interactionArcs are termed connectors and represent the pathways of interaction

Architectural styleArchitectural styleStyle specifies a vocabulary of component and connector types and the Style specifies a vocabulary of component and connector types and the rules of system composition, essentially capturing the units of change rules of system composition, essentially capturing the units of change from system to systemfrom system to systemCharacterizes a family of systems related by shared structural and Characterizes a family of systems related by shared structural and semantic properties {Types, Rules, Properties, Analyses}semantic properties {Types, Rules, Properties, Analyses}

Architecture-based self-adaptationArchitecture-based self-adaptationThe use of architectural models to reason about system behavior and The use of architectural models to reason about system behavior and guide adaptationguide adaptation

3333

System Validation/AssuranceSystem Validation/Assurance

Why should users trust a self-healing system to Why should users trust a self-healing system to do the “right” thing?do the “right” thing? The system must be able to give guarantees about its The system must be able to give guarantees about its

qualitative characteristics before, during and after qualitative characteristics before, during and after adaptationsadaptations

Availability, reliability, performance etc.Availability, reliability, performance etc. What kind of guarantees can the system give?What kind of guarantees can the system give?

AbsoluteAbsoluteProbabilisticProbabilisticPartial assurance of a few key system propertiesPartial assurance of a few key system propertiesAssurance during degraded mode operationsAssurance during degraded mode operationsAssurance between fault occurrence time and detection timeAssurance between fault occurrence time and detection time

Not all faults are detectableNot all faults are detectable

3434


Improving dependabilityImproving dependability Software rejuvenation [24]Software rejuvenation [24]

Addresses faults caused by software aging e.g. resource Addresses faults caused by software aging e.g. resource leaksleaks

System-level undo [25]System-level undo [25]Addresses operator-induced/compounded failuresAddresses operator-induced/compounded failuresRewind – revert state to an earlier timeRewind – revert state to an earlier timeRepair – operator makes any desired changes to the systemRepair – operator makes any desired changes to the systemReplay – roll forward with the “new” systemReplay – roll forward with the “new” system

Functional alternatives rather than redundancy [28]Functional alternatives rather than redundancy [28]Exploit a system’s available features to provide some system Exploit a system’s available features to provide some system redundancy without additional redundant componentsredundancy without additional redundant componentsLeverages components that satisfy overlapping system Leverages components that satisfy overlapping system objectivesobjectives

3535

Rewind, repair, replay: three r’s to Rewind, repair, replay: three r’s to dependabilitydependability

MechanismMechanism How is this different from related work?How is this different from related work?

Backup/restore/checkpointing schemes offer rewind/repair Backup/restore/checkpointing schemes offer rewind/repair but deny the ability to roll forward once changes have been but deny the ability to roll forward once changes have been mademade

Recovery systems for databases use rewind/replay to Recovery systems for databases use rewind/replay to recover from crashes, deadlocks and other fatal events but recover from crashes, deadlocks and other fatal events but do not offer the ability to inject repair into the recovery do not offer the ability to inject repair into the recovery cyclecycle

The standard transaction model does not allow committed The standard transaction model does not allow committed transactions to be altered or removed however extended transactions to be altered or removed however extended transaction models have introduced the notion of a transaction models have introduced the notion of a compensating transactioncompensating transaction

3636


MechanismMechanism Tracking recoverable state for replayTracking recoverable state for replay

When an undo is carried out it is the responsibility of the When an undo is carried out it is the responsibility of the replay step to restore all the state changes that are replay step to restore all the state changes that are important to the userimportant to the user

Defining exactly what state this encompasses is tricky e.g. Defining exactly what state this encompasses is tricky e.g. an upgrade of a mail server could change the underlying an upgrade of a mail server could change the underlying format of the mailbox fileformat of the mailbox file

Solution: Track user actions at the intentional level e.g. in Solution: Track user actions at the intentional level e.g. in an undoable email system a user’s act of deleting a an undoable email system a user’s act of deleting a message is recorded as “delete message with ID x” rather message is recorded as “delete message with ID x” rather than “alter byte range x – y of file z”than “alter byte range x – y of file z”

3737


MechanismMechanism How does intentional tracking workHow does intentional tracking work

In the network service environment being In the network service environment being targeted, users interact with the system through targeted, users interact with the system through standardized application protocols (usually standardized application protocols (usually comprised of a few action verbs e.g. HTTP PUT, comprised of a few action verbs e.g. HTTP PUT, GET, POST).GET, POST).

Intercept and record user interactions at the Intercept and record user interactions at the protocol levelprotocol level

3838


MechanismMechanism Are repairs also tracked?Are repairs also tracked?

No, because of practical reasonsNo, because of practical reasons User interactions are limited by the verbs-action User interactions are limited by the verbs-action

commands permitted by the protocol, however repair commands permitted by the protocol, however repair actions are limited only by the ingenuity of the human actions are limited only by the ingenuity of the human operatoroperator

3939


MechanismMechanism ArchitectureArchitecture

Basic structure of an undoable systemBasic structure of an undoable system A verb-based application interface that uses logical A verb-based application interface that uses logical

state names. This allows the undo system to properly state names. This allows the undo system to properly disambiguate recoverable user state changes from disambiguate recoverable user state changes from other system eventsother system events

The verb based interface should cover all interactions The verb based interface should cover all interactions with the system that affect user-visible state, including with the system that affect user-visible state, including operator interfaces that are used to create, move, operator interfaces that are used to create, move, delete and modify user state repositoriesdelete and modify user state repositories

If existing protocols do not satisfy these requirements If existing protocols do not satisfy these requirements then new protocols can be created and layered above then new protocols can be created and layered above existing interfacesexisting interfaces

4040


Improving availabilityImproving availability Recursive micro-reboots [26]Recursive micro-reboots [26]

Consider failures as facts to be coped withConsider failures as facts to be coped withUnequivocally returns the recovered system to its start state – best Unequivocally returns the recovered system to its start state – best understood/tested stateunderstood/tested stateHigh confidence way to reclaim resourcesHigh confidence way to reclaim resourcesEasy technique to understand and employ, automate, debug, implementEasy technique to understand and employ, automate, debug, implement

Crash-only componentsCrash-only componentsOne way to shutdown – crashOne way to shutdown – crashOne way to start up – recoverOne way to start up – recoverAll persistent state kept in crash-only data storesAll persistent state kept in crash-only data storesStrong modularity, timeout based communication, lease-based resource Strong modularity, timeout based communication, lease-based resource allocationallocation

Improving integrityImproving integrity Managing multiple versions via version handlers [23]Managing multiple versions via version handlers [23] Predicting problems caused by component upgrades [27]Predicting problems caused by component upgrades [27] Model constraints used in change management [13,19]Model constraints used in change management [13,19]

4141

Improving availability with recursive Improving availability with recursive micro-rebootsmicro-reboots

MechanismMechanism Recursive recovery and micro-rebootsRecursive recovery and micro-reboots

A recursive approach to recoveryA recursive approach to recovery A minimal subset of components is recoveredA minimal subset of components is recovered If that fails progressively larger subsets are recoveredIf that fails progressively larger subsets are recovered

A micro-reboot is a low-cost form of reboot applied at the level of A micro-reboot is a low-cost form of reboot applied at the level of fine-grained software componentsfine-grained software componentsFor a system to be recursively recoverable it must contain fine-For a system to be recursively recoverable it must contain fine-grained components that are independently recoverable such that grained components that are independently recoverable such that part of the system can be recovered without touching the restpart of the system can be recovered without touching the restComponents must be loosely coupled AND prepared to be denied Components must be loosely coupled AND prepared to be denied service from other components that may be in the process of service from other components that may be in the process of micro-rebootingmicro-rebooting

Popularity of loosely-coupled componentized systems (EJB,.NET) Popularity of loosely-coupled componentized systems (EJB,.NET) A recursively recoverable system gracefully tolerates successive A recursively recoverable system gracefully tolerates successive restarts at multiple levels in the sense that it does not lose or restarts at multiple levels in the sense that it does not lose or corrupt data or cause other systems to crashcorrupt data or cause other systems to crash

4242

Repair adaptation backup slidesRepair adaptation backup slides

4343

ADT/Module replacementADT/Module replacement

Changing modules on-the-fly [2]Changing modules on-the-fly [2] Simple case – local data onlySimple case – local data only

Indirection mechanismIndirection mechanismOld and new version instances available side-by-side for a short timeOld and new version instances available side-by-side for a short timeOld version services existing requests, new requests are redirected Old version services existing requests, new requests are redirected transparently to the new versiontransparently to the new version

General case – Shared/persistent dataGeneral case – Shared/persistent dataNew version of a module modifies representation of a shared data structure New version of a module modifies representation of a shared data structure on first encounteron first encounterVersion numbers signal whether to modifyVersion numbers signal whether to modifyModifications must be delayed until old versions are completely finished with Modifications must be delayed until old versions are completely finished with shared instancesshared instances

Other requirementsOther requirements Correctness of modifications should be analyzed/provenCorrectness of modifications should be analyzed/proven Access control to limit who can change indirection mechanisms and Access control to limit who can change indirection mechanisms and

initiate module swappinginitiate module swapping

4444

Method replacementMethod replacement

Dynamic Modification System (Dymos) [14]Dynamic Modification System (Dymos) [14] Supports changing a procedure’s interface between versionsSupports changing a procedure’s interface between versions Supports the implementation of static data local to a procedureSupports the implementation of static data local to a procedure Dymos is completely integrated – source code, object code, Dymos is completely integrated – source code, object code,

compilation artifacts (parse trees, ASTs and symbol tables) are compilation artifacts (parse trees, ASTs and symbol tables) are available at all timesavailable at all times

Supports the specification of the modules to update and the Supports the specification of the modules to update and the conditions that must be satisfiedconditions that must be satisfied

Tightly tied to the StarMod language, which was modified to Tightly tied to the StarMod language, which was modified to support dynamic program updatingsupport dynamic program updating

Tools must be on the host system to manipulate the compilation Tools must be on the host system to manipulate the compilation artifactsartifacts

Assumes the source code is always availableAssumes the source code is always available No support for distributed systemsNo support for distributed systems

4545

Component replacementComponent replacement

Dynamically Alterable System (DAS) operating Dynamically Alterable System (DAS) operating system [14]system [14] Supports re-plugging – swapping modules with the Supports re-plugging – swapping modules with the

same interfacesame interface Supports restructuring of data within modulesSupports restructuring of data within modules Does not handle interface changesDoes not handle interface changes

K42 operating system [20]K42 operating system [20] Two basic mechanisms for online reconfigurationTwo basic mechanisms for online reconfiguration

InterpositionInterposition

Hot-swappingHot-swapping

4646

Server replacementServer replacement

Conic [13]Conic [13] Handle configuration at the configuration levelHandle configuration at the configuration level

Components and connectors (architecture)Components and connectors (architecture) Specify change declaratively in terms of system structure onlySpecify change declaratively in terms of system structure only System modification via change transactions generated from change System modification via change transactions generated from change

specificationsspecifications Use change specifications to scope changeUse change specifications to scope change Do not force change rather wait for nodes to quiesce. Nodes remain quiescent Do not force change rather wait for nodes to quiesce. Nodes remain quiescent

during changeduring change For change as a result of failure nodes would need to include recovery actionsFor change as a result of failure nodes would need to include recovery actions

Michigan Terminal System (MTS) [14]Michigan Terminal System (MTS) [14] Assumes interface remains constant between versionsAssumes interface remains constant between versions Coarse granularity for replacements – large system components e.g. a command Coarse granularity for replacements – large system components e.g. a command

interpreterinterpreter Does not address cases where a new service calls some old serviceDoes not address cases where a new service calls some old service Data formats cannot change between versionsData formats cannot change between versions Services disabled for the duration of the upgradeServices disabled for the duration of the upgrade

4747

Verifiable patchesVerifiable patches

Popcorn [18]Popcorn [18] Code and data patchesCode and data patches Build dynamic patches on top of dynamic linking mechanismsBuild dynamic patches on top of dynamic linking mechanisms Program re-linking after loading patchProgram re-linking after loading patch

Programmer specifies when to apply patchesProgrammer specifies when to apply patches Semi-automatic patch generation based on source comparisonSemi-automatic patch generation based on source comparison

Identify changes to functions and dataIdentify changes to functions and dataGenerate stub functions and state transformersGenerate stub functions and state transformers

Native code used in dynamic patches is coupled with Native code used in dynamic patches is coupled with annotations such that the code is provably safe. A well-formed annotations such that the code is provably safe. A well-formed Typed Assembly Language (TAL) program isTyped Assembly Language (TAL) program is

Memory-safe – no pointer forgingMemory-safe – no pointer forgingControl-flow safe – no jumps to arbitrary memory locationsControl-flow safe – no jumps to arbitrary memory locationsStack-safe – no modification of non-local stack framesStack-safe – no modification of non-local stack frames

System includes a TAL verifier and a prototype compiler from a System includes a TAL verifier and a prototype compiler from a safe-C language, Popcorn, to TALsafe-C language, Popcorn, to TAL

1 design and implementation of self- healing systems presented by rean griffith phd candidacy exam...

Documents

selfhealing system slide

self healing system

system qualities

system evolution slide

system modeling koopman7

fault detection

fault masking

necessary slide