1 design and implementation of self- healing systems presented by rean griffith phd candidacy exam...
Post on 20-Dec-2015
224 views
TRANSCRIPT
11
Design and implementation of Self-Design and implementation of Self-Healing systemsHealing systems
Presented by Rean GriffithPresented by Rean GriffithPhD Candidacy ExamPhD Candidacy ExamNovember 23November 23rdrd, 2004, 2004
22
Self-Healing Systems Self-Healing Systems
A self healing system “…automatically A self healing system “…automatically detects, diagnoses and repairs localized detects, diagnoses and repairs localized software and hardware problems” – software and hardware problems” – The Vision The Vision of Autonomic Computing 2003 IEEE Computer Societyof Autonomic Computing 2003 IEEE Computer Society
33
Roots of Self-Healing Systems Roots of Self-Healing Systems AreaArea DescriptionDescription Foundation TechniquesFoundation TechniquesFault tolerant Fault tolerant computingcomputing
(’70’s) (’70’s)
Ability of a system to Ability of a system to respond gracefully to an respond gracefully to an unexpected hardware or unexpected hardware or software failuresoftware failure
Fault Model specification [5]Fault Model specification [5]
Fault Avoidance [1,3,5]Fault Avoidance [1,3,5]
Fault Detection [1,3,4,5], Fault Fault Detection [1,3,4,5], Fault Masking [3,4,5]Masking [3,4,5]
Dependable Dependable computing computing
(‘80’s)(‘80’s)
The trustworthiness of a The trustworthiness of a system that allows reliance system that allows reliance to be justifiably placed on to be justifiably placed on the service it deliversthe service it delivers
Focus on reliability and other Focus on reliability and other system qualities e.g. system qualities e.g. Availability[26], Integrity[23,25], Availability[26], Integrity[23,25], Degradation [7], MaintainabilityDegradation [7], Maintainability
Self-Healing Self-Healing systemssystems
(’90’s+)(’90’s+)
Automatically detects, Automatically detects, diagnoses and repairs diagnoses and repairs localized software and localized software and hardware problemshardware problems
Dependability, Modeling Dependability, Modeling [6,7,8], Monitoring [9], [6,7,8], Monitoring [9], Diagnosis [11], Planning [16], Diagnosis [11], Planning [16], Adaptation for Repair [13, 19, Adaptation for Repair [13, 19, 21] (System evolution)21] (System evolution)
44
OverviewOverview
Motivation for self-healing systemsMotivation for self-healing systemsBuilding a self-healing systemBuilding a self-healing systemEvaluating a self-healing system*Evaluating a self-healing system*
* - * - omitted from the presentation for the sake of brevity but will be discussed during Q & A as omitted from the presentation for the sake of brevity but will be discussed during Q & A as necessarynecessary
55
Motivation for self-healing systemsMotivation for self-healing systems
We are building increasingly complex (distributed) systemsWe are building increasingly complex (distributed) systems Harder to build [1,12]Harder to build [1,12] Systems must evolve over time to meet changing user needs [13,14]Systems must evolve over time to meet changing user needs [13,14] Systems still fail embarrassingly often [26]Systems still fail embarrassingly often [26] Complex and expensive to manage and maintain [12] – human in the Complex and expensive to manage and maintain [12] – human in the
loop a potential bottleneck in problem resolutionloop a potential bottleneck in problem resolutionComputer systems are pervading our everyday livesComputer systems are pervading our everyday lives
Increased dependency on systems e.g. financial networks [2]Increased dependency on systems e.g. financial networks [2] The more dependent users are on a system the less tolerant they are of The more dependent users are on a system the less tolerant they are of
interruptions [14]interruptions [14] The cost of downtime is prohibitive – not just monetary cost [18,26]The cost of downtime is prohibitive – not just monetary cost [18,26]
Can we remove the human operator and let the system manage and Can we remove the human operator and let the system manage and maintain itself?maintain itself?
Possible 24/7 monitoring and maintenancePossible 24/7 monitoring and maintenance CheaperCheaper Faster response and resolution than human administratorsFaster response and resolution than human administrators
66
Conceptual implementation of a Conceptual implementation of a self-healing systemself-healing system
77
Building blocksBuilding blocks
Fault model specification (Koopman[5])Fault model specification (Koopman[5])
Design techniquesDesign techniques Fault avoidance (Cheung[1])Fault avoidance (Cheung[1]) Incorporating repair via system modeling Incorporating repair via system modeling
(Koopman[7],Garlan[10])(Koopman[7],Garlan[10])
Implementation of system responses to faultsImplementation of system responses to faults Monitoring and Detection (Avizienis[4],Debusmann[9])Monitoring and Detection (Avizienis[4],Debusmann[9]) Analysis & Diagnosis (Stojanovic[12],Chen[11])Analysis & Diagnosis (Stojanovic[12],Chen[11]) Specific fault responses Specific fault responses
88
Fault Model SpecificationFault Model Specification
The fault model determines response strategy based on:The fault model determines response strategy based on: Fault durationFault duration
Permanent, intermittent, transientPermanent, intermittent, transient Fault manifestationFault manifestation
How will we detect?How will we detect?What happens if we do nothing?What happens if we do nothing?
Fault sourceFault sourceDesign defect, implementation defect, operational mistake, Design defect, implementation defect, operational mistake, malicious attack, “wear-out”/system agingmalicious attack, “wear-out”/system aging
Fault granularityFault granularityFault containment (component, subsystem, node, task)Fault containment (component, subsystem, node, task)
Fault profile expectationsFault profile expectationsAny symptoms that are definite markers of imminent failure?Any symptoms that are definite markers of imminent failure?Need to be robust for unexpected faultsNeed to be robust for unexpected faults
99
Incorporating repair via modelingIncorporating repair via modelingTechniqueTechnique When When
constructedconstructedRemarksRemarks
Architectural Architectural models [8]models [8]
Design timeDesign time Architectures encapsulate knowledge and Architectures encapsulate knowledge and provide a basis for principled adaptationprovide a basis for principled adaptation
Architectural Architectural reflection [6]reflection [6]
Design timeDesign time Reify architectural features as meta-objects Reify architectural features as meta-objects which can be observed and manipulated at which can be observed and manipulated at runtimeruntime
Utility Utility modeling modeling
(Degradation)(Degradation)
[7][7]
Design timeDesign time Use architecture to group components into Use architecture to group components into feature sets that enable reasoning about feature sets that enable reasoning about overall system utility. Gracefully degrading overall system utility. Gracefully degrading systems have positive utility when any systems have positive utility when any combination of components failcombination of components fail
Discovering Discovering architecturearchitecture
(Discotect) (Discotect) [10][10]
RuntimeRuntime Disconnect between design and Disconnect between design and implementationimplementation
Compliment design-time models by filling in Compliment design-time models by filling in gapsgaps
1010
Building blocksBuilding blocks
Fault model specification (Koopman[5])Fault model specification (Koopman[5])
Design techniquesDesign techniques Fault avoidance (Cheung[1])Fault avoidance (Cheung[1]) Incorporating repair via system modeling Incorporating repair via system modeling
(Koopman[7],Garlan[10])(Koopman[7],Garlan[10])
Implementation of system responses to faultsImplementation of system responses to faults Monitoring and Detection (Avizienis[4],Debusmann[9])Monitoring and Detection (Avizienis[4],Debusmann[9]) Analysis & Diagnosis (Stojanovic[12],Chen[11])Analysis & Diagnosis (Stojanovic[12],Chen[11]) Specific fault responses Specific fault responses
1111
Monitoring & DetectionMonitoring & Detection
DetectionDetection First step in responding to faults (Koopman[5])First step in responding to faults (Koopman[5]) Different granularitiesDifferent granularities
Internal to the component (Self-checking software[1])Internal to the component (Self-checking software[1])Supervisory checks (Recovery blocks [3])Supervisory checks (Recovery blocks [3])Comparisons with replicated components (N-version [4])Comparisons with replicated components (N-version [4])Instrumentation (Garlan[8],Debusmann[9],Valetto[21])Instrumentation (Garlan[8],Debusmann[9],Valetto[21])
Different levels of intrusiveness of detection [1]Different levels of intrusiveness of detection [1]Non-intrusive checking of resultsNon-intrusive checking of resultsExecution of audit/check tasksExecution of audit/check tasksRedundant task executionRedundant task executionOnline self-tests/periodic reboots for self-testsOnline self-tests/periodic reboots for self-testsFault injectionFault injection
1212
Building blocksBuilding blocks
Fault model specification (Koopman[5])Fault model specification (Koopman[5])
Design techniquesDesign techniques Fault avoidance (Cheung[1])Fault avoidance (Cheung[1]) Incorporating repair via system modeling Incorporating repair via system modeling
(Koopman[7],Garlan[10])(Koopman[7],Garlan[10])
Implementation of system responses to faultsImplementation of system responses to faults Monitoring and Detection (Avizienis[4],Debusmann[9])Monitoring and Detection (Avizienis[4],Debusmann[9]) Analysis & Diagnosis (Stojanovic[12],Chen[11])Analysis & Diagnosis (Stojanovic[12],Chen[11]) Specific fault responses Specific fault responses
1313
Analysis & DiagnosisAnalysis & Diagnosis
Analysis [12] for problem determinationAnalysis [12] for problem determination Data correlation and inference technologies to Data correlation and inference technologies to
support automated continuous system analysis over support automated continuous system analysis over monitoring datamonitoring data
We need reference points – models and/or knowledge We need reference points – models and/or knowledge repositories which we can use/reason over to repositories which we can use/reason over to determine whether a problem existsdetermine whether a problem exists
Diagnosing faults [11]Diagnosing faults [11] Locates source of a fault after detection e.g. Decision Locates source of a fault after detection e.g. Decision
trees [11]trees [11]Specificity down the offending line of code often not Specificity down the offending line of code often not necessarynecessary
Improve adaptation strategy selectionImprove adaptation strategy selection
1414
Building blocksBuilding blocks
Fault model specification (Koopman[5])Fault model specification (Koopman[5])
Design techniquesDesign techniques Fault avoidance (Cheung[1])Fault avoidance (Cheung[1]) Incorporating repair via system modeling Incorporating repair via system modeling
(Koopman[7],Garlan[10])(Koopman[7],Garlan[10])
Implementation of system responses to faultsImplementation of system responses to faults Monitoring and Detection (Avizienis[4],Debusmann[9])Monitoring and Detection (Avizienis[4],Debusmann[9]) Analysis & Diagnosis (Stojanovic[12],Chen[11])Analysis & Diagnosis (Stojanovic[12],Chen[11]) Specific fault responses Specific fault responses
1515
Specific fault responsesSpecific fault responses
Fault responseFault response ReactiveReactive
Degradation e.g. Killing less important tasks (Koopman[5,7])Degradation e.g. Killing less important tasks (Koopman[5,7])Repair Adaptations (Dynamic updates[18], Reconfigurations Repair Adaptations (Dynamic updates[18], Reconfigurations [2,14,20,23])[2,14,20,23])Roll forward with compensation (System-level undo [25])*Roll forward with compensation (System-level undo [25])*Functional alternatives [28]*Functional alternatives [28]*Requesting help from the outside (Koopman[5])Requesting help from the outside (Koopman[5])
ProactiveProactiveActions triggered by faults that are indicative of possible Actions triggered by faults that are indicative of possible near-term failure (Koopman[5])near-term failure (Koopman[5])
PreventativePreventativePeriodic micro-reboot, sub-system reboot, system reboot Periodic micro-reboot, sub-system reboot, system reboot (Software Rejuventation[24])*(Software Rejuventation[24])*
* - omitted from the presentation for the sake of brevity but will be discussed during Q & A * - omitted from the presentation for the sake of brevity but will be discussed during Q & A as necessaryas necessary
1616
DegradationDegradation
Degradation (Koopman[7])Degradation (Koopman[7]) The degree of degraded operation of a systemThe degree of degraded operation of a system
……is its resilience to damage that exceeds built in is its resilience to damage that exceeds built in redundancy (Koopman[5])redundancy (Koopman[5])
System might not be able to restore 100% System might not be able to restore 100% functionality after a faultfunctionality after a fault
Some systems must fail if they are not 100% Some systems must fail if they are not 100% functionalfunctional
Others can degrade performance, shed tasks or Others can degrade performance, shed tasks or failover to less computationally expensive degraded failover to less computationally expensive degraded mode algorithmsmode algorithms
1717
AdaptationAdaptation
Adaptation allows a system to respond to its Adaptation allows a system to respond to its environmentenvironmentSelf-healing systems are expected to be Self-healing systems are expected to be amenable to evolutionary change with minimal amenable to evolutionary change with minimal downtime (Kramer[13])downtime (Kramer[13]) Characteristic types of evolution (Oreizy[15])Characteristic types of evolution (Oreizy[15])
Corrective – removes software faultsCorrective – removes software faultsPerfective – enhances product functionalityPerfective – enhances product functionality
Self-healing systems can be:Self-healing systems can be: Closed adaptive (Oreizy[16]) – self containedClosed adaptive (Oreizy[16]) – self contained Open adaptive (Oreizy[16]) – amenable to adding Open adaptive (Oreizy[16]) – amenable to adding
new behaviorsnew behaviors
1818
Repair AdaptationsRepair Adaptations
Many kinds of repairsMany kinds of repairs ADT/Module replacement – Fabry[2]ADT/Module replacement – Fabry[2] Method replacement – Dymos [14]Method replacement – Dymos [14] Component replacement (Hot-swaps) DAS [14], K42 [20]Component replacement (Hot-swaps) DAS [14], K42 [20] Server replacement – Conic[13], MTS [14]Server replacement – Conic[13], MTS [14] Verifiable patches – Popcorn [18] (C-like language)Verifiable patches – Popcorn [18] (C-like language)
Risky changes to make in the regular update cycle of:Risky changes to make in the regular update cycle of: Stop systemStop system Apply updateApply update RestartRestart
Now we may have to perform repair as the system runsNow we may have to perform repair as the system runs Cost and risks will determine whether online repair is a viable Cost and risks will determine whether online repair is a viable
prospectprospect Can we make online repair less risky?Can we make online repair less risky?
1919
Adaptation techniquesAdaptation techniquesAdaptation Adaptation granularitygranularity
Language/Language/
compiler compiler supportsupport
IndirectionIndirection QuiescenceQuiescence State State transfertransfer
InterfaceInterface
Changes Changes supported?supported?
Verifiable/Verifiable/
provably provably safe?safe?
ADT/ADT/
ModuleModule
replacementreplacement
Fabry[2]Fabry[2] Fabry[2]Fabry[2] Fabry[2]Fabry[2]
MethodMethod
replacementreplacement
Dymos[14]Dymos[14] Dymos – Dymos – program program statestate
DymosDymos
[14][14]
Dymos[14]Dymos[14]
ComponentComponent
replacementreplacement
DAS[14]DAS[14]
K42[20]K42[20]
DAS[14]DAS[14]
K42[20]K42[20]
Zdonik[23]Zdonik[23]
ServerServer Conic[13]Conic[13] Conic[13]Conic[13]
ArbitraryArbitrary
patchespatches
Popcorn[18]Popcorn[18] Popcorn – Popcorn – programmer programmer hintshints
PopcornPopcorn
[18][18]
PopcornPopcorn
[18] – [18] – provably provably safe patchsafe patch
2020
Managing the risk of online repairManaging the risk of online repair
Key activities Key activities Change management [13]Change management [13]
A principled aspect of runtime system evolution A principled aspect of runtime system evolution that:that:
Helps identify what must be changedHelps identify what must be changed Provides context for reasoning about, specifying and Provides context for reasoning about, specifying and
implementing changeimplementing change Controls change to preserve system integrity as well as Controls change to preserve system integrity as well as
meeting the additional extra-functional requirements of meeting the additional extra-functional requirements of dependably systems e.g. availability, reliability etc.dependably systems e.g. availability, reliability etc.
2121
Motivation for Change Motivation for Change ManagementManagement
Why change management is necessaryWhy change management is necessary Need for novel approaches to maintenanceNeed for novel approaches to maintenance Principled approaches – Model-based healing [19] – Principled approaches – Model-based healing [19] –
supported by reusable infrastructure e.g. Rainbow [2], supported by reusable infrastructure e.g. Rainbow [2], Workflakes [21] preferred over ad hoc adaptation Workflakes [21] preferred over ad hoc adaptation mechanismsmechanisms
Objectives of change management (Kramer[13])Objectives of change management (Kramer[13]) Changes specified only in terms of system structure Changes specified only in terms of system structure
e.g. (Wermlinger[17])e.g. (Wermlinger[17]) Change specifications are independent of application Change specifications are independent of application
protocols, algorithms or statesprotocols, algorithms or states Changes leave the system in a consistent stateChanges leave the system in a consistent state Change should minimize disruption to systemChange should minimize disruption to system
2222
Externalized adaptation [19,21,22]Externalized adaptation [19,21,22]
One or more models of a system are One or more models of a system are maintained at runtime and external to the maintained at runtime and external to the application as a basis for identifying application as a basis for identifying problems and resolving themproblems and resolving them
Describe changes in terms of operations Describe changes in terms of operations on the modelon the model
Changes to the model effects changes to Changes to the model effects changes to the underlying systemthe underlying system
2323
Rainbow[22]Rainbow[22]
2424
Kinesthetics eXtreme (KX) Kinesthetics eXtreme (KX) featuring Workflakes[21]featuring Workflakes[21]
2525
ConclusionsConclusionsUsing models as the basis for guiding and Using models as the basis for guiding and validating repair and system integrity is a validating repair and system integrity is a promising principled approachpromising principled approachOntologies allow us to capture, reuse and Ontologies allow us to capture, reuse and extend domain specific knowledgeextend domain specific knowledgeMore flexible to construct the system such that More flexible to construct the system such that there is a separation between functional there is a separation between functional concerns and repair adaptation logic concerns and repair adaptation logic Change management is integral to minimizing Change management is integral to minimizing the risks of repairing a running systemthe risks of repairing a running systemWe need other “benchmarks” for evaluating self-We need other “benchmarks” for evaluating self-healing systems other than raw performancehealing systems other than raw performance
2626
ReferencesReferences1. Design of Self-Checking Software
S. S. Yau, R. C. Cheung Proceedings of the international conference on Reliable software pp 450 - 455 1975.
2. How to Design a System in which Modules can be changed on the flyR. S. Fabry International Conference on Software Engineering 1976.
3. System Structure for Software Fault Tolerance B. Randell Proceedings of the international conference on Reliable software pp 437 - 449 1975.
4. The N-Version Approach to Fault-Tolerant Software Algirdas Avizienis, Fellow, IEEE, IEEE Transactions on Software Engineering Vol. SE-11 No. 12 December 1985.
5. Elements of the Self-Healing Problem Space Philip Koopman ICSE Workshop on Architecting Dependable Systems 2003.
6. Architectural Reflection: Realising Software Architectures via Reflective ActivitiesFrancesco Tisato, Andrea Savigni, Walter Cazzola, Andrea Sosio Engineering Distributed Objects, Second International Workshop, EDO 2000, Davis, CA, USA, November 2-3, 2000, Revised Papers pp 102 - 115.
7. Using Architectural Properties to Model and Measure System-Wide Graceful DegradationC. Shelton and P. Koopman Accepted to the Workshop on Architecting Dependable Systems sponsored by the International Conference on Software Engineering (ICSE2002)
8. Exploiting Architectural Design Knowledge to Support Self-Repairing Systems Bradley Schmerl, David Garlan Proceedings of the 14th international conference on Software engineering and knowledge engineering Pages: 241 - 248 2002.
1. Efficient and transparent Instrumentation of Application Components using an Aspect-oriented Approach M. Debusmann and Kurt Geihs 14th IFIP/IEEE Workshop on Distributed Systems: Operations and Management (DSOM 2003) pp 209--220.
2. Discotect: A System for Discovering Architectures from Running Systems Hong Yan, David Garlan, Bradley Schmerl, Jonathan Aldrich, Rick Kazman International Conference on Software Engineering archive Proceedings of the 26th International Conference on Software Engineering - Volume 00 Pages: 470 - 479 2004.
3. Failure Diagnosis using Decision Trees Mike Chen, Alice X. Zheng, Jim Lloyd, Michael I. Jordan, Eric A. Brewer 1st International Conference on Autonomic Computing (ICAC 2004), 17-19 May 2004, New York, NY, USA pp36 - 43.
2727
ReferencesReferences12. The role of ontologies in autonomic computing systems
L. Stojanovic, J. Schneider, A. Maedche, S. Libischer, R. Studer, Th. Lumpp, A. Abecker, G. Breiter, and J. Dinger IBM Systems Journal Volume 43, Number 3, 2004 Unstructured Information Management.
13. The Evolving Philosophers Problem: Dynamic Change ManagementJeff Kramer, Jeff Magee IEEE Transactions on Software Engineering November 1990 Vol. 16 - No. 11 pp 1293 - 1306.
14. On-The-Fly Program Modification: Systems For Dynamic UpdatingMark E. Segal, Ophir FriederIEEE Software Volume 10, Number 2, March 1993 pp 53 - 65.
15. Architecture-Based Runtime Software EvolutionPeyman Oreizy, Nenad Medovic, Richard N. Taylor In Proceedings of the International Conference on Software Engineering 1998 (ICSE'98), pages 177--186, April 1998.
• An Architecture-Based Approach to Self-Adaptive SoftwarePeyman Oreizy, Michael M. Gorlick, Richard N. Taylor, Dennis Heimbigner, Gregory Johnson, Nenad Medvidovic,Alex Quilici, David S. Rosenblum, Alexander L. WolfIEEE Intelligent Systems archive Volume 14, Issue 3 (May 1999) pp 54 - 62.
17. A Graph Based Architectural (Re)configuration LanguageMichael Wermlinger, Antonia Lopes, Jose Luiz FiadeiroProceedings of the 8th European Software Engineering Conference held jointly with 9th ACM SIGSOFT InternationalSymposium on Foundations of Software Engineering 2001, Vienna, Austria, September 10-14, 2001 pp 21 - 32.
18. Dynamic Software UpdatingMichael W. Hicks, Jonathan T. Moore, Scott NettlesProceedings of the 2001 ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI), Snowbird, Utah, USA, June 20-22, 2001. SIGPLAN Notices 36(5) (May 2001), ACM, 2001, ISBN 1-58113-414-2 pp 13 - 23.
19. Model-based Adaptation for Self-Healing SystemsDavid Garlan, Bradley R. SchmerlProceedings of the First Workshop on Self-Healing Systems, WOSS 2002, Charleston, South Carolina, USA,November 18-19, 2002 pp 27 - 32.
20. System Support for Online ReconfigurationC. A. N. Soules, J. Appavoo, K. Hui, D. D. Silva, G. R. Ganger, O. Krieger, M. Stumm, R. W. Wisniewski, M.Auslander, M. Ostrowski, B. Rosenburg, and J. XenidisIn Proceedings of the Usenix Technical Conference, San Antonio, TX, USA, June 2003.
21. Using Process Technology to Control and Coordinate Software AdaptationGiuseppe Valetto, Gail KaiserInternational Conference on Software Engineering archive Proceedings of the 25th international conference onSoftware engineering Portland, Oregon Pages: 262 - 272 2003.
2828
ReferencesReferences22. Rainbow: Architecture-based Self-adaptation with Reusable Infrastructure
Shang-Wen Cheng, An-Cheng Huang, David Garlan, Bradley R. Schmerl, Peter Steenkiste 1st International Conference on Autonomic Computing (ICAC 2004), 17-19 May 2004, New York, NY, USA.23. Maintaining Consistency in a Database with Changing Types Stanley B. Zdonik Proceedings of the 1986 SIGPLAN workshop on Object-oriented programming Yorktown Heights,
New York, United States Pages: 120 - 127 1986.24. Software Rejuventation: Analysis, Module and Applications
Yennun Huang, Chandra Kintala, Nick Kolettis, N. Dudley Fulton Proceedings of the 25th International Symposium on Fault-Tolerant Computing (FTCS-25), Pasadena, CA, pp. June
1995, pp. 381-390. 25. Rewind, Repair, Replay: Three R's to Dependability
A. Brown and D. A. Patterson In 10th ACM SIGOPS European Workshop, 2002.
26. Improving Availability with Recursive Micro-Reboots: A Soft-State System Case Study George Candea, James Cutler, Armando Fox Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
pp 213 - 248. 27. Predicting Problems Caused by Component Upgrades
Stephen McCamant and Michael D. Ernst In 10th European Software Engineering Conference and the 11th ACM SIGSOFT Symposium on the Foundations of
Software Engineering, pages 287--296, Helsinki, Finland, September 2003. 28. Improving System Dependability with Functional Alternatives
Charles P. Shelton and Philip Koopman Proceedings of the 2004 International Conference on Dependable Systems and Networks (DSN'04) - Volume 00 Page: 295.
2929
Backup slidesBackup slides
3030
Component definitionsComponent definitions
ComponentComponent Software element that conforms to a component Software element that conforms to a component
model and can be independently deployed and model and can be independently deployed and composed without modification according to a composed without modification according to a composition standardcomposition standard
Component modelComponent model Defines specific interaction and composition Defines specific interaction and composition
standardsstandards
Component model implementationComponent model implementation Dedicated set of executable software elements Dedicated set of executable software elements
required to support the execution of components that required to support the execution of components that conform to that modelconform to that model
3131
Efficient and transparent instrumentation of Efficient and transparent instrumentation of application components using an aspect-application components using an aspect-
oriented approachoriented approachOverview of AOPOverview of AOP
Developed at Xerox PARC in the early 90’s (Gregor Kiczales et. Al)Developed at Xerox PARC in the early 90’s (Gregor Kiczales et. Al) Modularize cross-cutting concerns (non-functional requirements) and Modularize cross-cutting concerns (non-functional requirements) and
separate them from functional requirementsseparate them from functional requirements Some termsSome terms
Joinpoint – well defined point in execution flowJoinpoint – well defined point in execution flowPointcut – identifies a joinpoint e.g. using a regexPointcut – identifies a joinpoint e.g. using a regexAdvice – defines additional code to be executed at a joinpoint (before, Advice – defines additional code to be executed at a joinpoint (before, around, after)around, after)Introductions – modify classes, add members, change class relationshipsIntroductions – modify classes, add members, change class relationships
Aspects and application code can be developed separately and weaved Aspects and application code can be developed separately and weaved togethertogether
Compile time weaving – AspectJCompile time weaving – AspectJRuntime weaving – JBoss bytecode weavingRuntime weaving – JBoss bytecode weavingHybridHybrid
3232
Rainbow – architecture-based self-Rainbow – architecture-based self-adaptation with reusable infrastructureadaptation with reusable infrastructure
DefinitionsDefinitions ArchitectureArchitecture
An abstract view of the system as a composition of computational An abstract view of the system as a composition of computational elements and their interconnections [Shaw & Garlan]elements and their interconnections [Shaw & Garlan]
Architectural modelArchitectural modelRepresents the system architecture as a graph of interacting components Represents the system architecture as a graph of interacting components [popular format – ACME, xADL, SADL][popular format – ACME, xADL, SADL]Nodes in the graph are termed components – principal computation Nodes in the graph are termed components – principal computation elements and data stores of the systemelements and data stores of the systemArcs are termed connectors and represent the pathways of interactionArcs are termed connectors and represent the pathways of interaction
Architectural styleArchitectural styleStyle specifies a vocabulary of component and connector types and the Style specifies a vocabulary of component and connector types and the rules of system composition, essentially capturing the units of change rules of system composition, essentially capturing the units of change from system to systemfrom system to systemCharacterizes a family of systems related by shared structural and Characterizes a family of systems related by shared structural and semantic properties {Types, Rules, Properties, Analyses}semantic properties {Types, Rules, Properties, Analyses}
Architecture-based self-adaptationArchitecture-based self-adaptationThe use of architectural models to reason about system behavior and The use of architectural models to reason about system behavior and guide adaptationguide adaptation
3333
System Validation/AssuranceSystem Validation/Assurance
Why should users trust a self-healing system to Why should users trust a self-healing system to do the “right” thing?do the “right” thing? The system must be able to give guarantees about its The system must be able to give guarantees about its
qualitative characteristics before, during and after qualitative characteristics before, during and after adaptationsadaptations
Availability, reliability, performance etc.Availability, reliability, performance etc. What kind of guarantees can the system give?What kind of guarantees can the system give?
AbsoluteAbsoluteProbabilisticProbabilisticPartial assurance of a few key system propertiesPartial assurance of a few key system propertiesAssurance during degraded mode operationsAssurance during degraded mode operationsAssurance between fault occurrence time and detection timeAssurance between fault occurrence time and detection time
Not all faults are detectableNot all faults are detectable
3434
System Validation/AssuranceSystem Validation/Assurance
Improving dependabilityImproving dependability Software rejuvenation [24]Software rejuvenation [24]
Addresses faults caused by software aging e.g. resource Addresses faults caused by software aging e.g. resource leaksleaks
System-level undo [25]System-level undo [25]Addresses operator-induced/compounded failuresAddresses operator-induced/compounded failuresRewind – revert state to an earlier timeRewind – revert state to an earlier timeRepair – operator makes any desired changes to the systemRepair – operator makes any desired changes to the systemReplay – roll forward with the “new” systemReplay – roll forward with the “new” system
Functional alternatives rather than redundancy [28]Functional alternatives rather than redundancy [28]Exploit a system’s available features to provide some system Exploit a system’s available features to provide some system redundancy without additional redundant componentsredundancy without additional redundant componentsLeverages components that satisfy overlapping system Leverages components that satisfy overlapping system objectivesobjectives
3535
Rewind, repair, replay: three r’s to Rewind, repair, replay: three r’s to dependabilitydependability
MechanismMechanism How is this different from related work?How is this different from related work?
Backup/restore/checkpointing schemes offer rewind/repair Backup/restore/checkpointing schemes offer rewind/repair but deny the ability to roll forward once changes have been but deny the ability to roll forward once changes have been mademade
Recovery systems for databases use rewind/replay to Recovery systems for databases use rewind/replay to recover from crashes, deadlocks and other fatal events but recover from crashes, deadlocks and other fatal events but do not offer the ability to inject repair into the recovery do not offer the ability to inject repair into the recovery cyclecycle
The standard transaction model does not allow committed The standard transaction model does not allow committed transactions to be altered or removed however extended transactions to be altered or removed however extended transaction models have introduced the notion of a transaction models have introduced the notion of a compensating transactioncompensating transaction
3636
Rewind, repair, replay: three r’s to Rewind, repair, replay: three r’s to dependabilitydependability
MechanismMechanism Tracking recoverable state for replayTracking recoverable state for replay
When an undo is carried out it is the responsibility of the When an undo is carried out it is the responsibility of the replay step to restore all the state changes that are replay step to restore all the state changes that are important to the userimportant to the user
Defining exactly what state this encompasses is tricky e.g. Defining exactly what state this encompasses is tricky e.g. an upgrade of a mail server could change the underlying an upgrade of a mail server could change the underlying format of the mailbox fileformat of the mailbox file
Solution: Track user actions at the intentional level e.g. in Solution: Track user actions at the intentional level e.g. in an undoable email system a user’s act of deleting a an undoable email system a user’s act of deleting a message is recorded as “delete message with ID x” rather message is recorded as “delete message with ID x” rather than “alter byte range x – y of file z”than “alter byte range x – y of file z”
3737
Rewind, repair, replay: three r’s to Rewind, repair, replay: three r’s to dependabilitydependability
MechanismMechanism How does intentional tracking workHow does intentional tracking work
In the network service environment being In the network service environment being targeted, users interact with the system through targeted, users interact with the system through standardized application protocols (usually standardized application protocols (usually comprised of a few action verbs e.g. HTTP PUT, comprised of a few action verbs e.g. HTTP PUT, GET, POST).GET, POST).
Intercept and record user interactions at the Intercept and record user interactions at the protocol levelprotocol level
3838
Rewind, repair, replay: three r’s to Rewind, repair, replay: three r’s to dependabilitydependability
MechanismMechanism Are repairs also tracked?Are repairs also tracked?
No, because of practical reasonsNo, because of practical reasons User interactions are limited by the verbs-action User interactions are limited by the verbs-action
commands permitted by the protocol, however repair commands permitted by the protocol, however repair actions are limited only by the ingenuity of the human actions are limited only by the ingenuity of the human operatoroperator
3939
Rewind, repair, replay: three r’s to Rewind, repair, replay: three r’s to dependabilitydependability
MechanismMechanism ArchitectureArchitecture
Basic structure of an undoable systemBasic structure of an undoable system A verb-based application interface that uses logical A verb-based application interface that uses logical
state names. This allows the undo system to properly state names. This allows the undo system to properly disambiguate recoverable user state changes from disambiguate recoverable user state changes from other system eventsother system events
The verb based interface should cover all interactions The verb based interface should cover all interactions with the system that affect user-visible state, including with the system that affect user-visible state, including operator interfaces that are used to create, move, operator interfaces that are used to create, move, delete and modify user state repositoriesdelete and modify user state repositories
If existing protocols do not satisfy these requirements If existing protocols do not satisfy these requirements then new protocols can be created and layered above then new protocols can be created and layered above existing interfacesexisting interfaces
4040
System Validation/AssuranceSystem Validation/Assurance
Improving availabilityImproving availability Recursive micro-reboots [26]Recursive micro-reboots [26]
Consider failures as facts to be coped withConsider failures as facts to be coped withUnequivocally returns the recovered system to its start state – best Unequivocally returns the recovered system to its start state – best understood/tested stateunderstood/tested stateHigh confidence way to reclaim resourcesHigh confidence way to reclaim resourcesEasy technique to understand and employ, automate, debug, implementEasy technique to understand and employ, automate, debug, implement
Crash-only componentsCrash-only componentsOne way to shutdown – crashOne way to shutdown – crashOne way to start up – recoverOne way to start up – recoverAll persistent state kept in crash-only data storesAll persistent state kept in crash-only data storesStrong modularity, timeout based communication, lease-based resource Strong modularity, timeout based communication, lease-based resource allocationallocation
Improving integrityImproving integrity Managing multiple versions via version handlers [23]Managing multiple versions via version handlers [23] Predicting problems caused by component upgrades [27]Predicting problems caused by component upgrades [27] Model constraints used in change management [13,19]Model constraints used in change management [13,19]
4141
Improving availability with recursive Improving availability with recursive micro-rebootsmicro-reboots
MechanismMechanism Recursive recovery and micro-rebootsRecursive recovery and micro-reboots
A recursive approach to recoveryA recursive approach to recovery A minimal subset of components is recoveredA minimal subset of components is recovered If that fails progressively larger subsets are recoveredIf that fails progressively larger subsets are recovered
A micro-reboot is a low-cost form of reboot applied at the level of A micro-reboot is a low-cost form of reboot applied at the level of fine-grained software componentsfine-grained software componentsFor a system to be recursively recoverable it must contain fine-For a system to be recursively recoverable it must contain fine-grained components that are independently recoverable such that grained components that are independently recoverable such that part of the system can be recovered without touching the restpart of the system can be recovered without touching the restComponents must be loosely coupled AND prepared to be denied Components must be loosely coupled AND prepared to be denied service from other components that may be in the process of service from other components that may be in the process of micro-rebootingmicro-rebooting
Popularity of loosely-coupled componentized systems (EJB,.NET) Popularity of loosely-coupled componentized systems (EJB,.NET) A recursively recoverable system gracefully tolerates successive A recursively recoverable system gracefully tolerates successive restarts at multiple levels in the sense that it does not lose or restarts at multiple levels in the sense that it does not lose or corrupt data or cause other systems to crashcorrupt data or cause other systems to crash
4242
Repair adaptation backup slidesRepair adaptation backup slides
4343
ADT/Module replacementADT/Module replacement
Changing modules on-the-fly [2]Changing modules on-the-fly [2] Simple case – local data onlySimple case – local data only
Indirection mechanismIndirection mechanismOld and new version instances available side-by-side for a short timeOld and new version instances available side-by-side for a short timeOld version services existing requests, new requests are redirected Old version services existing requests, new requests are redirected transparently to the new versiontransparently to the new version
General case – Shared/persistent dataGeneral case – Shared/persistent dataNew version of a module modifies representation of a shared data structure New version of a module modifies representation of a shared data structure on first encounteron first encounterVersion numbers signal whether to modifyVersion numbers signal whether to modifyModifications must be delayed until old versions are completely finished with Modifications must be delayed until old versions are completely finished with shared instancesshared instances
Other requirementsOther requirements Correctness of modifications should be analyzed/provenCorrectness of modifications should be analyzed/proven Access control to limit who can change indirection mechanisms and Access control to limit who can change indirection mechanisms and
initiate module swappinginitiate module swapping
4444
Method replacementMethod replacement
Dynamic Modification System (Dymos) [14]Dynamic Modification System (Dymos) [14] Supports changing a procedure’s interface between versionsSupports changing a procedure’s interface between versions Supports the implementation of static data local to a procedureSupports the implementation of static data local to a procedure Dymos is completely integrated – source code, object code, Dymos is completely integrated – source code, object code,
compilation artifacts (parse trees, ASTs and symbol tables) are compilation artifacts (parse trees, ASTs and symbol tables) are available at all timesavailable at all times
Supports the specification of the modules to update and the Supports the specification of the modules to update and the conditions that must be satisfiedconditions that must be satisfied
Tightly tied to the StarMod language, which was modified to Tightly tied to the StarMod language, which was modified to support dynamic program updatingsupport dynamic program updating
Tools must be on the host system to manipulate the compilation Tools must be on the host system to manipulate the compilation artifactsartifacts
Assumes the source code is always availableAssumes the source code is always available No support for distributed systemsNo support for distributed systems
4545
Component replacementComponent replacement
Dynamically Alterable System (DAS) operating Dynamically Alterable System (DAS) operating system [14]system [14] Supports re-plugging – swapping modules with the Supports re-plugging – swapping modules with the
same interfacesame interface Supports restructuring of data within modulesSupports restructuring of data within modules Does not handle interface changesDoes not handle interface changes
K42 operating system [20]K42 operating system [20] Two basic mechanisms for online reconfigurationTwo basic mechanisms for online reconfiguration
InterpositionInterposition
Hot-swappingHot-swapping
4646
Server replacementServer replacement
Conic [13]Conic [13] Handle configuration at the configuration levelHandle configuration at the configuration level
Components and connectors (architecture)Components and connectors (architecture) Specify change declaratively in terms of system structure onlySpecify change declaratively in terms of system structure only System modification via change transactions generated from change System modification via change transactions generated from change
specificationsspecifications Use change specifications to scope changeUse change specifications to scope change Do not force change rather wait for nodes to quiesce. Nodes remain quiescent Do not force change rather wait for nodes to quiesce. Nodes remain quiescent
during changeduring change For change as a result of failure nodes would need to include recovery actionsFor change as a result of failure nodes would need to include recovery actions
Michigan Terminal System (MTS) [14]Michigan Terminal System (MTS) [14] Assumes interface remains constant between versionsAssumes interface remains constant between versions Coarse granularity for replacements – large system components e.g. a command Coarse granularity for replacements – large system components e.g. a command
interpreterinterpreter Does not address cases where a new service calls some old serviceDoes not address cases where a new service calls some old service Data formats cannot change between versionsData formats cannot change between versions Services disabled for the duration of the upgradeServices disabled for the duration of the upgrade
4747
Verifiable patchesVerifiable patches
Popcorn [18]Popcorn [18] Code and data patchesCode and data patches Build dynamic patches on top of dynamic linking mechanismsBuild dynamic patches on top of dynamic linking mechanisms Program re-linking after loading patchProgram re-linking after loading patch
Programmer specifies when to apply patchesProgrammer specifies when to apply patches Semi-automatic patch generation based on source comparisonSemi-automatic patch generation based on source comparison
Identify changes to functions and dataIdentify changes to functions and dataGenerate stub functions and state transformersGenerate stub functions and state transformers
Native code used in dynamic patches is coupled with Native code used in dynamic patches is coupled with annotations such that the code is provably safe. A well-formed annotations such that the code is provably safe. A well-formed Typed Assembly Language (TAL) program isTyped Assembly Language (TAL) program is
Memory-safe – no pointer forgingMemory-safe – no pointer forgingControl-flow safe – no jumps to arbitrary memory locationsControl-flow safe – no jumps to arbitrary memory locationsStack-safe – no modification of non-local stack framesStack-safe – no modification of non-local stack frames
System includes a TAL verifier and a prototype compiler from a System includes a TAL verifier and a prototype compiler from a safe-C language, Popcorn, to TALsafe-C language, Popcorn, to TAL