aspira dependability prediction with ultrasan · 2000-10-08 · aspira dependability prediction...
TRANSCRIPT
1 10-4-00
Aspira Dependability Prediction withUltraSAN
Aspira Systems Engineering
Bryce Kuhlman
Steve Beaudet
2 10-4-00
Vision Statement
A standardized method of system dependability modeling that streamlinescommunication between designers and dependability analysts andprovides a framework for the development of 99.999% available
systems.
3 10-4-00
Goals
• Modeling used as a tool to develop detailed understanding of systemdependability performance
• Analysis throughout the entire product life cycle.
• Streamlined communication between product engineers, 3rd partyvendors, and dependability analysts.
• One model, many measures (with distributions).
• Timely modeling and analysis– Low cycle time for new models and analysis (usually less than one week,
depending on the product complexity and familiarity.
– Models evolve with the design; more detail is added as it becomesavailable.
– Trade studies defined by availability team and system designers drivedesign for High Availability.
4 10-4-00
Modeling Process
• Two levels of modeling: description and computation
• Dependability Description Model (DDM)– Explanation of how the system works in an Availability sense.
– Standardized framework based on existing methodologies in dependabilityanalysis and system design which can be easily understood by all.
– Provides detailed system descriptions.
– Designers become an integral part of system dependability evaluation.
– Focus is on description, not evaluation and therefore circumvents the needfor expertise in particular modeling techniques (Markov Process, SPN,SAN, etc.)
– To achieve 99.999% availability, all possible sources of serviceinterruption must be addressed.
• Dependability Computation Model (DCM)– Calculation of measures defined in DDM
– UltraSAN
5 10-4-00
Dependability Description Model• Measures
– Availability = Probability of a user being able to setup a new connection– Reliability = Probability of an existing connection being dropped– Maintenance = Number of maintenance events necessary– Bellcore / TL-9000: outage, DPM, OFM, etc.
• Model Assumptions• System Description
– Dependability Block Model• Identification of Serial Blocks – Common failure impact, detection, response, repair
– Dependency Graphs
• Block Dependability Information– General Information– Failure Information– Detection Information– Recovery Information– Notification Information– Repair Information– Upgrade Information
6 10-4-00
Dependability Description Model
• For each serial block, for type of information– Description of design aspect
– Impact on other components
– Applicable Parameters• Time distribution & parameters
• Probability
– Basis for Parameter Estimate
– Effect(s) of failed activity• Reference next escalation level of detection or effect of failed detection, next
level of response
7 10-4-00
Dependability Computation Model (DCM)
• Simulation and analysis for the purpose of estimating measurementsdefined in the DDM.
• Created by dependability analysts based on information contained inthe DDM.
• Model precisely how the system behaves.
8 10-4-00
UltraSAN Selection Factors
• Output measures are accompanied with estimated distributions– Distribution shape gives insight
– Expected performance of individual networks, small populations can beunderstood
• Monte Carlo simulation utilized to avoid state-space explosion and tosupport non-exponential time distributions.– All details specified in the DDM can be modeled using UltraSAN
simulation.• Model how design works. Avoids Markov Model simplifications
• One model to estimate all measures.
• Composed model supports modular programming, model reuse, anddevelopment time reduction.
• Exceptionally fast simulation time.
9 10-4-00
Modeling Capability Details
• Rate Distributions (exponential, Weibull, lognormal, etc.)– Failure time distributions
– Failure detection, response, and notification (time distribution andprobability
• Distributions that reflect real experience
• Detection and Response– Multiple levels of detection and response escalation
– Effects of protocols and packet networks in fault management
• Software Architecture– Model details of how the software modules work and fail together, how
they interact, and their relationship to the hardware.
• Event edge (time-independent) impacts
10 10-4-00
Modeling Capability Details
• Repair Dependency– Example: If a single port on a multi-port adapter fails, the entire port
adapter and all of its connections must be disabled to replace the portadapter.
• Operational Dependency– Failure of some elements disables other elements.
– Example: If a processor fails, all applications are disabled and cannot failuntil the processor is brought back online.
• Procedural Errors– Failures caused by network operators performing routine or specialized
operations on the network.
• Maintenance strategies
• Planned upgrade
11 10-4-00
UltraSAN Modeling Process
• Defined UltraSAN templates cover extensive range of configurationsand procedures:– Component failure (HW and SW)
– Redundancy (active and standby)
– Operational and repair dependency
– Detection and response time and probability
– Detection and response escalation
– Repair / replacement
– Maintenance/Procedural Error
– Upgrades
• Standardized measurement definitions
• Standards for naming conventions, time-increment, variable usage
• Detailed model validation procedures
• Strict configuration management guidelines
12 10-4-00
Desired UltraSAN Enhancements• GUI enhancements
– cut and paste, rename– more robust text/code editor– complete model compilation at all levels of definition to circumvent mandatory
subnet->composed->reward->study sequence
• Additional model composition formalisms (graph models, etc.)• Path-based reward variables• Integration with Design of Experiments functionality for evaluation of
sensitivities• Token specification (colored tokens, data structures, etc.)• Easier specification of user-defined functions• Triangular distribution• Architecture-independent multi-processor runs• Port to Windows 2000• Alternate project documentation format (HTML, PDF, etc.)• Improved documentation• Worked complex examples
– Including tricks
13 10-4-00
Summary
• The attainment of 5 NINES Availability performance requires detaileddesign specifically targeted for availability enhancement
• A process has been developed that drives and records the Availabilitydesign detail
• The use of UltraSAN has allowed us to calculate the results of thatimplementation detail– Easy to learn without extensive mathematical background
– Deals with large state space reflective of design detail
– Rapid simulation time
– Distributions – as part of inputs and in outputs