software dependability cis 376 bruce r. maxim um-dearborn
TRANSCRIPT
Software Dependability
CIS 376
Bruce R. Maxim
UM-Dearborn
Dependability
• The extent to which a critical system is trusted by its users
• Dependability is usually the most important system property of a critical system
• A system does not have to be trusted to be useful
• Dependability reflects the extent of the user’s confidence that it will not fail in normal operation
Dimensions of Dependability
• Availability– ability of the system to deliver services when requested
• Reliability– ability of the system to deliver services specified
• Safety– ability of system to operate without catastrophic failure
• Security– ability of system to defend itself against intrusion
Maintainability
• Concerned with the ease of repairing a system after failure
• Many critical system failures are caused by faults introduced during maintenance
• Maintainability is the only static dimension of dependability, the other 3 are dynamic
Survivability
• Ability of a system to deliver services after a deliberate or accidental attack
• This is very important for distributed systems whose security can be compromised
• Resilience– ability of system to continue operation despite
component failures
Dependability Costs
• Tend to increase exponentially as increasing levels of dependability are required
• More expensive development techniques and hardware are required to achieve higher levels of reliability
• Increased testing and validation are required to convince users that higher levels of dependability have been achieved
Dependability and Performance
• Untrustworthy systems are rejected by users
• System failure costs may be high
• It is hard to make existing systems more dependable
• It may be possible to compensate for poor performance
• Untrustworthy systems may lead to information loss
Dependability Economics
• Sometimes it is more cost effective to pay for failures than try to improve dependability
• having a reputation for products that can’t be trusted can lead to loss of business
• System trustworthiness levels depend on the system type being developed
Availability and Reliability
• Availability– probability of failure-free operation over a
specified time period in a given environment for a given purpose
• Reliability– probability that a given system will be
operational at a given point in time and able to deliver services
Comparing Availability and Reliability
• If a system is not available when it is needed it is unreliable
• It is possible to have systems with low reliability and high availability (if failures can be repaired quickly and do not damage data)
• Availability must take repair time into account
Faults and Failures
• Failures are usually the result of system errors derived from system faults
• Faults do not always result in system failure – a transient system state is corrected before error occurs
• Errors do not always leads to system failures– an error can be corrected by built-in error detection and
recovery procedures
– failure can be protected against by protecting system resources from damage
User’s Reliability Perceptions
• The formal definition of reliability may not reflect the user’s perception of reliability– the users environment may not match the developers
assumptions about the application environment
• The consequences of failure affect the user’s perception of reliability– failures with serious consequences are given more
weight by users than failures that are inconvenient
Reliability Achievement
• Fault Avoidance– development techniques that minimize the possibility of mistakes
or reduce the consequences of errors
• Fault Detection and Removal– verification and validation techniques that increase the possibility
of detecting and correcting errors before deployment
• Fault Tolerance– run-time techniques used to ensure system faults do not result in
system error and system errors do nor result in system failures
Reliability Modeling
• You can model a system as an input-output mapping where some inputs lead to erroneous outputs
• The reliability of the system is the probability that a particular input lies in the set of inputs which cause erroneous outputs
• This probability is not static and depends on the system’s environment
Improving Reliability
• Removing X% of the system faults does not always improve system reliability– remember the 90/10 rule
• Program defects may lie in code rarely executed by the user, so removing them will do little to improve perceived reliability
• A program with known faults may still be perceived by its users as reliable
Safety
• System property that reflects the system’s ability to operate (normally or abnormally) without danger to system environment
• As more devices become software controlled, safety becomes a greater concern
• Safety requirements are exclusive (they exclude undesirable situations rather than specify required system services)
Safety Criticality
• Primary safety-critical systems– embedded software systems whose failure can
cause associated hardware to fail and directly threaten people
• Secondary safety-critical systems– systems whose faults can cause other systems
to fail which cause threaten people
Safety and Reliability
• They are related, but not identical
• Reliability– concerned with conformance to a specification
and delivery of a service
• Safety– concerned with ensuring a system cannot
damage, regardless of its conformance (or nonconformance) to its specification
Unsafe Reliable System
• Specification errors– if the specification is incorrect conformance to the
specification can still cause damage
• Hardware failures generating spurious outputs– hard to anticipate in specification
• Context-sensitive commands– e.g. issuing the right command at the wrong time
– often caused by operator error
Safety Achievement
• Hazard Avoidance– system design so some hazard cases can not arise
• Hazard Detection and Removal– system design so hazards are detected and removed
before they result in an accident
• Damage Limitation– system includes protection features that minimize
damage that may result from an accident
Accidents
• Rarely have a single cause in a complex system (e.g. credit assignment problem)
• Most accidents are the result of combinations of malfunctions
• Anticipating all combination of malfunctions may not be possible in a software controlled system, so complete safety may be impossible
Security
• Reflects a system’s ability to protect itself from attack
• Security is increasingly important when systems are networked to each other
• Security is an essential pre-requisite for availability, reliability, and safety
Fundamental Security
• If a system is networked and insecure then statements about it reliability and safety are unreliable
• Intrusion (attack) can change the system’s operating environment or data and invalidate the assumptions upon which the reliability and safety are made
Insecurity Damage
• Denial of Service– system forced into state where providing service is
impossible or significantly degraded
• Corruption of Programs or Data– modifications made by unauthorized user
• Disclosure of Confidential Information– information managed by system is exposed to people
who are not authorized users
Security Assurance
• Vulnerability Avoidance– system designed so vulnerabilities can not occur
– e.g. no network connection
• Attack Detection and Elimination– system designed so attacks on vulnerabilities do not occur
– e.g. use of anti-virus software
• Exposure Limitation– system designed so damage from attacks is minimal
– e.g. a backup policy that allows restoration of damaged files