software dependability cis 376 bruce r. maxim um-dearborn

Software Dependability

CIS 376

Bruce R. Maxim

UM-Dearborn

Dependability

• The extent to which a critical system is trusted by its users

• Dependability is usually the most important system property of a critical system

• A system does not have to be trusted to be useful

• Dependability reflects the extent of the user’s confidence that it will not fail in normal operation

Dimensions of Dependability

• Availability– ability of the system to deliver services when requested

• Reliability– ability of the system to deliver services specified

• Safety– ability of system to operate without catastrophic failure

• Security– ability of system to defend itself against intrusion

Maintainability

• Concerned with the ease of repairing a system after failure

• Many critical system failures are caused by faults introduced during maintenance

• Maintainability is the only static dimension of dependability, the other 3 are dynamic

Survivability

• Ability of a system to deliver services after a deliberate or accidental attack

• This is very important for distributed systems whose security can be compromised

• Resilience– ability of system to continue operation despite

component failures

Dependability Costs

• Tend to increase exponentially as increasing levels of dependability are required

• More expensive development techniques and hardware are required to achieve higher levels of reliability

• Increased testing and validation are required to convince users that higher levels of dependability have been achieved

Dependability and Performance

• Untrustworthy systems are rejected by users

• System failure costs may be high

• It is hard to make existing systems more dependable

• It may be possible to compensate for poor performance

• Untrustworthy systems may lead to information loss

Dependability Economics

• Sometimes it is more cost effective to pay for failures than try to improve dependability

• having a reputation for products that can’t be trusted can lead to loss of business

• System trustworthiness levels depend on the system type being developed

Availability and Reliability

• Availability– probability of failure-free operation over a

specified time period in a given environment for a given purpose

• Reliability– probability that a given system will be

operational at a given point in time and able to deliver services

Comparing Availability and Reliability

• If a system is not available when it is needed it is unreliable

• It is possible to have systems with low reliability and high availability (if failures can be repaired quickly and do not damage data)

• Availability must take repair time into account

Faults and Failures

• Failures are usually the result of system errors derived from system faults

• Faults do not always result in system failure – a transient system state is corrected before error occurs

• Errors do not always leads to system failures– an error can be corrected by built-in error detection and

recovery procedures

– failure can be protected against by protecting system resources from damage

User’s Reliability Perceptions

• The formal definition of reliability may not reflect the user’s perception of reliability– the users environment may not match the developers

assumptions about the application environment

• The consequences of failure affect the user’s perception of reliability– failures with serious consequences are given more

weight by users than failures that are inconvenient

Reliability Achievement

• Fault Avoidance– development techniques that minimize the possibility of mistakes

or reduce the consequences of errors

• Fault Detection and Removal– verification and validation techniques that increase the possibility

of detecting and correcting errors before deployment

• Fault Tolerance– run-time techniques used to ensure system faults do not result in

system error and system errors do nor result in system failures

Reliability Modeling

• You can model a system as an input-output mapping where some inputs lead to erroneous outputs

• The reliability of the system is the probability that a particular input lies in the set of inputs which cause erroneous outputs

• This probability is not static and depends on the system’s environment

Improving Reliability

• Removing X% of the system faults does not always improve system reliability– remember the 90/10 rule

• Program defects may lie in code rarely executed by the user, so removing them will do little to improve perceived reliability

• A program with known faults may still be perceived by its users as reliable

Safety

• System property that reflects the system’s ability to operate (normally or abnormally) without danger to system environment

• As more devices become software controlled, safety becomes a greater concern

• Safety requirements are exclusive (they exclude undesirable situations rather than specify required system services)

Safety Criticality

• Primary safety-critical systems– embedded software systems whose failure can

cause associated hardware to fail and directly threaten people

• Secondary safety-critical systems– systems whose faults can cause other systems

to fail which cause threaten people

Safety and Reliability

• They are related, but not identical

• Reliability– concerned with conformance to a specification

and delivery of a service

• Safety– concerned with ensuring a system cannot

damage, regardless of its conformance (or nonconformance) to its specification

Unsafe Reliable System

• Specification errors– if the specification is incorrect conformance to the

specification can still cause damage

• Hardware failures generating spurious outputs– hard to anticipate in specification

• Context-sensitive commands– e.g. issuing the right command at the wrong time

– often caused by operator error

Safety Achievement

• Hazard Avoidance– system design so some hazard cases can not arise

• Hazard Detection and Removal– system design so hazards are detected and removed

before they result in an accident

• Damage Limitation– system includes protection features that minimize

damage that may result from an accident

Accidents

• Rarely have a single cause in a complex system (e.g. credit assignment problem)

• Most accidents are the result of combinations of malfunctions

• Anticipating all combination of malfunctions may not be possible in a software controlled system, so complete safety may be impossible

Security

• Reflects a system’s ability to protect itself from attack

• Security is increasingly important when systems are networked to each other

• Security is an essential pre-requisite for availability, reliability, and safety

Fundamental Security

• If a system is networked and insecure then statements about it reliability and safety are unreliable

• Intrusion (attack) can change the system’s operating environment or data and invalidate the assumptions upon which the reliability and safety are made

Insecurity Damage

• Denial of Service– system forced into state where providing service is

impossible or significantly degraded

• Corruption of Programs or Data– modifications made by unauthorized user

• Disclosure of Confidential Information– information managed by system is exposed to people

who are not authorized users

Security Assurance

• Vulnerability Avoidance– system designed so vulnerabilities can not occur

– e.g. no network connection

• Attack Detection and Elimination– system designed so attacks on vulnerabilities do not occur

– e.g. use of anti-virus software

• Exposure Limitation– system designed so damage from attacks is minimal

– e.g. a backup policy that allows restoration of damaged files

software dependability cis 376 bruce r. maxim um-dearborn

Documents

given system

system type

system resources

critical system failures

system faults faults

resilience ability of

safety ability of system

users system failure