[email protected] fault-tolerant systems design part 1
TRANSCRIPT
1. 1. Introduction: Introduction: Basic DefinitionsBasic Definitions
Fault-Tolerance is the ability of a system
to continuously perform correctly its
tasks after the occurrence of a fault.
Reliability of a system is the function, R(t),
defined as the probability of the system to
perform correctly through the time interval
[t0, t], given that the system was performing
correctly at t0.
1. 1. Introduction: Introduction: Basic DefinitionsBasic Definitions
Availability is the function, A(t), defined as
the probability of the system to operate
correctly and to be available to perform
its tasks through the interval [t0, t].
1. 1. Introduction: Introduction: Basic DefinitionsBasic Definitions
Fault-Tolerant Systems can be designed by means of two basic approaches:
Fault Masking
Detection, localization and recovery, (via
reconfiguration) of the system to remove
the defective part.
2. 2. Design of FT SystemsDesign of FT Systems
If the option is reconfiguration, then ...
before ...
Fault detection techniques
Fault location techniques
after ...
Fault recovery techniques
2. 2. Design of FT SystemsDesign of FT Systems
Fault Recovery Techniques ...
Rollback Recovery
Forward Recovery
2. 2. Design of FT SystemsDesign of FT Systems
All techniques to design FT systems
are based on some
type and degree
of redundancy.
2. 2. Design of FT SystemsDesign of FT Systems
Redundancy is implemented through the use of HW, SW, information, or time beyond that necessary to system normal operation.
Results in a not negligible impact in the system in terms of performance, size, weight, power consumption, and reliability.
2. 2. Design of FT SystemsDesign of FT Systems
Active
Passive
Hybrid
Redundancy at the HW Level:
2. 2. Design of FT SystemsDesign of FT Systems
1. Based on the concept of fault masking to hide the occurrence of faults and prevent the faults from resulting in errors (developed around the concept of majority voting)
Do not provide for faults detection, but simply mask
them
HW Redundancy: 1. Passive
2. 2. Design of FT SystemsDesign of FT Systems
Module 1
Module 2
Module 3
VoterOutput
Basic concept of Triple Modular Replication (TMR)
Proc 1
Proc 2
Proc 3
Voter
The use of triplicated voters in a TMR configuration
Voter
Voter
Mem 1
Mem 2
Mem 3
HW Redundancy: 1. Passive
2. 2. Design of FT SystemsDesign of FT Systems
ExampleExample of SW of SW votingvoting
VoterVoterTaskTask
Task ATask A
Task BTask B
Task ATask A
Task ATask A
Proc 1Proc 1
Proc 3Proc 3
Proc 2Proc 2
HW Voting x SW Voting ?HW Voting x SW Voting ?
1. The availability of processor to perform the voting
2. The speed at which voting must be performed
3. The criticality of space, power, and weight
limitations
4. The # of different voters that must be provided
5. The flexibility required of the voter with respect to
future changes in the system
HW Redundancy: 1. Passive
2. 2. Design of FT SystemsDesign of FT Systems
In practical applications of voting, 3 results in a TMR system may
not completely agree, even in a fault-free environment:
e.g., A/D converters in sensors may produce quantities that disagree in
the least-significant bits. This disagreement can propagate into
larger discrepancies after computation, which can significantly
affect the voting process.
HW Redundancy: 1. Passive
2. 2. Design of FT SystemsDesign of FT Systems
Solution Mid-Value Select Technique
A TMR system selects the value that lies in the middle
of the others :
Corrupted signal
Uncorrupted signals
Selectedsignals
HW Redundancy: 1. Passive
2. 2. Design of FT SystemsDesign of FT Systems
Attempts to achieve fault tolerance by means of fault
detection, fault location, reconfiguration, and recovery
(property of fault masking is not obtained: there is no attempt
to prevent faults from producing errors within the system)
More suitable for applications where temporary, erroneous
results are acceptable, as long as the system reconfigures and
regains its operational status in a satisfactory length of time
HW Redundancy:
2. 2. Design of FT SystemsDesign of FT Systems
2. Active (or Dynamic)
Duplication of Functional Units
Standby Blocks Hot Standby Sparing Cold Standby Sparing
HW Redundancy:
2. 2. Design of FT SystemsDesign of FT Systems
2. Active (or Dynamic)
Comparison Task
Processor A
Comparison Task
Processor B
Error Signals
A B
Processor A’s Result
Processor B’s Result
Shared Memory
Processor A’s Private Memory
Processor A’s Result
Processor B’s Private Memory
Processor B’s Result
A software implementation of duplication with comparison
2. Active (or Dynamic)
HW Redundancy:
2. 2. Design of FT SystemsDesign of FT Systems
3. Hybrid
HW Redundancy:
2. 2. Design of FT SystemsDesign of FT Systems
Combines the attractive features of both the
Active and the Passive approaches
Consistency Checks
Capacity Checks
N-Auto testable Programming
N-Version Programming
Recovery Blocks
SW Redundancy:
2. 2. Design of FT SystemsDesign of FT Systems
Consistency Checks
SW Redundancy:
2. 2. Design of FT SystemsDesign of FT Systems
Use the previous knowledge about the chacacteristics of a given
information to check the information correctness.
Typically, for most applications, it is well known that a certain
quantity of a given operand cannot assume values beyond
predefined limits.
Consistency Checks
SW Redundancy:
2. 2. Design of FT SystemsDesign of FT Systems
Examples ...
A processing system can sample and store many sensor
readings in a typical control application.
The amount of cash requested by a patron at a bank’s teller
machine should never exceed the maximum withdrawal allowed.
Consistency Checks
SW Redundancy:
2. 2. Design of FT SystemsDesign of FT Systems
Examples ...
The address generated by a computer should never lie outside
the address range of the available memory.
In a computer, each instruction code can be checked to verify
that it is not one the illegal codes.
Capability Checks
SW Redundancy:
2. 2. Design of FT SystemsDesign of FT Systems
Capability checks are performed to verify that a system
possesses the capability expected.
Capability Checks
SW Redundancy:
2. 2. Design of FT SystemsDesign of FT Systems
Examples ...
Check whether a computer has the complete memory available.
Check whether the processors in a multiprocessor system are
working properly.
Periodically, a processor can execute specific instrutions on
specific data and compare the results to known results stored in
a ROM: check for ALU and Memory
Program Version 1
Program Version n
Acceptance Tests
Acceptance Tests
Sel
ecti
on
Lo
gic
Pro
gra
m O
utp
uts
Program Inputs
Program Inputs
The N-Self-Checking Programming Approach to software fault tolerance
SW Redundancy:
N-Auto testable Programming
2. 2. Design of FT SystemsDesign of FT Systems
Parity, Berger, and m-of-n Codes
Arithmetic Codes
Hamming Codes
Checksum Code
CRC (Cyclic Redundancy Checking) Code
Information Redundancy:
2. 2. Design of FT SystemsDesign of FT Systems
Transient Fault Detection
Permanent Fault Detection
Re-computation for Error Correction
Time Redundancy:
2. 2. Design of FT SystemsDesign of FT Systems
Transient Faults Detection
Time Redundancy:
2. 2. Design of FT SystemsDesign of FT Systems
The fundamental concept is to perform the same computation
two or more times and compare the results to determine if a
discrepancy exists.
Time Redundancy:
2. 2. Design of FT SystemsDesign of FT Systems
Permanent Faults Detection
Computation
ComputationEncode
DataDecodeResult
StoreResult
StoreResult
CompareResults
DataTime t0
DataTime t1
Error
Time Redundancy:
2. 2. Design of FT SystemsDesign of FT Systems
Re-computation for Error Correction
Time redundancy approach can also provide for error correction
if the computations are repeated three or more times.
Consider the example of a logical ANDAND operation. Suppose the
operation is performed three times: first, without shifting the
operands; second, with a one-bit logical shift of the operands;
and third, with a two-bit logical shift of the operands.