FAULT-TOLERANT COMPUTING
Jenn-Wei LinDepartment of Computer Science and Information Engineering
Fu Jen Catholic University
Motivation and IntroductionLecture Set 1
ECE 753 Fault Tolerant Computing 2
General Information• Textbook
– Marin L. Shooman: Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design, John Wiley and Sons, 2002.
– D.P. Siewiorek and R.S. Swarz: Reliable Computer Systems: Design and Evaluation, 3rd ed. A. K. Peters, 1999.
– D. K. Pradhan, editor, Fault-Tolerant Computer System Design, Prentice-Hall, 1996. The book is out of print
• Paper– Dependable Computing Conference
• Grading Policy– Exam. 20%– Presentation 40% (four)– Term report & Project 40%
ECE 753 Fault Tolerant Computing 3
Overview• Motivation
• Introduction
• Terminology
• Fundamental Principles
• Fault-Error-Failure concept
ECE 753 Fault Tolerant Computing 4
Motivation
• Informal Definition
• Key Attributes
• Who, What and Why Study
• Examples
ECE 753 Fault Tolerant Computing 5
Motivation
• What is Fault-Tolerance?
A “fault-tolerant system” is one that continues to perform at desired level of service in spite of failurs in some componetns that constitute the system.
ECE 753 Fault Tolerant Computing 6
Motivation (contd.)• Who is concerned about fault-tolerance?
– System Users• Who is concerned at design stages?
– Universities• R, d, and a (Research, development,
applications)– Industry
• r, D, and A (research, Development, Applications)
ECE 753 Fault Tolerant Computing 7
Motivation (contd.)
Examples
• General Purpose Systems– PCs: RAMs with parity checks– Workstations: error detection (HW), occasional
corrective action (SW), ECC (HW), keeping log (SW)
ECE 753 Fault Tolerant Computing 8
Motivation (contd.)
Examples
• Reliable Systems– Telephone systems– Banking systems e.g. ATM– Stock market– Football games display/ticketing
ECE 753 Fault Tolerant Computing 9
Motivation (contd.)
Examples
• Critical and Life Critical Systems– Manned and unmanned space borne systems– Aircraft control systems– Nuclear reactor control systems– Life support systems
ECE 753 Fault Tolerant Computing 10
Motivation (contd.)
Examples
• Reliable -> Critical Systems– 911 telephone switching system– Traffic light control system– Automobile control system (ABS, Fuel
injection system)
ECE 753 Fault Tolerant Computing 11
Introduction
– Historical perspective and major push
– Goals of fault-tolerance
– Applications of fault-tolerance
ECE 753 Fault Tolerant Computing 12
Introduction (contd.)
• Historical Perspective– not a new concept
– first use by J. van Neumann 1956
• Major push– Space program
– HW Fault tolerance - then
– SW Fault tolerance later
– Merge the two
ECE 753 Fault Tolerant Computing 13
Introduction (contd.)
• Applications– Space borne system
• long life system
– Airplane control system• critical system
– Transaction processing system• high availability system
– Switching system• high availability over certain level of performance
ECE 753 Fault Tolerant Computing 14
Terminology
• Reliability and concept of probability– R(t): conditional probability that a system provides continuous
proper service in the interval [0,t] given that it provided desired service at time 0.
• Availability– The probability that an item is up at any point in time– Uptime/(Uptime+Downtime)
• Dependability– Property of computer system that allows
reliance to be placed justifiably on service it delivers
ECE 753 Fault Tolerant Computing 15
Fundamental Principles
• Dependability
• Impairments– Faults, errors, failures
• Means– Fault Avoidance, Fault Tolerance, Fault Removal, Fault
Forecasting
• Measures– Reliability, Availability, Maintainability
ECE 753 Fault Tolerant Computing 16
Fundamental Principles (contd.)
• A set of methods, tools and solutions that enable development of dependable systems.
- Fault Prevention: how to prevent fault occurrence or
introduction,- Fault Tolerance: how to ensure a service up to fulfilling the system’s function in the presence of faults,- Fault Removal: how to reduce the presence (number seriousness) of faults,- Fault Forecasting: how to estimate the present number, the future incidence, and the consequences of faults
ECE 753 Fault Tolerant Computing 17
Fundamental Principles (contd.)
• Fault Avoidance: To prevent by construction fault occurrence. E.g., nearly fault-free components, shielding against electromagnetic fields– Drawbacks:
- Cost of near-perfect components high- Cost of maintenance personnel
• Fault Tolerance: To provide, by redundancy, service complying with specification in spite of faults occurring
ECE 753 Fault Tolerant Computing 18
Fundamental Principles (contd.)
• Fault Removal: To minimize, by verification, the presence of faults. E.g. Am I building the right system? Concepts of coverage,etc.
• Fault Forecasting: To estimate, by evaluation, the presence, occurrence and consequences of faults. E.g. For how long will the system be right ?
ECE 753 Fault Tolerant Computing 19
Fundamental Principles (contd.)
• Reliability: A measure of continuous delivery of proper service (or equivalently, of the time to failure) from a reference initial time
• Availability: A measure of the delivery of the proper service with respect to the alternation of delivery of proper and improper service
• Maintainability: A measure of continuous delivery of improper service (time to restoration or repair)
ECE 753 Fault Tolerant Computing 20
Fundamental Principles (contd.)
• Hardware redundancy• Low level
• High level
• Software Redundancy• Time Redundancy• Information Redundancy
ECE 753 Fault Tolerant Computing 21
Fundamental Principles (contd.)
• Hardware Redundancy - Low level– logic level
• Example 1 - Self checking circuits
• Example 2 - Arithmetic code A modular adder using the mathematical principle
(A+B+|) mod k = ((A mod k) + (B mod k)) mod k
• Hardware Redundancy - High level– Triplicate or 5-copies as in space shuttle
ECE 753 Fault Tolerant Computing 22
Fundamental Principles (contd.)
• Software Redundancy – Use two different programs/algorithms
• Time Redundancy– Re-compute or redo the task and compare the results
– May or may not use the same hardware/software
• Information Redundancy– backup information
– Use of ECC
• Question - What kind of FT is achieved?
ECE 753 Fault Tolerant Computing 23
Fault-Error-Failure concept
• Intuitive definitions• Origins of faults• Methods to break FEF chain• Attribute of faults
ECE 753 Fault Tolerant Computing 24
Fault-Error-Failure concept (contd.)
Intuitive definitions
• Fault -– An anomalous physical condition caused by a
manufacturing problem, fatigue, external disturbance (intentional or un-intentional), desgin flaw, …
– Causes
• Error - Effect of activation of a fault
• Failure - over-all system effect of an error
Fault -> Error -> Failure
ECE 753 Fault Tolerant Computing 25
Fault-Error-Failure concept (contd.)
• Failure occurs when the delivered service deviates from the specified service; failures are caused by errors
• Error is the manifestation of a fault within a program or data structure
• Fault is an incorrect state of hardware or software resulting from failures of components, physical interferences from the environment, operator error or incorrect design
ECE 753 Fault Tolerant Computing 26
Fault-Error-Failure concept (contd.) Causes of faults
• Specification mistakes– Incorrect algorithms, architectures, or hardware and software design
specification• Implementation mistakes
– Process of transforming hardware and software specifications into the physical hardware and the actual software
– Poor design, poor component selection, poor construction, software coding mistakes
• Component defects– Manufacturing imperfections, random device defects, and component
wear-out• External disturbance
– Radiation, electromagnetic interference, operator mistakes, battle damage, and environmental extremes
ECE 753 Fault Tolerant Computing 27
Fault-Error-Failure concept (contd.)
Causes of faultsSpecification
Mistakes
Implementaion Mistakes
External Disturbances
Component Defects
SoftwareFaults
HaredwareFaults
ErrorsSystemFailures
ECE 753 Fault Tolerant Computing 28
Fault-Error-Failure concept (contd.) Characteristics of faults
• Fault nature– Specify the type of fault
• Is the fault a hardware or a software fault?• Fault duration
– Specify the length of time that a fault is active• Permanent fault• Transient fault
– Appear and disappear within a very short period of time• Intermittent fault
– Appear, disappear, and reappear repeatedly• Fault extent
– Fault is localized to a given hardware or software module or globally affects the hardware, the software, or both.
• Fault value– Determinate or indeterminate
• Fault sensitive to either the data or time