a review of fault management techniques used in safety-critical avionic systems

Pergam,~n Prog. Aerospace Sci. Vol. 32, pp. 415-431, 1996 Copyright © 1996 Elsevier Science Ltd

Printed in Great Britain. All rights reserved 0376-0421/96 $29.00

0376-0421 (95) 000010-0

A REVIEW OF F A U L T M A N A G E M E N T T E C H N I Q U E S U S E D IN SAFETY-CRITICAL AVIONIC SYSTEMS

David M. Johnson

Department of Aerospace Engineering, University of BristoL Bristol BS8 1TH, U.K.

(Received for publication 14 November 1995)

Abstract--In order to achieve high integrity levels in complex, real-time, safety-critical systems, it is necessary to detect failures and take appropriate fault recovery action, to maintain safe system operation or fail to a safe state. It may also be necessary to alert the operator of the failure. In order to take appropriate maintenance action it is also necessary to isolate the failed component. This process is termed fault management. Airline experience with modern avionic systems is that, despite the apparent sophistication of the Built-In Test Equipment and Centralised Maintenance Systems, spurious fault detection is unacceptably high. Fault detection coverage is not uniformly good and fault isolation is often inaccurate or imprecise. This paper presents a critical analysis of the methods currently used in fault management, in the light of personal experience of safety-critical systems development within the aircraft industry and work by other researchers. It makes recommendations about the use of the various approaches and attempts to highlight areas where future research could b~ most usefully directed. It also assesses the impact that new avionics architectures may have on the utility of the various approaches to fault management in future aircraft systems. Copyright © 1996 Elsevier Science Ltd.

CONTENTS

1. INTRODUCTION 416 2. FAULT DETECTION AND FAULT ISOLATION 416

2.1. Analysis of fault detection requirements 417 2.1.1. Failure modes and effect analysis (FMEA) 417 2.1.2. Fault tree analysis (FTA) 417 2.1.3. Byzantine resilience (BR) 418 2.1.4. Analysis of common mode failure (CMF) 418 2.1.5. Analysis of human error 419 2.1.6. Summary of fault analysis methods 420

2.2. Fault detection and isolation methods 421 2.2.1. Reasonableness tests 421 2.2.2. Be.haviour tests 422 2.2.3. Continuity tests 422 2.2.4. Pc:rformance tests 422 2,2.5. Comparisons between redundant elements 423 2.2.6. O:~-line tests 423 2,2.7. Summary of fault detection and fault isolation methods 424

3. FAULT TOLERANCE AND SYSTEM RECONFIGURATION 425 3.1. Fault recovery 425

3.1.1. Continued operation in degraded mode 426 3.1.2. Failure to a safe passive state 426 3.1.3. Failure absorption 426 3.1.4. System reconfiguration 426 3.1.5. Direct failure recovery 427 3.1.6. Operational limitation 427

3.2. Fault tolerant design 427 3.2.1. FTA/FMEA approach 427 3.2.2. Byzantine resilience 427

3.3. Fault recovery and system reconfiguration summary 428 4. THE IMPACT OF NEW INTEGRATED AVIONICS ARCHITECTURES 428

4.1. Conven~:ional systems architectures 428 4.2. Integrated modular avionics (IMA) 429 4.3. Analysi,,; of the effects of IMA on existing fault management techniques 429

5. CONCLUSIONS 430 REFERENCES 431

415

416 D. M. Johnson

1. I N T R O D U C T I O N

The aim of this paper is to review the fault management techniques that are currently used in the design of avionic systems. It is not intended as a review of techniques that might be applied in the future, but aims to highlight the principal problems that any new techniques should resolve. It is primarily concerned with the problems of fault detection and diagnosis and of fault tolerance, rather than the prevention of specification, design or implementation errors. These problems, and the general problem of software integrity, are, however, important issues in choosing and developing a fault management strategy and they are therefore discussed briefly. Safety and the regulatory requirements placed on avionic systems are also discussed briefly, in order to place the fault management problem in context. The paper does not, however, attempt to provide a comprehensive review of recent developments in the regulatory requirements, as these relate more to the evidence of compliance with safety objectives than to the techniques used in detecting, diagnosing and tolerating faults.

The severity of the fault management problem facing the designers of complex, high- integrity systems can be illustrated by Lufthansa's early experience of A320 operations. From their A320 fleet, they reported an average of 2000 entries in the Post Flight Reports each day. Of these 2000 entries, only 70 could be correlated with a pilot report, though there were a further 70 pilot reports with no corresponding entry in the Post Flight Report. As a result of analysis of the Post Flight Reports and pilot reports, on average 17 Line Replaceable Units were removed each day. Of these 17, only two were confirmed faulty with a fault that correlated with the reports.

This high level of spurious fault detection can largely be attributed to the requirement to achieve very high levels of safety. The regulations for commercial aircraft, as defined by JAR 25 tl) and FAR 25, (2) require that the probability of a catastrophic system failure (leading to loss of the aircraft) must be less than 10 -9 per hour of flight. Achievement of this requires the use of redundancy, the detection of failures with very low probabilities of occurrence and often an immediate response to the detection of a failure.

A number of approaches have been taken in attempting to achieve very high integrity levels in digital systems and to resolve the fault management problems. These differing approaches attempt, in various ways, to prevent or to combat the effects of random hardware failures and various types of common mode failure, including specification, design or implementation errors and external interference. A number of approaches have also been used to minimise risks due to human error.

There are four main aspects that must be considered when designing high integrity digital systems. The first is fault prevention, the second is fault detection and isolation, the third is fault tolerance and system reconfiguration, and the fourth is fault reporting and alerting of the operator. Within each of these, the possibilities of random hardware failures, of common mode failures due to specification, design or implementation errors, or external interference, or of human error must be addressed.

Fault prevention will not be discussed in detail except as it relates to human error. This will include errors in specification, design or implementation of the system as well as operator error. The human factors problems related to fault warning systems, though extremely important, are beyond the scope of this paper and will be addressed only briefly. Methods of fault detection and isolation will be examined in detail. Fault recovery procedures will then be addressed and these two aspects will then be drawn together to provide an overall view of the fault management problem. Finally, the impact of new, integrated avionics architectures will be examined so that recommendations can be made for the direction of future research.

2. FAULT DETECTION AND FAULT ISOLATION

Faults may occur in any component of the system, or in the connections or interfaces between system components. As already stated these faults fall into two main categories:

Fault management techniques 417

Random hardware failures, assumed to occur with a constant random failure rate. Common mode failures (including specification and design errors). It is important to recognise the existence of these two types of failure. A single random

hardware failure will affect only one component or element of the system whereas a common mode failure may affect more than one component or element. This is of great significance when considering the use of redundancy. Many of the fault detection schemes developed over the past 20 years have concentrated on the detection and isolation of random hardware failures. Fault detection schemes have progressed from ad hoc methods to more rigorous and comprehensive approaches. The principal approaches used today are described below. The problem of error on the part of the operator is also analysed and approaches to the prevention of errors and to ameliorating the effects of error are also discussed.

2.1. ANALYSIS OF FAULT DETECTION REQUIREMENTS

Before choosing any method of failure detection, it is first necessary to identify the failures that must be detected. Discussed below are the principal analysis techniques used today.

2.1.1. Failure Modes and Effect Analysis (FMEA)

This is a bottom-up method of analysis, and as such cannot be completed until the system design has been completed. This is a problem. It is quite common when using this approach to find that there are many failures that either cannot be detected or cannot be accurately isolated. The system design may then require modification, or periodic tests or inspections may have to be added, in order to satisfy safety, availability or maintainability requirements.

Basically the method looks at each component of the system. It examines the failure modes of each component (and the failure rate) and assesses the impact of each failure mode on the operation of the system. This is a time-consuming process and it is necessary to show, from the FMEA, that a sufficient level of fault detection can be achieved to accomplish the system safety objectives. Thus, probable failures with a severe effect must be detected and appropriate recovery action must be taken, whereas it may be acceptable not to detect highly improbable failures or failures that have no significant impact on system behaviour. Lala and Harper (3) suggest that such an analysis is not only time-consuming to perform, but also next to impossible to prove. It is very difficult to ensure that all possible failure modes and where necessary, combinations of failure modes, have been considered and that their effects have bee~a correctly analysed.

2.1.2. Fault Tree Analysis (FTA)

This is a top-down approach and can therefore be performed at the start of the design process and updated throughout the development. The principle here is to take hazardous events identified by a functional hazard analysis (FHA) and to analyse possible causes of those hazardous events.

This is a usefill method in defining the basic architecture and overall system design. It is also useful in identifying, at an early stage of the development cycle, the fault detection requirements arising from considerations of system safety. Due to its top-down nature though, there is little hope of guaranteeing that every possible component failure mode, that could cause or contribute to causing a hazardous event, has been considered. For this reason, it is normal to use the FTA approach to assist in the earlier stages of the design and then to the use the FMEA to try to ensure that all types of failure have been considered. This combined FTA/FMEA approach has been widely applied, and is still essentially the approach recommended by the SAE "Guidelines for Certification of Highly-Integrated or Complex Aircraft Systems". (4)

418 D.M. Johnson

2.1.3. Byzantine Resilience (BR)

The FTA/FMEA approach requires extensive and detailed analysis, and it is argued that this analysis process is prone to error t3~ and is unlikely ever to fully succeed. An alternative approach is to design the system to provide 'Byzantine resilience'. A Byzantine fault is a fault that may exhibit any arbitrary behaviour. This may include any behaviour within the physical limits of the component that may corrupt the system. A system has Byzantine resilience if it can tolerate this class of fault. If a system has Byzantine resilience, then it is guaranteed to tolerate any actually occurring component failure mode and hence it is no longer necessary to conduct a detailed analysis of the actual failure modes that may occur. Design for BR thus attempts to remove the tedium of the FTA/FMEA approach and to provide a better guarantee of fault tolerance.

The principle of the BR approach is that any deviation from expected behaviour is regarded as a fault. For components that are subject to variations in performance, due to external conditions, manufacturing tolerances, etc., the BR approach is difficult to apply. It also requires quite high levels of redundancy that may not be cost effective for all types of component. It is thus primarily applicable to fault tolerant computers, since these are less susceptible to random variations due to component tolerances, wear, ageing, etc. Using the BR approach, a computer must consist of 3n + 1 redundant channels, each identical, where n is the number of faults that must be tolerated. The redundant channels must be initialised to exactly the same state, they must receive identical inputs, they must perform identical instructions and they must be synchronised. The outputs of each of the redundant channels are compared and a failure is detected if an exact bit-wise comparison of the outputs reveals any discrepancy between the redundant channels.

Lala and Harper describe the principles of this approach in some detail ~3~ and claim that it eliminates the need for detailed failure analysis. This claim is valid up to a point but there are a number of important limitations to the approach, not least that the approach is only really applicable to the computing element. These will be addressed in detail later in the paper.

2.1.4. Analysis of Common Mode Failure (CMF)

All of the above analysis techniques deal fundamentally with the problem of random hardware failure of components. Failures due to external interference are generally dealt with by specific analyses and testing to show resilience to the various types of external threat. This might typically include analysis and testing of the effects of electro-magnetic interference (EMI) and lightning, disc burst analysis for uncontained engine rotor failures, fire containment analysis, etc. Alternatively, if the effects of external interference can be predicted, and a probability associated to the occurrence of the interference, then the FTA and or FMEA methods may be used.

Failures due to specification, design or implementation errors present a greater problem. The FMEA will not reveal these types of failure and the BR approach is of no use if the redundant channels contain a common fault. FTA may be used in a limited manner, in that the fault tree for a particular event may include a contribution from this category of failure. A probability of the failure cannot be included in the analysis, but an objective or integrity target can be set and fault detection requirements can be identified. It is this FTA approach that is generally used to define the software integrity requirements of a system. Software, effectively can be treated as a component with Byzantine failure properties. The BR approach though is inappropriate for detecting software failures since it relies on identical software running synchronously on identical redundant computing channels.

The SAE guidelines t4~ define "System Development Assurance Levels" depending on the criticality of failure for a particular system or system function. Similarly RTCA DO-178 B tS~ defines software levels dependent on the worst effect of software failure. The basic principle is the worse the possible effect of error, the greater the effort that should be put into ensuring that there are no errors. For complex systems, and for software in particular,


proof of integrity to the levels required by JAR 25 ~1) and FAR 25 TM remains beyond the scope of the current state of the art, ~6) despite improvements in the software development process, the continuing development of formal methods and progress in software reliability analysis techniques. Although the idea of proving the correctness of software is appealing, the use of formal proofs does not take into account the possibility of errors in the original specification on which the proof is based, or the possibility that the proof itself might be wrong. ~7) The abiLlity of reliability analysis techniques is also limited and can only be used to demonstrate relztively modest levels of reliability38)

Together, the above guidelines and airworthiness regulations provide a much clearer definition of the assurance activities required of system and software developers than previous guidelines, but do not provide developers with any significant new techniques with which to address the fault management problem or to prove very high levels of software reliability. They represent in effect a definition of good practice, that has evolved over a number of years, but that has not previously been universally applied.

Dissimilarity between software (and hardware) used in redundant channels may be used to reduce the probability of malfunction due to design or implementation errors, but the benefit gained from this is difficult to quantify. The independence of dissimilar sets of software, produced from a common requirement or specification, has been ques- tioned by many and has been shown to be unreliable. ~9) To achieve greater independence, functional dissimilarity should be employed. This, ideally, should use independently developed specifications. The SAE guidelines ~4) allow a reduction in assurance level for the use of functional dissimilarity, though the dissimilarity must be assured to the higher level.

2.1.5. Analysis of Human Error

The analysis of the possibilities and effects of human error is often neglected in system design or at least is not performed explicitly. It is often assumed that the operator will not give erroneous commands to the system, or that if (s)he does then it is not the fault of the system. Similarly, it is often assumed that the operator can be relied upon to take the correct and timely course of action in order to overcome the effects of failure elsewhere in the system. To some extent this approach has to be accepted. It will always be possible for the operator to cause an accident through malicious intent, gross negligence or incompetence. However, the system designer should try to design the system to minimise the risks from accidental errors made by a qualified operator. The SAE guidelines ~4) require that analyses are performed to justify the assumptions made about operator behaviour and to show that reasonable precautions have been taken in the design, to protect the system from human error.

There are two separate approaches that may be taken: the prevention of error and the amelioration of the effect of error, ~1°) but their application has not been as systematic as might be hoped. Application of the SAE guidelines ~4) should help ensure that a more systematic approach is adopted in the future, but many of the system design features discussed below are the result of accident investigation leading to system modification, rather than the systematic analysis of the possibilities of human error.

2.1.5.1. Prevention of human error

Human error may be prevented, or at least the risks reduced, by careful design of the man-machine interface. Consideration should be given to the presentation of information to the operator and to the design and positioning of controls. The workload and skill requirements placed on the operator should also be assessed to ensure that they are reasonable and environmental considerations should also be taken into account. The field of ergonomics, the design of man-machine interfaces and the associated physiological and psychological considerations are too large for detailed discussion here, but some examples

420 D.M. Johnson

from aircraft systems will illustrate the type of consideration required of the system designer.

Controls performing different functions should be physically separated and differentiated to prevent operation of the wrong control, for example to prevent inadvertent retraction of the landing gear when trying to operate the flaps. Such design features may seem obvious, but this is one of the problems - systems' designers being human tend to believe that they are automatically expert in human factors and therefore do not see the need for specialist involvement or the need to make a specific analysis.

Altimeter design evolved due to accidents caused by misreading of three pointer altim- eters fitted on early aircraft. This reduced the incidence of 'controlled flight into terrain' (CFIT) accidents but did not eliminate them. Pilots continued to rely on their visual perception and neglected to monitor the altimeter readings on approach. The use of Ground Proximity Warning Systems to try to prevent CFIT accidents is an example of a system dedicated to the prevention of human error. By providing audio warnings of the aircraft height, awareness of the aircraft situation is increased. This has further reduced CFIT accidents but has still not eliminated them.

2.1.5.2. Amelioration of the effects of human error

It is generally accepted that, whilst it is possible to reduce the probability of human error, it will not be possible to eliminate it completely (as in the case of CFIT accidents). It is, therefore, also desirable to design features which reduce the effects of human error. Examples from aircraft systems will again illustrate the approach.

It is common for landing gear control levers to include a mechanism that will prevent the pilot from selecting UP when the aircraft is on the ground. Additionally, safety features may be included to prevent the extension of the landing gear above a certain speed. The pilot may still make the error of trying to operate the landing gear but (s)he will be prevented from doing so.

The Fly-By-Wire system of the A320 includes a number of features to protect the aircraft from the effects of pilot error. These effectively prevent the pilot from operating the aircraft outside the flight envelope for which it was designed, thus preventing stall, over speed, excessive manoeuvre, etc.

A simpler example is the confirmation of inputs made from a keypad before executing the instructions. This allows the pilot to correct an erroneous input before any action is taken.

2.1.6. Summary of Fault Analysis Methods

The FTA/FMEA approach can be used to assist in the system design, to identify fault detection requirements and to show that an adequate level of fault detection coverage has been achieved. The method is widely used and accepted, though it is labour intensive and prone to error. It is important to perform both types of analysis. Neither analysis method on its own is sufficient. The effects of common mode failures can be incorporated into the FTA, though the failure probabilities may not be known. The BR approach, though it only really addresses computer faults, does not require detailed failure analysis. This is an important advantage but unfortunately the BR approach does not address the problem of common mode failures. This does not necessarily rule out the BR approach, but common mode failures must be tackled in some other way. Current certification guidelines ~4' 5) allow this approach to be used, on the basis that the application of the highest assurance levels will result in system and software integrity levels of the order required, despite the fact that these levels of integrity cannot be proven.

Both the FTA/FMEA approach and the BR approach, despite their limitations, offer a more systematic analysis of failures than any method currently employed for assessing the possibilities of human error. Human factors research has produced a vast quantity of heuristic rules and guidelines defining good practice but has not yet provided the system


designer with a systematic method of attacking the problem. The development and application of failure analysis and fault tolerant design techniques to the problem of human error must therefore be considered.

One further option available to the system designer is to automate functions such that intervention by tlhe human operator is reduced or eliminated. This clearly can remove the possibility of human error in operation of the system, but it must be remembered that humans still per:Form certain tasks better than machines (pattern recognition and most sensory tasks, reasoning with uncertainty, etc.) and that human error remains possible in the design and implementation of the system. A further problem with automation is that it can result in a degradation of the performance of the human operator, due to an effect known as peripheralisationJ TM

2.2. FAULT DETECTION AND ISOLATION METHODS

Having identiJ~ed the fault detection requirements it is necessary to design the system such that it can detect and isolate the faults to allow appropriate recovery action to be taken. This is a major area of difficulty in the development of real-time safety-critical systems. Problems are frequently encountered with spurious fault detection and with failure to detect faults.

The BR approach to the problem uses redundancy, comparisons and voting to detect and isolate failures. ]it does not attempt to detect specific failure modes of components or to detect specific types of malfunction. Of the fault detection methods described below, only the comparison between redundant channels is directly applicable in the BR approach. The other types of test may, however, be used in conjunction with a BR approach, in order to improve fault i,;olation and to prevent propagation of failures through the BR fault containment regions.

The following types of fault detection may be appropriate, depending on the type of failure: reasonableness tests; behaviour tests; continuity tests; performance tests; comparisons between reclundant elements; and off-line tests. These are discussed below.

2.2.1. Reasonableness Tests

This type of test is particularly applicable to the monitoring of sensors or other system inputs. Such parameters are normally expected to remain within prescribed limits and may be expected to Change in prescribed ways. Fault detection schemes can therefore be devised which detect a failure if the parameter is not within prescribed limits or is not changing in the prescribed manner.

For example, if a pressure sensor is expected to produce an analogue output in the range 1-5V, a failure may be detected if the signal lies outside this range. If the thresholds for fault detection are set too tight, then nuisance fault detection may occur, for example, as a result of component tolerances. The thresholds would not therefore normally be set at 1V and 5V. However, if the thresholds are set too wide, then some failures may not be detected. Problems may also occur due to noise or other external, transient disturbance, so it is normal to include some filtering or confirmation of the fault condition.

There are two approaches that can be taken to determine the optimum detection thresholds and filtering or fault confirmation times, though it is not unusual for a more haphazard app~'oach to be used. The first approach is to use the FMEA to identify the actual failure modes and to design the monitor to detect these specific failures. The second approach is to use the FHA and FTA to identify system level failure events and to design the monitors to protect the system from the feared events.

This type of test is usually performed within a single computing channel and is necessary in a BR system, to prevent fault propagation from one channel to another as a result of a failed input.

422 D. M. Johnson

2.2.2. Behaviour Tests

Components of the system are expected to perform, within some prescribed limits, according to the inputs applied to them. Failure of a component to behave within such limits may be detected as a failure. Determination of monitoring algorithms for this type of failure is generally more difficult and may require the expected component behaviour to be modelled for comparison. For example, the movement of the spool of an electro-hydraulic servo-valve may be expected to respond to the servo-valve current according to a first order lag with a time constant of 10 ms. The monitoring algorithm may then use measured servo-valve current, with a model of the nominal servo-valve behaviour to determine the predicted position of the spool. This may then be compared with the measured spool position and a failure may be detected if the difference exceeds a threshold. The failure condition would normally be confirmed over several computing cycles. This technique may also be referred to as analytical redundancy.

Once again, either the FMEA or the FHA and FTA may be used to assist in the definition of the algorithm, but the analysis is usually more complex. Once again, it is quite common that neither approach is used.

2.2.3. Continuity Tests

In addition to checking the reasonableness of signals and of the behaviour of components, it is also possible to check electrical continuity between components. This type of test, which may detect open circuit or short circuit failures, is particularly useful in distinguishing component failures from wiring faults. For example, electrical continuity to an electro- hydraulic servo-valve may be tested by measuring voltages across a resistance placed in the line and at the output from and return to the computer. When data bus communications are used to connect two (or more) components, the level and accuracy of fault diagnosis is usually improved, since it is possible to monitor the transmission, refreshment and parity of data.

Requirements for this type of test can be obtained from the FTA or the FMEA, if the connections between system components are included in the analysis. The BR approach would not provide specific monitoring requirements, but would offer a fault detection method based on component and communication redundancy and comparisons between the redundant elements.

2.2.4. Performance Tests

These tests are associated with the behaviour of the processing elements of the system. Their aim is to trap software or processor failures that lead to exceptions or failure to execute in some prescribed manner within real-time constraints. Built-in exception handling (e.g. bus contention, illegal address, divide by zero), often with automatic recovery, is normally included in avionic and other safety-critical software.

In addition, the real-time execution of the software may be monitored by the use of watchdog timers, of various levels of sophistication. There are also state based methods of monitoring software operation and other methods of monitoring the performance of real-time schedulers. These more sophisticated monitoring methods, though interesting, will not be discussed further. With the availability of greatly increased processing power, it should be possible to achieve system functions using simple, deterministic scheduling methods, which do not require sophisticated monitoring techniques.

Most of these methods are used, to some degree, in most avionic applications. Although formal processes for determining the required tests are not well defined, there is a wealth of experience, in the use of this type of test, from which to draw. The proof of the effectiveness of the tests is, however, rarely attempted.


2.2.5. Comparisons Between Redundant Elements

Redundancy, with comparisons between the redundant elements, may be used to detect and isolate failures. This is most commonly used in the sensing and processing elements of a system. Redundancy of actuation elements is often also employed, but it is less common to compare their performance to detect and isolate faults. Comparisons may be made between two or more redundant elements. These elements may be identical or dissimilar and the comparisons made may be exact or approximate.

The BR approach to fault management uses identical redundancy and exact bit-wise comparison. A similar approach is used on the space shuttle, though to protect from common mode failure there is additional, manually selectable, dissimilar redundancy. Other approaches use identical redundancy but with less strict synchronisation and initialisation requirements, and approximate comparison (e.g. the Boeing 737 Yaw Damper). Others use dissimilar redundancy and approximate comparison (e.g. the A320 Slat and Flap Control Computer and the Boeing 777 Flight Control Computers) or functionally dissimilar monitoring (e.g. some monitoring performed by the A340 Brakes and Steering Control Unit). It is not possible to use dissimilar redundancy and exact bit-wise comparisons.

Lala and Harper (3) argue that methods based on approximate comparison can never be wholly successful since they must achieve a compromise between detection of all faults and the avoidance of spurious fault detection. The achievement of a satisfactory compromise is undoubtedly problematic, but depending on the objectives of the fault diagnosis it is not impossible. The argument put by Lala and Harper assumes that it is necessary to detect any deviation between the operation of the redundant channels and that components can exhibit any arbitrary mode of failure. This is not the case. A failure may be more usefully defined as the faiLlure of a component, or piece of equipment (or software), to perform within some specified limits under specified conditions. With this definition of failure, supported by knowledge of actual failure modes, approximate comparison schemes can be developed successfully. This task is further eased if it is recognised that it is only necessary to detect failures that rep:resent a hazard. This distinction between reliability and safety is important but is often overlooked. (x2)

Based on the FHA and FTA, it should then be possible to define explicitly the behaviour of the system, or of a component of the system, that is considered hazardous. This definition of hazardous behaviour may then be used, independently of the functional requirements, as a basis for the definition of functionally dissimilar monitoring, between redundant system elements.

There are disadvantages to the BR approach as well. Common mode failures are not detected; thus a single failure (e.g. due to a common design error or component fault) may simultaneously affect all redundant channels. The assumption of Byzantine fault properties tends to lead to over-design of the system since it is designed to tolerate failure modes that will never occui, or that are sufficiently improbable not to require consideration. Todd and Yount(13) identify the need to maintain maximum decoupling between redundant channels to prevent the introduction of common mode failure paths. A further problem of the BR approach is the', need to ensure fault containment within each of the redundant elements whilst providinl; identical initialisation conditions, close synchronisation and exchange and comparison of data between the channels. A failure within one channel must have no effect outside it and its operation must not be affected by any outside failure. The achievement of this is not strailghtforward and requires the application of other fault detection techniques to prevent failure propagation through the communication paths. Thus, design for Byzan- tine resilience does not completely eliminate the need for other failure analysis techniques and fault detection schemes.

2.2.6. Off-Line Tests

All of the above test methods are primarily applicable to continuous checking of some aspect of component or system behaviour or performance. This type of continuous

424 D.M. Johnson

monitoring may be supported by additional off-line tests. These tests may be run at power-up, on request, during specific phases of system operation (e.g. when the system is in stand-by mode), or following initial fault diagnosis. They may be used to detect failures that cannot be detected during normal system operation, to improve the accuracy of the fault isolation or to support an initial fault diagnosis.

Examples of this type of test include memory checksum tests and processor instruction set tests that may typically be performed at computer power-up. Requirements for this type of test may be determined from the FTA/FMEA.

2.2.7. Summary of Fault Detection and Fault Isolation Methods

All of the above types of fault detection are useful and it is likely that all will be necessary in achieving good fault detection coverage and accurate fault isolation in a complex system. FTA and FMEA are useful in defining the detection requirements and designing the fault detection algorithms, but there is no single methodology available to the system designer that fully supports this task. Consequently schemes for fault detection and isolation tend to be based on a collection of different methods (often applied in an ad hoc fashion), experience and engineering judgement.

Fault detection algorithms based only on the FTA tend not to succeed. The most common problem encountered is that monitor thresholds are set unnecessarily tight resulting in spurious fault detection. Similarly fault detection algorithms based only on the FMEA rarely succeed. The main problem encountered here is the definition of monitoring algorithms without proper consideration of the need to detect the failure. Again the monitoring algorithms may be unnecessarily stringent. Both types of analysis should be used to optimise the design of monitor algorithms, just as both types of analysis should be used to identify the fault detection requirements. Generally the FTA should be used as the primary means to identify requirements and the FMEA as the primary input to the design of the monitor algorithms to meet these requirements. In this way, monitor algorithms should be designed to detect real failure modes that might cause, or contribute to, some hazardous behaviour of the system.

The BR approach to fault management appears to offer a simpler and more reliable approach to the problem. However, its applicability is limited really to the processing elements of the system, it is susceptible to common mode failures and it is reliant on other fault detection methods to prevent failure propagation between the redundant channels.

Fault detection and isolation remain major problems for the designers of complex real-time systems. There is a pressing need for improved methodologies and design tools in this area. Reliability analysis tools have been developed that automate, to some degree, the analysis process, but these tools are of more assistance in proving compliance with reliability and safety requirements at the end of the development than in aiding the design process.

Assistance in the definition of monitoring algorithms would be of great value to the system designer. Research to define methods that would support this crucial part of the design process and to develop tools to automate the process would be of particular benefit. It is recommended that such methods should consider the fault detection requirements necessary to fulfil the system safety requirements separately from those required for other reasons, and that the safety requirements should be used directly as an input to the process.

Simulation and modelling has proved a valuable aid to system developers in refining monitoring algorithms and other system functional requirements. The use of simulation allows the effectiveness and robustness of monitors to be tested more thoroughly than is possible with the real system and allows testing and validation in advance of the implementation. Further development and the wider use of simulation techniques should, therefore, also be encouraged.

Some of the types of test described above may also be applied to the monitoring of commands from the human operator, for example the operator input may be checked for


reasonableness, the operator may be asked to confirm a keyboard input by re-entering the information, ,3r the command may be checked against the system state. Adaptation of the techniques u:sed for the analysis of other types of fault, or the development of new analysis techniques would aid the system designer in minimising the risks due to human error.

3. FAULT TOLERANCE AND SYSTEM RECONFIGURATION

Having detected and isolated a failure within the system, or on the part of the operator, it may be necessary to take some action in order to ensure continued safe operation. This action will depend on the design of the system, the effect of the failure, the criticality of the system function and the availability of fail-safe states. It should also be noted that the diagnosis may depend on the type of action to be taken, t14) For example, for immediate recovery action, :it may be sufficient to know that there is a fault somewhere in one lane of a replicated system, whereas for line maintenance it may be required to identify the failed unit and for shop maintenance it may be desirable to identify the failed component within the unit.

The various types of recovery action are described below, with examples of their use taken from aircraft system applications. This is followed by an assessment of the methods available to the designer, to determine the actions necessary to achieve the safety objectives for the system, and to design the system so that it can provide the required level of fault tolerance.

3.1. FAULT RECOVERY

The following types of recovery action may be taken: continued operation in a degraded mode; failure to a safe passive state; failure absorption; system reconfiguration; direct failure recovery; and operational limitation.

In addition to taking recovery action, it may also be necessary to alert the operator to the failure. This is essential if the operator is required to modify his/her control of the system as a result of the failure, or if the performance of the system is degraded. As stated previously, the methods used to alert the operator to the presence of failures are important, but will not be discussed in detail here. However, a brief outline of the warning philosophy used on the Airbus A320 is given to illustrate the key features of a warning system:

There are three classes of alert: a 'warning' (red), a 'caution' (amber) and an 'advisory' (green). Failures that have warnings require immediate pilot attention and immediate action. The pilot is alerted by a red flashing master warning light, a continuous aural warning and a message on the engine and warning display (EWD). Instructions may also be given on the EWD and additional system status information is provided on the systems display (SD). Cautions are given for failures requiring immediate pilot attention but not requiring immediate action. The pilot's attention is drawn to the caution by illumination of an amber master caution light and a single chime. Information is provided on the EWD and SD, similar to that provided for a warning. Cautions may also be given for failures that require crew awareness but that have no specific action. In this case the master caution light is not illuminated and there is no chime. Advisory messages and information may be displayed on the EWD or SD without specific 'attention getters'. These messages alert the pilot to minor failures and provide information about system status but do not require immediate attention. If several failures are present simultaneously they are prioritised for presentation to the crew. If the failures have a common cause (e.g. lo,;s of an electrical power supply), then the primary failure (the loss of power supply) is prersented as the warning or caution. The secondary failures resulting from the primary failure are listed under system status, on the EWD. If the failures do not have

426 D.M. Johnson

a common cause then they are prioritised according to the need for pilot action and pilot attention. The pilot can clear the failure from the EWD (it then remains under system status) in order to see the next failure.

3.1.1. Continued Operation in Degraded Mode

Not all failures require recovery action. The failure may have no overall effect upon the system or may simply degrade system performance without significant impact on safety. Alternatively the failure may affect safety, but may be of sufficiently low probability that the effect on safety can be accepted. It is, however, usually still necessary to detect and isolate the fault so that maintenance action can be taken at the appropriate time.

Loss of passive redundancy typically has no immediate effect on system operation. Subsequent failures may then have a severe effect, but no action is required as a result of the initial failure. Loss of braking on one of several braked wheels, for example, due to a tachometer failure, will degrade braking performance slightly but will not have any significant effect on safety. No failure recovery action is required. A servo-valve jam resulting in runaway of a spoiler control surface may have a significant safety effect (particularly on take-off or final approach), but the probability of the failure may be sufficiently low that the effect can be accepted, without the need to take any recovery action.

3.1.2. Failure to a Safe Passive State

A particular component may have hazardous failure modes but its continued operation may not be critical to the continued safe functioning of the system. In this case, on detecting the failure, the component may be switched to a safe passive state.

Erroneous output from a computer may be inhibited (e.g. by switching of relays) if there is redundancy available to take-over the function, or if the function is not required for the continued safe operation of the system. In some cases the complete system may be shut down. For example, certain slat and flap control system failures can be catastrophic leading to loss of the aircraft, but it is quite possible to safely continue flight and land with the system inoperative. An aileron runaway caused by a servo-valve jam may be hazardous, but lateral control of the aircraft can be safely maintained without the use of the ailerons, using spoilers and rudder. Detection of the servo-valve jam may therefore be recovered by removal of hydraulic power to the aileron.

3.1.3. Failure Absorption

Failure absorption is achieved by nullifying the effect of the failure, normally by use of a voting process. This generally requires at least triplex redundancy so that the effect of the failure can be overcome by the action of the un-failed elements.

An example of this is the use of aircraft control surface actuation arrangements that effectively sum the outputs from three (or more) redundant computing channels.

3.1.4. System Reconfiguration

If a failure occurs which degrades system performance below some acceptable level, then it is necessary to reconfigure the system in some way in order to recover an acceptable level of operation.

Failure of an active redundant element will normally require changeover to a passive element, for example switching of control between computer channels, or use of standby actuation. In other cases a degradation of system performance may be recoverable by modifying system behaviour. For example, roll control laws used to control the ailerons may be modified in the event of failure of roll spoilers.


The system reconfiguration is not necessarily required to recover fully all aspects of system performance, but to maintain system performance at a level compatible with the probability of the: fault occurrence.

3.1.5. Direct Failure Recovery

Certain types of transient failure may be directly recoverable, for example, failures caused by external interterence (EMI, lightning, etc.) or by software errors.

Failure recovery in these cases may be automatic or may require some action such as a processor reset. Failures detected by the type of test described in Section 2.2.4 are often dealt with in this way, though it is normal to limit the number of reset attempts.

3.1.6. Operational Limitation

The final type of action is to place operational limits on the system. This may be achieved by restricting system functionality (effectively the same as degraded operation) or by providing instructions or warnings to the user or to other systems.

For example, if anti-skid protection is lost on the braked wheels of an aircraft, an instruction can be provided to the pilot to restrict brake pressure so as to reduce the risk of tyre burst. An increased minimum landing distance may also be required. Loss of the ability to steer the nose-wheels of the aircraft from the autopilot may result in degradation of automatic landing capability. This limitation may be indicated to the autopilot system and/or the pilot.

3.2. FAULT TOLERANT DESIGN

All of the above types of fault recovery action may be used in order to provide a level of fault tolerance consistent with the safety requirements of the system. The problem for the system designer is to choose a system architecture that will provide the necessary fault tolerance and to decide where, when and how each type of action should be used.

3.2.1. FTA/FMEA Approach

Just as the FTA and FMEA can be used to identify fault detection requirements, so they can be used to identify requirements for fault recovery. It is, however, an iterative process since the requirements for fault recovery may change the system design which in turn will modify the analyses.

For the process to be applied successfully without excessive iteration, existing designs must be analysed, past experience must be used and engineering judgement must be exercised. Provided sufficient care is used in the design process, and provided that fault tolerance is considered from the beginning and throughout the design process, it is possible to achieve satisfeLctory results. The process relies heavily on the skill of the system designer, however, and it i:~ not uncommon for considerations of fault tolerance to be put aside during some phases of the development. This may result in expensive redesigns or undesirable test and inspection tasks being required.

3.2.2. Byzantine Resilience

This approach has received more rigorous, analytical study. This has resulted in the production of hard rules defining redundancy and other requirements as a function of the level of fault tolerance to be achieved.

This is clearly of benefit to the system designer, though the approach does not cover all aspects of system fault management. The application of the BR approach does, however,

428 D.M. Johnson

tend to result in over-design. For example, for a system to be resilient to a single Byzantine failure, four redundant channels, or fault containment regions are required. Depending on the criticality of the system, duplex or triplex redundancy may be shown to be sufficient using other methods. The cost of this over-design would have to be outweighed by savings in development costs gained from reduced fault analysis effort.

3.3. FAULT RECOVERY AND SYSTEM RECONFIGURATION SUMMARY

Just as there are numerous methods for detecting failures, so there are numerous methods for recovering from or coping with the effects of failure. All types of failure recovery action described are useful in certain situations, but the system designer has little more than past experience, engineering judgement and perhaps some heuristic rules to guide him/her in producing a fault tolerant design.

Combined use of FTA and FMEA is required to assist in the choice of system architecture, component designs and the determination of fault recovery actions, but it is an expensive, iterative and imprecise design technique.

The BR approach to the problem provides a much clearer set of rules but its application is limited really to the processing elements of the system. Though it is potentially useful, a technique that tackles only a part of the total system problem is unlikely to gain wide acceptance. The tendency to produce over-designed solutions is also likely to limit its application.

4. THE IMPACT OF NEW INTEGRATED AVIONICS ARCHITECTURES

Much of the previous discussion, because it has focused on existing approaches to fault management, has concentrated on existing, 'conventional' systems architectures. The future development of fault management techniques must consider the types of system architecture that will be used in the future. The following discussion relates to likely developments in avionic system architectures.

4.1. CONVENTIONAL SYSTEMS ARCHITECTURES

'Conventional' systems architectures, though there are wide differences between different systems and different aircraft, all exhibit the following characteristics:

Systems are largely self-contained, though information may be exchanged between systems.

The processing elements of a system are contained in one or more 'black boxes'. These black boxes are dedicated to that system.

Sensors, actuators, etc., are connected to the black boxes by dedicated wiring carrying analogue or discrete signals.

Communication between black boxes is achieved via data buses and dedicated discrete wiring.

There is little commonality between the components of different systems.

The result of this is that systems and system components (particularly the computers) are designed and developed individually for each new aircraft programme. Consequently, the total development cost is very high and development effort is spread over a large number of separate projects. The dedication of processing elements to particular systems and the separation of the individual systems result in excess total processing capacity and replica- tion of similar functions by different systems. Again this increases the total cost. Cost of ownership is also high since only limited redundancy can be provided, leading to system


availability problems. A wide variety of spares must be stocked and maintenance procedures will vary between the different systems.

4.2. INTEGRATED MODULAR AVIONICS (IMA)

New, integrated avionics architectures have been proposed. Their aim is to reduce the costs associated with conventional avionic architectures. The main features of these architectures are listed below:

The boundaries between systems are less distinct. Different systems may use common resources.

The 'black-boxes' are replaced by 'line replaceable modules' (LRMs) mounted in a rack or cabinet. These modules are not necessarily dedicated to a particular function. The module may be utilised by several functions (e.g. a power supply module), or may change its function.

Sensors, actua~Lors, etc., are connected to the modules via data buses. Interfacing of the sensors and actuators to the databus is achieved by localised electronics. Thus electronics and processing fitcilities are distributed rather than centralised.

Communication between modules is achieved via high-speed parallel data buses within the rack or cabinet.

All modules ar,e of standard types. It is also intended to make extensive re-use of software.

This approach is known as integrated modular avionics (IMA). The introduction of IMA will have a significant impact on the fault management problem.

4.3. ANALYSIS OF THE EFFECTS OF IMA ON EXISTING FAULT MANAGEMENT TECHNIQUES

The most significant distinctions between conventional and IMA systems architectures are that the boundaries between different systems are less clearly defined with IMA and that the use of standard modules greatly increases the potential for common mode failures affecting many system functions. This will tend to complicate analysis using the existing methods and raises the need for common mode failure analysis at an aircraft level. The SAE certification guidelines clearly recognise this, and require the aircraft manufacturer to perform common cause fault analyses starting with the identification of 'aircraft-level functions'.

The results of the FHA and FTA for a particular function will no longer be able to define the processing architecture required, since processing elements will not necessarily be dedicated to that function. Instead the analysis will provide integrity requirements to be satisfied, for that function, by the processing resource. The requirements from each function will then have to be analysed together in order to determine the requirements at the rack or cabinet level. This analysis, particularly if dynamic reconfiguration is used to preserve critical functions, will require the application of new design techniques. It will also require additional coordination between the designers of the different systems so that all of their requirements can be integrated.

The use of dynamic reconfiguration offers substantial economic advantages and the potential to imp~:ove safety, but great care must be taken to avoid the introduction of new, and possibly devastating, common mode failures. One possibility that should be studied is for each IMA module to determine its function autonomously, thus avoiding reliance on one or more 'executive' modules to manage the reconfiguration.

The requirements for other components of the system, and the monitoring requirements for these compo:aents can be addressed using the existing analysis techniques, although as already stated, tlhere is a need for development and improvement of these techniques.

430 D.M. Johnson

Thus there is a distinction between the central IMA processing resource and the other components of the system. The interfaces between the central processing resource and the other system components are greatly simplified with IMA, due to the replacement of dedicated wiring carrying an assortment of signal types, by data bus communications. This should allow the fault management problem to be split in two: one part considering the central processing resource, the other considering the other components of the system.

This division of the problem should make the application of the BR approach more attractive, since one of the criticisms of it was that it only considered the processing part of the system. In order to gain the maximum cost-benefit from an IMA architecture though, it is desirable to keep redundancy to the minimum required, to allow resources to be shared between different functions and to allow resources to reallocated to new functions in the event of failure. All of this is contrary to the BR philosophy. In particular, it would seem impossible to maintain strict fault containment with dynamic system reconfiguration. The initialisation and synchronisation requirements would also be difficult to achieve.

The BR approach does not therefore appear useful with an IMA architecture. It is more likely that individual modules will employ a dual redundant architecture, probably with some form of dissimilarity, such that the modules will effectively be self-monitoring.

The use of data bus communication within a system offers potential improvements in fault diagnostic capability. It also simplifies significantly the monitoring tasks required within the avionics rack. Faults in other system components (e.g. jamming of a servo-valve) should be detected locally and reported back via the data bus. Faults in data bus communications can be readily diagnosed by monitoring data refreshment, parity, etc.

This simplification of requirements within the avionics rack is, however, at the expense of placing the monitoring and fault detection requirements in the distributed electronics and processing elements, local to the system components. There is, nevertheless, an overall benefit since analyses can be performed at a component, rather than a system level, and the problems associated with the detection and isolation of wiring failures are largely removed.

If IMA features resource sharing and dynamic reconfiguration of resources, careful consideration will be necessary of how to present system failures to the flight and maintenance crews, since failed LRMs will not be dedicated to any specific function. Generally, failures should be presented to the flight crew in terms of their functional or operational effect and to the maintenance crew in a way that uniquely and clearly identifies the failed unit.

5. CONCLUSIONS

Existing analysis methods for determining fault management requirements rely heavily on the exercise of engineering judgement and the use of heuristics based on previous experience. They are time-consuming and costly to perform and rely on iteration of the design to achieve a satisfactory end result. Design to provide Byzantine resilience is more systematic, but the approach has a number of limitations and is not readily applicable to future integrated avionic architectures since the fault containment requirements would be difficult to achieve with dynamic system reconfiguration.

The existing methods are more suited to fault management considerations associated with random hardware failures. They are less suited to the treatment of common mode failures. There is a need, for safety-critical systems, to avoid common mode failures or to protect the system from their effects. Specification, design or implementation errors must either be avoided, or detected and contained by the use of dissimilar redundancy.

Fault detection coverage, accuracy of fault isolation and spurious fault detection remain major problems in system design. The existing analysis methods support the identification of monitoring requirements and definition of monitoring algorithms, but do not provide definitive solutions to the problem. The problems of monitor design are exacerbated by the use of dissimilar redundancy.


The introduction of integrated modular avionics will simplify some aspects of the fault management problem and should improve overall diagnostic capability. Fault management, in terms of fault recovery and reconfiguration, within the rack or cabinet, will be complicated by the sharing of resources and reallocation of resources in the event of failures.

Improvements in fault management analysis and design techniques will be of limited value unless the problems caused by human error are also addressed. Much research has been carried out in this area but this has not yet yielded any systematic design technique that can be used by the system designer. Research is particularly recommended in the following areas:

The use of functionally dissimilar redundancy to avoid hazardous behaviour due to specification, design or implementation errors.

Automation ot' the process of identifying fault detection requirements and defining fault detection algorithms and the enhancement of existing simulation techniques.

Application of fault tolerant design techniques to the problem of human error. Dynamic reconfiguration of resources in an IMA architecture.

REFERENCES

1. JAR 25.1309 Jo:int airworthiness requirements (+ Advisory Material AMJ 25.1309, System Design and Analysis, Advisory Material Joint), JAA.

2. FAR 25.1309 Fcxteral aviation regulations (+ Advisory Circular AC25.1309-1A, System Design and Analysis, Advisory Circular), FAA.

3. Lala, J. H. and Harper, R. E. (1994) Architectural principles for safety-critical real-time applications. In Proceedings of tile IEEE Vol, 82, No. 1, January 1994 (IEEE), pp. 25-40.

4. ARP 4754, Guidelines for Certification of Highly-integrated or Complex Aircraft Systems, SAE. 5. RTCA DO-178 B, Software Considerations in Airborne Systems and Equipment Certification, RTCA. 6. Shagnea, A. M. and Hayhurst, K. J. (1991) An evaluation of a DO-178 A software development process. In

Proceedings of t~e lOth Digital Avionic Systems Conference (IEEE/AIAA), pp. 97-102. 7. Knight, J. C. and Littlewood, B. (1994) The critical task of writing dependable software. In IEE E Software,

January 1994 (IEEE), pp. 16--20. 8. Brocklehurst, S and Littlewood, B (1992) New ways to get accurate reliability measures. In IEEE Software,

July 1992, pp. 34-42. 9. Knight, J. C. and Leveson, N. G. (1986) An experimental evaluation of the assumption of failure independence

in multi-version programming. In IEEE Transactions on Software Engineering, Vol. SE-12, No.I, January 1986, pp. 96-109.

10. Hawkins, F. H. (:[993) Human Factors in Flight, 2rid Edition, Chapter 2, pp. 27-55, H. W. Odady (ed.). Ashgate Publishing Ltd.

11. Wiener, E. L. (1988) Cockpit automation. In Human Factors in Aviation, pp. 433-461, E. L. Wiener and D. C. Nagel (eds). Academic Press.

12. Leveson, N. G. 111989) Safety. In Aerospace Software Engineering Progress in Aeronautics and Astronautics, Volume 136, pp. 319-337, C. Anderson and M. Dorfman (eds) (AIAA).

13. Todd, J. R. and Yount, L. J. (1991) Digital flight control systems: some new commercial twists. In Proceedings of the lOth Digital Avionic Systems Conference (IEEE/AIAA), pp. 79-84.

14. Rasmussen, J. (1!)93) Diagnostic reasoning in action. In IEEE Transactions on Systems, Man and Cybernetics, Vol. 23, No. 4, July/August 1993 (IEEE), pp. 981-992.

a review of fault management techniques used in safety-critical avionic systems

Documents