ebooksclub.org reliability engineering basic concepts and applications in ict

174

Upload: farandi-febrianto-pratama

Post on 15-Sep-2014

1.320 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT
Page 2: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

Reliability Engineering

Page 3: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT
Page 4: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

Massimo Lazzaroni, Loredana Cristaldi,Lorenzo Peretto, Paola Rinaldi,and Marcantonio Catelani

Reliability EngineeringBasic Concepts and Applications in ICT

ABC

Page 5: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

Authors

Prof. Dr. PhD. Massimo LazzaroniUniversità degli Studi di MilanoDipartimento di Tecnologie dell’InformazioneVia Bramante 65I-26013 CremaItalyEmail: [email protected]

Prof. Dr. PhD. Loredana CristaldiDipartimento di ElettrotecnicaPolitecnico di MilanoPiazza Leonardo da Vinci, 32I - 20133 MilanoItaly

Prof. Dr. PhD. Lorenzo PerettoAlma Mater Studiorum - Università di BolognaDipartimento di Ingegneria ElettricaViale Risorgimento, 2I - 40136 BolognaItaly

Dr. PhD. Paola RinaldiAlma Mater Studiorum - Università di BolognaDipartimento di Elettronica,Informatica e SistemisticaViale Risorgimento, 2I - 40136 BolognaItaly

Prof. Dr. Marcantonio CatelaniDipartimento di Elettronicae TelecomunicazioniUniversità degli Studi di Firenzevia S. Marta, 3I - 50139 FirenzeItaly

ISBN 978-3-642-20982-6 e-ISBN 978-3-642-20983-3

DOI 10.1007/978-3-642-20983-3

Library of Congress Control Number: 2011928069

c© 2011 Springer-Verlag Berlin Heidelberg

This work is subject to copyright. All rights are reserved, whether the whole or part of the mate-rial is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Dupli-cation of this publication or parts thereof is permitted only under the provisions of the GermanCopyright Law of September 9, 1965, in its current version, and permission for use must alwaysbe obtained from Springer. Violations are liable to prosecution under the German Copyright Law.

The use of general descriptive names, registered names, trademarks, etc. in this publication doesnot imply, even in the absence of a specific statement, that such names are exempt from the relevantprotective laws and regulations and therefore free for general use.

Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India.

Printed on acid-free paper

9 8 7 6 5 4 3 2 1

springer.com

Page 6: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

Preface

Nowadays, in many fields of application, it is fundamental to define and fulfil Dependability performances. For a complex equipment in avionics, automotive, transportation – only considering some examples – we have to take into account the functional requirements of the system and, in addition, its requirements in terms of Reliability, Maintainability, Availability and Safety. In other words, it is fundamental to evaluate as the functional requirements of the equipment under consideration are maintained in the time, in specified conditions of use.

For these reasons it is fundamental, as starting point, to focus the attention on the correct use of the terminology in this field. To this aim, in Chapter 1, an overview of the most important terms correlated with Dependability is proposed. Referring to the International Standard, we can assume Dependability as the collective term used to describe the availability performance and its influencing factors: reliability, maintainability and maintenance support performance. On the basis of this concept, it is evident that dependability gives a general description of the item - a compo-nent, an equipment, a system, and so on- in non-quantitative terms. To express its performance in quantitative terms and thus to describe, measure, improve, guaran-tee and certify such an item, it is necessary to establish the characteristics of reli-ability, availability, maintainability, maintenance support performance and safety.

Assuming the reliability of an item as the probability that such an item will ade-quately perform the specified function for a well-defined time interval in specified environmental conditions, it is well clear the importance of the probability and sta-tistics sciences in both reliability definition and evaluation. So, Chapter 2, is devoted to introduce some important probability and statistics basic concepts that are neces-sary for the reliability evaluation. In particular, the statistical point of view is devel-oped and discussed as a first approach to dependability feature of a system under consideration. A brief overview on probability and statistics is given in the first part and the reliability function is then derived. Furthermore, both the concept and model of the failure rate are also proposed.

In Chapter 3 the techniques used for describe the performance of devices in a system are considered. To this aim the system is assumed as a combination of ele-mentary devices – subsystems and elements - that follow a well-defined functional structure. The reliability evaluation of series, parallel and mixed structures will be shown and discussed in details. To this aim, the concept of Reliability Block Diagram is defined as a mandatory tool. The theory is developed using many practical examples. Parallel configuration is further developed in order to discuss the different type of redundancy for reliability growth: active, warm and stand-by.

Page 7: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

VI Preface

At the end of this chapter, two different types of redundancy approaches are com-pared: system redundancy and component redundancy. The results so obtained are fundamental during the design phase of a system when reliability aspects have to be taken into account.

Considering that the component reliability is often affected by different exter-nal and internal influencing factors – stress, environment, quality, technology, and so on - the operating profile of a component must be taken into account for good reliability predictions. Chapter 4 focuses the attention on such aspects. It is inter-esting to remember that, in many situations, the operating profile changes accord-ing to the type of operation of the component. So, we can assume continuous operation or non-continuous operation, such as also sporadic operation. Moreover, storage conditions may have deep impact on reliability of the component when operating. Obviously, environmental factors need to be taken into account as well. The environment contributes to both aging and failures during the life of devices or systems. To this aim, both duration and intensity of environmental stresses must be included in the system operational model. In this chapter, after a brief introduc-tion, the stress factors will be analyzed and some aspects concerning the compo-nent degradation are presented. Some concepts regarding the analysis of failure modes and laboratory test are also presented.

The evaluation of the failure rate for an item represents, often, a very difficult task. In order to implement this evaluation for an electronic or an electromechani-cal equipment, it is possible to use ad hoc HandBooks (HDBKs) – reliability pre-diction handbooks. In the first part of Chapter 5, a brief historical overview of such HandBooks is given, introducing first generation as well as second and third generation HDBKs. As practical application of reliability prediction in electronic field the USA military HDBK (MIL) and Farada HDBK is presented.

In Chapter 6 the concept of Availability is explained and discussed. Availabil-ity is defined by the international standards as the aptitude of an element to per-form its required function in given conditions up to a given point in time or during a given time interval, assuming that any eventual external measured are assured. So, the Availability is a concept that refers to reparable systems, that is systems where the operating life cycles can be often described by a sequence of up (operat-ing) and down (not operating) states. In this case, the important variables to be de-termined are both the time to failure and the time to repair and or restore.

For a more detailed and exhaustive description of the dependability perform-ance a set of well known techniques are present in literature. Such techniques, normally classified into quantitative and qualitative, are methods of analysis to evaluate the dependability parameters and the failure modes in which a realisti-cally complex system is, or, could be subjected. In Chapter 7 a simplified version of the Markov model is proposed. Other techniques able to allow the knowledge of the mechanism of the system failures and to identify all the potential weakness of the system under evaluation are presented in Chapter 8. In particular, we refer to the Failure Modes and Effects Analysis (FMEA) and the Failure Modes, Effects

Page 8: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

Preface VII

and Criticality Analysis (FMECA). These methods are able to highlight the failure modes leading to a negative final effect, also in terms of criticality on the system, operator and/or environment. A third method here presented is the Fault Tree Analysis (FTA), a deductive method for the analysis of a top event represented by a failure condition of the system as a function of failures of subsystems and components.

Massimo Lazzaroni Loredana Cristaldi

Lorenzo Peretto Paola Rinaldi

Marcantonio Catelani

Page 9: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT
Page 10: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

Contents

1 The Concept of Measurable Quality ...............................................................1 1.1 Introduction ................................................................................................1 1.2 Is Conformity Synonymous with Reliability? Some Definitions................2 1.3 Failures, Faults and Their Classification ....................................................4 References ...............................................................................................................6 2 The Concept of “Statistical” Reliability..........................................................7 2.1 Introduction ................................................................................................7 2.2 Definition of Probability.............................................................................8

2.2.1 Axioms of Probability ......................................................................9 2.2.2 Law of Large Numbers ....................................................................9

2.3 The Random Variables .............................................................................10 2.4 Probability Distribution ............................................................................11 2.5 The Characteristics of Reliability .............................................................14

2.5.1 Reliability.......................................................................................14 2.5.2 Failure Rate ....................................................................................17

2.6 The Frequency Approach..........................................................................18 Example 1.................................................................................................20

2.7 Models of Failure Rate .............................................................................23 Example 2.................................................................................................25

2.8 Other Laws ...............................................................................................26 Exponential Law.......................................................................................27 Log-normal Distribution...........................................................................29 Weibull Distribution.................................................................................30

References .............................................................................................................32 3 Reliability Analysis in the Design Phase .......................................................33 3.1 Introduction............................................................................................33 3.2 Reliability Evaluation of Series, Parallel and Mixed Structures ............34

3.2.1 The Series Functional Configuration ..........................................34 Example 1 ...................................................................................35 Example 2 ...................................................................................36 Example 3 ...................................................................................36

3.2.2 The Concept of Redundancy: Parallel Functional Configuration ..............................................................................37

Example 4 ....................................................................................38 Example 5 ....................................................................................39

Page 11: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

X Contents

Example 6 ....................................................................................40 Example 7 ....................................................................................41 Example 8 ....................................................................................42

3.3 Types of Redundancy ............................................................................42 3.4 Functional Configuration k out of n.......................................................43

Example 9..............................................................................................43 Example 10............................................................................................44 Example 11............................................................................................46 Example 12............................................................................................46 Example 13............................................................................................46 Example 14............................................................................................49 Example 15............................................................................................50 Example 16............................................................................................54

References .............................................................................................................57 4 Experimental Reliability and Laboratory Tests ..........................................59 4.1 Introduction ..............................................................................................59 4.2 Stress Factors ............................................................................................60 4.3 Component Degradation ...........................................................................64 4.4 The Prediction Approach ..........................................................................65 4.5 Failure Modes ...........................................................................................70 4.6 Laboratory Tests on Components and Systems ........................................71 References .............................................................................................................76 5 Reliability Prediction Handbooks: Evaluation of the System Failure Rate.....................................................................................................77 5.1 Introduction ..............................................................................................77 5.2 Second Generation Handbooks.................................................................78 5.3 MIL-Handbook-217..................................................................................78 5.4 Failure Rate Data Bank (FARADA).........................................................79 5.5 Third Generation Data Banks ...................................................................79 5.6 Calculation of the Failure Rate .................................................................81

Example 1.................................................................................................81 Example 2.................................................................................................82

5.7 FIT: A More Recent Unit..........................................................................83 References .............................................................................................................83 6 Repairable Systems and Availability ............................................................85 6.1 Introduction ..............................................................................................85 6.2 Mean Time To Repair/Restore (MTTR)...................................................86

6.2.1 A Particular Case............................................................................87 6.3 Mean Time Between Failures (MTBF).....................................................87 6.4 The Significance of Availability in the Life Cycle of a Product...............88 6.5 Instantaneous Availability ........................................................................89 6.6 Dependability: An Evaluation of the “Level of Confidence” in Reference to the Correct Functioning of the System ................................89

Page 12: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

Contents XI

6.7 The Prerequisites of Dependability...........................................................91 References .............................................................................................................92

7 Techniques and Methods to Support Dependability ...................................93 7.1 Introduction ..............................................................................................93 7.2 Introduction to Quantitative Techniques...................................................94 7.3 Evaluation of Availability Using Analytical Models................................95 7.4 Markov Models.........................................................................................95 7.5 Transition Matrix and Fundamental Equation ..........................................98 7.6 Diagrams of State .....................................................................................98

Case 1 – Analysis of a system with one element ......................................98 a) Non-repairable element ....................................................................98 b) Repairable element...........................................................................99

Case 2 - Analysis of a system with two elements ...................................100 7.7 Evaluation of Reliability.........................................................................101 7.8 Calculation of Reliability, Unreliability and Availability.......................102 7.9 Markov Analysis of a System: Application Example .............................103 7.10 Numerical Resolution of the System ....................................................106 7.11 Possible Solutions to the Absence of Memory of the Markov Model ......................................................................................108 References ...........................................................................................................109 8 Qualitative Techniques.................................................................................111 8.1 Introduction ............................................................................................111 8.2 Failure Mode and Effects Analysis (FMEA) ..........................................113

8.2.1 Operative Procedure.....................................................................114 Step 1 – Definition of the System................................................114 Step 2 – Elaboration of Block Diagrams .....................................115 Step 3 – Definition of Basic Principals........................................115 Step 4 – Definition of Failure Modes ..........................................115 Step 5 – Identifying the causes of failures. ..................................118 Step 6 – Identifying the effects of failure modes. ........................118 Step 7 – Definition of measures and methods for identifying and isolating failures......................................................119 Step 8 – Prevention of undesirable events ...................................119 Step 9 – Classification of the severity of final effects. ................119 Step 10 – Multiple failure modes.................................................120 Step 11 – Recommendations .......................................................120

8.2.2 FMEA Typology ..........................................................................120 8.2.3 The Concept of Criticality............................................................122 8.2.4 Final Considerations on FMEA Analysis.....................................122 Example 1 ....................................................................................122 8.2.4.1 Analysis of Failure Modes: Discussion and Exclusions ......................................................................123 8.2.4.2 Drafting the FMEA Table ..............................................124 Example 2 ....................................................................................124 Example 3 ....................................................................................127

Page 13: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

XII Contents

8.3 Failure Mode, Effects, and Criticality Analysis (FMECA) ....................128 8.3.1 Failure Modes and Their Probability............................................130 8.3.2 Evaluation of Criticality ...............................................................130 8.3.3 FMECA Based on the Concept of Risk........................................130 8.3.4 FMECA Based on the Failure Rate..............................................135 8.3.4.1 Evaluation of the Criticality Coefficient of a Component ...................................................................137 8.3.4.2 Failure Rate Evaluation ..................................................138 8.3.5 Worksheet Examples....................................................................139

8.4 Fault Tree Analysis (FTA)......................................................................144 Phase 1: Fault Tree logical construction.................................................145 Phase 2: Probability evaluation of fault tree ...........................................145

8.4.1 Graphical Constructing of a Fault Tree........................................147 8.4.2 Qualitative Analysis of a Fault Tree.............................................150 8.4.3 Quantitative Analysis of a Fault Tree...........................................151 8.4.4 Advantages and Disadvantages of Fault Tree Analysis (FTA) ....153

8.5 An Overview Example............................................................................154 References ...........................................................................................................158

Index ...................................................................................................................159

Page 14: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

Chapter 1 The Concept of Measurable Quality

Abstract. In spite of the science of Reliability is old in the time, there is often misunderstanding about the various terms used in this field. This is a real situation in many contexts, also for high technology industrial and practical applications. The aim of this Chapter concerns an overview of the most important terms corre-lated with Dependability, that is a qualitative performance of an item such as a component, a device, an equipment, and so on. Referring to the International Standard, Dependability is defined, in fact, as the collective term used to describe the availability performance and its influencing factors: reliability, maintainability and maintenance support performance. On the basis of this definition, it is evident that dependability gives a general description of the item in non-quantitative terms. To express its performance in quantitative terms and thus to describe, measure, improve, guarantee and certify such an item, it is necessary to establish the characteristics of reliability, availability, maintainability and maintenance sup-port performance. Such terms are defined in this Chapter.

1.1 Introduction

In the current technological context, characterized evermore by sudden and impor-tant developments, the concept of availability assumes a role of primary impor-tance, both in the design and realization of a product as a component or a system.

In general terms, we can think of a product as the result of a series of correlated activities, normally developed within a production process. This transforms an in-coming element (inputs) such as raw materials, technology or resources into the desired product at the end (output) of the process.

In order to express a judgment on the qualitative level of the product thus ob-tained, it is useful to recall the definition of the term Quality given by the EN ISO 9000 Standard [1] which states: “the totality of characteristics of an entity that bear on its ability to satisfy stated and implicit needs”.

It is evident that any consideration of the correct design and implementation of a product and therefore on the consequent verification of the quality level re-quires a preliminary definition of its characteristics that, in general terms, can be classified as qualitative and/or quantitative. Going into more detail and still fol-lowing the above mentioned Standard, the characteristics can be of a physical na-ture (e.g. mechanical, electrical, chemical, etc.), or a functional nature (the speed of an automobile, memory capacity of a computer chip, etc.), time dependent

Page 15: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

2 1 The Concept of Measurable Quality

(requisites of reliability, maintainability, availability), and so on. Yet, indepen-dently of their nature and always with the view of expressing an objective evalua-tion of the “quality” of the product, it is necessary that such characteristics are adequately defined in measurable terms. Only in this case it is possible to verify that all requirements have been met, that the expressed or implicit needs mani-fested by those interested in buying or using the product in question are satisfied. In other terms, it is necessary to measure and keep under control the capability the product must achieve and for which it was designed and realized. This capacity must be measured over time in order to assure that the product is able to maintain and supply the necessary characteristics whenever requested.

This demonstrates the multiplicity and heterogeneity of the characteristics that can distinguish a product, even if it is technologically simple, and consequently, the importance of correctly identifying such characteristics in order to verify not only attaining the objectives of the project but also, and above all, to undertake any eventual improvements aimed at guaranteeing increasingly higher levels of quality.

1.2 Is Conformity Synonymous with Reliability? Some Definitions

Since the objective of this book regards reliability and its impact, as an essential requirement for a proper and modern design oriented towards competitiveness, we must pay particular attention to the characteristics of the product regarding the time. As noted previously, in addition to reliability, maintainability, availability and safety are also very important. Above all, in certain technological contexts and for particular applications, we talk about RAMS performances (Reliability, Availability, Maintainability and Safety) to define the life cycle of a product. It is necessary however, not to confuse the concept of conformity with the RAMS pre-requisites, which are quite different. In this paragraph, we are trying to give, as much as possible, an exhaustive vision of the essential terminology. The reader can consult the bibliography for further in depth details.

Let us consider a generic element (entity or item), a component rather that a device, a subsystem, a functional unity, an apparatus or a system. The IEC 50 (191) standards [2] define an element as anything that can be considered indivi-dually and that can perform one or more functions.

Conformity means the response of functional parameters of the element (per-formance) to pre-established values (specifications). Conformity is definite and measurable, for example, in terms of nominal value and tolerance, percentage of defective elements, etc. We assume that an element conforms if it possesses the technical capacity to do what is requested of it and for which it was designed.

With determined and fixed conditions of use, such capacity must be maintained over time. This is commonly referred to as Reliability. In fact, reliability is de-fined [1] in qualitative terms as “the ability of an item to perform a required

Page 16: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

1.2 Is Conformity Synonymous with Reliability? Some Definitions 3

function under given conditions for a given time interval”. In this sense, reliability represents one of the characteristics of the element which can be expressed quanti-tatively by means of a probability. After establishing a time interval and assuming that the element is capable of carrying out its required function at the beginning of such an interval (the element conforms and conditions of its use at time “zero” have been set), reliability corresponds to the “probability that the element is capa-ble of performing its required function in the established time interval, under es-tablished conditions” [2].

Reliability can be determined through mathematical models (law of reliability) or measured and estimated through statistical parameters, for example the Mean Time To Failure (MTTF) or the Mean Time Between Failures (MTBF).

As it will become clear in the following, the study of reliability permits not on-ly to evaluate the conformity of a device over time but also to compare different design solutions with the same functional characteristics; it can also identify, in-side an apparatus, subsystems or critical elements that could cause a failure or malfunction of the apparatus itself, necessitating corrective action. For this reason, reliability has a determining role in modern design and constitutes a competitive element even in the light of stricter safety requirements.

However, a working apparatus or a machine, even though still reliable, is still affected by inevitable degradation that will cause it, more or less rapidly, to modi-fy or lose its technical capacity. It is therefore necessary to restore it to its proper working order whenever an interruption occurs or its characteristics have become unacceptable. For such a reason, merely knowing the prerequisites of reliability is not fully sufficient to represent the characteristics of an element during its life cycle. It is also necessary to take into consideration the concept of reconditioning or restoration. In this regard, the Standard [1] introduces the concept of dependa-bility as “the collective term used to describe the availability performance and its influencing factors: reliability, maintainability and maintenance support performance.”

Based on the definition, it is evident that dependability gives a general descrip-tion of the element in non-quantitative terms. To express its performance in quan-titative terms and thus to describe, anticipate, measure, improve, guarantee and certify such an element, it is necessary to establish, in addition to reliability already introduced, the characteristics of availability, maintainability and mainten-ance support performance. They are defined [1] below.

Availability: the ability of an item to be in a state to perform a required function under given conditions at a given instant of time or over a given time interval, assuming that the required external resources are provided. Maintainability: the ability of an item under given condition of use, to be retained in, or restored to, a state in which it can perform a required func-tion, when maintenance is performed under given conditions and using stated procedures and resources. Maintenance Support Performance: the ability of a maintenance organiza-tion, under given conditions, to provide upon demand, the resources re-quired to maintain an item, under a given maintenance policy.

Page 17: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

4 1 The Concept of Measurable Quality

As with reliability, the characteristics of availability and maintainability can be studied with mathematical models and measured by means of statistical parame-ters. For example, the average time for making repairs and the average time for restoring the element to proper working conditions. Other types of parameters can also be used, such as operative availability.

For some applications, it is fundamental to introduce the concept of safety, meaning the freedom from unacceptable risk of harm and determinable by means of the SIL (Safety Integrity Level). Safety can be also defined as the absence of catastrophic consequences on the user(s) and the environment. It should be noted that in Information Technology (IT) further requirements are mandatory for de-pendability: confidentiality, integrity and security. Confidentiality is the absence of unauthorized disclosure of information and integrity is the absence of improper system alterations. Finally, security can be defined as the simultaneous existence of availability for authorized users only, confidentiality, and integrity with mean-ing ‘unauthorized’ [3].

Combining this concept with those defined previously, it is possible to speak, for certain applications and in general terms, of RAMS requirements.

As the definition shows, safety is connected to the evaluation of risk which means the probable rate of occurrence of a hazard causing harm and the degree of severity of the harm. Risk analysis can be carried out using FMECA techniques as presented in Chapter 8. It is therefore evident that availability, reliability, main-tainability and safety are in themselves essential characteristics to define, control, maintain and improve the performance of an element over time. This is why they are listed as “key elements” when specifying product prerequisites and must be considered as standards of functional performance, “incoming” information for a correct design. Their a posteriori evaluation when a product is completed, is a clear indication of a bad design of a system and as such, represents a solution that, from an engineering point of view, cannot be considered valid. Intervening on the characteristics of availability of an apparatus already made, or even worse, on a working apparatus, can, at times, bring about unbearable costs for both the user and the company. It creates a bad impression in terms of quality for whoever puts the product on the market. To this can be added the responsibility for defective products and the consequences, at times legal, that in a user-supplier relationship, could arise through misinterpretation or by not respecting certain prerequisites of trust stipulated in the contract.

1.3 Failures, Faults and Their Classification

The time interval during which an element functions properly ends when any type of deterioration causes an unacceptable variation in the nominal characteristics of the correct use of such an element. The element ceases to perform its required function and failure occurs. Failure, therefore, is defined as the transition from a state of proper functioning to a malfunctioning state, which can be total or partial.

The time to first failure represents the total time duration of operating time of an item from the instant it is first put in an up-state, until failure. The time to first failure, or simply the time to failure, represents the random variable that is

Page 18: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

1.3 Failures, Faults and Their Classification 5

manifested in the event of a failure. The objective evidence of a failure is called failure mode. Some examples of failure mode are an open circuit, absence of an incoming signal or a valve that remains closed.

Circumstances connected to the design, realization or use of an element that have led to a failure represent the cause of failure; the term failure mechanism re-fers to the chemical, physical or other type of process that has caused the failure. Failures can be classified according to various criteria. An important classification is made in function of the cause responsible for their occurrence. Below are de-fined the most recent terms according to [2, 3]:

1. Misuse failures, due to the application of stresses during use which exceed the stated capabilities of the item.

2. Primary failures, where the direct or indirect cause of the failure is not due to the failure of another device.

3. Induced failures or secondary failures, generated by the failure of another device.

4. Early life failure, attributable to intrinsic weaknesses in construction of the element and whose causes are normally identifiable during the manufacturing process and which are manifested during the initial use of the device.

5. Random failure, due to uncontrollable factors which occur during the “useful life” of the component and with a probability independent of time.

6. Wear out failure, generated by chemical-physical degradation phenomena and with an increasing probability of occurring with the passage of time.

Some of these definitions are useful in understanding particular areas that charac-terize the temporal progression of the failure rate treated in Chapter 2.

A different classification in function of the consequences as a result of a failure is defined below [2, 3]:

1. Critical failures that can cause, with a high probability, damage to persons or material unacceptable to other parts of the system.

2. Failures of primary importance, that, although different than those mentioned previously, can reduce the functionality of a system.

3. Failures of secondary importance are those which do not reduce the functio-nality of a system.

Such a classification is particularly useful for the development of techniques for analyses of availability, e.g. Failure Modes, Effects and Criticality Analysis (FMECA) and Fault Tree Analysis (FTA), as discussed in Chapter 8.

Considering instead the nature of a failure at a system level, we can identify the following [2, 3]:

1. Total failure, when variations in the characteristics of the element are such to completely compromise its function.

2. Partial failure, when the variations of one or more characteristics of the ele-ment do not impede its complete functioning.

3. Intermittent failure, characterized by a succession, generally casual, of periods of normal operations with periods of failure breakdowns, without any mainten-ance operations carried out on the device.

Page 19: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

6 1 The Concept of Measurable Quality

The occurrence of a failure brings the element to a state of fault, characterized by the inability to perform its required function, not including such inability dur-ing preventive maintenance or other planned activities [3].

It is important not to confuse the concept of failure as an event, with the con-cept of fault, associated with a particular state of a system.

Analogous to the failures, also for fault a classification can be made according to appropriate criteria which will not be treated here. It is useful however to dis-cuss the meaning of some important activity that can be undertaken when an ele-ment is damaged. Such activity differs according to its purpose and in particular regard [2, 3]:

Fault diagnosis: actions taken for fault recognition, fault location and cause identification. Fault recognition: the event of a fault being recognized. Fault location: actions taken to identify the faulty sub-item or items at the appropriate indenture level.

Level of intervention signifies the appropriate level of subdivision of the device (in this case, the system) in regard to maintenance to be carried out [2, 3].

Fault correction: action taken after fault location to restore the ability of the faulty item to perform a required function. Restoration refers to the event when the item regains the ability to perform a required function. Repair: corrective maintenance performed on the faulty item.

Other important definitions can be found in [2, 3] and will be discussed in more detail later.

References

[1] ISO 9000:2005, Quality management systems - Fundamentals and vocabulary [2] IEC 60050-191 ed1.0, International Electrotechnical Vocabulary. Chapter 191: Depen-

dability and quality of service. Forecast publication date for Ed. 2.0 is 2012-06-02. IEC International Electrotechincal Commission, Geneve (December 31,1990)

[3] Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.: Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computing 1(1), 11–33 (2004)

Page 20: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

Chapter 2 The Concept of “Statistical” Reliability

Abstract. The reliability of an item or a system can be think, as a first approach, as the probability that the device or the system will adequately perform the speci-fied function for a well-defined time interval in specified environmental condi-tions. Starting from this first definition it is well clear the importance of the probability and statistics science in both reliability definition and evaluation. This chapter is devoted to introduce some important probability and statistics concepts necessary for reliability evaluation. In particular, the statistical point of view is developed and discussed as a first approach to dependability feature of a system or device. A brief overview on probability and statistics concepts is given in the first pages of this chapter. In particular, in 2.2.1 the axioms of probability are given and the law of the large number is discussed in 2.2.2. Random variables are introduced in 2.3 and probability distributions are detailed in 2.4. Finally, the re-liability function is derived. Furthermore, it is defined the concept of the failure rate model in section 2.7. In 2.8 the more used distribution laws are discussed.

2.1 Introduction

The concept of availability implies that a device or system must be able to be utilized for its life cycle responding to a priori established properties. The specifi-cations can vary in function of the supplier, the buyer, and/or the user or regulating authorities.

Generally speaking, it would be worthwhile that for every device or system on the market, information is included on its relative life cycle. It is therefore neces-sary to define, based on sufficient and pertinent experimental data, the statistical reliability of a device or system. A statistical study of reliability that must count on the experience derived from collecting experimental data presupposes the knowledge of the concept of random phenomenon and the meaning of the laws of distribution. It is from here that the definition reported in IEC 60050-191 [1] arises where reliability is seen in probabilistic terms and such probability characterizes the attitude expressed by “reliability.”

It is necessary to start with the concept of probability and clarify how the statis-tics and theory of probability work together.

Page 21: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8 2 The Concept of “Statistical” Reliability

2.2 Definition of Probability

The literature notes three definitions of probability: the axiomatic definition, the definition based on the concept of relative frequency and the “classic” definition. The axiomatic definition is the foundation of the mathematical theory of probabili-ty. In order to give the axiomatic definition the following definition are necessary [2].

• Random experiment: it is an experiment that can result in different outcomes, even if it is repeated in the same manner many times.

• Sample Space: it is the set of all possible outcomes of a random experiment and it is denoted as S. It is also called sure event.

• Discrete Sample Space: when a sample space consists of a finite set of out-comes the space is denoted as discrete sample space.

• Event: it is a subset of the sample space, as above defined, of a random experi-ment. The letter E is used in order to denote an event.

• Probability of an event: the probability of an event E, in a discrete sample space, is often written as P(E) and it equals the sum of the probabilities of the outcomes in E.

• Union: the union of two or more events is the event that consists of all out-comes that are contained in either of the two or more events. The union is here denoted with E1 ∪ E2.

• Intersection: the intersection of two or more events is the event that consists of all outcomes that are contained in all the aforementioned events. The intersec-tion is here denoted with E1 ∩ E2.

• Complement: the complement of an event (in a sample space) are the outcomes in the sample space that are not in the event. The complement of the event E is here denoted as E .

• Events mutually exclusive: Two events E1 and E2 such that E1 ∩ E2 = 0, are denoted as events mutually exclusive.

The following properties are also valid:

• Commutative law: A ∪ B = B ∪ A and A ∩ B = B ∩ A; • Associative law: A ∪ (B ∪ C) = (A ∪ B) ∪ C and A ∩ (B ∩ C) = (A ∩ B) ∩ C; • Distributive law: A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C) and

A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C);

• Complement law: 0=∩ AA and SAA =∪ ;

• Idempotent law: A ∪ A = A and A ∩ A = A; • De Morgan’s law: BABA ∩=∪ and BABA ∪=∩ ;

• Identity law: AA = and ( ) BABAA ∪=∩∪ .

Now the Axioms of probability con be given.

Page 22: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

2.2 Definition of Probability 9

2.2.1 Axioms of Probability

Probability is the number assigned to each member of a set of events of a random experiment in which the following very important three properties are satisfied. Denoting with S the sample space and with E any event in a random experiment, the following three axioms can be written:

1. P(S) = 1 2. 0 ≤ P(E) ≤ 1 3. For two events named for example E1 and E2 with E1 ∩ E2 = 0, P(E1 ∪ E2) =

P(E1) + P(E2).

The aforementioned axioms lead to consider the probabilities as relative frequen-cies. This relative frequency must be between zero and one as states by axiom 2. Axiom 1 is due to the fact that an outcome from the sample space occurs on every trial of an experiment and the relative frequency of S is one. Finally, property 3 implies that if two events have no common outcomes the relative frequency of outcomes in E1 ∪ E2 is the sum of the relative frequencies of the outcomes in the events. The reader would be note that:

• P(0) = 0

• P( E ) = 1 – P(E)

Starting from these three postulates, we can build the necessary theory for our purpose (deductive approach).

The definition based on the concept of relative frequency is founded on expe-rimental results. If the sample space consists of n possible outcomes which are equally likely - i.e. all single element events have the same probability - then the probability of any event E is evaluable as:

( )

n

n

n

EEP E== of elements ofnumber

(2.1)

It appears evident that the preceding formulation is not directly verifiable and it is apparently unusable in strict terms; a possible solution comes from the laws of large numbers and thus from the axiomatic approach. Here a brief overview of this law can be very useful.

2.2.2 Law of Large Numbers

The law of large numbers state that in repeated and independent trials having the same probability p of success in each trial, the percentage of successes approach to the chance of success as the number of trials increases (the probability of an event is seen as the limit of the relative frequency of occurrence of this event in a long series of trial repetitions). The percentage of successes differs from the probability p by a fixed positive amount, denoted as ε > 0, which converges to zero as the number of trials n goes to infinity.

Page 23: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

10 2 The Concept of “Statistical” Reliability

Two simple considerations are here mandatory:

• The difference between the number of successes and the number of trials times the chance of success in each trial (the number of successes awaited) grows, as the number of trials increases, like the square-root of the number of trials.

• As n grows, the chance of a large difference between the percentage of suc-cesses and the chance of success gets smaller and smaller even if this difference can be large in some sequences of trials. In fact, affirming that the difference between the percentage of successes and the chance of success tends to zero is not equivalent to affirm that it has a large probability of being arbitrarily close to zero.

Returning on our problematic, it can be shown that:

( ) 1lim =

⎭⎬⎫

⎩⎨⎧ =

∞→EP

n

nP E

n (2.2)

When n can be considered “large enough”, we can affirm with a certain level of confidence that the relative frequency evolves in the probability of verification of event E.

The classic definition presupposes that P(E) can be a priori evaluated without having to use experimental data. With the hypothesis of events of equal probabili-ty, it is necessary however to count the total number N of possible results of expe-riment under consideration. If the event E occurs NE times, then:

( )

N

NEP E= (2.3)

As amply demonstrated in the literature, the classic approach requires justifying the premise with the results or rather the plausible events coincide with the proba-ble events.

2.3 The Random Variables

In order to study a random experiment or better the results of a random experi-ment, a summarizing single number can be very useful. In many random experi-ments the sample space is simply a description of possible outcomes. In some cas-es it is very useful to associate a well-defined number with each outcome in the sample space [2].

The particular outcome of the experiment is not a priori known. The resulting value of the considered variable is not a priori known. Starting from this, the vari-able that associates the aforementioned number with the outcome of a random ex-periment is a random variable as defined in the following.

Random variable: function able to assigns a real number to each outcome in the sample space of a random experiment.

Page 24: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

2.4 Probability Distribution 11

In particular the set of possible numbers of a random variable X is denoted as range of X. Once the end of the experiment the measured value of the random va-riable, sometimes known as outcome, is denoted by x. The random variable is of-ten a real number. The random variable that represents this measurement is said to be a continuous random variable. In different type of experiments the measure-ments are limited to integers or to fractional numbers. These are two examples where the measurements are limited to discrete points on the axis line. In this case the random variable is said to be a discrete random variable. The two following definition are valid [2].

Continuous Random variable: random variable with an interval of real numbers for its range. Please note that the interval can be finite or infinite. Discrete Random variable: random variable with a finite range.

2.4 Probability Distribution

Often, we are interesting in evaluate the probability that a random variable as-sumes a particular value. The probability that a random variable X assumes a well- defined value x is described and evaluated with the probability distribution. For a discrete random variable the probability distribution is very simple to describe. In this case, in fact, the distribution is a simple table of the possible values asso-ciated to the probability of each value. In this case the probability distribution can be drawn as depicted in figure 2.1.

Fig. 2.1 Graphic representation of discrete probability distribution, with ( )∑ = 1ixp .

In other situation it is necessary or advisable to express the probability distribu-tion by an equation. For example, in many types of experiments the quantity of in-terest can be represented by a continuous random variable. The range of possible values of these random variables often span over an interval or real numbers. The range of the random variable, in this case, includes all values in an interval of real numbers. Thus, the range can be seen as a continuum [2].

Page 25: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

12 2 The Concept of “Statistical” Reliability When a continuous random variable would be described, a probability density

function f(x) can be used to describe the probability distribution of the variable under consideration X. The probability density function f(x) can be described as reported in figure 2.2.

Fig. 2.2 Probability density function.

The probability that X is between a and b is evaluated as the integral of f(x) from a to b of the probability density function. This probability is depicted in fig-ure 2.3.

Fig. 2.3 Probability evaluated from the area under probability density function f(x).

The area under the probability density function f(x) can be so evaluated with the following formula:

{ } ( )dxxfbXab

a∫=≤≤P (2.4)

Now it is possible to give the following definitions.

Page 26: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

2.4 Probability Distribution 13

Probability density function f(x): for a continuous random variable X the proba-bility density function is a simple description of the probabilities associated with a random variable an it is a function such that:

( ) 0≥xf (2.5)

( ) 1 =∫+∞

∞−

dxxf (2.6)

{ } ( )dxxfbXab

a

P ∫=≤≤ (2.7)

It is possible to use a second way of describing the distribution of a continuous random variable as shown in the following definition.

Cumulative distribution function: The Cumulative distribution function of a continuous random variable X is:

( ) ( ) ( ) ξξ dfxXPxFx

∫∞−

=≤= (2.8)

for - ∞ < x < ∞. F(x), also called simply distribution function, give the probability that the random variable will assume a value smaller than or equal to x. Moreover, F(x) is a no decreasing function, and F(-∞) = 0, F(+∞) = 1. Thus:

( ) 1 =∫+∞

∞−

ξξ df (2.9)

as reported in (2.6). The derivative of the cumulative distribution function is the probability density function of the random variable X:

( ) ( )dx

xdFxf =

(2.10)

as long as derivative exists. This definition leads to concludes that previous equa-tion (2.7) can be rewritten as (see also Fig. 2.4):

{ } ( ) ( )dxxfaFbFbxab

a

)(P ∫=−=≤≤

(2.11)

Equation (2.11) can be derived also reasoning as in the following:

{ } ( ) xxFxxXx Δ=Δ+≤≤P (2.12)

Page 27: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

14 2 The Concept of “Statistical” Reliability

00.10.20.30.40.5

x0

0.20.40.60.8

1

x

F (t)

f (t){ } ( ) ( )dxxfaFbFbxa

b

a

)(P ∫=−=≤≤

{ } ( )aFbFbxa −=≤≤ )(P

a b

Fig. 2.4 Cumulative distribution function; example of relationship between the Cumulative dis-tribution function F(t) and the Probability density function f(t) for a continuous random variable.

and, finally:

( ) { }Δx

ΔxxxPxf

+≤≤=→

Xlim

0Δx (2.13)

2.5 The Characteristics of Reliability

We start recalling here the definition of reliability of a device or system as “the ability of an item to perform a required function under given conditions for a giv-en time interval”. The preceding definition represents therefore a type of reliability “specification” for which it is necessary to define a “measurement” that allows for a quantitative and comparative evaluation. One notes that the concept of “perform-ing a required function” is complementary to that of a failure: termination of the ability of an item to perform a required function (clause 3.2 IEC 60812:2006 [3]). Therefore, as with a failure, there is a life span associated with this (in probabilis-tic terms, this is the random variable “time to failure” or more simply, failure time) and the quantitative evaluation of reliability is by way of the evaluation of reliability as a “performance”, the MTTF (Mean Time To Failure), the failure rate λ and the MTBF (Mean Time Between Failures).

2.5.1 Reliability

For convenience, we indicate with tf the random variable failure time defined in the interval [0, ∞). With a fixed subinterval [0, t] and the operating conditions

Page 28: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

2.5 The Characteristics of Reliability 15

established for the device or system under examination, we can state that this is re-liable if it correctly carries out in such an interval, the function or functions for which it has been designed. In probabilistic terms, we can write [4, 5]:

R(t )=P{ tf > t} (2.14)

In fact, the Reliability function is a survival function and it is denoted by R(t). R(t) is the probability that the considered item (device, system and so on) will operate failure – free in [0, t]. It is worth observing that the reliability function, being a probability, is dimensionless. Known the probability function of a failure, we have

( ) ( )dttftRt

∫∞

=

(2.15)

Assuming that the system can be found in only two states, the state of correct functioning or the state of failure, we can define the function of unreliability as complementary to R(t), that is:

( ) ( ) { }ttPtRtF f ≤=−= 1

(2.16)

where tf is the random variable. And also

( ) ( )dttftFt

0∫=

(2.17)

It therefore appears evident that the device or system is unreliable when there is a condition of failure in the considered interval [0, t]. From (2.9) and (2.14) it is possible to extract the failure probability density function as:

( ) ( ) ( )dt

tdR

dt

tdFtf −==

(2.18)

from which one determines the probability that the system will break down in the interval ( )dttt +, . The relationship

( ) 1 0

=∫∞

dttf

(2.19)

expresses the concept according to which the device is destined to breakdown with the passage of time.

For the item under consideration, we can identify two distinct situations. In the case of a non-repairable device, for example an incandescent lamp or a micropro-cessor, is usual to define the Mean Time to Failure (MTTF).

This one – item not – reparable system is characterized by the (cumulative) dis-tribution function F(t) = P(tf ≤ t), where tf is the failure-free operating time of the item. The time in this case can be considered as a continuous random variable. It would be noted that t > 0 and F(0) = 0. The Reliability function R(t) is the

Page 29: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

16 2 The Concept of “Statistical” Reliability

probability of no failure presence in the interval [0,t]. The problem consists in to evaluate the mean of the continuous random variable tf with density f(t):

{ } ( ) dt tt ft f ∫+∞

∞−

= E

(2.20)

if the integral converges absolutely. The time is a positive random variable and the previous equation can be so reduced to:

{ } ( ) dt tt ft f ∫+∞

=0

E

(2.21)

It follows that:

{ } ( ) ( ) ( ) ( )∫∫∫∫+∞+∞+∞+∞

−=−===0000

E tt dRdt dt

tdRt dt

dt

tdFt dt tt ft f

(2.22)

and, finally:

{ } ( ) ( )∫∫+∞+∞

=−=00

E dt tR tt dRt f

(2.23)

The mean or expected value of the time to failure is denoted by Mean Time to Failure (MTTF) and it is so given by

{ } ( ) ( )[ ]∫∫+∞+∞

−===00

1 E MTTF dt tF dt tRt f

(2.24)

From this we clearly deduce that the mean time to failure for the device under consideration represents the area under the reliability function. Considering in-stead a system for which, following a failure, correct functioning can be restored, the MTBF (Mean time Between Failures) can be defined. It is easy to demonstrate that, indicating the reliability of a system with RS(t), the value of MTBF can be de-termined as:

( )dttRMTBF 0

S∫∞

=

(2.25)

Obviously, both the Mean Time to Failure (MTTF) and the Mean Time Between Failures (MTBF) are expressed in hours.

Page 30: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

2.5 The Characteristics of Reliability 17

2.5.2 Failure Rate

Less intuitive but easily deduced is the expression of the failure rate λ(t). We con-sider the event a failure of the device in the interval [t, t+dt]. This event however is conditioned by the fact that the system did not fails before time t. Conditional prob-ability is treated in a specific way by the probability theory. In fact, with event B with P(B)>0, one defines the probability of event A conditioned by event B as the ratio of the probability of the joined event (AB) to the probability of event B:

( ) ( )( )BP

BAPB|AP

∩=

(2.26)

The conditioned probability of a failure can therefore be expressed as (f = failure):

{ } { }{ }t tP

dttt tPt tdt tttP

f

fff

>

+<<=>+<<

(2.27)

The Failure rate λ(t) of an item can be defined as:

( ) ( )ttdttttPdt

t ffdt

>+<<=→

|1

lim0

λ

(2.28)

thus:

( ))(

)(1lim

0 ttP

dttttP

dtt

f

f

dt >+<<

=→

λ

(2.29)

Recalling now that:

( ) ( ) ( ) ( )tFttPtFttP ff −=>→=≤ 1

(2.30)

and, if F(t) is derivable:

( ) ( )( )

( )( )

( )( )tR

tf

tF

tf

tFdt

dttttPt f

dt=

−=

−⋅

+<<=

→ 11

1lim

(2.31)

with t > 0, F(0) = 0 and R(0) = 1. The ratio of the conditioned probability with an object that breaks down in the

interval [t, t+dt] and the duration dt of the interval is defined as “Instantaneous Failure Rate”

( ) ( )( )tR

tftλ =

(2.32)

in hours –1. Equation (2.32) can furthermore be expressed as:

( ) ( )( ) ( )( )tR

dt

d

dt

tdR

tRtλ log

1 −=−=

(2.33)

Page 31: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

18 2 The Concept of “Statistical” Reliability

2.6 The Frequency Approach

Returning to the concept of probability based on relative frequency (expressed by 2.1) and the relation (2.4), we can define an analysis tool, the experimental histo-gram of relative frequency:

( ) ( )n

xnxxf =Δ

(2.34)

From a practical point of view, this means that after repeating the experiment n times and after counting the tests n(x), the relationship x ≤ X ≤ x+Δx is valid. Thinking in histogram terms, Δx represents the width of the classes which make up the rectangles and whose height is given by:

( ) ( )xn

xnxf

Δ⋅=

(2.35)

We will try to arrive at a definition of reliability starting from the “empirical” definition, extracted from the analysis of failure data. We shall consider n identic-al elements and statistically independent items that are put into operation under the same conditions at time t = 0. nh(t) indicates the subset of elements n that have not yet fails at a generic instant of time t (these are the item yet properly working at time t). With a fixed interval 1−−=Δ nn ttt , we consider the ratio between nh(t) and the total number of devices:

( ) ( )n

tntR h

N =

(2.36)

Since the definition of probability based on the concept of relative frequency (2.36) determines the probability of an event as the ratio between the number of time a certain event A takes place and the total number of experiments, the func-tion expressed by the ratio RN(t), expresses a probability that we will define empir-ical reliability. This ratio, for the law of large numbers, (considering n = a large number of devices) approximates therefore the function R(t) defined in paragraph 2.5.1.

The extension to the distribution function of random variable tf, that in empirical terms we will indicate with FN(t), immediately becomes

( ) ( ) ( ) ( ) ( )n

tn

n

tnn

n

tntRtF

fhhNN =

−=−=−= 11

(2.37)

where nf(t) indicates the number of elements that have failed down in time t, considering that nh(t) + nf(t) = n (where nh are the healthy items).

Page 32: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

2.6 The Frequency Approach 19 The course of failures reported in Figure 2.5 furthermore suggests that defini-

tion of an experimental histogram of relative frequency where Δt (interval be-tween one breakdown and the next) represents the width of the classes and whose height is given by:

( ) ( ) ( )tn

tnttntf ff

N Δ⋅−Δ+

=

(2.38)

From (2.15) and (2.16) we have:

( ) ( ) ( )i

iNiiNN t

tFttFtfΔ

−Δ+= with tttt ii Δ+≤≤ (2.39)

Fig. 2.5 Course of failures.

Indicating t1, t2, ... tn as the times to failure observed relative to n elements under consideration (Figure 2.5), it is possible to define the empirical mean time to fail-ure (MTTFN) as:

n

tttMTTF n

N+++= ...21

(2.40)

Having defined in (2.32) the “Instantaneous Failure Rate” as the ratio between the probability of the event and the observation interval, we have in empirical terms the possibility to express the failure rate as the ratio between elements that have broken down in the interval (t, t+Δt] and the number of elements ns(t) func-tioning at time t, that is:

( ) ( ) ( )( ) ( ) ( )

( )( )tR

tf

tn

ntf

tn

tnttn

tt

N

N

hN

h

ffN =⋅=

−Δ+⋅

Δ= 1λ

(2.41)

It is evident that the failure rate is the reciprocal of a time and is usually expressed in hours -1.

020406080

100120140160180200

t 1

n h

tt 2 t 3 t 4 t 5 t 6

Page 33: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

20 2 The Concept of “Statistical” Reliability

Example 1

Considering n = 172 the number of elements in the trial, we obtain, relative to their breakdown, the data reported in Table 2.1.

Table 2.1 Data of relative failures of applicatory example 1.

Time interval (h) Failure found at end of interval

0 – 1000 59

1000 – 2000 24

2000 – 3000 29

3000 – 4000 30

4000 – 5000 17

5000 – 6000 13

Faults 172

The evaluation of the Reliability function RE(t) is performed as reported in Ta-ble 2.2. The functions of empirical reliability and probability density, evaluated and based on equations (2.36) and (2.39), demonstrate the trend reported in Fig-ures 2.6 and 2.7 (Tables 2.3 and 2.4 report the results).

Table 2.2 Table for RE(t) evaluation.

t (h) nh RE(t)

0 172 1

1000 113 0.657

2000 89 0.517

3000 60 0.349

4000 30 0.174

5000 13 0.076

6000 0 0

Total 172

Page 34: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

2.6 The Frequency Approach 21

Fig. 2.6 Trend of empirical function RE(t) relative to the data reported in applicatory example 1.

The failure rate will show a trend as seen in Figure 2.8 (Table 2.5 shows results).

Fig. 2.7 Empirical density function from data in Example 1.

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0 1000 2000 3000 4000 5000 6000

RE

(t)

t (hours)

0.000

0.050

0.100

0.150

0.200

0.250

0.300

0.350

0.400

1000 2000 3000 4000 5000 6000

f E(t

) [10

-3]

t (hours)

Page 35: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

22 2 The Concept of “Statistical” Reliability

Table 2.3 Evaluation of empirical functions of reliability and distribution of times to failure.

t (h) RE

0 1.0

1000 (172 - 59)/172 = 0.657

2000 0.517

3000 0.349

4000 0.174

5000 0.076

6000 0

Table 2.4 Empirical density function.

Time interval [t+Δt ] (hours) RE FE(t) FE(t+Δt)-FE(t) fE [10-3]

0-1000 0.657 0.343 0.343 0.343

1001-2000 0.517 0.483 0.140 0.140

2001-3000 0.349 0.651 0.169 0.169

3001-4000 0.174 0.826 0.174 0.174

4001-5000 0.076 0.924 0.099 0.099

5001-6000 0.000 1.000 0.076 0.076

Table 2.5 Failure rate.

Time interval [t+Δt ] (hours) fE(t) [10-3] RE(t) λE [10-3]

0-1000 0.343 1.000 0.343

1001-2000 0.140 0.657 0.212

2001-3000 0.169 0.517 0.326

3001-4000 0.174 0.349 0.500

4001-5000 0.099 0.174 0.567

5001-6000 0.076 0.076 1.000

Page 36: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

2.7 Models of Failure Rate 23

Fig. 2.8 Trend of failure rate.

2.7 Models of Failure Rate

The most diffuse and widely known model of the failure rate is the model defined as the “bathtub” curve. This failure rate is often exhibited when a large population of statistically identical and independent items is considered. This model arises from actuary tables (relative to human mortality) from 1800 and is utilized in the field of insurance. In figure 2.9 a qualitative drawn of the bathtub curve is de-picted.

As well noted, three phases characterized by different trends can be analyzed:

Fig. 2.9 The “bathtub” curve (qualitative).

0.000

0.200

0.400

0.600

0.800

1.000

1.200

1000 2000 3000 4000 5000 6000

λ E(t

) [10

-3]

t (hours)

Page 37: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

24 2 The Concept of “Statistical” Reliability

1. The phase immediately following the start of the life cycle of the device is cha-racterized by a high failure rate that decreases rapidly over time. This phase is called “infant mortality” or “Early failure” phase. This course is derived from the existence of a “weak” fraction of the population whose defects however cause a failure in a short period of time. It would be mentioned that sometimes the failure rate not necessary goes as depicted in figure, but also oscillate.

2. A phase with an approximately constant failure rate whose value is determined above all by the level of solicitations the system is subjected to. Failures are Poisson distributed and often cataleptic.

3. A third phase, known as wear out, characterized by long time intervals and a rapidly increasing failure rate. This is the period of “wear out” failure of the devices. The failures are attributable to wear out, aging and fatigue.

Phase 2, where the failure rate is more or less constant, has an influence which is very interesting in the evaluation of reliability performance.

(a)

(b)

Fig. 2.10 Trend assumed by Reliability (a) and Probability Density Functions (b) in the case where λ = constant.

Recalling the expression (2.33)

( ) ( )( ) ( )( )tR

dt

d

dt

tdR

tRtλ log

1 −=−=

Page 38: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

2.7 Models of Failure Rate 25

and considering as an initial condition that reliability at time 0 is at a maximum and is equal to 1, we have:

( )( )∫

=−

t

dtt

etR 0

λ

(2.42)

Recalling the bathtub curve and hypothesizing to find oneself in phase 2 (a valid hypothesis above all in electric and electronic environments), (2.42) is notably simplified, becoming:

( ) tetR λ−= (2.43)

Equations (2.32) and (2.43) lead to the definitions of Reliability and Probability Density functions as seen in figures 2.10 (a) e (b).

Based on the preceding, it therefore follows that:

( ) ( )λ1

00

==⋅= ∫∫∞∞

dttRdttftMTTF

(2.44)

Example 2

Figure 2.11 demonstrates the plot of reliability in conditions of a constant failure rate for two devices that present different values of λ.

For example, at time t = 2•103 h, we note that for device 1, for which one has λ1 = 0.25•10-3 h -1, this presents with the probability of functioning correctly equal to about 61%. The second device, with a failure rate λ2 = 0.5•10-3 h–1, has a value of about 37%. With a selected reliability, for example 0.25, the graph shows that the interval of correct functioning for the second device is almost double that for the first device.

The information obtained from analyzing the two curves is extremely important when undertaking preventive measures for the reliability performance of the appa-ratus, both in terms of optimizing conditions of use, the appropriate selection of components to guarantee the required functions of the system as well as project revisions.

As will be subsequently illustrated, the calculation of the failure rate can be done by means of specific handbooks.

The exponential law represents, for its simplicity, the most widely used model in the study of reliability. However, this is not the only model that can be utilized, availing oneself of experimental data to describe failure events.

In fact, this will demonstrate how, for the function of unreliability of particular systems, other laws are more appropriate.

Page 39: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

26 2 The Concept of “Statistical” Reliability

(a)

(b)

Fig. 2.11 Comparison of two devices with different failure rate.

2.8 Other Laws

In this section the following distribution will be presented:

• Exponential law • Log-Normal distribution • Weibull distribution

Others distribution can be considered, such as gamma, Muth, Uniform, Log Logis-tic, Inverse Gaussian, Exponential Power and Pareto. However, these distributions will be not considered in this book.

Page 40: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

2.8 Other Laws 27 For the aforementioned distributions the following characteristic function will

be analyzed:

• reliability, • (failure) probability density function, • failure rate, • MTTF.

Exponential Law

This distribution is characterized by only one single positive scale parameter λ, that represents the failure rate, i.e. the number of failures in per unit time. For this type of distribution the characteristic functions are:

0,0,)( ≥>= − tetf t λλ λ (2.45)

tetR λ−=)( (2.46)

λ1=MTTF

(2.47)

22 1

λσ =

(2.48)

The functions are plotted in Figures 2.12, 2.13, 2.14 and 2.15.

Fig. 2.12 Exponential law: plot of the (failure) probability density function.

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2 2.5 3

f(t)

t

λ = 2 λ = 1.5

λ = 1

Page 41: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

28 2 The Concept of “Statistical” Reliability

Fig. 2.13 Exponential law: plot of the Reliability function.

Fig. 2.14 Exponential law: plot of the failure rate.

Fig. 2.15 Exponential law: plot of the MTTF.

0

0.2

0.4

0.6

0.8

1

0 0.5 1 1.5 2 2.5 3

R(t)

t

λ = 2

λ = 1.5

λ = 1

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2 2.5 3

λ(t)

t

λ = 2

λ = 1.5

λ = 1

0

0.2

0.4

0.6

0.8

1

1.2

0 0.5 1 1.5 2 2.5 3

MTTF

t

λ = 2

λ = 1.5

λ = 1

Page 42: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

2.8 Other Laws 29

The ratio

λ1=MTTF

is valid only for exponential distribution as is the probability that the system at in-stant t = MTTF is still functioning is

( ) 37,01 ≅=e

MTTFR

(2.49)

The exponential model demonstrates well the performance of elements after infant mortality in that its reliability interval depends only on the duration of the interval itself and not on its starting moment. In fact, applying the definition of condi-tioned probability, we have

( )( )

ts

st

ee

estsR λ

λ

λ−

+−==+ |

(2.50)

This particular law is often used to model the lifetime of many electronic compo-nents. Furthermore, exponential law is appropriate when a used device that has not failed is statistically as good as a new device. Many properties can be recalled for this distribution. The more important properties is the memoryless property:

Memoryless property: If T ∼ exponential (λ), then

[ ] [ ] 0 ;0 | ≥≥≥+≥=≥ stsTstTPtTP

It should be noted that, the exponential distribution is the only continuous distribution with the memoryless property.

Log-normal Distribution

The Log-normal characteristic functions are:

2ln

2

1

2

1)(

⎟⎠⎞

⎜⎝⎛ −−

= σμ

πσ

t

et

tf

(2.51)

( )∫−=t

dxxftR0

1)(

(2.52)

⎟⎟⎠

⎞⎜⎜⎝

⎛+

= 2

2σμeMTTF (2.53)

The Log-normal Distribution is utilized for the data processing of data derived from tests of accelerated life in particular for semiconductor devices. The parame-ters of the Log-normal law are: μ (mean value), σ (standard deviation of logarithm

Page 43: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

30 2 The Concept of “Statistical” Reliability

of time to failure). From the moment that the function λ(t) is decreasing for long periods of time, such distribution is generally used for describing reparation events in that the lack of spare parts can induce repair times far superior to the av-erage.

It is possible to demonstrate that the probability that the system at the instant t = MTTF is still depends only on functioning σ. In fact, if

⎟⎠⎞

⎜⎝⎛ −

= σμr

ey

ln

(2.54)

we have:

( ) ( )dyeMTTFR y∫∞

−=2

22

2

1

σπ

(2.55)

Weibull Distribution

The exponential law is often not well usable. In fact, the exponential law is limited in applicability because of the memoryless property, as previous described. The too hard assumption for the exponential law is the constant failure rate that results often too restrictive and/or inappropriate. Weibull distribution overcome these problems.

Weibull distribution is utilized as a model to describe infant mortality and it is a function of three parameters: γ (minimum life or the period within which failure do not occur), θ (parameter of scale intended as characteristic life) and b (a form parameter). If γ = 0, we can write the following relations:

0,0,,)(1

≥>⎟⎠⎞

⎜⎝⎛=

⎟⎠⎞

⎜⎝⎛−−

tbetb

tf

btb

θθθ

θ

(2.56)

bt

etR⎟⎠⎞

⎜⎝⎛−

= θ)( (2.57)

1

)(−

⎟⎠⎞

⎜⎝⎛=

btb

tθθ

λ

(2.58)

Such distribution is particularly significant for studying the reliability of systems since it allows for the describing of failure events for those systems characterized by a failure rate variable over time. In fact, for b > 1 (b < 1), the Weibull distribu-tion describes a system with an increasing (decreasing) rate. For b = 1, the Weibull distribution coincides with the exponential law. Therefore, in such circumstances, the failure rate is constant (θ = 1/λ).

The characteristic function for the Weibull distribution are plotted in the following figures.

Page 44: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

2.8 Other Laws 31

Fig. 2.16 Weibull law: plot of failure probability density function for λ = 1 and three differ-ent values for the parameter b.

Fig. 2.17 Weibull law: plot of reliability function for λ = 1 and three different values for the parameter b.

0

0.2

0.4

0.6

0.8

1

1.2

0 0.5 1 1.5 2 2.5 3

f(t)

t

b = 2

b = 3b = 1

0

0.2

0.4

0.6

0.8

1

0 0.5 1 1.5 2 2.5 3

R(t)

t

b = 2b = 3b = 1

Page 45: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

32 2 The Concept of “Statistical” Reliability

Fig. 2.18 Weibull law: plot of the failure rate function for θ = 1 and three different values for the parameter b. When b = 1 we have the exponential case with constant failure rate.

References

[1] IEC 60050-191 ed1.0, International Electrotechnical Vocabulary. Chapter 191: Depen-dability and quality of service, Forecast publication date for Ed. 2.0 is 2012-06-02. IEC International Electrotechincal Commission, Geneve (December 31, 1990)

[2] Montgomery, D.C., Runger, G.C.: Applied Statistics and probability for engineers, 2nd edn. John Wiley & Sons, Chichester, ISBN 0-471-17027-5

[3] IEC 60812:2006 – Analysis techniques for system reliability – Procedure for Failure mode and effects analysis (FMEA)

[4] Leemis, L.M.: Reliability, Probabilistic Models and Statistical methods, 2nd edn., ISBN 978-0-692-00027-4

[5] Birolini, A.: Reliability Engineering – Theory and Practice, Springer, Heidelberg, ISBN: 978-3-642-14951-1

0

5

10

15

20

25

30

0 0.5 1 1.5 2 2.5 3

λ(t)

t

b = 2

b = 3

b = 1

Page 46: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

Chapter 3 Reliability Analysis in the Design Phase

Abstract. In this chapter the techniques used for describe the performance of devices in a system will be considered. To this aim it is important to consider the system as a combination of elementary devices that follow a well-defined functional structure. After a brief introduction, the reliability evaluation of series, parallel and mixed structures will be shown and discussed. To this aim, the concept of Reliability Block Diagram is also defined as a mandatory tool. The theory is developed using many practical examples. Parallel configuration is further developed in order to dis-cuss the different type of redundancy for reliability growth: active, warm and stand-by. At the end of this chapter two different types of redundancy approaches are compared: system redundancy and component redundancy. The results so obtained are fundamental during the design phase of a system when reliability aspects have to be taken into account.

3.1 Introduction

In general terms, we can consider a system as a set of elements, subsystems or components, connected among themselves in order to guarantee one or more func-tional performances. Reliability, and therefore availability of such a system, de-pends on the characteristics of reliability and availability of the elements which make up the system, and their interconnections. The study of the relationships of the connections between the subsystems is called Combinatory Analysis and can be visualized in a diagram denoted as Reliability Block Diagram (RBD). In this chapter we will discuss some of the most common functional configurations, de-noted as canonic configurations, whose combinations give origin to mixed confi-gurations. For each functional configuration, not to be confused with the corres-ponding electric configuration, it will be possible to determine mathematical models of reliability and, consequently, the value of the Mean Time To Failure relative to the system (MTTFs). As will become more evident, the MTTFs can be determined through a combination of the failure rates of the elements constituting such system.

Page 47: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

34 3 Reliability Analysis in the Design Phase

3.2 Reliability Evaluation of Series, Parallel and Mixed Structures

3.2.1 The Series Functional Configuration

The series functional configuration, whose block diagram is shown in Figure 3.1, represents the simplest and most common reliability model in certain contexts, e.g. in the field of electronics. Considering the system S, composed of n elements Ei , for i = 1,… n, we say that the system is operative if and only if all the elements Ei are functioning correctly.

Fig. 3.1 Block diagram of reliability for the series functional configuration.

In the simplified hypothesis of independent events for which we can assume that the performance of every element Ei , in terms of correct functioning or fail-ure, does not depend on the condition assumed by other elements, the reliability of the system corresponds to the product of the reliability of single blocks, that is:

==

n

iiS (t)R (t)R

1

(3.1)

Assuming the condition of random failure and indicated with λi the constant failure rate associated with the generic element Ei for which we assume that

tλi

ie(t)R −= , the equation (3.1) becomes:

=

−⎟⎟⎠

⎞⎜⎜⎝

⎛−

=∑

== =n

i

t λ tλ

iSS

n

ii

eetRtR1

1)()( (3.2)

In the assumed hypotheses, Eq. (3.2) demonstrates an important property of the series functional configuration according to which the failure rate of the system λS can be determined by the sum of the failure rates of the constituent elements λi , in hours-1, that is:

∑=

=n

iiS λλ

1 (3.3)

Page 48: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

3.2 Reliability Evaluation of Series, Parallel and Mixed Structures 35

Consequently the Mean Time To Failure for the system, in hours, is:

===

n

ii

SS λ/

λ MTTF

1

11

(3.4)

It is therefore sufficient to know the failure rate of each element to determine the value of the MTTFs. For electronic equipments, this is called “reliability pre-diction” and can be carried out by means of particular handbooks as described in Chapter 5.

From the analysis of Eq. (3.2), fundamental considerations can be made for the series configuration:

1. being the reliability a probability and a number, fixed the time, from 1 to 0, we can deduce that the system reliability is inferior to the smallest value of reliabil-ity of the constituent elements, that is:

{ } n, i ; (t)R (t)R iiS ⋅⋅⋅=≤ 1min (3.5)

2. The probability of the system functioning correctly decreases with an increas-ing number of constituent elements.

Example 1

To justify Eq. (3.5), we consider a system composed of three elements E1 , E2 , E3 whose RBD is reported in Figure 3.2.

Fig. 3.2 RBD for a system composed of three elements in a series configuration.

If the values of reliability of each element, at generic time t, are 0.4, 0.7 and 0.9 respectively, the probability of the system functioning at the same time t is equal to 0.252. Though elementary, this example allows us to make the following prac-tical considerations. First of all, the presence of an intrinsically weak element in the series configuration has a strong negative effect on the system reliability. However, even improving the performance of the other two elements, the proba-bility of the proper functioning of the system is less than 40%. In addition, the probability of the system functioning correctly decreases with an increasing num-ber of constituent elements.

Page 49: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

36 3 Reliability Analysis in the Design Phase

Fig. 3.3 Reliability plot of a system with three elements in series configuration.

Example 2

Figure 3.3 shows the reliability plots with constant failure rates λ1 < λ2 < λ3. The lower plot, relative to the series, clearly shows how the high failure rate of the third element negatively influences total reliability which, assuming a value of 1 at time zero, decreases exponentially in function of the failure rate λS = λ1 +λ2 +λ3 .

Example 3

Similar considerations can be made examining the values reported in Table 3.1. Assuming high values of reliability for a single element, it appears evident that the reliability of a system, at a fixed time t, decreases with the increase in the number of elements which make up the system. If we consider, for example, an RBD with 20 elements set up in series functional configuration, and that for simplicity we consider identical, the probability that the system is functioning correctly at time t, will be over 65% only if the reliability of the single element is above 0.98.

Table 3.1 Influence of reliability values of the elements on the system reliability.

Element reliability ► 0.8 0.85 0.9 0.95 0.98 0.99

Number of elements ▼ System reliability ▼

1 0.8 0.85 0.9 0.95 0.98 0.99

5 0.32768 0.44370 0.59049 0.77378 0.90392 0.95099

10 0.10737 0.19687 0.34868 0.59874 0.81707 0.90438

20 0.01153 0.03876 0.12158 0.35849 0.66761 0.81791

50 1.47•10-5 2.96•10-4 5.15•10-3 0.07694 0.36417 0.60501

Page 50: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

3.2 Reliability Evaluation of Series, Parallel and Mixed Structures 37

3.2.2 The Concept of Redundancy: Parallel Functional Configuration

The parallel function configuration, also called redundant configuration (or active redundancy), assumes a determining role every time it is necessary to increase the reliability of a system.

The RBD for such a configuration is shown in Figure 3.4. The system is opera-tive also if only one component allocated in parallel is operative. Vice versa, the system is not functioning when all the elements are faulty. Considering this last definition and recalling the hypotheses made for the series configuration according to which events are independent and with constant failure rate, this shows that the unreliability of a system corresponds to the product of the unreliability of the con-stituting elements, that is:

( ) ( )∏

==

n

iiS tF tF

1

(3.6)

from which we extract the reliability of a system as:

( ) ( ) ( ) ( )∏∏

=

=−−=−=−=

n

i

tn

iiSS

ietFtFtR11

1111 λ (3.7)

and consequently, the mean time to failure:

( )∫

+∞=

0dttRMTTF SS (3.8)

Fig. 3.4 RDB for the parallel configuration.

Considering the example of a system with two independent elements connected in parallel configuration and assuming a failure rate constant equal to λ1 and λ2, from Eq.(3.7) we obtain the following expression of reliability:

( ) ( )tttS eeetR 2121 λλλλ +−−− −+= (3.9)

SYSTEM

E1

Ei

En

Page 51: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

38 3 Reliability Analysis in the Design Phase

and recalling Eq.(3.8)

2121

111

λλλλ +−+=SMTTF

(3.10)

In the simplified hypothesis according to which the two elements in redundancy are identical:

λ23= MTTFS

(3.11)

from which we clearly deduce an increase of 50% of the MTTFs in respect to the case of a single element characterizing by the same failure rate. Such a concept is the basis of the allocation of redundancy as the methodology to increase the relia-bility of a system.

Finally, it is necessary to evaluate the failure rate for the allocation of redun-dancy. This task will be obtained taking into consideration a two component paral-lel block with identical constant failure rate λ. Recalling (2.33) we obtain:

( ) ( )( ) { }=−⋅⋅

−⋅−=−= −−

−−tt

ttS eedt

d

eedt

tdR

tRtλ λλ

λλ2

22

2

11

{ } { }=−⋅−⋅

=+−⋅−⋅

−= −−−−

−−−−

tttt

tttt

eeee

eeee

λλλλ

λλλλ λλλλ 2

22

222

2

122

2

1

{ } =−−=

−⋅−=−⋅

−⋅= −−

−−

−−

−−−−

−− )2(

)1(2

222

2

12

22

2 tt

tt

tt

tttt

tt ee

ee

ee

eeee

ee λλ

λλ

λλ

λλλλ

λλ λλλ

)2(

)1(2

t

t

e

λλ −

−−=

Similar consideration can be deduced for more general case of many blocks paral-lel connected with different failure rate. It should be noted that the failure rate for parallel block is time dependent even if the failure rate of the single blocks are time independent.

For the parallel functional configuration it is possible to deduce the following considerations.

1. With a fixed time, the total reliability of a system is superior to the highest value of reliability of the constituent elements for which we can write that:

( ) ( ){ } n, i ; tR tR ii

S ⋅⋅⋅=≥ 1max

(3.12)

2. The probability of the system functioning increases with the increase in the number of constituent elements.

Example 4 Such property can be proved considering a system made up of three elements E1, E2, E3 whose RBD is shown in Figure 3.5. If the values of reliability for each

Page 52: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

3.2 Reliability Evaluation of Series, Parallel and Mixed Structures 39

element, at the generic time t, are respectively 0.4, 0.7 and 0.9, the probability of the system functioning, at the same time, becomes 0.982.

Fig. 3.5 RBD for a system made up of three elements in parallel configuration.

Example 5

The results of reliability of a system obtained by connecting a maximum of six elements in parallel configuration are reported in Table 3.2 and in Figure 3.6. For simplicity, we assume, also in this case, identical values of reliability, equal to 0.8 at time t. From the results of table, we observe that with two elements in parallel we obtain an increase in reliability, equal to 20% in respect to the configuration with a single element. Obviously, this increase is always positive with the increase in the number of elements in active redundancy, but, as it was logical to expect, always smaller and however, not such as to justify cost in design review. In the proposed example the importance of the parallel configuration as a technique for increasing the system reliability is demonstrated.

A further application reported below is an example of active redundancy alloca-tion for a series configuration. It is important to remember that the active redundancy discussed in this paragraph must not be confused with the stand-by redundancy. In general terms, this last configuration, which will not be discussed here, foresees the functioning of a system A and the activation of a system B whenever A assumes a failure state. A diagnostic block D controls the correct functioning of A and causes the activation of B when necessary. It is evident that for this configuration the reliability of all the system depends on the reliability of blocks A, B and D according to a connection of conditioned probability.

Table 3.2 Increase in reliability for a configuration in active redundancy.

Number of Elements System reliability Increase of reliability (a)Increase of reliability (b)

1 0.800000 --- ---

2 0.960000 0.160000 20.00 %

3 0.992000 0.032000 24.00 %

4 0.998400 0.006400 24.80 %

5 0.999680 0.001280 24.96 %

6 0.999936 0.000256 24.99 %

a) respect to the configuration of the preceding step. b) respect to the initial configuration with a single element.

Page 53: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

40 3 Reliability Analysis in the Design Phase

(a)

(b)

Fig. 3.6 Plot of increase in reliability for a configuration in active redundancy: (a) system reliability and increase of reliability respect to the configuration of the preceding step, (b) percentage increase of reliability respect to the initial configuration with a single element.

Example 6

Considering the RBD for a series configuration (Figure 3.7.a) in which the ele-ment Ei was a priori identified with a higher failure rate. A possible solution for increasing the reliability of the system could regard the insertion of an active re-dundancy as reported in Figure 3.7.b. Such a configuration usually takes the name of “allocation of redundancy”.

Considering equations (3.1) and (3.7) we can evaluate the expression for the system reliability as:

( ) ( )( ) ( )tsRtRtR seriesiS ⋅−= 2 (3.13)

Recalling example 1 and applying the preceding relations in which active re-dundancy is positioned on the element E1, with a reliability value of 0.4, we obtain

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6

Rel

iabi

lity

Numbers of elements

0%

10%

20%

30%

1 2 3 4 5 6

Rel

iabi

lity

Incr

emen

t (%

)

Numbers of elements

System Reliability

Increase of Reliability

Page 54: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

3.2 Reliability Evaluation of Series, Parallel and Mixed Structures 41

a probability of general functioning of the entire system equal to 0.4032, with an increase of 60% with respect to the initial value of 0.252.

Fig. 3.7 RBD: (a) series functional (b) allocation of redundancy.

Example 7

Figure 3.8 shows a comparison of the configurations with two equal and indepen-dent elements functioning in series and in parallel, and the plot of reliability with a single element having the same failure rate constant.

a) two elements in parallel: ( )λ

λ

2

32 2 =−= −− ; MTTFeetR S

λttS

b) single element: ( )λ

λ 1== − MTTF ; etR St

S

c) two elements in series: ( )λ

TTF ; MetR Sλt

S 212 == −

Fig. 3.8 Comparison among basic configurations with the same failure rate.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time (h)

Reliability

a

b

c

Page 55: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

42 3 Reliability Analysis in the Design Phase The series and parallel configurations can be opportunely combined giving rise

to the so-called mixed configurations. For these, assuming the same hypotheses and recalling Eqs. (3.1) and (3.7), both the plots of reliability vs. time and the val-ue of MTTFs can be immediately determined.

Example 8

A system consisting of seven subsystems, each characterized by a different failure rate constant, is represented by the RDB in Figure 3.9. We can observe that this is a combination of series and parallel configurations. In fact, the series-parallel con-figuration in which the entities E1, E2, E3 of the superior path are located in series among themselves and in parallel with series E4, E5 of the inferior branch. Every-thing is in series with the elements E6, E7. We obtain therefore:

( ) ( ) ( ) t λ λ λ λ λ λλ t λ λ λλ t λ λ λ λλ eeeR(t) 7654321765476321 ++++++−+++−++++− −+= (3.14)

and consequently:

7654321765476321

111

λ λ λ λ λ λλ λ λ λ λ λ λ λ λλMTTFS ++++++

−+++

+++++

=

Fig. 3.9 Mixed configuration.

3.3 Types of Redundancy

Redundancy is useful when very high dependability features are mandatory. In particular, this is true if high reliability, availability and safety of equipment is requested. It is important to underline that we deal with reliability block diagram. Parallel of items in RDB not means automatically and necessarily parallel in hardware block diagram. Three different types of redundancy can be defined [6]:

• Active redundancy (even known as parallel or hot redundancy): this is the aforementioned redundancy. Redundant elements are subject always to the same load;

• Warm Redundancy: The redundant elements are subject to a lower load until one of the elements fails; it is presents a load sharing.

• Stand-by Redundancy (even known as cold redundancy): in this type of re-dundancy the redundant elements are subject to no load until one of the operat-ing elements fails; the load sharing is not possible and, very important, the fail-ure rate of the elements with no load (in reserve state) is equal to zero.

E 2 (λ2) E 3 (λ3)E 1 (λ1)

E 4 (λ4) E 5 (λ5)

E 6 (λ6) E 7 (λ7)

Page 56: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

3.4 Functional Configuration k out of n 43

3.4 Functional Configuration k out of n

A particular configuration previously cited is represented by a system operative when at least k number of elements out of a total of n elements are functioning normally. The configuration is also called k-out-of-n redundancy with k ≤ n.

For this configuration we can assume a structure in which k elements are in ac-tive redundancy and the remaining elements (n-k) are in stand-by. A typical ex-ample is a steel cable formed of n strands that can withstand foreseen stress if at least k numbers of strands are intact.

To calculate the reliability of this configuration we use binomial distribution hypothesizing that the generic element of the system can assume only two condi-tions: correct functioning or in failure. Recalling the definition give for this configuration and indicating with R(t) the reliability of the generic element, the reliability of the system RS(t) can be expressed as:

( ) ( )( ) ( )( )∑=

−−⎟⎟⎠

⎞⎜⎜⎝

⎛=

n

ki

i n in

iS tRtR tR 1

(3.15)

denoting as

)!(!

!

ini

nn

i −=⎟⎟

⎞⎜⎜⎝

(3.16)

the binomial coefficient with 0!=1. Assuming a constant failure rate, we have:

( ) ( ) ( )∑=

−−⎟⎟

⎞⎜⎜⎝

⎛=

n

ki

int -it -n

iS ee tR λλ 1

(3.17)

from which the mean time to failure of the system can be immediately calculated as:

( ) dt e e MTTFn

ki

i nt -it -n

iS

⎥⎥⎦

⎢⎢⎣

⎡−⎟⎟

⎞⎜⎜⎝

⎛= ∑∫

=

−∞+)1(

0

λλ

(3.18)

It is interesting to observe that for k = 1 this configuration coincides with the pa-rallel configuration, while for k = n with the series configuration.

Example 9

We wish to determine the probability of functioning of the system 2 out of 3 at a time t=104 hours considering the failure rate of the generic element equal to 3•10-5 h-1. From Eq. (3.16) the reliability is:

( ) ( ) ( ) λt λtλt itti

i iS e e e e e tR 3233

2

3

3

31

2

31 −−−−−−

=⎟⎟⎠

⎞⎜⎜⎝

⎛+−⎟⎟

⎞⎜⎜⎝

⎛=−⎟⎟

⎞⎜⎜⎝

⎛=∑ λλ

Page 57: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

44 3 Reliability Analysis in the Design Phase

and: ( ) .23 32 λtλt

S e- etR −−=

Considering the assigned value of the failure rate, the reliability of the system at 104 hours is given by 5/6 ≅ 0.833.

Example 10

Table 3.3 and Figure 3.10 compare the results and reliability trends for different k out of n functional configurations, in the hypothesis that they have led to the de-terminations of expressions (3.16) and (3.17).

Table 3.3 Characteristics of reliability for k out of n configurations.

Configuration Reliability model MTTFS

a Single element te λ− λ / 1

b 1-out-of-2 tt ee λλ 22 −− − )λ (6 / 9

c 1-out-of-3 ttt eee λλλ 3233 −−− +− λ) (6 / 11

d 2-out-of-3 tt ee λλ 32 23 −− − )λ (6 / 5

Fig. 3.10 Reliability of system for different functional configurations with elements having the same failure rate.

Page 58: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

3.4 Functional Configuration k out of n 45

(a)

(b)

(c)

Fig. 3.11 RDB for example 11.

Page 59: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

46 3 Reliability Analysis in the Design Phase

Further overview examples are presented in the following.

Example 11

We are interested in investigate the reliability of an airplane with four propellers, two of these propellers are on the left wing and the others are on the right wing, such as depicted in Figure 3.11.a. The airplane will fly if at least one propeller on each wing functions. In order to drawn the RBD some preliminary consideration are mandatory. We denote the four propellers with letters A, B, C and D and, for example, propellers A and B are located on left wing whereas the propellers C and D are located on right wing.

The statement “the airplane will fly if at least one propeller on each wing func-tions” lead to consider the two wings as a series structure. In fact, failure of the propulsion function on either wing results in the system failure. This structure is depicted in Figure 3.11.b. Moreover, the single wing can be modeled with a very simple parallel system, or better subsystem. The subsystem is composed by two propellers, parallel configured, because only one propeller on each wing is re-quired to operate in correct way. The block diagram is depicted in Figure 3.11.c.

It should be noted that in this example the only airplane failures considered are due to the propeller failures. Obviously, many others failure are possible but for sake of simplicity these further failure are here not considered [7].

Example 12

Many and interesting applications can be implemented considering the example 11. For instance, the following situation can be analyzed. We are interested in in-vestigate the reliability of an airplane with four propellers, two of these propellers are on the left wing and the others are on the right wing, as depicted in Figure 3.11.a. The airplane will fly if at least two propeller functions. It should be hig-hlighted that in this case we deal with a situation where two propellers are suffi-cient in order to assure the correct function of the airplane and the position of these propellers is not important (please note that this is an example and actual sit-uation can be very different!). The concept can be analyzed as follows: the func-tion is available if 2 out of 4 propellers are correctly working. It is so easy to un-derstand this is a k out of n. In fact we have a system operative when at least k numbers of elements out of a total of n elements are functioning normally. Afore-mentioned equations 3.15 through 3.18 can be applied.

Example 13

We are now interested in the system with the Reliability Block Diagram (RBD) of Figure 3.12 where the original design not guarantee the wanted reliability level.

Page 60: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

3.4 Functional Configuration k out of n 47

2

31

Fig. 3.12 RDB for example 12.

A possible solution is to use the redundancy in order to improve the reliability. If further three components identical to the previous components are available two different arrangements are possible:

• component redundancy as depicted in Figure 3.13.a, and • system redundancy ad depicted in Figure 3.13.b.

(a)

(b)

Fig. 3.13 Component redundancy (a) and System redundancy (b).

The difference between such redundancy configurations is well described: in component redundancy each component is replicated in parallel at its position in

Page 61: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

48 3 Reliability Analysis in the Design Phase

the system whereas in system redundancy the system is replicated in parallel to its self. For two parallel components the following equation is valid (P = Parallel):

( ) ( ) ( ) ( )∏∏=

=−−=−=−=

n

i

tn

iiPP

ietFtFtR11

1111 λ

and, in case of two components, where the time is not more highlight:

( ) ( ) ( ) ( ) ( )( ) =−−−=−−=−=−= ∏∏==

21

2

1

2

1

1111111 RRRtFtFtRi

ii

iPP

( )( ) 2121211221 11111 RRRRRRRRRR −+=−++−=−−−= .

And taking into consideration for sake of simplicity the same value for reliability, R = R1 = R2, the reliability is evaluable as follows:

22 RRRP −=

If the value of reliability for each element, at the generic time t, is, for example, R=0.9, we obtain:

99.081.08.1 =−=PR .

So, for component redundancy the overall reliability is evaluable with the notation of Figure 3.14 where inside the block the reliability at the generic time t is re-ported. The reliability is (S = System, CR = component redundancy):

( ) 0.9899.99.099.0299.0 2)( =−⋅⋅=CRSR

0.99

0.990.99

Fig. 3.14 Component redundancy.

In order to evaluate the reliability of the system redundancy solution it is neces-sary evaluate first the reliability of the original design:

( ) 0.891.9.09.029.0 2 =−⋅⋅=SR

With the notation of Figure 3.15, the overall reliability in case of system redun-dancy is (S = System, SR = system redundancy):

0.9881891.0891.022 22)( =−⋅=−= SSSRS RRR .

Page 62: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

3.4 Functional Configuration k out of n 49

Fig. 3.15 System redundancy for example 12.

The result is that, in this example:

)()( SRSCRS RR > .

The obtained results in nearly true in all possible situations with the following in-equality:

)()( SRSCRS RR ≥ . (3.19)

Redundancy at the component level is more effective than redundancy at the system level even if this arrangement can create design problems and it can be dif-ficult to obtain in many situations. It is obvious that in the choice of a specific configuration evaluation of economic aspects have to be made.

Example 14

The results obtained in the previous example 13 can be further analyzed taking in-to account different value for the reliability in a well-defined time instant. In this example the following values for reliability are considered: 0.8, 0.85, 0.9, 0.95, 0.99 and 1. The results are summarized in Table 3.4 and in Figure 3.16.

Table 3.4 Reliability evaluation for example 14.

R 0.8 0.85 0.9 0.95 0.99 1

RS(CR) 0.958464 0.977005 0.989901 0.997494 0.999900 1

RS(SR) 0.946176 0.971397 0.988119 0.997257 0.999898 1

Page 63: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

50 3 Reliability Analysis in the Design Phase

Fig. 3.16 Plot of reliability for example 14.

Table and plot are in compliance with (3.19) here reported for simplicity:

)()( SRSCRS RR ≥

In this example the value for reliability equal to 1 has been taken into account. For this value the reliability of the two considered redundancy lead to have:

)()( SRSCRS RR =

whereas for different values of reliability the following statement is valid:

)()( SRSCRS RR > .

Example 15

Evaluate the MTTF for the system depicted in Figure 3.17. Components are used during the phase with an approximately constant failure rate of the bath-tube curve and λ = λ1 = λ2 = λ3.

Fig. 3.17 RBD for example 15.

0.94

0.95

0.96

0.97

0.98

0.99

1

0.8 0.85 0.9 0.95 1

Syst

em R

elia

bili

ty

Component Reliability

System Redundancy

Component Redundancy

Page 64: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

3.4 Functional Configuration k out of n 51

In order to solve this configuration, initial considerations are necessary. Recal-ling Eq. (2.33) here reported for the sake of simplicity

( ) ( )( ) ( )( )tR

dt

d

dt

tdR

tRtλ log

1 −=−=

and considering as an initial condition for the reliability at time 0 its maximum value equal to 1, results:

( )( )∫

=−

t

dtt

etR 0

λ

Recalling the bathtub curve and hypothesizing the phase 2 (useful life and random failures) of the bath-tube curve, the aforementioned equation became:

( ) tetR λ−=

We approach to the solution in different steps.

First parallel block

The block is depicted in Figure 3.18.

Fig. 3.18 First parallel block.

The reliability of this parallel block is calculated with Eq. (3.7):

( ) ( ) ( ) ( ) ( ) tt

i

tn

i

tn

iiPP eeeetFtFtR ii λλλλ 2

2

1111 2111111 −−

=

=

=−=−−=−−=−=−= ∏∏∏

and consequently, the mean time to failure:

( ) ( ) ∫ ∫∫∫+∞ +∞ −−+∞ −−+∞

=−=−==0 0

2

0

2

0 11 23

22λ

λλλλ dtedtedteedttRMTTF ttttPP

Page 65: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

52 3 Reliability Analysis in the Design Phase

as found in (3.11). Now it is necessary to evaluate the failure rate for this block. Recalling Eq. (2.33) we obtain:

( ) ( )( ) { }=−⋅⋅

−⋅−=−= −−

−−tt

ttP eedt

d

eedt

tdR

tRtλ λλ

λλ2

21 2 2

11

{ } { }=−⋅−⋅

=+−⋅−⋅

−= −−−−

−−−−

tttt

tttt

eeee

eeee

λλλλ

λλλλ λλλλ 2

22

222

2

122

2

1

{ } =−−=

−⋅−=−⋅

−⋅= −−

−−

−−

−−−−

−− )2(

)1(2

222

2

12

22

2 tt

tt

tt

tttt

tt ee

ee

ee

eeee

ee λλ

λλ

λλ

λλλλ

λλ λλλ

)2(

)1(2

t

t

e

λλ −

−−=

Second parallel block

The block is depicted in Figure 3.19. Taking into account the value of the failure rate, results:

( ) ttP eetR λλ 2

2 2 −− −=

λ2

32 =PMTTF

( ))2(

)1(22 t

t

Pe

etλ λ

λλ −

−−=

as for the previously evaluated block.

Fig. 3.19 Second parallel block.

Third parallel block

The block is depicted in Figure 3.20. Taking into account the value of the failure rate, we have:

( ) ttP eetR λλ 2

3 2 −− −=

λ2

33 =PMTTF

( ))2(

)1(23 t

t

Pe

etλ λ

λλ −

−−=

as for the previously evaluated blocks.

Page 66: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

3.4 Functional Configuration k out of n 53

Fig. 3.20 Third parallel block.

Fig. 3.21 Second and third parallel blocks.

Parallel of the second and third block

The block is depicted in Figure 3.21.

Reliability is:

( )( ) =−−=−=−= ∏∏=

=

3

2

3

22323 11)(1)(1)(

P

Pi

ttP

PiiPP

ietFtFtR λ

( )( ) ( )( ) ( ) ( ) ( ) ( ) .111 322332 tttttttttttt PPPPPP eeeeee λλλλλλ −−−−−− −+=−⋅−−=

However, blocks are equals so failure rate are also equal:

( ) ( ) ( ) ( ) ( ) ( )ttttttttttttP

PPPPPP eeeeeetR λλλλλλ 223 2)( −−−−−− −=−+=

( ) ( ) =−== ∫∫+∞ −−+∞

0

)(2)(

0 2323 2 dteedttRMTTF ttttP

PP λλ

∫ ∫+∞ +∞ −− −=0 0

)(2)(2 dtedte tttt PP λλ

Finally, the evaluation of the failure rate is:

( ) ( )( )

.1 23

2323 dt

tdR

tRtλ P

PP −=

Page 67: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

54 3 Reliability Analysis in the Design Phase

It is easy to understand that from this point the mathematical treatment became very difficult. For more complex structure software for reliability evaluation has been developed and are often utilized. When the device count became large the complexity of the mathematics lead to consider the use of these software. It should be noted that 4 or 5 devices are sufficient in order to have hard calculus. The ques-tion is now: have many devices are present in an automobile, in an airplane, a PC, etc?

Example 16

Let us to consider a system composed by four identical devices connected as de-picted in Figure 3.22. Each item is independent from other items.

Fig. 3.22 RBD for example 16.

Solving before the two series items from (3.1) with the notation used in Figure 3.23 the following reliability is obtained:

)()()( 2

1

tRtR tRn

iiS == ∏

=

Now taking into account the parallel structure:

( ) )()(2 42 tRtRtRSystem −=

Fig. 3.23 RBD for example 16.

A

A

A

A

RS(t)

RS(t)

RSystem(t)

Page 68: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

3.4 Functional Configuration k out of n 55

The obtained results are plotted in Figure 3.24.

Fig. 3.24 Item Vs system reliability of Figure 3.23.

At low level of single device reliability the system reliability is lower than the item reliability. For high values of item reliability the system reliability higher than the item reliability. It is very important to evaluated the point named Switch Reliability Point in Figure 3.24. In this point the system reliability is equal to the item reliability:

)()()(2 42 tRtRtR =−

Solving the equation we obtain

0)()(2)( 24 =+− tRtRtR

and

{ } 01)(2)()( 3 =+− tRtRtR

01)(2)(3 =+− tRtR

A plot of the previous equation is reported in Figure 3.25. Reliability can as-sume only a positive value, so the dashed line depicts the parts of the function not valid for Reliability.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Sys

tem

Rel

iabi

lity

Item Reliability

System Reliability

Switch Reliability Point

Item Reliability

Page 69: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

56 3 Reliability Analysis in the Design Phase

Fig. 3.25 Plot of R3(t)-2R(t)+1 Vs R(t).

In the previous equation 1 is a solution and with Ruffini theorem the equation can be rewritten in the following form:

( )( ) 01)()(1)( 2 =−+− tRtRtR .

Finally, the solutions are:

618.152

1

2

1

618.052

1

2

11

3

2

1

⎪⎪⎪

⎪⎪⎪

−≅−−=

≅+−=

=

R

R

R

The switch reliability point is about 0.618. The negative solution R3 is not a va-lid solution for Reliability.

If an exponential reliability function is considered and a constant failure rate of each block the system reliability is:

( ) ttSystem eetR λλ 422 −− −=

As far as the MTTF is concerning results:

( )λλλ 4

3

4

110

=−== ∫+∞

dttRMTTF SystemSystem .

and recalling that for a single device

λ1=MTTF

it is possible to conclude that the MTTF of the system is 75% of the MTTF of the single item:

-0.2

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

R3 (

t)-2

R(t

)+1

Item Reliability

∼ 0.618

Page 70: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

References 57

MTTFMTTFMTTFSystem ⋅=== 75.04

3

4

3

λ

Item reliability and System reliability are drawn in Figure 3.26.

Fig. 3.26 Plot of Item Reliability and System Reliability.

References

[1] IEC 50 (191) International Standard: International Electrotechnical Vocabulary – Charter 191: Dependability and quality of service, IEC International Electrotechincal Commission, Geneve (December 1990)

[2] Garvin, D.A: Competing in the eight dimension of quality. Harvard Business Review (1987)

[3] Michelini, R.C., Razzoli, R.P.: Affidabilità e sicurezza del manufatto industriale: la progettazione integrate per lo sviluppo sostenibile. Tecniche nuove (2000)

[4] Iuculano, G.: Introduzione a probabilità, statistica e processi stocastici, Pitagora edi-trice, Bologna (1996)

[5] Garcia-Diaz, A., Phillips, D.T.: Principles of Experimental Design and Analysis. Chapman & Hall, Boca Raton (1995)

[6] Birolini, A.: Reliability Engineering – Theory and Practice. Springer, Heidelberg, ISBN: 978-3-642-14951-1

[7] Leemis, L.M.: Reliability, Probabilistic Models and Statistical methods, 2nd edn., ISBN: 978-0-692-00027-4

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.5 1 1.5 2 2.5 3 3.5 4

Rel

iabi

lity

λt

Item Reliability, R(t)

System Reliability, RSystem(t)

Page 71: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT
Page 72: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

Chapter 4 Experimental Reliability and Laboratory Tests

Abstract. Component reliability is often affected by different influencing factors. In particular the operating profile of a component would be taken into account if good reliability predictions are necessary. The operating profile change in accord-ing to the type of operation of the component. So, we can have continuous opera-tion or non-continuous operation, such as also sporadic operation. Moreover, storage conditions may be deep impact on reliability of the component when oper-ating. Obviously, environmental factor need to be taken into account also. The en-vironment contributes to both aging and failures during the life of the device or system under consideration would be considered. To this aim, both duration and intensity of environmental stresses should be included in the system operational model. In this chapter, after a brief introduction, the stress factors will be analyzed in 4.2. In 4.3 the component degradation is presented, and in 4.4 a model for aging based on the temperature is deeply discussed. Analysis of failure modes (4.5) and laboratory test are, finally, presented (4.6).

4.1 Introduction

For a specific component, the physical reliability is based on the analysis of its “life - cycle” through the definition of a model. In any context, mechanical as well as electric or electronic etc., the definition of such a model depends on the current state of the component or system. The state is then influenced by “inputs” (e.g. loading conditions), “influencing factors” (environmental, mechanical, electrical etc.) and design performances. What was stated previously brings us to a formali-zation of equations where variables appear representing the phenomenon, input and measurements.

In the study of physical reliability it is important to remember that any material (and generalizing, any type of component) is able to store “energy” at the atomic level, originating from an external environment. The capacity of storing energy al-lows the defining of a critical value upon which, the mechanism of conservation gives way to the mechanism of the modification of structural bonds. This leads to the point of breakdown of the material (component) itself.

The anisotropy of materials causes an irregular distribution of energy: this de-termines a breakdown due to storing energy values inferior to the theoretical criti-cal value. Interaction with the external environment therefore cannot be predicted “in a deterministic manner” in that the molecular aspect of materials is also

Page 73: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

60 4 Experimental Reliability and Laboratory Tests

associated with the method of energy exchange with the external environment: quantity, type and exchange dynamics determine different processes of evolution.

The critical value can be reached through instantaneous breakdown processes (due to when the nature and dynamics of a situation exceed the resistance of the material) or slowdowns - this is associated with the phenomenon of fatigue.

4.2 Stress Factors

The definition of a physical reliability model of a component requires the know-ledge of both the failure modes and failure mechanisms. These lasts are connected to the various types of stress applied to the device: the way the device is used in normal functioning conditions as well as influencing factors related to the work environment have to be considered.

Depending on the working conditions, the combination of influencing factors that can lead to failures can be of different types. For example: for electronic components, the stress factor is often the “work temperature” of the component while for components in chemical plants, it could be the corrosive capacity of fluids working in the system. Principal influencing factors can be classified into three types:

• climatic factor where an increase in ambient temperature makes heat dissipa-tion more difficult which obviously affects the performance of the electronic component. This can also affect other types of components, e.g. insulation of electrical machines.

• mechanical factors (shock) involved in transportation and installation and in particular, the vibration the component undergoes when functioning normally.

• electrical factors (e.g. electromagnetic interference) due to characteristics of the supply of electricity or mutual interference among machines.

In general, every device is subjected to influencing factors. Obviously, the type of component and its application make some factors more predominant than others.

Since the way that components and materials are used has a strong impact on the systems reliability, (note the definition given in Chapter 1), the study of mod-els is aided, by means of standards, by defining parameters to use for the selection and qualification of materials/components. The importance of these aspects can be understood by referring to the international standard (e.g. ETSI and IEC).

In fact, these standards specify the limits of stress and trial conditions relative to temperature, humidity, precipitation, radiation, sand, noise, vibrations, electrical and mechanical shock.

The standards are used also to classify different environments, defining for each the values of specific environmental parameters (temperature, relative hu-midity, vibrations etc.). Table 4.1 regards standardized normal conditions. The Table makes reference to temperature as the central point of a diagram that de-monstrates the combination of possible values of air, temperature and relative humidity. This is known as a climate plot.

Page 74: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

4.2 Stress Factors 61

The plot identifies the following areas (e.g. Figure 4.1):

• a more internal area that represents conditions encountered 90% of the time • an intermediate area referring to environmental conditions at normal limits • a more external zone referring to “exceptional” environmental conditions (e.g.

a breakdown in the air conditioning system).

The functioning of apparatus must be guaranteed in the intermediate area de-fined by the normal climatic limits (see Fig. 4.1). In the zone included in the “ex-ceptional climatic limits” and “normal climatic limits”, the apparatus can work with degraded performances are allowed to the apparatus but its functionality can be restored when the condition are reported in the “normal” area.

It is important to remember that also when handbooks are used (see chapter 5), all reliability models include the effects of environmental stresses through the en-vironmental factor, πE.

Table 4.1 Example of environmental classification.

Environment SECTION ENVIRONMENTAL SUB SECTION

Ta (°C)

(central point of the climatic plot)

SHELTERED

Air-conditioned

Standard 25

Special 30

Not Air-conditioned

With partial temperature control 25

Without temperature control inside walls 25

Without temperature control with greenhouse effect 30

Without temperature control with natural ventilation and without greenhouse effect

30

Without temperature control inside container, for on line equipments

30

Mobile (cockpit or carrier) 30

FREE AIR

Unsheltered

Cold climate 15

Cold climate temperate 15

Warm climate temperate 20

Warm climate temperate (tropical) 25

Warm dry climate 25

Warm dry climate temperate 25

One might ask if it is possible to find a relationship between the applied stress and the strength of a component. This would permit the designing of components for which the conditions for failure do not exist. For many components, both the stress (load) and strength follow a statistical distribution. Hypothesizing a normal distribution (in the literature however, there are studies in which the analysis is implemented by analyzing different distributions), we can use statistical

Page 75: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

62 4 Experimental Reliability and Laboratory Tests

parameters in order to analyze how load and strength distributions interfere with the evaluation of the probability of failure (Figure 4.2). Starting from the mean values and the standard deviations of the strength and stress distributions, it is possible to define the Safety Margin (SM) and the Loading Roughness (LR). Representing the mean values with L for the Load distribution and S for the Strength distribution (Figure 4.3) and denoting with σL and σS the standard devia-tion of the distributions of the strength of the component and the stress applied to it, respectively [1], SM and LR are defined as:

22LS

LSSM

σσ +

−= ; 22LS

LLRσσ

σ

+= (4.1)

The probability that a failure will take place depends on the distance of the two distributions while the number of components involved in a failure condition de-pendents on these as well as the shape of distributions.

Fig. 4.1 Climatic plot: an example (ETSI 300 019-1-3) [2].

Page 76: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

4.2 Stress Factors 63

Fig. 4.2 Relationship between stress and resistance (qualitative plot).

Fig. 4.3 Analysis of links [1] (qualitative plot).

Prob

abili

ty d

ensi

ty

Page 77: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

64 4 Experimental Reliability and Laboratory Tests

As shown in Figure 4.3, (a) represents the ideal condition, while (b) and (c) re-fer to two possible cases where a failure could arise. In (b), the safety level is low although it has a narrow distribution of stress, and it is the shape of the resistance which widens. As seen in the figure, the probability of failure involves only a small fraction of components that respond to such laws.

In (c), instead, a greater state of criticality is represented: the probability of failure due to more consistent stress involves a fraction more large of components.

This last denotes a useful piece of information for quality control: in the case where reliability performances are fundamental but it is impossible to check entire production lots (e.g. systems containing a prevalence of electronic components), a system for checking components subjected to overstress can be implemented. Those whose resistance falls at the end of the distribution curve are eliminated.

The above considerations are the basis for implementing screening tests, whose goal is to demonstrate the percentage of components that are intrinsically weak, that is the elements which fall into the area of premature failures of the bathtub curve, as discussed in Chapter 2.

In order to better understand the previous considerations, we propose as an ex-ample, the analysis of a regulation valve. Manufacturer’s specifications state that the valve can function up to a maximum pressure of 14000 kPa (with a standard deviation of 5%).

The reliability of a device, utilized when a fluid is exerting a pressure of 10000 kPa (the standard deviation is equal to 1300 kPa) can be estimated from the safety level:

71.21300700

10000140002222

≅+

−=+

−=LS

LSSM

σσ (4.2)

From which, considering starting again from the normal distribution tables, we can evaluate the reliability as:

( ) 9966.0== SMFR (4.3)

4.3 Component Degradation

Only in ideal conditions a device, subjected to various levels or types of stress, can maintain its characteristics unaltered. In a real situation, the component undergoes a degradation process and, consequently, its performances change with time.

Recalling the hypothesis of normal distribution discussed previously, we can hypothesize the evolution of resistance over time, of a component subjected to external stresses as represented in Figure 4.4.

The trend plotted in figure 4.4 depends on the knowledge of the device functio-nality, evaluated by means of analytical models as well as laboratory tests [1].

Page 78: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

4.4 The Prediction Approach 65

Fig. 4.4 Degradation of Resistance.

4.4 The Prediction Approach

The study of both the mechanisms and types of failure are fundamental for defin-ing reliability models as well as the evaluation of failure rates. However, in that the test is implemented in simulated conditions, it is important to underline that such models represent an estimation of the best observed data. As a consequence, the use of such models in the reliability field can be well considered only if the device or the material under examination is used in the same conditions in which the test was implemented.

An influencing factor which affects materials, components and processes is the temperature. In that many processes (chemical reactions, diffusion of gases etc.) are accelerated when the temperature increases it is possible to define the Arrhe-nius model for this influencing factor:

KTEaeHR ⋅= (4.4)

(R = velocity of activation, H = typical constant of process, K = constant of Boltzmann, 8,623 • 10-5 = 1/11605 eV/K, Ea = activation energy of the degrada-tion process, in eV, in function of technology, T = thermodynamic temperature in kelvin, K, i.e. temperature °C + 273.15).

time

Page 79: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

66 4 Experimental Reliability and Laboratory Tests

This acceleration model is often used in order to predict life of component (item) as a function of temperature. It applies specifically to those failure mechan-isms that are temperature related and which are within the range of validity for the model. The model states:

KTEaeconstfailuretoTime ⋅= (4.5)

where Time to failure is a measure of life of the item under consideration, const. represents a parameter evaluated by experimental activity for the involved item. It should be noted that the unit electronvolt (eV) is a measure of an energy. The val-ue of Ea depend on the considered failure mode. Some values of the activation energies Ea for some silicon semiconductor failure mechanisms are reported in Table 4.2.

Another Arrhenius expression involves information about the failure rate. De-noting as λ1 the component failure rate (in h-1) at the temperature T1 (reference temperature in K), the failure rate λ2 of the same component at the temperature T2 (stress temperature) is given by:

( )21 /1/1

12

TTK

Ea

e−

⋅= λλ (4.6)

where λ2 is the failure rate, in h-1, of the same component at the temperature T2.

Table 4.2 Approximated values of the Activation energy, Ea, for different failure mechanism in silicon semiconductors.

Failure mechanism Ea (eV)

Corrosion 0.3 – 1.1

Assembly Defects 0.5 – 0.7

Electromigration

- Al line

- Contact/Via

0.6

0.9

Mask Defects 0.7

Photoresist Defects 0.7

Contamination 1.0

Charge Injection 1.3

Dielectric Breakdown 0.2 – 1.0

Au-Al Intermetallic Growth 1.0 – 1.05

Page 80: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

4.4 The Prediction Approach 67

Fig. 4.5 Acceleration Factor (AF) evaluated in according to (4.7) Vs actual temperature t2 = T2-273.15, with t1 = T1-273.15 = 35°C. Plot reports the AF curve for different values of Ea.

Taking into consideration the previous equation, the acceleration factor AF is defined as:

( )21 /1/1 TT

K

Ea

eAF−

= (4.7)

A plot of the acceleration factor is depicted in Figure 4.5 where a reference temperature t1 = 35°C is assumed and the activation energy is varying from 0.4 to 1.0 eV.

An interesting way to plot the Acceleration Factor is depicted in Figure 4.6. In this plot an inverse absolute temperature horizontal scale is used. If the degrees Centigrade are represented on the horizontal axis the plot of Figure 4.7 is finally obtained. A straight line plot on the plot depicted in Figure 4.6 and 4.7 supports the assumption that an Arrhenius relationship holds.

The Arrhenius model can be considered as a simplified degradation model. In fact, with the increase of technology and, consequently, the functionality of mod-ern components (transistors, microprocessors, custom devices, and so on), the model of Eq. (4.6) may not be adequate. Figure (4.9) shows a difference between information contained in an electronic data base (e.g. MIL-HDBK 217, data base for prediction in electronic equipment [2, 3]), and the actual situation. However, on the basis of the Arrhenius theory, more complex models are deduced representing the failure rate for electronic components and considering different influencing factors.

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

0 50 100 150 200

AF

Temperature (°C)

Ea = 1.0

Ea = 0.9

Ea = 0.8

Ea = 0.7

Ea = 0.6

Ea = 0.5

Page 81: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

68 4 Experimental Reliability and Laboratory Tests

Fig. 4.6 Acceleration Factor (AF) evaluated in according to (4.7) Vs inverse absolute tem-perature = 1/T2, with T1=308.15 K. Plot reports the AF curve for different values of Ea.

Fig. 4.7 Acceleration Factor (AF) evaluated in according to (4.7) Vs inverse absolute tem-perature = 1/T2 (in Celsius degrees), with t1=35 °C. Plot reports the AF curve for different values of Ea.

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

0.002110.002310.002510.002710.002910.003110.00331

AF

Inverse absolute temperature = 1/T2

Ea = 1.0

Ea = 0.9

Ea = 0.8

Ea = 0.7

Ea = 0.6

Ea = 0.5

Ea = 1.0

Ea = 0.9

Ea = 0.8

Ea = 0.7

Ea = 0.6

Ea = 0.5

Page 82: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

4.4 The Prediction Approach 69

A generic model able to predict the failure rate in electronic field is represented by the following equation:

KTSE ππππλλ ⋅⋅⋅⋅= ....0 (4.8)

where λ0 denotes the failure rate in reference conditions, πE is the environmental factor, πS is a factor as a function of electrical stress applied to the component, πT is a temperature dependent factor, πK denotes a factor that takes into account com-plexity, technology and functionality of the component.

This formula has been adopted in the standard US MIL-HDBK-217. This stan-dard will be discussed in Chapter 5 in regard to handbooks. We assume the expo-nential distribution as hypothesis for eq. (4.8).

Fig. 4.8 Arrhenius’s model.

100 101 102 103 104 105 106 1070.5

0.6

0.7

0.80.91.0

1.5

2.03.04.0

25

50

100

150

200

250300350400

T (°C)

t (h)

Page 83: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

70 4 Experimental Reliability and Laboratory Tests

Fig. 4.9 Temperature Vs reliability for electronic components (qualitative).

4.5 Failure Modes

Failure mode describes how a component can fail. The identification of failure mode is fundamental when, in the analysis of reliability, it is important to know the consequences of a failure of the system. The previous paragraphs demonstrat-ed how failure mechanisms responsible for such processes as corrosion, wear, vi-brations, fractures, oxidation etc. play a fundamental role in the dynamics of the failure. Note however, that the causes of a particular breakdown often have to be reach on the production process or into the context where the component is working.

Generally, three different operating conditions can be identified: continuously active, standby and intermittent activity (components in standby are normally pas-sive).

Starting with operating conditions, it is possible to define two categories for the causes of failures: the first classification is related to failures that occur when a component is called into service from a standby mode (demand related), and the second is related to failure in components during continuous activity (time re-lated); for components which operate in both modes, you can obviously see both types of failures.

In addition to the category of catastrophic failure, characterized by the total loss of functionality, we can consider also degradation failure and incipient failure. Degradation failure describes cases in which there is a loss of function but the component is still capable of performing above minimal requirements. Incipient failure refers to cases where circumstances indicate that without maintenance or repair, the system or component will undergo a loss of function. Furthermore,

Page 84: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

4.6 Laboratory Tests on Components and Systems 71

types of breakdowns are distinguished in regard to sub-function when not func-tioning. To simplify, for certain devices (e.g. a motor), a failure may occur in the start up phase or shutting down phase, but it is obvious that this type of breakdown could evolve in different ways.

4.6 Laboratory Tests on Components and Systems

Recalling the Standard IEC 60050 (191) the term test denotes an operation, or series of operations, carried out in order to evaluate, quantify and classify a cha-racteristic or some other property of an item [4]. For item, we normally mean an elementary component, a sub-system or a more complex system. For laboratory tests instead, we mean a compliance test (suitable for verifying a characteristic of an element) or a determination test (carried out to establish a characteristic of an element). These tests are performed in established and controlled conditions which may or may not simulate field conditions. A procedure for determining and mea-suring reliability parameters of a family of components in the laboratory is to sub-ject a representative sample of such components to the same stress that this will undergo when functioning, both for the type of stress (e.g. temperature, humidity etc.) and for the level of stress (for temperature, 40 °C, 55 °C etc.). In this case, the test continues until all or most of the representative samples have failed; this type of test is commonly referred as a long life test.

When the test end before all samples have failed, the data analysis can be very difficult. This way to operate is named censoring. Censoring is often presents in lifetime data because in many situation is impossible to observe the lifetime of all devise under test. This is particularly true for electronic device where the failure rate allows a life time very long. The resulting tests are so very long. A way to overcome this problem is the use of accelerated test as discussed in the following. It is, however, possible to process data even in presence of censoring. A censored test occurs when only a bound is determined in the time of failure. The following figures depicted the typical situations. If n is the number of items and nf is the number of the observed failure three main type of situations are possible [6]:

• Complete data set: the test end when nf = n as depicted in Figure 4.10.a. • Time censored data set: the test end at a priori well-defined time. The situation

is shown in Figure 4.10.b. In this type of censoring the number of failure nf col-lected during the test is random. In Figure is depicted a situation with nf = 4.

• Order statistic censored data set: the test end when an a priori well-defined number of failure nf is observed. In this type of censoring the time necessary to complete the test is random. In Figure 4.10.c. is reported a case for nf = 3. The test end when the third failure occurs.

The aforementioned types of censored data are the results of right censoring. Other types of censoring are possible, such as left censoring and interval censor-ing.

Page 85: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

72 4 Experimental Reliability and Laboratory Tests

(a)

(b)

(c)

Fig. 4.10 Examples of not censored data (a) and censored data (b and c). The symbol × de-notes a failure and tf(n) denotes a failure instant of the sample n [6].

Time (h)

Samples

1

2

3

4

5

0

tf (1)

tf (2)

tf (3)

tf (4)

tf (5)

Page 86: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

4.6 Laboratory Tests on Components and Systems 73

Recalling the “bathtub curve” that characterizes the failure rate of an electrical or electronic component (see Chapter 2) and remembering that the time interval of the useful life (the central part of the bathtub curve) extends for hundreds of thou-sands of hours, it appears evident that especially for electronic components, this type of test is inadequate being the information on the life of component in that it would furnish information on the comportment of a component over very long pe-riods of time, comparable with its technological obsolescence.

We must therefore consider an accelerated life test, that is a test in which the elements are subjected to higher levels of stress with respect to the normal use.

As said in previous section, we define acceleration factor as the ratio between the value of the stress applied during the test and the corresponding value that cha-racterizes the conditions of normal use. The aim of this test is to increase the de-gradation phenomena without altering the dominant failure mechanisms (defined in Chapter 1). This allows observing the failure of the components in a shorter time.

This category of tests is also useful in carrying out quantitative comparisons among the same type of components but of a different origin; for example, com-ponents coming from different production lines or from different manufacturers. These take into consideration a wide variety of different types of stress, both cli-matic (cold, heat, humidity) and in general, environmental stress (vibrations, cor-rosive conditions).

Table 4.3 Classification of environmental tests (see EN 60068-1 for exact table).

Test Environmental stress

A Cold

B Dry heat

C Heat with high humidity (continuous)

D Heat with high humidity (not continuous, cyclic)

E Mechanical impacts (bumps and jerks)

F Vibrations (sinusoidal, random occasional)

G Constant acceleration

J Mold

K Corrosive atmosphere (e.g. salty fog)

Test Environmental stress

L Powder and sand

M Atmospheric pressure (high and low)

N Temperature changes

Q Hermetic sealing (for liquids and gas)

R Water (rain, dripping)

S Radiation (solar, excluding electromagnetic radiation)

T Welding

U Sturdiness of terminals (of components)

Page 87: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

74 4 Experimental Reliability and Laboratory Tests

Table 4.3 reports a classification of tests in electrical and electronic settings, derived from IEC 60068-1 or EN 60068-1 standards – Environmental testing – Part 1 General and guidance [5].

The tests can be more carefully detailed in function of the particular type of stress. For example, test U for the durability of terminals and devices integrally assembled with a component can concern traction (Ua1), compression (Ua2), bending (Ub), torsion (Uc), and the torque factor measurement.

Independent of the type, level and length of the stress, laboratory tests, both for conformity and determination, are normally carried out according to the following sequence:

Phase 1 – Preliminary adjustment: Operation performed on the device (or sam-ples) being tested in order to eliminate the effects of its preceding states or condi-tions. For example, this could consist in maintaining the elements under test, for an established time interval, at ambient or laboratory temperature before applying stress.

Phase 2 – Controls and initial measurements: This phase ascertains that all com-ponents to be tested are functioning correctly (conformity assessment). We as-sume this phase (Phase 2) as a reference condition for measurements on compo-nent under test.

Phase 3 – Treatment: Components are subjected to a stress profile according to standards or determined by other experimental criteria. An example could be the application of a temperature (Test B – Dry heat) for a certain time period using an oven or the application of heat with high humidity (Test D) in a climate controlled room.

Phase 4 – Readjustment: After the stress is applied, it is necessary to restore components to reference conditions of phase 2 and verify the level of degradation or the occurrence of failure.

A sequence test is represented by a repetition of a cycle of tests that are charac-terized by the phases described above. The laboratory test can be also classified as:

• combined test, in which two or more types of environmental stress take place simultaneously on the device (e.g. the combined test heat and humidity);

• compound test, in which two or more types of environmental stress are ap-plied in quick succession. (e.g. test Z/AD: compound test (Z) of cold (A) and heat with high humidity in cycles (D).

• sequence test, in which the element being tested is subjected successively to two or more types of stress at time intervals that affect the test components. (e.g. welding test (T) followed by rapid temperature changes (Na test) and by non-constant acceleration – impact tests (Ea)

Table 4.4 lists some of the principal effects of various environmental agents.

Page 88: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

4.6 Laboratory Tests on Components and Systems 75

Table 4.4 Principal effects of various environmental agents.

Agent main effect type of resulting failure

High temperature

Thermal aging

oxidation

flaws

chemical reaction

Softening, fusion, sublimation

Reduced viscosity

Dilatation

Wear of mobile parts for di-latation or loss in lubricating performances.

High relative humidity

Adsorption and Absorption of humidity

Swelling of materials

Loss in mechanical resistance

Chemical reaction (corrosion, electroly-sis)

Conductivity growth of insulating mate-rials

Physical breakdown, insula-tion defects, mechanical fail-ure.

High pressure Compression, deformation Mechanical failure, loss (her-metic defects)

Solar radiation

Chemical, physical and photochemicalreactions

Surface deterioration

Discoloration

Heating

Ozone formation

Insulation defects

Sand or dust

Abrasion and erosion

Seizure

Incrustation

Loss of thermal conductibility

Electrostatic effects

Increased wear, electrical failure, mechanical failure, overheating

Corrosive

atmosphere

Chemicaol Reactions

Increase in conductivity

Increase in contact resistance

Increased wear, electrical failure, mechanical failure

Rain

Water absorption

Changes of temperature

Erosion

Corrosion

Electrical failure, flaws, leaks, surface deterioration

Rapid temperature changes

Changes of temperature

Differential heating Mechanical failure, fine leaks, seal degradation, cracks

Constant Accelera-tion, vibrations, jolts and bumps

Mechanical stress

Fatigue

Resonance

Mechanical failure, in-creased wear of moving parts, structural deformation

Page 89: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

76 4 Experimental Reliability and Laboratory Tests

References

[1] O’Connor, P.D.T.: Practical Reliability Engineering, 4th edn. Wiley, Chichester ISBN: 0-470-84462-0

[2] ETSI ETS 300 019-1-3-Edition 1-1992-02, equipment Engineering (EE) - Environ-mental conditions and environmental tests for telecommunications equipment; - Part 1-3: Classification of environmental conditions; - Stationary use at weatherprotected locations

[3] MIL-HDBK-217F, Reliability Prediction of Electronic Equipment (December 2, 1991), with Notice 1 -10 July 1992 and NOTICE 2 - 28 February 1995

[4] IEC 60050-191 ed1.0, International Electrotechnical Vocabulary. Chapter 191: Depen-dability and quality of service. Forecast publication date for Ed. 2.0 is 2012-06-02. IEC International Electrotechnical Commission, Geneve (December 31, 1990)

[5] IEC 60068-1 ed. 6.0, Environmental testing. Part 1: General and guidance (1998), Forecast publication date for updated version is August 2011

[6] Leemis, L.M.: Reliability, Probabilistic Models and Statistical methods, 2nd edn. ; ISBN 978-0-692-00027-4

Page 90: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

Chapter 5 Reliability Prediction Handbooks: Evaluation of the System Failure Rate

Abstract. The evaluation of the failure rates for device, components and systems is often a very difficult task. In order to simplify the failure rate evaluation for established electronic and electromechanical devices, it is possible to use ad hoc handbooks. In fact, values for failure rates are given for many devices in the fail-ure rate handbooks. In this chapter an historical overview on these handbooks is given. Many handbooks are available where the laws on dependency of the failure rate on different stresses are considered. In 5.1 a brief introduction about the first generation handbooks is given. In 5.2 the second generation handbooks are pre-sented. A USA military handbooks is presented in the next 5.3, 5.4 and 5.5 are de-voted to Farada and third generation handbook respectively. In 5.6 example of failure rate evaluation are discussed. Finally, the units for failure rate, the FIT, is introduced in 5.7.

5.1 Introduction

The origin of modern handbooks took place in the United States. It consisted in the analysis of data collected in a military environment for systems (equipped with electronic components) utilized during the Second World War. Between 1943-1950, the correlation between the frequency of failure in communication and na-vigation apparatus and the severity of the conditions in which these were required to operate became clear. At that time, particular attention was already being paid to climatic conditions such as temperature and humidity. This drew attention to the decrease in safety levels for troops, also in terms of the “availability” of equipment and relative maintenance costs.

On the basis of such considerations, the American government initiated a pro-gram in 1952 known as AGREE (Advisory Group on the Reliability Electronic Equipment), a consulting group that in 1957, published a report on specifications and tests regarding the reliability of such equipment. This gave rise to the origin of handbooks to be used as support for the project. The objective was to furnish an evaluation of the failure rate with a certain level of confidence.

In 1953 la Radio Electronic Television Manufacturer’s Association (RETMA), later to be called the Electronic Industries Association (EIA), formed a commis-sion for the use of electronic applications to determine methods and procedures for

Page 91: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

78 5 Reliability Prediction Handbooks: Evaluation of the System Failure Rate

the collection, analysis and classification of reliability data. The results of the commission’s work were published in the Electronics Applications Reliability Re-view bulletins. These represented the first rational collection of data from compa-nies and industrial organizations such as RCA (Radio Corporation of America), GE (General Electric) and Motorola which published the results of life cycle tests they performed on their components. A trace of this important work can be seen in the first edition of MIL HDBK 217E (“Reliability Prediction of Electronic Equipment”), published by the American Department of Defense in 1962.

The first attempt to define a handbook reporting information relative to me-chanical and electromechanical components was in 1959, with the Martin Titan Handbook, also known as “Procedure and Data for Estimating Reliability and Maintainability.“ The quality of this handbook can be reached in its attempt to present the data following a certain criterion of standardization: the Titan is in fact, the first collection of data in which failure rates are expressed as a function of working hours and that uses exponential distribution in its calculations. If the re-ported data suffers from not being supported by the correct statistical information (in fact, the number of components tested, breakdowns noted and hours of obser-vation are not reported), it is supported by information on the failure mode. Titan was the first to propose empirical factors (“factor K”) which take into account modes of use and the eventual presence of redundancy.

5.2 Second Generation Handbooks

In the Sixties, in the wake of experience with the Titan Handbook and requests from Air Force, programs were initiated for the collection and organization of reliability statistics. This work leads to the realization of the handbook:

• MIL-Handbook-217 • Failure Rate Data Bank (FARADA) • RADC Non Electronic Reliability Notebook.

5.3 MIL-Handbook-217

Starting with the characteristics of the Titan Handbook the Military Handbook 217 (MIL-HDBK 217) classifies components into categories and subcategories con-noted by corrective factors.

The latest editions of MIL-HDBK 217 furnish some of the most complete col-lections of data available. Unfortunately, due to collection and organization me-thods utilized, the data is not always reliable.

The models contained in the MIL-HDBK-217 refer only to defects in produc-tion of components which undergo stress connected to their use. Problems corre-lated to the design, transport and the mode of usage are not considered in the mod-el. The empirical origin of data at the basis of the handbook, not associated with an effective analysis of the real origins of breakdowns, is such that this data

Page 92: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

5.4 Failure Rate Data Bank (FARADA) 79

cannot be utilized to identify the onset of eventual problems and therefore one cannot assign statistical confidence associated with results obtained through such models.

In particular, the variations of tolerance are masked by the use of K factors. Furthermore, the failure rates are considered as fixed measurements of a specific apparatus and not as a general measure of a range of different types of equipment.

5.4 Failure Rate Data Bank (FARADA)

In the Seventies, encouraged by the US Army, a program for the exchange of data relative to equipment sold in a military environment was initiated. Known as GIDEP (Government/Industry Data Exchange Program) , this brought together more than 400 participants, 80% of which were private industrial organizations. Data collected by GIDEP was used by the first software system for processing data, with the advantage of quick updating and organization according to useful formats for their statistical processing. The relative handbook, FARADA, also furnished, in addition to failure rates information regarding substitution rates of components and where available, the failure modes.

The data come from the field, accelerated life cycle tests, and demonstrative re-liability tests. The problem with the statistical analysis adopted by this handbook is that the data comes from non-homogeneous populations. Though using chi-square distribution for defining intervals of confidence, the average estimated fail-ure rate is not representative of the subpopulation of samples. One of the problems associated with different databases is the use of only statistical techniques when defining intervals of confidence.

Other manuals, commonly denoted as second generation manuals, are similar to MIL-HDBK but collect data essentially for components in the field of telecom-munications. Among these, we note the RPP manual published by Bell Core since 1984, the HRD Manual published by British Telecom and the Italtel IRPH93 Ma-nual (with the collaboration of both the French CNET and British Telecom). The RPP manual reports data mainly for devices and components used in the telecom-munications field and covering five different areas. A common element which characterizes these databases is the hypothesis of components with a constant fail-ure rate.

5.5 Third Generation Data Banks

It appears evident that one of the objectives of the definition of a handbook should be the possibility of utilizing data adequately characterized by their uncertainty. The problem arises here how it is possible to define the uncertainty of reliability measurements.

Unable to refer to a single and common methodology in the definition of uncer-tainty, one uses quartiles (e.g. the Swedish handbook TBOOK - Reliability Data of Components in Nordic Nuclear Power Plants) for the definition of confidence

Page 93: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

80 5 Reliability Prediction Handbooks: Evaluation of the System Failure Rate

intervals (the Italian handbook EIREDA - European Industry Reliability Data Handbook) or the absence of this information, the typical situation in handbooks used in military environments.

With the “third generation” of databases, attention has moved from military and aerospace industries to the type of intrinsically critical installations such as nuclear power plants, oil rigs and the chemical industry. Modern databases of reliability in the public domain are the following:

• IEEE-Std-500 (Piscataway, NJ, 1984) • OREDA (Offshore reliability data, Norway, 1984) • EIREDA (European Industry Reliability Data Handbook, Italy, 1991) • T-BOOK (Reliability Data of Components in Nordic Nuclear Power Plants,

Sweden) • CCPS (Guidelines of the Center for Chemical Process Safety, New York, 1989) • NSWC-94/L07 - Handbook of Reliability Prediction Procedures for Mechani-

cal Equipment.

In particular, the last handbook cited above, developed by the Naval Surface Warfare Center – Carderock Division, provides failure rates for basic classes of mechanical components (belts, springs, bearings, breaks and clutches to cite just a few). The failure rate models take into account the impact of some factors on the reliability of components. To better understand, just think that for a spring, the most common failure modes are due to fatigue and excessive loads. The reliability of a spring depends on the material of which it is made, the working en-vironment and the way in which the project is carried out. It is obvious that the use of these models requires large amounts of data that may not be known by the user. Another aspect of this database is that in the evaluations, a parameter rela-tive to manufacturing defects is not examined.

It is interesting to observe that there does not exist a unique “profile” of the us-ers of databases. The information collected is obviously useful to the design engineer (that is interested in mechanisms and failure modes), to the risk analyst (information on the availability of the system, or rather the probability of a suc-cessful mission through the availability of components and relative breakdown rates), not to mention maintenance experts, ever more attentive to service performance.

The critical element in the design of a handbook is the mode in which data is collected and the definition of attributes which define intervals for measuring re-liability. To better understand this concept, take the example of a very simple and widely diffuse component: the resistors. It is sufficient to consult any catalogue including on line catalogues for acquiring this elementary component to under-stand how the definition of the parameters which characterize the resistor is fundamental. Under the heading of resistors, many varied and different types of components and usage applications are listed (e.g. from resistors for printed circuits to traction applications).

In second generation databases, the determination of correspondence among the attributes is left to the user while in third generation databases, a hierarchic ap-proach was implemented; it furnishes the user with a guide when a knowledge of attributes is insufficient.

Page 94: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

5.6 Calculation of the Failure Rate 81

5.6 Calculation of the Failure Rate

This demonstrates the calculation of the failure rate in an electronic environment following the procedures according to MIL-HDBK 217. As made clear by the de-finition of reliability as in Chapter 1, any evaluation of the failure rate, and there-fore reliability, cannot be undertaken without a knowledge of the working envi-ronment in which the system must operate. Data banks in electronics, and therefore the MIL-HDBK 217, proposal a classification of working environments in:

• Fixed protected environment: characterized by a high level of insensitivity to the atmospheric environment in regard to temperature as well as the control of humidity within defined limits. An example can be given by electronic appara-tus allocated in masonry buildings. The MIL-HDBK 217 classifies this as a “land protected environment” (GB, Ground benign), with controlled tempera-ture and humidity and no mechanical stress, easily accessible for maintenance activity.

• Fixed unprotected environment: characterized by thermal and mechanical stress determined directly by natural climatic conditions. The MIL-HDBK 217 calls this “fixed land environment” (GF, Ground fixed), characterized by moderately controlled environmental conditions. This is typical of apparatus installed in the open air, e.g. electronic control unit for traffic control, environmental monitor-ing, telecommunication equipment and radar.

• Mobile environment: characterized by mechanical stress and temperature gra-dients of a certain severity, typical of portable equipment or mounted on mobile equipment. The MIL-HDBK 217 identifies this as GM (Ground mobile).

It should be remembered that in addition to these, the MIL-HDBK 217, in re-spect to other databanks in electronics, classifies a further eleven operating envi-ronments including naval (N), aeronautical (A) and space (S) as well as environ-ments characterized by particularly critical conditions with the presence of high stress for the component. In all reliability handbooks, the operating environment is identified by means of the πE factor (Environmental Factor).

The prediction models relative to electronic components present in handbooks make reference furthermore to the following hypotheses: system in functional se-ries configuration, independent failures and constant failure rate. It is assumed fur-thermore that Arrhenius’s law, which is the model for describing the physical-chemical degradation of a component, relates the time to failure to the level of thermal stress applied. Based on such hypotheses, the failure rate of the system can be immediately calculated, using the relations of the series functional configu-ration reported in Chapter 3.

Example 1

Calculation of the failure rate of a signalling lamp of a warm aircraft. We assume that the lamp is positioned on the aircraft, constantly functioning on an inhabited area at 24 V c.c. The model for this component into the MIL-HDBK217E (section 5.1.17) is:

Page 95: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

82 5 Reliability Prediction Handbooks: Evaluation of the System Failure Rate

h

failure EAubp 610

πππλλ ⋅⋅⋅= (5.1)

with:

λb = 0,074·241,29 = 4.5 failures /106 hours; πu = 1.0 is the utilization factor; πA = 3.3 is the application factor; πE = 4.0 is the environmental factor.

Environmental conditions are considered as those for an unprotected device. The value of the failure rate for this component is:

h

failuresp 6

6

10590.43.30.1105.4 =⋅⋅⋅⋅= −λ (5.2)

Consequently, the Mean Time To Failure:

hMTTFp

46

107.11059

11 ⋅=⋅

== −λ (5.3)

Example 2

Calculation of the failure rate for a bridge rectifier with four equal diodes. From the point of view of the reliability performances, such diodes have to be consi-dered in series configuration. Recalling MIL-HDBK 217 E, the failure rate for a single component (diode) is represented by the following equation:

h

failuresTSCQEbp 610

πππππλλ ⋅⋅⋅⋅⋅= (5.4)

Assuming the diode as a “power rectifier fast recovery”, JAN quality, a metallic junction and used in “Ground Fixed” environment with Vs=Vdmax/VRRM=0.78 and junction temperature Tj=166°C, we have: λE = 6.0; πQ = 2,4 is the quality factor; πC = 1.0 is the construction factor; πS = 0.547 is the voltage stress factor; πT = 28 is the temperature factor. For the bridge rectifier we obtain:

h

failuresptotal 610

604 =⋅= λλ (5.5)

and the corresponding MTTF:

hMTTF

p

46 107.610067.01 ⋅=⋅==

λ (5.6)

and

hMTTF

totals

46 107.110017.01 ⋅=⋅==

λ (5.7)

Page 96: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

5.7 FIT: A More Recent Unit 83

5.7 FIT: A More Recent Unit

Failure rates λ lies between 10-10 h-1 and 10-7h-1 for electronic components. In par-ticular failure rate of 10-10 h-1 are valid for passive electronic components. For ac-tive device failure of about 10-7h-1 are more frequent, in particular for VLSI ICs. In order to manipulate more easy data a new unit has been introduced. The unit 10-9 h-1 is designed by Failures in Time (FIT) or failures per 109 h [1-4]. For example, the previous results obtained for Example 1 and Example 2 can be rewritten in the following way. Eq. 5.2 can be written as:

FIT, 5900010590001059

1059 96

6=⋅=⋅== −−

h

failurespλ (5.8)

and eq. 5.5 became:

FIT. 6000010600001060

1060 96

6=⋅=⋅== −−

h

failurestotalλ (5.9)

If the FIT value is available, for example λ = 200000 FIT, then:

( ) 1-61-9 h10200h10 200000 FIT 200000FIT −− ⋅=⋅==λ (5.10)

References

[1] Birolini, A.: Reliability Engineering – Theory and Practice. Springer, Heidelberg, ISBN: 978-3-642-14951-1

[2] U.S.A. Department of Defense, “MIL-HDBK-217F Military Handbook Reliability Prediction of Electronic Equipment” (and later versions) (1991)

[3] IEEE Std, IEEE Guide for Selecting and Using Reliability Prediction Based on IEEE 1413 (February 19, 2003)

[4] U.S.A. Department of Defense, MIL-HDBK-781 Military Handbook for Reliability Test Methods, Plans, and Environments for Engineering, Development, Qualification, and Production-Revision D

Page 97: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT
Page 98: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

Chapter 6 Repairable Systems and Availability

Abstract. In Chapter six the concepts of Availability will be explained and dis-cussed. The Availability is a concept that refers to reparable systems. For such systems the time of operation is not continuous, since their operating life cycles are described by a sequence of up and down states. Therefore the system operates until it fails, then it will be repaired and so it returns to its original functioning state. It will fail again after some random time of operation, get repaired again, and this process of both failure and repair will reiterate; hence the state of the sys-tem alternates between a operating state and a repair state. In this case, the impor-tant variables to be determined are the times to failure and the times to repair. Availability is defined as “the aptitude of the element to perform its required func-tion in given conditions up to a given point in time or during a given time interval assuming that any eventual external resource is assured.” The availability of a ma-chine can also be defined as the percentage of time, in respect to total time, in which the machine is required to function.

6.1 Introduction

When considering repairable systems or components, in addition to defining relia-bility, the function of availability must also be clarified.

As seen in Chapter 1, reliability is defined as the probability that a device per-forms a specific function up to a specific time interval, in pre-established condi-tions of use. This concept does not allow for an interruption in service. In cases where maintenance is scheduled, this must be carried out in time intervals when the device or component is not in use. In repairable systems where the system is unavailable for the time necessary to effect repairs or maintenance, availability implies that the system will not be functioning for a given time interval. Availabil-ity is therefore a more general function that takes into account both the reliability of a system and maintenance aspects, and then a return to normal functioning after a failure.

IEC 60050 (191), even known as International Electrotechnical Vocabulary (IEV), standard define availability as “the ability of an item to be in a state to perform a re-quired function under given conditions at a given instant of time or over a given time interval, assuming that the required external resources are provided”[1].

The availability of a machine can also be defined as the percentage of time, in respect to total time, in which the machine is required to function.

Page 99: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

86 6 Repairable Systems and Availability

6.2 Mean Time To Repair/Restore (MTTR)

In the case of repairable components, the parameter that expresses the mean time from the onset of a failure to its complete repair is fundamental. This is known as the Mean Time To Repair/Restore (MTTR).

Maintainability is a property of repairable systems and is defined as the facili-ty in which a system can be repaired once a malfunction (or failure) is manifested. Maintainability is the probability M(t) that a malfunctioning system can be restored to its correct functioning within time t. This is closely correlated with availability because the shorter the interval for restoring the system to its proper functioning is, the higher the probability will be of finding a function system at a given time interval. For the extreme value M(O) = 1, the system in question will always be available.

Fig. 6.1 Comportment of a repairable system.

Similar to the MTTF (Mean Time To Failure) that characterizes non-repairable devices, we refer to functions analogous to those already defined for reliability in repairable systems. The total of these functions is called the function of maintainability.

Table 6.1 Analogy between the functions of maintainability and the functions of reliability.

Functions of maintainability Analogous functions of reliability

g(t)

M(t)

N(t)

μ (t)

Probability density of normal repair

Probability of repair (maintainability)

Probability of non repair

Repair rate (instantaneous)

f(t)

F(t)

R(t)

λ(t)

Distribution of failure probability

Unreliability

Reliability

Failure Rate (instantaneous)

Page 100: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

6.3 Mean Time Between Failures (MTBF) 87

The definition of MTTR is given by:

∑ Δ⋅⋅=

iiii ttgtMTTR )( (6.1)

For such functions, relationships identical to those of reliability are also valid here. Therefore with t = 0, the time a failure occurs, we have:

ti = i-th repairing time

g(t)⋅Δt = probability that repairs will finish within the interval [t, t+Δt ]

M(t) = probability that repairs will finish within the interval [0, t]

μ(t)⋅Δt = probability that repairs will finish within the interval [t, t+Δt] , not com-pleted at time t.

6.2.1 A Particular Case

If the repair rate is constant so that μ(t) = μ, we have:

μ1=MTTR (6.2)

6.3 Mean Time Between Failures (MTBF)

The mean time between failures (MTBF) can be defined in two ways:

− MTBF is the MTTFS in repairable devices; − MTBF is the sum of the mean time of MTTFs of the device plus the

MTTR (Mean time to repair/restore)

Fig. 6.2 Definitions for MTBF.

Component Terminology 1 Terminology 2

(used in the following)

Not

repairable

Repairable

MTTF

Up

Down

MTTF

Up

Down

MTBF MTTR

Up

Down

MTTF MTTR

Up

Down

MTBF

Page 101: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

88 6 Repairable Systems and Availability It may seem more logical to use the second definition because you can

maintain the same terminology for MTTF, independent of the fact that the device can or cannot be repaired, considering one a theoretical extension of the other. Figure 6.2 graphically demonstrates the difference between the two MTBF definitions.

6.4 The Significance of Availability in the Life Cycle of a Product

Figure 6.3 illustrates time frames for both functioning and faults phases for elements used in the analysis of availability

“C” and “P” represent periods of time attributed respectively to corrective maintenance (performed after a failure) and preventive maintenance (carried out before the system failure), often waiting for the necessary resources to complete the work. Availability therefore, is the probability of being able to function cor-rectly at the required moment, independent of any previous failure subsequently repaired, and not up to a determined point in time, as is asserted in the definition of reliability.

This concept implies that the device may be non-functioning at certain times. A system can display high availability levels notwithstanding frequent but short periods of malfunctioning.

Total Time TT

Up Time Down Time

Operating Stand by TMT ALDT Time OT Time ST

TCM TPM C P

OFF Time

Key - TT: Total Time of use, Up Time: Functioning time, Down Time: Non-functioning time, OT Operating time: part of up time when effective use takes place, ST Stand-by Time: part of up time during which effective use is waiting to begin and the system is assumed to be operating, TMT Total Maintenance Time, ALDT Administrative and Logistic Down Time: Time often used waiting for parts and personnel for maintenance, TCM Total correc-tive maintenance time, TPM Total Preventive Maintenance time.

Fig. 6.3 Time frame of a repairable system.

Page 102: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

6.5 Instantaneous Availability 89 Availability is a good parameter to characterize those systems where malfunc-

tioning is acceptable since, in most circumstances the system functions correctly- The basic mathematical definition of A (Availability) is:

DownTimeUpTime

UpTime

TotalTime

UpTimeA

+== (6.3)

The actual evaluation of availability is carried out substituting temporal elements with other parameters that realize the desired function. We thus have different formulations aimed at visualizing specific objectives.

Under certain circumstances, it is necessary to define the availability of a re-pairable system only in regard to effective work time and corrective maintenance time. This is known as inherent availability and is represented as:

MTBF

MTTF

MTTRMTTF

MTTFA =

+= (6.4)

In these ideal conditions, waiting time and time associated with preventive main-tenance are overlooked (MTTR is calculated considering only corrective mainten-ance time). Such quantity (dimensionless and between 0 and 1) takes on a double significance:

1. a posteriori, the “efficiency” of a system for which the parameters MTTF, MTTR and MTBF have been determined;

2. instantaneously, with the probability that the system is available (not under re-pair)

Obviously the complement to 1 of availability assumes the name of Unavailability (U) with the significance:

1

MTTRMTBF

MTTRAU

+=−= (6.5)

6.5 Instantaneous Availability

Stationary availability or simply “availability”, is the limit value (t → infinity ) of another quantity (variable) that is called “instantaneous availability” A(t); this quantity represents the a priori mean availability, estimated in time t.

The plot of instantaneous availability depends on initial conditions (at the in-stant t = 0, the system can be “functioning” or “in failure”); in any case, the limit value A(t → infinity) is always A.

6.6 Dependability: An Evaluation of the “Level of Confidence” in Reference to the Correct Functioning of the System

In general, reliable systems are utilized in situations where it is necessary to guar-antee a series of performance characteristics, for example, safety or operational

Page 103: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

90 6 Repairable Systems and Availability

availability. Recently, the word “reliability” has been substituted by the term “de-pendability” which corresponds to “faith in the correct functioning of the system.”

In order to define dependability, it is necessary to clarify first the concepts of service and user. The service furnished by a system is the behavior of the system itself, how it is perceived by its users.

The user of a system is another system which interacts by way of a system in-terface.

The function of a system is what we expect from the system itself. The descrip-tion of the function of a system is furnished by its functional specifications. Ser-vice is correct if the specific functions of the system are performed.

A system is considered dependable if it has a high probability of successfully carrying out its specific functions. This first presumes that the system is available. Furthermore, in order to completely perform a specific function of the system, it is necessary to define all the environmental and operative requirements for the system to provide the desired service. Dependability is therefore a measurement of how much faith we have in the service given by the system.

The design and implementation of “dependable” systems necessitates the ap-propriate methodology for identifying possible causes of malfunctions, commonly known as “impediments.” The technology to eliminate or at least limit the effects of such causes is also necessary. Consequently, in order to deal with the problem of dependability, we need to know what impediments may arise and the technolo-gy to avoid the consequences. Systems that utilize such techniques are called Faults Tolerant.

Impediments to dependability assume three aspects: fault, error and failure. A system is in failure when it does not perform its specific function. A failure is

therefore a transition from a state of correct service to a state of no correct service. The periods of time when a system is not performing any service are called outage periods. Inversely, the transition from a period of non-service to a state of correct functioning is restoration of service. (Figure 6.4)

Possible system failures can be subdivided into classes of severity in respect to the possible consequences of system failure and its effect on the external environ-ment.

A generally used classification is one which separates failures into two catego-ries: benign and catastrophic.

Constructing a dependable system includes the prevention failures. To attain this, it is necessary to understand the process which may lead to a failure, which originates from a cause (failure), inside or outside the system. The failure may remain dormant for a period of time until its activation. The activation of a failure leads to an error, that part of the state of a system that can cause a successive fail-ure. The failure is therefore the effect, externally observable, of an error in the sys-tem. Errors are said to be in a latent state until they become observable and/or lead to a state of failure. Similar failures can correspond to many different errors, just as the same error can cause different failures.

Page 104: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

6.7 The Prerequisites of Dependability 91

Failure

Correct functioning

Incorrect functioning

Restoration

Fig. 6.4 State of a system.

Fig. 6.5 Chain fault – error – failure.

Systems are collections of interdependent components (elements, entities) which interact among themselves in accordance with predefined specifications. The chain fault – error – failure presented in Figure 6.5 can therefore be utilized to describe both the failure of a system and the failure of a single component. One fault can lead to successive faults, just as an error, through its propagation, can cause further errors. A system failure is often observed at the end of a chain of propagated errors.

6.7 The Prerequisites of Dependability

In order to measure the level of dependability reached by the system under analy-sis, it is necessary to evaluate a group of characteristics of the system. These cha-racteristics generally assume a different role and importance in respect to the pre-requisites of the system itself.

Page 105: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

92 6 Repairable Systems and Availability The principal prerequisites of dependability are:

Reliability, Maintainability, Availability (defined above) and Safety, defined as the probability S(t) that the system will not malfunction when it is required to operate or that even in the presence of a malfunctioning, this does not compromise the safety of the personnel or machinery related to the system itself (the absence of unacceptable risks, as defined in Chapter 1). In other words, this is a measure of both the capacity of the system to function correctly as well as the capacity not to carry out correctly it specific function but without generating considerable conse-quences. Note that safety differs from reliability and availability in that reliability and availability regard correct functioning and do not include effects derived from malfunctions.

Testability, defined as the ability to identify characteristics of a system by means of tests and/or measures.

This is clearly related to maintainability since the easier it is to test a malfunc-tioning system to identify component failures, the shorter the time to restore the system to correct functioning is.

Performability, P(L,t), a function of time, is defined as the probability that the value of the level of functionality of the system at time t is at least equal to L. It is a measure of the capacity of a system to furnish a determined quantity of work during a certain time interval, even in the presence of failures. This plays a fun-damental role in the design of systems in which the presence of faults does not imply the loss of some functioning but only a reduction in the level of such func-tioning.

References

[1] IEC 60050-191, International Electrotechnical Vocabulary 1.0 edn. ch. 191: Dependa-bility and quality of service, December 31 (1990); Forecast publication date for 2.0 edn. IEC International Electrotechincal Commission, Geneve, June 2 (2012)

Page 106: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

Chapter 7 Techniques and Methods to Support Dependability

Abstract. In chapter seven some of the Reliability Techniques will be considered. These techniques, classified into quantitative and qualitative, are methods of anal-ysis to evaluate the dependability parameters and the failure modes in which a rea-listically complex system is or could be subjected to. The techniques that will be described are Markov models. Other techniques, such as FMEA, FMECA and FTA, will be discussed in the following chapter 8. Markov models are characte-rized by a particular representation of Availability: a matrix in place of a single index. The matrix representation permits studying the behavior of the system, under different hypotheses, as a stochastic process, and studying its temporal evolution.

7.1 Introduction

In regard to the dependability of a real complex system (dependability is defined in Chapter 1), it is possible to conduct a twofold evaluation:

Probabilistic Evaluation or Quantitative Evaluation, the objective of which is to estimate the attributes of dependability for a system and/or its components. To this aim some techniques are discussed in the present Chapter.

Qualitative Evaluation with the aim to understand how component malfunction-ing can lead to a loss of function and performance as well as system failure and to fully understand the possible consequences. This is discussed in Chapter 8.

Quantitative methods have been widely discussed in the literature and numerous well tested techniques have been developed and utilized with success. In function of the level of extraction considered for the system under analysis, analytical (or axiomatic) and experimental techniques can be considered. The evaluation of the behavior of a system can be done in the following ways:

• Experimental (empirical): a prototype of the system is used and the parame-ters are estimated by means of statistical data. The experimental approach is, generally, characterized by:

− much more expensive and complex than the analytical technique − system prototype is often unavailable − dependability is difficult to evaluate because long observation times are

necessary

Page 107: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

94 7 Techniques and Methods to Support Dependability

• Analytical and simulative: the parameters are deduced directly from a mathe-matical model or graph of the system itself (model of system and its compo-nents).

7.2 Introduction to Quantitative Techniques

The use of quantitative techniques is based on the analytical description (through equations) or graphic description (through diagrams) of the behavior of the sys-tem. Measurements of the dependability performance are obtained in function of the parameters of the model, that typically includes probability distributions to represent random phenomenon connected to malfunctions.

Systems under analysis are mainly divided into two classes: discrete systems and continuous systems. In discrete systems one or more quantities of the system change instantaneously, but only in separated time instants. In continuous sys-tems, instead, the quantity can change continuously over time. An example of a discrete system is the line of data-packets of a computer on line. The state of the system changes in separate instants of time (caused by a packet leaving or a new packet arriving). A train represents an example of a continuous system that has speed and position, in respect to a railway station, that continuously vary in time. Often systems are both discrete and continuous at the same time, based on the quantity being analyzed.

In analytical models the components of the system are represented by state va-riables and parameters, and their interactions are represented through the relation-ships among these quantities.

In simulation models instead, the dynamic behavior of the system is reproduced over time. The evaluation of these models requires the execution of a program de-noted as simulator that permits the temporal evolution of the system and furnishes an estimate of the measurements in interest. The most useful distinctions in the classification of models are between static and dynamic models, between determi-nistic and stochastic models and between continuous and discrete models. A static model permits the representation of a system at an established time instant and this can be useful in obtaining information on the static characteristics of the system. A dynamic model, on the other hand, represents the temporal evolution of the system e.g. through the interaction among components constituting such system. A model can be considered as deterministic when it contains no components that present probabilistic behavior. On the contrary, a model is stochastic when it contains one or more components that demonstrate a probabilistic behavior. The distinction be-tween continuous and discrete models is similar to that made for continuous and discrete systems. In this case, however, we do not consider if the quantity in in-terest is continuous. For the objects in question, we wish to represent the quantity as a variable with or without continuity in time.

Models used for the analysis of computerized systems (mainly discrete sys-tems) are almost always dynamic and stochastic.

Page 108: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

7.3 Evaluation of Availability Using Analytical Models 95

7.3 Evaluation of Availability Using Analytical Models

Complex systems can therefore be analyzed in regard to the calculation of depen-dability and availability if they are considered to consist of a group of variously connected entities. In particular, when calculating the above mentioned quantities, it is first necessary to identify the dependability and availability of single entities. Then, given the possibility of redundancy, identify the configurations which per-mit the system to function in accordance with its design specifications. Finally, it is necessary to establish connections between the individual failures of the entity and those of the system as a whole.

The entities have, in addition, reliability and availability indexes depending on their quality levels, maintenance policy of the producer and their own interconnec-tions. For these reasons, the use of a single technique may not be sufficient in all cases.

In general, appropriate combinations of these techniques can be used for the construction and solution of a model. The techniques most utilized for construct-ing models are the combinatorial analysis methods and Markov processes.

The stochastic methods of combinatorial reliability are the simplest and most intuitive, both in the construction and solution of the model, but these are often in-adequate in the representation of the often complex dependence among the differ-ent components of the system. In addition, they do not permit the representation of repairable systems. Furthermore, such methods assume the complete knowledge of the structure of the system and can be implemented when the system is defined in all its constructive details. (e.g. series and parallel configurations etc. as dis-cussed in Chapter 3). Therefore, in place of combinatorial methods, Markov processes can be taken into account in order to model complex systems when the coverage factor and repair, and/or maintenance factors, have to be considered .

Markov models are considered for analyzing the dependability and availability of a system constituted of several entities that can fail independently of one anoth-er. They are also used to evaluate the safety performance of the system.

7.4 Markov Models

In respect to other techniques, Markov models are characterized by a different re-presentation of availability: a matrix in place of a single index. The matrix repre-sentation permits studying the performance of the system, under different hypo-theses, as a stochastic process, and studying its time evolution (not possible with other approaches). Furthermore, unlike other techniques, Markov models take into account reparability and the order in which failures occur in the system.

A Markov model is based on a graphic representation of the system dependa-bility which permits to synthetically observe the effect of the unreliability of ele-mentary components on the whole system and to understand the characteristics of single components with the next highest level of complexity. This continues up to the highest level of the system.

Page 109: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

96 7 Techniques and Methods to Support Dependability

The analysis of a system using a Markov model furnishes an analytical, uni-form and synthetic method to calculate availability and other associated functions. It is particularly useful in graphically representing situations of failure and repair of component elements. A system constituting by N components can assume 2N

mutually exclusive configurations denoted as system states. This is due to the fact that each individual device can also assume only two mu-

tually exclusive conditions: correct functioning or failure. In other words, the model defines states of the system as those combinations that consent to the func-tioning or non-functioning of the system. Analogously, changes in the state of the system (due to the change in single components) are called transitions and their chronology describes the temporal evolution of the system (path). The procedure requires calculating the probability of finding the system in each of the possible states. The evaluation of the system availability consists in the sum of the proba-bilities that compete with states defined as successful, where the system in func-tioning correctly.

The main problem in the construction of a chain of Markov models (Markov chain) is due to the exponential growth in the number of states as the complexity of the system increases. The increase in the number of states becomes necessary when very compact representations of stochastic processes of notable dimensions have to be manipulated. Nevertheless, several well tested algorithms have been developed for such models.

In order to apply a Markov model, it is necessary to verify the following hypo-theses:

• the process must be stationary: its behavior must be the same at any moment under consideration and consequently, the probability of a transition between two states must remain the same during the specific time interval. In the mode of availability, this implies that the rate of transition between the two states (characterized by a failure rate λ and a repair rate μ) must remain constant during the time of observation. As known the distribution of the probability density of the observed quantity is exponential.

• the process must be without memory according to this hypothesis, the future random behavior of a system depends only on its actual state, and not on pre-ceding states or the way in which the actual state has occurred. This means that the probability of transition pij depends only on the two states i and j and is completely independent of what occurred preceding the transition to state i. In some cases, such conditions can be considered too simplified.

At the beginning, the state of the system corresponds to when all of its entities are functioning. To determine the probability if a system is in state s at a given in-stant, it is necessary to identify only the probability of transition from one state to another (if restore intervals are not foreseen, there is a continual decrease in performance).

When the process is not stationary, this creates a function that connects transi-tions in the system to time. These are referred to as non - Markovian processes. In certain conditions, there are techniques for solving these problems.

Page 110: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

7.4 Markov Models 97

The Markovian approach can be applied both in a continuous time field (Mar-kov Processes) and a discrete time field (Markov Chains).

States can be classified according to their characteristics:

• Ergodic Group: a group of states such that when a system has assumed, it is no longer capable of exiting (a process is ergodic if the mean times of the data sample functions of the process converge at the corresponding spatial mean).

• Transitory Group: a group characterized by the fact that once a transition has conducted the system outside this group, it can no longer reenter it.

• Absorbent State: once a system has reached this state, the state can no longer be changed (e.g. a state where non-repairable damage to a component is ob-served; once in failure, with no probability for the component to repair itself, obligates the system to enter this state and from which it can no longer exit).

Markov models are based on the concept state and transition.

• State: represents everything that must be known in order to describe a system at any given instant. In the case of a dependability model, the state represents the possible configuration of a functioning entity or an entity in failure (functioning or non-functioning of various components of the system). Usually, the starting point of any system is that in which all the components are functioning. A suc-cessive state could be the non-functioning of even a single component of the system.

• Transitions: describe the passing from one state to the verification of an event: the failure of an entity or the restore of an entity in failure condition. Transi-tions of state are characterized by the concept of probability, e.g. the probability of the failure of an entity or the probability of repair.

Markov models are characterized by a certain probability number pij, each hav-ing the significance of transition from an initial state i to a final state j. These probabilities of transition must follow some initial rules:

1) Defining the hazard function - also known as failure rate - as z(t):

ttR

ttRtRtz

δδ

⋅+−=

)(

)()()( (7.1)

the probability of transition from an initial state to a final state relative to the time interval δt is identified by the product z(t)δt. In fact, we see:

[ ])/()()(

)()()(

)()()( ttttttP

tR

tFttF

tR

ttRtRttz >+≤<=−+=+−=⋅ δδδδ (7.2)

2) The probability that in the interval δt two or more transitions will occur is infi-nitesimal and thus may be ignored.

Page 111: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

98 7 Techniques and Methods to Support Dependability

7.5 Transition Matrix and Fundamental Equation

The probability of transition pij can be grouped in a matrix P, denoted as transition matrix, with a row index i that represents the initial state and a column j that represents the state of arrival.

The transition matrix has important properties: it is square and the rows are sto-chastic. This means that the sum of the elements of each line represents the proba-bility to remain in a certain state or to exit from this state and is equal to 1.

If we indicate with ρij the rate of transition from state i to state j and with Pi,j and the probability of transition between these same two states in δt, assuming that the interval of time δt is infinitesimal and considering the probability as infinite-simal of a higher order that at the same time interval δt, two or more transitions occur, we see:

ρijΔt = z(t)Δt = λΔt = Pij (7.3)

If Pi(t) is the probability of observing the system in the state i at time t, the probability of observing the same state at time t +Δt is given by the sum of the probabilities that compete for the two mutually exclusive events:

• the system was in the state j at the instant t and passed to the state i during δt; • the system was in state i at the instant t and did not pass to any other state dur-

ing δt

Pi(t+Δt) = ∑i≠jρi,jΔtPj(t)+[1-∑i≠jρi,jΔt]Pi(t) (7.4)

Developing and dividing by Δt and passing to the limit for Δt →0 we get a sys-tem of differential equations of the first order that, resolved for the give initial conditions (usually 1 for the initial state and 0 for the other states; it is presumed that the state is with a probability of 1 in the initial state), restores the probability of being in single states at a certain instant.

The complexity of a system increases rapidly with the number of states: for N components, 2N configurations are possible and therefore 2N states to study the characteristics in terms of probability.

7.6 Diagrams of State

The graphic representation of Markov models is based on the use of graphic sym-bols to define states and transitions. In general, states are indicated with circles and transitions with directional paths.

Case 1 – Analysis of a system with one element

a) Non-repairable element

The most elementary situation is that of a system made of only one non-repairable element with a constant failure rate and that assumes only two states: the operative state (S0) and the failure state (S1).

Page 112: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

7.6 Diagrams of State 99

We can define the following quantities:

• P0(t), probability that the component functions at time t • P1(t), probability that the component is in failure at time t • λ, constant failure rate.

The probability that the system is found in state S0 at the instant t+δt is given by the probability that this is found in state S0 at time t multiplied by the probabili-ty that it does not fail during the interval δt; if at instant t it is found in state S1, the probability of coming back to state S0 is 0 being the component non-repairable. Such probabilities can be represented by the diagram of state in Figure 7.1.

From Figure 1 we can write the following equations for the probability that the system will be found in S0 or S1 at time t+δt:

⎩⎨⎧

⋅+=+⋅+−=+

1)()()(

0)()1)(()(

101

100

tPttPttP

tPttPttP

λδδλδδ

(7.5)

S0 S1

1-λδ t

λδ t

1

Fig. 7.1 Diagram of state of a system made of only one non-repairable element.

Considering the limit for δt→0, we obtain the following the first order differen-tial equations

⎪⎪⎩

⎪⎪⎨

=

−=

)()(

)()(

01

00

tPdt

tdP

tPdt

tdP

λ

λ (7.6)

That can be solved with the usual analytical methods after having established the starting conditions. It is generally assumed that for t = 0, the element is func-tioning, that is P0(0) =1 e P1(0) = 0.

b) Repairable element

In the case of system consisting of a single repairable component with a failure rate λ and a repair rate µ both constant in time, the diagram of state is modified as indicated in figure 7.2, where the path from the state S1 to the state S0 represents the probability of transition µδt from a fault state (S1) to the functional state (S0), after the restore of the element .

From the diagram of state, we obtain the equations for the probability that the system is found in the state S0 or S1 at time t+δt:

Page 113: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

100 7 Techniques and Methods to Support Dependability

⎩⎨⎧

−⋅+=++−=+

)1()()()(

)()1)(()(

101

100

ttPttPttP

ttPttPttP

μδλδδμδλδδ

(7.7)

S0 S1

1-λδ t 1-μδ t

λδ t

μδ t

Fig. 7.2 Diagram of State for a system composed of one repairable element.

The corresponding differential equations result:

⎪⎪⎩

⎪⎪⎨

−=

+−=

)()()(

)()()(

101

100

tPtPdt

tdP

tPtPdt

tdP

μλ

μλ (7.8)

Case 2 - Analysis of a system with two elements

Series systems and parallel systems are the easiest to analyze. Moreover, such sys-tems are also those of major interest. In fact, analysis of more complex systems and frequently be brought back to the analysis of series or parallel systems.

One begins by analyzing simple systems consisting of only two devices. Initial-ly, one assumes that the faults of various components is independent, even if this simplified hypothesis is not also justified in reality.

To this aim, one thinks for example of the reliability performance of two elec-trical lines, not too far from each other, positioned in a zone subject to seismic phenomena. An earthquake of sufficient intensity can simultaneously render both lines out of service. For faults connected to earthquakes, it is therefore not possi-ble to assume that the two lines are independent of each other.

If the system is composed of two non-repairable elements there are four possi-ble states, having to consider also the two cases in which only one element has

failed. Indicating with X1 e X2 the two functioning elements and 1X , 2X the same elements in failure, 4 four possible states can be identified:

⎪⎪

⎪⎪

=

=

=

=

==

213

212

211

210

XXS

XXS

XXS

XXS

Si (7.9)

Page 114: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

7.7 Evaluation of Reliability 101

1-(λ01+ λ02) δt λ01δt

1-λ13δt

211 XXS =

210 XXS =

λ13δt1

1-λ23 δt

212 XXS =λ02δt λ23δt

S0

S1

S3

S2

213 XXS =

Fig. 7.3 Diagram of state of a system composed of two non-repairable elements where λij•δt represents the probability of transition from the state Si to the state Sj.

The system of equations becomes:

⎪⎪⎩

⎪⎪⎨

++=+−+=+

−+=++−=+

)()()()(

)1)(()()(

)1)(()()(

])(1)[()(

32321313

2320202

1310101

020100

tPttPttPttP

ttPttPttP

ttPttPttP

ttPttP

δλδλδδλδλδ

δλδλδδλλδ

(7.10)

From which, analogous to the case of a system composed of only one element, we can formulate the relative differential equations.

7.7 Evaluation of Reliability

To evaluate the function of reliability, the system must contain at least one absor-bent state. With t tending to infinity, the probability of such a state tends to 1 while that of the other states tend to 0.

As stated previously, reliability of the system is given by the sum of the proba-bilities of the states that assure the proper functioning of the system.

For the system seen previously, consisting of only one non-repairable element, it is represented in Figure 7.1. Reliability coincides with P0(t) that can be deduced from (7.6):

⎪⎪⎩

⎪⎪⎨

=

−=

)()(

)()(

01

00

tPdt

tdP

tPdt

tdP

λ

λ tetPtR λ−== )()( 0 (7.11)

in accordance with Chapter 2. For a system with two non-repairable elements, it is necessary to establish if the

system is in series or parallel. If the system is considered in series, the only state that represents functioning is S0 , and therefore

Page 115: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

102 7 Techniques and Methods to Support Dependability

⎪⎪⎩

⎪⎪⎨

++=+−+=+

−+=++−=+

)()()()(

)1)(()()(

)1)(()()(

])(1)[()(

32321313

2320202

1310101

020100

tPttPttPttP

ttPttPttP

ttPttPttP

ttPttP

δλδλδδλδλδ

δλδλδδλλδ

(7.12.a)

and,

[ ]tetPtR )(0

0201)()( λλ +−== (7.12.b)

Also in this case, this is in accordance with Chapter 2 If a system is considered parallel the operative states are S0, S1, S2 and since

these are mutually exclusive:

R(t)=P0(t)+P1(t)+P2(t) (7.13)

and the value of R(t) will be calculated using the other differential equations.

7.8 Calculation of Reliability, Unreliability and Availability

To demonstrate the procedure for calculating the availability of a system, we will consider the simple case of a repairable system consisting of only one element. Its state diagram is represented in Figure 7.2. In such a case, the system of differential equations (7.6) can be rewritten in a matrix form:

[ ] ⎥⎦

⎤⎢⎣

⎡−

−⋅=⎥⎦

⎤⎢⎣⎡

μμλλ

)( )()(

)(

1010 tPtPdt

tdP

dt

tdP

(7.14)

From (7.14) the following two equations are obtained:

[ ] [ ]

[ ] [ ]⎪⎪⎩

⎪⎪⎨

+−+

+++

=

−+

+++

=

+−

+−

)0()0()0()0()(

)0()0()0()0()(

10

)(

101

10

)(

100

PPe

PPtP

PPe

PPtP

t

t

μλλμλμ

λ

μλλμλμ

μ

μλ

μλ

(7.15)

Being P0(0) + P1(0) = 1 we have:

[ ]

[ ]⎪⎪⎩

⎪⎪⎨

+−+

++

=

−+

++

=

+−

+−

)0()0()(

)0()0()(

10

)(

1

10

)(

0

PPe

tP

PPe

tP

t

t

μλλμλμ

λ

μλλμλμ

μ

μλ

μλ

(7.16)

Assuming the system functioning at time t=0 so that P0(0) =1 and P1(0) = 0, we obtain:

Page 116: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

7.9 Markov Analysis of a System: Application Example 103

⎪⎪⎩

⎪⎪⎨

+=∞

+=∞

λμλ

λμμ

)(

)(

1

0

P

P

P

(7.17)

P0(t) and P1(t) represent the probability, dependent on the time, that the system is operating functioning (availability) or in a fault condition (unavailability).

It is therefore possible to calculate the asymptotic probability that the system reaches a state, for t →∞:

⎪⎪⎩

⎪⎪⎨

+=∞

+=∞

λμλ

λμμ

)(

)(

1

0

P

P

P

(7.18)

The Availability over time of a repairable system is therefore equal to:

λμλ

λμμ μλ

++

+==

+− tetPtA

)(

0 )()(

(7.19)

and for t →∞ becomes:

λμμ+

=∞=∞ )()( 0PA

(7.20)

7.9 Markov Analysis of a System: Application Example

With this example, we wish to describe the approach which must be followed in order to evaluate the availability of a system using the Markov techniques. The study is carried out considering a system of measurement based on GPS-based de-vice capable of measuring the initial instant of fast transients overlapping a sinu-soidal voltage at 50Hz.

Referring to Figure 7.4, this is constituted by:

• a voltage transducer (V-VT), with a failure rate λ_TV = 110-4 [failures/h]; • an event detector (comparator), with parameters λ_SC = 1 10-3 e μSC =

5 10-2 [failures/h]; • a GPS unit, with a λ_GPS = 2 10-4 [failures/h]; • a data processing unit .

The signal is conducted across a voltage transducer V-VT whose output is sent to the event detector. This, based on a comparator circuit, in the presence of a fast transient, will generate an output signal in TTL logic which is detected by the GPS unit as an event, with a resolution of 100ns. For sake of simplicity, we will con-sider that the subsystem be made up of the following three components:

Page 117: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

104 7 Techniques and Methods to Support Dependability

Fig. 7.4 GPS based measuring system of fast voltage transients.

1. the voltage transducer (TV); 2. event detector (comparator or SC); 3. the GPS unit.

According to the Markovian approach with a system composed of N = 3 elements, in the Availability model, 2N=8 states compete. For illustrative purposes, we are supposing that only the even surveyor is repairable while a breakdown in the vol-tage transducer or the GPS unit would cause a failure of system functioning as a whole. In practice, these hypotheses imply that from the 8 possible states (suppos-ing that all components are repairable), only 6 are now possible:

State S0: initial state where all devices are functioning. The probability of finding the system in this state at time t = 0 is equal to 1.

State S1: state in which the voltage transducer is in faulted condition. The proba-bility of passing to this state from the initial state depends on the λ_TV. Given that the transducer has been assumed to be non-repairable, this state is considered ab-sorbent.

State S2: the event detector function is not working in this state. Unlike the other states, this is not an absorbent state. Therefore in this state, in addition to the bi-directional transition with the initial state, determined by the failure rate λSC and repair rate μSC of the event detector, two other transitions can occur due to the failure of the V-VT and of the GPS unit.

State S3: state in which the GPS unit is not working. As in the case of V-VT this state is absorbent and the probability of entering this state is connected to λGPS.

State S4: the system is in this state due to the contemporary failure of the event surveyor and V-VT (this obviously takes place before the comparator can be repaired).

State S5: the system is in this state when both the GPS unit and the comparator are not operating. Similar to State S5, this is an absorbent state.

V-VT Voltage

transducer

GPS Unit

SC Event

detector

Hyper terminal

Initial Instant

Page 118: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

7.9 Markov Analysis of a System: Application Example 105

The model of the Markov states is represented in Figure 7.5. The probability of remaining in single states is not shown. The following transitions can be identi-fied:

Transitions 1,2,3: respectively, from the initial state to states S1,S2,S3. These are given by TV, SC, GPS failures.

Transitions 4,5: from the state S2 to the state S4 due to a failure of the VT and from the state S2 to the state S5 due to failure of GPS unit.

Transition 0: takes the system from S2 back to the initial state. This depends on the μSC.

Fig. 7.5 State diagram of fast voltage transients detector.

It is useful to recall that states S1,S3,S4,S5 are absorbent states while S0 and S2 are those in which the system is not considered in faulted condition. The sum of the probabilities that compete in these two states, P0(t)+P2(t), will give the availability of the system. To apply the Markov model, we use a system of differential equa-tions that bind the probability of transition of the system to a time interval Δt:

⎪⎪⎪⎪

⎪⎪⎪⎪

ΔΔΔΔΔΔ

ΔΔΔΔΔ

ΔΔΔ

)(P + )(P = )+(P

)(P + )(P = )+(P

)(P + )(P = )+(P

])++(-[1 )(P + )(P = )+(P

)(P + )(P = )+(P

)(P + ])++(-[1 )(P = )+(P

5GPS25

4TV24

3GPS03

GPSTVSC2SC02

1TV01

SC2GPSSCTV00

ttttt

ttttt

ttttt

tttttt

ttttt

tttttt

λλλ

λλμλλ

μλλλ

(7.21)

This is the system that, solved for opportune initial conditions, will form the probability associated to each state.

Page 119: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

106 7 Techniques and Methods to Support Dependability

7.10 Numerical Resolution of the System

There are calculation codes able to solve the system of equations obtained by us-ing Markov models. They provide the numerical value for the probability asso-ciated to the various states of the system. In particular, for the example proposed, it is necessary to remain in the states S0 and S2.

Figure 7.6 represents a state diagram in which transitions with a higher proba-bility have been emphasized with larger arrows. The numerical values indicated in the figure refer to values of the probability of being in the various states after 10000 hours of operation.

The probability of being in state S3 is equal to 65.4% and 32.7% in S1. Table 7.1 shows more detailed results relative to values of the Availability cal-

culated at various time instants. The trend of A(t) in various time intervals (for t from 0 to 1s; from 0 to 1000 s; from 0 to 10000 s) is also presented.

1 10-13

0.3268

0.6537

1 10-15

0.0065

0.0130

S1

S0

S3

S4

S5

Fig. 7.6 Diagram of state of the system in the example with emphasis on the probability of transition.

Table 7.1 Availability values for example presented.

Time (s) Reliability

characteristic Result R(t)

t =1 P(State 1) 0.9987

See figure 7.7 P(State 3) 0.00097

R(t) ≡ A(t) 0.9997

t =10000 P(State 1) 0.0488 See figure 7.8

Page 120: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

7.10 Numerical Resolution of the System 107

It is observed that after 10000 hours of operation, the availability of the system has decreased to 0.05. As demonstrated also by the graphs, the probability that the system is still working after one year (8,760 hours) is very low (0.7%.).

Fig. 7.7 Reliability function for table 7.1 for t = 1.

Fig. 7.8 Reliability function for table 7.1 for t = 10000.

Reliability Function R(t)=exp(-3/10000t)

R

Time [h]

1

0.99995

0.99990

0.99985

0.99980

0.99975

0.99970

R

Reliability Function R(t)=exp(-3/10000t)

Time [h]

Page 121: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

108 7 Techniques and Methods to Support Dependability

7.11 Possible Solutions to the Absence of Memory of the Markov Model

The most severe hypothesis in the application and interpretation of the results pro-vided by the Markov model is the lack of memory. As stated previously, it allows evaluating the probability of finding the system in a determined state depending only on the probability of exiting from the preceding state and not on the history of the system. However in practice this implies disregarding the influence that one fault may have on another. For example, according to Figure 7.9, the λGPS of the GPS unit is the same both in the transition the state S3 and to the state S5, where a failure has already occurred.

Fig. 7.9 Diagram of state of system considering also the contemporaneous aspect of more events.

Therefore, in the Markov model, the failure of the event surveyor does not in-fluence the behavior of the GPS unit. However this may not be true. For instance, a short circuit occurred at the comparator could easily damage both the voltage transducer and the GPS unit. In order to more fully understand the effect produced on the system as a whole by a failure, it is necessary to discuss two further transi-tions that go from the initial state respectively to the state of simultaneous failures of both the event detector and the transducer and to the state of simultaneous fail-ure of the event detector and GPS unit (Figure 7.9).

For the two new transitions, a failure rate, extracted from the product of the failure rates of the two components is supposed. We consider that the transition takes place if both elements are in fault condition. In particular, we have:

Transition 7: from state S0 to state S4. The transition depends on the λSC/VT which will be given by a combination of both λSC and λVT.

Page 122: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

References 109

Transition 8: from state S0 to state S5. The transition depends on the contempora-rily failure of both SC and GPS units. The relevant failure rate λSC/GPS is given by a combination of both λSC and λGPS.

References

[1] Birolini, A.: Reliability Engineering – Theory and Practice. Springer, Heidelberg, ISBN: 978-3-642-14951-1

[2] Leemis, L.M.: Reliability, Probabilistic Models and Statistical methods, 2nd edn., ISBN 978-0-692-00027-4

Page 123: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT
Page 124: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

Chapter 8 Qualitative Techniques

Abstract. In the previous chapter quantitative methods useful for reliability evalu-ation have been presented. Here the evaluation of the behavior of a system is con-duit with analytical and graphic methods. A second way to study the behavior of a system is based on a qualitative approach. These methods are able to understand the mechanism of the system failures and are able to identify all the potential weakness of the system under evaluation. Two main techniques of analysis are used and will be presented in this chapter: the Failure modes and effects analysis (FMEA) and the Failure modes effects and criticality analysis (FMECA). Such techniques are able to highlight the failure mode leading to a negative final effect. A third method here presented is based on a quite different approach. Fault Tree Analysis (FTA), a deductive method, in fact starts from the final effect studying the causes of a particular and well defined failure.

8.1 Introduction

As already pointed out in the preceding chapters, the performance and quality of an industrial product or system also includes reliability, availability and maintai-nability, intrinsic characteristics fundamental in defining the requirements of the products they represent. In other words, the dependability of the system or prod-uct. This has a large impact on operative costs, keeping the product in use, and ac-ceptable costs throughout the life cycle of the product.

Having defined the characteristics of dependability so required, a task of the project designer, its effective realization must be verified. To accomplish this, some methods of analyzing reliability were introduced which permit reviewing and forecasting the level of a product/system reliability. Techniques for analyzing trustworthiness are used, in fact, to review and forecast evaluations of the reliabili-ty, availability, maintainability and safety of a system.

Reliability analysis takes place mainly during the concept and defining phases, the design and development phases as well as in maintenance phases, in order to determine and evaluate dependability values of a system or installation. To this aim the most used methods of analysis can be summarized as follows:

1. Failure modes and effects analysis (FMEA); 2. Failure modes, effects and criticality analysis (FMECA); 3. Fault Tree Analysis (FTA);

Page 125: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

112 8 Qualitative Techniques

4. Event Tree Analysis (ETA); 5. Reliability Block Diagram Analysis (RBD); 6. Markov Analysis (MA); 7. Petri Net Analysis; 8. Hazard and operability studies (HAZOP) 9. Reliability forecast through the part count (PC);

10. Analysis of human reliability (HRA); 11. Analysis of force - resistance; 12. Truth table (structure function analysis); 13. Boolean methods; 14. Statistical test and estimation methods for reliability growth.

Only the first three methods will be discussed in the following. Further details on the others can be found in the bibliography. Such methods can adequately de-fine qualitative characteristics, even though, as will be seen, there may be some quantitative evaluations.

The methods of analysis that will be discussed in the following allow one to evaluate the failure modes to which a system is or could be subjected to.

Furthermore, the result so obtained permits the project designer to identify any modification which must be made in order to best improve the RAMS require-ments of the product.

In general, there are two types of analysis: inductive (bottom-up analysis such as FMEA and FMECA) and deductive (top-down analysis such as FTA).

Inductive methods start at the lowest level, for example the failure of a single component (mechanical, electrical etc.) that has an effect on all the system. In such a case, a detailed knowledge of the system and its structure is required to study fault conditions and failure propagation, for example.

Inductive methods are generally rather stringent and well designed to identify all the individual failure modes. An analysis of this type gives its best results when conducted in the final planning or design phase although it can be profitably used even in different phases of the design process.

Deductive methods, insetad start from the final effect, for example, studying the causes of a particular and well defined failure. The analysis is implemented starting at a system level and little by little progresses towards the lower levels (e.g. to the analysis of a single mechanical, electronic etc. component). Deductive methods are orientated towards events and are particularly useful during the first phases of the project, when operational details are still to be defined.

It must be remembered that the aforementioned methods for the analysis of reliability are treated in the technical standards edited by the International Electro-technical Commission (IEC) and in particular by the Technical Committee IEC TC 56 Dependability. Reference is often made to applicable standards when present, aware of the fact that in industry, conformity to standards is often a man-datory in the relationship between supplier and user. At the end of the chapter there is a brief summary of reference standards, although certainly not complete. Standards are constantly being updated and it is best to always verify the latest revisions.

Page 126: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.2 Failure Mode and Effects Analysis (FMEA) 113

8.2 Failure Mode and Effects Analysis (FMEA)

Failure Mode and Effects Analysis (FMEA) is a systematic procedure able to analyze a system with the aim to identify potential failure modes, their cause and effect on performance and, when applicable, their effect on the safety of the per-sonnel, on the environment as well as on the system [1, 2]. During the advanced design phase, this analysis technique can draw attention to eventual weaknesses in the system, in such a way as to suggest necessary modifications it improve relia-bility and more generally, availability. This analysis can include forecasting and preventive measures to be undertaken during the initial phases of the development of a new product. Often, the FMEA, concentrating on these aspects, allows to know if the component under examination, for example, satisfies safety require-ments. At this point, it is best to establish the most appropriate way to discuss the failure modes rather than the failure itself. For failure mode, we intend the manner in which an item fails. For failure effect we intend the consequence of a failure mode (measureable in a quantitative or qualitative sense) of a component or part of a system in term of operation, function or status of the item [1, 2].

FMEA is, essentially, a method for studying the failure of devices manufac-tured with different technologies (electrical, mechanical, hydraulic etc.) or their combination of such technologies. FMEA can also be used for studying software performance and function as well as action undertaken by the operator.

The analysis is conducted starting from the characteristics of components of a system by means of an inductive process. Hypothesizing the failure of a compo-nent, the analysis demonstrates the relationship existing between the failure itself and breakdowns, defects and decreased performance or integrity of the entire system.

FMEA analysis allows a good understanding of the behavior of a component of a system such as an electronic board, a component as well as mechanical device, and how this influences the functioning of the entire system particularly in cases of breakdown. In fact, it is always necessary to ensure that malfunctioning of a part of a system does not lead to dangerous situations for personnel or for the en-vironment. For example, in an industrial installation or in a numeric control ma-chine, it is not permissible that the breakdown of an electronic device or control board can allow unsafe procedures to be carried out. The electronic board where a breakdown has occurred must therefore be designed in such a way that in the pres-ence of a failure, the system is prohibited from functioning or at least, unsafe op-erations cannot take place. The solution is therefore in design and FMEA analysis can identify all the breakdowns, that if unforeseen, can give rise to dangerous situ-ations. It can also verify that the solutions adopted are efficient. This is why it is always useful and advisable to update and carry out a FMEA analysis during the design phase. Therefore, this method, if updated, becomes also an instrument for the a posteriori verification of design and a test of the conformity of the system to required specifications and standards as well as to the needs of the user.

It is clear that the results of a FMEA analysis establish the priorities for check-ing the processes and inspections to be performed during construction and installa-tion as well as rating, acceptance and operative tests. This does not mean to

Page 127: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

114 8 Qualitative Techniques

diminish the importance of a FMEA analysis on an already designed apparatus. In such a case, the analysis is simply conducted as a type of a posteriori verification of what has already been made. This can be regarded as the first useful step in de-fining design requirements and criteria at the moment considered opportune to redesign or update the product.

An FMEA analysis permits:

1. to identify failures including induced failures; 2. to determine the necessity of redundancy and further overdesigns and/or sim-

plifications to be made to the project; 3. to determine the necessity of choosing adequate materials, parts, components or

devices; 4. to identify the serious consequences of failures and to determine the necessity

to revise and/or modify the project and identify safety risks to personnel and evidence problems of legal responsibility;

5. to define testing and maintenance procedures, suggesting potential failure mod-es and determining parameters that must be recorded during testing, verification and operative phases. It is also useful in drawing up a guide to identifying and investigating failures;

6. to define some software characteristics, if present.

8.2.1 Operative Procedure

In principal, a precise operative procedure for carrying out a FMEA analysis can be defined but it is not a priori possible to go into detail. This derives from the fact that the design of systems and their applications is characterized by widely variable complexity and it may therefore be necessary to adopt very specific FMEA procedures adapted to available information.

The classic procedure for implementing a FMEA analysis is the following:

Step 1 – Definition of the System

This phase defines the system, its functional requirements, environmental re-quirements and legal ordinances to be respected.

In defining functional and operative requirements, it is necessary to identify, in addition to expected performance, any undesirable operations that can result in failure conditions.

Environmental conditions regard the temperature of the work environment of the system (e.g. one must take into consideration the high temperatures that devel-ops when an apparatus is encased, such as in a control box/cabinet), humidity, pressure and the presence of dust, mould and conditions of high salinity, vibra-tions. Furthermore, all problems connected to electromagnetic compatibility (EMC) must be taken into account if the system is interested to this aspect.

Finally, it is always necessary to comply with all legislative requirements, especially in terms of safety and risks evaluation.

Page 128: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.2 Failure Mode and Effects Analysis (FMEA) 115

Step 2 – Elaboration of Block Diagrams

Blocks diagrams are elaborated in order to identify the interconnections among subsystems, circuits and components. They demonstrates the fundamental ele-ments of the system and the in series an in parallel relationships present as well as the functional interdependence existing among the elements which constitute the system. This step is also denoted as analysis of the hierarchic levels.

Step 3 – Definition of Basic Principals

It is first necessary to establish the most adequate level of analysis. This is chosen by the technicians carrying out the FMEA analysis and is represented by the low-est level possible for which the necessary information is available. In the FMEA analysis of an electronic system, for instance, this is where the failure modes of single electronic components are analyzed.

Step 4 – Definition of Failure Modes

This step consists in identifying the modes, the causes and the effects of failures, their relative importance and in particular, their means of propagation. The failure mode is the objective evidence of the presence of a failure or, as defined in the IEC 60812 standard, the manner in which an item fails.

It is interesting at this point to remember the difference between failure and fault. At this aim it is useful the definition from IEC 60182 standard:

Failure: termination of the ability of an item to perform a required function (clause 3.2 IEC 60812:2006). Fault: state of an item characterized by the ability to perform a required function, excluding the inability during preventive maintenance or other planned actions, or due to the lack of external resources. A fault is often the result of a failure of the item itself, but may exist without prior failure (clause 3.3 IEC 60812:2006).

So, failure is defined as the transition from a state of proper functioning to a mal-functioning state (fault).

In the automotive field for example, failures modes can be that the car will not start, the lights don’t function, intermittent functioning, deflated tires. In this phase it is necessary to identify the critical elements/components of the system and make a list of failures.

Failure modes can almost always be classified in one of the categories listed in Tables 8.1.a, 8.1.b and 8.1.c. In particular, in Tables 8.1.a and 8.1.b two different examples of general failure modes are given (different lists would be required for different types of systems) and in Table 8.1.c a detailed list of failure modes is re-ported. Furthermore, typical failure modes for electronic components are, for

Page 129: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

116 8 Qualitative Techniques

example, opens, shorts, drift, functional faults. For mechanical components, instead, example of failure modes can be brittle rupture, cracking and creep and so on. The following table 8.1.d reports the relative frequency of failure modes in some electronics components [3].

Table 8.1a First example of a set of general failure modes (in compliance to IEC 60812).

Id General Failure Modes

1 Failure during operation

2 Failure to operate at a prescribed time (when necessary or when wanted)

3 Failure to cease operation at a prescribed time (when necessary or when wanted)

4 Premature operation (too early)

Table 8.1b Second example of a set of general failure modes.

Id General Failure Modes

1 Failure to operate at the proper time

2 Intermittent operation

3 Failure to stop operating at the proper time

4 Loss of output

Table 8.1c Specific failure modes.

Id Specific Failure Modes

1 Structural breakdown

2 Seizing or jamming

3 Vibrations

4 Does not stay in position

5 Does not open

6 Does not close

7 Remains open

8 Remains closed

9 External loss

10 Internal loss

11 Outside tolerance (+)

12 Outside tolerance (-)

13 Functions at inappropriate times

14 Intermittent function

15 Irregular function

16 Display error

17 Reduced flow

18 Activation error

19 Does not stop

Page 130: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.2 Failure Mode and Effects Analysis (FMEA) 117

Table 8.1c (continued)

20 Does not start

21 Does not switch over

22 Premature intervention

23 Late intervention

24 Input error (excessive)

25 Input error (insufficient)

26 Output error ( excessive )

27 Output error ( insufficient )

28 No input

29 No output

30 Short circuit (electrical)

31 Open circuit (electrical )

32 Leakage (electrical)

33 Other conditions of failure according to system characteristics, operating conditions and operative restrictions

Table 8.1d Values for failure modes of electronic devices (%) [3]

Device Shorts Opens Drift Functional

Digital Bipolar Integrated Circuits 50 30 20

Digital MOS Integrated Circuits 20 60 20

Linear Integrated Circuits 25 75

Bipolar transistor (BJT) 85 15

Field Effect Transistor (FET) 80 15 5

General Purpose Diode (Si) 80 20

Zener Diode (Si) 70 20 10

Thyristors 20 20 50 10

Optoelectronics device 10 50 40

Fixed Resistors 40 60

Variable Resistors 70 20 10

Foil Capacitor 15 80 5

Ceramic Capacitor 70 10 20

Tantalium Capacitor 80 15 5

Aluminum Capacitor 30 30 40

Coils 20 70 4

Relays 20 80

Quartz Crystal 80 20

Page 131: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

118 8 Qualitative Techniques

Once the failure mode has been identified, its consequent effects have to be studied. Finally, general or local failure effects are evaluated as well as the final effect that is to say at the highest level of the system. In this phase it is important to list all the possible and/or potential failure modes of the system on which the FMEA analysis is being performed.

To this end, although worthwhile, it is not always possible to involve the manu-facturers of the components or the equipment used to determine the failure modes of their products. Therefore, it is important to possess the most complete docu-mentation possible. Failure modes are generally deduced in different ways. In fact failure modes are deduced differently if a component is new or has previously been used. In particular:

• If the component or device is new or is new to the project designer, meaning that there is insufficient data regarding the behavior of the component, failure modes can be investigated by looking for similarities with components that have the same function or furnish test results of components operating under particularly hard conditions. Failure modes can also be deduced by a theoretical study of the component or the system.

• If the component has been used previously, failure modes are generally de-duced from maintenance and service reports to users, performance reports, failures, laboratory tests and other information available to the company includ-ing information furnished by the manufacturer.

FMEA analysis can study the individual failure modes and their effects on the system. This is less adapted to studying combinations and dependence on the sequence of failures. Example 1 demonstrates this concept.

Further consideration must be given to so-called common failures that are quite frequent. These are failures which originate from an event that causes the contem-porary failure of two or more components. The most common failure modes are typically due to environmental conditions, poor performance or characteristics (due to design insufficiency, for example), defects during construction, errors in assembly or installation or human errors, e.g. during operations or maintenance. FMEA analysis permits qualitative evaluations of the aforementioned failures.

Step 5 – Identifying the causes of failures.

It is well to indicate the cause in every type of failure; the more the effect of the type of failure, the more accurately the cause must be described.

Step 6 – Identifying the effects of failure modes.

A failure mode generally has effects in terms of reduced or absent functioning in the system, leading to graver situations of harm to personnel and/or the environ-ment. Two types of effects are typically recognized: local and final. Local effects involve a failure of the component under examination. Final effects are seen at the level of a subsystem or system.

Page 132: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.2 Failure Mode and Effects Analysis (FMEA) 119

Step 7 – Definition of measures and methods for identifying and isolating failures.

This phase defines the ways and methods for identifying and isolating failures, sug-gesting procedures to be followed to detect a failure and the means to do so. The ob-jective is to furnish the operator or maintenance personnel the information to verify the presence or absence of the failure mode under consideration. FMEA analysis can be applied to a process or a product, and also to a design process. In different cases, the types and time necessary to identify failure change. When FMEA analysis is ap-plied to a process, it is necessary to establish where it is most efficient place to detect failures, e.g. during the functioning of a process by an operator, from Statistical Process Control (SPC), Quality Control (QC) etc. When applied to a design project, maximum attention must be paid to when and where a failure mode can be easily identified: during project revision, the analysis phase or test phase.

Step 8 – Prevention of undesirable events.

In this phase, possible design or operative measures to be taken to prevent undesira-ble events are identified. The shortest possible way is to get to the root of the problem in such a way that undesirable events do not occur. It is possible to use redundancy, alarms and monitoring systems and to introduce limits to hypothetical damage.

Step 9 – Classification of the severity of final effects.

This classification is carried out taking into account various aspects: the nature of the system under examination, performance and functional characteristics of the system, contractual and legal requirements, especially in regard to operator safety, and finally, guarantee requirements. An example of this classification is proposed by IEC in [1] as in Table 8.2.

Table 8.2 Example of classification of final effects (*)

Class Severity level Consequence to persons or environment

I Insignificant

It can be so classified a failure mode which could potentially degrade sys-tem functions. However, no damage to the system will caused and the consequent situation or state of the system does not constitute a threat to life or injury.

II Marginal A failure mode can be classified as marginal when it could potentially de-grade system performance function(s) without appreciable damage to the system or threat to life or injury.

II Critical

A failure mode is critical when it could potentially result in the failure of the system’s primary functions and therefore causes considerable damage to the system and its environment. However, it is very important to check that the failure mode so classified does not constitute a serious threat to life or injury.

IV Catastrophic

We have a catastrophic failure mode when it could potentially result in the failure of the system’s primary functions. As a result the failure mode causes serious damage to the system and its environment and/or personal injury.

(*) IEC 60812:2006-01.

Page 133: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

120 8 Qualitative Techniques

Step 10 – Multiple failure modes

Considering specific combinations of multiple failures modes. It would be noted, at this point, that FTA is generally more suitable to take into account multiple failure modes. This consideration is very important when it is necessary to select what methodology of analysis is preferred.

Step 11 – Recommendations

Recommendations are made reporting useful observations to clarify for example, eventual aspects not completely analyzed, unusual conditions, the effects of fail-ures of redundant elements, critical aspects of the project, references to other data for analysis of sequential failures and all observations upon completing the analy-sis. Figure 8.1 demonstrates the typical procedure of a FMEA analysis.

8.2.2 FMEA Typology

A FMEA can be applied in various applicative fields. We can define the following:

• FMEA analysis of Service that identifies and highlights any eventual non-conformity (NC) associated with planning a service. With such an analysis, it is possible to study the effects that a NC can generate while providing the ser-vice. The objective is to identify the critical points of service and elimination of NC.

• FMEA System analysis where the propagation of the effects of possible failures of a component at the functional levels of the system are analyzed. The objec-tive is to minimize the effects of such failures and identifying critical points for corrective actions in order to increase the availability performance of the system.

• FMEA Project analysis carried out during the design phase of the project, be-fore initiating the productive process. Objectives are the same as those for a FMEA System analysis.

• FMEA Process analysis identifies critical points of the process and their influ-ence on producing the product (criticality/breakdown/damage to the productive process that could lead to NC of the product). This analysis evaluates the ef-fects that these failures can have on the functionality and safety of the product. The objective is to identify critical points of the process to keep under control, detection of eventual causes of failures to minimize their effect on the product and finally, improvement of reliability performance of the process.

Page 134: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.2 Failure Mode and Effects Analysis (FMEA) 121

Fig. 8.1 Analysis flowchart for FMEA (in compliance to IEC 60812).

Page 135: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

122 8 Qualitative Techniques

8.2.3 The Concept of Criticality

An analysis of criticality is usually the objective of an in-depth FMECA analysis rather than a FMEA. However, in some cases, it is useful to proceed with a qua-litative criticality analysis even when conducting a FMEA analysis especially when a complete analysis proves to be excessively taxing but one wishes to have an indication of the criticality of events.

Criticality is a way to quantify the attention it is opportune to give to a deter-mined failure, event or non-conformity and depends both on the probability of its occurrence and the gravity of the consequences it may have.

The attention to dedicate to an event depends first of all on the effect it may have on the safety of personnel, on the damage it can cause with subsequent losses and its effect on the availability of service. It is rather difficult to define a general-ly valid criterion to evaluate criticality because the concept of the seriousness of the consequences and their probability of their happening come into play.

The level of gravity can vary and can be evaluated differently if the objective, for example, regards the safety of personnel, damage and relative losses, or the availability of service.

Criticality is defined by means of a scale of values that permits evaluating the seriousness of consequences in function of the criteria taken into consideration. The aforementioned Table 8.2 shows a classification with four principal levels of the gravity of consequences. Different levels may also be utilized.

8.2.4 Final Considerations on FMEA Analysis

It is clear that when complex systems must be analyzed, a FMEA analysis tends to become ponderous, tedious and filled with possible repetitions. In these cases, the experience of those conducting the analysis plays an important role. Furthermore, there are some particulars which tend to simplify the analysis. It is rare that a sys-tem is designed in all its parts. It is often a case of revising an already existing sys-tem or, in the case of a new project, one can have some subsystems that have already been utilized satisfactorily in other projects. If the working conditions are the same, it may be possible to use the previous reliability considerations for the new project.

The results of a FMEA can give important information for establishing the priority of a statistical control of the process, incoming samples, inspections and qualifications.

Example 1

A brief example of a FMEA analysis on an industrial controller with a microcontrol-ler consisting of various types of electronic boards is given here. For simplicity, we will not discuss the initial part of the analysis (definition of the system), drawing up block diagrams and definition of basic principles. The analysis, conducted at the level of a single component, has as its main objective, the analysis of failure modes that can lead to unsafe conditions for personnel.

Page 136: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.2 Failure Mode and Effects Analysis (FMEA) 123

To this end, the parts with safety functions are validated demonstrating their conformity to the basic principles of safety and confirming that the specifications, design, implementation and choice of components comply with standards. The functioning in regard to environmental influences has been verified by means of opportune tests of electromagnetic compatibility (EMC). The criteria of analysis are the following: 1) the effects of failures in sequence (the failure of one compo-nent leads to failures in other components) are not studied. The first failure and all successive failures are considered as a single failure; 2) common failure modes are considered as single breakdowns; 3) only one independent failure is verified at a time.

8.2.4.1 Analysis of Failure Modes: Discussion and Exclusions

In microprocessor/microcontroller systems, the possibility of the blockade of a device must always be considered. The design team evaluates whether to activate the function of watch dog software to block output in the case of an infinite loop in the program cycle or even using an external watch dog circuit. Memory, which can be internal or external to the device, must be opportunely monitored if its fail-ure can lead to the system working in an unsafe manner or cause damage to the system or the component being manufactured.

A system is generally designed in such a way that the loss of data will not cause dangerous situations both because the data in memory does not influence safety and because in the event of a failure, the system responds or does not respond in an appropriate way. Given the criticality the device, it is furthermore necessary to monitor the supply voltage by means of a supervisor circuit. Based on a careful analysis, we can exclude some failure modes due to:

1. low probability of the event; 2. acceptable technical experience; 3. technical requirements deriving from the application and specific risks in con-

sideration.

We have therefore a list of a priori excluded failures. The following examples, maybe useful in understanding this concept.

For a Safety Relay, the following failures are generally excluded:

• a short circuit between the three terminals of a relay; • a short circuit between two pairs of contacts and between contacts and a coil; • the simultaneous closure of contacts normally open (NO) and normally closed

(NC).

The exclusion of such failure mode can be deduced from documentation fur-nished by the manufacturer’s guarantee1.

1 For example, from manufacturer datasheet: “…All Safety relays are fitted with forced guided

contacts to ensure the safe switching of a control system. All contacts are linked so that if one contact welds the others remain in their current position with open contacts maintaining a contact gap of > 0.5 mm. These relays meet the requirements of EN 50205”.

Page 137: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

124 8 Qualitative Techniques

For contactors, the following failures are excluded:

• contacts normally open remain closed having been deactivated. The failure mode is avoidable or rendered improbable thanks to design criteria based on over dimensioning. For this purpose a component is used with a nominal current equal to double the current in the circuit, with a commutation frequency higher than necessary and a guaranteed total number of commutations ten timers superior to that ex-pected. The circuit on which the contactor works is opportunely protected against short circuits. Furthermore, measures are taken so that electronic con-trol opens the contactor only when the current is very much reduced. From the point of view of mechanical stress, the installation is such that the vibrations and shocks foreseen are much lower than the maximum values indicated by the manufacturer.

• the possibility of a short circuit between the three terminals of an exchange contact and between the contacts and the coil can be excluded based on the manufacturer’s guarantee (if the printed circuit is well made and the compo-nents soldered well!)

• finally, the simultaneous closure of contacts normally open and normally closed can be excluded whenever a safety relay is used.

Further typical cases of failure modes that, according to some hypotheses, can generally be excluded are those regarding printed circuits. In fact, a Printed Circuit Board - PCB (double sides, FR4-74 made with a minimum thickness of 0.8 mm and a minimum copper base of 17 µm) is made with material conforming to IEC Standards 61249 and both creepage distances and clearance are measured in com-pliance with IEC Standards 60664 with a pollution level of 2. The circuit is cov-ered with a protective layer and the case that contains the circuits has a minimum level of protection index (IP) equal to IP54. Finally, one notes that a short circuit between adjoining PCB path could occur when the electronic board functions in dirty or humid conditions and in the presence of dripping water and when main-tenance is not carried out (frequent in some industrial conditions). Below standard soldering can also give rise to both short circuits and open circuits.

One must also keep in mind that all failure modes are not always possible. Some component breakdowns can also lead to short circuits or open circuits. In this simplified analysis, this was not taken into consideration.

8.2.4.2 Drafting the FMEA Table

The following Table 8.3.1 demonstrates the FMEA table. Only a study of failure modes of a limited number of components is reported.

Example 2

Table 8.3.2 demonstrates a second example of a FMEA analysis. Note that the ta-ble of the analysis is partly different from the preceding.

Page 138: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.2 Failure Mode and Effects Analysis (FMEA) 125

Table 8.3.1 Extract of FMEA analysis for applicative example 1.

Ref

. C

ompo

nent

Id

. Fa

ilure

mod

e P

ossi

ble

Fai

lure

ca

use

Det

ecti

on m

etho

d F

inal

eff

ects

C

ompe

nsat

ing

prov

isio

n ag

ains

t fa

ilure

N

otes

P1

Pus

h bu

tton

1.1

Fai

l to

open

C

onta

cts

wel

ds

Sim

ulat

ed

At t

he e

nd o

f th

e op

erat

ion

it is

not

po

ssib

le to

sta

rt

anot

her

oper

atio

n

Not

nec

essa

ry

The

sys

tem

wor

k on

the

tran

sitio

n.

1.2

Fai

l to

clos

e M

echa

nica

l fai

lure

S

imul

ated

T

he o

pera

tion

is

not p

erfo

rmed

N

ot n

eces

sary

K1

Rel

ay

1.1

NO

con

tact

s al

way

s op

en

NC

con

tact

s al

way

s cl

osed

Coi

l bre

akdo

wn

Ana

lysi

s of

the

sche

mat

ic c

ircu

it T

he o

pera

tion

is

not p

erfo

rmed

N

ot n

eces

sary

1.2

NO

con

tact

s al

way

s cl

osed

N

C c

onta

cts

alw

ays

open

Mec

hani

cal f

ailu

re

and

Con

tact

s w

elds

Ana

lysi

s of

the

sche

mat

ic c

ircu

it T

he o

pera

tion

is

not p

erfo

rmed

N

ot n

eces

sary

1.3

Thr

ee te

rmin

al

shor

t cir

cuit

Fai

lure

mod

e no

t po

ssib

le (

see

note

s)

− −

− Sa

fety

rel

ay

1.4

Oth

er te

rmin

al

shor

t cir

cuit

Fai

lure

mod

e no

t po

ssib

le (

see

note

s)

− −

− Sa

fety

rel

ay

1.5

Not

sy

nchr

oniz

atio

n be

twee

n co

ntac

ts

(Sim

ulta

neou

s N

O

/NC

)

Fai

lure

mod

e no

t po

ssib

le (

see

note

s)

− −

− Sa

fety

rel

ay

R1

Res

isto

r 5.

1 O

pen

circ

uit

The

rmal

str

ess

Sim

ulat

ed

Impo

ssib

ilit

y to

ac

tivat

e m

ovim

enta

tion

on

X c

oord

inat

e

Not

nec

essa

ry

5.2

Sho

rt c

ircu

it T

herm

al s

tres

s S

imul

ated

Not

nec

essa

ry

Not

dan

gero

us

fail

ure

mod

e 5.

3 N

omin

al v

alue

m

odif

icat

ion

The

rmal

str

ess

Cha

ngin

g th

e re

sist

or w

ith

anot

her

one

of d

iffe

rent

val

ue

Nul

l in

the

rang

e 60

% -

+80

% o

f th

e no

min

al v

alue

No

mal

func

tion

is

dete

ctab

le

Page 139: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

126 8 Qualitative Techniques

Table 8.3.2 Extract of FMEA analysis for applicative example 2.

Ref

. It

em

Fai

lure

Id

F

ailu

re

mod

e P

ossi

ble

failu

re

caus

es

Sym

ptom

de

tect

ed b

y L

ocal

ef

fect

Eff

ect

on u

nit

outp

ut (

tota

l ef

fect

)

Com

pens

atio

n pr

ovis

ion

Seve

rity

cl

ass

(S)

Rec

omm

enda

tion

s an

d ac

tion

s ta

ken

1 M

otor

sta

tor

1.1

Ope

n ci

rcui

t W

indi

ng f

ract

ure

Low

spe

ed

roug

hnes

s L

ow p

ower

T

rip

Sin

gle

phas

e pr

otec

tion

tem

pera

ture

trip

4

Not

hing

1.2

Ope

n ci

rcui

t C

onne

ctio

n fr

actu

re

Low

spe

ed

roug

hnes

s L

ow p

ower

T

rip

Sin

gle

phas

e pr

otec

tion

tem

pera

ture

trip

3

Not

hing

1.3

Isol

atio

n br

eakd

own

Per

sist

ent h

igh

tem

pera

ture

m

anuf

actu

ring

de

fect

Prot

ecti

on

syst

em

Ove

rloa

d N

o ou

tput

A

nnua

l ins

pect

ion

tem

pera

ture

trip

4

Not

hing

1.4

The

rmis

tor

open

cir

cuit

Age

ing;

con

nect

ion

frac

ture

Pr

otec

tion

sy

stem

N

one

No

outp

ut

Fitt

ed s

pare

3

Rec

omm

end

a sp

are

conn

ecte

d th

roug

h to

out

side

ca

sing

1.

5 T

herm

isto

r sh

ort c

ircu

it P

rote

ctio

n sy

stem

Pr

otec

tion

sy

stem

R

educ

ed

trip

mar

gin

No

outp

ut if

lo

ad is

hig

h F

itted

spa

re te

mpe

ratu

re

trip

3

Rec

omm

end

a sp

are

conn

ecte

d th

roug

h to

out

side

ca

sing

2

Mot

or c

ooli

ng

syst

em

2.1

Inad

equa

te

cool

ing

Blo

ckag

e lo

w d

iff.

pres

sure

H

igh

tem

pera

ture

st

ator

m

easu

red

by

ther

mis

tor

Exc

essi

ve

win

ding

te

mpe

ratu

re

Exc

essi

ve

mot

or

tem

pera

ture

Tem

pera

ture

trip

sta

tor

2 N

othi

ng

Page 140: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.2 Failure Mode and Effects Analysis (FMEA) 127

Example 3

The following is a brief discussion of a third example of a FMEA analysis. This regards STANDARD IEC 60812:2006 which can be consulted for further infor-mation. The analysis is conducted on a subsystem of a system composed of a mo-tor and a generator. The study does not take into consideration the effects of breakdowns on loads fed by groups or systems connected in any way to the system under examination. The system is first of all subdivided into subsystems as represented by the block diagrams reported in figure 8.2. Subsystem 2 is further developed to the level of the component and it is at this level that the FMEA anal-ysis is carried out. The FMEA is not reported here in that, for many aspects, it is similar to those already presented. Some aspects however merit to be explicitly described. First of all, the block diagram must be opportunely developed in such a way that each diagram is easily identifiable by way of a clear and simple system. In the case under examination, little by little as one goes into detail, the blocks are identified by an unambiguous numerical code, e.g. 2.1.2. this is a great help when one must write and interpret the table of the results of the analysis. Furthermore, the FMEA analysis is conducted here on a system where electric, electronic and mechanical aspects exist together. This means that FMEA is useful for so called multi-disciplinary systems analysis in various fields.

Fig. 8.2 Block diagram of system under examination (IEC 60812:2006).

Motor – generator set

Machine structureEnclosure heating,

ventilation and cooling system

DC machine AC machine Instrumentation

1 2 3 4 5

Enclosure heating, ventilation and cooling system

2

Ventilation and cooling system

Emergency cooling (inlet/outlet) doors

Condensate/cooling-water drain system

Heater system

2.1 2.2 2.3 2.4

….. …..

2.1.1 2.1.2

Page 141: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

128 8 Qualitative Techniques

8.3 Failure Mode, Effects, and Criticality Analysis (FMECA)

The evaluation of the criticality of failure modes and their effects has already been discussed but it was always done on a qualitative basis, based on experience and knowledge without going into detail of a quantitative evaluation. If such an evalu-ation is imperative, the analysis is then called Failure mode, effects, and criticality analysis (FMECA). An example of FMECA flowchart is depicted in Fig. 8.3. In general, it is possible to apply a quantitative evaluation of criticality when data relative to failure mode rates of systems or components to be analyzed is availa-ble. Otherwise a qualitative evaluation is applied. Establishing what is critical and the probability that a failure mode will occur is a very useful in determining what corrective action must be taken and defining the line between acceptable and un-acceptable risks.

Different types of critical failure modes can be identified (each company can define its own categories and classes). A scale of criticality based on the following categories is generally valid:

1. death or injury to the public or company personnel 2. damage to this or other equipment 3. economic damage deriving from loss of output or loss of system functions 4. inability to perform a function due to inability of equipment to properly per-

form its principal function

An example of a scale of criticality is seen in Table 8.2. The choice of criticali-ty categories requires careful study and prudence. It is necessary to keep in mind all the factors that have an impact on the evaluation of the system, its perfor-mance, costs, programs, safety and risks.

Page 142: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.3 Failure Mode, Effects, and Criticality Analysis (FMECA) 129

Fig. 8.3 FMECA flowchart (in compliance to IEC 60812).

Select a level of analysis

Select a component of the item or system or subsystrem to analyze

Select the failure mode to analyze

Identify immediate effect and the final effect of the selected failure mode

Determine severity of the final effects

Identify potential causes of that failure mode

Evaluate the frequency of probability of occurrance for the selcted failure mode during the predetermined time period

Do severityand/or probability of

occurrence warrant the need for action?

Propose mitigation method, corrective actions or compensating provisions

(design review). Identify actions and responsible personnel.

YES

NO

Write document notes, recommendations, actions, and remarks

Are therefurther component failure modes to

analyze?

Are therefurther components to be taken into account

for analysis?

NO

YES YES

End of FMECA.Fix the next revision data as appropriate

Start FMECA

NO

Identify failure modes of the selected component

Page 143: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

130 8 Qualitative Techniques

8.3.1 Failure Modes and Their Probability

After the failure mode identification, it is necessary to evaluate the corresponding probability of occurrence. In the FMECA analysis, this evaluation is carried out ana-lytically. To this aim, it is necessary to access at the detailed information regard-ing the reliability of components/devices, for example its failure rate. Whenever one applies a qualitative analysis, by choice or due to a lack of data, the probability with which a failure mode is manifested is generally described by discrete levels. For ex-ample, these can be the following:

• Level A, when the failure mode can occur frequently; • Level B, when the failure mode is reasonably probable; • Level C, when the failure mode is occasional; • Level D, when the failure mode is remote; • Level E, when the failure mode is extremely rare.

For a quantitative analysis, the evaluation of two indexes is made, RPN and Cm, to be discussed below.

8.3.2 Evaluation of Criticality

This evaluation can be carried by means of a criticality grid where the abscissa represents the probability or frequency of failure and the ordinate represents the class of criticality.

The failure modes, duly classified after having evaluated probability, are in-serted into one of the squares of the criticality grid reported in Table 8.9. This will be discussed more in depth later. Obviously, the farther away a square is from the origin, the greater the criticality is the mode of failure and thus the greater the ne-cessity to adopt appropriate counter measures.

The evaluation of criticality involves quantifying of the effects of a failure mode/non conformity. This operation is not always easy to carry out and often in-volves brainstorming. Measuring criticality can be performed in various ways from which different types of FMECA are derived. Here we will present two ways, the first based on risk and the second based on the failure rate.

8.3.3 FMECA Based on the Concept of Risk

This follows the STANDARD IEC 60812, and refers to the concept of Risk R and Risk Priority Number (RPN).

Risk is evaluated by means of an opportune measurement of the severity of the effects and of an estimate of the expected probability that the failure mode itself is manifested in an a prior determined interval of time. A measurement of potential risk is therefore:

PSR ⋅= (8.1)

Page 144: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.3 Failure Mode, Effects, and Criticality Analysis (FMECA) 131

where

• S (Severity) represents the estimate of how strongly the effects of a failure im-pacts on the system or user (personnel or customer for example). This is the gravity or criticality of the failure and is generally expressed in levels of criti-cality. Finally, it would be noted that S is a non-dimensional number.

• P (Probability) denotes the probability of occurrence. Even this parameter is a non-dimensional number.

An additional information concerning the failure detection at system level is possible by using a new parameter named D (Detection).

The evaluation of RPN instead is given by the following equation:

DOSRPN ⋅⋅= (8.2)

where:

• O (Occurrence) is the probability that a failure mode will be manifested in a es-tablished time that usually coincides with the useful life of the component un-der examination. It may be defined as a ranking number (or index number) ra-ther than the actual probability of occurrence. Through a design change it is possible to remove or limit one or more failure modes. This is the only possible way to reduce occurrence ranking.

• D (Detection) is the estimate of the possibility of identifying/diagnosing and eliminating/preventing the onset of a breakdown before its effects are mani-fested on the system or personnel. This number is usually ranked in reverse or-der from the severity or occurrence numbers: the higher the detection number D, the less probable is the possibility of identifying the failure and vice versa. Starting from these considerations the lower probability of detection leads to a higher RPN; this indicates the necessity to resolve the failure mode with maxi-mum priority and speed. Detection capability are mainly obtained or planned in the design phase. Typical design controls are design verification or validation such as: design review, road testing for automotive industries, etc. Detection is so an assessment of the capability of the design review to detect a potential cause or mechanism or design weakness.

The level of severity together with RPN permits establishing on which failure mode it is necessary to concentrate resources in order to mitigate or annul the effects.

S, O and D are generally estimated for values for 1 to 4 or 5 and in some con-texts from 1 to 10 as reported in the following three tables 8.4, 8.5 and 8.6.

Even if we are able to refer to such examples, every evaluation for establishing the values S, O and D must be connected with personal experience and with the type of analysis being carried out (on a product, process and working conditions).

Page 145: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

132 8 Qualitative Techniques

However, not always having effected an accurate evaluation of RPN erroneous deductions may occur. In fact, this parameter, as it is defined, can present some problems:

• Gaps in the range: referring to the values of S, O and D summarized in the tables the RPN index does not assume 1000 values as would be expected mul-tiplying the 3 factors each of which is included in the scale of 1-10, but rather only 120 different values: 88% of the range is empty. For example, two identic-al values of RPN can derive from different values of the parameters, S, O and D, and this must be kept in due consideration.

• Duplicate RPNs: different situations generated by identical RPN values could be obtained.

• Sensitivity to small changes: even a small variation in one of the factors implies a notable variation in the RPN value when the other factors are large; a minor variation in the RPN value, instead, when the other two factors assume lower values.

• Inadequate scaling: the distance between contiguous RPN values is not always the same.

• Inadequate scale of RPN: the difference in RPN value might appear negligible while in fact significant. An example would be useful to understand the possi-ble situations. Two different classifications might lead to the following value:

60

3

4

5

2

45

3

3

5

1

2222

2

2

2

1111

1

1

1

=⋅⋅=⇒⎪⎩

⎪⎨

===

=⋅⋅=⇒⎪⎩

⎪⎨

===

DOSRPN

D

O

S

Scenario

DOSRPN

D

O

S

Scenario

It would be noted that RPN2 is not twice RPN1, while the probability of Occur-rence O1 is twice O2 as shown in Table 8.5. In fact, a value of occurrence equal to 3 correspond to a 5·10-4 whereas an occurrence of 4 correspond to a proba-bility of occurrence of 10·10-4. This example leads to consider that it is not possible to compare linearly the RPN number.

Page 146: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.3 Failure Mode, Effects, and Criticality Analysis (FMECA) 133

Table 8.4 Table for determining the parameter S (according to IEC 60812:2006 and MIL-HDBC-338B). Please note that this is an example of classification used in the automobile industry.

S Criteria Ranking

None No discernible effect is present 1

Very minor Fit and finish /squeak and rattle item does not conform. Defect no-ticed by discriminating customers/operators (or by less than 25%)

2

Minor Fit and finish /squeak and rattle item does not conform. Defect no-ticed by average customer or operator (or by 25% - 75% of cus-

tomers/operators) 3

Very low Fit and finish /squeak and rattle item does not conform. Defect no-

ticed by most customers or operators (for example greater than 75%)

4

Low The vehicle(s) or the item(s) under consideration is (are) operable

but comfort/convenience item(s) operable at a reduced level of per-formance. Customer somewhat is dissatisfied.

5

Moderate The vehicle or item under consideration is operable but com-

fort/convenience item(s) is inoperable. Customer is dissatisfied. 6

High The vehicle or item under consideration is operable but at a reduced

level of performance. Customer is very dissatisfied. 7

Very high Vehicle or item under consideration is inoperable (there is a loss of

primary function). 8

Hazardous with warning

Very high severity ranking when a potential failure mode affects safe vehicle operation and/or involves non-compliance with gov-

ernment regulation and/or mandatory standards. 9

Hazardous with-out warning

Very high severity ranking when a potential failure mode affects safe vehicle operation and/or involves non-compliance with gov-ernment regulation and/or mandatory standards without warning.

10

Table 8.5 Recurrence of modes of failure, frequency and probability (according to IEC 60812:2006 and MIL-HDBC-338B).

Failure mode Failure mode IEC MIL

Definition Description Frequency

( ‰ ) Probability

Possible

failure rates Rating, O

Remote Failure is unlikely ≤ 0.010 ≤ 1·10-5 ≤ 1 in 1500000 1

Low Relatively few failures 0.1 1·10-4 1 in 150000 2

0.5 5·10-4 1 in 15000 3

Moderate Occasional failures 1 1·10-3 1 in 2000 4

2 2·10-3 1 in 400 5

5 5·10-3 1 in 80 6

High Repeated failures 10 1·10-2 1 in 20 7

20 2·10-2 1 in 8 8

Very high Failure is almost inevitable50 5·10-2 1 in 3 9

≥ 100 ≥ 1·10-1 ≥1 in 2 10

Page 147: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

134 8 Qualitative Techniques

Table 8.6 Criteria for evaluating parameter D (according to IEC 60812:2006 and MIL-HDBC-338B).

D Criteria: Likelihood of detection by Design Control Ranking

Almost Certain Design Control (or design review) will almost certainly detect a potential cause or mechanism and subsequent failure mode.

1

Very High Very high chance the Design Control (or design review) will detect a potential cause/mechanism and subsequent failure mode.

2

High High chance the Design Control (or design review) will detect a potential cause/mechanism and subsequent failure mode.

3

Moderately high Moderately high chance the Design Control (or design review) will detect a potential cause/mechanism and subsequent failure mode.

4

Moderate Moderately chance the Design Control (or design review) will detect a potential cause/mechanism and subsequent failure mode.

5

Low Low chance the Design Control (or design review) will detect a po-tential cause/mechanism and subsequent failure mode.

6

Very low Very Low chance the Design Control (or design review) will detect a potential cause/mechanism and subsequent failure mode.

7

Remote Remote chance the Design Control (or design review) will detect a potential cause/mechanism and subsequent failure mode.

8

Very remote Very remote chance the Design Control (or design review) will detect a potential cause/mechanism and subsequent failure mode.

9

Absolutely uncer-tain

Design Control (or design review) will not and/or cannot detect a potential cause/mechanism and subsequent failure mode. This ranking would be selected also in case there is not a design review process.

10

Table 8.7 Evaluation of RPN: an example.

Operation Characteristic of breakdown Rating

N Phase Mode of failure Cause Effect on component or system S O D RPN

1 Incoming inspection

Variations not noted by suppli-er

Deficiency in information system

Product cannot be sold. Delay in delivery.

2 1 1 2

2 FIFO not res-

pected Traceability not available

Material does not conform to aging or specifications

2 2 1 4

3 Exchange of

printed circuit (PCB)

Wrong mark-ing

Sending a non conforming PCB to production

8 6 10 480

4

Damage from ESD to PCB

Product mani-pulation does not conform

Printed circuit damaged 7 4 5 140

The aforementioned considerations lead us to deduce that drawing legitimate conclusions from the study of RPN must be done with extreme prudence. Table 8.7 shows some examples of the evaluation of the RPN coefficient.

Page 148: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.3 Failure Mode, Effects, and Criticality Analysis (FMECA) 135

8.3.4 FMECA Based on the Failure Rate

Estimating the criticality of a failure mode can also be implemented by means of a study of the failure rates of devices, subsystems and constituent parts of the system.

Unfortunately, generally traceable failure rates in databanks refer to components and not to failure modes of components. There is also a further complication. Usually the available data is valid in well-established environmental and operative conditions. Available failure rates are therefore not immediately usable and cannot be included in the final analysis report. An estimate of failure rates of a determined failure mode is calculated through the following formula:

mmcm βαλλ ⋅⋅= (8.3)

where

• λm is the failure rate of a single failure mode to be analyzed; • λc is the failure rate of the item • αm is the probability that the item breaking down fails in the failure mode m;

obviously for an item

∑ = 1mα

• βm is the conditional probability of the failure effect given the failure mode m, i.e. the probability that faced with that failure mode, the critical effect under examination is produced. The value of this parameter could be selected accord-ing to information of Table 8.8.

Table 8.8 Criteria for evaluation of parameter βm.

Failure modes effect Value

Real loss βm =1

Probable loss 0.1 < βm < 1

Possible loss 0 < βm ≤ 0,1

No effect 0

This relationship is valid in the hypothesis of constant failure rate. This is not always true and is one of the limits of this approach.

It is often useful to have an indication regarding time, for example, the useful life of the component denoted as tc. In this case, we use the coefficient of criticali-ty of the failure mode (sometimes named Failure Mode Criticality Number) as:

cmmccmm ttC ⋅⋅⋅=⋅= βαλλ (8.4)

Note that the time of observation, which often but not always coincides with the useful life of the component, refers to the component and not to the failure

Page 149: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

136 8 Qualitative Techniques

mode. For a single component there can often exist varied modes of failure. If these failure modes are n, then:

∑∑∑

===⋅⋅⋅=⋅==

n

mcmmc

n

mcm

n

mmc ttCC

111

βαλλ (8.5)

where Cc is the coefficient of the criticality of the component. The probability that a failure mode is manifested within a certain time interval is:

mC

m eP −−= 1 (8.6)

It is possible to subdivide the field Pm into classes as indicated in Table 8.9. Of the two failure modes classified here, one is more severe and the other has a great-er probability of occurring. To decide upon which of the two modes one must in-itially concentrate, it is necessary to take into consideration how the scales of the two axes were created and above all, the type of application you are dealing with.

Table 8.9 Matrix of criticality

Pro

babi

lity

C Pm 5 Pm>0,2 High risk

4 0,1≤ Pm<0,2 Failure Mode A

3 0,01≤ Pm<0,1

2 0,001≤ Pm<0,01 Failure Mode B

1 0≤ Pm<0,001 Low risk I II III IV

Severity

In some environments, more importance can be given to severity while in oth-

ers, the probability that an even can occur is more important. If the purpose of the analysis is to make a final draft of a matrix of criticality,

this can assume the form of that seen in Table 8.10, where for example, the level of severity can be of four types:

• Catastrophic: when a failure can cause the death of the operator or other per-sons and/or destruction of the system (e.g. a malfunction in an airplane).

• Critical: when a failure can cause damage to persons or the system or a de-crease in performance such to prohibit the objective from being reached.

• Marginal: when a failure can cause damage to persons but to a lesser extent than in the preceding level or damage to components or the system, leading to delays, unavailability or decreased performance.

• Insignificant (or minor): when a failure does not cause damage, loss, or de-crease in performance etc.; however unforeseen maintenance must be carried out.

Page 150: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.3 Failure Mode, Effects, and Criticality Analysis (FMECA) 137

Table 8.10 Risk – Criticality matrix (in compliance to IEC 60812:2006).

Frequency of occurrence of failure effect

Severity levels

1

Insignificant

2

Marginal

3

Critical

4

Catastrophic

5: Frequent Undesirable Intolerable Intolerable Intolerable

4: Probable Tolerable Undesirable Intolerable Intolerable

3: Occasional Tolerable Undesirable Undesirable Intolerable

2: Remote Negligible Tolerable Undesirable Undesirable

1: Improbable Negligible Negligible Tolerable Tolerable

Failure rates can be found in appropriate databanks such as the MIL-Handbook-217, the Failure Rate Data Bank (FARADA), RADC Non Electronic Reliability Notebook, or also more recent databanks: IEEE-Std-500 (Piscataway, NJ, 1984), OREDA (Offshore Reliability Data, Norway, 1984), EIREDA (European Industry Reliability Data Handbook, Italia, 1991), T-BOOK (Reliability Data of Components in Nordic Nuclear Power Plants, Sweden), CCPS (Guidelines of the Center for Chemical Process Safety, New York, 1989), NSWC-94/L07 - Handbook of Reliabili-ty Prediction Procedures for Mechanical Equipment. Regardless of the databank uti-lized, they must be clearly indicated in the FMEA report.

8.3.4.1 Evaluation of the Criticality Coefficient of a Component

Coefficient of criticality of a failure mode for every million hours of life of a component is expressed by (8.4) here reported for clarity:

cmmccmm ttC ⋅⋅⋅=⋅= βαλλ ,

while the coefficient of criticality of the component is (8.5):

∑∑∑===

⋅⋅⋅=⋅==n

mcmmc

n

mcm

n

mmc ttCC

111

βαλλ.

The failure rates λc are generally expressed in terms of millions of hours and con-sequently a multiplication of 106 appears in the reports. This notably simplifies things when, for example, spread sheets are used for the calculation of coefficients.

One can speculate for example: λc = 7 × 10-6 h-1 that equals 7 failures per one million hours. In the example the failure modes are equal to two. We can therefore define two values for the coefficient that estimates the probability that when a component is failed, the failure is in mode m. For example, it is interesting to dis-cuss the calculations for Cm and Cc, for a given mission phase under severity classification Category III. Therefore:

α1 = 0.3 for the first mode failure mode with a severity of 3; α2 = for the first mode of failure mode with a severity of 3;

Furthermore, one can hypothesize that the conditioned probability of a failure mode, evaluated in compliance with the preceding Table 8.12, βm is equal to 0.5 and the time of observation, i.e. the mission, is equal to one hour.

Page 151: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

138 8 Qualitative Techniques

Therefore:

05.11015.03.010710 666111 =⋅⋅⋅⋅⋅=⋅⋅⋅⋅=⋅= −

cccm ttC βαλλ

7.01015.02.010710 666222 =⋅⋅⋅⋅⋅=⋅⋅⋅⋅=⋅= −

cccm ttC βαλλ

and finally:

75.17.005.11

=+==∑=

n

mmc CC

under severity classification Category III.

8.3.4.2 Failure Rate Evaluation

In the following examples failure rate evaluation are reported. In the following cases we assume 8760 hours in one year. Table reports evaluation of a manufac-turer of electronic boards. Relative frequency can differ to the value reported in table 8.3.d. The value used now are tuned by the experience of the manufacturer on his self production.

MTTFs are evaluated starting from the specific application. For example, the relay switch a well determined (average) times in a day.

Table 8.11 Example of Failure rate evaluation.

Device Failure mode

MTTF Relative m (hours) Frequency

Resistor Open circuit

1×109 1×10-9 0.8 8.0×10-10

Short circuit 0.1 1.0×10-10 Drift 0.1 1.0×10-10

Push button Not able to open

5×106 2×10-7 0.2 4.0×10-8

Not able to close 0.8 1.6×10-7 Optocoupler

Open circuit (input pin)

6.7×107 1.5×10-8

0.3 4.5×10-9 Open circuit (output pin) 0 0

Input short circuit 0.3 4.5×10-9 Outptut short circuit 0.3 4.5×10-9

Input-Output short circuit 0.1 1.5×10-9 Capacitor (ceramics)

Open circuit

2×108 5.0×10-9

0.4 2.0×10-9 Short circuit 0.4 2.0×10-9

Drift (Nominal value) 0.1 5.0×10-10 Drift (tg ) 0.1 5.0×10-10

Page 152: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.3 Failure Mode, Effects, and Criticality Analysis (FMECA) 139

Table 8.11 (continued)

8.3.5 Worksheet Examples

The block diagrams of a FMECA analysis is shown in Figure 8.1. The procedure is different from a simple FMEA analysis only in the part of the analysis of criticality (in the broken line squares that must now be taken into consideration). In the follows, some examples of modules for a FMECA analysis (Tables 8.12 – 8.15).

IC (Hex Inverter)

Circuito aperto

4×108 2.4×10-9

0.2 4.8×10-10 Circuito chiuso 0.2 4.8×10-10

Variazione delle caratteristiche 0.2 4.8×10-10 Componente guasto 0.4 9.6×10-10

Diode

Open Circuit 1×109 1×10-9

0.375 3.8×10-10 Short circuit 0.375 3.8×10-10

Drift 0.25 2.5×10-10 BJT

Open circuit (Base)

1.7×107 5.9×10-8

0.035 2.1×10-9 Open circuit (Emitter) 0.035 2.1×10-9

Open circuit (Collector) 0.035 2.1×10-9 Short circuit (B-E) 0.2 1.2×10-9 Short circuit (B-C) 0.2 1.2×10-9 Short circuit (C-E) 0.2 1.2×10-9

Short circuit (all pins) 0.2 1.2×10-9 Drift 0.095 5.6×10-9

Inductance Open Circuit

3.3×108 3×10-9

0.7 2.1×10-9 Short circuit 0.1 3.0×10-10

Drift 0.1 3.0×10-10 Fucntional 0.1 3.0×10-10

Transformer Open circuit

1×108 1×10-8 0,7 7.0×10-9

Short circuit 0.2 2,0×10-9 Drift (turn ratio) 0.1 1.0×10-9

Relay Open Circuit

2×107 5×10-8 0.8 4.0×10-8

Short Circuit 0.2 1.0×10-8

Page 153: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

140 8 Qualitative Techniques

Table 8.12 Example of a table for collecting data in a FMECA analysis utilizing the para-meter RPN.

Page 154: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.3 Failure Mode, Effects, and Criticality Analysis (FMECA) 141

Tab

le 8

.13

Exa

mpl

e of

a ta

ble

for

coll

ecti

ng d

ata

in a

FM

EC

A a

naly

sisw

ith

the

coef

fici

ent o

f cr

itic

alit

y m

etho

d.

Id

Dev

ice

Func

tion

Failu

re

Mod

e Fa

ilure

Ca

use

Failu

re

Mec

hani

sm

Loca

l Ef

fect

Fina

l/ Ex

tern

al

Effe

ct

Com

pens

atin

g Pr

ovisi

on

Aga

inst

Falu

re

Seve

rity

Clas

s [1

-10]

λ

α β

t Fa

ilure

Mod

e Cr

itica

lity

Num

ber c

mm

cm

tC

⋅⋅

⋅=

βα

λ

Item

Crit

ical

ity

Num

ber

∑ =

=n

m

mc

CC

1

Page 155: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

142 8 Qualitative Techniques

Tab

le 8

.14

Exa

mpl

e of

a ta

ble

for

coll

ecti

ng d

ata

in a

FM

EC

A a

naly

sis.

No.

PR

OC

ESS

MA

CH

INE

FAIL

UR

E M

OD

E FA

ILU

RE

CA

USE

LO

CA

L EF

FEC

TS

FIN

AL

EFFE

CT

(S)

SEVERITY = S

OCCURRENCE = O

DETECTION = D

RPN = SOD

CO

MM

ENT

S C

ON

TR

OLS

JU

DG

E PR

EVEN

TA

TIV

E M

EASU

RE

S D

EAD

LIN

E

Page 156: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.3 Failure Mode, Effects, and Criticality Analysis (FMECA) 143

Tab

le 8

.15

Exa

mpl

e of

a ta

ble

for

coll

ecti

ng d

ata

in a

FM

EC

A a

naly

sis.

Fina

l Res

ults

Item

and

Fun

ctio

n Fa

ilure

Mod

e Fa

ilure

Effe

ct

Severity [1 – 10]

Class [1 - 5]

Failu

re E

ffect

s

Occurrence [1 -10]

Des

ign

Cont

rols

Detection [1 -10]

RPN = S × O × D

Act

ion

Com

plet

ion

Dat

eA

ctio

ns T

aken

if di

ffere

nt

Final Severity

Final Occurrence

Final Detect

Final RPN

Page 157: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

144 8 Qualitative Techniques

8.4 Fault Tree Analysis (FTA)

A Fault Tree Analysis (FTA) is a systematic method for acquiring information on a system that can be used in a decision making process. As such, it is a method used in the analysis of the reliability and safety of a device and furnishes the best results if used from the initial stages of its design. It is well to emphasize that the knowledge one has about a system can be obtained in various ways but is it is almost always partial and decisions must be made in conditions of uncertainty (for example, regarding information and regarding the model).

The decision making process must be based on a procedure that, a priori:

• identifies the data and information necessary for making a decision, • establishes a systematic program for acquiring the necessary information, • specifies the analysis to conduct on the data and information so acquired.

It can therefore be affirmed that the analysis of a system is a process whose scope is to acquire pertinent information and data necessary for decision making, in an orderly fashion and in a pre-established time. In this sense, a FTA is a deductive method of analyzing systems; in fact it starts from the final effect (also defined as top event) studying the causes of a particular and well defined failure.

This analysis allows the evaluation of the probability that a critical event occurs; in this way it is possible to gain the information needed to reduce the risk related to the system and personnel safety. The Standard regarding this analysis is IEC 62025 [4]. Many others publication can be found about FTA. An interesting example is the Fault Tree Handbook, NUREG-0492 [5].

The Fault Tree analysis is in the form of a diagram that represents the relation-ship between the event under study – failure or non-conformity – and what may have caused it.

It should be noted that the choice of fault analysis rather than event analysis is debatable. Although the first tendency might be to select the optimistic view of the system, success rather than the pessimistic view, failure, we shall see that this is not necessarily the best method. There are several advantages that accrue to the failure standpoint and, in particular, it is generally easier to attain concurrence on what constitutes failure than it is to agree on what constitutes success.

The information derived from such analysis allows to identify:

• factors that influence reliability and performance of the system, such as failure modes of components, operator errors, environmental conditions etc.

• requirements for their incompatibility; • the presence of design specifications that lead to a decrease in performance; • the presence of events in common that influence different components and

which can consequently annul the benefits connected to redundancy.

Page 158: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.4 Fault Tree Analysis (FTA) 145

Fault Tree Analysis is a widely used technique in many different fields: nuclear power plants, airplanes, communication systems, space missions, automotive and chemical industries etc.

The analysis is conducted according to the following phases:

Phase 1: Fault Tree logical construction

The final critical event (top event) is defined and then one proceeds with research as to its causes, arriving at primary events following a typical top - down technique.

Phase 2: Probability evaluation of fault tree

The probability of the final event happening is opportunely estimated associating a probability to each primary event and combining the various probabilities accord-ing to the relationships evidenced by the tree by means of calculating the probabil-ity.

In such a way, for every final event, one determines the chain of primary events capable of causing it, identifying which among these has the highest probability of occurring. Based on what is deduced from the fault tree, one identifies the design modifications necessary to improve the reliability of the product. This analysis permits a comparison between design proposal, at least from a reliability point of view.

The implementation of a fault tree is not an easy undertaking. It requires an in-depth analysis of the system, also considering that the methodology presents the following limitations:

1. The Fault Tree Analysis is based on the hypothesis of statistically independent and random failures. It does not deal with statistically dependent events, as there are no mechanism present conditional relationships. The failure rate of components is considered constant. This assumption is valid in established technological context.

2. In general, a fault tree analysis is not well suited to represent failures caused by a sequence of events. This is a case of failures that can be induced by a particu-lar sequence in which some events are manifested.

3. A Fault Tree Analysis does not constitute a model of all possible modes of fail-ure in that this is centered on the top event that corresponds to a particular mode of failure, including however, only those events which contribute to the occur-rence of the top event.

4. A Fault Tree Analysis is simply a qualitative model that can sometimes be evaluated through quantitative means.

For a more in-depth investigation of the limitations of an FTA and other me-thods of analysis, refer to Table 2 of IEC 60300-3-1.

To carry out a FTA it is necessary to establish the structure of the system, the events to be considered and the approach to follow. The system under examina-tion should be described by means of:

Page 159: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

146 8 Qualitative Techniques

• an account that deduces the objectives of the project; • the definition of the limits of the system. These limits can be electrical, me-

chanical or the interfaces; • the definition of the physical structure of the system; • identifying predicted operations and performance; • the definition of environmental conditions.

Subsequently, the events to be taken into consideration are defined. All events must be considered including those deriving from environmental causes, human errors, and from software. After having been considered, an event which is not applicable can be discarded; in such a case, the reasons for making this decision must be well documented.

A failure tree originates from the so called final event or top event, that represents for example, a dangerous situation or the failure to achieve a deter-mined performance, and through opportune logical connections, are identified and graphically represent the causes that lead to such an event. The selection of the top event must be performed with prudence. If the event turns out to be too general, the analysis tends to become unmanageable, while if the event is too specific, the failure tree will not be sufficiently ample.

Graphically, the tree consists of a set of logical blocks whose functions can be varied (Table 8.16 shows the meaning of some symbols and the complete list can be seen in the STANDARD IEC 61025, Table A.1 through A.4 of Annex A). The top event is always the outcome of a gate whose input events are the possible causes and conditions that can lead to a top event occurring.

Inputs can also be viewed as outgoing gate events at a lower level. The failure tree is concluded when events do not need to be further developed or they are de-veloped in another fault tree or they are, by nature, no further developable (also known as primary events).

Two concepts are rather useful in constructing a fault tree:

Direct Causes These are the direct causes, necessary and sufficient to generate the final event. Subsequently, these causes are considered as further final events for developing the remaining part of the fault tree. Elementary Units (Basic Units) These are the logical units of the fault tree whose further development would not furnish useful information. An elementary unit can consist of a single component.

Page 160: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.4 Fault Tree Analysis (FTA) 147

8.4.1 Graphical Constructing of a Fault Tree

A fault tree is a graphic representation in which every symbol has a well deter-mined significance. Table 8.16 shows the symbols and their detailed meanings.

Primary events: These are events that are not or must not be further devel-oped. The relative probability can be associated to such events. Basic events: Events that are not further developed, conditioned events and external events fall into this category. See even the Basic Unit concept. Intermediate events: These are events/failures caused by one or more pre-vious events by means of a gate. Gate: It is a symbol used to establish symbolic (and easy to understand graphic representation) link between the output event and the corresponding inputs such as: OR, AND, INHIBIT, EXCLUSIVE-OR2. Transfer events: these are events originating from other trees or that are transferred to other trees. They are mainly used to link the trees together and to give more order to the trees from a graphic point of view, for example developing different parts of the fault tree on different sheets. Obviously, a transfer event can appear in more that two places of a fault tree. A typical example is the temperature causing occurrence of two different events in the analyzed system.

2 Sometimes one resorts to another gate called PRIORITY-AND that is a normal gate AND

where events must be manifested in a well determined sequence. It is possible to specify this situation using conditioned events.

Page 161: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

148 8 Qualitative Techniques

Table 8.16 Symbols utilized in the arrangement of a failure tree (according to IEC 61025).

Preferable Symbol Alternative symbol Function Description & Reliability correlation

&

AND

Event that takes place only if all incoming events oc-cur. Parallel redundancy, one out of n equal r different branches.

OR

Event that takes place when at least one incoming event occurs. Failure occurs if any of the parts of that system fails – series system.

OR exclusive

Event takes place only if one of the incoming events occurs singularly. A failure of the system oc-curring only if one, not both of the two possible failures happens.

NOT

The output event occurs only if the input event does not occur. Exclusive events or pre-ventive measure does not take place.

INHIBIT

The output occurs only if both of the input events take place, one of them conditional. Conditional probability of occurrence of the final event.

m

m/n Majority vote gate

The output event occurs if at least m (or more) out of n incoming events occur. Redundancy k out of n, where m = n - k + 1.

Page 162: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.4 Fault Tree Analysis (FTA) 149

Table 8.16 (continued)

Gate – general symbol

General symbol of a logical gate; function is defined in the symbol

Block for descrip-tion of events

The name or description of an event, code of the event, probability is indicated inside the symbol

Basic event

Event that cannot be further subdivided.

The lowest level event for which probability of occurrence or reliability information is available.

Component failure or a failure mode cause.

A no further de-veloped event

Event where further subdi-visions are not done (also because further informa-tion may not be available).

A contributor to the prob-ability of failure. Structure of that system part is not yet defined.

Event analyzed elsewhere

Event that can be further developed in another failure tree

External event

Event that may not be a failure but is certain to happen: e.g. a phase change in a dynamic system

Transfer out

Transfer in

Gate indicating that this part of the system is devel-oped in another part or page of the diagram.

A partial fault tree diagram that is shown in other loca-tion of the overall system

In means that the devel-oped gate is elsewhere.

Out means that the same gate developed in this place will be used elsewhere.

Conditional eventEvery condition or restric-tion to be applied to a gate.

Page 163: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

150 8 Qualitative Techniques

Constructing a failure tree must always be implemented by defining situations that lead to or can lead to the undesirable event. One of the most common errors is that of reasoning in terms of failures in subsystems. It must be remembered that an undesirable event can derive from the malfunction of more systems or compo-nents. Furthermore, easily seen when a design has not been well thought out, an undesirable event can occur even when no component has failed.

8.4.2 Qualitative Analysis of a Fault Tree

The evaluation of a Fault Tree allows the identification of events that can directly cause a failure in the system, the probability of such an event, the evaluation of the fault tolerance capacity of the system, and the identification of eventual critical components and breakdown mechanisms. It also helps to define maintenance strategies to be adopted for the system under consideration. In order to obtain the information cited above, an in-depth logical analysis of the tree must be done.

Figure 8.4 represents a simple tree in which the final event is given by the fol-lowing logical relationship:

( )EDBCBA +⋅=⋅= (8.7)

or rather:

EBDBA ⋅+⋅= (8.8)

From this relationship, we can deduce the occurrence of the final event when the events B and D take place simultaneously or when events B and E3 take place simultaneously. This does not mean that the two events must happen at the same instant but rather, at a certain moment, the two events are simultaneously valid. For example, event B can be manifested a long time before event D. This situation in itself does not lead to the final event. When event D occurs, if event B is still active, the final event will occur.

From the above example, even though simple, we can deduce an important cha-racteristic of a FTA. An FTA permits the identification of causes and conditions (incoming to the tree) in order for the final event to take place (outgoing from the tree). However, as noted in the example, once this connection is known, no deduction regarding the temporal relationships among the events is possible.

In a Fault Tree, common causes can also be represented. In the following example, derived from the Standard cited, above the common cause is event B, which in fact, results in a simultaneous inputs to two gates (Figure 8.5)

3 DB⋅ and EB⋅ are called as minimal cat sets. A Cut Set is a group of events that, if all

occur, would cause occurrence of the top event.

Page 164: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.4 Fault Tree Analysis (FTA) 151

Fig. 8.4 A simple fault tree. The tree is represented in two equivalent graphic modes. Depending on the context, one or the other is used.

Fig. 8.5 Fault tree with a common cause event (B) that is further analyzed in a second failure tree.

8.4.3 Quantitative Analysis of a Fault Tree

To each event reported in the Fault Tree, it is possible to associate the correspond-ing probability of occurrence. The probability with which the cause of a failure mode is manifested is usually determined by engineering analysis and can be

Page 165: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

152 8 Qualitative Techniques

utilized in evaluating the overall unavailability of a system. The estimate of how much a cause of failure impacts on final unavailability steers the analysis to one branch of the tree rather than another.

In the study of the system shown in Figure 8.6, the top event is that the “light does not switch on” (when it should).

The Fault Tree derived for the study of the system is shown in Figure 8.7. The evaluation of a Fault Tree is quite complicated. What is reported as follows here is valid except in some particular cases. For example, what is not taken under examination here are events with multiple occurrences or failure modes that can take place in more than one place in the fault tree.

If the basic events have the probability assigned in Table 8.17, the probability that the so called top event will occur is therefore equal to:

P = PA + PB + PC + PD + PE =

= 1.0 ×10-6 + 1.0 ×10-7 + 1.0 ×10-7 + 1.0 ×10-6 + 1.0 ×10-9 = 2.201 ×10-6

A, B, C, D and E are the cut sets or event combinations that can cause Top Un-desired Event to occur. It is evident that in order to carry out a correct quantitative analysis of a fault tree it is necessary to know the calculation of the probability and the Boolean algebra, the study of which is not treated here.

Fig. 8.6 System under study.

Table 8.17 Probability of basic events [12, 13].

Event Description Probability, P [×10-6 ]

A Light bulb Fails 1.0

B Switch A Fails 0.1

C Switch B Fails 0.1

D Battery (Supply) Fails 1.0

E Wire Fails Open 0.001

Page 166: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.4 Fault Tree Analysis (FTA) 153

Fig. 8.7 Failure tree of system under examination [12, 13].

8.4.4 Advantages and Disadvantages of Fault Tree Analysis (FTA)

Here we wish to summarize the principal advantages and disadvantages of the Fault Tree method of analysis. The two lists which follow are obviously not to be considered exhaustive.

Advantages:

• Permits systematic identification of logical courses of damage starting from a specific effect to the main causes;

• Permits confronting the study of parallel, redundant or alternative courses of damage, most of the types of combinatory events and some types of depen-dence and systems with several intersecting subsystems;

• Permits the identification of the causes of failure modes that have an important influence on the final event;

• Facilitates research into the possible causes of a final effect that may not have been foreseen.

Disadvantages

• Analysis can lead to the generation of very large failure trees if the analysis has a very wide scope;

• The same event may sometimes appear in different parts of the tree; • The tree does not represent the transition routes between the states of any event; • For every final event, an express Fault Tree must be developed; • The principal causes that lead to a final event are correlated only to the specific

consequences at that moment but could lead to other consequences but may not be evident in the tree itself.

Page 167: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

154 8 Qualitative Techniques

8.5 An Overview Example

We are interested in study the following simple circuit. The circuit is able to turn on or to turn off a LED. In particular when Vi is high the LED must be turned ON whereas when the input signal Vi is low the LED must be turned OFF. This the wanted function: when the control voltage Vi is high the LED must be turned ON.

Fig. 8.8 Circuit for example.

As first step it is recommended to drawn the RBD of the simple system. Since the function is obtained only if all the device are not fail, the RBD is consists of the series of four blocks.

Fig. 8.9 A first RBD of the circuit of Fig. 8.8.

However, in drawing the RBD of Fig. 8.9 a very important aspects is not taken into account. The circuit is mounted on a printed circuit board (PCB) and the de-vice are connected using a soldering process. It is so necessary to modify the pre-vious RBD in order to take in account even the possible failure of both PCB and solder process. The RBD is reported in Fig. 8.10.

Fig. 8.10 RBD of the circuit of Fig. 8.8. Block PCB includes PCB and solder joints.

The resistor used in this circuit is a film resistor. Failure modes for this and other resistor types are reported in Table 8.18. In the following tables, 8.19 through 8.20, the failure modes for other devices used in this example are reported. For sake of simplicity PCB failure modes are evaluated as unique, taking in account also a soldering joint non-conforming.

Q1NPN

D1LED

R2

R1

VCC = 5V

V i

Page 168: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.5 An Overview Example 155

Table 8.18 Failure mode distribution for resistors. Data are from MIL-HDBK-338B. The values are valid for R1 and R2.

Resistor Failure mode Mode Probability (%)

Composition Parameter change (Drift)

Open

Short

66

31

3

Film Parameter change (Drift)

Open

Short

36

59

5

Wirewound Parameter change (Drift)

Open

Short

26

65

9

Network Parameter change (Drift)

Open

Short

-

92

8

Variable Erratic output

Open

Short

40

53

7

Table 8.19 Failure mode distribution for bipolar transistor. Data are from MIL-HDBK-338B. Please note that different sources sometimes reports quite different values. For ex-ample in Table 8.1.d the corresponding values are 85% for short and 15% for open failure modes.

Device Failure mode Mode Probability (%)

Bipolar transistor Short

Open

73

27

Table 8.20 Failure mode distribution for LED. Data are from MIL-HDBK-338B.

Device Failure mode Mode Probability (%)

Optoelectronic LED Open

Short

70

30

The failure rates are evaluated taking in consideration the data from MIL-

HDBK 217. We suppose for this application an environment classified as GB (Ground benign). The failure rates of the used devices are:

• Resistor R1: -19 h 104.01

−⋅=Rλ

• Resistor R2: -19 h 104.02

−⋅=Rλ

• Light Emitting Diode (LED): 9 -11.5 10 hLEDλ −= ⋅

Page 169: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

156 8 Qualitative Techniques

• Transistor Q1: -19 h 102.31

−⋅=Qλ

• Printed Circuit Board: 9 -12.2 10 hPCBλ −= ⋅

Next step is the evaluation of the occurrence. In fact starting from the failure mode rate it is necessary to evaluate the frequency of the relative distribution. This eval-uation is detailed in Table 8.21.

Table 8.21 Failure mode distribution evaluation.

Item Failure rate (10-9 h-1) Mode Probability (%) Occurrence (10-9)

Film Resistor 0.4 Open

Drift

Short

59

36

5

0.236

0.144

0.02

Sum 0.4

LED 1.5 Open 70 1.05

Short 30 0.45

Sum 1.5

Transistor 3.2 Short 73 2.336

Open 27 0.864

Sum 3.2

PCB 2.2 - 100 2.2

Sum 2.2

As far as the Severity is concerned the discussion may be very difficult. In this

simple example we have two possible situations:

• When the failure mode leads to a complete failure of the system. In this case the system function is not available. In this case the parameter S is 2.

• When the failure mode leads to a less severe failure of the system. In this case the system function will be again available even if with less performance. In this case the parameter S is 1.

Parameter D should be now evaluated. For this simple example we can assume that a design review will not detect a potential cause and subsequent failure mode or, as it is possible too, there is no a design review activity. D is selected to its maximum value, 10.

FMECA can be now drawn. An example concerning R1 resistor is reported in table 8.22. The study of the FMECA results can be useful in order to find a me-thods for improve the reliability, such as redundancy, different choice of used de-vice, etc.

Page 170: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

8.5 An Overview Example 157

Tab

le 8

.22

FM

EC

A [

3]

Id

Item

Fail

ure

mod

e P

ossi

ble

caus

e L

ocal

effe

ct

Fina

l

effe

ct

Faul

t

dete

ctio

n S

O

D

RP

N

1 R

1 O

pen

Sold

er jo

int

non-

conf

orm

ing

Bas

e ci

rcui

t fa

ilure

C

ompl

ete

failu

re

If V

i is

high

an

d V

Col

lect

oris

hi

gh to

o

2 2.

2

10

44

In

here

nt

failu

re

2

0.23

6

10

4.72

Shor

t In

here

nt

failu

re

Tra

nsis

tor

can

fails

for

to

o hi

gh b

ase

curr

ent

and/

or d

rive

r fa

ilure

for

to

o hi

gh d

e-li

vere

d cu

r-re

nt

Par

tial f

ail-

ure

Whe

n V

R1

=

0Van

d L

ED

is

turn

ed o

n (o

r V

R3>

0)

2 0.

02

10

0.4

Inte

rmit

tent

fa

ilure

So

lder

join

t no

n-co

nfor

min

g

Cir

cuit

wor

ks in

ter-

mit

tent

ly

Par

tial o

r co

mpl

ete

failu

re

- 2

2.2

10

44

Dri

ft

Dam

age

W

earo

ut

The

cir

cuit

wor

ks c

or-

rect

ly: d

esig

n ta

ken

into

ac

coun

t the

dr

ift.

No

cons

eque

nce-

1 0.

144

10

1.44

Page 171: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

158 8 Qualitative Techniques

References

In the following some the reference papers and standards for the subjects discussed in this chapter are considered. We recommend the reader to always verify the state of revision of the standards and to always consult the regulations in force.

[1] IEC 60812:2006 – Analysis techniques for system reliability – Procedure for Failure

mode and effects analysis (FMEA) [2] IEC 60300-3-1:2003-01 – Dependability management. Part 3-1: Application guide –

Analysis techniques for dependability – Guide on methodology [3] Birolini, A.: Reliability Engineering – Theory and Practice. Springer, Heidelberg,

ISBN: 978-3-642-14951-1 [4] IEC 61025:2006 – Fault Tree Analysis (FTA) [5] NUREG-0492, Fault tree Handbook, U.S. Nuclear Regulatory Commission (January

1981) [6] SAE J1739, Potential Failure Mode and Effects Analysis in Design (Design FMEA),

Potential Failure Mode and Effects Analysis in Manufacturing and Assembly Processes (Process FMEA), and Potential Failure Mode and Effects Analysis for Machinery (Machinery FMEA), (revision June 2000) (1a edn., July 1994)

[7] AIAG, Potential Failure Mode And Effects Analysis (FMEA), 3a edn. (July 2001) [8] A.N.F.I.A., FMEA - Linee Guida per l’applicazione della FMEA, ANFIA QUALITÀ

009, Edizione n. 2 (2006) [9] Vesely, W.E., Goldberg, F.F., Roberts, N.H., Haasl, D.F.: Fault Tree Handbook,

NUREG-0492, U.S. Nuclear Regulatory Commission (January 1981) [10] Andrews, J.D., Dugan, J.B.: Dependency Modeling Using Fault Tree Analysis. In:

Proceedings of the 17th International System Safety Conference (August 1999) [11] Failuremodes, effectsandcriticalityanalysis (FMECA) for command, control, commu-

nications, Computer, Intelligence, Surveillance, and Reconnaissance (C4ISR) facili-ties, Technical ManualNo. 5-698-4, Headquarters Department of the Army, Washing-ton, DC (September 29, 2006)

[12] Ericson II, C.A.: Fault Tree Analysis (1999), http://www.fault-tree.net/papers/ericson-fta-tutorial.pdf

[13] Ericson II, C.A.: Hazard Analysis Techniques for System Safety. John Wiley & Sons, Chichester (2005)

Page 172: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

Index

A Absorbent State 97 Acceleration Factor 67, 68 activation energy 65, 67 Administrative and Logistic Down Time

88 Arrhenius 65, 66, 67, 69 Associative law 8 availability 2, 3, 85, 88, 89, 92 95, 96,

97, 102, 103 B Basic events 147 Basic Units 146 bathtub 23, 25 bathtub curve 64, 73 Boltzmann 65 breakdown 88, 90, 91, 113, 116, 118,

120, 130, 131, 134, 150 C Catastrophic 119, 136, 137 CCPS 80 climatic factor, 60 combined test 74 Complement 8 Complete data set 71 Component Degradation 64 compound test 74 Confidentiality 4 Conformity 2 Critical 119, 136, 137 Critical failures 5 Criticality matrix 137 Cumulative law 8 De Morgan’s law 8 dependability 3, 4, 93, 94, 95, 97

D Detection 131 Direct Causes 146 Distributive law 8 Down Time 88 E Early life failure 5 EIREDA 80 electrical factors 60 element 1, 2, 3, 4, 5, 6 Elementary Units 146 Ergodic Group 97 Event 8 Events mutually exclusive 8 experimental histogram of relative

frequency 18, 19 Exponential law 26 F Failure 111, 113, 115, 116, 118, 128,

130, 133, 135, 136, 137, 138, 145, 151, 152, 154, 155, 156, 157

failure are 65 failure effect 113, 135, 137 failure mode 70, 111, 113, 115, 118,

119, 123, 124, 128, 130, 131, 133, 134, 135, 136, 137, 151, 156

failure rate 7, 14, 17, 19, 21, 23, 24, 25, 26, 27, 28, 30, 32

Failure Rate Data Bank 78, 79 Failures of primary importance 5 Failures of secondary importance 5 FARADA 78, 79 Fault 115, 144, 145, 147, 150, 151,

152, 153, 157

Page 173: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

Index 160

Fault correction 6 Fault diagnosis 6 Fault location 6 Fault recognition 6 FIT 77, 83 Fixed protected environment 81 Fixed unprotected environment 81 FMEA 93, 111, 112, 113, 114, 115,

118, 119, 120, 121, 122, 124, 125, 126, 127, 137, 139

FMECA 93, 111, 112, 122, 128, 130, 135, 139, 140, 141, 142, 143, 156, 157

FTA 93, 111, 112, 120, 144, 145, 150, 153

G Gate 147 H

handbooks 77, 80, 81 I Idempotent law 8 Identity law 8 IEC 112, 115, 119, 124, 127, 129, 130,

133, 134, 137, 144, 145, 146 IEEE 80 IEV 85 Induced failures 5 Insignificant 119, 136, 137 Instantaneous availability 89 Instantaneous Failure Rate 17, 19 Intermediate events 147 Intermittent failure 5 International Electrotechnical

Vocabulary 85 Intersection 8 item reliability 55 K k out of n 43, 44, 46 L Laboratory tests 71 Law of Large Numbers 9 Log-Normal distribution 26

M

Maintainability 2, 3, 86, 92 Maintenance Support Performance 3 Marginal 119, 136, 137 Markov 93, 95, 96, 97, 98, 103, 108 Mean Time Between Failures

MTBF 3, 14, 16, 87, 88, 89 Mean Time To Failure

MTTF 3, 14, 15, 16, 29, 30, 33, 35, 50, 56, 57, 86, 88, 89

Mean Time To Repair MTTR 86, 87, 89

Measurements 94 mechanical factors 60 memory 96, 108 Memoryless property 29 MIL-Handbook-217

MIL-HDBK 217 67, 78 Misuse failures 5 Mobile environment 81 N Non-repairable 98 O Occurrence 131, 132, 156 Operating time 88 Order statistic censored data set 71 OREDA 80 P parallel functional configuration 37, 38 Partial failure 5 Performability 92 Prediction Approach 65 Primary events 147 Primary failures 5 Probabilistic Evaluation 93 Probability 131, 133, 136, 145, 152,

155, 156 probability distribution 11, 12 Probability of an event 8 Q

Qualitative Evaluation 93 Quantitative Evaluation 93 Quantitative methods 93

Page 174: Ebooksclub.org Reliability Engineering Basic Concepts and Applications in ICT

Index 161

R RAMS 112 Random experiment 8 Random failure 5 random variable 10, 11, 12, 13, 14, 15,

16, 18 RBD 33, 35, 36, 37, 38, 39, 40, 41, 50,

112, 154 redundancy 33, 37, 38, 39, 40, 41, 42,

43, 47, 48, 49, 50 redundant configuration 37 reliability 2, 3, 7, 14, 15, 16, 18, 20, 22,

24, 25, 29, 30, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 46, 47, 48, 49, 50, 51, 54, 55, 56, 57

Reliability Block Diagram 33, 46 Repair 6 Repairable element 99 Resistance 65 Restoration 6 risk 4, 130, 136 RPN 130, 131, 132, 134, 140, 157 S safety 2, 4 Sample Space 8 secondary failures 5 sequence test 74 series functional configuration 34, 36 Severity 119, 131, 136, 137, 156 stand-by redundancy 39, 42 Stand-by Time 88 State 97, 98, 100, 106 stationary 96

Stress factors 60 system 33, 34, 35, 36, 37, 38, 39, 40,

41, 42, 43, 44, 46, 47, 48, 49, 50, 54, 55, 56, 57

system reliability 35, 39, 55 T T-BOOK 80 temporal evolution 93, 94, 96 Testability 92 Time censored data set 71 Total corrective maintenance time 88 Total failure 5 Total Maintenance Time 88 Total preventive Maintenance time 88 Transfer events 147 Transition Matrix 98 Transitions 97 Transitory Group 97 U Union 8 Unreliability 102 Up Time 88 V velocity of activation 65 W

Warm Redundancy 42 wear out 24 Wear out failure 5 Weibull distribution 26, 30