ieee std 1413.1 eee standards ieee standards reliability predictionseri)/system...

97
IEEE Std 1413.1 -2002 IEEE Standards 1413.1 TM IEEE Guide for Selecting and Using Reliability Predictions Based on IEEE 1413 Published by The Institute of Electrical and Electronics Engineers, Inc. 3 Park Avenue, New York, NY 10016-5997, USA 19 February 2003 IEEE Standards Coordinating Committee 37 IEEE Standards Coordinating Committee 37 on Reliability Prediction IEEE Standards Print: SH95020 PDF: SS95020

Upload: others

Post on 08-Oct-2020

18 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEE Std 1413.1™-2002

IEE

E S

tan

dar

ds 1413.1TM

IEEE Guide for Selecting and UsingReliability Predictions Based onIEEE 1413™

Published by The Institute of Electrical and Electronics Engineers, Inc.3 Park Avenue, New York, NY 10016-5997, USA

19 February 2003

IEEE Standards Coordinating Committee 37

IEEE Standards Coordinating Committee 37 onReliability Prediction

IEE

E S

tan

dar

ds

Print: SH95020PDF: SS95020

Page 2: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

The Institute of Electrical and Electronics Engineers, Inc.3 Park Avenue, New York, NY 10016-5997, USA

Copyright © 2003 by the Institute of Electrical and Electronics Engineers, Inc.All rights reserved. Published 19 February 2003. Printed in the United States of America.

IEEE is a registered trademarks in the U.S. Patent & Trademark Office, owned by the Institute of Electrical and Electronics Engineers, Incorporated.

Print:

ISBN 0-7381-3363-9 SH95020

PDF:

ISBN 0-7381-3364-7 SS95020

No part of this publication may be reproduced in any form, in an electronic retrieval system or otherwise, without the prior written permission of the publisher.

IEEE Std 1413.1™-2002

IEEE Guide for Selecting and Using Reliability Predictions Based on IEEE 1413™

Sponsor

IEEE Standards Coordinating Committee 37

on

Reliability Prediction

Approved 12 September 2002

IEEE-SA Standards Board

Abstract:

A framework for reliability prediction procedures for electronic equipment at all levels isprovided in this guide

.

Keywords:

baseline, classic reliability, constant failure rate, estimation, failure, goal, item, operat-ing environment, reliability prediction, requirement, system life cycle

Page 3: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEE Standards

documents are developed within the IEEE Societies and the Standards Coordinating Committees of theIEEE Standards Association (IEEE-SA) Standards Board. The IEEE develops its standards through a consensus develop-ment process, approved by the American National Standards Institute, which brings together volunteers representing variedviewpoints and interests to achieve the final product. Volunteers are not necessarily members of the Institute and serve with-out compensation. While the IEEE administers the process and establishes rules to promote fairness in the consensus devel-opment process, the IEEE does not independently evaluate, test, or verify the accuracy of any of the information containedin its standards.

Use of an IEEE Standard is wholly voluntary. The IEEE disclaims liability for any personal injury, property or other dam-age, of any nature whatsoever, whether special, indirect, consequential, or compensatory, directly or indirectly resultingfrom the publication, use of, or reliance upon this, or any other IEEE Standard document.

The IEEE does not warrant or represent the accuracy or content of the material contained herein, and expressly disclaimsany express or implied warranty, including any implied warranty of merchantability or fitness for a specific purpose, or thatthe use of the material contained herein is free from patent infringement. IEEE Standards documents are supplied “

AS IS

.”

The existence of an IEEE Standard does not imply that there are no other ways to produce, test, measure, purchase, market,or provide other goods and services related to the scope of the IEEE Standard. Furthermore, the viewpoint expressed at thetime a standard is approved and issued is subject to change brought about through developments in the state of the art andcomments received from users of the standard. Every IEEE Standard is subjected to review at least every five years for revi-sion or reaffirmation. When a document is more than five years old and has not been reaffirmed, it is reasonable to concludethat its contents, although still of some value, do not wholly reflect the present state of the art. Users are cautioned to checkto determine that they have the latest edition of any IEEE Standard.

In publishing and making this document available, the IEEE is not suggesting or rendering professional or other servicesfor, or on behalf of, any person or entity. Nor is the IEEE undertaking to perform any duty owed by any other person orentity to another. Any person utilizing this, and any other IEEE Standards document, should rely upon the advice of a com-petent professional in determining the exercise of reasonable care in any given circumstances.

Interpretations: Occasionally questions may arise regarding the meaning of portions of standards as they relate to specificapplications. When the need for interpretations is brought to the attention of IEEE, the Institute will initiate action to prepareappropriate responses. Since IEEE Standards represent a consensus of concerned interests, it is important to ensure that anyinterpretation has also received the concurrence of a balance of interests. For this reason, IEEE and the members of its soci-eties and Standards Coordinating Committees are not able to provide an instant response to interpretation requests except inthose cases where the matter has previously received formal consideration.

Comments for revision of IEEE Standards are welcome from any interested party, regardless of membership affiliation withIEEE. Suggestions for changes in documents should be in the form of a proposed change of text, together with appropriatesupporting comments. Comments on standards and requests for interpretations should be addressed to:

Secretary, IEEE-SA Standards Board445 Hoes LaneP.O. Box 1331Piscataway, NJ 08855-1331USA

Authorization to photocopy portions of any individual standard for internal or personal use is granted by the Institute ofElectrical and Electronics Engineers, Inc., provided that the appropriate fee is paid to Copyright Clearance Center. Toarrange for payment of licensing fee, please contact Copyright Clearance Center, Customer Service, 222 Rosewood Drive,Danvers, MA 01923 USA; +1 978 750 8400. Permission to photocopy portions of any individual standard for educationalclassroom use can also be obtained through the Copyright Clearance Center.

Note: Attention is called to the possibility that implementation of this standard may require use of subject mat-ter covered by patent rights. By publication of this standard, no position is taken with respect to the existence orvalidity of any patent rights in connection therewith. The IEEE shall not be responsible for identifying patentsfor which a license may be required by an IEEE standard or for conducting inquiries into the legal validity orscope of those patents that are brought to its attention.

Page 4: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

Copyright © 2003 IEEE. All rights reserved.

iii

Introduction

(This introduction is not part of IEEE Std 1413.1™-2002, IEEE Guide for Selecting and Using Reliability PredictionsBased on IEEE 1413™.)

IEEE Std 1413-1998, IEEE Standard Methodology for Reliability Predictions and Assessment for ElectronicSystems and Equipment, provides a framework for reliability prediction procedures for electronic equipmentat all levels. This guide is a supporting document for IEEE Std 1413-1998. This guide describes a wide vari-ety of hardware reliability prediction methodologies.

The scope of this guide is processes and methodologies for conducting reliability predictions for electronicsystems and equipment. This guide focuses on hardware reliability prediction methodologies, and specifi-cally excludes software reliability, availability and maintainability, human reliability, and proprietary reli-ability prediction data and methodologies. These topics may be the subjects for future IEEE 1413 guides.

The purpose of this guide is to assist in the selection and use of reliability prediction methodologies satisfy-ing IEEE Std 1413. The guide also describes the appropriate factors and criteria to consider when selectingreliability prediction methodologies.

Participants

At the time this standard was completed, the Reliability Prediction Standard Development Working Grouphad the following membership:

Michael Pecht,

Chair

Other contributors who aided in the development of this standard by providing direction and attending meet-ings were as follows:

The following members of the balloting committee voted on this standard. Balloters may have voted forapproval, disapproval, or abstention.

Gary BuchananJerry L. CartwrightDr. Victor ChienDr. Vladimir CrkDr. Diganta Das

Dan N. DonahoeJon G. ElerathLou GulloJeff W. HarmsHarold L. HartTyrone Jackson

Dr. Aridaman Jain Yvonne Lord Jack ShermanThomas J. StadtermanDr. Alan Wood

Dr. Glenn BlackwellJens BrabandBill F. CarpenterHelen CheungLloyd CondraDr. Michael J. CushingDr. Krishna DarbhaDr. Abhijit DasguptaTony DiVenti

Sheri ElliottDr. Ralph EvansDiego GutierreEdward B. HakimPatrick HetheringtonZhenya HuangNino IngegneriMargaret JacksonDr. Samuel KeeneDr. Dingjun Li

Stephen MageeDr. Michael OstermanArun RamakrishnanJack RemezMathew SamuelKevin SilkeJohn W. SullivanRicky ValentinNancy Neeld Youens

Dr. Vladimir CrkDr. Michael J. CushingDr. Diganta DasRichard L. Doyle

Jon G. ElerathHarold L HartDennis R. HoffmanDr. Aridaman Jain

Jack ShermanThomas J. StadtermanRicky ValentinDr. Alan Wood

Page 5: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

iv

Copyright © 2003 IEEE. All rights reserved.

When the IEEE-SA Standards Board approved this standard on 12 September 2002, it had the followingmembership:

James T. Carlo,

Chair

James H. Gurney,

Vice Chair

Judith Gorman,

Secretary

*Member Emeritus

Also included are the following nonvoting IEEE-SA Standards Board liaisons:

Alan Cookson,

NIST Representative

Satish K. Aggarwal,

NRC Representative

Andrew Ickowicz

IEEE Standards Project Editor

Sid BennettH. Stephen BergerClyde R. CampRichard DeBlasioHarold E. EpsteinJulian Forster*Howard M. Frazier

Toshio FukudaArnold M. GreenspanRaymond HapemanDonald M. HeirmanRichard H. HulettLowell G. JohnsonJoseph L. Koepfinger*Peter H. Lips

Nader MehravariDaleep C. MohlaWilliam J. MoylanMalcolm V. ThadenGeoffrey O. ThompsonHoward L. WolfmanDon Wright

Page 6: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

Copyright © 2003 IEEE. All rights reserved.

v

Contents

1. Overview.............................................................................................................................................. 1

1.1 Scope............................................................................................................................................ 1

1.2 Purpose......................................................................................................................................... 1

1.3 Glossary ....................................................................................................................................... 1

1.4 Contents ....................................................................................................................................... 1

2. References............................................................................................................................................ 2

3. Definitions, abbreviations, and acronyms............................................................................................ 8

3.1 Definitions.................................................................................................................................... 8

3.2 Abbreviations and acronyms........................................................................................................ 9

4. Background ........................................................................................................................................ 10

4.1 Basic concepts and definitions................................................................................................... 10

4.2 Reliability prediction uses and timing ....................................................................................... 16

4.3 Considerations for selecting reliability prediction methods ...................................................... 18

5. Reliability prediction methods........................................................................................................... 18

5.1 Engineering information assessment ......................................................................................... 19

5.2 Predictions based on field data .................................................................................................. 23

5.3 Predictions based on test data .................................................................................................... 33

5.4 Reliability predictions based on stress and damage models ...................................................... 41

5.5 Reliability prediction based on handbooks ................................................................................ 49

5.6 Assessment of reliability prediction methodologies based on IEEE 1413 criteria .................... 55

6. System reliability models................................................................................................................... 67

6.1 Reliability block diagram........................................................................................................... 68

6.2 Fault-tree analysis (FTA)........................................................................................................... 76

6.3 Reliability of repairable systems................................................................................................ 77

6.4 Monte Carlo simulation ............................................................................................................. 80

Annex A (informative) Statistical data analysis ......................................................................................... 83

Annex B (informative) Bibliography.......................................................................................................... 90

Page 7: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware
Page 8: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEE Guide for Selecting and Using Reliability Predictions Based on IEEE 1413

1. Overview

IEEE Std 1413-1998 [B5]1 provides a framework for reliability prediction procedures for electronic equip-ment at all levels. This guide is a supporting document for IEEE Std 1413-1998. This guide describes a widevariety of hardware reliability prediction methodologies.

1.1 Scope

The scope of this guide is processes and methodologies for conducting reliability predictions for electronicsystems and equipment. This guide focuses on hardware reliability prediction methodologies, and specifi-cally excludes software reliability, availability and maintainability, human reliability, and proprietaryreliability prediction data and methodologies. These topics may be the subjects for additional future IEEEguides supporting IEEE Std 1413-1998.

1.2 Purpose

The purpose of this guide is to assist in the selection and use of reliability prediction methodologies satisfy-ing IEEE Std 1413-1998. The guide accomplishes this purpose by briefly describing a wide variety ofhardware reliability prediction methodologies. The guide also describes the appropriate factors and criteriato consider when selecting reliability prediction methodologies.

1.3 Glossary

Many of the terms used to describe reliability prediction methodologies have multiple meanings. For exam-ple, the term reliability has a specific mathematical meaning, but the word is also used to mean an entire fieldof engineering study. Clause 2 contains definitions of the terms that are used in this document, taken prima-rily from The Authoritative Dictionary of IEEE Standards Terms, Seventh Edition [B3]. The terms reliabilityand failure are discussed in more detail in Clause 4.

1.4 Contents

Clause 4 provides background information for reliability prediction methodologies. This background infor-mation includes basic reliability concepts and definitions, reliability prediction uses, reliability predictionrelationship with a system life cycle, and factors to consider when selecting reliability prediction methodol-ogies. Clause 5 describes reliability prediction methodology inputs and reliability prediction methodologiesfor components, assemblies, or subsystems. These methodologies include reliability predictions based onfield data, test data, damage simulation, and handbooks. Clause 6 describes methodologies for combiningthe predictions in Clause 5 to develop system level reliability predictions. These methodologies include reli-ability block diagrams, fault trees, repairable system techniques, and simulation.

1The numbers in brackets correspond to those of the bibliography in Annex B.

Copyright © 2003 IEEE. All rights reserved. 1

Page 9: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

2. References

This standard shall be used in conjunction with the following publications. When the following specifica-tions are superseded by an approved revision, the revision shall apply.

ABS Group, Inc., Root Cause Analysis Handbook, A Guide to Effective Incident Investigation, Risk & Reli-ability Division, Rockville, MD, 1999.

Alvarez, M. and Jackson, T., “Quantifying the Effects of Commercial Processes on Availability of SmallManned-Spacecraft,” Proceedings of the 2000 Annual Reliability and Maintainability Symposium (RAMS),pp. 305–310, January 2000.

Asher, H. and Feingold, H., “Repairable Systems Reliability: Modeling, Inference, Misconceptions andTheir Causes,” lecture notes in statistics, Volume 7, Marcel Decker, New York, 1984.

Baxter, L. A. and Tortorella, M., “Dealing With Real Field Reliability Data: Circumventing Incompletenessby Modeling & Iteration,” Proceedings of the Annual RAMS Symposium, pp. 255–262, 1994.

Bhagat, W., “R&M through Avionics/Electronics Integrity Program,” Proceedings of the Annual Reliabilityand Maintainability Symposium, pp. 216–219, 1989.

Black, J. R., “Physics of Electromigration,” Proceedings of the IEEE International Reliability Physics Sym-posium, pp. 142–149, 1983.2

Bowles, J. B., “A Survey of Reliability-Prediction Procedures for Microelectronic Devices,” IEEE Transac-tions on Reliability, Vol. 41, No. 1, pp. 2–12, March 1992.

Braub, E. and MacDonald, S., “History and Impact of Semiconductor Electronics,” Cambridge UniversityPress, Cambridge, 1977.

British Telecom, Handbook of Reliability Data for Components Used in Telecommunication Systems, Issue4, January 1987.

Cox, D. R., Renewal Theory, Methuen, London, 1962.

Cunningham, J, Valentin, R., Hillman, C., Dasgupta, A., and Osterman, M., “A Demonstration of VirtualQualification for the Design of Electronic Hardware,” Proceedings of the ESTECH 2001, IEST, Phoenix,AZ, April 2001.

Cushing, M. J., Krolewski, J. G., Stadterman, T. J., and Hum, B. T., “U.S. Army Reliability StandardizationImprovement Policy and Its Impact,” IEEE Transactions on Components, Packaging, and ManufacturingTechnology, Part A, Vol. 19, No 2, pp. 277–278, June 1996.

Cushing, M. J., Mortin, D. E., Stadterman, T. J., and Malhotra, A., “Comparison of Electronics-ReliabilityAssessment Approaches,” IEEE Transactions on Reliability, Vol. 42, No. 4, pp. 542–546, December 1993.

Dasgupta, A. “Failure Mechanism Models For Cyclic Fatigue,” IEEE Transactions on Reliability, Vol. 42,No. 4, pp. 548–555, December, 1993.

Dasgupta, A., Oyan, C., Barker, D., and Pecht, M., “Solder Creep-Fatigue Analysis by an Energy-Partition-ing Approach,” ASME Transactions on Electronic Packaging, Vol. 144, pp. 152–160, 1992.3

2Information on IEEE documents may be obtained by contacting the Institute of Electrical and Electronics Engineers, Inc., at http://www.ieee.org.3ASME publications are available from the American Society of Mechanical Engineers, 3 Park Avenue, New York, NY 10016-5990,USA (http://www.asme.org/).

2 Copyright © 2003 IEEE. All rights reserved.

Page 10: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEE

RELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

Dasgupta, A., and Pecht, M., “Failure Mechanisms and Damage Models,” IEEE Transactions on Reliability,Vol. 40, No. 5, pp. 531–536, 1991.

Decker, Gilbert F., Assistant Secretary of the Army (Research, Development, and Acquisition), “Memoran-dum for Commander, U.S. Army Material Command, Program Executive Officers, and Program Managers,”15 February, 1996.

Denson, W., “A Tutorial: PRISM,” RAC Journal, pp. 1–6, 3rd Quarter 1999.

Denson, W., Keene, S., and Caroli, J., “A New System-Reliability Assessment Methodology,” Proceedingsof the 1998 Annual Reliability and Maintainability Symposium, pp. 413–420, January 1998.

Denson, W. and Priore, M., “Automotive Electronic Reliability Prediction,” SAE paper 870050.4

Dew, John R., “In Search of the Root Cause,” Quality Progress, pp. 97–102, March 1991.

Elerath, J., Wood, A., Christiansen, D., and Hurst-Hopf, M., “Reliability Management and Engineering in aCommercial Computer Environment”, Proceedings of the Annual Reliability and MaintainabilitySymposium, pp. 323–329, Washington, D.C., January 18–21, 1999.

Engelmaier, W., “Fatigue Life Of Leadless Chip Carriers Solder Joints During Power Cycling,” IEEE Trans-actions on Components, Hybrids, and Manufacturing Technology, Vol. CHMT-6, pp. 232–237, September1983.

Gullo, L., “In-Service Reliability Assessment and Top-Down Approach Provides Alternative Reliability Pre-diction Method,” Proceedings of the Annual Reliability and Maintainability Symposium, pp. 365–377,Washington, D.C., January 18–21, 1999.

Hahn, Gerald J., and Shapiro, Sammuel S., Statistical Models in Engineering, John Wiley and Sons, Inc.,New York, New York, 1967.

Hakim, E. B., “Reliability Prediction: Is Arrhenius Erroneous,” Solid State Technology, Vol. 33, No. 8, p. 57,Aug 1990.

Hallberg, Ö., “Hardware Reliability Assurance and Field Experience in a Telecom Environment,” Qualityand Reliability Engineering International, Vol. 10, No. 3, pp. 195–200, 1994.

Hallberg, Ö. and Löfberg, J., “A Time Dependent Field Return Model for Telecommunication Hardware,”Advances in Electronic Packaging 1999: Proceedings of the Pacific Rim/ASME International IntersocietyElectronic and Photonic Packaging Conference (InterPACK ’99), pp. 1769–1774, The American Society ofMechanical Engineers, New York, 1999.

Hu, J. M., “Physics-of-failure Based Reliability Qualification of Automotive Electronics,” SAE Communica-tions in RMS Journal, pp. 21–33, 1994.

Hughes, J. A., “Practical Assessment of Current Plastic Encapsulated Microelectronic Devices,” Quality andReliability Engineering International, Vol. 5, No. 2, pp. 125–129, 1989.

Jackson, T., “Integration of Sneak Circuit Analysis with FMEA,” Proceedings of the 1986 Annual Reliabilityand Maintainability Symposium, pp. 408–414, 1986.

Jensen, Finn, Electronic Component Reliability, John Wiley and Sons, Inc., New York, New York, 1995.

4SAE publications are available from the Society of Automotive Engineers, 400 Commonwealth Drive, Warrendale, PA 15096, USA(http://www.sae.org/).

Copyright © 2003 IEEE. All rights reserved. 3

Page 11: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

Johnson, B. G. and Gullo, L., “Improvements in Reliability Assessment and Prediction Methodology,” Pro-ceedings of the 2000 Annual Reliability and Maintainability Symposium (RAMS), pp. 181–187, January2000.

Jones, J. and Hayes, J., “A Comparison of Electronic-Reliability Prediction Models,” IEEE Transactions onReliability, Vol. 48, No. 2, pp. 127–134, June 1999.

Jordan, J., Pecht M., Fink, J., “How Burn-In Can Reduce Quality and Reliability,” International Journal ofMicrocircuits, Vol. 20, No. 1, pp. 36–40, First Quarter, 1997.

Kececioglu, B. D., Reliability Engineering Handbook, Vols. 1 and 2, Prentice Hall, Englewood Cliffs, NJ07632, 1991.

Kervarrec, G., Monfort, M. L., Riaudel, A., Klimonda, P. Y., Coudrin, J. R., Razavet, D Le, Boulaire, J. Y.,Jeanpierre, P., Perie, D., Meister, R., Casassa, S., Haumont, J. L., and Liagre., A., “Universal Reliability Pre-diction Model for SMD Integrated Circuits Based on Field Failures,” Microelectronics Reliability, Vol. 39,No. 6-7, pp. 765–771, June-July 1999.

Klion, J., Practical Electronic Reliability Engineering, Van Nostrand Reinhold, New York, New York, 1992.

Knowles, I., “Is It Time For a New Approach?” IEEE Transactions on Reliability, Vol. 42, No. 1, pp. 3,March, 1993.

Lall, P., Pecht, M., and Hakim, E. B., Influence of Temperature on Microelectronics and System Reliability:A Physics of Failure Approach, CRC Press, New York, New York, 1997.

Latino, R. L., and Latino, K. C., Root Cause Analysis: Improving Performance for Bottom Line Results,CRC Press, Boca Raton, Florida, 1999.

Leonard, C. T., “Failure Prediction Methodology Calculations Can Mislead: Use Them Wisely, Not Blindly,”Proceedings of the National Aerospace and Electronics Conference NAECON, Vol. 4, pp. 1248–1253, May1989.

Leonard, C. T., “How Failure Prediction Methodology Affects Electronic Equipment Design,” Quality andReliability Engineering International, Vol. 6, No. 4, pp. 243–249, 1993.

Leonard, C. T., “Mechanical Engineering Issues and Electronic Equipment Reliability: Incurred Costs With-out Compensating Benefits,” IEEE Transactions on Components Hybrids and Manufacturing Technology,Vol. 13, pp. 895–902, 1990.

Leonard, C. T., “On US MIL-HDBK-217 and Reliability Prediction,” IEEE Transaction on Reliability, Vol.37, pp. 450–451, 1988.

Leonard, C. T., “Passive Cooling for Avionics Can Improve Airplane Efficiency and Reliability,”Proceedings of the IEEE 1989 National Aerospace and Electronics Conference NAECON, Vol. 2102, pp.1887–1892, 1989.

Lewis, E. E., Introduction to Reliability Engineering, John Wiley and Sons, Inc., New York, New York, 1996.

Luthra, P., “MIL-HDBK 217: What is Wrong with it?,” IEEE Transactions on Reliability, Vol. 39, pp. 518,1990.

Lycoudes, N., and Childers, C. G., “Semiconductor Instability Failure Mechanism Review,” IEEE Transac-tions on Reliability, Vol. 29, pp. 237–247, 1980.

4 Copyright © 2003 IEEE. All rights reserved.

Page 12: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEE

RELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

Meyer, Paul L, Introductory Probability and Statistical Applications, Addison-Wesley, Menlo Park, pp 328–335, 1970.

MIL-HDBK-217F, Reliability Prediction of Electronic Equipment, Version F, U.S. Department of Defense,U.S. Government Printing Office, February 28, 1995.5

Miner, M. A., “Cumulative Damage in Fatigue,” Journal of Applied Mechanics, A-159, 1945.

Mobley, R. K., Root Cause Failure Analysis (Plant Engineering Maintenance Series), Butterworth-Heine-mann, Woburn, Massachusetts, 1999.

Montgomery, D. C. and Runger, G. C., Applied Statistics and Probability for Engineers, John Wiley andSons, Inc., New York, New York, 1994.

Morris, S. F., “Use and Application of MIL-HDBK-217,” Solid State Technology, pp. 65–69, August 1990.

Nash, F. R., “Estimating Device Reliability: Assessment of Credibility,” Kluwer Academic Publishers, Bos-ton, MA, 1993.

Nelson, Wayne, Accelerated Testing, John Wiley and Sons, Inc., New York, New York, pp. 71–107, 1990.

O’Connor, P. D. T. “Commentary: Reliability—Past, Present, and Future,” IEEE Transactions on Reliability,Vol. 49, No. 4, pp. 335–341, December 2000.

O’Connor, P. D. T., “Reliability: Measurement or Management?” Quality Assurance, Vol. 12, No. 2, pp. 46–50, 1986.

O’Connor, P. D. T., “Reliability Prediction: A State-Of-The-Art Review,” IEEE Proceedings A, Vol. 133, No.4, pp. 202–216, 1986.

O’Connor, P. D. T., “Reliability Prediction for Microelectronic Systems,” Reliability Engineering, Vol. 10,No. 3, pp. 129–140, 1985.

O’Connor, P. D. T., “Reliability Prediction: Help or Hoax,” Solid State Technology, Vol. 33, pp. 59–61, 1991.

O’Connor, P. D. T., “Statistics in Quality and Reliability. Lessons from the Past, and Future Opportunities,”Reliability Engineering & System Safety, Vol. 34, No. 1, pp. 23–33, 1991.

O’Connor, P. D. T., “Quantifying Uncertainty in Reliability and Safety Studies,” Microelectronics and Reli-ability, Vol. 35, No. 9-10, pp. 1347–1356, 1995.

O’Connor, P. D. T., “Undue Faith in US MIL-HDBK-217 for Reliability Prediction,” IEEE Transactions onReliability, Vol. 37, p. 468, 1988.

Osterman, M. and Stadterman, T., “Failure-Assessment Software for Circuit-Card Assemblies,” Proceedingsof the Annual Reliability and Maintainability Symposium, pp. 269–276, Jan 1999.

Pease, R., “What’s All This MIL-HDBK-217 Stuff Anyhow?” Electronic Design, pp. 82–84, October 24,1991.

Pecht, J. and Pecht, M., Long-Term Non-Operating Reliability of Electronic Products, CRC Press, BocaRaton, FL, 1995.

5MIL publications are available from Customer Service, Defense Printing Service, 700 Robbins Ave., Bldg. 4D, Philadelphia, PA19111-5094.

Copyright © 2003 IEEE. All rights reserved. 5

Page 13: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

Pecht, M., Integrated Circuit, Hybrid, and Multichip Module Package Design Guidelines, John Wiley andSons, Inc., New York, New York, 1993.

Pecht, M., Dasgupta, A., Barker, D., and Leonard, C. T., “The Reliability Physics Approach to Failure Pre-diction Modelling (sic),” Quality and Reliability Engineering International, Vol. 6, pp 267–273, 1990.

Pecht, M., and Ko, W., “A Corrosion Rate Equation For Microelectronic Die Metallization,” The Journal ofthe International Society of Hybrid Microelectronics, Vol. 13, No. 2, pp. 41–52, June 1990.

Pecht, M. and Nash, F., “Predicting the Reliability of Electronic Equipment,” Proceedings of the IEEE, Vol.82, No. 7, pp. 992–1004, July 1994.

Pecht, M., Nguyen, L. T., and Hakim, E. B., Plastic Encapsulated Microelectronics, John Wiley and Sons,Inc., New York, New York, 1994.

Pecht, M. and Ramappan, V., “Are Components Still the Major Problem: a Review of Electronic System andDevice Field Failure Returns,” IEEE Transactions on Components, Hybrids, and Manufacturing Technology,Vol. 15, pp. 1160–1164, December 1992.

Raheja, D., “Death of a Reliability Engineer,” Reliability Review, Vol. 10, March 1990.

Rao, S. S., Reliability-Based Design, McGraw-Hill, Inc., NY, pp. 505–543, 1992.

Reliability Assessment Center, PRISM, Version 1.3, System Reliability Assessment Software, ReliabilityAssessment Center, Rome, NY, June 2001.

Rome Air Development Center, NONOP-1—Non-Operating Reliability Databook, Rome Air DevelopmentCenter, 1987.

Rome Air Development Center, RADC-TR-73-248: Dormancy and Power On-Off Cycle Effects on Elec-tronic Equipment and Part Reliability, Rome Air Development Center, August 1973.

Rome Air Development Center, RADC-TR-80-136: Nonoperating Failure Rates for Avionics Study, RomeAir Development Center, April 1980.

Rome Air Development Center, RADC-TR-85-91: Impact of Nonoperating Periods on Equipment Reliabil-ity, Rome Air Development Center, May 1985.

Rooney, J. P., “Storage Reliability,” 1989 Proceeding of the Annual Reliability and Maintainability Sympo-sium, pp. 178–182, January 1989.

Ross, S., Stochastic Processes, John Wiley and Sons, Inc., New York, New York, 1983.

SAE G-11 Committee, Aerospace Information Report on Reliability Prediction Methodologies for Elec-tronic Equipment AIR5286, Draft Report, January 1998.

Shetty, S., Lehtinen V., Dasgupta, A., Halkola, V., and Reinikainen, T., “Fatigue of Chip-Scale Package Inter-connects due to Cyclic Bending,” ASME Transactions in Electronic Packaging, Vol. 123, No. 3, pp. 302–308, Sept. 2001.

Siemens AG, Siemens Company Standard SN29500, Version 6.0, Failure Rates of Electronic Components,Siemens Technical Liaison and Standardization, November 9, 1999.

6 Copyright © 2003 IEEE. All rights reserved.

Page 14: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEE

RELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

Stadterman, T., Cushing, M., Hum, B., Malhotra, A. and Pecht, M., “The Transition from Statistical-FieldFailure Based Models to Physics-of-Failure Based Models for Reliability Assessment of Electronic Pack-ages,” Proceedings of the INTERpack ‘95, Lahaina, Maui, HI, pp. 619–625, March 26–30, 1995.

Stensrud, A. C., “Fear of Reform,” Military & Aerospace Electronics, pp. 12–19, December 1994.

Telcordia Technologies, Special Report SR-332: Reliability Prediction Procedure for Electronic Equipment,Issue 1, Telcordia Customer Service, Piscataway, N. J., May 2001.

Tummla, R. and Rymaszeewski, E., Microelectronics Packaging Handbook, Van Nostrand Reinhold, NewYork, NY, 1989.

Union Technique de L’Electricité, Recueil de données des fiabilite: RDF 2000, Modèle universel pour le cal-cul de la fiabilité prévisionnelle des composants, cartes et équipements électroniques (Reliability DataHandbook: RDF 2000—A universal model for reliability prediction of electronic components, PCBs, andequipment),” July 2000.

Upadhyayula, K. and Dasgupta, A., “An Incremental Damage Superposition Approach for Interconnect Reli-ability Under Combined Accelerated Stresses,” ASME International Mechanical Engineering Congress &Exposition, Dallas, TX, 1997.

U.S. Army MIRADCOM, LC-78-1: Missile Material Reliability Prediction Handbook, U.S. Army MIRAD-COM, Redstone Arsenal, February 1978.

Wattson, G. F. “MIL Reliability: A New Approach,” IEEE Spectrum, Vol. 29, pp. 46–49, 1992.

Wilson, P. D., Dell, L. D., and Anderson, G. F., Root Cause Analysis: A Tool for Total Quality Management,ASQC Quality Press, Milwaukee, Wisconsin, 1993.

Witzmann, S. and Giroux, Y., “Mechanical Integrity of the IC Device Package: A Key Factor in AchievingFailure Free Product Performance,” Transactions of the First International High Temperature ElectronicsConference, Albuquerque, NM, pp. 137–142, June 1991.

Wong, K. L., “A Change in Direction for Reliability Engineering is Long Overdue,” IEEE Transactions onReliability, Vol. 42, p. 261, 1993.

Wong, K. L., “The Bathtub Curve and Flat Earth Society,” IEEE Transactions on Reliability, Vol. 38, pp.403–404, 1989.

Wong, K. L., “What Is Wrong with the Existing Reliability Prediction Methods?” Quality and ReliabilityEngineering International, Vol. 6, No. 4, pp. 251–257, 1990.

Wong, K. L., and Lindstrom, D. L., “Off the Bathtub onto the Roller-Coaster Curve (Electronic EquipmentFailure),” Proceedings of the Annual Reliability and Maintainability Symposium, pp. 356–363, 1988.

Wong, K. L., Quart, I., Kallis, J. M., and A. H. Burkhar, “Culprits Causing Avionic Equipment Failures,”Proceedings of the Annual Reliability and Maintainability Symposium, pp. 416–421, 1987.

Copyright © 2003 IEEE. All rights reserved. 7

Page 15: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

3. Definitions, abbreviations, and acronyms

3.1 Definitions

3.1.1 baseline: The set of data values selected as reference for comparing other similar sets of future datavalues.

3.1.2 Bx life (e.g., B10): Time until a specified percent of a device population will have experienced a failure(B10 means the time duration when 10% of a device’s population will have experienced a failure).

3.1.3 Cx confidence level (e.g., C90): Confidence level (C90 means confidence of 90%).

3.1.4 classic reliability: The probability that an item will perform its intended function for a specified inter-val under stated conditions.

3.1.5 constant failure rate: A hazard (see 3.1.14) rate that is constant or independent of time (applies onlyto exponentially distributed failure assumption—see 4.1.4.1).

3.1.6 estimation: A systematic procedure for deriving an approximation to the true value of a populationparameter.

3.1.7 failure: The termination of the ability of an item to perform a required function.

3.1.8 failure cause (root cause): The circumstances during design, manufacture, or use which have led to afailure.

3.1.9 failure criticality: The combined effect of the qualified consequences of a failure mode and its proba-bility of occurrence. Syn: risk.

3.1.10 failure mechanism: The physical, chemical, or other process that results in failure.

NOTE—The circumstance that induces or activates the process is termed the root cause of the failure.

3.1.11 failure mode: The effect by which a failure is observed to occur.

3.1.12 failure site: Failure site is the specific location where a failure mechanism occurs.

3.1.13 goal: An objective that is desirable to meet, but it is not mandatory to meet.

3.1.14 hazard rate: The hazard rate is the instantaneous rate of failure of the product.

3.1.15 item: An all-inclusive term to denote any level of hardware (or system) assembly.

3.1.16 item characteristics: The set of technical parameters that comprehensively define an item, includingits goals and performance.

3.1.17 Lx life (e.g., L60): Time until a specified percent of a device population will have experienced a fail-ure (L60 means the time duration when 60% of a device’s population will have experienced a failure). Lx isthe same as Bx.

3.1.18 operating environment: The natural or induced environmental conditions, anticipated system inter-faces, and user interactions within which the system is expected to be operated.

8 Copyright © 2003 IEEE. All rights reserved.

Page 16: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

3.1.19 operating profile: A set of functional requirements that are expected to apply to the system during itsoperational life.

3.1.20 requirement: A condition or capability that must be met or possessed by a system or system compo-nent to satisfy a contract, standard, specification, or other formally imposed documents.

3.1.21 system life cycle: The period of time, that begins when a system is conceived and ends when the sys-tem is no longer available for use.

3.2 Abbreviations and acronyms

AFR annualized failure rateASIC application specific integrated circuitsBOM bill of materialsCCA circuit card assemblyCDF cumulative distribution functionCNET The Centre National d’Etudes des TelecommunicationsCSP chip scale packageCTE coefficient of thermal expansionDOA dead on arrivalDRAM dynamic random access memoryDSIC Defense Standards Improvement CouncilEEPROM electrically erasable programmable read-only memoryEOS electrical overstressEPROM erasable programmable read-only memoryESD electrostatic dischargeESD electrostatic dischargeFAIT fabrication, assembly, integration, and testFFOP failure-free operating periodFITs failures per billion hoursFMEA failure modes and effects analysisFMECA failure modes, effects and criticality analysisFPMH failures per million hoursFRACAS Failure Reporting and Corrective Action SystemFTA fault tree analysisHALT highly accelerated life testsLCC leadless ceramic capacitorLCR leadless ceramic resistorMCBF or MCTF mean-cycles/miles-between/before-failureMFOP maintenance-free operating periodMIRADCOM The U.S. Army Missile Research and Development CommandMLE maximum likelihood estimationMTBF mean-time-before/between-failureMTBR mean-time-between-return/repair/replacementMTBSC mean-time-between-service callMTBSI mean-time-between-service interruptionMTBWC mean-time-between-warranty claimMTTF mean time to failureNFF no failure foundNTT Nippon Telegraph and Telephone CorporationORT ongoing reliability testsOST over-stress testsPCB printed circuit board

Copyright © 2003 IEEE. All rights reserved. 9

Page 17: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

PLCC plastic leaded chip carrierPROM programmable read-only memoryRAC Reliability Analysis CenterRBOC Regional Bell Operating CompaniesRDT reliability demonstration testsROM read-only memorySAE Society of Automotive EngineersSDDV stress driven diffusive voidingSRAM static random access memoryTDDB time-dependent dielectric breakdownUTE Union Technique de L’Electricité

4. Background

Background information on basic reliability concepts and definitions are provided in this clause. Subclause4.1 provides basic reliability concepts and definitions commonly used in reliability engineering such as fail-ure and hazard rate, bathtub curve, statistical distributions, and reliability metrics. It also introduces repair-able and non-repairable system concepts. Subclause 4.2 describes some of the uses of reliability predictionsand how reliability predictions fit into the system life cycle. Subclause 4.3 describes factors that should beconsidered when selecting a reliability prediction method.

4.1 Basic concepts and definitions

This subclause provides background information on basic reliability concepts and definitions. Subclause4.1.1 discusses common usage and definitions of the terms reliability and failure and associated concepts.Subclause 4.1.2 describes the bathtub curve. Subclause 4.1.3 provides the characteristics of statistical distri-butions commonly used in reliability prediction. Subclause 4.1.4 describes reliability metrics. Subclause4.1.5 is a brief discussion of the concepts for repairable and non-repairable system reliability.

4.1.1 Reliability and failure

The word reliability is used in many different ways. It can be used to broadly describe an engineering disci-pline or to narrowly define a specific performance metric. Classic reliability is the probability that an itemwill perform its intended function for a specified interval under stated conditions (see Pecht, [B11]). In clas-sical reliability, a single product, unit, or component is generally considered to have two states, operationaland failed. The assumption is made that the product can only transition from the operational to the failedstate (no repair). The state of the product is represented by a random variable that takes a value of one whenit is operational and zero when it is failed. Classic reliability, R(t), is the probability that the product is in theoperational state up to time t. At time 0, the product is assumed to be good, and the product must eventuallyfail, so R(0) = 1, R(∞) = 0, and R(t) is a non-increasing function. If there is a mission with duration T, theclassic reliability for that mission is R(T).

A failure occurs when an item does not perform a required function. However, in practice, the word failure isoften used to mean whatever the customer considers as a failure. There is also the concept of transient orintermittent failures, in which an item does not provide the specified performance level for a period, and thenonce again provides the specified performance level, without repair of the item and sometimes without anyintervention.

Although reliability theory is based on the concept of failures, the definitions of a failure may be very differ-ent from different perspectives. To a hardware engineer, a “failure” means a component replacement andverification of the replaced component failure. To a manufacturing repair depot, a “failure” is a returnedcomponent. To the finance department, a “failure” is a warranty claim. To a service organization, a “failure”

10 Copyright © 2003 IEEE. All rights reserved.

Page 18: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

is a service call for corrective maintenance (as opposed to planned or preventive maintenance). To a cus-tomer, a “failure” is a degradation of service or capability of the system (a failure that is tolerated withoutservice interruption is considered a degradation of the product capability since it is then less able to toleratefuture failures). These various definitions of “failure” lead to various types of metrics, such as replacementrate or service call rate, in addition to the classic reliability metric of constant failure rate. For reliability pre-dictions and discussions, it is important to make clear the system hierarchy level to which a failure applies.For example, a component failure may not cause a system failure, particularly in systems that include redun-dancy or fault tolerance.

When testing or tracking a subsystem, assembly, or component, the definition of failure becomes important.For a simple unit under test with a binary response, defining failure is relatively easy: either the unit operatesor it doesn’t. However, even digital devices may experience degraded modes such as changes in the timing ofcritical signals or signal bounce under certain switching conditions, which will affect the system perfor-mance. Similarly, analog devices may show a slowed response or excessive drift from nominal. How far canit drift before the device is considered failed? If a pull-up resistor on a digital circuit drifts from 10,000 ohmsto 7,000 ohms, it may still provide the intended function even though the magnitude of the change is 30%decrease in resistance. For an electronic control system, the failure criteria may be the number of millisec-onds delay in system response. Acceptable levels of drift or change should be established.

4.1.2 Bathtub curve

The idealized bathtub curve, shown in Figure 1, represents three phases of product hazard rates. The verticalaxis is the hazard rate, and the horizontal axis is time from initial product operation. The first phase, oftencalled “infant mortality,” represents the early life of the product when manufacturing imperfections or otherinitial failure mechanisms may appear. The hazard rate decreases during this time as the product becomesless likely to experience failure from one of these mechanisms. The second phase, sometimes called the“useful life,” represents the majority of the product operating time. During this period of time, the hazardrate of the product appears to be constant, i.e., a constant failure rate. The third phase, often called “wear-out,” occurs near the end of the expected product life and often represents failure mechanisms caused bycumulative damage. Electronic components are subject to wear-out due to electromigration, material degra-dation, and other mechanisms. Weibull, lognormal, or other statistical distributions can be used to describehazard rates during both “infant mortality” and “wear-out.” Mathematically, the idealized bathtub curve isactually the composite of 3 distinct distributions representing the 3 product behavior phases—an initialdecreasing hazard rate distribution in the early operation, an exponential distribution (constant failure rate)the useful life, and an increasing hazard rate distribution for end of life. However, in practice, the bathtubcurve will not be so simple, may be a combination of many distributions representing many different failuremodes, and may not have the characteristic shape shown in Figure 1 (see Wong, K. L., and Lindstrom, D. L.,“Off the Bathtub onto the Roller-Coaster Curve (Electronic Equipment Failure),” Proceedings of the AnnualReliability and Maintainability Symposium6).

6Information on references can be found in Clause 2.

Copyright © 2003 IEEE. All rights reserved. 11

Page 19: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

4.1.3 Statistical distributions

Many different statistical distributions are used in reliability prediction (see Table 1). These statistical distri-butions are sometimes called “life” distributions and usually represent the probability that an item is operat-ing at a particular time. Classic reliability, hazard rate, mean life, and other reliability metrics can becalculated from these distributions. A more detailed discussion of these distributions can be found in severaltextbooks (e.g., Montgomery, D. C. and Runger, G. C., Applied Statistics and Probability for Engineers, andPecht, M., Nguyen, L. T., and Hakim, E. B., Plastic Encapsulated Microelectronics).

Table 1—Example distributions used in developing reliability predictions [see Alvarez, M. and Jackson, T., “Quantifying the Effects of Commercial

Processes on Availability of Small Manned-Spacecraft”]

Distribution name Density function, f(t)

Binomial Comb{n;x}pxqn–x, where n is the number of trials, x ranges from 0 to n, p is the probability of success and q is 1–p.

Exponential λexp(–λt), where λ is the constant failure rate and the inverse of MTBF. Applies to middle section of idealized bathtub curve (constant failure rate).

Gamma (1/α!βα+1)tαexp(–t/β), where α is the scale parameter and β is the shape parameter.

Lognormal (2π t2 σ2)–1/2exp{–[(ln(t)–µ)/σ]2/2}, where µ is the mean and σ is the standard deviation.

Normal (2πσ2)–1/2exp{–[(t–µ)/σ]2/2}, where µ is the mean and σ is the standard deviation.

Poisson (λt)xexp(–λt)/x!, where x is the number of failures and λ is the constant failure rate. Appropri-ate distribution for number of failures from a device population in a time period when the devices have an exponential distribution and are replaced upon failure.

Weibull (β/α)(t/α)β–1exp(–(t/α)β), where a is the scale parameter and β is the shape parameter. Infant Mortality (shape parameter < 1); Wear-out (shape parameter > 1); constant failure rate (shape parameter = 1).

Time

Haz

ard

Rat

eDecreasing Hazard

Rate (Infant Mortality)Constant Hazard Rate

(Random Failures)Increasing Hazard

Rate (Wearout)

Time

Haz

ard

Rat

eDecreasing Hazard

Rate (Infant Mortality)Constant Hazard Rate

(Random Failures)Increasing Hazard

Rate (Wearout)

Figure 1—Idealized bathtub curve

12 Copyright © 2003 IEEE. All rights reserved.

Page 20: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

4.1.4 Measuring reliability

The classical definition of reliability is the probability of providing a specified performance level for a spec-ified duration in a specified environment. This probability is a useful metric for mission-oriented, low vol-ume products such as spacecraft. However, reliability metrics for most high-volume products measure thereliability of a product population rather than the performance of a single system or a mission. Specifying asingle value such as MTBF is not sufficient for a product that exhibits a time-dependent hazard rate (i.e.,non-constant failure rate). In this case, a more appropriate metric is the probability of mission success. Thismetric may be time dependent, e.g., the probability of mission success may vary depending on the length ofthe mission, or number of cycles, e.g., the probability of mission success may vary depending on the numberof uses. For “one-shot” devices, where the mission is a single event such as a warhead detonation, the proba-bility of success will be a single number. Constant rate metrics are discussed in 4.1.4.1. Probabilities of suc-cess metrics are described in 4.1.4.2.

A useful reliability function is the cumulative hazard function, H(t), that can be derived from the EquationH(t) = –ln(R(t)). The derivative of the cumulative hazard function is the hazard rate, h(t) (see Pecht [B11]).

4.1.4.1 Constant rate reliability metrics

The hazard rate is the instantaneous rate of failure of the product. When the hazard rate is constant, or inde-pendent of time, it is usually designated by a parameter l. Since

for a constant failure rate, the previous equation becomes the familiar R(t) = exp(–λ t), the exponential distri-bution. The constant parameter λ is usually called the constant failure rate, although sometimes the functionh(t) is also called the “failure rate,” and there are many references in the literature to increasing or decreasingfailure rates.7

A constant failure rate has many useful properties, one of them is the mean value of the product’s life distri-bution is 1/λ. This mean value represents the statistically expected length of time until product failure and iscommonly called the mean life, or mean-time-before/between-failure (MTBF). Another useful property ofthe constant failure rate is that it can be estimated from a population as the number of failures divided bytime without having to fit a distribution to failure times. However, it should be noted that the exponential dis-tribution is the only distribution for which the hazard rate is a constant and that the mean life is not 1/h(t)when the hazard rate is not a constant.

MTBF is sometimes misunderstood to be the life of the product rather than an expression of the constantfailure rate.8 If a product has an MTBF of 1,000,000 hours, it does not mean that the product will last thatlong (longer than the average human lifetime). Rather, it means that, on the average, one of the products willfail for every 1,000,000 hours of product operation, i.e., if there are 1,000,000 products in the field, one ofthem will fail in one hour on the average. In this case, if product failures are truly exponentially distributed,then 63% of the products will have failed after 1,000,000 hours of operation. Products with truly exponen-tially distributed failures over their entire lifetime almost never occur in practice, but a constant failure rateand MTBF may be a good approximation of product failure behavior.

7Since failure rate is so often implicitly interpreted as a constant parameter, the term constant failure rate is used throughout this guideto mean the constant parameter l of the exponential distribution. The term hazard rate is used whenever the derivative of the hazardfunction varies with time, e.g., decreasing hazard rate or increasing hazard rate.8The use of mean time to failure (MTTF) and MTBF is not standard in either reliability literature or industry practice. In some contexts,MTTF is used for non-repairable items, and MTBF is used for repairable items. In some contexts, either or both MTTF and MTBF areimplicitly assumed to imply a constant failure rate. For convenience and to help minimize confusion in this guide, MTTF is used in con-junction with non-repairable items, MTBF is used in conjunction with repairable items, and both are used only in conjunction with aconstant failure rate. When the hazard rate is not a constant, the mean value of the reliability distribution is referred to as the mean liferather than the MTBF or MTTF.

H t( ) h t( ) t λt=d0

t

∫=

Copyright © 2003 IEEE. All rights reserved. 13

Page 21: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

If the constant rate is represented by the parameter λ, the mean value of the exponential distribution is 1/λ, asdiscussed in the preceding subclause. Therefore, constant rate metrics can be described either as a rate or asa mean life, e.g., constant failure rate or MTBF. Constant rate reliability metrics are approximations basedon the assumption of the exponential distribution. Predictions for these alternative types of reliability metricsare discussed in 5.2. The term constant rate is used throughout this document to refer to this collection ofmetrics that includes, but is not limited to, constant failure rate. Constant rate metrics other than constantfailure rate are collectively referred to as “non-failure metrics.” Table 2 contains a list of constant ratemetrics and their equivalent mean life inverses along with an indication of when these metrics might beappropriate.

Table 2—Example Constant Rate Reliability Metrics (Note—Equivalent non-constant ver-sions of these metrics can also be used, e.g. hazard rate in place of constant failure rate)

Constant rate metric Mean life equivalent Definition Use

Constant failure rate Mean-Time-Between/Before-Failure (MTBF) or Mean-Time-To-Failure (MTTF)

Total failures divided by total population operating time; can be expressed as failures per year – annual-ized failure rate (AFR), failures per billion hours (FITs), failures per million hours (FPMH).

Standard metric for reliability predictions; measure of inherent sys-tem hardware reliability.

Constant failure rate using cycles or distance instead of time

Mean-Cycles/Miles- Between/Before-Failure (MCBF) or MCTF

Total failures divided by total population number of cycles or distance, e.g., miles.

Standard metric for reli-ability predictions when usage is more relevant than time. These metrics are sometimes converted to time-based metrics by specifying an operating profile.

Constant return/repair rate MTBR (Mean Time Between Return/Repair)

Total returns/repairs divided by total popula-tion operating time.

Useful for sizing a repair depot or manufacturing repair line.

Constant replacement rate MTBR (Mean Time Between Replacement)

Total replacements divided by total popula-tion operating time.

Used as surrogate for con-stant failure rate when no failure analysis is avail-able; useful for warranty analysis.

Constant service or customer call rate

MTBSC (Mean Time Between Service Call)

Total service/customer calls divided by total population operating time.

Customer perception of constant failure rate; use-ful for sizing support requirements.

Constant warranty claim rate

MTBWC (Mean Time Between Warranty Claim)

Total warranty claims divided by warranted pop-ulation operating time.

Useful for pricing warran-ties and setting warranty reserves.

Constant service interruption rate

MTBSI (Mean Time Between Service Interruption)

Total service interruptions divided by total popula-tion operating time.

Customer perception of constant failure rate; may be an availability metric.

14 Copyright © 2003 IEEE. All rights reserved.

Page 22: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

There are several equivalent ways of expressing the constant rate metrics in Table 2. For example, a constantfailure rate of 1% per year is equivalent to 0.01 failures per unit per year, 1.1 failures per million hours, 1100FITs, and 10 failures per 1000 products per year (assuming replacement, 9.95 failures per 1000 products peryear without replacement).

Constant rate reliability metrics may be used as surrogates for time-dependent metrics by specifying differ-ent constant failure rates for different periods of time such as the Infant Mortality phase. For example, aconstant failure rate goal might be stated as 3λ, for first 3 months of production or operation, 2λ, for next 3months, and λ, after 6 months, where λ, is the constant failure rate during the “useful life” phase. Anotherapproach for replacing a time-dependent metric with a constant failure rate is to determine the expectednumber of failures for a certain time period and to specify a constant failure rate during that time period. Forexample, a product that follows the idealized bathtub curve might be expected to have 50 failures in a popu-lation of 1,000 in the first year, 30 failures in each of the next 3 years, and 60 failures in the 5th year. Thisproduct’s reliability may be approximated with an average constant failure rate of 4.6 failures per millionhours (4,600 FITs) for 5 years, equivalent to 200 failures in a population of 1,000 in 5 years.

4.1.4.2 Probability of success metrics

For non-exponential life distributions, i.e., when the failure rate is not constant, metrics other than failurerates may be more meaningful. An easier to manipulate metric for non-exponential distributions is classicreliability, i.e., the probability of success. This metric can be a point on one of the reliability distributionsshown in Table 1 or can be a convolution of many distributions.

Another way of expressing probability of success is the percentage of a population that survives a specifiedduration. For this metric, percentiles of the distribution may be used. These percentiles may be stated as a“B” life or an “L” life. For example, an L10 life of 300,000 hours means that 10% of the product populationwould have experienced a failure by 300,000 hours of operation, and a B50 life of 5 years means that 50% ofthe product population would have experienced a failure by 5 years of operation. Some metrics may bestated with a confidence level, e.g., R96C90 in the automotive industry means a reliability of 96% with 90%confidence. Table 3 contains a list of probability of success metrics along with their definitions and uses.

Table 3—Example probability of success metrics

Metric Definition Example use

Classic reliability Probability that a product performs a required function under stated conditions for a stated period of time.

Classic reliability metric.

Lx life, e.g., L10 life Time until 10% of a device population will have expe-rienced a failure.

Mechanical items, e.g., fans. Electronic compo-nents with wearout, e.g., electrolytic capacitors.

Bx life, e.g., B10 life Same as Lx life. Automobile industry.

Failure-Free Operating Period (FFOP)

An operating period where a product performs a required function under stated conditions without failure.

Application where a period of time without failure is required.

Maintenance-Free Operat-ing Period (MFOP)

An operating period where a product performs a required function under stated conditions without a failure or maintenance action.

Applications where no failure or maintenance is allowed or possible.

Mean mission duration Integral of the reliability function R(t) from 0 to the specified design life.

Spacecraft.

Copyright © 2003 IEEE. All rights reserved. 15

Page 23: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

4.1.5 Repairable and non-repairable system concepts

A repairable item is one that can be restored to functionality after failure, e.g., a system can be restored byrepair or replacement of a component(s) either physically or functionally. Examples of repairable items arecars and computers. A non-repairable item is one that cannot be restored to functionality after failure (or onefor which an organization’s maintenance policy does not restore after failure). Examples of non-repairableitems are light bulbs and fuses. Note that a repairable item may contain non-repairable lower-level items. Forexample, if a circuit card assembly is repaired by replacing a resistor that is thrown away, the circuit cardassembly is repairable while the resistor is non-repairable.

Repair can affect the reliability prediction. Some redundant systems permit repair of a failed component thatrestores the component to an operational state while the system continues to operate. Reliability distributionsare calculated from multiple failures of the same item in field or test data. If repair of the item does not returnit to a “good as new” condition, these multiple failures represent different life distributions, and they cannotbe used to estimate the parameters of the original life distribution. When the repaired component has thesame statistical life distribution as the original component, it may be considered “good as new.”

Examples of techniques for analyzing repairable systems are Poisson processes (homogeneous and non-homogeneous), renewal theory, and Markov analysis. Repairable system concepts are described briefly inClause 6. If additional information is desired, sources such as Asher, H. and Feingold, H., “Repairable Sys-tems Reliability: Modeling, Inference, Misconceptions and Their Causes” may be consulted.

4.2 Reliability prediction uses and timing

As stated in 1.2, the purpose of this guide is to assist in the selection and use of reliability prediction method-ologies satisfying IEEE Std 1413-1998. However, before selecting a reliability prediction method, thedesired uses of the prediction (why), the appropriate time in the system life cycle to perform the prediction(when), the item(s) for which the reliability prediction is to be performed (what), and the factors that shouldbe considered in selecting the appropriate reliability prediction method (how) should be considered. Sub-clause 4.2.1 describes some of the uses for a reliability prediction to help define why a prediction is beingperformed. Subclause 4.2.2 explains how reliability predictions fit into the product life cycle, which helps toidentify the appropriate time to use a certain reliability prediction method. The factors that should be consid-ered in selecting the appropriate reliability prediction method are discussed in 4.3 and 5.1.

Predictions can be made using data obtained during engineering development phases. Not only is the rawdata useful to determine the reliability but the rate of change in reliability can also be used to show addi-tional improvements. Reliability improvement (growth) models available in literature include the Gompertz,Lloyd-Lipow, Logistic Reliability Growth, Duane and AMSAA for non-homogeneous Poisson processes(see Kececioglu, B. D., Reliability Engineering Handbook, Vols. 1 and 2).

4.2.1 Reliability prediction uses

Subclause 1.3 of IEEE Std 1413-1998 lists the uses of a reliability prediction. Those uses are also listedbelow followed by text that expands upon the usage and describes the associated value.

— Reliability goal assessment: Reliability predictions are used to help assess if the system will satisfyits reliability goals.

— Comparisons of designs and products: Most systems have design implementation options. Trade-offsmust be made among the various options, and reliability prediction is an important input in thesetrade-offs. These options may even affect the system architecture, e.g., the amount and level ofredundancy. Since trade-offs must often be made early in the design process, the reliability predic-tion may be very preliminary. However, it is still useful since the important information may be therelative reliability and ranking of design choices rather than a precise quantitative value.

16 Copyright © 2003 IEEE. All rights reserved.

Page 24: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

— Method to identify potential reliability improvement opportunities: Reliability improvement activi-ties should generally focus on the areas with the greatest opportunity for improvement. A reliabilityprediction quantifies the opportunity by identifying the relative reliability of various units and bypredicting the reliability improvement obtained from a reliability improvement activity.

— Logistics support: Reliability predictions are a key input into spare parts provisioning and calculationof warranty and life cycle costs.

— Reliability and system safety analyses: Design and process analyses such as failure modes, effectsand criticality analysis (FMECA), event tree analysis, and fault tree analysis (FTA) may be per-formed to uncover items associated with reliability or safety related risk.

— Mission reliability estimation: Missions may have multiple phases with different equipment configu-rations and system reliability models can be used to predict reliability for the entire mission.

— Prediction of field reliability performance: A reliability prediction provides an estimate of how reli-ably a product will perform in future field usage based on how it performed in past field usage. Thisinformation may impact operational concepts, contingency planning, and other support planningactivities.

4.2.2 Reliability predictions in the system life cycle

There are five system life cycle phases defined in IEEE Std 1220-1998 [B4]. These phases are listed belowalong with the appropriate reliability prediction output of each phase:

a) System Definition Phase—Defines the system requirements, including system interface specifica-tions. In the requirements definition phase, existing reliability data can be used for a reliability pre-diction based on similarity with existing design(s) or in-service product(s). Although this reliabilityprediction will only be a rough estimate, it may be useful to help establish reliability goals, makereliability allocations, use in trade-off studies, and/or aid in defining the high level system architec-ture. The output of this phase is reliability metrics and goals.

b) Preliminary Design Phase—Defines the subsystem requirements and functional architecture. Thepreliminary reliability prediction, which is the output of this phase, is based on a well-defined func-tional description and a somewhat less well-defined physical description of the system.

c) Detailed Design Phase—Completes the design and requirement allocation to the lowest level. Thedesign reliability prediction, which is the output of this phase, is more precise than the earlier onesbecause it is based on documentation that defines the ready-to-manufacture system, such as designand performance specifications, parts and materials lists, and circuit and layout drawings.

d) Fabrication, Assembly, Integration, and Test (FAIT) Phase—Verifies that the system satisfies itsoperational requirements, which may require building prototypes or conducting tests. The opera-tional reliability prediction, which is the output of this phase, includes the anticipated effects of themanufacturer’s processes on the field reliability of the system.

e) Production/Support Phase—Manufactures, ships, and supports the system in the field, including res-olution of any deficiencies. The field reliability prediction, which is the output of this phase, is basedon the field reliability data collected, possibly combined with other predictions. When changes aremade to the system’s design or manufacturing process, the field reliability prediction is updated. Anexample of a production change that may affect the field reliability prediction is parts obsolescence.Parts that are available in the Detailed Design phase may not be available in the Production/Supportphase, and the subsequent change in the Bill of Materials (BOM) may significantly affect the fieldreliability of the system.

The reliability prediction methodologies that are described in Clause 5 of this guide may be used in anyphase of the system life cycle, as long as the required engineering information is available. However,because of the progressive nature of the system life cycle, there may be times when certain reliability predic-tion methods are preferred due to the type and quality of the available engineering information. For example,field data necessary for a reliability prediction based on field data usually becomes available in the Produc-tion/Support Phase. However, field data from similar in-service systems can be used for reliability predic-tions earlier in the life cycle. Similarly, the test data necessary for a reliability prediction based on test data

Copyright © 2003 IEEE. All rights reserved. 17

Page 25: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

usually becomes available in the FAIT Phase. The engineering information necessary for each type of reli-ability prediction is explained in more detail in Clause 5.

4.3 Considerations for selecting reliability prediction methods

Subclause 5.1 describes the characteristics and required input data for reliability prediction methods. Thesecharacteristics and input data are important criteria for selecting appropriate reliability prediction methods,but there are also other factors that may influence the choice of a reliability prediction method. These addi-tional factors include product technology, consequences of system failure, failure criticality and availableresources.

— Product technology: Product technology may influence the selection of a reliability predictionmethod in several ways. If the product technology is similar to that used in previous products, reli-ability prediction methods that make use of historical data or analyses may be appropriate. If theproduct technology is new, it may be necessary to develop new models.

— Consequences of system failure: The desired reliability prediction precision is a function of thesocial or business consequences of a system failure. In general, the higher the risk, the higher is thedesire for accurate predictions, where risk includes both business risk and social risk. The risksinclude financial losses caused by delays in certification, fines emanating from regulatory require-ments, delay in time-to-market, loss of customer confidence, costs and results of litigation, safety,and information privacy and security. Social risk refers to the potential for human injury or environ-mental disruption.

— Failure criticality: Failure of an item contained in a system does not necessarily imply system failure.The consequences of each item’s failure modes can be variable, ranging from system failure tounnoticeable. The probability of occurrence of each failure mode can also be variable. It may beimportant to spend more resources evaluating those failure modes with the most severe conse-quences of failure and/or the highest probability of occurrence.

— Available resources: The choice of reliability prediction method may be affected by availableresources, including time, budget, and information. Some reliability prediction methods may requireengineering information or data that is unavailable, e.g., historical or test data. Time or budget limi-tations may prevent necessary data from being gathered. The skill levels and familiarity with certainprediction types of the available personnel may influence reliability prediction method selection.

— External influences: External influences may impact the selection of a reliability prediction. An orga-nization may have a specified reliability prediction method used for all products or all products of acertain type. Customers and regulators may dictate the type of reliability prediction method used ormay require a precision that can only be obtained by certain methods. In addition, a bias for oragainst certain types of prediction methods on the part of the customers or development organizationmay influence reliability prediction method selection. The available information on operating envi-ronment and profile may limit the applicable reliability prediction methods. The selection of areliability metric may also limit the applicable reliability prediction methods since some methodsare useful for only certain types of metrics, e.g., constant failure rate. The engineering informationavailable from a vendor may only support certain types of reliability prediction methods, or a vendormay only have the capability to perform certain types of reliability prediction methods.

5. Reliability prediction methods

This clause provides information for selecting and using reliability prediction methods. Subclause 5.1 dis-cusses the engineering information that should be considered in selecting a reliability prediction method andperforming a reliability prediction. Subclauses 5.2 and 5.3 describe the prediction methods that are based onfield data and test data, respectively. Subclause 5.4 discusses reliability predictions based on stress and dam-age models. Subclause 5.5 describes reliability prediction methods based on reliability handbooks. Sub-clause 5.6 describes the assessment of different reliability prediction methodologies based on IEEE Std1413-1998.

18 Copyright © 2003 IEEE. All rights reserved.

Page 26: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

5.1 Engineering information assessment

The engineering information used in developing a reliability prediction may be collected from many differ-ent sources, including the manufacturer’s database, the customer’s database, in-house database, or publicdomain literature and databases. It is necessary to consider the available resources, in terms of time, money,and labor, for collecting this information and using it to develop a reliability prediction with the desireddegree of precision. Engineering information includes the following:

— Reliability requirements — System architecture— Operating environment— Operating profile— Failure modes, mechanisms, and causes

Prior to the selection of a reliability prediction method(s), the type and quality of the engineering informa-tion that is available should be examined. Generally, the level of confidence in the engineering information iscommensurate with the life cycle maturity of the system for which it applies (see 4.2).

5.1.1 Reliability requirements and goals

Reliability requirements define the specified system reliability and reliability goals define the desired systemreliability. Reliability requirements and goals may be stated either as a single value, e.g., a minimal level ofreliability below which the predicted system reliability is unacceptable, or a range of values, e.g., the 3sigma values for a normal distribution that is centered at a specific time. Different reliability requirements orgoals may be defined for different system functions, e.g., when a system is in a high-power mode or a low-power mode. Reliability predictions may be useful in defining reliability requirements and goals, and con-versely, reliability requirements and goals may impact the selection of reliability prediction methods.

A single reliability goal may be sufficient for a simple system. However, for a complex system, it is useful toallocate the system level reliability goal to lower levels, such as a subsystem, assembly, or component level.These lower-level goals provide guidance for design engineers that are responsible for a specific portion ofthe system, and provide reliability input to supplier specifications. The way in which lower-level goals areallocated depends on the metric. If the metric is classical reliability for a series system, then the product ofthe reliabilities of lower-level units must be greater than or equal to the system goal. If the metric is constantfailure rate then the sum of the lower-level unit constant failure rates must be less than or equal to the systemgoal.

5.1.2 System architecture

The system architecture consists of both the physical and logical system hierarchy structure. There areseveral different levels in the system hierarchy structure, such as, component, assembly, and subsystem. Acomponent or assembly in one application may be a system in another application. The system hierarchyfrom IEEE Std 1220-1998 [B4] is used throughout this document:

— System— Product— Subsystem— Assembly— Component— Subassembly (optional)— Subcomponent— Part

Copyright © 2003 IEEE. All rights reserved. 19

Page 27: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

The physical architecture defines the basic system hierarchy and the association of lower-level parts andcomponents with higher-level assemblies and systems. A set of bills of materials (BOM), also called inden-tured parts lists, is an example of a physical architecture. The level of physical architecture detail that isavailable may influence the selection of the reliability prediction method. Some reliability prediction meth-ods require only indentured parts lists while others require the interconnection of the parts as well.

The logical architecture describes the partitioning of physical units into functional units and is often depictedas a functional block diagram. The reliability prediction may include a transformation of the system logicalarchitecture into a reliability model such as a reliability block diagram or a fault tree. This reliability modeldefines the system success/failure criteria and the level and type of redundancy. It may also identify the fail-ure modes for components and/or assemblies and the effects of those failure modes on system operation.Some systems have missions with multiple phases and different logical architectures for each phase. There-fore, the reliability model should define all the different logical architectures for the system.

5.1.3 Operating environment

The operating environment consists of both the physical environment and the human interaction with thesystem. The physical environment describes the operating conditions and loads under which the systemoperates. It includes temperature, humidity, shock and vibration, voltage, radiation, power, contaminants,and so forth. It also includes loads applied in packaging, handling, storage, and transportation. The humaninteraction with the system includes human interfaces, skill level of the operators, opportunities for correc-tive and preventive maintenance, and so forth. If a system has a mission with multiple phases, the operatingenvironment for each phase needs to be identified. These data are used to estimate the reliability of thesystem.

5.1.4 Operating profile

The operating profile (referred to as the mission profile by some industries) is a set of functional require-ments that are expected to apply to the system during its operating life. A system may have multiple operat-ing profiles with different functional requirements and operating environments. For example, a reusablespacecraft may have operational phases with different requirements and environments such as, launch, flight,re-entry, and landing. Electronic equipment may have a non-operating profile (such as, shipping, transporta-tion, and storage) as part of its operating profile. Sometimes system failures can occur during the transitionfrom one operating profile to another (see Jackson, T., “Integration of Sneak Circuit Analysis with FMEA”),e.g., a reverse current path that initiates an unwanted function. The reliability prediction should incorporatethe operating profile for the system.

5.1.5 Failures, modes, mechanisms, and causes

IEEE Std 1413-1998 specifies the need to identify the failure site, failure mode, failure mechanism, and fail-ure cause(s). The definitions of these terms are given in Clause 3. For every failure mode, there is an initiat-ing failure mechanism or process that is itself initiated by a root cause. In general, the failure mode at onelevel of the system becomes the failure cause for the next higher level. This bottom-up failure flow processapplies all the way to system-level, as illustrated in Figure 2. This is a generalized and simplified pictureassuming a linear hierarchical nature of connections between system elements. In reality, the source of a sys-tem failure can be at interactions between various system elements.

Failures can be described by their relation to failure precipitation.

— Overstress failure: A failure that arises as a result of a single load (stress) condition. Examples ofload conditions that can cause overstress failures are shock, temperature extremes, and electricaloverstress, under-voltage input signals, mismatched switching speeds, and sneak paths.

— Wearout failure: A failure that arises as a result of cumulative load (stress) conditions. Examples ofload conditions that cause cumulative damage are temperature cycling, abrasion and material aging.

20 Copyright © 2003 IEEE. All rights reserved.

Page 28: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

— System functional failure: A failure that arises as a result of an anomalous condition of the systemoutput. Examples of anomalous conditions that cause system functional failure are under-voltageinput signals, mismatched switching speeds, and sneak paths.

The root cause is the most basic causal factor or factors that, if corrected or removed, will prevent the recur-rence of the failure (see ABS Group, Inc., Root Cause Analysis Handbook, A Guide to Effective IncidentInvestigation). One of the purposes of determining the root cause(s) is to fix the problem at its most basicsource so it doesn’t occur again, even in other products, as opposed to merely fixing a failure symptom.Identifying root causes is the key to preventing similar occurrences in the future. Another purpose of deter-mining the root cause(s) is to predict the probability of occurrence of the failure. Examples of sources of rootcauses of failures are given as follows:

— Design process induced failure causes: Such as design rule violations, design errors resulting fromoverstressed parts, timing faults, reverse current paths, mechanical interference, software codingerrors, documentation or procedural errors, and non-tested or latent failures.

— Manufacturing process induced failure causes: Such as workmanship defects caused by manual orautomatic assembly or rework operations, test errors, and test equipment faults.

— Environment induced failure causes: Such as excessive operating temperature, humidity or vibration,external electromagnetic threshold exceeded, and foreign object damage or mishandling damage.

— Operator or maintenance induced failure causes: Such as operator errors, incorrectly calibratedinstruments, false system operating status, and maintenance errors or fault maintenance equipment.

Many methods are used for root cause analysis, including cause and effect diagram (Ishikawa diagram- fish-bone analysis), failure modes and effects analysis (FMEA), cause factor chart, fault tree analysis (FTA), andPareto chart. Readers are referred to the following references: ABS Group, Inc., Root Cause Analysis Hand-book, A Guide to Effective Incident Investigation; Dew, John R., “In Search of the Root Cause;” Latino, R.L., and Latino, K. C., Root Cause Analysis: Improving Performance for Bottom Line Results; Mobley, R. K.,Root Cause Failure Analysis (Plant Engineering Maintenance Series); and Wilson, P. D., Dell, L. D., andAnderson, G. F., Root Cause Analysis: A Tool for Total Quality Management; as well as EIA/JEP131 [B6]and MIL-STD-1629A [B8], for more information on these methods. The reliability prediction processshould use these methods for improved product development.

Copyright © 2003 IEEE. All rights reserved. 21

Page 29: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

5.1.6 Engineering information quality

The quality of a reliability prediction is generally dependent on the quality of the available engineeringinformation. It is advantageous to determine the quality of the input data as early as possible in the planningstages of the reliability prediction. The following is a list of characteristics that can be used to evaluate theinformation quality:

— Accuracy: Representing the data value with error bounds.— Applicability: Suitableness of data for its intended use.— Completeness: Degree to which all needed attributes are present in the data.— Consistency: Agreement or logical coherence among data that measure the same quantity.— Precision: Degree of exactness with which the data or error bounds are stated.— Relevance: Agreement or logical coherence of the data attributes to the quantity of interest.— Timeliness: Data item or multiple items that are provided at the time required or specified for the

data to be valid.— Trustworthiness: Degree of confidence in the source of the data.— Uniqueness: Data values that are constrained to a set of distinct entries where each value is the only

one of its kind.

System

Assembly Failure

Mechanism

Part Failure

Mechanism

Mode

Cause

Component Failure

Mechanism

Subsystem Failure

Mechanism

System Failure

Mechanism

Subsystem

Assembly

Component

Part

Mode

Cause

Mode

Cause

Mode

Cause

Mode

Cause

Figure 2—Failure Process Flow

22 Copyright © 2003 IEEE. All rights reserved.

Page 30: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

— Utility: Degree to which the data is available in a form that facilitates using it in the manner desired.— Validity: Conformance of data values to known facts or sound reasoning.— Verifiability: Degree to which the data can be independently verified.

5.2 Predictions based on field data

Field data represents the actual reliability performance of an item in its actual operational environment.Thus, a reliability prediction based on field data is appropriate for an item already in service, (e.g. for logis-tics planning, warranty reserve, repair department sizing or future corrective action). Field data is also usedwhen comparing reliability predictions based on test data or analysis and the actual reliability performanceof the equipment. The type and quality of field data can range from simple factory ship and return data tosophisticated tracking of installation times, operating hours, and failure times for every unit in service. Theideal data to use for an item’s reliability prediction is the field reliability data for that item in the same oper-ating environment. If some information is missing, similar items or similar environments may be found forreliability predictions.

This subclause describes the collection and analysis of field reliability data for reliability predictions. Sub-clause 5.2.1 describes types of field reliability data, including approximations and adjustments to the data.Subclause 5.2.2 describes field reliability data collection. Field reliability data analysis is the subject of5.2.3. Subclause 5.2.4 describes how to use field reliability data from existing systems to predict the reliabil-ity of new designs. The use of non-failure field reliability data, such as replacements or returns, is describedin 5.2.5. Subclause 5.2.6 contains an example of using field reliability data to perform a reliabilityprediction.

5.2.1 Field reliability data

Reliability predictions based on field data require an estimate of the operating time before failure for faileditems and the accumulated operating time for all items that have not failed. This implies that three things areknown for each unit: 1) initial operation time, 2) life cycle history and operating profile (along with the oper-ating environment), and 3) failure time (or current time if the item has not failed).9 Field reliability data usu-ally consists of failed units with different failure times intermingled with non-failed units, all of which mayhave had different installation times.10 In addition, the failures may be due to a number of different causes.This situation is shown in Figure 3. In this figure, the time 0 is the starting time for the operation of unit 1.Units 1 and 5 fail at different operating times due to failure cause X. Unit 3 fails due to failure cause Y. Units2 and 4 are still operating at the current time although their initial operating times are different. Unit 4 also isin a non-operating condition for a period of time as shown by the dashed line in the figure.

9If the aggregate operating time on installed units and number of failures are the only data available, then an exponential distribution(constant failure rate) is the distribution that must be applied.10This type of data is called multiply censored. Units that have not failed are often called suspended items. See Nelson [B10] for moreinformation on censored data.

Copyright © 2003 IEEE. All rights reserved. 23

Page 31: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

5.2.1.1 Data approximations

The initial operating time, failure time, operating profile, and failure cause data shown in Figure 3 are criticalfor accurate predictions, but may be unavailable, so approximations for that data may need to be made.Example approximations are shown in Table 4. Approximations for initial operation time, such as installa-tion time or shipment date, are events that occur before initial operation; so estimated delay times may berequired to account for the difference between the approximated time and the true time. Approximations forfailure time, such as return or replacement time are events that occur after the failure and may require esti-mated delay times. When shipment quantities or number of returns is used as approximations, shipment orreturn dates, need to be assigned. For example, uniform shipment rates could be assumed (i.e., the samenumber of items ship every day of the month).

Operating profile approximations may define cyclical operational periods, possibly in different operatingenvironments, or may simply specify equipment operation some percentage of the time, e.g., continuousoperation is 100%. Data analysis may provide a statistical distribution for each failure cause or observedfailure mode, or may provide only a single statistical distribution. When a single distribution is utilized, theimplicit (but usually unstated) assumption is that there is a single failure cause or that a single distributioncan adequately represent the observed set of failure modes or underlying set of failure causes.

Table 4—Example approximations for field reliability data

Desired field reliability data

Alternative approximate data Data adjustments Notes

Initial operation time Installation time Delay from installa-tion to initial operation.

A vendor may know when they deliver and/or install equipment but may not know when the customer actually begins to operate it.

Shipment date Delay from ship-ment to initial operation.

A vendor may know when they shipped a unit but not when it is installed or initially operated.

Unit

5

4

3

2

1

Time

Current Time 0

Failure cause X

Failure cause Y

Failure cause X Non-operating period

Still operating at current time

Figure 3—Example: field reliability data

24 Copyright © 2003 IEEE. All rights reserved.

Page 32: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

The approximations in Table 4 must be applied with caution. They will cause statistical estimation of distri-bution parameters to be less precise because the data is approximated rather than actual. For example, theestimated delay times in Table 4 are averages, so the initial operating time will be the installation time plusan average delay rather than the actual delay for each unit. The variability in the input data may be under-stated by the approximations, so a distribution fit to these data does not exactly represent the field reliability.On the other hand, field reliability is often measured using the same approximations as the predictions.

Shipment quantities Estimated shipment dates; delay from shipment to initial operation.

A vendor may know the number of units shipped per month but may not track them individually. This type of data should be used only for constant failure rate prediction.

Failure time Return or repair data Some returns may not be failures (see 5.2.1.2).

If failure analysis is not available, see 5.2.4 for non-failure metric calculations.

Return or replace-ment time

Delay from failure to return or replace-ment.

A vendor will probably know when the equipment is returned but may not know the time it actually failed.

Number of returns Estimated return date and delay from failure to return.

A vendor may know the number of units returned per month but may not track them individually. This type of data should be used only for constant failure rate (or return/replacement rate) prediction.

Number of failures or returns for higher-level items

Apportion higher-level failures or returns.

If failure analysis is not available down to the level of the item being analyzed, it may be necessary to apportion failures or returns at higher levels in the system hier-archy to the system hierarchy level of the item being analyzed.

Operating profile Continuous operation

This may be reasonable if equipment is left powered on most of the time.

“Standard” operating profile

Businesses may have standard operating hours or there may be standard mission profiles available.

Duty cycles Apply each duty cycle to appropriate percent of population.

Electromechanical devices such as printers often have specified duty cycles with dif-ferent reliability predictions for each duty cycle.

Failure causes (or modes)

Combine all failure causes

Assume a single sta-tistical distribution fits the data.

This is reasonable only if a single distribu-tion can statistically represent the combina-tion of all failure causes/modes.

Table 4—Example approximations for field reliability data (continued)

Desired field reliability data

Alternative approximate data Data adjustments Notes

Copyright © 2003 IEEE. All rights reserved. 25

Page 33: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

5.2.1.2 Data adjustments

There are a number of adjustments that may have to be made to field reliability data before using it in reli-ability predictions. If one of the approximations described in the previous subclause is used, there may beadjustments to the initial operating time or to the failure time that need to be made, e.g., shipping delays. Ifunits are shipped to a spare parts inventory rather than an operational system, adjustments may need to bemade to shipping, installation time and quantity. In addition, there are a number of cases for which replace-ments or failures may be categorized and possibly discounted:

— Dead on arrival (DOA): A unit that is unable to function immediately following installation.11

Although DOAs should be analyzed, they are often difficult to include in a reliability distributionbecause their failure time is 0. Approximations can be made to fit DOAs into the infant mortalityperiod, or different metrics such as DOA rate may be used to account for these types of failures asmanufacturing or shipping problems.

— Physical or cosmetic damage: A unit that is returned because there is physical or cosmetic damagerather than a functional failure. This may be indicative of a problem with packaging, handling, ortransportation. These units might be included in a reliability distribution if the damage causes afunctional failure or may be accounted for in other metrics.

— No failure found (NFF): A unit that is returned but passes all the diagnostic tests. For a NFF it isnecessary to determine if the diagnostic tests are insufficient to reveal the failure, if a transient orintermittent failure occurred, or if the unit really is fully operational (and probably should not havebeen replaced). In the latter case, the unit might not be counted as a failure, and the unit’s operationalhours could be included in a field reliability calculation as a suspended item (see Annex A). Careshould be exercised in removing the item from the failure data since it is often very difficult to diag-nose transient or intermittent failures. An NFF can also be treated as a failure in the diagnostics orservice manual documentation.

— Inability to troubleshoot: An inability to determine the root cause of the failure. If multiple parts arereplaced in a single repair action, then care must be exercised to ensure that a single system-levelfailure does not end up counting multiple times. For more information and examples, see Gullo, L.,“In-Service Reliability Assessment and Top-Down Approach Provides Alternative Reliability Pre-diction Method.”

5.2.2 Field data collection

Regardless of the type or use of field data, a field failure tracking and reporting system along with a field fail-ure database is essential for providing field data statistics. In addition to the failure reporting, records ofinitial operating time, operating profile, operating environment, and failure time for each unit should bestored in a database. An example of the type of failure information that needs to be kept is shown in Figure 4.Data of maintenance actions, replacements, and returns should be kept in the Failure Reporting Database toassist in predictions and to aid in corrective action. Replacements include functional restoration (e.g.,switching to a backup assembly in a satellite). Returns include detailed failure event data used for diagnos-tics in lieu of having the failed item to examine. The failure causes in the Failure Reporting Database shouldbe as detailed as possible to allow future design analysis and corrective action as well as reliability predic-tions. The Failure Reporting Database is often part of a failure reporting and corrective action system(FRACAS). It may also contain inspection and test failure data for analysis or predictions.

11The definition of DOA varies. For some items, it may be as simple as a power indicator turning on or not. For others, there may be acomplex set of tests that must be passed before the unit is declared operational. DOAs may also be extended to cover a time period, e.g.,before the warranty starts or before the system is declared ready for customer use. This type of a measurement may also be called out-of-box quality.

26 Copyright © 2003 IEEE. All rights reserved.

Page 34: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

5.2.3 Analysis of field data

After the field data has been collected, statistical analysis tools may be used to help determine trends andidentify problems. The first step in any field reliability data analysis is to plot the failure data. If individualunit operating and failure times are available, the data can be used to determine the appropriate statisticaldistribution.12 There may be several different statistical distributions that represent different failure causesand modes within a single set of field data. If failure analysis results are available and if sufficient data exists,each failure mode and mechanism should be separately analyzed.13 Annex A contains examples of probabil-ity plots, hazard plots, and other methods for plotting data and determining the appropriate statisticaldistribution.

12It should never be assumed that the data follows an exponential distribution. Plots and goodness of fit type tests can be used to deter-mine if the exponential distribution is appropriate. However, if the units are not individually tracked, the exponential distribution is theonly statistical distribution that can be applied. Even when an exponential distribution is assumed, there may be a significant amount ofdata analysis and adjustment required to accurately plot field reliability data.13When analyzing modes and mechanisms separately, units that fail due to other failure modes and mechanisms may be treated as sus-pended items, i.e., items that are still operating at their time of failure.

Screen data

Analyze MaintenanceActions

Analyze Replacement

Analyze Returns

Analyze Failures

Service Records/Maintenance Requests

Non-hardware and non-maintenance actions

(e.g., administrative, operator comments)

Non-replacement(e.g., repair, realign, reboot)

Non-returns(e.g., throw-away, lost, inaccessible)

Possible non-failures(e.g., NFF, damage)

Failure cause

Failure Reporting Database

Hardware maintenance actions

Replacement

Returns

Failures

Screen data

Analyze MaintenanceActions

Analyze Replacement

Analyze Returns

Analyze Failures

Service Records/Maintenance Requests

Non-hardware and non-maintenance actions

(e.g., administrative, operator comments)

Non-replacement(e.g., repair, realign, reboot)

Non-returns(e.g., throw-away, lost, inaccessible)

Possible non-failures(e.g., NFF, damage)

Failure cause

Failure Reporting Database

Hardware maintenance actions

Replacement

Returns

Failures

Figure 4—Example of field failure reporting database

Copyright © 2003 IEEE. All rights reserved. 27

Page 35: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

It is important to correlate failures with manufacturing builds, process changes, and design changes. Designand process changes are often made to improve reliability.14 Failure data and failure modes and mechanismcan be correlated with:

— Process changes: To help identify process-induced failures.— Manufacturing builds: To help identify bad lots, problems with specific date codes, or manufacturing

process-induced failures. Note—it may be possible to approximate manufacturing builds by usingmonth of manufacture (or day, week, etc.).

— Design changes: To demonstrate the effect on reliability of design modifications.— Operating environment/profile changes: To help identify environment-induced failures or operations/

maintenance-induced failures.— Time: To help identify infant mortality and wear out and specific patterns such as seasonal variations.

It may also show that a perceived reliability distribution is really a combination of different reliability distri-butions. For example, the plot on the left of Figure 5 shows the hazard rate of an entire field population overtime. In this picture, it appears that the hazard rate remains constant. The picture to the right shows that thisapparent hazard rate behavior is an artifact of the data and there actually are three different hazard ratecurves for three different populations of about the same size that fail by three different failure modes andmechanisms.

5.2.4 Similarity analysis

There are many ways in which a new system may be similar to an in-service system. For example, when thenew system is compared with an in-service system, it may have a minor design change, have similar technol-ogy, and similar operating environment. These similarities permit comparisons that can be used to develop areliability estimate. When there are similarities between new and in-service systems, reliability may beassessed based on field data for the in-service systems, comparisons of items in the new system with similaritems in the in-service systems and other prediction methods for unique items in the new system.

5.2.4.1 Similarity analysis process

The similarity analysis process is shown in Figure 6, there are six steps:

— Step 1. Select an in-service item that has similarities with the item of interest. Determine the itemsthat have sufficient similarities with existing items to make them candidates for similarity analysis.This includes an examination of the physical and functional characteristics of the items. The appro-priate system hierarchy level at which to make a comparison of new and in-service items is selected

14The reliability distributions derived from a series of reliability improvements can be used for reliability growth modeling.

Time

Haz

ard

rate

Time

Haz

ard

rate

h2

h1

h3

Figure 5—Example hazard rate based on field data

28 Copyright © 2003 IEEE. All rights reserved.

Page 36: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

in this step. This may vary depending on the item and the engineering information available. A closedesign and operational similarity will improve reliability prediction accuracy.

— Step 2. Analyze failure modes, mechanisms, and root causes of new and in-service items. After exam-ining and comparing the engineering information available for the new and in-service products, thenext step is to define and compare the failure modes, mechanisms and root causes for the selectednew and in-service items. This information may come from an FMEA or other similar analysis. Thelevel of analysis detail depends on the available engineering information. New item failure modeswith low criticality may be aggregated or approximated. The mechanisms and root causes for newitem failure modes with high criticality should be examined in detail. Failure modes, mechanisms,and causes that are not similar between the new and in-service items are followed by Step 3, whilefailure modes/mechanisms/causes that are similar between the new and in-service items are fol-lowed by Step 4.

— Step 3. Select appropriate reliability prediction method. For failure modes, mechanisms and rootcauses that are not similar between the new and in-service items, similarity analysis does not apply,and one of the other reliability prediction methods described in this guide may be applied.

— Step 4. Determine field reliability prediction of new and in-service items. For the failure modes,mechanisms and root causes that are similar between the new and in-service items, a field reliabilityprediction is performed for the in-service item. If the failure modes, mechanisms, and root causesare identical, then the field reliability prediction for the in-service item may be used for the field reli-ability prediction for the new item. If they are similar but not identical, the field reliability predictionmay be adjusted as described in Step 5.

— Step 5. Adjust field reliability prediction based on similarity between new and in-service items. Thisstep distinguishes similarity analysis from other prediction methods and is described in 5.2.4.2.

— Step 6. Combine reliability predictions to create new item reliability prediction. In this step, the reli-ability predictions from similarity analysis are combined with the reliability predictions from othermethods.

1. Select in-service item that has similarities w ith new item of interest

2. Analyze failure modes/mechanisms/causes of new and

in-service items

4. Determine field reliability prediction of new and in -service items

5. Adjust field reliability prediction based on similarity between new and in -

service items

3. Select appropriate reliability prediction method

6. Combine reliability predictions to create new item reliability prediction

For non-similar failure modes/mechanisms/causes

For similar failure modes/mechanisms/causes

Figure 6—Similarity analysis process flow

Copyright © 2003 IEEE. All rights reserved. 29

Page 37: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

5.2.4.2 In-service item reliability prediction adjustment

The distinguishing characteristic of similarity analysis is adjusting the in-service item reliability predictionto account for the differences between the in-service item and the new item. A failure mode in an in-serviceitem that is eliminated or reduced in frequency in a new item has the effect of reducing the failure probabilityof the new item. Table 5, which is a modified version of the Generic Failure Modes Table found in the IECFMEA Standard (see IEC 812 [B2]), gives examples of factors that should cause the reliability for a newitem to increase in comparison to an in-service item. The reliability should decrease if the factors were oppo-site to those shown in Table 5.

5.2.5 Reliability prediction for non-failure metrics

The term “non-failure metrics” is used to refer to the metrics such as service call rate or part replacementrate described in 4.1.4.1. Therefore, these predictions are usually based on field experience. The field datacan be used either directly or in combination with failure distributions derived from the other predictionmethods, to predict the metrics. The use of field data and similarity analysis as described in the previous sub-clause still applies to non-failure metrics. Note that field data may be available for metrics such as warrantyreturn rate, service call rate, or part replacement rate, so it may be possible to use field data for non-failuremetrics either directly or via similarity analysis.

The following discussion is an example of predicting some non-failure metrics given a Weibull failuredistribution.

Figure 7 shows examples of rate metrics: return rate, warranty claim rate, replacement rate, corrective main-tenance rate, and hardware problem call rate. Based on a given failure distribution, an example prediction foreach of these metrics is derived as follows. The failure distribution could come from field data or any of theother methods described in this guide.

Table 5—Failure causes and increased reliability

Failure cause Characteristic of new item that should increase reliability in comparison to an in-service item

Contamination Fewer parts, fewer foreign objects, or better processing.

Mistimed command Less complexity in timing circuitry or software commands.

Excessive vibration Sturdier mounting, fewer parts, or less volatile environment.

Open (electrical) Lower powered circuitry.

Short (electrical) Less dense circuit boards.

Intermittent operation Less intense or less frequent electrical transients.

Over-temperature Less heat-sensitivity, better insulation, or cooler operating environment.

Excessive temperature cycling Lower number of temperature cycles.

Unwanted functions More testing or greater testability.

30 Copyright © 2003 IEEE. All rights reserved.

Page 38: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

— Return rate: Predicted using the failure distribution and repair depot and logistics data. From repairdepot data, the No Failure Found (NFF) rate for an item is determined to be 50%. However, fromlogistics data, only 90% of defective items are returned (the others could be lost in transit or dam-aged and deemed unrepairable), so predicted return rate = 0.9 x predicted hazard rate/(1-0.5) = 1.8 xpredicted hazard rate. Note that if items were not returned because of customers that do not have ser-vice contracts or use other service providers, this would need to be factored in.

— Warranty claim rate: A prediction for a warranty claim rate depends on how the warranty works. Avendor may choose to warranty all maintenance actions, all replacements, only returns (meaning areplacement has to be returned for warranty credit), or only failures (meaning they charge for returnsthat are no defect found). A warranty usually only covers a period of time after product shipment orreceipt by the customer, e.g., a 1-year warranty. Assume that a warranty is based on returns as shownin Figure 7, and the warranty period is 1 year. Then the predicted warranty claim rate is the same asthe predicted return rate (1.8 x predicted hazard rate from above) for 1 year. If the failure distributionpredicted 0.05 failures for the first year, the warranty claim rate would be 0.09 warranty claims (1.8x 0.05) for the first year, usually quoted as 9% for warranty reserve.

— Replacement rate: Predicted using the failure distribution and repair depot data. Based on the NoFailure Found (NFF) rate of 50%, predicted replacement rate = predicted hazard rate/(1–0.5) = 2.0 xpredicted hazard rate.

— Corrective maintenance rate: Predicted using the replacement rate prediction and field support data.From field support data, 70% of corrective maintenance actions are single part replacements, 10%are double part replacements, and 20% are adjustments (no part replacement). Therefore, the ratiobetween corrective maintenance actions and replacements is 0.9 (1x70% + 2x10% + 0x20%), andpredicted corrective maintenance rate = predicted part replacement rate/0.9 = 2.2 x predicted hazardrate.

— Hardware problem call rate: Predicted using the corrective maintenance rate prediction and call cen-ter data. Corrective maintenance actions are determined from call center data to be 80% of hardwareproblem calls. Therefore, predicted hardware problem call rate = predicted corrective maintenancerate/0.8 = 2.75 x predicted hazard rate.

Note: Numbers in the preceding examples are for illustrative purposes only.

Verified Failures

Return Rate

No Failure Found

Verified Failures

Replace- ment Rate

No Failure Found

Throw- away Item

Verified Failures

Corrective Main- tenance Rate

No Failure Found

Throw- away Item

Adjust, align, etc.

Verified Failures

HW Problem Call Rate

No Failure Found

Throw- away Item

Adjust, align, etc.

Not a problem, not HW, etc.

Verified Failures

Warranty Claim Rate

No Failure Found

Figure 7—Example non-failure metrics

Copyright © 2003 IEEE. All rights reserved. 31

Page 39: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

5.2.6 Example of reliability prediction using field data and similarity analysis

There are many ways to use field data for reliability predictions of similar items and many different ways toadjust and combine reliability predictions (see Gullo, L., “In-Service Reliability Assessment and Top-DownApproach Provides Alternative Reliability Prediction Method;” Johnson, B. G. and Gullo, L., “Improve-ments in Reliability Assessment and Prediction Methodology;” and Alvarez, M. and Jackson, T., “Quantify-ing the Effects of Commercial Processes on Availability of Small Manned-Spacecraft”). Constant failurerates for the components and subassemblies the rates can simply be summed together, and rates derived byone method can be simply adjusted by multiplicative factors derived from other methods. The remainder ofthis subclause provides an example of using field data to adjust predicted constant failure rates.

In this example, an initial reliability prediction from a handbook or other method is updated using field dataand similarity analysis. The process for combining constant failure rate prediction methods is shown in Fig-ure 8 (see Elerath, J., Wood, A., Christiansen, D., and Hurst-Hopf, M., “Reliability Management and Engi-neering in a Commercial Computer Environment”). The steps in the process are as follows:

a) New reliability prediction: This is created using one of the methods described in 5.3 through 5.5.b) Determine field/prediction factors: These factors are developed from ratios of previous product

predictions to previous product field reliability. In this example, the new circuit board predictedconstant failure rate = 12,556 FITs = 0.110 failures/year from a constant failure rate handbook. Thelatest 6 months of field data for a similar board is 8,760,000 hours and 10 removals = 0.010 failures/year. The similar board’s predicted constant failure rate from a handbook was 6,283 FITs (0.055 fail-ures/year); therefore the field/prediction factor = 0.010/0.055 = 0.182.

c) Updated reliability prediction: This is created by combining the information from steps a) and b).Updated reliability prediction is created by multiplying the handbook prediction from step a) by thefield/prediction factor from step 2 to get 0.110 x 0.182 = 0.020 failures/year. For this product, theprediction did not meet the goal of 0.010 failures/year, so process improvements were defined toimprove the reliability.

d) Quantify characteristic differences: Using Similarity Analysis [steps c) through e)], determine thereliability impact of process changes. The distribution of the failure causes for the previous productis as follows: assembly—40%, solder—12%, and components—48% (distributed between DRAM,microprocessor, ASIC, SRAM, and miscellaneous components). Manufacturing process improve-ments were defined, and it was estimated that these improvements would reduce assembly errors andsolder defects by a factor of 3.

e) Updated reliability prediction: This is created by combining the information from steps c) and d). Itis possible to continue this process by updating the prediction with test data and field data.

1) New Reliabil -ity Prediction

3) Updated Prediction

2) Determi ne Field/ Prediction Factor

Previous Product Field Data

5) Updated Prediction

4) Quantify Charac -teristic Differences

Similarity Analysis

Figure 8—Example reliability prediction based on field data

32 Copyright © 2003 IEEE. All rights reserved.

Page 40: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

A

M

S

D

M

A

S

B

B

Table 6 shows all the reliability prediction calculations. The new board is assumed to have the same failurecause distribution as the old board, so the predicted constant failure rate is allocated accordingly. Theupdated reliability prediction is derived by multiplying by the field/prediction factor as described in step c).This reliability prediction is then adjusted to account for the process improvements described in step d).With the process improvement, the new board just meets the goal 0.01 failure per year.

5.3 Predictions based on test data

The benefits of reliability predictions based on test data are that they include actual equipment operationalexperience (albeit in a test environment), and the time required to observe failures can be accelerated toincrease the amount of data available. Test data can be used in combination with or as a validation of othermethods.

One of the most critical aspects of all reliability tests is careful planning. Tests may be constructed so thatthey either demonstrate a specific reliability at a specific confidence level or generate valid test hours forgeneral data accumulation. Tests are often conducted to determine or demonstrate the reliability at the com-ponent, assembly, subsystem, or system level. Reliability test data at lower levels may be combined to inferreliability at the next higher system hierarchy, if failure results from interaction are negligible. The value oftest data depends on how well the test environment can be related to the actual use environment. A testshould be conducted in a typical operating environment to include failures from sources such as humanintervention, thermal environment, electro-magnetic disturbances, humidity; and to avoid other failures thatare not typical of the operating environment.

Subclause 5.3.1 describes some test considerations that apply to all reliability related tests. Subclauses 5.3.2and 5.3.3 provide guidance for using data from non-accelerated and accelerated life tests, respectively, inreliability predictions. Subclause 5.3.4 provides an example of merging test data and damage simulationresults.

Table 6—Constant failure rate prediction example

Failure cause

% Allocation per failure

cause

Constant failure rate per failure

cause

Previous product field/

prediction factor

Updated reliability prediction

Process improvement

factor

Failure rate prediction

with process improvement

ssembly defects 40% 0.0440 0.182 0.0080 0.33 (3x) 0.0026

isc. components 22% 0.0242 0.182 0.0044 1.00 0.0044

older defects 12% 0.0132 0.182 0.0024 0.33 (3x) 0.0008

RAM 10% 0.0110 0.182 0.0020 1.00 0.0020

icroprocessor 8% 0.0088 0.182 0.0016 1.00 0.0016

SIC 4% 0.0044 0.182 0.0008 1.00 0.0008

RAM 4% 0.0044 0.182 0.0008 1.00 0.0008

oard totals 100% 0.1100 0.182 0.0200 0.50 (2x) 0.0100

oard goal 0.01

Copyright © 2003 IEEE. All rights reserved. 33

Page 41: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

5.3.1 Test data considerations

A structured system for collection and storage of data gathered during any test phase is highly desirable. Thedatabase should include test start and stop times and dates, as well as test environmental conditions, tran-sients, transient durations, unit responses, etc. If a failure occurs, the database should also store the results ofthe root cause analysis and identify corrective actions and design changes. This information is useful whendetermining whether a failure occurred, whether or not it is chargeable and the test time and conditions priorto failure. At an absolute minimum, the database should contain individual unit test and failure times. Thetest time should not be aggregated until the analysis of the data confirms that the failure distribution will per-mit it.

Test times may be collected at the assembly level and used to determine failure distributions, hazard rates orreliability metrics for constituent elements that comprise the assembly. For example, tests of a computerassembly may be broken down to the capacitor, resistor, microprocessor, driver, etc., and put in the hazardrate database for future predictions at the assembly level.

Some failures may be excluded from the results when analyzing the test data. However, exclusion should bedone only after rigorous failure analysis of the failed unit under test is completed and the failure cause cantruly be ascribed to the test fixture (hardware), test software or environmental conditions that will not bepresent in the actual use environment. Some examples that may justify exclusion include:

— Test fixture failure: The test fixture can make the test unit appear to have failed. For example, if awire or connector in the fixture fails, signals to or from the test unit or power to the test unit may belost. If the power supply in the fixture loses regulation, the test fixture may subject the unit to volt-ages outside the design limits, causing the unit to fail.

— Runaway temperature: The unit under test may experience temperatures outside the test limits, theunit under test may go into a runaway (high) thermal condition and damage the unit.

— Physical damage due to overstress: If the unit is subject to an overstress condition, it may stop func-tioning properly. Shock, excessive vibration, high levels of electrostatic discharge (ESD), andvoltage are possible overstress conditions.

— Data recording storage system out of calibration: If a data recorder is used, it may go out of calibra-tion causing an erroneous signal that implies the unit under test has failed. If the test unit isconnected to a controller using active feedback, the controller may provide (correct) responses thatinduce failure. Alternatively, the controller itself may be misprogrammed, causing conditions thatexceed the test limits of the unit under test and cause failure.

Multiple failures due to the same cause or exhibiting the same mode must not be consolidated. That is, mul-tiple failures due to the same single cause or exhibiting the same single mode or mechanism must all becounted as separate individual failures and not counted as a single failure. The (erroneous) rationale for con-solidating failures is that there is only one underlying cause so it should be counted as only one failure. Forexample, in the testing of Winchester disk drives, thermal asperities are a significant failure mode. If therewere 10 thermal asperities in a reliability demonstration test, they should all be counted separately, resultingin 10 failures. They should not be consolidated and counted as only one failure.

5.3.2 Non-accelerated test data

Non-accelerated tests are conducted at “nominal” load (stress) condition (e.g., temperatures, power, humid-ity) within the specification bounds. In these tests, there is no attempt to relate the test temperature, humidity,voltage, or other environmental stimulus to an additional stress level that will increase the hazard rate,thereby reducing the test time.

34 Copyright © 2003 IEEE. All rights reserved.

Page 42: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

When analyzing test data, it is important to determine a distribution that provides a good fit to the data and isrepresentative of the failure mechanism type. A good fit is important so that the distribution parameters canbe used to extrapolate behavior, or predict the reliability, beyond the period over which the data was gener-ated.15 If the data does not fit any of the more commonly used distributions (see 4.1.3), the ability to predictis severely limited.16

Reliability demonstration tests (RDT) tests stay within the design specification environments and may beperformed on pre-production or regular production level products. Demonstration tests are often done onlyonce or twice (if the first was unsuccessful) before or near the beginning of production. In a reliability dem-onstration test, a statistical basis is identified and documented in a test plan.

Some guidance is provided in some publications for setting up a time terminated, failure terminated or accel-erated life test for various distribution.17 If there is more than one unknown distributional parameter, inter-preting the test is very difficult and requires simulation or prior knowledge. An alternative is to simply run alarge quantity of systems to accumulate sufficient failures that a distribution can be fit. Then, determine theparameters of the distribution and calculate (predict) the probability of failure for the time interval of inter-est. In essence, making statements about product reliability for a time interval beyond the test interval is aprediction based on test data.

5.3.2.1 Manufacturing tests

Several types of data from manufacturing processes can be used for predicting reliability of the same or sim-ilar units. The underlying failure distribution must be carefully considered when using this data. Burn-in,run-in, and on-going reliability tests (ORT) are often sources for non-accelerated test data.

Generally, burn-in and run-in tests are performed on 100% of the production line. They are essentiallyscreening tests, designed to remove early failures and marginal product from the production line. Run-intests are most often conducted at the nominal operating temperature, whereas burn-in tests may be con-ducted at an elevated temperature, thus making it an accelerated test. If a failure distribution is fitted to thetest data, then the probability of surviving time intervals beyond the burn-in or run-in times can be estimatedand used in a prediction, assuming that the failure distribution does not change.

ORT is usually conducted in the manufacturing facility, testing units from the current production line. Thetest is to assess whether there are any significant changes in product quality or reliability. ORT will usuallydivert a fraction (sample) of the production on a periodic basis for testing. The test usually runs longer thaneither the burn-in or run-in tests, but still short enough so the product is still considered “new.” The failuremodes being sought should be identified and the test constructed to stress the weaknesses of the product forthose modes. If the test is too benign, the test results may be overly optimistic. However, if the test environ-ment is too severe, it may be become an accelerated test rather than an ORT. The ORT data collected andaccumulated is usually the number of failures and total number of hours. If time-to-failure data is collected,the data can be used to determine the underlying failure distribution. The environmental conditions during

15For example, suppose the time to failure is available on 1000 units that were run for 1000 hours each under representative environ-ments and conditions. Further assume that the data is plotted on lognormal hazard paper using the techniques shown in Annex A andthat the distribution’s parameters, log-mean and log-standard deviation, are calculated. The cumulative probability of failure (unreliabil-ity) can be determined for any time interval, even those beyond the 1000-hour test time if the underlying causes and mechanisms offailure remain the same and it is reasonable to assume the log-normal distribution is still appropriate.16Assume the same 1000 units and 1000 hours of test time data is available, as previously discussed. If the distribution is unknown, acumulative probability of failure can be determined for the first 1000 hours, but because no unit was tested beyond 1000 hours and thedistribution is unknown, the expected behavior beyond 1000 hours is unknown. Therefore, estimates of reliability within the first 1000hours can be made but predictions of failure (probability of failure) beyond 1000 hours cannot. This latter situation severely limits theability to predict (extrapolate) reliability at a future point in time.17If an exponential failure distribution is assumed, the test can be time terminated or failure terminated and based on the chi-squaredstatistic. The details of this are documented in numerous reliability texts. For these tests, it is important to make sure that the underlyingfailure distribution really has a constant failure rate and that “early life” failures have been resolved prior to starting the test. Attempts todemonstrate very low constant failure rates (high MTBFs) may not be practical with these tests.

Copyright © 2003 IEEE. All rights reserved. 35

Page 43: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

test are usually the same as for product use, so there is no acceleration factor. It is most common to performORT at non-accelerated conditions. However, if the test is conducted at elevated temperatures or is otherwiseaccelerated, acceleration factors may be employed per 5.3.3. Table 7 shows an example of ORT where a roll-ing set of 80 units are kept under test.

5.3.2.2 Actual usage tests

Testing units in their actual use environment is an excellent source of reliability data for a prediction. Thesetests are often conducted by the company that designed the system (alpha tests) or at a friendly customer site(beta tests). In either case, it must be understood that the reliability may be lower than desired if the producthas not been released for production. In the computer industry, for example, a manufacturer may build anduse their own computers to run their internal e-mail. During this test, failures are tracked, analyzed, and cor-rected. The data can be used later in calculating a constant failure rate or other reliability metric.

Alpha and beta tests usually occur just before a system is ready to ship. Therefore, alpha and beta test data isprobably the most applicable to reliability predictions because it is most representative of the final system.Often, failures found in alpha and beta tests are corrected before full production begins, in which case thefailures found in these tests may be excluded from the data used to create a prediction.

5.3.3 Accelerated tests

The purpose of accelerated testing is to verify the life-cycle reliability of the product within a short period ortime. Thus, the goal in accelerated testing is to accelerate the damage accumulation rate for relevant wearoutfailure mechanisms (a relevant failure mechanism is one that is expected to occur under life-cycle condi-tions). The extent of acceleration, usually termed the acceleration factor, is defined as the ratio of the lifeunder life-cycle conditions to that under the accelerated test conditions. This acceleration factor is needed toquantitatively extrapolate reliability metrics (such as time-to-failure and hazard rates) from the acceleratedenvironment to the usage environment, with some reasonable degree of assurance. The acceleration factordepends on the hardware parameters (e.g., material properties, product architecture) of the unit under test(UUT), life-cycle stress conditions, accelerated stress test conditions, and the relevant failure mechanism.Thus, each relevant wearout failure mechanism in the UUT has its own acceleration factor and the testconditions (e.g., duty cycle, stress levels, stress history, test duration) must be tailored based on these accel-eration factors.

Table 7—Example of ORT testing—in this test, 20 samples are swapped with new ones every four weeks, keeping the maximum number under test at eighty

Product taken from production in:

QTY on test

Week 1 Week 2 Week 3 Week 4 Week 5 Week 6

Week 1 20 20 20 20

Week 2 20 20 20 20

Week 3 20 20 20 20

Week 4 20 20 20

Week 5 20 20

Week 6 20

Cumulative hours 20 40 60 80 80 80

36 Copyright © 2003 IEEE. All rights reserved.

Page 44: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

Accelerated life tests attempt to compress the time it takes to observe failures. In some cases, it is possible todo this without actually changing the hazard rate. However, if the hazard function changes, it is termed a“proportional hazard model.” Mathematically, the differences between these two can be seen in the follow-ing two equations for a Weibull distribution in which HAL(t) is the cumulative hazard function for acceler-ated life, HPH(t) is the cumulative hazard function for the proportional hazard model, AF is an accelerationfactor due to some sort of stimulus and (t/α)β is the unmodified cumulative hazard rate for a Weibull distri-bution (t = time, α_= characteristic life and β_= shape parameter).

In HAL(t), time is a linear function of the acceleration function. In HPH(t), the hazard function itself is beingmodified. By rearranging the equation for HPH(t), it can be seen that time is a non-linear function of the AF.That is, time is multiplied by (AF)1/β. The difference between these two types of accelerated tests is thatHAL(t) requires knowledge only of the ratio of the actual test time to calendar time (non-accelerated time)caused by the applied environmental stimulus whereas HPH(t) requires knowledge of the manner in whichthe AF changes as a function of the parameter β. Both of these are discussed in detail in Leemis [B7]. For theWeibull distribution, of which the exponential is a special case, the resultant distribution for either of thesetwo conditions is still a Weibull distribution.

The two most common forms of “accelerated life testing are” 1) eliminating “dead-time” by compressing theduty cycle and 2) reducing the time-to-failure by increasing the stress levels to beyond what is expected inthe life-cycle. The interested readers are referred to the following references: Nelson, Wayne, AcceleratedTesting; Jensen, Finn, Electronic Component Reliability; and Lall, P., Pecht, M., and Hakim, E. B., Influenceof Temperature on Microelectronics and System Reliability: A Physics of Failure Approach.

“Dead-time” elimination is accomplished by compressing the duty cycle. A good example of duty cyclecompression is when a test runs for 24 hours per day, whereas the product in its actual use environment runsfor only 8 hours per day. This results in a time compression of 3. Each day of test time is equal to 3 days ofactual use time. Test data analysis must account for failure modes or mechanisms introduced to reduce deadtime. For example, a ball bearing designed for intermittent use may fail due to fretting in its actual expecteduse environment. However, if an accelerated test is developed that uses the bearing 100% of the time, thefretting mode may not be found due to the lack of sufficient corrosion. Furthermore, new failure modes maybe introduced due to increased levels of heat created during the test.

Accelerated stress tests of the second type can be run by enhancing a variety of loads such as thermal loads(e.g.,. temperature, temperature cycling, and rates of temperature change), chemical loads (e.g.,. humidity,corrosive chemicals like acids and salt); electrical loads (e.g., steady-state or transient voltage, current,power); and mechanical loads (e.g., quasi-static cyclic mechanical deformations, vibration, and shock/impulse/impact). The accelerated environment may include a combination of these loads. Interpretation ofresults for combined loads and extrapolation of the results to the life-cycle conditions requires a quantitativeunderstanding of the relative interactions of the different test stresses and the contribution of each stress typeto the overall damage. The stress and damage method discussed in 5.4 provides a basis to interpret the testresults.

5.3.3.1 Example

There are numerous models relating the accelerated life to steady state temperature. The Arrhenius relation-ship is an example used here for illustrative purposes. The relationship is as follows:

HAL t( ) AF∗tα

------------

β=

HPH t( ) AF∗ tα---

β=

Copyright © 2003 IEEE. All rights reserved. 37

Page 45: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

where

AF is the acceleration factorEA is the activation energy

k is Boltzman’s constant = 8.716x10–5 eV/KT1 is the “base” temperature, the temperature of the expected use environment, in KT2 is the “accelerated” temperature, which is usually greater than T1, in K

There is no substitute for experimentally determining the activation energy. However, obtaining this infor-mation requires conducting at least two sets of tests at different temperatures. In the absence of experimentaldata, many are available in the public literature. Be sure to use the approximation that best applies to theanticipated failure mechanism. Table 8 gives examples of other acceleration factor models.

Table 8—Examples of models that can be used to derive acceleration factors

Coffin-Manson inverse power law

AF = Nuse / Ntest = [∆Ttest/ ∆Tuse]β Simplified acceleration factor for temperature cycling fatigue testing

AF = Nuse / Ntest = [Gtest/ Guse]β Acceleration factor for vibration based on Grms for similar product

responses. Do not use when product responses are different from one level to the next.

Rudra inverse power law model for conductive filament formation failures in printed wiring board

tf = a f (1000Leff)n / Vm (M–Mt) tf = Time to failure (hours)

a = Filament formation acceleration factorf = Multiplayer correction factorLeff = Effective length between conductors (inches)V = Voltage (DC)M = Moisture absorbedMt = Threshold moisturen = Geometry acceleration factorm = Voltage acceleration factor(Note: No temperature dependence up to ≈50°C; most CFF occurs below 50°C)

Peck’s model for temperature-humidity (Eyring form)

AF = (Muse / Mtest) – n exp[(Ea/k){(1/

Tuse)–(1/Ttest)}]AF = Acceleration factorMuse = Moisture level in serviceMtest = Moisture level in testTuse = Temperature in service use, KTtest = Temperature in test, KEa = Activation energy for damage mechanism and materialk = Boltzman’s constant = 8.617 * 10–5 eV/Kn = A material constant(For aluminum corrosion, n=3 and Ea = 0.90)

Kemeny Model for accelerated voltage testing

Constant Failure Rate = [exp(C0–Ea/kTj] [expC1(Vcb/Vcbmax)]

Tj = Junction temperatureVcb = Collector-base voltageVcbmax = Maximum collector-base voltage before breakdownEa = Activation energy for damage mechanism and materialk = Boltzman’s constant = 8.617 * 10–5 eV/ KC0, C1 = Material constant

AF expEA

k------

1T1------ 1

T2------–

=

38 Copyright © 2003 IEEE. All rights reserved.

Page 46: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

5.3.3.2 Cautions on accelerated tests

Accelerated testing should begin by identifying all the possible overstress and wearout failure mechanismsunder operating environment. The load parameters that most directly cause the time-dependent failureshould be selected as the acceleration parameters. When increasing the stress levels, care should be taken toavoid overstress failure mechanisms and non-representative material behavior.

Failure due to a particular mechanism can be induced by several acceleration parameters. For example,corrosion can be accelerated by both temperature and humidity; and creep can be accelerated by bothmechanical stress and temperature. Furthermore, a single acceleration stress can induce failure by severalwearout mechanisms simultaneously. For example, temperature can accelerate wearout damage accumula-tion not only by electromigration, but also by corrosion, creep, and so on. Failure mechanisms that dominateunder usual operating conditions may lose their dominance as the stress is elevated. Conversely, failuremechanisms that are dormant under normal use conditions may contribute to device failure under acceler-ated conditions. Thus, accelerated tests require careful planning in order to represent the actual usageenvironments and operating conditions without introducing extraneous failure mechanisms or non-represen-tative physical or material behavior.

In order for an accelerated test to be meaningful, it should not only duplicate the failure mechanismsexpected under life cycle conditions, but it should also be possible to estimate the corresponding accelera-tion factors. Without acceleration factors, there is no basis for estimating the meaning or relevance of thetest. The stress and damage method discussed in 5.4 provides the basis for determining acceleration factors,determining the test period and load levels, as well as the relevance of individual failures.

When conducting the accelerated testing, stress sensors should be used in key locations (preferably close toexpected failure sites) so that the response of the UUT to the test environment can be quantitativelyrecorded. The same sensor can be used to verify the specimen response at the same location under life-cycleloading conditions. These responses must be quantitatively known either through sensor-based measure-ments or based on computer simulations, in order to obtain an accurate estimate of the acceleration factors.For example, when conducting accelerated vibration tests of circuit card assemblies, accelerometers andstrain gages should be used at key locations to measure the in-plane accelerations and out-of-plane flexureboth under accelerated excitation as well as under life-cycle vibration levels. Furthermore, the spectralcontent of the life-cycle vibration condition should be approximately preserved during the accelerated stresstest. Suitable physics-of-failure models (e.g., fatigue models) should then be used to estimate the accelera-tion factors at the critical failure sites that experience fatigue failures.

Many wearout failure mechanisms are manifested initially as intermittent functional failures. Therefore, fail-ure monitoring during accelerated testing should be conducted in real-time while the product is beingstressed. Otherwise, the initiation of failure can often be missed and the results will contain non-conservativeerrors.

5.3.3.3 Overstress tests

Over-stress tests (OST) or highly accelerated life tests (HALT) are conducted during the design process withthe intent to stress the product to failure, learn as much as possible about design robustness and weaknessthrough rigorous root cause analyses, then redesign the product to improve resistance to the environments.The purpose of the test is to identify design weaknesses that, due to variability, will eventually show up asfailures when larger quantities of the product are used within the design bounds. There is no attempt to pre-dict reliability.

OST and HALT usually include some combination of temperature, vibration, voltage margining, humidity,and on-off power cycling. The initial stress and order of increase or decrease for the various stress levels(stress profiles) are product dependent. Digital electronics may need a different profile than analog devicesor inductive devices such as motors. A common process is to begin well within the design envelope and

Copyright © 2003 IEEE. All rights reserved. 39

Page 47: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

cycle temperature from high to low while simultaneously inducing 6-axis vibration. Additionally, at varioustimes within the cycle, electric power is turned off and then back on. The temperature extremes and vibrationlevels are gradually increased until a failure occurs. This type of test is most effective when conducted onassemblies, not individual electronic parts, such as ICs. OST and HALT tests are sound reliability engineer-ing practices directed at improving product reliability. However, since the relationship between the stresslevel and the hazard rate are unknown, the time to failure should not be used as a source of data for a predic-tion. Certainly, the failures from these tests should not be extrapolated back down to operating conditions todetermine reliability within the specifications.

5.3.4 Example of reliability assessment from accelerated test data

This subclause provides an example of assessing durability of Chip Scale Package (CSP) interconnectsunder quasi-static bending loads caused by keys on a keypad being pressed in a handheld electronic product.The details of this study can be found in Shetty, S., Lehtinen V., Dasgupta, A., Halkola, V., and Reinikainen,T., “Fatigue of Chip-Scale Package Interconnects due to Cyclic Bending,” and only the key features of thisstudy are summarized here for illustration. The hardware configuration is shown in Figure 9. The life cycleapplication condition involves a deflection of 0.1 mm of the printed circuit board (PCB) under the keypad.The loading history is assumed to be a triangular ramp-up and ramp-down over a 1-second period. Takingthe life cycle usage profile (duty cycle) of the equipment into consideration, it is estimated that this bendingcycle occurs about 21 times each day.

The accelerated test is intended to verify the life-cycle durability of the solder joints that connect the CSP tothe printed circuit board. The failure mechanism of interest is fatigue of the interconnect solder joints due tothe cyclic bending of the PCB. The test planning, execution and data analysis was done in tandem with stressand damage analysis in accordance with the caveats described in 5.3.3.1. To accelerate the fatigue of the sol-der joints, the amplitude of the bending deformation is increased in the test. First, the overstress limits forbending are determined. The overstress limits can arise either due to instantaneous fracture of the solderinterconnect or due to a failure mechanism other than cyclic solder fatigue (examples include bond-paddelamination, failures in the CSP, or failures in the PCB), or due to any non-representative material response.

A test specimen, consisting of 33 daisy-chained CSPs per circuit card, was fabricated. All the interconnectswere monitored throughout the test sequence to check for intermittent opens, by using the daisy-chains andsuitable connectors. A 3-point bend test setup was created to apply bending load in a quasi-static manner.Based on an overstress step-test, it was determined that a maximum displacement of 15 mm could be appliedat the center of the test specimen, over a period of one second. A fatigue bending test was performed andfailure data was recorded. A failure analysis was done to ensure that the failure mode was relevant (solderjoint fatigue). Due to the 3-point bending, each row of CSPs on the specimen experienced a different amountof flexure (measured with strain gages mounted on the specimen) and thus each set of solder joints of theassembly experienced a different level of fatigue load. The cycles to failure for all the components are nor-malized with respect to the component at the center which experiences the highest bending curvature κ. Thenormalized bending load (curvature κ multiplied by half the PWB thickness t/2) versus the cycles to failure

Copper and insulator

Solder interconnect

Cross-sectional view of the CSP assembly

Compliant interposer(polyamide)

Molding compoundDie

PCBCopper bondpads

Die attach

Hardware configuration:Copper and insulator

Solder interconnect

Cross-sectional view of the CSP assembly

Compliant interposer(polyamide)

Molding compoundDie

PCBCopper bondpads

Die attach

Hardware configuration:

Figure 9—Hardware configuration of the CSP assembly

40 Copyright © 2003 IEEE. All rights reserved.

Page 48: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

(N) is plotted (Figure 10). The test setup and the life cycle loading condition were both simulated using finiteelement models of the hardware, in order to estimate the acceleration factor, as a function of the flexureexperienced by the circuit card at the site of each CSP. As an example, the acceleration factor, for the row ofCSPs at the center of the 3-point bend specimen, was calculated to be approximately –10. The interconnectsfor this row lasted for approximately 8300 cycles which simulates approximately 83000 cycles (10 years)under life-cycle conditions. Because of the reasonably large sample size tested here, the variability of themeasured life-time could also be determined, as shown in Figure 2. The subscript on N provides the failurebased on the observed variability.

5.4 Reliability predictions based on stress and damage models

The objective of reliability predictions based on stress and damage models is to assess the time to failure andits distribution for a system and its components, evaluating individual failure sites which can be identifiedand modeled based on the construction of the system and its anticipated life cycle. However, since simula-tion techniques continue to improve and models for assessing new and known failure mechanisms continueto be developed, this subclause does not attempt nor is it intended to provide a complete reference to all ofthe models that can be used to conduct a reliability prediction based on stress and damage models. Practitio-ners of this reliability prediction method need to document the acceptance and validity of the simulation andmodeling techniques used and be aware of their limitations.

Reliability predictions based on stress and damage models rely on understanding the modes by which a sys-tem will fail, the mechanisms that will induce the identified failures, the loading conditions that can producethe failure mechanisms, and the sites18 that are vulnerable to the failure mechanisms. The methodologymakes use of the geometry and material construction of the system, its operational requirements (e.g.,electrical connectivity), and the anticipated operating (e.g. internal heat dissipation, voltage, current), andenvironmental (e.g., ambient temperature, vibration, relative humidity) conditions in the anticipated applica-tion. The method may be limited by the availability and accuracy of models for quantifying the time to fail-ure of the system. It may also be limited by the ability to combine the results of multiple failure models for asingle failure site and the ability to combine results of the same model for multiple stress conditions.However, there are recognized methods for addressing these issues, and research continues to produceimprovements.

18For example, an interconnection. It is essential to know the failure site for predicting failure with this technique.

1.00E-4

1.00E-03

Cycles to failure

ACCELERATION FACTOR= 10Cycles to failure = 83,000 cycles

(approx 10 years)

Accelerated test load

Life cycle load

Prediction

N 50% Calibrated with experiments

N 1% Calibrated with experiments

N .1%Calibrated with experimentsN

orm

aliz

ed b

endi

ng lo

adκ*

0.5*

t

83,000 1.00E-4

1.00E-03

Cycles to failure

ACCELERATION FACTOR= 10Cycles to failure = 83,000 cycles

(approx 10 years)

Accelerated test load

Life cycle load

Prediction

N 50% Calibrated with experiments

N 1% Calibrated with experiments

N .1%Calibrated with experimentsN

orm

aliz

ed b

endi

ng lo

adκ*

0.5*

t

83,000

Figure 10—Reliability prediction based on test data

Copyright © 2003 IEEE. All rights reserved. 41

Page 49: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

In this approach, reliability predictions depend on the development of a representative model(s) of the sys-tem, from which the system’s response to anticipated operating and environmental conditions can beassessed. Stresses19 (responses within the structure to the applied conditions) determined from the simula-tion are then used as inputs to the damage models, to quantify the likelihood of failure at individual failuresites. The number of failure mechanisms and sites addressed in this method is governed by the availability ofstress and damage models, and the quality of the prediction is governed by the quality of the models and theinput parameters. The method can be used for identifying and ranking failure sites, determining the designmargins and intrinsic (ideal) durability of the system, developing qualification tests for the system, and fordetermining acceleration transforms between test and use conditions. It is recommended that the method beused in combination with physical testing to verify that the modeling has adequately captured the systemfailures.

Research into failure mechanisms found in electronic systems is actively pursued and models exist for themajority of failures (see Dasgupta, A., and Pecht, M.,“Failure Mechanisms and Damage Models;” Tummla,R. and Rymaszeewski, E., Microelectronics Packaging Handbook; Pecht, M., Integrated Circuit, Hybrid,and Multichip Module Package Design Guidelines; and Upadhyayula, K. and Dasgupta, A., “An IncrementalDamage Superposition Approach for Interconnect Reliability Under Combined Accelerated Stresses”).However, it should be recognized that models may not exist for all possible failures, and users of thisapproach should clearly state that the assessment only covers failures that have been modeled. In manycases, the stress and damage models are combined to form a single model sometimes referred to as a failuremodel. The stress input is usually derived for a particular condition to which a system is exposed, using anappropriate stress analysis approach. A review of the sensitivity of the assessment to geometric and materialinputs, as well as applied loading conditions, should be conducted and limits of the model and the appropri-ateness of the model for each assessment situation should be considered. Examples of conditions that areknown to induce failure include, but are not limited to, a temperature cycle, a sustained temperature expo-sure, a repetitive dynamic mechanical load, a sustained electrical bias, a sustained humidity exposure, and anexposure to ionic contamination. Examples of damage include exceeding a material strength, reduction inmaterial strength, removal of material, change in material properties, growth of a conductive path, or separa-tion of joined conductors.

Failure models may be classified as overstress or wearout. Models for overstress calculate whether failurewill occur based on a single exposure to a defined stress condition. For an overstress model, the simplest for-mulation is comparison of an induced stress versus the strength of the material that must sustain that stress.Die fracture, popcorning, seal fracture, and electrical overstress are examples of overstress failures. Modelsfor wearout failures calculate an exposure time required to induce failure based on a defined stress condition.Fatigue, crack growth, creep rupture, stress driven diffusive voiding (SDDV), time dependent dielectricbreakdown (TDDB), metallization corrosion, and electromigration are examples of wearout mechanisms. Inthe case of wearout failures, damage is accumulated over a period until the item is no longer able to with-stand the applied load. Therefore, an appropriate method for combining multiple conditions must be deter-mined for assessing the time to failure. Sometimes, the damage due to the individual loading conditions maybe analyzed separately, and the failure assessment results may be combined in a cumulative manner. Invari-ably, the prediction based on stress and damage models relies on failure models that are documented withassumptions and limitations and comparisons to experimental and/or field data.

The variability of the time to failure can be assessed by considering the distribution of the input data in a sys-tematic approach (e.g., Monte Carlo analysis). Alternatively, one could directly apply known distributionsobtained from test or field results that correspond to each specific failure mechanism to the results of the fail-ure model. Figure 11 depicts time to failure distributions and their means arising from three competing fail-ure sources. A deterministic assessment would consider the single time to failure for each source (i.e., t1, t2,and t3), while a probabilistic assessment would consider the entire time to failure distribution of source. Forthe illustration presented Figure 11, first failure could be attributed to any of the three failures sources, how-ever failure source number one is clearly dominant. A failure source is a failure that occurs at specific site.

19In this discussion, stress refers to the local condition at the failure site, in response to an applied load. For example, vibration (loading)can produce mechanical stresses at an interconnect; or power cycling (load) can produce temperature transients (stress) at an IC gate.

42 Copyright © 2003 IEEE. All rights reserved.

Page 50: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

5.4.1 Stress and damage model reliability prediction method process

A flowchart of the prediction methodology using stress and damage models is depicted in Figure 12, andinvolves:

— Reviewing geometry and materials of the system (e.g., component interconnects, board metalliza-tion, and solder joints), their distributions, and potential manufacturing flaws and defects;

— Reviewing the environmental and operating loading conditions (e.g., voltage, current, relativehumidity, temperature, or vibration and their variability) to which the system will be subjected tofrom an anticipated operational profile for the system;

— Identifying the modes by which the system can fail (e.g., electrical shorts, opens, or parametric shiftswhich result from degradation), the locations or sites where the failure would occur, and the mecha-nisms (e.g., fatigue, fracture, corrosion, voiding, wear, or change in material property) that producethe failure.

— From all possible failure mechanisms identified for the system and components, only a subset ofthese failure mechanisms will compete to cause the first failure. The following methods can be usedto identify the dominant failure mechanisms:

–Field or test data from the system or similar systems;–Highly accelerated life tests (HALT)–Vibration/shock/thermal or other environmental data collected on the actual system or similarsystems; and –Engineering judgment.

— The determination of dominant failure mechanisms could be a combination of the above methods.Test and field data could show which failure mechanisms occur during testing conditions or actualfield conditions. HALT is testing which uses high stress levels to identify what fails first in a system,

Figure 11—Time to failure distributions for three competing failure sources

Copyright © 2003 IEEE. All rights reserved. 43

Page 51: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

although there is no relation to operational life. Environmental data collected on the actual system,and system-level stress analysis, could be used to determine the severity of different environmentalstresses.

— Identifying physics-of-failure models (e.g., Coffin-Manson Fatigue model, crack initiation and Parisfatigue crack growth power law model, creep rupture model, Arrhenius, Eyring) for evaluating thetime to failure for the identified failure mechanisms;

— Estimating time to failure and the variation of time to failure based on distributions in the inputs tothe failure models;

— Ranking susceptibilities of the system to specific failure mechanisms based on assessment of times tofailure, their variation, and their acceptable confidence level.

Figure 12—Generic process of estimating the reliability of an electronic system

44 Copyright © 2003 IEEE. All rights reserved.

Page 52: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

5.4.2 Stress and damage model reliability prediction method example

This subclause presents a simple step-by-step example (following the flow chart presented in Figure 12) ofapplying the stress and damage model approach to reliability prediction based on an assumed failure mecha-nism at a few selected failure sites. Examples of evaluating multiple sites/mechanisms at a system level andranking the time to failure can be found elsewhere in Engelmaier, W., “Fatigue Life Of Leadless Chip Carri-ers Solder Joints During Power Cycling,” IEEE Transactions on Components, Hybrids, and ManufacturingTechnology and Dasgupta, A., Oyan, C., Barker, D., and Pecht, M., “Solder Creep-Fatigue Analysis by anEnergy-Partitioning Approach.”

Consider a circuit card assembly (CCA) composed of a leadless ceramic capacitor (LCC), a leadless ceramicresistor (LCR), and 68-pin plastic leaded chip carrier (PLCC) on a printed wiring board. The printed wiringboard is 1.5 mm thick, has 4 signal layers, and is constructed of FR4. The CCA is expected to operate for 3years in relatively benign environment where it will be powered on one time a day. The ambient environmenttemperature is expected to be 22 ˚C.

Step 1—Review the geometry and materials of the system

The description in 5.4.2 provides a general discussion of the geometry and material construction of the elec-tronic hardware, its operation, and its anticipated application life cycle. Details of the parts may be obtainedfrom vendor data sheets and related design documentation. For this example, the relevant physical propertiesof the part are defined in Table 9. The solder joint height for all parts is 0.1 mm and CTE of the board is 17.5ppm/˚C.

Step 2—Review the load conditions to which the system will be subjected to define its anticipated opera-tional profile

While there are varieties of conditions that can result in the failure of the circuit card, this example isrestricted to failure arising from temperature cycling. Temperature cycling is of particular concern, sincetemperature variations in the assembly combined with differences in rates of expansion due to temperaturetend to fatigue material interfaces. Since the temperatures of the parts and the printed wiring board will varybased on operation, we need to evaluate or determine the temperature that each part is likely to reach duringits anticipated use. This can be accomplished through simulation or measurement. For this example, thedaily temperature history of the parts as measured is depicted in Figure 13, and the powered and unpoweredpart temperatures are provided in Table 10.

Table 9—Part geometry and coefficient of thermal expansion (CTE)

Part Length (mm) Width (mm) CTE (ppm/˚C)

68 PLCC 24.0 24.0 22

LCR 6.3 3.2 7

LCC 3.2 1.6 7

Copyright © 2003 IEEE. All rights reserved. 45

Page 53: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

Step 3—Identify potential failure modes, sites, and mechanisms based on expected conditions

Methods for identifying potential failure modes, sites, and mechanisms include using global stress analysis(e.g., coarse finite element analysis), accelerated tests to failure (e.g., HALT or “elephant tests”), or engi-neering judgment. Failure modes can include open or shorts in the electrical circuits as well as operationdrift. Failure sites may include the individual parts, metallization on the printed wiring board, and part inter-connects. Examples of common failures are failure of the capacitor to maintain a charge (see Cunningham, J,Valentin, R., Hillman, C., Dasgupta, A., and Osterman, M., “A Demonstration of Virtual Qualification for theDesign of Electronic Hardware”), failure of the semiconductor in the 68 pin PLCC due to electrical opens(see Pecht, M., and Ko, W., “A Corrosion Rate Equation For Microelectronic Die Metallization”), shorting inthe PLCC (see Black, J. R., “Physics of Electromigration”), and potential for electrical opens of the solderinterconnects (see Engelmaier, W., “Fatigue Life Of Leadless Chip Carriers Solder Joints During PowerCycling,” and Dasgupta, A., Oyan, C., Barker, D., and Pecht, M., “Solder Creep-Fatigue Analysis by anEnergy-Partitioning Approach”). For the purposes of illustration, the opens in the circuit at the solder inter-connects due to low cycle fatigue will be evaluated.

Step 4—Identify appropriate failure models and their associated inputs based on identified failure mode,site, and mechanism

Table 10—Part temperatures

Part Operational (˚C) Power-off (˚C)

68 PLCC (24 x 24) 80 22

LCR (6.3 x 3.2) 60 22

LCC (3.2 x 1.6) 60 22

0

10

20

30

40

50

60

70

80

90

0 240 480 720 960 1200 1440

Time (minutes)

Tem

per

atu

re

(deg

C

)LCC

LCR

PLCC

Figure 13—Daily temperature history of parts

46 Copyright © 2003 IEEE. All rights reserved.

Page 54: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

There is a variety of failure models that can be used to estimate the time to failure for low cycle fatigue. Twoof the most common are the total inelastic strain range (Coffin-Manson) and the energy partitioningapproaches (see Engelmaier, W., “Fatigue Life Of Leadless Chip Carriers Solder Joints During PowerCycling,” and Dasgupta, A., Oyan, C., Barker, D., and Pecht, M., “Solder Creep-Fatigue Analysis by anEnergy-Partitioning Approach”).

For this example, the number of cycles to failure, based on the temperature cycling condition, will be evalu-ated by a Coffin-Manson low cycle fatigue relationship defined in Equation (1).

(1)

where c and εf are material properties of the joint and ∆γ is the strain range of the joint under a cyclic loadingcondition. Assuming that a eutectic tin-lead solder was used to form the interconnect, the damage modelproperties in Equation (1) are defined as

and

(2)

where Ts is the mean temperature of the solder joint, and td is the dwell time in minutes (see Engelmaier, W.,“Fatigue Life Of Leadless Chip Carriers Solder Joints During Power Cycling”). Constants for other materi-als can be found in engineering handbooks, and test methods for experimentally determining constants arewell established.

To quantify the time to failure (cycles to failure), it is necessary to evaluate the strain range. The strain rangecan be evaluated by a number of simulation techniques. Figure 14 provides a simplified representation of thedimensional changes on a part-board assembly under temperature cycling.

Nf12--- ∆γ

2εf

-------

1c---

=

εf 0.325=

c 0.442– 0.0006Ts– 0.0172 1 360td

---------+ ln+=

Figure 14—Schematic of solder interconnect under temperature cycle

Copyright © 2003 IEEE. All rights reserved. 47

Page 55: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

Based on Figure 14, the strain range may be approximated (see Engelmaier, W., “Fatigue Life Of LeadlessChip Carriers Solder Joints During Power Cycling”) as

(3)

where Ld is half of the length of the package, ∆α is the difference in coefficients of thermal expansion (CTE)between the package and the board, ∆T is temperature difference between powered off state and the opera-tional state, h is the distance from the board to the bottom of the package, and ξ is a model calibration factor.

Step 5—Estimate the time to failure (and its variability) using relevant failure models

Using the failure model listed in Equation (1), the time to failure is estimated for each package. To determinethe variation of the time to failure for the individual failure sites, there are two methods: 1) using the varia-tion of the input parameters to determine variation of outputs (e.g., Monte Carlo techniques); and 2) apply-ing variability to the deterministic result using previous knowledge of what statistical distribution representsthe variability (e.g., using calculated time to failure as the mean/median of a lognormal distribution of timeto failure). For this case, if the operation temperature varies by ±5 ˚C, the coefficient of thermal expansion ofthe package and board varies by ±1 ppm/˚C, and the physical dimensions of the package and interconnectvaried by ±0.1 mm, then the cycles to failure can be evaluated by randomly sampling the associated inputparameters through a Monte Carlo analysis. Using this process, the distribution of time to failure can be sim-ulated. The results of 1000 runs are presented in Table 11.

In some cases, test data may be sufficient to quantify the distribution of failures that arise for a specific fail-ure mechanism. Assuming the distribution is valid over the range of stresses that are being evaluated, thedistribution may be applied directly to the model results. This provides an alternative to Monte Carloanalysis.

Step 6—More failure models and/or sites

When reviewing the system, the appropriateness of the failure modeling approach that is used to quantify thelikelihood of failure must be evaluated. For certain stress conditions, multiple failure mechanisms may beactive. For instance, electromigration of metal that is driven by operating voltage and temperature may causethe integrated circuit in the 68 pins PLCC to fail. Alternatively, the temperature may also accelerate metalli-zation corrosion at the bond pads of the device, driven by ingress of moisture and availability of mobile ions.In cases of multiple failure mechanisms, the dominant failure mechanism is the mechanism that is mostlikely to cause the first failure (i.e., the mechanism with the lowest predicted time to failure) in the system. Ifthere are a few failure mechanisms that have similar predicted time to failure, these mechanisms willcompete for failure of the system. This concept of “competing failure mechanisms” can be addressed proba-bilistically. In other cases, other stress conditions may produce additional damage to the same site. For

Table 11—Results of Monte Carlo analysis of 1000 runs

Part Time to failure mean (days) Time to failure standard deviation (days)

Minimum observed time to failure (days)

68 PLCC 50 000 1000 46 000

LCR 1400 15 1350

LCC 8600 600 7200

∆γ ξLd ∆α( )∆T

2h--------------------------=

48 Copyright © 2003 IEEE. All rights reserved.

Page 56: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

instance, vibration may induce further damage to the solder interconnect presented in this example. In thiscase, a method for combining the damage at the interconnect site by temperature cycling and vibration mustbe developed. Both Miner’s rule (see Miner, M. A., “Cumulative Damage in Fatigue”) and the incrementaldamage superposition method (see Upadhyayula, K. and Dasgupta, A., “An Incremental Damage Superposi-tion Approach for Interconnect Reliability Under Combined Accelerated Stresses”) have been used toaccomplish this goal. Clearly, the accuracy of the method is based on ability to identify and model failures.

Step 7—Rank failures based on time to failure and determine failure site with the minimum time to failure

For simplicity ignoring other failure mechanisms, the results of the analysis presented in Table 11 indicatethat the most likely failure site is the interconnect of LCR part. Based on the Monte Carlo analysis, the meantime to failure for the system is 1400 days with the LCR identified as the failure site.

5.5 Reliability prediction based on handbooks

Handbook prediction methods are appropriate only for predicting the reliability of electronic and electricalcomponents and systems that exhibit constant failure rates. All handbook prediction methods contain one ormore of the following type of prediction:

a) Tables of operating and/or non-operating constant failure rate values arranged by part type,b) Multiplicative factors for different environmental parameters20 to calculate the operating or non-

operating constant failure rate, andc) Multiplicative factors that are applied to a base operating constant failure rate to obtain non-operat-

ing21 constant failure rate.

Reliability prediction for electronic equipment using handbooks can be traced back to MIL-HDBK-217,published in 1960, which was based on curve fitting a mathematical model to historical field failure data todetermine the constant failure rate of parts. Several companies and organizations such as the Society ofAutomotive Engineers (SAE) (see SAE G-11 Committee, Aerospace Information Report on Reliability Pre-diction Methodologies for Electronic Equipment AIR5286), Bell Communications Research (now Telcordia)(see Telcordia Technologies, Special Report SR-332: Reliability Prediction Procedure for Electronic Equip-ment, Issue 1), the Reliability Analysis Center (RAC) (see Denson, W., “A Tutorial: PRISM”), the FrenchNational Center for Telecommunication Studies (CNET, now France Telecom R&D) (see Union Techniquede L’Electricité, Recueil de données des fiabilite: RDF 2000), Siemens AG (see Siemens AG, Siemens Com-pany Standard SN29500), Nippon Telegraph and Telephone Corporation (NTT), and British Telecom (seeBritish Telecom, Handbook of Reliability Data for Components Used in Telecommunication Systems, Issue4) decided that it was more appropriate to develop their own ‘application-specific’ prediction handbooks fortheir products and systems.22 In most cases, they adapted the MIL-HDBK-217 philosophy of curve-fittingfield failure data to some model of the form given in Equation (4).

(4)

where, λP is the calculated constant part failure rate, λG is an assumed (generic) constant part failure rate,and πi is a set of adjustment factors for the assumed constant failure rates. What all of these handbook meth-ods have in common is they either provide or calculate a constant failure rate. The handbook methods thatcalculate constant failure rates use one or more multiplicative factors (which may include factors for partquality, temperature, design, environment, and so on) to modify a given constant base failure rate.

20Many handbook prediction methods mistakenly name this method as stress modeling. The handbooks do not model the stress atpoints of failures but only assume values of multiplicative factors to relate to different environmental and operational conditions.21Non-operating reliability predictions are made for one of the following conditions: (a) stored prior to operation [see Pecht, J. andPecht, M., Long-Term Non-Operating Reliability of Electronic Products] or (b) dormant while in a normal operating environment.22Several of these handbooks are effectively unavailable as the sponsoring organizations have stopped updating the handbooks andassociated databases.

λP f λG,πi( )=

Copyright © 2003 IEEE. All rights reserved. 49

Page 57: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

The constant failure rate models used in some of the handbooks are reportedly obtained by performing a lin-ear regression analysis on the field data.23 The aim of the regression analysis is to quantify the expected the-oretical relationship between the constant part failure rate and the independent variables. The first step in theanalysis is to examine the correlation matrix for all variables, which showed the correlation between thedependent variable (the constant failure rate) and each independent variable. The independent variables usedin the regression analysis include factors such as the device type, package type, screening level, ambienttemperature, and the application stresses. The second step is to apply stepwise multiple linear regressions tothe data, which expresses the constant failure rate as a function of the relevant independent variables andtheir respective coefficients. The constant failure rate is then calculated using the regression formula and theinput parameters.

The regression analysis does not eliminate data entries lacking essential information, since the scarcity ofdata necessitates that all data be utilized. To accommodate such data entries in the regression analysis, a sep-arate “unknown” category may be constructed for each potential factor where the required information wasnot available. A regression factor can be calculated for each “unknown” category, considering it a uniqueoperational condition. If the coefficient for the unknown category is significantly smaller than the next lowercategory or larger than the next higher category, it can be decided that the factor in question could not bequantified by the available data and that additional data was required before the factor can be fully evaluated.

A constant failure rate model for the non-operating condition can be extrapolated by eliminating from thehandbook predictions models all operation-related stresses, such as temperature rise or electrical stress ratio.However, non-operating components were not included in the field data that was used to derive the models.Therefore, using handbooks such as MIL-HDBK-217F to calculate constant non-operating failure rates isessentially an extrapolation of the empirical relationship of the source field data beyond the range in which itwas gathered.

Some of the concerns regarding technical assumptions associated with the development of handbook-basedmethodology are: limitation to constant failure rate assumptions (see O’Connor, P. D. T., “Statistics in Qual-ity and Reliability. Lessons from the Past, and Future Opportunities”), emphasis on steady state temperaturedependent failure mechanisms (see Hakim, E. B., “Reliability Prediction: Is Arrhenius Erroneous”), factorsbased on burn-in and screening tests that predetermine superior reliability of ceramic/metal package typesover plastic packages (see MIL-HDBK-217F), and assumption of higher constant failure rate for newer tech-nologies (see Pease, R., “What’s All This MIL-HDBK-217 Stuff Anyhow?”). The users of the handbookmethods need to consider these concerns and decide how it affects their reliability prediction.

The handbook prediction methods described in this subclause are MIL-HDBK-217F, SAE’s Reliability Pre-diction Method, Telcordia SR-332, CNET Reliability Prediction Method, and PRISM.

5.5.1 MIL-HDBK-217F

The MIL-HDBK-217 reliability prediction methodology was developed under the preparing activity of theRome Air Development Center (now Rome Laboratory). The last version of the methodology was MIL-HDBK-217 Revision F Notice 2, which was released on February 28, 1995. The last issue of this handbookprohibits the use of it as a requirement. In 2001, the office of the U. S. Secretary of Defense has stated that“…. the Defense Standards Improvement Council (DSIC) made a decision several years ago to let MIL-HDBK-217 ‘die a natural death.’ This is still the current OSD position, i.e., we will not support any updates/revisions to MIL-HDBK-217.” (See Desiderio, George, “FW: 56/755/NP/ Proposed MIL Std 217 Replace-ment.”)

The stated purpose of MIL-HDBK-217 was “… to establish and maintain consistent and uniform methodsfor estimating the inherent reliability (i.e., the reliability of a mature design) of military electronic equipmentand systems. The methodology provided a common basis for reliability predictions during acquisition pro-

23It is not known if all the different handbooks use this form of regression analysis.

50 Copyright © 2003 IEEE. All rights reserved.

Page 58: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

grams for military electronic systems and equipment. It also established a common basis for comparing andevaluating reliability predictions or related competitive designs. The handbook was intended to be used as atool to increase the reliability of the equipment being designed.” (See MIL-HDBK-217F.)

MIL-HDBK-217 provides two constant failure rate prediction methods – parts count and parts stress. MIL-HDBK-217F parts stress method provides electronic constant parts failure rate models based on curve-fittingthe empirical data obtained from field operation and test. The models have a constant base failure rate modi-fied by an environmental, temperature, electrical stress, quality, and other factors. Both methods use theformulization of Equation (4), but one method assumes there are no modifiers to the general constant failurerate. The MIL-HDBK-217 methodology only provides results for parts, and not for equipment or systems.

5.5.2 SAE’s reliability prediction methodology

The SAE reliability predictor methodology was developed by the Reliability Standards Committee of theSociety of Automotive Engineers (SAE), and was implemented through a software package known as PREL.The last version of the software (PREL 5.0) was released in 1990.24 The stated purpose of this methodologywas “to estimate the number of warranty failures of electronic parts used in automotive applications as afunction of the pertinent component, assembly, and design variables.” (See SAE G-11 Committee, Aero-space Information Report on Reliability Prediction Methodologies for Electronic Equipment AIR5286.)

The methodology was developed using MIL-HDBK-217 data combined with automotive field data andempirical data analyses on automotive data collected by the SAE Reliability Standards Committee. Themethodology’s database included information on part type, screening level, package type, and location in thevehicle (see SAE G-11 Committee, Aerospace Information Report on Reliability Prediction Methodologiesfor Electronic Equipment AIR5286.), but the actual data sources are kept anonymous. The component con-stant failure rates were predominantly derived from the warranty records of the participating companies. Itwas also assumed that all the vehicles are operated 400 hours/year while calculating time to failure fromwarranty return time.

It was reported (see Denson, W. and Priore, M., “Automotive Electronic Reliability Prediction”) that themodifying factors for constant failure rates were obtained through regression analysis followed by modelvalidation through residual analysis, examination of outliers and examination against (constant) zero failurerates. Some factor values were later modified by the developers manually to account for being “intuitivelyincorrect.” This method provides for what it calls a “first approximation of the non operating effect,” withonly component type as the independent variable.

5.5.3 CNET reliability prediction method

The development of the CNET reliability prediction methodology was led by the Centre National d’Etudesdes Telecommunications (CNET) of France (now France Telecom R&D) who carried out this work in con-junction with the work at the Institut de Sûreté de Fonctionnement. The most recent version of this method-ology is RDF 2000, which was released in July 2000.

The RDF 2000 methodology is available from the Union Technique de L’Electricité (UTE) of France asstandard UTE C80-810, which targets surface mounted parts. The UTE C80-810 standard has been devel-oped using field failure data for parts operating in ground equipment or employed in commercial aircraft,and spread out over the period 1990–1998 (1990–1992 for avionics data). The data is extrapolated to covermilitary, space, and automotive applications. The data is mainly taken from electronic equipment operatingin ground: stationary (or fixed); ground: non-stationary; and airborne: inhabited (see Kervarrec, G., Monfort,M. L., Riaudel, A., Klimonda, P. Y., Coudrin, J. R., Razavet, D Le, Boulaire, J. Y., Jeanpierre, P., Perie, D.,Meister, R., Casassa, S., Haumont, J. L., and Liagre., A., “Universal Reliability Prediction Model for SMDIntegrated Circuits Based on Field Failures”).

24SAE has now discontinued the use of PREL due to difficulties in maintaining the database needed for the software.

Copyright © 2003 IEEE. All rights reserved. 51

Page 59: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

5.5.4 Telcordia SR-332

Telcordia (previously known as Bellcore) SR-332 is a reliability prediction methodology developed by BellCommunications Research (or Bellcore) primarily for telecommunications companies (see Telcordia Tech-nologies, Special Report SR-332). Bellcore, which previously was the telecommunications research arm ofthe Regional Bell Operating Companies (RBOCs), is now known as Telcordia Technologies. The mostrecent revision of the methodology is dated May 2001.

The stated purpose of Telcordia SR-332 is “to document the recommended methods for predicting deviceand unit hardware reliability (and also) for predicting serial system hardware reliability (see Telcordia Tech-nologies, Special Report SR-332). The methodology is based on empirical statistical modeling of commer-cial telecommunication systems whose physical design, manufacture, installation, and reliability assurancepractices meet the appropriate Telcordia (or equivalent) generic and system-specific requirements (seeHughes, J. A., “Practical Assessment of Current Plastic Encapsulated Microelectronic Devices”). In general,Telcordia SR-332 adapts the equations in MIL-HDBK-217 to represent what telecommunications equipmentexperience in the field. Results are provided as a constant failure rate, and the handbook provides the upper90% confidence-level point estimate for the constant failure rate.

The main concepts in MIL-HDBK-217 and Telcordia SR-332 are similar, but Telcordia SR-332 also has theability to incorporate burn-in, field, and laboratory test data, using a Bayesian analysis. For example, Telcor-dia SR-332 contains a table of the “first-year multiplier,” which is the predicted ratio of the number of fail-ures of the part in the first year of operation in the field to the number of failures of the part in another oneyear of (steady state) operation. This table contains the first-year multiplier for each value of the part deviceburn-in time in the factory. Here, the part’s total burn-in time can be obtained as the sum of the burn-in timeat the part, unit, and system level.

5.5.5 PRISM

PRISM is a reliability assessment method developed by the Reliability Analysis Center (RAC) (see Denson,W., Keene, S., and Caroli, J., “A New System-Reliability Assessment Methodology;” and Reliability Assess-ment Center, PRISM, Version 1.3). The method is available only as software; the latest version of the soft-ware is Version 1.3 released in June 2001.

PRISM combines “empirical” data of users with the built-in database using Bayesian techniques. In thistechnique, new data is combined in a “weighted average” method, but there is no new regression analysis.PRISM includes some non-part factors such as interface, software, and mechanical problems.

PRISM calculates assembly and system-level constant failure rates in accordance with similarity analysis(see 5.2.4), which is an assessment method that compares the actual life cycle characteristics of a systemwith predefined process grading criteria, from which an estimated constant failure rate is obtained. The com-ponent models used in PRISM are called RACRates™ models and are based on historical field data acquiredfrom a variety of sources over time and under various undefined levels of statistical control andverification.25

Unlike the other handbook constant failure rate models, the RACRates™ models do not have a separate fac-tor for part quality level. Quality level is implicitly accounted for by a method known as process grading.Process grades address factors such as design, manufacturing, part procurement, and system management,which are intended to capture the extent to which measures have been taken to minimize the occurrence ofsystem failures.

25Motorola has evaluated the software and the data and Pradeep Lall of Motorola reports that much of the data comes from Mil-HDBK-217.

52 Copyright © 2003 IEEE. All rights reserved.

Page 60: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

The RACRates™ models consider separately the following five contributions to the total component con-stant failure rate: 1) operating conditions, 2) non-operating conditions, 3) temperature cycling, 4) solder jointreliability, and 5) electrical overstress (EOS). It needs to be noted that the solder joint failures are combinedwithout consideration of the board material or solder material. These five factors are not independent, forexample, solder joint failures depend on the temperature cycling parameters. A constant failure rate is calcu-lated for solder joint reliability, although solder joint failures are primarily wearout failure mechanism due tocyclic fatigue (see Dasgupta, A. “Failure Mechanism Models For Cyclic Fatigue”).

PRISM calculates non-operating constant failure rates with the following assumptions. The daily or seasonaltemperature cycling high and low values that are assumed to occur during storage or dormancy represent themajor contribution to the non-operating constant failure rate value. The solder joints contribution to the non-operating constant failure rate value is represented by reducing the internal part temperature rise to zero foreach part in the system. Lastly, the probability of electrical overstress (EOS) or electrostatic discharge (ESD)contribution to the non-operating constant failure rate value is represented by the assumption that the EOSconstant failure rate is independent of the duty cycle. This accounts for parts in storage affected by this fail-ure mode due to handling and transportation.

5.5.6 Non-operating constant failure rate predictions

The MIL-HDBK-217 did not have specific methods or data related to the non-operational failure of elec-tronic parts and systems. Several different methods were proposed in the 1970s and 1980s to estimate non-operating constant failure rates. The first methods used multiplicative factors based on the operating constantfailure rates obtained using other handbook methods. Reported values of such multiplicative factors are 0.03or 0.1. The first value of 0.03 is reportedly obtained from an unpublished study of satellite clock failure datafrom 23 failures. The value of 0.1 is based on a RADC study from 1980 (see Rome Air Development Center,RADC-TR-80-136). RAC followed up the efforts with RADC-TR-85-91 Method (see Rome Air Develop-ment Center, RADC-TR-85-91). This method was projected as an equivalent of MIL-HDBK-217 for non-operating conditions and contained same number of environmental factors and same type of quality factorsas the MIL-HDBK-217 document current at the time of development on the method (see Rooney, J. P.,“Storage Reliability”). Some other non-operating constant failure rate tables from the 1970–1980 include:MIRADCOM Report LC-78-1, RADC-TR-73-248, and NONOP-1.26

5.5.7 Examples of constant failure rate predictions

This subclause includes the examples of handbook reliability predictions; the examples are taken from Tel-cordia methods. In the first example, the unit constant failure rate in the Telcordia parts count method isgiven by the formula:

(5)

where

λGi is the generic constant failure rate for the ith device type,

πQi is the quality factor for the ith device type,

πSi is the stress factor for the ith device type,

26The U.S. Army Missile Research and Development Command (MIRADCOM) Report LC-78-1, published in 1978, contains non-operating constant failure rate data by part type along with a 90% upper confidence limit. RADC-TR-73-248, published in 1973, con-tains non-operating constant failure rates that were developed by Martin-Marietta under a RADC contract. NONOP-1, published in1987, is based on non-operating field and test of using electronic and mechanical diodes.

λSS πE λGiπQi

πSiπTi

Ni

i 1=

n

∑=

Copyright © 2003 IEEE. All rights reserved. 53

Page 61: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

πTi is the temperature factor for the ith device type,

Ni is the number of type i devices in the unit, n is the total number of device types, and πE is the unit environmental factor.

5.5.7.1 Example 1: Parts count method

To illustrate the application, an example of a circuit card that consists of 3 types of components is used.Table 12 gives the assumed generic constant failure rates and the π factors associated with each part typealong with the total constant failure rate for the circuit card (the environment factor πE is equal to 1). Thus,the predicted constant failure rate for the circuit card under consideration by using the part count method is120 FIT.

5.5.7.2 Example 2: Combining laboratory data with parts count data

The constant failure rate for combining laboratory data with parts count or parts stress data is given by theformula:

(6)

where

λG is the generic constant failure rate,λLAB is the laboratory constant failure rate incorporating the effective burn-in time, if applicable,πE is the environment factor, and

w is the weight assigned to the generic constant failure rate.27

The constant failure rate for combining the field tracking data is obtained by replacing λLAB in Equation (6)by λFIELD.

Table 12—Example of part count method

Part type iNumber of components

of type i λGi πQi πSi πTi

Total failure rate for type i

1 5 10 1 1 1.2 60

2 2 15 1 0.8 1 24

3 1 30 1 1.5 0.8 36

Total 120

27The weight assigned to the generic failure rate for any part of interest is based on the assumption that the generic failure rate of thepart is computed such that 2 failures were observed during the normal operating hours of the part of interest.

λSS πE wλG 1 w–( )λLAB+[ ]=

54 Copyright © 2003 IEEE. All rights reserved.

Page 62: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

R-

Fixcom

Fix

Fix

Fix(RW

Fix

Var

5.5.8 An Example of non-operating constant failure rate predictions

In this example, dormant constant failure rates are calculated for varying types of resistors using RADC-TR-85-91, the MIL-HDBK-217F (Notice 2) parts count method, and PRISM. The following assumptions apply:1) environment is ground benign, 2) duty cycle is 0% in PRISM, and 3) temperature rise is 0oC in PRISM.The predicted non-operating constant failure rates in FITs for each method are shown in Table 13. As shownin Table 13, the predicted non-operating constant failure rates vary by large factors amongst the predictionmethods. Similarly, the predicted non-operating constant failure rates vary by orders of magnitude amongstcomponent types especially for PRISM and RADC-TR-85-91.

5.5.9 Comparison of handbook methods for predicting operating reliability

Handbook prediction methodologies have been extensively studied and compared, and results are readilyavailable in literature (see SAE G-11 Committee, Aerospace Information Report on Reliability PredictionMethodologies for Electronic Equipment AIR5286; Kervarrec, G., Monfort, M. L., Riaudel, A., Klimonda, P.Y., Coudrin, J. R., Razavet, D Le, Boulaire, J. Y., Jeanpierre, P., Perie, D., Meister, R., Casassa, S., Haumont,J. L., and Liagre., A., “Universal Reliability Prediction Model for SMD Integrated Circuits Based on FieldFailures;” Bowles, J. B., “A Survey of Reliability-Prediction Procedures for Microelectronic Devices;”Jones, J. and Hayes, J., “A Comparison of Electronic-Reliability Prediction Models;” Pecht M. and Nash, F.,“Predicting the Reliability of Electronic Equipment;” Cushing, M. J., Mortin, D. E., Stadterman, T. J., andMalhotra, A., “Comparison of Electronics-Reliability Assessment Approaches;” Leonard, C. T., “How Fail-ure Prediction Methodology Affects Electronic Equipment Design;” O’Connor, P. D. T., “Reliability Predic-tion: A State-Of-The-Art Review;” O’Connor, P. D. T., “Undue Faith in US MIL-HDBK-217 for ReliabilityPrediction;”and O’Connor, P. D. T., “Reliability Prediction: Help or Hoax.”). The handbook reliability pre-diction methods described above are compared with other reliability prediction methods in 5.6 with respectto IEEE 1413 criteria.

5.6 Assessment of reliability prediction methodologies based on IEEE 1413 criteria

Reliability prediction method selection should be based on how well the prediction satisfies the user’s objec-tives. IEEE Std 1413-1998 was developed to identify the key required elements for an understandable andcredible reliability prediction, and to provide its users with sufficient information to evaluate prediction

Table 13—Predicted non-operating constant failure rate values from handbook methods

Part type PRISM RADC-TR-85-91

RAC tool-kit

10% rule-of-thumb

3% rule-of-thumb

MIRADCOM LC-78-1

RADC-T73-248

ed, carbon position (RC/RCR)

2.07 0.15 1.32 0.66 0.20 < 0.06 0.07

ed, film (RN) 1.10 0.24 2.22 1.11 0.33 0.02 3.00

ed, network, film (RZ) 2.49 1.03 0.96 0.48 0.14 < 909.10 N/A

ed, wirewound, power )

1.63 1.37 3.90 1.95 0.59 1.49 0.50

ed, thermistor (RTH) 6.75 6.48 0.84 0.42 0.13 16.90 30.00

iable, wirewound (RT) 12.10 2.38 1.44 0.72 0.22 3.79 50.00

Copyright © 2003 IEEE. All rights reserved. 55

Page 63: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

methodologies and to effectively use their results. A prediction made according to this standard includes suf-ficient information regarding the inputs, assumptions, and uncertainties associated with the methodologyused to make the prediction, enabling the user to understand risks associated with the methodology. IEEEStd 1413-1998 was formulated to enable the industry to capitalize on the positive aspects of the availableprediction methodologies and to benefit from the flexibility of using various methodologies, as appropriateduring equipment development and use.

5.6.1 IEEE 1413 compliance

IEEE Std 1413-1998 identifies the framework for the reliability prediction process for electronic systems(products) and equipment. Since the reasons for performing a reliability prediction vary (e.g., feasibilityevaluation, comparing competing designs, spares provisioning, safety analysis, warranties, and cost assess-ment), a clear statement of the intended use of prediction results obtained from an IEEE 1413-compliantmethod is required to be included with the final prediction report. Thus, an IEEE 1413-compliant reliabilityprediction report must include:

— Reasons why the reliability predictions were performed— The intended use of the reliability prediction results— Cautions as to how the reliability prediction results must not be used— Where precautions are necessary

An IEEE 1413-compliant reliability prediction report should also identify the method used for the predictionand identify the approach, rationale, and references to where the method is documented. In addition, the pre-diction report should include:

— Definition of failures and failure criteria–Predicted failure modes –Predicted failure mechanisms

— Description of the process to develop the prediction –Assumptions made in the assessment–Methods and models –Source of data

— Required prediction format –Prediction metrics–Confidence level

Further, IEEE Std 1413-1998 specifically identifies inputs that must be addressed with respect to the extentto which they are known (and can be verified) or unknown for a prediction to be conducted. These include,but are not limited to, usage, environment, lifetime, temperature, shock and vibration, airborne contami-nants, humidity, voltage, radiation, power, packaging, handling, transportation, storage, manufacturing, dutycycles, maintenance, prediction metrics, confidence levels, design criteria, derating, material selection,design of printed circuit boards, box and system design parameters, previous reliability data and experience,and limitations of the inputs and other assumptions in the prediction method.

Besides prediction outputs, the prediction results should also contain conclusions, recommendations, systemfigures of merit, and confidence levels. The report should indicate how the conclusions follow from the out-puts and justify the recommendations, where the recommendations are stated in terms of specific engineer-ing and logistic support actions. Since the uncertainty (or the confidence level) is affected by theassumptions regarding the model inputs, the limitations of the model, and the repeatability of the prediction,the reliability prediction results should be presented and included in the report.

56 Copyright © 2003 IEEE. All rights reserved.

Page 64: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

In summary, a reliability prediction report complying with IEEE Std 1413-1998 provides documentation ofthe prediction results, the intended use of prediction results, the method(s) used for the prediction, a list ofinputs required to conduct the prediction, the extent to which each input is known, sources of known inputdata, assumptions used for unknown input data, figures of merit, confidence in the prediction, sources ofuncertainty in the prediction results, limitations of the results, and a measure of the repeatability of the pre-diction results. Thus, any reliability prediction methodology can comply with IEEE Std 1413-1998.

In order to assist the user in the selection and use of a particular reliability prediction methodology comply-ing with IEEE Std 1413-1998, a list of criteria is provided. The criteria consist of questions that concern theinputs, assumptions, and uncertainties associated with each methodology, enabling the risk associated withthe methodology to be identified. Table 14 provides the assessment of various reliability prediction method-ologies according to IEEE 1413 criteria (the first eleven questions in the table). Other considerations may beincluded when selecting the methodology, and these are also included in Table 14 following the IEEE 1413criteria.

Copyright © 2003 IEEE. All rights reserved. 57

Page 65: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEE

ES

td 1413.1-2002IE

EE

GU

IDE

FO

R S

ELE

CT

ING

AN

D U

SIN

G

58C

opyright © 2003 IE

EE

. All rights reserved.

s

andbook methods

SAE’s HDBK Telcordia SR-332 CNET’s HDBK

Doesthe spreddescrsour

No No No

Are aductthe mincluunkn

Yes Yes No

Are sthe p

No No No

Are lresul

Yes Yes Yes

Are f No No No

Are fident

No No No

Are cpred

No No No

Doesfor liditioencousagvoltaing, ce) traf) ma

o. It does not con-er the different

pects of environ-ent. Ambient tem-rature, application esses, and duty cle are used as ctors in prediction uation.

No. It does not con-sider the different aspects of environ-ment. Ambient tem-perature, vibration and shock, power and voltage condi-tions are used as fac-tors in prediction equation.

No. It does not con-sider the different aspects of environ-ment. Requires a range of parameter values that define each environmental category. Parameters include vibration, noise, dust, pres-sure, relative humid-ity, and shock.

Doesfor marchparts

No No No

Table 14—Comparison of reliability prediction methodologie

Field data Test data Stress and damage models

H

MIL-HDBK-217F RAC’s PRISM

the methodology identify ources used to develop the iction methodology and ibe the extent to which the

ce is known?

Yes Yes Yes No Yesa

ssumptions used to con- the prediction according to

ethodology identified, ding those used for the own data?

Yes Yes Yes No Yes

ources of uncertainty in rediction results identified?

Can Be Can Be Can Be No No

imitations of the prediction ts identified?

Yes Yes Yes Yes Yes

ailure modes identified? Can Be Can Be Yes No No

ailure mechanisms ified?

Can Be Can Be Yes No No

onfidence levels for the iction results identified?

Yes Yes Yes No No

the methodology account fe cycle environmental con-nsb, including those untered during a) product e (including power and ge conditions), b) packag-) handling, d) storage,nsportation, and intenance conditions?

Can be. If field data is collected in the same or a similar environment which accounts for all the life cycle conditions.

Can be. It can con-sider them through the design of the tests used to assess product reliability.

Yes. As input to physics of failure based models for the failure mechanisms.

No. It does not con-sider the different aspects of environ-ment. There is a temperature factor πT and an environ-ment factor πE in the prediction equation.

No. It does not con-sider the different aspects of environ-ment. Environmen-tal inputs include operating and dor-mant temperatures, relative humidity, vibration, duty cycle, cycling rate and power and volt-age conditions.

Nsidasmpestrcyfaeq

the methodology account aterials, geometry, and

itectures that comprise the ?

Can be Can be Yes No No

Page 66: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEE

ER

ELIA

BILIT

Y P

RE

DIC

TIO

NS

BA

SE

D O

N IE

EE

1413S

td 1413.1-2002

Copyright ©

2003 IEE

E. A

ll rights reserved.59

Doesfor p

ere is no part ality factor in the E constant failure

te model.

Four quality levels that are based on generalities regard-ing the origin and screening of parts.

Seven main levels of quality and several subclasses based on screening and part origin are used.

Doesporaexpe

No Yes. Through Baye-sian method of weighted averaging.

No

Inpuanaly

(e.g., temperature, voltage; specifics depend on the handbook used).

Otheperfo

book method and is limited to obtaining the handbook.

Whaelect

ve generic part tegories (micro-rcuits, diodes, nsistors, capaci-

rs and resistors).

Extensivee Extensivef

Whabutio

Exponentialg

ntinued)

andbook methods

SAE’s HDBK Telcordia SR-332 CNET’s HDBK

the methodology account art quality?c

Can be. Not explic-itly considered. Implicitly used from the quality of the parts in the system.

Can be. Not explic-itly considered. Implicitly used from the quality of the parts in the system.

Yes. Considered through the design and manufacturing data.

Quality levels are derived from spe-cific part-dependent data and the num-ber of the manufac-turer screens the part goes through.

There is no part quality factor in the RAC rate models. Part quality level is implicitly addressed by process grading factors and the growth factor, πG.

ThquSAra

methodology allow incor-tion of reliability data and rience?

Yes Yes Yes No Yes. Through Baye-sian method of weighted averaging.

t data required for the sis

Information on ini-tial operating time, failure time, and operating profile (or approximations) for all units.

Detailed test plan and results including Information on operating stresses, failure time(s), and expected applica-tion environments.

Information on materials, architec-tures, design and manufacturing pro-cesses, and operat-ing stresses.

Information on part count and operational conditions

r requirements for rming the analysis

Effort required in creating and main-taining a field data collection system might be high.

The analysis typi-cally involves designing and con-ducting tests.

The analysis typi-cally involves stress and damage simula-tions that can be performed through commercially avail-able software.

Effort required is relatively small for using the hand

t is the coverage of ronic parts?

Not limited to a particular set of parts. Extensived Extensive constant failure rate data-bases are included in the methodology.

Ficacitrato

t failure probability distri-ns are supported?

Not limited to spcific distribution. Statistical tech-niques are used to fit a distribution to the field data.

Not limited to specific distribu-tion. Statistical tech-niques are used to fit a distribution to the test data.

Not limited to spe-cific distribution. The users choose to input and interpret the data in a manner that suits the physi-cal situation.

Table 14—Comparison of reliability prediction methodologies (co

Field data Test data Stress and damage models

H

MIL-HDBK-217F RAC’s PRISM

Page 67: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEE

ES

td 1413.1-2002IE

EE

GU

IDE

FO

R S

ELE

CT

ING

AN

D U

SIN

G

60C

opyright © 2003 IE

EE

. All rights reserved.

Whasupp

etween failure), constant failure rate.

Can dicticond

No No No

Lastpubl

rsion 5.0, released 1990.

Issue 7, released in May 2001.

Version RDF released in July 2000.

aSombTh pected severity and duration of these environments.

Sp , chemically aggressive or inert environments, elec-tro es.

cQu sed by some of the handbook methods are differentfro concept of quality level comes from the belief thatscr

dMI evices, relays, switches, connectors, interconnectionass

eTel , EPROMS, optoelectronic devices, displays, LEDs,tra ies, heaters, coolers, oscillators, fuses, lamps, circuitbre

fUT s), application-specific integrated circuits (ASICs),bip

gTehSo

ntinued)

andbook methods

SAE’s HDBK Telcordia SR-332 CNET’s HDBK

t reliability metrics are orted?

Many including time to failure, number of cycles to failure, failure prob-ability distribution, failure percentile, confidence levels, failure free operat-ing period, non-fail-ure metrics.

Many including time to failure, number of cycles to failure, failure prob-ability distribution, failure percentile, failure free operat-ing period, confi-dence levels.

Many including time to failure, number of cycles to failure, failure prob-ability distribution, failure free operat-ing period, failure percentile, confi-dence levels.

MTBF (mean time b

it provide a reliability pre-on for non-operational itions?

Yes, if field data is collected for non-operational condi-tions.

Yes, if storage and dormant condition loads are used in the tests.

Yes, non-operational conditions can be part of the environ-mental and opera-tional profile.

Noh Yes

revision as of guidebook ication date

Not Applicable. Version F Notice 2, released in Febru-ary 1995.

Version 1.3, released in June 2001.

Vein

e data sources are included in the accompanying database.e life cycle of a product describes the assembly, storage, handling, and scenario for the use of the product, as well as the execific conditions include temperature, temperature cycles, temperature gradients, humidity, pressure, vibration or shock loadsmagnetic radiation, airborne contaminants, and application-induced stresses caused by current, voltage, power, and duty cyclality is defined as a measure of a part’s ability to meet the workmanship criteria of the manufacturer. Quality levels for parts um quality of the parts. Quality levels are assigned based on the part source and level of screening the part goes through. Theeening improves part quality.L-HDBK-217F covers microcircuits, discrete semiconductors, tubes, lasers, capacitors, resistors, inductive devices, rotating demblies, connectors, meters, quartz crystals, lamps, electronic filters, fuses, and miscellaneous parts.cordia SR-332 covers integrated circuits (analog and digital), microprocessors, SRAMs, DRAMs, gate arrays, ROMs, PROMsnsistors, diodes, thermistors, resistors, capacitors, inductors, connectors, switches, relays, rotating devices, gyroscopes, batterakers, and computer systems.E C80-810 (or CNET RDF 2000) covers standard circuits (e.g., ROMs, DRAMs, flash memories, EPROMs, and EEPROMolar circuits, BiCMOS circuits, and gallium arsenide devices.

lcordia SR-332 can cover non-constant failure rate using first year multiplier.me users use this handbook predictions under zero stress to calculate non-operational reliability.

Table 14—Comparison of reliability prediction methodologies (co

Field data Test data Stress and damage models

H

MIL-HDBK-217F RAC’s PRISM

Page 68: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

5.6.2 Reliability prediction methodology assessment per IEEE 1413

The next four subclauses describe the major categories of reliability prediction methodologies. Additionaldiscussion is provided pertaining to the assessment according to IEEE 1413 criteria and other information inTable 14. This subclause explains the “yes,” “no,” and “can be” entries in Table 14.

5.6.2.1 Assessment of reliability predictions using field data

Field data can be used for a reliability predication of an item already in service or a similar item. The meth-odology is best applicable for high-volume applications from which sufficient data can be obtained to estab-lish statistical confidence. It can also be used to adjust other predictions based on other methods bycomparing previous reliability predictions based on those methods with the actual field reliability perfor-mance of the item. Field data is applicable to many non-failure metrics (see 5.2.5). The answers to the 1413assessment criteria questions may vary depending on the quality of data and the analysis level, both statisti-cal and physical.

The data sources and assumptions used to conduct a reliability prediction based on field data will be depen-dent on the field data and the data collection methods. The sources of uncertainty and limitation of theprediction will be based on the quality of the available data and the similarity of the item to other items forwhich field data is available. Since field data analysis usually consists of fitting a statistical distribution todata, confidence levels for the prediction can be developed from those distributions.

The type and quality of field data is highly variable, ranging from simple factory ship and return data todetailed tracking and failure analysis for every unit built. The answer to several of the questions in Table 14are “Can Be,” however, the answer can be “Yes” if sufficiently detailed data is available and used for theanalysis. For example, if failure analysis is available, a separate field reliability prediction can be performedfor each failure mode, failure mechanism, or failure cause. Since it represents equipment in its actual opera-tional operating conditions, field data implicitly accounts for life cycle environmental conditions, includingnon-operational environments such as storage and transportation if suitable records are kept. Field data alsoimplicitly accounts for part materials, geometry, architecture, and quality if field data is used to predict thereliability of an item already in service or an item with similar part materials, geometry, architecture, andquality. Field data can explicitly account for these factors if sufficient data is available, e.g., if the field reli-ability of a design change that modified part quality is tracked separately from the field reliability of the orig-inal design. Although a prediction based on field data includes the impact of the life cycle environment onthe product reliability, such as the conditions encountered during product operation (including power andvoltage conditions), it may not allow the user to discriminate between the effect of the individual compo-nents of the environment on the observed failures. In this case, the methodology will not account for failuremodes and mechanisms. In addition, it is often difficult to use field data during the design stage to predictand compare the effect of changes on part characteristics for theoretical designs, e.g., to determine the effectof a part geometry change on the reliability prediction.

The primary strength of field data is that it represents the actual reliability performance of an item in itsactual operational environment rather than simulating or estimating that environment. However, to performan accurate reliability prediction based on field data requires extensive data collection on the same or similaritem for a sufficient length of time to fit a statistical distribution with measurable confidence.

5.6.2.2 Assessment of reliability predictions using test data

The process of designing a test, setting it up, and conducting it can provide in-depth knowledge of the testdata, the assumptions, approximations, uncertainties, errors, environments, and stresses. Therefore, a reli-ability prediction based on test data may implicitly satisfy many of the criteria in Table 14.

Copyright © 2003 IEEE. All rights reserved. 61

Page 69: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

A test report should identify clearly the source of the data (who conducted the test) and the extent to whichthe source is known (details surrounding the test). Such a report should also identify all electrical, mechani-cal, environmental, and statistical bases and assumptions. Knowing the bases upon which the test was devel-oped provides understanding of the sources of uncertainty and the limitations on the use of the results andthe data. Such bases and assumptions will include, but not be limited to mechanical, electrical, statistical,environmental or other modeling assumptions.

Knowing the processes that went into creating the test articles provides an understanding of the test article’squality. This permits extrapolating the test results from the test article quality to the final product quality. Therigor with which the test is conducted provides insight as to the quality of the data itself. Similarly, the mate-rials, geometry and system architecture can be reproduced to whatever level of accuracy desired. Regardlessof the exactness, the level of precision should be stated explicitly.

When failures occur the modes can be recorded. If failure analysis is performed, failure mechanisms can beidentified and, possibly, root causes identified. However, it is sometimes not possible to determine theunequivocal cause of failure, especially on intermittent failures or destructive tests.

In order to derive a reliable result from an accelerated test, it is required that 1) the same failure mechanismactive during product operation is dominant during accelerated testing, and 2) the acceleration of this failuremechanism by some appropriate change in the operating conditions to the test conditions can be expressed inthe form of an acceleration transform. Hence, a successful accelerated test should ensure that the failuremodes generated in the test are results of the same mechanism as those that occur in the actual use of theproduct in the field.

A source of error in a test is the inability to duplicate or model the actual use environment. In addition,extrapolation beyond the actual test time may not be possible, depending on the failure distribution or lack ofdata to accurately deduce the failure distribution. The utility of the prediction will also depend on the extentto which the electronics are tested (whether all circuits cards and open slots were tested) and whether or notnon-operational tests were conducted.

The quality and type of reliability predictions will be predicated on the extent of the statistical analyses ofthe data. Such analyses can include determining the underlying failure distribution, calculating confidencelimits and expressing the reliability. These may be limited based on the amount of failure data that is devel-oped. Too few failures may prevent accurate estimation of the failure distribution and, therefore, the inabilityto determine confidence limits (which depend on the distribution). Knowing the level of data quality desiredallows one to identify the types of inputs required.

5.6.2.3 Assessment of reliability predictions using stress and damage models

For the stress and damage model method, the prediction is based on the evaluation of documented modelsand their variability, allowing for a ranking of potential failure sites and a determination of a minimum timeto failure. The use of documented models provides an identification of sources used to make the predictionand describes the extent to which the sources are known. Since failure models are documented, literaturemay be referenced to provide details of its development, including the underlying assumptions and limita-tions. Therefore, reliability prediction based on stress and damage models satisfy most of the criteria ofTable 14.

The accuracy of a stress and damage model prediction is dependent on the models used and the inputs tothose models. The sources of uncertainty can be identified because all the parameters used in the models areassociated with physical properties (e.g., material properties, geometry values, and environmental parame-ters) and the variations in these parameters can be considered to account for the uncertainty of the modelresults. These failure models are for particular failure mechanisms, and the models predict the mode inwhich the failure would manifest. These failure models utilize environmental and operational usage profileconditions as inputs, including power and voltage conditions, environmental exposures, duration and duty

62 Copyright © 2003 IEEE. All rights reserved.

Page 70: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

cycles at various temperatures, exposure to airborne contaminants, shock and vibration, humidity, radiation,maintenance, packaging, handling, storage, and transportation conditions. Hence, this methodology identi-fies failure modes and mechanisms and accounts for life cycle environmental conditions.

Using numerical methods such as Monte Carlo simulation, the distribution in the time to failure may bedeveloped by considering the range of variability of the input parameters to the failure models. From the cal-culated distribution, a confidence level can be calculated to a given time to failure interval.

The stress and damage method considers part quality in terms of the variation of material properties andstructural geometries. Variation of material properties and structural geometries may be accounted for byusing worst-case parameter values or by conducting a sensitivity study of the effect of variation in inputparameters for the failure models. Again, a Monte Carlo simulation or other numerical methods may be usedto model the effect of variation in these properties.

The stress and damage method incorporates reliability into the design process by establishing a crediblebasis for evaluating new materials, structures, and electronics technologies. The models are updated with theresults of additional tests and observations as they become available. This method focuses on the root-causefailure mechanisms and sites, which is central to good design and manufacturing.

The feasibility of reliability prediction based on stress and damage models is governed by the availabilityand accuracy of models as well as the input data for those models. The method requires an a priori knowl-edge of the relevant failure modes and mechanisms, it is more suited for products where the dominant failuremechanisms and sites are known. Failure mechanisms in systems are the subject of extensive and activestudy by industry, professional organizations, research institutes, and governments, and there is extensiveliterature documenting failure mechanisms as well as simulation techniques that can be used for their assess-ment. Since studies are documented in open literature and subjected to peer review, the accuracy of simula-tion techniques is reviewed and continues to be improved.

The stress and damage model approach considers the compatibility of the part to the next level of assembly.In fact, the approach uses the next level of assembly to perform failure mechanism modeling (e.g., a circuitcard assembly (CCA) thermal analysis uses box level thermal characteristics and a CCA vibration analysisuses structural response of the box and the system).

5.6.2.4 Assessment of handbook prediction methodologies

This subclause assesses the five handbook prediction methodologies described in 5.5, namely MIL-HDBK-217, the Reliability Analysis Center’s (RAC) PRISM, the Society of Automotive Engineers’ (SAE) PREL,Telcordia SR-332, and the CNET Reliability Prediction Standard according to the criteria derived fromIEEE Std 1413-1998. This method can only be used if the handbook under consideration covers the hard-ware of interest. The reader may also refer to Bhagat, W., “R&M through Avionics/Electronics Integrity Pro-gram;” Bowles, J. B., “A Survey of Reliability-Prediction Procedures for Microelectronic Devices;” Lall, P.,Pecht, M., and Hakim, E. B., Influence of Temperature on Microelectronics and System Reliability: A Phys-ics of Failure Approach; Leonard, C. T., “On US MIL-HDBK-217 and Reliability Prediction;” Leonard, C.T., “How Failure Prediction Methodology Affects Electronic Equipment Design;” O’Connor, P. D. T., “Reli-ability Prediction for Microelectronic Systems;” O’Connor, P. D. T., “Reliability Prediction: A State-Of-The-Art Review;” O’Connor, P. D. T., “Undue Faith in US MIL-HDBK-217 for Reliability Prediction;”O’Connor, P. D. T., “Reliability Prediction: Help or Hoax;” O’Connor, P. D. T., “Statistics in Quality andReliability. Lessons from the Past, and Future Opportunities;” Wong, K. L., “What Is Wrong with the Exist-ing Reliability Prediction Methods?;” Wong, K. L., “A Change in Direction for Reliability Engineering isLong Overdue;” Wong, K. L., “The Bathtub Curve and Flat Earth Society;” O’Connor, P. D. T., “Reliability:Measurement or Management?;” Nash, F. R., “Estimating Device Reliability: Assessment of Credibility;”and Hallberg, Ö., “Hardware Reliability Assurance and Field Experience in a Telecom Environment” forother assessments of handbook prediction methodologies. All the handbook prediction methods are easy toapply and do not require failure data collection or specific design assessment. However, none of them identi-

Copyright © 2003 IEEE. All rights reserved. 63

Page 71: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

fies failure modes and mechanisms and thus offer limited insight into reliability issues. Hence, these meth-ods can potentially misguide efforts to design reliable electronic equipment (see Cushing, M. J., Krolewski,J. G., Stadterman, T. J., and Hum, B. T., “U.S. Army Reliability Standardization Improvement Policy and ItsImpact;” Hallberg, Ö. and Löfberg, J., “A Time Dependent Field Return Model for TelecommunicationHardware;” Leonard, C. T., “Mechanical Engineering Issues and Electronic Equipment Reliability: IncurredCosts Without Compensating Benefits;” Leonard, C. T., “Passive Cooling for Avionics Can Improve Air-plane Efficiency and Reliability;” Pease, R., “What’s All This MIL-HDBK-217 Stuff Anyhow?;” Wattson,G. F. “MIL Reliability: A New Approach;” Pecht, M. and Nash, F., “Predicting the Reliability of ElectronicEquipment;” and Knowles, I., “Is It Time For a New Approach?”

5.6.2.4.1 MIL-HDBK-217

MIL-HDBK-217F was developed through the collection and analysis of historical field failure data, and theconstant failure rate prediction models used in the methodology were based on data acquired from varioussources over time. However, information regarding the sources of this data, levels of statistical control andverification, and data processing to derive constant failure rates and adjustment factors were not specified inthe document, although a bibliography was provided.

The methodology did not specify the assumptions used to predict reliability and the reasons behind theuncertainties in the results, but identified the limitations of the predicted results. For example, the first limi-tation cited in the document is that the models for predicting the constant failure rate are only valid for theconditions under which the data was obtained, and for the devices covered. However, the handbook did notprovide any information about these conditions or the specific devices for which the data was collected.

MIL-HDBK-217F does not predict specific field failures and failures due to environmental conditions suchas vibration, humidity, and temperature cycling (except for steady-state temperature), but only the number offailures over time. The methodology assumes an exponential failure distribution, irrespective of the hazardrates, failure modes, and failure mechanisms.28

The impact of failure mechanisms on the failure distribution was not considered, and confidence levels forthe prediction results were not addressed. Models were lumped together and used at the package level, andfactors were used without determining the relative dominance of the failure mechanisms (i.e., without veri-fying that the same failure mechanism is being accelerated in the new environment).

MIL-HDBK-217 stated that the applications of systems could be significantly different, even when used insimilar environments (for example, two computers may be operating in the same environment, but one maybe used more frequently than the other). In other words, the methodology acknowledged that the reliabilityof a system depended on both its environment and the operational loads to which it was subjected. However,although the prediction methodology covers fourteen environments, it does not account for the actual lifecycle environment, which includes temperature cycles, temperature gradients, humidity, pressure, vibration,chemically aggressive or inert environments, radiation, airborne contaminants, and application-inducedstresses caused by voltage, power, or duty cycles. The methodology also does not account for the impact ofassembly, handling, storage, maintenance, and transportation on reliability. Consequently, the United StatesDepartment of Defense (DoD) stated that a “reliability prediction should never be assumed to represent theexpected field reliability as measured by the user” (see Lycoudes, N., and Childers, C. G., “SemiconductorInstability Failure Mechanism Review”). Another indication of the effectiveness of the methodology’sresults was the statement: “MIL-HDBK-217 is not intended to predict field reliability and, in general, doesnot do a very good job when so applied” (see Morris, S. F., “Use and Application of MIL-HDBK-217”).

The methodology assumes that the stress conditions and the predicted life were independent of materialproperties and geometries, and consequently did not account for variabilities in part materials and geome-

28For example, MIL-HDBK-217FHDBK assumes that solder joint failures, which are known to be wearout, can be modeled by a con-stant failure rate.

64 Copyright © 2003 IEEE. All rights reserved.

Page 72: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

tries. Parts within the same class and application are assumed to have the same constant failure rate, evenwhen made of different materials and having different geometries.

The methodology does not consider quality to be a function of manufacturer process control or part variabil-ity, but simply a function of the number of tests to which the part is subjected. In other words, the greater thenumber of screens a part is subjected to, the higher its quality is assumed to be, irrespective of the damagecaused by faulty design, manufacturing, assembly, or screening procedures. The preparing activity recog-nized this shortcoming, and stated: “Poor equipment design, production, and testing facilities can degradepart quality. It would make little sense to procure high quality parts only to have the equipment productionprocedures damage the parts or introduce latent defects” (see Morris, S. F., “Use and Application of MIL-HDBK-217”).

The methodology does not address reliability growth, and adds little value to current, new, or futuristic tech-nologies. The handbook stated: “evolutionary changes (in technology) may be handled by extrapolationfrom existing models; revolutionary changes may defy analysis” (see Lycoudes, N., and Childers, C. G.,“Semiconductor Instability Failure Mechanism Review”).

5.6.2.4.2 SAE’s reliability prediction methodology

The SAE reliability prediction methodology (implemented through a software package known as PREL)was developed using MIL-HDBK-217 data combined with automotive environments and other automotivedata. Since the extent to which the source of MIL-HDBK-217 data was not defined, the extent to which SAEdata is known is also undefined.

Although the methodology does not identify the information used to develop the models, the assumptionsused to develop the models and the limitations of the models are identified. The methodology states that itonly predicts failures due to common causes and not special causes, and that “… since (the models) werederived from statistical analysis of (constant) failure rate information from a wide variety of manufacturersand module types, the resultant reliability predictions are representative of industry averages. Therefore, pre-dictions for a specific set of conditions will be estimates rather than true values” (see Denson, W. and Priore,M., “Automotive Electronic Reliability Prediction”).

Some of the other limitations stated in the methodology are that there is always an uncertainty in the failuredata collection process due to the uncertainty in deciding whether the failure is inherent or event-related, thatthe failure models used do not account for materials, geometry, and manufacturing variations, and that thedata used to develop the failure models was obtained from 1982–1985, and is not representative of today’sparts. Failure modes and failure mechanisms were also not identified. Confidence levels for the predictionresults were addressed by providing a predicted to observed constant failure rate ratio.

The models used in PREL included duty cycles for operating, dormant, and non-operating conditions, aswell as factors for evaluation of system-level infant mortality failures, allowing reliability to be predicted asa function of any operating scenario and duty cycle. Although actual automotive field data was used todevelop the models and predominant environmental parameters (e.g., ambient temperature, duty cycles,power, voltage, and current) were inputs to the failure models, the true part life cycle environment, includingthe effects of packaging, handling, transportation, and maintenance was not accounted for.

Although there was no distinct part quality factor in the SAE constant failure rate model, quality was implic-itly considered in the regression analysis by a screening factor. In other words, the SAE reliability predictionmethodology also considered quality to be only a function of the number of tests to which the part was sub-jected, and not of manufacturer process control, assembly, or part variability. The effects of materials, partgeometry, and part architecture on the final part reliability were also not addressed.

Copyright © 2003 IEEE. All rights reserved. 65

Page 73: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

5.6.2.4.3 CNET reliability prediction method

The CNET methodology (see RDF 2000) states that the reliability data was taken mainly from field dataconcerning electronic equipment operating in three kinds of environments: 1) ground: stationary, 2) ground:non-stationary, and 3) civilian: airborne (see Kervarrec, G., Monfort, M. L., Riaudel, A., Klimonda, P. Y.,Coudrin, J. R., Razavet, D Le, Boulaire, J. Y., Jeanpierre, P., Perie, D., Meister, R., Casassa, S., Haumont, J.L., and Liagre., A., “Universal Reliability Prediction Model for SMD Integrated Circuits Based on FieldFailures”). It does not identify the actual data sources and the extent, quantity or quality to which the data isknown. As with other handbook prediction methodologies, the CNET RDF 2000 methodology does notidentify failure modes, failure sites, and failure mechanisms. The constant failure rate is calculated by add-ing the contributions of the die, the package, and electrical overstresses (EOS). The methodology providessome temperature, humidity, and chemical exposure information, but the actual life cycle environment is notconsidered. Packaging, handling, storage, transportation, and maintenance conditions are also not consid-ered. Environmental conditions are specified by selecting an adjustment factor from an available list, whichdoes not incorporate the actual life cycle environmental conditions.

The methodology identifies certain assumptions made in the reliability prediction, such as failure rates beingconstant and vibration and shock not being considered to generate significant failures for the selected envi-ronment, but does not describe the assumptions made for unknown data. Although the limitations of the pre-diction results are identified, the sources of uncertainty in the prediction results are not. The methodologystates that the predictions are based only on the intrinsic reliability of parts; they do not therefore account forexternal overload conditions, design errors, or incorrect use of parts, and risks involved in using parts withpoor reliability. Materials, part geometry, and part architecture are also not taken into account.

The methodology does not consider the impact of competing failure mechanisms; for example, environ-ments where severe mechanical stresses are present are not represented in a realistic manner in a modelwhere only thermal fatigue is taken into account. Although a quality factor is included in the methodology,this factor is largely based on criteria such as the time-in-production of the part, qualification or supervisionprocedures followed by the manufacturer, and conformance of the part to various certifying authorities (suchas the IECQ or CECC), but not on the damage caused by faulty design, manufacturing, assembly, or screen-ing procedures.

This prediction methodology does not allow the incorporation of reliability data and experience. The usermay modify the adjustment factors according to experience, but no venue exists to incorporate reliabilitydata and experience into the structure of the methodology.

5.6.2.4.4 Telcordia SR-332

Telcordia SR-332 primarily uses data from the computer and telecommunications industries, but the sourcesof the data used to develop the prediction methodology and the extent, quantity, quality, and time frame ofthe data are not identified. Assumptions and limitations in the methodology are identified (e.g., failure ratesbeing held constant). However, the sources of uncertainty in the prediction results are unknown. Further-more, the failure modes and mechanisms are not identified.

Telcordia SR-332 provides the upper 90% confidence-level point estimate for the generic constant failurerate along with tables of multiplication factors. However, although these factors represent certain applicationconditions (e.g., steady state temperature, electrical stress), they do not account for the effects of temperaturecycling, vibration, humidity, and other life cycle conditions (except steady-state temperature) on productreliability. They also do not account for conditions encountered during storage, handling, packaging, trans-portation, and maintenance. The methodology aims to account for the uncertainty in the parameters contrib-uting to reliability, but does not address the uncertainty in the failure models used.

The methodology accounts for quality as a function of the origin and screening of parts, but does not accountfor the materials, part geometry, and part architecture. Four standard quality levels are defined, which are

66 Copyright © 2003 IEEE. All rights reserved.

Page 74: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

identical for all part types, and are based on criteria regarding the origin and screening of the parts. Theeffect of faulty design, manufacturing, and assembly procedures on part quality is not addressed. A method-ology for combining laboratory test data and/or field-tracking data with parts count data is provided, but forthat information constant failure rate must be assumed. The methodology allows for the incorporation ofreliability growth and experience in the form of laboratory and field data without additional regressionanalysis.

5.6.2.4.5 RAC’s PRISM

The constant failure rate calculation models used in PRISM for electrical, electronic, and electromechanicaldevices are reported to be based on historical field data acquired from a variety of sources over time andunder various levels of statistical control and verification, but no details on the quantity, quality and timeframes of the data, or references are provided. There is no presentation or justification for the algorithm usedfor finding the constant failure rate, levels of statistical control and verification, and data manipulation toderive the generic constant failure rates and adjustment factors. The methodology predicts constant failurerates.

PRISM specifies assumptions used to predict reliability (e.g., failure rates are assumed constant), and identi-fies the limitations of the predicted results such as the applicability to only similar parts used in developingthe failure rate models. Although the methodology accounts for the uncertainty in the parameters contribut-ing to reliability (e.g., design and manufacturing), it does not account for the uncertainty in its failure mod-els. Confidence levels are not specified at all, but the software allows integration of the prediction resultswith Bayesian statistics.

Some environmental conditions are direct inputs to the failure models used in PRISM. However, the modelsused in the software are only based on an estimate of the influence of these parameters because the RACdatabases do not consider failure modes and mechanisms. Hence, the methodology attempts to account forthe effects of life cycle processes on end-item operational reliability without accounting for the actual failuremodes, failure sites and mechanisms induced by the processes. If a process is suspected of having an effecton a part’s operational reliability, PRISM requires the user to have a working knowledge of the process andanswer process related questionnaire that PRISM uses to estimate a modifying factor.

PRISM quality factors are obtained by a method called process grading, which are factors to modify ageneric constant failure rate. These factors include design, manufacturing, part procurement, and systemmanagement, which are intended to capture the extent to which measures have been taken to minimize theoccurrence of system failures. If these grades are not calculated for a part or a subsystem, the model defaultsto assuming a “typical” manufacturing process, and does not adjust the predicted constant failure rate anyfurther. The effect of materials, geometries, and part architecture on the final quality and reliability of a partis not accounted for. PRISM models do have a growth factor that is meant to address improvements in partreliability and manufacturing. As information about a part (e.g., field or test data) becomes available, PRISMallows the user to update the initial reliability prediction, but this estimate is not based on the analysis of thecomplete database, so statistical accuracy is unknown.

6. System reliability models

System reliability prediction is based on the system architecture and the reliability of the lower-level compo-nents and assemblies. Clause 5 provides methods for predicting the reliability of the lower-level componentsand assemblies. This clause describes how to combine these lower-level reliability predictions using thesystem architecture to create a system reliability prediction. The combinatorial methods described in thisclause can also be used for combining reliability predictions at lower levels of the system hierarchy, e.g., forcombining component reliability predictions to create an assembly reliability prediction.

Copyright © 2003 IEEE. All rights reserved. 67

Page 75: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

Subclause 6.1 describes how reliability block diagrams can be used to represent the logical system architec-ture and develop system reliability predictions. Subclause 6.2 explains how fault trees can be used to com-bine lower-level reliability predictions. Subclause 6.3 describes Markov models, which are especially usefulfor repairable system reliability prediction. Other techniques for repairable system reliability prediction arebriefly mentioned at the end of 6.3. Subclause 6.4 describes the use of Monte Carlo simulation to createsystem reliability predictions. There are many texts that describe how to combine reliability predictions anddistributions in much more detail than in this guide (see Kececioglu, B. D., Reliability Engineering Hand-book, Vols. 1 and 2; Lewis, E. E., Introduction to Reliability Engineering; and Klion, J., Practical ElectronicReliability Engineering).

6.1 Reliability block diagram

A reliability block diagram presents a logical relationship of the system components. Series systems aredescribed in 6.1.1, parallel systems are described in 6.1.2, stand-by systems are presented in 6.1.3, (k, n), ork-out-of-n systems in 6.1.4 and complex systems are described in 6.1.5. All of the above system configura-tions are analyzed using the principles of probability theory.

6.1.1 Series system

In a series system all subsystems must operate successfully if the system is to function. This implies that thefailure of any of the subsystems causes the system to fail. The reliability block diagram of a series system isrepresented by Figure 15.

The units need not be physically connected in series for the system to be called a series system. The systemreliability can be derived from the basic principles of the probability theory. The system will fail if any of thesubsystems or any of the components fails, or the system will survive the mission time t if all the units sur-vive by time t. Then,

(7)

In general, each unit can have a different failure distribution. The reliability metrics such as hazard rates andmean life can be derived for the system based on the individual component or subsystem failuredistributions.

Assuming that the time-to-failure distribution for all units is exponential with the constant failure rate, λi, theunit reliability is

(8)

Figure 15—Series system

Rs t( ) R1 t( ) R2 t( )...Rn t( ) Ri t( ) n⋅i 1=

n

∏=⋅=

Ri t( ) eλi t⋅–

=

68 Copyright © 2003 IEEE. All rights reserved.

Page 76: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

then, the system reliability is given by

(9)

The constant system failure rate is

(10)

and the system mean life is

(11)

6.1.2 Parallel system

A parallel system is a system that is not considered to have failed unless all components have failed. Some-times, the parallel system is called (1, n), or 1-out-of-n system, which implies that only one out of n sub-systems is necessary to operate for the system to be in an operational, non-failed state. The reliability blockdiagram of a parallel system is given in the following figure

The units need not be physically connected in parallel for the system to be called a parallel system. The sys-tem will fail if all of the subsystems or all of the components fail by the time t, or the system will survive themission time, t, if at least one of the units survive by time t. Then, the system reliability can be expressed as

(12)

where

Fs(t) is the probability of system failure, or

(13)

Rs t( ) Ri t( ) eλi t⋅–

e

λi

i 1=

n

t⋅–

=i 1=

n

∏=i 1=

n

∏=

λs λi

i 1=

n

∑=

MTBF 1λs

----- 1

λi

i 1=

n

∑-------------= =

Figure 16—Parallel system

Rs t( ) 1 Fs t( )–=

Fs t( ) 1 R1 t( )–[ ] 1 R2 t( )–[ ]... 1 Rnt–[ ] 1 Ri t( )–[ ]i 1=

n

∏=⋅=

Copyright © 2003 IEEE. All rights reserved. 69

Page 77: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

and the system reliability for a mission time t is

(14)

In general, each unit can have a different failure distribution. The system hazard rate is given by

(15)

where

fs(t) is the system time-to-failure pdf (probability density function)

The mean life, m, of the system can be determined by

(16)

For example, if the system consists of two units (n=2) with the exponential failure distribution, and the con-stant failure rates λ1and λ2, then the system mean life is given by

(17)

Note that the system mean life is not equal to the reciprocal of the sum of the component constant failurerates and the hazard rate is not constant over time although the individual unit failure rates are constant.

6.1.3 Stand-by system

A standby system consists of an active unit or subsystem and one or more inactive units, which becomeactive in the event of a failure of the functioning unit. These dormant systems or units may be in quiescent,non-operating or warm-up modes. The failures of active units are signaled by a sensing subsystem, and thestandby unit is brought to action by a switching subsystem. The simplest standby configuration is a two-unitsystem shown in the following figure. In the general case, there will be N number of units with (N-1) of themin standby.

Rs t( ) 1 1 Ri t( )–[ ]i 1=

n

∏–=

λs t( )fs t( )

Rs t( )------------=

m Rs t( ) 1 1 Ri t( )–[ ]i 1=

n

∏–

td0

∫=0

∫=

m 1λ1----- 1

λ2----- 1

λ1 λ2+-----------------–+=

Figure 17—Stand-by system

70 Copyright © 2003 IEEE. All rights reserved.

Page 78: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

The following assumptions for standby redundancies are generally made:

a) Switching is in one direction only.b) Standby non-operating units cannot fail if not energized.c) Switching devices should respond only when directed to switch by the monitor; false switching

operation (static failure) is directed by the monitor as a path failure, and switching is initiated.d) Switching devices do not fail if not energized.e) Monitor failure includes both dynamic (failure to switch when active path fails) and static (switching

when not required) failures.

When the active and the standby units have equal constant failure rates, λ, and the switching and sensingunits are perfect, λsw =0, the reliability function for such a system is

(18)

6.1.4 (k, n) Systems

A system consisting of n components is called (k, n), or k-out-of-n system, where the system operates only ifat least k components are in the operating state. The reliability block diagram for the (k, n) system is thesame as for the parallel system, but at least k items need to be operating for the system to be functional. Theparallel system described in 4.1.2 is a special case of (k, n) system with k=1.

The reliability function for the system is very complex when the components have different failure distribu-tions. Assuming that all the components have the same failure distribution, F(t), the system reliability can bedetermined using the Binomial distribution; i.e.,

(19)

and the probability of system failure is then

(20)

R t( ) e λt– 1 λ t⋅+( )⋅=

Figure 18—k-out-of-n system

R1(t)

R2(t)

Rn(t)

Rs t( ) n

i

i k=

n

∑ 1 F t( )–[ ]i F t( )[ ]n 1–⋅=

Fs t( ) 1 Rst– 1 n

i

i k=

n

∑ 1 F t( )–[ ]i F t( )[ ]n 1– n

i

i 0=

k 1=

∑ 1 F t( )–[ ]i F t( )[ ]n 1–⋅=⋅–= =

Copyright © 2003 IEEE. All rights reserved. 71

Page 79: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

Alop

Onsta

The probability density function can be determined from

(21)

and the hazard rate is given by

(22)

6.1.5 Complex system

If the system architecture cannot be decomposed into series-parallel structures, it is deemed a complex sys-tem. Subclauses 6.1.5.1 through 6.1.5.3 describe three methods for reliability analysis of a complex systemusing Figure 19 as an example.

6.1.5.1 Complete enumeration method

The complete enumeration method is based on the list of all possible combinations of the unit failures. Table15 contains all possible states of the system given in Figure 19. The symbol “O” stands for “system in oper-ating state” and “F” stands for the “system in failed state.” Letters in uppercase denote a unit in an operatingstate and the lowercase letters denote a unit in a failed state.

Table 15—Complete enumeration example

System description System condition System status System

description System condition System status

l components erable

ABCDE O Three units in failed state

ABcde F

e unit in failed te

aBCDE O AbCde O

AbCDE O AbcDe F

AbcDE O AbcdE F

ABCdE O aBCde O

ABCDe O aBcDe O

fs t( )dFs t( )

dt--------------- n!

n k–( )! k 1–( )!⋅----------------------------------------- 1 F t( )–[ ]k 1– F t( )[ ]n k– f t( )⋅ ⋅= =

λs t( )fs t( )

Rs t( )------------=

Figure 19—A complex system

72 Copyright © 2003 IEEE. All rights reserved.

Page 80: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

Twfai

Each combination representing system status can be written as a product of the probabilities of units being ina given state; e.g., combination 2 can be written as (1–RA)RBRCRDRE, where (1–RA) denotes probability offailure of unit A by time t. The system reliability can be written as the sum of all the combinations for whichthe system is in operating state, O, i.e.,

(23)

After simplification the system reliability is given by

(24)

6.1.5.2 Conditional probability method (the law of the total probability)

This method is based on the law of the total probability, which allows system decomposition by a selectedunit and its state at time t. For example, system reliability is equal to the reliability of the system given thatunit A is in operating state at time t, denoted by RS| AG, times the reliability of unit A, plus the reliability ofthe system, given that unit A is in a failed state at time t, RS| AB, times the unreliability of unit A, or

o units in led state

abCDE F aBcdE O

aBcDE O abCDe F

aBCdE O abCdE F

aBCDe O abcDE F

AbcDE F Four units in failed state

Abcde F

AbCdE O aBcde F

AbCDe O abCde F

ABcdE O abcDe F

ABcDe O abcdE F

ABCde O All five units in failed state

abcde F

Table 15—Complete enumeration example (continued)

System description System condition System status System

description System condition System status

( )( ) ( ) ( )

( ) ( ) ( )

( )( ) EDCBA

EDCBAEDCBAEDCBA

EDCBAEDCBAEDCBA

EDCBAEDCBAEDCBAs

RRRRR

RRRRRRRRRRRRRRR

RRRRRRRRRRRRRRR

RRRRRRRRRRRRRRRR

−−−+

+

−−+−−+−−+

−+−+−+

−+−+=

11)1(

1)1(1)1(1)1(

111

1)1(

M

L

EBDB

CBCAEDBECB

DCBCBAEDCBs

RRRR

RRRRRRRRRR

RRRRRRRRRRR

++

++−−

−−=

Copyright © 2003 IEEE. All rights reserved. 73

Page 81: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

(25)

where

QA = 1–RA

This decomposition process continues until each term is written in terms of the reliabilities and unreliabilityof all the units. As an example of the application of this methodology, consider the system given in Figure 19and decompose the system using unit C. Then the system reliability can be written

(26)

If the unit C is in operating state at time t, the system reduces to the configuration shown in Figure 20.

Therefore, the system reliability, given unit C is in operating state at time t, is equal to the series-parallelcombination as shown above, or

(27)

If unit C is in a failed state at time t, the system reduces to the configuration as given in Figure 21.

RS RS AG⟨ | ⟩ RA RS AB⟨ | ⟩ QA⋅+⋅=

RS RS CG⟨ | ⟩ RC RS CB⟨ | ⟩ QC⋅+⋅=

Figure 20—System reduction when unit C is operating

( ) ( )[ ]BAGS RRCR −⋅−−= 111|

Figure 21—System reduction when unit C fails

74 Copyright © 2003 IEEE. All rights reserved.

Page 82: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

Then the system reliability, given that Unit C is in a failed state, is given by

(28)

The system reliability is obtained by substituting Equation (29) and Equation (30) into Equation (28).

(29)

The system reliability is expressed in terms of the reliabilities of its components. Simplification of Equation(29) gives the same expression as Equation (26). The component reliabilities can be obtained using method-ologies presented in the preceding subclauses.

6.1.5.3 Cut-sets methodology

A cut set is a set of components; with the property, that failure of all the components causes a system to fail.A minimal cut set is a set containing minimum number of components that causes a system to fail. If a singleunit is removed (not failed) from the minimal cut set, the system will not fail. This implies that all the unitsfrom a minimal cut set must fail in order to system to fail. The procedure for system reliability calculationusing minimal cut sets is as follows:

a) Identify minimal cut sets for a given system.b) Model the components in each cut set as a parallel configuration.c) Model all minimal cut sets as a series configuration.d) Model system reliability as a series combination of cut sets with the parallel combination of compo-

nents in each cut set.

For a) through d), the following cut sets can be identified:

(30)

Following b) and c), the system block diagram in terms of minimal cut set is as given in Figure 22.

Using the methodologies for the series and parallel systems, the system reliability is

(31)

( ) ( )[ ]EDBBS RRRCR −⋅−−⋅= 111|

( ) ( )( ) ( )[ ] ( ) ( )[ ] ( )CEDBCBA

CBSCGSS

RRRRRRR

QCRRCRR

−⋅−⋅−−⋅+⋅−⋅−−=

⋅+⋅=

1111111

||

{ }{ }{ }EDCC

CBC

BAC

,,

,

,

3

2

1

=

=

=

Figure 22—System block diagram in terms of minimal cut sets

( ) ( )[ ] ( ) ( )[ ] ( ) ( )( )[ ]EDCCBBAS RRRRRRRR −−⋅−−⋅−⋅−−⋅−⋅−−= 1111111111

Copyright © 2003 IEEE. All rights reserved. 75

Page 83: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

and upon simplification

(32)

The same as given by Equation (23).

6.2 Fault-tree analysis (FTA)

Fault tree analysis is a technique that graphically and logically connects various combinations of possibleevents occurring in a system. The events, usually failures of the components or subsystems, lead to the topundesired event, which may be system failure or its malfunction. The primary benefits of the fault tree anal-ysis are:

a) It provides methodology for tracking down system failures deductively.b) It addresses system design aspects while dealing with the failures of interests.c) It provides graphical tool for describing system functions as well as insight into the system behavior.d) It provides system analysis by considering one failure at a time.e) It provides qualitative and quantitative reliability analyses of the system of interest.

There are three phases in the fault tree analysis:

1) Develop a logic block diagram or a fault tree using elements of the fault tree. This phaserequires complete system definition and understanding of its operation. Every possible causeand effect of each failure condition should be investigated and related to the top event.

2) Apply Boolean algebra to the logic diagram and develop algebraic relationships betweenevents. If possible, simplify the expressions using Boolean algebra.

3) Apply probabilistic methods to determine the probabilities of each intermediate event and thetop event. The probability of occurrence of each event has to be known; i.e., the reliability ofeach component or subsystem for every possible failure mode has to be considered.

The graphical symbols used to construct the fault tree fall into two categories: gate symbols and event sym-bols. The basic gate symbols are AND, OR, k-out-of-n voting gate, priority AND, Exclusive OR and Inhibitgate. The basic event symbols are Basic Event, Undeveloped Event, Conditional Event, Trigger Event,Resultant Event, Transfer-in and Transfer-out Event. For complete list of symbols and their graphical pre-sentation (see Rao, S. S., Reliability-Based Design; Kececioglu, B. D., Reliability Engineering Handbook,Vols. 1 and 2; and Lewis, E. E., Introduction to Reliability Engineering). Quantitative evaluation of the faulttree includes calculation of the probability of the occurrence of the top event. This is based on the Booleanexpressions for the interaction of the tree events. There are several methodologies for quantitative evaluationof the fault trees. Some of them are as follows:

a) Minimal cut set algorithms. A cut set is a set of basic events whose occurrence causes the top eventto occur. A minimal cut set is a set that satisfies the following: If any basic event is removed from theset, the remaining events are no longer a cut set. See 6.1.5.3 for more details. The MOCUS algo-rithm (see Kececioglu, B. D., Reliability Engineering Handbook, Vols. 1 and 2) can be used to deter-mine the minimal cut set for given fault tree.

b) Dual trees and the minimal path sets. A path set is a dual set of the cut set. A path set is a set of basicevents of the fault tree for which the top event is guaranteed not to occur if none of the events in theset occurs. A path set is a minimal path set if, when any of the basic events is removed from the set,the remaining set is no longer a path set. A dual tree of a given fault tree is such a tree for which theOR gates are replaced with AND gates and the AND gates are replaced with the OR gates in theoriginal tree. The cut sets obtained from the dual tree are, actually, the path sets of a fault tree.

EDCBECBDCB

EDBCBAEBDBCBCAS

RRRRRRRRRR

RRRRRRRRRRRRRRR

+−−

−−+++=

76 Copyright © 2003 IEEE. All rights reserved.

Page 84: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

The ultimate goal of the fault tree analysis is to compute the probability of occurrence of the top event.Knowing the minimal cut sets for the tree of interest, the probability of occurrence of the top event can beobtained using the structure function methodology. Assuming that any basic event has only two states; i.e.,occurring or not occurring, then a binary indicator variable

(33)

is assigned to basic event j, and j = 1, 2,…, n, where n is the number of events (components) in the system.The structure function of the ith minimal cut set is

(34)

where i=1,2,…, m, and m is the number of minimal cut sets, and ni is the number of basic events in the ith

minimal cut set. Then, the structure function for the top event in terms of the minimal cut sets is

(35)

The probability of occurrence of the top event is calculated from

(36)

In order to calculate the expectation given in Equation (35), the probability of occurrence of every basicevent must be known. If the basic events are, actually, the component failures, then the probability of a basicevent is the probability of component (or subsystem) failure. These probabilities can be calculated usingmethodologies presented in the previous clauses of this document. The probability of the occurrence of thetop event is, then, the probability of system failure.

6.3 Reliability of repairable systems

This subclause presents reliability analysis models for the systems that can be repaired during its operationor mission. A redundant system that contains two or more units can be repaired as long as at least one of theunits is functioning while the other is being repaired. Reliability analysis of repairable systems with redun-dancy includes several techniques, some of which are Markov processes. For a parallel system of N identicalunits, N+1 states of the system existence can be identified: State N implies that all N units are operable, StateN–1 implies that one unit has failed and it is under repair and N–1 units are operable, … , State 1 impliesthat 1 unit is operable and the rest are under repair either one at the time (single repair) or more than one atthe time (multiple repairs), and State 0 implies that all units have failed and the system has failed, as well.For Markov process theory to apply, the unit’s failure rate, λ, and the unit’s repair rate, µ, must be constant(although non-constant rates can be approximated by combinations of constant rates). A general procedurefor determining the reliability of repairable systems is as follows:

a) Identify all states of the system existence.b) Determine the probability of the system being in each state at time t considering only the transition

rates for one state above and one state below the state of interest. Write down the system of differen-tial equations describing all system states.

c) Solve the system of the differential equations and define the system reliability from

= occur, not does eventbasic the if 0,

occurs, eventbasic the if ,1z j

φi Z( ) zij

j 1=

ni

∏=

( ) ( )[ ]∏=

−−=m

i

i ZZ1

11 φφ

( ) ( )[ ]ZTEP φE=

Copyright © 2003 IEEE. All rights reserved. 77

Page 85: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

(37)

where R0(t) is the probability of system failure at time t.

There are three methods for completing the above steps that result in the differential equations for a givensystem. They are the following:

1) System States Analysis Method.2) States Transition Matrix.3) Markov Graph Method.

For illustration, consider a redundant system consisting of two identical units in parallel as shown in the fol-lowing figure.

The constant failure rate, λ, and the constant repair rate, µ, are identical for both units. The states of thesystem existence are:

— State 2: Both units are operable. It is assumed that the system is in state 2 at t=0.— State 1: One unit is operable, the other is in a failed state and is undergoing repairs.— State 0: Both units are in a failed state, and the system has failed.

At any point in time, the system must be in one of the states, but it can not be in two states at the same time;i.e., the states are mutually exclusive. The state transition matrix is given in Figure 24:

and the Markov graph for this system is given in Figure 25:

Rλ µ, t( ) 1 R0 t( )–=

Figure 23—Parallel system with repair

0 1 2tat t States

at t States

100

)(1

0221

0

1

2

∆+

+−

= λµλµ

λλ

P

Figure 24—Markov transition matrix for the two-unit parallel system

78 Copyright © 2003 IEEE. All rights reserved.

Page 86: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

The system states, or the graph nodes, are identified by S2, S1 and S0 and the transition probabilities are writ-ten on each branch. The description of the Markov graph for this system is as follows: The system is in State2 (S2) at time (t+∆t) if neither of the units fails, or if a unit, which has previously failed at time t, is repairedand both units are operational at time (t+∆t). The system is in State 1 (S1) at time (t+∆t) if one of the units isin a failed state at time t and is not repaired at time (t+∆t) and the operational unit has not failed. The systemis in State 0 (S0) if the second unit fails at time (t+∆t), as well. If in State 0, the system fails. The system ofdifferential equations for the system is:

(38)

Upon solving the system and using Equation (37), the reliability of the system with two units in parallel withrepair is given by

(39)

where

(40)

and

(41)

The methodology can be extended to more complex system configurations that include unequal failure andrepair rates for the parallel units, stand-by units with perfect or imperfect sensing and switching, systemswith single or multiple repairs, etc. However, application of the Markov process implies that the failure andrepair rates are constant which in some instances may not be true. Non-constant failure or repair rates leadsto the Semi-Markov or Non-Markov processes whose application becomes much more complex from thecomputational point of view (see Ross, S., Stochastic Processes). Another approach to study reliability andavailability of repairable systems may be application of the renewal theory, which is based on the assump-tion that each system repair, or renewal, restores the system to “as good as new” condition. For more infor-mation on the renewal theory, see Cox, D. R., Renewal Theory.

Figure 25—Markov graph for the two-unit parallel system

( ) ( ) ( )( ) () ( ) ( )( ) ( ).

,

,2

1’

0

12’

1

12’

2

tPtP

tPtPtP

tPtPtP

⋅=

⋅+−⋅=

⋅+⋅⋅−=

λ

µλλ

µλ

Rλ µ, t( )s1 e

s2 t⋅⋅ s2 e

s1 t⋅⋅–

s1 s2–---------------------------------------------=

( )

+⋅⋅+−+⋅−= 22

1 6321

λλµµµλs

( )

+⋅⋅+++⋅−= 22

2 6321

λλµµµλs

Copyright © 2003 IEEE. All rights reserved. 79

Page 87: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

6.4 Monte Carlo simulation

Monte Carlo simulation is a very powerful tool that enables engineers to study a complex system behaviorand its performance. It is particularly effective for complex systems, whose performance is difficult toanalyze analytically, whose performance evaluation by experimenting is long and costly and whose compo-nent’s and subsystem’s performances are known in terms of the random variables that describe suchperformance. If system performance parameters are known to follow certain probability distributions, sys-tem behavior can be studied by considering several possible values of those parameters that are generatedusing corresponding distributions. The reliability of a system can be calculated by simulating system perfor-mance using random number generation and determining the percentage of the system successfulperformance outcomes.

All Monte Carlo simulation calculations are based on the substitution of random variables that represent aquantity of interest by a sequence of the numbers having the statistical properties of the variables. Thosenumbers are called random numbers. In general, random variables can be classified as continuous, discrete,and correlated random variables, and corresponding methodologies have been developed for theirgeneration.

The first step in predicting the complex system reliability using Monte Carlo simulation is considering asmany of the parameters influencing the performance as possible. The next step is to determine the randomparameters (variables) and to estimate corresponding distributions. A sample of system performance is cre-ated by generating all random performance parameters and then comparing the sample performance with theperformance requirement. If the sample performance meets the requirement, it is considered successful. Theprocess continues until a predetermined number of performance samples are generated. The system reliabil-ity is then calculated using

(42)

The flow diagram in Figure 26 presents Monte Carlo simulation technique as applied to reliability estimationof a complex system.

NR

,"sexperiment" ofnumber Totals"experiment" successful ofNumber

=

80 Copyright © 2003 IEEE. All rights reserved.

Page 88: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

Define the total number of

experiments to be conducted (N)

Identify random parameters of the

system

Assume appropriate distributions for the parameters (random

variables)

Initialize counter of the experiments,

i=1

Generate a uniformly

distributed number for each random

variable

Generate each random variable

corresponding to its distribution

Using the set of generated random

variables (parameters), evaluate the

performance of the system

Examine the system performance and

determine whether the experiment is a

success or a failure

Is i=N ?i = i+1

Calculate system reliability usingR =

Number of successful experimentsTotal Number of experiments, N

Define the total number of

experiments to be conducted (N)

Identify random parameters of the

system

Assume appropriate distributions for the parameters (random

variables)

Initialize counter of the experiments,

i=1

Generate a uniformly

distributed number for each random

variable

Generate each random variable

corresponding to its distribution

Using the set of generated random

variables (parameters), evaluate the

performance of the system

Examine the system performance and

determine whether the experiment is a

success or a failure

Is i=N ?i = i+1

Calculate system reliability usingR =

Number of successful experimentsTotal Number of experiments, N

Figure 26—Monte Carlo simulation technique

Copyright © 2003 IEEE. All rights reserved. 81

Page 89: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

There are several advantages of the Monte Carlo simulation techniques for the system reliability prediction.Monte Carlo simulation provides evaluation of a complex, real world, systems with stochastic elements,which can not be evaluated analytically. The simulation allows one to estimate system performance undersome projected set of operating conditions. Alternative system designs can be compared via simulation tosee which best meets a specified requirement. Better control over experimental conditions can be obtainedthan when experimenting with a real system.

The disadvantages of the Monte Carlo simulation technique for the system reliability prediction include thefollowing.

— Each run of the stochastic simulation produces only estimates of model’s true characteristics for aparticular set of input parameters.

— Simulation is not good for optimization, but it is good for comparison of different system’s designs. — Simulation of complex systems is expensive and time consuming to develop.— The validity of a model is critical. If a model is not valid, simulation results are useless.

82 Copyright © 2003 IEEE. All rights reserved.

Page 90: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

Annex A

(informative)

Statistical data analysis

This annex contains several standard statistical methodologies that can be used for analysis of reliabilitydata. The methods contained herein are brief synopses. Details of the methods and a greater discussion of thetheoretic bases for each are found in the references.

The concepts of “complete,” “singly censored,” and “multiply censored” data are germane to all life dataanalysis techniques. Some techniques are more difficult than others to apply to specific data types. “Com-plete data” means all units were run to failure. “Singly censored” data means there are survivors and all thesurvivors have the same operation (test) time on them. “Multiply censored” means each surviving unit mayhave accumulated different operating/test time. That is, there are multiple censor times on the survivingunits. Field data is often multiply censored because units are installed at different times so that the survivorsaccumulate different amounts of usage time.

Annex subclause A.1 describes graphic plotting techniques including both probability plotting and hazardplotting. Subclause A.2 presents an analytic technique, the maximum likelihood method for determining dis-tribution parameters (not a plotting technique). Goodness-of-fit techniques and tests for determining, statisti-cally, whether the data really fits the assumed distribution are not covered, but references are provided.

A.1 Graphical techniques

Graphical plotting techniques are used to determine the parameters of the underlying failure distribution.Once determined, probability of survival or failure can be estimated for various time intervals. Making astatement about reliability that requires extrapolation out beyond the longest point in time for which data hasbeen generated, whether failure or censor, is a prediction. It is predictive because it estimates the reliabilitywhile assuming the current distribution continues for the rest of the life of the product and that no new fail-ure mechanisms (distributions) will occur.

Graphical methods consist of plotting data points on paper developed for a specific distribution. Linearregression techniques are often used to fit a straight line to the plotted data. Statistical goodness-of-fit testsare used to determine whether or not the data can be modeled using the assumed distribution. That is, agoodness-of-fit test can help determine whether the data really should be modeled by the assumed distribu-tion. Goodness-of-fit tests are not discussed herein, but can be found in Nelson [B10]. The advantages of thegraphical methods are simplicity and speed (ease of plotting). A disadvantage is a less precise estimate of thedistribution parameters.

When fitting data to distributions, the process is trial and error using educated guesses as to the best candi-date distributions. Specific distributions are selected as candidates based on the type of data collected. Forexample, candidate distributions for time based data (time to failure) include Weibull and exponential. Fordimensional measurements, normal is often tried first. For go/no-go (Bernoulli trials) data, binomial is agood start. Lognormal is often a good starting point for cyclic mechanisms such as metal fatigue. Other can-didate distributions can be found in Table 4-3 of Leemis [B7]. If the first distribution does not fit the data,another is assumed and evaluated until an appropriate distribution is determined. Sometimes no standard dis-tribution adequately models the data. While more complicated and elaborate models can be developed, themathematics required frequently becomes very difficult. In these cases, non-parametric analysis techniquesmay prove helpful.

Copyright © 2003 IEEE. All rights reserved. 83

Page 91: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

The two types of plotting techniques discussed are hazard plotting and probability plotting. Both types ofplotting depend on the linear dependency of two variables, e.g.,

(A.1)

To arrive at a linear equation as shown in Equation (A.1), the distribution must undergo a mathematicaltransformation. The cumulative distribution function (CDF), which models the cumulative probability offailure as a function of time, is linearized through. The plot then consists of plotting time (or a function ofplotting time) on one axis and the transformed CDF on the other. Having the plotted data follow a straightline verifies that the underlying failure distribution used to generate the axis transform is valid.

Once the plot is made, the distribution’s parameters can be read directly off the chart. The transforms are dif-ferent for every distribution and are different for probability and hazard plots. Some of the transformationsare discussed briefly in the following subclauses. Equations for other distributions are available in most text-books, such as Nelson [B10].

A.1.1 Probability plotting

A probability plot consists of the cumulative probability of failure plotted on the vertical axis and time (or afunction of time) plotted on the horizontal axis. The plotting paper usually has additional graphical scales topermit reading the distribution’s parameters directly from the plot. Probability plots are easiest to apply to“complete” data. They can be applied to singly and multiply censored data, but special software tools areoften required to do the analysis.

In probability plots, the plotting positions are determined based on their rank (order of occurrence). Themost commonly used, because of the higher accuracy, is the “median rank.” A commonly used estimate ofmedian rank is shown in Equation (A.2):

(A.2)

where

i is the rank, andn is the number of data points.

The transforms for several different distributions are shown in A.1.1.1 through A.1.1.3.

A.1.1.1 Weibull distribution

The two-parameter Weibull distribution is given by

where

β is the shape parameter, andα is the scale parameter or characteristic life.

y a x b+⋅=

F ti( ) MRii 0.3–n 0.4+---------------- 100⋅= =

F t( ) 1 e

tα---

–β

–=

84 Copyright © 2003 IEEE. All rights reserved.

Page 92: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

The linear transform is determined by taking the natural logarithm of both sides 2 times. This results in thefollowing:

(A.3)

which is of the form

This linear transform is included in the axes of Weibull probability paper, so it is necessary only to determinethe median ranks and plot median ranks versus the time for each data point. An example is shown in FigureA.1.

The parameter α can be read off the graph as the time that corresponds to the probability of failure of 63.2%as illustrated in Figure A.1. The shape parameter β can be calculated as the slope of the fitted line; i.e.,

(A.4)

where

t2 is not equal to t1.

( )( ) ( )αββ loglog

1

1lnlog ⋅−⋅=

−t

tF

y m x b+⋅=

Figure A.1—Example Weibull probability plot

( ){ } ( ){ }( ) ( )12

12

loglog

lnloglnlog

tt

tFtF

−=β

Copyright © 2003 IEEE. All rights reserved. 85

Page 93: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

The mean life for a Weibull is not the characteristic life, α. For a Weibull the mean life is

(A.5)

where Γ(*) is the Gamma function. The reliability function is given by

(A.6)

A.1.1.2 Exponential distribution

The CDF for the exponential distribution is given by

(A.7)

Taking log of both sides yields

(A.8)

which indicates that ln[1–F(t)] varies linearly with time, t. Thus, the probability paper for the exponentialdistribution is constructed by plotting the values of [1–F(ti)] on a logarithmic scale against the values of t ona linear scale. The process for complete data is as follows:

a) Arrange the times for the n failed units in increasing order, so that

b) Determine the median rank for each failure using Equation (A.2).c) Select plotting paper for the assumed distribution (the axes are already transformed)d) Plot the points as shown on Figure A.2.e) Draw a straight line through the data points. Least-squares can be used to determine the best fit line.f) Determine the goodness-of-fit to be sure that the underlying distribution (exponential in this case) is

appropriate.

To estimate the distribution parameter, the constant failure rate λ, draw a horizontal line at 63.2% that inter-sects the fitted line and then draw a vertical line that intersects the x-axis. The value at the intersection withthe x-axis is the MTTF, and the constant failure rate is the reciprocal of that value.

In this example, shown in Figure 1, the MTTF is 1000 hours and = 0.001/hr. The reliability func-tion is given by:

(A.9)

( )11 +Γ⋅= βαm

R t( ) 1 F t( ) e

tα---

–β

=–=

F t( ) 1 e λ t⋅––=

( )[ ] ttF ⋅−=− λ1ln

ni tttt ≤≤≤≤≤ LL21

1000

1=λ

R t( ) 1 F t( ) e λt–=–=

86 Copyright © 2003 IEEE. All rights reserved.

Page 94: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

The predicted probability of failure by a specific time can be read directly off the plot. For example, theprobability of failing by 4000 hours is 98%.

A.1.1.3 Other distributions

The normal and lognormal probability plotting techniques are performed analogously to the methodsdescribed for exponential and Weibull distributions. Several commercial software packages that use proba-bility-plotting techniques are available.

A.1.2 Hazard analyses

The advantage of hazard plots is that they are easily used for complete or incomplete data, as well as multi-ply-censored data. Therefore, they are appropriate for analyzing test data in which all units have failed, testsin which only a fraction of the units have failed, and field data in which only a fraction of the units havefailed and the remaining functional units may all have different total number of operational hours on them.Furthermore, it is often easy to identify multiple failure mechanisms (each having its own hazard rate orparameters) directly off the plot. As with probability plots, each distribution requires its own paper. Expo-nential hazard paper is semi-log, whereas Weibull paper is log-log. This is a result of the transformation ofvariables described in A.1.1.

Hazard plotting does not require specialized computer software and can be performed with a simple spread-sheet application. The recommended process (see Nelson [B10]) is included here. Plotting can be done in thespreadsheet or by hand.

Figure A.2—Example probability plot

Copyright © 2003 IEEE. All rights reserved. 87

Page 95: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002 IEEE GUIDE FOR SELECTING AND USING

a) Order the times to failure or censure from shortest to longest.b) Number the failure and censure times from 1 to n, where n is the total number of data points (test

articles or field units).c) Now reverse the order of the numbers, so the shortest test time is n and the longest is 1.d) Calculate the inverse of the order, h(t), from 1/n to 1/1. These are the instantaneous hazard ratese) Use only the failures and maintain a running sum of the instantaneous hazard rates. This is the

cumulative hazard rate, H(t)

Using an appropriate Hazard paper, plot the time of failure on the ordinate and the cumulative hazard percenton the abscissa for each point. If the plot approximates a straight line, then the data fit the assumed distribu-tion. Linear regression can be used to determine the best fit line.

If plotted on Weibull paper, the slope of the line is the shape parameter, β, and the intercept at 100% failureprobability is the characteristic life, α. If the shape parameter is 1.0, then the distribution is exponential andthe characteristic life is the MTBF. If the slope is greater than 1.0, then the hazard rate is increasing. If theslope is less than 1.0 the distribution has a decreasing hazard rate. An example set of test data is shown inTable A.1 and plotted in Figure A.3. This data shows a decreasing hazard rate situation. The distributionparameters can be read directly off the plot. The characteristic life is 300,000 hours, corresponding to the100% value of the cumulative hazard function. For details on the theory of Hazard Plotting, see Nelson[B10].

Table A.1—Data for Weibull Plot

Unit Time Fail? h(t) H(t) Unit Time Fail? h(t) H(t)

1 10 Y 1/1000 1/1000 8 150 n

2 16 Y 2/1000 3/1000 9 175 y 9/1000 33/1000

3 30 Y 3/1000 6/1000 10 200 n

4 50 N 11 210 y 11/1000 44/1000

5 100 Y 5/1000 11/1000 12 250 n

6 135 Y 6/1000 17/1000 13 260 y 13/1000 57/1000

7 148 Y 7/1000 24/1000 14 500 y 14/1000 71/1000

0.10%

1.00%

10.00%

100.00%

1000.00%

1 10 100 1000 10000 100000 1000000

β = 0.67 α = 300,000 hrs

Figure A.3—Hazard plot

88 Copyright © 2003 IEEE. All rights reserved.

Page 96: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEERELIABILITY PREDICTIONS BASED ON IEEE 1413 Std 1413.1-2002

A.2 Analytical techniques

Maximum Likelihood Estimation (MLE) is frequently used as an analytic technique to estimate distributionparameters. The advantage of analytical methods such as MLE is the accuracy of the parameter estimates.The disadvantage is computational complexity. In case of incomplete data such as suspended units, left andright censoring, or interval data, MLE modeling is the best analytical tool.

The MLE method can accommodate both complete and incomplete data and can provide confidence limitsfor the model parameters and for the functions of those parameters. The MLE technique is computationallyvery intensive and requires complex numerical routines. However, many commercial software packages canperform the standard linear regression or MLE calculations and provide analytical estimates of the modelparameters. The theoretic bases of the MLE method are not presented here but can be found in Nelson[B10].

If the data is incomplete in a non-traditional way, such as the inclusion of non-operating time periods, notobserving the effect of environmental factors, or if failure analysis indicates only the failure of a higher-levelassembly, the Expectation-Maximization (EM) algorithm can be used to calculate the MLE parameters (seeAlbert and Baxter [B1]).

15–1000 1000 n

Table A.1—Data for Weibull Plot (continued)

Unit Time Fail? h(t) H(t) Unit Time Fail? h(t) H(t)

Copyright © 2003 IEEE. All rights reserved. 89

Page 97: IEEE Std 1413.1 EEE Standards IEEE Standards Reliability PredictionSeri)/System Reliability/Reliability... · 2004. 6. 28. · systems and equipment. This guide focuses on hardware

IEEEStd 1413.1-2002

Annex B

(informative)

Bibliography

[B1] Albert, J. R. G., and Baxter, L. A., “Application of EM Algorithm to the analysis of life length data,”Applied Statistics, Vol. 44, No. 3, pp. 323–341, 1995.

[B2] IEC 60812 (1985): Analysis Techniques for System Reliability—Procedure for Failure Mode andEffects Analysis (FMEA), pp. 29–35.29

[B3] IEEE 100™, The Authoritative Dictionary of the IEEE Standard Terms, Seventh Edition.30

[B4] IEEE Std 1220™-1998, IEEE Standard for Application and Management of the Systems EngineeringProcess.31

[B5] IEEE Std 1413™-1998, IEEE Standard Methodology for Reliability Predictions and Assessment forElectronic Systems and Equipment.

[B6] JEDEC JEP 131-1998: Process Failure Mode and Effects Analysis (FMEA).32

[B7] Leemis, Lawrence M., Reliability: Probabilistic Models and Statistical Methods, Prentice-Hall, UpperSaddle River, New Jersey, pp. 230–247, 1995.

[B8] MIL-STD-1629A: Procedures for Performing A Failure Mode, Effects and Criticality Analysis (Can-celed), 1980.33

[B9] Modarres, M., “What Every Engineer Should Know About Reliability and Risk Analysis,” Marcel Dek-ker, 1993

[B10] Nelson, Wayne, Applied Life Data Analysis, John Wiley and Sons, New York, 1982.

[B11] Pecht, M., “Product Reliability, Maintainability, and Supportability Handbook,” CRC Press, NewYork, New York, 1995.

[B12] U.S. Department of Commerce, Questions and Answers: Quality System Registration ISO 9000 Stan-dard Series, U.S. Department of Commerce, Washington D.C., 1992.

29IEC publications are available from the Sales Department of the International Electrotechnical Commission, Case Postale 131, 3, ruede Varembé, CH-1211, Genève 20, Switzerland/Suisse (http://www.iec.ch/). IEC publications are also available in the United Statesfrom the Sales Department, American National Standards Institute, 11 West 42nd Street, 13th Floor, New York, NY 10036, USA.30The IEEE products referred to in this standard are trademarks belonging to the Institute of Electrical and Electronic Engineers, Inc.31IEEE publications are available from the Institute of Electrical and Electronics Engineers, 445 Hoes Lane, P.O. Box 1331, Piscat-away, NJ 08855-1331, USA (http://standards.ieee.org/).32JEDEC publications are available from JEDEC, 2001 I Street NW, Washington, DC 20006, USA.33MIL publications are available from Customer Service, Defense Printing Service, 700 Robbins Ave., Bldg. 4D, Philadelphia, PA19111-5094.

90 Copyright © 2003 IEEE. All rights reserved.