elec3106 electronics lecture notes: circuit failures and reliability · 2017-02-06 · school of...

School of Electrical Engineering

and Telecommunication

ELEC3106Electronics

Lecture notes: circuit failures and reliability

ObjectiveThe objective of these brief notes is to supplement the textbooks used in the course on the topicof circuit failures and reliability. Much of the material herein as taken from Ohring1 to whichthe interested reader is referred. Additional material for the section on failure mode and effectanalysis can be found in Stamatis2, FMEA Info Centre3, and on the course website.

Circuit failure mechanismsThere are many mechanisms that cause electronic circuits to fail. Normal use, and plain aging,wears down the electronics just as mechanical systems wear down with use. Poor manufacturingtechniques (often used on cheap components) leads to early component failures, as does equip-ment use in adverse environments (such as extreme temperatures or corrosive environments).Trauma, such as physical impact or electro-static discharge (ESD), can also lead to electronicsfailure. Failures can be sudden and catastrophic, like when a component is destroyed by an ESDevent — or they can be gradual with slowly degrading performance due to shift in circuit pa-rameters, as may be caused by normal wear and aging. Particularly troublesome can be failuresthat are intermittent (maybe caused by a loose component) or that happens only under certaincircumstances (like at high temperatures or for a specific set of input conditions); such failurescan be particularly hard to identify. In this section we shall have a look at some typical failuremechanisms found in electronic circuits. There are many other ways that circuits can fail thanwhat is described herein, but the present text should give the reader a general idea of how circuitsfail.

Electro-static discharge

(a)

1.5k Ω

100pF

(b)

0kV to 6kV

ESD

ESD currentcausing damagingreverse break−down

Figure 1: Diode damaged by electro-static discharge (a). Human-body model (HBM) of ESDevent (b).

1M. Ohring, Reliability and Failure of Electronic Materials and Devicecs, Academic Press, New York, 1998.2D. H. Stamatis, Failure Mode and Effect Analysis: FMEA from Theory to Execution, 2nd Ed., ASQ Quality

Press, Milwaukee, 2003.3FMEA Info Centre, http://www.fmeainfocentre.com, last accessed 21/3/11.

ELEC3106/notes-reliability p. 1/15

Electro-static discharge (ESD), see Figure 1, can damage most devices; particularly vulnerableare semiconductor devices. The damage by an ESD event can leave the device completely short-circuited or open-circuited, or somewhere in between — for instance having increased leakagecurrent (i.e. the component has degraded performance). It may take several smaller ESD eventsgiving progressive performance degradation before the component must be deemed failed — orthe component can be completely destroyed in one large ESD event.

Electromigration

wire wire

e e

materialmovement(a) (b)flow

current

Figure 2: Electromigration on integrated circuits. Electron movement in conductive wire (a).Wire material moved over time cause near open-circuit in wire (b).

On Integrated circuits in particular (because the wires used are very thin), electromigrationcauses conductive wires to go open-circuit over time, see Figure 2: when electrons flow in aconductor they will occasionally collide with the atoms in the wire material, causing the atomsto shift slightly. Over time there is thus a material movement in the direction of the electron flow.The higher the current density in the wire, the faster the migration of atoms. The effect is thatmaterial is usually moved away from grain boundaries or faults on the wire, eventually causingthe wire to go open circuit. Electromigration is most pronounced in wires carrying DC currents,but also happen in wires carrying pure AC currents; typically, commercial integrated circuits aredesigned with a life-time of about 10 years.

Dendritic growth and dielectric break-down

EE

conductor

conductor

dentrites large break−down

(a) (b)

sustained

corrosion

Figure 3: Corrosion and dendrite formation between wires in the presence of moisture (a). Di-electric break-down between conductors where field is strongest (b).

Even in the driest environment finite amounts of moisture and ions are present, forming a thinelectrolytic coating between conductors. If a sustained electrical field is present between a pairof conductors, the conductive material (for instance copper) can corrode from one wire and bemoved towards the other wire, being deposited there, see Figure 3(a). The effect (in addition tothe eroding of the first wire) is the growth of conductive “wiskers” or dendrites from the secondwire towards the first. This can eventually lead to a short-circuit between the wires.

Large electric fields can also cause a break-down of the dielectric between conductors, causingthem to short-circuit. Such break-downs normally occurs where the is a weakness in the dielectricor where (for instance due to geometry) the electric field between the conductors are strongest,


see Figure 3(b). The break-down can be temporary such that normal circuit function can resumeafter the high field has been removed; but it can also be permanent, causing circuit failure. Thelatter usually occur when a thin dielectric material is present in a component, such as in a MOStransistor or a capacitor.

Photo-lithographic defects

wire

wire

by dust particlespatterns caused

nearshort

shortcircuit

wire

wire

by dust particlespatterns caused

nearopen

opencircuit

(a) (b)

Figure 4: Defects on wire fabrication due to dust in lithographic process. Excess material wheredust particles are present (a). Missing material where dust particles are present (b).

Both printed circuit boards (PCBs) and integrated circuits (ICs) are fabricated using photo-lithographic processes whereby the desired circuit patters are transferred to the substrate froman image using a photographic process. Dust particles present during this process can formundesired patterns on the target substrate, causing missing or excessive conductive material beinglaid down on the substrate (depending on whether a positive or negative photographic step isused), see Figure 4. If the dust particles are comparable in size to the feature dimensions inthe process (e.g. the width of conductive wires), their presence can cause complete failures tofabricate operating circuits (e.g. due to open- or short-circuited wires). Slightly smaller dustparticles may only cause near open-circuits or near short-circuits.

wire

wire

wire

wire

(a) (b)

currentcrowding

accelerated dendriticgrowth

causing earlyshort circuit

causing earlyopen circuit

Figure 5: Accelerated wear caused by near-failures in photo-lithographic process. Increasedfield causing accelerated dendrite formation (a). Increased current density causing acceleratedelectromigration (b).

The near-open and near-short manufacturing defects are more bothersome that the completemanufacturing failures; this is because latter can be detected at production and thus never makeit in to final products. The near-failure defects, however will go un-detected as the circuit wouldinitially operate as desired; however the near-failure defects can cause an acceleration of otherfailure mechanisms such as dendrite growth or electromigration, see Figure 5, with an early cir-cuit failure of the final product as a consequence. Dust is always present to a certain extend


(depending on the cleanliness of the manufacturing environment), thus the occasional manufac-tured circuit will not work, while some will fail early.

Junction spiking

n−type Si

interconnectAluminium

Silicon junction

p−type Si

Si migration Al migration

Spike causing junction short

Figure 6: Silicon dissolving in aluminium and aluminium migration causing a short-circuit acrossa p-n junction in old-fashioned IC process.

Integrated circuits consist of very complicated arrangements of many different materials, ina state with highly reduced entropy. In the p-n junction, for instance, the Boron (p-type) impu-rities are concentrated on one side of the junction while the Phosphorus (n-type) impurities areconcentrated on the other. The “preferred” state of the system would be to have an equal concen-tration of Boron and Phosphorus throughout the semiconductor. At normal circuit temperatures,fortunately, species movement is so slow that the p-n junction is stable. Keep in mind, however,at highly elevated temperatures, species movements are much faster causing circuit componentsto fail at a faster rate. Not all species movements are negligible at normal circuit temperatures,though. Silicon, for instance, is soluble in Aluminium. In old-type semiconductor processes, alu-minium was used directly as contact to the active silicon semiconductor. This can cause siliconto migrate into the aluminium interconnect and the aluminum to fill the void left by the siliconeventually shortening near by p-n junctions, see Figure 6.

Parameter shift

n−type Sin−type Si

p−type Si

e

gate

source

electron flow

high−energyelectrons trappedin oxide

drain

Figure 7: Caption.

Material movement in general normally happens slowly causing a slow shift in componentparameters, such as capacitance or resistance. Charge stored on a non-volatile memory, suchas an e-prom or flash memory, slowly leaks away, eventually loosing the information originallystored. Information stored on magnetic materials is slowly lost due to randomisation of themagnetising. Again, most consumer electronic devices are designed to last no longer than abouta decade. Another example of parameter shifts is the shift in threshold voltage in MOS transistorsover time: at the drain end of MOS transistors, electrons can get enough energy to tunnel into


the dielectric separating the gate from the drain and remain there, see Figure 7. The thresholdvoltage shift cause by accumulated trapped charge can eventually lead to circuit failure.

Stress and thermal cycling

somestressrelief

crackedsolderbumps

packagingthrough−hole chip−scale

packaging

(c)

(a) (b)

(d)

Figure 8: Thermal expansion causing stress in IC packages mounted on PCB. Through-holepackage (a). Chip-scale package (b). Expanded circuit board; through-hole package pins ableto absorb stress (c). Expanded circuit borad; with no stress relief, the solder connections on thechip-scale page eventually breaks (d).

Electronic systems are made up of a large number of different materials which have differentthermal expansion coefficients; further, throughout the manufacturing processes, a number ofdifferent temperatures are often used. This means that there is normally a considerable amountof stress build in to an electronic circuit. When the temperature change, the stress-profiles in thecircuit will change. Temperature changes in a final product can be quite significant, even overa day — consider the electronics in a car parked in the sun: at night the temperature may dropto, say, 0C, while, during the day it may surge to 50C. Such thermal cycling can eventuallycause material fatigue. Some components are better at withstanding the stress-cycles causedby thermal cycling. Figure 8(a,c) shows a through-hole mounted IC package where the leadsprovide an amount of stress relief when the circuit board expands; Figure 8(b,d) shows a chip-scale package without any stress relief, where the solder connections eventually break when thecircuit board expands.

Contact wear

new oxide formation

cracked oxidecontact deformation

oxidised

conductor

slider

contact

Figure 9: Contact wear. Repeated use of contact cause an increase in contact oxide formationthickness, eventually leading to poor connection.


Contacts, such as inter-connectors, switches or other components creating electrical connec-tion that can be activated or removed mechanically, are extensively used in electronic circuits tojoin different systems and different parts within a system. Such contacts are weak point in thesystem, often failing early. A common contact failure mechanism is illustrated in Figure 9. Theexposed metal surface of the contact is often subject to oxidation; i.e. a thin layer of isolatingmetal-oxide is formed at the contact surface. When the slider (the other half of the contact pair)is brought into contact with the contact, the contact is slightly deformed, the oxide cracks andcontact is made. On the newly exposed metal on the contact, a new oxide will form, and so forth.Eventually, the oxide becomes so thick, that only a very poor electrical connection is made, andthe contact fails. Using noble metals that do not easily oxidise prolongs the life of the contact.

IC corrosion damage

Cl

chip

movemention

crackpackage

corrosion

Figure 10: Corrosive ions migrating through crack in IC package onto IC, eventually causingdamage to IC.

Integrated circuits are, in general, less prone to corrosion damage compared with PCBs dueto their encapsulation. They are not immune, though, and corrosive ions can migrate into the ICencapsulation and cause corrosion damage (and hence circuit failure) on the IC, see Figure 10.Some IC packaging materials (ceramic) are better at withstanding corrosion, than others (plastic).

Cosmic radiation

n−type Si

cosmic ray

Silicon junction

built−infield

E

electron−hole pairsgenerated

E

E

(photon or particle)

current flow

p−type Sie

e

Figure 11: Radiation causing generation of electron-hole pairs in semiconductor junction, lead-ing to current flow and potentially memory disruption.

Radiation of most kinds can cause temporary disruption to proper circuit performance. Whennot addressing electromagnetic compatibility, high-energy particles (e.g. from cosmic radiation)causes the most concern. Usually the radiation will not permanently damage electronic devices


(though it can) but can cause bit-flips in stored digital memories. Such disruptions are increas-ingly common with the denser and denser digital memories used in modern systems. When aparticle (or photon) enters a semiconductor its interaction with the material causes generationsof hole-electron pairs, which cause a current to flow in the semiconductor, see Figure 11. Suchcurrents can easily be enough to disrupt the content of digital memories. For this reason, criticalsystems should always be designed in such a way that memory disruption will not cause a systemfailure. Some technologies, such as Silicon-on-Sapphire, are more robust against radiation.

Failure mathematicsHow devices (or components) fail over time are commonly described by their failure distributionfunction, F(t) which is the probability that a devices will have failed at time t. t = 0 is normallytaken as the time a device is brought into service. The failure probability density, f (t)= dF(t)/dtis the rate at which devices fails at a time t, while the hazard function, or failure rate, λ(t) =f (t)/(1−F(t)) is the relative rate at which devices fail (normalised to the number of devicesnot yet failed) at time t. 1/λ(t) is also known as the mean time between failures (MTBF).Sometimes the reliability function, R(t) = 1−F(t) is used, which is the probability that a devicehas not failed at time t. Failure rates are often quoted in the unit FIT (“failures in time”); 1 FITis one failure per billion operating hours. Figure 12 depicts a typical failure distribution functionand its derivatives, while the table below summarises the commonly used notation.

Commonly used notationFailure distribution function F(t)

Reliability function R(t) = 1−F(t)Failure probability density f (t) = dF(t)/dt

Hazard function λ(t) = f (t)/(1−F(t))Failures-in-time (FIT) 1FIT = 10−9 h−1

( )tF

t

( )tf

t

( )tλ

t

infantmortality

random

wearout

failures

1

Figure 12: Typical failure distribution function, F(t), failure probability density, f (t), and hazardfunction, λ(t).

Note the shape of the hazard function in Figure 12 (sometimes dubbed the “bathtub curve”)which is typical. Early on, the failure rate is quite high (high “infant mortality”) due to thefraction of devices that have been poorly manufactured (for instance half-shorts or half-openscaused by dust in a lithographic process). When such poor devices have been weeded out, thefailure rate drops to a relative low constant level where random failures occur (for instance due toESD damage). Eventually the circuit will start to wear out (for instance due to electromigration)and the failure rate increase again.

For highly reliable systems (whose failure will not endanger life), target failure rates in theorder of 10 FIT is often expected.


Failure rate dependencies

As should be apparent from the section on failure mechanisms, failure rates depends on pa-rameters such as currents, voltages and temperature; in fact failure rates normally increase veryrapidly (typical a power law or exponentially) in such parameters, see Figure 13. The maximumrating of these parameters specified for devices (as found in data sheets) are normally set suchthat a certain (reasonable) maximum failure rate is achieved. Failure rate temperature depen-dency plays a particularly important role. Most failure mechanisms have an exponential relationwith temperature:

λ

V/I/T

ratingmax

Figure 13: Failure rate increasing rapidly in voltage, current and temperature; max rating set atlow failure rate.

λT = Λ0e−EA/kT

where λT is the failure rate at temperature T , Λ0 = Λ0(t,V, I, . . .) is a function of time, volt-age, current and other parameters other than temperature; EA is the activation energy of thatparticular failure mechanism (typically in the order of 1eV = 1.602 ·10−19 J); k is Boltzmann’sconstant (k = 1.38 · 10−23 J/K) and T is the absolute temperature. The factor e−EA/kT is calledthe Arrhenius factor.

Failure acceleration and MTTF

F ( )t1T

T

F ( )t2T

T tt

1/2

T21T

MTTF1

MTTF2

1/2

Figure 14: Mean time to failure at different temperatures.

An often quoted measure of a device is its mean time to failure, MTTF which is defined thus(see Figure 14):

MTTF = F−1(1/2)

For consumer electronics with expected device life times in the order of ten years, the MTTFshould be well above ten years. Because the expected life time is so long, it is rather challengingto verify the lifetime by direct measurements — clearly it is impractical to have device lifetime verifications running for a decade before a product is released: when the tests are finished,consumers would have long gone and found alternative products. However, it is rather importantto have a good idea of what the product life time is such that consumer confidence is not erodeddue to poor life time or revenue is not lost due to excessive early failures requiring warranty


replacements. By testing circuits at elevated temperatures, an estimate of the MTTF can beobtained in reasonable time: We define the failure rate acceleration factor, AF as the failure rateratio at two temperatures; a usage temperature, T1 and a test temperature, T2:

AF =λT2

λT1

= eEA/kT1−EA/kT2

that is:λT2 = AF ·λT1

Using the definition of the failure rate function, it is then simple to prove that

FT2(t) = FT1(AF · t)

thus, the failure distribution function has been contracted along the time axis, see Figure 14, andwe find

MTTFT2 = MTTFT1/AF

where indices T1 and T2 refers to the function at that particular temperature. For a usage temper-ature of T1 = 323K (50C), and a test temperature of T2 = 393K (120C), a failure mechanismwith an activation energy of EA = 1eV have a acceleration factor of AF= 600, for example. Thusif this device has a ,MTTFT1

= 10years at normal usage temperature, at the test temperature wefind MTTFT2 = 6months which is a much more realistic time frame to conduct a test over. Onecan not take the acceleration factor too high, though, because other failure mechanisms may takeover than the one that dominates at usage temperatures, giving too optimistic estimates.

Another use of the Arrhenius factor is to weed out poorly manufactured devices before prod-ucts are sold; also known as burn-in tests. Devices are operated for some time at elevated tem-peratures, and often also at elevated voltages and currents, in order to stress the devices and causethe poor ones to fail.

Failure distribution functions

Accurate failure distribution functions for a particular device (unless it is a simple componentlike a resistor) can be quite difficult to obtain, especially if the failure rate is low, as it wouldtake prohibitively long time to gather statistically significant data. For the special case of randomfailures, where the failure rate is constant, λ(t) = Λ, we can find an expression for the failuredistribution function, FR(t):

Λ =dFR(t)/dt1−FR(t)

Λ(1−FR(t)) = dFR(t)/dtΛ(1/s−FR(s)) = sFR(s) (using Laplace)

FR(s) =1s

Λ

s+Λ

FR(t) =∫ t

0Λe−Λτ = 1− e−Λt

Another popular failure distribution function is the Weibull distribution function (FW(t) below). Itis a generalised version of the exponential distribution function above with two fitting parameters:α > 0 is a scaling parameter and β > 0 is a shape parameter. The Weibull distribution functionhas the advantage that it is a simple function that can model failure rates that are reducing over


time (infant mortality), constant over time (random failures), or increasing over time (wearout):

FW(t) = 1− e−(t/α)β

β < 1 decreasing failure rateβ = 1 constant failure rateβ > 1 increasing failure rate

Devices with many components

Devices are made up of components; usually, for the device to operate appropriately, all com-ponents need to be operational; and usually it is a reasonable assumption that components failindependently of one another. Thus, we find the failure distribution function for a device madeup of N components, Fdev to

Fdev(t) = 1−N

∏i=1

(1−Fi(t))

where Fi(t) is the failure distribution function for component i. Likewise, we find the devicefailure rate, λdev(t) to

λdev(t) =N

∑i=1

λi(t)

where λi(t) is the failure rate for component i.

Improving reliabilityMaking reliable devices is largely a matter of reducing failure rates by design; here are a fewguidelines:

Thermal management: Do proper thermal analysis of power components an use appropriatelysized and mounted heat sinks to avoid high temperatures.

De-rating: Over-specify component maximum ratings (i.e. use a 50 V device (say), even ifonly 20 V is required on it). Component ratings that could be de-rated include: voltage, current,power, temperature, and distance (e.g. on PCB).

Over specify component parameters: Use better (or more accurate) components than strictlyrequired; for instance use more accurate resistors, amplifier with less off-set, or power suppliesthat can supply higher currents.

Design reviews: Conduct timely design reviews and be prepared to make design changes.

Failure analysis: Conduct formal failure analysis of device, for instance using FMEA (seelater).

Simplicity: Make the device as simple as possible; the fewer components in a system, thebetter the reliability. Sometimes this makes the design harder; it also means that feature-creepshould be avoided.


Source good components: Use highly reliable components (do not just look at price, whensourcing components). Components from manufactures that specify failure rates and are ISO9000qualified have probably better reliability than generic components.

Redundancy: Sometimes components can be added in parallel such that they all need to failbefore the device fail. Assuming component failures are independent, such redundancy createmore reliable sub-systems within the device; in really critical applications complete parallelsystems may even be used. The sub-system failure distribution function, Fss is then:

Fss(t) = F1a(t) ·F1b(t) ·F1c(t) · · ·

where F1a(t), F1b(t), . . . are the failure distribution functions of the parallel components.

Failure-mode and effects analysisToday electronic systems are in use in many systems that could lead to death if they fail; such asin pacemakers, ABS brakes and air bag deployment in cars, aeroplane stability control and many,many other places. The reliability of such systems is obviously paramount, but the reliability ofless critical electronics can still be very important. There are many different “tools” that can beused to analyse (possible) system failures and to increase reliability. Failure-mode and effectsanalysis (FMEA) is one such tool, that we will briefly introduce in this section. FMEA providesa structured way to analyse weak points in a system and means to note down actions taken toimprove them. The main part of an FMEA is an evolving spread sheet (special software programsare also available) — see example on the course website. The spreadsheet may look like the onein Figure 15.

FMEAs are normally conducted by a team of engineers; often with different backgrounds allrelevant to the system in question (say, a systems engineer, a mechanical engineer, an electricalengineer, and the engineer that has designed the system). In principle this team has to go throughevery possible failure mode for each component in the system and classify the consequencesof such failures. This can be a rather daunting task for complex systems. A good approachto reduce the complexity of the task is to conduct a hierarchical FMEA: low level circuits andfunction critical ones are analysed in all detail, looking at each individual component. At thenext level, sub-systems can be identified each with a simplified set of failure modes observableat its connections (or pins) to the other sub-systems.

The purpose of an FMEA is to improve design reliability and to identify critical failure modes,e.g. such that may be life threatening. An FMEA tries to quantify the risk of each failure byassigning it a risk priority number (RPN). It is worth noting that a RPN should not be thought ofas an absolute value that can necessarily be compared between designs, because its assignmentis somewhat subjective. However PRNs within the same FMEA should be comparable, and thelargest numbers identify where the weak points in the design lie. The RPN is made up of threefactors:

RPN = S ·L ·Dwhere S, L and D is the ranking of a failure’s severity, likelihood and detectability, respectively.The severity describes how severe the consequence of a fault is, ranging from “no effect” to “haz-ardous without warning”; a ranking, S between 1 and 10 is assigned according to the severity,see Figure 16. An S = 10 ranking is often taken to mean that people will die as a consequenceof the fault. The likelihood describes how likely the fault is, ranging from “remote” to “almostinevitable”; a ranking, L between 1 and 10 is assigned according to the likelihood, see Figure 17.


. . . . . action results . . . . .Item/Func-tion

PotentialFailureModes

PotentialEffects

S PotentialCauses/Mecha-nism

L CurrentDesignCon-trols

D RPN Recom-mendedActions

Respon-sibility+ Date

ActionsTaken

S L D RPN

Bridge RectifierConvertACvoltageto DCpulsedvoltage

diodeopencircuit

Half-waverectifi-cation;po-tentialloss ofpower

4 ESDdamage

5 powersupplycut-outdetec-tioncircuit

3 60 Possiblyincreasesmooth-ingcapac-itorvalue

4 Wear-out

4 powersupplycut-outdetec-tioncircuit

3 48

diodeshortcircuit

Half-waverectifi-cation;half-shortof acpowersource;loss ofpower

8 ESDdamage

5 primaryfuse

3 120 inclusionof ESDprotec-tion

E. Am-bikaira-jah19/6/08

short atbridgeoutput

Softshortof ACpowersource;loss ofpower

6 dendriticgrowth

3 none 7 126 De-ratingof PCBtrackdis-tances

T. Hes-keth19/6/08

column 1 1 1 1 1 1 11 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6

Figure 15: FMEA work sheet.

The detectability describes how easy it is for the system to detect the fault, ranging from “almostcertain” to “no detection”; a ranking, D between 1 and 10 is assigned according to the detectabil-ity, see Figure 18. The main work when conducting an FMEA is to identify all the possiblefailure modes and their causes and consequences, and assign the appropriate ranking numbers.Note that many organisations have their own tables to define the S, L and D ranking numbersthat fit their particular type of product and make assignment more consistent between FMEAsthan using the generic tables in the figures here. The thinking behind defining the RPN as aboveis that one can live with the possibility of rather serious failure modes if they are very unlikelyto happen or if they are very easy to detect. If a possible serious fault can be detected easily bythe system, for instance, it can take actions (such a performing a system shut-down or engage aback-up function) such that the severe consequence does not happen. Thus, if the RPN of a fault


is high, it flags the fault a problem that should be looked into.

Effect Severity of Effect Rank SHazardous with-out warning

Very high severity ranking when a potential failure mode af-fects safe system operation without warning

10

Hazardous withwarning

Very high severity ranking when a potential failure mode af-fects safe system operation with warning

9

Very High System inoperable with destructive failure without compro-mising safety

8

High System inoperable with equipment damage 7Moderate System inoperable with minor damage 6Low System inoperable without damage 5Very Low System operable with significant degradation of performance 4Minor System operable with some degradation of performance 3Very Minor System operable with minimal interference 2None No effect 1

Figure 16: Severity ranking (from 3)

Likelihood of Failure Failure Probability Rank LVery High: Failure is almost inevitable >1 in 2 10

1 in 3 9High: Repeated failures 1 in 8 8

1 in 20 7Moderate: Occasional failures 1 in 80 6

1 in 400 51 in 2,000 4

Low: Relatively few failures 1 in 15,000 31 in 150,000 2

Remote: Failure is unlikely <1 in 1,500,000 1

Figure 17: Likelihood ranking (from 3)

Once RPNs have been assigned to all failure modes, the faults with high RPNs can be iden-tified, and corrective actions can be taken to reduce the RPNs for those failures. The RPNs,obviously, can be reduced by reducing one or more of the three rankings (severity, likelihoodor detectability). The severity, for instance, could be reduced by adding a back-up function thatwould be engaged if the fault is detected. If detectability is good in the first place, that maybe all that is required; otherwise it may be necessary to put in specific circuits that detect theoccurrence of that failure mode. Another approach, for instance, is to reduce the likelihood byde-rating the component failing. When actions have been taken the FMEA is carried out againon the failure mode and a revised RPN is calculated.

Returning to the FMEA example work sheet in Figure 15, we will now explain the meaningof each column (1–16) in more detail:

1 Item/Function: In electronics FMEA this is often a physical component or a sub-system.


Detectability Likelihood of Detection by Design Control Rank DAbsolute Uncer-tainty

Design control cannot detect potential cause/mechanism andsubsequent failure mode

10

Very Remote Very remote chance the design control will detect potentialcause/mechanism and subsequent failure mode

9

Remote Remote chance the design control will detect potentialcause/mechanism and subsequent failure mode

8

Very Low Very low chance the design control will detect potentialcause/mechanism and subsequent failure mode

7

Low Low chance the design control will detect potentialcause/mechanism and subsequent failure mode

6

Moderate Moderate chance the design control will detect potentialcause/mechanism and subsequent failure mode

5

Moderately High Moderately High chance the design control will detect poten-tial cause/mechanism and subsequent failure mode

4

High High chance the design control will detect potentialcause/mechanism and subsequent failure mode

3

Very High Very high chance the design control will detect potentialcause/mechanism and subsequent failure mode

2

Almost Certain Design control will detect potential cause/mechanism andsubsequent failure mode

1

Figure 18: Detectability ranking (from 3)

2 Potential Failure Modes: This is what is wrong with the circuit. For electronics, a min-imum set of failure modes to investigate is that a component goes open circuit or shortcircuits; short circuits to ground or an adjacent net is also often included (for complexcomponents with many pins, short circuit to ground or an adjacent net may be more ap-propriate that a component short circuit). For a more nuanced failure investigation, “soft”short circuits or open circuits (i.e. via a moderate impedance rather than a zero or infiniteimpedance) can be included as well as component parameters being outside their specifi-cation. Intermittent faults (of any kind) can also be included as can software faults. Forsub-systems one may have to invent some more complicated failure mechanism (e.g. theapparent failure observed on the pins of a crashed micro-controller may be a rather com-plicated, seemingly random, pattern of high/low states being generated at one of its ports).

3 Potential Effects: This is the effect of the failure in column 2; i.e. what can be observed.There may be more than one significant effect of one failure mode; they should all berecorded. For a power supply unit, for example (of which the bridge rectifier in the examplework sheet would be part of), effects might include: loss of power, increased power drain,intermittent fall-out or high-voltage exposure in low-voltage system.

4 Severity (S): The severity ranking of the effect in column 3, as per Figure 16 or some otheragreed-upon scale.

5 Potential Causes/Mechanism: The cause of the failure recorded in column 2. There maybe more than one cause that gives rise to the same failure; these should all be recorded.Typical causes for failures in electronics include: ESD damage, normal (electronic) wear,


fatigue (e.g. due to thermal cycling or repeated physical displacement), trauma (e.g. impactdamage when equipment is dropped or in accident), dendritic growth, over current/voltagesurges (e.g. if connecting equipment under power), electro migration, and cosmic radiation.

6 Likelihood (L): The likelihood ranking of the failure cause in column 5, as per Figure 17or some other agreed-upon scale.

7 Current Design Controls: The parts of the current design that address the effects (column3) or the causes (column 5). Design controls can be special circuits that detect the effect;design strategies taken to minimise occurrence, such as de-rating or use of redundant com-ponents; or systems put in place to reduce the severity; i.e. design controls can be in placeto minimise any of the three ranking numbers. Design controls that are assumed to be inplace (e.g. the presence of a mains fuse) must also be recorded here. A design control canalso be the use of simplified or modified circuits to reduce the likelihood of failure or it canbe general or specific systems check or watch-dogs that monitors that the system operatesappropriately.

8 Detectability (D): The detectability ranking of the failure effect in column 3, using thecurrent design and the design controls in column 7, as per Figure 18 or some other agreed-upon scale.

9 Risk Priority Number (RPN): The product of the severity ranking, the likelihood rankingand the detectability ranking.

10 Recommended Actions: Here is noted the action that should be taken (if any) in order toreduce the RPN to acceptable levels.

11 Responsibility + Date: Here is noted the person responsible for implementing the recom-mended action and the target completion date (usual good management practice).

12 Actions Taken: When the design has been revised, the action that actually was taken isrecorded here (this may be different from the recommended action as new issues may havesurfaced during the implementation of the recommended action).

13 Revised Severity (S): If the design has been revised, here is recorded the revised severityranking.

14 Revised Likelihood (L): If the design has been revised, here is recorded the revised like-lihood ranking.

15 Revised Detectability (D): If the design has been revised, here is recorded the reviseddetectability ranking.

16 Revised Risk Priority Number (RPN): If the design has been revised, here is recordedthe revised RPN; hopefully this is now acceptably low, or another revision will be required.


elec3106 electronics lecture notes: circuit failures and reliability · 2017-02-06 · school of...

Documents