Download - Fantastic Failures
-
8/14/2019 Fantastic Failures
1/30
Class 304: Fantastic Failures
Embedded Systems Conference
Wednesday, 31 March 2004
By Kim R. Fowler
Historical Case Studies
-
8/14/2019 Fantastic Failures
2/30
-
8/14/2019 Fantastic Failures
3/30
31 March 2004 5
Ariane 5
(Photographic source is ESA/CNES. You can find these photos at the following website:
www.mssl.ucl.ac.uk/www_plasma/missions/cluster/about_cluster/clu ster1/cluster1_images.html)
31 March 2004 6
Ariane 5 recounted
Dual-redundant processors
3 unprotected variables that overflowed
Processors reset on overflow, no gracefulrecovery
Used in Ariane 4, no check of flight dynamics
Ariane 5 had > horizontal drift velocities
Reuse is tricky, end-to-end system testnecessary
Find report at:www.esa.int/export/esaLA/Pr_33_1996_p_EN.html
-
8/14/2019 Fantastic Failures
4/30
31 March 2004 7
Therac 25
Medical linear accelerator for treating tumors Mid-1980s overdosed six patients
Problems Quick editing by operator caused race condition
Cryptic error messages ignored
No explanation in Users Manual of error codes
50 times full dose but displayed no dose given
No mechanical interlocks No software reviews or audits, little
documentation
31 March 2004 8
Therac 25 Lessons
Need general plan for system development
The operator interface must be clear, intuitive,and explained
Hardware safeguards must limit software faults
Good design, not testing, makes a safe system See Appendix A Medical Devices: The Therac-
25 from Nancy Leveson, Safeware: System
Safety and Computers, Addison-Wesley, 1995.
-
8/14/2019 Fantastic Failures
5/30
-
8/14/2019 Fantastic Failures
6/30
31 March 2004 11
Chernobyl events Experiment called for by engineers in Moscow
Manual shutdown, automatic control turned off Power dropped to 1% capacity
Removed more control rods
Power crept up to 7%
Turned on more water to produce more steam
Water cooled reactor, dropping steam and reactivity
Removed even more control rods
Steam production rose until 1:22 a.m. when operatorsshut off water flow
Heat built up quickly, control rod sleeves bent Could not insert control rods
Steam explosion
31 March 2004 12
Chernobyl Lessons
Theoretical knowledge vs. hands-on
Humans over-steer dynamic systems
Humans dont handle interacting,nonlinear problems well
Groupthink Understand human nature
Clarity of function
Reduce confounding problems
Accommodate in system design
-
8/14/2019 Fantastic Failures
7/30
31 March 2004 13
Apple Lisa
(Part of the computer collection of Giorgio Ungarelli, photograph used with permission.)
31 March 2004 14
Apple Lisa Legacy
Brilliant concept before its time
Mouse
Graphical file management
People not ready for paradigm shift
-
8/14/2019 Fantastic Failures
8/30
31 March 2004 15
Apple Lisa Lessons
Prohibitive price for unappreciated
capability
Cost-effective solutions rely on users
understanding
Failure falls into business/political arena difficult to predict and avoid
Navy Terrier/LEAP
-
8/14/2019 Fantastic Failures
9/30
31 March 2004 17
Terrier LEAP outline
Concept for ballistic missile intercept
Use current (early-mid 1990s) technology
Prepare and test quickly
Target launched from Wallops Island
Interceptor launched from cruiser inAtlantic
Basic human error foiled success
31 March 2004 18
LEAP Target
(Photograph courtesy
of Raytheon, Inc.)
-
8/14/2019 Fantastic Failures
10/30
31 March 2004 19
LEAP General Operation
High-resolution radars at Wallops Island tracktarget (shipboard radars insufficient)
Wallops Island processor collected data from theradars, filtered the target track with a six-stateKalman filter, and transmitted the track to theship.
Sent target tracks to ship via redundant telephonelandlines and Inmarsat satellite links
Ship processor received the data, predicted theintercept time and point, and indicated when tolaunch the interceptor missile.
31 March 2004 20
LEAP Missile & Intercept
(Photograph courtesy of Raytheon, Inc.)
-
8/14/2019 Fantastic Failures
11/30
31 March 2004 21
LEAP Testing Finds Problems
End-to-end tests of the system simulated a target launch,
transmitted the simulated data through the entiresystem to the ship,
calculated an intercept as if we were at sea.
Redundant landlines switch maintenance inNew Jersey cut off early test
Separate landlines
one through New Jersey other through Pennsylvania
31 March 2004 22
Richmond K. Turner, GC20
(Photograph courtesy of the Johns Hopkins
University Applied Physics Laboratory.)
-
8/14/2019 Fantastic Failures
12/30
31 March 2004 23
Testing Finds Problems (contd.)
Two shipboard radars caused problems
SPS-49 jammed the Inmarsat receivers
SPS-20 jammed the GPS receivers
Inmarsat situated on port and starboardbridge to reduce superstructure blockage
Too many dropouts with commercial
modems, switched to cell phone modems
31 March 2004 24
LEAP Targeting Processorand laboratory test set
(Photographs courtesy of the Johns Hopkins
University Applied Physics Laboratory.)
-
8/14/2019 Fantastic Failures
13/30
31 March 2004 25
LEAP: Lessons Learned
Technical failure
Simple, human error can interrupt the bestdesigns
Careful development and thorough testingnecessary
All components must be tested within thesystem to uncover interactions
31 March 2004 26
Aegis LEAP
A success story
Three successful intercepts in 2002, morein 2003
Carefully planned development
-
8/14/2019 Fantastic Failures
14/30
-
8/14/2019 Fantastic Failures
15/30
31 March 2004 29
Kinetic Kill Vehicle and TargetImage
(Figure and photograph courtesy of the Johns Hopkins
University Applied Physics Laboratory.)
31 March 2004 30
Aegis LEAP Launch
(Photographs courtesy of the Johns Hopkins University Applied Physics Laboratory.)
-
8/14/2019 Fantastic Failures
16/30
31 March 2004 31
Thorough Ground Test Program
Separation tests squibs, batteries, explosive bolts
KW hover test for the closed loop pointing
Air bearing tests of maneuvers: pitch-to-ditch, IR seekercalibration, and pointing before separation
Hardware-in-the-loop simulation and test of avionics
KW tests for the IR seeker characterization, stabilization,third stage interfaces
Vacuum tests PCB delamination, arcing, and outgassing
Aerothermal testing in a hypersonic wind tunnel fornosecone heating and outgassing, seeker shield function,strake heating and insulation
Types of Failure
-
8/14/2019 Fantastic Failures
17/30
31 March 2004 33
Examples: Product Recalls
[. . .] recalled 45,000 heaters for defective thermostats thatwere improperly positioned, which could lead to theoverheating.
[. . .] recalled 3.1 million dishwashers. The slide switch (thelever that selects between heat drying and energy saving)can melt and ignite over time, posing a fire hazard.
[. . .] recalled 5,500 toy flashlights because the batteriesmay overheat or leak and children can suffer burns fromthe leaking battery.
[. . .] recalled upright vacuum cleaners because the powercord may break inside of the handle posing electricalshock and burn injury hazards.
http://www.matthewslawfirm.com
31 March 2004 34
Examples: Automotive Recalls
March 12, 2002[. . .] recalled the [. . .] trailer hitch circuitry in the converter is inadequate to properlymanage voltage spikes that can lead to an electricalshort or open circuit within the converter, causing afailure and an inoperative trailer light.
September 11, 2000[. . .] recalled about 270,000 [cars] air bags that may deploy unexpectedly because of
corrosion in the inflator. During 2000[. . .] recalled ignition modules that could
cause a car to stall. When the temperature of the ignitionmodule rises above a certain temperature the chances ofthe module cutting out also increases.
http://www.crash-worthiness.com
-
8/14/2019 Fantastic Failures
18/30
31 March 2004 35
Examples: More AutomotiveRecalls
[. . .] recalled 263,000 1995-97 [vehicles] . . . The airbagelectronic control module (AECM) could corrode fromwater or road salt and then accidentally fire the driver sideairbag.
[. . .] recalled 757,000 1992-97 [vehicles] because higherthan specified electrical load through accessory powerfeed circuit may cause a short circuit and allow current toflow through ground wiring. This could cause overheatingand an electrical fire.
[. . .] recalled 1995-97 [vehicles] because improperlyrouted wire harness for the air-conditioner may permitwires to rub together and short circuit, resulting in a blownfuse, dead battery, or fire.
http://www.matthewslawfirm.com
31 March 2004 36
Examples: More AutomotiveRecalls
December 11, 1998[. . .] recalled 226[electric vehicles] to reprogram the logic in
the motor electronic control unit (ECU), whichcan mistakenly detect a failure of an electrical
current sensor at speeds above 50 mph. Itcan cause the sudden loss of power andunexpected deceleration.
http://autorepair.about.com/library/recalls/
-
8/14/2019 Fantastic Failures
19/30
31 March 2004 37
Elements of UnintendedConsequences in Previous Examples
Passage of time usually fielded units
Nonobvious or obscure causes
Environmental interactions, i.e. corrosion,
overheating
Failure modes with significant effects, i.e.
fire or injury
31 March 2004 38
The Nature of Problems
Confounding complexity unforeseen circumstances
multiple causes
Human error nonobviousness to user improper use
design oversight even if it appears to bea manufacturing problem
-
8/14/2019 Fantastic Failures
20/30
-
8/14/2019 Fantastic Failures
21/30
31 March 2004 41
Remedies
Truth in advertising expertise, schedule estimation,management style/employee responses
Work hard to develop reasonable schedules review and testing
plan for contingencies
Continuous learning lessons learned, your own experience
others experiences
Reduce complexity understand and define interactions
do not reinvent the wheel limit features
Teamwork
31 March 2004 42
Integrity
The Big Picture
Truth in advertising (your capability andskills)
Estimation and scheduling
Plan for the long term your success and reputation
your products viability
your companys reputation
-
8/14/2019 Fantastic Failures
22/30
31 March 2004 43
Failure and How to Handle It
Types of failure technical
professional
political/societal
Embrace failure
admit and accept responsibility
understand and learn
put past behind you because others wont
forgive others failures; help them to
rebound
Less control
Progression
Personal Examples
-
8/14/2019 Fantastic Failures
23/30
31 March 2004 45
Technical Failure
Ultraviolet satellite camera with image
intensifier
Automatic gain control for image intensifier
Nonlinear control problem
First version blooming/collapsing picture
Second version unreliable transmission
of gain value
31 March 2004 46
Technical Failure 1st Version
( 2002, Figure courtesy of the Johns Hopkins University
Applied Physics Laboratory.)
Dn
Up
Up-downcounter
reset
Hi-threshold
comparator
Pixel clock
Frame
sync
Image
intensifier
Camera
DAC
Video
signal
-
8/14/2019 Fantastic Failures
24/30
31 March 2004 47
Technical Failure 1st Version
Problem: blooming/collapsing picture
Background:
Discrete logic, up-down counters
Unstable for bright objects
Not fully simulated or analyzed
Short development time (flew breadboards)
Shoulda: analyzed/simulated expectedscenes during design
31 March 2004 48
Technical Failure 2nd Version
( 1996, Oxford University
Press, used with permission.)
-
8/14/2019 Fantastic Failures
25/30
31 March 2004 49
Technical Failure 2nd Version
Problem: unreliable transmission of gain value
Background:
Microcontroller implementation of AGC
AGC stable for all scenes
Readout of gain by ground equipment unreliable
Analog encoding of gain into video frame
Shoulda:
Use digital encoding into video frame for noise margin Needed better understanding of noise environment
31 March 2004 50
Professional Failure
Asked to finish programming effort while
original designer moved onto otherprojects
False starts and procrastination
Finally removed myself from project
-
8/14/2019 Fantastic Failures
26/30
31 March 2004 51
Professional Failure
Problem: did not complete assignment
Background:
Mounds of documentation to plow through
Early realization of no-win situation
Lost motivation
No real recognition of work obvious to me
Shoulda:
Either not taken the job in the first place
Or if no choice, plow through assignment while findinganother job (setting precedence)
31 March 2004 52
Professional/Business Failure
Business deal
My personal performance
Technical excellence
Professional excellence
Maintained integrity
Accused of bad stuff, which I did not do
Deal fell through
-
8/14/2019 Fantastic Failures
27/30
31 March 2004 53
Professional/Business Failure
Problem: business politics outside my control
Background:
Interesting proposition and product
Long-term relationships
Unknowns quantities introduced early in deal
Weirdnesses grew
Shoulda:
Either not make deal in the first place Or left earlier before weirdness got out of hand
Note: always deal with integrity or dont deal
31 March 2004 54
Political Failure
Satellite subsystem
Teams performance
Technical excellence
Professional excellence
NASA sponsor pulled project in-house
-
8/14/2019 Fantastic Failures
28/30
31 March 2004 55
Political Failure
Problem: politics outside my companys control
Background:
6-month long set of trade studies to define architecture
Thorough studies and review
Schedule well understood, team prepared to buildsystem
Groups at NASA out of work
NASA pulled project in-house to feed their own
Shoulda:
None, politics happen
A Success Story
-
8/14/2019 Fantastic Failures
29/30
31 March 2004 57
The Sidewinder Missile A SuccessStory
(Courtesy of the U.S. Navy. All U.S. Navy photos
are public domain.
http://library.thinkquest.org/jo113065/citations.htm)
31 March 2004 58
Sidewinder recounted
Goal: simple, sturdy, cheap missile
Small development team, 1949 1953
Simple, clever combination of ideas Rollerons: simple but important control
Proportional navigation simplified circuitry
Torque-balance servo for maneuvering
Canard control fins reduced wiring and connectors
Simple data acquisition equipment
Extensive testing and prototyping
-
8/14/2019 Fantastic Failures
30/30
31 March 2004 59
Sidewinder Lessons
Breakthroughs require vision
Small teams facilitate commitment andcommunications
Simple and robust design
Careful, thorough, and extensive testingand integration