solving real problems that required a consultant
Post on 11-Jan-2016
27 Views
Preview:
DESCRIPTION
TRANSCRIPT
Dave Stewart, PhDDirector of Research and Systems Integration
InHand Electronics dstewart@inhand.comwww.inhand.com
Solving Real Problems thatRequired a Consultant
ESC 210
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Objective of this Class
Share some lessons learned If you encounter a similar issue, the flags raised here
may give you additional ideas of what to look for View hard problems differently
If you use same steps as a consultant, you can be your own consultant and save time and money
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Overview
When is a Consultant Consulted? Satellite Modem USB Key Transfers Hang Locomotive Braking System Lock-ups Flash Corruption on Battlefield Cryogenic Temperature Cooling Degrading Legacy Software
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
When is a Consultant Consulted?
4
A problem has been around for weeks or months Engineers familiar with system spent months to resolve
unsuccessfully Issues are glitches
Problem shows up randomly, and is not easily repeatable
Traditional debug unsatisfactory Problem goes away or functionality breaks when you
add debug Can’t identify who is responsible
Is this a hardware or a software issue? Is this “our” problem or a “vendor” problem?
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Challenges Faced by the Consultant
5
Expected to find root cause in days Even though engineers familiar with system could not
do it in weeks or longer Traditional methods won’t find issue
If they did, problem would already have been found Root cause is not known
Even if customer says it is software or confined to a particular module, those might just be observable effects, not root cause
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Who is a Consultant?
6
An expert within the organization Usually an “expensive” resource Need to pull the person off a different project
An FAE or vendor expert When using COTS hardware or software, it could be
the organization who sold the product An independent contractor or consultant
Can leverage skills and experience applied to many other jobs
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Reality of Hard Problems
7
Most hard problems are fundamentally simple A common or known issue A “silly” bug in software One hardware signal is faulty
Difficulty solving problem is for three reasons: Trying to fix problem before understanding root cause Failure to use theory to analyze the system Using the wrong tools to collect clues
Anyone can become a consultant If they use a systematic approach and the right tools
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Consultant TriageA Systematic Approach to Troubleshooting
8
Observe Review information already available Identify what information is missing
Hypothesize Review the design for known common flaws Verify any applicable errata Consider what fundamental theories are likely in play What tools can be used to prove or disprove a hypothesis?
Investigate Use new additional techniques to increase quality and quantity of
clues to identify root cause Solutions
Usually, once root cause is known, several viable solutions follow rather quickly.
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Satellite Modem
9
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Satellite TV ModemKey Observables
10
Picture would occasionally glitch when button on remote pressed Engineers identified it only happened when guide was
being downloaded at same time Implemented workaround until problem could be
solved: Ignore remote buttons when guide being downloaded Leverages user’s default action when no response,
which is to just press the same button again
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Satellite ModemHypothesize
11
Multi-threaded system Possible priority inversion or race condition
Real-time analysis never performed Measurements of execution time not known Possible transient overload not handled correctly
Need list of threads and execution time measurements Customer was able to provide list of threads, but not
execution time
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Satellite ModemInvestigate
12
Need execution time for each thread Instrumented code to allow measurements
Needed to restructure some tasks to follow a proper model Can only measure execution time of threads that
follow a definitive model (shown next slide) Used logic analyzer to measure execution time
See this month’s issue of Embedded Systems Design March 2012 issue available on show floor
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Model of Real-Time Task
13
Thread A
Each thread has a main loop that does the following:
Read ITC Inputs/Events
Do Processing andRead/Write Devices
Write ITC Outputs
Wait for Event
For periodic threads, event is time-based.
For other threads, event could be an interrupt,
message arrival, semaphore wakeup,or any other signal.
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Model of Real-Time Task
14
Thread A
Measuring execution time for purpose of real-time analysisalways done at same place
Read ITC Inputs/Events
Do Processing andRead/Write Devices
Write ITC Outputs
Wait for Event
Start Thread Cycle
End Thread Cycle
Frequency of thread represented by how often
this point is reached.
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Satellite ModemSolutions
15
Used Rate Monotonic Analysis Identified system overloaded when guide being downloaded Also identified on average 20% idle time when guide not
being downloaded Culprit was one interrupt handler executing for 6 msec, and
temporarily using 80% CPU power. Engineer thought it was only a few hundred microseconds
Solution that Worked Ensure all threads followed model or real-time thread Split interrupt handler into ISR+IST Defined IST as an Aperiodic Server Scheduled system using Rate Monotonic
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Satellite Modem
16
Why was Consultant Needed? Customer was not applying fundamental real-time
systems theory A simple Rate Monotonic Analysis of the problem
showed an obvious root cause Customer did not have the tools needed to measure
execution time. Execution time is key input into the analysis.
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
USB Key Transfer Hang
17
Can you spot the difference?
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
USB Key Transfer HangKey Observables
18
Two apparently identical USB keys connected to embedded system one worked consistently 100% of the time other one locked up 50% of the time during long transfers.
Customer kept running tests from user-space. After the first few tests, each other test only provided
duplicate information; no new clues. The key had a custom mechanical construction
Using a different key was not an option. On desktop PC, both worked 100% of the time
Ran controlled tests with different file sizes 1KB to 1GB. Problems only started above 1MB on embedded system
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
USB Key Transfer HangHypothesize
19
Hangs == deadlock Anytime there is a hang, look for a deadlock
Looks are deceiving Although the two keys “looked” the same, they might
be different versions Compare working (PC) to non-working
(Embedded System) at the USB interface Focus on large file sizes since small files did not fail
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
USB Key Transfer HangInvestigate
20
Driver Instrument USB driver at lowest level Every time it sent a message, log event Log every time a lock was obtained
Protocol Use a USB analyzer to capture the transfer
Version Analyzer allowed checking firmware version dates …
the good one was Rev 8.02, the failing one 8.01 This confirmed the keys were in fact not identical
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
USB Key Transfer HangSolutions
21
It works fine on PC Since even “bad” key worked fine on PC, reverse engineer
what PC was doing compared to embedded system USB Analyzer provided key clue
The PC broke large blocks into many smaller blocks Problem was the USB key
Changing hardware is always much more expensive than changing software
Can a software workaround be used to avoid the issue with the key?
Making change to embedded driver to break larger transfer into smaller ones allowed the bad key to work consistently.
Problem was not a deadlock.
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
USB Key Transfer Hang
22
Why was Consultant Needed? Customer focus on creating more and more tests was
not producing more clues Customer was not using the right tool to debug
They didn’t have a USB analyzer because it was “expensive” $2000 for a tool was much cheaper than losing month+ labor!
USB Analyzer was key tool that provided the clues to quickly zoom into root cause
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Locomotive Anti-Lock Braking Hang
23
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Locomotive Anti-Lock Braking HangKey Observables
24
Randomly entire system would lock up Manual override needed to be engaged
Debug showed threads were blocked Post-mortem dump showed multiple threads all
waiting for message Design was a message-passing system
It followed guidelines given in RTOS documentation
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Locomotive Braking System HangHypothesis
25
Most likely causes Deadlock Lost message
Blocking form of message passing was being used This is known to be problematic in real-time systems Potentially prone to deadlock
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Locomotive Braking System HangInvestigation
26
Does system have all four necessary conditions required for a deadlock to occur? Mutual Exclusion Lock one resource while waiting for another Cannot preempt resource usage Circular wait
Answer was yes! No reason to try to pinpoint the sequence of events
that leads to deadlock If a deadlock is possible, change the design
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Locomotive Braking System HangSolution
27
Avoid deadlock by eliminating one of the necessary conditions from being possible: Prevented “waiting” for another resource by changing
the system to use non-blocking communication Implemented this modified design
Never encountered subsequent deadlocks
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Locomotive Braking System Hang
28
Why was a Consultant Needed? Missing theoretical foundation to recognize that
recommended design by RTOS vendor was flawed This was a design flaw, customer tried to fix
implementation by changing priorities and synchronization, but to no avail
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Ruggedized PDA Flash Corruption
29
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Ruggedized PDA Flash CorruptionKey Observables
30
Some units that were fielded for a year or more started crashing on boot-up Reformatting flash seemed to fix the problem, but only
temporarily No other indication of what problem was
“Damaged” units were sent back for analysis Confirmed flash was corrupted, but no evidence of
why
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Ruggedized PDA Flash CorruptionHypothesis
31
Unit encountering hard shut-offs Verified file system was transactional
Possible failure of flash chip Run extensive tests
Compare image of corrupted flash with good unit Filesystem area expected to be different Focus on read-only parts of memory, ensure no
corruption there
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Ruggedized PDA Flash CorruptionInvestigation
32
Flash tests proved there were occasional bit errors That is enough to point to the chip as culprit But Why?
Review of theory indicated flash rated for 100,000 erase/write cycles per block It seems like a lot, but that means 100,000/365=273
cycles a day on the same block could damage the flash
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Ruggedized PDA Flash CorruptionSolutions
33
Enabled logging on a test unit to determine how flash was being used Found that the registry was being written once per
minute … or 1440 times per day. Although the file system had wear leveling, when it
was mostly full, the number of blocks available for wear leveling was only a handful
This meant blocks were being erased/written about a couple of hundred times per day each
Wearing out of the flash is to be expected.
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Ruggedized PDA Flash CorruptionSolutions (cont’d)
34
Only fix identified was to replace flash Workarounds to avoid bad blocks did not work,
because blocks were scattered, and that only meant even less blocks available for wear leveling
For units that did not fail yet … All units could eventually fail Modified design to write logs by keeping file open then
doing a flush, instead of open/write/close each time
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Ruggedized PDA Flash Corruption
35
Why was a consultant needed? Customer did not pay attention to theoretical limits of
flash; it was a design oversight Engineers working on project did not have a good set
of flash tests that could catch the issue
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Cryogenic Temperature Control
36
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Cryogenic Temperature ControlKey Observables
37
Temperature needed to be maintained at 4°K Tolerance of +/- 10% But was fluctuating +/- 2°K (200%)
Engineers using room temperature to troubleshoot Using heater and ice bucket to verify control algorithm
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Cryogenic Temperature ControlHypothesis
38
Temperature behavior at room temperature not same as near absolute zero Control algorithm needed to be based on theory of
temperature as it approaches zero Heating/cooling cycles took tens of minutes
Very difficult to monitor behavior of temperature over extended period of time
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Cryogenic Temperature ControlInvestigation
39
Created an emulator A PC external to controller took outputs of embedded
system, and returned an analog signal to represent temperature
New understanding of temperature To create emulator, needed to think like the canister,
and understand laws of heating and cooling Emulator allowed execution thousands of times
faster than real-time In 1 minute, had 60,000 data points. Previously, took 10 minutes to get 1 data point
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Cryogenic Temperature ControlSolutions
40
Emulator created by different engineer than controller Both emulator and controller need to agree in order to
get correct results When they don’t agree, both systems get debugged
Identified reduced number of significant digits near zero resulted in inaccurate calculations Needed to perform computations as micro-degrees
instead of milli-degrees
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Cryogenic Temperature Control
41
Why was a consultant needed? Customer not using correct tool, namely an emulator Engineer failed to apply correct theory (heating
cooling laws) near absolute zero
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Degrading Legacy SoftwareKey Observables
42
Customer reported, “Software is degrading” Control software for locomotives that had been
working for two decades started crashing watchdog timer causing board resets
Replacing main boards did not fix problem Customer suspected possible issue with sensor data
Adding debug caused system to fail worse
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Degrading Legacy SoftwareHypothesis
43
Adding debug made system worse System is overloaded Debug breaks real-time performance
System only starting to fail now Perhaps a different path of execution was increasing
utilization Verify error handling.
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Degrading Legacy SoftwareInvestigation
44
Used logic analyzer method to troubleshoot Allowed collecting data in real-time where traditional
debug methods caused system to fail totally Monitored inputs and outputs of functions.
Found data output from sensors had significant noise and had occasional bad samples
Root cause – sensors are degrading It was determined that it was the sensors that started to
degrade The software did not have appropriate error handling
code, and faulty calculations were causing processor exceptions
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Degrading Legacy SoftwareSolutions
45
Replacing sensors is a very costly solution Changing sensor fixed issue. However, replacing all
sensors on all locomotives too costly Software workaround: Add filters and error
handling Desire to add filtering to clean up the now-noisier data Processor was already fully loaded, adding software
filters data caused overloads and system failure
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Degrading Legacy SoftwareSolutions (cont’d)
46
Reduce processor load Performed real-time analysis: near 0% idle time Found a 5-msec control loop using 60% of processor
bandwidth Why 5-msec? Customer: “Because it works”
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Degrading Legacy SoftwareSolutions (cont’d)
47
Review design decisions How slow can control loop be run?
Edge of customer comfort level was at about 50 msec
Decided to run at 25-msec instead Final solution
Revised real-time analysis, idle time up to 48% Now using 3msec every 25msec, instead of 15msec every 25msec
Added filtering for sensors starting to fail Added error notifications for sensors that needed
replacing Created enough ‘spare’ processing time to add
traditional debug
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Degrading Legacy Software
48
Why was Consultant Needed? Wrong tool: customer did not have a good way to
obtain debug in real-time. Using a logic analyzer resolved this
No real-time analysis: once a real-time analysis was done, it was obvious which thread needed to be optimized.
Real-time systems theory: to reduce utilization, only two things can be done:1) Reduce execution time
2) Increase period of execution
#2 is usually easier, to verify possibility of that first
DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant
Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com
Summary
49
Most hard problems are fundamentally simple A common or known issue A “silly” bug in software One hardware signal is faulty
Difficulty solving problem is for two reasons: Trying to fix problem before understanding root cause Using the wrong tools to collect clues
Anyone can become a consultant If they use a systematic approach and the right tools
top related