solving real problems that required a consultant

49
Dave Stewart, PhD Director of Research and Systems Integration InHand Electronics [email protected] www.inhand.com Solving Real Problems that Required a Consultant ESC 210

Upload: benita

Post on 11-Jan-2016

27 views

Category:

Documents


1 download

DESCRIPTION

ESC 210. Solving Real Problems that Required a Consultant. Dave Stewart, PhD Director of Research and Systems Integration InHand Electronics [email protected] www.inhand.com. Objective of this Class. Share some lessons learned - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Solving Real Problems that Required a  Consultant

Dave Stewart, PhDDirector of Research and Systems Integration

InHand Electronics [email protected]

Solving Real Problems thatRequired a Consultant

ESC 210

Page 2: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Objective of this Class

Share some lessons learned If you encounter a similar issue, the flags raised here

may give you additional ideas of what to look for View hard problems differently

If you use same steps as a consultant, you can be your own consultant and save time and money

Page 3: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Overview

When is a Consultant Consulted? Satellite Modem USB Key Transfers Hang Locomotive Braking System Lock-ups Flash Corruption on Battlefield Cryogenic Temperature Cooling Degrading Legacy Software

Page 4: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

When is a Consultant Consulted?

4

A problem has been around for weeks or months Engineers familiar with system spent months to resolve

unsuccessfully Issues are glitches

Problem shows up randomly, and is not easily repeatable

Traditional debug unsatisfactory Problem goes away or functionality breaks when you

add debug Can’t identify who is responsible

Is this a hardware or a software issue? Is this “our” problem or a “vendor” problem?

Page 5: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Challenges Faced by the Consultant

5

Expected to find root cause in days Even though engineers familiar with system could not

do it in weeks or longer Traditional methods won’t find issue

If they did, problem would already have been found Root cause is not known

Even if customer says it is software or confined to a particular module, those might just be observable effects, not root cause

Page 6: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Who is a Consultant?

6

An expert within the organization Usually an “expensive” resource Need to pull the person off a different project

An FAE or vendor expert When using COTS hardware or software, it could be

the organization who sold the product An independent contractor or consultant

Can leverage skills and experience applied to many other jobs

Page 7: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Reality of Hard Problems

7

Most hard problems are fundamentally simple A common or known issue A “silly” bug in software One hardware signal is faulty

Difficulty solving problem is for three reasons: Trying to fix problem before understanding root cause Failure to use theory to analyze the system Using the wrong tools to collect clues

Anyone can become a consultant If they use a systematic approach and the right tools

Page 8: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Consultant TriageA Systematic Approach to Troubleshooting

8

Observe Review information already available Identify what information is missing

Hypothesize Review the design for known common flaws Verify any applicable errata Consider what fundamental theories are likely in play What tools can be used to prove or disprove a hypothesis?

Investigate Use new additional techniques to increase quality and quantity of

clues to identify root cause Solutions

Usually, once root cause is known, several viable solutions follow rather quickly.

Page 9: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Satellite Modem

9

Page 10: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Satellite TV ModemKey Observables

10

Picture would occasionally glitch when button on remote pressed Engineers identified it only happened when guide was

being downloaded at same time Implemented workaround until problem could be

solved: Ignore remote buttons when guide being downloaded Leverages user’s default action when no response,

which is to just press the same button again

Page 11: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Satellite ModemHypothesize

11

Multi-threaded system Possible priority inversion or race condition

Real-time analysis never performed Measurements of execution time not known Possible transient overload not handled correctly

Need list of threads and execution time measurements Customer was able to provide list of threads, but not

execution time

Page 12: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Satellite ModemInvestigate

12

Need execution time for each thread Instrumented code to allow measurements

Needed to restructure some tasks to follow a proper model Can only measure execution time of threads that

follow a definitive model (shown next slide) Used logic analyzer to measure execution time

See this month’s issue of Embedded Systems Design March 2012 issue available on show floor

Page 13: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Model of Real-Time Task

13

Thread A

Each thread has a main loop that does the following:

Read ITC Inputs/Events

Do Processing andRead/Write Devices

Write ITC Outputs

Wait for Event

For periodic threads, event is time-based.

For other threads, event could be an interrupt,

message arrival, semaphore wakeup,or any other signal.

Page 14: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Model of Real-Time Task

14

Thread A

Measuring execution time for purpose of real-time analysisalways done at same place

Read ITC Inputs/Events

Do Processing andRead/Write Devices

Write ITC Outputs

Wait for Event

Start Thread Cycle

End Thread Cycle

Frequency of thread represented by how often

this point is reached.

Page 15: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Satellite ModemSolutions

15

Used Rate Monotonic Analysis Identified system overloaded when guide being downloaded Also identified on average 20% idle time when guide not

being downloaded Culprit was one interrupt handler executing for 6 msec, and

temporarily using 80% CPU power. Engineer thought it was only a few hundred microseconds

Solution that Worked Ensure all threads followed model or real-time thread Split interrupt handler into ISR+IST Defined IST as an Aperiodic Server Scheduled system using Rate Monotonic

Page 16: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Satellite Modem

16

Why was Consultant Needed? Customer was not applying fundamental real-time

systems theory A simple Rate Monotonic Analysis of the problem

showed an obvious root cause Customer did not have the tools needed to measure

execution time. Execution time is key input into the analysis.

Page 17: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

USB Key Transfer Hang

17

Can you spot the difference?

Page 18: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

USB Key Transfer HangKey Observables

18

Two apparently identical USB keys connected to embedded system one worked consistently 100% of the time other one locked up 50% of the time during long transfers.

Customer kept running tests from user-space. After the first few tests, each other test only provided

duplicate information; no new clues. The key had a custom mechanical construction

Using a different key was not an option. On desktop PC, both worked 100% of the time

Ran controlled tests with different file sizes 1KB to 1GB. Problems only started above 1MB on embedded system

Page 19: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

USB Key Transfer HangHypothesize

19

Hangs == deadlock Anytime there is a hang, look for a deadlock

Looks are deceiving Although the two keys “looked” the same, they might

be different versions Compare working (PC) to non-working

(Embedded System) at the USB interface Focus on large file sizes since small files did not fail

Page 20: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

USB Key Transfer HangInvestigate

20

Driver Instrument USB driver at lowest level Every time it sent a message, log event Log every time a lock was obtained

Protocol Use a USB analyzer to capture the transfer

Version Analyzer allowed checking firmware version dates …

the good one was Rev 8.02, the failing one 8.01 This confirmed the keys were in fact not identical

Page 21: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

USB Key Transfer HangSolutions

21

It works fine on PC Since even “bad” key worked fine on PC, reverse engineer

what PC was doing compared to embedded system USB Analyzer provided key clue

The PC broke large blocks into many smaller blocks Problem was the USB key

Changing hardware is always much more expensive than changing software

Can a software workaround be used to avoid the issue with the key?

Making change to embedded driver to break larger transfer into smaller ones allowed the bad key to work consistently.

Problem was not a deadlock.

Page 22: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

USB Key Transfer Hang

22

Why was Consultant Needed? Customer focus on creating more and more tests was

not producing more clues Customer was not using the right tool to debug

They didn’t have a USB analyzer because it was “expensive” $2000 for a tool was much cheaper than losing month+ labor!

USB Analyzer was key tool that provided the clues to quickly zoom into root cause

Page 23: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Locomotive Anti-Lock Braking Hang

23

Page 24: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Locomotive Anti-Lock Braking HangKey Observables

24

Randomly entire system would lock up Manual override needed to be engaged

Debug showed threads were blocked Post-mortem dump showed multiple threads all

waiting for message Design was a message-passing system

It followed guidelines given in RTOS documentation

Page 25: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Locomotive Braking System HangHypothesis

25

Most likely causes Deadlock Lost message

Blocking form of message passing was being used This is known to be problematic in real-time systems Potentially prone to deadlock

Page 26: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Locomotive Braking System HangInvestigation

26

Does system have all four necessary conditions required for a deadlock to occur? Mutual Exclusion Lock one resource while waiting for another Cannot preempt resource usage Circular wait

Answer was yes! No reason to try to pinpoint the sequence of events

that leads to deadlock If a deadlock is possible, change the design

Page 27: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Locomotive Braking System HangSolution

27

Avoid deadlock by eliminating one of the necessary conditions from being possible: Prevented “waiting” for another resource by changing

the system to use non-blocking communication Implemented this modified design

Never encountered subsequent deadlocks

Page 28: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Locomotive Braking System Hang

28

Why was a Consultant Needed? Missing theoretical foundation to recognize that

recommended design by RTOS vendor was flawed This was a design flaw, customer tried to fix

implementation by changing priorities and synchronization, but to no avail

Page 29: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Ruggedized PDA Flash Corruption

29

Page 30: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Ruggedized PDA Flash CorruptionKey Observables

30

Some units that were fielded for a year or more started crashing on boot-up Reformatting flash seemed to fix the problem, but only

temporarily No other indication of what problem was

“Damaged” units were sent back for analysis Confirmed flash was corrupted, but no evidence of

why

Page 31: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Ruggedized PDA Flash CorruptionHypothesis

31

Unit encountering hard shut-offs Verified file system was transactional

Possible failure of flash chip Run extensive tests

Compare image of corrupted flash with good unit Filesystem area expected to be different Focus on read-only parts of memory, ensure no

corruption there

Page 32: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Ruggedized PDA Flash CorruptionInvestigation

32

Flash tests proved there were occasional bit errors That is enough to point to the chip as culprit But Why?

Review of theory indicated flash rated for 100,000 erase/write cycles per block It seems like a lot, but that means 100,000/365=273

cycles a day on the same block could damage the flash

Page 33: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Ruggedized PDA Flash CorruptionSolutions

33

Enabled logging on a test unit to determine how flash was being used Found that the registry was being written once per

minute … or 1440 times per day. Although the file system had wear leveling, when it

was mostly full, the number of blocks available for wear leveling was only a handful

This meant blocks were being erased/written about a couple of hundred times per day each

Wearing out of the flash is to be expected.

Page 34: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Ruggedized PDA Flash CorruptionSolutions (cont’d)

34

Only fix identified was to replace flash Workarounds to avoid bad blocks did not work,

because blocks were scattered, and that only meant even less blocks available for wear leveling

For units that did not fail yet … All units could eventually fail Modified design to write logs by keeping file open then

doing a flush, instead of open/write/close each time

Page 35: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Ruggedized PDA Flash Corruption

35

Why was a consultant needed? Customer did not pay attention to theoretical limits of

flash; it was a design oversight Engineers working on project did not have a good set

of flash tests that could catch the issue

Page 36: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Cryogenic Temperature Control

36

Page 37: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Cryogenic Temperature ControlKey Observables

37

Temperature needed to be maintained at 4°K Tolerance of +/- 10% But was fluctuating +/- 2°K (200%)

Engineers using room temperature to troubleshoot Using heater and ice bucket to verify control algorithm

Page 38: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Cryogenic Temperature ControlHypothesis

38

Temperature behavior at room temperature not same as near absolute zero Control algorithm needed to be based on theory of

temperature as it approaches zero Heating/cooling cycles took tens of minutes

Very difficult to monitor behavior of temperature over extended period of time

Page 39: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Cryogenic Temperature ControlInvestigation

39

Created an emulator A PC external to controller took outputs of embedded

system, and returned an analog signal to represent temperature

New understanding of temperature To create emulator, needed to think like the canister,

and understand laws of heating and cooling Emulator allowed execution thousands of times

faster than real-time In 1 minute, had 60,000 data points. Previously, took 10 minutes to get 1 data point

Page 40: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Cryogenic Temperature ControlSolutions

40

Emulator created by different engineer than controller Both emulator and controller need to agree in order to

get correct results When they don’t agree, both systems get debugged

Identified reduced number of significant digits near zero resulted in inaccurate calculations Needed to perform computations as micro-degrees

instead of milli-degrees

Page 41: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Cryogenic Temperature Control

41

Why was a consultant needed? Customer not using correct tool, namely an emulator Engineer failed to apply correct theory (heating

cooling laws) near absolute zero

Page 42: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Degrading Legacy SoftwareKey Observables

42

Customer reported, “Software is degrading” Control software for locomotives that had been

working for two decades started crashing watchdog timer causing board resets

Replacing main boards did not fix problem Customer suspected possible issue with sensor data

Adding debug caused system to fail worse

Page 43: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Degrading Legacy SoftwareHypothesis

43

Adding debug made system worse System is overloaded Debug breaks real-time performance

System only starting to fail now Perhaps a different path of execution was increasing

utilization Verify error handling.

Page 44: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Degrading Legacy SoftwareInvestigation

44

Used logic analyzer method to troubleshoot Allowed collecting data in real-time where traditional

debug methods caused system to fail totally Monitored inputs and outputs of functions.

Found data output from sensors had significant noise and had occasional bad samples

Root cause – sensors are degrading It was determined that it was the sensors that started to

degrade The software did not have appropriate error handling

code, and faulty calculations were causing processor exceptions

Page 45: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Degrading Legacy SoftwareSolutions

45

Replacing sensors is a very costly solution Changing sensor fixed issue. However, replacing all

sensors on all locomotives too costly Software workaround: Add filters and error

handling Desire to add filtering to clean up the now-noisier data Processor was already fully loaded, adding software

filters data caused overloads and system failure

Page 46: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Degrading Legacy SoftwareSolutions (cont’d)

46

Reduce processor load Performed real-time analysis: near 0% idle time Found a 5-msec control loop using 60% of processor

bandwidth Why 5-msec? Customer: “Because it works”

Page 47: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Degrading Legacy SoftwareSolutions (cont’d)

47

Review design decisions How slow can control loop be run?

Edge of customer comfort level was at about 50 msec

Decided to run at 25-msec instead Final solution

Revised real-time analysis, idle time up to 48% Now using 3msec every 25msec, instead of 15msec every 25msec

Added filtering for sensors starting to fail Added error notifications for sensors that needed

replacing Created enough ‘spare’ processing time to add

traditional debug

Page 48: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Degrading Legacy Software

48

Why was Consultant Needed? Wrong tool: customer did not have a good way to

obtain debug in real-time. Using a logic analyzer resolved this

No real-time analysis: once a real-time analysis was done, it was obvious which thread needed to be optimized.

Real-time systems theory: to reduce utilization, only two things can be done:1) Reduce execution time

2) Increase period of execution

#2 is usually easier, to verify possibility of that first

Page 49: Solving Real Problems that Required a  Consultant

DesignWest 2012 – San JoseSolving Real Problems that Require a Consultant

Dave Stewart, PhD© 2012 InHand Electronics, Inc. – www.inhand.com

Summary

49

Most hard problems are fundamentally simple A common or known issue A “silly” bug in software One hardware signal is faulty

Difficulty solving problem is for two reasons: Trying to fix problem before understanding root cause Using the wrong tools to collect clues

Anyone can become a consultant If they use a systematic approach and the right tools