how we analyzed 1000 dumps in one day - dina goldshtein, brightsource - devopsdays tel aviv 2015
Post on 14-Feb-2017
253 Views
Preview:
TRANSCRIPT
How We Analyzed
1000 Dumps in One Day
DINA GOLDSHTEINEMBEDDED TEAM LEADER, BRIGHTSOURCE
ENERGYBLOGS.MICROSOFT.CO.IL/DINAZIL/
@DINAGOZIL
Agenda What we do and why we need dumps
Manual analysis process
The holy grail: automatic dump analysis
Our automatic triage workflow
About Us BrightSource Energy builds solar power plants
Power plants have control software
Control software crashes
Our Production Environment
The office (development) network is connected to the Internet
The production (power plant) network is isolated
There is a (very slow) one-way link from production to development
In the Beginning… Mask all crashes by a nice error dialog and an “orderly” shut-down
Analyze errors using very extensive log files from all components
Alas, last error in log doesn’t always correspond to the fiend
Need to know exact exception, when it occurred and where!
Crash Dumps A dump is a snapshot of a process’s memory: threads, heap, exceptions, locks, etc.
Various tools can open dump files and see what’s inside
How??? An executable can be compiled with debug information - the symbols
Symbols files (.PDB) contain information which allows debuggers to match addresses and other information in the file to names of DLLs, functions, variables, lines of code, etc.
How??? An executable can be compiled with debug information - the symbols
Symbols files (.PDB) contain information which allows debuggers to match addresses and other information in the file to names of DLLs, functions, variables, lines of code, etc.
Symbol Server Symbols can be provided to the debugger explicitly
But they can also reside in a Symbol Server (stored by name and hash)
The debugger can download debugging symbols automatically for the right product version
Production Crashes We can’t attach a debugger, or do remote analysis of production errors
Windows can be configured to automatically save a dump when a process crashes
When crashes occur, dump files are generated and transmitted to a central location and then the office network
Manual Dump Analysis With high failure rates, we’re talking dozens of dumps per day from a single facility
Many errors are exact duplicates
Manual analysis means:◦ Copy dump to my machine (it’s not uncommon for a dump to be 2-3GB)◦ Copy debugger support files and symbols (if no symbol server is present)◦ Open dump in debugger (Visual Studio/WinDbg)◦ Locate the exception and call stack◦ Triage and open a bug for the relevant developer
◦ Probably around 10 minutes per dump…
Automatic Dump Analysis
ClrMD is a NuGet package which provides a debugger API for dumps and live processes
◦ Works with both native and managed code
The core of our automatic solution uses ClrMD for automatic dump analysis and triage:
◦ Exception information◦ Call stack◦ Likely faulting component
Recently became open source on GitHub
Some Code… target = DataTarget.LoadCrashDump(dumpPath);if (target.ClrVersions.Count > 0) { ClrInfo dacVersion = target.ClrVersions[0]; string dacLocation = dacVersion.TryDownloadDac(); runtime = target.CreateRuntime(dacLocation);}var dc = (IDebugControl)target.DebuggerInterface;dc.GetLastEventInformation(out eventType, out processId, out threadIndex, extraInformation, extraInformationSize, out extraInformationUsed, description, descriptionSize, out descriptionUsed);var dso = (IDebugSystemObjects)target.DebuggerInterface;var sysIds = new uint[count];dso.GetThreadIdsByIndex(threadIndex, count, null, sysIds);if (IsThreadManaged(sysIds[0])) { var td = runtime.Threads.First(t => t.OSThreadId == sysIds[0]); clrException = td.CurrentException; }
Our Dump Analysis Workflow
At the end of a shift, operators copy dumps to a network share in the office network
A script goes over the dumps one by one and uses ClrMD to find the root cause of the error
According to a configuration file, the faulting module’s owner is alerted and a ticket is opened in Redmine
From Hours to Seconds Manual, tedious, error-prone dump analysis by red-eyed developers…
…Automatic, happy, untiring ninja script
DEMOANALYZE 74 DUMPS IN A FEW MINUTES
Summary What we do and why we need dumps
Manual analysis process
The holy grail: automatic dump analysis
Our automatic triage workflow
Resources:◦ The slides: http://tinyurl.com/dumpstlv ◦ ClrMD on GitHub◦ DumpAnalyzer on GitHub◦ msos on GitHub
Questions?Thank You!
DINA GOLDSHTEINEMBEDDED TEAM LEADER, BRIGHTSOURCE
ENERGYBLOGS.MICROSOFT.CO.IL/DINAZIL/
@DINAGOZIL
"Retouched Kitty" by Ozan Kilic is licensed under Creative Commons Attribution 2.0
top related