failure data collection and analysis archana ganapathi peter bodik wei xu
TRANSCRIPT
Failure Data Collection and Analysis
Archana GanapathiPeter Bodik
Wei Xu
Motivation (1)My machine crashes…
Since 3/1/04… 3 system crashes 18 application errors 96 application hangs
Who cares? I do! People who share similar experiences In general, customer uproar
Motivation (2)An Internet service has failures…
Who cares? Internet service
users Internet service
system administrators
Anyone affected by the IS’s loss of revenue
Hardware26%
Software28%
Unknown11%
Operator35%
Total: 61 user-visible failures in 12 months
at Online Service
Motivation (3)
ROC/RADS needs real failure/attack information to drive benchmarks evaluate our prototypes help us select what we work attack
Data Sources
1000s of individual machines Cory/Soda Hall, BOINC
Large clusters at real Internet services Internet services
Distributed applications on 100s of machines PlanetLab
Individual Machines
Data Collection Collect minidumps that contain…
The Stop message/parameters/data Loaded drivers Processor context for processor that
stopped Process info/kernel context for
process/thread stopped The Kernel-mode call stack for thread that
stopped Frequency of collection
synchronized with application and system crashes on computers
Analysis results What happened that is immediately
responsible for the crash exact error code brief description, primarily for debugging
Bucketing info, e.g.: "driver fault" Details for debugging, e.g. stack
contents Use Microsoft’s publicly available
analysis tools Caveat: significant variability in results
between internal and public version of tool!
How we collect minidumps (1)
Corporate Error Reportinghttp://www.microsoft.com/resources/satech/
cer/ Manage error reports/msgs
generated by WER and other programs
Configure clients to redirect reports to CER shared directory
Sample Statistics(25 nodes, 5 days)
Sample Statistics(25 nodes, 5 days)
Crashed Program Version Problem
BESConsole.exe 4.1.3.33 hungapp
CDCopier.exe 5.3.4.21 hungapp
CreateCD50.exe 5.3.4.21 hungapp
CreateCD50.exe 5.3.4.21 hungapp
explorer.exe 6.0.2800.1106 shlwapi.dll
firefox.exe 0.8.0.0 hungapp
IAMAPP.EXE 5.1.1.309 hungapp
iexplore.exe 6.0.2800.1106 hungapp
iexplore.exe 6.0.2800.1106 mshtml.dll
matlab.exe 1.0.0.1 hungapp
mozilla.exe 1.6.20040.11308 ntdll.dll
msmsgs.exe 4.7.0.2009 msmsgs.exe
OUTLOOK.EXE 10.0.4510.0 hungapp
thunde~1.exe 0.6.0.0 xpc3250.dll
How we collect minidumps (2)
BOINC For SETI@home –esque apps that
pool resources Provides client API to send/receive
data to/from BOINC server Write tools to read info in
minidump directory and send to us
Sample Statistics (50 system crashes)
Thread stuck in device driver 12
Page Fault in Non-Paged Area 10
System Thread Exception Not Handled 6
Unexpected Kernel Mode Trap 6
Kernel Mode Exception Not Handled 5
IRQL Not Less or Equal 3
Driver IRQL Not Less or Equal 3
NTFS File System 2
Bad Pool Caller 2
PFN List Corrupt 1
Sample Statistics (50 system crashes)
watchdog.sys 7 ar5211.sys6 ibmpmdrv.sys 6 ati3duag.dll 5 SYMEVENT.SYS 3 ipsecw2k.sys 3 memory_corruption
3 ialmdev5.DLL 2 PSCRIPT4.DLL 2 ntoskrnl.exe 2
CLASSPNP.SYS 2 win32k.sys 2 SynTP.sys 1 TDI.SYS 1 ino_fltr.sys1 ks.sys 1 drvnddm.sys 1 ntkrnlmp.exe 1 Pool_Corruption 1
Metrics (Windows & Linux)
Availability system uptime, % time BOINC running
CPU(s) # processes, processor queue length, % non-idle
Memory available physical memory, free swap space
Disk(s) free space
Network(s) IP address, packets&bytes sent&received/sec, bandwidth to/from
SETI@home server, first-hop bandwidth*, network coordinates*
Static CPU type, #, and benchmarks; total memory; OS type
Questions
Other metrics? Frequency with which to measure them? What research questions can we answer with this data
set? original goal: workload to evaluate our node discovery service evaluate effectiveness of network coordinates evaluate potential to run more than just “embarrassingly
parallel” apps on this type of infrastructure depending on machines’ uptime network connectivity available disk space
distributed analysis? security uses?
Internet Services
Data characteristics
Real companies Multitude of users Voluminous data (several terabytes) Systems are complex
Treat as black box Use SLT algorithms for analysis
More data => better models
Analysis Results Study event logs
Not necessarily failures Can derive models of good & bad behavior
Models with varying granularity Use different algorithms Vary boundary parameters
For more details see poster:“Towards a General Approach for Event Log Analysis”
Distributed Apps
PlanetLab
“An open platform for developing, deploying, and accessing planetary-scale services” 392 nodes at 164 sites around the
world Per-site system administration Applications: OceanStore, PIER
Why?
Platform for injecting faults and testing our algorithms
Applications on RADS-like environment
Research platform More accessible University-developed apps most likely
to be tested on PlanetLab
Applications
1) OceanStore Global persistent data store. In the process of running prototype on
PlanetLab Good source of failure data
2) PIER Distributed query processor Currently running on PlanetLab Good source of failure data + analysis engine
What do we do with these apps?
Instrument applications to collect any type of information Choice of granularity
Open source - no longer black box Can modify it as much as necessary
Questions
What other applications can we use? What should we measure and model? What information is useful for
industry? Do you have any failure/attack data
you are willing to share with us?