failure data collection and analysis archana ganapathi peter bodik wei xu

Failure Data Collection and Analysis

Archana GanapathiPeter Bodik

Wei Xu

Motivation (1)My machine crashes…

Since 3/1/04… 3 system crashes 18 application errors 96 application hangs

Who cares? I do! People who share similar experiences In general, customer uproar

Motivation (2)An Internet service has failures…

Who cares? Internet service

users Internet service

system administrators

Anyone affected by the IS’s loss of revenue

Hardware26%

Software28%

Unknown11%

Operator35%

Total: 61 user-visible failures in 12 months

at Online Service

Motivation (3)

ROC/RADS needs real failure/attack information to drive benchmarks evaluate our prototypes help us select what we work attack

Data Sources

1000s of individual machines Cory/Soda Hall, BOINC

Large clusters at real Internet services Internet services

Distributed applications on 100s of machines PlanetLab

Individual Machines

Data Collection Collect minidumps that contain…

The Stop message/parameters/data Loaded drivers Processor context for processor that

stopped Process info/kernel context for

process/thread stopped The Kernel-mode call stack for thread that

stopped Frequency of collection

synchronized with application and system crashes on computers

Analysis results What happened that is immediately

responsible for the crash exact error code brief description, primarily for debugging

Bucketing info, e.g.: "driver fault" Details for debugging, e.g. stack

contents Use Microsoft’s publicly available

analysis tools Caveat: significant variability in results

between internal and public version of tool!

How we collect minidumps (1)

Corporate Error Reportinghttp://www.microsoft.com/resources/satech/

cer/ Manage error reports/msgs

generated by WER and other programs

Configure clients to redirect reports to CER shared directory

Sample Statistics(25 nodes, 5 days)

Sample Statistics(25 nodes, 5 days)

Crashed Program Version Problem

BESConsole.exe 4.1.3.33 hungapp

CDCopier.exe 5.3.4.21 hungapp

CreateCD50.exe 5.3.4.21 hungapp

CreateCD50.exe 5.3.4.21 hungapp

explorer.exe 6.0.2800.1106 shlwapi.dll

firefox.exe 0.8.0.0 hungapp

IAMAPP.EXE 5.1.1.309 hungapp

iexplore.exe 6.0.2800.1106 hungapp

iexplore.exe 6.0.2800.1106 mshtml.dll

matlab.exe 1.0.0.1 hungapp

mozilla.exe 1.6.20040.11308 ntdll.dll

msmsgs.exe 4.7.0.2009 msmsgs.exe

OUTLOOK.EXE 10.0.4510.0 hungapp

thunde~1.exe 0.6.0.0 xpc3250.dll

How we collect minidumps (2)

BOINC For SETI@home –esque apps that

pool resources Provides client API to send/receive

data to/from BOINC server Write tools to read info in

minidump directory and send to us

Sample Statistics (50 system crashes)

Thread stuck in device driver 12

Page Fault in Non-Paged Area 10

System Thread Exception Not Handled 6

Unexpected Kernel Mode Trap 6

Kernel Mode Exception Not Handled 5

IRQL Not Less or Equal 3

Driver IRQL Not Less or Equal 3

NTFS File System 2

Bad Pool Caller 2

PFN List Corrupt 1

Sample Statistics (50 system crashes)

watchdog.sys 7 ar5211.sys6 ibmpmdrv.sys 6 ati3duag.dll 5 SYMEVENT.SYS 3 ipsecw2k.sys 3 memory_corruption

3 ialmdev5.DLL 2 PSCRIPT4.DLL 2 ntoskrnl.exe 2

CLASSPNP.SYS 2 win32k.sys 2 SynTP.sys 1 TDI.SYS 1 ino_fltr.sys1 ks.sys 1 drvnddm.sys 1 ntkrnlmp.exe 1 Pool_Corruption 1

Metrics (Windows & Linux)

Availability system uptime, % time BOINC running

CPU(s) # processes, processor queue length, % non-idle

Memory available physical memory, free swap space

Disk(s) free space

Network(s) IP address, packets&bytes sent&received/sec, bandwidth to/from

SETI@home server, first-hop bandwidth*, network coordinates*

Static CPU type, #, and benchmarks; total memory; OS type

Questions

Other metrics? Frequency with which to measure them? What research questions can we answer with this data

set? original goal: workload to evaluate our node discovery service evaluate effectiveness of network coordinates evaluate potential to run more than just “embarrassingly

parallel” apps on this type of infrastructure depending on machines’ uptime network connectivity available disk space

distributed analysis? security uses?

Internet Services

Data characteristics

Real companies Multitude of users Voluminous data (several terabytes) Systems are complex

Treat as black box Use SLT algorithms for analysis

More data => better models

Analysis Results Study event logs

Not necessarily failures Can derive models of good & bad behavior

Models with varying granularity Use different algorithms Vary boundary parameters

For more details see poster:“Towards a General Approach for Event Log Analysis”

Distributed Apps

PlanetLab

“An open platform for developing, deploying, and accessing planetary-scale services” 392 nodes at 164 sites around the

world Per-site system administration Applications: OceanStore, PIER

Why?

Platform for injecting faults and testing our algorithms

Applications on RADS-like environment

Research platform More accessible University-developed apps most likely

to be tested on PlanetLab

Applications

1) OceanStore Global persistent data store. In the process of running prototype on

PlanetLab Good source of failure data

2) PIER Distributed query processor Currently running on PlanetLab Good source of failure data + analysis engine

What do we do with these apps?

Instrument applications to collect any type of information Choice of granularity

Open source - no longer black box Can modify it as much as necessary

Questions

What other applications can we use? What should we measure and model? What information is useful for

industry? Do you have any failure/attack data

you are willing to share with us?

failure data collection and analysis archana ganapathi peter bodik wei xu

Documents

corruption1 slide

days slide

attack slide

online service slide

customer uproar slide

hungapp createcd50

s of machines planetlab

system crashes thread