failure data collection and analysis archana ganapathi peter bodik wei xu

25
Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Upload: shanon-black

Post on 23-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Failure Data Collection and Analysis

Archana GanapathiPeter Bodik

Wei Xu

Page 2: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Motivation (1)My machine crashes…

Since 3/1/04… 3 system crashes 18 application errors 96 application hangs

Who cares? I do! People who share similar experiences In general, customer uproar

Page 3: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Motivation (2)An Internet service has failures…

Who cares? Internet service

users Internet service

system administrators

Anyone affected by the IS’s loss of revenue

Hardware26%

Software28%

Unknown11%

Operator35%

Total: 61 user-visible failures in 12 months

at Online Service

Page 4: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Motivation (3)

ROC/RADS needs real failure/attack information to drive benchmarks evaluate our prototypes help us select what we work attack

Page 5: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Data Sources

1000s of individual machines Cory/Soda Hall, BOINC

Large clusters at real Internet services Internet services

Distributed applications on 100s of machines PlanetLab

Page 6: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Individual Machines

Page 7: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Data Collection Collect minidumps that contain…

The Stop message/parameters/data Loaded drivers Processor context for processor that

stopped Process info/kernel context for

process/thread stopped The Kernel-mode call stack for thread that

stopped Frequency of collection

synchronized with application and system crashes on computers

Page 8: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Analysis results What happened that is immediately

responsible for the crash exact error code brief description, primarily for debugging

Bucketing info, e.g.: "driver fault" Details for debugging, e.g. stack

contents Use Microsoft’s publicly available

analysis tools Caveat: significant variability in results

between internal and public version of tool!

Page 9: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

How we collect minidumps (1)

Corporate Error Reportinghttp://www.microsoft.com/resources/satech/

cer/ Manage error reports/msgs

generated by WER and other programs

Configure clients to redirect reports to CER shared directory

Page 10: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Sample Statistics(25 nodes, 5 days)

Page 11: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Sample Statistics(25 nodes, 5 days)

Crashed Program Version Problem

BESConsole.exe 4.1.3.33 hungapp

CDCopier.exe 5.3.4.21 hungapp

CreateCD50.exe 5.3.4.21 hungapp

CreateCD50.exe 5.3.4.21 hungapp

explorer.exe 6.0.2800.1106 shlwapi.dll

firefox.exe 0.8.0.0 hungapp

IAMAPP.EXE 5.1.1.309 hungapp

iexplore.exe 6.0.2800.1106 hungapp

iexplore.exe 6.0.2800.1106 mshtml.dll

matlab.exe 1.0.0.1 hungapp

mozilla.exe 1.6.20040.11308 ntdll.dll

msmsgs.exe 4.7.0.2009 msmsgs.exe

OUTLOOK.EXE 10.0.4510.0 hungapp

thunde~1.exe 0.6.0.0 xpc3250.dll

Page 12: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

How we collect minidumps (2)

BOINC For SETI@home –esque apps that

pool resources Provides client API to send/receive

data to/from BOINC server Write tools to read info in

minidump directory and send to us

Page 13: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Sample Statistics (50 system crashes)

Thread stuck in device driver 12

Page Fault in Non-Paged Area 10

System Thread Exception Not Handled 6

Unexpected Kernel Mode Trap 6

Kernel Mode Exception Not Handled 5

IRQL Not Less or Equal 3

Driver IRQL Not Less or Equal 3

NTFS File System 2

Bad Pool Caller 2

PFN List Corrupt 1

Page 14: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Sample Statistics (50 system crashes)

watchdog.sys 7 ar5211.sys6 ibmpmdrv.sys 6 ati3duag.dll 5 SYMEVENT.SYS 3 ipsecw2k.sys 3 memory_corruption

3 ialmdev5.DLL 2 PSCRIPT4.DLL 2 ntoskrnl.exe 2

CLASSPNP.SYS 2 win32k.sys 2 SynTP.sys 1 TDI.SYS 1 ino_fltr.sys1 ks.sys 1 drvnddm.sys 1 ntkrnlmp.exe 1 Pool_Corruption 1

Page 15: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Metrics (Windows & Linux)

Availability system uptime, % time BOINC running

CPU(s) # processes, processor queue length, % non-idle

Memory available physical memory, free swap space

Disk(s) free space

Network(s) IP address, packets&bytes sent&received/sec, bandwidth to/from

SETI@home server, first-hop bandwidth*, network coordinates*

Static CPU type, #, and benchmarks; total memory; OS type

Page 16: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Questions

Other metrics? Frequency with which to measure them? What research questions can we answer with this data

set? original goal: workload to evaluate our node discovery service evaluate effectiveness of network coordinates evaluate potential to run more than just “embarrassingly

parallel” apps on this type of infrastructure depending on machines’ uptime network connectivity available disk space

distributed analysis? security uses?

Page 17: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Internet Services

Page 18: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Data characteristics

Real companies Multitude of users Voluminous data (several terabytes) Systems are complex

Treat as black box Use SLT algorithms for analysis

More data => better models

Page 19: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Analysis Results Study event logs

Not necessarily failures Can derive models of good & bad behavior

Models with varying granularity Use different algorithms Vary boundary parameters

For more details see poster:“Towards a General Approach for Event Log Analysis”

Page 20: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Distributed Apps

Page 21: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

PlanetLab

“An open platform for developing, deploying, and accessing planetary-scale services” 392 nodes at 164 sites around the

world Per-site system administration Applications: OceanStore, PIER

Page 22: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Why?

Platform for injecting faults and testing our algorithms

Applications on RADS-like environment

Research platform More accessible University-developed apps most likely

to be tested on PlanetLab

Page 23: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Applications

1) OceanStore Global persistent data store. In the process of running prototype on

PlanetLab Good source of failure data

2) PIER Distributed query processor Currently running on PlanetLab Good source of failure data + analysis engine

Page 24: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

What do we do with these apps?

Instrument applications to collect any type of information Choice of granularity

Open source - no longer black box Can modify it as much as necessary

Page 25: Failure Data Collection and Analysis Archana Ganapathi Peter Bodik Wei Xu

Questions

What other applications can we use? What should we measure and model? What information is useful for

industry? Do you have any failure/attack data

you are willing to share with us?