fig: a prototype tool for on-line verification of recovery mechanisms
DESCRIPTION
FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms. Naveen Sastry, Pete Broadwell, Jonathan Traupman, David Patterson University of California, Berkeley. Presentation Outline. Introduction Objective/Motivation Background Methods Implementation Test setup Evaluation - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/1.jpg)
FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms
Naveen Sastry, Pete Broadwell,Jonathan Traupman, David Patterson
University of California, Berkeley
![Page 2: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/2.jpg)
Presentation Outline1. Introduction
– Objective/Motivation– Background
2. Methods– Implementation– Test setup
3. Evaluation– Test results– Conclusions
![Page 3: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/3.jpg)
The Berkeley/Stanford ROC Project
• Purpose: investigating novel techniques for building highly-dependable Internet services
• Example techniques:– Advanced support for operator undo– Stability through targeted restarts– Integrated root cause analysis– Online verification of recovery
mechanisms
![Page 4: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/4.jpg)
FIG Project Objective/Motivation
Objective:• Develop a lightweight, extensible tool
for injecting errors to test recovery code/mechanisms
Motivation:• Testing and production environments
are always different• Large systems will require recovery
code, which should be tested as part of normal operation
![Page 5: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/5.jpg)
““Software’s Invisible Users”Software’s Invisible Users”
ApplicationOther libraries Other apps
System libraries (libc)
OS
User interface
User Input
Concept: Jim WhittakerFlorida Institute of Technology
![Page 6: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/6.jpg)
Related Testing Methods1. Ballista (DeVale, Koopman, Siewiorek)
• “Top-down” testing of POSIX-compliant OS and library interfaces
2. Fuzz (Miller, Fredriksen, So)• Tested UNIX applications by feeding
them random input streams3. Holodeck (Whittaker et al.)
• Similar approach to ours, but only for Windows 2000/XP
![Page 7: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/7.jpg)
FIG Implementation• Thin stub library
between app & libraries
• Traps API calls– Logs them– Inserts faults
• Can be inserted into any app without modification– Uses LD_PRELOAD
Application
libfig.so
libc.so, other libs
OS
Normal call path Injected fault
![Page 8: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/8.jpg)
Extensibility• API stubs are
automatically generated
• Very easy to add new APIs to log
• Fault injection is under script control
• Can simulate multiple fault models (e.g., memory pressure)
MALLOC_INDEX interval 82 to infinity return 0 errno ENOMEM probability 0.03
OPEN_INDEX // device out of space. interval 100 to infinity return –1 errno ENOSPC probability 0.001 // kernel out of memory. interval 100 to 120 return –1 errno ENOMEM probability 0.1 // too many files open. callnumber 108 return -1 errno EMFILE probability 1.0
Sample control file:
![Page 9: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/9.jpg)
Test Setup: Applications• GNU file utilities (ls, mv, etc.)• Emacs 20.7.1 – with and without X• Apache 1.3.22• Berkeley DB 4.0.14• Netscape Navigator 4.76• MySQL server 3.23.36
![Page 10: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/10.jpg)
Test Setup:Instrumented Calls & Their Errors
• malloc() – memory exhaustion• read() – I/O error, system call was
interrupted• write() – I/O error, no space left on
device, call interrupted• open() – memory exhaustion, no space
on device, too many files open• select() – memory exhaustion
![Page 11: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/11.jpg)
Test Results: Client Appsread() write() select() malloc()
EINTR EIO ENOSPC EIO ENOMEM ENOMEMEmacs – no X o.k. exit warn warn o.k. crash
Emacs -w/X o.k. crash o.k. crash crash/
exit crash
Netscape warn exit exit exit n/a exit
![Page 12: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/12.jpg)
Test Results: Server Appsread() write() select() malloc()
EINTR EIO ENOSPC EIO ENOMEM ENOMEMBerkeley DB – Xact retry detec
tXact abort
Xact abort n/a Xact
abortBerkeley DB – no Xact
retry detect
data loss
data loss n/a
detect, or data
lossMySQL Server
Xact abort
retry, warn
Xact abort
Xact abort retry restart
process
Apache o.k. req. drop
req. drop
req. drop o.k. n/a
![Page 13: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/13.jpg)
Netscape Reacts
![Page 14: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/14.jpg)
Test Results: OverheadTime (s) Overhead
No FIG 33.46 N/AFIG, no logging 34.28 2.5%Logging w/o timestamps 47.83 42.9%Logging w/timestamps 61.74 84.5%strace (all syscalls) 112.85 237.3%
Timing using Berkeley DB (non-transactional) to read, sort and write one million words.
• Note: FIG communicates with a separate logging daemon through shared memory to reduce logging overhead.
![Page 15: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/15.jpg)
Strategies forReliable Services:
• Intelligent retry– ls: “bounded retry” of malloc()
• Resource preallocation– Apache: allocates buffer pool at startup
• Degraded service– Apache: deactivates logging if disk full
• Process pools– Apache and MySQL
![Page 16: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/16.jpg)
FIG as a Prototype for Online Error Injection• Low run-time overhead• Easy to enable/disable• Easy to configure• Extensible• Can simulate multiple fault
models
![Page 17: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/17.jpg)
A Case for OnlineError Injection
• Recovery code is not usually exercised during normal operation
• Deployed environments tend to differ from testing environments
• Can run error injection tests on a subset of deployed systems
• FIG can simulate common environmental errors
![Page 18: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/18.jpg)
Conclusions• FIG exposed a variety of deficiencies in
how our test applications handled environmental errors
• Server apps are generally more robust than client applications
• FIG exhibits low overhead• FIG is suitable for online error injection
![Page 19: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/19.jpg)
![Page 20: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/20.jpg)
Future Directions• Limitations of FIG:
– Only for UNIX-like OSes– Limited to app/library interface (proxy for
app/OS interaction)• Make FIG part of a larger test suite• Include clock time and event based
error triggers• Greater flexibility in configuration file
![Page 21: FIG: A Prototype Tool for On-Line Verification of Recovery Mechanisms](https://reader036.vdocuments.site/reader036/viewer/2022081520/5681682b550346895dddc106/html5/thumbnails/21.jpg)
Other Related Work1. Xept (Vo et al.)
• Instruments object code to ensure that error handling code exists
2. Processor & memory errors• DOCTOR, HYBRID, DEFINE
3. Process memory corruption• FERRARI, DEFINE