exploring failure transparency and the limits of generic recovery
DESCRIPTION
Exploring Failure Transparency and the Limits of Generic Recovery. Dave Lowell Compaq Western Research Lab xxx Subhachandra Chandra and Peter M. Chen, University of Michigan. Introduction. Failure transparency: abstraction of failure-free operation - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/1.jpg)
Exploring Failure Transparency and the Limits of Generic Recovery
Dave LowellCompaq Western Research Labxxx
Subhachandra Chandra andPeter M. Chen, University of Michigan
![Page 2: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/2.jpg)
2
Introduction
Failure transparency: abstraction of failure-free operation
OS recovers app after hardware, OS, and application failures
– No programmer help– No slow down
Will explore theory, performance, and limitations
![Page 3: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/3.jpg)
3
Consistent recovery
Visible output equivalent to failure-free run
– equivalence: allows duplicates– avoids “exactly once” problem
Failure transparency consistent recovery with generic techniques
![Page 4: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/4.jpg)
4
Guaranteeing consistent recovery
Key players: non-deterministic events, visible events, commit events
Save-work invariant (simplified):– There’s a commit after each non-
deterministic event that happens-before a visible event.
– Full theorem handles liveness, distinguishes causality and ordering
![Page 5: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/5.jpg)
5
Commit All CAND CAND-LOG
Effort to identify/convert ND events
![Page 6: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/6.jpg)
6
CAND CAND-LOG
Effort to identify/convert ND events
CPVS
CPV-2PCE
ffort
to c
om
mit
onl
y vi
sib
le e
vent
s
CBNDVS
CBNDV-2PC
CBNDVS-LOG
![Page 7: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/7.jpg)
7
CAND CAND-LOG
Effort to identify/convert ND events
CPVS
CPV-2PCE
ffort
to c
om
mit
onl
y vi
sib
le e
vent
s
CBNDVS
CBNDV-2PC
CBNDVS-LOG
Coord. CheckpointingManethoOptimistic Logging
Targon/32SBL Hypervisor
![Page 8: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/8.jpg)
8
Effort to identify/convert ND events
Effo
rt to
co
mm
it o
nly
visi
ble
eve
nts increasing recovery time
app
lica
tion
failu
re r
eco
very
incre
asing
sim
plicit
y
incre
asing
per
form
ance
![Page 9: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/9.jpg)
9
Performance study
Discount Checking: fast checkpoints to reliable memory (Rio)
– Logging and two-phase commit– Disk version
Mostly interactive applications– Localized and distributed
![Page 10: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/10.jpg)
10
CAND1%
43%
CAND-LOG0%
13%
Effort to identify/convert ND events
CPVS1%44%
Effo
rt to
co
mm
it o
nly
visi
ble
eve
nts
CBNDVS1%42%
CBNDVS-LOG0%12%
Nvi Text Editor
![Page 11: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/11.jpg)
11
CAND199%
11499%
CAND-LOG126%
7700%
Effort to identify/convert ND events
CPVS129%7346%
CPV-2PC12%319%
Effo
rt to
co
mm
it o
nly
visi
ble
eve
nts
CBNDVS101%5743%
CBNDV-2PC12% 252%
CBNDVS-LOG73%4973%
TreadMarks Barnes-Hut
![Page 12: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/12.jpg)
12
Have only considered “stop” failures
Committing everything is okay– Save-work: when we must commit
Some failures affect application state– Can we commit too much?
![Page 13: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/13.jpg)
13
Dangerous Paths
![Page 14: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/14.jpg)
14
Dangerous Paths
![Page 15: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/15.jpg)
15
Lose-work invariant
To recover from propagation failure, never commit on a “dangerous path”.
Save-work and Lose-work conflict!– Visible event on dangerous path– Can’t guarantee consistent recovery
from propagation failures
Do we see this conflict in practice?
![Page 16: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/16.jpg)
16
Measuring Lose-work violations
Fault-injection study : OS crashes– injected faults into running kernel– induced 350 OS crashes– recovered nvi and postgres using
Discount Checking
Results– nvi: 15% crashes violate Lose-work– postgres: 3% crashes violate Lose-work
![Page 17: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/17.jpg)
17
Application crashes
Fault-injection study: ND bugs– nvi: 37% violate Lose-work– postgres: 33% violate Lose-work
Published bug distributions: 85-95% of application bugs are deterministic
– intrinsically violate Lose-work
Perhaps > 90% app crashes violate Lose-work!
![Page 18: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/18.jpg)
18
Conclusions
Save-work and Lose-work invariants Save-work protocol space Invariants fundamentally conflict Failure transparency performance:
– 0-12% overhead on reliable memory– 13-40% overhead on disk (interactive apps)
> 90% application failures violate Lose-work
![Page 19: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/19.jpg)
![Page 20: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/20.jpg)
20
![Page 21: Exploring Failure Transparency and the Limits of Generic Recovery](https://reader035.vdocuments.site/reader035/viewer/2022062520/5681596f550346895dc6b11b/html5/thumbnails/21.jpg)
21
Chart example
0
10
20
30
40
50
60
70
80
90
100
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
East
West
North
South