encore: low-cost, fine-grained transient fault recovery authors: shuguang feng *
DESCRIPTION
Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng * Shantanu Gupta Amin Ansari Scott Mahlke David August University of Michigan *Currently with Northrop Grumman, Information Systems Sector. “Failure to prepare is preparing to fail…”. - PowerPoint PPT PresentationTRANSCRIPT
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science1
Encore: Low-Cost, Fine-Grained Transient Fault Recovery
Authors: Shuguang Feng*Shantanu GuptaAmin AnsariScott MahlkeDavid August
University of Michigan
*Currently with Northrop Grumman, Information Systems Sector
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science2
Negative Bias Temperature Instability
Oxide
Oxide Breakdown
GSI gs DIgd
B
N+N+
P-wellIgb
I gcsIgcd
ElectromigrationPackaging ImpuritiesCosmic Radiation
PVT Variation
[Gupta`09]
…many ways to fail
[Dreslinski`10]
NTC Computing
“Failure to prepare is preparing to fail…”- Benjamin Franklin
The distinction between a transient and permanent fault is becoming blurred
Transient (“soft”) FaultsRare ContinuousPeriodic
Permanent (“hard”) Faults
Many permanent faults, particularly wearout-induced faults, initially manifest as timing errors.
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science3
The Future of Soft Errors
Past Present Future
Aggressive voltage scaling(near-threshold computing)One failure per
MONTH per 100 chips
One failure per DAY per 100 chips
One failure per DAY per chip
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Diagnosis Repair
4
Realizing a Reliability “Pipeline”
Detection Recovery
VulnerableComputation
ReliableOutput
Detection Recovery
Recent interest in low-cost fault detection ReStore [DSN`05] SWAT [ASPLOS`08] Shoestring [ASPLOS`10]
Not perfect…but very low-cost
Generally involves some form of rollback/re-execution1) Identify fault site2) Restore processor to pre-fault state, before 1)3) Resume execution from 1)
Many low-cost detection techniques rely on hardware speculation support
VulnerableComputation
ReliableOutput
Commodity systems present both challenges and opportunities
Challenge: HW speculation support (if it exists) is limited
Challenge: Cannot afford expensive, heavyweight SW checkpointing
Opportunity: Typically not running mission-critical applications Sacrifice a small degree of reliability
Exploit (probabilistic) idempotence in program execution
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science5
The Role of Idempotence Mathematical Definition:
an operation that can be applied multiple times without changing the result
Computer Science Definition: a region of code without any
exposed write-after-read (WAR, anti-) dependencies
Non-idempotentIdempotent…
… = X
…
… = XX++
…
X++
X = …
X
Idempotent code regions can be safely re-executed without additional checkpointing
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science6
Does Idempotence Exist?
Selectively checkpointing a *few* offending stores
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science7
Challenges to Exploiting Idempotence Must identify where to resume execution
1) Control flow2) Rollback distance
Statically identifying optimal rollback distance is inherently intractable ↑ rollback dist. → ↑ Pr(recoverable) ↓ rollback dist. → ↑ Pr(idempotent)
Simplifying engineering solution based on single-entry, multiple-exit (SEME) regions
Execution Path
X
bb’
bb 7
bb 3
bb 4
bb 6
bb 5
bb 2
bb 1
bb 6
X
Xa
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Code Partitioning(CFG-based)
8
Encore VisionSo
urce
Cod
e
Idempotence Analysis(per region)
…= X X++
…
… = X
Idempotent
Non-id
empo
tent
X++…= XX++
…
… = X
Chkpt X
Recovery
Runtime Behavior(post-fault)
Recovery
Chkpt X
Instrumentation(per region)
Fault Detected
Redirect Control
Restore State
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science9
Identifying Idempotence (High-level)
bb 2
bb7
bb 1
bb 8
bb 6bb 5
bb 3 bb 4
With respect to a point, p, in the CFG… Reachable Stores (RS)
A store that may execute after p
Guarded Addresses (GA) An address that is guaranteed to be
overwritten before reaching p
Exposed Addresses (EA) An address that may be referenced by
an unguarded load prior to p
Idempotent IFF EA ∩ RS = Ø
bb 6
bb7
bb 8
bb 2
bb 3 bb 4bb 3 bb 4
bb 1
Additional Details…
1) Applies to both memory and registers Static, conservative alias analysis
2) Scalable hierarchical analysis Handles cyclic code
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
*Restore BRestore R1Restore R2 …Restore Rn
bb r
10
Code Instrumentation
MemCopy BSave Address[B]
“On-demand” Checkpointing
Recovery Code
*Restore B bb rSave R1Save R2
…Save Rn
Live-in Checkpointing
bb 0
Upon Fault Detection
bb 2
bb7
bb 1
bb 8
bb 6bb 5
bb 3 bb 4
…1: Store A
…
…6: Load B
…
…2: Store B
…3: Store C
…
…4: Load A
…5: Store C
…
… 9: Store A
…10: Store B
…11: Load C
…
…7: Load B
…8: Load C
…
…12: Store C
…
#
#
$
$
@
@+ +
Encore Heuristics
1) Selectively prune dynamically-dead code ↓ offending stores → ↑ Pr(idempotent)
2) Selectively fuse adjacent regions ↑ region size → ↑ Pr(recoverable)
3) Selectively instrument profitable regions
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science11
Lightweight Checkpointing
STACKdata_1
addr_1
data_Naddr_N
data_0addr_0
Live-in Registers
Local Variables
Return AddressInput Parameters
Traditional Call Stack
Encore Extensions
Frame Pointer
Stack Pointer
1 reg2mem store
1 reg2mem store1 mem2mem copy1 stack ptr incrementStack grows dynamically to
accommodate checkpoint storage
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science12
Evaluation Methodology
Program analysis/instrumentation performed in the LLVM compiler
In-order, single-issue, embedded-class processor Dynamic instruction model based on profiled execution
Reliability coverage Analytical model in lieu of traditional fault injection Decouples evaluation from microarchitectural details
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science13
Inherent Idempotence0% (dynamically-dead)<5%<10%
76% of application code is naturally idempotent
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science14
Dynamic Execution Breakdown
Impact of detection latency
If control has left the region containing the original fault site, re-execution cannot correct the error
91% of execution time is spent within recoverable regions
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Existing (~100 instrs)Future (~10 instrs)Future (~1000 instrs)
15
Full System “Coverage”
93% − 99.99% coverage, highly application dependent
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science16
Overheads
3% − 22% performance degradation
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science17
Summary Large portions of applications, across domains, are
(probabilistically) idempotent
Encore is a software-only solution that exploits this property to provide low-cost fault recovery
97% of faults on average are recoverable with current detection schemes
@ 15% performance penalty
Implementing Encore in a runtime system / virtual machine has the potential to yield even better results
Larger dynamic traces v. static intervals Dynamic v. static memory analysis
University of MichiganElectrical Engineering and Computer Science
University of MichiganElectrical Engineering and Computer Science
Questions?
18
http://cccp.eecs.umich.edu
18