swift: software implemented fault tolerance george a. reis, jonathan chang, neil vachharajani, ram...

65
SWIFT: SWIFT: Software Implemented Fault Software Implemented Fault Tolerance Tolerance George A. Reis, Jonathan Chang, Neil George A. Reis, Jonathan Chang, Neil Vachharajani, Vachharajani, Ram Rangan, David I. August Ram Rangan, David I. August Princeton University Princeton University International Symposium on Code Generation and International Symposium on Code Generation and Optimization CGO’05 Optimization CGO’05 Nihan Özman - 2005700452 Nihan Özman - 2005700452

Upload: felicity-gallagher

Post on 26-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

SWIFT: SWIFT: Software Implemented Fault ToleranceSoftware Implemented Fault Tolerance

George A. Reis, Jonathan Chang, Neil Vachharajani,George A. Reis, Jonathan Chang, Neil Vachharajani,Ram Rangan, David I. AugustRam Rangan, David I. August

Princeton UniversityPrinceton University

International Symposium on Code Generation and Optimization International Symposium on Code Generation and Optimization CGO’05CGO’05

Nihan Özman - 2005700452Nihan Özman - 2005700452

Page 2: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

OutlineOutline

• IntroductionIntroduction• Prior WorkPrior Work• Software Fault DetectionSoftware Fault Detection

Control Flow CheckingControl Flow Checking SWIFTSWIFT

• Implementation DetailsImplementation Details• EvaluationEvaluation• ConclusionConclusion

Page 3: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

IntroductionIntroduction

• In recent decades, microprocessor performance has been In recent decades, microprocessor performance has been increasing exponentially due to:increasing exponentially due to: smaller and faster transistors with low threshold voltagessmaller and faster transistors with low threshold voltages tighter noise margins enabled by improved fabrication tighter noise margins enabled by improved fabrication

technologytechnology

• While these devices yield performance enhancements, While these devices yield performance enhancements,

they will they will be less reliablebe less reliable make processors that use them more susceptible to make processors that use them more susceptible to

transient faultstransient faults

Page 4: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Properties of Transient FaultsProperties of Transient Faults

• Known as Known as soft errorssoft errors

• Unlike manifacturing or design faults, do not occur Unlike manifacturing or design faults, do not occur consistently.consistently.

• Caused by external events:Caused by external events: such as energetic particles striking the chipsuch as energetic particles striking the chip not cause permanent physical damage to the processornot cause permanent physical damage to the processor alter signal transfers or stored values and thus cause alter signal transfers or stored values and thus cause

incorrect program executionincorrect program execution

Page 5: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Hardware Solutions for Transient FaultsHardware Solutions for Transient Faults

• To counter transient faults, designers typically introduce To counter transient faults, designers typically introduce redundant hardwareredundant hardware::

Some storage structures, such as caches and memory, Some storage structures, such as caches and memory, include error correcting codes (ECC) and parity bits: include error correcting codes (ECC) and parity bits: redundant bits can be used to detect or correct the faultredundant bits can be used to detect or correct the fault ..

Combinational logic within the processor can be protected Combinational logic within the processor can be protected by by duplication:duplication: output from the duplicated output from the duplicated combinational logic blocks can be compared to detect combinational logic blocks can be compared to detect faultsfaults..

Page 6: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Examples of Advanced Hardware SolutionsExamples of Advanced Hardware Solutions

• High-availability systems need more redundancy hardware High-availability systems need more redundancy hardware than that provided by ECC and parity bits, likethan that provided by ECC and parity bits, like:: IBM has IBM has added additional logicadded additional logic within its mainframe within its mainframe

processorsprocessors for fault tolerance. for fault tolerance. During design of S/390 G5, IBM During design of S/390 G5, IBM fully replicated the fully replicated the

processor’s execution unitsprocessor’s execution units to avoid various performance to avoid various performance pitfalls with their previous fault tolerance approach.pitfalls with their previous fault tolerance approach.

Fujitsu used a form of Fujitsu used a form of error protection that includes ALU error protection that includes ALU parity generation and a mul/divide residue checkparity generation and a mul/divide residue check..

Boeing designed its 777 aircraft system with three different Boeing designed its 777 aircraft system with three different processors and data busses using processors and data busses using a majority voting a majority voting scheme to achieve both fault detection and recoveryscheme to achieve both fault detection and recovery ..

Page 7: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Disadvantages of Hardware SolutionsDisadvantages of Hardware Solutions

• Too expensive for many processor markets, including Too expensive for many processor markets, including highly price-competitive desktop and laptop markets.highly price-competitive desktop and laptop markets.

• May have ECC or parity in the memory subsystem, but May have ECC or parity in the memory subsystem, but certainly do not posses double or triple redundant certainly do not posses double or triple redundant execution cores.execution cores.

• Transient faults in both memory and combinational logic Transient faults in both memory and combinational logic will need to be addressed in will need to be addressed in allall aggressive processor aggressive processor designs, not only in high-availability applications.designs, not only in high-availability applications.

Page 8: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Proposed Software SolutionProposed Software Solution

• To achieve redundancy and fault tolerance, a To achieve redundancy and fault tolerance, a software-basedsoftware-based, , single-threaded single-threaded approach, SWIFT, is approach, SWIFT, is proposed.proposed.

• It performs fault detection in a manner compatible with It performs fault detection in a manner compatible with most reporting and recovery mechanisms (can be easily most reporting and recovery mechanisms (can be easily extended to incorporate complete fault tolerance)extended to incorporate complete fault tolerance)

• It is a compiler-based transformation thatIt is a compiler-based transformation that duplicates the instructions in a programduplicates the instructions in a program inserts comparison instructions at strategic points during inserts comparison instructions at strategic points during

code generation.code generation.

Page 9: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Desirable Features of Software SolutionDesirable Features of Software Solution

• The technique does not require any hardware changes.The technique does not require any hardware changes.

• The compiler is free to make use of slack in a program’s The compiler is free to make use of slack in a program’s schedule to minimize performance degradation.schedule to minimize performance degradation.

• Programmers are free to vary transient policy within a Programmers are free to vary transient policy within a program.program.

• A compiler orchestrated relationship between the duplicated A compiler orchestrated relationship between the duplicated instructions allows for simple methods to deal with instructions allows for simple methods to deal with exception-handling, interrupt-handlingexception-handling, interrupt-handling and and shared shared memorymemory

Page 10: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Improvements of SWIFTImprovements of SWIFT

• Requires no hardware beyond ECC in memory subsystemRequires no hardware beyond ECC in memory subsystem• Eliminates the need to double the memory requirement by Eliminates the need to double the memory requirement by

acknowledging the use of ECC in caches and memoryacknowledging the use of ECC in caches and memory• Increases protection at no additional performance cost by Increases protection at no additional performance cost by

introducing a new control- flow checking mechanismintroducing a new control- flow checking mechanism• Reduces performance overhead by eliminating branch Reduces performance overhead by eliminating branch

validation code made unnnecessary by this enhanced control validation code made unnnecessary by this enhanced control flow mechanism.flow mechanism.

• Performs better than all known single-threaded full software Performs better than all known single-threaded full software detection techniques.detection techniques.

• Deployable in both uniprocessor and multiprocessor Deployable in both uniprocessor and multiprocessor environments (methods to deal with exception-handling, environments (methods to deal with exception-handling, interrupt-handling, shared memory programs)interrupt-handling, shared memory programs)

Page 11: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Implementation of SWIFTImplementation of SWIFT

• SWIFT can be implemented on any architecture and can SWIFT can be implemented on any architecture and can protect individual code segments to varying degrees.protect individual code segments to varying degrees.

• A full program implementation running on Itanium 2 is A full program implementation running on Itanium 2 is evaluated.evaluated.

• In experiments, SWIFT demonstrates In experiments, SWIFT demonstrates exceptional fault-coverage with a reasonable performance exceptional fault-coverage with a reasonable performance

costcost a 14% average speedup compared to the best known a 14% average speedup compared to the best known

single-threaded approach utilizing an ECC memory systemsingle-threaded approach utilizing an ECC memory system

Page 12: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Prior Work – Hardware-Based RedundancyPrior Work – Hardware-Based Redundancy

• Mahmood and McCluskey proposed using a Mahmood and McCluskey proposed using a watchdog watchdog processor to compare and validate the outputs against the processor to compare and validate the outputs against the main running processor.main running processor.

• Austin proposed Austin proposed DIVADIVA, uses a main, high-performance, , uses a main, high-performance, out-of-order processor core that executes instructions and out-of-order processor core that executes instructions and a second, simpler core to validates the execution.a second, simpler core to validates the execution.

• Compaq NonStop HimalayaCompaq NonStop Himalaya, real system implementation , real system implementation that replicates part or all of the processor and uses that replicates part or all of the processor and uses checkers to validate the redundant computations.checkers to validate the redundant computations.

• Rotenberg expanded the Rotenberg expanded the SMT (Simultaneous SMT (Simultaneous MultiThreading)MultiThreading) redundancy concept with AR-SMT redundancy concept with AR-SMT (Active Stream/Redundant Stream Simultaneous (Active Stream/Redundant Stream Simultaneous Multithreading).Multithreading).

Page 13: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Prior Work – Hardware-Based RedundancyPrior Work – Hardware-Based Redundancy

• Reinhardt and Mukherjee proposed Reinhardt and Mukherjee proposed simultaneous simultaneous Redundant MultiThreading (RMT)Redundant MultiThreading (RMT) which increases the which increases the performance of AR-SMT and compares redundant performance of AR-SMT and compares redundant streams before data is stored in the memory.streams before data is stored in the memory.

• Mukherjee proposed a Mukherjee proposed a Chip-level Redundantly Chip-level Redundantly Threaded multiprocessor (CRT)Threaded multiprocessor (CRT)

• Gomma expanded CRT approach withGomma expanded CRT approach with CRTR CRTR to enable to enable recovery.recovery.

• Ray proposed modifying an out-of-order super scalar Ray proposed modifying an out-of-order super scalar processor’s microarchitectural components to implement processor’s microarchitectural components to implement redundancy. redundancy.

• All HW-based approaches require the addition of new All HW-based approaches require the addition of new hardware logic to meet redundancy requirements.hardware logic to meet redundancy requirements.

Page 14: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Comparison of Various Redundancy Comparison of Various Redundancy ApproachesApproaches

Page 15: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Prior Work – Software-Based RedundancyPrior Work – Software-Based Redundancy

• Software-only approaches to redundancy comeSoftware-only approaches to redundancy come free of free of costcost

• Oh and McCluskey proposed a novel software redundancy Oh and McCluskey proposed a novel software redundancy approachapproach (EDDI: Error Detection by Duplicating (EDDI: Error Detection by Duplicating Instructions) Instructions) wherein all instructions are duplicated and wherein all instructions are duplicated and appropriate “check” instructions are inserted to validateappropriate “check” instructions are inserted to validate

• Oh et al. developed a pure Oh et al. developed a pure Software Control-Flow Software Control-Flow Checking Scheme (CFCSS) Checking Scheme (CFCSS) wherein each control transfer wherein each control transfer generates a run-time signature that is validated by eror generates a run-time signature that is validated by eror checking code generated by the compiler for every blockchecking code generated by the compiler for every block

• Venkatasubramanian et al. proposed Assertions for Control Venkatasubramanian et al. proposed Assertions for Control Flow CheckingFlow Checking (ACFC) (ACFC) that assigns an execution parity to that assigns an execution parity to each basic block and detect faults based on parity errors. each basic block and detect faults based on parity errors.

Page 16: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Prior Work – Software-Based RedundancyPrior Work – Software-Based Redundancy

• A sphere of replication (SoR) is the logical domain of A sphere of replication (SoR) is the logical domain of redundant execution.redundant execution.

• SWIFTSWIFT makes several key refinements to EDDImakes several key refinements to EDDI incorporates a software only signature-based control-flow incorporates a software only signature-based control-flow

checking scheme to achieve exceptional fault-coveragechecking scheme to achieve exceptional fault-coverage

• The main difference between EDDI and SWIFT isThe main difference between EDDI and SWIFT is EDDI’s SoR includes entire processor core and the memory EDDI’s SoR includes entire processor core and the memory

subsystemsubsystem SWIFT moves memory out of the SoR (memory structures SWIFT moves memory out of the SoR (memory structures

are already well-protected by hardware schemes like are already well-protected by hardware schemes like parity parity and ECCand ECC, with or without scrubbing), with or without scrubbing)

Page 17: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Comparison of Various Redundancy Comparison of Various Redundancy ApproachesApproaches

Page 18: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Software Fault DetectionSoftware Fault Detection

• In this section, the following will be explained:In this section, the following will be explained: foundation of SWIFTfoundation of SWIFT extending EDDI with control-flow checking with software extending EDDI with control-flow checking with software

signaturessignatures introducing novel extensions that comprise SWIFTintroducing novel extensions that comprise SWIFT

• The assumptions should be taken into consideration:The assumptions should be taken into consideration: a a Single-Event Upset (SEU)Single-Event Upset (SEU) fault model, in which exactly fault model, in which exactly

one bit is flipped throughout the entire program.one bit is flipped throughout the entire program. memory subsystem, including processor caches, are already memory subsystem, including processor caches, are already

adequately protected using techniques like parity and ECCadequately protected using techniques like parity and ECC the transformations are used to the transformations are used to detect detect faults (efficacious and faults (efficacious and

cost-effective fault detection is of primary concern)cost-effective fault detection is of primary concern)

Page 19: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

EDDIEDDI

• Software-only fault detection systemSoftware-only fault detection system• Operates by duplicating program instructions and using this Operates by duplicating program instructions and using this

redundant execution to achieve fault tolerance.redundant execution to achieve fault tolerance.• Program instructions:Program instructions:

duplicated by the compiler duplicated by the compiler intertwined with the original program instructionsintertwined with the original program instructions

• Each copy of the program uses different registers and Each copy of the program uses different registers and different memory location for not to interfere with another.different memory location for not to interfere with another.

• Check instructions are inserted at certain synchronization Check instructions are inserted at certain synchronization points by the compiler:points by the compiler: the original instructions and their redundant copies agree on the original instructions and their redundant copies agree on

the computed values.the computed values.

Page 20: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

EDDIEDDI

• Program correctness is defined by the output of a programProgram correctness is defined by the output of a program

• Assuming memory-mapped I/O, a program has executed Assuming memory-mapped I/O, a program has executed correctly if all stores in the program have executed correctly if all stores in the program have executed correctly. correctly.

• Two types of instructions should be used as Two types of instructions should be used as synchronization points for comparing redundant values:synchronization points for comparing redundant values: Store instructionsStore instructions Branch instructions (misdirected branches can cause stores Branch instructions (misdirected branches can cause stores

to be skipped, incorrect stores to be executed, or incorrect to be skipped, incorrect stores to be executed, or incorrect values to ultimately feed a store)values to ultimately feed a store)

Page 21: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

EDDI Fault Detection

1:1: The load from a global constant address is duplicated The load from a global constant address is duplicated

2:2: Add instruction is duplicated (to create redundant chain of Add instruction is duplicated (to create redundant chain of computation)computation)

3 & 4:3 & 4: The store’s operands are compared to their redundant The store’s operands are compared to their redundant copies.copies.

5:5: If any difference is detected, an error is reported If any difference is detected, an error is reported

6:6: If no difference is detected, storing values are executed to If no difference is detected, storing values are executed to non-conflicting addresses.non-conflicting addresses.

Page 22: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

EDDI Fault DetectionEDDI Fault Detection

• An optimizing compiler (or dynamic hardware scheduler) is An optimizing compiler (or dynamic hardware scheduler) is free to schedule the instructions to use additional available free to schedule the instructions to use additional available ILP (minimizing the performance penalty of the ILP (minimizing the performance penalty of the transformaiton).transformaiton).

• Two different types of redundancy is exploited: Two different types of redundancy is exploited: Temporal RedundancyTemporal Redundancy: :

• The redundant duplicates are executed sequentiallyThe redundant duplicates are executed sequentially• Computes the same data value at two different times, usually on Computes the same data value at two different times, usually on

the same hardwarethe same hardware Spatial RedundancySpatial Redundancy::

• The redundant duplicates are executed in paralelThe redundant duplicates are executed in paralel• Computes the same data value in two different pieces of Computes the same data value in two different pieces of

hardware, usually at the same timehardware, usually at the same time

Page 23: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Eliminating the Memory PenaltyEliminating the Memory Penalty

• EDDI is able to effectively detect transient faults at the cost EDDI is able to effectively detect transient faults at the cost of significant memory overhead.of significant memory overhead.

• Each memory location needs a shadow memory location Each memory location needs a shadow memory location for use with redundant duplicate. This duplication incurs: for use with redundant duplicate. This duplication incurs: a significant hardware costa significant hardware cost a significant performance cost (since cache sizes are a significant performance cost (since cache sizes are

effectively halved and additional memory traffic is created)effectively halved and additional memory traffic is created)

• In the paper, it is proposed to eliminate the use of two In the paper, it is proposed to eliminate the use of two distinct memory locations for all memory values eliminating distinct memory locations for all memory values eliminating duplicate duplicate store instructionsstore instructions. (Load instructions necessary). (Load instructions necessary)

• Modifications will Modifications will not reduce the fault detection not reduce the fault detection coverage of the systemcoverage of the system, but will make , but will make the protected the protected code execute more efficientlycode execute more efficiently and and require less memoryrequire less memory..

Page 24: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Eliminating the Memory PenaltyEliminating the Memory Penalty

• EDDI with eliminated memory penalty can be referred as:EDDI with eliminated memory penalty can be referred as:

EDDI + ECC EDDI + ECC

Page 25: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Control Flow CheckingControl Flow Checking

• EDDI also suffers from incomplete protection for control EDDI also suffers from incomplete protection for control flow faults.flow faults.

• A program’s control flow can get errorneously misdirected A program’s control flow can get errorneously misdirected without detection. The corruption can happen without detection. The corruption can happen during the execution of the branchduring the execution of the branch during register corruption after branch check instructionsduring register corruption after branch check instructions due to a fault in the instruction pointer update logicdue to a fault in the instruction pointer update logic

• To make EDDI more robust, additional checks can be To make EDDI more robust, additional checks can be inserted to ensure control flow is being transfered properlyinserted to ensure control flow is being transfered properly

Page 26: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Control Flow CheckingControl Flow Checking

• EDDI + ECC with control flow validation can be referred as:EDDI + ECC with control flow validation can be referred as:

EDDI + ECC + CF EDDI + ECC + CF

Page 27: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Control Flow CheckingControl Flow Checking

• Each block will be assigned a signature in order to verify Each block will be assigned a signature in order to verify that control transfer is in the appropriate control block.that control transfer is in the appropriate control block.

• GSR (General Signature Register), a designated general GSR (General Signature Register), a designated general purpose register, will hold the signatures and will be used purpose register, will hold the signatures and will be used to detect faults.to detect faults.

• The procedure will go on in the following manner:The procedure will go on in the following manner: GSR will contain the signature for currently executing blockGSR will contain the signature for currently executing block Upon entry to any block, GSR will be xor’ed with a statistically Upon entry to any block, GSR will be xor’ed with a statistically

determined constant to transform the previous block’s determined constant to transform the previous block’s signature into the current block’s signaturesignature into the current block’s signature

After transformation, GSR can be compared to the After transformation, GSR can be compared to the statistically assigned signature for the block to ensure that a statistically assigned signature for the block to ensure that a legal control transfer is occuredlegal control transfer is occured

Page 28: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Control Flow CheckingControl Flow Checking

• Using statistically-determined constant forces two blocks Using statistically-determined constant forces two blocks which both jump to a common block (a control flow merge) which both jump to a common block (a control flow merge) to share the same signatureto share the same signature undesirable, since faults which transfer control to or from undesirable, since faults which transfer control to or from

blocks that share the same signature will go undetected.blocks that share the same signature will go undetected.

• Run-time adjusting signatureRun-time adjusting signature can be used instead can be used instead is assigned to another designated registeris assigned to another designated register at entry of a block, this signature, GSR and predetermined at entry of a block, this signature, GSR and predetermined

constant are all xor’ed together to form new GSRconstant are all xor’ed together to form new GSR

• It can be different depending on the source of control It can be different depending on the source of control transfer, so it can be used to compensate for differences in transfer, so it can be used to compensate for differences in signatures between source blockssignatures between source blocks

Page 29: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Control Flow Checking

1 & 2:1 & 2: Redundant duplicates for add Redundant duplicates for add and compare instructionsand compare instructions

3 to 7:3 to 7: Compare the predicate Compare the predicate p11p11 to to its redundant duplicate its redundant duplicate p21p21 and and branch to error code if a fault is branch to error code if a fault is detecteddetected

8:8: Transforms the GSR from the Transforms the GSR from the previous block to the signature for previous block to the signature for this block (The control flow additions this block (The control flow additions begin)begin)

9 & 10:9 & 10: Ensure that signature is correct Ensure that signature is correct (otherwise error code is invoked)(otherwise error code is invoked)

11 to 13:11 to 13: Handles the synchronization Handles the synchronization point induced by the later store point induced by the later store instructioninstruction

Page 30: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Control Flow CheckingControl Flow Checking

• The transformationThe transformation detect any fault that causes a control transfer between two detect any fault that causes a control transfer between two

blocks that should not jump to one another (blocks that should not jump to one another (which yields which yields incorrect signatures even if the errorneous transfer jumps to the incorrect signatures even if the errorneous transfer jumps to the middle of a basic blockmiddle of a basic block))

ensures only that the control flow is diverted to the taken or ensures only that the control flow is diverted to the taken or untaken pathuntaken path

does not ensure that the correct direction of the conditional does not ensure that the correct direction of the conditional branch is takenbranch is taken

• The base EDDI transformation The base EDDI transformation provides reasonable guarantees (the branches input provides reasonable guarantees (the branches input

operands are verified prior to its execution)operands are verified prior to its execution) does not detect faults that occur during the execution of does not detect faults that occur during the execution of

a branch instruction which influence branch directiona branch instruction which influence branch direction

Page 31: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Enhanced Control Flow CheckingEnhanced Control Flow Checking

• To extend fault detection coverage to cases where branch To extend fault detection coverage to cases where branch instruction execution is concerned, an enhanced control instruction execution is concerned, an enhanced control flow checking transformation is proposed:flow checking transformation is proposed: EDDI + ECC + CFEEDDI + ECC + CFE similar for blocks using run-time adjusting signaturessimilar for blocks using run-time adjusting signatures increases the reliability of the control flow checkingincreases the reliability of the control flow checking

• Enhanced mechanism uses a Enhanced mechanism uses a dynamic equivalent of a dynamic equivalent of a run-time adjusting signaturerun-time adjusting signature for all blocks, even those for all blocks, even those that are not control flow mergesthat are not control flow merges

• Each block asserts its target using this signature and each Each block asserts its target using this signature and each target confirms the transfer by checking GSR.target confirms the transfer by checking GSR.

• This This signature combined with the GSR serve as a signature combined with the GSR serve as a redundant duplicate for the program counter.redundant duplicate for the program counter.

Page 32: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Enhanced Control Flow Checking

1 & 2:1 & 2: Redundant duplicates for add Redundant duplicates for add and compare instructionsand compare instructions

The synchronization check before the The synchronization check before the branch instruction omittedbranch instruction omitted

3:3: Computes the run-time signature for Computes the run-time signature for the target of branch by xor’ing the the target of branch by xor’ing the signature of the current block, with signature of the current block, with signature of target blocksignature of target block

Branch is predicted, so the Branch is predicted, so the assignment to RTS is predicted assignment to RTS is predicted using redundant duplicate for the using redundant duplicate for the predicate registerpredicate register

4:4: Equivalent of 3 for the fall through Equivalent of 3 for the fall through control transfer control transfer

Page 33: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Enhanced Control Flow Checking

5:5: Xors RTS with the GSR to compute Xors RTS with the GSR to compute the signature of the new block, at the signature of the new block, at the target of a control transferthe target of a control transfer

6:6: Compares the signature in 5 with Compares the signature in 5 with the statistically assigned signaturethe statistically assigned signature

7:7: Error code is invoked if there is a Error code is invoked if there is a mismatch in 6mismatch in 6

8 & 9:8 & 9: Implement the synchronization Implement the synchronization checks for the store instructionchecks for the store instruction

10:10: Error code is invoked if there is a Error code is invoked if there is a mismatch in 8 or 9 mismatch in 8 or 9

Page 34: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Enhanced Control Flow CheckingEnhanced Control Flow Checking

• Even if a branch is incorrectly executed, the fault will be Even if a branch is incorrectly executed, the fault will be detected since RTS register will have the incorrect value:detected since RTS register will have the incorrect value: more robustly protects against against transient faultsmore robustly protects against against transient faults

• The EDDI + ECC + CF control flow checkingThe EDDI + ECC + CF control flow checking ensures that execution is transfered to a valid control blockensures that execution is transfered to a valid control block does not ensure that correct conditional control path is takendoes not ensure that correct conditional control path is taken

• The enhanced control flow checking detects this case by:The enhanced control flow checking detects this case by: Dynamically updating the target signature based on the Dynamically updating the target signature based on the

redundant conditional instructions (3)redundant conditional instructions (3) Checking at the beginning of each control block (5, 6, 7)Checking at the beginning of each control block (5, 6, 7)

Page 35: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

SWIFTSWIFT

• The following optimizations applied to the EDDI + ECC + The following optimizations applied to the EDDI + ECC + CFE transformation comprise SWIFT:CFE transformation comprise SWIFT:

Control flow checking at blocks with storesControl flow checking at blocks with stores

Redundancy in branch/control flow checkingRedundancy in branch/control flow checking

Page 36: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Control Flow Checking at Blocks with StoresControl Flow Checking at Blocks with Stores

• It is only the store instructions that ultimately send data out It is only the store instructions that ultimately send data out of the SoR:of the SoR: should execute only if they “meant to”should execute only if they “meant to” should write the correct data to the correct address should write the correct data to the correct address

• This observation can be used to restrict enhanced control This observation can be used to restrict enhanced control flow checking only to blocks which have stores in them.flow checking only to blocks which have stores in them. the updates to GSR and RTS are performed in all blocksthe updates to GSR and RTS are performed in all blocks signature comparisons are restricted to blocks with stores signature comparisons are restricted to blocks with stores

(any deviation from valid control flow path to that point will be (any deviation from valid control flow path to that point will be detected before memory and output is corrupted)detected before memory and output is corrupted)

signature check instructions are removed (SCFOpti)signature check instructions are removed (SCFOpti)

• By this optimization, By this optimization, performance is increasedperformance is increased and and static static size is reducedsize is reduced for for no reduction in reliabilityno reduction in reliability..

Page 37: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Redundancy in Branch/Control Flow Redundancy in Branch/Control Flow CheckingChecking

• Branch Checking:Branch Checking: branches are taken in proper direction branches are taken in proper direction

• Enhanced Control Flow Checking:Enhanced Control Flow Checking: all control transfers all control transfers are made to the proper addressare made to the proper address

• Verifying Verifying allall control flow control flow subsumessubsumes the notion of branching the notion of branching in the right directionin the right direction

• By removing branch checking (BROpti)By removing branch checking (BROpti) reduction in performance and static size overheadreduction in performance and static size overhead no reduction in reliabilityno reduction in reliability

Page 38: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Undetected Errors – Points of FailureUndetected Errors – Points of Failure

• Redundancy is introduced solely via software instructionsRedundancy is introduced solely via software instructions a delay between validation and use of the validated register a delay between validation and use of the validated register

valuesvalues any strikes during this gap might corrupt stateany strikes during this gap might corrupt state bit flips in store address or data registers are uncaughtbit flips in store address or data registers are uncaught incorrect store values or address -> incorrect writes going incorrect store values or address -> incorrect writes going

outside the SoR -> outside the SoR -> Incorrect Program ExecutionIncorrect Program Execution

• When an instruction opcode is changed to a store When an instruction opcode is changed to a store instruction by a transient fault:instruction by a transient fault: The compiler did not see instruction: Stores are unprotectedThe compiler did not see instruction: Stores are unprotected The store will be free to execute and its value will leave SoRThe store will be free to execute and its value will leave SoR

Page 39: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Multibit ErrorsMultibit Errors

• Code transformations are less effective at detecting multibit Code transformations are less effective at detecting multibit faults, which can cause problems in:faults, which can cause problems in: when the same bit is flipped in both the original and when the same bit is flipped in both the original and

redundant computation redundant computation (Case 1)(Case 1) when a bit is flipped in either the original or redundant when a bit is flipped in either the original or redundant

computation and the comparison is also flipped such that it computation and the comparison is also flipped such that it does not branch the code does not branch the code (Case 2)(Case 2)

• These patterns of multibit errors are unlikely enough to be These patterns of multibit errors are unlikely enough to be safely ignoredsafely ignored

• A dual-upset fault model, wherein two faults are injected A dual-upset fault model, wherein two faults are injected into each program with a uniformly random distribution, is into each program with a uniformly random distribution, is used in probability calculatingused in probability calculating

Page 40: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Probability of Multibit Errors – Case 1Probability of Multibit Errors – Case 1

• ““The same bit is flipped in both the original and redundant The same bit is flipped in both the original and redundant computation”computation”

• Assumption:Assumption: The same fault must occur in the same bit of the same The same fault must occur in the same bit of the same

instruction for the fault to go undetectedinstruction for the fault to go undetected

• Probability of Error:Probability of Error:

Probability of that particular instruction being chosen (Probability of that particular instruction being chosen (average average SPEC benchmark has on the order of 10^9 to 10^11 dynamic instrSPEC benchmark has on the order of 10^9 to 10^11 dynamic instr.).)

timestimes Probability of a particular bit being chosen (64-bit registers)Probability of a particular bit being chosen (64-bit registers)

Page 41: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Probability of Multibit Errors – Case 2Probability of Multibit Errors – Case 2

• ““a bit is flipped in either the original or redundant a bit is flipped in either the original or redundant computation and the comparison is also flipped such that it computation and the comparison is also flipped such that it does not branch the code”does not branch the code”

• Assumption:Assumption: There is only one comparison for every possible faultThere is only one comparison for every possible fault

• Probability of Error:Probability of Error: P(errorcomparison|errororiginal) = 1 / #instructions

• This is a gross This is a gross overestimateoverestimate because in reality, there may because in reality, there may be many checks on a faulty value.be many checks on a faulty value.

Page 42: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Implementation DetailsImplementation Details

• Details specific to the implementation and deployment of Details specific to the implementation and deployment of SWIFT:SWIFT:

Different options for calling conventionDifferent options for calling convention

Implementations on multiprocessor systemsImplementations on multiprocessor systems

The effects of using an ISA with prediction (IA64)The effects of using an ISA with prediction (IA64)

Page 43: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Function CallsFunction Calls

• Function calls are made as synchronization points:Function calls are made as synchronization points: Before function call, all input operands are checked against Before function call, all input operands are checked against

their redundant copiestheir redundant copies• if mismatch, fault is detectedif mismatch, fault is detected• o/w th original versions are passed as parameters to the functiono/w th original versions are passed as parameters to the function

At the beginning of function, parameters must be duplicatedAt the beginning of function, parameters must be duplicated

• Redundant and original versionsRedundant and original versions• Only one version of return (must be duplicated into redundant Only one version of return (must be duplicated into redundant

versions for the remaining redundant code of function)versions for the remaining redundant code of function)

• Adds performance overhead and introduces vulnerability:Adds performance overhead and introduces vulnerability: Faults that occur on the parameters Faults that occur on the parameters after the checks by callerafter the checks by caller

and and before the duplication by the calleebefore the duplication by the callee will not be caught will not be caught

Page 44: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Function CallsFunction Calls

• The calling convention should be altered The calling convention should be altered to pass multiple sets of computed arguments to a functionto pass multiple sets of computed arguments to a function to return multiple return values from a functionto return multiple return values from a function

• Arguments passed in the registers need to be duplicated, Arguments passed in the registers need to be duplicated, not the ones in memory (not the ones in memory (memory is outside the SoRmemory is outside the SoR))

• Multiple return values require that an extra register be Multiple return values require that an extra register be reserved for the replicated return value:reserved for the replicated return value: additional pressure of twice as many input and output additional pressure of twice as many input and output

registersregisters fault detection is preserved accross function callsfault detection is preserved accross function calls

Page 45: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Shared Memory, Interrupts, and ExceptionsShared Memory, Interrupts, and Exceptions

• When multiple processes communicate with each other When multiple processes communicate with each other using shared memory, the compiler can not possibly using shared memory, the compiler can not possibly enforce an ordering of reads and writes across processes. enforce an ordering of reads and writes across processes.

• There is always the possibilty of intervening writes from There is always the possibilty of intervening writes from other processes and two loads of a duplicated pair of loads other processes and two loads of a duplicated pair of loads are not guaranteed to return the same valueare not guaranteed to return the same value not reduce the fault-coverage of the system in any waynot reduce the fault-coverage of the system in any way increase the detected fault count by contributing to the increase the detected fault count by contributing to the

number of detected faults that would not caused a failurenumber of detected faults that would not caused a failure

• Similar when an interrupt or exception occurs between the Similar when an interrupt or exception occurs between the two loads of a duplicated pair and the interrupt or exception two loads of a duplicated pair and the interrupt or exception handler changes the contents at the load addresshandler changes the contents at the load address

Page 46: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Shared Memory, Interrupts, and ExceptionsShared Memory, Interrupts, and Exceptions

• Hardware Solutions:Hardware Solutions: ““Safe” hardware-based load value duplication techniques (the Safe” hardware-based load value duplication techniques (the

Active Load Address Buffer (ALAB) or the Load Value Queue Active Load Address Buffer (ALAB) or the Load Value Queue in RMT machines) adapted to a SWIFT system in RMT machines) adapted to a SWIFT system (costly)(costly)

• No Duplication for Loads:No Duplication for Loads: Compiler does only one load (instead of two) and duplicates Compiler does only one load (instead of two) and duplicates

the loaded value for original & redundant version consumersthe loaded value for original & redundant version consumers Removes redundancy from the load executionRemoves redundancy from the load execution

Page 47: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Shared Memory, Interrupts, and ExceptionsShared Memory, Interrupts, and Exceptions

• Dealing with Potentially-Excepting Instructions:Dealing with Potentially-Excepting Instructions: Compiler knows a priori, certain instructions may cause faults, Compiler knows a priori, certain instructions may cause faults,

and enforces a schedule in which pairs of loads are not split and enforces a schedule in which pairs of loads are not split across themacross them

Prevents most exceptions to be raised between two verisons Prevents most exceptions to be raised between two verisons of a load instructionof a load instruction

Redundancy in load executionRedundancy in load execution

Asynchronous signals and interrupts can not be handledAsynchronous signals and interrupts can not be handled• Hardware solutionHardware solution• Single-load solutionSingle-load solution

Page 48: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Logical Masking from PredicationLogical Masking from Predication

• Consider the branch Consider the branch br r1 != r2br r1 != r2 :: in the absence of a fault, if a branch were to be taken, even in the absence of a fault, if a branch were to be taken, even

after a strike to either r1 or r2, condition would still hold trueafter a strike to either r1 or r2, condition would still hold true• the error can be safely ignoredthe error can be safely ignored

• Logical maskingLogical masking allows the fault detection mechanism to be less conservative allows the fault detection mechanism to be less conservative

in detecting errorsin detecting errors reduces the overall false detected unrecoverable fault countreduces the overall false detected unrecoverable fault count

• Special checks are needed to check logical maskingSpecial checks are needed to check logical masking Predicted architectures naturally provide logical masking Predicted architectures naturally provide logical masking

• IA64 ISA:IA64 ISA: conditional branches are executed based on a conditional branches are executed based on a predicate value, compared by prior predicate-defining predicate value, compared by prior predicate-defining instructions (no validation before them)instructions (no validation before them)

Page 49: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Evaluation - PerformanceEvaluation - Performance

• A pre-release version of OpenIMPACT compiler (modified A pre-release version of OpenIMPACT compiler (modified to add redundancy) targeted at Intel Itanium 2 processors to add redundancy) targeted at Intel Itanium 2 processors running RedHat Advanced Workstation 2.1 with 4Gb running RedHat Advanced Workstation 2.1 with 4Gb

• A version created for each of the A version created for each of the EDDI + ECC + CFE EDDI + ECC + CFE SWIFT techniquesSWIFT techniques

• Versions are also created with each of the specific Versions are also created with each of the specific optimizations removed, to see the effects individuallyoptimizations removed, to see the effects individually SWIFT-SCFopti:SWIFT-SCFopti: to analyze the control-flow checking only at to analyze the control-flow checking only at

blocks with storesblocks with stores SWIFT-BRopti:SWIFT-BRopti: to analyze branch checking optimization to analyze branch checking optimization

Page 50: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Evaluation - PerformanceEvaluation - Performance

• Compilers used to evaluate techniques on benchmarks:Compilers used to evaluate techniques on benchmarks: SPEC CINT2000, SPEC FP2000,SPEC CINT95,Media BenchSPEC CINT2000, SPEC FP2000,SPEC CINT95,Media Bench

• Executions compared against binaries generated by the Executions compared against binaries generated by the original OpenIMPACT compiler original OpenIMPACT compiler (have no fault detection)(have no fault detection)

• The fault detection code was inserted into the low level The fault detection code was inserted into the low level code immediately before register allocation and schedulingcode immediately before register allocation and scheduling

• Optimizations that would have interfered eith the duplicated Optimizations that would have interfered eith the duplicated and detection code, Common Subexpression Elimination, and detection code, Common Subexpression Elimination, modified to respect the fault detecting codemodified to respect the fault detecting code

Page 51: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Evaluation - PerformanceEvaluation - Performance

• Normalized execution timesNormalized execution times EDDI + ECC + CFEEDDI + ECC + CFE: : geometric mean of 1,62 geometric mean of 1,62

• no fault detection IMPACT binariesno fault detection IMPACT binaries• does comparisons of the values used before every branchdoes comparisons of the values used before every branch

SWIFT:SWIFT: geometric mean of 1,41 geometric mean of 1,41 Indicates that methods are exploiting the unused processor Indicates that methods are exploiting the unused processor

resources present during the execution of the baseline program resources present during the execution of the baseline program

• Optimization due to control flow checking accounts for differenceOptimization due to control flow checking accounts for difference

Page 52: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Evaluation - PerformanceEvaluation - Performance

• Normalized IPCNormalized IPC EDDI + ECC + CFEEDDI + ECC + CFE: : geometric mean of 1,53 geometric mean of 1,53

• the additional branch checks enable more independent work and the additional branch checks enable more independent work and increase IPCincrease IPC

SWIFT:SWIFT: geometric mean of 1,48 geometric mean of 1,48

• Scheduling both version of program together enables a Scheduling both version of program together enables a normalized IPC of 1,5 (compared with non-detecting executions)normalized IPC of 1,5 (compared with non-detecting executions)

Page 53: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Evaluation - PerformanceEvaluation - Performance

• Static sizes of the binaries normalized to baseline with no Static sizes of the binaries normalized to baseline with no detection abilitydetection ability

EDDI + ECC + CFEEDDI + ECC + CFE: : 2,83x larger 2,83x larger SWIFT:SWIFT: 2,40x larger (does not generate the extra instructions 2,40x larger (does not generate the extra instructions

eliminated by the optimizations)eliminated by the optimizations)• Control block %2 – Branch checking %13 reduces static sizeControl block %2 – Branch checking %13 reduces static size

• Techniques duplicate all instructions except for NOPs, stores, Techniques duplicate all instructions except for NOPs, stores, branches and then insert detection code branches and then insert detection code

Page 54: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Evaluation - PerformanceEvaluation - Performance

• Light-grey region: Fraction of total dynamic instructions (NOP)Light-grey region: Fraction of total dynamic instructions (NOP)• The dynamic instruction counts normalized to baselineThe dynamic instruction counts normalized to baseline

EDDI + ECC + CFEEDDI + ECC + CFE: : geometric mean of 2,73 geometric mean of 2,73 SWIFT:SWIFT: geometric mean of 2,23 geometric mean of 2,23 follows the same trend as the static binary size numbersfollows the same trend as the static binary size numbers however, grows disproportionally to the static binary size increaseshowever, grows disproportionally to the static binary size increases

• programs spend, on the balance, more of execution time in branch-programs spend, on the balance, more of execution time in branch-heavy or store-heavy routines.heavy or store-heavy routines.

Page 55: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Evaluation – Fault DetectionEvaluation – Fault Detection

• Pin is used to instrument binaries:Pin is used to instrument binaries: Binaries profiled to detect how many times each static Binaries profiled to detect how many times each static

instruction is executedinstruction is executed Libc randLibc rand function is used to select number of faults to insert function is used to select number of faults to insert

into the programinto the program The fault injection rate per dynamic instruction is normalized The fault injection rate per dynamic instruction is normalized

(exactly one fault per run on baseline builds)(exactly one fault per run on baseline builds) For each fault, a number between zero and the number of For each fault, a number between zero and the number of

total dynamic instructions is chosentotal dynamic instructions is chosen This number is used to choose a static instruction using the This number is used to choose a static instruction using the

weights from the profileweights from the profile A specific instance of this static instruction in the dynamic A specific instance of this static instruction in the dynamic

instruction stream to instrument is choseninstruction stream to instrument is chosen

Page 56: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Evaluation – Fault DetectionEvaluation – Fault Detection

• In the experiments, only the following registers are modifiedIn the experiments, only the following registers are modified general purpose registersgeneral purpose registers floating point registersfloating point registers predicate registerspredicate registers

• Dynamic instruction is instrumented as follows:Dynamic instruction is instrumented as follows: one of the outputs of the instruction chosen at randomone of the outputs of the instruction chosen at random a random bit of this output register is flippeda random bit of this output register is flipped predicates are considered 1-bit entitiespredicates are considered 1-bit entities execution continues normally and result is recordedexecution continues normally and result is recorded execution output is also recorded and compared against execution output is also recorded and compared against

known good inputsknown good inputs

Page 57: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Evaluation – Fault DetectionEvaluation – Fault Detection

• Correct:Correct: Both the execution and check are successful Both the execution and check are successful• Segfault:Segfault: The execution fails due to The execution fails due to

SIGSEGV:SIGSEGV: Segmentation fault due to the access of an illegal Segmentation fault due to the access of an illegal addressaddress

SIGILL:SIGILL: Illegal instruction due to the consumption of a NaT bit Illegal instruction due to the consumption of a NaT bit generated by a faulting speculative load instructiongenerated by a faulting speculative load instruction

SIGBUS:SIGBUS: Bus error due to unaligned access Bus error due to unaligned access

• Fault Detected:Fault Detected: The execution fails due to a fault being The execution fails due to a fault being detecteddetected

• Incorrect:Incorrect: Runs do not satisfy any of the aboveRuns do not satisfy any of the above

Page 58: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Evaluation – Fault DetectionEvaluation – Fault Detection

• NOFT:NOFT: No fault tolerance (incorrect results up to 50% time) No fault tolerance (incorrect results up to 50% time)• EDDI + ECC + CFE and SWIFTEDDI + ECC + CFE and SWIFT

detect all but the most pathological single-upset faults -> detect all but the most pathological single-upset faults -> detect all of the faults which yields incorrect outputsdetect all of the faults which yields incorrect outputs

some faults which would have resulted in segfaults are some faults which would have resulted in segfaults are detected in builds before the segfault can actually occur -> detected in builds before the segfault can actually occur -> number of segfaults are reducednumber of segfaults are reduced

despite fault injection into every run, binaries ran successfully despite fault injection into every run, binaries ran successfully with lower rates of success (injected fault number is higher for with lower rates of success (injected fault number is higher for builds with fault detection due to higher dynamic instr. count) builds with fault detection due to higher dynamic instr. count)

Page 59: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Evaluation – Fault DetectionEvaluation – Fault Detection

• EDDI + ECC + CFE and SWIFT are a bit overzealous in EDDI + ECC + CFE and SWIFT are a bit overzealous in fault detectionfault detection big difference in the correct rates (when compared to NOFT)big difference in the correct rates (when compared to NOFT) in 129.compress, rate of correct execution is 1% (NOFT 63%)in 129.compress, rate of correct execution is 1% (NOFT 63%) the balance largely being made up of faults detectedthe balance largely being made up of faults detected

Page 60: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Evaluation – Fault DetectionEvaluation – Fault Detection

• Reasons for the fall in correct execution:Reasons for the fall in correct execution: large portion of execution time spent in initializing hash table large portion of execution time spent in initializing hash table

(orders of magnitude larger than input)(orders of magnitude larger than input)

many of the stores are superfluous (not affect output)many of the stores are superfluous (not affect output)

SWIFT nevertheless detects faults on these stores (it can not SWIFT nevertheless detects faults on these stores (it can not know a priori whether or not the output will depend on them)know a priori whether or not the output will depend on them)

Page 61: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Evaluation – Fault DetectionEvaluation – Fault Detection

• There is a statistically significant difference between fault There is a statistically significant difference between fault detection rate of EDDI + ECC + CFE (1) versus SWIFT (2)detection rate of EDDI + ECC + CFE (1) versus SWIFT (2)

Faults injected on the extra comparison instruction generated by 1Faults injected on the extra comparison instruction generated by 1• a fault on these instructions always generates a fault detected a fault on these instructions always generates a fault detected

where as where as

• a fault on the general population of instruction has a nonzero probability of a fault on the general population of instruction has a nonzero probability of generating correct output (or a segmentation fault) when a fault detectedgenerating correct output (or a segmentation fault) when a fault detected

• SWIFT has a slightly lower fault detection rate because SWIFT SWIFT has a slightly lower fault detection rate because SWIFT binaries do not have extra comparison instructions to fault onbinaries do not have extra comparison instructions to fault on

Larger segmentation fault rate in SWIFT binariesLarger segmentation fault rate in SWIFT binaries• Large number of speculative loads in EDDI+ECC+CFE binaries (from Large number of speculative loads in EDDI+ECC+CFE binaries (from

large number of branches around which compiler must schedule loads)large number of branches around which compiler must schedule loads)

• Injected faultInjected fault Cause a segmentation fault in (2)Cause a segmentation fault in (2) With the insertion of a NaT bit on the output register of (1), this bit is checked With the insertion of a NaT bit on the output register of (1), this bit is checked

at the comparison code and detected as fault (rather than segfault)at the comparison code and detected as fault (rather than segfault)

Page 62: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

ConclusionConclusion

• Detection of most transient faults can be accomplished Detection of most transient faults can be accomplished without the need for specialized hardware.without the need for specialized hardware.

• SWIFT is introducedSWIFT is introduced the best performing single-threaded software-based the best performing single-threaded software-based

approach for full out fault detectionapproach for full out fault detection exploits unused instruction-level parallelism resources to exploits unused instruction-level parallelism resources to

efficiently manage fault detectionefficiently manage fault detection realizes a performance increase through enhanced control realizes a performance increase through enhanced control

flow checking (validation points are unnecessary)flow checking (validation points are unnecessary) achieves a 14% speedupachieves a 14% speedup

• SWIFT can be integrated into production compilers to SWIFT can be integrated into production compilers to provide fault detection on today’s commodity hardwareprovide fault detection on today’s commodity hardware

Page 63: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

ReferencesReferences

• S. Padmanabhan, T. Malkemus, R. Agarwal, and A. Jhingran. S. Padmanabhan, T. Malkemus, R. Agarwal, and A. Jhingran. Block oriented processing of relational database operations in Block oriented processing of relational database operations in modern computer architectures. In modern computer architectures. In Proceedings of ICDE Proceedings of ICDE ConferenceConference, 2001., 2001.

• J. Zhou, and K. A. Ross. Bufferring accesses to memory-resident J. Zhou, and K. A. Ross. Bufferring accesses to memory-resident index structures. index structures. In In Proceedings of VLDB conferenceProceedings of VLDB conference, , 20032003..

• K.A. Ross, J. Cieslewicz, J. Rao, and J. Zhou. Architecture K.A. Ross, J. Cieslewicz, J. Rao, and J. Zhou. Architecture Sensitive Database Design: Examples from the Columbia Group. Sensitive Database Design: Examples from the Columbia Group. Bulletin of Bulletin of the IEEE Computer Society Technical Committe on the IEEE Computer Society Technical Committe on Data EngineeringData Engineering, 2005, 2005

Page 64: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Questions?Questions?

Page 65: SWIFT: Software Implemented Fault Tolerance George A. Reis, Jonathan Chang, Neil Vachharajani, Ram Rangan, David I. August Princeton University International

Thank YouThank You