reliability. threads for fault tolerance zmultiprocessors: ytransient fault detection

17
Reliability

Post on 20-Dec-2015

236 views

Category:

Documents


0 download

TRANSCRIPT

Reliability

Threads for Fault Tolerance

Multiprocessors: Transient fault detection

Transient Faults

Faults that persist for a “short” durationCause: cosmic rays, energetic particles

originating from outer space Effect: knock off electrons, discharge

capacitorSolution

no practical absorbent for cosmic rays1 fault per 1000 computers per year (estimated fault

rate)Future is worse

smaller feature size, higher transistor count, reduced noise margin

Background

Fault tolerant systems use redundancy to improve reliability: Time redundancy: separate executions Space redundancy: separate physical copies of resources

DMR/TMR Data redundancy

ECC: Automatic repeat request (ARQ) , Forward error correction (FEC)

Parity: odd/even

Examples: IBM: duplicated pipelines, spare processors, ECC in

memories... HP: DMR/TMR processors, Parity/ECC in buses, memories...

Multiprocessors: Fault Detection

Chip-level Redundantly Threaded processor Replicates register values but not memory

values The leading thread commits stores only after

checkingMemory is guaranteed to be correctOther instructions commit without checking

The leading thread sends committed values for:branch outcomesload/store valuesstore addresses

Sphere of Replication (SoR)

Logical boundary of redundant execution within a system Components within protected

via redundant execution Components outside must be

protected via other means

Its size matters: Error detection latency Stored-state size

Example Spheres of Replication

Compaq HimalayaORH-Dual: On-Chip Replicated Hardware

(similar to IBM G5)

Fault Detection in Compaq Himalaya System

Replicated Microprocessors + Cycle-by-Cycle Lockstepping

Fault Detection via Simultaneous Multithreading (SMT)

Replicated Microprocessors + Cycle-by-Cycle Lockstepping

Concept

SMT improves the performance of a processor by: allowing independent threads to execute

simultaneously doing so in different functional units

Redundant Multithreading (RMT): leverages SMT’s properties to allow fault

detection for microprocessorsruns two copies of the same program as independent

threadscompares their outputs and initiates recovery in case

of mismatch

Input Replication

Load Value Queue (LVQ) Keep threads on same path despite I/O or MP writes Out-of-order load issue possible

Output Comparison

Compare & validate output before sending it outside the SoR

Store Queue Comparator (STQ)

Store Queue Comparator Compares outputs to data cache Catch faults before propagating to

rest of system

Store Queue Comparator (cont’d)

Extends residence time of leading-thread stores Size constrained by cycle time goal Base CPU statically partitions single queue among threads Potential solution: per-thread store queues

Deadlock if matching trailing store cannot commit Several small but crucial changes to avoid this

Branch Outcome Queue (BOQ)

Branch Outcome Queue Forward leading-thread branch targets

to trailing fetch 100% prediction accuracy in absence of

faults

Simultaneous & Redundantly Threaded Processor (SRT)

SRT = SMT + Fault DetectionLess hardware compared to replicated

microprocessors SMT needs ~5% more hardware over uniprocessor SRT adds very little hardware overhead to existing

SMT

Better performance than complete replication better use of resources Lower cost

Issues

Cycle-by-cycle output comparison and input replication: Equivalent insts from different threads may

execute in different cycles Equivalent insts from different threads might

execute in different orderPrecise scheduling of the threads crucial

for optimal performanceBranch mispredictionCache miss