cs717 detection and tolerance of complex faults in computing systems cs 717 greg bronevetsky
Post on 18-Jan-2016
225 Views
Preview:
TRANSCRIPT
CS717
Detection and Tolerance of Complex Faults in
Computing Systems
CS 717Greg Bronevetsky
CS717
The Problem
• Systems Fail– Hardware failures– Software failures– Hacker attacks
• Failures can be made less probable– ex: Higher quality hardware
• Cannot be prevented• Key challenge
– Detect real-world failures– Allow applications to tolerate them
CS717
Types of Failures
• Hardware Failures– Wear and tear on wires– Electric interference– Radiation hitting electronics
• Software Failures– System misconfigurations– Buggy code
• Hacker Attacks– Worst-case scenario– Arbitrary modifications of system state/code
CS717
Сlasses of Faults
• Permanent Faults– Same input will result in same (erroneous) output– Ex:
• Broken wires• OS misconfigurations
• Transient Faults– Temporarily erroneous output– Not replicable– Ex:
• Radiation hitting wires, flipping bits• Hacker attacks
– Typically harder to detect
CS717
Fault Models
• To study faults multiple fault models developed– Too many to cover her
• Major Types:– Fail-stop– Random– Human-Induced– Byzantine
CS717
Fail-Stop Faults
• System stops on failure• No random misbehaving
• Very simple to detect• Vast body of work exists assuming this model
– Mostly focused on failure tolerance
CS717
Random Faults
• State of computer changed randomly• Computer may be arbitrarily complex
• Meant to model physical problems in hardware
• Work ranges from theoretical to physical– Abstract circuits with randomly failing gates
vs.– Shooting CPUs with protons
CS717
Human-Induced Faults
• Targets failures typically caused by humans• Focus of work to deal with specific problems
– Buffer overflow attacks– System misconfiguration– Buggy code …
• Types of research– Bug Prevention : Software Engineering,
Programming Languages User Interfaces
– Hacker Attacks : Security
CS717
Byzantine Faults
• Worst-case scenario• Adversary can change system state/code
arbitrarily• Typically, limits placed on adversary
– Decidable– Polynomial-time probabilistic– Longer-running than checker algorithm…
• Very hard to detect these efficiently• Solutions either very specific or very expensive
CS717
Goal
• Suppose system fails according to some model
• Can we– Detect fault and recover?– Ensure that result correct despite failure?
• ex: guarantee provided by error-correcting codes
• Is 100% correctness achievable?– Probably not– High-probability correctness doable
CS717
What Might We Want to Do
• Detect any errors that happen in computation• Detect most errors
– Random ones– Human-induced– Byzantine with limited adversary
• Once detected, correct errors• Tolerate errors without explicit detection
CS717
What Might We Want to Do
• Want to have deterministic correctness guarantees– May be possible if adversary limited enough
• Probabilistic guarantees may be sufficient– 1 in 100 years chance of failure = 0% chance
• Would like to have provable guarantees • May settle for experimental evidence of
effectiveness
CS717
Coverage of Course
• Work exists in many subfields• Theory
– Checkers for restricted algorithm classes– Complexity work on efficient proving systems
• Systems– Replicated systems– Checking specific algorithms– Checking program control flows– Theory of placement of manual checks
CS717
Coverage of Course
• Software Engineering– Programmers helping with system checks– Assertions– Modeling of programs
• Hardware work– Variations on hardware replications– Lock-step execution– Thread-level speculation
CS717
State of the Field
• Many communities, little communication– Systems work boring to theorists– Theory obscure/useless to systems people
• Solutions space:Algorithm-specific General
• Mostly useless for general programs
• If automatic, then very high time/hardware overheads (100%+)
• If not automatic then significant manual work for programmer
CS717
What Do We Want?
• Want to – Never worry about failures– Not pay much overhead for the privilege
• This means– Arbitrary programs– Reasonably powerful adversaries
(at least, random faults)– No (or minimal) programmer interaction– Efficiency
CS717
Solution Parameters
Need solution that is both:• Automatic
– Applies to any program– No programmer interaction
• Intelligent– Tailored to each particular program– Application-specific knowledge used for better
reliability/efficiency
CS717
The Black Hole
Little work on solutions both Automatic AND Intelligent
Theory of checking
Establishes limits and capabilities of checkers
Algorithm-specific solutions(efficient checkers tied
to particular algorithms)
Blind Mechanisms(simple, application
independent)
Automatic AND
Intelligent???
CS717
The Black Hole
Our Goal: Automatic AND Intelligent
Theory of checking
Establishes limits and capabilities of checkers
Algorithm-specific solutions(efficient checkers tied
to particular algorithms)
Blind Mechanisms(simple, application
independent)
Automatic AND
Intelligent!
CS717
Our Research Goal
• Given an algorithm, create custom checker
• Detects errors with high probability
• Runs in time asymptotically smaller than original algorithm (Or just faster by constant factor)
CS717
Our Research Goal
• Intelligent checking work at application-level• Lower-level techniques only check simple
components– Correctness of addition, memory state, etc.– Poor ability to cut overhead
• Since no application knowledge
• Application-level checking: application's own semantics preserved through faults– i.e. if app says matrix A=BC, ensure that– Solution tailored to application
CS717
Our Research Goal
• Unknown if Automatic/Intelligent is possible– Some subproblems shown impossible via
complexity theory– Problem looks very hard
• Field very large but largely nonexistent
• Goal of class: Find inspiration from current work to embark on new research
CS717
Coverage of Course
• Lectures cover papers in multiple fields• Different communities bring different
techniques– Hence, different sources of inspiration
• Multiple fault models– Random and Byzantine in particular
• By semester end, may find good leads to follow
CS717 Outline of Course
(in no particular order)
• Hardware/Software Replication• Algorithm Based Fault Tolerance• Control-Flow Checking• Checkers for Specific Algorithms• Data Structure Checkers• Complexity Work on Provers (PCP Theorem)• Programmers Helping Checkers• Fault Detection in Parallel Systems• Hardware Fault Tolerance• Fault Tolerant Circuits• Experimental Evaluation of Checkers• Physics Experiments• Machine Learning
CS717
Hardware Replication
• Comes from Systems community• Run same algorithm on multiple processors• Compare results, take majority vote• Tolerates almost arbitrary faults in minority of
processors– 2f+1 replicas needed for f failures
• Triplication of hardware most common• Used by NASA to secure against errors
– Recent gravity experiment saved by backup processor
CS717
Byzantine Quorums
• Basic replication assumes: every processor replies with answer
• Suppose faulty processors can stay silent• Allow f faulty processors out of n
– Then we must decide on correct answer after n-f replies
– But f out of those n-f might be wrong– Thus, must take majority decision out of n-2f
• Bottom line: to tolerate f faults, need 3f+1 replicas
CS717
Byzantine Quorums
• Byzantine Quorum Systems provide protocols to manage this 3f+1 replication
• Cryptographically sign all communication• Maintain known core of 3+1 good
processors at all times• Protocols somewhat mindbending but pretty
cool
CS717
Algorithm-Based Fault Tolerance
• Comes from Scientific Computing/Numerical Analysis community
• Fault tolerance for basic linear algebra algorithms
• Input encoded in algorithm-specific code– Input matrices typically encoded per row/column
• Algorithm run on encoded input, returns encoded output
• Encoded output decoded, checked for inconsistencies
CS717
Algorithm-Based Fault Tolerance
• Encoding guarantees detection/tolerance of upto f errors in each row/column
• Approach meant for parallel systems• If processor fails, all its results likely wrong• Thus, algorithms modified s.t. no processor
touches >1 entry in a row/column
• Approach fairly general, but each algorithm needs own solution
CS717
Algorithm-Based Fault Tolerance
• ABFT produces checkers for data elements• Can develop theory of check placement in
parallel systems• Given assignment of data to processors and
checks to data, can derive number of faults detectable/tolerable by arrangement– Detectability/tolerance depends on detailed failure
model
• Multiple evaluation algorithms available
CS717
Control-Flow Checking
• Checking general programs is hard• Control-flow follows basic stack pattern
– Much easier to check– Present in most programs
• Solutions typically annotate program, check annotations
• Typically check that– Program exists blocks it entered– Program executes each block’s correct code
CS717
Control-Flow Checking
• Hardware– Watchdog processor watches fetched instructions– Yells if illegal block sequence or illegal instructions
in block
• Software– Program modified to check itself– Can’t check each instruction
• Too costly in software
– Just checks that program moves through blocks/functions correctly
CS717
Checkers for Specific Algorithms
• Work from Theory and Software Engineering communities
• Given specific algorithm can usually develop efficient checker for it
• Exist checkers for whole algorithm classes– ex: Linear recurrences
• Self-correctors available– Corrector calls faulty algorithm on several random
inputs– Collects results into (likely) correct answer
CS717
Checkers for Specific Algorithms
• For example, sorting:– Invariants:
• Output is permutation of input• Output is in non-decreasing order
– To check:• Can easily check order in linear time• Modify sorter to output for each input value its post-sort
index• Can use index list to verify permutation in linear time
– This O(nlog n) algorithm has O(n)-time checker
CS717
Checkers for Specific Algorithms
• Different algorithms have different checkers• Some use “certification trails” (additional mini-
proof to help verify correctness)– ex: Sorting checker
• Checker for one algorithm rarely applicable to other algorithms
• Specific checkers give technique ideas and show how efficient general algorithms can be
CS717
Data Structure Checkers
• Algorithms only produce answers• Usually interested in maintaining state reliably• Need checkers for data structure• Solutions exists for
– Generic RAMs (most expensive)– Stacks– Queues– Trees– Graphs
…
CS717
Data Structure Checkers
• Some solutions use a little secure memory safe from adversary
• Others use certification trails to prove correctness of encoding
• Solution for RAMs applicable to all other data structures
• Custom-tailored checkers more efficient– Not directly applicable to other data structures– Tend to provide good inspiration though
CS717
Complexity Work on Provers
• Complexity community wants to know: How small can proofs get?
• To show string an NP language need poly-length proof
• How small can proof be if only probabilistic guarantee?
• Big theorems developed– IP = PSPACE– PCP Theorem
CS717
Complexity Work on Provers
• IP = PSPACE– IP (Interactive Proofs) = Languages where
membership can be probabilistically proven via poly-many queries
– PSPACE = Languages computable using poly space
• PCP Theorem: can prove string an NP language by– Using log n random bits– Showing 3 random bits of (very long) proof
CS717
Complexity Work on Provers
• Work done by complexity theorists, so:– Very cheap for checker– VERY expensive for prover– Not directly applicable to real world
• However in theorem development, multiple tools developed– Checkers for polynomials– Secure encodings via polynomials
…
• Tools usefuls for our purposes
CS717
Programmers Helping Checkers
• Software Engineering community• Gives up on purely automates solutions,
asks programmer for help
• Simplest example: Assertions&Exceptions– Insert boolean checks into code– If check fails
• Assertions: program informs user of failure• Exception: program executes exception handler to fix
problem
CS717
Programmers Helping Checkers
• Another example: Programmer-implemented checkers
• SCCM – two languages in one– Regular imperative language– Auxiliary functional language
• Allows programmer to specify algorithm itself• Hooks to associate algorithm’s variables to program
variables
– System check’s main program’s state using auxiliary program
• Pro: Guards against bugs & faults• Con: Annoying to use
CS717
Fault Detection in Parallel Systems
• Parallel systems bring new challenges• May wish to compute average of numbers
held by all nodes– Some nodes may be faulty– Can use polynomials to encode data to help
system compute approximate answer
• May wish to know which nodes failed(or number of failed nodes)– Hard when faults are Byzantine– If this was known, could
• Move replicas away from failed nodes• Increase degree of replication
CS717
Fault Detection in Parallel Systems
• Some techniques guard against malicious humans
• SETI@Home has users who return fake data without doing work
• Can develop schemes to make this unprofitable
• Parallel techniques valuable for – Scientific computations (usually parallel and large)– Large business computations (ex: databases)
CS717
Hardware Fault Tolerance
• If hardware can be made more reliable then less need to software-level reliability
• Quality of circuitry will not improve in near future– Driven by physics and economics
• Can use replication of circuits to achieve high reliability
• Faults caught at hardware level fast, invisible correction
CS717
Hardware Fault Tolerance
• Simplest version: lock-step execution– Two processors execute in lock-step– If output if any instruction pair disagrees, redo
instruction
• Modern versions– Thread-level speculation (TLS) allows threads to
be aborted mid-stream– Can guess that branch will go TRUE and
speculatively execute ahead– If guess was wrong, all speculative thread actions
aborted
CS717
Fault Tolerance via TLS
• Can run multiple threads with identical code– Regularly compare results– If disagreement abort threads, try again
• TLS can be done on same processor or multiple processors
• Threads touch hardware differently, so affected differently by physical problems– ex: radiation, faulty wires, etc.
• Guarantees very low-level• Low overhead (if TLS already available)
CS717
Fault Tolerant Circuits
• Very old field, begun by von Neumann in 50’s• Model:
– Computer made up of binary logic gates– Each gate fails with constant probability =
• Approach– Replicate each gate log n times
• n = number of gates in overall circuit
– Feed output of copies into combiner circuit– – Thus, probability of 1 of n gates failing = constant
)/1()/1( log nOn
CS717
Fault Tolerant Circuits
• Get constant failure probability for any size circuit– Note, constant number of replicas becomes worse
as replicas become larger (more gates)
• Limitation: can’t replicate <log n since output gate needs log n replication
• Recent work: encoded computation– Encode all inputs using Reed-Solomon codes– Transform circuit to work on encoded data– Output encoded data– Allows for <log n replication since output > 1 gate
CS717
Experimental Evaluation of Checkers
• Many different approaches to fault tolerance• Little work on comparing their effectiveness in
real-world setting• Example: control-flow checking
– Many different papers– Each uses different evaluation method– Net effect: no insight into how techniques compare
• No surprise: experiments not glamorous• Will cover few papers that do experimental
comparisons
CS717
Physics Experiments
• Multiple sources of hardware faults– Manufacturing defects– Temperature– Wire failure– Radiation
• Manufacturing defects and wire failure hard to study
• Temperature-induced failures– Easy to study– However, few papers around (that I’ve seen)
CS717
Physics Experiments
• Radiation-induced failures studied extensively by physicists– Stick CPU inside particle accelerator– Shoot protons or ions at it– Run sample programs and watch what happens
• Experiments run on various CPUs• Give insight into vulnerability of CPUs to
radiation– Approximate space operation conditions
CS717
Machine Learning
• Machine learning algorithms can be thought of “fault removers”– Given true data + adversary-induced noise– Return expression describing true data
• Decision tree, linear plane, neural net, etc.
– Thus, undo effects of adversary
• If limit data & adversary complexity– Can prove given learning algorithm effective
• ex: PAC learning
– Use complexity measurements like V-C dimension
CS717
Machine Learning
• Can assure correct output of algorithms with low complexity– i.e. low V-C dimension
• Applies to broad range of algorithms• Not very intelligent (i.e. algorithm-specific)
• Would like to cover in course• I don’t have background to find papers• Anyone want to help?
CS717
CS717
• Failures happen, want to pretend they don’t• Need techniques for detecting and correcting
system faults• Requirements
– Automatic – little/no programmer interaction– Efficient – small cost for fault tolerance
• Must check correctness at application-level– Ensure application-level semantics– Checker tailored to application – potential for low
overhead
CS717
Something for Everyone
• Field spans many different areas– Theory
• Algorithms• Complexity
– Systems– Computer Architecture– Software Engineering– Scientific Computing/Numerical Analysis– Machine Learning
• Great place for cross-area collaboration
CS717
Pulling Fields Together
• Breadth of field makes collaboration hard
• Difficult to see from one side to another
• Nobody expert in all sub-areas
• This semester will get basic grounding in many
CS717
Goal of Semester
• If your computer is lying to you, how do you know?
• Big question, no good answers• Very fundamental to Computer Science
• Goal:– By end of semester get ideas for possible answers– May found new field in process
CS717
The Black Hole
Many disjoint efforts
Core problem still wide open
Lets find the answer!Theory of checking
Establishes limits and capabilities of checkers
Algorithm-specific solutions(efficient checkers tied
to particular algorithms)
Blind Mechanisms(simple, application
independent)
Automatic AND
Intelligent???
top related