tests and tolerances for high-performance software-implemented fault detection

24
Tests and Tolerances for High- Performance Software-Implemented Fault Detection Michael Turmon, Robert Granat, Daniel S.Katz, John Z.Lou

Upload: lavender

Post on 05-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Tests and Tolerances for High-Performance Software-Implemented Fault Detection. Michael Turmon, Robert Granat, Daniel S.Katz, John Z.Lou. Objective. Software fault detection in common numerical libraries by checking computed output - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Michael Turmon, Robert Granat, Daniel S.Katz, John Z.Lou

Page 2: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Objective

Software fault detection in common numerical libraries by checking computed output

Faulty environment here essentially constitutes bit flips in application’s state space

Distinguish between errors and round-offs in computed results

Page 3: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Faults and EDMs

Single Event Upsets Radiation induced errors causing bit flips in memory, cache Effects application data and code Data errors are more difficult to detect

Error Detecting Middleware Wrap existing numerical libraries Avoid altering internals of the library More efficient than original computation

Page 4: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Numerical Error Checking - Summary

Consider common numerical matrix computations

Use “post-conditions” to evaluate correctness Post-condition: Necessary relation between inputs &

computed outputs

Use well-known upper bounds on error propagation within numerical algorithms for matrix computations

Define tests and tolerances to separate errors and round-offs

Develop input-independent tolerances

Page 5: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Definitions: Vector & Matrix norms

Vector:||v||1 = ∑ |vi|

||v||∞ = max|vi|

||v||2 = (∑|vi|2)1/2

Matrices:||A||1 = max. column sum of A

||A||∞ = max. row sum of A

||A||2 = largest singular value of A

||A||F = ( |aij| 2)1/2

Page 6: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Matrices review

Orthogonal MatrixA AT = I => A-1 = AT

Unitary Matrix A*T = A-1

Permutation MatrixReordered rows of I

Sub-multiplicative property ||Av|| ≤||A|| ||v||

||AB|| ≤||A|| ||B||

Page 7: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Numerical Functions

Matrix multiplication QR decomposition

A = Q * R A = input matrix Q = Orthogonal matrix R = upper triangular matrix

Singular Value decompositionA = U * D * VT

A = input matrix D = diagonal matrix U & V = orthogonal matrices

Page 8: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Numerical Functions (contd.)

LU decomposition A = P* L*U

P = permutation matrix L = lower triangular matrix U = upper triangular matrix

System Solution Solve for x in Ax=b , given A & b

Matrix inverse Given A, find B such that A*B = I

Page 9: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Numerical functions (contd.)

Fourier transform Given x, find y such that y=W x, where W is the matrix of

Fourier basis, Wnk = e-j2kn/N

Inverse Fourier transform Given y, find x such that x = n-1WTy where W is n*n matrix

of Fourier bases (WT = W-1)

Page 10: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Operations & Post-conditions

Page 11: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Post-condition check A = Q * R -> computationally intense

Instead multiply with probe vector w and compare vectors

w A >< w Q R

Choice of w Elements of w should not vary greatly in magnitude w should be non-zero everywhere Can be a vector of all ones, except for FFT

Probe Vector

^ ^

^^

Page 12: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Error Propagation – Matrix multiplication

Error matrix E = P – AB P = mult(A,B)

||E||∞ n ||A||∞ ||B||∞ u u = difference between unity & next larger float number, n

= dimension common to A & B

d = P w – A B w = E w

||d||∞ = ||E w||∞ ||E||∞ ||w||∞ n ||A||∞ ||B||∞ ||w||∞ u

||d||∞ / ||A||∞ ||B||∞ ||w||∞ >< u

n is ignored – in average case, round-off errors independent of dimension

^

^

^

Page 13: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Error Propagation

QRD: ||d||F / (||A||F ||w||F ) >

< u d = Q R w – A w

SVD: ||d|| / (||A|| ||w|| ) >

< u d = U D VT w – A w

LUD: ||d|| / (||A|| ||w|| ) >

< u

d = P L U w – A w

^ ^

^ ^

^^ ^

^

Page 14: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Error Propagation (contd.)

Solve Ax = b: ||d|| / (||A|| ||x|| ) >

< u d = A x – b

Matrix inverse: ||d|| / (||A|| ||B|| ||w|| ) >

< u d = B A w - w

^

^

^

^

Page 15: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Error Propagation - FFT

Forward Transform: d = (y – Wx)T w

W is the n*n forward transform matrix containing the Fourier basis functions

w cannot have a sparse transform Error propagation: ||e|| 5nlog2n ||x|| u |d| /(nlog2n ||x||2 ||w||2) >

< u

Inverse Transform: d = (x – n-1 WT y)T w |d|/(log2n ||y||2 ||w||2) >

< u

^

Page 16: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Comparison Tests

= RHS – LHS and = || w|| ( never actually computed)

T0: /||w|| >< u

Trivial test:Un-normalized comparison

T1: /(1 ||w||) >< u

Ideal test: may not always be computable

T2: /(2 ||w||) >< u

Approx. matrix test: based on computed quantities

T3: /(||w||+3) >< u

Approx. vector test: higher chance of false alarms

Page 17: Tests and Tolerances for High-Performance Software-Implemented Fault Detection
Page 18: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Experiments

Faults are injected in half the runs by changing a random bit of the algorithm’s state space

Faults are injected at random point of execution

The threshold value is chosen based on error quantity computed in the faulty and fault-free conditions

Page 19: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Choosing

Page 20: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

T2, ,T 1

T3

T0

Page 21: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

`

Page 22: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

ROC for FFT

Alternate tests for FFT: Parseval’s condition:

(||x||2 - n-1/2 ||y||2 )/ ||x||2 >< u

Choosing a vector w2 with real & imag. parts equal to :

cos(4(k – n/2)/n), k=0,1,….n-1

and compute difference as before

Page 23: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Related work

ABFT – introduced by Huang & Abraham for matrix operations, 1984

Error detection based on algorithm employed – matrix encoded with checksum matrix

Vastly extended by others for various numerical operations

Result Checking – introduced by Blum & Wasserman – focus on computation errors,1996

Prata & Silva compared the two, found for Matrix mult. & QRD, RC more efficient than ABFT, 1999

Page 24: Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Summary

Faults detected based on conditions that numerical output must satisfy

Implemented as wrappers around existing libraries Run experiments under fault-free & faulty conditions

and observe decision criterion ub >> * => can be set based on an average-case

outlook rather than assuming worst-case scenario Selecting a trade-off between fault detection & false

alarms Can be extended to other common computations like

Sorting, Integration, etc.