tests and tolerances for high-performance software-implemented fault detection

Tests and Tolerances for High-Performance Software-Implemented Fault Detection

Michael Turmon, Robert Granat, Daniel S.Katz, John Z.Lou

Objective

Software fault detection in common numerical libraries by checking computed output

Faulty environment here essentially constitutes bit flips in application’s state space

Distinguish between errors and round-offs in computed results

Faults and EDMs

Single Event Upsets Radiation induced errors causing bit flips in memory, cache Effects application data and code Data errors are more difficult to detect

Error Detecting Middleware Wrap existing numerical libraries Avoid altering internals of the library More efficient than original computation

Numerical Error Checking - Summary

Consider common numerical matrix computations

Use “post-conditions” to evaluate correctness Post-condition: Necessary relation between inputs &

computed outputs

Use well-known upper bounds on error propagation within numerical algorithms for matrix computations

Define tests and tolerances to separate errors and round-offs

Develop input-independent tolerances

Definitions: Vector & Matrix norms

Vector:||v||1 = ∑ |vi|

||v||∞ = max|vi|

||v||2 = (∑|vi|2)1/2

Matrices:||A||1 = max. column sum of A

||A||∞ = max. row sum of A

||A||2 = largest singular value of A

||A||F = ( |aij| 2)1/2

Matrices review

Orthogonal MatrixA AT = I => A-1 = AT

Unitary Matrix A*T = A-1

Permutation MatrixReordered rows of I

Sub-multiplicative property ||Av|| ≤||A|| ||v||

||AB|| ≤||A|| ||B||

Numerical Functions

Matrix multiplication QR decomposition

A = Q * R A = input matrix Q = Orthogonal matrix R = upper triangular matrix

Singular Value decompositionA = U * D * VT

A = input matrix D = diagonal matrix U & V = orthogonal matrices

Numerical Functions (contd.)

LU decomposition A = P* L*U

P = permutation matrix L = lower triangular matrix U = upper triangular matrix

System Solution Solve for x in Ax=b , given A & b

Matrix inverse Given A, find B such that A*B = I

Numerical functions (contd.)

Fourier transform Given x, find y such that y=W x, where W is the matrix of

Fourier basis, Wnk = e-j2kn/N

Inverse Fourier transform Given y, find x such that x = n-1WTy where W is n*n matrix

of Fourier bases (WT = W-1)

Operations & Post-conditions

Post-condition check A = Q * R -> computationally intense

Instead multiply with probe vector w and compare vectors

w A >< w Q R

Choice of w Elements of w should not vary greatly in magnitude w should be non-zero everywhere Can be a vector of all ones, except for FFT

Probe Vector

^ ^

^^

Error Propagation – Matrix multiplication

Error matrix E = P – AB P = mult(A,B)

||E||∞ n ||A||∞ ||B||∞ u u = difference between unity & next larger float number, n

= dimension common to A & B

d = P w – A B w = E w

||d||∞ = ||E w||∞ ||E||∞ ||w||∞ n ||A||∞ ||B||∞ ||w||∞ u

||d||∞ / ||A||∞ ||B||∞ ||w||∞ >< u

n is ignored – in average case, round-off errors independent of dimension

^

^

^

Error Propagation

QRD: ||d||F / (||A||F ||w||F ) >





< u

d = P L U w – A w

^ ^

^ ^

^^ ^

^

Error Propagation (contd.)

Solve Ax = b: ||d|| / (||A|| ||x|| ) >



< u d = B A w - w

^

^

^

^

Error Propagation - FFT

Forward Transform: d = (y – Wx)T w

W is the n*n forward transform matrix containing the Fourier basis functions

w cannot have a sparse transform Error propagation: ||e|| 5nlog2n ||x|| u |d| /(nlog2n ||x||2 ||w||2) >



< u

^

Comparison Tests

= RHS – LHS and = || w|| ( never actually computed)

T0: /||w|| >< u

Approx. vector test: higher chance of false alarms

Experiments

Faults are injected in half the runs by changing a random bit of the algorithm’s state space

Faults are injected at random point of execution

The threshold value is chosen based on error quantity computed in the faulty and fault-free conditions

Choosing

T2, ,T 1

T3

T0

ROC for FFT

Alternate tests for FFT: Parseval’s condition:

(||x||2 - n-1/2 ||y||2 )/ ||x||2 >< u

Choosing a vector w2 with real & imag. parts equal to :

cos(4(k – n/2)/n), k=0,1,….n-1

and compute difference as before

Related work

ABFT – introduced by Huang & Abraham for matrix operations, 1984

Error detection based on algorithm employed – matrix encoded with checksum matrix

Vastly extended by others for various numerical operations

Result Checking – introduced by Blum & Wasserman – focus on computation errors,1996

Prata & Silva compared the two, found for Matrix mult. & QRD, RC more efficient than ABFT, 1999

Summary

Faults detected based on conditions that numerical output must satisfy

Implemented as wrappers around existing libraries Run experiments under fault-free & faulty conditions

and observe decision criterion ub >> * => can be set based on an average-case

outlook rather than assuming worst-case scenario Selecting a trade-off between fault detection & false

alarms Can be extended to other common computations like

Sorting, Integration, etc.

tests and tolerances for high-performance software-implemented fault detection

Documents