combining algorithm-based fault olerancet and

123

Upload: others

Post on 17-Mar-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Draft

1/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Combining Algorithm-Based Fault Tolerance

and Checkpointing for Iterative Solvers

Massimiliano FasiAdvisors: Yves Robert and Bora Uçar

25 june 2014

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

2/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

1 IntroductionLinear solversSilent errors

2 Algorithm-Based Fault Tolerance

3 Model

4 Experiments

5 Conclusions

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

3/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Selective reliability

High energy mode

reliable

energy wasting

Low energy mode

unreliable

energy e�cient1 2 3 4 5 6 7 8 9

low

high

computational steps

energy

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

3/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Selective reliability

High energy mode

reliable

energy wasting

Low energy mode

unreliable

energy e�cient1 2 3 4 5 6 7 8 9

low

high

computational steps

energy

computation

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

3/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Selective reliability

High energy mode

reliable

energy wasting

Low energy mode

unreliable

energy e�cient1 2 3 4 5 6 7 8 9

low

high

computational steps

energy

computation

validation

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

4/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

The Conjugate Gradient Method

Ax = b

A ∈ Rn×n, x,b ∈ Rn

Remarks on line 5

only matrix operation

A is never modi�ed

Require: A ∈ Rn×n, b, v ∈ Rn, ε ∈ REnsure: x ∈ Rn : | Ax− b |≤ ε1: r0 ← b− Ax0;

2: p0 ← r0;

3: i ← 0;

4: while ‖ri‖ > ε (‖A‖ · ‖r0‖+ ‖b‖) do5: qi ← Api ;

6: αi ← ‖ri ‖2

pᵀiqi

;

7: xi+1 ← xi + α pi ;

8: ri+1 ← ri − α qi ;

9: β ← ‖ri+1‖2

‖ri ‖2;

10: pi+1 ← ri+1 + β pi ;

11: i ← i + 1;

12: end while

13: return xi ;

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

8/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing for silent errors

Is not always necessary

the computation can continue

small perturbations do not impact the solution

iterative methods can compensate some errors

Requires veri�cation

a validation mechanism has to be devised

some overhead cannot be avoided

�nding a checkpointing interval becomes even more di�cult

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

10/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

No error

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

10/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

No error

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

10/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

No error

24

24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

10/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

No error

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

10/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

No error

24 2424

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

11/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in the computation

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

11/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in the computation

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

5

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

11/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in the computation

24

24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

5

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

11/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in the computation

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

5

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

11/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in the computation

24 2426

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

5

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

12/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

12/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

12/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

2

2

2

1

5

3

4

4

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

12/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

23

24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

2

2

2

1

5

3

4

4

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

12/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

23 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

2

2

2

1

5

3

4

4

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

12/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

23 2423

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

2

2

2

1

5

3

4

4

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

13/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

13/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

13/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

4

2

0

5

2

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

13/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

24

24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

4

2

0

5

2

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

13/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

4

2

0

5

2

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

13/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

24 2424

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

4

2

0

5

2

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

14/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

How to overcome that issue

Random weight vector

Checksum shifting

Matrix splitting

Hierarchical partitioning

(cᵀA) x = cᵀ (Ax)

c = (1 1 1 ... 1)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

14/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

How to overcome that issue

Random weight vector

Checksum shifting

Matrix splitting

Hierarchical partitioning

(cᵀA) x = cᵀ (Ax)

c = (c1 c2 c3 ... cn)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

15/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Checksum shifting

42 40

18

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

15/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Checksum shifting

42 40

18

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

2 2 2 2 2 2 2 2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

15/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Checksum shifting

42 40

18

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

2 2 2 2 2 2 2 2

1 2 3 9 2 8 2 13

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

15/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Checksum shifting

42 40

18

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

2 2 2 2 2 2 2 2

1 2 3 9 2 8 2 13

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

4

2

0

5

2

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

15/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Checksum shifting

42 40

18

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

2 2 2 2 2 2 2 2

1 2 3 9 2 8 2 13

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

4

2

0

5

2

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

15/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Checksum shifting

42

40

18

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

2 2 2 2 2 2 2 2

1 2 3 9 2 8 2 13

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

4

2

0

5

2

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

15/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Checksum shifting

42 40

18

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

2 2 2 2 2 2 2 2

1 2 3 9 2 8 2 13

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

4

2

0

5

2

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

15/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Checksum shifting

42 4042

18

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

2 2 2 2 2 2 2 2

1 2 3 9 2 8 2 13

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

4

2

0

5

2

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

16/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Summary of ABFT results

checksumcomputation

SpMxVoverhead

single error detection ∼ nnz ∼ 4nk errors detection ∼ k nnz ∼ 4kn

single error correction ∼ 2 nnz ∼ 8nk errors correction ? ?

Table : ABFT techniques

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

17/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Not all that seems so is an error

Theorem

Let A ∈ Rn×n, x ∈ Rn, c ∈ Rn. Then, if all of the sums involvedinto the matrix operations are performed using some �avour ofrecursive summation, it holds that

| � ((cᵀA) x)− � (cᵀ (Ax)) |≤ 2 γ2n | cᵀ | | A | | x | .

| � ((cᵀA) x)− � (cᵀ (Ax)) |≤ 2 γ2n n ‖cᵀ‖∞ ‖A‖1 ‖x‖∞

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

17/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Not all that seems so is an error

Theorem

Let A ∈ Rn×n, x ∈ Rn, c ∈ Rn. Then, if all of the sums involvedinto the matrix operations are performed using some �avour ofrecursive summation, it holds that

| � ((cᵀA) x)− � (cᵀ (Ax)) |≤ 2 γ2n | cᵀ | | A | | x | .

| � ((cᵀA) x)− � (cᵀ (Ax)) |≤ 2 γ2n n ‖cᵀ‖∞ ‖A‖1 ‖x‖∞

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

18/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Preliminaries

Why combining

checkpointing (CP) needs a veri�cation mechanism

ABFT's worst case could require restarting from scratch

Why a trade-o�

CP interval depends on the probability of incorrectable errors

per iteration overhead depends on the kind of ABFT protection

Goal: minimize the expected global execution time

Idea: minimize the expected overhead (ABFT and CP) of a frame

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

19/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Expected execution time

p = correctable error probability

s = checkpoint interval

k = correctable errors

E(Ts

)= p s Titer + (1− p )

(E (Tlost) + Trecovery + E

(Ts

) )

pk =k∑

i=0

q(k)i (s T

(k)iter ), q

(k)` (T ) =

(M

`

)(1− e−λT

)`e−λT (M−`)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

19/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Expected execution time

p = correctable error probability

s = checkpoint interval

k = correctable errors

E(Ts

)= p s Titer + (1− p )

(E (Tlost) + Trecovery + E

(Ts

) )

pk =k∑

i=0

q(k)i (s T

(k)iter ), q

(k)` (T ) =

(M

`

)(1− e−λT

)`e−λT (M−`)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

19/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Expected execution time

p = correctable error probability

s = checkpoint interval

k = correctable errors

E(Ts

)= p s Titer + (1− p )

(E (Tlost) + Trecovery + E

(Ts

) )

pk =k∑

i=0

q(k)i (s T

(k)iter ), q

(k)` (T ) =

(M

`

)(1− e−λT

)`e−λT (M−`)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

19/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Expected execution time

p = correctable error probability

s = checkpoint interval

k = correctable errors

E(Ts

)= p s Titer + (1− p )

(E (Tlost) + Trecovery + E

(Ts

) )

pk =k∑

i=0

q(k)i (s T

(k)iter ), q

(k)` (T ) =

(M

`

)(1− e−λT

)`e−λT (M−`)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

19/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Expected execution time

p = correctable error probability

s = checkpoint interval

k = correctable errors

E(Ts

)= p s Titer + (1− p )

((s + 1)

2Titer + Trecovery + E

(Ts

) )

pk =k∑

i=0

q(k)i (s T

(k)iter ), q

(k)` (T ) =

(M

`

)(1− e−λT

)`e−λT (M−`)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

19/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Expected execution time

p = correctable error probability

s = checkpoint interval

k = correctable errors

E(T (k)s

)= pk s T

(k)iter + (1− pk)

((s + 1)

2T

(k)iter +Trecovery + E

(T (k)s

))

pk =k∑

i=0

q(k)i (s T

(k)iter ), q

(k)` (T ) =

(M

`

)(1− e−λT

)`e−λT (M−`)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

19/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Expected execution time

p = correctable error probability

s = checkpoint interval

k = correctable errors

E(T (k)s

)= pk s T

(k)iter + (1− pk)

((s + 1)

2T

(k)iter +Trecovery + E

(T (k)s

))

pk =k∑

i=0

q(k)i (s T

(k)iter ), q

(k)` (T ) =

(M

`

)(1− e−λT

)`e−λT (M−`)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

20/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

A probabilistic model

Model

The checkpoint interval that minimizes the expected wasted time is

s = argmins∈N

E(T

(k)s

)− s T

(k)iter + Tcheckpoint

s T(k)iter

.

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

21/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Test problems

n nnz(A) κ(A) Convergence

BCSSTK09 1083 18437 3.10173e+04 linearP3D 27000 183600 6.45723e+02 quadraticTHERMAL1 82654 574458 4.96250e+05 sublinear

[From similar studies]Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

22/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Empirical validation

0 10 20 30 40 50 60 70 80 90 1000

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80 90 1002

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7

0 10 20 30 40 50 60 70 80 90 1004

5

6

7

8

9

10

11

12

13

14

0 10 20 30 40 50 60 70 80 90 1000

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80 90 1002

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7

0 10 20 30 40 50 60 70 80 90 1004

5

6

7

8

9

10

11

12

13

14

Figure : Execution time vs checkpoint interval. The expected execution time(continuous line) is compared with the experimentally obtained one (circles),for both CP + ABFT detection (top) and CP + ABFT correction (bottom).

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

23/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Experimental comparison

101

102

103

104

1050.2

0.3

0.4

0.5

0.6

0.7

0.8CG-1D

CG-2D1C

101

102

103

104

1052.5

3

3.5

4

4.5

5

5.5CG-1D

CG-2D1C

101

102

103

104

1054

5

6

7

8

9

10CG-1D

CG-2D1C

Figure : Execution time vs reciprocal of the normalized fault rate for bothplain checkpointing (CG-1D) and mixed strategy (CG-2D1C).

min max

BCSSTK09 -2.23 % 12.78 %P3D -0.60 % 26.76 %THERMAL1 -0.08 % 40.44 %

Table : Relative gain of CG-2D1C with respect to CG-1D.

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

24/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Summary

silent errors are treacherous

checkpointing needs a veri�cation mechanism

detecting ABFT is a cheap and reliable

correcting ABFT can improve checkpointing's performances

a trade-o� can be established

the same analysis holds for other iterative linear solvers

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Draft

25/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Future work

General ABFT improvements

extend error correction capabilities for matrix representations

extension to other matrix operations

develop accurate estimates for �oating point errors

Other applications of the ABFT/checkpointing solution

Preconditioned Conjugate Gradient

ABFT for dense iterative methods

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers