on numerical resiliency in numerical linear...

81
Salishan Workshop April 28, 2015 Gleneden Beach, Oregon On Numerical Resiliency in Numerical Linear Algebra Solvers Luc Giraud joint work with E. Agullo (Inria), P. Salas (Sherbrooke Univ.), F. Yetkin (Inria) and M. Zounon (Inria) HiePACS - Inria Project Inria Bordeaux Sud-Ouest

Upload: others

Post on 13-Aug-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Salishan WorkshopApril 28, 2015Gleneden Beach, Oregon

On Numerical Resiliency inNumerical Linear Algebra Solvers

Luc Giraudjoint work with

E. Agullo (Inria), P. Salas (Sherbrooke Univ.),F. Yetkin (Inria) and M. Zounon (Inria)

HiePACS - Inria Project

Inria Bordeaux Sud-Ouest

Page 2: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Outline

1. Hard fault: Interpolation-Restart strategiesIn Krylov subspace linear solversIn revisited eigensolvers

2. Soft errors in Conjugate Gradient (preliminary)

L. Giraud - On Numerical Resilience in Linear Algebra 2 / 31

Page 3: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies

Outline

1. Hard fault: Interpolation-Restart strategiesIn Krylov subspace linear solversIn revisited eigensolvers

2. Soft errors in Conjugate Gradient (preliminary)

L. Giraud - On Numerical Resilience in Linear Algebra 3 / 31

Page 4: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

L. Giraud - On Numerical Resilience in Linear Algebra 4 / 31

x bA

=

Ax = bObjective: Design resilient algorithms forsolving sparse linear systems

Two classes of iterative methodsStationary methods (Jacobi, Gauss-Seidel, . . . )Krylov subspace methods (CG, GMRES, Bi-CGStab, . . . )

Krylov methods are widely used but efforts are requied to gofor extreme-scale

Page 5: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

L. Giraud - On Numerical Resilience in Linear Algebra 5 / 31

����������������

����������������

x bA

P

P

P

P

1

2

3

4

=

Dynamic data Static data

We distinguish two categories of data:

Static dataDynamic data

Page 6: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

L. Giraud - On Numerical Resilience in Linear Algebra 5 / 31

����������������

����������������

����������������

����������������

x bA

P

P

P

P

1

2

3

4

Static data Dynamic data

=

Lost data

We distinguish two categories of data:

Static dataDynamic data

What will happen when P1 crashes?

Page 7: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

L. Giraud - On Numerical Resilience in Linear Algebra 5 / 31

����������������

����������������

����������������

����������������

x bA

P

P

P

P

1

2

3

4

=

Static data Lost dataDynamic data

We distinguish two categories of data:

Static dataDynamic data

What will happen when P1 crashes?

Failed process is replacedStatic data are restored

Page 8: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

L. Giraud - On Numerical Resilience in Linear Algebra 5 / 31

x bA

P

P

P

P

1

2

3

4

Static data Dynamic data Lost data

=

We distinguish two categories of data:

Static dataDynamic data

What will happen when P1 crashes?

Failed process is replacedStatic data are restored

Reset: set x1 to the initial value and restartIR: interpolate x1 and restart

Page 9: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

L. Giraud - On Numerical Resilience in Linear Algebra 5 / 31

x bA

P

P

P

P

1

2

3

4

Static data Dynamic data Interpolated data

=

We distinguish two categories of data:

Static dataDynamic data

What will happen when P1 crashes?

Failed process is replacedStatic data are restored

Reset: set x1 to the initial value and restartIR: interpolate x1 and restart

Page 10: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

General assumptions

1. Large dimensional vectors and matrices are distributed

2. Scalars and low dimensional vectors and matrices are replicated

3. There is a system mechanism to report faults

4. Faulty process is replaced

5. Static data are restored

L. Giraud - On Numerical Resilience in Linear Algebra 6 / 31

Page 11: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Interpolation methods

Fault in linear system(A11 A12A21 A22

)(x1x2

)=

(b1b2

)

Linear Interpolation (LI) [J. Langou et al., 2007]

Solve A11x1 = b1 − A12x2 A11 must be non singular

Least Squares Interpolation (LSI)(A11A21

)x1 +

(A21A22

)x2 =

(b1b2

)x1 = argmin

x

∥∥∥∥(b1b2

)−

(A12A22

)x2 −

(A11A21

)x∥∥∥∥

2

L. Giraud - On Numerical Resilience in Linear Algebra 7 / 31

Page 12: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Interpolation methods

Fault in linear system(A11 A12A21 A22

)(?x2

)=

(b1b2

)How to interpolate x1?

Linear Interpolation (LI) [J. Langou et al., 2007]

Solve A11x1 = b1 − A12x2 A11 must be non singular

Least Squares Interpolation (LSI)(A11A21

)x1 +

(A21A22

)x2 =

(b1b2

)x1 = argmin

x

∥∥∥∥(b1b2

)−

(A12A22

)x2 −

(A11A21

)x∥∥∥∥

2

L. Giraud - On Numerical Resilience in Linear Algebra 7 / 31

Page 13: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Interpolation methods

Fault in linear system(A11 A12A21 A22

)(?x2

)=

(b1b2

)How to interpolate x1?

Linear Interpolation (LI) [J. Langou et al., 2007]

Solve A11x1 = b1 − A12x2 A11 must be non singular

Least Squares Interpolation (LSI)(A11A21

)x1 +

(A21A22

)x2 =

(b1b2

)x1 = argmin

x

∥∥∥∥(b1b2

)−

(A12A22

)x2 −

(A11A21

)x∥∥∥∥

2

L. Giraud - On Numerical Resilience in Linear Algebra 7 / 31

Page 14: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Interpolation methods

Fault in linear system(A11 A12A21 A22

)(?x2

)=

(b1b2

)How to interpolate x1?

Linear Interpolation (LI) [J. Langou et al., 2007]

Solve A11x1 = b1 − A12x2 A11 must be non singular

Least Squares Interpolation (LSI)(A11A21

)x1 +

(A21A22

)x2 =

(b1b2

)x1 = argmin

x

∥∥∥∥(b1b2

)−

(A12A22

)x2 −

(A11A21

)x∥∥∥∥

2

L. Giraud - On Numerical Resilience in Linear Algebra 7 / 31

Page 15: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Main properties of Interpolation-Restart strategies

LI propertyThe LI strategy preserves the monotone decrease of the A-norm ofthe forward error of CG or PCG.

LSI propertyThe LSI strategy preserves the monotone decrease of the residualnorm of minimal residual Krylov subspace methods such asGMRES and MinRES.

L. Giraud - On Numerical Resilience in Linear Algebra 8 / 31

Page 16: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Main properties of Interpolation-Restart strategies

LI propertyThe LI strategy preserves the monotone decrease of the A-norm ofthe forward error of CG or PCG.

LSI propertyThe LSI strategy preserves the monotone decrease of the residualnorm of minimal residual Krylov subspace methods such asGMRES and MinRES.

L. Giraud - On Numerical Resilience in Linear Algebra 8 / 31

Page 17: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Experimental results

L. Giraud - On Numerical Resilience in Linear Algebra 9 / 31

Impact of lost data volume

Page 18: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Impact of lost data volume

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 98 196 294 392 490 588 686 784 882 980

||(b

-Ax)|

|/||b||

Iteration

NF

GMRES(100) - Averous/epb3

L. Giraud - On Numerical Resilience in Linear Algebra 10 / 31

Page 19: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Impact of lost data volume

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 98 196 294 392 490 588 686 784 882 980

||(b

-Ax)|

|/||b||

Iteration

Reset

GMRES(100) - Averous/epb3 - 10 faults - 0.001% data loss

L. Giraud - On Numerical Resilience in Linear Algebra 10 / 31

Page 20: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Impact of lost data volume

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 98 196 294 392 490 588 686 784 882 980

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

GMRES(100) - Averous/epb3 - 10 faults - 0.001% data loss

L. Giraud - On Numerical Resilience in Linear Algebra 10 / 31

Page 21: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Impact of lost data volume

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 98 196 294 392 490 588 686 784 882 980

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

GMRES(100) - Averous/epb3 - 10 faults - 0.001% data loss

L. Giraud - On Numerical Resilience in Linear Algebra 10 / 31

Page 22: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Impact of lost data volume

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 98 196 294 392 490 588 686 784 882 980

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

ER

GMRES(100) - Averous/epb3 - 10 faults - 0.001% data loss

L. Giraud - On Numerical Resilience in Linear Algebra 10 / 31

Page 23: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Impact of lost data volume

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 98 196 294 392 490 588 686 784 882 980

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

ER

GMRES(100) - Averous/epb3 - 10 faults - 0.2% data loss

L. Giraud - On Numerical Resilience in Linear Algebra 10 / 31

Page 24: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Impact of lost data volume

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 98 196 294 392 490 588 686 784 882 980

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

ER

GMRES(100) - Averous/epb3 - 10 faults - 0.8% data loss

L. Giraud - On Numerical Resilience in Linear Algebra 10 / 31

Page 25: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Impact of lost data volume

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 98 196 294 392 490 588 686 784 882 980

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

ER

GMRES(100) - Averous/epb3 - 10 faults - 3% data loss

L. Giraud - On Numerical Resilience in Linear Algebra 10 / 31

Page 26: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Experimental results

L. Giraud - On Numerical Resilience in Linear Algebra 11 / 31

Fault rate impact

Page 27: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Impact of fault rate

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 140 280 420 560 700 840 980 1120 1260 1400

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

ER

GMRES(100) - Kim - 4 faults - 0.2% data loss

L. Giraud - On Numerical Resilience in Linear Algebra 12 / 31

Page 28: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Impact of fault rate

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 140 280 420 560 700 840 980 1120 1260 1400

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

ER

GMRES(100) - Kim - 8 faults - 0.2% data loss

L. Giraud - On Numerical Resilience in Linear Algebra 12 / 31

Page 29: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Impact of fault rate

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 140 280 420 560 700 840 980 1120 1260 1400

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

ER

GMRES(100) - Kim - 17 faults - 0.2% data loss

L. Giraud - On Numerical Resilience in Linear Algebra 12 / 31

Page 30: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Impact of fault rate

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 140 280 420 560 700 840 980 1120 1260 1400

||(b

-Ax)|

|/||b||

Iteration

Reset

LI

LSI

ER

GMRES(100) - Kim - 40 faults - 0.2% data loss

L. Giraud - On Numerical Resilience in Linear Algebra 12 / 31

Page 31: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Penalty of restart strategy on PCG

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 11 22 33 44 55 66 77 88 99 110

A-n

orm

(err

or)

Iteration

Reset

LI

LSI

ER

PCG (Cunningham/qa8fm - 9 faults)

L. Giraud - On Numerical Resilience in Linear Algebra 13 / 31

Page 32: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Penalty of restart strategy on PCG

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 11 22 33 44 55 66 77 88 99 110

A-n

orm

(err

or)

Iteration

Reset

LI

LSI

ER

NF

PCG (Cunningham/qa8fm - 9 faults)

L. Giraud - On Numerical Resilience in Linear Algebra 13 / 31

Page 33: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Penalty of restart strategy on GMRES

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 140 280 420 560 700 840 980 1120 1260 1400

||(b

-Ax)

||/||b

||

Iteration

Reset

LI

LSI

ER

Preconditioned GMRES(100) (Kim1 - 8 faults)

L. Giraud - On Numerical Resilience in Linear Algebra 14 / 31

Page 34: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Penalty of restart strategy on GMRES

1e-11

1e-10

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

0 140 280 420 560 700 840 980 1120 1260 1400

||(b

-Ax)

||/||b

||

Iteration

Reset

LI

LSI

ER

NF

Preconditioned GMRES(100) (Kim1 - 8 faults)

L. Giraud - On Numerical Resilience in Linear Algebra 14 / 31

Page 35: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Summary on resilient Krylov linear solvers

Interpolation-Restart strategies key features:

Make sense: key properties of CG and GMRES are preserved

The restarting effect remains reasonable within the restartedGMRES context

Robust even with a high fault rate and a high volume of lost data

No fault implies no overhead

Extended to multiple faults

L. Giraud - On Numerical Resilience in Linear Algebra 15 / 31

Page 36: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In revisited eigensolvers

L. Giraud - On Numerical Resilience in Linear Algebra 16 / 31

uA

λ=

u Au = λu with u 6= 0, where A ∈ Cn×n,u ∈ Cn, and λ ∈ C

λ : eigenvalueu : eigenvector(λ, u) : eigenpair

Two classes of methodsFixed Point Methods (Power Method, Subspace iteration)Subpace Methods (Jacobi-Davidson, Arnoldi, IRA/KrylovSchur)

Page 37: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In revisited eigensolvers

Interpolation strategies

Fault in eigenproblem(A11 A12A21 A22

)(u1u2

)= λ

(u1u2

)

Linear Interpolation (LI)(A11 − λI1

)u1 = −A12u2

Least Squares Interpolation (LSI)(A11A21

)x1 +

(A21A22

)u2 = λ

(u1u2

)u1 = argmin

u

∥∥∥∥(A11 − λI1A21

)u +

(A12

A22 − λI2

)u2

∥∥∥∥2

L. Giraud - On Numerical Resilience in Linear Algebra 17 / 31

Page 38: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In revisited eigensolvers

Interpolation strategies

Fault in eigenproblem(A11 A12A21 A22

)(?u2

)= λ

(?u2

)u1 entries are lostλ is replicated

Linear Interpolation (LI)(A11 − λI1

)u1 = −A12u2

Least Squares Interpolation (LSI)(A11A21

)x1 +

(A21A22

)u2 = λ

(u1u2

)u1 = argmin

u

∥∥∥∥(A11 − λI1A21

)u +

(A12

A22 − λI2

)u2

∥∥∥∥2

L. Giraud - On Numerical Resilience in Linear Algebra 17 / 31

Page 39: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In revisited eigensolvers

Eigensolvers revisited

Basic subspace iteration method

Subspace iteration with polynomial acceleration

Arnoldi method

Implicitly Restarted Arnoldi method

Jacobi-Davidson method (JDQR)

L. Giraud - On Numerical Resilience in Linear Algebra 18 / 31

Page 40: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In revisited eigensolvers

Eigensolvers revisited

Basic subspace iteration method

Subspace iteration with polynomial acceleration

Arnoldi method

Implicitly Restarted Arnoldi method

Jacobi-Davidson method (JDQR)

L. Giraud - On Numerical Resilience in Linear Algebra 18 / 31

Page 41: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In revisited eigensolvers

Jacobi-Davidson at a glanceJDQR algorithm

Compute a few nev eigenpairs associated with eigenvalues closed to a target τ

Perform a partial Schur decomposition AZnconv = ZnconvTnconv

The next Schur vectors are searched in the Vk search space

L. Giraud - On Numerical Resilience in Linear Algebra 19 / 31

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

1e+01

0 24 48 72 96 120 144 168 192 216 240

||(A

x -

lam

bda*x

)||/||la

mbda||

Iteration

NF

Page 42: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In revisited eigensolvers

Resilient Jacobi-Davidson algorithm

L. Giraud - On Numerical Resilience in Linear Algebra 20 / 31

Converged Schur vectorsAZnconv = ZnconvTnconvInterpolate the nconv converged vectors

Search spaceCk = V H

k AVkInterpolate a few s best candidates

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

1e+01

0 24 48 72 96 120 144 168 192 216 240

||(A

x -

la

mb

da

*x)|

|/||la

mb

da

||

Iteration

NF

JD is restarted with a search space of nconv + s vectorsFlexibility to select the dimension of the search space for restarting

Page 43: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In revisited eigensolvers

Experimental framework

L. Giraud - On Numerical Resilience in Linear Algebra 21 / 31

Thermoacoustic instabilities in combustion chambers [Image courtesy of CERFACS]

Damaged rocket chamber from combustion instabilities

Page 44: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In revisited eigensolvers

Experimental framework

L. Giraud - On Numerical Resilience in Linear Algebra 21 / 31

Thermoacoustic instabilities in combustion chambers [Image courtesy of CERFACS]

Simulation of combustion instabilities in annular chambers

Page 45: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In revisited eigensolvers

Experimental framework

L. Giraud - On Numerical Resilience in Linear Algebra 21 / 31

Thermoacoustic instabilities in combustion chambers

Compute eigenpairs associated with smallest magnitude eigenvalueslaying in the periphery of the spectrum

Page 46: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In revisited eigensolvers

Experimental results

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

1e+01

0 24 48 72 96 120 144 168 192 216 240

||(A

x -

lam

bda*x

)||/||la

mbda||

Iteration

NF

Convergence of 5 eigenpairs associated with smallest magnitudeeigenvalues

L. Giraud - On Numerical Resilience in Linear Algebra 22 / 31

Page 47: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In revisited eigensolvers

Experimental results

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

1e+01

0 24 48 72 96 120 144 168 192 216 240

||(A

x -

lam

bda*x

)||/||la

mbda||

Iteration

0 0 0 2 2 4

LSI

Restart with nconv vectors from converged Schur vectors combined with5 best candidates from the search space (6 faults)

L. Giraud - On Numerical Resilience in Linear Algebra 22 / 31

Page 48: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In revisited eigensolvers

Experimental results

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

1e+01

0 24 48 72 96 120 144 168 192 216 240

||(A

x -

lam

bda*x

)||/||la

mbda||

Iteration

0 0 0 2 2 4

LSI

NF

Restart with nconv vectors from converged Schur vectors combined with5 best candidates from the search space (6 faults)

L. Giraud - On Numerical Resilience in Linear Algebra 22 / 31

Page 49: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In revisited eigensolvers

Experimental results

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

1e+01

0 24 48 72 96 120 144 168 192 216 240

||(A

x -

lam

bda*x

)||/||la

mbda||

Iteration

0 0 0 2 3 4

LSI

NF

Impact of keeping the best Schur vector candidate in the searchspace after a fault (6 faults)

L. Giraud - On Numerical Resilience in Linear Algebra 22 / 31

Page 50: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Hard fault: Interpolation-Restart strategies In revisited eigensolvers

Summary on resilient eigensolvers

Specific Interpolation-Restart strategies for each eigensolverI Basic subspace iteration method

I Subspace iteration with polynomial acceleration

I Arnoldi method

I IRAM

I Jacobi-Davidson

More flexibility in eignesolver than in Krylov linear solvers

L. Giraud - On Numerical Resilience in Linear Algebra 23 / 31

Page 51: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Outline

1. Hard fault: Interpolation-Restart strategiesIn Krylov subspace linear solversIn revisited eigensolvers

2. Soft errors in Conjugate Gradient (preliminary)

L. Giraud - On Numerical Resilience in Linear Algebra 24 / 31

Page 52: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Aim of this study

ContextSoft Errors in Conjugate Gradient (CG)

Question 1Impact of soft errors on convergence of CG

Question 2Reliability of numerical detection mechanisms ?

Question 3Robust numerical recovery schemes ?

Method to address question 1Experimental study with basic statistics

L. Giraud - On Numerical Resilience in Linear Algebra 25 / 31

Page 53: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Aim of this study

ContextSoft Errors in Conjugate Gradient (CG)

Question 1Impact of soft errors on convergence of CG

Question 2Reliability of numerical detection mechanisms ?

Question 3Robust numerical recovery schemes ?

Method to address question 1Experimental study with basic statistics

L. Giraud - On Numerical Resilience in Linear Algebra 25 / 31

Page 54: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Aim of this study

ContextSoft Errors in Conjugate Gradient (CG)

Question 1Impact of soft errors on convergence of CG

Question 2Reliability of numerical detection mechanisms ?

Question 3Robust numerical recovery schemes ?

Method to address question 1Experimental study with basic statistics

L. Giraud - On Numerical Resilience in Linear Algebra 25 / 31

Page 55: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Aim of this study

ContextSoft Errors in Conjugate Gradient (CG)

Question 1Impact of soft errors on convergence of CG

Question 2Reliability of numerical detection mechanisms ?

Question 3Robust numerical recovery schemes ?

Method to address question 1Experimental study with basic statistics

L. Giraud - On Numerical Resilience in Linear Algebra 25 / 31

Page 56: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Aim of this study

ContextSoft Errors in Conjugate Gradient (CG)

Question 1Impact of soft errors on convergence of CG

Question 2Reliability of numerical detection mechanisms ?

Question 3Robust numerical recovery schemes ?

Method to address question 1Experimental study with basic statistics

L. Giraud - On Numerical Resilience in Linear Algebra 25 / 31

Page 57: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Fault injection

L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31

Page 58: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Fault injection

L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31

Page 59: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Fault injection

L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31

Page 60: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Fault injection

L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31

Page 61: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Fault injection

L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31

Page 62: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Fault injection

L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31

Page 63: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Fault injection

L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31

Page 64: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Fault injection

L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31

Page 65: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Transient fault injection

L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31

Page 66: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Transient fault injection

L. Giraud - On Numerical Resilience in Linear Algebra 26 / 31

Page 67: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Parameters for Fault Injection

Parameter ExplanationType Type of the Conjugate Gradient Method [Classical, Chrono, Pipelined]ID Matrix ID [1 : 30]Time Fault injection iteration as percentage [10%,. . . ,90%]k Magnitude of fault injection [-16:2:16]Location Location of fault injection 50 random

[A. T. Chronopoulos, C. W. Gear - JCAM, 1989][P. Ghysels, W. Vanroose - ParCo, 2014]

L. Giraud - On Numerical Resilience in Linear Algebra 27 / 31

Page 68: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Protocol: correct convergence

L. Giraud - On Numerical Resilience in Linear Algebra 28 / 31

Page 69: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Protocol: correct convergence

L. Giraud - On Numerical Resilience in Linear Algebra 28 / 31

Page 70: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Protocol: correct convergence

L. Giraud - On Numerical Resilience in Linear Algebra 28 / 31

Page 71: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Protocol: correct convergence

L. Giraud - On Numerical Resilience in Linear Algebra 28 / 31

Page 72: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Protocol: correct convergence

L. Giraud - On Numerical Resilience in Linear Algebra 28 / 31

Page 73: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Protocol: correct convergence

L. Giraud - On Numerical Resilience in Linear Algebra 28 / 31

Page 74: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Protocol: correct convergence

L. Giraud - On Numerical Resilience in Linear Algebra 28 / 31

Page 75: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Protocol: correct convergence

L. Giraud - On Numerical Resilience in Linear Algebra 28 / 31

Page 76: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Protocol: correct convergence

L. Giraud - On Numerical Resilience in Linear Algebra 28 / 31

Page 77: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Characterize effect of magnitude (k)

0.00

0.25

0.50

0.75

1.00

−16 −14 −12 −10 −8 −4 −2 0 2 4 6 8 10 12 14 16Magnitude (k)

Acc

epta

ble

Cas

es

Variant of CG

Classical

Chrono/Gear

Pipeline

L. Giraud - On Numerical Resilience in Linear Algebra 29 / 31

Page 78: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Characterize effect of fault injection time

0.25

0.50

0.75

10 20 30 40 50 60 70 80 90Fault Injection Time

Acc

epta

ble

Cas

es

Variant of CG

Classical

Chrono/Gear

Pipeline

L. Giraud - On Numerical Resilience in Linear Algebra 30 / 31

Page 79: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Soft errors in Conjugate Gradient (preliminary)

Summary

Qualitive observationsClassical CG and Chronopolus/Gear CG are less sensitive forsoft errors than Pipelined CG.

Future WorkExtend the study to GMRESDetection based on checksum mechanisms to protect thesparse marix-vector productDetection mechanisms based on round-off error analysis(additional benefit : insight on mixed precision calculation)

L. Giraud - On Numerical Resilience in Linear Algebra 31 / 31

Page 80: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Acknowlegement for financial support:

French ANR: RESCUE projectEuropean FP7 : Exa2CT projectG8 : ECS project

http://hiepacs.bordeaux.inria.fr/

Page 81: On Numerical Resiliency in Numerical Linear …salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/giraud.pdfHard fault: Interpolation-Restart strategies In Krylov subspace linear solvers

Merci for your attentionQuestions ?

Acknowlegement for financial support:

French ANR: RESCUE projectEuropean FP7 : Exa2CT projectG8 : ECS project

http://hiepacs.bordeaux.inria.fr/