containment domains a scalable, efficient, and flexible resilience scheme for exascale systems

36
Containment Domains A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems Jinsuk Chung, Ikhwan Lee, Michael Sullivan, Jee Ho Ryoo, Dong Wan Kim, Doe Hyun Yoon+, Larry Kaplan*, and Mattan Erez UT Austin, + now at HP Labs, * Cray

Upload: selena

Post on 23-Mar-2016

80 views

Category:

Documents


0 download

DESCRIPTION

Containment Domains A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems. Jinsuk Chung , Ikhwan Lee, Michael Sullivan, Jee Ho Ryoo , Dong Wan Kim, Doe Hyun Yoon + , Larry Kaplan * , and Mattan Erez UT Austin, + now at HP Labs, * Cray. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Containment DomainsA Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems

Jinsuk Chung, Ikhwan Lee, Michael Sullivan, Jee Ho Ryoo, Dong Wan Kim, Doe Hyun Yoon+,

Larry Kaplan*, and Mattan Erez

UT Austin, + now at HP Labs, * Cray

Page 2: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Containment DomainsA Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems

Page 3: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Motivation and goals• Resilience bounds performance

– Resilience is a major obstacle to exascale

Containment domains: scalable efficient resilience• Hierarchical

– Preserve data where most efficient and effective• Proportional

– Tunable redundancy and recovery– Different errors/faults handled differently

• Abstract– Portable– Amenable to auto-tuning and analysis

3

CDs elevate resilience to a first-order application concern

Containment Domain [SC'12] (c) Jinsuk Chung

Page 4: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Containment domains• Single consistent abstraction

– Encapsulates resilience techniques– Spans levels: programming, system, and analysis

• Components– Preserve data on domain start– Compute (domain body)– Detect faults before domain commits– Recover from detected errors

• Semantics– Erroneous data never communicated – Each CD provides recovery mechanism

• Hierarchy– Escalation– Match CD and machine hierarchies

Containment Domain [SC'12] (c) Jinsuk Chung 4

Root CD

Child CD

Page 5: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Mapping example: SpMVvoid task<inner> SpMV( in M, in Vi, out Ri){ forall(…) reduce(…) SpMV(M[…],Vi[…],Ri[…]);}

void task<leaf> SpMV(…){ for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]];

prevC=c; }}

Containment Domain [SC'12] (c) Jinsuk Chung 5

𝑴Matrix M

𝑽Vector V

Page 6: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Mapping example: SpMVContainment Domain [SC'12] (c) Jinsuk Chung 6

𝑴 𝟎𝟎𝑴 𝟎𝟏𝑴𝟏𝟎𝑴𝟏𝟏

Matrix M

𝑽 𝟎

Vector V

𝑽 𝟏

void task<inner> SpMV( in M, in Vi, out Ri){ forall(…) reduce(…) SpMV(M[…],Vi[…],Ri[…]);}

void task<leaf> SpMV(…){ for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]];

prevC=c; }}

Page 7: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

7

Mapping example: SpMVContainment Domain [SC'12] (c) Jinsuk Chung

𝑴 𝟎𝟎 𝑴 𝟎𝟏𝑴𝟏𝟎 𝑴𝟏𝟏𝑽 𝟎 𝑽 𝟏𝑽 𝟎 𝑽 𝟏

𝑴 𝟎𝟎𝑴 𝟎𝟏𝑴𝟏𝟎𝑴𝟏𝟏

Matrix M

𝑽 𝟎

Vector V

𝑽 𝟏Distributed to 4 nodes

void task<inner> SpMV( in M, in Vi, out Ri){ forall(…) reduce(…) SpMV(M[…],Vi[…],Ri[…]);}

void task<leaf> SpMV(…){ for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]];

prevC=c; }}

Page 8: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

8

Mapping example: SpMVContainment Domain [SC'12] (c) Jinsuk Chung

𝑴 𝟎𝟎𝑴 𝟎𝟏𝑴𝟏𝟎𝑴𝟏𝟏

Matrix M

𝑽 𝟎

Vector V

𝑽 𝟏

void task<inner> SpMV( in M, in Vi, out Ri){ forall(…) reduce(…) SpMV(M[…],Vi[…],Ri[…]);}

void task<leaf> SpMV(…){ for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]];

prevC=c; }}

Distributed to 4 nodes

Page 9: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

9

Mapping example: SpMVContainment Domain [SC'12] (c) Jinsuk Chung

𝑴 𝟎𝟎 𝑽 𝟎

Preserve

DetectRecover

𝑴𝟏𝟎 𝑽 𝟎

Preserve

DetectRecover

𝑴 𝟎𝟏 𝑽 𝟏

Preserve

DetectRecover

𝑴𝟏𝟏 𝑽 𝟏

Preserve

DetectRecover

Preserve

DetectRecover

M VParent CD

Child CD

Preserve (Parent)

Detect (Parent)Recover (Parent)

Child

DetectRecover

Child

DetectRecover

Child

DetectRecover

Child

DetectRecover

Page 10: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Initial CD preservation API and prototypevoid task<inner> SpMV(in M, in Vi, out Ri) { cd = create_CD(parentCD);  preserve_via_copy(cd, matrix, …); forall(…) reduce(…) SpMV(M[…],Vi[…],Ri[…]); commit_CD(cd);}void task<leaf> SpMV(…) { cd = create_CD(parentCD); preserve_via_copy(cd, matrix, …); preserve_via_parent(cd, veci, …); for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]]; check {fault<fail>(c > prevC);} prevC=c; } commit_CD(cd);}

Containment Domain [SC'12] (c) Jinsuk Chung 10

Preservation components prototype on Cray XK7http://lph.ece.utexas.edu/public/CDs

APIcreate_CDpreserve_via_copypreserve_via_parentcheckcommit_CD

Page 11: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Containment domains long-term design

Hardware Abstraction Layer

Runtime Library Interface

Machine

efficiency-oriented programming model

int main(int argc, char **argv){ main_task here = phalanx::initialize(argc, argv);

… Create test arrays here …

// Launch kernel on default CPU (“host”) openmp_event e1 = async(here, here.processor(), n) (host_saxpy, 2.0f, host_x, host_y); // Launch kernel on default GPU (“device”) cuda_event e2 = async(here, here.cuda_gpu(0), n) (device_saxpy, 2.0f, device_x, device_y);

wait(here, e1&e2); return 0;}

CD Annotationsresilience model

Error Reporting Architecture

ECC, status

CD control and persistence

Language integration

Compiler support

Runtime components

Hardware aspects

CD APIresilience interface Research prototype by

Cray for XK7 (Titan)

Containment Domain [SC'12] (c) Jinsuk Chung

Page 12: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

12

Outline• Motivation and Goals• Semantics of Containment Domains• What do CDs do? When and why are they good?

– Differentiated error handling– Analyzability

• Evaluation

Containment Domain [SC'12] (c) Jinsuk Chung

Page 13: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

13Containment Domain [SC'12] (c) Jinsuk Chung

Differentiated Error Handling

Page 14: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

• Abstract– Optimized preservation and restoration– Analyzed, auto-tuned – Allows explicit application control

• Hierarchical– Match storage hierarchy– Maximize locality and minimize overhead

• Partial– Preserve only when worth it– Exploit natural redundancy– Exploit hierarchy– Enable regeneration

State preservation and restoration Containment Domain [SC'12] (c) Jinsuk Chung 14

Page 15: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

15

SpMV partial preservation tuningContainment Domain [SC'12] (c) Jinsuk Chung

Natural redundancy

void task<leaf> SpMV(…) { cd = create_CD(parentCD); preserve_via_copy(cd, matrix, …); preserve_via_parent(cd, veci, …); for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]]; check {fault<fail>(c > prevC);} prevC=c; } commit_CD(cd);}

𝑴 𝟎𝟎𝑴 𝟎𝟏𝑴𝟏𝟎𝑴𝟏𝟏

Matrix M

𝑽 𝟎

Vector V

𝑽 𝟏Hierarchy

𝑴 𝟎𝟎 𝑴 𝟎𝟏𝑴𝟏𝟎 𝑴𝟏𝟏𝑽 𝟎 𝑽 𝟏𝑽 𝟎 𝑽 𝟏

Page 16: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Concise abstraction for complex behavior

Containment Domain [SC'12] (c) Jinsuk Chung 16

void task<leaf> SpMV(…) { cd = create_CD(parentCD); preserve_via_copy(cd, matrix, …); preserve_via_parent(cd, veci, …); for r=0..N for c=rowS[r]..rowS[r+1] { resi[r]+=data[c]*Vi[cIdx[c]]; check {fault<fail>(c > prevC);} prevC=c; } commit_CD(cd);}

Local copy or regen Sibling Parent (unchanged)

Page 17: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Detection• Abstract

– Utilize most efficient detection mechanism– Low overhead detection: e.g., algorithm specific detection

• Customized– Replicate in time, replicate in space, algorithm specific

• Heterogeneous– Per-CD routines– E.g., selective multi-granularity DMR

Containment Domain [SC'12] (c) Jinsuk Chung 17

Page 18: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Recovery• Abstract

– Utilize most efficient recovery mechanism– Maximize local recovery– Low overhead recovery e.g., re-materialization or

regeneration• Customized

– Re-execute, ignore, re-materialize, DMR, TMR• Heterogeneous

– Per-CD routines– E.g., selective multi-

granularity DMR– App/system specific

Containment Domain [SC'12] (c) Jinsuk Chung 18

Compute

Preserve

Detect

Re-execution overhead

Tim

e

Page 19: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

19Containment Domain [SC'12] (c) Jinsuk Chung

Analyzability

Page 20: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

• Leverage hierarchy and CD semantics– Uncoordinated “local” actions– Solve in out

• Application abstracted to CDs– CD tree– Volumes of preservation,

computation, and communication

– Preservation and recovery options per CD

• Machine model– Storage hierarchy– Communication hierarchy– Bandwidths and capacities– Error processes and rates

Analytical Model20Containment Domain [SC'12] (c) Jinsuk Chung

Exec

utio

n ti

me

Page 21: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Power model• CDs that are not re-executing may remain idle• Actively executing a CD has a relative power of 1• A node that is idling consumes a relative power of

– In our experiments

21

Idle

Containment Domain [SC'12] (c) Jinsuk Chung

Re-e

xecu

tion

time

Parallel domains

Execution Re-execution

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Idle

Page 22: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Evaluation• What we evaluated

– Performance efficiency – Energy overhead

• Baseline resiliency approaches– g-CPR: global checkpoint restart– h-CPR: hierarchical checkpoint restart (e.g., SCR)– Optimum interval used for each

• CD advantages– Preserve only what is needed– Hierarchical uncoordinated

• Assumptions– Detection overhead is assumed to be zero– Capacity of storage for preservation is infinite– Infinite spares (quick repair)

22Containment Domain [SC'12] (c) Jinsuk Chung

Page 23: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Machine and error models23Containment Domain [SC'12] (c) Jinsuk Chung

Component “Performance” Error Error ScalingCore 10GFLOP/core Soft error ∝ #coresMemory 1GB/core ECC fail ∝ #DRAM chipsSocket 200GB/s /socket Hard/OS

crash∝ #sockets

System Hierarchical network

Power moduleor network

∝ #modules and #cabinets

Page 24: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Workloads• Monte Carlo NT

– Embarrassingly parallel– Infrequent communication– Small fraction of read/write data

• Iterative hierarchical SpMV– Recursive decomposition– Natural redundancy– Frequent global communication

• Mantevo HPCCG– Requires little storage– Conjugate-gradient based linear system solver– Frequent global communication

24Containment Domain [SC'12] (c) Jinsuk Chung

Page 25: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Evaluation tools• Simulator

– Executes at granularity of containment domains– Reexecutes when error is detected– Used to validate the analytical model

• Analytical Model– Simulation is too slow for evaluating exascale systems– Inputs to the model: extracted from each application

• Volume of preservation, restoration, computation and communication• Error rates• Shape of CD structure

• Validation– Simulator and analytical model– Prototype of preservation/restoration on Cray XK7

25Containment Domain [SC'12] (c) Jinsuk Chung

Page 26: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

26Containment Domain [SC'12] (c) Jinsuk Chung

Peak System Performance

NT

SpMV

HPCCG

Autotuned CDs perform well

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

40%

80% CDs, NTh-CPR, 80%

Perf

or-

man

ce E

f-

ficie

ncy

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

40%

80%CDs, SpMVh-CPR, 50%Pe

rfor

-m

ance

Ef

-fic

ienc

y

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

40%

80% CDs, HPCCGh-CPR, 10%g-CPR, 10%Pe

rfor

-m

ance

Ef

-fic

ienc

y

Page 27: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

27Containment Domain [SC'12] (c) Jinsuk Chung

Peak System Performance

NT

SpMV

HPCCG

Autotuned CDs perform well

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

40%

80% CDs, NTh-CPR, 80%

Perf

or-

man

ce E

f-

ficie

ncy

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

40%

80%CDs, SpMVh-CPR, 50%Pe

rfor

-m

ance

Ef

-fic

ienc

y

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

40%

80% CDs, HPCCGh-CPR, 10%g-CPR, 10%Pe

rfor

-m

ance

Ef

-fic

ienc

y

Page 28: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

SPMV, HPCCG: local recovery and partial preservation

28Containment Domain [SC'12] (c) Jinsuk Chung

Disk

Remote NVM

Local NVM

DRAM

Partial preservation via sibling or parent where appropriate

Page 29: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

NT: hierarchical local recovery and partial preservation

29Containment Domain [SC'12] (c) Jinsuk Chung

Disk

Remote NVM

Local NVM

DRAM

Partial preservation via sibling, parent, or regeneration where appropriate

Page 30: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

30Containment Domain [SC'12] (c) Jinsuk Chung

Peak System Performance

NT

SpMV

HPCCG

Autotuned CDs perform well

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

40%

80% CDs, NTh-CPR, 80%

Perf

or-

man

ce E

f-

ficie

ncy

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

40%

80%CDs, SpMVh-CPR, 50%Pe

rfor

-m

ance

Ef

-fic

ienc

y

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

40%

80% CDs, HPCCGh-CPR, 10%g-CPR, 10%Pe

rfor

-m

ance

Ef

-fic

ienc

y

Page 31: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

10%

20%CDs, NTh-CPR, 80%

Ener

gy

Ove

rhea

d

31Containment Domain [SC'12] (c) Jinsuk Chung

Peak System Performance

CDs improve energy efficiency at scale

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

10%

20%CDs, SpMVh-CPR, 50%

Ener

gy

Ove

rhea

d

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

10%

20% CDs, HPCCGh-CPR, 10%g-CPR, 10%

Ener

gy

Ove

rhea

d

NT

SpMV

HPCCG

Page 32: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

10%

20%CDs, SpMVh-CPR, 50%

Ener

gy

Ove

rhea

d

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

10%

20%CDs, NTh-CPR, 80%

Ener

gy

Ove

rhea

d

32Containment Domain [SC'12] (c) Jinsuk Chung

Peak System Performance

CDs improve energy efficiency at scale

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

10%

20% CDs, HPCCGh-CPR, 10%g-CPR, 10%

Ener

gy

Ove

rhea

d

NT

SpMV

HPCCG

Page 33: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

10X failure rate emphasizes CD benefits

33Containment Domain [SC'12] (c) Jinsuk Chung

Peak Performance

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

20%40%60%80%

100%

0%20%40%60%80%100%

CDs, NTh-CPR, 80%

Perf

orm

ance

Effi

cien

cy

Ener

gy

Ove

rhea

d

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

20%40%60%80%

100%

0%20%40%60%80%100%

CDs, SpMV

Perf

orm

ance

Effi

cien

cy

Ener

gy

Ove

rhea

d

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF0%

20%40%60%80%

100%

0%20%40%60%80%100%

CDs, HPCCG

Perf

orm

ance

Effi

cien

cy

Ener

gy

Ove

rhea

d

Energy Overhead

Page 34: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

34

More in the paper• Strict vs. relaxed containment domains• Analytical model details• Error and machine model details• Additional sensitivity studies• Related work discussion

Containment Domain [SC'12] (c) Jinsuk Chung

Page 35: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Conclusion• Containment domains

– Abstract constructs for resilience concerns & techniques– Proportional and application/machine tuned resilience– Hierarchical & distributed preservation, restoration, and

recovery– Analyzable and amendable to automatic optimization– Scalable to large systems with high relative energy efficiency– Heterogeneous to match emerging architecture

• Good start and exciting work ahead– Preservation concept prototyped on Cray XK7– Fine-grained CDs for high error rates– Compiler optimizations and support– Application-specific detection/elision – PGAS support and interactions with system – Interaction with other models (tasking, DSLs, …)

35

http://lph.ece.utexas.edu/public/CDs

Containment Domain [SC'12] (c) Jinsuk Chung

Page 36: Containment Domains A Scalable, Efficient, and Flexible  Resilience Scheme for  Exascale  Systems

Questions?

Thank you

36Containment Domain [SC'12] (c) Jinsuk Chung