roma june 2014 journ ee des doctorantsmbbxqmf2/talks/jdd14.pdf · 2018. 1. 4. · towards exascale...

108
Towards Exascale ROMA June 2014 Journ´ ee des doctorants

Upload: others

Post on 23-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Towards Exascale

    ROMA

    June 2014

    Journée des doctorants

  • TowardsExascale

    ROMA

  • TowardsExascale

    ROMA The story starts with a box ...

    ... that contains lots of little boxes.

  • TowardsExascale

    ROMA The story starts with a box ...

    ... that contains lots of little boxes.

  • TowardsExascale

    ROMA

    The Titan SuperComputer:

    • 404m2 (the big box)• 299, 008 processor cores

    (the small boxes)

    • 17.59 PetaFlops

    • 8.2 MW• 693.6 TiB of RAM• 240 GB/s transfer speed

    to RAM

    Image courtesy of Oak Ridge National Laboratory, U.S. Dept. of Energy

  • TowardsExascale

    ROMA Then what is Exascale?

    ×1000

    But inthe same box:

  • TowardsExascale

    ROMA Then what is Exascale?

    ×1000

    But inthe same box:

  • TowardsExascale

    ROMA Then what is Exascale?

    ×1000

    But inthe same box:

  • TowardsExascale

    ROMA

  • TowardsExascale

    ROMA

    Linear algebra, problems get bigger and bigger

    Code Aster, Carter(e.g., finite ele-ments)

    →Solution of sparsesystemsAx = b

    Often the most expensive part in numerical simulation codesSparse direct methods to solve Ax = b:

    • Decompose A under the form LU,LDLt or LLt

    • Solve the triangular systems Ly = b, then Ux = y3D example in earth science:acoustic wave propagation,27-point finite difference grid

    Current goal [Seiscope project]:LU on complete earthn = N3 = 10003

    Extrapolation on a 1000× 1000× 1000 grid: 55 exaflops, 200 Tbytesfor factors, 40 TBytes for active memory!

  • TowardsExascale

    ROMA

  • TowardsExascale

    ROMA

  • TowardsExascale

    ROMA

    Resilience

    The main known problem for Exascale is Resilience.

    Time to checkpoint

    Time

    What if there is 1000×the processing power?It gets worse

  • TowardsExascale

    ROMA

    Resilience

    The main known problem for Exascale is Resilience.

    Time to checkpoint

    Time

    What if there is 1000×the processing power?It gets worse

  • TowardsExascale

    ROMA

    Resilience

    The main known problem for Exascale is Resilience.

    Time to checkpoint

    Time

    What if there is 1000×the processing power?It gets worse

  • TowardsExascale

    ROMA

    Resilience

    The main known problem for Exascale is Resilience.

    Time to checkpoint

    Time

    What if there is 1000×the processing power?It gets worse

  • TowardsExascale

    ROMA

    Resilience

    The main known problem for Exascale is Resilience.

    Time to checkpoint

    Time

    What if there is 1000×the processing power?

    It gets worse

  • TowardsExascale

    ROMA

    Resilience

    The main known problem for Exascale is Resilience.

    Time to checkpoint

    Time

    What if there is 1000×the processing power?It gets worse

  • TowardsExascale

    ROMA

    Resilience

    The main known problem for Exascale is Resilience.

    Time to checkpoint

    Time

    What if there is 1000×the processing power?

    It gets worse

  • TowardsExascale

    ROMA

    Resilience

    The main known problem for Exascale is Resilience.

    Time to checkpointTime

    What if there is 1000×the processing power?It gets worse

  • TowardsExascale

    ROMA

    Fault-tolerance techniques

    • Rollback Recovery Strategies: All processors periodicallystop computing and checkpoint (save the state of theparallel application onto resilient storage).

    • Coordinated checkpointing, No need to log messages/ All processors need to rollback/ I/O congestion

    • Non Coordinated checkpointing/ Need to log messages

    • Slowdowns failure-free execution and increasescheckpoint size/time

    , Faster re-execution with logged messages• Hierarchical checkpointing

    / Need to log inter-groups messages, Only processors from failed group need to rollback, Faster re-execution with logged messages, Rumor: scales well to very large platforms

  • TowardsExascale

    ROMA

    Fault-tolerance techniques

    • Rollback Recovery Strategies: All processors periodicallystop computing and checkpoint (save the state of theparallel application onto resilient storage).

    • Coordinated checkpointing, No need to log messages/ All processors need to rollback/ I/O congestion

    • Non Coordinated checkpointing/ Need to log messages

    • Slowdowns failure-free execution and increasescheckpoint size/time

    , Faster re-execution with logged messages

    • Hierarchical checkpointing/ Need to log inter-groups messages, Only processors from failed group need to rollback, Faster re-execution with logged messages, Rumor: scales well to very large platforms

  • TowardsExascale

    ROMA

    Fault-tolerance techniques

    • Rollback Recovery Strategies: All processors periodicallystop computing and checkpoint (save the state of theparallel application onto resilient storage).

    • Coordinated checkpointing, No need to log messages/ All processors need to rollback/ I/O congestion

    • Non Coordinated checkpointing/ Need to log messages

    • Slowdowns failure-free execution and increasescheckpoint size/time

    , Faster re-execution with logged messages• Hierarchical checkpointing

    / Need to log inter-groups messages, Only processors from failed group need to rollback, Faster re-execution with logged messages, Rumor: scales well to very large platforms

  • TowardsExascale

    ROMA

    Replication

    Model

    • A parallel application comprising n (sequential) processes• Each process replicated g ≥ 2 times• A processing element executes a single replica• The application fails when all replicas in one replica group

    have been hit by failures

    1 2

    . . .

    i

    . . .

    n

    Objective

    • Show when replication is beneficial to periodiccheckpointing

  • TowardsExascale

    ROMA

    Replication

    Model

    • A parallel application comprising n (sequential) processes• Each process replicated g ≥ 2 times• A processing element executes a single replica• The application fails when all replicas in one replica group

    have been hit by failures

    1 2

    . . .

    i

    . . .

    n

    Objective

    • Show when replication is beneficial to periodiccheckpointing

  • TowardsExascale

    ROMA

    Replication

    Model

    • A parallel application comprising n (sequential) processes• Each process replicated g ≥ 2 times• A processing element executes a single replica• The application fails when all replicas in one replica group

    have been hit by failures

    1 2

    . . .

    i

    . . .

    n

    Objective

    • Show when replication is beneficial to periodiccheckpointing

  • TowardsExascale

    ROMA

    Replication

    Model

    • A parallel application comprising n (sequential) processes• Each process replicated g ≥ 2 times• A processing element executes a single replica• The application fails when all replicas in one replica group

    have been hit by failures

    1 2

    . . .

    i

    . . .

    n

    Objective

    • Show when replication is beneficial to periodiccheckpointing

  • TowardsExascale

    ROMA

    Replication

    Model

    • A parallel application comprising n (sequential) processes• Each process replicated g ≥ 2 times• A processing element executes a single replica• The application fails when all replicas in one replica group

    have been hit by failures

    1 2

    . . .

    i

    . . .

    n

    Objective

    • Show when replication is beneficial to periodiccheckpointing

  • TowardsExascale

    ROMA

    Prediction

    • Predictor (Recall, Precision), Window-based predictions• Predictions must be provided at least Cp seconds in

    advance

    TimeTR-C TR-C Tlost TR-C

    Error(Regular mode)

    TimeTR-C Wreg

    I

    TR-C-Wreg

    TR-C

    (Prediction without failure)

    TimeTR-C Wreg

    IError

    TR-C-Wreg

    TR-C

    (Prediction with failure)

    C C C D R C

    C C Cp C C

    C C Cp D R C C

    Objective

    • Characterize when prediction is useful.

  • TowardsExascale

    ROMA

    Prediction

    • Predictor (Recall, Precision), Window-based predictions• Predictions must be provided at least Cp seconds in

    advance

    TimeTR-C TR-C Tlost TR-C

    Error(Regular mode)

    TimeTR-C Wreg

    I

    TR-C-Wreg

    TR-C

    (Prediction without failure)

    TimeTR-C Wreg

    IError

    TR-C-Wreg

    TR-C

    (Prediction with failure)

    C C C D R C

    C C Cp C C

    C C Cp D R C C

    Objective

    • Characterize when prediction is useful.

  • TowardsExascale

    ROMA

    Kind of errors

    Hard errors

    • Easy to detect

    • Easy to localize and characterize

    • Expensive to correct

    Soft errors

    • Hard to detect

    • Hard to localize and characterize

    • Easy to correct (sometimes)

  • TowardsExascale

    ROMA

    Silent errors

    How to spot them

    • Add some redundancy

    • Error detecting codes

    • Selective reliability

    How to face them

    • Majority vote among the replicas

    • Error correcting codes

    • Checkpoint recovery

  • TowardsExascale

    ROMA

    Finding the best trade-off

    Let us consider an iterative method

    • correction at each step• increases cost of a single iteration• no time wasted for checkpoint• good for low error rates

    • checkpointing + detection at each step• small overhead at each iteration (detection)• periodic time loss for checkpointing• checkpoint interval can be tailored on error rate

    Solution: combine the two techniques

  • TowardsExascale

    ROMA

    Finding the best trade-off

    Let us consider an iterative method

    • correction at each step• increases cost of a single iteration• no time wasted for checkpoint• good for low error rates

    • checkpointing + detection at each step• small overhead at each iteration (detection)• periodic time loss for checkpointing• checkpoint interval can be tailored on error rate

    Solution: combine the two techniques

  • TowardsExascale

    ROMA

    Finding the best trade-off

    Let us consider an iterative method

    • correction at each step• increases cost of a single iteration• no time wasted for checkpoint• good for low error rates

    • checkpointing + detection at each step• small overhead at each iteration (detection)• periodic time loss for checkpointing• checkpoint interval can be tailored on error rate

    Solution: combine the two techniques

  • TowardsExascale

    ROMA

    Finding the best trade-off

    Let us consider an iterative method

    • correction at each step• increases cost of a single iteration• no time wasted for checkpoint• good for low error rates

    • checkpointing + detection at each step• small overhead at each iteration (detection)• periodic time loss for checkpointing• checkpoint interval can be tailored on error rate

    Solution: combine the two techniques

  • TowardsExascale

    ROMA

    Dealing with verifications

    It is not always possible to use error detection / correctioncodes at each step. What if we still want to use checkpointsand recoveries ?

    Problem

    • We don’t know when the error occurred

    • We don’t know if the last checkpoint is valid

    We need a verification mechanism to verify that there were nosilent errors in previous computations and to check thecorrectness of the checkpoints. But this has a cost!

  • TowardsExascale

    ROMA

    Checkpoints and Verifications

    We assume there are no errors during checkpoints (less errorsources when doing I/O).

    Simple approach: perform a verification before each checkpointto eliminate risk of corrupted data.

    Time

    w V C w V C w V C w V C

    Is this better?

    Time

    w C w V C w C w V C w C

  • TowardsExascale

    ROMA

    Checkpoints and Verifications

    We assume there are no errors during checkpoints (less errorsources when doing I/O).

    Simple approach: perform a verification before each checkpointto eliminate risk of corrupted data.

    Time

    w V C w V C w V C w V C

    Is this better?

    Time

    w C w V C w C w V C w C

  • TowardsExascale

    ROMA

    With k checkpoints and one verification

    With multiple checkpoints, the problem is to find when theerror occurred.

    Time

    Error

    V C w C w C w C w C w V R V R V R V

  • TowardsExascale

    ROMA

    With k checkpoints and one verification

    With multiple checkpoints, the problem is to find when theerror occurred.

    Time

    Error

    V C w C w C w C w C w V R V R V R V

  • TowardsExascale

    ROMA

    With k checkpoints and one verification

    With multiple checkpoints, the problem is to find when theerror occurred.

    Time

    Error

    V C w C w C w C w C w V R V R V R V

  • TowardsExascale

    ROMA

    With k checkpoints and one verification

    With multiple checkpoints, the problem is to find when theerror occurred.

    Time

    Error

    V C w C w C w C w C w V R V R V R V

  • TowardsExascale

    ROMA

    With k checkpoints and one verification

    With multiple checkpoints, the problem is to find when theerror occurred.

    Time

    Error

    V C w C w C w C w C w V R V R V R V

    Solution

    • The problem is very similar with k verifications and onecheckpoint

    • With constant C, V and R we can find an optimal solutionto this problem (i.e that minimizes the expectation of theexecution time).

  • TowardsExascale

    ROMA

    What about DAGs?

    Let us consider a Directed Acyclic Graph (DAG) where:

    • Nodes represent tasks

    • Edges correspond to precedence constraintsWe make several important assumptions on this model:

    • All tasks are executed by all the p processors (whichamounts to linearize the task graph and to execute alltasks sequentially)

    • Each task has its own undivisible work of size w

    Problem: Where do we have to place the checkpoints and theverifications in order to find the optimal expectation of thetime to execute all the tasks without failures?

  • TowardsExascale

    ROMA

    Starting with simple graphs

    We have analytical formulas to compute the expectation of thetime to successfully execute each of these graphs.

    • We can find the optimal expectation of the time tosuccessfully execute the fork graph and the linear chainusing a polynomial dynamic programming algorithm.

    • The join is probably NP-Complete because of thecombinatorial explosion of the possibilites.

    T0

    T1

    Ti

    Tn

    T0

    Ti

    Tn

    Tf T0 T1 Ti Tn

    Future work: investigate the optimal checkpointing andverification problem for general DAGs.

  • TowardsExascale

    ROMA

  • TowardsExascale

    ROMA

    Memory

    Another concern: Bandwidth to Memory:

    240Gb/s

    When system grows 10 times,

    Bandwidth to Memory should grow 20 times!

    Since we are not good with architecture, we focus onalgorithms..

  • TowardsExascale

    ROMA

    Memory

    Another concern: Bandwidth to Memory:

    240Gb/s

    When system grows 10 times,

    Bandwidth to Memory should grow 20 times!

    Since we are not good with architecture, we focus onalgorithms..

  • TowardsExascale

    ROMA

    Memory

    Another concern: Bandwidth to Memory:

    240Gb/s

    When system grows 10 times,

    Bandwidth to Memory should grow 20 times!

    Since we are not good with architecture, we focus onalgorithms..

  • TowardsExascale

    ROMA

    Memory

    Another concern: Bandwidth to Memory:

    240Gb/s

    When system grows 10 times,

    Bandwidth to Memory should grow 20 times!

    Since we are not good with architecture, we focus onalgorithms..

  • TowardsExascale

    ROMA

    Memory

    Another concern: Bandwidth to Memory:

    240Gb/s

    When system grows 10 times,

    Bandwidth to Memory should grow 20 times!

    Since we are not good with architecture, we focus onalgorithms..

  • TowardsExascale

    ROMA

    Pebble Game

    0/3

    0/2

    0/4

    0/1

    Two moves:

    • Add a pebble on a vertex.• Remove a pebble from a vertex.

    One rule:

    • To add pebble on a vertex, all its predecessors must have anumber of pebbles equal to its weight.

    One goal:

    • All vertices have to be fulfil at least one time and thenumber of used pebbles must be minimized.

  • TowardsExascale

    ROMA

    Pebble Game

    1/3

    0/2

    0/4

    0/1

    Two moves:

    • Add a pebble on a vertex.• Remove a pebble from a vertex.

    One rule:

    • To add pebble on a vertex, all its predecessors must have anumber of pebbles equal to its weight.

    One goal:

    • All vertices have to be fulfil at least one time and thenumber of used pebbles must be minimized.

  • TowardsExascale

    ROMA

    Pebble Game

    0/3

    0/2

    0/4

    0/1

    Two moves:

    • Add a pebble on a vertex.• Remove a pebble from a vertex.

    One rule:

    • To add pebble on a vertex, all its predecessors must have anumber of pebbles equal to its weight.

    One goal:

    • All vertices have to be fulfil at least one time and thenumber of used pebbles must be minimized.

  • TowardsExascale

    ROMA

    Pebble Game

    0/3

    0/2

    0/4

    1/1

    Two moves:

    • Add a pebble on a vertex.• Remove a pebble from a vertex.

    One rule:

    • To add pebble on a vertex, all its predecessors must have anumber of pebbles equal to its weight.

    One goal:

    • All vertices have to be fulfil at least one time and thenumber of used pebbles must be minimized.

  • TowardsExascale

    ROMA

    Pebble Game

    0/3

    0/2

    0/4

    1/1

    Wrong

    Two moves:

    • Add a pebble on a vertex.• Remove a pebble from a vertex.

    One rule:

    • To add pebble on a vertex, all its predecessors must have anumber of pebbles equal to its weight.

    One goal:

    • All vertices have to be fulfil at least one time and thenumber of used pebbles must be minimized.

  • TowardsExascale

    ROMA

    Pebble Game

    0/3

    0/2

    0/4

    0/1

    pebble counter : 0; number max of pebbles : 0

  • TowardsExascale

    ROMA

    Pebble Game

    1/3

    0/2

    0/4

    0/1

    pebble counter : 1; number max of pebbles : 1

  • TowardsExascale

    ROMA

    Pebble Game

    2/3

    0/2

    0/4

    0/1

    pebble counter : 2; number max of pebbles : 2

  • TowardsExascale

    ROMA

    Pebble Game

    3/3

    0/2

    0/4

    0/1

    pebble counter : 3; number max of pebbles : 3

  • TowardsExascale

    ROMA

    Pebble Game

    3/3

    0/2

    1/4

    0/1

    pebble counter : 4; number max of pebbles : 4

  • TowardsExascale

    ROMA

    Pebble Game

    3/3

    0/2

    2/4

    0/1

    pebble counter : 5; number max of pebbles : 5

  • TowardsExascale

    ROMA

    Pebble Game

    3/3

    0/2

    3/4

    0/1

    pebble counter : 6; number max of pebbles : 6

  • TowardsExascale

    ROMA

    Pebble Game

    3/3

    0/2

    4/4

    0/1

    pebble counter : 7; number max of pebbles : 7

  • TowardsExascale

    ROMA

    Pebble Game

    3/3

    0/2

    3/4

    0/1

    pebble counter : 6; number max of pebbles : 7

  • TowardsExascale

    ROMA

    Pebble Game

    3/3

    0/2

    2/4

    0/1

    pebble counter : 5; number max of pebbles : 7

  • TowardsExascale

    ROMA

    Pebble Game

    3/3

    0/2

    1/4

    0/1

    pebble counter : 4; number max of pebbles : 7

  • TowardsExascale

    ROMA

    Pebble Game

    3/3

    0/2

    0/4

    0/1

    pebble counter : 3; number max of pebbles : 7

  • TowardsExascale

    ROMA

    Pebble Game

    3/3

    1/2

    0/4

    0/1

    pebble counter : 4; number max of pebbles : 7

  • TowardsExascale

    ROMA

    Pebble Game

    3/3

    0/4

    0/1

    2/2

    pebble counter : 5; number max of pebbles : 7

  • TowardsExascale

    ROMA

    Pebble Game

    3/3

    0/4

    1/1

    2/2

    pebble counter : 6; number max of pebbles : 7

  • TowardsExascale

    ROMA

    An other modelisation

    DefinitionLet G be a DAG with weighted edges and vertices, and π atopological order.

    • We define Me(π, x) (memory edges) as the set of edgeseuv such that π(u) < π(x) ≤ π(v)

    • We call Cost of π at vertex v the value

    Cost(π, v) = w(v) +∑

    u∈N+(v)

    c(evu) +∑

    eux∈Me(π,v)

    c(eux)

    • We define the Cost of an order as:

    Cost(π) = max{Cost(π, v), v ∈ G}

    Our goal: minimize Cost(π)

  • TowardsExascale

    ROMA

    An other modelisation

    DefinitionLet G be a DAG with weighted edges and vertices, and π atopological order.

    • We define Me(π, x) (memory edges) as the set of edgeseuv such that π(u) < π(x) ≤ π(v)

    • We call Cost of π at vertex v the value

    Cost(π, v) = w(v) +∑

    u∈N+(v)

    c(evu) +∑

    eux∈Me(π,v)

    c(eux)

    • We define the Cost of an order as:

    Cost(π) = max{Cost(π, v), v ∈ G}

    Our goal: minimize Cost(π)

  • TowardsExascale

    ROMA

    An other modelisation

    unprocessedprocessedvertices already

    vertices

    Figure : Before the processing of v

  • TowardsExascale

    ROMA

    An other modelisation

    unprocessedprocessedvertices already

    vertices

    Figure : During the processing of v

  • TowardsExascale

    ROMA

    An other modelisation

    unprocessedprocessedvertices already

    vertices

    Figure : After the processing of v

  • TowardsExascale

    ROMA

  • TowardsExascale

    ROMA

    Energy

    One last problem, Energy.

    2W /cm2

    80W /cm2

    8.2MW

    Thermal Wall: We cannot improve the clock

    efficiency of a chip: it would melt.

  • TowardsExascale

    ROMA

    Energy

    One last problem, Energy.

    2W /cm2

    80W /cm2

    8.2MW

    Thermal Wall: We cannot improve the clock

    efficiency of a chip: it would melt.

  • TowardsExascale

    ROMA

    Energy

    One last problem, Energy.

    2W /cm2

    80W /cm2

    8.2MW

    Thermal Wall: We cannot improve the clock

    efficiency of a chip: it would melt.

  • TowardsExascale

    ROMA

    Energy

    One last problem, Energy.

    2W /cm2

    80W /cm2

    8.2MW

    Thermal Wall: We cannot improve the clock

    efficiency of a chip: it would melt.

  • TowardsExascale

    ROMA

    Energy

    One last problem, Energy.

    2W /cm2

    80W /cm2

    8.2MW

    Thermal Wall: We cannot improve the clock

    efficiency of a chip: it would melt.

  • TowardsExascale

    ROMA

    Speed Scaling

    One can modify the execution speed f of any task,f ∈ [fmin, fmax].

    Let Ti of weight wi executed on processor pj :

    time

    pj · · · · · ·

    Exe(wi , fi )

    fi

    Exe(wi , fi )

    fi

  • TowardsExascale

    ROMA

    Speed Scaling

    One can modify the execution speed f of any task,f ∈ [fmin, fmax].

    Let Ti of weight wi executed on processor pj :

    time

    pj · · · · · ·

    Exe(wi , fi )

    fi

    Exe(wi , fi )

    fi

  • TowardsExascale

    ROMA

    Speed Scaling

    One can modify the execution speed f of any task,f ∈ [fmin, fmax].

    Let Ti of weight wi executed on processor pj :

    time

    pj · · · · · ·

    Exe(wi , fi )

    fi

    Exe(wi , fi )

    fi

  • TowardsExascale

    ROMA

    The energy consumption of the execution of task Ti at speed fi :

    Ei (fi ) = Exe(wi , fi )f 3i = wi f 2i

    → (Dynamic part of the classical energy model)

  • TowardsExascale

    ROMA

    Unfortunately some more drawbacks (reliability):

    fi

    Ri (fi )

    frel

    Ri (frel)

    Ri (fi ) ≈ 1− λ0e−dfi Exe(wi , fi )

  • TowardsExascale

    ROMA

    Unfortunately some more drawbacks (reliability):

    fi

    Ri (fi )

    frel

    Ri (frel)

    Ri (fi ) ≈ 1− λ0e−dfi Exe(wi , fi )

  • TowardsExascale

    ROMA

    Unfortunately some more drawbacks (reliability):

    fi

    Ri (fi )

    frel

    Ri (frel)

    Ri (fi ) ≈ 1− λ0e−dfi Exe(wi , fi )

  • TowardsExascale

    ROMA A solution: two executions!

    Ri = 1− (1− Ri (f(1)i ))(1− Ri (f

    (2)i ))

    time

    p1

    p2

    T(1)i f

    (1)i

    T(2)i f

    (2)i

    ti

    Energy consumption with two executions:

    Ei = wi

    (f(1)i

    )2+ wi

    (f(2)i

    )2

    fi

    Ei (fi )

    wi f2i + wi f

    2i = 2Ei (fi )

    frel

    Ei (frel)

    frel√2

  • TowardsExascale

    ROMA A solution: two executions!

    Ri = 1− (1− Ri (f(1)i ))(1− Ri (f

    (2)i ))

    time

    p1

    p2

    T(1)i f

    (1)i

    T(2)i f

    (2)i

    ti

    Energy consumption with two executions:

    Ei = wi

    (f(1)i

    )2+ wi

    (f(2)i

    )2

    fi

    Ei (fi )

    wi f2i + wi f

    2i = 2Ei (fi )

    frel

    Ei (frel)

    frel√2

  • TowardsExascale

    ROMA A solution: two executions!

    Ri = 1− (1− Ri (f(1)i ))(1− Ri (f

    (2)i ))

    time

    p1

    p2

    T(1)i f

    (1)i

    T(2)i f

    (2)i

    ti

    Energy consumption with two executions:

    Ei = wi

    (f(1)i

    )2+ wi

    (f(2)i

    )2

    fi

    Ei (fi )

    wi f2i + wi f

    2i = 2Ei (fi )

    frel

    Ei (frel)

    frel√2

  • TowardsExascale

    ROMA A solution: two executions!

    Ri = 1− (1− Ri (f(1)i ))(1− Ri (f

    (2)i ))

    time

    p1

    p2

    T(1)i f

    (1)i

    T(2)i f

    (2)i

    ti

    Energy consumption with two executions:

    Ei = wi

    (f(1)i

    )2+ wi

    (f(2)i

    )2

    fi

    Ei (fi ) wi f 2i + wi f2i = 2Ei (fi )

    frel

    Ei (frel)

    frel√2

  • TowardsExascale

    ROMA A solution: two executions!

    Ri = 1− (1− Ri (f(1)i ))(1− Ri (f

    (2)i ))

    time

    p1

    p2

    T(1)i f

    (1)i

    T(2)i f

    (2)i

    ti

    Energy consumption with two executions:

    Ei = wi

    (f(1)i

    )2+ wi

    (f(2)i

    )2

    fi

    Ei (fi ) wi f 2i + wi f2i = 2Ei (fi )

    frel

    Ei (frel)

    frel√2

  • TowardsExascale

    ROMA A solution: two executions!

    Ri = 1− (1− Ri (f(1)i ))(1− Ri (f

    (2)i ))

    time

    p1

    p2

    T(1)i f

    (1)i

    T(2)i f

    (2)i

    ti

    Energy consumption with two executions:

    Ei = wi

    (f(1)i

    )2+ wi

    (f(2)i

    )2

    fi

    Ei (fi ) wi f 2i + wi f2i = 2Ei (fi )

    frel

    Ei (frel)

    frel√2

  • TowardsExascale

    ROMA

    To sum up

    We need to find for each task:

    • the number of execution (one or two)• their speed• their mapping (processor)

    In order to minimize the energy consumption under theconstraints:

    • ∀i , ti ≤ D (bounded makespan)• ∀i , Ri (Ti ) ≥ Ri (frel) (minimum reliability)

  • TowardsExascale

    ROMA

    Two kind of results

    Theoretical:

    • FPTAS for linear chains;• Inapproximability for independent tasks;• With a relaxation on the makespan constraint (β), we can

    approximate the optimal solution within 1 + 1β2

    , for all

    β ≥ max(

    2− 32p+1 , 2−p+24p+2

    ).

    But also simulations for general DAGs.

  • TowardsExascale

    ROMA

  • TowardsExascale

    ROMA

    Sparse direct solution: main research issues

    Code Aster,EDF Pump,nuclear backupcircuit

    01234D

    epth

    (km

    )

    0Dip (km)

    5

    10

    15

    20

    Cros

    s (km

    )

    5 10 15 20

    3000 4000 5000 6000m/s

    Frequency domainseismic modeling,Helmholtz equa-tions, SEISCOPEproject

    Extrapolation on a 1000× 1000× 1000 grid:55 exaflops, 200 Tbytes for factors, 40 TBytes for active memory!

    Main algorithmic issues

    • Parallel algorithmic issues: synchronization avoidance,mapping irregular data structures, scheduling.

    • Performance scalability: time but also memory/proc whenincreasing number of processors (and problem size).

    • Numerical issues: numerical accurary, hybrid iterative-directsolvers, application (elliptic PDEs) specific solvers

  • TowardsExascale

    ROMA

    Execution of malleable task trees

    • It is one of the problems submitted to Exascale• Motivation: linear algebra, sparse matrix factorisations. . .• Principle: many processors available → can parallelize the

    tree but also the tasks

    • Difficulty: parallelisation is not perfect; the moreprocessors we allocate to a task, the more losses occur

    0

    1 2 3 4

    11 12 13 14 21 22 23

    121 122 123 124 231 232

  • TowardsExascale

    ROMA

    • In the model developed (time to complete a task of lengthL with p processors is L/pα for 0 < α < 1): makespan-optimal processor allocation to the tree looks likeelectricity charges repartition → nice structure to workwith

    • This model forgets some relevant constraints as memorylimit or granularity: other models are designed to handlethem

    0

    1 2 3 4

    11 12 13 14 21 22 23

    121 122 123 124 231 232

  • TowardsExascale

    ROMA

    • In the model developed (time to complete a task of lengthL with p processors is L/pα for 0 < α < 1): makespan-optimal processor allocation to the tree looks likeelectricity charges repartition → nice structure to workwith

    • This model forgets some relevant constraints as memorylimit or granularity: other models are designed to handlethem

    0

    1 2 3 4

    11 12 13 14 21 22 23

    121 122 123 124 231 232

  • TowardsExascale

    ROMA

    • In the model developed (time to complete a task of lengthL with p processors is L/pα for 0 < α < 1): makespan-optimal processor allocation to the tree looks likeelectricity charges repartition → nice structure to workwith

    • This model forgets some relevant constraints as memorylimit or granularity: other models are designed to handlethem

    0

    1 2 3 4

    11 12 13 14 21 22 23

    121 122 123 124 231 232

  • TowardsExascale

    ROMA

    • In the model developed (time to complete a task of lengthL with p processors is L/pα for 0 < α < 1): makespan-optimal processor allocation to the tree looks likeelectricity charges repartition → nice structure to workwith

    • This model forgets some relevant constraints as memorylimit or granularity: other models are designed to handlethem

    0

    1 2 3 4

    11 12 13 14 21 22 23

    121 122 123 124 231 232

  • TowardsExascale

    ROMA

    • In the model developed (time to complete a task of lengthL with p processors is L/pα for 0 < α < 1): makespan-optimal processor allocation to the tree looks likeelectricity charges repartition → nice structure to workwith

    • This model forgets some relevant constraints as memorylimit or granularity: other models are designed to handlethem

    0

    1 2 3 4

    11 12 13 14 21 22 23

    121 122 123 124 231 232

  • TowardsExascale

    ROMA

  • TowardsExascale

    ROMA