1 getting ready for mainstream parallel computing burton smith cray inc

1

Getting Readyfor

Mainstream Parallel ComputingBurton Smith

Cray Inc.

2

Parallel computing is upon us†

Uniprocessors are nudging against performance limits More transistors/processor increases watts, not speed

Meanwhile, logic cost ($/gate-Hz) continues to fall What are we to do with all that hardware?

New “killer apps” will probably need more performance Semantic information processing and retrieval Personal robots Better human-computer interfaces

Newer microprocessors are multi-core and/or multithreaded So far, it’s just “more of the same” architecturally

The big question: what are parallel computers good for? This is a question for both hardware and software Hardware has gotten ahead of software, as usual

†And it’s about time!

3

Parallel software is almost rudderless We don’t have a good parallel language story

MPI? PGAS languages?

We don’t have a good debugging story TotalView? gdb?

We don’t have a good resource management story Some-weight-or-other kernels? Linux? Lustre?

We don’t have a good robustness story Malware resistance? User-generated checkpointints (or else)?

4

Better parallel languages

Making data races impossible Implementing higher level operations Presenting a transparent performance model Exploiting architectural support for parallelism

Global memory operations Light-weight synchronization

Enabling a more abstract specification of program locality Give the programmer the hard parts and nothing else

A better parallel programming language could makeprogrammers more productive by:

5

Are functional languages the answer? Imperative languages schedule values into variables

Parallel versions of such languages encourage data races The basic problem: stores don’t commute with loads

Functional languages avoid races by avoiding variables In a sense, they compute new constants from old Data races are impossible (because loads do commute) We can make dead constant reclamation pretty efficient

Program transformation is enabled by these semantics High-level operations become practical

Programmer productivity is much improved Caution: it is pretty easy to opacify performance

Abstracting locality is much easier in functional languages Copying is safer, for example

6

The bad news There is no notion of state in functional languages Attempts to add state while preserving commutativity:

Applicative State Transition systems (Backus) Monads (Wadler et al.) M-structures (Arvind et al.)

A related fact: functional programs are deterministic Introducing state leads to non-determinism (e.g. races)

Some kinds of nondeterminism are good Any ordering that does not affect final results is OK Only the programmer knows where the opportunities are How can we tell good non-determinism from bad?

7

A (serial) histogramming example

const double in[N]; //data to be histogrammedconst int f(double); //f(x) < M is the bin of xint hist[M]; //histogram, initially 0for(i = 0; i < N; i++){int bin = f(in[i]);hist[bin]++;

}

/*(int k)(hist[k] = {j|0j<i f(in[j])= k})*/

/*(int k)(hist[k] = {j|0j<N f(in[j])= k})*/

Don’t try this in parallel with a functional language!

8

Histogramming in parallel

const double in[N]; //data to be histogrammedconst int f(double); //f(x) < M is the bin of xint hist[M]; //histogram, initially 0forall i in 0..N-1{int bin = f(in[i]);lock hist[bin];hist[bin]++;unlock hist[bin];

}

/*(int k)(hist[k] = {j|j f(in[j])=k})*/

/*(int k)(hist[k] = {j|0j<N f(in[j])=k})*/ is the set of values i processed “so far”

•The loop instances commute with respect to the invariant•Premature reads of hist[] get non-deterministic garbage

9

What do the locks do? The locks guarantee the integrity of the invariant

They protect whatever makes the invariant temporarily false As long as invariants describe all we care about in the

computation and forward progress is made, all is well We have non-determinism “beneath the invariants” In the example, the set captures that non-determinism

Pretty clearly, the locks need to be lightweight Barriers won’t do the job

Can we automate or at least verify lock insertion? If we had a language for the invariants, maybe so

A constructive step is to let the language handle the locks Efficiency with safety is one reason

10

Atomic sections

const double in[N]; //data to be histogrammedconst int f(double); //f(x) < M is the bin of xint hist[M]; //histogram, initially 0forall i in 0..N-1 do atomic { int bin = f(in[i]); hist[bin]++; } This abstraction permits implementation mechanisms other

than locking It works better interprocedurally All of the proposed HPCS languages have something like it

11

Operations on multiple objects

node *m; //a node in an undirected graph/*(node m)(node n)(n(m->nbr)* m(n-

>nbr)*)*/atomic { //remove *m from the mesh for (n = m->nbr, n != NULL, n = n->nbr){ //remove links between *n to *m for (p = n->nbr, p != NULL, ... //etc } A naíve implementation would routinely deadlock If a sequence would deadlock or fail, preservation of the

invariant requires that it be “undone”, reversing its side effects

In other words, what we need is linguistic support for nestable multi-object atomic transactions

Operating system implementers really need this stuff

12

Better debugging Don’t worry about:

The ability to single-step any thread Instead, worry about:

Conditional breakpoints For both programs and data Ad-hoc conditional expressions

Whole-program data perusal Declaration awareness Run-time data structures, e.g. queues Ad-hoc data visualization Ad-hoc data verification

A much higher level user interface language Don’t make the GUI mandatory (what generates code?) Anyone familiar with duel?gdb> duel #/(root-->(left,right)->key)

13

Debugging shared memory parallelism Data races are a perennial problem

A more disciplined programming paradigm, e.g using transactions systematically, would help a lot

Another big issue is verifying invariants of data structures Anyone else ever read a dump with this end in view?

Invariants could be checked continually, say with daemons Denelcor debugged Unix System 3 on the HEP that way Trouble is, the daemons race with the code being checked

Of course the daemons can transact along with everyone else If transactions are protecting each invariant’s domain, a

daemon will always see a consistent state by definition As long as we are at it, why not routinely make sure the

invariant is restored whenever a transaction commits?

14

Better resource management Parallel computers have a plethora of resources to manage

And well-virtualized resources of a given type are fungible Resource allocation is all about enforcing policies while

satisfying constraints One policy, e.g. for process scheduling, never fits all users

On the other hand, policy implementations are usually so mixed up with mechanism that they are tough to tweak

It’s certainly not something we want users doing Why do we use imperative languages to implement policy?

Declarative programming, e.g. CLP, may be a better fit

15

What the heck is CLP? CLP is Constraint Logic Programming — guarded Horn

clauses, say, augmented by predicates over some domain CLP() adds < , = , etc. over the real numbers Another flavor is CLP(FD) over a finite domain

Unification is extended appropriately to the predicates The cool thing about CLP is that it directly enforces the

constraints and searches for a solution that satisfies them all There is probably a way to make CLP parallel if need be

In any event it needs to be able to transact against the current state of the resource requests and allocations

We are exploring these ideas in Cascade, our HPCS project

16

Better robustness Our computer systems are too easy to attack

Prevention is best, but detection is a decent runner-up Continuous verification of a system’s properties is a direct

way to detect intrusion Violations of resource allocation policy Loss of data structure integrity

The debugging discussion applies here Transactions are needed to make it work

Another enemy of robustness is hardware failure Hardware error checking is necessary but hardly sufficient Continually checking invariant properties of programs, with

daemons or upon transaction commits, is the rest of the story But what should we do to recover from failures?

Checkpoint/restart is the usual answer

17

How to checkpoint Don’t do it

Just re-run or re-boot as necessary Let the user do it

Tell the OS where the restart file(s) and entry point are Let the compiler do it

Ditto, hopefully at a “good time” Let the OS do it

Maybe at times specified by the user or the compiler Do it directly to disk Use 2x the memory and RAID and/or buffer to disk Use COW and transactions and do it concurrently and

incrementally Mix these above strategies

How do we decide?

18

Productivity Productivity is utility U divided by cost C

What one gets for what one spends We want to maximize it

Utility is generally a function of the time-to-solution It is positive, non-increasing, and eventually zero

Here we consider execution time, which depends on: The application The system

These two parameters also strongly influence the cost

We assume the application is fixed in what follows),(

)),((max)(

C

tU

19

The rental execution cost model Plausibly, the cost of a run is linear in the time used

The rate might be the life-cycle cost of the system times the fraction used over the system’s useful life:

and so

For a given system under the rental model, productivity strictly decreases with compute time until it becomes zero when the utility reaches zero

The implication is an injective functional relationship between time and productivity under the rental model

)()()()(euseful_lif

)(frac_used)(life_cost)(

tRtC

)()(

))((max

tR

tU

20

Merits of the productivity approach Utility functions quantify the value of timeliness

One example is a hard deadline (“box”) utility function Productivities can be computed from execution times Given a probability distribution for productivity, we can

compute other quantities readily: The mean and variance of the productivity The probability that the productivity is greater than a particular

value The impact of policies and system configuration decisions can

be assessed, e.g. Is a more expensive but more reliable system worthwhile? How frequently should we checkpoint, if at all? Should memory or disks be added to speed up checkpointing?

21

Estimating total running time The distribution of total running time t is determined by:

Failure process parameters, e.g. the failure rate The no-failure execution time Tx The execution interval between checkpoints Ti The operating system start-up time To The time to perform a checkpoint Tc The time to perform a restart Tr

Without checkpointing, job execution looks like this:

With checkpointing, job execution looks like this:

failure

Ti Tc Ti TiTrTc To Tc

TxToTo

failure

To

failurefailure

22

A Markov process model Here we have a nontrivial Markov process:

where

and the delays associated with the events a–d arero

ciTTT

TTT

pTTq

pTq

1)prob(

1)prob(

TT

:b:a

TTTT

:d:c

p/a . . .0

0

q/b

q/d

p/c

1

1

q/b

q/d

p/c

2

2

q/b

q/d

p/c

3

3

q/b

q/d

p/c

n

. . .

p/a p/a p/a

24

Mean total time Computing the mean total time, we get

Expanding,

This approaches n(Ti+Tc) when the MTBF -1 is large and

when -1 is small compared to Ti+Tc

p

nqt t

)0(

)(

)1(

)()(

)(

)(

roroci

roci

ci

TTTTTT

TTTT

TT

eene

ent

)( roci TTTTen

25

Optimizing the checkpoint interval

Setting n = Tx / Ti allows determination of the optimal checkpoint interval Ťi:

whence

This is identical to Daly’s result†

2

)()()(

)()(

)(0

i

TTTTTT

i

TTTTx

i

TTTTTTx

ii

T

ee

T

eT

T

eeT

Tt

Trorociroci

roroci

)(1 ci TT

ie

T

† Equation (9) in Daly, J.T. “A Strategy for Running Large Scale Applications Based on a Model that Optimizes the Checkpoint Interval for Restart Dumps”, Proc. Int. Conf. Supercomputing, St. Malo, France, June 2004

26

Optimizing checkpoint bandwidth

Discs can be bought to reduce t by reducing Tc (and Tr) Whether it is worth doing depends on the utility

Assuming a rental cost model, constant utility, an optimal checkpoint interval and a disk rental “rate” of D dollars/s per image/s:

Letting Tc = Tr and basing cost on the expected total time,

The solution looks something like

tTDR

U

c )/(

tTDR

T cc0

R

TfRDDDT ct

c 2

),(42

27

Conclusions It is time to take parallel computing seriously We have a lot of architectural work to do in software

Not that the hardware is all done! Software architecture is needed to define and refine it

New ways of thinking about programming may be useful New ways of thinking about performance are also needed

1 getting ready for mainstream parallel computing burton smith cray inc

Documents