art of multiprocessor programming1 futures, scheduling, and work distribution companion slides for...

Art of Multiprocessor Programming

1

Futures, Scheduling, and Work Distribution

Companion slides forThe Art of Multiprocessor Programming

by Maurice Herlihy & Nir Shavit

Modified by Rajeev Alur forCIS 640 University of Pennsylvania


22

How to write Parallel Apps?

• How to– split a program into parallel parts– In an effective way– Thread management


33

Matrix Multiplication

BAC


44

Matrix Multiplication

cij = k=0N-1 aki * bjk


5

Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { this.row = row; this.col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}}


6


a thread


7


Which matrix entry to compute


8


Actual computation


9

Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}


10


Create n x n threads


11


Start them


12


Wait for them to finish

Start them


13


Wait for them to finish

Start them

What’s wrong with this picture?


14

Thread Overhead

• Threads Require resources– Memory for stacks– Setup, teardown

• Scheduler overhead• Worse for short-lived threads


15

Thread Pools

• More sensible to keep a pool of long-lived threads

• Threads assigned short-lived tasks– Runs the task– Rejoins pool– Waits for next assignment


16

Thread Pool = Abstraction

• Insulate programmer from platform– Big machine, big pool– And vice-versa

• Portable code– Runs well on any platform– No need to mix algorithm/platform

concerns


17

ExecutorService Interface

• In java.util.concurrent– Task = Runnable object

• If no result value expected• Calls run() method.

– Task = Callable<T> object• If result value of type T expected• Calls T call() method.


18

Future<T>Callable<T> task = …; …

Future<T> future = executor.submit(task);

…

T value = future.get();


19



…


Submitting a Callable<T> task returns a Future<T> object


20



…


The Future’s get() method blocks until the value is

available


21

Future<?>Runnable task = …; …

Future<?> future = executor.submit(task);

…

future.get();


2222



…

future.get();

Submitting a Runnable task returns a Future<?> object


2323



…

future.get();

The Future’s get() method blocks until the computation is

complete


2424

Matrix Addition

00 00 00 00 01 01

10 10 10 10 11 11

C C A B B A

C C A B A B


2525

Matrix Addition

00 00 00 00 01 01

10 10 10 10 11 11

C C A B B A

C C A B A B

4 parallel additions


2626

Matrix Addition Taskclass AddTask implements Runnable { Matrix a, b; // multiply this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices aij and bij) Future<?> f00 = exec.submit(add(a00,b00)); … Future<?> f11 = exec.submit(add(a11,b11)); f00.get(); …; f11.get(); … }}


27


Base case: add directly


28


Constant-time operation


29


Submit 4 tasks


30


Let them finish


31

Dependencies

• Matrix example is not typical• Tasks are independent

– Don’t need results of one task …– To complete another

• Often tasks are not independent


3232

Fibonacci

1 if n = 0 or 1F(n)

F(n-1) + F(n-2) otherwise

• Note– potential parallelism– Dependencies


3333

Disclaimer

• This Fibonacci implementation is– Egregiously inefficient

• So don’t deploy it!

– But illustrates our point• How to deal with dependencies

• Exercise:– Make this implementation efficient!


34

class FibTask implements Callable<Integer> { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future<Integer> left = exec.submit(new FibTask(arg-1)); Future<Integer> right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}}

Multithreaded Fibonacci


35



Parallel calls


36



Pick up & combine results


3737

Dynamic Behavior

• Multithreaded program is– A directed acyclic graph (DAG)– That unfolds dynamically

• Each node is– A single unit of work


3838

Fib DAGfib(4)

fib(3) fib(2)

submitget

fib(2) fib(1) fib(1)fib(1)

fib(1) fib(1)


3939

Arrows Reflect Dependenciesfib(4)

fib(3) fib(2)

submitget


fib(1) fib(1)


4040

How Parallel is That?

• Define work:– Total time on one processor

• Define critical-path length:– Longest dependency path– Can’t beat that!


4141

Fib Workfib(4)

fib(3) fib(2)


fib(1) fib(1)


4242

Fib Work

work is 17

1 2 3

84 765 9

1410 131211 15

16 17


4343

Fib Critical Pathfib(4)


4444

Fib Critical Pathfib(4)

Critical path length is 8

1 8

2 7

3 64

5


4545

Notation Watch

• TP = time on P processors

• T1 = work (time on 1 processor)

• T∞ = critical path length (time on ∞ processors)


4646

Simple Bounds

• TP ≥ T1/P– In one step, can’t do more than P

work

• TP ≥ T∞

– Can’t beat infinite resources


4747

More Notation Watch

• Speedup on P processors– Ratio T1/TP

– How much faster with P processors

• Linear speedup– T1/TP = Θ(P)

• Max speedup (average parallelism)– T1/T∞


4848

Matrix Addition

00 00 00 00 01 01

10 10 10 10 11 11

C C A B B A

C C A B A B


4949

Matrix Addition

00 00 00 00 01 01

10 10 10 10 11 11

C C A B B A

C C A B A B

4 parallel additions


5050

Addition

• Let AP(n) be running time – For n x n matrix– on P processors

• For example– A1(n) is work

– A∞(n) is critical path length


5151

Addition

• Work is

A1(n) = 4 A1(n/2) + Θ(1)

4 spawned additions

Partition, synch, etc


5252

Addition

• Work is

A1(n) = 4 A1(n/2) + Θ(1)

= Θ(n2)

Same as double-loop summation


5353

Addition

• Critical Path length is

A∞(n) = A∞(n/2) + Θ(1)

spawned additions in parallel

Partition, synch, etc


5454

Addition

• Critical Path length is

A∞(n) = A∞(n/2) + Θ(1)

= Θ(log n)


5555

Matrix Multiplication Redux

BAC


5656

Matrix Multiplication Redux

2221

1211

2221

1211

2221

1211

BB

BB

AA

AA

CC

CC


5757

First Phase …

2222122121221121

2212121121121111

2221

1211

BABABABA

BABABABA

CC

CC

8 multiplications


5858

Second Phase …

2222122121221121

2212121121121111

2221

1211

BABABABA

BABABABA

CC

CC

4 additions


5959

Multiplication

• Work is

M1(n) = 8 M1(n/2) + A1(n)

8 parallel multiplications

Final addition


6060

Multiplication

• Work is

M1(n) = 8 M1(n/2) + Θ(n2)

= Θ(n3)

Same as serial triple-nested loop


6161

Multiplication

• Critical path length is

M∞(n) = M∞(n/2) + A∞(n)

Half-size parallel multiplications

Final addition


6262

Multiplication

• Critical path length is

M∞(n) = M∞(n/2) + A∞(n)

= M∞(n/2) + Θ(log n)

= Θ(log2 n)


6363

Parallelism

• M1(n)/ M∞(n) = Θ(n3/log2 n)

• To multiply two 1000 x 1000 matrices– 10003/102=107

• Much more than number of processors on any real machine


6464

Work Distributionzzz…


6565

Work Dealing

Yes!


6666

The Problem with Work Dealing

D’oh!

D’oh!

D’oh!


6767

Work Stealing No work…

zzzYes!


6868

Lock-Free Work Stealing

• Each thread has a pool of ready work

• Remove work without synchronizing

• If you run out of work, steal someone else’s

• Choose victim at random


6969

Local Work Pools

Each work pool is a Double-Ended Queue


7070

Work DEQueue1

pushBottom

popBottom

work

1. Double-Ended Queue


7171

Obtain Work

•Obtain work•Run task until•Blocks or terminates

popBottom


7272

New Work

•Unblock node•Spawn node

pushBottom


7373

Whatcha Gonna do When the Well Runs Dry?

@&%$!!

empty


7474

Steal Work from OthersPick random thread’s DEQeueue


7575

Steal this Task!

popTop


7676

Task DEQueue

• Methods– pushBottom– popBottom– popTop

Never happen concurrently


7777

Task DEQueue

• Methods– pushBottom– popBottom– popTop

These most common –

make them fast(minimize use

of CAS)


7878

Ideal

• Wait-Free• Linearizable• Constant time


7979

Compromise

• Method popTop may fail if– Concurrent popTop succeeds, or a – Concurrent popBottom takes last

work

Blame the victim!


8080

Dreaded ABA Problem

topCAS


8181

Dreaded ABA Problem

top


8282

Dreaded ABA Problem

top


8383

Dreaded ABA Problem

top


8484

Dreaded ABA Problem

top


8585

Dreaded ABA Problem

top


8686

Dreaded ABA Problem

top


8787

Dreaded ABA Problem

topCAS

Uh-Oh …

Yes!


8888

Fix for Dreaded ABA

stamp

top

bottom


8989

Bounded DEQueuepublic class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] tasks; …}


9090

Bounded DQueuepublic class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] tasks; …}

Index & Stamp (synchronized)


9191

Bounded DEQueuepublic class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] deq; …}

Index of bottom taskno need to synchronizeDo need memory barrier


92

Bounded DEQueuepublic class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] tasks; …}

Array holding tasks


93

pushBottom()

public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } …}


94

pushBottom()

public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } …} Bottom is the index to

store the new task in the array


95

pushBottom()

public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } …}

Adjust the bottom index

stamp

top

bottom


96

Steal Work

public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp))

return r; return null; }


97

Steal Work


return r; return null; } Read top (value & stamp)


98

Steal Work


return r; return null; } Compute new value & stamp


99

Steal Work



Quit if queue is empty

stamp

top

bottom


100

Steal Work


return r; return null; } Try to steal the task

stamp

topCAS

bottom


101

Steal Work



Give up if conflict occurs


102

Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null;}

Take Work


103

Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null;} 103

Take Work

Make sure queue is non-empty


104


Take Work

Prepare to grab bottom task


105105


Take Work

Read top, & prepare new values


106

Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } return null;} 106

Take Work

If top & bottom 1 or more apart, no conflict

stamp

top

bottom


107


Take Work

At most one item left

stamp

top

bottom


108


Take WorkTry to steal last task.Always reset bottombecause the DEQueue will be empty even if unsuccessful (why?)


109109


Take Work

stamp

topCAS

bottom

I win CAS


110110


Take Work

stamp

topCAS

bottom

If I lose CASThief must have won…


111111


Take Work

failed to get last task Must still reset top


112112

Variations

• Stealing is expensive– Pay CAS– Only one thread taken

• What if– Randomly balance loads?


113113

Work Stealing & Balancing

• Clean separation between app & scheduling layer

• Works well when number of processors fluctuates.

• Works on “black-box” operating systems

art of multiprocessor programming1 futures, scheduling, and work distribution companion slides for...

Documents

int col workerrowcol

col workerint row

worker worker

n i dotproduct

double dotproduct

new workernn

new workerrow

matrix entry