simplifying parallel programming with compiler transformations

22
[email protected] 31-Oct-2008 Matt Frank University of Illinois Simplifying Parallel Programming with Compiler Transformations

Upload: edward

Post on 10-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Simplifying Parallel Programming with Compiler Transformations. Matt Frank University of Illinois. What I’m ranting about. Transformations that alleviate tedium Analogous to: code generation, register allocation, and instr. Sched (Not really “optimizations”) Mainly: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

Matt FrankUniversity of Illinois

Simplifying Parallel Programming with Compiler Transformations

Page 2: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

What I’m ranting about

• Transformations that alleviate tedium• Analogous to:

– code generation, register allocation, and instr. Sched– (Not really “optimizations”)

• Mainly:– Loop distribution, reassociation, “scalar” expansion,

inspector-executor, hashing.

• Cover much more than you might think• || language expressivity

Page 3: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

Assumptions

• Cache-coherent shared-memory many-cores– (I’m not addressing distributed memory issues)

• Synchronization somewhat expensive– Don’t use barriers gratuitously (but don’t avoid at all

costs)

• Analysis is not my problem– Programmer annotates

• Non-determinism is outside realm of this talk– No race detection in this talk either

Page 4: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

Compiler Flow Front-end type systems and whole-program analysis

Dependence Graph (PDG) based compiler

Runtime/Execution platform

Feedback

New information:1. Type systems (e.g. DPJ)2. Domain-specific objects3. run-time feedback

Program analysis (info about high level program invariants) for more efficient coherence, checkpointing, q.o.s.

New capabilities: checkpointing, q.o.s. guarantees.

Page 5: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

I’m leaving out localityFront-end type systems and whole-program analysis

|| exposing transformations

Runtime/Execution platform

Tiling, etc.

Page 6: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

What’s enabled?

• Loops that contain arbitrary control flow– Including early exits, arbitrary function calls, etc.

• Arbitrary iterators (even sequential ones)– Can’t depend on main body of computation though

• Arbitrary combinations of data parallel work, scans and reductions

• Can use “partial sums” inside loop• Buffered printf

Page 7: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

The transformations• Scalar expansion

– Eliminates anti, output deps– Can be applied to properly scoped aggregates

• Reassociation– Integer reassociation extraordinarily useful– Can use partial sums later in loop!

• Loop distribution– Think of it as scheduling

• Inspector-executor– As long as the data access pattern is invariant in the

loop

Page 8: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

You’ve heard of map-reducedoall i(1..n) private j = f(X[i]) total = total + j

shared j[n]doall i(1..n) j[i] = f(X[i])do i(1..n) total = total + j[i]

Page 9: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

How ‘bout scan-map?struct { data; *next;} *p;

doall p != NULL modify(p->data) p = p->next

n=0do a[n++] = p p = p->nextdoall i(0..n) modify(a[i]->data) p = p->next

Page 10: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

Sparse matrix construction

scan int ptr = 0shared data[float]shared rows[int]doall row (1..n) private j rows[row] = ptr for j in non_zeros(row) data[ptr] = foo(row, j) ptr++

rowsdata

ptr

row

Page 11: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

Partial Sum Expansionscan int ptr = 0shared float data[m]shared int rows[n]doall row (1..n) private j rows[row] = ptr for j in non_zeros(row) data[ptr] = foo(row, j) ptr++

scan int ptr[n] # scalar expand ptrshared data[float]shared int rows[n]doall row (1..n) private j ptr[row] = 0 rows[row] = rows[row-1] + ptr[row-1] for j in non_zeros(row) data[rows[row] + ptr[row]] = foo(row, j) ptr[row]++

expand partial sum

Page 12: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

Scalar Expansionscan int ptr[n]shared data[float]shared int rows[n]doall row (1..n) private j ptr[row] = 0 rows[row] = rows[row-1] + ptr[row-1] for j in non_zeros(row) data[rows[row] + ptr[row]] = foo(row, j) ptr[row]++

scan int ptr[n]shared data[float]shared int rows[n]doall row (1..n) private j private vector mydata ptr[row] = 0 rows[row] = rows[row-1] + ptr[row-1] for j in non_zeros(row) mydata.pushback(foo(row, j)) ptr[row]++ for j (rows[row], rows[row]+ptr[row]) data[j] = mydata.popfront()

and inner loop fission

Page 13: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

Outer Loop Fissionscan int ptr[n]shared data[float]shared int rows[n]doall row (1..n) private j private vector mydata ptr[row] = 0 rows[row] = rows[row-1]+ ptr[row-1] for j in non_zeros(row) mydata.pushback(foo(row, j)) ptr[row]++ for j (rows[row], rows[row]+ptr[row]) data[j] = mydata.popfront()

scan int ptr[n]shared data[float]shared int rows[n]doall row (1..n) private j private vector mydata ptr[row] = 0 for j in non_zeros(row) mydata.pushback(foo(row, j)) ptr[row]++do row (1..n) rows[row] = rows[row-1] + ptr[row-1]doall row (1..n) for j (rows[row], rows[row]+ptr[row]) data[j] = mydata.popfront()

Page 14: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

Concatenation

rowsdata

ptr

row

rows

data

sequential

parallel

Page 15: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

printf() is same pattern

doall i (1..n) private mystring = s(i) printf(mystring)

private mystrings

stdout buffer

Page 16: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

Sparse array updatesdoall i(1..n) private j for j in neighbors_of(i) private temp = foo(i, j) x[i]+= temp x[j]+= temp

Page 17: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

Becomesdoall i(1..n) private j for j in neighbors_of(i) private temp = foo(i, j) continue[hash(i)][myproc].push(i,temp) continue[hash(j)][myproc].push(j,temp)doall p(1..P) for t (1..P) private (ptr,val) = continue[p][t] x[ptr] += val

1 2 3 4

1

2

3

4

the continuation matrix ->

Page 18: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

Graph updatesdoall i(1..n) newvalue = value[i] for pred in predecessors[i] newvalue = f(newvalue, value[pred]) value[i] = newvalue

Page 19: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

Inspector Executor

int wavefront[n] = {0}

do i(1..n) wavefront[i] = max(wavefronts[i’s predecessors])

do w(1..maxdepth) doall i suchthat wf[i] = w newvalue = value[i] for pred in predecessors[i] newvalue = f(newvalue, pred[i]) value[i] = newvalue

Polychronopolous ’88

Saltz ’91

Leung/Zahorjan, ‘93

Page 20: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

Limits of what we know

doall node in worklist modify graph structure

Page 21: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

What I’ve shown you• Scalar expansion

– Eliminates anti, output deps– Can be applied to properly scoped aggregates

• Reassociation– Integer reassociation extraordinarily useful– Can use partial sums later in loop!

• Loop distribution– Think of it as scheduling

• Inspector-executor– As long as the data access pattern is invariant in the

loop

Page 22: Simplifying Parallel Programming with Compiler Transformations

[email protected] 31-Oct-2008

Where next?

• Relieve Tedium– (build the compiler, or frameworks, or …)

• Find new patterns– Delauney triangulation– Pick an example application: there will be something

new you wish could be transformed automatically

• Parallel languages beyond “doall” and “reduce”