optimizing compilers for modern architectures creating coarse-grained parallelism for loop nests...

Optimizing Compilers for Modern Architectures

Creating Coarse-grained Parallelism for Loop Nests

Chapter 6, Sections 6.3 through 6.9

Last time…

• Single loop methods

• Privatization

• Loop distribution

• Alignment

• Loop Fusion

Loop Interchange

• Moves dependence-free loops to outermost level

• Theorem—In a perfect nest of loops, a particular loop can be

parallelized at the outermost level if and only if the column of the direction matrix for that nest contain only ‘=‘ entries

• Vectorization moves loops to innermost level

Loop Interchange

DO I = 1, N

DO J = 1, N

A(I+1, J) = A(I, J) + B(I, J)

• OK for vectorization

• Problematic for parallelization

Loop Interchange

PARALLEL DO J = 1, N

DO I = 1, N

A(I+1, J) = A(I, J) + B(I, J)

END PARALLEL DO

Loop Interchange

• Working with direction matrix—Move loops with all “=“ entries into outermost position

and parallelize it and remove the column from the matrix—Move loops with most “<“ entries into next outermost

position and sequentialize it, eliminate the column and any rows representing carried dependences

—Repeat step 1

while L is not empty

while there exist columns in M with all “=“

success := true;

l:= loop with all “=“ column;

remove l from L;

parallelize l;

eliminate l from M;

if L is not empty

select_loop_and_interchange(L);

l:= outermost loop; remove l from L; sequentialize l;

remove column corresponding to l from M;

remove all rows corresponding to dependences carried by l from M;

Loop Interchange

Loop Selection

• Generate most parallelism with adequate granularity—Key is to select proper loops to run in parallel

• Informal parallel code generation strategy

• While there are loops that can be run in parallel, move them to the outermost position and parallelize them, then

• Select a sequential loop, run it sequentially, and find what new parallelism may have been revealed.

= < << = >< = =

Loop SelectionDO I = 2, N+1

DO J = 2, M+1

DO K = 1, L

A(I, J, K+1) = A(I, J-1, K) + A(I-1, J, K+2) + A(I-1, J, K)

Loop SelectionDO I = 2, N+1

DO J = 2, M+1

PARALLEL DO K = 1, L

A(I, J, K+1) = A(I, J-1, K) + A(I-1, J, K+2) + A(I-1, J, K)

< < = =< = < =< = = <= < = == = < == = = <

Loop Selection

• Is it possible to derive a selection heuristic that provides optimal code?—Probably not possible, NP-complete problem

• Assume simple approach of selecting the loop with the most ‘ < ‘ directions to eliminate the max number of rows from the direction matrix—Applying to this matrix will fail

Loop Selection

• Favor the selection of loops that must be sequentialized before parallelism can be uncovered.

• If there exists a loop that can legally be moved to the outermost position and there is a dependence for which that loop has the only ‘<‘ direction, sequentialize that loop

• If there are several such loops, they will all need to be sequentialized at some point in the process

• To show that loop selection is NP-complete—Bit vector

—Problem of loop selection corresponds to finding a minimal basis among the loops, with “logical or” as the combining operation

—The same as minimum set cover problem, known to be NP-complete

• Loop selection is best done by a heuristic

1 1 0 01 0 1 01 0 0 10 1 0 00 0 1 00 0 0 1

Loop Selection

= < =< = <= < << = >

Loop Selection

• Example of principals involved in heuristic loop selection

DO I = 2, N

DO J = 2, M

DO K = 2, L

A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) +

A(I, J+1, K+1) + A(I-1, J, K+1)

Loop Selection

DO J = 2, M

DO I = 2, N

PARALLEL DO K = 2, L

A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) +

A(I, J+1, K+1) + A(I-1, J, K+1)

= < >< = >

Loop Reversal

DO I = 2, N+1

DO J = 2, M+1

DO K = 1, L

A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1)

Loop ReversalDO K = L, 1, -1

PARALLEL DO I = 2, N+1

PARALLEL DO J = 2, M+1

A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1)

END PARALLEL DO

• Increase the range of options available for loop selection heuristics

= < =< = == = <= = =

Loop SkewingDO I = 2, N+1

DO J = 2, M+1

DO K = 1, L

A(I, J, K) = A(I, J-1, K) + A(I-1, J, K)

B(I, J, K+1) = B(I, J, K) + A(I, J, K)

= < << = <= = <= = =

Loop Skewing

• Skewed using k = K + I + J yield:DO I = 2, N+1

DO J = 2, M+1

DO k = I+J+1, I+J+L

A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J)

B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J)

Loop Skewing

DO k = 5, N+M+1

PARALLEL DO I = MAX(2, k-M-L-1), MIN(N+1, k-L-2)

PARALLEL DO J = MAX(2, k-I-L), MIN(M+1, k-I-1)

A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J)

B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J)

Loop Skewing

• Transforms skewed loop into one that can be interchanged to the outermost position without changing the meaning of the program

• Can be used to transform the skewed loop in such a way that, after outward interchange, it will carry all dependences formerly carried by the loop with respect to which it is skewed

Loop Skewing

• Selection Heuristics—Parallelize outermost loop if possible—Sequentializes at most one outer loop to find parallelism in

the next loop—If 1 and 2 fails, try skewing—If 3 fails, sequentialize the loop that can be moved to the

outermost position and cover the most other loops

Unimodular Transformations

• Definitions—A transformation represented by a matrix T is unimodular

if—T is square—All the elements of T are integral and—The absolute value of the determinant of T is 1

• Any composition of unimodular transformations is unimodular

Profitability-Based Methods

• Motivation—Minimum granularity, low sync cost —Many alternatives for parallel code generation

• Static performance estimation function—No need to be accurate—Good at selecting the better of two alternatives

• Key considerations—Cost of memory references—Sufficiency of granularity

• Pick the best of all possible permutations and parallelizations—Impractical

– Total number of alternative is exponential in the number of loops in a nest

– Many of the loop upper bounds are unknown at compile time

• Consider only subset of the possible code arrangements, based on properties of the cost function

• Subdivide all the references in the loop body into reference groups

• Determine whether subsequent accesses to the same reference are—Loop invariant

– Cost = 1—Unit stride

– Cost = number of iterations / cache line size—Non-unit stride

– Cost = number of iterations

• Assign the loop a cost of the sum of the reference costs times the aggregate number of times the loop will be executed if it is innermost in the loop nest

Profitability-Based MethodsDO I = 1, N

DO J = 1, N

DO K = 1, N

C(I, J) = C(I, J) + A(I, K) * B(K, J)

ENDDO—C = 1 A = N B = N/L— Innermost K loop = N3(1+1/L)+N2— Innermost J loop = 2N3+N2— Innermost I loop = 2N3/L+N2

• Reorder loop from innermost to outermost by increasing loop cost

• Can’t always have desired loop order

Multilevel Loop Fusion

• Commonly used for imperfect loop nests

• Used after maximal loop distribution

Multilevel Loop Fusion

• Decision making needs look-ahead

• Heuristic: Fuse with the loop that cannot be fused with one of its successors

Parallel Code Generation

procedure Parallelize(l, Dl);

ParallelizeNest(l, success);

if not success then begin

if l can be distributed then begin

distribute l into loop nests l1, l2, …, ln;

for i:=1 to n do begin

Parallelize(li, Di);

Merge({l1, l2, …, ln});

else if l cannot be distributed then begin

Parallel Code Generation

for each outer loop l0 nested in l do begin

let D0 be the set of dependences between statements in l0 less dependences carried by l;

Parallelize(Io,D0);

let S be the set of outer loops and statements loops left in l;

If ||S||>1 then Merge(S);

DO J = 1, JMAXD

DO I = 1, IMAXD

F(I, J, 1) = F(I, J, 1) * B(1)

DO K = 2, N-1

DO J = 1, JMAXD

DO I = 1, IMAXD

F(I, J, K) = (F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)

DO J = 1, JMAXD

DO I = 1, IMAXD

TOT(I, J) = 0.0

DO J = 1, JMAXD

DO I = 1, IMAXD

TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)

DO K = 2, N-1

DO J = 1, JMAXD

DO I = 1, IMAXD

TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)

Erlebacher

ErlebacherPARALLEL DO = 1, JMAXD

DO I = 1, IMAXD

F(I, J, 1) = F(I, J, 1) * B(1)

DO K = 2, N – 1

DO I = 1, IMAXD

F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)

DO I = 1, IMAXD

TOT(I, J) = 0.0

DO I = 1, IMAXD

TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)

DO K = 2, N-1

DO I = 1, IMAXD

ErlebacherPARALLEL DO J = 1, JMAXD

DO I = 1, IMAXD

F(I, J, 1) = F(I, J, 1) * B(1)

TOT(I, J) = 0.0

TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)

DO K = 2, N-1

DO I = 1, IMAXD

F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)

Strip Mining

• Converts available parallelism into a form more suitable for the hardware

DO I = 1, N

A(I) = A(I) + B(I)

k = CEIL (N / P)

PARALLEL DO I = 1, N, k

DO i = I, MIN(I + k-1, N)

A(i) = A(i) + B(i)

END PARALLEL DO

Strip Mining

DO I = 1, N

DO J = 2, I

A(I, J) = A(I, J-1) + B(I)

• Choose smaller unit size to allow more balanced load distribution

Pipeline Parallelism

• Fortran command DOACROSS

• Useful where parallelization is not available

• High synchronization costs

DO I = 2, N-1

DO J = 2, N-1

A(I, J) = .25 * (A(I-1, J) + A(I, J-1) + A(I+1, J) + A(I, J+1))

Pipeline ParallelismDOACROSS I = 2, N-1

POST (EV(1))

DO J = 2, N-1

WAIT(EV(J-1))

A(I, J) = .25 * (A(I-1, J) + A(I, J-1) + A(I+1, J) + A(I, J+1))

POST (EV(J))

Pipeline ParallelismDOACROSS I = 2, N-1

POST (E(1))

DO J = 2, N-1, 2

K = K+1

WAIT(EV(K))

DO j = J, MAX(J+1, N-1)

A(I, J) = .25 * (A(I-1, J) + A(I, J-1) + A(I+1, J) + A(I, J+1)

POST (EV(K+1))

Scheduling Parallel Work

• Parallel execution is slower than serial execution if

• Bakery-counter scheduling—High synchronization overhead

Guided Self-Scheduling

• Minimize synchronization overhead—Schedules groups of iterations unlike the bakery counter

method—Going from large to small chunks of work

• Keep all processors busy at all times

• Iterations dispensed at time t follows:

• Alternatively we can have GSS(k) that guarantees that all blocks handed out are of size k or greater

Guided Self-Scheduling

• GSS(1)

optimizing compilers for modern architectures creating coarse-grained parallelism for loop nests...

Documents

creating coarse-grained parallelism for loop nests chapter...

cse4305: compilers for algorithmic languages cse5317: ...

honors compilers

communication-optimal loop nests

recognizing big-headed ants and their nests · • monitor...

optimising compilers

3-1 3 compilers and interpreters compilers and other...

camouflaged nests of

apresentação do powerpoint photographic report.pdf ·...

the philosophy of birds' nests

compilers for embedded systems: why are compilers an issue?

dios - compilers

compilers - yanniss.github.io

compilers - cgi.di.uoa.gr

compilers - univ-

partial compilers

bird habitats birds build nests. nests keep eggs safe. nests...

chapter 16-3 bird nests

basics compilers

compilers -principles_techniques_and_tools