optimizing compilers for modern architectures creating coarse-grained parallelism for loop nests...

Post on 21-Jan-2016

222 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Optimizing Compilers for Modern Architectures

Creating Coarse-grained Parallelism for Loop Nests

Chapter 6, Sections 6.3 through 6.9

Optimizing Compilers for Modern Architectures

Last time…

• Single loop methods

• Privatization

• Loop distribution

• Alignment

• Loop Fusion

Optimizing Compilers for Modern Architectures

Loop Interchange

• Moves dependence-free loops to outermost level

• Theorem—In a perfect nest of loops, a particular loop can be

parallelized at the outermost level if and only if the column of the direction matrix for that nest contain only ‘=‘ entries

• Vectorization moves loops to innermost level

Optimizing Compilers for Modern Architectures

Loop Interchange

DO I = 1, N

DO J = 1, N

A(I+1, J) = A(I, J) + B(I, J)

ENDDO

ENDDO

• OK for vectorization

• Problematic for parallelization

Optimizing Compilers for Modern Architectures

Loop Interchange

PARALLEL DO J = 1, N

DO I = 1, N

A(I+1, J) = A(I, J) + B(I, J)

ENDDO

END PARALLEL DO

Optimizing Compilers for Modern Architectures

Loop Interchange

• Working with direction matrix—Move loops with all “=“ entries into outermost position

and parallelize it and remove the column from the matrix—Move loops with most “<“ entries into next outermost

position and sequentialize it, eliminate the column and any rows representing carried dependences

—Repeat step 1

Optimizing Compilers for Modern Architectures

while L is not empty

while there exist columns in M with all “=“

success := true;

l:= loop with all “=“ column;

remove l from L;

parallelize l;

eliminate l from M;

end;

if L is not empty

select_loop_and_interchange(L);

l:= outermost loop; remove l from L; sequentialize l;

remove column corresponding to l from M;

remove all rows corresponding to dependences carried by l from M;

Loop Interchange

Optimizing Compilers for Modern Architectures

Loop Selection

• Generate most parallelism with adequate granularity—Key is to select proper loops to run in parallel

• Informal parallel code generation strategy

• While there are loops that can be run in parallel, move them to the outermost position and parallelize them, then

• Select a sequential loop, run it sequentially, and find what new parallelism may have been revealed.

Optimizing Compilers for Modern Architectures

= < << = >< = =

Loop SelectionDO I = 2, N+1

DO J = 2, M+1

DO K = 1, L

A(I, J, K+1) = A(I, J-1, K) + A(I-1, J, K+2) + A(I-1, J, K)

ENDDO

ENDDO

ENDDO

Optimizing Compilers for Modern Architectures

Loop SelectionDO I = 2, N+1

DO J = 2, M+1

PARALLEL DO K = 1, L

A(I, J, K+1) = A(I, J-1, K) + A(I-1, J, K+2) + A(I-1, J, K)

ENDDO

ENDDO

ENDDO

Optimizing Compilers for Modern Architectures

< < = =< = < =< = = <= < = == = < == = = <

Loop Selection

• Is it possible to derive a selection heuristic that provides optimal code?—Probably not possible, NP-complete problem

• Assume simple approach of selecting the loop with the most ‘ < ‘ directions to eliminate the max number of rows from the direction matrix—Applying to this matrix will fail

Optimizing Compilers for Modern Architectures

Loop Selection

• Favor the selection of loops that must be sequentialized before parallelism can be uncovered.

• If there exists a loop that can legally be moved to the outermost position and there is a dependence for which that loop has the only ‘<‘ direction, sequentialize that loop

• If there are several such loops, they will all need to be sequentialized at some point in the process

Optimizing Compilers for Modern Architectures

• To show that loop selection is NP-complete—Bit vector

—Problem of loop selection corresponds to finding a minimal basis among the loops, with “logical or” as the combining operation

—The same as minimum set cover problem, known to be NP-complete

• Loop selection is best done by a heuristic

1 1 0 01 0 1 01 0 0 10 1 0 00 0 1 00 0 0 1

Loop Selection

Optimizing Compilers for Modern Architectures

= < =< = <= < << = >

Loop Selection

• Example of principals involved in heuristic loop selection

DO I = 2, N

DO J = 2, M

DO K = 2, L

A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) +

A(I, J+1, K+1) + A(I-1, J, K+1)

ENDDO

ENDDO

ENDDO

Optimizing Compilers for Modern Architectures

Loop Selection

DO J = 2, M

DO I = 2, N

PARALLEL DO K = 2, L

A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) +

A(I, J+1, K+1) + A(I-1, J, K+1)

ENDDO

ENDDO

ENDDO

Optimizing Compilers for Modern Architectures

= < >< = >

Loop Reversal

DO I = 2, N+1

DO J = 2, M+1

DO K = 1, L

A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1)

ENDDO

ENDDO

ENDDO

Optimizing Compilers for Modern Architectures

Loop ReversalDO K = L, 1, -1

PARALLEL DO I = 2, N+1

PARALLEL DO J = 2, M+1

A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1)

END PARALLEL DO

END PARALLEL DO

ENDDO

• Increase the range of options available for loop selection heuristics

Optimizing Compilers for Modern Architectures

= < =< = == = <= = =

Loop SkewingDO I = 2, N+1

DO J = 2, M+1

DO K = 1, L

A(I, J, K) = A(I, J-1, K) + A(I-1, J, K)

B(I, J, K+1) = B(I, J, K) + A(I, J, K)

ENDDO

ENDDO

ENDDO

Optimizing Compilers for Modern Architectures

= < << = <= = <= = =

Loop Skewing

• Skewed using k = K + I + J yield:DO I = 2, N+1

DO J = 2, M+1

DO k = I+J+1, I+J+L

A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J)

B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J)

ENDDO

ENDDO

ENDDO

Optimizing Compilers for Modern Architectures

Loop Skewing

DO k = 5, N+M+1

PARALLEL DO I = MAX(2, k-M-L-1), MIN(N+1, k-L-2)

PARALLEL DO J = MAX(2, k-I-L), MIN(M+1, k-I-1)

A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J)

B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J)

ENDDO

ENDDO

ENDDO

Optimizing Compilers for Modern Architectures

Loop Skewing

• Transforms skewed loop into one that can be interchanged to the outermost position without changing the meaning of the program

• Can be used to transform the skewed loop in such a way that, after outward interchange, it will carry all dependences formerly carried by the loop with respect to which it is skewed

Optimizing Compilers for Modern Architectures

Loop Skewing

• Selection Heuristics—Parallelize outermost loop if possible—Sequentializes at most one outer loop to find parallelism in

the next loop—If 1 and 2 fails, try skewing—If 3 fails, sequentialize the loop that can be moved to the

outermost position and cover the most other loops

Optimizing Compilers for Modern Architectures

Unimodular Transformations

• Definitions—A transformation represented by a matrix T is unimodular

if—T is square—All the elements of T are integral and—The absolute value of the determinant of T is 1

• Any composition of unimodular transformations is unimodular

Optimizing Compilers for Modern Architectures

Profitability-Based Methods

• Motivation—Minimum granularity, low sync cost —Many alternatives for parallel code generation

• Static performance estimation function—No need to be accurate—Good at selecting the better of two alternatives

• Key considerations—Cost of memory references—Sufficiency of granularity

Optimizing Compilers for Modern Architectures

Profitability-Based Methods

• Pick the best of all possible permutations and parallelizations—Impractical

– Total number of alternative is exponential in the number of loops in a nest

– Many of the loop upper bounds are unknown at compile time

• Consider only subset of the possible code arrangements, based on properties of the cost function

Optimizing Compilers for Modern Architectures

Profitability-Based Methods

• Subdivide all the references in the loop body into reference groups

• Determine whether subsequent accesses to the same reference are—Loop invariant

– Cost = 1—Unit stride

– Cost = number of iterations / cache line size—Non-unit stride

– Cost = number of iterations

• Assign the loop a cost of the sum of the reference costs times the aggregate number of times the loop will be executed if it is innermost in the loop nest

Optimizing Compilers for Modern Architectures

Profitability-Based MethodsDO I = 1, N

DO J = 1, N

DO K = 1, N

C(I, J) = C(I, J) + A(I, K) * B(K, J)

ENDDO

ENDDO

ENDDO—C = 1 A = N B = N/L— Innermost K loop = N3(1+1/L)+N2— Innermost J loop = 2N3+N2— Innermost I loop = 2N3/L+N2

• Reorder loop from innermost to outermost by increasing loop cost

• Can’t always have desired loop order

Optimizing Compilers for Modern Architectures

Multilevel Loop Fusion

• Commonly used for imperfect loop nests

• Used after maximal loop distribution

Optimizing Compilers for Modern Architectures

Multilevel Loop Fusion

• Decision making needs look-ahead

• Heuristic: Fuse with the loop that cannot be fused with one of its successors

Optimizing Compilers for Modern Architectures

Parallel Code Generation

procedure Parallelize(l, Dl);

ParallelizeNest(l, success);

if not success then begin

if l can be distributed then begin

distribute l into loop nests l1, l2, …, ln;

for i:=1 to n do begin

Parallelize(li, Di);

end

Merge({l1, l2, …, ln});

end

else if l cannot be distributed then begin

Optimizing Compilers for Modern Architectures

Parallel Code Generation

for each outer loop l0 nested in l do begin

let D0 be the set of dependences between statements in l0 less dependences carried by l;

Parallelize(Io,D0);

end

let S be the set of outer loops and statements loops left in l;

If ||S||>1 then Merge(S);

Optimizing Compilers for Modern Architectures

DO J = 1, JMAXD

DO I = 1, IMAXD

F(I, J, 1) = F(I, J, 1) * B(1)

DO K = 2, N-1

DO J = 1, JMAXD

DO I = 1, IMAXD

F(I, J, K) = (F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)

DO J = 1, JMAXD

DO I = 1, IMAXD

TOT(I, J) = 0.0

DO J = 1, JMAXD

DO I = 1, IMAXD

TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)

DO K = 2, N-1

DO J = 1, JMAXD

DO I = 1, IMAXD

TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)

Erlebacher

Optimizing Compilers for Modern Architectures

ErlebacherPARALLEL DO = 1, JMAXD

DO I = 1, IMAXD

F(I, J, 1) = F(I, J, 1) * B(1)

DO K = 2, N – 1

DO I = 1, IMAXD

F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)

DO I = 1, IMAXD

TOT(I, J) = 0.0

DO I = 1, IMAXD

TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)

DO K = 2, N-1

DO I = 1, IMAXD

TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)

Optimizing Compilers for Modern Architectures

ErlebacherPARALLEL DO J = 1, JMAXD

DO I = 1, IMAXD

F(I, J, 1) = F(I, J, 1) * B(1)

TOT(I, J) = 0.0

TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)

ENDDO

DO K = 2, N-1

DO I = 1, IMAXD

F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)

TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)

ENDDO

ENDDO

ENDDO

Optimizing Compilers for Modern Architectures

Strip Mining

• Converts available parallelism into a form more suitable for the hardware

DO I = 1, N

A(I) = A(I) + B(I)

ENDDO

k = CEIL (N / P)

PARALLEL DO I = 1, N, k

DO i = I, MIN(I + k-1, N)

A(i) = A(i) + B(i)

ENDDO

END PARALLEL DO

Optimizing Compilers for Modern Architectures

Strip Mining

DO I = 1, N

DO J = 2, I

A(I, J) = A(I, J-1) + B(I)

ENDDO

ENDDO

• Choose smaller unit size to allow more balanced load distribution

Optimizing Compilers for Modern Architectures

Pipeline Parallelism

• Fortran command DOACROSS

• Useful where parallelization is not available

• High synchronization costs

DO I = 2, N-1

DO J = 2, N-1

A(I, J) = .25 * (A(I-1, J) + A(I, J-1) + A(I+1, J) + A(I, J+1))

ENDDO

ENDDO

Optimizing Compilers for Modern Architectures

Pipeline ParallelismDOACROSS I = 2, N-1

POST (EV(1))

DO J = 2, N-1

WAIT(EV(J-1))

A(I, J) = .25 * (A(I-1, J) + A(I, J-1) + A(I+1, J) + A(I, J+1))

POST (EV(J))

ENDDO

ENDDO

Optimizing Compilers for Modern Architectures

Pipeline Parallelism

Optimizing Compilers for Modern Architectures

Pipeline ParallelismDOACROSS I = 2, N-1

POST (E(1))

K = 0

DO J = 2, N-1, 2

K = K+1

WAIT(EV(K))

DO j = J, MAX(J+1, N-1)

A(I, J) = .25 * (A(I-1, J) + A(I, J-1) + A(I+1, J) + A(I, J+1)

ENDDO

POST (EV(K+1))

ENDDO

ENDDO

Optimizing Compilers for Modern Architectures

Pipeline Parallelism

Optimizing Compilers for Modern Architectures

Scheduling Parallel Work

• Parallel execution is slower than serial execution if

• Bakery-counter scheduling—High synchronization overhead

Optimizing Compilers for Modern Architectures

Guided Self-Scheduling

• Minimize synchronization overhead—Schedules groups of iterations unlike the bakery counter

method—Going from large to small chunks of work

• Keep all processors busy at all times

• Iterations dispensed at time t follows:

• Alternatively we can have GSS(k) that guarantees that all blocks handed out are of size k or greater

Optimizing Compilers for Modern Architectures

Guided Self-Scheduling

• GSS(1)

top related