creating coarse-grained parallelism for loop nests
DESCRIPTION
Creating Coarse-grained Parallelism for Loop Nests. Chapter 6, Sections 6.3 through 6.9. Yaniv Carmeli. Last time …. Single loop methods Privatization Loop distribution Alignment Loop Fusion. This time …. Perfect Loop Nests Loop Interchange Loop Selection Loop Reversal Loop Skewing - PowerPoint PPT PresentationTRANSCRIPT
Creating Coarse-Creating Coarse-grained Parallelism grained Parallelism for Loop Nestsfor Loop Nests
Chapter 6, Sections 6.3 Chapter 6, Sections 6.3 through 6.9through 6.9Yaniv CarmeliYaniv Carmeli
Single loop methods
Privatization
Loop distribution
Alignment
Loop Fusion
Last time…
Perfect Loop Nests Loop Interchange Loop Selection Loop Reversal Loop Skewing Profitibility-Based Methods
This time…
This time… Imperfectly Nested Loops
Multilevel Loop Fusion Parallel Code Generation
Packaging Parallelism Strip Mining Pipeline Parallelism Guided Self Scheduling
Vectorization: BadParallelization: Good
Loop Interchange
A(I+1, J) = A(I, J) + B(I, J) ENDDO
ENDDO
Vectorization: OKParallelization: Problematic
DO J = 1, MDO I = 1, NPARALLEL DO J = 1 , M DO I = 1 , N A(I+1, J) = A(I, J) + B(I, J) ENDDOEND PARALLEL DO D = ( < , = )
DO I = 1, N
DO J = 1, M A(I+1, J+1) = A(I, J) + B(I, J) ENDDO
ENDDO
DO I = 1, N
PARALLEL DO J = 1, M A(I+1, J+1) = A(I, J) + B(I, J) END PARALLEL DO
ENDDO
Loop Interchange (Cont.)
Loop Interchange doesn’t work, as both loops carry dependence!!
Best we can do
D = ( < , < )
When can a loop be moved to the outermost position in the nest, and be guaranteed to be parallel?
Loop Interchange (Cont.) Theorem: In a perfect nest of loops, a particular
loop can be parallelized at the outermost level if and only if the column of the direction matrix for that nest contains only ‘=‘ entries.
Proof. If. A column with only “=“ entries represents a loop that can be interchanged, and carries no dependence.
Only If. There is a non “=“ entry in that column: If it is “>” – Can’t interchange loops (dependence
will be reversed) If it is “<“ – Can interchange, but can’t shake the
dependece (Will not allow parallelization anyway...)
Loop Interchange (Cont.) Working with direction matrix
1. Move loops with all “=“ entries into outermost position and parallelize it. Remove the column from the matrix
2. Move loops with most “<“ entries into next outermost position and sequentialize it, eliminate the column and any rows representing carried dependences
3. Repeat step 1
DO I = 1, NDO J = 1, M
DO K = 1, LA(I+1, J,K) = A(I, J,K) + X1B(I, J,K+1) = B(I, J,K) + X2C(I+1, J+1,K+1) = C(I, J,K) + X3
ENDDOENDDO
ENDO
DO I = 1, NPARALLEL DO J = 1, M
DO K = 1, LA(I+1, J,K) = A(I, J,K) + X1B(I, J,K+1) = B(I, J,K) + X2C(I+1, J+1,K+1) = C(I, J,K) + X3
ENDDOEND PARALLEL DO
ENDO
Loop Interchange (Cont.) Example:
< = == = << < <
< < = =< = < =< = = <= < = == = < == = = <
Loop Selection – Optimal? Is the approach of selecting the loop with
the most ‘ < ‘ directions optimal? Will result in NO
parallelization for this matrix
While other selections may allow parallelization
< < = =< = < =< = = <= < = == = < == = = <
Is it possible to derive a selection heuristic that provides optimal code?
The problem of loop selection is NP-complete Loop selection is best done by a heuristic!
Loop Selection
< < = =< = < =< = = <= < = == = < == = = <
Favor the selection of loops that must be sequentialized before parallelism can be uncovered.
Heuristic Loop Selection (Cont.) Example of principals involved
in heuristic loop selectionDO I = 2, N
DO J = 2, MDO K = 2, L
A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) + A(I, J+1, K+1) + A(I-1, J, K+1)
ENDDO ENDDO
ENDDO
The I-loop must be sequentialized because of the fourth dependence
The J-loop must be sequentialized because of the first dependence
DO J = 2, M DO I = 2, N PARALLEL DO K = 2, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) + A(I, J+1, K+1) + A(I-1, J, K+1) END PARALLEL DO ENDDO
ENDDO
= < =< = <= < << = >
Loop Reversal
DO I = 2, N+1 DO J = 2, M+1 DO K = 1, L A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO ENDDOENDDO
Using loop reversal to create coarse-grained parallelism. Consider:
DO I = 2, N+1 DO J = 2, M+1 DO K = L, 1, -1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO ENDDOENDDO
DO K = L, 1, -1 DO I = 2, N+1
DO J = 2, M+1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO ENDDOENDDO
DO K = L, 1, -1 PARALLEL DO I = 2, N+1
PARALLEL DO J = 2, M+1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) END PARALLEL DO END PARALLEL DOENDDO
= < >< = >
= < << = <
Loop Skewing
DO I = 2, N+1 DO J = 2, M+1 DO K = 1, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K) B(I, J, K+1) = B(I, J, K) + A(I, J, K) ENDDO ENDDOENDDO
Skewed using k = K + I + J yield:DO I = 2, N+1 DO J = 2, M+1 DO k = I+J+1, I+J+L A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J) B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J) ENDDO ENDDOENDDO
= < =< = == = <= = =
0 1 01 0 00 0 10 0 0
= < << = <= = <= = =
Loop Skewing - Main Benefits
Eliminate “>” signs in the matrix
Transforms skewed loops in such a way, that after outward interchange, it will carry all dependences formerly carried by the loop with respect to which it is skewed
Loop Skewing - Drawback
The resulting parallelism is usually unbalanced. (The resulting loop executes a variable amount of iterations each time). As we shall see – It’s not really a problem for
asynchronous parallelism (unlike vectorization).
Loop Skewing (Cont.) Updated strategy
1. Parallelize outermost loop if possible2. Sequentializes at most one outer loop to find parallelism in the next loop3. If 1 and 2 fail, try skewing4. If 3 fails, sequentialize the loop that can be moved to the outermost
position and cover the most other loops
In Practice – Sometimes we get much worse
execution times, than we would have gotten parallelizing less\different loops.
Profitability-Based Methods
Static performance estimation function No need to be accurate, just good at
selecting the better of two alternatives
Key considerations Cost of memory references Sufficiency of granularity
Profitability-Based Methods (Cont.) Impractical to choose from all
arrangements
Consider only subset of the possible code arrangements, based on properties of the cost function In our case: consider only the inner-most loop
Profitability-Based Methods (Cont.)
1. Subdivide all the references in the loop body into reference groups
Two references are in the same group if: There is a loop independent dependence between
them. There is a constant-distance loop carried
dependence between them.
A possible cost evaluation heuristics:
Profitability-Based Methods (Cont.)
2. Determine whether subsequent accesses to the same reference are
Loop invariant Cost = 1
Unit stride Cost = number of iterations / cache line size
Non-unit stride Cost = number of iterations
A possible cost evaluation heuristics:
Profitability-Based Methods (Cont.)
3. Compute loop cost:
A possible cost evaluation heuristics:
#loop _ cost reference cost
aggregate
of iterations
Profitability-Based Methods: ExampleDO I = 1, N
DO J = 1, N DO K = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDOENDDO
Profitability-Based Methods: ExampleDO I = 1, N
DO J = 1, N DO K = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDOENDDO
2N3/L+N21N/LN/LI
2N3+N2N1NJ
N3(1+1/L)+N2N/LN1K
COSTBACInner-most loop
Worst
Best
Profitability-Based Methods: Example Reorder loop from innermost to outermost by increasing loop cost: I,K,J
Can’t always have desired loop order(as some permutations are illegal) - Try to find the possible permutation closest to the desired one.
DO J = 1, N DO K = 1, N DO I = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDOENDDO
Profitability-Based Methods (Cont.) Goal: Given a desired loop order and a direction matrix for a loop nest - find the legal permutation closest to the desired one.
Method: Until there are no more loops:
Choose from all the loops that can be interchanged to the outermost position, the one that is outermost in the desired permutation. Drop that loop. It can be shown that if a legal permutation with the desired innermost loop in the innermost position exists – this algorithm will find such a permutation.
Profitability-Based Methods (Cont.)DO J = 1, N DO K = 1, N DO I = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO ENDDOENDDO
For performance reasons – the compiler may mark the inner loop as “not meant for parallelization” (sequential performance utilizes locality in memory accesses).
Multilevel Loop Fusion
Commonly used for imperfect loop nests
Used after maximal loop distribution
Multilevel Loop Fusion
DO I = 1, N DO J = 1, M A(I, J+1) = A(I, J) + C
B(I+1, J) = B(I, J) + D ENDDOENDDO
DO I = 1, N DO J = 1, M A(I, J+1) = A(I, J) + C
ENDDOENDDO
DO I = 1, N DO J = 1, M
B(I+1, J) = B(I, J) + D ENDDOENDDO
PARALLEL DO I = 1, N DO J = 1, M A(I, J+1) = A(I, J) + C
ENDDOEND PARALLEL DO
PARALLEL DO J = 1, M DO I = 1, N
B(I+1, J) = B(I, J) + D ENDDOEND PARALLEL DO
After distribution each nest is better with a different outer loop – Can’t fuse!
Multilevel Loop Fusion (Cont.)
DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X
B(I+1, J) = A(I, J) + B(I,J)C(I, J+1) = A(I, J) + B(I,J)D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)
ENDDOENDDO
DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X
ENDDOENDDODO I = 1, N DO J = 1, M
B(I+1, J) = A(I, J) + B(I,J) ENDDOENDDODO I = 1, N DO J = 1, M
C(I, J+1) = A(I, J) + C(I,J) ENDDOENDDODO I = 1, N DO J = 1, M
D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOENDDO
DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X
ENDDOENDDODO I = 1, N DO J = 1, M
B(I+1, J) = A(I, J) + B(I,J) ENDDOENDDODO I = 1, N DO J = 1, M
C(I, J+1) = A(I, J) + C(I,J) ENDDOENDDODO I = 1, N DO J = 1, M
D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOENDDO
Which loop should be fused into the A loop?
i,jA
jB iC
jD
Multilevel Loop Fusion (Cont.)PARALLEL DO J = 1, M
DO I = 1, N A(I, J) = A(I, J) + X
B(I+1, J) = A(I, J) + B(I,J)ENDDO
ENDDO
PARALLEL DO I = 1, N DO J = 1, M
C(I, J+1) = A(I, J) + C(I,J) ENDDOENDDO
PARALLEL DO J = 1, M DO I = 1, N
D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOENDDO
jAB
iC
jD
Fusing A loop with B loop
2 barriers
Multilevel Loop Fusion (Cont.)PARALLEL DO I = 1, N
DO J = 1, M A(I, J) = A(I, J) + X
C(I, J+1) = A(I, J) + C(I,J)ENDDO
ENDDO
PARALLEL DO J = 1, M DO I = 1, N
B(I+1, J) = A(I, J) + B(I,J)ENDDO
ENDDO
PARALLEL DO J = 1, M DO I = 1, N
D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOENDDO
iAC
jB
jD
PARALLEL DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X
C(I, J+1) = A(I, J) + C(I,J)ENDDO
ENDDO
PARALLEL DO J = 1, M DO I = 1, N
B(I+1, J) = A(I, J) + B(I,J)D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)
ENDDO
ENDDO
Now we can also fuse B-D
iAC
jBD
Fusing A loop with C loop
1 barrier
Multilevel Loop Fusion (Cont.) Decision making needs look-ahead Strategy: Fuse with the loop that cannot
be fused with one of its successorsRationale: If it can’t be fused with its successors – a barrier will be formed anyway.
i,jA
jB iC
jD A barrier is inevitable!!
Parallel Code Generation
Parallelize(l,D)1. Try methods for perfect nests (loop
interchange, loop skewing, loop reversal), and stop if parallelism is found.
2. If nest can be distributed: distribute, run recursively on the distributed nests, and merge.
3. Else sequentialize outer loop, eliminate the dependences it carries, and try recursively on each of the loops nested in it.
Code generation scheme:
Parallel Code Generationprocedure Parallelize(l, Dl);
ParallelizeNest(l, success); //(try methods for perfect nests..)if ¬success then begin
if l can be distributed then begindistribute l into loop nests l1, l2, …,
ln;for i:=1 to n do begin
Parallelize(li, Di);endMerge({l1, l2, …, ln});
end
Parallel Code Generation (Cont.)else begin // if l cannot be distributed then
for each outer loop l0 nested in l do beginlet D0 be the set of dependences between statements in l0 less dependences carried by l;
Parallelize(l0,D0);endlet S - the set of outer loops and statements loops left in l;If ||S||>1 then Merge(S);endend
end Parallelize
Parallel Code Generation (Cont.)
DO J = 1, M DO I = 1, N A(I+1, J+1) = A(I+1, J) + C
X(I, J) = A(I, J) + C ENDDOENDDO
DO J = 1, M DO I = 1, N A(I+1, J+1) = A(I+1, J) + C
ENDDOEND DO
DO J = 1, M DO I = 1, N
X(I, J) = A(I, J) + C ENDDOEND DO
Both loops carry dependence – loop interchange will not find sufficient parallelism.
Try distribution…PARALLEL DO I = 1, N DO J = 1, M A(I+1, J+1) = A(I+1, J) + C
ENDDOEND PARALLEL DO
DO J = 1, M DO I = 1, N
X(I, J) = A(I, J) + C ENDDOEND DO
I loop can be parallelizedPARALLEL DO I = 1, N DO J = 1, M A(I+1, J+1) = A(I+1, J) + C
ENDDOEND PARALLEL DO
PARALLEL DO J = 1, M
DO I = 1, N !Left sequential for memory hierarchy
X(I, J) = A(I, J) + C ENDDOEND PARALLEL DO
Both loops can be parallelized
Now fusing…
Type: (I-loop, parallel)
Type: (J-loop, parallel)
Different types – can’t fuse
Parallel Code Generation (Cont.)
DO I = 1, N DO J = 1, M A(I, J) = A(I, J) + X
B(I+1, J) = A(I, J) + B(I,J)C(I, J+1) = A(I, J) + C(I,J)D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J)
ENDDOENDDO
PARALLEL DO J = 1, M DO I = 1, N !Sequentialized for memory hierarchy A(I, J) = A(I, J) + X
ENDDOENDPARALLEL DOPARALLEL DO J = 1, M DO I = 1, N
B(I+1, J) = A(I, J) + B(I,J) ENDDOEND PARALLEL DOPARALLEL DO I = 1, N DO J = 1, M
C(I, J+1) = A(I, J) + C(I,J) ENDDOEND PARALLEL DOPARALLEL DO J = 1, M DO J = 1, N
D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOEND PARLLEL DO
PARALLEL DO J = 1, M DO I = 1, N !Sequentialized for memory hierarchy A(I, J) = A(I, J) + X
ENDDO DO I = 1, N
B(I+1, J) = A(I, J) + B(I,J) ENDDOEND PARALLEL DO
PARALLEL DO I = 1, N DO J = 1, M
C(I, J+1) = A(I, J) + C(I,J) ENDDOEND PARALLEL DOPARALLEL DO J = 1, M DO J = 1, N
D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOEND PARLLEL DO
PARALLEL DO J = 1, M DO I = 1, N !Sequentialized for memory hierarchy A(I, J) = A(I, J) + X
B(I+1, J) = A(I, J) + B(I,J) ENDDOEND PARALLEL DO
PARALLEL DO I = 1, N DO J = 1, M
C(I, J+1) = A(I, J) + C(I,J) ENDDOEND PARALLEL DOPARALLEL DO J = 1, M DO J = 1, N
D(I+1, J) = B(I+1, J) + C(I,J) + D(I,J) ENDDOEND PARLLEL DO
A
B C
D
I loop, parallel
J loop, parallel
J loop, parallel
J loop, parallel
A
C
D
DO J = 1, JMAXDO I = 1, IMAXD
F(I, J, 1) = F(I, J, 1) * B(1)
DO K = 2, N-1DO J = 1, JMAXD
DO I = 1, IMAXDF(I, J, K) = (F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)
DO J = 1, JMAXDDO I = 1, IMAXD
TOT(I, J) = 0.0
DO J = 1, JMAXDDO I = 1, IMAXD
TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)
DO K = 2, N-1DO J = 1, JMAXD
DO I = 1, IMAXDTOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)
ErlebacherPARALLEL DO J = 1, JMAX
DO I = 1, IMAXDF(I, J, 1) = F(I, J, 1) * B(1)
DO K = 2, N-1PARALLEL DO J = 1, JMAXD
DO I = 1, IMAXDF(I, J, K) = (F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)
PARALLEL DO J = 1, JMAXDDO I = 1, IMAXD
TOT(I, J) = 0.0
PARALLEL DO J = 1, JMAXDDO I = 1, IMAXD
TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)
DO K = 2, N-1PARALLEL DO J = 1, JMAXD
DO I = 1, IMAXDTOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)
ErlebacherPARALLEL DO J= 1, MAXD
L1 : DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1)
L2: DO K = 2, N – 1 DO I = 1, IMAXD
F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K)
L3: DO I = 1, IMAXD TOT(I, J) = 0.0
L4: DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1)
L5: DO K = 2, N-1 DO I = 1, IMAXD
TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)
END PARALLEL DO
L1
L4
L2
L3
L5
ErlebacherPARALLEL DO J = 1, JMAXD DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1) TOT(I, J) = 0.0 TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1) ENDDO
DO K = 2, N-1 DO I = 1, IMAXD F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K) TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K) ENDDO ENDDOEND PARALLEL DO
Packaging of Parallelism Trade off between parallelism and
granularity of synchronization. Larger granularity work-units means
synchronization needs to be done less frequently, but at a cost of less parallelism, and poorer load balance.
Strip Mining Converts available parallelism into a form
more suitable for the hardware
DO I = 1, NA(I) = A(I) + B(I)
ENDDO
Interruptions may be disastrous
k = CEIL (N / P)PARALLEL DO I = 1, N, k
DO i = I, MIN(I + k-1, N)A(i) = A(i) + B(i)
ENDDOEND PARALLEL DO
The value of P is unknown until runtime, so strip mining is often handled by special hardware (Convex C2 and C3)
Strip Mining (Cont.) What if the execution time varies among
iteraions?
PARALLEL DO I = 1, NDO J = 2, I
A(J, I) = A(J-1, I) * 2.0ENDDO
END PARALLEL DO
Solution: smaller unit size to allow more balanced distribution
8
N 7
8
N3
8
N 5
8
N
Pipeline Parallelism Fortran command DOACROSS – pipelines
parallel loop iterations with cross-iteration synchronization.
Useful where parallelization is not available High synchronization costs
DOACROSS I = 2, NS1: A(I) = B(I) + C(I)
POST(EV(I)) IF (I>2) WAIT (EV(I-1))
S2: C(I) = A(I-1) + A(I)ENDDO
Scheduling Parallel Work
Load balance
LittleSychro.
Scheduling Parallel Work Parallel execution is slower than serial execution if
Bakery-counter scheduling Moderate synchronization overhead
N- number of iterationsB- time of one iterationp- number of processorsσ0- constant overhead per processor
0
NB
p
Guided Self-Scheduling Incorporates some level of static
scheduling to guide dynamic self-scheduling Schedules groups of iterations Going from large to small chunks of work
Iterations dispensed at time t follows:1
tt t t t
Nx N N x
p
Guided Self-Scheduling (Cont.) GSS: (20 iteration, 4 processors)
Not completely balanced
Required synchronization: 9In bakery counter: 20
6 45 4
Guided Self-Scheduling (Cont.) In the example, last 4 allocation are for a single iteration. Coincidence? Last p-1 iterations will always be of 1 iteration.
GSS(2): No block of iterations smaller than 2
GSS(k): No block is smaller than k
1
2tt t t t
Nx N N x
p
p
Yaniv
Carmeli
B.A. in CS
Thanks for you attention!