optimizing compilers for modern architectures dependence: theory and practice allen and kennedy,...

29
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Post on 22-Dec-2015

230 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Dependence: Theory and Practice

Allen and Kennedy, Chapter 2 pp. 49-70

Page 2: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Loop-carried and Loop-independent Dependences

• If in a loop statement S2 depends on S1, then there are two possible ways of this dependence occurring:

1. S1 and S2 execute on different iterations—This is called a loop-carried dependence.

2. S1 and S2 execute on the same iteration—This is called a loop-independent dependence.

Page 3: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Loop-carried dependence

• Definition 2.11

• Statement S2 has a loop-carried dependence on statement S1 if and only if S1 references location M on iteration i, S2 references M on iteration j and d(i,j) > 0 (that is, D(i,j) contains a “<” as leftmost non “=” component).

Example:DO I = 1, N

S1 A(I+1) = F(I)

S2 F(I+1) = A(I)

ENDDO

Page 4: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Loop-carried dependence• Level of a loop-carried dependence is the index of

the leftmost non-“=” of D(i,j) for the dependence.

For instance:DO I = 1, 10

DO J = 1, 10

DO K = 1, 10

S1 A(I, J, K+1) = A(I, J, K)

ENDDO

ENDDO

ENDDO

• Direction vector for S1 is (=, =, <)

• Level of the dependence is 3

• A level-k dependence between S1 and S2 is denoted by S1 k S2

Page 5: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Loop-carried Transformations

• Theorem 2.4 Any reordering transformation that does not alter the relative order of any loops in the nest and preserves the iteration order of the level-k loop preserves all level-k dependences.

• Proof:—D(i, j) has a “<” in the kth position and “=” in

positions 1 through k-1 Source and sink of dependence are in the same

iteration of loops 1 through k-1 Cannot change the sense of the dependence by a

reordering of iterations of those loops

• As a result of the theorem, powerful transformations can be applied

Page 6: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Loop-carried Transformations

Example:DO I = 1, 10

S1 A(I+1) = F(I)

S2 F(I+1) = A(I)

ENDDO

can be transformed to:

DO I = 1, 10

S1 F(I+1) = A(I)

S2 A(I+1) = F(I)

ENDDO

Page 7: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Loop-independent dependences

• Definition 2.14. Statement S2 has a loop-independent dependence on statement S1 if and only if there exist two iteration vectors i and j such that:

1) Statement S1 refers to memory location M on iteration i, S2 refers to M on iteration j, and i = j.

2) There is a control flow path from S1 to S2 within the iteration.

Example:DO I = 1, 10

S1 A(I) = ...

S2 ... = A(I)

ENDDO

Page 8: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Loop-independent dependences

More complicated example:DO I = 1, 9

S1 A(I) = ...

S2 ... = A(10-I)

ENDDO

• No common loop is necessary. For instance:DO I = 1, 10

S1 A(I) = ...

ENDDO

DO I = 1, 10

S2 ... = A(20-I)

ENDDO

Page 9: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Loop-independent dependences

• Theorem 2.5. If there is a loop-independent dependence from S1 to S2, any reordering transformation that does not move statement instances between iterations and preserves the relative order of S1 and S2 in the loop body preserves that dependence.

• S2 depends on S1 with a loop independent dependence is denoted by S1 S2

• Note that the direction vector will have entries that are all “=” for loop independent dependences

Page 10: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Loop-carried and Loop-independent Dependences

• Loop-independent and loop-carried dependence partition all possible data dependences!

• Note that if S1 S2, then S1 executes before S2. This can happen only if:—The difference vector for the dependence is less than 0, or

—The difference vector equals 0 and S1 occurs before S2 textually

...precisely the criteria for loop-carried and loop-independent dependences.

Page 11: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Simple Dependence Testing

• Theorem 2.7: Let a and b be iteration vectors within the iteration space of the following loop nest:

DO i1 = L1, U1, S1

DO i2 = L2, U2, S2

...

DO in = Ln, Un, Sn

S1 A(f1(i1,...,in),...,fm(i1,...,in)) = ...

S2 ... = A(g1(i1,...,in),...,gm(i1,...,in))

ENDDO

...

ENDDO

ENDDO

Page 12: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Simple Dependence TestingDO i1 = L1, U1, S1

DO i2 = L2, U2, S2

...

DO in = Ln, Un, Sn

S1 A(f1(i1,...,in),...,fm(i1,...,in)) = ...

S2 ... = A(g1(i1,...,in),...,gm(i1,...,in))

ENDDO

...

ENDDO

ENDDO

• A dependence exists from S1 to S2 if and only if there exist values of and such that (1) is lexicographically less than or equal to and (2) the following system of dependence equations is satisfied:

fi() = gi() for all i, 1 i m

• Direct application of Loop Dependence Theorem

Page 13: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Simple Dependence Testing: Delta Notation

• Notation represents index values at the source and sink

Example:DO I = 1, N

S A(I + 1) = A(I) + B

ENDDO

• Iteration at source denoted by: I0

• Iteration at sink denoted by: I0 + I

• Forming an equality gets us: I0 + 1 = I0 + I

• Solving this gives us: I = 1 Carried dependence with distance vector (1) and

directionvector (<)

Page 14: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Simple Dependence Testing: Delta Notation

Example:

DO I = 1, 100

DO J = 1, 100

DO K = 1, 100

A(I+1,J,K) = A(I,J,K+1) + B

ENDDO

ENDDO

ENDDO

• I0 + 1 = I0 + I; J0 = J0 + J; K0 = K0 + K + 1

• Solutions: I = 1; J = 0; K = -1

• Corresponding direction vector: (<, =, >)

Page 15: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Simple Dependence Testing: Delta Notation

• If a loop index does not appear, its distance is unconstrained and its direction is “*”

Example:

DO I = 1, 100

DO J = 1, 100

A(I+1) = A(I) + B(J)

ENDDO

ENDDO

• The direction vector for the dependence is (<, *)

Page 16: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Simple Dependence Testing: Delta Notation

• * denotes union of all 3 directions

Example:

DO J = 1, 100

DO I = 1, 100

A(I+1) = A(I) + B(J)

ENDDO

ENDDO

• (*, <) denotes { (<, <), (=, <), (>, <) }

• Note: (>, <) denotes a level 1 antidependence with direction vector (<, >)

Page 17: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Parallelization and Vectorization

• Theorem 2.8. It is valid to convert a sequential loop to a parallel loop if the loop carries no dependence.

• Want to convert loops like:DO I=1,N

X(I) = X(I) + C

ENDDO

• to X(1:N) = X(1:N) + C (Fortran 77 to Fortran 90)

• However:DO I=1,N

X(I+1) = X(I) + C

ENDDO

is not equivalent to X(2:N+1) = X(1:N) + C

Page 18: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Loop Distribution

• Can statements in loops which carry dependences be vectorized? D0 I = 1, N

S1 A(I+1) = B(I) + C

S2 D(I) = A(I) + E

ENDDO

• Dependence: S1 1 S2 can be converted to:

S1 A(2:N+1) = B(1:N) + C

S2 D(1:N) = A(1:N) + E

Page 19: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Loop Distribution

DO I = 1, N

S1 A(I+1) = B(I) + C

S2 D(I) = A(I) + E

ENDDO

• transformed to:

DO I = 1, NS1 A(I+1) = B(I) + C

ENDDODO I = 1, N

S2 D(I) = A(I) + EENDDO

• leads to:S1 A(2:N+1) = B(1:N) + CS2 D(1:N) = A(1:N) + E

Page 20: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Loop Distribution

• Loop distribution fails if there is a cycle of dependences

DO I = 1, N

S1 A(I+1) = B(I) + C

S2 B(I+1) = A(I) + E

ENDDO

S1 1 S2 and S2 1 S1

• What about: DO I = 1, N

S1 B(I) = A(I) + E

S2 A(I+1) = B(I) + C

ENDDO

Page 21: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Simple Vectorization Algorithm

procedure vectorize (L, D)// L is the maximal loop nest containing the statement.

// D is the dependence graph for statements in L.

find the set {S1, S2, ... , Sm} of maximal strongly-connected regions in the dependence graph D restricted to L (Tarjan);

construct Lp from L by reducing each Si to a single node and compute Dp, the dependence graph naturally induced on Lp by D;

let {p1, p2, ... , pm} be the m nodes of Lp numbered in an order consistent with Dp (use topological sort);

for i = 1 to m do begin

if pi is a dependence cycle then

generate a DO-loop around the statements in pi;

else

directly rewrite pi in Fortran 90, vectorizing it with respect to every loop containing it;

end

end vectorize

Page 22: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Problems With Simple VectorizationDO I = 1, N

DO J = 1, M

S1 A(I+1,J) = A(I,J) + B

ENDDO

ENDDO

• Dependence from S1 to itself with d(i, j) = (1,0)

• Key observation: Since dependence is at level 1, we can manipulate the other loop!

• Can be converted to:

DO I = 1, N

S1 A(I+1,1:M) = A(I,1:M) + B

ENDDO

• The simple algorithm does not capitalize on such opportunities

Page 23: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Advanced Vectorization Algorithm

procedure codegen(R, k, D);// R is the region for which we must generate code.// k is the minimum nesting level of possible parallel loops. // D is the dependence graph among statements in R..

find the set {S1, S2, ... , Sm} of maximal strongly-connectedregions in the dependence graph D restricted to R;

construct Rp from R by reducing each Si to a single node andcompute Dp, the dependence graph naturally induced on Rp by D;

let {p1, p2, ... , pm} be the m nodes of Rp numbered in an orderconsistent with Dp (use topological sort to do the numbering);

for i = 1 to m do begin

if pi is cyclic then begin

generate a level-k DO statement;

let Di be the dependence graph consisting of all dependence edges in D that are at level k+1 or greater and are internal to p i;

codegen (pi, k+1, Di);

generate the level-k ENDDO statement;endelse

generate a vector statement for p i in r(pi)-k+1 dimensions, where r (pi) is the number of loops containing pi;

end

Page 24: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Advanced Vectorization Algorithm

DO I = 1, 100

S1 X(I) = Y(I) + 10

DO J = 1, 100

S2 B(J) = A(J,N)

DO K = 1, 100

S3 A(J+1,K)=B(J)+C(J,K)

ENDDO

S4 Y(I+J) = A(J+1, N)

ENDDO

ENDDO

Page 25: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Advanced Vectorization Algorithm

DO I = 1, 100

S1 X(I) = Y(I) + 10

DO J = 1, 100

S2 B(J) = A(J,N)

DO K = 1, 100

S3 A(J+1,K)=B(J)+C(J,K)

ENDDO

S4 Y(I+J) = A(J+1, N)

ENDDO

ENDDOSimple dependence testing procedure: True dependence from S4 to S1

I0 + J = I0 + I I = JAs J is always positive Direction is “<”

Page 26: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Advanced Vectorization Algorithm

DO I = 1, 100

S1 X(I) = Y(I) + 10

DO J = 1, 100

S2 B(J) = A(J,N)

DO K = 1, 100

S3 A(J+1,K)=B(J)+C(J,K)

ENDDO

S4 Y(I+J) = A(J+1, N)

ENDDO

ENDDOS2 and S3: dependence via B(J)I does not occur in either subscript (D.V = *)We get:J0 = J0 + J J = 0 Direction vectors = (*, =)

Page 27: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Advanced Vectorization Algorithm

DO I = 1, 100S1 X(I) = Y(I) + 10

DO J = 1, 100S2 B(J) = A(J,N)

DO K = 1, 100S3 A(J+1,K)=B(J)+C(J,K)

ENDDOS4 Y(I+J) = A(J+1, N)

ENDDOENDDO

DO I = 1, 100 codegen({S2, S3, S4}, 2})ENDDOX(1:100) = Y(1:100) + 10

• codegen called at the outermost level

• S1 will be vectorized

Page 28: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Advanced Vectorization Algorithm

DO I = 1, 100 DO J = 1, 100 codegen({S2, S3}, 3})

ENDDOS4 Y(I+1:I+100) = A(2:101,N)ENDDO

X(1:100) = Y(1:100) + 10

• codegen ({S2, S3, S4}, 2})

• level-1 dependences are stripped off

Page 29: Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp. 49-70

Optimizing Compilers for Modern Architectures

Advanced Vectorization Algorithm

• codegen ({S2, S3}, 3})

• level-2 dependences are stripped off

DO I = 1, 100S1 X(I) = Y(I) + 10

DO J = 1, 100S2 B(J) = A(J,N)

DO K = 1, 100S3 A(J+1,K)=B(J)+C(J,K)

ENDDOS4 Y(I+J) = A(J+1, N)

ENDDOENDDO

DO I = 1, 100 DO J = 1, 100

B(J) = A(J,N)

A(J+1,1:100)=B(J)+C(J,1:100)

ENDDO

Y(I+1:I+100) = A(2:101,N)

ENDDO

X(1:100) = Y(1:100) + 10