loop optimizations scheduling. loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc +=...

17
Loop Optimizations Scheduling

Upload: august-newman

Post on 21-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Loop OptimizationsScheduling

Page 2: Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Loop fusion• int acc = 0;

for (int i = 0; i < n; ++i) {acc += a[i];a[i] = acc;

} for (int i = 0; i < n; ++i) {

b[i] += a[i];}

• Int acc = 0;for (int i = 0; i < n; ++i) {

acc += a[i];a[i] = acc; // Will DCE pick this up?b[i] += acc;

}

Page 3: Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Loop fission• for (int I = 0; i < n; ++i) {

a[i] = e1;b[i] = e2; // e1 and e2 independent

}

• for (int I = 0; i < n; ++i) {a[i] = e1;

}for (int I = 0; i < n; ++i) {

b[i] = e2; }

Page 4: Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Loop unrolling• for (int i = 0; i < n; ++i) {

a[i] = b[i] * 7 + c[i] / 13;}

• for (int i = 0; i < n % 3; ++i) { a[i] = b[i] * 7 + c[i] / 13;

}for (; i < n; i += 3) {

a[i] = b[i] * 7 + c[i] / 13;a[i + 1] = b[i + 1] * 7 + c[i + 1] / 13;a[i + 2] = b[i + 2] * 7 + c[i + 2] / 13;

}

Page 5: Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Loop interchange• for (int i = 0; i < n; ++i) {

for (int j = 0; j < n; ++j) {a[i][j] += 1;

}}

• for (int j = 0; j < n; ++j) {for (int i = 0; i < n; ++i) {

a[i][j] += 1; }

}

Page 6: Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Loop peeling• for (int i = 0; i < n; ++i) {

b[i] = (i == 0) ? a[i] : a[i] + b[i-1];}

• b[0] = a[0];for (int i = 1; i < n; ++i) {

b[i] = a[i] + b[i-1];}

Page 7: Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Loop tiling• for (int i = 0; i < n; ++i) {

for (int j = 0; j < n; ++j) { for (int k = 0; k < n; ++k) { c[i][j] += a[i][k] * b[k][j]; }

}}

• Very roughly: (need outer loops to move y and z)for (int i = y; i < y + 10; ++i) { for (int j = z; j < z + 10; ++j) {

for (int k = 0; k < n; ++k) { c[i][j] += a[i][k] * b[k][j]; }

}}

Page 8: Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Loop parallelization• for (int i = 0; i < n; ++i) {

a[i] = b[i] + c[i]; // a, b, and c do not overlap}

• for (int i = 0; i < n % 4; ++i) a[i] = b[i] + c[i];for (; i < n; i = i + 4) {

__some4SIMDadd(a+i,b+i,c+i);}

Page 9: Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Instruction scheduling

• An instruction goes through the processor pipeline in one or more cycles

• Several instructions can be processed simultaneously at different stages in the pipeline

• The number of cycles necessary to process an instruction is called its latency

• Examples of instruction latency on some x86– ADD: 1 cycle – MUL: 4 cycles– DIV (32 bits): 40 cycles

• The simplest for of scheduling is done per CFG blocks

Page 10: Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Instruction scheduling: example

1: ADD R1, R22: MUL R3, R43:4:5:6: ADD R1, R3

1: MUL R3, R42: ADD R1, R23:4:5: ADD R1, R3

Page 11: Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Beyond blocks: trace scheduling 1

Page 12: Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Beyond blocks: trace scheduling 2

Page 13: Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Pipelining for loops

• First idea: unrolling the loop and then scheduling– It works, but it is not always optimal, and increase the code

size• Think of a loop with the following body:– DIV R1, R3 ; ADD R1, R2– We would have to unroll 40 times to hide the latency– And in general, it may not always be possible to hide the

latency– What if the DIV was computing the value for 40 iterations

from now?• Software pipelining

Page 14: Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Software pipelining 1• There is one last technique in the arsenal of the software optimizer that may be

used to make most machines run at tip top speed. It can also lead to severe code bloat and may make for almost unreadable code, so should be considered the last refuge of the truly desperate. However, its performance characteristics are in many cases unmatched by any other approach, so we cover it here. It is called software pipelining [. . . ]

Apple Developer Connection

Page 15: Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Software pipelining 2

Page 16: Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Symbolic evaluation

• Turning sequence of instructions back to expressions

• Hides some of the syntactic details• Example: add a b c ; add d a b ; add e d a

becomes a -> add(b,c)d -> add(add(b,c),b)e -> add(add(add(b,c),b),add(b,c))

• For example, it is insensitive to the order of independent instructions

Page 17: Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Is my pipeline correct?

• Let s(I) denotes the symbolic evaluation of the block of instructions I

• Let o be the composition of symbolic trees• If s(P o E) = s(Bm) and s(S o E) = s(E o B)

then the pipeline is correct (the converse is not true)