loop optimizations scheduling. loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc +=...
TRANSCRIPT
Loop OptimizationsScheduling
Loop fusion• int acc = 0;
for (int i = 0; i < n; ++i) {acc += a[i];a[i] = acc;
} for (int i = 0; i < n; ++i) {
b[i] += a[i];}
• Int acc = 0;for (int i = 0; i < n; ++i) {
acc += a[i];a[i] = acc; // Will DCE pick this up?b[i] += acc;
}
Loop fission• for (int I = 0; i < n; ++i) {
a[i] = e1;b[i] = e2; // e1 and e2 independent
}
• for (int I = 0; i < n; ++i) {a[i] = e1;
}for (int I = 0; i < n; ++i) {
b[i] = e2; }
Loop unrolling• for (int i = 0; i < n; ++i) {
a[i] = b[i] * 7 + c[i] / 13;}
• for (int i = 0; i < n % 3; ++i) { a[i] = b[i] * 7 + c[i] / 13;
}for (; i < n; i += 3) {
a[i] = b[i] * 7 + c[i] / 13;a[i + 1] = b[i + 1] * 7 + c[i + 1] / 13;a[i + 2] = b[i + 2] * 7 + c[i + 2] / 13;
}
Loop interchange• for (int i = 0; i < n; ++i) {
for (int j = 0; j < n; ++j) {a[i][j] += 1;
}}
• for (int j = 0; j < n; ++j) {for (int i = 0; i < n; ++i) {
a[i][j] += 1; }
}
Loop peeling• for (int i = 0; i < n; ++i) {
b[i] = (i == 0) ? a[i] : a[i] + b[i-1];}
• b[0] = a[0];for (int i = 1; i < n; ++i) {
b[i] = a[i] + b[i-1];}
Loop tiling• for (int i = 0; i < n; ++i) {
for (int j = 0; j < n; ++j) { for (int k = 0; k < n; ++k) { c[i][j] += a[i][k] * b[k][j]; }
}}
• Very roughly: (need outer loops to move y and z)for (int i = y; i < y + 10; ++i) { for (int j = z; j < z + 10; ++j) {
for (int k = 0; k < n; ++k) { c[i][j] += a[i][k] * b[k][j]; }
}}
Loop parallelization• for (int i = 0; i < n; ++i) {
a[i] = b[i] + c[i]; // a, b, and c do not overlap}
• for (int i = 0; i < n % 4; ++i) a[i] = b[i] + c[i];for (; i < n; i = i + 4) {
__some4SIMDadd(a+i,b+i,c+i);}
Instruction scheduling
• An instruction goes through the processor pipeline in one or more cycles
• Several instructions can be processed simultaneously at different stages in the pipeline
• The number of cycles necessary to process an instruction is called its latency
• Examples of instruction latency on some x86– ADD: 1 cycle – MUL: 4 cycles– DIV (32 bits): 40 cycles
• The simplest for of scheduling is done per CFG blocks
Instruction scheduling: example
1: ADD R1, R22: MUL R3, R43:4:5:6: ADD R1, R3
1: MUL R3, R42: ADD R1, R23:4:5: ADD R1, R3
Beyond blocks: trace scheduling 1
Beyond blocks: trace scheduling 2
Pipelining for loops
• First idea: unrolling the loop and then scheduling– It works, but it is not always optimal, and increase the code
size• Think of a loop with the following body:– DIV R1, R3 ; ADD R1, R2– We would have to unroll 40 times to hide the latency– And in general, it may not always be possible to hide the
latency– What if the DIV was computing the value for 40 iterations
from now?• Software pipelining
Software pipelining 1• There is one last technique in the arsenal of the software optimizer that may be
used to make most machines run at tip top speed. It can also lead to severe code bloat and may make for almost unreadable code, so should be considered the last refuge of the truly desperate. However, its performance characteristics are in many cases unmatched by any other approach, so we cover it here. It is called software pipelining [. . . ]
Apple Developer Connection
Software pipelining 2
Symbolic evaluation
• Turning sequence of instructions back to expressions
• Hides some of the syntactic details• Example: add a b c ; add d a b ; add e d a
becomes a -> add(b,c)d -> add(add(b,c),b)e -> add(add(add(b,c),b),add(b,c))
• For example, it is insensitive to the order of independent instructions
Is my pipeline correct?
• Let s(I) denotes the symbolic evaluation of the block of instructions I
• Let o be the composition of symbolic trees• If s(P o E) = s(Bm) and s(S o E) = s(E o B)
then the pipeline is correct (the converse is not true)