loop optimizations scheduling. loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc +=...

Loop OptimizationsScheduling

Loop fusion• int acc = 0;

for (int i = 0; i < n; ++i) {acc += a[i];a[i] = acc;

} for (int i = 0; i < n; ++i) {

b[i] += a[i];}

• Int acc = 0;for (int i = 0; i < n; ++i) {

acc += a[i];a[i] = acc; // Will DCE pick this up?b[i] += acc;

}

Loop fission• for (int I = 0; i < n; ++i) {

a[i] = e1;b[i] = e2; // e1 and e2 independent

}

• for (int I = 0; i < n; ++i) {a[i] = e1;

}for (int I = 0; i < n; ++i) {

b[i] = e2; }

Loop unrolling• for (int i = 0; i < n; ++i) {

a[i] = b[i] * 7 + c[i] / 13;}

• for (int i = 0; i < n % 3; ++i) { a[i] = b[i] * 7 + c[i] / 13;

}for (; i < n; i += 3) {

a[i] = b[i] * 7 + c[i] / 13;a[i + 1] = b[i + 1] * 7 + c[i + 1] / 13;a[i + 2] = b[i + 2] * 7 + c[i + 2] / 13;

}

Loop interchange• for (int i = 0; i < n; ++i) {

for (int j = 0; j < n; ++j) {a[i][j] += 1;

}}

• for (int j = 0; j < n; ++j) {for (int i = 0; i < n; ++i) {

a[i][j] += 1; }

}

Loop peeling• for (int i = 0; i < n; ++i) {

b[i] = (i == 0) ? a[i] : a[i] + b[i-1];}

• b[0] = a[0];for (int i = 1; i < n; ++i) {

b[i] = a[i] + b[i-1];}

Loop tiling• for (int i = 0; i < n; ++i) {

for (int j = 0; j < n; ++j) { for (int k = 0; k < n; ++k) { c[i][j] += a[i][k] * b[k][j]; }

}}

• Very roughly: (need outer loops to move y and z)for (int i = y; i < y + 10; ++i) { for (int j = z; j < z + 10; ++j) {

for (int k = 0; k < n; ++k) { c[i][j] += a[i][k] * b[k][j]; }

}}

Loop parallelization• for (int i = 0; i < n; ++i) {

a[i] = b[i] + c[i]; // a, b, and c do not overlap}

• for (int i = 0; i < n % 4; ++i) a[i] = b[i] + c[i];for (; i < n; i = i + 4) {

__some4SIMDadd(a+i,b+i,c+i);}

Instruction scheduling

• An instruction goes through the processor pipeline in one or more cycles

• Several instructions can be processed simultaneously at different stages in the pipeline

• The number of cycles necessary to process an instruction is called its latency

• Examples of instruction latency on some x86– ADD: 1 cycle – MUL: 4 cycles– DIV (32 bits): 40 cycles

• The simplest for of scheduling is done per CFG blocks

Instruction scheduling: example

1: ADD R1, R22: MUL R3, R43:4:5:6: ADD R1, R3

1: MUL R3, R42: ADD R1, R23:4:5: ADD R1, R3

Beyond blocks: trace scheduling 1

Beyond blocks: trace scheduling 2

Pipelining for loops

• First idea: unrolling the loop and then scheduling– It works, but it is not always optimal, and increase the code

size• Think of a loop with the following body:– DIV R1, R3 ; ADD R1, R2– We would have to unroll 40 times to hide the latency– And in general, it may not always be possible to hide the

latency– What if the DIV was computing the value for 40 iterations

from now?• Software pipelining

Software pipelining 1• There is one last technique in the arsenal of the software optimizer that may be

used to make most machines run at tip top speed. It can also lead to severe code bloat and may make for almost unreadable code, so should be considered the last refuge of the truly desperate. However, its performance characteristics are in many cases unmatched by any other approach, so we cover it here. It is called software pipelining [. . . ]

Apple Developer Connection

Software pipelining 2

Symbolic evaluation

• Turning sequence of instructions back to expressions

• Hides some of the syntactic details• Example: add a b c ; add d a b ; add e d a

becomes a -> add(b,c)d -> add(add(b,c),b)e -> add(add(add(b,c),b),add(b,c))

• For example, it is insensitive to the order of independent instructions

Is my pipeline correct?

• Let s(I) denotes the symbolic evaluation of the block of instructions I

• Let o be the composition of symbolic trees• If s(P o E) = s(Bm) and s(S o E) = s(E o B)

then the pipeline is correct (the converse is not true)

loop optimizations scheduling. loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc +=...

Documents