performance and optimization

Performance and Optimization

Measuring Performance• Key measure of performance for a computing system is

speed– Response time or execution time or latency.

– Throughput.

• We have seen how to increase throughput, while slightly increasing execution time of each single instruction– Pipeline design.

• We now concentrate on measuring total execution time.

• Total execution time can mean:– Elapsed time -- includes all I/O, OS and time spent on other

jobs.

– CPU time -- time spent by processor on your job.

CPU Execution Time• We consider CPU execution time on an unloaded system

• Machine X is n times faster than machine Y if

where,– CPU Time = Execution Time

– Performance = 1 / CPU Time

• Basic measure of performance:

= Cycles count X Clock cycle time

= n orperformanceX

performanceY = n

CPU TimeY

CPU TimeX

Seconds

Clock cycleCPU Time = X

Clock cycles

program

CPU Execution Time

• Clock cycle time is measured in nanoseconds (10-9sec) or microseconds (10-6sec)

• Clock cycle rate = 1 / (Clock cycle time) is measured in• MegaHertz (MHz) - 106 cycles/sec

• GigaHertz (GHz) - 109 cycles/sec

CPI (Cycles per Instruction)

= IC X CPI

• CPI is one way to compare different implementations of the same Instruction Set Architecture (ISA), since instruction count (IC) for a given program will be the same in both cases.

Average Clock cycles

InstructionCycles Count = X

Instructions

program

CPI - Pipelined Implementation• In each cycle the execute stage would either process an instruction

or a bubble, injected due to one of three special cases.

• Total of Ci instructions and Cb bubbles, then the processor has required Ci+Cb clock cycles to execute Ci instructions.

• In pipelined implementation, CPI = (Ci+Cb)/Ci = 1 + Cb/Ci

• Cb/Ci indicates the average number of bubbles injected per instruction.

Thus, in our implementation CPI = 1.27

CauseFrequencyConditionBubblesProduct

Load/Use0.250.210.05

Misspredict0.200.420.16

Return0.02130.06

Total0.27

CPI Example• We have two machines with different implementations

of the same ISA (Instruction Set Architecture). Machine A has a clock cycle time of 10 ns and a CPI of 2.0 for program P; machine B has a clock cycle time of 20 ns and a CPI of 1.2 for the same program. Which machine is faster?

• Let IC be the number of instructions to be executed (same in both machines). Then,

Cycles countA = 2.0 IC

Cycles countB = 1.2 IC

Calculate CPU Time for each machine:

CPU TimeA = 2.0 IC x 10 ns = 20.0 IC ns

CPU TimeB = 1.2 IC x 20 ns = 24.0 IC ns

» Machine A is faster; in fact 24/20 = 20% faster.

Composite Performance Measure

or = Instruction Count X CPI X clock cycle time

or = Instruction Count X CPI

Clock rate

• These formulas show that performance is always a function of 3 distinct factors; 1 or 2 factors alone are not sufficient.

• IC (Instruction Count) was once the main factor advertised (VAX); today clock rate is in the headlines (3 GHz Pentiums)

• CPI is more difficult to advertise

• Changing one factor often affects others. For example,o Decreasing Instruction count means each instruction is doing

more; hence either CPI or cycle time or both, may increase.

• A smart compiler may decrease CPI by choosing the right kind and order of instructions, without a large increase in Instruction count.

Average Clock cycles

InstructionCPU Time = X X

Instructions

program

Seconds

Clock cycle

Amdahl's Law•Make the common case fast -- why?

•Denote part of system that was enhanced as the enhanced fraction or Fenhanced.

Speedup = = =

=

Fenhanced

speedupenhanced

(1 - Fenhanced ) +

1

CPU Timeold

CPU Timenew

CPU Timeold

CPU Timeold (1 - Fenhanced) + CPU Timeold Fenhanced (1/ speedupenhanced)

Amdahl's Law (Example)Suppose we have a technique for improving the performance of FP operations by a factor of 10 : What fraction of the code must be floating point to achieve a 200% improvement in performance?

3 = ==>

Fenhanced = 20/27 = 74%

Even dramatic enhancements make a limited contribution unless they relate to a very common case.

Fenhanced

10 (1 - Fenhanced ) +

1

Amdahl's Law (example cont.) Let us assume a seq processor hardware:

Fetch – 11% (percentage of time spent) Decode – 18% Execute – 23% Memory – 40% WriteBack – 8%

And the speedups : 1x , 5x , 20x, 1.6x , 1x. 0.11/1 + 0.18/5 + 0.23/20 + 0.40/1.6 + 0.08/1 = 0.4875 Speedup = 1/0.4875 =~ 2 The 5x and 20x have little effect

Amdahl's Law (Parallel)

P is the portion of the code that can be made parallel N is the number of processors

With a very big number of processors the speedup isbound to 1/(1-p).

What does that mean about the efficiency of parallel computing ? On which kind of problems ?

Max speedup by N

Processors =

Important to Keep in Mind

90/10 rule. 90% of the time spent in 10% of the code.

Readability vs. Performances.Time vs. Memory.

Machine-Independent Optimizations– Optimizations you should do regardless of processor / compiler

• Code Motion– Reduce frequency with which computation performed• If it will always produce same result• Especially moving code out of loop

for (i = 0; i < n; i++)

for (j = 0; j < n; j++)

a[n*i + j] = b[j];

for (i = 0; i < n; i++){

int ni = n*i;

for (j = 0; j < n; j++)

a[ni + j] = b[j];

}

Reduction in Strength– Replace costly operation with simpler one– Shift, add instead of multiply or divide

16*x --> x << 4• Utility machine dependent• Depends on cost of multiply or divide instruction• On Pentium II or III, integer multiply only requires 4

CPU cycles

– Recognize sequence of products

for (i = 0; i < n; i++)

for (j = 0; j < n; j++)

a[n*i + j] = b[j];

int ni = 0;

for (i = 0; i < n; i++){

for (j = 0; j < n; j++)

a[ni + j] = b[j];

ni += n;

}

Using more efficient instructions

Can

Cannot

Arrays and Loops Example Original code:

Assembly code:

Loop Optimization

Loop optimization is the process of the increasing execution speed and reducing the overheads associated of loops. Plays an important role in improving cache performance and making effective use of parallel processing capabilities. Most execution time of a program is spent on loops, so a lot of compiler optimization techniques have been developed to make them faster.

Time Scales• Absolute Time– Typically use nanoseconds

• 10–9 seconds

– Time scale of computer instructions

• Clock Cycles– Most computers controlled by high frequency clock

signal

– Typical Range• 100 MHz

– 108 cycles per second

– Clock period = 10ns

• 2 GHz – 2 X 109 cycles per second

– Clock period = 0.5ns

Cycles Per Element– Convenient way to express performance of

program that operates on vectors or lists– Length = n– T = CPE*n + Overhead

0

100

200

300

400

500

600

700

800

900

1000

0 50 100 150 200

Elements

Cycles

vsum1

Slope = 4.0

vsum2

Slope = 3.5

Optimization Example

• Procedure– Compute sum of all elements of integer vector

– Store result at destination location

– Vector data structure and operations defined via abstract data type

• Pentium II/III Performance: Clock Cycles / Element– 42.06 (Compiled -g) 31.25 (Compiled -O2)

void combine1(vec_ptr v, int *dest)

{

int i;

* dest = 0;

for (i = 0; i < vec_length(v); i++){

int val;

get_vec_element(v, i, &val);

* dest += val;

}

}

General GCC optimization commands

• Most optimizations are only enabled if -O is set on the command line. Otherwise they are disabled, even if individual optimization flags are specified.

• With -O, the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.

• With -O2 Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. As compared to -O, this option increases both compilation time and the performance of the generated code.

Understanding Loop

• Inefficiency–Procedure vec_length called every iteration–Even though result always the same

void combine1-goto(vec_ptr v, int *dest)

{

int i = 0;

int val;

* dest = 0;

if (i >= vec_length(v))

goto done;

loop:


* dest += val;

i;++

if (i < vec_length(v))

goto loop

done:

}

1 iteration

Move vec_length Call Out of Loop

• Optimization–Move call to vec_length out of inner loop• Value does not change from one iteration to next

• Code motion

– CPE: 20.66 (Compiled -O2)• vec_length requires only constant time, but significant overhead


{

int i;

int length = vec_length(v);

* dest = 0;

for (i = 0; i < length; i++){

int val;


* dest += val;

}

}

Arrays optimizations - Loops

Arrays and Loops optimizations:No need for a loop variable.Using pointer arithmetic I: Instead of increasing the

loop variable by one, it increases our pointer by the size of the data type.

Using pointer arithmetic II: it computes the address of the final array element, and uses a comparison to this address as the loop test (do-while loop).

Optimization techniques (C++ oriented)

For loops: Use ++i instead of i++

i++ needs to be able to return the unincremented original value and therefore store it, whereas ++i can return the incremented value without storing the previous value. (Old compilers but still good practice because you never know what machine will run your code)

specially on non primitive types += is more efficient than x = x + a ( Probably doesn’t matter

to most compilers nowadays but once x = x + a used to evaluate x twice ).

Count down to 0 instead of up It’s usually faster to compare against 0

Reduction in Strength

• Optimization– Avoid procedure call to retrieve each vector element• Get pointer to start of array before loop

• Within loop just do pointer reference

• Not as clean in terms of data abstraction

– CPE: 6.00 (Compiled -O2)• Procedure calls are expensive!

• Bounds checking is expensive


{

int i;


int *data = get_vec_start(v);

* dest = 0;

for (i = 0; i < length; i++){

* dest += data[i];

}

Eliminate Unneeded Memory Refs

•Optimization–Don’t need to store in destination until end–Local variable sum held in register–Avoids 1 memory read, 1 memory write per cycle–CPE: 2.00 (Compiled -O2)•Memory references are expensive!


{

int i;



int sum = 0;

for (i = 0; i < length; i++)

sum += data[i];

* dest = sum;

}

Detecting Unneeded Memory Refs.

• Performance–Combine3• 5 instructions in 6 clock cycles• addl must read and write memory

–Combine4• 4 instructions in 2 clock cycles

.L18:

movl (%ecx,%edx,4),%eax

addl %eax,(%edi)

incl %edx

cmpl %esi,%edx

jl .L18

Combine3.L24:

addl (%eax,%edx,4),%ecx

incl %edx

cmpl %esi,%edx

jl .L24

Combine4

Optimization Blocker: Memory Aliasing• Aliasing– Two different memory references specify single location

• Example– v: [3, 2, 17]

– combine3(v, get_vec_start(v)+2)-->?

– combine4(v, get_vec_start(v)+2)-->?

• Observations– Easy to have happen in C• Since allowed to do address arithmetic

• Direct access to storage structures

–Get in habit of introducing local variables• Accumulating within loops

• Your way of telling compiler not to check for aliasing

Loop Unrolling

•Optimization–Combine multiple

iterations into single loop body–Amortizes loop

overhead across multiple iterations–Finish extras at end–Measured CPE = 1.33


{


int limit = length-2;


int sum = 0;

int i;

*/ Combine 3 elements at a time/*

for (i = 0; i < limit; i+=3){

sum += data[i] + data[i+2]

+ data[i+1];

}

*/ Finish any remaining elements/*

for (; i < length; i++){

sum += data[i];

}

* dest = sum;

}

Parallel Loop Unrolling•Code Version– Integer product

•Optimization–Accumulate in two

different products•Can be performed simultaneously

–Combine at end

• Performance–CPE = 2.0–2X performance


{


int limit = length-1;


int x0 = 1;

int x1 = 1;

int i;

*/ Combine 2 elements at a time/*

for (i = 0; i < limit; i+=2){

x0 *= data[i];

x1 *= data[i+1];

}

*/ Finish any remaining elements/*

for (; i < length; i++){

x0 *= data[i];

}

* dest = x0 * x1;

}

Code Optimizing Time

Using efficient algorithms.Constant calculations – outside of loops.Accessing memory as little as possible.Using more efficient instructions (shift vs. mult).Minimizing the call to inefficient functions.

MemoryUsing the smallest data types that fit.Using more efficient structures that allow for the same

functionality with less memory (see example later).Using as little variables as possible.

performance and optimization

Documents