1 code optimization. 2 outline optimizing blockers –memory alias –side effect in function call...

114
1 Code Optimization

Upload: giles-summers

Post on 04-Jan-2016

216 views

Category:

Documents


3 download

TRANSCRIPT

1

Code Optimization

2

Outline

• Optimizing Blockers– Memory alias– Side effect in function call

• Understanding Modern Processor– Super-scalar– Out-of –order execution

• More Code Optimization techniques• Performance Tuning

• Suggested reading

– 5.1, 5.7 ~ 5.16

3

5.1 Capabilities and Limitations of Optimizing Compliers

Review on5.3 Program Example5.4 Eliminating Loop Inefficiencies5.5 Reducing Procedure Calls5.6 Eliminating Unneeded Memory References

4

void combine1(vec_ptr v, data_t *dest){ int i; *dest = IDENT;

for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; }}

Example P387

5

void combine2(vec_ptr v, int *dest){ int i; int length = vec_length(v); *dest = IDENT;

for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest = *dest OPER val; }}

Example P388

6

void combine3(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v);

*dest = IDENT; for (i = 0; i < length; i++) { *dest = *dest OPER data[i];}

Example P392

7

void combine4(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int x = IDENT;

for (i = 0; i < length; i++) x = x OPER data[i]; *dest = x;}

Example P394

8

Machine Independent Opt. Results

• Optimizations– Reduce function calls and memory references

within loop

9

Machine Independent Opt. Results

• Performance Anomaly– Computing FP product of all elements exceptionally slow.– Very large speedup when accumulate in temporary– Memory uses 64-bit format, register use 80– Benchmark data caused overflow of 64 bits, but not 80

Integer Floating Point Method + * + *

Abstract -g 42.06 41.86 41.44 160.00 Abstract -O2 31.25 33.25 31.25 143.00 Move vec_length 22.61 21.25 21.15 135.00 data access 6.00 9.00 8.00 117.00 Accum. in temp 2.00 4.00 3.00 5.00

Combine4Combine3Combine2

Combine1Combine1

P385P388P392

P394

10

Optimization Blockers P394

void combine4(vec_ptr v, int *dest)

{

int i;

int length = vec_length(v);

int *data = get_vec_start(v);

int sum = 0;

for (i = 0; i < length; i++)

sum += data[i];

*dest = sum;

}

11

Optimization Blocker: Memory Aliasing P394

• Aliasing– Two different memory references specify single

location

• Example

– v: [3, 2, 17]

– combine3(v, get_vec_start(v)+2) -->

?

– combine4(v, get_vec_start(v)+2) -->

?

12

Optimization Blocker: Memory Aliasing

• Observations

– Easy to have happen in C

• Since allowed to do address arithmetic

• Direct access to storage structures

– Get in habit of introducing local variables

• Accumulating within loops

• Your way of telling compiler not to check for aliasing

13

Optimizing Compilers

• Provide efficient mapping of program to

machine

– register allocation

– code selection and ordering

– eliminating minor inefficiencies

14

Optimizing Compilers

• Don’t (usually) improve asymptotic efficiency– up to programmer to select best overall algorithm

– big-O savings are (often) more important than constant factors

• but constant factors also matter

• Have difficulty overcoming “optimization blockers”– potential memory aliasing

– potential procedure side-effects

15

Limitations of Optimizing Compilers

• Operate Under Fundamental Constraint

– Must not cause any change in program

behavior under any possible condition

– Often prevents it from making optimizations

when would only affect behavior under

pathological conditions.

16

Limitations of Optimizing Compilers

• Behavior that may be obvious to the programmer

can be obfuscated by languages and coding

styles

– e.g., data ranges may be more limited than variable

types suggest

• e.g., using an “int” in C for what could be an enumerated

type

17

Limitations of Optimizing Compilers

• Most analysis is performed only within procedures

– whole-program analysis is too expensive in most cases

• Most analysis is based only on static information

– compiler has difficulty anticipating run-time inputs

• When in doubt, the compiler must be conservative

18

Optimization Blockers P380

• Memory aliasing void twiddle1(int *xp, int *yp) {

*xp += *yp ;

*xp += *yp ;

}

void twiddle2(int *xp, int *yp)

{

*xp += 2* *yp ;

}

19

Optimization Blockers P381

• Function call and side effectint f(int) ;

int func1(x)

{

return f(x)+f(x)+f(x)+f(x) ;

}

int func2(x)

{

return 4*f(x) ;

}

20

Optimization Blockers P381

• Function call and side effectint counter = 0 ;

int f(int x)

{

return counter++ ;

}

21

5.7 Understanding Modern Processors

22

Modern CPU Design Figure 5.11 P396

ExecutionExecution

FunctionalUnits

Instruction ControlInstruction Control

Integer/Branch

FPAdd

FPMult/Div

Load Store

InstructionCache

DataCache

FetchControl

InstructionDecode

Address

Instructions

Operations

PredictionOK?

DataData

Addr. Addr.

GeneralInteger

Operation Results

RetirementUnit

RegisterFile

RegisterUpdates

23

Retirement

Unit

Register

File

Instruction

Cache

Fetch Control

Instruction

Decode

Address

Instructions

Integer

/branch

General

Integer

FP

Add

FP

mult/div

Load Store

Fu

nctional un

its

operations

Predication OK?

Data Cache

Operation resultsaddr

dataaddr

data

Register

Updates

1)2)

3)

4)

5)

(1) (2) (3) (4) (5) (6)

(7)

24

Modern Processor P396

• Superscalar

– Perform multiple operations on every clock cycle

• Out-of-order execution

– The order in which the instructions execute need

not correspond to their ordering in the assembly

program

25

Modern Processor P396

• Two main parts

– Instruction Control Unit

• Responsible for reading a sequence of instructions

from memory

• Generating from above instructions a set of primitive

operations to perform on program data

– Execution Unit

26

1) Instruction Control Unit

• Instruction Cache– A special, high speed memory containing the

most recently accessed instructions.

27

1) Instruction Control Unit

• Instruction Decoding Logic– Take actual program instructions– Converts them into a set of primitive operations– Each primitive operation performs some simple

task• Simple arithmetic, Load, Store• addl %eax, 4(%edx) --- three operations

load 4(%edx) t1addl %eax, t1 t2store t2, 4(%edx)

– Register renaming

P397

P398

28

2) Fetch Control

• Fetch Ahead P396

– Fetches well ahead of currently accessed

instructions

– ICU has enough time to decode these

– ICU has enough time to send decoded

operations down to the EU

29

Fetch Control

• Branch Predication P397– Branch taken or fall through

– Guess whether branch is taken or not

• Speculative Execution P397– Fetch, decode and execute only according to

the branch prediction

– Before the branch predication has been determined

30

Multi-functional Units

• Multiple Instructions Can Execute in

Parallel

– 1 load

– 1 store

– 2 integer (one may be branch)

– 1 FP Addition

– 1 FP Multiplication or Division

31

Multi-functional Units Figure 5.12 P400

• Some Instructions Take > 1 Cycle, but Can be Pipelined– Instruction Latency Cycles/Issue– Load / Store 3 1– Integer Multiply 4 1– Integer Divide 36 36– Double/Single FP Multiply 5 2– Double/Single FP Add 3 1– Double/Single FP Divide38 38

32

Execution Unit

• Receives operations from ICU

• Each cycle it may receive more than one

operation

• Operations are queued in buffer

33

Execution Unit

• Operation is dispatched to one of multi-functional units, whenever– All the operands of an operation are ready– Suitable functional units are available

• Execution results are passed among functional units

• (7) Data Cache P398– A high speed memory containing the most

recently accessed data values

34

4) Retirement Unit P398

• Instructions need to commit in serial order

– Misprediction

– Exception

• Updates Architecture status

– Memory and register values

35

Translation Example P401

.L24: # Loop:imull (%eax,%edx,4),%ecx # t *= data[i]incl %edx # i++cmpl %esi,%edx # i:lengthjl .L24 # if < goto Loop

.L24:

imull (%eax,%edx,4),%ecx

incl %edxcmpl %esi,%edxjl .L24

load (%eax,%edx.0,4) t.1imull t.1, %ecx.0 %ecx.1incl %edx.0 %edx.1cmpl %esi, %edx.1 cc.1jl-taken cc.1

36

Understanding Translation Example P401

• Split into two operations– Load reads from memory to generate

temporary result t.1– Multiply operation just operates on registers

imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4) t.1imull t.1, %ecx.0 %ecx.1

37

Understanding Translation Example P401

• Operands

– Registers %eax does not change in loop.

Values will be retrieved from register file during

decoding

imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4) t.1imull t.1, %ecx.0 %ecx.1

38

Understanding Translation Example P401

• Operands

– Register %ecx changes on every iteration.

– Uniquely identify different versions as

• %ecx.0, %ecx.1, %ecx.2, …

– Register renaming

• Values passed directly from producer to consumers

imull (%eax,%edx,4),%ecx load (%eax,%edx.0,4) t.1imull t.1, %ecx.0 %ecx.1

39

Understanding Translation Example P402

• Register %edx changes on each iteration

• Renamed as %edx.0, %edx.1, %edx.2, …

incl %edx incl %edx.0 %edx.1

40

Understanding Translation Example P402

• Condition codes are treated similar to

registers

• Assign tag to define connection between

producer and consumer

cmpl %esi,%edx cmpl %esi, %edx.1 cc.1

41

Understanding Translation Example P402

• Instruction control unit determines

destination of jump

• Predicts whether target will be taken

• Starts fetching instruction at predicted

destination

jl .L24 jl-taken cc.1

42

Understanding Translation Example P401

• Execution unit simply checks whether or not prediction was OK

• If not, it signals instruction control– Instruction control then “invalidates” any

operations generated from misfetched instructions

– Begins fetching and decoding instructions at correct target

jl .L24 jl-taken cc.1

43

• Operations– Vertical position denotes

time at which executed• Cannot begin operation

until operands available– Height denotes latency

• Operands– Arcs shown only for operands

that are passed within execution unit

cc.1

t.1

load

%ecx.1

incl

cmpl

jl

%edx.0

%edx.1

%ecx.0

imull

load (%eax,%edx.0,4) t.1imull t.1, %ecx.0 %ecx.1incl %edx.0 %edx.1cmpl %esi, %edx.1 cc.1jl-taken cc.1

Time

Visualizing Operations Figure 5.13 P403

44

• Operations– Same as before,

except that add has latency of 1

load (%eax,%edx,4) t.1iaddl t.1, %ecx.0 %ecx.1incl %edx.0 %edx.1cmpl %esi, %edx.1 cc.1jl-taken cc.1

Time

cc.1

t.1

%ecx.i +1

incl

cmpl

jl

load

%edx.0

%edx.1

%ecx.0

addl%ecx.1

load

Visualizing Operations Figure 5.14 P403

45

cc.1

cc.2%ecx.0

%edx.3t.1

imull

%ecx.1

incl

cmpl

jl

%edx.0

i=0

load

t.2

imull

%ecx.2

incl

cmpl

jl

%edx.1

i=1

load

cc.3

t.3

imull

%ecx.3

incl

cmpl

jl

%edx.2

i=2

load

Cycle

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

cc.1

cc.2

Iteration 3

Iteration 2

Iteration 1

cc.1

cc.2%ecx.0

%edx.3t.1

imull

%ecx.1

incl

cmpl

jl

%edx.0

i=0

load

t.1

imull

%ecx.1

incl

cmpl

jl

%edx.0

i=0

load

t.2

imull

%ecx.2

incl

cmpl

jl

%edx.1

i=1

load

t.2

imull

%ecx.2

incl

cmpl

jl

%edx.1

i=1

load

cc.3

t.3

imull

%ecx.3

incl

cmpl

jl

%edx.2

i=2

load

Cycle

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Cycle

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

cc.1

cc.2

Iteration 3

Iteration 2

Iteration 1

• Unlimited Resource Analysis– Assume operation

can start as soon as operands available

– Operations for multiple iterations overlap in time

• Performance– Limiting factor

becomes latency of integer multiplier

– Gives CPE of 4.0

3 Iterations of Combining Product Figure 5.15 P404

46

• Unlimited Resource Analysis

• Performance

– Can begin a new iteration on each clock cycle

– Should give CPE of 1.0

– Would require executing 4 integer operations in parallel

%edx.0

t.1

%ecx.i +1

incl

cmpl

jl

addl%ecx.1

i=0

loadcc.1

%edx.0

t.1

%ecx.i +1

incl

cmpl

jl

addl%ecx.1

i=0

loadcc.1

%edx.1

t.2

%ecx.i +1

incl

cmpl

jl

addl%ecx.2

i=1

loadcc.2

%edx.1

t.2

%ecx.i +1

incl

cmpl

jl

addl%ecx.2

i=1

loadcc.2

%edx.2

t.3

%ecx.i +1

incl

cmpl

jl

addl%ecx.3

i=2

loadcc.3

%edx.2

t.3

%ecx.i +1

incl

cmpl

jl

addl%ecx.3

i=2

loadcc.3

%edx.3

t.4

%ecx.i +1

incl

cmpl

jl

addl%ecx.4

i=3

loadcc.4

%edx.3

t.4

%ecx.i +1

incl

cmpl

jl

addl%ecx.4

i=3

loadcc.4

%ecx.0

%edx.4

Cycle

1

2

3

4

5

6

7

Cycle

1

2

3

4

5

6

7

Iteration 1

Iteration 2

Iteration 3

Iteration 4

4 integer ops

4 Iterations of Combining Sum Figure 5.16 P405

47

Iteration 4

Iteration 5

Iteration 6

Iteration 7

Iteration 8

%ecx.3

%edx.8

%edx.3

t.4%ecx.i +1

incl

cmpl

jladdl

%ecx.4

i=3

load

cc.4

%edx.3

t.4%ecx.i +1

incl

cmpl

jladdl

%ecx.4

i=3

load

cc.4

%edx.4

t.5%ecx.i +1

incl

cmpl

jladdl%ecx.5

i=4

load

cc.5

%edx.4

t.5%ecx.i +1

incl

cmpl

jladdl%ecx.5

i=4

load

cc.5

cc.6

%edx.7

t.8%ecx.i +1

incl

cmpl

jladdl

%ecx.8

i=7

load

cc.8

%edx.7

t.8%ecx.i +1

incl

cmpl

jladdl

%ecx.8

i=7

load

cc.8

%edx.5

t.6

incl

cmpl

jl

addl

%ecx.6

load

i=5

%edx.5

t.6

incl

cmpl

jl

addl

%ecx.6

load

i=5

6

7

8

9

10

11

12

Cycle

13

14

15

16

17

6

7

8

9

10

11

12

Cycle

13

14

15

16

17

18

cc.6

%edx.6

t.7

cmpl

jl

addl

%ecx.7

load

cc.7

i=6

incl

%edx.6

t.7

cmpl

jl

addl

%ecx.7

load

cc.7

i=6

incl

Combining Sum: Resource Constraints Figure 5.18 P408

48

Combining Sum: Resource Constraints

• Only have two integer functional units• Some operations delayed even though

operands available• Set priority based on program order• Performance

– Sustain CPE of 2.0

49

5.9 Converting to Pointer Code

50

void combine4p(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int *dend = data + length ; int x = IDENT;

for (; data < dend ; data++ ) x = x OPER *data; *dest = x;}

Example P413

51

• Some compilers and processors do better job optimizing array code

Function Integer Floating pointer

+ * + *

Combine4 2.00 4.00 3.00 5.00

Combine4p 3.00 4.00 3.00 5.00

Pointer Code vs. Array Code P414

52

.L24: # Loop:addl (%eax,%edx,4),%ecx # x += data[i]incl %edx # i++cmpl %esi,%edx # i:lengthjl .L24 # if < goto Loop

.L30: # Loop:addl (%eax),%ecx # x += *dataaddl $4,%eax # data ++cmpl %edx,%eax # data:dendjb .L30 # if < goto Loop

Pointer vs. Array Code Inner Loops P414

53

• Performance– Array Code: 4 instructions in 2 clock cycles– Pointer Code: Almost same 4 instructions in 3

clock cycles

Pointer vs. Array Code Inner Loops

54

5.8 Reducing Loop Overhead

55

Loop unrolling P409

void combine5(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int x = IDENT;

/* combine 3 elements at a time */ for (i = 0; i < length-2; i+=3) x = x OPER data[i] OPER data[i+1] OPER data[i+2];

/* finish any remaining elements */ for (; i < length; i++) x = x OPER data[i]; *dest = x;}

56

– Loads can pipeline, since don’t have dependencies

– Only one set of loop control operations

load (%eax,%edx.0,4) t.1aiaddl t.1a, %ecx.0c %ecx.1aload 4(%eax,%edx.0,4) t.1biaddl t.1b, %ecx.1a %ecx.1bload 8(%eax,%edx.0,4) t.1ciaddl t.1c, %ecx.1b %ecx.1ciaddl $3,%edx.0 %edx.1cmpl %esi, %edx.1 cc.1jl-taken cc.1

Visualizing Unrolled Loop P410

57

Time

%edx.0

%edx.1

%ecx.0c

cc.1

t.1a

%ecx.i +1

addl

cmpl

jl

addl

%ecx.1c

addl

addl

t.1b

t.1c

%ecx.1a

%ecx.1b

load

load

load

Measured CPE = 1.33

Visualizing Unrolled Loop Figure 5.20 P410

58

i=6

cc.3

t.3a

%ecx.i +1

addl

cmpl

jl

addl

%ecx.3c

addl

addl

t.3b

t.3c

%ecx.3a

%ecx.3b

load

load

load

%ecx.2c

i=9

cc.4

t.4a

%ecx.i +1

addl

cmpl

jl

addl

%ecx.4c

addl

addl

t.4b

t.4c

%ecx.4a

%ecx.4b

load

load

load

cc.4

t.4a

%ecx.i +1

addl

cmpl

jl

addl

%ecx.4c

addl

addl

t.4b

t.4c

%ecx.4a

%ecx.4b

load

load

load

%edx.3

%edx.2

%edx.4

5

6

7

8

9

10

11

Cycle

12

13

14

15

5

6

7

8

9

10

11

Cycle

12

13

14

15

Iteration 3

Iteration 4

Executing with Loop Unrolling Figure 5.21 P411

59

Executing with Loop Unrolling

• Predicted Performance– Can complete iteration in 3 cycles

– Should give CPE of 1.0

• Measured Performance– CPE of 1.33

– One iteration every 4 cycles

60

Unrolling Degree

1 2 3 4 8 16

Integer

Sum 2.00 1.50 1.33 1.50 1.25 1.06

Integer

Product 4.00

FP Sum 3.00

FP Product 5.00

Effect of Unrolling P411

61

Effect of Unrolling

• Only helps integer sum for our examples

– Other cases constrained by functional unit

latencies

• Effect is nonlinear with degree of

unrolling

– Many subtle effects determine exact

scheduling of operations

62

5.10 Enhancing Parallelism

63

void combine5(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int x0 = IDENT, x1 = IDENT;

/* combine 2 elements at a time */ for (i = 0; i < length; i+=2){ x0 = x0 OPER data[i]; x1 = x1 OPER data[i+1]; }

/* finish any remaining elements */ for (; i < length; i++) x0 = x0 OPER data[i]; *dest = x0 OPER x1;}

Loop Splitting P409

64

Loop Splitting

• Optimization– Accumulate in two different sums

• Can be performed simultaneously

– Combine at end

– Exploits property that integer addition & multiplication are associative & commutative

– FP addition & multiplication not associative, but transformation usually acceptable

Associative:可结合的Commutative:可交换的

65

load (%eax,%edx.0,4) t.1aimull t.1a, %ecx.0 %ecx.1load 4(%eax,%edx.0,4) t.1bimull t.1b, %ebx.0 %ebx.1iaddl $2,%edx.0 %edx.1cmpl %esi, %edx.1 cc.1jl-taken cc.1

Visualizing Parallel Loop P417

• Two multiplies within loop no longer have data dependency

• Allows them to pipeline

66

Time

%edx.1

%ecx.0

%ebx.0

cc.1

t.1a

imull

%ecx.1

addl

cmpl

jl

%edx.0

imull

%ebx.1

t.1b

load

load

Visualizing Parallel Loop Figure 5.25 P417

67

%edx.3%ecx.0

%ebx.0

i=0

i=2

cc.1

t.1a

imull

%ecx.1

addl

cmpl

jl

%edx.0

imull

%ebx.1

t.1b

load

loadcc.1

t.1a

imull

%ecx.1

addl

cmpl

jl

%edx.0

imull

%ebx.1

t.1b

load

loadcc.2

t.2a

imull

%ecx.2

addl

cmpl

jl

%edx.1

imull

%ebx.2

t.2b

load

loadcc.2

t.2a

imull

%ecx.2

addl

cmpl

jl

%edx.1

imull

%ebx.2

t.2b

load

load

i=4

cc.3

t.3a

imull

%ecx.3

addl

cmpl

jl

%edx.2

imull

%ebx.3

t.3b

load

load

14

Cycle

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Cycle

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Iteration 1

Iteration 2

Iteration 3

Executing with Parallel Loop Figure 5.26 P418

68

Integer Floating Point Method + * + *

Unroll 4 1.50 4.00 3.00 5.00 Unroll 16 1.06 4.00 3.00 5.00 2 X 2 1.50 2.00 2.00 2.50 4 X 4 1.50 2.00 1.50 2.50 8 X 4 1.25 1.25 1.50 2.00 Theoretical Opt. 1.00 1.00 1.00 2.00

Optimization Results for Combining P419

69

Optimization Results for Combining

• Register spilling

– only 6 registers available

– Using memory as storage

• Register spilling

– movl -12(%ebp), %edi

– imull 24(%eax), %edi

– movl %edi, -12(%ebp)

70

5.11 Putting it Together: Summary of Results for Optimizing Combining Code

71

5.12 Branch Prediction and Misprediction Penalties

72

What About Branches?

• Challenge– Instruction Control Unit must work well ahead

of Exec. Unit– To generate enough operations to keep EU

busy

73

80489f3: movl $0x1,%ecx 80489f8: xorl %edx,%edx 80489fa: cmpl %esi,%edx 80489fc: jnl 8048a25 80489fe: movl %esi,%esi 8048a00: imull (%eax,%edx,4),%ecx

Executing

Fetching &Decoding

What About Branches?

74

What About Branches?

• Challenge

– When encounters conditional branch, cannot

reliably determine where to continue fetching

75

Branch Outcomes

• When encounter conditional branch, cannot

determine where to continue fetching

– Branch Taken: Transfer control to branch target

– Branch Not-Taken: Continue with next

instruction in sequence

• Cannot resolve until outcome determined

by branch/integer unit

76

80489f3: movl $0x1,%ecx 80489f8: xorl %edx,%edx 80489fa: cmpl %esi,%edx 80489fc: jnl 8048a25 80489fe: movl %esi,%esi 8048a00: imull (%eax,%edx,4),%ecx

8048a25: cmpl %edi,%edx 8048a27: jl 8048a20 8048a29: movl 0xc(%ebp),%eax 8048a2c: leal 0xffffffe8(%ebp),%esp 8048a2f: movl %ecx,(%eax)

Branch Taken

Branch Not-Taken

Branch Outcomes

77

Branch Prediction

• Idea

– Guess which way branch will go

– Begin executing instructions at predicted

position

• But don’t actually modify register or memory data

78

80489f3: movl $0x1,%ecx 80489f8: xorl %edx,%edx 80489fa: cmpl %esi,%edx 80489fc: jnl 8048a25 . . .

8048a25: cmpl %edi,%edx 8048a27: jl 8048a20 8048a29: movl 0xc(%ebp),%eax 8048a2c: leal 0xffffffe8(%ebp),%esp 8048a2f: movl %ecx,(%eax)

Predict Taken

Execute

Branch Prediction

79

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1

i = 98

i = 99

i = 100

Predict Taken (OK)

Predict Taken(Oops)

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1

i = 101

Assume vector length = 100

Read invalid location

Executed

Fetched

Branch Prediction Through Loop

80

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1

i = 98

i = 99

i = 100

Predict Taken (OK)

Predict Taken (Oops)

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx i = 101

Invalidate

Assume vector length = 100

Branch Misprediction Invalidation

81

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1

80488b1: movl (%ecx,%edx,4),%eax 80488b4: addl %eax,(%edi) 80488b6: incl %edx 80488b7: cmpl %esi,%edx 80488b9: jl 80488b1 80488bb: leal 0xffffffe8(%ebp),%esp 80488be: popl %ebx 80488bf: popl %esi 80488c0: popl %edi

i = 98

i = 99

Predict Taken (OK)

Definitely not taken

Assume vector length = 100

Branch Misprediction Recovery

82

Branch Misprediction Recovery P427

• Performance Cost

– Misprediction on Pentium III wastes ~14 clock

cycles

– That’s a lot of time on a high performance

processor

83

• Misprediction penalty is about 14 cycles in PIII

machine

• Conditional mov is used to avoid the

misprediction penalty when the branch outcome

is not predictable

• For example: int absval(int val) {

return (val <0)? –val : val

}

Conditional Jump Figure 5.29 P427

84

Conditional Jump P428

movl 8(%ebp), %eax Get val as result

movl %eax, %edx Copy to %edx

negl %edx Negate %edx

testl %eax, %eax Test Val

cmov1 %edx, %eax if <0 copy %edx to result

85

5.13 Understanding Memory Performance

86

typedef struct ELE {

struct ELE *next ;

int data ;

} list_ele, *list_ptr ;

int list_len(list_ptr ls)

{

int len = 0 ;

for (;ls;ls=ls->next)

len++ ;

return len ;

}

Assembly Instructions.L27:

incl %eaxmovl (%edx), %edxtestl %edx, %edxjne .L27

Execution unit operationsincl %eax.0 %eax.1load (%edx.0) %edx.1testl %edx.1, %edx.1 cc.1jne-taken cc.1

Load Latency P429, P430

Figure 5.30 P430

87

incl

testl

jne

%eax.0

%edx.0

incl

testl

jne

load

incl

testl

jne

load

load

%eax.1

%eax.2

%eax.3%edx.1

%edx.2

%edx.3

cc.1

cc.2

cc.3

i=1

i=2

i=3

1

2

3

4

5

6

7

8

9

10

11

Figure 5.31 P430

88

Store Latency Figure 5.32 P431

void array_clear(int *dest, int n)

{

int i;

for ( i = 0 ; i < n ; i++)

dest[i] = 0 ;

}

CPE 2.0

89

Store Latency Figure 5.32 P431

void array_clear(int *dest, int n) {

int i;int len = n-7 ;

for ( i = 0 ; i < len ; i++) {dest[i] = dest[i+1] = dest[i+2] = dest[i+3] = 0 ;dest[i+4] = dest[i+5] = dest[i+6] = dest[i+7] =

0 ;}for ( ; i < n ; i++)

dest[i] = 0 ; }CPE 1.25

90

Store latency Figure 5.33 P432

void write_read(int *src, int *dest, int n)

{

int cnt = n;

int val = 0;

while (cnt--) {

*dest = val;

val = (*src)+1;

}

}

91

Store latency Figure 5.33 P432

write_read(&a[0], &a[1], 3) initial iter. 1 iter. 2 iter. 3 cnt 3 2 1 0 a (-10, 17) (-10, 0) (-10, -9) (-10, -9) val 0 -9 -9 -9

write_read(&a[0], &a[0], 3)initial iter. 1 iter. 2 iter. 3

cnt 3 2 1 0 a (-10, 17) (0, 17) (1, 17)

(2, 17) val 0 1 2 3

92

Store latency

void write_read(int *src, int *dest, int n)

{

int cnt = n;

int val = 0;

while (cnt--) {

*dest = val;

val = (*src)+1;

}

}

93

Store latency P434

.L32:movl %edx, (%ecx)movl (%ebx), %edxincl %edxdecl %eaxjnc .L32

storeaddr (%ecx)storedata %edx.0load (%ebx) %edx.1aincl %edx.1a %edx.1bdecl %eax.0 %eax.1jnc-taken cc.1

94

%eax.2

1

2

3

4

5

6

7

decl

storedata

storeaddr

load

incl

jnc decl

storedata

storeaddr

load

incl

jnc

%edx.1a

%edx.1b

cc.1 %eax.1

%edx.2a

%edx.2b

%eax.0

%edx.0

Store latency Figure 5.35 P434

95

1

2

3

4

5

6

7

decl

storedata

storeaddr

load

incl

jnc decl

Storedata

storeaddr

incl

jnc

=

%edx.1a

%edx.1b

cc.1 %eax.1

=

%edx.2b

%eax.2

%eax.0

%edx.0

load

%edx.2a

Figure 5.36 P435

96

5.14 Life in the Real World: Performance Improvement Techniques

97

5.15 Identifying and Eliminating Performance Bottlenecks

98

Performance Tuning

• Identify – Which is the hottest part of the program– Using a very useful method profiling

• Instrument the program• Run it with typical input data• Collect information from the result• Analysis the result

– gprof example• $gcc –O2 –pg prog.c –o prog• $prog file.text (generate new file gmon.out)• $gprof prog (with gmon.out)

99

Example

• Task– Count word frequencies in text document– Sort the words in descending order of occurence

• Steps– Convert strings to lowercase– Apply hash function– Read words and insert into hash table

• Mostly list operations• Maintain counter for each unique word

– Sort results

100

Examples

unix> gcc –O2 –pg prog.c –o prog

unix> ./prog file.txt

unix> gprof prog

% cumulative self self total

time seconds seconds calls ms/call ms/call name

86.60 8.21 8.21 1 8210.00 8210.00 sort_words

5.80 8.76 0.55 946596 0.00 0.00 lower1

4.75 9.21 0.45 946596 0.00 0.00 find_ele_rec

1.27 9.33 0.12 946596 0.00 0.00 h_add

101

Branch Misprediction Recovery

• Performance Cost

– Misprediction on Pentium III wastes ~14 clock

cycles

– That’s a lot of time on a high performance

processor

102

4872758 find_ele_rec [5]

0.60 0.01 946596/946596 insert_string [4]

[5] 6.7 0.60 0.01 946596+4872758 find_ele_rec [5]

0.00 0.01 26946/26946 save_string [9]

0.00 0.00 26946/26946 new_ele [11]

4872758 find_ele_rec [5]

Example P439

103

Principle

• Interval counting

– Maintain a counter for each function

• Record the time spent executing this function

– Interrupted at regular time (1ms)

• Check which function is executing when interrupt

occurs

• Increment the counter for this function

104

Data Set P439

• Collected works of Shakespeare• 946,596 total words, 26,596 unique• Initial implementation: 9.2 seconds

105

Code Optimizations

– First step: Use more efficient sorting function– Library function qsort

0

1

2

3

4

5

6

7

8

9

10

Initial Quicksort Iter First Iter Last Big Table Better Hash Linear Lower

CP

U S

ecs.

Rest

Hash

Lower

List

Sort

Figure 5.37 P441

1) 2) 3) 4) 5) 6) 7)

106

Further Optimizations

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Initial Quicksort Iter First Iter Last Big Table Better Hash Linear Low er

CPU

Secs

.

Rest

Hash

Low er

List

Sort

2) 3) 4) 5) 6) 7)1)

107

Example

• 3) Iter first: Use iterative function to insert elements in linked list– Causes code to slow down

• 4) Iter last: Iterative function, places new entry at end of list– Tend to place most common words at front of list

• 5) Big table: Increase number of hash buckets• 6) Better hash: Use more sophisticated hash

function• 7) Linear lower: Move strlen out of loop

108

Code Motion Example#2void lower(char *s){ int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}

void lower(char *s){ int i = 0; if (i >= strlen(s)) goto done; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++; if (i < strlen(s)) goto loop; done:}

109

Lower Case Conversion Performance

– Time quadruples when double string length– Quadratic performance

lower1

0.0001

0.001

0.01

0.1

1

10

100

1000

256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144

String Length

CP

U S

eco

nd

s

110

• Time quadruples when double string length• Quadratic performance

lower1

0.0001

0.001

0.01

0.1

1

10

100

1000

256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144

String Length

CP

U S

eco

nd

s

Lower Case Conversion Performance

111

• Move call to strlen outside of loop• Since result does not change from one iteration to another• Form of code motion

void lower(char *s){ int i; int len = strlen(s); for (i = 0; i < len; i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}

Improving Performance

112

Lower Case Conversion Performance

– Time doubles when double string length– Linear performance

0.000001

0.00001

0.0001

0.001

0.01

0.1

1

10

100

1000

256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144

String Length

CP

U S

eco

nd

s

lower1 lower2

113

• Benefits– Helps identify performance bottlenecks

– Especially useful when have complex system with many components

• Limitations– Only shows performance for data tested

– E.g., linear lower did not show big gain, since words are short

• Quadratic inefficiency could remain lurking in code

– Timing mechanism fairly crude• Only works for programs that run for > 3 seconds

Performance Tuning

114

Tnew = (1-)Told + (Told)/k

= Told[(1-) + /k]

S = Told / Tnew = 1/[(1-) + /k]

S = 1/(1-)

Amdahl’s Law P443