© 2004 wayne wolf topics goals of program performance anlaysis. worst-case execution time analysis....

© 2004 Wayne Wolf

Topics

Goals of program performance anlaysis. Worst-case execution time analysis. Execution-based timing analysis.

© 2004 Wayne Wolf

CPUs and software performance System performance cannot be determined without

choosing a CPU Software execution times definitely don’t scale across CPU

architectures, perhaps not even within a CPU family Architectural features which influence program performance:

pipelining caching bus bandwidth

Ceramic or plastic is a key design decision.

© 2004 Wayne Wolf

Hierarchical performance modeling We would like to have a hierarchy of increasingly

accurate performance models, from system specification to assembly code.

Very little work in C-level performance modeling---hard to separate the program from the CPU.

Two types of questions: which variety of Brand X CPU do I use? should I use Brand X or Brand Y?

© 2004 Wayne Wolf

Caches and code speed Worst-case:

tight-deadline device interrupts driver is not in cache multiple high-priority drivers knock each other out of the

cache

Cache miss costs from a few cycles on up; the faster the CPU, the more costly is a miss

Worst-case execution time is much larger than best-case, leading to extreme overengineering.

© 2004 Wayne Wolf

Alternative approaches to performance analysis Conservative analysis---performance

always within bounds.WCET gives bounds from analysis, limited

simulation. Detailed analysis---more info on a

particular case but no bounds.Execution-based methods provide lots of

details but only for given input data.

© 2004 Wayne Wolf

Performance of HLLs We would like to bound or estimate program

performance from high-level code: simplifies identification of paths provides early performance estimates

Realistically, we need to know the execution platform.

© 2004 Wayne Wolf

Paths and performance

Branches of conditional have different execution times.

Loops:Multiple iterations.Varying number of

iterations.

F1()

F2()

i<N

if

Loopstart

t

f

Loopbody

t

f

© 2004 Wayne Wolf

Measurements of interest

Execution time bounds: worst, best. Upper/lower bounds important in multitasking

systems. Execution time of incomplete code.

Must be able to handle time estimates for pieces of code.

Bounds of varying quality: Loose bounds quickly. Tight bounds with more effort.

© 2004 Wayne Wolf

Early work: explicit path analysis

Shaw developed techniques to prove bounds on the number of times through paths.

Park and Shaw developed techniques for path analysis and for measurement of execution times of HLL statements on a 68000.

© 2004 Wayne Wolf

Challenges and approaches

Exponential number of paths.Limit program constructs.Add annotations. Implicit path analysis.

Instruction times are not independent: pipeline effects; cache effects.State-dependent instruction execution time.Simulation.

© 2004 Wayne Wolf

Uppsalla WCET tool

Four major phases:Analyze global program paths.Analyze global effect of caches, etc.Analyze local effects of pipelines, etc.Calculate final WCET.

© 2004 Wayne Wolf

Abstract program flow anlaysis

Bound set of feasible paths without exhaustive simulation.May include some infeasible paths in the

feasible set. Perform abstract interpretation of the

program to find feasible paths.Generate safe bounds on the values of

variables.

© 2004 Wayne Wolf

Abstract interpretation example (Engblom et al.)while (x<4) {

S1 if (x<3) x = x*2;S2 else x = x+1;S3 if (x == 1) x = x+2;S4 else x=x+1;

}

Input limits: 0 <= x <= 3

start: 0 <= x <= 2; end: 0 <= x <= 4

start: 3 <= x <= 3; end: 4 <= x <= 4

start: 0 <= x <= 4; end: 1 <= x <= 5start: 3 <= x <= 3; end: 5 <= x <= 5

start: 0 <= x <= 4; end: 3 <= x <= 3start: 3 <= x <= 3; INFEASIBLE

© 2004 Wayne Wolf

Example (cont’d.)X = [0..3]

S1 S2

X = [0..4]

S3 S4

X = [4..4]

S3 S4

X = [3..3] X = [1..3] X = [5..5]

X = [1..5]merge

Iteration 1

Iteration 2 X = [1..3] X = [4..5]

S1 S2

…X = [2..4] X = [4..4]

…

© 2004 Wayne Wolf

Bounding loop iterations

Uppsalla/FSU: handle complex loops. Four phases:

1. Iteratively identify branches that affect the number of iteraitons.

2. Identify loop iteration on which loop index-dependent branches change direction.

3. Use step 2 to determine when step 1 branches are reached.

4. Calculate bounds on number of iterations.

© 2004 Wayne Wolf

Loop iteration example (Engblom et al)for (i=0, j=1;

i<100;i++, j+=3) {

if (j>75 and somecond || j>300)

break;}

i=0, j=1 jump

j>75, jump if false

somecond, jump if false

return

j>300, jump if true

i++, j+=3

j<100, jump if false

jump

iteration 26

iteration 101

iteration 101Lower bound: 26Upper bound: 101

Redundant code

© 2004 Wayne Wolf

Implicit path analysis

Schedl, Li/Malik—find path length without explicitly finding path.

Formulate as constraint solving problem:Generate constraints that describe program,

annotations.Solve using constraint solver, ILP (depending

on types of constraints allowed).

© 2004 Wayne Wolf

Types of constraints

Structural constraints:Describe conditionals, etc.

Finiteness and start constraints:Bound loop iterations.

Tightening constraints: Infeasible paths.User constraints.

© 2004 Wayne Wolf

Structural constraints

if

then else

cont

d1

d2d3

d4 d5

d6

di = number of times controlflows through iconservation of flow:d1 = d2 + d3

d2 = d4

© 2004 Wayne Wolf

WCET and optimizing compilers

Optimizing compilers can radically change program control flow.

Must analyze timing of the optimized code.Annotations must be transferred to the

optimized code.Must be able to perform the program

transformations on the optimizations.

© 2004 Wayne Wolf

Cache analysis extensions Must segment program units around cache lines. Different execution times for in-cache and out-of-

cache. Conservative assumption: use in-cache time

only if statement is known to be in cache. Add constraints which model cache state based

on program flow.

© 2004 Wayne Wolf

Cache analysis model (Colin, Puaut) Instruction block (iblock): a basic block

fragment that fits into a cache line.Decomposition of program into iblocks

depends on cache organization. Determine paths on which iblocks result in

hits, misses.

© 2004 Wayne Wolf

Cache interference example

for (i=0; i<N; i++) {f1(); f2();f3();

}

iblockA

iblockB

iblockC

iblockD

cacheiblockBiblockD

iblockBiblockAiblockC

© 2004 Wayne Wolf

Branch prediction bounding (Colin & Puaut) Missed branch prediction causes pipeline bubble.

Branch predictor has finite capacity. Predictor may make wrong prediction.

Keep track of branch history. Memoryless predictors are a special case.

Determine what prediction the machine will make to determine whether a bubble may be caused. Known correct predictions cause no bubble. Known incorrect predictions cause a bubble. Indeterminate results are pessimistically presumed to cause a

bubble.

© 2004 Wayne Wolf

Data caching

Data address may not be known at compile time: Pointers. Stack variables.

Caching of stack variables is easier to compute. Offset from stack pointer is known. Can compute cache block based upon sp offset.

© 2004 Wayne Wolf

Timing through simulation

Use a simulator to time a sequence of instructions.Can simulate basic blocks with boundary

conditions for branches. WCET tools often use custom simulators

designed for small pieces of code, call by subroutine.

© 2004 Wayne Wolf

Pipeline interactions

Basic block execution time depends on the path. Next block may or may not

overlap with current block.

Overlap model: execution time of block includes (variable) overlap gains from next block.

BB1

BB2 BB3

T1(BB1)

T2(BB1,BB2)T2(BB1,BB3)

© 2004 Wayne Wolf

Pipeline simulation (Lim et al.)

Use reservation tables to capture pipeline state.

Use multiple reservation tables to simulate different path conditions.

1 2 3

ALU X

reg X

© 2004 Wayne Wolf

Behavioral performance analysis

Use program behavior to analyze performance.

Advantages:Handles arbitrary program.Captures realistic behavior.

Disadvantages:Doesn’t guarantee worst-case/best-case

behavior.

© 2004 Wayne Wolf

Methodology

Sources of a behavior:Program execution

on platform.Simulated

execution.

program data

execute

behavior

analysistool

results

© 2004 Wayne Wolf

Taxonomy

Behavior analysis: trace vs. execution.Execution style: simulation vs. direct

execution. Performance analysis: instruction

schedulers vs. cycle timers.

© 2004 Wayne Wolf

Methods for gathering traces

PC sampling. Program instrumentation. Simulation.

© 2004 Wayne Wolf

PC sampling

Example: Unix prof. Interrupts are used to sample PC

periodically.Must run on the platform.Doesn’t provide complete trace.Subject to sampling problems: undersampling,

periodicity problems.

© 2004 Wayne Wolf

Call graph report

Main 100%f1f2

---f1 37%

g1g2

---f2 23%

g3g4

Cumulative execution time

© 2004 Wayne Wolf

Program instrumentation

Example: dinero. Modify the program to

write trace information. Track entry into basic

blocks. Requires editing object

files. Provides complete

trace.

© 2004 Wayne Wolf

Functional simulation

Works on programming model (registers). Interprets instructions.

Instructions are independent. Doesn’t model timing, pipeline state.

© 2004 Wayne Wolf

Direct execution

Model target programming model within the simulation host. May need to use variables to capture registers not

available in host. May need to generate extra instructions to generate

the non-native state. Use simulation host instructions to model target

instructions. Mapping may be many host->one target.

© 2004 Wayne Wolf

Cycle-accurate simulator

Models the microarchitecture. Simulating one instruction

requires executing routines for instruction decode, etc.

Models pipeline state. Microarchitectural registers

are exposed to the simulator.

reg

IR

PC

I-box

© 2004 Wayne Wolf

Trace-based vs. execution-based

Trace-based: Gather trace first, then

generate timing information.

Basic timing information is simpler to generate.

Full timing information may require regenerating information from the original execution.

Requires owning the platform.

Execution-based: Simulator fully executes the

instruction. Requires a more complex

simulator. Requires explicit

knowledge of the microarchitecture, not just instruction execution times.

© 2004 Wayne Wolf

Sources of timing information

Data book tables: Time of individual

instructions. Penalties for various

hazards.

Microarchitecture: Depends from the

structure of machine. Derived from

execution of the instruction in the microarchitecture.

© 2004 Wayne Wolf

Levels of detail in simulation

Instruction schedulers:Models availability of microarchitectural

resources.May not capture all interactions.

Cycle timers:Models full microarchitecture.Most accurate, requires exact model of the

microarchitecture.

© 2004 Wayne Wolf

Modular simulators

Model instructions through a description file.Drives assembler, basic behavioral

simulation. Assemble a simulation program from code

modules.Can add your own code.

© 2004 Wayne Wolf

Early approaches to power modeling Instruction macromodels:

ADD = 1 w, JMP = 2 w, etc. Data-dependent models:

Based on data value statistics. Transition-based models.

© 2004 Wayne Wolf

Power simulation

Model capacitance in the processor. Keep track of activity in the processor.

Requires full simulation. Activity determines capacitive

charge/discharge, which determines power consumption.

© 2004 Wayne Wolf

SimplePower simulator

Cycle-accurate simulator.SimpleScalar-style cycle-accurate simulator.

Transition-based power analysis.Estimates energy of data path, memory, and

busses on every clock cycle.

© 2004 Wayne Wolf

RTL power estimation interface

A power estimator is required for each functional unit modeled in the simulator.Functional interface makes the simulator more

modular. Power estimator takes same arguments as

the performance simulation module.

© 2004 Wayne Wolf

Switch capacitance tables

Model functional units such as ALU, register files, multiplexers, etc.

Capture technology-dependent capacitance of the unit. Two types of model:

Bit-independent: each bit is independent, model is one bit wide. Bit-dependent: bits interact (as in adder), model must be multiple

bits. Analytical models used for memories. Adder model is built from sub-model for adder slice.

© 2004 Wayne Wolf

Wattch power simulator

Built on top of SimpleScalar. Adds parameterized power models for the

functional units.

© 2004 Wayne Wolf

Array model

Analytical model:Decoder.Wordline drive.Bitline discharge.Sense amp output.

Register file word line capacitance:Cdiff (word line driver) + Cgate(cell

access)*nbit_lines + Cmetal * Word_line_length

© 2004 Wayne Wolf

Bus, function unit models

Bus model based upon length of bus, capacitance of bus lines.

Models for ALUs, etc. based upon transistion models.

© 2004 Wayne Wolf

Clock network power model

Clock is a major power sink in modern designs.

Major elements of the clock power model:Global clock lines.Global drivers.Loads on the clock network.

Must handle gated clocks.

© 2004 wayne wolf topics goals of program performance anlaysis. worst-case execution time analysis....

Documents

performance of hllswe

tight bounds

global program paths

execution platform

loose bounds

safe bounds

different execution

cpusoftware execution