© 2004 wayne wolf topics goals of program performance anlaysis. worst-case execution time analysis....
TRANSCRIPT
© 2004 Wayne Wolf
Topics
Goals of program performance anlaysis. Worst-case execution time analysis. Execution-based timing analysis.
© 2004 Wayne Wolf
CPUs and software performance System performance cannot be determined without
choosing a CPU Software execution times definitely don’t scale across CPU
architectures, perhaps not even within a CPU family Architectural features which influence program performance:
pipelining caching bus bandwidth
Ceramic or plastic is a key design decision.
© 2004 Wayne Wolf
Hierarchical performance modeling We would like to have a hierarchy of increasingly
accurate performance models, from system specification to assembly code.
Very little work in C-level performance modeling---hard to separate the program from the CPU.
Two types of questions: which variety of Brand X CPU do I use? should I use Brand X or Brand Y?
© 2004 Wayne Wolf
Caches and code speed Worst-case:
tight-deadline device interrupts driver is not in cache multiple high-priority drivers knock each other out of the
cache
Cache miss costs from a few cycles on up; the faster the CPU, the more costly is a miss
Worst-case execution time is much larger than best-case, leading to extreme overengineering.
© 2004 Wayne Wolf
Alternative approaches to performance analysis Conservative analysis---performance
always within bounds.WCET gives bounds from analysis, limited
simulation. Detailed analysis---more info on a
particular case but no bounds.Execution-based methods provide lots of
details but only for given input data.
© 2004 Wayne Wolf
Performance of HLLs We would like to bound or estimate program
performance from high-level code: simplifies identification of paths provides early performance estimates
Realistically, we need to know the execution platform.
© 2004 Wayne Wolf
Paths and performance
Branches of conditional have different execution times.
Loops:Multiple iterations.Varying number of
iterations.
F1()
F2()
i<N
if
Loopstart
t
f
Loopbody
t
f
© 2004 Wayne Wolf
Measurements of interest
Execution time bounds: worst, best. Upper/lower bounds important in multitasking
systems. Execution time of incomplete code.
Must be able to handle time estimates for pieces of code.
Bounds of varying quality: Loose bounds quickly. Tight bounds with more effort.
© 2004 Wayne Wolf
Early work: explicit path analysis
Shaw developed techniques to prove bounds on the number of times through paths.
Park and Shaw developed techniques for path analysis and for measurement of execution times of HLL statements on a 68000.
© 2004 Wayne Wolf
Challenges and approaches
Exponential number of paths.Limit program constructs.Add annotations. Implicit path analysis.
Instruction times are not independent: pipeline effects; cache effects.State-dependent instruction execution time.Simulation.
© 2004 Wayne Wolf
Uppsalla WCET tool
Four major phases:Analyze global program paths.Analyze global effect of caches, etc.Analyze local effects of pipelines, etc.Calculate final WCET.
© 2004 Wayne Wolf
Abstract program flow anlaysis
Bound set of feasible paths without exhaustive simulation.May include some infeasible paths in the
feasible set. Perform abstract interpretation of the
program to find feasible paths.Generate safe bounds on the values of
variables.
© 2004 Wayne Wolf
Abstract interpretation example (Engblom et al.)while (x<4) {
S1 if (x<3) x = x*2;S2 else x = x+1;S3 if (x == 1) x = x+2;S4 else x=x+1;
}
Input limits: 0 <= x <= 3
start: 0 <= x <= 2; end: 0 <= x <= 4
start: 3 <= x <= 3; end: 4 <= x <= 4
start: 0 <= x <= 4; end: 1 <= x <= 5start: 3 <= x <= 3; end: 5 <= x <= 5
start: 0 <= x <= 4; end: 3 <= x <= 3start: 3 <= x <= 3; INFEASIBLE
© 2004 Wayne Wolf
Example (cont’d.)X = [0..3]
S1 S2
X = [0..4]
S3 S4
X = [4..4]
S3 S4
X = [3..3] X = [1..3] X = [5..5]
X = [1..5]merge
Iteration 1
Iteration 2 X = [1..3] X = [4..5]
S1 S2
…X = [2..4] X = [4..4]
…
© 2004 Wayne Wolf
Bounding loop iterations
Uppsalla/FSU: handle complex loops. Four phases:
1. Iteratively identify branches that affect the number of iteraitons.
2. Identify loop iteration on which loop index-dependent branches change direction.
3. Use step 2 to determine when step 1 branches are reached.
4. Calculate bounds on number of iterations.
© 2004 Wayne Wolf
Loop iteration example (Engblom et al)for (i=0, j=1;
i<100;i++, j+=3) {
if (j>75 and somecond || j>300)
break;}
i=0, j=1 jump
j>75, jump if false
somecond, jump if false
return
j>300, jump if true
i++, j+=3
j<100, jump if false
jump
iteration 26
iteration 101
iteration 101Lower bound: 26Upper bound: 101
Redundant code
© 2004 Wayne Wolf
Implicit path analysis
Schedl, Li/Malik—find path length without explicitly finding path.
Formulate as constraint solving problem:Generate constraints that describe program,
annotations.Solve using constraint solver, ILP (depending
on types of constraints allowed).
© 2004 Wayne Wolf
Types of constraints
Structural constraints:Describe conditionals, etc.
Finiteness and start constraints:Bound loop iterations.
Tightening constraints: Infeasible paths.User constraints.
© 2004 Wayne Wolf
Structural constraints
if
then else
cont
d1
d2d3
d4 d5
d6
di = number of times controlflows through iconservation of flow:d1 = d2 + d3
d2 = d4
© 2004 Wayne Wolf
WCET and optimizing compilers
Optimizing compilers can radically change program control flow.
Must analyze timing of the optimized code.Annotations must be transferred to the
optimized code.Must be able to perform the program
transformations on the optimizations.
© 2004 Wayne Wolf
Cache analysis extensions Must segment program units around cache lines. Different execution times for in-cache and out-of-
cache. Conservative assumption: use in-cache time
only if statement is known to be in cache. Add constraints which model cache state based
on program flow.
© 2004 Wayne Wolf
Cache analysis model (Colin, Puaut) Instruction block (iblock): a basic block
fragment that fits into a cache line.Decomposition of program into iblocks
depends on cache organization. Determine paths on which iblocks result in
hits, misses.
© 2004 Wayne Wolf
Cache interference example
for (i=0; i<N; i++) {f1(); f2();f3();
}
iblockA
iblockB
iblockC
iblockD
cacheiblockBiblockD
iblockBiblockAiblockC
© 2004 Wayne Wolf
Branch prediction bounding (Colin & Puaut) Missed branch prediction causes pipeline bubble.
Branch predictor has finite capacity. Predictor may make wrong prediction.
Keep track of branch history. Memoryless predictors are a special case.
Determine what prediction the machine will make to determine whether a bubble may be caused. Known correct predictions cause no bubble. Known incorrect predictions cause a bubble. Indeterminate results are pessimistically presumed to cause a
bubble.
© 2004 Wayne Wolf
Data caching
Data address may not be known at compile time: Pointers. Stack variables.
Caching of stack variables is easier to compute. Offset from stack pointer is known. Can compute cache block based upon sp offset.
© 2004 Wayne Wolf
Timing through simulation
Use a simulator to time a sequence of instructions.Can simulate basic blocks with boundary
conditions for branches. WCET tools often use custom simulators
designed for small pieces of code, call by subroutine.
© 2004 Wayne Wolf
Pipeline interactions
Basic block execution time depends on the path. Next block may or may not
overlap with current block.
Overlap model: execution time of block includes (variable) overlap gains from next block.
BB1
BB2 BB3
T1(BB1)
T2(BB1,BB2)T2(BB1,BB3)
© 2004 Wayne Wolf
Pipeline simulation (Lim et al.)
Use reservation tables to capture pipeline state.
Use multiple reservation tables to simulate different path conditions.
1 2 3
ALU X
reg X
© 2004 Wayne Wolf
Behavioral performance analysis
Use program behavior to analyze performance.
Advantages:Handles arbitrary program.Captures realistic behavior.
Disadvantages:Doesn’t guarantee worst-case/best-case
behavior.
© 2004 Wayne Wolf
Methodology
Sources of a behavior:Program execution
on platform.Simulated
execution.
program data
execute
behavior
analysistool
results
© 2004 Wayne Wolf
Taxonomy
Behavior analysis: trace vs. execution.Execution style: simulation vs. direct
execution. Performance analysis: instruction
schedulers vs. cycle timers.
© 2004 Wayne Wolf
Methods for gathering traces
PC sampling. Program instrumentation. Simulation.
© 2004 Wayne Wolf
PC sampling
Example: Unix prof. Interrupts are used to sample PC
periodically.Must run on the platform.Doesn’t provide complete trace.Subject to sampling problems: undersampling,
periodicity problems.
© 2004 Wayne Wolf
Call graph report
Main 100%f1f2
---f1 37%
g1g2
---f2 23%
g3g4
Cumulative execution time
© 2004 Wayne Wolf
Program instrumentation
Example: dinero. Modify the program to
write trace information. Track entry into basic
blocks. Requires editing object
files. Provides complete
trace.
© 2004 Wayne Wolf
Functional simulation
Works on programming model (registers). Interprets instructions.
Instructions are independent. Doesn’t model timing, pipeline state.
© 2004 Wayne Wolf
Direct execution
Model target programming model within the simulation host. May need to use variables to capture registers not
available in host. May need to generate extra instructions to generate
the non-native state. Use simulation host instructions to model target
instructions. Mapping may be many host->one target.
© 2004 Wayne Wolf
Cycle-accurate simulator
Models the microarchitecture. Simulating one instruction
requires executing routines for instruction decode, etc.
Models pipeline state. Microarchitectural registers
are exposed to the simulator.
reg
IR
PC
I-box
© 2004 Wayne Wolf
Trace-based vs. execution-based
Trace-based: Gather trace first, then
generate timing information.
Basic timing information is simpler to generate.
Full timing information may require regenerating information from the original execution.
Requires owning the platform.
Execution-based: Simulator fully executes the
instruction. Requires a more complex
simulator. Requires explicit
knowledge of the microarchitecture, not just instruction execution times.
© 2004 Wayne Wolf
Sources of timing information
Data book tables: Time of individual
instructions. Penalties for various
hazards.
Microarchitecture: Depends from the
structure of machine. Derived from
execution of the instruction in the microarchitecture.
© 2004 Wayne Wolf
Levels of detail in simulation
Instruction schedulers:Models availability of microarchitectural
resources.May not capture all interactions.
Cycle timers:Models full microarchitecture.Most accurate, requires exact model of the
microarchitecture.
© 2004 Wayne Wolf
Modular simulators
Model instructions through a description file.Drives assembler, basic behavioral
simulation. Assemble a simulation program from code
modules.Can add your own code.
© 2004 Wayne Wolf
Early approaches to power modeling Instruction macromodels:
ADD = 1 w, JMP = 2 w, etc. Data-dependent models:
Based on data value statistics. Transition-based models.
© 2004 Wayne Wolf
Power simulation
Model capacitance in the processor. Keep track of activity in the processor.
Requires full simulation. Activity determines capacitive
charge/discharge, which determines power consumption.
© 2004 Wayne Wolf
SimplePower simulator
Cycle-accurate simulator.SimpleScalar-style cycle-accurate simulator.
Transition-based power analysis.Estimates energy of data path, memory, and
busses on every clock cycle.
© 2004 Wayne Wolf
RTL power estimation interface
A power estimator is required for each functional unit modeled in the simulator.Functional interface makes the simulator more
modular. Power estimator takes same arguments as
the performance simulation module.
© 2004 Wayne Wolf
Switch capacitance tables
Model functional units such as ALU, register files, multiplexers, etc.
Capture technology-dependent capacitance of the unit. Two types of model:
Bit-independent: each bit is independent, model is one bit wide. Bit-dependent: bits interact (as in adder), model must be multiple
bits. Analytical models used for memories. Adder model is built from sub-model for adder slice.
© 2004 Wayne Wolf
Wattch power simulator
Built on top of SimpleScalar. Adds parameterized power models for the
functional units.
© 2004 Wayne Wolf
Array model
Analytical model:Decoder.Wordline drive.Bitline discharge.Sense amp output.
Register file word line capacitance:Cdiff (word line driver) + Cgate(cell
access)*nbit_lines + Cmetal * Word_line_length
© 2004 Wayne Wolf
Bus, function unit models
Bus model based upon length of bus, capacitance of bus lines.
Models for ALUs, etc. based upon transistion models.
© 2004 Wayne Wolf
Clock network power model
Clock is a major power sink in modern designs.
Major elements of the clock power model:Global clock lines.Global drivers.Loads on the clock network.
Must handle gated clocks.