accurate analytical modeling of superscalar processors j. e. smith tejas karkhanis

Accurate Analytical Modeling of Accurate Analytical Modeling of Superscalar ProcessorsSuperscalar Processors

J. E. Smith

Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 2

Superscalar Processor EvaluationSuperscalar Processor Evaluation

Processors typically evaluated via simulation• Highly detailed simulator• Many cycles of simulation• Has a black box character -- provides little insight

Workload Implications• All workload characteristics are needed for detailed

simulation, BUT not all are critical for determining performance

• Workload space limited to specific benchmarks Alternative Approach – Use an analytical model


Analytical ApproachAnalytical Approach

Analytical Model driven by relevant benchmark properties

Helps isolate important workload characteristics

• If performance estimate is accurate then workload characteristics must be the important ones

Workload characteristics can be varied over a “workload space”

• Apply characteristics directly by short-circuiting simulation

FunctionalSimulator

Benchmarks

Performance/Powerestimates

Extract RelevantProgram

Properties

Analytical Model


Basis for ModelBasis for Model

Consider profile of dynamic instructions issued per cycle:

Background constant IPC • With never-ending series of transient events

determine performance with ideal caches & predictors then account for transient events

time

IPC

branch mispredictsi-cache miss

long d-cache miss


IBID ModelIBID Model

Based on generic superscalar processor Useful for reasoning about transient events

Ife tch B I

s to p

P

s ta rt

e m p ty

&

s to p

s ta rt

m isp re d ic t S ize p ip e

Ica ch e m issS ize w in d o w

S ize R O B

D

L o n gD ca ch e m iss

s to p

D

s ta rt

IWch a ra c te r is tic


Series/Parallel Performance PenaltiesSeries/Parallel Performance Penalties

Branch Misprediction and I-Cache Miss penalties “serialize”

• i.e. penalties add linearly Long D-Cache Misses may overlap with I-cache and

B-predict misses (and with each other)• Overlap with other long D-cache misses more important• Short D-cache misses handled differently (later)

BranchMispredicts

I-Cache Misses

Long D-CacheMisses


Validating Series/Parallel ModelValidating Series/Parallel Model

0

1

2

3

4

IPC

Combined Independent Overlaps Compensated

Combined: simulated performance with realistic caches/predictor Independent: ideal performance minus individually determined

performance losses Overlap Compensated: account for overlaps w/ D-cache misses

4-way issue, 48 window, 128 ROB16K I-cache and D-Cache8K gshare branch predictor

I-cache Decode PipelineIssueBuffer

Exec.Unit

Exec.Unit

Exec.Unit

DataCache

Reorder Buffer

RegisterFile

StoreQ

f d d i

M SHRs

d

d

m ispredictrate

BranchPredict

m iss rate

#entries

# stages

# entries

# values

m iss rate

# entries

# entries# entries

# and type of unitsunit latencies


IW CharacteristicIW Characteristic

Key Result (Michaud, Seznec, Jourdan):• Square Root relationship between Issue Rate

and Window sizeWI


Similar ExperimentSimilar Experiment

Ideal caches, predictor Efficient I fetch keeps window full Graph issue rate I, as a fcn of window size W Straight lines on log log graph

0

1

2

3

4

5

6

3 4 5 6 7

lg(window size)

lg(I

PC

)

bzip

crafty

eon

gap

gcc

gzip

mcf

parser

perl

twolf

vortex

vpr

WI


IW CharacteristicIW Characteristic

Allows determination of “background” IPC Allows evaluation of transients to determine

penalties

time

IPC

branch mispredictsi-cache miss

long d-cache miss


Transient #1: Branch MispredictionsTransient #1: Branch Mispredictions

Typical behavior

steady state

mispredictedbranch enters

window

flushpipeline re-fill pipeline

instructionsre-enterwindow

issue rampsback up to

steady state

mispredictiondetected

misspeculatedinstructions


Branch Misprediction PenaltyBranch Misprediction Penalty

1) lost opportunity• performance lost by issuing soon-to-be flushed instructions

2) pipeline re-fill penalty• obvious penalty; most people equate this with the penalty

3) window fill penalty• performance lost due to window startup

lostopportunity

pipelinere-fill window fill


Use Sqrt ModelUse Sqrt Model

0

1

2

3

4

IPC

0 2 4 6 8 10 12 14 16 18 20 22 24 clock cycle


Experimental DataExperimental Data

GCC

0

0.5

1

1.5

2

2.5

3

3.5

4

0 5 10 15 20 25 30

Cycle after the mispredict

Iss

ue

Ra

te (

ON

LY

Us

efu

l)

Load Lat = 1

Load Lat = 2


Branch Mispredict PenaltyBranch Mispredict Penalty

02468

10121416

cy

cle

s

short long

short pipeline = 5 stages before issue long pipeline = 10 stages before issue

Insight from analytical model: Penalty from drain/fill is significant

Insight from analytical model: Penalty similar across all benchmarks for a given pipeline length


Implication of Wider PipesImplication of Wider Pipes

Assume 1 mispredict every 96 instructions• E.g. SPEC benchmark crafty with 4K gshare• Graph full mispredict “cycle”

0 1 2 3 4 5 6 7

IP C

0 10 20 30 40 50 60 clock cycle

issue=8

issue=4

issue=2

Issue=8 gives very modest improvement vs issue=4 (window never full enough to issue 8)

Issue=4 barely reaches peak performance


Importance of Branch PredictionImportance of Branch Prediction

0

200

400

600

800

1000

1200

1400

1600

1800

10 20 30 40 50

Percent time at 3.5, 7, 14 issues per cycle

Ins

tru

cti

on

s b

etw

ee

n

mis

pre

dic

tio

ns

Issue width 4 Issue width 8 issue width 16

Insight: Doubling issue width means predictor has to be four times better for similar performance profile (issue efficiency)


Implication of Deeper PipelinesImplication of Deeper Pipelines

Assume 1 misprediction per 96 instructions Vary fetch/decode/rename section of pipe

Advantage of wide issue diminishes as pipe deepens

Pentium 4 decode pipe depth = 15 & issue width = 3

0

1

2

3

4

5

IP C

0 2 4 6 8 10 12 14 16 Fetch/Decode Pipe Length

issue=8

issue=4

issue=2


Transient #2: I-Cache MissesTransient #2: I-Cache Misses

steady state

cache missoccurs

windowdrains

instructionsre-enterwindow


steady state

instructionsbuffered in

decode pipe

miss delayinstructionsfill decode

pipe


I-cache miss penaltyI-cache miss penalty

Penalty = Miss delay (L2 or memory latency)

minus window drain

plus window re-fill penalty

Instructions buffered in window offsets re-fill penalty

insight: penalty is independent of pipeline length.

Instructions buffered in pipe compensate for pipe re-fill


I-cache miss penaltyI-cache miss penalty

Estimated i-cache penalty:• for n consecutive (clustered) misses:

Avg. Miss penalty

= (miss delay – drain + fill + (n-1)(miss delay-1))/n

miss delay – 1 + 1/n• For isolated miss miss delay• For long cluster miss delay – 1


0

2

4

6

8

10

12p

en

alt

y

pipelen=4

pipelen=8

Independence from Pipe LengthIndependence from Pipe Length

16 K I-cache; ideal D-cache and predictor Two different pipeline lengths (4 and 8 cycles) I-cache miss delay 10 cycles Penalty independent of pipe length Similar across

benchmarks


Reducing Miss Penalty – I-CachesReducing Miss Penalty – I-Caches

Add Ifetch buffer • Overlaps execution with

miss handling• Bypassed by miss

instructions To be effective, should

be enhanced with high fetch bandwidth

• greater than issue width

I-Cache

Decode PipelineDecoupling

Buffer

n2n n

steady state

cache missoccurs

instructionsbuffered in

decode pipe

instructionsfill decode

pipe

Increases this

Without increasing this


Transient #3: D-Cache MissesTransient #3: D-Cache Misses

More complex than front-end miss events• Branch mispredict and icache misses block I-fetch• Data cache misses can be handled in parallel with I-fetch

and execution Divide into:

• Short misses – handle like long latency functional unit

• Long misses – get special treatment


D-cache long miss penaltyD-cache long miss penalty

Three things can reduce performance1) Structural hazard

ROB fills up behind load (or inst dependent on load)and dispatch stalls

2) Data dependencesInstructions dependent on load pile up and stall

window3) Control dependences

Mispredicted branch dependent on load data Instructions beyond branch wasted


ROB BlockageROB Blockage

Experiment:• Window size 32, Issue width 4, ROB size 64• Ideal branch prediction• Cache miss delay 1000 cycles• Simulate sampled, isolated cache misses and see

what happens


ResultsResults

Benchmark Avg. # insts #insts in Fract. Samplesissued after in window where ROB fillsmiss after miss

Bzip2 44.1 13.1 1.0Crafty 44.6 9.6 0.9Eon 55.2 6.0 1.0Gap 56.8 10.7 1.0Gcc 51.7 8.2 0.9Mcf 55.8 5.5 0.9Parser 44.2 7.4 1.0Twolf 49.6 12.9 0.8Vortex 49.7 3.5 1.0Vpr 27.0 16.9 0.6

Full ROB stalls most of the time Relatively few dependent instructions pile up in window


D-Cache Miss PenaltyD-Cache Miss Penalty

For typical ROBs, data and control dependences are not limiters – assume structural (ROB) stall:

if load at tail of window:

Penalty = Miss delay minus ROB fill, window drain plus ramp-up Miss delay minus ROB fill

if load at head of window:

Penalty = Miss delay minus window drain plus ramp-up Miss delay

If second long load miss is within ROB distance of first, then penalty is completely overlapped


Transient #3: D-Cache MissesTransient #3: D-Cache Misses

steady state

d cache missoccurs

windowdrains

Miss datareturns

Commit resumes;Issue ramps to

steady state

ROBfills

miss delay


Reducing Miss Penalty – D-Caches Reducing Miss Penalty – D-Caches

Enlarge ROB, Window, Rename Values• Overlap miss delay with execution

steady state

d cache missoccurs ROB fills

window drainsindependent insts.

miss datareturns

commitresumes;


steady statemiss delay

ROB full


Put it togetherPut it together

Issue width 4, window size 48 => peak CPI 8 cycle L1 I cache miss delay 200 cycle L2 cache miss delay (both I and D) 6.4 cycle branch mispredict delay (4 in pipeline) Performance (cycles)

= #insts*peak CPI + (total #br mispredicts- mispreds w/in ROBsize of long miss)*penalty + (total #Icache misses – misses w/in ROBsize of long miss)*penalty+ (total #long misses – long misses w/in ROBsize of long miss)*penalty


Compare with Detailed SimulationCompare with Detailed Simulation

Very accurate Greatest inaccuracy from Dcache long misses

0

1

2

3

4

IPC

Detailed Simulation Analytical model


Important Workload CharacteristicsImportant Workload Characteristics

0

0.2

0.4

0.6

0.8

1

1.2

CP

I

Ideal L1 Icache misses

L2 Icache misses L1 Dcache misses

L2 Dcache misses Branch mispredictions


Conclusions: Key Workload CharacteristicsConclusions: Key Workload Characteristics

Instruction dependences important:• For establishing background (ideal) IPC• Not for performance penalties

All “major” events important• Branch mispredicts• I cache misses (both short and long)• D cache misses (long)

But ONLY “major” events are importantin a well-balanced design

Clustering of events only important for D cache misses

• Is miss within ROB distance of preceding miss?


Conclusions: Performance EvaluationConclusions: Performance Evaluation

Accurate analytical models can (and should be) developed

Trace driven cache/predictor simulators have an important role

Hybrid analytical/simulation models also should be considered

• Combine real address streams with analytical processor models

• Statistical simulation

If you really need detailed simulation – you’re not doing research, you’re doing development!

accurate analytical modeling of superscalar processors j. e. smith tejas karkhanis

Documents