accurate analytical modeling of superscalar processors j. e. smith tejas karkhanis

35
Accurate Analytical Modeling Accurate Analytical Modeling of Superscalar Processors of Superscalar Processors J. E. Smith Tejas Karkhanis

Upload: todd-daniels

Post on 29-Dec-2015

223 views

Category:

Documents


8 download

TRANSCRIPT

Page 1: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

Accurate Analytical Modeling of Accurate Analytical Modeling of Superscalar ProcessorsSuperscalar Processors

J. E. Smith

Tejas Karkhanis

Page 2: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 2

Superscalar Processor EvaluationSuperscalar Processor Evaluation

Processors typically evaluated via simulation• Highly detailed simulator• Many cycles of simulation• Has a black box character -- provides little insight

Workload Implications• All workload characteristics are needed for detailed

simulation, BUT not all are critical for determining performance

• Workload space limited to specific benchmarks Alternative Approach – Use an analytical model

Page 3: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 3

Analytical ApproachAnalytical Approach

Analytical Model driven by relevant benchmark properties

Helps isolate important workload characteristics

• If performance estimate is accurate then workload characteristics must be the important ones

Workload characteristics can be varied over a “workload space”

• Apply characteristics directly by short-circuiting simulation

FunctionalSimulator

Benchmarks

Performance/Powerestimates

Extract RelevantProgram

Properties

Analytical Model

Page 4: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 4

Basis for ModelBasis for Model

Consider profile of dynamic instructions issued per cycle:

Background constant IPC • With never-ending series of transient events

determine performance with ideal caches & predictors then account for transient events

time

IPC

branch mispredictsi-cache miss

long d-cache miss

Page 5: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 5

IBID ModelIBID Model

Based on generic superscalar processor Useful for reasoning about transient events

Ife tch B I

s to p

P

s ta rt

e m p ty

&

s to p

s ta rt

m isp re d ic t S ize p ip e

Ica ch e m issS ize w in d o w

S ize R O B

D

L o n gD ca ch e m iss

s to p

D

s ta rt

IWch a ra c te r is tic

Page 6: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 6

Series/Parallel Performance PenaltiesSeries/Parallel Performance Penalties

Branch Misprediction and I-Cache Miss penalties “serialize”

• i.e. penalties add linearly Long D-Cache Misses may overlap with I-cache and

B-predict misses (and with each other)• Overlap with other long D-cache misses more important• Short D-cache misses handled differently (later)

BranchMispredicts

I-Cache Misses

Long D-CacheMisses

Page 7: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 7

Validating Series/Parallel ModelValidating Series/Parallel Model

0

1

2

3

4

IPC

Combined Independent Overlaps Compensated

Combined: simulated performance with realistic caches/predictor Independent: ideal performance minus individually determined

performance losses Overlap Compensated: account for overlaps w/ D-cache misses

4-way issue, 48 window, 128 ROB16K I-cache and D-Cache8K gshare branch predictor

I-cache Decode PipelineIssueBuffer

Exec.Unit

Exec.Unit

Exec.Unit

DataCache

Reorder Buffer

RegisterFile

StoreQ

f d d i

M SHRs

d

d

m ispredictrate

BranchPredict

m iss rate

#entries

# stages

# entries

# values

m iss rate

# entries

# entries# entries

# and type of unitsunit latencies

Page 8: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 8

IW CharacteristicIW Characteristic

Key Result (Michaud, Seznec, Jourdan):• Square Root relationship between Issue Rate

and Window sizeWI

Page 9: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 9

Similar ExperimentSimilar Experiment

Ideal caches, predictor Efficient I fetch keeps window full Graph issue rate I, as a fcn of window size W Straight lines on log log graph

0

1

2

3

4

5

6

3 4 5 6 7

lg(window size)

lg(I

PC

)

bzip

crafty

eon

gap

gcc

gzip

mcf

parser

perl

twolf

vortex

vpr

WI

Page 10: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 10

IW CharacteristicIW Characteristic

Allows determination of “background” IPC Allows evaluation of transients to determine

penalties

time

IPC

branch mispredictsi-cache miss

long d-cache miss

Page 11: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 11

Transient #1: Branch MispredictionsTransient #1: Branch Mispredictions

Typical behavior

steady state

mispredictedbranch enters

window

flushpipeline re-fill pipeline

instructionsre-enterwindow

issue rampsback up to

steady state

mispredictiondetected

misspeculatedinstructions

Page 12: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 12

Branch Misprediction PenaltyBranch Misprediction Penalty

1) lost opportunity• performance lost by issuing soon-to-be flushed instructions

2) pipeline re-fill penalty• obvious penalty; most people equate this with the penalty

3) window fill penalty• performance lost due to window startup

lostopportunity

pipelinere-fill window fill

Page 13: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 13

Use Sqrt ModelUse Sqrt Model

0

1

2

3

4

IPC

0 2 4 6 8 10 12 14 16 18 20 22 24 clock cycle

Page 14: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 14

Experimental DataExperimental Data

GCC

0

0.5

1

1.5

2

2.5

3

3.5

4

0 5 10 15 20 25 30

Cycle after the mispredict

Iss

ue

Ra

te (

ON

LY

Us

efu

l)

Load Lat = 1

Load Lat = 2

Page 15: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 15

Branch Mispredict PenaltyBranch Mispredict Penalty

02468

10121416

cy

cle

s

short long

short pipeline = 5 stages before issue long pipeline = 10 stages before issue

Insight from analytical model: Penalty from drain/fill is significant

Insight from analytical model: Penalty similar across all benchmarks for a given pipeline length

Page 16: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 16

Implication of Wider PipesImplication of Wider Pipes

Assume 1 mispredict every 96 instructions• E.g. SPEC benchmark crafty with 4K gshare• Graph full mispredict “cycle”

0 1 2 3 4 5 6 7

IP C

0 10 20 30 40 50 60 clock cycle

issue=8

issue=4

issue=2

Issue=8 gives very modest improvement vs issue=4 (window never full enough to issue 8)

Issue=4 barely reaches peak performance

Page 17: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 17

Importance of Branch PredictionImportance of Branch Prediction

0

200

400

600

800

1000

1200

1400

1600

1800

10 20 30 40 50

Percent time at 3.5, 7, 14 issues per cycle

Ins

tru

cti

on

s b

etw

ee

n

mis

pre

dic

tio

ns

Issue width 4 Issue width 8 issue width 16

Insight: Doubling issue width means predictor has to be four times better for similar performance profile (issue efficiency)

Page 18: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 18

Implication of Deeper PipelinesImplication of Deeper Pipelines

Assume 1 misprediction per 96 instructions Vary fetch/decode/rename section of pipe

Advantage of wide issue diminishes as pipe deepens

Pentium 4 decode pipe depth = 15 & issue width = 3

0

1

2

3

4

5

IP C

0 2 4 6 8 10 12 14 16 Fetch/Decode Pipe Length

issue=8

issue=4

issue=2

Page 19: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 19

Transient #2: I-Cache MissesTransient #2: I-Cache Misses

steady state

cache missoccurs

windowdrains

instructionsre-enterwindow

issue rampsback up to

steady state

instructionsbuffered in

decode pipe

miss delayinstructionsfill decode

pipe

Page 20: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 20

I-cache miss penaltyI-cache miss penalty

Penalty = Miss delay (L2 or memory latency)

minus window drain

plus window re-fill penalty

Instructions buffered in window offsets re-fill penalty

insight: penalty is independent of pipeline length.

Instructions buffered in pipe compensate for pipe re-fill

Page 21: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 21

I-cache miss penaltyI-cache miss penalty

Estimated i-cache penalty:• for n consecutive (clustered) misses:

Avg. Miss penalty

= (miss delay – drain + fill + (n-1)(miss delay-1))/n

miss delay – 1 + 1/n• For isolated miss miss delay• For long cluster miss delay – 1

Page 22: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 22

0

2

4

6

8

10

12p

en

alt

y

pipelen=4

pipelen=8

Independence from Pipe LengthIndependence from Pipe Length

16 K I-cache; ideal D-cache and predictor Two different pipeline lengths (4 and 8 cycles) I-cache miss delay 10 cycles Penalty independent of pipe length Similar across

benchmarks

Page 23: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 23

Reducing Miss Penalty – I-CachesReducing Miss Penalty – I-Caches

Add Ifetch buffer • Overlaps execution with

miss handling• Bypassed by miss

instructions To be effective, should

be enhanced with high fetch bandwidth

• greater than issue width

I-Cache

Decode PipelineDecoupling

Buffer

n2n n

steady state

cache missoccurs

instructionsbuffered in

decode pipe

instructionsfill decode

pipe

Increases this

Without increasing this

Page 24: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 24

Transient #3: D-Cache MissesTransient #3: D-Cache Misses

More complex than front-end miss events• Branch mispredict and icache misses block I-fetch• Data cache misses can be handled in parallel with I-fetch

and execution Divide into:

• Short misses – handle like long latency functional unit

• Long misses – get special treatment

Page 25: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 25

D-cache long miss penaltyD-cache long miss penalty

Three things can reduce performance1) Structural hazard

ROB fills up behind load (or inst dependent on load)and dispatch stalls

2) Data dependencesInstructions dependent on load pile up and stall

window3) Control dependences

Mispredicted branch dependent on load data Instructions beyond branch wasted

Page 26: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 26

ROB BlockageROB Blockage

Experiment:• Window size 32, Issue width 4, ROB size 64• Ideal branch prediction• Cache miss delay 1000 cycles• Simulate sampled, isolated cache misses and see

what happens

Page 27: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 27

ResultsResults

Benchmark Avg. # insts #insts in Fract. Samplesissued after in window where ROB fillsmiss after miss

Bzip2 44.1 13.1 1.0Crafty 44.6 9.6 0.9Eon 55.2 6.0 1.0Gap 56.8 10.7 1.0Gcc 51.7 8.2 0.9Mcf 55.8 5.5 0.9Parser 44.2 7.4 1.0Twolf 49.6 12.9 0.8Vortex 49.7 3.5 1.0Vpr 27.0 16.9 0.6

Full ROB stalls most of the time Relatively few dependent instructions pile up in window

Page 28: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 28

D-Cache Miss PenaltyD-Cache Miss Penalty

For typical ROBs, data and control dependences are not limiters – assume structural (ROB) stall:

if load at tail of window:

Penalty = Miss delay minus ROB fill, window drain plus ramp-up Miss delay minus ROB fill

if load at head of window:

Penalty = Miss delay minus window drain plus ramp-up Miss delay

If second long load miss is within ROB distance of first, then penalty is completely overlapped

Page 29: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 29

Transient #3: D-Cache MissesTransient #3: D-Cache Misses

steady state

d cache missoccurs

windowdrains

Miss datareturns

Commit resumes;Issue ramps to

steady state

ROBfills

miss delay

Page 30: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 30

Reducing Miss Penalty – D-Caches Reducing Miss Penalty – D-Caches

Enlarge ROB, Window, Rename Values• Overlap miss delay with execution

steady state

d cache missoccurs ROB fills

window drainsindependent insts.

miss datareturns

commitresumes;

issue rampsback up to

steady statemiss delay

ROB full

Page 31: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 31

Put it togetherPut it together

Issue width 4, window size 48 => peak CPI 8 cycle L1 I cache miss delay 200 cycle L2 cache miss delay (both I and D) 6.4 cycle branch mispredict delay (4 in pipeline) Performance (cycles)

= #insts*peak CPI + (total #br mispredicts- mispreds w/in ROBsize of long miss)*penalty + (total #Icache misses – misses w/in ROBsize of long miss)*penalty+ (total #long misses – long misses w/in ROBsize of long miss)*penalty

Page 32: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 32

Compare with Detailed SimulationCompare with Detailed Simulation

Very accurate Greatest inaccuracy from Dcache long misses

0

1

2

3

4

IPC

Detailed Simulation Analytical model

Page 33: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 33

Important Workload CharacteristicsImportant Workload Characteristics

0

0.2

0.4

0.6

0.8

1

1.2

CP

I

Ideal L1 Icache misses

L2 Icache misses L1 Dcache misses

L2 Dcache misses Branch mispredictions

Page 34: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 34

Conclusions: Key Workload CharacteristicsConclusions: Key Workload Characteristics

Instruction dependences important:• For establishing background (ideal) IPC• Not for performance penalties

All “major” events important• Branch mispredicts• I cache misses (both short and long)• D cache misses (long)

But ONLY “major” events are importantin a well-balanced design

Clustering of events only important for D cache misses

• Is miss within ROB distance of preceding miss?

Page 35: Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis

October 27, 2003 copyright J. E. Smith, 2003 35

Conclusions: Performance EvaluationConclusions: Performance Evaluation

Accurate analytical models can (and should be) developed

Trace driven cache/predictor simulators have an important role

Hybrid analytical/simulation models also should be considered

• Combine real address streams with analytical processor models

• Statistical simulation

If you really need detailed simulation – you’re not doing research, you’re doing development!