![Page 1: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/1.jpg)
15-740/18-740
Computer ArchitectureLecture 15: Efficient Runahead Execution
Prof. Onur Mutlu
Carnegie Mellon University
Fall 2011, 10/14/2011
![Page 2: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/2.jpg)
Review Set
� Due next Wednesday
� Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers, 1965.
� Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” ISCA 1990.
� Recommended:
� Hennessy and Patterson, Appendix C.2 and C.3
� Liptay, “Structural aspects of the System/360 Model 85 II: the cache,” IBM Systems Journal, 1968.
� Qureshi et al., “A Case for MLP-Aware Cache Replacement,“ISCA 2006.
2
![Page 3: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/3.jpg)
Announcements
� Milestone I
� Due this Friday (Oct 14)
� Format: 2-pages
� Include results from your initial evaluations. We need to see good progress.
� Of course, the results need to make sense (i.e., you should be able to explain them)
� Midterm I
� October 24
� Milestone II
� Will be postponed. Stay tuned.
3
![Page 4: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/4.jpg)
Course Feedback
� I have read them
� Fill out the form and return, if you have not done so
4
![Page 5: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/5.jpg)
Last Lecture
� More issues in load related instruction scheduling
� Better utilizing the instruction window
� Runahead execution
5
![Page 6: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/6.jpg)
Today
� More on runahead execution
6
![Page 7: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/7.jpg)
Efficient Scaling of Instruction Window Size
� One of the major research issues in out of order execution
� How to achieve the benefits of a large window with a small one (or in a simpler way)?
� Runahead execution?
� Upon L2 miss, checkpoint architectural state, speculatively execute only for prefetching, re-execute when data ready
� Continual flow pipelines?
� Upon L2 miss, deallocate everything belonging to an L2 miss dependent, reallocate/re-rename and re-execute upon data ready
� Dual-core execution?
� One core runs ahead and does not stall on L2 misses, feeds another core that commits instructions
7
![Page 8: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/8.jpg)
Compute
Compute
Compute
Load 1 Miss
Miss 1
Stall Compute
Load 2 Miss
Miss 2
Stall
Load 1 Hit Load 2 Hit
Compute
Load 1 Miss
Runahead
Load 2 Miss Load 2 Hit
Miss 1
Miss 2
Compute
Load 1 Hit
Saved Cycles
Perfect Caches:
Small Window:
Runahead:
Runahead Example
![Page 9: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/9.jpg)
Benefits of Runahead Execution
Instead of stalling during an L2 cache miss:
� Pre-executed loads and stores independent of L2-miss instructions generate very accurate data prefetches:
� For both regular and irregular access patterns
� Instructions on the predicted program path are prefetchedinto the instruction/trace cache and L2.
� Hardware prefetcher and branch predictor tables are trainedusing future access information.
![Page 10: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/10.jpg)
Runahead Execution Pros and Cons
� Advantages:+ Very accurate prefetches for data/instructions (all cache levels)
+ Follows the program path
+ Simple to implement, most of the hardware is already built in
+ Versus other pre-execution based prefetching mechanisms:
+ Uses the same thread context as main thread, no waste of context
+ No need to construct a pre-execution thread
� Disadvantages/Limitations:-- Extra executed instructions
-- Limited by branch prediction accuracy
-- Cannot prefetch dependent cache misses. Solution?
-- Effectiveness limited by available “memory-level parallelism” (MLP)
-- Prefetch distance limited by memory latency
� Implemented in IBM POWER6, Sun “Rock”
10
![Page 11: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/11.jpg)
Memory Level Parallelism (MLP)
� Idea: Find and service multiple cache misses in parallel
� Why generate multiple misses?
� Enables latency tolerance: overlaps latency of different misses
� How to generate multiple misses?
� Out-of-order execution, multithreading, runahead, prefetching
11
time
AB
C
isolated miss parallel miss
![Page 12: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/12.jpg)
12
Memory Latency Tolerance Techniques
� Caching [initially by Wilkes, 1965]� Widely used, simple, effective, but inefficient, passive� Not all applications/phases exhibit temporal or spatial locality
� Prefetching [initially in IBM 360/91, 1967]� Works well for regular memory access patterns� Prefetching irregular access patterns is difficult, inaccurate, and hardware-
intensive
� Multithreading [initially in CDC 6600, 1964]� Works well if there are multiple threads� Improving single thread performance using multithreading hardware is an
ongoing research effort
� Out-of-order execution [initially by Tomasulo, 1967]� Tolerates cache misses that cannot be prefetched� Requires extensive hardware resources for tolerating long latencies
![Page 13: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/13.jpg)
13
12%
35%
13%
15%
22% 12%
16% 52%
22%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
S95 FP00 INT00 WEB MM PROD SERV WS AVG
Mic
ro-o
pera
tion
s P
er C
ycle
No prefetcher, no runaheadOnly prefetcher (baseline)Only runaheadPrefetcher + runahead
Performance of Runahead Execution
![Page 14: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/14.jpg)
14
Runahead Execution vs. Large Windows
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
S95 FP00 INT00 WEB MM PROD SERV WS AVG
Mic
ro-o
pera
tion
s P
er C
ycle
128-entry window (baseline)128-entry window with Runahead256-entry window384-entry window512-entry window
![Page 15: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/15.jpg)
Runahead vs. A Real Large Window
� When is one beneficial, when is the other?
� Pros and cons of each
15
![Page 16: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/16.jpg)
16
Runahead on In-order vs. Out-of-order
39%
50%28%
14%20%
17%
73%
73%
15%
20%
47%15%
12%22%
13%
16%
23%
10%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
S95 FP00 INT00 WEB MM PROD SERV WS AVG
Mic
ro-o
pera
tion
s P
er C
ycle
in-order baselinein-order + runaheadout-of-order baselineout-of-order + runahead
![Page 17: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/17.jpg)
17
Runahead vs. Large Windows (Alpha)
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
2.00
2.25
2.50
2.75
3.00
64 128 256 384 512 1024 2048 4096 8192
Instruction Window Size (mem latency = 500 cycles)
Inst
ruct
ion
s P
er C
ycle
Per
form
ance
Baseline
Runahead0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
2.00
2.25
2.50
2.75
3.00
64 128 256 384 512 1024 2048 4096 8192
Instruction Window Size (mem latency = 1000 cycles)
Inst
ruct
ion
s P
er C
ycle
Per
form
ance
Baseline
Runahead
![Page 18: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/18.jpg)
18
In-order vs. Out-of-order Execution (Alpha)
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
2.00
2.25
2.50
2.75
3.00
100 300 500 700 900 1100 1300 1500 1700 1900
Memory Latency (in cycles)
Inst
ruct
ion
s P
er C
ycle
Per
form
ance
OOO+RA
OOO
IO+RA
IO
![Page 19: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/19.jpg)
Sun ROCK Cores
� Load miss in L1 cache starts parallelization using 2 HW threads
� Ahead thread
� Checkpoints state and executes speculatively
� Instructions independent of load miss are speculatively executed
� Load miss(es) and dependent instructions are deferred to behind thread
� Behind thread
� Executes deferred instructions and re-defers them if necessary
� Memory-Level Parallelism (MLP)
� Run ahead on load miss and generate additional load misses
� Instruction-Level Parallelism (ILP)
� Ahead and behind threads execute independent instructions from different points in program in parallel
19
![Page 20: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/20.jpg)
ROCK Pipeline
20
![Page 21: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/21.jpg)
Effect of Runahead in Sun ROCK
� Chaudhry talk, Aug 2008.
21
![Page 22: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/22.jpg)
Limitations of the Baseline Runahead Mechanism
� Energy Inefficiency
� A large number of instructions are speculatively executed
� Efficient Runahead Execution [ISCA’05, IEEE Micro Top Picks’06]
� Ineffectiveness for pointer-intensive applications
� Runahead cannot parallelize dependent L2 cache misses
� Address-Value Delta (AVD) Prediction [MICRO’05, IEEE TC’06]
� Irresolvable branch mispredictions in runahead mode
� Cannot recover from a mispredicted L2-miss dependent branch
� Wrong Path Events [MICRO’04]
![Page 23: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/23.jpg)
The Efficiency Problem [ISCA’05]
� A runahead processor pre-executes some instructions speculatively
� Each pre-executed instruction consumes energy
� Runahead execution significantly increases the number of executed instructions, sometimes
without providing performance improvement
23
![Page 24: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/24.jpg)
The Efficiency Problem
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%
bzip
2
craf
ty
eon
gap
gcc
gzip
mcf
pars
er
perlb
mk
twol
f
vort
ex vpr
amm
p
appl
u
apsi art
equa
ke
face
rec
fma3
d
galg
el
luca
s
mes
a
mgr
id
sixt
rack
swim
wup
wis
e
AV
G
% Increase in IPC
% Increase in Executed Instructions
235%
22%27%
![Page 25: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/25.jpg)
Causes of Inefficiency
� Short runahead periods
� Overlapping runahead periods
� Useless runahead periods
� Mutlu et al., “Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance,” ISCA 2005, IEEE Micro Top Picks 2006.
![Page 26: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/26.jpg)
Short Runahead Periods
� Processor can initiate runahead mode due to an already in-flight L2 miss generated by
� the prefetcher, wrong-path, or a previous runahead period
� Short periods
� are less likely to generate useful L2 misses
� have high overhead due to the flush penalty at runahead exit
Compute
Load 1 Miss
Runahead
Load 2 Miss Load 2 Miss
Miss 1
Miss 2
Load 1 Hit
![Page 27: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/27.jpg)
Eliminating Short Periods
� Mechanism to eliminate short periods:
� Record the number of cycles C an L2-miss has been in flight
� If C is greater than a threshold T for an L2 miss, disable entry into runahead mode due to that miss
� T can be determined statically (at design time) or dynamically
� T=400 for a minimum main memory latency of 500 cycles works well
27
![Page 28: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/28.jpg)
Overlapping Runahead Periods
Compute
Load 1 Miss
Miss 1
Runahead
Load 2 Miss
Miss 2
Load 2 INV Load 1 Hit
OVERLAP OVERLAP
� Two runahead periods that execute the same instructions
� Second period is inefficient
![Page 29: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/29.jpg)
Eliminating Overlapping Periods
� Overlapping periods are not necessarily useless
� The availability of a new data value can result in the generation of useful L2 misses
� But, this does not happen often enough
� Mechanism to eliminate overlapping periods:
� Keep track of the number of pseudo-retired instructions R
during a runahead period
� Keep track of the number of fetched instructions N since the exit from last runahead period
� If N < R, do not enter runahead mode
29
![Page 30: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/30.jpg)
Useless Runahead Periods
� Periods that do not result in prefetches for normal mode
� They exist due to the lack of memory-level parallelism
� Mechanism to eliminate useless periods:
� Predict if a period will generate useful L2 misses
� Estimate a period to be useful if it generated an L2 miss that cannot be captured by the instruction window
� Useless period predictors are trained based on this estimation
Compute
Load 1 Miss
Runahead
Miss 1
Load 1 Hit
![Page 31: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/31.jpg)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%bz
ip2
craf
ty
eon
gap
gcc
gzip
mcf
pars
er
perlb
mk
twol
f
vort
ex vpr
amm
p
appl
u
apsi art
equa
ke
face
rec
fma3
d
galg
el
luca
s
mes
a
mgr
id
sixt
rack
swim
wup
wis
e
AV
G
Incr
ease
in E
xecu
ted
Inst
ruct
ion
s
baseline runahead
all techniques
235%
Overall Impact on Executed Instructions
26.5%
6.2%
![Page 32: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/32.jpg)
Overall Impact on IPC
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%bz
ip2
craf
ty
eon
gap
gcc
gzip
mc f
pars
er
perlb
mk
twol
f
vort
ex vpr
amm
p
appl
u
apsi art
equa
ke
face
rec
fma3
d
galg
el
luca
s
mes
a
mgr
id
s ixt
rack
swim
wup
wis
e
AV
G
Incr
ease
in IP
C
baseline runahead
all techniques
116%
22.6%22.1%
![Page 33: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/33.jpg)
Limitations of the Baseline Runahead Mechanism
� Energy Inefficiency
� A large number of instructions are speculatively executed
� Efficient Runahead Execution [ISCA’05, IEEE Micro Top Picks’06]
� Ineffectiveness for pointer-intensive applications
� Runahead cannot parallelize dependent L2 cache misses
� Address-Value Delta (AVD) Prediction [MICRO’05]
� Irresolvable branch mispredictions in runahead mode
� Cannot recover from a mispredicted L2-miss dependent branch
� Wrong Path Events [MICRO’04]
![Page 34: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/34.jpg)
� Runahead execution cannot parallelize dependent misses
� wasted opportunity to improve performance
� wasted energy (useless pre-execution)
� Runahead performance would improve by 25% if this limitation were ideally overcome
The Problem: Dependent Cache Misses
Compute
Load 1 Miss
Miss 1
Load 2 Miss
Miss 2
Load 2 Load 1 Hit
Runahead: Load 2 is dependent on Load 1
Runahead
Cannot Compute Its Address!
INV
![Page 35: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/35.jpg)
The Goal of AVD Prediction
� Enable the parallelization of dependent L2 cache misses in runahead mode with a low-cost mechanism
� How:
� Predict the values of L2-miss address (pointer) loads
� Address load: loads an address into its destination register, which is later used to calculate the address of another load
� as opposed to data load
![Page 36: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/36.jpg)
Parallelizing Dependent Cache Misses
Compute
Load 1 Miss
Miss 1
Load 2 Hit
Miss 2
Load 2 Load 1 Hit
Value Predicted
RunaheadSaved Cycles
Can Compute Its Address
Compute
Load 1 Miss
Miss 1
Load 2 Miss
Miss 2
Load 2 INV Load 1 Hit
Runahead
Cannot Compute Its Address!
Saved Speculative Instructions
Miss
![Page 37: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/37.jpg)
AVD Prediction [MICRO’05]
� Address-value delta (AVD) of a load instruction defined as:
AVD = Effective Address of Load – Data Value of Load
� For some address loads, AVD is stable
� An AVD predictor keeps track of the AVDs of address loads
� When a load is an L2 miss in runahead mode, AVD predictor is consulted
� If the predictor returns a stable (confident) AVD for that load, the value of the load is predicted
Predicted Value = Effective Address – Predicted AVD
![Page 38: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/38.jpg)
Why Do Stable AVDs Occur?
� Regularity in the way data structures are
� allocated in memory AND
� traversed
� Two types of loads can have stable AVDs
� Traversal address loads
� Produce addresses consumed by address loads
� Leaf address loads
� Produce addresses consumed by data loads
![Page 39: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/39.jpg)
Traversal Address Loads
Regularly-allocated linked list:
A
A+k
A+2k
A+3k...
A traversal address load loads the pointer to next node:
node = node�next
Effective Addr Data Value AVD
A A+k -k
A+k A+2k -k
A+2k A+3k -k
Stable AVDStriding data value
AVD = Effective Addr – Data Value
![Page 40: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/40.jpg)
AVD Prediction 40
� Stable AVDs can be captured with a stride value predictor
� Stable AVDs disappear with the re-organization of the data structure (e.g., sorting)
� Stability of AVDs is dependent on the behavior of the memory allocator
� Allocation of contiguous, fixed-size chunks is useful
Properties of Traversal-based AVDs
A
A+k
A+2k
A+3k
A+3k
A+k
A
A+2k
Sorting
Distance betweennodes NOT constant!�
![Page 41: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/41.jpg)
Leaf Address Loads
Sorted dictionary in parser: Nodes point to strings (words) String and node allocated consecutively
A+k
A C+k
C
B+k
B
D+k E+k F+k G+k
D E F G
Dictionary looked up for an input word.
A leaf address load loads the pointer to the string of each node:
Effective Addr Data Value AVD
A+k A k
C+k C k
F+k F k
lookup (node, input) { // ... ptr_str = node�string;m = check_match(ptr_str, input); // …
}
Stable AVDNo stride!
AVD = Effective Addr – Data Valuestring
node
![Page 42: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/42.jpg)
AVD Prediction 42
Properties of Leaf-based AVDs
� Stable AVDs cannot be captured with a stride value predictor
� Stable AVDs do not disappear with the re-organization of the data structure (e.g., sorting)
� Stability of AVDs is dependent on the behavior of the memory allocator
A+k
AB+k
B C
C+kSorting
Distance betweennode and stringstill constant!
C+k
CA+k
A B
B+k�
![Page 43: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/43.jpg)
AVD Prediction 43
Identifying Address Loads in Hardware
� Insight:
� If the AVD is too large, the value that is loaded is likely not an address
� Only keep track of loads that satisfy:
-MaxAVD ≤ AVD ≤ +MaxAVD
� This identification mechanism eliminates many loads from consideration
� Enables the AVD predictor to be small
![Page 44: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/44.jpg)
AVD Prediction 44
An Implementable AVD Predictor
� Set-associative prediction table
� Prediction table entry consists of
� Tag (Program Counter of the load)
� Last AVD seen for the load
� Confidence counter for the recorded AVD
� Updated when an address load is retired in normal mode
� Accessed when a load misses in L2 cache in runahead mode
� Recovery-free: No need to recover the state of the processor or the predictor on misprediction
� Runahead mode is purely speculative
![Page 45: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/45.jpg)
AVD Prediction 45
AVD Update Logic
![Page 46: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/46.jpg)
AVD Prediction 46
AVD Prediction Logic
![Page 47: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/47.jpg)
AVD Prediction 47
Baseline Processor
� Execution-driven Alpha simulator
� 8-wide superscalar processor
� 128-entry instruction window, 20-stage pipeline
� 64 KB, 4-way, 2-cycle L1 data and instruction caches
� 1 MB, 32-way, 10-cycle unified L2 cache
� 500-cycle minimum main memory latency
� 32 DRAM banks, 32-byte wide processor-memory bus (4:1 frequency ratio), 128 outstanding misses
� Detailed memory model
� Pointer-intensive benchmarks from Olden and SPEC INT00
![Page 48: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/48.jpg)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
bisor
t
healt
h
mst
perim
eter
treea
dd tsp
voro
noi
mcf
pars
er
twolf vp
r
AVG
No
rmal
ized
Exe
cuti
on
Tim
e an
d E
xecu
ted
Inst
ruct
ion
s
Execution Time
Executed Instructions
Performance of AVD Prediction
runahead
14.3%15.5%
![Page 49: Lecture 15: Efficient Runahead Executionece740/f11/lib/exe/... · 15-740/18-740 Computer Architecture Lecture 15: Efficient Runahead Execution Prof. Onur Mutlu Carnegie Mellon University](https://reader030.vdocuments.site/reader030/viewer/2022041100/5ed81efb0fa3e705ec0ddfe9/html5/thumbnails/49.jpg)
AVD Prediction 49
AVD vs. Stride VP Performance
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.00
16 entries 4096 entries
No
rmal
ized
Exe
cuti
on
Tim
e (e
xclu
din
g h
ealt
h)
AVD
stride
hybrid
5.1%
2.7%
6.5%5.5%
4.7%
8.6%
16 entries 4096 entries