18-447: computer...
TRANSCRIPT
![Page 1: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/1.jpg)
18-447: Computer Architecture
Lecture 28: Runahead Execution
Prof. Onur Mutlu
Carnegie Mellon University
Spring 2013, 4/12/2013
![Page 2: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/2.jpg)
Homework 6
Due April 19 (Friday)
Topics: Virtual memory and cache interaction, main memory, memory scheduling
Strong suggestion:
Please complete this before the exam to prepare for the exam
Reminder:
Homeworks are mainly for your benefit and learning (and preparation for the exam).
They are not meant to be a large part of your grade
2
![Page 3: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/3.jpg)
Lab 6: Memory Hierarchy
Due April 22 (Monday)
Cycle-level modeling of L2 cache and DRAM-based main memory
Extra credit: Prefetching
Design your own hardware prefetcher to improve system performance
HW 6 and Lab 6 are synergistic – work on them together
3
![Page 4: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/4.jpg)
Midterm II Next Week
April 17
Similar format as Midterm I
Suggestion: Do Homework 6 to prepare for the Midterm
4
![Page 5: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/5.jpg)
Last Lecture
Memory Interference (and Techniques to Manage It)
With a focus on Memory Request Scheduling
Memory Performance Hogs
QoS-Aware Memory Scheduling
STFM
PAR-BS
TCM
ATLAS
Memory Channel Partitioning
Source Throttling
Thread Co-scheduling
Memory Scheduling for Parallel Applications
5
![Page 6: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/6.jpg)
Today
Start Memory Latency Tolerance
Runahead Execution
Prefetching
6
![Page 7: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/7.jpg)
Readings
Required
Mutlu et al., “Runahead execution”, HPCA 2003.
Srinath et al., “Feedback directed prefetching”, HPCA 2007.
Optional
Mutlu et al., “Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance,” ISCA 2005, IEEE Micro Top Picks 2006.
Mutlu et al., “Address-Value Delta (AVD) Prediction,” MICRO 2005.
Armstrong et al., “Wrong Path Events,” MICRO 2004.
7
![Page 8: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/8.jpg)
Handling Interference in the
Entire Memory System
![Page 9: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/9.jpg)
Memory System is the Major Shared Resource
9
threads’ requests
interfere
![Page 10: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/10.jpg)
Inter-Thread/Application Interference
Problem: Threads share the memory system, but memory system does not distinguish between threads’ requests
Existing memory systems
Free-for-all, shared based on demand
Control algorithms thread-unaware and thread-unfair
Aggressive threads can deny service to others
Do not try to reduce or control inter-thread interference
10
![Page 11: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/11.jpg)
How Do We Solve The Problem?
Inter-thread interference is uncontrolled in all memory resources
Memory controller
Interconnect
Caches
We need to control it
i.e., design an interference-aware (QoS-aware) memory system
11
![Page 12: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/12.jpg)
Designing QoS-Aware Memory Systems: Approaches
Smart resources: Design each shared resource to have a configurable interference control/reduction mechanism
QoS-aware memory controllers [Mutlu+ MICRO’07] [Moscibroda+, Usenix Security’07]
[Mutlu+ ISCA’08, Top Picks’09] [Kim+ HPCA’10] [Kim+ MICRO’10, Top Picks’11] [Ebrahimi+ ISCA’11, MICRO’11] [Ausavarungnirun+, ISCA’12] [Subramanian+, HPCA’13]
QoS-aware interconnects [Das+ MICRO’09, ISCA’10, Top Picks ’11, HPCA’13] [Grot+
MICRO’09, ISCA’11, Top Picks ’12]
QoS-aware caches
Dumb resources: Keep each resource free-for-all, but reduce/control interference by injection control or data mapping
Source throttling to control access to memory system [Ebrahimi+ ASPLOS’10,
ISCA’11, TOCS’12] [Ebrahimi+ MICRO’09] [Nychis+ HotNets’10] [Nychis+ SIGCOMM’12]
QoS-aware data mapping to memory controllers [Muralidhara+ MICRO’11]
QoS-aware thread scheduling to cores [Das+ HPCA’13]
12
![Page 13: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/13.jpg)
Summary: Memory QoS Approaches and Techniques
Approaches: Smart vs. dumb resources
Smart resources: QoS-aware memory scheduling
Dumb resources: Source throttling; channel partitioning
Both approaches are effective in reducing interference
No single best approach for all workloads
Techniques: Request scheduling, source throttling, memory partitioning
All approaches are effective in reducing interference
No single best technique for all workloads
Combined approaches and techniques are the most powerful
Integrated Memory Channel Partitioning and Scheduling [MICRO’11]
13
![Page 14: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/14.jpg)
Tolerating Memory Latency
![Page 15: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/15.jpg)
Latency Tolerance
An out-of-order execution processor tolerates latency of multi-cycle operations by executing independent instructions concurrently
It does so by buffering instructions in reservation stations and reorder buffer
Instruction window: Hardware resources needed to buffer all decoded but not yet retired/committed instructions
What if an instruction takes 500 cycles?
How large of an instruction window do we need to continue decoding?
How many cycles of latency can OoO tolerate?
15
![Page 16: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/16.jpg)
16
Stalls due to Long-Latency Instructions
When a long-latency instruction is not complete, it blocks instruction retirement.
Because we need to maintain precise exceptions
Incoming instructions fill the instruction window (reorder buffer, reservation stations).
Once the window is full, processor cannot place new instructions into the window.
This is called a full-window stall.
A full-window stall prevents the processor from making progress in the execution of the program.
![Page 17: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/17.jpg)
17
ADD R2 R2, 64
STOR mem[R2] R4
ADD R4 R4, R5
MUL R4 R4, R3
LOAD R3 mem[R2]
ADD R2 R2, 8
BEQ R1, R0, target
LOAD R1 mem[R5]
Full-window Stall Example
Oldest L2 Miss! Takes 100s of cycles.
8-entry instruction window:
Independent of the L2 miss,
executed out of program order,
but cannot be retired.
Younger instructions cannot be executed
because there is no space in the instruction window.
The processor stalls until the L2 Miss is serviced.
L2 cache misses are responsible for most full-window stalls.
LOAD R3 mem[R2]
![Page 18: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/18.jpg)
18
Cache Misses Responsible for Many Stalls
05
101520253035404550556065707580859095
100
128-entry window
No
rmali
zed
Execu
tio
n T
ime
Non-stall (compute) time
Full-window stall time
512KB L2 cache, 500-cycle DRAM latency, aggressive stream-based prefetcher
Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model
L2 Misses
![Page 19: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/19.jpg)
How Do We Tolerate Stalls Due to Memory?
Two major approaches
Reduce/eliminate stalls
Tolerate the effect of a stall when it happens
Four fundamental techniques to achieve these
Caching
Prefetching
Multithreading
Out-of-order execution
Many techniques have been developed to make these four fundamental techniques more effective in tolerating memory latency
19
![Page 20: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/20.jpg)
20
Memory Latency Tolerance Techniques
Caching [initially by Wilkes, 1965] Widely used, simple, effective, but inefficient, passive Not all applications/phases exhibit temporal or spatial locality
Prefetching [initially in IBM 360/91, 1967]
Works well for regular memory access patterns Prefetching irregular access patterns is difficult, inaccurate, and hardware-
intensive
Multithreading [initially in CDC 6600, 1964] Works well if there are multiple threads Improving single thread performance using multithreading hardware is an
ongoing research effort
Out-of-order execution [initially by Tomasulo, 1967] Tolerates irregular cache misses that cannot be prefetched Requires extensive hardware resources for tolerating long latencies Runahead execution alleviates this problem (as we will see in a later lecture)
![Page 21: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/21.jpg)
Runahead Execution
![Page 22: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/22.jpg)
22
ADD R2 R2, 64
STOR mem[R2] R4
ADD R4 R4, R5
MUL R4 R4, R3
LOAD R3 mem[R2]
ADD R2 R2, 8
BEQ R1, R0, target
LOAD R1 mem[R5]
Small Windows: Full-window Stalls
Oldest L2 Miss! Takes 100s of cycles.
8-entry instruction window:
Independent of the L2 miss,
executed out of program order,
but cannot be retired.
Younger instructions cannot be executed
because there is no space in the instruction window.
The processor stalls until the L2 Miss is serviced.
L2 cache misses are responsible for most full-window stalls.
LOAD R3 mem[R2]
![Page 23: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/23.jpg)
23
Impact of L2 Cache Misses
05
101520253035404550556065707580859095
100
128-entry window
No
rmali
zed
Execu
tio
n T
ime
Non-stall (compute) time
Full-window stall time
512KB L2 cache, 500-cycle DRAM latency, aggressive stream-based prefetcher
Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model
L2 Misses
![Page 24: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/24.jpg)
24
Impact of L2 Cache Misses
05
101520253035404550556065707580859095
100
128-entry window 2048-entry window
No
rmali
zed
Execu
tio
n T
ime
Non-stall (compute) time
Full-window stall time
500-cycle DRAM latency, aggressive stream-based prefetcher
Data averaged over 147 memory-intensive benchmarks on a high-end x86 processor model
L2 Misses
![Page 25: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/25.jpg)
25
The Problem
Out-of-order execution requires large instruction windows to tolerate today’s main memory latencies.
As main memory latency increases, instruction window size should also increase to fully tolerate the memory latency.
Building a large instruction window is a challenging task if we would like to achieve
Low power/energy consumption (tag matching logic, ld/st buffers)
Short cycle time (access, wakeup/select latencies)
Low design and verification complexity
![Page 26: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/26.jpg)
Efficient Scaling of Instruction Window Size
One of the major research issues in out of order execution
How to achieve the benefits of a large window with a small one (or in a simpler way)?
How do we efficiently tolerate memory latency with the machinery of out-of-order execution (and a small instruction window)?
26
![Page 27: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/27.jpg)
Memory Level Parallelism (MLP)
Idea: Find and service multiple cache misses in parallel so that the processor stalls only once for all misses
Enables latency tolerance: overlaps latency of different misses
How to generate multiple misses?
Out-of-order execution, multithreading, runahead, prefetching
27
time
A B
C
isolated miss parallel miss
![Page 28: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/28.jpg)
Runahead Execution (I)
A technique to obtain the memory-level parallelism benefits of a large instruction window
When the oldest instruction is a long-latency cache miss:
Checkpoint architectural state and enter runahead mode
In runahead mode:
Speculatively pre-execute instructions
The purpose of pre-execution is to generate prefetches
L2-miss dependent instructions are marked INV and dropped
Runahead mode ends when the original miss returns
Checkpoint is restored and normal execution resumes
Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors,” HPCA 2003.
28
![Page 29: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/29.jpg)
Compute
Compute
Compute
Load 1 Miss
Miss 1
Stall Compute
Load 2 Miss
Miss 2
Stall
Load 1 Hit Load 2 Hit
Compute
Load 1 Miss
Runahead
Load 2 Miss Load 2 Hit
Miss 1
Miss 2
Compute
Load 1 Hit
Saved Cycles
Perfect Caches:
Small Window:
Runahead:
Runahead Example
![Page 30: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/30.jpg)
Benefits of Runahead Execution
Instead of stalling during an L2 cache miss:
Pre-executed loads and stores independent of L2-miss instructions generate very accurate data prefetches:
For both regular and irregular access patterns
Instructions on the predicted program path are prefetched into the instruction/trace cache and L2.
Hardware prefetcher and branch predictor tables are trained using future access information.
![Page 31: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/31.jpg)
Runahead Execution Mechanism
Entry into runahead mode
Checkpoint architectural register state
Instruction processing in runahead mode
Exit from runahead mode
Restore architectural register state from checkpoint
![Page 32: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/32.jpg)
Instruction Processing in Runahead Mode
Compute
Load 1 Miss
Runahead
Miss 1
Runahead mode processing is the same as normal instruction processing, EXCEPT:
It is purely speculative: Architectural (software-visible) register/memory state is NOT updated in runahead mode.
L2-miss dependent instructions are identified and treated specially. They are quickly removed from the instruction window.
Their results are not trusted.
![Page 33: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/33.jpg)
L2-Miss Dependent Instructions
Compute
Load 1 Miss
Runahead
Miss 1
Two types of results produced: INV and VALID
INV = Dependent on an L2 miss
INV results are marked using INV bits in the register file and store buffer.
INV values are not used for prefetching/branch resolution.
![Page 34: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/34.jpg)
Removal of Instructions from Window
Compute
Load 1 Miss
Runahead
Miss 1
Oldest instruction is examined for pseudo-retirement
An INV instruction is removed from window immediately.
A VALID instruction is removed when it completes execution.
Pseudo-retired instructions free their allocated resources.
This allows the processing of later instructions.
Pseudo-retired stores communicate their data to dependent loads.
![Page 35: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/35.jpg)
Store/Load Handling in Runahead Mode
Compute
Load 1 Miss
Runahead
Miss 1
A pseudo-retired store writes its data and INV status to a dedicated memory, called runahead cache.
Purpose: Data communication through memory in runahead mode.
A dependent load reads its data from the runahead cache.
Does not need to be always correct Size of runahead cache is very small.
![Page 36: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/36.jpg)
Branch Handling in Runahead Mode
Compute
Load 1 Miss
Runahead
Miss 1
INV branches cannot be resolved.
A mispredicted INV branch causes the processor to stay on the wrong
program path until the end of runahead execution.
VALID branches are resolved and initiate recovery if mispredicted.
![Page 37: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/37.jpg)
Runahead Execution Pros and Cons
Advantages: + Very accurate prefetches for data/instructions (all cache levels)
+ Follows the program path
+ Simple to implement, most of the hardware is already built in
+ Versus other pre-execution based prefetching mechanisms:
+ Uses the same thread context as main thread, no waste of context
+ No need to construct a pre-execution thread
Disadvantages/Limitations: -- Extra executed instructions
-- Limited by branch prediction accuracy
-- Cannot prefetch dependent cache misses. Solution?
-- Effectiveness limited by available “memory-level parallelism” (MLP)
-- Prefetch distance limited by memory latency
Implemented in IBM POWER6, Sun “Rock”
37
![Page 38: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/38.jpg)
38
12%
35%
13%
15%
22% 12%
16% 52%
22%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
S95 FP00 INT00 WEB MM PROD SERV WS AVG
Mic
ro-o
per
ati
on
s P
er C
ycl
e
No prefetcher, no runahead
Only prefetcher (baseline)
Only runahead
Prefetcher + runahead
Performance of Runahead Execution
![Page 39: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/39.jpg)
39
Runahead Execution vs. Large Windows
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
S95 FP00 INT00 WEB MM PROD SERV WS AVG
Mic
ro-o
per
ati
on
s P
er C
ycl
e
128-entry window (baseline)
128-entry window with Runahead256-entry window
384-entry window512-entry window
![Page 40: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/40.jpg)
Runahead vs. A Real Large Window
When is one beneficial, when is the other?
Pros and cons of each
40
![Page 41: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/41.jpg)
41
Runahead on In-order vs. Out-of-order
39%
50%28%
14%
20%
17%
73%
73%
15%
20%
47%15%
12%
22%
13%
16%
23%
10%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
S95 FP00 INT00 WEB MM PROD SERV WS AVG
Mic
ro-o
per
ati
on
s P
er C
ycl
e
in-order baseline
in-order + runahead
out-of-order baseline
out-of-order + runahead
![Page 42: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/42.jpg)
Effect of Runahead in Sun ROCK
Shailender Chaudhry talk, Aug 2008.
42
![Page 43: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/43.jpg)
Runahead Enhancements
![Page 44: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/44.jpg)
Readings
Required
Mutlu et al., “Runahead Execution”, HPCA 2003, Top Picks 2003.
Recommended
Mutlu et al., “Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance,” ISCA 2005, IEEE Micro Top Picks 2006.
Mutlu et al., “Address-Value Delta (AVD) Prediction,” MICRO 2005.
Armstrong et al., “Wrong Path Events,” MICRO 2004.
44
![Page 45: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/45.jpg)
Limitations of the Baseline Runahead Mechanism
Energy Inefficiency
A large number of instructions are speculatively executed
Efficient Runahead Execution [ISCA’05, IEEE Micro Top Picks’06]
Ineffectiveness for pointer-intensive applications
Runahead cannot parallelize dependent L2 cache misses
Address-Value Delta (AVD) Prediction [MICRO’05]
Irresolvable branch mispredictions in runahead mode
Cannot recover from a mispredicted L2-miss dependent branch
Wrong Path Events [MICRO’04]
![Page 46: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/46.jpg)
The Efficiency Problem
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%
bz
ip2
cra
fty
eo
n
ga
p
gc
c
gz
ip
mc
f
pa
rse
r
pe
rlb
mk
two
lf
vo
rte
x
vp
r
am
mp
ap
plu
ap
si
art
eq
ua
ke
fac
ere
c
fma
3d
ga
lge
l
luc
as
me
sa
mg
rid
six
tra
ck
sw
im
wu
pw
ise
AV
G
% In c re a se in IP C
% In c re a se in E xe cu te d In s tru c t io n s
2 3 5 %
22%
27%
![Page 47: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/47.jpg)
Causes of Inefficiency
Short runahead periods
Overlapping runahead periods
Useless runahead periods
Mutlu et al., “Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance,” ISCA 2005, IEEE Micro Top Picks 2006.
![Page 48: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/48.jpg)
Short Runahead Periods
Processor can initiate runahead mode due to an already in-flight L2 miss generated by
the prefetcher, wrong-path, or a previous runahead period
Short periods
are less likely to generate useful L2 misses
have high overhead due to the flush penalty at runahead exit
Compute
Load 1 Miss
Runahead
Load 2 Miss Load 2 Miss
Miss 1
Miss 2
Load 1 Hit
![Page 49: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/49.jpg)
Overlapping Runahead Periods
Compute
Load 1 Miss
Miss 1
Runahead
Load 2 Miss
Miss 2
Load 2 INV Load 1 Hit
OVERLAP OVERLAP
Two runahead periods that execute the same instructions
Second period is inefficient
![Page 50: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/50.jpg)
Useless Runahead Periods
Periods that do not result in prefetches for normal mode
They exist due to the lack of memory-level parallelism
Mechanism to eliminate useless periods:
Predict if a period will generate useful L2 misses
Estimate a period to be useful if it generated an L2 miss that cannot be captured by the instruction window
Useless period predictors are trained based on this estimation
Compute
Load 1 Miss
Runahead
Miss 1
Load 1 Hit
![Page 51: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/51.jpg)
0 %
1 0 %
2 0 %
3 0 %
4 0 %
5 0 %
6 0 %
7 0 %
8 0 %
9 0 %
1 0 0 %
1 1 0 %b
zip
2
cra
fty
eo
n
ga
p
gc
c
gz
ip
mc
f
pa
rse
r
pe
rlb
mk
two
lf
vo
rte
x
vp
r
am
mp
ap
plu
ap
si
art
eq
ua
ke
fac
ere
c
fma
3d
ga
lge
l
luc
as
me
sa
mg
rid
six
tra
ck
sw
im
wu
pw
ise
AV
G
Inc
re
as
e i
n E
xe
cu
ted
In
str
uc
tio
ns
b a se lin e ru n ah e ad
a ll te ch n iq u e s
2 3 5 %
Overall Impact on Executed Instructions
26.5%
6.2%
![Page 52: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/52.jpg)
Overall Impact on IPC
0 %
1 0 %
2 0 %
3 0 %
4 0 %
5 0 %
6 0 %
7 0 %
8 0 %
9 0 %
1 0 0 %
1 1 0 %b
zip
2
cra
fty
eo
n
ga
p
gc
c
gz
ip
mc
f
pa
rse
r
pe
r lb
mk
two
lf
vo
rte
x
vp
r
am
mp
ap
plu
ap
si
art
eq
ua
ke
fac
ere
c
fma
3d
ga
lge
l
luc
as
me
sa
mg
rid
six
tra
ck
sw
im
wu
pw
ise
AV
G
Inc
re
as
e i
n I
PC
b a se lin e ru n ah e ad
a ll te ch n iq u e s
1 1 6 %
22.6%
22.1%
![Page 53: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/53.jpg)
Taking Advantage of Pure Speculation
Runahead mode is purely speculative
The goal is to find and generate cache misses that would otherwise stall execution later on
How do we achieve this goal most efficiently and with the highest benefit?
Idea: Find and execute only those instructions that will lead to cache misses (that cannot already be captured by the instruction window)
How?
53
![Page 54: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/54.jpg)
Limitations of the Baseline Runahead Mechanism
Energy Inefficiency
A large number of instructions are speculatively executed
Efficient Runahead Execution [ISCA’05, IEEE Micro Top Picks’06]
Ineffectiveness for pointer-intensive applications
Runahead cannot parallelize dependent L2 cache misses
Address-Value Delta (AVD) Prediction [MICRO’05]
Irresolvable branch mispredictions in runahead mode
Cannot recover from a mispredicted L2-miss dependent branch
Wrong Path Events [MICRO’04]
![Page 55: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/55.jpg)
Runahead execution cannot parallelize dependent misses
wasted opportunity to improve performance
wasted energy (useless pre-execution)
Runahead performance would improve by 25% if this limitation were ideally overcome
The Problem: Dependent Cache Misses
Compute
Load 1 Miss
Miss 1
Load 2 Miss
Miss 2
Load 2 Load 1 Hit
Runahead: Load 2 is dependent on Load 1
Runahead
Cannot Compute Its Address!
INV
![Page 56: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/56.jpg)
Parallelizing Dependent Cache Misses
Idea: Enable the parallelization of dependent L2 cache misses in runahead mode with a low-cost mechanism
How: Predict the values of L2-miss address (pointer) loads
Address load: loads an address into its destination register, which is later used to calculate the address of another load
as opposed to data load
Read:
Mutlu et al., “Address-Value Delta (AVD) Prediction,” MICRO 2005.
![Page 57: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/57.jpg)
Parallelizing Dependent Cache Misses
Compute
Load 1 Miss
Miss 1
Load 2 Hit
Miss 2
Load 2 Load 1 Hit
Value Predicted
Runahead Saved Cycles
Can Compute Its Address
Compute
Load 1 Miss
Miss 1
Load 2 Miss
Miss 2
Load 2 INV Load 1 Hit
Runahead
Cannot Compute Its Address!
Saved Speculative
Instructions Miss
![Page 58: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/58.jpg)
AVD Prediction [MICRO’05]
Address-value delta (AVD) of a load instruction defined as:
AVD = Effective Address of Load – Data Value of Load
For some address loads, AVD is stable
An AVD predictor keeps track of the AVDs of address loads
When a load is an L2 miss in runahead mode, AVD predictor is consulted
If the predictor returns a stable (confident) AVD for that load, the value of the load is predicted
Predicted Value = Effective Address – Predicted AVD
![Page 59: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/59.jpg)
Why Do Stable AVDs Occur?
Regularity in the way data structures are
allocated in memory AND
traversed
Two types of loads can have stable AVDs
Traversal address loads
Produce addresses consumed by address loads
Leaf address loads
Produce addresses consumed by data loads
![Page 60: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/60.jpg)
Traversal Address Loads
Regularly-allocated linked list:
A
A+k
A+2k
A+3k ...
A traversal address load loads the
pointer to next node:
node = nodenext
Effective Addr Data Value AVD
A A+k -k
A+k A+2k -k
A+2k A+3k -k
Stable AVD Striding
data value
AVD = Effective Addr – Data Value
![Page 61: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/61.jpg)
Leaf Address Loads
Sorted dictionary in parser:
Nodes point to strings (words)
String and node allocated consecutively
A+k
A C+k
C
B+k
B
D+k E+k F+k G+k
D E F G
Dictionary looked up for an input word.
A leaf address load loads the pointer to
the string of each node:
Effective Addr Data Value AVD
A+k A k
C+k C k
F+k F k
lookup (node, input) { // ...
ptr_str = nodestring; m = check_match(ptr_str, input);
// …
}
Stable AVD No stride!
AVD = Effective Addr – Data Value string
node
![Page 62: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/62.jpg)
AVD Prediction 62
Identifying Address Loads in Hardware
Insight:
If the AVD is too large, the value that is loaded is likely not an address
Only keep track of loads that satisfy:
-MaxAVD ≤ AVD ≤ +MaxAVD
This identification mechanism eliminates many loads from consideration
Enables the AVD predictor to be small
![Page 63: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/63.jpg)
0 .0
0 .1
0 .2
0 .3
0 .4
0 .5
0 .6
0 .7
0 .8
0 .9
1 .0
bis
ort
health
ms t
per im
ete
r
treeadd
tsp
voro
noi
mc f
pars
er
twolf
v pr
AV
G
No
rm
ali
ze
d E
xe
cu
tio
n T
ime
an
d E
xe
cu
ted
In
str
uc
tio
ns
E x e c u tio n T im e
E x e c u te d In s tru c tio n s
Performance of AVD Prediction
runahead
14.3% 15.5%
![Page 64: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/64.jpg)
Readings
Required
Mutlu et al., “Runahead Execution”, HPCA 2003, Top Picks 2003.
Recommended
Mutlu et al., “Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance,” ISCA 2005, IEEE Micro Top Picks 2006.
Mutlu et al., “Address-Value Delta (AVD) Prediction,” MICRO 2005.
Armstrong et al., “Wrong Path Events,” MICRO 2004.
64
![Page 65: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/65.jpg)
We did not cover the following slides in
lecture. They are for your benefit.
![Page 66: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/66.jpg)
More on Runahead Enhancements
![Page 67: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/67.jpg)
Eliminating Short Periods
Mechanism to eliminate short periods:
Record the number of cycles C an L2-miss has been in flight
If C is greater than a threshold T for an L2 miss, disable entry into runahead mode due to that miss
T can be determined statically (at design time) or dynamically
T=400 for a minimum main memory latency of 500 cycles works well
67
![Page 68: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/68.jpg)
Eliminating Overlapping Periods
Overlapping periods are not necessarily useless
The availability of a new data value can result in the generation of useful L2 misses
But, this does not happen often enough
Mechanism to eliminate overlapping periods:
Keep track of the number of pseudo-retired instructions R during a runahead period
Keep track of the number of fetched instructions N since the exit from last runahead period
If N < R, do not enter runahead mode
68
![Page 69: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/69.jpg)
AVD Prediction 69
Stable AVDs can be captured with a stride value predictor
Stable AVDs disappear with the re-organization of the data structure (e.g., sorting)
Stability of AVDs is dependent on the behavior of the memory allocator
Allocation of contiguous, fixed-size chunks is useful
Properties of Traversal-based AVDs
A
A+k
A+2k
A+3k
A+3k
A+k
A
A+2k
Sorting
Distance between
nodes NOT constant!
![Page 70: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/70.jpg)
AVD Prediction 70
Properties of Leaf-based AVDs
Stable AVDs cannot be captured with a stride value predictor
Stable AVDs do not disappear with the re-organization of the data structure (e.g., sorting)
Stability of AVDs is dependent on the behavior of the memory allocator
A+k
A
B+k
B C
C+k Sorting
Distance between
node and string
still constant!
C+k
C
A+k
A B
B+k
![Page 71: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/71.jpg)
AVD Prediction 71
An Implementable AVD Predictor
Set-associative prediction table
Prediction table entry consists of
Tag (Program Counter of the load)
Last AVD seen for the load
Confidence counter for the recorded AVD
Updated when an address load is retired in normal mode
Accessed when a load misses in L2 cache in runahead mode
Recovery-free: No need to recover the state of the processor or the predictor on misprediction
Runahead mode is purely speculative
![Page 72: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/72.jpg)
AVD Prediction 72
AVD Update Logic
![Page 73: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/73.jpg)
AVD Prediction 73
AVD Prediction Logic
![Page 74: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/74.jpg)
AVD Prediction 74
Baseline Processor
Execution-driven Alpha simulator
8-wide superscalar processor
128-entry instruction window, 20-stage pipeline
64 KB, 4-way, 2-cycle L1 data and instruction caches
1 MB, 32-way, 10-cycle unified L2 cache
500-cycle minimum main memory latency
32 DRAM banks, 32-byte wide processor-memory bus (4:1 frequency ratio), 128 outstanding misses
Detailed memory model
Pointer-intensive benchmarks from Olden and SPEC INT00
![Page 75: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/75.jpg)
AVD Prediction 75
AVD vs. Stride VP Performance
0 .8 0
0 .8 2
0 .8 4
0 .8 6
0 .8 8
0 .9 0
0 .9 2
0 .9 4
0 .9 6
0 .9 8
1 .0 0
1 6 e n tr ie s 4 0 9 6 e n tr ie s
No
rm
ali
ze
d E
xe
cu
tio
n T
ime
(e
xc
lud
ing
he
alt
h)
A V D
s tr id e
h yb rid
5.1%
2.7%
6.5% 5.5%
4.7%
8.6%
16 entries 4096 entries
![Page 76: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/76.jpg)
Wrong Path Events
![Page 77: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/77.jpg)
An Observation and A Question
• In an out-of-order processor, some instructions are executed on the mispredicted path (wrong-path instructions).
• Is the behavior of wrong-path instructions different from the behavior of correct-path instructions?
– If so, we can use the difference in behavior for early misprediction detection and recovery.
![Page 78: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/78.jpg)
What is a Wrong Path Event?
An instance of illegal or unusual behavior
that is more likely to occur on the wrong
path than on the correct path.
Wrong Path Event = WPE
Probability (wrong path | WPE) ~ 1
![Page 79: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/79.jpg)
Why Does a WPE Occur?
• A wrong-path instruction may be executed
before the mispredicted branch is
executed.
– Because the mispredicted branch may be
dependent on a long-latency instruction.
• The wrong-path instruction may consume
a data value that is not properly initialized.
![Page 80: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/80.jpg)
WPE Example from eon:
NULL pointer dereference
1 : for (int i=0 ; i< length(); i++) {
2 : structure *ptr = array[i];
3 : if (ptr->x) {
4 : // . . .
5 : }
6 : }
![Page 81: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/81.jpg)
Beginning of the loop
xEFF8B0 x8ABCD0 x0 x0
Array boundary
Array of pointers
to structs
i = 0
1 : for (int i=0 ; i< length(); i++) {
2 : structure *ptr = array[i];
3 : if (ptr->x) {
4 : // . . .
5 : }
6 : }
![Page 82: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/82.jpg)
First iteration
xEFF8B0 x8ABCD0 x0 x0
Array boundary
Array of pointers
to structs
i = 0 ptr = x8ABCD0
1 : for (int i=0 ; i< length(); i++) {
2 : structure *ptr = array[i];
3 : if (ptr->x) {
4 : // . . .
5 : }
6 : }
![Page 83: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/83.jpg)
First iteration
xEFF8B0 x8ABCD0 x0 x0
Array boundary
Array of pointers
to structs
i = 0 ptr = x8ABCD0
*ptr
1 : for (int i=0 ; i< length(); i++) {
2 : structure *ptr = array[i];
3 : if (ptr->x) {
4 : // . . .
5 : }
6 : }
![Page 84: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/84.jpg)
Loop branch correctly predicted
xEFF8B0 x8ABCD0 x0 x0
Array boundary
Array of pointers
to structs
i = 1
1 : for (int i=0 ; i< length(); i++) {
2 : structure *ptr = array[i];
3 : if (ptr->x) {
4 : // . . .
5 : }
6 : }
![Page 85: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/85.jpg)
Second iteration
xEFF8B0 x8ABCD0 x0 x0
Array boundary
Array of pointers
to structs
i = 1 ptr = xEFF8B0
1 : for (int i=0 ; i< length(); i++) {
2 : structure *ptr = array[i];
3 : if (ptr->x) {
4 : // . . .
5 : }
6 : }
![Page 86: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/86.jpg)
Second iteration
xEFF8B0 x8ABCD0 x0 x0
Array boundary
Array of pointers
to structs
i = 1 ptr = xEFF8B0
*ptr
1 : for (int i=0 ; i< length(); i++) {
2 : structure *ptr = array[i];
3 : if (ptr->x) {
4 : // . . .
5 : }
6 : }
![Page 87: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/87.jpg)
Loop exit branch mispredicted
xEFF8B0 x8ABCD0 x0 x0
Array boundary
Array of pointers
to structs
i = 2
1 : for (int i=0 ; i< length(); i++) {
2 : structure *ptr = array[i];
3 : if (ptr->x) {
4 : // . . .
5 : }
6 : }
![Page 88: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/88.jpg)
Third iteration on wrong path
xEFF8B0 x8ABCD0 x0 x0
Array boundary
Array of pointers
to structs
i = 2
ptr = 0
1 : for (int i=0 ; i< length(); i++) {
2 : structure *ptr = array[i];
3 : if (ptr->x) {
4 : // . . .
5 : }
6 : }
![Page 89: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/89.jpg)
Wrong Path Event
xEFF8B0 x8ABCD0 x0 x0
Array boundary
Array of pointers
to structs
NULL pointer dereference!
i = 2
ptr = 0
*ptr
1 : for (int i=0 ; i< length(); i++) {
2 : structure *ptr = array[i];
3 : if (ptr->x) {
4 : // . . .
5 : }
6 : }
![Page 90: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/90.jpg)
Types of WPEs
• Due to memory instructions
– NULL pointer dereference
– Write to read-only page
– Unaligned access (illegal in the Alpha ISA)
– Access to an address out of segment range
– Data access to code segment
– Multiple concurrent TLB misses
![Page 91: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/91.jpg)
Types of WPEs (continued)
• Due to control-flow instructions – Misprediction under misprediction
• If three branches are executed and resolved as mispredicts while there are older unresolved branches in the processor, it is almost certain that one of the older unresolved branches is mispredicted.
– Return address stack underflow
– Unaligned instruction fetch address (illegal in Alpha)
• Due to arithmetic instructions – Some arithmetic exceptions
• e.g. Divide by zero
![Page 92: 18-447: Computer Architecturecourse.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring... · Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction](https://reader033.vdocuments.site/reader033/viewer/2022041703/5e42a8ccd44bf57dd14e5440/html5/thumbnails/92.jpg)
Two Empirical Questions
1. How often do WPEs occur?
2. When do WPEs occur on the wrong path?