advanced microarchitecture
DESCRIPTION
Advanced Microarchitecture. Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by Ilhyun Kim Updated by Mikko Lipasti. Outline. Instruction scheduling overview Scheduling atomicity Speculative scheduling Scheduling recovery - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/1.jpg)
Advanced Microarchitecture
Prof. Mikko H. LipastiUniversity of Wisconsin-Madison
Lecture notes based on notes by Ilhyun KimUpdated by Mikko Lipasti
![Page 2: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/2.jpg)
Outline• Instruction scheduling overview
– Scheduling atomicity– Speculative scheduling– Scheduling recovery
• Complexity-effective instruction scheduling techniques– CRIB reading
• Scalable load/store handling– NoSQ reading
• Building large instruction windows– Runahead, CFP, iCFP
• Control Independence• 3D die stacking
![Page 3: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/3.jpg)
Readings• Read on your own:
– Shen & Lipasti Chapter 10 on Advanced Register Data Flow – skim– I. Kim and M. Lipasti, “Understanding Scheduling Replay Schemes,” in Proceedings of the
10th International Symposium on High-performance Computer Architecture (HPCA-10), February 2004.
– Srikanth Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, and Mike Upton, “Continual Flow Pipelines”, in Proceedings of ASPLOS 2004, October 2004.
– Ahmed S. Al-Zawawi, Vimal K. Reddy, Eric Rotenberg, Haitham H. Akkary, “Transparent Control Independence,” in Proceedings of ISCA-34, 2007.
• To be discussed in class:– T. Shaw, M. Martin, A. Roth, “NoSQ: Store-Load Communication without a Store Queue, ” in
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006.
– Erika Gunadi, Mikko Lipasti: CRIB: Combined Rename, Issue, and Bypass, ISCA 2011.– Andrew Hilton, Amir Roth, "BOLT: Energy-efficient Out-of-Order Latency-Tolerant execution,"
Proceedings of HPCA 2010.– Loh, G. H., Xie, Y., and Black, B. 2007. Processor Design in 3D Die-Stacking Technologies. IEEE
Micro 27, 3 (May. 2007), 31-48.
![Page 4: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/4.jpg)
Register Dataflow
![Page 5: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/5.jpg)
Instruction scheduling
• A process of mapping a series of instructions into execution resources– Decides when and where an instruction is executed
Data dependence graph
1
2 3 4
5 6
FU0 FU1
n
n+1
n+2
n+3
1
2 3
5 4
6
Mapped to two FUs
![Page 6: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/6.jpg)
Instruction scheduling
• A set of wakeup and select operations– Wakeup
• Broadcasts the tags of parent instructions selected• Dependent instruction gets matching tags, determines if source
operands are ready• Resolves true data dependences
– Select• Picks instructions to issue among a pool of ready instructions• Resolves resource conflicts
– Issue bandwidth– Limited number of functional units / memory ports
![Page 7: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/7.jpg)
Scheduling loop
• Basic wakeup and select operations
==== OROR
readyL tagL readyRtagR
==== OROR
readyL tagL readyRtagR
tag W tag 1
…
… …
ready - requestrequest n
grant n
grant 0request 0
grant 1request 1
……
selected
issueto FU
broadcast the tagof the selected inst
Select logic Wakeup logic
schedulingloop
![Page 8: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/8.jpg)
Wakeup and Select
FU0 FU1
n
n+1
n+2
n+3
1
2 3
5 4
6
Select 1Wakeup 2,3,4
Wakeup / select
Select 2, 3Wakeup 5, 6
Select 4, 5Wakeup 6
Select 6
Ready instto issue
1
2, 3, 4
4, 5
6
1
2 3 4
5 6
![Page 9: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/9.jpg)
Scheduling Atomicity
• Operations in the scheduling loop must occur within a single clock cycle– For back-to-back execution of dependent instructions
n
n+1
n+2
n+3
n+4
select 1
wakeup 2, 3
select 2, 3
wakeup 4
select 4
select 1wakeup 2, 3
Select 2, 3wakeup 4
Select 4
Atomic scheduling Non-Atomic2-cycle scheduling
cycle
1
4
1
2 3
4
2 3
![Page 10: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/10.jpg)
Implication of scheduling atomicity
• Pipelining is a standard way to improve clock frequency
• Hard to pipeline instruction scheduling logic without losing ILP– ~10% IPC loss in 2-cycle scheduling– ~19% IPC loss in 3-cycle scheduling
• A major obstacle to building high-frequency microprocessors
![Page 11: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/11.jpg)
Scheduler Designs• Data-Capture Scheduler
– keep the most recent register value in reservation stations
– Data forwarding and wakeup are combined
RegisterFile
Data-capturedscheduling window(reservation station)
Functional Units
For
war
ding
and
wak
eup R
egis
ter
upda
te
![Page 12: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/12.jpg)
Scheduler Designs• Non-Data-Capture
Scheduler– keep the most recent
register value in RF (physical registers)
– Data forwarding and wakeup are decoupled
RegisterFile
Non-data-capturescheduling
window
Functional Units
For
war
ding
wak
eup
Complexity benefits simpler scheduler / data / wakeup path
![Page 13: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/13.jpg)
Mapping to pipeline stages• AMD K7 (data-capture)
Pentium 4 (non-data-capture)
Data
Data
Data / wakeup
wakeup
![Page 14: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/14.jpg)
Scheduling atomicity & non-data-capture scheduler
Fetch DecodeSched/Exe
Writeback Commit
Atomic Sched/Exe
Fetch Decode Schedule Dispatch RF Exe Writeback Commit
wakeup/select
Fetch Decode Schedule Dispatch RF Exe Writeback CommitFetch Decode Schedule Dispatch RF Exe Writeback CommitFetch Decode Schedule Dispatch RF Exe Writeback CommitFetch Decode Schedule Dispatch RF Exe Writeback CommitFetch Decode Schedule Dispatch RF Exe Writeback Commit
Wakeup/Select
Fetch Decode Schedule Dispatch RF Exe Writeback Commit
Wakeup/Select
• Multi-cycle scheduling loop
• Scheduling atomicity is not maintained– Separated by extra pipeline stages (Disp, RF)– Unable to issue dependent instructions consecutively
solution: speculative scheduling
![Page 15: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/15.jpg)
Speculative Scheduling• Speculatively wakeup dependent instructions even before the parent
instruction starts execution– Keep the scheduling loop within a single clock cycle
• But, nobody knows what will happen in the future
• Source of uncertainty in instruction scheduling: loads– Cache hit / miss– Store-to-load aliasing eventually affects timing decisions
• Scheduler assumes that all types of instructions have pre-determined fixed latencies– Load instructions are assumed to have a common case (over 90% in general)
$DL1 hit latency– If incorrect, subsequent (dependent) instructions are replayed
![Page 16: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/16.jpg)
Speculative Scheduling• Overview
Spec wakeup/select
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Spec wakeup/select
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Latency Changed!!
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Re-schedulewhen latency mispredicted
Invalid input value
Speculatively issued instructions
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Unlike the original Tomasulo’s algorithm Instructions are scheduled BEFORE actual execution occurs Assumes instructions have pre-determined fixed latencies
ALU operations: fixed latency Load operations: assumes $DL1 latency (common case)
![Page 17: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/17.jpg)
Scheduling replay• Speculation needs verification / recovery
– There’s no free lunch
• If the actual load latency is longer (i.e. cache miss) than what was speculated– Best solution (disregarding complexity): replay data-dependent
instructions issued under load shadow
verification flow
Fetch Decode RenameQueue SchedDisp Disp RF RF Exe Retire
/ WBCommitRename
instruction flow
Cache missdetected
![Page 18: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/18.jpg)
Wavefront propagation
• Speculative execution wavefront– speculative image of execution (from scheduler’s perspective)
• Both wavefront propagates along dependence edges at the same rate (1 level / cycle)
– the real wavefront runs behind the speculative wavefront
• The load resolution loop delay complicates the recovery process– scheduling miss is notified a couple of clock cycles later after issue
verification flow
Fetch Decode RenameQueue SchedDisp Disp RF RF Exe Retire
/ WBCommitRename
speculative executionwavefront
real executionwavefront
instruction flow
dependencelinking
Datalinking
![Page 19: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/19.jpg)
Load resolution feedback delay in instruction scheduling
• Scheduling runs multiple clock cycles ahead of execution– But, instructions can keep track of only one level of dependence at a
time (using source operand identifiers)
BroadcastBroadcast/ wakeup/ wakeup SelectSelect
ExecutionExecution
Dispatch /Dispatch /PayloadPayload
RFRFMisc.Misc.
NN
NN
N-1N-1
N-2N-2
N-3N-3
N-4N-4Time delay Time delay betweenbetweensched and sched and feedbackfeedback
recoverrecoverinstructionsinstructionsin this pathin this path
![Page 20: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/20.jpg)
Issues in scheduling replay
• Cannot stop speculative wavefront propagation– Both wavefronts propagate at the same rate– Dependent instructions are unnecessarily issued under load misses
checker
Sched/ Issue
Exe
cache misssignalcycle n
cycle n+1
cycle n+2
cycle n+3
![Page 21: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/21.jpg)
Requirements of scheduling replay
• Conditions for ideal scheduling replay– All mis-scheduled dependent instructions are invalidated instantly– Independent instructions are unaffected
• Multiple levels of dependence tracking are needed– e.g. Am I dependent on the current cache miss?– Longer load resolution loop delay tracking more levels
Propagation of recovery status should be faster than speculative wavefront propagation
Recovery should be performed on the transitive closure of dependent instructions
loadmiss
![Page 22: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/22.jpg)
Scheduling replay schemes
• Alpha 21264: Non-selective replay– Replays all dependent and independent instructions issued under load
shadow– Analogous to squashing recovery in branch misprediction– Simple but high performance penalty
• Independent instructions are unnecessarily replayedSched Disp RF Exe Retire
Invalidate & replay ALL instructions in the load
shadow
LD
ADD
OR
AND
BR
LD
ADD
OR
AND
BR
LD
ADD
OR
AND
BR
missresolvedLD
ADD
OR
AND
BR
LD
ADD
OR
Cachemiss
AND
BR
![Page 23: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/23.jpg)
Position-based selective replay
• Ideal selective recovery– replay dependent instructions only
• Dependence tracking is managed in a matrix form– Column: load issue slot, row: pipeline stages
mergematices
ADD
0 00 00 00 1
OR
0 00 00 00 1
SLL
0 00 00 00 1
AND
0 00 01 00 1
XOR
0 00 01 00 0
LD
LD
ADD
OR XOR
ANDSLL
Integer pipeline
Mem pipeline(width 2)
Sched
Disp
RF
Exe
Retire
ADD
0 00 00 10 0
OR
0 00 00 10 0
XOR
0 01 00 00 0
LD
LD
OR
ANDSLL
ADD
XOR
SLL
0 00 00 10 0
AND
0 01 00 10 0
tag / dep infobroadcast
kill bus broadcast
killed killed killed killed
Cycle n
Cycle
n+1
Sched
Disp
RF
Exe
Retire
1 00 10 01 0
bit
merg
e&
sh
ift
invalid
ate
if
bit
s m
atc
hin
th
e last
row
tagR
ReadyR
ReadyL
tagL =
=
Kill b
us
tag
bu
s
dep
en
den
ce in
fo b
us
Cache missDetected
![Page 24: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/24.jpg)
Low-complexity scheduling techniques• FIFO (Palacharla, Jouppi, Smith, 1996)
– Replaces conventional scheduling logic with multiple FIFOs• Steering logic puts instructions into different FIFOs considering
dependences• A FIFO contains a chain of dependent instructions• Only the head instructions are considered for issue
![Page 25: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/25.jpg)
FIFO (cont’d)• Scheduling example
![Page 26: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/26.jpg)
FIFO (cont’d)• Performance
• Comparable performance to the conventional scheduling• Reduced scheduling logic complexity• Many related papers on clustered microarchitecture
![Page 27: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/27.jpg)
CRIB Reading
– Erika Gunadi, Mikko Lipasti: CRIB: Combined Rename, Issue, and Bypass, ISCA 2011.
– Goals– Match OOO performance per cycle– Match OOO frequency– Match OOO area– Reduce power significantly– Eliminate pipelines, latches, rename structures,
issue logic
![Page 28: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/28.jpg)
CRIB Data Movement
ROB
RS
PRF
Bypass
ALU
Physical Register File - style
Front-End
CRIB
ARF
In-place execution
CRIB
![Page 29: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/29.jpg)
In-place Execution• First proposed by Ultrascalar [1999]
– Place instructions in execution stations– Route operands to instructions– Goal: massively wide issue– Power constraints not even on the horizon
• CRIB: in-place execution as enabler– Eliminate pipelined execution lanes, multiported RF,
renaming, wakeup & select, clock loads– Enable efficient speculation recovery– Enable variable execution latency tolerance
![Page 30: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/30.jpg)
CRIB Concept
• Data values propagate combinationally (no latches)– Completion bit propagates synchronously (latched)
• Instructions stay until all are finished• When all are finished, latch data into ARF latches
R0 R1 R2 R3
So
urc
e1
So
urc
e2
De
stin
atio
n
C C C C
C
C
C
C
ALU C
Previous Entry
Next Entry
WE
![Page 31: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/31.jpg)
Renaming in CRIB
• All the connections forms in parallel after dispatch• Dependency is solved by the positional renaming• Instructions issue subject to the readiness of its operands
Add R2, R0, R0
Sub R3, R0, R2
Add R2, R2, R3
Add R2, R0, R3
R0 R1 R2 R3
So
urc
e1
So
urc
e2
De
stin
atio
n
C C C C
C
C
C
C
Cyc 1
Cyc 1
Cyc 2
Cyc 3
![Page 32: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/32.jpg)
Scaling Up CRIB
• Multiple CRIB partitions maintained as circular queue• Only head ARF has committed state
– Other latches are left transparent
FrontEnd ARF
Mult/Div
Cache Port
ARF
ARF
ARF
LQBank
SQ
LQBank
SQ
LQBank
SQ
LQBank
SQ
![Page 33: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/33.jpg)
Data Propagation across partitions
• Transparent latches for data
• Regular latches for complete bit
• Data values take one additional cycle to travel to the next partition
R0 C R1 C R2 C R3 C
R0 C R1 C R2 C R3 C
CRIB 1
CAdd R2, R2, R1
CRIB 0
CAdd R2, R2, R1
R0 C R1 C R2 C R3 C
Cycle 0
Cycle 1
Cycle 2
Cycle 3
![Page 34: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/34.jpg)
CRIB Pipeline Diagram
• Fewer pipe stages– Remove rename stage from front-end– Remove issue and RF from middle
• Combine dependence and data linking
Rnm Disp Issue RFWB
WBCmt
DispInt
A-Gen Load LoadWB
dependence linkingdata linking
dependence / data linking
OoO
CRIB
![Page 35: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/35.jpg)
Load-Store Ordering
• Loads/stores are ordered aggressively• Recovery: replay in place• No prediction needed; recovery is cheap
ADD R2, R3, R1
ADD R2, R1, R1
LD R3, R1, R2
ST R0, R1, 1
R0 R1 R2 R3
Data Addr
LQ
SQ
![Page 36: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/36.jpg)
Branch Misprediction
Mispredicted branch drives a global signal up the CRIB Forces younger instructions to transform into NOPs Simpler than checkpointing or ROB unrolling
branch mispredict
Instruction 0
R0 R1 R2 R3
flush
Instruction 2
Instruction 3
NOP
NOP
![Page 37: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/37.jpg)
CRIB Findings• CRIB proposal appears promising
– Competitive IPC and area– Dramatic power reductions
• Over baseline1 (“Bobcat”)– 45% less energy per instruction– 20-30% better IPC
• Over baseline2 (“Nehalem”)– 75% less energy per instruction– INT IPC slightly better, FP IPC slightly worse
![Page 38: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/38.jpg)
CRIB Summary• Instructions are inserted from front end• Instructions inside CRIB execute subject to readiness
of operands• Data propagates without latches• Complete bit ensures that data propagate
synchronously• A CRIB retires when all instructions done executing• When a CRIB retires, data are latched in the ARF
![Page 39: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/39.jpg)
Memory Dataflow
![Page 40: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/40.jpg)
Scalable Load/Store Queues
• Load queue/store queue– Large instruction window: many loads and stores
have to be buffered (25%/15% of mix)– Expensive searches
• positional-associative searches in SQ, • associative lookups in LQ
– coherence, speculative load scheduling
– Power/area/delay are prohibitive
![Page 41: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/41.jpg)
Store Queue/Load Queue Scaling
• Multilevel queues• Bloom filters (quick check for independence)• Eliminate associative load queue via replay
[Cain 2004]
– Issue loads again at commit, in order– Check to see if same value is returned– Filter load checks for efficiency:
• Most loads don’t issue out of order (no speculation)• Most loads don’t coincide with coherence traffic
![Page 42: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/42.jpg)
SVW and NoSQ• Store Vulnerability Window (SVW)
– Assign sequence numbers to stores– Track writes to cache with sequence numbers– Efficiently filter out safe loads/stores by only
checking against writes in vulnerability window
• NoSQ– Rely on load/store alias prediction to satisfy
dependent pairs– Use SVW technique to check
![Page 43: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/43.jpg)
Store/Load Optimizations
• Weakness: predictor still fails– Machine should fail gracefully, not fall off a cliff– Glass jaw
• Several other concurrent proposals– DMDC, Fire-and-forget, …
![Page 44: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/44.jpg)
Key Challenge: MLP
• Tolerate/overlap memory latency– Once first miss is encountered, find another one
• Naïve solution– Implement a very large ROB, IQ, LSQ– Power/area/delay make this infeasible
• Build virtual instruction window
![Page 45: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/45.jpg)
Runahead• Use poison bits to eliminate miss-dependent
load program slice– Forward load slice processing is a very old idea
• Massive Memory Machine [Garcia-Molina et al. 84]
• Datascalar [Burger, Kaxiras, Goodman 97]
– Runahead proposed by [Dundas, Mudge 97]
• Checkpoint state, keep running• When miss completes, return to checkpoint
– May need runahead cache for store/load communication
![Page 46: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/46.jpg)
Waiting Instruction Buffer
[Lebeck et al. ISCA 2002]• Capture forward load slice in separate buffer
– Propagate poison bits to identify slice
• Relieve pressure on issue queue• Reinsert instructions when load completes• Very similar to Intel Pentium 4 replay
mechanism– But not publicly known at the time
![Page 47: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/47.jpg)
Continual Flow Pipelines[Srinivasan et al. 2004]
• Slice buffer extension of WIB– Store operands in slice buffer as well to free up buffer
entries on OOO window– Relieve pressure on rename/physical registers
• Applicable to – data-capture machines (Intel P6) or – physical register file machines (Pentium 4)
• Also extended to in-order machines (iCFP)• Challenge: how to buffer loads/stores• Reading: Hilton & Roth, BOLT, HPCA 2010
![Page 48: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/48.jpg)
Instruction Flow
![Page 49: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/49.jpg)
Transparent Control Independence• Control flow graph convergence
– Execution reconverges after branches– If-then-else constructs, etc.
• Can we fetch/execute instructions beyond convergence point?• How do we resolve ambiguous register and memory
dependences – Writes may or may not occur in branch shadow
• TCI employs CFP-like slice buffer to solve these problems– Instructions with ambiguous dependences buffered– Reinsert them the same way forward load miss slice is reinserted
• “Best” CI proposal to date, but still very complex and expensive, with moderate payback
![Page 50: Advanced Microarchitecture](https://reader033.vdocuments.site/reader033/viewer/2022051001/5681488b550346895db5a0ad/html5/thumbnails/50.jpg)
Summary of Advanced Microarchitecture
• Instruction scheduling overview– Scheduling atomicity– Speculative scheduling– Scheduling recovery
• Complexity-effective instruction scheduling techniques– CRIB reading
• Scalable load/store handling– NoSQ reading
• Building large instruction windows– Runahead, CFP, iCFP
• Control Independence