a. moshovos ©ece1773 - fall ‘07 ece toronto out-of-order execution structures
TRANSCRIPT
![Page 1: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/1.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Out-of-Order Execution Structures
![Page 2: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/2.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
MIPS R10000-Like Design
• Based on:– Complexity-Effective Superscalar Processors– S. Palacharla, N. Jouppi and J. E. Smith, ISCA 97
![Page 3: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/3.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Fetch Phase
• Fetch:– Read instructions from I-Cache– Predict Branches– Pass on to Decode phase
![Page 4: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/4.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Decode Phase
• Decode:– Parse instruction– Shuffle opcode parts to appropriate ports for
rename
![Page 5: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/5.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming Phase
• Rename:– Map Architectural registers to Physical– Eliminate False Dependences– Passes renamed instructions to scheduler
• Called Dispatch
![Page 6: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/6.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Scheduling Phase
• Wakeup:– Instructions check whether they become ready– From Writeback: physical register names
• Select:– Amongst the ready select those to execute– Structural hazards
![Page 7: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/7.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Register File Read Phase
• Read source operands
![Page 8: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/8.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Bypass and Execute Phase
![Page 9: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/9.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Data Cache Access Phase
![Page 10: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/10.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Writeback Phase
• Write result to register file• Broadcast tag in order to wakeup waiting
instructions– Notice that the tag broadcast should happen TWO
cycles in advance of the result production
![Page 11: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/11.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Reservation Station Model
• Used by Pentium Pro, PowerPC 604• Re-order buffer holds values• Renaming points to re-order buffer entries
– Tomasulo-like
![Page 12: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/12.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Physical Register File vs. Reservation Station
• Physical Register File– Values reside in the register file– At writeback instructions broadcast the
register name• Reservation Stations:
– Values reside:– In the register file upon commit
• Non-speculative
– In reservation stations prior to commit• Speculative
![Page 13: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/13.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Quantifying Complexity
• Critical Path Delay as a function of architectural parameters– Instruction Window size (WinSize)– Issue Width (IW)
• Full-custom Implementations– Study the critical path– Delay model– Extrapolate how it will scale with “future”
technologies
![Page 14: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/14.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming
• Inputs:– IW instructions– Up to 2 x Input register names– Up to 1 x Output register name
• Outputs:– 2 x input physical registers– 1 x new output physical register– 1 x previous physical register name for
checkpointing – Updated rename table
• Superscalar Issue complicates things a bit
![Page 15: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/15.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming One Instruction
s1s2
d
RAT
p0
p31
s1s2
old
d
new reg from free list
Write port
Read port
Read port
Read port
1
1
2
1
For mispeculation recovery
![Page 16: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/16.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming Two Instructions
RAT
s1 s2 d new d s1 s2 d new d
?
?
?
ps1 ps2 Old d new d ps1 ps2 Old dnew d
Cross BundleDependency Check Logic
![Page 17: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/17.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming More Instructions
• Dependency Checking logic for instruction i must match against all preceding destinations
• If there are multiple matches it must enforce priority:– Pick the one closest to this instruction
![Page 18: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/18.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
RAT: SRAM Implementation
decoderSRAM cell
bitlines
Sense amp
Arch reg
Phys reg
#ARCH REGS
lg(#PHYS REGS)
![Page 19: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/19.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
SRAM RAT cell
![Page 20: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/20.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
RAT: CAM Implementation
encoder
CAM cellArch reg
Phys reg#PHYS REGS
lg(#ARCH REGS)
Active bit
• One CAM per physical register• Active bit indicates the current map• New version by setting active bit
![Page 21: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/21.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
CAM Cell
Match
Wordline
Bitline
Bitline_B
![Page 22: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/22.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
SRAM vs. CAM
• SRAM:– Arch reg rows– Lg(phy reg) cols– SRAM read/write
• CAM:– Phy reg rows– Lg(arch reg) cols– CAM match– Update:
• Reset previous valid bit• Set current valid bit
![Page 23: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/23.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Scheduler: Part #1 - Wakeup
![Page 24: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/24.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Tree of Arbiters
REQ Signals
GRANT Signals
Anyreq raised if any req is active, Grant
Issued if arbiter enabled
Root enabled if
FU available
Scheduler: Part #2 - Select
For a Single FU
Location based select policy
![Page 25: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/25.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Select for more than one FUs
• Handling Multiple FUs of Same Type:– Stack Select logic blocks
in series - hierarchy– Mask the Request granted
to previous unit
• NOT Feasible for More than 2 FUs• Alternative:
– statically partition issue window among FUs – MIPS R10000, HP PA 8000
![Page 26: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/26.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Datapath and Bypass
Commonly Used Layout:
1 Bit-Slice
Turn on Tri-State A
to pass result of
FU1 to left operand of
FU0
![Page 27: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/27.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Complexity Analysis
• Critical path delay as a function of:– Issue Width – Window Size
• Register Renaming Table
• Wakeup and Select
• Bypass paths
![Page 28: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/28.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Methodology
• A representative CMOS design is selected from published alternatives
• Implemented the circuits for 3 technologies:– 0.8micron, 0.35micron and 0.18 micron
• Optimize for speed
• Wire parasitics in delay model– Rmetal, Cmetal
![Page 29: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/29.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Methodology
• Feature size scaling: 1 / S• Voltage scaling: 1 / U
• Logic Delay = (CLx V) / I
• Capac. Load: CL= 1 1 / S
• Supply Voltage: V = 1 1 / U• Average charge/discharge current: I = 1
1 / U
• So, Logic Delay = (1 / S x 1 / U ) / (1 / U) = 1 / S
![Page 30: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/30.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wire Delay
• L: wire length• Intrinsic RC delay
• Rmetal: resistance per unit length
• Cmetal: capacitance per unit length
• 0.5: 1st order approximation of distributed RC model – uniformly distributed R & C
![Page 31: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/31.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wire Delay Scaling
• Metal Thickness doesn’t scale much– Width ~ 1/S– Rmetal ~ S
• Fringe Capacitance dominates in smaller feature sizes – edges to parallel wires and the substrate
• Parallel plate – scales with 1 / S– Cmetal ~ S
• Length scales with 1/S• Overall Scale factor: S x S x (1/S)2 = 1
• Wire delay remains constant
![Page 32: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/32.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Register Renaming Table
![Page 33: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/33.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Dependency Checking Logic
• Accessed in Parallel with Map Table• Every Logical Reg compared against
logical dest regs of current rename group• For IW=2,4,8, delay less than map table
r1
r4
r4
r4
r4
![Page 34: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/34.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming Delay
• SRAM scheme• Delay Components:
– Time to decode the arch reg index– Time to drive wordline– Time to pull down bit line– Time for SenseAmp to detect pull-down– MUX time ignored as control from dep.
Check logic comes in advance
![Page 35: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/35.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming Circuit
![Page 36: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/36.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay
![Page 37: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/37.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay
• Predecoding for speed• Length of predecode lines:
– Cellheight: Height of single cell excluding wordlines
– Wordline spacing• NVREG: # of virtual reg-s• x3: 3-operand instr-s
![Page 38: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/38.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay
• Tnand fall delay of NAND• Tnor rise delay of NOR
• Rnandpd NAND pull-down channel resistance + Predecode line metal resistance
• Ceq diff-n Cap. of NAND + gate Cap. of NOR + interconnect Cap.
![Page 39: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/39.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay
• Substitute• Predecode line length, Req and Ceq we
get:
• c2: intrinsic RC delay of predecode line• c2 very small • Decoder delay ~linearly dependent on
IW
![Page 40: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/40.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Rename Delay
• Wordline
• c2: intrinsic RC delay of wordline• c2 very small • Wordline delay ~linearly dependent on
IW
![Page 41: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/41.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Rename Delay• Bitline:
• c2 very small • Bitline delay ~linearly dependent on IW
• SenseAmp delay ~linearly dependent on IW
![Page 42: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/42.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Rename Logic Delay Scaling
• Feature size - [increase in bitline&wordline delay with increasing IW]
• 0.8um: IW 2 8 Bitline delay + 37%• 0.18um: IW 28 Bitline delay + 53%
• Total delay increases linearly with IW
• Each Component shows linear increase with IW
• Bitline Delay > Wordline Delay
• Bitline length ~ # of Logical reg-s
• Wordline length ~ width of physical reg designator
IW impact on delay worsenswith decreasing featuresize
![Page 43: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/43.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay• Critical Path: Mismatch Pull ready signal low• Delay Components:
– Tag drivers drive tag lines - vertical– Mismatched bit: pull down stack pull matchline low
– horizontal– Final OR gate or all the matchlines of an operand
tag
• Ttagdrive ~ Driver Pullup R & Tagline length & Tagline Load C
• Quadratic component significant for IW>2 & 0.18um
![Page 44: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/44.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay
• Quadratic component Small for both cases
• Both delays ~linearly dependent on IW
![Page 45: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/45.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay: IW and Window Size• 0.18um Process• Quadratic
dependence• Issue width has
greater effect increase all 3 delay components
• As IW & WinSize + together delay actually changes like: THIS
![Page 46: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/46.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay: Window Size
• 8 way & 0.18 Process• Tag drive delay increases rapidly with WinSize +• Match OR delay constant
![Page 47: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/47.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay: Feature size
• 8 way & 64 entry window• Tag drive and Tag match delays do not scale as well as MatchOR
delay • Match OR logic delay• Others also have wire delays
![Page 48: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062517/56649f345503460f94c50ff3/html5/thumbnails/48.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Selection Logic and Bypass Delay
• Selection– Logarithmically dependent on WinSize
• Bypass: Delay dependent on (IW)2