out-of-order execution structures
DESCRIPTION
Out-of-Order Execution Structures. Based on: Complexity-Effective Superscalar Processors S. Palacharla, N. Jouppi and J. E. Smith, ISCA 97. MIPS R10000-Like Design . Fetch: Read instructions from I-Cache Predict Branches Pass on to Decode phase. Fetch Phase. Decode: Parse instruction - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/1.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Out-of-Order Execution Structures
![Page 2: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/2.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
MIPS R10000-Like Design
• Based on:– Complexity-Effective Superscalar Processors– S. Palacharla, N. Jouppi and J. E. Smith, ISCA 97
![Page 3: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/3.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Fetch Phase
• Fetch:– Read instructions from I-Cache– Predict Branches– Pass on to Decode phase
![Page 4: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/4.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Decode Phase
• Decode:– Parse instruction– Shuffle opcode parts to appropriate ports for
rename
![Page 5: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/5.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming Phase
• Rename:– Map Architectural registers to Physical– Eliminate False Dependences– Passes renamed instructions to scheduler
• Called Dispatch
![Page 6: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/6.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Scheduling Phase
• Wakeup:– Instructions check whether they become ready– From Writeback: physical register names
• Select:– Amongst the ready select those to execute– Structural hazards
![Page 7: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/7.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Register File Read Phase
• Read source operands
![Page 8: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/8.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Bypass and Execute Phase
![Page 9: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/9.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Data Cache Access Phase
![Page 10: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/10.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Writeback Phase
• Write result to register file• Broadcast tag in order to wakeup waiting
instructions– Notice that the tag broadcast should happen TWO
cycles in advance of the result production
![Page 11: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/11.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Reservation Station Model
• Used by Pentium Pro, PowerPC 604• Re-order buffer holds values• Renaming points to re-order buffer entries
– Tomasulo-like
![Page 12: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/12.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Physical Register File vs. Reservation Station• Physical Register File
– Values reside in the register file– At writeback instructions broadcast the
register name• Reservation Stations:
– Values reside:– In the register file upon commit
• Non-speculative– In reservation stations prior to commit
• Speculative
![Page 13: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/13.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Quantifying Complexity• Critical Path Delay as a function of
architectural parameters– Instruction Window size (WinSize)– Issue Width (IW)
• Full-custom Implementations– Study the critical path– Delay model– Extrapolate how it will scale with “future”
technologies
![Page 14: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/14.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming• Inputs:
– IW instructions– Up to 2 x Input register names– Up to 1 x Output register name
• Outputs:– 2 x input physical registers– 1 x new output physical register– 1 x previous physical register name for
checkpointing – Updated rename table
• Superscalar Issue complicates things a bit
![Page 15: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/15.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming One Instruction
s1s2
d
RAT
p0
p31
s1s2
old
d
new reg from free listWrite port
Read port
Read port
Read port
1
1
2
1
For mispeculation recovery
![Page 16: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/16.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming Two Instructions
RAT
s1 s2 d new d s1 s2 d new d
?
??
ps1 ps2 Old d new d ps1 ps2 Old dnew d
Cross BundleDependency Check Logic
![Page 17: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/17.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming More Instructions• Dependency Checking logic for
instruction i must match against all preceding destinations
• If there are multiple matches it must enforce priority:– Pick the one closest to this instruction
![Page 18: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/18.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
RAT: SRAM Implementation
decoder SRAM cellbitlines
Sense amp
Arch reg
Phys reg
#ARCH REGS
lg(#PHYS REGS)
![Page 19: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/19.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
SRAM RAT cell
![Page 20: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/20.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
RAT: CAM Implementation
encoder
CAM cellArch reg
Phys reg#PHYS REGS
lg(#ARCH REGS)
Active bit
• One CAM per physical register• Active bit indicates the current map• New version by setting active bit
![Page 21: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/21.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
CAM Cell
Match
Wordline
Bitline
Bitline_B
![Page 22: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/22.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
SRAM vs. CAM• SRAM:
– Arch reg rows– Lg(phy reg) cols– SRAM read/write
• CAM:– Phy reg rows– Lg(arch reg) cols– CAM match– Update:
• Reset previous valid bit• Set current valid bit
![Page 23: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/23.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Scheduler: Part #1 - Wakeup
![Page 24: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/24.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Tree of Arbiters
REQ Signals
GRANT Signals
Anyreq raised if any req is active, Grant
Issued if arbiter enabled
Root enabled if
FU available
Scheduler: Part #2 - Select
For a Single FU
Location based select policy
![Page 25: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/25.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Select for more than one FUs• Handling Multiple FUs of Same Type:
– Stack Select logic blocks in series - hierarchy
– Mask the Request granted to previous unit
• NOT Feasible for More than 2 FUs• Alternative:
– statically partition issue window among FUs – MIPS R10000, HP PA 8000
![Page 26: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/26.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Datapath and BypassCommonly Used
Layout:
1 Bit-Slice
Turn on Tri-State A
to pass result of
FU1 to left operand of
FU0
![Page 27: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/27.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Complexity Analysis• Critical path delay as a function of:
– Issue Width – Window Size
• Register Renaming Table
• Wakeup and Select
• Bypass paths
![Page 28: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/28.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Methodology• A representative CMOS design is selected
from published alternatives
• Implemented the circuits for 3 technologies:– 0.8micron, 0.35micron and 0.18 micron
• Optimize for speed
• Wire parasitics in delay model– Rmetal, Cmetal
![Page 29: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/29.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Methodology• Feature size scaling: 1 / S• Voltage scaling: 1 / U
• Logic Delay = (CLx V) / I• Capac. Load: CL= 1 1 / S• Supply Voltage: V = 1 1 / U• Average charge/discharge current: I = 1
1 / U
• So, Logic Delay = (1 / S x 1 / U ) / (1 / U) = 1 / S
![Page 30: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/30.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wire Delay• L: wire length• Intrinsic RC delay
• Rmetal: resistance per unit length
• Cmetal: capacitance per unit length
• 0.5: 1st order approximation of distributed RC model – uniformly distributed R & C
![Page 31: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/31.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wire Delay Scaling• Metal Thickness doesn’t scale much
– Width ~ 1/S– Rmetal ~ S
• Fringe Capacitance dominates in smaller feature sizes – edges to parallel wires and the substrate
• Parallel plate – scales with 1 / S– Cmetal ~ S
• Length scales with 1/S• Overall Scale factor: S x S x (1/S)2 = 1
• Wire delay remains constant
![Page 32: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/32.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Register Renaming Table
![Page 33: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/33.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Dependency Checking Logic• Accessed in Parallel with Map Table• Every Logical Reg compared against
logical dest regs of current rename group• For IW=2,4,8, delay less than map table
r1
r4
r4
r4
r4
![Page 34: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/34.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming Delay • SRAM scheme• Delay Components:
– Time to decode the arch reg index– Time to drive wordline– Time to pull down bit line– Time for SenseAmp to detect pull-down– MUX time ignored as control from dep.
Check logic comes in advance
![Page 35: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/35.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Renaming Circuit
![Page 36: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/36.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay
![Page 37: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/37.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay• Predecoding for speed• Length of predecode lines:
– Cellheight: Height of single cell excluding wordlines
– Wordline spacing• NVREG: # of virtual reg-s• x3: 3-operand instr-s
![Page 38: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/38.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay
• Tnand fall delay of NAND• Tnor rise delay of NOR
• Rnandpd NAND pull-down channel resistance + Predecode line metal resistance
• Ceq diff-n Cap. of NAND + gate Cap. of NOR + interconnect Cap.
![Page 39: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/39.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay• Substitute• Predecode line length, Req and Ceq we
get:
• c2: intrinsic RC delay of predecode line• c2 very small • Decoder delay ~linearly dependent on
IW
![Page 40: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/40.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Rename Delay• Wordline
• c2: intrinsic RC delay of wordline• c2 very small • Wordline delay ~linearly dependent on
IW
![Page 41: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/41.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Rename Delay• Bitline:
• c2 very small • Bitline delay ~linearly dependent on IW
• SenseAmp delay ~linearly dependent on IW
![Page 42: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/42.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Rename Logic Delay Scaling
• Feature size - [increase in bitline&wordline delay with increasing IW]
• 0.8um: IW 2 8 Bitline delay + 37%• 0.18um: IW 28 Bitline delay + 53%
• Total delay increases linearly with IW
• Each Component shows linear increase with IW
• Bitline Delay > Wordline Delay
• Bitline length ~ # of Logical reg-s
• Wordline length ~ width of physical reg designator
IW impact on delay worsenswith decreasing featuresize
![Page 43: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/43.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay• Critical Path: Mismatch Pull ready signal low• Delay Components:
– Tag drivers drive tag lines - vertical– Mismatched bit: pull down stack pull matchline low –
horizontal– Final OR gate or all the matchlines of an operand
tag
• Ttagdrive ~ Driver Pullup R & Tagline length & Tagline Load C
• Quadratic component significant for IW>2 & 0.18um
![Page 44: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/44.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay• Quadratic component Small for both
cases• Both delays ~linearly dependent on IW
![Page 45: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/45.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay: IW and Window Size• 0.18um Process• Quadratic
dependence• Issue width has
greater effect increase all 3 delay components
• As IW & WinSize + together delay actually changes like: THIS
![Page 46: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/46.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay: Window Size
• 8 way & 0.18 Process• Tag drive delay increases rapidly with WinSize +• Match OR delay constant
![Page 47: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/47.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay: Feature size
• 8 way & 64 entry window• Tag drive and Tag match delays do not scale as well as MatchOR
delay • Match OR logic delay• Others also have wire delays
![Page 48: Out-of-Order Execution Structures](https://reader035.vdocuments.site/reader035/viewer/2022062323/56815dfa550346895dcc3586/html5/thumbnails/48.jpg)
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto
Selection Logic and Bypass Delay• Selection
– Logarithmically dependent on WinSize
• Bypass: Delay dependent on (IW)2