dynamic scheduling ii - yeditepe Üniversitesi bilgisayar...
TRANSCRIPT
Dynamic Scheduling II
• so far: dynamic scheduling (out-of-order execution)• Scoreboard• Tomasulo’s algorithm• register renaming: removing artificial dependences (WAR/WAW)
• now: out-of-order execution + precise state• advanced topic: dynamic load scheduling • PentiumII vs. Pentium4• limits of ILP
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
1© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
Readings
H+P• chapter 3 (not required)
Readings in Computer Architecture• Sohi and Vajapeyam: “Instruction Issue Logic”• Yeager: “The MIPS R10000 Superscalar Microprocessor”
Recent Research Papers• Complexity-Effective Superscalar• Optimal Pipeline Depth• DIVA• Power
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
2© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
Superscalar + Out-of-Order + Speculation
superscalar + out-of-order + speculation• three concepts that work well (best?) when used together• CPI >= 1?
• overcome with superscalar
• superscalar increases hazards? • overcome with dynamic scheduling
• RAW dependences still a problem?• overcome with a large instruction window
• branches a problem for filling large window?• overcome with speculation
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
3© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
Speculation and Precise Interrupts
Q: why are we discussing these together?• sequential (von Neumann) semantics for interrupts• all instructions before interrupt should be complete• all instructions after interrupt should look as if never started (abort)• basically, we also want the same thing for a mis-predicted branch
• what makes precise interrupts hard? • out-of-order completion ⇒ must undo post-interrupt writebacks• in-order ⇒ no post-branch writebacks before branch completes• out-of-order ⇒ can happen
A: with out-of-order, precise interrupts and mis-speculation recovery are same problem ⇒ same solution
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
4© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
Solution: Precise State
• speculative execution requirements• ability to abort & restart at every branch
• precise synchronous interrupt requirements• ability to abort & restart at every load, store, FP divide, ??
• precise asynchronous interrupt requirements• ability to abort & restart at every ??
• just bite the bullet • implement ability to abort & restart at every instruction• called “precise state”
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
5© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
Ways to Implement Precise State
• imprecise state: ignore the problem!– makes page faults (any restartable exceptions) hard– makes speculative execution almost impossible• IEEE standard strongly suggests precise state
• force in-order completion (WB): stall pipe if necessary– slow
• precise state in software– even slower - would require a trap for every misprediction
• precise state in hardware: save recovery info internally+ everything is better in hardware
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
6© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
The Problem with Precise State
problem is in the writeback stage (WB)• mixes two things together that should be separate• (1) broadcasts values to RS, forwards to other instructions
• OK for this to be out-of-order
• (2) writes values to registers• would like this to be in-order
solution to every functionality problem? add a level of indirection
• have already seen this for out-of-order execution• split ID into in-order DS and out-of-order IS• separate using instruction buffer (scoreboard, reservation stations)
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
7© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
Re-Order Buffer (ROB)
I$
regfile
DSIF CMEX
F/D D/XPC X/C
IS
ROB
RT
instruction buffer ⇒ re-order buffer (ROB)• buffers completed results en route to register file and D$!!!• may be combined with RS or separate (combined in the picture)
• split writeback (WB) into two stages: Complete and Retire
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
8© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
Complete and Retire
I$
regfile
DSIF CMEX
F/D D/XPC X/C
IS
ROB
RT
• CM (complete)• completed values write results to ROB out-of-order• out-of-order stage ⇒ hazards result in waits
• RT (retire, but sometimes called “commit” or “graduate”)• ROB writes results to register file in-order• in-order stage ⇒ hazards result in stalls
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
9© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
Memory Ordering Buffer (MOB)
ROB makes register writes in-order, but what about stores?• same as before (i.e., to D$ in MEM stage)?• bad idea: imprecise memory worse than imprecise registers• must do same trick for stores
Memory Ordering Buffer (MOB)• a.k.a. store buffer, store queue, load/store queue (LSQ)• completed (but not retired) stores write to MOB• on RT, head of MOB written to D$• loads look at MOB and D$ in parallel
• forward from MOB if matching store (i.e. to same address)
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
10© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
ROB+MOB
I$
regfile
DSIF CMEX
F/D D/XPC X/C
IS
ROB
RT
D$MOBstores
loads/storesstores loads
modulo some gross simplifications, this picture is almost realistic!!
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
11© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
Tomasulo+ROB
• add ROB to Tomasulo’s algorithm• combined ROB and RS are called RUU (or Sohi’s method)• RUU = register update unit
• separate ROB and RS are called P6-style (Intel P6 = Pentium Pro)
• our example: Simple-P6• separate ROB and RS• same RS organization as before: 1 ALU, 1 load, 1 store, 2 3-cycle FP
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
12© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6-style Organization
value
V1 V2
T+
T1 T2T
R
==
FUT
taildispatch
retirehead
value
RSROB
reg status
dispatch
RF
CD
B.V
CD
B.T
• instruction fields and ready bits• tags• values
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
13© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 Data Structures
• RS are the same as before• ROB• head, tail: to keep sequential order• R: output register of instruction, V: output value of instruction
• tags are different• was: RS# → now: ROB#
• register status table is different• T+: tag + “ready-in-ROB” bit• tag == 0 ⇒ result ready in register file• tag != 0 ⇒ result not ready• tag != 0 + ⇒ result ready in ROB
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
14© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 Data Structures
RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store No4 FP1 No
hdtl
ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1)2 mulf f4,f0,f23 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)
Reg. Statusreg T+f0f2f4r1
CDBV T
5 FP2 No
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
15© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 Pipeline
new pipeline structure: IF, DS, IS, EX, CM, RT• DS (dispatch)• (RS/ROB/MOB full) ? (stall) : • {allocate RS/ROB/MOB entries, set RS tag to ROB#, set register status entry to ROB# with “ready-in-ROB” bit off, read ready registers into RS}
• EX (execute)• free RS entry• used to be done at WB• can be earlier now because RS# are not tags
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
16© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 Pipeline
• CM (complete)• (CDB not available) ? (wait) :• {write value into ROB entry indicated by RS tag, mark ROB entry complete, mark register status entry “ready-in-ROB” bit (+)}
• RT (retire, commit, graduate)• (ROB head not complete) ? (stall) :• {write ROB head result to register file, if store, then write MOB head to D$, handle any exceptions, free ROB/MOB entries}
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
17© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6: Dispatch (DS) part I
value
V1 V2
T+
T1 T2T
valueR
==
FUT
taildispatch
retirehead
RSROB
reg status
dispatch
RF
CD
B.V
CD
B.T
• stall if RS or ROB or MOB is full• allocate RS+ROB entries (assign ROB# to RS output tag)• set register status entry to ROB# and “ready-in-ROB” bit to 0
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
18© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6: Dispatch (DS) part II
value
V1 V2
T+
T1 T2T
valueR
==
FUT
taildispatch
retirehead
RSROB
reg status
dispatch
RF
CD
B.V
CD
B.T
• read tags for register inputs from register status table• if tag==0: copy value from RF (not shown)• if tag!=0: copy tag to RS • if tag!=0 +: copy value from ROB
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
19© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6: Complete (CM)
value
V1 V2
T+
T1 T2T
valueR
==
FUT
tailfetch
retirehead
RSROB
reg status
fetch
RF
CD
B.V
CD
B.T
• wait for CDB• broadcast <result,tag> on CDB• write result into ROB, set reg. status “ready-in-ROB” bit (+)• match tags, write CDB.V into RS of dependent instructions
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
20© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6: Retire (RT)
value
V1 V2
T+
T1 T2T
R
==
FUT
tailfetch
retirehead
value
RSROB
reg status
fetch
RF
CD
B.V
CD
B.T
• stall until instruction at ROB head has completed• write ROB head result to reg-file (D$ if store), clear reg. status entry• free ROB entry
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
21© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
22© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
RS# FU busy op T V1 V2 T1 T21 ALU No2 load Yes ldf ROB#1 REG[r1]3 store No4 FP1 No5 FP2 No
P6 Example: Cycle 1
hdtl
ROB + MOB# instruction R V addr IS EX CM
ht 1 ldf f0,X(r1) f0 &X[0]2 mulf f4,f0,f23 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)
Reg. Statusreg T+f0 ROB#1f2f4r1
allocate RS
CDBV T
set reg. status
before ROB,this was RS#2
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
23© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 Example: Cycle 2
RS# FU busy op T V1 V2 T1 T21 ALU No2 load Yes ldf ROB#1 REG[r1]3 store No4 FP1 Yes mulf ROB#2 REG[f2] ROB#15 FP2 No
allocate RS,
Reg. Statusreg T+f0 ROB#1f2f4 ROB#2r1
CDBV T
set reg. status
allocate ROB,
hdtl
ROB + MOB# instruction R V addr IS EX CM
h 1 ldf f0,X(r1) f0 &X[0] c2t 2 mulf f4,f0,f2 f4
3 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
24© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 Example: Cycle 3
RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#3 REG[r1] ROB#24 FP1 Yes mulf ROB#2 REG[f2] ROB#15 FP2 No
allocatefree
hdtl
ROB + MOB# instruction R V addr IS EX CM
h 1 ldf f0,X(r1) f0 &X[0] c2 c32 mulf f4,f0,f2 f4
t 3 stf f4,Z(r1) &Z[0]4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)
Reg. Statusreg T+f0 ROB#1f2f4 ROB#2r1
CDBV T
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
25© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 Example: Cycle 4
RS# FU busy op T V1 V2 T1 T21 ALU Yes add ROB#4 REG[r1]2 load No3 store Yes stf ROB#3 REG[r1] ROB#24 FP1 Yes mulf ROB#2 CDB.V REG[f2] ROB#15 FP2 No
allocate
ldf finished1. write result to ROB
f0 ready
hdtl
ROB + MOB# instruction R V addr IS EX CM
h 1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 c43 stf f4,Z(r1) &Z[0]
t 4 add r1,r1,#8 r15 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)
Reg. Statusreg T+f0 ROB#1+f2f4 ROB#2r1 ROB#4
CDBV T
[f0] ROB#1
2. CDB broadcast
grab from CDB
3. set ready-in-ROB bit
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
26© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 Example: Cycle 5
RS# FU busy op T V1 V2 T1 T21 ALU Yes add ROB#4 REG[r1]2 load Yes ldf ROB#5 ROB#43 store Yes stf ROB#3 REG[r1] ROB#24 FP1 No5 FP2 No
retire, write ROBresult into regfile
Reg. Statusreg T+f0 ROB#5f2f4 ROB#2r1 ROB#4
CDBV T
hdtl
ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c4
h 2 mulf f4,f0,f2 f4 c4 c53 stf f4,Z(r1) &Z[0]4 add r1,r1,#8 r1 c5
t 5 ldf f0,X(r1) f06 mulf f4,f0,f27 stf f4,Z(r1)
allocate
free
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
27© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 Example: Cycle 6
RS# FU busy op T V1 V2 T1 T21 ALU No2 load Yes ldf ROB#5 ROB#43 store Yes stf ROB#3 REG[r1] ROB#24 FP1 No5 FP2 Yes mulf ROB#6 REG[f2] ROB#5 allocate
free
Reg. Statusreg T+f0 ROB#5f2f4 ROB#6r1 ROB#4
CDBV T
hdtl
ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c4
h 2 mulf f4,f0,f2 f4 c4 c5+3 stf f4,Z(r1) &Z[0]4 add r1,r1,#8 r1 c5 c65 ldf f0,X(r1) f0
t 6 mulf f4,f0,f2 f47 stf f4,Z(r1)
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
28© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 Example: Cycle 7
RS# FU busy op T V1 V2 T1 T21 ALU No2 load Yes ldf ROB#5 CDB.V ROB#43 store Yes stf ROB#3 REG[r1] ROB#24 FP1 No5 FP2 Yes mulf ROB#6 REG[f2] ROB#5
stall DS, no free store RS
r1 ready
Reg. Statusreg T+f0 ROB#5f2f4 ROB#6r1 ROB#4+
CDBV T
[r1] ROB#4
write result into ROB, CDB
hdtl
ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c4
h 2 mulf f4,f0,f2 f4 c4 c5+3 stf f4,Z(r1) &Z[0]4 add r1,r1,#8 r1 [r1] c5 c6 c75 ldf f0,X(r1) f0 &X[1] c7
t 6 mulf f4,f0,f2 f47 stf f4,Z(r1)
grab from CDB
add finished
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
29© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 Example: Cycle 8
RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#3 CDB.V REG[r1] ROB#24 FP1 No5 FP2 Yes mulf ROB#6 REG[f2] ROB#5
stall DS, no free store RS
stall RT
free
Reg. Statusreg T+f0 ROB#5f2f4 ROB#6r1 ROB#4+
CDBV T
[f4] ROB#2
hdtl
ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c4
h 2 mulf f4,f0,f2 f4 [f4] c4 c5+ c83 stf f4,Z(r1) &Z[0] c84 add r1,r1,#8 r1 [r1] c5 c6 c75 ldf f0,X(r1) f0 &X[1] c7 c8
t 6 mulf f4,f0,f2 f47 stf f4,Z(r1)
f4 readygrab from CDB
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
30© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 Example: Cycle 9
RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#7 ROB#4.V ROB#64 FP1 No5 FP2 Yes mulf ROB#6 CDB.V REG[f2] ROB#5
stall RT
Reg. Statusreg T+f0 ROB#5+f2f4 ROB#6r1 ROB#4+
CDBV T
[f0] ROB#5
hdtl
ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c8
h 3 stf f4,Z(r1) &Z[0] c8 c94 add r1,r1,#8 r1 [r1] c5 c6 c75 ldf f0,X(r1) f0 &X[1] c7 c8 c96 mulf f4,f0,f2 f4 c9
t 7 stf f4,Z(r1) &Z[1]
f0 readygrab from CDB
free (ROB#3)allocate (ROB#7)
read from ROBnot reg. file (+)
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
31© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 Example: Cycle 10
RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#7 ROB#4.V ROB#64 FP1 No5 FP2 No
stall RT
Reg. Statusreg T+f0 ROB#5+f2f4 ROB#6r1 ROB#4+
CDBV T
hdtl
ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c8
h 3 stf f4,Z(r1) &Z[0] c8 c9 c104 add r1,r1,#8 r1 [r1] c5 c6 c75 ldf f0,X(r1) f0 &X[1] c7 c8 c96 mulf f4,f0,f2 f4 c9 c10
t 7 stf f4,Z(r1)
free
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
32© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 Example: Cycle 11
RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#7 ROB#4.V ROB#64 FP1 No5 FP2 No
Reg. Statusreg T+f0 ROB#5+f2f4 ROB#6r1 ROB#4+
CDBV T
hdtl
ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c83 stf f4,Z(r1) &Z[0] c8 c9 c10
h 4 add r1,r1,#8 r1 [r1] c5 c6 c75 ldf f0,X(r1) f0 &X[1] c7 c8 c96 mulf f4,f0,f2 f4 c9 c10
t 7 stf f4,Z(r1)retire stf
Precise State with ROB
point of ROB is maintaing precise state• how does that work?• easy as 1,2,3
• 1: wait until precise condition reaches retire (RT) stage• 2: clear contents of ROB, RS & register status table• 3: start over
• works because zeros signify the right state for start-over• 0 in ROB/RS ⇒ entry is empty• tag==0 in register status table ⇒ register is written
• and because register file and D$ writes take place at RT• let’s look at example page fault at stf ...
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
33© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
34© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 Precise State Example: Cycle 9
RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#7 ROB#4.V ROB#64 FP1 No5 FP2 Yes mulf ROB#6 CDB.V REG[f2] ROB#5
Reg. Statusreg T+f0 ROB#5+f2f4 ROB#6r1 ROB#4+
CDBV T
[f0] ROB#5
hd tl
ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c8
h 3 stf f4,Z(r1) &Z[0] c8 c94 add r1,r1,#8 r1 [r1] c5 c6 c75 ldf f0,X(r1) f0 &X[1] c7 c8 c96 mulf f4,f0,f2 f4 c9
t 7 stf f4,Z(r1)PAGE FAULT
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
35© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 Precise State Example:Cycle 10
RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store No4 FP1 No5 FP2 No
faulting instruction at
Reg. Statusreg Tf0f2f4r1
CDBV T
hd tl
ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c83 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1) ROB head?
CLEAR EVERYTHING
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
36© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 Precise State Example: Cycle 11
RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#3 REG[f4] REG[r1]4 FP1 No5 FP2 No
Reg. Statusreg Tf0f2f4r1
CDBV T
hd tl
ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c8
ht 3 stf f4,Z(r1) &Z[0]4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1) START OVER
P6 Precise State Example: Cycle 12
RS# FU busy op T V1 V2 T1 T21 ALU Yes add ROB#4 REG[r1]2 load No3 store Yes stf ROB#3 REG[f4] REG[r1]4 FP1 No
Reg. Statusreg Tf0f2f4r1 ROB#4
CDBV T
hd tl
ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c8
h 3 stf f4,Z(r1) &Z[0] c12t 4 add r1,r1,#8 r1
5 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)
5 FP2 No
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
37© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 Performance
in other words: what is the cost of precise interrupts?+ in general: same performance as plain Tomasulo+ maybe a little better (RS freed earlier -> fewer RS structural hazards)
– unless ROB is very small– in which case ROB structural hazards become a problem
• rules of thumb for ROB size• >= superscalar width (N) * number of stages between DS and RT• at least N * thit-L2• what’s the rationale behind these rules?
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
38© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 Organization Summary
• popular design for a while+ easy (relatively) to implement correctly• to recover, just zero out everything• examples: PentiumPro, PowerPC, AMD K6
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
39© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
Problems with P6
value
V1 V2
T+
T1 T2T
valueR
==
FUT
– problems for high performance implementations• too much value movement (regfile/ROB→RS→ROB→regfile) • too many muxes/buses (values from too many places)• RS mixes values with control tags (long data paths make clock slow)
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
40© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
Alternative Implementation: MIPS R10K
valueT+
T1+ T2+T
R
==
FUT
T ToldT
RS
dispatch
map table
ROB
PRF
head
tail
CDB.T
freelist
• separate control (ROB/RS) from data (registers/FU)• one big physical register file (PRF) holds all data → no copies• ROB and RS used only for control and tags → small• register file close to FUs, everything else is on the side
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
41© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
R10K Register Renaming
• no architectural register file!• physical register file holds all values• #physical registers = #architectural registers + #ROB• map architectural registers to physical registers• removes WAW&WAR hazards (physical registers replace RS copies)
• register status table replaced by register map table• mappings cannot be 0 (there is no architectural register file)
• free list keeps track of unallocated physical registers• ROB responsible for returning physical registers to free list
• conceptually: true register renaming• have seen an example a few lectures ago ... here it is again ...
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
42© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
Register Renaming Example
• names: r1,r2,r3, locations: l1,l2,l3,l4,l5,l6,l7• original mapping: r1→l1, r2→l2, r3→l3 (l4-l7 “free”)map tableraw instruction r1 r2 r3 free locations renamed instruction
l1 l2 l3 l4,l5,l6,l7add r1,r2,r3 l4 l2 l3 l5,l6,l7 add l4,l2,l3sub r3,r2,r1 l4 l2 l5 l6,l7 sub l5,l2,l4mul r1,r2,r3 l6 l2 l5 l7 mul l6,l2,l5div r2,r1,r3 l6 l7 l5 div l7,l6,l5
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
43© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
Freeing Registers in R10K
issue: freeing physical registers• P6• no need to free speculative storage explicitly• temporary storage comes with ROB entry• RT: copy value from ROB to register file, free ROB entry
• R10K• can’t free physical register when writing instruction retires• no architectural register to copy value to• but...• we can free phys register previously mapped to same logical register• why? all instructions that will ever read that value have retired
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
44© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
Freeing Registers in R10K
• when add commits: free l1• when sub commits: free l3• when mul commits: free ?• when div commits: free ?• see the pattern?
map tableraw instruction r1 r2 r3 free locations renamed instruction
l1 l2 l3 l4,l5,l6,l7add r1,r2,r3 l4 l2 l3 l5,l6,l7 add l4,l2,l3sub r3,r2,r1 l4 l2 l5 l6,l7 sub l5,l2,l4mul r1,r2,r3 l6 l2 l5 l7 mul l6,l2,l5div r2,r1,r3 l6 l7 l5 div l7,l6,l5
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
45© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
R10K Pipeline
same pipeline structure: IF,DS,IS,EX,CM,RT• DS (dispatch)• (RS or ROB or MOB full or no physical registers) ? (stall) : • (allocate RS and ROB entries AND physical register)
• IS (issue)• (read physical registers)
• CM (completion)• (writeback destination register, mark ROB entry complete)
• RT (retire, commit, graduate)• (ROB head not complete) ? (stall) :• {if store then write MOB head to D$, handle any exceptions, free ROB/MOB entries, free previous physical register}
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
46© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
R10K: Dispatch (DS)
valueT+
T1+ T2+T
R
==
FUT
T ToldT
RS
dispatch
map table
ROB
PRF
head
tail
CDB.T
freelist
• stall if no free RS, ROB, MOB or physical register (preg)• allocate RS and ROB entry• read physical register tags for input registers, store in RS• allocate new preg for output, set in RS, ROB and map table
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
47© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
R10K: Complete (CM)
valueT+
T1+ T2+T
R
==
FUT
T ToldT
RS
dispatch
map table
ROB
PRF
head
tail
CDB.T
freelist
• wait for free CDB• broadcast tag on CDB.T• set instruction’s output register ready bit in map table• set ready bits for matching input tags in RS
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
48© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
R10K: Retire (RT)
valueT+
T1+ T2+T
R
==
FUT
T ToldT
RS
dispatch
map table
ROB
PRF
head
tail
CDB.T
freelist
• stall until instruction at ROB head is complete• return Told of ROB head to free list• free ROB head entry
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
49© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
50© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
R10K Data Structures
RS# FU busy op T T1+ T2+1 ALU No2 load No3 store No4 FP1 No5 FP2 No
Map Tablereg T+f0 PR#1+f2 PR#2+f4 PR#3+r1 PR#4+
Free ListPR#5,PR#6, PR#7, PR#8
Told for freeing oldphysical register
always T mappings, evenwitn no in-flight instructions
hd tl
ROB + MOB# instruction T Told addr IS EX CM1 ldf f0,X(r1)2 mulf f4,f0,f23 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)
CDBT
only tags in RS and CDBno values, so need ready bits
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
51© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
R10K Example: Cycle 1
RS# FU busy op T T1+ T2+1 ALU No2 load Yes ldf PR#5 PR#4+3 store No4 FP1 No5 FP2 No
Map Tablereg T+f0 PR#5f2 PR#2+f4 PR#3+r1 PR#4+
Free ListPR#5, PR#6, PR#7, PR#8
hd tl
ROB + MOB# instruction T Told addr IS EX CM
ht 1 ldf f0,X(r1) PR#5 PR#1 &X[0]2 mulf f4,f0,f23 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)
CDBT
allocate new physical register to f0remember old physical registermapped to f0
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
52© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
R10K Example: Cycle 2
RS# FU busy op T T1+ T2+1 ALU No2 load Yes ldf PR#5 PR#4+3 store No4 FP1 Yes mulf PR#6 PR#5 PR#2+5 FP2 No
Map Tablereg T+f0 PR#5f2 PR#2+f4 PR#6r1 PR#4+
Free ListPR#6,
PR#7, PR#8
hd tl
ROB + MOB# instruction T Told addr IS EX CM
h 1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2t 2 mulf f4,f0,f2 PR#6 PR#3
3 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)
CDBT
allocate new physical register to f0remember old physical registermapped to f4
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
53© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
R10K Example: Cycle 3
RS# FU busy op T T1+ T2+1 ALU No2 load No3 store Yes stf PR#6 PR#4+4 FP1 Yes mulf PR#6 PR#5 PR#2+5 FP2 No
Map Tablereg T+f0 PR#5f2 PR#2+f4 PR#6r1 PR#4+
Free List
PR#7, PR#8
hd tl
ROB + MOB# instruction T Told addr IS EX CM
h 1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c32 mulf f4,f0,f2 PR#6 PR#3
t 3 stf f4,Z(r1) &Z[0]4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)
CDBT
allocate RSfree RS
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
54© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
R10K Example: Cycle 4
RS# FU busy op T T1+ T2+1 ALU Yes add PR#7 PR#4+2 load No3 store Yes stf PR#6 PR#4+4 FP1 Yes mulf PR#6 PR#5+ PR#2+5 FP2 No
Map Tablereg T+f0 PR#5+f2 PR#2+f4 PR#6r1 PR#7
Free List
PR#7, PR#8
hd tl
ROB + MOB# instruction T Told addr IS EX CM
h 1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c3 c42 mulf f4,f0,f2 PR#6 PR#3 c43 stf f4,Z(r1) &Z[0]
t 4 add r1,r1,#8 PR#7 PR#45 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)
CDBT
PR#5
match PR#5 tag from CDB & issue
allocate RS
ldf finishedset ready bit, CDB broadcast
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
55© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
R10K Example: Cycle 5
RS# FU busy op T T1+ T2+1 ALU Yes add PR#7 PR#4+2 load Yes ldf PR#8 PR#73 store Yes stf PR#6 PR#4+4 FP1 No5 FP2 No
Map Tablereg T+f0 PR#8f2 PR#2+f4 PR#6r1 PR#7
Free ListPR#1 PR#8
hd tl
ROB + MOB# instruction T Told addr IS EX CM1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c3 c4
h 2 mulf f4,f0,f2 PR#6 PR#3 c4 c53 stf f4,Z(r1) &Z[0]4 add r1,r1,#8 PR#7 PR#4 c5
t 5 ldf f0,X(r1) PR#8 PR#56 mulf f4,f0,f27 stf f4,Z(r1)
CDBT
free RS
allocate RS
ldf retiresreturn PR#1 to free list,no one will ever read it again
R10K Precise State
problem with R10K way? precise state is more difficult– registers already written• but that’s OK• why? because there is no architectural register file• we can “free” written registers and restore old ones
• we need to restore register map table to the way it was• option 1: roll back ROB serially (slow)• option 2: restore from a checkpoint (fast)
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
56© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
R10K Rollback Example: Cycle 5
RS# FU busy op T T1+ T2+1 ALU Yes add PR#7 PR#4+2 load Yes ldf PR#8 PR#73 store Yes stf PR#6 PR#4+4 FP1 No
Map Tablereg T+f0 PR#8f2 PR#2+f4 PR#6r1 PR#7
Free ListPR#1
hd tl
ROB + MOB# instruction T Told addr IS EX CM1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c3 c4
h 2 mulf f4,f0,f2 PR#6 PR#3 c4 c53 stf f4,Z(r1) &Z[0]4 add r1,r1,#8 PR#7 PR#4 c5
t 5 ldf f0,X(r1) PR#8 PR#56 mulf f4,f0,f27 stf f4,Z(r1)
CDBT
undo instructions 3–5
5 FP2 No
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
57© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
58© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
R10K Rollback Example: Cycle 6
RS# FU busy op T T1+ T2+1 ALU Yes add PR#7 PR#4+2 load No3 store Yes stf PR#6 PR#4+4 FP1 No5 FP2 No
Map Tablereg T+f0 PR#5f2 PR#2+f4 PR#6r1 PR#7
Free ListPR#1, PR#8
hd tl
ROB + MOB# instruction T Told addr IS EX CM1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c3 c4
h 2 mulf f4,f0,f2 PR#6 PR#3 c4 c5+3 stf f4,Z(r1) &Z[0]
t 4 add r1,r1,#8 PR#7 PR#4 c55 ldf f0,X(r1) PR#8 PR#56 mulf f4,f0,f27 stf f4,Z(r1)
CDBT
undo ldf1. free load RS2. free T (PR#8)3. restore f0 mapping to Told (PR#5)4. free ROB entry #5
R10K Rollback Example: Cycle 7
RS# FU busy op T T1+ T2+1 ALU No2 load No3 store Yes stf PR#6 PR#4+4 FP1 No
Map Tablereg T+f0 PR#5f2 PR#2+f4 PR#6r1 PR#4
Free ListPR#1,PR#8
PR#7
hd tl
ROB + MOB# instruction T Told addr IS EX CM1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c3 c4
h 2 mulf f4,f0,f2 PR#6 PR#3 c4 c5+t 3 stf f4,Z(r1) &Z[0]
4 add r1,r1,#8 PR#7 PR#4 c55 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)
CDBT
undo add1. free ALU RS2. free T (PR#7)3. restore r1 mapping to Told (PR#3)4. free ROB entry #4
5 FP2 No
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
59© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
R10K Rollback Example: Cycle 8
Map Tablereg T+f0 PR#5f2 PR#2+f4 PR#6r1 PR#4
Free ListPR#1,PR#8
PR#7
hd tl
ROB + MOB# instruction T Told addr IS EX CM1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c3 c4
ht 2 mulf f4,f0,f2 PR#6 PR#3 c4 c5+ c83 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)
CDBT
undo stf1. free store RS2. free ROB entry #33. no registers to restore/free4. how is store undone?
RS# FU busy op T T1+ T2+1 ALU No2 load No3 store No4 FP1 No
5 FP2 NoCOMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
60© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
P6 vs. R10K
• R10K style is more popular today• e.g., Alpha, Pentium4
feature P6 R10Kvalue storage RF, ROB, RS PRF
register read DS: RF/ROB → RS IS: PRF → FU
register write RT: ROB → RF CM: FU → PRF
spec value free RT: automatic with ROB when overwriting instructionretires
datapaths RF/ROB→RS, RS→FU, FU→ROB, ROB→RF
PRF→FU, FU→PRF
precise state simple, zero all structures complex: serial/checkpoint
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
61© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
Advanced Topic: Load Scheduling
• all instructions except for loads are easy in Tomasulo• register inputs only• register renaming captures all dependences• tags tell you exactly when you can execute
• loads not so easy• must check for older active stores with same address• register renaming doesn’t tell you that
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
62© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
The Data Memory FU
COMP© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
address valueL/S
hea
tail
==
D$/TLB
MOB
data
_out
data
_in
addr
ess
• MOB+D$ = memory FU• just like any other FU• 2 reg inputs: addr, data_in• 1 reg output: data_out
• what actually happens?
d
SCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
63
Store/Load Dispatch
COMP© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
address valueL/S
hea
tail
==
D$/TLB
MOB
data
_out
data
_in
addr
ess
• allocate MOB entry (tail)• indicate store/load• remember MOB# in RS
d
SCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
64
Store/Load Retire
COMP© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
address valueL/S
hea
tail
==
D$/TLB
MOB
data
_out
data
_in
addr
ess
• free MOB entry (head)• load? • done
• store?• address+value to D$/TLB
d
SCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
65
Store Execute
COMP© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
address valueL/S
hea
tail
==
D$/TLB
MOB
data
_out
data
_in
addr
ess
• address+value to MOB• can be done separately• e.g., PentiumII: store→2µops
d
SCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
66
Load Execute
COMP© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
address valueL/S
hea
tail
==
D$/TLB
MOB
data
_out
data
_in
addr
ess
in parallel• address to D$• read value
• address to MOB• compare to older store addrs
• no match ⇒ D$ value• match +value ⇒ MOB value
• multiple matches? youngest• forwarding or bypassing• same latency as D$ hit (why?)
• match –value ⇒ stall• try executing load again later
d
SCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
67
Memory Disambiguation Problem
© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
address L/S
==
addr
ess
lo
h
at load execution• what if older store address unknown?
• how to determine match?• can’t determine, have to guess• called “memory disambiguation”
• what if older load address unknown?• who cares
ad
ead
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
68
Memory Disambiguation Alternatives
• conservative: loads in-order with respect to stores• don’t know address? assume match, wait+ pretty simple– many unnecessary waits on non-matching stores
• opportunistic: out-of-order loads• don’t know address? assume no match, go+ higher performance (most cases are not matching)– mis-speculations: went too soon? recover (complex+expensive!)
• selective: combination of conservative and opportunistic• start out opportunistic• load mis-speculation? remember PC in table, next time conservative+ pretty accurate prediction ⇒ pretty good performance
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
69© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
Pentium II vs. Pentium4
Pentium II Pentium4Peak Clock 450MHz (Xeon) 2.4 (4.8 internal) GHzPipeline stages 14 22
Branch Prediction 512 entry local 2K entry hybridBTB 512 entries 4K entriesI$ 16KB 64KB T$ + 8KB I$D$ 16KB 8KBL2 512KB–2MB 256KB–2MB
Fetch Width 16 bytes 3 µops (16 bytes on miss)Rename/Retire Width 3 µops 3 µops
Execute Width 5 µops 7 µops (X2)Reservation Stations 20 60
ROB Size 40 128Register Renaming P6-style MIPS R10K-style
Memory Disambiguation Conservative Predictor-BasedAnything else? No Multithreading
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
70© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
Limits of Instruction Level Parallelism (ILP)
• no need to build bigger superscalar if there isn’t ILP• ultimately, how much ILP is there?
• ILP study• assume perfect/infinite hardware• successively refine to more realistic hardware• e.g.: [Wall’88], [Wilson+Lam’92]
• some (surprising) results• perfect/infinite/single-cycle everything: int ILP: >50!, FP ILP: >150!!• on actual machines: int ILP: ~2!!, FP ILP: ~3!• culprits? branch prediction, memory latency, finite ROB, issue width• read on your own (H+P, 3.8, 3.9)
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
71© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti
Summary
• dynamic scheduling + precise state + speculation• re-order buffer (ROB)• WB ⇒ CM (out-of-order) + RT (in-order)• P6 vs. R10K• how is recovery done?
• load scheduling using • MOB+D$, bypassing values from MOB• memory disambiguation
• ILP limits• read on your own
next up: static (compiler) exploitation of ILP
COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II
72© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,
Vijaykumar, Lipasti