lecture 8: data-capture instruction schedulers. the goal is to execute instructions in dataflow...
TRANSCRIPT
Advanced MicroarchitectureLecture 8: Data-Capture Instruction Schedulers
2
Out-of-Order Execution• The goal is to execute instructions in
dataflow order as opposed to the sequential order specified by the ISA
• The renamer plays a critical role by removing all of the false register dependencies
• The scheduler is responsible for:– for each instruction, detecting when all
dependencies have been satisifed (and therefore ready to execute)
– propagating readiness information between instructions
Lecture 8: Data-Capture Instruction Schedulers
3
Out-of-Order Execution (2)
Lecture 8: Data-Capture Instruction Schedulers
StaticProgram
Fetc
hDynamic
InstructionStream
Renam
e
RenamedInstruction
Stream
Sch
edule
DynamicallyScheduled
Instructions
Out-of-order =out of the originalsequential order
4
Superscalar != Out-of-Order
Lecture 8: Data-Capture Instruction Schedulers
A: R1 = Load 16[R2]B: R3 = R1 + R4C: R6 = Load 8[R9]D: R5 = R2 – 4E: R7 = Load 20[R5]F: R4 = R4 – 1G: BEQ R4, #0
C
D
E
cach
e m
iss
B
C
D
E
F
G
10 cycles
B
F
G
7 cycles
A
B
C D
E
F
G
C
D E
F
G
B
5 cycles
B C
D
E F
G
8 cycles
A
cach
e m
iss
1-wideIn-Order
A
cach
e m
iss
2-wideIn-Order
A
1-wideOut-of-Order
A
cach
e m
iss
2-wideOut-of-Order
5
Data-Capture Scheduler• At dispatch, instruction
read all available operands from the register files and store a copy in the scheduler
• Unavailable operands will be “captured” from the functional unit outputs
• When ready, instructions can issue directly from the scheduler without reading additional operands from any other register files
Lecture 8: Data-Capture Instruction Schedulers
Fetch &Dispatch
ARF PRF/ROB
Data-CaptureScheduler
FunctionalUnits
Physica
l reg
ister u
pdate
Bypass
6
Non-Data-Capture Scheduler
Lecture 8: Data-Capture Instruction Schedulers
Fetch &Dispatch
ARF PRF
Scheduler
FunctionalUnits
Physica
l reg
ister
up
date
More on thisnext lecture!
7
Components of a Scheduler
Lecture 8: Data-Capture Instruction Schedulers
A B
C
D
F
E
G
Buffer for unexecutedinstructions
Method for trackingstate of dependencies(resolved or not)
Arbiter BMethod for choosing between multiple ready instructions competing for the same resource
Method for notificationof dependency resolution
“Scheduler Entries” or“Issue Queue” (IQ) or“Reservation Stations” (RS)
8
Lather, Rinse, Repeat…• Scheduling Loop or Wakeup-Select Loop
– Wake-Up Part:• Instructions selected for execution notify dependents
(children) that the dependency has been resolved• For each instruction, check whether all input
dependencies have been resolved– if so, the instruction is “woken up”
– Select Part:• Choose which instructions get to execute
– If 3 add instructions are ready at the same time, but there are only two adders, someone must wait to try again in the next cycle (and again, and again until selected)
Lecture 8: Data-Capture Instruction Schedulers
9
Scalar Scheduler (Issue Width = 1)
Lecture 8: Data-Capture Instruction Schedulers
T14T16
T39T6
T17T39
T15T39
=
=
=
=
=
=
=
=
T39
T8
T17
T42
Sele
ct Log
ic
To E
xecu
te Lo
gic
Tag B
roadca
st Bus
10
Tags,ReadyLogic
SelectLogic
Superscalar Scheduler (detail of one entry)
Lecture 8: Data-Capture Instruction Schedulers
TagBroadcast
Buses
====
====
bid
grants
SrcL RdyLValL IssuedSrcL RdyLValL Dst
11
Interaction with Execution
Lecture 8: Data-Capture Instruction Schedulers
A
Sele
ct Log
ic
SRD SL opcode ValL ValR
ValL ValR
ValL ValRValL ValR
Payload RAM
12
Again, But Superscalar
Lecture 8: Data-Capture Instruction Schedulers
A
B
Sele
ct Log
ic
SR
SR
D
D
SL
SL
opcode ValL ValR
ValL ValR
ValL ValRValL ValR
opcode ValL ValR
ValL ValR
ValL ValR
The schedulercaptures thedata, hence “Data-Capture”
13
Issue Width• Maximum number of instructions selected
for execution each cycle is the issue width– Previous slide showed an issue width of two– The slide before that showed the details of a
scheduler entry for an issue width of four• Hardware requirements:
– Typically, an Issue Width of N requires N tag broadcast buses
– Not always true: can specialize such that, for example, one “issue slot” can only handle branches
Lecture 8: Data-Capture Instruction Schedulers
14
Pipeline Timing
Lecture 8: Data-Capture Instruction Schedulers
Select Payload
Wakeup
A: Execute
CaptureB:
tag broadcastresultbroadcast
enablecapture
on tag match
Select Payload Execute
WakeupC:enablecapture
tag broadcast
Cycle i Cycle i+1
A
B
C
15
Pipelined Timing
Lecture 8: Data-Capture Instruction Schedulers
Select PayloadA: Execute
CaptureB:
tag broadcastresultbroadcast
enablecapture
Select Payload Execute
CaptureC:enablecapture
tag broadcast
Cycle i Cycle i+1
Select Payload Execute
Cycle i+2 Cycle i+3
Wakeup
Wakeup
A
B
CCan’t read and writepayload RAM at the
same time; may needto bypass the results
16
Pipelined Timing (2)
• Previous slide placed the pipeline boundary at the writing of the ready bits
• This slide shows a pipeline where latches are placed right before the tag broadcast
Lecture 8: Data-Capture Instruction Schedulers
Select PayloadA: Execute
CaptureB:
tag broadcastresultbroadcast
enablecapture
Select Payload Execute
Cycle i Cycle i+1 Cycle i+2
Wakeup
A
B
Wakeup
17
More Pipelined Timing
Lecture 8: Data-Capture Instruction Schedulers
Select PayloadA: Execute
CaptureB:tag broadcast
result broadcastand bypass
enablecapture
C:
Cycle i
Wakeup
Select
Wakeup
Payload Execute
Select Payload Exec
Capture
Cycle i+1 Cycle i+2 Cycle i+3
CaptureWakeuptag match
on firstoperand tag match
on secondoperand
(now C is ready)No simultaneous
read/write!
A
B
C
Need a secondlevel of bypassing
18
More-er Pipelined Timing
Lecture 8: Data-Capture Instruction Schedulers
Select PayloadA: Execute
CaptureC:
Cycle i
Wakeup
i+1 i+2 i+3
Select Payload Execute
Wakeup Capture
Select Payload Ex
i+4 i+5
D:
A
C
B
D
Wakeup Capture
B: Select Select Payload Execute
A&B bothready, onlyA selected,B bids again
AC and CD mustbe bypassed, butno bypass for BD
Dependent instructionscannot execute in
back-to-back cycles!
19
Critical Loops• Wakeup-Select Loop cannot be trivially
pipelined while maintaining back-to-back execution of dependent instructions
Lecture 8: Data-Capture Instruction Schedulers
A
B
C
A
B
C
RegularScheduling
No Back-to-Back• Worst-case IPC reduction by ½
• Shouldn’t be that bad (previous slide had IPC of 4/3)
• Studies indicate 10-15% IPC penalty
• “Loose Loops Sink Chips”, Borch et al.
20
IPC vs. Frequency• 10-15% IPC drop doesn’t seem bad if we
can double the clock frequency
Lecture 8: Data-Capture Instruction Schedulers
1000ps 500ps500ps2.0 IPC, 1GHz 1.7 IPC, 2GHz
2 BIPS 3.4 BIPS• Frequency doesn’t double
– latch/pipeline overhead– unbalanced stages
• Other sources of IPC penalties– branches: ↑ pipe depth, ↓ predictor size, ↑ predict-to-
update latency– caches/memory: same time in seconds, ↑ frequency,
more cycles• Power limitations: more logic, higher frequency
P=½CV2f
900ps 450ps 450ps
900ps 350 550
1.5GHz
21
Select Logic• Goal: minimize DFG height (execution time)• NP-Hard
– Precedence Constrained Scheduling Problem– Even harder because the entire DFG is not
known at scheduling time• Scheduling decisions made now may affect the
scheduling of instructions not even fetched yet
• Heuristics– For performance– For ease of implementation
Lecture 8: Data-Capture Instruction Schedulers
22
Simple Select Logic
Lecture 8: Data-Capture Instruction Schedulers
Scheduler Entries
1
S entriesyields O(S)gate delay
Grant0 = 1Grant1 = !Bid0
Grant2 = !Bid0 & !Bid1
Grant3 = !Bid0 & !Bid1 & !Bid2
Grantn-1 = !Bid0 & … & !Bidn-21
x0
x1
x2
x3
x4
x5
x6
x7
x8
grant0
xi = Bidi
granti
grant1
grant2
grant3
grant4
grant5
grant6
grant7
grant8
grant9
O(log S) gates
23
Simple Select Logic• Instructions may be located in scheduler
entries in no particular order– The first ready entry may be the oldest,
youngest, anywhere in between– Simple select results in a “random” schedule
• the schedule is still “correct” in that no dependencies are violated
• it just may be far from optimal
Lecture 8: Data-Capture Instruction Schedulers
24
Oldest First Select• Intuition:
– An instruction that has just entered the scheduler will likely have few, if any, dependents (only intra-group)
– Similarly, the instructions that have been in the scheduler the longest (the oldest) are likely to have the most dependents
– Selecting the oldest instructions has a higher chance of satisfying more dependencies, thus making more instructions ready to execute more parallelism
Lecture 8: Data-Capture Instruction Schedulers
25
Implementing Oldest First Select
Lecture 8: Data-Capture Instruction Schedulers
BA
CDEF
H
Write instructionsinto scheduler inprogram order
G
EF
HG
Compress Up
KL
EF
HG
IJ
Newlydispatched
IJ
BD
26
Implementing Oldest First Select (2)• Compressing buffers are very complex
– gates, wiring, area, power
Lecture 8: Data-Capture Instruction Schedulers
Ex. 4-wideNeed up toshift by 4
An entireinstruction’s
worth ofdata: tags,opcodes,
immediates,readiness, etc.
27
Implementing Oldest First Select (3)
Lecture 8: Data-Capture Instruction Schedulers
B
C
A
D
E
F
H
G
4
6053172
0
3
∞
22
0
0
Age-Aware Select Logic
Grant
28
Handling Multi-Cycle Instructions
Lecture 8: Data-Capture Instruction Schedulers
Sched PayLd Exec
Sched PayLd Exec
Add R1 = R2 + R3
Xor R4 = R1 ^ R5
Sched PayLd Exec Add R4 = R1 + R5
Sched PayLd Exec Mul R1 = R2 × R3Exec Exec
Add attemps to execute too early!Result not ready for another two cycles.
29
Delayed Tag Broadcast
• It works, but…– Must make sure tag broadcast bus will be
available N cycles in the future when needed– Bypass, data-capture potentially get more
complex
Lecture 8: Data-Capture Instruction Schedulers
Sched PayLd Exec Add R4 = R1 + R5
Sched PayLd Exec Mul R1 = R2 × R3Exec Exec
Assume pipelined such that tagbroadcast occurs at cycle boundary
30
Delayed Tag Broadcast (2)
Lecture 8: Data-Capture Instruction Schedulers
Sched PayLd Exec Add R4 = R1 + R5
Sched PayLd Exec Mul R1 = R2 × R3Exec Exec
Sched PayLd Exec Sub R7 = R8 – #1
Sched PayLd Exec Xor R9 = R9 ^ R6
Assumeissue width
equals 2
In this cycle, three instructionsneed to broadcast their tags!
31
Delayed Tag Broadcast (3)• Possible solutions
1. Have one select for issuing, another select for tag broadcast• messes up timing of data-capture
2. Pre-reserve the bus• select logic more complicated, must track usage in
future cycles in addition to the current cycle
3. Hold the issue slot from initial launch until tag broadcast
Lecture 8: Data-Capture Instruction Schedulers
sch payl exec exec exec
Issue width effectively reduced by one for three cycles
32
Delayed Wakeup• Push the delay to the consumer
Lecture 8: Data-Capture Instruction Schedulers
=
Tag Broadcast forR1 = R2 × R3
R1
=
R4
R5 = R1 + R4
ready!
Tag arrives, but we waitthree cycles beforeacknowledging it
Also need to know parent’s latency
33
Non-Deterministic Latencies• Problem with previous approaches is that
they assume that all instruction latencies are known at the time of scheduling– Makes things uglier for the delayed broadcast– This pretty much kills the delayed wakeup
approach• Examples
– Load instructions• Latency {L1_lat, L2_lat, L3_lat, DRAM_lat}
– DRAM_lat is not a constant either, queuing delays– Some architecture specific cases
• PowerPC 603 has a “early out” for multiplication with a low-bit-width multiplicand
• Intel Core 2’s divider also has an early out
Lecture 8: Data-Capture Instruction Schedulers
34
The Wait-and-See Approach• Just wait and see whether a load hits or
misses in the cache
Lecture 8: Data-Capture Instruction Schedulers
Sched PayLd Exec
R2 = R1 + #4Sched PayLd Exec
R1 = 16[$sp]
Exec Exec Cache hit known,can broadcast tag
Load-to-Use latencyincreases by 2 cycles(3 cycle load appears as 5)
Scheduler
DL1 Tags
DL1Data
May be able todesign cache s.t.hit/miss knownbefore data
Exec Exec Exec
Sched PayLd Exec
Penalty reduced to 1 cycle
35
Load-Hit Speculation• Caches work pretty well
– hit rates are high, otherwise caches wouldn’t be too useful
– Just assume all loads will hit in the cache
Lecture 8: Data-Capture Instruction Schedulers
Sched PayLd Exec R2 = R1 + #4
Sched PayLd Exec R1 = 16[$sp]Exec Exec Cache hit,data forwardedBroadcast delayed
by DL1 latency
• Um, ok, what happens when there’s a load miss?
36
Load Miss Scenario
Lecture 8: Data-Capture Instruction Schedulers
Sched PayLd Exec
Sched PayLd Exec Exec ExecBroadcast delayedby DL1 latency
Exec … Exec
Cache Miss Detected!
Value at cache output is bogus
Invalidate the instruction(ALU output ignored)
Sched PayLd Exec
Rescheduled assuminga hit at the DL2 cache
There could be a miss at the L2, and again at the L3 cache. A single load can waste multiple issuing opportunities.
Each mis-scheduling wastes an issue slot:the tag broadcast bus, payload RAM read port, writeback/bypass bus, etc. could have been used for another instruction
Broadcast delayed by L2 latency
L2 hit
37
Scheduler Deallocation• Normally, as soon as an instruction issues,
it can vacate its scheduler entry– The sooner an entry is deallocated, the sooner
another instruction can reuse that entry leads to that instruction executing earlier
• In the case of a load, the load must hang around in the scheduler until it can be guaranteed that it will not have to rebroadcast its destination tag– Decreases the effective size of the scheduler
Lecture 8: Data-Capture Instruction Schedulers
38
“But wait, there’s more!”
Lecture 8: Data-Capture Instruction Schedulers
Sched PayLd Exec
Sched PayLd Exec Exec Exec
DL1 Miss
Sched PayLd Exec
Sched PayLd Exec
Sched PayLd Exec
Squash
Not only children getsquashed, there may be grand-children to squash as well
All waste issue slotsAll must be rescheduledAll waste powerNone may leave scheduler until load hit known
Sched PayLd Ex
Sched PayLd Ex
Sched PayLd Ex
Sched PayLd Exec
39
Squashing• The number of cycles worth of dependents
that must be squashed is equal to Cache-Miss-Known latency minus one– previous example, miss-known latency = 3
cycles, there are two cycles worth of mis-scheduled dependents
– Early miss-detection helps reduce mis-scheduled instructions
– A load may have many children, but the issue width limits how many can possibly be mis-scheduled• Max = Issue-width × (Miss-known-lat – 1)
Lecture 8: Data-Capture Instruction Schedulers
40
Squashing (2)• Simple: squash anything “in-flight”
between schedule and execute
Lecture 8: Data-Capture Instruction Schedulers
SchedPayLd Exec Exec Exec
SchedPayLd Exec
SchedPayLd Exec
SchedPayLd Exec
SchedPayLd Exec
SchedPayLd Exec
SchedPayLd Exec
SchedPayLd Exec
• This may include non-dependent instructions
• All instructions must stay in the scheduler for a few extra cycles to make sure they will not be rescheduled due to a squash
41
Squashing (3)• Selective squashing: use “load colors”
– each load is assigned a unique color– every dependent “inherits” its parents’ colors– on a load miss, the load broadcasts its color and
anyone in the same color group gets squashed• An instruction may end up with many
colors
Lecture 8: Data-Capture Instruction Schedulers
… …Explicitly tracking each color wouldrequire a huge number of comparisons
42
Squashing (4)• Can list “colors” in unary (bit-vector) form
Lecture 8: Data-Capture Instruction Schedulers
Load R1 = 16[R2]1 0 0 0 0 0 0 0
Add R3 = R1 + R41 0 0 0 0 0 0 0
Load R5 = 12[R7]1 0 0 0 0 0 00
Load R8 = 0[R1]1 0 1 0 0 0 0 0
Load R7 = 8[R4]100 0 0 0 00
Add R6 = R8 + R71 0 1 0 0 0 01
• Each instruction’s color vector is the bitwise OR of its parents’ vectors
• A load miss now only squashes the dependent instructions
• Hardware cost increases quadratically with number of load instructions
X
X
X
43
Allocation• Allocate in-order,
Deallocate in-order• Very simple!• Smaller effective
scheduler size– instructions may have
already executed out-of-order, but their RS entries cannot be reused
– Can be very bad if a load goes to main memory
Lecture 8: Data-Capture Instruction Schedulers
Circular Buffer
Head
Tail
Tail
44
Allocation (2)• With arbitrary
placement, entries much better utilized
• Allocator more complex– must scan availability and
find N free entries• Write logic more
complex– must route N instructions
to arbitrary entries of the scheduler
Lecture 8: Data-Capture Instruction Schedulers
RS A
lloca
tor
Entry availabilitybit-vector
0000100100000111
45
Allocation (3)• Segment the entries
– only one entry per segment may be allocated per cycle
– instead of 4-of-16 alloc (previous slide), each allocator only does 1-of-4
– write logic simplified as well
• Still possible inefficiencies– full segment leads to
allocating less than N instructions per cycleLecture 8: Data-Capture Instruction Schedulers
0010101000010110
Allo
cA
lloc
Allo
cA
lloc
A
B
C
DX
Free RS entries exist, justnot in the correct segment
0