lecture 8: data-capture instruction schedulers. the goal is to execute instructions in dataflow...

Advanced MicroarchitectureLecture 8: Data-Capture Instruction Schedulers

2

Out-of-Order Execution• The goal is to execute instructions in

dataflow order as opposed to the sequential order specified by the ISA

• The renamer plays a critical role by removing all of the false register dependencies

• The scheduler is responsible for:– for each instruction, detecting when all

dependencies have been satisifed (and therefore ready to execute)

– propagating readiness information between instructions

Lecture 8: Data-Capture Instruction Schedulers

3

Out-of-Order Execution (2)


StaticProgram

Fetc

hDynamic

InstructionStream

Renam

e

RenamedInstruction

Stream

Sch

edule

DynamicallyScheduled

Instructions

Out-of-order =out of the originalsequential order

4

Superscalar != Out-of-Order


A: R1 = Load 16[R2]B: R3 = R1 + R4C: R6 = Load 8[R9]D: R5 = R2 – 4E: R7 = Load 20[R5]F: R4 = R4 – 1G: BEQ R4, #0

C

D

E

cach

e m

iss

B

C

D

E

F

G

10 cycles

B

F

G

7 cycles

A

B

C D

E

F

G

C

D E

F

G

B

5 cycles

B C

D

E F

G

8 cycles

A

cach

e m

iss

1-wideIn-Order

A

cach

e m

iss

2-wideIn-Order

A

1-wideOut-of-Order

A

cach

e m

iss

2-wideOut-of-Order

5

Data-Capture Scheduler• At dispatch, instruction

read all available operands from the register files and store a copy in the scheduler

• Unavailable operands will be “captured” from the functional unit outputs

• When ready, instructions can issue directly from the scheduler without reading additional operands from any other register files


Fetch &Dispatch

ARF PRF/ROB

Data-CaptureScheduler

FunctionalUnits

Physica

l reg

ister u

pdate

Bypass

6

Non-Data-Capture Scheduler


Fetch &Dispatch

ARF PRF

Scheduler

FunctionalUnits

Physica

l reg

ister

up

date

More on thisnext lecture!

7

Components of a Scheduler


A B

C

D

F

E

G

Buffer for unexecutedinstructions

Method for trackingstate of dependencies(resolved or not)

Arbiter BMethod for choosing between multiple ready instructions competing for the same resource

Method for notificationof dependency resolution

“Scheduler Entries” or“Issue Queue” (IQ) or“Reservation Stations” (RS)

8

Lather, Rinse, Repeat…• Scheduling Loop or Wakeup-Select Loop

– Wake-Up Part:• Instructions selected for execution notify dependents

(children) that the dependency has been resolved• For each instruction, check whether all input

dependencies have been resolved– if so, the instruction is “woken up”

– Select Part:• Choose which instructions get to execute

– If 3 add instructions are ready at the same time, but there are only two adders, someone must wait to try again in the next cycle (and again, and again until selected)


9

Scalar Scheduler (Issue Width = 1)


T14T16

T39T6

T17T39

T15T39

=

=

=

=

=

=

=

=

T39

T8

T17

T42

Sele

ct Log

ic

To E

xecu

te Lo

gic

Tag B

roadca

st Bus

10

Tags,ReadyLogic

SelectLogic

Superscalar Scheduler (detail of one entry)


TagBroadcast

Buses

====

====

bid

grants

SrcL RdyLValL IssuedSrcL RdyLValL Dst

11

Interaction with Execution


A

Sele

ct Log

ic

SRD SL opcode ValL ValR

ValL ValR

ValL ValRValL ValR

Payload RAM

12

Again, But Superscalar


A

B

Sele

ct Log

ic

SR

SR

D

D

SL

SL

opcode ValL ValR

ValL ValR

ValL ValRValL ValR

opcode ValL ValR

ValL ValR

ValL ValR

The schedulercaptures thedata, hence “Data-Capture”

13

Issue Width• Maximum number of instructions selected

for execution each cycle is the issue width– Previous slide showed an issue width of two– The slide before that showed the details of a

scheduler entry for an issue width of four• Hardware requirements:

– Typically, an Issue Width of N requires N tag broadcast buses

– Not always true: can specialize such that, for example, one “issue slot” can only handle branches


14

Pipeline Timing


Select Payload

Wakeup

A: Execute

CaptureB:

tag broadcastresultbroadcast

enablecapture

on tag match

Select Payload Execute

WakeupC:enablecapture

tag broadcast

Cycle i Cycle i+1

A

B

C

15

Pipelined Timing


Select PayloadA: Execute

CaptureB:


enablecapture


CaptureC:enablecapture

tag broadcast

Cycle i Cycle i+1


Cycle i+2 Cycle i+3

Wakeup

Wakeup

A

B

CCan’t read and writepayload RAM at the

same time; may needto bypass the results

16

Pipelined Timing (2)

• Previous slide placed the pipeline boundary at the writing of the ready bits

• This slide shows a pipeline where latches are placed right before the tag broadcast



CaptureB:


enablecapture


Cycle i Cycle i+1 Cycle i+2

Wakeup

A

B

Wakeup

17

More Pipelined Timing



CaptureB:tag broadcast

result broadcastand bypass

enablecapture

C:

Cycle i

Wakeup

Select

Wakeup

Payload Execute

Select Payload Exec

Capture

Cycle i+1 Cycle i+2 Cycle i+3

CaptureWakeuptag match

on firstoperand tag match

on secondoperand

(now C is ready)No simultaneous

read/write!

A

B

C

Need a secondlevel of bypassing

18

More-er Pipelined Timing



CaptureC:

Cycle i

Wakeup

i+1 i+2 i+3


Wakeup Capture

Select Payload Ex

i+4 i+5

D:

A

C

B

D

Wakeup Capture

B: Select Select Payload Execute

A&B bothready, onlyA selected,B bids again

AC and CD mustbe bypassed, butno bypass for BD

Dependent instructionscannot execute in

back-to-back cycles!

19

Critical Loops• Wakeup-Select Loop cannot be trivially

pipelined while maintaining back-to-back execution of dependent instructions


A

B

C

A

B

C

RegularScheduling

No Back-to-Back• Worst-case IPC reduction by ½

• Shouldn’t be that bad (previous slide had IPC of 4/3)

• Studies indicate 10-15% IPC penalty

• “Loose Loops Sink Chips”, Borch et al.

20

IPC vs. Frequency• 10-15% IPC drop doesn’t seem bad if we

can double the clock frequency


1000ps 500ps500ps2.0 IPC, 1GHz 1.7 IPC, 2GHz

2 BIPS 3.4 BIPS• Frequency doesn’t double

– latch/pipeline overhead– unbalanced stages

• Other sources of IPC penalties– branches: ↑ pipe depth, ↓ predictor size, ↑ predict-to-

update latency– caches/memory: same time in seconds, ↑ frequency,

more cycles• Power limitations: more logic, higher frequency

P=½CV2f

900ps 450ps 450ps

900ps 350 550

1.5GHz

21

Select Logic• Goal: minimize DFG height (execution time)• NP-Hard

– Precedence Constrained Scheduling Problem– Even harder because the entire DFG is not

known at scheduling time• Scheduling decisions made now may affect the

scheduling of instructions not even fetched yet

• Heuristics– For performance– For ease of implementation


22

Simple Select Logic


Scheduler Entries

1

S entriesyields O(S)gate delay

Grant0 = 1Grant1 = !Bid0

Grant2 = !Bid0 & !Bid1

Grant3 = !Bid0 & !Bid1 & !Bid2

Grantn-1 = !Bid0 & … & !Bidn-21

x0

x1

x2

x3

x4

x5

x6

x7

x8

grant0

xi = Bidi

granti

grant1

grant2

grant3

grant4

grant5

grant6

grant7

grant8

grant9

O(log S) gates

23

Simple Select Logic• Instructions may be located in scheduler

entries in no particular order– The first ready entry may be the oldest,

youngest, anywhere in between– Simple select results in a “random” schedule

• the schedule is still “correct” in that no dependencies are violated

• it just may be far from optimal


24

Oldest First Select• Intuition:

– An instruction that has just entered the scheduler will likely have few, if any, dependents (only intra-group)

– Similarly, the instructions that have been in the scheduler the longest (the oldest) are likely to have the most dependents

– Selecting the oldest instructions has a higher chance of satisfying more dependencies, thus making more instructions ready to execute more parallelism


25

Implementing Oldest First Select


BA

CDEF

H

Write instructionsinto scheduler inprogram order

G

EF

HG

Compress Up

KL

EF

HG

IJ

Newlydispatched

IJ

BD

26

Implementing Oldest First Select (2)• Compressing buffers are very complex

– gates, wiring, area, power


Ex. 4-wideNeed up toshift by 4

An entireinstruction’s

worth ofdata: tags,opcodes,

immediates,readiness, etc.

27

Implementing Oldest First Select (3)


B

C

A

D

E

F

H

G

4

6053172

0

3

∞

22

0

0

Age-Aware Select Logic

Grant

28

Handling Multi-Cycle Instructions


Sched PayLd Exec

Sched PayLd Exec

Add R1 = R2 + R3

Xor R4 = R1 ^ R5

Sched PayLd Exec Add R4 = R1 + R5

Sched PayLd Exec Mul R1 = R2 × R3Exec Exec

Add attemps to execute too early!Result not ready for another two cycles.

29

Delayed Tag Broadcast

• It works, but…– Must make sure tag broadcast bus will be

available N cycles in the future when needed– Bypass, data-capture potentially get more

complex




Assume pipelined such that tagbroadcast occurs at cycle boundary

30

Delayed Tag Broadcast (2)




Sched PayLd Exec Sub R7 = R8 – #1

Sched PayLd Exec Xor R9 = R9 ^ R6

Assumeissue width

equals 2

In this cycle, three instructionsneed to broadcast their tags!

31

Delayed Tag Broadcast (3)• Possible solutions

1. Have one select for issuing, another select for tag broadcast• messes up timing of data-capture

2. Pre-reserve the bus• select logic more complicated, must track usage in

future cycles in addition to the current cycle

3. Hold the issue slot from initial launch until tag broadcast


sch payl exec exec exec

Issue width effectively reduced by one for three cycles

32

Delayed Wakeup• Push the delay to the consumer


=

Tag Broadcast forR1 = R2 × R3

R1

=

R4

R5 = R1 + R4

ready!

Tag arrives, but we waitthree cycles beforeacknowledging it

Also need to know parent’s latency

33

Non-Deterministic Latencies• Problem with previous approaches is that

they assume that all instruction latencies are known at the time of scheduling– Makes things uglier for the delayed broadcast– This pretty much kills the delayed wakeup

approach• Examples

– Load instructions• Latency {L1_lat, L2_lat, L3_lat, DRAM_lat}

– DRAM_lat is not a constant either, queuing delays– Some architecture specific cases

• PowerPC 603 has a “early out” for multiplication with a low-bit-width multiplicand

• Intel Core 2’s divider also has an early out


34

The Wait-and-See Approach• Just wait and see whether a load hits or

misses in the cache


Sched PayLd Exec

R2 = R1 + #4Sched PayLd Exec

R1 = 16[$sp]

Exec Exec Cache hit known,can broadcast tag

Load-to-Use latencyincreases by 2 cycles(3 cycle load appears as 5)

Scheduler

DL1 Tags

DL1Data

May be able todesign cache s.t.hit/miss knownbefore data

Exec Exec Exec

Sched PayLd Exec

Penalty reduced to 1 cycle

35

Load-Hit Speculation• Caches work pretty well

– hit rates are high, otherwise caches wouldn’t be too useful

– Just assume all loads will hit in the cache


Sched PayLd Exec R2 = R1 + #4

Sched PayLd Exec R1 = 16[$sp]Exec Exec Cache hit,data forwardedBroadcast delayed

by DL1 latency

• Um, ok, what happens when there’s a load miss?

36

Load Miss Scenario


Sched PayLd Exec

Sched PayLd Exec Exec ExecBroadcast delayedby DL1 latency

Exec … Exec

Cache Miss Detected!

Value at cache output is bogus

Invalidate the instruction(ALU output ignored)

Sched PayLd Exec

Rescheduled assuminga hit at the DL2 cache

There could be a miss at the L2, and again at the L3 cache. A single load can waste multiple issuing opportunities.

Each mis-scheduling wastes an issue slot:the tag broadcast bus, payload RAM read port, writeback/bypass bus, etc. could have been used for another instruction

Broadcast delayed by L2 latency

L2 hit

37

Scheduler Deallocation• Normally, as soon as an instruction issues,

it can vacate its scheduler entry– The sooner an entry is deallocated, the sooner

another instruction can reuse that entry leads to that instruction executing earlier

• In the case of a load, the load must hang around in the scheduler until it can be guaranteed that it will not have to rebroadcast its destination tag– Decreases the effective size of the scheduler


38

“But wait, there’s more!”


Sched PayLd Exec

Sched PayLd Exec Exec Exec

DL1 Miss

Sched PayLd Exec

Sched PayLd Exec

Sched PayLd Exec

Squash

Not only children getsquashed, there may be grand-children to squash as well

All waste issue slotsAll must be rescheduledAll waste powerNone may leave scheduler until load hit known

Sched PayLd Ex

Sched PayLd Ex

Sched PayLd Ex

Sched PayLd Exec

39

Squashing• The number of cycles worth of dependents

that must be squashed is equal to Cache-Miss-Known latency minus one– previous example, miss-known latency = 3

cycles, there are two cycles worth of mis-scheduled dependents

– Early miss-detection helps reduce mis-scheduled instructions

– A load may have many children, but the issue width limits how many can possibly be mis-scheduled• Max = Issue-width × (Miss-known-lat – 1)


40

Squashing (2)• Simple: squash anything “in-flight”

between schedule and execute


SchedPayLd Exec Exec Exec

SchedPayLd Exec

SchedPayLd Exec

SchedPayLd Exec

SchedPayLd Exec

SchedPayLd Exec

SchedPayLd Exec

SchedPayLd Exec

• This may include non-dependent instructions

• All instructions must stay in the scheduler for a few extra cycles to make sure they will not be rescheduled due to a squash

41

Squashing (3)• Selective squashing: use “load colors”

– each load is assigned a unique color– every dependent “inherits” its parents’ colors– on a load miss, the load broadcasts its color and

anyone in the same color group gets squashed• An instruction may end up with many

colors


… …Explicitly tracking each color wouldrequire a huge number of comparisons

42

Squashing (4)• Can list “colors” in unary (bit-vector) form


Load R1 = 16[R2]1 0 0 0 0 0 0 0

Add R3 = R1 + R41 0 0 0 0 0 0 0

Load R5 = 12[R7]1 0 0 0 0 0 00

Load R8 = 0[R1]1 0 1 0 0 0 0 0

Load R7 = 8[R4]100 0 0 0 00

Add R6 = R8 + R71 0 1 0 0 0 01

• Each instruction’s color vector is the bitwise OR of its parents’ vectors

• A load miss now only squashes the dependent instructions

• Hardware cost increases quadratically with number of load instructions

X

X

X

43

Allocation• Allocate in-order,

Deallocate in-order• Very simple!• Smaller effective

scheduler size– instructions may have

already executed out-of-order, but their RS entries cannot be reused

– Can be very bad if a load goes to main memory


Circular Buffer

Head

Tail

Tail

44

Allocation (2)• With arbitrary

placement, entries much better utilized

• Allocator more complex– must scan availability and

find N free entries• Write logic more

complex– must route N instructions

to arbitrary entries of the scheduler


RS A

lloca

tor

Entry availabilitybit-vector

0000100100000111

45

Allocation (3)• Segment the entries

– only one entry per segment may be allocated per cycle

– instead of 4-of-16 alloc (previous slide), each allocator only does 1-of-4

– write logic simplified as well

• Still possible inefficiencies– full segment leads to

allocating less than N instructions per cycleLecture 8: Data-Capture Instruction Schedulers

0010101000010110

Allo

cA

lloc

Allo

cA

lloc

A

B

C

DX

Free RS entries exist, justnot in the correct segment

0

lecture 8: data-capture instruction schedulers. the goal is to execute instructions in dataflow...

Documents

cycles b b c c d d e

b b c c d d f f e e

cycles b b f f g g

c c d d e e cache

arbiterarbiter b b method

logic lecture

r1 r4 c

load 16r2 b