computer architecture lecture 6: dynamic scheduling for...

Computer ArchitectureLecture 6 Dynamic Scheduling

for ILP (Chapter 3)

Chih‐Wei Liu 劉志尉

National Chiao Tung Universitycwliutwinseenctuedutw

Data Dependence and Parallelism

bull If 2 instructions are parallelndash they can be executed simultaneously in a pipeline without causing any stalls (except the structural hazards)

ndash their execution order can be swappedbull If 2 instructions are dependent

ndash they must be executed in order or partially overlapped

bull To exploit parallelisms over instructions is equivalent to determine dependences over instructions

CA-Lec6 cwliutwinseenctuedutw

Introduction

2

Software Overcomes Data Hazards

bull For a simple statically scheduled pipelinendash In‐order instruction issue and executionndash fetch an instruction and issue it in program orderndash if there is a data dependence that cannot be hidden (eg forwarding logic) then the hazard detection hardware stalls the pipeline

ndash No new instructions are fetched or issued until the dependence is cleared

ndash Minimize stalls by software to separate dependent instructions so that they will not lead to hazards

CA-Lec6 cwliutwinseenctuedutw 3

Hardware Overcomes Data Hazards

bull For a dynamically schedulingndash the hardware rearranges the instruction execution to reduce the

stalls while maintaining data flow and exception behaviorndash Handling some cases when dependences are unknown at

compiler time (eg memory reference)ndash Simplify the compilerndash (Perhaps most importantly) Allow code compiled with one

pipeline run on a different pipelinendash Will explore hardware speculation

ndash But a cost of significant increase in hardware complexity


Dynamic Scheduling bull Idea

ndash To maintain IPC1 by executing an instruction as early as possiblendash When stalled other instructions can be issued and executed if they do

not depend on any active or stalled instructionsbull Dynamic Scheduling implies Out‐of‐order execution and Out‐of‐order

completion

bull Advantagesndash Compiler doesnrsquot need to have knowledge of microarchitecturendash Handles cases where dependencies are unknown at compile time

bull Disadvantagendash Substantial increase in hardware complexityndash Creates the possibility for WAR and WAW hazardsndash Complicates exceptions


Dynam

ic Scheduling

5

Dynamic Scheduling Introductionbull In classic 5‐stage pipeline both structural and data hazards could be checked

during ID stagendash When an instruction could execute without hazards it was issued from ID

knowing that all data hazards had been resolvedbull Let separate the ID stage into two parts

ndash Issuebull Decode check for structural hazard in the manner of in‐order issue

ndash Read Operandsbull Wait until no data hazards then read operands

bull Out‐of‐order (OOO) executionndash It may introduce WAR WAW hazards

Issue Reg

ALU DM RegIM

EXIF ID MEM WBCA-Lec6 cwliutwinseenctuedutw 6

OOO Example

bull In‐order issue but allow out‐of‐order execution (and thus out‐of‐order completion)

Performance limitation due to hazardhellip


Register Remaining Example

bull Before

DIVD F0F2F4ADDD F6F0F8SD F60(R1)SUBD F8F10F14MULD F6F10F8

bull After

DIVD F0F2F4ADDD SF0F8SD S0(R1)SUBD TF10F14MULD F6F10T


Anti-dependence Only RAW hazards remain

Solving WAR amp WAW when Dynamic Scheduling

bull Scoreboard (used in CDC6600 first 1963)ndash Bookkeeping approachndash Centralized controlndash Stall the instruction and keep track of dependencies between pending instructions

bull Tomasulo approach (used in IBM 36091 Floating‐point Unit 1966)ndash Register remaining approach by using reservation registers

ndash Distributed control


Scoreboard

bull The scoreboard takes full responsibility for instruction issue and execution including hazard detection

bull Three parts to the scoreboardndash Instruction status

bull Indicate the pipeline stage of the instructionndash Functional unit status

bull 9 fields to indicate the state of the functional unit (FU)ndash Register result status

bull Indicate which FU will write the result to register


Scoreboard Example


Instruction status Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Functional unit status dest S1 S2 FU FU Fj FkTime Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer NoMult1 NoMult2 NoAdd NoDivide No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 F30

FU

Scoreboard Example Cycle 1Instruction status Read Exec Write

Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2


Integer Yes Load F6 R2 YesMult1 NoMult2 NoAdd NoDivide No


1 FU Integer



Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2




2 FU Integer

bull Issue 2nd LD CA-Lec6 cwliutwinseenctuedutw 13


Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2


Integer Yes Load F6 R2 NoMult1 NoMult2 NoAdd NoDivide No


3 FU Integer

bull Issue MULT CA-Lec6 cwliutwinseenctuedutw 14


Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2




4 FU Integer



Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2




5 FU Integer



Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6MULTD F0 F2 F4 6SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2


Integer Yes Load F2 R3 YesMult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAdd NoDivide No


6 FU Mult1 Integer



Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7MULTD F0 F2 F4 6SUBD F8 F6 F2 7DIVD F10 F0 F6ADDD F6 F8 F2


Integer Yes Load F2 R3 NoMult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAdd Yes Sub F8 F6 F2 Integer Yes NoDivide No


7 FU Mult1 Integer Add

bull Read multiply operandsCA-Lec6 cwliutwinseenctuedutw 18

Scoreboard Example Cycle 8a(First half of clock cycle)

Instruction status Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7MULTD F0 F2 F4 6SUBD F8 F6 F2 7DIVD F10 F0 F6 8ADDD F6 F8 F2


Integer Yes Load F2 R3 NoMult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAdd Yes Sub F8 F6 F2 Integer Yes NoDivide Yes Div F10 F0 F6 Mult1 No Yes


8 FU Mult1 Integer Add Divide


Scoreboard Example Cycle 8b(Second half of clock cycle)

Instruction status Read Exec WriteInstruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6SUBD F8 F6 F2 7DIVD F10 F0 F6 8ADDD F6 F8 F2


Integer NoMult1 Yes Mult F0 F2 F4 Yes YesMult2 NoAdd Yes Sub F8 F6 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes


8 FU Mult1 Add Divide



Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2


Integer No10 Mult1 Yes Mult F0 F2 F4 Yes Yes

Mult2 No2 Add Yes Sub F8 F6 F2 Yes Yes

Divide Yes Div F10 F0 F6 Mult1 No Yes



bull Read operands for MULT amp SUB Issue ADDD

ClockRemainng





Integer No9 Mult1 Yes Mult F0 F2 F4 No No

Mult2 No1 Add Yes Sub F8 F6 F2 No No






Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9 11DIVD F10 F0 F6 8ADDD F6 F8 F2









Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2



Mult2 NoAdd NoDivide Yes Div F10 F0 F6 Mult1 No Yes


12 FU Mult1 Divide

bull Read operands for DIVDCA-Lec6 cwliutwinseenctuedutw 24


Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13



Mult2 NoAdd Yes Add F6 F8 F2 Yes YesDivide Yes Div F10 F0 F6 Mult1 No Yes





Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14



Mult2 No2 Add Yes Add F6 F8 F2 Yes Yes









Mult2 No1 Add Yes Add F6 F8 F2 No No






Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 16












Mult2 NoAdd Yes Add F6 F8 F2 No NoDivide Yes Div F10 F0 F6 Mult1 No Yes



bull Why not write result of ADD

WAR Hazard











Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 16








Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8ADDD F6 F8 F2 13 14 16


Integer NoMult1 NoMult2 NoAdd Yes Add F6 F8 F2 No NoDivide Yes Div F10 F0 F6 Yes Yes


20 FU Add Divide



Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21ADDD F6 F8 F2 13 14 16




21 FU Add Divide

bull WAR Hazard is now gone CA-Lec6 cwliutwinseenctuedutw 33


Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21ADDD F6 F8 F2 13 14 16 22


Integer NoMult1 NoMult2 NoAdd No

39 Divide Yes Div F10 F0 F6 No No


22 FU Divide


skip a couple of cycles



Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21 61ADDD F6 F8 F2 13 14 16 22





61 FU Divide


Scoreboard Summarybull In‐order issue and out‐of‐order executioncompletionbull Do not issue on structural hazardsbull Solution for WAR wait for WAR hazards

ndash Stall write‐back until registers have been read (flag check)ndash Read registers only during Read‐Operand stage

bull Solution for WAW prevent WAW hazardsndash Detect hazard and stall issue of new instruction until other instruction completes

bull No register renamingbull Scoreboard replaces 3‐stages ie IDEXWB with Issue(ID1)Read‐Operand(ID2)EXWB


Another Dynamic Algorithm Tomasulorsquos Algorithm


Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm

bull Virtual registers amp buffers distributed with Function Units (FU)ndash FU virtual registers called ldquoreservation stations (RSs)rdquo have pending operands

ndash Registers in instruction are renamed by pointers to RSs amp buffers

bull Avoids WAR and WAW hazardsbull RSs amp buffers are more than registers so can do optimizations that compiler canrsquot

ndash Results to FU from RS not through registers overcommon data bus (CDB) that broadcasts to all Fus

ndash Load and Store are treated as FUs with RSs as well


Reservation Station Duties

bull Each RS holds an instruction that has been issued and is awaiting execution at a FU and either the operand values or the RS names that will provide the operand values

bull RS fetches operands from CDB when they appearbull When all operands are present enable the associated

functional unit to executebull Since values are not really written to registers

ndash No WAW or WAR hazards are possible


Three Stages of Tomasulo Algorithm1 Issue

ndash Get the next instruction from the head of OP queuebull The FIFO instruction queue (in‐order issue)

ndash If no RS is availablebull Structural hazards stall the pipeline

ndash If there is an available RSbull Issue the instructionbull If the operands are available in the RFs

ndash Fetch the operands and buffer them in the RSndash To solve WAR hazards (register renaming)

bull If the operand is not available in the RFsndash some FU is currently computing itndash Redirect the operand source to that reservation stationndash To solve WAW hazards (register renaming)


Three Stages of Tomasulo Algorithm2 Execute

ndash If one of operands is not availablebull Monitor (CDB) and wait for itbull When the operand becomes available it is placed into the

corresponding RSndash If all operands are available

bull The operation is performed at FUbull RAW hazards are avoided bull Several insts could become ready at the same clock cycle for the

same FUbull Loads and stores require 2‐step execution process

bull Effective address (EA) calculation LS buffer for memory accessbull LS are maintained in program order through the EA calculation

which will help to prevent hazards through memorybull To preserve exception behavior

ndash No instruction is allowed to initiate execution until all branches that precede it in program order have completed


Three Stages of Tomasulo Algorithm

3 Write resultndash When result is available write it on the CDBndash When both the address and data values are available they are sent

to the memory unit


Summary for 3‐stages of Tomasulo algorithm

1 Issuemdashget instruction from the head of Op Queue (FIFO)If reservation station free (no structural hazard) control issues instr amp sends operands (renames registers)

2 Executemdashoperate on operands (EX)When both operands ready then executeif not ready watch Common Data Bus for result

3 Write resultmdashfinish execution (WB)Write on Common Data Bus to all awaiting units mark reservation station available

bull Normal data bus data + destination (ldquogo tordquo bus)bull Common data bus data + source (ldquocome fromrdquo bus)

ndash 64 bits of data + 4 bits of Functional Unit source addressndash Write if matches expected Functional Unit (produces result)ndash Does the broadcast


Tomasulo ExampleInstruction status Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 Load1 NoLD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Reservation Stations S1 S2 RS RSTime Name Busy Op Vj Vk Qj Qk

Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 No


0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers

3 FP Adder RS2 FP Mult RS


Tomasulo Example Cycle 1Instruction status Exec Write

Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 Load2 NoMULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2




1 FU Load1



Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2




2 FU Load2 Load1

Note Unlike Scoreboard can have multiple loads outstandingCA-Lec6 cwliutwinseenctuedutw 47


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 Load1 Yes 34+R2LD F2 45+ R3 2 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2


Add1 NoAdd2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No


3 FU Mult1 Load2 Load1

bull Note registers names are removed (ldquorenamedrdquo) in Reservation Stations MULT issued vs scoreboard

bull Load1 completing what is waiting for Load1 CA-Lec6 cwliutwinseenctuedutw 48


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 Load2 Yes 45+R3MULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2


Add1 Yes SUBD M(A1) Load2Add2 NoAdd3 NoMult1 Yes MULTD R(F4) Load2Mult2 No


4 FU Mult1 Load2 M(A1) Add1



Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2


2 Add1 Yes SUBD M(A1) M(A2)Add2 NoAdd3 No

10 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVD M(A1) Mult1


5 FU Mult1 M(A2) M(A1) Add1 Mult2

bull Timer starts down for Add1 Mult1CA-Lec6 cwliutwinseenctuedutw 50


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2 6


1 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDD M(A2) Add1Add3 No



6 FU Mult1 M(A2) Add2 Add1 Mult2

bull Issue ADDD here despite name dependence on F6 vs scoreboard CA-Lec6 cwliutwinseenctuedutw 51


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6






bull Add1 completing what is waiting for it CA-Lec6 cwliutwinseenctuedutw 52


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6


Add1 No2 Add2 Yes ADDD (M-M) M(A2)

Add3 No7 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1


8 FU Mult1 M(A2) Add2 (M-M) Mult2












Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10







bull Add2 (ADDD) completing what is waiting for it CA-Lec6 cwliutwinseenctuedutw 55


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11


Add1 NoAdd2 NoAdd3 No



11 FU Mult1 M(A2) (M-M+M(M-M) Mult2

bull Write result of ADDD here vs scoreboardbull All quick instructions complete in this cycle



























Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11






bull Mult1 (MULTD) completing what is waiting for it



Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 10 11


Add1 NoAdd2 NoAdd3 NoMult1 No

40 Mult2 Yes DIVD MF4 M(A1)


16 FU MF4 M(A2) (M-M+M(M-M) Mult2

bull Now wait for Mult2 (DIVD) to complete











Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56ADDD F6 F8 F2 6 10 11






bull Mult2 (DIVD) is completing what is waiting for it CA-Lec6 cwliutwinseenctuedutw 63


Instruction j k Issue Comp Result Busy AddressLD F6 34+ R2 1 3 4 Load1 NoLD F2 45+ R3 2 4 5 Load2 NoMULTD F0 F2 F4 3 15 16 Load3 NoSUBD F8 F6 F2 4 7 8DIVD F10 F0 F6 5 56 57ADDD F6 F8 F2 6 10 11


Add1 NoAdd2 NoAdd3 NoMult1 NoMult2 Yes DIVD MF4 M(A1)


56 FU MF4 M(A2) (M-M+M(M-M) Result

bull Once again In-order issue out-of-order execution and completion


Compare to Scoreboard Cycle 62

Instruction status Read Exec Write Exec WriteInstruction j k Issue Oper Comp Result Issue Comp ResultLD F6 34+ R2 1 2 3 4 1 3 4LD F2 45+ R3 5 6 7 8 2 4 5MULTD F0 F2 F4 6 9 19 20 3 15 16SUBD F8 F6 F2 7 9 11 12 4 7 8DIVD F10 F0 F6 8 21 61 62 5 56 57ADDD F6 F8 F2 13 14 16 22 6 10 11

bull Why take longer on scoreboard6600bull Structural Hazardsbull Lack of forwarding


2 Major Advantages of Tomasulo

bull Distribution of the hazard detection logicndash Distributed RS and CDBndash If multiple instructions are waiting on a single result and each already has its other operand then the instruction can be released simultaneously by the broadcast on CDB

ndash If a centralized register file were used the units would have to read their results from the registers when register buses are available

bull Elimination of stalls for WAW and WARndash Rename register using RSndash Store operands into RS as soon as they are availablendash For WAW‐hazard the last write will win


Loop Unrolling in HardwareLoopLD F0 0 R1

MULTD F4 F0 F2SD F4 0 R1SUBI R1 R1 8BNEZ R1 Loop

bull Assume Multiply takes 4 clocksbull Assume first load takes 8 clocks (cache miss) second load

takes 1 clock (hit)bull To be clear will show clocks for SUBI BNEZbull Reality integer instructions ahead


Take‐home Quiz Complete the following table at cycle 18

Instruction status Exec WriteITER Instruction j k Issue CompResult Busy Addr Fu

1 LD F0 0 R1 Load1 No1 MULTD F4 F0 F2 Load2 No1 SD F4 0 R1 Load3 No2 LD F0 0 R1 Store1 No2 MULTD F4 F0 F2 Store2 No2 SD F4 0 R1 Store3 No

Reservation Stations S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code

Add1 No LD F0 0 R1Add2 No MULTD F4 F0 F2Add3 No SD F4 0 R1Mult1 No SUBI R1 R1 8Mult2 No BNEZ R1 Loop

Register result statusClock R1 F0 F2 F4 F6 F8 F10 F12 F30

0 80 Fu

Tomasulo Drawbacks

bull Performance limited by Common Data Busndash Each CDB must go to multiple functional units high capacitance high wiring density

ndash Number of functional units that can complete per cycle limited to one

bull Multiple CDBs more complexitybull Non‐precise interrupts

ndash Need way to resynchronize execution with instruction stream (ie with issue‐order)

ndash Easiest way is with reorder buffer (ie in‐order completion)


Reorder Buffer Operationbull Holds instructions in FIFO order exactly as issuedbull When instructions complete results placed into ROB

ndash Supplies operands to other instruction between execution complete amp commit more registers like RS

ndash Tag results with ROB buffer number instead of reservation stationbull Instructions commit values at head of ROB placed in registersbull As a result easy to undo speculated instructions

on mispredicted branches or on exceptions ReorderBufferFP

OpQueue

FP Adder FP AdderRes Stations Res Stations

FP Regs

Commit path


Greater ILP by Speculation

bull Essential data flow execution modelndash Operations execute as soon as their operands are available

bull Greater ILPndash Overcome control dependence by hardware speculatingon outcome of branches and executing program as if guesses were correct

bull Prediction vs Speculationndash Dynamic scheduling only fetches and issues instructionsndash Speculation fetch issue and execute instructions as if branch predictions were always correct


Hardware‐Based Speculation3 components of HW‐based speculation1 Dynamic branch prediction to choose which instructions to

execute 2 Dynamic scheduling to deal with scheduling of different

combinations of basic blocks3 Speculation to allow execution of instructions before control

dependences are resolved + ability to undo effects of incorrectly speculated sequence

bull Adding ROB to Tomasulondash Instruction commit when an instruction is no longer speculative

allow it to update the register file or memoryndash ROB is also used to pass results among instructions that are

speculated


Reorder Buffer (ROB)bull Additional registers just like reservation stations

ndash ROB is a source of operandsndash It holds the results of instruction that have finished execution but not

committedndash Use ROB number instead of RS to indicate the source of operands

when execution completes (but not committed)ndash It also uses to pass results among instructions that may be speculatedndash Each (pending) instruction occupies an ROB entry before being

committed ndash Instructions in ROB are committed in order

bull Once instruction commits the result is put into registerndash On misprediction the corresponding ROB entry will be flushedndash In case of exceptions Not recognized until it is ready to commit


The Speculative MIPSReplace store buffer

Observations

bull For an execution result separatendash data forwarding (thru RS) pathndash write‐back (thru ROB) path

bull Data forwarding pathndash still use RS to buffer operandsndash provide speculative register readsndash provide out‐of‐order completion

bull Register write‐back pathndash use ROB to buffer resultsndash when itrsquos committed update RF (in order)

Reorder Buffer Entry

Each entry in the ROB contains four fields1 Instruction type

bull a branch (has no destination result) a store (has a memory address destination) or a register operation (ALU operation or load which has register destinations)

2 Destinationbull Register number (for loads and ALU operations) or

memory address (for stores) where the instruction result should be written

3 Valuebull Value of instruction result until the instruction commits

4 Readybull Indicates that instruction has completed execution and the value is ready

Four Steps of Speculative Tomasulo1 Issuemdashget instruction from FP Op Queue

If reservation station and reorder buffer slot free issue instr amp send operands amp reorder buffer no for destination (this stage sometimes called ldquodispatchrdquo)

2 Executionmdashoperate on operands (EX)When both operands ready then execute if not ready watch CDB for result when both in reservation station execute checks RAW (sometimes called ldquoissuerdquo)

3 Write resultmdashfinish execution (WB)Write on Common Data Bus to all awaiting FUs amp reorder buffer mark reservation station available

4 Commitmdashupdate register with reorder resultWhen instr at head of reorder buffer amp result present update register with result (or store to memory) and remove instr from reorder buffer Mispredicted branch flushes reorder buffer (sometimes called ldquograduationrdquo)

Examplebull The same example as Tomasulo without speculation

ndash LD F6 34(R2)ndash LD F2 45(R3)ndash MULD F0 F2 F4ndash SUBD F8 F6 F2ndash DIVD F10 F0 F6ndash ADDD F6 F8 F2

bull Modified status tablesndash Qj and Qk fields and register status fields use ROB (instead of RS)ndash Add Dest field to RS (ROB to put the operation result)

bull Show the status tables when MULD is ready to go to commitndash At this time only two LD instructions have been committed

AssumeFP ADD 2 cycles

MUL 10 cyclesDIV 40 cycles

Figure 330

Precise Exceptionsbull Consider the case if MULD causes an interrupthellipbull Tomasulo without speculation

ndash SUBD and ADDD have completedbull Tomasulo with speculation

ndash No instruction after the earliest uncompleted instruction (MULD) is allowed to complete

ndash In‐order commit

bull ROB with in‐order instruction commit provides precise exceptionsndash Exceptions are handled in the instruction order

Memory Disambiguation Problem

bull Given a load that follows a store in program order Eg ndash SD 0(R2) R5ndash LD R6 0(R3)

bull Question are the two relatedbull Question can we go ahead and start the load earlyndash We do not know whether 0(R2) 0(R3) in compiler time

ndash Hardware‐based speculation would be helpful


Hardware Support for Memory Disambiguation

bull Need buffer to keep track of all outstanding stores to memory in program order

bull When issuing a load record current head of store queue (in order to know which stores are ahead of you)

bull When have address for load check store queuendash If any store prior to load is waiting for its address stall loadndash If load address matches earlier store address a RAW hazard occurs

bull Actual stores commit in FIFO order so no worry about WARWAW hazards through memory


ROB Avoids Memory Hazardsbull WAW and WAR hazards through memory are eliminated with speculation

because actual updating of memory occurs in order when a store is at head of the ROB and hence no earlier loads or stores can still be pending

bull RAW hazards through memory are maintained by two restrictions 1 not allowing a load to initiate the second step of its execution if any active

ROB entry occupied by a store has a Destination field that matches the value of the A field of the load and

2 maintaining the program order for the computation of an effective address of a load with respect to all earlier stores

bull these restrictions ensure that any load that accesses a memory location written to by an earlier store cannot perform the memory access until the store has written the data

Getting CPI below 1bull CPI ge 1 if issue only 1 instruction every clock cycle bull Multiple‐issue processors come in 3 flavors

1 statically‐scheduled superscalar processors2 dynamically‐scheduled superscalar processors and 3 VLIW (very long instruction word) processors

bull 2 types of superscalar processors issue varying numbers of instructions per clock ndash use in‐order execution if they are statically scheduled or ndash out‐of‐order execution if they are dynamically scheduled

bull VLIW processors in contrast issue a fixed number of instructionsformatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (IntelHP Itanium)

Multiple Issue Processors


Multiple Issue and S

tatic Scheduling

85

Multi‐issue Superscalar Processor

Instruction Fetchwith Branch Prediction

Out-Of-OrderExecutionUnit

Correctness FeedbackOn Branch Results

Stream of InstructionsTo Execute

bull Instruction fetch decoupled from executionbull Often issue logic (+ rename) included with Fetch

Independent Fetch Unit

Multiple Issue with Speculation

bull To maintain throughput of greater than one instructions per cycle we must handle multiple instruction commits per clock

bull Extend Tomasulo speculation algorithm to multiple‐issue schemendash 2 challenges

bull Instruction issuebull Monitor CDB for instruction completion

ndash In additionbull How to handle multiple instruction commits per clock cycle

Advantages of Superscalar over VLIW

bull Old codes still runndash Like those tools you have that came as binariesndash HW detects whether the instruction pair is a legal dual issue pair

bull If not they are run sequentially

bull Little impact on code densityndash Donrsquot need to fill all of the canrsquot issue here slots with NOPrsquos

bull Compiler issues are very similarndash Still need to do instruction scheduling anywayndash Dynamic issue hardware is there so the compiler does not have to be

too conservative

Examplebull Loop LD R2 0(R1)

DADDIU R2 R2 1SD R2 0(R1)DADDIU R1 R1 4BNE R2 R3 LOOP

bull Assume separate integer FUsndash for effective address calculation ndash ALU operations andndash branch condition evaluation

bull Assume up to 2 instructions of any type can commit per clock

Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation

Out-of-order executing In-order committing

Comparisons bull Without speculation (Tomasulo only)

ndash LD following BNE cannot start execution earlier wait until branch outcome is determinedndash Completion rate is falling behind the issue rate rapidly stall when a few more iterations are issued

bull With speculationndash LD following BNE can start execution early because it is speculative

ndash More complex HW is requiredndash Completion rate is almost equal to issue rate

Advanced Techniques for Instruction Delivery and Speculation

bull High performance instruction deliveryndash For a multiple‐issue processor predicting branches well is not enough

bull Predicated executionbull Branch target buffer (BTB)

ndash Deliver a high‐bandwidth instruction stream is necessary

bull Eg 4~8 instructionscyclebull Increasing instruction fetch bandwidthbull Speculation (branch value prediction)


I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started

Modern processors may have gt 10 pipeline stages between next PC calculation and branch resolution

Control Flow Penalty

How much work is lost if pipeline doesnrsquot follow correct instruction flow

~ Loop length x pipeline width

Branch and Jump Instruction

bull Each instruction fetch depends on one or two pieces of information from the preceding branch instruction1 Is a taken branch2 If so what is the target address

bull Example MIPS branches and jumps


Instruction Taken known Target known

J

JRBEQZBNEZ After Inst Decode

After Inst Decode After Inst Decode

After Inst Decode After Reg Fetch

After Reg Fetch

Assuming zero detect on register read

Branch Penalties in Modern Pipelines

A PC GenerationMuxP Instruction Fetch Stage 1F Instruction Fetch Stage 2B Branch Address CalcBegin DecodeI Complete DecodeJ Steer Instructions to Functional unitsR Register File ReadE Integer Execute

Remainder of execute pipeline (+ another 6 stages)

UltraSPARC-III instruction fetch pipeline stages(in-order issue 4-way superscalar 750MHz 2000)

Branch Target Address Known

Branch Direction ampJump Register Target Known

Reducing Control Flow Penalty

bull Software solutionsndash Loop unrolling eliminate branches

bull To increase the run lengthndash Instruction scheduling reduce resolution time

bull eg delay branch

bull Hardware solutionsndash Branch prediction and Speculationndash Predicated instructionndash Branch target buffer (BTB)


Predicated Execution

bull Avoid branch prediction by turning branches into conditionally executed instructionsif (x) then A = B op C else NOPndash If false then neither store result nor cause exceptionndash Expanded ISA with 1‐bit condition fieldndash This transformation is called ldquoif‐conversionrdquo

bull Drawbacks to predicated instructionsndash Still takes a clock even if ldquoannulledrdquondash Stall if condition evaluated latendash Complex conditions reduce effectiveness

condition becomes known late in pipeline

x

A=B op C

Branch Target Buffer


Steps Handling an Instruction with BTB


Combining BTB and BHTbull BTB entries are considerably more expensive than BHT but can redirect

fetches at earlier stage in pipeline and can accelerate indirect branches (JR)bull BHT can hold many more entries and is more accurate


BTB

BHTBHT in later pipeline stage corrects when BTB misses a predicted taken branch

BTBBHT only updated after branch resolves in E stage

BTB Remarksbull BTB contains useful information for branch and jump instructions

onlyndash Do not update BTB for other instructionsndash For all other instructions the next PC is PC+4

bull Keep both the branch PC and target PC in the BTBndash ldquoBranch foldingrdquondash 0‐cycle unconditional branchesndash Sometimes 0‐cycle conditional branches

bull Only predicted taken branches and jumps held in BTBndash More room to store

bull Subroutine returns (jump to return address)ndash BTB can work well if usually return to the same placendash Return address predictors


Return Address Predictor

bull Most unconditional branches come from function returns

bull The same procedure can be called from multiple sitesndash Causes the buffer to potentially forget about the return address from previous calls

bull Create return address buffer organized as a stack


Subroutine Return Stackbull Small structure to accelerate JR for subroutine returns typically much more accurate than BTBs

ampnextaampnextb

Push return address when function call executed

Pop return address when subroutine return decoded

fa() fb() nexta

fb() fc() nextb

fc() fd() nextc

ampnextc k entries(typically k=8-16)

Special Case Return Addressesbull Register Indirect branch hard to predict address

BTBPC Predicted

Next PC

Fetch Unit

Destination FromCall Instruction[ On Fetch]

Select forIndirect Jumps[ On Fetch ]

Return Address Stack

Mux

Performance Return Address Predictor

bull Cache most recent return addressesndash Call Push a return address on stackndash Return Pop an address off stack amp predict as new PC

bull SPEC95 Benchmarks


0

10

20

30

40

50

60

70

0 1 2 4 8 16Return address buffer entries

Mis

pre

dic

tio

n f

req

ue

ncy

gom88ksimcc1compressxlispijpegperlvortex

More Instruction Fetch Bandwidth

bull Integrated branch prediction branch predictor is part of instruction fetch unit and is constantly predicting branches

bull Instruction prefetch Instruction fetch units prefetch to deliver multiple instructions per clock integrating it with branch prediction

bull Instruction memory access and buffering Fetching multiple instructions per cyclendash May require accessing multiple cache blocks (prefetch to hide cost

of crossing cache blocks) ndash Provides buffering acting as on‐demand unit to provide

instructions to issue stage as needed and in quantity needed

Speculation Register Renaming vs ROB

bull Alternative to ROB is a larger physical set of registers combined with register renamingndash Extended registers replace function of both ROB and reservation

stations

bull Instruction issue maps names of architectural registers to physical register numbers in extended register set ndash On issue allocates a new unused register for the destination

(which avoids WAW and WAR hazards)ndash Speculation recovery easy because a physical register holding an

instruction destination does not become the architectural register until the instruction commits

bull Most Out‐of‐Order processors today use extended registers with renaming

Explicit Register Renaming

bull Instead of virtual registers from reservation stations and reorder buffer create a single (physical) register poolndash Contains visible registers and virtual registers

bull Use hardware‐based map to rename registers during issuebull Still need a ROB‐like queue to update table in orderbull Physical register becomes free when not being used


Fetch DecodeRename Execute

RenameTable

Speculation Performancebull How much to speculate

ndash Mis‐speculation degrades performance and power relative to no speculation

bull May cause additional misses (cache TLB)ndash Prevent speculative code from causing higher costing misses (eg L2)

bull Speculating through multiple branchesndash Complicates speculation recoveryndash No processor can resolve multiple branches per cycle

bull Speculation and energy efficiencyndash Note speculation is only energy efficient when it significantly improves performance


Adv Techniques for Instruction D

elivery and Speculation

110

Value Predictionbull Attempts to predict value produced by instruction

ndash Eg Loads a value that changes infrequentlybull Value prediction is useful only if it significantly increases ILP

ndash Focus of research has been on loads so‐so results no processor uses value prediction

bull Related topic is address aliasing predictionndash RAW for load and store or WAW for 2 stores

bull Address alias prediction is both more stable and simpler since need not actually predict the address values only whether such values conflictndash Has been used by a few processors

Data Value Prediction Example

bull Why do itndash Can ldquoBreak the DataFlow Boundaryrdquondash Before Critical path = 4 operations (probably worse)ndash After Critical path = 1 operation (plus verification)

+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess

In Conclusionhellipbull Interest in multiple‐issue because wanted to improve performance

without affecting uniprocessor programming modelbull Taking advantage of ILP is conceptually simple but design problems are

amazingly complex in practicebull Conservative in ideas just faster clock and biggerbull Processors of Pentium 4 IBM Power 5 and AMD Opteron have the same

basic structure and similar sustained issue rates (3 to 4 instructions per clock) as the 1st dynamically scheduled multiple‐issue processors announced in 1995ndash Clocks 10 to 20X faster caches 4 to 8X bigger 2 to 4X as many

renaming registers and 2X as many load‐store units performance 8 to 16X

bull Peak vs delivered performance gap increasing

Data Dependence and Parallelism

bull If 2 instructions are parallelndash they can be executed simultaneously in a pipeline without causing any stalls (except the structural hazards)

ndash their execution order can be swappedbull If 2 instructions are dependent

ndash they must be executed in order or partially overlapped

bull To exploit parallelisms over instructions is equivalent to determine dependences over instructions


Introduction

2
















completion




Dynam

ic Scheduling

5







Issue Reg

ALU DM RegIM


OOO Example





bull Before


bull After









Scoreboard







Scoreboard Example






FU






1 FU Integer







2 FU Integer







3 FU Integer







4 FU Integer







5 FU Integer







6 FU Mult1 Integer
































ClockRemainng


























12 FU Mult1 Divide













































WAR Hazard























20 FU Add Divide







21 FU Add Divide








22 FU Divide










61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess






















completion




Dynam

ic Scheduling

5







Issue Reg

ALU DM RegIM


OOO Example





bull Before


bull After









Scoreboard







Scoreboard Example






FU






1 FU Integer







2 FU Integer







3 FU Integer







4 FU Integer







5 FU Integer







6 FU Mult1 Integer
































ClockRemainng


























12 FU Mult1 Divide













































WAR Hazard























20 FU Add Divide







21 FU Add Divide








22 FU Divide










61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess













Issue Reg

ALU DM RegIM


OOO Example





bull Before


bull After









Scoreboard







Scoreboard Example






FU






1 FU Integer







2 FU Integer







3 FU Integer







4 FU Integer







5 FU Integer







6 FU Mult1 Integer
































ClockRemainng


























12 FU Mult1 Divide













































WAR Hazard























20 FU Add Divide







21 FU Add Divide








22 FU Divide










61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess







OOO Example





bull Before


bull After









Scoreboard







Scoreboard Example






FU






1 FU Integer







2 FU Integer







3 FU Integer







4 FU Integer







5 FU Integer







6 FU Mult1 Integer
































ClockRemainng


























12 FU Mult1 Divide













































WAR Hazard























20 FU Add Divide







21 FU Add Divide








22 FU Divide










61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess








bull Before


bull After









Scoreboard







Scoreboard Example






FU






1 FU Integer







2 FU Integer







3 FU Integer







4 FU Integer







5 FU Integer







6 FU Mult1 Integer
































ClockRemainng


























12 FU Mult1 Divide













































WAR Hazard























20 FU Add Divide







21 FU Add Divide








22 FU Divide










61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess












Scoreboard







Scoreboard Example






FU






1 FU Integer







2 FU Integer







3 FU Integer







4 FU Integer







5 FU Integer







6 FU Mult1 Integer
































ClockRemainng


























12 FU Mult1 Divide













































WAR Hazard























20 FU Add Divide







21 FU Add Divide








22 FU Divide










61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess







Scoreboard Example






FU






1 FU Integer







2 FU Integer







3 FU Integer







4 FU Integer







5 FU Integer







6 FU Mult1 Integer
































ClockRemainng


























12 FU Mult1 Divide













































WAR Hazard























20 FU Add Divide







21 FU Add Divide








22 FU Divide










61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess












1 FU Integer







2 FU Integer







3 FU Integer







4 FU Integer







5 FU Integer







6 FU Mult1 Integer
































ClockRemainng


























12 FU Mult1 Divide













































WAR Hazard























20 FU Add Divide







21 FU Add Divide








22 FU Divide










61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess












2 FU Integer







3 FU Integer







4 FU Integer







5 FU Integer







6 FU Mult1 Integer
































ClockRemainng


























12 FU Mult1 Divide













































WAR Hazard























20 FU Add Divide







21 FU Add Divide








22 FU Divide










61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess












3 FU Integer







4 FU Integer







5 FU Integer







6 FU Mult1 Integer
































ClockRemainng


























12 FU Mult1 Divide













































WAR Hazard























20 FU Add Divide







21 FU Add Divide








22 FU Divide










61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess












4 FU Integer







5 FU Integer







6 FU Mult1 Integer
































ClockRemainng


























12 FU Mult1 Divide













































WAR Hazard























20 FU Add Divide







21 FU Add Divide








22 FU Divide










61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess












5 FU Integer







6 FU Mult1 Integer
































ClockRemainng


























12 FU Mult1 Divide













































WAR Hazard























20 FU Add Divide







21 FU Add Divide








22 FU Divide










61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess












6 FU Mult1 Integer
































ClockRemainng


























12 FU Mult1 Divide













































WAR Hazard























20 FU Add Divide







21 FU Add Divide








22 FU Divide










61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess





































ClockRemainng


























12 FU Mult1 Divide













































WAR Hazard























20 FU Add Divide







21 FU Add Divide








22 FU Divide










61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess































12 FU Mult1 Divide













































WAR Hazard























20 FU Add Divide







21 FU Add Divide








22 FU Divide










61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess


















































WAR Hazard























20 FU Add Divide







21 FU Add Divide








22 FU Divide










61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess




























20 FU Add Divide







21 FU Add Divide








22 FU Divide










61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess












21 FU Add Divide








22 FU Divide










61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess













22 FU Divide










61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess















61 FU Divide









Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess














Dynam

ic Scheduling

38

Virtual registers

Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess







Tomasulo Algorithm































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess































to the memory unit














0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess



















0 FU

Clock cycle counter

FU countdown

Instruction stream

3 LoadBuffers








1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess












1 FU Load1







2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess












2 FU Load2 Load1



































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess








































































































































































0 80 Fu

Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess







Tomasulo Drawbacks











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess











OpQueue


FP Regs

Commit path













speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess


















speculated










Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess















Observations






















Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess

























Figure 330






























tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess




































tatic Scheduling

85


















too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess
























too conservative





Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess











Figure 333 amp 334

R2

R2

R2

No Speculation

R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess







R2

R2

R2

Speculation












I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess

















I-cache

Fetch Buffer

IssueBuffer

FuncUnits

ArchState

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started










J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess












J




After Reg Fetch


















x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess























x

A=B op C








BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess














BTB















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess



















ampnextaampnextb



fa() fb() nexta

fb() fc() nextb

fc() fd() nextc



BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess








BTBPC Predicted

Next PC

Fetch Unit




Mux





0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess











0

10

20

30

40

50

60

70


Mis

pre

dic

tio

n f

req

ue

ncy










stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess















stations










RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess












RenameTable









110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess















110








+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess














+

A B

+

Y X

+

A B

+

Y X

Guess

Guess

Guess







computer architecture lecture 6: dynamic scheduling for...

Documents