Download - Module: Speculative Execution © Krishna V. Palem, Weng Fai Wong, and Sudhakar Yalamanchili, Georgia…
Module: Module: Speculative ExecutionSpeculative Execution
© Krishna V. Palem, Weng Fai Wong, and Sudhakar Yalamanchili, Georgia Institute of Technology (slides contributed by Prof. Weng Fai Wong were prepared while visiting, and employed by, Georgia Tech)
2ECE 4100/6100 (2)
Reading for This Module Speculative Execution
– Section 3.7
The Reorder Buffer and Register Renaming – Section 3.7
Multithreading– Section 6.9
Additional Reading– Section 3.10, Section 4.5 (pp. 345-350)
Hyperthreading– ftp://download.intel.com/technology/itj/2002/volume06issue
01/art01_hyper/vol6iss1_art01.pdf
P4 Microarchitecture– http://www.intel.com/technology/itj/q12001/articles/art_2.htm
3ECE 4100/6100 (3)
Speculation
Speculative execution is the execution of instructions before it is known if it is safe to do so
Rely on branch prediction to get the branch direction right in most cases
4ECE 4100/6100 (4)
Speculation vs. Prediction
Prediction is targeted at instruction fetch– Prediction is de-coupled from the decision to execute
fetched instructions– Prediction helps boost the issue rate
Speculation refers to the execution of predicted instructions
IF ID
EXINT
EXFP
EXBR
MEM WB
Keep the instruction pipeline full via prediction
Keep the out-of-order execution core full via
speculation
Maintain correctness of out-of-order execution
5ECE 4100/6100 (5)
Speculation
Hardware based speculation as an extension of dynamic scheduling is composed of– Branch prediction to select instructions to be
speculatively executed– Dynamic scheduling what we have seen so far– Execution – Commitment update machine state– Exception handling
Challenges – Handling multiple executions completions/cycle – Enforcing dependencies to ensure correctness– Handling exceptions
The Reorder BufferThe Reorder Buffer
7ECE 4100/6100 (7)
Principle
Basic block sizes are not very large– Prediction can increase the issue rate but not the completion rate– Boosting issue rate by itself is insufficient
The completion rate has to be increased to keep up with the issue rate – Need speculative execution
Key idea: separate instruction execution from instruction commitment– Compute on a need-to-know basis until speculation outcome is
determined
I-Fetch Execution Core Retire
Processor datapath
8ECE 4100/6100 (8)
Issues
What is commitment?– Updating the register file!– Permanent update to the machine state
What should be the criteria?– Commitment is performed in program order
How to enforce the criteria?– Reorder the instructions that complete out-of-order
Reorder Buffer
9ECE 4100/6100 (9)
The Reorder Buffer
Initially proposed to support precise interrupts
Handles output and anti-dependences– Another form of register renaming
Does not take care of flow dependences
A FIFO circular queue
10ECE 4100/6100 (10)
The Reorder Buffer
11ECE 4100/6100 (11)
Three Simple Steps
Every instruction gets a reorder entry allocated in-order as it is issued - the entry is marked “invalid”
When an instruction completes, it writes its result to the corresponding entry in the reorder buffer – the entry is now valid
When the entry at the head of the reorder buffer is “valid” it is committed to the register file
87654321
654321
321
321
21
FP Add(2 stage)
FP Mul/Div(6 stage)
Decoder
FP Registers
LoadBuffer
FP Ops“Stack”
OperandBusses
StoreBuffer
Reservation Stations
Common Data Bus
Operation Bus
To Memory
From Memory
From Instruction UnitROB
12ECE 4100/6100 (12)
Structure/Operation of the ROB
Issue/dispatch must now issue a ROB entry– ROB tag is used in renaming
Execute in a data-driven manner– Write results on the CDB using the ROB tag
Commit instructions in-order– Commit valid instructions at the head of the ROB– Incorrect branches cause the ROB to be flushed and
execution restarted
I-Type Dest Value Ready
branch memory register
register memory address
status
Speculation info
speculative? identify which block?
Why do you need this information?
13ECE 4100/6100 (13)
The Result
Results are written into the register file in-order
Destination registers are effectively renamed to reorder buffer entries– each instruction writes to a new “destination register”
14ECE 4100/6100 (14)
Using the Speculative Bits
Speculative instructions are marked in the reorder buffer by a special “speculative” bit– Should a branch become confirmed it will turn the
“speculative” bits of the corresponding speculative instruction to “confirm”
– If it is not confirmed, status is set to “not confirmed” When an instruction reaches the head of the reorder
buffer– If it is marked “speculative”, commitment is stalled until its
status is determined, i.e., it is no longer speculative– If it is marked “confirm”, commit the instruction– If it is marked as “not confirmed”, its result is discarded
These are known as speculative writebacks
15ECE 4100/6100 (15)
Speculative Memory References
Loads/stores do exhibit flow, output and anti-dependences
Speculative Stores are a problem– Use a store buffer and manage it like a re-order buffer
16ECE 4100/6100 (16)
Simple solution
Only one load/store unit
Reservation stations for this single load/store unit is a queue
Process in strict queue (in-order) order
Inefficient
17ECE 4100/6100 (17)
Separate Load/Store Units
18ECE 4100/6100 (18)
An Example - Alpha 21264
Instruction Cache
Data Cache
Instruction processing and dispatch
Integer issue fp issue
Integer execution unit FP execution unit
Load queue
Storequeue
Memory interface
Both 32 entryreorder buffers
19ECE 4100/6100 (19)
Parallel Retirement
Retiring only one instruction per cycle is also a bottleneck
Can retire instructions in parallel
Advantage: free up more reorder buffer entries quickly– Does not affect instruction execution directly as instructions
can read from the reorder buffer directly
20ECE 4100/6100 (20)
Parallel Retirement
InstructionResults
ReorderBuffer
InstructionOperands
RegisterFile
RetirementLogic
InstructionOperands
21ECE 4100/6100 (21)
Parallel Retirement
Although retiring in parallel, retirement logic must guarantee in-order retirement– must check valid bits in sequence– must check destination register number
Requires more ports to the register file
22ECE 4100/6100 (22)
Forwarding from the ROB
Results from the ROB can be forwarded directly to executing instructions– Can read valid results directly from the reorder buffer
Suppose two reorder buffer entry write to the same destination register, R0 say
For an instruction reading from R0, must use extra hardware to decide which is the right one to read from – the later of the two instructions writing to R0 has the higher
priority
Register RenamingRegister Renaming
24ECE 4100/6100 (24)
Dependencies and Register Pressure
Registers are re-used over the life of a program
Compilers provide a static scheme for re-using registers
Speculative execution creates a greater demand for registers to eliminate name dependencies
Register renaming increases issue rate
25ECE 4100/6100 (25)
Renaming Used at Different Points
Extend the resources available for renaming– More physical registers are available than are visible in the
ISA Renaming performed at/during ID or prior to issue
– Number of registers determines how many instructions can exist between issue and commit
IF ID
EXINT
EXFP
EXBR
MEM MEM WB
Values available for forwarding
Values available for commitment
26ECE 4100/6100 (26)
Principle
Instructions specify logical or architecture registers At instruction issue a logical register is re-mapped
or re-named to one of a larger pool of physical registers
R0R1
R7
P0
P1
P11Entry contains the name of a physical
register
Register Re-Map Table (Logical Register File)
Physical Register File
27ECE 4100/6100 (27)
Example: IBM RS 6000
Add a few extra registers to be re-used over the life of the program
R0R1R2
R7
Extra registers
R0
R1
R2
R jFree registers
Registers in use
RS 6000 Scheme
How do we keep track of this mapping information – Index a table with register number Mapping table– Keep track of free registers available for renaming– Keep track of currently in use registers in use
28ECE 4100/6100 (28)
When is Safe to Re-Use a Register?
If no active instruction is using that register, it can re-used
One approach is to check the registers being used by all active instructions– Expensive
Another approach is to perform checks at instruction commitment
Case Study: MIPS R10000Case Study: MIPS R10000
30ECE 4100/6100 (30)
MIPS R10000
There are 32 logical registers– 5 bit logical register specifiers
There are 64 physical registers– 6 bit physical register identifiers
31ECE 4100/6100 (31)
Main Data Structures
The Register Map Table
The Free Register List
The Active List
The Busy Bit Table
Duplicated for General Purpose and Floating Point Registers
32ECE 4100/6100 (32)
The Register Map Table
A multi-ported Static Random Access Memory (SRAM)
Takes 5 bit addresses
Deliver 6 bit results
For each instruction that may be issued in one cycle, requires three read ports– ADD.D F0, F2, F4– Need at least one write port per instruction that can be
retired in a cycle (recall parallel retirement)
33ECE 4100/6100 (33)
Active List
A FIFO queue - similar in function to the reorder buffer
Each instruction has a corresponding active list entry
Processing the head of the active list is called instruction retirement or graduation ( what we referred to as commitment)
34ECE 4100/6100 (34)
Free Register List
A FIFO queue of physical registers that are available for reuse
35ECE 4100/6100 (35)
Busy Bit Table
A table to indicate the availability of source operands
Busy bit in the instruction queue entry must be updated constantly– Each time a physical register is being written, all
corresponding busy bits in the instruction queues must be updated
36ECE 4100/6100 (36)
Functional Unit Instruction Queue
Equivalent of reservation stations for each functional unit
Consists of– opcode– ready bit of physical register operands– physical source register identifiers– physical destination register identifier– a TAG field for locating the corresponding active list entry
37ECE 4100/6100 (37)
MIPS R10000 RMT
op src1 src2 dst
Register Map Table
OpReadyField
Pscr1 Pscr2 Pdst TagOld Pdst
DstDone
Bit
Free Register
List
BusyBit
Table
Instruction
FU Instruction Queue FU Instruction Queue
New Pdst
Old Pdst
38ECE 4100/6100 (38)
Upon Instruction Issue...
Each instruction gets the following allocated
an entry in the corresponding FU instruction queue (reservation station)
an entry in the active list (reorder buffer)
a new physical destination register from the free register list (register renaming)
39ECE 4100/6100 (39)
Next...
The two 5 bit logical source register specifiers are used to access the RMT to obtain the corresponding physical registers
The 5 bit logical destination register specifier is used to access the RMT– The output is written to the corresponding active list entry– The busy bit for the physical destination is set
40ECE 4100/6100 (40)
Instruction Execution
When both physical source registers are ready, proceed with operand read– Takes care of flow dependences
Result is written directly to the physical destination register
Update Busy Bit Table
DONE bit in active list entry is set
41ECE 4100/6100 (41)
Instruction Retirement
When the entry at the head of the active list is marked “DONE”, proceed to retire instruction– Old physical register is released to free register list for
reuse
Each allocated physical register is written exactly once
42ECE 4100/6100 (42)
When Is It Safe? When is it safe to reuse a physical register?
Example: R1 P7 previously, now a new instruction, I1, will write to R1 and gets assigned P5
It is safe to reuse P7 when I1 has completed execution (and has written to P5)
R1 = …..= R1= R1..
R1 = ….= R1
Remapped to P7
Remapped to P5 From behavior of ROB, we know all prior instructions have committed, i.e. P7 can be now freed after this instruction commits
43ECE 4100/6100 (43)
Why?
Because the logical register R1 has been overwritten
Any subsequent read of R1 should be done to P5
44ECE 4100/6100 (44)
Handling Flow Dependences
Each time we allocated a new destination register, we update the RMT
Any subsequent read will get the correct map from the RMT
The Busy bit system comprising of the Busy Bit Table and the constantly updating of the busy fields in the instruction queue entry ensures data availability checking
45ECE 4100/6100 (45)
Handling Output and Anti-Dependences
Each instruction writes to a newly allocated physical register
Registers are renamed from logical to physical– Can use more physical registers
Case Study: Case Study: Intel Pentium III andIntel Pentium III and
Pentium IV (NETBURST)Pentium IV (NETBURST)
47ECE 4100/6100 (47)
Intel IA32
Due to backward compatibility– Complex instructions– Limited number of registers
Each complex instruction is translated into several micro-ops (uops)
Register renaming used to allow for more registers
48ECE 4100/6100 (48)
The Pipeline
fetch fetch dec dec dec rename ROBrd
Rdysch dispatch exec
1 2 3 4 5 6 7 8 9 10
drv all que sch sch sch disp disp RF RF EX drv
1 5 10 15 20
rename
Basic Pentium 3 Misprediction Pipeline
Basic Pentium 4 Misprediction Pipeline: Key stages
TC Nxt IP TC Fetch
49ECE 4100/6100 (49)
ROB
The Reorder Buffer (ROB) in the IA32 is implemented by content-addressable memory
Served as an instruction pool
ROB
Bus interface
L2 cache
L1 I-cache L1 D-Cache
Fetch/decode unit
Dispatch/execute
Retire unit
To system bus
Instruction pool
loadfetch store
50ECE 4100/6100 (50)
ROB Entries
Each ROB entry has a data and a status field
ROB data field stores the data result of a uop
ROB status field track the status of the uop producing the result that is to go into the corresponding data field
51ECE 4100/6100 (51)
Register Renaming in P-III
A Register Alias Table (multi-ported SRAM) keeps track of the latest alias for logical registers
ROB is managed like a reorder buffer
Tracks availability of data
Once retired, data is copied from ROB to the Retirement Register File (RRF)
52ECE 4100/6100 (52)
Pentium 3
ESPEBP
ESIEDI
ECXEDX
EAXEBX
ROB
RRF
Status Data
40 entry ROB
Register Alias Table (RAT): Remember the most current
version of each register
53ECE 4100/6100 (53)
Register Renaming
RAT may point to a ROB entry or a RRF
No physical EAX, EBX etc. exist
54ECE 4100/6100 (54)
Pentium IV
Introduced the NETBURST architecture
Eliminate the copying of ROB data value to the RRF
Consists of two RAT– Frontend RAT– Retirement RAT
55ECE 4100/6100 (55)
Pentium IV
The 128 Register File (RF) is separated from the ROB - which now only consists of status fields
A unique, in-order sequence number is allocated for each uop that points to the corresponding ROB entry
56ECE 4100/6100 (56)
Pentium IV
ESPEBP
ESIEDI
ECXEDX
EAXEBX
Status Data
RF ROB
ESPEBP
ESIEDI
ECXEDX
EAXEBX
Front End RAT
Retirement RAT
NetBurst
128 physical registers
drv all que sch sch sch disp disp RF RF EX drv
1 5 10 15 20
renameTC Nxt IP TC Fetch
57ECE 4100/6100 (57)
Pentium IV Execution Core
Up to 126 instructions in flight and up to 48 loads and 24 stores pending
The front end feeds the execution core– Allocator
– allocate ROB entry, rename registers, allocate μop queue entry, allocate load/store buffer
Front end μop supply and backend μop retirement bandwidth is 3 μops
Dispatch bandwidth into the execution core is 6 μops
Multi-clock bypass network for double speed integer ALUs
58ECE 4100/6100 (58)
Pentium IV Execution Core
Exec Port 0 Exec Port 1 Load Port Store Port
ALU(2X)
FP MoveALU(2X)
Integer FP StoreLoad
FP/SSE MoveFP/SSE Store
Add/SubLogicStore DataBranches
Add/Sub
Shift/Rotate
FP/SSE Add Mul Div
Dispatch Ports
schedulerscheduler scheduler scheduler
Out-of-order schedulers feed dispatch ports
Compute μop queue memory μop queue
59ECE 4100/6100 (59)
Some Observations
Applications have a high level of thread parallelism
Within a thread, high latency operations have to be tolerated, e.g., cache misses
5X increase in performance for a 15X increase in effective (scaled) chip area and 18X increase in power!
Transistors have been invested to improve the performance of a single thread– Sub-linear relationship between investment (chip area)
and return (execution speed) utilization is the key!
60ECE 4100/6100 (60)
What Next?
Exploit thread level parallelism– Use multiple processors and keep them busy– Time sharing
– Switch-on-event time sharing– Need to flush the deep pipelines
– Fine grained multi-threading to keep the pipelines full
Simultaneous multithreading to maximize resource utilization with minimal overhead
61ECE 4100/6100 (61)
Forms of Multithreading
stall
Superscalar Coarse Grain Multithreading
Fine Grained Multithreading
Simultaneous Multithreading
Issue slots
time
62ECE 4100/6100 (62)
Increasing Utilization in the NetBurst Microarchitecture
Observations for dynamically scheduled processors– Have large registers sets with support for renaming– Tag support enables tracking of instructions across
threads– Schedulers and execution units track dependencies
Idea: provide support for sharing resources across threads with little additional hardware support Hyper-threading
Abstraction: Logical processors– This is what the programmer and operating system sees
63ECE 4100/6100 (63)
Hyper-threading in the Xeon Processor Family
Goals:– Minimize die area cost– Independent forward progress for a logical processor– Do not penalize single thread performance minimize static
allocation of resources Implementation of Hyper-threading adds less that 5% to the
chip area Principle: share major logic components by adding or
partitioning buffering logic
Processor Execution Resources
Arch State
Processor Execution Resources
Processor Execution Resources
Processor Execution Resources
Arch State Arch State Arch State Arch State Arch State
2 CPU Without Hyper-threading 2 CPU With Hyper-threading
64ECE 4100/6100 (64)
The Xeon Pipeline
Regrename
allocator
Register cache Reg TC
EXINT
EXFP
EXBR
Trace cache access
μop queue
Rename Queue Schedule RegisterRead
Execute
L1 cache WB Retire
ROB
round robin access dynamic sharingShared ucode ROMIndependent code pointers
Duplicate ITLBs and PCs Independent I-buffers for decode RAS duplicated and some sharing of branch prediction logic
fairness enforced by limits on buffer sharing
Separate RATs
Schedulers oblivious to logical processors
Execution unit oblivious to logical processorsForwarding feasible due to shared register file
fairness enforced by limits on buffer sharing
Fetch LogicShared DTLB with logical processor tags
65ECE 4100/6100 (65)
Performance
65% performance increase for high end server applications for 4-way server platform
~20%-30% performance improvement for categories such as transactions, web server, and server side Java environment
Operating system can optimize scheduling of threads across logical/physical processor combinations
66ECE 4100/6100 (66)
Power 5
67ECE 4100/6100 (67)
Power 5: Key Features
Shared I-cache, fetching 8 instructions/thread/cycle Shared BHT 5 instr/thread/ decoded and grouped Group dispatch and commitment
– Instructions tracked as a group via GCT Register renaming dynamically shares registers
between threads as well as LRQ and SRQ Issue is independent of group membership I and D caches fully shared via increased
associativity Resource balancing logic to prevent starvation
68ECE 4100/6100 (68)
Recall…
Determine Dependences
Determine Independences
Bind Resources
Execute
Front-End & Optimizer
Determine Dependences
Determine Independences
Bind Resources
Sequential(superscalar)
Dependence Architecture(dataflow)
Independence Architecture(Horizon)
Independence Architecture(VLIW)
Compiler Hardware
69ECE 4100/6100 (69)
Review of the Superscalar Datapath
in-order fetch and issue logic
In-order completion logicOut-of-order
execution core
Instruction Issue
Instruction Execution
Instruction Completion
Renaming Allocate reservation stations Allocate re-order buffer entry Check for structural hazards
Data driven execution all dependencies have been resolved Issue to functional unit De-allocate reservation stations Forwarding Check load/store dependencies
Enable waiting instructions Retire from re-order buffer Forward from re-order buffer
70ECE 4100/6100 (70)
Concluding Remarks
Degree of speculation– Speculate the bad along with the good, e.g., cache misses
Speculating through multiple branches– Hide long functional unit delays– May need to speculate through multiple branches in one
cycle Use SATSIM
– Follow the execution and understand the use of the register renaming and use of the re-order buffer
– http://www.ece.gatech.edu/research/pica/SATSim/satsim.html
Check the data sheets for modern processors. What techniques do they use?
71ECE 4100/6100 (71)
Study Guide
Given a code sequence– What is the state if the ROB at some point in time?
Exception handling – Using a ROB
Register renaming– Given a code sequence,
– what would be the contents of the rename table or rename register file (depending on which technique is used) at some point in time
– Which physical registers are available?
Forms of speculation – understanding how they work– Across branches– Speculating memory accesses