Advanced MicroarchitectureLecture 13: Commit, Exceptions, Interrupts
2
The End of the Road (um… Pipe)• Commit is typically the last stage of the
pipeline• Anything that an instruction does at this
point is irrevocable– therefore, only actions that correspond to a
sequential execution can be allowed to pass this point
– E.g., wrong path instructions may not commit because they do not exist in the sequential execution
Lecture 13: Commit, Exceptions, Interrupts
3
Everything In-Order• The ISA defines program execution with
respect to the sequential order of the instructions
• To the outside world, the CPU must appear to execute in-order
• What does it mean to “appear”?– … when someone looks– ok, so what does it mean to “look”?
Lecture 13: Commit, Exceptions, Interrupts
4
“Looking” at CPU State• Examples
– when the OS task swaps a program out, it copies the current program state (requires “looking”) so the program can be resumed later
– when a program has a fault (e.g., page fault), again the OS usually steps in and it needs to “look” at the “current” CPU state
Lecture 13: Commit, Exceptions, Interrupts
5
When Can Someone Look?• Divide-by-Zero
– only on a divide• Page fault
– only on a load or store?– can occur anytime if instruction fetch is on an
unmapped page• OS Task/Process Scheduling
– anytime
Lecture 13: Commit, Exceptions, Interrupts
6
“Proper” State Always Required• So what does it mean to have a proper
state?– one that corresponds to sequential execution
• For a superscalar processor with degree N, the commit stage must be able to “retire” at least N instructions per cycle– retirement includes updating processor state in
the same fashion that an instruction would for sequential execution
Lecture 13: Commit, Exceptions, Interrupts
7
Superscalar Commit is like Sampling
Lecture 13: Commit, Exceptions, Interrupts
A
AB
ABC
ABCD
ABCDE
ABCDEF
ABCDEFG
ABCDEFGH
ABC
ABCDE
ABCDEFGH
Each “state” in thesuperscalar machinealways correspondsto one state of thescalar machine (butnot necessarily theother way around),and the ordering ofstates is preserved
Scalar Commit Processor States Superscalar Commit Processor States
8
Implementation in the CPU• The architected register file contains the
state of the machine corresponding only to the committed instructions– commit happens in-order, so the ARF states
always correspond to some RF state of a sequential execution
– … this means anyone who wants to “look” will have to look in the ARF
– What about the other instructions that have executed out of order?
Lecture 13: Commit, Exceptions, Interrupts
9
Nothing Else Matters
Lecture 13: Commit, Exceptions, Interrupts
LSQ
PRF
RS
fPC
ROBPC
Memory
SequentialView of theProcessor
PC
Memory
RF
State of the SuperscalarOut-of-Order Processor
Just ignoreeverything else!
RF
10
Committing Instructions• “Retire” vs. “Commit”
– sometimes people use this to mean the same thing
– sometimes they mean different things• check the context!
• An instruction commits by making its “effects” known/visible to the architected state– architected state: (A)RF, Memory/$, PC, other
regs– speculative state: everything else (ROB, RS,
LSQ, etc.)Lecture 13: Commit, Exceptions, Interrupts
11
Committing Instructions (2)• When an instruction executes, it modifies
processor state– update a register– update memory– update the PC (almost all instructions do this)
• To “make appear”, the processor just copies values– copy value from Physical Reg to Architected Reg– copy value from LSQ to memory/cache– copy value from ROB to Architected PC
Lecture 13: Commit, Exceptions, Interrupts
12
Blocked Commit• To commit N instructions per cycle, the
ROB needs to support N read ports– (in addition to other ports such as for reading
from the PRF, write ports for allocation, and write ports for updates/writebacks)
Lecture 13: Commit, Exceptions, Interrupts
inst 1inst 2inst 3inst 4
Four re
ad p
orts
for fo
ur
com
mits
ROB
inst 1 inst 2 inst 3 inst 4
One wide read port
ROB
Can’t reuse (reallocate) any of these ROB entries until all have committed
Helps to keep ROB port requirements under control, at cost of slight IPC loss
due to occasional alloc stalls
13
Commit Restrictions• If any N instructions can commit per cycle, may
require heavy multi-porting of other structures– Stores: need N extra DL1 write ports, N extra DTLB
read ports– Branches: need N branch predictor update ports;
also need to deallocate N RAT checkpoints– Others: if processor has memory dependence
predictor, then may need N update ports for loads
• Solution: limit maximum number of commits per cycle of special types of instructions (e.g., max one branch per cycle)
Lecture 13: Commit, Exceptions, Interrupts
Don’t we need to check DTLB during store-address
computation anyway?Do we need to do it again
here?
14
x86 Commit (macros vs. uops)• ROB contains uops, but outside world only
knows about macro-ops
Lecture 13: Commit, Exceptions, Interrupts
uop 1 (SUB)uop 1 (LD)
uop 2 (ADD)
ROB
uop 1uop 2uop 3uop 4uop 5uop 6
uop 1 (ADD)
com
mit
ADD EAX, EBX
SUB EBX, ECX
ADD EDX, [EAX]
com
mit
POPA ????
uop 7uop 8
If we take an interrupt right now, we’ll see a half-executed
instruction!
15
x86 Commit (con’t)• Works fine when uop-flow length ≤ commit
width• What to do with long flows?
– In all cases: can’t commit until all uops in a flow have completed
– Just commit N uops per cycle– ... but make commit uninteruptable
Lecture 13: Commit, Exceptions, Interrupts
ROBuop 1uop 2uop 3uop 4uop 5uop 6uop 7uop 8
com
mit
com
mit
POPA
Timer interrupt! Defer: Can’t act on this yet...
Now do something about the interrupt.
16
Handling REP-prefixed Instructions• Ex. REP STOSB (memset with EAX’s value)
– The entire sequence is effectively one x86 instruction
– What if this REP’s for 1,000,000,000 iterations?• Does the CPU “lock up” for a billion or so cycles while
we wait for the entire instruction to commit?• We also can’t wait until the entire instruction has
completed to commit, because we can’t even fetch the entire instruction!
• At the ISA level, REP iterations are interruptable...– so treat each iteration as a separate “macro-
op”
Lecture 13: Commit, Exceptions, Interrupts
17
REP (con’t)• MOV EDI, <pointer> ; the array we want to
memset• SUB EAX, EAX ; zero• CLD ; clear direction flag
(REP fwd)• MOV ECX, 4 ; do for 100 iterations• REP STOSB ; memset!• ADD EBX, EDX ; unrelated instruction
Lecture 13: Commit, Exceptions, Interrupts
MOV EDI, xxx
SUB EAX, EAX
CLD
MOV ECX, 4
uCMP ECX, 0
uJCZ
STA tmp, EDI
STD EAX, tmp
SUB ECX, 1
ADD EDI, 1
uCMP ECX, 0
uJCZ
STA tmp, EDI
STD EAX, tmp
SUB ECX, 1
ADD EDI, 1
uCMP ECX, 0
uJCZ
STA tmp, EDI
STD EAX, tmp
SUB ECX, 1
ADD EDI, 1
uCMP ECX, 0
uJCZ
STA tmp, EDI
STD EAX, tmp
SUB ECX, 1
ADD EDI, 1
uCMP ECX, 0
uJCZ
ADD EBX, EDX
Check for zero iterations(could happen with MOV ECX, 0 )
MOVS flow
REP overhead for1st iteration
MOVS flow
REP overheadfor 2nd iter.
MOVS flow
REP overheadfor 3rd iter
MOVS flow
REP4th iter
All of these are interruptible points (commit can stop and effects be seen by outside world), since they all have well-defined ISA-level
states:A: ECX=3, EDI = ptr+1B: ECX=2, EDI = ptr+2C: ECX=1, EDI = ptr+3D: ECX=0, EDI = ptr+4
A:
B:
C:
D:
18
Store Retirement• Stores need to forward values to later loads
from the same address– Normally, LSQ provides this facility
Lecture 13: Commit, Exceptions, Interrupts
ld
17st
ld
33st
ld
D$
17
At commit, storeUpdates cache
st
ld
After store has leftthe LSQ, the D$can provide the
correct value
D$ D$
19
Store Retirement (2)• Store shouldn’t vacate LSQ entry until
actual writeback has completed– this way it can continue to forward values until
it is guaranteed that later loads can get the value from the cache
• This can stall commit– what if there’s a cache miss?– what if there’s a TLB miss?
Lecture 13: Commit, Exceptions, Interrupts
store
All instructions may have successfullyexecuted, but none can commit!
20
Writeback Buffer• Want to get stores out of the way
Lecture 13: Commit, Exceptions, Interrupts
storeD$
store
WB Buffer
ld
Even if the store misses incache, entering the WB buffer
counts as committing.
This allows other insts tocommit as well.
The WB buffer is part of the cachehierarchy now. It may need
to provide values to later loads
Eventually, the cacheupdate occurs, the WBbuffer entry is emptied
ld
The cache can nowprovide the correct value
21
Writeback Buffer (2)• The WBB is part of the “Architected” state
Lecture 13: Commit, Exceptions, Interrupts
ld
D$
WBB
• Loads must check both structures for a hit– more routing to get request
to both places– yet another possible
bypass source– also want to normalize
latencies for easier scheduling
stcache
coherenceinterface As an “official” write, the store must
also be visible to other processors
To o
ther C
PU
s
Processor stalls if there are noavailable writeback buffer entries
22
Writeback Buffer (3)• Stores enter WB Buffer in program order• If there are multiple stores to the same
address, only the last one is “visible”
Lecture 13: Commit, Exceptions, Interrupts
123442
Addr Value
-113
909018
oldest
youngest
next to writeto cache
Load 42
567842
Store 42
Load 42
No one can “see” this store anymore!
23
Write Combining Buffer• Augment WBB to combine writes together
Lecture 13: Commit, Exceptions, Interrupts
123442
Addr ValueLoad 42 567842
Store 42
Load 42
Only one writebackNow instead of two
If writing to same address, just combine the writes
(similar to writeback cache)
24
Write Combining Buffer (2)• Can combine stores to same cache line
Lecture 13: Commit, Exceptions, Interrupts
123480
$-LineAddr
Cache Line Data
Store 84
One cache writecan serve multiple
original storeinstructions
Benefit: reduces cachetraffic, reduces pressure
on store buffers
5678
Aggressiveness of write-combining may be limited by
memory ordering model
Writeback/combining buffer can be implemented in/integrated with the
MSHRs
Only certain memory regions may be “write-combinable” (e.g., USWC
in x86)
25
Senior Store Queue• Use STQ as WBB (not necessarily write
combining)
Lecture 13: Commit, Exceptions, Interrupts
STQ
StoreStoreStoreStoreStoreStore
STQ head
STQ tail
DL1 L2STQ headSTQ head
While stores are completing, other accesses (loads, snoops) can continue getting the values
from the “senior” STQ
StoreStore
StoreStore
STQ tail
New stores cannot allocate into Senior STQ entries until stores
complete
Can continue allocating stores
Benefit: don’t need separate WBB, but we don’t need to stall commit
on stores either
26
Silent Stores• Many stores write the same value to
memory that’s already there
Lecture 13: Commit, Exceptions, Interrupts
Store 4242
This is a “silent store”
D$
Ideally, this store shouldn’t have to happen
Silent Stores are another way to exploit value locality;they are like the “dual” of value-predictable loads that
always (frequently) load the same value
27
Store executesas a Load
CommitWBExecuteScheduleDispatchRenameDecodeFetch
Silent Store Verification
• Pro: stores can commit faster (reduces commit-time cache port pressure)
• Con: more total accesses– every store converted to
load– non-silent still has to store
Lecture 13: Commit, Exceptions, Interrupts
D$
=
data
hit
Check load value againststore data for equality
If equal, then store issilent, suppress store
at commit
Silent?
28
Other Duties• Besides updating architected state, commit
needs to deallocate microarchitecture resources– ROB/LSQ entries– Physical register– “colors” of various sorts if used– RAT checkpoints
• Most are FIFO’s or Queues, so alloc/dealloc is usually just inc/dec head/tail pointers
• Unified PRF requires a little more work
Lecture 13: Commit, Exceptions, Interrupts
29
In-Order Commit is Sufficient• … but not necessary• Necessary conditions for retirement:
1. Finished– result produced as defined by ISA semantics
2. No WAR hazard– in the architected register file
3. No unresolved branches4. No unresolved exceptions5. No memory ordering violations6. No memory consistency problems
Lecture 13: Commit, Exceptions, Interrupts
30
Commitability
Lecture 13: Commit, Exceptions, Interrupts
% o
f RO
B
NoneNot Finished
Earlier BranchWAR
CombinationEarlier agen Many instructions could commit
(all requirements satisfied) butdon’t due to in-order constraint
Figure from Bell and Lipasti, “Deconstructing Commit,” in ISPASS’04
31
Backup Slides on Interrupts/Faults
Lecture 13: Commit, Exceptions, Interrupts
32
External Appearance Maintained
Lecture 13: Commit, Exceptions, Interrupts
LSQ
PRF
RS
fPC
ROBPC
Memory
ARF
If you want to peek in,all you get to see arethe (arch) register file,the PC and the cachestate (incl. WB buffers)
But what if there’s no ARF?
33
View of the Unified Register File
Lecture 13: Commit, Exceptions, Interrupts
PRFsRAT
aRAT
If you need to “see”a register, you gothrough the aRATfirst.
ARF
34
View of Branch Mispredictions
Lecture 13: Commit, Exceptions, Interrupts
ROB
LSQ
PRF
RS
fPC
PC
Memory
ARF
MispredictedBranch
Wrong-path instructionsare flushed…
architected state hasnever been touched
Fetch correct pathinstructions
Which can update thearchitected state when
they commit
35
Faults• Div-by-Zero, Overflow, Page-Fault
• All occur at a specific point in execution (precise)
Lecture 13: Commit, Exceptions, Interrupts
• CPU must maintain the appearance of sequential execution
DBZ!
Trap
(resumeexecution)
Divide may have executedbefore other instructionsdue to OOO scheduling!
DBZ!
Trap?(when?)
36
Timing of DBZ Fault• Need to hold on to your faults
Lecture 13: Commit, Exceptions, Interrupts
RS
ROB
Exec:DBZ
ArchitectedState
Let earlier instructions commitNow, the arch. state is the sameas just before the divide executed
in the sequential order
Now, raise the DBZ fault andwhen you switch to the kernel,everything appears as it should
On a fault, flush themachine and switch
to the kernel
Just make note of the fault,but don’t do anything (yet)
37
Speculative Faults• Faults might not be faults…
Lecture 13: Commit, Exceptions, Interrupts
ROB
DBZ!
BranchMispredict
The fault goes away, too
Which is what we want since in asequential execution, the wrong-path divide
would not have executed (and faulted)
(flush wrong-path)
Buffering faults until commit filtersout speculative faults
38
Timing of TLB Miss• Store must re-execute
(or re-commit, really), but this means it cannot leave the ROB
• Store TLB miss can stall the processor
Lecture 13: Commit, Exceptions, Interrupts
TLB miss
Trap
(resumeexecution)
…
Walk page-table,may find a page fault
Re-executestore
39
Load Faults are Similar• Load issues, misses in TLB• Must wait! When load is oldest, allow
switch to kernel to do page-table walk
• …could be painful; there are lots of loads
• Some processors have hardware page-table walkers– OS loads a few registers with PT information
(pointers) and then simple logic fetches mapping info from memory
Lecture 13: Commit, Exceptions, Interrupts
OS must support particular PT format tomake use of HW page-table walker
40
Asynchronous Interrupts• Some interrupts are not associated with
any particular instruction– timer interrupt– disk, network interrupts (I/O)– low battery, UPS shutdown
• When the CPU “notices” doesn’t matter (too much)
Lecture 13: Commit, Exceptions, Interrupts
KeyPressed Key
Pressed
KeyPressed
41
Handling Async Interrupts• Handle right away
– Just use the current architected state, and flush the pipeline
• Deferred– Stop fetching, let the processor drain, then
switch to the interrupt handler• What if CPU takes a fault in the mean time?• Which came “first”, the External interrupt or the fault?
Lecture 13: Commit, Exceptions, Interrupts