trace caches j. nelson amaral. difficulties to instruction fetching where to fetch the next...
Post on 21-Dec-2015
216 views
TRANSCRIPT
Trace Caches
J. Nelson Amaral
Difficulties to Instruction Fetching• Where to fetch the next instruction from?– Use branch prediction
• Sometimes there is misprediction• Likely can only fetch from one I-cache line– m instructions may spread over two lines
• I-cache misses are even worse• Taken branches– target address may be in the middle of a cache line
• Instructions before target must be discarded– the remainder of the m instructions fetched need to
be discarded.
Baer p. 159
Getting more from I-cache
• How can we increase the probability that more of the m instructions needed are in a cache line?– Increase cache line size• Increasing too much increases cache misses
– Fetch “next” line• What is “next” in a set-associative cache?• replacements:
– next line does not contain the right instructions– address checking and repair needed even with no branches
Baer p. 159
Line-and-way Predictor
• Instead of predicting a Branch Target Address, predict the next line and set in the Icache.– Called a Next Cache Line and Set Predictor by
Calder and Grunwald (ISCA 95)• NLS-cache: associate predictor bits with a cache line• NLS-table: store the predictor bits into a separate
direct mapped tag-less buffer
• Effective for programs containing many branches
Baer p. 160
NLS-Cache
Calder, Brad and Grunwald, Dirk, Next Cache Line and Set Prediction, InternationalSymposium on Computer Architecture (ISCA), 1995, 297-296.
NLS: tagless table ofpointers to nextinstruction to beexecuted intoinstruction cache
NLS also predictsindirect branchesand provides branchtype.
three predictedaddresses.
Needs an early distinction betweenbranch and non-branchinstructions.
Trace Caches
• A cache that records instructions in the sequence in which they were fetched (or committed)– PC indexes into the trace cache– If predictions are correct:• whole trace fetched in one cycle• all instructions in the trace are executed
Baer p. 161
Trace Caches Design Issues
• How to design the trace cache?• How to build a trace?• When to build a trace?• How to fetch a trace?• When to fetch a trace?• When to replace a trace?
Baer p. 161
Instruction Fetch with I-cache
Baer p. 161
Fetch with I-cache and Trace CacheTrace Cache
Baer p. 161
Trace Selection Criteria
• Number of conditional branches in a trace– number of consecutive correct
predictions is limited• Merging next block may exceed trace
line– no partial blocks in a trace
• Indirect jump or call-return terminate traces
Baer p. 161
Trace Tags
• What should be the tag of a trace?
• Is it sufficient to use the address of the first instruction as tag?
Baer p. 161
Tags for Trace Cache EntriesAssume a trace may contain up to 16 instructions.
There are two possible traces:
T1: B1-B2-B4T2: B1-B3-B4
T1 and T2 start at the same address.
Possible solution:Add the predicted branch outcomes to the trace
Baer p. 162
Fetch with I-cache and Trace CacheTrace Cache
Register RenamingBypass decode stage
Instructions in a tracemay not need to bedecoded: trace of μops.
Big advantage on CISCISAs (Intel IA-32) wheredecoding is expensive.
Baer p. 162
Where to build a trace from?Trace Cache
Decoder
Fill Unit
Traces from mispredicted paths are added to tracecache.
Baer p. 163
Where to build a trace from?Trace Cache
Fill Unit
Reorder Buffer
Long delay to build a trace.
Not much performancedifference betweendecoder and ROB.
Baer p. 163
Next Trace Predictor
• To predict the next trace– Need to predict the outcome of several branches
at the same time.• An expanded BTB can be used
– Can base the prediction on a path history of past traces• Use bits from tags of previous traces to index a trace
predictor
Baer p. 163
Intel Pentium 4 Trace Cache• Trace Cache contains up to 6 μops per line– Can store 12K μops (2K lines)
• Claim that it delivers a hit rate equal to 8 to 16 KB of I-cache
• There is no I-cache– On a trace cache miss fetches from L2
• Trace cache has its own BTB (512 entries)– Another independent 4K BTB from L2 fetches
• Advantages over using an I-Cache:– Increased fetch bandwidth– Bypass of the decoder
Baer p. 163
A 2-to-4 bit Decoder
An 8-to-256 bit Decoder
A0
A1
A2
A3
A4
A5
A6
A7
256 times
256 lines
Decoding Complexities
• Need m 8-256 decoders • To speculatively compute branch targets for
each of m instructions– Need m adders in the decode stage.– Solution:• limit number of branches decode per cycle• if branch is resolved in c cycles
– still there are c branches in flight – c = 10 is typical
Baer p. 164
Alleviating Decoding Complexities
• Use predecoded bits appended to instructions– predecode on transfers from L2 to I-cache
• CISC: limit the number of complex instructions decoded in a single cycle
Intel P63 decoders:2 for 1-μop instruction1 for complex instruct.
Baer p. 164
Alleviating Decoding Complexities
• Use an extra decoding stage to steer instructions towards instruction queues.
Baer p. 164
Pre-decoded Bits
• Append 4 bits to each instruction– designate class (integer, floating point, branch,
load-store) and execution unit queue
• Partial decode on transfer from L2 to I-Cache.
MIPS R10000
Baer p. 165
Pre-decoded Bits
• 3 bits appended to each byte:– indicate how many bytes away is the start of the
next instruction– stored in a predecode cache– predecode cache is accessed in parallel with I-
cache.– Advantage: detection of instruction boundaries
done only once (not at every execution)• saves power dissipation
– Disadvantage: size of I-cache is almost double
AMD K7
Baer p. 165
Instruction Buffer (Queue)
• A pipeline stage brings instructions into a buffer.– Boundaries are determined in the buffer– Instructions are steered to either:• A simple decoder• A complex decoder
– Can detect opportunities for instruction fusion (pe. a compare-and-test followed by a branch)• may send the fused instruction to a simple decoder
Intel P6
Baer p. 165
Impact of Decoding on Superscalar
• The complexity of the decoder is one of the major limitations to increase m.
Baer p. 165
Three approaches to Register Renaming
Reorder Buffer
Monolithic Physical Register File
Architectural Register File
Physical Extension to Register FileBaer p. 165
Implementation of Register Renaming
• Where and when to allocate/release physical registers?
• What to do on a branch misprediction?
• What do do on an exception?
Baer p. 166
Examplei1: R1 ← R2/R3 # division takes a long timei2: R4 ← R1 + R5 i3: R5 ← R6 + R7i4: R1 ← R8 + R9
i1: R32 ← R2/R3 # division takes a long timei2: R33 ← R32 + R5 i3: R34 ← R6 + R7i4: R35 ← R8 + R9
R35 will receive a value before instruction i2 is issued.
When/how can R32 be released?
As soon as i2 is issued.
How does the hardware know that i2 is the
last use of R32?
Use a counter?rename in an use → count up
instruction issued→ count down
Too expensive!
Baer p. 166
Examplei1: R1 ← R2/R3 # division takes a long timei2: R4 ← R1 + R5 i3: R5 ← R6 + R7i4: R1 ← R8 + R9
i1: R32 ← R2/R3 # division takes a long timei2: R33 ← R32 + R5 i3: R34 ← R6 + R7i4: R35 ← R8 + R9
R35 will receive a value before instruction i2 is issued.
When R1 is renamedagain, all uses of thefirst renaming must be issued!
Release R32 when i4 commits!
Baer p. 166
Allocated
Free
R32 R33 R34R35
Executed
AssignedR1 R2 R3
R6R4 R5
R7 R8 R9
i1: R1 ← R2/R3 # division takes a long timei2: R4 ← R1 + R5 i3: R5 ← R6 + R7i4: R1 ← R8 + R9
Renaming Stage:Allocated as a result
End of Execution:Result Generated
Commit:Physical Registerbecomes an ArchitecturalRegister
Release:Next instruction that renamessame register commits
rename R32
Baer p. 167
Executed
Allocated
Free
R32 R33 R34R35
AssignedR1 R2 R3
R6R4 R5
R7 R8 R9
i1: R32 ← R2/R3 # division takes a long timei2: R33 ← R32 + R5 i3: R34 ← R6 + R7i4: R35 ← R8 + R9
Renaming Stage:Allocated as a result
End of Execution:Result Generated
Commit:Physical Registerbecomes an ArchitecturalRegister
Release:Next instruction that renamessame register commits
execute
execute
Baer p. 167
Assigned
Executed
Allocated
Free
R32
R33
R34R35
R1 R2 R3R6
R4 R5R7 R8 R9
i1: R32 ← R2/R3 # division takes a long timei2: R33 ← R32 + R5 i3: R34 ← R6 + R7i4: R35 ← R8 + R9
Renaming Stage:Allocated as a result
End of Execution:Result Generated
Commit:Physical Registerbecomes an ArchitecturalRegister
Release:Next instruction that renamessame register commits
commitexecute
Baer p. 167
Assigned
Executed
Allocated
Free
R35
R1 R2 R3R6
R4 R5R7 R8 R9
i1: R32 ← R2/R3 # division takes a long timei2: R33 ← R32 + R5 i3: R34 ← R6 + R7i4: R35 ← R8 + R9
Renaming Stage:Allocated as a result
End of Execution:Result Generated
Commit:Physical Registerbecomes an ArchitecturalRegister
Release:Next instruction that renamessame register commits
commit
R32 R33 R34
Baer p. 167
Register Releasing
• Need to know the previous renaming– maintain a previous vector for each architectural
register.– when renaming R1 to R35 in i4, record that the
previous renaming of R1 was R32– when i4 commits, R35 becomes the previous
renaming of R1
i1: R32 ← R2/R3 # division takes a long timei2: R33 ← R32 + R5 i3: R34 ← R6 + R7i4: R35 ← R8 + R9
Baer p. 167
Extended Register File
• Only physical registers can be in free state.• At commit, result must be stored into the
architectural register mapped to the physical register.– Needs to associatively search the mapping OR– ROB has a field with name of architectural register
Architectural Register File
Physical Extension to Register FileBaer p. 168
Repair Mechanism - Misprediction
• How to repair the register renaming after a branch misprediction?– ROB• discarding ROB entries for miss-speculated instructions
invalidates all mappings from architectural to physical registers.
– Monolithic • make a copy of mapping table at each branch prediction• save copies in circular queue• in case of misprediction use saved copy to restore mapping
Baer p. 169
Repair Mechanism - Exceptions
• Monolithic– Mappings are not saved at every instruction• There may be no correct map to restore
– Have to undo the mappings• from last renamed instruction• up to the one that caused the exception
– Undoing the map is costly but only occurs when an exception is being handled.
Baer p. 169
Comparison: ROB-based
• Space wasting – ROB fields for renaming for
instr. that do not use registers.
• Time wasting– Two cycles to store result
into register file.• retiring and writing result
can be pipelined
• Power and Space wasting– ROB needs too many read
and write ports
• Easy repair• No map saving• Compelling Simplicity
Baer p. 169
Monolithic
ROB-based
Intel P640-μops ROB in Pentium III and M> 80-μops ROB in Intel Core
MIPS R10000
Alpha 21264
Extended
IBM PowerPC
Baer p. 170