RR
®® Lihu Rappoport1
XBC - eXtended Block Cache
Lihu Rappoport
Stephan Jourdan
Yoav Almog
Mattan Erez
Adi Yoaz
Ronny Ronen
Intel Corporation
RR
®® Lihu Rappoport2
The Frontend
Frontend goal: supply instructions to execution – Predict which instructions to fetch – Fetch the instructions from cache / memory – Decode the instructions– Deliver the decoded instructions to execution
FrontendFrontend
MemoryMemoryExecutionExecution
Instructions
Data
The processor:
RR
®® Lihu Rappoport3
Requirements from the Frontend
High bandwidth
Low latency
RR
®® Lihu Rappoport4
The Traditional Solution: Instruction Cache Basic unit: cache line
– A sequence of consecutive instructions in memory
Deficiencies:– Low Bandwidth Jump into
the lineJump out of the line
jmpjmp
– High Latency– Instructions need decoding
RR
®® Lihu Rappoport5
TC Goals: high bandwidth & low latency Basic unit: trace
– A sequence of dynamically executed instructions
Trace Cache
Instructions are decoded into uops – Fixed length, RISC like instructions
Traces have a single entry, and multiple exits
Trace end condition
jmpjmp jmpjmpjmpjmpjmpjmp
– Trace tag/index is derived from starting IP
RR
®® Lihu Rappoport6
Redundancy in the TC
CodeIf (cond) AB
Possible Traces
(i) AB (ii) BBB
AA
Space inefficiencySpace inefficiency low hit rate low hit rate
RR
®® Lihu Rappoport7
XBC Goals
High bandwidth
Low latency
High hit rate
RR
®® Lihu Rappoport8
XBC - eXtended Block Cache Basic unit: XB - eXtended Block
jccjccjmpjmp
XB features – Multiple entry, single exit– Tag / index derived from ending instruction IP
– Instructions are decoded
XB end conditions – Conditional or indirect branches– Call/Return– Quota (16 uops)
RR
®® Lihu Rappoport9
XBC Fetch Bandwidth Fetch multiple XBs per cycle
– A conditional branch ends a XB– Need to predict only 1 branch/ XB– Predicting 2 branch/cyc fetch 2 XB/cyc
Promote 99% biased conditional branches*
Build longer XBs Maximize XBC bandwidth for a given #pred/cyc
99% biased99% biased
jccjcc jccjccjccjccjmpjmp
*[Patel 98]
RR
®® Lihu Rappoport10
XB LengthBlock types Average Length
BB basic block 7.7
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
BB
XB
XBp
DBL
XB don’t break on uncond 8.0
XBp XB + promotion 10.0 DBL group 2 XBp 12.7
RR
®® Lihu Rappoport11
XBC StructureA banked structure which supports Variable length XBs (minimize fragmentation) Fetching multiple XBs/cycle
Reorder & AlignReorder & Align
Bank0 Bank1 Bank2 Bank3
4 uop4 uop 4 uop4 uop 4 uop4 uop 4 uop4 uop
RR
®® Lihu Rappoport12
Support Variable Length XBs An XB may spread over several Banks on the same
set
Reorder & AlignReorder & Align
bank0 bank1 bank2 bank3
00 11
RR
®® Lihu Rappoport13
Support Fetching 2 XBs/cycle Data may be received from all Banks in the same cycle
Reorder & AlignReorder & Align
bank0 bank1 bank2 bank3
00 11
11 00
00 11 00 11
RR
®® Lihu Rappoport14
Support Fetching 2 XBs/cycle Actual bandwidth may be sometimes less than 4 banks
per cycle
Reorder & AlignReorder & Align
bank0 bank1 bank2 bank3
00 11
11 00
00 11 00
RR
®® Lihu Rappoport15
Reordering and Aligning Uops
bankbank00 bankbank11 bankbank22 bankbank33
bankbanki2i2 bankbanki3i3
Reorder Reorder BanksBanks
Mux 1
bankbanki0i0 bankbanki1i1
Align UopsAlign UopsMux 2
bnkbnki0i0 bnkbnki1i1 bankbanki2i2
Empty uopsEmpty uops
RR
®® Lihu Rappoport16
XBC Structure The average XB length is >8 uops
16 uop/line is < 2-XB set associative
Reorder & AlignReorder & Align
bank0 bank1 bank2 bank3
00 1100 22 11
16 uop16 uop
RR
®® Lihu Rappoport17
XBC Structure The average XB length is >8 uops
make each bank set-associative
Reorder & AlignReorder & Align
bank0 bank1 bank2 bank3
1100 1100 22
RR
®® Lihu Rappoport18
The XBTB The XBTB provides the next XB for each XB
– XBs are indexed according to ending IP Cannot directly lookup next IP in the XBC XBC can only be accessed using the XBTB
XBTB provides info needed to access next XB – The IP of the next XB
– Defines the set in which the XB resides – A masking vector, indicating the banks in which the
XB resides– The #uops counted backward from the end of XB
– Defines where to enter the XB XBTB provides next 2 XBs
RR
®® Lihu Rappoport19
XBTBXBC
Decoder
Memory / Cache BTB
Delivery mode
PriorityEncode XBQ
XBC Structure: the whole picture
Build mode
Fill Unit
RR
®® Lihu Rappoport20
XB Build Algorithm XBTB lookup fails build a new XB into the fill buffer End-of-XB condition reached lookup XBC for the new XB
– No match store new XB in the XBC, and update XBTB – Match there are three cases:
XBnew XBexist
Update XBTB
IP1XBexist
IP1XBnew
Extend XBexist
Update XBTB
XBnew XBexist
XBexist IP1
IP1XBnew
Complex XB,Update XBTB
XBnewXBexist
IP1
IP1
XBexist
XBnew
The XBC has NO RedundancyThe XBC has NO Redundancy
RR
®® Lihu Rappoport21
XBnew and XBexist have same suffix but different
prefix:
– Possible solution, complying to no-redundancy:
Complex XBs
– Drawback: we get 2 short XBs instead of a single long XB
Wrong Way
IP1
IP1
XBexist
XBnew
IP1XBexist
Prefixnew
RR
®® Lihu Rappoport22
XBnew and XBexist have same suffix but different
prefix:– Second solution: a single “complex XB”
Complex XBs
Complex XBs: no redundancy, but still high bandwidth
Right Way
PrefixcurIP1
IP1
IP1
XBexist
XBnew
Prefixnew
Suffix
RR
®® Lihu Rappoport23
bank0 bank1 bank2 bank3
Extending an Existing XB An XB can only be extended at its beginning
991 2 3 41 2 3 4 5 6 7 85 6 7 8 8 98 91 2 31 2 3 4 5 6 74 5 6 700
Since the existing uops move, the pointers in the XBTB become stale
If we store XB in the usual way, when an XB is extended, we need to move all its uops
RR
®® Lihu Rappoport24
Storing Uops in Reverse Order The solution is to store the uops of an XB in a
reversed order
11 2 3 4 52 3 4 5 6 7 8 96 7 8 9
bank0 bank1 bank2 bank3
00
XB IP is the IP of the ending instruction extending the XB does not change the XB IP
when an XB is extended, no need to move uops
RR
®® Lihu Rappoport25
Set Search XB is replaced and then placed again
– Not on same set different XB– Same set, same banks no problem– Same set but not on the same banks
XBTB entries which point to the old location of the XB are erroneous
Solution - Set Search– On an XBTB hit & XBC miss, try to locate the XB in
other banks in the same set– Calculate new mask according to offset– Only a small penalty: cycle loss, but no switch to
build
RR
®® Lihu Rappoport26
XB Replacement Use a LRU among all the lines in a given set LRU also makes sure that we do not evict a line
other than the first line of a XB (a head line)–There is no point in retaining the head line while
evicting another line– if we enter the XB in the head line, we will get a miss
when we reach the evicted line – if a head line is evicted, but we enter the XB in its
middle, we may still avoid a miss A non-head line is always accessed after a head
line is accessed its LRU will be higher it will not be evicted before the head line
RR
®® Lihu Rappoport27
XB Placement Build-mode placement algorithm
– New XB is placed in banks such that it does not have bank conflict with the previous XB (if possible)
– LRU ordering is maintained by switching the LRU line with the non-conflicting line before the new XB is placed
– Set-search repairs the XBTB Delivery mode placement algorithm
– repeating bandwidth losses due to bank conflicts found conflicting lines are moved to non-conflicting banks
– Each XB is augmented with a counter – incremented when XB has a bank conflict – when counter reaches threshold, the conflicting lines
are switched with other lines in non-conflicting banks – A line can be switched with another line, only if its LRU is
higher, or if both gain from the switch
RR
®® Lihu Rappoport28
0
1
2
3
4
5
6
7
Games SpecINT SysmarkNT Average
Uo
p p
er C
ycle
XBC vs. TC Delivery Bandwidth
TC XBC
RR
®® Lihu Rappoport29
0%1%2%3%4%5%
6%7%
8%9%
10%
16K 32K 64K
Size - KUops
Uo
p M
iss
Rat
e
Miss Rate as a Function of Size
TC XBC
29%
>50%
RR
®® Lihu Rappoport30
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
1 2 4Associativity
Uo
p M
iss
Rat
e
Miss Rate as a Function of Size
XBCTC
RR
®® Lihu Rappoport31
XBC Features Summary Basic unit - XB
– Ends with a conditional branch– Multiple entries, single exit – Indexed according to ending IP– Branch promotion longer XBs
XBC uses a banked structure– Supports fetching multiple XBs/cycle– Supports variable length XBs
– Uops within XBs are stored in reverse order
RR
®® Lihu Rappoport32
Conclusions Instruction Cache has high hit rate, but …
–Low bandwidth, high latency
TC has high bandwidth, low latency, but …–Low hit rate
XBC combines the best of both worlds–High bandwidth, low latency and high hit rate
RR
®® Lihu Rappoport33