scalable load and store processing in latency tolerant processors amit gandhi 1,2 haitham akkary 1...
TRANSCRIPT
Scalable Load and Store Processing in Latency Tolerant Processors
Amit Gandhi1,2
Haitham Akkary1
Ravi Rajwar1
Srikanth T. Srinivasan1
Konrad Lai1
1Intel2Portland State University
2
Problem: tolerating miss latencies
• Increasing miss latencies to memory– large instruction windows tolerate latencies– naïve window scaling impractical
• Resource efficient large instruction windows– sustain 1000s of instructions in-flight– need small register files and schedulers– do not address memory buffers efficiency
Must track all memory operationsMemory consistency, ordering, and forwarding
3
Why is this a problem?
• Memory operations tracked in load & store buffers– buffers require CAMs for scanning and matching– CAMs have high area and power requirements
• Don’t always need large memory buffers– L2 cache hit small buffers sufficient– L2 cache miss large buffers necessary
• Scaling CAM is difficult• Why pay the price when not necessary?
Must eliminate CAMs from large buffers
4
Loads: Unordered buffer
• Hierarchical load buffers• Conventional level one load buffer
– effective in the absence of a miss
• Un-ordered level two load buffer– used only when long latency miss occurs– set-associative cache structure
• no scan, only indexed lookup necessary
– does not track precise order of loads• sufficient to know if violation occurred (not where)• checkpoint rollback
5
Stores: CAM-free buffers
• Hierarchical store queue• Conventional level one store queue
– effective in the absence of a miss
• CAM-free level two store queue– used only when long latency miss occurs– used only for ordering
no scanning or matching necessary in queue
Decouple ordering from forwarding
1. Redo stores to enforce order2. Forward from cache instead of queue
6
Outline
• Motivation• Resource efficient processors
– Continual Flow Pipelines– memory buffer demands
• Store processing• Results• Summary
7
Implications of a miss
• Long latency misses to memory– place pressure on critical resources– pipeline quickly stalls due to blocked resources
• Large instruction window processors– execute useful instructions in shadow of miss– tolerate latency by overlapping miss with useful work– naïve scaling impractical
• Resource-efficient instruction windows– scale window to thousands– do not require scaled cycle-critical structures
8
Resource-efficient latency tolerance
Significant fraction of instructions in the shadow of a miss are independent of the miss
Exploit above program property
Treat and process miss-dependent and miss-independent instructions differently
9
Continual Flow Pipeline processor
• Miss dependent instructions– release critical resources
– leave pipeline, and wait outside pipeline in slice buffer
• Miss independent instructions – execute
– release critical resources and retire
• When miss returns– miss-dependent instructions re-acquire resources
– execute and retire
• After miss-dependent instructions execute– results automatically integrated
10
Continual Flow Pipeline processor
• Critical resource efficient– don’t require large register files, large schedulers
• Need to track all memory operations– large load buffer large CAM footprint and power– hierarchical store queue
• small, fast L1 store queue (32 entries)• large, slow L2 store queue (~512 entries)
large CAM foot print
high leakage power• good performance
11
Why track all memory operations?
• Stores must update in program order• Load/store dependence speculation• Multiprocessor memory consistency
• Continual Flow Pipeline processors– execute independents ahead of dependents– aggressively reorder memory operations execution
12
Outline
• Motivation• Resource efficient processors• Store processing
– store queue overview– SRL key idea– SRL workings
• Results• Summary
13
Functions of a store queue
• Ordering– ensure memory updates are in program order– correctness
• Forwarding– provide data to subsequent loads– performance– CAM
X
ZY
YK
X
ZY
YK
A D
STQ
Z
LD
A DFwd. data
ZMatch
14
Conventional store queue
• Single structure for ordering, forwarding• Large sizes increase CAM area & leakage
– CAM contribution to area and power dominates
Efficiency Eliminate CAMs
15
Decoupling ordering from forwarding
CAM
L2 STQ
A D
A D
SRAM
Store Redo Log (SRL)
• FIFO• Program Order• No CAM
Data Cache•Forwarding•No CAM
No CAMs for ordering/forwarding!
16
Store Redo Log workings (1)
In shadow of a miss• Allocate FIFO L2 store queue (SRL) entry for all stores
– records program order for stores• Dependent stores
– not ready, release L1 store queue entry, and enter SRL• Independent stores
– update cache temporarily, and enter SRL• Loads
– independent loads forward from cache & retire– dependent loads go to slice buffer– do not scan L2 store queue for forwarding
17
Store Redo Log workings (2)
When miss returns• Discard all independent store updates to cache
– these stores don’t re-execute– their dependents don’t re-execute
• Drain the SRL in program order– reconstruct memory live-outs– program order maintained– no re-execution, only re-update
• no extra cache ports required
19
Handling hazards: WAW
ST X 12
ST X
ST Y 17
Y 2
X 38
17
12
ST X 5
512
SRLCache
L1 STQ
Miss returns
ST X ST Y ST X
Program Order
20
Handling hazards: WAR
LD X ST X ST Y
Program Order
ST X 5 LD
ST Y 17
Y
X
2
385
17
LD X38
L1 STQ L1 LDQ Slice Buffer
SRLCache
Miss returns
21
Handling hazards: RAW
• Detect by snooping completed stores• Restart execution in case of violations
– restore to checkpoint
23
Evaluation
• Ideal store queue– large L1 STQ (Latency = 3 cycles)– gives upper-bound (impractical to build)
• Hierarchical store queue– L1 STQ (Latency = 3 cycles)– L2 STQ (with CAMs) (Latency = 8 cycles)
• SRL store processing– L1 STQ (Latency = 3 cycles)– FIFO CAM-free Store Redo Log
• Baseline– L1 STQ (Latency = 3 cycles)
24
SRL performance
0
5
10
15
20
25
30
SFP2K SINT2K WEB MM PROD SERVER WS
% S
pee
du
p o
ver
Bas
elin
e
SRL store processing
Hierarchical STQ
Ideal STQ
Performance within 6% of ideal store queue
25
Power and area comparison
• Hierarchical store queue – 90nm CMOS technology– SPICE simulations – circuit optimized to reduce leakage power– banked structure to reduce dynamic power
• SRL over Hierarchical STQ– more than 50% reduction in leakage power– more than 90% reduction in dynamic power– 75% reduction in the area