madcache: a pc-aware cache insertion policy andrew nere, mitch hayenga, and mikko lipasti pharm...
TRANSCRIPT
MadCache: A PC-aware Cache Insertion Policy
Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group
University of Wisconsin – Madison
June 20, 2010
2
• Problem: Changing hardware and workloads encourage investigation of cache replacement/insertion policy designs
• Proposal: MadCache uses PC history to choose cache insertion policy– Last level cache granularity– Individual PC granularity
• Performance improvements over LRU– 2.5% improvement IPC (single thread)– 4.5% speedup and 6% speedup improvement (multithreaded)
Executive Summary
3
• Importance of investigating cache insertion policies– Direct affect on performance– LRU dominated hardware designs for many years– Changing workloads, levels of caches
• Shared last-level cache– Cache behavior now depends on multiple running applications– One streaming thread can ruin the cache for everyone
Motivation
4
• Dynamic insertion policies– DIP – Qureshi et. al – ISCA ’07
• Dueling sets select best of multiple policies• Bimodal Insertion Policy (BIP) offers thrash protection
– TADIP – Jaleel et. al – PACT ’08• Awareness of other threads’ workloads
• Utilizing Program Counter information– Exhibit a useful amount of predictable behavior– Dead-block prediction and prefetching – ISCA ’01– PC-based load miss prediction – MICRO ’95
Previous Work
5
• Problem: With changing hardware and workloads, caches are subject to suboptimal insertion policies
• Solution: Use PC information to create a better policy– Adaptive default cache insertion policy– Track PCs to determine the policy on a finer grain than DIP– Filter out streaming PCs
Introducing MadCache!
MadCache Proposal
6
• Tracker Sets– Sample behavior of the cache– Enter the PCs into PC-Predictor Table– Determines default policy of cache
• Uses set dueling - Qureshi et. al – ISCA ’07• LRU and Bypassing Bimodal Insertion Policy (BBIP)
• Follower Sets– Majority of the last level cache– Typically follow the default policy– Can override default cache policy (PC-Predictor Table)
MadCache Design
7
Tracker and Follower Sets
BBIP Tracker Sets
LRU Trackers Sets
Follower Sets
Reuse Bit
Index to PC- Predictor
• Tracker Sets overhead– 1-bit to indicate if line was accessed again– 10/11 bits to index PC-Predictor table
Last Level Cache
8
• PC-Predictor Table– Store PCs that have accessed Tracker Sets– Track behavior history using counter
• Decrement if an address is used many times in the LLC • Increment if line is evicted and was never reused
– Per-PC default policy override• LRU (default) plus BBIP override• BBIP (default) plus LRU override
MadCache Design
9
PC-Predictor Table
Policy + PC (MSB) Counter # Entries
(1 + 64 bits) (6 bits) (9 bits)
PC (miss) (MSB) Counter
Hit?
0 1
• Parallel to cache miss, PC + current policy index PC-Predictor • If hit in table, follow the PC’s override policy• If miss in table, follow global default policy
Default Policy PC-Predictor Table
10
• Thread aware MadCache– Similar structures as single-threaded MadCache– Track based on current policy of other threads
• Multithreaded MadCache extensions– Separate tracker sets for each thread
• Each thread still tracks LRU and BBIP– PC-Predictor table
• Extended number of entries• Indexed by thread-ID, policy, and PC
– Set dueling PER THREAD
Multi-Threaded MadCache
11
Multi-threaded MadCache
TID + <P0,P1,P2,P3> + PC (MSB) Counter # Entries
(2 + 4 + 64 bits) (6 bits) (9 bits)
(MSB) Counter
Hit?
0 1
Default Policy PC-Predictor Table
(10 bits)
TID-0
TID-1
TID-2
TID-3
TID-0 BBIP Tracker Sets
TID-0 LRU Tracker Sets
Other Tracker Sets
Follower Sets
Last Level Cache
12
• Deep Packet Inspection1
– Large match tables (1MB+) commonly used for DFA/XFA regular expression matching
– Incoming byte stream from packets causes different table traversals• Table exhibits reuse between packets• Packets mostly streaming (backtracking implementation
dependent)
MadCache – Example Application
1Evaluating GPUs for Network Packet Signature Matching – ISPASS ‘09
13
MadCache – Example Application
– Packets mostly streaming– Frequently accessed Match Table contents held in L1/L2
• Less frequently accessed elements in LLC/memory
Match Table Current Processing Element
Pa
cke
t
Current Processing ElementCurrent Processing Element
Current Processing Element
Current Processing ElementCurrent Processing Element
Pa
cke
tP
ack
et
14
MadCache – Example Application
• DIP– Would favor BIP policy due to packet data streaming– LLC mixture of Match Table and useless packet data
• MadCache– Would identify PCs associated with Match Table as useful– LLC populated almst entirely by Match Table
DIP LLC MadCache LLC
Packet Data
Table Data
15
Experimentation
Processor 8-Stage, 4-wide pipeline
Instruction window size 128 entries
Branch Predictor Perfect
L1 inst. cache 32KB, 64B linesize, 4-way SA, LRU, 1 cycle hit
L1 data cache 32KB, 64B linesize, 8-way SA, LRU, 1 cycle hit
L2 cache 32KB, 64B linesize, 8-way SA, LRU, 10 cycle hit
L3 cache (1 thread) 1MB, 64B linesize, 30 cycle hit
L3 cache (4 threads) 4MB, 64B linesize, 30 cycle hit
Main memory 200 cycles
– 15 benchmarks from SPEC CPU2006– 15 workload mixes for multithreaded experiments– 200 million cycle simulations
16
IPC normalized to LRU– 2.5% improvement across benchmarks tested– Slight improvement over DIP
Results – Single-threaded
0.88
0.9
0.92
0.94
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
astar gcc hmmer geomean
IPC
Nor
mal
ized
to
LRU
RAND
DIP
MAD
17
Results – Multithreaded
0.95
1
1.05
1.1
1.15
1.2
Thro
ughp
ut N
orm
alize
d to
LRU
DIP
MAD
Throughput normalized to LRU– 6% improvement across mixes tested– DIP performs similarly to LRU
18
Results
Weighted speedup normalized to LRU– 4.5% improvement across benchmaks tested– DIP performs similarly to LRU
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
1.12
1.14
Wei
ghte
d Sp
eedu
p N
orm
aliz
ed t
o LR
U DIP
MAD
19
Future Work
• MadderCache?– Optimize size of structures
• PC-Predictor Table size• Replace CAM with Hashed PC & Tag
– Detailed analysis of benchmarks with MadCache– Extend PC Predictions
• Don’t take into account sharers
20
Conclusions
• Cache behavior still evolving – Changing cache levels, sharing, workloads
• MadCache insertion policy uses PC information– PCs exhibit useful amount of predictable behavior
• MadCache performance– 2.5% improvement IPC for single-threaded– 4.5% speedup, 6% throughput improvement for 4-threads– Sized to competition bit budget
• Preliminary investigations show little impact with reduction in structures
21
Questions?