madcache: a pc-aware cache insertion policy andrew nere, mitch hayenga, and mikko lipasti pharm...

MadCache: A PC-aware Cache Insertion Policy

Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group

University of Wisconsin – Madison

June 20, 2010

2

• Problem: Changing hardware and workloads encourage investigation of cache replacement/insertion policy designs

• Proposal: MadCache uses PC history to choose cache insertion policy– Last level cache granularity– Individual PC granularity

• Performance improvements over LRU– 2.5% improvement IPC (single thread)– 4.5% speedup and 6% speedup improvement (multithreaded)

Executive Summary

3

• Importance of investigating cache insertion policies– Direct affect on performance– LRU dominated hardware designs for many years– Changing workloads, levels of caches

• Shared last-level cache– Cache behavior now depends on multiple running applications– One streaming thread can ruin the cache for everyone

Motivation

4

• Dynamic insertion policies– DIP – Qureshi et. al – ISCA ’07

• Dueling sets select best of multiple policies• Bimodal Insertion Policy (BIP) offers thrash protection

– TADIP – Jaleel et. al – PACT ’08• Awareness of other threads’ workloads

• Utilizing Program Counter information– Exhibit a useful amount of predictable behavior– Dead-block prediction and prefetching – ISCA ’01– PC-based load miss prediction – MICRO ’95

Previous Work

5

• Problem: With changing hardware and workloads, caches are subject to suboptimal insertion policies

• Solution: Use PC information to create a better policy– Adaptive default cache insertion policy– Track PCs to determine the policy on a finer grain than DIP– Filter out streaming PCs

Introducing MadCache!

MadCache Proposal

6

• Tracker Sets– Sample behavior of the cache– Enter the PCs into PC-Predictor Table– Determines default policy of cache

• Uses set dueling - Qureshi et. al – ISCA ’07• LRU and Bypassing Bimodal Insertion Policy (BBIP)

• Follower Sets– Majority of the last level cache– Typically follow the default policy– Can override default cache policy (PC-Predictor Table)

MadCache Design

7

Tracker and Follower Sets

BBIP Tracker Sets

LRU Trackers Sets

Follower Sets

Reuse Bit

Index to PC- Predictor

• Tracker Sets overhead– 1-bit to indicate if line was accessed again– 10/11 bits to index PC-Predictor table

Last Level Cache

8

• PC-Predictor Table– Store PCs that have accessed Tracker Sets– Track behavior history using counter

• Decrement if an address is used many times in the LLC • Increment if line is evicted and was never reused

– Per-PC default policy override• LRU (default) plus BBIP override• BBIP (default) plus LRU override

MadCache Design

9

PC-Predictor Table

Policy + PC (MSB) Counter # Entries

(1 + 64 bits) (6 bits) (9 bits)

PC (miss) (MSB) Counter

Hit?

0 1

• Parallel to cache miss, PC + current policy index PC-Predictor • If hit in table, follow the PC’s override policy• If miss in table, follow global default policy

Default Policy PC-Predictor Table

10

• Thread aware MadCache– Similar structures as single-threaded MadCache– Track based on current policy of other threads

• Multithreaded MadCache extensions– Separate tracker sets for each thread

• Each thread still tracks LRU and BBIP– PC-Predictor table

• Extended number of entries• Indexed by thread-ID, policy, and PC

– Set dueling PER THREAD

Multi-Threaded MadCache

11

Multi-threaded MadCache

TID + <P0,P1,P2,P3> + PC (MSB) Counter # Entries

(2 + 4 + 64 bits) (6 bits) (9 bits)

(MSB) Counter

Hit?

0 1

Default Policy PC-Predictor Table

(10 bits)

TID-0

TID-1

TID-2

TID-3

TID-0 BBIP Tracker Sets

TID-0 LRU Tracker Sets

Other Tracker Sets

Follower Sets

Last Level Cache

12

• Deep Packet Inspection1

– Large match tables (1MB+) commonly used for DFA/XFA regular expression matching

– Incoming byte stream from packets causes different table traversals• Table exhibits reuse between packets• Packets mostly streaming (backtracking implementation

dependent)

MadCache – Example Application

1Evaluating GPUs for Network Packet Signature Matching – ISPASS ‘09

13


– Packets mostly streaming– Frequently accessed Match Table contents held in L1/L2

• Less frequently accessed elements in LLC/memory

Match Table Current Processing Element

Pa

cke

t

Current Processing ElementCurrent Processing Element

Current Processing Element

Current Processing ElementCurrent Processing Element

Pa

cke

tP

ack

et

14


• DIP– Would favor BIP policy due to packet data streaming– LLC mixture of Match Table and useless packet data

• MadCache– Would identify PCs associated with Match Table as useful– LLC populated almst entirely by Match Table

DIP LLC MadCache LLC

Packet Data

Table Data

15

Experimentation

Processor 8-Stage, 4-wide pipeline

Instruction window size 128 entries

Branch Predictor Perfect

L1 inst. cache 32KB, 64B linesize, 4-way SA, LRU, 1 cycle hit

L1 data cache 32KB, 64B linesize, 8-way SA, LRU, 1 cycle hit

L2 cache 32KB, 64B linesize, 8-way SA, LRU, 10 cycle hit

L3 cache (1 thread) 1MB, 64B linesize, 30 cycle hit

L3 cache (4 threads) 4MB, 64B linesize, 30 cycle hit

Main memory 200 cycles

– 15 benchmarks from SPEC CPU2006– 15 workload mixes for multithreaded experiments– 200 million cycle simulations

16

IPC normalized to LRU– 2.5% improvement across benchmarks tested– Slight improvement over DIP

Results – Single-threaded

0.88

0.9

0.92

0.94

0.96

0.98

1

1.02

1.04

1.06

1.08

1.1

astar gcc hmmer geomean

IPC

Nor

mal

ized

to

LRU

RAND

DIP

MAD

17

Results – Multithreaded

0.95

1

1.05

1.1

1.15

1.2

Thro

ughp

ut N

orm

alize

d to

LRU

DIP

MAD

Throughput normalized to LRU– 6% improvement across mixes tested– DIP performs similarly to LRU

18

Results

Weighted speedup normalized to LRU– 4.5% improvement across benchmaks tested– DIP performs similarly to LRU

0.96

0.98

1

1.02

1.04

1.06

1.08

1.1

1.12

1.14

Wei

ghte

d Sp

eedu

p N

orm

aliz

ed t

o LR

U DIP

MAD

19

Future Work

• MadderCache?– Optimize size of structures

• PC-Predictor Table size• Replace CAM with Hashed PC & Tag

– Detailed analysis of benchmarks with MadCache– Extend PC Predictions

• Don’t take into account sharers

20

Conclusions

• Cache behavior still evolving – Changing cache levels, sharing, workloads

• MadCache insertion policy uses PC information– PCs exhibit useful amount of predictable behavior

• MadCache performance– 2.5% improvement IPC for single-threaded– 4.5% speedup, 6% throughput improvement for 4-threads– Sized to competition bit budget

• Preliminary investigations show little impact with reduction in structures

21

Questions?

madcache: a pc-aware cache insertion policy andrew nere, mitch hayenga, and mikko lipasti pharm...

Documents