1 iccd 2010 amsterdam, the netherlands rami sheikh north carolina state university mazen kharbutli...

27
1 ICCD 2010 ICCD 2010 Amsterdam, the Netherlands Amsterdam, the Netherlands Rami Sheikh Rami Sheikh North Carolina State North Carolina State University University Mazen Kharbutli Mazen Kharbutli Jordan Univ. of Science Jordan Univ. of Science and Technology and Technology Improving Cache Performance by Combining Cost-Sensitivity and Locality Principles in Cache Replacement Algorithms

Upload: vaughn-crowthers

Post on 31-Mar-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

1

ICCD 2010ICCD 2010Amsterdam, the NetherlandsAmsterdam, the Netherlands

Rami SheikhRami SheikhNorth Carolina State UniversityNorth Carolina State University

Mazen KharbutliMazen KharbutliJordan Univ. of Science and Jordan Univ. of Science and

TechnologyTechnology

Improving Cache Performance by Combining Cost-Sensitivity and Locality Principles in

Cache Replacement Algorithms

Page 2: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Outline

2

Page 3: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Motivation

3

The processor-memory performance gap. L2 cache performance is very crucial.

Traditionally, L2 cache replacement algorithms focus on improving the hit rate. But, cache misses have different costs. Better to take the cost of a miss into consideration.

Processor’s ability to (partially) hide the L2 cache miss latency differs between misses. Depends on: dependency chain, miss bursts ..etc.

Page 4: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Motivation

4

Issued Instructions per Miss Histogram.

Page 5: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Contributions

5

A novel, effective, but simple cost estimation method. Based on the number of instructions a processor manages

to issue during the miss latency. A reflection of the processor’s ability to hide the miss

latency.Number of issued instructions during the miss

Small Large

High cost miss/block Low cost miss/block

Page 6: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Contributions

6

LACS: Locality-Aware Cost-Sensitive Cache Replacement Algorithm.

Integrates our novel cost estimation method with a locality algorithm (e.g. LRU).

Attempts to reserve high cost blocks in the cache while their locality is still high.

On a cache miss, a low-cost block is chosen for eviction.

Excellent performance improvement at feasible cost. Performance improvement: 15% average and up to 85%. Effective in uniprocessors and CMPs. Effective for different cache configurations.

Page 7: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Outline

7

Page 8: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Related Work

8

Cache replacement algorithms traditionally attempt to reduce the cache miss rate. Belady’s OPT algorithm [Belady 1966]. Dead block predictors [Kharbutli 2008 ..etc]. OPT emulators [Rajan 2007].

Cache misses are not uniform and have different costs [Srinivasan 1998, Puzak 2008]. A new class of replacement algorithms. Miss cost can be latency, power consumption, penalty ..etc.

Page 9: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Related Work

9

Jeong and Dubois [1999, 2003, 2006]: In the context of CC-NUMA multiprocessors.Cost of miss mapping to remote memory higher than if mapping to local memory.LACS estimates cost based on processor’s ability to tolerate the miss latency not the miss latency value itself.

Jeong et al. [2008]: In the context of uniprocessors.Next access predicted: Load (high cost); Store (low cost).All load misses treated equally.LACS does not treat load misses equally (different costs).A store miss may have a high cost.

Page 10: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Related Work

10

Srinivasan et al. [2001]: Critical blocks preserved in special critical cache. Criticality estimated from load’s dependence chain. No significant improvement under realistic configurations.LACS does not track the dependence chain. Uses a simpler cost heuristic.LACS achieves considerable performance improvement under realistic configurations.

Page 11: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Related Work

11

Qureshi et al. [2006]: Based on Memory-level Parallelism (MLP).Cache misses occur in isolation (high cost) or concurrently (low cost).Suffers from pathological cases. Integrated with a tournament predictor to choose between it and LRU (SBAR).LACS does not slow down any of the 20 benchmarks in our study.LACS outperforms MLP-SBAR in our study.

Page 12: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Outline

12

Page 13: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

LACS Storage Organization

13

PP

L1$

L1$

L2$L2$

IIC (32 bits)

MSHRMSHR IIRs (32 bits each)

Prediction TableEach entry: 6-bit hashed tag, 5-bit cost, 1-bit confidence

(8K sets x 4 ways x 1.5 bytes/entry = 48 KB)

Total Storage Overhead ≈ 48 KB

9.4% of a 512KB Cache4.7% of a 1MB Cache

Page 14: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Outline

14

Page 15: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

LACS Implementation

15

On an L2 cache miss on block B in set S:

MSHR[B].IIR = IIC

Page 16: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

LACS Implementation

16

On an L2 cache miss on block B in set S:

Identify all low cost blocks in set S.If there is at least one, choose a victim randomly from among them.Otherwise, the LRU block is the victim. Block X is a low cost block if:

X.cost > threshold, and X.conf == 1

Page 17: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

LACS Implementation

17

On an L2 cache miss on block B in set S:

When miss returns, calculate B’s new cost:newCost = IIC – MSHR[B].IIR

Update B’s table info:if(newCost ≈ B.cost) B.conf=1, else B.conf=0B.cost = newCost

Page 18: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Outline

18

Page 19: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Evaluation Environment

19

Evaluation using SESC: a detailed, cycle-accurate, execution-driven simulator.

20 of the 26 SPEC2000 benchmarks are used. Reference input sets. 2 billion instructions simulated after skipping the first 2 billion

instructions.

Benchmarks divided into two groups (GrpA, GrpB). GrpA: L2 cache performance-constrained - ammp, applu,

art, equake, gcc, mcf, mgrid, swim, twolf, and vpr.

L2 cache: 512 KB, 8-way, WB, LRU.

Page 20: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Outline

20

Page 21: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Evaluation

21

Performance Improvement:

L2 Cache Miss Rates:

Page 22: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Evaluation

22

L2 Cache Miss Rates:

Fraction of LRU blocks reserved by LACS that get re-used:

ammp applu art equake gcc mcf mgrid swim twolf vpr

94% 22% 51% 15% 89% 1% 33% 11% 21% 22%

Low-cost blocks in the cache: <20% OPT evicted blocks that were low-cost: 40% to 98%

Strong correlation between blocks evicted by OPT and their cost.

Page 23: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Evaluation

23

Performance improvement in a CMP architecture:

Page 24: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Evaluation

24

Sensitivity to cache parameters:

Configuration Minimum Average Maximum

256 KB, 8-way 0% 3% 9%

512 KB, 8-way 0% 15% 85%

1 MB, 8-way -3% 8% 47%

2 MB, 8-way -3% 19% 195%

512 KB, 4-way 0% 12% 69%

512 KB, 16-way -1% 17% 101%

Page 25: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Outline

25

Page 26: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

Conclusion

26

LACS’s Exquisite Features: Novelty

New metric for measuring cost-sensitivity.

Combines Two Principles Locality and cost-sensitivity.

Performance Improvements at Feasible Cost 15% average speedup in L2 cache performance-constrained

benchmarks. Effective in uniprocessor and CMP architectures. Effective for different cache configurations.

Page 27: 1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache

27

Thank You !

Questions?