extending the effectiveness of 3d-stacked dram caches with an adaptive multi-queue policy (g. h....

Extending the Effectiveness of 3D-Stacked DRAM Caches with an Adaptive Multi-Queue

Policy (G. H. Loh).

Bismita Srichandan, Semra Kul, Rasanjalee Disanayaka

Outline

• Introduction• What are 3D stacked caches?• Multi-queue cache algorithm• Implementation• Adaptive Multi-Queue• Conclusion

Introduction

• Main goal of the paper – Multi-queue Cache replacement policy to remove

the dead cache lines and maintain isolation.

• 3D-integration technology– 3D-integration to combat “Memory wall”

• Memory wall: disparity of speed between CPU and memory outside the CPU chip.

3D-stacked cache

• What is the need of 3D-stacked cache?

• Construct 3D Last level cache(LLC) before 3D core architecture

• LLC is made up of DRAM

• Each cache set is organized into multiple queues

3D-stacked cache cont..

• DRAM– Machine refreshes the data periodically

• SRAM– Data remains there in the RAM when there is

power

• Set dueling approach– Dynamically adapt the sizes of the queues – How to advance the lines between queues.

3D cache organizations

The figure ‘d’ is cost effective and efficient, 8 times more efficient than the baseline in fig a.

Physical organization of 3D Caches

• Access time for any row on SRAM array is same

• Reading DRAM destroys the content.

• Precharge policy

• Openpage policy

Basic Multi-queue Algorithm

• Each queue entry has ‘u’ bit set to zero on entry.

• Hit on the line sets the ‘u’ bit to 1.

• If ‘u’ bit is still zero then that evicts the line.

• In multi-core, each core has one FIFO queue.

Cache behavior 1

• Cache lines that are inserted into the cache but never referenced again.

• In LRU, the unused data stays there until it’s turn comes to be rleased.

• With queueing system, the lines are evicted quickly

Cache behavior 2

• With respect to temporal locality LRU and multi-queue, the behavior is different

• LRU: eviction happens late even it is unused

• Multi-Queue: eviction happens after the first queue.

Cache behavior 3

• Isolation and protection between the different cores.

• In figure to the left, core1 has high access rate with no reuse, there are many misses for core0.

Implementation Issues

• Processors: Clock based pseudo-LRU replacement policy• Overhead: u bit/entry, single counter for current clock

position• Extra counter/queue for tracking the queue head • 4 cores

Implementation Issues

Row Buffer: Single cycle line shuffling between queues- A multi queue design complication: moving data between

queues - With DRAM array: source and destination column addresses to

the mux and demux - Power for manipulating data in RB is less than SRAM based

cache (precharge bit lines, power the sense amplifiers…)- RB not efficient in SRAM cache, it might slow down access

patterns

Configuration

- Baseline system: Quad-core processor- Shared, inclusive 4MB, 16 way cache- Clock replacement policy- Multiple prefetchers applying FIFO

Configuration- DRAM with the same footprint as 4MB

SRAM : 32 MB capacity (up to 8 banks)- Set associative (up to 128 way)- Line sizes: up to 512 bytes- The best configuration for 32 MB DRAM : 4

banks, 64 way set associative, 128 byte cache lines

- Queues (/core): Q= 8, S= 12, SLRU = 20

- Multi-programmed workloads of several memory intensive programs (SPEC2006)

- LLC : MPKI (misses / thousand instructions)- IPC (Instruction /cycle)

Evaluation

Evaluation- For multi-core performance simulations

- Fast forward each program 500 million instructions while warming cache

- Then perform simulation until each program committed 250 million instructions

- Statistics collected up to that limit, core continues executing , contending with the other cores for shared resources

- Throughput :

- Speedup :

Performance

- Clock replacement- Stacking additional cache -> larger fraction of working set on

the chip

Performance

32 MB 3D-stacked DRAM: 4 Policies- Baseline clock replacement- TADIP (Thread-aware dynamic insertion policy)- UCP (Utility-based cache partitioning)- Multi Queue cache management

Performance- MQ : 23.6% more performance than UCP - UCP -more performance in some workloads:

adaptation to dynamic changes in per core memory requirement

- Inclusion:- TADIP : no good performance according to the inclusion

property LLC enforces - MQ avoids the problem: 64 way set associative, Q=8 entries

(no quick eviction from the queue)- No 2nd level queue: speedup reduced to 15.4%.

Performance

4 cores:- Core 0 (MIX01): 58.6 % LLC hits in first level queue, 35.8% hits

in shared second level queue, 5.6% in clock managed region.- Core 1(MIX01): All hits in first level queue…

ADAPTIVE MULTI-QUEUE (AMQ)

• For some workloads, UCP can achieve higher IPC throughput than MQ approach

• Reasons:– UCP dynamically partition cache to reduce overall misses– MQ use statically partitioned queues that may some times result in:

• Over-provisioned queues for some cores deadlines stay longer than they should

• Under-provisioned queues for some cores early eviction of lines that will be referenced near future

• Solution : Adaptive MQ (AMQ) that use dynamic partitioning

Adaptive Multi-Queue (AMQ)

• AMQ dynamically adjusts queue sizes based on the needs of each core.

• Instead of allowing arbitrary queue sizes, authors restrict the queues to only a few choices (simpler approach).

• But still need a method to choose among these few choices!

Multi-Set Dueling• All possible unique queue size configurations for n-core

system: |Q|n×|S|

– Given :• |Q| possible selections for the size of each of the first-level queues and • |S| selections for the second-level queue

• Finding the best parameters among such a potentially large configuration space may be daunting.

• To tackle this problem, authors propose a simple generalization of the set-dueling principal.

Set-Dueling Principle –(DIP)

• Proposed for the Dynamic Insertion Policy (DIP)

• Objective : adaptively choose between the better of two different policies

• The idea: dedicate a small (but statistically significant) number of cache sets where the sets follow fixed policies.

Set-Dueling Principle –(DIP)• Process:

– A few leader sets always manage their lines using a fixed policy P0, and a few other leader sets always use policy P1.

– Policy Selection Counter (PSEL):• Is decremented when misses occur in leader sets following P0• Is Incremented when misses occur in leader sets following P1• Estimates which policy causes more misses based on the observed

behaviors of these sampled leader sets.

– The remaining follower sets simply use the policy that should result in fewer misses as indicated by the PSEL counter.

Set-Dueling Principle in Multi-Core Context -(TADIP)• In a multi-core

context, each individual core may wish to follow a different policy.

• TADIP multi-core extension of DIP introduced the use of per-core leader sets with per-core PSEL counters

Figure: Multi-Core, two-policy-per-core selection

Set-Dueling Principle in Multi-Core Context –(TADIP)• Each set is annotated with a policy

vector <ρc0 , ρc1 , ρc2 , ρc3>, – where ci represents Core i, and – ρci indicates the policy followed

by Core i for this set.

• In each group of leader sets there is one leader set per policy, per core.

• Ex:– First leader set always applies

policy P0 to Core 0– Second leader set always uses P1

for Core 0. – Remaining cores (Core 1 through

Core 3) do not use a fixed policy and simply follow the policy specified by their respective PSEL counters.

Leader Set 1

Leader Set 2

Leader Set 3

Leader Set 4

Leader Set 5

Leader Set 6

Leader Set 7

Leader Set 8

8 Leader sets

in the group

Core 0

Core 1

Core 2

Core

Set-Dueling Principle in Multi-Core Context –(TADIP)• Miss in a set where Core 0 is forced to always follow P0

PSEL0 is decremented.• • Misses in sets where Core 0 is forced to always follow P1

PSEL0 incremented.

• For all remaining sets, including leader sets for other cores, cache decisions involving Core 0 will use the policy f0 chosen by PSEL0.

• The leader set structure is symmetric for all remaining cores.

• Each core can choose the policy that works the best for it, but the determination of what is “best” accounts for the policy selections of the other cores.

Set-Dueling –(AMQ)• The set-dueling approaches for both DIP and TADIP assume

that each core has only one of two policies to choose from.

• The selection of a queue size for MQ approach is effectively a “policy” decision.

• For |Q| > 2 authors of MQ use a multi-set dueling approach

Multi-Set-Dueling –(AMQ)• Consider the case for Q={Qa,Qb,Qc,Qd} shown

in Figure

• Ex: Consider Core 0, First-Level Queue:– For the first leader set, Core 0 always uses a first-

level queue of size Qa.

– For the second set, Core 0 always uses size Qb.

• Misses in the first leader set policy cause the counter PSELab 0 to be decremented

• Misses in the next leader set policy increment the counter.

– The third set follows the policy φab 0 , which sets Core 0’s queue size (in this set) to Qa or Qb based on PSELab 0 .

• φ indicate a partial follower; (partial because the sizes Qc and Qd are not considered)

• Miss in the set following φab 0 causes a “meta-policy” counter MPSEL0 to be decremented.

(Leader Set) Set 1(Leader Set) Set 2

(Partial Follower )Set 3



Multi-Set-Dueling –(AMQ)• The next three sets (set 4,5,6) are similar

to the first three, except that:

– one always sets Core 0’s first-level queue size to Qc,

– the next to Qd,

– and the third to the best of these two (φcd0 ).

– A miss in the set following policy φcd 0 causes MPSEL0 to be incremented.

• Finally, all other follower sets set Core 0’s first-level queue size according to policy f0

• f0 is determined by the results of PSELab0 , PSELcd0 and MPSEL0.





Multi-Set-Dueling –(AMQ)• Figure shows how the next six sets

effectively repeat the process to determine the size of Core 1’s first level queues.

• This repeats again for Core 2 (not fully shown) and Core 3 (not shown at all).

• Likewise, another six leader sets (also not shown) determine the size of the shared second-level queue.





Multi-Set-Dueling –(AMQ)• For our adaptive multi-queue (AMQ) approach:

– the first-level queues use one of four policies Q = {0s, 0m, 4, 8}, and – the second-level queue selects from one of four sizes S = {0, 4, 8, 12}.

• For the first-level queues, there are actually two choices for zero-sized queues. – The policy 0s :

• means that the queue has no entries• incoming cache lines should be inserted into the second-level queue.

– The policy 0m :• is similar except that lines are inserted directly into the main clock-based

region of the set

Stability Issues• Problem: Instability

so many policy , meta-policy decisions: overall system can become unstable and

rapidly switch through many different configurations and not actually converge on a good one.

• Solutions:slow down the rate of policy change.

• Two Methods:1. Simple time delay where independent of the actual PSEL values, once a policy change

has been made, no other changes may occur until at least δ cycles have elapsed, although the PSEL counters are still updated.

2. Adds hysteresis to the PSEL counters. When a PSEL counter goes negative, it must actually be decremented below −h before the change in policy is invoked. Similarly, one must increment the counter above +h to switch the policy back.

Occasional Lines with Long Reuse Distances• Problem: Early Eviction

– The queue size must match up reasonably well with the actual reuse distances for each core.

– The multi-set dueling approach select the queue size that most closely covers the majority of a core’s cache line reuse patterns

– There may still be a significant number of lines that simply have reuse distances longer than the queue size (Early Eviction)

• Solution:– Include a pardon probability and statistical trace cache filtering.

• Method:– If a line’s u bit is set, then it is always advanced to the next region of the cache

– If the u-bit is zero, then with some probability Ppardon , the line is advanced anyway. Four possible pardon probabilities P = {0, 1/32 , 1/8 , 1} Use multi-set dueling to select Ppardon on a per-core basis

IPC throughput results for AMQ

IPC throughput results for AMQ• Additional performance gains beyond the simple stacking of DRAM as a

cache – UCP (18.9% ) – MQ-static (23.6%) – AMQ (25.7%) improvement

• Stability mechanisms provide a small net benefit increasing the geometric mean improvement to 27.6% over the baseline 32MB cache.

• Dynamic pardon probability selection provides another small boost, bringing the performance gain to 29.1%.

• AMQ technique achieves 75.6% of the performance difference between the 32MB and 64MB clock-managed DRAM caches

AMQ’s adaptations over time

AMQ’s adaptations over time• The first four columns (dark shading on top) of each workload

correspond to the per-core queues from Core 0 to Core 3.

• The 0s and 0m policies correspond to a zero-sized first-level queue with direct insertion into the second-level queue and main region, respectively.

• The fifth column (light shading on top) is for the shared queue. While a few individual programs find a queue size and then stick with it throughout the traced execution, there are others that clearly vary (i.e., adapt) over time.

Weighted speedup and Fair speedup

Weighted speedup and Fair speedup

• Overall, AMQ performs well on these metrics.

• For the fair speedup metric, AMQ with stability and pardoning performs better than UCP and always better than clock, indicating that there are no significant concerns over fairness.

Conclusion

• In this paper, authors have revisited the simple application of using 3D integration to stack a DRAM layer as a large last-level cache.

• Authors have shown that the physical architecture of the DRAM and its peripheral logic, which traditionally increases the complexity of the memory interface, actually provides us with an opportunity to derive benefit from these otherwise inconvenient structures.

Questions?

extending the effectiveness of 3d-stacked dram caches with an adaptive multi-queue policy (g. h....

Documents

dstacked cache

d stacked cache

sram cache

cache set

multi queue

cache behavior

d cache organizations

cache precharge