extending the effectiveness of 3d-stacked dram caches with an adaptive multi-queue policy (g. h....
TRANSCRIPT
Extending the Effectiveness of 3D-Stacked DRAM Caches with an Adaptive Multi-Queue
Policy (G. H. Loh).
Bismita Srichandan, Semra Kul, Rasanjalee Disanayaka
Outline
• Introduction• What are 3D stacked caches?• Multi-queue cache algorithm• Implementation• Adaptive Multi-Queue• Conclusion
Introduction
• Main goal of the paper – Multi-queue Cache replacement policy to remove
the dead cache lines and maintain isolation.
• 3D-integration technology– 3D-integration to combat “Memory wall”
• Memory wall: disparity of speed between CPU and memory outside the CPU chip.
3D-stacked cache
• What is the need of 3D-stacked cache?
• Construct 3D Last level cache(LLC) before 3D core architecture
• LLC is made up of DRAM
• Each cache set is organized into multiple queues
3D-stacked cache cont..
• DRAM– Machine refreshes the data periodically
• SRAM– Data remains there in the RAM when there is
power
• Set dueling approach– Dynamically adapt the sizes of the queues – How to advance the lines between queues.
3D cache organizations
The figure ‘d’ is cost effective and efficient, 8 times more efficient than the baseline in fig a.
Physical organization of 3D Caches
• Access time for any row on SRAM array is same
• Reading DRAM destroys the content.
• Precharge policy
• Openpage policy
Basic Multi-queue Algorithm
• Each queue entry has ‘u’ bit set to zero on entry.
• Hit on the line sets the ‘u’ bit to 1.
• If ‘u’ bit is still zero then that evicts the line.
• In multi-core, each core has one FIFO queue.
Cache behavior 1
• Cache lines that are inserted into the cache but never referenced again.
• In LRU, the unused data stays there until it’s turn comes to be rleased.
• With queueing system, the lines are evicted quickly
Cache behavior 2
• With respect to temporal locality LRU and multi-queue, the behavior is different
• LRU: eviction happens late even it is unused
• Multi-Queue: eviction happens after the first queue.
Cache behavior 3
• Isolation and protection between the different cores.
• In figure to the left, core1 has high access rate with no reuse, there are many misses for core0.
Implementation Issues
• Processors: Clock based pseudo-LRU replacement policy• Overhead: u bit/entry, single counter for current clock
position• Extra counter/queue for tracking the queue head • 4 cores
Implementation Issues
Row Buffer: Single cycle line shuffling between queues- A multi queue design complication: moving data between
queues - With DRAM array: source and destination column addresses to
the mux and demux - Power for manipulating data in RB is less than SRAM based
cache (precharge bit lines, power the sense amplifiers…)- RB not efficient in SRAM cache, it might slow down access
patterns
Configuration
- Baseline system: Quad-core processor- Shared, inclusive 4MB, 16 way cache- Clock replacement policy- Multiple prefetchers applying FIFO
Configuration- DRAM with the same footprint as 4MB
SRAM : 32 MB capacity (up to 8 banks)- Set associative (up to 128 way)- Line sizes: up to 512 bytes- The best configuration for 32 MB DRAM : 4
banks, 64 way set associative, 128 byte cache lines
- Queues (/core): Q= 8, S= 12, SLRU = 20
- Multi-programmed workloads of several memory intensive programs (SPEC2006)
- LLC : MPKI (misses / thousand instructions)- IPC (Instruction /cycle)
Evaluation
Evaluation
Evaluation- For multi-core performance simulations
- Fast forward each program 500 million instructions while warming cache
- Then perform simulation until each program committed 250 million instructions
- Statistics collected up to that limit, core continues executing , contending with the other cores for shared resources
- Throughput :
- Speedup :
Performance
- Clock replacement- Stacking additional cache -> larger fraction of working set on
the chip
Performance
32 MB 3D-stacked DRAM: 4 Policies- Baseline clock replacement- TADIP (Thread-aware dynamic insertion policy)- UCP (Utility-based cache partitioning)- Multi Queue cache management
Performance- MQ : 23.6% more performance than UCP - UCP -more performance in some workloads:
adaptation to dynamic changes in per core memory requirement
- Inclusion:- TADIP : no good performance according to the inclusion
property LLC enforces - MQ avoids the problem: 64 way set associative, Q=8 entries
(no quick eviction from the queue)- No 2nd level queue: speedup reduced to 15.4%.
Performance
4 cores:- Core 0 (MIX01): 58.6 % LLC hits in first level queue, 35.8% hits
in shared second level queue, 5.6% in clock managed region.- Core 1(MIX01): All hits in first level queue…
ADAPTIVE MULTI-QUEUE (AMQ)
• For some workloads, UCP can achieve higher IPC throughput than MQ approach
• Reasons:– UCP dynamically partition cache to reduce overall misses– MQ use statically partitioned queues that may some times result in:
• Over-provisioned queues for some cores deadlines stay longer than they should
• Under-provisioned queues for some cores early eviction of lines that will be referenced near future
• Solution : Adaptive MQ (AMQ) that use dynamic partitioning
Adaptive Multi-Queue (AMQ)
• AMQ dynamically adjusts queue sizes based on the needs of each core.
• Instead of allowing arbitrary queue sizes, authors restrict the queues to only a few choices (simpler approach).
• But still need a method to choose among these few choices!
Multi-Set Dueling• All possible unique queue size configurations for n-core
system: |Q|n×|S|
– Given :• |Q| possible selections for the size of each of the first-level queues and • |S| selections for the second-level queue
• Finding the best parameters among such a potentially large configuration space may be daunting.
• To tackle this problem, authors propose a simple generalization of the set-dueling principal.
Set-Dueling Principle –(DIP)
• Proposed for the Dynamic Insertion Policy (DIP)
• Objective : adaptively choose between the better of two different policies
• The idea: dedicate a small (but statistically significant) number of cache sets where the sets follow fixed policies.
Set-Dueling Principle –(DIP)• Process:
– A few leader sets always manage their lines using a fixed policy P0, and a few other leader sets always use policy P1.
– Policy Selection Counter (PSEL):• Is decremented when misses occur in leader sets following P0• Is Incremented when misses occur in leader sets following P1• Estimates which policy causes more misses based on the observed
behaviors of these sampled leader sets.
– The remaining follower sets simply use the policy that should result in fewer misses as indicated by the PSEL counter.
Set-Dueling Principle in Multi-Core Context -(TADIP)• In a multi-core
context, each individual core may wish to follow a different policy.
• TADIP multi-core extension of DIP introduced the use of per-core leader sets with per-core PSEL counters
Figure: Multi-Core, two-policy-per-core selection
Set-Dueling Principle in Multi-Core Context –(TADIP)• Each set is annotated with a policy
vector <ρc0 , ρc1 , ρc2 , ρc3>, – where ci represents Core i, and – ρci indicates the policy followed
by Core i for this set.
• In each group of leader sets there is one leader set per policy, per core.
• Ex:– First leader set always applies
policy P0 to Core 0– Second leader set always uses P1
for Core 0. – Remaining cores (Core 1 through
Core 3) do not use a fixed policy and simply follow the policy specified by their respective PSEL counters.
Leader Set 1
Leader Set 2
Leader Set 3
Leader Set 4
Leader Set 5
Leader Set 6
Leader Set 7
Leader Set 8
8 Leader sets
in the group
Core 0
Core 1
Core 2
Core
Set-Dueling Principle in Multi-Core Context –(TADIP)• Miss in a set where Core 0 is forced to always follow P0
PSEL0 is decremented.• • Misses in sets where Core 0 is forced to always follow P1
PSEL0 incremented.
• For all remaining sets, including leader sets for other cores, cache decisions involving Core 0 will use the policy f0 chosen by PSEL0.
• The leader set structure is symmetric for all remaining cores.
• Each core can choose the policy that works the best for it, but the determination of what is “best” accounts for the policy selections of the other cores.
Set-Dueling –(AMQ)• The set-dueling approaches for both DIP and TADIP assume
that each core has only one of two policies to choose from.
• The selection of a queue size for MQ approach is effectively a “policy” decision.
• For |Q| > 2 authors of MQ use a multi-set dueling approach
Multi-Set-Dueling –(AMQ)• Consider the case for Q={Qa,Qb,Qc,Qd} shown
in Figure
• Ex: Consider Core 0, First-Level Queue:– For the first leader set, Core 0 always uses a first-
level queue of size Qa.
– For the second set, Core 0 always uses size Qb.
• Misses in the first leader set policy cause the counter PSELab 0 to be decremented
• Misses in the next leader set policy increment the counter.
– The third set follows the policy φab 0 , which sets Core 0’s queue size (in this set) to Qa or Qb based on PSELab 0 .
• φ indicate a partial follower; (partial because the sizes Qc and Qd are not considered)
• Miss in the set following φab 0 causes a “meta-policy” counter MPSEL0 to be decremented.
(Leader Set) Set 1(Leader Set) Set 2
(Partial Follower )Set 3
(Partial Follower )Set 6
(Leader Set) Set 4(Leader Set) Set 5
Multi-Set-Dueling –(AMQ)• The next three sets (set 4,5,6) are similar
to the first three, except that:
– one always sets Core 0’s first-level queue size to Qc,
– the next to Qd,
– and the third to the best of these two (φcd0 ).
– A miss in the set following policy φcd 0 causes MPSEL0 to be incremented.
• Finally, all other follower sets set Core 0’s first-level queue size according to policy f0
• f0 is determined by the results of PSELab0 , PSELcd0 and MPSEL0.
(Leader Set) Set 1(Leader Set) Set 2
(Partial Follower )Set 3
(Partial Follower )Set 6
(Leader Set) Set 4(Leader Set) Set 5
Multi-Set-Dueling –(AMQ)• Figure shows how the next six sets
effectively repeat the process to determine the size of Core 1’s first level queues.
• This repeats again for Core 2 (not fully shown) and Core 3 (not shown at all).
• Likewise, another six leader sets (also not shown) determine the size of the shared second-level queue.
(Leader Set) Set 1(Leader Set) Set 2
(Partial Follower )Set 3
(Partial Follower )Set 6
(Leader Set) Set 4(Leader Set) Set 5
Multi-Set-Dueling –(AMQ)• For our adaptive multi-queue (AMQ) approach:
– the first-level queues use one of four policies Q = {0s, 0m, 4, 8}, and – the second-level queue selects from one of four sizes S = {0, 4, 8, 12}.
• For the first-level queues, there are actually two choices for zero-sized queues. – The policy 0s :
• means that the queue has no entries• incoming cache lines should be inserted into the second-level queue.
– The policy 0m :• is similar except that lines are inserted directly into the main clock-based
region of the set
Stability Issues• Problem: Instability
so many policy , meta-policy decisions: overall system can become unstable and
rapidly switch through many different configurations and not actually converge on a good one.
• Solutions:slow down the rate of policy change.
• Two Methods:1. Simple time delay where independent of the actual PSEL values, once a policy change
has been made, no other changes may occur until at least δ cycles have elapsed, although the PSEL counters are still updated.
2. Adds hysteresis to the PSEL counters. When a PSEL counter goes negative, it must actually be decremented below −h before the change in policy is invoked. Similarly, one must increment the counter above +h to switch the policy back.
Occasional Lines with Long Reuse Distances• Problem: Early Eviction
– The queue size must match up reasonably well with the actual reuse distances for each core.
– The multi-set dueling approach select the queue size that most closely covers the majority of a core’s cache line reuse patterns
– There may still be a significant number of lines that simply have reuse distances longer than the queue size (Early Eviction)
• Solution:– Include a pardon probability and statistical trace cache filtering.
• Method:– If a line’s u bit is set, then it is always advanced to the next region of the cache
– If the u-bit is zero, then with some probability Ppardon , the line is advanced anyway. Four possible pardon probabilities P = {0, 1/32 , 1/8 , 1} Use multi-set dueling to select Ppardon on a per-core basis
IPC throughput results for AMQ
IPC throughput results for AMQ• Additional performance gains beyond the simple stacking of DRAM as a
cache – UCP (18.9% ) – MQ-static (23.6%) – AMQ (25.7%) improvement
• Stability mechanisms provide a small net benefit increasing the geometric mean improvement to 27.6% over the baseline 32MB cache.
• Dynamic pardon probability selection provides another small boost, bringing the performance gain to 29.1%.
• AMQ technique achieves 75.6% of the performance difference between the 32MB and 64MB clock-managed DRAM caches
AMQ’s adaptations over time
AMQ’s adaptations over time• The first four columns (dark shading on top) of each workload
correspond to the per-core queues from Core 0 to Core 3.
• The 0s and 0m policies correspond to a zero-sized first-level queue with direct insertion into the second-level queue and main region, respectively.
• The fifth column (light shading on top) is for the shared queue. While a few individual programs find a queue size and then stick with it throughout the traced execution, there are others that clearly vary (i.e., adapt) over time.
Weighted speedup and Fair speedup
Weighted speedup and Fair speedup
• Overall, AMQ performs well on these metrics.
• For the fair speedup metric, AMQ with stability and pardoning performs better than UCP and always better than clock, indicating that there are no significant concerns over fairness.
Conclusion
• In this paper, authors have revisited the simple application of using 3D integration to stack a DRAM layer as a large last-level cache.
• Authors have shown that the physical architecture of the DRAM and its peripheral logic, which traditionally increases the complexity of the memory interface, actually provides us with an opportunity to derive benefit from these otherwise inconvenient structures.
Questions?