![Page 1: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/1.jpg)
HPCA-16 2010
Laboratory for Computer Architecture 1/11/2010
Dimitris Kaseridis1, Jeff Stuecheli1,2, Jian Chen1 & Lizy K. John1
1University of Texas – Austin2IBM – Austin
![Page 2: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/2.jpg)
2 Laboratory for Computer Architecture
Motivation
Datacenters
– Widely spread
– Multiple core/sockets available
– Hierarchical cost of communication
• Core-to-Core, Socket-to-Socket and Board-to-Board
Datacenter-like CMP multi-chip
![Page 3: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/3.jpg)
3 Laboratory for Computer Architecture
Motivation Virtualization systems is the norm
– Multiple single thread workloads in system
– Decision based on high level scheduling
algorithms
– CMP heavily relied on shared resources
– Destructive Interference
– Unfairness
– Lack of QoS
– Limit optimization in single-chip suboptimal solutions – Explore opportunities within and outside single chip
Most important shared resources in CMPs– Last Level Cache Capacity Limits
– Memory bandwidth Bandwidth Limits
Capacity and Bandwidth partitioning as promising means of resource management
![Page 4: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/4.jpg)
4 Laboratory for Computer Architecture
MotivationPrevious Work focus on single chip
– Trial-and-error
+ lower complexity- less efficient - slow to react
- Artificial Intelligent
+ better performance - Black box difficult to tune- High cost for accurate schemes.
– Predictive evaluating multiple solutions+ more accurate- higher complexity- high cost of wrong decision (drastic changes to configurations)
Need for low-overhead, non-invasive monitoring that efficiently drives resource management algorithms
Equal Partitions
![Page 5: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/5.jpg)
5 Laboratory for Computer Architecture
Outline
Applications’ Profiling Mechanisms
– Cache Capacity– Memory Bandwidth
Bandwidth-aware Resource Management Scheme– Intra chip allocation algorithm
– Inter chip resource management
Evaluation
![Page 6: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/6.jpg)
6 Laboratory for Computer Architecture
Applications’ Profiling Mechanisms
![Page 7: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/7.jpg)
7 Laboratory for Computer Architecture
Overview Resource Requirements Profiling
Based on Mattson’s Stack-distance Algorithm (MSA)
Non-invasive, predictive– Parallel monitoring on each core assuming each core is assigned the whole LLC
Cache misses for all partitions assignment
– Monitor/Predict Cache misses
– Help estimate ideal cache partitions sizes
Memory Bandwidth
– Two components
• Memory Read traffic Cache fills
• Memory Write traffic Dirty Write-back traffic from Cache to Main memory
![Page 8: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/8.jpg)
8 Laboratory for Computer Architecture
LLC misses Profiling
Mattson stack algorithm (MSA)
– Originally proposed to concurrently simulate many cache sizes
– Based on LRU inclusion property
– Structure is a true LRU cache
– Stack distance from MRU of each reference is recorded
– Misses can be calculated for fraction of ways
![Page 9: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/9.jpg)
9 Laboratory for Computer Architecture
MSA-based Bandwidth Profiling
Read traffic
– proportional to misses
– derived from LLC misses profiling
Write traffic
– Cache evictions of dirty lines sent back to memory
– Traffic depends on assigned cache partition on write-back caches
– Hit to dirty line
• if stack distance of hit bigger than assigned capacity it is sent to main memory Traffic
• Otherwise it is a hit No Traffic • Only one write-back per store should be counted
Monitoring Mechanism
![Page 10: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/10.jpg)
10 Laboratory for Computer Architecture
MSA-based Bandwidth Profiling
Additions to profiler
– Dirty Bit: Dirty line
– Dirty Stack Distance (reg): Largest distance a dirty line accessed
– Dirty_Counter: Dirty accesses for every LRU distance
Rules
– Track traffic for all cache allocations
– Dirty bit reset when line is evicted from whole monitored cache
– Track greatest stack distance each store is referenced before evicted
– Keep a counter (Dirty_Counter) of this max evictions
Traffic estimation
– For a cache size projection that uses W ways
– Sum of Dirty_Counteri, i= [W : max_ways + 1]
![Page 11: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/11.jpg)
11 Laboratory for Computer Architecture
MSA-based Bandwidth Example
![Page 12: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/12.jpg)
12 Laboratory for Computer Architecture
Profiling Examplesmilc calculix gcc
Different behavior on write traffic
– Milc: No fit, updates complex matrix structures
– Calculix: Cache blocking of matrix and dot product operations, data contained in cache read only traffic beyond blocking size
– Gcc: Code generation small caches are read dominated due to data tables bigger are write dominated due to code output
Accurate monitoring of Memory Bandwidth use is important
![Page 13: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/13.jpg)
13 Laboratory for Computer Architecture
Hardware MSA implementation Naïve algorithm is prohibitive
– Fully associative– Complete cache directory of maximum cache size for every core on the CMP
(total size)
H/W Overhead Reduction– Set Sampling– Partial Hashed Tags – XOR tree of bits – Max capacity assignable per core
Sensitivity Analysis (Details in paper)– 1-in-32 set sampling– 11bit partial hashed tags– 9/16 Maximal capacity
• LRU, Dirty-stack register 6 bits• Hit, Dirty counter 32 bits
– Overall 117 Kbits 1.4% of 8MB LLC
ways
sets
![Page 14: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/14.jpg)
14 Laboratory for Computer Architecture
Resource Management Scheme
![Page 15: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/15.jpg)
15 Laboratory for Computer Architecture
Overall Scheme
Two levels approach
– Intra-chip Partitioning Algorithm: Assign LLC capacity on a single chip to minimize misses
– Inter-chip Partitioning Algorithm : Use LLC assignments and Memory bandwidth to find a better workload assignment over whole system
Epochs of 100M instructions for re-evaluation and initiate migrations
![Page 16: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/16.jpg)
16 Laboratory for Computer Architecture
Intra-chip Partitioning Algorithm
Based on Marginal Utility
Miss rate relative to capacity is non-linear, and heavily workload dependent
Dramatic miss rate reduction as data structures become cache contained
In practice
– Iteratively assign cache to cores that produce the most hits per capacity
O(n2) complexity
Equal Partitions
![Page 17: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/17.jpg)
17 Laboratory for Computer Architecture
Inter-chip Partitioning Algorithm Suboptimal assignment of workloads on chips based on execution phase
of each workload
Two greedy algorithms looking over multiple chips
– Cache Capacity
– Memory Bandwidth
Cache Capacity
1. Estimate ideal capacity assignment assuming whole cache belongs to core
2. Find the worst assignment for a core per chip
3. Find chips with most surplus of ways (ways not significantly contributing to miss reduction)
4. Perform with a greedy approach workloads swaps between chips
5. Bound swap with threshold to keep migrations down
6. Perform finally selected migrations
![Page 18: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/18.jpg)
18 Laboratory for Computer Architecture
Bandwidth Algorithm Example
A lbm B calculix
C bwaves D zeusmp
AB C D C B
Memory Bandwidth
Algorithm finds combinations of low/high bandwidth demands cores
Migrate high to low bandwidth chips
Migrated jobs should have similar partitions (10% bounds)
Perform until no over-committed or no additional reduction
![Page 19: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/19.jpg)
19 Laboratory for Computer Architecture
Evaluation
![Page 20: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/20.jpg)
20 Laboratory for Computer Architecture
Methodology Workloads
– 64 cores 8 chips with 8-core CMPs running mix of 29 SPEC CPU2006 workloads
– What benchmark mix? ≈ 30 Million mix of 8 benchmarks
High level - Monte Carlo
– Compare Intra and Inter algorithm to equal partitions assignment• Show algorithm works for many cases / configurations• 1000 experiments
Detailed simulation
– Cycle accurate / Full system• Simics + GEMS+ CMP-DNUCA + Profiling Mechanisms + Cache Partitions
Comparison – Utility-based Cache Partitioning (UCP+) modified for our DNUCA CMP
• Only last level cache misses• Uses Marginal Utility on Single Chip to assign capacity
![Page 21: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/21.jpg)
21 Laboratory for Computer Architecture
High level LLC misses
25.7% over simple equal partitions
Average 7.9% reduction over UCP+
Significant reductions with only 1.4% overhead for monitoring mechanisms that UCP+ already requires
As LLC increases more surplus of ways more opportunities for migrations in Inter-chip
Relative Miss rate Relative reduction BW-aware over UCP+
![Page 22: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/22.jpg)
22 Laboratory for Computer Architecture
High level Memory Bandwidth
UCP+ reductions are due to miss rate reduction 19% over equal
Average 18% reduction over UCP+ and 36% over equal
Winning more in smaller caches due to contention
Number of Chips increase more opportunities for Inter-chip
Relative Bandwidth Reduction Relative reduction BW-aware over UCP+
![Page 23: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/23.jpg)
Full system case studies
23 Laboratory for Computer Architecture
Case 1
8.6% reduction in IPC and 15.3% MPKI reduction
Chip 4 {bwaves, mcf } Chip 7 {povray, calculix}
Case 2
8.5% IPC and 11% MPKI reduction
Chip 7 overcommitted in memory bandwidth
bwaves Chip 7 zeusmp Chip 2
gcc Chip 7 gamess Chip 6
![Page 24: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/24.jpg)
24 Laboratory for Computer Architecture
Conclusions As # core in a system increases resource contention
dominating factor
Memory Bandwidth a significant factor in system performance and should always be considered in Memory resource management
Bandwidth-aware achieved 18% reduction in memory bandwidth and 8% in miss rate over existing partitioning techniques and more than 25% over no partitioning schemes
Overall improvement can justify the cost of the proposed monitoring mechanisms of only 1.4% overhead that could exist in predictive single chip schemes
![Page 25: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/25.jpg)
25 Laboratory for Computer Architecture
Thank You
Questions?
Laboratory for Computer Architecture
The University of Texas Austin
![Page 26: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/26.jpg)
Backup Slides
26 Laboratory for Computer Architecture
![Page 27: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/27.jpg)
Misses absolute and effective error
27 Laboratory for Computer Architecture 9/23/2009
![Page 28: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/28.jpg)
Bandwidth absolute and effective error
28 Laboratory for Computer Architecture 9/23/2009
![Page 29: HPCA-16 2010 Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649f535503460f94c7713a/html5/thumbnails/29.jpg)
Overhead analysis
29 Laboratory for Computer Architecture 9/23/2009