multi-core architectures and shared resource...
TRANSCRIPT
![Page 1: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/1.jpg)
Multi-Core Architectures and Shared Resource Management
Lecture 1.2: Cache Management
Prof. Onur Mutlu http://www.ece.cmu.edu/~omutlu
[email protected] Bogazici University
June 7, 2013
![Page 2: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/2.jpg)
Last Lecture n Parallel Computer Architecture Basics
q Amdahl’s Law q Bottlenecks in the Parallel Portion (3 fundamental ones)
n Why Multi-Core? Alternatives, shortcomings, tradeoffs.
n Evolution of Symmetric Multi-Core Systems q Sun NiagaraX, IBM PowerX, Runahead Execution
n Asymmetric Multi-Core Systems q Accelerated Critical Sections q Bottleneck Identification and Scheduling
2
![Page 3: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/3.jpg)
What Will You Learn? n Tentative, Aggressive Schedule
q Lecture 1: Why multi-core? Basics, alternatives, tradeoffs Symmetric versus asymmetric multi-core systems q Lecture 2: Shared cache design for multi-cores (if time permits) Interconnect design for multi-cores q Lecture 3: Data parallelism and GPUs (if time permits) (if time permits) Prefetcher design and management
3
![Page 4: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/4.jpg)
Agenda for Today and Beyond n Wrap up Asymmetric Multi-Core Systems
q Handling Private Data Locality q Asymmetry Everywhere
n Cache design for multi-core systems
n Interconnect design for multi-core systems n Prefetcher design for multi-core systems
n Data Parallelism and GPUs
4
![Page 5: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/5.jpg)
Readings for Lecture June 6 (Lecture 1.1) n Required – Symmetric and Asymmetric Multi-Core Systems
q Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors,” HPCA 2003, IEEE Micro 2003.
q Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009, IEEE Micro 2010.
q Suleman et al., “Data Marshaling for Multi-Core Architectures,” ISCA 2010, IEEE Micro 2011.
q Joao et al., “Bottleneck Identification and Scheduling for Multithreaded Applications,” ASPLOS 2012.
q Joao et al., “Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs,” ISCA 2013.
n Recommended q Amdahl, “Validity of the single processor approach to achieving large scale
computing capabilities,” AFIPS 1967. q Olukotun et al., “The Case for a Single-Chip Multiprocessor,” ASPLOS 1996. q Mutlu et al., “Techniques for Efficient Processing in Runahead Execution
Engines,” ISCA 2005, IEEE Micro 2006. 5
![Page 6: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/6.jpg)
Videos for Lecture June 6 (Lecture 1.1) n Runahead Execution
q http://www.youtube.com/watch?v=z8YpjqXQJIA&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=28
n Multiprocessors q Basics:
http://www.youtube.com/watch?v=7ozCK_Mgxfk&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=31
q Correctness and Coherence: http://www.youtube.com/watch?v=U-VZKMgItDM&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=32
q Heterogeneous Multi-Core: http://www.youtube.com/watch?v=r6r2NJxj3kI&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=34
6
![Page 7: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/7.jpg)
Readings for Lecture June 7 (Lecture 1.2) n Required – Caches in Multi-Core
q Qureshi et al., “A Case for MLP-Aware Cache Replacement,” ISCA 2005. q Seshadri et al., “The Evicted-Address Filter: A Unified Mechanism to
Address both Cache Pollution and Thrashing,” PACT 2012. q Pekhimenko et al., “Base-Delta-Immediate Compression: Practical Data
Compression for On-Chip Caches,” PACT 2012. q Pekhimenko et al., “Linearly Compressed Pages: A Main Memory
Compression Framework with Low Complexity and Low Latency,” SAFARI Technical Report 2013.
n Recommended q Qureshi et al., “Utility-Based Cache Partitioning: A Low-Overhead, High-
Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006.
7
![Page 8: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/8.jpg)
Videos for Lecture June7 (Lecture 1.2) n Cache basics:
q http://www.youtube.com/watch?v=TpMdBrM1hVc&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=23
n Advanced caches: q http://www.youtube.com/watch?v=TboaFbjTd-
E&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=24
8
![Page 9: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/9.jpg)
Readings for Lecture June 10 (Lecture 1.3) n Required – Interconnects in Multi-Core
q Moscibroda and Mutlu, “A Case for Bufferless Routing in On-Chip Networks,” ISCA 2009.
q Fallin et al., “CHIPPER: A Low-Complexity Bufferless Deflection Router,” HPCA 2011.
q Fallin et al., “MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect,” NOCS 2012.
q Das et al., “Application-Aware Prioritization Mechanisms for On-Chip Networks,” MICRO 2009.
q Das et al., “Aergia: Exploiting Packet Latency Slack in On-Chip Networks,” ISCA 2010, IEEE Micro 2011.
n Recommended q Grot et al. “Preemptive Virtual Clock: A Flexible, Efficient, and Cost-
effective QOS Scheme for Networks-on-Chip,” MICRO 2009. q Grot et al., “Kilo-NOC: A Heterogeneous Network-on-Chip Architecture
for Scalability and Service Guarantees,” ISCA 2011, IEEE Micro 2012. 9
![Page 10: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/10.jpg)
Videos for Lecture June 10 (Lecture 1.3) n Interconnects
q http://www.youtube.com/watch?v=6xEpbFVgnf8&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=33
n GPUs and SIMD processing q Vector/array processing basics:
http://www.youtube.com/watch?v=f-XL4BNRoBA&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=15
q GPUs versus other execution models: http://www.youtube.com/watch?v=dl5TZ4-oao0&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=19
q GPUs in more detail: http://www.youtube.com/watch?v=vr5hbSkb1Eg&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=20
10
![Page 11: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/11.jpg)
Readings for Prefetching n Prefetching
q Srinath et al., “Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers,” HPCA 2007.
q Ebrahimi et al., “Coordinated Control of Multiple Prefetchers in Multi-Core Systems,” MICRO 2009.
q Ebrahimi et al., “Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems,” HPCA 2009.
q Ebrahimi et al., “Prefetch-Aware Shared Resource Management for Multi-Core Systems,” ISCA 2011.
q Lee et al., “Prefetch-Aware DRAM Controllers,” MICRO 2008.
n Recommended q Lee et al., “Improving Memory Bank-Level Parallelism in the
Presence of Prefetching,” MICRO 2009.
11
![Page 12: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/12.jpg)
Videos for Prefetching n Prefetching
q http://www.youtube.com/watch?v=IIkIwiNNl0c&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=29
q http://www.youtube.com/watch?v=yapQavK6LUk&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=30
12
![Page 13: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/13.jpg)
Readings for GPUs and SIMD n GPUs and SIMD processing
q Narasiman et al., “Improving GPU Performance via Large Warps and Two-Level Warp Scheduling,” MICRO 2011.
q Jog et al., “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance,” ASPLOS 2013.
q Jog et al., “Orchestrated Scheduling and Prefetching for GPGPUs,” ISCA 2013.
q Lindholm et al., “NVIDIA Tesla: A Unified Graphics and Computing Architecture,” IEEE Micro 2008.
q Fung et al., “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow,” MICRO 2007.
13
![Page 14: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/14.jpg)
Videos for GPUs and SIMD n GPUs and SIMD processing
q Vector/array processing basics: http://www.youtube.com/watch?v=f-XL4BNRoBA&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=15
q GPUs versus other execution models: http://www.youtube.com/watch?v=dl5TZ4-oao0&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=19
q GPUs in more detail: http://www.youtube.com/watch?v=vr5hbSkb1Eg&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=20
14
![Page 15: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/15.jpg)
Online Lectures and More Information n Online Computer Architecture Lectures
q http://www.youtube.com/playlist?list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ
n Online Computer Architecture Courses q Intro: http://www.ece.cmu.edu/~ece447/s13/doku.php q Advanced: http://www.ece.cmu.edu/~ece740/f11/doku.php q Advanced: http://www.ece.cmu.edu/~ece742/doku.php
n Recent Research Papers
q http://users.ece.cmu.edu/~omutlu/projects.htm q http://scholar.google.com/citations?
user=7XyGUGkAAAAJ&hl=en
15
![Page 16: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/16.jpg)
Asymmetric Multi-Core Systems Continued
16
![Page 17: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/17.jpg)
Handling Private Data Locality: Data Marshaling
M. Aater Suleman, Onur Mutlu, Jose A. Joao, Khubaib, and Yale N. Patt, "Data Marshaling for Multi-core Architectures"
Proceedings of the 37th International Symposium on Computer Architecture (ISCA), pages 441-450, Saint-Malo, France, June 2010.
17
![Page 18: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/18.jpg)
Staged Execution Model (I) n Goal: speed up a program by dividing it up into pieces n Idea
q Split program code into segments q Run each segment on the core best-suited to run it q Each core assigned a work-queue, storing segments to be run
n Benefits q Accelerates segments/critical-paths using specialized/heterogeneous cores q Exploits inter-segment parallelism q Improves locality of within-segment data
n Examples q Accelerated critical sections, Bottleneck identification and scheduling q Producer-consumer pipeline parallelism q Task parallelism (Cilk, Intel TBB, Apple Grand Central Dispatch) q Special-purpose cores and functional units
18
![Page 19: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/19.jpg)
19
Staged Execution Model (II)
LOAD X STORE Y STORE Y
LOAD Y
…. STORE Z
LOAD Z
….
![Page 20: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/20.jpg)
20
Staged Execution Model (III)
LOAD X STORE Y STORE Y
LOAD Y ….
STORE Z
LOAD Z ….
Segment S0
Segment S1
Segment S2
Split code into segments
![Page 21: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/21.jpg)
21
Staged Execution Model (IV)
Core 0 Core 1 Core 2
Work-queues
Instances of S0
Instances of S1
Instances of S2
![Page 22: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/22.jpg)
22
LOAD X STORE Y STORE Y
LOAD Y ….
STORE Z
LOAD Z ….
Core 0 Core 1 Core 2
S0
S1
S2
Staged Execution Model: Segment Spawning
![Page 23: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/23.jpg)
Staged Execution Model: Two Examples
n Accelerated Critical Sections [Suleman et al., ASPLOS 2009] q Idea: Ship critical sections to a large core in an asymmetric CMP
n Segment 0: Non-critical section n Segment 1: Critical section
q Benefit: Faster execution of critical section, reduced serialization, improved lock and shared data locality
n Producer-Consumer Pipeline Parallelism q Idea: Split a loop iteration into multiple “pipeline stages” where
one stage consumes data produced by the next stage à each stage runs on a different core n Segment N: Stage N
q Benefit: Stage-level parallelism, better locality à faster execution
23
![Page 24: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/24.jpg)
24
Problem: Locality of Inter-segment Data
LOAD X STORE Y STORE Y
LOAD Y ….
STORE Z
LOAD Z ….
Transfer Y
Transfer Z
S0
S1
S2
Core 0 Core 1 Core 2
Cache Miss
Cache Miss
![Page 25: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/25.jpg)
Problem: Locality of Inter-segment Data n Accelerated Critical Sections [Suleman et al., ASPLOS 2010]
q Idea: Ship critical sections to a large core in an ACMP q Problem: Critical section incurs a cache miss when it touches data
produced in the non-critical section (i.e., thread private data)
n Producer-Consumer Pipeline Parallelism q Idea: Split a loop iteration into multiple “pipeline stages” à each
stage runs on a different core q Problem: A stage incurs a cache miss when it touches data
produced by the previous stage
n Performance of Staged Execution limited by inter-segment cache misses
25
![Page 26: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/26.jpg)
26
What if We Eliminated All Inter-segment Misses?
![Page 27: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/27.jpg)
27
Terminology
LOAD X STORE Y STORE Y
LOAD Y ….
STORE Z
LOAD Z ….
Transfer Y
Transfer Z
S0
S1
S2
Inter-segment data: Cache block written by one segment and consumed by the next segment
Generator instruction: The last instruction to write to an inter-segment cache block in a segment
Core 0 Core 1 Core 2
![Page 28: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/28.jpg)
Key Observation and Idea n Observation: Set of generator instructions is stable over
execution time and across input sets
n Idea: q Identify the generator instructions q Record cache blocks produced by generator instructions q Proactively send such cache blocks to the next segment’s core
before initiating the next segment
n Suleman et al., “Data Marshaling for Multi-Core Architectures,” ISCA 2010, IEEE Micro Top Picks 2011.
28
![Page 29: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/29.jpg)
Data Marshaling
1. Identify generator instructions
2. Insert marshal instructions
1. Record generator- produced addresses 2. Marshal recorded blocks to next core Binary containing
generator prefixes & marshal Instructions
Compiler/Profiler Hardware
29
![Page 30: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/30.jpg)
Data Marshaling
1. Identify generator instructions
2. Insert marshal instructions
1. Record generator- produced addresses 2. Marshal recorded blocks to next core Binary containing
generator prefixes & marshal Instructions
Hardware
30
Compiler/Profiler
![Page 31: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/31.jpg)
31
Profiling Algorithm
LOAD X STORE Y STORE Y
LOAD Y ….
STORE Z
LOAD Z ….
Mark as Generator Instruction
Inter-segment data
![Page 32: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/32.jpg)
32
Marshal Instructions
LOAD X STORE Y G: STORE Y MARSHAL C1
LOAD Y …. G:STORE Z MARSHAL C2
0x5: LOAD Z ….
When to send (Marshal)
Where to send (C1)
![Page 33: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/33.jpg)
DM Support/Cost n Profiler/Compiler: Generators, marshal instructions n ISA: Generator prefix, marshal instructions n Library/Hardware: Bind next segment ID to a physical core
n Hardware q Marshal Buffer
n Stores physical addresses of cache blocks to be marshaled n 16 entries enough for almost all workloads à 96 bytes per core
q Ability to execute generator prefixes and marshal instructions q Ability to push data to another cache
33
![Page 34: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/34.jpg)
DM: Advantages, Disadvantages n Advantages
q Timely data transfer: Push data to core before needed q Can marshal any arbitrary sequence of lines: Identifies
generators, not patterns q Low hardware cost: Profiler marks generators, no need for
hardware to find them
n Disadvantages q Requires profiler and ISA support q Not always accurate (generator set is conservative): Pollution
at remote core, wasted bandwidth on interconnect n Not a large problem as number of inter-segment blocks is small
34
![Page 35: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/35.jpg)
35
Accelerated Critical Sections with DM
Small Core 0
Marshal Buffer
Large Core
LOAD X STORE Y G: STORE Y CSCALL
LOAD Y …. G:STORE Z CSRET
Cache Hit!
L2 Cache
L2 Cache Data Y
Addr Y
Critical Section
![Page 36: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/36.jpg)
Accelerated Critical Sections: Methodology
n Workloads: 12 critical section intensive applications q Data mining kernels, sorting, database, web, networking q Different training and simulation input sets
n Multi-core x86 simulator q 1 large and 28 small cores q Aggressive stream prefetcher employed at each core
n Details: q Large core: 2GHz, out-of-order, 128-entry ROB, 4-wide, 12-stage q Small core: 2GHz, in-order, 2-wide, 5-stage q Private 32 KB L1, private 256KB L2, 8MB shared L3 q On-chip interconnect: Bi-directional ring, 5-cycle hop latency
36
![Page 37: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/37.jpg)
37
DM on Accelerated Critical Sections: Results
0
20
40
60
80
100
120
140
is
pagem
ine
puzzle
qsort
tsp
maze
nqueen
sqlite
iplooku
p
mysql-1
mysql-2
webca
che
hmean
Spee
dup
over
AC
S
DM Ideal
168 170
8.7%
![Page 38: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/38.jpg)
38
Pipeline Parallelism
Core 0
Marshal Buffer
Core 1
LOAD X STORE Y G: STORE Y MARSHAL C1
LOAD Y …. G:STORE Z MARSHAL C2
0x5: LOAD Z ….
Cache Hit!
L2 Cache
L2 Cache Data Y
Addr Y
S0
S1
S2
![Page 39: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/39.jpg)
Pipeline Parallelism: Methodology
n Workloads: 9 applications with pipeline parallelism q Financial, compression, multimedia, encoding/decoding q Different training and simulation input sets
n Multi-core x86 simulator q 32-core CMP: 2GHz, in-order, 2-wide, 5-stage q Aggressive stream prefetcher employed at each core
q Private 32 KB L1, private 256KB L2, 8MB shared L3 q On-chip interconnect: Bi-directional ring, 5-cycle hop latency
39
![Page 40: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/40.jpg)
40
DM on Pipeline Parallelism: Results
0
20
40
60
80
100
120
140
160
black
compres
s
dedupD
dedupE
ferret
imag
e
mtwist
rank
sign
hmean Sp
eedu
p ov
er B
asel
ine
DM Ideal
16%
![Page 41: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/41.jpg)
DM Coverage, Accuracy, Timeliness
n High coverage of inter-segment misses in a timely manner n Medium accuracy does not impact performance
q Only 5.0 and 6.8 cache blocks marshaled for average segment
41
0102030405060708090100
ACS Pipeline
Percentage
CoverageAccuracyTimeliness
![Page 42: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/42.jpg)
Scaling Results
n DM performance improvement increases with q More cores q Higher interconnect latency q Larger private L2 caches
n Why? q Inter-segment data misses become a larger bottleneck q More cores à More communication q Higher latency à Longer stalls due to communication q Larger L2 cache à Communication misses remain
42
![Page 43: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/43.jpg)
43
Other Applications of Data Marshaling
n Can be applied to other Staged Execution models q Task parallelism models
n Cilk, Intel TBB, Apple Grand Central Dispatch q Special-purpose remote functional units q Computation spreading [Chakraborty et al., ASPLOS’06]
q Thread motion/migration [e.g., Rangan et al., ISCA’09]
n Can be an enabler for more aggressive SE models
q Lowers the cost of data migration n an important overhead in remote execution of code segments
q Remote execution of finer-grained tasks can become more feasible à finer-grained parallelization in multi-cores
![Page 44: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/44.jpg)
Data Marshaling Summary n Inter-segment data transfers between cores limit the benefit
of promising Staged Execution (SE) models
n Data Marshaling is a hardware/software cooperative solution: detect inter-segment data generator instructions and push their data to next segment’s core q Significantly reduces cache misses for inter-segment data q Low cost, high-coverage, timely for arbitrary address sequences q Achieves most of the potential of eliminating such misses
n Applicable to several existing Staged Execution models q Accelerated Critical Sections: 9% performance benefit q Pipeline Parallelism: 16% performance benefit
n Can enable new modelsà very fine-grained remote execution
44
![Page 45: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/45.jpg)
A Case for Asymmetry Everywhere
Onur Mutlu, "Asymmetry Everywhere (with Automatic Resource Management)"
CRA Workshop on Advancing Computer Architecture Research: Popular Parallel Programming, San Diego, CA, February 2010.
Position paper
45
![Page 46: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/46.jpg)
The Setting n Hardware resources are shared among many threads/apps
in a many-core based system q Cores, caches, interconnects, memory, disks, power, lifetime,
…
n Management of these resources is a very difficult task q When optimizing parallel/multiprogrammed workloads q Threads interact unpredictably/unfairly in shared resources
n Power/energy is arguably the most valuable shared resource q Main limiter to efficiency and performance
46
![Page 47: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/47.jpg)
Shield the Programmer from Shared Resources
n Writing even sequential software is hard enough q Optimizing code for a complex shared-resource parallel system
will be a nightmare for most programmers
n Programmer should not worry about (hardware) resource management q What should be executed where with what resources
n Future cloud computer architectures should be designed to q Minimize programmer effort to optimize (parallel) programs q Maximize runtime system’s effectiveness in automatic
shared resource management
47
![Page 48: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/48.jpg)
Shared Resource Management: Goals
n Future many-core systems should manage power and performance automatically across threads/applications
n Minimize energy/power consumption n While satisfying performance/SLA requirements
q Provide predictability and Quality of Service n Minimize programmer effort
q In creating optimized parallel programs
n Asymmetry and configurability in system resources essential to achieve these goals
48
![Page 49: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/49.jpg)
Asymmetry Enables Customization
n Symmetric: One size fits all q Energy and performance suboptimal for different phase behaviors
n Asymmetric: Enables tradeoffs and customization q Processing requirements vary across applications and phases q Execute code on best-fit resources (minimal energy, adequate perf.)
49
C4 C4
C5 C5
C4 C4
C5 C5
C2
C3
C1
Asymmetric
C C
C C
C C
C C
C C
C C
C C
C C
Symmetric
![Page 50: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/50.jpg)
Thought Experiment: Asymmetry Everywhere
n Design each hardware resource with asymmetric, (re-)configurable, partitionable components q Different power/performance/reliability characteristics q To fit different computation/access/communication patterns
50
!"#$%&'()%)'*$%+,*+',
-,.//$*%+'&0&1)%*+*+"2)34$+2*$%'"22$'*
-,.//$*%+'&0&'"25+67%)34$
-,.//$*%+'&0&1)%*+*+"2)34$/$/"%.&(+$%)%'(+$,
-,.//$*%+'&/)+2&/$/"%+$,
'"%$,&)28&)''$4$%)*"%,9+6(!1"#$%9+6(&1$%5:
!"#$%01$%5"%/)2'$"1*+/+;$8&5"%$)'(&)''$,,&1)**$%2
<+55$%$2*&*$'(2"4"6+$,
![Page 51: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/51.jpg)
Thought Experiment: Asymmetry Everywhere
n Design the runtime system (HW & SW) to automatically choose the best-fit components for each workload/phase q Satisfy performance/SLA with minimal energy q Dynamically stitch together the “best-fit” chip for each phase
51
!"#$%&'()%)'*$%+,*+',
-,.//$*%+'&0&1)%*+*+"2)34$+2*$%'"22$'*
-,.//$*%+'&0&'"25+67%)34$
-,.//$*%+'&0&1)%*+*+"2)34$/$/"%.&(+$%)%'(+$,
-,.//$*%+'&/)+2&/$/"%+$,
'"%$,&)28&)''$4$%)*"%,9+6(!1"#$%9+6(&1$%5:
!"#$%01$%5"%/)2'$"1*+/+;$8&5"%$)'(&)''$,,&1)**$%2
<+55$%$2*&*$'(2"4"6+$,
![Page 52: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/52.jpg)
Thought Experiment: Asymmetry Everywhere
n Morph software components to match asymmetric HW components q Multiple versions for different resource characteristics
52
!"#$%&'()%)'*$%+,*+',
-,.//$*%+'&0&1)%*+*+"2)34$+2*$%'"22$'*
-,.//$*%+'&0&'"25+67%)34$
-,.//$*%+'&0&1)%*+*+"2)34$/$/"%.&(+$%)%'(+$,
-,.//$*%+'&/)+2&/$/"%+$,
'"%$,&)28&)''$4$%)*"%,9+6(!1"#$%9+6(&1$%5:
!"#$%01$%5"%/)2'$"1*+/+;$8&5"%$)'(&)''$,,&1)**$%2
<+55$%$2*&*$'(2"4"6+$,
![Page 53: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/53.jpg)
Many Research and Design Questions n How to design asymmetric components?
q Fixed, partitionable, reconfigurable components? q What types of asymmetry? Access patterns, technologies?
n What monitoring to perform cooperatively in HW/SW? q Automatically discover phase/task requirements
n How to design feedback/control loop between components and runtime system software?
n How to design the runtime to automatically manage resources? q Track task behavior, pick “best-fit” components for the entire workload
53
![Page 54: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/54.jpg)
Exploiting Asymmetry: Simple Examples
!"#$%&'()%)'*$%+,*+',
-,.//$*%+'&0&1)%*+*+"2)34$+2*$%'"22$'*
-,.//$*%+'&0&'"25+67%)34$
-,.//$*%+'&0&1)%*+*+"2)34$/$/"%.&(+$%)%'(+$,
-,.//$*%+'&/)+2&/$/"%+$,
'"%$,&)28&)''$4$%)*"%,9+6(!1"#$%9+6(&1$%5:
!"#$%01$%5"%/)2'$"1*+/+;$8&5"%$)'(&)''$,,&1)**$%2
<+55$%$2*&*$'(2"4"6+$,
54
n Execute critical/serial sections on high-power, high-performance cores/resources [Suleman+ ASPLOS’09, ISCA’10, Top Picks’10’11, Joao+ ASPLOS’12]
n Programmer can write less optimized, but more likely correct programs
![Page 55: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/55.jpg)
Exploiting Asymmetry: Simple Examples
!"#$%&'()%)'*$%+,*+',
-,.//$*%+'&0&1)%*+*+"2)34$+2*$%'"22$'*
-,.//$*%+'&0&'"25+67%)34$
-,.//$*%+'&0&1)%*+*+"2)34$/$/"%.&(+$%)%'(+$,
-,.//$*%+'&/)+2&/$/"%+$,
'"%$,&)28&)''$4$%)*"%,9+6(!1"#$%9+6(&1$%5:
!"#$%01$%5"%/)2'$"1*+/+;$8&5"%$)'(&)''$,,&1)**$%2
<+55$%$2*&*$'(2"4"6+$,
55
n Execute streaming “memory phases” on streaming-optimized cores and memory hierarchies n More efficient and higher performance than general purpose hierarchy
![Page 56: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/56.jpg)
Exploiting Asymmetry: Simple Examples
!"#$%&'()%)'*$%+,*+',
-,.//$*%+'&0&1)%*+*+"2)34$+2*$%'"22$'*
-,.//$*%+'&0&'"25+67%)34$
-,.//$*%+'&0&1)%*+*+"2)34$/$/"%.&(+$%)%'(+$,
-,.//$*%+'&/)+2&/$/"%+$,
'"%$,&)28&)''$4$%)*"%,9+6(!1"#$%9+6(&1$%5:
!"#$%01$%5"%/)2'$"1*+/+;$8&5"%$)'(&)''$,,&1)**$%2
<+55$%$2*&*$'(2"4"6+$,
56
n Partition memory controller and on-chip network bandwidth asymmetrically among threads [Kim+ HPCA 2010, MICRO 2010, Top Picks 2011] [Nychis+ HotNets 2010] [Das+ MICRO 2009, ISCA 2010, Top Picks 2011] n Higher performance and energy-efficiency than symmetric/free-for-all
![Page 57: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/57.jpg)
Exploiting Asymmetry: Simple Examples
!"#$%&'()%)'*$%+,*+',
-,.//$*%+'&0&1)%*+*+"2)34$+2*$%'"22$'*
-,.//$*%+'&0&'"25+67%)34$
-,.//$*%+'&0&1)%*+*+"2)34$/$/"%.&(+$%)%'(+$,
-,.//$*%+'&/)+2&/$/"%+$,
'"%$,&)28&)''$4$%)*"%,9+6(!1"#$%9+6(&1$%5:
!"#$%01$%5"%/)2'$"1*+/+;$8&5"%$)'(&)''$,,&1)**$%2
<+55$%$2*&*$'(2"4"6+$,
57
n Have multiple different memory scheduling policies; apply them to different sets of threads based on thread behavior [Kim+ MICRO 2010, Top Picks 2011] [Ausavarungnirun, ISCA 2012] n Higher performance and fairness than a homogeneous policy
![Page 58: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/58.jpg)
Exploiting Asymmetry: Simple Examples
!"#$%&'()%)'*$%+,*+',
-,.//$*%+'&0&1)%*+*+"2)34$+2*$%'"22$'*
-,.//$*%+'&0&'"25+67%)34$
-,.//$*%+'&0&1)%*+*+"2)34$/$/"%.&(+$%)%'(+$,
-,.//$*%+'&/)+2&/$/"%+$,
'"%$,&)28&)''$4$%)*"%,9+6(!1"#$%9+6(&1$%5:
!"#$%01$%5"%/)2'$"1*+/+;$8&5"%$)'(&)''$,,&1)**$%2
<+55$%$2*&*$'(2"4"6+$,
58
n Build main memory with different technologies with different characteristics (energy, latency, wear, bandwidth) [Meza+ IEEE CAL’12]
n Map pages/applications to the best-fit memory resource n Higher performance and energy-efficiency than single-level memory
CPU DRAMCtrl
Fast, durable Small,
leaky, volatile, high-cost
Large, non-volatile, low-cost Slow, wears out, high active energy
PCM Ctrl DRAM Phase Change Memory (or Tech. X)
![Page 59: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/59.jpg)
Vector Machine Organization (CRAY-1) n CRAY-1 n Russell, “The CRAY-1
computer system,” CACM 1978.
n Scalar and vector modes n 8 64-element vector
registers n 64 bits per element n 16 memory banks n 8 64-bit scalar registers n 8 24-bit address registers
59
![Page 60: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/60.jpg)
Research in Asymmetric Multi-Core n How to Design Asymmetric Cores
q Static q Dynamic
n Can you fuse in-order cores easily to build an OoO core? q How to create asymmetry
n How to divide the program to best take advantage of asymmetry? q Explicit vs. transparent
n How to match arbitrary program phases to the best-fitting core?
n How to minimize code/data migration overhead? n How to satisfy shared resource requirements of different cores?
60
![Page 61: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/61.jpg)
Related Ongoing/Future Work n Dynamically asymmetric cores n Memory system design for asymmetric cores
n Asymmetric memory systems q Phase Change Memory (or Technology X) + DRAM q Hierarchies optimized for different access patterns
n Asymmetric on-chip interconnects q Interconnects optimized for different application requirements
n Asymmetric resource management algorithms q E.g., network congestion control
n Interaction of multiprogrammed multithreaded workloads
61
![Page 62: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/62.jpg)
Summary: Asymmetric Design n Applications and phases have varying performance requirements n Designs evaluated on multiple metrics/constraints: energy,
performance, reliability, fairness, …
n One-size-fits-all design cannot satisfy all requirements and metrics: cannot get the best of all worlds
n Asymmetry in design enables tradeoffs: can get the best of all worlds q Asymmetry in core microarch. à Accelerated Critical Sections, BIS, DM
à Good parallel performance + Good serialized performance q Asymmetry in memory scheduling à Thread Cluster Memory Scheduling
à Good throughput + good fairness q Asymmetry in main memory à Best of multiple technologies
n Simple asymmetric designs can be effective and low-cost
62
![Page 63: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/63.jpg)
Further Reading
63
![Page 64: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/64.jpg)
Shared Resource Design for Multi-Core Systems
64
![Page 65: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/65.jpg)
The Multi-Core System: A Shared Resource View
65
Shared Storage
![Page 66: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/66.jpg)
Resource Sharing Concept n Idea: Instead of dedicating a hardware resource to a
hardware context, allow multiple contexts to use it q Example resources: functional units, pipeline, caches, buses,
memory n Why?
+ Resource sharing improves utilization/efficiency à throughput q When a resource is left idle by one thread, another thread can
use it; no need to replicate shared data + Reduces communication latency
q For example, shared data kept in the same cache in SMT processors
+ Compatible with the shared memory model
66
![Page 67: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/67.jpg)
Resource Sharing Disadvantages n Resource sharing results in contention for resources
q When the resource is not idle, another thread cannot use it q If space is occupied by one thread, another thread needs to re-
occupy it
- Sometimes reduces each or some thread’s performance - Thread performance can be worse than when it is run alone
- Eliminates performance isolation à inconsistent performance across runs
- Thread performance depends on co-executing threads - Uncontrolled (free-for-all) sharing degrades QoS - Causes unfairness, starvation
Need to efficiently and fairly utilize shared resources 67
![Page 68: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/68.jpg)
Need for QoS and Shared Resource Mgmt. n Why is unpredictable performance (or lack of QoS) bad?
n Makes programmer’s life difficult q An optimized program can get low performance (and
performance varies widely depending on co-runners)
n Causes discomfort to user q An important program can starve q Examples from shared software resources
n Makes system management difficult q How do we enforce a Service Level Agreement when hardware
resources are sharing is uncontrollable?
68
![Page 69: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/69.jpg)
Resource Sharing vs. Partitioning n Sharing improves throughput
q Better utilization of space
n Partitioning provides performance isolation (predictable performance) q Dedicated space
n Can we get the benefits of both?
n Idea: Design shared resources such that they are efficiently utilized, controllable and partitionable q No wasted resource + QoS mechanisms for threads
69
![Page 70: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/70.jpg)
Shared Hardware Resources n Memory subsystem (in both MT and CMP)
q Non-private caches q Interconnects q Memory controllers, buses, banks
n I/O subsystem (in both MT and CMP) q I/O, DMA controllers q Ethernet controllers
n Processor (in MT) q Pipeline resources q L1 caches
70
![Page 71: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/71.jpg)
Multi-core Issues in Caching n How does the cache hierarchy change in a multi-core system? n Private cache: Cache belongs to one core (a shared block can be in
multiple caches) n Shared cache: Cache is shared by multiple cores
71
CORE 0 CORE 1 CORE 2 CORE 3
L2 CACHE
L2 CACHE
L2 CACHE
DRAM MEMORY CONTROLLER
L2 CACHE
CORE 0 CORE 1 CORE 2 CORE 3
DRAM MEMORY CONTROLLER
L2 CACHE
![Page 72: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/72.jpg)
Shared Caches Between Cores n Advantages:
q High effective capacity q Dynamic partitioning of available cache space
n No fragmentation due to static partitioning q Easier to maintain coherence (a cache block is in a single location) q Shared data and locks do not ping pong between caches
n Disadvantages q Slower access q Cores incur conflict misses due to other cores’ accesses
n Misses due to inter-core interference n Some cores can destroy the hit rate of other cores
q Guaranteeing a minimum level of service (or fairness) to each core is harder (how much space, how much bandwidth?)
72
![Page 73: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/73.jpg)
Shared Caches: How to Share? n Free-for-all sharing
q Placement/replacement policies are the same as a single core system (usually LRU or pseudo-LRU)
q Not thread/application aware q An incoming block evicts a block regardless of which threads
the blocks belong to
n Problems q Inefficient utilization of cache: LRU is not the best policy q A cache-unfriendly application can destroy the performance of
a cache friendly application q Not all applications benefit equally from the same amount of
cache: free-for-all might prioritize those that do not benefit q Reduced performance, reduced fairness
73
![Page 74: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/74.jpg)
Controlled Cache Sharing n Utility based cache partitioning
q Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006.
q Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002.
n Fair cache partitioning
q Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004.
n Shared/private mixed cache mechanisms q Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in
CMPs,” HPCA 2009. q Hardavellas et al., “Reactive NUCA: Near-Optimal Block Placement and
Replication in Distributed Caches,” ISCA 2009. 74
![Page 75: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/75.jpg)
Efficient Cache Utilization n Qureshi et al., “A Case for MLP-Aware Cache Replacement,” ISCA
2005.
n Seshadri et al., “The Evicted-Address Filter: A Unified Mechanism to Address both Cache Pollution and Thrashing,” PACT 2012.
n Pekhimenko et al., “Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches,” PACT 2012.
n Pekhimenko et al., “Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency,” SAFARI Technical Report 2013.
75
![Page 76: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/76.jpg)
MLP-Aware Cache Replacement
Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt, "A Case for MLP-Aware Cache Replacement"
Proceedings of the 33rd International Symposium on Computer Architecture (ISCA), pages 167-177, Boston, MA, June 2006. Slides (ppt)
76
![Page 77: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/77.jpg)
77
Memory Level Parallelism (MLP)
q Memory Level Parallelism (MLP) means generating and servicing multiple memory accesses in parallel [Glew’98]
q Several techniques to improve MLP (e.g., out-of-order execution, runahead execution)
q MLP varies. Some misses are isolated and some parallel
How does this affect cache replacement?
time
A B
C
isolated miss parallel miss
![Page 78: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/78.jpg)
Traditional Cache Replacement Policies
q Traditional cache replacement policies try to reduce miss count
q Implicit assumption: Reducing miss count reduces memory-
related stall time
q Misses with varying cost/MLP breaks this assumption! q Eliminating an isolated miss helps performance more than
eliminating a parallel miss q Eliminating a higher-latency miss could help performance
more than eliminating a lower-latency miss
78
![Page 79: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/79.jpg)
79
Misses to blocks P1, P2, P3, P4 can be parallel Misses to blocks S1, S2, and S3 are isolated
Two replacement algorithms: 1. Minimizes miss count (Belady’s OPT) 2. Reduces isolated misses (MLP-Aware)
For a fully associative cache containing 4 blocks
S1 P4 P3 P2 P1 P1 P2 P3 P4 S2 S3
An Example
![Page 80: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/80.jpg)
Fewest Misses = Best Performance
80
P3 P2 P1 P4
H H H H M H H H M Hit/Miss Misses=4 Stalls=4
S1 P4 P3 P2 P1 P1 P2 P3 P4 S2 S3
Time stall Belady’s OPT replacement
M M
MLP-Aware replacement
Hit/Miss
P3 P2 S1 P4 P3 P2 P1 P4 P3 P2 S2 P4 P3 P2 S3 P4 S1 S2 S3 P1 P3 P2 S3 P4 S1 S2 S3 P4
H H H
S1 S2 S3 P4
H M M M H M M M Time stall Misses=6
Stalls=2
Saved cycles
Cache
![Page 81: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/81.jpg)
81
Motivation"
q MLP varies. Some misses more costly than others
q MLP-aware replacement can improve performance by reducing costly misses
![Page 82: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/82.jpg)
82
Outline"q Introduction
q MLP-Aware Cache Replacement § Model for Computing Cost § Repeatability of Cost § A Cost-Sensitive Replacement Policy
q Practical Hybrid Replacement § Tournament Selection § Dynamic Set Sampling § Sampling Based Adaptive Replacement
q Summary
![Page 83: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/83.jpg)
83
Computing MLP-Based Cost"
q Cost of miss is number of cycles the miss stalls the processor q Easy to compute for isolated miss
q Divide each stall cycle equally among all parallel misses
"
"
" A
BC"
t0! t1! t4! t5! time!
1"
½ "
1" ½ "
½ "
t2!t3!
½ "
1 "
![Page 84: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/84.jpg)
84
q Miss Status Holding Register (MSHR) tracks all in flight misses "
q Add a field mlp-cost to each MSHR entry""q Every cycle for each demand entry in MSHR"
! ! !mlp-cost += (1/N) "" N = Number of demand misses in MSHR"" "
A First-Order Model"
![Page 85: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/85.jpg)
85
Machine Configuration"
q Processor § aggressive, out-of-order, 128-entry instruction window
q L2 Cache § 1MB, 16-way, LRU replacement, 32 entry MSHR
q Memory § 400 cycle bank access, 32 banks
q Bus § Roundtrip delay of 11 bus cycles (44 processor cycles)
![Page 86: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/86.jpg)
86
Distribution of MLP-Based Cost
Cost varies. Does it repeat for a given cache block? MLP-Based Cost"
% o
f All
L2 M
isse
s"
![Page 87: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/87.jpg)
87
Repeatability of Cost"
q An isolated miss can be parallel miss next time
q Can current cost be used to estimate future cost ?
q Let δ = difference in cost for successive miss to a block § Small δ è cost repeats § Large δ è cost varies significantly
![Page 88: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/88.jpg)
88
q In general δ is small è repeatable cost q When δ is large (e.g. parser, mgrid) è performance loss
Repeatability of Cost δ < 60"59 < δ < 120"
δ > 120"
![Page 89: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/89.jpg)
89
The Framework"
MSHR"
L2 CACHE"
MEMORY"
Quantization of Cost Computed mlp-based cost is quantized to a 3-bit value
CCL! C A R E Cost-Aware
Repl Engine
Cost Calculation Logic
PROCESSOR"
ICACHE" DCACHE"
![Page 90: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/90.jpg)
90
q A Linear (LIN) function that considers recency and cost
Victim-LIN = min { Recency (i) + S*cost (i) } S = significance of cost. Recency(i) = position in LRU stack cost(i) = quantized cost
Design of MLP-Aware Replacement policy q LRU considers only recency and no cost
Victim-LRU = min { Recency (i) } q Decisions based only on cost and no recency hurt
performance. Cache stores useless high cost blocks
![Page 91: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/91.jpg)
91
Results for the LIN policy
Performance loss for parser and mgrid due to large δ .
![Page 92: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/92.jpg)
92
Effect of LIN policy on Cost"
Miss += 4% !IPC += 4%!
Miss += 30% !IPC -= 33%!
Miss -= 11% !IPC += 22%!
![Page 93: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/93.jpg)
93
Outline"q Introduction
q MLP-Aware Cache Replacement § Model for Computing Cost § Repeatability of Cost § A Cost-Sensitive Replacement Policy
q Practical Hybrid Replacement § Tournament Selection § Dynamic Set Sampling § Sampling Based Adaptive Replacement
q Summary
![Page 94: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/94.jpg)
94
Tournament Selection (TSEL) of Replacement Policies for a Single Set"
ATD-LIN ATD-LRU Saturating Counter (SCTR) HIT HIT Unchanged MISS MISS Unchanged HIT MISS += Cost of Miss in ATD-LRU MISS HIT -= Cost of Miss in ATD-LIN
SET A! SET A!+!SCTR!
If MSB of SCTR is 1, MTD uses LIN else MTD use LRU
ATD-LIN! ATD-LRU!
SET A!MTD!
![Page 95: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/95.jpg)
95
Extending TSEL to All Sets"
Implementing TSEL on a per-set basis is expensive Counter overhead can be reduced by using a global counter
+!SCTR!
Policy for All !Sets In MTD!
Set A!ATD-LIN!
Set B!Set C!Set D!Set E!Set F!Set G!Set H!
Set A!ATD-LRU!
Set B!Set C!Set D!Set E!Set F!Set G!Set H!
![Page 96: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/96.jpg)
96
Dynamic Set Sampling"
+!SCTR!
Policy for All !Sets In MTD!
ATD-LIN!
Set B!
Set E!
Set G!
Set B!
Set E!
Set G!
ATD-LRU!Set A!Set A!
Set C!Set D!
Set F!
Set H!
Set C!Set D!
Set F!
Set H!
Not all sets are required to decide the best policy Have the ATD entries only for few sets.
Sets that have ATD entries (B, E, G) are called leader sets
![Page 97: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/97.jpg)
97
Dynamic Set Sampling"
q Bounds using analytical model and simulation (in paper)
q DSS with 32 leader sets performs similar to having all sets
q Last-level cache typically contains 1000s of sets, thus ATD entries are required for only 2%-3% of the sets
How many sets are required to choose best performing policy?
ATD overhead can further be reduced by using MTD to always simulate one of the policies (say LIN)
![Page 98: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/98.jpg)
98
Decide policy only for follower sets
+!
Sampling Based Adaptive Replacement (SBAR)"
The storage overhead of SBAR is less than 2KB (0.2% of the baseline 1MB cache)
SCTR
MTD
Set B
Set E
Set G
Set G
ATD-LRU Set A
Set C Set D
Set F
Set H
Set B Set E
Leader sets"Follower sets"
![Page 99: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/99.jpg)
99
Results for SBAR"
![Page 100: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/100.jpg)
100
SBAR adaptation to phases"
SBAR selects the best policy for each phase of ammp
LIN is better" LRU is better"
![Page 101: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/101.jpg)
101
Outline"q Introduction
q MLP-Aware Cache Replacement § Model for Computing Cost § Repeatability of Cost § A Cost-Sensitive Replacement Policy
q Practical Hybrid Replacement § Tournament Selection § Dynamic Set Sampling § Sampling Based Adaptive Replacement
q Summary
![Page 102: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/102.jpg)
102
Summary
q MLP varies. Some misses are more costly than others
q MLP-aware cache replacement can reduce costly misses
q Proposed a runtime mechanism to compute MLP-Based cost and the LIN policy for MLP-aware cache replacement
q SBAR allows dynamic selection between LIN and LRU with low hardware overhead
q Dynamic set sampling used in SBAR also enables other cache related optimizations
![Page 103: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/103.jpg)
The Evicted-Address Filter
Vivek Seshadri, Onur Mutlu, Michael A. Kozuch, and Todd C. Mowry, "The Evicted-Address Filter: A Unified Mechanism to Address Both
Cache Pollution and Thrashing" Proceedings of the
21st ACM International Conference on Parallel Architectures and Compilation Techniques (PACT), Minneapolis, MN, September 2012. Slides (pptx)
103
![Page 104: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/104.jpg)
Execu&ve Summary • Two problems degrade cache performance
– Pollu&on and thrashing – Prior works don’t address both problems concurrently
• Goal: A mechanism to address both problems • EAF-‐Cache
– Keep track of recently evicted block addresses in EAF – Insert low reuse with low priority to mi&gate pollu&on – Clear EAF periodically to mi&gate thrashing – Low complexity implementa&on using Bloom filter
• EAF-‐Cache outperforms five prior approaches that address pollu&on or thrashing 104
![Page 105: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/105.jpg)
Cache U&liza&on is Important
Core Last-‐Level Cache
Memory
Core Core
Core Core
Increasing conten&on
Effec&ve cache u&liza&on is important
Large latency
105
![Page 106: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/106.jpg)
Reuse Behavior of Cache Blocks
A B C A B C S T U V WX Y A B C
Different blocks have different reuse behavior
Access Sequence:
High-‐reuse block Low-‐reuse block
Z
Ideal Cache A B C . . . . .
106
![Page 107: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/107.jpg)
Cache Pollu&on
H G F E D C B AS H G F E D C BT S H G F E D CU T S H G F E DMRU LRU
LRU Policy
Prior work: Predict reuse behavior of missed blocks. Insert low-‐reuse blocks at LRU posi&on.
H G F E D C B ASTUMRU LRU
AB AC B A
AS AT S A
Cache
Problem: Low-‐reuse blocks evict high-‐reuse blocks
107
![Page 108: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/108.jpg)
Cache Thrashing
H G F E D C B AI H G F E D C BJ I H G F E D CK J I H G F E D
MRU LRU
LRU Policy A B C D E F G H I J KAB AC B A
Prior work: Insert at MRU posi&on with a very low probability (Bimodal inser2on policy)
Cache
H G F E D C B AI J KMRU LRU
AI AJ I AA frac&on of working set stays in cache
Cache
Problem: High-‐reuse blocks evict each other
108
![Page 109: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/109.jpg)
Shortcomings of Prior Works Prior works do not address both pollu&on and thrashing concurrently
Prior Work on Cache Pollu2on No control on the number of blocks inserted with high priority into the cache
Prior Work on Cache Thrashing No mechanism to dis&nguish high-‐reuse blocks from low-‐reuse blocks
Our goal: Design a mechanism to address both pollu&on and thrashing concurrently
109
![Page 110: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/110.jpg)
Outline
• Evicted-‐Address Filter – Reuse Predic&on – Thrash Resistance
• Final Design
• Evalua&on • Conclusion
• Background and Mo&va&on
• Advantages and Disadvantages
110
![Page 111: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/111.jpg)
Reuse Predic&on
Miss Missed-‐block High reuse
Low reuse
?
Keep track of the reuse behavior of every cache block in the system
Imprac2cal 1. High storage overhead 2. Look-‐up latency
111
![Page 112: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/112.jpg)
Prior Work on Reuse Predic&on Use program counter or memory region informa&on.
BA TS
PC 1 PC 2
BA TS
PC 1 PC 2 PC 1
PC 2
C C
U U
1. Group Blocks 2. Learn group behavior 3. Predict reuse
1. Same group → same reuse behavior 2. No control over number of high-‐reuse blocks
112
![Page 113: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/113.jpg)
Our Approach: Per-‐block Predic&on Use recency of evic&on to predict reuse
ATime
Time of evic&on
A
Accessed soon aher evic&on
STime
S
Accessed long &me aher evic&on
113
![Page 114: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/114.jpg)
Evicted-‐Address Filter (EAF)
Cache
EAF (Addresses of recently evicted blocks)
Evicted-‐block address
Miss Missed-‐block address
In EAF? Yes No MRU LRU
High Reuse Low Reuse
114
![Page 115: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/115.jpg)
Naïve Implementa&on: Full Address Tags
EAF
1. Large storage overhead 2. Associa&ve lookups – High energy
Recently evicted address
Need not be 100% accurate
?
115
![Page 116: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/116.jpg)
Low-‐Cost Implementa&on: Bloom Filter
EAF
Implement EAF using a Bloom Filter Low storage overhead + energy
Need not be 100% accurate
?
116
![Page 117: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/117.jpg)
Y
Bloom Filter Compact representa&on of a set
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1. Bit vector 2. Set of hash func&ons
H1 H2
H1 H2
X
1 1 1
Insert Test Z W
Remove
X Y
May remove mul&ple addresses Clear ü û False posi&ve
117
Inserted Elements: X Y
![Page 118: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/118.jpg)
EAF using a Bloom Filter EAF
Insert
Test
Evicted-‐block address
Remove FIFO address
Missed-‐block address
Bloom Filter
Remove If present
when full
Clear
ü û
ü
û 1
2 when full
Bloom-‐filter EAF: 4x reduc&on in storage overhead, 1.47% compared to cache size 118
![Page 119: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/119.jpg)
Outline
• Evicted-‐Address Filter – Reuse Predic&on – Thrash Resistance
• Final Design
• Evalua&on • Conclusion
• Background and Mo&va&on
• Advantages and Disadvantages
119
![Page 120: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/120.jpg)
Large Working Set: 2 Cases
Cache EAF AEK J I H G FL C BD
Cache EAF R Q P O N M LS J I H G F E DK C B A
1
2
Cache < Working set < Cache + EAF
Cache + EAF < Working Set
120
![Page 121: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/121.jpg)
Large Working Set: Case 1
Cache EAF AEK J I H G FL C BD
û û
BFL K J I H GA D CE CGA L K J I HB E DF
û
A L K J I H GB E DFC
û û û û û û û û û û û û ASequence: B C D E F G H I J K L A B C
EAF Naive: D
û A B C
Cache < Working set < Cache + EAF
121
![Page 122: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/122.jpg)
Large Working Set: Case 1
Cache EAF E AK J I H G FL C BD
ASequence: B C D E F G H I J K L A B CA B
EAF BF: û û û û û û û û
A
ü ü ü ü ü ü EAF Naive: û û û û û û û û û û û û û û û
A L K J I H G BE D C ABFA L K J I H G BE DF C AB
D
H G BE DF C AA L K J I BCD
D
û ü
Not removed Not present in the EAF
Bloom-‐filter based EAF mi&gates thrashing
H
û
G F E I
Cache < Working set < Cache + EAF
122
![Page 123: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/123.jpg)
Large Working Set: Case 2
Cache EAF R Q P O N M LS J I H G F E DK C B A
Problem: All blocks are predicted to have low reuse
Use Bimodal Inser2on Policy for low reuse blocks. Insert few of them at the MRU posi&on
Allow a frac&on of the working set to stay in the cache
Cache + EAF < Working Set
123
![Page 124: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/124.jpg)
Outline
• Evicted-‐Address Filter – Reuse Predic&on – Thrash Resistance
• Final Design
• Evalua&on • Conclusion
• Background and Mo&va&on
• Advantages and Disadvantages
124
![Page 125: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/125.jpg)
EAF-‐Cache: Final Design
Cache Bloom Filter
Counter
1
2
3
Cache evic2on
Cache miss
Counter reaches max
Insert address into filter Increment counter
Test if address is present in filter Yes, insert at MRU. No, insert with BIP
Clear filter and counter
125
![Page 126: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/126.jpg)
Outline
• Evicted-‐Address Filter – Reuse Predic&on – Thrash Resistance
• Final Design
• Evalua&on • Conclusion
• Background and Mo&va&on
• Advantages and Disadvantages
126
![Page 127: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/127.jpg)
EAF: Advantages
Cache Bloom Filter
Counter
1. Simple to implement
2. Easy to design and verify
3. Works with other techniques (replacement policy)
Cache evic&on
Cache miss
127
![Page 128: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/128.jpg)
EAF: Disadvantage
Cache
A First access
AA
A Second access Miss
Problem: For an LRU-‐friendly applica2on, EAF incurs one addi2onal miss for most blocks
Dueling-‐EAF: set dueling between EAF and LRU
In EAF?
128
![Page 129: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/129.jpg)
Outline
• Evicted-‐Address Filter – Reuse Predic&on – Thrash Resistance
• Final Design
• Evalua&on • Conclusion
• Background and Mo&va&on
• Advantages and Disadvantages
129
![Page 130: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/130.jpg)
Methodology • Simulated System
– In-‐order cores, single issue, 4 GHz – 32 KB L1 cache, 256 KB L2 cache (private) – Shared L3 cache (1MB to 16MB) – Memory: 150 cycle row hit, 400 cycle row conflict
• Benchmarks – SPEC 2000, SPEC 2006, TPC-‐C, 3 TPC-‐H, Apache
• Mul&-‐programmed workloads – Varying memory intensity and cache sensi&vity
• Metrics – 4 different metrics for performance and fairness – Present weighted speedup
130
![Page 131: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/131.jpg)
Comparison with Prior Works Addressing Cache Pollu2on
-‐ No control on number of blocks inserted with high priority ⟹ Thrashing
Run-‐&me Bypassing (RTB) – Johnson+ ISCA’97 -‐ Memory region based reuse predic&on
Single-‐usage Block Predic&on (SU) – Piquet+ ACSAC’07 Signature-‐based Hit Predic&on (SHIP) – Wu+ MICRO’11 -‐ Program counter based reuse predic&on
Miss Classifica&on Table (MCT) – Collins+ MICRO’99 -‐ One most recently evicted block
131
![Page 132: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/132.jpg)
Comparison with Prior Works Addressing Cache Thrashing
-‐ No mechanism to filter low-‐reuse blocks ⟹ Pollu&on
TA-‐DIP – Qureshi+ ISCA’07, Jaleel+ PACT’08 TA-‐DRRIP – Jaleel+ ISCA’10 -‐ Use set dueling to determine thrashing applica&ons
132
![Page 133: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/133.jpg)
Results – Summary
0%
5%
10%
15%
20%
25%
1-‐Core 2-‐Core 4-‐Core
Performan
ce Im
provem
ent o
ver LRU
TA-‐DIP TA-‐DRRIP RTB MCT SHIP EAF D-‐EAF
133
![Page 134: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/134.jpg)
-‐10%
0%
10%
20%
30%
40%
50%
60%
Weighted Speedu
p Im
provem
ent o
ver
LRU
Workload Number (135 workloads)
LRU
EAF
SHIP
D-‐EAF
4-‐Core: Performance
134
![Page 135: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/135.jpg)
Effect of Cache Size
0%
5%
10%
15%
20%
25%
1MB 2MB 4MB 8MB 2MB 4MB 8MB 16MB
2-‐Core 4-‐Core
Weighted Speedu
p Im
provem
ent
over LRU
SHIP EAF D-‐EAF
135
![Page 136: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/136.jpg)
Effect of EAF Size
0%
5%
10%
15%
20%
25%
30%
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Weighted Speedu
p Im
provem
ent O
ver LRU
# Addresses in EAF / # Blocks in Cache
1 Core 2 Core 4 Core
136
![Page 137: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/137.jpg)
Other Results in Paper
• EAF orthogonal to replacement policies – LRU, RRIP – Jaleel+ ISCA’10
• Performance improvement of EAF increases with increasing memory latency
• EAF performs well on four different metrics – Performance and fairness
• Alterna&ve EAF-‐based designs perform comparably – Segmented EAF – Decoupled-‐clear EAF
137
![Page 138: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/138.jpg)
Conclusion • Cache u&liza&on is cri&cal for system performance
– Pollu&on and thrashing degrade cache performance – Prior works don’t address both problems concurrently
• EAF-‐Cache – Keep track of recently evicted block addresses in EAF – Insert low reuse with low priority to mi&gate pollu&on – Clear EAF periodically and use BIP to mi&gate thrashing – Low complexity implementa&on using Bloom filter
• EAF-‐Cache outperforms five prior approaches that address pollu&on or thrashing
138
![Page 139: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/139.jpg)
Base-Delta-Immediate Cache Compression
Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Philip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry,
"Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches"
Proceedings of the 21st ACM International Conference on Parallel Architectures and Compilation
Techniques (PACT), Minneapolis, MN, September 2012. Slides (pptx) 139
![Page 140: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/140.jpg)
Execu2ve Summary • Off-‐chip memory latency is high
– Large caches can help, but at significant cost • Compressing data in cache enables larger cache at low cost
• Problem: Decompression is on the execu&on cri&cal path • Goal: Design a new compression scheme that has 1. low decompression latency, 2. low cost, 3. high compression ra&o • Observa2on: Many cache lines have low dynamic range data
• Key Idea: Encode cachelines as a base + mul&ple differences • Solu2on: Base-‐Delta-‐Immediate compression with low decompression latency and high compression ra&o – Outperforms three state-‐of-‐the-‐art compression mechanisms
140
![Page 141: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/141.jpg)
Mo2va2on for Cache Compression Significant redundancy in data:
141
0x00000000
How can we exploit this redundancy? – Cache compression helps – Provides effect of a larger cache without making it physically larger
0x0000000B 0x00000003 0x00000004 …
![Page 142: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/142.jpg)
Background on Cache Compression
• Key requirements: – Fast (low decompression latency) – Simple (avoid complex hardware changes) – Effec2ve (good compression ra&o)
142
CPU L2
Cache Uncompressed Compressed Decompression Uncompressed
L1 Cache
Hit
![Page 143: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/143.jpg)
Shortcomings of Prior Work
143
Compression Mechanisms
Decompression Latency
Complexity Compression Ra2o
Zero ü ü û
![Page 144: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/144.jpg)
Shortcomings of Prior Work
144
Compression Mechanisms
Decompression Latency
Complexity Compression Ra2o
Zero ü ü û Frequent Value û û ü
![Page 145: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/145.jpg)
Shortcomings of Prior Work
145
Compression Mechanisms
Decompression Latency
Complexity Compression Ra2o
Zero ü ü û Frequent Value û û ü Frequent Parern û û/ü ü
![Page 146: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/146.jpg)
Shortcomings of Prior Work
146
Compression Mechanisms
Decompression Latency
Complexity Compression Ra2o
Zero ü ü û Frequent Value û û ü Frequent Parern û û/ü ü Our proposal: BΔI ü ü ü
![Page 147: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/147.jpg)
Outline
• Mo&va&on & Background • Key Idea & Our Mechanism • Evalua&on • Conclusion
147
![Page 148: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/148.jpg)
Key Data Pa\erns in Real Applica2ons
148
0x00000000 0x00000000 0x00000000 0x00000000 …
0x000000FF 0x000000FF 0x000000FF 0x000000FF …
0x00000000 0x0000000B 0x00000003 0x00000004 …
0xC04039C0 0xC04039C8 0xC04039D0 0xC04039D8 …
Zero Values: ini&aliza&on, sparse matrices, NULL pointers
Repeated Values: common ini&al values, adjacent pixels
Narrow Values: small values stored in a big data type
Other Pa\erns: pointers to the same memory region
![Page 149: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/149.jpg)
How Common Are These Pa\erns?
0%
20%
40%
60%
80%
100%
libqu
antum
lbm
m
cf
tpch1
7
sjeng
omne
tpp
tpch2
sph
inx3
xalancbmk
bzip
2
tpch6
leslie3d
apache
gromacs
astar
gob
mk
sop
lex
gcc
hmmer
wrf
h26
4ref
zeu
smp
cactusADM
G
emsFDT
D
Average
Cache Co
verage (%
)
Zero Repeated Values Other Parerns
149
SPEC2006, databases, web workloads, 2MB L2 cache “Other Parerns” include Narrow Values
43% of the cache lines belong to key parerns
![Page 150: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/150.jpg)
Key Data Pa\erns in Real Applica2ons
150
0x00000000 0x00000000 0x00000000 0x00000000 …
0x000000FF 0x000000FF 0x000000FF 0x000000FF …
0x00000000 0x0000000B 0x00000003 0x00000004 …
0xC04039C0 0xC04039C8 0xC04039D0 0xC04039D8 …
Zero Values: ini&aliza&on, sparse matrices, NULL pointers
Repeated Values: common ini&al values, adjacent pixels
Narrow Values: small values stored in a big data type
Other Pa\erns: pointers to the same memory region
Low Dynamic Range:
Differences between values are significantly smaller than the values themselves
![Page 151: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/151.jpg)
32-‐byte Uncompressed Cache Line
Key Idea: Base+Delta (B+Δ) Encoding
151
0xC04039C0 0xC04039C8 0xC04039D0 … 0xC04039F8
4 bytes
0xC04039C0 Base
0x00
1 byte
0x08
1 byte
0x10
1 byte
… 0x38 12-‐byte Compressed Cache Line
20 bytes saved ü Fast Decompression: vector addi&on
ü Simple Hardware: arithme&c and comparison
ü Effec2ve: good compression ra&o
![Page 152: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/152.jpg)
Can We Do Be\er?
• Uncompressible cache line (with a single base):
• Key idea: Use more bases, e.g., two instead of one • Pro:
– More cache lines can be compressed • Cons:
– Unclear how to find these bases efficiently – Higher overhead (due to addi&onal bases)
152
0x00000000 0x09A40178 0x0000000B 0x09A4A838 …
![Page 153: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/153.jpg)
B+Δ with Mul2ple Arbitrary Bases
153
1
1.2
1.4
1.6
1.8
2
2.2
GeoMean
Compression
Ra2
o 1 2 3 4 8 10 16
ü 2 bases – the best op&on based on evalua&ons
![Page 154: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/154.jpg)
How to Find Two Bases Efficiently? 1. First base -‐ first element in the cache line
2. Second base -‐ implicit base of 0
Advantages over 2 arbitrary bases: – Berer compression ra&o – Simpler compression logic
154
ü Base+Delta part
ü Immediate part
Base-‐Delta-‐Immediate (BΔI) Compression
![Page 155: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/155.jpg)
B+Δ (with two arbitrary bases) vs. BΔI
155
1 1.2 1.4 1.6 1.8 2
2.2 lbm
w
rf
hmmer
sph
inx3
tpch1
7
libqu
antum
leslie3d
gromacs
sjeng
mcf
h26
4ref
tpch2
omne
tpp
apache
bzip
2
xalancbmk
astar
tpch6
cactusADM
gcc
sop
lex
gob
mk
zeu
smp
Gem
sFDT
D
GeoM
ean Co
mpression
Ra2
o B+Δ (2 bases) BΔI
Average compression ra&o is close, but BΔI is simpler
![Page 156: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/156.jpg)
BΔI Implementa2on • Decompressor Design
– Low latency
• Compressor Design – Low cost and complexity
• BΔI Cache Organiza2on – Modest complexity
156
![Page 157: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/157.jpg)
Δ0 B0
BΔI Decompressor Design
157
Δ1 Δ2 Δ3
Compressed Cache Line
V0 V1 V2 V3
+ +
Uncompressed Cache Line
+ +
B0 Δ0
B0 B0 B0 B0
Δ1 Δ2 Δ3
V0 V1 V2 V3
Vector addi&on
![Page 158: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/158.jpg)
BΔI Compressor Design
158
32-‐byte Uncompressed Cache Line
8-‐byte B0 1-‐byte Δ
CU
8-‐byte B0 2-‐byte Δ
CU
8-‐byte B0 4-‐byte Δ
CU
4-‐byte B0 1-‐byte Δ
CU
4-‐byte B0 2-‐byte Δ
CU
2-‐byte B0 1-‐byte Δ
CU
Zero CU
Rep. Values CU
Compression Selec&on Logic (based on compr. size)
CFlag & CCL
CFlag & CCL
CFlag & CCL
CFlag & CCL
CFlag & CCL
CFlag & CCL
CFlag & CCL
CFlag & CCL
Compression Flag & Compressed Cache Line
CFlag & CCL
Compressed Cache Line
![Page 159: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/159.jpg)
BΔI Compression Unit: 8-‐byte B0 1-‐byte Δ
159
32-‐byte Uncompressed Cache Line
V0 V1 V2 V3
8 bytes
-‐ -‐ -‐ -‐
B0=
V0
V0 B0 B0 B0 B0
V0 V1 V2 V3
Δ0 Δ1 Δ2 Δ3
Within 1-‐byte range?
Within 1-‐byte range?
Within 1-‐byte range?
Within 1-‐byte range?
Is every element within 1-‐byte range?
Δ0 B0 Δ1 Δ2 Δ3 B0 Δ0 Δ1 Δ2 Δ3
Yes No
![Page 160: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/160.jpg)
BΔI Cache Organiza2on
160
Tag0 Tag1
… …
… …
Tag Storage: Set0
Set1
Way0 Way1
Data0
…
…
Set0
Set1
Way0 Way1
…
Data1
…
32 bytes Data Storage: Conven2onal 2-‐way cache with 32-‐byte cache lines
BΔI: 4-‐way cache with 8-‐byte segmented data
Tag0 Tag1
… …
… …
Tag Storage:
Way0 Way1 Way2 Way3
… …
Tag2 Tag3
… …
Set0
Set1
üTwice as many tags
üC -‐ Compr. encoding bits C
Set0
Set1
… … … … … … … …
S0 S0 S1 S2 S3 S4 S5 S6 S7
… … … … … … … …
8 bytes
üTags map to mul&ple adjacent segments 2.3% overhead for 2 MB cache
![Page 161: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/161.jpg)
Qualita2ve Comparison with Prior Work • Zero-‐based designs
– ZCA [Dusser+, ICS’09]: zero-‐content augmented cache – ZVC [Islam+, PACT’09]: zero-‐value cancelling – Limited applicability (only zero values)
• FVC [Yang+, MICRO’00]: frequent value compression – High decompression latency and complexity
• Pa\ern-‐based compression designs – FPC [Alameldeen+, ISCA’04]: frequent parern compression
• High decompression latency (5 cycles) and complexity – C-‐pack [Chen+, T-‐VLSI Systems’10]: prac&cal implementa&on of FPC-‐like algorithm
• High decompression latency (8 cycles)
161
![Page 162: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/162.jpg)
Outline
• Mo&va&on & Background • Key Idea & Our Mechanism • Evalua&on • Conclusion
162
![Page 163: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/163.jpg)
Methodology • Simulator
– x86 event-‐driven simulator based on Simics [Magnusson+, Computer’02]
• Workloads – SPEC2006 benchmarks, TPC, Apache web server – 1 – 4 core simula&ons for 1 billion representa&ve instruc&ons
• System Parameters – L1/L2/L3 cache latencies from CACTI [Thoziyoor+, ISCA’08] – 4GHz, x86 in-‐order core, 512kB -‐ 16MB L2, simple memory model (300-‐cycle latency for row-‐misses)
163
![Page 164: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/164.jpg)
Compression Ra2o: BΔI vs. Prior Work
BΔI achieves the highest compression ra&o
164
1 1.2 1.4 1.6 1.8 2
2.2 lbm
w
rf
hmmer
sph
inx3
tpch1
7
libqu
antum
leslie3d
gromacs
sjeng
mcf
h26
4ref
tpch2
omne
tpp
apache
bzip
2
xalancbmk
astar
tpch6
cactusADM
gcc
sop
lex
gob
mk
zeu
smp
Gem
sFDT
D
GeoM
ean Compression
Ra2
o
ZCA FVC FPC BΔI 1.53
SPEC2006, databases, web workloads, 2MB L2
![Page 165: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/165.jpg)
Single-‐Core: IPC and MPKI
165
0.9 1
1.1 1.2 1.3 1.4 1.5
Normalized
IPC
L2 cache size
Baseline (no compr.) BΔI
8.1% 5.2%
5.1% 4.9%
5.6% 3.6%
0 0.2 0.4 0.6 0.8 1
Normalized
MPK
I L2 cache size
Baseline (no compr.) BΔI 16%
24% 21%
13% 19% 14%
BΔI achieves the performance of a 2X-‐size cache Performance improves due to the decrease in MPKI
![Page 166: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/166.jpg)
Mul2-‐Core Workloads • Applica&on classifica&on based on
Compressibility: effec&ve cache size increase (Low Compr. (LC) < 1.40, High Compr. (HC) >= 1.40)
Sensi2vity: performance gain with more cache (Low Sens. (LS) < 1.10, High Sens. (HS) >= 1.10; 512kB -‐> 2MB)
• Three classes of applica&ons: – LCLS, HCLS, HCHS, no LCHS applica&ons
• For 2-‐core -‐ random mixes of each possible class pairs (20 each, 120 total workloads)
166
![Page 167: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/167.jpg)
Mul2-‐Core: Weighted Speedup
BΔI performance improvement is the highest (9.5%)
4.5% 3.4%
4.3%
10.9%
16.5% 18.0%
9.5%
0.95
1.00
1.05
1.10
1.15
1.20
LCLS -‐ LCLS LCLS -‐ HCLS HCLS -‐ HCLS LCLS -‐ HCHS HCLS -‐ HCHS HCHS -‐ HCHS
Low Sensi&vity High Sensi&vity GeoMean
Normalized
Weighted Speedu
p ZCA FVC FPC BΔI
If at least one applica&on is sensi2ve, then the performance improves 167
![Page 168: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/168.jpg)
Other Results in Paper
• IPC comparison against upper bounds – BΔI almost achieves performance of the 2X-‐size cache
• Sensi&vity study of having more than 2X tags – Up to 1.98 average compression ra&o
• Effect on bandwidth consump&on – 2.31X decrease on average
• Detailed quan&ta&ve comparison with prior work • Cost analysis of the proposed changes
– 2.3% L2 cache area increase
168
![Page 169: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/169.jpg)
Conclusion • A new Base-‐Delta-‐Immediate compression mechanism • Key insight: many cache lines can be efficiently represented using base + delta encoding
• Key proper&es: – Low latency decompression – Simple hardware implementa&on – High compression ra2o with high coverage
• Improves cache hit raNo and performance of both single-‐core and mul&-‐core workloads – Outperforms state-‐of-‐the-‐art cache compression techniques: FVC and FPC
169
![Page 170: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/170.jpg)
Linearly Compressed Pages
Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Michael A. Kozuch, Phillip B. Gibbons, and Todd C. Mowry,
"Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency"
SAFARI Technical Report, TR-SAFARI-2012-005, Carnegie Mellon University, September 2012.
170
![Page 171: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/171.jpg)
Execu2ve Summary
171
§ Main memory is a limited shared resource § Observa2on: Significant data redundancy § Idea: Compress data in main memory § Problem: How to avoid latency increase? § Solu2on: Linearly Compressed Pages (LCP): fixed-‐size cache line granularity compression 1. Increases capacity (69% on average) 2. Decreases bandwidth consump&on (46%) 3. Improves overall performance (9.5%)
![Page 172: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/172.jpg)
Challenges in Main Memory Compression
172
1. Address Computa&on
2. Mapping and Fragmenta&on
3. Physically Tagged Caches
![Page 173: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/173.jpg)
L0 L1 L2 . . . LN-‐1
Cache Line (64B)
Address Offset 0 64 128 (N-‐1)*64
L0 L1 L2 . . . LN-‐1 Compressed Page
0 ? ? ? Address Offset
Uncompressed Page
Address Computa2on
173
![Page 174: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/174.jpg)
Mapping and Fragmenta2on
174
Virtual Page (4kB)
Physical Page (? kB) Fragmenta4on
Virtual Address
Physical Address
![Page 175: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/175.jpg)
Physically Tagged Caches
175
Core
TLB
tag tag tag
Physical Address
data data data
Virtual Address
Cri4cal Path Address Transla4on
L2 Cache Lines
![Page 176: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/176.jpg)
Shortcomings of Prior Work
176
Compression Mechanisms
Access Latency
Decompression Latency
Complexity Compression Ra2o
IBM MXT [IBM J.R.D. ’01] û û û ü
![Page 177: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/177.jpg)
Shortcomings of Prior Work
177
Compression Mechanisms
Access Latency
Decompression Latency
Complexity Compression Ra2o
IBM MXT [IBM J.R.D. ’01] û û û ü Robust Main Memory Compression [ISCA’05]
û
ü
û
ü
![Page 178: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/178.jpg)
Shortcomings of Prior Work
178
Compression Mechanisms
Access Latency
Decompression Latency
Complexity Compression Ra2o
IBM MXT [IBM J.R.D. ’01] û û û ü Robust Main Memory Compression [ISCA’05]
û
ü
û
ü
LCP: Our Proposal
ü
ü
ü
ü
![Page 179: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/179.jpg)
Linearly Compressed Pages (LCP): Key Idea
179
64B 64B 64B 64B . . .
. . . M E
Metadata (64B): ? (compressible)
Excep&on Storage
4:1 Compression
64B
Uncompressed Page (4kB: 64*64B)
Compressed Data (1kB)
![Page 180: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/180.jpg)
LCP Overview
180
• Page Table entry extension – compression type and size – extended physical base address
• Opera&ng System management support – 4 memory pools (512B, 1kB, 2kB, 4kB)
• Changes to cache tagging logic – physical page base address + cache line index (within a page)
• Handling page overflows • Compression algorithms: BDI [PACT’12] , FPC [ISCA’04]
![Page 181: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/181.jpg)
LCP Op2miza2ons
181
• Metadata cache – Avoids addi&onal requests to metadata
• Memory bandwidth reduc&on:
• Zero pages and zero cache lines – Handled separately in TLB (1-‐bit) and in metadata (1-‐bit per cache line)
• Integra&on with cache compression – BDI and FPC
64B 64B 64B 64B 1 transfer instead of 4
![Page 182: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/182.jpg)
Methodology • Simulator
– x86 event-‐driven simulators • Simics-‐based [Magnusson+, Computer’02] for CPU • Mul&2Sim [Ubal+, PACT’12] for GPU
• Workloads – SPEC2006 benchmarks, TPC, Apache web server, GPGPU applica&ons
• System Parameters – L1/L2/L3 cache latencies from CACTI [Thoziyoor+, ISCA’08] – 512kB -‐ 16MB L2, simple memory model
182
![Page 183: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/183.jpg)
Compression Ra2o Comparison
183
1.30 1.59 1.62 1.69
2.31 2.60
1
1.5
2
2.5
3
3.5
Compression
Ra2
o
GeoMean
Zero Page FPC LCP (BDI) LCP (BDI+FPC-‐fixed) MXT LZ
SPEC2006, databases, web workloads, 2MB L2 cache
LCP-‐based frameworks achieve compe&&ve average compression ra&os with prior work
![Page 184: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/184.jpg)
Bandwidth Consump2on Decrease
184
SPEC2006, databases, web workloads, 2MB L2 cache
0.92 0.89
0.57 0.63 0.54 0.55 0.54
0 0.2 0.4 0.6 0.8 1
1.2
GeoMean Normalized
BPK
I
FPC-‐cache BDI-‐cache FPC-‐memory (None, LCP-‐BDI) (FPC, FPC) (BDI, LCP-‐BDI) (BDI, LCP-‐BDI+FPC-‐fixed)
LCP frameworks significantly reduce bandwidth (46%)
Be\er
![Page 185: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/185.jpg)
Performance Improvement
185
Cores LCP-‐BDI (BDI, LCP-‐BDI) (BDI, LCP-‐BDI+FPC-‐fixed)
1 6.1% 9.5% 9.3%
2 13.9% 23.7% 23.6%
4 10.7% 22.6% 22.5%
LCP frameworks significantly improve performance
![Page 186: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/186.jpg)
Conclusion • A new main memory compression framework called LCP (Linearly Compressed Pages) – Key idea: fixed size for compressed cache lines within a page and fixed compression algorithm per page
• LCP evalua&on: – Increases capacity (69% on average) – Decreases bandwidth consump&on (46%) – Improves overall performance (9.5%) – Decreases energy of the off-‐chip bus (37%)
186
![Page 187: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/187.jpg)
Controlled Shared Caching
187
![Page 188: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/188.jpg)
Controlled Cache Sharing n Utility based cache partitioning
q Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006.
q Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002.
n Fair cache partitioning
q Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004.
n Shared/private mixed cache mechanisms q Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in
CMPs,” HPCA 2009. q Hardavellas et al., “Reactive NUCA: Near-Optimal Block Placement and
Replication in Distributed Caches,” ISCA 2009. 188
![Page 189: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/189.jpg)
Utility Based Shared Cache Partitioning n Goal: Maximize system throughput n Observation: Not all threads/applications benefit equally from
caching à simple LRU replacement not good for system throughput
n Idea: Allocate more cache space to applications that obtain the most benefit from more space
n The high-level idea can be applied to other shared resources as well.
n Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006.
n Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002.
189
![Page 190: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/190.jpg)
Marginal Utility of a Cache Way
190
Utility Uab = Misses with a ways – Misses with b ways
Low Utility High Utility
Saturating Utility
Num ways from 16-way 1MB L2
Mis
ses
per 1
000
inst
ruct
ions
![Page 191: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/191.jpg)
Utility Based Shared Cache Partitioning Motivation
191
Num ways from 16-way 1MB L2
Mis
ses
per 1
000
inst
ruct
ions
(MP
KI) equake
vpr
LRU
UTIL Improve performance by giving more cache to the application that benefits more from cache
![Page 192: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/192.jpg)
Utility Based Cache Partitioning (III)
192
Three components:
q Utility Monitors (UMON) per core
q Partitioning Algorithm (PA)
q Replacement support to enforce partitions
I$
D$ Core1
I$
D$ Core2 Shared
L2 cache
Main Memory
UMON1 UMON2 PA
![Page 193: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/193.jpg)
Utility Monitors q For each core, simulate LRU policy using ATD
q Hit counters in ATD to count hits per recency position
q LRU is a stack algorithm: hit counts è utility E.g. hits(2 ways) = H0+H1
193
MTD
Set B
Set E
Set G
Set A
Set C Set D
Set F
Set H
ATD Set B
Set E
Set G
Set A
Set C Set D
Set F
Set H
+ + + + (MRU)H0 H1 H2…H15(LRU)
![Page 194: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/194.jpg)
Utility Monitors
194
![Page 195: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/195.jpg)
Dynamic Set Sampling q Extra tags incur hardware and power overhead
q Dynamic Set Sampling reduces overhead [Qureshi, ISCA’06]
q 32 sets sufficient (analytical bounds)
q Storage < 2kB/UMON
195
MTD
ATD Set B
Set E
Set G
Set A
Set C Set D
Set F
Set H
+ + + + (MRU)H0 H1 H2…H15(LRU)
Set B
Set E
Set G
Set A
Set C Set D
Set F
Set H
Set B
Set E
Set G
Set A
Set C Set D
Set F
Set H
Set B Set E Set G
UMON (DSS)
![Page 196: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/196.jpg)
Partitioning Algorithm q Evaluate all possible partitions and select the best
q With a ways to core1 and (16-a) ways to core2: Hitscore1 = (H0 + H1 + … + Ha-1) ---- from UMON1
Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2
q Select a that maximizes (Hitscore1 + Hitscore2)
q Partitioning done once every 5 million cycles
196
![Page 197: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/197.jpg)
Way Partitioning
197
Way partitioning support: [Suh+ HPCA’02, Iyer ICS’04] 1. Each line has core-id bits
2. On a miss, count ways_occupied in set by miss-causing app
ways_occupied < ways_given
Yes No
Victim is the LRU line from other app
Victim is the LRU line from miss-causing app
![Page 198: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/198.jpg)
Performance Metrics n Three metrics for performance:
1. Weighted Speedup (default metric) è perf = IPC1/SingleIPC1 + IPC2/SingleIPC2
è correlates with reduction in execution time
2. Throughput è perf = IPC1 + IPC2 è can be unfair to low-IPC application
3. Hmean-fairness è perf = hmean(IPC1/SingleIPC1, IPC2/SingleIPC2) è balances fairness and performance
198
![Page 199: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/199.jpg)
Weighted Speedup Results for UCP
199
![Page 200: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/200.jpg)
IPC Results for UCP
200
UCP improves average throughput by 17%
![Page 201: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/201.jpg)
Any Problems with UCP So Far? - Scalability - Non-convex curves?
n Time complexity of partitioning low for two cores (number of possible partitions ≈ number of ways)
n Possible partitions increase exponentially with cores
n For a 32-way cache, possible partitions: q 4 cores à 6545 q 8 cores à 15.4 million
n Problem NP hard à need scalable partitioning algorithm 201
![Page 202: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/202.jpg)
Greedy Algorithm [Stone+ ToC ’92] n GA allocates 1 block to the app that has the max utility for
one block. Repeat till all blocks allocated
n Optimal partitioning when utility curves are convex
n Pathological behavior for non-convex curves
202
![Page 203: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/203.jpg)
Problem with Greedy Algorithm
n Problem: GA considers benefit only from the immediate block. Hence, it fails to exploit large gains from looking ahead
203
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6 7 8
AB
In each iteration, the utility for 1 block:
U(A) = 10 misses U(B) = 0 misses
Blocks assigned
Mis
ses
All blocks assigned to A, even if B has same miss reduction with fewer blocks
![Page 204: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/204.jpg)
Lookahead Algorithm n Marginal Utility (MU) = Utility per cache resource
q MUab = Ua
b/(b-a)
n GA considers MU for 1 block. LA considers MU for all possible allocations
n Select the app that has the max value for MU. Allocate it as many blocks required to get max MU
n Repeat till all blocks assigned
204
![Page 205: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/205.jpg)
Lookahead Algorithm Example
205
Time complexity ≈ ways2/2 (512 ops for 32-ways)
0
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6 7 8
AB
Iteration 1: MU(A) = 10/1 block MU(B) = 80/3 blocks
B gets 3 blocks
Result: A gets 5 blocks and B gets 3 blocks (Optimal)
Next five iterations: MU(A) = 10/1 block MU(B) = 0 A gets 1 block
Blocks assigned
Mis
ses
![Page 206: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/206.jpg)
UCP Results
206
Four cores sharing a 2MB 32-way L2
Mix2 (swm-glg-mesa-prl)
Mix3 (mcf-applu-art-vrtx)
Mix4 (mcf-art-eqk-wupw)
Mix1 (gap-applu-apsi-gzp)
LA performs similar to EvalAll, with low time-complexity
LRU UCP(Greedy) UCP(Lookahead) UCP(EvalAll)
![Page 207: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/207.jpg)
Utility Based Cache Partitioning n Advantages over LRU
+ Improves system throughput + Better utilizes the shared cache
n Disadvantages - Fairness, QoS?
n Limitations - Scalability: Partitioning limited to ways. What if you have
numWays < numApps? - Scalability: How is utility computed in a distributed cache? - What if past behavior is not a good predictor of utility?
207
![Page 208: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/208.jpg)
Fair Shared Cache Partitioning n Goal: Equalize the slowdowns of multiple threads sharing
the cache n Idea: Dynamically estimate slowdowns due to sharing and
assign cache blocks to balance slowdowns
n Approximate slowdown with change in miss rate + Simple - Not accurate. Why?
q Kim et al., “Fair Cache Sharing and Partitioning in a Chip
Multiprocessor Architecture,” PACT 2004.
208
![Page 209: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/209.jpg)
Problem with Shared Caches
209
L2 $
L1 $
……
Processor Core 1
L1 $
Processor Core 2 ←t1
![Page 210: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/210.jpg)
Problem with Shared Caches
210
L1 $
Processor Core 1
L1 $
Processor Core 2
L2 $
……
t2→
![Page 211: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/211.jpg)
Problem with Shared Caches
211
L1 $
L2 $
……
Processor Core 1 Processor Core 2 ←t1
L1 $
t2→
t2’s throughput is significantly reduced due to unfair cache sharing.
![Page 212: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/212.jpg)
Problem with Shared Caches
212
0
2
4
6
8
10
gzip(alone) gzip+applu gzip+apsi gzip+art gzip+swim
gzip'sNormalized
Cache Misses
0
0.2
0.4
0.6
0.8
1
1.2
gzip(alone) gzip+applu gzip+apsi gzip+art gzip+swim
gzip'sNormalized
IPC
![Page 213: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/213.jpg)
Fairness Metrics
213
• Uniform slowdown
• Minimize: – Ideally:
i
iiji
ij
aloneMisssharedMissXwhereXXM__,1 =−=
i
iiji
ij
aloneMissRatesharedMissRateXwhereXXM__,3 =−=
j
j
i
i
aloneTsharedT
aloneTsharedT
__
__
=
i
iiji
ij
aloneTsharedTXwhereXXM__,0 =−=
![Page 214: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/214.jpg)
Block-Granularity Partitioning
214
LRU LRU
LRU LRU
P1: 448B
P2 Miss
P2: 576B
Current Partition P1: 384B P2: 640B
Target Partition
• Modified LRU cache replacement policy – G. Suh, et. al., HPCA 2002
![Page 215: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/215.jpg)
Block-Granularity Partitioning
215
LRU LRU
LRU * LRU
P1: 448B
P2 Miss
P2: 576B
Current Partition P1: 384B P2: 640B
Target Partition
• Modified LRU cache replacement policy – G. Suh, et. al., HPCA 2002
LRU LRU
LRU * LRU
P1: 384B P2: 640B
Current Partition P1: 384B P2: 640B
Target Partition
![Page 216: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/216.jpg)
Dynamic Fair Caching Algorithm
216
P1:
P2:
Ex) Optimizing M3 metric
P1:
P2:
Target Partition
MissRate alone
P1:
P2:
MissRate shared
Repartitioning interval
![Page 217: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/217.jpg)
217
Dynamic Fair Caching Algorithm
1st Interval P1:20%
P2: 5%
MissRate alone
Repartitioning interval
P1:
P2:
MissRate shared P1:20%
P2:15%
MissRate shared
P1:256KB
P2:256KB
Target Partition
![Page 218: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/218.jpg)
218
Dynamic Fair Caching Algorithm
Repartition!
Evaluate M3 P1: 20% / 20% P2: 15% / 5%
P1:20%
P2: 5%
MissRate alone
Repartitioning interval
P1:20%
P2:15%
MissRate shared
P1:256KB
P2:256KB
Target Partition P1:192KB
P2:320KB
Target Partition
Partition granularity: 64KB
![Page 219: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/219.jpg)
219
Dynamic Fair Caching Algorithm
2nd Interval P1:20%
P2: 5%
MissRate alone
Repartitioning interval
P1:20%
P2:15%
MissRate shared P1:20%
P2:15%
MissRate shared P1:20%
P2:10%
MissRate shared
P1:192KB
P2:320KB
Target Partition
![Page 220: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/220.jpg)
220
Dynamic Fair Caching Algorithm
Repartition!
Evaluate M3 P1: 20% / 20% P2: 10% / 5%
P1:20%
P2: 5%
MissRate alone
Repartitioning interval
P1:20%
P2:15%
MissRate shared P1:20%
P2:10%
MissRate shared
P1:192KB
P2:320KB
Target Partition P1:128KB
P2:384KB
Target Partition
![Page 221: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/221.jpg)
221
Dynamic Fair Caching Algorithm
3rd Interval P1:20%
P2: 5%
MissRate alone
Repartitioning interval
P1:20%
P2:10%
MissRate shared
P1:128KB
P2:384KB
Target Partition
P1:20%
P2:10%
MissRate shared P1:25%
P2: 9%
MissRate shared
![Page 222: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/222.jpg)
222
Dynamic Fair Caching Algorithm
Repartition! Do Rollback if: P2: Δ<Trollback Δ=MRold-MRnew
P1:20%
P2: 5%
MissRate alone
Repartitioning interval
P1:20%
P2:10%
MissRate shared P1:25%
P2: 9%
MissRate shared
P1:128KB
P2:384KB
Target Partition P1:192KB
P2:320KB
Target Partition
![Page 223: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/223.jpg)
Dynamic Fair Caching Results
n Improves both fairness and throughput
223
0
0.5
1
1.5
2
apsi+art gzip+art swim+gzip tree+mcf AVG18
NormalizedCombinedIPC
0
0.5
1
1.5
apsi+art gzip+art swim+gzip tree+mcf AVG18
NormalizedFairnessM1
PLRU FairM1Dyn FairM3Dyn FairM4Dyn
![Page 224: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/224.jpg)
Effect of Partitioning Interval
n Fine-grained partitioning is important for both fairness and throughput
224
1.11
1.12
1.13
1.14
1.15
1.16
AVG18
NormalizedCombinedIPC
10K 20K 40K 80K
0
0.2
0.4
0.6
0.8
AVG18
NormalizedFairnessM1
10K 20K 40K 80K
![Page 225: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/225.jpg)
Benefits of Fair Caching n Problems of unfair cache sharing
q Sub-optimal throughput q Thread starvation q Priority inversion q Thread-mix dependent performance
n Benefits of fair caching q Better fairness q Better throughput q Fair caching likely simplifies OS scheduler design
225
![Page 226: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/226.jpg)
Advantages/Disadvantages of the Approach
n Advantages + No (reduced) starvation + Better average throughput
n Disadvantages - Scalable to many cores? - Is this the best (or a good) fairness metric? - Does this provide performance isolation in cache? - Alone miss rate estimation can be incorrect (estimation interval
different from enforcement interval)
226
![Page 227: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/227.jpg)
Software-Based Shared Cache Management n Assume no hardware support (demand based cache sharing, i.e.
LRU replacement) n How can the OS best utilize the cache?
n Cache sharing aware thread scheduling q Schedule workloads that “play nicely” together in the cache
n E.g., working sets together fit in the cache n Requires static/dynamic profiling of application behavior n Fedorova et al., “Improving Performance Isolation on Chip
Multiprocessors via an Operating System Scheduler,” PACT 2007.
n Cache sharing aware page coloring q Dynamically monitor miss rate over an interval and change
virtual to physical mapping to minimize miss rate n Try out different partitions
227
![Page 228: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/228.jpg)
OS Based Cache Partitioning n Lin et al., “Gaining Insights into Multi-Core Cache Partitioning: Bridging
the Gap between Simulation and Real Systems,” HPCA 2008. n Cho and Jin, “Managing Distributed, Shared L2 Caches through OS-
Level Page Allocation,” MICRO 2006.
n Static cache partitioning q Predetermines the amount of cache blocks allocated to each
program at the beginning of its execution q Divides shared cache to multiple regions and partitions cache
regions through OS page address mapping n Dynamic cache partitioning
q Adjusts cache quota among processes dynamically q Page re-coloring q Dynamically changes processes’ cache usage through OS page
address re-mapping 228
![Page 229: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/229.jpg)
Page Coloring n Physical memory divided into colors n Colors map to different cache sets n Cache partitioning
q Ensure two threads are allocated pages of different colors
229
Thread A
Thread B
Cache Way-1 Way-n …………
Memory page
![Page 230: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/230.jpg)
Page Coloring
virtual page number Virtual address page offset
physical page number Physical address Page offset
Address translation
Cache tag Block offset Set index Cache address
Physically indexed cache
page color bits
… …
OS control
=
• Physically indexed caches are divided into multiple regions (colors). • All cache lines in a physical page are cached in one of those regions (colors).
OS can control the page color of a virtual page through address mapping (by selecting a physical page with a specific value in its page color bits).
![Page 231: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/231.jpg)
Static Cache Partitioning using Page Coloring
… … ...
…… …
…… …
Physically indexed cache
………
…… …
Physical pages are grouped to page bins according to their page color 1
2 3 4
…
i+2
i i+1
…Process 1
1 2 3 4
…
i+2
i i+1
…Process 2
OS
address mapping
Shared cache is partitioned between two processes through address mapping.
Cost: Main memory space needs to be partitioned, too.
![Page 232: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/232.jpg)
Allocated color
Dynamic Cache Partitioning via Page Re-Coloring
page color table
……
N - 1
0
1
2
3
n Page re-coloring: q Allocate page in new color q Copy memory contents q Free old page
Allocated colors
� Pages of a process are organized into linked lists by their colors.
� Memory allocation guarantees that pages are evenly distributed into all the lists (colors) to avoid hot points.
![Page 233: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/233.jpg)
Dynamic Partitioning in Dual Core
Init: Partition the cache as (8:8)
Run current partition (P0:P1) for one epoch
finished
Try one epoch for each of the two neighboring partitions: (P0 – 1: P1+1) and (P0 + 1: P1-1)
Choose next partitioning with best policy metrics measurement (e.g., cache miss rate)
No
Yes Exit
![Page 234: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/234.jpg)
Experimental Environment
n Dell PowerEdge1950 q Two-way SMP, Intel dual-core Xeon 5160 q Shared 4MB L2 cache, 16-way q 8GB Fully Buffered DIMM
n Red Hat Enterprise Linux 4.0 q 2.6.20.3 kernel q Performance counter tools from HP (Pfmon) q Divide L2 cache into 16 colors
![Page 235: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/235.jpg)
Performance – Static & Dynamic
� Aim to minimize combined miss rate � For RG-type, and some RY-type:
� Static partitioning outperforms dynamic partitioning
� For RR- and RY-type, and some RY-type � Dynamic partitioning outperforms static partitioning
![Page 236: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/236.jpg)
Software vs. Hardware Cache Management n Software advantages
+ No need to change hardware + Easier to upgrade/change algorithm (not burned into hardware)
n Disadvantages
- Less flexible: large granularity (page-based instead of way/block) - Limited page colors à reduced performance per application
(limited physical memory space!), reduced flexibility - Changing partition size has high overhead à page mapping
changes - Adaptivity is slow: hardware can adapt every cycle (possibly) - Not enough information exposed to software (e.g., number of
misses due to inter-thread conflict) 236
![Page 237: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/237.jpg)
Private/Shared Caching n Example: Adaptive spill/receive caching
n Goal: Achieve the benefits of private caches (low latency, performance isolation) while sharing cache capacity across cores
n Idea: Start with a private cache design (for performance isolation), but dynamically steal space from other cores that do not need all their private caches q Some caches can spill their data to other cores’ caches
dynamically
n Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs,” HPCA 2009.
237
![Page 238: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/238.jpg)
238
Revisiting Private Caches on CMP
Private caches avoid the need for shared interconnect ++ fast latency, tiled design, performance isolation
Core A I$ D$
CACHE A
Core B I$ D$
CACHE B
Core C I$ D$
CACHE C
Core D I$ D$
CACHE D Memory
Problem: When one core needs more cache and other core has spare cache, private-cache CMPs cannot share capacity
![Page 239: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/239.jpg)
239
Cache Line Spilling
Spill evicted line from one cache to neighbor cache - Co-operative caching (CC) [ Chang+ ISCA’06]
Problem with CC: 1. Performance depends on the parameter (spill probability) 2. All caches spill as well as receive è Limited improvement
Cache A Cache B Cache C Cache D
Spill
Goal: Robust High-Performance Capacity Sharing with Negligible Overhead
![Page 240: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/240.jpg)
240
Spill-Receive Architecture
Each Cache is either a Spiller or Receiver but not both - Lines from spiller cache are spilled to one of the receivers - Evicted lines from receiver cache are discarded
What is the best N-bit binary string that maximizes the performance of Spill Receive Architecture è Dynamic Spill Receive (DSR)
Cache A Cache B Cache C Cache D
Spill
S/R =1 (Spiller cache)
S/R =0 (Receiver cache)
S/R =1 (Spiller cache)
S/R =0 (Receiver cache)
![Page 241: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/241.jpg)
241
Spiller-sets
Follower Sets
Receiver-sets
Dynamic Spill-Receive via “Set Dueling”
Divide the cache in three: – Spiller sets – Receiver sets – Follower sets (winner of spiller,
receiver)
n-bit PSEL counter misses to spiller-sets: PSEL-- misses to receiver-set: PSEL++ MSB of PSEL decides policy for
Follower sets: – MSB = 0, Use spill – MSB = 1, Use receive
PSEL -
miss
+ miss
MSB = 0? YES No
Use Recv Use spill
monitor è choose è apply (using a single counter)
![Page 242: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/242.jpg)
242
Dynamic Spill-Receive Architecture
Cache A Cache B Cache C Cache D Set X
Set Y
AlwaysSpill
AlwaysRecv
-
+
Miss in Set X in any cache Miss in Set Y in any cache
PSEL B PSEL C PSEL D PSEL A
Decides policy for all sets of Cache A (except X and Y)
Each cache learns whether it should act as a spiller or receiver
![Page 243: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/243.jpg)
243
Experimental Setup
q Baseline Study: n 4-core CMP with in-order cores n Private Cache Hierarchy: 16KB L1, 1MB L2 n 10 cycle latency for local hits, 40 cycles for remote hits
q Benchmarks: n 6 benchmarks that have extra cache: “Givers” (G) n 6 benchmarks that benefit from more cache: “Takers” (T) n All 4-thread combinations of 12 benchmarks: 495 total Five types of workloads: G4T0 G3T1 G2T2 G1T3 G0T4
![Page 244: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/244.jpg)
244
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
Gmean-G4T0 Gmean-G3T1 Gmean-G2T2 Gmean-G1T3 Gmean-G0T4 Avg (All 495)
Nor
malized
Thr
ough
put
over
NoS
pill Shared (LRU)
CC (Best)DSRStaticBest
Results for Throughput
On average, DSR improves throughput by 18%, co-operative caching by 7% DSR provides 90% of the benefit of knowing the best decisions a priori
* DSR implemented with 32 dedicated sets and 10 bit PSEL counters
![Page 245: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/245.jpg)
245
Results for Weighted Speedup
2
2.2
2.4
2.6
2.8
3
3.2
3.4
3.6
3.8
4
Gmean-G4T0 Gmean-G3T1 Gmean-G2T2 Gmean-G1T3 Gmean-G0T4 Avg (All 495)
Weigh
ted S
peed
up
Shared (LRU)Baseline(NoSpill)DSRCC(Best)
On average, DSR improves weighted speedup by 13%
![Page 246: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/246.jpg)
246
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Gmean-G4T0 Gmean-G3T1 Gmean-G2T2 Gmean-G1T3 Gmean-G0T4 Avg All(495)
Baseline NoSpillDSRCC(Best)
Results for Hmean Speedup
On average, DSR improves Hmean Fairness from 0.58 to 0.78
![Page 247: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/247.jpg)
247
DSR vs. Faster Shared Cache
DSR (with 40 cycle extra for remote hits) performs similar to shared cache with zero latency overhead and crossbar interconnect
1.001.021.041.061.081.101.121.141.161.181.201.221.241.261.281.30
Gmean-G4T0 Gmean-G3T1 Gmean-G2T2 Gmean-G1T3 Gmean-G0T4 Avg All(495)
Thro
ughp
ut N
ormalized
to
NoS
pill
Shared (Same latency as private)DSR (0 cycle remote hit)DSR (10 cycle remote hit)DSR (40 cycle remote hit)
![Page 248: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/248.jpg)
248
Scalability of DSR
DSR improves average throughput by 19% for both systems (No performance degradation for any of the workloads)
11.021.041.061.081.11.121.141.161.181.21.221.241.261.281.31.32
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
100 workloads ( 8/16 SPEC benchmarks chosen randomly)
Nor
malized
Thr
ough
put
over
NoS
pill
8-core16-core
![Page 249: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/249.jpg)
249
Over time, ΔMiss 0, if DSR is causing more misses.
Quality of Service with DSR For 1 % of the 495x4 =1980 apps, DSR causes IPC loss of > 5% In some cases, important to ensure that performance does not
degrade compared to dedicated private cache è QoS DSR can ensure QoS: change PSEL counters by weight of miss:
ΔMiss = MissesWithDSR – MissesWithNoSpill
Weight of Miss = 1 + Max(0, f(ΔMiss)) Calculate weight every 4M cycles. Needs 3 counters per core
Estimated by Spiller Sets
![Page 250: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/250.jpg)
250
IPC of QoS-Aware DSR
IPC curves for other categories almost overlap for the two schemes. Avg. throughput improvement across all 495 workloads similar (17.5% vs. 18%)
0.75
1
1.25
1.5
1.75
2
2.25
2.5
1 6 11 16 21 26 31 36 41 46 51 5615 workloads x 4 apps each = 60 apps
DSR
QoS Aware DSR
For Category: G0T4
IPC
Nor
mal
ized
To
NoS
pill
![Page 251: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/251.jpg)
Distributed Caches
251
![Page 252: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/252.jpg)
Caching for Parallel Applications
252
core
core
core
core
L2 L2 L2 L2
core
core
core
core
L2 L2 L2 L2
core
core
core
core
L2 L2 L2 L2
core
core
core
core
L2 L2 L2 L2
core
core
core
core
L2 L2 L2 L2
core
core
core
core
L2 L2 L2 L2
core
core
core
core
L2 L2 L2 L2
core
core
core
core
L2 L2 L2 L2
" Data placement determines performance " Goal: place data on chip close to where they are used
cache slice
![Page 253: Multi-Core Architectures and Shared Resource …users.ece.cmu.edu/~omutlu/pub/onur-Bogazici-June-7-2013-lecture1-2..." Mutlu et al., “Runahead Execution: An Alternative to Very Large](https://reader031.vdocuments.site/reader031/viewer/2022030416/5aa2476a7f8b9a1f6d8d067a/html5/thumbnails/253.jpg)
Shared Cache Management: Research Topics n Scalable partitioning algorithms
q Distributed caches have different tradeoffs
n Configurable partitioning algorithms q Many metrics may need to be optimized at different times or
at the same time q It is not only about overall performance
n Ability to have high capacity AND high locality (fast access) n Within vs. across-application prioritization n Holistic design
q How to manage caches, NoC, and memory controllers together?
n Cache coherence in shared/private distributed caches
253