handling the problems and opportunities posed by multiple on-chip memory controllers
DESCRIPTION
Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers. Manu Awasthi , David Nellans , Kshitij Sudan, Rajeev Balasubramonian, Al Davis University of Utah. Takeaway. Multiple, on-chip MCs will be common in future CMPs, with multiple cores sharing one MC - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/1.jpg)
Handling the Problems and Opportunities Posed by Multiple
On-Chip Memory Controllers
Manu Awasthi , David Nellans, Kshitij Sudan, Rajeev Balasubramonian, Al Davis
University of Utah
![Page 2: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/2.jpg)
2
Takeaway
• Multiple, on-chip MCs will be common in future CMPs, with multiple cores sharing one MC– NUMA memory hierarchies across multiple sockets– Intelligent data mapping required to reduce average memory
access delay• Hardware-software co-design approach required for
efficient data placement– Minimum software involvement
• Data placement needs to be aware of system parameters – Row-buffer hit rates, queuing delays, physical proximity, etc.
![Page 3: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/3.jpg)
NUMA - Today
3
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
Socket 1
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMMQPI
Conceptual representation of four
socket Nehalem machine
MC On-Chip Memory Controller
QPI Interconnect
Memory Channel
DIMM DRAM (DIMMs)
Socket Boundary
![Page 4: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/4.jpg)
NUMA - Future
4
Core 5 Core 6 Core 7 Core 8
Core 9 Core 10 Core 11 Core 12
Core 13 Core 14 Core 15 Core 16
Core 1 Core 2 Core 3 Core 4
DIMM
MC2
MC3
MC4
MC1
DIMM DIMM
DIMM
L2$ L2$ L2$ L2$
L2$ L2$ L2$ L2$
L2$ L2$ L2$ L2$
L2$ L2$ L2$ L2$
Future CMPs withmultiple on-chip MCs
MC On-Chip Memory ControllerOn-Chip
Interconnect
Memory Channel
DIMM DRAM (DIMMs)
![Page 5: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/5.jpg)
5
Local Memory Access
• Accessing local memory is fast!!
5
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
Socket 1
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
ADDR
DATA
![Page 6: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/6.jpg)
6
Problem 1 - Remote Memory Access
• Data for Core N can be anywhere! MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
Socket 1
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
ADDR
![Page 7: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/7.jpg)
7
Problem 1 - Remote Memory Access
• Data for Core N can be anywhere! MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
Socket 1
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMM
MC
Core 1 Core 2
Core 3 Core 4
DIMM DIMM DIMMDATA
![Page 8: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/8.jpg)
8
Memory Access Stream – Single Core
Prog 1CPU 0
Prog 1CPU 0
Prog 1CPU 0
Prog 1CPU 0
Prog 2CPU 0
Prog 1CPU 0
Prog 1CPU 0
Prog 1CPU 0
Memory Controller Request Queue
In Out
• Single cores executed a handful of context-switched programs.• Spatio-temporal locality can be exploited!!
![Page 9: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/9.jpg)
9
Problem 2 - Memory Access Stream - CMPs
Prog 0CPU 0
Prog 1CPU 1
Prog 1CPU 1
Prog 2 CPU 2
Prog 3 CPU 3
Prog 4CPU 4
Prog 5CPU 5
Prog 6CPU 6
Memory Controller Request Queue
In Out
• Memory accesses from cores get interleaved, leading to loss of spatio-temporal locality.
![Page 10: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/10.jpg)
Problem 3 – Increased Overheads for Memory Accesses
Increased queuing delays1 Core/1 Thread
16 Core/16 Threads
![Page 11: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/11.jpg)
11
Problem 4 – Pin Limitations
Core 5 Core 6 Core 7 Core 8
Core 9 Core 10 Core 11 Core 12
Core 13 Core 14 Core 15 Core 16
Core 1 Core 2 Core 3 Core 4
MC7MC5
MC1 MC2 MC3 MC4
MC8MC6
Core 5 Core 6 Core 7 Core 8
Core 9 Core 10 Core 11 Core 12
Core 13 Core 14 Core 15 Core 16
Core 1 Core 2 Core 3 Core 4
MC10MC12
MC1 MC2 MC3 MC4
MC9MC11
MC16
MC15
MC14
MC13
MC4
MC4
MC4
MC4
• Pin bandwidth is limited : Number of MCs cannot grow exponentially
• A small number of MCs will have to handle all traffic
![Page 12: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/12.jpg)
12
Problems Summary - I• Pin limitations imply an increase in queuing delay
– Almost 8x increase in queuing delays from single core/one thread to 16 cores/16 threads
• Multi-core implies an increase in row-buffer interference– Increasingly randomized memory access stream– Row-buffer hit rates bound to go down
• Longer on- and off-chip wire delays imply an increase in NUMA factor
• NUMA factor already at 1.5 today
![Page 13: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/13.jpg)
13
Problems Summary - II
• DRAM access time in systems with multiple on-chip MCs is governed by– Distance between requesting core and responding MC.– Load on the on-chip interconnect.– Average queuing delay at responding MC– Bank and rank contention at target DIMM– Row-buffer hit rate at responding MC
Bottomline : Intelligent management of data is required
![Page 14: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/14.jpg)
14
cost j = α x loadj + β x rowhitsj + λ x distancej
Adaptive First Touch Policy
• Basic idea : Assign each new virtual page to a DRAM (physical) page belonging to MC (j) that minimizes the following cost function –
Measure of Queuing Delay
Measure of Locality at DRAM
Measure of Physical Proximity
Constants α, β and λ can be made programmable
![Page 15: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/15.jpg)
15
Dynamic Page Migration Policy
• Programs change phases!!– Can completely stop touching new pages– Can change the frequency of access to a subset of pages
• Leads to imbalance in MC accesses– For long running programs with varying working sets,
AFT can lead to some MCs getting overloaded
Solution : Dynamically migrate pages between MCs at runtime to decrease
imbalance
![Page 16: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/16.jpg)
16
Dynamic Page Migration Policy
Core 5 Core 6 Core 7 Core 8
Core 9 Core 10 Core 11 Core 12
Core 13 Core 14 Core 15 Core 16
Core 1 Core 2 Core 3 Core 4
DIMM
MC2
MC3
MC4
MC1
DIMM DIMM
DIMM
L2$ L2$ L2$ L2$
L2$ L2$ L2$ L2$
L2$ L2$ L2$ L2$
L2$ L2$ L2$ L2$
MC3
Heavily Loaded (Donor)
MC
MC2
Lightly Loaded MC
Lightly Loaded MC
Lightly Loaded MC
MC2
MC2
![Page 17: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/17.jpg)
17
Dynamic Page Migration Policy
Core 5 Core 6 Core 7 Core 8
Core 9 Core 10 Core 11 Core 12
Core 13 Core 14 Core 15 Core 16
Core 1 Core 2 Core 3 Core 4
DIMM
MC2
MC3
MC4
MC1
DIMM DIMM
DIMM
L2$ L2$ L2$ L2$
L2$ L2$ L2$ L2$
L2$ L2$ L2$ L2$
L2$ L2$ L2$ L2$
MC3
Select N pages
MC2
MC2
MC2
Select Recipient MC
Copy N pages from
donor to recipient
MC
![Page 18: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/18.jpg)
18
Dynamic Page Migration Policy - Challenges
• Selecting recipient MC
– Move pages to MC with least value of cost function• Selecting N pages to migrate
– Empirically select the best possible value– Can also be made programmable
Move pages to a physically proximal MC
Minimize interference at recipient MC
costk = Λ x distancek + Γ x rowhitsk
![Page 19: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/19.jpg)
19
Dynamic Page Migration Policy - Overheads
• Pages are physically copied to new addresses– Original address mapping has to be invalidated– Invalidate cache lines belonging to copied pages
• Copying pages can block resources, leading to unnecessary stalls.
• Instant TLB invalidates could cause misses in memory even when data is present.
• Solution : Lazy Copying– Essentially, delayed write-back
![Page 20: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/20.jpg)
20
Issues with TLB Invalidates
Donor MC RecipientMC
Copy Page A,B
Core 1 Core 3 Core 5 Core 12TLB INV
TLB INV
TLB INV
TLB INV
Read A’ -> A
OS Stall!
![Page 21: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/21.jpg)
21
Lazy Copying
Donor MC RecipientMC
Copy Page A,B
Core 1 Core 3 Core 5 Core 12Read Only
Read Only
Read Only
Read Only
OSFlush Dirty Cachelines
Read A’ -> A
Copy Complete
TLB Update
TLB Update
TLB Update
TLB Update
Read A’ -> A
![Page 22: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/22.jpg)
22
Methodology• Simics based simulation platform• DRAMSim based DRAM timing. • DRAM energy figures from CACTI 6.5• Baseline : Assign pages to closest MC
CPU 16-core Out-of-Order CMP, 3 GHz freq.L1 Inst. and Data Cache Private, 32 KB/2-way, 1-cycle access
L2 Unified Cache Shared, 2 MB KB/8-way, 4x4 S-NUCA, 3 cycle bank access
Total DRAM Capacity 4 GBDIMM Configuration 8 DIMMs, 1 rank/DIMM, 64 bit channel, 8
devices/DIMMα, β ,λ , Λ, Γ 10, 20, 100, 100, 100
![Page 23: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/23.jpg)
Results - Throughput
23AFT : 17.1% , Dynamic Page Migration : 34.8%
![Page 24: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/24.jpg)
24
Results – DRAM Locality
AFT : 16.6% , Dynamic Page Migration : 22.7%
STDDEV Down, increased fairness
![Page 25: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/25.jpg)
25
Results – Reasons for Benefits
![Page 26: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/26.jpg)
26
Sensitivity Studies
• Lazy Copying does help, a little– Average 3.2% improvement over without lazy copying
• Terms/Variables in cost function– Very sensitive to load and row-buffer hit rates, not as
much to distance• Cost of TLB shootdowns
– Negligible, since fairly uncommon• Physical placement of MCs – center or peripheral
– Most workloads agnostic to physical placement
![Page 27: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/27.jpg)
27
Summary
• Multiple, on-chip MCs will be common in future CMPs, with multiple cores sharing one MC– Intelligent data mapping will need to be done to reduce
average memory access delay• Adaptive First Touch policy
– Increases performance by 17.1%– Decreases DRAM energy consumption by 14.1%
• Dynamic page migration, improvement on AFT– Further improvement over AFT by 17.7%, 34.8% over
baseline.– Increases energy consumption by 5.2%
![Page 28: Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers](https://reader035.vdocuments.site/reader035/viewer/2022081507/5681604d550346895dcf78eb/html5/thumbnails/28.jpg)
28
Thank You
http://www.cs.utah.edu/arch-research