towards scalable and energy-efficient memory system architectures
DESCRIPTION
Towards Scalable and Energy-Efficient Memory System Architectures. Rajeev Balasubramonian School of Computing University of Utah. Main Memory Problems. 3. Reliability. PROCESSOR. DIMM. DIMM. 2. High capacity at high bandwidth. 1. Energy. Motivation: Memory Energy. - PowerPoint PPT PresentationTRANSCRIPT
1
Towards Scalable and Energy-Efficient Memory System Architectures
Rajeev Balasubramonian
School of ComputingUniversity of Utah
2
Main Memory Problems
PROCESSOR DIMMDIMM
1. Energy
2. High capacity at high
bandwidth
3. Reliability
3
Motivation: Memory Energy
Contributions of memory to overall system energy:
• 25-40%, IBM, Sun, and Google server data summarized by Meisner et al., ASPLOS’09
• HP servers: 175 W out of ~785 W for 256 GB memory (HP power calculator)
• Intel SCC: memory controller contributes 19-69% of chip power, ISSCC’10
4
Motivation: Reliability
• DRAM data from Schroeder et al., SIGMETRICS’09: 25K-70K errors per billion device hours per Mbit 8% of DRAM DIMMs affected by errors every year
• DRAM error rates may get worse as scalability limits are reached; PCM (hard and soft) error rates expected to be high as well
• Primary concern: storage and energy overheads for error detection and correction
• ECC support is not too onerous; chip-kill is much worse
5
Motivation: Capacity, Bandwidth
Processor
DIMM DIMM
6
Motivation: Capacity, Bandwidth
Processor
Cores are increasing, but
pins are not
DIMM DIMM
7
Motivation: Capacity, Bandwidth
Processor
Cores are increasing, but
pins are not
DIMM DIMM
High channel frequency fewer DIMMs
Will eventually need disruptive
shifts: NVM, optics
Can’t have high capacity, high bandwidth, and low energyPick 2 of the 3!
8
Memory System Basics
Processor
M DIMM
M
MDIMM
DIMM
Multiple on-chip memory controllersthat handle multiple 64-bit channels
9
Memory System Basics: FB-DIMM
Processor
M DIMMM
DIMM
FB-DIMM: Can boost capacity with narrowchannels and buffering at each DIMM
M
DIMM
DIMM DIMM
M
M
DIMM DIMM
DIMM DIMM
10
What’s a Rank?
Processor M
x8 x8 x8 x8 x8 x8 x8 x8
64b
DIMM
Rank: DRAM chips required to provide the 64boutput expected by a JEDEC standard bus
For example: 8 x8 DRAM chips
11
What’s a Bank?
Processor M
x8 x8 x8 x8 x8 x8 x8 x8
64b
DIMM
Bank: A portion of a rank that is tied up whenservicing a request; multiple banks in a rankenable parallel handling of multiple requests
BANK
12
What’s an Array?
Processor M
x8 x8 x8 x8 x8 x8 x8 x8
64b
DIMM
Array: Matrix of cellsOne array provides 1 bit/cycle
Each array reads out an entire rowLarge array high density
BANK
13
What’s a Row Buffer?
… Array
Wordline
Bitlines
Row Buffer
RAS
CASOutput pin
14
Row Buffer Management
• Row buffer: collection of rows read out by arrays in a bank
• Row buffer hits incur low latency and low energy
• Bitlines must be precharged before a new row can be read
• Open page policy: delays the precharge until a different row is encountered
• Close page policy: issues the precharge immediately
15
Primary Sources of Energy Inefficiency
• Overfetch: 8 KB of data read out for each cache line request
• Poor row buffer hit rates: diminished locality in multi-cores
• Electrical medium: bus speeds have been increasing
• Reliability measures: overhead in building a reliable system from inherently unreliable parts
16
SECDED Support
64-bit data word 8-bit ECC
• One extra x8 chip per rank
• Storage and energy overhead of 12.5%
• Cannot handle complete failure in one chip
17
Chipkill Support I
• Use 72 DRAM chips to read out 72 bits
• Dramatic increase in activation energy and overfetch
• Storage overhead is still 12.5%
64-bit data word 8-bit ECC
At most one bit from each DRAM chip
18
Chipkill Support II
• Use 13 DRAM chips to read out 13 bits
• Storage and energy overhead: 62.5%
• Other options exist; trade-off between energy and storage
8-bit data word 5-bit ECC
At most one bit from each DRAM chip
19
Summary So Far
We now understand…
• why memory energy is a problem - overfetch, row buffer miss rates
• why reliability incurs high energy overheads - chipkill support requires high activation per useful bit
• why capacity and bandwidth increases cost energy - need high frequency and buffering per hop
20
Crucial Timing
Disruptive changes may be compelling today…
• Increasing role of memory energy
• Increasing role of memory errors
• Impact of multi-core: high bandwidth needs, loss of locality
• Emerging technologies (NVM, optics) will require a revamp of memory architecture ideas can be easily applied to NVM role of DRAM may change
21
Attacking the Problem
• Find ways to maximize row buffer utility
• Find ways to reduce overfetch
• Treat reliability as a first-class design constraint
• Use photonics and 3D to boost capacity and bandwidth
Solutions must be very cost-sensitive
22
Maximizing Row Buffer Locality
• Micro-pages (ASPLOS’10)
• Handling multiple memory controllers (PACT’10)
• On-going work: better write scheduling, better bank management (data mapping, row closure)
23
Micro-Pages
• Key observation: most accesses to a page are localized to a small region (micro-page)
24
Solution
• Identify hot micro-pages• Co-locate hot micro-pages in reserved DRAM rows• Memory controller keeps track of re-direction• Low overheads if applications have few hot micro-pages that account for most memory accesses
Processor MDIMM
25
Results
• Overall 9% improvement in performance and 15% reduction in energy
26
Handling Multiple Memory Controllers
• Data mapping across multiple memory controllers is key:
Must equalize load and queuing delays Must minimize “distance” Must maximize row buffer hit rates
M DIMM
M
MDIMM
DIMM
27
Solution
• Cost function to guide initial page placement
• Similar cost function to guide page migration • Initial page placement improves performance by 7%, page migration by 9%
• Row buffer hit rates can be doubled
28
Reducing Overfetch
Key idea: eliminate overfetch by employing smaller arrays and activating a single array in a single chip Single Subarray Access (SSA), ISCA’10
Positive effects:• Minimizes activation energy• Small activation footprint: more arrays can be asleep longer• Enables higher parallelism and reduces queuing delays
Negative effects:• Longer transfer time• Drop in density• No row buffer hits• Vulnerable to chip failure• Change to standards
29
Energy Results
• Dynamic energy reduction of 6x• In some cases, 3x reduction in leakage
30
Performance Results
• SSA better on half the programs (mem-intensive ones)
BASELINE (OPEN PAGE,
FR-FCFS)
BASELINE (CLOSED ROW,
FCFS)
SBA SSA0%
20%
40%
60%
80%
100%Data Transfer
DRAM Core Access
Rank Switching delay (ODT)
Command/Addr Transfer
Queuing Delay
31
Support for Reliability
• Checksum support per row allows low-cost error detection
• Can build a 2nd tier error-correction scheme, based on RAID
DRAMchip
ChecksumData row …Parity
DRAM chip
• Reads: single array read• Writes: two array reads and two array writes
32
Capacity and Bandwidth
• Silicon photonics to break the pin barrier at the processor
• But, several concerns at the DIMM: Breaking the DRAM pin barrier will impact cost! High capacity daisy-chaining and loss of power High static power for photonics; need high utilization Scheduling for large capacities
33
Exploiting 3D Stacks (ISCA’11)
Processor DIMMWaveguide
DRAM chips
Interface die +Stack controller
Memory controller
• Interface die for photonic penetration• Does not impact DRAM design• Few photonic hops; high utilization• Interface die schedules low-level operations
34
Packet-Based Scheduling Protocol
• High capacity high scheduling complexity
• Move to a packet-based interface Processor issues an address request Processor reserves a slot for data return Scheduling minutiae are handled by stack controller Data is returned at the correct time Back-up slot in case deadline is not met
• Better plug’n’play• Reduced complexity at processor• Can handle heterogeneity
35
Summary
• Treat reliability as a first-order constraint
• Possible to use photonics to break pin barrier and not disrupt memory chip design: boosts bandwidth and capacity !
• Can reduce memory chip energy by reducing overfetch and with better row buffer management
36
Acks
• Terrific students in the Utah Arch group
• Prof. Al Davis (Utah) and collaborators at HP, Intel, IBM
• Funding from NSF, Intel, HP, University of Utah