towards scalable and energy-efficient memory system architectures

1

Towards Scalable and Energy-Efficient Memory System Architectures

Rajeev Balasubramonian

School of ComputingUniversity of Utah

2

Main Memory Problems

PROCESSOR DIMMDIMM

1. Energy

2. High capacity at high

bandwidth

3. Reliability

3

Motivation: Memory Energy

Contributions of memory to overall system energy:

• 25-40%, IBM, Sun, and Google server data summarized by Meisner et al., ASPLOS’09

• HP servers: 175 W out of ~785 W for 256 GB memory (HP power calculator)

• Intel SCC: memory controller contributes 19-69% of chip power, ISSCC’10

4

Motivation: Reliability

• DRAM data from Schroeder et al., SIGMETRICS’09: 25K-70K errors per billion device hours per Mbit 8% of DRAM DIMMs affected by errors every year

• DRAM error rates may get worse as scalability limits are reached; PCM (hard and soft) error rates expected to be high as well

• Primary concern: storage and energy overheads for error detection and correction

• ECC support is not too onerous; chip-kill is much worse

5

Motivation: Capacity, Bandwidth

Processor

DIMM DIMM

6


Processor

Cores are increasing, but

pins are not

DIMM DIMM

7


Processor

Cores are increasing, but

pins are not

DIMM DIMM

High channel frequency fewer DIMMs

Will eventually need disruptive

shifts: NVM, optics

Can’t have high capacity, high bandwidth, and low energyPick 2 of the 3!

8

Memory System Basics

Processor

M DIMM

M

MDIMM

DIMM

Multiple on-chip memory controllersthat handle multiple 64-bit channels

9

Memory System Basics: FB-DIMM

Processor

M DIMMM

DIMM

FB-DIMM: Can boost capacity with narrowchannels and buffering at each DIMM

M

DIMM

DIMM DIMM

M

M

DIMM DIMM

DIMM DIMM

10

What’s a Rank?

Processor M

x8 x8 x8 x8 x8 x8 x8 x8

64b

DIMM

Rank: DRAM chips required to provide the 64boutput expected by a JEDEC standard bus

For example: 8 x8 DRAM chips

11

What’s a Bank?

Processor M

x8 x8 x8 x8 x8 x8 x8 x8

64b

DIMM

Bank: A portion of a rank that is tied up whenservicing a request; multiple banks in a rankenable parallel handling of multiple requests

BANK

12

What’s an Array?

Processor M

x8 x8 x8 x8 x8 x8 x8 x8

64b

DIMM

Array: Matrix of cellsOne array provides 1 bit/cycle

Each array reads out an entire rowLarge array high density

BANK

13

What’s a Row Buffer?

… Array

Wordline

Bitlines

Row Buffer

RAS

CASOutput pin

14

Row Buffer Management

• Row buffer: collection of rows read out by arrays in a bank

• Row buffer hits incur low latency and low energy

• Bitlines must be precharged before a new row can be read

• Open page policy: delays the precharge until a different row is encountered

• Close page policy: issues the precharge immediately

15

Primary Sources of Energy Inefficiency

• Overfetch: 8 KB of data read out for each cache line request

• Poor row buffer hit rates: diminished locality in multi-cores

• Electrical medium: bus speeds have been increasing

• Reliability measures: overhead in building a reliable system from inherently unreliable parts

16

SECDED Support

64-bit data word 8-bit ECC

• One extra x8 chip per rank

• Storage and energy overhead of 12.5%

• Cannot handle complete failure in one chip

17

Chipkill Support I

• Use 72 DRAM chips to read out 72 bits

• Dramatic increase in activation energy and overfetch

• Storage overhead is still 12.5%


At most one bit from each DRAM chip

18

Chipkill Support II

• Use 13 DRAM chips to read out 13 bits

• Storage and energy overhead: 62.5%

• Other options exist; trade-off between energy and storage


At most one bit from each DRAM chip

19

Summary So Far

We now understand…

• why memory energy is a problem - overfetch, row buffer miss rates

• why reliability incurs high energy overheads - chipkill support requires high activation per useful bit

• why capacity and bandwidth increases cost energy - need high frequency and buffering per hop

20

Crucial Timing

Disruptive changes may be compelling today…

• Increasing role of memory energy

• Increasing role of memory errors

• Impact of multi-core: high bandwidth needs, loss of locality

• Emerging technologies (NVM, optics) will require a revamp of memory architecture ideas can be easily applied to NVM role of DRAM may change

21

Attacking the Problem

• Find ways to maximize row buffer utility

• Find ways to reduce overfetch

• Treat reliability as a first-class design constraint

• Use photonics and 3D to boost capacity and bandwidth

Solutions must be very cost-sensitive

22

Maximizing Row Buffer Locality

• Micro-pages (ASPLOS’10)

• Handling multiple memory controllers (PACT’10)

• On-going work: better write scheduling, better bank management (data mapping, row closure)

23

Micro-Pages

• Key observation: most accesses to a page are localized to a small region (micro-page)

24

Solution

• Identify hot micro-pages• Co-locate hot micro-pages in reserved DRAM rows• Memory controller keeps track of re-direction• Low overheads if applications have few hot micro-pages that account for most memory accesses

Processor MDIMM

25

Results

• Overall 9% improvement in performance and 15% reduction in energy

26

Handling Multiple Memory Controllers

• Data mapping across multiple memory controllers is key:

Must equalize load and queuing delays Must minimize “distance” Must maximize row buffer hit rates

M DIMM

M

MDIMM

DIMM

27

Solution

• Cost function to guide initial page placement

• Similar cost function to guide page migration • Initial page placement improves performance by 7%, page migration by 9%

• Row buffer hit rates can be doubled

28

Reducing Overfetch

Key idea: eliminate overfetch by employing smaller arrays and activating a single array in a single chip Single Subarray Access (SSA), ISCA’10

Positive effects:• Minimizes activation energy• Small activation footprint: more arrays can be asleep longer• Enables higher parallelism and reduces queuing delays

Negative effects:• Longer transfer time• Drop in density• No row buffer hits• Vulnerable to chip failure• Change to standards

29

Energy Results

• Dynamic energy reduction of 6x• In some cases, 3x reduction in leakage

30

Performance Results

• SSA better on half the programs (mem-intensive ones)

BASELINE (OPEN PAGE,

FR-FCFS)

BASELINE (CLOSED ROW,

FCFS)

SBA SSA0%

20%

40%

60%

80%

100%Data Transfer

DRAM Core Access

Rank Switching delay (ODT)

Command/Addr Transfer

Queuing Delay

31

Support for Reliability

• Checksum support per row allows low-cost error detection

• Can build a 2nd tier error-correction scheme, based on RAID

DRAMchip

ChecksumData row …Parity

DRAM chip

• Reads: single array read• Writes: two array reads and two array writes

32

Capacity and Bandwidth

• Silicon photonics to break the pin barrier at the processor

• But, several concerns at the DIMM: Breaking the DRAM pin barrier will impact cost! High capacity daisy-chaining and loss of power High static power for photonics; need high utilization Scheduling for large capacities

33

Exploiting 3D Stacks (ISCA’11)

Processor DIMMWaveguide

DRAM chips

Interface die +Stack controller

Memory controller

• Interface die for photonic penetration• Does not impact DRAM design• Few photonic hops; high utilization• Interface die schedules low-level operations

34

Packet-Based Scheduling Protocol

• High capacity high scheduling complexity

• Move to a packet-based interface Processor issues an address request Processor reserves a slot for data return Scheduling minutiae are handled by stack controller Data is returned at the correct time Back-up slot in case deadline is not met

• Better plug’n’play• Reduced complexity at processor• Can handle heterogeneity

35

Summary

• Treat reliability as a first-order constraint

• Possible to use photonics to break pin barrier and not disrupt memory chip design: boosts bandwidth and capacity !

• Can reduce memory chip energy by reducing overfetch and with better row buffer management

36

Acks

• Terrific students in the Utah Arch group

• Prof. Al Davis (Utah) and collaborators at HP, Intel, IBM

• Funding from NSF, Intel, HP, University of Utah

towards scalable and energy-efficient memory system architectures

Documents

high capacity

high bandwidth

overall system energy

energy overheads

activation energy

chip memory controllersthat

reliability dram data

low energy bitlines