runtime solutions to apply non-volatile memories in future … archive/doctoral... ·...

Runtime Solutions to Apply Non-volatile

Memories in Future Computer Systems

Hoda Aghaei Khouzani

Fall 2016

Outline

• Introduction

• Prolonging PCM Limited Lifetime

• Addressing PCM Write Latency and Energy

• Reducing DWM Access Latency

• Summary and Future Works

1

Trends Affecting Main Memory

However, DRAM technology scaling is ending

2

DRAM has been used as main memory from 1970

2003

2004

200

5

20

06

2007

2008

200

9

201

0

2011

20

12

2013

2014

2015

20

16

1

10

100

Rel

ati

ve

Ca

pa

city #Core

DRAM

Memory capacity per core expected

to drop by 30% every two years

[Mutlu, IMW’13]

[Lim, ISCA’09]

90nm0

200

400

W/c

m2

Leakage Power

Dynamic Power

[Borkar,MICRO’05]

56nm 40nm 28nm 20nm

600

800

1000

DRAM consumes up to 40% of system

energy due to the need of consistent

refresh cycles [Udipi, ISCA’10]

3

1ns

10ns

100ns

1µs

100µs

10µs

1ms

1 10 100 1000

Acc

ess

Tim

e

Cell area per bit (F2)

NOR-FlashNAND-Flash

Static Power

High

Low

Volatile

Non-Volatile

Emerging Memory Technologies

[ITRS, 2013]

SRAMDRAM

DWM

PCM FeRAM

STT-RAM

Potential Candidates for Future Main Memory

(1) Phase Change Memory (PCM)

• Advantages:

– Higher scalability than DRAM

– Near zero static power

– Similar read performance and energy to

DRAM

• Challenges:

– Limited write endurance (106~109)

– Long write latency (10x DRAM)

– High write energy (4x DRAM)

4

(2) Domain Wall Memory (DWM)

• Advantages:

– Ultra dense

– Near zero static power

– Similar read/write performance and energy

to DRAM

• Challenge:

– Sequential access structure

Outline

• Introduction





5

6

Related Works on PCM Lifetime

Write Balancing

[1] S. Cho, et al, “Flip-N-Write: A Simple Deterministic

Technique to Improve PRAM Write Performance,

Energy and Endurance,” in MICRO, 2009.

[2] P. Zhou, et al, “A Durable and Energy Efficient

Main Memory Using Phase Change Memory

Technology,” in ISCA, 2009.

[3] A. Ferreira, et al, “Increasing PCM main memory

lifetime,” in DATE, 2010.

[4] M. Qureshi, et al, “Enhancing lifetime and security of

PCM-based main memory with start-gap wear leveling,”

in MICRO, 2009.

[5] P. Zhou, et al, “A Durable and Energy Efficient Main

Memory Using Phase Change Memory Technology,” in

ISCA, 2009.

[6] N. H. Seong, et al, “SAFER: Stuck-at-fault Error

Recovery for Memories,” in MICRO, 2010.

[7] J. Fan, et al, “Aegis: partitioning data block for

efficient recovery of stuck-at-faults in phase change

memory,” in MICRO, 2013.

• At bit level

• Reduces average writes

• Lifetime limits by the maximum

• At different levels of granularity

• Balances writes

• Extra writes due to remap

• Applies when cells die

• Increases access latency

7

Goal and Challenges

Old Physical Pages Cold Virtual Pages

Young Physical Pages Hot Virtual Pages

Goal

How to identify the age of physical pages? How to identify the temperature of virtual pages?

Physical Domain Virtual DomainOperating System

How frequent should remap be?

Physical Pages Virtual Pages

Received writes Perform writes

A lot

A lotA few

A few

8

Identification Solution

How to identify the age of physical pages? How to identify the temperature of virtual pages?

Read only

Physical Domain Virtual Domain

N-bit counter per physical page

𝑁 = log2(𝑊𝑒𝑎𝑟 𝑜𝑢𝑡 𝐿𝑖𝑚𝑖𝑡)

For wear out limit 109, almost

30 bits needed per 4KB pages.

So the overhead is <0.001.

Counters again?

(1) Virtual domain is larger than physical domain.

(2) More importantly, no need.

No!!

Virtual Pages

Text

Stack

Data

Write characteristic

High spatial locality

High temporal locality

9

How frequent should remap be?In existing works

Start After Process A After Process B After Process C

My Idea: Wear leveling can be done across different processes

Start After Process A After Process B After Process C

Implementation: Mapping upon page allocation in OS

Age

Almost no extra remap => Very limited extra writes

Relies on many extra remap => result in intensive extra writes

System Overview

10

Memory

Access

HitMiss

Wear-resistant

Page Allocation

Age-aware Page

Replacement

Yes

No

Too few

Frames?

Hit

No

Yes

Remap Page

A write &

Frame get Old?

Re-define Young

vs. Mid-Age

YesNo

Re-define Old

vs. Mid-Age

Yes

NoToo few

Young

Frames?

Too many

Old Frames?

Following are added to the system:

• Three procedures

− Wear-resistant page allocation

− Age-aware page replacement

− Remap hot page

• Two thresholds

− Young-to-Midage threshold

− Midage-to-Old threshold

Wear-resistant Page Allocation

11

How to quickly find the appropriate physical page?

Age-aware free list Segment == Cold

Assign Mid-age page

Page Request

Assign Old page

Yes No

Old list empty?

Mid-age list

empty?

Yes

No

Assign Young page

No Yes

Hot is always served with young

P9P7P5P4

P8P2 P3

P1 P6Old

Mid-Age

YoungWC > Y-to-M Threshold

WriteCount > M-to-O Threshold

WriteCount > Y-to-M Threshold

System Overview

12

Memory

Access

HitMiss

Wear-resistant

Page Allocation

Age-aware Page

Replacement

Yes

No

Too few

Frames?

Hit

No

Yes

Remap Page

A write &

Frame get Old?

Re-define Young

vs. Mid-Age

YesNo

Re-define Old

vs. Mid-Age

Yes

NoToo few

Young

Frames?

Too many

Old Frames?

Following are added to the system:

• Three procedures

− Wear-resistant page allocation

− Age-aware page replacement

− Remap hot page

• Two thresholds

− Young-to-Midage threshold

− Midage-to-Old threshold

Age-aware Page Deallocation

13

In OS there is an in-use list per process

P9

P7

P5

P4

P8

P2

P3P1

P6

Process A

Process B

P11

P12

P10 Old

Mid-Age

Young

P13

Page Deallocation

How to select a page for deallocation?

• Constrained Clock Algorithm

• Constrain is the upper bound to the

write count of page

Age-aware free list

If the young list empty

upper bound = Y-to-M threshold

Else

upper bound = wear out limit

System Overview

14

Memory

Access

HitMiss

Wear-resistant

Page Allocation

Age-aware Page

Replacement

Yes

No

Too few

Frames?

Hit

No

Yes

Remap Page

A write &

Frame get Old?

Re-define Young

vs. Mid-Age

YesNo

Re-define Old

vs. Mid-Age

Yes

NoToo few

Young

Frames?

Too many

Old Frames?

Effectively balance page write

counts across processes

Remap is only needed to handle an

extremely hot virtual page

Redefining the two thresholds

across PCM lifetime

Evaluation: Methodology

15

• Benchmarks: SPEC 2000/2006. Memory traces collected with Pin.

• PCM main memory: 128MB with 106 wear out limit per cell.

• Y-to-M threshold: 104 initially.

• M-to-O threshold: 2×105 initially.

[1] M. Qureshi, et al, “Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling,” in MICRO, 2009.

[2] A. Ferreira, et al, “Increasing PCM main memory lifetime,” in DATE, 2010.

[3] P. Zhou, et al, “A Durable and Energy Efficient Main Memory Using Phase Change Memory Technology,” in ISCA, 2009.

no-WL Start Gap [1] Random Swap [2] Segment Swap [3]

No wear leveling • Maintain an extra empty

line

• Every 50 writes, swap the

empty line with the next line

• Every 512 writes, swap

the last written page with

a randomly selected page

• Group pages into 1MB

segment.

• Swap the hottest and coldest

segments every 2×105 writes

Compare with

Benchmarks

Group1 leslie3d, omnetpp, vpr

Group2 gzip, gammes, hmmer

Group3 calculix, facerec, fma3d

Group4 gcc, gromacs, milc

Lifetime Improvement

16

0.0000

0.2000

0.4000

0.6000

0.8000

1.0000

Group 1 Group 2 Group 3 Group 4

no-WL Proposed StartGap RandomSwap SegmentSwap

Normalized Lifetime:

𝑇𝑜𝑡𝑎𝑙 𝐿𝐿𝐶 𝑊𝑟𝑖𝑡𝑒𝑠

𝑊𝑒𝑎𝑟 𝑜𝑢𝑡 𝑙𝑖𝑚𝑖𝑡 × 𝑇𝑜𝑡𝑎𝑙 𝑝𝑎𝑔𝑒𝑠

• No-WL only explores 1% of

potential writes.

• Proposed scheme explores more

than 98% of PCM writes.

• Other schemes can only explore up

to 89% of writes.

0.0000

0.2000

0.4000

0.6000

0.8000

1.0000

leslie3d omnetpp vpr gzip gamess hmmer calculix facerec fma3d gcc gromacs milc

Group 1 Group 2 Group 3 Group 4

no-WL Proposed StartGap RandomSwap SegmentSwap

It is mostly due to extra remaps.

Normalized lifetime-Single Process

Normalized lifetime-Multi Process

Overhead

Benchmark no-WL Proposed StartGap Random

Swap

Segment

Swap

Group1 2% 7% 145% 37% 34%

Group2 21% 10% 179% 58% 45%

Group3 2% 3 % 119% 123% 33%

Group4 2% 9% 141% 51% 34%

17

Overhead:

(𝑇𝑜𝑡𝑎𝑙 𝑃𝑎𝑔𝑒 𝐹𝑎𝑢𝑙𝑡𝑠 + 𝑇𝑜𝑡𝑎𝑙 𝑅𝑒𝑚𝑎𝑝𝑠) × 128

𝑇𝑜𝑡𝑎𝑙 𝐿𝐿𝐶 𝑊𝑟𝑖𝑡𝑒𝑠

• No-WL overhead is due to page faults.

• Proposed scheme imposes at most 10%

extra writes.

• StartGap impose the most extra writes

− High remap frequency

𝑃𝑎𝑔𝑒 𝑆𝑖𝑧𝑒

𝐿𝐿𝐶 𝐵𝑙𝑜𝑐𝑘 𝑆𝑖𝑧𝑒=

212

25

Outline

• Introduction





18

DRAM vs PCM

19

DRAM cell PCM cell

Read Latency (ns)

Read Energy (pJ/bit)

Write Latency (ns)

Write Energy (pJ/bit)

~10 ~10

~10 ~100

4.4 2.5

5.5 14(set)~20(reset)

Write Endurance 1016 106~109

Leakage Power

Scalability

High Low

Not below 20nm Predicted 9nm

What if

~10

4.4

~10

5.5

1016

Low

Predicted 9nm

One attractive solution is a DRAM-PCM hybrid architecture

DRAM-PCM Hybrid Structures

20

PCMDRAM

Ad

dress sp

ace

from

LLC

Hierarchical organization

miss

Problem! Two-step access strategy degrades memory performance

• Use a hierarchical organization (DRAM is a cache for PCM)

• DRAM is size is 3% of PCM size

• Access policy is two steps, first access DRAM, if miss, access PCM

Related work on Hierarchical Architecture

21

[1] M. Qureshi, et al, “Scalable High Performance Main Memory System Using Phase-Change Memory Technology,” in ISCA, 2009.


[3] H. G. Lee, et al, “An Energy- and Performance-Aware DRAM Cache Architecture for Hybrid DRAM/PCM Main Memory Systems,” in ICCD, 2011.

First Hierarchical

DRAM-PCM [1]

DRAM misses limit

its performance.

Clean-First

Replacement [2]

Lower PCM writes

Higher DRAM miss rate

Miss Penalty

Reduction [3]

Lower miss latency

Higher DRAM miss rate

Criticality of DRAM Misses

22

Goal: Reduce DRAM misses, more specifically those misses that generate PCM writes

Operations upon DRAM miss:

DRAM read

PCM read

DRAM write

Possibly PCM write

System metrics:

Performance

Energy

Endurance

Distribution of Conflict Misses

• DRAM Conflict Misses:

– Hit in PCM

• Critical Conflicts:

– Generates writeback to PCM

23

0.00E+00

2.50E+03

5.00E+03Omnetpp

0.00E+00

2.50E+03

5.00E+03h264ref

0.00E+00

6.00E+04

1.20E+05

11

01

92

83

74

65

56

47

38

29

11

00

109

118

127

136

145

154

163

172

181

190

199

208

217

226

235

244

253

vpr

0.00E+00

6.00E+04

1.20E+05vortex

Sets

Nu

mb

er o

f C

riti

cal

Co

nfl

icts

High variation across DRAM sets!!

A Motivating Example

24

Consider a directly mapped DRAM with 2 sets

Mapped to set 0 Mapped to set 1

Time

Pages

ABCD

Time

Pages

ACBD

Conflicts

No conflicts

Idea: when bring a page to DRAM, map it to a less conflicting set

How the DRAM Set is Determined?

25

DRAM set of a page

Physical page number

Virtual-to-physical page mapping

Physical page number:

(2n Pages in PCM)

DRAM set number

n

1001.... 10 1..00

0

0m

1010..............10

Determined by OS,

upon a page fault

Exploiting the flexibility of virtual-to-physical page mapping

Virtual page number

How to identify a less conflicting set?

26

A counter per set is used.

2-bit saturating counters per DRAM set.

Siz

e o

f C

ou

nte

r Higher Overhead

(Power and Storage)

Lower Accuracy

For a 2MB, 4-way associative

DRAM, storage overhead is 256

bits out of 2MB (< 0.01%)

A higher priority is given to critical

conflicts

i

iHit in

PCM

Miss in

DRAM

Writeback to

PCM

Counter ++

Yes

No

No

Yes

Counter ++

Yes

No

In parallel with

data transfer,

no performance

overhead

Look for a Page More Efficiently

27

Clock_Page

R bits

DRAM sets

PCM pages mapped to the same DRAM set

Clock_Set

C bits

1 00 1 1 1 10 0 00

Proposed Algorithm:

• Group pages of the same DRAM set into a

list.

• Search only the list associated with the

selected DRAM set.

Previously: Reference bits of all in-use pages

are searched.

n = total number of in-use pages,

m = total number of DRAM sets

Complexity of conventional clock = O(n)

Complexity of our algorithm = O(m) + O(n/m)

2

0

3

1

0

2

0

28

• Benchmarks: SPEC 2000/2006. Memory traces collected with Pin.

• Last Level Cache (LLC): 1MB, 128B Block, and 4-way associative

• PCM main memory: 1 GB, page size of 4KB

• DRAM Cache: 32MB, 4-way associative


Baseline [1] N-Chance [2] Reduced Miss Penalty [3]

Basic hierarchical

DRAM-PCM

Clean first replacement in

DRAM

Maintain an empty block

per set in DRAM

Compare with

[1] M. Qureshi, et al, “Scalable High Performance Main Memory System Using Phase-Change Memory Technology,” in ISCA, 2009.


[3] H. G. Lee, et al, “An Energy- and Performance-Aware DRAM Cache Architecture for Hybrid DRAM/PCM Main Memory Systems,” in ICCD, 2011.

Groups of Benchmarks

29

Benchmarks

Group1 gromacs, h264ref, hmmer, leslie3d, omnetpp, sjeng, tonto, zeusmp

Group2 ammp, gromacs, leslie3d, mgrid, milc, sjeng, tonto, wupwise

Group3 bwaves, gzip, leslie3d, milc, sjeng, vortex, wupwise, zeusmp

Group4 facerec, gzip, h264ref, mgrid, omnetpp, tonto, vpr, wupwise

Group5 facerec, gamess, gcc, gzip, h264ref, vortex, vpr, zeusmp

Benchmarks are grouped to simulate multi-process environment

Balanced Critical Conflicts

Reduced standard deviation

by 64.78%

30

1.00E+04

3.50E+04

6.00E+04

1.00E+04

5.50E+04

1.00E+05

9.00E+03

1.10E+04

1.30E+04

1.40E+04

1.90E+04

2.40E+04

1.00E+04

3.50E+04

6.00E+04

133

65

97

12

916

119

322

525

728

932

135

338

541

744

948

151

354

557

760

964

167

370

573

776

980

183

386

589

792

996

199

310

25

10

57

10

89

11

21

11

53

11

85

12

17

12

49

12

81

13

13

13

45

13

77

14

09

14

41

14

73

15

05

15

37

15

69

16

01

16

33

16

65

16

97

17

29

17

61

17

93

18

25

18

57

18

89

19

21

19

53

19

85

20

17

Baseline Proposed

Sets

Nu

mb

er o

f C

riti

cal

Co

nfl

icts

Group1

Group2

Group3

Group4

Group5

Results- DRAM misses and PCM writes

• Proposed scheme reduces both DRAM misses and PCM writes

– 7% reduction in DRAM misses

– 6% reduction in PCM writes

• N-Chance only reduces PCM writes but induces more misses

• Reduced miss penalty increases both.31

Number of DRAM Misses Number of PCM Writes

0.80

1.00

1.20

1.40

1.60

Group1 Group2 Group3 Group4 Group5

N-Chance ReducedMissPenalty Proposed

0.80

1.00

1.20

1.40

1.60

Group1 Group2 Group3 Group4 Group5

N-Chance ReducedMissPenalty Proposed

All values are normalized to the baseline

Baseline is basic hierarchical DRAM-PCM

Outline

• Introduction





32

Domain Wall Memory (DWM)

33

Access latency and energy is affected by shift operation.

• Made of hundreds of millions of ferromagnetic nanowires.

• Each nanowire stores many bits or domains.

• To read/write data several access ports are placed at fixed position.

• Shift operation required to access each bit.

Access PortAccess Port

Shift CurrentShift Current

Related Work on DWM

34

Storage

Main Memory

Caches

Registers

Memory Hierarchy

[1] M. Mao, et al, “Exploration of GPGPU

Register File Architecture Using Domain-wall-

shift-write based Racetrack Memory,” in DAC,

2014.

[2] R. Venkatesan, et al, “TapeCache: A High

Density, Energy Efficient Cache Based on

Domain Wall Memory,” in ISLPED, 2012.

[3] H. Xu, et al, “Multilane Racetrack Caches:

Improving Efficiency through Compression and

Independent Shifting,” in ASP-DAC, 2015.

[4] Q. Hu, et al, “Exploring Main Memory

Design Based on Racetrack Memory

Technology,” in GLVLSI, 2016.

Reduced shift by register remapping

and instruction scheduling [1]

Reduced shift through compression

techniques [3] and adding read-only ports [2]

Reduced shift through

storing data vertically[4]

My work differs from them by

Considering metadata accesses, specifically page table.

(1) Reducing shifts

(2) Leveraging access port position for metadata interpretation

Page Table• Contains the virtual-to-physical page address mapping.

• Modern systems adopt a hierarchical page table.

35

CR3

PML4 Idx Directory Ptr Idx Directory Idx Table Idx Page Offset

0111220212930383947

Level 4

(PML4)

Level 3

(Directory Pointer)

Level 2

(Directory)

Level 1

(Table)

Physical

Page

Number

Virtual Page Address

Page Table

(in Main Memory)

The latency of this 4-step process is

50% of total execution time!!! [1]

[1] V. Karakostas, et al, “Redundant Memory Mappings for Fast Access to Large Memories,” in ISCA, 2015.

Page Table• Contains the virtual-to-physical page address mapping.

• Modern systems adopt a hierarchical page table.

36

The latency of this 4-step process is

50% of total execution time!!! [1]

CR3

PML4 Idx Directory Ptr Idx Directory Idx Table Idx Page Offset

0111220212930383947

Level 4

(PML4)

Level 3

(Directory Pointer)

Level 2

(Directory)

Level 1

(Table)

Physical

Page

Number

Virtual Page Address

Page Table

(in Main Memory)

Control Bits Address Bits

V

R

/

W

R

U

/

S

P

W

T

P

C

D

D

P

A

T

G Avail Address Page Table Entry (PTE)

Most frequently accessed control bits

• Translation Lookaside Buffer (TLB) store recently accessed PTEs

– More TLB misses in today’s system[1].

[1] V. Karakostas, et al, “Redundant Memory Mappings for Fast Access to Large Memories,” in ISCA, 2015.

Intuitive Read/Write a PTE

37

V RD Address

Currently upon access to page table

#𝑜𝑓 𝑆ℎ𝑖𝑓𝑡𝑠 = 2 ×𝑀𝑎𝑥𝑆ℎ𝑖𝑓𝑡

𝑀𝑎𝑥𝑆ℎ𝑖𝑓𝑡 = distance between two neighbouring port

Aligning the first bit to the access port.

Read/write the whole entry.

My observation: There is no need to read the entire PTE

Observation on PTE States

38

Not in

MemoryBrought to

memory

Evicted from TLB

Evicted from

Memory

Missed in TLB

In system with TLB, PTE have three states

In TLBNot in

TLB

Transition Action Accessed Field in PTE

Not in Memory → In TLB Write The entire entry

In TLB → Not in TLB UpdateReferenced (R) bit

and/or Dirty (D) bit

Not in TLB → In TLB Read The entire entry

Not in TLB → Not in Memory Update Validity (V) bit

TLB Main memory

×

×Not in Memory

In TLB

Not in TLB ×

39

An Example

V RD Address

Alignment Shift:

Extra Shift:

Necessary Shift:

Align first bit with access port.

Shifting through the bits that are not listed in table

Shifting through the bits that are listed in table.




Pre-Alignment TechniquePre-align based on the state:

• In TLB: Pre-align to R bit

• Not in TLB: Pre-align to V bit

• Not in Memory: Pre-align to

the bit adjacent to the V bit

40

Two advantages:

(1) Remove the alignment shifts from

the PTE access path

(2) The state of PTE is known prior to

reading PTE


Not in Memory → In TLB Write The entire entry



Not in TLB → In TLB Read The entire entry

Not in TLB → Not in Memory Update Validity (V) bit

V RD Address

V RD Address

V RD Address

Multiple PTEs can share a track.

41

Placement – Motivating example

But it can create Conflict of interest!!!

PTE0 → Not In Memory

V RD Address V RD Address

PTE0 PTE1

At the beginning

Idea: Place PTEs with minimum conflict on the same track

PTE1 → Not In Memory

PTE0 → In TLB

Horizontal Placement

42

Placement

PTE 0 PTE 1

PTE 2 PTE 3

PTE n-2 PTE n-1

Track0

Track1

Track(n/2)-1

PTPage0

PTE 0 PTE 1

PTE 2 PTE 3

PTE n-2 PTE n-1

Trackn/2

Track(n/2)+1

Trackn-1

PTPage1

PTE 0 PTE 0

PTE 1 PTE 1

PTE (n/2)-1 PTE (n/2)-1

PTPage0

PTE n/2 PTE n/2

PTE (n/2)+1 PTE (n/2)+1

PTE n-1 PTE n-1

PTPage1

Track0

Track1

Track(n/2)-1

Trackn/2

Track(n/2)+1

Trackn-1

Vertical Placement

PTEs are stored in granularity of pages. (PTpage)

PTEs in the same page have high

spatial locality ⇒ highly conflicting

Less spatial locality between PTEs of

different pages ⇒ less conflicting

43

• Benchmarks: SPEC 2000/2006. TLB traces collected with Pin.

• Instruction and Data TLB: 4-way associative 128-entry

• PCM main memory: 4 GB, page size of 4KB

• Page Table Reserved Size: 2MB


DWM-based baseline Traditional DRAM-based baseline

• DWM main memory.

• No pre-alignment.

• Horizontal placement.

• DRAM main memory.

Compare with

Groups of Benchmarks

44

Benchmarks

Group1 gemsFDTD, gobmk, gromacs, h264ref

Group2 astar, bwaves, bzip, cactus

Group3 calculix, deal, gamess, gcc

Group4 hmmer, zeusmp, leslie3d, libquantum

Group5 mcf, milc, namd, omnetpp

Group6 perlbench, sjeng, soplex, tonto

Benchmarks are grouped to simulate multi-process environment

Comparison with DWM-based Baseline

45

0.0E+00

6.0E+10

1.2E+11

Group1 Group2 Group3 Group4 Group5 Group6

BaseLine Proposed Lower bound

0.0E+00

3.0E+09

6.0E+09

Group1 Group2 Group3 Group4 Group5 Group6BaseLine Proposed Lower bound

Latency of Address Translation.

Latency and Energy of Context Switching.

6.0E+10

9.0E+10

1.2E+11

Group1 Group2 Group3 Group4 Group5 Group6BaseLine Proposed

Energy of Address Translation.

69% reduction 15% reduction

71% reduction

Lower bound =

necessary shifts only

Outline

• Introduction





46

Summary

47

• Showed the necessity to change main memory technology.

• Introduced two candidates:

− Phase Change Memory (PCM)

− Domain Wall Memory (DWM)

• Prolonged PCM main memory lifetime

− Proposed a segment-aware page allocation.

• Overcame write limitation of PCM

− Proposed a conflict-aware page allocation for PCM-DRAM hybrid main memory.

• Improved the performance and energy page table accesses

− Proposed a pre-alignment technique as well as a new placement in DWM main memory.

Future Work

48

Storage

Main Memory

Caches

Registers

Memory Hierarchy

PCM-based Main memory

DWM-based Main memory

DWM-based Stack Architecture

STT-RAM-based Cache

Future Work – DWM-based Stack Architecture

49

Energy harvesting devices become popular

Goal: Improving the reliability, energy, and execution

time of energy harvesting devices

Non-volatile memory register files [1]

Alternative architecture such as Stack architecture

(1) Intermittent power supply

(2) Tend to be smaller everyday

DWM is a good choice for stack.

Solutions:Concerns:

[1] K. Ma, et al, “Nonvolatile Processor Architecture Exploration for Energy-Harvesting Applications,” in MICRO, 2015.

My proposal

Future Work – STT-RAM-based Cache

50

• Over 50% of chip area devoted to cache.

• STT-RAM is more scalable than SRAM.

• STT-RAM has long write latency and high write energy,

− however they have a trade of with non-volatility property

Retention Time 10 years 1 sec 10 ms

Write Latency @2GHz 22 cycles 12 cycles 6 cycles

[Jog, DAC’12]

Goal: Explore the flexibility of cache replacement algorithm to reduce

retention time and improve access latency and energy.

Thank You!

Q&A

51

runtime solutions to apply non-volatile memories in future … archive/doctoral... ·...

Documents