software-transparent crash consistency for persistent memorysafari/talks/... · nvm as main memory...

ThyNVM Software-Transparent Crash Consistency

for Persistent Memory

Onur Mutlu [email protected]

(joint work with Jinglei Ren, Jishen Zhao, Samira Khan, Jongmoo Choi, Yongwei Wu)

August 8, 2016 Flash Memory Summit 2016, Santa Clara, CA

Original Paper (I)

2

Original Paper (II) n  Presented at ACM/IEEE MICRO Conference in Dec 2015.

n  Full paper for details: q  Jinglei Ren, Jishen Zhao, Samira Khan, Jongmoo Choi, Yongwei

Wu, and Onur Mutlu, "ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems" Proceedings of the 48th International Symposium on Microarchitecture (MICRO), Waikiki, Hawaii, USA, December 2015. [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Poster (pptx) (pdf)] [Source Code]

q  https://users.ece.cmu.edu/~omutlu/pub/ThyNVM-transparent-crash-consistency-for-persistent-memory_micro15.pdf

3

The Main Memory System

n  Main memory is a critical component of all computing systems: server, mobile, embedded, desktop, sensor

n  Main memory system must scale (in size, technology, efficiency, cost, and management algorithms) to maintain performance growth and technology scaling benefits

4

Processor and caches

Main Memory Storage (SSD/HDD)

Limits of Charge Memory n  Difficult charge placement and control

q  Flash: floating gate charge q  DRAM: capacitor charge, transistor leakage

n  Reliable sensing becomes difficult as charge storage unit size reduces

5

Emerging NVM Technologies n  Some emerging resistive memory technologies seem more

scalable than DRAM (and they are non-volatile)

n  Example: Phase Change Memory q  Data stored by changing phase of material q  Data read by detecting material’s resistance q  Expected to scale to 9nm (2022 [ITRS]) q  Prototyped at 20nm (Raoux+, IBM JRD 2008) q  Expected to be denser than DRAM: can store multiple bits/cell

n  But, emerging technologies have (many) shortcomings q  Can they be enabled to replace/augment/surpass DRAM?

6

Promising NVM Technologies n  PCM

q  Inject current to change material phase q  Resistance determined by phase

n  STT-MRAM q  Inject current to change magnet polarity q  Resistance determined by polarity

n  Memristors/RRAM/ReRAM q  Inject current to change atomic structure q  Resistance determined by atom distance

7

NVM as Main Memory Replacement n  Very promising

q  persistence, high capacity, OK latency, low idle power

n  Can enable merging of memory and storage

n  Two example works that show benefits q  Lee, Ipek, Mutlu, Burger, “Architecting Phase Change

Memory as a Scalable DRAM Alternative,” ISCA 2009. q  Kultursay, Kandemir, Sivasubramaniam, Mutlu,

“Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,” ISPASS 2013.

8

Two Example Works

9

Architected STT-MRAM as Main Memory n  4-core, 4GB main memory, multiprogrammed workloads n  ~6% performance loss, ~60% energy savings vs. DRAM

10

88%

90%

92%

94%

96%

98%

Per

form

ance

vs

. D

RA

M

STT-RAM (base) STT-RAM (opt)

0%

20%

40%

60%

80%

100%

En

erg

y

vs.

DR

AM

ACT+PRE WB RB

Kultursay+, “Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,” ISPASS 2013.

A More Viable Approach: Hybrid Memory Systems

Meza+, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters, 2012. Yoon+, “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012 Best Paper Award.

CPU DRAMCtrl

Fast, durable Small,

leaky, volatile, high-cost

Large, non-volatile, low-cost Slow, wears out, high active energy

NVMCtrl DRAM NVM Type X

Hardware/software manage data allocation and movement to achieve the best of multiple technologies

Some Opportunities with Emerging Technologies

n  Merging of memory and storage q  e.g., a single interface to manage all data

n  New applications q  e.g., ultra-fast checkpoint and restore

n  More robust system design q  e.g., reducing data loss

n  Processing tightly-coupled with memory q  e.g., enabling efficient search and filtering

12 Meza+, “A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory,” WEED 2013.

TWO-‐LEVEL STORAGE MODEL CP

U

MEM

ORY

STORA

GE

VOLATILE FAST

BYTE ADDR NONVOLATILE

SLOW BLOCK ADDR

Ld/St

FILE I/O

DRAM

13

TWO-‐LEVEL STORAGE MODEL CP

U

MEM

ORY

STORA

GE

VOLATILE FAST

BYTE ADDR NONVOLATILE

SLOW BLOCK ADDR

Ld/St

FILE I/O

DRAM

14

PCM, STT-RAM NVM

Non-‐vola@le memories combine characteris@cs of memory and storage

Coordinated Memory and Storage with NVM (I) n  The traditional two-level storage model is a bottleneck with NVM

q  Volatile data in memory à a load/store interface q  Persistent data in storage à a file system interface q  Problem: Operating system (OS) and file system (FS) code to locate, translate,

buffer data become performance and energy bottlenecks with fast NVM stores

15

Two-Level Store



Virtual memory

Address translation

Load/Store

Operating system

and file system

fopen, fread, fwrite, …

Persistent (e.g., Phase-Change) Memory

Coordinated Memory and Storage with NVM (II)

n  Goal: Unify memory and storage management in a single unit to eliminate wasted work to locate, transfer, and translate data q  Improves both energy and performance q  Simplifies programming model as well

16

Unified Memory/Storage



Load/Store

Persistent Memory Manager

Feedback

Meza+, “A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory,” WEED 2013.

PERSISTENT MEMORY

CPU

PERSISTENT

MEM

ORY

Provides an opportunity to manipulate persistent data directly

Ld/St

NVM

17

Performance Benefits of Persistent Memory

18

0

0.2

0.4

0.6

0.8

1.0

HDD 2-level NVM 2-level Persistent Memory

Nor

mal

ized

Exe

cutio

n Ti

me

User CPU User Memory Syscall CPU Syscall I/O

0.0440.009

~5X

~24X


Energy Benefits of Persistent Memory

19

0

0.2

0.4

0.6

0.8

1.0


Frac

tion

of T

otal

Ener

gy

User CPU Syscall CPU DRAM NVM HDD

0.0650.013

~5X

~16X


On Persistent Memory Benefits & Challenges

n  Justin Meza, Yixin Luo, Samira Khan, Jishen Zhao, Yuan Xie, and Onur Mutlu, "A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory" Proceedings of the 5th Workshop on Energy-Efficient Design (WEED), Tel-Aviv, Israel, June 2013. Slides (pptx) Slides (pdf)

20

The Persistent Memory Manager (PMM) n  Exposes a load/store interface to access persistent data

q  Applications can directly access persistent memory à no conversion, translation, location overhead for persistent data

n  Manages data placement, location, persistence, security q  To get the best of multiple forms of storage

n  Manages metadata storage and retrieval q  This can lead to overheads that need to be managed

n  Exposes hooks and interfaces for system software q  To enable better data placement and management decisions

n  Meza+, “A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory,” WEED 2013.

21

The Persistent Memory Manager (PMM)

22

2.2.1 Efficient Hardware and Software Support: We propose to investigate the efficient software andhardware support needed for single-level stores. A single-level store system should provide an abstractionthat maps persistent user data to physical addresses in memory. A software interface for programs wouldmap a pointer to the actual persistent data. Programs would be able to access any part of the data usingnormal load and store instructions. Figure 2 shows two examples of high-level abstractions which couldbe provided to programs to access persistent data in a single-level store system. In it, a program createsa persistent file (Figure 2 left) or object (Figure 2 right) using the handle “file.dat” and allocates an arrayof integers in it. Later—perhaps after the application or system is restarted—when the program executesthe updateValue() function, the system retrieves the persistent data for the same handle, and the programmodifies its state. With such an abstraction, a single-level store can eliminate the OS system calls to transferdata to and from disk. In addition, it eliminates the need for a file system to track physical file addressesby traversing metadata (such as inodes) in the OS. In this way, single-level stores provide the opportunity todesign a simple and efficient persistent data lookup system in hardware. We plan to research efficient waysto map files or objects to the virtual address space. In such a hardware-based design, the processor wouldmanage how data handles correspond to physical addresses. Note that, single-level stores can use alternativedesign choices, such as segments, to provide the high-level abstraction instead of files or objects. Regardless,segments, files, or objects will be mapped to physical addresses with hardware support. Prior works tried tomake file lookup and update efficient in software [27, 28] in the presence of persistent memory, and otherworks proposed using complex and potentially inefficient hardware directory techniques (e.g., [15]). Ourgoal is to design fast and efficient techniques that take into account the byte addressability of persistentmemory in a single-level store. To this end, we will research the following:• The efficient use of hash table and B-tree indices for locating files in a single-level store.• How techniques such as key-value stores can provide fast and efficient lookups in single-level stores.• Policies for intelligently caching some entries of these indices in hardware to improve system perfor-

mance.Every access in the single-level store needs to be translated from a virtual address used by a program to

a physical address used to access a device. We will investigate how to efficiently manage address translationso that locating data is simple and fast. We intend to explore the following directions to solve this problem:• We will design mechanisms to predict access patterns based on program behavior and pre-compute

virtual-to-physical address translations. We are interested in answering questions such as: What is thepattern of data accesses to a single-level store, and how can prefetching techniques be redesigned withsingle-level stores in mind to enable efficient address translation? How can simple application-level orprofile-based hints on access patterns be communicated to and used by hardware to make address translationand prefetching efficient?• We will design efficient translation lookaside buffer (TLB)-like structures which will cache the trans-

lation between virtual and physical addresses but for a much larger amount of physical memory than inexisting systems. In the presence of such a single-level store, many disparate data accesses could need alarge translation table to be serviced effectively. To reduce overhead of such hardware, we are interested inexploring whether TLB-like structures should favor storing translations only for particular classes of data,such as data with locality or data which is on the critical path of execution, which get the most benefitout of the limited structure space. In addition, we will investigate centralized versus distributed translationstructures to design techniques that, for example, buffer private data translation entries near the processor,while distributing shared entries across processors to minimize translation overheads. Such translation struc-

1 int main(void) {2 // data in file.dat is persistent3 FILE myData = "file.dat";4 myData = new int[64];5 }6 void updateValue(int n, int value) {7 FILE myData = "file.dat";8 myData[n] = value; // value is persistent9 }

1 int main(void) {2 // data in file.dat is persistent3 int *myData = new PersistentObject("file.dat");4 myData = new int[64];5 }6 void updateValue(int n, int value) {7 int *myData = PersistentObject.open("file.dat");8 myData[n] = value; // value is persistent9 }

Figure 2: Sample program with access to file-based (left) and object-based (right) persistent data.

5

Load Store

DRAM Flash NVM HDD

Persistent Memory ManagerHardware

SoftwareData Layout, Persistence, Metadata, Security, ...

Hints from SW/OS/runtime

PMM uses access and hint informa@on to allocate, locate, migrate and access data in the heterogeneous array of devices

Persistent objects

One Key Challenge

n  How to ensure consistency of system/data if all memory is persistent?

n  Two extremes q  Programmer transparent: Let the system handle it q  Programmer only: Let the programmer handle it

n  Many alternatives in-between…

23

CHALLENGE: CRASH CONSISTENCY

System crash can result in permanent data corrup@on in NVM

24

Persistent Memory System

CURRENT SOLUTIONS Explicit interfaces to manage consistency

– NV-‐Heaps [ASPLOS’11], BPFS [SOSP’09], Mnemosyne [ASPLOS’11]

AtomicBegin { Insert a new node; } AtomicEnd;

Limits adop@on of NVM Have to rewrite code with clear par@@on between vola@le and non-‐vola@le data

Burden on the programmers 25

OUR APPROACH: ThyNVM

26

Goal: Software-transparent crash consistency

in persistent memory systems

ThyNVM: Summary

27

•  Checkpoints at mul$ple granulari$es to reduce both checkpoin7ng latency and metadata overhead

•  Overlaps checkpoin$ng and execu$on to reduce checkpoin7ng latency

•  Adapts to DRAM and NVM characteris7cs

Performs within 4.9% of an idealized DRAM with zero cost consistency

A new hardware-based checkpointing mechanism

OUTLINE

28

Crash Consistency Problem

Current Solutions

ThyNVM

Evaluation

Conclusion

CRASH CONSISTENCY PROBLEM

29

Add a node to a linked list

1. Link to next 2. Link to prev

System crash can result in inconsistent memory state

OUTLINE

30


Current Solutions

ThyNVM

Evaluation

Conclusion

CURRENT SOLUTIONS

31

void hashtable_update(hashtable_t* ht, void *key, void *data) { list_t* chain = get_chain(ht, key); pair_t* pair; pair_t updatePair; updatePair.first = key; pair = (pair_t*) list_find(chain, &updatePair); pair->second = data; }

Example Code update a node in a persistent hash table

Explicit interfaces to manage consistency – NV-‐Heaps [ASPLOS’11], BPFS [SOSP’09], Mnemosyne [ASPLOS’11]

CURRENT SOLUTIONS

32

void TMhashtable_update(TMARCGDECL hashtable_t* ht, void *key, void*data){ list_t* chain = get_chain(ht, key); pair_t* pair; pair_t updatePair; updatePair.first = key; pair = (pair_t*) TMLIST_FIND(chain, &updatePair); pair->second = data; }

CURRENT SOLUTIONS

33


Manual declara@on of persistent components

CURRENT SOLUTIONS

34



Need a new implementa@on

CURRENT SOLUTIONS

35




Third party code can be inconsistent

CURRENT SOLUTIONS

36




Third party code can be inconsistent

Prohibited Opera@on

Burden on the programmers

OUTLINE

37


Current Solutions

ThyNVM

Evaluation

Conclusion

OUR GOAL

38

Software transparent consistency in persistent memory systems

•  Execute legacy applica$ons •  Reduce burden on programmers

•  Enable easier integra$on of NVM

NO MODIFICATION IN THE CODE

void hashtable_update(hashtable_t* ht, void *key, void *data) { list_t* chain = get_chain(ht, key); pair_t* pair; pair_t updatePair; updatePair.first = key; pair = (pair_t*) list_find(chain, &updatePair); pair->second = data; }

RUN THE EXACT SAME CODE…

Persistent Memory System

So`ware transparent memory crash consistency

40

void hashtable_update(hashtable_t* ht, void *key, void *data){ list_t* chain = get_chain(ht, key); pair_t* pair; pair_t updatePair; updatePair.first = key; pair = (pair_t*) list_find(chain,

&updatePair); pair->second = data; }

ThyNVM APPROACH

Running

Epoch 0 Epoch 1

time

Checkpoin@ng Running Checkpoin@ng

41

Periodic checkpointing of data managed by hardware

Transparent to applica@on and system

CHECKPOINTING OVERHEAD

Running

Epoch 0 Epoch 1

time


42

1. Metadata overhead

Working loca@on Checkpoint loca@on X X’ Y Y’

Metadata Table

2. Checkpointing latency

1. METADATA AND CHECKPOINTING GRANULARITY

PAGE CACHE BLOCK

One Entry Per Page Small Metadata

One Entry Per Block Huge Metadata

43

PAGE GRANULARITY

BLOCK GRANULARITY

Working loca@on Checkpoint loca@on X X’ Y Y’

W

2. LATENCY AND LOCATION

44

DRAM-‐BASED WRITEBACK

Long latency of wri@ng back data to NVM

DRAM NVM

Working loca@on Checkpoint loca@on X X’ 1. Writeback data

from DRAM

2. Update the metadata table

2. LATENCY AND LOCATION

45

NVM-‐BASED REMAPPING

Short latency in NVM-‐based remapping

DRAM NVM

Working loca@on Checkpoint loca@on Y X

1.  No copying of data

2. Update the metadata table

3. Write in a new loca7on

ThyNVM KEY MECHANISMS Checkpoin@ng granularity

•  Small granularity: large metadata •  Large granularity: small metadata

Latency and loca@on •  Writeback from DRAM: long latency •  Remap in NVM: short latency

Based on these, we propose two key mechanisms 1. Dual granularity checkpoin@ng 2. Overlap of execu@on and checkpoin@ng

DRAM NVM

1. DUAL GRANULARITY CHECKPOINTING

High write locality pages in DRAM, low write locality pages in NVM

Page Writeback in DRAM

Block Remapping in NVM

47

GOOD FOR STREAMING WRITES

GOOD FOR RANDOM WRITES

TRADEOFF SPACE

Running

time


time

Epoch 0

Epoch 1 Epoch 2

Epoch 0 Epoch 1 Running Checkpoin@ng Running Checkpoin@ng

Running Checkpoin@ng

Epoch 0 Epoch 1

2. OVERLAPPING CHECKPOINTING AND EXECUTION

Hides the long latency of Page Writeback

OUTLINE

50


Current Solutions

ThyNVM

Evaluation

Conclusion

SYSTEM ARCHITECTURE

MEMORY ADDRESS SPACE

METHODOLOGY

53

Cycle accurate x86 simulator Gem5 Comparison Points: Ideal DRAM: DRAM-‐based, no cost for consistency

– Lowest latency system

Ideal NVM: NVM-‐based, no cost for consistency – NVM has higher latency than DRAM

Journaling: Hybrid, commit dirty cache blocks – Leverages DRAM to buffer dirty blocks

Shadow Paging: Hybrid, copy-‐on-‐write pages – Leverages DRAM to buffer dirty pages

ADAPTIVITY TO ACCESS PATTERN

54

0

1

2

3

Journal Shadow ThyNVM

Normalized

Write

Traffic To

NVM

0

1

2

3


Normalized

Write

Traffic To

NVM

RANDOM SEQUENTIAL

ThyNVM adapts to both access paherns

Journaling is beher for Random and Shadow paging is beher for Sequen@al

BETTER

OVERLAPPING CHECKPOINTING AND EXECUTION

55

RANDOM SEQUENTIAL

0

20

40

60


Percen

tage of

Execu@

on Tim

e

0

20

40

60


Percen

tage of

Execu@

on Tim

e

Stalls the applica@on for a negligible @me

ThyNVM spends only 2.4%/5.5% of the execu@on @me on checkpoin@ng

Can spend 35-‐45% of the execu@on on checkpoin@ng

BETTER

PERFORMANCE OF LEGACY CODE

56

0

0.2

0.4

0.6

0.8

1

gcc bwaves milc leslie. soplex Gems lbm omnet

Normalized

IPC

Ideal DRAM Ideal NVM ThyNVM

Provides consistency without significant performance overhead

Within -‐4.9%/+2.7% of an idealized DRAM/NVM system

KEY-‐VALUE STORE TX THROUGHPUT

Storage throughput close to Ideal DRAM

OUTLINE

58


Current Solutions

ThyNVM

Evaluation

Conclusion

ThyNVM

59

•  Checkpoints at mul$ple granulari$es to minimize both latency and metadata

•  Overlaps checkpoin$ng and execu$on

•  Adapts to DRAM and NVM characteris7cs

Can enable widespread adop$on of persistent memory

A new hardware-based checkpointing mechanism,

with no programming effort

ThyNVM Enabling So`ware-‐transparent

Crash Consistency In Persistent Memory Systems

Source Code and More Available at hhp://persper.com/thynvm

Our Other FMS 2016 Talks n  "A Large-Scale Study of Flash Memory Errors in the Field”

q  Onur Mutlu (ETH Zurich & CMU) August 10 @ 3:50pm q  Study of flash-based SSD errors in Facebook data centers

over the course of 4 years q  First large-scale field study of flash memory reliability q  Forum F-22: SSD Testing (Testing Track)

n  Practical Threshold Voltage Distribution Modeling q  Yixin Luo (CMU PhD Student) August 10 @ 4:20pm q  Forum E-22: Controllers and Flash Technology

n  "WARM: Improving NAND Flash Memory Lifetime with Write-hotness Aware Retention Management” q  Saugata Ghose (CMU Researcher) August 10 @ 5:45pm q  Forum C-22: SSD Concepts (SSDs Track)

61

Referenced Papers and Talks

n  All are available at http://users.ece.cmu.edu/~omutlu/projects.htm http://users.ece.cmu.edu/~omutlu/talks.htm

n  And, many other previous works on q  NVM & Persistent Memory q  DRAM q  Hybrid memories q  NAND flash memory

62

Thank you.

Feel free to email me with any questions & feedback

[email protected] http://users.ece.cmu.edu/~omutlu/

ThyNVM Software-Transparent Crash Consistency

for Persistent Memory

Onur Mutlu [email protected]

(joint work with Jinglei Ren, Jishen Zhao, Samira Khan, Jongmoo Choi, Yongwei Wu)

August 8, 2016 Flash Memory Summit 2016, Santa Clara, CA

References to Papers and Talks

Challenges and Opportunities in Memory

n  Onur Mutlu, "Rethinking Memory System Design" Keynote talk at 2016 ACM SIGPLAN International Symposium on Memory Management (ISMM), Santa Barbara, CA, USA, June 2016. [Slides (pptx) (pdf)] [Abstract]

n  Onur Mutlu and Lavanya Subramanian, "Research Problems and Opportunities in Memory Systems" Invited Article in Supercomputing Frontiers and Innovations (SUPERFRI), 2015.

66

Phase Change Memory As DRAM Replacement

n  Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger, "Architecting Phase Change Memory as a Scalable DRAM Alternative" Proceedings of the 36th International Symposium on Computer Architecture (ISCA), pages 2-13, Austin, TX, June 2009. Slides (pdf)

n  Benjamin C. Lee, Ping Zhou, Jun Yang, Youtao Zhang, Bo Zhao, Engin Ipek, Onur Mutlu, and Doug Burger, "Phase Change Technology and the Future of Main Memory" IEEE Micro, Special Issue: Micro's Top Picks from 2009 Computer Architecture Conferences (MICRO TOP PICKS), Vol. 30, No. 1, pages 60-70, January/February 2010.

67

STT-MRAM As DRAM Replacement

n  Emre Kultursay, Mahmut Kandemir, Anand Sivasubramaniam, and Onur Mutlu, "Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative" Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, TX, April 2013. Slides (pptx) (pdf)

68

Taking Advantage of Persistence in Memory n  Justin Meza, Yixin Luo, Samira Khan, Jishen Zhao, Yuan Xie, and

Onur Mutlu, "A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory" Proceedings of the 5th Workshop on Energy-Efficient Design (WEED), Tel-Aviv, Israel, June 2013. Slides (pptx) Slides (pdf)

n  Jinglei Ren, Jishen Zhao, Samira Khan, Jongmoo Choi, Yongwei Wu, and Onur Mutlu, "ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems" Proceedings of the 48th International Symposium on Microarchitecture (MICRO), Waikiki, Hawaii, USA, December 2015. [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Poster (pptx) (pdf)] [Source Code]

69

Hybrid DRAM + NVM Systems (I) n  HanBin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael Harding,

and Onur Mutlu, "Row Buffer Locality Aware Caching Policies for Hybrid Memories" Proceedings of the 30th IEEE International Conference on Computer Design (ICCD), Montreal, Quebec, Canada, September 2012. Slides (pptx) (pdf) Best paper award (in Computer Systems and Applications track).

n  Justin Meza, Jichuan Chang, HanBin Yoon, Onur Mutlu, and Parthasarathy Ranganathan, "Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management" IEEE Computer Architecture Letters (CAL), February 2012.

70

Hybrid DRAM + NVM Systems (II)

n  Dongwoo Kang, Seungjae Baek, Jongmoo Choi, Donghee Lee, Sam H. Noh, and Onur Mutlu, "Amnesic Cache Management for Non-Volatile Memory" Proceedings of the 31st International Conference on Massive Storage Systems and Technologies (MSST), Santa Clara, CA, June 2015. [Slides (pdf)]

71

NVM Design and Architecture

n  HanBin Yoon, Justin Meza, Naveen Muralimanohar, Norman P. Jouppi, and Onur Mutlu, "Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories" ACM Transactions on Architecture and Code Optimization (TACO), Vol. 11, No. 4, December 2014. [Slides (ppt) (pdf)] Presented at the 10th HiPEAC Conference, Amsterdam, Netherlands, January 2015. [Slides (ppt) (pdf)]

n  Justin Meza, Jing Li, and Onur Mutlu, "Evaluating Row Buffer Locality in Future Non-Volatile Main Memories" SAFARI Technical Report, TR-SAFARI-2012-002, Carnegie Mellon University, December 2012.

72

Our FMS Talks and Posters • Onur Mutlu, ThyNVM: SoCware-‐Transparent Crash Consistency for Persistent Memory, FMS 2016.

• Onur Mutlu, Large-‐Scale Study of In-‐the-‐Field Flash Failures, FMS 2016. •  Yixin Luo, Prac$cal Threshold Voltage Distribu$on Modeling, FMS 2016. •  Saugata Ghose, Write-‐hotness Aware Reten$on Management, FMS 2016. • Onur Mutlu, Read Disturb Errors in MLC NAND Flash Memory, FMS 2015. •  Yixin Luo, Data Reten$on in MLC NAND Flash Memory, FMS 2015. • Onur Mutlu, Error Analysis and Management for MLC NAND Flash Memory, FMS 2014.

•  FMS 2016 posters: -‐  WARM: Improving NAND Flash Memory Life7me with Write-‐hotness Aware Reten7on Management

-‐  Read Disturb Errors in MLC NAND Flash Memory -‐  Data Reten7on in MLC NAND Flash Memory

73

Our Flash Memory Works (I) 1.   Reten@on noise study and management 1)  Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman

Unsal, and Ken Mai, Flash Correct-‐and-‐Refresh: Reten@on-‐Aware Error Management for Increased Flash Memory Life@me, ICCD 2012.

2)  Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu, Data Reten@on in MLC NAND Flash Memory: Characteriza@on, Op@miza@on and Recovery, HPCA 2015.

3)  Yixin Luo, Yu Cai, Saugata Ghose, Jongmoo Choi, and Onur Mutlu, WARM: Improving NAND Flash Memory Life@me with Write-‐hotness Aware Reten@on Management, MSST 2015.

2.   Flash-‐based SSD prototyping and tes@ng planorm 4)  Yu Cai, Erich F. Haratsh, Mark McCartney, Ken Mai,

FPGA-‐based solid-‐state drive prototyping planorm, FCCM 2011.

74

Our Flash Memory Works (II) 3.   Overall flash error analysis 5)  Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai,

Error Paherns in MLC NAND Flash Memory: Measurement, Characteriza@on, and Analysis, DATE 2012.

6)  Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman Unsal, and Ken Mai, Error Analysis and Reten@on-‐Aware Error Management for NAND Flash Memory, ITJ 2013.

4.   Program and erase noise study 7)  Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai,

Threshold Voltage Distribu@on in MLC NAND Flash Memory: Characteriza@on, Analysis and Modeling, DATE 2013.

75

Our Flash Memory Works (III) 5. Cell-‐to-‐cell interference characteriza@on and tolerance 8)  Yu Cai, Onur Mutlu, Erich F. Haratsch, and Ken Mai,

Program Interference in MLC NAND Flash Memory: Characteriza@on, Modeling, and Mi@ga@on, ICCD 2013.

9)  Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Osman Unsal, Adrian Cristal, and Ken Mai, Neighbor-‐Cell Assisted Error Correc@on for MLC NAND Flash Memories, SIGMETRICS 2014.

6. Read disturb noise study 10)  Yu Cai, Yixin Luo, Saugata Ghose, Erich F. Haratsch, Ken Mai, and Onur Mutlu,

Read Disturb Errors in MLC NAND Flash Memory: Characteriza@on and Mi@ga@on, DSN 2015.

76

Our Flash Memory Works (IV) 7. Flash errors in the field 11)  Jus7n Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu,

A Large-‐Scale Study of Flash Memory Errors in the Field, SIGMETRICS 2015.

8. Persistent memory 12)  Jinglei Ren, Jishen Zhao, Samira Khan, Jongmoo Choi, Yongwei Wu, and Onur

Mutlu, ThyNVM: Enabling So`ware-‐Transparent Crash Consistency in Persistent Memory Systems, MICRO 2015.

77

Referenced Papers and Talks

n  All are available at http://users.ece.cmu.edu/~omutlu/projects.htm http://users.ece.cmu.edu/~omutlu/talks.htm

n  And, many other previous works on NAND flash memory errors and management

78

Related Videos and Course Materials n  Undergraduate Computer Architecture Course Lecture

Videos (2013, 2014, 2015)

n  Undergraduate Computer Architecture Course Materials (2013, 2014, 2015)

n  Graduate Computer Architecture Lecture Videos (2013, 2015)

n  Parallel Computer Architecture Course Materials (Lecture Videos)

n  Memory Systems Short Course Materials (Lecture Video on Main Memory and DRAM Basics)

79

Additional Slides on Persistent Memory and NVM

Phase Change Memory: Pros and Cons n  Pros over DRAM

q  Better technology scaling (capacity and cost) q  Non volatility q  Low idle power (no refresh)

n  Cons q  Higher latencies: ~4-15x DRAM (especially write) q  Higher active energy: ~2-50x DRAM (especially write) q  Lower endurance (a cell dies after ~108 writes) q  Reliability issues (resistance drift)

n  Challenges in enabling PCM as DRAM replacement/helper: q  Mitigate PCM shortcomings q  Find the right way to place PCM in the system

81

PCM-based Main Memory (I) n  How should PCM-based (main) memory be organized?

n  Hybrid PCM+DRAM [Qureshi+ ISCA’09, Dhiman+ DAC’09]: q  How to partition/migrate data between PCM and DRAM

82

PCM-based Main Memory (II) n  How should PCM-based (main) memory be organized?

n  Pure PCM main memory [Lee et al., ISCA’09, Top Picks’10]:

q  How to redesign entire hierarchy (and cores) to overcome PCM shortcomings

83

An Initial Study: Replace DRAM with PCM n  Lee, Ipek, Mutlu, Burger, “Architecting Phase Change

Memory as a Scalable DRAM Alternative,” ISCA 2009. q  Surveyed prototypes from 2003-2008 (e.g. IEDM, VLSI, ISSCC) q  Derived “average” PCM parameters for F=90nm

84

Results: Naïve Replacement of DRAM with PCM n  Replace DRAM with PCM in a 4-core, 4MB L2 system n  PCM organized the same as DRAM: row buffers, banks, peripherals n  1.6x delay, 2.2x energy, 500-hour average lifetime

n  Lee, Ipek, Mutlu, Burger, “Architecting Phase Change Memory as a

Scalable DRAM Alternative,” ISCA 2009. 85

Results: Architected PCM as Main Memory n  1.2x delay, 1.0x energy, 5.6-year average lifetime n  Scaling improves energy, endurance, density

n  Caveat 1: Worst-case lifetime is much shorter (no guarantees) n  Caveat 2: Intensive applications see large performance and energy hits n  Caveat 3: Optimistic PCM parameters?

86

A More Viable Approach: Hybrid Memory Systems

Meza+, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters, 2012. Yoon+, “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012 Best Paper Award.

CPU DRAMCtrl

Fast, durable Small,

leaky, volatile, high-cost

Large, non-volatile, low-cost Slow, wears out, high active energy

PCM Ctrl DRAM Phase Change Memory (or Tech. X)

Hardware/software manage data allocation and movement to achieve the best of multiple technologies

Data Placement Between DRAM and PCM n  Idea: Characterize data access patterns and guide data

placement in hybrid memory

n  Streaming accesses: As fast in PCM as in DRAM

n  Random accesses: Much faster in DRAM

n  Idea: Place random access data with some reuse in DRAM; streaming data in PCM

n  Yoon+, “Row Buffer Locality-Aware Data Placement in Hybrid Memories,” ICCD 2012 Best Paper Award.

88

0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6 1.8

2

Weighted Speedup Max. Slowdown Perf. per Watt Normalized Metric

16GB PCM RBLA-Dyn 16GB DRAM

0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6 1.8

2

Nor

mal

ized

Wei

ghte

d Sp

eedu

p

0

0.2

0.4

0.6

0.8

1

1.2

Nor

mal

ized

Max

. Slo

wdo

wn

Hybrid vs. All-PCM/DRAM [ICCD’12]

31% beher performance than all PCM, within 29% of all DRAM performance

31%

29%

Yoon+, “Row Buffer Locality-‐Aware Data Placement in Hybrid Memories,” ICCD 2012 Best Paper Award.

STT-MRAM as Main Memory n  Magnetic Tunnel Junction (MTJ) device

q  Reference layer: Fixed magnetic orientation q  Free layer: Parallel or anti-parallel

n  Magnetic orientation of the free layer determines logical state of device q  High vs. low resistance

n  Write: Push large current through MTJ to change orientation of free layer

n  Read: Sense current flow

n  Kultursay et al., “Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,” ISPASS 2013.

Reference Layer

Free Layer

Barrier

Reference Layer

Free Layer

Barrier

Logical 0

Logical 1

Word Line

Bit Line

Access Transistor

MTJ

Sense Line

STT-MRAM: Pros and Cons n  Pros over DRAM

q  Better technology scaling q  Non volatility q  Low idle power (no refresh)

n  Cons q  Higher write latency q  Higher write energy q  Reliability?

n  Another level of freedom q  Can trade off non-volatility for lower write latency/energy (by

reducing the size of the MTJ)

91

Architected STT-MRAM as Main Memory n  4-core, 4GB main memory, multiprogrammed workloads n  ~6% performance loss, ~60% energy savings vs. DRAM

92

88%

90%

92%

94%

96%

98%

Per

form

ance

vs

. D

RA

M

STT-RAM (base) STT-RAM (opt)

0%

20%

40%

60%

80%

100%

En

erg

y

vs.

DR

AM

ACT+PRE WB RB

Kultursay+, “Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,” ISPASS 2013.

Other Opportunities with Emerging Technologies

n  Merging of memory and storage q  e.g., a single interface to manage all data

n  New applications q  e.g., ultra-fast checkpoint and restore

n  More robust system design q  e.g., reducing data loss

n  Processing tightly-coupled with memory q  e.g., enabling efficient search and filtering

93

Coordinated Memory and Storage with NVM (I) n  The traditional two-level storage model is a bottleneck with NVM

q  Volatile data in memory à a load/store interface q  Persistent data in storage à a file system interface q  Problem: Operating system (OS) and file system (FS) code to locate, translate,

buffer data become performance and energy bottlenecks with fast NVM stores

94

Two-Level Store



Virtual memory

Address translation

Load/Store

Operating system

and file system

fopen, fread, fwrite, …


Coordinated Memory and Storage with NVM (II)

n  Goal: Unify memory and storage management in a single unit to eliminate wasted work to locate, transfer, and translate data q  Improves both energy and performance q  Simplifies programming model as well

95

Unified Memory/Storage



Load/Store

Persistent Memory Manager

Feedback


The Persistent Memory Manager (PMM) n  Exposes a load/store interface to access persistent data

q  Applications can directly access persistent memory à no conversion, translation, location overhead for persistent data

n  Manages data placement, location, persistence, security q  To get the best of multiple forms of storage

n  Manages metadata storage and retrieval q  This can lead to overheads that need to be managed

n  Exposes hooks and interfaces for system software q  To enable better data placement and management decisions

n  Meza+, “A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory,” WEED 2013.

96

The Persistent Memory Manager (PMM)

97

2.2.1 Efficient Hardware and Software Support: We propose to investigate the efficient software andhardware support needed for single-level stores. A single-level store system should provide an abstractionthat maps persistent user data to physical addresses in memory. A software interface for programs wouldmap a pointer to the actual persistent data. Programs would be able to access any part of the data usingnormal load and store instructions. Figure 2 shows two examples of high-level abstractions which couldbe provided to programs to access persistent data in a single-level store system. In it, a program createsa persistent file (Figure 2 left) or object (Figure 2 right) using the handle “file.dat” and allocates an arrayof integers in it. Later—perhaps after the application or system is restarted—when the program executesthe updateValue() function, the system retrieves the persistent data for the same handle, and the programmodifies its state. With such an abstraction, a single-level store can eliminate the OS system calls to transferdata to and from disk. In addition, it eliminates the need for a file system to track physical file addressesby traversing metadata (such as inodes) in the OS. In this way, single-level stores provide the opportunity todesign a simple and efficient persistent data lookup system in hardware. We plan to research efficient waysto map files or objects to the virtual address space. In such a hardware-based design, the processor wouldmanage how data handles correspond to physical addresses. Note that, single-level stores can use alternativedesign choices, such as segments, to provide the high-level abstraction instead of files or objects. Regardless,segments, files, or objects will be mapped to physical addresses with hardware support. Prior works tried tomake file lookup and update efficient in software [27, 28] in the presence of persistent memory, and otherworks proposed using complex and potentially inefficient hardware directory techniques (e.g., [15]). Ourgoal is to design fast and efficient techniques that take into account the byte addressability of persistentmemory in a single-level store. To this end, we will research the following:• The efficient use of hash table and B-tree indices for locating files in a single-level store.• How techniques such as key-value stores can provide fast and efficient lookups in single-level stores.• Policies for intelligently caching some entries of these indices in hardware to improve system perfor-

mance.Every access in the single-level store needs to be translated from a virtual address used by a program to

a physical address used to access a device. We will investigate how to efficiently manage address translationso that locating data is simple and fast. We intend to explore the following directions to solve this problem:• We will design mechanisms to predict access patterns based on program behavior and pre-compute

virtual-to-physical address translations. We are interested in answering questions such as: What is thepattern of data accesses to a single-level store, and how can prefetching techniques be redesigned withsingle-level stores in mind to enable efficient address translation? How can simple application-level orprofile-based hints on access patterns be communicated to and used by hardware to make address translationand prefetching efficient?• We will design efficient translation lookaside buffer (TLB)-like structures which will cache the trans-

lation between virtual and physical addresses but for a much larger amount of physical memory than inexisting systems. In the presence of such a single-level store, many disparate data accesses could need alarge translation table to be serviced effectively. To reduce overhead of such hardware, we are interested inexploring whether TLB-like structures should favor storing translations only for particular classes of data,such as data with locality or data which is on the critical path of execution, which get the most benefitout of the limited structure space. In addition, we will investigate centralized versus distributed translationstructures to design techniques that, for example, buffer private data translation entries near the processor,while distributing shared entries across processors to minimize translation overheads. Such translation struc-

1 int main(void) {2 // data in file.dat is persistent3 FILE myData = "file.dat";4 myData = new int[64];5 }6 void updateValue(int n, int value) {7 FILE myData = "file.dat";8 myData[n] = value; // value is persistent9 }

1 int main(void) {2 // data in file.dat is persistent3 int *myData = new PersistentObject("file.dat");4 myData = new int[64];5 }6 void updateValue(int n, int value) {7 int *myData = PersistentObject.open("file.dat");8 myData[n] = value; // value is persistent9 }

Figure 2: Sample program with access to file-based (left) and object-based (right) persistent data.

5

Load Store

DRAM Flash NVM HDD

Persistent Memory ManagerHardware

SoftwareData Layout, Persistence, Metadata, Security, ...

Hints from SW/OS/runtime

PMM uses access and hint informa@on to allocate, locate, migrate and access data in the heterogeneous array of devices

Persistent objects

Performance Benefits of a Single-Level Store

98

0

0.2

0.4

0.6

0.8

1.0


Nor

mal

ized

Exe

cutio

n Ti

me

User CPU User Memory Syscall CPU Syscall I/O

0.0440.009

~5X

~24X


Energy Benefits of a Single-Level Store

99

0

0.2

0.4

0.6

0.8

1.0


Frac

tion

of T

otal

Ener

gy

User CPU Syscall CPU DRAM NVM HDD

0.0650.013

~5X

~16X


software-transparent crash consistency for persistent memorysafari/talks/... · nvm as main memory...

Documents