pmfs - pages.cs.wisc.edu

PMFSCS 839 - Persistence

Learning objectives

• Appreciate the difference between redo & undo logging

• Understand how hardware features can optimizes PM software

• Understand XXX

Project

• Proposals should be presented in class on 10/14 – 2 weeks

• Some ideas are up on the web page

• Work in groups of 2-3; I can help find groups if you need

• You can overlap with your own research & courses, but needs to be a distinct effort (can’t turn in same regular-sized project for two courses)

Notes from reviews

• When making a claim, please explain:

• “Mapping PM to kernel's virtual address space is not secure.”

• Please say why: what bad thing could happen, and why this is different than current systems. If they already have the same flaw, it is generally an accepted risk

• How should we think about papers on PM before PM was commercially available? What do we expect for evaluation

Background story

• Intel is developing 3d Xpoint

• Others have proposed various crazy file system designs

• Their engineers/researchers have deep knowledge of real architectural issues, architectural features

• Seek to show various low-level options

What are the biggest ideas in PMFS?

Big ideas in PMFS

• Fine-grained redo logging

• Leveraging unique Intel processor features for PMFS• Hardware transactions

• Memory ordering rules

• Write-protect disable

• Intel’s programming model for PM (with hardware)

• Use of large pages

What are the concerns PMFS addresses?

What are the concerns PMFS addresses?

• How to control ordering

• How to provide atomicity

• How to handle memory mapping

• How to handle stray writes

• …

Real Intel support for persistent memory

• Problem:• How implement flush/fence operations?

• Options:• MTRR to enable Write-through caching• NTSTORE – what happens if data in cache?• SFENCE – only waits for data to hit controller, not memory

• Solution• CLFLUSHOPT – asynchronous flush• PM_WBARRIER – guarantees durability of data previously flushed

• All flushes from any core?• Only flushes from this core?

Hardware support for atomic & ordered writes• 8 byte: regular store instruction• 16 byte: cmpxchg16b compare with RDX:RAX and swap with data from

RCX:RBX• Useful for setting size & modification time atomically (BPFS problem)

• 64-byte transactions – use Intel hardware transactional memory• XBEGIN• Write to a cache line multiple times• XEND• CLFLUSH

• Writes to same cacheline happen in order• STORE A, 14• STORE A+8, 27• Guarantees 27 never reaches pmem if 1 doesn’t also reach pmem

Mapping PM into kernel address space

• Why do this?

Volatile data PM

Protection from stray writes

• How real is the problem?

• How is this handled in normal file systems?

Handling stray writes

• Hardware protection: block writes in hardware

• Software protection: block writes in software• Type-safety

• Software fault isolation

• Software protection: hide PM• Map it far away from anything else

• Don’t tell normal code the real address

• Only reveal the real address is internal super-correct code

Normal data PMPM

addr

Write protection in hardware

• What hardware features are possible?• Kernel on kernel

• Kernel on user

• User on kernel

• User on user

Idea for protection

• Have accessor functions for reading/writing PM• Normal stores shouldn’t have access

• Accessors enable access, do write, disable access• Example: TX_BEGIN allows access / TX_END removes

access

• Example: D_RW(ptr) allows acess, *ptr does not

Write protection in hardware

• What hardware features are possible?• Kernel on kernel

• CR0: disable write protection for compatibility with 80386 (1985(

• Kernel on user• SMAP: blocks kernel access to user data – prevent tricking kernel into revealing user data

(2012)

• User on kernel• Normal page permissions

• User on user • MPK: memory protection keys allow turning access on/off from usermode

Normal data:R/W PM:R

CR0: WP=1

PM:R/W

CR0: WP=0

Address Space

Intel Memory Protection Keys (MPK)

18

… …

Page Table Entry (PTE)

Page 1

Page 2

Page 3

…Page 1PKEY

2

• Available in Skylake server CPUs

• Tag memory pages with PKEY

Address Space


19

CPU Core

PKRU Register

Page 1

Page 2

Page 3• Available in Skylake server CPUs


• Permission Register (PKRU)

… …


…Page 1PKEY

20 0 1 1000 0 …

1W

1R

0W

0R

2R

2W

15W

15R

…

Address Space


20

CPU Core

1W

1R

0W

0R

2R

2W

15W

15R

…

PKRU Register

Page 1

Page 2




• Userspace instruction to update PKRU• Fast switch between 11 – 260 cycles/switch

… …


…Page 1PKEY

20 0 1 1110 0 …

Address Space


21

CPU Core

PKRU Register

Page 1

Page 2




• Userspace instruction to update PKRU• Fast switch at 50 cycles/switch

By itself, MPK does not protect

against malicious attacks.

… …


…Page 1PKEY

21 1 1 1111 1 …

1W

1R

0W

0R

2R

2W

15W

15R

…

Using Memory protection keys

Safe_write(object * ptr, object & some_data) {

MPK_WRPKRU(0b0...01100)

*ptr = some_data; // write to PM

MPK_WRPKRU(0b0...00000) }

}

Efficient layout

• What layout changes are possible with NVM?

Efficient layout

• What layout changes are possible with NVM?

• Use memory-optimized data structures: Btree• (was also good for disk …, but can choose

different size blocks)

• Allocate blocks for MMU sizes• 4KB, 2MB, 1GB• Allows use of huge pages in TLB

• Policy: when use large pages?

What consistency mechanism is best?

• Logging/journaling?

• Shadow paging/CoW?

What consistency mechanism is best? Why?

• Logging/journaling?• Good for small updates – not much data to write twice

• Low write amplification

• Shadow paging/CoW?• Good for large writes – avoid writing twice

Undo/redo logging:

Undo loggingWrite (log, x =2);

CLFLUSH(log); MFENCE; WBARRIER

X = 3;

Write (log, y = 2);


Y = X

CLFLUSH(x); CLFLUSH(Y); MFENCE; WBARRIER

Redo loggingTmp_x = 3;

Tmp_y = tmp_x;

Write(log, x = 3);

Write(log, y = 3);


X = tmp_x;

Y = tmp_y;

CLFLUSH(x); CLFLUSH(Y); MFENCE; WBARRIER

X=2, y = 2;

TX_BEGINx = 3y = X

TX_COMMIT

Is the final WBARRIER needed?

What makes one better

Undo logging

+ directly read/write to new data

- More flushes/fences: before every write

Redo logging

+ only need to write log before end of transaction

+ only one fence between log and data writes

- Need to store new values somewhere else, track where they are

PMFS log record

• Records are 64 bytes

• Stored in a circular buffer

• COMMIT record indicates TX committed

Efficient log management

• Log is circular buffer

• On failure: how know what entries are valid?

X=1

Y=2

S=3

C

F=4

G=7

Efficient log management

• Add generation ID to each entry

• After failure: only entries with latest generation_ID are valid

• Rely on ordered write to cacheline• Write gen_ID last to commit log entry

X=1, 1

Y=2, 1

S=3, 1

C, 1

F=4, 2

G=7, 2

testing

What can go wrong with code?

Testing

What can go wrong with code?

• Missing flushes

• Missing fences

• Missing WBARRIERS

How test?

• Collect trace of store/fence/wbarrier

• Replays all possible orderings of stores• Reordering stores

• Crashing before wbarrier

• Check data structure consistency everywhere• Is a double-linked list still doubly linked?

Memory mapping

• Map PM pages right into user address spaces

• For mmap: register page-fault handler• Attach PM page on fault

• For read/write: on access, copy directly from PM page to user buffer (bypass page cache)

Evaluation

• PM Emulator platform• Intel special sauce – modify microcode in processor

• Add latency periodically to model added latency of PM

• Throttle bandwidth of access to model lower bandwidth

• How good is this compared to gem5/software emulators?

t

Evaluation

File-based Access

File I/O & utilities

t

Evaluation

3. Memory-mapped I/O

t

Evaluation

3. Memory-mapped I/O

Neo4j Graph Database

Logging overhead

t

Evaluation

4. Write Protection

Outcome

• PMFS showed what was a possible, but …

• It didn’t scale well to large numbers of cores

• It had bugs – writing an FS is hard

• But, applying most important concepts to Ext4 gave good/better performance• Using large pages• Bypassing page cache• Fine-grained logging (Maybe – not sure)

Questions from reviews

• Application control over durability?

• Why map all PM into kernel address space?

• Should we buffer some data in DRAM for performance?

• What happened to PMFS?

• Could we use PMFS as a cache in front of HDD?

• Why can we skip final wbarrier ?

• How good is huge page support?

pmfs - pages.cs.wisc.edu

Documents