pmfs - pages.cs.wisc.edu
TRANSCRIPT
PMFSCS 839 - Persistence
Learning objectives
• Appreciate the difference between redo & undo logging
• Understand how hardware features can optimizes PM software
• Understand XXX
Project
• Proposals should be presented in class on 10/14 – 2 weeks
• Some ideas are up on the web page
• Work in groups of 2-3; I can help find groups if you need
• You can overlap with your own research & courses, but needs to be a distinct effort (can’t turn in same regular-sized project for two courses)
Notes from reviews
• When making a claim, please explain:
• “Mapping PM to kernel's virtual address space is not secure.”
• Please say why: what bad thing could happen, and why this is different than current systems. If they already have the same flaw, it is generally an accepted risk
• How should we think about papers on PM before PM was commercially available? What do we expect for evaluation
Background story
• Intel is developing 3d Xpoint
• Others have proposed various crazy file system designs
• Their engineers/researchers have deep knowledge of real architectural issues, architectural features
• Seek to show various low-level options
What are the biggest ideas in PMFS?
Big ideas in PMFS
• Fine-grained redo logging
• Leveraging unique Intel processor features for PMFS• Hardware transactions
• Memory ordering rules
• Write-protect disable
• Intel’s programming model for PM (with hardware)
• Use of large pages
What are the concerns PMFS addresses?
What are the concerns PMFS addresses?
• How to control ordering
• How to provide atomicity
• How to handle memory mapping
• How to handle stray writes
• …
Real Intel support for persistent memory
• Problem:• How implement flush/fence operations?
• Options:• MTRR to enable Write-through caching• NTSTORE – what happens if data in cache?• SFENCE – only waits for data to hit controller, not memory
• Solution• CLFLUSHOPT – asynchronous flush• PM_WBARRIER – guarantees durability of data previously flushed
• All flushes from any core?• Only flushes from this core?
Hardware support for atomic & ordered writes• 8 byte: regular store instruction• 16 byte: cmpxchg16b compare with RDX:RAX and swap with data from
RCX:RBX• Useful for setting size & modification time atomically (BPFS problem)
• 64-byte transactions – use Intel hardware transactional memory• XBEGIN• Write to a cache line multiple times• XEND• CLFLUSH
• Writes to same cacheline happen in order• STORE A, 14• STORE A+8, 27• Guarantees 27 never reaches pmem if 1 doesn’t also reach pmem
Mapping PM into kernel address space
• Why do this?
Volatile data PM
Protection from stray writes
• How real is the problem?
• How is this handled in normal file systems?
Handling stray writes
• Hardware protection: block writes in hardware
• Software protection: block writes in software• Type-safety
• Software fault isolation
• Software protection: hide PM• Map it far away from anything else
• Don’t tell normal code the real address
• Only reveal the real address is internal super-correct code
Normal data PMPM
addr
Write protection in hardware
• What hardware features are possible?• Kernel on kernel
• Kernel on user
• User on kernel
• User on user
Idea for protection
• Have accessor functions for reading/writing PM• Normal stores shouldn’t have access
• Accessors enable access, do write, disable access• Example: TX_BEGIN allows access / TX_END removes
access
• Example: D_RW(ptr) allows acess, *ptr does not
Write protection in hardware
• What hardware features are possible?• Kernel on kernel
• CR0: disable write protection for compatibility with 80386 (1985(
• Kernel on user• SMAP: blocks kernel access to user data – prevent tricking kernel into revealing user data
(2012)
• User on kernel• Normal page permissions
• User on user • MPK: memory protection keys allow turning access on/off from usermode
Normal data:R/W PM:R
CR0: WP=1
PM:R/W
CR0: WP=0
Address Space
Intel Memory Protection Keys (MPK)
18
… …
Page Table Entry (PTE)
Page 1
Page 2
Page 3
…Page 1PKEY
2
• Available in Skylake server CPUs
• Tag memory pages with PKEY
Address Space
Intel Memory Protection Keys (MPK)
19
CPU Core
PKRU Register
Page 1
Page 2
Page 3• Available in Skylake server CPUs
• Tag memory pages with PKEY
• Permission Register (PKRU)
… …
Page Table Entry (PTE)
…Page 1PKEY
20 0 1 1000 0 …
1W
1R
0W
0R
2R
2W
15W
15R
…
Address Space
Intel Memory Protection Keys (MPK)
20
CPU Core
1W
1R
0W
0R
2R
2W
15W
15R
…
PKRU Register
Page 1
Page 2
Page 3• Available in Skylake server CPUs
• Tag memory pages with PKEY
• Permission Register (PKRU)
• Userspace instruction to update PKRU• Fast switch between 11 – 260 cycles/switch
… …
Page Table Entry (PTE)
…Page 1PKEY
20 0 1 1110 0 …
Address Space
Intel Memory Protection Keys (MPK)
21
CPU Core
PKRU Register
Page 1
Page 2
Page 3• Available in Skylake server CPUs
• Tag memory pages with PKEY
• Permission Register (PKRU)
• Userspace instruction to update PKRU• Fast switch at 50 cycles/switch
By itself, MPK does not protect
against malicious attacks.
… …
Page Table Entry (PTE)
…Page 1PKEY
21 1 1 1111 1 …
1W
1R
0W
0R
2R
2W
15W
15R
…
Using Memory protection keys
Safe_write(object * ptr, object & some_data) {
MPK_WRPKRU(0b0...01100)
*ptr = some_data; // write to PM
MPK_WRPKRU(0b0...00000) }
}
Efficient layout
• What layout changes are possible with NVM?
Efficient layout
• What layout changes are possible with NVM?
• Use memory-optimized data structures: Btree• (was also good for disk …, but can choose
different size blocks)
• Allocate blocks for MMU sizes• 4KB, 2MB, 1GB• Allows use of huge pages in TLB
• Policy: when use large pages?
What consistency mechanism is best?
• Logging/journaling?
• Shadow paging/CoW?
What consistency mechanism is best? Why?
• Logging/journaling?• Good for small updates – not much data to write twice
• Low write amplification
• Shadow paging/CoW?• Good for large writes – avoid writing twice
Undo/redo logging:
Undo loggingWrite (log, x =2);
CLFLUSH(log); MFENCE; WBARRIER
X = 3;
Write (log, y = 2);
CLFLUSH(log); MFENCE; WBARRIER
Y = X
CLFLUSH(x); CLFLUSH(Y); MFENCE; WBARRIER
Redo loggingTmp_x = 3;
Tmp_y = tmp_x;
Write(log, x = 3);
Write(log, y = 3);
CLFLUSH(log); MFENCE; WBARRIER
X = tmp_x;
Y = tmp_y;
CLFLUSH(x); CLFLUSH(Y); MFENCE; WBARRIER
X=2, y = 2;
TX_BEGINx = 3y = X
TX_COMMIT
Is the final WBARRIER needed?
What makes one better
Undo logging
+ directly read/write to new data
- More flushes/fences: before every write
Redo logging
+ only need to write log before end of transaction
+ only one fence between log and data writes
- Need to store new values somewhere else, track where they are
PMFS log record
• Records are 64 bytes
• Stored in a circular buffer
• COMMIT record indicates TX committed
Efficient log management
• Log is circular buffer
• On failure: how know what entries are valid?
X=1
Y=2
S=3
C
F=4
G=7
Efficient log management
• Add generation ID to each entry
• After failure: only entries with latest generation_ID are valid
• Rely on ordered write to cacheline• Write gen_ID last to commit log entry
X=1, 1
Y=2, 1
S=3, 1
C, 1
F=4, 2
G=7, 2
testing
What can go wrong with code?
Testing
What can go wrong with code?
• Missing flushes
• Missing fences
• Missing WBARRIERS
How test?
• Collect trace of store/fence/wbarrier
• Replays all possible orderings of stores• Reordering stores
• Crashing before wbarrier
• Check data structure consistency everywhere• Is a double-linked list still doubly linked?
Memory mapping
• Map PM pages right into user address spaces
• For mmap: register page-fault handler• Attach PM page on fault
• For read/write: on access, copy directly from PM page to user buffer (bypass page cache)
Evaluation
• PM Emulator platform• Intel special sauce – modify microcode in processor
• Add latency periodically to model added latency of PM
• Throttle bandwidth of access to model lower bandwidth
• How good is this compared to gem5/software emulators?
t
Evaluation
File-based Access
File I/O & utilities
t
Evaluation
3. Memory-mapped I/O
t
Evaluation
3. Memory-mapped I/O
Neo4j Graph Database
Logging overhead
t
Evaluation
4. Write Protection
Outcome
• PMFS showed what was a possible, but …
• It didn’t scale well to large numbers of cores
• It had bugs – writing an FS is hard
• But, applying most important concepts to Ext4 gave good/better performance• Using large pages• Bypassing page cache• Fine-grained logging (Maybe – not sure)
Questions from reviews
• Application control over durability?
• Why map all PM into kernel address space?
• Should we buffer some data in DRAM for performance?
• What happened to PMFS?
• Could we use PMFS as a cache in front of HDD?
• Why can we skip final wbarrier ?
• How good is huge page support?