mysql acceleration through new flash storage api primitives
DESCRIPTION
Torben Mathiasen presented at Percona Live 2012 in New York in October 2012 on non-volative memory software interfaces and MySQL atomics support. His presentation includes benchmarks run by Percona with Atomic I/O and directFS pre-release.TRANSCRIPT
MYSQL ACCELERATION THROUGH NEW FLASH STORAGE API PRIMITIVES
TORBEN MATHIASEN PERCONA LIVE 2012 – NEW YORK
Running Native on Flash
TOPICS – NVM SOFTWARE INTERFACES
1. What are we building?
2. Why are we building it?
3. Examples
4. MySQL atomics support
5. Where are we headed?
October 29, 2012 2
CONVENTIONAL I/O ACCESS
October 29, 2012 3
Traditional Storage
Proprietary Storage OS
Storage Media
Native Flash Translation Layer
Storage Media
Software Defined Storage
Conventional I/O access
DIRECT-ACCESS I/O THROUGH NATIVE INTERFACES
October 29, 2012 4
Traditional Storage
Proprietary Storage OS
Storage Media
Native Flash Translation Layer
Storage Media
Software Defined Storage
Conventional I/O access
Direct access I/O
MEMORY-ACCESS THROUGH NATIVE INTERFACES
October 29, 2012 5
Traditional Storage
Proprietary Storage OS
Storage Media
Native Flash Translation Layer
Storage Media
Software Defined Storage
Conventional I/O access
Memory access
Direct access I/O
COMPARING I/O AND MEMORY ACCESS SEMANTICS
October 29, 2012 6
I/O I/O semantics examples:
• Open file descriptor – open(), read(), write(), seek(), close() • (New) Write multiple data blocks atomically, nvm_vectored_write() • (New) Open key-value store – nvm_kv_open(), kv_put(), kv_get(), kv_batch_*()
Volatile memory semantics example: • Allocate virtual memory, e.g. malloc() • memcpy/pointer dereference writes (or reads) to memory address • (Improved) Page-faulting transparently loads data from NVM into memory
Non-volatile memory semantics example: • (New) Allocate and map Auto-Commit Memory™ (ACM) virtual memory pages • memcpy/pointer dereference writes (or reads) to memory address • (New) Call checkpoint() to create application-consistent ACM page snapshots • (New) After system failure, remap ACM snapshot pages to recover memory state • (New) De-stage completed ACM pages to NVM namespace • (New) Remap and access ACM pages from NVM namespace at any time
OPEN NVM DEVELOPMENT INTERFACES
October 29, 2012 7
I/O Semantics (explicit persistence) Memory Semantics (implicit persistence)
API Libraries: Open Source
Open Interface
Key-Value Store
Key-Value Cache
User-Defined Libraries
High-speed Logging
Transactional mmap
Native Filesystem: Open Source
Open Interface
directFS (POSIX filesystem namespace implemented with native Primitives)
Primitives: Open Interface
Traditional Block I/O Primitives
Atomic I/O Transaction Primitives
Sparse-address
Primitives
Clone/ Merge
Primitives
Cache Eviction
Primitives
Memory Transaction Primitives
Native Flash Translation Layer
DIRECTFS – NATIVE FILESYSTEM NAMESPACE
October 29, 2012 8
Application
Linux VFS (virtual file system) abstraction layer
Native Flash Translation Layer
ext3 btrfs xfs directFS
Kernel block layer
kernel-space
user-space
DIRECTFS – ELIMINATING DUPLICATE LOGIC
October 29, 2012 9
Native Flash Translation Layer block allocation, mapping, recycling
ACID updates, logging/journaling, crash-recovery
directFS file metadata mgmt
Kernel block layer
kernel-space
user-space
Ext3 file metadata mgmt,
block allocation, mapping, recycling, ACID updates, logging/journaling, crash-recovery
Primitive Interfaces
Application
Linux VFS (virtual file system) abstraction layer
TOPICS – NVM SOFTWARE INTERFACES
1. What are we building?
2. Why are we building it?
3. Examples
4. MySQL atomics support
5. Where are we headed?
October 29, 2012 10
NVM SOFTWARE INTERFACES
Industry-first, direct API access to non-volatile memory’s unique characteristics.
The ioMemory SDK was introduced to help developers:
▸ Write less code to create high-performing apps
▸ Tap into performance not available with conventional I/O access to SSDs
▸ Reduce operating costs by decreasing RAM while increasing NVM
October 29, 2012 11
DIRECTFS – BENEFITS IN ELIMINATING DUPLICATE LOGIC
October 29, 2012 12
File System Lines of Code
directFS 6879
ReiserFS 19996
ext4 25837
btrfs 51925
XFS 63230
ATOMIC I/O PRIMITIVES: SAMPLE USES AND BENEFITS
October 29, 2012 13
Databases Transactional Atomicity: Replace various workarounds implemented in database code to provide write atomicity (double-buffered writes, etc.)
Filesystems File Update Atomicity: Replace various workarounds implemented in filesystem code to provide file/directory update atomicity (journaling, etc.)
▸ 99% performance of raw writes Smarter media now natively understands atomic updates, with no additional metadata overhead.
▸ 2x longer flash media life Atomic Writes increase the life of flash media up to 2x due to reduction in write-ahead-logging and double-write buffering.
▸ 50% less code in key modules Atomic operations dramatically reduce application logic, such as journaling, built as work-arounds.
DIRECTFS WITH ATOMIC WRITES - ACHIEVING RAW DEVICE PERFORMANCE
October 29, 2012 14
Ban
dwid
th (M
iB/s
)
I/O Size
Ban
dwid
th (M
iB/s
)
I/O Size Block directFS
1 Thread 8 Threads
Filesystem convenience AND atomic writes with the performance of simple writes to raw device
KEY-VALUE STORE API LIBRARY: SAMPLE USES AND BENEFITS
October 29, 2012 15
NoSQL Applications Reduce overprovisioning due to lack of coordination between two-layers of garbage collection (application-layer and flash-layer). Some top NoSQL applications recommend over-provisioning by 3x due to this.
Reduce application I/Os through batched put and get operations.
Increase performance by eliminating packing and unpacking blocks, defragmentation, and duplicate metadata at app layer.
▸ 95% performance of raw device Smarter media now natively understands a key-value I/O interface with lock-free updates, crash recovery, and no additional metadata overhead.
▸ Up to 3x capacity increase Dramatically reduces over-provisioning with coordinated garbage collection and automated key expiry.
▸ 3x throughput on same SSD Early benchmarks comparing against memcached with BerkeleyDB persistence show up to 3x improvement.
KEY-VALUE STORE API LIBRARY BENCHMARKS: (VS MEMCACHEDB)
October 29, 2012 16
TOPICS – NVM SOFTWARE INTERFACES
1. What are we building?
2. Why are we building it?
3. Examples
4. MySQL atomics support
5. Where are we headed?
October 29, 2012 17
EXAMPLE: ATOMIC I/O PRIMITIVES
October 29, 2012 18
iov[0] iov[1] iov[2] iov[3] iov[4]
LBA 7 + range
LBA 24 + range
LBA 42 + range
LBA 68 + range
LBA 24 + range
LBA 7 + range
iov[0] iov[1]
TRANSACTION ENVELOPES
ATOMIC I/O PRIMITIVES BENCHMARKS (ATOMIC I/O VS NON-ATOMIC I/O)
1U HP blade server with 16 GB RAM, 8 CPU cores - Intel(R) Xeon(R) CPU X5472 @ 3.00GHz with single 1.2 TB ioDrive2 mono
Significantly more functionality with negligible performance cost
October 29, 2012 19
EXAMPLE: KEY-VALUE STORE API LIBRARY
October 29, 2012 20
key value 1-128B 64b-1MB
kv_put() kv_get() or kv_batch_get()
key expiration timer marks KV pair for VSL garbage collection
key hashed into sparse address
space to simplify collision
management
value returned through single I/O operation, regardless of
value size
key value key value key value
pool A pool B
pool C
kv_get_current(), kv_next()
Iterate through each KV pair in
a pool of related keys
Atomic transaction envelope
key
value
KEY-VALUE STORE API LIBRARY BENCHMARKS (NATIVE KV GET/PUT VS. RAW READS/WRITES)
!"
#!!!!"
$!!!!"
%!!!!"
&!!!!"
'!!!!!"
'#!!!!"
'$!!!!"
!" #!" $!" %!" &!" '!!" '#!" '$!"
!"#$%$&
#'()*+$&
,*-./)&0)(12(-*34)&5&!"#&
('#)"
$*)"
'%*)"
%$*)"
!"
#!!!!"
$!!!!"
%!!!!"
&!!!!"
'!!!!!"
'#!!!!"
!" #!" $!" %!" &!" '!!" '#!" '$!"
!"#$%
$&
#'()*+$&
,*-./)&!)(01(-*23)&4&!"#&
('#)"
$*)"
'%*)"
%$*)"
!"
#!!!!"
$!!!!"
%!!!!"
&!!!!"
'!!!!!"
'#!!!!"
'$!!!!"
!" #!" $!" %!" &!"
!"#$%
&
'()*+,%&
"*)-.)/+0*&)*1+23*&4.&5.6)53*&
('#)"*+,"-./"
'*)"012"3.45"
!"
#!!!!"
$!!!!"
%!!!!"
&!!!!"
'!!!!!"
'#!!!!"
!" '!" #!" (!" $!" )!" %!" *!"
!"#$%
&
'()*+,%&
"*)-.)/+01*&)*2+34*&5.&6.7)64*&
)'#+",-."/01"
',2345"67418"
Sample Performance - GET
Performance relative to ioDrive
Sample Performance - PUT
Performance relative to ioDrive
1U HP blade server with 16 GB RAM, 8 CPU cores - Intel(R) Xeon(R) CPU X5472 @ 3.00GHz with single 1.2 TB ioDrive2 mono
Significantly more functionality with negligible performance cost
October 29, 2012 21
RANGE OF MEMORY-ACCESS SEMANTICS
October 29, 2012 22
Extended Memory Volatile
Transparently extends DRAM onto flash, extending application virtual memory
Checkpointed Memory
Volatile with non-volatile checkpoints
Region of application virtual memory with ability to preserve snapshots to flash namespace
Auto-Commit Memory™ Non-volatile
Region of application memory automatically persisted to non-volatile memory and recoverable post-system failure
OS SWAP VS. EXTENDED MEMORY
▸ Originally designed as a last resort to prevent OOM (out-of-memory) failures ▸ Never tuned for high-performance demand-paging ▸ Never tuned for multi-threaded apps ▸ Poor performance, ex. < 30 MB/sec throughput
▸ No application code changes required ▸ Designed to migrate hot pages to DRAM and cold pages to ioMemory ▸ Tuned to run natively on flash (leverages native characteristics) ▸ Tuned for multi-threaded apps ▸ 10-15x throughput improvement over standard OS Swap
October 29, 2012 23
Non-Volatile Storage (Disks, SSDs, etc.) System Memory
OS SWAP Mechanism
ioMemory System Memory Extended Memory
Mechanism
CHECKPOINTED MEMORY PERSISTENCE PATH
October 29, 2012 24
1. Application designates virtual address space range to be checkpointed a. Causes creation of independently-addressable linked clone of the
checkpointed address range (no data moves or copies) b. Checkpoint appears as addressable file in the directFS
native filesystem namespace.
2. Application can continue manipulating contents of designated virtual address range without affecting contents of persisted checkpoint file.
3. Application can load or manipulate persisted checkpoint file at a later time.
System Memory ioMemory Checkpointed
Memory
d d d d
EXAMPLE: MEMORY TRANSACTION PRIMITIVES
October 29, 2012 25
1. Allocate pages app virtual address space
CPU loads/stores data to ACMTM
pages
Process allocate ACMTM pages and maps to virtual address space
2. memcpy() Process writes and reads to pages through standard memory-access semantics
3. Checkpoint pages
d d d d
d d d d
System
failure event
5. Restore state app virtual address space
Re-started process maps restored snapshot pages into virtual address space
Application-consistent snapshot pages are created or updated
working pages
snapshot pages
d d d d
d d d d
d d d d
System failure automatically preserves snapshot pages
d d d d 4. Destage pages
d d d d Completed
pages
d d d d Pages persisted
to flash namespace
TOPICS – NVM SOFTWARE INTERFACES
1. What are we building?
2. Why are we building it?
3. Examples
4. MySQL atomics support
5. Where are we headed?
October 29, 2012 26
PERCONA SERVER DEFAULT DOUBLE-WRITE
27
• Writes data twice • Extra data flush • Required for ACID
compliance Pages copied into memory
OS Page
A Page
B Page
C
Page A
Page B
Page C
Two Writes
Page C
Page B
Page A
Double-write buffer
Actual tablespace
DB Server Page
A Page
B Page
C
October 29, 2012
1
2
3
ioMemory
PERCONA SERVER WITH ATOMIC-WRITE
October 29, 2012 28
Pages copied into memory
ioMemory
DB Server
OS Page
A Page
B Page
C
Page C
Page B
Page A
Atomic-Write
Page A
Page B
Page C
• Less Writes • Better Performance • ACID Compliance
1
2
CASE STUDY: PERCONA SERVER (MYSQL)
Percona has added atomics support to Percona Server 5.5
▸ Removes the need of the MySQL double write buffer
▸ Ensures data integrity in case of system crashes
▸ Writes 50% less, great for flash
▸ Removes complexity from the software stack
▸ Improves both transaction bandwidth and latency
▸ Works though the DirectFS filesystem or on RAW devices
October 29, 2012 29
PERCONA SERVER 5.5 TPC-C BENCHMARKS
Benchmarks run by Percona with Atomic I/O and directFS pre-release
October 29, 2012 30
Percona Server on DirectFS with Atomic I/O
Non-ACID (Double-write
disabled)
ACID (Double-write)
50% more transactions with same atomic durability
ACID (Atomic writes)
Percona Server on ext4
PERCONA SERVER ATOMICS IMPLEMENTATION
Initialize Fusion-io atomics!os_aio_atomic_writes_init(void)!
!if (srv_atomic_writes_device != NULL &&!
! (vsl_iom_create_by_name(srv_atomic_writes_device,!! ! ! ! &os_aio_atomic_iom_handle,!! ! ! ! &error) != FIO_SUCCESS ||!
! vsl_open(os_aio_atomic_iom_handle, &error) != FIO_SUCCESS ||!! vsl_iom_get_raw_handle(os_aio_atomic_iom_handle,!
! ! ! ! &os_aio_atomic_writes_fd,!! ! ! ! &error) != FIO_SUCCESS)) {!! !ut_print_timestamp(stderr);!
! !fprintf(stderr, " Unable to open atomic writes device %s: %s\n",!! ! !srv_atomic_writes_device, error.message);!
! !return(FALSE);!!}!
!if (srv_atomic_writes_device != NULL) {!
! !fprintf(stderr, " Atomic I/O has been initialized on %s\n",!! ! !srv_atomic_writes_device);!!} !!
!return(TRUE);!}!
October 29, 2012 31
PERCONA SERVER ATOMICS IMPLEMENTATION CONT.
Issue atomics write using vector os_aio_do_atomic_write(!
!memset(iov, 0, IO_VECTOR_LIMIT * sizeof(struct vsl_iovec));!
!do {!! !iovcnt = 0;!! !iov_total_len = 0;!
! !do {!! ! !iov_len = ut_min(n - written,!
! ! ! ! ! FTL_ATOMIC_MAX_VECTOR_SIZE);!! ! !iov[iovcnt].iov_base = (uint64_t) ((byte *) buf +!! ! ! ! ! ! ! written);!
! ! !iov[iovcnt].iov_len = iov_len;!! ! !iov[iovcnt].sector = sector_start + !
! ! ! !os_aio_atomic_sector(0, written);!! ! !iov[iovcnt].iov_flag = VSL_IOV_WRITE;!
! ! !iovcnt++;!! ! !written += iov_len;!
! ! !iov_total_len += iov_len;!
! !} while (iovcnt < IO_VECTOR_LIMIT && written < n);!
! !rc = vsl_vectored_write(file, iov, iovcnt, O_ATOMIC);!
October 29, 2012 32
TOPICS – NVM SOFTWARE INTERFACES
1. What are we building?
2. Why are we building it?
3. Examples
4. MySQL atomics support
5. Where are we headed?
October 29, 2012 33
OPEN INTERFACES AND OPEN SOURCE
▸ Primitives: Open Interface
▸ directFS: Open Source
▸ API Libraries: Open Source, Open Interface
▸ INCITS SCSI (T10) active standards proposals: • SBC-4 SPC-5 Atomic-Write
http://www.t10.org/cgi-bin/ac.pl?t=d&f=11-229r6.pdf
• SBC-4 SPC-5 Scattered writes, optionally atomic http://www.t10.org/cgi-bin/ac.pl?t=d&f=12-086r3.pdf
• SBC-4 SPC-5 Gathered reads, optionally atomic http://www.t10.org/cgi-bin/ac.pl?t=d&f=12-087r3.pdf
▸ SNIA NVM-Programming TWG active member
34 October 29, 2012
Tradi&onal SSDs
ioMemory™ with Conven&onal I/O
ioMemory™ as Transparent Cache
ioMemory™ with direct access I/O
ioMemory™ with memory seman&cs
App
lica&
on
Applica&on
App
lica&
on Applica&on Applica&on Applica&on Applica&on
User-‐defined I/O API Libraries
User-‐defined Memory API Libraries
OS Block I/O OS Block I/O OS Block I/O Direct-‐access I/O API Libraries
Memory Seman&cs API Libraries
Host
Host
File System File System File System directFS – NVM filesystem
directFS – NVM filesystem
Block Layer Block Layer Block Layer I/O Primi&ves Memory Primi&ves
SAS/SATA VSL™
expanded flash transla&on layer
directCache™ VSL™ VSL™
Network VSL™
Remote
RAID Controller
Read/Write Read/Write Read/Write Read/Write CPU Load/Store Flash Transla&on
Layer
Read/Write
Na&ve Access |
35 October 29, 2012
FLASH MEMORY EVOLUTION
T H A N K Y O U