accelerating rocksdb with zoned namespaces · 2019. 12. 21. · zone states empty, implicitly...

26
2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 1 Accelerating RocksDB with NVMe Zoned SSDs Hans Holmberg, R&D Technologist Emerging System Architectures Group Western Digital

Upload: others

Post on 12-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 1

Accelerating RocksDB with

NVMe™ Zoned SSDs

Hans Holmberg, R&D Technologist

Emerging System Architectures Group

Western Digital

Page 2: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 2

RocksDB on Zoned NVMe™ SSDs

Optimal data placement

No drive

garbage

collection

Minimal Device Write

amplification

Lower latencyLonger

life

Higher

throughput

ZNS NVMe SSD Development Platform

Page 3: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 3

Agenda

▪ Zoned Namespaces 101

▪ Adapting RocksDB for Zoned SSDs

▪ Demo

▪ Results

▪ What’s next?

Page 4: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 4

Zoned Namespaces 101

Page 5: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 5

What are Zoned Block Devices?

▪ The storage device logical

block addresses are divided

into ranges of zones.

▪ Writes within a zone must

be sequential.

▪ The zone must be erased

before it can be rewritten.

The new paradigm in storage

Zone 0 Zone 1 Zone 2 Zone 3 Zone X

Write pointer

position

Device LBA range divided in zones

Write commands

advance the write

pointer

Reset write pointer commands

rewind the write pointer

Page 6: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 6

Zoned Storage on SMR▪ SMR (Shingled Magnetic Recording)

▪ Enables areal density growth

▪ Shares flash access model

▪ Erase before re-write

▪ Zoned Access

▪ Zoned Block I/F standardized in INCITS

▪ Zoned Block Commands (ZBC): SAS

▪ Zoned ATA Commands (ZAC): SATA

▪ Host/Device cooperate to optimize RMW

aspect of SMR by enforcing sequential

writes and enabling host FTL model

Conventional HDDDiscrete Tracks

SMR HDDOverlapped Tracks

Zo

ne

Page 7: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 7

Solid-State

Drive

Ubiquitous WorkloadsThe cloud applies multiple workloads to a single SSD

Database

s

Sensors

Analytics

Virtualization

Video

SSDs write log-structured to the media

that requires garbage collection

Multiplex data streams onto the same

garbage collection units

Increases

Write Amplification, Over-Provisioning

and thereby Cost

Decreases

throughput and latency predictability

Page 8: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 8

Zones for Solid State Drives

Application 1 Application 2 Application 3

Conventional SSD Controller

LBA Space

Application 1 Application 2 Application 3

ZNS SSD Controller

LBA Space

ZoneEliminate data streams multiplexing:

• Significantly decreases write amplification, over-provisioning and thereby reduces cost

• Increases throughput and latency predictability

Page 9: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 9

ZNS: Synergies w/ ZAC/ZBC software ecosystem

▪ Device exposed as a Zoned Block Device (ZBD)

▪ Reuse existing work already done for ZAC/ZBC

devices

▪ Existing ZBD-aware file systems & device

mappers “just work”

• Few additions to support to ZNS

▪ Integrates with file-systems and applications

▪ RocksDB, Ceph, fio, libzbd, …

▪ ZAC/ZBC devices are already in production at

technology adopters and a mature storage stack

is available through the Linux® eco-systemZNS SSD

Traditional Block Storage Applications

fio

Logical Block

Device

(dm-zoned)

File-System with

ZBD Support

(f2fs, btrfs, xfs)

SCSI/ATA

Regular File-

Systems (xfs)

User Space

Linux

Kernel

Block Layer

NVMe

*= Enhanced data paths for SMR/ZNS drives

RocksDB Ceph

libzbd

Optimized Applications

ZNS NVMe SSD Development Platform

ZNS NVMe SSD Development PlatformZNS NVMe SSD

Development Platform

Page 10: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 10

Zoned Namespaces

▪ Ongoing Technical Proposal in the NVMeworking group

▪ New Zoned Command Set – Inherits the NVM Command Set and adds zone support.

▪ Aligns to the existing host-managed models defined in the ZAC/ZBC specifications.

▪ Note that it does not map 1:1. Beware of the details.

▪ Optimized for Solid State Drives

▪ Zone Capacity

▪ Zone Append

▪ Zone Descriptors

Under review

Page 11: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 11

Host-Managed Zoned Block Devices

▪ Zone States

▪ Empty, Implicitly Opened, Explicitly Opened, Closed,

Full, Read Only, and Offline.

▪ Changes state upon writes, zone management

commands, and device resets.

▪ Zone Management

▪ Open Zone, Close Zone, Finish Zone, and Reset Zone

▪ Zone Size & Zone Capacity(NEW)

▪ Zone Size is fixed

▪ Zone Capacity is the writeable area within a zone

Zone Capacity (E.g., 500MB)

Zone Size (e.g., 512MB)

Zone Start LBA

Zone XZone X - 1 Zone X + 1

Page 12: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 12

ZonedStorage.IO

Page 13: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 13

RocksDB on ZNS

Page 14: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 14

RocksDB, a good fit for ZNS

▪ Persistent key-value store

for fast storage

environments

▪ Log-structured, flash

friendly

▪ Customizable storage

back ends

ZNS NVMe SSD Development Platform

Page 15: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 15

WA on a conventional SSD

Page 16: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 16

RocksDB

Data placement

Target: End-to-end-integration

RocksDB

Data placement

File system

Block allocator

FTL

Media Management

MEDIA

Media Management

MEDIA

Page 17: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 17

Challenges

▪ Multiple, parallel files being written to

▪ Map each file to a set of zones

▪ All writes must be sequential and ordered

▪ Use direct I/O and the deadline scheduler

▪ Limits on number of open zones

▪ Finish zones when done with writes

Page 18: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 18

RocksDB on-disk data structuresWriteahed-log (WAL)

WAL

Memtable

<KEY, VALUE>

Page 19: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 19

RocksDB on-disk data structuresSorted string tables

Per

sist

edIn

mem

ory

Ho

t D

ata

A ZZ A Z A Z A

A Z

sync

compaction

compaction

Append-only

A Z

Most Updated

Least Updated

e.g. 64MB

e.g. 128MB

e.g. 1G

e.g. 100G

Grows 10X

Co

ld D

ata

Page 20: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 20

Mapping files to zones

WAL File WAL File SST File SST File SST File SST File

Zones

Page 21: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 21

Approach

▪ Files are mapped to zones

▪ Zone management through file

management

▪ Zones are allocated when

creating a new file

▪ Zones are released after file

deletion

▪ Zones can be rewritten after

being reset

No device-side garbage collection

Page 22: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 22

Demo (Random Insert Workload)

ZNS NVMe SSD Development Platform

Page 23: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 23

Results

device write amplification

3-6X WA’s measured on conventional drives (28% OP)

increase in capacity

Compared to a conventional 28% OP SSD

TCO reduction

Increases lifetime/writes significantly

Smart data placement

No on-drive garbage

collection

Page 24: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 24

Conclusions

▪ Easy to leverage flash-friendy data placement

▪ ZNS enables applications to become flash-optimal

▪ Zoned Block Device Software Eco-system already available

▪ Libraries, tools, emulation

▪ Easy to integrate with existing storage stack

Page 25: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 25

What’s next?

▪ Upstream support to RocksDB

▪ More ZNS end-to-end-integration:

▪ Databases (LSM-based, logs, ...)

▪ Filesystems (btrfs, ceph, ...)

▪ Cloud infrastructure

Page 26: Accelerating RocksDB with Zoned Namespaces · 2019. 12. 21. · Zone States Empty, Implicitly Opened, Explicitly Opened, Closed, Full, Read Only, and Offline. Changes state upon writes,

2019 Storage Developer Conference. © Western Digital Corporation or its affiliates. All Rights Reserved. 26

Thanks!

Western Digital and the Western Digital logo are registered

trademarks or trademarks of Western Digital Corporation or its

affiliates in the US and/or other countries. The NVMe word

mark and NVM Express design mark are trademarks of NVM

Express, Inc. All other marks are the property of their

respective owners.