open-channel solid state drives matias bjørling linux ... · rocksdb on open-channel ssds...

38
Open-Channel Solid State Drives Matias Bjørling Linux Software Foundation Vault 2016 1 Copyright © 2016 CNEX Labs

Upload: others

Post on 21-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

Open-Channel Solid State Drives

Matias Bjørling

Linux Software Foundation Vault 2016

1

Copyright © 2016 CNEX Labs

Page 2: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

Solid State Drives

2

High Throughput High ParallelismLow Latency

Page 3: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

Solid State Drives and Flash Translation Layer

3

Read/Write/EraseFlash Interface

Block I/O

Read/Write • Flash Translation Layer (FTL)- Read/Write/Erase -> Read/Write

- Translation map• Data placement decisions

- Garbage Collection

• Device performs garbage collection in background

- Wear-leveling

• Flash has a limited amount of writes available

• Applications- Applications work with (and optimize for) LBAs

- Use trim to hint FTL

- Makes assumption about good data placement strategies

Page 4: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

Embedded FTL Solid State Drives

• Hardwires design decisions about data placement, scheduling, garbage collection, and wear leveling.

• Designed with explicit assumptions about application workload.

• Non-optimal write workloads increase write-amplification

• Performance is tied to the cost of heavy over-provisioning

• Hints from applications are best effort

• Predictable latencies cannot be guaranteed –99 percentiles

4

Metadata StateManagement

NAND Bad Block Management

Error Handling

XOR engines ECC enginesSMART/health Management

Temperature Monitor and Performance

Throttle

Asynchronous Events Management

Other Etc. Management

NANDChannel 0

NANDChannel 1

NANDChannel n

NAND LUN 0

NAND LUN 1

NAND LUN 2

NAND LUN m

NAND LUN 0

NAND LUN 1

NAND LUN 2

NAND LUN m

Garbage CollectionWear-levelingTranslation Map

NAND LUN 0

NAND LUN 1

NAND LUN 2

NAND LUN m

Read/Write/Trim

Hardwired

CommonFeatures

Solid State Drive

Controller Internals

Page 5: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

Solid State Drive Workloads

• SSDs are targeted typical workloads:

• Accesses (e.g., 90%, 75%, 50% reads)

• I/O Applications (e.g., MySQL and RocksDB)

• Latency requirements (Quality of Service)

• Cost and lack of flexibility for these “hard-wired” solutions is prohibitive:

• What if the workload changes (at run-time)?

• What about new workloads?

• And new applications?

5

Page 6: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

6

Common Solid State Drive Design

0 1 2 3

0 W R

1 W

2 W R

3 W R

4 W

5 W R

6 W

7 W

LUN -> One unit of parallelismEither read (~50-80us), write (~1-2ms), or erase (~1-2ms)

Ch

ann

els

LUNs0 1 2 3

0 W

1 RW

2 WR

3 W

4 W

5 WR

6 EW

7 W

• Write Line

- Also known as virtual block, raid line, erase unit, …

- Optimizes for Throughput – Across all channels and LUNs

- Considerably reduces metadata requirements

- Single digit GB write lines are not uncommon

• Critical Sections

- Write/Erase before Read -> Read must wait 1-2ms

• Common Strategies for efficient reads

- Erase/Program Suspend -> Delays erases and writes

- Reconstruction from parity -> Extra reads

Best case Common case

0 1 2 3

0 W R

1 W

2 W R

3 W R

4 W

5 WR R R R

6 W

7 W

Recontruct data by Placing parity data inNear-by LUNs

R Extra reads

R Read recovered from parities

WLUN R

R Scheduled

Page 7: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

• Host in Control- Data placement

• I/O scheduling, Wear-leveling, Over-provisioning

- Garbage Collection

- Physical Page Addresses (PPAs) vs LBA + Range

- Mapping Table <> Block-based vs Page-based

• Device maintains responsibilities- Bad block management

- ECC Calculations

- XOR Calculations

• Media optimized I/O Commands- Programming Mode, Multiple PPAs, Read retry, ...

7

Open-Channel Solid State Drives

Metadata StateManagement

NAND Bad Block Management

Error Handling

XOR engines ECC enginesSMART/health Management

Temperature Monitor and Performance

Throttle

Asynchronous Events Management

Other Etc. Management

NANDChannel 0

NANDChannel 1

NANDChannel n

NAND LUN 0

NAND LUN 1

NAND LUN 2

NAND LUN m

NAND LUN 0

NAND LUN 1

NAND LUN 2

NAND LUN m

NAND LUN 0

NAND LUN 1

NAND LUN 2

NAND LUN m

Read/Write/Erase

Dynamic

CommonFeatures

Solid State Drive

Controller Internals

Garbage CollectionWear-levelingTranslation Map

Host System

Page 8: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

Open-Channel SSD: Usages

• Predictability

- Predictable I/O with consistent and low latency QoS requirements

- Decrease write amplification

- Reduce ‘noisy neighbor’ problem

• Management- Manage all flash media globally

- Petabytes of storage with flash

- Single level address space

• Application-Driven Storage- Programmable SSDs versus

Hardware/Firmware configurable

- Integrate with I/O application

8

Web-scaleData Centers

Flash ArrayVendors

High-PerformanceComputing

Hyper-convergedInfrastructure

Adopters

Page 9: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

Open-Channel SSD: Architecture

9

Provisioning Interface

Get block

Put block

LightNVM Spec.

Interface

Common

Data Structures

Append-only

Key-Value

Hardware Offload

Block Storage

Page 10: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

• LightNVM Specification- Works with NVMe

- Vendor-based command set

- Identification through PCI IDs

- Path to standardilization• ”Physical Page Addressing” Command Set

• Vendors implement their own- ECC algorithms

- XOR engines

- Offload engines

10

Open-Channel SSD: Specification

Generic Media

Manager

Target X

Vendor A SSD Vendor B SSD Vendor C SSD

Generic Media

Manager

Generic Media

Manager

PCIe EthernetPCIe

Target Y

Apps Apps

Page 11: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

• To support Physical Page Addressing, a new identify structure is implemented

• Identify• Flash Media Type (SLC, MLC, TLC, QLC)

• Number of channels

• Number of LUNs, planes, block, pages, sector size, etc.

• Multi-plane Operation Support (Single, Dual, Quad, …)

• Media and Controller Capabilities (SLC, Scramble, Encryption, etc.)

• Page Index Table – Let host know the order of writes to MLC/TLC flash media

• PPA Format (Device-side PPA format)

• Bad block Management• Get and Set Bad Block State – Easy Recovery

11

LightNVM Specification: Identify Command

Page 12: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

• Write PPA – Enable host to write to a PPA in a unit of a sector

• Read PPA – Enable host to read PPAs in a unit of a sector

• Erase PPA – Enable host to erase a one or more NV memory blocks

• Read/Write- Limited Retry (Only try to read once, else apply all means to recover data)

- FUA – Force data to media

- SLC Mode – Write only to lower page of MLC/TLC media

- Suspend Enable – Apply Erase/Program Suspend to the command

- Plane operation – Access with either single, dual or quad plane commands

12

LightNVM Specification: I/O Commands

Page 13: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

Open-Channel SSD: General Media Manager

13

• One media manager instance per SSD• Block media information

• Manages physical address space of a device• Initialization/recovery of block states• Can manage multiple blocks as a single block• Manages addressing mode for device

• Global PPA <-> Device PPA• Provides Wear-leveling• API – Manage address ranges of physical address space

• Get/Put Block of the Physical Address Space• Block size is typically one or more physical flash blocks.

Global PPA (64bit)

Device PPA (32bit)

Host

Device

Page 14: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

Open-Channel SSD: pblk target (FTL)

14

pblk – Physical Block Target (FTL)• Exposes media as a LBA block

device• Implements write buffering • Manages logical to physical

address table (L2P)• Orchestrates data placement• Garbage collection• Multiple targets per device

Page 15: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

Open-Channel SSD: Steady State Example

• Manage your over-provisioning- SSDs employ 10-30% over-provisioning for performance

- Improve the steady state of the drive

- Predictable latency

- Reduce I/O outliers significantly

15

IOP

S

Time

Page 16: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

CNEX Labs Westlake ASIC

Performance Measurements:16 Channels2TB MLC Flash400MT/sHost Interface PCIe G3x8User space performancePreconditioning & garbagecollection enabled

ASIC Performance (NAND @400MT)Seq. READ: 4.0 GB/s Seq. WRITE: 4.1 GB/sRan. 4K READ: 983K IOPSRan. 4K WRITE: 892K IOPS

Westlake ASIC Platform NVMe CompliantPCIe G3x8 or dual G3X44x10GbE NVMoE40 bit DDR316 CH SLC/eMLC/MLC/TLC NAND

Page 17: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

Westlake ASIC: Read/Write Latency

17

*FPGA Controller - Not final performance numbers

Page 18: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

Westlake ASIC: Read/Write Performance

18

0

500

1000

1500

2000

2500

3000

3500

4000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63

MB

/s

LUNs active

Read/Write Performance (MB/s)

Write KB/s

Read KB/s

*1T - Not final performance numbers

Page 19: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

19

Application-Driven Storage withOpen-Channel SSDs

Page 20: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

Application-Driven Storage: Motivation

• I/O applications with log-structured data structures• Writes are sequential • Reads are random• Great for Open-Channel SSDs

• Handle physical data placement and garbage collection and avoid expensive translation layers• Multiple layers of translation

• Application -> Logical data (e.g. record, key-value pair, blob) to physical file locations

• File-systems -> File offset and range to LBAs• SSD -> LBAs to PPAs

• Convenience of the traditional storage stack becomes a bottleneck

• Mismatch between embedded FTLs capabilities and I/O applications expectations

20

Page 21: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

Application-Driven Storage: Applications

• Single level of translation – Applications map data to physical storage

• Fast path for both reads and writes

• Use application data structures to make informed decisions regarding latency, resource utilization and data movement

• Reduce SSD write amplification by accurate flash block alignment

• Translation table overheads are stored within application data structures

21

Page 22: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

• User-space library to access Open-Channel SSDs

- Generic interface for programmable SSDs

- Open-Channel SSD configuration exposed through library and sysfs

- ioctls for read/write/erase

• Applications integrates with Open-Channel SSDs by using the exposed configuration and ioctls.

22

Application-Driven Storage: liblightnvm

Page 23: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

• Each device exposes its configuration- Number of block, channels, lun, pages, planes,

page size, sector size, ...

- Registered media manager

- Device PPA format

• A registered target exposes its- Configuration

- Devices that it is attached to

- And any other variables

• Applications may use liblightnvm or query sysfs for device geometry

23

Sysfs Integration

/sys/devices/virtual/misc/lightnvm├── devices│ └── nvme0n1│ ├── capabilities│ ├── device_mode│ ├── grp0│ │ ├── flash_media_type│ │ ├── num_blocks│ │ ├── num_channels│ │ ├── num_luns│ │ ├── num_pages│ │ ├── num_planes│ │ ├── page_size│ │ ├── sector_oob_size│ │ ├── sector_size│ │ └── etc.│ ├── media_manager│ ├── num_groups│ ├── ppa_format│ └── version└── targets

└── tgt0Available from 4.7 of the kernel

Page 24: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

• Physical Page Address Commands- Three buffers

• Data buffer (sector aligned data buffer)

• Metadata buffer (Out-of-band buffer)

• Physical Page Address (PPA) List• If > 1 PPAs, buffer with list of PPAs to access, else PPA

passed directly

- # of pages (i.e. sectors defined in the PPA list)

- Type of command (read, write, erase)

- Flags• Program/Erase Suspend

• Single, dual, quad plane operation

• Polling

• Vendor-specific – Encryption, compression, etc.

24

liblightnvm Data Commands

*Preliminary design

struct ppa_cmd {void *databuf;void *metadatabuf;union {

void *ppa_list;u64 ppa_addr;

}int nr_pages;int rw; /* r/w/e */int flags;

};

Page 25: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

25

RocksDB andOpen-Channel SSDs

Page 26: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

RocksDB: Overview

26

• Embeddable Key-Value persistent store where keys and values are arbitrary byte streams.

• Fork from LevelDB

• Based on Log-Structured Merge Tree

• Optimized for fast storage, many CPU cores and low latency

• Used by Facebook, LinkedIn, Yahoo, AirBnb, ...

RocksDB

Page 27: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

27

Pers

iste

dIn

mem

ory

Sstable Size

User data age

A ZZ A Z A Z A

A Z

sync

compaction

compaction

Append-only ->

A Z

Most Updated

Least Updated

RocksDB: Log-Structured Merge (LSM) Tree

e.g. 64MB

e.g. 128MB

e.g. 1G

e.g. 100G

Grows 10X

Page 28: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

28

• Data locality per sstable (> L0)

• All data in sstable is invalidated simultaneously- sstables are immutable- Merges involve the whole sstable

- Storage unit

- Approximate size - configurable (top-down)

- Data Block, Filter Block, Index Block, Metadata

• But after FTL…- sstable data not guaranteed to be sequential on physical media

- Append-only inserts translate into: 1 invalidation + 1 page write + GC

➡Device-added write amplification

➡GC introduces long unpredictable latency tails

RocksDB: Log-Structured Merge (LSM) Tree

Page 29: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

RocksDB on Open-Channel SSDs

• Objective: Optimize RocksDB for Open-Channel SSDs- Control data placement:

• Align data in sstables and WAL to physical media (same block, adjacent blocks)

• Other metadata continues to be stored on posix compatible block storage

- Exploit Parallelism:• Define virtual blocks based on I/O patters issued by the storage backend

• What is the sustained write requirement?

- Reduce GC and minimize over-provisioning• Use LSM merging strategies to remove device-side GC and over-provisioning

• Drive SSD in a stable state

- Control I/O scheduling• Prioritize I/Os based on the LSM persistent needs (e.g., WAL, L0 and lower levels move from hot

to cold)

29

Page 30: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

RocksDB + liblightnvm: Challenges

• sstables and write ahead logging- Fit sstables and WAL to block sizes in L0 and further

level (merges + compactions)• No need for GC on SSD side - RocksDB merging acts as GC

- Keep block metadata to reconstruct sstable in case of host crash

• Other Metadata- Keep superblock metadata and other data structures

on block storage to easily recover the database

30

Page 31: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

Rocksdb: Match flash block size

31

• sstables and WAL are matched to the size of the of the physical blocks.

• Writes are issued in parallel across multiple blocks (and preferably across multiple LUNs)

• In the case a write fails (very rare), a new pblk is added and the data, for the broken pblk, is rewritten to the new block.

• Physical blocks associated is managed by MANIFEST file.

Page 32: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

RocksDB: Block Metadata

32

First Page• Filename (RocksDB GID)

OOB• Page State (Fully written, partially written)• # Valid Bytes

Last Page• CRC

Page 33: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

RocksDB: LightNVM Architecture

33

• LSM is the FTL- Tree nodes (files) control data placement on

physical flash

- sstables and WAL are on Open-Channel SSD -rest is in posix file-system

- Garbage collection takes place during compaction

• LightNVM manages flash- liblightnvm takes care of flash block

provisioning; RocksDB works directly with PPAs

- Media manager handles wear-levelling (get/put block)

Page 34: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

36

LightNVM and Open-Channel SSDsStatus

Page 35: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

LightNVM Subsystem: Status

• LightNVM available from Linux kernel 4.4• Pluggable Architecture

- Media Managers- General Media Manager- CNEX FTL Media Manager (Full FTL exposed as a block device)

- General Media Manager targets- pblk – Expose PPA SSDs as a block device (mapping table on host)- rrpc – Expose Hybrid PPA SSDs as a block device (mapping table both on host and device)- direct flash (to be folded into liblightnvm)

• Supported device drivers:- NVMe support for devices that implements the LightNVM specification- null_blk for performance testing

• Hardware available• CNEX Labs Westlake SDK

37

Page 36: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

LightNVM Subsystem: Ongoing Work

• Linux kernel 4.7-4.8- Upstream pblk target- Sysfs integration- Persistent block state management

• Add support for LightNVM in nvme-cli

• Application-driven Storage- liblightnvm

• Integrate with sysfs integration• Standardlize data transfer API

- RocksDB• Move RocksDB DFlash storage backend logic to liblightnvm• Implement efficient page caching • Improve I/O submission (ioctls + libaio)• Exploit device parallelism within RockDB’s LSM

38

Page 37: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

39

• LightNVM: http://lightnvm.io

• Interface Specification: http://goo.gl/BYTjLI

•Documentation: http://openchannelssd.readthedocs.org

LightNVMOpen-Channel SSDs

Page 38: Open-Channel Solid State Drives Matias Bjørling Linux ... · RocksDB on Open-Channel SSDs •Objective: Optimize RocksDB for Open-Channel SSDs - Control data placement: •Align

Tools and Libraries

l LightNVM Subsystem Development

l https://github.com/OpenChannelSSD/linux

l lnvm – Administrate your Open-Channel SSDsl https://github.com/OpenChannelSSD/lnvm

l liblightnvm – Integrate lightnvm into your applicationl https://github.com/OpenChannelSSD/liblightnvm

l Test toolsl https://github.com/OpenChannelSSD/lightnvm-hw

l QEMU NVMe with Open-Channel SSD Supportl https://github.com/OpenChannelSSD/qemu-nvme

40