optimizing forest db for flash based ssd: couchbase connect 2015

Post on 07-Aug-2015

220 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

OPTIMIZING FORESTDB FOR FLASH-BASED SSD

Sang-Won LeeProfessor, Sungkungkwan University

Sundar SridharanSenior Software Engineer, Couchbase Inc.

©2015 Couchbase Inc.

2

Contents

▪ Introduction▪ SHARE Interface in Flash-Based SSD for

ForestDB▪ ForestDB Optimizations at File System Layer▪ Evaluation Results▪ Future Work▪ Summary

©2015 Couchbase Inc.

3

Introduction

▪It is all-flash storage era!

▪Legacy of harddisk era at system softwares▪ Suboptimal on top of flash storage

▪ForestDB: next-generation KV engine of Couchbase

▪Opportunities▪ Exploit flash storage characteristics (SHARE Interface)▪ Leverage modern CoW-based file systems

SHARE Interface in Flash-Based SSD

for ForestDB

©2015 Couchbase Inc.

5

Characteristics of Flash Storage (vs. Hard Disk)

▪No-overwrite and FTL layer▪ Overwrite is not allowed▪ Another layer of address mapping inside flash storage

▪Limited lifetime

▪Write time in flash storage ~ write amount▪ Write time in harddisk ~ mechanical disk head

movement

©2015 Couchbase Inc.

6

Copy-on-Write in ForestDB

▪Document update▪ Copy-on-Write, instead of in-place-update

©2015 Couchbase Inc.

7

Copy-On-Write in ForestDB (2)

▪Why CoW? ▪ 1) Write atomicity and 2) multi-version concurrency

control ▪ A reasonable solution in HDD

▪Problems with CoW in flash storage▪ Tree-wandering write amplification low performance ▪ Flash storage lifetime

©2015 Couchbase Inc.

8

Opportunities in Flash Storage

▪Address mapping inside flash storage (by FTL)

©2015 Couchbase Inc.

9

Opportunities in Flash Storage(2)

▪SHARE interface: explicit address remapping

©2015 Couchbase Inc.

10

Opportunities in Flash Storage (3)

▪ForestDB Compaction with SHARE▪ No write of valid documents to new file

©2015 Couchbase Inc.

11

SHARE Implementation

▪Firmware extension for SHARE▪ OpenSSD Board (http://www.openssd-project.org/)▪ Atomic and recoverable

©2015 Couchbase Inc.

12

Performance Evaluation

▪Normal time performance: YCSB’s workload-F

©2015 Couchbase Inc.

13

Performance Evaluation (2)

▪Compaction performance

Elapsed Time(sec)

Written Bytes(MB)

Original ForestDB 227.5 1126.4

ForestDB with SHARE 88.4 150.6

ForestDB Optimizations atFile System Layer

©2015 Couchbase Inc.

15

Overview

▪Motivation – the catch-22

▪Why B-Tree file system (Btrfs)

▪How ForestDB solves the catch-22 using Btrfs

▪Optimizing with Linux Asynchronous library (libaio)

▪Performance Results

©2015 Couchbase Inc.

16

Append-Only Key-Value Stores are Great!

▪Consistency▪Stable access to multiple point-in-time snapshots of data

▪Performance with Isolation▪Multi-Version Concurrency Control (MVCC) means readers

and writers do not block each other

▪Recoverability▪Can easily rollback entire database to a stable past state

▪SSD Friendly▪Avoids in-place updates and Flash Layer Translations

©2015 Couchbase Inc.

17

Append-Only KV Stores are Great!

©2015 Couchbase Inc.

18

MVCC: Readers & Writer Run Unblocked!

©2015 Couchbase Inc.

19

But...

▪Disk can fill up with stale data

▪Need to do garbage collection - Compaction

©2015 Couchbase Inc.

20

Compactions Do Garbage Collection...

©2015 Couchbase Inc.

21

Compactions for Garbage Collection

©2015 Couchbase Inc.

22

What if size of active data exceeds free space available….

A Fundamental Problem with Disk Space

Writer appends too much data

©2015 Couchbase Inc.

23

A Fundamental Problem: Catch-22

“My disk is getting full... I want to free up space but don’t have enough free space to free up space!”

Size of Active Data must be strictly lesser than free space available on disk!!

©2015 Couchbase Inc.

24

B-Tree File System (Btrfs)

▪Btrfs is a copy-on-write filesystem for Linux

▪Development began in Oracle in 2007 and marked as stable since August 2014 (http://goo.gl/upukn4)

▪Industry support from Facebook, Fujitsu, Fusion-IO, Intel, Netgear, Novel/SUSE, Oracle, Red Hat etc

▪Available as an option in all major Linux distributions

©2015 Couchbase Inc.

25

Btrfs Features (Short list)▪Max file size upto 16 exbibytes (1 exbibyte in ext4)▪Self healing due to copy-on-write nature▪Online defragmentation▪Online volume growth and shrinking▪Online block device addition and removal▪Block discards for improved wear levelling on SSDs using TRIM▪Transparent compression configurable with file or volume ▪Online data scrubbing▪Send/receive of diffs▪Snapshots and subvolumes

▪File Cloning!

©2015 Couchbase Inc.

26

Btrfs Basics - Representation

File P with reference counted extents

©2015 Couchbase Inc.

27

Btrfs Feature - Copy File Range

Copy file range api lets new File “Q” share physical disk extents from File “P”

©2015 Couchbase Inc.

28

Btrfs Feature - Blocks shared across files

Copy-On-Write lets new updates to happen on File Q

©2015 Couchbase Inc.

29

Btrfs Basics - Deleting File

Deleting file Q

©2015 Couchbase Inc.

30

Btrfs Basics - Freeing up space

Freeing up space

©2015 Couchbase Inc.

31

ForestDB Compaction Using Btrfs Cloning

Compaction works by using BTRFS to copy-on-write (clone) valid block-ranges from old file into new file...

©2015 Couchbase Inc.

32

ForestDB Compaction Using Btrfs Cloning

Deleting old file.fdb.0 frees up space only belonging to the stale blocks. Valid blocks of file.fdb.1 stay intact!

Performance ResultsUbuntu 14.04, Btrfs v3.12, 4 CPU cores, 20GB

SSD drive 8GB DRAM

©2015 Couchbase Inc.

34

Performance (1) – ForestDB on Btrfs

~1.25 - 2 X Faster! ½ write amplification!

©2015 Couchbase Inc.

35

Performance (2) – ForestDB on Btrfs

~1.5 - 4 X Faster! ½ write amplification!

©2015 Couchbase Inc.

36

Performance (3) – ForestDB on Btrfs

~2 X Faster! ½ write amplification!

©2015 Couchbase Inc.

37

Speeding up Reads with libaio

▪Modern SSDs have multiple I/O channels

▪Asynchronous I/O maximizes throughput

▪Well suited for ForestDB compaction tasks

©2015 Couchbase Inc.

38

Performance (4) ForestDB on Btrfs with libaio

13X faster!

7X faster!

4X faster!

©2015 Couchbase Inc.

39

Advantages of Btrfs with libaio

▪Efficiently uses disk space avoiding the catch-22

▪Reduces Write Amplification by 2 times▪Longer SSD lifespan due to reduced wear

▪Over 13 X faster compaction speeds

▪Generic file system layer solution that applies to SSD as well as spinning disks

Future Work

©2015 Couchbase Inc.

41

Future Work

▪Optimize Btrfs clone feature for better performance▪Working with the Linux Btrfs community

▪Optimize ForestDB to skip reading if cloning on compaction

▪Adapt Ext4 file system to add the new system call that allows us to share physical blocks among multiple files

Summary

©2015 Couchbase Inc.

43

Summary

▪ForestDB with SHARE interface in SSD▪Speeds up compactions by 3X with 10X lower write

amplification

▪ForestDB with Btrfs clone feature in File system layer▪Speeds up compactions by 2X with 2X lower write

amplification

▪ForestDB with Btrfs clone feature with Linux libaio▪ Speeds up compactions by 13X with 2X lower write

amplification

©2015 Couchbase Inc.

44

Questions?

Sang-Won Lee, swlee@skku.edu

Sundar Sridharansundar@couchbase.com

©2015 Couchbase Inc.

45

Initial Load Performance

3x ~ 6x less time

©2015 Couchbase Inc.

46

Initial Load Performance

4x less write overhead

©2015 Couchbase Inc.

47

Read-Only Performance

1 2 4 80

5000

10000

15000

20000

25000

30000

Throughput

ForestDB LevelDB RocksDB

# reader threads

Ope

ratio

ns p

er s

econ

d

2x ~ 5x

©2015 Couchbase Inc.

48

Write-Only Performance

1 4 16 64 2560

2000

4000

6000

8000

10000

12000

Throughput

ForestDB LevelDB RocksDB

Write batch size (# documents)

Ope

ratio

ns p

er s

econ

d

- Small batch size (e.g., < 10) is not usually common

3x ~ 5x

©2015 Couchbase Inc.

49

Write-Only Performance

1 4 16 64 2560

50

100

150

200

250

300

350

400

450

Write Amplification

ForestDB LevelDB RocksDB

Write batch size (# documents)

Writ

e am

plifi

catio

n(N

orm

aliz

ed t

o a

sing

le d

oc s

ize)

ForestDB shows 4x ~ 20x less write amplification

©2015 Couchbase Inc.

50

Mixed Workload Performance

1 2 4 80

2000

4000

6000

8000

10000

12000

Mixed (Unrestricted) Performance

ForestDB LevelDB RocksDB

# reader threads

Ope

ratio

ns p

er s

econ

d

2x ~ 5x

top related