03 bluestore rocksdb optimization ceph v2 by...

19
Li, Xiaoyan

Upload: others

Post on 21-May-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Li, Xiaoyan

2

Agenda

§ BlueStore overview

§ Rocksdb overview

§ BlueStore latency over OSD

§ Rocksdb optimization

3

Agenda

§ BlueStore overview

§ Rocksdb overview

§ BlueStore latency over OSD

§ Rocksdb optimization

4

Ceph: Architecture

DISK DISK DISK DISK DISK

OSD OSD OSD

ObjectStore ObjectStore ObjectStore

MMM

CommodityServers

M

Storage Node

Monitor Node

§ BlueStore is default ObjectStore.

OSD OSD

ObjectStoreObjectStore

BlueStore

• BlueStore = Block + NewStore

Ø Data written directly to block device

Ø Key/value database (Rocksdb) for metadata

Ø Light weight file system BlueFS.

Metadata

S* - “superblock” properties for the entire store

B* - block allocation metadata (blocks, size, blocks_per_key etc)

b* - allocation bitmap

T* - stats (bytes used, compressed, tec)

C* - collection name -> cnode_t

O* - object name -> onode mapping

X* - shared blobs

L* - deferred writes

M* - omap data

Metadata – cons.

16k random write:Total 11113220 key-value pairsM,4758712O,3182396b,3172112size of keys (MB):M,157O,229b,30size of values (MB):M,577O,1580b,48total 2205

4k random write:Total 11120873 key-value pairsL,3177321M,4764738O,3178352size of keys (MB):L,30M,157O,228size of values (MB):L,6264M,578O,2202total 9044

• What kind of metadata?

BlueStore – small IO (rewrite)

STATE_PREPARE_txc_add_transaction/adddeferrediotokvwritebach STATE_AIO_WAITNoaios

STATE_IO_DONE

_txc_finish_io

STATE_KV_QUEUED Putkv_queueKv_cond.notifySTATE_KV_SUBMITTED _kv_sync_thread/

_kv_finalize_thread

STATE_KV_DONE

_txc_committed_kv

STATE_DEFERRED_QUEUED STATE_DEFERRED_CLEANUP_deferred_queue/_kv_finalize_thread

_txc_finish

STATE_DONE

• Key/Value DB acts as WAL (deferred IO).

• Data is written to KV db, and return to upper layer.

• Later data is written into block device.

• Deferred IO entry is deleted from KV db.

BlueStore – big IO

• Key/Value DB acts as WAL (deferred IO).

• Data is written to KV db, and return to upper layer.

• Later data is written into block device.

• Deferred IO entry is deleted from KV db.

STATE_PREPARE_txc_add_transaction STATE_AIO_WAIT_txc_aio_submit/txc_aio_finish

STATE_IO_DONE

_txc_finish_io

STATE_KV_QUEUED Putkv_queueKv_cond.notifySTATE_KV_SUBMITTED _kv_sync_thread/

_kv_finalize_thread

STATE_KV_DONE

_txc_committed_kv

STATE_FINISHING

_txc_finish

STATE_DONE

10

Agenda

§ BlueStore overview

§ Rocksdb overview

§ BlueStore latency over OSD

§ Rocksdb optimization

Rocksdb

MemTableImmutableMemTable

write

Memory

Disk

Flush

.sstLevel 0

.sstLevel 1

Level 2 .sst

SSTable

.log

manifest

current

• A key-value database, originated by Google, improved by Facebook.

• Based on LSM (Log-Structure merge Tree).

§ Flush

§ Compaction

§ Write: write into memTable

§ Read: memTable fist, and then Level 0, Level 1 etc until finding the value.

§ Write amplification

§ Read amplification

§ Space amplification

Compaction

12

Agenda

§ BlueStore overview

§ Rocksdb overview

§ BlueStore latency over OSD

§ Rocksdb optimization

rbd10_sata_dev_nvme_db_4k

§ Fio+librbd on 10 rbd images.

§ 4k random write.

§ Use Intel P3700 as db+wal, Intel S3520 as block device.

0%10%20%30%40%50%60%70%80%90%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Perc

enta

ge

Time (3mins interval)

BlueStore IO time span

state_io_percentage state_kv_queued_percentage state_kv_percentage other_percentage

0%10%20%30%40%50%60%70%80%90%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Perc

enta

ge

Time (3 mins interval)

OSD time span

bluestore_percentage other_percentage

rbd10_nvme_all_4k

§ Fio+librbd on 10 rbd images.

§ 4k random write.

§ Use Intel P3700 for all.

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Perc

enta

ge

Time (3mins interval)

BlueStore IO time span

state_io_percentage state_kv_queued_percentage state_kv_percentage other_percentage

0%

20%

40%

60%

80%

100%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Perc

enta

ge

Time (3 mins interval)

OSD time span

bluestore_percentage other_percentage

15

Agenda

§ BlueStore overview

§ Rocksdb overview

§ BlueStore latency over OSD

§ Rocksdb optimization

Merge Merge

Rocksdb Optimization

• New merge style.

Ø Merge key/value pairs recursively.

Ø Decrease the data flushed into disks. 020406080

100120140160

1 2 3 4 5 6 7 8 9 101112131415161718192021222324

Wri

te.m

icro

s

Time

db.write.micros

normal_merge2 dup2

0100200300400500600700800900

1 2 3 4 5 6 7 8 9 101112131415161718192021222324

Get

.mic

ros

Time

db.Get.micros

normal_merge2 dup2

memtable1 memtable2 memtable3 memtable4

memtable1 memtable2 memtable3 memtable4

Dedup

Old:

New:

Rocksdb Optimization – cons.

• Enable column families.

Ø Create different column families for different kinds of metadata.

Ø Set different options based on attributes of each type of metadata.

• Omap

• Deferred Ios

• Other

Ø This is first step. Further optimization based on cf is in progress.

Impact of write buffer size

• Write_buffer_size

• Use Intel P3700

• 4k random Ios

• Different write_buffer_size and min_write_buffer_number_to_merge.

100012001400160018002000220024002600

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Late

ncy

Time (3 mins interval)

Bluestore/commit_lat (us)

64M_merge8 64M_merge4 256M_merge2 256M_merge1

3500000

3700000

3900000

4100000

4300000

4500000

4700000

4900000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Cou

nt

Time (3 mins interval)

bluestore/txc count

64M_merge8 64M_merge4 256M_merge2 256M_merge1