02-lixiaoyan performance tuning in bluestore&rocksdb

16
Performance tuning in BlueStore&RocksDB Li Xiaoyan (Lisa), Intel 1

Upload: others

Post on 18-Dec-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

PerformancetuninginBlueStore&RocksDB

LiXiaoyan(Lisa),Intel

1

Overview

2

DISK DISK DISK DISK DISK

OSD OSD OSD

MMMM

Storage Node

Monitor Node

OSD OSD

ObjectStoreObjectStore ObjectStore ObjectStore ObjectStore

• ObjectStore istostoredatainlocalnodes.

• BlueStore isdefaultobjectstore.

BlueStore

• BlueStore =Block+NewStore

• Datawritedirectlytorawblockdevice.

• MetadatawritetoKey/valuedatabase(Rocksdb).

Ø ObjectdataØ OmapØ DeferredlogsØ others• RocksDB isabovelight

weightfilesystemBlueFS.

3

RocksDB

MemTableImmutableMemTable

write

Memory

Disk

Compaction

.sstLevel0

.sstLevel1

Level2 .sst

SSTable

.log

manifest

current

• Akey-valuedatabase,originatedbyGoogle,improvedbyFacebook.

• BasedonLSM(Log-StructuremergeTree).

• Keywords:Ø ActiveMemTableØ Immutable

MemTableØ SSTfileØ LOG

4

BlueStore – smallwrite

• RocksDB actsasjournal(deferredIO).

• CustomerdataiswrittentoRocksDB,andreturntoOSD.

• Latercustomerdataiswrittenintoblockdevice.

• DeferredIOentryisdeletedfromRocksDB.

5

STATE_PREPARE_txc_add_transaction/adddeferrediotokvwritebach STATE_AIO_WAITNoaios

STATE_IO_DONE

_txc_finish_io

STATE_KV_QUEUED Putkv_queueKv_cond.notifySTATE_KV_SUBMITTED _kv_sync_thread/

_kv_finalize_thread

STATE_KV_DONE

_txc_committed_kv

STATE_DEFERRED_QUEUED_deferred_queue/_kv_finalize_thread STATE_DEFERRED_CLEANUP

_txc_finish

STATE_DONE

BlueStore – bigwrite

• Nojournalisneeded.• Customerdataiswrittenintoanewspace.

• ReturntoOSDwhenmetadataiswrittenintoRocksDB.

• Theoldspaceisreleased.

6

STATE_PREPARE_txc_add_transaction STATE_AIO_WAIT_txc_aio_submit/txc_aio_finish

STATE_IO_DONE

_txc_finish_io

STATE_KV_QUEUED Putkv_queueKv_cond.notifySTATE_KV_SUBMITTED _kv_sync_thread/

_kv_finalize_thread

STATE_KV_DONE

_txc_committed_kv

STATE_FINISHING _txc_finish

STATE_DONE

OSDtuning1

OSD

bluestore_lat others

• 4krandomwrites.• Withdefaultconfig (Perfdump

data):topchart.• BlueStore finisherissingle-

threadbydefault.• Aftersetting

bluestore_shard_finishers:bottomcharts.

40

45

50

IOPS(k)

4k_randw 4k_finsiher_false

7

OSD

bluestore_lat others

BlueStore

others

OSD

bluestore_lat others

WriteIOtimespan

• Gettimespanfromperfdump.

• OSDtotallatency:fromOSDhandlesaIOinMessengerstocommittheIO.

• BlueStore latency:fromBlueStoregetsaIOtocommitte IOtoOSD.

• Note:Left4krandomwrites,right16krandomwrites

BlueStore

others

8

0

20

40

60

256 128 64

IOPS(k)

256 512 768

• Keeptotalmemoryusageconsistent.

• RocksDB options:Ø min_write_buffer_num

ber_to_merge (default1)

Ø write_buffer_size(default256MB),changedto128,64.

• 4krandomwrites.

Randomwritetuning

9

• Increasetotalmemoryusageconsistent.

• RocksDB options:Ø max_write_buffer_num

ber (default8)• Theimprovementislittle.• 4krandomwrites.

Randomwritetuning– cont.

404244464850

IOPS(k)

10

0

100

200

300

1 39 77 115

153

191

229

267

305

343

381

419

457

495

533

571

609

647

685

Data written into db

plogs ologs dlogs mlogs others

0

50

100

1 47 93 139

185

231

277

323

369

415

461

507

553

599

645

691

737

L0 SST

plogs ologs dlogs others

Randomwrite– 4k

• WhendataiswrittenintoRocksDB,itisaddedintomemTables.

• Onceflushconditionistriggered,datainmemTablesareflushedintoL0SSTfiles.

• DeferredlogsaremaindataineverymemTablewhileobjectdataaremaindatainL0files.

11

Randomwrite– 16k

0100200300

1 13 25 37 49 61 73 85 97 109

121

133

145

157

169

Data written into db

plogs ologs dlogs mlogs others

0100200

1 14 27 40 53 66 79 92 105

118

131

144

157

170

183

L0 SST

plogs ologs dlogs others

• Similardataas4krandomwrites.

• ObjectdataaremaindatabothineverymemTableandeveryL0SSTfile.

12

SST2

• Randomwrites4k/16k.• Addaflushstyle:todelete

duplicatedentriesrecursively.• Performanceissimilaras

mergenum =2,butdatawrittenintoL0isdecreased.(5Gper10mins)

RocksDB - Flushdatarecursively.

0

20

40

60

4k 16k

IOPS(k)

Default dedup_2

Merge Merge

memtable1 memtable2 memtable3 memtable4

memtable1 memtable2 memtable3 memtable4

Dedup

Old:

New:

SST1

SST1 SST2 SST3 SST4

14

Futurework

• RocksDB isstillheavyforpg logs,objectdata.Ø Forpg logsdata,theyarewrittenonce,andreadwhen

theOSDnodegetsrecovery.Ø Forobjectdata,one-timejournalmaybeenough.

15

Thanks&QA

16

Backup– Testconfig

• Hardware• Memory:128G• CPU:Intel(R)Xeon(R)[email protected]• Disk:IntelP3700400G

• Software• Cephmasterbranch

@f584df78c294b11baa7527d8eab0874ae6a2b809• Config

• AOSD,amonitor,andamanager.