02-lixiaoyan performance tuning in bluestore&rocksdb
TRANSCRIPT
Overview
2
DISK DISK DISK DISK DISK
OSD OSD OSD
MMMM
Storage Node
Monitor Node
OSD OSD
ObjectStoreObjectStore ObjectStore ObjectStore ObjectStore
• ObjectStore istostoredatainlocalnodes.
• BlueStore isdefaultobjectstore.
BlueStore
• BlueStore =Block+NewStore
• Datawritedirectlytorawblockdevice.
• MetadatawritetoKey/valuedatabase(Rocksdb).
Ø ObjectdataØ OmapØ DeferredlogsØ others• RocksDB isabovelight
weightfilesystemBlueFS.
3
RocksDB
MemTableImmutableMemTable
write
Memory
Disk
Compaction
.sstLevel0
.sstLevel1
Level2 .sst
SSTable
.log
manifest
current
• Akey-valuedatabase,originatedbyGoogle,improvedbyFacebook.
• BasedonLSM(Log-StructuremergeTree).
• Keywords:Ø ActiveMemTableØ Immutable
MemTableØ SSTfileØ LOG
4
BlueStore – smallwrite
• RocksDB actsasjournal(deferredIO).
• CustomerdataiswrittentoRocksDB,andreturntoOSD.
• Latercustomerdataiswrittenintoblockdevice.
• DeferredIOentryisdeletedfromRocksDB.
5
STATE_PREPARE_txc_add_transaction/adddeferrediotokvwritebach STATE_AIO_WAITNoaios
STATE_IO_DONE
_txc_finish_io
STATE_KV_QUEUED Putkv_queueKv_cond.notifySTATE_KV_SUBMITTED _kv_sync_thread/
_kv_finalize_thread
STATE_KV_DONE
_txc_committed_kv
STATE_DEFERRED_QUEUED_deferred_queue/_kv_finalize_thread STATE_DEFERRED_CLEANUP
_txc_finish
STATE_DONE
BlueStore – bigwrite
• Nojournalisneeded.• Customerdataiswrittenintoanewspace.
• ReturntoOSDwhenmetadataiswrittenintoRocksDB.
• Theoldspaceisreleased.
6
STATE_PREPARE_txc_add_transaction STATE_AIO_WAIT_txc_aio_submit/txc_aio_finish
STATE_IO_DONE
_txc_finish_io
STATE_KV_QUEUED Putkv_queueKv_cond.notifySTATE_KV_SUBMITTED _kv_sync_thread/
_kv_finalize_thread
STATE_KV_DONE
_txc_committed_kv
STATE_FINISHING _txc_finish
STATE_DONE
OSDtuning1
OSD
bluestore_lat others
• 4krandomwrites.• Withdefaultconfig (Perfdump
data):topchart.• BlueStore finisherissingle-
threadbydefault.• Aftersetting
bluestore_shard_finishers:bottomcharts.
40
45
50
IOPS(k)
4k_randw 4k_finsiher_false
7
OSD
bluestore_lat others
BlueStore
others
OSD
bluestore_lat others
WriteIOtimespan
• Gettimespanfromperfdump.
• OSDtotallatency:fromOSDhandlesaIOinMessengerstocommittheIO.
• BlueStore latency:fromBlueStoregetsaIOtocommitte IOtoOSD.
• Note:Left4krandomwrites,right16krandomwrites
BlueStore
others
8
0
20
40
60
256 128 64
IOPS(k)
256 512 768
• Keeptotalmemoryusageconsistent.
• RocksDB options:Ø min_write_buffer_num
ber_to_merge (default1)
Ø write_buffer_size(default256MB),changedto128,64.
• 4krandomwrites.
Randomwritetuning
9
• Increasetotalmemoryusageconsistent.
• RocksDB options:Ø max_write_buffer_num
ber (default8)• Theimprovementislittle.• 4krandomwrites.
Randomwritetuning– cont.
404244464850
IOPS(k)
10
0
100
200
300
1 39 77 115
153
191
229
267
305
343
381
419
457
495
533
571
609
647
685
Data written into db
plogs ologs dlogs mlogs others
0
50
100
1 47 93 139
185
231
277
323
369
415
461
507
553
599
645
691
737
L0 SST
plogs ologs dlogs others
Randomwrite– 4k
• WhendataiswrittenintoRocksDB,itisaddedintomemTables.
• Onceflushconditionistriggered,datainmemTablesareflushedintoL0SSTfiles.
• DeferredlogsaremaindataineverymemTablewhileobjectdataaremaindatainL0files.
11
Randomwrite– 16k
0100200300
1 13 25 37 49 61 73 85 97 109
121
133
145
157
169
Data written into db
plogs ologs dlogs mlogs others
0100200
1 14 27 40 53 66 79 92 105
118
131
144
157
170
183
L0 SST
plogs ologs dlogs others
• Similardataas4krandomwrites.
• ObjectdataaremaindatabothineverymemTableandeveryL0SSTfile.
12
SST2
• Randomwrites4k/16k.• Addaflushstyle:todelete
duplicatedentriesrecursively.• Performanceissimilaras
mergenum =2,butdatawrittenintoL0isdecreased.(5Gper10mins)
RocksDB - Flushdatarecursively.
0
20
40
60
4k 16k
IOPS(k)
Default dedup_2
Merge Merge
memtable1 memtable2 memtable3 memtable4
memtable1 memtable2 memtable3 memtable4
Dedup
Old:
New:
SST1
SST1 SST2 SST3 SST4
14
Futurework
• RocksDB isstillheavyforpg logs,objectdata.Ø Forpg logsdata,theyarewrittenonce,andreadwhen
theOSDnodegetsrecovery.Ø Forobjectdata,one-timejournalmaybeenough.
16
Backup– Testconfig
• Hardware• Memory:128G• CPU:Intel(R)Xeon(R)[email protected]• Disk:IntelP3700400G
• Software• Cephmasterbranch
@f584df78c294b11baa7527d8eab0874ae6a2b809• Config
• AOSD,amonitor,andamanager.