ceph & rocksdb · rocksdb group commit . thread scalibility 0 10,000 20,000 30,000 40,000...
TRANSCRIPT
Ceph & RocksDB
변일수(Cloud Storage팀)
Ceph Basics
Placement Group
myobject mypool
PG#1 PG#2 PG#3
hash(myobject) = 4 % 3(# of PGs) = 1 ← Target PG
CRUSH
mypool
PG#1 PG#2 PG#3
OSD#1 OSD#3 OSD#12
Recovery
mypool
PG#1 PG#2 PG#3
OSD#1 OSD#3 OSD#12
OSD
ObjectStore
Replication, ???, …
OSD
Peering, Heartbeat,
…
FileStore
BlueStore
https://www.scan.co.uk/products/4tb-toshiba-mg04aca400e-enterprise-hard-drive-35-hdd-sata-iii-6gb-s-7200rpm-128mb-cache-oem
ObjectStore
https://ceph.com/community/new-luminous-bluestore/
OSD Transaction
Consistency is enforced here!
CRUSH
SSD
op_w
qkv_committing
pipe
out
_q
Request
Ack
write
flusth
ROCKSDBsync transaction
Shard
finish
er q
ueue
• To maintain ordering within each PG, ordering within each shard should be guaranteed.
BlueStore Transaction
• Metadata is stored in RocksDB.• After storing metadata atomically, data is available to users.
Request
SSTFile
LogfileMemtable
Transaction Log
Flush
JoinBatchGroup (leader)
PreprocessWrite
WriteToWAL
MarkLogsSynced
ExitAsBatchGroupLeader
Write to memtable
#1 Thread #2 Thread
JoinBatchGroupAwaitState
#3 Thread
JoinBatchGroupAwaitState
PreprocessWrite
WriteToWAL
leader
LaunchParallelFollower
MarkLogsSyncedfollower
ExitAsBatchGroupFollower
CompleteParallelWorker
Concurrent write to memtable
Group commit
RocksDB Group Commit
Thread Scalibility
0
10,000
20,000
30,000
40,000
50,000
60,000
1 shard 10 shard
IOPS
Shard Scalability
WAL disableWAL
PUT
PUT
PUT
PUT
WAL
RocksDB
RadosGW
• RadosGW is an application of RADOS
RadosGW
OSD Mon Mgr
Rados
RadosGW CephFS
• All atomic operations depen on RocksDB
RadosGW Transaction
Prepare Index
Write Data
Complete Index
Put Object
RADOS
Index Object Data Object
k vk vk v
k vk vk v
RocksDB
SSD
op_w
qkv_committing
pipe
out
_q
Request
Ack
write
flusth
ROCKSDBsync transaction
Shard
finish
er q
ueue
bstore_shard_finishers= true
• To maintain ordering within each PG, ordering within each shard should be guaranteed.
BlueStore Transaction
Performance Issue
Tail Latency
Performance Metrics
• "SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores" (ATC'19)
RocksDB Compaction Overhead
• Ceph highly depends on RocksDB• Strong consistency of Ceph is implemented using RocksDB transactions
• The performance of ceph also depends on RocksDB• Especially for Small IO
• But RocksDB has some performance Issues• Flushing WAL• Compaction
Conclusions
THANK YOU