ceph & rocksdb · rocksdb group commit . thread scalibility 0 10,000 20,000 30,000 40,000...

Ceph & RocksDB

변일수(Cloud Storage팀)

Ceph Basics

Placement Group

myobject mypool

PG#1 PG#2 PG#3

hash(myobject) = 4 % 3(# of PGs) = 1 ← Target PG

CRUSH

mypool

PG#1 PG#2 PG#3

OSD#1 OSD#3 OSD#12

Recovery

mypool

PG#1 PG#2 PG#3

OSD#1 OSD#3 OSD#12

OSD

ObjectStore

Replication, ???, …

OSD

Peering, Heartbeat,

…

FileStore

BlueStore

https://www.scan.co.uk/products/4tb-toshiba-mg04aca400e-enterprise-hard-drive-35-hdd-sata-iii-6gb-s-7200rpm-128mb-cache-oem

ObjectStore

https://ceph.com/community/new-luminous-bluestore/

OSD Transaction

Consistency is enforced here!

CRUSH

SSD

op_w

qkv_committing

pipe

out

_q

Request

Ack

write

flusth

ROCKSDBsync transaction

Shard

finish

er q

ueue

• To maintain ordering within each PG, ordering within each shard should be guaranteed.

BlueStore Transaction

• Metadata is stored in RocksDB.• After storing metadata atomically, data is available to users.

Request

SSTFile

LogfileMemtable

Transaction Log

Flush

JoinBatchGroup (leader)

PreprocessWrite

WriteToWAL

MarkLogsSynced

ExitAsBatchGroupLeader

Write to memtable

#1 Thread #2 Thread

JoinBatchGroupAwaitState

#3 Thread

JoinBatchGroupAwaitState

PreprocessWrite

WriteToWAL

leader

LaunchParallelFollower

MarkLogsSyncedfollower

ExitAsBatchGroupFollower

CompleteParallelWorker

Concurrent write to memtable

Group commit

RocksDB Group Commit

Thread Scalibility

0

10,000

20,000

30,000

40,000

50,000

60,000

1 shard 10 shard

IOPS

Shard Scalability

WAL disableWAL

PUT

PUT

PUT

PUT

WAL

RocksDB

RadosGW

• RadosGW is an application of RADOS

RadosGW

OSD Mon Mgr

Rados

RadosGW CephFS

• All atomic operations depen on RocksDB

RadosGW Transaction

Prepare Index

Write Data

Complete Index

Put Object

RADOS

Index Object Data Object

k vk vk v

k vk vk v

RocksDB

SSD

op_w

qkv_committing

pipe

out

_q

Request

Ack

write

flusth

ROCKSDBsync transaction

Shard

finish

er q

ueue

bstore_shard_finishers= true

• To maintain ordering within each PG, ordering within each shard should be guaranteed.

BlueStore Transaction

Performance Issue

Tail Latency

Performance Metrics

• "SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores" (ATC'19)

RocksDB Compaction Overhead

• Ceph highly depends on RocksDB• Strong consistency of Ceph is implemented using RocksDB transactions

• The performance of ceph also depends on RocksDB• Especially for Small IO

• But RocksDB has some performance Issues• Flushing WAL• Compaction

• [email protected]

Conclusions

THANK YOU

ceph & rocksdb · rocksdb group commit . thread scalibility 0 10,000 20,000 30,000 40,000...

Documents