tut18972: unleash the power of ceph across the data center

Unleash the Power of CephAcross the Data CenterTUT18972: FC/iSCSI for Ceph

Ettore SimoneSenior Architect

Alchemy Solutions Lab

[email protected]

2

Agenda

• Introduction

• The Bridge

• The Architecture

• Use Cases

• How It Works

• Some Benchmarks

• Some Optimizations

• Q&A

• Bonus Tracks

Introduction

4

About Ceph

“Ceph is a distributed object store and file systemdesigned to provide excellent performance, reliabilityand scalability.” (http://ceph.com/)

FUT19336 - SUSE Enterprise Storage Overview and Roadmap

TUT20074 - SUSE Enterprise Storage Design and Performance

http://ceph.com/

5

Ceph timeline

OpenSource2006

OpenStackIntegration2011

ProductionReadyQ3 2012

XenIntegration2013

SUSEEnterpriseStorage 2.0Q4 2015

2004ProjectStart atUCSC

2010MainlineLinuxKernel

Q2 2012Launch ofInktank

2012CloudStackIntegration

Q1 2015SUSEStorage 1.0

6

Some facts

Common data centers storage solutions are builtmainly on top of Fibre Channel (yes, and NAS too).

Source: Wikibon Server SAN Research Project 2014

7

Is the storage mindset changing?

New/Cloud ‒ Micro-services Composed Applications

‒ NoSQL and Distributed Database (lazy commit, replication)

‒ Object and Distributed Storage

SCALE-OUT

Classic‒ Traditional Application → Relational DB → Traditional Storage

‒ Transactional Process → Commit on DB → Commit on Disk

SCALE-UP

8

Is the storage mindset changing? No.

New/Cloud ‒ Micro-services Composed Applications

‒ NoSQL and Distributed Database (lazy commit, replication)

‒ Object and Distributed Storage

Natural playground of Ceph

Classic‒ Traditional Application → Relational DB → Traditional Storage

‒ Transactional Process → Commit on DB → Commit on Disk

Where we want to introduce Ceph!

9

Is the new kid on the block so noisy?

Ceph is cool but I cannot rearchitect my storage!

And what about my shiny big disk arrays?

I have already N protocols, why another one?

<Add your own fear here>

10

SAN

SCSIover FC

Our goal

How to achieve a non disruptive introduction of Cephinto a traditional storage infrastructure?

NAS

NFS/SMB/iSCSIover Ethernet

RBDover Ethernet

Ceph

11

How to let happily coexist Ceph in yourdatacenter with the existing neighborhood

(traditional workloads, legacy servers, FC switches etc...)

The Bridge

13

FC/iSCSI gateway

iSCSI‒ Out-of-the-box feature of SES 2.0

‒ TUT16512 - Ceph RBD Devices and iSCSI

Fiber Channel‒ That's the point we will focus today

14

Back to our goal

How to achieve a non disruptive introduction of Cephinto a traditional storage infrastructure?

RBDSAN NAS

15

Linux-IO Target (LIO™)

Is the most common open-source SCSI target inmodern GNU/Linux distros:

FCFCoE

FireWireiSCSIiSERSRPloop

vHost

FABRIC BACKSTORELIO

FILEIOIBLOCKRBDpSCSIRAMDISKTCMU

Kernel space

The Architecture

17

Technical Reference for Entry Level

Dedicated nodes connect Ceph to Fiber Channel

18

Hypothesis for High Throughput

All OSDs nodes connect Ceph to Fiber Channel

19

Our LAB Architecture

20

Pool and OSD geometry

xxx

x

xxx

xxx

xxx

x

21

Multi root CRUSH map

22

Multipath I/O (MPIO)

devices {device {

vendor "(LIO-ORG|SUSE)"product "*"path_grouping_policy "multibus"path_checker "tur"features "0"hardware_handler "1 alua"prio "alua"failback "immediate"rr_weight "uniform"no_path_retry "fail"rr_min_io 100

}}

23

Automatically classify the OSD

Classify by NODE;OSD;DEV;SIZE;WEIGHT;SPEED# ceph-disk-classify osd01 0 sdb 300G 0.287 15Kosd01 1 sdc 300G 0.287 15Kosd01 2 sdd 200G 0.177 SSDosd01 3 sde 1.0T 0.971 7.2Kosd01 4 sdf 1.0T 0.971 7.2Kosd02 5 sdb 300G 0.287 15Kosd02 6 sdd 200G 0.177 SSDosd02 7 sde 1.0T 0.971 7.2Kosd01 8 sdf 1.0T 0.971 7.2Kosd03 9 sdb 300G 0.287 15K…

24

Avoid standard CRUSH location

Default:osd crush location = root=default host=`hostname -s`

Using an helper script:osd crush location hook = /path/to/script

Or entirely manual:osd crush update on start = false…

# ceph osd crush [add|set] 39 0.971 root=root-7.2Khost=osd08-7.2K

Use Cases

26

Smooth transition

Native migration of SAN/LUN to RBD/Volumes helpmigration/conversion/coexisting:

Traditional Workloads Private Cloud

CephSAN GW

New Workloads

27

Smooth transition



CephSAN GW

New Workloads

28

Smooth transition



CephSAN GW

New Workloads

29

Smooth transition



CephSAN GW

New Workloads

30

Storage replacement

No drama at the End of Life/Support of traditionalstorages:


CephGW

New Workloads

31

D/R and Business Continuity

CephGW

Site A Site B

Ceph GW

How It Works

33

Ceph and Linux-IO

SCSI commands from fabrics are addressed by LIOcore, configured using targetcli or directly via sysfs,and proxied to the interested block device through therelative backstore module.

CL

IEN

TS

CE

PH

CLU

ST

ER

/sys/kernel/config/target

← user space →

← kernel space →

34

Enable QLocig in target mode

# modprobe qla2xxx qlini_mode="disabled"

CL

IEN

TS

CE

PH

CLU

ST

ER


← user space →


35

Identify and enable HBAs

# cat/sys/class/scsi_host/host*/device/fc_host/host*/port_name | \sed -e 's/../:&/g' -e 's/:0x://'

# targetcli qla2xxx/ create ${WWPN}

CL

IEN

TS

CE

PH

CLU

ST

ER


← user space →


36

Map RBDs and create backstores

# rbd map -p ${POOL} ${VOL}

# targetcli backstores/rbd create name="${POOL}-${VOL}" dev="${DEV}"

CL

IEN

TS

CE

PH

CLU

ST

ER


/dev/rbd0

← user space →


37

Create LUNs connected to RBDs

# targetcli qla2xxx/${WWPN}/luns create/backstores/rbd/${POOL}-${VOL}

CL

IEN

TS

CE

PH

CLU

ST

ER


/dev/rbd0

← user space →

← kernel space →LUN0

38

“Zoning” to filter access with ACLs

# targetcli qla2xxx/${WWPN}/acls create ${INITIATOR} true

CL

IEN

TS

CE

PH

CLU

ST

ER


/dev/rbd0

← user space →

← kernel space →LUN0

Some Benchmarks

40

First of all...

This solution is NOT a drop in replacement for SAN norNAS (at the moment at least!).

The main focus is to identify how to minimize theoverhead from native RBD to FC/iSCSI.

41

Raw performance/estimation on 15K

Physical Disk IOPS: Ceph IOPS:‒ 4K RND Read = 193 x 24 = 4.632

‒ 4K RND Write = 178 x 24 / 3 = 1.424 / 3 = 475

Physical Disk Throughput: Ceph Throughput:‒ 512K RND Read = 108 MB/s x 24 = 2.600

‒ 512K RND Write = 105 MB/s x 24 / 3 = 840 / 2 = 420 MB/s

NOTE:‒ 24 OSD and 3 Replicas per Pool

‒ No SSD for journal (so ~1/3 IOPS and ~1/2 of bandwidth forwrites)

43

64K SEQ Read

64K SEQ Write

0 500 1000 1500 2000 2500 3000

EstimatedRBDMAP/AIOMAP/LIOQEMU/LIOT

hrou

ghpu

t in

MB

/s

4K RND Read

4K RND Write

0 1000 2000 3000 4000 5000 6000

EstimatedRBDMAP/AIOMAP/LIOQEMU/LIO

IOP

S

Compared performance on 15K

NOTE:‒ SEQ 64K on RBD Client → RND 512K on Ceph OSD

Work in Progress

46

Where we are working on

Centralized management with GUI/CLI‒ Deploy MON/OSD/GW nodes

‒ Manage Nodes/Disk/Pools/Map/LIO

‒ Monitor cluster and node status

Reaction on failures

Using librados/librbd with tcmu for backstore

47

Central Management Console

• Intel Virtual Storage Manager

• Ceph Calamari

• inkScope

48

More integration with existing tools

Extend LRBD do accept multiple Fabric:‒ iSCSI (native support)

‒ FC

‒ FCoE

Linux-IO:‒ Use of librados via tcmu

Some Optimizations

50

I/O scheduler matter!

On OSD nodes:‒ deadline on physical disk (cfq if ionice scrub thread)

‒ noop on RAID disk

‒ read_ahead_kb=2048

On Gateway nodes:‒ noop on mapped RBD

On Client nodes:‒ noop or deadline on multipath device

51

Reduce I/O concurrency

• Reduce OSD scrub priority:‒ I/O scheduler cfq

‒ osd_disk_thread_ioprio_class = idle

‒ osd_disk_thread_ioprio_priority = 7

52

Design optimizations

• SSD on monitor nodes for LevelDB: decrease CPU,memory usage and time during recovery

• SSD Journal decrease I/O latency: 3x IOPS and betterthroughput

54

[email protected]

Thank you.

Corporate HeadquartersMaxfeldstrasse 590409 NurembergGermany

+49 911 740 53 0 (Worldwide)www.suse.com

Join us on:www.opensuse.org

55

Bonus Tracks

57

Business Continuity architecture

Low latency connected sites:

WARNING: To improve availability a third site to place aquorum node are highly encouraged.

http://www.opensuse.org/

58

Disaster Recovery architecture

High latency or disconnected sites:

As in OpenStack Ceph plug-in for Cinder Backup:# rbd export-diff pool/image@end --from-snap start - |ssh -C remote rbd import-diff – pool/image

59

KVM Gateways

• VT-x Physical passthrough of QLogic

• RBD Volumes as VirtIO devices

• Linux-IO iblock backstore

60

VT-x PCI passthrough 1/2

Install KVM and tools

Boot with intel_iommu=on# lspci -D | grep -i QLogic | awk '{ print $1 }'0000:24:00:00000:24:00:1

# readlink /sys/bus/pci/devices/0000:24:00.{0,1}/driver../../../../bus/pci/drivers/qla2xxx../../../../bus/pci/drivers/qla2xxx

# modprobe -r qla2xxx

61

VT-x PCI passthrough 2/2

# virsh nodedev-detach pci_0000_24_00_{0,1}Device pci_0000_24_00_0 detachedDevice pci_0000_24_00_1 detached

# virsh edit VM<hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0000' bus='0x24' slot='0x0' function='0x0'/> </source></hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0000' bus='0x24' slot='0x0' function='0x1'/> </source></hostdev>

# virsh start VM

62

KVM hot-add RBD 1/2

# ceph auth get-or-create client.libvirt mon 'allow r'osd 'allow rwx'[client.libvirt]

key = AQBN3S9W0Z2gKxAAnua2fIlcSVSZ/c7pqHtTwA==

# cat secret.xml<secret ephemeral='no' private='no'> <usage type='ceph'> <name>client.libvirt secret</name> </usage></secret>

# virsh secret-define --file secret.xmlSecret 363aad3c-d13c-440d-bb27-fd58fca6aac2 created

# virsh secret-set-value --secret 363aad3c-d13c-440d-bb27-fd58fca6aac2 --base64AQBN3S9W0Z2gKxAAnua2fIlcSVSZ/c7pqHtTwA==

63

KVM hot-add RBD 2/2

# cat disk.xml<disk type='network' device='disk'> <source protocol='rbd' name='pool/vol'> <host name='mon01' port='6789'/> <host name='mon02' port='6789'/> <host name='mon03' port='6789'/> </source> <auth username='libvirt'> <secret type='ceph' uuid='363aad3c-d13c-440d-bb27-fd58fca6aac2'/> </auth> <target dev='vdb' bus='virtio'/></disk>

# virsh attach-device --persistent VM disk.xmlDevice attached successfully

64

/usr/local/sbin/ceph-disk-classify

# Enumerate OSDsceph osd ls | \while read OSD; do # Extract IP/HOST from Cluster Map IP=`ceph osd find $OSD | tr -d '"' | grep 'ip:' | awk -F: '{ print $2 }'` NODE=`getent hosts $IP | sed -e 's/.* //'` test -n "$NODE" || NODE=$IP

# Evaluate mount point for osd.<N> (so skip Journals and not used ones) MOUNT=`ssh -n $NODE ceph-disk list 2>/dev/null | grep "osd\\.$OSD" | awk '{ print $1 }'` DEV=`echo $MOUNT | sed -e 's/[0-9]*$//' -e 's|/dev/||'`

# Calculate Disk size and FS size SIZE=`ssh -n $NODE cat /sys/block/$DEV/size` SIZE=$[SIZE*512] DF=`ssh -n $NODE df $MOUNT | grep $MOUNT | awk '{ print $2 }'`

# Weight is the size in TByte WEIGHT=`printf '%3.3f' $(bc -l<<<$DF/1000000000)` SPEED=`ssh -n $NODE sginfo -g /dev/$DEV | sed -n -e 's/^Rotational Rate\s*//p'` test "$SPEED" = '1' && SPEED='SSD'

# Output echo $NODE $OSD $DEV `numfmt --to=si $SIZE` $WEIGHT $SPEEDdone

A Light Hands-On

66

A Vagrant LAB for Ceph and iSCSI

• 3 all-in-one nodes (MON+OSD+iSCSI Target)

• 1 admin Calamari and iSCSI Initiator with MPIO

• 3 disks per OSD node

• 2 replicas

• Placement Groups: 3*3*100/2 = 450 → 512

67

Ceph Initial ConfigurationLogin into ceph-admin and create initial ceph.conf

# ceph-deploy install ceph-{admin,1,2,3}# ceph-deploy new ceph-{1,2,3}# cat <<-EOD >>ceph.confosd_pool_default_size = 2osd_pool_default_min_size = 1osd_pool_default_pg_num = 512osd_pool_default_pgp_num = 512EOD

68

Ceph DeployLogin into ceph-admin and create the Ceph cluster

# ceph-deploy mon create-initial# ceph-deploy osd create ceph-{1,2,3}:sd{b,c,d}# ceph-deploy admin ceph-{admin,1,2,3}

69

LRBD “auth”

"auth": [ { "authentication": "none", "target": "iqn.2015-09.ceph:sn" }]

70

LRBD “targets”

"targets": [ { "hosts": [ { "host": "ceph-1", "portal": "portal1" }, { "host": "ceph-2", "portal": "portal2" }, { "host": "ceph-3", "portal": "portal3" } ], "target": "iqn.2015-09.ceph:sn" }]

71

LRBD “portals”

"portals": [ { "name": "portal1", "addresses": [ "10.20.0.101" ] }, { "name": "portal2", "addresses": [ "10.20.0.102" ] }, { "name": "portal3", "addresses": [ "10.20.0.103" ] }]

72

LRBD “pools”

"pools": [ { "pool": "rbd", "gateways": [ { "target": "iqn.2015-09.ceph:sn", "tpg": [ { "image": "data", "initiator": "iqn.1996-04.suse:cl" } ] } ] }]

Unpublished Work of SUSE LLC. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of theirassignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated,abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.

General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market aproduct. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in makingpurchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, andspecifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. Thedevelopment, release, and timing of features or functionality described for SUSE products remains at the sole discretionof SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time,without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in thispresentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.

tut18972: unleash the power of ceph across the data center

Technology