tut18972: unleash the power of ceph across the data center
TRANSCRIPT
Unleash the Power of CephAcross the Data CenterTUT18972: FC/iSCSI for Ceph
Ettore SimoneSenior Architect
Alchemy Solutions Lab
2
Agenda
• Introduction
• The Bridge
• The Architecture
• Use Cases
• How It Works
• Some Benchmarks
• Some Optimizations
• Q&A
• Bonus Tracks
Introduction
4
About Ceph
“Ceph is a distributed object store and file systemdesigned to provide excellent performance, reliabilityand scalability.” (http://ceph.com/)
FUT19336 - SUSE Enterprise Storage Overview and Roadmap
TUT20074 - SUSE Enterprise Storage Design and Performance
5
Ceph timeline
OpenSource2006
OpenStackIntegration2011
ProductionReadyQ3 2012
XenIntegration2013
SUSEEnterpriseStorage 2.0Q4 2015
2004ProjectStart atUCSC
2010MainlineLinuxKernel
Q2 2012Launch ofInktank
2012CloudStackIntegration
Q1 2015SUSEStorage 1.0
6
Some facts
Common data centers storage solutions are builtmainly on top of Fibre Channel (yes, and NAS too).
Source: Wikibon Server SAN Research Project 2014
7
Is the storage mindset changing?
New/Cloud ‒ Micro-services Composed Applications
‒ NoSQL and Distributed Database (lazy commit, replication)
‒ Object and Distributed Storage
SCALE-OUT
Classic‒ Traditional Application → Relational DB → Traditional Storage
‒ Transactional Process → Commit on DB → Commit on Disk
SCALE-UP
8
Is the storage mindset changing? No.
New/Cloud ‒ Micro-services Composed Applications
‒ NoSQL and Distributed Database (lazy commit, replication)
‒ Object and Distributed Storage
Natural playground of Ceph
Classic‒ Traditional Application → Relational DB → Traditional Storage
‒ Transactional Process → Commit on DB → Commit on Disk
Where we want to introduce Ceph!
9
Is the new kid on the block so noisy?
Ceph is cool but I cannot rearchitect my storage!
And what about my shiny big disk arrays?
I have already N protocols, why another one?
<Add your own fear here>
10
SAN
SCSIover FC
Our goal
How to achieve a non disruptive introduction of Cephinto a traditional storage infrastructure?
NAS
NFS/SMB/iSCSIover Ethernet
RBDover Ethernet
Ceph
11
How to let happily coexist Ceph in yourdatacenter with the existing neighborhood
(traditional workloads, legacy servers, FC switches etc...)
The Bridge
13
FC/iSCSI gateway
iSCSI‒ Out-of-the-box feature of SES 2.0
‒ TUT16512 - Ceph RBD Devices and iSCSI
Fiber Channel‒ That's the point we will focus today
14
Back to our goal
How to achieve a non disruptive introduction of Cephinto a traditional storage infrastructure?
RBDSAN NAS
15
Linux-IO Target (LIO™)
Is the most common open-source SCSI target inmodern GNU/Linux distros:
FCFCoE
FireWireiSCSIiSERSRPloop
vHost
FABRIC BACKSTORELIO
FILEIOIBLOCKRBDpSCSIRAMDISKTCMU
Kernel space
The Architecture
17
Technical Reference for Entry Level
Dedicated nodes connect Ceph to Fiber Channel
18
Hypothesis for High Throughput
All OSDs nodes connect Ceph to Fiber Channel
19
Our LAB Architecture
20
Pool and OSD geometry
xxx
x
xxx
xxx
xxx
x
21
Multi root CRUSH map
22
Multipath I/O (MPIO)
devices {device {
vendor "(LIO-ORG|SUSE)"product "*"path_grouping_policy "multibus"path_checker "tur"features "0"hardware_handler "1 alua"prio "alua"failback "immediate"rr_weight "uniform"no_path_retry "fail"rr_min_io 100
}}
23
Automatically classify the OSD
Classify by NODE;OSD;DEV;SIZE;WEIGHT;SPEED# ceph-disk-classify osd01 0 sdb 300G 0.287 15Kosd01 1 sdc 300G 0.287 15Kosd01 2 sdd 200G 0.177 SSDosd01 3 sde 1.0T 0.971 7.2Kosd01 4 sdf 1.0T 0.971 7.2Kosd02 5 sdb 300G 0.287 15Kosd02 6 sdd 200G 0.177 SSDosd02 7 sde 1.0T 0.971 7.2Kosd01 8 sdf 1.0T 0.971 7.2Kosd03 9 sdb 300G 0.287 15K…
24
Avoid standard CRUSH location
Default:osd crush location = root=default host=`hostname -s`
Using an helper script:osd crush location hook = /path/to/script
Or entirely manual:osd crush update on start = false…
# ceph osd crush [add|set] 39 0.971 root=root-7.2Khost=osd08-7.2K
Use Cases
26
Smooth transition
Native migration of SAN/LUN to RBD/Volumes helpmigration/conversion/coexisting:
Traditional Workloads Private Cloud
CephSAN GW
New Workloads
27
Smooth transition
Native migration of SAN/LUN to RBD/Volumes helpmigration/conversion/coexisting:
Traditional Workloads Private Cloud
CephSAN GW
New Workloads
28
Smooth transition
Native migration of SAN/LUN to RBD/Volumes helpmigration/conversion/coexisting:
Traditional Workloads Private Cloud
CephSAN GW
New Workloads
29
Smooth transition
Native migration of SAN/LUN to RBD/Volumes helpmigration/conversion/coexisting:
Traditional Workloads Private Cloud
CephSAN GW
New Workloads
30
Storage replacement
No drama at the End of Life/Support of traditionalstorages:
Traditional Workloads Private Cloud
CephGW
New Workloads
31
D/R and Business Continuity
CephGW
Site A Site B
Ceph GW
How It Works
33
Ceph and Linux-IO
SCSI commands from fabrics are addressed by LIOcore, configured using targetcli or directly via sysfs,and proxied to the interested block device through therelative backstore module.
CL
IEN
TS
CE
PH
CLU
ST
ER
/sys/kernel/config/target
← user space →
← kernel space →
34
Enable QLocig in target mode
# modprobe qla2xxx qlini_mode="disabled"
CL
IEN
TS
CE
PH
CLU
ST
ER
/sys/kernel/config/target
← user space →
← kernel space →
35
Identify and enable HBAs
# cat/sys/class/scsi_host/host*/device/fc_host/host*/port_name | \sed -e 's/../:&/g' -e 's/:0x://'
# targetcli qla2xxx/ create ${WWPN}
CL
IEN
TS
CE
PH
CLU
ST
ER
/sys/kernel/config/target
← user space →
← kernel space →
36
Map RBDs and create backstores
# rbd map -p ${POOL} ${VOL}
# targetcli backstores/rbd create name="${POOL}-${VOL}" dev="${DEV}"
CL
IEN
TS
CE
PH
CLU
ST
ER
/sys/kernel/config/target
/dev/rbd0
← user space →
← kernel space →
37
Create LUNs connected to RBDs
# targetcli qla2xxx/${WWPN}/luns create/backstores/rbd/${POOL}-${VOL}
CL
IEN
TS
CE
PH
CLU
ST
ER
/sys/kernel/config/target
/dev/rbd0
← user space →
← kernel space →LUN0
38
“Zoning” to filter access with ACLs
# targetcli qla2xxx/${WWPN}/acls create ${INITIATOR} true
CL
IEN
TS
CE
PH
CLU
ST
ER
/sys/kernel/config/target
/dev/rbd0
← user space →
← kernel space →LUN0
Some Benchmarks
40
First of all...
This solution is NOT a drop in replacement for SAN norNAS (at the moment at least!).
The main focus is to identify how to minimize theoverhead from native RBD to FC/iSCSI.
41
Raw performance/estimation on 15K
Physical Disk IOPS: Ceph IOPS:‒ 4K RND Read = 193 x 24 = 4.632
‒ 4K RND Write = 178 x 24 / 3 = 1.424 / 3 = 475
Physical Disk Throughput: Ceph Throughput:‒ 512K RND Read = 108 MB/s x 24 = 2.600
‒ 512K RND Write = 105 MB/s x 24 / 3 = 840 / 2 = 420 MB/s
NOTE:‒ 24 OSD and 3 Replicas per Pool
‒ No SSD for journal (so ~1/3 IOPS and ~1/2 of bandwidth forwrites)
43
64K SEQ Read
64K SEQ Write
0 500 1000 1500 2000 2500 3000
EstimatedRBDMAP/AIOMAP/LIOQEMU/LIOT
hrou
ghpu
t in
MB
/s
4K RND Read
4K RND Write
0 1000 2000 3000 4000 5000 6000
EstimatedRBDMAP/AIOMAP/LIOQEMU/LIO
IOP
S
Compared performance on 15K
NOTE:‒ SEQ 64K on RBD Client → RND 512K on Ceph OSD
Work in Progress
46
Where we are working on
Centralized management with GUI/CLI‒ Deploy MON/OSD/GW nodes
‒ Manage Nodes/Disk/Pools/Map/LIO
‒ Monitor cluster and node status
Reaction on failures
Using librados/librbd with tcmu for backstore
47
Central Management Console
• Intel Virtual Storage Manager
• Ceph Calamari
• inkScope
48
More integration with existing tools
Extend LRBD do accept multiple Fabric:‒ iSCSI (native support)
‒ FC
‒ FCoE
Linux-IO:‒ Use of librados via tcmu
Some Optimizations
50
I/O scheduler matter!
On OSD nodes:‒ deadline on physical disk (cfq if ionice scrub thread)
‒ noop on RAID disk
‒ read_ahead_kb=2048
On Gateway nodes:‒ noop on mapped RBD
On Client nodes:‒ noop or deadline on multipath device
51
Reduce I/O concurrency
• Reduce OSD scrub priority:‒ I/O scheduler cfq
‒ osd_disk_thread_ioprio_class = idle
‒ osd_disk_thread_ioprio_priority = 7
52
Design optimizations
• SSD on monitor nodes for LevelDB: decrease CPU,memory usage and time during recovery
• SSD Journal decrease I/O latency: 3x IOPS and betterthroughput
Q&A
Corporate HeadquartersMaxfeldstrasse 590409 NurembergGermany
+49 911 740 53 0 (Worldwide)www.suse.com
Join us on:www.opensuse.org
55
Bonus Tracks
57
Business Continuity architecture
Low latency connected sites:
WARNING: To improve availability a third site to place aquorum node are highly encouraged.
58
Disaster Recovery architecture
High latency or disconnected sites:
As in OpenStack Ceph plug-in for Cinder Backup:# rbd export-diff pool/image@end --from-snap start - |ssh -C remote rbd import-diff – pool/image
59
KVM Gateways
• VT-x Physical passthrough of QLogic
• RBD Volumes as VirtIO devices
• Linux-IO iblock backstore
60
VT-x PCI passthrough 1/2
Install KVM and tools
Boot with intel_iommu=on# lspci -D | grep -i QLogic | awk '{ print $1 }'0000:24:00:00000:24:00:1
# readlink /sys/bus/pci/devices/0000:24:00.{0,1}/driver../../../../bus/pci/drivers/qla2xxx../../../../bus/pci/drivers/qla2xxx
# modprobe -r qla2xxx
61
VT-x PCI passthrough 2/2
# virsh nodedev-detach pci_0000_24_00_{0,1}Device pci_0000_24_00_0 detachedDevice pci_0000_24_00_1 detached
# virsh edit VM<hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0000' bus='0x24' slot='0x0' function='0x0'/> </source></hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0000' bus='0x24' slot='0x0' function='0x1'/> </source></hostdev>
# virsh start VM
62
KVM hot-add RBD 1/2
# ceph auth get-or-create client.libvirt mon 'allow r'osd 'allow rwx'[client.libvirt]
key = AQBN3S9W0Z2gKxAAnua2fIlcSVSZ/c7pqHtTwA==
# cat secret.xml<secret ephemeral='no' private='no'> <usage type='ceph'> <name>client.libvirt secret</name> </usage></secret>
# virsh secret-define --file secret.xmlSecret 363aad3c-d13c-440d-bb27-fd58fca6aac2 created
# virsh secret-set-value --secret 363aad3c-d13c-440d-bb27-fd58fca6aac2 --base64AQBN3S9W0Z2gKxAAnua2fIlcSVSZ/c7pqHtTwA==
63
KVM hot-add RBD 2/2
# cat disk.xml<disk type='network' device='disk'> <source protocol='rbd' name='pool/vol'> <host name='mon01' port='6789'/> <host name='mon02' port='6789'/> <host name='mon03' port='6789'/> </source> <auth username='libvirt'> <secret type='ceph' uuid='363aad3c-d13c-440d-bb27-fd58fca6aac2'/> </auth> <target dev='vdb' bus='virtio'/></disk>
# virsh attach-device --persistent VM disk.xmlDevice attached successfully
64
/usr/local/sbin/ceph-disk-classify
# Enumerate OSDsceph osd ls | \while read OSD; do # Extract IP/HOST from Cluster Map IP=`ceph osd find $OSD | tr -d '"' | grep 'ip:' | awk -F: '{ print $2 }'` NODE=`getent hosts $IP | sed -e 's/.* //'` test -n "$NODE" || NODE=$IP
# Evaluate mount point for osd.<N> (so skip Journals and not used ones) MOUNT=`ssh -n $NODE ceph-disk list 2>/dev/null | grep "osd\\.$OSD" | awk '{ print $1 }'` DEV=`echo $MOUNT | sed -e 's/[0-9]*$//' -e 's|/dev/||'`
# Calculate Disk size and FS size SIZE=`ssh -n $NODE cat /sys/block/$DEV/size` SIZE=$[SIZE*512] DF=`ssh -n $NODE df $MOUNT | grep $MOUNT | awk '{ print $2 }'`
# Weight is the size in TByte WEIGHT=`printf '%3.3f' $(bc -l<<<$DF/1000000000)` SPEED=`ssh -n $NODE sginfo -g /dev/$DEV | sed -n -e 's/^Rotational Rate\s*//p'` test "$SPEED" = '1' && SPEED='SSD'
# Output echo $NODE $OSD $DEV `numfmt --to=si $SIZE` $WEIGHT $SPEEDdone
A Light Hands-On
66
A Vagrant LAB for Ceph and iSCSI
• 3 all-in-one nodes (MON+OSD+iSCSI Target)
• 1 admin Calamari and iSCSI Initiator with MPIO
• 3 disks per OSD node
• 2 replicas
• Placement Groups: 3*3*100/2 = 450 → 512
67
Ceph Initial ConfigurationLogin into ceph-admin and create initial ceph.conf
# ceph-deploy install ceph-{admin,1,2,3}# ceph-deploy new ceph-{1,2,3}# cat <<-EOD >>ceph.confosd_pool_default_size = 2osd_pool_default_min_size = 1osd_pool_default_pg_num = 512osd_pool_default_pgp_num = 512EOD
68
Ceph DeployLogin into ceph-admin and create the Ceph cluster
# ceph-deploy mon create-initial# ceph-deploy osd create ceph-{1,2,3}:sd{b,c,d}# ceph-deploy admin ceph-{admin,1,2,3}
69
LRBD “auth”
"auth": [ { "authentication": "none", "target": "iqn.2015-09.ceph:sn" }]
70
LRBD “targets”
"targets": [ { "hosts": [ { "host": "ceph-1", "portal": "portal1" }, { "host": "ceph-2", "portal": "portal2" }, { "host": "ceph-3", "portal": "portal3" } ], "target": "iqn.2015-09.ceph:sn" }]
71
LRBD “portals”
"portals": [ { "name": "portal1", "addresses": [ "10.20.0.101" ] }, { "name": "portal2", "addresses": [ "10.20.0.102" ] }, { "name": "portal3", "addresses": [ "10.20.0.103" ] }]
72
LRBD “pools”
"pools": [ { "pool": "rbd", "gateways": [ { "target": "iqn.2015-09.ceph:sn", "tpg": [ { "image": "data", "initiator": "iqn.1996-04.suse:cl" } ] } ] }]
Unpublished Work of SUSE LLC. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of theirassignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated,abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.
General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market aproduct. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in makingpurchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, andspecifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. Thedevelopment, release, and timing of features or functionality described for SUSE products remains at the sole discretionof SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time,without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in thispresentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.