tut18972: unleash the power of ceph across the data center

71
Unleash the Power of Ceph Across the Data Center TUT18972: FC/iSCSI for Ceph Ettore Simone Senior Architect Alchemy Solutions Lab [email protected]

Upload: ettore-simone

Post on 19-Feb-2017

1.505 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: TUT18972: Unleash the power of Ceph across the Data Center

Unleash the Power of CephAcross the Data CenterTUT18972: FC/iSCSI for Ceph

Ettore SimoneSenior Architect

Alchemy Solutions Lab

[email protected]

Page 2: TUT18972: Unleash the power of Ceph across the Data Center

2

Agenda

• Introduction

• The Bridge

• The Architecture

• Use Cases

• How It Works

• Some Benchmarks

• Some Optimizations

• Q&A

• Bonus Tracks

Page 3: TUT18972: Unleash the power of Ceph across the Data Center

Introduction

Page 4: TUT18972: Unleash the power of Ceph across the Data Center

4

About Ceph

“Ceph is a distributed object store and file systemdesigned to provide excellent performance, reliabilityand scalability.” (http://ceph.com/)

FUT19336 - SUSE Enterprise Storage Overview and Roadmap

TUT20074 - SUSE Enterprise Storage Design and Performance

Page 5: TUT18972: Unleash the power of Ceph across the Data Center

5

Ceph timeline

OpenSource2006

OpenStackIntegration2011

ProductionReadyQ3 2012

XenIntegration2013

SUSEEnterpriseStorage 2.0Q4 2015

2004ProjectStart atUCSC

2010MainlineLinuxKernel

Q2 2012Launch ofInktank

2012CloudStackIntegration

Q1 2015SUSEStorage 1.0

Page 6: TUT18972: Unleash the power of Ceph across the Data Center

6

Some facts

Common data centers storage solutions are builtmainly on top of Fibre Channel (yes, and NAS too).

Source: Wikibon Server SAN Research Project 2014

Page 7: TUT18972: Unleash the power of Ceph across the Data Center

7

Is the storage mindset changing?

New/Cloud ‒ Micro-services Composed Applications

‒ NoSQL and Distributed Database (lazy commit, replication)

‒ Object and Distributed Storage

SCALE-OUT

Classic‒ Traditional Application → Relational DB → Traditional Storage

‒ Transactional Process → Commit on DB → Commit on Disk

SCALE-UP

Page 8: TUT18972: Unleash the power of Ceph across the Data Center

8

Is the storage mindset changing? No.

New/Cloud ‒ Micro-services Composed Applications

‒ NoSQL and Distributed Database (lazy commit, replication)

‒ Object and Distributed Storage

Natural playground of Ceph

Classic‒ Traditional Application → Relational DB → Traditional Storage

‒ Transactional Process → Commit on DB → Commit on Disk

Where we want to introduce Ceph!

Page 9: TUT18972: Unleash the power of Ceph across the Data Center

9

Is the new kid on the block so noisy?

Ceph is cool but I cannot rearchitect my storage!

And what about my shiny big disk arrays?

I have already N protocols, why another one?

<Add your own fear here>

Page 10: TUT18972: Unleash the power of Ceph across the Data Center

10

SAN

SCSIover FC

Our goal

How to achieve a non disruptive introduction of Cephinto a traditional storage infrastructure?

NAS

NFS/SMB/iSCSIover Ethernet

RBDover Ethernet

Ceph

Page 11: TUT18972: Unleash the power of Ceph across the Data Center

11

How to let happily coexist Ceph in yourdatacenter with the existing neighborhood

(traditional workloads, legacy servers, FC switches etc...)

Page 12: TUT18972: Unleash the power of Ceph across the Data Center

The Bridge

Page 13: TUT18972: Unleash the power of Ceph across the Data Center

13

FC/iSCSI gateway

iSCSI‒ Out-of-the-box feature of SES 2.0

‒ TUT16512 - Ceph RBD Devices and iSCSI

Fiber Channel‒ That's the point we will focus today

Page 14: TUT18972: Unleash the power of Ceph across the Data Center

14

Back to our goal

How to achieve a non disruptive introduction of Cephinto a traditional storage infrastructure?

RBDSAN NAS

Page 15: TUT18972: Unleash the power of Ceph across the Data Center

15

Linux-IO Target (LIO™)

Is the most common open-source SCSI target inmodern GNU/Linux distros:

FCFCoE

FireWireiSCSIiSERSRPloop

vHost

FABRIC BACKSTORELIO

FILEIOIBLOCKRBDpSCSIRAMDISKTCMU

Kernel space

Page 16: TUT18972: Unleash the power of Ceph across the Data Center

The Architecture

Page 17: TUT18972: Unleash the power of Ceph across the Data Center

17

Technical Reference for Entry Level

Dedicated nodes connect Ceph to Fiber Channel

Page 18: TUT18972: Unleash the power of Ceph across the Data Center

18

Hypothesis for High Throughput

All OSDs nodes connect Ceph to Fiber Channel

Page 19: TUT18972: Unleash the power of Ceph across the Data Center

19

Our LAB Architecture

Page 20: TUT18972: Unleash the power of Ceph across the Data Center

20

Pool and OSD geometry

xxx

x

xxx

xxx

xxx

x

Page 21: TUT18972: Unleash the power of Ceph across the Data Center

21

Multi root CRUSH map

Page 22: TUT18972: Unleash the power of Ceph across the Data Center

22

Multipath I/O (MPIO)

devices {device {

vendor "(LIO-ORG|SUSE)"product "*"path_grouping_policy "multibus"path_checker "tur"features "0"hardware_handler "1 alua"prio "alua"failback "immediate"rr_weight "uniform"no_path_retry "fail"rr_min_io 100

}}

Page 23: TUT18972: Unleash the power of Ceph across the Data Center

23

Automatically classify the OSD

Classify by NODE;OSD;DEV;SIZE;WEIGHT;SPEED# ceph-disk-classify osd01 0 sdb 300G 0.287 15Kosd01 1 sdc 300G 0.287 15Kosd01 2 sdd 200G 0.177 SSDosd01 3 sde 1.0T 0.971 7.2Kosd01 4 sdf 1.0T 0.971 7.2Kosd02 5 sdb 300G 0.287 15Kosd02 6 sdd 200G 0.177 SSDosd02 7 sde 1.0T 0.971 7.2Kosd01 8 sdf 1.0T 0.971 7.2Kosd03 9 sdb 300G 0.287 15K…

Page 24: TUT18972: Unleash the power of Ceph across the Data Center

24

Avoid standard CRUSH location

Default:osd crush location = root=default host=`hostname -s`

Using an helper script:osd crush location hook = /path/to/script

Or entirely manual:osd crush update on start = false…

# ceph osd crush [add|set] 39 0.971 root=root-7.2Khost=osd08-7.2K

Page 25: TUT18972: Unleash the power of Ceph across the Data Center

Use Cases

Page 26: TUT18972: Unleash the power of Ceph across the Data Center

26

Smooth transition

Native migration of SAN/LUN to RBD/Volumes helpmigration/conversion/coexisting:

Traditional Workloads Private Cloud

CephSAN GW

New Workloads

Page 27: TUT18972: Unleash the power of Ceph across the Data Center

27

Smooth transition

Native migration of SAN/LUN to RBD/Volumes helpmigration/conversion/coexisting:

Traditional Workloads Private Cloud

CephSAN GW

New Workloads

Page 28: TUT18972: Unleash the power of Ceph across the Data Center

28

Smooth transition

Native migration of SAN/LUN to RBD/Volumes helpmigration/conversion/coexisting:

Traditional Workloads Private Cloud

CephSAN GW

New Workloads

Page 29: TUT18972: Unleash the power of Ceph across the Data Center

29

Smooth transition

Native migration of SAN/LUN to RBD/Volumes helpmigration/conversion/coexisting:

Traditional Workloads Private Cloud

CephSAN GW

New Workloads

Page 30: TUT18972: Unleash the power of Ceph across the Data Center

30

Storage replacement

No drama at the End of Life/Support of traditionalstorages:

Traditional Workloads Private Cloud

CephGW

New Workloads

Page 31: TUT18972: Unleash the power of Ceph across the Data Center

31

D/R and Business Continuity

CephGW

Site A Site B

Ceph GW

Page 32: TUT18972: Unleash the power of Ceph across the Data Center

How It Works

Page 33: TUT18972: Unleash the power of Ceph across the Data Center

33

Ceph and Linux-IO

SCSI commands from fabrics are addressed by LIOcore, configured using targetcli or directly via sysfs,and proxied to the interested block device through therelative backstore module.

CL

IEN

TS

CE

PH

CLU

ST

ER

/sys/kernel/config/target

← user space →

← kernel space →

Page 34: TUT18972: Unleash the power of Ceph across the Data Center

34

Enable QLocig in target mode

# modprobe qla2xxx qlini_mode="disabled"

CL

IEN

TS

CE

PH

CLU

ST

ER

/sys/kernel/config/target

← user space →

← kernel space →

Page 35: TUT18972: Unleash the power of Ceph across the Data Center

35

Identify and enable HBAs

# cat/sys/class/scsi_host/host*/device/fc_host/host*/port_name | \sed -e 's/../:&/g' -e 's/:0x://'

# targetcli qla2xxx/ create ${WWPN}

CL

IEN

TS

CE

PH

CLU

ST

ER

/sys/kernel/config/target

← user space →

← kernel space →

Page 36: TUT18972: Unleash the power of Ceph across the Data Center

36

Map RBDs and create backstores

# rbd map -p ${POOL} ${VOL}

# targetcli backstores/rbd create name="${POOL}-${VOL}" dev="${DEV}"

CL

IEN

TS

CE

PH

CLU

ST

ER

/sys/kernel/config/target

/dev/rbd0

← user space →

← kernel space →

Page 37: TUT18972: Unleash the power of Ceph across the Data Center

37

Create LUNs connected to RBDs

# targetcli qla2xxx/${WWPN}/luns create/backstores/rbd/${POOL}-${VOL}

CL

IEN

TS

CE

PH

CLU

ST

ER

/sys/kernel/config/target

/dev/rbd0

← user space →

← kernel space →LUN0

Page 38: TUT18972: Unleash the power of Ceph across the Data Center

38

“Zoning” to filter access with ACLs

# targetcli qla2xxx/${WWPN}/acls create ${INITIATOR} true

CL

IEN

TS

CE

PH

CLU

ST

ER

/sys/kernel/config/target

/dev/rbd0

← user space →

← kernel space →LUN0

Page 39: TUT18972: Unleash the power of Ceph across the Data Center

Some Benchmarks

Page 40: TUT18972: Unleash the power of Ceph across the Data Center

40

First of all...

This solution is NOT a drop in replacement for SAN norNAS (at the moment at least!).

The main focus is to identify how to minimize theoverhead from native RBD to FC/iSCSI.

Page 41: TUT18972: Unleash the power of Ceph across the Data Center

41

Raw performance/estimation on 15K

Physical Disk IOPS: Ceph IOPS:‒ 4K RND Read = 193 x 24 = 4.632

‒ 4K RND Write = 178 x 24 / 3 = 1.424 / 3 = 475

Physical Disk Throughput: Ceph Throughput:‒ 512K RND Read = 108 MB/s x 24 = 2.600

‒ 512K RND Write = 105 MB/s x 24 / 3 = 840 / 2 = 420 MB/s

NOTE:‒ 24 OSD and 3 Replicas per Pool

‒ No SSD for journal (so ~1/3 IOPS and ~1/2 of bandwidth forwrites)

Page 42: TUT18972: Unleash the power of Ceph across the Data Center

43

64K SEQ Read

64K SEQ Write

0 500 1000 1500 2000 2500 3000

EstimatedRBDMAP/AIOMAP/LIOQEMU/LIOT

hrou

ghpu

t in

MB

/s

4K RND Read

4K RND Write

0 1000 2000 3000 4000 5000 6000

EstimatedRBDMAP/AIOMAP/LIOQEMU/LIO

IOP

S

Compared performance on 15K

NOTE:‒ SEQ 64K on RBD Client → RND 512K on Ceph OSD

Page 43: TUT18972: Unleash the power of Ceph across the Data Center

Work in Progress

Page 44: TUT18972: Unleash the power of Ceph across the Data Center

46

Where we are working on

Centralized management with GUI/CLI‒ Deploy MON/OSD/GW nodes

‒ Manage Nodes/Disk/Pools/Map/LIO

‒ Monitor cluster and node status

Reaction on failures

Using librados/librbd with tcmu for backstore

Page 45: TUT18972: Unleash the power of Ceph across the Data Center

47

Central Management Console

• Intel Virtual Storage Manager

• Ceph Calamari

• inkScope

Page 46: TUT18972: Unleash the power of Ceph across the Data Center

48

More integration with existing tools

Extend LRBD do accept multiple Fabric:‒ iSCSI (native support)

‒ FC

‒ FCoE

Linux-IO:‒ Use of librados via tcmu

Page 47: TUT18972: Unleash the power of Ceph across the Data Center

Some Optimizations

Page 48: TUT18972: Unleash the power of Ceph across the Data Center

50

I/O scheduler matter!

On OSD nodes:‒ deadline on physical disk (cfq if ionice scrub thread)

‒ noop on RAID disk

‒ read_ahead_kb=2048

On Gateway nodes:‒ noop on mapped RBD

On Client nodes:‒ noop or deadline on multipath device

Page 49: TUT18972: Unleash the power of Ceph across the Data Center

51

Reduce I/O concurrency

• Reduce OSD scrub priority:‒ I/O scheduler cfq

‒ osd_disk_thread_ioprio_class = idle

‒ osd_disk_thread_ioprio_priority = 7

Page 50: TUT18972: Unleash the power of Ceph across the Data Center

52

Design optimizations

• SSD on monitor nodes for LevelDB: decrease CPU,memory usage and time during recovery

• SSD Journal decrease I/O latency: 3x IOPS and betterthroughput

Page 51: TUT18972: Unleash the power of Ceph across the Data Center

Q&A

Page 52: TUT18972: Unleash the power of Ceph across the Data Center

54

[email protected]

Thank you.

Page 53: TUT18972: Unleash the power of Ceph across the Data Center

Corporate HeadquartersMaxfeldstrasse 590409 NurembergGermany

+49 911 740 53 0 (Worldwide)www.suse.com

Join us on:www.opensuse.org

55

Page 54: TUT18972: Unleash the power of Ceph across the Data Center

Bonus Tracks

Page 55: TUT18972: Unleash the power of Ceph across the Data Center

57

Business Continuity architecture

Low latency connected sites:

WARNING: To improve availability a third site to place aquorum node are highly encouraged.

Page 56: TUT18972: Unleash the power of Ceph across the Data Center

58

Disaster Recovery architecture

High latency or disconnected sites:

As in OpenStack Ceph plug-in for Cinder Backup:# rbd export-diff pool/image@end --from-snap start - |ssh -C remote rbd import-diff – pool/image

Page 57: TUT18972: Unleash the power of Ceph across the Data Center

59

KVM Gateways

• VT-x Physical passthrough of QLogic

• RBD Volumes as VirtIO devices

• Linux-IO iblock backstore

Page 58: TUT18972: Unleash the power of Ceph across the Data Center

60

VT-x PCI passthrough 1/2

Install KVM and tools

Boot with intel_iommu=on# lspci -D | grep -i QLogic | awk '{ print $1 }'0000:24:00:00000:24:00:1

# readlink /sys/bus/pci/devices/0000:24:00.{0,1}/driver../../../../bus/pci/drivers/qla2xxx../../../../bus/pci/drivers/qla2xxx

# modprobe -r qla2xxx

Page 59: TUT18972: Unleash the power of Ceph across the Data Center

61

VT-x PCI passthrough 2/2

# virsh nodedev-detach pci_0000_24_00_{0,1}Device pci_0000_24_00_0 detachedDevice pci_0000_24_00_1 detached

# virsh edit VM<hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0000' bus='0x24' slot='0x0' function='0x0'/> </source></hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0000' bus='0x24' slot='0x0' function='0x1'/> </source></hostdev>

# virsh start VM

Page 60: TUT18972: Unleash the power of Ceph across the Data Center

62

KVM hot-add RBD 1/2

# ceph auth get-or-create client.libvirt mon 'allow r'osd 'allow rwx'[client.libvirt]

key = AQBN3S9W0Z2gKxAAnua2fIlcSVSZ/c7pqHtTwA==

# cat secret.xml<secret ephemeral='no' private='no'> <usage type='ceph'> <name>client.libvirt secret</name> </usage></secret>

# virsh secret-define --file secret.xmlSecret 363aad3c-d13c-440d-bb27-fd58fca6aac2 created

# virsh secret-set-value --secret 363aad3c-d13c-440d-bb27-fd58fca6aac2 --base64AQBN3S9W0Z2gKxAAnua2fIlcSVSZ/c7pqHtTwA==

Page 61: TUT18972: Unleash the power of Ceph across the Data Center

63

KVM hot-add RBD 2/2

# cat disk.xml<disk type='network' device='disk'> <source protocol='rbd' name='pool/vol'> <host name='mon01' port='6789'/> <host name='mon02' port='6789'/> <host name='mon03' port='6789'/> </source> <auth username='libvirt'> <secret type='ceph' uuid='363aad3c-d13c-440d-bb27-fd58fca6aac2'/> </auth> <target dev='vdb' bus='virtio'/></disk>

# virsh attach-device --persistent VM disk.xmlDevice attached successfully

Page 62: TUT18972: Unleash the power of Ceph across the Data Center

64

/usr/local/sbin/ceph-disk-classify

# Enumerate OSDsceph osd ls | \while read OSD; do # Extract IP/HOST from Cluster Map IP=`ceph osd find $OSD | tr -d '"' | grep 'ip:' | awk -F: '{ print $2 }'` NODE=`getent hosts $IP | sed -e 's/.* //'` test -n "$NODE" || NODE=$IP

# Evaluate mount point for osd.<N> (so skip Journals and not used ones) MOUNT=`ssh -n $NODE ceph-disk list 2>/dev/null | grep "osd\\.$OSD" | awk '{ print $1 }'` DEV=`echo $MOUNT | sed -e 's/[0-9]*$//' -e 's|/dev/||'`

# Calculate Disk size and FS size SIZE=`ssh -n $NODE cat /sys/block/$DEV/size` SIZE=$[SIZE*512] DF=`ssh -n $NODE df $MOUNT | grep $MOUNT | awk '{ print $2 }'`

# Weight is the size in TByte WEIGHT=`printf '%3.3f' $(bc -l<<<$DF/1000000000)` SPEED=`ssh -n $NODE sginfo -g /dev/$DEV | sed -n -e 's/^Rotational Rate\s*//p'` test "$SPEED" = '1' && SPEED='SSD'

# Output echo $NODE $OSD $DEV `numfmt --to=si $SIZE` $WEIGHT $SPEEDdone

Page 63: TUT18972: Unleash the power of Ceph across the Data Center

A Light Hands-On

Page 64: TUT18972: Unleash the power of Ceph across the Data Center

66

A Vagrant LAB for Ceph and iSCSI

• 3 all-in-one nodes (MON+OSD+iSCSI Target)

• 1 admin Calamari and iSCSI Initiator with MPIO

• 3 disks per OSD node

• 2 replicas

• Placement Groups: 3*3*100/2 = 450 → 512

Page 65: TUT18972: Unleash the power of Ceph across the Data Center

67

Ceph Initial ConfigurationLogin into ceph-admin and create initial ceph.conf

# ceph-deploy install ceph-{admin,1,2,3}# ceph-deploy new ceph-{1,2,3}# cat <<-EOD >>ceph.confosd_pool_default_size = 2osd_pool_default_min_size = 1osd_pool_default_pg_num = 512osd_pool_default_pgp_num = 512EOD

Page 66: TUT18972: Unleash the power of Ceph across the Data Center

68

Ceph DeployLogin into ceph-admin and create the Ceph cluster

# ceph-deploy mon create-initial# ceph-deploy osd create ceph-{1,2,3}:sd{b,c,d}# ceph-deploy admin ceph-{admin,1,2,3}

Page 67: TUT18972: Unleash the power of Ceph across the Data Center

69

LRBD “auth”

"auth": [ { "authentication": "none", "target": "iqn.2015-09.ceph:sn" }]

Page 68: TUT18972: Unleash the power of Ceph across the Data Center

70

LRBD “targets”

"targets": [ { "hosts": [ { "host": "ceph-1", "portal": "portal1" }, { "host": "ceph-2", "portal": "portal2" }, { "host": "ceph-3", "portal": "portal3" } ], "target": "iqn.2015-09.ceph:sn" }]

Page 69: TUT18972: Unleash the power of Ceph across the Data Center

71

LRBD “portals”

"portals": [ { "name": "portal1", "addresses": [ "10.20.0.101" ] }, { "name": "portal2", "addresses": [ "10.20.0.102" ] }, { "name": "portal3", "addresses": [ "10.20.0.103" ] }]

Page 70: TUT18972: Unleash the power of Ceph across the Data Center

72

LRBD “pools”

"pools": [ { "pool": "rbd", "gateways": [ { "target": "iqn.2015-09.ceph:sn", "tpg": [ { "image": "data", "initiator": "iqn.1996-04.suse:cl" } ] } ] }]

Page 71: TUT18972: Unleash the power of Ceph across the Data Center

Unpublished Work of SUSE LLC. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of theirassignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated,abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.

General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market aproduct. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in makingpurchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, andspecifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. Thedevelopment, release, and timing of features or functionality described for SUSE products remains at the sole discretionof SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time,without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in thispresentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.