ceph day london 2014 - deploying ceph in the wild
DESCRIPTION
Wido den Hollander, 42on.comTRANSCRIPT
![Page 1: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/1.jpg)
Deploying Ceph in the wild
![Page 2: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/2.jpg)
Who am I?
● Wido den Hollander (1986)● Co-owner and CTO of a PCextreme B.V., a
dutch hosting company● Ceph trainer and consultant at 42on B.V.● Part of the Ceph community since late 2009
– Wrote the Apache CloudStack integration
– libvirt RBD storage pool support
– PHP and Java bindings for librados
![Page 3: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/3.jpg)
What is 42on?
● Consultancy company focused on Ceph and it's eco-system
● Founded in 2012● Based in the Netherlands● I'm the only employee
– My consultancy company
![Page 4: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/4.jpg)
Deploying Ceph
● As a consultant I see a lot of different organizations– From small companies to large governments
– I see Ceph being used in all kinds of deployments
● It starts with gathering information about the use-case– Deployment application: RBD? Objects?
– Storage requirements: TBs or PBs?
– I/O requirements
![Page 5: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/5.jpg)
I/O is EXPENSIVE
● Everybody talks about storage capacity, almost nobody talks about Iops
● Think about IOps first and then about TerraBytes
Storage type € per I/O Remark
HDD € 1,60 Seagate 3TB drive for €150 with 90 IOps
SSD € 0,01 Intel S3500 480GB with 25k iops for €410
![Page 6: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/6.jpg)
Design for I/O
● Use more, but smaller disks– More spindles means more I/O
– Can go for consumer drives, cheaper
● Maybe deploy SSD-only– Intel S3500 or S3700 SSDs are reliable and fast
● You really want I/O during recovery operations– OSDs replay PGLogs and scan directories
– Recovery operations require a lot of I/O
![Page 7: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/7.jpg)
Deployments
● I've done numerous Ceph deployments– From tiny to large
● Want to showcase two of the deployments– Use cases
– Design principles
![Page 8: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/8.jpg)
Ceph with CloudStack
● Location: Belgium● Organization: Government● Use case:
– RBD for CloudStack
– S3 compatible storage
● Requirements:– Storage for ~1000 Virtual Machines
● Including PostgreSQL databases
– TBs of S3 storage● Actual data is unknown to me
![Page 9: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/9.jpg)
Ceph with CloudStack
● Cluster:– 16 nodes with 24 drives
● 19 1TB 7200RPM 2.5”● 2 Intel S3700 200GB SSDs for journaling● 2 Intel S3700 480GB SSDs for SSD-only storage● 64GB of memory● Xeon E5-2609 2.5Ghz CPU
– 3x replication and 80% rounding provides:● 81TB HDD storage● 8TB SSD storage
– 3 small nodes as monitors● SSD for Operating System and monitor data
![Page 10: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/10.jpg)
Ceph with CloudStack
![Page 11: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/11.jpg)
Ceph with CloudStack
● If we detect the OSD is running on a SSD it goes into a different 'host' in the CRUSH Map– Rack is encoded in hostname (dc2-rk01)
ROTATIONAL=$(cat /sys/block/$DEV/queue/rotational)
if [ $ROTATIONAL -eq 1 ]; then echo "root=hdd rack=${RACK}-hdd host=${HOST}-hdd"else echo "root=ssd rack=${RACK}-ssd host=${HOST}-ssd"fi
-48 2.88 rack rk01-ssd-33 0.72 host dc2-rk01-osd01-ssd252 0.36 osd.252 up 1253 0.36 osd.253 up 1
-41 69.16 rack rk01-hdd-10 17.29 host dc2-rk01-osd01-hdd20 0.91 osd.20 up 119 0.91 osd.19 up 117 0.91 osd.17 up 1
![Page 12: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/12.jpg)
Ceph with CloudStack● Download the script on my Github page:
– Url: https://gist.github.com/wido
– Place it in /usr/local/bin
● Configure it in your ceph.conf– Push the config to your nodes using Puppet, Chef,
Ansible, ceph-deploy, etc
[osd]osd_crush_location_hook = /usr/local/bin/crush-location-looukp
![Page 13: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/13.jpg)
Ceph with CloudStack● Highlights:
– Automatic assignment of OSDs to right type
– Designed for IOps. More, smaller drives● SSD for the real high I/O applications
– RADOS Gateway for object storage● Trying to push developers towards objects instead of
shared filesystems. A challenge!
● Future:– Double cluster size within 6 months
![Page 14: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/14.jpg)
RBD with OCFS2
● Location: Netherlands● Organization: ISP● Use case:
– RBD for OCFS2
● Requirements:– Shared filesystem between webservers
● Until CephFS is stable
![Page 15: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/15.jpg)
RBD with OCFS2
● Cluster:– 9 nodes with 8 drives
● 1 SSD for Operating System● 7 Samsung 840 Pro 512GB SSDs● 10Gbit network (20Gbit LACP)
– At 3x replication and 80% filling it provides 8.6TB of storage
– 3 small nodes as monitors
![Page 16: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/16.jpg)
RBD with OCFS2
![Page 17: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/17.jpg)
RBD with OCFS2
● “OCFS2 is a general-purpose shared-disk cluster file system for Linux capable of providing both high performance and high availability.”– RBD disks are shared
– ext4 or XFS can't be mounted on multiple locations at the same time
![Page 18: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/18.jpg)
RBD with OCFS2
● All the challenges were in OCFS2, not in Ceph nor RBD– Running 3.14.17 kernel due to OCFS2 issues
– Limited OCFS2 volumes to 200GB to minimize impact in case of volume corruption
– Done multiple hardware upgrades without any service interruption
● Runs smoothly while waiting for CephFS to mature
![Page 19: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/19.jpg)
RBD with OCFS2
● 10Gbit network for lower latency:– Lower network latency provides more performance
– Lower latency means more IOps● Design for I/O!
● 16k packet roundtrip times:– 1GbE: 0.8 ~ 1.1ms
– 10GbE: 0.3 ~ 0.4ms
● It's not about the bandwidth, it's about latency!
![Page 20: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/20.jpg)
RBD with OCFS2
● Highlights:– Full SSD cluster
– 10GbE network for lower latency
– Replaced all hardware since cluster was build● From 8 to 16 bays machines
● Future:– Expand when required. No concrete planning
![Page 21: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/21.jpg)
DO and DON'T● DO
– Design for I/O, not raw TerraBytes
– Think about network latency● 1GbE vs 10GbE
– Use small(er) machines
– Test recovery situations● Pull the plug out of those machines!
– Reboot your machines regularly to verify it all works● So do update those machines!
– Use dedicated hardware for your monitors● With a SSD for storage
![Page 22: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/22.jpg)
DO and DON'T
![Page 23: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/23.jpg)
DO and DON'T
● DON'T– Create to many Placement Groups
● It might overload your CPUs during recovery situations
– Fill your cluster over 80%
– Try to be smarter then Ceph● It's auto-healing. Give it some time.
– Buy the most expensive machines● Better to have two cheap(er) ones
– Use RAID-1 for journaling SSDs● Spread your OSDS over them
![Page 24: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/24.jpg)
DO and DON'T
![Page 25: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/25.jpg)
REMEMBER
● Hardware failure is the rule, not the exception!● Consistency goes over availability● Ceph is designed to run on commodity
hardware● There is no more need for RAID
– forget it ever existed
![Page 26: Ceph Day London 2014 - Deploying ceph in the wild](https://reader034.vdocuments.site/reader034/viewer/2022052303/5562ecfdd8b42ab47d8b512a/html5/thumbnails/26.jpg)
Questions?
● Twitter: @widodh● Skype: @widodh● E-Mail: [email protected]● Github: github.com/wido● Blog: http://blog.widodh.nl/