![Page 2: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/2.jpg)
2 Ceph Day London - CephFS Update
Agenda
● Introduction to distributed filesystems
● Architectural overview
● Recent development
● Test & QA
![Page 3: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/3.jpg)
3Ceph Day London – CephFS Update
Distributed filesystems...and why they are hard.
![Page 4: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/4.jpg)
4 Ceph Day London - CephFS Update
Interfaces to storage
● Object● Ceph RGW, S3, Swift
● Block (aka SAN)● Ceph RBD, iSCSI, FC, SAS
● File (aka scale-out NAS)● Ceph, GlusterFS, Lustre, proprietary filers
![Page 5: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/5.jpg)
5 Ceph Day London - CephFS Update
Interfaces to storage
FILE SYSTEMCephFS
BLOCK STORAGE
RBD
OBJECT STORAGE
RGW
Keystone
Geo-Replication
Native API
Multi-tenant
S3 & Swift
OpenStack
Linux Kernel
iSCSI
Clones
Snapshots
CIFS/NFS
HDFS
Distributed Metadata
Linux Kernel
POSIX
![Page 6: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/6.jpg)
6 Ceph Day London - CephFS Update
Object stores scale out well
● Last writer wins consistency
● Consistency rules only apply to one object at a time
● Clients are stateless (unless explicitly doing lock ops)
● No relationships exist between objects
● Objects have exactly one name
● Scale-out accomplished by mapping objects to nodes
● Single objects may be lost without affecting others
![Page 7: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/7.jpg)
7 Ceph Day London - CephFS Update
POSIX filesystems are hard to scale out
● Extents written from multiple clients must win or lose on all-or-nothing basis → locking
● Inodes depend on one another (directory hierarchy)
● Clients are stateful: holding files open
● Users have local-filesystem latency expectations: applications assume FS client will do lots of metadata caching for them.
● Scale-out requires spanning inode/dentry relationships across servers
● Loss of data can damage whole subtrees
![Page 8: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/8.jpg)
8 Ceph Day London - CephFS Update
Failure cases increase complexity further
● What should we do when... ?● Filesystem is full● Client goes dark● An MDS goes dark● Memory is running low● Clients are competing for the same files● Clients misbehave
● Hard problems in distributed systems generally, especially hard when we have to uphold POSIX semantics designed for local systems.
![Page 9: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/9.jpg)
9 Ceph Day London - CephFS Update
Terminology
● inode: a file. Has unique ID, may be referenced by one or more dentries.
● dentry: a link between an inode and a directory
● directory: special type of inode that has 0 or more child dentries
● hard link: many dentries referring to the same inode
● Terms originate form original (local disk) filesystems, where these were how a filesystem was represented on disk.
![Page 10: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/10.jpg)
10Ceph Day London – CephFS Update
Architectural overview
![Page 11: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/11.jpg)
11 Ceph Day London - CephFS Update
CephFS architecture
● Dynamically balanced scale-out metadata
● Inherit flexibility/scalability of RADOS for data
● POSIX compatibility
● Beyond POSIX: Subtree snapshots, recursive statistics
Weil, Sage A., et al. "Ceph: A scalable, high-performance distributed file system." Proceedings of the 7th symposium on Operating systems
design and implementation. USENIX Association, 2006.http://ceph.com/papers/weil-ceph-osdi06.pdf
![Page 12: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/12.jpg)
12Ceph Day London – CephFS Update
Components
● Client: kernel, fuse, libcephfs● Server: MDS daemon● Storage: RADOS cluster (mons & OSDs)
![Page 13: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/13.jpg)
13Ceph Day London – CephFS Update
Components
Linux host
M M
M
Ceph server daemons
ceph.ko
datametadata 0110
![Page 14: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/14.jpg)
14 Ceph Day London - CephFS Update
From application to disk
ceph-mds
libcephfsceph-fuse Kernel client
RADOS
Client network protocol
Application
Disk
![Page 15: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/15.jpg)
15Ceph Day London – CephFS Update
Scaling out FS metadata
● Options for distributing metadata?– by static subvolume
– by path hash
– by dynamic subtree
● Consider performance, ease of implementation
![Page 16: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/16.jpg)
16Ceph Day London – CephFS Update
DYNAMIC SUBTREE PARTITIONING
![Page 17: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/17.jpg)
17 Ceph Day London - CephFS Update
Dynamic subtree placement
● Locality: get the dentries in a dir from one MDS
● Support read heavy workloads by replicating non-authoritative copies (cached with capabilities just like clients do)
● In practice work at directory fragment level in order to handle large dirs
![Page 18: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/18.jpg)
18 Ceph Day London - CephFS Update
Data placement
● Stripe file contents across RADOS objects● get full rados cluster bandwidth from clients● delegate all placement/balancing to RADOS
● Control striping with layout vxattrs● layouts also select between multiple data pools
● Deletion is a special case: client deletions mark files 'stray', RADOS delete ops sent by MDS
![Page 19: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/19.jpg)
19 Ceph Day London - CephFS Update
Clients
● Two implementations:● ceph-fuse/libcephfs● kclient
● Interplay with VFS page cache, efficiency harder with fuse (extraneous stats etc)
● Client perf. matters, for single-client workloads
● Slow client can hold up others if it's hogging metadata locks: include clients in troubleshooting
● - future: want more per client perf stats and maybe metadata QoS per client. Clients probably group into jobs or workloads.
● - future: may want to tag client io with job id (eg hpc workload, samba client I'd, container/VM id)
![Page 20: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/20.jpg)
20 Ceph Day London - CephFS Update
Journaling and caching in MDS
● Metadata ops initially journaled to striped journal "file" in the metadata pool.
● I/O latency on metadata ops is sum of network latency and journal commit latency.
● Metadata remains pinned in in-memory cache until expired from journal.
![Page 21: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/21.jpg)
21 Ceph Day London - CephFS Update
Journaling and caching in MDS
● In some workloads we expect almost all metadata always in cache, in others its more of a stream.
● Control cache size with mds_cache_size
● Cache eviction relies on client cooperation
● MDS journal replay not only recovers data but also warms up cache. Use standby replay to keep that cache warm.
![Page 22: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/22.jpg)
22 Ceph Day London - CephFS Update
Lookup by inode
● Sometimes we need inode → path mapping:● Hard links● NFS handles
● Costly to store this: mitigate by piggybacking paths (backtraces) onto data objects
● Con: storing metadata to data pool● Con: extra IOs to set backtraces● Pro: disaster recovery from data pool
● Future: improve backtrace writing latency?
![Page 23: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/23.jpg)
23 Ceph Day London - CephFS Update
Extra features
● Snapshots:● Exploit RADOS snapshotting for file data● … plus some clever code in the MDS● Fast petabyte snapshots
● Recursive statistics● Lazily updated● Access via vxattr● Avoid spurious client I/O for df
![Page 24: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/24.jpg)
24 Ceph Day London - CephFS Update
Extra features
● Snapshots:● Exploit RADOS snapshotting for file data● … plus some clever code in the MDS● Fast petabyte snapshots
● Recursive statistics● Lazily updated● Access via vxattr● Avoid spurious client I/O for df
![Page 25: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/25.jpg)
25 Ceph Day London - CephFS Update
CephFS in practice
ceph-deploy mds create myserver
ceph osd pool create fs_data
ceph osd pool create fs_metadata
ceph fs new myfs fs_metadata fs_data
mount -t cephfs x.x.x.x:6789 /mnt/ceph
![Page 26: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/26.jpg)
26 Ceph Day London - CephFS Update
Managing CephFS clients
● New in giant: see hostnames of connected clients
● Client eviction is sometimes important:● Skip the wait during reconnect phase on MDS restart● Allow others to access files locked by crashed client
● Use OpTracker to inspect ongoing operations
![Page 27: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/27.jpg)
27 Ceph Day London - CephFS Update
CephFS tips
● Choose MDS servers with lots of RAM
● Investigate clients when diagnosing stuck/slow access
● Use recent Ceph and recent kernel
● Use a conservative configuration:● Single active MDS, plus one standby● Dedicated MDS server● Kernel client● No snapshots, no inline data
![Page 28: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/28.jpg)
28Ceph Day London – CephFS Update
Development update
![Page 29: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/29.jpg)
29Ceph Day London – CephFS Update
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
LIBRADOS
A library allowingapps to directlyaccess RADOS,with support forC, C++, Java,Python, Ruby,and PHP
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RBD
A reliable and fully-distributed block device, with a Linux kernel client and a QEMU/KVM driver
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
RADOSGW
A bucket-based REST gateway, compatible with S3 and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
CEPH FS
A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE
NEARLYAWESOME
AWESOMEAWESOME
AWESOME
AWESOME
![Page 30: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/30.jpg)
30 Ceph Day London - CephFS Update
Towards a production-ready CephFS
● Focus on resilience:
1. Don't corrupt things
2. Stay up
3. Handle the corner cases
4. When something is wrong, tell me
5. Provide the tools to diagnose and fix problems
● Achieve this first within a conservative single-MDS configuration
![Page 31: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/31.jpg)
31 Ceph Day London - CephFS Update
Giant → Hammer timeframe
● Initial online fsck (a.k.a. forward scrub)
● Online diagnostics (`session ls`, MDS health alerts)
● Journal resilience & tools (cephfs-journal-tool)
● flock in the FUSE client
● Initial soft quota support
● General resilience: full OSDs, full metadata cache
![Page 32: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/32.jpg)
32 Ceph Day London - CephFS Update
FSCK and repair
● Recover from damage:● Loss of data objects (which files are damaged?)● Loss of metadata objects (what subtree is damaged?)
● Continuous verification:● Are recursive stats consistent?● Does metadata on disk match cache?● Does file size metadata match data on disk?
● Repair:● Automatic where possible● Manual tools to enable support
![Page 33: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/33.jpg)
33 Ceph Day London - CephFS Update
Client management
● Current eviction is not 100% safe against rogue clients● Update to client protocol to wait for OSD blacklist
● Client metadata● Initially domain name, mount point● Extension to other identifiers?
![Page 34: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/34.jpg)
34 Ceph Day London - CephFS Update
Online diagnostics
● Bugs exposed relate to failures of one client to release resources for another client: “my filesystem is frozen”. Introduce new health messages:
● “client xyz is failing to respond to cache pressure”● “client xyz is ignoring capability release messages”● Add client metadata to allow us to give domain names
instead of IP addrs in messages.
● Opaque behavior in the face of dead clients. Introduce `session ls`
● Which clients does MDS think are stale?● Identify clients to evict with `session evict`
![Page 35: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/35.jpg)
35 Ceph Day London - CephFS Update
Journal resilience
● Bad journal prevents MDS recovery: “my MDS crashes on startup”:
● Data loss● Software bugs
● Updated on-disk format to make recovery from damage easier
● New tool: cephfs-journal-tool● Inspect the journal, search/filter● Chop out unwanted entries/regions
![Page 36: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/36.jpg)
36 Ceph Day London - CephFS Update
Handling resource limits
● Write a test, see what breaks!
● Full MDS cache:● Require some free memory to make progress● Require client cooperation to unpin cache objects● Anticipate tuning required for cache behaviour: what
should we evict?
● Full OSD cluster● Require explicit handling to abort with -ENOSPC
● MDS → RADOS flow control:● Contention between I/O to flush cache and I/O to journal
![Page 37: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/37.jpg)
37 Ceph Day London - CephFS Update
Test, QA, bug fixes
● The answer to “Is CephFS production ready?”
● teuthology test framework:● Long running/thrashing test● Third party FS correctness tests● Python functional tests
● We dogfood CephFS internally● Various kclient fixes discovered● Motivation for new health monitoring metrics
● Third party testing is extremely valuable
![Page 38: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/38.jpg)
38 Ceph Day London - CephFS Update
What's next?
● You tell us!
● Recent survey highlighted:● FSCK hardening● Multi-MDS hardening● Quota support
● Which use cases will matter to community?● Backup● Hadoop● NFS/Samba gateway● Other?
![Page 39: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/39.jpg)
39 Ceph Day London - CephFS Update
Reporting bugs
● Does the most recent development release or kernel fix your issue?
● What is your configuration? MDS config, Ceph version, client version, kclient or fuse
● What is your workload?
● Can you reproduce with debug logging enabled?
http://ceph.com/resources/mailing-list-irc/
http://tracker.ceph.com/projects/ceph/issues
http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/
![Page 40: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/40.jpg)
40 Ceph Day London - CephFS Update
Future
● Ceph Developer Summit:● When: 8 October● Where: online
● Post-Hammer work:● Recent survey highlighted multi-MDS, quota support ● Testing with clustered Samba/NFS?
![Page 41: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/41.jpg)
41Ceph Day London – CephFS Update
Questions?
![Page 42: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/42.jpg)
42Ceph Day London – CephFS Update
![Page 43: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/43.jpg)
43 Ceph Day London - CephFS Update
Body slide design guidelines
● > 15 words per bullet
● If your slide is text-only, reserve at least 1/3 of the slide for white space.
● If you use a graphic, make sure text is readable.
![Page 44: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/44.jpg)
44 Ceph Day London - CephFS Update
Body slide design guidelines
● > 15 words per bullet
● If your slide is text-only, reserve at least 1/3 of the slide for white space.
● If you use a graphic, make sure text is readable.
![Page 45: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/45.jpg)
45 Ceph Day London - CephFS Update
Introduce Red Hat
● Create an agenda slide for every presentation.● Outline what you’re going to tell the audience.● Prepare them for a call to action after the presentation.
● If this is a confidential presentation, use the confidential presentation template located on the Corporate > Templates > Presentation templates page of the PNT Portal.
![Page 46: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/46.jpg)
46 Ceph Day London - CephFS Update
Introduce Red Hat solutions and services
● Provide product details that specifically solve the customer pain point you’re addressing.
● These slides explain how Red Hat solutions work, what makes them unique and valuable.
![Page 47: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/47.jpg)
47 Ceph Day London - CephFS Update
Learn more
● End with a call to action.
● Let the audience know what can be done next, how you or Red Hat can help them.
![Page 48: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/48.jpg)
48Ceph Day London – CephFS Update
Divider slide
![Page 49: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/49.jpg)
49Ceph Day London – CephFS Update
Divider slide
![Page 50: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/50.jpg)
50Ceph Day London – CephFS Update
Divider slide
![Page 51: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/51.jpg)
51Ceph Day London – CephFS Update
Divider slide
![Page 52: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/52.jpg)
52Ceph Day London – CephFS Update
Divider slide
![Page 53: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/53.jpg)
53Ceph Day London – CephFS Update
Divider slide
![Page 54: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/54.jpg)
54Ceph Day London – CephFS Update
Divider slide
![Page 55: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/55.jpg)
55Ceph Day London – CephFS Update
Divider slide
![Page 56: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/56.jpg)
56Ceph Day London – CephFS Update
Divider SlideDivider slide
![Page 57: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/57.jpg)
A STORAGE REVOLUTION
PROPRIETARY HARDWARE
PROPRIETARY SOFTWARE
SUPPORT & MAINTENANCE
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
STANDARDHARDWARE
OPEN SOURCE SOFTWARE
ENTERPRISEPRODUCTS &
SERVICES
COMPUTER DISK
COMPUTER DISK
COMPUTER DISK
![Page 58: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/58.jpg)
Copyright © 2014 by Inktank | Private and Confidential
ARCHITECTURAL COMPONENTS
58
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
![Page 59: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/59.jpg)
Copyright © 2014 by Inktank | Private and Confidential
ARCHITECTURAL COMPONENTS
59
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
![Page 60: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/60.jpg)
OBJECT STORAGE DAEMONS
60
FS
DISK
OSD
DISK
OSD
FS
DISK
OSD
FS
DISK
OSD
FS
btrfsxfsext4
M
M
M
![Page 61: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/61.jpg)
RADOS CLUSTER
61
APPLICATION
M M
M M
M
RADOS CLUSTER
![Page 62: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/62.jpg)
RADOS COMPONENTS
62
OSDs: 10s to 10000s in a cluster One per disk (or one per SSD, RAID group…) Serve stored objects to clients Intelligently peer for replication & recovery
Monitors: Maintain cluster membership and state Provide consensus for distributed decision-
making Small, odd number These do not serve stored objects to clients
M
![Page 63: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/63.jpg)
WHERE DO OBJECTS LIVE?
63
??
APPLICATION
M
M
M
OBJECT
![Page 64: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/64.jpg)
A METADATA SERVER?
64
1
APPLICATION
M
M
M
2
![Page 65: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/65.jpg)
CALCULATED PLACEMENT
65
FAPPLICATION
M
M
MA-G
H-N
O-T
U-Z
![Page 66: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/66.jpg)
EVEN BETTER: CRUSH!
66
RADOS CLUSTER
OBJECT
10
01
01
10
10
01
11
01
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
![Page 67: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/67.jpg)
CRUSH IS A QUICK CALCULATION
67
RADOS CLUSTER
OBJECT
10
01
01
10
10
01 11
01
1001
0110 10 01
11
01
![Page 68: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/68.jpg)
CRUSH: DYNAMIC DATA PLACEMENT
68
CRUSH: Pseudo-random placement algorithm
Fast calculation, no lookup Repeatable, deterministic
Statistically uniform distribution Stable mapping
Limited data migration on change Rule-based configuration
Infrastructure topology aware Adjustable replication Weighting
![Page 69: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/69.jpg)
Copyright © 2014 by Inktank | Private and Confidential
ARCHITECTURAL COMPONENTS
69
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
![Page 70: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/70.jpg)
ACCESSING A RADOS CLUSTER
70
APPLICATION
M M
M
RADOS CLUSTER
LIBRADOS
OBJECT
socket
![Page 71: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/71.jpg)
L
LIBRADOS: RADOS ACCESS FOR APPS
71
LIBRADOS: Direct access to RADOS for applications C, C++, Python, PHP, Java, Erlang Direct access to storage nodes No HTTP overhead
![Page 72: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/72.jpg)
Copyright © 2014 by Inktank | Private and Confidential
ARCHITECTURAL COMPONENTS
72
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
![Page 73: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/73.jpg)
THE RADOS GATEWAY
73
M M
M
RADOS CLUSTER
RADOSGW
LIBRADOS
socket
RADOSGW
LIBRADOS
APPLICATION APPLICATION
REST
![Page 74: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/74.jpg)
RADOSGW MAKES RADOS WEBBY
74
RADOSGW: REST-based object storage proxy Uses RADOS to store objects API supports buckets, accounts Usage accounting for billing Compatible with S3 and Swift applications
![Page 75: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/75.jpg)
Copyright © 2014 by Inktank | Private and Confidential
ARCHITECTURAL COMPONENTS
75
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
![Page 76: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/76.jpg)
STORING VIRTUAL DISKS
76
M M
RADOS CLUSTER
HYPERVISOR
LIBRBD
VM
![Page 77: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/77.jpg)
SEPARATE COMPUTE FROM STORAGE
77
M M
RADOS CLUSTER
HYPERVISOR
LIBRBDVM
HYPERVISOR
LIBRBD
![Page 78: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/78.jpg)
KERNEL MODULE FOR MAX FLEXIBLE!
78
M M
RADOS CLUSTER
LINUX HOST
KRBD
![Page 79: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/79.jpg)
RBD STORES VIRTUAL DISKS
79
RADOS BLOCK DEVICE: Storage of disk images in RADOS Decouples VMs from host Images are striped across the cluster (pool) Snapshots Copy-on-write clones Support in:
Mainline Linux Kernel (2.6.39+) Qemu/KVM, native Xen coming soon OpenStack, CloudStack, Nebula, Proxmox
![Page 80: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/80.jpg)
Copyright © 2014 by Inktank | Private and Confidential
ARCHITECTURAL COMPONENTS
80
RGWA web services
gateway for object storage, compatible
with S3 and Swift
LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)
RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors
RBDA reliable, fully-distributed block device with cloud
platform integration
CEPHFSA distributed file
system with POSIX semantics and scale-
out metadata management
APP HOST/VM CLIENT
![Page 81: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/81.jpg)
SEPARATE METADATA SERVER
81
LINUX HOST
M M
M
RADOS CLUSTER
KERNEL MODULE
datametadata 0110
![Page 82: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/82.jpg)
SCALABLE METADATA SERVERS
82
METADATA SERVER Manages metadata for a POSIX-compliant
shared filesystem Directory hierarchy File metadata (owner, timestamps, mode,
etc.) Stores metadata in RADOS Does not serve file data to clients Only required for shared filesystem
![Page 83: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/83.jpg)
CEPH AND OPENSTACK
83
RADOSGWLIBRADOS
M M
RADOS CLUSTER
OPENSTACK
KEYSTONE CINDER GLANCE
NOVASWIFTLIBRB
DLIBRB
D
HYPER- VISOR
LIBRBD
![Page 84: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/84.jpg)
Read about the latest version of Ceph. The latest stuff is always at http://ceph.com/get
Deploy a test cluster using ceph-deploy. Read the quick-start guide at http://ceph.com/qsg
Read the rest of the docs! Find docs for the latest release at http://ceph.com/docs
Ask for help when you get stuck! Community volunteers are waiting for you at
http://ceph.com/help
Copyright © 2014 by Inktank | Private and Confidential
GETTING STARTED WITH CEPH
84
![Page 85: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/85.jpg)
85Ceph Day London – CephFS Update
![Page 86: Ceph Day London 2014 - The current state of CephFS development](https://reader033.vdocuments.site/reader033/viewer/2022050817/5562ed00d8b42ab47d8b512c/html5/thumbnails/86.jpg)
86Ceph Day London – CephFS Update