ceph successes and challenges with open source distributed ...€¦ · credit: sandisk data center...

CephSuccessesandchallengeswithopensourcedistributedstorage

CarlosMaltzahn,standinginforSageWeilDOMAWorkshop,11/16/17

Purposeofthistalk

• Shareimportantchallenges• Lookforward(alittlebit)• Collectfeedback

Credit:SanDiskDataCenterTechBlog:CPUBandwidth– TheWorrisome2020Trend 3/23/16

Credit:SanDiskDataCenterTechBlog:CPUBandwidth– TheWorrisome2020Trend 3/23/16

TheCPU/DRAMBottleneck

Credit:AllenSamuelsatOpenStackSummitAustin,4/2716:TheConsequencesofInfiniteStorageBandwidth

Ceph,ProgrammableStorage,andDataFabricsCarlosMaltzahn 7

ARCHITECTURAL COMPONENTS

4

RGWA web services

gateway for object storage, compatible

with S3 and Swift

LIBRADOSA library allowing apps to directly access RADOS (C, C++, Java, Python, Ruby, PHP)

RADOSA software-based, reliable, autonomous, distributed object store comprised ofself-healing, self-managing, intelligent storage nodes and lightweight monitors

RBDA reliable, fully-distributed block device with cloud

platform integration

CEPHFSA distributed <le

system with POSIX semantics and scale-

out metadata management

APP HOST/VM CLIENT

Challenges

• CurrentArchitectureworksgreatformultipleinterfacesanddirectly-attachedHDDs(akaspinners)• Challenges

• Storagedevicesaregettingtoofast• OSDpeeringtooexpensiveforfabric-connectedstoragedevices• Controloftaillatency• Globalnamespacescalability

• frequentreasonwhypeopleswitchfromfilestoobjects

Storagedevicesaregettingtoofast

• NVMe istoofast• MigratingOSDcriticalpathfromthread-basedtofutures-based• ProbablygoingtouseSeastar library(seastar-project.org)• Mostcodedoesn’thavetochange,e.g.peeringandconsistencycode• WillgreatlyreducetheCPUcostforallstorage• Willtakeatleastayear

• Fabric-basedstoragedevices(NVMeoF)• OSDtoOSDreplicationcausestoomuchoverhead• PlanninganewpooltypewhereoneOSDmanagesallreplica• Fabricsmightbetooexpensive,especiallyifalreadyCPU-limited• NewdeviceinterfacesmightmakeOSDtoOSDreplicationaffordableagain

• Forexample,NVMe key/valueinterfacestandardizationeffort

Challenges

• Taillatencycontrol• Planninganewpooltypeforquorum-basedconsistency• Differentwriterstodifferentreplicainparallelwitheventualconsistency• Strongconsistency:readallreplicaandresolveconflicts• Weakconsistency:readonereplicaonly

• Filesvsobjects• Objectsareconsideredtobemorescalablethanfiles• Mostfilesystemshave*one*nameserver:thatdoesn’tscale• Manyfilesystemworkloadswith>50%metadataoperations• Luminousreleaseintroducesmultipleactivemetadataservers

GlobalNameSpaces

• Multipleactivenameservers• firstproposedbySageWeilatSC04

• Challenges• Loadbalancing• Consistency• Logicalnamesvsphysicalnames(e.g.,mappingfilestoobjects)

Credit:http://ceph.com/community/new-luminous-multiple-active-metadata-servers-cephfs/

GlobalNameSpaces

Cudele: An API and Framework for ProgrammableConsistency and Durability in a Global Namespace

Michael A. Sevilla, Ivo Jimenez, Noah Watkins, Jeff LeFevrePeter Alvaro, Shel Finkelstein, Patrick Donnelley*, Carlos Maltzahn

University of California, Santa Cruz; *Red Hat{msevilla, ivo, jayhawk, jlefevre}@soe.ucsc.edu, {palvaro, shel, carlosm}@ucsc.edu, [email protected]

Abstract—HPC and data center scale application developersare abandoning POSIX IO because the file system metadatasynchronization and serialization overheads of providing strongconsistency and durability are too costly – and often unnecessary– for their applications. Unfortunately, designing file systemswith weaker consistency or durability excludes applications thatrely on stronger guarantees, forcing developers to re-write theirapplications or deploy them on a different system. Users canmount multiple systems in the global namespace but this means(1) provisioning separate storage clusters and (2) manuallymoving data across system boundaries. We present a frameworkand API that lets clients specify their consistency/durabilityrequirements and dynamically assign them to subtrees in thesame namespace, allowing users to optimize subtrees over timeand space for different workloads. We confirm the performancebenefits of techniques presented in related work but also explorenew consistency/durability metadata designs, all integrated overthe same storage system. By custom fitting a subtree to a create-heavy application, we show 8⇥ speedup and can scale to 2⇥ asmany clients when compared to our baseline system.

I. INTRODUCTION

File system metadata services in HPC and large-scale datacenters have scalability problems because common tasks, likecheckpointing [1] or scanning the file system [2], contend forthe same directories and inodes. Applications perform betterwith dedicated metadata servers [3], [4] but provisioning ametadata server for every client is unreasonable. This problemis exacerbated by current hardware and software trends; forexample, HPC architectures are transitioning from complexstorage stacks with burst buffer, file system, object store, andtape tiers to more simplified stacks with just a burst buffer andobject store [5]. These types of trends put pressure on dataaccess because more requests end up hitting the same layerand latencies cannot be hidden while data migrates across tiers.

To address this, developers are relaxing the consistencyand durability semantics in the file system because weakerguarantees are sufficient for their applications. For example,many batch style jobs do not need the strong consistency thatthe file system provides, so BatchFS [2] and DeltaFS [6] domore client-side processing and merge updates when the jobis done. Developers in these domains are turning to thesenon-POSIX IO solutions because their applications are well-understood (e.g., well-defined read/write phases, synchroniza-tion only needed during certain phases, workflows describingcomputation, etc.) and because these applications wreak havoc

Fig. 1: Illustration of subtrees with different semantics co-existing in a global namespace. For performance, clients canrelax consistency on their subtree (HDFS) or decouple thesubtree and move it locally (BatchFS, RAMDisk). Decoupledsubtrees can relax durability for even better performance.

on file systems designed for general-purpose workloads (e.g.,checkpoint-restart’s N:N and N:1 create patterns [1]).

One popular approach for relaxing consistency and dura-bility is to “decouple the namespace”, where clients lock thesubtree they want exclusive access to as a way to tell the filesystem that the subtree is important or may cause resource con-tention in the near-future [2], [4], [6]–[8]. Then the file systemcan change its internal structure to optimize performance. Forexample, the file system could enter a mode that prevents otherclients from interfering with the decoupled directory. Thisdelayed merge (i.e. a form of eventual consistency) and relaxeddurability improves performance and scalability by avoidingthe costs of remote procedure calls (RPCs), synchronization,false sharing, and serialization. While the performance benefitsof decoupling the namespace are obvious, applications thatrely on the file system’s guarantees must be deployed on anentirely different system or re-written to coordinate strongconsistency/durability themselves.

To address this problem, we present an API and frameworkthat lets developers dynamically control the consistency anddurability guarantees for subtrees in the file system namespace.Figure 1 shows a potential setup in our proposed system wherea single global namespace has subtrees for applications opti-mized with techniques from different state-of-the-art architec-

overload the HDFS namenode [23]. The elegance and sim-plicity of the solutions stem from a thorough understandingof the workloads (e.g., temperature zones at Facebook [14])and are not applicable for general purpose storage systems.

The most common technique for improving the perfor-mance of these metadata services is to balance the loadacross dedicated metadata server (MDS) nodes [16, 25, 26,21, 28]. Distributed MDS services focus on parallelizingwork and synchronizing access to the metadata. A popu-lar approach is to encourage independent growth and re-duce communication, using techniques like lazy client andMDS synchronization [16, 18, 29, 9, 30], inode path/permis-sion caching [4, 11, 28], locality-aware/inter-object transac-tions [21, 30, 17, 18] and e�cient lookup tables [4, 30]. De-spite having mechanisms for migrating metadata, like lock-ing [21, 20], zero copying and two-phase commits [21], anddirectory partitioning [28, 16, 18, 25], these systems fail toexploit locality.

File system workloads have locality because the names-pace has semantic meaning; data stored in directories is re-lated and is usually accessed together. Figure 1 shows themetadata locality when compiling the Linux source code.The “heat” of each directory is calculated with per-directorymetadata counters, which are tempered with an exponentialdecay. The hotspots can be correlated with phases of the job:untarring the code has high, sequential metadata load acrossdirectories and compiling the code has hotspots in the arch,kernel, fs, and mm directories. Exploiting this locality haspositive implications for performance because it reduces thenumber of requests, lowers the communication across MDSnodes, and eases memory pressure. The Ceph [25] (see alsowww.ceph.com) file system (CephFS) tries to leverage thisspatial, temporal, and request-type locality in metadata in-tensive workloads using dynamic subtree partitioning, butstruggles to find the best degree of locality and balance.

We envision a general purpose metadata balancer that re-sponds to many types of parallel applications. To get tothat balancer, we need to understand the trade-o↵s of re-source migration and the processing capacity of the MDSnodes. We present Mantle1, a system built on CephFS thatexposes these factors by separating migration policies fromthe mechanisms. Mantle accepts injectable metadata migra-tion code and helps us make the following contributions:

• a comparison of balancing for locality and balancingfor distribution

• a general framework for succinctly expressing di↵erentload balancing techniques

• an MDS service that supports simple balancing scriptsusing this framework

Using Mantle, we can dynamically select di↵erent tech-niques for distributing metadata. We explore the infrastruc-tures for a better understanding of how to balance diversemetadata workloads and ask the question “is it better tospread load aggressively or to first understand the capacityof MDS nodes before splitting load at the right time underthe right conditions?”. We show how the second option canlead to better performance but at the cost of increased com-plexity. We find that the cost of migration can sometimes1The mantle is the structure behind an octopus’s head thatprotects its organs.

MDS cluster

rebalance

send HB

fragmentpartition clusterpartition

namespacemigrate

RADOS

recv HB

Hierarchical Namespace

journal

migrate?

CephFSMantle Hooks

Figure 2: The MDS cluster journals to RADOS and ex-poses a namespace to clients. Each MDS makes decisions byexchanging heartbeats and partitioning the cluster/names-pace. Mantle adds code hooks for custom balancing logic.

outweigh the benefits of parallelism (up to 40% performancedegradation) and that searching for balance too aggressivelyincreases the standard deviation in runtime.

2. BACKGROUND: DYNAMIC

SUBTREE PARTITIONING

We use Ceph [25] to explore the metadata managementproblem. Ceph is a distributed storage platform that stripesand replicates data across a reliable object store called RA-DOS. Clients talk directly to object storage daemons (OSDs)on individual disks by calculating the data placement (“where”should I store my data) and location (“where” did I storemy data) using a hash-based algorithm (CRUSH). CephFSis the POSIX-compliant file system that uses RADOS. Itdecouples metadata and data access, so data IO is done di-rectly with RADOS while all metadata operations go to aseparate metadata cluster. The MDS cluster is connectedto RADOS so it can periodically flush its state. The hier-archical namespace is kept in the collective memory of theMDS cluster and acts as a large distributed cache. Directo-ries are stored in RADOS, so if the namespace is larger thanmemory, parts of it can be swapped out.The MDS nodes use dynamic subtree partitioning [26] to

carve up the namespace and to distribute it across the MDScluster, as shown in Figure 2. MDS nodes maintain thesubtree boundaries and “forward” requests to the authorityMDS if a client’s request falls outside of its jurisdiction orif the request tries to write to replicated metadata. EachMDS has its own metadata balancer that makes indepen-dent decisions, using the flow in Figure 2. Every 10 seconds,each MDS packages up its metrics and sends a heartbeat(“send HB”) to every MDS in the cluster. Then the MDSreceives the heartbeat (“recv HB”) and incoming inodes fromthe other MDS nodes. Finally, the MDS decides whether tobalance load (“rebalance”) and/or fragment its own directo-ries (“fragment”). If the balancer decides to rebalance load,it partitions the namespace and cluster and sends inodes

MichaelSevilla,NoahWatkins,CarlosMaltzahn,IkeNassi,ScottBrandt,SageWeil,GregFarnum,andSamFineberg,“Mantle:Aprogrammablemetadataloadbalancerfortheceph filesystem,”SC’15,November2015.

MichaelA.Sevilla,IvoJimenez,NoahWatkins,JeffLeFevre,PeterAlvaro,ShelFinkelstein,CarlosMaltzahn,PatrickDonnelley, “Cudele:AnAPIandFrameworkforProgrammableConsistencyandDurabilityinaGlobalNamespace,”submittedforpublication.









MDS cluster

rebalance

send HB


namespacemigrate

RADOS

recv HB


journal

migrate?

CephFSMantle Hooks















MDS cluster

rebalance

send HB


namespacemigrate

RADOS

recv HB


journal

migrate?

CephFSMantle Hooks







Objects









MDS cluster

rebalance

send HB


namespacemigrate

RADOS

recv HB


journal

migrate?

CephFSMantle Hooks







ClientPerspective

ComposedfrommetadatastoredinMDSandobjects

Summary

• DatamanagementCPUoverheadwilldominatecostofstorage• Ceph datapathwillgetalotshorter,checkoutseastar-project.org• Taillatencycontrolviaquorum-basedconsistency(newpooltype)• Globalnamespacescalabilityvia

• Subtree-specificloadbalancingpolicies• Subtree-specificconsistencysemanticsthatcanbedynamicallychanged• Decouplinglogicalnamesfromphysicalmetadatamanagement

• Contact:CarlosMaltzahn,[email protected]

Ceph Pools

• Groupingofobjectsintosetsthatdifferbythefollowing• Resilience(numberofreplicasorerasurecodeparameters)• Placementgroups• CRUSHrules• Snapshots• Ownership• Futureobjectmanagementalternatives(seebelow)

ceph successes and challenges with open source distributed ...€¦ · credit: sandisk data center...

Documents