hopsfs: scaling hierarchical file system metadata using...

15
H HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases Salman Niazi, Mahmoud Ismail, Seif Haridi, and Jim Dowling KTH – Royal Institute of Technology, Stockholm, Sweden Definition Modern NewSQL database systems can be used to store fully normalized metadata for distributed hierarchical file systems, and provide high throughput and low operational latencies for the file system operations. Introduction For many years, researchers have investigated the use of database technology to manage file system metadata, with the goal of providing ex- tensible typed metadata and support for fast, rich metadata search. However, previous attempts failed mainly due to the reduced performance introduced by adding database operations to the file system’s critical path. However, recent im- provements in the performance of distributed in- memory online transaction processing databases (NewSQL databases) led us to reinvestigate the possibility of using a database to manage file system metadata, but this time for a distributed, hierarchical file system, the Hadoop file sys- tem (HDFS). The single-host metadata service of HDFS is a well-known bottleneck for both the size of HDFS clusters and their throughput. In this entry, we characterize the performance of different NewSQL database operations and design the metadata architecture of a new drop-in replacement for HDFS using, as far as possible, only those high-performance NewSQL database access patterns. Our approach enabled us to scale the throughput of all HDFS operations by an order of magnitude. HopsFS vs. HDFS In both systems, namenodes provide an API for metadata operations to the file system clients, see Fig. 1. In HDFS, an active namenode (ANN) handles client requests, manages the metadata in memory (on the heap of a single JVM), logs changes to the metadata to a quorum of journal servers, and coordinates repair actions in the event of failures or data corruption. A standly namenode asynchronously pull the metadata changes from the journal nodes and applies the changes to its in memeory metadata. A ZooKeeper service is used to signal namenode failures, enabling both agreement on which namenode is active as well as failover from active to standby. In HopsFS, namenodes are stateless servers that handle client requests and process the metadata that is stored in an external distributed database, NDB in our case. The internal © Springer International Publishing AG 2018 S. Sakr, A. Zomaya (eds.), Encyclopedia of Big Data Technologies, https://doi.org/10.1007/978-3-319-63962-8_146-1

Upload: others

Post on 13-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HopsFS: Scaling Hierarchical File System Metadata Using …salmanniazi.com/files/2018-Springer-Chapter.pdf · 2018-04-03 · HopsFS: Scaling Hierarchical File System Metadata Using

H

HopsFS: Scaling HierarchicalFile System Metadata UsingNewSQL Databases

Salman Niazi, Mahmoud Ismail, Seif Haridi, andJim DowlingKTH – Royal Institute of Technology,Stockholm, Sweden

Definition

Modern NewSQL database systems can be usedto store fully normalized metadata for distributedhierarchical file systems, and provide highthroughput and low operational latencies for thefile system operations.

Introduction

For many years, researchers have investigatedthe use of database technology to manage filesystem metadata, with the goal of providing ex-tensible typed metadata and support for fast,rich metadata search. However, previous attemptsfailed mainly due to the reduced performanceintroduced by adding database operations to thefile system’s critical path. However, recent im-provements in the performance of distributed in-memory online transaction processing databases(NewSQL databases) led us to reinvestigate thepossibility of using a database to manage file

system metadata, but this time for a distributed,hierarchical file system, the Hadoop file sys-tem (HDFS). The single-host metadata serviceof HDFS is a well-known bottleneck for boththe size of HDFS clusters and their throughput.In this entry, we characterize the performanceof different NewSQL database operations anddesign the metadata architecture of a new drop-inreplacement for HDFS using, as far as possible,only those high-performance NewSQL databaseaccess patterns. Our approach enabled us to scalethe throughput of all HDFS operations by anorder of magnitude.

HopsFS vs. HDFSIn both systems, namenodes provide an API formetadata operations to the file system clients,see Fig. 1. In HDFS, an active namenode (ANN)handles client requests, manages the metadatain memory (on the heap of a single JVM), logschanges to the metadata to a quorum of journalservers, and coordinates repair actions in theevent of failures or data corruption. A standlynamenode asynchronously pull the metadatachanges from the journal nodes and appliesthe changes to its in memeory metadata. AZooKeeper service is used to signal namenodefailures, enabling both agreement on whichnamenode is active as well as failover from activeto standby.

In HopsFS, namenodes are stateless serversthat handle client requests and process themetadata that is stored in an external distributeddatabase, NDB in our case. The internal

© Springer International Publishing AG 2018S. Sakr, A. Zomaya (eds.), Encyclopedia of Big Data Technologies,https://doi.org/10.1007/978-3-319-63962-8_146-1

Page 2: HopsFS: Scaling Hierarchical File System Metadata Using …salmanniazi.com/files/2018-Springer-Chapter.pdf · 2018-04-03 · HopsFS: Scaling Hierarchical File System Metadata Using

2 HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases

HopsFS: Scaling Hierarchical File System MetadataUsing NewSQL Databases, Fig. 1 In HDFS, a singlenamenode manages the namespace metadata. For highavailability, a log of metadata changes is stored on a setof journal nodes using quorum-based replication. The logis subsequently replicated asynchronously to a standby

namenode. In contrast, HopsFS supports multiple statelessnamenodes and a single leader namenode that all accessmetadata stored in transactional shared memory (NDB). InHopsFS, leader election is coordinated using NDB, whileZooKeeper coordinates leader election for HDFS

management (housekeeping) operations mustbe coordinated among the namenodes. HopsFSsolves this problem by electing a leader namen-ode that is responsible for the housekeeping.We use the database as a shared memory toimplement a leader election and membershipmanagement service (Salman Niazi et al. 2015;Guerraoui and Raynal 2006).

In both HDFS and HopsFS, datanodes areconnected to all the namenodes. Datanodes pe-riodically send a heartbeat message to all thenamenodes to notify them that they are still alive.The heartbeat also contains information such asthe datanode capacity, its available space, andits number of active connections. Heartbeats up-date the list of datanode descriptors stored atnamenodes. The datanode descriptor list is usedby namenodes for future block allocations, and itis not persisted in either HDFS or HopsFS, as itis rebuilt on system restart using heartbeats fromdatanodes.

HopsFS clients support random, round-robin,and sticky policies to distribute the file systemoperations among the namenodes. If a file system

operation fails, due to namenode failure or over-loading, the HopsFS client transparently retriesthe operation on another namenode (after backingoff, if necessary). HopsFS clients refresh thenamenode list every few seconds, enabling newnamenodes to join an operational cluster. HDFSv2.X clients are fully compatible with HopsFS,although they do not distribute operations overnamenodes, as they assume there is a single activenamenode.

MySQL’s NDB Distributed RelationalDatabase

HopsFS uses MySQL’s Network Database(NDB) to store the file system metadata.NDB is an open-source, real-time, in-memory,shared nothing, distributed relational databasemanagement system.

NDB cluster consists of three types of nodes:NDB datanodes that store the tables, managementnodes, and database clients; see Fig. 3.

Page 3: HopsFS: Scaling Hierarchical File System Metadata Using …salmanniazi.com/files/2018-Springer-Chapter.pdf · 2018-04-03 · HopsFS: Scaling Hierarchical File System Metadata Using

HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases 3

H

The management nodes are only used to dis-seminate the configuration information and todetect network partitions. The client nodes accessthe database using the SQL interface via MySQLServer or using the native APIs implemented inC++, Java, JPA, and JavaScript/Node.js.

Figure 3 shows an NDB cluster setup consist-ing of four NDB datanodes. Each NDB datanodeconsists of multiple transaction coordinator (TC)threads, which are responsible for performingtwo-phase commit transactions; local data man-agement threads (LDM), which are responsiblefor storing and replicating the data partitionsassigned to the NDB datanode; send and receivethreads, which are used to exchange the databetween the NDB datanodes and the clients; andIO threads which are responsible for performingdisk IO operations.

MySQL Cluster horizontally partitions the ta-bles, that is, the rows of the tables are uniformlydistributed among the database partitions storedon the NDB datanodes. The NDB datanodes areorganized into node replication groups of equalsizes to manage and replicate the data partitions.The size of the node replication group is thereplication degree of the database. In the examplesetup, the NDB replication degree is set to two(default value); therefore, each node replicationgroup contains exactly two NDB datanodes. Thefirst replication node group consists of NDBdatanodes 1 and 2, and the second replicationnode group consists of NDB datanodes 3 and 4.Each node group is responsible for storing and

replicating all the data partitions assigned to theNDB datanodes in the replication node group.

By default, NDB hashes the primary keycolumn of the tables to distribute the table’srows among the different database partitions.Figure 3 shows how the inodes table for thenamespace shown in Fig. 2 is partitioned andstored in the NDB cluster database. In productiondeployments each NDB datanode may storemultiple data partitions, but, for simplicity,the datanodes shown in Fig. 3 store only onedata partition. The replication node group 1 isresponsible for storing and replicating partitions1 and 2 of the inodes’ table. The primary replicasof partition 1 is stored on the NDB datanode 1,and the replica of the partition 1 is stored onthe other datanode of the same replication nodegroup, that is, NDB datanode 2. For example,the NDB datanode 1 is responsible for storing adata partition that stores the lib, conf, and libBinodes, and the replica of this data partition isstored on NDB datanode 2. NDB also supportsuser-defined partitioning of the stored tables,that is, it can partition the data based on a user-specified table column. User-defined partitioningprovides greater control over how the data isdistributed among different database partitions,which helps in implementing very efficientdatabase operations.

NDB only supports read�commit ted trans-action isolation level which is not sufficient forimplementing practical applications. However,NDB supports row-level locking which can beused to isolate transactions operating on the same

HopsFS: ScalingHierarchical File SystemMetadata Using NewSQLDatabases, Fig. 2 Asample file systemnamespace. The diamondshapes represent directoriesand the circles representfiles

Page 4: HopsFS: Scaling Hierarchical File System Metadata Using …salmanniazi.com/files/2018-Springer-Chapter.pdf · 2018-04-03 · HopsFS: Scaling Hierarchical File System Metadata Using

4 HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases

HopsFS: Scaling Hierarchical File System MetadataUsing NewSQL Databases, Fig. 3 System architecturediagram of MySQL’s Network Database (NDB) cluster.

The NDB cluster consists of three types of nodes, thatis, database clients, NDB management nodes, and NDBdatabase datanodes

datasets. NDB supports exclusive (write) locks,shared (read) locks, and read � commit ted(read the last committed value) locks. Using theselocking primitives, HopsFS isolates different filesystem operations trying to mutate the samesubset of inodes.

Types of Database OperationsA transactional file system operation consistsof three distinct phases. In the first phase, allthe metadata that is required for the file sys-tem operation is read from the database; in thesecond phase, the operation is performed on themetadata; and in the last phase, all the modifiedmetadata (if any) is stored back in the database.As the latency of the file system operation greatlydepends on the time spent in reading and writingthe data, therefore, it is imperative to understandthe latency and computational cost of differenttypes of database operations, used to read andupdate the stored data, in order to understandhow HopsFS implements low-latency scalablefile system operations. The data can be readusing four different types of database operations,

such as primary key, partition-pruned index scan,distributed index scan, and distributed full ta-ble scan operations. NDB also supports user-defined data partitioning and data distribution-aware transactions to reduce the latency of thedistributed transactions. A brief description ofthese operations is as follows:

Application-Defined Partitioning (ADP)NDB allows developers to override the defaultpartitioning scheme for the tables, enabling afine-grained control on how the tables’ data isdistributed across the data partitions. HopsFSleverage this feature, for example, the inodestable is partitioned by the parent inode ID, thatis, all immediate children of a directory reside onthe same database partition, enabling an efficientdirectory listing operation.

Distribution-Aware Transactions (DAT)NDB supports distribution-aware transactionsby specifying a transaction hint at the start ofthe transaction. This enlists the transaction with atransaction coordinator on the database node that

Page 5: HopsFS: Scaling Hierarchical File System Metadata Using …salmanniazi.com/files/2018-Springer-Chapter.pdf · 2018-04-03 · HopsFS: Scaling Hierarchical File System Metadata Using

HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases 5

H

holds the data required for the transaction, whichreduces the latency of the database operation.The choice of the transaction hint is based onthe application-defined partitioning scheme.Incorrect hints do not affect the correctness ofthe operation, since the transaction coordinator(TC) will route the requests to different NDBdatanode that holds the data; however, it willincur additional network traffic.

Primary Key (PK) OperationPK operation is the most efficient operationin NDB that reads/writes/updates a single rowstored in a database shard.

Batched Primary Key OperationsBatch operations use batching of PK operationsto enable higher throughput while efficiently us-ing the network bandwidth.

Partition-Pruned Index Scan (PPIS)PPIS is a distribution-aware operation thatexploits the distribution-aware transactionsand application-defined partitioning featuresof NDB, to implement a scalable index scanoperation such that the scan operation isperformed on a single database partition.

Distributed Index Scan (DIS)DIS is an index scan operation that executes onall database shards. It is not a distribution-awareoperation, causing it to become increasingly slowfor increasingly larger clusters.

Distributed Full Table Scan (DFTS)DF TS operations are not distribution aware anddo not use any index; thus, they read all the rowsfor a table stored in all database shards. DFTSoperations incur high cpu and network costs;therefore, these operations should be avoided inimplementing any database application.

Comparing Different Database OperationsFigure 4 shows the comparison of the throughputand latency of different database operations.These micro benchmarks were performed ona four-node NDB cluster setup running NDBversion 7.5.6. All the experiments were run on

premise using Dell PowerEdge R730xd servers(Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40 GHz,256 GB RAM, 4 TB 7200 RPM HDD) connectedusing a single 10 GbE network adapter. We ran400 concurrent database clients, which wereuniformly distributed across 20 machines. Theexperiments are repeated twice. The first set ofexperiments consist of read-only operations, andthe second set of experiments consist of read-write operations where all the data that is read isupdated and stored back in the database.

Distributed full table scan operations and dis-tributed index scan operations do not scale, asthese operations have the lowest throughput andhighest end-to-end latency among all the databaseoperations; therefore, there operations must beavoided in implementing file system operations.Primary key, batched primary key operations,and partition-pruned index scan operations arescalable database operations that can deliver veryhigh throughput and have low end-to-end opera-tional latency.

The design of the database schema, that is,how the data is laid out in different tables, thedesign of the primary keys of the tables, type-s/number of indexes for different columns ofthe tables, and data partitioning scheme for thetables, plays a significant role in choosing anappropriate (efficient) database operation to read-/update the data, which is discussed in detail inthe following sections.

HopsFS Distributed Metadata

In HopsFS the metadata is stored in the databasein normalized form, that is, instead of storing thefull file paths with each inode as in Thomsonand Abadi (2015), HopsFS stores individual filepath components. Storing the normalized datain the database has many advantages, such asrename file system operation can be efficientlyimplemented by updating a single row in thedatabase that represents the inode. If completefile paths are stored for each inode, then it wouldnot only consume significant amount of preciousdatabase storage, but also, a simple directory

Page 6: HopsFS: Scaling Hierarchical File System Metadata Using …salmanniazi.com/files/2018-Springer-Chapter.pdf · 2018-04-03 · HopsFS: Scaling Hierarchical File System Metadata Using

6 HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases

HopsFS: ScalingHierarchical File SystemMetadata Using NewSQLDatabases, Fig. 4 Thecomparison of thethroughput and latency ofdifferent databaseoperations. (a) Read-onlyoperations’ throughput. (b)Read-only operations’latency. (c) Read-writeoperations’ throughput. (d)Read-write operations’latency

0.0

200.0k

400.0k

600.0k

800.0k

1.0M

1.2M

1.4M

1.6M

1 Row 5 Rows 10 RowsTh

roug

hput

Row(s) read in each operation

PKBatchPPIS

DISDFTS

(a) Read only operations’ throughput

0.1

1

10

100

1000

10000

1 Row 5 Rows 10 Rows

Late

ncy

(ms)

Log

Sca

le

Row(s) read in each operation

PKBatchPPIS

DISDFTS

(b) Read only operations’ latency

0.0

50.0k

100.0k

150.0k

200.0k

250.0k

300.0k

350.0k

1 Row 5 Rows 10 Rows

Thro

ughp

ut

Row(s) read/updated in each operation

PKBatchPPIS

DISDFTS

(c) Read-Write operations’ throughput

1

10

100

1000

10000

1 Row 5 Rows 10 Rows

Late

ncy

(ms)

Log

Sca

le

Row(s) read/updated in each operation

PKBatchPPIS

DISDFTS

(d) Read-Write operations’ latency

Page 7: HopsFS: Scaling Hierarchical File System Metadata Using …salmanniazi.com/files/2018-Springer-Chapter.pdf · 2018-04-03 · HopsFS: Scaling Hierarchical File System Metadata Using

HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases 7

H

HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases, Fig. 5 Entity relational diagramfor HopsFS schema

rename operation would require updating the filepaths for all the children of that directory subtree.

Key tables in HopsFS metadata schema areshown in the entity relational (ER) diagram inFig. 5. The inodes for files and directories arestored in the inodes table. Each inode is repre-sented by a single row in the table, which alsostores information such as inode ID, name, parentinode ID, permission attributes, user ID, and size.As HopsFS does not store complete file pathswith each inode, the inode’s parent ID is used totraverse the inodes table to discover the completepath of the inode. The extended metadata for theinodes is stored in the extended metadata entity.The quota table stores information about diskspace consumed by the directories with quotarestrictions enabled. When a client wants to up-

date a file, then it obtains a lease on the file, andthe file lease information is stored in the leasetable. Small files that are few kilobytes in sizeare stored in the database for low-latency access.Such files are stored in the small files’ data blockstables.

Non-empty files may contain multiple blocksstored in the blocks table. The location ofeach replica of the block is stored in thereplicas table. During its life cycle a blockgoes through various phases. Blocks may beunder-replicated if a datanode fails, and suchblocks are stored in the under-replicated blockstable. HopsFS periodically checks the healthof all the data blocks. HopsFS ensures that thenumber of replicas of each block does not fallbelow the threshold required to provide high

Page 8: HopsFS: Scaling Hierarchical File System Metadata Using …salmanniazi.com/files/2018-Springer-Chapter.pdf · 2018-04-03 · HopsFS: Scaling Hierarchical File System Metadata Using

8 HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases

availability of the data. The replication monitorsends commands to datanodes to create morereplicas of the under-replicated blocks. Theblocks undergoing replication are stored in thepending replication blocks table. Similarly, areplica of a block has various states during itslife cycle. When a replica gets corrupted dueto disk errors, it is moved to the corruptedreplicas table. The replication monitor willcreate additional copies of the corrupt replicasby copying healthy replicas of the correspondingblocks, and the corrupt replicas are moved toinvalidated replicas table for deletion. Similarly,when a file is deleted, the replicas of the blocksbelonging to the file are moved to the invalidatedreplicas table. Periodically the namenodes readthe invalided replicas tables and send commandsto the datanodes to permanently remove thereplicas from the datanodes’ disk. Whenever aclient writes to a new block’s replica, this replicais moved to the replica under construction table.If too many replicas of a block exist (e.g., due torecovery of a datanode that contains blocks thatwere re-replicated), the extra copies are stored inthe excess replicas table.

Metadata PartitioningThe choice of partitioning scheme for the hier-archical namespace is a key design decision fordistributed metadata architectures. HopsFS usesuser-defined partitioning for all the tables storedin NDB. We base our partitioning scheme onthe expected relative frequency of HDFS oper-ations in production deployments and the costof different database operations that can be usedto implement the file system operations. Read-only metadata operations, such as list, stat, andfile read operations, alone account for �95% ofthe operations in Spotify’s HDFS cluster (Niaziet al. 2017). Based on the micro benchmarks ofdifferent database operations, we have designedthe metadata partitioning scheme such that fre-quent file system operations are implementedusing the scalable database operations such asprimary key, batched primary key operations,and partition-pruned index scan operations. Indexscans and full table scans are avoided, where

possible, as they touch all database partitions andscale poorly.

HopsFS partitions inodes by their parents’inode IDs, resulting in inodes with the sameparent inode being stored on the same databasepartitions. That enables efficient implementationof the directory listing operation. When listingfiles in a directory, we use a hinting mechanism tostart the transaction on a transaction coordinatorlocated on the database partitions that holds thechild inodes for that directory. We can then usea pruned index scan to retrieve the contents ofthe directory locally. File inode-related metadata,that is, blocks and replicas and their states, ispartitioned using the file’s inode ID. This resultsin metadata for a given file all being stored in asingle database partition, again enabling efficientfile operations. Note that the file inode-relatedentities also contain the inode’s foreign key (notshown in the entity relational diagram Fig. 5) thatis also the partition key, enabling HopsFS to readthe file inode-related metadata using partition-pruned index scans.

It is important that the partitioning scheme dis-tribute the metadata evenly across all the databasepartitions. Uneven distribution of the metadataacross all the database partitions leads to poorperformance of the database. It is not uncommonfor big data applications to create tens of millionsof files in a single directory (Ren et al. 2013;Patil et al. 2007). HopsFS metadata partitioningscheme does not lead to hot spots as the databasepartition that contains the immediate children(files and directories) of a parent directory onlystores the names and access attributes, while theinode-related metadata (such as blocks, replicas,and associated states) that comprises the majorityof the metadata is spread across all the databasepartitions because it is partitioned by the inodeIDs of the files. For hot files/directories, that is,file/directories that are frequently accessed bymany of concurrent clients, the performance ofthe file system will be limited by the performanceof the database partitions that hold the requireddata for the hot files/directories.

Page 9: HopsFS: Scaling Hierarchical File System Metadata Using …salmanniazi.com/files/2018-Springer-Chapter.pdf · 2018-04-03 · HopsFS: Scaling Hierarchical File System Metadata Using

HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases 9

H

HopsFS Transactional Operations

File system operations in HopsFS are divided intotwo main categories, that is, non-recursive filesystem operations that operate on a single file ordirectory, also called inode operations, and re-cursive file system operations that operate on di-rectory subtrees which may contain large numberof descendants, also called subtree operations.The file system operations are divided into thesetwo categories because databases, for practicalreasons, put a limit on the number of read andwrite operations that can be performed in a singledatabase transaction, that is, a large file systemoperation may take a long time to complete andthe database may timeout the transaction beforethe operation successfully completes. For exam-ple, creating a file, mkdir, or deleting file areinode operations that read and update limitedamount of metadata in each transactional filesystem operation. However, recursive file systemoperations such as delete, move, chmod on largedirectories may read and update millions of rows.As subtree operations cannot be performed ina single database operation, HopsFS breaks thelarge file system operation into small transac-tions. Using application-controlled locking forthe subtree operations, HopsFS ensures that thesemantics and consistency of the file system op-eration remain the same as in HDFS.

Inode Hint CacheIn HopsFS all file system operations are pathbased, that is, all file system operations operate onfile/directory paths. When a namenode receives afile system operation, it traverses the file path toensure that the file path is valid and the client isauthorized to perform the given file system oper-ation. In order to recursively resolve the file path,HopsFS uses primary key operations to read eachconstituent inode one after the other. HopsFSpartitions the inodes’ table using the parent IDcolumn, that is, the constituent inodes of the filepath may reside on different database partitions.In production deployments, such as in case ofSpotify, the average file system path length is 7,that is, on average it would take 7 round tripsto the database to resolve the file path, which

would increase the end-to-end latency of the filesystem operation. However, if the primary keys ofall the constituent inodes for the path are knownto the namenode in advance, then all the inodescan be read from the database in a single-batchedprimary key operation. For reading multiple rowsbatched primary key operations scale better thaniterative primary key operations and have lowerend-to-end latency. Recent studies have shownthat in production deployments, such as in Yahoo,the file system access patterns follow a heavy-tailed distribution, that is, only 3% of the filesaccount for 80% of file system operations (Abad2014); therefore, it is feasible for the namenodesto cache the primary keys of the recently accessedinodes. The namenodes store the primary keysof the inodes table in a local least recently used(LRU) cache called the inodes hints cache. Weuse the inode hint cache to resolve the frequentlyused file paths using single-batch query. Theinodes cache does not require cache coherencymechanisms to keep the inode cache consistentacross the namenodes because if the inodes hintscache contains invalid primary keys, then thebatch operation would fail and then the namenodewill fall back to recursively resolving the file pathusing primary key operations. After recursivelyreading the path components from the database,the namenodes also update the inode cache. Ad-ditionally, only the move operation changes theprimary key of an inode. Move operations arevery tiny fraction of the production file systemworkloads, such as at Spotify less than 2% of thefile system operations are file move operations.

Inode OperationsInode operations go through the followingphases.

Pre-transaction Phase• The namenode receives a file system operation

from the client.• The namenode checks the local inode cache,

and if the file path components are found inthe cache, then it creates a batch operation toread all file components.

Page 10: HopsFS: Scaling Hierarchical File System Metadata Using …salmanniazi.com/files/2018-Springer-Chapter.pdf · 2018-04-03 · HopsFS: Scaling Hierarchical File System Metadata Using

10 HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases

Transactional File System Operation• The inodes cache is also used to set the parti-

tion key hint for the transactional file systemoperation. The inode ID of the last valid pathcomponent is used as partition key hint.

• The transaction is enlisted on the desired NDBdatanode according to the partition hint. If thenamenode is unable to determine the partitionkey hint from the inode cache, then the trans-action is enlisted on a random NDB datanode.

• The batched primary key operation is sent tothe database to read all the file path inodes.If the batched primary key operation fail dueto invalid primary key(s), then the file path isrecursively resolved. After resolving the filepath components, the namenode also updatesthe inode cache for future operations.

• The last inode in the file path is read us-ing appropriate lock, that is, read-only filesystem operation takes shared lock on theinode, while file system operations that needto update the inode take exclusive lock on theinode.

• Depending on the file system operation, someor all of the secondary metadata for the in-ode is read using partition-pruned index scanoperations. The secondary metadata for an in-ode includes quota, extended metadata, leases,small files data blocks, blocks, replicas, etc.,as shown in the entity relational diagram ofthe file system; see Fig. 5.

• After all the data has been read from thedatabase, the metadata is stored in a per-transaction cache. The file system operation isperformed, which reads and updates the datastored in the per-transaction cache. All thechanges to the metadata are also stored in theper-transaction cache.

• After the file system operations successfullyfinish, all the metadata updates are sent tothe database to persist the changes for lateroperations.

• If there are no errors, then the transaction iscommitted; otherwise the transaction is rolledback.

Post Transaction• After the transactional file system operation

finishes, the results are sent back to the client.• Additionally, the namenode also updates it

matrices about different statistics of the filesystem operations, such as frequency of dif-ferent types of file system operations and du-ration of the operations.

Subtree OperationsRecursive file system operations that operate onlarge directories are too large to fit in a singledatabase transaction, for example, it is not pos-sible to delete hundreds of millions of files in asingle database transaction. HopsFS uses subtreeoperations protocol to perform such large file sys-tem operations. The subtree operations protocoluses application-level locking to mark and isolateinodes. HopsFS subtree operations are designedin such a way that the consistency and operationalsemantics of the file system operations remaincompatible with HDFS. The subtree operationsexecute in three phases as described below.

Phase 1 – Locking the Subtree: In the firstphase, the directory is isolated using application-level locks. Using an inode operation, an exclu-sive lock is obtained for the directory, and thesubtree lock flag is set for the inode. The subtreelock flag also contains other information, suchas the ID of the namenode that holds the lock,the type of the file system operation, and the fullpath of the directory. The flag is an indicationthat all the descendants of the subtree are lockedwith exclusive (write) lock. It is important to notethat during path resolution, inode and subtreeoperations that encounter an inode with a subtreelock turned on voluntarily abort the transactionand wait until the subtree lock is removed.

Phase 2 – Quiescing the Subtree: Beforethe subtree operation updates the descendantsof the subtree, it is imperative to ensure thatnone of the descendants are being locked byany other concurrent subtree or inode operation.Setting the subtree lock before checking if anyother descendant directory is also locked for asubtree operation can result in multiple activesubtree operations on the same inodes, which will

Page 11: HopsFS: Scaling Hierarchical File System Metadata Using …salmanniazi.com/files/2018-Springer-Chapter.pdf · 2018-04-03 · HopsFS: Scaling Hierarchical File System Metadata Using

HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases 11

H

compromise the consistency of the namespace.We store all active subtree operations in a tableand query it to ensure that no subtree operationsare executing at lower levels of the subtree. Ina typical workload, this table does not grow toolarge as subtree operations are usually only a tinyfraction of all file system operations. Addition-ally, we wait for all ongoing inode operations tocomplete by taking and releasing database writelocks on all inodes in the subtree in the sametotal order used to lock inode. This is repeateddown the subtree to the leaves, and a tree datastructure containing the inodes in the subtree isbuilt in memory at the namenode. The tree is laterused by some subtree operations, such as moveand delete operations, to process the inodes. Ifthe subtree operations protocol fails to quiescethe subtree due to concurrent file system opera-tions on the subtree, it is retried with exponentialbackoff.

Phase 3 – Executing the FS Operation:Finally, after locking the subtree and ensuringthat no other file system operation is operatingon the same subtree, it is safe to perform therequested file system operation. In the last phase,the subtree operation is broken into smaller op-erations, which are run in parallel to reduce thelatency of the subtree operation. For example,when deleting a large directory, starting from theleaves of the directory subtree, the contents of thedirectory is deleted in batches of multiple files.

Handling Failed Subtree OperationAs the locks for subtree operations are controlledby the application (i.e., the namenodes), there-fore, it is important that the subtree locks arecleared when a namenode holding the subtreelock fails. HopsFS uses lazy approach for clean-ing the subtree locks left by the failed namenodes.The subtree lock flag also contains informationabout the namenode that holds the subtree lock.Using the group membership service, the namen-ode maintains a list of all the active namenodesin the system. During the execution of the filesystem operations, when a namenode encountersan inode with a subtree lock flag set that doesnot belong to any live namenode, then the subtreeoperation flag is reset.

It is imperative that the failed subtree opera-tions do not leave the file system metadata in aninconsistent state. For example, the deletion ofthe subtree progresses from the leaves to the rootof the subtree. If the recursive delete operationfails, then the inodes of the partially deleted sub-tree remain connected to the namespace. When asubtree operation fails due to namenode failure,then the client resubmits the operation to anotheralive namenode to complete the file system oper-ation.

Storing Small Files in the Database

HopsFS supports two file storage layers, in con-trast to the single file storage service in HDFS,that is, HopsFS can store data blocks on HopsFSdatanodes as well as in the distributed database.Large files blocks are stored on the HopsFSdatanodes, while small files (<D 64 KB) arestored in the distributed database. The techniqueof storing the data blocks with the metadata isalso called inode stuffing. In HopsFS, an averagefile requires 1.5 KB of metadata. As a rule ofthumb, if the size of a file is less than the sizeof the metadata (in our case 1 KB or less), thenthe data block is stored in memory with the meta-data. Other small files are stored in on-disk datatables. The latest solid-state drives (SSDs) arerecommended for storing small files data blocksas typical workloads produce large number ofrandom reads/writes on disk for small amountsof data.

Inode stuffing has two main advantages. First,it simplifies the file system operations protocolfor reading/writing small files, that is, many net-work round trips between the client and datan-odes are avoided, significantly reducing the ex-pected latency for operations on small files. Sec-ond, it reduces the number of blocks that arestored on the datanode and reduces the blockreporting traffic on the namenode. For example,when a client sends a request to the namenodeto read a file, the namenode retrieves the file’smetadata from the database. In case of a smallfile, the namenode also fetches the data blockfrom the database. The namenode then returns

Page 12: HopsFS: Scaling Hierarchical File System Metadata Using …salmanniazi.com/files/2018-Springer-Chapter.pdf · 2018-04-03 · HopsFS: Scaling Hierarchical File System Metadata Using

12 HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases

the file’s metadata along with the data blockto the client. Compared to HDFS, this removesthe additional step of establishing a validated,secure communication channel with the datan-odes (Kerberos, TLS/SSL sockets, and a blocktoken are all required for secure client-datanodecommunication), resulting in lower latencies forfile read operations (Salman Niazi et al. 2017).

Extended Metadata

File systems provide basic attributes for files/di-rectories such as name, size, modification time,etc. However, in many cases, users and adminis-trators require the ability to tag files/directorieswith extended attributes for later use. Moreover,rich metadata has become an invaluable featureespecially with the continuous growth of datavolumes stored in scale-out file systems, such asHopsFS. As discussed earlier HopsFS store allthe file system metadata in a database, where eachfile/directory (inode) is identified by a primarykey. Therefore, creating an extended metadatatable with a foreign key to the inodes table shouldbe sufficient to ensure consistency and integrityof the metadata and the extended metadata.

HopsFS offers two modes to store extendedmetadata, that is, schema-based and schemalessextended metadata. In the schema-basedapproach, users have to create a predefinedschema for their extended metadata, similar toinformation schema in database systems, andthen associate their data according to that schema.The predefined schema enables validation of theextended metadata before attaching them to theinodes. On the other hand, in the schemalessapproach, users don’t have to create a predefinedschema beforehand; instead, they can attach anyarbitrary extended metadata that is stored in aself-contained manner such as a JSON file.

Using either approach schema-based orschemaless, users and administrators can tagtheir files/directories with extended metadata andthen later query for the files/directories based onthe tags provided. For example, users can tagfiles with a description field and later search forthe files that match (or partially match) a given

free-text query. Also, administrators could use thebasic attributes, for example, file size to list all thebig files in the cluster, or modification time to listall files that have not been modified within the lastmonth. Depending on the queries pattern, it mightbe efficient for some queries to run directly onthe metadata database, and for others especiallythe free-text queries, to run on a specializedfree-text search engines. Also, the conventionalwisdom in database community has been thatthere is no-size-fits-all solution. This encouragesthe use of different storage engines accordingto the query pattern needed that is known as apolyglot persistence of the data. Therefore, themetadata of the namespace should be replicatedinto different engines to satisfy different queryrequirements such as free-text, point, range, andsophisticated join queries. We have developed adatabus system, called ePipe (Ismail et al. 2017),that provides a strongly consistent replicatedmetadata service as an extension to HopsFS andthat in real-time delivers file system metadata andextended metadata to different storage engines.Users and administrators can later query thedifferent storage engines according to their queryrequirements.

Results

As HopsFS addresses how to scale out the meta-data layer of HDFS, all our experiments aredesigned to comparatively test the performanceand scalability of the namenode(s) in HDFS andHopsFS. Most of the available benchmarkingutilities for Hadoop are MapReduce-based, pri-marily designed to determine the performanceof a Hadoop cluster. These benchmarks are verysensitive to the MapReduce subsystem and there-fore are not suitable for testing the performanceand scalability of namenodes in isolation (Noll2015). Inspired by the NNThroughputBenchmarkfor HDFS and the benchmark utility designed totest QFS (Ovsiannikov et al. 2013), we devel-oped a distributed benchmark that spawns tensof thousands of file system clients, distributedacross many machines, which concurrently ex-ecute file system (metadata) operations on the

Page 13: HopsFS: Scaling Hierarchical File System Metadata Using …salmanniazi.com/files/2018-Springer-Chapter.pdf · 2018-04-03 · HopsFS: Scaling Hierarchical File System Metadata Using

HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases 13

H

0.0100.0k200.0k300.0k400.0k500.0k600.0k

1 5 10 15 20 25 30 35 40 45 50 55 60

59.1X Improvemet

Thro

ughp

ut (o

ps/s

ec)

Number of Namenodes

HopsFSHDFS

(a) Mkdir

0.050.0k

100.0k150.0k200.0k250.0k

1 5 10 15 20 25 30 35 40 45 50 55 60

56.2X Improvemet

Thro

ughp

ut (o

ps/s

ec)

Number of Namenodes

HopsFSHDFS

(b) Create Empty File

0.050.0k

100.0k150.0k200.0k250.0k

1 5 10 15 20 25 30 35 40 45 50 55 60

49.0X Improvemet

Thro

ughp

ut (o

ps/s

ec)

Number of Namenodes

HopsFSHDFS

(c) Append File

0.0200.0k400.0k600.0k800.0k

1.0M1.2M1.4M1.6M

1 5 10 15 20 25 30 35 40 45 50 55 60

16.8X Improvemet

Thro

ughp

ut (o

ps/s

ec)

Number of Namenodes

HopsFSHDFS

(d) Read File (get Block Locations)

0.0200.0k400.0k600.0k800.0k

1.0M1.2M1.4M1.6M1.8M

1 5 10 15 20 25 30 35 40 45 50 55 60

18.2X Improvemet

Thro

ughp

ut (o

ps/s

ec)

Number of Namenodes

HopsFSHDFS

(e) Stat Dir

0.0200.0k400.0k600.0k800.0k

1.0M1.2M1.4M1.6M1.8M

1 5 10 15 20 25 30 35 40 45 50 55 60

17.2X Improvemet

Thro

ughp

ut (o

ps/s

ec)

Number of Namenodes

HopsFSHDFS

(f) Stat File

0.0100.0k200.0k300.0k400.0k500.0k600.0k700.0k

1 5 10 15 20 25 30 35 40 45 50 55 60

10.7X Improvemet

Thro

ughp

ut (o

ps/s

ec)

Number of Namenodes

HopsFSHDFS

(g) List Dir

0.0200.0k400.0k600.0k800.0k

1.0M1.2M1.4M1.6M

1 5 10 15 20 25 30 35 40 45 50 55 60

15.7X Improvemet

Thro

ughp

ut (o

ps/s

ec)

Number of Namenodes

HopsFSHDFS

(h) List File

0.050.0k

100.0k150.0k200.0k250.0k300.0k350.0k

1 5 10 15 20 25 30 35 40 45 50 55 60

39.6X Improvemet

Thro

ughp

ut (o

ps/s

ec)

Number of Namenodes

HopsFSHDFS

(i) Rename File

0.0100.0k200.0k300.0k400.0k500.0k600.0k

1 5 10 15 20 25 30 35 40 45 50 55 60

56.5X Improvemet

Thro

ughp

ut (o

ps/s

ec)

Number of Namenodes

HopsFSHDFS

(j) Delete File

0.050.0k

100.0k150.0k200.0k250.0k300.0k350.0k400.0k

1 5 10 15 20 25 30 35 40 45 50 55 60

39.3X Improvemet

Thro

ughp

ut (o

ps/s

ec)

Number of Namenodes

HopsFSHDFS

(k) Chmod Dir

0.0100.0k200.0k300.0k400.0k500.0k600.0k700.0k

1 5 10 15 20 25 30 35 40 45 50 55 60

67.4X Improvemet

Thro

ughp

ut (o

ps/s

ec)

Number of Namenodes

HopsFSHDFS

(l) Chmod File

0.050.0k

100.0k150.0k200.0k250.0k300.0k350.0k400.0k

1 5 10 15 20 25 30 35 40 45 50 55 60

37.2X Improvemet

Thro

ughp

ut (o

ps/s

ec)

Number of Namenodes

HopsFSHDFS

(m) Chown Dir

0.020.0k40.0k60.0k80.0k

100.0k120.0k140.0k160.0k180.0k200.0k220.0k

1 5 10 15 20 25 30 35 40 45 50 55 60

21.2X Improvemet

Thro

ughp

ut (o

ps/s

ec)

Number of Namenodes

HopsFSHDFS

(n) Chown File

HopsFS: Scaling Hierarchical File System MetadataUsing NewSQL Databases, Fig. 6 Throughput of differ-ent file system operations in HopsFS and HDFS. As HDFSonly supports only one active namenode, the throughputfor HDFS file system operations are represented by the

horizontal (blue) lines. (a) Mkdir. (b) Create empty file.(c) Append file. (d) Read file (getBlockLocations). (e) Statdir. (f) Stat file. (g) List dir. (h) List file. (i) Rename file.(j) Delete file. (k) Chmod dir. (l) Chmod file. (m) Chowndir. (n) Chown file

Page 14: HopsFS: Scaling Hierarchical File System Metadata Using …salmanniazi.com/files/2018-Springer-Chapter.pdf · 2018-04-03 · HopsFS: Scaling Hierarchical File System Metadata Using

14 HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases

namenode(s). The benchmark utility can test theperformance of both individual file system oper-ations and file system workloads based on indus-trial workload traces. The benchmark utility isopen source (Hammer-Bench 2016) and the ex-periments described here can be fully reproducedon AWS using Karamel and the cluster definitionsfound in Hammer-Bench (2016).

We ran all the experiments on premise us-ing Dell PowerEdge R730xd servers (Intel(R)Xeon(R) CPU E5-2620 v3 @ 2.40 GHz, 256 GBRAM, 4 TB 7200 RPM HDDs) connected usinga single 10 GbE network adapter. Unless statedotherwise, NDB, version 7.5.3, was deployedon 12 nodes configured to run using 22 threadseach, and the data replication degree was 2. Allother configuration parameters were set to defaultvalues. Moreover, in our experiments ApacheHDFS, version 2.7.2, was deployed on 5 servers.We did not colocate the file system clients withthe namenodes or with the database nodes. Aswe are only evaluating metadata performance, allthe tests created files of zero length (similar tothe NNThroughputBenchmark Shvachko 2010).Testing with non-empty files requires an order ofmagnitude more HDFS/HopsFS datanodes, andprovides no further insight.

Figure 6 shows the throughput of different filesystem operations as a function of number ofnamenodes in the system. HDFS supports onlyone active namenode; therefore, the throughput ofthe file system is represented by (blue) horizontalline in the graphs. From the graphs it is quiteclear that HopsFS outperforms HDFS for all filesystem operations and has significantly betterperformance than HDFS for the most commonfile system operations. For example, HopsFS im-proves the throughput of create file and read fileoperations by 56:2X and 16:8X , respectively.

Additional experiments for industrial work-loads, write intensive synthetic workloads, meta-data scalability, and performance under failuresare available in HopsFS papers (Niazi et al. 2017;Ismail et al. 2017).

Conclusions

In this entry, we introduced HopsFS, an open-source, highly available, drop-in alternative forHDFS that provides highly available metadatathat scales out in both capacity and throughputby adding new namenodes and database nodes.We designed HopsFS file system operations to bedeadlock-free by meeting the two preconditionsfor deadlock freedom: all operations follow thesame total order when taking locks, and locksare never upgraded. Our architecture supportsa pluggable database storage engine, and otherNewSQL databases could be used, providing theyhave good support for cross-partition transactionsand techniques such as partition-pruned indexscans. Most importantly, our architecture nowopens up metadata in HDFS for tinkering: custommetadata only involves adding new tables andcleaner logic at the namenodes.

Cross-References

�Distributed File Systems� In-memory Transactions�Hadoop

References

Abad CL (2014) Big data storage workload characteriza-tion, modeling and synthetic generation. PhD thesis,University of Illinois at Urbana-Champaign

Guerraoui R, Raynal M (2006) A leader election protocolfor eventually synchronous shared memory systems.In: The fourth IEEE workshop on software technolo-gies for future embedded and ubiquitous systems, 2006and the 2006 second international workshop on collab-orative computing, integration, and assurance, SEUS2006/WCCIA, pp 6–

Hammer-Bench (2016) Distributed metadata benchmarkto HDFS. https://github.com/smkniazi/hammer-bench.[Online; Accessed 1 Jan 2016]

Ismail M, Gebremeskel E, Kakantousis T, Berthou G,Dowling J (2017) Hopsworks: improving user ex-perience and development on hadoop with scalable,strongly consistent metadata. In: 2017 IEEE 37th inter-national conference on distributed computing systems(ICDCS), pp 2525–2528

Page 15: HopsFS: Scaling Hierarchical File System Metadata Using …salmanniazi.com/files/2018-Springer-Chapter.pdf · 2018-04-03 · HopsFS: Scaling Hierarchical File System Metadata Using

HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases 15

H

Ismail M, Niazi S, Ronström M, Haridi S, Dowling J(2017) Scaling HDFS to more than 1 million operationsper second with HopsFS. In: Proceedings of the17th IEEE/ACM international symposium on cluster,cloud and grid computing, CCGrid ’17. IEEE Press,Piscataway, pp 683–688

Niazi S, Haridi S, Dowling J (2017) Size mat-ters: improving the performance of small filesin HDF. https://eurosys2017.github.io/assets/data/posters/poster09-Niazi.pdfl. [Online; Accessed 30 June2017]

Niazi S, Ismail M, Haridi S, Dowling J, GrohsschmiedtS, Ronström M (2017) Hopsfs: scaling hierarchicalfile system metadata using newsql databases. In:15th USENIX conference on file and storage technolo-gies (FAST’17). USENIX Association, Santa Clara,pp 89–104

Noll MG (2015) Benchmarking and stress testingan hadoop cluster with TeraSort. TestDFSIO &Co. http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/. [Online;Accessed 3 Sept 2015]

Ovsiannikov M, Rus S, Reeves D, Sutter P, Rao S, Kelly J(2013) The quantcast file system. Proc VLDB Endow6(11):1092–1101

Patil SV Gibson GA Lang S, Polte M (2007) GIGA+: scal-able directories for shared file systems. In: Proceedingsof the 2nd international workshop on petascale datastorage: held in conjunction with supercomputing ’07,PDSW ’07. ACM, New York, pp 26–29

Ren K, Kwon Y, Balazinska M, Howe B (2013) Hadoop’sadolescence: an analysis of hadoop usage in scientificworkloads. Proc VLDB Endow 6(10):853–864

Salman Niazi GB, Ismail M, Dowling J (2015) Leaderelection using NewSQL systems. In: Proceeding ofDAIS 2015. Springer, pp 158–172

Shvachko KV (2010) HDFS scalability: the limits togrowth. Login Mag USENIX 35(2):6–16

Thomson A, Abadi DJ (2015) CalvinFS: consistent WANreplication and scalable metadata management for dis-tributed file systems. In: 13th USENIX conferenceon file and storage technologies (FAST 15). USENIXAssociation, Santa Clara, pp 1–14