scaling hdfs to manage billions of files with distributed storage schemes

38
© Hortonworks Inc. 2017 Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes Jing Zhao Tsz-Wo Nicholas Sze 6. April 2017 Page 1

Upload: dataworks-summithadoop-summit

Post on 21-Jan-2018

1.390 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Scaling HDFS to Manage Billions of Files

with Distributed Storage Schemes

Jing Zhao

Tsz-Wo Nicholas Sze

6. April 2017

Page 1

Page 2: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

About Me

• Tsz-Wo Nicholas Sze, Ph.D.

– Software Engineer at Hortonworks

– PMC member/Committer of Apache Hadoop

– Active contributor and committer of Apache Ratis

– Ph.D. from University of Maryland, College Park

– MPhil & BEng from Hong Kong University of Sci & Tech

Page 2Architecting the Future of Big Data

Page 3: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

• Jing Zhao, Ph.D.

– Software Engineer at Hortonworks

– PMC member/Committer of Apache Hadoop

– Active contributor and committer of Apache Ratis

– Ph.D. from University of Southern California

– B.E. from Tsinghua University, Beijing

Page 3Architecting the Future of Big Data

Page 4: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Agenda

• Current HDFS Architecture

• Namespace Scaling

• Container Architecture

– Containers

– Next Generation HDFS

– Ozone – Hadoop Object Store

– cBlock

• Current Development Status

Page 4Architecting the Future of Big Data

Page 5: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Current HDFS

Architecture

Architecting the Future of Big DataPage 5

Page 6: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

HDFS Architecture

Page 7Architecting the Future of Big Data

Namenode

Heartbeats & Block Reports

Block

Map Block ID Block Locations

Datanodes

Block ID Data

Namespace

TreeFile Path Block IDs

Horizontally Scale IO and Storage

7

b1

b5

b3

Blo

ck S

tora

ge

Nam

esp

ace

b2

b3

b1 b3

b5

b2 b1

b5

b2

Page 7: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Foreign NS n

Common Storage

HDFS Layering

Page 14Architecting the Future of Big Data

DN 1 DN 2 DN m..

NS1... ...

NS k

Block PoolsPool nPool kPool 1

NN-1 NN-k NN-n

Blo

ck S

tora

ge

Nam

esp

ace

.. ..

Page 8: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Scalability – What HDFS Does Well?

• HDFS NN stores all metadata in memory (as per GFS)

– Scales to large clusters (5k) and since all metadata in memory

• 60K-100K tasks (large # of parallel ops) can share Namenode

• Low latency

• Large data if files are large

– Proof points of large data and large clusters

• Single Organizations have over 600PB in HDFS

• Single clusters with over 200PB using federation

Page 15Architecting the Future of Big Data

Metadata in memory the strength of the original GFS and HDFS designBut also its weakness in scaling number of files and blocks

Page 9: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Scalability – The Challenges

• Large number of files (> 350 million)

– The files may be small in size.

– NN’s strength has become a limitation

• Number of file operations

– Need to improve concurrency – move to multiple name servers

• HDFS Federation is the current solution

– Add NameNodes to scale number of files & operations

– Deployed at Twitter• Cluster with three NameNodes 5000+ node cluster (Plans to grow to 10,000 nodes)

– Back ported and used at Facebook to scale HDFS

Page 16Architecting the Future of Big Data

Page 10: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Scalability – Large Number of Blocks

• Block report processing

– Datanode block reports also become huge

– Requires a long time to process them.

Namenode

Datanodesb1

b5

b3b2

b3

b1 b3

b5

b2 b1

b5

b2

Heartbeats & Block Reports

Page 11: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Namespace Scaling

Architecting the Future of Big DataPage 19

Page 12: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Partial Namespace - Briefly

• Has been prototyped

– Benchmarks so that model works well

– Most file systems keep only partial namespace in memory but not at this

scale• Hence Cache replacement policies of working-set is important

• In Big Data, you are using only the last 3-6-12 months of your five/ten years of data

actively => working set is small

• Work in progress to get it into HDFS

• Partial Namespace has other benefits

– Faster NN start up – load-in the working set as needed

– Allows n+k failover as all the namespace does not need to fit in memory

– Partial Namespace in Memory will allow multiple namespace volumes

Page 21Architecting the Future of Big Data

Page 13: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Previous Talks on Partial Namespace

• Evolving HDFS to a Generalized Storage Subsystem

– Sanjay Radia, Jitendra Pandey (@Hortonworks)

– Hadoop Summit 2016

• Scaling HDFS to Manage Billions of Files with Key Value Stores

– Haohui Mai, Jing Zhao (@Hortonworks)

– Hadoop Summit 2015

• Removing the NameNode's memory limitation

– Lin Xiao (Phd student @CMU, intern @Hortonworks)

– Hadoop User Group 2013

Page 14: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Container Architecture

Architecting the Future of Big DataPage 23

Page 15: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Containers

• Container – a storage unit

• Local block map

– Map block IDs to local block locations.

• Small in size

– 5GB or 32GB (configurable)

Page 24Architecting the Future of Big Data

b6b1 b3

Block Map

c1

Containers

b8b2 b7

Block Map

c2

Page 16: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Distributed Block Map

• The block map is moved from the namenode to datanodes

– The block map becomes distributed

– Entire container is replicated

– A datanode has multiple containers

Page 25Architecting the Future of Big Data

b6b1 b3

Block Map

c1

b6b1 b3

Block Map

c1

b6b1 b3

Block Map

c1

c1

c5

c3

Containers

c1

c4

c2 c2

c6

c3Datanodes

Page 17: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

SCM – Storage Container Manager

SCM

Heartbeats & Container Reports

Container

Map Container ID Container Locations

Datanodesc1

c5

c3c2

c3

c1 c3

c5

c2 c1

c5

c2

Page 18: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Next Generation HDFS

Namenode/SCM

Heartbeats & Container Reports

Container

Map Container ID Container Locations

Datanodesc1

c5

c3c2

c3

c1 c3

c5

c2 c1

c5

c2

Namespace

Tree

File Path Block IDs and Container IDs

Page 19: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Billions of Files

• Current HDFS architecture

– Support 1 million blocks per datanode

– A 5k-node cluster can store 5 billion blocks.

• Next generation HDFS architecture

– Support up to 1 million blocks per container• Provided that the total block size can fit into a container.

– A 5k-node cluster could have 1 million containers

– The cluster can store up to 1 trillion (small) blocks.

– HDFS can easily scale to mange billions of files!

Page 28Architecting the Future of Big Data

Page 20: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Ozone – Hadoop Object Store

• Store KV (key-value) pairs

– Similar to Amazon S3

• Need a Key Map – a key-to-container-id map

• Containers are partial object stores (partial KV maps)

Page 29Architecting the Future of Big Data

Ozone

Heartbeats & Container Reports

Container

Map Container ID Container Locations

Datanodesc1

c5

c3c2

c3

c1 c3

c5

c2 c1

c5

c2

Key MapKey Container IDs

Page 21: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Challenge – Trillions of Key-Value Pairs

• Values (Objects) are distributed in datanodes

– 5k nodes can handle a trillion of objects (no problem)

• Trillions of keys in the Key Map

– The Key Map becomes huge (TB in size)

– Cannot fit in memory – the same old problem

• Avoid storing all keys in the Key Map

– Hash partitioning

– Range partitioning

– Partitions can be split/merged

Page 30Architecting the Future of Big Data

Ozone

Key MapKey Container IDs

Page 22: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Challenge – Millions of Containers

• Object store

– Users may choose to use any key combinations

• Hash/Range partitioning

– Map a key to a key partition

– Each container stores the partial KV map according to the partitioning

• Need millions of containers (partitions) for trillions of keys

– User clients may read or write to any of these containers.• Users decide the key; the partitioning decides the partition of the key

– All containers need to send key reports to SCM• A scalability issue!

Page 31Architecting the Future of Big Data

c1: Partition 1 c2: Partition 2 c3: Partition 3 …

Page 23: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Solution – Closed Containers

• Initially, a container is open for read and write

• Close the container

– once the container has reached a certain size, say 5GB or 32GB.

• Closed containers are immutable

– Cannot add new KV entries

– Cannot overwrite KV entries

– Cannot delete KV entries

– Cannot be reopened

• Open containers

– New KV entries are always written to open containers

– Only need a small number of open containers (hundreds/thousands)

Page 32Architecting the Future of Big Data

Page 24: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Solution – Keychain Containers

• In order to support Closed containers

– Hash/range partitioning cannot be used directly

– New keys cannot be mapped to a closed containers!

• Distribute the Key Map (key-to-container-id)

– Use hash partitioning to partition the Key Map

– Store a key partition in a Keychain container

• Keychain containers (partial key maps)

– Special system containers (with the same container implementation)

– 10GB can store 100 million entries

– Need only thousands keychain containers

• One more map: a key-to-keychain-container map

– Very small in size (KB)

– The map is mostly constant (keychains do not split/merge often)

– Stored in SCM, cached everywhere

Page 33Architecting the Future of Big Data

Page 25: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Summary of the Mappings

• Ozone: a key-to-value map (PB or even EB in size)

– Distributed over containers

– Containers: partial key-to-value map (GB in size @)

• Key Map: key-to-container-id (TB in size)

– Distributed over keychain containers

– Keychain Containers: partial key-to-container-id map (GB in size @)

• Keychain Map: key-to-keychain-container (KB in size)

– Use hash partitioning

• Container Map: container-id-to-locations (GB in size)

– Cached in all the keychain containers

Page 34Architecting the Future of Big Data

Page 26: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Object Lookup

• Step 1

– Client computes the hash of the key, and

– find out the keychain container from the cache.

– (For cache miss, load the cache from SCM.)

• Step 2

– Client lookups container locations from the keychain container

• Step 3

– The keychain container replies the locations to client

• Step 4

– Client lookups the object from the container

Page 35Architecting the Future of Big Data

Page 27: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Object Creation

• Step 1 (same as before)

– Client computes the hash of the key, and

– find out the keychain container from the cache.

– (For cache miss, load the cache from SCM.)

• Step 2

– Client sends a create request to the keychain container.

• Step 3

– The keychain container allocates the key to an open container, and

– replies the container locations to client.

• Step 4

– Client writes to the container.

Page 36Architecting the Future of Big Data

Page 28: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Container Replication

• Closed containers

– Replication or Erasure Coding

– The same way HDFS does for blocks

• Open containers are replicated by Raft

– Raft – a consensus algorithm

– Apache Ratis – an implementation of Raft• More detail in later slides

Page 37Architecting the Future of Big Data

Page 29: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

cBlock

• A block device backed by containers

– highly available, replicated and fault tolerant

• Block device

– allow users to create normal file systems on top of it like ext4 or XFS

• HDFS-11118: Block Storage for HDFS

Page 38Architecting the Future of Big Data

Page 30: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Big Picture

Page 39

DataNodes

Block

Containers

Object Store

Containers

Cluster

Membership

Replication

Management

Container

Location Service

Container Management Services

(Runs on DataNodes)

HBaseObject

Store

Metadata

Applications

HDFS

Physical Storage - Shared

Page 31: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Current Development

Status

Architecting the Future of Big DataPage 40

Page 32: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

HDFS-7240 – Object store in HDFS

• The umbrella JIRA for the Ozone including the container

framework

– 130 subtasks

– 103 subtasks resolved (as of 5. April)

– Code contributors• Anu Engineer, Arpit Agarwal, Chen Liang, Chris Nauroth, Kanaka Kumar Avvaru,

Mukul Kumar Singh, Tsz Wo Nicholas Sze, Weiwei Yang, Xiaobing Zhou, Xiaoyu

Yao, Yuanbo Liu,

Page 41Architecting the Future of Big Data

Page 33: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

HDFS-11118: Block Storage for HDFS

• The umbrella JIRA for additional work for cBlock

– 13 subtasks

– 10 subtasks resolved (as of 5. April)

– Code contributor• Chen Liang

• cBlock has already been deployed in Hortonworks’ QE

environment for several months!

Page 42Architecting the Future of Big Data

Page 34: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Raft – A Consensus Algorithm

• “In Search of an Understandable Consensus Algorithm”

– The Raft paper by Diego Ongaro and John Ousterhout

– USENIX ATC’14

• “In Search of a Usable Raft Library”

– A long list of Raft implementations is available

– None of them a general library ready to be consumed by other projects.

– Most of them are tied to another project or a part of another project.

• We need a Raft library!

Page 43Architecting the Future of Big Data

Page 35: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Apache Ratis – A Raft Library

• A brand new, incubating Apache project

– Open source, open development

– Apache License 2.0

– Written in Java 8

• Emphasized on pluggability

– Pluggable state machine

– Pluggable Raft log

– Pluggable RPC• Supported RPC: gRPC, Netty, Hadoop RPC

• Users may provide their own RPC implementation

• Support high throughput data ingest

– for more general data replication use cases.

Page 44Architecting the Future of Big Data

Page 36: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Apache Ratis – Use cases

• General use case:

– You already have a service running on a single server.

• You want to:

– replicate the server log/states to multiple machines• The replication number/cluster membership can be changed in runtime

– have a HA (highly available) service• When a server fails, another server will automatically take over.

• Clients automatically failover to the new server.

• Apache Ratis is for you!

• Use cases in Ozone/HDFS

– Replicating open containers (HDFS-11519, committed on 3. April)

– Support HA in SCM

– Replacing the current Namenode HA solution

Page 45Architecting the Future of Big Data

Page 37: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Apache Ratis – Development Status

• A brief history

– 2016-03: Project started at Hortonworks

– 2016-04: First commit “leader election (without tests)”

– 2017-01: Entered Apache incubation.

– 2017-03: Started preparing the first Alpha release (RATIS-53).

– 2017-04: Hadoop Ozone branch started using Ratis (HDFS-11519)!

• Committers

– Anu Engineer, Arpit Agarwal, Chen Liang, Chris Nauroth, Devaraj Das,

Enis Soztutar, Hanisha Koneru, Jakob Homan, Jing Zhao, Jitendra

Pandey, Li Lu, Mayank Bansal, Mingliang Liu, Tsz Wo Nicholas Sze,

Uma Maheswara Rao G, Xiaobing Zhou, Xiaoyu Yao

• Contributions are welcome!

– http://incubator.apache.org/projects/ratis.html

[email protected]

Page 46Architecting the Future of Big Data

Page 38: Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes

© Hortonworks Inc. 2017

Thank You!

Page 47Architecting the Future of Big Data