changing requirements for distributed file systems in cloud … · 2020. 6. 18. · client a adds...

34
Changing Requirements for Distributed File Systems in Cloud Storage Wesley Leggette Cleversafe

Upload: others

Post on 03-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Changing Requirements for Distributed File Systems in Cloud Storage

Wesley Leggette Cleversafe

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Presentation Agenda

r  About Cleversafe r  Scalability, our core driver r  Object storage as basis for filesystem technology

r Namespace-based routing r Distributed transactions r Optimistic concurrency

r  Designing an ultra-scalable filesystem r Filesystem operations on object layer

r  Conclusions

2

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

About Cleversafe

r  We offer scalable storage solutions r Target market is massive storage (>10 PiB) r Information Dispersal Algorithms (Erasure Codes)

r Reduce cost by avoiding replication overhead r Maximize reliability by tolerating many failures

r  Object storage core product offering r  How to translate this technology to filesystem space

r Evolution from object storage concepts r Also influenced by distributed databases and P2P r Techniques we investigate not unique to IDA

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

How Dispersed Storage Works

Digital  Content  

Site  1  

Site  2  

Site  3  

Site  4  

8h$1   vD@-­‐   fMq&   Z4$’   >hip   )aj%   l[au   T0kQ   %~fa   Uh(k   My)v   9hU6   >kiR   &i@n   pYvQ   4Wco  

1. Digital Assets divided into slices using Information Dispersal Algorithms

8h$1   vD@-­‐   >hip   )aj%   l[au   %~fa   9hU6   >kiR   pYvQ   4Wco  

2.  Slices  distributed  to  separate  disks,  storage  nodes  and  geographic  locaVons  

Total  Slices  =  ‘width’  =  N  IDA

IDA 3.  A  threshold  number  of  slices  are  retrieved  and  used  to  regenerate  the  original  content  

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Access Methods Simple Object HTTP

HTTP Accesser

Accesser Exposes HTTP REST API

Simple Object Client Library Accesser Function Embedded into the Client

•  Multiple Accessers can be load balanced for increased throughput and availability.

•  The Accesser returns a unique 36

Character Object ID

Accesser functionality including slicing and dispersal is contained within the client library

Object ID

dsNet Protocol

OBJECT VAULT

dsNet Protocol

OBJECT VAULT

Application Server

Database

Stores metadata

Application Server

Object ID Database

Stores metadata

Java Client

Library

These are “clients” in context of this presentation

We sell two deployment models…

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Scalability – A Primary Requirement

r  Big Data customers are petabyte to exabyte scale r  Scale out architecture

r Add storage capacity with commodity machines r Reduce costs: commodity hard drives

r  Invariants r Reliability – keep data even as cheap disks fail r Availability – access data during node failures r Performance – linear performance growth

6

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Scale Example

r  Shutterfly r 10 PB Cleversafe dsNet storage system r All commodity hard drives r Single storage container for all photos r 10’s thousands of large photos stored per minute

r Max capacity many times this level

r 14 access nodes for load balanced read/write r No single point of failure r Linear performance growth with each new node

r  This uses object storage product

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Investigating Filesystem Space

r  We have scalable object storage r Limitless capacity and performance growth r Fully concurrent read/write

r  Some customers want the same with a filesystem

r  Is this technically possible? r  What tradeoffs would have to be made?

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Scale comes from homogeneity

r  To scale out, we need to do so at each layer r Eliminate central chokepoint for data operations

r Central point of failure, central point of…

r We accomplish this today with object storage r Consider same concept in a filesystem

Client Storage

Client Storage

Client Storage

Client Storage

Client Storage

Client Storage

Client Storage

Client Storage

Metadata

Scalable

NotScalable

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

What approach can we take?

r  Start with scalable transactional object storage r  Add filesystem implementation on top

r  Object Layer r Check-and-write transactions

r  Reliability Layer r  Ensures committed objects

reliable and consistent

r  Namespace Layer r Routes actual data storage r No central I/O manager Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice ServerSlice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

Filesystem

Transactional Object Storage

IDA + DistributedTransaction Client

Namespace-basedStorage Routing

Session Management(Multi-path)

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Namespace Layer

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Traditional Centralized Routing

r  Central controller directs traffic r Easier to implement, allows simple search r Detect conflicts, control locking

r  Does not scale-out with rest of architecture r Today, 10PB system needs 90 45-disk nodes* r These nodes can service 57,600 2MB req/s**

r  Central point of failure = less availability

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

* 3TB drives, some IDA overhead **10Gbps NIC, nodes saturate wire speed

Client

Storage

Storage

Storage

Storage

Routing Master

10,000 req/s

640 req/s

MAX 15 Servers!

1

2

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Namespace-based Routing

r  Namespace concept from P2P systems r Chord, CAN, Kademila r MongoDB, CouchDB production examples

r  Physical mapping determined by storage map r Small data (<10KiB) loaded at start-up

r P2P systems use dynamic overlay protocol r We’ll have 10’s thousands of nodes, not millions

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

Slicestor H

Slicestor A

SlicestorB

Slicestor C

Slicestor D

Slicestor F

Slicestor E

Slicestor G

Slicestor H

Slicestor A

Slicestor B

Slicestor C

Slicestor D

Slicestor F

Slicestor E

Slicestor G

Index 0

12

3

4-wide VaultIndex 0

8-wide Vault

1

2

34

5

6

7

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Storing Data in a Namespace

r  No central lookup for data I/O 1.  Generate “object id” 2.  Map to storage

r  With object storage, object id à database r  How do we map file name to object id?

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

Object ID Source Name

Slice Name Slicestor

Slicestor

Slicestor

Storage Map

Slice Name

Slice Name

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Reliability Layer

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Replication and Eventual Consistency

r  Eventual consistency often used with replication r Client writes new versions to available nodes r Versions sync to other replicas lazily

r  Application responsible for consistency r Already true in

filesystems r  Allows partition

tolerant systems

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server Client A

V2

V2

V1

1COPY

2COPY

3REPAIR

V1

V2

V2

V2

V2

4REPAIR

Client B

Now Later

Read sees old version

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Dispersal Requires Consistency

r  Dispersal doesn’t store replicas r Threshold of slices

required to recover data r Crash during “unsafe”

periods can cause loss r  Methods to prevent loss

r Three-phase distributed transaction r Commit: All revisions visible during unsafe period r Finalize: Cleanup when new version commit safe

r Quorum-based voting r Writes fail if <T successful

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

V1 V1 V1 V1

Tim

e V2 V1 V1 V1

V2 V2 V1 V1

V2 V2 V2 V1

Safe

Safe

UNSAFE

Safe

Width: 4Threshold: 3

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Three-Phase Commit Protocol

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

V1 V1 V1

V2 V2 V2

Width: 4Threshold: 3

V1

V2

V1

V2 V2 V2

V1

V2

1 WRITE 2 COMMIT

Commit Failure Causes Loss!

X X

V1 V1 V1

V2 V2 V2

V1

V2

V1

V2 V2 V2

V1

V2

1 WRITE 2 COMMIT

X X

V1 V1 V1

V2 V2 V2

V1

V2

3 FINALIZE/UNDO

V1 V1

2-Phase Commit Protocol

3-Phase Commit Protocol

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Consistent Transactional Interface

r  Distributed transaction makes dispersal safe r All happens in client, no server coordination

r  Write consistency r Side-effect of distributed transactions r Writes either succeed or fail “atomically”

r  Limitation: Consistency = less partition tolerance r CAP Theorem (we also choose availability) r Either read or write fails during partition

r Still “shardable”: affects availability, not scalability

r  Is consistency useful for filesystem directories?

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Object Layer

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Write-if-absent for WORM

r  Object storage is WORM r Enforced by underlying storage r Write-if-absent model built on transactions

r Distributed transactions emulate atomicity r “Checked write” fails if previous revision exists

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

V1 WRITE IF PREVIOUS = ∅ SuccessV1

V1' V1WRITE IF PREVIOUS = ∅ FailureV1

Client A

Client B

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Optimistic Concurrency Control

r  Easy to extend this model to multiple revisions r Write succeeds IFF last revision matches given r Basis for “optimistic concurrency”

r  How do concurrent writers update a directory?

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

V1Client A WRITE IF PREVIOUS = ∅

V2 WRITE IF PREVIOUS = 1

SuccessV1

V1 V2 SuccessClient A

V2 WRITE IF PREVIOUS = 1 V1 V2 SuccessClient A

V2' WRITE IF PREVIOUS = 1 V2 V2 FailureClient B

V3 WRITE IF PREVIOUS = 2 V2 V3 Success

1

2

3

READ, REDO ACTION

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Filesystem Layer

Slice Server

Object Layer

Reliability Layer

Remote Session

Namespace Layer

Slice Server

Filesystem

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Ultra-Scalable Filesystem Technology

r  Filesystem layer on top of object storage r Scalable no-master storage r Inherits reliability, security, and performance

r  How do we map file name to object id? r  Is consistency useful for filesystem directories? r  How do concurrent writers update a directory?

Object Layer

Filesystem

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Object-based directory tree

r  How do we map file name to object id? r  Directories stored as objects

r Filesystem structure as reliable as data r  Directory content data is map of file name to object id

r Object id points to another object on system r Id for content data r Id for metadata (xattr, etc) r Data objects WORM r Zero-copy snapshot support

r Reference counting r  Well known object id for “root”

r mapsContains asdf

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Directory Internal Consistency

r  Is consistency useful for filesystem directories? r  Object layer allows “atomic” directory updates

r This mimics model used by traditional filesystems r  Content data stored in separate “immutable” storage

r Safe snapshot support r  Eventual consistency

r Temporary effects r Writes: Orphaned data r Deletes: Read error

r  Absolute requirement? No.

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Concurrency Requires Serialization

r  How do concurrent writers update a directory? r  Updates to directory entries are atomic (definition)

r More precisely, filesystem operations are serialized r Client A adds file, Client B adds file, Client C deletes file r First to call wins, application must have sane order

r  Kernels use mutexes (locks) for serialization r Master controller (pNFS, GoogleFS) does this r We want to use “multiple/no master” model

r  Distributed locking protocols exist (e.g., PAXOS) r It’s hard: Protocols complex and have drawbacks r It’s slow: Overhead for every operation

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Optimistic Concurrency

r  We want to serialize without locking r  Observation: File writes have two steps

r Write the data (long, no contention)* r Modify the directory (short, serialized)**

r  Use checked writes for directory r Always read directory before writes r Write new revision “if-not-modified-since” r On “write conflict”, re-read, replay, repeat

* Consider workload where files > 1 MiB, we write content data in WORM storage ** Because directories stored as as objects themselves, modifying directory is re-writing object

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Lockless Directory Update

r  Optimistic concurrency guarantees serialization r Operation is simple (“add file”), so replay trivial r On conflict, operation replay semantics are clear

r Content data (large) is not rewritten on conflict r Highly parallelizable

r  Potentially unbounded contention latency r Back-off protocol can help r Not good for high directory contention use cases

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Conclusions

r  Advantages r  Limitations r  Final Thoughts

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Advantages

r  Scalability and Performance r Content data I/O quick and contention free r No-master concurrent read and write r Linearly scalable performance

r  Availability r Load balancing without complicated HA setups

r  Reliability r Information dispersal r Both data and metadata have same reliability r No separate “backup” required for index server

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Limitations

r  Optimistic concurrency sensitive to high contention r  Cache requirements limit directory size

r No intrinsic limit, but a 100MiB directory object? r  No central master makes explicit file locking hard

r SMB, NFS protocols support these r  Not suitable for random-write workloads r  Not suitable for majority small file workloads

r Directory write times eclipse file write times r  Requires separate index service for search

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Final Thoughts

r  Significant advances come from P2P and NOSQL space r Three key techniques allow for ultra-scalable FS

r Namespace-based routing r Distributed transactions using quorum/3-phase commit r Optimistic concurrency using checked write

r Techniques useable with IDA or replicated systems r  Filesystem would not be general purpose

r Techniques have some trade-offs r Excellent for specific big data use cases

2011 Storage Developer Conference. © Cleversafe. All Rights Reserved.

Questions?