bridging posix-like apis and cloud storage · 2020. 8. 25. · cloud storage: a model for storage...

28
Bridging POSIX-like APIs and Cloud Storage Wesley Leggette Cleversafe, Inc.

Upload: others

Post on 17-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

Bridging POSIX-like APIs and Cloud Storage

Wesley Leggette Cleversafe, Inc.

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

Presentation Agenda

r  From POSIX to Cloud Storage/Object Storage r  User-mountable file systems r  Bridging file systems and object storage

r Mapping semantics r Performance considerations

r  Addressing write performance r Write/Discard caching model r Performance results

r  Other considerations 2

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

What did we do?

r  POSIX/WIN32 APIs still important for customers r  Wanted to develop a better POSIX/Cloud driver

r Allows high performance r Minimize hardware costs, allow scale out

r  Addressed one side of performance issue r Write improvement using new caching model

r  Next steps are read and metadata improvements

3

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

POSIX Historical Origins

r  Designed on top of block devices r  General purpose application I/O to hard drive

r Everything from documents to database r Random read and write access, overwrite

r  Also encapsulates access to LAN (NAS) r “Medium latency” applications

4

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

POSIX and Performance

r  API not designed for high latency storage (WAN) r Typically small buffers for all operations r Operations expected to return quickly r Traditional API methods synchronous r Multiple requests for simple operations

r Create file: open, setattr, write r  Has been improvement over time for latency

r Async IO, atomic open, larger transfer sizes, multiplexing, buffering

r  Improvements “behind the scenes” r Must still implement legacy API

5

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

Cloud Storage

r  Cloud Storage: A model for storage as a service r Separate storage usage and provisioning r Public Cloud

r Consolidates customers under large providers r Storage over the Internet

r Private Cloud r Works same technically as public cloud

r  Cloud Storage APIs: CDMI, S3, OpenStack Cloud Storage is meant for Internet

6

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

Object Storage

r  Cloud Storage provides Object Storage interfaces r  Object Storage is for unstructured data r  Examples:

r Books, documents, spreadsheets, presentations, audio, video, e-mail, web pages

r  What do these have in common? r Most use pre-compressed formats r Written and read in streaming fashion

r Almost never random-offset writes

7

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

REST and Object Storage

r  REST (representational state transfer) r Roy Fielding, for “quality” scalable API design

r  Main principal is stateless architecture r One storage request = one stored object r No session between client and server r Reduces round trips over Internet r No sessions, easier load balancing, scalability

Used by modern Cloud Storage APIs: It works well on the Internet

8

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

Broader Definition of Object Storage

r  Objects used in several “non-cloud” systems r Random write/update not available or not

commonly used r Perhaps categorized as “streaming storage”

r  WebDAV, FTP r  HDFS

r File system for Hadoop r Typically deployed to local clusters r Same semantic limitations

r No random write, weak append support

9

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

User-mountable file storage

r  The mountable “file system” r From local DAS to sharable NAS and SAN

r  Still a very common use case r Works with most existing applications r Familiarity, ease of migration r Well established APIs

r WIN32, POSIX

Object Storage is for applications, File systems are for humans!

10

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

What do we want to do?

r  Allow users and applications to use object storage as if it were a local file system

r  Achieve use-case advantages of NAS r Plus advantages of object storage

r  File system APIs: POSIX, WIN32 r  Object Storage APIs: S3, Atmos, CDMI, HDFS,

OpenStack, Mezeo, Cleversafe, Nirvanix

r  We will specifically discuss HDFS 11

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

HDFS Architecture

12

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

Considering Cloud/POSIX translation

r  Read model r Semantic model similar r Positional reads usually allowed r Performance differences are only major concern r Techniques involve prefetching, buffering, caching

r  Write model r Semantic model different, reduced functionality r Performance implications r Techniques mostly involve better caching

13

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

Object Storage Write Semantics

r  Limits writes to “streaming” r One byte in front of the other

r  Appends sometimes allowed r  No random I/O in middle of file r  Better WAN performance

r Per-call transfer sizes much larger than 4K blocks

14

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

Mapping POSIX semantics

r  POSIX procedure: r open(): obtain a file handle, perhaps creating file r write(): single call per I/O buffer chunk

r Buffers usually extremely small (4K-16K typical) r Depends on write() calls being low latency r I/O indicates offset, which may be completely random

r close(): closes a reference to a file handle

15

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

Mapping POSIX semantics (cont.)

r  HDFS procedure: r create(): Obtain an output stream, perhaps

creating file r Same as open() with O_TRUNC on POSIX; will

replace file content r append() also available

r Writes append to stream, stream closed when finished r On HDFS, content visible during writes; though

timing not guaranteed r On other systems, file invisible until stream closed.

16

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

Traditional Caching: Write/Copy

r  All drivers must cache to allow random writes r  Most caches two-phase procedure

r open(): Create cache entry r write(): Write data to cache (write phase) r close(): Copy from cache to storage (copy phase)

r Actually, release() when all fd handles closed r Cache eviction interesting, but out of scope

r  Copy either done in blocking or in background r Blocking allows error reporting r Background attractive when files large

17

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

Performance Issues with Write/Copy

r  Hard drives become performance bottleneck r Network performance increasing with Moore’s Law r Data access speed is not trending as well

r Disks aren’t keeping up with CPU, network, memory r SSD’s also have bad random I/O performance

r Gigabit Ethernet at ~100 MBps in practice r Taxes sequential performance of many drives r 10G Ethernet makes problem even more pronounced

r  Write caches exacerbate issue r I/O is mixed read/write, performance even worse

18

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

Random Write Issue in Practice

r  Cacheless system is ideal, but impractical r Driver could fail random write() operation r Applications don’t degrade gracefully when

write() fails r  But most applications don’t use random write

r Remember unstructured data is compressed r Can’t trivially randomly update compressed data

r Small text files not compressed… but they are quite small

19

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

New Technique: Write/Discard

r  Driver can detect when writes are streaming r  Updated write procedure:

r open(): Create cache entry, open remote stream r Handle starts in “discard” mode

r write(): Duplex to cache and remote (write phase) r If write out of position, close stream, go to “copy” mode

r close(): Discard mode, clear cache (discard phase) r If in copy mode, copy from cache to remote stream

r  Cache should be write-only for most workloads r Used only to support copy fallback for random writes r Copy mode better for heavy random-write workloads

20

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

Performance Results

r  Implemented write/discard cache for mountable HDFS driver r Driver written with FUSE, Java

r  Tested on 1 master, 3 slave HDFS configuration r  Used iozone load tool

21

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

r  Simulates concurrent reader threads in copy mode r  Write/Copy degradation caused by mixed workload

0  

10  

20  

30  

40  

50  

60  

70  

0   1   2   3   4   5   6   7   8  

Aggregate  Th

roughp

ut  (M

B/s)  

Concurrent  Writers  

Simulated Baseline — Local HDD Aggregate Throughput vs. Concurrent Writers

Write/Discard  

Write/Copy  (Ac<ve)  

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

r  Apparent performance omits copy phase (background copy) r  Actual performance includes copy phase (inline copy) r  Apparent performance trends with baseline r  Copy phase degradation seemed to highlight write-over-read favoring

r  To avoid cache exhaustion, write throttling would be needed to fix this

0  

10  

20  

30  

40  

50  

60  

70  

0   1   2   3   4   5   6   7   8  

Aggregate  Th

roughp

ut  (M

B/s)  

Concurrent  Writers  

Write/Copy  Cache  —  Ac:ve  System  Aggregate  Throughput  vs.  Concurrent  Writers    

Baseline  Write/Copy  

HDFS  Write/Copy  Apparent  

HDFS  Write/Copy  Actual  

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

r  Driver hits disk bottleneck at 60 MB/s r  Aligns closely with baseline write-only simulation

0  

10  

20  

30  

40  

50  

60  

70  

0   1   2   3   4   5   6   7   8  

Aggregate  Th

roughp

ut  (M

B/s)  

Concurrent  Writers  

Write/Discard  Cache  Aggregate  Throughput  vs.  Concurrent  Writers    

Baseline  Write/Discard  

HDFS  Write/Discard  

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

r  Write/Discard consistently performs better for a streaming write workload

0  

10  

20  

30  

40  

50  

60  

70  

0   1   2   3   4   5   6   7   8  

Aggregate  Th

roughp

ut  (M

B/s)  

Concurrent  Writers  

HDFS  Cache  Techniques  Aggregate  Throughput  vs.  Concurrent  Writers    

HDFS  Write/Discard  

HDFS  Write/Copy  (Ac<ve)  

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

Conclusions

r  Driver performance is a single system issue r Important for performance with fast networks r Addresses limitations of cheap hardware

r  Write/Discard is a good write cache model r Better than Write/Copy for streaming workloads

26

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

Other considerations?

r  Cloud drivers additionally need to consider reads r Prefetching and read buffering r Mixed read/write in POSIX and visibility r Write cache eviction policies and read

performance r  Reducing impact of retransmission on failure r  Upload techniques for higher throughput

27

2012 Storage Developer Conference. © 2012 Cleversafe, Inc. All Rights Reserved.

Questions?

28