hydrafs presentation with notes

Upload: four2five

Post on 30-May-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 Hydrafs Presentation With Notes

    1/30

    April 27, 2010

    HydraFSC. Ungureanu, B. Atkin, A. Aranya, et al.

    Slides: Joe Buck, CMPS 229, Spring 2010

    1

    Tuesday, April 27, 2010

  • 8/9/2019 Hydrafs Presentation With Notes

    2/30

    Introduction

    What is HydraFS?

    Why is it necessary?

    2

    Tuesday, April 27, 2010

    HydraFS is a le system on top of HydraStore, a scalable, distributed CASApplications dont write to a CAS interface, they write to a FS interface. Need an adapterlayer, thus hydraFS.CAS uses a put/get model

  • 8/9/2019 Hydrafs Presentation With Notes

    3/30

    HYDRAstor

    What is HYDRAstor

    Immutable data

    High Latency

    Jitter

    Put / Get API

    3

    Tuesday, April 27, 2010

    Inconsistent use of capitalization in acronyms Jitter, in this case, means distance between writes to storageMention chunking

  • 8/9/2019 Hydrafs Presentation With Notes

    4/30

    Hydra Diagram

    CommitServer

    FileServer

    StorageNode

    StorageNode

    StorageNode

    StorageNode

    HYDRAstor Block Access Library

    HydraFS

    Hydra

    Access Node

    Single System Content Addressable Store4

    Tuesday, April 27, 2010

  • 8/9/2019 Hydrafs Presentation With Notes

    5/30

    CAS

    4 KB 4 KB 4 KB

    4 KB 4 KB 4 KB

    Client

    CAS

    5

    Tuesday, April 27, 2010

  • 8/9/2019 Hydrafs Presentation With Notes

    6/30

    CAS - continued

    4 KB 4 KB 4 KB

    4 KB 4 KB 4 KB

    Client

    CAS

    Chunker

    6

    Tuesday, April 27, 2010

    The chunker uses some heuristic involving content data and some hard set limits to chunk invariable sizes

  • 8/9/2019 Hydrafs Presentation With Notes

    7/30

    CAS - continued

    2 KB

    4 KB 4 KB 4 KB

    Client

    CAS

    cas1: 10KB

    7

    Tuesday, April 27, 2010

    Objects in the CAS have ids that are pointed to by meta-data.cas1 is 10 kb in size

  • 8/9/2019 Hydrafs Presentation With Notes

    8/30

    CAS - continued

    1 KB 4 KB

    Client

    CAS

    cas1: 10KB cas2: 9KB

    8

    Tuesday, April 27, 2010

    cas2 is 9 kb

  • 8/9/2019 Hydrafs Presentation With Notes

    9/30

    A little more on CAS addresses

    Same data doesnt mean the same address

    Impossible to calculate prior to write

    Foreground processing writes shallow trees

    Root cannot be updated until all child nodes are set

    9

    Tuesday, April 27, 2010

    Dif ering retention levels can produce di f erent CAS addresses.Collisions can be detected but are unlikely.Writes are done asynch, block on root node commit

  • 8/9/2019 Hydrafs Presentation With Notes

    10/30

    Issues for a CAS FS

    Updates are more expensive

    Metadata cache misses cause signicant performance issues

    The combination of high latency and high throughput means lots of buffering

    10

    Tuesday, April 27, 2010

    Updates must touch all meta-data that points to a f ected dataBuf ering allows for optimal write ordering and read cache is important as well

  • 8/9/2019 Hydrafs Presentation With Notes

    11/30

    Design Decisions

    Decouple data and metadata processing

    Fixed size caches with admission control

    Second-order cache for metadata

    11

    Tuesday, April 27, 2010

    From the previous 3 issues come 3 design decisions:1) this is done via a log. Allows batching of meta-data updates2) this prevents swapping, other resource over-allocations3) removes operations from reads via cache hits, improves metadata cache hit rate

  • 8/9/2019 Hydrafs Presentation With Notes

    12/30

    Issues - continued

    Immutable Blocks

    FS can only reference blocks already written

    Forms DAGs

    Height of DAGs needs to be minimized

    12

    Tuesday, April 27, 2010

    The entire tree must be updated if a block contained in it is updated, makes updates quiteexpensive

  • 8/9/2019 Hydrafs Presentation With Notes

    13/30

  • 8/9/2019 Hydrafs Presentation With Notes

    14/30

    Issues - continued

    Variable sized blocks

    Avoids the shifting window problem

    Use a balanced tree structure

    14

    Tuesday, April 27, 2010

    This is the chunking referred to in the paper.there is a min / max size for chunks.tree helps minimize DAGs

  • 8/9/2019 Hydrafs Presentation With Notes

    15/30

    FS design

    High Throughput

    Minimize the number of dependent I/O operations

    Availability guarantees no worse than standard Unix FS

    Efciently support both local and remote access

    15

    Tuesday, April 27, 2010

    close to open consistency (fsync acknowledgment means data is persisted)Remote access could be NFS or CIFS

  • 8/9/2019 Hydrafs Presentation With Notes

    16/30

    File System Layout

    Filename1

    Filename2

    Filename3

    321

    365

    442

    R

    R

    D

    Regular File Inode

    File Contents

    Inode B Tree

    Super Blocks

    Imap Handle

    Imap B Tree

    Imap Segmented Array

    Directory Inode

    Directory Blocks

    Inode B Tree

    16

    Tuesday, April 27, 2010

    Inode map similar to Log-Structured le systemFiles dedup across le systems

  • 8/9/2019 Hydrafs Presentation With Notes

    17/30

    HydraFS Software Stack

    Uses FUSE

    Split into le server and commit server

    Simplies metadata locking

    Amortizes the cost of metadata updates via batching

    Each server has its own caching strategy

    17

    Tuesday, April 27, 2010

    File server manages the interface to the client, records le modications in a transaction logstored in hydra, in-memory cache of recent le modications.Commit server reads transaction log, updates FS metadata, generates new FS versions

  • 8/9/2019 Hydrafs Presentation With Notes

    18/30

    Writing Data

    Data stored in inode specic buffer

    Chunked, marked dirty and written to Hydra

    After write conrmation, block freed and entered in uncommitted block table

    Needed until metadata is ushed to storage

    Designed for append writing, in-place updates are expensive

    18

    Tuesday, April 27, 2010

    Chunks have a max size at which point a chunk is createdWrites cached in memory until Hydra conrms them. (this allows for responses to reads in themeantime or failures in hydra.Data not visible in Hydra until a new FS is created.

  • 8/9/2019 Hydrafs Presentation With Notes

    19/30

    Metadata Cleaning

    Dirty data kept until the commit server applies changes

    New versions of le systems are created periodically

    Metadata in separate structures, tagged by time

    Always clean (in Hydra), can be dropped from cache at any time

    Cleaning allows le servers to drop changes in the new FS version

    19

    Tuesday, April 27, 2010

    New FS allows a le server to clean its dirty metadata proactively.

  • 8/9/2019 Hydrafs Presentation With Notes

    20/30

    Admission Control

    Events assume worse case memory usage

    If insufcient resources are available, the event blocks

    Limits the number of active events

    Memory usage is tuned to the amount of physical memory

    20

    Tuesday, April 27, 2010

    Not all memory used are freed when an action completes. For example, cache. This can beushed if the system nds it needs to reclaim memory.Not swapping is key for keeping latencies low and performance up.

  • 8/9/2019 Hydrafs Presentation With Notes

    21/30

    Read Processing

    Aggressive read-ahead

    Multiple fetches to get metadata

    Weighted caching to favor metadata over data

    Fast range map

    Metadata read-ahead

    Primes FRM, cache

    21

    Tuesday, April 27, 2010

    Read-ahead goes into an in-memory LRU cache, default is 20 MB.HydraFS caches both meta-data and data. Uses large leaf nodes and high-fan parent nodes.Fast range map is a look-aside bu f er, translates le o f set to content address.FRM and BtreeReadAhead add 36% performance for small memory/cpu overhead

  • 8/9/2019 Hydrafs Presentation With Notes

    22/30

    Deletion

    File deletion removes the entry from the current FS

    Data remains until there are no pointers to it

    22

    Tuesday, April 27, 2010

    The data will remain in storage until all FS versions that reference it are garbage collected.Block maybe pointed to by other les as well.The FS only marks roots for deletion, Hydra handles reference counting and storagereclamation.

  • 8/9/2019 Hydrafs Presentation With Notes

    23/30

    Performance

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Read (iSCSI) Read (Hydra) Write (iSCSI) Write (Hydra)

    N o r m

    a l i z e

    d T h r o u g

    h p u t

    Raw block deviceFile system

    23

    Tuesday, April 27, 2010

    Sequential throughputiSCSI is 6 disks per node -> software raid5 (likely the write hit iscsi takes)Block size 64 KBHydraFS 82% of read, 88% on write

  • 8/9/2019 Hydrafs Presentation With Notes

    24/30

    Metadata Intensive

    Postmark

    Generates les, then issues transactions.

    File size: 512 B - 16 KB

    Create DeleteOverall

    Alone Tx Alone Tx

    ext3 1,851 68 1,787 68 136HydraFS 61 28 676 28 57

    24

    Tuesday, April 27, 2010

    This is a worse-case for HydraFSHad to create FSes on the y due to limit on outstanding metadata updatesFewer operations to amortize costs over

  • 8/9/2019 Hydrafs Presentation With Notes

    25/30

    Write Performance vs Dedup

    0

    50

    100

    150

    200

    250

    300

    350

    0 20 40 60 80

    T h r o u g

    h p u

    t ( M B / s )

    Duplicate Ratio (%)

    HydraHydraFS

    25

    Tuesday, April 27, 2010

    HydraFS within 12% of Hydra throughout

  • 8/9/2019 Hydrafs Presentation With Notes

    26/30

    Write Behind

    6

    6.5

    7

    7.5

    8

    8.5

    9

    9.5

    10

    0 5 10 15 20

    O f f s e

    t ( G B )

    Time (s)26

    Tuesday, April 27, 2010

    Helps with bu f ering. No IO in the write critical pathA lot of jitter around 6 seconds, biggest gap is 1.5 GB

  • 8/9/2019 Hydrafs Presentation With Notes

    27/30

    Hydra Latency

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0 10 20 30 40 50 60 70

    P r (

    t < = x

    )

    Time (ms)27

    Tuesday, April 27, 2010

    90% percentile at 10 msPoint: even though Hydra is jittery and high latency, hydraFS still works (smoothes things out)

  • 8/9/2019 Hydrafs Presentation With Notes

    28/30

    Future Work

    Allow multiple nodes to manage same FS Makes failover transparent and automatic

    Exposing snapshots to users

    Incorporating SSD storage to lower latencies, make HydraFS usableas primary storage

    28

    Tuesday, April 27, 2010

  • 8/9/2019 Hydrafs Presentation With Notes

    29/30

    Thank you

    Questions? Comments?

    email: [email protected]

    Paper: ht tp://www.usenix.org/events/fast10/tech/full_papers/ungurea nu.pdf

    29

    Tuesday, April 27, 2010

    http://www.usenix.org/events/fast10/tech/full_papers/ungureanu.pdfmailto:[email protected]://www.usenix.org/events/fast10/tech/full_papers/ungureanu.pdfhttp://www.usenix.org/events/fast10/tech/full_papers/ungureanu.pdfhttp://www.usenix.org/events/fast10/tech/full_papers/ungureanu.pdfhttp://www.usenix.org/events/fast10/tech/full_papers/ungureanu.pdfmailto:[email protected]:[email protected]
  • 8/9/2019 Hydrafs Presentation With Notes

    30/30

    Sample Operations

    Block Write

    Block Read

    Searchable Block Write

    Searchable Block Read

    30

    Tuesday, April 27, 2010

    Writes trade blocks for CAS addresses, reads invert thatLabels can group data for retention or deletion, garbage collection reaps all the data thatisnt part of a tree anchored by a retention block