xtreemfs-a cloud file system

Upload: ubiquetec

Post on 14-Apr-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 XtreemFS-A Cloud File System

    1/27

    XtreemFS A Cloud File SystemMichael BerlinZuse Institute Berlin

    Contrail Summer School, Almere, 24.07.2012

    Funded under: FP7 (Seventh Framework Programme)Area: Internet of Services, Software & virtualization (ICT-2009.1.2)

    Project reference: 257438

  • 7/27/2019 XtreemFS-A Cloud File System

    2/27

    Motivation Cloud Storage / Cloud File System

    Cloud Storage Requirements

    highly available

    scalable

    elastic: add and remove capacity

    suitable for wide area networks

    Support for legacy applications

    POSIX-compatible file system required

    Google for cloud file system: www.XtreemFS.org

    2

  • 7/27/2019 XtreemFS-A Cloud File System

    3/27

    Outline

    XtreemFS Architecture

    Replication in XtreemFS

    Read-Only File Replication

    Read/Write File Replication

    Custom Replica Placement and Selection

    Metadata Replication

    XtreemFS Use Cases

    XtreemFS and OpenNebula

    3

  • 7/27/2019 XtreemFS-A Cloud File System

    4/27

    XtreemFS - A Cloud File System

    History

    2006 initial development in XtreemOS project

    2010 further development in Contrail project

    2012 August: Release 1.3.2

    Features

    Distributed File System

    POSIX compatible

    Replication

    X.509 Certificates and SSLSupport

    Software

    Open source: www.xtreemfs.org

    Client software (C++) runs on Linux & OS X (Fuse), Windows (Dokan)

    Server software (Java)4

  • 7/27/2019 XtreemFS-A Cloud File System

    5/27

    XtreemFS Architecture

    5

    Metadata and Replica Catalog

    (MRC):

    stores metadata per volume

    Object Storage Devices (OSDs):

    directly accessed by clients

    file content split into objects

    Separation of Metadata

    and File Content:

    object-based file system

  • 7/27/2019 XtreemFS-A Cloud File System

    6/27

  • 7/27/2019 XtreemFS-A Cloud File System

    7/27

    Read-Only Replication (1)

    Only for write-once files

    File must be marked as read-only

    done automatically after close()

    Use Case: CDN

    Replica Types:

    1. Full replicas

    complete copy, fills itself as fast as possible

    2. Partial replicas Initially empty

    on-demand fetching of missing objects

    P2P-like efficient transfer between all replicas

    7

  • 7/27/2019 XtreemFS-A Cloud File System

    8/27

    Read-Only Replication (2)

    8

  • 7/27/2019 XtreemFS-A Cloud File System

    9/27

    Read/Write Replication (1)

    Primary/backup scheme

    POSIX requires total order of update operations

    primary/backup

    Primary fail-over?

    Leases

    grants access to a resource (here: primary role)

    for a predefined period of time

    Failover after timeout possible

    Assumption: loosely synchronized clocks

    max drift

    9

  • 7/27/2019 XtreemFS-A Cloud File System

    10/27

    Read/Write Replication (2)

    10

    Replicated write():

  • 7/27/2019 XtreemFS-A Cloud File System

    11/27

    Read/Write Replication (3)

    11

    Replicated write():

    1. Lease Acquisition

  • 7/27/2019 XtreemFS-A Cloud File System

    12/27

    Read/Write Replication (4)

    12

    Replicated write():

    1. Lease Acquisition

    2. Data Dissemination

  • 7/27/2019 XtreemFS-A Cloud File System

    13/27

  • 7/27/2019 XtreemFS-A Cloud File System

    14/27

    Read/Write Replication: Distributed Lease Acquisition with Flease

    Flease

    Failure tolerant: majority-based

    Scalable: lease per file

    Experiment:

    Zookeeper: 3 servers

    Flease: 3 nodes

    (2 randomly selected)

    14

    FleaseCentral Lock Service

  • 7/27/2019 XtreemFS-A Cloud File System

    15/27

  • 7/27/2019 XtreemFS-A Cloud File System

    16/27

    Read/Write Replication: Summary

    High up-front costs (for first access to inactive file)

    3+ round-trips

    2 for Flease (lease acquisition)

    1 for Replica Reset

    + further when fetching missing objects

    Minimal cost for subsequent operations

    Read: identical to non-replicated case

    Write: latency increases by time to update majority of backups

    Works at file-level: scales with # OSDs and # files

    Flease: no I/O to stable storage for crash-recovery needed

    16

  • 7/27/2019 XtreemFS-A Cloud File System

    17/27

    Custom Replica Placement and Selection

    Policies

    filter and sort available OSDs/replicas

    evaluates client information (IP address/hostname, estimated latency)

    create file on OSD close to me

    access closest replica

    Available default policies:

    Server ID

    DNS

    Datacenter Map

    Vivaldi

    Own policies possible (Java)

    17

  • 7/27/2019 XtreemFS-A Cloud File System

    18/27

    Replica Placement/Selection: Vivaldi Visualization

    18

  • 7/27/2019 XtreemFS-A Cloud File System

    19/27

    Metadata Replication

    Replication at database level same approach as file R/W replication

    Loosen consistency

    allow stale reads

    All services replicated

    No single point of failure

    19

  • 7/27/2019 XtreemFS-A Cloud File System

    20/27

    XtreemFS Use Cases

    Storage of VM images for IaaS solutions (OpenNebula, ...)

    Storage-as-a-Service: Volumes per User

    XtreemFS as HDFS replacement in Hadoop

    XtreemFS in ConPaaS: storage on demand for other services

    20

  • 7/27/2019 XtreemFS-A Cloud File System

    21/27

    XtreemFS and OpenNebula (1)

    Use Case: VM images in OpenNebula cluster

    no distributed file system: scp VM images to hosts

    distributed file system: shared storage, available on all nodes

    Support for live migration

    Fault-tolerant storage of VM images

    Resume VM on another node after crash Use XtreemFS Read/Write file replication

    21

  • 7/27/2019 XtreemFS-A Cloud File System

    22/27

    XtreemFS and OpenNebula (2)

    VM deployment Create copy (clone) of original VM image

    Run cloned VM image at scheduled host

    (Discard cloned image after VM shutdown)

    Problems

    1. cloning time-consuming

    2. waste of space

    3. increasing total boot time when starting multiple VMs e.g., ConPaaS image

    22

  • 7/27/2019 XtreemFS-A Cloud File System

    23/27

    XtreemFS and OpenNebula: qcow2 + Replication

    qcow2 VM image format allows snapshots

    1. immutable backing file

    2. mutable, initially empty snapshot file

    instead of cloning, snapshot original VM image (< 1 second)

    Use Read/Write replication for snapshot file

    Problem left: run multiple VMs simultaneously

    snapshot file: R/W replication scales with # OSDs and # files backing file: bottle neck

    use Read-Only Replication

    23

  • 7/27/2019 XtreemFS-A Cloud File System

    24/27

    XtreemFS and OpenNebula: Benchmark (1)

    OpenNebula Test Cluster Frontend + 30 Worker nodes

    Gigabit Ethernet (100 MB/s)

    SATA disk (70 MB/s)

    Setup Frontend

    MRC

    OSD (has the ConPaaS VM image)

    Each worker node

    OSD

    XtreemFS Fuse client

    OpenNebula node

    Replica Placement + Replica Selection: prefer local OSD/replica

    24

  • 7/27/2019 XtreemFS-A Cloud File System

    25/27

    XtreemFS and OpenNebula: Benchmark (2)

    Setup Total Boot Time

    copy (1.6 GB image file) 82 seconds (69 seconds for copy)

    qcow2, 1 VM 13.6 seconds

    qcow2, 30 VMs 20.8 seconds

    qcow2, 30 VMs, 30 partial replicas 142.8 seconds

    - second run 20.1 seconds

    - after second run 17.5 seconds

    + Read/Write Replication on

    snapshot file

    19.5 seconds

    25

    few read()s on image, no bottleneck yet

    Replication: object granularity vs. small reads/writes

  • 7/27/2019 XtreemFS-A Cloud File System

    26/27

    Future Research & Work

    Deduplication

    Improved Elasticity

    Fault Tolerance

    Optimize Storage Cost

    Erasure Codes

    Self-*

    Client Cache less POSIX: replace MRC with a scalable service

    26

  • 7/27/2019 XtreemFS-A Cloud File System

    27/27

    27

    Funded under: FP7 (Seventh Framework Programme)Area: Internet of Services, Software & virtualization (ICT-2009.1.2)Project reference: 257438

    Total cost: 11,29 million euroEU contribution: 8,3 million euroExecution: From 2010-10-01 till 2013-09-30Duration: 36 monthsContract type: Collaborative project (generic)