sc09 zfs end to end integrity(1)

Upload: sebastian-gutierrez

Post on 07-Apr-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    1/22

    1

    Lustre ZFS TeamSun Microsystems

    Lustre, ZFS, and Data Integrity

    ZFS BOF Supercomputing 2009

    Andreas DilgerSr. Staff EngineerSun Microsystems

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    2/22

    ZFS BOF Supercomputing 2009

    Topics

    Overview

    ZFS/DMU features

    ZFS Data Integrity

    Lustre Data Integrity

    ZFS-Lustre Integration

    Current Status

    2

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    3/22

    ZFS BOF Supercomputing 2009

    2012 Lustre Requirements

    Filesystem Limits 240 PB file system size 1 trillion files (1012) per file system

    1.5 TB/s aggregate bandwidth 100k clients

    Single File/Directory Limits 10 billion files in a single directory

    0 to 1 PB file size rangeReliability 100h filesystem integrity check End-to-end data integrity

    3

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    4/22

    ZFS BOF Supercomputing 2009

    ZFS Meets Future Requirements

    4

    Capacity Single filesystems 512TB+ (theoretical 264 devices * 264 bytes) Trillions of files in a single file system (theoretical 248 files per fileset) Dynamic addition of capacity/performance

    Reliability and Integrity Transaction based, copy-on-write Internal data redundancy (up to triple parity/mirror and/or 3 copies) End-to-end checksum of all data/metadata Online integrity verification and reconstruction

    Functionality Snapshots, filesets, compression, encryption Online incremental backup/restore Hybrid storage pools (HDD + SSD) Portable code base (BSD, Solaris, ... Linux, OS/X?)

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    5/22

    Lustre/ZFS Project Background

    ZFS-FUSE (2006) The first project of ZFS porting to the FUSE

    framework for the Linux in the user space

    Kernel DMU (2008) LLNL engineer ported core parts of ZFS to

    RedHat Linux as a kernel module

    Lustre-kDMU (2008 present) Modify Lustre to work on ZFS/DMU storage

    backend for a production environment

    ZFS BOF Supercomputing 2009

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    6/22

    ZFS Overview

    ZFS POSIX Layer (ZPL)

    Interface to user applications

    Directories, namespace, ACLs, xattr

    Data Management Unit (DMU)

    Objects, datatsets, snapshots

    Copy-on-write atomic transactions

    name->value lookup tables (ZAP)

    Adaptive Replacement Cache (ARC)

    Manage RAM, flash cache space

    Storage Pool Allocator (SPA)

    Allocate blocks at IO time

    Manages multiple disks directly

    Mirroring, parity, compression

    ZFS BOF Supercomputing 2009

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    7/22

    Lustre/Kernel DMU Modules

    ZFS BOF Supercomputing 2009

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    8/22

    Lustre Software Stack Updates

    MDD

    DMU

    OSD

    DMU

    MDT

    ZFS BOF Supercomputing 2009

    2.0 MDS 2.1 MDS1.8 MDS

    OFD

    DMU

    OSD

    DMU

    OST

    obdfilter

    fsfilt

    ldiskfs

    OST

    MDD

    ldiskfs

    OSD

    ldiskfs

    MDT

    fsfilt

    ldiskfs

    MDT

    1.8/2.0 OSS

    MDS

    2.1 OSS

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    9/22

    ZFS BOF Supercomputing 2009

    How Safe Is Your Data?

    There are many sources of data corruption

    Software/firmware: client, NIC, router, server, HBA, disk

    RAM, CPU, disk cache, media

    Client network, storage network, cables

    Object Storage ServerClient

    Router

    Application Storage

    Transport

    Network

    Link

    Network

    Link

    Server

    Transport

    Network

    Link

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    10/22

    ZFS BOF Supercomputing 2009

    ZFS Industry Leading Data Integrity

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    11/22

    ZFS BOF Supercomputing 2009

    End-to-End Data Integrity

    11

    Lustre Network Checksum Detects data corruption over network Ext3/4 does not checksum data on disk

    ZFS stores data/metadata checksums Fast (Fletcher-4 default, or none) Strong (SHA-256)

    Combine for End-to-End Integrity

    Integrate Lustre and ZFS checksums Avoid recompute full checksum on data Always overlap checksum coverage Use scalable tree hash method

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    12/22

    ZFS BOF Supercomputing 2009

    Hash Tree and Multiple Block Sizes

    12

    16kB

    128kB

    256kB

    512kB

    32kB

    1024kB

    64kB

    DATA

    SEGMENT

    LEAF HASH

    INTERIOR HASH

    ROOT HASH

    4kB

    8kB

    MAX PAGE

    ZFS BLOCK

    LUSTRE RPC SIZE

    MIN

    PAGE

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    13/22

    ZFS BOF Supercomputing 2009

    End-to-End Integrity Client Write

    13

    USER SPACEwrite(fd, buffer, size=1MB)

    0 4 8 12 16 20 1024kB10161008

    16kB PAGE SIZE

    [buffer copy or page reference]

    CLIENT

    C0 C1 C2 C3

    P0 P62

    C252 C253C254C255

    P63

    C3 C5 C250 C251

    P1

    0 16kB 1024kB1008

    ROOT HASH K

    16kB PAGE HASH

    KERNEL

    4kB LEAF HASH

    1MB RPC HASH

    WRITE RPC 1024kB KSERVER RPC

    BULK

    RDMA

    HASH TREE

    Page 0 Page 63

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    14/22

    ZFS BOF Supercomputing 2009

    End-to-End Integrity Server Write

    14

    4kBPAGESIZE

    SERVER

    0 16kB 1024kB1008kB

    K'

    4kB LEAF HASH

    1MB RPC HASH

    WRITE RPC 1024kB KCLIENT

    RPC

    BULK

    RDMA

    HASH TREE

    128kB BLOCK HASH

    OST

    K = K' ?

    DMU

    128kBBLOCK

    SIZE

    4 8 12

    S0 S1 S2 S3 S4 S5 S255S254S253S252S251S250

    B0 B7

    B7BLOCK_PTR 896-1024kBB0BLOCK_PTR 0-128kB

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    15/22

    ZFS BOF Supercomputing 2009

    Integrity Checking Targets

    15

    Scale of proposed filesystem is enormous 1 trillion files checked in 100h 2-4PB of MDT filesystem metadata

    ~3 million files/sec, 3GB/s+ for one passNeed to handle CMD coherency as well Distributed link count on files, directories Directory parent/child relationship

    Filename to FID to inode mapping

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    16/22

    ZFS BOF Supercomputing 2009

    Distributed Coherency Checking

    16

    Traverse MDT filesystem(s) only once Piggyback on ZFS resilver/scrub to avoid extra IO Lustre processes blocks after checksum validation

    Validate OST/remote MDT data opportunistically Track which objects have not been validated (1 bit/object) Each object has a back-reference to validate user Back-reference avoids need to keep global reference table

    Lazily scan OST/MDT in the background Load sensitive scanning Continuous online verification/fix of distributed state

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    17/22

    Lustre DMU development Status

    Data Alpha OST basic functionality

    MD Alpha

    MDS basic functionality, beginning recovery supportBeta

    Layouts, performance improvement, full recovery

    GA Full performance in production release

    ZFS BOF Supercomputing 2009

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    18/22

    18

    Questions?

    Thank You

    ZFS BOF Supercomputing 2009

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    19/22

    ZFS BOF Supercomputing 2009

    Hash Tree For Non-contiguous Data

    19

    C45

    K

    C7

    C457

    C1

    = LH(data segment 1)

    C0

    = LH(data segment 0)

    C5

    = LH(data segment 5)C4 = LH(data segment 4)

    LH(x) = hash of data x (leaf)

    x+y = concatenation x and y

    C7

    = LH(data segment 7)

    C01

    = IH(C0

    + C1)

    C45

    = IH(C4

    + C5)

    C457

    = IH(C45

    + C7)

    C5

    C7

    C4

    K = IH(C01

    + C457

    ) = ROOT hash

    IH(x) = hash of data x (interior)

    C01

    C0

    C01

    C1

    ZFS BOF Supercomputing 2009

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    20/22

    ZFS BOF Supercomputing 2009

    T10-DIF vs. Hash Tree

    CRC-16 Guard Word All 1-bit errors

    All adjacent 2-bit errors

    Single 16-bit burst error

    10-5 bit error rate

    32-bit Reference Tag Misplaced write != 2nTB

    Misplaced read != 2nTB

    Fletcher-4 Checksum All 1- 2- 3- 4-bit errors

    All errors affecting 4 or fewer 32-bitwords

    Single 128-bit burst error 10-13 bit error rate

    Hash Tree Misplaced read

    Misplaced write Phantom write

    Bad RAID reconstruction

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    21/22

    ZFS BOF Supercomputing 2009

    ZFS Hybrid Pool Example

    Five 400GB 4200 RPM SATA DrivesSeven 146GB 10,000 RPM SASDrives

    32GB SSD ZIL Device

    80GB SSD Cache Device

    4 Xeon 7350 Processors (16 cores)

    32GB FB DDR2 ECC DRAM

    OpenSolarisTM with ZFS

    Configuration A Configuration B

  • 8/6/2019 SC09 ZFS End to End Integrity(1)

    22/22

    ZFS BOF Supercomputing 2009

    Raw Capacity (2TB)

    ZFS Hybrid Pool Example

    Hybrid Storage(DRAM + SSD + 5x 4200 SATA)

    Traditional Storage (DRAM + 7x 10K SAS)

    PowerRead IOPs Write IOPs

    3.2x 1.1x 2x1/5