zfs overview

Upload: muhammed-kunhi-jalali-bovikanam

Post on 05-Oct-2015

29 views

Category:

Documents


2 download

DESCRIPTION

zfs

TRANSCRIPT

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 1

    ZFS:The Zettabyte Filesystem

    [email protected]

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 2

    The Perfect FilesystemWrite my data

    Keep it safe

    Read it back

    Do it fast

    Dont hassle me

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 3

    Existing FilesystemsWrite my data?

    limited size (16TB for UFS)limited number of files

    limited directory entries

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 4

    Existing FilesystemsKeep it safe?

    bit rot causes silent data corruption

    no defense against phantom writes,misdirections, other firmware bugs

    no defense against administrative errors(e.g. swap on active filesystem device)no security: spying, tampering, theft

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 5

    Existing FilesystemsRead it back?

    no data integrity checksno data authenticationdata might be good, might be bad

    dont knowcouldnt fix it if we did

    like running a server without DRAM parity

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 6

    Existing FilesystemsDo it fast?

    linear-time directory opslinear-time newfs(1M), fsck(1M)limited read/write concurrencyfixed block sizefixed stripe widthpoor random write performanceslow mirroring

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 7

    Existing FilesystemsDont hassle me?

    create a partition for every FSgrow: manual processshrink: not possibleremember a bunch of c0t0d0s0 namesedit /etc/vfstab by handwait around for fsck(1M)take system down to upgrade disks

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 8

    ZFS Objective

    End the suffering

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 9

    The ZFS FilesystemWrite my data!

    immense capacity (128-bit)theres no SI prefix for this!

    zettabyte = 70-bit (a billion TB)ZFS capacity: 256 quadrillion ZB

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 10

    The ZFS FilesystemKeep it safe!

    self-healing datacopes with every class of error

    bit rotphantom writesmisdirected reads and writesadministrative errors

    disk scrubbingreal-time remote replicationencryption

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 11

    The ZFS FilesystemRead it back!

    provable data integrity model

    detects and corrects errors

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 12

    The ZFS FilesystemDo it fast!

    write sequentializationdynamic stripingmultiple block sizesconstant-time snapshotsconcurrent, constant-time directory opsbyte-range locking for concurrent writessync semantics at async speed(critical for good NFS performance)

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 13

    The ZFS FilesystemDont hassle me!

    FS creation is as easy as mkdirgrow and shrink are automaticno raw device names to rememberno volumes at allno more fsck(1M)no more editing /etc/vfstaball administration online

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 14

    Organizing PrinciplesSimple administration

    Extensible, modular design

    Provable data integrity model

    Always-available data

    High performance

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 15

    Simple AdministrationPooled storage

    Immense capacity

    Quotas and Reservations

    User undo

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 16

    Volumes vs. Storage PoolsTraditional volumes

    partition per FSFS/volume interface:block-level I/O

    Pooled StorageFSes share spaceZFS/pool interface:object transactions

    FS

    Volume(Virtual Disk)

    FS

    Volume(Virtual Disk)

    FS

    Volume(Virtual Disk)

    No space sharing

    Namingand

    storagetightlybound

    ZFS ZFS ZFS

    Storage PoolNaming

    andstorage

    decoupled

    All space shared

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 17

    Volumes vs. Storage Pools, contdBoth manage disks and provide mirroring

    Traditional FS/volume model: volume providesspace, but FS manages it

    volume doesnt know which blocks are in useFS cant easily grow or shrinkFS creation requires new partition

    ZFS model: SPA provides and manages spacemany filesystems share spacegrow and shrink are implicitFS create/delete are just like mkdir/rmdironly one pool to manage (vs. volume per FS)

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 18

    Volumes vs. Storage Pools, contdAdvantages of pooled storage

    reduces fragmentationsimplifies administrationdecouples logical and physical structurefilesystems named by default mount point

    Proof of concept: tmpfsall tmpfs mounts share common swap spaceadministration is trivial: swap -a / swap -d

    FS becomes more powerful administrative pointno longer tied to physical configurationmore like a directory with heritable attributes

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 19

    Immense Capacity128-bit storage pools

    128-bit filesystems

    128-bit files, but limited to 64-bit accessuntil we have 128-bit OS support

    64-bit max files per dataset

    64-bit max files per directory

    statvfs128() will be needed

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 20

    Quotas and ReservationsTraditional model

    quotas: per-user UFS bolt-on(cred structures all the way down to bmap)reservations: no (nothing to reserve against)

    ZFS modelFS is now the administrative pointFS per home directory, project, workspace, ...quotas: per-FSreservations: per-FSgroup quotas, hierarchical quotas almost free

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 21

    User UndoUnlimited snapshots

    recover previous version of a file

    Undeleterecover recently deleted file

    No sysadmin intervention required

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 22

    Organizing PrinciplesSimple administration

    Extensible, modular design

    Provable data integrity model

    Always-available data

    High performance

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 23

    ZFS is Object-BasedAn object is a "flat file"Everything is stored in objects: user data,znodes, directories, free block lists, etc.

    Arbitrarily complex operations reduce to readsand writes on a set of objectsSimplifies interfaces, design, and analysis

    single I/O pathsingle interposition pointsingle object read/write model

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 24

    ZFS Components

    ZPLZFS POSIX Layer: standard POSIXsemantics (permission, mode, timestamps);translates vnode ops into object read/write

    ZAPZFS Attribute Processor: constant-time,concurrent attribute operations(directories, object properties, etc)

    DMU Data Management Unit: transactions,caching, object translations

    SPAStorage Pool Allocator: space allocation,replication, checksums, resource controls,encryption, compression, fault management

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 25

    SPA ComponentsGather non-dependentI/O into I/O groupsAllocate space frommetaslab layerApply pluggable modules

    compressionencryptionchecksum

    Dispatch parallel, asyncI/O to vdev stackIssue disk I/O

    SPADMU

    IOGmetaslaballocator

    compression

    encryption

    checksum

    mirrorvdev

    diskvdev

    diskvdev

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 26

    Organizing PrinciplesSimple administration

    Extensible, modular design

    Provable data integrity model

    Always-available data

    High performance

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 27

    Provable Data Integrity ModelAll operations are copy-on-write

    never overwrite live data

    All operations are transactionalrelated changes succeed or fail as a whole

    All data is checksummedno silent data corruption

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 28

    Copy-on-Write TX ModelProblem: modify several objects atomicallyDMU provides transactional interface

    ZPL groups work into transactionsDMU sends whole transactions to SPASPA commits transaction groups

    SPA never modifies active blocksentire storage pool is a tree of blocksrooted at the "uberblock"transactions COW nodes of the treetransaction group is committed whenuberblock is rewritten to point to new tree

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 29

    Copy-on-Write TX Modelinitial block tree

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 30

    Copy-on-Write TX Modelwrite: COWs a data block

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 31

    Copy-on-Write TX ModelCOW its level-1 indirect block

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 32

    Copy-on-Write TX ModelCOW its level-2 indirect block

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 33

    Copy-on-Write TX Modelrewrite the uberblock (atomic)

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 34

    SnapshotsCOW TX model enables constant-time snapshots

    snapshot storage pool by copying its uberblocksnapshot single FS by copying its root blocksnapshot single file by copying its dnode

    Provides data recovery and fixed target for backup

    Snapshot delta = incremental

    Unlimited number of snapshotsc.f. 1 with UFS, 32 with WAFL

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 35

    SnapshotsSave old uberblock - describes complete snapshot

    snapshotuberblock current

    uberblock

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 36

    ChecksumsTraditional model: checksum stored with block

    Fine for detecting bit rot, but:cant detect phantom writes, misdirectionscant validate the checksum itselfcant protect against tampering

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 37

    Checksums, contdSPA model: checksum stored with indirect block

    Self-validatingDetects bit rot, phantom writes, misdirections,admin error (e.g. swap on active ZFS disk)

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 38

    Checksums, contdPhysical separation improves fault isolation,yet doesnt require additional I/O

    64-bit strength ensures data integrityprovides 99.99999999999999999%("nineteen nines") error detection probability

    Checksum vectoring provides flexibilityweaker checksums for performancefaster checksums in the futuresecure checksums for data authentication(uberblock checksum provides unforgeablesignature for the entire storage pool)

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 39

    Organizing PrinciplesSimple administration

    Extensible, modular design

    Provable data integrity model

    Always-available data

    High performance

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 40

    Always-Available DataAlways-consistent on-disk formatElimination of fsck(1M)Self-Healing DataFailure Prediction and Disk ScrubbingHot SpaceData MigrationReal-Time Remote ReplicationUser Undo

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 41

    Always-Consistent On-Disk FormatZFS is always self-consistent

    follows from COW transaction model

    Doesnt depend on the intent log

    No more fsck(1M)no "clean bit"no off-line maintenanceZFS is always mountable

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 42

    Self-Healing DataMedia error under traditional FS:

    bad user data causes silent data corruptionbad metadata causes SDC, panic, or both

    Media error under ZFS:checksum detects data corruptionSPA gets valid data from another replicaand uses it to repair the damaged oneSPA returns valid data to applicationno sysadmin intervention required

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 43

    Failure PredictionSPA automatically migrates data from failingdevices to healthy devices

    Detects health by monitoring error rate

    Employs disk scrubbing to detect latent errorswhile theyre still correctable

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 44

    Hot SpaceHot spare model "Hot space" model

    No more dedicated hot spares"hot space" spread across all devices

    Keeps all devices activeuses all available I/O bandwidthimproves drive utilizationimproves failure predictionprevents silent atrophy

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 45

    Data MigrationAllow transparent disk upgrades anddata migration from failing devices

    Apply VM principles to storageDMU names blocks by 128-bit DVA(Data Virtual Address)high-order 64 bits specify metaslabSPA translates metaslab to SPA can migrate metaslabs from one vdevto another without affecting any DMU state

    Data remains available during migration

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 46

    Real-Time Remote ReplicationEverything in ZFS is an objectEvery change is just a write to an objectWrites are always batched into TX groups

    Contents of TX group can be sent async

    Latency insensitive!

    Occasional ACK for remote TX group commit

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 47

    Organizing PrinciplesSimple administration

    Extensible, modular design

    Provable data integrity model

    Always-available data

    High performance

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 48

    High PerformanceWrite SequentializationDynamic stripingParallel three-phase TX groupsIntelligent prefetchMultiple block sizesSync semantics at async speedConcurrent, constant-time directory opsPOSIX-compliant concurrent writesHot space

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 49

    Write SequentializationTraditional FS: random file writes becomerandom disk writes

    ZFS: random file writes becomesequential disk writes

    follows from COW modelmodified blocks are newly allocatedSPA has complete allocation freedomSPA chooses sequential free blocks

    Cost of writing extra ZFS metadata more thanoffset by improved locality

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 50

    Dynamic StripingTraditional striping: spread data across multipledevices at fixed stride

    Inflexible: cant change stripe width,cant add or remove devices

    Dynamic striping: round-robin allocationbalances writes across all available devicesenabled by COW model

    0 1 2 3 45 6 7 8 9

    10 11 12 13 14... ... ... ... ...

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 51

    Three-Phase Transaction GroupsOpen: accepting new transactionsQuiescing: waiting for transactions to finishSyncing: pushing changes to disk

    Up to three transaction groups activeone in each state - prevents burstinessuses all available disk bandwidth

    Open

    Quiescing

    Syncing

    Closed

    Time

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 52

    Multiple Block SizesNo block size is optimal for everything

    large blocks: less metadatasmall blocks: more efficient for small objectsrecord-structured files have natural granularity;we want to match it to avoid read/modify/write

    ZFS supports any power of two block size

    Per-object granularityautomatic block size selection by defaultmanual override

    Enables transparent block-based compression

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 53

    Multiple Block Sizes, contdWhy not extents?

    extents dont COW: writes force extent breaksgreater code complexity

    Multiple block sizes combine the simplicityof blocks with the metadata savings of extents

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 54

    Sync Semantics at Async SpeedReview: ZFS is always self-consistent on diskHowever: after system crash, ZFS wont containtransactions since last syncUse intent log to recover recent transactions

    log metadata only: lose recent writes (UFS)log user + metadata: recover everything (NFS)log to disk: wait for one sequential disk writelog to NVRAM on I/O bus: fast (NetApp filers)log to NVRAM on main memory bus: blazing

    Ideal configuration: log all ops to NVRAMneed HW/sales/marketing on boardbig payoff: only a system vendor can do this

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 55

    Fast Directory OperationsLarge directories: need constant-time operations(lookup, create, delete)Hot directories: need concurrent operations

    Solution: extendible hashingblock-basedamortized growth costshort chains for constant-time opsper-block locking for high concurrencyreaddir: returns entries in hash-value order

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 56

    Concurrent WritesExisting filesystems force trade-off betweenPOSIX compliance and write concurrency

    ZFS employs byte-range locking to allow maximumconcurrency while satisfying POSIX overlappingwrite semantics

    Parallel read/write Serialized

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 57

    CompressionBlock-level compression in SPA

    transparent to all other layersenabled by multiple block size support

    Per-file, per-filesystem, or per-poolVectoring for different compression functions

    8k 4k 2k 8k

    DMU translations: all 8k

    SPA blockallocations:

    vary withcompression

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 58

    Encryption and Data SecurityBlock-level encryption in SPA

    transparent to all other layerssupports any symmetric block cipher mode:DES, AES, IDEA, RC6, Blowfish, SEAL, OCB...

    Per-filesystem or per-poolVectoring for different encryption functionsData authentication via secure checksumsOpen issues:

    key managementlarger data security model

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 59

    FuturesPOSIX isnt the only game in town

    DMU as native Oracle APIObject-based appliances

    agnostic: NFS, database, volume emulationDMU as "foundation classes"

    SPA

    DMU

    ZPL NFS Oraclezvol

    UFS *FS raw

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 60

    Case Study: Jurassic on UFS/SVMUpgrading disks

    major down timesignificant manual labor

    FS-to-user mappingsingle FS impossible: exceeds 1TBFS per user impractical: fragments storage

    /var/mailcreate/delete .lock files: serial and slow

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 61

    Case Study: Jurassic on UFS/SVMQuotas and reservations

    quotas: too expensive and broken to usereservations: no such concept

    User error recoveryrestore from tapelast 24 hours lost

    Reliability / Availabilityseveral instances of data loss this yearhours of down time for fsck(1M)

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 62

    Case Study: Jurassic on ZFSUpgrading disks

    add new disks to storage poolremove old disks from storage pool(SPA auto-migrates the data)

    FS-to-user mappingsingle FS possibleFS per user better: enables per-userreservations, snapshots, encryption, etc.

    /var/mailcreate/delete .lock files: parallel and fast

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 63

    Case Study: Jurassic on ZFSQuotas and reservations

    per-filesystem: e.g. per-workspace,per-home directory, per-project

    User error recoveryuser undorestore from snapshoteither way, no sysadmin required

    Reliability / Availabilityno fsck(1M); ZFS is always mountableprovable data integrity model

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 64

    Where Are We Now?"Hello world" on Oct 31, 2002

    complete POSIX-compliant filesystemmost key features working: pooled storage,crash resilience, self-healing data

    Full builds of ON10 on ZFS filesystemszvol driver used for MTB-UFS test/bringupStill plenty to do

    intent log, snapshots, perf workinternal alpha program

    Phase 1 putback in October

  • Sun Microsystems Proprietary / Confidential Need to Know

    ZFS: The Zettabyte Filesystem February 10, 2003 Page 65

    ZFS:The Zettabyte Filesystem

    Please send questions, comments and ideas to:[email protected]

    Want to follow ZFS developments? Join:[email protected]

    For the latest information, visit:http://zfs.eng