zfs deep dive

Download ZFS Deep Dive

Post on 14-Apr-2015




4 download

Embed Size (px)


ZFS Deep


Solaris 10 Deep Dive ZFSBob NethertonTechnical Specialist, Solaris Adoption Sun Microsystems, Inc. http://blogs.sun.com/bobn

What is ZFS? Why a new file system? What's different about it? What can I do with it? How much does it cost? Where does ZFS go from here?2

What is ZFS?End-to End Data Integrity

A new way to manage dataImmense Data Software Capacity Developer

With checksumming and copy-on-write transactionsEasier Administration

The world's first 128-bit file system

Huge Performance Gains

Pooled storage model No volume manager

Especially architected for speed3

Why a New File System?

Data Management Costs are High

The Value of Data is Becoming Even More Critical

The Amount of Storage is EverIncreasing


Trouble with Existing File Systems?Good for the time they were designed, but...No Defense Against Silent Data Corruption Difficult to AdministerNeed a Volume Manager Older/Slower Data Management Techniques

Any defect in datapath can corrupt data... undetected

Volumes, labels, partitions, provisioning and lots of limits

Fat locks, fixed block size, naive pre-fetch, dirty region logging


ZFS Design Principles Start with a new design around today's requirements Pooled storage> Eliminate the notion of volumes > Do for storage what virtual memory did for RAM

End-to-end data integrity> Historically considered too expensive. > Now, data is too valuable not to protect

Transactional operation> Maintain consistent on-disk format > Reorder transactions for performance gains big

performance win


Evolution of Disks and Volumes Intially, we had simple disks Abstraction of disks into volumes to meet requirements Industry grew around HW / SW volume managementFile System Volume Manager File System Volume Manager File System Volume Manager

Lower 1GB

Upper 1GB

Even 1GB

Odd 1GB

Left 1GB

Right 1GB

Concatenated 2GB

Striped 2GB

Mirrored 1GB7

FS/Volume Model vs. ZFS Traditional Volumes ZFS Pooled Storage> > > >

1:1 FS to Volume Grow / shrink by hand Limited bandwidth Storage fragmentedFS

> > > >ZFS

No partitions / volumes Grow / shrink automatically All bandwidth always available All storage in pool is sharedZFS ZFS

Volume Manager


FS / Volume Model vs. ZFSFS / Volume I/O Stack FS to Volume> Block device interface > Write a block, write a block, ... > Loss of power = loss of

ZFS I/O Stack ZFS to Data Mgmt Unit> Object-based transactions > Make these changes to these

consistency > Workaround: journaling slow & complex

objects > All or nothing> > > >

DMU to Storage PoolTransaction group commit All or nothing Always consistent on disk Journal not needed

Volume to Disk> Block device interface > Write each block to each disk

immediately to sync mirrors > Loss of power = resync > Synchronous & slow

SP to Disk> Schedule, aggregate, and issue I/O

at will runs at platter speed > No resync if power lost





ZFS Data Integrity Model Everything is copy-on-write> Never overwrite live data > On-disk state always valid no fsck

Everything is transactional> Related changes succeed or fail as a whole > No need for journaling

Everything is checksummed> No silent corruptions > No panics from bad metadata

Enhanced data protection> Mirrored pools, RAID-Z, disk scrubbing11

Copy-on-Write and TransactionalUber-block Original Data New Data

Initial block treeOriginal Pointers

Writes a copy of some changes New Uber-block

New Pointers

Copy-on-write of indirect blocks

Rewrites the Uber-block12

End-to-End ChecksumsChecksums are separated from the data

Entire I/O path is self-validating (uber-block)

Prevents: > Silent data corruption > Panics from corrupted metadata > Phantom writes > Misdirected reads and writes > DMA parity errors > Errors from driver bugs > Accidental overwrites


Self-Healing DataZFS can detect bad data using checksums and heal the data using its mirrored copy.Application ZFS Mirror Application ZFS Mirror Application ZFS Mirror

Detects Bad Data

Gets Good Data from Mirror

Heals Bad Copy14

Disk Scrubbing Uses checksums to verify the integrity of all the data Traverses metadata to read every copy of every block Finds latent errors while they're still correctable It's like ECC memory scrubbing but for disks Provides fast and reliable re-silvering of mirrors


RAID-Z ProtectionRAID-5 and More

ZFS provides better than RAID-5 availability> Copy-on-write approach solves historical problems

Striping uses dynamic widths> Each logical block is its own stripe

All writes are full-stripe writes> Eliminates read-modify-write (So it's fast!)

Eliminates RAID-5 write hole> No need for NVRAM


128-bit File System No Practical Limitations on File Size, Directory Entries, etc. All metadata is dynamic Concurrent Everything

Immense Data Capacity17




Easier Administration Pooled Storage Design makes for Easier AdministrationNo need for a Volume Manager!

Straightforward Commands and a GUI > Snapshots & Clones > Quotas & Reservations > Compression > Pool Migration > ACLs for Security19

No More Volume Manager!Application 1 Application 2

Automatically add capacity to shared storage poolApplication 3


Storage Pool20

ZFS File systems are Hierarchical File system properties are inherited Inheritance makes administration a snap File systems become control points Manage logically related file systems as a group


Create ZFS Pools and File Systems Create a ZFS pool consisting of two mirrored drives# zpool create tank mirror c9t42d0 c13t11d0 # df -h -F zfs Filesystem size used avail capacity Mounted on tank 33G 1K 33G 1% /tank

Create home directory file system# zfs create tank/home # zfs set mountpoint=/export/home tank/home # df -h -F zfs Filesystem size used avail capacity Mounted on tank 33G 24K 33G 1% /tank tank/home 33G 27K 33G 1% /export/home


Create ZFS Pools and File Systems Create home directories for users# zfs create tank/home/ahrens # zfs create tank/home/bonwick # zfs create tank/home/billm # df -h -F zfs Filesystem size used avail capacity Mounted on tank 33G 24K 33G 1% /tank tank/home 33G 27K 33G 1% /export/home tank/home/ahrens 33G 24K 33G 1% /export/home/ahrens tank/home/bonwick 33G 24K 33G 1% /export/home/bonwick tank/home/billm 33G 24K 33G 1% /export/home/billm

Add space to the pool

# zpool add tank mirror c9t43d0 c13t12d0 # df -h -F zfs Filesystem size used avail capacity Mounted on tank 66G 24K 66G 1% /tank tank/home 66G 27K 66G 1% /export/home


Quotas and Reservations To control pooled storage usage, administrators can set a quota or reservation on a per file system basis# df -h -F zfs Filesystem size tank/home 66G tank/home/ahrens 66G avail capacity Mounted on 66G 1% /export/home 66G 1% /export/home/ahrens tank/home/bonwick 66G 24K 66G 1% /export/home/bonwick # zfs set quota=10g tank/home/ahrens # zfs set reservation=20g tank/home/bonwick # df -h -F zfs Filesystem size used avail capacity Mounted on tank/home 66G 28K 46G 1% /export/home tank/home/ahrens 10G 24K 10G 1% /export/home/ahrens tank/home/bonwick 66G 24K 66G 1% /export/home/bonwick24

used 28K 24K

File System Attributes Attributes are set for the file system and inherited by child file systems in the tree# zfs set compression=on tank # zfs set sharenfs=rw tank/home # zfs get all tank NAME PROPERTY VALUE tank type filesystem tank creation Fri Sep 1 tank used 20.0G tank available 46.4G tank compressratio 1.00x tank mounted yes tank quota none tank reservation none tank recordsize 128K tank mountpoint /tank tank sharenfs off tank compression on tank atime on ... SOURCE default default default default default local default25

9:38 2006

ZFS Snapshots Provide a read-only point-in-time copy of file system Copy-on-write makes them essentially free Very space efficient only changes are tracked And instantaneous just doesn't delete the copyNew Uber-block

Snapshot Uber-block

Current Data


ZFS Snapshots Simple to create and rollback with snapshots# zfs list -r tank NAME USED tank 20.0G tank/home 20.0G tank/home/ahrens 24.5K tank/home/billm 24.5K tank/home/bonwick 24.5K AVAIL 46.4G 46.4G 10.0G 46.4G 66.4G REFER 24.5K 28.5K 24.5K 24.5K 24.5K MOUNTPOINT /tank /export/home /export/home/ahrens /export/home/billm /export/home/bonwick

# zfs snapshot tank/home/billm@s1 # zfs list -r tank/home/billm NAME USED AVAIL REFER tank/home/billm 24.5K 46.4G 24.5K tank/home/billm@s1 0 - 24.5K

MOUNTPOINT /export/home/billm -

# cat /export/home/billm/.zfs/snapshot/s1/foo.c # zfs rollback tank/home/billm@s1 # zfs destroy tank/home/billm@s1


ZFS Clones A clone is a writable copy of a snapshot> Created instantly, unlimited number

Perfect for read-mostly file systems source directories, application binaries and configuration, etc.# zfs list -r tank/home/billm NAME USED AVAIL tank/home/billm 24.5K 46.4G tank/home/billm@s1 0 REFER 24.5K 24.5K MOUNTPOINT /export/home/billm -

# zfs clone tank/home/billm@s1 tank/newbillm # zfs list -r tank/home/billm tank/newbi