zfs bootcamp

Download ZFS bootcamp

Post on 16-Apr-2015




6 download

Embed Size (px)




Solaris 10 Boot Camp ZFSBob NethertonSolaris Adoption, US Client Solutions Sun Microsystems

ZFS: Motivation for a new file system Existing file systems:> No defense against data corruption > Lots of Limits: size, # of files, etc. > Difficult to manage: > fsck, /etc/fstab, partitions, volumes > too many things to tune

Design Principles:> End-to-End data integrity > Lots of capacity (128-bit) > Simple to administer

Next Generation File System Many file systems based on old assumptions > Enter ZFS Scalable, Easy to Manage, Fast, Data Integrity > EoM very important considering infrequent use Z = Zettabyte (128-bit) > UFS = 16 TB > VxFS = 32 TB > QFS = 252 TB Customers have Petabytes today, Exabytes not far

ZFS Design Principles Pooled storage> Completely eliminates the antique notion of volumes > Does for storage what VM did for memory

End-to-end data integrity> Historically considered too expensive > Turns out, no it isn't > And the alternative is unacceptable

Everything is transactional> Keeps things always consistent on disk > Removes almost all constraints on I/O order > Allows us to get huge performance wins

Background: Why Volumes ExistIn the beginning, Customers wanted more space, bandwidth, reliability each filesystem > Rewrite filesystems to handle many disks: hard managed a single > Insert a little shim (volume) to cobble disks together: easy disk. An industry grew up around the FS/volume model> Filesystems, volume managers sold as separate products > Inherent problems in FS/volume interface can't be fixed


FS Volume(2G concat) Lower 1G Upper 1G

FS Volume(2G stripe) Even 1G Odd 1G

FS Volume(1G mirror) Left 1G Right 1G

1G Disk

FS/Volume Model vs. ZFSTraditional Volumes Abstraction: virtual disk Partition/volume for each FS Grow/shrink by hand Each FS has limited bandwidth Storage is fragmented, stranded

ZFS Pooled StorageAbstraction: malloc/free No partitions to manage Grow/shrink automatically All bandwidth always available Pool allows space to be shared

FS Volume

FS Volume

FS Volume


ZFS Storage Pool


FS/Volume Model vs. ZFSFS/Volume I/O StackBlock Device Interface Write this block, then that block, ... Loss of power = loss of ondisk consistency Workaround: journaling, which is slow & complex Block Device Interface Write each block to each disk immediately to keep mirrors in sync Loss of power = resync Synchronous and slow

ZFS I/O StackObject-Based Transactions Make these 7 changes to these 3 objects All-or-nothing Transaction Group Commit Again, all-or-nothing Always consistent on disk No journal not needed Transaction Group Batch I/O Schedule, aggregate, and issue I/O at will No resync if power lost Runs at platter speed





Storage Pool

ZFS Data Integrity Model All operations are copy-on-write> Never overwrite live data > On-disk state always valid no windows of vulnerability > No need for fsck(1M)

All operations are transactional> Related changes succeed or fail as a whole > No need for journaling

All data is checksummed> No silent data corruption > No panics on bad metadata

Copy-On-Write Transactions1. Initial block tree 2. COW some blocks

3. COW indirect blocks

4. Rewrite uberblock (atomic)

Bonus: Constant-Time Snapshots At end of TX group, don't free COWed blocks> Actually cheaper to take a snapshot than not!

Snapshot uberblock

Current uberblock

End-to-End Checksums Checksum stored with data block Can't even detect stray writes Inherent FS/volume interface limitation

Disk Block Checksums

ZFS Checksum Trees Checksum stored in parent block pointer Fault isolation between data and checksum Entire pool (block tree) is self-validating Enabling technology: ZFS stack integrationAddress Address Checksum Checksum Address Address Checksum Checksum

Any self-consistent block will pass





Only validates the media Bit rot Phantom writes Misdirected reads and writes DMA parity errors Driver bugs Accidental overwrite

Validates the entire I/O path Bit rot Phantom writes Misdirected reads and writes DMA parity errors Driver bugs Accidental overwrite

Know Your Data Threat LevelData Integrity Advisory System

Linux ext2, reiser4 ext3, XFS UFS, VxFS

SEVERESevere Risk of Data Corruption


High Risk of

ELEVATEDElevated Risk of

Data Corruption

GUARDEDGeneral Risk of

Data Corruption


LOW (10-77)Low Risk of Data Corruption

Data Corruption

Traditional Mirroring1. Application issues a read. 2. Volume manager passesMirror reads the first disk, which has a corrupt block. It can't tell. bad block up to filesystem. If it's a metadata block, the filesystem panics. If not...

3. Filesystem returns baddata to the application.

Application FS xxVM mirror

Application FS xxVM mirror

Application FS xxVM mirror

Self-Healing Data in ZFS1. Application issues a read. 2. ZFS tries the second disk.ZFS mirror tries the first disk. Checksum reveals that the block is corrupt on disk. Checksum indicates that the block is good.

3. ZFS returns good datato the application and repairs the damaged block.




ZFS mirror

ZFS mirror

ZFS mirror

Self-Healing Data in Action# dd if=/dev/zero of=/dev/dsk/c2d9d0s0 bs=128k ... count=12 # ... read the affected file ... no problem! # zpool iostat home home ----------mirror1 c2t8d0s0 c3t8d0s0 mirror2 c2t9d0s0 c3t9d0s0 capacity used avail 305M --------256M --------136G ------136G ------operations read write 167 88 79 168 86 81 bandwidth read write 0 0 0 0 0 0 errors I/O cksum 0 0 0 0 0 0 0 0 0 0 12 0

0 21.0M 0 11.0M 0 9.9M

0 21.0M 0 10.8M 0 10.2M

ZFS Data Security NFSv4/NT-style ACLs> Allow/deny with inheritance

Encryption> Protects against spying, SAN snooping, physical device theft

Authentication via cryptographic checksums> User-selectable 256-bit checksum algorithms include SHA-256 > Data can't be forged checksums detect it > Uberblock checksum provides digital signature for entire pool

Secure deletion> Freed blocks thoroughly erased

Real-time remote replication> Allows data to survive disasters of geographic scope

ZFS Scalability Immense capacity (128-bit)> Moore's Law: need 65th bit in 10-15 years > Zettabyte = 70-bit (a billion TB) > ZFS capacity: 256 quadrillion ZB > Exceeds quantum limit of Earth-based storageSeth Lloyd, Ultimate physical limits to computation. Nature 406, 1047-1054 (2000)

Scalable algorithms, 100% dynamic metadata> No limits on files, directory entries, etc. > No wacky knobs (e.g. inodes/cg)

Built-in compression> Saves space (2-3x typical), reduces I/O sometimes by so much that compressed filesystems are actually faster!

ZFS Performance Key architectural wins:> Copy-on-write design makes most disk writes sequential > Dynamic striping across all devices maximizes throughput > Multiple block sizes, automatically chosen to match workload > Explicit I/O priority with deadline scheduling > Globally optimal I/O sorting and aggregation > Multiple independent prefetch streams with automatic length and stride detection > Unlimited, instantaneous read/write snapshots > Parallel, constant-time directory operations

Already #1 on several benchmarks> Almost as fast as tmpfs on some!

ZFS Administation Pooled storage no more volumes!> All storage is shared no wasted space > Filesystems are cheap: like directories with mount options > Grow and shrink are automatic > No more fsck(1M), format(1M), /etc/vfstab, /etc/dfs/dfstab... > Snapshot: read-only point-in-time copy of the data > Clone: writable copy of a snapshot > Lets users recover lost data without sysadmin intervention > Change server from x86 to SPARC, it just works > Adaptive endianness ensures neither platform pays a tax

Unlimited, instantaneous snapshots and clones

Host-neutral on-disk format

100% online administration

Invidious Comparison Time! Claim:> ZFS pooled storage simplifies administration dramatically. > So... let's consider a simple example.

Our Task:> Given two disks, create mirrored filesystems for

Ann, Bob, and Sue. > Later, add more space.

UFS/SVM Administration: Summary# format ... (long interactive session omitted) # metadb -a -f disk1:slice0 disk2:slice0 # metainit d10 1 1 disk1:slice1 d10: Concat/Stripe is setup # metainit d11 1 1 disk2:slice1 d11: Concat/Stripe is setup # metainit d20 -m d10 d20: Mirror is setup # metattach d20 d11 d20: submirror d11 is attached # metainit d12 1 1 disk1:slice2 d12: Concat/Stripe is setup # metainit d13 1 1 disk2:slice2 d13: Concat/Stripe is setup # metainit d21 -m d12 d21: Mirror is setup # metattach d21 d13 d21: submirror d13 is attached # metainit d14 1 1 disk1:slice3 d14: Concat/Stripe is setup # metainit d15 1 1 disk2:slice3 d15: Concat/Stripe is setup # metainit d22 -m d14 d22: Mirror is setup # metattach d22 d15 d22: submirror d15 is attached

# newfs /dev/md/rdsk/d20 newfs: construct a new file system /dev/md/rdsk/d20: (y/n)? y ... (many pages of 'superblock backup' output omitted) # mount /dev/md/dsk/d20 /export/home/ann # vi /etc/vfstab ... while in 'vi', type this exactly: /dev/md/dsk/d2