zfs: theory & architecture - disruptive...
TRANSCRIPT
ZFS: Theory & ArchitectureDisruptive Storage Workshop - April 2016
1
Aaron Gardner
Who am I?
2
Senior Scientific Consultant
Computer engineer who spent the last 15 years with biologists in situ
Along the way learned bioinformatics, data management, HPC, research cyberinfrastructure
I help enable science!
3
Setting the Stage
My stolen borrowed ZFS materialsReference Links
1. ZFS — The Last Word in File Systems (Jeff Bonwick, Bill Moore)http://wiki.illumos.org/download/attachments/1146951/zfs_last.pdf
2. ZFS on Linux FAQhttp://zfsonlinux.org/faq.html
3. zfsonlinux GitHub Releaseshttps://github.com/zfsonlinux/zfs/releases
4. ZFS Evil Tuning Guidehttp://www.solaris-cookbook.eu/solaris/solaris-10-zfs-evil-tuning-guide
5. Aaron Toponcehttps://pthree.org/2012/04/17/install-zfs-on-debian-gnulinux
6. ZFS Raidz Performance, Capacity and Integrityhttps://calomel.org/zfs_raid_speed_capacity.html
7. ZFS — Wikipediahttps://en.wikipedia.org/wiki/ZFS
4
My stolen borrowed ZFS materialsReference Links cont.
8. Sequoia's 55PB Lustre+ZFS Filesystemhttp://zfsonlinux.org/docs/LUG12_ZFS_Lustre_for_Sequoia.pdf
9. Lustre 2.9 and Beyond http://cdn.opensfs.org/wp-content/uploads/2016/04/LUG2016D1_Lustre-2.9-and-Beyond_Dilger_Revised_v2.pdf
10. ZFS* as backend file system for Lustre* the current status, how to optimize, where to improvehttp://cdn.opensfs.org/wp-content/uploads/2015/04/ZFS-As-Backend-File-System_Paciucci.pdf
11. Lustre* I/O Performance on ZFShttp://cdn.opensfs.org/wp-content/uploads/2016/04/LUG2016D1_Large-Streaming-IO-for-Lustre-on-ZFS_Jinshan-Xiong_Revised_v2.pdf
12. Lustre ZFS Snapshot Overviewhttp://cdn.opensfs.org/wp-content/uploads/2016/04/LUG2016D1_Lustre-ZFS-Snapshots_Fan-Yong_Revised_v2.pdf
13. Vectorized ZFS RAIDZ Implementationhttp://cdn.opensfs.org/wp-content/uploads/2016/04/LUG2016D1_Vectorized-ZFS-RAIDZ_Gvozden-Neskovic.pdf
5
A decade and *still* pretty much the last word…ZFS History and Timeline
‣ The name originally stood for "Zettabyte File System”
‣ Today does not stand for anything
‣ A ZFS file system can store up to 256 zebibytes (ZiB)
6
2001 - Initial development at Sun
2004 - Announced
2005 - Released in Solaris and OpenSolaris
2006 - FUSE port for Linux started
2007 - Apple porting ZFS to Mac OS X
2008 - Port to FreeBSD released
2008 - Native Linux port started
2009 - Apple’s ZFS closed, MacZFS
2010 - OpenSolaris discontinued
2010 - illumos founded as OSS
2013 - OpenZFS project begins
2016 - Ubuntu 16.04 “Goes There”
[1] ZFS — The Last Word in File SystemsOverview
Pooled Storage ‣ Completely eliminates the antique notion of volumes
‣ Does for storage what VM did for memory
Transactional Object System ‣ Always consistent on disk – no fsck, ever
‣ Universal – file, block, iSCSI, swap ...
Provable End-to-End Data Integrity ‣ Detects and corrects silent data corruption
‣ Historically considered “too expensive” – no longer true
Simple Administration ‣ Concisely express your intent
7
[1] ZFS — The Last Word in File SystemsTrouble with Existing Filesystems
No defense against silent data corruption ‣ Any defect in disk, controller, cable, driver, laser, or firmware can corrupt data silently; like running a
server without ECC memory
Brutal to manage ‣ Labels, partitions, volumes, provisioning, grow/shrink, /etc files...
‣ Lots of limits: filesystem/volume size, file size, number of files, files per directory, number of snapshots ...
‣ Different tools to manage file, block, iSCSI, NFS, CIFS ...
‣ Not portable between platforms (x86, SPARC, PowerPC, ARM ...)
Dog slow ‣ Linear-time create, fat locks, fixed block size, naïve prefetch, dirty region logging, painful RAID
rebuilds, growing backup time
8
[1] ZFS — The Last Word in File SystemsObjective
‣Figure out why storage has gotten so complicated ‣Blow away 20 years of obsolete assumptions ‣Design an integrated system from scratch
9
End the Suffering…
[1] ZFS — The Last Word in File SystemsWhy Volumes Exist
Customers wanted more space, bandwidth, reliability ‣ Hard: redesign filesystems to solve these problems well ‣ Easy: insert a shim (“volume”) to cobble disks together
An industry grew up around the FS/volume model ‣ Filesystem, volume manager sold as separate products ‣ Inherent problems in FS/volume interface can't be fixed
10
In the beginning, each filesystem managed a
single disk.
It wasn't very big.
[1] ZFS — The Last Word in File SystemsFS/Volume vs. Pooled Storage
ZFS Pooled Storage ‣ Abstraction: malloc/free ‣ No partitions to manage ‣ Grow/shrink automatically ‣ All bandwidth always available ‣ All storage in the pool is shared
11
Traditional Volumes ‣ Abstraction: virtual disk
‣ Partition/volume for each FS
‣ Grow/shrink by hand ‣ Each FS has limited bandwidth
‣ Storage is fragmented, stranded
[1] ZFS — The Last Word in File SystemsFS/Volume Interfaces vs. ZFS
12
[1] ZFS — The Last Word in File SystemsUniversal Storage
13
‣ DMU is a general-purpose transactional object store ‣ ZFS dataset = up to 2^48 objects, each up to 2^64 bytes
‣ Key features common to all datasets ‣ Snapshots, compression, encryption, end-to-end data integrity
‣ Any flavor you want: file, block, object, network
[5] Aaron ToponceMerkle Trees
14
‣ Merkle Tree = A hash tree
‣ Each block of data is hashed
‣ Parent hashes are generated by concatenating and hashing child hashes
‣ Uber block hash is parent hash at top of tree
‣ By walking the tree you can quickly determine corruption/changes to data
[5] Aaron ToponceCopy-On-Write Transactions
15
‣ Modifying data does not overwrite block
‣ Instead modified block is copied, pointers updated to new block
‣ Old block can be freed, or kept with old pointers (free snapshot*)
‣ Fragments FS, but mitigated ‣ COW is the Future
[1] ZFS — The Last Word in File SystemsConstant-Time Snapshots
16
‣ At end of TX group, don't free COWed blocks ‣ Actually cheaper to take a snapshot than not!
‣ The tricky part: how do you know when a block is free?
[1] ZFS — The Last Word in File SystemsStorage Integrity
17
Uncorrectable bit error rates have stayed roughly constant ‣ 1 in 1014 bits (~12TB) for desktop-class drives ‣ 1 in 1015 bits (~120TB) for enterprise-class drives (allegedly) ‣ Bad sector every 8-20TB in practice (desktop and enterprise)
Drive capacities doubling every 12-18 months Number of drives per deployment increasing → Rapid increase in error rates Both silent and “noisy” data corruption becoming more common Cheap flash storage will only accelerate this trend (TLC, V-NAND)
[1] ZFS — The Last Word in File SystemsZFS Data Integrity
18
[1] ZFS — The Last Word in File SystemsSelf Healing Data in ZFS
19
[1] ZFS — The Last Word in File SystemsTraditional RAID4/5/6
20
‣ Several data disks and >=1 parity disk ‣ Fatal flaw: partial stripe writes
‣ Parity update requires read-modify-write (slow) ‣ Read old data and old parity (two synchronous disk reads) ‣ Compute new parity = new data ^ old data ^ old parity ‣ Write new data and new parity
‣ Suffers from write hole: ‣ Loss of power between data and parity writes will corrupt data ‣ Workaround: $$$ NVRAM in hardware (i.e., don't lose power!)
‣ Can't detect or correct silent data corruption
[1] ZFS — The Last Word in File SystemsRAID-Z
21
‣ Dynamic stripe width ‣ Variable block size: 512 – 128K 16MB ‣ Each logical block is its own stripe
‣ All writes are full-stripe writes ‣ Eliminates read-modify-write (it's fast) ‣ Eliminates the RAID write hole
(no need for NVRAM) ‣ Single, double, triple parity (raidz, raidz2, raidz3) ‣ Detects and corrects silent data corruption
‣ Checksum-driven combinatorial reconstruction ‣ No special hardware – ZFS loves cheap disks
[1] ZFS — The Last Word in File SystemsTraditional Resilvering
22
‣ Creating a new mirror (or RAID stripe): ‣ Copy one disk to the other (or XOR them together) so all copies are
self-consistent – even though they're all random garbage! ‣ Replacing a failed device:
‣ Whole-disk copy – even if the volume is nearly empty ‣ No checksums or validity checks along the way ‣ No assurance of progress until 100% complete – your root
directory may be the last block copied ‣ Recovering from a transient outage:
‣ Dirty region logging – slow, and easily defeated by random writes
[1] ZFS — The Last Word in File SystemsZFS Resilvering
23
‣ Top-down resilvering ‣ ZFS resilvers the storage pool's block tree from the root down ‣ Most important blocks first ‣ Every single block copy increases the amount of discoverable data
‣ Only copy live blocks ‣ No time wasted copying free space ‣ Zero time to initialize a new mirror or RAID-Z group
‣ Dirty time logging (for transient outages) ‣ ZFS records the transaction group window that the device missed ‣ To resilver, ZFS walks the tree and prunes where birth time < DTL ‣ A five-second outage takes five seconds to repair
[1] ZFS — The Last Word in File SystemsDisk Scrubbing
24
‣ Finds latent errors while they're still correctable ‣ ECC memory scrubbing for disks
‣ Verifies the integrity of all data ‣ Traverses pool metadata to read every copy of every block
‣ All mirror copies, all RAID-Z parity, and all ditto blocks ‣ Verifies each copy against its 256-bit checksum ‣ Repairs data as it goes
‣ Minimally invasive ‣ Low I/O priority ensures that scrubbing doesn't get in the way ‣ User-defined scrub rates coming soon
‣ Gradually scrub the pool over the course of a month, a quarter, etc.
[1] ZFS — The Last Word in File SystemsDitto Blocks
25
‣ Data replication above and beyond mirror/RAID-Z ‣ Each logical block can have up to three physical blocks
‣ Different devices whenever possible ‣ Different places on the same device otherwise (e.g. laptop drive)
‣ All ZFS metadata 2+ copies ‣ Small cost in latency and bandwidth (metadata ≈ 1% of data)
‣ Explicitly settable for precious user data ‣ Detects and corrects silent data corruption
‣ In a multi-disk pool, ZFS survives any non-consecutive disk failures ‣ In a single-disk pool, ZFS survives loss of up to 1/8 of the platter
‣ ZFS survives failures that send other filesystems to tape
[1] ZFS — The Last Word in File SystemsScalability
26
‣ Immense capacity (128-bit) ‣ Moore's Law: need 65th bit in 10-15 years ‣ ZFS capacity: 256 quadrillion ZB (1ZB = 1 billion TB) ‣ Exceeds quantum limit of Earth-based storage
‣ Seth Lloyd, "Ultimate physical limits to computation.” Nature 406, 1047-1054 (2000)
‣ 100% dynamic metadata ‣ No limits on files, directory entries, etc. ‣ No wacky knobs (e.g. inodes/cg)
‣ Concurrent everything ‣ Byte-range locking: parallel read/write without violating POSIX ‣ Parallel, constant-time directory operations
[1] ZFS — The Last Word in File SystemsPerformance
27
‣ Copy-on-write design ‣ Turns random writes into sequential writes ‣ Intrinsically hot-spot-free
‣ Pipelined I/O ‣ Fully scoreboarded 24-stage pipeline with I/O
dependency graphs ‣ Maximum possible I/O parallelism ‣ Priority, deadline scheduling, out-of-order issue,
sorting, aggregation ‣ Dynamic striping across all devices ‣ Intelligent prefetch ‣ Variable block size
[1] ZFS — The Last Word in File SystemsDynamic Striping
28
[1] ZFS — The Last Word in File SystemsVariable Block Size
‣ No single block size is optimal for everything ‣ Large blocks: less metadata, higher bandwidth ‣ Small blocks: more space-efficient for small objects ‣ Record-structured files (e.g. databases) have natural granularity;
filesystem must match it to avoid read/modify/write ‣ Why not arbitrary extents?
‣ Extents don't COW or checksum nicely (too big) ‣ Large blocks suffice to run disks at platter speed
‣ Per-object granularity ‣ A 37k file consumes 37k – no wasted space
‣ Enables transparent block-based compression
29
[1] ZFS — The Last Word in File SystemsBuilt-in Compression
‣ Block-level compression in SPA ‣ Transparent to all other layers ‣ Each block compressed independently ‣ All-zero blocks converted into file holes
‣ LZJB, GZIP, LZ4 available
30
Miscellaneous Miscellaneous Topics
‣ Native ZFS Encryption in ZFS for Solaris ‣ With ZFS on Linux (ZoL), you can use encryptfs or LUKS below ZFS ‣ Intelligent Prefetch
‣ Multiple prefetch streams ‣ Automatic stride detection (linear access pattern detection)
31
[1] ZFS — The Last Word in File SystemsAdministration
‣ Pooled storage – no more volumes! ‣ Up to 248 datasets per pool – filesystems, iSCSI targets, swap, etc. ‣ Nothing to provision!
‣ Filesystems become administrative control points ‣ Hierarchical, with inherited properties
‣ Per-dataset policy: snapshots, compression, backups, quotas, etc. ‣ Who's using all the space? du(1) takes forever, but df(1M) is instant ‣ Manage logically related filesystems as a group ‣ Inheritance makes large-scale administration a snap
‣ Policy follows the data (mounts, shares, properties, etc.) ‣ Delegated administration lets users manage their own data ‣ ZFS filesystems are cheap – use a ton, it's OK, really!
‣ Online everything 32
[1] ZFS — The Last Word in File SystemsSnapshots
‣ Read-only point-in-time copy of a filesystem ‣ Instantaneous creation, unlimited number ‣ No additional space used – blocks copied only when they change ‣ Accessible through .zfs/snapshot in root of each filesystem
‣ Allows users to recover files without sysadmin intervention
33
[1] ZFS — The Last Word in File SystemsClones
Writable copy of a snapshot ‣ Instantaneous creation, unlimited number
Ideal for storing many private copies of mostly-shared data ‣ Software installations ‣ Source code repositories ‣ Diskless clients ‣ Containers ‣ Virtual machines
34
[1] ZFS — The Last Word in File SystemsSend / Receive
Powered by snapshots (and now bookmarks) ‣ Full backup: any snapshot ‣ Incremental backup: any snapshot delta ‣ Very fast delta generation – cost proportional to data changed
So efficient it can drive remote replication
35
[1] ZFS — The Last Word in File SystemsData Migration
Host-neutral on-disk format ‣ Change server, it just works ‣ Adaptive endianness: neither platform pays a tax
‣ Writes always use native endianness, set bit in block pointer ‣ Reads byteswap only if host endianness != block endianness
ZFS takes care of everything ‣ Forget about device paths, config files, /etc/vfstab, etc. ‣ ZFS will share/unshare, mount/unmount, etc. as necessary
36
[1] ZFS — The Last Word in File SystemsZFS and Virtualization
‣ Fast – snapshots and clones make zone creation instant ‣ Useful for container and VM environments
OpenStack, Docker, Flocker, etc.
37
Pool: Physical resource in global zone
Dataset: Logical resource in local
zone
[1] ZFS — The Last Word in File SystemsZFS Root Filesystem
Brings all the ZFS goodness to / ‣ Checksums, compression, replication, snapshots, clones ‣ Boot from any dataset
Patching becomes safe ‣ Take snapshot, apply patch... rollback if there's a problem
Live upgrade becomes fast ‣ Create clone (instant), upgrade, boot from clone ‣ No “extra partition”
Worked on Solaris, Working but Unsupported in Linux (Ubuntu*) ‣ ZFS can easily create multiple boot environments ‣ GRUB can easily manage them
38
39
But wait! There’s more!
How is ZFS deployed on Linux today?ZFS on Linux
‣ A project run by the DoD (LLNL, etc.)http://zfsonlinux.org
‣ Ported ZFS natively into the Linux kernel • SPL (Solaris Portability Layer) • ZFS • Can be deployed via kmod or dkms
‣ Actively developed and supported, rapidly evolving, originally driven by need for large scale Lustre (Sequoia), now community focus
40
Some new features to be aware of…Current Release (v0.6.5.6)
‣ v0.6.5
‣ zfs/snapshot over NFS
‣ large_block IO
‣ Un to 16MB record size (great for Lustre, reducing ZFS slab fragmentation, etc.)
‣ 10x lower ZIL latency
‣ zvol (>50% higher throughput, >20% lower latency)
‣ v0.6.4
‣ Asynchronous IO (AIO) support
‣ Previously was Direct IO only
‣ LZ4 compression of metadata
‣ Bash completion41
ZFS was designed to take advantage of RAM, SSD, etc.
A quick word on ZFS caching
‣ Adjustable Replacement Cache (ARC)
‣ Recent and frequent block cache (read)
‣ ZFS Intent Log (ZIL)
‣ Cache where all data is written, bundled into transactions, and then flushed on a regular interval (txg_sync)
42
An overview of common RAID/RAIDZ levelsRAID
43
Raid Level Data Drives Parity Drives Failures Read Write Use?
0 >=1 0 0 Best* Best* No
1 2 1 1 Best Better Yes
1+0 2 + x 2 + x 1 from any RAID1 set
Best Better Yes
5 2 + x 1 1 Better OK No
5+0 ( 2 + x ) * y y 1 from any RAID5 set
Better Better No
6 2 + x 2 2 Better Meh Yes
6+0 (2 + x) * y 2 * y 2 from any RAID6 set
Better OK Maybe
7 2 + x 3 3 Better Sigh Maybe
What was that about data members and powers of two again?
RAID
44
RAID Level Recommended Data Drives Parity Drives Total Drives
1 1 1 2
1+0 2 - 16 2 - 16 4 - 32
5 2, 4 1 3, 5
6 4, 8 2 6, 10
7 4, 8, 16 3 7, 11, 19
*You can be successful with other configurations, but you have to be more careful*
ZFS has deduplication, but should you use it?A quick word on ZFS dedupe
‣ Need a lot of RAM
‣ "5GB RAM per 1TB disk” estimate w/ 128KB
‣ Lots of snapshots mixed with dedupe on petascale ZFS pools = pain
‣ Can use deduplication in specialized cases to good effect(reference DB and repetitive workflows)
‣ General recommendation—don’t use it generally
45
Best practicesOther ZFS Comments
‣ Use ECC RAM to keep in flight and modified data safe while in memory(ZFS is protecting most other things)
‣ ZVOLS
‣ Don’t forget about them! Use them when you need block/alternate FS
‣ Compression usually helps performance, especially using LZ4 (now default algo.)
‣ ZFS send and receive are your friend
‣ Do use ZIL/SLOG and L2ARC on SSD, etc. as supported
‣ Easy to build hybrid-SSD storage system
‣ Node local burst buffer characteristic (variable stripe, re-ordered IO into transactions, etc.)
‣ Can use fletcher4 over sha256 to speed hashing
‣ Not as necessary after AVX/AVX2 patches46
Best practicesZFS/Lustre Comments
‣ Being brief—this stuff will likely be covered in detail during the rest of the workshop
8. Sequoia's 55PB Lustre+ZFS Filesystemhttp://zfsonlinux.org/docs/LUG12_ZFS_Lustre_for_Sequoia.pdf
9. Lustre 2.9 and Beyond http://cdn.opensfs.org/wp-content/uploads/2016/04/LUG2016D1_Lustre-2.9-and-Beyond_Dilger_Revised_v2.pdf
10. ZFS* as backend file system for Lustre* the current status, how to optimize, where to improvehttp://cdn.opensfs.org/wp-content/uploads/2015/04/ZFS-As-Backend-File-System_Paciucci.pdf
11. Lustre* I/O Performance on ZFShttp://cdn.opensfs.org/wp-content/uploads/2016/04/LUG2016D1_Large-Streaming-IO-for-Lustre-on-ZFS_Jinshan-Xiong_Revised_v2.pdf
12. Lustre ZFS Snapshot Overviewhttp://cdn.opensfs.org/wp-content/uploads/2016/04/LUG2016D1_Lustre-ZFS-Snapshots_Fan-Yong_Revised_v2.pdf
13. Vectorized ZFS RAIDZ Implementationhttp://cdn.opensfs.org/wp-content/uploads/2016/04/LUG2016D1_Vectorized-ZFS-RAIDZ_Gvozden-Neskovic.pdf
47
[1] ZFS — The Last Word in File SystemsSummary
End the Suffering & Free Your Mind
Simple ‣ Concisely expresses the user's intent
Powerful ‣ Pooled storage, snapshots, clones, compression, scrubbing, RAID-Z
Safe ‣ Detects and corrects silent data corruption
Fast ‣ Dynamic striping, intelligent prefetch, pipelined I/O
Open ‣ http://www.opensolaris.org/os/community/zfs
Free48
49
Thank You;