zfs: theory & architecture - disruptive...

49
ZFS: Theory & Architecture Disruptive Storage Workshop - April 2016 1

Upload: phamnguyet

Post on 25-May-2018

241 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

ZFS: Theory & ArchitectureDisruptive Storage Workshop - April 2016

1

Page 2: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

Aaron Gardner

Who am I?

2

Senior Scientific Consultant

Computer engineer who spent the last 15 years with biologists in situ

Along the way learned bioinformatics, data management, HPC, research cyberinfrastructure

I help enable science!

Page 3: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

3

Setting the Stage

Page 4: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

My stolen borrowed ZFS materialsReference Links

1. ZFS — The Last Word in File Systems (Jeff Bonwick, Bill Moore)http://wiki.illumos.org/download/attachments/1146951/zfs_last.pdf

2. ZFS on Linux FAQhttp://zfsonlinux.org/faq.html

3. zfsonlinux GitHub Releaseshttps://github.com/zfsonlinux/zfs/releases

4. ZFS Evil Tuning Guidehttp://www.solaris-cookbook.eu/solaris/solaris-10-zfs-evil-tuning-guide

5. Aaron Toponcehttps://pthree.org/2012/04/17/install-zfs-on-debian-gnulinux

6. ZFS Raidz Performance, Capacity and Integrityhttps://calomel.org/zfs_raid_speed_capacity.html

7. ZFS — Wikipediahttps://en.wikipedia.org/wiki/ZFS

4

Page 5: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

My stolen borrowed ZFS materialsReference Links cont.

8. Sequoia's 55PB Lustre+ZFS Filesystemhttp://zfsonlinux.org/docs/LUG12_ZFS_Lustre_for_Sequoia.pdf

9. Lustre 2.9 and Beyond http://cdn.opensfs.org/wp-content/uploads/2016/04/LUG2016D1_Lustre-2.9-and-Beyond_Dilger_Revised_v2.pdf

10. ZFS* as backend file system for Lustre* the current status, how to optimize, where to improvehttp://cdn.opensfs.org/wp-content/uploads/2015/04/ZFS-As-Backend-File-System_Paciucci.pdf

11. Lustre* I/O Performance on ZFShttp://cdn.opensfs.org/wp-content/uploads/2016/04/LUG2016D1_Large-Streaming-IO-for-Lustre-on-ZFS_Jinshan-Xiong_Revised_v2.pdf

12. Lustre ZFS Snapshot Overviewhttp://cdn.opensfs.org/wp-content/uploads/2016/04/LUG2016D1_Lustre-ZFS-Snapshots_Fan-Yong_Revised_v2.pdf

13. Vectorized ZFS RAIDZ Implementationhttp://cdn.opensfs.org/wp-content/uploads/2016/04/LUG2016D1_Vectorized-ZFS-RAIDZ_Gvozden-Neskovic.pdf

5

Page 6: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

A decade and *still* pretty much the last word…ZFS History and Timeline

‣ The name originally stood for "Zettabyte File System”

‣ Today does not stand for anything

‣ A ZFS file system can store up to 256 zebibytes (ZiB)

6

2001 - Initial development at Sun

2004 - Announced

2005 - Released in Solaris and OpenSolaris

2006 - FUSE port for Linux started

2007 - Apple porting ZFS to Mac OS X

2008 - Port to FreeBSD released

2008 - Native Linux port started

2009 - Apple’s ZFS closed, MacZFS

2010 - OpenSolaris discontinued

2010 - illumos founded as OSS

2013 - OpenZFS project begins

2016 - Ubuntu 16.04 “Goes There”

Page 7: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsOverview

Pooled Storage ‣ Completely eliminates the antique notion of volumes

‣ Does for storage what VM did for memory

Transactional Object System ‣ Always consistent on disk – no fsck, ever

‣ Universal – file, block, iSCSI, swap ...

Provable End-to-End Data Integrity ‣ Detects and corrects silent data corruption

‣ Historically considered “too expensive” – no longer true

Simple Administration ‣ Concisely express your intent

7

Page 8: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsTrouble with Existing Filesystems

No defense against silent data corruption ‣ Any defect in disk, controller, cable, driver, laser, or firmware can corrupt data silently; like running a

server without ECC memory

Brutal to manage ‣ Labels, partitions, volumes, provisioning, grow/shrink, /etc files...

‣ Lots of limits: filesystem/volume size, file size, number of files, files per directory, number of snapshots ...

‣ Different tools to manage file, block, iSCSI, NFS, CIFS ...

‣ Not portable between platforms (x86, SPARC, PowerPC, ARM ...)

Dog slow ‣ Linear-time create, fat locks, fixed block size, naïve prefetch, dirty region logging, painful RAID

rebuilds, growing backup time

8

Page 9: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsObjective

‣Figure out why storage has gotten so complicated ‣Blow away 20 years of obsolete assumptions ‣Design an integrated system from scratch

9

End the Suffering…

Page 10: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsWhy Volumes Exist

Customers wanted more space, bandwidth, reliability ‣ Hard: redesign filesystems to solve these problems well ‣ Easy: insert a shim (“volume”) to cobble disks together

An industry grew up around the FS/volume model ‣ Filesystem, volume manager sold as separate products ‣ Inherent problems in FS/volume interface can't be fixed

10

In the beginning, each filesystem managed a

single disk.

It wasn't very big.

Page 11: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsFS/Volume vs. Pooled Storage

ZFS Pooled Storage ‣ Abstraction: malloc/free ‣ No partitions to manage ‣ Grow/shrink automatically ‣ All bandwidth always available ‣ All storage in the pool is shared

11

Traditional Volumes ‣ Abstraction: virtual disk

‣ Partition/volume for each FS

‣ Grow/shrink by hand ‣ Each FS has limited bandwidth

‣ Storage is fragmented, stranded

Page 12: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsFS/Volume Interfaces vs. ZFS

12

Page 13: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsUniversal Storage

13

‣ DMU is a general-purpose transactional object store ‣ ZFS dataset = up to 2^48 objects, each up to 2^64 bytes

‣ Key features common to all datasets ‣ Snapshots, compression, encryption, end-to-end data integrity

‣ Any flavor you want: file, block, object, network

Page 14: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[5] Aaron ToponceMerkle Trees

14

‣ Merkle Tree = A hash tree

‣ Each block of data is hashed

‣ Parent hashes are generated by concatenating and hashing child hashes

‣ Uber block hash is parent hash at top of tree

‣ By walking the tree you can quickly determine corruption/changes to data

Page 15: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[5] Aaron ToponceCopy-On-Write Transactions

15

‣ Modifying data does not overwrite block

‣ Instead modified block is copied, pointers updated to new block

‣ Old block can be freed, or kept with old pointers (free snapshot*)

‣ Fragments FS, but mitigated ‣ COW is the Future

Page 16: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsConstant-Time Snapshots

16

‣ At end of TX group, don't free COWed blocks ‣ Actually cheaper to take a snapshot than not!

‣ The tricky part: how do you know when a block is free?

Page 17: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsStorage Integrity

17

Uncorrectable bit error rates have stayed roughly constant ‣ 1 in 1014 bits (~12TB) for desktop-class drives ‣ 1 in 1015 bits (~120TB) for enterprise-class drives (allegedly) ‣ Bad sector every 8-20TB in practice (desktop and enterprise)

Drive capacities doubling every 12-18 months Number of drives per deployment increasing → Rapid increase in error rates Both silent and “noisy” data corruption becoming more common Cheap flash storage will only accelerate this trend (TLC, V-NAND)

Page 18: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsZFS Data Integrity

18

Page 19: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsSelf Healing Data in ZFS

19

Page 20: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsTraditional RAID4/5/6

20

‣ Several data disks and >=1 parity disk ‣ Fatal flaw: partial stripe writes

‣ Parity update requires read-modify-write (slow) ‣ Read old data and old parity (two synchronous disk reads) ‣ Compute new parity = new data ^ old data ^ old parity ‣ Write new data and new parity

‣ Suffers from write hole: ‣ Loss of power between data and parity writes will corrupt data ‣ Workaround: $$$ NVRAM in hardware (i.e., don't lose power!)

‣ Can't detect or correct silent data corruption

Page 21: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsRAID-Z

21

‣ Dynamic stripe width ‣ Variable block size: 512 – 128K 16MB ‣ Each logical block is its own stripe

‣ All writes are full-stripe writes ‣ Eliminates read-modify-write (it's fast) ‣ Eliminates the RAID write hole

(no need for NVRAM) ‣ Single, double, triple parity (raidz, raidz2, raidz3) ‣ Detects and corrects silent data corruption

‣ Checksum-driven combinatorial reconstruction ‣ No special hardware – ZFS loves cheap disks

Page 22: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsTraditional Resilvering

22

‣ Creating a new mirror (or RAID stripe): ‣ Copy one disk to the other (or XOR them together) so all copies are

self-consistent – even though they're all random garbage! ‣ Replacing a failed device:

‣ Whole-disk copy – even if the volume is nearly empty ‣ No checksums or validity checks along the way ‣ No assurance of progress until 100% complete – your root

directory may be the last block copied ‣ Recovering from a transient outage:

‣ Dirty region logging – slow, and easily defeated by random writes

Page 23: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsZFS Resilvering

23

‣ Top-down resilvering ‣ ZFS resilvers the storage pool's block tree from the root down ‣ Most important blocks first ‣ Every single block copy increases the amount of discoverable data

‣ Only copy live blocks ‣ No time wasted copying free space ‣ Zero time to initialize a new mirror or RAID-Z group

‣ Dirty time logging (for transient outages) ‣ ZFS records the transaction group window that the device missed ‣ To resilver, ZFS walks the tree and prunes where birth time < DTL ‣ A five-second outage takes five seconds to repair

Page 24: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsDisk Scrubbing

24

‣ Finds latent errors while they're still correctable ‣ ECC memory scrubbing for disks

‣ Verifies the integrity of all data ‣ Traverses pool metadata to read every copy of every block

‣ All mirror copies, all RAID-Z parity, and all ditto blocks ‣ Verifies each copy against its 256-bit checksum ‣ Repairs data as it goes

‣ Minimally invasive ‣ Low I/O priority ensures that scrubbing doesn't get in the way ‣ User-defined scrub rates coming soon

‣ Gradually scrub the pool over the course of a month, a quarter, etc.

Page 25: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsDitto Blocks

25

‣ Data replication above and beyond mirror/RAID-Z ‣ Each logical block can have up to three physical blocks

‣ Different devices whenever possible ‣ Different places on the same device otherwise (e.g. laptop drive)

‣ All ZFS metadata 2+ copies ‣ Small cost in latency and bandwidth (metadata ≈ 1% of data)

‣ Explicitly settable for precious user data ‣ Detects and corrects silent data corruption

‣ In a multi-disk pool, ZFS survives any non-consecutive disk failures ‣ In a single-disk pool, ZFS survives loss of up to 1/8 of the platter

‣ ZFS survives failures that send other filesystems to tape

Page 26: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsScalability

26

‣ Immense capacity (128-bit) ‣ Moore's Law: need 65th bit in 10-15 years ‣ ZFS capacity: 256 quadrillion ZB (1ZB = 1 billion TB) ‣ Exceeds quantum limit of Earth-based storage

‣ Seth Lloyd, "Ultimate physical limits to computation.” Nature 406, 1047-1054 (2000)

‣ 100% dynamic metadata ‣ No limits on files, directory entries, etc. ‣ No wacky knobs (e.g. inodes/cg)

‣ Concurrent everything ‣ Byte-range locking: parallel read/write without violating POSIX ‣ Parallel, constant-time directory operations

Page 27: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsPerformance

27

‣ Copy-on-write design ‣ Turns random writes into sequential writes ‣ Intrinsically hot-spot-free

‣ Pipelined I/O ‣ Fully scoreboarded 24-stage pipeline with I/O

dependency graphs ‣ Maximum possible I/O parallelism ‣ Priority, deadline scheduling, out-of-order issue,

sorting, aggregation ‣ Dynamic striping across all devices ‣ Intelligent prefetch ‣ Variable block size

Page 28: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsDynamic Striping

28

Page 29: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsVariable Block Size

‣ No single block size is optimal for everything ‣ Large blocks: less metadata, higher bandwidth ‣ Small blocks: more space-efficient for small objects ‣ Record-structured files (e.g. databases) have natural granularity;

filesystem must match it to avoid read/modify/write ‣ Why not arbitrary extents?

‣ Extents don't COW or checksum nicely (too big) ‣ Large blocks suffice to run disks at platter speed

‣ Per-object granularity ‣ A 37k file consumes 37k – no wasted space

‣ Enables transparent block-based compression

29

Page 30: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsBuilt-in Compression

‣ Block-level compression in SPA ‣ Transparent to all other layers ‣ Each block compressed independently ‣ All-zero blocks converted into file holes

‣ LZJB, GZIP, LZ4 available

30

Page 31: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

Miscellaneous Miscellaneous Topics

‣ Native ZFS Encryption in ZFS for Solaris ‣ With ZFS on Linux (ZoL), you can use encryptfs or LUKS below ZFS ‣ Intelligent Prefetch

‣ Multiple prefetch streams ‣ Automatic stride detection (linear access pattern detection)

31

Page 32: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsAdministration

‣ Pooled storage – no more volumes! ‣ Up to 248 datasets per pool – filesystems, iSCSI targets, swap, etc. ‣ Nothing to provision!

‣ Filesystems become administrative control points ‣ Hierarchical, with inherited properties

‣ Per-dataset policy: snapshots, compression, backups, quotas, etc. ‣ Who's using all the space? du(1) takes forever, but df(1M) is instant ‣ Manage logically related filesystems as a group ‣ Inheritance makes large-scale administration a snap

‣ Policy follows the data (mounts, shares, properties, etc.) ‣ Delegated administration lets users manage their own data ‣ ZFS filesystems are cheap – use a ton, it's OK, really!

‣ Online everything 32

Page 33: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsSnapshots

‣ Read-only point-in-time copy of a filesystem ‣ Instantaneous creation, unlimited number ‣ No additional space used – blocks copied only when they change ‣ Accessible through .zfs/snapshot in root of each filesystem

‣ Allows users to recover files without sysadmin intervention

33

Page 34: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsClones

Writable copy of a snapshot ‣ Instantaneous creation, unlimited number

Ideal for storing many private copies of mostly-shared data ‣ Software installations ‣ Source code repositories ‣ Diskless clients ‣ Containers ‣ Virtual machines

34

Page 35: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsSend / Receive

Powered by snapshots (and now bookmarks) ‣ Full backup: any snapshot ‣ Incremental backup: any snapshot delta ‣ Very fast delta generation – cost proportional to data changed

So efficient it can drive remote replication

35

Page 36: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsData Migration

Host-neutral on-disk format ‣ Change server, it just works ‣ Adaptive endianness: neither platform pays a tax

‣ Writes always use native endianness, set bit in block pointer ‣ Reads byteswap only if host endianness != block endianness

ZFS takes care of everything ‣ Forget about device paths, config files, /etc/vfstab, etc. ‣ ZFS will share/unshare, mount/unmount, etc. as necessary

36

Page 37: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsZFS and Virtualization

‣ Fast – snapshots and clones make zone creation instant ‣ Useful for container and VM environments

OpenStack, Docker, Flocker, etc.

37

Pool: Physical resource in global zone

Dataset: Logical resource in local

zone

Page 38: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsZFS Root Filesystem

Brings all the ZFS goodness to / ‣ Checksums, compression, replication, snapshots, clones ‣ Boot from any dataset

Patching becomes safe ‣ Take snapshot, apply patch... rollback if there's a problem

Live upgrade becomes fast ‣ Create clone (instant), upgrade, boot from clone ‣ No “extra partition”

Worked on Solaris, Working but Unsupported in Linux (Ubuntu*) ‣ ZFS can easily create multiple boot environments ‣ GRUB can easily manage them

38

Page 39: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

39

But wait! There’s more!

Page 40: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

How is ZFS deployed on Linux today?ZFS on Linux

‣ A project run by the DoD (LLNL, etc.)http://zfsonlinux.org

‣ Ported ZFS natively into the Linux kernel • SPL (Solaris Portability Layer) • ZFS • Can be deployed via kmod or dkms

‣ Actively developed and supported, rapidly evolving, originally driven by need for large scale Lustre (Sequoia), now community focus

40

Page 41: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

Some new features to be aware of…Current Release (v0.6.5.6)

‣ v0.6.5

‣ zfs/snapshot over NFS

‣ large_block IO

‣ Un to 16MB record size (great for Lustre, reducing ZFS slab fragmentation, etc.)

‣ 10x lower ZIL latency

‣ zvol (>50% higher throughput, >20% lower latency)

‣ v0.6.4

‣ Asynchronous IO (AIO) support

‣ Previously was Direct IO only

‣ LZ4 compression of metadata

‣ Bash completion41

Page 42: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

ZFS was designed to take advantage of RAM, SSD, etc.

A quick word on ZFS caching

‣ Adjustable Replacement Cache (ARC)

‣ Recent and frequent block cache (read)

‣ ZFS Intent Log (ZIL)

‣ Cache where all data is written, bundled into transactions, and then flushed on a regular interval (txg_sync)

42

Page 43: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

An overview of common RAID/RAIDZ levelsRAID

43

Raid Level Data Drives Parity Drives Failures Read Write Use?

0 >=1 0 0 Best* Best* No

1 2 1 1 Best Better Yes

1+0 2 + x 2 + x 1 from any RAID1 set

Best Better Yes

5 2 + x 1 1 Better OK No

5+0 ( 2 + x ) * y y 1 from any RAID5 set

Better Better No

6 2 + x 2 2 Better Meh Yes

6+0 (2 + x) * y 2 * y 2 from any RAID6 set

Better OK Maybe

7 2 + x 3 3 Better Sigh Maybe

Page 44: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

What was that about data members and powers of two again?

RAID

44

RAID Level Recommended Data Drives Parity Drives Total Drives

1 1 1 2

1+0 2 - 16 2 - 16 4 - 32

5 2, 4 1 3, 5

6 4, 8 2 6, 10

7 4, 8, 16 3 7, 11, 19

*You can be successful with other configurations, but you have to be more careful*

Page 45: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

ZFS has deduplication, but should you use it?A quick word on ZFS dedupe

‣ Need a lot of RAM

‣ "5GB RAM per 1TB disk” estimate w/ 128KB

‣ Lots of snapshots mixed with dedupe on petascale ZFS pools = pain

‣ Can use deduplication in specialized cases to good effect(reference DB and repetitive workflows)

‣ General recommendation—don’t use it generally

45

Page 46: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

Best practicesOther ZFS Comments

‣ Use ECC RAM to keep in flight and modified data safe while in memory(ZFS is protecting most other things)

‣ ZVOLS

‣ Don’t forget about them! Use them when you need block/alternate FS

‣ Compression usually helps performance, especially using LZ4 (now default algo.)

‣ ZFS send and receive are your friend

‣ Do use ZIL/SLOG and L2ARC on SSD, etc. as supported

‣ Easy to build hybrid-SSD storage system

‣ Node local burst buffer characteristic (variable stripe, re-ordered IO into transactions, etc.)

‣ Can use fletcher4 over sha256 to speed hashing

‣ Not as necessary after AVX/AVX2 patches46

Page 47: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

Best practicesZFS/Lustre Comments

‣ Being brief—this stuff will likely be covered in detail during the rest of the workshop

8. Sequoia's 55PB Lustre+ZFS Filesystemhttp://zfsonlinux.org/docs/LUG12_ZFS_Lustre_for_Sequoia.pdf

9. Lustre 2.9 and Beyond http://cdn.opensfs.org/wp-content/uploads/2016/04/LUG2016D1_Lustre-2.9-and-Beyond_Dilger_Revised_v2.pdf

10. ZFS* as backend file system for Lustre* the current status, how to optimize, where to improvehttp://cdn.opensfs.org/wp-content/uploads/2015/04/ZFS-As-Backend-File-System_Paciucci.pdf

11. Lustre* I/O Performance on ZFShttp://cdn.opensfs.org/wp-content/uploads/2016/04/LUG2016D1_Large-Streaming-IO-for-Lustre-on-ZFS_Jinshan-Xiong_Revised_v2.pdf

12. Lustre ZFS Snapshot Overviewhttp://cdn.opensfs.org/wp-content/uploads/2016/04/LUG2016D1_Lustre-ZFS-Snapshots_Fan-Yong_Revised_v2.pdf

13. Vectorized ZFS RAIDZ Implementationhttp://cdn.opensfs.org/wp-content/uploads/2016/04/LUG2016D1_Vectorized-ZFS-RAIDZ_Gvozden-Neskovic.pdf

47

Page 48: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

[1] ZFS — The Last Word in File SystemsSummary

End the Suffering & Free Your Mind

Simple ‣ Concisely expresses the user's intent

Powerful ‣ Pooled storage, snapshots, clones, compression, scrubbing, RAID-Z

Safe ‣ Detects and corrects silent data corruption

Fast ‣ Dynamic striping, intelligent prefetch, pipelined I/O

Open ‣ http://www.opensolaris.org/os/community/zfs

Free48

Page 49: ZFS: Theory & Architecture - Disruptive Storagedisruptivestorage.org/wordpress-dsw/wp-content/uploads/2016/04/ZFS... · ZFS: Theory & Architecture ... ZFS Evil Tuning Guide 5

49

Thank You;