zfs for databases

Delphix Agile Data PlatformZFS for Databases

Adam LeventhalCTO, Delphix@ahl

2

Definition 1: ZFS Storage Appliance (ZSA)• Shipped by Sun in 2008• Originally the Sun Storage 7000

3

Definition 2: Filesystem for Solaris

• Filesystem developed in the Solaris Kernel Group• First shipped in 2006 as part of Solaris 10 u2• The engine for the ZSA

• Always consistent on disk (no fsck)• End-to-end (strong) checksumming• Snapshots are cheap to create; no practical limit• Built-in replication• Custom RAID (RAID-Z)

4

Definition 3: OpenZFS

• Sun open sourced ZFS in 2006• Oracle closed it in 2010• OpenZFS has continued•Many of the same developers

– Many left Oracle for companies innovating around OpenZFS

• Expanded beyond Solaris– Active OpenZFS ports on Linux, FreeBSD, Mac OS X

• Significant evolution– Many critical bugs fixed– Test framework, CLI improvements, progress report and

resumability for replication, lz4, simpler API, etc.– Big emphasis on data driven performance

enhancements

5

This Talk

• First, which ZFS? The filesystem one.– Most will apply to both Oracle Solaris ZFS and OpenZFS

• Benefits of ZFS• Practical considerations: storage pool and dataset layout• One highly relevant area of performance analysis

6

Who am I?

• Joined the Solaris Kernel Group in 2001• One of the three developers of DTrace• Added double- and triple-parity RAID-Z to ZFS• Founding member of the ZSA team (Fishworks) in 2006• Joined Delphix in 2010

– Founded in 2008 using ZFS as a component– Virtualize the database– Database copies become as cheap and flexible as VMs– Agile data for faster projects, more efficient devs, and

happier DBAs– Now the leader in ZFS expertise– Founded the OpenZFS project– Also: UKOUG TECH13 sponsor; check out our booth;

drinks tonight!

7

Why ZFS for Databases?

•Modern – in development for over 12 years• Stable – in production for over 7 years• Strong data integrity• No practical limit on snapshots or clones

• Not all good news:– Random writes turn into sequential writes– Sequential reads turn into random reads– (Like NetApp/WAFL)

8

RAID-Z

• Traditional RAID-5/6/7 requires NV-RAM to perform• RAID-Z always writes full, variable-width stripes• Particularly good for cheap disks

• Not strictly better– Individual records are split between disks– RAID-5/6/7 -- a random read translates to a single disk

read– RAID-Z – a random read becomes many disk ops (like

RAID-3)

Oracle Solaris ZFS implements an improvement on RAID-5, RAID-Z3, which uses parity,

striping, and atomic operations to ensure reconstruction of corrupted data even in the face

of three concurrent drive failures. It is ideally suited for managing industry standard storage

servers.*

*www.oracle.com/us/products/servers-storage/solaris/solaris-zfs-ds-067320.pdf

9

Datasets for Oracle

• Filesystems (datasets) cheap/easy to create in ZFS• Key settings

– recordsize – atomic unit in ZFS; match Oracle block size (8K)

– logbias={latency,throughput} – QoS hint– primarycache={none,metadata,all} – caching hint

# zfs create -o recordsize=8k -o logbias=throughput pool/datafiles# zfs create -o recordsize=8k -o logbias=throughput pool/temp# zfs create –o primarycache=metadata pool/archive# zfs create pool/redo# zfs list -o name,recordsize,logbias,primarycacheNAME RECSIZE LOGBIAS PRIMARYCACHE...pool/archive 128K latency metadatapool/datafiles 8K throughput allpool/redo 128K latency allpool/temp 8K throughput all

10

Inconsistent Write Latency microseconds ------------- Distribution ------------- count 8 | 0 16 | 149 32 |@@@@@@@@@@@@@@@@@@@@@ 8682 64 |@@@@@ 2226 128 |@@@@ 1743 256 |@@ 658 512 | 95 1024 | 20 2048 | 19 4096 | 122 8192 |@@ 744 16384 |@@ 865 32768 |@@ 625 65536 |@ 316 131072 | 113 262144 | 22 524288 | 70 1048576 | 94 2097152 | 16 4194304 | 0

11

Oracle Solaris ZFS Write Throttle

• Basic problem: limit rate of input to rate of output• Originally no write throttle: consume all memory, then wait

• ZFS composes transactions into transaction groups• Idea: limit the size of a transaction group• Figure out the backend throughput; target a few seconds

12

ZFS Write Throttle Problems

• Transaction group full? Start writing it out• One already being written out? Wait

• And it can be a looooong wait• Solution?

– When the transaction group is 7/8ths full, delay for 10ms

– Didn’t guess that did you?

13

Let’s Look Again microseconds ------------- Distribution ------------- count 8 | 0 16 | 149 32 |@@@@@@@@@@@@@@@@@@@@@ 8682 64 |@@@@@ 2226 128 |@@@@ 1743 256 |@@ 658 512 | 95 1024 | 20 2048 | 19 4096 | 122 8192 |@@ 744 16384 |@@ 865 32768 |@@ 625 65536 |@ 316 131072 | 113 262144 | 22 524288 | 70 1048576 | 94 2097152 | 16 4194304 | 0

14

Write Amplification microseconds NFS write IO writes value ------------------------- count ---------------- count 16 | 0 | 0 32 | 56 | 259 64 | 118 |@ 631 128 | 47 |@ 1024 256 | 13 |@@@@@@ 5747 512 | 16 |@@@@@@ 5421 1024 |@@@@@@@@@@ 4172 |@@@@ 4113 2048 |@@@@@@@@@@@@@@@@@@@@@@@ 9835 |@@@@@ 4890 4096 |@ 425 |@@@@@ 4528 8192 | 121 |@@@@@ 4311 16384 | 198 |@@@@ 3334 32768 |@@@ 1158 |@@ 1885 65536 |@@ 957 |@ 528 131072 | 110 | 28 262144 | 31 | 0 524288 | 25 1048576 | 0

avg latency iops NFS write 13231us 292/s IO write 8559us 622/s

15

Oracle Solaris ZFS Tuning

• IO queue depth zfs_vdev_max_pending– Default of 10 – may be reasonable for spinning disks– ZFS on a SAN? 24 - 100– Higher for additional throughput– Lower for reduced latency

• Transaction group duration zfs_txg_synctime– Default of 5 seconds– Higher for more metadata amortization– Lower for a smaller window for data loss with non-

synced writes

16

Back to the ZFS Write Throttle

•Measure of IO throughput swings wildly:

•Many factors impact the measured IO throughput• The wrong guess can lead to massive delays

# dtrace -n 'BEGIN{ start = timestamp; } fbt::dsl_pool_sync:entry/stringof(args[0]->dp_spa->spa_name) == "domain0"/{ @[(timestamp - start) / 1000000000] = min(args[0]->dp_write_limit / 1000000); }' –xaggsortkeydtrace: description 'BEGIN' matched 2 probes… 14 487 15 515 16 515 17 557 18 581 19 581 20 617 21 617 22 635 23 663 24 663…

17

OpenZFS I/O Scheduler

• Throw out the ZFS write throttle and IO queue• Queue depth and throttle based on quantity of modified data

• Result: smooth, single-moded write latency

0 10 20 30 40 50 60 70 80 90 1000

2

4

6

8

10

12

14

16

18

20

Queue Depth

18

OpenZFS I/O Scheduler Tuning

• Tunables that area easier to reason about– zfs_vdev_async_write_max_active (default: 10)– zfs_dirty_data_max (default: min(memory/10, 4GB))– zfs_delay_max_ns (default: 100µs)– zfs_delay_scale (delay curve; default: 500µs/op)

19

Summing Up

• ZFS is great for databases– Storage Appliance, Oracle Solaris, OpenZFS

• Important best practices• Beware the false RAID-Z idol•Measure, measure, measure

– DTrace is your friend (Wednesday 11:00am Exchange 1)

20

Further Reading

• Oracle Solaris ZFS “Evil” Tuning Guide– www.solaris

-cookbook.com/solaris/solaris-10-zfs-evil-tuning-guide/

• OpenZFS– www.open-zfs.org

• Oracle’s tuning guide– docs.oracle.com/cd/E26505_01/html/E37386/chapterzfs-

db1.html

http://www.solaris-cookbook.com/solaris/solaris-10-zfs-evil-tuning-guide/



http://www.open-zfs.org/

zfs for databases

Technology

zfs expertise

zfs list o

cheap disks oracle solaris

spinning disks zfs

zfs founding member

zfs key settings

open sourced zfs

zfs storage appliance