zfs for databases

20
Delphix Agile Data Platform ZFS for Databases Adam Leventhal CTO, Delphix @ahl

Upload: ahl0003

Post on 11-May-2015

3.050 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: ZFS for Databases

Delphix Agile Data PlatformZFS for Databases

Adam LeventhalCTO, Delphix@ahl

Page 2: ZFS for Databases

2

Definition 1: ZFS Storage Appliance (ZSA)• Shipped by Sun in 2008• Originally the Sun Storage 7000

Page 3: ZFS for Databases

3

Definition 2: Filesystem for Solaris

• Filesystem developed in the Solaris Kernel Group• First shipped in 2006 as part of Solaris 10 u2• The engine for the ZSA

• Always consistent on disk (no fsck)• End-to-end (strong) checksumming• Snapshots are cheap to create; no practical limit• Built-in replication• Custom RAID (RAID-Z)

Page 4: ZFS for Databases

4

Definition 3: OpenZFS

• Sun open sourced ZFS in 2006• Oracle closed it in 2010• OpenZFS has continued•Many of the same developers

– Many left Oracle for companies innovating around OpenZFS

• Expanded beyond Solaris– Active OpenZFS ports on Linux, FreeBSD, Mac OS X

• Significant evolution– Many critical bugs fixed– Test framework, CLI improvements, progress report and

resumability for replication, lz4, simpler API, etc.– Big emphasis on data driven performance

enhancements

Page 5: ZFS for Databases

5

This Talk

• First, which ZFS? The filesystem one.– Most will apply to both Oracle Solaris ZFS and OpenZFS

• Benefits of ZFS• Practical considerations: storage pool and dataset layout• One highly relevant area of performance analysis

Page 6: ZFS for Databases

6

Who am I?

• Joined the Solaris Kernel Group in 2001• One of the three developers of DTrace• Added double- and triple-parity RAID-Z to ZFS• Founding member of the ZSA team (Fishworks) in 2006• Joined Delphix in 2010

– Founded in 2008 using ZFS as a component– Virtualize the database– Database copies become as cheap and flexible as VMs– Agile data for faster projects, more efficient devs, and

happier DBAs– Now the leader in ZFS expertise– Founded the OpenZFS project– Also: UKOUG TECH13 sponsor; check out our booth;

drinks tonight!

Page 7: ZFS for Databases

7

Why ZFS for Databases?

•Modern – in development for over 12 years• Stable – in production for over 7 years• Strong data integrity• No practical limit on snapshots or clones

• Not all good news:– Random writes turn into sequential writes– Sequential reads turn into random reads– (Like NetApp/WAFL)

Page 8: ZFS for Databases

8

RAID-Z

• Traditional RAID-5/6/7 requires NV-RAM to perform• RAID-Z always writes full, variable-width stripes• Particularly good for cheap disks

• Not strictly better– Individual records are split between disks– RAID-5/6/7 -- a random read translates to a single disk

read– RAID-Z – a random read becomes many disk ops (like

RAID-3)

Oracle Solaris ZFS implements an improvement on RAID-5, RAID-Z3, which uses parity,

striping, and atomic operations to ensure reconstruction of corrupted data even in the face

of three concurrent drive failures. It is ideally suited for managing industry standard storage

servers.*

*www.oracle.com/us/products/servers-storage/solaris/solaris-zfs-ds-067320.pdf

Page 9: ZFS for Databases

9

Datasets for Oracle

• Filesystems (datasets) cheap/easy to create in ZFS• Key settings

– recordsize – atomic unit in ZFS; match Oracle block size (8K)

– logbias={latency,throughput} – QoS hint– primarycache={none,metadata,all} – caching hint

# zfs create -o recordsize=8k -o logbias=throughput pool/datafiles# zfs create -o recordsize=8k -o logbias=throughput pool/temp# zfs create –o primarycache=metadata pool/archive# zfs create pool/redo# zfs list -o name,recordsize,logbias,primarycacheNAME RECSIZE LOGBIAS PRIMARYCACHE...pool/archive 128K latency metadatapool/datafiles 8K throughput allpool/redo 128K latency allpool/temp 8K throughput all

Page 10: ZFS for Databases

10

Inconsistent Write Latency microseconds ------------- Distribution ------------- count 8 | 0 16 | 149 32 |@@@@@@@@@@@@@@@@@@@@@ 8682 64 |@@@@@ 2226 128 |@@@@ 1743 256 |@@ 658 512 | 95 1024 | 20 2048 | 19 4096 | 122 8192 |@@ 744 16384 |@@ 865 32768 |@@ 625 65536 |@ 316 131072 | 113 262144 | 22 524288 | 70 1048576 | 94 2097152 | 16 4194304 | 0

Page 11: ZFS for Databases

11

Oracle Solaris ZFS Write Throttle

• Basic problem: limit rate of input to rate of output• Originally no write throttle: consume all memory, then wait

• ZFS composes transactions into transaction groups• Idea: limit the size of a transaction group• Figure out the backend throughput; target a few seconds

Page 12: ZFS for Databases

12

ZFS Write Throttle Problems

• Transaction group full? Start writing it out• One already being written out? Wait

• And it can be a looooong wait• Solution?

– When the transaction group is 7/8ths full, delay for 10ms

– Didn’t guess that did you?

Page 13: ZFS for Databases

13

Let’s Look Again microseconds ------------- Distribution ------------- count 8 | 0 16 | 149 32 |@@@@@@@@@@@@@@@@@@@@@ 8682 64 |@@@@@ 2226 128 |@@@@ 1743 256 |@@ 658 512 | 95 1024 | 20 2048 | 19 4096 | 122 8192 |@@ 744 16384 |@@ 865 32768 |@@ 625 65536 |@ 316 131072 | 113 262144 | 22 524288 | 70 1048576 | 94 2097152 | 16 4194304 | 0

Page 14: ZFS for Databases

14

Write Amplification microseconds NFS write IO writes value ------------------------- count ---------------- count 16 | 0 | 0 32 | 56 | 259 64 | 118 |@ 631 128 | 47 |@ 1024 256 | 13 |@@@@@@ 5747 512 | 16 |@@@@@@ 5421 1024 |@@@@@@@@@@ 4172 |@@@@ 4113 2048 |@@@@@@@@@@@@@@@@@@@@@@@ 9835 |@@@@@ 4890 4096 |@ 425 |@@@@@ 4528 8192 | 121 |@@@@@ 4311 16384 | 198 |@@@@ 3334 32768 |@@@ 1158 |@@ 1885 65536 |@@ 957 |@ 528 131072 | 110 | 28 262144 | 31 | 0 524288 | 25 1048576 | 0

avg latency iops NFS write 13231us 292/s IO write 8559us 622/s

Page 15: ZFS for Databases

15

Oracle Solaris ZFS Tuning

• IO queue depth zfs_vdev_max_pending– Default of 10 – may be reasonable for spinning disks– ZFS on a SAN? 24 - 100– Higher for additional throughput– Lower for reduced latency

• Transaction group duration zfs_txg_synctime– Default of 5 seconds– Higher for more metadata amortization– Lower for a smaller window for data loss with non-

synced writes

Page 16: ZFS for Databases

16

Back to the ZFS Write Throttle

•Measure of IO throughput swings wildly:

•Many factors impact the measured IO throughput• The wrong guess can lead to massive delays

# dtrace -n 'BEGIN{ start = timestamp; } fbt::dsl_pool_sync:entry/stringof(args[0]->dp_spa->spa_name) == "domain0"/{ @[(timestamp - start) / 1000000000] = min(args[0]->dp_write_limit / 1000000); }' –xaggsortkeydtrace: description 'BEGIN' matched 2 probes… 14 487 15 515 16 515 17 557 18 581 19 581 20 617 21 617 22 635 23 663 24 663…

Page 17: ZFS for Databases

17

OpenZFS I/O Scheduler

• Throw out the ZFS write throttle and IO queue• Queue depth and throttle based on quantity of modified data

• Result: smooth, single-moded write latency

0 10 20 30 40 50 60 70 80 90 1000

2

4

6

8

10

12

14

16

18

20

Queue Depth

Page 18: ZFS for Databases

18

OpenZFS I/O Scheduler Tuning

• Tunables that area easier to reason about– zfs_vdev_async_write_max_active (default: 10)– zfs_dirty_data_max (default: min(memory/10, 4GB))– zfs_delay_max_ns (default: 100µs)– zfs_delay_scale (delay curve; default: 500µs/op)

Page 19: ZFS for Databases

19

Summing Up

• ZFS is great for databases– Storage Appliance, Oracle Solaris, OpenZFS

• Important best practices• Beware the false RAID-Z idol•Measure, measure, measure

– DTrace is your friend (Wednesday 11:00am Exchange 1)

Page 20: ZFS for Databases

20

Further Reading

• Oracle Solaris ZFS “Evil” Tuning Guide– www.solaris

-cookbook.com/solaris/solaris-10-zfs-evil-tuning-guide/

• OpenZFS– www.open-zfs.org

• Oracle’s tuning guide– docs.oracle.com/cd/E26505_01/html/E37386/chapterzfs-

db1.html