zfs talk part 1

102
zfs to developers

Upload: steven-burgess

Post on 22-Jun-2015

366 views

Category:

Technology


14 download

DESCRIPTION

A presentation about ZFS aimed at developers. Given at Datto, see the talk here https://www.youtube.com/watch?v=Wd6eacYeeJI

TRANSCRIPT

Page 1: ZFS Talk Part 1

zfs to developers

Page 2: ZFS Talk Part 1

zfs, a modern file system built forlarge scale data integrity

Page 3: ZFS Talk Part 1

wikipedia to the rescue!

Page 4: ZFS Talk Part 1

NFSLustreOpen Office

Sun Microsystems

Page 5: ZFS Talk Part 1

They also made a large amount of hardware

Page 6: ZFS Talk Part 1
Page 7: ZFS Talk Part 1
Page 8: ZFS Talk Part 1
Page 9: ZFS Talk Part 1
Page 10: ZFS Talk Part 1
Page 11: ZFS Talk Part 1

http://zfsonlinux.org/docs/LUG12_ZFS_Lustre_for_Sequoia.pdf

Page 12: ZFS Talk Part 1

2005, integrated into solaris kernel

2013OpenZFS

2010 illumos founded

45 commits to ZoL 1174 commits to ZoL

2008 first commit to ZoL

Page 13: ZFS Talk Part 1

-File systems should be large-Storage media is not to be trusted-Storage maintenance should be easy-Disk storage should be more like ram

Page 14: ZFS Talk Part 1

File systems should be largeOur largest system was 144 TB of storage.

disks * capacity36 * 4

ZFS can address hard drives so large they could not be stored on this planet.

Page 15: ZFS Talk Part 1

File systems should be largeext4 1 EXBHFS+ 1 EXBBTRFS 16 EXBzfs 256X(1024) EXB

Page 16: ZFS Talk Part 1

File systems should be largewho cares?

Page 17: ZFS Talk Part 1

Storage media is not to be trusted

-Spinning disks have a bit error rate-Sometimes the head writes to the wrong place-”Modern hard disks write so fast and so faint that they are only guessing that what they read is what you wrote”-Cables go bad-Cosmic rays (!!!)

Page 18: ZFS Talk Part 1

Storage media is not to be trustedzfs overcomes these problems with checksumming. Every block is run through fletcher4 before it is written, and that checksum is combined with other metadata and written “far away” from the data when they are written out.

sha256futureEdon-RSkein

Page 19: ZFS Talk Part 1

Storage media is not to be trusted

Does not happen too often, is usually just a great early warning that the drive is failing

Page 20: ZFS Talk Part 1

Storage maintenance should be easy

zpool create name diskszfs create filesystemzfs set compression=off filesystemzfs set sync=disabled filesystemzpool statuszfs destroy

Page 21: ZFS Talk Part 1

Storage maintenance should be easy

Is it intuitive?

zfs snapshotzfs send/receivezfs create/destroy

Page 22: ZFS Talk Part 1

Storage maintenance should be easy

Is it intuitive?

zpool add VS zpool attach

Page 23: ZFS Talk Part 1

Storage maintenance should be easy

Is it easy?

I think so

Page 24: ZFS Talk Part 1

Disk storage should be more like ram

Should open a computer up, throw some disks in there and be running. Never need to mess with it, never need to tune it.

Page 25: ZFS Talk Part 1

Disk storage should be more like ram

FAIL

tuning is not recommended

Page 26: ZFS Talk Part 1

Disk storage should be more like ram

“Tuning is evil, yes, in the way that doing something against the will of the creator is evil”

Page 27: ZFS Talk Part 1

zfs sits above your hard drives and below your directory, it adds features you might like.

Page 28: ZFS Talk Part 1

zfs sits above your hard drives and below your directory, it adds features you might like.

data integrity

trasnparent compression (LZ4)

improved throughput

snapshoting replication via snapshoting

speed via ARC

easy maintancne

choice in raid setup

Page 29: ZFS Talk Part 1

Command overview

zfszpool

zdb

Page 30: ZFS Talk Part 1

Command overview

zfs every weekzpool every month

zdb depends on the day

Page 31: ZFS Talk Part 1

Command overview

zfs Awesome man pagezpool Awesome man page

zdb meh...

Page 32: ZFS Talk Part 1

zpool create

Page 33: ZFS Talk Part 1

zpool createzpool create tank -o ashift=12 -O compression=lz4 mirror ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468

Page 34: ZFS Talk Part 1

zpool createzpool create tank -o ashift=12 -O compression=lz4 mirror ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468

Page 35: ZFS Talk Part 1

zpool createzpool create tank -o ashift=12 -O compression=lz4 mirror ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468

Page 36: ZFS Talk Part 1

zpool createzpool create tank -o ashift=12 -O compression=lz4 mirror ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468

Page 37: ZFS Talk Part 1

zpool createzpool create tank -o ashift=12 -O compression=lz4 mirror ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468

Page 38: ZFS Talk Part 1

zpool createzpool create tank -o ashift=12 -O compression=lz4 mirror ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468

Page 39: ZFS Talk Part 1

zpool createzpool create tank -o ashift=12 -O compression=lz4 mirror ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468

Page 40: ZFS Talk Part 1

zpool createzpool create tank -o ashift=12 -O compression=lz4 mirror ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468

/dev/disk/by-id/ata-*

Page 41: ZFS Talk Part 1

zpool createzpool create tank -o ashift=12 -O compression=lz4 mirror ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468

/dev/disk/by-id/ata-*

http://zfsonlinux.org/faq.html#WhatDevNamesShouldIUseWhenCreatingMyPool

Page 42: ZFS Talk Part 1

zpool status/home/sburgess > zpool status pool: tank state: ONLINE scan: scrub repaired 0 in 19h39m with 0 errors on Tue Jul 15 10:23:16 2014config:

NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ata-WDC_WD1002FAEX-00Y9A0_WD-WCAW32714185 ONLINE 0 0 0 ata-WDC_WD1002FAEX-00Z3A0_WD-WMATR0443468 ONLINE 0 0 0

Page 43: ZFS Talk Part 1

so far/home/sburgess > zpool get all tank

NAME PROPERTY VALUE SOURCE

tank size 928G -

tank capacity 34% -

tank health ONLINE -

Page 44: ZFS Talk Part 1

so far/home/sburgess > zfs get all tank

NAME PROPERTY VALUE SOURCE tank type filesystem - tank creation Thu Jan 3 15:55 2013 - tank used 325G - tank available 589G - tank referenced 184K - tank compressratio 1.54x - tank mounted yes - tank recordsize 128K default tank mountpoint /tank default tank compression lz4 local tank sync standard default tank refcompressratio 1.00x

Page 45: ZFS Talk Part 1

so far/home/sburgess > ls /tank/

Page 46: ZFS Talk Part 1

zfs create

zfs create tank/home

Page 47: ZFS Talk Part 1

zfs create

zfs create tank/home/sburgess-o mountpoint=/home/sburgess

Page 48: ZFS Talk Part 1

zfs create

zfs create tank/home/sburgess/downloadszfs create tank/home/sburgess/projectszfs create tank/home/sburgess/tools

Page 49: ZFS Talk Part 1

zfs create

zfs create tank/home/sburgess/downloadszfs create tank/home/sburgess/projectszfs create tank/home/sburgess/tools

chown -R sburgess: /home/sburgess

Page 50: ZFS Talk Part 1

zfs createzfs list -o name,refer,used,compressratio -r tank/home/sburgess

NAME REFER USED RATIOtank/home/sburgess 4.37G 114G 1.73xtank/home/sburgess/downloads 34.8G 36.0G 1.66xtank/home/sburgess/projects 2.08G 11.7G 1.30xtank/home/sburgess/tools 583M 635M 1.54x

Page 51: ZFS Talk Part 1

zfs createmv Pictures pic

zfs create tank/home/sburgess/Pictures

chown -R surgess: Pictures

mv pic/* Pictures

Page 52: ZFS Talk Part 1

zfs create/home/sburgess > zfs list -o name,refer,used,compressratio -r tank/home/sburgessNAME REFER USED RATIOtank/home/sburgess 4.36G 114G 1.73xtank/home/sburgess/Pictures 11.3M 11.3M 1.16xtank/home/sburgess/downloads 34.8G 36.0G 1.66xtank/home/sburgess/projects 2.08G 11.7G 1.30xtank/home/sburgess/tools 583M 635M 1.54x

Page 53: ZFS Talk Part 1

zfs createshopt -s dotglob

du -hs *

2.9G .kde

1.3G .cache

Page 54: ZFS Talk Part 1

uberblock

Page 55: ZFS Talk Part 1

uberblockThe root of the zfs hash tree

“A Merkle tree is a tree in which every non-leaf node is labelled with the hash of the labels of its children nodes.”

Page 56: ZFS Talk Part 1
Page 57: ZFS Talk Part 1

uberblockzdb -u poolName

Page 58: ZFS Talk Part 1

uberblockzdb -u test

Uberblock: magic = 0000000000bab10c version = 5000 txg = 5 guid_sum = 16411893724316372364 timestamp = 1392754246 UTC = Tue Feb 18 15:10:46 2014

Page 59: ZFS Talk Part 1

uberblockUberblock: magic = 0000000000bab10c version = 5000 txg = 5 guid_sum = 16411893724316372364 timestamp = 1392754246 UTC = Tue Feb 18 15:10:46 2014

… cat /dev/urandom > file …

Uberblock: magic = 0000000000bab10c version = 5000 txg = 163 guid_sum = 16411893724316372364 timestamp = 1392755035 UTC = Tue Feb 18 15:23:55 2014

Page 60: ZFS Talk Part 1

uberblockUberblock: magic = 0000000000bab10c version = 5000 txg = 163 guid_sum = 16411893724316372364 timestamp = 1392755035 UTC = Tue Feb 18 15:23:55 2014

… zpool attach pool disk1 disk2…

Uberblock: magic = 0000000000bab10c version = 5000 txg = 197 guid_sum = 16865875370843337150 timestamp = 1392755190 UTC = Tue Feb 18 15:26:30 2014

Page 61: ZFS Talk Part 1

uberblockGo back in time via

zpool import -F

Page 62: ZFS Talk Part 1

snapshotting

Page 63: ZFS Talk Part 1

snapshotting

zfs snapshot tank/home/sburgess@now

Page 64: ZFS Talk Part 1

snapshotting

zfs list -o name,creation,used -t all -r tank/home/sburgess

Page 65: ZFS Talk Part 1
Page 66: ZFS Talk Part 1

What to do with snapshots

Page 67: ZFS Talk Part 1

.zfs directory

Always there, whether or not it shows up in ls -a is controlled by

zfs set snapdir=hidden|visible filesystem

Page 68: ZFS Talk Part 1

.zfs directory

Contains .zfs/snapshots, which has a directory for each snapshot. When you access any directory, it is temporarily mounted read only there.

Page 69: ZFS Talk Part 1

.zfs directory

Use case:

-Test if/when a file was created

-Easily restore a file or two, for large complicated restores, use clone.

Page 70: ZFS Talk Part 1

zfs rollback

zfs rollback tank/home/sburgess@then

Should be the most recent snapshot, but you can use -r to roll back further

Page 71: ZFS Talk Part 1

zfs rollback

Use case:

Being too bold with tar -x

Page 72: ZFS Talk Part 1

zfs clone

zfs clone tank/home/sburgess@now tank/other

tank/other is aread/write, snapshotable, cloneable file system

Initially shares all blocks with the parent, takes 0 space, amplify ARC hits

Page 73: ZFS Talk Part 1

zfs clone

Use case:

Virtual Machine base images

All configs, modules, programs and OS data shared

Page 74: ZFS Talk Part 1

zfs clone

zfs clone-o readonly=on-o mountpoint=/tmp/rotank/home/sburgess@now tank/other

Page 75: ZFS Talk Part 1

zfs clone

-safe (readonly)-0 time-0 space

Page 76: ZFS Talk Part 1

zfs clone

Use case:

-large file restore-diffing files across both

Page 77: ZFS Talk Part 1

zfs clone

What clones of this snapshot exist?zfs get clones filesystem@snapshot

What snapshot was this filesystem cloned from?zfs get origin filesystem

Page 78: ZFS Talk Part 1

a note on -

“-” is zfs none/null/not applicable

zfs get clones tankNAME PROPERTY VALUE SOURCEtank clones - -

zfs get origin tank@nowNAME PROPERTY VALUE SOURCEtank@now origin - -

Page 79: ZFS Talk Part 1

a note on -

“-” is zfs none/null/not applicable

zpool get version

NAME PROPERTY VALUE SOURCEtank version - default

Page 80: ZFS Talk Part 1

a note on 5000

zpool version numbers no longer increase with features

Page 81: ZFS Talk Part 1

zfs send

Page 82: ZFS Talk Part 1

zfs send

Original idea:

Send the changes I made today across the ocean

Page 83: ZFS Talk Part 1

zfs send

Create a file detailing the changes that need to be made to transition a filesystem from one snapshot to another.

Page 84: ZFS Talk Part 1

zfs send

zfs send is a dictation, not a conversation

Page 85: ZFS Talk Part 1

zfs sendzpool create -O compression=off -O copies=2 -o ashift=12

zpool create -O compression=lz4 -O checksum=sha256 -o ashift=9

Page 86: ZFS Talk Part 1

zfs send

zfs send tank/currr@1387825261Error: Stream can not be written to a terminal.You must redirect standard output.

Page 87: ZFS Talk Part 1

zfs send

-n

-v

Page 88: ZFS Talk Part 1

zfs send

zfs send -n -v tank/home/sburgess@now

Page 89: ZFS Talk Part 1

zfs send

zfs send -n -v tank/home/sburgess@nowsend from @ to tank/home/sburgess@now total estimated size is 9.22G

Page 90: ZFS Talk Part 1

zfs send

zfs send tank/home/sburgess@now

What does this send? What does it create when its received?

Page 91: ZFS Talk Part 1

zfs send

zfs send tank/home/sburgess@now

Its sends a “full” filesystem, everything that is needed to create tank/home/sburgess@now

The receiving side gets a new FS with a single snapshot named now

Page 92: ZFS Talk Part 1

zfs send

Can be used with the -i and -I options to send incremental changes. Only send the blocks that changed between the first and second snapshots.

Page 93: ZFS Talk Part 1

zfs send

-i do not send intermediate snapshots

-I send intermediate snapshots

Page 94: ZFS Talk Part 1

zfs send

-i do not send intermediate snapshots

-I send intermediate snapshots

zfs send -I early file/system/path@late

Page 95: ZFS Talk Part 1

zfs get vs zfs list

Page 96: ZFS Talk Part 1

zfs get vs zfs list

When working interactively use zfs list

zfs list -t all -o name,written,used,mounted

NAME WRITTEN USED MOUNTED

tank/home/sburgess/tools@1387825261 0 0 -

tank/images 590M 8.82G no

tank/images@base 8.25G 369M -

tank/other 8K 8K yes

tank/trick 0 136K yes

Page 97: ZFS Talk Part 1

zfs get vs zfs list

zfs list is the same as

zfs list -o name,used,avail,refer,mountpoint

Page 98: ZFS Talk Part 1

zfs get vs zfs list

zfs list is the same as

zfs list -o name,used,avail,refer,mountpoint ^^^^

Page 99: ZFS Talk Part 1

zfs get vs zfs list

zfs list | grep/awk/??

Page 100: ZFS Talk Part 1

zfs get vs zfs list

when looking at an FS or snapshot, I callzfs get all item | less

Page 101: ZFS Talk Part 1

zfs get vs zfs list

For programmatic use, use zfs get -H -P

zfs get used tank

NAME PROPERTY VALUE SOURCE

tank used 484G -

zfs get used -o value -H -p tank

519265562624

Page 102: ZFS Talk Part 1

Learn more

read the zpool man page

read the zfs man page

subscribe to the ZoL mailing list, and just read new messages as they come in