Download - ZFS Deep Dive

Transcript
Page 1: ZFS Deep Dive

Solaris 10 Deep DiveZFS

Bob NethertonTechnical Specialist, Solaris AdoptionSun Microsystems, Inc.http://blogs.sun.com/bobn

Page 2: ZFS Deep Dive

2

• What is ZFS?• Why a new file system?• What's different about it?• What can I do with it?• How much does it cost?• Where does ZFS go from here?

Page 3: ZFS Deep Dive

3

What is ZFS?A new way to manage data

The world's first 128-bit file system

With check-summing and copy-on-write transactions

Pooled storage model

No volumemanager

Especially architected for speed

End-to EndData Integrity

Huge Performance Gains

Software Developer

Easier Administration

Immense DataCapacity

Page 4: ZFS Deep Dive

4

Why a New File System?

Data ManagementCosts are High

The Value of Datais Becoming Even

More Critical

The Amount of Storage is Ever-

Increasing

Page 5: ZFS Deep Dive

5

Trouble with Existing File Systems?Good for the time they were designed, but...

No Defense Against Silent Data Corruption

Difficult toAdminister−Needa Volume Manager

Older/SlowerData ManagementTechniques

Any defect in datapath cancorrupt data...undetected

Volumes, labels, partitions, provisioningand lots of limits

Fat locks, fixedblock size,naive pre-fetch,dirty region logging

Page 6: ZFS Deep Dive

6

ZFS Design Principles

• Start with a new design around today's requirements• Pooled storage

> Eliminate the notion of volumes> Do for storage what virtual memory did for RAM

• End-to-end data integrity> Historically considered too expensive.> Now, data is too valuable not to protect

• Transactional operation> Maintain consistent on-disk format> Reorder transactions for performance gains – big

performance win

Page 7: ZFS Deep Dive

7

Evolution of Disks and Volumes

• Intially, we had simple disks• Abstraction of disks into volumes to meet requirements• Industry grew around HW / SW volume management

Upper 1GB

Volume Manager

File System

Concatenated 2GB

Lower 1GB Odd 1GB

Volume Manager

File System

Striped 2GB

Even 1GB Right 1GB

Volume Manager

File System

Mirrored 1GB

Left 1GB

Page 8: ZFS Deep Dive

8

FS/Volume Model vs. ZFS• Traditional Volumes

> 1:1 FS to Volume> Grow / shrink by hand> Limited bandwidth> Storage fragmented

ZFS ZFSZFS

Volume Manager

FS

• ZFS Pooled Storage> No partitions / volumes> Grow / shrink automatically> All bandwidth always available> All storage in pool is shared

Page 9: ZFS Deep Dive

9

FS / Volume Model vs. ZFS

• FS to Volume> Block device interface> Write a block, write a block, ...> Loss of power = loss of

consistency> Workaround: journaling – slow

& complex

• Volume to Disk> Block device interface> Write each block to each disk

immediately to sync mirrors> Loss of power = resync> Synchronous & slow

• ZFS to Data Mgmt Unit> Object-based transactions> “Make these changes to these

objects”> All or nothing

• DMU to Storage Pool> Transaction group commit> All or nothing> Always consistent on disk> Journal not needed

• SP to Disk> Schedule, aggregate, and issue I/O

at will – runs at platter speed> No resync if power lost

FS / Volume I/O Stack ZFS I/O Stack

Page 10: ZFS Deep Dive

10

DATAINTEGRITY

Page 11: ZFS Deep Dive

11

ZFS Data Integrity Model• Everything is copy-on-write

> Never overwrite live data> On-disk state always valid – no fsck

• Everything is transactional> Related changes succeed or fail as a whole> No need for journaling

• Everything is checksummed> No silent corruptions> No panics from bad metadata

• Enhanced data protection> Mirrored pools, RAID-Z, disk scrubbing

Page 12: ZFS Deep Dive

12

Copy-on-Write and Transactional

Initial block tree Writes a copy of some changes

Copy-on-write of indirect blocks Rewrites the Uber-block

Original Data

New Data

New Pointers

Original Pointers New Uber-block

Uber-block

Page 13: ZFS Deep Dive

13

Checksums are separated from

the data

End-to-End Checksums

Entire I/O path is self-validating (uber-block)

Prevents:> Silent data corruption> Panics from corrupted

metadata> Phantom writes> Misdirected reads and writes> DMA parity errors> Errors from driver bugs> Accidental overwrites

Page 14: ZFS Deep Dive

14

ApplicationApplication

ZFS MirrorZFS MirrorZFS Mirror

Self-Healing DataZFS can detect bad data using checksums and “heal” the data using its mirrored copy.

Application

“Heals” Bad CopyGets Good Data from MirrorDetects Bad Data

Page 15: ZFS Deep Dive

15

Disk Scrubbing

• Uses checksums to verify the integrity of all the data

• Traverses metadata to read every copy of every block

• Finds latent errors while they're still correctable

• It's like ECC memory scrubbing – but for disks

• Provides fast and reliable re-silvering of mirrors

Page 16: ZFS Deep Dive

16

RAID-Z Protection RAID-5 and More

• ZFS provides better than RAID-5 availability> Copy-on-write approach solves historical problems

• Striping uses dynamic widths> Each logical block is its own stripe

• All writes are full-stripe writes> Eliminates read-modify-write (So it's fast!)

• Eliminates RAID-5 “write hole” > No need for NVRAM

Page 17: ZFS Deep Dive

17

Immense Data Capacity

• 128-bit File System

• No Practical Limitations on File Size, Directory Entries, etc.

• All metadata is dynamic

• Concurrent Everything

Page 18: ZFS Deep Dive

18

EASIERADMINISTRATION

Page 19: ZFS Deep Dive

19

Easier Administration• Pooled Storage Design

makes for Easier Administration

No need for a Volume Manager!

• Straightforward Commands and a GUI> Snapshots & Clones> Quotas & Reservations> Compression> Pool Migration> ACLs for Security

Page 20: ZFS Deep Dive

20

No More Volume Manager!Automatically add capacity to shared storage pool

ZFS

Application 1

Storage Pool

ZFS

Application 2 Application 3

Page 21: ZFS Deep Dive

21

ZFS File systems are Hierarchical

• File system properties are inherited

• Inheritance makes administration a snap

• File systems become control points

• Manage logically related file systems as a group

Page 22: ZFS Deep Dive

22

Create ZFS Pools and File Systems

• Create a ZFS pool consisting of two mirrored drives # zpool create tank mirror c9t42d0 c13t11d0# df -h -F zfsFilesystem size used avail capacity Mounted ontank 33G 1K 33G 1% /tank

• Create home directory file system# zfs create tank/home# zfs set mountpoint=/export/home tank/home# df -h -F zfsFilesystem size used avail capacity Mounted ontank 33G 24K 33G 1% /tanktank/home 33G 27K 33G 1% /export/home

Page 23: ZFS Deep Dive

23

Create ZFS Pools and File Systems

• Create home directories for users # zfs create tank/home/ahrens# zfs create tank/home/bonwick# zfs create tank/home/billm# df -h -F zfsFilesystem size used avail capacity Mounted ontank 33G 24K 33G 1% /tanktank/home 33G 27K 33G 1% /export/hometank/home/ahrens 33G 24K 33G 1% /export/home/ahrenstank/home/bonwick 33G 24K 33G 1% /export/home/bonwicktank/home/billm 33G 24K 33G 1% /export/home/billm

• Add space to the pool# zpool add tank mirror c9t43d0 c13t12d0# df -h -F zfsFilesystem size used avail capacity Mounted ontank 66G 24K 66G 1% /tanktank/home 66G 27K 66G 1% /export/home

Page 24: ZFS Deep Dive

24

Quotas and Reservations

• To control pooled storage usage, administrators can set a quota or reservation on a per file system basis # df -h -F zfsFilesystem size used avail capacity Mounted ontank/home 66G 28K 66G 1% /export/hometank/home/ahrens 66G 24K 66G 1% /export/home/ahrenstank/home/bonwick 66G 24K 66G 1% /export/home/bonwick# zfs set quota=10g tank/home/ahrens# zfs set reservation=20g tank/home/bonwick# df -h -F zfsFilesystem size used avail capacity Mounted ontank/home 66G 28K 46G 1% /export/hometank/home/ahrens 10G 24K 10G 1% /export/home/ahrenstank/home/bonwick 66G 24K 66G 1% /export/home/bonwick

Page 25: ZFS Deep Dive

25

File System Attributes• Attributes are set for the file system and inherited by

child file systems in the tree # zfs set compression=on tank# zfs set sharenfs=rw tank/home# zfs get all tankNAME PROPERTY VALUE SOURCEtank type filesystem -tank creation Fri Sep 1 9:38 2006 -tank used 20.0G -tank available 46.4G -tank compressratio 1.00x -tank mounted yes -tank quota none defaulttank reservation none defaulttank recordsize 128K defaulttank mountpoint /tank defaulttank sharenfs off defaulttank compression on localtank atime on default...

Page 26: ZFS Deep Dive

26

ZFS Snapshots

• Provide a read-only point-in-time copy of file system

• Copy-on-write makes them essentially “free”

• Very space efficient – only changes are tracked

• And instantaneous – just doesn't delete the copy

Current Data

Snapshot Uber-block New Uber-block

Page 27: ZFS Deep Dive

27

ZFS Snapshots• Simple to create and rollback with snapshots

# zfs list -r tankNAME USED AVAIL REFER MOUNTPOINTtank 20.0G 46.4G 24.5K /tanktank/home 20.0G 46.4G 28.5K /export/hometank/home/ahrens 24.5K 10.0G 24.5K /export/home/ahrenstank/home/billm 24.5K 46.4G 24.5K /export/home/billmtank/home/bonwick 24.5K 66.4G 24.5K /export/home/bonwick

# zfs snapshot tank/home/billm@s1# zfs list -r tank/home/billmNAME USED AVAIL REFER MOUNTPOINTtank/home/billm 24.5K 46.4G 24.5K /export/home/billmtank/home/billm@s1 0 - 24.5K -

# cat /export/home/billm/.zfs/snapshot/s1/foo.c# zfs rollback tank/home/billm@s1# zfs destroy tank/home/billm@s1

Page 28: ZFS Deep Dive

28

ZFS Clones

• A clone is a writable copy of a snapshot> Created instantly, unlimited number

• Perfect for “read-mostly” file systems – source directories, application binaries and configuration, etc. # zfs list -r tank/home/billmNAME USED AVAIL REFER MOUNTPOINTtank/home/billm 24.5K 46.4G 24.5K /export/home/billmtank/home/billm@s1 0 - 24.5K -

# zfs clone tank/home/billm@s1 tank/newbillm

# zfs list -r tank/home/billm tank/newbillmNAME USED AVAIL REFER MOUNTPOINTtank/home/billm 24.5K 46.4G 24.5K /export/home/billmtank/home/billm@s1 0 - 24.5K -tank/newbillm 0 46.4G 24.5K /tank/newbillm

Page 29: ZFS Deep Dive

29

ZFS Send / Receive (Backup / Restore)

• Backup and restore ZFS snapshots> Full backup of any snapshot> Incremental backup of differences between snapshots

• Create full backup of a snapshot# zfs send tank/fs@snap1 > /backup/fs-snap1.zfs

• Create incremental backup# zfs send -i tank/fs@snap1 tank/fs@snap2 > \

/backup/fs-diff1.zfs

• Replicate ZFS file system remotely# zfs send -i tank/fs@11:31 tank/fs@11:32 | \

ssh host zfs receive -d /tank/fs

Page 30: ZFS Deep Dive

30

Storage Pool Migration

“Adaptive Endian-ness”- Hosts always write in their native “endian-ness”

Opposite “Endian” Systems- Write and copy operations will eventually byte

swap all data!

Config Data is Stored within the Data- When the data moves, so does its config info

Page 31: ZFS Deep Dive

31

ZFS Data Migration

• Host-neutral format on-disk> Move data from SPARC to x86 transparently> Data always written in native format, reads reformat data

if needed

• ZFS pools may be moved from host to host> ZFS handles device ids & paths, mount points, etc.

• Export pool from original hostsource# zfs export tank

• Import pool on new hostdestination# zfs import tank

Page 32: ZFS Deep Dive

32

Data Compression

• Reduces the amount of disk space used• Reduces the amount of data transferred to disk –

increasing data throughput

Data Compression

ZFS

Page 33: ZFS Deep Dive

33

Data SecurityACLs and Checksums

• ACLs based on NFSv4 – NT style > Full allow / deny semantics with inheritance> Fine grained privilege control model (17 attributes)

• The uber-block checksum can serve as a digital signature for the entire filesystem> 256 bit, military grade checksum (SHA-256) available

• Encrypted filesystem support coming soon• Secure deletion (scrubbing) coming soon

Page 34: ZFS Deep Dive

34

ZFS and Zones

• Two great tastes that go great together> You've got ZFS data in my zone!> Hey, you've got your zone on my ZFS!

• ZFS datasets (pools or file systems) can be delegated to zones> Zone administrator controls contents of dataset

• Zoneroot may (soon) be placed on ZFS> Separate ZFS filesystem per zone> Snapshots and clones make zone creation fast

Page 35: ZFS Deep Dive

35

ZFS Pools and Zones

tank/a tank/ctank/b

Zone A Zone CZone B

tank

Global Zone

Page 36: ZFS Deep Dive

36

Framework for Examples

• Zones> z1 – sparse root, zoneroot on ZFS> z2 – full root, zoneroot on ZFS> z4 – sparse root, zoneroot on UFS

• ZFS Pools & Filesystems> p1 – mirrored ZFS pool, mounted as /zones> p2 – mirrored ZFS pool, mounted as /p2> p3 – unmirrored ZFS pool, mounted as /p3

Page 37: ZFS Deep Dive

37

Adding ZFS as Mounted File System

• Mount ZFS filesystem into a zone like any other loopback filesystem

• Must set mountpoint to legacy so that the zone manages the mount

# zfs create p2/z1a# zfs set mountpoint=legacy p2/z1a# zonecfg -z z1zonecfg:z3> add fszonecfg:z1:fs> set type=zfszonecfg:z1:fs> set dir=/z1azonecfg:z1:fs> set special=p2/z1azonecfg:z1:fs> endzonecfg:z1> verifyzonecfg:z1> commitzonecfg:z1> exit

Page 38: ZFS Deep Dive

38

# zfs create p2/z1b# mkdir /zones/z1/root/z1b# zonecfg -z z1zonecfg:z3> add datasetzonecfg:z1:dataset> set name=p2/z1bzonecfg:z1:dataset> endzonecfg:z1> commitzonecfg:z1> exit# zoneadm -z z1 boot# zlogin z1 df -hFilesystem size used avail capacity Mounted onp2/z1b 12G 24K 12G 1% /p2/z1b# zlogin z1 zfs listNAME USED AVAIL REFER MOUNTPOINTp2 136K 11.5G 25.5K /p2p2/z1b 24.5K 11.5G 24.5K /p2/z1b

Adding ZFS as Delegated File System• Delegate ZFS dataset to a zone

> Zone administrator manages file systems within the zone

Page 39: ZFS Deep Dive

39

zoned Property for a ZFS File System

• Once a FS is delegated to a zone, the zoned property is set.

• If set, the FS can no longer be managed in the global zone.> Zone admin might have changed things in incompatible

ways (mountpoint, for example).

Page 40: ZFS Deep Dive

40

Zoneroot on ZFS (Soon)

# cat z5.confcreateset zonepath=/zones/z5set autoboot=falseadd netset address=192.168.100.1/25set physical=nge0endcommit# zonecfg -z z5 -f z5.conf# zoneadm -z z5 install A ZFS file system has been created for this zone.Preparing to install zone <z5>.Creating list of files to copy from the global zone.Copying <2587> files to the zone.Initializing zone product registry.Determining zone package initialization order.Preparing to initialize <957> packages on the zone.Initialized <957> packages on zone.Zone <z5> is initialized.

Page 41: ZFS Deep Dive

41

# zfs listNAME USED AVAIL REFER MOUNTPOINTp1 3.44G 8.06G 38K /zonesp1/z5 81.1M 8.06G 81.1M /zones/z5# zlogin z5 zfs listno datasets available# zfs set quota=500m p1/z5# zfs listNAME USED AVAIL REFER MOUNTPOINTp1 3.45G 8.06G 38K /zonesp1/z5 81.1M 419M 81.1M /zones/z5# zfs set reservation=500m p1/z5# zfs listNAME USED AVAIL REFER MOUNTPOINTp1 3.45G 7.65G 38K /zonesp1/z5 81.1M 419M 81.1M /zones/z5

Zoneroot on ZFS (Soon)

Page 42: ZFS Deep Dive

42

Cloning Zones with ZFS

# zfs listNAME USED AVAIL REFER MOUNTPOINTp1 3.37G 8.14G 36K /zonesp1/z1 127M 8.14G 127M /zones/z1p1/z2 3.24G 8.14G 3.24G /zones/z2# cp z2.conf z3.conf<make changes necessary for z3 identity># zonecfg -z z3 -f z3.conf# zoneadm -z z3 clone z2Cloning snapshot p1/z2@SUNWzone1Instead of copying, a ZFS clone has been created for this zone.# zfs listNAME USED AVAIL REFER MOUNTPOINTp1 3.37G 8.14G 37K /zonesp1/z1 127M 8.14G 127M /zones/z1p1/z2 3.24G 8.14G 3.24G /zones/z2p1/z2@SUNWzone1 94.5K - 3.24G -p1/z3 116K 8.14G 3.24G /zones/z3

Page 43: ZFS Deep Dive

43

ZFS Object-Based Storage

Storage Pool Allocator (SPA)

Data Management Unit (DMU)

ZFS Posix Interface ZFS Volume Emulator

zvol iSCSI Swap Raw

• DMU provides a general purpose object store• zvol interface allows creation of raw devices

> Use for DB, create UFS in them, etc.

Page 44: ZFS Deep Dive

44

ZFS ZVOL Interface

# zfs create -V 4g tank/v1# newfs /dev/zvol/rdsk/tank/v1<newfs output>

# mount /dev/zvol/dsk/tank/v1 /mnt# df -h /mntFilesystem size used avail capacity Mounted on/dev/zvol/dsk/tank/v1 3.9G 4.0M 3.9G 1% /mnt

• Create zvol interfaces just as any other zfs file system• Devices are located in /dev/zvol/

> /dev/zvol/rdsk/<poolname>/<volname>

Page 45: ZFS Deep Dive

45

BREATHTAKINGPERFORMANCE

Page 46: ZFS Deep Dive

46

Architected for Speed

Copy-on-Write Design

Multiple Block Sizes

Pipelined I/O

Dynamic Striping

Intelligent Pre-Fetch

Page 47: ZFS Deep Dive

47

Cost and Source Code

• ZFS source code is included in Open Solaris> 47 ZFS patents

added to CDDL patent commons

ZFS is FREE**Free

USD0

EUR0

GBP0

SEK0

YEN0

YUAN0

Page 48: ZFS Deep Dive

48

And for the Future

MoreReliable

MoreFlexible

More Secure

• Pool resize and device removal• Booting / root file system• Integration with Solaris Containers

• Encryption• Secure delete ─ overwriting for “absolute” deletion

• Fault Management Architecture Integration• Hot spares• DTrace providers

Page 49: ZFS Deep Dive

Solaris 10 Deep DiveZFS

Bob [email protected]


Top Related