zfs deep dive
Post on 14-Apr-2015
176 Views
Preview:
DESCRIPTION
TRANSCRIPT
Solaris 10 Deep DiveZFS
Bob NethertonTechnical Specialist, Solaris AdoptionSun Microsystems, Inc.http://blogs.sun.com/bobn
2
• What is ZFS?• Why a new file system?• What's different about it?• What can I do with it?• How much does it cost?• Where does ZFS go from here?
3
What is ZFS?A new way to manage data
The world's first 128-bit file system
With check-summing and copy-on-write transactions
Pooled storage model
No volumemanager
Especially architected for speed
End-to EndData Integrity
Huge Performance Gains
Software Developer
Easier Administration
Immense DataCapacity
4
Why a New File System?
Data ManagementCosts are High
The Value of Datais Becoming Even
More Critical
The Amount of Storage is Ever-
Increasing
5
Trouble with Existing File Systems?Good for the time they were designed, but...
No Defense Against Silent Data Corruption
Difficult toAdminister−Needa Volume Manager
Older/SlowerData ManagementTechniques
Any defect in datapath cancorrupt data...undetected
Volumes, labels, partitions, provisioningand lots of limits
Fat locks, fixedblock size,naive pre-fetch,dirty region logging
6
ZFS Design Principles
• Start with a new design around today's requirements• Pooled storage
> Eliminate the notion of volumes> Do for storage what virtual memory did for RAM
• End-to-end data integrity> Historically considered too expensive.> Now, data is too valuable not to protect
• Transactional operation> Maintain consistent on-disk format> Reorder transactions for performance gains – big
performance win
7
Evolution of Disks and Volumes
• Intially, we had simple disks• Abstraction of disks into volumes to meet requirements• Industry grew around HW / SW volume management
Upper 1GB
Volume Manager
File System
Concatenated 2GB
Lower 1GB Odd 1GB
Volume Manager
File System
Striped 2GB
Even 1GB Right 1GB
Volume Manager
File System
Mirrored 1GB
Left 1GB
8
FS/Volume Model vs. ZFS• Traditional Volumes
> 1:1 FS to Volume> Grow / shrink by hand> Limited bandwidth> Storage fragmented
ZFS ZFSZFS
Volume Manager
FS
• ZFS Pooled Storage> No partitions / volumes> Grow / shrink automatically> All bandwidth always available> All storage in pool is shared
9
FS / Volume Model vs. ZFS
• FS to Volume> Block device interface> Write a block, write a block, ...> Loss of power = loss of
consistency> Workaround: journaling – slow
& complex
• Volume to Disk> Block device interface> Write each block to each disk
immediately to sync mirrors> Loss of power = resync> Synchronous & slow
• ZFS to Data Mgmt Unit> Object-based transactions> “Make these changes to these
objects”> All or nothing
• DMU to Storage Pool> Transaction group commit> All or nothing> Always consistent on disk> Journal not needed
• SP to Disk> Schedule, aggregate, and issue I/O
at will – runs at platter speed> No resync if power lost
FS / Volume I/O Stack ZFS I/O Stack
10
DATAINTEGRITY
11
ZFS Data Integrity Model• Everything is copy-on-write
> Never overwrite live data> On-disk state always valid – no fsck
• Everything is transactional> Related changes succeed or fail as a whole> No need for journaling
• Everything is checksummed> No silent corruptions> No panics from bad metadata
• Enhanced data protection> Mirrored pools, RAID-Z, disk scrubbing
12
Copy-on-Write and Transactional
Initial block tree Writes a copy of some changes
Copy-on-write of indirect blocks Rewrites the Uber-block
Original Data
New Data
New Pointers
Original Pointers New Uber-block
Uber-block
13
Checksums are separated from
the data
End-to-End Checksums
Entire I/O path is self-validating (uber-block)
Prevents:> Silent data corruption> Panics from corrupted
metadata> Phantom writes> Misdirected reads and writes> DMA parity errors> Errors from driver bugs> Accidental overwrites
14
ApplicationApplication
ZFS MirrorZFS MirrorZFS Mirror
Self-Healing DataZFS can detect bad data using checksums and “heal” the data using its mirrored copy.
Application
“Heals” Bad CopyGets Good Data from MirrorDetects Bad Data
15
Disk Scrubbing
• Uses checksums to verify the integrity of all the data
• Traverses metadata to read every copy of every block
• Finds latent errors while they're still correctable
• It's like ECC memory scrubbing – but for disks
• Provides fast and reliable re-silvering of mirrors
16
RAID-Z Protection RAID-5 and More
• ZFS provides better than RAID-5 availability> Copy-on-write approach solves historical problems
• Striping uses dynamic widths> Each logical block is its own stripe
• All writes are full-stripe writes> Eliminates read-modify-write (So it's fast!)
• Eliminates RAID-5 “write hole” > No need for NVRAM
17
Immense Data Capacity
• 128-bit File System
• No Practical Limitations on File Size, Directory Entries, etc.
• All metadata is dynamic
• Concurrent Everything
18
EASIERADMINISTRATION
19
Easier Administration• Pooled Storage Design
makes for Easier Administration
No need for a Volume Manager!
• Straightforward Commands and a GUI> Snapshots & Clones> Quotas & Reservations> Compression> Pool Migration> ACLs for Security
20
No More Volume Manager!Automatically add capacity to shared storage pool
ZFS
Application 1
Storage Pool
ZFS
Application 2 Application 3
21
ZFS File systems are Hierarchical
• File system properties are inherited
• Inheritance makes administration a snap
• File systems become control points
• Manage logically related file systems as a group
22
Create ZFS Pools and File Systems
• Create a ZFS pool consisting of two mirrored drives # zpool create tank mirror c9t42d0 c13t11d0# df -h -F zfsFilesystem size used avail capacity Mounted ontank 33G 1K 33G 1% /tank
• Create home directory file system# zfs create tank/home# zfs set mountpoint=/export/home tank/home# df -h -F zfsFilesystem size used avail capacity Mounted ontank 33G 24K 33G 1% /tanktank/home 33G 27K 33G 1% /export/home
23
Create ZFS Pools and File Systems
• Create home directories for users # zfs create tank/home/ahrens# zfs create tank/home/bonwick# zfs create tank/home/billm# df -h -F zfsFilesystem size used avail capacity Mounted ontank 33G 24K 33G 1% /tanktank/home 33G 27K 33G 1% /export/hometank/home/ahrens 33G 24K 33G 1% /export/home/ahrenstank/home/bonwick 33G 24K 33G 1% /export/home/bonwicktank/home/billm 33G 24K 33G 1% /export/home/billm
• Add space to the pool# zpool add tank mirror c9t43d0 c13t12d0# df -h -F zfsFilesystem size used avail capacity Mounted ontank 66G 24K 66G 1% /tanktank/home 66G 27K 66G 1% /export/home
24
Quotas and Reservations
• To control pooled storage usage, administrators can set a quota or reservation on a per file system basis # df -h -F zfsFilesystem size used avail capacity Mounted ontank/home 66G 28K 66G 1% /export/hometank/home/ahrens 66G 24K 66G 1% /export/home/ahrenstank/home/bonwick 66G 24K 66G 1% /export/home/bonwick# zfs set quota=10g tank/home/ahrens# zfs set reservation=20g tank/home/bonwick# df -h -F zfsFilesystem size used avail capacity Mounted ontank/home 66G 28K 46G 1% /export/hometank/home/ahrens 10G 24K 10G 1% /export/home/ahrenstank/home/bonwick 66G 24K 66G 1% /export/home/bonwick
25
File System Attributes• Attributes are set for the file system and inherited by
child file systems in the tree # zfs set compression=on tank# zfs set sharenfs=rw tank/home# zfs get all tankNAME PROPERTY VALUE SOURCEtank type filesystem -tank creation Fri Sep 1 9:38 2006 -tank used 20.0G -tank available 46.4G -tank compressratio 1.00x -tank mounted yes -tank quota none defaulttank reservation none defaulttank recordsize 128K defaulttank mountpoint /tank defaulttank sharenfs off defaulttank compression on localtank atime on default...
26
ZFS Snapshots
• Provide a read-only point-in-time copy of file system
• Copy-on-write makes them essentially “free”
• Very space efficient – only changes are tracked
• And instantaneous – just doesn't delete the copy
Current Data
Snapshot Uber-block New Uber-block
27
ZFS Snapshots• Simple to create and rollback with snapshots
# zfs list -r tankNAME USED AVAIL REFER MOUNTPOINTtank 20.0G 46.4G 24.5K /tanktank/home 20.0G 46.4G 28.5K /export/hometank/home/ahrens 24.5K 10.0G 24.5K /export/home/ahrenstank/home/billm 24.5K 46.4G 24.5K /export/home/billmtank/home/bonwick 24.5K 66.4G 24.5K /export/home/bonwick
# zfs snapshot tank/home/billm@s1# zfs list -r tank/home/billmNAME USED AVAIL REFER MOUNTPOINTtank/home/billm 24.5K 46.4G 24.5K /export/home/billmtank/home/billm@s1 0 - 24.5K -
# cat /export/home/billm/.zfs/snapshot/s1/foo.c# zfs rollback tank/home/billm@s1# zfs destroy tank/home/billm@s1
28
ZFS Clones
• A clone is a writable copy of a snapshot> Created instantly, unlimited number
• Perfect for “read-mostly” file systems – source directories, application binaries and configuration, etc. # zfs list -r tank/home/billmNAME USED AVAIL REFER MOUNTPOINTtank/home/billm 24.5K 46.4G 24.5K /export/home/billmtank/home/billm@s1 0 - 24.5K -
# zfs clone tank/home/billm@s1 tank/newbillm
# zfs list -r tank/home/billm tank/newbillmNAME USED AVAIL REFER MOUNTPOINTtank/home/billm 24.5K 46.4G 24.5K /export/home/billmtank/home/billm@s1 0 - 24.5K -tank/newbillm 0 46.4G 24.5K /tank/newbillm
29
ZFS Send / Receive (Backup / Restore)
• Backup and restore ZFS snapshots> Full backup of any snapshot> Incremental backup of differences between snapshots
• Create full backup of a snapshot# zfs send tank/fs@snap1 > /backup/fs-snap1.zfs
• Create incremental backup# zfs send -i tank/fs@snap1 tank/fs@snap2 > \
/backup/fs-diff1.zfs
• Replicate ZFS file system remotely# zfs send -i tank/fs@11:31 tank/fs@11:32 | \
ssh host zfs receive -d /tank/fs
30
Storage Pool Migration
“Adaptive Endian-ness”- Hosts always write in their native “endian-ness”
Opposite “Endian” Systems- Write and copy operations will eventually byte
swap all data!
Config Data is Stored within the Data- When the data moves, so does its config info
31
ZFS Data Migration
• Host-neutral format on-disk> Move data from SPARC to x86 transparently> Data always written in native format, reads reformat data
if needed
• ZFS pools may be moved from host to host> ZFS handles device ids & paths, mount points, etc.
• Export pool from original hostsource# zfs export tank
• Import pool on new hostdestination# zfs import tank
32
Data Compression
• Reduces the amount of disk space used• Reduces the amount of data transferred to disk –
increasing data throughput
Data Compression
ZFS
33
Data SecurityACLs and Checksums
• ACLs based on NFSv4 – NT style > Full allow / deny semantics with inheritance> Fine grained privilege control model (17 attributes)
• The uber-block checksum can serve as a digital signature for the entire filesystem> 256 bit, military grade checksum (SHA-256) available
• Encrypted filesystem support coming soon• Secure deletion (scrubbing) coming soon
34
ZFS and Zones
• Two great tastes that go great together> You've got ZFS data in my zone!> Hey, you've got your zone on my ZFS!
• ZFS datasets (pools or file systems) can be delegated to zones> Zone administrator controls contents of dataset
• Zoneroot may (soon) be placed on ZFS> Separate ZFS filesystem per zone> Snapshots and clones make zone creation fast
35
ZFS Pools and Zones
tank/a tank/ctank/b
Zone A Zone CZone B
tank
Global Zone
36
Framework for Examples
• Zones> z1 – sparse root, zoneroot on ZFS> z2 – full root, zoneroot on ZFS> z4 – sparse root, zoneroot on UFS
• ZFS Pools & Filesystems> p1 – mirrored ZFS pool, mounted as /zones> p2 – mirrored ZFS pool, mounted as /p2> p3 – unmirrored ZFS pool, mounted as /p3
37
Adding ZFS as Mounted File System
• Mount ZFS filesystem into a zone like any other loopback filesystem
• Must set mountpoint to legacy so that the zone manages the mount
# zfs create p2/z1a# zfs set mountpoint=legacy p2/z1a# zonecfg -z z1zonecfg:z3> add fszonecfg:z1:fs> set type=zfszonecfg:z1:fs> set dir=/z1azonecfg:z1:fs> set special=p2/z1azonecfg:z1:fs> endzonecfg:z1> verifyzonecfg:z1> commitzonecfg:z1> exit
38
# zfs create p2/z1b# mkdir /zones/z1/root/z1b# zonecfg -z z1zonecfg:z3> add datasetzonecfg:z1:dataset> set name=p2/z1bzonecfg:z1:dataset> endzonecfg:z1> commitzonecfg:z1> exit# zoneadm -z z1 boot# zlogin z1 df -hFilesystem size used avail capacity Mounted onp2/z1b 12G 24K 12G 1% /p2/z1b# zlogin z1 zfs listNAME USED AVAIL REFER MOUNTPOINTp2 136K 11.5G 25.5K /p2p2/z1b 24.5K 11.5G 24.5K /p2/z1b
Adding ZFS as Delegated File System• Delegate ZFS dataset to a zone
> Zone administrator manages file systems within the zone
39
zoned Property for a ZFS File System
• Once a FS is delegated to a zone, the zoned property is set.
• If set, the FS can no longer be managed in the global zone.> Zone admin might have changed things in incompatible
ways (mountpoint, for example).
40
Zoneroot on ZFS (Soon)
# cat z5.confcreateset zonepath=/zones/z5set autoboot=falseadd netset address=192.168.100.1/25set physical=nge0endcommit# zonecfg -z z5 -f z5.conf# zoneadm -z z5 install A ZFS file system has been created for this zone.Preparing to install zone <z5>.Creating list of files to copy from the global zone.Copying <2587> files to the zone.Initializing zone product registry.Determining zone package initialization order.Preparing to initialize <957> packages on the zone.Initialized <957> packages on zone.Zone <z5> is initialized.
41
# zfs listNAME USED AVAIL REFER MOUNTPOINTp1 3.44G 8.06G 38K /zonesp1/z5 81.1M 8.06G 81.1M /zones/z5# zlogin z5 zfs listno datasets available# zfs set quota=500m p1/z5# zfs listNAME USED AVAIL REFER MOUNTPOINTp1 3.45G 8.06G 38K /zonesp1/z5 81.1M 419M 81.1M /zones/z5# zfs set reservation=500m p1/z5# zfs listNAME USED AVAIL REFER MOUNTPOINTp1 3.45G 7.65G 38K /zonesp1/z5 81.1M 419M 81.1M /zones/z5
Zoneroot on ZFS (Soon)
42
Cloning Zones with ZFS
# zfs listNAME USED AVAIL REFER MOUNTPOINTp1 3.37G 8.14G 36K /zonesp1/z1 127M 8.14G 127M /zones/z1p1/z2 3.24G 8.14G 3.24G /zones/z2# cp z2.conf z3.conf<make changes necessary for z3 identity># zonecfg -z z3 -f z3.conf# zoneadm -z z3 clone z2Cloning snapshot p1/z2@SUNWzone1Instead of copying, a ZFS clone has been created for this zone.# zfs listNAME USED AVAIL REFER MOUNTPOINTp1 3.37G 8.14G 37K /zonesp1/z1 127M 8.14G 127M /zones/z1p1/z2 3.24G 8.14G 3.24G /zones/z2p1/z2@SUNWzone1 94.5K - 3.24G -p1/z3 116K 8.14G 3.24G /zones/z3
43
ZFS Object-Based Storage
Storage Pool Allocator (SPA)
Data Management Unit (DMU)
ZFS Posix Interface ZFS Volume Emulator
zvol iSCSI Swap Raw
• DMU provides a general purpose object store• zvol interface allows creation of raw devices
> Use for DB, create UFS in them, etc.
44
ZFS ZVOL Interface
# zfs create -V 4g tank/v1# newfs /dev/zvol/rdsk/tank/v1<newfs output>
# mount /dev/zvol/dsk/tank/v1 /mnt# df -h /mntFilesystem size used avail capacity Mounted on/dev/zvol/dsk/tank/v1 3.9G 4.0M 3.9G 1% /mnt
• Create zvol interfaces just as any other zfs file system• Devices are located in /dev/zvol/
> /dev/zvol/rdsk/<poolname>/<volname>
45
BREATHTAKINGPERFORMANCE
46
Architected for Speed
Copy-on-Write Design
Multiple Block Sizes
Pipelined I/O
Dynamic Striping
Intelligent Pre-Fetch
47
Cost and Source Code
• ZFS source code is included in Open Solaris> 47 ZFS patents
added to CDDL patent commons
ZFS is FREE**Free
USD0
EUR0
GBP0
SEK0
YEN0
YUAN0
48
And for the Future
MoreReliable
MoreFlexible
More Secure
• Pool resize and device removal• Booting / root file system• Integration with Solaris Containers
• Encryption• Secure delete ─ overwriting for “absolute” deletion
• Fault Management Architecture Integration• Hot spares• DTrace providers
Solaris 10 Deep DiveZFS
Bob Nethertonbob.netherton@sun.com
top related