teacher's notes - lab chapter 7 - zfs file system

7/25/2019 Teacher's Notes - Lab Chapter 7 - ZFS File System

1/12

J.E.D.I.

1 ZFS

1.1 ObjectivesThis chapter gives an overview of the Solaris ZFS file system. It discusses the features of ZFSas well as discusses the basic ZFS commands.

1.2 Chapter OutlineThis chapter is divided into the following sections

verview

!apacity

"ooled Storage Data Integrity

#irroring and $%ID&Z

%dministration

1.3 OverviewZFS was developed by the team of Jeff 'onwic( as a new file system for Solaris. %nnounced fordevelopment in September )**+, it was included as part of the official build of penSolaris in-ovember )** and distributed with the /0*/ update of Solaris 1* in June )**/.

ZFS aims to be the 2last word in file systems2. Developers of ZFS describe it in the Solariscommunity webpage as 2a new (ind of file system that provides simple administration,transactional semantics, end&to&end data integrity, and immense scalability.21

This chapter will discuss in detail what these features are and show their improvements over

file systems currently being used today.

1.4 Capacity% file system3s capacity can be described in the number of bits the system uses to storeinformation about files. !urrent file systems are /+&bit systems, meaning that /+ bits of dataare used to store each file3s information, such as location on which device, file permissions,

directory contents, etc. 'eing a /+&bit file system denotes a theoretical ma4imum file systemsi5e of )6/+ bytes. This is appro4imately 1*611 7' or 1 billion hard dis(s 8of 1** 7' each9.

This is only a theoretical ma4imum. The following are the file system limitations of some well(nown file systems.

File system perating System !apacity

F%T1/ DS&:in;.11 )7'

F%T;) :ininu4 ;)T' 8;)***7'9

1 http?00www.opensolaris.org0os0community05fs0ta(en %ug )1 )**@.

perating Systems 1
http://www.opensolaris.org/os/community/zfs/http://www.opensolaris.org/os/community/zfs/


2/12

J.E.D.I.

-TFS :inA"&Bista )T' 8)*** 7'9

This may seem large enough but already some computer companies have data in the petabyterange, appro4imately 1*6/ 7' worth of storage space. If the trend predicted by #oore3s lawcontinues, then by the middle of the ne4t decade can see the /+&bit limit being pushed.)

%lready, the world wide storage has now e4ceeded 1/1 e4ebytes 81*61= bytes9 and will reachinu4 systems, a single root directory spans all the devices, and devices 8ordirectories on those devices9 can be mounted as directories in the root directory.

ZFS on the other hand uses a pooled storage system. :hen you setup a ZFS pool, you can

assign one, two or more devices to it. %ll the capacity of all the devices are accessible by the

) http?00blogs.sun.com0bonwic(0entry01)=HbitHstorageHareHyou ta(en %ug )) )**@.

; E#! !orporation. The E4panding Digital niverse? % Forecast of :orldwide Information

7rowth Through )*1*+ 1)=&bit storage are you

perating Systems )
http://blogs.sun.com/bonwick/entry/128_bit_storage_are_youhttp://blogs.sun.com/bonwick/entry/128_bit_storage_are_you


3/12

J.E.D.I.

pool.

This pool can than be mounted as a directory on the regular Solaris filesystem. From the user3sstandpoint, you are simply saving to another directory in the Solaris system. owever,

internally, the data you save may be saved in one device and might be e4tended to the ne4t

device if space is lac(ing. It could even be mirrored in another or distributed across severalhard dis(s for greater redundancy. The user is abstracted from having to worry about whereand how the data is stored.

To put it in simple terms, ZFS 2does to storage what virtual memory did to $%#2. %s wasdiscussed in a previous chapter, virtual memory abstracts from applications details of how they

are stored in memory. %n application does not (now where it stays in memory, if it is actuallyallocated continuous memory, or is actually not in memory but temporarily stored on the harddis(. Files in ZFS are the same, the user is not made aware how the file is actually stored, savethat it could be accessed in this directory.

%s for the problem we discussed earlier, adding a new device in ZFS means ust adding a newdevice in the storage pool. our computer automatically gets the additional capacity without

you having to transfer or reorgani5e your filesystem. ou can have a ma4imum of )6/+

devices in a single storage pool.

1." #ata $nte!ritySecondary storage is far from a reliable means of (eeping your data intact. "roblems can occur

at anytime destroying information.

'it rot occurs when parts of the magnetic medium of your hard dis( might fail due to simplewear and tear. % phantom write occurs when the hard dis( claims to have written the data butactually hasn3t. The hard dis( may accidentally read or write data from the wrong portion ofthe hard dis(. These and more may cause your data to suddenly be unreadable.

ZFS has many features that aim to maintain the integrity of data.

1..1 !nd"to"end #he$ks%ms

Filesystems store information in bloc(s. Traditional file systems have chec(sums appended tothese bloc(s in order to provide error correction. This can detect bit rot, as what happens when

the data and the chec(sum do not match.

owever, what happens when the correct data and correct chec(sum are placed on the wrongportion of the hard dis(G In fact, this system can only detect bit rot and nothing more.

Jeff 'onwic(. ZFS? The >ast :ord in File Systems. Sun #icrosystems

perating Systems ;


4/12

J.E.D.I.

ZFS goes a step further by placing a chec(sum on each level of the bloc( tree.

"lacing chec(sums all the way up to the parent bloc( assures that each data bloc( is

consistent. It also assures that data bloc( placing and the entire pool is consistent. %nyoperation that results in bad chec(sum anywhere along the tree means that the entire pool isin some inconsistent state and would reCuire correction.

perating Systems +


5/12

J.E.D.I.

%lso, ZFS separates the data from the chec(sum. In traditional file systems, where the data

and chec(sum are placed in the same bloc(, there is a chance that a hard dis( error couldmodify both data and chec(sum in a way that the system can no longer determine if there was

an error. 'y physically separating data and chec(sum, hard dis( errors would affect only thedata or only the chec(sum, ma(ing this error easier to detect.

1..2 Disk s$r%bbing

To ma(e sure that data remains consistent, ZFS continuously chec(s all data bloc(s in a

process (nown as dis( scrubbing. ZFS goes through the entire dis(, ma(ing sure that datamatches chec(sums. In case of errors, ZFS is able to correct the information automatically,either deriving it from the chec(sum, or through a mirror, which we will discuss in a laterchapter.

1..& #o("on"write Transa$tional )*+

ave you ever e4perienced a power failure as you were saving a very important documentG Toyour horror, you find out that your document3s file was corrupted and you would have to startfrom scratch.

The file was corrupted because the file was left in an inconsistent state. The file consists of

data bloc(s from the new version as well as data bloc(s from the old version, which shouldhave been overwritten if not for the power failure.

perating Systems


6/12

J.E.D.I.

Some traditional filesystems ma(e use of a ournaling system. %ll suggested I0 operations arefirst written into the file system ournal before being run. This ensures that in the event of apower failure or other error, the system can simply go through the operations written in the

ournal until the file system is once more in a consistent state. perations are said to beatomic. Either they are all successfully e4ecuted originally or replayed from the ournal in the

event of a crash, or they did not happen at all.

Journaling, however, slows down I0 e4ecution because of the e4tra step of having to ta(enote of all instructions to be e4ecuted. ver a long period of time, the ournal itself ta(essignificant space in the file system.

ZFS goes the e4tra step by implementing copy&on&write transactional I0. -o data is actuallybeing overwritten by ZFS. %ny changes to the system ta(es place on a copy of the data. %nysystem failure will affect only the copy of the data. In the event of failure the file systemrecovers the original consistent data bloc(s before we started running operations. nly whenthe changes reach the root bloc( are the changes committed to the system. Either all groupedI0 instructions are e4ecuted or the transaction did not happen at all.

perating Systems /


7/12

J.E.D.I.

1.., -inear time snashots

%s a side&effect of the !opy&n&:rite system, file system snapshots are automatically doneafter any file system operation. Snapshots are a copy of the file system at some point in thepast which are used for bac(up purposes. Every operation in ZFS automatically creates asnapshot of the old system. It is actually faster to create the snapshot rather than overwritingthe old data, which reCuires an e4tra step.

perating Systems @


8/12

J.E.D.I.

1.% &irrorin! an '($#Z

1..1 /irroring

ZFS allows for nearly effortless setup of a mirrored file system. % mirrored file system uses asecond hard dis( to completely replicate the data of a first hard dis(. #irrors are often used forredundancy. If the first hard dis( fails, the system still operates with data from the secondhard dis(. %lso, mirrors ma(e reading twice as fast. Data could be retrieved from the secondhard dis( while busy reading from the first hard dis(.

Traditional mirroring implementations cannot differentiate bad bloc(s. Even though a bac(upcopy e4ists in the mirror, it is unable to tell if the bloc( has been corrupted in any way. %s ZFSchec(sums all bloc(s, it silently determines if a bloc( has turned bad. If it has, it automaticallyretrieves the correct bloc( from the second dis( and repairs the bad bloc( on the first dis(.This is done silently without informing the user of any problems.

1..2 0)D"Z$%ID stands for $edundant %rray of Ine4pensive Dis(s. Setting up a $%ID system meansadding more hard dis(s, following a particular scheme 8called a $%ID level9 of how informationis stored among the hard dis(s.

There are traditional $%ID levels.

In $%ID *, data is simply striped over multiple dis(s. % stripe consists of several data bloc(soined together. This has a performance advantage as reads could be done in parallel.

owever, this does not provide any (ind of data security, as the failure of one dis( means thefailure of all data.

In $%ID 1, data is mirrored or completely duplicated on a second 8or multiple9 dis(. $eadscould also be done in parallel, as well as provide data security since if one dis( fails the datacan still be read from the second dis(. owever, this $%ID level is e4pensive as you wouldneed to have twice the amount of hard dis( to store an amount of data.

In $%ID ), data is written one bit each per dis(, with a special dis( dedicated to store parityinformation for data recovery. This level is not used as it is very impractical to store per bit.

>i(e $%ID ), $%ID ; divides the data into dis(s with a special parity dis(. owever, data thistime is divided into stripes.

$%ID + divides data into file system data bloc(s instead of striping a bloc( among multipledis(s. % dedicated parity dis( contains information that can be used to detect and correct

errors from the data dis(s.

The problem with $%ID + is that any writing done to a data bloc( automatically means

recomputing the parity. This means that any write would involve two writes? the data write andthe new parity bloc(. This causes a bottlenec( at the parity dis(. $%ID distributes the paritydis(, allowing faster write time.

The problem with $%ID + and is that whenever any data bloc( is written, the other databloc(s in the stripe would have to be read to recompute the parity bloc(. This causes writes toslow down. %lso, data becomes corrupted if a power failure occurs while the parity bloc( isbeing computed. %s the new parity bloc( was not written correctly, then the data bloc(s do notmatch with the parity bloc(. :hen this happens, $%ID incorrectly assumes that the data was

corrupted.

ZFS includes with it $%ID&Z. $%ID&Z is a modified $%ID implementation. $%ID&Z sets eachfile system data bloc( to be its own stripe. This way, in order to recompute the parity, it only

needs to load a single data bloc(. There is no need to read any other bloc(. %s file systemoperations are now transaction based, the parity hole is now avoided. Either the entire bloc(

perating Systems =


9/12

J.E.D.I.

8including the parity9 gets written, or it does not get written at all.

Dynamic striping is also an additional feature of $%ID&Z. ld $%ID implementations mean thatthe number of hard dis(s is fi4ed once the system is setup because stripe si5es are fi4ed.

:hen a new dis( is added, all new data is striped to use the new dis(. There is no need to

migrate the old data. ver time, ZFS migrates the old data into the new stripe format. Thismigration is done automatically and behind the scenes.

1.) ZFS (*inistration

1..1 Disk naming $on3ention

'efore we can discuss ZFS administration, we must first be familiar with the dis( notation usedin Solaris.

%ll devices in Solaris are represented as files. These files are stored in the directory 0devices.

# ls /devicesiscsi pci@1f,2000:devctl pci@1f,4000:devctl pseudo:devctliscsi:devctl pci@1f,2000:intr pci@1f,4000:intr scsi_vhcioptions pci@1f,2000:reg pci@1f,4000:reg scsi_vhci:devctlpci@1f,2000 pci@1f,4000 pseudo

These are all devices that are attached to the system, including the (eyboard, monitor, S'devices and the li(e. To differentiate between hard dis(s 8including !D drives9 with otherdevices, Solaris provides a separate directory 0dev0rds( for hard dis(s.

If you list the contents of 0dev0rds(, you will see the files following a particular format?

cds or ctds or cdp. These strings describe the complete address of a dis( slice.

# ls /dev/dskc0d0p0 c0d0s7 c1t0d0s4 c1t1d0s15 c1t2d0s12 c1t3d0s1 c1t4d0p3 c0d0p1

c0d0s c1t0d0s5 c1t1d0s2 c1t2d0s13 c1t3d0s10 c1t4d0p4 c0d0p2 c0d0s!c1t0d0s" c1t1d0s3 c1t2d0s14 c1t3d0s11 c1t4d0s0 c0d0p3 c1t0d0p0 c1t0d0s7c1t1d0s4 c1t2d0s15 c1t3d0s12 c1t4d0s1 c0d0p4 c1t0d0p1 c1t0d0s c1t1d0s5c1t2d0s2 c1t3d0s13 c1t4d0s10 c0d0s0 c1t0d0p2 c1t0d0s! c1t1d0s" c1t2d0s3c1t3d0s14 c1t4d0s11 c0d0s1 c1t0d0p3 c1t1d0p0 c1t1d0s7 c1t2d0s4 c1t3d0s15c1t4d0s12 c0d0s10 c1t0d0p4 c1t1d0p1 c1t1d0s c1t2d0s5 c1t3d0s2 c1t4d0s13c0d0s11 c1t0d0s0 c1t1d0p2 c1t1d0s! c1t2d0s" c1t3d0s3 c1t4d0s14 c0d0s12c1t0d0s1 c1t1d0p3 c1t2d0p0 c1t2d0s7 c1t3d0s4 c1t4d0s15 c0d0s13 c1t0d0s10c1t1d0p4 c1t2d0p1 c1t2d0s c1t3d0s5 c1t4d0s2 c0d0s14 c1t0d0s11 c1t1d0s0c1t2d0p2 c1t2d0s! c1t3d0s" c1t4d0s3 c0d0s15 c1t0d0s12 c1t1d0s1 c1t2d0p3c1t3d0p0 c1t3d0s7 c1t4d0s4 c0d0s2 c1t0d0s13 c1t1d0s10 c1t2d0p4 c1t3d0p1c1t3d0s c1t4d0s5 c0d0s3 c1t0d0s14 c1t1d0s11 c1t2d0s0 c1t3d0p2 c1t3d0s!c1t4d0s" c0d0s4 c1t0d0s15 c1t1d0s12 c1t2d0s1 c1t3d0p3 c1t4d0p0 c1t4d0s7

c0d0s5 c1t0d0s2 c1t1d0s13 c1t2d0s10 c1t3d0p4 c1t4d0p1 c1t4d0s c0d0s"c1t0d0s3 c1t1d0s14 c1t2d0s11 c1t3d0s0 c1t4d0p2 c1t4d0s!

c represents the controller number. They are numbered c*, c1, c), etc. ard dis(s are

connected to a controller.

d represents dis( number for that controller.

s represents slice number. Slice numbers can go from s* to s1.

p or partition number is sometimes used instead of a slice number. "artition numbers go fromp* to p+.

% regular computer upon bootup shows + devices. These are your primary master device,

primary slave device, secondary master device and secondary slave device. our primary harddis( is more often than not the primary master device. % !D or DBD drive is often placed as

perating Systems


10/12

J.E.D.I.

the primary slave device. If you have additional hard dis(s, they would be secondary masterand secondary slave.

For Solaris, the dis( notation for these would be?

primary master? c*d*s* 8cable *, dis( *, slice *9

primary slave? c*d1s* 8cable * dis( 1 slice *9

secondary master? c1d*s* 8cable 1 dis( * slice *9

secondary slave? c1d1s* 8cable 1 dis( 1 slice *9

Some computers may have a S!SI interface. S!SI allows for up to 1/ devices attached to asingle controller. ften, the S!SI controller number starts from c) 8as c* and c1 are the

primary and secondary controllers respectively9.

S!SI computers use the t value to indicate a dis(. t can range from t* to t1. S!SIaddresses also use the d notation, but it is always set to 5ero 8d*9. For e4ample, the filesc)t)d*s* to c)t)d*s1 represents all the slices of the device assigned to target ) on

controller ).

1..2 Pool administration

"ools are maintained by the 5pool command. Subcommands of 5pool allow for creation,deletion, adding of new devices, listing and modification of pools.

To create a basic 5fs pool, you can run the command 5pool create. The basic synta4 of 5poolcreate is as follows

5pool create KpoolnameL KvdevL KdevicesL

"oolname is the name of the pool you are going to create. Bdev describes what storage featurethe pool should use. Devices indicate what hard dis(s you are going to be using for the pool.

ou can create pools from dis( slices 8or even files9 but usually pools are made from wholedis(s.

-ote that setting a dis( to be part of a pool formats that dis(.

For e4ample, the following command creates a basic 5pool named myfirstpool1 using the

secondary master hard dis(.

5pool create myfirstpool1 dis( c1d*

To create a mirrored pool, simply replace the dis( (eyword with mirror. -ote that you have toprovide more than one dis( to create a mirror

5pool create mymirroedpool mirror c1d* c1d1

% collection of dis(s can be setup into a $%ID&Z configuration by placing raid5 on the KvdevLentry. -ote that the recommended number of dis(s for $%ID&Z is between ; to


11/12

J.E.D.I.

This feature allows for you to test ZFS features without needing to have additional dis(s. ouwill use this for our e4ercises

To add a device to a pool?

5pool add myfirstpool dis( c1d*

ou can add a mirror or an additional $%ID&Z device to an e4isting pool simply by replacingdis( with mirror or raid5 respectively.

The command

5pool list

lists down all the 5fs pools available along with their space usage information and Mstatus. %

pool can have ; status values? online, degraded or faulted. %n online 5pool has all its devicesin wor(ing order. % degraded pool has a failed device but data can still be recovered through

redundancy. % faulted device means that a device has failed and data cannot be recovered.E4porting a pool means setting up devices for transfer. To e4port a pool, simply run the

command?

5pool e4port mypool

%nd after migrating the devices to a new computer, run the command

5pool import mypool

%nd finally, to destroy a pool, run the command?

5pool destroy mypool

This destroys a pool, ma(ing the devices that used to be part of this pool available for other

uses.

1..& 4asi$ ZFS ool %sage

nce you have created a pool, you can now mount it to be part of the regular Solaris filesystem. sers would simply save as usual to the mounted directory, not (nowing that thedirectory is now using ZFS. sers are also abstracted from the (nowledge that their directoriesare mirrored or stored using $%ID&Z, its all business as usual.

To mount the pool to be part of the regular Solaris filesystem, use the command

5fs set mountpointN0target0directory0in0regular0filesystem poolname

For e4ample, we will mount mypool to store user files.

5fs set mountpointN0e4port0home mypool

ou can create additional directories in mypool. For e4ample, we create directories for user1and user)

5fs create mypool0user1

5fs create mypool0user)

%s we have already mounted mypool to 0e4port0home, user1 is automatically mounted as0e4port0home0user1, user) as 0e4port0home0user).

There are additional options such as compression, the enforcement of dis( Cuotas and dis(space guarantees, which can be easily added with the following commands.

5fs set compressionNon mypool

perating Systems 11


12/12

J.E.D.I.

5fs set CuotaNg mypool0user1

5fs set reservationN1*g mypool0user)

1.., Snashots and $lones

%s was discussed earlier, ZFS allows for the creation of snapshots. Snapshots are a read&onlycopy of the file system at a given point in time, which can be used for bac(up purposes.

To create a snapshot, simply run the command 5fs snapshot indicating the ZFS directory youwish to ta(e a snapshot of together with a name for that snapshot. For e4ample, the followingcommand creates a snapshot of the proects directory of user1. :e will name the snapshot asver;bac(up.

5fs snapshot mypool0user10proectsOver;bac(up

Due to the way ZFS stores data, snapshots are instantly created and reCuire no additionalspace. -o additional processing is necessary. ZFS simply preserves original data bloc(s

whenever changes are made to the target directory.

%ll snapshots are stored in the .5fs0snapshot directory located in the root of each filesystem.This allows users to chec( their snapshots without having to be system administration. Fore4ample, the snapshot we ust created a while ago is now stored in

0e4port0home0user10.5fs0snapshot0ver;bac(up

which user1 can access without having to be a system administrator.

In addition, you can rollbac( your directory to a snapshot. $ollbac( means retrieving thesnapshot to replace all changes made to your directory since the snapshot was ta(en. Fore4ample, user1 made a lot of errors in proect version+ so there would be a need to go bac( tothe version; bac(up. To revert to the ver;bac(up snapshot, simply run the command?

5fs rollbac( &r mypool0user10proectsOver;bac(up

ZFS clones are a writeable copy of a snapshot. To create a clone, indicate the snapshot name

and the target directory where the snapshot is to be cloned.

5fs clone mypool0user10proectOver;bac(up mypool0user10proect0ver;copy

There are many more 5fs commands. To find out additional 5fs command options and how touse them, simply run the man command on 5fs.

man 5fs

perating Systems 1)

teacher's notes - lab chapter 7 - zfs file system

Documents