asm and zfs comparison presentation

Nominet Technology Group

Nominet Technology Group

Comparing ASM with ZFS11.Introduction

2.History of ASM23.Traditional Filesystems24.The modern approach25.ASM Components36.EXTENTS57.Rebalancing68.Metadata79.Some Myths of ASM810.History of ZFS911.Building Blocks of ASM1012.Layers of ZFS1113.its a TREE!1314.Creating a pool1415.Snapshots & Clones1516.Comparing the two1617.Refrences17

1. Introduction

This document is meant to accompany my presentation, Comparing ASM with ZFS.

This presentation describes Oracle's ASM and Sun's ZFS file systems. I will tell a little bit of their history and how they actually work.

I will also compare and contrast the file systems, giving an understanding of the benefits of each. The idea for the presentation came about while I was watching one of the Chief designers of ZFS, Suns Bill Moore, give a talk on ZFS. I was of course impressed with the functionality of the file system, though I had heard quite a lot about it prior to this. What I found unexpectedly in the talk that really intrigued me was that the language Bill was using and some of the concepts expounded on in the presentation would be familiar to a DBA audience.

I was also struck by some of the similarities between ASM and ZFS they have some unique features in common what I mean by that, is that there are some advantages a software RAID solution (which both of them are) have over hardware RAID.

I had been running ASM in production for around two years by this time (December 2007) and, I suspect like a lot of DBAs had in some ways treated ASM like a black box. I knew enough to install it and operate it, but knew very little about how it actually worked. In some ways I think Oracle are greatly responsible for this state of affairs, as the stunning lack of documentation available regarding ASM has only bred a lack of understanding.

To be fair, I think Oracle have partly addressed this issue with the 11g documentation set, which now includes a Storage Administrator guide. However, they really have only partly addressed this, in that this guide still does not really tell you very many details on how ASM actually works. There is though an ASM book: Oracle Automatic Storage Management by Nitin Vengurlekar, Murali Vallath, and Rich Long. This book covers the gap in the explanation of how ASM actually works.

The boundary of responsibility for storage administration has become increasingly blurred within organisations with the adoption of ASM. I think this means DBAs more than ever (though, you could argue it should always have been the case) need to understand storage concepts to be fully in a position to extract the maximum benefit from their storage.

Here I will present some of the ideas behind both ASM and ZFS giving some insight of the benefits of both storage solutions and some of the features they have in common as well as where they differ.

2. History of ASM

ASM has an interesting history that really gives an insight into the development timeframe of a large corporation like Oracle. The idea for ASM was from Bill Bridge, a long time Oracle employee. The original idea goes way back to 1996, and it took 3 years for the project to be given management approval.

ASM was released with 10gR1. This occurred a full 7 years after the original idea, which I think is a long time in technology terms, but a large corporation probably has difficulty meeting quicker turn around times (cf. Microsoft Vista).

Right from the start one of the initial design goals was to not have a directory tree, and that files would not have names nor be accessible via the standard OS utilities. It now makes sense as to why ASMCMD feels like a bolted on afterthought that is somewhat lacking in functionality.

I do wonder if this has hurt the take up rate of ASM, though I believe a large proportion of new RAC installs are using ASM. Support for clustering was built in from the beginning. Indeed Oracles big push on RAC may have been the killer application that ASM needed to go from a proposal to a fully realised product.

3. Traditional Filesystems

File systems have been around for almost 40 years. UFS for example was introduced in the early 80s and has thus evolved over 2 decades.

ASM and ZFS are both Volume managers and file systems. ASM has been designed with the specific aim of storing Oracle database files. While ZFS is a general file system.

Historically each file system managed a single disk. This has some clear drawbacks, in terms of size, reliability, and speed. This is the niche that volume managers filled.

Volume managers are software that sits between the disk and the file system that enables things like mirroring and stripping of disks that is completely transparent to the file system itself.

4. The modern approach

Both ASM and ZFS combine the role of Volume managers & file systems into 1. This can provide an administrative benefit as well as other advantages.

5. ASM Components

When managing your storage via ASM, you are required to run an ASM instance on your database server in addition to the normal RDBMS instance. The ASM instance allows the user to allocate disks to disk groups and perform the required storage related tasks.

The ASM instance is managed in a similar way to a normal RDBMS instance and has an SGA, and uses an spfile to configure the various parameters associated with it. When it starts up a set of Oracle processes run that manage the instance. As the ASM instance is performing less work than an RDBMS instance it requires far fewer resources. ASM instances mount disk groups to make files stored in ASM available to RDBMS instances. ASM instances do not mount databases.

ASM instances are started/stopped in a similar way to RDBMS instances using sqlplus, srvctl or even Enterprise Manager.

ASM instances can be clustered and in a RAC environment, there is one ASM instance per node of the cluster. There is only ever one ASM instance per node.

On failure of an ASM instance on a node, all databases that are using ASM on that node will also fail.

A disk group is the fundamental object that ASM manages and exposes to the user. A disk group can consist of one or more disks. The datafiles belonging to an RDBMS instance are stored in disk groups. Each individual database file is completely contained within one disk group, but a disk group can contain files for one or more databases, and a single database may store files in multiple diskgroups.

It is at the disk group level that the mirroring and striping capabilities of ASM can be utilised. ASM will automatically stripe data files across all disks in a disk group. The idea is that by doing this, the I/O will be evenly distributed across all the disks. The size of the disks should be taken into account when striping the data as all disks in the disk group should be filled to the same capacity, i.e. a larger disk should receive more data than a smaller one.

The various levels of redundancy can be specified at the disk group level:

External redundancy let the storage array take care of it

Normal mirrored pair

High triple mirroring

6. EXTENTS

Every ASM disk is divided into allocation units (au). Data files are stored as extents and an extent consists of one or more allocation unit. When you create a disk group in 11g you can specify the size of the allocation unit to be from 1MB to 64MB, the size doubling between these limits. That is you can set the size of the au for a disk group to be one of 1, 2, 4, 8, 16, 32, or 64MB.

Clearly the larger the au size chosen the less the number of extents it will take to map a file of a given size. The larger au are clearly beneficial for large data files. Each individual extent resides on a single disk. Each extent consists of one or more au; with the concept of variable extent sizes being introduce to better accommodate larger data files.

Extents can vary in size from 1 au to 8 au to 64 au. The number of au a given extent will use is dependent on the number of extents allocated and the extent size increases at a threshold of 20,000 extents (to 8) and then again at 40,000 extents (to 64).

7. Rebalancing

One of the major advantages of ASM is the ease with which the storage configuration can be changed while the database relying on ASM remains online. This is thanks to the ability of ASM to automatically rebalance the distribution of data among the disks in a disk group whenever a reconfiguration of the disk group occurs.

It is the RBAL background process that manages the rebalancing process with the actual rebalancing work of moving the data extents being performed by the ARBn processes.

A rebalance operation will essentially shift the extents around a disk group with the goal of ensuring each disk in a disk group is filled up to the same capacity. This can be beneficial if a new drive has been added to increase capacity as a rebalance will ensure that all drives are fully participating in servicing I/O requests.

8. Metadata

ASM uses metadata to control disk groups and the allocation of space on the disks within the disk group to ASM files (i.e. datafiles etc that are under the control of ASM). All of the metadata associated with a disk group resides within the disk group that is to say a disk group is self-describing.

ASM does not use a database to store the metadata, the ASM instance is not opened and it does not mount a database.

The ASM metadata is either stored at fixed physical locations on the disk or in special ASM files that are not exposed to the end user, e.g. you cant see them with ASMCMD. User created files under ASM have file numbers that count upwards from 256, while the metadata files count down from 255, though not all numbers are being utilised yet.

You can see the metadata files via the following X$KFFXP view:

SQL> select NUMBER_KFFXP file#, GROUP_KFFXP DG#, count(XNUM_KFFXP) AU_count

from x$kffxp

where NUMBER_KFFXP < 256

group by NUMBER_KFFXP, GROUP_KFFXP;

FILE# DG# AU_COUNT

---------- ---- -------------

1 1

2

1 2

2

1 3

2

1 4

2

2 1

1

2 2

1

2 3

1

2 4

1

3 1 42

3 2 42

3 3 42

3 4 42

4 1

2

4 2

2

4 3

2

4 4

2

5 1

1

5 2

1

5 3

1

5 4

1

6 1

1

6 2

1

6 3

1

6 4

1

9. Some Myths of ASM

With the lack of clarity and comprehensiveness in the Oracle documentation several myths surrounding ASM have gained currency within the wider community.

I think the most popular myth that I have encountered is people saying that somehow ASM is able to move extents (and hence RDBMS data files) based on I/O levels or even hot spots on disks.

This never happens, the only goal of ASM rebalancing is to ensure each file is evenly distributed amongst all disks in a disk group. If a file is evenly distributed then the chances are that I/O to this file will be evenly distributed but ASM makes no use of any I/O metrics.

Another pervasive myth is that an RDBMS instance sends its I/O via the ASM instance. This is completely wrong, each RDBMS instance uses the extent maps it has received from the ASM instance to read and write direct to the ASM disks.

10. History of ZFS

1996 was obviously a popular year for the invention of new file systems as the first ideas for ZFS occurred to Sun engineer Jeff Bonwick way back then. However, like ASM it would be several years, before development proper took place. Development really started in 2001 and ZFS was announced in 2004, but took a further 2 years to be released with Solaris 10 6/06.

It is the worlds first 128-bit file system and as such has a huge capacity. Unlike ASM, ZFS is a general-purpose file system; it has not been explicitly designed for storing database files.

There were several goals kept in mind when designing ZFS:

Ease of administration It is two commands to create your storage pool and mount a file system Data Integrity- All data is protected with a 256-bit checksum Scalability maximum size of each file is 264 bytes11. Building Blocks of ASM

There are three basic building blocks to ASM

Pooled storage

Copy on write

Transactions

Checksums

Pooled storage takes the concept of virtual memory and tries to apply it to disks. Just like adding more memory adding more disks to a system should just make the additional storage available straight away.

ZFS maintains its records as a tree of blocks. Every block is accessible via a single block called the uberblock. When you change an existing block, instead of getting overwritten a copy of the data is made and then modified before being written to disk. This is Copy on Write. This ensures ZFS never overwrites live data. This guarantees the integrity of the file system as a system crash still leaves the on disk data in a completely consistent state. There is no need for fsck. Ever.

Every block in the ZFS tree of blocks has a checksum. The checksum is stored in the parent block.

Operations that modify the files system are bunched together into transactions before being committed to disk asynchronously. Related changes are put together into a transaction and either the whole transactions completes or fails. Individual operations within a transaction can be reordered to optimise performance as long as data integrity is not affected.

12. Layers of ZFS

ZFS is composed of several layers. It is coded in around 1/7th the lines of code as UFS/Solaris Volume Manager and provides more functionality than this combination.

Starting from the bottom up:

VDEV: method of accessing and arranging devices. Each vdev is responsible for representing the available space, as well as laying out blocks on the physical disk

ZIO: ZFS I/O pipeline - all data to or from the disk goes through this.

SPA: Storage Pool Allocator - includes routines to create and destroy pools as well as sync the data out to the vdevs.

ARC: Adaptive Replacement Cache - file system cache

DMU: Data Management Unit - presents transactional objects built upon address space from vdevs . The DMU is responsible for mainting data consistency.

ZIL: ZFS Intent Log - Not all writes are written to disk straight away, synchronous writes are written to the transaction log.

ZAP: ZFS Attribute Processor most commonly used to implement directories in the ZPL

DSL: Dataset and Snapshot Layer responsible for implementing snapshots and clones.

ZVOL: ZFS Emulated Volume - This is the ability to present raw devices backed by a ZFS pool

ZPL: ZFS Posix Layer primary interface for interacting with ZFS as a file system.

13. its a TREE!

The block structure of ZFS can be thought of as like a tree structure. The leaf nodes are effectively the blocks on disk while the higher levels are called indirect blocks. The top block is called the uberblock. You can think of all but the leaf blocks as metadata. The metadata is allocated dynamically.

14. Creating a pool

The real ease of administration in ZFS comes when you are attempting to create a file system. When you are using whole disks there is no need for the device to specially formatted as ZFS formats the disk. There is also no need to issue the mkfs or newfs command. The following:

zpool create db c1t0d0

Will create a file system mounted automatically on /db using as much space on the c1t0d0 device as it requires. There is no need to edit /etc/vfstab entries.

It is with the zpool command that you can also define a redundant pool, for example:

zpool create db mirror c1t0d0 c2t0d0

will create a mirrored pool between these two devices.

Within a pool you can create multiple file systems using the ZFS command:

zfs create db db/oradata

You can dynamically add space to a storage pool with the following:

zpool add db mirror c3t0d0 c4t0d0

15. Snapshots & Clones

A snapshot is a read-only copy of a file system at a particular point of time. They can be created quickly and due to the copy on write nature of ZFS snapshots are very cheap to create. A snapshot to begin with consumes very little space, but as the active dataset changes the snapshot will begin to consume more and more space by keeping the references to the old data.

Snapshots are very straightforward to create and initially occupy no storage:

zfs snapshot db@now

This creates a snapshot of the db filesystem with a label called now. Snapshots consume storage from the same pool as the file system from which they were created.

As the file system that you have taken the snapshot off undergoes changes the snapshot increases in size as it effectively records and stores the original entries. The Copy on Write process makes taking snapshots a lot easier as it just a case of keeping the pointers to the old structure.

The snapshot data is accessible via the .zfs directory within the root of the file system that has undergone a snapshot. This makes it possible to recover individual files.

Rolling back to a snapshot is also a fairly trivial command:

zfs rollback db@now

In contrast to snapshots a clone is a writable volume who's initial contents are the same as the dataset from which it was created.

Clones are created from snapshots.

zfs clone db@now db/test

This creates a new clone db/test from the snapshot db@now.

This seems to me like a great way of providing developers with a full copy of a database to work on without having to consume vast quantities of storage space.16. Comparing the two

ASM and ZFS are both modern file systems that take the similar approach of combining a volume manager and a file system together. There are advantages to this approach.

ASM has obviously been written with the sole aim of storing Oracle data files, and thus optimised for this. ZFS meanwhile is a general-purpose file system and thus has not been optimised at all for database usage.

I think clearly with ASM having the ability to be used in a clustered environment whereas ZFS is not cluster aware means ZFS is unable to participate in that segment of the market.

There may be also questions over whether copy on write has serious performance drawbacks when used in conjunction with an OLTP database. The copy on write may well have an impact on sequential I/O with a table that undergoes many updates.

On the other hand, ZFS provides a far richer set of features than ASM and also has a far friendlier interface. It is also easier to manage I believe than ASM.

Another advantage of ASM though is the ability to rebalance online where the extents are held so that I/O can be optimally distributed.

I think both file systems have their own advantages and there may be certain cases where ZFS though perhaps less performant has more functionality and that may be a trade of to be made.

In a RAC situation there is no choice, it would be ASM all the way.

17. Refrences

Some sources I used when compiling this document:

Oracle Storage Administrators Guide: http://download.oracle.com/docs/cd/B28359_01/server.111/b31107/toc.htm Oracle Automatic Storage Management:

http://www.amazon.co.uk/Oracle-Automatic-Storage-Management-Under/dp/0071496076 Description of ZFS layers:

http://opensolaris.org/os/community/zfs/source/ Sun information on ZFS:

http://www.sun.com/software/solaris/zfs_learning_center.jsp

PAGE

asm and zfs comparison presentation

Documents