zfs and oltp

8/9/2019 Zfs and Oltp

1/6


2/6

Platform & Configuration

Our platform was a Niagara T2000 (8 cores @ 1.2Ghz, 4 HW threads or strands percore) with 130 @ 36GB disks attached in JBOD fashion. Each disk was partitioned in 2

equal slices, with half of the surface given to a Solaris Volume Manager (SVM) ontowhich UFS would be built and the other half was given to ZFS pool.

The benchmark was designed to not fully saturate either the CPU or the disks. While weknow that performance varies between inner & outer disk surface, we don't expect theeffect to be large enough to require attention here.

Write Cache Enabled (WCE)

ZFS is designed to work safely, whether or not a disk write-cache is enabled (WCE).This stays true if ZFS is operating on a disk slice. However, when given a full disk, ZFS

will turn _ON_ the write cache as part of the import sequence. That is, it won't enablewrite cache when given only a slice. So, to be fair to ZFS capabilities we manuallyturned ON WCE when running our test over ZFS.

UFS is not designed to work with WCE and will put data at risk if WCE is set, so weneeded to turn it off for the UFS runs. We needed to do this, to get around the fact thatwe did not have enough disk to provide each filesystem. Therefore the performance wemeasured is what would be expected when giving full disk to either filesystem. We notethat, for the FC devices we used, WCE does not provide ZFS a significant performanceboost on this setup.

No Redundancy

For this initial effort we also did not configure any form of redundancy for eitherfilesystem. ZFS RAID-Z does not really have equivalent feature in UFS and so wesettled on simple stripe. We could eventually configure software mirroring on bothfilesystems, but we don't expect that will change our conclusions. But still this will beinteresting in follow-up work.

DBMS logging

Another thing we know already is that a DBMS's log writer latency is critical to OLTP

performance. So in order to improve on that metric, it's good practice to set aside anumber of disks for the DBMS' logs. So with this in hand, we manage to run ourbenchmark and get our target performance number (in relative terms, higher the better):

UFS/DIO/SVM : 42.5Separate Data/log volumes

Recordsize


3/6

OK, so now we're ready. We load up Solaris 10 Update 2 (S10U2/ZFS), build a log pooland a data pool and get going. Note that log writers actually generate a pattern ofsequential I/O of varying sizes. That should map quite well with ZFS out of the box. Butfor the DBMS' data pool, we expect a very random pattern of read and writes to DB

records. A commonly known zfs best practice when servicing fixed record access is tomatch the ZFS' recordsize property to that of the application. We note that UFS, bychance or by design, also works (at least on sparc) using 8K records.

2nd run ZFS/S10U2

So for a fair comparison, we set the recordsize to 8K for the data pool and run ourOLTP/Net and....gasp!:

ZFS/S10U2 : 11.0Data pool (8K record on FS)

Log pool (no tuning)

So that's no good and we have our work cut out for us.

The role of Prefetch in this result

To some extent we already knew of a subsystem that commonly misbehaves (which isbeing fix as we speak), the vdev level prefetch code (that I also refer to as the softwaretrack buffer). In this code, whenever ZFS issues a small read I/O to a device, it will, bydefault, go and fetch quite a sizable chunk of data (64K) located at the physical locationbeing read. In itself, this should not increase the I/O latency which is dominated by the

head-seek and since the data is stored in a small fixed sized buffer we don't expect thisis eating up too much memory either. However in a heavy-duty environment like wehave here, every extra byte that moves up or down the data channel occupies valuablespace. Moreover, for a large DB, we really don't expect the speculatively read data to beused very much. So for our next attempt we'll tune down the prefetch buffer to 8K.

And the role of the vq_max_pending parameter

But we don't expect this to be quite sufficient here. My DBMS savvy friends would tellme that the I/O latency of reads was quite large in our runs. Now ZFS prioritizes readsover writes and so we thought we should be ok. However during a pool transaction

group sync, ZFS will issue quite a number of concurrent writes to each device. This isthe vq_max_pending parameter which default to 35. Clearly during this phase the readlatency even if prioritized will take a somewhat longer time to complete.

3rd run, ZFS/S10U2 - tuned

So I wrote up a script to tune those 2 ZFS knobs. We could then run with a vdev


4/6

preftech buffer of 8K and a vq_max_pending of 10. This boosted our performancealmost 2X:

ZFS/S10U2 : 22.0Data pool (8K record on FS)

Log pool (no tuning)vq_max_pending : 10vdev prefetch : 8K

But not quite satisfying yet.

ZFS/S10U2 known bug

We know of something else about ZFS. In the last few builds before S10U2, a little bugmade it's way into the code base. The effect of this bug was that for full record rewrite,ZFS would actually input the old block even though the data is actually not needed at

all. Shouldn't be too bad, perfectly aligned block rewrites of uncached data is not thatcommon....except for database, bummer.

So S10U2 is plagued with this issue affecting DB performance with no workaround. Soour next step was to move on to ZFS latest bits.

4th run ZFS/Build 44

Build 44 of our next Solaris version has long had this particular issue fixed. There wetopped our past performance with:

ZFS/B44 : 33.0Data pool (8K record on FS)Log pool (no tuning)vq_max_pending : 10vdev prefetch : 8K

As we compare to umpty-years of super tuned UFS:

UFS/DIO/SVM : 42.5Separate Data/log volumes

Summary

I think at this stage of ZFS, the results are neither great nor bad. We have achieved:

UFS/DIO : 100 %UFS : xx no directio (to be updated)ZFS Best : 75% best tuned config with latest bits.


5/6

ZFS S10U2 : 50% best tuned config.ZFS S10U2 : 25% simple tuning.

To achieve acceptable performance levels:

The latest ZFS code base. ZFS improves fast these days. We will need to keep tracking

releases for a little while. The current OpenSolaris release as well as the upcomingSolaris 10 Update 3 (this fall), should perform for these tests, as well as the Build 44results shown here.

1 data pool and 1 log pool: common practice to partition HW resource when we wantproper isolation. Going forward I think that, we will eventually get to the point where thiswill not be necessary but it seems an acceptable constraint for now. Tuned vdevprefetch: the code is being worked on. I expect that in a near future this will not benecessary.

Tuned vq_max_pending: that may take a little longer. In a DB workload, latency is key

and throughput secondary. There are a number of ideas that needs to be tested whichwill help ZFS improve on both average latency as well as latency fluctuations. This willhelp both the Intent log (O_DSYNC write) latency as well as reads.

Parting Words

As those improvement come out, they may well allow ZFS to catch or surpass our bestUFS numbers. When you match that kind of performance with all the usability and dataintegrity features of ZFS, that's a proposition that becomes hard to pass up.

posted by roch sept. 22 2006, 11:00:14 AM MEST PermalinkComments [6]

Comments:

How are the disks partitioned? I think that could be of impact. Is UFS or ZFSfirst? A hard drive performs best at the beginning of the disk and worse furtheron. example:http://www.simplisoftware.com/Public/Content/Pages/Products/Benchmarks/HdTachUsage4.jpgPosted by J. Resoort on septembre 22, 2006 at 12:08 PM MEST #

We don't think it matters enough to this kind of benchmark whoseperformance depends a lot more on random IOPS (head movements) thanpure throughput. Q: do we see inner/outer cyl. affecting IOPS capability ? Our

judgement on this is it should not matter.Posted by Roch on septembre 22, 2006 at 02:22 PM MEST #

How do these numbers compare to raw partitions? Rumour has it thatUFS/DIO/SVM is 90-95% of raw paritions. I understand that managing raw


6/6

partitions is PITA but if I don't have too many spindles/disks maybe rawparitions are still an viable option. -- prasadPosted by Prasad on septembre 22, 2006 at 06:06 PM MEST #

J.Resoort, We assigned each slice on a disk on a round-robin basis. So bothUFS and ZFS got the same number of inner and outer slices.

Posted by Neel on septembre 22, 2006 at 06:46 PM MEST #Explain again why it is safe to turn on write caching for a database running onZFS? What happens if a transaction is committed (but not written to diskbecause of write caching), and then there is a power cut?Posted by AM on septembre 23, 2006 at 02:02 AM MEST #

Synchroneous writes are handled by the ZIL. The ZIL may well issue anumber of concurrent writes to some of the vdevs (which go to the write-cache) _but_ it will not return execution control to application before it hasflushed the caches from all devices in question. When a synchroneous writescompletes, data is always on stable storage.Posted by Roch on septembre 25, 2006 at 05:24 PM MEST #

zfs and oltp

Documents