zfs analysis

Brendan Gregg's professional blog

Home

About

Brendan's blog

Search Find

Activity of the ZFS ARC

Disk I/O is still a common source of performance issues, despite modern cloud environments, modern file systems and huge amounts of

main memory serving as file system cache. Understanding how well that cache is working is a key task while investigating disk I/O

issues. In this post, Ill show the activity of the ZFS file system Adaptive Replacement Cache (ARC).

There are often more statistics available than you realize (or have been documented), which may certainly be true with the ARC. Apart

from showing these statistics, Ill also show how to extend observability using dynamic tracing (DTrace). These tracing techniques are

also applicable to any kernel subsystem. This is an advanced topic, where Ill sometimes dip into kernel code.

Architecture

For background on the ZFS ARC, see the paper ARC: A Self-Tuning, Low Overhead Replacement Cache, by Nimrod Megiddo and

Dharmendra S. Modha. In a nutshell, the ARC achieves a high cache hit rate by using multiple cache algorithms at the same time: most

recently used (MRU) and most frequently used (MFU). Main memory is balanced between these algorithms based on their performance,

which is known by maintaining extra metadata (in main memory) to see how each algorithm would perform if it ruled all of memory.

Such extra metadata is held on ghost lists.

The ZFS ARC has some changes beyond this design, as described in the block comment at the top of uts/common/fs/zfs/arc.c. These

changes include the ability to lock pages, vary the size of the cache, and to cache buffers of different sizes.

Lists

+----------------------------------------------------------+ | ZFS ARC | | +---------------+----------------------------------+ | | | MRU | MRU ghost | | | +---------------+---------------+------------------+ | | | MFU | MFU ghost | | | +-------------------------------+------------------+ | | | | | +----------------------------------------------------------+

The MRU + MFU lists refer to the data cached in main memory; the MRU ghost + MFU ghost lists consist of themselves only (the

metadata) to track algorithm performance.

This is a simplification to convey the basic principle. The current version of the ZFS ARC splits the lists above into separate data and

metadata lists, and also has a list for anonymous buffers and one for L2ARC only buffers (which I added when I developed the L2ARC).

The actual lists are these, from arc.c:

typedef struct arc_state { list_t arcs_list[ARC_BUFC_NUMTYPES]; /* list of evictable buffers */ uint64_t arcs_lsize[ARC_BUFC_NUMTYPES]; /* amount of evictable data */ uint64_t arcs_size; /* total amount of data in this state */ kmutex_t arcs_mtx; } arc_state_t;

/* The 6 states: */ static arc_state_t ARC_anon; static arc_state_t ARC_mru; static arc_state_t ARC_mru_ghost; static arc_state_t ARC_mfu; static arc_state_t ARC_mfu_ghost; static arc_state_t ARC_l2c_only;

Pgina 1 de 12Brendan's blog Activity of the ZFS ARC

20/03/2015http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

These lists exhibit MRU- and MFU-like behavior, but arent strictly MRU/MFU. This can be understood from a lifecycle of an ARC

buffer: on the first access, it is created and moved to the head of the MRU list. On the second access, it is moved to the head of the MFU

list. On the third access, it moves back to the start of the MFU list. (Other lifecycles are possible, this is just one example.) So, the most

recently accessed buffer may be at the start of the MFU list, not the MRU list. And, the most frequently accessed buffer may not be at the

very start of the MFU list.

Locks

Data exists in the cache as buffers, where the primary structures are the arc_buf_hdr_t (header struct, defined in arc.c) and arc_buf_t

(buffer struct, defined in arc.h). Access to these is protected by a hash table based on the 128-bit ZFS data virtual address (DVA). The

hash table has 256 buffer chains (BUF_LOCKS, which may vary based on your ZFS version), each protected by a padded lock (to avoid

false sharing). From arc.c:

#define HT_LOCK_PAD 64

struct ht_lock { kmutex_t ht_lock; #ifdef _KERNEL unsigned char pad[(HT_LOCK_PAD - sizeof (kmutex_t))]; #endif };

#define BUF_LOCKS 256 typedef struct buf_hash_table { uint64_t ht_mask; arc_buf_hdr_t **ht_table; struct ht_lock ht_locks[BUF_LOCKS]; } buf_hash_table_t;

These are optimized for performance since ARC buffers can be accessed, modified and moved between lists frequently.

For more details on ARC lists and locks, see the block comments in arc.c, and the overview by Joerg Moellenkamp.

Sizing

The ARC grows to fill available memory on the system, on the principle that if there is free memory, use it. It shouldnt do this at the

expense of applications, ie, it shouldnt push out application memory (at least, in any large and sustained way). It keeps its size in check

via:

allocation: once the ARC size has grown to its expected maximum, it will begin evicting buffers during new allocations. There is

also some logic in arc_evict() to recycle a buffer of equal size, an optimization to avoid doing an evict-free-alloc path of the same

size.

reclaim thread: this is arc_reclaim_thread(), which wakes up every second (or sooner if signaled by the arc_reclaim_thr_cv

conditional variable) and will attempt to reduce the size of the ARC to the target size. It calls arc_kmem_reap_now() to clean up

the kmem caches, and arc_adjust() to resize the ARC lists. If arc_shrink() is called by arc_kmem_reap_now(), the target ARC size

is reduced by arc_shrink_shift (or needfree), which means shrinking the ARC by 3%. If you plot the ARC size, you sometimes see

these arc_shrink() steps appearing as teeth on a saw a sharp drop followed by a gradual increase.

This is a brief summary, and includes keywords so you can find the right places in the source to start reading. I should note that the ARC

did have sizing issues in the past, where it did seem to push out application memory; those were since fixed. (One issue was where it

didnt account for its own footprint accurately, missing a source of metadata into its size calculation, which meant the ARC was reaping

later than it should have.)

Statistics

On Solaris-based systems, ARC statistics are available from kstat (kernel statistics), the same resource used by tools such as vmstat(1M)

and iostat(1M). kstats are global (entire system, not individual zones) and accessible from non-root users. On the down side, they usually

are not documented and are not considered a stable interface.

On FreeBSD, the same kstats for the ARC are available via sysctl (kstat.zfs.misc.arcstats).

ARC Hit/Miss Rate

ARC hit or miss rate can be determined from the kstats zfs::arcstats:hits and zfs::arcstats:misses. To watch a rate over time, they can be

processed using a little awk (example for Solaris-based systems):

# cat -n archits.sh 1 #!/usr/bin/sh 2 3 interval=${1:-5} # 5 secs by default 4 5 kstat -p zfs:0:arcstats:hits zfs:0:arcstats:misses $interval | awk ' 6 BEGIN { 7 printf "%12s %12s %9s\n", "HITS", "MISSES", "HITRATE" 8 }



9 /hits/ { 10 hits = $2 - hitslast 11 hitslast = $2 12 } 13 /misses/ { 14 misses = $2 - misslast 15 misslast = $2 16 rate = 0 17 total = hits + misses 18 if (total) 19 rate = (hits * 100) / total 20 printf "%12d %12d %8.2f%%\n", hits, misses, rate 21 } 22 '

This program could be shorter; Ive spent some extra lines to write it more clearly. You could also write this in Perl (see my

Sun::Solaris::Kstat examples), or C via libkstat.

$ ./archits.sh 1 HITS MISSES HITRATE 651329528960 370490565 99.94% 22600 11 99.95% 17984 6 99.97% 8978 8 99.91% 87041 28 99.97% 89861 10 99.99% [...]

The first line is the summary since boot, then interval summaries. These counters are system wide. The hit rate on this system is

impressive (99.94% since boot), although hit rates can be misleading. Im usually studying the MISSES column, as a linear measure of

pain.

arcstat.pl

Neelakanth Nadgir wrote arcstat.pl (Solaris), which prints various statistics including reads, misses and the size of the ARC. Mike Harsch

delevoped arcstat.pl further, including L2ARC statistics.

$ ./arcstat.pl 1 time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 04:45:47 0 0 0 0 0 0 0 0 0 14G 14G 04:45:49 15K 10 0 10 0 0 0 1 0 14G 14G 04:45:50 23K 81 0 81 0 0 0 1 0 14G 14G 04:45:51 65K 25 0 25 0 0 0 4 0 14G 14G 04:45:52 30K 11 0 11 0 0 0 3 0 14G 14G [...]

Instead of hit rates, this tool uses miss rates.

In Neels version the first line is the summary since boot; this isnt the case in Mikes current L2ARC version: an extra snap_stats() for an

early L2ARC check means that by the time the statistics loop is reached the first iteration is comparing now with now instead of now

with boot.

Jason Hellenthal has created a FreeBSD version.

All statistics

All the kstats from the arcstat group (which feed the tools seen above) can be listed using:

$ kstat -pn arcstatszfs:0:arcstats:c 15730138449 zfs:0:arcstats:c_max 50447089664 zfs:0:arcstats:c_min 6305886208 zfs:0:arcstats:class misc zfs:0:arcstats:crtime 95.921230719 zfs:0:arcstats:data_size 13565817856 zfs:0:arcstats:deleted 388469245 zfs:0:arcstats:demand_data_hits 611277816567 zfs:0:arcstats:demand_data_misses 258220641 zfs:0:arcstats:demand_metadata_hits 40050025212 zfs:0:arcstats:demand_metadata_misses 88523590 zfs:0:arcstats:evict_skip 5669994 zfs:0:arcstats:hash_chain_max 20 zfs:0:arcstats:hash_chains 248783 zfs:0:arcstats:hash_collisions 2106095400 zfs:0:arcstats:hash_elements 971654 zfs:0:arcstats:hash_elements_max 5677254 zfs:0:arcstats:hdr_size 188240232 zfs:0:arcstats:hits 651328694708 [...l2arc statistics truncated...]

zfs:0:arcstats:memory_throttle_count 0 zfs:0:arcstats:mfu_ghost_hits 55377634 zfs:0:arcstats:mfu_hits 649347616033 zfs:0:arcstats:misses 370489546



zfs:0:arcstats:mru_ghost_hits 127477329 zfs:0:arcstats:mru_hits 1980639328 zfs:0:arcstats:mutex_miss 11530337 zfs:0:arcstats:other_size 1967741376 zfs:0:arcstats:p 14713329404 zfs:0:arcstats:prefetch_data_hits 21342 zfs:0:arcstats:prefetch_data_misses 20782630 zfs:0:arcstats:prefetch_metadata_hits 831587 zfs:0:arcstats:prefetch_metadata_misses 2962685 zfs:0:arcstats:recycle_miss 27036925 zfs:0:arcstats:size 15721799464 zfs:0:arcstats:snaptime 29379870.8764106

More of the activity related statistics will be discussed in the next sections.

Demand/Prefetch

Hits and misses can be broken down into four components, such that:

hits = demand_data_hits + demand_metadata_hits + prefetch_data_hits + prefetch_metadata_hits

And similar for misses. Prefetch and demand refer to how the ARC request was initiated; data and metadata refer to the type of data

requested.

Prefetch is the ZFS read-ahead feature, to predict and pre-cache blocks for streaming (sequential) workloads. All the prefetch statistics

refer to ARC requests that have originated from the ZFS prefetch algorithm which happens before the ARC and without knowing

whether the data is already cached in the ARC. So, a prefetch hit means that ZFS initiated a prefetch, which was then found in the ARC.

A prefetch miss happens when that prefetch request was not in the ARC, and so initiated a disk I/O request (normal behavior). Demand

is the opposite of prefetch: direct requests to the ARC, not predicted requests.

Another way to understand prefetch statistics is to follow the code. In dbuf.c, see the ARC_PREFETCH flag set in dbuf_prefetch(), which

is then checked in arc.c via the ARCSTAT_CONDSTAT macro to determine which kstat to increment.

You can also add these up in other ways; eg:

streaming ratio = prefetch_* / (hits + misses)

At least, that identifies the ratio of the workload that ZFS has identified as streaming. This can be turned into a kstat tool (awk/Perl/C), as

with hits/misses earlier, to show both summary since boot and interval summaries (current activity).

Data/Metadata

Metadata describes the ZFS dataset (file system or volume) and the objects within it. The data is the contents of those objects, including

file, directory and volume blocks.

metadata ratio = *_metadata_* / (hits + misses)

This may be useful to check for considering the effect of picking a small recsize setting (thus increasing metadata), or when considering

the effect of setting primarycache to metadata only.

Others

Some other activity related kstats worth mentioning for the ARC:

mru_hits, mru_ghost_hits, mfu_hits, mfu_ghost_hits, p: Comparing the mru_hits and mfu_hits statistic with misses can

determine the performance of each ARC list type (its not comparing performance of the MRU/MFU algorithms alone, since these

arent strictly MRU/MFU, as mentioned in Architecture). By adding _hits + _ghost_hits for each type, and then comparing the ratio

of each type over time, you can also identify if the workload changes in terms of ARC MRU/MFU. And you can also see how

quickly the ARC adapts to the workload, by watching the p statistic (ARC parameter) change.

hash_chain_max, hash_collisions: These show how well the DVA hash table is hashing. hash_chain_max is the longest length

seen for a chain, when DVAs hash to the same table entry, and is usually less than 10. If that was much higher, performance may

degrade as the hash locks are held longer while the chains are walked, assuming the max is reflective and not an anomaly caused by

some short event. This could be double checked by studying the hash_collisions rate. If an issue is found, the number of hash table

entries (BUF_LOCKS) could be increased in arc.c, and ZFS recompiled (this isnt a regular tunable); although I wouldnt expect

needing to tune this for a while.

Other kstats in the arcstats group describe sizes of the ARC, and the L2ARC.

arc_summary.pl

Another Perl Sun::Solaris::Kstat-based ARC tool worth mentioning is Ben Rockwoods arc_summary.pl, which prints a neat summary of

the hit/miss rate and many of the other counters. Jason Hellenthal also ported the tool to FreeBSD.

$ ./arc_summary.plSystem Memory:



Physical RAM: 49134 MB Free Memory : 1925 MB LotsFree: 767 MB

ZFS Tunables (/etc/system): set zfs:zil_disable=1 set zfs:zfs_prefetch_disable=1 set zfs:zfs_nocacheflush=1

ARC Size: Current Size: 15172 MB (arcsize) Target Size (Adaptive): 15256 MB (c) Min Size (Hard Limit): 6013 MB (zfs_arc_min) Max Size (Hard Limit): 48110 MB (zfs_arc_max)

ARC Size Breakdown: Most Recently Used Cache Size: 77% 11865 MB (p) Most Frequently Used Cache Size: 22% 3391 MB (c-p)

ARC Efficency: Cache Access Total: 654018720316 Cache Hit Ratio: 99% 653646329407 [Defined State for buffer] Cache Miss Ratio: 0% 372390909 [Undefined State for Buffer] REAL Hit Ratio: 99% 653645890054 [MRU/MFU Hits Only]

Data Demand Efficiency: 99% Data Prefetch Efficiency: 0%

CACHE HITS BY CACHE LIST: Anon: --% Counter Rolled. Most Recently Used: 0% 1989696958 (mru) [ Return Customer ] Most Frequently Used: 99% 651656193096 (mfu) [ Frequent Customer ] Most Recently Used Ghost: 0% 128471495 (mru_ghost) [ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 0% 55618357 (mfu_ghost) [ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data: 93% 613371468593 Prefetch Data: 0% 21342 Demand Metadata: 6% 40274007879 Prefetch Metadata: 0% 831593 CACHE MISSES BY DATA TYPE: Demand Data: 69% 259735783 Prefetch Data: 5% 20782630 Demand Metadata: 23% 88909678 Prefetch Metadata: 0% 2962818 ---------------------------------------------

Percentages and raw counters are provided, and the four breakdowns of hit/miss statistics (which I documented above; Bens been

bugging me to document the arcstats for a while).

Tracing

Apart from statistics, the activity of the ARC can also be observed by tracing function points and probes in the kernel. While statistics are

always enabled and collected, tracing is enabled when needed, and costs much higher overhead. This overhead is relative to the frequency

of the traced events, which for the ARC can be very frequent (hundreds of thousands of events per second). I usually only trace the ARC

for short periods (seconds or minutes) to gather debug data.

There isnt a stable DTrace provider for the ARC (and there probably never will be other areas make much more sense), but there are

sdt-provider probes in the ARC code:

# dtrace -ln 'sdt:zfs::arc-*' ID PROVIDER MODULE FUNCTION NAME 19307 sdt zfs arc_read_nolock arc-miss 19310 sdt zfs arc_evict_ghost arc-delete 19311 sdt zfs arc_evict arc-evict 19312 sdt zfs arc_read_nolock arc-hit 19313 sdt zfs arc_buf_add_ref arc-hit

If these didnt exist, you could use the fbt provider. Id begin by inspecting the functions listed in the FUNCTION column.

Note that neither of these providers (sdt or fbt) are available from within Solaris zones these must be traced from the global zone. They

are also both considered unstable interfaces, meaning the one-liners and scripts that follow may not work on future versions of the ARC

without maintenance to match the code changes.

ARC accesses by applicaiton

Checking which applications are (directly) using the ARC:

# dtrace -n 'sdt:zfs::arc-hit,sdt:zfs::arc-miss { @[execname] = count() }'dtrace: description 'sdt:zfs::arc-hit,sdt:zfs::arc-miss ' matched 3 probes ^C

sendmail 1 qmgr 3



[...] nscd 81 httpd 243 imapd 1417 python2.6 2572 awstats.pl 4285 php 6934 mysqld 105901

This frequency counts the execname during ARC access. mysqld was the heaviest user, with 105,901 accesses while tracing.

The kernel will show up as sched, for activities including ZFS transaction group flushes (TXG flush).

ARC accesses by kernel call path

For more details on why the ARC is being accessed, the kernel calling stack can be frequency counted:

# dtrace -n 'sdt:zfs::arc-hit,sdt:zfs::arc-miss { @[execname, probefunc, stack()] = count(); }'dtrace: description 'sdt:zfs::arc-hit,sdt:zfs::arc-miss ' matched 3 probes ^C [...] sched arc_buf_add_ref zfs`dbuf_hold_impl+0xea zfs`dbuf_hold+0x2e zfs`dmu_buf_hold+0x75 zfs`zap_lockdir+0x67 zfs`zap_update+0x5b zfs`uidacct+0xc4 zfs`zfs_space_delta_cb+0x112 zfs`dmu_objset_do_userquota_callbacks+0x151 zfs`dsl_pool_sync+0xfe zfs`spa_sync+0x32b spa sync zfs`txg_sync_thread+0x265 unix`thread_start+0x8 26 [...] python2.6 arc_buf_add_ref zfs`dbuf_hold_impl+0xea zfs`dbuf_hold+0x2e zfs`dmu_buf_hold+0x75 zfs`zap_get_leaf_byblk+0x56 zfs`zap_deref_leaf+0x78 zfs`fzap_cursor_retrieve+0xa7 zfs`zap_cursor_retrieve+0x152 zfs`zfs_readdir+0x2b8 genunix`fop_readdir+0xab read directory genunix`getdents64+0xbc unix`_sys_sysenter_post_swapgs+0x149 2130 [...] mysqld arc_buf_add_ref zfs`dbuf_hold_impl+0xea zfs`dbuf_hold+0x2e zfs`dmu_buf_hold_array_by_dnode+0x1a7 zfs`dmu_buf_hold_array+0x71 zfs`dmu_read_uio+0x4d zfs`zfs_read+0x19a genunix`fop_read+0x6b read genunix`read+0x2b8 genunix`read32+0x22 unix`_sys_sysenter_post_swapgs+0x149 101955

The output was many pages long; Ive truncated to include a few different stacks, and added annotations.

ARC misses by user-land call path

Heres another view of ARC access call paths, this time for misses only, and the user-land stack trace that led to the miss. Ive filtered on

mysqld processes only:

# dtrace -n 'sdt:zfs::arc-miss /execname == "mysqld"/ { @[execname, probefunc, ustack()] = count(); }'dtrace: description 'sdt:zfs::arc-miss ' matched 1 probe ^C [...] mysqld arc_read_nolock libc.so.1`__read+0x15 mysqld`my_read+0x43 mysqld`_Z7openfrmP3THDPKcS2_jjjP8st_table+0x95 mysqld`_ZL17open_unireg_entryP3THDP8st_tablePKcS4_S4_P10TABLE_LISTP1... mysqld`_Z10open_tableP3THDP10TABLE_LISTP11st_mem_rootPbj+0x6d7 mysqld`_Z11open_tablesP3THDPP10TABLE_LISTPjj+0x1b0 mysqld`_Z30open_normal_and_derived_tablesP3THDP10TABLE_LISTj+0x1b mysqld`_Z14get_all_tablesP3THDP10TABLE_LISTP4Item+0x73b



mysqld`_Z24get_schema_tables_resultP4JOIN23enum_schema_table_state+0x18 mysqld`_ZN4JOIN4execEv+0x59e mysqld`_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8s... mysqld`_Z13handle_selectP3THDP6st_lexP13select_resultm+0x102 mysqld`_Z21mysql_execute_commandP3THD+0x51c6 mysqld`_Z11mysql_parseP3THDPKcjPS2_+0x1be mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x895 mysqld`handle_one_connection+0x318 libc.so.1`_thrp_setup+0x7e libc.so.1`_lwp_start 124

The kernel stack trace could be included as well, showing the complete call path from user-land to a kernel event.

ARC access sizes

Digging a bit deeper; the sdt probes used previously were declared as:

DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);

Which means arg0 is an arc_buf_hdr_t. Its declared in arc.c, and contains various members including:

struct arc_buf_hdr { [...] arc_buf_t *b_buf; uint32_t b_flags; [...] arc_buf_contents_t b_type; uint64_t b_size; uint64_t b_spa; [...] clock_t b_arc_access; [...]

Lets pick out the size, and trace ARC accesses by buffer size:

# dtrace -n 'sdt:zfs::arc-hit,sdt:zfs::arc-miss { @["bytes"] = quantize(((arc_buf_hdr_t *)arg0)->b_size); }'dtrace: description 'sdt:zfs::arc-hit,sdt:zfs::arc-miss ' matched 3 probes ^C

bytes value ------------- Distribution ------------- count 0 | 0 1 | 82 2 | 1 4 | 0 8 | 1 16 | 1 32 | 1 64 | 1 128 | 0 256 | 0 512 |@ 1526 1024 | 605 2048 | 780 4096 | 913 8192 |@ 1094 16384 |@@ 4386 32768 | 618 65536 |@@ 4196 131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 60811 262144 | 0

Here I used a power-of-2 quantization, which showed that most of the buffers were in the 128 Kbyte range. (Which is also the default

recsize for the datasets on this system.) Smaller buffers will exist due to cases including files and directories that are smaller than 128k.

Other members of arc_buf_hdr_t can be retrieved and inspected in similar ways.

ARC buffer age

Heres a neat use of the b_arc_access member, which tracks the time that the buffer was last accessed in terms of clock ticks. This time

the fbt provider is used, to trace arc_access() before and after it updates b_arc_access:

# cat -n arcaccess.d 1 #!/usr/sbin/dtrace -s 2 3 #pragma D option quiet 4 5 dtrace:::BEGIN 6 { 7 printf("lbolt rate is %d Hertz.\n", `hz); 8 printf("Tracing lbolts between ARC accesses..."); 9 }



10 11 fbt::arc_access:entry 12 { 13 self->ab = args[0]; 14 self->lbolt = args[0]->b_arc_access; 15 } 16 17 fbt::arc_access:return 18 /self->lbolt/ 19 { 20 @ = quantize(self->ab->b_arc_access - self->lbolt); 21 self->ab = 0; 22 self->lbolt = 0; 23 }

Running for 10 seconds:

# ./arcaccess.d -n 'tick-10s { exit(0); }'lbolt rate is 100 Hertz. Tracing lbolts between ARC accesses...

value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 729988 1 | 3805 10 ms 2 | 3038 4 | 2028 8 | 1428 16 | 1398 32 | 1618 64 | 2883 1 second 128 | 738 256 | 681 512 | 338 1024 | 569 2048 | 166 4096 | 607 1 minute 8192 | 632 16384 | 808 32768 | 373 65536 | 110 131072 | 142 262144 | 39 1 hour 524288 | 5 1048576 | 97 2097152 | 10 4194304 | 44 8388608 | 617 1 day 16777216 | 1 33554432 | 0

This is interesting data. It shows that most buffers were accessed less than one clock tick apart (10 ms), with 729,988 accesses in the 0 to

1 tick range. The oldest buffer accessed was in the 16777216+ range, which (converting lbolts @100 Hertz into time) means it is at least

46 hours since last access. The above output has been annotated to show where times fall in the lbolt ranges (eg, 1 second falls in the 64

127 lbolt range).

This gives us an insight into the age of the oldest buffers in the ARC (at least, in terms of access rate not birth), and, of its churn rate.

This particular ARC is 25 Gbytes, and has been running with a 99.94% hit rate as shown earlier which may be less surprising now we

know that it is so large that it can contain buffers accessed 40+ hrs apart.

ARC hash lock

To get a handle on ARC hash lock contention (instead of using more heavyweight tools like lockstat(1M)), you can try tracing the time

for arc_buf_add_ref(), since it grabs the buffer hash lock:

# dtrace -n 'arc_buf_add_ref:entry { self->s = timestamp; } arc_buf_add_ref:return /self->s/ { @["ns"] = quantize(timestamp - self->s); self->s = 0; }'dtrace: description 'arc_buf_add_ref:entry ' matched 2 probes ^C

ns value ------------- Distribution ------------- count 256 | 0 512 |@ 2123 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 44784 2048 |@@@@@ 7556 4096 |@@ 2267 8192 | 385 16384 | 4 32768 | 0 65536 | 1 131072 | 0

Most of the times were in the 1 2 us range, with only a single occurrence passing 65 us.



ARC reap

Heres a simple script to provide insight into the ARC reclaim thread an asynchronous task that keeps the size of the ARC in check.

# cat -n arcreap.d 1 #!/usr/sbin/dtrace -s 2 3 fbt::arc_kmem_reap_now:entry, 4 fbt::arc_adjust:entry 5 { 6 self->start[probefunc] = timestamp; 7 } 8 9 fbt::arc_shrink:entry 10 { 11 trace("called"); 12 } 13 14 fbt::arc_kmem_reap_now:return, 15 fbt::arc_adjust:return 16 /self->start[probefunc]/ 17 { 18 printf("%Y %d ms", walltimestamp, 19 (timestamp - self->start[probefunc]) / 1000000); 20 self->start[probefunc] = 0; 21 }

Different functions are traced: arc_kmem_reap_now(), to see the time taken to reap the ARC kmem caches; arc_adjust(), for resizing the

ARC lists; and arc_shrink(), to know when the ARC size has been stepped down (this isnt timed, since any real work will be done by

arc_adjust()).

# ./arcreap.ddtrace: script './arcreap.d' matched 5 probes CPU ID FUNCTION:NAME 0 64929 arc_shrink:entry called 0 62414 arc_adjust:return 2012 Jan 9 23:10:01 18 ms 9 62420 arc_kmem_reap_now:return 2012 Jan 9 23:10:03 1511 ms 0 62414 arc_adjust:return 2012 Jan 9 23:10:24 0 ms 6 62414 arc_adjust:return 2012 Jan 9 23:10:49 0 ms

This isnt the only way the ARC keeps its size sane; it will also evict/recycle buffers during allocation, as mentioned in the Architecture

section. This reclaim thread is the more aggressive method, so if you have occasional odd ARC behavior it may be handy to check if it is

related to reclaims.

Evicts by list and type

Tracing the function that does eviction, with details:

# cat -n arcevict.d 1 #!/usr/sbin/dtrace -s 2 3 #pragma D option quiet 4 5 dtrace:::BEGIN 6 { 7 trace("Tracing ARC evicts...\n"); 8 } 9 10 fbt::arc_evict:entry 11 { 12 printf("%Y %-10a %-10s %-10s %d bytes\n", walltimestamp, args[0], 13 arg4 == 0 ? "data" : "metadata", 14 arg3 == 0 ? "evict" : "recycle", arg2); 15 }

Sample output:

# ./arcevict.d Tracing ARC evicts... 2012 Jan 8 08:13:03 zfsÀRC_mru data evict 812181411 bytes 2012 Jan 8 08:13:03 zfsÀRC_mfu data evict 5961212 bytes 2012 Jan 8 08:13:03 zfsÀRC_mfu data recycle 131072 bytes 2012 Jan 8 08:13:04 zfsÀRC_mfu data recycle 131072 bytes 2012 Jan 8 08:13:07 zfsÀRC_mfu data recycle 131072 bytes 2012 Jan 8 08:13:07 zfsÀRC_mfu data recycle 131072 bytes 2012 Jan 8 08:13:08 zfsÀRC_mfu metadata recycle 16384 bytes 2012 Jan 8 08:13:08 zfsÀRC_mfu data recycle 131072 bytes [...]

The output begins by catching an 800 Mbyte evict from the ARC MRU data list, followed by a 6 Mbyte evict from the MFU data list.

After that, buffers were evicted due to the recycle code path, which recycles buffers when the ARC is getting full instead of allocating

new ones.



To understand (and maintain) the arg mappings above, see the invocations of arc_evict() in arc.c. Eg, from arc_adjust():

if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_DATA] > 0) {

delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_DATA], adjustment);

(void) arc_evict(arc_mru, NULL, delta, FALSE, ARC_BUFC_DATA);

adjustment -= delta;

}

This is the first arc_evict() in arc_adjust(), which is why the ARC MRU data list is hit up first.

And more

The previous tracing examples show the sort of additional information that can be obtained using static tracing (the sdt provider) and

dynamic tracing (the fbt provider). With dynamic tracing, a lot more can be seen as needed. Every function that makes up the ARC can be

traced, along with their arguments.

One detail that is actually difficult to trace is the file names during ARC accesses, since vnode pointers are not passed down to the ARC

layer. It is possible, and has been done before (I dont have an example on-hand though). You could more easily cache them from upper

layers (eg, VFS; see the sample chapter from the DTrace book).

Conclusion

In this post, I examined ZFS ARC activity in detail, starting with statistics provided by kstat and then tracing provided by DTrace. Apart

from calculating hit and miss rates, I discussed other statistics including prefetch and metadata ratios. I then used tracing to observe

information from the ARC including who is using the ARC and why, ARC buffer sizes, the age of the ARC buffers, lock contention

timings and eviction details. More can be traced as needed: ZFS with DTrace provides great performance and observability.

Ive spent much time on kernel internals, but I havent really blogged about the deeper areas. Im trying to change that, at least

occasionally, starting with this post on ARC activity. I hope it is useful.

Thanks to the original ZFS team especially Mark Maybee for writing the ARC and explaining details to me, and to Bryan Cantrill for

kstat-ifying the ARC statistics and creating DTrace.

Posted on January 9, 2012 at 4:50 pm by Brendan Gregg Permalink

In: Kernel Tagged with: ARC, dtrace, performance, ZFS

3 Responses

Subscribe to comments via RSS

1. Written by Richard Elling

on January 10, 2012 at 10:34 am

Permalink

Spent a lot of time last summer looking at evictions and their impact on the system. These do not scale well with the size of

memory in the system. For example, in the arcevict.d data above, evicting 800 MBytes is not a big deal, but evicting 8 GBytes is a

big deal. Look for some illumos putbacks in this area RSN :-)

2. Written by mic

on January 11, 2012 at 2:46 am

Permalink

Thanks Brendon, Great Blog.

3. Written by Kyle Hailey

on January 19, 2012 at 11:01 pm

Permalink

Awesome to see all this information on ARC analysis. Till now Ive found information on these stats pretty sparse. Thanks!

- Kyle

Subscribe to comments via RSS

Previous post



Next post

Recent Posts

A New Challenge

Another 10 Performance Wins

Benchmarking the Cloud

Cloud Performance Training

Systems Performance: available now

Open Source Systems Performance

The TSA Method

Control T for TENEX

The USE Method: Unix 7th Edition Performance Checklist

The USE Method: FreeBSD Performance Checklist

The USE Method: Mac OS X Performance Checklist

Memory Leak (and Growth) Flame Graphs

My Books

Tags

7410 analytics art benchmarking book cloud cloud analytics CPI dtrace example experimental filesystem frequencytrail heatmaps illumos iSCSI

javascript joyent L2ARC latency limits linux macosx methodology mysql NAS nfs off-cpu omnios performance personal PICs pid provider slides SLOG smartos solaris SSD statistics talk testing usemethod video visualizations ZFS

People

Adam Leventhal dtrace.org

Brendan Gregg dtrace.org (professional)

Brendan Gregg blogspot (personal)

Bryan Cantrill dtrace.org

Dave Pacheco dtrace.org

Deirdr Straughan beginningwithi.com

Jim Mauro sun.com

Robert Mustacchi dtrace.org



Links

Brendan's homepage

Joyent

SolarisInternals

Meta

Log in

Entries RSS

Comments RSS

WordPress.org

Copyright 2013 Brendan Gregg, all rights reserved

Brendan's blog.

Powered by WordPress and Grey Matter.



zfs analysis

Documents