zfs analysis

12
Brendan Gregg's professional blog Home About Brendan's blog Search Find Activity of the ZFS ARC Disk I/O is still a common source of performance issues, despite modern cloud environments, modern file systems and huge amounts of main memory serving as file system cache. Understanding how well that cache is working is a key task while investigating disk I/O issues. In this post, I’ll show the activity of the ZFS file system Adaptive Replacement Cache (ARC). There are often more statistics available than you realize (or have been documented), which may certainly be true with the ARC. Apart from showing these statistics, I’ll also show how to extend observability using dynamic tracing (DTrace). These tracing techniques are also applicable to any kernel subsystem. This is an advanced topic, where I’ll sometimes dip into kernel code. Architecture For background on the ZFS ARC, see the paper ARC: A Self-Tuning, Low Overhead Replacement Cache, by Nimrod Megiddo and Dharmendra S. Modha. In a nutshell, the ARC achieves a high cache hit rate by using multiple cache algorithms at the same time: most recently used (MRU) and most frequently used (MFU). Main memory is balanced between these algorithms based on their performance, which is known by maintaining extra metadata (in main memory) to see how each algorithm would perform if it ruled all of memory. Such extra metadata is held on “ghost lists”. The ZFS ARC has some changes beyond this design, as described in the block comment at the top of uts/common/fs/zfs/arc.c. These changes include the ability to lock pages, vary the size of the cache, and to cache buffers of different sizes. Lists +----------------------------------------------------------+ | ZFS ARC | | +---------------+----------------------------------+ | | | MRU | MRU ghost | | | +---------------+---------------+------------------+ | | | MFU | MFU ghost | | | +-------------------------------+------------------+ | | <--- available main memory ---> | | | +----------------------------------------------------------+ The MRU + MFU lists refer to the data cached in main memory; the MRU ghost + MFU ghost lists consist of themselves only (the metadata) to track algorithm performance. This is a simplification to convey the basic principle. The current version of the ZFS ARC splits the lists above into separate data and metadata lists, and also has a list for anonymous buffers and one for L2ARC only buffers (which I added when I developed the L2ARC). The actual lists are these, from arc.c: typedef struct arc_state { list_t arcs_list[ARC_BUFC_NUMTYPES]; /* list of evictable buffers */ uint64_t arcs_lsize[ARC_BUFC_NUMTYPES]; /* amount of evictable data */ uint64_t arcs_size; /* total amount of data in this state */ kmutex_t arcs_mtx; } arc_state_t; /* The 6 states: */ static arc_state_t ARC_anon; static arc_state_t ARC_mru; static arc_state_t ARC_mru_ghost; static arc_state_t ARC_mfu; static arc_state_t ARC_mfu_ghost; static arc_state_t ARC_l2c_only; Página 1 de 12 Brendan's blog » Activity of the ZFS ARC 20/03/2015 http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

Upload: baraka08

Post on 05-Sep-2015

263 views

Category:

Documents


3 download

DESCRIPTION

Zfs Analysis

TRANSCRIPT

  • Brendan Gregg's professional blog

    Home

    About

    Brendan's blog

    Search Find

    Activity of the ZFS ARC

    Disk I/O is still a common source of performance issues, despite modern cloud environments, modern file systems and huge amounts of

    main memory serving as file system cache. Understanding how well that cache is working is a key task while investigating disk I/O

    issues. In this post, Ill show the activity of the ZFS file system Adaptive Replacement Cache (ARC).

    There are often more statistics available than you realize (or have been documented), which may certainly be true with the ARC. Apart

    from showing these statistics, Ill also show how to extend observability using dynamic tracing (DTrace). These tracing techniques are

    also applicable to any kernel subsystem. This is an advanced topic, where Ill sometimes dip into kernel code.

    Architecture

    For background on the ZFS ARC, see the paper ARC: A Self-Tuning, Low Overhead Replacement Cache, by Nimrod Megiddo and

    Dharmendra S. Modha. In a nutshell, the ARC achieves a high cache hit rate by using multiple cache algorithms at the same time: most

    recently used (MRU) and most frequently used (MFU). Main memory is balanced between these algorithms based on their performance,

    which is known by maintaining extra metadata (in main memory) to see how each algorithm would perform if it ruled all of memory.

    Such extra metadata is held on ghost lists.

    The ZFS ARC has some changes beyond this design, as described in the block comment at the top of uts/common/fs/zfs/arc.c. These

    changes include the ability to lock pages, vary the size of the cache, and to cache buffers of different sizes.

    Lists

    +----------------------------------------------------------+ | ZFS ARC | | +---------------+----------------------------------+ | | | MRU | MRU ghost | | | +---------------+---------------+------------------+ | | | MFU | MFU ghost | | | +-------------------------------+------------------+ | | | | | +----------------------------------------------------------+

    The MRU + MFU lists refer to the data cached in main memory; the MRU ghost + MFU ghost lists consist of themselves only (the

    metadata) to track algorithm performance.

    This is a simplification to convey the basic principle. The current version of the ZFS ARC splits the lists above into separate data and

    metadata lists, and also has a list for anonymous buffers and one for L2ARC only buffers (which I added when I developed the L2ARC).

    The actual lists are these, from arc.c:

    typedef struct arc_state { list_t arcs_list[ARC_BUFC_NUMTYPES]; /* list of evictable buffers */ uint64_t arcs_lsize[ARC_BUFC_NUMTYPES]; /* amount of evictable data */ uint64_t arcs_size; /* total amount of data in this state */ kmutex_t arcs_mtx; } arc_state_t;

    /* The 6 states: */ static arc_state_t ARC_anon; static arc_state_t ARC_mru; static arc_state_t ARC_mru_ghost; static arc_state_t ARC_mfu; static arc_state_t ARC_mfu_ghost; static arc_state_t ARC_l2c_only;

    Pgina 1 de 12Brendan's blog Activity of the ZFS ARC

    20/03/2015http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

  • These lists exhibit MRU- and MFU-like behavior, but arent strictly MRU/MFU. This can be understood from a lifecycle of an ARC

    buffer: on the first access, it is created and moved to the head of the MRU list. On the second access, it is moved to the head of the MFU

    list. On the third access, it moves back to the start of the MFU list. (Other lifecycles are possible, this is just one example.) So, the most

    recently accessed buffer may be at the start of the MFU list, not the MRU list. And, the most frequently accessed buffer may not be at the

    very start of the MFU list.

    Locks

    Data exists in the cache as buffers, where the primary structures are the arc_buf_hdr_t (header struct, defined in arc.c) and arc_buf_t

    (buffer struct, defined in arc.h). Access to these is protected by a hash table based on the 128-bit ZFS data virtual address (DVA). The

    hash table has 256 buffer chains (BUF_LOCKS, which may vary based on your ZFS version), each protected by a padded lock (to avoid

    false sharing). From arc.c:

    #define HT_LOCK_PAD 64

    struct ht_lock { kmutex_t ht_lock; #ifdef _KERNEL unsigned char pad[(HT_LOCK_PAD - sizeof (kmutex_t))]; #endif };

    #define BUF_LOCKS 256 typedef struct buf_hash_table { uint64_t ht_mask; arc_buf_hdr_t **ht_table; struct ht_lock ht_locks[BUF_LOCKS]; } buf_hash_table_t;

    These are optimized for performance since ARC buffers can be accessed, modified and moved between lists frequently.

    For more details on ARC lists and locks, see the block comments in arc.c, and the overview by Joerg Moellenkamp.

    Sizing

    The ARC grows to fill available memory on the system, on the principle that if there is free memory, use it. It shouldnt do this at the

    expense of applications, ie, it shouldnt push out application memory (at least, in any large and sustained way). It keeps its size in check

    via:

    allocation: once the ARC size has grown to its expected maximum, it will begin evicting buffers during new allocations. There is

    also some logic in arc_evict() to recycle a buffer of equal size, an optimization to avoid doing an evict-free-alloc path of the same

    size.

    reclaim thread: this is arc_reclaim_thread(), which wakes up every second (or sooner if signaled by the arc_reclaim_thr_cv

    conditional variable) and will attempt to reduce the size of the ARC to the target size. It calls arc_kmem_reap_now() to clean up

    the kmem caches, and arc_adjust() to resize the ARC lists. If arc_shrink() is called by arc_kmem_reap_now(), the target ARC size

    is reduced by arc_shrink_shift (or needfree), which means shrinking the ARC by 3%. If you plot the ARC size, you sometimes see

    these arc_shrink() steps appearing as teeth on a saw a sharp drop followed by a gradual increase.

    This is a brief summary, and includes keywords so you can find the right places in the source to start reading. I should note that the ARC

    did have sizing issues in the past, where it did seem to push out application memory; those were since fixed. (One issue was where it

    didnt account for its own footprint accurately, missing a source of metadata into its size calculation, which meant the ARC was reaping

    later than it should have.)

    Statistics

    On Solaris-based systems, ARC statistics are available from kstat (kernel statistics), the same resource used by tools such as vmstat(1M)

    and iostat(1M). kstats are global (entire system, not individual zones) and accessible from non-root users. On the down side, they usually

    are not documented and are not considered a stable interface.

    On FreeBSD, the same kstats for the ARC are available via sysctl (kstat.zfs.misc.arcstats).

    ARC Hit/Miss Rate

    ARC hit or miss rate can be determined from the kstats zfs::arcstats:hits and zfs::arcstats:misses. To watch a rate over time, they can be

    processed using a little awk (example for Solaris-based systems):

    # cat -n archits.sh 1 #!/usr/bin/sh 2 3 interval=${1:-5} # 5 secs by default 4 5 kstat -p zfs:0:arcstats:hits zfs:0:arcstats:misses $interval | awk ' 6 BEGIN { 7 printf "%12s %12s %9s\n", "HITS", "MISSES", "HITRATE" 8 }

    Pgina 2 de 12Brendan's blog Activity of the ZFS ARC

    20/03/2015http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

  • 9 /hits/ { 10 hits = $2 - hitslast 11 hitslast = $2 12 } 13 /misses/ { 14 misses = $2 - misslast 15 misslast = $2 16 rate = 0 17 total = hits + misses 18 if (total) 19 rate = (hits * 100) / total 20 printf "%12d %12d %8.2f%%\n", hits, misses, rate 21 } 22 '

    This program could be shorter; Ive spent some extra lines to write it more clearly. You could also write this in Perl (see my

    Sun::Solaris::Kstat examples), or C via libkstat.

    $ ./archits.sh 1 HITS MISSES HITRATE 651329528960 370490565 99.94% 22600 11 99.95% 17984 6 99.97% 8978 8 99.91% 87041 28 99.97% 89861 10 99.99% [...]

    The first line is the summary since boot, then interval summaries. These counters are system wide. The hit rate on this system is

    impressive (99.94% since boot), although hit rates can be misleading. Im usually studying the MISSES column, as a linear measure of

    pain.

    arcstat.pl

    Neelakanth Nadgir wrote arcstat.pl (Solaris), which prints various statistics including reads, misses and the size of the ARC. Mike Harsch

    delevoped arcstat.pl further, including L2ARC statistics.

    $ ./arcstat.pl 1 time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 04:45:47 0 0 0 0 0 0 0 0 0 14G 14G 04:45:49 15K 10 0 10 0 0 0 1 0 14G 14G 04:45:50 23K 81 0 81 0 0 0 1 0 14G 14G 04:45:51 65K 25 0 25 0 0 0 4 0 14G 14G 04:45:52 30K 11 0 11 0 0 0 3 0 14G 14G [...]

    Instead of hit rates, this tool uses miss rates.

    In Neels version the first line is the summary since boot; this isnt the case in Mikes current L2ARC version: an extra snap_stats() for an

    early L2ARC check means that by the time the statistics loop is reached the first iteration is comparing now with now instead of now

    with boot.

    Jason Hellenthal has created a FreeBSD version.

    All statistics

    All the kstats from the arcstat group (which feed the tools seen above) can be listed using:

    $ kstat -pn arcstatszfs:0:arcstats:c 15730138449 zfs:0:arcstats:c_max 50447089664 zfs:0:arcstats:c_min 6305886208 zfs:0:arcstats:class misc zfs:0:arcstats:crtime 95.921230719 zfs:0:arcstats:data_size 13565817856 zfs:0:arcstats:deleted 388469245 zfs:0:arcstats:demand_data_hits 611277816567 zfs:0:arcstats:demand_data_misses 258220641 zfs:0:arcstats:demand_metadata_hits 40050025212 zfs:0:arcstats:demand_metadata_misses 88523590 zfs:0:arcstats:evict_skip 5669994 zfs:0:arcstats:hash_chain_max 20 zfs:0:arcstats:hash_chains 248783 zfs:0:arcstats:hash_collisions 2106095400 zfs:0:arcstats:hash_elements 971654 zfs:0:arcstats:hash_elements_max 5677254 zfs:0:arcstats:hdr_size 188240232 zfs:0:arcstats:hits 651328694708 [...l2arc statistics truncated...]

    zfs:0:arcstats:memory_throttle_count 0 zfs:0:arcstats:mfu_ghost_hits 55377634 zfs:0:arcstats:mfu_hits 649347616033 zfs:0:arcstats:misses 370489546

    Pgina 3 de 12Brendan's blog Activity of the ZFS ARC

    20/03/2015http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

  • zfs:0:arcstats:mru_ghost_hits 127477329 zfs:0:arcstats:mru_hits 1980639328 zfs:0:arcstats:mutex_miss 11530337 zfs:0:arcstats:other_size 1967741376 zfs:0:arcstats:p 14713329404 zfs:0:arcstats:prefetch_data_hits 21342 zfs:0:arcstats:prefetch_data_misses 20782630 zfs:0:arcstats:prefetch_metadata_hits 831587 zfs:0:arcstats:prefetch_metadata_misses 2962685 zfs:0:arcstats:recycle_miss 27036925 zfs:0:arcstats:size 15721799464 zfs:0:arcstats:snaptime 29379870.8764106

    More of the activity related statistics will be discussed in the next sections.

    Demand/Prefetch

    Hits and misses can be broken down into four components, such that:

    hits = demand_data_hits + demand_metadata_hits + prefetch_data_hits + prefetch_metadata_hits

    And similar for misses. Prefetch and demand refer to how the ARC request was initiated; data and metadata refer to the type of data

    requested.

    Prefetch is the ZFS read-ahead feature, to predict and pre-cache blocks for streaming (sequential) workloads. All the prefetch statistics

    refer to ARC requests that have originated from the ZFS prefetch algorithm which happens before the ARC and without knowing

    whether the data is already cached in the ARC. So, a prefetch hit means that ZFS initiated a prefetch, which was then found in the ARC.

    A prefetch miss happens when that prefetch request was not in the ARC, and so initiated a disk I/O request (normal behavior). Demand

    is the opposite of prefetch: direct requests to the ARC, not predicted requests.

    Another way to understand prefetch statistics is to follow the code. In dbuf.c, see the ARC_PREFETCH flag set in dbuf_prefetch(), which

    is then checked in arc.c via the ARCSTAT_CONDSTAT macro to determine which kstat to increment.

    You can also add these up in other ways; eg:

    streaming ratio = prefetch_* / (hits + misses)

    At least, that identifies the ratio of the workload that ZFS has identified as streaming. This can be turned into a kstat tool (awk/Perl/C), as

    with hits/misses earlier, to show both summary since boot and interval summaries (current activity).

    Data/Metadata

    Metadata describes the ZFS dataset (file system or volume) and the objects within it. The data is the contents of those objects, including

    file, directory and volume blocks.

    metadata ratio = *_metadata_* / (hits + misses)

    This may be useful to check for considering the effect of picking a small recsize setting (thus increasing metadata), or when considering

    the effect of setting primarycache to metadata only.

    Others

    Some other activity related kstats worth mentioning for the ARC:

    mru_hits, mru_ghost_hits, mfu_hits, mfu_ghost_hits, p: Comparing the mru_hits and mfu_hits statistic with misses can

    determine the performance of each ARC list type (its not comparing performance of the MRU/MFU algorithms alone, since these

    arent strictly MRU/MFU, as mentioned in Architecture). By adding _hits + _ghost_hits for each type, and then comparing the ratio

    of each type over time, you can also identify if the workload changes in terms of ARC MRU/MFU. And you can also see how

    quickly the ARC adapts to the workload, by watching the p statistic (ARC parameter) change.

    hash_chain_max, hash_collisions: These show how well the DVA hash table is hashing. hash_chain_max is the longest length

    seen for a chain, when DVAs hash to the same table entry, and is usually less than 10. If that was much higher, performance may

    degrade as the hash locks are held longer while the chains are walked, assuming the max is reflective and not an anomaly caused by

    some short event. This could be double checked by studying the hash_collisions rate. If an issue is found, the number of hash table

    entries (BUF_LOCKS) could be increased in arc.c, and ZFS recompiled (this isnt a regular tunable); although I wouldnt expect

    needing to tune this for a while.

    Other kstats in the arcstats group describe sizes of the ARC, and the L2ARC.

    arc_summary.pl

    Another Perl Sun::Solaris::Kstat-based ARC tool worth mentioning is Ben Rockwoods arc_summary.pl, which prints a neat summary of

    the hit/miss rate and many of the other counters. Jason Hellenthal also ported the tool to FreeBSD.

    $ ./arc_summary.plSystem Memory:

    Pgina 4 de 12Brendan's blog Activity of the ZFS ARC

    20/03/2015http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

  • Physical RAM: 49134 MB Free Memory : 1925 MB LotsFree: 767 MB

    ZFS Tunables (/etc/system): set zfs:zil_disable=1 set zfs:zfs_prefetch_disable=1 set zfs:zfs_nocacheflush=1

    ARC Size: Current Size: 15172 MB (arcsize) Target Size (Adaptive): 15256 MB (c) Min Size (Hard Limit): 6013 MB (zfs_arc_min) Max Size (Hard Limit): 48110 MB (zfs_arc_max)

    ARC Size Breakdown: Most Recently Used Cache Size: 77% 11865 MB (p) Most Frequently Used Cache Size: 22% 3391 MB (c-p)

    ARC Efficency: Cache Access Total: 654018720316 Cache Hit Ratio: 99% 653646329407 [Defined State for buffer] Cache Miss Ratio: 0% 372390909 [Undefined State for Buffer] REAL Hit Ratio: 99% 653645890054 [MRU/MFU Hits Only]

    Data Demand Efficiency: 99% Data Prefetch Efficiency: 0%

    CACHE HITS BY CACHE LIST: Anon: --% Counter Rolled. Most Recently Used: 0% 1989696958 (mru) [ Return Customer ] Most Frequently Used: 99% 651656193096 (mfu) [ Frequent Customer ] Most Recently Used Ghost: 0% 128471495 (mru_ghost) [ Return Customer Evicted, Now Back ] Most Frequently Used Ghost: 0% 55618357 (mfu_ghost) [ Frequent Customer Evicted, Now Back ] CACHE HITS BY DATA TYPE: Demand Data: 93% 613371468593 Prefetch Data: 0% 21342 Demand Metadata: 6% 40274007879 Prefetch Metadata: 0% 831593 CACHE MISSES BY DATA TYPE: Demand Data: 69% 259735783 Prefetch Data: 5% 20782630 Demand Metadata: 23% 88909678 Prefetch Metadata: 0% 2962818 ---------------------------------------------

    Percentages and raw counters are provided, and the four breakdowns of hit/miss statistics (which I documented above; Bens been

    bugging me to document the arcstats for a while).

    Tracing

    Apart from statistics, the activity of the ARC can also be observed by tracing function points and probes in the kernel. While statistics are

    always enabled and collected, tracing is enabled when needed, and costs much higher overhead. This overhead is relative to the frequency

    of the traced events, which for the ARC can be very frequent (hundreds of thousands of events per second). I usually only trace the ARC

    for short periods (seconds or minutes) to gather debug data.

    There isnt a stable DTrace provider for the ARC (and there probably never will be other areas make much more sense), but there are

    sdt-provider probes in the ARC code:

    # dtrace -ln 'sdt:zfs::arc-*' ID PROVIDER MODULE FUNCTION NAME 19307 sdt zfs arc_read_nolock arc-miss 19310 sdt zfs arc_evict_ghost arc-delete 19311 sdt zfs arc_evict arc-evict 19312 sdt zfs arc_read_nolock arc-hit 19313 sdt zfs arc_buf_add_ref arc-hit

    If these didnt exist, you could use the fbt provider. Id begin by inspecting the functions listed in the FUNCTION column.

    Note that neither of these providers (sdt or fbt) are available from within Solaris zones these must be traced from the global zone. They

    are also both considered unstable interfaces, meaning the one-liners and scripts that follow may not work on future versions of the ARC

    without maintenance to match the code changes.

    ARC accesses by applicaiton

    Checking which applications are (directly) using the ARC:

    # dtrace -n 'sdt:zfs::arc-hit,sdt:zfs::arc-miss { @[execname] = count() }'dtrace: description 'sdt:zfs::arc-hit,sdt:zfs::arc-miss ' matched 3 probes ^C

    sendmail 1 qmgr 3

    Pgina 5 de 12Brendan's blog Activity of the ZFS ARC

    20/03/2015http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

  • [...] nscd 81 httpd 243 imapd 1417 python2.6 2572 awstats.pl 4285 php 6934 mysqld 105901

    This frequency counts the execname during ARC access. mysqld was the heaviest user, with 105,901 accesses while tracing.

    The kernel will show up as sched, for activities including ZFS transaction group flushes (TXG flush).

    ARC accesses by kernel call path

    For more details on why the ARC is being accessed, the kernel calling stack can be frequency counted:

    # dtrace -n 'sdt:zfs::arc-hit,sdt:zfs::arc-miss { @[execname, probefunc, stack()] = count(); }'dtrace: description 'sdt:zfs::arc-hit,sdt:zfs::arc-miss ' matched 3 probes ^C [...] sched arc_buf_add_ref zfs`dbuf_hold_impl+0xea zfs`dbuf_hold+0x2e zfs`dmu_buf_hold+0x75 zfs`zap_lockdir+0x67 zfs`zap_update+0x5b zfs`uidacct+0xc4 zfs`zfs_space_delta_cb+0x112 zfs`dmu_objset_do_userquota_callbacks+0x151 zfs`dsl_pool_sync+0xfe zfs`spa_sync+0x32b spa sync zfs`txg_sync_thread+0x265 unix`thread_start+0x8 26 [...] python2.6 arc_buf_add_ref zfs`dbuf_hold_impl+0xea zfs`dbuf_hold+0x2e zfs`dmu_buf_hold+0x75 zfs`zap_get_leaf_byblk+0x56 zfs`zap_deref_leaf+0x78 zfs`fzap_cursor_retrieve+0xa7 zfs`zap_cursor_retrieve+0x152 zfs`zfs_readdir+0x2b8 genunix`fop_readdir+0xab read directory genunix`getdents64+0xbc unix`_sys_sysenter_post_swapgs+0x149 2130 [...] mysqld arc_buf_add_ref zfs`dbuf_hold_impl+0xea zfs`dbuf_hold+0x2e zfs`dmu_buf_hold_array_by_dnode+0x1a7 zfs`dmu_buf_hold_array+0x71 zfs`dmu_read_uio+0x4d zfs`zfs_read+0x19a genunix`fop_read+0x6b read genunix`read+0x2b8 genunix`read32+0x22 unix`_sys_sysenter_post_swapgs+0x149 101955

    The output was many pages long; Ive truncated to include a few different stacks, and added annotations.

    ARC misses by user-land call path

    Heres another view of ARC access call paths, this time for misses only, and the user-land stack trace that led to the miss. Ive filtered on

    mysqld processes only:

    # dtrace -n 'sdt:zfs::arc-miss /execname == "mysqld"/ { @[execname, probefunc, ustack()] = count(); }'dtrace: description 'sdt:zfs::arc-miss ' matched 1 probe ^C [...] mysqld arc_read_nolock libc.so.1`__read+0x15 mysqld`my_read+0x43 mysqld`_Z7openfrmP3THDPKcS2_jjjP8st_table+0x95 mysqld`_ZL17open_unireg_entryP3THDP8st_tablePKcS4_S4_P10TABLE_LISTP1... mysqld`_Z10open_tableP3THDP10TABLE_LISTP11st_mem_rootPbj+0x6d7 mysqld`_Z11open_tablesP3THDPP10TABLE_LISTPjj+0x1b0 mysqld`_Z30open_normal_and_derived_tablesP3THDP10TABLE_LISTj+0x1b mysqld`_Z14get_all_tablesP3THDP10TABLE_LISTP4Item+0x73b

    Pgina 6 de 12Brendan's blog Activity of the ZFS ARC

    20/03/2015http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

  • mysqld`_Z24get_schema_tables_resultP4JOIN23enum_schema_table_state+0x18 mysqld`_ZN4JOIN4execEv+0x59e mysqld`_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8s... mysqld`_Z13handle_selectP3THDP6st_lexP13select_resultm+0x102 mysqld`_Z21mysql_execute_commandP3THD+0x51c6 mysqld`_Z11mysql_parseP3THDPKcjPS2_+0x1be mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x895 mysqld`handle_one_connection+0x318 libc.so.1`_thrp_setup+0x7e libc.so.1`_lwp_start 124

    The kernel stack trace could be included as well, showing the complete call path from user-land to a kernel event.

    ARC access sizes

    Digging a bit deeper; the sdt probes used previously were declared as:

    DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);

    Which means arg0 is an arc_buf_hdr_t. Its declared in arc.c, and contains various members including:

    struct arc_buf_hdr { [...] arc_buf_t *b_buf; uint32_t b_flags; [...] arc_buf_contents_t b_type; uint64_t b_size; uint64_t b_spa; [...] clock_t b_arc_access; [...]

    Lets pick out the size, and trace ARC accesses by buffer size:

    # dtrace -n 'sdt:zfs::arc-hit,sdt:zfs::arc-miss { @["bytes"] = quantize(((arc_buf_hdr_t *)arg0)->b_size); }'dtrace: description 'sdt:zfs::arc-hit,sdt:zfs::arc-miss ' matched 3 probes ^C

    bytes value ------------- Distribution ------------- count 0 | 0 1 | 82 2 | 1 4 | 0 8 | 1 16 | 1 32 | 1 64 | 1 128 | 0 256 | 0 512 |@ 1526 1024 | 605 2048 | 780 4096 | 913 8192 |@ 1094 16384 |@@ 4386 32768 | 618 65536 |@@ 4196 131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 60811 262144 | 0

    Here I used a power-of-2 quantization, which showed that most of the buffers were in the 128 Kbyte range. (Which is also the default

    recsize for the datasets on this system.) Smaller buffers will exist due to cases including files and directories that are smaller than 128k.

    Other members of arc_buf_hdr_t can be retrieved and inspected in similar ways.

    ARC buffer age

    Heres a neat use of the b_arc_access member, which tracks the time that the buffer was last accessed in terms of clock ticks. This time

    the fbt provider is used, to trace arc_access() before and after it updates b_arc_access:

    # cat -n arcaccess.d 1 #!/usr/sbin/dtrace -s 2 3 #pragma D option quiet 4 5 dtrace:::BEGIN 6 { 7 printf("lbolt rate is %d Hertz.\n", `hz); 8 printf("Tracing lbolts between ARC accesses..."); 9 }

    Pgina 7 de 12Brendan's blog Activity of the ZFS ARC

    20/03/2015http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

  • 10 11 fbt::arc_access:entry 12 { 13 self->ab = args[0]; 14 self->lbolt = args[0]->b_arc_access; 15 } 16 17 fbt::arc_access:return 18 /self->lbolt/ 19 { 20 @ = quantize(self->ab->b_arc_access - self->lbolt); 21 self->ab = 0; 22 self->lbolt = 0; 23 }

    Running for 10 seconds:

    # ./arcaccess.d -n 'tick-10s { exit(0); }'lbolt rate is 100 Hertz. Tracing lbolts between ARC accesses...

    value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 729988 1 | 3805 10 ms 2 | 3038 4 | 2028 8 | 1428 16 | 1398 32 | 1618 64 | 2883 1 second 128 | 738 256 | 681 512 | 338 1024 | 569 2048 | 166 4096 | 607 1 minute 8192 | 632 16384 | 808 32768 | 373 65536 | 110 131072 | 142 262144 | 39 1 hour 524288 | 5 1048576 | 97 2097152 | 10 4194304 | 44 8388608 | 617 1 day 16777216 | 1 33554432 | 0

    This is interesting data. It shows that most buffers were accessed less than one clock tick apart (10 ms), with 729,988 accesses in the 0 to

    1 tick range. The oldest buffer accessed was in the 16777216+ range, which (converting lbolts @100 Hertz into time) means it is at least

    46 hours since last access. The above output has been annotated to show where times fall in the lbolt ranges (eg, 1 second falls in the 64

    127 lbolt range).

    This gives us an insight into the age of the oldest buffers in the ARC (at least, in terms of access rate not birth), and, of its churn rate.

    This particular ARC is 25 Gbytes, and has been running with a 99.94% hit rate as shown earlier which may be less surprising now we

    know that it is so large that it can contain buffers accessed 40+ hrs apart.

    ARC hash lock

    To get a handle on ARC hash lock contention (instead of using more heavyweight tools like lockstat(1M)), you can try tracing the time

    for arc_buf_add_ref(), since it grabs the buffer hash lock:

    # dtrace -n 'arc_buf_add_ref:entry { self->s = timestamp; } arc_buf_add_ref:return /self->s/ { @["ns"] = quantize(timestamp - self->s); self->s = 0; }'dtrace: description 'arc_buf_add_ref:entry ' matched 2 probes ^C

    ns value ------------- Distribution ------------- count 256 | 0 512 |@ 2123 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 44784 2048 |@@@@@ 7556 4096 |@@ 2267 8192 | 385 16384 | 4 32768 | 0 65536 | 1 131072 | 0

    Most of the times were in the 1 2 us range, with only a single occurrence passing 65 us.

    Pgina 8 de 12Brendan's blog Activity of the ZFS ARC

    20/03/2015http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

  • ARC reap

    Heres a simple script to provide insight into the ARC reclaim thread an asynchronous task that keeps the size of the ARC in check.

    # cat -n arcreap.d 1 #!/usr/sbin/dtrace -s 2 3 fbt::arc_kmem_reap_now:entry, 4 fbt::arc_adjust:entry 5 { 6 self->start[probefunc] = timestamp; 7 } 8 9 fbt::arc_shrink:entry 10 { 11 trace("called"); 12 } 13 14 fbt::arc_kmem_reap_now:return, 15 fbt::arc_adjust:return 16 /self->start[probefunc]/ 17 { 18 printf("%Y %d ms", walltimestamp, 19 (timestamp - self->start[probefunc]) / 1000000); 20 self->start[probefunc] = 0; 21 }

    Different functions are traced: arc_kmem_reap_now(), to see the time taken to reap the ARC kmem caches; arc_adjust(), for resizing the

    ARC lists; and arc_shrink(), to know when the ARC size has been stepped down (this isnt timed, since any real work will be done by

    arc_adjust()).

    # ./arcreap.ddtrace: script './arcreap.d' matched 5 probes CPU ID FUNCTION:NAME 0 64929 arc_shrink:entry called 0 62414 arc_adjust:return 2012 Jan 9 23:10:01 18 ms 9 62420 arc_kmem_reap_now:return 2012 Jan 9 23:10:03 1511 ms 0 62414 arc_adjust:return 2012 Jan 9 23:10:24 0 ms 6 62414 arc_adjust:return 2012 Jan 9 23:10:49 0 ms

    This isnt the only way the ARC keeps its size sane; it will also evict/recycle buffers during allocation, as mentioned in the Architecture

    section. This reclaim thread is the more aggressive method, so if you have occasional odd ARC behavior it may be handy to check if it is

    related to reclaims.

    Evicts by list and type

    Tracing the function that does eviction, with details:

    # cat -n arcevict.d 1 #!/usr/sbin/dtrace -s 2 3 #pragma D option quiet 4 5 dtrace:::BEGIN 6 { 7 trace("Tracing ARC evicts...\n"); 8 } 9 10 fbt::arc_evict:entry 11 { 12 printf("%Y %-10a %-10s %-10s %d bytes\n", walltimestamp, args[0], 13 arg4 == 0 ? "data" : "metadata", 14 arg3 == 0 ? "evict" : "recycle", arg2); 15 }

    Sample output:

    # ./arcevict.d Tracing ARC evicts... 2012 Jan 8 08:13:03 zfs`ARC_mru data evict 812181411 bytes 2012 Jan 8 08:13:03 zfs`ARC_mfu data evict 5961212 bytes 2012 Jan 8 08:13:03 zfs`ARC_mfu data recycle 131072 bytes 2012 Jan 8 08:13:04 zfs`ARC_mfu data recycle 131072 bytes 2012 Jan 8 08:13:07 zfs`ARC_mfu data recycle 131072 bytes 2012 Jan 8 08:13:07 zfs`ARC_mfu data recycle 131072 bytes 2012 Jan 8 08:13:08 zfs`ARC_mfu metadata recycle 16384 bytes 2012 Jan 8 08:13:08 zfs`ARC_mfu data recycle 131072 bytes [...]

    The output begins by catching an 800 Mbyte evict from the ARC MRU data list, followed by a 6 Mbyte evict from the MFU data list.

    After that, buffers were evicted due to the recycle code path, which recycles buffers when the ARC is getting full instead of allocating

    new ones.

    Pgina 9 de 12Brendan's blog Activity of the ZFS ARC

    20/03/2015http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

  • To understand (and maintain) the arg mappings above, see the invocations of arc_evict() in arc.c. Eg, from arc_adjust():

    if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_DATA] > 0) {

    delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_DATA], adjustment);

    (void) arc_evict(arc_mru, NULL, delta, FALSE, ARC_BUFC_DATA);

    adjustment -= delta;

    }

    This is the first arc_evict() in arc_adjust(), which is why the ARC MRU data list is hit up first.

    And more

    The previous tracing examples show the sort of additional information that can be obtained using static tracing (the sdt provider) and

    dynamic tracing (the fbt provider). With dynamic tracing, a lot more can be seen as needed. Every function that makes up the ARC can be

    traced, along with their arguments.

    One detail that is actually difficult to trace is the file names during ARC accesses, since vnode pointers are not passed down to the ARC

    layer. It is possible, and has been done before (I dont have an example on-hand though). You could more easily cache them from upper

    layers (eg, VFS; see the sample chapter from the DTrace book).

    Conclusion

    In this post, I examined ZFS ARC activity in detail, starting with statistics provided by kstat and then tracing provided by DTrace. Apart

    from calculating hit and miss rates, I discussed other statistics including prefetch and metadata ratios. I then used tracing to observe

    information from the ARC including who is using the ARC and why, ARC buffer sizes, the age of the ARC buffers, lock contention

    timings and eviction details. More can be traced as needed: ZFS with DTrace provides great performance and observability.

    Ive spent much time on kernel internals, but I havent really blogged about the deeper areas. Im trying to change that, at least

    occasionally, starting with this post on ARC activity. I hope it is useful.

    Thanks to the original ZFS team especially Mark Maybee for writing the ARC and explaining details to me, and to Bryan Cantrill for

    kstat-ifying the ARC statistics and creating DTrace.

    Posted on January 9, 2012 at 4:50 pm by Brendan Gregg Permalink

    In: Kernel Tagged with: ARC, dtrace, performance, ZFS

    3 Responses

    Subscribe to comments via RSS

    1. Written by Richard Elling

    on January 10, 2012 at 10:34 am

    Permalink

    Spent a lot of time last summer looking at evictions and their impact on the system. These do not scale well with the size of

    memory in the system. For example, in the arcevict.d data above, evicting 800 MBytes is not a big deal, but evicting 8 GBytes is a

    big deal. Look for some illumos putbacks in this area RSN :-)

    2. Written by mic

    on January 11, 2012 at 2:46 am

    Permalink

    Thanks Brendon, Great Blog.

    3. Written by Kyle Hailey

    on January 19, 2012 at 11:01 pm

    Permalink

    Awesome to see all this information on ARC analysis. Till now Ive found information on these stats pretty sparse. Thanks!

    - Kyle

    Subscribe to comments via RSS

    Previous post

    Pgina 10 de 12Brendan's blog Activity of the ZFS ARC

    20/03/2015http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

  • Next post

    Recent Posts

    A New Challenge

    Another 10 Performance Wins

    Benchmarking the Cloud

    Cloud Performance Training

    Systems Performance: available now

    Open Source Systems Performance

    The TSA Method

    Control T for TENEX

    The USE Method: Unix 7th Edition Performance Checklist

    The USE Method: FreeBSD Performance Checklist

    The USE Method: Mac OS X Performance Checklist

    Memory Leak (and Growth) Flame Graphs

    My Books

    Tags

    7410 analytics art benchmarking book cloud cloud analytics CPI dtrace example experimental filesystem frequencytrail heatmaps illumos iSCSI

    javascript joyent L2ARC latency limits linux macosx methodology mysql NAS nfs off-cpu omnios performance personal PICs pid provider slides SLOG smartos solaris SSD statistics talk testing usemethod video visualizations ZFS

    People

    Adam Leventhal dtrace.org

    Brendan Gregg dtrace.org (professional)

    Brendan Gregg blogspot (personal)

    Bryan Cantrill dtrace.org

    Dave Pacheco dtrace.org

    Deirdr Straughan beginningwithi.com

    Jim Mauro sun.com

    Robert Mustacchi dtrace.org

    Pgina 11 de 12Brendan's blog Activity of the ZFS ARC

    20/03/2015http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

  • Links

    Brendan's homepage

    Joyent

    SolarisInternals

    Meta

    Log in

    Entries RSS

    Comments RSS

    WordPress.org

    Copyright 2013 Brendan Gregg, all rights reserved

    Brendan's blog.

    Powered by WordPress and Grey Matter.

    Pgina 12 de 12Brendan's blog Activity of the ZFS ARC

    20/03/2015http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/