real-time in the real world: dirt in production

Real-time in the real world:DIRT in production

SVP, Engineering

[email protected]

Bryan Cantrill

@bcantrill

Lead Performance Engineer

[email protected]

Brendan Gregg

@brendangregg

mailto:[email protected]




Previously, on #surgecon...

• Two years ago at Surge, we described the emergence of real-time data semantics in web-facing applications

• We dubbed this data-intensive real-time (DIRT)

• Last year at Surge 2011, we presented experiences building a DIRTy system of our own — a facility for real-time analytics of latency in the cloud

• While this system is interesting, it is somewhat synthetic in nature in that it does not need to scale (much) with respect to users...

#surgecon 2012

• Accelerated by the rise of mobile applications, DIRTy systems are becoming increasingly common

• In the past year, we’ve seen apps in production at scale

• There are many examples of this, but for us, a paragon of the emerging DIRTy apps has been Voxer

• Voxer is a push-to-talk mobile app that can be thought of as the confluence of voice mail and SMS

• A canonical DIRTy app: latency and scale both matter!

• Our experiences debugging latency bubbles with Voxer over the past year have taught us quite a bit about the new challenges that DIRTy apps pose...

The challenge of DIRTy apps

• DIRTy applications tend to have the human in the loop

• Good news: deadlines are soft — microseconds only matter when they add up to tens of milliseconds

• Bad news: because humans are in the loop, demand for the system can be non-linear

• One must deal not only with the traditional challenge of scalability, but also the challenge of a real-time system

• Worse, emerging DIRTy apps have mobile devices at their edge — network transience makes clients seem ill-behaved with respect to connection state!

The lessons of DIRTy apps

• Many latency bubbles originate deep in the stack; OS understanding and instrumentation have been essential even when the OS is not at fault

• For up-stack problems, tooling has been essential

• Latency outliers can come from many sources: application restarts, dropped connections, slow disks, boundless memory growth

• We have also seen some traditional real-time problems with respect to CPU scheduling, e.g. priority inversions

• Enough foreplay; on with the DIRTy disaster pr0n!

Application restarts

• Modern internet-facing architectures are designed to be resilient with respect to many failure modes…

• ...but application restarts can induce pathological, cascading latency bubbles, as clients reconnect, clusters reconverge, etc.

• For example, Voxer ran into a node.js bug where it would terminate on ECONNABORTED from accept(2)

• Classic difference in OS semantics: BSD and illumos variants (including SmartOS) do this; Linux doesn’t

• Much more likely over a transient network!

Dropped connections

• If an application can’t keep up with TCP backlog, packets (SYNs) are dropped:

• Client waits, then retransmits (after 1 or 3 seconds), inducing tremendous latency outliers; terrible for DIRTy apps!

$ netstat -s | grep Drop tcpTimRetransDrop = 56 tcpTimKeepalive = 2582 tcpTimKeepaliveProbe= 1594 tcpTimKeepaliveDrop = 41 tcpListenDrop =3089298 tcpListenDropQ0 = 0 tcpHalfOpenDrop = 0 tcpOutSackRetrans =1400832 icmpOutDrops = 0 icmpOutErrors = 0 sctpTimRetrans = 0 sctpTimRetransDrop = 0 sctpTimHearBeatProbe= 0 sctpTimHearBeatDrop = 0 sctpListenDrop = 0 sctpInClosed = 0

Dropped connections, cont.

• The fix for dropped connections :

• If due to a surge, increase TCP backlog

• If due to sustained load, increase CPU resources, decrease CPU consumption or scale the app

• If fixed by increasing the TCP backlog, check that the system backlog tunable took effect!

• If not, does the app need to be restarted?

• If not, is the application providing its own backlog that it taking precedent?

• How close are we to dropping?


• Networking 101

App

accept()TCP

backlogSYN

max

listendrop

Dropped connections, cont.The kernel code (usr/src/uts/common/inet/tcp/tcp_input.c):/* * THIS FUNCTION IS DIRECTLY CALLED BY IP VIA SQUEUE FOR SYN. * tcp_input_data will not see any packets for listeners since the listener * has conn_recv set to tcp_input_listener. *//* ARGSUSED */static voidtcp_input_listener(void *arg, mblk_t *mp, void *arg2, ip_recv_attr_t *ira){[...] if (listener->tcp_conn_req_cnt_q >= listener->tcp_conn_req_max) { mutex_exit(&listener->tcp_eager_lock); TCP_STAT(tcps, tcp_listendrop); TCPS_BUMP_MIB(tcps, tcpListenDrop); if (lconnp->conn_debug) { (void) strlog(TCP_MOD_ID, 0, 1, SL_TRACE|SL_ERROR, "tcp_input_listener: listen backlog (max=%d) " "overflow (%d pending) on %s", listener->tcp_conn_req_max, listener->tcp_conn_req_cnt_q, tcp_display(listener, NULL, DISP_PORT_ONLY)); } goto error2; }[...]


SEE ALL THE THINGS!tcp_conn_req_cnt_q distributions:

cpid:3063 max_q:8 value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 1 | 0

cpid:11504 max_q:128 value ------------- Distribution ------------- count -1 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 7279 1 |@@ 405 2 |@ 255 4 |@ 138 8 | 81 16 | 83 32 | 62 64 | 67 128 | 34 256 | 0

tcpListenDrops: cpid:11504 max_q:128 34

Text


• Uses DTrace to get a distribution of TCP backlog queue length on SYN; max_q is the current backlog length, per-process:

• Script is on http://github.com/brendangregg/dtrace-cloud-tools as net/tcpconnreqmaxq-pid*.d

fbt::tcp_input_listener:entry{ this->connp = (conn_t *)arg0; this->tcp = (tcp_t *)this->connp->conn_proto_priv.cp_tcp; self->max = strjoin("max_q:", lltostr(this->tcp->tcp_conn_req_max)); self->pid = strjoin("cpid:", lltostr(this->connp->conn_cpid)); @[self->pid, self->max] = quantize(this->tcp->tcp_conn_req_cnt_q);}

mib:::tcpListenDrop{ this->max = self->max; this->pid = self->pid; this->max != NULL ? this->max : "<null>"; this->pid != NULL ? this->pid : "<null>"; @drops[this->pid, this->max] = count(); printf("%Y %s:%s %s\n", walltimestamp, probefunc, probename, this->pid);}

http://github.com/brendangregg/dtrace-cloud-tools

http://github.com/brendangregg/dtrace-cloud-tools


Or, snoop each drop:

# ./tcplistendrop.d TIME SRC-IP PORT DST-IP PORT 2012 Jan 19 01:22:49 10.17.210.103 25691 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.108 18423 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.116 38883 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.117 10739 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.112 27988 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.106 28824 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.12.143.16 65070 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.100 56392 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.99 24628 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.98 11686 -> 192.192.240.212 80 2012 Jan 19 01:22:49 10.17.210.101 34629 -> 192.192.240.212 80 [...]


• That code parsed IP and TCP headers from the in-kernel packet buffer:

• Script is tcplistendrop*.d, also on github

fbt::tcp_input_listener:entry { self->mp = args[1]; }fbt::tcp_input_listener:return { self->mp = 0; }

mib:::tcpListenDrop/self->mp/{ this->iph = (ipha_t *)self->mp->b_rptr; this->tcph = (tcph_t *)(self->mp->b_rptr + 20); printf("%-20Y %-18s %-5d -> %-18s %-5d\n", walltimestamp, inet_ntoa(&this->iph->ipha_src), ntohs(*(uint16_t *)this->tcph->th_lport), inet_ntoa(&this->iph->ipha_dst), ntohs(*(uint16_t *)this->tcph->th_fport));}


• To summarize, dropped connections induce acute latency bubbles

• With Voxer, found that failures often cascaded: high CPU utilization due to unrelated issues will induce TCP listen drops

• Tunables don’t always take effect: need confirmation

• Having a quick tool to check scalability issues (DTrace) has been invaluable

Slow disks

• Slow I/O in a cloud computing environment can be caused by multi-tenancy — which is to say, neighbors:

• Neighbor running a backup

• Neighbor running a benchmark

• Neighbors can’t be seen by tenants...

• ...but is it really a neighbor?

Slow disks, cont.

• Unix 101

VFS

Block Device Interface

ZFS ...

Disks

ProcessSyscallInterface

Slow disks, cont.

• Unix 101

VFS


ZFS ...

Disks


iostat(1)often async:write buffering,read ahead

sync.

Slow disks, cont.

• VFS-level-iostat: vfsstat

• Kernel changes, new kstats (thanks Bill Pijewski)

# vfsstat -Z 1 r/s w/s kr/s kw/s ractv wactv read_t writ_t %r %w d/s del_t zone 1.2 2.8 0.6 0.2 0.0 0.0 0.0 0.0 0 0 0.0 0.0 global (0) 0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0 0 0.0 34.9 9cc2d0d3 (2) 0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0 0 0.0 46.5 72188ca0 (3) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 16.5 4d2a62bb (4) 0.3 0.1 0.1 0.3 0.0 0.0 0.0 0.0 0 0 0.0 27.6 8bbc4000 (5) 5.9 0.2 0.5 0.1 0.0 0.0 0.0 0.0 0 0 5.0 11.3 d305ee44 (6) 0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0 0 0.0 132.0 9897c8f5 (7) 0.1 0.0 0.1 0.0 0.0 0.0 0.0 0.1 0 0 0.0 40.7 5f3c7d9e (9) 0.2 0.8 0.5 0.6 0.0 0.0 0.0 0.0 0 0 0.0 31.9 22ef87fc (10)

Slow disks, cont.

• zfsslower.d:

• Go-to tool. Are there VFS-level I/O > 10ms? (arg)

• Stupidly easy to do

# ./zfsslower.d 10TIME PROCESS D B ms FILE2012 Sep 27 13:45:33 zlogin W 372 11 /zones/b8b2464c/var/adm/wtmpx2012 Sep 27 13:45:36 bash R 8 14 /zones/b8b2464c/opt/local/bin/zsh2012 Sep 27 13:45:58 mysqld R 1048576 19 /zones/b8b2464c/var/mysql/ibdata12012 Sep 27 13:45:58 mysqld R 1048576 22 /zones/b8b2464c/var/mysql/ibdata12012 Sep 27 13:46:14 master R 8 6 /zones/b8b2464c/root/opt/local/libexec/postfix/qmgr2012 Sep 27 13:46:14 master R 4096 5 /zones/b8b2464c/root/opt/local/etc/postfix/master.cf[...]

Slow disks, cont.

• Written in DTrace

• zfsslower.d, also on github, originated from the DTrace book

[...]fbt::zfs_read:entry,fbt::zfs_write:entry{ self->path = args[0]->v_path; self->kb = args[1]->uio_resid / 1024; self->start = timestamp;}

fbt::zfs_read:return,fbt::zfs_write:return/self->start && (timestamp - self->start) >= min_ns/{ this->iotime = (timestamp - self->start) / 1000000; this->dir = probefunc == "zfs_read" ? "R" : "W"; printf("%-20Y %-16s %1s %4d %6d %s\n", walltimestamp, execname, this->dir, self->kb, this->iotime, self->path != NULL ? stringof(self->path) : "<null>");}[...]

Slow disks, cont.

• Traces VFS/ZFS interface (kernel)from usr/src/uts/common/fs/zfs/zfs_vnops.c:

/* * Regular file vnode operations template */vnodeops_t *zfs_fvnodeops;const fs_operation_def_t zfs_fvnodeops_template[] = { VOPNAME_OPEN, { .vop_open = zfs_open }, VOPNAME_CLOSE, { .vop_close = zfs_close }, VOPNAME_READ, { .vop_read = zfs_read }, VOPNAME_WRITE, { .vop_write = zfs_write }, VOPNAME_IOCTL, { .vop_ioctl = zfs_ioctl }, VOPNAME_GETATTR, { .vop_getattr = zfs_getattr },[...]

Slow disks, cont.

• Unix 101

VFS


ZFS ...

Disks


iosnoop

zfsslower.d

Correlate

Slow disks, cont.

• Correlating the layers narrows the latency location

• Or you can associate in the same D script

• Via text, filtering on slow I/O, works fine

• For high frequency I/O, heat maps

Slow disks, cont.

• WHAT DOES IT MEAN?

Slow disks, cont.

• Latency outliers:

Slow disks, cont.

• Latency outliers:

GoodBadVery BadInconceivable

Slow disks, cont.

• Inconceivably bad, 1000+ms VFS-level latency:

• Queueing behind large ZFS SPA syncs (tunable)

• Other tenants benchmarking (before we added I/O throttling to SmartOS)

• Reads queueing behindwrites. Needed to tuneZFS and LSI PERC(shakes fist!)

60ms

latency

time (s)

read = red, write = blue

Slow disks, cont.

• Deeper tools rolled as needed. Anywhere in ZFS.# dtrace -n 'io:::start { @[stack()] = count(); }'dtrace: description 'io:::start ' matched 6 probes^C genunix`ldi_strategy+0x53 zfs`vdev_disk_io_start+0xcc zfs`zio_vdev_io_start+0xab zfs`zio_execute+0x88 zfs`zio_nowait+0x21 zfs`vdev_mirror_io_start+0xcd zfs`zio_vdev_io_start+0x250 zfs`zio_execute+0x88 zfs`zio_nowait+0x21 zfs`arc_read_nolock+0x4f9 zfs`arc_read+0x96 zfs`dsl_read+0x44 zfs`dbuf_read_impl+0x166 zfs`dbuf_read+0xab zfs`dmu_buf_hold_array_by_dnode+0x189 zfs`dmu_buf_hold_array+0x78 zfs`dmu_read_uio+0x5c zfs`zfs_read+0x1a3 genunix`fop_read+0x8b genunix`read+0x2a7 143

Slow disks, cont.

• On Joyent’s IaaS architecture, it’s usually not the disks or filesystem; useful to rule that out quickly

• Some of the time it is, due to bad disks (1000+ms I/O);heat map or iosnoop correlation matches

• Some of the time it’s due to big I/O (how quick is a 40 Mbyte read from cache?)

• Some of the time it is other tenants (benchmarking!);much less for us now with ZFS I/O throttling

• With ZFS and an SSD-based intent log, HW RAID is not just unobservable, but entirely unnecessary — adios PERC!

Memory growth

• Riak had endless memory growth

• Expected 9GB, after two days:

• Eventually hits paging and terrible performance, needing a restart

• Remember, application restarts are a latency disaster!

$ prstat -c 1Please wait... PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 21722 103 43G 40G cpu0 59 0 72:23:41 2.6% beam.smp/594 15770 root 7760K 540K sleep 57 0 23:28:57 0.9% zoneadmd/5 95 root 0K 0K sleep 99 -20 7:37:47 0.2% zpool-zones/166 12827 root 128M 73M sleep 100 - 0:49:36 0.1% node/5 10319 bgregg 10M 6788K sleep 59 0 0:00:00 0.0% sshd/1 10402 root 22M 288K sleep 59 0 0:18:45 0.0% dtrace/1[...]

Memory growth, cont.

• What is in the heap?

• ... and why does it keep growing?

$ pmap 1471914719: beam.smp0000000000400000 2168K r-x-- /opt/riak/erts-5.8.5/bin/beam.smp000000000062D000 328K rw--- /opt/riak/erts-5.8.5/bin/beam.smp000000000067F000 4193540K rw--- /opt/riak/erts-5.8.5/bin/beam.smp00000001005C0000 4194296K rw--- [ anon ]00000002005BE000 4192016K rw--- [ anon ]0000000300382000 4193664K rw--- [ anon ]00000004002E2000 4191172K rw--- [ anon ]00000004FFFD3000 4194040K rw--- [ anon ]00000005FFF91000 4194028K rw--- [ anon ]00000006FFF4C000 4188812K rw--- [ anon ]00000007FF9EF000 588224K rw--- [ heap ][...]


• Is this a memory leak?

• In the app logic: Voxer?

• In the DB logic: Riak?

• In the DB’s Erlang VM?

• In the OS libraries?

• In the OS kernel?

• Or application growth?

• Where would you guess?

RiakErlang VM

libc, lib*kernel?

Voxer


• Voxer (App): don’t think it’s us

• Basho (Riak): don’t think it’s us

• Joyent (OS): don’t think it’s us

• This sort of issue is usually app growth...

• ...but we can check libs & kernel to be sure


• libumem was in use for allocations

• fast, scalable, object-caching, multi-threaded support

• user-land version of kmem (slab allocator, Bonwick)


• Fix by experimentation (backend=mmap, other allocators) wasn’t working.

• Detailed observability can be enabled in libumem, allowing heap profiling and leak detection

• While designed with speed and production use in mind, it still comes with some cost (time and space), and isn’t on by default: restart required.

• UMEM_DEBUG=audit


• libumem provides some default observability

• Eg, slabs:

> ::umem_malloc_infoCACHE BUFSZ MAXMAL BUFMALLC AVG_MAL MALLOCED OVERHEAD %OVER0000000000707028 8 0 0 0 0 0 0.0%000000000070b028 16 8 8730 8 69836 1054998 1510.6%000000000070c028 32 16 8772 16 140352 1130491 805.4%000000000070f028 48 32 1148038 25 29127788 156179051 536.1%0000000000710028 64 48 344138 40 13765658 58417287 424.3%0000000000711028 80 64 36 62 2226 4806 215.9%0000000000714028 96 80 8934 79 705348 1168558 165.6%0000000000715028 112 96 1347040 87 117120208 190389780 162.5%0000000000718028 128 112 253107 111 28011923 42279506 150.9%000000000071a028 160 144 40529 118 4788681 6466801 135.0%000000000071b028 192 176 140 155 21712 25818 118.9%000000000071e028 224 208 43 188 8101 6497 80.1%000000000071f028 256 240 133 229 30447 26211 86.0%0000000000720028 320 304 56 276 15455 12276 79.4%0000000000723028 384 368 35 335 11726 7220 61.5%[...]


• ... and heap (captured @14GB RSS):

• The heap is 9 GB (as expected), but sbrk_top total is 14 GB (equal to RSS). And growing.

• Are there Gbyte-sized malloc()/free()s?

> ::vmemADDR NAME INUSE TOTAL SUCCEED FAILfffffd7ffebed4a0 sbrk_top 9090404352 14240165888 4298117 84403fffffd7ffebee0a8 sbrk_heap 9090404352 9090404352 4298117 0fffffd7ffebeecb0 vmem_internal 664616960 664616960 79621 0fffffd7ffebef8b8 vmem_seg 651993088 651993088 79589 0fffffd7ffebf04c0 vmem_hash 12583424 12587008 27 0fffffd7ffebf10c8 vmem_vmem 46200 55344 15 000000000006e7000 umem_internal 352862464 352866304 88746 000000000006e8000 umem_cache 113696 180224 44 000000000006e9000 umem_hash 13091328 13099008 86 000000000006ea000 umem_log 0 0 0 000000000006eb000 umem_firewall_va 0 0 0 000000000006ec000 umem_firewall 0 0 0 000000000006ed000 umem_oversize 5218777974 5520789504 3822051 000000000006f0000 umem_memalign 0 0 0 00000000000706000 umem_default 2552131584 2552131584 307699 0


• No huge malloc()s, but RSS continues to climb.

# dtrace -n 'pid$target::malloc:entry { @ = quantize(arg0); }' -p 17472dtrace: description 'pid$target::malloc:entry ' matched 3 probes^C value ------------- Distribution ------------- count 2 | 0 4 | 3 8 |@ 5927 16 |@@@@ 41818 32 |@@@@@@@@@ 81991 64 |@@@@@@@@@@@@@@@@@@ 169888 128 |@@@@@@@ 69891 256 | 2257 512 | 406 1024 | 893 2048 | 146 4096 | 1467 8192 | 755 16384 | 950 32768 | 83 65536 | 31 131072 | 11 262144 | 15 524288 | 0 1048576 | 1 2097152 | 0


• Tracing why the heap grows via brk():# dtrace -n 'syscall::brk:entry /execname == "beam.smp"/ { ustack(); }'dtrace: description 'syscall::brk:entry ' matched 1 probeCPU ID FUNCTION:NAME 10 18 brk:entry libc.so.1`_brk_unlocked+0xa libumem.so.1`vmem_sbrk_alloc+0x84 libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`umem_alloc+0x72 libumem.so.1`malloc+0x59 libstdc++.so.6.0.14`_Znwm+0x20 libstdc++.so.6.0.14`_Znam+0x9 eleveldb.so`_ZN7leveldb9ReadBlockEPNS_16RandomAccessFileERKNS_11Rea... eleveldb.so`_ZN7leveldb5Table11BlockReaderEPvRKNS_11ReadOptionsERKN... eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator13InitDataBl... eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5... eleveldb.so`_ZN7leveldb12_GLOBAL__N_116TwoLevelIterator4SeekERKNS_5... eleveldb.so`_ZN7leveldb12_GLOBAL__N_115MergingIterator4SeekERKNS_5S... eleveldb.so`_ZN7leveldb12_GLOBAL__N_16DBIter4SeekERKNS_5SliceE+0xcc eleveldb.so`eleveldb_get+0xd3 beam.smp`process_main+0x6939 beam.smp`sched_thread_func+0x1cf beam.smp`thr_wrapper+0xbe


• More DTrace showed the size of the malloc()s causing the brk()s:

• These 8 Mbyte malloc()s grew the heap

• Even though the heap has Gbytes not in use

• This is starting to look like an OS issue

# dtrace -x dynvarsize=4m -n 'pid$target::malloc:entry { self->size = arg0; }syscall::brk:entry /self->size/ { printf("%d bytes", self->size); } pid$target::malloc:return { self->size = 0; }' -p 17472

dtrace: description 'pid$target::malloc:entry ' matched 7 probesCPU ID FUNCTION:NAME 0 44 brk:entry 8343520 bytes 0 44 brk:entry 8343520 bytes[...]


• More tools were created:

• Show memory entropy (+ malloc - free)along with heap growth, over time

• Show codepath taken for allocationscompare successful with unsuccessful (heap growth)

• Show allocator internals: sizes, options, flags

• And run in the production environment

• Briefly. Tracing frequent allocs does cost overhead

• Casting light into what was a black box

4 <- vmem_xalloc 0 4 -> _sbrk_grow_aligned 4096 4 <- _sbrk_grow_aligned 17155911680 4 -> vmem_xalloc 7356400 4 | vmem_xalloc:entry umem_oversize 4 -> vmem_alloc 7356416 4 -> vmem_xalloc 7356416 4 | vmem_xalloc:entry sbrk_heap 4 -> vmem_sbrk_alloc 7356416 4 -> vmem_alloc 7356416 4 -> vmem_xalloc 7356416 4 | vmem_xalloc:entry sbrk_top 4 -> vmem_reap 16777216 4 <- vmem_reap 3178535181209758 4 | vmem_xalloc:return vmem_xalloc() == NULL, vm: sbrk_top, size: 7356416, align: 4096, phase: 0, nocross: 0, min: 0, max: 0, vmflag: 1 libumem.so.1`vmem_xalloc+0x80f libumem.so.1`vmem_sbrk_alloc+0x33 libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`vmem_xalloc+0x669 libumem.so.1`vmem_alloc+0x14f libumem.so.1`umem_alloc+0x72 libumem.so.1`malloc+0x59 libstdc++.so.6.0.3`_Znwm+0x2b libstdc++.so.6.0.3`_ZNSs4_Rep9_S_createEmmRKSaIcE+0x7e



• These new tools and metrics pointed to the allocation algorithm “instant fit”

• This had been hypothesized earlier; the tools provided solid evidence that this really was the case here

• A new version of libumem was built to force use of VM_BESTFIT

• ...and added by Robert Mustacchi as a tunable: UMEM_OPTIONS=allocator=best

• Riak was restarted with new libumem version, solving the problem


• Not the first issue with the system memory allocator; depending on configuration, Riak may use libc’s malloc(), which isn’t designed to be scalable

• man page does say it isn’t multi-thread scaleable

• libumem was the answer (with the fix)


• The fragmentation problem was interesting because it was unusual; it is not the most common source of memory growth!

• DIRTy systems are often event-oriented…

• ...in event-oriented systems, memory growth can be a consequence of either surging or drowning

• In an interpreted environment, memory growth can also come from memory that is semantically leaked

• Voxer — like many emerging DIRTy apps — has a substantial node.js component; how to debug node.js memory growth?


• We have developed a postmortem technique for making sense of a node.js heap:

OBJECT #OBJECTS #PROPS CONSTRUCTOR: PROPfe806139 1 1 Object: Queuefc424131 1 1 Object: Credentialsfc424091 1 1 Object: versionfc4e3281 1 1 Object: messagefc404f6d 1 1 Object: uncaughtException...fafcb229 1007 23 ClientRequest: outputEncodings, _headerSent, ...fafc5e75 1034 5 Timing: req_start, res_end, res_bytes, req_end, ...fafcbecd 1037 3 Object: aborted, data, end 8045475 1060 1 Object:fb0cee9d 1220 9 HTTPParser: socket, incoming, onHeadersComplete, ...fafc58d5 1271 25 Socket: _connectQueue, bytesRead, _httpMessage, ...fafc4335 1311 16 ServerResponse: outputEncodings, statusCode, ...

• Used by @izs to debug a nasty node.js leak

• Search for “findjsobjects” (one word) for details

CPU scheduling

• Problem: occasional latency outliers

• Analysis: no smoking gun. No slow I/O or locks. Some random dispatcher queue latency, but with CPU headroom.

$ prstat -mLc 1 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 17930 103 21 7.6 0.0 0.0 0.0 53 16 9.1 57K 1 73K 0 beam.smp/265 17930 103 20 7.0 0.0 0.0 0.0 57 16 0.4 57K 2 70K 0 beam.smp/264 17930 103 20 7.4 0.0 0.0 0.0 53 18 1.7 63K 0 78K 0 beam.smp/263 17930 103 19 6.7 0.0 0.0 0.0 60 14 0.4 52K 0 65K 0 beam.smp/266 17930 103 2.0 0.7 0.0 0.0 0.0 96 1.6 0.0 6K 0 8K 0 beam.smp/267 17930 103 1.0 0.9 0.0 0.0 0.0 97 0.9 0.0 4 0 47 0 beam.smp/280[...]

CPU scheduling, cont.

• Unix 101

CPUR R R

ORun Queue

Scheduler

CPUR R R

ORun Queue

R

preemption

Threads:R = Ready to runO = On-CPU

CPU scheduler, cont.

• Unix 102

• TS (and FSS) check for CPU starvation

CPUR R R

ORun Queue

RR RR

Priority Promotion

CPUStarvation


• Experimentation: run 2 CPU-bound threads, 1 CPU

• Subsecond offset heat maps:


• Experimentation: run 2 CPU-bound threads, 1 CPU

• Subsecond offset heat maps:

THISSHOULDNTHAPPEN


• Worst case (4 threads 1 CPU), 44 sec dispq latency

# dtrace -n 'sched:::off-cpu /execname == "burn1"/ { self->s = timestamp; } sched:::on-cpu /self->s/ { @["off-cpu (ms)"] = lquantize((timestamp - self->s) / 1000000, 0, 100000, 1000); self->s = 0; }'

off-cpu (ms) value ------------- Distribution ------------- count < 0 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 387184 1000 | 2256 2000 | 1078 3000 | 862 4000 | 1070 5000 | 637 6000 | 535[...] 41000 | 3 42000 | 2 43000 | 2 44000 | 1 45000 | 0

ExpectedBadInconceivable

ts_maxwait @pri 59 = 32s, FSS uses ?


• Findings:

• FSS scheduler class bug:

• FSS uses a more complex technique to avoid CPU starvation. A thread priority could stay high and on-CPU for many seconds before the priority is decayed to allow another thread to run.

• Analyzed (more DTrace) and fixed (thanks Jerry Jelinek)

• DTrace analysis of the scheduler was invaluable

• Under (too) high CPU load, your runtime can be bound by how well you schedule, not do work

• Not the only scheduler issue we’ve encountered

CPU scheduling, cont.• CPU caps to throttle tenants in our cloud

• Experiment: add hot-CPU threads (saturation):

CPU scheduling, cont.• CPU caps to throttle tenants in our cloud

• Experiment: add hot-CPU threads:

:-(

Visualizing CPU latency

• Using a node.js ustack helper and the DTrace profile provider, we can determine the relative frequency of stack backtraces in terms of CPU consumption

• Stacks can be visualized with flame graphs, a stack visualization we developed:

DIRT in production

• node.js is particularly amenable for the DIRTy apps that typify the real-time web

• The ability to understand latency must be considered when deploying node.js-based systems into production!

• Understanding latency requires dynamic instrumentation and novel visualization

• At Joyent, we have added DTrace-based dynamic instrumentation for node.js to SmartOS, and novel visualization into our cloud and software offerings

• Better production support — better observability, better debuggability — remains an important area of node.js development!

Beyond node.js

• node.js is adept at connecting components in the system; it is unlikely to be the only component!

• As such, when using node.js to develop a DIRTy app, you can expect to spend as much time (if not more!) understanding the components as the app

• When selecting components — operating system, in-memory data store, database, distributed data store — observability must be a primary consideration!

• When building a team, look for full-stack engineers — DIRTy apps pose a full-stack challenge!

Thank you!

• @mranney for being an excellent guinea pigcustomer

• @dapsays for the V8 DTrace ustack helper and V8 debugging support

• More information: http://dtrace.org/blogs/brendan, http://dtrace.org/blogs/dap, and http://smartos.org

http://dtrace.org/blogs/brendan

http://dtrace.org/blogs/brendan

http://dtrace.org/blogs/dap




http://smartos.org

http://smartos.org

real-time in the real world: dirt in production

Documents