advanced os performance tools and their more basic alternatives … · 2019. 12. 6. · systemtap...

38
Using the big guns Advanced OS performance tools and their more basic alternatives for tackling production issues Nikolai Savvinov, Deutsche Bank

Upload: others

Post on 24-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Using the big gunsAdvanced OS performance tools and their more basic alternatives for

tackling production issues

Nikolai Savvinov, Deutsche Bank

Page 2: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

About me

• worked with databases for 19 years (started with MySQL)

• working with Oracle database since approx. 2005

• Oracle database engineer, Deutsche Bank Moscow/London 2007-now

• Last 10 years, specialize in database performance/internals

• blog: savvinov.com, twitter: @oradiag

Page 3: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

About this presentation

• 50/50 overview of tools / practical use cases

• All examples based on Oracle Linux 6 (kernel 4.1.12), Exadata X6-2 EF 3-node cluster

• All opinions are my own and not of my employer (or Oracle corp)

• Caveat emptor

Page 4: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Why use OS performance tools?

• Database instruments things that are expected to be slow

• Also, bugs and blind spots

• Database doesn’t sit in a vacuum, there isn’t always clear-cut separation between application/database/OS/hw layers

• Better understand Unix/Storage/Network when jointly troubleshooting issues

Page 5: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

What kind of database prod issues are solved with OS performance tools

• Memory issues• low-level memory leaks• memory fragmentation• NUMA• swapping

• Kernel-level locking• I/O issues

• e.g. suboptimal low-level I/O settings

• Filesystem issues• e.g. too many files in a directory because a bug leading to excessive tracing

• Network issues• e.g. poor TCP throughput due to congestion events

Page 6: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Overview of Linux performance tools

Page 7: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Basic process level tools

• pidstat• -u for CPU (default)

• -d for disk I/O

• -r for memory

• ps• -e to do output for all processes

• -o to pick fields you like

• e.g. wchan, state, rss

• most useful when used “ASH-style” in ExaWatcher / OSWatcher

• especially when combined with “OEM-style” visualization

Page 8: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

WCHAN

• WCHAN: the outermost system call where the process is waiting

• Zero-overhead stack profiling tool that is always on (w/ OSWatcher)

• Low frequency (every few seconds)

• Sometimes doesn’t go all the way for some reason

• Off CPU only

• Kernel-space only

Page 9: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

WCHAN example

• Removed state=‘S’ (interruptible sleep)

• cma_acquire_dev is biggest

• CMA = contiguos memory allocation

• called inside rdma_bind_addr

• in drivers/infiniband/core/cma.c

• RDMA is core IB technology

• Exafusion relies on RDMA

• RDMA is not NUMA-friendly

Page 10: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

DIY-profilers

• debugger command dumping a stack with a loop around it

• not always stable/safe/predictable

• pstack (Tanel Poder’s “poor man’s profiler”)

• /proc/<pid>/stack (Luca Canali’s kstacksampler)

• oradebug short_stack

Page 11: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Profilers/tracers in Linux

• perf

• systemtap

• dtrace for Oracle Linux

• ftrace

• lttng

https://sourceware.org/systemtap/wiki/SystemtapDtraceComparison

http://www.brendangregg.com/blog/2015-07-08/choosing-a-linux-tracer.html

Page 12: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Scope of profiling tools

• The output from a profiler does not always represent wall clock time

• The standard (on-CPU) profiling is only valid for CPU-intensive loads

• Off-CPU profiling can be tricky (e.g. perf requires CONFIG_SCHEDSTATS=y which is not the case for some Linux distributions), DIY-profilers or WCHAN can help

• User-space vs kernel-space: a profiler can only be giving you one half of the picture, and not necessarily the one you need

• Be sure to pick the right profiler for the task

Page 13: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Safety considerations

• ptrace-based tools (strace, gdb, pstack etc.) are less safe

• systemtap had some teething problems in early days,considered relatively safe now, but still has issues

• Oracle Linux dtrace, perf and bcc(bpf) are probably the safest

• It’s a good idea to build a large arsenal of tools

• Use UAT as much as possible: even if the issue doesn’t reproduce itself in all its entirety, doesn’t mean some aspects of it can’t be reproduced

• If the problem can’t be reproduced on UAT, one can try to reproduce it as isolated mockup activity on production – so if tracing crashes it, no biggie

• Balance of risk: side effects of diagnostics vs issue going undiagnosed

Page 14: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Case study: memory fragmentation

Page 15: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Step 1: high-level picture

• Problems started shortly after 18c upgrade

• The main symptom experienced by users was connection delays

• The obvious things to check were listener, cluster database alert and kernel logs

• On the system level, somewhat elevated sys CPU was noticed

• WCHAN analysis didn’t reveal any interesting waits, but showed that extended periods of busy CPU for some of the listeners

• pidstat confirmed there were periods of 100% sys CPU for listener processes

Page 16: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Step 2: getting stacks

• We know what processes we want

• We are interested in on-CPU samples

• We are interested in kernel-space stacks

• Two obvious choices: /proc/<pid>/stack based sampling or“proper” stack profiling using perf

• Both are sufficiently safe, but perf can cause noticeable overhead, so we started with kstacksampler, but then also used perf

Page 17: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Step 3 (optional here): visualization

Page 18: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Step 4: make sense out of results

• Identify biggest branch(es)

• Identify stack structure, e.g. in this example• Oracle Network Session layer calls• Oracle Network Transport calls• VFS syscalls• TCP syscalls• page allocation• direct compaction• page migration

• Identify key elements

• Read relevant parts of documentation

• Look at the source code

• Read source comments + git blame/history

Page 19: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Sidenote: compaction and fragmentation

• Compaction is an algorithm for memory defragmentation

• Normally kernel shouldn’t care if memory is fragmented

• Some pieces do, however (like device drivers)

• Apparently, TCP implementation also relies on contiguous allocations

• A chunk is 2^N pages (4kB), N = order

• When initial allocation attempt fails, there are a number of possible fallback strategies (depending on GFP flags)

• One common scenario is direct compaction

• While doing compaction, the process will be unresponsive

Page 20: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Step 5: finding solution

Page 21: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Step 6: getting additional detail

sudo perf probe --add 'sk_stream_alloc_skb sk=%di size=%si gfp=%dx force=%cx'

sudo perf probe --add '__alloc_pages_nodemask gfp_mask=%di order=%sizonelist=%dx zonemask=%cx'

sudo perf record -e probe:sk_stream_alloc_skb --filter 'size>0x1000’ –e probe:__alloc_pages_nodemask --filter 'order>4' -agR sleep 100

============================================================

tgtd 5216 [004] 5706280.274976: probe:sk_stream_alloc_skb: (ffffffff816232d0) sk=0xffff8804959b9800 size=0x1b30 gfp=0xd0 force=0x40

ibportstate 17206 [003] 5706280.311569: probe:__alloc_pages_nodemask: (ffffffff81191fa0) gfp_mask=0x2c0 order=0x6 zonelist=0xffff88407ffd8e00 zonemask=0x0

7fff81193fa1 __alloc_pages_nodemask ([kernel.kallsyms])

7fff81069641 x86_swiotlb_alloc_coherent ([kernel.kallsyms])

7fffa01bd88f mlx4_alloc_icm ([kernel.kallsyms])

Page 22: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Step 7: digging a little bit deeper…

Page 23: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Summary

• OS tools can be very useful or even necessary for troubleshooting complex cluster or database issues

• Much can be done with basic risk-free OS tools like ps

• Some tracing/profiling tools are safer than others

• Low-level OS tools become safer over time, but can still carry risk

• There are ways to minimize the risk

• Weigh the risk of side effects against the risk of not solving the issue

Page 24: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Credits

• Brendan Gregg – Linux performance expert

• Tanel Poder, Luca Canali, Frits Hoogland, Andrey Nikolaev, Alexander Anokhin – pioneered use of OS low-level tools in Oracle troubleshooting

• Thanks UKOUG organizers for the opportunity!

Page 25: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Bonus slides

Page 26: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

WCHAN example 2: NUMA balancing

Page 27: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

WCHAN example 3: inode cache depletion

Page 28: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Systemtap safety

“In practice, there are several weak points in systemtap and the underlying kprobes system at the time of writing. Putting probes indiscriminately into unusually sensitive parts of the kernel (low level context switching, interrupt dispatching) has reportedly caused crashes in the past. We are fixing these bugs as they are found, and constructing a probe point “blacklist”, but it is not complete”

Frank Ch. Eigler, Systemtap tutorial, November 2019

https://sourceware.org/systemtap/tutorial.pdf

Page 29: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Bcc (bpf) safety

It is unlikely that the programs should cause the kernel to crash, loop or become unresponsive because they run in a safe virtual machine in the kernel.

Page 30: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Hanging the system with a simple “cat”

Page 31: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Other low-level OS tools• tcpdump

• iosnoop

Page 32: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Tcpdump for network performance

• SQL*Net tracing on client/server side doesn’t always reveal problem

• Network-side metrics don’t always reveal problem

• Various pings almost never reveal the problem

• Many TCP performance problems have to do with congestion control

• A variety of tools for analyzing the dumps, e.g. Wireshark

• Can dump to ASCII and use own tools

Page 33: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Our case

• Log file sync delays, sometimes spiking to tens of seconds

• Production synchronously replicated to standby via DataGuard

• Synchronicity was essential (Max Availability)

• Pings show nothing

• netops say the network is fine

Page 34: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Analysis

• We did a tcpdumpcapture at both ends

• Tcpdump shows congestion window (bytes in flight)

• It was shrinking in response to congestion events

Page 35: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Remediation

• Remediation: netops removed bottlenecks, optimized QoS, top users working on improving colocality of their estate

• Monitoring/alerting: how do we define thresholds for packet loss?

• Zero-loss networks are expensive

• In a non zero-loss network, what level of packet loss is acceptable?

• Relationship between throughput and packet loss in TCP was approximated by Mathis in 1997

• 𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 ≤𝑀𝑆𝑆

𝑙𝑎𝑡𝑒𝑛𝑐𝑦 𝑝𝑎𝑐𝑘𝑒𝑡 𝑙𝑜𝑠𝑠

• Assuming 1 ms latency, this meant 1.5

𝑝𝑎𝑐𝑘𝑒𝑡 𝑙𝑜𝑠𝑠MB/s, or 0.25% for 30 MB/s

Page 36: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Iosnoop

• gives a high resolution picture of I/O usage (low-res can be obtained from iotop)

• ftrace based

• reported safe

• observed high performance overhead

Page 37: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Our case

• General slowness on one of the nodes during certain periods

• Nothing helpful in AWR/ASH

• ExaWatcher iostat showed I/O spikes on another node

• High “reliable message” waits on that other node

• iotop didn’t reveal the culprit

Page 38: Advanced OS performance tools and their more basic alternatives … · 2019. 12. 6. · Systemtap safety “In practice, there are several weak points in systemtap and the underlying

Analysis & remediation

• iosnoop told us the high I/O was from admin f/s housekeeping

• housekeeping had too much work due to excessive tracing

• processes were slow due to slow writes to trace files

• slowness propagated to another node via inter-node communication (“reliable message”)

• Remediation: excessive tracing patched, housekeeping job optimized, old files moved manually, scheduling clash resolved