getting started linux on power - ibm - united states · pdf file2 power performance our goal...

© Copyright IBM Corporation 2016

Getting Started with Linux Performance on IBM POWERSteve Nasypany [email protected]

2

POWER Performance

▪ Our goal is to work with the most common performance issues we may deal with in Linux Proof-of-Concepts

▪ From this lesson, you should learn: – References, links, packages? – Answers to common performance questions – Best tools, metrics – Configuring CPUs and SMT – What is my CPU consumption? – What is my working memory consumption (out of memory?) – Am I paging, and if so, what process is paging? – What about NUMA effect on POWER? – How to I check storage IO performance? – VIOS, Shared Ethernet monitoring

▪ What we won’t cover: – Application porting tools/debug – Standard network analysis, kernel tracing

3

POWER7 & POWER8 PowerVM Hypervisor AIX, i & Linux Java, WAS, DB2… Compilers & optimization Performance tools & tuning

POWER Optimization & Tuning

http://www.redbooks.ibm.com/abstracts/sg248171.html

4

Linux on Power Community Wiki (biased towards developer and not admin) https://ibm.biz/BdDKbG Performance Best Practices https://ibm.biz/BdDGEa Linux on Power Performance FAQs Using Advanced Toolchain PowerKVM Guest Tunings (VirtIO, disk, Hugepages, etc) https://ibm.biz/Bd4ZB5

▪IBM Developerworks PowerLinux Community is the go-to spot for the latest on Linux on PowerVM and PowerKVM

https://ibm.biz/Bd4ZBJ

▪Public forum is monitored by development teams and they are aggressive about addressing install, distro and performance questions

–Not a porting resource or a replacement for normal support, but documents, tools, best practices for developers, installers and performance specialists are provided

–Be sure to search the forum with your issue before posting

POWER Performance: Linux

5

Red Hat Performance Guide https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Performance_Tuning_Guide/

SUSE Linux Analysis & Tuning Guide https://www.suse.com/documentation/sles11/book_sle_tuning/data/book_sle_tuning.html

Ubuntu LE Wiki & Support https://wiki.ubuntu.com/ppc64el

Brendan Gregg’s performance series http://www.brendangregg.com/linuxperf.html

POWER Performance: Linux Distros

6

▪ What are some of the tooling packages for Linux on Power? Start with the IBM POWER Linux Tools Repository Install

https://ibm.biz/Bd4Zdx – system-config-* POWER hardware diag, misc tools – ibm-power-managed-rhel6 HMC or IVM managed – nmon nmon tool http://nmon.sourceforge.net (see pre-compiled download link) – pseries-energy POWER Energy Mgmt – sysstat iostat, sar, mpstat, pidstat – numa* NUMA policies – ppc64-utils* lparstat, others – sg3_utils SCSI & FC tools – sysfsutils HBA related tools – tuned RH tuning daemon (tuned-adm utility)

▪ libvirt - package to support management of virtual devices. Supported PowerKVM & PowerVM (but don’t know anyone using in PowerVM)

https://libvirt.org/drvphyp.html

Packages?

7

▪ Linux Performance Customer Profiler Utility (lpcpu) – Utility that integrates system configuration data collection,

performance profiling data collection, profiler data post processing, and graphing into a single package

– Sort of the AIX “perfpmr” for Linux with plug-ins – Runs iostat, mpstat, vmstat, perf, meminfo, top, sar and optionally

oprofile

▪ Deep profiling for Linux (for developers) - Evaluate performance for Linux on Power:

http://www.ibm.com/developerworks/linux/library/l-evaluatelinuxonpower/

▪ PowerPC utilities (code and manpages) http://sourceforge.net/projects/powerpc-utils/?source=directory

Special Linux Tooling/References

8

Other commands you might need

AIX Linux Release, HW Levels oslevel, uname cat /etc/issue, uname SMT Control smtctl ppc64_cpu --smt CPU details cat /proc/cpuinfo cat sys/devices/system/cpu/cpufreq Memory affinity lssrad –va, rset numactl --hardware Monitoring affinity topas ‘M’ numastat (RHEL 7, SUSE 11) mpstat –d Network entstat, netstat ethtool, netstat, sar System/Lib/App Profiler tprof perf, oprofile

9

POWER: Knowing your Linux Environment

#cat /proc/cpuinfo platform : pSeriesmodel : IBM,8231-E2Bmachine : CHRP IBM,8231-E2B

platform : pSeriesmodel : IBM pSeries (emulated by qemu)machine : CHRP IBM pSeries (emulated by qemu)

platform : PowerNVmodel : 8286-42Amachine : PowerNV 8286-42Afirmware : OPAL v3

Check your CPU speed in any Proof-of-Concept test – previous users may have enabled power management features ppc64_cpu --frequency

PowerKVM or Non-Virtualized Linux Install

Linux client on PowerKVM

Linux on PowerVM

10

Stress Tools

▪ Linux hdparm utility supports testing read timings on disk devices -t for device test -T for cached reads

▪ Stress (also stress-ng source) package (variety of tools): http://www.rpmfind.net/linux/rpm2html/search.php?query=stress(ppc-64)

▪ FIO I/O stress tools: http://rpmfind.net/linux/rpm2html/search.php?query=fio%28ppc-64%29

▪ sg3_utils package sg_dd command

▪ Nigel Griffith’s nstress package is available on AIX & Linux now! – Support CPU, memory and storage I/O testing – AIX, RHEL 6.5 & 7, SLES 11 SP3, SLES 12, Fedora 20 – Linux function may not be fully equivalent to AIX

https://ibm.biz/Bd4ZdX

11

▪ Where do I start? – For Linux major issues/warnings are typically found

• /var/log/messages RHEL • /var/log/warn SUSE • /var/log/syslog Ubuntu (older levels uses ~messages)

– For those of you familiar with AIX • Linux shares enough commonalities with traditional UNIX concepts

that CPU, I/O and Network debug use tooling with common names • Memory management is somewhat different and you will need

additional tools on Linux to dig deeper

▪ How do we monitor for shared CPU pool constraints like in AIX? – In PowerVM

• While Linux instances technically have access to pool information, tools like lparstat on Linux do not reliably collect this

• Some Linux tools can provide metrics to show local physical and entitlement utilization on PowerVM

– On PowerKVM, we use same tools to view overall utilization

Common Questions

12

▪ Best tool for interactive and short-term analysis of AIX and LoP resources? – nmon provides access to most of the important metrics required for

benchmarks, proof-of-concepts and regular monitoring • CPU vmstat, sar, lparstat, mpstat • Memory vmstat • Paging vmstat • Disk iostat, sar • Adapter iostat • Network netstat, ifconfig • Process/Threads ps • Threads ps, trace tools

– Latest nmon is supported on AIX, Linux on PowerVM and PowerKVM • RHEL 6.5/71, SLES 11.3/12, Ubuntu 14.04/14.10 • Also supported on x86!

▪ Recordings post-processed with other tools nmon Analyser: https://ibm.biz/BdDGJZ, nmon Consolidator nmonchart: https://ibm.biz/Bd4ZMm

Common Questions: Best Tool?

13

What is nmon/Analyser/Chart good at?

▪ nmon is very powerful interactively, but many prefer working from recordings and using the nmon Analyser spreadsheet tool

▪ CPU: Physical consumption, shared pool utilization, entitlement consumption, user & system percentages

▪ Global Memory: Computational, Computational, Paging

▪ Storage IO: hdisk balance, adapter rates, IO per second (IOPS or tps)

▪ Network: Rates, packet sizes and SEA (VIOS)

14

What is nmon Analyser good at that I may not know?

▪BBBP panel dumps a variety of useful configuration information

▪BBBP AIX Linux –Firmware level lsconf lsmcode, ppc64_utils –Configuration lscfg lscfg, ppc64_utils –LPAR configuration lparstat –i cat /proc/ppc64/lparcfg –Memory details vmstat –v/-s cat /proc/meminfo –Memory affinity layout lssrad –va –CPU/Affinity snapshot mpstat –d –IO memory buffers vmstat –v –Network settings ifconfig ifconfig, netstat

15

nmon Analyser BBBP

AIX: lsconf

and much more… collected at start/end of recording

Linux: /proc/ppc64/lparcfg, lparstat –i contains most of these

16

CPU Memory

Dedicated Shared RAM Paging

User/System/Idle/Wait: vmstat, sar, lparstat Linux %Steal: vmstat, sar, nmon

SMT Linux: ppc64_cpu AIX: smtctl

Total Size AIX vmstat –v “memory”

svmon “size” Linux free

Total Size AIX lsps –a Linux free

Physical consumed Entitlement consumed

AIX lparstat, vmstat Linux nmon

Working / Free AIX vmstat –v, svmon virt

Linux vmstat, smem

In Use AIX lsps/svmon

Linux free

Available Pool AIX lparstat Linux n/a

Cache AIX vmstat –v “client” Linux vmstat, smem

Run Queue, Context Switches vmstat, sar –q/-w, mpstat

Scan Rate & Free Rate AIX vmstat “sr” & “fr”

Pages In/Out vmstat, vmstat -s

Affinity/NUMA PowerVP provides AIX, i & Linux partition placement info

AIX: lssrad –va, topas ‘M’ Linux: numactl, numastat (RHEL7)

nmon, nmon Analyser provide all of these metrics where relevant

Important Metrics: CPU & Partition Memory/Paging

17

IO

Hdisk Storage Adapter

Network Adapter

%busy, IO/sec, KB/sec iostat, sar –d

IO/sec AIX iostat –as

Linux see vendor adapter tools

Send/Receive Packets AIX entstat, netstat

Linux netstat

Read/Write IOPS & KB iostat

Read/Write Bytes iostat -as

Send/Receive MB AIX entstat, netstat

Linux netstat

Avg Service Time(s) AIX iostat –D, sar –d

%IO relative to bandwidth (est. from adapter rates)

%IO relative to bandwidth (est. from adapter rates)

Service Queue Full or Wait AIX iostat -D, sar –d Linux iostat –x, sar -d

Service Queue Counters AIX fcstat

MPIO pkg commands

Packet errors, drops, timeouts netstat, entstat Linux ethtool

nmon, nmon Analyser provide metrics

Important Metrics: IO

VIOS Performance Advisor 2.2.3

Linux ‘options’ Emulex/Qlogic

18

CPU: PowerKVM capabilities

▪ KVM host CPU configuration is viewed and edited with the virsh command:

# virsh capabilities … <topology> <cells num='2'> <cell id='0'> <memory unit='KiB'>133019072</memory> <pages unit='KiB' size='64'>2078423</pages> <pages unit='KiB' size='16384'>0</pages> <pages unit='KiB' size='16777216'>0</pages> <distances> <sibling id='0' value='10'/> <sibling id='1' value='20'/> </distances> <cpus num='5'> <cpu id='0' socket_id='0' core_id='32' siblings='0'/> <cpu id='8' socket_id='0' core_id='48' siblings='8'/> … <cpu id='32' socket_id='0' core_id='112' … </cpus> </cell> <cell id='1'> … <cpu id='40' socket_id='1' core_id='160' siblings… <cpu id='48' socket_id='1' core_id='168' siblings… …

Cells are NUMA nodes

Memory node assignments

CPUs per node

Node distance assignment (numactl & numastat utils)

Second node

19

CPU: SMT configuration for Linux guests

▪ PowerKVM, use the ppc64_cpu –smt=X [1,2,4,8] to adjust SMT

[ats@fire1 ~]$ ppc64_cpu Usage: ppc64_cpu [command] [options] ppc64_cpu --smt # Get current SMT state ppc64_cpu --smt={on|off} # Turn SMT on/off ppc64_cpu --smt=X # Set SMT state to X ppc64_cpu --cores-present # Get the number of cores present ppc64_cpu --cores-on # Get the number of cores online ppc64_cpu --cores-on=X # Put exactly X cores online ppc64_cpu --dscr # Get current DSCR system setting ppc64_cpu --dscr=<val> # Change DSCR system setting ppc64_cpu --dscr [-p <pid>] # Get DSCR setting for process <pid> ppc64_cpu --dscr=<val> [-p <pid>] # Change DSCR for process <pid> ppc64_cpu --smt-snooze-delay # Get current smt-snooze-delay setting ppc64_cpu --smt-snooze-delay=<val> # Change smt-snooze-delay setting ppc64_cpu --run-mode # Get current diagnostics run mode ppc64_cpu --run-mode=<val> # Set current diagnostics run mode ppc64_cpu --frequency [-t <time>] # Determine cpu frequency for <time> # seconds, default is 1 second. ppc64_cpu --subcores-per-core # Get number of subcores per core ppc64_cpu --subcores-per-core=X # Set subcores per core to X (1 or 4) ppc64_cpu --threads-per-core # Get threads per core ppc64_cpu --info # Display system state information)

20

CPU: SMT on PowerKVM, PowerVM & guests

▪ PowerKVM will report SMT as off, but in actuality it has reserved the first threads on all cores and guests use SMT functionality for these cores based on their configuration definition (as virtual cpus)

▪ PowerKVM host or guests, ppc64_cpu command lists settings:

▪ PowerVM Linux guest with 2 Virtual Processors and SMT8 enabled:

# ppc64_cpu --smt SMT is off

# ppc64_cpu --info Core 0: 0* 1 2 3 4 5 6 7 Core 1: 8* 9 10 11 12 13 14 15

# ppc64_cpu –info Core 0: 0* 1* 2* 3* 4* 5* 6* 7* Core 1: 8* 9* 10* 11* 12* 13* 14* 15*

21

CPU: SMT configuration for Linux guests

▪ PowerKVM, use the virsh command or edit the clients XML configuration – Guests will then be able to use ppc64_cpu to adjust local SMT – RHEL6.5 and SLES11 SP3, the max threads supported per core is SMT4

# virsh list

# virsh edit [guest name] … <vcpu>16</vcpu> <cpu> <topology sockets='1' cores='2‘ threads='8'/> </cpu>

# lscpu | grep per Thread(s) per core: 8 Core(s) per socket: 2

List guests

VCPU count = sockets * cores * threads

Verify guest config

22

▪ RHEL 7.1 Dispatcher/SMT fixes – SMT8 performance degraded over repeated runs – SMT4 not impacted – Update to 3.10.0-229.11.1.el7 fixed issue – Should impact BE and LE versions, could be present in other distros – Exact defect fix is not known, possibly: https://ibm.biz/BdHguF

▪ Linux and PowerKVM are rapidly evolving – New dispatching mechanisms, more like PowerVM micro-threading (v3.1): https://patchwork.ozlabs.org/patch/490575/ – Most Scale-Out/HPC environments may not care about this capability, but

we need bleeding edge POC’s to experiment and provide feedback – Set SMT based on application-space recommendations, ask on

Developworks Community if not sure – MariaDB benchmark example with various distros and SMT settings

https://github.com/bwgartner/ovh-power8-mariadb-benchmarking

SMT Fixes/Advances

23

CPU Configuration: lparstat –i (not supported on PowerKVM)# lparstat -i Node Name : *.dfw.ibm.com Partition Name : pvcmgr1 Partition Number : 19 Type : Shared Mode : Capped Entitled Capacity : 0.50 Partition Group-ID : 32787 Shared Pool ID : 0 Online Virtual CPUs : 2 Maximum Virtual CPUs : 4 Minimum Virtual CPUs : 1 Online Memory : 8073792 kB Minimum Memory : 1024 Minimum Capacity : 0.20 Maximum Capacity : 1.00 Capacity Increment : 0.01 Active Physical CPUs in system : 32 Active CPUs in Pool : 32 Maximum Capacity of Pool : 32.00 Entitled Capacity of Pool : 1330 … Memory Mode : Shared Total I/O Memory Entitlement : 8589934592 Variable Memory Capacity Weight : 0

24

CPU: Metrics & Steal

▪ In AIX, idle time is ceded to the hypervisor, much of this is reusable by other VMs ▪ In Linux, user/system/wait/idle times are reported in various CPU tools, and in some,

the additional steal metric is reported – Steal time is the time a virtual CPU waits for a real CPU while the hypervisor

is servicing another virtual processor (same or different virtual machine) – Steal is accounted for separately from idle, and not as a percentage of it – Generally, if:

• %idle is low, the CPU is very busy and does not have idle capacity • %wait is high, the CPU is ready to run but waiting on I/O completions • %steal is persistently > 20-25% your partition may need more entitlement • %system is equal to or persistently higher than %user on a busy system, you

should research why the kernel demands such high resources

# sar 1 Linux 2.6.32-504.1.3.el6.ppc64 (*.dfw.ibm.com) 03/12/2015 _ppc64_ (16 CPU)

10:31:14 AM CPU %user %nice %system %iowait %steal %idle 10:31:15 AM all 10.89 3.00 0.63 0.00 36.55 48.94 10:31:16 AM all 11.49 3.12 1.37 0.00 40.45 43.57 10:31:17 AM all 10.40 3.13 0.63 0.00 35.21 50.63

25

CPU: nmon

▪ System with 2 cores and SMT4 results in 8 virtual CPUs ▪ nmon reports entitlement and computes physical consumption

– Traditional Unix Dedicated SMP: #cores x (user+system)% – Linux Shared Partition: #cores active x (user+system+steal)%

26

▪ Run queue length is another well known metric of CPU usage – It refers to the number of software threads that are ready to run, but have

to wait because the CPU(s) is/are busy or waiting on interrupts – The length is sometimes used as a measure of health, and long run

queues usually mean worse performance, but many workloads can vary dramatically

– It is quite possible for a pair of single-threaded workloads to contend for a single physical resource (batch, low run queue, bad performance) while dozens of multi-threaded workloads share it (OLTP, high run queue, good performance)

▪ AIX/Linux vmstat reports global run queue as “r”, sar –q as “runq-sz”

Run Queue

27

▪ Context switches – The number of times a running entity was stopped and replaced by another – Collected for Threads (operating system) and Virtual Processors (PowerVM) – There are voluntary and involuntary context switches

▪ How Many “context switches” are Too Many? – No rules of thumb exist – Voluntary: Not an issue because it means no work for the CPU – Involuntary: Could be an issue, but generally the bottleneck will materialize

in a easier to diagnosis metric; such as, CPU utilization, physical consumption, entitlement consumed, run queue

– Establish a baseline and compare when system encounter performance problems

– When an application/workload “blows up” in context switches as more cores or virtual processors are added, you need development help (locking/latch issues)

▪ Tool outputs – vmstat reports total context switches as “cs” – sar –w as “cswch/s”

Context Switches

28

# vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 491840 44736 140224 717824 0 0 4 4 10 4 2 0 95 0 2 0 1 491840 43648 140224 717824 0 0 0 164 133 1000 2 0 98 0 0 1 0 491840 44096 140224 717824 0 0 0 4 96 779 1 0 98 0 0 0 0 491840 44288 140224 717824 0 0 8 0 35 752 1 0 99 0 0

Memory: What’s Free in Linux

Modern Linux distros cache file I/O also “free” + “buffers” + “cached” ~= free for use by applications

# free -t total used free shared buffers cached Mem: 4140288 3890880 249408 0 180096 1425984

-/+ buffers/cache: 2284800 1855488 Swap: 4128640 0 4128640 Total: 8268928 3890880 4378048

Current page totals listed with free command (-k KB, -m MB, -g GB)

+ +

Filesystem cache may be released dynamically depending I/O demands Estimate of application shared memory in cache vs filesystem, review cat /proc/meminfo “Mapped” and “Shmem” totals

29

# cat /proc/meminfo MemTotal: 15693888 kB MemFree: 14800256 kB MemAvailable: 15119936 kB Buffers: 3968 kB Cached: 350144 kB SwapCached: 0 kB Active: 342144 kB Inactive: 208768 kB Active(anon): 200000 kB Inactive(anon): 30272 kB Active(file): 142144 kB Inactive(file): 178496 kB SwapTotal: 6291392 kB SwapFree: 6291392 kB Dirty: 256 kB Writeback: 0 kB AnonPages: 196800 kB Mapped: 73856 kB Shmem: 33472 kB Slab: 208512 kB SReclaimable: 58176 kB SUnreclaim: 150336 kB KernelStack: 3184 kB PageTables: 2880 kB ... CommitLimit: 14138304 kB Committed_AS: 491456 kB VmallocTotal: 8589934592 kB VmallocUsed: 73856 kB VmallocChunk: 8589766720 kB

Linux Gritty Details:

30

# vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 491840 44736 140224 717824 0 0 4 4 10 4 2 0 95 0 2 0 1 491840 43648 140224 717824 0 0 0 164 133 1000 2 0 98 0 0 1 0 491840 44096 140224 717824 0 0 0 4 96 779 1 0 98 0 0 0 0 491840 44288 140224 717824 0 0 8 0 35 752 1 0 99 0 0 0 0 491840 44352 140224 717824 0 0 4 0 38 766 1 0 99 0 0

Where is paging info in Linux?

▪ Iterative vmstat shows swap “si” (in) and “so” (out) activity ▪ Historical swap counters are shown using vmstat –s ▪ nmon: Under the MEM tab, free memory is graphed, and swap* metric samplings

are listed but not graphed

31

Who’s Paging?

▪This turns out to be a bit complicated in Linux as opposed to AIX w/o additional tools

– There is no Linux equivalent to AIX svmon – /proc/meminfo + /proc/${PID}/[smaps | status | stat]. Hints here:

http://www.cyberciti.biz/faq/linux-which-process-is-using-swap/ – In AIX, we would use svmon. A Linux tool that shares some report

similarities is smem

▪For those of you familiar with AIX memory management, Linux is a bit different. A good overview of Linux memory mgmt:

http://linuxaria.com/howto/linux-memory-management

32

smem

▪Traditional tools report Resident Set Size (RSS) per process, which includes memory common to each (shared memory, libraries, kernel tables)

▪Adding up each process usage will not equal the actual real memory used on the system

▪ smem reports RSS and additional metrics: – Unshared Set Size (USS), the memory unique to each process – Proportional Set Size (PSS), USS + common shared memory

divided evenly for the using processes – Swap for physically paged memory

33

# ./smem –p –r PID User Command Swap USS PSS RSS 1998 pwrvcdb db2sysc 0.00% 23.68% 24.01% 24.92% 2812 nova /usr/bin/python 0.00% 2.84% 2.90% 3.64% 3919 nova /usr/bin/python 0.00% 1.70% 1.76% 2.44% 27022 keystone /usr/sbin/httpd 0.00% 1.57% 1.64% 2.48%

Linux smem

-p percentages, -r reverse order

# ./smem -u -p -r User Count Swap USS PSS RSS pwrvcdb 5 0.00% 26.87% 27.66% 31.04% nova 10 0.00% 13.23% 14.16% 20.09% cinder 5 0.00% 5.01% 5.21% 8.08% root 55 0.00% 2.59% 3.82% 9.21%

-u User Report

# ./smem –w -R 8G -p Area Used Cache Noncache firmware/hardware 3.75% 0.00% 3.75% kernel image 0.00% 0.00% 0.00% kernel dynamic memory 17.81% 14.91% 2.90% userspace memory 57.76% 23.46% 34.30% free memory 20.67% 20.67% 0.00%

[-w] System Report, -R [REALMEM] accurate FW/HW/Kern

34

Linux NUMA

▪ POWER architectures implement Non-Uniform Memory Access – Architecture maps memory by locality (core, chip, dual chip module/

socket, node, etc) – Memory affinity, process locality can impact performance – The POWER8 architecture bandwidth is extremely high, and most scale-

out workloads will not incur significant cross-socket/node latencies

▪ This topic is beyond the scope of basic Linux performance, but developers will want to track this information and can do so with the following tools – cat /sys/devices/system/node/node* – cat /sys/devices/system/node/node0/cpulist – numactl – numastat (implemented in the lastest distro levels)

Topas Monitor for host: claret Interval: 2 =================================================================== REF1 SRAD TOTALMEM INUSE FREE FILECACHE HOMETHRDS CPUS ------------------------------------------------------------------- 0 2 4.48G 515M 3.98G 52.9M 134.0 12-15 0 12.1G 1.20G 10.9G 141M 236.0 0-7 1 1 4.98G 537M 4.46G 59.0M 129.0 8-11 3 3.40G 402M 3.01G 39.7M 116.0 16-19 =================================================================== CPU SRAD TOTALDISP LOCALDISP% NEARDISP% FARDISP% ---------------------------------------------------------- 0 0 303.0 43.6 15.5 40.9 2 0 1.00 100.0 0.0 0.0 3 0 1.00 100.0 0.0 0.0

Affinity: Monitoring

In PowerVM, Dynamic Platform Optimizer will help consolidate heavily populated systems and optimize VM placement

Use these tools to monitor changes

Note: Only more recent AIX/Linux distros may interactively capture VM movement within a frame

AIX Topas ‘M’

Linux numastat# numastat -c

Per-node numastat info (in MBs): Node 0 Node 1 Node 3 Total ------ ------ ------ ----- Numa_Hit 21341 248 219 21808 Numa_Miss 0 0 0 0 Numa_Foreign 0 0 0 0 Interleave_Hit 120 120 120 360 Local_Node 21341 0 0 21341 Other_Node 0 248 219 468

PowerVM NUMA terms: AIX REF Node drawer or socket (when DCM/MCM) SRAD Scheduler Resource Allocation Domain (typically a chip) CPU Logical CPUs (SMT threads) Linux Node Chip

AIX vs Linux Affinity

# lssrad -av

REF1 SRAD MEM CPU

0 0 12363.94 0-7

2 4589.00 12-15

1 1 5104.50 8-11

3 3486.00 16-19

• AIX intra-chip references are local, intra-DCM are near and inter-DCM are far • Chip designers use local, remote and distant for these domains • Client OS tools show logical mappings, but closely follow physical (FW780+) • Far dispatches are most expensive in POWER7, much less so in POWER8

0

AIX SRAD

2

Linux Node

0

1

1

3

2

3

S824 16-way, two socket POWER8 with Dual Chip Modules (DCM), 4 cores per chip

AIX LINUX# numastat -c -z -m -n

Per-node system memory usage (in MBs): Node 0 Node 1 Node 3 Total ------ ------ ------ ----- MemTotal 23808 32768 25344 81920 MemFree 20736 32531 25149 78417 MemUsed 3072 237 195 3503 Active 213 0 0 213 ...

Affinity: PowerVP view of Linux VM (PowerVM only)

# numastat -c -z -m -n

Per-node system memory usage (in MBs):

Node 0 Node 1 Node 3 Total ------ ------ ------ ----- MemTotal 23808 32768 25344 81920

Hypervisor Virtual Processor placements onto physical cores

Virtual Machine DIMM Usage

S824 16-way, two socket POWER8 with Dual Chip Modules (DCM), 4 cores per chip

POWER8 Affinity: Do I Care?

PowerVM & POWER8 provide a variety of improvements

– Firmware, pHyp, OS, Dynamic Platform Optimizer & PowerVP

– Cache size, L4 cache, access logic, DIMM bandwidth

– Inter-socket latencies and bus bandwidth improvements

– Single to Two-Hop memory and cache architecture vs POWER7 Three-Hop

▪ PowerKVM or Non-Virtualized scale-out system are less complex

▪ But new In-Memory workloads depend on optimal memory latency, so it’s still good to have this function available in POWER8

For the E870/E880, every socket is ▪ Single hop to sockets on same node ▪ Single hop to 2 sockets on other nodes ▪ Two hops to 2 sockets on other nodes

39

▪ If IO service times are reasonably good, but queues are getting filled up, then – Increase queue depths until:

• You aren’t filling the queues or • IO service times start degrading (bottleneck at disk)

– Disks and adapters have service and wait queues. When the queue is full, and an IO completes, another is issued

▪ For Linux parted –l, fdisk –l to list disks, cat /sys/block/<disk>/device/queue_depth iostat –x <interval> <count> hdparm –I /dev/sd* lists detailed disk config & queue depth setting For adapters, you’ll need to research/obtain special tools typically shipped by

the FC vendor for that distro (QLogic, Emulex, etc)

▪ Modern 4/8/16 Gb FC adapter ports should be able to sustain >100K IOPS

IO Queues: Review

40

Linux I/O▪ In SLES 11 / RHEL 6.5 and lower, modprobe facility manages HBA settings

lsscsi –l (or cat /proc/scsi/scsi) List scsi devices lspci [-v] | more Manufacturer/driver /etc/modprobe.d/[driver].conf Editable options

▪ In RHEL 7, modprobe is deprecated lsmod | grep scsi cat /sys/block/<sdx>/device/queue_depth

▪ PowerKVM, queue_depth will be located in a /sys/devices/vio path like this: – /sys/devices/vio/30000003/host0/target0:0:1/0:0:1:0 – Configure/adjust queues on RAID devices, use iprconfig

https://ibm.biz/Bd4zHy – For LC systems with Adaptec RAID, use arcconf

https://ibm.biz/Bd4zH2

▪ HBA Adjustments to queue depth may be configurable via BIOS & OS. Set BIOS settings to maximum and adjust OS settings within those ranges as required.

41

Linux I/O: virsh & sg

▪ libvirt (PowerKVM) http://libvirt.org/virshcmdref.html virsh nodedev-list | grep pci list PCI virsh nodedev-dumpxml [PCI device] function, bus, domain info virsh nodedev-dumpxml [scsi_host*]

virsh nodedev-list –cap vports list vHBA support virsh nodedev-list –cap scsi_host list HBAs

▪ sg utilities (sg3_utils) provide access to SCSI commands and devices along with transports like FC and SAS

See http://sg.danny.cz/sg/sg3_utils.html for a complete list

42

iostat: Detailed hdisk stats

▪ %UTIL (-x option) not a single reliable indicator of a constraint, but helps sorting

▪ I/Os PER SECOND (reads/sec, writes/sec, tps) Hdisk drivers are single-threaded, we target less than 3K IOPS per logical disk

▪ BALANCE Balanced IO activity, well spread – it should look like someone designed it, not

chaotic ▪ SERVICE TIMES

Reads < 10 msec are generally good for non-SSD/Flash SSD: 10K+ IOPS, sub-millisecond read. Flash: high IOPS, 0.2-0.5

msec write Writes < 2 msec are good non-SSD/Flash SSD: Writes 0.5 to 2 seconds. Flash: high IOPS, 0.2-0.5 msec write

▪ QUEUE Discussed on next slide

▪ Linux iostat –Nmtx [interval] gives us – Throughput (-m MB/sec), transfers/sec – Service times (-x) & queuing – N [disk/fs names], t [timestamp]

43

iostat -x: Linux detailed disk stats

#avg-cpu: %user %nice %system %iowait %steal %idle 4.46 0.00 1.18 10.19 6.76 77.41

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 10.70 51.90 0.69 9.79 342.81 3.38 53.13 12.59 78.80 sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ... sdg 0.00 0.00 11.40 51.60 0.74 10.23 356.72 6.11 96.79 13.35 84.10 sdh 0.00 0.00 0.10 0.00 0.00 0.00 8.00 0.00 0.00 0.00 0.00 mpathd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 mpathc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 mpathb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 mpatha 3.50 4934.60 21.80 103.50 1.43 20.02 350.60 63.65 486.86 7.33 91.80 mpathap1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 mpathap2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 mpathap3 0.00 0.00 25.30 5043.90 1.43 21.41 9.23 2951.74 550.05 0.18 90.80 rootvg-lv_root 0.00 0.00 21.30 5014.70 1.18 19.59 8.45 2932.21 549.79 0.18 90.10 rootvg-lv_swap 0.00 0.00 4.00 29.20 0.25 1.82 128.00 19.57 589.43 10.18 33.80

Standard transfer, throughput and %util metrics plus:

avgqu-sz Average queue length of request issued to device await Average time in msec for requests issued to the device to be served (includes

queuing and device service time) svctm Average service time in msec for requests issued to device This is not considered a reliable metric per the manpage

44

nmon monitoring filesystem (‘j’) + disk busy map (‘o’) and disk detail (‘D’)

‘j’

‘o’

‘D’ for graph ‘d’ for values

45

VIOS Performance Advisor: Physical FC Details

2.2.3!FC Utilization based on peak IOPS rates

46

VIOS Performance Advisor: IO Total, Disks & NPIV

47

Network Tuning (Linux)

▪ All Linux distros support netstat, ethtool and nmon for basic network monitoring, packet errors, etc

▪ Linux did not support VIOS Shared Ethernet Large Receive function on the physical adapter – This caused performance problems between AIX and Linux clients hosted on the same adapter,

penalizing Linux performance – Customers with high network demands for AIX/i clients and hosting Linux should not host Linux

partitions on same physical adapter – VIOS 2.2.4 provides support, some debate about whether it is completely working. Follow thread

here: https://ibm.biz/Bd4z46

▪ See RHEL 7 Guide, Section 6.3 for network tuning information. Much of this is applicable across distros https://ibm.biz/Bd4Zdq

48

Shared Ethernet (VIOS)

▪A common complaint from customers is that they can’t get Shared Ethernet information easily on PowerVM clients

– It turns out there are a variety of AIX tools that show this • entstat/seastat • topas & nmon

– Monitor on the VIOS and not the client – Many customers do not know how to enable SEA monitoring in

nmon recordings. This has to be done on the VIOS.

▪Accounting must first be enabled per device: chdev –dev ent* –attr accounting=enabled

49

Packet Counts

Throughput in KB/sec

nmon Analyser

V34a SEA & SEAPACKET tab

SEA Recording: nmon –O option for VIOS

Unfortunately, Analyser currently does not provide stacked graphs for SEA aggregation views

50

SEA Monitoring: VIOS Performance Advisor (v2.2.3+)

Accounting feature must be enabled on VIOS chdev –dev ent* –attr accounting=enabled

51

SEA Monitoring: seastat on VIOS

chdev -dev ent* -attr accounting=enabled seastat -d <device_name> -c [-n | -s search_criterion=value] <device_name> shared adapter device -c clears per-client SEA statistics -n displays name resolution on the IP addresses -s search values

MAC address (mac) VLAN id (vlan) IP address (ip) Hostname (host) Greater than bytes sent (gbs) Greater than bytes recv (gbr) Greater than packets sent (gps) Greater than packets recv (gpr) Smaller than bytes sent (sbs) Smaller than bytes recv (sbr) Smaller than packets sent (sps) Smaller than packets recv (spr)

All sorts of advanced filtering (debugging) options now in seastat Accounting must first be enabled per device:

getting started linux on power - ibm - united states · pdf file2 power performance our goal...

Documents