esx performance troubleshooting

73
© 2009 VMware Inc. All rights reserved ESX Performance Troubleshooting VMware Technical Support Broomfield, Colorado Confidential

Upload: gema

Post on 24-Feb-2016

69 views

Category:

Documents


1 download

DESCRIPTION

ESX Performance Troubleshooting. VMware Technical Support Broomfield, Colorado. What is slow performance?. What does slow performance mean? Application responds slowly - latency Application takes longer time to do a job – throughput Interpretation varies wildly Slower than expectation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ESX Performance Troubleshooting

© 2009 VMware Inc. All rights reserved

ESX Performance Troubleshooting VMware Technical SupportBroomfield, Colorado

Confidential

Page 2: ESX Performance Troubleshooting

What is slow performance?

•What does slow performance mean?• Application responds slowly - latency

• Application takes longer time to do a job – throughput

•Interpretation varies wildly• Slower than expectation

• Throughput is low

• Latency is high

• Throughput, latency fine but uses excessive resources (efficiency)

•What are high latency, low throughput, and excessive resource usage?• These are subjective and relative

Both related to time

Page 3: ESX Performance Troubleshooting

Bandwidth, Throughput, Goodput, Latency

Bandwidth vs. Throughput• Higher Bandwidth does not guarantee Throughput.

• Low Bandwidth is a bottleneck for higher Throughput

Throughput vs. Goodput• Higher Throughput does not mean higher Goodput

• Low Throughput is indicative of lower Goodput

Efficiency = Goodput/Bandwidth Throughput vs. Latency• Low Latency does not guarantee higher Throughput and vice versa

• Throughput or Latency alone can dominate performance

Page 4: ESX Performance Troubleshooting

Bandwidth, Throughput, Goodput, Latency

Latency

Bandwidth

GoodputThroughput

Page 5: ESX Performance Troubleshooting

How to measure performance?

Higher throughput does not necessarily mean higher performance – Goodput could be low Throughput is easy to measure, but Goodput is not

How do we measure performance?• Performance is actually never measured

• We could only quantify different metrics that affect performance. These metrics describe the state of: CPU, memory, disk and network

Page 6: ESX Performance Troubleshooting

Performance Metrics

CPU• Throughput: MIPS (%used), Goodput: useful instructions

• Latency: Instruction Latency (cache latency, cache miss)

Memory• Throughput: MB/Sec, Goodput: useful data

• Latency: nanosecs

Storage• Throughput: MB/Sec, IOPS/Sec, Goodput: useful data

• Latency: Seek time

Networking• Throughput: MB/Sec, IO/Sec, Goodput: useful traffic

• Latency: microseconds

Page 7: ESX Performance Troubleshooting

Hardware and Performance

CPU• Processor Architecture: Intel XEON, AMD Opteron

• Processor cache – L1, L2, L3, TLB

• Hyperthreading

• NUMA

Page 8: ESX Performance Troubleshooting

Hardware and Performance

Processor Architecture• Clock Speeds from one architecture is not comparable with other

P-III outperforms P4 on a clock by clock basis Opteron outperforms P4 on a clock by clock basis

• Higher clock speeds is not always beneficial Bigger cache or better architecture may outperform higher clock speeds

• Processor memory communication is often the performance bottleneck Processor wastes 100’s of instruction cycles while waiting on memory

access Caching alleviates this issue

Page 9: ESX Performance Troubleshooting

Hardware and Performance

Processor Cache• Cache reduces memory access latency

• Bigger cache increases cache hit probability

• Why not build bigger cache ? Expensive Cache access latency increases with cache size

• Cache is built into stages – L1, L2, L3 with varying cache access latency

• ESX benefits from larger cache sizes

• L3 cache seems to boost performance of networking workloads

Page 10: ESX Performance Troubleshooting

Hardware and Performance

TLB – Translation Lookaside Buffer• Every running process needs virtual address (VA) to physical

address (PA) translation

• Historically this translation table was done entirely from memory

• Since memory access is significantly slower and process needs access to this table on every context switch, TLB was introduced

• TLB is a hardware circuitry that caches VA to PA mappings

• When VA is not available in TLB, Page Fault occurs and OS needs to bring the address to TLB (load latency)

• Performance of application depends on effective use of TLB

• TLB is flushed during context switch

Page 11: ESX Performance Troubleshooting

Hardware and performance

Hyperthreading• Introduced with Pentium 4 and Xeon processors

• Allows simultaneous execution of two threads on a single processor

• HT maintains separate architectural states for the same processor but shares underlying processor resources like execution unit, cache etc

• HT strives to improve throughput by taking advantage of processor stalls on the logical processor

• HT performance could be worse than UniProcessor (non-HT) performance if the threads have higher cache hit (more than 50%)

Page 12: ESX Performance Troubleshooting

Hardware and Performance

Multicores• Cores have their own L1 Cache

• L2 Cache is shared between processors

• Cache coherency is relatively faster compared to SMP systems

• Performance scaling is same as SMP systems

Page 13: ESX Performance Troubleshooting

Hardware and performance

NUMA• Memory contention increases as the number of processors increase

• NUMA alleviates memory contention by localizing memory per processor

Page 14: ESX Performance Troubleshooting

Hardware and Performance - Memory

Node Interleaving• Opteron processors supports two type of memory access –

NUMA and Node Interleaving mode

• Node interleaving mode alternates memory pages between processor nodes so that the memory latencies are made uniform. This can offer performance improvements to systems that are not NUMA aware

• NUMA on single core Opteron systems contains only one core per NUMA node.

• SMP VM on ESX running on a single core Opteron systems will have to access memory across the NUMA boundary. So SMP VMs may benefit from Node Interleaving

• On dual core Opteron systems a single NUMA node will have two cores. So NUMA mode could be turned on.

Page 15: ESX Performance Troubleshooting

Hardware and Performance – I/O devices

I/O Devices• PCI-E, PCI-X, PCI

PCI at 66MHz – 533 MB/s PCI-X at 133 MHz – 1066 MB/s PCI-X at 266 MHz – 2133 MB/s PCI-E bandwidth depends on the number of Lanes, x16 Lanes - 4GB/s,

each Lane adds 250 MB/s.

• PCI bus saturation – dual port, quad port devices In PCI protocol the bus bandwidth is shared by all the devices in the bus.

Only one device could communicate at a time. PCI-E allows parallel full duplex transmission with the use of Lanes

Page 16: ESX Performance Troubleshooting

Hardware and Performance – I/O Devices

SCSI• Ultra3/Ultra 160 SCSI – 160 MB/s

• Ultra320 SCSI – 320 MB/s

• SAS 3Gbps– 300 MB/s duplex

FC• Speed constrained by Medium, Laser wavelength

• Link Speeds: 1G FC – 200 MB/s, 2G – 400 MB/s, 4G – 800 MB/s, 8GB – 1600 MB/s

Page 17: ESX Performance Troubleshooting

ESX Architecture

17

Performance Perspective

Confidential

Page 18: ESX Performance Troubleshooting

ESX Architecture – Performance Perspective

CPU Virtualization – Virtual Machine Monitor• ESX doesn’t trap and emulate every instruction, x86 arch does not

allow this

• System calls and Faults are trapped by the monitor

• Guest code runs in one of three contexts Direct execution Monitor code (fault handling) Binary Translation (BT - non virtualizable instructions)

• BT behaves much like JIT

• Previously translated code fragments are stored in translation cache and reused – saves translation overhead

Page 19: ESX Performance Troubleshooting

ESX Architecture – Performance Implications

Virtual Machine Monitor – Performance implications• Programs that don’t fault or invoke system calls run at near native

speeds – ex. Gzip

• Micro-benchmarks that do nothing but invoke system calls will incur nothing but monitor overhead

• Translation overhead varies with different Privileged instructions. Translation cache tries to offset some of the overhead.

• Applications will have varying amount of monitor overhead depending on their call stack profile.

• Call stack profile of an application can vary depending on its workload, errors and other factors.

• It is hard to generalize monitor overheads for any workload. Monitor overheads measured for an application are strictly applicable only to “Identical” test conditions.

Page 20: ESX Performance Troubleshooting

ESX Architecture – Performance Perspective

Memory virtualization• Modern OS’es set up page tables for each running process. x86

paging hardware (TLB) caches VA - PA mappings

• Page table shadowing – additional level of indirection VMM maintains PA – MA mappings in a shadow table Allows the guest to use x86 paging hardware with the shadow table

• MMU updates VMM write protects shadow page tables (trace) When the guest updates page table, monitor kicks in (page fault) and

keeps shadow page table consistent with the physical page table

• Hidden page faults Trace faults are hidden to the guest OS - monitor overhead. Hidden page faults are similar to TLB misses on native environments

Page 21: ESX Performance Troubleshooting

ESX Architecture – Performance Perspective

Page table shadowing

Page 22: ESX Performance Troubleshooting

ESX Architecture – Performance Implications

Context Switches• On Native hardware TLB is flushed during a context switch. Newly

switched process will incur TLB miss on first memory access.

• VMM caches Page Table Entries (PTE) during context switches (caching MMU). We try to keep the Shadow PTE consistent with the Physical PTE

• If there are lots of processes running in the guest, and they context switch frequently, VMM may run out of PT caching. Workload=terminalservices increases this cache size (vmx).

Process creation• Every new process created requires new PT mapping. MMU

updates are frequent

• Shell Scripts that spawns commands can cause MMU overhead

Page 23: ESX Performance Troubleshooting

ESX Architecture – Performance Perspective

I/O Path

Page 24: ESX Performance Troubleshooting

ESX Architecture – Performance Perspective

I/O Virtualization• I/O devices are non virtualizable and therefore they are emulated in

the guest OS

• VMkernel handles Storage and Networking devices directly as they are performance critical in server environments. CDROM, floppy devices are handled by the service console.

• I/O is interrupt driven and therefore incurs monitor overhead. All I/O goes through VMkernel and involves a context switch from VMM to VMKernel

• Latency of networking device is lower and therefore delay due to context switches can hamper throughput

• VMkernel fields I/O interrupts and delivers it to correct VM. From ESX 2.1, VMKernel delivers the interrupts to the idle processor.

Page 25: ESX Performance Troubleshooting

ESX Architecture – Performance Perspective

Virtual Networking• Virtual NICs

Queue buffer could overflow if the pkt tx/rx rate is high VM is not scheduled frequently

VMs are scheduled when they have packets for delivery Idle VMs still receive broadcast frames. Wastes CPU resources. Guest Speed/duplex settings is irrelevant.

• Virtual Switches don’t learn MAC address VMs register MAC address, virtual switch knows the location of the MAC

• VMnics Listens for the MAC addresses that are registered by the VMs. Layer 2 Broadcast frames are passed above

Page 26: ESX Performance Troubleshooting

ESX Architecture – Performance Perspective

NIC Teaming• Teaming only provides outbound load balancing

• NICs with different capabilities could be teamed Least common Capability in the bond is used

• Out-MAC mode scales with number of VMs/virtual NICs. Traffic from a single virtual NIC is never load balanced.

• Out-IP scales with the number of Unique TCP/IP sessions.

• Incoming traffic can come on the same NIC. Link aggregation on the physical switches provides inbound load balancing.

• Packet reflections can cause performance hits in the guest OS. No empirical data available.

• We Failback when the Link comes alive again. Performance could be affected if the Link flips flops.

Page 27: ESX Performance Troubleshooting

ESX Architecture – Performance Perspective

vmxnet optimizations• vmxnet handles cluster of packets at once – reduces context

switches and interrupts

• Clustering kicks in only when the packet receive/transmit rate is high.

• vmxnet shares memory area with VMkernel – reduces copying overhead

• vmxnet can take advantage of TCP checksum and Segmentation offloading (TSO)

• NIC Morphing – allows loading vmxnet driver for valance virtual device. Probes a new register with the valance device.

• Performance of a NIC Morphed vlance device is same as the performance of vmxnet virtual device.

Page 28: ESX Performance Troubleshooting

ESX Architecture – Performance Perspective

SCSI performance• Queue depth determines the SCSI throughput. When the queue is

full, SCSI I/O’s are blocked limiting effective throughput.

• Stages of Queuing Buslogic/LSILogic -> VMkernel Queue -> VMkernel Driver Queue depth -

> Device Firmware Queue -> Queue depth of the LUN

• Sched.numrequestOutstanding – number of outstanding I/O commands per VM – see KB 1269

• Buslogic driver in windows limits the queue depth size to 1 – see KB 1890

• Registry settings available for maximizing queue depth for LSILogic adapter (Maximum Number of Concurrent I/Os)

Page 29: ESX Performance Troubleshooting

ESX Architecture – Performance Perspective

VMFS• Uses larger block sizes (1MB default)

Larger block size reduces Metadata size – metadata is completely cached in memory

Near native speeds is possible, because metadata overhead is removed Fewer I/O operations. Improves read-ahead cache hits for sequential

reads

• Spanning Data is filled to the other LUN sequentially after overflow. There is no

striping. Does not offer performance improvements.

• Distributed Access Multiple ESX hosts can access the VMFS volume, only one ESX host

updates the meta-data

Page 30: ESX Performance Troubleshooting

ESX Architecture – Performance Perspective

VMFS• Volume Locking

Metadata updates are performed through locking mechanism SCSI reservation is used to lock the volume Do not confuse this locking with the file level locks implemented in the

VMFS volume for different access modes

• SCSI reservation SCSI reservation blocks all I/O operations until the lock is released by the

owner SCSI reservation is held usually for a very short time and released as

soon as the update is performed SCSI reservation conflict happens when SCSI reservation is attempted on

a volume that is already locked. This usually happens when multiple ESX hosts contend for metadata updates

Page 31: ESX Performance Troubleshooting

ESX Architecture – Performance Perspective

VMFS• Contention for metadata updates

Redo log updates from multiple ESX hosts Template deployment with redo log activity Anything that changes/modifies file permission on every ESX host

• VMFS 3.0 uses new volume locking mechanism that significantly reduces the number of SCSI reservations used

Page 32: ESX Performance Troubleshooting

ESX Architecture – Performance Perspective

Service Console• Service console can share Interrupt resources with VMkernel.

Shared interrupt lines reduce performance of I/O devices – KB 1290

• MKS is handled in the service console in ESX 2.x. and its performance is determined by the resources available in the COS

• The default Min CPU allocated is 8% and may not be sufficient if there are lots of VMs running

• Memory recommendations for service console do not account memory that will be used by the agents

• Scalability of VMs is limited by COS in ESX 2.x. ESX 3.x avoids this problems with userworlds for VMkernel.

Page 33: ESX Performance Troubleshooting

Understanding ESX Resource

33

Management & Over-Commitment

Confidential

Page 34: ESX Performance Troubleshooting

ESX Resource Management

Scheduling• Only one VCPU runs on a CPU at any time

• Scheduler tries to run the VM on the same CPU as much as possible

• Scheduler can move VMs to others Processors when it has to meet the CPU demands of the VM

Co-scheduling• SMP VMs are co-scheduled, i.e. all the VCPUs run on their own

PCPUs/LCPUs simultaneously

• Co-scheduling facilitates synchronization/communication between processors, like in the case of spinlock wait between CPUs

• Scheduler can run a VCPU without the other for a short period of time (1.5 ms)

• Guest could halt the co-scheduled CPU, if it is not using it, but Windows doesn’t seem to halt the CPU – wastes CPU cycles

Page 35: ESX Performance Troubleshooting

ESX Resource Management

NUMA Scheduling• Scheduler tries to schedule the world within the same NUMA node

so that cross NUMA migrations are fewer

• If a VM’s memory pages are split between NUMA nodes, the memory scheduler slowly migrates all the VM’s pages to the local node. Over time the system becomes completely NUMA balanced.

• On NUMA architecture, CPU utilization per NUMA node gives better idea of CPU contention

• While factoring %ready, factor the CPU contention within the same NUMA node.

Page 36: ESX Performance Troubleshooting

ESX Resource Management

Hyperthreading• Hyperthreading support was added in ESX 2.1, recommended

• Hyperthreading increases scheduler’s flexibility especially in the case of running SMP VMs with UP VMs

• A VM scheduled on a LCPU is charged only half the “package seconds”

• Scheduler tries to avoid scheduling a SMP VM onto the logical CPUS of the same package

• A high priority VM may be scheduled to a package with one its of LCPU halted – this prevents other running worlds from using the same package

Page 37: ESX Performance Troubleshooting

ESX Resource Management

HTSharing• Controls hyperthreading behavior with individual VMs.

• htsharing=any Virtual CPUs could be scheduled on any LCPUs. Most flexible option for the

scheduler.

• htsharing=none Excludes sharing of LCPUs with other VMs. VM with this option gets a full package

or never gets scheduled.

Essentially this excludes the VM from using logical CPUs (useful for the security paranoid). Use this option if an application in the VM is known to perform poorly with HT.

• htsharing=internal Applies to SMP VMs only. This is same as none, but allows sharing the same

package for the VCPUs of the same VM. Best of both worlds for SMP VMs.

For UP VMs this translates to none

Page 38: ESX Performance Troubleshooting

ESX Resource Management

HT Quarantining• ESX uses P4 Performance counters to constantly evaluate HT

performance of running worlds

• If a VM appears to interact badly with HT, the VM is automatically placed into a quarantining mode (i.e. htsharing is set to none)

• If the bad events disappear, the VM is automatically pulled back from quarantining mode

• Quarantining is completely transparent

Page 39: ESX Performance Troubleshooting

ESX Resource Management

CPU affinity• Defines a subset of LCPUs/PCPUs that a world could run on

• Useful to partition server between departments

troubleshoot system reliability issues

For manually setting NUMA affinity in ESX 1.5.x

applications that benefit from cache affinity

• Caveats Worlds that don’t have affinity can run on any CPU, so they have better chance of

getting scheduled

Affinity reduces Schedulers capability to maintain fairness – min CPU guarantees may not be possible under some circumstances

NUMA optimizations (page migrations) are excluded for VMs that have CPU affinity (can enforce manual memory affinity)

SMP VMs should not be pinned to LCPUs

Disallows vMotion operations

Page 40: ESX Performance Troubleshooting

ESX Resource Management

Proportional Shares• Shares are used only when there is a resource contention

• Unused shares (shares of a halting/idling VM) are partitioned across active VMs.

• In ESX 2.x shares operate on a flat namespace

• Changing shares of one world affects the effective CPU cycles received by other running worlds.

• If VMs use a different share scale then shares for other worlds should be changed to the same scale

Page 41: ESX Performance Troubleshooting

ESX Resource Management

Minimum CPU• Guarantees CPU resources when the VM requests for it

• Unused resources are not wasted, and is given to other worlds that requires it.

• Setting min CPU to 100% (200% in case of SMP) ensures that the VM is not bound by the CPU resource limits

• Using min CPU is favored over using CPU affinity or proportional shares

• Admission control verifies if Min CPUs could be guaranteed when the VM is powered on or VMotioned

Page 42: ESX Performance Troubleshooting

ESX Resource Management

Demystifying “Ready” time• Powered on VM could be either running, halted or in a ready state

• Ready time signifies the time spent by a VM on the run queue waiting to be scheduled

• Ready time accrues when more than one world wants to run at the same time on the same CPU PCPU, VCPU over-commitment with CPU intensive workloads

Scheduler constraints - CPU affinity settings

• Higher ready time reduces response times or increases job completion time

• Total accrued ready time is not useful VM could have accrued ready time during their runtime without incurring

performance loss (for example during boot)

• %ready = ready time accrual rate

Page 43: ESX Performance Troubleshooting

ESX Resource Management

Demystifying “Ready” time• There are no good/bad values for %ready.

Depends on the priority of the VMs - latency sensitive applications may require less or no ready time

• Ready time could be reduced by increasing the priority of the VM Allocate more shares, set minCPU, remove CPU affinity

Page 44: ESX Performance Troubleshooting

ESX Resource Management

Unexplained “Ready” time• If the VM accrues ready time while there are enough CPU resources

then it is called “Unexplained Ready time”

• There are some belief in the field that such a thing actually exists – hard to prove or disprove

• Very hard to determine if CPU resources are available when ready time accrues CPU utilization is not a good indicator of CPU contention Burstiness is very hard to determine NUMA boundaries – All VMs may contend within the same NUMA node Misunderstanding of how scheduler works

Page 45: ESX Performance Troubleshooting

ESX Resource Management

Resource Management in ESX 3.0• Resource Pools

Extends hierarchy. Shares operate within the resource pool domain.

• MHz Resource allocation are absolute based on clock cycles. % based

allocation could vary with processor speeds.

• Clusters Aggregates resources from multiple ESX hosts

Page 46: ESX Performance Troubleshooting

Resource Over-Commitment

CPU Over-Commitment• Scheduling

Too many things to do!

Symptoms: high %ready

Judicious use of SMP

• CPU utilization Too much to do!

Symptoms: 100% CPU

Things to watch

Misbehaving applications inside the guest

Do not rely on Guest CPU utilization – halting issues, timer interrupts

Some applications/services seem to impact guest halting behavior. No longer tied to SMP HALs.

Page 47: ESX Performance Troubleshooting

Resource Over-Commitment

CPU Over-Commitment• Higher CPU utilization does not necessarily mean lesser

performance. Application’s progress is not affected by higher CPU utilization However if higher CPU utilization is due to monitor overheads then it may

impact performance by increasing latency When there is no headroom (100% CPU), performance degrades

• 100% CPU utilization and %ready are almost identical – both delay application progress

• CPU Over-Commitment could lead to other performance problems Dropped network packets Poor I/O throughput Higher latency, poor response time

Page 48: ESX Performance Troubleshooting

Resource Over-Commitment

Memory Over-Commitment• Guest Swapping - Warning

Guest page faults while swapping.

Performance is affected by both guest swapping and due to monitor overhead handling page faults.

Additional disk I/O

• Ballooning – Serious

• VMkernel Swapping - Critical

• COS Swapping - Critical VMX process could stall and affect the progress of the VM

VMX could be a victim of random process killed by the kernel

COS requires additional CPU cycles, for handling frequent page faults and disk I/O

• Memory shares determine the rate of ballooning/swapping

Page 49: ESX Performance Troubleshooting

Resource Over-Commitment

Memory Over-Commitment• Ballooning

Ballooning/swapping stalls processor, increases delay Windows VMs touches all allocated memory pages during boot. Memory

pages touched by the guest could be reclaimed only by ballooning Linux guest touches memory pages on demand. Ballooning kicks in only

when the guest is under complete memory pressure Ballooning could be avoided by using min=max /proc/vmware/sched/mem

size <>sizetgt indicates memory pressure mctl > mctlgt – ballooning out (giving away pages) mctl < mctlgt – ballooning in (taking in pages)

Memory shares affect ballooning rate

Page 50: ESX Performance Troubleshooting

Resource Over-Commitment

Memory Over-Commitment• VMKernel Swapping

Processor stalls due to VMkernel swapping are more expensive than ballooning (due to disk I/O)

Do not confuse this with Swap reservation: Swap is always reserved for worst case scenario if

min<> max, reservation = max – min Total swapped pages: Only current swap I/O affects performance

/proc/vmware/sched/mem-verbose swpd – total pages swapped swapin, swapout – swap I/O activity

SCSI I/O delays during VMKernel I/O swapping could result in system reliability issues

Page 51: ESX Performance Troubleshooting

Resource Over-Commitment

I/O bottlenecks• PCI Bus saturation

• Target device saturation Easy to saturate storage arrays if the topology is not designed correctly for load

distribution

• Packet drops Effective throughput reduces

Retransmissions can cause congestion

Window size scales down in the case of TCP

• Latency affects throughput TCP is very sensitive to Latency and packet drops

• Broadcast traffic Multicast and broadcast traffic sent to all VMs.

• Keep an eye on Pkts/sec and IOPS and not just bandwidth consumption

Page 52: ESX Performance Troubleshooting

ESX Performance

52

Application Performance issues

Confidential

Page 53: ESX Performance Troubleshooting

ESX Performance – Application Issues Before we begin• From VM perspective, an running application is just a x86 workload.

• Any Application performance tuning that makes the application to run more efficiently will help

• Application performance can vary between versions New version could be more or less efficient

Tuning recommendations could change

• Application behavior could change based on its configuration

• Application performance tuning requires intimate knowledge on how the application behaves

• Nobody at VMware specializes on application performance tuning Vendors should optimize their software with the thought that the hardware resources

could be shared by other Operating Systems.

TAP program

SpringSource (unit of VMware) – Provides developer support for API scripting

Page 54: ESX Performance Troubleshooting

ESX Performance – Application issues

Citrix• Roughly 50-60% monitor overhead – takes 50-60% more CPU cycles than

on the native machine

• The maximum number of users limit is hit when the CPU is maxed out – roughly 50% of users as would be seen on native environment with an apples to apples comparison.

• Citrix Logon delays This could happen even on native machines when roaming profiles are configured.

Refer Citrix and MS KB articles

Monitor overhead can introduce logon delays

• Workarounds Disable com ports, workload=terminalservices, disable unused apps, scale

horizontally

• ESX 3.0 improves Citrix performance – roughly 70-80% of native performance

Page 55: ESX Performance Troubleshooting

ESX Performance – Application issues

Database performance• Scales well with vSMP – recommended

Exceptions: Pervasive SQL – not optimized for SMP

• Two key parameters for database workloads Response time

Transaction logs CPU utilization

• Understanding SQL performance is complex. Most enterprise databases run some sort of query optimizer that changes the SQL Engine parameters dynamically Performance will vary with run time. Typically benchmarking is done after

priming the database

• Memory resource is key. SQL performance can vary a lot depending on the available memory.

Page 56: ESX Performance Troubleshooting

ESX Performance – Application Issues

Lotus Domino Server• One of the better performing workloads. 80-90% of direct_exec

• CPU and I/O intensive

• Scalability issues – Not a good idea to run all domino servers on the same ESX server.

Page 57: ESX Performance Troubleshooting

ESX Performance – Application Issues

16-bit applications• 16 bit applications on windows NT/2000 and above run in a

Sandboxed Virtual Machine

• 16 bit apps depend on segmentation – possible monitor overhead.

• Some 16-bit apps seem to spin idle loop instead of halting the CPU Consumes excessive CPU cycles

• No performance studies done yet No compelling application

Page 58: ESX Performance Troubleshooting

ESX Performance – Application Issues

Netperf – throughput• Max Throughput is bound by a variety of parameters

Available Bandwidth, TCP window size, available CPU cycles

• VM incurs additional CPU overhead for I/O

• CPU utilization for networking varies with Socket buffer size, MTU – affects the number of I/O operations performed Driver – vmxnet consumes lesser CPU cycles Offloading features – depending on the driver settings and NIC

capabilities

• For most applications, throughput is not the bottleneck Measuring throughput and improving it may not always resolve the

underlying performance issue

Page 59: ESX Performance Troubleshooting

ESX Performance – Application Issues

Netperf – Latency• Latency plays an important role for many applications

• Latency can increase When there are too many VMs to schedule VM is CPU bound Packets are dropped and then re-transmitted

Page 60: ESX Performance Troubleshooting

ESX Performance – Application Issues

Compiler Workloads• MMU intensive: Lots of new processes created, context switched,

and destroyed.

• SMP VM may hurt performance Many compiler workloads are not optimized by SMP Process threads could ping-pong between the vCPUs

• Workarounds: Disable NPTL Try UP (don’t forget to change the HAL) Workload=terminalservices might help

Page 61: ESX Performance Troubleshooting

ESX Performance Forensics

61 Confidential

Page 62: ESX Performance Troubleshooting

ESX Performance Forensics

Troubleshooting Methodology• Understand the problem.

Pay attention to all the symptoms Pay less attention to subjective metrics.

• Know the mechanics of the application Find how the application works What resources it uses, and how it interacts with the rest of the system

• Identify the key bottleneck Look for clues in the data and see if that could be related to the symptoms Eliminate CPU, Disk I/O, Networking I/O, Memory bottlenecks by running

tests

• Running the right test is critical.

Page 63: ESX Performance Troubleshooting

ESX Performance Forensics

Isolating memory bottlenecks• Ballooning

• Swapping

• Guest MMU overheads

Page 64: ESX Performance Troubleshooting

ESX Performance Forensics

Isolating Networking Bottlenecks• Speed/Duplex settings

• Link state flapping

• NIC Saturation /Load balancing

• Packet drops

• Rx/Tx Queue Overflow

Page 65: ESX Performance Troubleshooting

ESX Performance Forensics

Isolating Disk I/O bottlenecks• Queue depth

• Path thrashing

• LUN thrashing

Page 66: ESX Performance Troubleshooting

ESX Performance Forensics

Isolating CPU bottlenecks• CPU utilization

• CPU scheduling contention

• Guest CPU usage

• Monitor Overhead

Page 67: ESX Performance Troubleshooting

ESX Performance Forensics

Isolating Monitor overhead• Procedures for release builds

Collect performance snapshots

• Monitor Components

Page 68: ESX Performance Troubleshooting

ESX Performance Forensics

Collecting Performance Snapshots• Duration

• Delay

• Proc nodes

• Running esxtop on performance snapshots

Page 69: ESX Performance Troubleshooting

ESX Performance Forensics

Collecting Benchmarking numbers• Client side benchmarks

• Running benchmarks inside the guest

Page 70: ESX Performance Troubleshooting

ESX Performance

70

Troubleshooting - Summary

Confidential

Page 71: ESX Performance Troubleshooting

ESX Performance Troubleshooting - Summary

Key points• Address real performance issues. Lots of time could be spent on spinning

wheels on theoretical benchmarking studies

• Real performance issues could be easily described by the end user who uses the application

• There is no magical configuration parameter that will solve all performance problems

• ESX performance problems are resolved by Re-architecting the deployment

Tuning application

Applying workarounds to circumvent bad workloads

Moving to a newer version that addresses a known problem

• Understanding Architecture is the key Understanding both ESX and application architecture is essential to resolve

performance problems

Page 72: ESX Performance Troubleshooting

Questions?

Page 73: ESX Performance Troubleshooting

Reference links

http://www.vmware.com/files/pdf/perf-vsphere-memory_management.pdf

http://www.vmware.com/resources/techresources/10041http://www.vmware.com/resources/techresources/10054http://www.vmware.com/resources/techresources/10066http://www.vmware.com/files/pdf/perf-vsphere-cpu_scheduler.pdfhttp://www.vmware.com/pdf/RVI_performance.pdfhttp://www.vmware.com/pdf/Perf_ESX_Intel-EPT-eval.pdfhttp://www.vmware.com/files/pdf/perf-vsphere-fault_tolerance.pdf