performance testing in virtualised environments

Stress and Performance Testing in Virtual Environments

Hi, I'm Rodger, and I like to talk, and manage Linux systems.

This is Aneel, and he likes to race cars, and beat the snot out of the systems I manage.

As the slide says, we're here to talk about performance testing in virtual environments, and some of the ways it varies from performance testing on bare metal.

We've broken this talk into three parts: First, we want remind you why, abuses like benchmarketing aside, stress and performance testing is a good idea.

Second, we're going to discuss some of the principles of performance testing that are applicable to every environment, including virtualised ones.

Finally in the third, we'll get on to some of the lessons we've learned about stress and performance testing in virtualised environments.

Charlatans Abound

Stress and performance testing has a bit of a bad name; it's easy to be cynical about it because of the amount of benchmarketing and axe-grinding that it's mis-used to support. Cheats are everywhere.

That's because pride and money are often at stake.

At best, this involves companies using well-designed test suites like the TPC benchmarks in novel and less than entirely truthful ways.

At worst... well, how many times have you seen people arguing you should adopt their new web framework because it gives a 1% benefit on serving Hello, World and static files? Three in the last year is my answer. If there's anything more depressing than the cheats, it's the people who aren't even smart enough to know what they're doing wrong.

But done properly, stress and performance testing does have value...

Why Do We Test?

To drive an application stack to emulate production like workloads

Being able to repeat the tests to confirm results and or changes introduced

Performance Testing is, when all pretence and artifice is put aside, is the qualification of the total decisions - from the hardware design, to the software architects, to the workarounds put in place by the developers that were made in the release - all from the point of view of the end user.

To achieve this in a non- production environment resource intensive, as well high utilisation and high risk transactions, need to be combined to provide a representation of a production workload. (The creation and maintenance overhead in a complex system wipes out the cost to benefit ratio of a complete coverage test.)

We aren't going to try to go into detail around particular tools in this talk; they vary too much for that to be terribly useful.

In order to complete a successful test cycle an environment where repeatable results can be obtained is the key. As such, we need to be able to fulfill the requirement to isolate the STRESS workload from other environments. Otherwise you will be chasing your tail for days on end only to find that a database was being restored. Or that another guest was being built for another project. The management of change in such a readily changeable environment becomes critical.

How Do We Test

Define what you want to test

Have an environment that allows you to repeat those tests

Stress and performance testing is - or ought to be - a science. Most of the people in this room, I expect, have at least a senior high-school understanding about scientific experiments, if not university level. Your performance testing should be like a lab: an environment where empirical thinking drives the process. You state your goals. You explain how your tests are going to meet your goals. You document your outcomes. If you need to make changes to meet those goals, you document the changes, making them one at a time.

Understand what you're measuring. People get hung up on numbers like utilisation, steal, and whatnot. System metrics. But what your users and customers care about are user experience metrics. Did their analytic batch job come back in minutes, hours, or days? Did their interactive app respond in tenths of seconds or tens of seconds? Once you know whether you're hitting your goals, then the system metrics become interesting.

Above all, production is your control environment. We've had a running argument with a colleague for a few months because he keeps insisting that when we see unexpectedly bad behaviour in production but can't reproduce it in our lab, "production is wrong." Nope. The *lab* is wrong. Because we're modelling the real world. If we start asserting the real world is wrong because it doesn't match our models, we're not doing science any more, we're doing economics.

So lets have a look at some of the types of tests that can be completed using performance test artifacts.

Types Of Testing

Proof Of Concept

Peak Load Simulation

Stress Test

Duration/Soak

Customised Tests for specific uses

In the world where I work our tests are developed from use cases or stories. Or, built to drive a specific part of the stack, say for example a webservice operation. A central design tenet is that any of these tests can be used together to emulate production.

Proof of Concept- get in early to prove or disprove whatever has been bought from a nice glossy brochure. Getting in early has the ability to save money, both in wasted effort and disproving the figures offered by our previously mentioned charlatans. Peak Load Simulation what is your application going to look on Monday morning when everyone logs in at once full of vim and vigour at the start of their working week? Or the last payday before Xmas? Or online exam results? Stress Test When is it going to break. Because everything has a breakpoint Are you able to confirm the threshold of your environment. Duration/Soak Do you have a memory leak. You know how fast the app can run and how many users it can support but how long will it stay up for? Customised tests for specific uses Having the confidence to prove your high availability model under a realistic load.

But does virtualisation change everything? (Segue to Rodger.)

Virtual Changes Everything?

Virtualisation accelerates your ability to make changes

Stress environments are all about change

Good news first: all the benefits of virtulisation apply to stress environments; if you've spent any time doing this sort of work on bare metal you'll know that one of the biggest pain points is making significant changes and running projects built on common components in parallel.

Being able to keep multiple VMs lying around, cloning for fast deployment and rollbacks save significant time and money compared to stress on bare metal.

The biggest problem in understanding and debugging performance issues in virtualised environments is really simple: the person who looks for the thing that's different to their accustomed environment, and blames that.

There's no shortage of people out there who have heard virtualisation causes problems, therefore this must be the problem, just like there's people who ran a Java app in 1998 and continue to insist garbage collection cripples performance, or argue you can't build large applications in PHP while using Facebook. And then there's...

It's A Technology, Not An Excuse

...people who just look for any excuse not to do their job when they can blame something else.

Some proportion of people, who are used to dealing with bare metal, will reflexively blame the virtualised environment. "Oh," they declaim, "if only we were running on the server we wouldn't have this problem."

There are a number of answers to this. I prefer this one...

We Fear Change

Educate

Train

Involve

...but my managers are strangely reluctant to endorse actually hitting people with it until they talk sense.

One answer they will accept: train your people. One way we eliminated a huge amount of pain for ourselves was when we went around teams in our bank and spent time and energy doing seminars and training sessions on the differences between our virtual and bare metal environments, and how to think about and diagnose problems. Another was to get more people involved in the stress test sessions. It's not something you can rely on people knowing or understanding, especially your developers, and it was interesting coming to grips with the gaps in even basic troubleshooting knowledge around the place. It's resulted in a huge reduction in the number of stupid and pointless conversations in my life.

Wheels Within Wheels

More moving partsMore tuningMore knowledgeMore instrumentation

The next most fundamental difference is the obvious one: you've added another moving part to your stack. Since the whole purpose of stress and performance testing is optimising the moving parts, that means you need to understand the hypervisor, and that means you need to instrument it. Trying to diagnose performance questions and understand how your application is going to scale without the proper metrics out of the hypervisor is stabbing around in the dark. Are you seeing poor IO because the guest is running out of resources, because the hypervisor is running out of CPU to do the emulation, or because the physical NICs your virtual networks are running through are fully utilised? We've seen all three problems at various times. One of the nice things, by the way, about Linux-based hypervisors like kvm, is that we can use our normal tools on the hypervisor itself, something I really like; it minimises the amount of new knowledge required. Our zVM environments, on the other hand, required a whole new range of expertise and terminology.

Real Life

The Thundering Herd

Reporting Skew

Steal

Hypervisor Overhead

When to Give Up

So far we've been speaking pretty generally; I'd like to grab some real-world examples.

The first goes back to my comments around training: the thundering herd of admins and developers. A large proportion of the people I work with were in the habit of trying to understand systems by logging into them and running their favourite tools en masse, leaving a dozen copies of top lying around.That's not too bad on bare metal systems with heaps of spare grunt. When they do it do a dozen VMs all running on the same physical servers... It makes a bad situation worse.People need to learn to look at centrally-gathered information. They should have been doing it that way before, but now they're crippling the systems in the process of doing it the wrong way.

Lies, etc

Your in-guest view is wrong

2.4 Kernels were terrible

2.6 Kernels are much better

Inaccuracy still creeps in

Reporting skew has been another one people have hard time coming to grips with; when running Linux as a VM, the 2.4 series of kernels were notorious for reporting errors in CPU time and similar stats, often by orders of magnitude on some platforms.The 2.6 line of kernels gave us some huge improvements, but even today in our environments we routinely see the kernels of virtualised Linux environments reporting numbers that are 5-10% out of whack with what the hypervisor tells us.People need to be educated to look at the hypervisor instrumentation first, and the guest second, and to accept the hypervisor as authorative. If your instrumentation is smart enough to correlate the two, so much the better.

Those are pretty minor education and tooling issues compared to our biggest bugbear, steal. Steal has been a huge pain in the arse for us. Why's that?Well, I have a picture of the person who thought up the title for this value, by the way:

steal threat or menace?

"Steal" is a term that causes disharmony amongst ponies. "You stole my cycles!" I'd be pissed off, too, if thought some other bugger was nicking all my CPU time. But it sends people off in the wrong direction. They start looking for the thieves, often even before they actually have a problem - or, worse yet, instead of understanding the problem they do have. Because if you're seeing steal it may be because your hypervisor is overcommitted, or it may be because it's doing a lot of work on your behalf, or it may simply be that you're running everything right? "Steal time" probably seemed like as good a name as any, but the problem is that it's highly misleading. It represents a bucket of things. Yes, it can represent time taken to run other VMs. But that's kind of the point of virtualisation: getting as close to 100% utilisation of your iron as you can. If you aren't using those idle cycles, I'll put them to good use elsewhere and you'll never know the difference.

Neither!

Steal is the hypervisor doing its job!

Steal is not always a problem!

Worry about user experience, not steal!

More amusingly is that steal time can represent time the hypervisor spends doing work on behalf of the guest. The IBM team at Boeblingen, who work on Linux on mainframes, tell the amusing story of a customer who logged a support call because they could still see a percent or so of steal time, and they wanted to know how to eliminate this troublesome amount. The Germans were forced to explain that the only way to do so would be to stop the hypervisor performing any disk, network, or console IO on behalf of the virtual machine. Which would, you know, be a bit limiting.

But people get hung up on that. They will explain, with absolute certainty, that they need more vCPU because they can see steal time. Even though they may have a whole vCPU of idle time, anyway!

The final word on steal is user experience. If your stress tests show you're hitting all your goals, then steal is meaningless.

Hitting the Redline

Not all hypervisors are equal

Degradation can be dramatic

Symptoms can be non-obvious

A final technical word about maxing out your hardware: you need to understand how hard you can push your environment before it falls over.We run a variety of hypervisors zVM, KVM-based, and VMWare. We find they all degrade at different points and with different characteristics, depending on the hypervisor and the workload on it.ZVM is our champ, running at 85-90% all day long. Some people get as high as 95% utilisation. Fantastic. But it should be for what it costs.Our experience with VMWare is perhaps 70% is the practical limit.KVM has been around 80%, which isn't too shabby, but it's no zVM.At those points performance starts to degrade significantly, and often dramatically. We typically see massive drop-offs in virtual network and disk performance as the hypervisor runs out of headroom to emulate those resources. In the early days not understanding that caused us some pain: we saw plummeting IO, and failed to understand it was a CPU problem that the hypverisor itself had become CPU bound, partly because at that point we didn't have great hypervisor instrumentation.In the worst case an earlier version of our KVM-based hypervisor would reboot when 30 or 40 guests ran a particular IO and CPU intensive workload all at once but the same hypervisor could run at 80% VM utilisation for a more pure CPU-based workload all day.

When To Give Up

What are your gains worth?

What are you spending to get them?

5% on 1 environment is nothing

5% on 300 is huge

Finally, a practical, but not technical issue: you can spend an infinite amount of time optimising.

If you're working on something for your own interest, that's cool. If you aren't you need to be aware of how your costs and benefits stack up but in a heavily virtualised shop the equation can be quite different.

A 5% performance improvement on stand-alone servers may be meaningless to your end-users if the server is barely utilised but when that 5% reduction in CPU is a tuning that can be made to one, two, three hundred or more VMs you can be talking about serious savings.

Any Questions?

So, today we've looked at some of the basic tenets of performance testing, it's interaction with the both physical and virtualised environments, and the some of the lessons that we've learned along the way.

We've really only scratched the surface of the possibilities and the learnings so before we open the floor to questions we would like to take this opportunity to say that we are more than happy to discuss in greater detail any of the things that we've touched on today.

But, for now, the floor is open to questions.