paper by jeremy sugerman, ganesh venkitachalam, beng-hong lim presented by kit cischke
TRANSCRIPT
Presentation Overview Section the First: Basic Background and
the Problems Virtualizing the IA-32 Section the Second: A Hosted Virtual
Machine ArchitectureEspecially Virtualizing Network I/O
Section the Third: Performance Metrics and AnalysisIncluding Optimizations
Section the Fourth: Future Performance Enhancements
A Little Story The year is 1997-ish. A young, idealistic
college student wants to try this “Linux” thing. He gets a hold of some Red Hat installation
discs and goes to it (with much help). Success!
Well, mostly. His network card isn’t supported by the distro
on the disk.He can get an updated driver online, but he can’t get
online because his network card doesn’t have a driver.
This poor student goes back to playing Tomb Raider.
Alternatively…
What, according to popular “informed” opinion, makes the Wintel platform so popular, yet so instable (relatively speaking)?Support for lots and lots of hardware!
The Point Both of those stories illustrate the same
point, and the driving force behind this paper…
Namely, if you’re writing VM software for the PC, you either have to:A.) Write device drivers for a vast array of
devices that work with your VMMB.) Come up with a way to use the existing
drivers. Enter: The “Hosted” VM Model.
Other PC Virtualization Problems Technically, the IA-32 is not naturally
virtualizable.So sayeth Popek and Goldberg: “An
architecture can support virtual machines only if all instructions that can inspect or modify privileged machine state will trap when executed from any but the most privileged mode.”
Lots of pre-existing PC software people won’t get rid of.
A Hosted VM Architecture VMApp is what the user
sees, and runs in normal user space.
VMDriver is essentially the VMM, and residing as a driver, it gets direct access to the hardware.
The physical processor is either executing in the host world or the VMM world.
“World switches” mean restoring all of the user-and system-visible state, making it more heavyweight than “normal” process switches.
Consequences Strictly computational applications run just like any
other VM. When I/O occurs, a world switch occurs, VMApp
performs the I/O request on behalf of the guest OS, captures the result, then hands it to VMDriver. We’ll look at this in more detail later.
Obviously, there is opportunity for significant performance degradation from this, as world switches are expensive.
Additionally, since the host OS is scheduling everything, performance of the VM can’t be guaranteed.
The question is, “Is performance good enough, given device support?”
Basic I/O Virtualization On PCs, I/O access is usually done using
privileged IN and OUT instructions. A decision is made: Can the VMM handle
it?If it is I/O stuff that doesn’t need to actually talk
to the hardware, then yes.Otherwise, cause a world switch and ask the
host to do it. This is totally adequate and appropriate for
devices with low sustained throughput or high latency.For example, a keyboard.
Virtualizing a Network Card A NIC is everything a keyboard isn’t: high
sustained throughput and low latency. Therefore, it’s performance can be
indicative of the paradigm as a whole. In VMware workstation, there is a virtual
NIC presented to the guest, which is largely indistinguishable from a “real” PCI NIC.
There are two different ways we can make the connection from the physical NIC to the virtual NIC.
Virtual NIC Details
Implemented partially in the VMM and partially in VMApp.
VMM exports virtual I/O ports and a virtual IRQ for the device in the VM.
VMApp catches and honors the requests made to the virtual NIC.
The modeled NIC is an AMD Lance.
Sending Packets from a Guest The guest OS gives and OUT
instruction to the Lance. The VMM sees these, and
hands control to the VMDriver. This requires the world switch.
VMDriver pushes the requests to VMApp. This happens a number of
times. VMApp makes a system call
to the physical NIC, which actually sends the packet.
When finished, we switch back to the VMM and raise an IRQ indicating the packet has been launched.
This stuff only needs to be done because of the hosted VM.
Receiving Packets in a Guest
The physical NIC gets the data and raises a real IRQ.
The NIC delivers the packets to our virtual bridge, and VMApp sees them when it issues a select syscall.
VMApp moves the packets to a shared memory location and tells the VMM it’s ready.
Receiving Packets in a Guest
World switch to the VMM, and the VMM raises a virtual IRQ.
The Guest OS recognizes this and issues appropriate instructions to read the incoming data.
The VMM switches back to the host world to let the physical NIC receive some packet acknowledgements.
More extra work!
Section the 3rd – Performance Where will performance be affected? 1. A world switch from the VMM to the host for every
real hardware access. Recall there are some we can successfully fake.
2. I/O interrupt handling might mean traveling through ISRs in the VMM, host OS and guest OS.
3. Packet transmission by the guest involves the drivers for the virtual and physical NICs.
4. Data that is in kernel buffers needs to go to the guest OS, and that’s just one more copy.
Consequence: An app that would otherwise saturate the network interface might become CPU bound by all this extra work.
So does it?
Experimental Setup Two Intel-based physically connected via Ethernet and a
cross-over cable. PC-350: A 350 MHz Pentium II with 128 MB RAM running Linux
2.2.13 PC-733: A 733 MHz Pentium III with 256 MB RAM running
2.2.17. Virtual machines were configured with that Lance NIC
bridged to the physical Intel EtherExpress NICs. The VM runs Red Hat 6.2 plus the 2.2.17-14 kernel update.
Each physical machine ran one instance of the VM with half the physical RAM of the host.
An in-house test program called nettest was written to stress the network interface. nettest attempts to eliminate or mitigate any machine-
dependent influences aside from network performance.
Packet Transmit Overheads Experiment 1:
VM/PC-733 sends 100 MB to PC-350 in 4096-byte chunks.
Result 1:The workload is CPU-bound with an
average throughput of only 64 Mb/s.
Analysis 1:On the next slide!
Experiment Analysis
From the start of the OUT instruction to when the packet hits the wire, it’s 23.8 μs. It’s 31.63 μs to when control is returned to the VMM for the next instruction.
Of that time, 30.65 μs are spent in world switches and in the host. (That’s 96.9%.) If we assume the 17.55 μs in the driver are spent physically transmitting the packet, that leaves 13.10 μs of overhead.
That’s considerable, but doesn’t explain how we became CPU-bound.
Further Analysis More than 25% of the
time in the VMM is spent just preparing to transfer control to VMApp.
Then, each of those transfers requires an 8.90 μs world switch.
This is almost two orders of magnitude slower than native I/O calls.
Further Further Analysis The other significant cost isn’t in just one row of the
chart, but spread around: IRQ processing. Every packet sent raises an IRQ on both the virtual
and physical NIC. For network-intensive ops, the interrupt rate is high.
So what? Each IRQ in the VMM world runs the VMM ISR, then
there’s a world switch to the host world. Then the host ISR runs. If the packet is destined for the guest, VMApp needs to send an IRQ to the guest OS.
That virtual IRQ requires another world switch, delivering the IRQ and then running the guest OS’s ISR.
That’s a lot of code for a simple packet IRQ. Furthermore, the guest ISRs run expensive privileged
instructions to handle the interrupt. Even more cost.
One More Overhead
VMApp and the VMM can’t recognize a packet destined for the host or guest. Only the host OS can do that.
This means there’s a world switch in store, which takes time.
Running the select syscall too frequently is wasteful and running it too infrequently could miss important, time-critical deadlines.
Optimizations – I/O Handling If it’s not a real packet transmittal, don’t
switch into the host world. The VMM can do it.
Additionally, we can take advantage of the memory semantics of the Lance address register to not use privileged instructions to do the accesses.Just use a simple MOV instruction instead.
Send Combining If the world switch rate is too high (as
monitored by the VMM), a transmitted packet won’t actually cause a world switch.
Instead it goes into a ring buffer until an interrupt-induced world switch occurs.When the world switch occurs, off go the
packets. If the ring buffer gets too full (3 buffered
packets), we force the world switch. Bonus: Other interrupts might (and do) get
processed when we’re already in the host world, saving more world switching.
Cheating on IRQ Notification Rather than using the select syscall, let
VMNet and VMApp use some shared memory to communicate when packets are available.
Now we just check a bit vector and automatically return to the VMM, saving at least one run through an ISR.
Future Performance Enhancements Reducing CPU Virtualization
The authors cop out and say, “However, a discussion of [delivering virtual IRQs, handling IRET instructions and the MMU overheads associated with context switches] requires an understanding of VMware Workstation’s core virtualization technology and is beyond the scope of this paper.”
They do mention an “easy” optimization regarding the interrupt controller.
Every packet in TCP requires an ACK that means 5 access to the IC. One can be treated like the Lance address register and be replaced by a MOV instruction, and that saves costly virtualization of privileged execution.
Modifying the Guest OS First Idea: Make the Guest OS stop using
privileged instructions! Second Idea: Make the Guest share
intelligence with the VMM. For example: Don’t do page table switching
when starting the idle task.This is meaningful: VM/PC-733 spends 8.5% of
its time virtualizing page table switches! A quick and dirty prototype of this halved
the MMU-derived vritualization overhead and all that saved time becomes CPU idle time.
Optimizing the Guest Driver Create an idealized version of the device
you’re trying to support.E.g., a virtual NIC that only uses a single OUT
instruction and skips the transmit IRQ, instead of 12 I/O instructions and a transmit IRQ.
That’s all well and good (and VMware actually does this in their server products), but it means supporting a bunch of drivers, which is what we were trying to avoid in the first place.
Changing the Host If we can change the host’s behavior, we can save
time or memory (or both). For example, network operations in Linux make
heavy use of sk_buffs.sk_buffs are the buffers in which the Linux kernel
handles network packets. The packet is received by the network card, put into a sk_buff and then passed to the network stack, which then uses the sk_buff.
What would be great is to allocate a big chunk of memory to fill with sk_buffs instead of constantly mallocing fresh memory.
Two potential problems: Memory Leaks Inaccessible OS code
Bypass the Host
In other words, ditch the whole “hosted VM” thing and write real drivers for the VMM.
Of course, this takes us back to the root of the problem – too many devices to support.
Summing Up We can make our VMMs simpler and more widely
available using a hosted VM. I/O performance takes a hit though. There are various optimizations that can be made:
Reducing world switches by paying attention to what we’re trying to accomplish.
Reduce overhead during world switches by send combining (and other similar things).
Cheat on checking for available data from the driver. With these optimizations, we can nearly match
native performance, even on sub-standard machines. The 733 MHz machines the tests were run on were below
the corporate standard, even for 2001.