vmware - instituto de computação
TRANSCRIPT
VMware
VMware
Fernando Granha Jeronimo
November 29, 2012
Fernando Granha Jeronimo VMware
VMware
Plan
1 IntroductionVMwareVMM
2 WorkstationI/O
3 ESX ServerBalloning
4 Hardware AssistIntel VT-xMemory Management Virtualization
Fernando Granha Jeronimo VMware
VMware
Plan
1 IntroductionVMwareVMM
2 WorkstationI/O
3 ESX ServerBalloning
4 Hardware AssistIntel VT-xMemory Management Virtualization
Fernando Granha Jeronimo VMware
VMware
Plan
1 IntroductionVMwareVMM
2 WorkstationI/O
3 ESX ServerBalloning
4 Hardware AssistIntel VT-xMemory Management Virtualization
Fernando Granha Jeronimo VMware
VMware
Plan
1 IntroductionVMwareVMM
2 WorkstationI/O
3 ESX ServerBalloning
4 Hardware AssistIntel VT-xMemory Management Virtualization
Fernando Granha Jeronimo VMware
VMware
Introduction
VMware
VMware
The importance of the hypervisor:
With the mindset of trap-and-emulate, the x86 virtualizationwas considered impossible
In 1998, VMware was founded by a group of highly skilledprofessionals
Due to a perfect combination of situations, the x86 processingpower has grown and most servers were underutilized, thecompany has greatly succeeded
Fernando Granha Jeronimo VMware
VMware
Introduction
VMware
VMware
Fernando Granha Jeronimo VMware
VMware
Introduction
VMware
VMware
It has an amazing marketshare of 80%
DataCenter virtualization (vMotion key component)
Virtualization is the base of Cloud Computing
VMMs/Hypervisors are becoming commodities and the focusnow is in the management stack
Fernando Granha Jeronimo VMware
VMware
Introduction
VMM
Virtual Machine Monitor
Category
System virtual machine from x86 to x86.
Groundbreaking
There was a general misunderstanding about the x86 virtualizationcapacity.The mindset was that a virtualizable architecture is capable ofrunning the guest opering system in a privilege level inferior to theVMM, so that behaviour/control sensitive instructions wouldgenerate a trap and their behaviour would be emulated. Actually,it is only one way of achieving the Popek and Goldbergvirtualization criteria.
Fernando Granha Jeronimo VMware
VMware
Introduction
VMM
Virtual Machine Monitor
Question
VMware runs all the guest software in a deprivileged mode, so howcan it ensure that the behaviour of instructions such as popf thatdo not trap in user mode will not loose its semantics?
Answer
VMware achieves this goal through dynamic binary translation(DBT). When it encounters instructions such as popf, thetranslated code will make a call or inline an emulation routine. Thistechnique is not only useful for non-virtualizable instructions, butalso for instructions that may generate traps, once trap handlinghas a major performance penalty in the out-or-order architectures.
Fernando Granha Jeronimo VMware
VMware
Introduction
VMM
Virtual Machine Monitor
Question
VMware runs all the guest software in a deprivileged mode, so howcan it ensure that the behaviour of instructions such as popf thatdo not trap in user mode will not loose its semantics?
Answer
VMware achieves this goal through dynamic binary translation(DBT). When it encounters instructions such as popf, thetranslated code will make a call or inline an emulation routine. Thistechnique is not only useful for non-virtualizable instructions, butalso for instructions that may generate traps, once trap handlinghas a major performance penalty in the out-or-order architectures.
Fernando Granha Jeronimo VMware
VMware
Introduction
VMM
Virtual Machine Monitor
Question
One important requisite of Popek and Goldberg it that mostinstructions run natively without any modification. Is it necessaryto translate all guest code?
Answer
No, only guest operating system (code supposed to run withCPL=0) needs to be translated which represent a small part of theexecuted code.
Fernando Granha Jeronimo VMware
VMware
Introduction
VMM
Virtual Machine Monitor
Question
One important requisite of Popek and Goldberg it that mostinstructions run natively without any modification. Is it necessaryto translate all guest code?
Answer
No, only guest operating system (code supposed to run withCPL=0) needs to be translated which represent a small part of theexecuted code.
Fernando Granha Jeronimo VMware
VMware
Introduction
VMM
Virtual Machine Monitor
How is the translation done?
As it is usual for a DBT, translation is done on-demand to avoidthe problem of telling apart code and data.
1 The translator starts from current source PC up to 12instructions or stops before if it finds a control flow changeinstruction such as: call, jumps and branches
2 These instructions forms the translation unit (TU) that arelatter translated to an intermediate representation
3 Finally, compiled code fragments (CCF) are generated
Fernando Granha Jeronimo VMware
VMware
Introduction
VMM
Virtual Machine Monitor
Even in system code, most translations yield what is called IDENTtranslations, no modification is needed. The following modificationare desirable or mandatory:
PC-relative instruction: Similarly to other DBT the translated code goesto a translation cache (TC) changing the original code layout
Direct control flow: same reason of PC-relative
Indirect control flow: needs hash lookup
Non-virtualizable instructions: the replacement of this instructions byemulation routines is mandatory to the execution correctness
Privileged instructions: once the guest OS was deprivileged, suchinstructions will trap, causing a performance hit. Therefore, it is desirableto proactively replace them by emulation routines instead of waiting for atrap
Fernando Granha Jeronimo VMware
VMware
Introduction
VMM
Virtual Machine Monitor
Adaptative
Sometimes non-privileged instruction access priviliged datasuch as load and stores to the page table. Once the page table wasprotected by the VMM, a trap will be generated and the VMM willhave to emulate. As stated, traps are a great source ofperformance penalty, so it may be better to replace it for a call toan emulation routing. The DBT starts with the premise thateverybody is innocent, but after a few traps it aggressively adaptsto the guilty and loosely adapts from the guilty to innocent.
Fernando Granha Jeronimo VMware
VMware
Introduction
VMM
Virtual Machine Monitor
Fernando Granha Jeronimo VMware
VMware
Introduction
VMM
Virtual Machine Monitor
The VMware DBT approach satisfies the Popek and Goldbergvirtualization criteria:
Efficiency: all user mode code that represents the majority ofguest code run directly without intervention
Resource Control: the guest runs in a deprivileged state(CPL=3), as a result, it has no power to change systemresources
Equivalence: all the semantics are kept by the emulationroutines and there is support for self-modifying code. VMwareargues that the trap-and-emulate is just an implementationsatisfying the virtualizable condition.
Fernando Granha Jeronimo VMware
VMware
Introduction
VMM
Virtual Machine Monitor
VMM uses segmentation to protect itself.
Fernando Granha Jeronimo VMware
VMware
Introduction
VMM
Virtual Machine Monitor
VMM uses segmentation to protect itself.
Most operating systems use paging (segmentation is rarelyused)
The VMM is placed in the upper 4 MB of the address spaceand needs to be protected.
The code in the TC must be accessible, this is achieved byletting the cs contain the whole address space. However, it isimportant to avoid writes coming from the guest to VMMspace. So all segments are truncated, except gs that is usedby the VMM to access its own data. This force non-identtranslations for instructions that use the gs.
Fernando Granha Jeronimo VMware
VMware
Introduction
VMM
Virtual Machine Monitor
Memory Management - Shadow Page Tables
Guest OS: gVA ⇒ gPA
VMM:
gPA ⇒ hPAShadow: gVA =⇒ hPA
Fernando Granha Jeronimo VMware
VMware
Introduction
VMM
Virtual Machine Monitor
Feature Summary
Binary
Dynamic
On demand
Subsetting
Do not optimize
Chaining
Adaptative
Two modes: BT (kernel mode) and direct execution for(user mode)
Fernando Granha Jeronimo VMware
VMware
Introduction
VMM
Virtual Machine Monitor
Translation Example
Fernando Granha Jeronimo VMware
VMware
Introduction
VMM
Virtual Machine Monitor
Translation Example
Fernando Granha Jeronimo VMware
VMware
Introduction
VMM
Virtual Machine Monitor
Translation Example
Fernando Granha Jeronimo VMware
VMware
Introduction
VMM
Virtual Machine Monitor
Translation Example
Fernando Granha Jeronimo VMware
VMware
Workstation
VMware Workstation
In the beginning, VMware was a start-up trying to launch avirtualization technology in a new market: the commodityhardware.In those early days, two requirements were very important:
As a new technology, it could not force users to replace OS
Implement and maintain the myriad of PC device driverswould not be feasible
Fernando Granha Jeronimo VMware
VMware
Workstation
VMware Workstation
In the first version the target audience was mostlyprogrammers
From the requirements, the hosted architecture was the bestalternative
Fernando Granha Jeronimo VMware
VMware
Workstation
VMware Workstation
Hosted Architecture
vmApp: runs in ring 3 (user space) and is responsible forissuing syscalls on the behalf of the VMM to access hostdevices
VMM: runs in the ring 0 and is responsible for exposing auniform virtual hardware layer to the VM
vmDriver: runs in ring 0 and is responsible for thecommunication between vmApp and VMM
Fernando Granha Jeronimo VMware
VMware
Workstation
VMware Workstation
Hosted Architecture
Fernando Granha Jeronimo VMware
VMware
Workstation
I/O
VMware Workstation
How is the I/O done?
The VMM exposes well supported standard devices to theVM. For instance, it uses the AMD Lance NIC.
VMM is aware of the semantics of each I/O port
The VM uses the IN/OUT instructions and the VMMtranslates them to requests to the vmApp so that they becomesystem call in host operating system (e.g. can become a read)
Fernando Granha Jeronimo VMware
VMware
Workstation
I/O
VMware Workstation
Network I/O path
Fernando Granha Jeronimo VMware
VMware
Workstation
I/O
VMware Workstation
Sources of overhead
Once the VMM runs in ring 0 and is not part of the host OSthe context switch in this case is more expensive and calledworld switch once privileged state must also be saved
The VMM has delegated the device handling to the host OS,so if the VMM receives an interrupt it cannot do anyprocessing. It must make a world switch to the host OS toprocess it and latter by the vmApp. For instance if the vmApphas read a new package another world switch must take place,so that the VMM can give the package to a VM
Native mode workflows that were I/O bound can become CPUbound in a virtualized environment due to the extra processing
Fernando Granha Jeronimo VMware
VMware
Workstation
I/O
VMware Workstation
Sources of overhead
Fernando Granha Jeronimo VMware
VMware
Workstation
I/O
VMware Workstation
Improvements
Handle all possible IN/OUTs in the VMM: A significantpart of IN and OUT instructions do not require contact to theexternal world, some ports act merely as latches
Send Combine: when the system is experiencing a high rateof world switches, instead of sending the package as soon aspossible they are queued up to three in a ring buffer. Oncethe system is frequently going to the host world, it will nottake a long time to send the packages
Remove select: the vmApp uses select to hear for changes,unfortunately this is expensive. A shared memory bit-map isused to communicate IRQs between the actual driver and thevmApp
Fernando Granha Jeronimo VMware
VMware
Workstation
I/O
VMware Workstation
Improvements
Handle all possible IN/OUTs in the VMM: A significantpart of IN and OUT instructions do not require contact to theexternal world, some ports act merely as latches
Send Combine: when the system is experiencing a high rateof world switches, instead of sending the package as soon aspossible they are queued up to three in a ring buffer. Oncethe system is frequently going to the host world, it will nottake a long time to send the packages
Remove select: the vmApp uses select to hear for changes,unfortunately this is expensive. A shared memory bit-map isused to communicate IRQs between the actual driver and thevmApp
Fernando Granha Jeronimo VMware
VMware
Workstation
I/O
VMware Workstation
Improvements
Handle all possible IN/OUTs in the VMM: A significantpart of IN and OUT instructions do not require contact to theexternal world, some ports act merely as latches
Send Combine: when the system is experiencing a high rateof world switches, instead of sending the package as soon aspossible they are queued up to three in a ring buffer. Oncethe system is frequently going to the host world, it will nottake a long time to send the packages
Remove select: the vmApp uses select to hear for changes,unfortunately this is expensive. A shared memory bit-map isused to communicate IRQs between the actual driver and thevmApp
Fernando Granha Jeronimo VMware
VMware
Workstation
I/O
VMware Workstation
Perfomance measurements after improvements
Fernando Granha Jeronimo VMware
VMware
ESX Server
ESX Server
Native Virtual Machine
Targeting a new and more important market, the commodity servermarket, VMware created a native virtualization system. With thissolution, it had to create its own drivers, luckily, in the serverenvironment only a limited number of devices are certified to run.
Fernando Granha Jeronimo VMware
VMware
ESX Server
ESX Server
Fernando Granha Jeronimo VMware
VMware
ESX Server
ESX Server
Hypervisor vs. VMM
Hypervisor: resposible for multiplexing host system resourcesand providing policies, such as scheduling. It is composed bya vmkernel that resembles an operating system speciallytailored to virtualization
VMM: responsible for creating the virtual hardware layer tothe VM. Its goal is to provide mechanism and each VMrequires a separate VMM
Fernando Granha Jeronimo VMware
VMware
ESX Server
Balloning
ESX Server
Balloning
Motivation
The hypervisor is capable of overcommiting its memory,allowing more guests to run on a single host
The hypervisor must be able reclaim memory to give it tomore prioritary VMs
It could invalidate a shadow page table entries and use freedhost physical page (hPP), however it does not have thenecessary information
Fernando Granha Jeronimo VMware
VMware
ESX Server
Balloning
ESX Server
Balloning
Ballon Device
A pseudo device that is installed in the guest OS
It communicates with the VMM trough a channel
When memory is needed, it inflates the ballon, the devicerequests pages to the OS forcing its allocation algorithm andthese pages are pinned
The ballon can inform the hypervisor which pages it manageto allocate. These pages can be used as free pages
Fernando Granha Jeronimo VMware
VMware
ESX Server
Balloning
ESX Server
Balloning
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Intel VT-x
Intel VT-x
Vanderpool Technology
In 2005, Intel launched the Vanderpool Technology, VT-x, toaddress the classical x86 non-virtualizable problem
With the new virtual machine extension (vmx), the processorcan be virtualized without recurring to dynamic translation orany guest code modification and the classical architecture oftrap-and-emulate is now perfectly possible
AMD has also created, the Pacifica Technology to address thesame problem and it resembles VT-x.
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Intel VT-x
Intel VT-x - General Architecture
The processor is multiplexed in two modes the vmx root andnon-root, in both modes all the rings are available. As a result,the guest operating system can run in the ring 0 of the non-rootmode.
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Intel VT-x
Intel VT-x - General Architecture
Basically, the VMM configures in which circumstances theexecution must leave the non-root mode (vmexit) and whatautomated actions to perform when processing certain sensitiveactions. When an exiting condition is met, the processor stores ameaningful information about the exit, so that the VMM candecode it and take the appropriate action. This is exactly thetrap-and-emulate architecture.
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Intel VT-x
Intel VT-x - General Architecture
It is important to note that in this first generation of hardwareassist the problem of memory management was not addressed.
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Intel VT-x
Intel VT-x - VMCS
The workflow is much simpler than in a sophisticated DBT. TheVMM creates a structure called VMCS (Virtual Machine ControlStructure) responsible for:
storing guest state
configuring automated actions to be performed withoutleaving the non-root mode (e.g. apply an offset to the TSC)
configuring entry and exit conditions from the root tonon-root mode
storing meaningful exit information. In the case of an I/Oinstruction it has the port number, the width and thedirection of the access.
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Intel VT-x
Intel VT-x - VMCS
A size of 4KB must be destinated to the VMCS structure andit must be explicitly activated by a VMPTRLD instruction.Each logical processor must have its own VMCS.
Most part of the VMCS is implementation dependent anddoes not make part of the architecture. For this reason itmust be changed not by regular loads or stores, but by issuingVMREAD and VMWRITE.
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Intel VT-x
Intel VT-x - VMCS
Figure: VMCS snippet
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Intel VT-x
Intel VT-x - VM Entry and VM Exit
For the first time the VMM wants to execute a VM, it must use theVMLAUNCH instruction, later on it can simply use VMRESUME.
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Intel VT-x
Intel VT-x - VM Entry and VM Exit
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Intel VT-x
Performance Analysis
Figure: SPECint 2000 and SPECjbb 2005
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Intel VT-x
Performance Analysis
Figure: Virtualization nanobenchmarks
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Memory Management Virtualization
Intel VT - RVI and EPT
Second Generation
The second generation of hardware assist addressed the memorymenagement problem, the major remaining source ofvirtualization overhead. Intel created Extended Page Table (EPT)and AMD created Rapid Virtualization Indexing (RVI) by 2007/08.
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Memory Management Virtualization
Intel VT - RVI and EPT
Figure: Background on 32-bit paging
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Memory Management Virtualization
Intel VT - RVI and EPT
Figure: Background on 64-bit paging
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Memory Management Virtualization
Intel VT - RVI and EPT
Figure: Nested paging
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Memory Management Virtualization
Intel VT - RVI and EPT
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Memory Management Virtualization
Intel VT - RVI and EPT
Quadratic worst-case: O(l1l2)
Need CPU hardware assists: cannot be used with DBT (“amajor VMware sorrow source”)
Benefits: Intel claims that EPT can make virtualization 20%faster on average
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Memory Management Virtualization
Intel VT - RVI and EPT
Figure: Kernel Microbenchmarks
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Memory Management Virtualization
Intel VT - RVI and EPT
Figure: Apache compilation (MMU-intensive)
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Memory Management Virtualization
Intel VT - RVI and EPT
Figure: SPECjbb2005 (stress TLB)
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Memory Management Virtualization
That’s all!
Thanks!
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Memory Management Virtualization
Questions?
Fernando Granha Jeronimo VMware
VMware
Hardware Assist
Reference
Bibliographie
K. Adams and O. Agesen. A comparison of software and hardware techniques for x86 virtualization. In
ASPLOS-XII: Proceedings of the 12th international conference on Architectural support for programminglanguages and operating systems, pages 2–13, 2006.
Intel Corporation. Intel Virtualization Technology Specification for the IA-32 Intel Architecture, April 2005.
J. Sugerman, G. Venkitachalam, and B.-H. Lim. Virtualizing I/O devices on VMware Workstation’s hosted
virtual machine monitor. In USENIX Annual Technical Conference, General Track, pages 1–14, 2001.
VMware. Timekeeping in VMware Virtual Machines, May 2010.
http://www.vmware.com/vmtn/resources/238.
C. A. Waldspurger. Memory resource management in VMware ESX server. SIGOPS Oper. Syst. Rev.,
36(SI):181–194, 2002.
Ole Agesen, Alex Garthwaite, Jeffrey Sheldon, and Pratap Subrahmanyam. 2010. The evolution of an x86
virtual machine monitor. SIGOPS Oper. Syst. Rev. 44, 4 (December 2010), 3-18
Mendel Rosenblum and Tal Garfinkel. 2005. Virtual Machine Monitors: Current Technology and Future
Trends. Computer 38, 5 (May 2005), 39-47.
VMware. Understanding full virtualization, paravirtualization, and hardware assist
Intel R© 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes:1, 2A, 2B, 2C, 3A, 3B,
and 3C
Nikhil Bhatia, Performance Evaluation of Intel EPT Hardware Assist
Fernando Granha Jeronimo VMware