virtualization - pdos.csail.mit.edu · •want to maintain illusion that each vm has dedicated...

39
Virtualization Adam Belay <[email protected]>

Upload: duongminh

Post on 18-Jul-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

VirtualizationAdamBelay<[email protected]>

Page 2: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Whatisavirtualmachine

• Simulationofacomputer• Runningasanapplicationonahostcomputer• Accurate• Isolated• Fast

Page 3: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Whyuseavirtualmachine?

• Torunmultipleoperatingsystem(e.g.WindowsandLinux)• Tomanagebigmachines(allocatecoresandmemoryatO/Sgranularity)• Kerneldevelopment(e.g.likeQEMU+JOS)• Betterfaultisolation(defenseindepth)• Topackageapplicationswithaspecifickernelversionandenvironment• Toimproveresourceutilization

Page 4: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Howaccuratedowehavetobe?

• MusthandleweirdquirksinexistingOses• Evenbug-for-bugcompatibility

• Mustmaintainisolationwithmalicioussoftware• GuestcannotbreakoutofVM!

• MustbeimpossibleforguesttodistinguishVMfromrealmachine• SomeVMscompromise,modifyingtheguestkerneltoreduceaccuracyrequirement

Page 5: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

VMsareanoldidea

• 1960s:IBMusedVMstosharebigmachines• 1970s:IBMspecializedCPUsforvirtualization• 1990s:VMwarerepopularized VMsforx86HW• 2000s:AMD&IntelspecializedCPUsforvirtualization

Page 6: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

ProcessArchitecture

Hardware

OS

vi gcc firefox

Page 7: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

VMArchitecture

• WhatiftheprocessabstractionlookedjustlikeHW?

Hardware

OS(VMM)

vi gcc firefox

GuestOS

VirtualHW

GuestOS

VirtualHW

Page 8: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

ComparingaprocessandHW

Process• Nonprivilegedregistersandinstructions• Virtualmemory• Signals• Filesystemandsockets

Hardware• Allregistersandinstructions• Virt.mem.andMMU• Trapsandinterrupts• I/OdevicesandDMA

Page 9: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

CanaCPUbevirtualized?

Requirementstobe“classicallyvirtualizable”definedbyPopek andGoldbergin1974:1. Fidelity:SoftwareontheVMMexecutes

identicallytoitsexecutiononhardware,barringtimingeffects.

2. Performance:AnoverwhelmingmajorityofguestinstructionsareexecutedbythehardwarewithouttheinterventionoftheVMM.

3. Safety:TheVMMmanagesallhardwareresources.

Page 10: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Whynotsimulation?

• VMMinterpretseachinstruction(e.g.BOCHS)• Maintainmachinestateforeachregister• EmulateI/Oportsandmemory• Violatesperformance requirement

Page 11: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Idea:ExecuteguestinstructionsonrealCPUwheneverpossible• Worksfineformostinstructions• E.g.add%eax,%ebx• Butprivilegedinstructionscouldbeharmful• Wouldviolatesafety property

Page 12: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Idea:RunguestkernelsatCPL3

• Ordinaryinstructionsworkfine• PrivilegedinstructionsshouldtraptoVMM(generalprotectionfault)• VMMcanapplyprivilegedoperationson“virtual”state,nottorealhardware• Thisiscalled“trap-and-emulate”

Page 13: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Trapandemulateexample

• CLI/STI– enablesanddisablesinterrupts• EFLAGSIFbittrackscurrentstatus• VMMmaintainsvirtualcopyofEFLAGSregister• VMMcontrolshardwareEFLAGS• ProbablyleaveinterruptsenablesevenifVMrunsCLI

• VMMlooksatvirtualEFLAGSregistertodecidewhentointerruptguest• VMMmustmakesureguestonlyseesvirtualEFLAGS

Page 14: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Whataboutvirtualmemory?

• WanttomaintainillusionthateachVMhasdedicatedphysicalmemory• GuestwantstostartatPA0,useallofRAM• VMMneedstosupportmanyguests,theycan’tallreallyusethesamephysicaladdresses• Idea:

• ClaimRAMissmallerthanrealRAM• Keeppagingenabled• Maintaina“shadow”copyofguestpagetable• ShadowmapsVAstodifferentPAthanguestrequests• Real%CR3pointstoshadowtable• Virtual%CR3pointstoguestpagetable

Page 15: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Virtualizationmemorydiagram

HostPageTable

HostVirtualAddress

HostPhysicalAddress

Page 16: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Virtualizationmemorydiagram

HostPageTable

HostVirtualAddress

HostPhysicalAddress

VMMMap

GuestVirtualAddress

HostPhysicalAddress

GuestPTGuestPhysicalAddress

ShadowPageTable

GuestVirtualAddress

HostPhysicalAddress

Page 17: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Example:

• Guestwantsguest-physical page@0x1000000• VMMmapredirectsguest-physical 0x1000000tohost-physical 0x2000000• VMMtrapsifguestchanges%cr3orwritestoguestpagetable• TransferseachguestPTEtoshadowpagetable• UsesVMMmaptotranslateguest-physical pageaddressesinpagetabletohost-physical addresses

Page 18: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Whycan’ttheVMMmodifytheguestpagetablein-place?

Page 19: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Needshadowcopyofallprivilegedstate• SofardiscussedEFLAGSandpagetables• AlsoneedGDT,IDT,LDTR,%CR*,etc.

Page 20: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Unfortunatelytrap-and-emulateisnotpossibleonx86Twoproblems:1. SomeinstructionsbehavedifferentlyinCPL3

insteadoftrapping2. SomeregistersleakstatethatrevealsiftheCPUis

runninginCPL3• Violatesfidelity property

Page 21: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

x86isn’tclassicallyvirtualizable

Problems->CPL3versusCPL0:• mov %cs,%ax• %cs containstheCPLinitslowertwobits

• popfl/pushfl• Privilegedbits,includingEFLAGS.IFaremaskedout

• iretq• Noringchange,sodoesn’trestoreSS/ESP

Page 22: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Twopossiblesolutions

1. Binarytranslation• Rewriteoffendinginstructionstobehavecorrectly

2. Hardwarevirtualization• CPUmaintainsshadowstateinternallyanddirectlyexecutesprivilegedguestinstructions

Page 23: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Strawmanbinarytranslation

• ReplaceallinstructionsthatcauseviolationswithINT$3,whichtraps• INT$3isonebyte,socanfitinsideanyx86instructionwithoutchangingsize/layout• Butunrealistic• Don’tknowthedifferencebetweencodeanddataorwhereinstructionboundarieslie• VMware’ssolutionismuchmoresophisticated

Page 24: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

VMware’sbinarytranslator

• KerneltranslateddynamicallylikeaJIT• idea:scanonlyasexecuted,sinceexecutionrevealsinstructionboundaries• whenVMMfirstloadsguestkernel,rewritefromentrytofirstjump• Mostinstructionstranslateidentically

• Needtotranslateinstructionsinchunks• Calledabasicblock• Either12instructionsorthecontrolflowinstruction,whicheveroccursfirst

• Onlyguestkernelcodeistranslated

Page 25: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

GuestkernelsharesaddressspacewithVMM• UsessegmentationtoprotectVMMmemory• VMMloadedathighvirtualaddresses,translatedguestkernelatlowaddresses• Programsegmentlimitsto“truncate”addressspace,preventingallsegmentsfromaccessingVMMexcept%GS• Whatifguestkernelinstructionuses%GSselector?• %GSprovidesfastaccesstodatasharedbetweenguestkernelandVMM

• Assumption:Translatedcodecan’tviolateisolation• Canneverdirectlyaccess%GS,%CR3,GDT,etc.

Page 26: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

WhyputguestkernelandVMMinsameaddressspace?

Page 27: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

WhyputguestkernelandVMMinsameaddressspace?• Sharedstatebecomesinexpensivetoaccesse.g.cli->“vcpu.flags.IF:=0”• Translatedcodeissafe,can’tviolateisolationaftertranslation

Page 28: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Translationexample• AllcontrolflowrequiresindirectionOriginal:isPrime()mov %ecx, %edi # %ecx = %edi (a)mov %esi, $2 # %esi = 2cmp %esi, %ecx # is i >= a?jge prime # if yes jump…

Csource:int isPrime(int a) {

for (int i = 2; i < a; i++) { if (a % i == 0) return 0;

}return 1;

}

Endofbasicblock

Page 29: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Translationexample• Allcontrolflowrequiresindirection• Original:isPrime()mov %ecx, %edi # %ecx = %edi (a)mov %esi, $2 # %esi = 2cmp %esi, %ecx # is i >= a?jge prime # if yes jump…

Translated:isPrime()’mov %ecx, %edi # IDENTmov %esi, $2cmp %esi, %ecxjge [takenAddr] # JCCjmp [fallthrAddr]

Page 30: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Translationexample

• Bracketsrepresentcontinuations• Firsttimetheyareexecuted,jumpintoBTandgeneratethenextbasicblock• Canelide“jmp [fallthraddr]”ifit’sthenextaddresstranslated• Indirectcontrolflowisharder• “(jmp,call,ret)doesnotgotoafixedtarget,preventingtranslation-timebinding.Instead,thetranslatedtargetmustbecomputeddynamically,e.g.,withahashtablelookup.Theresultingoverheadvariesbyworkloadbutistypicallyasingle-digitpercentage.”– frompaper

Page 31: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Hardwarevirtualization

• CPUmaintainsguest-copyofprivilegedstateinspecialregioncalledthevirtualmachinecontrolstructure(VMCS)• CPUoperatesintwomodes• VMXnon-rootmode:runsguestkernel• VMXrootmode:runsVMM• HardwaresavesandrestoresprivilegedregisterstatetoandfromtheVMCSasitswitchesmodes• Eachmodehasitsownseparateprivilegerings

• Neteffect:Hardwarecanrunmostprivilegedguestinstructionsdirectlywithoutemulation

Page 32: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

WhataboutMMU?

• Hardwareeffectivelymaintainstwopagetables• Normalpagetablecontrolledbyguestkernel• Extendedpagetable(EPT)controlledbyVMM• EPTdidn’texistwhenVMwarepublishedpaper

EPT

GuestVirtualAddress

HostPhysicalAddress

GuestPTGuestPhysicalAddress

Page 33: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

What’sbetterHWorSWvirt?

Page 34: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

What’sbetterHWorSWvirt?

• Softwarevirtualizationadvantages• Trapemulation:Mosttrapscanbereplacedwithcallouts• Emulationspeed:BTcangeneratepurpose-builtemulationcode,hardwaretrapsmustdecodetheinstruction,etc.

• Calloutavoidance:SometimesBTcaneveninlinecallouts

• Hardwarevirtualizationadvantages• Codedensity:Translatedcoderequiresmoreinstructionsandlargeropcodes

• Preciseexceptions:BTmustperformextraworktorecovergueststate

• Systemcalls:Don’trequireVMMintervention

Page 35: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

What’sbetterHWorSWvirt?

0.1

1

10

100

1000

10000

100000

ptemoddivzeropgfaultcallretcr8wrinsyscall

CPU

cycle

s (s

mal

ler i

s be

tter)

NativeSoftware VMM

Hardware VMM

Figure 4. Virtualization nanobenchmarks.

tween the two VMMs, the hardware VMM inducing approximately4.4 times greater overhead than the software VMM. Still, this pro-gram stresses many divergent paths through both VMMs, such assystem calls, context switching, creation of address spaces, modifi-cation of traced page table entries, and injection of page faults.

6.3 Virtualization nanobenchmarksTo better understand the performance differences between the twoVMMs, we wrote a series of “nanobenchmarks” that each exer-cise a single virtualization-sensitive operation. Often, the measuredoperation is a single instruction long. For precise control over theexecuted code, we repurposed a custom OS, FrobOS, that VMwaredeveloped for VMM testing.Our modified FrobOS boots, establishes a minimal runtime en-

vironment for C code, calibrates its measurement loops, and thenexecutes a series of virtualization-sensitive operations. The test re-peats each operation many times, amortizing the cost of the binarytranslator’s adaptations over multiple iterations. In our experience,this is representative of guest behavior, in which adaptation con-verges on a small fraction of poorly behaving guest instructions.The results of these nanobenchmarks are presented in Figure 4. Thelarge spread of cycle counts requires the use of a logarithmic scale.

syscall. This test measures round-trip transitions from user-level to supervisor-level via the syscall and sysret instructions.The software VMM introduces a layer of code and an extra privi-lege transition, requiring approximately 2000 more cycles than anative system call. In the hardware VMM, system calls executewithout VMM intervention, so as we expect, the hardware VMMexecutes system calls at native speed.

in. We execute an in instruction from port 0x80, the BIOSPOST port. Native execution accesses an off-CPU register in thechipset, requiring 3209 cycles. The software VMM, on the otherhand, translates the in into a short sequence of instructions thatinteracts with the virtual chipset model. Thus, the software VMMexecutes this instruction fifteen times faster than native. The hard-ware VMM must perform a vmm/guest round trip to complete theI/O operation. This transition causes in to consume 15826 cyclesin the tested system.

cr8wr. %cr8 is a privileged register that determines whichpending interrupts can be delivered. Only %cr8 writes that reduce%cr8 below the priority of the highest pending virtual interruptcause an exit [24]. Our FrobOS test never takes interrupts so no%cr8 write in the test ever causes an exit. As with syscall, thehardware VMM’s performance is similar to native. The softwareVMM translates %cr8 writes into a short sequence of simple in-

0

2

4

6

8

10

translateptemodpgfaultcallretcr8wrin/outsyscall

Ove

rhea

d (s

econ

ds)

Software VMMHardware VMM

Figure 5. Sources of virtualization overhead in an XP boot/halt.

structions, completing the %cr8 write in 35 cycles, about four timesfaster than native.

call/ret. BT slows down indirect control flow. We target thisoverhead by repeatedly calling a subroutine. Since the hardwareVMM executes calls and returns without modification, the hard-ware VMM and native both execute the call/return pair in 11 cycles.The software VMM introduces an average penalty of 40 cycles, re-quiring 51 cycles.

pgfault. In both VMMs, the software MMU interposes onboth true and hidden page faults. This test targets the overheadsfor true page faults. While both VMM paths are logically similar,the software VMM (3927 cycles) performs much better than thehardware VMM (11242 cycles). This is due mostly to the shorterpath whereby the software VMM receives control; page faults,while by no means cheap natively (1093 cycles on this hardware),are faster than a vmrun/exit round-trip.

divzero. Division by zero has fault semantics similar to thoseof page faults, but does not invoke the software MMU. Whiledivision by zero is uncommon in guest workloads, we includethis nanobenchmark to clarify the pgfault results. It allows usto separate out the virtualization overheads caused by faults fromthe overheads introduced by the virtual MMU. As expected, thehardware VMM (1014 cycles) delivers near native performance(889 cycles), decisively beating the software VMM (3223 cycles).

ptemod. Both VMMs use the shadowing technique described inSection 2.4 to implement guest paging with trace-based coherency.The traces induce significant overheads for PTE writes, causingvery high penalties relative to the native single cycle store. Thesoftware VMM adaptively discovers the PTE write and translates itinto a small program that is cheaper than a trap but still quite costly.This small program consumes 391 cycles on each iteration. Thehardware VMM enters and exits guest mode repeatedly, causingit to perform approximately thirty times worse than the softwareVMM, requiring 12733 cycles.To place this data in context, Figure 5 shows the total over-

heads incurred by each nano-operation during a 64-bit WindowsXP Professional boot/halt. Although the pgfault nanobenchmarkhas much higher cost on the hardware VMM than the softwareVMM, the boot/halt workload took so few true page faults that thedifference does not affect the bottom line materially. In contrast,the guest performed over 1 million PTE modifications, causinghigh overheads for the hardware VMM. While the figure may sug-gest that in/out dominates the execution profile of the hardwareVMM, the vast majority of these instructions originate in atypicalBIOS code that is unused after initial boot.

Page 36: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

What’sbettershadowpagetableorEPT?

Page 37: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

What’sbettershadowpagetableorEPT?• EPTisfasterwhenpagetablecontentschangefrequently• Fewertraps

• Shadowpagetableisfasterwhenpagetableisstable• LessTLBmissoverhead• Onepagetabletowalkthroughinsteadoftwo

Page 38: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Conclusion

• Virtualizationtransformedcloudcomputing,hadatremendousimpact• VirtualizationonPCswasalsobig,butlesssignificant

• VMwaremadevirtualizationpossibleonanarchitecturethatcouldn’tbevirtualized(x86)throughBT• PromptedIntelandAMDtochangehardware,sometimesfaster,sometimesslowerthanBT

Page 39: Virtualization - pdos.csail.mit.edu · •Want to maintain illusion that each VM has dedicated physical memory ... Strawman binary translation •Replace all instructions that cause

Adecadelater,what’schanged?

• HWvirtualizationbecamemuchfaster• Fewertraps,bettermicrocode,morededicatedlogic• AlmostallCPUarchitecturessupportHWvirt.• EPTwidelyavailable

• VMMsbecamecommoditized• BTtechnologywashardtobuild• VMMsbasedonHWvirt.aremucheasiertoimplement• Xen,KVM,HyperV,etc.

• I/Odevicesaren’tjustemulated,theycanbeexposeddirectly• IOMMUprovidespagingprotectionforDMA