virtualisation - csperkins.org · virtualisation concepts • enable resource sharing by decoupling...

Colin Perkins | https://csperkins.org/ | Copyright © 2018 | This work is licensed under the Creative Commons Attribution-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nd/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

Virtualisation

Advanced Operating Systems (M) Lecture 8

http://creativecommons.org/licenses/by-nd/4.0/







https://csperkins.org/


Colin Perkins | https://csperkins.org/ | Copyright © 2018

Lecture Outline

• What is virtualisation?

• Full system virtualisation • Hypervisors – type 1 and type 2

• Virtualising CPUs, memory, device drivers

• Live migration

• Example: Xen

2










Virtualisation Concepts

• Enable resource sharing by decoupling execution environments from physical hardware • Full system virtualisation – run multiple operating systems on a single

physical host; virtual machines (VMs)

• Desirable for data centres and cloud hosting environments – computing as a service

• Benefits for flexible management and security

• Use isolation to protect services • Containers – virtualise resources of interest

• Lightweight, but imperfect, virtualisation; constrains ability of a group of processes to access system resources

• Enhances security within an operating system

3









Full System Virtualisation

4


Full System Virtualisation

• Allow more than one operating system to run on the same physical hardware • First implemented: IBM System/360 mainframe

– mid-1960s

• Popular current implementations: Xen, VMWare, QEMU, VirtualBox

• Introduces hypervisor abstraction: • The hypervisor is a process that manages the

virtualisation, and emulates the hardware • Processors

• Memory

• Device drivers

• The guest operating systems run as processes on the hypervisor

• The hypervisor API presents a virtual machine abstraction – each operating system thinks it’s running on real hardware

5

(source: Wikipedia) IBM System/360










Hardware Virtualisation: CPU

• Processors distinguish privileged and unprivileged operations • Known as protection rings or privilege levels

• ARM7 CPUs have “application”, “operating system”, and “hypervisor” privileges

• x86 CPUs have four levels of numbered protection rings

• Instructions that control I/O, interrupt handlers, virtual memory,memory protection, etc., tend to be privileged

• Attempts to execute privileged instructions from unprivileged code trap and invoke a handler at next higher privilege level • Handler can check permissions, and either allow the operation, terminate

lower privilege process, or otherwise arbitrate access

• Full virtualisation requires either: • Hypervisor running at higher privilege than the guest operating

systems – allowing it to arbitrate access between those guests, without support from the guests

• Or, rewriting operating system code to be aware of virtualisation, and to call into the hypervisor to perform privileged operations – known as paravirtualisation, where the operating systems agree to cooperate with the hypervisor

6

(Wikipedia)










Hardware Virtualisation: Hypervisor Mode

• Unmodified guest operating systems running on a hypervisor • The hypervisor is aware of the guest operating systems it’s running

• The guest operating systems don’t know that they’re being virtualised

• Traps normally privileged operations to the hypervisor • Cache control

• Page tables and virtual memory

• Interrupt handlers

• …

• Performance reduced – depends on degree of hardware support

7

App 1 App 2

Operating System

Hardware

App 1 App 2

Operating System

App 1 App 2

Operating System

Hypervisor

Hardware










Hardware Virtualisation: Paravirtualisation

• Paravirtualisation – provide a virtual machine that is similar, but not identical, to underlying hardware

• Guest operating systems aware of virtualisation • Guests are “ported” to run on the virtualised environment – modified to never

execute privileged instructions

• Hypervisor provides an API: • Cache control

• Pages tables and virtual memory

• Interrupt handlers

• …

• Guest operating systems call that API – much as a user process calls into an operating system kernel

• Relies on cooperation between hypervisor and guest operating systems • Needed if hardware doesn’t provide a hypervisor privilege level – but require trust in

guests (trust in the guest kernels, not in the applications that run on them)

8










Hardware Virtualisation: Memory

• Operating systems provide virtual addressing • Each process believes it has access to the entire

address space

• Kernel configures hardware address translation tables – mapping virtual to physical addresses, and isolating the different processes

• Hypervisor requires additional translation • Guest operating systems believe they own entire

address space

• Hypervisor maps this onto physical addresses, isolating the guest operating systems – needs hardware support

• “Page table virtualisation”

• Must support all devices that access memory • Direct memory access by storage/networking devices

• Access to memory mapped hardware registers

9

Virtual Addresses

“Physical” Addresses

Physical Addresses

Gue

st O

pera

ting

Sys

tem

Gue

st O

pera

ting

Sys

tem

Hypervisor

Two-stage address translation










Hardware Virtualisation: Device Drivers

• Similar to two-stage address translation, interrupts and other hardware accesses are virtualised • Hardware support from CPU, PCI bus

controllers, BIOS, etc.

• Example: PCI single root I/O virtualisation

• Software support for hypervisor – arbitrate access to real hardware (e.g., only one guest operating system can actually read the keyboard at once)

• Some devices can be shared between guest operating systems • Storage devices might appear as an entire

device, but only give access to only a single partition – by translating block addresses

• Network devices can have multiple MAC addresses that are used to deliver packets to particular guests

10

68 COMMUNICATIONS OF THE ACM | JANUARY 2012 | VOL. 55 | NO. 1

practice

to move a running VM between physi-cal machines, known as live migra-tion. In both of these applications, ac-tive logical devices must be decoupled from physical devices and recoupled when the VM resumes after being saved or moved.

This virtualization layer may also change mappings to physical devices, even when the VM itself does not move. For example, by changing mappings while copying storage contents, a VM’s virtual disks can be migrated transpar-ently between network storage units, even while remaining in active use by the VM. The same capability can be used to improve availability or balance load across different I/O channels. For example, in a storage system with mul-tiple paths between the machines and storage, the virtualization layer can rebind mappings to mask failures or avoid delays that might occur because of contention on paths.

I/O virtualization provides a foot-hold for many innovative and ben-eficial enhancements of the logical I/O devices. The ability to interpose on the I/O stream in and out of a VM has been widely exploited in both re-search papers and commercial virtu-alization systems.

One useful capability enabled by I/O virtualization is device aggregation, where multiple physical devices can be combined into a single more capable logical device that is exported to the VM. Examples include combining mul-

tiple disk storage devices exported as a single larger disk, and network chan-nel bonding where multiple network interfaces can be combined to appear as a single faster network interface.

New features can be added to exist-ing systems by interposing and trans-forming virtual I/O requests, transpar-ently enhancing unmodified software with new capabilities. For example, a disk write can be transformed into rep-licated writes to multiple disks, so that the system can tolerate disk-device fail-ures. Similarly, by logging and tracking the changes made to a virtual disk, the virtualization layer can offer a time-travel feature, making it possible to move a VM’s file system backward to an earlier point in time. This functionality is a key ingredient of the snapshot and undo features found in many desktop virtualization systems.

Many I/O virtualization enhance-ments are designed to improve system security. A simple example is running an encryption function over the I/O to and from a disk to implement trans-parent disk encryption. Interposing on network traffic allows virtualization layers to implement advanced net-working security, such as firewalls and intrusion-detection systems employ-ing deep packet inspection.

ChallengesWhile virtualization offers many ben-efits, it also introduces significant challenges. One is achieving good I/O

performance despite the potential overhead associated with flexible in-direction and interposition. Complex resource-management issues such as scheduling and prioritization are intro-duced by multiplexing physical devices across multiple VMs, further impact-ing performance. Another challenge is defining appropriate semantics for vir-tual devices and interfaces, especially when faced with complex physical I/O devices or system-level optimizations.

In many systems, a nontrivial per-formance penalty is associated with in-direction. The same can be true for vir-tualized I/O, since I/O operations must conceptually traverse two separate I/O stacks: one in the guest managing the virtual hardware; and one in the hyper-visor managing physical hardware. The longer I/O path affects both latency and throughput, and imposes addi-tional CPU load.

Indeed, I/O-intensive workloads on some early virtualization systems suf-fered a virtualization penalty larger than a factor of two. Since then, further research, optimizations, and hardware acceleration have reduced this penalty into the noise for an impressive set of demanding production workloads. Somewhat counterintuitively, virtual-ized systems have even outperformed native systems on the same physical hardware, overcoming native scaling limitations by instead running several smaller VM instances in a scale-out configuration.

Figure 1 depicts the flow of an I/O request in a virtualized system. When an application running within a VM is-sues an I/O request, typically by making a system call, it is initially processed by the I/O stack in the guest operating system, which is also running within the VM. A device driver in the guest is-sues the request to a virtual I/O device, which the hypervisor then intercepts. The hypervisor schedules requests from multiple VMs onto an underlying physical I/O device, usually via another device driver managed by the hypervi-sor or a privileged VM with direct ac-cess to physical hardware.

When a physical device finishes processing an I/O request, the two I/O stacks must be traversed again, but in the reverse order. The actual device posts a physical completion interrupt, which is handled by the hypervisor. The

Figure 1. An I/O request issued by an application is processed first by the guest operating system I/O stack running within the VM, and then by the hypervisor I/O stack managing physical hardware.

VM Host

Applications Virtual Machines

Guest OS Hypervisor

Virtual Hardware Physical Hardware

… …

emulated disk device

local disk device

app VM

I/O

Sta

ck

I/O

Sta

ckbuffer cache

I/O scheduler

device driver I/O scheduler

device driver

virtual-to-physical translation

interpose/transform e.g. log, encrypt

app VM

NIC

app VM

NAS

Source: C. Waldspurger and M. Rosenblum, “I/O Virtualisation”, CACM, Vol 55, Num 1, pp 66-73, Jan. 2012. DOI: 10.1145/2063176.2063194










Types of Hypervisor

• Type 1 (“native”) hypervisor: • Hypervisor is the operating system for the

underlying hardware

• Requires its own device drivers, and has to be ported to new hardware platforms

• Example: Xen

• Type 2 (“hosted”) hypervisor: • The hypervisor is an application running on

an existing operating system

• Example: VMWare, VirtualBox

11

(Wikipedia)










Type 1 Hypervisors

• The hypervisor is an operating system • Controls/arbitrates access to the hardware

• Schedules the execution of guest operating systems

• Needs device drivers for the underlying hardware • Some devices are driven by the hypervisor –

the hypervisor executes operations on behalf of the guests

• For other devices, configure resources to give an individual guest direct access to resources

• Exposes a control API • Allows one of the guests to act as controller for

the hypervisor

• Example: “domain 0” in Xen

• Extremely efficient – basis for most cloud hosting platforms

12

(Wikipedia)










Type 2 Hypervisors

• The hypervisor is a process that runs within an existing operating system • Doesn’t hard privileged access to the

underlying hardware, CPU, etc.

• Cannot run unmodified guest operating systems – since it doesn’t have privilege to virtualise their operation

• Requires paravirtualisation • Guest operating systems must be modified to

call hypervisor for privileged operations

• Hypervisor itself depends on access given by the underlying operating system

• Often requires software emulation of the hardware – e.g., interrupt handlers, I/O, page tables – can be very slow

• Good for development – not useful as high performance or as a cloud hosting platform

13

(Wikipedia)










Management of Virtual Machines

• Type 1 hypervisors need a management interface • Features:

• Configuring, starting, stopping, and migrating VMs

• Managing underlying hardware

• Managing network configuration

• Hyper-call API – works like system call API (INT 0x82 vs. INT 0x80) – that can be called by management software

• Designed for large-scale, automated, administration – virtual machines as a service

• Type 2 hypervisors can be configured as any other application on the host operating system • Designed for small-scale personal use

14










Management of Virtual Machines

• Hypervisors allow creation of virtual machines • Hypervisor + guest operating system + local state → virtual machine

• Allows on-demand instantiation of servers within a virtualised system

• Can be (largely) independent of underlying hardware • Configure VM with generic device drivers, reasonable amount of memory, etc.

• A subset of the real hardware available in the hosting environment

• VM can be instantiated on any physical system that meets requirements

• Many VMs can be instantiated on a single machine – performance constraints?

• User of the VM typically unaware which physical hardware used; physical hardware can change over time

• Virtual machines can be migrated between physical servers when stopped, but also while in use if care is taken

• Enables cloud computing platforms and computing as a service

15










Live Migration

• Possible to migrate running guest operating systems to new system • Move VM between physical servers,

while running, without users noticing

• Live migration procedure: • Activate new VM, with matching hardware

resources

• Copy contents of memory and storage – keep track of writes after copy

• Stop VM, copy any outstanding memory, storage, and other state

• Reassign network addresses to new host

• Restart VM on new host

• Performance heavily dependent on amount of active I/O on VM; speed of network • Sub-second downtime, with performance impact for

several seconds preceding is possible, if care taken

16

For network resources, we want a migrated OS to maintainall open network connections without relying on forward-ing mechanisms on the original host (which may be shutdown following migration), or on support from mobilityor redirection mechanisms that are not already present (asin [6]). A migrating VM will include all protocol state (e.g.TCP PCBs), and will carry its IP address with it.

To address these requirements we observed that in a clus-ter environment, the network interfaces of the source anddestination machines typically exist on a single switchedLAN. Our solution for managing migration with respect tonetwork in this environment is to generate an unsolicitedARP reply from the migrated host, advertising that the IPhas moved to a new location. This will reconfigure peersto send packets to the new physical address, and while avery small number of in-flight packets may be lost, the mi-grated domain will be able to continue using open connec-tions with almost no observable interference.

Some routers are configured not to accept broadcast ARPreplies (in order to prevent IP spoofing), so an unsolicitedARP may not work in all scenarios. If the operating systemis aware of the migration, it can opt to send directed repliesonly to interfaces listed in its own ARP cache, to removethe need for a broadcast. Alternatively, on a switched net-work, the migrating OS can keep its original Ethernet MACaddress, relying on the network switch to detect its move toa new port1.

In the cluster, the migration of storage may be similarly ad-dressed: Most modern data centers consolidate their stor-age requirements using a network-attached storage (NAS)device, in preference to using local disks in individualservers. NAS has many advantages in this environment, in-cluding simple centralised administration, widespread ven-dor support, and reliance on fewer spindles leading to areduced failure rate. A further advantage for migration isthat it obviates the need to migrate disk storage, as the NASis uniformly accessible from all host machines in the clus-ter. We do not address the problem of migrating local-diskstorage in this paper, although we suggest some possiblestrategies as part of our discussion of future work.

3.3 Design Overview

The logical steps that we execute when migrating an OS aresummarized in Figure 1. We take a conservative approachto the management of migration with regard to safety andfailure handling. Although the consequences of hardwarefailures can be severe, our basic principle is that safe mi-gration should at no time leave a virtual OS more exposed

1Note that on most Ethernet controllers, hardware MAC filtering willhave to be disabled if multiple addresses are in use (though some cardssupport filtering of multiple addresses in hardware) and so this techniqueis only practical for switched networks.

Stage 0: Pre-Migration Active VM on Host A Alternate physical host may be preselected for migration Block devices mirrored and free resources maintained

Stage 4: Commitment VM state on Host A is released

Stage 5: Activation VM starts on Host B Connects to local devices Resumes normal operation

Stage 3: Stop and copy Suspend VM on host A Generate ARP to redirect traffic to Host B Synchronize all remaining VM state to Host B

Stage 2: Iterative Pre-copy Enable shadow paging Copy dirty pages in successive rounds.

Stage 1: Reservation Initialize a container on the target host

Downtime(VM Out of Service)

VM running normally onHost A

VM running normally onHost B

Overhead due to copying

Figure 1: Migration timeline

to system failure than when it is running on the original sin-gle host. To achieve this, we view the migration process asa transactional interaction between the two hosts involved:

Stage 0: Pre-Migration We begin with an active VM onphysical host A. To speed any future migration, a tar-get host may be preselected where the resources re-quired to receive migration will be guaranteed.

Stage 1: Reservation A request is issued to migrate an OSfrom host A to host B. We initially confirm that thenecessary resources are available on B and reserve aVM container of that size. Failure to secure resourceshere means that the VM simply continues to run on Aunaffected.

Stage 2: Iterative Pre-Copy During the first iteration, allpages are transferred from A to B. Subsequent itera-tions copy only those pages dirtied during the previoustransfer phase.

Stage 3: Stop-and-Copy We suspend the running OS in-stance at A and redirect its network traffic to B. Asdescribed earlier, CPU state and any remaining incon-sistent memory pages are then transferred. At the endof this stage there is a consistent suspended copy ofthe VM at both A and B. The copy at A is still con-sidered to be primary and is resumed in case of failure.

Stage 4: Commitment Host B indicates to A that it hassuccessfully received a consistent OS image. Host Aacknowledges this message as commitment of the mi-gration transaction: host A may now discard the orig-inal VM, and host B becomes the primary host.

Stage 5: Activation The migrated VM on B is now ac-tivated. Post-migration code runs to reattach devicedrivers to the new machine and advertise moved IPaddresses.

NSDI ’05: 2nd Symposium on Networked Systems Design & Implementation USENIX Association276

Source: Clark et al., “Live Migration of Virtual Machines”, Proc. USENIX NSDI, 2005.










Example: Xen

• A modern type 1 hypervisor for x86

• Supports two types of virtualisation: • Full virtualisation – needs hardware support

• Older x86 CPUs had no way of trapping supervisor mode instruction calls, to allow efficient emulation by a hypervisor; modern x86 CPUs do provide this

• Paravirtualisation • Guest operating system is ported to a new

system architecture, which looks like x86 with problematic features adapted to ease virtualisation

• Operating system knows it is running in a VM

• Processes running within the OS cannot tell that virtualisation is in use

• Control operations delegated to a privileged guest operating system • Known as Domain0

• Provides policy and controls the hypervisor

17

Xen and the Art of Virtualization

Paul Barham∗, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris,Alex Ho, Rolf Neugebauer†, Ian Pratt, Andrew Warfield

University of Cambridge Computer Laboratory15 JJ Thomson Avenue, Cambridge, UK, CB3 0FD

{firstname.lastname}@cl.cam.ac.uk

ABSTRACTNumerous systems have been designed which use virtualization tosubdivide the ample resources of a modern computer. Some requirespecialized hardware, or cannot support commodity operating sys-tems. Some target 100% binary compatibility at the expense ofperformance. Others sacrifice security or functionality for speed.Few offer resource isolation or performance guarantees; most pro-vide only best-effort provisioning, risking denial of service.

This paper presents Xen, an x86 virtual machine monitor whichallows multiple commodity operating systems to share conventionalhardware in a safe and resource managed fashion, but without sac-rificing either performance or functionality. This is achieved byproviding an idealized virtual machine abstraction to which oper-ating systems such as Linux, BSD and Windows XP, can be portedwith minimal effort.

Our design is targeted at hosting up to 100 virtual machine in-stances simultaneously on a modern server. The virtualization ap-proach taken by Xen is extremely efficient: we allow operating sys-tems such as Linux and Windows XP to be hosted simultaneouslyfor a negligible performance overhead — at most a few percentcompared with the unvirtualized case. We considerably outperformcompeting commercial and freely available solutions in a range ofmicrobenchmarks and system-wide tests.

Categories and Subject DescriptorsD.4.1 [Operating Systems]: Process Management; D.4.2 [Opera-ting Systems]: Storage Management; D.4.8 [Operating Systems]:Performance

General TermsDesign, Measurement, Performance

KeywordsVirtual Machine Monitors, Hypervisors, Paravirtualization∗Microsoft Research Cambridge, UK†Intel Research Cambridge, UK

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SOSP’03, October 19–22, 2003, Bolton Landing, New York, USA.Copyright 2003 ACM 1-58113-757-5/03/0010 ...$5.00.

1. INTRODUCTIONModern computers are sufficiently powerful to use virtualization

to present the illusion of many smaller virtual machines (VMs),each running a separate operating system instance. This has led toa resurgence of interest in VM technology. In this paper we presentXen, a high performance resource-managed virtual machine mon-itor (VMM) which enables applications such as server consolida-tion [42, 8], co-located hosting facilities [14], distributed web ser-vices [43], secure computing platforms [12, 16] and applicationmobility [26, 37].

Successful partitioning of a machine to support the concurrentexecution of multiple operating systems poses several challenges.Firstly, virtual machines must be isolated from one another: it is notacceptable for the execution of one to adversely affect the perfor-mance of another. This is particularly true when virtual machinesare owned by mutually untrusting users. Secondly, it is necessaryto support a variety of different operating systems to accommodatethe heterogeneity of popular applications. Thirdly, the performanceoverhead introduced by virtualization should be small.

Xen hosts commodity operating systems, albeit with some sourcemodifications. The prototype described and evaluated in this papercan support multiple concurrent instances of our XenoLinux guestoperating system; each instance exports an application binary inter-face identical to a non-virtualized Linux 2.4. Our port of WindowsXP to Xen is not yet complete but is capable of running simpleuser-space processes. Work is also progressing in porting NetBSD.

Xen enables users to dynamically instantiate an operating sys-tem to execute whatever they desire. In the XenoServer project [15,35] we are deploying Xen on standard server hardware at econom-ically strategic locations within ISPs or at Internet exchanges. Weperform admission control when starting new virtual machines andexpect each VM to pay in some fashion for the resources it requires.We discuss our ideas and approach in this direction elsewhere [21];this paper focuses on the VMM.

There are a number of ways to build a system to host multipleapplications and servers on a shared machine. Perhaps the simplestis to deploy one or more hosts running a standard operating sys-tem such as Linux or Windows, and then to allow users to installfiles and start processes — protection between applications beingprovided by conventional OS techniques. Experience shows thatsystem administration can quickly become a time-consuming taskdue to complex configuration interactions between supposedly dis-joint applications.

More importantly, such systems do not adequately support per-formance isolation; the scheduling priority, memory demand, net-work traffic and disk accesses of one process impact the perfor-mance of others. This may be acceptable when there is adequateprovisioning and a closed user group (such as in the case of com-










Xen Paravirtualisation

• Guest operating systems is ported to Xen

• The Xen hypervisor provides an API to mediate hardware access by guest operating systems: • Privileged CPU instructions

• Memory management

• Interrupt handling

• Access to network, disk, etc., hardware

• Required changes to guest systems are small • Porting guests to an API that allows efficient

virtualisation allows significant performance benefits

• Simplifies hypervisor design for x86, where the processor architecture makes full virtualisation expensive

• The user-space API of guest operating systems is unchanged

18

Memory ManagementSegmentation Cannot install fully-privileged segment descriptors and cannot overlap with the top end of the linear

address space.Paging Guest OS has direct read access to hardware page tables, but updates are batched and validated by

the hypervisor. A domain may be allocated discontiguous machine pages.CPUProtection Guest OS must run at a lower privilege level than Xen.Exceptions Guest OS must register a descriptor table for exception handlers with Xen. Aside from page faults,

the handlers remain the same.System Calls Guest OS may install a ‘fast’ handler for system calls, allowing direct calls from an application into

its guest OS and avoiding indirecting through Xen on every call.Interrupts Hardware interrupts are replaced with a lightweight event system.Time Each guest OS has a timer interface and is aware of both ‘real’ and ‘virtual’ time.Device I/ONetwork, Disk, etc. Virtual devices are elegant and simple to access. Data is transferred using asynchronous I/O rings.

An event mechanism replaces hardware interrupts for notifications.

Table 1: The paravirtualized x86 interface.

VMM is contrary to our goal of performance isolation: maliciousvirtual machines can encourage thrashing behaviour, unfairly de-priving others of CPU time and disk bandwidth. In Xen we expecteach guest OS to perform its own paging using its own guaran-teed memory reservation and disk allocation (an idea previouslyexploited by self-paging [20]).

Finally, Denali virtualizes the ‘namespaces’ of all machine re-sources, taking the view that no VM can access the resource alloca-tions of another VM if it cannot name them (for example, VMs haveno knowledge of hardware addresses, only the virtual addressescreated for them by Denali). In contrast, we believe that secure ac-cess control within the hypervisor is sufficient to ensure protection;furthermore, as discussed previously, there are strong correctnessand performance arguments for making physical resources directlyvisible to guest OSes.

In the following section we describe the virtual machine abstrac-tion exported by Xen and discuss how a guest OS must be modifiedto conform to this. Note that in this paper we reserve the term guestoperating system to refer to one of the OSes that Xen can host andwe use the term domain to refer to a running virtual machine withinwhich a guest OS executes; the distinction is analogous to that be-tween a program and a process in a conventional system. We callXen itself the hypervisor since it operates at a higher privilege levelthan the supervisor code of the guest operating systems that it hosts.

2.1 The Virtual Machine InterfaceTable 1 presents an overview of the paravirtualized x86 interface,

factored into three broad aspects of the system: memory manage-ment, the CPU, and device I/O. In the following we address eachmachine subsystem in turn, and discuss how each is presented inour paravirtualized architecture. Note that although certain partsof our implementation, such as memory management, are specificto the x86, many aspects (such as our virtual CPU and I/O devices)can be readily applied to other machine architectures. Furthermore,x86 represents a worst case in the areas where it differs significantlyfrom RISC-style processors — for example, efficiently virtualizinghardware page tables is more difficult than virtualizing a software-managed TLB.

2.1.1 Memory managementVirtualizing memory is undoubtedly the most difficult part of

paravirtualizing an architecture, both in terms of the mechanismsrequired in the hypervisor and modifications required to port each

guest OS. The task is easier if the architecture provides a software-managed TLB as these can be efficiently virtualized in a simplemanner [13]. A tagged TLB is another useful feature supportedby most server-class RISC architectures, including Alpha, MIPSand SPARC. Associating an address-space identifier tag with eachTLB entry allows the hypervisor and each guest OS to efficientlycoexist in separate address spaces because there is no need to flushthe entire TLB when transferring execution.

Unfortunately, x86 does not have a software-managed TLB; in-stead TLB misses are serviced automatically by the processor bywalking the page table structure in hardware. Thus to achieve thebest possible performance, all valid page translations for the currentaddress space should be present in the hardware-accessible pagetable. Moreover, because the TLB is not tagged, address spaceswitches typically require a complete TLB flush. Given these limi-tations, we made two decisions: (i) guest OSes are responsible forallocating and managing the hardware page tables, with minimalinvolvement from Xen to ensure safety and isolation; and (ii) Xenexists in a 64MB section at the top of every address space, thusavoiding a TLB flush when entering and leaving the hypervisor.

Each time a guest OS requires a new page table, perhaps be-cause a new process is being created, it allocates and initializes apage from its own memory reservation and registers it with Xen.At this point the OS must relinquish direct write privileges to thepage-table memory: all subsequent updates must be validated byXen. This restricts updates in a number of ways, including onlyallowing an OS to map pages that it owns, and disallowing writablemappings of page tables. Guest OSes may batch update requests toamortize the overhead of entering the hypervisor. The top 64MBregion of each address space, which is reserved for Xen, is not ac-cessible or remappable by guest OSes. This address region is notused by any of the common x86 ABIs however, so this restrictiondoes not break application compatibility.

Segmentation is virtualized in a similar way, by validating up-dates to hardware segment descriptor tables. The only restrictionson x86 segment descriptors are: (i) they must have lower privi-lege than Xen, and (ii) they may not allow any access to the Xen-reserved portion of the address space.

2.1.2 CPUVirtualizing the CPU has several implications for guest OSes.

Principally, the insertion of a hypervisor below the operating sys-tem violates the usual assumption that the OS is the most privileged

entity in the system. In order to protect the hypervisor from OSmisbehavior (and domains from one another) guest OSes must bemodified to run at a lower privilege level.

Many processor architectures only provide two privilege levels.In these cases the guest OS would share the lower privilege levelwith applications. The guest OS would then protect itself by run-ning in a separate address space from its applications, and indirectlypass control to and from applications via the hypervisor to set thevirtual privilege level and change the current address space. Again,if the processor’s TLB supports address-space tags then expensiveTLB flushes can be avoided.

Efficient virtualizion of privilege levels is possible on x86 be-cause it supports four distinct privilege levels in hardware. The x86privilege levels are generally described as rings, and are numberedfrom zero (most privileged) to three (least privileged). OS codetypically executes in ring 0 because no other ring can execute priv-ileged instructions, while ring 3 is generally used for applicationcode. To our knowledge, rings 1 and 2 have not been used by anywell-known x86 OS since OS/2. Any OS which follows this com-mon arrangement can be ported to Xen by modifying it to executein ring 1. This prevents the guest OS from directly executing priv-ileged instructions, yet it remains safely isolated from applicationsrunning in ring 3.

Privileged instructions are paravirtualized by requiring them tobe validated and executed within Xen— this applies to operationssuch as installing a new page table, or yielding the processor whenidle (rather than attempting to hlt it). Any guest OS attempt todirectly execute a privileged instruction is failed by the processor,either silently or by taking a fault, since only Xen executes at asufficiently privileged level.

Exceptions, including memory faults and software traps, are vir-tualized on x86 very straightforwardly. A table describing the han-dler for each type of exception is registered with Xen for valida-tion. The handlers specified in this table are generally identicalto those for real x86 hardware; this is possible because the ex-ception stack frames are unmodified in our paravirtualized archi-tecture. The sole modification is to the page fault handler, whichwould normally read the faulting address from a privileged proces-sor register (CR2); since this is not possible, we write it into anextended stack frame2. When an exception occurs while executingoutside ring 0, Xen’s handler creates a copy of the exception stackframe on the guest OS stack and returns control to the appropriateregistered handler.

Typically only two types of exception occur frequently enough toaffect system performance: system calls (which are usually imple-mented via a software exception), and page faults. We improve theperformance of system calls by allowing each guest OS to registera ‘fast’ exception handler which is accessed directly by the proces-sor without indirecting via ring 0; this handler is validated beforeinstalling it in the hardware exception table. Unfortunately it is notpossible to apply the same technique to the page fault handler be-cause only code executing in ring 0 can read the faulting addressfrom register CR2; page faults must therefore always be deliveredvia Xen so that this register value can be saved for access in ring 1.

Safety is ensured by validating exception handlers when they arepresented to Xen. The only required check is that the handler’s codesegment does not specify execution in ring 0. Since no guest OScan create such a segment, it suffices to compare the specified seg-ment selector to a small number of static values which are reservedby Xen. Apart from this, any other handler problems are fixed upduring exception propagation — for example, if the handler’s code2In hindsight, writing the value into a pre-agreed shared memory locationrather than modifying the stack frame would have simplified the XP port.

OS subsection # linesLinux XP

Architecture-independent 78 1299Virtual network driver 484 –Virtual block-device driver 1070 –Xen-specific (non-driver) 1363 3321Total 2995 4620

(Portion of total x86 code base 1.36% 0.04%)

Table 2: The simplicity of porting commodity OSes to Xen. Thecost metric is the number of lines of reasonably commented andformatted code which are modified or added compared with theoriginal x86 code base (excluding device drivers).

segment is not present or if the handler is not paged into mem-ory then an appropriate fault will be taken when Xen executes theiret instruction which returns to the handler. Xen detects these“double faults” by checking the faulting program counter value: ifthe address resides within the exception-virtualizing code then theoffending guest OS is terminated.

Note that this “lazy” checking is safe even for the direct system-call handler: access faults will occur when the CPU attempts todirectly jump to the guest OS handler. In this case the faultingaddress will be outside Xen (since Xen will never execute a guestOS system call) and so the fault is virtualized in the normal way.If propagation of the fault causes a further “double fault” then theguest OS is terminated as described above.

2.1.3 Device I/ORather than emulating existing hardware devices, as is typically

done in fully-virtualized environments, Xen exposes a set of cleanand simple device abstractions. This allows us to design an inter-face that is both efficient and satisfies our requirements for protec-tion and isolation. To this end, I/O data is transferred to and fromeach domain via Xen, using shared-memory, asynchronous buffer-descriptor rings. These provide a high-performance communica-tion mechanism for passing buffer information vertically throughthe system, while allowing Xen to efficiently perform validationchecks (for example, checking that buffers are contained within adomain’s memory reservation).

Similar to hardware interrupts, Xen supports a lightweight event-delivery mechanism which is used for sending asynchronous noti-fications to a domain. These notifications are made by updating abitmap of pending event types and, optionally, by calling an eventhandler specified by the guest OS. These callbacks can be ‘held off’at the discretion of the guest OS — to avoid extra costs incurred byfrequent wake-up notifications, for example.

2.2 The Cost of Porting an OS to XenTable 2 demonstrates the cost, in lines of code, of porting com-

modity operating systems to Xen’s paravirtualized x86 environ-ment. Note that our NetBSD port is at a very early stage, and hencewe report no figures here. The XP port is more advanced, but still inprogress; it can execute a number of user-space applications froma RAM disk, but it currently lacks any virtual I/O drivers. For thisreason, figures for XP’s virtual device drivers are not presented.However, as with Linux, we expect these drivers to be small andsimple due to the idealized hardware abstraction presented by Xen.

Windows XP required a surprising number of modifications toits architecture independent OS code because it uses a variety ofstructures and unions for accessing page-table entries (PTEs). Eachpage-table access had to be separately modified, although some of










Xen

• Xen provides mechanisms, but does not define policy • Virtualisation primitives are well understood

and stable

• Control mechanisms and policies evolve to match business models

• Separation of mechanism and policy allows these to develop at separate rates

• Interactions via a mix of hyper-calls and message passing • Hyper-calls allow a guest operating system to

call into the hypervisor • Implement the Domain0 control interface

• Implement the paravirtualisation interface

• Message passing allows the hypervisor to queue events to be processed by a guest • Callbacks to a guest, to indicate completion of

request or receipt of asynchronous data

19

XEN

H/W (SMP x86, phy mem, enet, SCSI/IDE)

virtual network

virtual blockdev

virtual x86 CPU

virtual phy mem

ControlPlane

Software

GuestOS(XenoLinux)

GuestOS(XenoBSD)

GuestOS(XenoXP)

UserSoftware

UserSoftware

UserSoftware

GuestOS(XenoLinux)

Xeno-AwareDevice Drivers




Domain0control

interface

Figure 1: The structure of a machine running the Xen hyper-visor, hosting a number of different guest operating systems,including Domain0 running control software in a XenoLinuxenvironment.

this process was automated with scripts. In contrast, Linux neededfar fewer modifications to its generic memory system as it uses pre-processor macros to access PTEs — the macro definitions providea convenient place to add the translation and hypervisor calls re-quired by paravirtualization.

In both OSes, the architecture-specific sections are effectivelya port of the x86 code to our paravirtualized architecture. Thisinvolved rewriting routines which used privileged instructions, andremoving a large amount of low-level system initialization code.Again, more changes were required in Windows XP, mainly dueto the presence of legacy 16-bit emulation code and the need fora somewhat different boot-loading mechanism. Note that the x86-specific code base in XP is substantially larger than in Linux andhence a larger porting effort should be expected.

2.3 Control and ManagementThroughout the design and implementation of Xen, a goal has

been to separate policy from mechanism wherever possible. Al-though the hypervisor must be involved in data-path aspects (forexample, scheduling the CPU between domains, filtering networkpackets before transmission, or enforcing access control when read-ing data blocks), there is no need for it to be involved in, or evenaware of, higher level issues such as how the CPU is to be shared,or which kinds of packet each domain may transmit.

The resulting architecture is one in which the hypervisor itselfprovides only basic control operations. These are exported throughan interface accessible from authorized domains; potentially com-plex policy decisions, such as admission control, are best performedby management software running over a guest OS rather than inprivileged hypervisor code.

The overall system structure is illustrated in Figure 1. Note thata domain is created at boot time which is permitted to use the con-trol interface. This initial domain, termed Domain0, is responsiblefor hosting the application-level management software. The con-trol interface provides the ability to create and terminate other do-mains and to control their associated scheduling parameters, phys-ical memory allocations and the access they are given to the ma-chine’s physical disks and network devices.

In addition to processor and memory resources, the control inter-face supports the creation and deletion of virtual network interfaces(VIFs) and block devices (VBDs). These virtual I/O devices haveassociated access-control information which determines which do-mains can access them, and with what restrictions (for example, a

read-only VBD may be created, or a VIF may filter IP packets toprevent source-address spoofing).

This control interface, together with profiling statistics on thecurrent state of the system, is exported to a suite of application-level management software running in Domain0. This complementof administrative tools allows convenient management of the entireserver: current tools can create and destroy domains, set networkfilters and routing rules, monitor per-domain network activity atpacket and flow granularity, and create and delete virtual networkinterfaces and virtual block devices. We anticipate the developmentof higher-level tools to further automate the application of admin-istrative policy.

3. DETAILED DESIGNIn this section we introduce the design of the major subsystems

that make up a Xen-based server. In each case we present bothXen and guest OS functionality for clarity of exposition. The cur-rent discussion of guest OSes focuses on XenoLinux as this is themost mature; nonetheless our ongoing porting of Windows XP andNetBSD gives us confidence that Xen is guest OS agnostic.

3.1 Control Transfer: Hypercalls and EventsTwo mechanisms exist for control interactions between Xen and

an overlying domain: synchronous calls from a domain to Xen maybe made using a hypercall, while notifications are delivered to do-mains from Xen using an asynchronous event mechanism.

The hypercall interface allows domains to perform a synchronoussoftware trap into the hypervisor to perform a privileged operation,analogous to the use of system calls in conventional operating sys-tems. An example use of a hypercall is to request a set of page-table updates, in which Xen validates and applies a list of updates,returning control to the calling domain when this is completed.

Communication from Xen to a domain is provided through anasynchronous event mechanism, which replaces the usual deliverymechanisms for device interrupts and allows lightweight notifica-tion of important events such as domain-termination requests. Akinto traditional Unix signals, there are only a small number of events,each acting to flag a particular type of occurrence. For instance,events are used to indicate that new data has been received over thenetwork, or that a virtual disk request has completed.

Pending events are stored in a per-domain bitmask which is up-dated by Xen before invoking an event-callback handler specifiedby the guest OS. The callback handler is responsible for resettingthe set of pending events, and responding to the notifications in anappropriate manner. A domain may explicitly defer event handlingby setting a Xen-readable software flag: this is analogous to dis-abling interrupts on a real processor.

3.2 Data Transfer: I/O RingsThe presence of a hypervisor means there is an additional pro-

tection domain between guest OSes and I/O devices, so it is crucialthat a data transfer mechanism be provided that allows data to movevertically through the system with as little overhead as possible.

Two main factors have shaped the design of our I/O-transfermechanism: resource management and event notification. For re-source accountability, we attempt to minimize the work required todemultiplex data to a specific domain when an interrupt is receivedfrom a device — the overhead of managing buffers is carried outlater where computation may be accounted to the appropriate do-main. Similarly, memory committed to device I/O is provided bythe relevant domains wherever possible to prevent the crosstalk in-herent in shared buffer pools; I/O buffers are protected during datatransfer by pinning the underlying page frames within Xen.










Further Reading

• P. Barham et al, “Xen and the art of virtualization”, Proc. ACM Symposium on Operating Systems Principles, October 2003. DOI:10.1145/945445.945462

• Trade-offs of paravirtualisation vs. full virtualisation?

• What needs to be done to port an OS to Xen?

• Is paravirtualisation worthwhile, when compared to full system virtualisation?

• How do Dom0 and device drivers work?

20

Xen and the Art of Virtualization

Paul Barham∗, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris,Alex Ho, Rolf Neugebauer†, Ian Pratt, Andrew Warfield

University of Cambridge Computer Laboratory15 JJ Thomson Avenue, Cambridge, UK, CB3 0FD

{firstname.lastname}@cl.cam.ac.uk

ABSTRACTNumerous systems have been designed which use virtualization tosubdivide the ample resources of a modern computer. Some requirespecialized hardware, or cannot support commodity operating sys-tems. Some target 100% binary compatibility at the expense ofperformance. Others sacrifice security or functionality for speed.Few offer resource isolation or performance guarantees; most pro-vide only best-effort provisioning, risking denial of service.

This paper presents Xen, an x86 virtual machine monitor whichallows multiple commodity operating systems to share conventionalhardware in a safe and resource managed fashion, but without sac-rificing either performance or functionality. This is achieved byproviding an idealized virtual machine abstraction to which oper-ating systems such as Linux, BSD and Windows XP, can be portedwith minimal effort.

Our design is targeted at hosting up to 100 virtual machine in-stances simultaneously on a modern server. The virtualization ap-proach taken by Xen is extremely efficient: we allow operating sys-tems such as Linux and Windows XP to be hosted simultaneouslyfor a negligible performance overhead — at most a few percentcompared with the unvirtualized case. We considerably outperformcompeting commercial and freely available solutions in a range ofmicrobenchmarks and system-wide tests.

Categories and Subject DescriptorsD.4.1 [Operating Systems]: Process Management; D.4.2 [Opera-ting Systems]: Storage Management; D.4.8 [Operating Systems]:Performance

General TermsDesign, Measurement, Performance

KeywordsVirtual Machine Monitors, Hypervisors, Paravirtualization∗Microsoft Research Cambridge, UK†Intel Research Cambridge, UK

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SOSP’03, October 19–22, 2003, Bolton Landing, New York, USA.Copyright 2003 ACM 1-58113-757-5/03/0010 ...$5.00.

1. INTRODUCTIONModern computers are sufficiently powerful to use virtualization

to present the illusion of many smaller virtual machines (VMs),each running a separate operating system instance. This has led toa resurgence of interest in VM technology. In this paper we presentXen, a high performance resource-managed virtual machine mon-itor (VMM) which enables applications such as server consolida-tion [42, 8], co-located hosting facilities [14], distributed web ser-vices [43], secure computing platforms [12, 16] and applicationmobility [26, 37].

Successful partitioning of a machine to support the concurrentexecution of multiple operating systems poses several challenges.Firstly, virtual machines must be isolated from one another: it is notacceptable for the execution of one to adversely affect the perfor-mance of another. This is particularly true when virtual machinesare owned by mutually untrusting users. Secondly, it is necessaryto support a variety of different operating systems to accommodatethe heterogeneity of popular applications. Thirdly, the performanceoverhead introduced by virtualization should be small.

Xen hosts commodity operating systems, albeit with some sourcemodifications. The prototype described and evaluated in this papercan support multiple concurrent instances of our XenoLinux guestoperating system; each instance exports an application binary inter-face identical to a non-virtualized Linux 2.4. Our port of WindowsXP to Xen is not yet complete but is capable of running simpleuser-space processes. Work is also progressing in porting NetBSD.

Xen enables users to dynamically instantiate an operating sys-tem to execute whatever they desire. In the XenoServer project [15,35] we are deploying Xen on standard server hardware at econom-ically strategic locations within ISPs or at Internet exchanges. Weperform admission control when starting new virtual machines andexpect each VM to pay in some fashion for the resources it requires.We discuss our ideas and approach in this direction elsewhere [21];this paper focuses on the VMM.

There are a number of ways to build a system to host multipleapplications and servers on a shared machine. Perhaps the simplestis to deploy one or more hosts running a standard operating sys-tem such as Linux or Windows, and then to allow users to installfiles and start processes — protection between applications beingprovided by conventional OS techniques. Experience shows thatsystem administration can quickly become a time-consuming taskdue to complex configuration interactions between supposedly dis-joint applications.

More importantly, such systems do not adequately support per-formance isolation; the scheduling priority, memory demand, net-work traffic and disk accesses of one process impact the perfor-mance of others. This may be acceptable when there is adequateprovisioning and a closed user group (such as in the case of com-









http://dx.doi.org/10.1145/945445.945462

Virtualisation: Jails and Unikernels

21


Jails and Unikernels

• Alternatives to full virtualisation

• Sandboxing processes: • FreeBSD jails and Linux containers

• Container management

• Unikernels and library operating systems

22










Alternatives to Full Virtualisation

• Full operating system virtualisation is expensive • Need to run a hypervisor

• Need to run a complete guest operating system instance for each VM

• High memory and storage overhead, high CPU load from running multiple OS instances on a single machine – but excellent isolation between VMs

• Running multiple services on a single OS instance can offer insufficient isolation • All must share the same OS

• A bug in a privileged service can easily compromise entire system – all services running on the host, and any unrelated data

• A sandbox is desirable – to isolate multiple services running on a single operating system

23










Sandboxing: Jails and Containers

• Concept – lightweight virtualisation • A single operating system kernel

• Multiple services

• Each service runs within a jail, a service specific container, running just what’s needed for that application

• Jail restricted to see only a subset of the filesystem and processes of the host • A sandbox within the operating system

• Partially virtualised

24

/

/bin/dev/etc/home

/usr/sbin/tmp/var

/lib /local/lib/sbin/share

/jail /www/mail

/bin/dev/etc/home

/usr/sbin/tmp/var

/lib










Example: FreeBSD Jail subsystem

• Unix has long had chroot() • Limits process to see a subset of the filesystem

• Used in, e.g., web servers, to prevent buggy code from serving files outside chosen directory

• Jails extend this to also restrict access to other resources for jailed processes: • Limits process to see subset of the processes

(the children of the initially jailed process)

• Limits process to see specific network interface

• Prevents a process from mounting/un-mounting file systems, creating device nodes, loading kernel modules, changing kernel parameters, configuring network interfaces, etc.

• Even if running with root privilege

• Processes outside the jail can see into the jail – from outside, jailed processes look like regular processes

25

Jails: Confining the omnipotent root.

Poul-Henning Kamp <[email protected]>

Robert N. M. Watson <[email protected]>

The FreeBSD Project

ABSTRACT

The traditional UNIX security model is simple but inexpressive.Adding fine-grained access control improves the expressiveness, butoften dramatically increases both the cost of system management andimplementation complexity. In environments with a more complex man-agement model, with delegation of some management functions to par-ties under varying degrees of trust, the base UNIX model and most natu-ral extensions are inappropriate at best. Where multiple mutually un-trusting parties are introduced, ‘‘inappropriate’’ rapidly transitions to‘‘nightmarish’’, especially with regards to data integrity and privacy pro-tection.

The FreeBSD ‘‘Jail’’ facility provides the ability to partition the operat-ing system environment, while maintaining the simplicity of the UNIX‘‘root’’ model. In Jail, users with privilege find that the scope of theirrequests is limited to the jail, allowing system administrators to delegatemanagement capabilities for each virtual machine environment. Creatingvirtual machines in this manner has many potential uses; the most popu-lar thus far has been for providing virtual machine services in InternetService Provider environments.

1. IntroductionThe UNIX access control mechanism is designed for an environment with two types

of users: those with, and without administrative privilege. Within this framework, everyattempt is made to provide an open system, allowing easy sharing of files and inter-pro-cess communication. As a member of the UNIX family, FreeBSD inherits these secu-rity properties. Users of FreeBSD in non-traditional UNIX environments must balancetheir need for strong application support, high network performance and functionality,

This work was sponsored by http://www.servetheweb.com/ and donated tothe FreeBSD Project for inclusion in the FreeBSD OS. FreeBSD 4.0-RELEASE wasthe first release including this code. Follow-on work was sponsored by Safeport Net-work Services, http://www.safeport.com/










Benefits of Jails

• Extremely lightweight implementation • Processes in the jail are regular Unix processes, with some additional

privilege checks

• FreeBSD implementation added/modified ~1,000 lines of code in total

• Contents of the jail are limited: it’s not a full operating system – just the set of binaries/libraries needed for the application

• Straightforward system administration • Few new concepts: no need to learn how to operate the hypervisor

• Visibility into the jail eases debugging and administration; no new tools

• Portable • No separate hypervisor, with its own device drivers, needed

• If the base system runs, so do the jails

26










Disadvantages of Jails

• Virtualisation is imperfect • Full virtual machines can provide stronger isolation and performance

guarantees – processes in jail can compete for resources with other processes in the system, and observe the effects of that competition

• A process can discover that it’s running in a jail – a process in a VM cannot determine that it’s in a VM

• Jails are tied to the underlying operating system • A virtual machine can be paused, migrated to a different physical system,

and resumed, without processes in the VM being aware – a jail cannot

27










Example: Linux Containers

• Two inter-related kernel features: • Namespace isolation – separation of processes

• Control groups “cgroups” – accounting for process CPU, memory, disk and network I/O usage

• Both part of the mainline Linux kernel

• Combined to provide container-based virtualisation • Implemented via tools such as “lxc”, “OpenVZ”, “Linux-VServer”

• Very similar functionality to FreeBSD Jails, but using the Linux kernel as the underlying OS

• Run a subset of processes within a containers – can be a full OS-like environment, or something much for limited

28










Example: iOS/macOS Sandbox

• All apps on iOS, and all apps from the Mac App Store on macOS, are sandboxed • Sandbox functionality varies with macOS/iOS version – gradually

getting more sophisticated and more secure

• Kernel heavily based on FreeBSD, but sandboxing doesn’t use Jails • Mandatory access control – http://www.trustedbsd.org/mac.html

• Apps are constrained to see a particular directory hierarchy – but given flexibility to see other directories outside this based on user input

• Policy language allows flexible control over what OS functions are restricted – restricts access to files, processes, and other namespaces like Jails, but more flexibility in what parts of the system are exposed

• At most restrictive, looks like a FreeBSD Jail containing a single-process and it’s data directories

• Can allow access to other OS services/directories as needed – more flexible, but more complex and hence harder to sandbox correctly

• Heavily based on Robert Watson’s PhD and work on FreeBSD MAC and Capsicum security framework • http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-818.pdf

• Trade-off: flexibility, complexity, security

29

Technical Report

Number 818

Computer Laboratory

UCAM-CL-TR-818ISSN 1476-2986

New approaches to operatingsystem security extensibility

Robert N. M. Watson

April 2012

15 JJ Thomson AvenueCambridge CB3 0FDUnited Kingdomphone +44 1223 763500

http://www.cl.cam.ac.uk/









http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-818.pdf


Container Management

• Full virtualisation – VMs run complete OS, and can use standard OS update/management procedures

• Jails run a container with a subset of the full OS • Want minimal stack needed to support services running in the container

• Less software installed in container → security vulnerability less likely

• How to ensure all images are up-to-date? • No longer sufficient to patch the OS, must also patch each container

• Need to minimally subset the OS for the services in the container – can this be automated?

• How to share service/container configuration?

30










Example: Docker

• DockerFile and context specify an image • Base system image

• Docker maintains official, managed and security updated, images for popular operating systems

• Images can build on/extend other images

• Install additional binaries and libraries

• Install additional data

• Commands to execute when image is instantiated

• Metadata about resulting image

• Images are immutable → instantiated in container • Virtualisation uses Linux containers, the macOS sandbox, or FreeBSD jails to instantiate

images, depending on underlying operating system running the container • macOS and Windows run Linux in a hypervisor, and use that to execute containers

• FreeBSD relies on native Linux system call emulation, and jails

• Image can specify external filesystems to mount, to hold persistent data accessible by running image

• A standardised way of packaging a software image to run in a container – plus easy-to-manage tools for instantiating a container

31

ContextDockerFile

Image

docker build

Containerdocker run

Virtualisation Engine










Unikernels: Library Operating Systems

• Taking the idea of service-specific containers to its logical extreme – a library operating system

• Well-implemented kernels have clear interfaces between components • Device drivers, filesystems, network protocols, processes, …

• Implement as a set of libraries, that can link with an application • Only compile in the subset of functions needed to run the

desired application (e.g., you don’t use UDP, it isn’t linked in to your kernel, etc…)

• Link entire operating system kernel and application as a single binary

• Run on bare hardware, or within a hypervisor

• Typically written in high-level, type-safe, languages, with expressive module systems – not backwards compatible with Unix • E.g., MirageOS largely written in OCaml

• Type-safe languages give clear interface boundaries → automatic minimisation of images

• Single-purpose virtual machines for the cloud32

Unikernels: Library Operating Systems for the Cloud

Anil Madhavapeddy, Richard Mortier1, Charalampos Rotsos, David Scott2, Balraj Singh,Thomas Gazagnaire3, Steven Smith, Steven Hand and Jon Crowcroft

University of Cambridge, University of Nottingham1, Citrix Systems Ltd2, OCamlPro SAS3

[email protected], [email protected], [email protected], [email protected]

Abstract

We present unikernels, a new approach to deploying cloud servicesvia applications written in high-level source code. Unikernels aresingle-purpose appliances that are compile-time specialised intostandalone kernels, and sealed against modification when deployedto a cloud platform. In return they offer significant reduction inimage sizes, improved efficiency and security, and should reduceoperational costs. Our Mirage prototype compiles OCaml code intounikernels that run on commodity clouds and offer an order ofmagnitude reduction in code size without significant performancepenalty. The architecture combines static type-safety with a singleaddress-space layout that can be made immutable via a hypervisorextension. Mirage contributes a suite of type-safe protocol libraries,and our results demonstrate that the hypervisor is a platform thatovercomes the hardware compatibility issues that have made pastlibrary operating systems impractical to deploy in the real-world.

Categories and Subject Descriptors D.4 [Operating Systems]:Organization and Design; D.1 [Programming Techniques]: Ap-plicative (Functional) Programming

General Terms Experimentation, Performance

1. Introduction

Operating system virtualization has revolutionised the economicsof large-scale computing by providing a platform on which cus-tomers rent resources to host virtual machines (VMs). Each VMpresents as a self-contained computer, booting a standard OS kerneland running unmodified application processes. Each VM is usuallyspecialised to a particular role, e.g., a database, a webserver, andscaling out involves cloning VMs from a template image.

Despite this shift from applications running on multi-user op-erating systems to provisioning many instances of single-purposeVMs, there is little actual specialisation that occurs in the imagethat is deployed to the cloud. We take an extreme position on spe-cialisation, treating the final VM image as a single-purpose appli-ance rather than a general-purpose system by stripping away func-tionality at compile-time. Specifically, our contributions are: (i) theunikernel approach to providing sealed single-purpose appliances,particularly suitable for providing cloud services; (ii) evaluation ofa complete implementation of these techniques using a functionalprogramming language (OCaml), showing that the benefits of type-

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.

ASPLOS’13, March 16–20, 2013, Houston, Texas, USA.Copyright c⃝ 2013 ACM 978-1-4503-1870-9/13/03. . . $15.00

Mirage Compiler

Hardware

Hypervisor

OS Kernel

User Processes

Language Runtime

Parallel Threads

Application Binary

Mirage Runtime

Hardware

Hypervisor

Application Code

Configuration Filesapplication source codeconfiguration fileshardware architecturewhole-system optimisation

specialisedunikernel}

Figure 1: Contrasting software layers in existing VM appliances vs.unikernel’s standalone kernel compilation approach.

safety need not damage performance; and (iii) libraries and lan-guage extensions supporting systems programming in OCaml.

The unikernel approach builds on past work in library OSs [1–3]. The entire software stack of system libraries, language runtime,and applications is compiled into a single bootable VM image thatruns directly on a standard hypervisor (Figure 1). By targeting astandard hypervisor, unikernels avoid the hardware compatibilityproblems encountered by traditional library OSs such as Exoker-nel [1] and Nemesis [2]. By eschewing backward compatibility, incontrast to Drawbridge [3], unikernels address cloud services ratherthan desktop applications. By targeting the commodity cloud witha library OS, unikernels can provide greater performance and im-proved security compared to Singularity [4]. Finally, in contrast toLibra [5] which provides a libOS abstraction for the JVM over Xenbut relies on a separate Linux VM instance to provide networkingand storage, unikernels are more highly-specialised single-purposeappliance VMs that directly integrate communication protocols.

We describe a complete unikernel prototype in the form ofour OCaml-based Mirage implementation (§3). We evaluate it viamicro-benchmarks and appliances providing DNS, OpenFlow, andHTTP (§4). We find sacrificing source-level backward compatibil-ity allows us to increase performance while significantly improvingthe security of external-facing cloud services. We retain compati-bility with external systems via standard network protocols such asTCP/IP, rather than attempting to support POSIX or other conven-tional standards for application construction. For example, the Mi-rage DNS server outperforms both BIND 9 (by 45%) and the high-performance NSD server (§4.2), while using very much smallerVM images: our unikernel appliance image was just 200 kB whilethe BIND appliance was over 400 MB. We conclude by discussingour experiences building Mirage and its position within the state ofthe art (§5), and concluding (§6).

Source: A. Madhavapeddy et al., “Unikernels: Library Operating Systems for the Cloud”, Proc. ACM ASPLOS, Houston, TX, USA, March 2013. DOI:10.1145/2451116.2451167









http://dx.doi.org/10.1145/2451116.2451167


Further Reading for Tutorial

• P.-H. Kamp and R. Watson, “Jails: Confining the omnipotent root”, System Administration and Network Engineering Conference, May 2000. http://www.sane.nl/events/sane2000/papers/kamp.pdf • Trade-offs vs. complete system virtualisation?

• Overheads vs. flexibility vs. ease of management?

• Benefits of Docker-style configuration of images

• A. Madhavapeddy et al., “Unikernels: Library Operating Systems for the Cloud”, ACM ASPLOS, Houston, TX, USA, March 2013. DOI:10.1145/2451116.2451167 • Is optimising operating system for single application going too far?

• Are unikernels maintainable?

• Relation to containers and hypervisors?

33

Jails: Confining the omnipotent root.

Poul-Henning Kamp <[email protected]>

Robert N. M. Watson <[email protected]>

The FreeBSD Project

ABSTRACT

The traditional UNIX security model is simple but inexpressive.Adding fine-grained access control improves the expressiveness, butoften dramatically increases both the cost of system management andimplementation complexity. In environments with a more complex man-agement model, with delegation of some management functions to par-ties under varying degrees of trust, the base UNIX model and most natu-ral extensions are inappropriate at best. Where multiple mutually un-trusting parties are introduced, ‘‘inappropriate’’ rapidly transitions to‘‘nightmarish’’, especially with regards to data integrity and privacy pro-tection.

The FreeBSD ‘‘Jail’’ facility provides the ability to partition the operat-ing system environment, while maintaining the simplicity of the UNIX‘‘root’’ model. In Jail, users with privilege find that the scope of theirrequests is limited to the jail, allowing system administrators to delegatemanagement capabilities for each virtual machine environment. Creatingvirtual machines in this manner has many potential uses; the most popu-lar thus far has been for providing virtual machine services in InternetService Provider environments.

1. IntroductionThe UNIX access control mechanism is designed for an environment with two types

of users: those with, and without administrative privilege. Within this framework, everyattempt is made to provide an open system, allowing easy sharing of files and inter-pro-cess communication. As a member of the UNIX family, FreeBSD inherits these secu-rity properties. Users of FreeBSD in non-traditional UNIX environments must balancetheir need for strong application support, high network performance and functionality,

This work was sponsored by http://www.servetheweb.com/ and donated tothe FreeBSD Project for inclusion in the FreeBSD OS. FreeBSD 4.0-RELEASE wasthe first release including this code. Follow-on work was sponsored by Safeport Net-work Services, http://www.safeport.com/

Unikernels: Library Operating Systems for the Cloud

Anil Madhavapeddy, Richard Mortier1, Charalampos Rotsos, David Scott2, Balraj Singh,Thomas Gazagnaire3, Steven Smith, Steven Hand and Jon Crowcroft

University of Cambridge, University of Nottingham1, Citrix Systems Ltd2, OCamlPro SAS3

[email protected], [email protected], [email protected], [email protected]

Abstract

We present unikernels, a new approach to deploying cloud servicesvia applications written in high-level source code. Unikernels aresingle-purpose appliances that are compile-time specialised intostandalone kernels, and sealed against modification when deployedto a cloud platform. In return they offer significant reduction inimage sizes, improved efficiency and security, and should reduceoperational costs. Our Mirage prototype compiles OCaml code intounikernels that run on commodity clouds and offer an order ofmagnitude reduction in code size without significant performancepenalty. The architecture combines static type-safety with a singleaddress-space layout that can be made immutable via a hypervisorextension. Mirage contributes a suite of type-safe protocol libraries,and our results demonstrate that the hypervisor is a platform thatovercomes the hardware compatibility issues that have made pastlibrary operating systems impractical to deploy in the real-world.

Categories and Subject Descriptors D.4 [Operating Systems]:Organization and Design; D.1 [Programming Techniques]: Ap-plicative (Functional) Programming

General Terms Experimentation, Performance

1. Introduction

Operating system virtualization has revolutionised the economicsof large-scale computing by providing a platform on which cus-tomers rent resources to host virtual machines (VMs). Each VMpresents as a self-contained computer, booting a standard OS kerneland running unmodified application processes. Each VM is usuallyspecialised to a particular role, e.g., a database, a webserver, andscaling out involves cloning VMs from a template image.

Despite this shift from applications running on multi-user op-erating systems to provisioning many instances of single-purposeVMs, there is little actual specialisation that occurs in the imagethat is deployed to the cloud. We take an extreme position on spe-cialisation, treating the final VM image as a single-purpose appli-ance rather than a general-purpose system by stripping away func-tionality at compile-time. Specifically, our contributions are: (i) theunikernel approach to providing sealed single-purpose appliances,particularly suitable for providing cloud services; (ii) evaluation ofa complete implementation of these techniques using a functionalprogramming language (OCaml), showing that the benefits of type-

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.

ASPLOS’13, March 16–20, 2013, Houston, Texas, USA.Copyright c⃝ 2013 ACM 978-1-4503-1870-9/13/03. . . $15.00

Mirage Compiler

Hardware

Hypervisor

OS Kernel

User Processes

Language Runtime

Parallel Threads

Application Binary

Mirage Runtime

Hardware

Hypervisor

Application Code

Configuration Filesapplication source codeconfiguration fileshardware architecturewhole-system optimisation

specialisedunikernel}

Figure 1: Contrasting software layers in existing VM appliances vs.unikernel’s standalone kernel compilation approach.

safety need not damage performance; and (iii) libraries and lan-guage extensions supporting systems programming in OCaml.

The unikernel approach builds on past work in library OSs [1–3]. The entire software stack of system libraries, language runtime,and applications is compiled into a single bootable VM image thatruns directly on a standard hypervisor (Figure 1). By targeting astandard hypervisor, unikernels avoid the hardware compatibilityproblems encountered by traditional library OSs such as Exoker-nel [1] and Nemesis [2]. By eschewing backward compatibility, incontrast to Drawbridge [3], unikernels address cloud services ratherthan desktop applications. By targeting the commodity cloud witha library OS, unikernels can provide greater performance and im-proved security compared to Singularity [4]. Finally, in contrast toLibra [5] which provides a libOS abstraction for the JVM over Xenbut relies on a separate Linux VM instance to provide networkingand storage, unikernels are more highly-specialised single-purposeappliance VMs that directly integrate communication protocols.

We describe a complete unikernel prototype in the form ofour OCaml-based Mirage implementation (§3). We evaluate it viamicro-benchmarks and appliances providing DNS, OpenFlow, andHTTP (§4). We find sacrificing source-level backward compatibil-ity allows us to increase performance while significantly improvingthe security of external-facing cloud services. We retain compati-bility with external systems via standard network protocols such asTCP/IP, rather than attempting to support POSIX or other conven-tional standards for application construction. For example, the Mi-rage DNS server outperforms both BIND 9 (by 45%) and the high-performance NSD server (§4.2), while using very much smallerVM images: our unikernel appliance image was just 200 kB whilethe BIND appliance was over 400 MB. We conclude by discussingour experiences building Mirage and its position within the state ofthe art (§5), and concluding (§6).









http://www.apple.com

virtualisation - csperkins.org · virtualisation concepts • enable resource sharing by decoupling...

Documents