a comparison of software and hardware techniques for x86 virtualization keith adams ole agesen oct....

25
A Comparison of Software and Hardware Techniques for x86 Virtualization Keith Adams Ole Agesen Oct. 23, 2006

Upload: eric-payne

Post on 25-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

A Comparison of Software and Hardware Techniques for x86 Virtualization

Keith Adams

Ole Agesen

Oct. 23, 2006

VMs are everywhere

Security

Test and Dev

Server Consolidation

Mobile Desktops...

VMM

X86 virtualization

1998-2005: Software-only VMMs

X86 does not support traditional virtualization

Binary translation!

VMMs from VMware, Microsoft, Parallels, QEMU

2005- : Hardware support emerges

AMD, Intel extend x86 to directly support virtualization

Direct comparisons now possible!

Software vs. Hardware: Performance

Intuition: hardware is fast!

A mixed bag; why?

0

10

20

30

40

50

60

70

80

LinCompile WinCompile Swapping ApacheLin Passmark2D ApacheWin

Perc

en

t o

f N

ati

ve (

big

ger

is b

ett

er)

Software

Hardware

Software VMM

Direct Exec(user)

Translated Code(guest kernel)

VMM

Faults, syscalls,interrupts

Guest KernelExecution

IRET, sysret

Traces, faults,interrupts, I/O

Binary Translation in Action

BT offers many advantages

Correctness: Non-virtualizable x86 instructions

Flexibility

Guest idle loops, spin locks, etc.

Work around guest bugs

Transparently instrument guest

Adaptation

Traces: VMM write protects guest privileged data, e.g., page tables

Trace faults: Guest writes page tables -> major source of overhead

Adaptation example

Translation Cache

Captures working set of guest

Amortizes translation overheads

CFG of a simple guest

High rate of trace faults at instruction '!*!'

“Trap-and-emulate” approach => 1000's of CPU cycles

!*!

Invoke Translator

TranslationCache

Adaptation example (2)

BT Engine splices in special 'TRACE' translation

Executes memory access “in software”

10x improvement in trace performance

JMP

Invoke Translator

TRACE

Hardware-assisted VMM

Hardware-Assisted Direct ExecCPL 0-3

VMMCPL 0-3

Host mode

Guest mode

I/O, Fault,Interrupt, ...

Resume Guest

Hardware: System Calls Are Fast

CPL transitions don't require VMM intervention

Native speed system calls!

Syscall100

1000

10000

Cyc

les

(sm

alle

r is

bet

ter) SW VMM

HW VMMNative

Hardware VMM Trace Faults

!*!

VMM: Emulate '!*!'

Trace Fault!

Resume Guest

Trace fault from '!*!'

Exit from guest mode

Emulate faulting instruction

Resume

Many 1000's of cycles round-trip

VMM notices high rate of faults at !*!, and ...

does what?

Pagetable modification

Native

Simple store

1 cycle (!)

Software VMM

Converges on 'TRACE' translation

~400 cycles

Hardware VMM

No translation -> no adaptation

~11000 cycles0.1

1

10

100

1000

10000

100000

1

CP

U c

yc

les

(s

ma

ller

is b

ett

er)

Native

Software

Hardware

Benchmarks

System under test

Pentium 4 672, 3.8 GHz, VT-x

Software VMM: VMware Player 1.0.1

Hardware VMM: VMware Player 1.0.1 (same!)

http://www.vmware.com/products/player/

Computation Is a Toss-Up

0

20

40

60

80

100

120

gzip vp

rm

cf

craf

ty

pars

ereo

n

perlb

mk

gap

vorte

xbz

ip2tw

olf

Per

cen

t o

f N

ativ

e (b

igg

er is

bet

ter)

Softw are

Hardw are

Direct execution

Both VMMs close to each other, native

Kernel-Intensive Workloads

Some workloads favor hardware, others software

Why?

Which one should you use?

0

10

20

30

40

50

60

70

80

LinCompile WinCompile Sw apping ApacheLin Passmark2D ApacheWin

Per

cen

t o

f N

ativ

e (b

igg

er is

bet

ter)

Softw are

Hardw are

Nano-benchmarks

More “micro-” than “micro-”

Measure a single virtualization-sensitive op

Often a single instruction

Nano-bench results + workload’s mix of virtualization ops => crude performance prediction

Nano-Benchmark Results

0.1

1

10

100

1000

10000

100000

syscall cr8write callret divzero in pgfault ptemod

CP

U C

yc

les

(s

ma

ller

is b

ett

er)

Native

Software

Hardware

•Software: wins some (even vs. native), loses some•Hardare:” bimodal: native speed or ~11000 cycles(!)

Decomposing a Macro-Benchmark: XP64 Boot/Halt

0

2

4

6

8

10

12

syscall cr8write callret in pgfault divzero ptemod

Overh

ead

(s)

SWCost

HWCost

•Estimated overhead = Frequency * nanobench score•“In” overhead is anomolous (boot-time BIOS initialization code)

Two Workloads That Favor Hardware

Passmark/2D

I/O to weird device: no VMM intervention

Apache/Windows

Performing many thread switches

No exits on hardware VMM

…but “purpose” of Apache is I/O, not thread switches

They are system call micro-benchmarks in disguise

Claim: These two workloads are anomalous.

Which VMM Should I Use?

“It depends.”

Computation: flip a coin

“Trivial” kernel-intensive

Single address space, little I/O

=> Hardware!

“Non-trivial” kernel-intensive

Process switches, I/O, address space modifications

=> Software!!!

Claim: Hardware Will Improve

Micro-architecture: faster exits, VMCB accesses, ...

Architecture: assists for MMU, more decoding, fewer exits…

Software: tuning, maturity, …

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008

Software

Hardware

Conclusions

Current hardware does not achieve performance-parity with previous software techniques.

Major problem for accelerating virtualization

Not executing the virtual instruction stream…

But efficient virtualization of MMU and I/O

Hardware should enhance, not replace, software techniques

Backup slides

Improving Virtual MMU Performance

Tune existing software MMU

Inherited from SW VMM

Can use traces more lightly, but…

Trade performance in other parts of the system

Current hardware introduces new constraints

Fundamentally harder for software MMU

Hardware approach

Intel’s “EPT”, AMD’s “NPT”

Hardware walks 2nd level of page tables on TLB misses

WinXP64 Boot/Halt Translation Stats

t (in 10s) units size instr cycles size cyc/ins ins/unit0 38690 336k 120k 252M 924k 2097 3.111 48839 500k 169k 318M 1164k 1871 3.482 108k 1187k 392k 754M 2589k 1920 3.613 29362 264k 89749 287M 951k 3197 3.064 96876 1000k 337k 708M 2418k 2100 3.485 58553 577k 193k 403M 1572k 2078 3.316 19430 148k 50951 148M 633k 2904 2.627 13081 87811 30455 124M 494k 4071 2.33

Total 413k 4101k 1384k 2994M 10748k 2161 3.35