a comparison of software and hardware techniques for x86 virtualization keith adams ole agesen oct....
TRANSCRIPT
A Comparison of Software and Hardware Techniques for x86 Virtualization
Keith Adams
Ole Agesen
Oct. 23, 2006
X86 virtualization
1998-2005: Software-only VMMs
X86 does not support traditional virtualization
Binary translation!
VMMs from VMware, Microsoft, Parallels, QEMU
2005- : Hardware support emerges
AMD, Intel extend x86 to directly support virtualization
Direct comparisons now possible!
Software vs. Hardware: Performance
Intuition: hardware is fast!
A mixed bag; why?
0
10
20
30
40
50
60
70
80
LinCompile WinCompile Swapping ApacheLin Passmark2D ApacheWin
Perc
en
t o
f N
ati
ve (
big
ger
is b
ett
er)
Software
Hardware
Software VMM
Direct Exec(user)
Translated Code(guest kernel)
VMM
Faults, syscalls,interrupts
Guest KernelExecution
IRET, sysret
Traces, faults,interrupts, I/O
Binary Translation in Action
BT offers many advantages
Correctness: Non-virtualizable x86 instructions
Flexibility
Guest idle loops, spin locks, etc.
Work around guest bugs
Transparently instrument guest
Adaptation
Traces: VMM write protects guest privileged data, e.g., page tables
Trace faults: Guest writes page tables -> major source of overhead
Adaptation example
Translation Cache
Captures working set of guest
Amortizes translation overheads
CFG of a simple guest
High rate of trace faults at instruction '!*!'
“Trap-and-emulate” approach => 1000's of CPU cycles
!*!
Invoke Translator
TranslationCache
Adaptation example (2)
BT Engine splices in special 'TRACE' translation
Executes memory access “in software”
10x improvement in trace performance
JMP
Invoke Translator
TRACE
Hardware-assisted VMM
Hardware-Assisted Direct ExecCPL 0-3
VMMCPL 0-3
Host mode
Guest mode
I/O, Fault,Interrupt, ...
Resume Guest
Hardware: System Calls Are Fast
CPL transitions don't require VMM intervention
Native speed system calls!
Syscall100
1000
10000
Cyc
les
(sm
alle
r is
bet
ter) SW VMM
HW VMMNative
Hardware VMM Trace Faults
!*!
VMM: Emulate '!*!'
Trace Fault!
Resume Guest
Trace fault from '!*!'
Exit from guest mode
Emulate faulting instruction
Resume
Many 1000's of cycles round-trip
VMM notices high rate of faults at !*!, and ...
does what?
Pagetable modification
Native
Simple store
1 cycle (!)
Software VMM
Converges on 'TRACE' translation
~400 cycles
Hardware VMM
No translation -> no adaptation
~11000 cycles0.1
1
10
100
1000
10000
100000
1
CP
U c
yc
les
(s
ma
ller
is b
ett
er)
Native
Software
Hardware
Benchmarks
System under test
Pentium 4 672, 3.8 GHz, VT-x
Software VMM: VMware Player 1.0.1
Hardware VMM: VMware Player 1.0.1 (same!)
http://www.vmware.com/products/player/
Computation Is a Toss-Up
0
20
40
60
80
100
120
gzip vp
rm
cf
craf
ty
pars
ereo
n
perlb
mk
gap
vorte
xbz
ip2tw
olf
Per
cen
t o
f N
ativ
e (b
igg
er is
bet
ter)
Softw are
Hardw are
Direct execution
Both VMMs close to each other, native
Kernel-Intensive Workloads
Some workloads favor hardware, others software
Why?
Which one should you use?
0
10
20
30
40
50
60
70
80
LinCompile WinCompile Sw apping ApacheLin Passmark2D ApacheWin
Per
cen
t o
f N
ativ
e (b
igg
er is
bet
ter)
Softw are
Hardw are
Nano-benchmarks
More “micro-” than “micro-”
Measure a single virtualization-sensitive op
Often a single instruction
Nano-bench results + workload’s mix of virtualization ops => crude performance prediction
Nano-Benchmark Results
0.1
1
10
100
1000
10000
100000
syscall cr8write callret divzero in pgfault ptemod
CP
U C
yc
les
(s
ma
ller
is b
ett
er)
Native
Software
Hardware
•Software: wins some (even vs. native), loses some•Hardare:” bimodal: native speed or ~11000 cycles(!)
Decomposing a Macro-Benchmark: XP64 Boot/Halt
0
2
4
6
8
10
12
syscall cr8write callret in pgfault divzero ptemod
Overh
ead
(s)
SWCost
HWCost
•Estimated overhead = Frequency * nanobench score•“In” overhead is anomolous (boot-time BIOS initialization code)
Two Workloads That Favor Hardware
Passmark/2D
I/O to weird device: no VMM intervention
Apache/Windows
Performing many thread switches
No exits on hardware VMM
…but “purpose” of Apache is I/O, not thread switches
They are system call micro-benchmarks in disguise
Claim: These two workloads are anomalous.
Which VMM Should I Use?
“It depends.”
Computation: flip a coin
“Trivial” kernel-intensive
Single address space, little I/O
=> Hardware!
“Non-trivial” kernel-intensive
Process switches, I/O, address space modifications
=> Software!!!
Claim: Hardware Will Improve
Micro-architecture: faster exits, VMCB accesses, ...
Architecture: assists for MMU, more decoding, fewer exits…
Software: tuning, maturity, …
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
Software
Hardware
Conclusions
Current hardware does not achieve performance-parity with previous software techniques.
Major problem for accelerating virtualization
Not executing the virtual instruction stream…
But efficient virtualization of MMU and I/O
Hardware should enhance, not replace, software techniques
Improving Virtual MMU Performance
Tune existing software MMU
Inherited from SW VMM
Can use traces more lightly, but…
Trade performance in other parts of the system
Current hardware introduces new constraints
Fundamentally harder for software MMU
Hardware approach
Intel’s “EPT”, AMD’s “NPT”
Hardware walks 2nd level of page tables on TLB misses
WinXP64 Boot/Halt Translation Stats
t (in 10s) units size instr cycles size cyc/ins ins/unit0 38690 336k 120k 252M 924k 2097 3.111 48839 500k 169k 318M 1164k 1871 3.482 108k 1187k 392k 754M 2589k 1920 3.613 29362 264k 89749 287M 951k 3197 3.064 96876 1000k 337k 708M 2418k 2100 3.485 58553 577k 193k 403M 1572k 2078 3.316 19430 148k 50951 148M 633k 2904 2.627 13081 87811 30455 124M 494k 4071 2.33
Total 413k 4101k 1384k 2994M 10748k 2161 3.35