nesting paging in vm replay for mps
DESCRIPTION
Nesting Paging in VM Replay for MPs. Jaehyuk Huh Computer Science, KAIST. Address Translation in VM. Need to translate guest VA ( gVA ) to machine address gVA (guest VA) gPA (guest PA) sPA (system PA) Paravirtualization - PowerPoint PPT PresentationTRANSCRIPT
CS530 Operating System
Nesting Paging in VMReplay for MPs
Jaehyuk HuhComputer Science, KAIST
Address Translation in VM• Need to translate guest VA (gVA) to machine address
– gVA (guest VA) gPA (guest PA) sPA (system PA)
• Paravirtualization– Guest page table (managed by guest OS) directly maps gVA to sPA– Hypervisor validates guest page table
• Full virtualization– SW technique: shadow paging– HW-assisted technique: nested paging
X86 4KB page tables in long mode
Shadow Page Table • Shadow page table (sPT)
– translate from gVA to sPA– maintained by VMM (hypervisor)
• VMM intercepts the updates of page table base address – CR3 updates in x86– Set CR3 with sPT base address instead of gPT base address
• must be consistent with guest page table (gPT) gPT up-dates must be reflected in sPT
• Any page fault must be intercepted by VMM– VMM must tell guest-induced page-faults from VMM-induced ones– Vectors guest-induced page-faults to guest OS– High overheads for page fault handling
How to make gPT and sPT consistent?• Write-protecting gPT
– Any modification of gPT (add or remove a translation) causes a fault– VMM updates sPT accordingly
• Exploiting page-fault behavior and TLB consistency rules– Adding a page translation
• Guest OS can add a new translation to gPT without interception by VMM• Later accesses by guest VM causes a page fault on the new translation• VMM updates sPT on the page fault: must inspect gPT to find out the new page
– Deleting a page translation• Guest OS executes INVLPG to invalidate TLB entry• VMM intercept the execution and remove the entry from sPT
Overheads of Shadow Paging• Any page fault requires the expensive VMM intervention
– Guest-induced page fault– Hypervisor-induced page faults
• Accessed and dirty bit updates– HW page walker sets bits in sPT (not gPT)– Guest OS need the information to make paging decision– Dirty bit example: set pages pointed by sPT read-only
• Problems in MPs– What if a VM uses multiple processors?– Replicating sPT for each processor? memory overheads– Sharing sPT ? synchronizing sPT for any change
Shadow Paging Overheads
Nesting Page Table• A source of address translation overheads in traditional
x86 VMM– a fixed hardware page walker to handle a TLB miss – Can walk from only one page table (pointed by CR3)
• Nested paging– Separate HW states affecting paging (two copies of CR3 etc … ) for guest
OS and VMM– HW page walker can walk both gPT and sPT – TLB can holds a translation from gVA to sPT directly
• Benefits: No more traps on Guest Page Table accesses• Drawback: Extra page table steps add latency to TLB miss• May add extra caching for page translation
– Nested TLB– 2D page walk cache
Nested Paging
Nested Paging
Address Space IDs• Old x86 did not support address space IDs (ASID) in TLBs
– must flush TLBs for VM switch– Assign ASID for each VM– Still need to flush TLBs for context switch within a VM
Replay Papers• VM-based replay
– Execution Replay for Multiprocessor Virtual Machines– Dunlap et al
• HW-based replay– Rerun: Exploiting Episodes for Lightweight Memory Race Recording– Hower and Hill
• ODR: Output-Deterministic Replay for Multicore Debug-ging– Altekar and Stoica
• Slides adapted from the presentation slides by the paper authors
Big ideas• Detection and replay of memory races is possible on com-
modity hardware• Overhead high for some workloads• …but surprisingly low for other workloads
Execution Replay
CPU
Memory
Disk
Network
Keyboard, mouse
Interrupts
Deterministic Replay• Deterministic Replay
– Faithfully replay an execution such that all instructions appear to complete in the same order and produce the same result
• Valuable– Debugging [LeBlanc, et al. - COMP ’87]
• e.g., time travel debugging, rare bug replication– Fault tolerance [Bressoud, et al. - SIGOPS ‘95]
• e.g., hot backup virtual machines– Security [Dunlap et al. – OSDI ‘02]
• e.g., attack analysis– Tracing [Xu et al. – WDDD ‘07]
• e.g., unobtrusive replay tracing
15
Single-processor Replay• Basic principles well understood
– Log all non-deterministic inputs– Timing of asynchronous events
• Minimal overhead (Dunlap02)– 13% worst case– Log for months or years
• Available commercially– VMWare: Record/Replay
The Multiprocessor Challenge• Interleaved reads and writes
– Fine-grained non-determinism– Much more difficult
• Existing solutions– Hardware modification– Software instrumentation
• SMP-ReVirt– Hardware MMU to detect sharing
Multiprocessor Replay
P2
Memory
P1
P1 P2
n=3n=5
if (n<4)
Ordering Memory Accesses• Preserving order will reproduce execution
– a→b: “a happens-before b”– Ordering is transitive: a→b, b→c means a→c
• Two instructions must be ordered if:– they both access the same memory, and– one of them is a write
Constraints: Enforcing order
• To guarantee a→d:– a→d– b→d– a→c– b→c
• Suppose we need b→c– b→c is necessary– a→d is redundant
P1a
b
c
d
P2
overconstrained
CREW Protocol• Each shared object in one of two states:
– Concurrent-Read: all processors can read, none can write– Exclusive-Write: one processor (the owner) can read and write; others have
no access• Enforced with hardware MMU
– Read/write– Read-only– None
• Change CREW states on demand– Fault, fixup, re-execute
• CREW event– Increasing or reducing permission due to CREW state changes
CREW Property• If two instructions on different processors:
– access the same page,– and one of them is a write,– there will be a CREW event on each processor between them.
Generating Constraints
• State: Concurrent Read– All processors read-only
• d*: CREW fault• New state: P2 Exclusive• r: privilege reduction
– Read to None• i: privilege increase
– Read to Read/write• Log timing of r and i• Constraint:
– r → i
P1a
d
P2
ri
d*
Predicting results• Key changes in sharing attributes
– 4096-byte sharing granularity– “Miss” is very expensive
• SPLASH2– Good: high spatial locality / low false sharing– Bad: random access patterns / high false sharing
• The Linux kernel– Tuned to 16-byte cacheline– Involving the kernel may be expensive
Single-processor Xen guests
1.001.04
1.01 1.001.03
1.13
1.001.05
0
0.2
0.4
0.6
0.8
1
1.2
FMM LU ocean radix water-spatial
kernel-build
radiosity dbench
Nor
mal
ized
runt
ime
Unmodified 1-cpu guest
Logging 1-cpuguest
`
2-processor Xen guests
1.51
1.001.08
1.601.48
2.10
1.901.76
1.96
1.741.83
1.99
0
0.5
1
1.5
2
2.5
FMM LU ocean radix water-spatial kernel-build
Nor
mal
ized
runt
ime
Unmodified 2-cpuguest
Logging 2-cpu guest
Logging 1-cpu guest
2-processor, con’t
8.70
7.21
1.85 1.88
0123456789
10
radiosity dbench
Nor
mal
ized
runt
ime
Unmodified 2-cpu guestLogging 2-cpu guestLogging 1-cpu guest
4-processor Xen guests
7.36
1.12 1.28
4.20
1.72
9.03
0
2
4
6
8
10
FMM LU ocean radix water-spatial kernel-build
Nor
mal
ized
runt
ime
Unmodified domain, 4 cpusCREW logging, 4 cpusCREW logging, 2 cpus*CREW logging, 1 cpu
HW Memory Race Recording• SW only approach
– Too slow to be turned on always– SW alter execution path
• Want– Small log – record longer for same state– Small hardware – reduce cost, especially when not used– Unobtrusive – should not alter execution
• Rerun: Exploiting Episodes for Lightweight Memory Race Recording
29
Episodic Recording• Most code executes without races
– Use race-free regions as unit of ordering• Episodes: independent execution regions
– Defined per thread– Identified passively does not affect execution– Encompass every instruction
30
T0 T1
LD A ST B ST C LD F
ST E LD B ST X LD R ST T LD X
T2
ST V ST Z LD W LD J
ST C LD Q LD J
ST Q ST E ST C LD Z
LD V
ST X
23
Capturing Causality• Via scalar Lamport Clocks [Lamport ‘78]
– Assigns timestamps to events– Timestamp order implies causality
• Replay in timestamp order– Episodes with same timestamp can be replayed in parallel
31
43 2260
61 44
62
2344
45
T0 T1 T2
Episode Benefits
• Multiple races can be captured by a single episode– Reduces amount of information to be logged
• Episodes are created passively– No speculation, no rollback
• Episodes can end early– Eases implementation
• Episode information is thread-local– Promotes scalability, avoids synchronization overheads
32
Hardware• Rerun requirements:
– Detect races track r/w sets– Mark episode boundaries– Maintain logical time
33
Coherence Con-troller
L1 I
L2 0 L2 1 L2 14
L2 15
Core 15
Interconnect
DR
AM
DR
AM
…
Core 14
Core 1
Core 0 …
Base System
Write Filter (WF)Read Filter (RF)
Timestamp (TS)References (REFS)
Memory Timestamp(MTS)
32 bytes
128 bytes2 bytes4 bytes
4 bytes
Total State: 166 bytes/core
HW Replay Summary• Require some modification to existing HW
– will CPU manufacturers add the support any time soon? not likely
• Other low overhead approaches with SW-based replay– ODR: Output-Deterministic Replay for Multicore Debugging, Altekar and
Stoica, SOSP 09