nesting paging in vm replay for mps

CS530 Operating System

Nesting Paging in VMReplay for MPs

Jaehyuk HuhComputer Science, KAIST

Address Translation in VM• Need to translate guest VA (gVA) to machine address

– gVA (guest VA) gPA (guest PA) sPA (system PA)

• Paravirtualization– Guest page table (managed by guest OS) directly maps gVA to sPA– Hypervisor validates guest page table

• Full virtualization– SW technique: shadow paging– HW-assisted technique: nested paging

X86 4KB page tables in long mode

Shadow Page Table • Shadow page table (sPT)

– translate from gVA to sPA– maintained by VMM (hypervisor)

• VMM intercepts the updates of page table base address – CR3 updates in x86– Set CR3 with sPT base address instead of gPT base address

• must be consistent with guest page table (gPT) gPT up-dates must be reflected in sPT

• Any page fault must be intercepted by VMM– VMM must tell guest-induced page-faults from VMM-induced ones– Vectors guest-induced page-faults to guest OS– High overheads for page fault handling

How to make gPT and sPT consistent?• Write-protecting gPT

– Any modification of gPT (add or remove a translation) causes a fault– VMM updates sPT accordingly

• Exploiting page-fault behavior and TLB consistency rules– Adding a page translation

• Guest OS can add a new translation to gPT without interception by VMM• Later accesses by guest VM causes a page fault on the new translation• VMM updates sPT on the page fault: must inspect gPT to find out the new page

– Deleting a page translation• Guest OS executes INVLPG to invalidate TLB entry• VMM intercept the execution and remove the entry from sPT

Overheads of Shadow Paging• Any page fault requires the expensive VMM intervention

– Guest-induced page fault– Hypervisor-induced page faults

• Accessed and dirty bit updates– HW page walker sets bits in sPT (not gPT)– Guest OS need the information to make paging decision– Dirty bit example: set pages pointed by sPT read-only

• Problems in MPs– What if a VM uses multiple processors?– Replicating sPT for each processor? memory overheads– Sharing sPT ? synchronizing sPT for any change

Shadow Paging Overheads

Nesting Page Table• A source of address translation overheads in traditional

x86 VMM– a fixed hardware page walker to handle a TLB miss – Can walk from only one page table (pointed by CR3)

• Nested paging– Separate HW states affecting paging (two copies of CR3 etc … ) for guest

OS and VMM– HW page walker can walk both gPT and sPT – TLB can holds a translation from gVA to sPT directly

• Benefits: No more traps on Guest Page Table accesses• Drawback: Extra page table steps add latency to TLB miss• May add extra caching for page translation

– Nested TLB– 2D page walk cache

Nested Paging

Address Space IDs• Old x86 did not support address space IDs (ASID) in TLBs

– must flush TLBs for VM switch– Assign ASID for each VM– Still need to flush TLBs for context switch within a VM

Replay Papers• VM-based replay

– Execution Replay for Multiprocessor Virtual Machines– Dunlap et al

• HW-based replay– Rerun: Exploiting Episodes for Lightweight Memory Race Recording– Hower and Hill

• ODR: Output-Deterministic Replay for Multicore Debug-ging– Altekar and Stoica

• Slides adapted from the presentation slides by the paper authors

Big ideas• Detection and replay of memory races is possible on com-

modity hardware• Overhead high for some workloads• …but surprisingly low for other workloads

Execution Replay

CPU

Memory

Disk

Network

Keyboard, mouse

Interrupts

Deterministic Replay• Deterministic Replay

– Faithfully replay an execution such that all instructions appear to complete in the same order and produce the same result

• Valuable– Debugging [LeBlanc, et al. - COMP ’87]

• e.g., time travel debugging, rare bug replication– Fault tolerance [Bressoud, et al. - SIGOPS ‘95]

• e.g., hot backup virtual machines– Security [Dunlap et al. – OSDI ‘02]

• e.g., attack analysis– Tracing [Xu et al. – WDDD ‘07]

• e.g., unobtrusive replay tracing

15

Single-processor Replay• Basic principles well understood

– Log all non-deterministic inputs– Timing of asynchronous events

• Minimal overhead (Dunlap02)– 13% worst case– Log for months or years

• Available commercially– VMWare: Record/Replay

The Multiprocessor Challenge• Interleaved reads and writes

– Fine-grained non-determinism– Much more difficult

• Existing solutions– Hardware modification– Software instrumentation

• SMP-ReVirt– Hardware MMU to detect sharing

Multiprocessor Replay

P2

Memory

P1

P1 P2

n=3n=5

if (n<4)

Ordering Memory Accesses• Preserving order will reproduce execution

– a→b: “a happens-before b”– Ordering is transitive: a→b, b→c means a→c

• Two instructions must be ordered if:– they both access the same memory, and– one of them is a write

Constraints: Enforcing order

• To guarantee a→d:– a→d– b→d– a→c– b→c

• Suppose we need b→c– b→c is necessary– a→d is redundant

P1a

b

c

d

P2

overconstrained

CREW Protocol• Each shared object in one of two states:

– Concurrent-Read: all processors can read, none can write– Exclusive-Write: one processor (the owner) can read and write; others have

no access• Enforced with hardware MMU

– Read/write– Read-only– None

• Change CREW states on demand– Fault, fixup, re-execute

• CREW event– Increasing or reducing permission due to CREW state changes

CREW Property• If two instructions on different processors:

– access the same page,– and one of them is a write,– there will be a CREW event on each processor between them.

Generating Constraints

• State: Concurrent Read– All processors read-only

• d*: CREW fault• New state: P2 Exclusive• r: privilege reduction

– Read to None• i: privilege increase

– Read to Read/write• Log timing of r and i• Constraint:

– r → i

P1a

d

P2

ri

d*

Predicting results• Key changes in sharing attributes

– 4096-byte sharing granularity– “Miss” is very expensive

• SPLASH2– Good: high spatial locality / low false sharing– Bad: random access patterns / high false sharing

• The Linux kernel– Tuned to 16-byte cacheline– Involving the kernel may be expensive

Single-processor Xen guests

1.001.04

1.01 1.001.03

1.13

1.001.05

0

0.2

0.4

0.6

0.8

1

1.2

FMM LU ocean radix water-spatial

kernel-build

radiosity dbench

Nor

mal

ized

runt

ime

Unmodified 1-cpu guest

Logging 1-cpuguest

`

2-processor Xen guests

1.51

1.001.08

1.601.48

2.10

1.901.76

1.96

1.741.83

1.99

0

0.5

1

1.5

2

2.5

FMM LU ocean radix water-spatial kernel-build

Nor

mal

ized

runt

ime

Unmodified 2-cpuguest

Logging 2-cpu guest

Logging 1-cpu guest

2-processor, con’t

8.70

7.21

1.85 1.88

0123456789

10

radiosity dbench

Nor

mal

ized

runt

ime

Unmodified 2-cpu guestLogging 2-cpu guestLogging 1-cpu guest

4-processor Xen guests

7.36

1.12 1.28

4.20

1.72

9.03

0

2

4

6

8

10

FMM LU ocean radix water-spatial kernel-build

Nor

mal

ized

runt

ime

Unmodified domain, 4 cpusCREW logging, 4 cpusCREW logging, 2 cpus*CREW logging, 1 cpu

HW Memory Race Recording• SW only approach

– Too slow to be turned on always– SW alter execution path

• Want– Small log – record longer for same state– Small hardware – reduce cost, especially when not used– Unobtrusive – should not alter execution

• Rerun: Exploiting Episodes for Lightweight Memory Race Recording

29

Episodic Recording• Most code executes without races

– Use race-free regions as unit of ordering• Episodes: independent execution regions

– Defined per thread– Identified passively does not affect execution– Encompass every instruction

30

T0 T1

LD A ST B ST C LD F

ST E LD B ST X LD R ST T LD X

T2

ST V ST Z LD W LD J

ST C LD Q LD J

ST Q ST E ST C LD Z

LD V

ST X

23

Capturing Causality• Via scalar Lamport Clocks [Lamport ‘78]

– Assigns timestamps to events– Timestamp order implies causality

• Replay in timestamp order– Episodes with same timestamp can be replayed in parallel

31

43 2260

61 44

62

2344

45

T0 T1 T2

Episode Benefits

• Multiple races can be captured by a single episode– Reduces amount of information to be logged

• Episodes are created passively– No speculation, no rollback

• Episodes can end early– Eases implementation

• Episode information is thread-local– Promotes scalability, avoids synchronization overheads

32

Derek Hower

A result of indepence recording.Unlike some other race recorders, Rerun can reduce dependencies without any additional processing

Hardware• Rerun requirements:

– Detect races track r/w sets– Mark episode boundaries– Maintain logical time

33

Coherence Con-troller

L1 I

L2 0 L2 1 L2 14

L2 15

Core 15

Interconnect

DR

AM

DR

AM

…

Core 14

Core 1

Core 0 …

Base System

Write Filter (WF)Read Filter (RF)

Timestamp (TS)References (REFS)

Memory Timestamp(MTS)

32 bytes

128 bytes2 bytes4 bytes

4 bytes

Total State: 166 bytes/core

HW Replay Summary• Require some modification to existing HW

– will CPU manufacturers add the support any time soon? not likely

• Other low overhead approaches with SW-based replay– ODR: Output-Deterministic Replay for Multicore Debugging, Altekar and

Stoica, SOSP 09

nesting paging in vm replay for mps

Documents

page tables

page translation guest

page fault handling

sptany page fault

page translationguest

updateshw page walker

vmmhw page walker

extra page table steps