nesting paging in vm replay for mps

34
CS530 Operating System Nesting Paging in VM Replay for MPs Jaehyuk Huh Computer Science, KAIST

Upload: walden

Post on 11-Feb-2016

51 views

Category:

Documents


0 download

DESCRIPTION

Nesting Paging in VM Replay for MPs. Jaehyuk Huh Computer Science, KAIST. Address Translation in VM. Need to translate guest VA ( gVA ) to machine address gVA (guest VA)  gPA (guest PA)  sPA (system PA) Paravirtualization - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Nesting Paging in VM Replay for MPs

CS530 Operating System

Nesting Paging in VMReplay for MPs

Jaehyuk HuhComputer Science, KAIST

Page 2: Nesting Paging in VM Replay for MPs

Address Translation in VM• Need to translate guest VA (gVA) to machine address

– gVA (guest VA) gPA (guest PA) sPA (system PA)

• Paravirtualization– Guest page table (managed by guest OS) directly maps gVA to sPA– Hypervisor validates guest page table

• Full virtualization– SW technique: shadow paging– HW-assisted technique: nested paging

Page 3: Nesting Paging in VM Replay for MPs

X86 4KB page tables in long mode

Page 4: Nesting Paging in VM Replay for MPs

Shadow Page Table • Shadow page table (sPT)

– translate from gVA to sPA– maintained by VMM (hypervisor)

• VMM intercepts the updates of page table base address – CR3 updates in x86– Set CR3 with sPT base address instead of gPT base address

• must be consistent with guest page table (gPT) gPT up-dates must be reflected in sPT

• Any page fault must be intercepted by VMM– VMM must tell guest-induced page-faults from VMM-induced ones– Vectors guest-induced page-faults to guest OS– High overheads for page fault handling

Page 5: Nesting Paging in VM Replay for MPs

How to make gPT and sPT consistent?• Write-protecting gPT

– Any modification of gPT (add or remove a translation) causes a fault– VMM updates sPT accordingly

• Exploiting page-fault behavior and TLB consistency rules– Adding a page translation

• Guest OS can add a new translation to gPT without interception by VMM• Later accesses by guest VM causes a page fault on the new translation• VMM updates sPT on the page fault: must inspect gPT to find out the new page

– Deleting a page translation• Guest OS executes INVLPG to invalidate TLB entry• VMM intercept the execution and remove the entry from sPT

Page 6: Nesting Paging in VM Replay for MPs

Overheads of Shadow Paging• Any page fault requires the expensive VMM intervention

– Guest-induced page fault– Hypervisor-induced page faults

• Accessed and dirty bit updates– HW page walker sets bits in sPT (not gPT)– Guest OS need the information to make paging decision– Dirty bit example: set pages pointed by sPT read-only

• Problems in MPs– What if a VM uses multiple processors?– Replicating sPT for each processor? memory overheads– Sharing sPT ? synchronizing sPT for any change

Page 7: Nesting Paging in VM Replay for MPs

Shadow Paging Overheads

Page 8: Nesting Paging in VM Replay for MPs

Nesting Page Table• A source of address translation overheads in traditional

x86 VMM– a fixed hardware page walker to handle a TLB miss – Can walk from only one page table (pointed by CR3)

• Nested paging– Separate HW states affecting paging (two copies of CR3 etc … ) for guest

OS and VMM– HW page walker can walk both gPT and sPT – TLB can holds a translation from gVA to sPT directly

• Benefits: No more traps on Guest Page Table accesses• Drawback: Extra page table steps add latency to TLB miss• May add extra caching for page translation

– Nested TLB– 2D page walk cache

Page 9: Nesting Paging in VM Replay for MPs

Nested Paging

Page 10: Nesting Paging in VM Replay for MPs

Nested Paging

Page 11: Nesting Paging in VM Replay for MPs

Address Space IDs• Old x86 did not support address space IDs (ASID) in TLBs

– must flush TLBs for VM switch– Assign ASID for each VM– Still need to flush TLBs for context switch within a VM

Page 12: Nesting Paging in VM Replay for MPs

Replay Papers• VM-based replay

– Execution Replay for Multiprocessor Virtual Machines– Dunlap et al

• HW-based replay– Rerun: Exploiting Episodes for Lightweight Memory Race Recording– Hower and Hill

• ODR: Output-Deterministic Replay for Multicore Debug-ging– Altekar and Stoica

• Slides adapted from the presentation slides by the paper authors

Page 13: Nesting Paging in VM Replay for MPs

Big ideas• Detection and replay of memory races is possible on com-

modity hardware• Overhead high for some workloads• …but surprisingly low for other workloads

Page 14: Nesting Paging in VM Replay for MPs

Execution Replay

CPU

Memory

Disk

Network

Keyboard, mouse

Interrupts

Page 15: Nesting Paging in VM Replay for MPs

Deterministic Replay• Deterministic Replay

– Faithfully replay an execution such that all instructions appear to complete in the same order and produce the same result

• Valuable– Debugging [LeBlanc, et al. - COMP ’87]

• e.g., time travel debugging, rare bug replication– Fault tolerance [Bressoud, et al. - SIGOPS ‘95]

• e.g., hot backup virtual machines– Security [Dunlap et al. – OSDI ‘02]

• e.g., attack analysis– Tracing [Xu et al. – WDDD ‘07]

• e.g., unobtrusive replay tracing

15

Page 16: Nesting Paging in VM Replay for MPs

Single-processor Replay• Basic principles well understood

– Log all non-deterministic inputs– Timing of asynchronous events

• Minimal overhead (Dunlap02)– 13% worst case– Log for months or years

• Available commercially– VMWare: Record/Replay

Page 17: Nesting Paging in VM Replay for MPs

The Multiprocessor Challenge• Interleaved reads and writes

– Fine-grained non-determinism– Much more difficult

• Existing solutions– Hardware modification– Software instrumentation

• SMP-ReVirt– Hardware MMU to detect sharing

Page 18: Nesting Paging in VM Replay for MPs

Multiprocessor Replay

P2

Memory

P1

P1 P2

n=3n=5

if (n<4)

Page 19: Nesting Paging in VM Replay for MPs

Ordering Memory Accesses• Preserving order will reproduce execution

– a→b: “a happens-before b”– Ordering is transitive: a→b, b→c means a→c

• Two instructions must be ordered if:– they both access the same memory, and– one of them is a write

Page 20: Nesting Paging in VM Replay for MPs

Constraints: Enforcing order

• To guarantee a→d:– a→d– b→d– a→c– b→c

• Suppose we need b→c– b→c is necessary– a→d is redundant

P1a

b

c

d

P2

overconstrained

Page 21: Nesting Paging in VM Replay for MPs

CREW Protocol• Each shared object in one of two states:

– Concurrent-Read: all processors can read, none can write– Exclusive-Write: one processor (the owner) can read and write; others have

no access• Enforced with hardware MMU

– Read/write– Read-only– None

• Change CREW states on demand– Fault, fixup, re-execute

• CREW event– Increasing or reducing permission due to CREW state changes

Page 22: Nesting Paging in VM Replay for MPs

CREW Property• If two instructions on different processors:

– access the same page,– and one of them is a write,– there will be a CREW event on each processor between them.

Page 23: Nesting Paging in VM Replay for MPs

Generating Constraints

• State: Concurrent Read– All processors read-only

• d*: CREW fault• New state: P2 Exclusive• r: privilege reduction

– Read to None• i: privilege increase

– Read to Read/write• Log timing of r and i• Constraint:

– r → i

P1a

d

P2

ri

d*

Page 24: Nesting Paging in VM Replay for MPs

Predicting results• Key changes in sharing attributes

– 4096-byte sharing granularity– “Miss” is very expensive

• SPLASH2– Good: high spatial locality / low false sharing– Bad: random access patterns / high false sharing

• The Linux kernel– Tuned to 16-byte cacheline– Involving the kernel may be expensive

Page 25: Nesting Paging in VM Replay for MPs

Single-processor Xen guests

1.001.04

1.01 1.001.03

1.13

1.001.05

0

0.2

0.4

0.6

0.8

1

1.2

FMM LU ocean radix water-spatial

kernel-build

radiosity dbench

Nor

mal

ized

runt

ime

Unmodified 1-cpu guest

Logging 1-cpuguest

`

Page 26: Nesting Paging in VM Replay for MPs

2-processor Xen guests

1.51

1.001.08

1.601.48

2.10

1.901.76

1.96

1.741.83

1.99

0

0.5

1

1.5

2

2.5

FMM LU ocean radix water-spatial kernel-build

Nor

mal

ized

runt

ime

Unmodified 2-cpuguest

Logging 2-cpu guest

Logging 1-cpu guest

Page 27: Nesting Paging in VM Replay for MPs

2-processor, con’t

8.70

7.21

1.85 1.88

0123456789

10

radiosity dbench

Nor

mal

ized

runt

ime

Unmodified 2-cpu guestLogging 2-cpu guestLogging 1-cpu guest

Page 28: Nesting Paging in VM Replay for MPs

4-processor Xen guests

7.36

1.12 1.28

4.20

1.72

9.03

0

2

4

6

8

10

FMM LU ocean radix water-spatial kernel-build

Nor

mal

ized

runt

ime

Unmodified domain, 4 cpusCREW logging, 4 cpusCREW logging, 2 cpus*CREW logging, 1 cpu

Page 29: Nesting Paging in VM Replay for MPs

HW Memory Race Recording• SW only approach

– Too slow to be turned on always– SW alter execution path

• Want– Small log – record longer for same state– Small hardware – reduce cost, especially when not used– Unobtrusive – should not alter execution

• Rerun: Exploiting Episodes for Lightweight Memory Race Recording

29

Page 30: Nesting Paging in VM Replay for MPs

Episodic Recording• Most code executes without races

– Use race-free regions as unit of ordering• Episodes: independent execution regions

– Defined per thread– Identified passively does not affect execution– Encompass every instruction

30

T0 T1

LD A ST B ST C LD F

ST E LD B ST X LD R ST T LD X

T2

ST V ST Z LD W LD J

ST C LD Q LD J

ST Q ST E ST C LD Z

LD V

ST X

Page 31: Nesting Paging in VM Replay for MPs

23

Capturing Causality• Via scalar Lamport Clocks [Lamport ‘78]

– Assigns timestamps to events– Timestamp order implies causality

• Replay in timestamp order– Episodes with same timestamp can be replayed in parallel

31

43 2260

61 44

62

2344

45

T0 T1 T2

Page 32: Nesting Paging in VM Replay for MPs

Episode Benefits

• Multiple races can be captured by a single episode– Reduces amount of information to be logged

• Episodes are created passively– No speculation, no rollback

• Episodes can end early– Eases implementation

• Episode information is thread-local– Promotes scalability, avoids synchronization overheads

32

Derek Hower
A result of indepence recording.Unlike some other race recorders, Rerun can reduce dependencies without any additional processing
Page 33: Nesting Paging in VM Replay for MPs

Hardware• Rerun requirements:

– Detect races track r/w sets– Mark episode boundaries– Maintain logical time

33

Coherence Con-troller

L1 I

L2 0 L2 1 L2 14

L2 15

Core 15

Interconnect

DR

AM

DR

AM

Core 14

Core 1

Core 0 …

Base System

Write Filter (WF)Read Filter (RF)

Timestamp (TS)References (REFS)

Memory Timestamp(MTS)

32 bytes

128 bytes2 bytes4 bytes

4 bytes

Total State: 166 bytes/core

Page 34: Nesting Paging in VM Replay for MPs

HW Replay Summary• Require some modification to existing HW

– will CPU manufacturers add the support any time soon? not likely

• Other low overhead approaches with SW-based replay– ODR: Output-Deterministic Replay for Multicore Debugging, Altekar and

Stoica, SOSP 09