red hat, inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... the fabric is...

52
Copyright © 2017 Red Hat Inc. 1 20 years of Linux Virtual Memory: 20 years of Linux Virtual Memory: from simple server workloads to from simple server workloads to cloud virtualization cloud virtualization Red Hat, Inc. Andrea Arcangeli <aarcange at redhat.com> FOSDEM, Brussels 4 Feb 2017

Upload: others

Post on 23-Jan-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

1

20 years of Linux Virtual Memory: 20 years of Linux Virtual Memory: from simple server workloads to from simple server workloads to

cloud virtualizationcloud virtualization

Red Hat, Inc.

Andrea Arcangeli <aarcange at redhat.com>

FOSDEM, Brussels

4 Feb 2017

Page 2: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

2

Topics● Milestones in the evolution of the Virtual

Memory subsystem● Kernel Virtual Machine design● Virtual Memory latest innovations

– Automatic NUMA balancing– THP developments– userfaultfd

● Postcopy live Migration, etc..

Page 3: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

3

Virtual Memory (simplified)Virtual pagesThey cost “nothing”Practically unlimitedon 64bit archs

Physical pagesThey cost money!This is the RAM

arrows = pagetablesvirtual to physical mapping

Page 4: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

4

PageTablesPageTables● Common code and x86 pagetable format is a

tree

● All pagetables are 4KB in size● Total: grep PageTables /proc/meminfo● (((2**9)**4)*4096)>>48 = 1 → 48bits

pgd

pmd

pte

pud

pte

pmd

pte pte

pmd

pte

pud

pte

pmd

pte pte

2^9 = 512 arrows, not just 2

Page 5: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

5

The Fabric of the Virtual Memory● The fabric are all those data structures that connects to the hardware

constrainted structures like pagetables and that collectively create all the software abstractions we're accustomed to– tasks, processes, virtual memory areas, mmap (glibc malloc) ...

● The fabric is the most black and white part of the Virtual Memory● The algorithms doing the computations on those data structures are the

Virtual Memory heuristics– They need to solve hard problems with no guaranteed perfect solution– i.e. when it's the right time to start to unmap pages (swappiness)

● Some of the design didn't change: we still measure how hard it is to free memory while we're trying to free it

● All free memory is used as cache and we overcommit by default (not excessively by default)– Android uses: echo 1 >/proc/sys/vm/overcommit_memory

Page 6: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

6

Physical page and struct page

PAGE_SIZE (4KiB) large physical pagestruct page64 bytes

PAGESIZE4KiB

...Physical RAMmem_map

64/4096

1.56%of the RAM

Page 7: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

7

MM & VMA● mm_struct aka MM

– Memory of the process● Shared by threads

● vm_area_struct aka VMA– Virtual Memory Area

● Created and teardown by mmap and munmap

● Defines the virtual address space of an “MM”

Page 8: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

8

Page reclaim clock algorithm

page[0]

page[1]

page[N]

mem

_map

page

sca

n cl

ock

algo

rith

m

Page 9: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

9

Pgtable scan clock algorithm

page[0]

page[1]

page[N]

mem

_map

page

rec

laim

clo

ck a

lgor

ithm

proc

ess

virt

ual m

emor

ypr

oces

svi

rtua

l mem

ory

pgta

ble

scan

clo

ck a

lgor

ithm

proc

ess

virt

ual m

emor

ypr

oces

svi

rtua

l mem

ory

pgtables

pgtablespgtables

pgtables

Page 10: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

10

Last Recently Used list

page[0]

page[1]

page[N]

mem

_map

page_lru

pgtables

pgtables

pgtables

pgtables

proc

ess

virt

ual m

emor

ypr

oces

svi

rtua

l mem

ory

pgta

ble

scan

clo

ck a

lgor

ithm

proc

ess

virt

ual m

emor

ypr

oces

svi

rtua

l mem

ory

Page 11: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

11

Original bitmap...

Active and Inactive list LRU● The active page LRU preserves the the active memory working set

– only the inactive LRU loses information as fast as use-once I/O goes– Introduced in 2001, it works good enough also with an arbitrary balance– Active/inactive list optimum balancing algorithm was solved in 2012-2014

● shadow radix tree nodes that detect refaults (more patches last month)

Page 12: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

12

Active & inactive LRU lists

Use-once pages to trash Use-many pages

proc

ess

virt

ual m

emor

ypr

oces

svi

rtua

l mem

ory

pgta

ble

scan

clo

ck a

lgor

ithm

proc

ess

virt

ual m

emor

ypr

oces

svi

rtua

l mem

ory

page[0]

page[1]

page[N]

mem

_map

active_lruinactive_lru

Page 13: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

13

Active and Inactive list LRU$ grep -i active /proc/meminfo Active: 3555744 kBInactive: 2511156 kBActive(anon): 2286400 kBInactive(anon): 1472540 kBActive(file): 1269344 kBInactive(file): 1038616 kB

Page 14: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

14

rmap obsoleted thepgtable scan clock algorithm

To free the candidate pagewe must first first drop thetwo arrows(mark the pte non-present)

Page 15: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

15

rmap as in reverse mapping

rmap allows to reach thepagetables of any givenphysical page withouthaving to scan them all

Page 16: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

16

rmap unmap event

Page faultswapin

Page faultswapin

Free

pag

e

If the userland programaccesses the page it willtrigger a pagein/swapin

Page 17: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

17

objrmap/anon-vmaPHYSICAL PAGE

Anonymous anon_vma

inode(objrmap)

vma

vma

vma vma

prio_tree

rmapitem

rmapitem

vma

vma

PHYSICAL PAGEFilesystem

PHYSICAL PAGEAnonymous

PHYSICAL PAGEFilesystem

PHYSICAL PAGEKSM

Page 18: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

18

Active & inactive + rmap

page[0]

page[1]

page[N]

mem

_map

proc

ess

virt

ual m

emor

ypr

oces

svi

rtua

l mem

ory

proc

ess

virt

ual m

emor

ypr

oces

svi

rtua

l mem

ory

active_lruinactive_lru

rmap

rmap

rmap

rmap

Use-once pages to trash Use-many pages

Page 19: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

19

Active LRU workingset detection

fault ------------------------+ | +--------------+ | +-------------+ reclaim <- | inactive | <-+-- demotion | active | <--+ +--------------+ +-------------+ | | | +-------------- promotion ------------------+

+-memory available to cache-+ | | +-inactive------+-active----+ a b | c d e f g h i | J K L M N | +---------------+-----------+

Page 20: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

20

lru→inactive_age and radix tree shadow entries

inode/fileradix_tree root

inode/fileradix_tree root

inode/fileradix_tree node

inode/fileradix_tree node

inode/fileradix_tree node

inode/fileradix_tree node

pagepage

Page 21: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

21

Reclaim saving inactive_age

inode/fileradix_tree root

inode/fileradix_tree root

inode/fileradix_tree node

inode/fileradix_tree node

inode/fileradix_tree node

inode/fileradix_tree node

pagepage Exceptional/shadow entrymemcg LRU

LRU reclaim inactive_age++

Exceptional/shadow entrymemcg LRU

LRU reclaim inactive_age++

Page 22: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

22

Many more LRUs● Separated LRU for anon and file backed

mappings● Memcg (memory cgroups) introduced per-

memcg LRUs● Removal of unfreeable pages from LRUs

– anonymous memory with no swap– mlocked memory

● Transparent Hugepages in the LRU increase scalability further (lru size decreased 512 times)

Page 23: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

23

Recent Virtual Memory trends● Optimzing the workloads for you, without manual tuning

– NUMA hard bindings (numactl) → Automatic NUMA Balancing

– Hugetlbfs → Transparent Hugepage● THP in tmpfs was merged in Kernel v4.8

– Programs or Virtual Machines duplicating memory → KSM– Page pinning (RDMA/KVM shadow MMU) -> MMU notifier– Private device memory managed by hand and pinned →

HMM/UVM (unified virtual memory) for GPU seamlessly computing in GPU memory

● The optimizations can be optionally disabled

Page 24: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

24

Page 25: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

25

Virtual Memory in hypervisors● Can we use all these nice features to manage

the memory of Virtual Machines?– i.e. in hypervisors?

● Why should we reinvent anything?– We don't… with KVM

Page 26: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

26

KVM philosophyKVM philosophy➢ Reuse Linux code as much as possible➢ Focus on virtualization only, leave other things to

respective developers➢ VM➢ cpu scheduler➢ Drivers➢ Numa➢ Powermanagement

➢ Integrate well into existing infrastructure➢ just a kernel module + mmu/sched notifier

Page 27: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

27

KVM design... way to go!!KVM design... way to go!!

Linux

Driver Driver Driver

Hardware

UserVM

UserVM

UserVM

KVM

OrdinaryLinux

Process

OrdinaryLinux

Process

OrdinaryLinux

Process

Modules

Page 28: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

28

KVM task modelKVM task model

kernel

task task guest task task guest

Page 29: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

29

KVM userland <-> KVM kernelKVM userland <-> KVM kernel

Native GuestExecution

Kernelexit handler

Userspaceexit handler

Switch toGuest Mode

ioctl()

Userspace Kernel Guest

Page 30: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

30

Page 31: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

31

Automatic NUMA Balancing benchmark

Intel SandyBridge (Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz)

2 Sockets – 32 Cores with Hyperthreads

256G Memory

RHEV 3.6

Host bare metal – 3.10.0-327.el7 (RHEL7.2)

VM guest – 3.10.0-324.el7 (RHEL7.2)

VM – 32P , 160G (Optimized for Server)

Storage – Violin 6616 – 16G Fibre Channel

Oracle – 12C , 128G SGA

Test – Running Oracle OLTP workload with increasing user count and measuring Trans / min for each run as a metric for comparison

Page 32: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

32

10U 20U 40U 80U 100U0

200000

400000

600000

800000

1000000

1200000

1400000

4 VMs with different NUMA options

OLTP workload

Auto numa OFFAuto Numa on - No pinningAuto Numa on - Numa pinning

User count

Tran

s / m

in

Page 33: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

33

Automatic NUMA balancing configuration

● https://tinyurl.com/zupp9v3 https://access.redhat.com/

● In RHEL7 Automatic NUMA balancing is enabled when:– # numactl --hardware shows multiple nodes

● To disable automatic NUMA balancing:– # echo 0 > /proc/sys/kernel/numa_balancing

● To enable automatic NUMA balancing:– # echo 1 > /proc/sys/kernel/numa_balancing

● At boot:– numa_balancing=enable|disable

Page 34: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

34

Page 35: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

35

Hugepages● Traditionally x86 hardware gave us 4KiB

pages● The more memory the bigger the overhead

in managing 4KiB pages● What if you had bigger pages?

– 512 times bigger → 2MiB

Page 36: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

36

PageTablesPageTables

pgd

pmd

pte

pud

pte

pmd

pte pte

pmd

pte

pud

pte

pmd

pte pte

2^9 = 512 arrows, not just 2

Page 37: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

37

● Improve CPU performance– Enlarge TLB size (essential for KVM)– Speed up TLB miss (essential for KVM)

● Need 3 accesses to memory instead of 4 to refill the TLB– Faster to allocate memory initially (minor)– Page colouring inside the hugepage (minor)– Higher scalability of the page LRUs

● Cons– clear_page/copy_page less cache friendly– higher memory footprint sometime– Direct compaction takes time

Benefit of hugepages

Page 38: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

38

TLB miss cost:TLB miss cost:number of accesses to memorynumber of accesses to memory

host THP off guest THP off

host THP on guest THP off

host THP off guest THP on

host THP on guest THP on

novirt THP off

novirt THP on

0 5 10 15 20 25

Bare

met

alVi

rtua

lizat

ion

Page 39: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

39

● How do we get the benefits of hugetlbfs without having to configure anything?– Transparent Hugepage

● Any Linux process will receive 2M pages– if the mmap region is 2M naturally

aligned– If compaction succeeeds in producing

hugepages● Entirely transparent to userland

Transparent Hugepage design

Page 40: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

40

● /sys/kernel/mm/transparent_hugepage/enabled

– [always] madvise never● Always use THP if vma start/end permits

– always [madvise] never● Use THP only inside MAD_HUGEPAGE

–Applies to khugepaged too– always madvise [never]

● Never use THP–khugepaged quits

● Default selected at build time

THP sysfs enabled

Page 41: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

41

● /sys/kernel/mm/transparent_hugepage/defrag

– [always] defer madvise never● Use direct compaction (ideal for long lived allocations)

– always [defer] madvise never● Defer compaction asynchronously (kswapd/kcompact)

– always defer [madvise] never● Use direct compaction only inside MAD_HUGEPAGE

– always defer madvise [never]● Never use compaction

● Disabling THP is excessive if direct compaction is too expensive● Default will change to defer to reduce allocation latency● KVM uses MADV_HUGEPAGE

– MADV_HUGEPAGE will still use direct compaction

THP defrag - compaction control

Page 42: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

42

Page 43: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

43

Why: Memory Externalization● Memory externalization is about running a program with

part (or all) of its memory residing on a remote node

● Memory is transferred from the memory node to the compute node on access

● Memory can be transferred from the compute node to the memory node if it's not frequently used during memory pressure

Node 0Compute nodeLocal Memory

Node 0Memory nodeRemote Memory

userfault

Node 0Compute nodeLocal Memory

Node 0Memory nodeRemote Memory

memorypressure

Page 44: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

44

Postcopy live migration● Postcopy live migration is a form of memory

externalization

● When the QEMU compute node (destination) faults on a missing page that resides in the memory node (source) the kernel has no way to fetch the page

– Solution: let QEMU in userland handle the pagefault

Partially funded by the Orbit European Union project

Node 0Compute nodeLocal Memory

Node 0Memory nodeRemote Memory

userfault

QEMUsource

QEMUdestination

Postcopy live migration

Page 45: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

45

userfaultfd latency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 270

200

400

600

800

1000

1200

1400

1600

userfault latency during postcopy live migration - 10Gbitqemu 2.5+ - RHEL7.2+ - stressapptest running in guest

<= latency in milliseconds

num

ber

of u

serf

aults

Userfaults triggered on pages that were already in network-flight are instantaneous. Background transfer seeks at the last userfault address.

Page 46: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

46

KVM precopy live migration

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

Before precopy

After precopy

During precopy

Precopy never completes until the database benchmark completes

10Gbit NIC120GiB guest

Database TPM

~120secTime to

Transfer RAMover network

precopy

Page 47: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

47

KVM postcopy live migration

00:1000:50

01:3002:10

02:5003:30

04:1004:50

05:3006:10

06:5007:30

08:1008:50

09:3010:10

10:5011:30

12:1012:50

13:3014:10

14:5015:30

16:1016:50

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

Before postcopy

During postcopy

After postcopy

# virsh migrate .. --postcopy --timeout <sec> --timeout-postcopy# virsh migrate .. --postcopy --postcopy-after-precopy

precopy runsFrom 5m

To 7m

postcopy runsFrom 7m

To about ~9mdeterministic

precopy

postcopy

khugepagedcollapses THPs

Page 48: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

48

All available upstream● Userfaultfd() syscall in Linux Kernel >= v4.3● Postcopy live migration in:

– QEMU >= v2.5.0● Author: David Gilbert @ Red Hat Inc.

– Postcopy in Libvirt >= 1.3.4– OpenStack Nova >= Newton

● … and coming soon in production starting with:– RHEL 7.3

Page 49: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

49

Live migration total time

Total time0

50

100

150

200

250

300

350

400

450

500

autoconvergepostcopyse

cond

s

Page 50: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

50

Max UDP latency0

5

10

15

20

25

precopy timeoutautoconvergepostcopyse

cond

s

Live migration max perceived downtime latency

Page 51: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

51

Page 52: Red Hat, Inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... The fabric is the most black and white part of the Virtual Memory The algorithms doing the computations

Copyright © 2017 Red Hat Inc.

52

Virtual Memory evolution since '99● Amazing to see the room for further innovation there was back then

– Things constantly looks pretty mature● They may actually have been considering my hardware back then

was much less powerful and not more complex than my cellphone

● Unthinkable to maintain the current level of mission critical complexity by reinventing the wheel in a not Open Source way–Can perhaps be still done in a limited set of laptops and

cellphones models, but for how long?● Innovation in the Virtual Memory space is probably one among the

plenty of factors that contributed to Linux success and the KVM lead in OpenStack user base too– KVM (unlike the preceding Hypervisor designs) leverages the power

of the Linux Virtual Memory in its entirety