red hat, inc. · – tasks, processes, virtual memory areas, mmap (glibc malloc) ... the fabric is...
TRANSCRIPT
Copyright © 2017 Red Hat Inc.
1
20 years of Linux Virtual Memory: 20 years of Linux Virtual Memory: from simple server workloads to from simple server workloads to
cloud virtualizationcloud virtualization
Red Hat, Inc.
Andrea Arcangeli <aarcange at redhat.com>
FOSDEM, Brussels
4 Feb 2017
Copyright © 2017 Red Hat Inc.
2
Topics● Milestones in the evolution of the Virtual
Memory subsystem● Kernel Virtual Machine design● Virtual Memory latest innovations
– Automatic NUMA balancing– THP developments– userfaultfd
● Postcopy live Migration, etc..
Copyright © 2017 Red Hat Inc.
3
Virtual Memory (simplified)Virtual pagesThey cost “nothing”Practically unlimitedon 64bit archs
Physical pagesThey cost money!This is the RAM
arrows = pagetablesvirtual to physical mapping
Copyright © 2017 Red Hat Inc.
4
PageTablesPageTables● Common code and x86 pagetable format is a
tree
● All pagetables are 4KB in size● Total: grep PageTables /proc/meminfo● (((2**9)**4)*4096)>>48 = 1 → 48bits
pgd
pmd
pte
pud
pte
pmd
pte pte
pmd
pte
pud
pte
pmd
pte pte
2^9 = 512 arrows, not just 2
Copyright © 2017 Red Hat Inc.
5
The Fabric of the Virtual Memory● The fabric are all those data structures that connects to the hardware
constrainted structures like pagetables and that collectively create all the software abstractions we're accustomed to– tasks, processes, virtual memory areas, mmap (glibc malloc) ...
● The fabric is the most black and white part of the Virtual Memory● The algorithms doing the computations on those data structures are the
Virtual Memory heuristics– They need to solve hard problems with no guaranteed perfect solution– i.e. when it's the right time to start to unmap pages (swappiness)
● Some of the design didn't change: we still measure how hard it is to free memory while we're trying to free it
● All free memory is used as cache and we overcommit by default (not excessively by default)– Android uses: echo 1 >/proc/sys/vm/overcommit_memory
Copyright © 2017 Red Hat Inc.
6
Physical page and struct page
PAGE_SIZE (4KiB) large physical pagestruct page64 bytes
PAGESIZE4KiB
...Physical RAMmem_map
64/4096
1.56%of the RAM
Copyright © 2017 Red Hat Inc.
7
MM & VMA● mm_struct aka MM
– Memory of the process● Shared by threads
● vm_area_struct aka VMA– Virtual Memory Area
● Created and teardown by mmap and munmap
● Defines the virtual address space of an “MM”
Copyright © 2017 Red Hat Inc.
8
Page reclaim clock algorithm
page[0]
page[1]
page[N]
mem
_map
page
sca
n cl
ock
algo
rith
m
Copyright © 2017 Red Hat Inc.
9
Pgtable scan clock algorithm
page[0]
page[1]
page[N]
mem
_map
page
rec
laim
clo
ck a
lgor
ithm
proc
ess
virt
ual m
emor
ypr
oces
svi
rtua
l mem
ory
pgta
ble
scan
clo
ck a
lgor
ithm
proc
ess
virt
ual m
emor
ypr
oces
svi
rtua
l mem
ory
pgtables
pgtablespgtables
pgtables
Copyright © 2017 Red Hat Inc.
10
Last Recently Used list
page[0]
page[1]
page[N]
mem
_map
page_lru
pgtables
pgtables
pgtables
pgtables
proc
ess
virt
ual m
emor
ypr
oces
svi
rtua
l mem
ory
pgta
ble
scan
clo
ck a
lgor
ithm
proc
ess
virt
ual m
emor
ypr
oces
svi
rtua
l mem
ory
Copyright © 2017 Red Hat Inc.
11
Original bitmap...
Active and Inactive list LRU● The active page LRU preserves the the active memory working set
– only the inactive LRU loses information as fast as use-once I/O goes– Introduced in 2001, it works good enough also with an arbitrary balance– Active/inactive list optimum balancing algorithm was solved in 2012-2014
● shadow radix tree nodes that detect refaults (more patches last month)
Copyright © 2017 Red Hat Inc.
12
Active & inactive LRU lists
Use-once pages to trash Use-many pages
proc
ess
virt
ual m
emor
ypr
oces
svi
rtua
l mem
ory
pgta
ble
scan
clo
ck a
lgor
ithm
proc
ess
virt
ual m
emor
ypr
oces
svi
rtua
l mem
ory
page[0]
page[1]
page[N]
mem
_map
active_lruinactive_lru
Copyright © 2017 Red Hat Inc.
13
Active and Inactive list LRU$ grep -i active /proc/meminfo Active: 3555744 kBInactive: 2511156 kBActive(anon): 2286400 kBInactive(anon): 1472540 kBActive(file): 1269344 kBInactive(file): 1038616 kB
Copyright © 2017 Red Hat Inc.
14
rmap obsoleted thepgtable scan clock algorithm
To free the candidate pagewe must first first drop thetwo arrows(mark the pte non-present)
Copyright © 2017 Red Hat Inc.
15
rmap as in reverse mapping
rmap allows to reach thepagetables of any givenphysical page withouthaving to scan them all
Copyright © 2017 Red Hat Inc.
16
rmap unmap event
Page faultswapin
Page faultswapin
Free
pag
e
If the userland programaccesses the page it willtrigger a pagein/swapin
Copyright © 2017 Red Hat Inc.
17
objrmap/anon-vmaPHYSICAL PAGE
Anonymous anon_vma
inode(objrmap)
vma
vma
vma vma
prio_tree
rmapitem
rmapitem
vma
vma
PHYSICAL PAGEFilesystem
PHYSICAL PAGEAnonymous
PHYSICAL PAGEFilesystem
PHYSICAL PAGEKSM
Copyright © 2017 Red Hat Inc.
18
Active & inactive + rmap
page[0]
page[1]
page[N]
mem
_map
proc
ess
virt
ual m
emor
ypr
oces
svi
rtua
l mem
ory
proc
ess
virt
ual m
emor
ypr
oces
svi
rtua
l mem
ory
active_lruinactive_lru
rmap
rmap
rmap
rmap
Use-once pages to trash Use-many pages
Copyright © 2017 Red Hat Inc.
19
Active LRU workingset detection
fault ------------------------+ | +--------------+ | +-------------+ reclaim <- | inactive | <-+-- demotion | active | <--+ +--------------+ +-------------+ | | | +-------------- promotion ------------------+
+-memory available to cache-+ | | +-inactive------+-active----+ a b | c d e f g h i | J K L M N | +---------------+-----------+
Copyright © 2017 Red Hat Inc.
20
lru→inactive_age and radix tree shadow entries
inode/fileradix_tree root
inode/fileradix_tree root
inode/fileradix_tree node
inode/fileradix_tree node
inode/fileradix_tree node
inode/fileradix_tree node
pagepage
Copyright © 2017 Red Hat Inc.
21
Reclaim saving inactive_age
inode/fileradix_tree root
inode/fileradix_tree root
inode/fileradix_tree node
inode/fileradix_tree node
inode/fileradix_tree node
inode/fileradix_tree node
pagepage Exceptional/shadow entrymemcg LRU
LRU reclaim inactive_age++
Exceptional/shadow entrymemcg LRU
LRU reclaim inactive_age++
Copyright © 2017 Red Hat Inc.
22
Many more LRUs● Separated LRU for anon and file backed
mappings● Memcg (memory cgroups) introduced per-
memcg LRUs● Removal of unfreeable pages from LRUs
– anonymous memory with no swap– mlocked memory
● Transparent Hugepages in the LRU increase scalability further (lru size decreased 512 times)
Copyright © 2017 Red Hat Inc.
23
Recent Virtual Memory trends● Optimzing the workloads for you, without manual tuning
– NUMA hard bindings (numactl) → Automatic NUMA Balancing
– Hugetlbfs → Transparent Hugepage● THP in tmpfs was merged in Kernel v4.8
– Programs or Virtual Machines duplicating memory → KSM– Page pinning (RDMA/KVM shadow MMU) -> MMU notifier– Private device memory managed by hand and pinned →
HMM/UVM (unified virtual memory) for GPU seamlessly computing in GPU memory
● The optimizations can be optionally disabled
Copyright © 2017 Red Hat Inc.
24
Copyright © 2017 Red Hat Inc.
25
Virtual Memory in hypervisors● Can we use all these nice features to manage
the memory of Virtual Machines?– i.e. in hypervisors?
● Why should we reinvent anything?– We don't… with KVM
Copyright © 2017 Red Hat Inc.
26
KVM philosophyKVM philosophy➢ Reuse Linux code as much as possible➢ Focus on virtualization only, leave other things to
respective developers➢ VM➢ cpu scheduler➢ Drivers➢ Numa➢ Powermanagement
➢ Integrate well into existing infrastructure➢ just a kernel module + mmu/sched notifier
Copyright © 2017 Red Hat Inc.
27
KVM design... way to go!!KVM design... way to go!!
Linux
Driver Driver Driver
Hardware
UserVM
UserVM
UserVM
KVM
OrdinaryLinux
Process
OrdinaryLinux
Process
OrdinaryLinux
Process
Modules
Copyright © 2017 Red Hat Inc.
28
KVM task modelKVM task model
kernel
task task guest task task guest
Copyright © 2017 Red Hat Inc.
29
KVM userland <-> KVM kernelKVM userland <-> KVM kernel
Native GuestExecution
Kernelexit handler
Userspaceexit handler
Switch toGuest Mode
ioctl()
Userspace Kernel Guest
Copyright © 2017 Red Hat Inc.
30
Copyright © 2017 Red Hat Inc.
31
Automatic NUMA Balancing benchmark
Intel SandyBridge (Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz)
2 Sockets – 32 Cores with Hyperthreads
256G Memory
RHEV 3.6
Host bare metal – 3.10.0-327.el7 (RHEL7.2)
VM guest – 3.10.0-324.el7 (RHEL7.2)
VM – 32P , 160G (Optimized for Server)
Storage – Violin 6616 – 16G Fibre Channel
Oracle – 12C , 128G SGA
Test – Running Oracle OLTP workload with increasing user count and measuring Trans / min for each run as a metric for comparison
Copyright © 2017 Red Hat Inc.
32
10U 20U 40U 80U 100U0
200000
400000
600000
800000
1000000
1200000
1400000
4 VMs with different NUMA options
OLTP workload
Auto numa OFFAuto Numa on - No pinningAuto Numa on - Numa pinning
User count
Tran
s / m
in
Copyright © 2017 Red Hat Inc.
33
Automatic NUMA balancing configuration
● https://tinyurl.com/zupp9v3 https://access.redhat.com/
● In RHEL7 Automatic NUMA balancing is enabled when:– # numactl --hardware shows multiple nodes
● To disable automatic NUMA balancing:– # echo 0 > /proc/sys/kernel/numa_balancing
● To enable automatic NUMA balancing:– # echo 1 > /proc/sys/kernel/numa_balancing
● At boot:– numa_balancing=enable|disable
Copyright © 2017 Red Hat Inc.
34
Copyright © 2017 Red Hat Inc.
35
Hugepages● Traditionally x86 hardware gave us 4KiB
pages● The more memory the bigger the overhead
in managing 4KiB pages● What if you had bigger pages?
– 512 times bigger → 2MiB
Copyright © 2017 Red Hat Inc.
36
PageTablesPageTables
pgd
pmd
pte
pud
pte
pmd
pte pte
pmd
pte
pud
pte
pmd
pte pte
2^9 = 512 arrows, not just 2
Copyright © 2017 Red Hat Inc.
37
● Improve CPU performance– Enlarge TLB size (essential for KVM)– Speed up TLB miss (essential for KVM)
● Need 3 accesses to memory instead of 4 to refill the TLB– Faster to allocate memory initially (minor)– Page colouring inside the hugepage (minor)– Higher scalability of the page LRUs
● Cons– clear_page/copy_page less cache friendly– higher memory footprint sometime– Direct compaction takes time
Benefit of hugepages
Copyright © 2017 Red Hat Inc.
38
TLB miss cost:TLB miss cost:number of accesses to memorynumber of accesses to memory
host THP off guest THP off
host THP on guest THP off
host THP off guest THP on
host THP on guest THP on
novirt THP off
novirt THP on
0 5 10 15 20 25
Bare
met
alVi
rtua
lizat
ion
Copyright © 2017 Red Hat Inc.
39
● How do we get the benefits of hugetlbfs without having to configure anything?– Transparent Hugepage
● Any Linux process will receive 2M pages– if the mmap region is 2M naturally
aligned– If compaction succeeeds in producing
hugepages● Entirely transparent to userland
Transparent Hugepage design
Copyright © 2017 Red Hat Inc.
40
● /sys/kernel/mm/transparent_hugepage/enabled
– [always] madvise never● Always use THP if vma start/end permits
– always [madvise] never● Use THP only inside MAD_HUGEPAGE
–Applies to khugepaged too– always madvise [never]
● Never use THP–khugepaged quits
● Default selected at build time
THP sysfs enabled
Copyright © 2017 Red Hat Inc.
41
● /sys/kernel/mm/transparent_hugepage/defrag
– [always] defer madvise never● Use direct compaction (ideal for long lived allocations)
– always [defer] madvise never● Defer compaction asynchronously (kswapd/kcompact)
– always defer [madvise] never● Use direct compaction only inside MAD_HUGEPAGE
– always defer madvise [never]● Never use compaction
● Disabling THP is excessive if direct compaction is too expensive● Default will change to defer to reduce allocation latency● KVM uses MADV_HUGEPAGE
– MADV_HUGEPAGE will still use direct compaction
THP defrag - compaction control
Copyright © 2017 Red Hat Inc.
42
Copyright © 2017 Red Hat Inc.
43
Why: Memory Externalization● Memory externalization is about running a program with
part (or all) of its memory residing on a remote node
● Memory is transferred from the memory node to the compute node on access
● Memory can be transferred from the compute node to the memory node if it's not frequently used during memory pressure
Node 0Compute nodeLocal Memory
Node 0Memory nodeRemote Memory
userfault
Node 0Compute nodeLocal Memory
Node 0Memory nodeRemote Memory
memorypressure
Copyright © 2017 Red Hat Inc.
44
Postcopy live migration● Postcopy live migration is a form of memory
externalization
● When the QEMU compute node (destination) faults on a missing page that resides in the memory node (source) the kernel has no way to fetch the page
– Solution: let QEMU in userland handle the pagefault
Partially funded by the Orbit European Union project
Node 0Compute nodeLocal Memory
Node 0Memory nodeRemote Memory
userfault
QEMUsource
QEMUdestination
Postcopy live migration
Copyright © 2017 Red Hat Inc.
45
userfaultfd latency
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 270
200
400
600
800
1000
1200
1400
1600
userfault latency during postcopy live migration - 10Gbitqemu 2.5+ - RHEL7.2+ - stressapptest running in guest
<= latency in milliseconds
num
ber
of u
serf
aults
Userfaults triggered on pages that were already in network-flight are instantaneous. Background transfer seeks at the last userfault address.
Copyright © 2017 Red Hat Inc.
46
KVM precopy live migration
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
Before precopy
After precopy
During precopy
Precopy never completes until the database benchmark completes
10Gbit NIC120GiB guest
Database TPM
~120secTime to
Transfer RAMover network
precopy
Copyright © 2017 Red Hat Inc.
47
KVM postcopy live migration
00:1000:50
01:3002:10
02:5003:30
04:1004:50
05:3006:10
06:5007:30
08:1008:50
09:3010:10
10:5011:30
12:1012:50
13:3014:10
14:5015:30
16:1016:50
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
Before postcopy
During postcopy
After postcopy
# virsh migrate .. --postcopy --timeout <sec> --timeout-postcopy# virsh migrate .. --postcopy --postcopy-after-precopy
precopy runsFrom 5m
To 7m
postcopy runsFrom 7m
To about ~9mdeterministic
precopy
postcopy
khugepagedcollapses THPs
Copyright © 2017 Red Hat Inc.
48
All available upstream● Userfaultfd() syscall in Linux Kernel >= v4.3● Postcopy live migration in:
– QEMU >= v2.5.0● Author: David Gilbert @ Red Hat Inc.
– Postcopy in Libvirt >= 1.3.4– OpenStack Nova >= Newton
● … and coming soon in production starting with:– RHEL 7.3
Copyright © 2017 Red Hat Inc.
49
Live migration total time
Total time0
50
100
150
200
250
300
350
400
450
500
autoconvergepostcopyse
cond
s
Copyright © 2017 Red Hat Inc.
50
Max UDP latency0
5
10
15
20
25
precopy timeoutautoconvergepostcopyse
cond
s
Live migration max perceived downtime latency
Copyright © 2017 Red Hat Inc.
51
Copyright © 2017 Red Hat Inc.
52
Virtual Memory evolution since '99● Amazing to see the room for further innovation there was back then
– Things constantly looks pretty mature● They may actually have been considering my hardware back then
was much less powerful and not more complex than my cellphone
● Unthinkable to maintain the current level of mission critical complexity by reinventing the wheel in a not Open Source way–Can perhaps be still done in a limited set of laptops and
cellphones models, but for how long?● Innovation in the Virtual Memory space is probably one among the
plenty of factors that contributed to Linux success and the KVM lead in OpenStack user base too– KVM (unlike the preceding Hypervisor designs) leverages the power
of the Linux Virtual Memory in its entirety