transcendent memory: not just for virtualization anymore
TRANSCRIPT
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.1
Transcendent Memory
Avi Miller
Principal Program Manager
ORACLE
PRODUCT
LOGO
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.3
Further reading:
https://lwn.net/Articles/454795/
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.4
Objectives
Utilise RAM more effectively
– Lower capital costs
– Lower power utilisation
– Less I/O
Better performance on many workloads
– Negligible loss on others
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.5
Motivation: Memory-inefficient workloads
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.6
More motivation: memory capacity wall
Memory capacity per core drops ~30% every 2 years
Source: Disaggregated Memory for Expansion and Sharing in Blade Server
http://isca09.cs.columbia.edu/pres/24.pptx
1
10
100
1000
20
03
20
04
20
05
20
06
20
07
20
08
20
09
20
10
20
11
20
12
20
13
20
14
20
15
20
16
20
17
# Core
GB DRAM
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.7
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.8
Slide from: Linux kernel support to exploit phase change memory, Linux Symposium 2010, Youngwoo Park, EE KAIST
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.9
Disaggregated memory
Source: Disaggregated Memory for Expansion and Sharing in Blade Server
http://isca09.cs.columbia.edu/pres/24.pptx
Leverage fast, shared communication fabrics
Memory blade
CPUsDIMMDIMM
CPUsDIMMDIMM
CPUsDIMMDIMM
CPUsDIMMDIMM
DIMM
DIMM
DIMM
Exofa
bric
DIMM
DIMM
DIMM
DIMM
DIMM
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.10
OS memory “demand”
Operating
systems are
memory hogs!
OS
Memory constraint
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.11
OS Physical Memory Management
If you give an
operating
system more
memory…
New larger memory
constraint
OS
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.12
OS Physical Memory Management
… it uses up
any memory
you give it!
My name is
Linux and I
am a
memory hog
Memory constraint
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.13
OS Memory “Asceticism”
ASSUME
– We should use as little RAM as possible
SUPPOSE
– Mechanism to allow the OS to surrender RAM
– Mechanism to allow the OS to obtain more RAM
THEN
– How does an OS decide how much RAM it actually needs?
as-cet-i-cism, n. 1. extreme self-denial and austerity; rigorous self-discipline and
active restraint; renunciation of material comforts so as to achieve a higher state
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.14
Impact on Linux Memory Subsystem
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.15
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.16
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.17
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.18
CAPACITY KNOWN
Can read or write to
any byte.
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.19
CAPACITY KNOWN
Can read or write to
any byte.
CAPACITY UNKOWN
and may change
dynamically!
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.20
• CAPACITY: known
• USES:
• kernel memory
• user memory
• DMA
• ADDRESSABILITY:
• Read/write any byte
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.21
• CAPACITY
-“unknowable”
- dynamic
SO…
kernel/CPU can’t
address directly!
SO…
Need “permission”
to access and need
to “follow rules”
(even the kernel!)
• CAPACITY: known
• USES:
• kernel memory
• user memory
• DMA
• ADDRESSABILITY:
• Read/write any byte
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.22
• CAPACITY: known
• USES:
• kernel memory
• user memory
• DMA
• ADDRESSABILITY:
• Read/write any byte
• THE RULES1. “page”-at-a-time
2. to put data here,
kernel MUST use a
“put page call”
3. (more rules later)
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.23
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.24
We have a page that contains:
And the kernel wants to
“preserve” Tux in Type B
memory.
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.25
And the kernel wants to “preserve”
Tux into Type B memory… but…
Kernel MUST ask permission
and may get told NO!
may say NO
to kernel!
We have a page that contains:
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.26
And the kernel wants to “preserve”
Tux into Type B memory.
Two choices…
1.DEFINITELY want Tux back(e.g. “dirty” page)
may commit to
keeping the
page around…
We have a page that contains:
may say NO
to kernel!
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.27
And the kernel wants to “preserve”
Tux into Type B memory.
Two choices…
1.DEFINITELY want Tux back
2.PROBABLY want Tux back(but OK if disappears, e.g. “clean” pages)
may commit
to keeping the
page around…
or may not!
We have a page that contains:
may say NO
to kernel!
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.28
Two choices…
1.DEFINITELY want Tux back
2.PROBABLY want Tux back
tran-scend-ent, adj., … beyond the range of normal perception
We have a page that contains:
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.29
We have a page that contains:
Two choices…
1.DEFINITELY want Tux back
“PERSISTENT PUT”2.PROBABLY want Tux back
“EPHEMERAL PUT”eph-em-er-al, adj., … transitory, existing only briefly, short-
lived (i.e. NOT persistent)
tran-scend-ent, adj., … beyond the
range of normal perception
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.30
“PUT”
“GET”
“FLUSH”
Core Transcendent Memory Operations
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.31
“Normal” RAM
addressing
• byte-addressable
• virtual address: @fffff8000102458
0
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.32
“Normal” RAM
addressing
• byte-addressable
• virtual address: @fffff80001024580
Transcendent
Memory• object-oriented addressing
• object is a page
• handle addresses a page
• kernel can (mostly) choose
handle when a page is put
• uses same handle to get
• must ensure handle is
and remains unique
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.33
Why bother?
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.34
Once we’re behind the curtain, we can do interesting things…
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.35
Interesting thing #1
virtual machines (aka “guests”)
future?
Tmem support:
• multiple guests
• compression
• deduplication
Tmem supported in
Xen since 4.0 (2009)
hypervisor (aka “host”)
hypervisor
RAM
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.36
Interesting thing #2
Zcache(2.6.39 staging driver)
compress
on put
decompress
on get
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.37
Interesting thing #3
Transparently move pre-
compressed pages cross a
high-speed coherent
interconnect
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.38
Interesting thing #3
RAMster
Peer-to-peer transcendent
memory
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.39
Interesting thing #4
SSmem: Transcendent Memory as a
“safe” access layer for SSD or NVRAM
e.g. as a “RAM extension” not I/O device
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.40
Interesting thing #3
…maybe only one large
memory server shared
by many machines?
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.41
Cleancache
A third-level victim cache for otherwise reclaimed clean page cache
pages
– Optionally load-balanced across multiple clients
Cleancache patchset:
– VFS hooks to put clean page cache pages, get them back, maintain
coherency
– Per filesystem opt-in hooks
– Shim to zcache in 2.6.39
– Shim to Xen tmem in 3.0
Merged in Linux 3.0
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.42
Frontswap
Temporary emergency FAST swap page store
– Optionally load-balanced across multiple clients
Frontswap patchset:
– Swap subsystem hooks to put and get swap cache pages
– Maintain coherency
– Manages tracking data structures (1 bit/page)
– Partial swapoff
– Shim to zcache in 2.6.39
– Shim to Xen tmem merged in 3.1
Merged in Linux 3.5
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.43
Kernel changes
Frontends require core kernel changes
– Cleancache
– Frontswap
Backends do NOT require core kernel chances
– Zcache, RAMster, Xen tmem all implemented as drivers
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.44
Transcendent Memory in LinuxMulti-year merge effort
Xen non-
Xen
name of patchset Linux
version
N Y zcache/zcache2 2.6.39/3.7 staging driver
Y Y cleancache 3.0 Linus decided!
Y N Xen-tmem, selfballooning 3.1
Y ? frontswap-selfshrinking 3.1
Y Y Frontswap 3.5 Linus decided!
? Y RAMster (merged w/zcache2) 3.4/3.7 staging driver
Y Y module support, frontswap unuse,
frontswap admission improvements3.8? under development
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.45
Transcendent Memory
Transcendent Memory now in upstream Linux kernel
– cleancache, frontswap
– guest kernel support (aka Xen tmem)
– zcache
– RAMster
Transcendent Memory support has been in the Xen hypervisor for
over 2 years.
– Available in Oracle VM 2.2 and 3.x
Transcendent Memory in UEK2 for a year
– cleancache, frontswap
– guest kernel support (aka Xen tmem)
– zcache2 coming soon
Oracle Product Plans
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.46
Pretty Graphs! Facts! Figures!
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.47
frontswap patchset diffstat
Low core maintenance impact
– ~100 lines
No impact if CONFIG_FRONTSWAP=n
Negligible impact if CONFIG_FRONTSWAP=y and no backend
How much benefit per backend?
Documentation/vm/frontswap.txt | 210 +++++++++++++++++++
include/linux/frontswap.h | 126 ++++++++++++
include/linux/swap.h | 4
include/linux/swapfile.h | 13 +
mm/Kconfig | 17 ++
mm/Makefile | 1
mm/frontswap.c | 273 +++++++++++++++++++
mm/page_io.c | 12 +
mm/swapfile.c | 64 +++++--
9 files changed, 707 insertions(+), 13 deletions(-)
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.48
A benchmark
Workload:
– make --jN on linux-3.1 source (after make clean)
– Fresh reboot before each run
– All tests run as root in multi-user mode
Software:
– Linux 3.2
Hardware:
– Dell Optiplex 790 (~$500)
– Intel Core i5-2400 @ 3.1Ghz (Quad Core/Hyperthreaded – 6M cache)
– 1GB DDR3 RAM @ 1333Mhz (limited by memmap)
– One 7200rpm SATA 6Gpbs drive with 8MB cache
– 10GB swap partition
– 1Gb ethernet
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.49
Workload objective
Small N (4-12)
– No memory pressure
Page cache never fills to exceed RAM, no swapping
Medium N (16-24)
– Moderate memory pressure
Page cache fills so lots of reclaiming, but little to no swapping
Large N (28-36)
– High memory pressure
Much page cache churn, lots of swapping
Largest N (40)
– Extreme memory pressure
Little space for page cache churn, swap storm occurs
Changing N varies memory pressure
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.50
Native/Baseline (no zcache registered)
4 8 12 16 20 24 28 32 36 40
nozcache 879 858 858 1009 1316 2164 3293 4286 6516
500
1000
2000
4000
8000
seconds
(elapsedme)
Kernelcompile“make–jN”(smallerisbe er)
DNC
didnotcomplete(18000+)
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.51
Review: what is zcache?
when clean pages are reclaimed (cleancache “put”)
– zcache compresses/stores contents of evicted pages in RAM
– zcache has “shrinker hook” for if kernel runs low
when filesystem reads file pages (cleancache “get”)
– zcache checks if it has a copy, if so decompresses/returns
– else reads from filesystem/disk as normal
One disk access saved for every successful “get”
Captures and compresses evicted clean page cache pages
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.52
Review: what is zcache?
when a page needs to be swapped out (frontswap “put”)
– zcache compresses/stores contents of swap page in RAM
– zcache enforces policies, may reject some (or all) pages
– frontswap maintains a bit map for saved/rejected swap pages
when a page needs to be swapped in (frontswap “get”)
– if frontswap bit is set, zcache decompresses/returns
– else read from swap disk as normal
One disk write+read saved for every successful “get”
Captures and compresses swap pages (in RAM)
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.53
Zcache vs native/baseline
4 8 12 16 20 24 28 32 36 40
nozcache 879 858 858 1009 1316 2164 3293 4286 6516
zcache 877 856 856 922 1154 1714 2500 4282 660213755
500
1000
2000
4000
8000
seconds
(elapsedme)
Kernelcompile“make–jN”(smallerisbe er)
Upto26-31%faster
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.54
Benchmark analysis - zcache
small N (4-12):
– no memory pressure
zcache has no effect, but apparently no measurable cost either
medium N (16-20):
– moderate memory pressure
zcache increases total pages cached due to compression
performance improves 9%-14%
large N (24-28)
– high memory pressure
zcache increases total pages cached due to compression
AND zcache uses RAM for compressed swap to avoid swap-to-disk
performance improves 26%-31%
large N (32-36)
– very high memory pressure
compressed page cache gets reclaimed before use, no advantage
compressed in-RAM swap counteracted by smaller kernel page cache?
performance improves /loses 0%-(1%)
largest N (40):
– extreme memory pressure
– in-RAM swap compression reduces worst case swapstorm
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.55
Review: what is RAMster?
Leverages zcache, adds cluster code using kernel sockets
same as zcache but also “remotifies” compressed swap pages to
another system’s RAM
– One disk write+read saved for every successful swap “get” (at cost
of some network traffic)
– One disk access saved for every successful page cache “get” (at
cost of some network traffic)
Peer-to-peer or client-server (currently up to 8 nodes)
RAM management is entirely dynamic
Locally compresses swap and clean page cache pages, but stores in
remote RAM
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.56
Zcache and RAMster
4 8 12 16 20 24 28 32 36 40
nozcache 879 858 858 1009 1316 2164 3293 4286 6516
zcache 877 856 856 922 1154 1714 2500 4282 660213755
ramster 887 866 875 949 1162 1788 2177 3599 5394 8172
500
1000
2000
4000
8000
seconds
(elapsedme)
Kernelcompile“make–jN”(smallerisbe er)
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.57
Workload Analysis - RAMster
small N (4-12):
– no memory pressure
RAMster has no effect, but small cost
medium N (16-20):
– moderate memory pressure
RAMster increases total pages cached due to compression
performance improves 6%-13%
somewhat slower than zcache
large N (24-28)
– high memory pressure
RAMster increases total pages cached (local) due to compression
and RAMster uses remote RAM for to avoid swap-to-disk
performance improves 21%-51%
large N (32-36)
– very high memory pressure
compressed page cache gets reclaimed before use, no advantage
but RAMster still uses remote (compressed) RAM to avoid swap-to-disk
performance improves 19%-22% (vs zcache and native)
largest N (40):
– extreme memory pressure
use of remote RAM significantly reduces worst case swapstorm
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.58
Questions?
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.59
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.60