a study of garbage collector scalability on multicores
DESCRIPTION
A Study of Garbage Collector Scalability on Multicores. LokeshGidra , Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6. Garbage collection on multicore hardware. 14/20 most popular languages have GC but GC doesn’t scale on multicore hardware. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/1.jpg)
A Study of Garbage Collector Scalability on Multicores
LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro
INRIA/University of Paris 6
![Page 2: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/2.jpg)
2
14/20 most popular languages have GCbut GC doesn’t scale on multicore hardware
Garbage collection on multicore hardware
Lokesh Gidra
Parallel Scavenge/HotSpot scalability on a 48-core machine
GC threads
GC
Thr
ough
put (
GB
/s)
SPECjbb2005 with48 application threads/3.5GB
A Study of the Scalability of Garbage Collectors on Multicores
Degrades after 24 GC threads
Better ↑
![Page 3: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/3.jpg)
3
Scalability of GC is a bottleneckBy adding new cores, application creates more garbage per time unit
And without GC scalability, the time spent in GC increases
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
Lusearch[PLOS’11]
~50% of the time spent in the GCat 48 cores
![Page 4: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/4.jpg)
4
Where is the problem?
Probably not related to GC design: the problem exists in ALL the GCs of HotSpot 7(both, stop-the-world and concurrent GCs)
What has really changed:Multicores are distributed architectures, not centralized anymore
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
![Page 5: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/5.jpg)
A Study of the Scalability of Garbage Collectors on Multicores 5
From centralized architectures to distributed ones
Lokesh Gidra
A few years ago…
Uniform memory access machines
Now…
Inter-connectNode 0 Node 1
Node 2 Node 3Cores
Non-uniform memory access machines
Cores
Memory
System Bus
Mem
ory
Mem
ory
Mem
ory
Mem
ory
![Page 6: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/6.jpg)
A Study of the Scalability of Garbage Collectors on Multicores 6
From centralized architectures to distributed onesOur machine: AMD Magny-Cours with 8 nodes and 48 cores
12 GB per node 6 cores per node
Lokesh Gidra
Node 0 Node 1
Node 2 Node 3M
emor
y
Mem
ory
Mem
ory
Mem
ory
Local access: ~ 130 cyclesRemote access: ~350 cycles
![Page 7: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/7.jpg)
A Study of the Scalability of Garbage Collectors on Multicores 7
Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node
From centralized architectures to distributed ones
Lokesh Gidra
Node 0 Node 1
Node 2 Node 3M
emor
y
Mem
ory
Mem
ory
Mem
ory
Local access: ~ 130 cyclesRemote access: ~350 cycles
#cores = #threadsBetter↓
Com
plet
ion
time
(ms)
Time to perform a fixednumber of reads in //
![Page 8: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/8.jpg)
A Study of the Scalability of Garbage Collectors on Multicores 8
Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node
From centralized architectures to distributed ones
Lokesh Gidra
Node 0 Node 1
Node 2 Node 3M
emor
y
Mem
ory
Mem
ory
Mem
ory
Local access: ~ 130 cyclesRemote access: ~350 cyclesBetter↓
Com
plet
ion
time
(ms)
Local Access
Time to perform a fixednumber of reads in //
#cores = #threads
![Page 9: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/9.jpg)
A Study of the Scalability of Garbage Collectors on Multicores 9
Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node
From centralized architectures to distributed ones
Lokesh Gidra
Node 0 Node 1
Node 2 Node 3M
emor
y
Mem
ory
Mem
ory
Mem
ory
Local access: ~ 130 cyclesRemote access: ~350 cyclesBetter↓
Com
plet
ion
time
(ms)
Random access
Time to perform a fixednumber of reads in //
#cores = #threads
Local Access
![Page 10: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/10.jpg)
A Study of the Scalability of Garbage Collectors on Multicores 10
Our machine: AMD Magny-Cours with 8 nodes and 48 cores 12 GB per node 6 cores per node
From centralized architectures to distributed ones
Lokesh Gidra
Node 0 Node 1
Node 2 Node 3M
emor
y
Mem
ory
Mem
ory
Mem
ory
Local access: ~ 130 cyclesRemote access: ~350 cyclesBetter↓
Com
plet
ion
time
(ms)
Time to perform a fixednumber of reads in //
Single node access
Local Access
#cores = #threads
Random access
![Page 11: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/11.jpg)
11
Parallel Scavenge Heap Space
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
Kernel’s lazy first-touch page allocation policyFirst-touch allocation
policy
Virtual address space
Parallel Scavenge
![Page 12: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/12.jpg)
12
Parallel Scavenge Heap Space
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
Kernel’s lazy first-touch page allocation policy ⇒ initial sequential phase maps most pages on first node
Initial application
thread
First-touch allocation policy
Parallel Scavenge
![Page 13: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/13.jpg)
13
Parallel Scavenge Heap Space
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
Kernel’s lazy first-touch page allocation policy ⇒ initial sequential phase maps most pages on its node
Initial application
thread
First-touch allocation policy
But during the whole execution, the mapping remains as it is
(virtual space reused by the GC)
Parallel Scavenge
A severe problem for generational GCs
![Page 14: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/14.jpg)
14
Parallel Scavenge Heap Space
Lokesh Gidra
Bad balanceBad locality
First-touch allocation policy
95% on a single node
PS
SpecJBB
GC threads
GC
Thr
ough
put (
GB
/s)
Better ↑
Parallel Scavenge
A Study of the Scalability of Garbage Collectors on Multicores
![Page 15: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/15.jpg)
15
NUMA-aware heap layouts
Lokesh Gidra
Bad balanceBad locality
First-touch allocation policy
Round-robin allocation policy
Node local objectallocation and copy
95% on a single node
Targets balance Targets locality
Parallel Scavenge Interleaved Fragmented
A Study of the Scalability of Garbage Collectors on Multicores
PS
SpecJBB
GC threads
GC
Thr
ough
put (
GB
/s)
Better ↑
![Page 16: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/16.jpg)
16
Interleaved heap layout analysis
Lokesh Gidra
Bad balance Perfect balanceBad locality Bad locality
First-touch allocation policy
Round-robin allocation policy
Node local objectallocation and copy
95% on a single node 7/8 remote accesses
PS
Interleaved
SpecJBB
GC threads
GC
Thr
ough
put (
GB
/s)
Better ↑
Parallel Scavenge Interleaved Fragmented
A Study of the Scalability of Garbage Collectors on Multicores
![Page 17: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/17.jpg)
17
Fragmented heap layout analysis
Lokesh Gidra
Bad balance Perfect balance Good balanceBad locality Bad locality Average locality
Parallel Scavenge Interleaved FragmentedFirst-touch allocation
policyRound-robin
allocation policyNode local object
allocation and copy
95% on a single node 7/8 remote accesses Bad balance if a singlethread allocates for the others
PS
Interleaved
FragmentedSpecJBB7/8 remote scans
100% local copies
GC threads
GC
Thr
ough
put (
GB
/s)
Better ↑
A Study of the Scalability of Garbage Collectors on Multicores
![Page 18: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/18.jpg)
18
Synchronization optimizationsRemoved a barrier between the GC phasesReplaced the GC task-queue with a lock-free one
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
GC
Thr
ough
put (
GB
/s)
PS
Interleaved
FragmentedSpecJBB
Fragmented + synchro
Synchro optimization
has effect with high contention
GC threads
Better ↑
![Page 19: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/19.jpg)
19
Effect of Optimizations on the App (GC excluded)
A good balance improves a lot application timeLocality has only a marginal effect on application
While fragmented space increases locality for application over interleaved space
(recently allocated objects are the most accessed)Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
App
licat
ion
time PS
Other heap layouts
XML Transform from SPECjvm
GC threads
Better↓
![Page 20: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/20.jpg)
20
Overall effect (both GC and application)
Optimizations double the app throughput of SPECjbbPause time divided in half (105ms to 49ms)
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
App
licat
ion
thro
ughp
ut (o
ps/m
s)
PS
Fragmented
SpecJBB
Interleaved
Fragmented + synchro
GC threads
Better ↑
![Page 21: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/21.jpg)
21
GC scales well with memory-intensive applications
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
3.5GB 1GB 2GB
512MB1GB2GB
PS Fragmented + synchro
![Page 22: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/22.jpg)
22
Take Away
Previous GCs do not scale because they are not NUMA-aware Existing mature GCs can scale with standard // programming techniques Using NUMA-aware memory layouts should be useful for all GCs
(concurrent GCs included)
Most important NUMA effects1. Balancing memory access2. Memory locality only helps at high core count
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
![Page 23: A Study of Garbage Collector Scalability on Multicores](https://reader034.vdocuments.site/reader034/viewer/2022051020/5681665d550346895dd9e274/html5/thumbnails/23.jpg)
23
Take Away
Previous GCs do not scale due to NUMA obliviousness Existing mature GCs can scale with standard // programming techniques Using NUMA-aware memory layouts should be useful for all GCs
(concurrent GCs included)
Most important NUMA effects1. Balancing memory access2. Memory locality at high core count
Lokesh Gidra A Study of the Scalability of Garbage Collectors on Multicores
Thank You