Download - Cooperative Cache Scrubbing
Cooperative Cache Scrubbing
Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S. McKinley^
PACT 2014
* ^
Multicore Challenge
Chip
memory (DRAM)p. 2
P
$
P
$
P
$
P
$
Managed language runtime environment
Application
Operating System
Objects rapidly allocated and
short-lived
LLC
Problem: Allocation Wall
Chip
memory (DRAM)p. 3
P
$
P
$
P
$
P
$
Managed language runtime environment
Application
Operating System
DEADDEAD
DEADDEAD
DEAD
DEAD
Objects rapidly allocated and
short-lived
LLC
Problem: Bandwidth & Power Wall
Chip
memory (DRAM)p. 4
P
$
P
$
P
$
P
$
Managed language runtime environment
Application
Operating System
DEADDEAD
DEADDEAD
DEAD
DEAD 00000000000000
Objects rapidly allocated and
short-lived
Zero initialization
LLC
Cooperative Cache Scrubbing
Chip
LLC
memory (DRAM)p. 5
P
$
P
$
P
$
P
$
Managed language runtime environment
Application
Operating System
00000000000000
Objects rapidly allocated and
short-lived
Zero initialization
DEADDEAD
DEADDEAD DEAD
DEADwrite read
LLC
Generational Garbage Collection
Young objects die quickly Nursery
Traced for live objects Copy to mature space Reclaimed ‘en masse’
NurseryMature
LLC
8MBp. 6
DEADDEADDEADDEAD DEAD
DEAD
Dead Lines in LLC (8MB)
p. 7
Dead Data Written Back?
Chip
LLC
memory (DRAM)p. 8
P
$
P
$
P
$
P
$
Managed language runtime environment
Application
Operating System
DEADDEADDEAD
DEAD
DEAD
DEAD
Useless Write Backs (8MB LLC)
p. 9
Cooperative Cache Scrubbing
Communicate managed language’s semantic information to hardware
Caches ‘Scrub’ dead lines
Invalidate Unset dirty bit
Zero lines without fetch Result
Better cache management Avoid traffic to DRAM Save DRAM energy
p. 10
writes
reads
Dead Data Written in Cache?
Young objects die quickly Nursery
Traced for live objects Copy to mature space Reclaimed ‘en masse’
NurseryMature
LLCDEADDEAD
DEAD DEAD
DEADDEAD
DEAD
DEAD
p. 11
0000000
Dead Lines Written in LLC (8MB)
p. 12
SW-HW Cooperative Scrubbing
Software Identify cache line-aligned dead/zero region Generational Immix collector (stop-the-world)
After nursery collection, call scrub instruction on each line in entire range
Call zero instructions to zero region (32KB)
Hardware
p. 13
SW-HW Cooperative Scrubbing
Software Hardware
Scrubbing (LLC) clinvalidate: invalidates cache line clundirty: clears dirty bit clclean: clears dirty bit, moves line to LRU
Zeroing (L2) clzero: zero cache line without fetch
Modifications to MESI cache coherence protocol Back-propagation from LLC to L1/L2 cache levels Local coherence transitions (no off-chip)
p. 14
PowerPC’s dcbi, ARM
PowerPC’s dcbz
MESI Coherence Transitions
p. 15
M E
I S
clclean/-
clinvalidate/- clin
valid
ate/
-
clclean/-
clclean/-
clinv
alida
te/-
clinvalidate/-clclean/-
MESI Coherence Transitions
p. 16
M E
I S
clzero/-clzero/-
clze
ro/B
usIn
valid
ate
clzero/BusInvalidateB
usIn
valid
ate
BusIn
valid
ate
BusInvalidate
external: from another LLC
Methodology
Sniper simulator 4 cores, 8MB shared L3 (LLC), McPAT Extensions for JVM
Works with JIT compiler Emulate system calls (futex & nanosleep)
JVM-simulator communication with new instruction
Jikes RVM 3.1.2 and DaCapo benchmarks Generational Immix garbage collector 4 application, 4 GC threads 2x minimum heap Replay compilation, 2nd invocation
p. 17
DRAM Writes (8MB nursery)
p. 18
antlr
avro
rabl
oat
fop
jytho
n
luin
dex
luse
arch
luse
arch
.fix
pmd
sunf
low
xala
nM
ean
0
20
40
60
80
100
120
clinvalidateclundirtyclcleanclzeroclclean+clzero
Wri
tes
/Ba
se
lin
e (
%)
DRAM Writes (8MB nursery)
p. 19
antlr
avro
rabl
oat
fop
jytho
n
luin
dex
luse
arch
luse
arch
.fix
pmd
sunf
low
xala
nM
ean
0
20
40
60
80
100
120
clinvalidateclundirtyclcleanclzeroclclean+clzero
Wri
tes
/Ba
se
lin
e (
%)
DRAM Writes (8MB nursery)
p. 20
antlr
avro
rabl
oat
fop
jytho
n
luin
dex
luse
arch
luse
arch
.fix
pmd
sunf
low
xala
nM
ean
0
20
40
60
80
100
120
clinvalidateclundirtyclcleanclzeroclclean+clzero
Wri
tes
/Ba
se
lin
e (
%)
DRAM Reads (8MB nursery)
p. 21
antlr
avro
rabl
oat
fop
jytho
n
luin
dex
luse
arch
luse
arch
.fix
pmd
sunf
low
xala
nM
ean
0
25
50
75
100
125
150
175
200
225
clinvalidateclundirtyclcleanclzeroclclean+clzero
Re
ad
s/B
as
eli
ne
(%
)
DRAM Reads (8MB nursery)
p. 22
antlr
avro
rabl
oat
fop
jytho
n
luin
dex
luse
arch
luse
arch
.fix
pmd
sunf
low
xala
nM
ean
0
25
50
75
100
125
150
175
200
225
clinvalidateclundirtyclcleanclzeroclclean+clzero
Re
ad
s/B
as
eli
ne
(%
)
DRAM Reads (8MB nursery)
p. 23
antlr
avro
rabl
oat
fop
jytho
n
luin
dex
luse
arch
luse
arch
.fix
pmd
sunf
low
xala
nM
ean
0
25
50
75
100
125
150
175
200
225
clinvalidateclundirtyclcleanclzeroclclean+clzero
Re
ad
s/B
as
eli
ne
(%
)
DRAM Reads (8MB nursery)
p. 24
antlr
avro
rabl
oat
fop
jytho
n
luin
dex
luse
arch
luse
arch
.fix
pmd
sunf
low
xala
nM
ean
0
25
50
75
100
125
150
175
200
225
clinvalidateclundirtyclcleanclzeroclclean+clzero
Re
ad
s/B
as
eli
ne
(%
)
DRAM Reads (8MB nursery)
p. 25
antlr
avro
rabl
oat
fop
jytho
n
luin
dex
luse
arch
luse
arch
.fix
pmd
sunf
low
xala
nM
ean
0
25
50
75
100
125
150
175
200
225
clinvalidateclundirtyclcleanclzeroclclean+clzero
Re
ad
s/B
as
eli
ne
(%
)
Dynamic DRAM Energy (8MB nursery)
p. 26
Mean0
10
20
30
40
50
60
70
80
clinvalidateclundirtyclcleanclzeroclclean+clzero
En
erg
y R
ed
uc
tio
n (
%)
Dynamic DRAM Energy (8MB nursery)
p. 27
Mean0
10
20
30
40
50
60
70
80
clinvalidateclundirtyclcleanclzeroclclean+clzero
En
erg
y R
ed
uc
tio
n (
%)
Total DRAM Energy
p. 28
4M 8M 16M
-5
0
5
10
15
20
25
clinvalidateclundirtyclcleanclzeroclclean+clzero
En
erg
y R
ed
uc
tio
n (
%)
-22%
Total DRAM Energy
p. 29
4M 8M 16M
-5
0
5
10
15
20
25
clinvalidateclundirtyclcleanclzeroclclean+clzero
En
erg
y R
ed
uc
tio
n (
%)
-22%
Total DRAM Traffic
p. 30
4M 8M 16M
-50
-25
0
25
50
75
100
clinvalidateclundirtyclcleanclzeroclclean+clzero
Tra
ffic
Re
du
cti
on
(%
)
-14x
clclean+clzero Improvements
p. 31
DRAM R
eads
DRAM W
rites
Total
DRAM
Tra
ffic
LLC m
isses
Execu
tion
time
Dynam
ic DRAM
Ene
rgy
Total
DRAM
Ene
rgy
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
4MB 8MB 16MB
Related Work
Cooperative cache management ESKIMO by Isen & John, Micro 09
Useless reads and writes to DRAM by sequential C programs
Reduce energy Require large map in hardware, extra cache bits
Wang et al., PACT 02/ ISCA 03; Sartor et al., 05 C & Fortran static analysis to give cache hints to evict or
keep data
Zero initialization [Yang et al., OOPSLA 11] Studied costs in time, cache and traffic Use non-temporal writes to DRAM, increase bandwidth
p. 32
Conclusions
Software-hardware cooperative cache scrubbing
Leverages region allocation semantics Changes to MESI coherence protocol New multicore architectural simulation
methodology Reductions 59% traffic 14% DRAM energy 4.6% execution time
p. 33
http://users.elis.ugent.be/~jsartor/
0000000DEAD
p. 34
Execution Time (8MB nursery)
p. 35
Mean0
1
2
3
4
5
6
7
clinvalidateclundirtyclcleanclzeroclclean+clzero
Ex
ec
uti
on
Tim
e R
ed
uc
tio
n (
%)
Changes to MESI coherence protocol
State clinvalidate clundirty/clclean
clzero BusInvalidate
M invalidate L1/L2 (no WB) I
invalidate L1/L2 (no WB) E(clclean LRU)
⁄ invalidate L1/L2 (no WB) I
E invalidate L1/L2 I
invalidate L1/L2 (clclean LRU)
M invalidate L1/L2 I
S invalidate L1/L2 I
invalidate L1/L2 (clclean LRU)
BusInvalidate M
invalidate L1/L2 I
I ⁄ ⁄ BusInvalidate M
⁄
p. 36
Total DRAM Energy (8MB nursery)
p. 37
antlr
avro
rabl
oat
fop
jytho
n
luin
dex
luse
arch
luse
arch
.fix
pmd
sunf
low
xala
nM
ean
-10
0
10
20
30
40
50
60
clinvalidateclundirtyclcleanclzeroclclean+clzero
En
erg
y R
ed
uc
tio
n (
%)
Execution Time Across Nurseries
p. 38
Execution Time
p. 39
Dynamic DRAM Energy 8MB Nursery
p. 40