wall: a writeback-aware llc management for pcm-based...
TRANSCRIPT
WALL: A Writeback-Aware LLC Management for
PCM-based Main Memory SystemsBahareh Pourshirazi*, Majed Valad Beigi†,
Zhichun Zhu*, and Gokhan Memik†
* University of Illinois at Chicago† Northwestern University
DATE-2018IEEE/ACM Design, Automation, and Test in Europe
March 21Dresden, Germany
Introduction• Increasing demand for memory capacity
– Increasing number of cores on multicore processors• Intel Sandy Bridge: 8 cores (16 threads)• IBM POWER7: 8 cores (32 threads)
– Increasing data set sizes• Graph, database, scientific workloads
• Problems with DRAM – Scalability limitations
• Slowed down• Below 16nm seems difficult
– Periodic refresh operations
2DATE 2018 – Pourshirazi et al. 3/21/2018
Phase Change Memory• Promising technology
– Denser than DRAM (3−12×)– Non-volatile storage
• Shortcomings– Higher access latency (4−12× DRAM)– Higher dynamic energy (2−40× DRAM) – Limited write endurance
3
wordline
PCM
bitline
storageelement
DATE 2018 Pourshirazi et al. –
especially WRITE
operations
3/21/2018
Existing Solutions• DRAM + PCM hybrid main memories
– DRAM as a cache to PCM• Modifications to PCM-based main memories
– Optimization on PCM architecture– Reducing the number of writebacks from LLC to PCM
4DATE 2018 – Pourshirazi et al.
CPU Core
Shared LLC
PCM Main Memory
CPU Core
CPU Core
CPU Core
Shared LLC
DRAM Cache
Large PCM storage
CPU Core
CPU Core
MM
3/21/2018
Summary of Contributions• In this work
– We propose WALL, a novel Writeback-Aware LLC management scheme
WALL reduces the number of writebacks from the Last Level Cache (LLC) to a PCM-based main memory
– WALL consists of • Writeback aware set-balancing mechanism• Writeback-aware replacement policy
5DATE 2018 – Pourshirazi et al. 3/21/2018
Outline
• Introduction• Background• Motivation• WALL• Evaluation Results• Conclusion
6DATE 2018 – Pourshirazi et al. 3/21/2018
• Impact of reducing write traffic on PCM– Lifetime enhancement
– Performance improvement• Writes increase latency of reads by 1.2 to 1.8 times
[Arjomand_ISCA2016]– Reduction in energy consumption
• Writes consume ~10× of reads
Background
7DATE 2018 – Pourshirazi et al.
PCM Lifetime α1
write traffic
0
2
4
6
8
10
0 10 20 30 40 50 60 70 80 90
Nor
mal
ized
Life
time
(×)
Write Reduction (%)
3/21/2018
Related Work• WADE [Wang_TACO2013]
– Reduces the number of writebacks to PCM• Partitions a set’s blocks into “frequent writeback” and
“non-frequent writeback”• Tries to keep the frequent writeback blocks in the set
– Considers the set’s blocks as the only replacement candidates
– Complex implementation• Set-Balancing Cache (SBC) [Rolán_Micro2009]
– Balances the pressure on cache sets to reduce miss rate
– It does not reduce writebacks
8DATE 2018 – Pourshirazi et al. 3/21/2018
• Writebacks are not uniformly distributed among LLC sets
Motivation
9DATE 2018 – Pourshirazi et al.
0
20
40
60
80
100
0 12.5 25 37.5 50 62.5 75 87.5 100
Perc
enta
ge o
f writ
ebac
ks
Percentage of sets
sp
22.9%
5.3%94.7
sp from NAS
0
20
40
60
80
100
0 12.5 25 37.5 50 62.5 75 87.5 100
Perc
enta
ge o
f writ
ebac
ks
Percentage of sets
gcc
29.6%
94.9
0
20
40
60
80
100
0 12.5 25 37.5 50 62.5 75 87.5 100
Perc
enta
ge o
f writ
ebac
ks
Percentage of sets
streamcluster
27.4%
8.7%91.3
gcc from SPEC2006 streamcluster from PARSEC
5.1%5.3%
A set with few writeback can be used to store the dirty eviction victims of a set with many writeback
3/21/2018
• LLC sets are classified into three categories– Writer: frequent writebacks– Non-writer: infrequent writebacks, underutilized– Neutral: neither writer, nor non-writer
• Each writer set is partnered with a non-writer set
Set Balancing
10DATE 2018 – Pourshirazi et al.
PCM Main Memory…
write
INSERT…
EVICT DIRTY
…Partners
WRITER
NON-WRITER
LRU
3/21/2018
• To determine set types, two counters are used– Writeback counter– Saturation counter [Rolán_Micro2009]
• To measure the degree to which set can hold its working set
• Counter thresholds– Writeback
– Saturation • τsat = K/4, K is the set associativity
30313132
Set Balancing (cont.)
11DATE 2018 – Pourshirazi et al.
Access miss
Saturation CounterAccess hit
𝑤𝑤𝑤𝑤𝑚𝑚
Arithmetic Mean𝑤𝑤𝑤𝑤1 𝑤𝑤𝑤𝑤2 𝑤𝑤𝑤𝑤𝑛𝑛… overall
average
𝑤𝑤𝑤𝑤1 𝑤𝑤𝑤𝑤2 𝑤𝑤𝑤𝑤𝑚𝑚−1… τlow_wb
𝑤𝑤𝑤𝑤𝑚𝑚 𝑤𝑤𝑤𝑤𝑚𝑚+1 𝑤𝑤𝑤𝑤𝑛𝑛… τhigh_wb
3/21/2018
0102030405060708090
100
wb sat wb sat wb sat wb sat wb sat wb sat wb sat wb sat
sp ua stream dedup gcc mcf mix1 mix2
Perc
enta
ge o
f Set
s
< [ ] > < >
• For a set with writeback count of wb and saturation counter of sat– Set is writer if wb ≥ 𝜏𝜏ℎ𝑖𝑖𝑖𝑖ℎ_𝑤𝑤𝑤𝑤
– Set is non-writer if sat ≤ 𝜏𝜏𝑠𝑠𝑠𝑠𝑠𝑠 and wb ≤ 𝜏𝜏𝑙𝑙𝑙𝑙𝑤𝑤_𝑤𝑤𝑤𝑤𝝉𝝉𝒔𝒔𝒔𝒔𝒔𝒔𝝉𝝉𝒍𝒍𝒍𝒍𝒍𝒍_𝒍𝒍𝒘𝒘- 𝝉𝝉𝒉𝒉𝒉𝒉𝒉𝒉𝒉𝒉_𝒍𝒍𝒘𝒘 𝝉𝝉𝒉𝒉𝒉𝒉𝒉𝒉𝒉𝒉_𝒍𝒍𝒘𝒘𝝉𝝉𝒍𝒍𝒍𝒍𝒍𝒍_𝒍𝒍𝒘𝒘 𝝉𝝉𝒔𝒔𝒔𝒔𝒔𝒔
Set Balancing (cont.)
12DATE 2018 – Pourshirazi et al.
writer
non-writer
3/21/2018
• Frequent writeback block: frequently reused after being evicted from the cache
• Frequent writeback blocks are given a second chance upon eviction to stay in cache and be accessed
• To avoid performance penalty, the replacement policy is considered for the neutral or non-writer sets
Replacement Policy
13DATE 2018 – Pourshirazi et al.
PCM Main Memory… write
INSERT
EVICT DIRTYLRUFV = 0 FV = 1
ACCESS
Non-writer or neutral set
3/21/2018
• WALL storage overhead– Less than 0.6% of the LLC capacity
Design
14DATE 2018 – Pourshirazi et al.
victim
to PCM
dirty FV1 to PCM
to MRU
1
0
dirt
y
FV1 1
1 0 00 X 1 to PCM
insert into partner of Set(n)
move to MRU
ST[0
]
ST[1
]
1 writer1 01 0 1
1 neutral
ST[0] ST[1]
non-writer0 XST: Set Type
1
0
...
MRU
...
LRU
MRU
finding another victim
Set(n) Set(n)
3/21/2018
• Simulator– GEM5 integrated with NVMAIN
• Cores8 cores, out-of-order, 2.0GHz
• Caches– 32KB L1 (2 cycles), 256KB L2 (12 cycles), Shared LLC 8MB (35 cycles)– MOESI directory
• PCM Main Memory– 4GB, 4 channels, 1 rank/channel, 4 banks/rank– t_SET= 150ns, t_RESET= 100ns, t_RCD= 120ns, – Cell endurance = 32×106 writes – Four memory controllers– One read and write queue, Write drain threshold: high = 80%, low = 50%
Experimental Setup
15DATE 2018 – Pourshirazi et al. 3/21/2018
Experimental Setup• Workloads
– Multi-threaded applications• NAS and PARSEC benchmarks
– Multi-programmed workloads• SPEC CPU2006
• We run the workloads for 2 billion instructions, after two billion for cache warm-up phase
• We compare WALL with– Baseline: that uses the LRU replacement policy – Baseline double-way: a baseline with double the associativity– WADE: the proposed scheme by Wang et al. [Wang_TACO2013]
DATE 2018 – Pourshirazi et al. 3/21/2018 16
• Compared to baseline, reduced by 26.6% on average– For writers sets, reduced by 39.5%, on average– For non-writers sets, increased from 10.4% to 13.1% – For neutral sets, reduced by 28.6% on average
LLC Writeback
17DATE 2018 – Pourshirazi et al.
0
0.2
0.4
0.6
0.8
1
1.2
Base
line
WAL
L
Base
line
WAL
L
Base
line
WAL
L
Base
line
WAL
L
Base
line
WAL
L
Base
line
WAL
L
Base
line
WAL
L
Base
line
WAL
L
Base
line
WAL
L
sp ua stream dedup gcc mcf mix1 mix2 Average
Nor
mal
ized
Writ
ebac
ks
non-writer neutral writer
26.6%
3/21/2018
WALL Writeback Reduction
18DATE 2018 – Pourshirazi et al.
0
0.2
0.4
0.6
0.8
1
1.2
sp ua stream dedup gcc mcf mix1 mix2 Average
LLC Writeback Reduction
Baseline Baseline double-way WADE WALL
• Compared to baseline double-way, by 23.3% on average • Compared to WADE, by 16.4% on average
23.3%16.4%
3/21/2018
• Compared to baseline, reduced by 2.4% on average– For writers sets, reduced by 27.8%, on average.– For non-writers sets, increased from 12.0% to 16.2%– For neutral sets, increased from 57.3% to 59.1%
MPKI
19DATE 2018 – Pourshirazi et al.
0
0.2
0.4
0.6
0.8
1
1.2
Base
line
WAL
L
Base
line
WAL
L
Base
line
WAL
L
Base
line
WAL
L
Base
line
WAL
L
Base
line
WAL
L
Base
line
WAL
L
Base
line
WAL
L
Base
line
WAL
L
sp ua stream dedup gcc mcf mix1 mix2 Average
Nor
mal
ized
MPK
I
non-writer neutral writer
2.4%
3/21/2018
• Compared to baseline double-way, increased by 1.0% on average
• Compared to WADE, reduced by 0.3% on average
WALL MPKI
20DATE 2018 – Pourshirazi et al.
0.50.60.70.80.9
11.11.2
sp ua stream dedup gcc mcf mix1 mix2 Average
Normalized MPKI
Baseline Baseline double-way WADE WALL
3/21/2018
• Compared to baseline, reduced by 19.2% on average • Compared to baseline double-way, reduced by 16.5% on
average • Compared to WADE, reduced by 11.3% on average
Main Memory Energy
21DATE 2018 – Pourshirazi et al.
0.5
0.6
0.7
0.8
0.9
1
1.1
sp ua stream dedup gcc mcf mix1 mix2 GMEAN
Main Memory Energy
Baseline Baseline double-way WADE WALL
19.2%16.5%11.3%
3/21/2018
• Compared to baseline, improved by 6.7% on average • Compared to baseline double-way, improved by 4.9% on
average• Compared to WADE, 3.2% on average
IPC
22DATE 2018 – Pourshirazi et al.
0.9
1
1.1
1.2
1.3
sp ua stream dedup gcc mcf mix1 mix2 GMEAN
Normalized IPC
Baseline Baseline double-way WADE WALL
6.7%4.9%3.2%
3/21/2018
• Compared to baseline scheme, increased by 1.25× on average• Compared to baseline double-way, increased by 1.21× on
average • Compared to WADE, increased by 1.17× on average
PCM Lifetime
23DATE 2018 – Pourshirazi et al.
1248
163264
128256
sp ua stream dedup gcc mcf mix1 mix2 GMEAN
Life
time
(yea
rs)
Baseline Baseline double-way WADE WALL
3/21/2018
Conclusion• We proposed WALL to reduce the number of writebacks
from the LLC to the PCM main memory • WALL includes:
– A set-balancing mechanism • Uses the non-write sets as storage of writer sets writebacks.
– A writeback-aware replacement policy • Keeps the frequently reused dirty lines of the sets
• Results show that WALL can achieve– Writeback reduction, by 26.6% on average– PCM lifetime enhancement , by 1.25× on average– Main memory energy efficiency, by 19.2% on average
24DATE 2018 – Pourshirazi et al. 3/21/2018
Thank You !Questions ?
25DATE 2018 – Pourshirazi et al. 3/21/2018