writeback -aware bandwidth partitioning for multi-core systems with pcm
DESCRIPTION
Writeback -Aware Bandwidth Partitioning for Multi-core Systems with PCM. Miao Zhou, Yu Du, Bruce Childers, Rami MeLHEM , Daniel Mossé University of Pittsburgh. http:// www.cs.pitt.edu /PCM. Introduction. DRAM memory is not energy efficient Data centers are energy hungry - PowerPoint PPT PresentationTRANSCRIPT
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL
MOSSÉ
UNIVERSITY OF PITTSBURGH
Writeback-Aware Bandwidth Partitioning for Multi-core Systems
with PCM
http://www.cs.pitt.edu/PCM
Introduction
DRAM memory is not energy efficient Data centers are energy hungry DRAM memory consumes 20-40% of the energy
Apply PCM as main memory Energy efficient but Slower read, much slower write and shorter lifetime
Hybrid memory: add a DRAM cache Improve performance ( LLC miss rate) Extend lifetime ( LLC writeback rate)
How to manage the shared resources?
C0
L1
L2
C1
L1
L2
C2
L1
L2
C3
L1
L2
DRAMPCMDRAM LLCDRAM LLC
Shared Resource Management
CMP systems Shared resources: - last-level cache
- memory bandwidth Unmanaged resources interference poor performance
Partition resources: interference, performance Cache
PartitioningBandwidth
PartitioningDRAM main memoryHybrid main memory
UCP [Qureshi et. al., MICRO 39]
RBP [Liu et. al., HPCA’10]
WCP [Zhou et. al., HiPEAC’12]
This work
Utility-based Cache Partitioning (UCP) Tracks utility (LLC hit/miss) and minimizes overall LLC misses
Read-only Bandwidth Partitioning (RBP) Partitions the bus bandwidth based on LLC miss information
Writeback-aware Cache Partitioning (WCP) Tracks and minimizes LLC miss & writebacksQuestions: 1. Is read-only (LLC miss) information enough? 2. Is bus bandwidth still the bottleneck?
C0
L1
L2
C1
L1
L2
C2
L1
L2
C3
L1
L2
Memory
LLC
Bandwidth Partitioning
Analytic model guides the run time partitioning Use queuing theory to model delay Monitor performance to estimate the parameters of the model Find the partition that maximizes the system’s performance Enforce the partition at run time
DRAM vs. Hybrid main memory PCM writes are extremely slow and power hungry
Issues specific to hybrid main memory Bottleneck: bus bandwidth or device bandwidth Can we ignore the bandwidth consumed by LLC writebacks
Device Bandwidth Utilizationper
lben
chbzi
p2
gcc
zeusm
pca
ctusA
DM
calc
ulix
hm
mer
sjen
gas
tar
wrf
sphin
x3xa
lancb
mk
0
1
2
3
4
5WriteRead
% D
evic
e B
an
dw
idth
Uti
lizati
on
bw
aves
mcf
milc
lesl
ie3d
gobm
kso
ple
xG
ems.
..libqua.
..lb
mom
net
pp
0
5
10
15
20
25
30
35
40WriteRead
% D
evic
e B
an
dw
idth
Uti
lizati
on
DRAM PCMDRAM PCM
DRAM Memory1. Low device bandwidth utilization2. Memory reads (LLC misses) dominate
Hybrid Memory1. High device bandwidth utilization2. Memory writes (LLC writebacks) often dominate
RBP on Hybrid Main Memory
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
Th
ou
gh
pu
t o
f R
BP
No
rmal-
ized
to
SH
AR
E
RBP vs. SHARE1. RBP outperforms SHARE for workloads dominated by PCM read (LLC miss) 2. RBP lost against SHARE for workloads dominated by PCM write (LLC writeback) A new bandwidth partitioning scheme is necessary for hybrid memory
10% 90%Percentage of Device Bandwidth Consumed by PCM Writes (LLC Writebacks)
Writeback-Aware Bandwidth Partitioning
Focus on collective bandwidth of PCM devices
Considers LLC writeback informationToken bucket algorithm
Device service units = tokens Allocate tokens among app. every epoch (5 million
cycles)
Analytic model Maximize weighted speedup Model the contention on bandwidth as queuing delay Difficulty: write is blocking only when write queue is
full
Analytic Model for bandwidth partitioning
For a single core Additive CPI formula: CPI = CPILLC∞ + LLC miss freq. * LLC miss
penalty
memory ≈ queue, memory service time ≈ queuing delay
For a CMP
CPI with a infinite LLC
CPI due to LLC misses
request arrival rate … request service rateLLC miss rate
λm
Memory bandwidth α
Time to serve requestsCPI due to LLC misses
…
…
…
…
LLC miss rate
λm,1LLC miss rate
λm,1
LLC miss rate
λm,N
Memory bandwidth
α1Memory bandwidth
α2
Memory bandwidth
αN
Memory
Maximize Weighted Speedup
Taking into account the LLC writebacks CPI = CPILLC∞ + LLC miss freq. * LLC miss penalty + LLC writeback freq. * LLC writeback penalty
LLC miss rate λm,1
Analytic Model for WBP
CPI due to LLC writebacks
…
…
…
LLC writeback rate
λw,1
Read memory bandwidth α1
Write memory bandwidth β1
Memory
* P
RQ
WQ
…
…
LLC miss rate
λm,2LLC writeback rate
λw,2
Read memory bandwidth α2
Write memory bandwidth β2
…
…
LLC miss rate λm,N
LLC writeback rate
λw,N
Read memory bandwidth αN
Write memory bandwidth βN
Memory
p
CPI due to LLC misses and writebacks
How to determin
e P?
Prob. that writebacks are on the critical path
Maximize Weighted Speedup
Dynamic Weight Adjustment
Choose P based on the expected number of executed instructions (EEI)
λm,1
λw,1
λm,2
λw,2
λm,N
λw,N
α1,1
β1,1
α1,2
β1,2
α1,N
β1,N
WBP
p1
EEI
BU1
BU2
BUN
Actual EEI EEI1
P
p2 …
α2,1
β2,1
α2,2
β2,2
α2,N
β2,N
αm,1βm,1
αm,2
βm,2
αm,N
βm,N
pm
EEI2EEIm
Bandwidth Utilization ratio (BU): utilized bandwidth : allocated bandwidth
Architecture Overview
BUMon tracks info during an epoch
DWA and WBP compute bandwidth partition for the next epoch
Bandwidth Regulator enforces the configuration
Enforcing Bandwidth Partitioning
Simulation Setup
Configurations 8-core CMP, 168-entry instruction window Private 4-way 64KB L1, Private 8-way 2MB L2 Partitioned 32MB LLC, 12.5 ns latency 64GB PCM, 4 channels of 2 ranks each, 50ns read latency, 1000ns write latency
Benchmarks SPEC CPU2006 Classified into 3 types (W, R, RW) based on whether PCM
reads/writes dominate bandwidth consumption Creates 15 workloads (Light, High)
Sensitivity study on write latency, #channels and #cores
Effective Read LatencyLig
ht1
Lig
ht2
Lig
ht3
Lig
ht4
Lig
ht5
Lig
ht6
Lig
ht7
Hig
h1
Hig
h2
Hig
h3
Hig
h4
Hig
h5
Hig
h6
Hig
h7
Hig
h8
Avg
0.0
0.5
1.0
1.5
2.0
2.5
RBP WBP_0.5 WBP_1.0 WBP+DWA
No
rmali
zed
Eff
ecti
ve R
ead
L
ate
ncy
1. Different workloads favor different policy (partitioning weight)2. WBP+DWA can match the best static policy (partitioning weight)3. WBP+DWA reduces the effective read latency by 31.9% over RBP
ThroughputLig
ht1
Lig
ht2
Lig
ht3
Lig
ht4
Lig
ht5
Lig
ht6
Lig
ht7
Hig
h1
Hig
h2
Hig
h3
Hig
h4
Hig
h5
Hig
h6
Hig
h7
Hig
h8
Avg
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0RBP WBP_0.5 WBP_1.0 WBP+DWA
No
rmali
zed
Th
rou
gh
pu
t
1. The best weight varies for different workloads (writebacks weight)2. WBP+DWA achieves comparable performance to the best static weight3. WBP+DWA improves the throughput by 24.2% over RBP
Fairness (Harmonic IPC)
WBP+DWA improves fairness by an average of 16.7% over RBP
Lig
ht1
Lig
ht2
Lig
ht3
Lig
ht4
Lig
ht5
Lig
ht6
Lig
ht7
Hig
h1
Hig
h2
Hig
h3
Hig
h4
Hig
h5
Hig
h6
Hig
h7
Hig
h8
Avg
0.0
0.5
1.0
1.5
2.0
2.5RBP WBP_0.5 WBP_1.0 WBP+DWA
No
rmali
zed
Th
rou
gh
pu
t
Conclusions
PCM device bandwidth is the bottleneck in hybrid memory
Writeback information is important (LLC writebacks consume a substantial portion of memory bandwidth)
WBP can better partition the PCM bandwidth
WBP outperforms RBP by an average of 24.9% in terms of weighted speedup
Thank you
Questions ?