writeback -aware bandwidth partitioning for multi-core systems with pcm

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL

MOSSÉ

UNIVERSITY OF PITTSBURGH

Writeback-Aware Bandwidth Partitioning for Multi-core Systems

with PCM

http://www.cs.pitt.edu/PCM

Introduction

DRAM memory is not energy efficient Data centers are energy hungry DRAM memory consumes 20-40% of the energy

Apply PCM as main memory Energy efficient but Slower read, much slower write and shorter lifetime

Hybrid memory: add a DRAM cache Improve performance ( LLC miss rate) Extend lifetime ( LLC writeback rate)

How to manage the shared resources?

C0

L1

L2

C1

L1

L2

C2

L1

L2

C3

L1

L2

DRAMPCMDRAM LLCDRAM LLC

Shared Resource Management

CMP systems Shared resources: - last-level cache

- memory bandwidth Unmanaged resources interference poor performance

Partition resources: interference, performance Cache

PartitioningBandwidth

PartitioningDRAM main memoryHybrid main memory

UCP [Qureshi et. al., MICRO 39]

RBP [Liu et. al., HPCA’10]

WCP [Zhou et. al., HiPEAC’12]

This work

Utility-based Cache Partitioning (UCP) Tracks utility (LLC hit/miss) and minimizes overall LLC misses

Read-only Bandwidth Partitioning (RBP) Partitions the bus bandwidth based on LLC miss information

Writeback-aware Cache Partitioning (WCP) Tracks and minimizes LLC miss & writebacksQuestions: 1. Is read-only (LLC miss) information enough? 2. Is bus bandwidth still the bottleneck?

C0

L1

L2

C1

L1

L2

C2

L1

L2

C3

L1

L2

Memory

LLC

Bandwidth Partitioning

Analytic model guides the run time partitioning Use queuing theory to model delay Monitor performance to estimate the parameters of the model Find the partition that maximizes the system’s performance Enforce the partition at run time

DRAM vs. Hybrid main memory PCM writes are extremely slow and power hungry

Issues specific to hybrid main memory Bottleneck: bus bandwidth or device bandwidth Can we ignore the bandwidth consumed by LLC writebacks

Device Bandwidth Utilizationper

lben

chbzi

p2

gcc

zeusm

pca

ctusA

DM

calc

ulix

hm

mer

sjen

gas

tar

wrf

sphin

x3xa

lancb

mk

0

1

2

3

4

5WriteRead

% D

evic

e B

an

dw

idth

Uti

lizati

on

bw

aves

mcf

milc

lesl

ie3d

gobm

kso

ple

xG

ems.

..libqua.

..lb

mom

net

pp

0

5

10

15

20

25

30

35

40WriteRead

% D

evic

e B

an

dw

idth

Uti

lizati

on

DRAM PCMDRAM PCM

DRAM Memory1. Low device bandwidth utilization2. Memory reads (LLC misses) dominate

Hybrid Memory1. High device bandwidth utilization2. Memory writes (LLC writebacks) often dominate

RBP on Hybrid Main Memory

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Th

ou

gh

pu

t o

f R

BP

No

rmal-

ized

to

SH

AR

E

RBP vs. SHARE1. RBP outperforms SHARE for workloads dominated by PCM read (LLC miss) 2. RBP lost against SHARE for workloads dominated by PCM write (LLC writeback) A new bandwidth partitioning scheme is necessary for hybrid memory

10% 90%Percentage of Device Bandwidth Consumed by PCM Writes (LLC Writebacks)

Writeback-Aware Bandwidth Partitioning

Focus on collective bandwidth of PCM devices

Considers LLC writeback informationToken bucket algorithm

Device service units = tokens Allocate tokens among app. every epoch (5 million

cycles)

Analytic model Maximize weighted speedup Model the contention on bandwidth as queuing delay Difficulty: write is blocking only when write queue is

full

Analytic Model for bandwidth partitioning

For a single core Additive CPI formula: CPI = CPILLC∞ + LLC miss freq. * LLC miss

penalty

memory ≈ queue, memory service time ≈ queuing delay

For a CMP

CPI with a infinite LLC

CPI due to LLC misses

request arrival rate … request service rateLLC miss rate

λm

Memory bandwidth α

Time to serve requestsCPI due to LLC misses

…

…

…

…

LLC miss rate

λm,1LLC miss rate

λm,1

LLC miss rate

λm,N

Memory bandwidth

α1Memory bandwidth

α2

Memory bandwidth

αN

Memory

Maximize Weighted Speedup

Taking into account the LLC writebacks CPI = CPILLC∞ + LLC miss freq. * LLC miss penalty + LLC writeback freq. * LLC writeback penalty

LLC miss rate λm,1

Analytic Model for WBP

CPI due to LLC writebacks

…

…

…

LLC writeback rate

λw,1

Read memory bandwidth α1

Write memory bandwidth β1

Memory

* P

RQ

WQ

…

…

LLC miss rate

λm,2LLC writeback rate

λw,2

Read memory bandwidth α2

Write memory bandwidth β2

…

…

LLC miss rate λm,N

LLC writeback rate

λw,N

Read memory bandwidth αN

Write memory bandwidth βN

Memory

p

CPI due to LLC misses and writebacks

How to determin

e P?

Prob. that writebacks are on the critical path

Maximize Weighted Speedup

Dynamic Weight Adjustment

Choose P based on the expected number of executed instructions (EEI)

λm,1

λw,1

λm,2

λw,2

λm,N

λw,N

α1,1

β1,1

α1,2

β1,2

α1,N

β1,N

WBP

p1

EEI

BU1

BU2

BUN

Actual EEI EEI1

P

p2 …

α2,1

β2,1

α2,2

β2,2

α2,N

β2,N

αm,1βm,1

αm,2

βm,2

αm,N

βm,N

pm

EEI2EEIm

Bandwidth Utilization ratio (BU): utilized bandwidth : allocated bandwidth

Architecture Overview

BUMon tracks info during an epoch

DWA and WBP compute bandwidth partition for the next epoch

Bandwidth Regulator enforces the configuration

Enforcing Bandwidth Partitioning

Simulation Setup

Configurations 8-core CMP, 168-entry instruction window Private 4-way 64KB L1, Private 8-way 2MB L2 Partitioned 32MB LLC, 12.5 ns latency 64GB PCM, 4 channels of 2 ranks each, 50ns read latency, 1000ns write latency

Benchmarks SPEC CPU2006 Classified into 3 types (W, R, RW) based on whether PCM

reads/writes dominate bandwidth consumption Creates 15 workloads (Light, High)

Sensitivity study on write latency, #channels and #cores

Effective Read LatencyLig

ht1

Lig

ht2

Lig

ht3

Lig

ht4

Lig

ht5

Lig

ht6

Lig

ht7

Hig

h1

Hig

h2

Hig

h3

Hig

h4

Hig

h5

Hig

h6

Hig

h7

Hig

h8

Avg

0.0

0.5

1.0

1.5

2.0

2.5

RBP WBP_0.5 WBP_1.0 WBP+DWA

No

rmali

zed

Eff

ecti

ve R

ead

L

ate

ncy

1. Different workloads favor different policy (partitioning weight)2. WBP+DWA can match the best static policy (partitioning weight)3. WBP+DWA reduces the effective read latency by 31.9% over RBP

ThroughputLig

ht1

Lig

ht2

Lig

ht3

Lig

ht4

Lig

ht5

Lig

ht6

Lig

ht7

Hig

h1

Hig

h2

Hig

h3

Hig

h4

Hig

h5

Hig

h6

Hig

h7

Hig

h8

Avg

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0RBP WBP_0.5 WBP_1.0 WBP+DWA

No

rmali

zed

Th

rou

gh

pu

t

1. The best weight varies for different workloads (writebacks weight)2. WBP+DWA achieves comparable performance to the best static weight3. WBP+DWA improves the throughput by 24.2% over RBP

Fairness (Harmonic IPC)

WBP+DWA improves fairness by an average of 16.7% over RBP

Lig

ht1

Lig

ht2

Lig

ht3

Lig

ht4

Lig

ht5

Lig

ht6

Lig

ht7

Hig

h1

Hig

h2

Hig

h3

Hig

h4

Hig

h5

Hig

h6

Hig

h7

Hig

h8

Avg

0.0

0.5

1.0

1.5

2.0

2.5RBP WBP_0.5 WBP_1.0 WBP+DWA

No

rmali

zed

Th

rou

gh

pu

t

Conclusions

PCM device bandwidth is the bottleneck in hybrid memory

Writeback information is important (LLC writebacks consume a substantial portion of memory bandwidth)

WBP can better partition the PCM bandwidth

WBP outperforms RBP by an average of 24.9% in terms of weighted speedup

Thank you

Questions ?

writeback -aware bandwidth partitioning for multi-core systems with pcm

Documents