accord: associativity for dram caches by coordinating way … · 2018-06-20 · accord:...
TRANSCRIPT
ACCORD:AssociativityforDRAMCachesbyCoordinatingWay-InstallandWay-Prediction
ISCA2018
1
Vinson Young (GT)Chiachen Chou (GT)Aamer Jaleel (NVIDIA)Moinuddin K. Qureshi (GT)
Authors:
4-8x Bandwidth(of traditional memory)
3D-DRAM MITIGATES BANDWIDTH WALL
2
Hybrid Memory Cube (HMC) from Micron, High Bandwidth Memory (HBM) from Samsung
3D-Stacked DRAM
✔
Limited Capacity✘
Memory3D-DRAM + High-Capacity Memory = Hybrid Memory
Modernsystempackingmanycoresè BandwidthWall
OS-visible Space
System Memory(NVM / DRAM)
USE 3D-DRAM AS A CACHE
3
DRAM-Cache(3D-DRAM)
Mem
ory
Hie
rarc
hy
fast
slow
CPUL1$L2$
L3$
CPU
L2$L1$
Using 3D-DRAM as a DRAM cache, can improve memory bandwidth (and avoid OS/software change)
MCDRAM from Intel
Organize at line granularity (64B) for capacity/BW utilization
Gigascale cache needs large tag-store (tens of MBs)
3D-DRAM
ARCHITECTING LARGE DRAM CACHES
4
4GB Data128 MBTags
Tags?Too large for SRAM
Organize at line granularity (64B) for high cache utilization
Gigascale cache needs large tag-store (tens of MBs)
Practical designs must store Tags in DRAM
3D-DRAM
ARCHITECTING LARGE DRAM CACHES
5
How to architect tag-store for low-latency tag access?
4GB Data128 MBTags
EFFICIENT TAG ORGANIZATION (KNL CACHE)
6
Practical designs are 64B line-size, store Tag-With-Data, and are direct-mapped, to optimize for hit-latency.
Tag Data Tag Data Tag Data Tag Data
Tag-With-Data [Alloy Cache, Intel Knights Landing]
Single Tag+Data Lookup (1x hit latency),but direct-mapped
Intel Knights Landing Product (MCDRAM) uses this DRAM-cache organization.
60
70
80
90
1-way
2-way
4-way
8-way
(a) Hit Rate
Hit
Ra
te (
%)
0
0.5
1
1.5
2-way
4-way
8-way
(b) Speedup (Parallel)
Sp
ee
du
p (
Pa
ralle
l) 0
0.5
1
1.5
2-way
4-way
8-way
(c) Speedup (Idealized)
Sp
ee
du
p (
Ide
aliz
ed
)Reduce 25% of misses
POTENTIAL OF ASSOCIATIVITY
7
How can we make DRAM caches associative?
Assumes 16-core system, with 4GB DRAM-Cache, in front of PCM memory.
ASSOCIATIVITY OPTION 1: SERIAL TAG LOOKUP
Serial Tag Lookup enables associativity, but, it has serialization delay.
8
A B
Way0 Way1Address
A BIfmiss
ASSOCIATIVITY OPTION 2: PARALLEL TAG LOOKUP
Parallel Lookup avoids serialization latency, but, it introduces 2x bandwidth cost.
9
A B
Way0 Way1Address
A B
60
70
80
90
1-way
2-way
4-way
8-way
(a) Hit Rate
Hit
Ra
te (
%)
0
0.5
1
1.5
2-way
4-way
8-way
(b) Speedup (Parallel)
Sp
ee
du
p (
Pa
ralle
l) 0
0.5
1
1.5
2-way
4-way
8-way
(c) Speedup (Idealized)
Sp
ee
du
p (
Ide
aliz
ed
)Reduce 25% of misses -46%
ASSOCIATIVITY FOR DRAM CACHE (PARALLEL)
10
Increasing associativity naively actually degrades performance due to increased BW cost
60
70
80
90
1-way
2-way
4-way
8-way
(a) Hit Rate
Hit
Ra
te (
%)
0
0.5
1
1.5
2-way
4-way
8-way
(b) Speedup (Parallel)
Sp
ee
du
p (
Pa
ralle
l) 0
0.5
1
1.5
2-way
4-way
8-way
(c) Speedup (Idealized)
Sp
ee
du
p (
Ide
aliz
ed
)Reduce 25% of misses -46%
21%
ASSOCIATIVITY FOR DRAM CACHE (IDEAL)
11
With latency / BWof direct-mapped
Associativity must still maintain the latency/BWof direct-mapped caches. How?
OPTION 3: WAY-PREDICTED TAG LOOKUP
Way-Predicted Tag Lookup can obtain improved hit-rate, with BW / latency of direct-mapped cache.
12
Way-Predicted Tag Lookup
A B
Way0 Way1Address
BIfmiss
WayPrediction
Accuracy(4-way) 74.3% 91.6%
Accuracy(8-way) 63.2% 81.2%
MRUPred(1bit/set)
Partial-Tag(4bit/line)
SRAMStorage 4MB 32MB
Way-Pred Accuracy(2-way)
85.7% 97.3%
WAY-PREDICTION ACCURACY & COST
Prior methods for way-prediction have low accuracy and/or have high storage overhead.
13
TOWARDS ASSOCIATIVITY W/ WAY-PREDICTION
14
Way-Predicted Tag Lookup
A B
Way0 Way1Address
BIfmiss
WayPrediction
Goal: Low storage-overhead and high accuracy way-prediction, to enable associative DRAM cache
ACCORD OVERVIEW
• Background
• ACCORD– Probabilistic Way-Steering (PWS)– Ganged Way-Steering (GWS)– Skewed Way-Steering (SWS)
• Summary
15
INSIGHT: WAY-PREDICTABILITY AT LOW STORAGE?
Insight: Modifying install policy can make way-prediction much simpler!
16
EVENODD
Way0 Way1
EVEN
ODDODDEVEN
Base Install Policy (Rand)
EVENODDEVEN
ODDODDEVEN
Tag-based Install Policy
Way0 Way1
Hard-to-predict (~50%) Predict 100%! But, direct-mapped
PROPOSAL: ACCORD
AssoCiativity by CoORDinating way-install and prediction.ACCORD achieves a way-predictable cache at low cost.
17
Way0 Way1
A2B3A3
B5B7
WayInstallPolicy
WayPredictor
Coordinate
ACCORD OVERVIEW
• Background
• ACCORD– Probabilistic Way-Steering (PWS)– Ganged Way-Steering (GWS)– Skewed Way-Steering (SWS)
• Summary
18
PROBABILISTIC WAY-STEERING
PWS enables way-predictability, by trading speed of learning to use both ways (hit-rate)
19
Install using PWS
PageA,BBias=90% 10%
Static prediction: ~90%
B1B2B3B4
B6B7
B0
B5
Way0 Way1Address
A1A2
A3A4
A6A7
A0
A5
B1B2B3B4
B6B7
B0
B5
A1A2
A3A4
A6A7
A0
A5
Preferred
Will use both ways, improve hit-rate
SENSITIVITY TO PWS PROBABILITY
20
0% 20% 40% 60% 80% 100%
0% 2% 4% 6% 8%
10% 12% 14%
50% 60% 70% 80% 85% 90% 100% Way-PredAccuracy(%
)
MissRed
uctio
n(%
)
Biasforselecting“preferredway”
Way-PredAccuracy
2-way design Direct-mapped
Preferred-wayInstallProbability=x%biastoinstallinpreferredway
SENSITIVITY TO PWS PROBABILITY
21
0% 20% 40% 60% 80% 100%
0% 2% 4% 6% 8%
10% 12% 14%
50% 60% 70% 80% 85% 90% 100% Way-PredAccuracy(%
)
MissRed
uctio
n(%
)
Preferred-wayInstallProbability
MissReduction(%) Way-PredAccuracy
2-way design Direct-mapped
SENSITIVITY TO PWS PROBABILITY
22
2.6% 3.7% 4.7% 5.5% 5.6% 5.3%
0.0% 0% 20% 40% 60% 80% 100%
0% 2% 4% 6% 8%
10% 12% 14%
50% 60% 70% 80% 85% 90% 100% Way-PredAccuracy(%
)
MissRed
uctio
n(%
)Speedu
p(%
)
Preferred-wayInstallProbability
Speedup MissReduction(%) Way-PredAccuracy
Preferred-way Install Probability (85%) provides best trade-off of hit-rate for WP accuracy, for 5.6% speedup.
5.6% speedup
ACCORD OVERVIEW
• Background
• ACCORD– Probabilistic Way-Steering (PWS)– Ganged Way-Steering (GWS)– Skewed Way-Steering (SWS)
• Summary
23
GANGED WAY-STEERING
24
B1B2B3B4
B6B7
B0
B5
Way0 Way1Address
B0B1B2B3B4
B6B7
B5
Way0 Way1Address
B0B1B2B3B4
B6B7
B5
A0A1A2A3A4
A6A7
A5
A1A2
A3A4
A6A7
A0
A5
B1B2B3B4
B6B7
B0
B5
A1A2
A3A4
A6A7
A0
A5
Probabilistic Way-SteeringPer-line randomized decision
Ganged Way-SteeringPer-page rand decision
Preferred Preferred
Pred ~50% Pred >90%
Ganged Way-Steering makes install decision at large granularity, to improve predictability for workloads with high spatial locality.
GANGED WAY-STEERING IMPLEMENTATION
25
Way0 Way1
A2B3A3
B5B7
0x001
RegionID WayGuideInstall
RecentInstallTable(RIT)
Install RegionID
0x101
WayPredictWay
RecentLookupTable(RLT)
Access
GWS Per-Region Last-Way install + Last-Way prediction. 64-entry RIT and 64-entry RLT needs only 320 Bytes.
0 1
PWS+GWS WAY-PREDICTION ACCURACY
26
70%
75%
80%
85%
90%
95%
100%
PWS+GWSPWSLibquantum
GWS enables spatial workloads to have near-100% accuracyPWS has ~85% base accuracy
Combination of PWS+GWS achieves 90% accuracy,at the cost of 320B storage.
Way-PredAcc(%
)
70%
75%
80%
85%
90%
95%
100%
Average(21workloads)PWS+GWSPWS
PWS+GWS (ACCORD 2-WAY) RESULTS
PWS + GWS gets 7.3% of 10% speedup of perfectly-predicted 2-way cache.
7.3% speedup
27System assumes 4GB DRAM Cache, and PCM-based main memory.
0%
2%
4%
6%
8%
10%
12% Speedu
p
PWSPWS+GWS
Perfect
ACCORD OVERVIEW
• Background
• ACCORD– Probabilistic Way-Steering (PWS)– Ganged Way-Steering (GWS)– Skewed Way-Steering (SWS)
• Summary
28
DIFFICULTY IN SCALING TO N-WAYS
• Scaling ACCORD to N-ways– ACCORD 4-way has 3% speedup– ACCORD 8-way has 6% slowdown…
We need solutions to reduce miss-confirmation
• Miss confirmation: N-way cache needs N accesses to confirm line is not resident
29
EA CB D
Way0 Way1 Way2 Way3AddressEMiss!
SOLUTION: SKEWED WAY-STEERING
Restricting placement, reduces miss-confirmation èhit-rate benefits without any storage overhead
30
Only2lookupstodeterminemiss
Way0 Way2Way3
EAccess:
ABCA B
4-waywith2-skew:Access:ABC
OnePreferred+OneAlternateway
Way1
SPEEDUP FROM ACCORD (WITH SWS)
31
0%
2%
4%
6%
8%
10%
12% Speedu
p
2-Way
SWS 8-way achieves 11% speedup
4-Way 8-Way
ACCORD OVERVIEW
• Background
• ACCORD– Probabilistic Way-Steering (PWS)– Ganged Way-Steering (GWS)– Skewed Way-Steering (SWS)
• Summary
32
SUMMARY OF ACCORD
§ ACCORD: associative DRAM caches by coordinating way-install and way-prediction.
§ Probabilistic Way-Steering§ Biased-install enables accurate static way-prediction
§ Ganged Way-Steering§ Region-based install enables accurate region-based way-prediction
§ Skewed Way-Steering§ Skew enables flexibility in line placement, while maintaining miss cost
§ ACCORD enables associativity at negligible storage cost (320B), to achieve 11% speedup.
33
ACCORD BACKUP SLIDES
ACCORD backup slides
34
REPLACEMENT POLICY?
• LRU– State in SRAM
• 1-bit per line needs 8MB. Size of Last-level cache– State in DRAM
• 9% slowdown due to state-update cost (Hit to alternate way)
35
COMPARISON TO OTHER WAY PREDICTORS
36
-6% -4% -2% 0% 2% 4% 6% 8%
10% 12%
Speedu
p
ACCORD outperforms other predictors while needing negligible storage overhead (320 B)
COLUMN-ASSOCIATIVE CACHE
• Column-associative / Hash-Rehash cache – Install lines in preferred way (way-0)– On eviction, move line to alternate way (way-1)– On hit to alternate way, move to preferred way
• Effectiveness– In general, way-prediction accuracy similar to MRU– But, requires significant bandwidth to swap lines on hit
to alternate way. CA-cache thus causes 4% slowdown.
37