nathan beckmann o-an tsai and aniel anchez...
TRANSCRIPT
![Page 1: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/1.jpg)
SCALING DISTRIBUTED CACHE
HIERARCHIES THROUGH COMPUTATION
AND DATA CO-SCHEDULING
NATHAN BECKMANN, PO-AN TSAI AND DANIEL SANCHEZ
MIT CSAIL
Near
![Page 2: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/2.jpg)
Executive Summary
2
Thread Private L2
Core
L1I L1D
LLC Bank
LLC Data
Private L2
Core
L1I L1D
LLC Bank
Private L2
Core
L1I L1D
LLC BankNear
FarFar
FarFar
![Page 3: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/3.jpg)
Executive Summary
3
Thread
LLC Data
Many missesLow hit latency
Moderate missesMedium hit latency
Few missesHigh hit latency
More capacity does not always mean better performance
![Page 4: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/4.jpg)
Executive Summary
4
Threads access different data
Threads access same data
![Page 5: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/5.jpg)
Executive Summary
5
CDCS jointly places threads and data to reduce data movement
Improves performance by 46% on average and by up to 76% Saves 36% of system energy Uses low-overhead algorithms that perform within 1% of
impractical, idealized solutions
![Page 6: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/6.jpg)
Agenda
6
Background
CDCS Design
Evaluation
![Page 7: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/7.jpg)
Capacity vs latency
7
App: 471.omnetppfrom SPECCPU2006
MPKI
Cache size
85
0
6x6 mesh, 18MB NUCA
Thread running on this tile
Acc
. La
tenc
y
2.5MB
![Page 8: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/8.jpg)
Capacity vs latency
8
App: 471.omnetppfrom SPECCPU2006
MPKI
Cache size
85
0 2.5MB
Acc
. La
tenc
y
Place data in local bank
![Page 9: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/9.jpg)
Capacity vs latency
9
App: 471.omnetppfrom SPECCPU2006
MPKI
Cache size
85
0 2.5MB
Acc
. La
tenc
y
Use closest banks that just fit working set
3.7x speedup
Banks Perf
1 1x
![Page 10: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/10.jpg)
Capacity vs latency
10
App: 471.omnetppfrom SPECCPU2006
MPKI
Cache size
85
0 2.5MB
Acc
. La
tenc
y
Place data across the chip
2.4x speedup
Banks Perf
1 1x
5 3.7x
In NUCA, using more capacity than needed is detrimental!
![Page 11: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/11.jpg)
Thread placement matters
11
App: 471.omnetpp471.omnetpp471.omnetpp471.omnetpp
MPKI
Cache size
85
0 2.5MB
Acc
. La
tenc
y
3.3x speedup each
Capacity contention changes achievable access latency!
Not contended
Contended
Banks Perf
1 1x
5 3.7x
36 2.4x
![Page 12: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/12.jpg)
Thread placement matters
12
App: 471.omnetpp471.omnetpp471.omnetpp471.omnetpp
MPKI
Cache size
85
0 2.5MB
Acc
. La
tenc
y
3.7x speedup
Spread out threads
Banks Perf
1 1x
5 3.7x
36 2.4x
![Page 13: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/13.jpg)
Thread placement matters
13
App: 4-thread from SPECOMP2012
MPKI
Cache size
85
0 0.5MB
Acc
. La
tenc
y
Spread out threads
Threads are far-away from data
![Page 14: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/14.jpg)
Thread placement matters
14
App: 4-thread from SPECOMP2012
MPKI
Cache size
85
0 0.5MB
Acc
. La
tenc
y
1.4x speedup
Cluster threads
Clustered
Spread
One scheduler does not fit all applications!
![Page 15: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/15.jpg)
Dynamic NUCA
15
Mix: 4 x 471.omnetpp4-thread
R-NUCA [Hardavellas’09]
Place data
Control capacity
Place threads
![Page 16: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/16.jpg)
Partitioned NUCA
16
Mix: 4 x 471.omnetpp4-thread
Jigsaw [Beckmann’13]
Place data
Control capacity
Place threads When capacity is managed well, thread placement becomes important!
![Page 17: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/17.jpg)
CDCS
17
Mix: 4 x 471.omnetpp4-thread
CDCS
Place data
Control capacity
Place threads
![Page 18: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/18.jpg)
Agenda
18
Background
CDCS Design
Operation
Optimization
Evaluation
![Page 19: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/19.jpg)
CDCS Overview
19
Optimization
Steady-stateOperation
Monitoring
Place threads & data
Sample accesses
Miss curves
Hardware
Software
![Page 20: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/20.jpg)
Operation
20
Group partitions from different banks to create virtual
caches (VCs)
Similar to Jigsaw [Beckmann’13]
Private L2 VC1
VC3
VC2
……
4x4 mesh NUCA LLC
Core
L1I L1D
LLC Bank NoC
![Page 21: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/21.jpg)
Optimization
21
Minimize sum of on-chip latency and off-chip latency by deciding:
Thread placement
Virtual cache capacity
Virtual cache data placement
It’s an NP-hard problem
Thread and data placement are interrelated
Similar to VLSI place & route, HPC cluster scheduling
ThreadPlacement
DataPlacement
![Page 22: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/22.jpg)
Insight: Decouple the dependency
22
ThreadPlacement
By placing data twice, CDCS disentangles the dependencies
DataPlacement
DataPlacement
OptimisticData
Placement
Refined Data
Placement
OptimisticAssumption
Done
inform
![Page 23: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/23.jpg)
Latency-aware allocation
23
Assume no contention
Best size
Virtual Cache sizeLa
tenc
y
On-chip latency
Off-chip latency
Total latency
Miss curve x MemLatency
![Page 24: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/24.jpg)
Latency-aware allocation
24
Use total latency curve to partition cache among VCs
Tota
l acc
ess
late
ncy
Cache size0
VC2
VC3
VC1
Capacity
![Page 25: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/25.jpg)
Optimistic VC placement
25
Place VC as compactly as possible
Estimating contention of every bank for VC
VC placed around least-contended tile
![Page 26: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/26.jpg)
Thread placement
26
Place threads at center of mass of their accesses
![Page 27: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/27.jpg)
Refined VC placement
27
Move/trade cache lines between VCs
Greedily place VC close to thread first
![Page 28: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/28.jpg)
Scalable reconfiguration & monitoring
28
Incremental reconfiguration Allows chip to reconfigure smoothly, without pausing cores
Geometric monitor Monitors large LLC with low overhead
See paper for details
![Page 29: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/29.jpg)
Agenda
29
Background
CDCS design
Evaluation
Methodology
Performance
Sensitivity
![Page 30: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/30.jpg)
Methodology
30
Systems:
64-core, 512KB/L3 bank
OOO cores (Silvermont-like)
8x8 Mesh network
Similar to Knights Landing
Zsim [Sanchez’13]: Pin-based simulator
Workloads: SPEC CPU2006, SPEC OMP2012
Mem
/ IO
L3 Bank
OOO Core
L1I L1D
L2
Mem / IO
Mem
/ IO
Mem / IO64-core; 8x8 mesh network
![Page 31: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/31.jpg)
Methodology
31
Schemes
S-NUCA (baseline) with clustering thread scheduler
R-NUCA with clustering thread scheduler
Jigsaw
Jigsaw+C: Jigsaw with clustering thread scheduler
Jigsaw+R: Jigsaw with random thread scheduler
CDCSD-NUCA
Partitioned NUCA CDCS
Place data
Control capacity
Place threads
![Page 32: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/32.jpg)
Multi-programmed mixes
32
Workloads that do not share data
18% GMEAN
34% GMEAN
38% GMEAN
46%
GMEAN
CDCS avoids capacity contention more effectively than random scheduler
![Page 33: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/33.jpg)
Multi-threaded mixes
33
Workloads that share dataCDCS guards against pathological behavior
incurred by fixed thread scheduling policies
9% GMEAN
14% GMEAN
19% GMEAN
21%
GMEAN
Clustering is better now
![Page 34: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/34.jpg)
Undercommitted multi-threaded mixes
34
SPECOMP mixes using half of the cores
11% GMEAN
17% GMEAN
21% GMEAN
26%
GMEAN
With more flexibility, CDCS dynamically clusters or spreads out threads
![Page 35: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/35.jpg)
CDCS vs idealized algorithms
35
Integer Linear Programming (ILP)
Simulated annealing
Within 1%Multi-programmed mixes
Same for multi-threaded mixes
Algorithm runtime overhead (%)
0 0.08 0.08 0.2 6.8 196
![Page 36: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/36.jpg)
See paper for additional results
36
Under-committed system
Traffic breakdown
Energy breakdown
Factor analysis
Other sensitivity studies
Reconfiguration interval sweep
Incremental reconfiguration IPC trace
![Page 37: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/37.jpg)
Conclusions
37
Thread placement has a large impact on NUCA performance when capacity is well managed
CDCS reduces the distance to data through joint thread and data placement
CDCS outperforms state-of-the-art NUCA techniques with different thread scheduling policies and prevents pathological behavior of fixed policies
![Page 38: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms](https://reader035.vdocuments.site/reader035/viewer/2022071404/60f93a93c9b0536bdb49d749/html5/thumbnails/38.jpg)
QUESTIONS
38