nathan beckmann o-an tsai and aniel anchez...

38
SCALING DISTRIBUTED CACHE HIERARCHIES THROUGH COMPUTATION AND D ATA CO-SCHEDULING NATHAN BECKMANN, PO-AN TSAI AND D ANIEL SANCHEZ MIT CSAIL Near

Upload: others

Post on 25-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

SCALING DISTRIBUTED CACHE

HIERARCHIES THROUGH COMPUTATION

AND DATA CO-SCHEDULING

NATHAN BECKMANN, PO-AN TSAI AND DANIEL SANCHEZ

MIT CSAIL

Near

Page 2: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Executive Summary

2

Thread Private L2

Core

L1I L1D

LLC Bank

LLC Data

Private L2

Core

L1I L1D

LLC Bank

Private L2

Core

L1I L1D

LLC BankNear

FarFar

FarFar

Page 3: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Executive Summary

3

Thread

LLC Data

Many missesLow hit latency

Moderate missesMedium hit latency

Few missesHigh hit latency

More capacity does not always mean better performance

Page 4: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Executive Summary

4

Threads access different data

Threads access same data

Page 5: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Executive Summary

5

CDCS jointly places threads and data to reduce data movement

Improves performance by 46% on average and by up to 76% Saves 36% of system energy Uses low-overhead algorithms that perform within 1% of

impractical, idealized solutions

Page 6: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Agenda

6

Background

CDCS Design

Evaluation

Page 7: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Capacity vs latency

7

App: 471.omnetppfrom SPECCPU2006

MPKI

Cache size

85

0

6x6 mesh, 18MB NUCA

Thread running on this tile

Acc

. La

tenc

y

2.5MB

Page 8: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Capacity vs latency

8

App: 471.omnetppfrom SPECCPU2006

MPKI

Cache size

85

0 2.5MB

Acc

. La

tenc

y

Place data in local bank

Page 9: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Capacity vs latency

9

App: 471.omnetppfrom SPECCPU2006

MPKI

Cache size

85

0 2.5MB

Acc

. La

tenc

y

Use closest banks that just fit working set

3.7x speedup

Banks Perf

1 1x

Page 10: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Capacity vs latency

10

App: 471.omnetppfrom SPECCPU2006

MPKI

Cache size

85

0 2.5MB

Acc

. La

tenc

y

Place data across the chip

2.4x speedup

Banks Perf

1 1x

5 3.7x

In NUCA, using more capacity than needed is detrimental!

Page 11: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Thread placement matters

11

App: 471.omnetpp471.omnetpp471.omnetpp471.omnetpp

MPKI

Cache size

85

0 2.5MB

Acc

. La

tenc

y

3.3x speedup each

Capacity contention changes achievable access latency!

Not contended

Contended

Banks Perf

1 1x

5 3.7x

36 2.4x

Page 12: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Thread placement matters

12

App: 471.omnetpp471.omnetpp471.omnetpp471.omnetpp

MPKI

Cache size

85

0 2.5MB

Acc

. La

tenc

y

3.7x speedup

Spread out threads

Banks Perf

1 1x

5 3.7x

36 2.4x

Page 13: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Thread placement matters

13

App: 4-thread from SPECOMP2012

MPKI

Cache size

85

0 0.5MB

Acc

. La

tenc

y

Spread out threads

Threads are far-away from data

Page 14: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Thread placement matters

14

App: 4-thread from SPECOMP2012

MPKI

Cache size

85

0 0.5MB

Acc

. La

tenc

y

1.4x speedup

Cluster threads

Clustered

Spread

One scheduler does not fit all applications!

Page 15: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Dynamic NUCA

15

Mix: 4 x 471.omnetpp4-thread

R-NUCA [Hardavellas’09]

Place data

Control capacity

Place threads

Page 16: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Partitioned NUCA

16

Mix: 4 x 471.omnetpp4-thread

Jigsaw [Beckmann’13]

Place data

Control capacity

Place threads When capacity is managed well, thread placement becomes important!

Page 17: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

CDCS

17

Mix: 4 x 471.omnetpp4-thread

CDCS

Place data

Control capacity

Place threads

Page 18: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Agenda

18

Background

CDCS Design

Operation

Optimization

Evaluation

Page 19: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

CDCS Overview

19

Optimization

Steady-stateOperation

Monitoring

Place threads & data

Sample accesses

Miss curves

Hardware

Software

Page 20: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Operation

20

Group partitions from different banks to create virtual

caches (VCs)

Similar to Jigsaw [Beckmann’13]

Private L2 VC1

VC3

VC2

……

4x4 mesh NUCA LLC

Core

L1I L1D

LLC Bank NoC

Page 21: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Optimization

21

Minimize sum of on-chip latency and off-chip latency by deciding:

Thread placement

Virtual cache capacity

Virtual cache data placement

It’s an NP-hard problem

Thread and data placement are interrelated

Similar to VLSI place & route, HPC cluster scheduling

ThreadPlacement

DataPlacement

Page 22: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Insight: Decouple the dependency

22

ThreadPlacement

By placing data twice, CDCS disentangles the dependencies

DataPlacement

DataPlacement

OptimisticData

Placement

Refined Data

Placement

OptimisticAssumption

Done

inform

Page 23: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Latency-aware allocation

23

Assume no contention

Best size

Virtual Cache sizeLa

tenc

y

On-chip latency

Off-chip latency

Total latency

Miss curve x MemLatency

Page 24: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Latency-aware allocation

24

Use total latency curve to partition cache among VCs

Tota

l acc

ess

late

ncy

Cache size0

VC2

VC3

VC1

Capacity

Page 25: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Optimistic VC placement

25

Place VC as compactly as possible

Estimating contention of every bank for VC

VC placed around least-contended tile

Page 26: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Thread placement

26

Place threads at center of mass of their accesses

Page 27: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Refined VC placement

27

Move/trade cache lines between VCs

Greedily place VC close to thread first

Page 28: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Scalable reconfiguration & monitoring

28

Incremental reconfiguration Allows chip to reconfigure smoothly, without pausing cores

Geometric monitor Monitors large LLC with low overhead

See paper for details

Page 29: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Agenda

29

Background

CDCS design

Evaluation

Methodology

Performance

Sensitivity

Page 30: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Methodology

30

Systems:

64-core, 512KB/L3 bank

OOO cores (Silvermont-like)

8x8 Mesh network

Similar to Knights Landing

Zsim [Sanchez’13]: Pin-based simulator

Workloads: SPEC CPU2006, SPEC OMP2012

Mem

/ IO

L3 Bank

OOO Core

L1I L1D

L2

Mem / IO

Mem

/ IO

Mem / IO64-core; 8x8 mesh network

Page 31: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Methodology

31

Schemes

S-NUCA (baseline) with clustering thread scheduler

R-NUCA with clustering thread scheduler

Jigsaw

Jigsaw+C: Jigsaw with clustering thread scheduler

Jigsaw+R: Jigsaw with random thread scheduler

CDCSD-NUCA

Partitioned NUCA CDCS

Place data

Control capacity

Place threads

Page 32: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Multi-programmed mixes

32

Workloads that do not share data

18% GMEAN

34% GMEAN

38% GMEAN

46%

GMEAN

CDCS avoids capacity contention more effectively than random scheduler

Page 33: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Multi-threaded mixes

33

Workloads that share dataCDCS guards against pathological behavior

incurred by fixed thread scheduling policies

9% GMEAN

14% GMEAN

19% GMEAN

21%

GMEAN

Clustering is better now

Page 34: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Undercommitted multi-threaded mixes

34

SPECOMP mixes using half of the cores

11% GMEAN

17% GMEAN

21% GMEAN

26%

GMEAN

With more flexibility, CDCS dynamically clusters or spreads out threads

Page 35: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

CDCS vs idealized algorithms

35

Integer Linear Programming (ILP)

Simulated annealing

Within 1%Multi-programmed mixes

Same for multi-threaded mixes

Algorithm runtime overhead (%)

0 0.08 0.08 0.2 6.8 196

Page 36: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

See paper for additional results

36

Under-committed system

Traffic breakdown

Energy breakdown

Factor analysis

Other sensitivity studies

Reconfiguration interval sweep

Incremental reconfiguration IPC trace

Page 37: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

Conclusions

37

Thread placement has a large impact on NUCA performance when capacity is well managed

CDCS reduces the distance to data through joint thread and data placement

CDCS outperforms state-of-the-art NUCA techniques with different thread scheduling policies and prevents pathological behavior of fixed policies

Page 38: NATHAN BECKMANN O-AN TSAI AND ANIEL ANCHEZ ...people.csail.mit.edu/sanchez/papers/2015.cdcs.hpca...CDCS reduces the distance to data through joint thread and data placement CDCS outperforms

QUESTIONS

38