1 cacm july 2012 talk: mark d. hill, wisconsin @ cornell university, 10/2012

1

CACM July 2012

Talk: Mark D. Hill, Wisconsin@ Cornell University, 10/2012

2

Executive Summary

• Today chips provide shared memory w/ HW coherence as low-level support for OS & application SW

• As #cores per chip scales?o Some argue HW coherence gone due to growing overheadso We argue it’s stays by managing overheads

• Develop scalable on-chip coherence proof-of-concepto Inclusive caches firsto Exact tracking of sharers & replacements (key to analysis)o Larger systems need to use hierarchy (clusters)o Overheads similar to today’s

Compatibility of on-chipHW coherence is here to stay

Let’s spend programmer sanity on parallelism,not lost compatibility!

3

OutlineMotivation & Coherence Background

Scalability Challenges1. Communication2. Storage3. Enforcing Inclusion4. Latency5. Energy

Extension to Non-Inclusive Shared Caches

Criticisms & Summary

4

Academics Criticize HW Coherence

• Choi et al. [DeNovo]:o Directory…coherence…extremely

complex & inefficient .... Directory … incurring significant storage and invalidation traffic overhead.

• Kelm et al. [Cohesion]:o A software-managed coherence protocol ...

avoids .. directories and duplicate tags , & implementing & verifying … less traffic ...

5

Industry Eschews HW Coherence

• Intel 48-Core IA-32 Message-Passing Processor … SW protocols … to eliminate the communication & HW overhead

• IBM Cell processor … the greatest opportunities for increased application performance, is the existence of the local store memory and the fact that software must manage this memory

BUT…

6Source: Avinash Sodani "Race to Exascale: Challenges and Opportunities,“ Micro 2011.

7

Define “Coherence as Scalable”

• Define a coherent system as scalable whenthe cost of providing coherence grows (at most) slowly as core count increases

• Our Focuso YES: coherenceo NO: Any scalable system also requires scalable HW

(interconnects, memories) and SW (OS, middleware, apps)

• Methodo Identify each overhead & show it can grow slowly

• Expect more coreso Moore Law’s provide more transistorso Power-efficiency improvements (w/o Dennard Scaling)o Experts disagree on how many core possible

8

Caches & Coherence• Cache— fast, hidden memory—to reduce

o Latency: average memory access timeo Bandwidth: interconnect traffico Energy: cache misses cost more energy

• Caches hidden (from software)o Naturally for single core systemo Via Coherence Protocol for multicore

• Maintain coherence invariant o For a given (memory) block at a give time eithero Modified (M): A single core can read & writeo Shared (S): Zero or more cores can read, but not write

Interconnection network

tracking bits state tag block data

Core 1

Private cache

state tag block data

Core 2

Private cache

Core C

Private cache

Block in private cache

Block in shared cache

~2 bits ~64 bits ~512 bits

~C bits ~2 bits ~64 bits ~512 bits

9

Baseline Multicore Chip

•Intel Core i7 like

•C = 16 Cores (not 8)

•Private L1/L2 Caches

•Shared Last-Level Cache (LLC)

•64B blocks w/ ~8B tag

•HW coherence pervasive in general-purpose multicore chips: AMD, ARM, IBM, Intel, Sun (Oracle)



Core 1

Private cache


Core 2

Private cache

Core C

Private cache





10

Baseline Chip Coherence

• 2B per 64+8B L2 block to track L1 copies• Inclusive L2 (w/ recall messages on LLC evictions)

11

Coherence Example Setup

• Block A in no private caches: state Invalid (I)• Block B in no private caches: state Invalid (I)


Core 0

Private cache

Core 1

Private cache

Core 2

Private cache

Core 3

Private cache

Bank 0 Bank 1

{0000} I …A:

Bank 2 Bank 3

{0000} I …B:

12

Coherence Example 1/4

• Block A at Core 0 exclusive read-write: Modified(M)


Core 0

Private cache

Core 1

Private cache

Core 2

Private cache

Core 3

Private cache

Bank 0 Bank 1

{0000} I …A:

Bank 2 Bank 3

{0000} I …B:

Write A

M, …A:

{1000} M …

13


• Block B at Cores 1+2 shared read-only: Shared (S)


Core 0

Private cache

Core 1

Private cache

Core 2

Private cache

Core 3

Private cache

Bank 0 Bank 1

{1000} M …A:

Bank 2 Bank 3

{0000} I …B:

Read B

M, …A:S, …B:

{0100} S …

Read B

S, …B:

{0110} S …

14


• Block A moved from Core 0 to 3 (still M)


Core 0

Private cache

Core 1

Private cache

Core 2

Private cache

S, …

Core 3

Private cache

B: S, …B:

Bank 0 Bank 1

{1000} M …A:

Bank 2 Bank 3

{0110} S …B:

M, …A:

Write A

M, …A:

{0001} M …

15


• Block B moved from Cores1+2 (S) to Core 1 (M)


Core 0

Private cache

Core 1

Private cache

Core 2

Private cache

Core 3

Private cache

S, …B: S, …B:

Bank 0 Bank 1

A:

Bank 2 Bank 3

{0110} S …B:

M, …B:

Write B

M, …A:

{0001} M …

{1000} M …

16

Caches & Coherence

17


Scalability Challenges1.Communication: Extra bookkeeping messages

(longer section)2. Storage: Extra bookkeeping storage3. Enforcing Inclusion: Extra recall messages (subtle)4. Latency: Indirection on some requests5. Energy: Dynamic & static overhead

Extension to Non-Inclusive Shared Caches (subtle)


18

1. Communication: (a) No Sharing,

Dirty


Core 1

Private cache

Core 2

Private cache

Core C

Private cache

o W/o coherence: RequestDataData(writeback)o W/ coherence: RequestDataData(writeback)Acko Overhead = 8/(8+72+72) = 5% (independent of

#cores!)

Key:Green for RequiredRed for OverheadThin is 8-byte controlThick is 72-byte data

19

1. Communication: (b) No Sharing,

Clean


Core 1

Private cache

Core 2

Private cache

Core C

Private cache

o W/o coherence: RequestData0o W/ coherence: RequestData(Evict)Acko Overhead = 16/(8+72) = 10-20% (independent of

#cores!)


20

1. Communication: (c) Sharing,

Read


Core 1

Private cache

Core 2

Private cache

Core C

Private cache

o To memory: RequestDatao To one other core: RequestForwardData(Cleanup)o Charge 1-2 Control messages (independent of #cores!)


21

1. Communication: (d) Sharing,

Write


Core 1

Private cache

Core 2

Private cache

Core C

Private cache

o If Shared at C other coreso Request{Data, C Invalidations + C Acks}(Cleanup)o Needed since most directory protocols send invalidations

to caches that have & sometimes do not have copieso Not Scalable


22

1. Communication: Extra

Invalidations


Core 1

Private cache

Core 2

Private cache

Core C

Private cache

o Core 1 Read: RequestDatao Core C Write: Request{Data, 2 Inv + 2 Acks}(Cleanup)o Charge Write for all necessary & unnecessary invalidationso What if all invalidations necessary? Charge reads that

get data!


{1|2 3|4 .. C-1|C}{ 0 0 .. 0 }{ 1 0 .. 0 }{ 0 0 .. 1 }

23

1. Communication: No Extra

Invalidations


Core 1

Private cache

Core 2

Private cache

Core C

Private cache

o Core 1 Read: RequestData + {Inv + Ack} (in future)o Core C Write: RequestData(Cleanup)o If all invalidations necessary, coherence addso Bounded overhead to each miss -- Independent of

#cores!


{1 2 3 4 .. C-1 C}{0 0 0 0 .. 0 0}{1 0 0 0 .. 0 0}{0 0 0 0 .. 0 1}

24

1. Communication Overhead

(1) Communication overhead bounded & scalable

(a) Without Sharing & Dirty(b) Without Sharing & Clean(c) Shared Read Miss (charge future inv + ack)(d) Shared Write Miss (not charged for inv + acks)

• But depends on tracking exact sharers (next)

25

Total CommunicationC Read Misses per Write Miss

Exact (unbounded storage)

Inexact (32b coarse vector)

How get performance of “exact” w/ reasonable storage?

02

46

832

128

512

0

100

200

300

400

500

600

700

14

1664

25610

24

Read misses per write miss

Byte

s per

mis

s

Cores0

24

68

32

128

512

0

100

200

300

400

500

600

700

1

8

6451

2

Read misses per write miss

Byte

s per

mis

sCores

26


Scalability Challenges1. Communication: Extra bookkeeping messages

(longer section)2.Storage: Extra bookkeeping storage3. Enforcing Inclusion: Extra recall messages4. Latency: Indirection on some requests5. Energy: Dynamic & static overhead



27

2. Storage Overhead (Small Chip)

• Track up to C=#readers (cores) per LLC block• Small #Cores: C bit vector acceptable

o e.g., 16 bits for 16 cores : 2 bytes / 72 bytes = 3%



Core 1

Private cache


Core 2

Private cache

Core C

Private cache





28

2. Storage Overhead (Larger Chip)

• Use Hierarchy!

core

private cache

core

private cache

core

private cache

Intra-clusterInterconnection network

Cluster of K cores

trackingstate bits tag block data Cluster Cache

Inter-cluster Interconnection network

core

private cache

core

private cache

core

private cache

Intra-clusterInterconnection network

Cluster of K cores

trackingstate bits tag block data Cluster Cache Cache

trackingstate bits tag block data

Shared last-level cache

Cluster 1 Cluster K

{11..1 … 10..1} S …

{11..1} S … {10..1} S …

{1 … 1} S …

29

2. Storage Overhead (Larger Chip)

• Medium-Large #Cores: Use Hierarchy!o Cluster: K1 cores with L2 cluster cacheo Chip: K2 clusters with L3 global cacheo Enables K1*K2 Cores

• E.g., 16 16-core clusterso 256 cores (16*16)o 3% storage overhead!!

• More generally?

30

Storage Overhead for Scaling

(2) Hierarchy enables scalable storage

16 clusters of16 cores each

31



(longer section)2. Storage: Extra bookkeeping storage3.Enforcing Inclusion: Extra recall messages

(subtle)4. Latency: Indirection on some requests5. Energy: dynamic & static overhead



32

3. Enforcing Inclusion (Subtle)

• Inclusion: Block in a private cache In shared cache

+ Augment shared cache to trackprivate cache sharers (as assumed)

- Replace in shared cache Replace in private c.- Make impossible?

- Requires too much shared cache associativity - E.g., 16 cores w/ 4-way caches 64-way assoc

- Use recall messages

• Make recall messages necessary & rare

33

Inclusion Recall Example

• Shared cache miss to new block C• Needs to replace (victimize) block B in shared cache• Inclusion forces replacement of B in private caches


Core 0

Private cache

Core 1

Private cache

Core 2

Private cacheM, …

S, …

Core 3

Private cache

B: S, …B:

Bank 0 Bank 1

{1000} M …A:

Bank 2 Bank 3

{0110} S …B:

A:

Write C

34

Make All Recalls Necessary

Exact state tracking (cover earlier)+

L1/L2 replacement messages (even clean)=

Every recall message finds cached block

Every recall message necessary & occurs after a cache miss (bounded overhead)

35

Make Necessary Recalls Rare

• Recalls naturally rare when Shared Cache Size/ Σ Private Cache sizes > 2

(3) Recalls made rare

Assume misses to random sets [Hill & Smith 1989]

0 1 2 3 4 5 6 7 80

10

20

30

40

50

60

70

80

90

100

1-way2-way4-way8-way

Ratio of Aggregate Private Cache Capacity to Shared Cache Capacity

Perc

enta

ge o

f M

isse

s C

ausi

ng R

ecall

s

Associativityof Shared

Cache

ExpectedDesignSpace

Core i7

36



(longer section)2. Storage: Extra bookkeeping storage3. Enforcing Inclusion: Extra recall messages4.Latency: Indirection on some requests5.Energy: Dynamic & static overhead



37

4. Latency Overhead – Often None


Core 1

Private cache

Core 2

Private cache

Core C

Private cache

1. None: private hit2. “None”: private miss + “direct” shared cache hit3. “None”: private miss + shared cache miss4. BUT …


38

4. Latency Overhead -- Some


Core 1

Private cache

Core 2

Private cache

Core C

Private cache

4. 1.5-2X: private miss + shared cache hit with indirection(s)

• How bad?


39

4. Latency Overhead -- Indirection

4. 1.5-2X: private miss + shared cache hit with indirection(s)

interconnect + cache + interconnect + cache + interconnect------------------------------------------------------------------------------------------

---interconnect + cache + interconnect

• Acceptable today• Relative latency similar w/ more

cores/hierarchy • Vs. magically having data at shared cache

(4) Latency overhead bounded & scalable

40

5. Energy Overhead• Dynamic -- Small

o Extra message energy – traffic increase small/bounded

o Extra state lookup – small relative to cache block lookup

o …

• Static – Also Smallo Extra state – state increase small/boundedo …

• Little effect on energy-intensive cores, cache data arrays, off-chip DRAM, secondary storage, …

• (5) Energy overhead bounded & scalable

41




Extension to Non-Inclusive Shared Caches (subtle) Apply analysis to caches used by AMD


42

Review Inclusive Shared Cache


Core 1

Private cache

Core 2

Private cache

Core C

Private cache

• Inclusive Shared Cache:• Block in a private cache In shared cache• Blocks must be cached redundantly


~1 bit per core ~2 bits ~64 bits ~512 bits

43

Non-Inclusive Shared Cache


Core 1

Private cache

Core 2

Private cache

Core C

Private cache

tracking bits state tag

~1 bit per core ~2 bits ~64 bits

2. InclusiveDirectory

(probe filter) state tag block data


1. Non-Inclusive

Shared Cache

Any size or associativity Avoids redundant caching

Allows victim caching

Dataless Ensures coherence But duplicates tags

44

Non-Inclusive Shared Cache

• Non-Inclusive Shared Cache: Data Block + Tag(Any Configuration )

• Inclusive Directory: Tag (Again) + State• Inclusive Directory == Coherence State Overhead

• WITH TWO LEVELSo Directory size proportional to sum of private cache sizeso 64b/(48b+512b) * 2 (for rare recalls) = 22% * Σ L1 size

• Coherence overhead higher than w/ inclusion

L2 / ΣL1s

1 2 4 8

Overhead

11% 7.6% 4.6% 2.5%

45

Non-Inclusive Shared Caches

WITH THREE LEVELS• Cluster has L2 cache & cluster directory

o Cluster directory points to cores w/ L1 block (as before)

o (1) Size = 22% * ΣL1s sizes

• Chip has L3 cache & global directoryo Global directory points to cluster w/ block ino (2) Cluster directory for size 22% * ΣL1s +o (3) Cluster L2 cache for size 22% * ΣL2s

• Hierarchical overhead higher than w/ inclusion

L3 / ΣL2= L2 / ΣL1s

1 2 4 8

Overhead(1)+(2)+(3

)

23%

13% 6.5% 3.1%

46






Some Criticisms(1) Where are workload-driven evaluations?

o Focused on robust analysis of first-order effects

(2) What about non-coherent approaches? o Showed compatible of coherence scales

(3) What about protocol complexity?o We have such protocols today (& ideas for better ones)

(4) What about multi-socket systems?o Apply non-inclusive approaches

(5) What about software scalability?o Hard SW work need not re-implement coherence

48

Executive Summary

• Today chips provide shared memory w/ HW coherence as low-level support for OS & application SW

• As #cores per chip scales?o Some argue HW coherence gone due to growing overheadso We argue it’s stays by managing overheads

• Develop scalable on-chip coherence proof-of-concepto Inclusive caches firsto Exact tracking of sharers & replacements (key to analysis)o Larger systems need to use hierarchy (clusters)o Overheads similar to today’s

Compatibility of on-chipHW coherence is here to stay

Let’s spend programmer sanity on parallelism,not lost compatibility!

49

Coherence NOT this Awkward

1 cacm july 2012 talk: mark d. hill, wisconsin @ cornell university, 10/2012

Documents

coherence invariant

chiphw coherence

shared memory w hw coherence

b tag hw coherence pervasive

c bits

chip coherence proof

core possible

core ia