foundations what is the meaning of shared shared-memory...

- 1 -

Shared-Memory Systemsand

Cache Coherence

6.173Fall 2010Agarwal - 2 -

FoundationsWhat is the meaning of shared

memory when you have multiple access ports into global memory?

What if you have caches?Memory

wa3ra2ra1

rc4rc3wc2wc1

wb3wb2rb1

Pa Pb Pc

Sequential consistency: Final state (of memory) is as if all RDs and WRTs were executed in some fixed serial order (per processor order also maintained) Lamport

[This notion borrows from similar notions of sequential consistency in transaction processing systems.]

- 3 -

Foundations

Memory

A hardware designers physical perspective of sequential consistency

wa3ra2ra1 r c4

r c3

w c2

w c1

wb3w

b2rb1Pa Pb

w1c

Pc

We will revisit this in more detail shortly

Key: Using fence to wait until flush is done is the key mechanism that guarantees sequential consistency

- 4 -

One other cache nasty to watch out for

foo1foo2foo3foo4

P P

cache cache

MEM

foo home

foo1foo2foo3foo4

Flush foo* from cache, wait till done

Does it always work?

- 5 -

One other cache nasty to watch out for

foo1foo2foo3foo4

P P

cache cache

MEM

foo home

foo1foo2foo3foo4

Flush foo* from cache, wait till done

xxx

Cache line

xxxxxxfoo1yyy

Flush yyy from cache, wait till done

yyyxxx

yyyfoo1Correct final value:

xxxfoo1Wrong final value:

Problem called “False Sharing”Leads to bugs with sw coherenceLeads to poor perf. with hw coherence

Solutions?Pad shared data structures so multiple shared items do not fall into same cache line

- 6 -

Summary of New Multicore Instructions

• Send message

• Receive message

• Synchronization– Barrier– Test and set– F&A and relatives (e.g., F&Op, CmpXch)

• Flush cache line

• Memory fence

- 7 -

Outlline

Memory architecture

Cache coherence in small multicores

Cache coherence in manycores

- 8 -

Recall, Shared MemoryAlgorithmic Model

. . .

Shared Memory

wrt read

PP P

P

- 9 -

Shared Memory Structuresin Parallel Computers

Memory

Network

Monolithic

PPP

P . . .

C C C C Network

M MMM

Distributed

PPPP . . .

C C CC

MM

M

Network

Distributed - local

PPP

. . .C

C C

But, what about multicores chips?

Like legos, can move Ps, Cs and Ms around

- 10 -

Chip

Shared-Memory Structure in Cutting Edge Multicores

MemoryPC

PC

PCPC Memory

Multicore Chip

Ring

Memory

contr

oller

Network

M MMM

Distributed

PPPP . . .

C C CC

- 11 -

Shared-Memory Structure in Cutting Edge Multicores

Tile processor64 cores

Chip

Network

M MMM

Distributed

PPPP . . .

C C CC

Multicore Chip

Memory

Memory

Mesh

P C P C P CP C

P C P C P CP C

P C P C P CP C

P C P C P CP C

Memory

Memory

Memory

contr

oller

- 12 -

M M MC

C C

PPP

Caches and Cache Coherence

Network

- 13 -

M M MC

C C

PPP

Network

rdrd

A World Without Caches

- 14 -

M M MC

C C

PPP

Network

With Caches

- 15 -

M M MC

C C

PPP

Network

How are Caches Different from Fast Local Memory (SRAM)?

M M Mm

m m

PPP

Network

Discuss

versus

- 16 -

Key insightwhy use a cache when local mem exists

Anatomy of a common case LD operation

LD A

If A replicated in local store then fetch from local store

Else send message to get A from DRAM

HW: 1 cycleSW: 10 cycles

HW: 100 cyclesSW: 110 cycles

When done in HW, we call the store a cache!

Can do all of thisin hardware too. This is what typical caches do

- 17 -

M M MC

C C

PPP

Network

Coherence problem

Cache Coherence Problem

wrt

?

- 18 -

Solving the Coherence Problem

– Small multicores> Software coherence> Snooping caches

– Manycores> Software coherence> full map directories> limited pointers> chained pointers

· singly linked· doubly linked

> limitless schemes> Hierarchical methods

We will studyCoherence structuresCoherence protocols

Cache side state diagramsDirectory side state diagrams

- 19 -

Software CoherenceSaw this before

foo1foo2foo3foo4

P P

cache cache

MEM

foo home

RELEASE_foo_LOCK

GET_foo_LOCK

MUNGE....

flush fence

Flush foo* from cacheFence: wait till changes that result from flush

are visible to everyone

foo1foo2foo3foo4

Can stick the locking, flushes and fences in library codeto provide clean abstractions

- 20 -

Hardware Cache CoherenceSnooping Caches

• Works for small multicores (mem off chip)• Broadcast address on shared write• Everyone listens (snoops) on bus/ring to see

if any of their own addresses match• Invalidate copy on match• How do you know when to broadcast,

invalidate– State associated with each cache line– Key benefit: no global state in main mem

cache

cachetags

cachetags

cache

Dualported

a

a

ProcessorProcessor

4Match a

write1

a

Broadcast2snoopa3

5Invalidate

Shared Memory

Bus or Ring

Let’s look at this in more detail next…

x

x y

z

- 21 -

Hardware Cache CoherenceInvalidate versus Update Snooping Caches

• Broadcast address on shared write

• Everyone listens (snoops) on bus/ring to see if any of their own addresses match

• If address matches– Invalidate local copy (called invalidate or

ownership protocol)OR

– Update local copy with new data from bus (writer must broadcast value along with address)

cache

cachetags

cachetags

cache

Dualported

a

a

ProcessorProcessor

4Match a

write1

a

Broadcast2snoopa3

5

Shared Memory

Bus or Ring

5Update

Only a cache side state machine neededDiscuss paper - 22 -

Competitive snooping idea --

–Do write updates

–If more than a “few” updates, then use ownership

“Few” Switch mode when cost of all updates so far = cost of invalidation

The cost of this approach is no worse than twice the optimal (try to prove this)

“Competitive algorithms are cool”

Tradeoffs between

• Update protocols

• Ownership protocols

Update better when poor write locality

Invalidate better otherwise

Update versus Invalidate Protocols

- 23 -

State diagram for ownership protocols

• For each address

• Assume cache blocksize is one word for now; Let’s deal with the cache block complexity later

^shared-data

invalid

write-dirtyread-clean

“Invalid”

“Modified”“Shared”

“MSI”Variants such as MESI, MOESI

Cache side state machineStore state with cache tags

For each address a

- 24 -

Snooping CachesDefinitions

cache

cachetags

cachetags

cache

Dualported

a

a

ProcessorProcessor

4Match a

write1

a

Broadcast2snoopa3

Shared Memory

Bus or Ring

My local request

My bus responseExt. bus request

My local responseMy local response5Update

- 25 -

invalid


a: address

Local ReadFetch block

Remote WriteRemoteWrite/local replaceUpdate memory

Remote ReadUpdate memory

Local WriteBroadcast a

Local WriteBroadcast a; Fetch block

In ownership protocol: writer owns exclusive copy

State diagram for cache block in ownership protocols

My local requestExt. bus requestMy bus response

- 26 -

State diagram for updateprotocols

a: address<a>: value

My local requestExt. bus requestMy bus responseMy local responseMy local response

invalid


Local ReadFetch block Local replace

Update memory

Remote WriteUpdate local copyUpdate local copy

Local WriteBroadcast a, <a>

Local WriteBroadcast a, <a>; Fetch block

Local WriteBroadcast a,<a>Remote Write

Update local copyUpdate local copy

- 27 -

Maintaining coherence in manycores

• Software coherence – saw this before

• Hardware coherence> full map directories> limited pointers> chained pointers

· singly linked· doubly linked

> limitless schemes> Hierarchical methods

foundations what is the meaning of shared shared-memory...

Documents