foundations what is the meaning of shared shared-memory...
TRANSCRIPT
Page 1
- 1 -
Shared-Memory Systemsand
Cache Coherence
6.173Fall 2010Agarwal - 2 -
FoundationsWhat is the meaning of shared
memory when you have multiple access ports into global memory?
What if you have caches?Memory
wa3ra2ra1
rc4rc3wc2wc1
wb3wb2rb1
Pa Pb Pc
Sequential consistency: Final state (of memory) is as if all RDs and WRTs were executed in some fixed serial order (per processor order also maintained) Lamport
[This notion borrows from similar notions of sequential consistency in transaction processing systems.]
Page 2
- 3 -
Foundations
Memory
A hardware designers physical perspective of sequential consistency
wa3ra2ra1 r c4
r c3
w c2
w c1
wb3w
b2rb1Pa Pb
w1c
Pc
We will revisit this in more detail shortly
Key: Using fence to wait until flush is done is the key mechanism that guarantees sequential consistency
- 4 -
One other cache nasty to watch out for
foo1foo2foo3foo4
P P
cache cache
MEM
foo home
foo1foo2foo3foo4
Flush foo* from cache, wait till done
Does it always work?
Page 3
- 5 -
One other cache nasty to watch out for
foo1foo2foo3foo4
P P
cache cache
MEM
foo home
foo1foo2foo3foo4
Flush foo* from cache, wait till done
xxx
Cache line
xxxxxxfoo1yyy
Flush yyy from cache, wait till done
yyyxxx
yyyfoo1Correct final value:
xxxfoo1Wrong final value:
Problem called “False Sharing”Leads to bugs with sw coherenceLeads to poor perf. with hw coherence
Solutions?Pad shared data structures so multiple shared items do not fall into same cache line
- 6 -
Summary of New Multicore Instructions
• Send message
• Receive message
• Synchronization– Barrier– Test and set– F&A and relatives (e.g., F&Op, CmpXch)
• Flush cache line
• Memory fence
Page 4
- 7 -
Outlline
Memory architecture
Cache coherence in small multicores
Cache coherence in manycores
- 8 -
Recall, Shared MemoryAlgorithmic Model
. . .
Shared Memory
wrt read
PP P
P
Page 5
- 9 -
Shared Memory Structuresin Parallel Computers
Memory
Network
Monolithic
PPP
P . . .
C C C C Network
M MMM
Distributed
PPPP . . .
C C CC
MM
M
Network
Distributed - local
PPP
. . .C
C C
But, what about multicores chips?
Like legos, can move Ps, Cs and Ms around
- 10 -
Chip
Shared-Memory Structure in Cutting Edge Multicores
MemoryPC
PC
PCPC Memory
Multicore Chip
Ring
Memory
contr
oller
Network
M MMM
Distributed
PPPP . . .
C C CC
Page 6
- 11 -
Shared-Memory Structure in Cutting Edge Multicores
Tile processor64 cores
Chip
Network
M MMM
Distributed
PPPP . . .
C C CC
Multicore Chip
Memory
Memory
Mesh
P C P C P CP C
P C P C P CP C
P C P C P CP C
P C P C P CP C
Memory
Memory
Memory
contr
oller
- 12 -
M M MC
C C
PPP
Caches and Cache Coherence
Network
Page 7
- 13 -
M M MC
C C
PPP
Network
rdrd
A World Without Caches
- 14 -
M M MC
C C
PPP
Network
With Caches
Page 8
- 15 -
M M MC
C C
PPP
Network
How are Caches Different from Fast Local Memory (SRAM)?
M M Mm
m m
PPP
Network
Discuss
versus
- 16 -
Key insightwhy use a cache when local mem exists
Anatomy of a common case LD operation
LD A
If A replicated in local store then fetch from local store
Else send message to get A from DRAM
HW: 1 cycleSW: 10 cycles
HW: 100 cyclesSW: 110 cycles
When done in HW, we call the store a cache!
Can do all of thisin hardware too. This is what typical caches do
Page 9
- 17 -
M M MC
C C
PPP
Network
Coherence problem
Cache Coherence Problem
wrt
?
- 18 -
Solving the Coherence Problem
– Small multicores> Software coherence> Snooping caches
– Manycores> Software coherence> full map directories> limited pointers> chained pointers
· singly linked· doubly linked
> limitless schemes> Hierarchical methods
We will studyCoherence structuresCoherence protocols
Cache side state diagramsDirectory side state diagrams
Page 10
- 19 -
Software CoherenceSaw this before
foo1foo2foo3foo4
P P
cache cache
MEM
foo home
RELEASE_foo_LOCK
GET_foo_LOCK
MUNGE....
flush fence
Flush foo* from cacheFence: wait till changes that result from flush
are visible to everyone
foo1foo2foo3foo4
Can stick the locking, flushes and fences in library codeto provide clean abstractions
- 20 -
Hardware Cache CoherenceSnooping Caches
• Works for small multicores (mem off chip)• Broadcast address on shared write• Everyone listens (snoops) on bus/ring to see
if any of their own addresses match• Invalidate copy on match• How do you know when to broadcast,
invalidate– State associated with each cache line– Key benefit: no global state in main mem
cache
cachetags
cachetags
cache
Dualported
a
a
ProcessorProcessor
4Match a
write1
a
Broadcast2snoopa3
5Invalidate
Shared Memory
Bus or Ring
Let’s look at this in more detail next…
x
x y
z
Page 11
- 21 -
Hardware Cache CoherenceInvalidate versus Update Snooping Caches
• Broadcast address on shared write
• Everyone listens (snoops) on bus/ring to see if any of their own addresses match
• If address matches– Invalidate local copy (called invalidate or
ownership protocol)OR
– Update local copy with new data from bus (writer must broadcast value along with address)
cache
cachetags
cachetags
cache
Dualported
a
a
ProcessorProcessor
4Match a
write1
a
Broadcast2snoopa3
5
Shared Memory
Bus or Ring
5Update
Only a cache side state machine neededDiscuss paper - 22 -
Competitive snooping idea --
–Do write updates
–If more than a “few” updates, then use ownership
“Few” Switch mode when cost of all updates so far = cost of invalidation
The cost of this approach is no worse than twice the optimal (try to prove this)
“Competitive algorithms are cool”
Tradeoffs between
• Update protocols
• Ownership protocols
Update better when poor write locality
Invalidate better otherwise
Update versus Invalidate Protocols
Page 12
- 23 -
State diagram for ownership protocols
• For each address
• Assume cache blocksize is one word for now; Let’s deal with the cache block complexity later
^shared-data
invalid
write-dirtyread-clean
“Invalid”
“Modified”“Shared”
“MSI”Variants such as MESI, MOESI
Cache side state machineStore state with cache tags
For each address a
- 24 -
Snooping CachesDefinitions
cache
cachetags
cachetags
cache
Dualported
a
a
ProcessorProcessor
4Match a
write1
a
Broadcast2snoopa3
Shared Memory
Bus or Ring
My local request
My bus responseExt. bus request
My local responseMy local response5Update
Page 13
- 25 -
invalid
write-dirtyread-clean
a: address
Local ReadFetch block
Remote WriteRemoteWrite/local replaceUpdate memory
Remote ReadUpdate memory
Local WriteBroadcast a
Local WriteBroadcast a; Fetch block
In ownership protocol: writer owns exclusive copy
State diagram for cache block in ownership protocols
My local requestExt. bus requestMy bus response
- 26 -
State diagram for updateprotocols
a: address<a>: value
My local requestExt. bus requestMy bus responseMy local responseMy local response
invalid
write-dirtyread-clean
Local ReadFetch block Local replace
Update memory
Remote WriteUpdate local copyUpdate local copy
Local WriteBroadcast a, <a>
Local WriteBroadcast a, <a>; Fetch block
Local WriteBroadcast a,<a>Remote Write
Update local copyUpdate local copy
Page 14
- 27 -
Maintaining coherence in manycores
• Software coherence – saw this before
• Hardware coherence> full map directories> limited pointers> chained pointers
· singly linked· doubly linked
> limitless schemes> Hierarchical methods