flynn’s classification - people at vt computer...
TRANSCRIPT
Chapter 8 page 1 CS 5515
Flynn’s Classification
r SISD (Single Instruction Single Data)l Uniprocessors
r MISD (Multiple Instruction Single Data)l No machine is built yet for this type
r SIMD (Single Instruction Multiple Data)l Examples: Illiac-IV, CM-2
– Each processor has its own data memory
– Only a single inst. memory and control processor
– Less flexible
– Use special-purpose microprocessors
r MIMD (Multiple Instruction Multiple Data)l Examples: SPARCCenter, T3D
– Flexible
– Use off-the-shelf microprocessors
Chapter 8 page 2 CS 5515
Small-Scale, Centralized Share-MemoryMultiprocessors
r Memory: centralized with uniform access time (“uma”)and bus interconnect
r Examples: SUN SPARCCenter, SGI Challenge
Chapter 8 page 3 CS 5515
Large-Scale, Distributed-Memory Multiprocessors
r Memory: distributed with nonuniform access time (“numa”) andscalable interconnect (distributed memory)
r Examples: T3D, Exemplar, Paragon, CM-5
Low LatencyHigh Reliability
1 cycle
40 cycles 100 cycles
Chapter 8 page 4 CS 5515
Communication Modelsr Shared Memory
l Processors communicate with a shared address space
l Applicable to centralized or distributed shared-memory MPs
l Advantages:– Ease of programming -- dealing with a single address space
– Lower overhead for communication (e.g., via hardware)
– Better use of bandwidth when communicating small items
– Easier to use hardware controlled caching
r Message passingl Processors have private memories & communicate via messages --
normally being called as “multicomputers.”
l MPI (message-passing interface) standards available
l Advantages:– Simpler hardware, e.g., no cache coherence issues
– communication is explicit so force programmer to pay attention oncostly non-local operations (a drawback too).
Chapter 8 page 5 CS 5515
Small-Scale MPs -- Centralized Shared-Memory
r Caches serve to:l Reduce bandwidth
requirements ofbus/memory
l Reduce latency ofaccess
l Valuable for bothprivate data (usedby a singleprocessor) andshared data
r Problem: cachecoherence
Chapter 8 page 6 CS 5515
The Problem of Cache Coherency
Chapter 8 page 7 CS 5515
What Does Coherency Mean?
r Informally: Any read of a data item returns the most recentlywritten value of that data item
r Two aspects to be addressed:l Coherence: what values can be returned by a read
l Consistency: when a written value will be returned by a read
r A system is coherent ifl Any write is eventually seen by a read
l All writes are seen in the same order (“serialization”)
r Two rules to ensure this:l If P writes x and P1 reads it, P’s write will be seen if the read and
write are sufficiently far apart
l Writes to the same location are serialized, i.e., two writes to thesame location by any two processors are seen in the same order byall processors -- a property called write serialization.
Chapter 8 page 8 CS 5515
Potential Solutions to Coherency
r Snooping Protocols (Snoopy Bus):l Send all requests for data to all processors
l Processors snoop to see if they have a copy and respond accordingly
l Requires broadcast, since caching information is at processors
l Works well with bus (natural broadcast medium)
l Dominates for small scale machines (most of the market)
r Directory-Based Protocolsl Keep track of what is being shared in one centralized place
l Distributed memory => distributed directory (avoids bottlenecks)
l Send point-to-point requests to processors
l Used for distrubuted share-memory MPs (discussed in 8.4)
l Scales better than Snoop
Chapter 8 page 9 CS 5515
Basic Snooping Protocolsr Write Invalidate Protocol:
l Multiple readers, single writer
l Write to shared data: an invalidate is sent to all caches whichsnoop and invalidate any copies
l Read Miss:– Write-through: memory is always up-to-date
– Write-back: snoop in caches to find the most recent copy
r Write Update Protocol:l Write to shared data: broadcast on bus, processors snoop, and
update copies
l Read miss: memory is always up-to-date
r In either case, write serialization is enforced because the busis a single point of arbitration and it serializes all writerequests to the same data item
Chapter 8 page 10 CS 5515
Basic Snoopy Protocolsr Write Invalidate versus Write Update (Broadcast):
l Invalidate requires only one invalidation broadcast for multiple writeswhile Update requires multiple write broadcasts
l For a multi-word cache block, Invalidate only invalidates it once for thefirst write while Update updates it for each word written
l Update has lower latency between write and read
l Update Trades BW requirement (increased) for latency (decreased)
Name Protocol Type Memory-write policy Machines using
Write Once Write invalidate Write back First snoopy protocol.after first write
Synapse N+1 Write invalidate Write back 1st cache-coherent MPs
Berkeley Write invalidate Write back Berkeley SPUR
Illinois Write invalidate Write back SGI Power and Challenge
“Firefly” Write update Write back private,Write through shared SPARCCenter 2000
protocol of choice nowadays
Chapter 8 page 11 CS 5515
An Example Snooping Protocol
r Invalidation protocol, write-back cache
r Each block of memory is in one state:l Clean in all caches and up-to-date in memory
l OR Dirty in exactly one cache
l OR Not in any caches
r Each cache block is in one state:l Shared: cache has one copy the same as in the memory and
there may be other copies in other caches
l OR Exclusive: cache has only copy, it’s writeable, and dirty
l OR Invalid: block contains no data
r Read misses: cause all caches to snoop
r A write to a cache block in the shared state will cause awrite miss -- assumed although can be improved
Chapter 8 page 12 CS 5515
Snoopy-Cache State Machine-I
r State machine fora cache blockbased on requestsfrom CPU
r Boldface is used tospecify bus actions
InvalidShared
(read only)
Exclusive(read/write)
CPU Read:
CPU Write:
CPUreadhit
Place read misson bus
Place write misson bus
CPU read miss for another block:Write back block;place read misson bus
CPU Write:Place write miss on Bus
CPU read missfor another block:Place read miss on bus
CPU write miss for another blcok:Write back cache blockPlace write miss on bus
CPU read hitCPU write hit
Chapter 8 page 13 CS 5515
Snoopy-Cache State Machine-II
r Cache Statemachine basedon requestsfrom the bus
InvalidShared
(read only)
Exclusive(read/write)
Write back block;abort memory access
Write miss forthis block:
Read miss forthis block:
Write miss forthe block
Write back block;abort memoryaccess
Chapter 8 page 14 CS 5515
Snoop Cache: State Machine (Fig. 8.12)
A possible extension (e.g.,
Problem 8.4):l Adding a 4th state called
Clean Private: the cachecontains the only copythe same as the memorycopy
l Clean Private ->Exclusive on CPU writehit
l Clean Private -> Sharedon read miss receivedfrom the bus
Exclusive(read/write)
CPU write hitCPU read hit
Write missfor block
CPU write
Pla
ce w
rite
mis
s on
bus
Read
miss
for b
lock
CPU read
miss
Writ
e-bac
k blo
ck
Place
write
miss
on
bus
CPU writ
e
Place read miss on bus
Place readmiss on bus
CPU read
CPU read hit
CPU write miss
Write-back dataPlace write miss on bus
CPUreadmiss
Invalid
Write miss for this block
Writ
e-bac
k dat
a; p
lace r
ead m
iss o
n bus
Shared(read only)
Wri
te-b
ack
bloc
k
FIGURE 8.12 Cache-coherence state diagram with the state transitions induced by the local processor shown in black and by the bus activities shown in gray.
Chapter 8 page 15 CS 5515
Value
An Example of Basic Snooping Protocolp1: write(A1,10) p3:read(A2)
P2
timet1 t2 t3 t4 t5
p2:read(A1)
p2:write(A2,20)
p2: replaces the cache block on a read missAssume that A1 and A2 map to same cache block
P1
step
State
Addr
State
Addr
Value
Action
Proc.
Addr
Value
Addr
time processor p1 processor p2 processor p3
t1 exclusive invalid invalidt2 shared invalid sharedt3 shared shared sharedt4 invalid exclusive invalidt5 invalid *shared invalid
note that the state of this cache block is not for the blockcontaining A1 and A2 any more; it is for another block
Chapter 8 page 16 CS 5515
Snooping Protocol Variations
Berkeley ProtocolOwned Exclusive
Owned SharedSharedInvalid
Basic ProtocolExclusive
SharedInvalid
Illinois ProtocolPrivate DirtyPrivate Clean
SharedInvalid
A new bus operation called “invalidation”is introduced, so that Shared -> Owned Exclusive can be achieved by broadcastinginvalidate operations and there is no need togenerate a write-miss on write hit
If read sourced from memory, then Private Cleanif read sourced from other cache, then SharedCan write in cache if held private clean or dirtyHowever, write hit on a Shared cache stillgenerates write miss broadcast to the bus
Chapter 8 page 17 CS 5515
Scaleable MPs
r Separate Memory per Processorl bus bandwidth cannot support a large # of processors
l replacing bus with a general ICN; however; snooping protocols willbe expensive because the cost of broadcasting is high
r A simple software solution: make shared data uncacheablel simple hardware but poor performance
r Alternative: Use a directory to track the state of every memoryblock in caches
l Which caches have a copies of the block, dirty vs. clean, ...
r Prevent directory as bottleneck: distribute directory entries withmemory, each keeping track of which caches have copies oftheir blocks
l there will make the directory associated with a memory block at aknown location to avoid the cost of broadcasting
Interconnection network
Processor+ cache
Memory I/O
Directory
Processor+ cache
Memory I/O
Directory
Processor+ cache
Memory I/O
Directory
Processor+ cache
Memory I/O
Memory
Directory
Processor+ cache
Processor+ cache
Processor+ cache
Processor+ cache
Memory I/O Memory MemoryI/O I/O Memory I/O
Directory Directory Directory Directory
FIGURE 8.22 A directory is added to each node to implement cache coherence in a distributed-memory machine.
Chapter 8 page 18 CS 5515
Directory Protocol
r Similar to Snoopy Protocol: 3 statesl Shared: Some processors have data, memory up to date
l Uncached: No processor has a copy
l Exclusive: 1 processor(owner) has data; memory out of date
r In addition to cache state, must track which processor(Shared or Exclusive) has a copy: Use a bit vector tomaintain a Sharer set (one such set per memory block)
r Terms:l Local node is the node where a request originates
l Home node is the node where the memory location of anaddress resides
l Remote node is the node that has a copy of a cache block,whether exclusive or shared.
Message type Source DestinationMessage contents Function of this message
Read miss Local cache Home directory
P, A Processor P has a read miss at address A; request data and make P a read sharer.
Write miss Local cache Home directory
P, A Processor P has a write miss at address A; —request data and make P the exclusive owner.
Invalidate Home directory
Remote caches A Invalidate a shared copy of data at address A.
Fetch Home directory
Remote cache A Fetch the block at address A and send it to its home directory; change the state of A in the remote cache to shared.
Fetch/invalidate Home directory
Remote cache A Fetch the block at address A and send it to its home directory; invalidate the block in the cache.
Data value reply Home directory
Local cache Data Return a data value from the home memory.
Data write back Remote cache
Home directory
A, Data Write back a data value for address A.
FIGURE 8.23 The possible messages sent among nodes to maintain coherence.
Chapter 8 page 19 CS 5515
Directory Protocolr Message sent to directory causes 2 actions:
l update the directory and send messages to caches to satisty request
r Block is in Uncached state: the copy in memory is the only copy sopossible requests for that block are:
l Read miss: requesting processor is sent back the data from memory andthe requestor is the only sharing node. The state of the block is madeShared.
l Write miss: requesting processor is sent the value and becomes theSharing node. The block is made Exclusive to indicate that the only validcopy is cached. Sharers indicates the identity of the owner.
r Block is Shared, the memory value is up-to-date:l Read miss: requesting processor is sent back the data from memory &
requesting processor is added to the sharing set.
l Write miss: requesting processor is sent the value. All processors in theset Sharers are sent invalidate messages, & Sharers set is to identity ofrequesting processor. The state of the block is made Exclusive.
Chapter 8 page 20 CS 5515
Directory Protocolr Block is Exclusive: current value of the block is held in the cache of
the processor identified by the set Sharers (the owner), & 3 possibledirectory requests:
l Read miss: owner processor is sent a data fetch message, which causesstate of block in owner’s cache to transition to Shared and causes ownerto send data to directory, where it is written to memory and sent back tothe requesting processor. Identity of requesting processor is added to setSharers, which still contains the identity of the processor that was theowner (since it still has a readable copy).
l Data write-back: owner processor is replacing the block and hence mustwrite it back. This makes the memory copy up-to-date (the homedirectory essentially becomes the owner), the block is now uncached, andthe Sharer set is empty.
l Write miss: block has a new owner. A message is sent to old ownercausing the cache to send the value of the block to the directory fromwhich it is send to the requesting processor, which becomes the newowner. Sharers is set to identity of new owner, and state of block is madeExclusive.
Chapter 8 page 21 CS 5515
State Transition Diagram for the Home Directorythat Tracks the Status of A Memory Block
r Three States: Uncached,Shared and Exclusive
r All actions are in graycolor since they all areexternally caused. Italicsindicates the action takenby the directory in responseto the request. Bold italicsindicate an action thatupdates the sharing set,Sharers, as opposed tosending a message.
Exclusive(read/write)
Datawrite-back
Write miss
Dat
a va
lue
repl
y; S
hare
rs={
P}
Sha
rers
={}
Inva
lidat
e; S
harer
s={P
}; dat
a valu
e rep
ly
Read miss
Data value replySharers=Sharers+{P}
Data value reply; Sharers={P}
Writemiss
Fetch/invalidateData value replySharers={P}
Readmiss
Uncached
Fetch
; dat
a valu
e rep
ly; S
harer
s=Shar
ers+
{P}
Read
miss
Writ
e m
iss
Shared(read only)
FIGURE 8.25 The state transition diagram for the directory has the same states and structure as the transition diagram for an individual cache.
Chapter 8 page 22 CS 5515
State Transition Diagram for an Individual Cache Blockin a Directory Based System
r The states are identical to thosein the snoopingcase and thetransactions are very similar,with explicit invalidate andwrite-back requests replacingthe write misses that wereformerly broadcast on the bus.
r Colors in the figurel black: requests from the CPU
l gray: requests from the homedirectory
Exclusive(read/write)
CPU write hitCPU read hit
Fetchinvalidate
CPU write
Sen
d w
rite
mis
s m
essa
ge
Fetch
CPU read
miss
Data w
rite-
back
Send
write
miss
mes
sage
CPU writ
e
Send read miss message
Read miss
CPU read
CPU read hit
CPU write miss
Data write-backWrite miss
CPUreadmiss
Invalid
Invalidate
Data w
rite-
back;
read
miss
Shared(read only)
Dat
a w
rite-
back
FIGURE 8.24 State transition diagram for an individual cache block in a directory-based system.
Chapter 8 page 23 CS 5515
An Example of Directory-Based Protocolp1: write(A1,10) p3:read(A2)
P2
timet1 t2 t3 t4 t5
p2:read(A1)
p2:write(A2,20)
p2: replaces the cache block on a read missAssume that A1 and A2 map to same cache block
P1
step
State
Addr
State
Addr
Value
Action
Proc.
Addr
Value
Addr
time processor p1 processor p2 processor p3
t1 exclusive invalid invalidt2 shared invalid sharedt3 shared shared sharedt4 invalid exclusive invalidt5 invalid *shared invalid
note that the state of this cache block is not for the blockcontaining A1 and A2 any more; it is for another block
homedirectoryexclusive
sharedshared
exclusiveuncached
Chapter 8 page 24 CS 5515
r 4th C: Conflict, Capacity, Compulsory and Coherency Misses
r More processors: increase coherency misses while decreasingcapacity misses (cache size increases for fixed problem size)
r Cache behavior of Five Parallel Programs:l FFT Fast Fourier Transform: Matrix transposition + computation
l LU factorization of dense 2D matrix (linear algebra)
l Barnes-Hut n-body algorithm solving galaxy evolution problem
l Ocean simulates influence of eddy & boundary currents on large-scale flow in ocean: dynamic arrays per grid
l VolRend is parallel volume rendering: scientific visualization
Miss Rates for Snooping Protocol
Chapter 8 page 25 CS 5515
Mis
s R
ate
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
fft lu barnes ocean volrend
8%
2%
1%
14%
1%
8%
2%
1%
18%
1%
8%
2%
1%
15%
1%
8%
2%
1%
13%
1%
8%
2%
1%
9%
1%
1 2 4 8 16
Miss Rates for Snooping Protocols
l Cache size is 64KB, 2-way set associative, with 32B blocks.
l With the exception of Volrend, the misses in these applications are generated byaccesses to data that is potentially shared.
l Except for Ocean, data is heavily shared; in Ocean only the boundaries of thesubgrids are shared, though the entire grid is treated as a shared data object. Sincethe boundaries change as we increase the processor count (for a fixed size problem),different amounts of the grid become shared. The anomalous increase in miss ratefor Ocean in moving from 1 to 2 processors arises because of conflict misses inaccessing the subgrids.
Big differencesin miss ratesamong theprograms
Miss Rate
# of processorsOcean
High CapacityMisses
Chapter 8 page 26 CS 5515
Processor Count
Mis
s R
ate
0%
10%
20%
30%
40%
50%
60%
70%
80%
1 2 4 8 16
fft lu barnes
ocean volrend
% Misses Caused by Coherency Traffic vs. # ofProcessors
r % cache misses caused by coherencytransactions typically rises when a fixedsize problem is run on more processors.
r The absolute number of coherency missesis increasing in all these benchmarks,including Ocean. In Ocean, however, it isdifficult to separate out these misses fromothers, since the amount of sharing of thegrid varies with processor count.
r Invalidation increases significantly; InFFT, the miss rate arising from coherencymisses increases from nothing to almost7%.
80% of misses due tocoherency misses!
FFT
LU
BarnesOcean
Volrend
Chapter 8 page 27 CS 5515
Cache Size in KB
Mis
s R
ate
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
16 32 64 128 256
fft lu barnes
ocean volrend
Miss Rates vs. Cache Size Per Processor
r Miss rate drops as the cache size is increased, unless the miss rate is dominated bycoherency misses.
r The block size is 32B & the cache is 2-way set-associative. The processor count isfixed at 16 processors.
FFT
LU
Barnes
Ocean
Volrend
MissRate
Cache Size
Ocean and FFTstrongly influencedby capacity misses
Chapter 8 page 28 CS 5515
% Misses Caused by Coherency Traffic vs.Cache Size
Cache Size in KB
Mis
s R
ate
0%
10%
20%
30%
40%
50%
60%
70%
80%
16 32 64 128 256
fft lu barnes
ocean volrend
FFT
Volrend
Ocean
LU
Barnes
Reduction ofcapacity misseswith increasingcache size
(small absolute miss rate < 2%)
(large absolute miss rates > 8%)
Chapter 8 page 29 CS 5515
2%
4%
6%
8%
10%
12%
14%
fft lu barnes ocean volrend
13%
4%
1%
13%
1%
8%
2%
1%
9%
1%
5%
1%1%
6%
1%
4%
0% 1%
5%
1%
16 32 64 128
Miss Rate vs. Block Size: Miss Rate MostlyDecreases with Increasing Block Size
r Overall, miss rate drops as blocksize increases due to the decreasein capacity misses
r Since a cache block holdsmultiple words, coherence missescan increase with a larger blockbecause of a higher probability ofthe block being invalidated
r False sharing arises from the useof an invalidation-basedcoherency algorithm. It occurswhen a block is invalidated (and asubsequent reference causes amiss) because some word in theblock, other than the one beingread, is written into. False sharingwould not arise if each cacheblock contains only one word.
Note:
Chapter 8 page 30 CS 5515
1.0
2.0
3.0
4.0
5.0
6.0
7.0
16 32 64 128
fft lu barnes
ocean volrend
Bus Traffic vs. Block Size
r Bus traffic climbs steadily as theblock size is increased.
r Volrend: the increase is more than afactor of 10, although the low missrate keeps the absolute traffic small.
r The factor of 3 increase in traffic forOcean is the best argument againstlarger block sizes.
r Remember that our protocol treatsownership misses the same as othermisses, slightly increasing thepenalty for large cache blocks: inboth Ocean and FFT this effectaccounts for less than 10% of thetraffic.
Huge Increases in bus trafficdue to coherency!
Bytes perdata ref
Ocean
FFTLU
Volrend
Chapter 8 page 31 CS 5515
Mis
s R
ate
0%
1%
2%
3%
4%
5%
6%
7%
fft lu barnes ocean volrend
5%
1%
0%
6%
1%
5%
1%
0%
4%
1%
5%
1%
0%
3%
1%
5%
1%0%
7%
1%
8 16 32 64
Miss Rates for Directory-Based Protocols
l Cache size is 128 KB, 2-way setassociative, with 64B blocks.
l Ocean only the boundaries of the
Since the boundaries change as weincrease the processor count (for a fixedsize problem), different amounts of the
to 64 processors arises because ofconflict misses in accessing small
Miss Rate
# of Processors
Ocean
Chapter 8 page 32 CS 5515
2%
4%
6%
8%
10%
12%
14%
16%
18%
fft lu barnes ocean volrend
9%
2%
1%
18%
1%
8%
2%
1%
13%
1%
7%
2%
0%
9%
1%
5%
1%
0%
7%
1%
4%
1%0%
5%
1%
32 64 128 256 512
Miss Rates vs. Cache Size per Processor forDirectory-Based Protocols
r Miss rate drops as the cachesize is increased, unless themiss rate is dominated bycoherency misses.
r The block size is 64B andthe cache is 2-way set-associative. The processorcount is fixed at 16processors.
Chapter 8 page 33 CS 5515
2%
4%
6%
8%
10%
12%
14%
fft lu barnes ocean volrend
12%
3%
0%
13%
1%
7%
2%
0%
9%
1%
5%
1% 0%
7%
1%
3%
0% 0%
5%
1%
16 32 64 128
Block Size Effect for Directory Protocolsr Assumes 128 KB cache & 64 processors
l Use larger cache size to deal with higher memory latencies than snoopcaches