flynn’s classification - people at vt computer...

Chapter 8 page 1 CS 5515

Flynn’s Classification

r SISD (Single Instruction Single Data)l Uniprocessors

r MISD (Multiple Instruction Single Data)l No machine is built yet for this type

r SIMD (Single Instruction Multiple Data)l Examples: Illiac-IV, CM-2

– Each processor has its own data memory

– Only a single inst. memory and control processor

– Less flexible

– Use special-purpose microprocessors

r MIMD (Multiple Instruction Multiple Data)l Examples: SPARCCenter, T3D

– Flexible

– Use off-the-shelf microprocessors


Small-Scale, Centralized Share-MemoryMultiprocessors

r Memory: centralized with uniform access time (“uma”)and bus interconnect

r Examples: SUN SPARCCenter, SGI Challenge


Large-Scale, Distributed-Memory Multiprocessors

r Memory: distributed with nonuniform access time (“numa”) andscalable interconnect (distributed memory)

r Examples: T3D, Exemplar, Paragon, CM-5

Low LatencyHigh Reliability

1 cycle

40 cycles 100 cycles


Communication Modelsr Shared Memory

l Processors communicate with a shared address space

l Applicable to centralized or distributed shared-memory MPs

l Advantages:– Ease of programming -- dealing with a single address space

– Lower overhead for communication (e.g., via hardware)

– Better use of bandwidth when communicating small items

– Easier to use hardware controlled caching

r Message passingl Processors have private memories & communicate via messages --

normally being called as “multicomputers.”

l MPI (message-passing interface) standards available

l Advantages:– Simpler hardware, e.g., no cache coherence issues

– communication is explicit so force programmer to pay attention oncostly non-local operations (a drawback too).


Small-Scale MPs -- Centralized Shared-Memory

r Caches serve to:l Reduce bandwidth

requirements ofbus/memory

l Reduce latency ofaccess

l Valuable for bothprivate data (usedby a singleprocessor) andshared data

r Problem: cachecoherence


The Problem of Cache Coherency


What Does Coherency Mean?

r Informally: Any read of a data item returns the most recentlywritten value of that data item

r Two aspects to be addressed:l Coherence: what values can be returned by a read

l Consistency: when a written value will be returned by a read

r A system is coherent ifl Any write is eventually seen by a read

l All writes are seen in the same order (“serialization”)

r Two rules to ensure this:l If P writes x and P1 reads it, P’s write will be seen if the read and

write are sufficiently far apart

l Writes to the same location are serialized, i.e., two writes to thesame location by any two processors are seen in the same order byall processors -- a property called write serialization.


Potential Solutions to Coherency

r Snooping Protocols (Snoopy Bus):l Send all requests for data to all processors

l Processors snoop to see if they have a copy and respond accordingly

l Requires broadcast, since caching information is at processors

l Works well with bus (natural broadcast medium)

l Dominates for small scale machines (most of the market)

r Directory-Based Protocolsl Keep track of what is being shared in one centralized place

l Distributed memory => distributed directory (avoids bottlenecks)

l Send point-to-point requests to processors

l Used for distrubuted share-memory MPs (discussed in 8.4)

l Scales better than Snoop


Basic Snooping Protocolsr Write Invalidate Protocol:

l Multiple readers, single writer

l Write to shared data: an invalidate is sent to all caches whichsnoop and invalidate any copies

l Read Miss:– Write-through: memory is always up-to-date

– Write-back: snoop in caches to find the most recent copy

r Write Update Protocol:l Write to shared data: broadcast on bus, processors snoop, and

update copies

l Read miss: memory is always up-to-date

r In either case, write serialization is enforced because the busis a single point of arbitration and it serializes all writerequests to the same data item


Basic Snoopy Protocolsr Write Invalidate versus Write Update (Broadcast):

l Invalidate requires only one invalidation broadcast for multiple writeswhile Update requires multiple write broadcasts

l For a multi-word cache block, Invalidate only invalidates it once for thefirst write while Update updates it for each word written

l Update has lower latency between write and read

l Update Trades BW requirement (increased) for latency (decreased)

Name Protocol Type Memory-write policy Machines using

Write Once Write invalidate Write back First snoopy protocol.after first write

Synapse N+1 Write invalidate Write back 1st cache-coherent MPs

Berkeley Write invalidate Write back Berkeley SPUR

Illinois Write invalidate Write back SGI Power and Challenge

“Firefly” Write update Write back private,Write through shared SPARCCenter 2000

protocol of choice nowadays


An Example Snooping Protocol

r Invalidation protocol, write-back cache

r Each block of memory is in one state:l Clean in all caches and up-to-date in memory

l OR Dirty in exactly one cache

l OR Not in any caches

r Each cache block is in one state:l Shared: cache has one copy the same as in the memory and

there may be other copies in other caches

l OR Exclusive: cache has only copy, it’s writeable, and dirty

l OR Invalid: block contains no data

r Read misses: cause all caches to snoop

r A write to a cache block in the shared state will cause awrite miss -- assumed although can be improved


Snoopy-Cache State Machine-I

r State machine fora cache blockbased on requestsfrom CPU

r Boldface is used tospecify bus actions

InvalidShared

(read only)

Exclusive(read/write)

CPU Read:

CPU Write:

CPUreadhit

Place read misson bus

Place write misson bus

CPU read miss for another block:Write back block;place read misson bus

CPU Write:Place write miss on Bus

CPU read missfor another block:Place read miss on bus

CPU write miss for another blcok:Write back cache blockPlace write miss on bus

CPU read hitCPU write hit


Snoopy-Cache State Machine-II

r Cache Statemachine basedon requestsfrom the bus

InvalidShared

(read only)


Write back block;abort memory access

Write miss forthis block:

Read miss forthis block:

Write miss forthe block

Write back block;abort memoryaccess


Snoop Cache: State Machine (Fig. 8.12)

A possible extension (e.g.,

Problem 8.4):l Adding a 4th state called

Clean Private: the cachecontains the only copythe same as the memorycopy

l Clean Private ->Exclusive on CPU writehit

l Clean Private -> Sharedon read miss receivedfrom the bus


CPU write hitCPU read hit

Write missfor block

CPU write

Pla

ce w

rite

mis

s on

bus

Read

miss

for b

lock

CPU read

miss

Writ

e-bac

k blo

ck

Place

write

miss

on

bus

CPU writ

e

Place read miss on bus

Place readmiss on bus

CPU read

CPU read hit

CPU write miss

Write-back dataPlace write miss on bus

CPUreadmiss

Invalid

Write miss for this block

Writ

e-bac

k dat

a; p

lace r

ead m

iss o

n bus

Shared(read only)

Wri

te-b

ack

bloc

k

FIGURE 8.12 Cache-coherence state diagram with the state transitions induced by the local processor shown in black and by the bus activities shown in gray.


Value

An Example of Basic Snooping Protocolp1: write(A1,10) p3:read(A2)

P2

timet1 t2 t3 t4 t5

p2:read(A1)

p2:write(A2,20)

p2: replaces the cache block on a read missAssume that A1 and A2 map to same cache block

P1

step

State

Addr

State

Addr

Value

Action

Proc.

Addr

Value

Addr

time processor p1 processor p2 processor p3

t1 exclusive invalid invalidt2 shared invalid sharedt3 shared shared sharedt4 invalid exclusive invalidt5 invalid *shared invalid

note that the state of this cache block is not for the blockcontaining A1 and A2 any more; it is for another block


Snooping Protocol Variations

Berkeley ProtocolOwned Exclusive

Owned SharedSharedInvalid

Basic ProtocolExclusive

SharedInvalid

Illinois ProtocolPrivate DirtyPrivate Clean

SharedInvalid

A new bus operation called “invalidation”is introduced, so that Shared -> Owned Exclusive can be achieved by broadcastinginvalidate operations and there is no need togenerate a write-miss on write hit

If read sourced from memory, then Private Cleanif read sourced from other cache, then SharedCan write in cache if held private clean or dirtyHowever, write hit on a Shared cache stillgenerates write miss broadcast to the bus


Scaleable MPs

r Separate Memory per Processorl bus bandwidth cannot support a large # of processors

l replacing bus with a general ICN; however; snooping protocols willbe expensive because the cost of broadcasting is high

r A simple software solution: make shared data uncacheablel simple hardware but poor performance

r Alternative: Use a directory to track the state of every memoryblock in caches

l Which caches have a copies of the block, dirty vs. clean, ...

r Prevent directory as bottleneck: distribute directory entries withmemory, each keeping track of which caches have copies oftheir blocks

l there will make the directory associated with a memory block at aknown location to avoid the cost of broadcasting

Interconnection network

Processor+ cache

Memory I/O

Directory

Processor+ cache

Memory I/O

Directory

Processor+ cache

Memory I/O

Directory

Processor+ cache

Memory I/O

Memory

Directory

Processor+ cache

Processor+ cache

Processor+ cache

Processor+ cache

Memory I/O Memory MemoryI/O I/O Memory I/O

Directory Directory Directory Directory

FIGURE 8.22 A directory is added to each node to implement cache coherence in a distributed-memory machine.


Directory Protocol

r Similar to Snoopy Protocol: 3 statesl Shared: Some processors have data, memory up to date

l Uncached: No processor has a copy

l Exclusive: 1 processor(owner) has data; memory out of date

r In addition to cache state, must track which processor(Shared or Exclusive) has a copy: Use a bit vector tomaintain a Sharer set (one such set per memory block)

r Terms:l Local node is the node where a request originates

l Home node is the node where the memory location of anaddress resides

l Remote node is the node that has a copy of a cache block,whether exclusive or shared.

Message type Source DestinationMessage contents Function of this message

Read miss Local cache Home directory

P, A Processor P has a read miss at address A; request data and make P a read sharer.

Write miss Local cache Home directory

P, A Processor P has a write miss at address A; —request data and make P the exclusive owner.

Invalidate Home directory

Remote caches A Invalidate a shared copy of data at address A.

Fetch Home directory

Remote cache A Fetch the block at address A and send it to its home directory; change the state of A in the remote cache to shared.

Fetch/invalidate Home directory

Remote cache A Fetch the block at address A and send it to its home directory; invalidate the block in the cache.

Data value reply Home directory

Local cache Data Return a data value from the home memory.

Data write back Remote cache

Home directory

A, Data Write back a data value for address A.

FIGURE 8.23 The possible messages sent among nodes to maintain coherence.


Directory Protocolr Message sent to directory causes 2 actions:

l update the directory and send messages to caches to satisty request

r Block is in Uncached state: the copy in memory is the only copy sopossible requests for that block are:

l Read miss: requesting processor is sent back the data from memory andthe requestor is the only sharing node. The state of the block is madeShared.

l Write miss: requesting processor is sent the value and becomes theSharing node. The block is made Exclusive to indicate that the only validcopy is cached. Sharers indicates the identity of the owner.

r Block is Shared, the memory value is up-to-date:l Read miss: requesting processor is sent back the data from memory &

requesting processor is added to the sharing set.

l Write miss: requesting processor is sent the value. All processors in theset Sharers are sent invalidate messages, & Sharers set is to identity ofrequesting processor. The state of the block is made Exclusive.


Directory Protocolr Block is Exclusive: current value of the block is held in the cache of

the processor identified by the set Sharers (the owner), & 3 possibledirectory requests:

l Read miss: owner processor is sent a data fetch message, which causesstate of block in owner’s cache to transition to Shared and causes ownerto send data to directory, where it is written to memory and sent back tothe requesting processor. Identity of requesting processor is added to setSharers, which still contains the identity of the processor that was theowner (since it still has a readable copy).

l Data write-back: owner processor is replacing the block and hence mustwrite it back. This makes the memory copy up-to-date (the homedirectory essentially becomes the owner), the block is now uncached, andthe Sharer set is empty.

l Write miss: block has a new owner. A message is sent to old ownercausing the cache to send the value of the block to the directory fromwhich it is send to the requesting processor, which becomes the newowner. Sharers is set to identity of new owner, and state of block is madeExclusive.


State Transition Diagram for the Home Directorythat Tracks the Status of A Memory Block

r Three States: Uncached,Shared and Exclusive

r All actions are in graycolor since they all areexternally caused. Italicsindicates the action takenby the directory in responseto the request. Bold italicsindicate an action thatupdates the sharing set,Sharers, as opposed tosending a message.


Datawrite-back

Write miss

Dat

a va

lue

repl

y; S

hare

rs={

P}

Sha

rers

={}

Inva

lidat

e; S

harer

s={P

}; dat

a valu

e rep

ly

Read miss

Data value replySharers=Sharers+{P}

Data value reply; Sharers={P}

Writemiss

Fetch/invalidateData value replySharers={P}

Readmiss

Uncached

Fetch

; dat

a valu

e rep

ly; S

harer

s=Shar

ers+

{P}

Read

miss

Writ

e m

iss

Shared(read only)

FIGURE 8.25 The state transition diagram for the directory has the same states and structure as the transition diagram for an individual cache.


State Transition Diagram for an Individual Cache Blockin a Directory Based System

r The states are identical to thosein the snoopingcase and thetransactions are very similar,with explicit invalidate andwrite-back requests replacingthe write misses that wereformerly broadcast on the bus.

r Colors in the figurel black: requests from the CPU

l gray: requests from the homedirectory


CPU write hitCPU read hit

Fetchinvalidate

CPU write

Sen

d w

rite

mis

s m

essa

ge

Fetch

CPU read

miss

Data w

rite-

back

Send

write

miss

mes

sage

CPU writ

e

Send read miss message

Read miss

CPU read

CPU read hit

CPU write miss

Data write-backWrite miss

CPUreadmiss

Invalid

Invalidate

Data w

rite-

back;

read

miss

Shared(read only)

Dat

a w

rite-

back

FIGURE 8.24 State transition diagram for an individual cache block in a directory-based system.


An Example of Directory-Based Protocolp1: write(A1,10) p3:read(A2)

P2

timet1 t2 t3 t4 t5

p2:read(A1)

p2:write(A2,20)

p2: replaces the cache block on a read missAssume that A1 and A2 map to same cache block

P1

step

State

Addr

State

Addr

Value

Action

Proc.

Addr

Value

Addr

time processor p1 processor p2 processor p3

t1 exclusive invalid invalidt2 shared invalid sharedt3 shared shared sharedt4 invalid exclusive invalidt5 invalid *shared invalid

note that the state of this cache block is not for the blockcontaining A1 and A2 any more; it is for another block

homedirectoryexclusive

sharedshared

exclusiveuncached


r 4th C: Conflict, Capacity, Compulsory and Coherency Misses

r More processors: increase coherency misses while decreasingcapacity misses (cache size increases for fixed problem size)

r Cache behavior of Five Parallel Programs:l FFT Fast Fourier Transform: Matrix transposition + computation

l LU factorization of dense 2D matrix (linear algebra)

l Barnes-Hut n-body algorithm solving galaxy evolution problem

l Ocean simulates influence of eddy & boundary currents on large-scale flow in ocean: dynamic arrays per grid

l VolRend is parallel volume rendering: scientific visualization

Miss Rates for Snooping Protocol


Mis

s R

ate

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

fft lu barnes ocean volrend

8%

2%

1%

14%

1%

8%

2%

1%

18%

1%

8%

2%

1%

15%

1%

8%

2%

1%

13%

1%

8%

2%

1%

9%

1%

1 2 4 8 16

Miss Rates for Snooping Protocols

l Cache size is 64KB, 2-way set associative, with 32B blocks.

l With the exception of Volrend, the misses in these applications are generated byaccesses to data that is potentially shared.

l Except for Ocean, data is heavily shared; in Ocean only the boundaries of thesubgrids are shared, though the entire grid is treated as a shared data object. Sincethe boundaries change as we increase the processor count (for a fixed size problem),different amounts of the grid become shared. The anomalous increase in miss ratefor Ocean in moving from 1 to 2 processors arises because of conflict misses inaccessing the subgrids.

Big differencesin miss ratesamong theprograms

Miss Rate

# of processorsOcean

High CapacityMisses


Processor Count

Mis

s R

ate

0%

10%

20%

30%

40%

50%

60%

70%

80%

1 2 4 8 16

fft lu barnes

ocean volrend

% Misses Caused by Coherency Traffic vs. # ofProcessors

r % cache misses caused by coherencytransactions typically rises when a fixedsize problem is run on more processors.

r The absolute number of coherency missesis increasing in all these benchmarks,including Ocean. In Ocean, however, it isdifficult to separate out these misses fromothers, since the amount of sharing of thegrid varies with processor count.

r Invalidation increases significantly; InFFT, the miss rate arising from coherencymisses increases from nothing to almost7%.

80% of misses due tocoherency misses!

FFT

LU

BarnesOcean

Volrend


Cache Size in KB

Mis

s R

ate

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

16 32 64 128 256

fft lu barnes

ocean volrend

Miss Rates vs. Cache Size Per Processor

r Miss rate drops as the cache size is increased, unless the miss rate is dominated bycoherency misses.

r The block size is 32B & the cache is 2-way set-associative. The processor count isfixed at 16 processors.

FFT

LU

Barnes

Ocean

Volrend

MissRate

Cache Size

Ocean and FFTstrongly influencedby capacity misses


% Misses Caused by Coherency Traffic vs.Cache Size

Cache Size in KB

Mis

s R

ate

0%

10%

20%

30%

40%

50%

60%

70%

80%

16 32 64 128 256

fft lu barnes

ocean volrend

FFT

Volrend

Ocean

LU

Barnes

Reduction ofcapacity misseswith increasingcache size

(small absolute miss rate < 2%)

(large absolute miss rates > 8%)


2%

4%

6%

8%

10%

12%

14%


13%

4%

1%

13%

1%

8%

2%

1%

9%

1%

5%

1%1%

6%

1%

4%

0% 1%

5%

1%

16 32 64 128

Miss Rate vs. Block Size: Miss Rate MostlyDecreases with Increasing Block Size

r Overall, miss rate drops as blocksize increases due to the decreasein capacity misses

r Since a cache block holdsmultiple words, coherence missescan increase with a larger blockbecause of a higher probability ofthe block being invalidated

r False sharing arises from the useof an invalidation-basedcoherency algorithm. It occurswhen a block is invalidated (and asubsequent reference causes amiss) because some word in theblock, other than the one beingread, is written into. False sharingwould not arise if each cacheblock contains only one word.

Note:


1.0

2.0

3.0

4.0

5.0

6.0

7.0

16 32 64 128

fft lu barnes

ocean volrend

Bus Traffic vs. Block Size

r Bus traffic climbs steadily as theblock size is increased.

r Volrend: the increase is more than afactor of 10, although the low missrate keeps the absolute traffic small.

r The factor of 3 increase in traffic forOcean is the best argument againstlarger block sizes.

r Remember that our protocol treatsownership misses the same as othermisses, slightly increasing thepenalty for large cache blocks: inboth Ocean and FFT this effectaccounts for less than 10% of thetraffic.

Huge Increases in bus trafficdue to coherency!

Bytes perdata ref

Ocean

FFTLU

Volrend


Mis

s R

ate

0%

1%

2%

3%

4%

5%

6%

7%


5%

1%

0%

6%

1%

5%

1%

0%

4%

1%

5%

1%

0%

3%

1%

5%

1%0%

7%

1%

8 16 32 64

Miss Rates for Directory-Based Protocols

l Cache size is 128 KB, 2-way setassociative, with 64B blocks.

l Ocean only the boundaries of the

Since the boundaries change as weincrease the processor count (for a fixedsize problem), different amounts of the

to 64 processors arises because ofconflict misses in accessing small

Miss Rate

# of Processors

Ocean


2%

4%

6%

8%

10%

12%

14%

16%

18%


9%

2%

1%

18%

1%

8%

2%

1%

13%

1%

7%

2%

0%

9%

1%

5%

1%

0%

7%

1%

4%

1%0%

5%

1%

32 64 128 256 512

Miss Rates vs. Cache Size per Processor forDirectory-Based Protocols

r Miss rate drops as the cachesize is increased, unless themiss rate is dominated bycoherency misses.

r The block size is 64B andthe cache is 2-way set-associative. The processorcount is fixed at 16processors.


2%

4%

6%

8%

10%

12%

14%


12%

3%

0%

13%

1%

7%

2%

0%

9%

1%

5%

1% 0%

7%

1%

3%

0% 0%

5%

1%

16 32 64 128

Block Size Effect for Directory Protocolsr Assumes 128 KB cache & 64 processors

l Use larger cache size to deal with higher memory latencies than snoopcaches

flynn’s classification - people at vt computer...

Documents