other factors multiprocessors -...

1

Multiprocessors

Adapted from UCB cs252 Sp 2007 lectures

CprE 581 Computer SystemsArchitecture

(Chapter 5 of Hennessy-Patterson. 5th edition text)

2

Uniprocessor Performance (SPECint)

• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: 22%/year 2002 to present

11/29/2016 3

Déjà vu all over again?“… today’s processors … are nearing an impasse as technologies approach the

speed of light..”David Mitchell, The Transputer: The Time Is Now (1989)

Transputer had bad timing (Uniprocessor performance) Procrastination rewarded: 2X seq. perf. / 1.5 years

“We are dedicating all of our future product development to multicore designs. … This is a sea change in computing”

Paul Otellini, President, Intel (2005) All microprocessor companies switch to MP (2X CPUs / 2 yrs)

Procrastination penalized: 2X sequential perf. / 5 yrs

Manufacturer/Year AMD/’11 Intel/’11 IBM/’10 Sun/’10

Processors/chip 8 4 8 8Threads/Processor 1 2 4 8

Threads/chip 8 8 32 64

4

Other Factors MultiprocessorsGrowth in data-intensive applications Data bases, file servers, …

Growing interest in servers, server perf.Increasing desktop perf. less important Outside of graphics

Improved understanding in how to use multiprocessors effectively Especially server where significant natural TLP

Advantage of leveraging design investment by replication Rather than unique design

5

Flynn’s TaxonomyFlynn classified by data and control streams in 1966

SIMD Data-Level ParallelismMIMD Thread-Level ParallelismMIMD popular because Flexible: N programs or 1 multithreaded program Cost-effective: same MPU in desktop & MIMD machine

Single Instruction, Single Data (SISD)

(Uniprocessor)

Single Instruction, Multiple Data SIMD

(single PC: Vector, CM-2)

Multiple Instruction, Single Data (MISD)

Multiple Instruction, Multiple Data MIMD

(Clusters, SMP servers)

M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.

6

Back to Basics“A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”Parallel Architecture = Computer Architecture + Communication Architecture

2

7

Two Models for Communication and Memory Architecture1. Communication occurs by explicitly passing messages among the

processors: message-passing multiprocessors (aka multicomputers)• Modern cluster systems contain multiple stand-alone computers

communicating via messages2. Communication occurs through a shared address space (via

loads and stores): shared-memory multiprocessors either• UMA (Uniform Memory Access time) for shared address,

centralized memory MP• NUMA (Non-Uniform Memory Access time multiprocessor)

for shared address, distributed memory MPIn past, confusion whether “sharing” means sharing physical memory (Symmetric MP) or sharing address space

8

Centralized vs. Distributed Memory

P1

$

Inter-connection network

$

Pn

Mem Mem

P1

$

Inter-connection network

$

Pn

Mem Mem

Centralized Memory Distributed Memory

Scale

9

Centralized Memory Multiprocessor Also called symmetric multiprocessors (SMPs)

because single main memory has a symmetric relationship to all processors

Large caches single memory can satisfy memory demands of small number of processors

Can scale to a few dozen processors by using a switch and by using many memory banks

Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases

10

Distributed Memory Multiprocessor

Pro: Cost-effective way to scale memory bandwidth If most accesses are to local memory

Pro: Reduces latency of local memory accesses

Con: Communicating data between processors more complex

Con: Software must be aware of data placement to take advantage of increased memory BW

Programming model (pro or con?)

11

Challenges of Parallel ProcessingBig challenge is % of program that is inherently sequential What does it mean to be inherently sequential?

Suppose 80X speedup from 100 processors. What fraction of original program can be sequential?a. 10%b. 5%c. 1%d. <1%

12

Amdahl’s Law Answers

%75.992.79/79Fraction

Fraction8.0Fraction8079

1)100

Fraction Fraction 1(80

100

Fraction Fraction 1

1 08

Speedup

Fraction Fraction 1

1 Speedup

parallel

parallelparallel

parallelparallel

parallelparallel

parallel

parallelenhanced

overall

3

13

Challenges of Parallel Processing

Second challenge is long latency to remote memorySuppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles.) What is performance impact if 0.2% instructions that involve remote access?a. 1.5Xb. 2.0Xc. 2.5X

14

CPI Equation CPI = Base CPI +

Remote request rate x Remote request cost

CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3No communication is 1.3/0.5 or 2.6 faster than 0.2% instructions involve local access

15

Communication and SynchronizationParallel processes must co-operate to complete a single task fasterRequires distributed communication and synchronization Communication is for data values, or “what” Synchronization is for control, or “when” Communication and synchronization are often inter-related

i.e., “what” depends on “when”Message-passing bundles data and control Message arrival encodes “what” and “when”

In shared-memory machines, communication usually via coherent caches & synchronization via atomic memory operations Due to advent of single-chip multiprocessors, it is likely

cache-coherent shared memory systems will be the dominant form of multiprocessor

16

symmetric• All memory is equally far away from all processors

• Any processor can do any I/O(set up a DMA transfer)

Symmetric Multiprocessors

MemoryI/O controller

Graphicsoutput

CPU-Memory bus

bridge

Processor

I/O controller I/O controller

I/O bus

Networks

Processor

17

SynchronizationThe need for synchronization arises whenever there are concurrent processes in a system

(even in a uniprocessor system)

Forks and Joins: In parallel programming, a parallel process may want to wait until several events have occurred

Producer-Consumer: A consumer process must wait until the producer process has produced data

Exclusive use of a resource: Operating system has to ensure that only one process uses a resource at a given time

producer

consumer

fork

join

P1 P2

18

A Producer-Consumer Example

The program is written assuming instructions are executed in order.

Producer posting Item x:Load Rtail, (tail)Store (Rtail), xRtail=Rtail+1Store (tail), Rtail

Consumer:Load Rhead, (head)

spin: Load Rtail, (tail)if Rhead==Rtail goto spinLoad R, (Rhead)Rhead=Rhead+1Store (head), Rheadprocess(R)

Producer Consumertail head

Rtail Rtail Rhead R

Problems?

4

19

A Producer-Consumer Example continued




Can the tail pointer get updatedbefore the item x is stored?

Programmer assumes that if 3 happens after 2, then 4happens after 1.

Problem sequences are:2, 3, 4, 14, 1, 2, 3

1

2

3

4

20

Sequential ConsistencyA Memory Model

“ A system is sequentially consistent if the result ofany execution is the same as if the operations of allthe processors were executed in some sequential order, and the operations of each individual processorappear in the order specified by the program”

Leslie Lamport

Sequential Consistency = arbitrary order-preserving interleavingof memory references of sequential programs

M

P P P P P P

21

Sequential ConsistencySequential concurrent tasks: T1, T2Shared variables: X, Y (initially X = 0, Y = 10)

T1: T2:Store (X), 1 (X = 1) Load R1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R1 (Y’= Y)

Load R2, (X) Store (X’), R2 (X’= X)

what are the legitimate answers for X’ and Y’ ?

(X’,Y’) {(1,11), (0,10), (1,10), (0,11)} ?

If y is 11 then x cannot be 0

22

Sequential ConsistencySequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( )

What are these in our example ?

T1: T2:Store (X), 1 (X = 1) Load R1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R1 (Y’= Y)

Load R2, (X) Store (X’), R2 (X’= X)additional SC requirements

Does (can) a system with caches or out-of-order execution capability provide a sequentially consistentview of the memory ?

more on this later

23

Multiple Consumer Example




What is wrong with this code?

Critical section:Needs to be executed atomically by one consumer locks

tail headProducer

Rtail

Consumer1

R Rhead

Rtail

Consumer2

R Rhead

Rtail

24

Locks or SemaphoresE. W. Dijkstra, 1965

A semaphore is a non-negative integer, with thefollowing operations:

P(s): if s>0, decrement s by 1, otherwise wait

V(s): increment s by 1 and wake up one of the waiting processes

P’s and V’s must be executed atomically, i.e., without• interruptions or• interleaved accesses to s by other processors

initial value of s determines the maximum no. of processesin the critical section

Process iP(s)

<critical section>V(s)

5

25

Implementation of SemaphoresSemaphores (mutual exclusion) can be implemented using ordinary Load and Store instructions in the Sequential Consistency memory model. However, protocols for mutual exclusion are difficult to design...

Simpler solution:atomic read-modify-write instructions

Test&Set (m), R: R M[m];if R==0 then

M[m] 1;

Swap (m), R:Rt M[m];M[m] R;R Rt;

Fetch&Add (m), RV, R:R M[m];M[m] R + RV;

Examples: m is a memory location, R is a register

26

CriticalSection

P: Test&Set (mutex),Rtempif (Rtemp!=0) goto PLoad Rhead, (head)

spin: Load Rtail, (tail)if Rhead==Rtail goto spinLoad R, (Rhead)Rhead=Rhead+1Store (head), Rhead

V: Store (mutex),0process(R)

Multiple Consumers Example using the Test&Set Instruction

Other atomic read-modify-write instructions (Swap, Fetch&Add, etc.) can also implement P’s and V’s

What if the process stops or is swapped out while in the critical section?

27

Nonblocking SynchronizationCompare&Swap(m), Rt, Rs:if (Rt==M[m])

then M[m]=Rs;Rs=Rt ;status success;

else status fail;

try: Load Rhead, (head)spin: Load Rtail, (tail)

if Rhead==Rtail goto spinLoad R, (Rhead)Rnewhead = Rhead+1Compare&Swap(head), Rhead, Rnewheadif (status==fail) goto tryprocess(R)

status is animplicit argument

28

Load-reserve & Store-conditionalSpecial register(s) to hold reservation flag and address, and the outcome of store-conditional

try: Load-reserve Rhead, (head)spin: Load Rtail, (tail)

if Rhead==Rtail goto spinLoad R, (Rhead)Rhead = Rhead + 1Store-conditional (head), Rheadif (status==fail) goto tryprocess(R)

Load-reserve R, (m):<flag, adr> <1, m>; R M[m];

Store-conditional (m), R:if <flag, adr> == <1, m> then cancel other procs’

reservation on m;M[m] R; status succeed;

else status fail;

29

Performance of LocksBlocking atomic read-modify-write instructions

e.g., Test&Set, Fetch&Add, Swapvs

Non-blocking atomic read-modify-write instructionse.g., Compare&Swap,

Load-reserve/Store-conditionalvs

Protocols based on ordinary Loads and Stores

Performance depends on several interacting factors:degree of contention, caches, out-of-order execution of Loads and Stores

30

Issues in Implementing Sequential Consistency

Implementation of SC is complicated by two issues

• Out-of-order execution capabilityLoad(a); Load(b) yesLoad(a); Store(b) yes if a bStore(a); Load(b) yes if a bStore(a); Store(b) yes if a b

• CachesCaches can prevent the effect of a store from being seen by other processors

M

P P P P P P

6

31

Memory FencesInstructions to sequentialize memory accesses

Processors with relaxed or weak memory models (i.e.,permit Loads and Stores to different addresses to be reordered) need to provide memory fence instructions to force the serialization of memory accesses

Examples of processors with relaxed memory models:Sparc V8 (TSO,PSO): Membar Sparc V9 (RMO):

Membar #LoadLoad, Membar #LoadStoreMembar #StoreLoad, Membar #StoreStore

PowerPC (WO): Sync, EIEIO

Memory fences are expensive operations, however, one pays the cost of serialization only when it is required

32

Using Memory Fences

Producer posting Item x:Load Rtail, (tail)Store (Rtail), xMembarSSRtail=Rtail+1Store (tail), Rtail


spin: Load Rtail, (tail)if Rhead==Rtail goto spinMembarLLLoad R, (Rhead)Rhead=Rhead+1Store (head), Rheadprocess(R)

Producer Consumertail head

Rtail Rtail Rhead R

ensures that tail ptris not updated before x has been stored

ensures that R isnot loaded before x has been stored

33

Data-Race Free Programs a.k.a. Properly Synchronized Programs

Process 1...Acquire(mutex);< critical section>

Release(mutex);

Process 2...Acquire(mutex);< critical section>

Release(mutex);

Synchronization variables (e.g. mutex) are disjointfrom data variables

Accesses to writable shared data variables are protected in critical regions

no data races except for locks(Formal definition is elusive)

In general, it cannot be proven if a program is data-race free.

34

Fences in Data-Race Free Programs

Process 1...Acquire(mutex);membar;

< critical section>membar;Release(mutex);

Process 2...Acquire(mutex);membar;

< critical section>membar;Release(mutex);

• Relaxed memory model allows reordering of instructions by the compiler or the processor as long as the reordering is not done across a fence

• The processor also should not speculate or prefetch across fences

35

Mutual Exclusion Using Load/Store A protocol based on two shared variables c1 and c2. Initially, both c1 and c2 are 0 (not busy)

What is wrong?

Process 1...c1=1;

L: if c2=1 then go to L< critical section>

c1=0;

Process 2...c2=1;

L: if c1=1 then go to L< critical section>

c2=0;

Deadlock!

36

Mutual Exclusion: second attempt

To avoid deadlock, let a process give up the reservation (i.e. Process 1 sets c1 to 0) while waiting.

• Deadlock is not possible but with a low probability a livelock may occur.

• An unlucky process may never get to enter the critical section starvation

Process 1...

L: c1=1;if c2=1 then

{ c1=0; go to L}< critical section>

c1=0

Process 2...

L: c2=1;if c1=1 then

{ c2=0; go to L}< critical section>

c2=0

7

37

A Protocol for Mutual ExclusionT. Dekker, 1966

Process 1...c1=1;turn = 1;

L: if c2=1 & turn=1 then go to L

< critical section>c1=0;

A protocol based on 3 shared variables c1, c2 and turn. Initially, both c1 and c2 are 0 (not busy)

• turn = i ensures that only process i can wait • variables c1 and c2 ensure mutual exclusion

Solution for n processes was given by Dijkstra and is quite tricky!

Process 2...c2=1;turn = 2;



38

Analysis of Dekker’s Algorithm... Process 1c1=1;turn = 1;



... Process 2c2=1;turn = 2;



Sce

nario

1







Sce

nario

2

39

N-process Mutual ExclusionLamport’s Bakery Algorithm

Process i

choosing[i] = 1;num[i] = max(num[0], …, num[N-1]) + 1;choosing[i] = 0;

for(j = 0; j < N; j++) {while( choosing[j] );while( num[j] &&

( ( num[j] < num[i] ) ||( num[j] == num[i] && j < i ) ) );

}

num[i] = 0;

Initially num[j] = 0, for all jEntry Code

Exit Code

40

Memory Consistency in SMPs

Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale valueswrite-through: cache-2 has a stale value

Do these stale values matter?What is the view of shared memory for programming?

cache-1A 100

CPU-Memory bus

CPU-1 CPU-2

cache-2A 100

memoryA 100

41

Write-back Caches & SC

• T1 is executed

prog T2LD Y, R1ST Y’, R1LD X, R2ST X’,R2

prog T1ST X, 1ST Y,11

cache-2cache-1 memoryX = 0Y =10X’=Y’=

X= 1Y=11

Y =Y’= X = X’=

• cache-1 writes back Y X = 0Y =11X’=Y’=

X= 1Y=11

Y =Y’= X =

X’=

X = 1Y =11X’=Y’=

X= 1Y=11

Y = 11Y’= 11X = 0X’= 0

• cache-1 writes back X

X = 0Y =11X’=Y’=

X= 1Y=11

Y = 11Y’= 11X = 0X’= 0

• T2 executed

X = 1Y =11X’= 0Y’=11

X= 1Y=11

Y =11Y’=11 X = 0X’= 0

• cache-2 writes backX’ & Y’

42

Write-through Caches & SCcache-2

Y = Y’= X = 0X’=

memory

X = 0Y =10X’=Y’=

cache-1

X= 0Y=10

prog T2LD Y, R1ST Y’, R1LD X, R2ST X’,R2

prog T1ST X, 1ST Y,11

Write-through caches don’t preserve sequential consistency either

• T1 executedY = Y’= X = 0X’=

X = 1Y =11X’=Y’=

X= 1Y=11

• T2 executed Y = 11Y’= 11X = 0X’= 0

X = 1Y =11X’= 0Y’=11

X= 1Y=11

8

43

Maintaining Sequential ConsistencySC is sufficient for correct producer-consumerand mutual exclusion code (e.g., Dekker)

Multiple copies of a location in various cachescan cause SC to break down.

Hardware support is required such that• only one processor at a time has write

permission for a location• no processor can load a stale copy of

the location after a write

cache coherence protocols

44

Cache Coherence Protocols for SC

write request:the address is invalidated (updated) in all othercaches before (after) the write is performed

read request:if a dirty copy is found in some cache, a write-back is performed before the memory is read

We will focus on Invalidation protocolsas opposed to Update protocols

45

Warmup: Parallel I/O

(DMA stands for Direct Memory Access)

Either Cache or DMA canbe the Bus Master andeffect transfers

DISKDMA

PhysicalMemory

Proc.

R/W

Data (D) Cache

Address (A)

AD

R/W

Page transfersoccur while theProcessor is running

MemoryBus

46

Problems with Parallel I/O

Memory Disk: Physical memory may bestale if Cache copy is dirty

Disk Memory: Cache may hold state data and not see memory writes

DISK

DMA

PhysicalMemory

Proc.Cache

MemoryBus

Cached portionsof page

DMA transfers

47

Snoopy Cache Goodman 1983Idea: Have cache watch (or snoop upon) DMA transfers, and then “do the right thing”Snoopy cache tags are dual-ported

Proc.

Cache

Snoopy read portattached to MemoryBus

Data(lines)

Tags andState

A

D

R/W

Used to drive Memory Buswhen Cache is Bus Master

AR/W

48

Snoopy Cache Actions for DMA

Observed Bus Cycle Cache State Cache Action

Address not cached

DMA Read Cached, unmodifiedMemory Disk Cached, modified

Address not cached

DMA Write Cached, unmodifiedDisk Memory Cached, modified

No action

No action

No action

Cache intervenes

Cache purges its copy

???

9

49

Shared Memory Multiprocessor

Use snoopy mechanism to keep all processors’ view of memory coherent

M1

M2

M3

SnoopyCache

DMA

PhysicalMemory

MemoryBus

SnoopyCache

SnoopyCache DISKS

Copyright © 2012, Elsevier Inc. All rights reserved.

Cache CoherenceProcessors may see different values through their caches:

Centralized S

hared-Mem

ory Architectures

51

Example Cache Coherence Problem

Processors see different values for u after event 3 With write back caches, value written back to memory depends

on happenstance of which cache flushes or writes back value when Processes accessing main memory may see very stale value

Unacceptable for programming, and its frequent!

I/O devices

Memory

P1

$ $ $

P2 P3

5

u = ?4

u = ?

u :51

u :5

2

u :5

3

u= 7

52

Example

Intuition not guaranteed by coherenceexpect memory to respect order between accesses to different locations issued by a given process to preserve orders among accesses to same location by

different processesCoherence is not enough! pertains only to single location

P1 P2/*Assume initial value of A and flag is 0*/

A = 1; while (flag == 0); /*spin idly*/flag = 1; print A;

Mem

P1Pn

Conceptual Picture

53

P

DiskMemory

L2L1

100:34

100:35

100:67

Intuitive Memory Model

Too vague and simplistic; 2 issues1. Coherence defines values returned by a read2. Consistency determines when a written value will

be returned by a readCoherence defines behavior to same location, Consistency defines behavior w.r.t other locations

• Reading an address should return the last value written to that address– Easy in uniprocessors,

except for I/O


Cache CoherenceCoherence All reads by any processor must return the most

recently written value Writes to the same location by any two

processors are seen in the same order by all processors

Consistency When a written value will be returned by a read If a processor writes location A followed by

location B, any processor that sees the new value of B must also see the new value of A

Centralized S

hared-Mem

ory Architectures

10

55

Defining Coherent Memory System1. Preserve Program Order: A read by processor P to

location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P

2. Coherent view of memory: Read by a processor to location X that follows a write by anotherprocessor to X returns the written value if the read and write are sufficiently separated in timewhen no other writes to X occur between the two accesses

3. Write serialization: 2 writes to same location by any 2 processors are seen in the same order by all processors

56

Write Consistency

For now assume1. A write does not complete (and allow the next

write to occur) until all processors have seen the effect of that write

2. The processor does not change the order of any write with respect to any other memory access

if a processor writes location A followed by location B, any processor that sees the new value of B must also see the new value of A

These restrictions allow the processor to reorder reads, but forces the processor to finish writes in program order

57

Basic Schemes for Enforcing CoherenceProgram on multiple processors will normally have copies of the same data in several caches Unlike I/O, where its rare

Rather than trying to avoid sharing in SW, SMPs use a HW protocol to maintain coherent caches Migration and Replication key to performance of shared data

Migration - data can be moved to a local cache and used there in a transparent fashion Reduces both latency to access shared data that is allocated

remotely and bandwidth demand on the shared memoryReplication – for shared data being simultaneously read, since caches make a copy of data in local cache Reduces both latency of access and contention for read

shared data

58

2 Classes of Cache Coherence Protocols

1. Directory based — Sharing status of a block of physical memory is kept in just one location, the directory

2. Snooping — Every cache with a copy of data also has a copy of sharing status of block, but no centralized state is kept• All caches are accessible via some broadcast

medium (a bus or switch) • All cache controllers monitor or snoop on the

medium to determine whether or not they have a copy of a block that is requested on a bus or switch access


Snoopy Coherence Protocols

Write invalidate On write, invalidate all other copies Use bus itself to serialize

Write cannot complete until bus access is obtained

Write update On write, update all copies

Centralized S

hared-Mem

ory Architectures

60

Snoopy Cache-Coherence Protocols

Cache Controller “snoops” all transactions on the shared medium (bus or switch) relevant transaction if for a block it contains take action to ensure coherence

invalidate, update, or supply value depends on state of the block and the protocol

Either get exclusive access before write via write invalidate or update all copies on write

StateAddressData

I/O devicesMem

P1

$

Bus snoop

$

Pn

Cache-memorytransaction

11

61

Example: Write-through Invalidate

Must invalidate before step 3Write update uses more broadcast medium BW all recent MPUs use write invalidate

I/O devices

Memory

P1

$ $ $

P2 P3

5

u = ?4

u = ?

u :51

u :5

2

u :5

3

u= 7

u = 7

62

Architectural Building BlocksCache block state transition diagram FSM specifying how disposition of block changes

invalid, valid, dirtyBroadcast Medium Transactions (e.g., bus) Fundamental system design abstraction Logically single set of wires connect several devices Protocol: arbitration, command/addr, data Every core observes every transaction

Broadcast medium enforces serialization of read or write accesses Write serialization 1st processor to get medium invalidates others copies Implies cannot complete write until it obtains bus All coherence schemes require serializing accesses to same

cache blockAlso need to find up-to-date copy of cache block

63

Locate up-to-date copy of dataWrite-through: get up-to-date copy from memory Write through simpler if enough memory BWWrite-back harder Most recent copy can be in a cacheCan use same snooping mechanism1. Snoop every address placed on the bus2. If a processor has dirty copy of requested cache block, it

provides it in response to a read request and aborts the memoryaccess

Complexity from retrieving cache block from a processorcache, which can take longer than retrieving it frommemory

Write-back needs lower memory bandwidth Support larger numbers of faster processors Most multiprocessors use write-back

64

Cache Resources for WB SnoopingNormal cache tags can be used for snoopingValid bit per block makes invalidation easyRead misses easy since rely on snoopingWrites Need to know if any other copies of theblock are cached No other copies No need to place write on bus

for WB Other copies Need to place invalidate on bus

65

Cache Resources for WB Snooping

To track whether a cache block isshared, add extra state bit associatedwith each cache block, like valid bitand dirty bit Write to Shared block Need to place invalidate on bus

and mark cache block as private (if an option) No further invalidations will be sent for that block This processor called owner of cache block Owner then changes state from shared to unshared (or

exclusive)

66

Cache behavior in response to busEvery bus transaction must check the cache-address tags could potentially interfere with processor cache accesses

A way to reduce interference is to duplicate tags One set for caches access, one set for bus accesses

Another way to reduce interference is to use L2 tags Since L2 less heavily used than L1 Every entry in L1 cache must be present in the L2 cache, called

the inclusion property If Snoop gets a hit in L2 cache, then it must arbitrate for the

L1 cache to update the state and possibly retrieve the data, which usually requires a stall of the processor

12

67

Example Protocol

Snooping coherence protocol is usually implemented by incorporating a finite-state controller in each nodeLogically, think of a separate controller associated with each cache block That is, snooping operations or cache requests for

different blocks can proceed independentlyIn implementations, a single controller allows multiple operations to distinct blocks to proceed in interleaved fashion that is, one operation may be initiated before another is

completed, even through only one cache access or one bus access is allowed at time

68

Ordering

Writes establish a partial orderDoesn’t constrain ordering of reads, though shared-medium (bus) will order read misses too any order among reads between writes is fine,

as long as in program order

R W

R

R R

R R

RR R W

R

R

R R

RR

R

P0:

P1:

P2:

69

Example Write Back Snoopy Protocol

Invalidation protocol, write-back cache Snoops every address on bus If it has a dirty copy of requested block, provides that block in

response to the read request and aborts the memory accessEach memory block is in one state: Clean in all caches and up-to-date in memory (Shared) OR Dirty in exactly one cache (Exclusive) OR Not in any caches

Each cache block is in one state (track these): Shared : block can be read OR Exclusive : cache has only copy, its writeable, and dirty OR Invalid : block contains no data (in uniprocessor cache too)

Read misses: cause all caches to snoop busWrites to clean blocks are treated as misses


Snoopy Coherence ProtocolsLocating an item when a read miss occurs In write-back cache, the updated value

must be sent to the requesting processor

Cache lines marked as shared or exclusive/modified Only writes to shared lines need an

invalidate broadcastAfter this, the line is marked as exclusive

Centralized S

hared-Mem

ory A

rchitectures



Centralized S

hared-Mem

ory Architectures



Centralized S

hared-Mem

ory Architectures

13

73

Cache State Transition DiagramThe MSI protocol

M

S I

M: ModifiedS: SharedI: Invalid

Each cache line has a tag

Address tagstatebits

Write miss

Other processorintent to write

Readmiss


Read by anyprocessor

P1 readsor writes

Cache state in processor P1

Other processor readsP1 writes back

Permitted States for a pair of caches

M S I

M X X

S X

I

74

[edit]

75

Two Processor Example(Reading and writing the same cache line)

M

S I

Write miss

Readmiss

P2 intent to write

P2 reads,P1 writes back

P1 readsor writes

P2 intent to write

P1

M

S I

Write miss

Readmiss

P1 intent to write

P1 reads,P2 writes back

P2 readsor writes

P1 intent to write

P2

P1 reads

P1 writes

P2 readsP2 writes

P1 writesP2 writes

P1 reads

P1 writes

76

Observation

If a line is in the M state then no other cache can have a copy of the line! Memory stays coherent, multiple differing copies

cannot exist

M

S I

Write miss


Readmiss



P1 readsor writesOther processor reads

P1 writes back


Snoopy Coherence ProtocolsComplications for the basic MSI protocol: Operations are not atomic

E.g. detect miss, acquire bus, receive a response Creates possibility of deadlock and races One solution: processor that sends invalidate can hold

bus until other processors receive the invalidate

Extensions: Add exclusive state to indicate clean

block in only one cache (MESI protocol) Prevents needing to write invalidate on a write

Owned state

Centralized S

hared-Mem

ory Architectures

10/16/2007 78

MESI: An Enhanced MSI protocolincreased performance for private data

M E

S I

M: Modified ExclusiveE: Exclusive, unmodifiedS: SharedI: Invalid

Each cache line has a tag

Address tagstatebits

Write missOther processorintent to write

Read miss,shared


P1 write


Other processor readsP1 writes back

P1 readP1 writeor read

Cache state in processor P1

Read miss, not shared

14

79

Snooper Snooper Snooper Snooper

Optimized Snoop with Level-2 Caches

• Processors often have two-level caches• small L1, large L2 (usually both on chip now)

• Inclusion property: entries in L1 must be in L2invalidation in L2 invalidation in L1

• Snooping on L2 does not affect CPU-L1 bandwidth

What problem could occur?

CPU

L1 $

L2 $

CPU

L1 $

L2 $

CPU

L1 $

L2 $

CPU

L1 $

L2 $

80

Intervention

When a read-miss for A occurs in cache-2, a read request for A is placed on the bus

• Cache-1 needs to supply & change its state to shared• The memory may respond to the request also!

Does memory know it has stale data?

Cache-1 needs to intervene through memory controller to supply correct data to cache-2

cache-1A 200

CPU-Memory bus

CPU-1 CPU-2

cache-2

memory (stale data)A 100

81

False Sharing

state blk addr data0 data1 ... dataN

A cache block contains more than one word

Cache-coherence is done at the block-level and not word-level

Suppose M1 writes wordi and M2 writes wordk andboth words have the same block address.

What can happen?

82

Synchronization and Caches:Performance Issues

Cache-coherence protocols will cause mutex to ping-pongbetween P1’s and P2’s caches.

Ping-ponging can be reduced by first reading the mutexlocation (non-atomically) and executing a swap only if it is found to be zero.

cache

Processor 1R 1

L: swap (mutex), R;if <R> then goto L; <critical section>

M[mutex] 0;

Processor 2R 1


M[mutex] 0;

Processor 3R 1


M[mutex] 0;

CPU-Memory Bus

mutex=1cache cache

83

Performance Related to Bus OccupancyIn general, a read-modify-write instruction requires two memory (bus) operations without intervening memory operations by other processors

In a multiprocessor setting, bus needs to be locked for the entire duration of the atomic read and write operation

expensive for simple busesvery expensive for split-transaction buses

modern ISAs useload-reserve store-conditional

10/16/2007 84

Load-reserve & Store-conditional

If the snooper sees a store transaction to the addressin the reserve register, the reserve bit is set to 0

• Several processors may reserve ‘a’ simultaneously• These instructions are like ordinary loads and storeswith respect to the bus traffic

Can implement reservation by using cache hit/miss, no additional hardware required (problems?)

Special register(s) to hold reservation flag and address, and the outcome of store-conditionalLoad-reserve R, (a):

<flag, adr> <1, a>; R M[a];

Store-conditional (a), R:if <flag, adr> == <1, a> then cancel other procs’

reservation on a;M[a] <R>; status succeed;

else status fail;

15

85

Performance: Load-reserve & Store-conditional

The total number of memory (bus) transactions is not necessarily reduced, but splitting an atomic instruction into load-reserve & store-conditional:

• increases bus utilization (and reducesprocessor stall time), especially in split-transaction buses

• reduces cache ping-pong effect because processors trying to acquire a semaphore donot have to perform a store each time

86

Blocking cachesOne request at a time + CC SC

Non-blocking cachesMultiple requests (different addresses) concurrently + CC

Relaxed memory modelsCC ensures that all processors observe the same order of loads and stores to an address

Out-of-Order Loads/Stores & CC

CacheMemorypushout (Wb-rep)

load/storebuffers

CPU

(S-req, E-req)

(S-rep, E-rep)

Wb-req, Inv-req, Inv-repsnooper

(I/S/E)

CPU/MemoryInterface

87

Performance of Symmetric Shared-Memory Multiprocessors

Cache performance is combination of 1. Uniprocessor cache miss traffic2. Traffic caused by communication

Results in invalidations and subsequent cache misses

4th C: coherence miss Joins Compulsory, Capacity, Conflict

88

Coherency Misses1. True sharing misses arise from the communication of data

through the cache coherence mechanism• Invalidates due to 1st write to shared block• Reads by another CPU of modified block in different cache• Miss would still occur if block size were 1 word

2. False sharing misses when a block is invalidated because some word in the block, other than the one being read, is written into• Invalidation does not cause a new value to be

communicated, but only causes an extra cache miss• Block is shared, but no word in block is actually shared

miss would not occur if block size were 1 word

89

Example: True v. False Sharing v. Hit?

Time P1 P2 True, False, Hit? Why?

1 Write x1

2 Read x2

3 Write x1

4 Write x2

5 Read x2

• Assume x1 and x2 in same cache block. P1 and P2 both read x1 and x2 before.

True miss; invalidate x1 in P2

False miss; x1 irrelevant to P2



True miss; invalidate x2 in P1

90

MP Performance 4 Processor Commercial Workload: OLTP, Decision Support (Database), Search Engine

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

2.25

2.5

2.75

3

3.25

1 MB 2 MB 4 MB 8 MB

Cache size

InstructionCapacity/ConflictColdFalse SharingTrue Sharing

• True sharing and false sharing unchanged going from 1 MB to 8 MB (L3 cache)

• Uniprocessor cache missesimprove withcache size increase (Instruction, Capacity/Conflict,Compulsory)

16

91

MP Performance 2MB Cache Commercial Workload: OLTP, Decision Support (Database), Search Engine

• True sharing,false sharing increase going from 1 to 8 CPUs

0

0.5

1

1.5

2

2.5

3

1 2 4 6 8

Processor count

InstructionConflict/CapacityColdFalse SharingTrue Sharing

92

Synchronization and Caches:Performance Issues

Cache-coherence protocols will cause mutex to ping-pongbetween P1’s and P2’s caches.

Ping-ponging can be reduced by first reading the mutexlocation (non-atomically) and executing a swap only if it is found to be zero.

cache

Processor 1R 1


M[mutex] 0;

Processor 2R 1


M[mutex] 0;

Processor 3R 1


M[mutex] 0;

CPU-Memory Bus

mutex=1cache cache

93

Performance Related to Bus OccupancyIn general, a read-modify-write instruction requires two memory (bus) operations without intervening memory operations by other processors

In a multiprocessor setting, bus needs to be locked for the entire duration of the atomic read and write operation

expensive for simple busesvery expensive for split-transaction buses

modern ISAs useload-reserve store-conditional

94

Load-reserve & Store-conditional

If the snooper sees a store transaction to the addressin the reserve register, the reserve bit is set to 0

• Several processors may reserve ‘a’ simultaneously• These instructions are like ordinary loads and storeswith respect to the bus traffic

Can implement reservation by using cache hit/miss, no additional hardware required (problems?)

Special register(s) to hold reservation flag and address, and the outcome of store-conditionalLoad-reserve R, (a):

<flag, adr> <1, a>; R M[a];

Store-conditional (a), R:if <flag, adr> == <1, a> then cancel other procs’

reservation on a;M[a] <R>; status succeed;

else status fail;

95

Performance: Load-reserve & Store-conditional

The total number of memory (bus) transactions is not necessarily reduced, but splitting an atomic instruction into load-reserve & store-conditional:

• increases bus utilization (and reducesprocessor stall time), especially in split-transaction buses

• reduces cache ping-pong effect because processors trying to acquire a semaphore donot have to perform a store each time

96

A Cache Coherent System Must:Provide set of states, state transition diagram, and actionsManage coherence protocol (0) Determine when to invoke coherence protocol (a) Find info about state of block in other caches to

determine action whether need to communicate with other cached copies

(b) Locate the other copies (c) Communicate with those copies (invalidate/update)

(0) is done the same way on all systems state of the line is maintained in the cache protocol is invoked if an “access fault” occurs on the line

Different approaches distinguished by (a) to (c)

17

97

Bus-based CoherenceAll of (a), (b), (c) done through broadcast on bus faulting processor sends out a “search” others respond to the search probe and take necessary

actionCould do it in scalable network too broadcast to all processors, and let them respond

Conceptually simple, but broadcast doesn’t scale with p on bus, bus bandwidth doesn’t scale on scalable network, every fault leads to at least p network

transactionsScalable coherence: can have same cache states and state transition diagram different mechanisms to manage protocol

98

Scalable Approach: Directories

Every memory block has associated directory information keeps track of copies of cached blocks and their

states on a miss, find directory entry, look it up, and

communicate only with the nodes that have copies if necessary

in scalable networks, communication with directory and copies is through network transactions

Many alternatives for organizing directory information


Directory ProtocolsDirectory keeps track of every block Which caches have each block Dirty status of each block

Implement in shared L3 cache Keep bit vector of size = # cores for each block in L3 Not scalable beyond shared L3

Implement in a distributed fashion:

Distributed S

hared Mem

ory and Directory-B

ased Coherence

100

Basic Operation of Directory

• k processors. • With each cache-block in memory:

k presence-bits, 1 dirty-bit• With each cache-block in cache:

1 valid bit, and 1 dirty (owner) bit• ••

P P

Cache Cache

Memory Directory

presence bits dirty bit

Interconnection Network

• Read from main memory by processor i:• If dirty-bit OFF then { read from main memory; turn p[i] ON; }• if dirty-bit ON then { recall line from dirty proc (cache state

to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;}

• Write to main memory by processor i:• If dirty-bit OFF then {send invalidations to all caches that have

the block; turn dirty-bit ON; supply data to i; turn p[i] ON; ... }

101

Directory ProtocolSimilar to Snoopy Protocol: Three states Shared: ≥ 1 processors have data, memory up-to-date Uncached (no processor has it; not valid in any cache) Exclusive: 1 processor (owner) has data;

memory out-of-dateIn addition to cache state, must track which processors have data when in the shared state (usually bit vector, 1 if processor has copy)Keep it simple(r): Writes to non-exclusive data

write miss Processor blocks until access completes Assume messages received and acted upon in order sent

102

Directory ProtocolNo bus and don’t want to broadcast: interconnect no longer single arbitration point all messages have explicit responses

Terms: typically 3 agents involved Local node (processor+cache) where a request originates Home node (directory controller) where the memory

location of an address resides Remote node (processor+cache) has a copy of a cache

block, whether exclusive or sharedExample messages on next slide: P = processor number, A = address

18

10/18/2007 103

Directory Protocol Messages (Fig 4.22)

Message type Source Destination Msg ContentRead miss Local cache Home directory P, A

Processor P reads data at address A; make P a read sharer and request data

Write miss Local cache Home directory P, A Processor P has a write miss at address A;

make P the exclusive owner and request dataInvalidate Home directory Remote caches A

Invalidate a shared copy at address AFetch Home directory Remote cache A

Fetch the block at address A and send it to its home directory;change the state of A in the remote cache to shared

Fetch/Invalidate Home directory Remote cache A Fetch the block at address A and send it to its home directory;

invalidate the block in the cacheData value reply Home directory Local cache Data

Return a data value from the home memory (read miss response)Data write back Remote cache Home directory A, Data

Write back a data value for address A (invalidate response)

104

State Transition Diagram for One Cache Block in Directory-Based System

States identical to snoopy case; transactions very similarTransitions caused by read misses, write misses, invalidates, data fetch requestsGenerates read miss & write miss message to home directoryWrite misses that were broadcast on the bus for snooping explicit invalidate & data fetch requestsNote: on a write, a cache block is bigger, so need to read the full cache block

105

CPU -Cache State MachineState machinefor CPU requestsfor each memory blockInvalid stateif in memory

Fetch/Invalidatesend Data Write Back message

to home directory

Invalidate

Invalid

Exclusive(read/write)

CPU Read

CPU Read hit

Send Read Missmessage

CPU Write:Send Write Miss msg to homedirectory

CPU Write: Send Write Miss messageto home directory

CPU read hitCPU write hit

Fetch: send Data Write Back message to home directory

CPU read miss:Send Read Miss

CPU write miss:send Data Write Back message and Write Miss to home directory

CPU read miss: send Data Write Back message and read miss to home directory

Shared(read/only)

106

State Transition Diagram for Directory

Same states & structure as the transition diagram for an individual cache2 actions: update of directory state & send messages to satisfy requests Tracks all copies of memory blockAlso indicates an action that updates the sharing set, Sharers, as well as sending a message

107

Directory State MachineState machinefor Directory requests for each memory blockUncached stateif in memory

Data Write Back:Sharers = {}

(Write back block)

UncachedShared(read only)

Exclusive(read/write)

Read miss:Sharers = {P}send Data Value Reply

Write Miss: send Invalidate to Sharers;then Sharers = {P};send Data Value Reply msg

Write Miss:Sharers = {P}; send Data Value Replymsg

Read miss:Sharers += {P}; send Fetch;send Data Value Replymsg to remote cache(Write back block)

Read miss: Sharers += {P};send Data Value Reply

Write Miss:Sharers = {P}; send Fetch/Invalidate;send Data Value Replymsg to remote cache

108

Example Directory ProtocolMessage sent to directory causes two actions: Update the directory More messages to satisfy request

Block is in Uncached state: the copy in memory is the current value; only possible requests for that block are: Read miss: requesting processor sent data from memory &requestor

made only sharing node; state of block made Shared. Write miss: requesting processor is sent the value & becomes the

Sharing node. The block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner.

Block is Shared the memory value is up-to-date: Read miss: requesting processor is sent back the data from memory

& requesting processor is added to the sharing set. Write miss: requesting processor is sent the value. All processors in

the set Sharers are sent invalidate messages, & Sharers is set to identity of requesting processor. The state of the block is made Exclusive.

19

109

Example Directory ProtocolBlock is Exclusive: current value of the block is held in the cache of the processor identified by the set Sharers (the owner) three possible directory requests: Read miss: owner processor is sent data fetch message, causing state of

block in owner’s cache to transition to Shared and causes owner to send data to directory, where it is written to memory & sent back to requesting processor. Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy). State is shared.

Data write-back: owner processor is replacing the block and hence must write it back, making memory copy up-to-date (the home directory essentially becomes the owner), the block is now Uncached, and the Sharer set is empty.

Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.

110

Example

P1 P2 Bus Directory Memorystep State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value

P1: Write 10 to A1

P1: Read A1P2: Read A1

P2: Write 40 to A2

P2: Write 20 to A1

A1 and A2 map to the same cache block

Processor 1 Processor 2 Interconnect MemoryDirectory

111

Example


P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0

P1: Read A1P2: Read A1

P2: Write 40 to A2

P2: Write 20 to A1



112

Example



P1: Read A1 Excl. A1 10P2: Read A1

P2: Write 40 to A2

P2: Write 20 to A1



113

Example

P2: Write 20 to A1




P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1

Shar. A1 10 Ftch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar. P1,P2} 10

1010

P2: Write 40 to A2 10


A1

Write BackWrite Back

A1

114

Example

P2: Write 20 to A1





Shar. A1 10 Ftch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar. P1,P2} 10Excl. A1 20 WrMs P2 A1 10

Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 10


A1A1

20

115

Example

P2: Write 20 to A1

A1 and A2 map to the same cache block (but different memory block addresses A1 ≠ A2)




Shar. A1 10 Ftch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar. P1,P2} 10Excl. A1 20 WrMs P2 A1 10

Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 WrMs P2 A2 A2 Excl. {P2} 0

WrBk P2 A1 20 A1 Unca. {} 20Excl. A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0


A1A1

116

Implementing a Directory

We assume operations atomic, but they are not; reality is much harder; must avoid deadlock when run out of buffers in network (see Appendix E)Optimizations: read miss or write miss in Exclusive: send

data directly to requestor from owner vs. 1st to memory and then from memory to requestor

117

Basic Directory Transactions

P

A M/D

C

P

A M/D

C

P

A M/D

C

Read requestto directory

Reply withowner identity

Read req.to owner

DataReply

Revision messageto directory

1.

2.

3.

4a.

4b.

P

A M/D

CP

A M/D

C

P

A M/D

C

RdEx requestto directory

Reply withsharers identity

Inval. req.to sharer

1.

2.

P

A M/D

C

Inval. req.to sharer

Inval. ack

Inval. ack

3a. 3b.

4a. 4b.

Requestor

Node withdirty copy

Directory nodefor block

Requestor

Directory node

Sharer Sharer

(a) Read miss to a block in dirty state (b) Write miss to a block with two sharers

118

Example Directory Protocol (1st

Read)

E

S

I

P1$

E

S

I

P2$

D

S

U

MDirctrl

ld vA -> rd pA

Read pA

R/reply

R/req

P1: pA

S

S

119

Example Directory Protocol (Read Share)

E

S

I

P1$

E

S

I

P2$

D

S

U

MDirctrl

ld vA -> rd pA

R/reply

R/req

P1: pA

ld vA -> rd pA

P2: pA

R/reqR/_

R/_

R/_S

S

S

120

Example Directory Protocol (Wr to shared)

E

S

I

P1$

E

S

I

P2$

D

S

U

MDirctrl

st vA -> wr pA

R/reply

R/req

P1: pA

P2: pA

R/req

W/req E

R/_

R/_

R/_

Invalidate pARead_to_update pA

Inv ACK

RX/invalidate&reply

S

S

S

D

E

reply xD(pA)

W/req EW/_

Inv/_

EX

21

10/18/2007 121

Example Directory Protocol (Wr to Ex)

E

S

I

P1$

E

S

I

P2$

D

S

U

MDirctrlR/reply

R/req

P1: pA

st vA -> wr pAR/req

W/req E

R/_

R/_

R/_

Reply xD(pA)Write_back pA

Read_toUpdate pA

RX/invalidate&reply

D

E

Inv pA

W/req EW/_

Inv/_ Inv/_

W/req EW/_

I

E

W/req E

RU/_

122

A Popular Middle Ground

Two-level “hierarchy”Individual nodes are multiprocessors, connected non-hiearchically e.g. mesh of SMPs

Coherence across nodes is directory-based directory keeps track of nodes, not individual

processorsCoherence within nodes is snooping or directory orthogonal, but needs a good interface of

functionalitySMP on a chip directory + snoop?

other factors multiprocessors -...

Documents