distributed shared mem

8/2/2019 Distributed Shared Mem

1/42

Distributed Shared Memory


2/42

Directory Based CacheCoherence

Why snooping is a bad idea? Broadcasting is expensive

Directory Maintain the cache state explicitly

List of caches that have a copy: many read-only copies but onewritable copy

P1

$

CA

Scalable interconnection network

P2

$

CA

MemoryDirectory


3/42

Directory Protocol

Directory Memory

X

Interconnection Network

Cache CacheX


4/42

Terminology

Home node The node in whose main memory the block is allocated

Dirty node The node that has a copy of the block in its cache in modified state

Owner node The node that currently hosts the valid copy of a block

Exclusive node The node that has a copy in exclusive state

Local node, or requesting node The node containing the processor that issues a request for theblock

Local block Blocks whose home is local to the issuing processor


5/42

Basic Operations

Read miss to a block in modified state

P

$

CA

M

em/Dir

Requestor Home

Owner

P

$

CA

M

em/Dir

P

$

CA

Mem/Dir

1. Read request

2. Response with owner identifier

3. Read request

4a. Data reply4b. Revision message


6/42

Basic Operations (Contd)

Write miss to a block with two sharers

P

$

CA

M

em/Dir

P

$

CA

M

em/Dir

P

$

CA

Mem/Dir

P

$

CA

Mem/Dir

Requestor Home

Sharer Sharer

1. RdEx request

2. Response with sharers identifiers

3. Invalidation

requests

4. Invalidation acknowledgement


7/42

Alternatives for OrganizingDirectories

Directory storage schemes

Flat Centralized Hierarchical

Memory-based Cache-based

How to find source of directory information?

How to locate copies?

Directory information co-located withmemory modules that is home

Stanford DASH/FLASH, SGI Origin, etc

Caches holding a copy of thememory block form a linked list

IEEE SCI, Sequent NUMA-Q

Directory information is ina fixed place: home

Hierarchy of cache thatguarantee the inclusion property


8/42

Flat Directory Schemes

Full-Map Directory

Limited Directory

Chained Directory


9/42

Memory-Based Directory

SchemesFull bit vector (full-map) directory

Most straightforward

Low latency: parallel invalidation Main disadvantage: storage overhead P*B

Increase the cache block

Access time and network traffic increased due to false

sharing

Use hierarchical protocol: Stanford DASH:

Node has bus-based 4-processor


10/42

Storage Reducing Optimization:

Directory WidthDirectory width Bits per directory entry

Motivation Mostly only a few caches have a copy of a block

Limited (pointer) directory Storage overhead: log P * k (number of copies)

Overflow methods are needed Diri X

i indicates number of pointers (i < P)

X indicates invalidation methods: broadcast or non-broadcast


11/42

Overflow Methods for Limited

ProtocolDiri B (Broadcast) Set the broadcast bit in case of overflow

Broadcast invalidation messages to all nodes

Simple Increase write latency

Wasting communication bandwidth

Diri NB (Not Broadcast) Invalidate the copy of one sharer

Bad for widely shared read-mostly data

Degradation for intensive sharing of read-only and read-mostlydata due to increased miss ratio


12/42


Protocol (Contd)Diri CVr (Coarse Vector)

Representation changes to a coarse bit vector

If P i, each bit stands for a region of r processors (coarse

vector)

Invalidations to the regions of caches

SGI Origin

Robust to different sharing patterns

70% less memory message traffic than broadcast and at

least 8% less than other schemes


13/42

Coarse Bit Vector Scheme

P0 P1 P2 P3

P4 P5 P6 P7

P8 P9 P10 P11

P12 P13 P14 P15

0 4bits 4bits Overflow bit

overflow

1 1 1 1

P0 P1 P2 P3

P4 P5 P6 P7

P8 P9 P10 P11

P12 P13 P14 P15

2 pointers 8 pointers


14/42


Protocol (Contd)Diri SW (Software) The current i pointers and a pointer to the new sharer are saved

into a special portion of the local main memory by software

MIT Alewife: LimitLESS Cost of interrupts and software handling is high.

Diri DP (Dynamic Pointers) Directory entry contains a hardware pointer into the local memory.

Similar to software mechanism without software overhead.

Difference is list manipulation is done in a special-purpose protocolprocessor rather than by the general-purpose processor

Stanford FLASH

Directory overhead: 7-9% of main memory


15/42

Dynamic Pointers

Circular sharing list

Director: continuous region in main memory


16/42

Stanford DASH Architecture

DASH => Directory Architecture for SHaredmemory Nodes connected by scalable interconnect

Partitioned shared memory

Processing nodes are themselves multiprocessors

Distributed directory-based cache coherence

Interconnection network

P1

$

P1

$

Memory Directory

Dirty bitPresence bit

P1

$

P1

$


17/42

Conclusions

Full map is most appropriate up to amodest number of processors

Diri CVr , Diri DP are most likelycandidates

Coarse vector: lack of accuracy on overflow

Dynamic pointer: processing cost due hardware list

manipulation


18/42

Storage Reducing Optimization:Directory Height

Directory height Total number of directory entries

Motivation

The total amount of cache memory is much less than the total mainmemory

Sparse directory Organize the directory as a cache

This cache has no need for a backing store

When an entry is replaced, send invalidations to the nodes with copies Spatial locality is not an issue: one entry per block

References stream is heavily filtered, consisting of only thosereferences that were not satisfied in the processor caches.

With directory size factor = 8, associativity of 4, and LRU

replacement: very close to that of full-map directory


19/42

Protocol Optimization

Two major goals + one

Reduce the number of network transactions per

memory operation Reduce the bandwidth demand

Reduce the number of actions on the critical path

Reduce the uncontended latency

Reduce the endpoint assist occupancy per transaction Reduce the uncontended latency as well as endpoint contention


20/42

Latency Reduction

L H R1. req

2. res

3. intervention

4a. revise

4b. response

(a) strict request-response

L H R

1. req

4. response

2. intervention

3. response

(b) intervention forwarding

L H R

1. req

3a. revise

3b. response

2. intervention

(c) reply forwarding

A read request to a block in exclusive state


21/42

Cache-Based Directory Schemes

Directory is a doubly linked list of entries

Read miss

Insert the requestor at the head of the list

Write miss

Insert the requestor at the head of the list

Invalidate the sharers by traversing the list: long latency!

Write back

Delete itself from the list

Main memory (home)

Node 0 Node 1 Node 2

$

Head pointer


22/42

Tradeoffs with Cache-Based

SchemesAdvantages

Small overhead: single pointer

Easier to provide fairness and to avoid livelock Sending invalidations is not centralized at the home, but

rather distributed among sharers

Disadvantages

Long latency, long occupancy of communication assist

Modifying the sharing list requires careful coordination

and mutual exclusion.


23/42

Latency Reduction

H S1 S22. ack S31. inv

3. inv5. inv

4. ack6. ack

H S1 S22b. ack

S3

1. inv

3b. ack

4b. ack

2a. inv 3a. inv

H S1 S2 S3

1. inv 2. inv 3. inv

4. ack

(a) strict request-response

(b) intervention forwarding

(c) reply forwarding


24/42

Sequent NUMA-Q

IEEE standard Scalable Coherent Interface(SCI) protocol

Targeted toward commercial workloads Databases and transaction processing

Commodity hardware

Intel SMP as the processing nodes DataPump network interface from Vitesse semiconductor

Custom IQ Link: directory logic, remote cache (32MB, 4-way)

IQ-Link

P P P P

Peripheral

interface I/O

PCI

I/OMem

Quad

Quad

Quad

Quad ring

Snooping

within Quad


25/42

IQ Link Implementation

Directory for locallyallocated data

Tags for remotelyallocated but locallycached data

Orion bus controller Manage snooping and requesting

logic

DataPump

GaAs chip for transport protocolof the SCI standard

SCI link interfacecontroller Manage SCI coherence protocol -

programmable

DataPump

Directorycontroller

(SCLIC)

Orion bus

controller(OBIC)

Quad bus

Remote

tag

Local

directory

Remotedata&tag Localdirectory

SCI ring

(1GB/s)

Bus side tag: SRAM

Network side tag: SDRAM


26/42

Directory States

Home

No remote cache (quad) in the system contains a copy of the block

A processor cache in the home quad itself may have a copy since

this is not visible to SCI protocol, but managed by bus protocolwithin the quad.

Fresh

One or more remote caches may have a read-only copy

Gone Another remote cache contains a writable (exclusive or dirty) copy.

No valid copy exists on the local node


27/42

Remote Cache States

7 bits: 29 stable states + many pendingor transient states

Each stable state has two parts First part: where the cache entry is located:

ONLY, HEAD, TAIL, MID

Second part: actual state

Dirty, clean (exclusive state in MESI), fresh (data may not bewritten until memory is informed), copy (unmodified and

readible), and so on


28/42

SCI Standard

Three primitive operations

List construction: adding a new node to the head of the list

Rollout: remove a node from a list

Purge (invalidation): the head node invalidate all other nodes

Levels of protocol

Minimal protocol: does not permit read sharing: only one copy

Typical protocol: NUMA-Q

Full protocol: implement all standards of SCI


29/42

Handling Read Requests

If the directory state is HOME The home updates the blocks state to FRESH and sends data

The requestor updates from PENDING to ONLY_FRESH

If the directory state is FRESH

Insert the node at the head The previous head changes its state from HEAD_FRESH to

MID_VALID or from ONLY_FRESH to TAIL_VALID

The requestor changes its state from PENDING to HEAD_FRESH

If the directory state is GONE

The home stays in the GONE state and sends a pointer to theprevious head

The previous head changes its state from HEAD_DIRTY toMID_VALID or from ONLY_DIRTY to TAIL_VALID

The requestor sets its state to HEAD_DIRTY


30/42

An Example of Read Miss

INVALIDONLY_

FRESH

FRESH

Requestor Old head

Home memory

PENDINGONLY_

FRESH

FRESH

Requestor Old head

Home memory

PENDINGONLY_

FRESH

FRESH

Requestor Old head

Home memory

HEAD_

FRESH

TAIL_

VALID

FRESH

Requestor Old head

Home memory

1st Round

2nd Round


31/42

Handling Write Requests

Only the head node is allowed to write a blockand issue invalidation When the write is in the middle of the list, remove itself from the

list (rollout) and add itself again to the head (list construction)

If the writer cache block is in HEAD_DIRTYstate Purge sharing list: strict request-response manner

Writer stays in the pending state while the purging is in progress

If in HEAD_FRESH state

Writer goes into pending state and sends a request to the home thatchanges from FRESH to GONE and replies

Writer goes into a different pending state and purge the list

Eventually, the writer state goes toONLY_DIRTY


32/42

Handling Writeback andReplacements

Need of rollouts Replacement, invalidation, write

Middle node rollback First sets itself to a pending state to prevent the race condition. The

node closer to the tail has priority and is rolled out first.

Set the state to invalid

Head node rollback Goes into a pending state and the downstream node change its state

(ex: MID_VALID -> HEAD_DIRTY)

Updates the pointer in the home node. If the node is the only node,change the home node state to HOME

Writeback upon a miss Serve writeback first since buffering is complicated and miss on

the remote cache will be infrequent (vs. write-buffer in bus system)


33/42

Hierarchical Coherence


34/42

Snoop-Snoop System

Simplest way to build large scale cache-coherent MPs

Coherence monitor

Remote (access) cache Local state monitor - keep state information on data locally

allocated, remotely cached

Remote cache Should be larger than the sum of processor caches and quite

associative Should be lockup-free

Issue invalidation request to the local bus when a block is replaced


35/42

Snoop-Snoop with GlobalMemory

First level cache Highest performance SDRAM

caches

B1 follows a standard snoopingprotocol

Second level cache Much larger than L1 caches (set

associative)

Must maintain inclusion

L2 cache acts as a filter for B1-busand L1-caches

L2 cache can be DRAM basedsince fewer references get to it

B2

M

P1

$

P1

$

Coherence

monitor

P1

$

P1

$

Coherence

monitor

B1


36/42

Snoop-Snoop with Global

Memory (Contd)Advantages Misses to main memory just require single traversal to

the root of the hierarchy.

Placement of shared data is not an issue.

Disadvantages Misses to local data structures (e.g., stack) also have to

traverse the hierarchy, resulting in higher traffic and

latency. Memory at the global bus must be highly interleaved.

Otherwise bandwidth to it will not scale.


37/42

Cluster Based Hierarchies

Key idea Main memory is distributed among clusters.

reduces global bus traffic (local data & suitably placed shared data)

reduces latency (less contention and local accesses are faster)

example machine: Encore GigamaxL2 cache can be replaced by a tag-only router-coherence switch.

B2

M

P1

$

P1

$

Coherence

monitor

P1

$

P1

$

Coherence

monitor

B1

M


38/42

Summary

Advantages: Conceptually simple to build (apply snooping

recursively)

Can get merging and combining of requests inhardware

Disadvantages: Physical hierarchies do not provide enough bisection

bandwidth (the root becomes a bottleneck, e.g., 2-d, 3-dgrid problems)

Latencies often larger than in direct networks


39/42

Hierarchical Directory Scheme

The internal nodes containonly directory information L1 directory tracks which of its children

processing nodes have a copy of thememory block.

L2 directory tracks which of its childrenL1 directories have a copy the memoryblock.

It also tracks which local memoryblocks are cached outside

Inclusion is maintained betweenprocessor caches and L1 directory

Logical trees may beembedded in any physicalhierarchy

L1 directory

L2 directory

Processing nodes


40/42

A Multirooted Hierarchical

Directory

p0 p7

Directory treefor p0s memory

Internal nodes

Processing nodes at leaves

All three circles in this rectangle represent the same processing node


41/42

Organization and Overhead

Organization

Separate directory structure for every block

Storage overhead

Each level has about the same amount of memory

C: cache size, b: branch factor, M: memory size, B: block size

Performance overhead

Reduce the number of network hops.

But, increase the end-to-end transactions, increase latency

Root becomes the bottleneck

MB

PCb

log


42/42

Performance Implication of

Hierarchical CoherenceAdvantages

Combining of requests for a block

Reduce traffic and contention

Locality effect

Reduce transit latency and contention

Disadvantages

Long uncontended latency Bandwidth requirements near the root of the hierarchy

distributed shared mem

Documents