distributed shared mem

Upload: santanoop

Post on 05-Apr-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Distributed Shared Mem

    1/42

    Distributed Shared Memory

  • 8/2/2019 Distributed Shared Mem

    2/42

    Directory Based CacheCoherence

    Why snooping is a bad idea? Broadcasting is expensive

    Directory Maintain the cache state explicitly

    List of caches that have a copy: many read-only copies but onewritable copy

    P1

    $

    CA

    Scalable interconnection network

    P2

    $

    CA

    MemoryDirectory

  • 8/2/2019 Distributed Shared Mem

    3/42

    Directory Protocol

    Directory Memory

    X

    Interconnection Network

    Cache CacheX

  • 8/2/2019 Distributed Shared Mem

    4/42

    Terminology

    Home node The node in whose main memory the block is allocated

    Dirty node The node that has a copy of the block in its cache in modified state

    Owner node The node that currently hosts the valid copy of a block

    Exclusive node The node that has a copy in exclusive state

    Local node, or requesting node The node containing the processor that issues a request for theblock

    Local block Blocks whose home is local to the issuing processor

  • 8/2/2019 Distributed Shared Mem

    5/42

    Basic Operations

    Read miss to a block in modified state

    P

    $

    CA

    M

    em/Dir

    Requestor Home

    Owner

    P

    $

    CA

    M

    em/Dir

    P

    $

    CA

    Mem/Dir

    1. Read request

    2. Response with owner identifier

    3. Read request

    4a. Data reply4b. Revision message

  • 8/2/2019 Distributed Shared Mem

    6/42

    Basic Operations (Contd)

    Write miss to a block with two sharers

    P

    $

    CA

    M

    em/Dir

    P

    $

    CA

    M

    em/Dir

    P

    $

    CA

    Mem/Dir

    P

    $

    CA

    Mem/Dir

    Requestor Home

    Sharer Sharer

    1. RdEx request

    2. Response with sharers identifiers

    3. Invalidation

    requests

    4. Invalidation acknowledgement

  • 8/2/2019 Distributed Shared Mem

    7/42

    Alternatives for OrganizingDirectories

    Directory storage schemes

    Flat Centralized Hierarchical

    Memory-based Cache-based

    How to find source of directory information?

    How to locate copies?

    Directory information co-located withmemory modules that is home

    Stanford DASH/FLASH, SGI Origin, etc

    Caches holding a copy of thememory block form a linked list

    IEEE SCI, Sequent NUMA-Q

    Directory information is ina fixed place: home

    Hierarchy of cache thatguarantee the inclusion property

  • 8/2/2019 Distributed Shared Mem

    8/42

    Flat Directory Schemes

    Full-Map Directory

    Limited Directory

    Chained Directory

  • 8/2/2019 Distributed Shared Mem

    9/42

    Memory-Based Directory

    SchemesFull bit vector (full-map) directory

    Most straightforward

    Low latency: parallel invalidation Main disadvantage: storage overhead P*B

    Increase the cache block

    Access time and network traffic increased due to false

    sharing

    Use hierarchical protocol: Stanford DASH:

    Node has bus-based 4-processor

  • 8/2/2019 Distributed Shared Mem

    10/42

    Storage Reducing Optimization:

    Directory WidthDirectory width Bits per directory entry

    Motivation Mostly only a few caches have a copy of a block

    Limited (pointer) directory Storage overhead: log P * k (number of copies)

    Overflow methods are needed Diri X

    i indicates number of pointers (i < P)

    X indicates invalidation methods: broadcast or non-broadcast

  • 8/2/2019 Distributed Shared Mem

    11/42

    Overflow Methods for Limited

    ProtocolDiri B (Broadcast) Set the broadcast bit in case of overflow

    Broadcast invalidation messages to all nodes

    Simple Increase write latency

    Wasting communication bandwidth

    Diri NB (Not Broadcast) Invalidate the copy of one sharer

    Bad for widely shared read-mostly data

    Degradation for intensive sharing of read-only and read-mostlydata due to increased miss ratio

  • 8/2/2019 Distributed Shared Mem

    12/42

    Overflow Methods for Limited

    Protocol (Contd)Diri CVr (Coarse Vector)

    Representation changes to a coarse bit vector

    If P i, each bit stands for a region of r processors (coarse

    vector)

    Invalidations to the regions of caches

    SGI Origin

    Robust to different sharing patterns

    70% less memory message traffic than broadcast and at

    least 8% less than other schemes

  • 8/2/2019 Distributed Shared Mem

    13/42

    Coarse Bit Vector Scheme

    P0 P1 P2 P3

    P4 P5 P6 P7

    P8 P9 P10 P11

    P12 P13 P14 P15

    0 4bits 4bits Overflow bit

    overflow

    1 1 1 1

    P0 P1 P2 P3

    P4 P5 P6 P7

    P8 P9 P10 P11

    P12 P13 P14 P15

    2 pointers 8 pointers

  • 8/2/2019 Distributed Shared Mem

    14/42

    Overflow Methods for Limited

    Protocol (Contd)Diri SW (Software) The current i pointers and a pointer to the new sharer are saved

    into a special portion of the local main memory by software

    MIT Alewife: LimitLESS Cost of interrupts and software handling is high.

    Diri DP (Dynamic Pointers) Directory entry contains a hardware pointer into the local memory.

    Similar to software mechanism without software overhead.

    Difference is list manipulation is done in a special-purpose protocolprocessor rather than by the general-purpose processor

    Stanford FLASH

    Directory overhead: 7-9% of main memory

  • 8/2/2019 Distributed Shared Mem

    15/42

    Dynamic Pointers

    Circular sharing list

    Director: continuous region in main memory

  • 8/2/2019 Distributed Shared Mem

    16/42

    Stanford DASH Architecture

    DASH => Directory Architecture for SHaredmemory Nodes connected by scalable interconnect

    Partitioned shared memory

    Processing nodes are themselves multiprocessors

    Distributed directory-based cache coherence

    Interconnection network

    P1

    $

    P1

    $

    Memory Directory

    Dirty bitPresence bit

    P1

    $

    P1

    $

  • 8/2/2019 Distributed Shared Mem

    17/42

    Conclusions

    Full map is most appropriate up to amodest number of processors

    Diri CVr , Diri DP are most likelycandidates

    Coarse vector: lack of accuracy on overflow

    Dynamic pointer: processing cost due hardware list

    manipulation

  • 8/2/2019 Distributed Shared Mem

    18/42

    Storage Reducing Optimization:Directory Height

    Directory height Total number of directory entries

    Motivation

    The total amount of cache memory is much less than the total mainmemory

    Sparse directory Organize the directory as a cache

    This cache has no need for a backing store

    When an entry is replaced, send invalidations to the nodes with copies Spatial locality is not an issue: one entry per block

    References stream is heavily filtered, consisting of only thosereferences that were not satisfied in the processor caches.

    With directory size factor = 8, associativity of 4, and LRU

    replacement: very close to that of full-map directory

  • 8/2/2019 Distributed Shared Mem

    19/42

    Protocol Optimization

    Two major goals + one

    Reduce the number of network transactions per

    memory operation Reduce the bandwidth demand

    Reduce the number of actions on the critical path

    Reduce the uncontended latency

    Reduce the endpoint assist occupancy per transaction Reduce the uncontended latency as well as endpoint contention

  • 8/2/2019 Distributed Shared Mem

    20/42

    Latency Reduction

    L H R1. req

    2. res

    3. intervention

    4a. revise

    4b. response

    (a) strict request-response

    L H R

    1. req

    4. response

    2. intervention

    3. response

    (b) intervention forwarding

    L H R

    1. req

    3a. revise

    3b. response

    2. intervention

    (c) reply forwarding

    A read request to a block in exclusive state

  • 8/2/2019 Distributed Shared Mem

    21/42

    Cache-Based Directory Schemes

    Directory is a doubly linked list of entries

    Read miss

    Insert the requestor at the head of the list

    Write miss

    Insert the requestor at the head of the list

    Invalidate the sharers by traversing the list: long latency!

    Write back

    Delete itself from the list

    Main memory (home)

    Node 0 Node 1 Node 2

    $

    Head pointer

  • 8/2/2019 Distributed Shared Mem

    22/42

    Tradeoffs with Cache-Based

    SchemesAdvantages

    Small overhead: single pointer

    Easier to provide fairness and to avoid livelock Sending invalidations is not centralized at the home, but

    rather distributed among sharers

    Disadvantages

    Long latency, long occupancy of communication assist

    Modifying the sharing list requires careful coordination

    and mutual exclusion.

  • 8/2/2019 Distributed Shared Mem

    23/42

    Latency Reduction

    H S1 S22. ack S31. inv

    3. inv5. inv

    4. ack6. ack

    H S1 S22b. ack

    S3

    1. inv

    3b. ack

    4b. ack

    2a. inv 3a. inv

    H S1 S2 S3

    1. inv 2. inv 3. inv

    4. ack

    (a) strict request-response

    (b) intervention forwarding

    (c) reply forwarding

  • 8/2/2019 Distributed Shared Mem

    24/42

    Sequent NUMA-Q

    IEEE standard Scalable Coherent Interface(SCI) protocol

    Targeted toward commercial workloads Databases and transaction processing

    Commodity hardware

    Intel SMP as the processing nodes DataPump network interface from Vitesse semiconductor

    Custom IQ Link: directory logic, remote cache (32MB, 4-way)

    IQ-Link

    P P P P

    Peripheral

    interface I/O

    PCI

    I/OMem

    Quad

    Quad

    Quad

    Quad ring

    Snooping

    within Quad

  • 8/2/2019 Distributed Shared Mem

    25/42

    IQ Link Implementation

    Directory for locallyallocated data

    Tags for remotelyallocated but locallycached data

    Orion bus controller Manage snooping and requesting

    logic

    DataPump

    GaAs chip for transport protocolof the SCI standard

    SCI link interfacecontroller Manage SCI coherence protocol -

    programmable

    DataPump

    Directorycontroller

    (SCLIC)

    Orion bus

    controller(OBIC)

    Quad bus

    Remote

    tag

    Local

    directory

    Remotedata&tag Localdirectory

    SCI ring

    (1GB/s)

    Bus side tag: SRAM

    Network side tag: SDRAM

  • 8/2/2019 Distributed Shared Mem

    26/42

    Directory States

    Home

    No remote cache (quad) in the system contains a copy of the block

    A processor cache in the home quad itself may have a copy since

    this is not visible to SCI protocol, but managed by bus protocolwithin the quad.

    Fresh

    One or more remote caches may have a read-only copy

    Gone Another remote cache contains a writable (exclusive or dirty) copy.

    No valid copy exists on the local node

  • 8/2/2019 Distributed Shared Mem

    27/42

    Remote Cache States

    7 bits: 29 stable states + many pendingor transient states

    Each stable state has two parts First part: where the cache entry is located:

    ONLY, HEAD, TAIL, MID

    Second part: actual state

    Dirty, clean (exclusive state in MESI), fresh (data may not bewritten until memory is informed), copy (unmodified and

    readible), and so on

  • 8/2/2019 Distributed Shared Mem

    28/42

    SCI Standard

    Three primitive operations

    List construction: adding a new node to the head of the list

    Rollout: remove a node from a list

    Purge (invalidation): the head node invalidate all other nodes

    Levels of protocol

    Minimal protocol: does not permit read sharing: only one copy

    Typical protocol: NUMA-Q

    Full protocol: implement all standards of SCI

  • 8/2/2019 Distributed Shared Mem

    29/42

    Handling Read Requests

    If the directory state is HOME The home updates the blocks state to FRESH and sends data

    The requestor updates from PENDING to ONLY_FRESH

    If the directory state is FRESH

    Insert the node at the head The previous head changes its state from HEAD_FRESH to

    MID_VALID or from ONLY_FRESH to TAIL_VALID

    The requestor changes its state from PENDING to HEAD_FRESH

    If the directory state is GONE

    The home stays in the GONE state and sends a pointer to theprevious head

    The previous head changes its state from HEAD_DIRTY toMID_VALID or from ONLY_DIRTY to TAIL_VALID

    The requestor sets its state to HEAD_DIRTY

  • 8/2/2019 Distributed Shared Mem

    30/42

    An Example of Read Miss

    INVALIDONLY_

    FRESH

    FRESH

    Requestor Old head

    Home memory

    PENDINGONLY_

    FRESH

    FRESH

    Requestor Old head

    Home memory

    PENDINGONLY_

    FRESH

    FRESH

    Requestor Old head

    Home memory

    HEAD_

    FRESH

    TAIL_

    VALID

    FRESH

    Requestor Old head

    Home memory

    1st Round

    2nd Round

  • 8/2/2019 Distributed Shared Mem

    31/42

    Handling Write Requests

    Only the head node is allowed to write a blockand issue invalidation When the write is in the middle of the list, remove itself from the

    list (rollout) and add itself again to the head (list construction)

    If the writer cache block is in HEAD_DIRTYstate Purge sharing list: strict request-response manner

    Writer stays in the pending state while the purging is in progress

    If in HEAD_FRESH state

    Writer goes into pending state and sends a request to the home thatchanges from FRESH to GONE and replies

    Writer goes into a different pending state and purge the list

    Eventually, the writer state goes toONLY_DIRTY

  • 8/2/2019 Distributed Shared Mem

    32/42

    Handling Writeback andReplacements

    Need of rollouts Replacement, invalidation, write

    Middle node rollback First sets itself to a pending state to prevent the race condition. The

    node closer to the tail has priority and is rolled out first.

    Set the state to invalid

    Head node rollback Goes into a pending state and the downstream node change its state

    (ex: MID_VALID -> HEAD_DIRTY)

    Updates the pointer in the home node. If the node is the only node,change the home node state to HOME

    Writeback upon a miss Serve writeback first since buffering is complicated and miss on

    the remote cache will be infrequent (vs. write-buffer in bus system)

  • 8/2/2019 Distributed Shared Mem

    33/42

    Hierarchical Coherence

  • 8/2/2019 Distributed Shared Mem

    34/42

    Snoop-Snoop System

    Simplest way to build large scale cache-coherent MPs

    Coherence monitor

    Remote (access) cache Local state monitor - keep state information on data locally

    allocated, remotely cached

    Remote cache Should be larger than the sum of processor caches and quite

    associative Should be lockup-free

    Issue invalidation request to the local bus when a block is replaced

  • 8/2/2019 Distributed Shared Mem

    35/42

    Snoop-Snoop with GlobalMemory

    First level cache Highest performance SDRAM

    caches

    B1 follows a standard snoopingprotocol

    Second level cache Much larger than L1 caches (set

    associative)

    Must maintain inclusion

    L2 cache acts as a filter for B1-busand L1-caches

    L2 cache can be DRAM basedsince fewer references get to it

    B2

    M

    P1

    $

    P1

    $

    Coherence

    monitor

    P1

    $

    P1

    $

    Coherence

    monitor

    B1

  • 8/2/2019 Distributed Shared Mem

    36/42

    Snoop-Snoop with Global

    Memory (Contd)Advantages Misses to main memory just require single traversal to

    the root of the hierarchy.

    Placement of shared data is not an issue.

    Disadvantages Misses to local data structures (e.g., stack) also have to

    traverse the hierarchy, resulting in higher traffic and

    latency. Memory at the global bus must be highly interleaved.

    Otherwise bandwidth to it will not scale.

  • 8/2/2019 Distributed Shared Mem

    37/42

    Cluster Based Hierarchies

    Key idea Main memory is distributed among clusters.

    reduces global bus traffic (local data & suitably placed shared data)

    reduces latency (less contention and local accesses are faster)

    example machine: Encore GigamaxL2 cache can be replaced by a tag-only router-coherence switch.

    B2

    M

    P1

    $

    P1

    $

    Coherence

    monitor

    P1

    $

    P1

    $

    Coherence

    monitor

    B1

    M

  • 8/2/2019 Distributed Shared Mem

    38/42

    Summary

    Advantages: Conceptually simple to build (apply snooping

    recursively)

    Can get merging and combining of requests inhardware

    Disadvantages: Physical hierarchies do not provide enough bisection

    bandwidth (the root becomes a bottleneck, e.g., 2-d, 3-dgrid problems)

    Latencies often larger than in direct networks

  • 8/2/2019 Distributed Shared Mem

    39/42

    Hierarchical Directory Scheme

    The internal nodes containonly directory information L1 directory tracks which of its children

    processing nodes have a copy of thememory block.

    L2 directory tracks which of its childrenL1 directories have a copy the memoryblock.

    It also tracks which local memoryblocks are cached outside

    Inclusion is maintained betweenprocessor caches and L1 directory

    Logical trees may beembedded in any physicalhierarchy

    L1 directory

    L2 directory

    Processing nodes

  • 8/2/2019 Distributed Shared Mem

    40/42

    A Multirooted Hierarchical

    Directory

    p0 p7

    Directory treefor p0s memory

    Internal nodes

    Processing nodes at leaves

    All three circles in this rectangle represent the same processing node

  • 8/2/2019 Distributed Shared Mem

    41/42

    Organization and Overhead

    Organization

    Separate directory structure for every block

    Storage overhead

    Each level has about the same amount of memory

    C: cache size, b: branch factor, M: memory size, B: block size

    Performance overhead

    Reduce the number of network hops.

    But, increase the end-to-end transactions, increase latency

    Root becomes the bottleneck

    MB

    PCb

    log

  • 8/2/2019 Distributed Shared Mem

    42/42

    Performance Implication of

    Hierarchical CoherenceAdvantages

    Combining of requests for a block

    Reduce traffic and contention

    Locality effect

    Reduce transit latency and contention

    Disadvantages

    Long uncontended latency Bandwidth requirements near the root of the hierarchy