cache coherent distributed shared memory. motivations small processor count –smp machines...

Cache Coherent Distributed Shared Memory

Motivations

• Small processor count– SMP machines– Single shared memory with multiple

processors interconnected with a bus

• Large processor count– Distributed Shared Memory Machines– Largely message passing architectures

Programming Concerns

• Message passing– Access to memory involve send/request

packets– Communication costs

• Shared memory model– Ease of programming– But not very scalable

• Scalable and easy to program?

Distributed Shared Memory

• Physically distributed memory

• Implemented with a single shared address space

• Also known as NUMA machines since memory access times are non-uniform– Local access times < Remote access times

DSM and Memory access

• Big difference in accessing local versus remote data

• Large differences make it difficult to hide latency

• How about caching?– In short, it’s difficult– Cache coherence

Cache coherence

• Cache Coherence– Different processors may access values at

same memory location– How to ensure data integrity at all times?

• An update by a processor at time t is available for other processors at time t+1

– Snoopy protocol– Directory based protocol

Snoopy Coherence Protocols

• Transparent to user• Easy to implement• For a read

– Data fetched from other cache or from memory

• For a write– All data at other caches are invalidated– Delayed or immediate write-back.

• The Bus plays an important role

Example

But it does not scale!

• Not feasible for machines with memory distributed across a large number of systems

• Broadcast on bus approach is bad

• Leads to bus saturation

• Waste of processor cycles to snoop all caches in system

Directory-Based Cache Coherence

• A directory tracks which processor have cached a block of memory

• Directory contains information for all cache blocks in system

• Each cache block can have 1 of 3 states– Invalid– Shared– Exclusive

• To enter exclusive state, all other cache blocks for same memory location is invalidated

Original form not popular

• Compared to snoopy protocols– Directory systems avoid broadcasting on bus

• But requests served by 1 directory server– May saturate a directory server

• Still not scalable

• How about distributing the directory– Load balancing– Hierarchical model?

Distributed Directory Protocol

• Involved sending messages among 3 node types– Local node

• Requesting processor node

– Home node• Node containing memory location

– Remote node• Node containing cache block in exclusive state

3 Scenarios

• Scenario 1– Local node sends request to home node– Home node sends data back to local node

• Scenario 2– Local node sends request to home node– Home node redirects request to remote node– Remote node sends data back to local node

• Scenario 3– Local node sends request for exclusive state– Home node redirects request to other remote nodes

for invalidation

Example

Stanford DASH Multiprocessor

• 1st operational multiprocessor to support scalable coherence protocol

• Demonstrates scalability and cache coherence are not incompatible

• 2 hypotheses– Shared memory machines easier to program– Cache coherence vital

Past experience

• From experience– Memory access times differ widely between

physical location– Latency and bandwidth is important for

shared memory systems– Caching helps amortize cost of memory

access in a memory hierarchy

DASH Multiprocessor

• Relaxed memory consistency model

• Observation– Most programs use explicit synchronization – Sequential consistency is not necessary– Allows system to perform writes without

waiting till all invalidations are performed

• Offers advantages in hiding memory latency

DASH Multiprocessor

• Non-Binding software prefetch– Prefetches data into cache– Maintains coherence– Transparent to user

• Compiler can issue such instructions to help runtime performance

– If data is invalidated, it will refresh the data when it is accessed

• Helps to hide latency as well

DASH Multiprocessor

• Remote Access Cache– Remote access combined and buffered within

individual nodes– Can be likened to having a 2-level cache

hierarchy

Lessons

• High performance require careful planning of remote data access

• Scaling applications depend on other factors– Load balancing– Limited parallelism– Difficult to scale application into using more

processor

Challenges

• Programming model?– Model that helps programmers reason about

code rather than fine-tuning for a specific machine

• Fault tolerance and recovery?– More computers = Higher chance of failure

• Increasing latency?– Increasing hierarchies = Larger variety of

latencies

Callisto

• Previously networking gateways– Handle diverse set of services– Handles 1000s of channels– Complex designs involving many chips– High power requirement

• Callisto is a gateway on a chip – Used to implement communication gateways

for different networks

In a nutshell

• Integrates DSPs, CPUs, RAM, IO channels on chip

• Programmable multi-service platform

• Handles 60 to 240 channels per chip

• An array of Callisto chips can fit in a small space– Power efficient– Handles a large number of channels

cache coherent distributed shared memory. motivations small processor count –smp machines...

Documents

processor node home

remote node remote node

node types local node

distributed memory

home node home node

local node scenario

block of memory directory

cache coherence vital