large computer systems

Large Computer Systems

CE 140 A1/A227 August 2003

Rationale Although computers are getting

faster, the demands are also increasing at least as fast

High-performance applications: simulations and modeling

Circuit speed cannot be increased indefinitely eventually, physical limits will be reached, and quantum mechanical effects will be a problem

Rationale

To handle larger problems, parallel computers are used

Machine level parallelism Replicates entire CPUs or portions of

them

Design Issues

What are the nature, size, and number of the processing elements?

What are the nature, size, and number of the memory modules?

How are the processing and memory elements interconnected?

What applications are to be run in parallel?

Grain Size

Coarse-grained parallelism Unit of parallelism is larger Running large pieces of software in

parallel with little or no communication between the pieces

Example: large time-sharing systems Fine-grained parallelism

Parallel programs with high degree of communication with each other

Tightly Coupled versus Loosely Coupled

Loosely coupled Small number of large, independent

CPUs that have relatively low-speed connections to each other

Tightly coupled Smaller processing units that work

closely together over high-bandwidth connections

Design Issues

In most cases Coarse-grained is well suited for loosely

coupled Fine-grained is well suited for tightly

coupled

Communication Models

In a parallel computer system, CPUs communicate with each other to exchange information

Two general types Multiprocessors Multicomputers

Multiprocessors

Shared Memory System All processors may share a single

virtual address space Easy model for programmers Global memory

any processor can access any memory module without intervention by another processor

Uniform Memory Access (UMA) Multiprocessor

INTERCONNECTION NETWORK

P1 P2 Pn

M1 M2 Mk

Non-Uniform Memory Access (NUMA) Multiprocessor


P1 P2 PnM2 MnM1

Multiprocessor

Multicomputers Distributed Memory System Each CPU has its own private memory Local/private memory – a processor cannot

access a remote memory without the cooperation of the remote processor

Cooperation takes place in the form of a message passing protocol

Programming for a multicomputer is much more difficult than programming a multiprocessor

Distributed Memory System


P1 P2 Pn

M2 MnM1

Distributed Memory System

Multiprocessors versus Multicomputers

Easier to program for multiprocessors But multicomputers are much simpler

and cheaper to build Goal: large computer systems that

combines the best of both worlds

Taxonomy of Large Computer Systems

Instruction Streams

Data Streams

Name Examples

1 1 SISD Classical Von Neumann Machine

1 Multiple SIMD Vector supercomputer, array processor

Multiple 1 MISD NONE

Multiple Multiple MIMD Multiprocessor, Multicomputer

Taxonomy of Large Computer Systems

Symmetric MultiProcessors (SMP)

Multiprocessor architecture where all processor can access all memory locations uniformly

Processors also share I/O SMP classified as an UMA SMP is simplest multiprocessor

system Any processor can execute either the

OS kernel or user programs

SMP Performance improves if programs

can be run in parallel Increased availability: if one processor

breaks down, system does not stop running

Performance is also improved incrementally by adding processors

Does not scale well beyond 16 processors

Clusters

A group of whole computers connected together to function as a parallel computer

Popular implementation: Linux computers using Beowulf clustering software

Clusters

High availability – redundant resources

Scalability Affordable – off-the-shelf parts

Clusters

Cyborg ClusterDrexel University

32 nodesDual P3 per node

Clusters

Memory Organization Shared Memory System (Multiprocessors)

each processor may also have a cache convenient to have a global address space

For NUMA, accesses to the global address space may be slower than access to remote address space

Distributed Memory System (Multicomputers) Private address space for each processor Easiest way to connect computers into a large

system Data sharing is implemented through message

passing

Issues

When processors share data, different processors must access the same value for a given data item

When a processor updates its cache, it must also update the caches of other processors, or invalidate other processors’ copies

shared data must be coherent

Cache Coherence

All cached copies of shared data must have the same value at all times

Snooping Caches

So-called because individual caches “snoop” on the bus

Write-Through Protocol

Write-Through with Update (Write Update) Update cache and memory, update the

cache of the rest of the processors Write-Through without Update (Write

Invalidate) Update cache and memory, invalidate

the cache of the rest of the processors

Write-Back Protocol When a processor wants to write to a block,

it must acquire exclusive control/ownership of the block All other copies are invalidated Block’s contents may be changed at any time When another processor requests to read the

block, owner processor sends block to requesting processor, and returns control of block to the memory module which updates block to contain the latest value

MESI Protocol Popular write-back cache coherence

protocol named after the initials of the four possible states of each cache line Modified – entry is valid; memory is invalid; no

copies exist Exclusive – no other cache holds the line;

memory is up to date Shared – multiple caches may hold the line;

memory is up to date Invalid – cache entry does not contain valid data

Snoopy Cache Issues

Snoopy caches require broadcasting information over the bus leading to increased bus traffic if the system grows in size

Directory Protocols

Uses a directory that keeps tracks of locations where multiple copies of a given data item is present

Eliminates need for broadcast If directory is centralized, the

directory will be a bottleneck

Performance

According to Amdahl’s law, introducing machine parallelism will not have a significant effect on performance if the program cannot take advantage of the parallel architecture

Not all programs parallelize well

Performance

Scalability Issues

Scalability Issues

Bandwidth Latency Depends on topology

large computer systems

Documents

remote memory

memory elements

memory locations

memory modules

large computer systemsce

parallel computer system

parallel computers

processing elements