architecture and compilers part ii - cs.fsu.eduengelen/courses/hpc-2008/architecture2.pdf ·...
Post on 19-Oct-2020
9 Views
Preview:
TRANSCRIPT
Architecture and Compilers PART II
HPC Fall 2008 Prof. Robert van Engelen
HPC Fall 2008 2 10/14/08
Overview
PMS models Shared memory multiprocessors
Basic shared memory systems SMP, Multicore, and COMA
Distributed memory multicomputers MPP systems Network topologies for message-passing multicomputers Distributed shared memory
Pipeline and vector processors Comparison Taxonomies
HPC Fall 2008 3 10/14/08
PMS Architecture Models
Processor (P) A device that performs
operations on data
Memory (M) A device that stores
data
Switch (S) A device that facilitates
transfer of data between devices
Arcs denote connectivity
Example: a computer system with CPU and peripherals
A simple PMS model
HPC Fall 2008 4 10/14/08
Shared Memory Multiprocessor
Processors access shared memory via a common switch, e.g. a bus Problem: a single bus results in a bottleneck
Shared memory has a single address space Architecture sometimes referred to as a “dance hall”
HPC Fall 2008 5 10/14/08
Bus Contention
Each processor competes for access to shared memory Fetching instructions Loading and storing data
Bus contention Access to memory is restricted to one processor at a time This limits the speedup and scalability with respect to the
number of processors Assume that each instruction requires m memory operations
(load/store), F instructions are performed per unit of time, and a maximum of W words can be moved over the bus per unit of time, then SP < W / (m F)
regardless of the number of processors P In other words, the efficiency is limited unless
P < W / (m F)
HPC Fall 2008 6 10/14/08
Work-Memory Ratio
The work-memory ratio (also known as the FP:M ratio): The ratio of the number of
floating point operations to the number of distinct memory locations referenced
Note that the definition refers to the number of distinct memory locations referenced The same location is counted
just once Implicitly refers to the use of
registers and cache Efficient utilization of shared
memory multiprocessors requires
P < FP:M × W / F
for (i=0; i<1000; i++) x = x + 2*i;
for (i=0; i<1000; i++) x = x + a[i];
2 distinct locations: FP:M = 500
1001 distinct locations: FP:M = 1000:1001 ≈ 1
HPC Fall 2008 7 10/14/08
Shared Memory Multiprocessor with Local Cache
Add local cache the improve performance when W / F is small With today’s systems we have W / F << 1
Problem: how to ensure cache coherence?
HPC Fall 2008 8 10/14/08
Cache Coherence
A cache coherence protocol ensures that processors obtain newly altered data when shared data is modified by another processor
Because caches operate on cache lines, more data than the shared object alone can be effected, which may lead to false sharing
Thread 1 modifies shared data
Thread 0 reads modified shared data
Cache coherence protocol ensures that thread 0 obtains the newly altered data
HPC Fall 2008 9 10/14/08
COMA
Cache-only memory architecture (COMA) Large cache per processor to replace shared memory A data item is either in one cache (non-shared) or in multiple caches
(shared) Switch includes an engine that provides a single global address
space and ensures cache coherence
HPC Fall 2008 10 10/14/08
Distributed Memory Multicomputer
Massively parallel processor (MPP) systems with P > 1000 Communication via message passing Nonuniform memory access (NUMA) Network topologies
Mesh Hypercube Cross-bar switch
HPC Fall 2008 11 10/14/08
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
10 20 30 40 50 60 70 80 90 100
n
Speedup
Computation-Communication Ratio
The computation-communication ratio: tcomp / tcomm
Usually assessed analytically and/or measured empirically
High communication overhead decreases speedup, so ratio should be as high as possible
For example: data size n, number of processors P, and ratio tcomp / tcomm = 1000n / 10n2
0
1
2
3
4
5
6
7
8
9
10
10 20 30 40 50 60 70 80 90 100
n
com
puta
tion /
co
mm
unic
atio
n tcomp / tcomm = 1000n / 10n2
P=1
P=2
P=4
P=8
SP = ts / tP = 1000n / (1000n / P + 10n2)
HPC Fall 2008 12 10/14/08
Mesh
Network of P nodes has mesh size √P × √P
Diameter 2 × (√P -1)
Torus network wraps the ends Diameter √P -1
HPC Fall 2008 13 10/14/08
Hypercube
d-dimensional hypercube has P = 2d nodes
Diameter is d = log2 P
Node addressing is simple Node number of nearest
neighbor node differs in one bit
Routing algorithm flips bits to determine possible paths, e.g. from node 001 to 111 has two shortest paths
001 → 011 → 111 001 → 101 → 111
d=2
d=4
d=3
HPC Fall 2008 14 10/14/08
Cross-bar Switches
Cross-bar switch
Processors and memories are connected by a set of switches
Enables simultaneous (contention free) communication between processor i and σ(i), where σ is an arbitrary permutation of 1…P
σ(1)=2 σ(2)=1 σ(3)=3
HPC Fall 2008 15 10/14/08
Multistage Interconnect Network
Each switch has an upper output (0) and a lower output (1)
A message travels through switch based on destination address Each bit in destination address is
used to control a switch from start to destination
For example, from 001 to 100 First switch selects lower output
(1) Second switch selects upper
output (0) Third switch selects upper output
(0)
Contention can occur when two messages are routed through the same switch
8×8 three-stage interconnect
4×4 two-stage interconnect
HPC Fall 2008 16 10/14/08
Distributed Shared Memory
Distributed shared memory (DSM) systems use physically distributed memory modules and a global address space that gives the illusion of shared virtual memory
Hardware is used to automatically translate a memory address into a local address or a remote memory address (via message passing)
Software approaches add a programming layer to simplify access to shared objects (hiding the communications)
HPC Fall 2008 17 10/14/08
Pipeline and Vector Processors
Vector processors run operations on multiple data elements simultaneously
Vector processor has a maximum vector length, e.g. 512
Strip mining the loop results in an outer loop with stride 512 to enable vectorization of longer vector operations
Pipelined vector architectures dispatch multiple vector operations per clock cycle
Vector chaining allows the result of a previous vector operation to be directly fed into the next operation in the pipeline
DO i = 0,9999 z(i) = x(i) + y(i)
DO j = 0,18,512 DO i = 0,511 z(j+i) = x(j+i) + y(j+i) ENDDO ENDDO DO i = 0,271 z(9728+i) = x(9728+i) + y(9728+i) ENDDO
DO j = 0,18,512 z(j:j+511) = x(j:j+511) + y(j:j+511) ENDDO z(9728:9999) = x(9728:9999) + y(9728:9999)
HPC Fall 2008 18 10/14/08
Comparison: Bandwidth, Latency and Capacity
HPC Fall 2008 19 10/14/08
Flynn’s Taxonomy
Single instruction stream single data stream (SISD) Traditional PC system
Single instruction stream multiple data stream (SIMD) Similar to MMX/SSE/AltiVec
multimedia instruction sets MASPAR
Multiple instruction stream multiple data stream (MIMD) Single program, multiple data
(SPMD) programming: each processor executes a copy of the program
MIMD
SISD SIMD
Data stream
Inst
ruct
ion
stre
am
single
sing
le
mul
tiple
multiple
HPC Fall 2008 20 10/14/08
Task Parallelism versus Data Parallelism
Task parallelism Thread-level parallelism MIMD
Data parallelism Multiple processors (or units) operate on segmented data set SIMD, vector and pipeline machines SIMD-like multi-media extensions, e.g. MMX/SSE/Altivec
X3 X2 X1 X0
Y3 Y2 Y1 Y0
X3 ⊕ Y3 X2 ⊕ Y2 X1 ⊕ Y1 X0 ⊕ Y0
⊕ ⊕ ⊕ ⊕
src1
src2
dest
Vector operation X[0:3] ⊕ Y[0:3] with SSE instruction on Pentium-4
HPC Fall 2008 21 10/14/08
Further Reading
[PP2] pages 13-26 [SPC] pages 71-95 [HPC] pages 25-28 Optional:
[SRC] pages 15-42
top related