lectures11-12
TRANSCRIPT
-
8/4/2019 lectures11-12
1/13
Datorarkitektur F 11/12 - 1
Petru Eles, IDA, LiTH
ARCHITECTURES FOR PARALLEL
COMPUTATION
1. Why Parallel Computation
2. Parallel Programs
3. A Classification of Computer Architectures
4. Performance of Parallel Architectures
5. The Interconnection Network
6. Array Processors
7. Multiprocessors
8. Multicomputers
9. Vector Processors
10. Multimedia Extensions to Microprocessors
Datorarkitektur F 11/12 - 2
Petru Eles, IDA, LiTH
Why Parallel Computation?
The need for high performance!
Two main factors contribute to high performance of
modern processors:1. Fast circuit technology
2. Architectural features:
- large caches
- multiple fast buses
- pipelining
- superscalar architectures (multiple funct. units)
However
Computers running with a single CPU, often are notable to meet performance needs in certain areas:
- Fluid flow analysis and aerodynamics;
- Simulation of large complex systems, for exam-ple in physics, economy, biology, technic;
- Computer aided design;- Multimedia.
Applications in the above domains arecharacterized by a very high amount of numericalcomputation and/or a high quantity of input data.
Datorarkitektur F 11/12 - 3
Petru Eles, IDA, LiTH
A Solution: Parallel Computers
One solution to the need for high performance:architectures in which several CPUs are running inorder to solve a certain application.
Such computers have been organized in verydifferent ways. Some key features:
- number and complexity of individual CPUs
- availability of common (shared memory)
- interconnection topology
- performance of interconnection network
- I/O devices
- - - - - - - - - - - - - -
Datorarkitektur F 11/12 - 4
Petru Eles, IDA, LiTH
Parallel Programs
1. Parallel sorting
Unsorted-1 Unsorted-4Unsorted-3Unsorted-2
Sorted-1 Sorted-4Sorted-3Sorted-2
Sort-1 Sort-4Sort-3Sort-2
Merge
S O R T E D
-
8/4/2019 lectures11-12
2/13
Datorarkitektur F 11/12 - 5
Petru Eles, IDA, LiTH
Parallel Programs (contd)
A possible program for parallel sorting:
var t: array[1..1000] of integer;
- - - - - - - - - - -
procedure sort(i,j:integer);
-sort elements between t[i] and t[j]-
end sort;
procedure merge;
- - merge the four sub-arrays - -
end merge;
- - - - - - - - - - -
begin
- - - - - - - -
cobegin
sort(1,250)|
sort(251,500)|
sort(501,750)|sort(751,1000)
coend;
merge;
- - - - - - - -
end;
Datorarkitektur F 11/12 - 6
Petru Eles, IDA, LiTH
Parallel Programs (contd)
2. Matrix addition:
var a: array[1..n,1..m] of integer;
b: array[1..n,1..m] of integer;
c: array[1..n,1..m] of integer;
i:integer
- - - - - - - - - - -
begin
- - - - - - - -
for i:=1 to n dofor j:= 1 to m do
c[i,j]:=a[i,j]+b[i,j];
end for
end for
- - - - - - - -
end;
a11a12
a13
a1n
a21a22
a23
a2n
am1am2
am3
amn
b11b12
b13
b1n
b21b22
b23
b2n
bm1bm2
bm3
bmn
c11c12
c13
c1n
c21c22
c23
c2n
cm1cm2
cm3
cmn
+ =
Datorarkitektur F 11/12 - 7
Petru Eles, IDA, LiTH
Parallel Programs (contd)
Matrix addition - parallel version:
var a: array[1..n,1..m] of integer;
b: array[1..n,1..m] of integer;
c: array[1..n,1..m] of integer;
i:integer
- - - - - - - - - - -
procedure add_vector(n_ln:integer);
var j:integer
begin
for j:=1 to m do
c[n_ln,j]:=a[n_ln,j]+b[n_ln,j];
end for
end add_vector;
begin
- - - - - - - -
cobegin for i:=1 to n do
add_vector(i);
coend;
- - - - - - - -
end;
Datorarkitektur F 11/12 - 8
Petru Eles, IDA, LiTH
Parallel Programs (contd)
Matrix addition - vector computationversion:
var a: array[1..n,1..m] of integer;
b: array[1..n,1..m] of integer;
c: array[1..n,1..m] of integer;
i,j:integer
- - - - - - - - - - -
begin
- - - - - - - -
for i:=1 to n do
c[i,1:m]:=a[i,1:m]+b[i,1:m];
end for;
- - - - - - - -
end;
Or even so:
begin
- - - - - - - -
c[1:n,1:m]:=a[1:n,1:m]+b[1:n,1:m];
- - - - - - - -end;
-
8/4/2019 lectures11-12
3/13
Datorarkitektur F 11/12 - 9
Petru Eles, IDA, LiTH
Parallel Programs (contd)
Pipeline model computation:
xy 5= 45 xlog+
a 45 xlog+=
y
y 5= aa yx
Datorarkitektur F 11/12 - 10
Petru Eles, IDA, LiTH
Parallel Programs (contd)
A program for the previous computation:
channel ch:real;
- - - - - - - - -
cobegin
var x:real;
while true do
read(x);
send(ch,45+log(x));
end while |
var v:real;
while true do
receive(ch,v);
write(5*sqrt(v));
end whilecoend;
- - - - - - - - -
Datorarkitektur F 11/12 - 11
Petru Eles, IDA, LiTH
Flynns Classification of Computer Architectures
Flynns classification is based on the nature of theinstruction flow executed by the computer and thatof the data flow on which the instructions operate.
1. Single Instruction stream, Single Data stream (SISD)
Controlunit
Processingunit
instr. stream
datastream
Memory
CPU
Datorarkitektur F 11/12 - 12
Petru Eles, IDA, LiTH
Flynns Classification (contd)
2.Single Instruction stream, Multiple Data stream (SIMD)
SIMD with shared memory
Controlunit
Processingunit_1
Shared
Memory
Processingunit_2
Processingunit_n
InterconnectionNetwork
IS
DS1
DS2
DSn
-
8/4/2019 lectures11-12
4/13
Datorarkitektur F 11/12 - 13
Petru Eles, IDA, LiTH
Flynns Classification (contd)
SIMD with no shared memory
Controlunit
Processingunit_1
Processingunit_2
Processingunit_n
InterconnectionNetwork
DS1
LM1
IS
LM
DS2
LM2
DSnLMn
Datorarkitektur F 11/12 - 14
Petru Eles, IDA, LiTH
Flynns Classification (contd)
3.Multiple Instruction stream, Multiple Datastream (MIMD)
MIMD with shared memory
Controlunit_1
Processingunit_1
Processingunit_2
Processingunit_n
Interco
nnectionNetwork
DS1
LM1IS1
DS2
LM2
DSn
LMn
Controlunit_2
Controlunit_n
IS2
ISn
CPU_1
CPU_2
CPU_n
Shared
Memory
Datorarkitektur F 11/12 - 15
Petru Eles, IDA, LiTH
Flynns Classification (contd)
MIMD with no shared memory
Controlunit_1
Processingunit_1
Processingunit_2
Processing
unit_n
InterconnectionNetwork
DS1
LM1IS1
DS2
LM2
DSn
LMn
Controlunit_2
Control
unit_n
IS2
ISn
CPU_1
CPU_2
CPU_n
Datorarkitektur F 11/12 - 16
Petru Eles, IDA, LiTH
Performance of Parallel Architectures
Important questions:
How fast runs a parallel computer at its maximalpotential?
How fast execution can we expect from a parallelcomputer for a concrete application?
How do we measure the performance of a parallelcomputer and the performance improvement weget by using such a computer?
-
8/4/2019 lectures11-12
5/13
Datorarkitektur F 11/12 - 17
Petru Eles, IDA, LiTH
Performance Metrics
Peak rate: the maximal computation rate that canbe theoretically achieved when all modules are fullyutilized.
The peak rate is of no practical significance for theuser. It is mostly used by vendor companies formarketing of their computers.
Speedup: measures the gain we get by using acertain parallel computer to run a given parallelprogram in order to solve a specific problem.
TS: execution time needed with the best sequentialalgorithm;
TP: execution time needed with the parallelalgorithm.
STS
TP------=
Datorarkitektur F 11/12 - 18
Petru Eles, IDA, LiTH
Performance Metrics (contd)
Efficiency: this metric relates the speedup to thenumber of processors used; by this it provides ameasure of the efficiency with which the processors
are used.
S: speedup;
p: number of processors.
For the ideal situation, in theory:
; which means E= 1
Practically the ideal efficiency of 1 can not be achieved!
ES
p---=
STS
TS
p------
------ p= =
Datorarkitektur F 11/12 - 19
Petru Eles, IDA, LiTH
Amdahls Law
Consider fto be the ratio of computations that,according to the algorithm, have to be executedsequentially (0 f 1); pis the number ofprocessors;
TP f TS1 f( ) TS
p-----------------------------+=
STS
f TS 1 f( )TS
p------+
----------------------------------------------------1
f1 f( )
p-----------------+
---------------------------= =
123456789
10
0.2 0.4 0.6 0.8 1.0
S
f
Datorarkitektur F 11/12 - 20
Petru Eles, IDA, LiTH
Amdahls Law (contd)
Amdahls law: even a little ratio of sequentialcomputation imposes a certain limit to speedup; a higherspeedup than 1/fcan not be achieved, regardless thenumber of processors.
To efficiently exploit a high number of processors, fmustbe small (the algorithm has to be highly parallel).
ES
P---
1
f p 1( ) 1+------------------------------------= =
-
8/4/2019 lectures11-12
6/13
Datorarkitektur F 11/12 - 21
Petru Eles, IDA, LiTH
Other Aspects which Limit the Speedup
Beside the intrinsic sequentiality of some parts ofan algorithm there are also other factors that limitthe achievable speedup:
- communication cost
- load balancing of processors
- costs of creating and scheduling processes
- I/O operations
There are many algorithms with a high degree ofparallelism; for such algorithms the value of fis verysmall and can be ignored. These algorithms aresuited for massively parallel systems; in such casesthe other limiting factors, like the cost ofcommunications, become critical.
Datorarkitektur F 11/12 - 22
Petru Eles, IDA, LiTH
Efficiency and Communication Cost
Consider a highly parallel computation, so that f(theratio of sequential computations) can be neglected.
We define fc, the fractional communication overhead of aprocessor:
Tcalc: time that a processor executes computations;
Tcomm: time that a processor is idle because ofcommunication;
With algorithms that have a high degree ofparallelism, massively parallel computers,consisting of large number of processors, can beefficiently used if fcis small; this means that thetime spent by a processor for communication has tobe small compared to its effective time ofcomputation.
In order to keep fcreasonably small, the size ofprocesses can not go below a certain limit.
fc
Tcomm
Tcalc----------------=
Tp
TS
p------ 1 fc+( )=
STS
TP------
p
1 fc+---------------= =
E1
1 fc+--------------- 1 fc=
Datorarkitektur F 11/12 - 23
Petru Eles, IDA, LiTH
The Interconnection Network
The interconnection network (IN) is a keycomponent of the architecture. It has a decisiveinfluence on the overall performance and cost.
The traffic in the IN consists of data transfer andtransfer of commands and requests.
The key parameters of the IN are- total bandwidth: transferred bits/second
- cost
Datorarkitektur F 11/12 - 24
Petru Eles, IDA, LiTH
The Interconnection Network (contd)
Single Bus
Single bus networks are simple and cheap.
One single communication is allowed at a time; thebandwidth is shred by all nodes.
Performance is relatively poor.
Inorder to keep a certainperformance, the numberof nodes is limited (16 - 20).
Node1 Node2 Noden
-
8/4/2019 lectures11-12
7/13
Datorarkitektur F 11/12 - 25
Petru Eles, IDA, LiTH
The Interconnection Network (contd)
Completely connected network
Each node is connected to every other one.
Communications can be performed in parallelbetween any pair of nodes.
Both performance and cost are high.
Cost increases rapidly with number of nodes.
Node1
Node2 Node5
Node3 Node4
Datorarkitektur F 11/12 - 26
Petru Eles, IDA, LiTH
The Interconnection Network (contd)
Crossbar network
The crossbar is a dynamic network: theinterconnection topology can be modified by
positioning of switches. The crossbar switch is completely connected: anynode can be directly connected to any other.
Fewer interconnections are needed than for thestatic completely connected network; however, alarge number of switches is needed.
A large number of communications can beperformed in parallel (one certain node can receiveor send only one data at a time).
Node1
Node2
Noden
Datorarkitektur F 11/12 - 27
Petru Eles, IDA, LiTH
The Interconnection Network (contd)
Mesh network
Mesh networks are cheaper than completelyconnected ones and provide relatively goodperformance.
In order to transmit an information between certainnodes, routing through intermediate nodes is need-ed (maximum 2*(n-1) intermediates for an n*n mesh).
It is possible to provide wraparound connections:between nodes 1 and 13, 2 and 14, etc.
Three dimensional meshes have been alsoimplemented.
Node1
Node2
Node3
Node4
Node5
Node6
Node7
Node8
Node9
Node10
Node11
Node12
Node13
Node14
Node15
Node16
Datorarkitektur F 11/12 - 28
Petru Eles, IDA, LiTH
The Interconnection Network (contd)
Hypercube network
2n nodes are arranged in an n-dimensional cube.Each node is connected to nneighbours.
In order to transmit an information between certainnodes, routing through intermediate nodes isneeded (maximum nintermediates).
N10
N11 N15
N14
N12
N9 N13
N0
N2
N3 N7
N5
N4
N1
N6
N8
-
8/4/2019 lectures11-12
8/13
Datorarkitektur F 11/12 - 29
Petru Eles, IDA, LiTH
SIMD Computers
SIMD computers are usually called array processors.
PUs are usually very simple: an ALU whichexecutes the instruction broadcast by the CU, a fewregisters, and some local memory.
The first SIMD computer:
- ILLIAC IV (1970s): 64 relatively powerfulprocessors (mesh connection, see above).
Contemporary commercial computer:- CM-2 (Connection Machine) by Thinking Ma-
chines Corporation: 65 536 very simple proces-sors (connected as a hypercube).
Array processors are highly specialized fornumerical problems that can be expressed in matrixor vector format (see program on slide 8).Each PU computes one element of the result.
Controlunit
PU
PU
PU
PU
PU
PU
PU
PU
PU
Datorarkitektur F 11/12 - 30
Petru Eles, IDA, LiTH
MULTIPROCESSORS
Shared memory MIMD computers are calledmultiprocessors:
Some multiprocessors have no shared memorywhich is central to the system and equally
accessible to all processors. All the memory isdistributed as local memory to the processors.However, each processor has access to the localmemory of any other processor a globalphysical address space is available.This memory organization is called distributedshared memory.
SharedMemory
Processor1
Processor2
Processorn
Local
Memory
Local
Memory
Local
Memory
Datorarkitektur F 11/12 - 31
Petru Eles, IDA, LiTH
Multiprocessors (contd)
Communication between processors is through theshared memory. One processor can change thevalue in a location and the other processors canread the new value.
From the programmers point of viewcommunication is realised by shared variables;these are variables which can be accessed by eachof the parallel activities (processes):
- table tin slide 5;
- matrixes a, b, and cin slide 7;
With many fast processors memory contention canseriously degrade performance multiprocessorarchitectures dont support a high number ofprocessors.
Datorarkitektur F 11/12 - 32
Petru Eles, IDA, LiTH
Mutiprocessors (contd)
IBM System/370 (1970s): two IBM CPUsconnected to shared memory.IBM System370/XA (1981): multiple CPUs can beconnected to shared memory.IBM System/390 (1990s): similar features likeS370/XA, with improved performance. Possibility toconnect several multiprocessor systems togetherthrough fast fibre-optic connection.
CRAY X-MP (mid 1980s): from one to four vectorprocessors connected to shared memory (cycletime: 8.5 ns).CRAY Y-MP (1988): from one to 8 vector processorsconnected to shared memory; 3 times morepowerful than CRAY X-MP (cycle time: 4 ns).C90 (early 1990s): further development of CRAY Y-MP; 16 vector processors.CRAY 3 (1993): maximum 16 vector processors(cycle time 2ns).
Butterfly multiprocessor system, by BBN AdvancedComputers (1985/87): maximum 256 Motorola68020 processors, interconnected by asophisticated dynamic switching network;
distributed shared memory organization.BBN TC2000 (1990): improved version of theButterflyusing Motorola 88100 RISC processor.
-
8/4/2019 lectures11-12
9/13
Datorarkitektur F 11/12 - 33
Petru Eles, IDA, LiTH
Multicomputers
MIMD computerswith a distributed address space, so thateach processor has its one private memory which is notvisible to other processors, are called multicomputers:
Processor1
Processor2
Processorn
PrivateMemory
PrivateMemory
PrivateMemory
Datorarkitektur F 11/12 - 34
Petru Eles, IDA, LiTH
Multicomputers (contd)
Communication between processors is only bypassing messagesover the interconnectionnetwork.
From the programmers point of view this meansthat no shared variables are available (a variablecan be accessed only by one single process). Forcommunication between parallel activities(processes) the programmer uses channelsandsend/receive operations(see program in slide 10).
There is no competition of the processors for theshared memory the number of processors isnot limited by memory contention.
The speed of the interconnection network is animportant parameter for the overall performance.
Datorarkitektur F 11/12 - 35
Petru Eles, IDA, LiTH
Multicomputers (contd)
Intel iPSC/2 (1989): 128 CPUs of type 80386 inter-connected by a 7-dimensional hypercube (27=128).
Intel Paragon (1991): over 2000 processors of typei860 (high performance RISC) interconnected by atwo-dimensional mesh network.
KSR-1 by Kendal Square Research (1992): 1088processors interconnected by a ring network.
nCUBE/2S by nCUBE (1992): 8192 processors in-terconnected by a 10-dimensional hypercube.
Cray T3E MC512 (1995): 512 CPUsinterconnectedby a tree-dimensional mesh; each CPU is a DECAlpha RISC.
Network of workstations:
A group of workstations connected through a LocalArea Network (LAN), can be used together as amulticomputer for parallel computation.Performances usually will be lower than withspecialized multicomputers, because of thecommunication speed over the LAN. However, this
is a cheap solution.
Datorarkitektur F 11/12 - 36
Petru Eles, IDA, LiTH
Vector Processors
Vector processors include in their instruction set,beside scalar instructions, also instructionsoperating on vectors.
Array processors (SIMD) computers (see slide 29)can operate on vectors by executing simultaneouslythe same instruction on pairs of vector elements;
each instruction is executed by a separateprocessing element.
Several computer architectures have implementedvector operations using the parallelism provided bypipelined functional units. Such architectures arecalled vector processors.
-
8/4/2019 lectures11-12
10/13
Datorarkitektur F 11/12 - 37
Petru Eles, IDA, LiTH
Vector Processors (contd)
Vector processors are not parallel processors;there are not several CPUs running in parallel.They are SISD processors which haveimplemented vector instructions executed onpipelined functional units.
Vector computers usually have vector registerswhich can store each 64 up to 128 words.
Vector instructions (see slide 40):
- load vector from memory into vector register
- store vector into memory
- arithmetic and logic operations between vectors
- operations between vectors and scalars
- etc.
From the programmers point of view this meansthat he is allowed to use operations on vectors inhis programmes (see program in slide 8), and thecompiler translates these instructions into vectorinstructions at machine level.
Datorarkitektur F 11/12 - 38
Petru Eles, IDA, LiTH
Vector Processors (contd)
Vector computers:
- CDC Cyber 205- CRAY
- IBM 3090 (an extension to the IBM System/370)
- NEC SX
- Fujitsu VP
- HITACHI S8000
Scalar registers
Vector registers
Scalar functionalunits
Vector functionalunits
Scalar unit
Vector unit
Instructiondecoder
Scalar
instructions
Vectorinstructions
Memory
Datorarkitektur F 11/12 - 39
Petru Eles, IDA, LiTH
The Vector Unit
A vector unit typically consists of
- pipelined functional units
- vector registers
Vector registers:
- ngeneral purpose vector registers Ri, 0 i n-1;
- vector length register VL; stores the length l(0 l s), of the currently processed vector(s);sis the length of the vector registers Ri.
- mask register M; stores a set of lbits, one foreach element in a vector register, interpreted asboolean values; vector instructions can be exe-cuted in masked mode so that vector registerelements corresponding to a false value in M,are ignored.
Datorarkitektur F 11/12 - 40
Petru Eles, IDA, LiTH
Vector Instructions
LOAD-STORE instructions:
R A(x1:x2:incr) load
A(x1:x2:incr) R store
R MASKED(A) masked load
A MASKED(R) masked store
R INDIRECT(A(X)) indirect load
A(X) INDIRECT(R) indirect store
Arithmetic - logic
R R' b_op R''
R S b_op R'
R u_op R'
M R rel_op R'
WHERE(M) R R' b_op R''
chaining
R2 R0 + R1
R3 R2 * R4
execution of the vector multiplication has not to waituntil the vector addition has terminated; as
elements of the sum are generated by the additionpipeline they enter the multiplication pipeline; thus,addition and multiplication are performed (partially)in parallel.
-
8/4/2019 lectures11-12
11/13
Datorarkitektur F 11/12 - 41
Petru Eles, IDA, LiTH
Vector Instructions (contd)
In a Pascal-like language with vector computation:
if T[1..50]>0 then
T[1..50]:=T[1..50]+1;
A compiler for a vector computer generates something like:
R0 T(0:49:1)
VL 50
M R0 > 0
WHERE(M) R0 R0 + 1
Datorarkitektur F 11/12 - 42
Petru Eles, IDA, LiTH
Multimedia Extensions to General PurposeMicroprocessors
Video and audio applications very often deal withlarge arrays of small data types (8 or 16 bits).
Such applications exhibit a large potential of SIMD(vector) parallelism.
New generations of general purpose microproces-sors have been equipped with special instructionsto exploit this potential of parallelism.
The specialised multimedia instructions performvector computations on bytes, half-words, or words.
Datorarkitektur F 11/12 - 43
Petru Eles, IDA, LiTH
Multimedia Extensions to General Purpose
Microprocessors (contd)
Several vendors have extended the instruction set oftheir processors in order to improve performance withmultimedia applications:
MMX for Intel x86 family
VIS for UltraSparc
MDMX for MIPS
MAX-2 for Hewlett-Packard PA-RISC
The Pentium line provides 57 MMX instructions. Theytreat data in a SIMD fashion (see textbook pg. 353).
Datorarkitektur F 11/12 - 44
Petru Eles, IDA, LiTH
Multimedia Extensions to General Purpose
Microprocessors (contd)
The basic idea: subword execution
Use the entire width of a processor data path(32or 64 bits), even when processing the smalldata types used in signal processing (8, 12, or16 bits).
With word size 64 bits, the adders will be used toimplement eight 8 bit additions in parallel
This is practically a kind of SIMD parallelism, at areduced scale.
-
8/4/2019 lectures11-12
12/13
Datorarkitektur F 11/12 - 45
Petru Eles, IDA, LiTH
Multimedia Extensions to General Purpose
Microprocessors (contd)
Three packed data types are defined for paralleloperations: packed byte, packed half word, packed word.
q0q1q2q3q4q5q6q7
q0q1q2q3
q0q1
q0
Packed byte
Packed half word
Packed word
Long word
64 bits
Datorarkitektur F 11/12 - 46
Petru Eles, IDA, LiTH
Multimedia Extensions to General Purpose
Microprocessors (contd)
Examples of SIMD arithmetics with the MMXinstruction set:
a0a1a2a3a4a5a6a7
ADD R3 R1,R2
b0b1b2b3b4b5b6b7
a0+b0
++++++++
========
a7+b7 a6+b6 a5+b5 a4+b4 a3+b3 a2+b2 a1+b1
a0a1a2a3a4a5a6a7
MPYADD R3 R1,R2
b0b1b2b3b4b5b6b7
x-+
====
(a6xb6)+(a7xb7)
x-+ x-+ x-+
(a4xb4)+(a5xb5) (a2xb2)+(a3xb3) (a0xb0)+(a1xb1)
Datorarkitektur F 11/12 - 47
Petru Eles, IDA, LiTH
Multimedia Extensions to General Purpose
Microprocessors (contd)
How to get the data ready for computation?
How to get the results back in the right format?
Packing and Unpacking
truncated a0
a0a1
b0b1
truncated b1 truncated b0 truncated a1
PACK.W R3 R1,R2
a0a1a2a3
a0a1
UNPACK R3 R1
Datorarkitektur F 11/12 - 48
Petru Eles, IDA, LiTH
Summary
The growing need for high performance can notalways be satisfied by computers running a singleCPU.
With Parallel computers, several CPUs are runningin order to solve a given application.
Parallel programs have to be available in order touse parallel computers.
Computers can be classified based on the nature ofthe instruction flow executed and that of the dataflow on which the instructions operate: SISD, SIMD,and MIMD architectures.
The performance we effectively can get by using aparallel computer depends not only on the numberof available processors but is limited bycharacteristics of the executed programs.
The efficiency of using a parallel computer isinfluenced by features of the parallel program, like:degree of parallelism, intensity of inter-processorcommunication, etc.
-
8/4/2019 lectures11-12
13/13
Datorarkitektur F 11/12 - 49
Petru Eles, IDA, LiTH
Summary (contd)
A key component of a parallel architecture is theinterconnection network.
Array processors execute the same operation on a
set of interconnected processing units. They arespecialized for numerical problems expressed inmatrix or vector formats.
Multiprocessors are MIMD computers in which allCPUs have access to a common shared addressspace. The number of CPUs is limited.
Multicomputers have a distributed addressspace. Communication between CPUs is only bymessage passing over the interconnection network.The number of interconnected CPUs can be high.
Vector processors are SISD processors whichinclude in their instruction set instructions operating
on vectors. They are implemented using pipelinedfunctional units.
Multimedia applications exhibit a large potential ofSIMD parallelism. The instruction set of moderngeneral purpose microprocessors (Pentium,UltraSparc) has been extended to support SIMD-style parallelism with operations on shor t vectors.