lectures11-12

8/4/2019 lectures11-12

1/13

Datorarkitektur F 11/12 - 1

Petru Eles, IDA, LiTH

ARCHITECTURES FOR PARALLEL

COMPUTATION

1. Why Parallel Computation

2. Parallel Programs

3. A Classification of Computer Architectures

4. Performance of Parallel Architectures

5. The Interconnection Network

6. Array Processors

7. Multiprocessors

8. Multicomputers

9. Vector Processors

10. Multimedia Extensions to Microprocessors



Why Parallel Computation?

The need for high performance!

Two main factors contribute to high performance of

modern processors:1. Fast circuit technology

2. Architectural features:

- large caches

- multiple fast buses

- pipelining

- superscalar architectures (multiple funct. units)

However

Computers running with a single CPU, often are notable to meet performance needs in certain areas:

- Fluid flow analysis and aerodynamics;

- Simulation of large complex systems, for exam-ple in physics, economy, biology, technic;

- Computer aided design;- Multimedia.

Applications in the above domains arecharacterized by a very high amount of numericalcomputation and/or a high quantity of input data.



A Solution: Parallel Computers

One solution to the need for high performance:architectures in which several CPUs are running inorder to solve a certain application.

Such computers have been organized in verydifferent ways. Some key features:

- number and complexity of individual CPUs

- availability of common (shared memory)

- interconnection topology

- performance of interconnection network

- I/O devices

- - - - - - - - - - - - - -



Parallel Programs

1. Parallel sorting

Unsorted-1 Unsorted-4Unsorted-3Unsorted-2

Sorted-1 Sorted-4Sorted-3Sorted-2

Sort-1 Sort-4Sort-3Sort-2

Merge

S O R T E D

8/4/2019 lectures11-12

2/13



Parallel Programs (contd)

A possible program for parallel sorting:

var t: array[1..1000] of integer;

- - - - - - - - - - -

procedure sort(i,j:integer);

-sort elements between t[i] and t[j]-

end sort;

procedure merge;

- - merge the four sub-arrays - -

end merge;

- - - - - - - - - - -

begin

- - - - - - - -

cobegin

sort(1,250)|

sort(251,500)|

sort(501,750)|sort(751,1000)

coend;

merge;

- - - - - - - -

end;




2. Matrix addition:

var a: array[1..n,1..m] of integer;

b: array[1..n,1..m] of integer;

c: array[1..n,1..m] of integer;

i:integer

- - - - - - - - - - -

begin

- - - - - - - -

for i:=1 to n dofor j:= 1 to m do

c[i,j]:=a[i,j]+b[i,j];

end for

end for

- - - - - - - -

end;

a11a12

a13

a1n

a21a22

a23

a2n

am1am2

am3

amn

b11b12

b13

b1n

b21b22

b23

b2n

bm1bm2

bm3

bmn

c11c12

c13

c1n

c21c22

c23

c2n

cm1cm2

cm3

cmn

+ =




Matrix addition - parallel version:




i:integer

- - - - - - - - - - -

procedure add_vector(n_ln:integer);

var j:integer

begin

for j:=1 to m do

c[n_ln,j]:=a[n_ln,j]+b[n_ln,j];

end for

end add_vector;

begin

- - - - - - - -

cobegin for i:=1 to n do

add_vector(i);

coend;

- - - - - - - -

end;




Matrix addition - vector computationversion:




i,j:integer

- - - - - - - - - - -

begin

- - - - - - - -

for i:=1 to n do

c[i,1:m]:=a[i,1:m]+b[i,1:m];

end for;

- - - - - - - -

end;

Or even so:

begin

- - - - - - - -

c[1:n,1:m]:=a[1:n,1:m]+b[1:n,1:m];

- - - - - - - -end;

8/4/2019 lectures11-12

3/13




Pipeline model computation:

xy 5= 45 xlog+

a 45 xlog+=

y

y 5= aa yx




A program for the previous computation:

channel ch:real;

- - - - - - - - -

cobegin

var x:real;

while true do

read(x);

send(ch,45+log(x));

end while |

var v:real;

while true do

receive(ch,v);

write(5*sqrt(v));

end whilecoend;

- - - - - - - - -



Flynns Classification of Computer Architectures

Flynns classification is based on the nature of theinstruction flow executed by the computer and thatof the data flow on which the instructions operate.

1. Single Instruction stream, Single Data stream (SISD)

Controlunit

Processingunit

instr. stream

datastream

Memory

CPU



Flynns Classification (contd)

2.Single Instruction stream, Multiple Data stream (SIMD)

SIMD with shared memory

Controlunit

Processingunit_1

Shared

Memory

Processingunit_2

Processingunit_n

InterconnectionNetwork

IS

DS1

DS2

DSn

8/4/2019 lectures11-12

4/13




SIMD with no shared memory

Controlunit

Processingunit_1

Processingunit_2

Processingunit_n


DS1

LM1

IS

LM

DS2

LM2

DSnLMn




3.Multiple Instruction stream, Multiple Datastream (MIMD)

MIMD with shared memory

Controlunit_1

Processingunit_1

Processingunit_2

Processingunit_n

Interco

nnectionNetwork

DS1

LM1IS1

DS2

LM2

DSn

LMn

Controlunit_2

Controlunit_n

IS2

ISn

CPU_1

CPU_2

CPU_n

Shared

Memory




MIMD with no shared memory

Controlunit_1

Processingunit_1

Processingunit_2

Processing

unit_n


DS1

LM1IS1

DS2

LM2

DSn

LMn

Controlunit_2

Control

unit_n

IS2

ISn

CPU_1

CPU_2

CPU_n



Performance of Parallel Architectures

Important questions:

How fast runs a parallel computer at its maximalpotential?

How fast execution can we expect from a parallelcomputer for a concrete application?

How do we measure the performance of a parallelcomputer and the performance improvement weget by using such a computer?

8/4/2019 lectures11-12

5/13



Performance Metrics

Peak rate: the maximal computation rate that canbe theoretically achieved when all modules are fullyutilized.

The peak rate is of no practical significance for theuser. It is mostly used by vendor companies formarketing of their computers.

Speedup: measures the gain we get by using acertain parallel computer to run a given parallelprogram in order to solve a specific problem.

TS: execution time needed with the best sequentialalgorithm;

TP: execution time needed with the parallelalgorithm.

STS

TP------=



Performance Metrics (contd)

Efficiency: this metric relates the speedup to thenumber of processors used; by this it provides ameasure of the efficiency with which the processors

are used.

S: speedup;

p: number of processors.

For the ideal situation, in theory:

; which means E= 1

Practically the ideal efficiency of 1 can not be achieved!

ES

p---=

STS

TS

p------

------ p= =



Amdahls Law

Consider fto be the ratio of computations that,according to the algorithm, have to be executedsequentially (0 f 1); pis the number ofprocessors;

TP f TS1 f( ) TS

p-----------------------------+=

STS

f TS 1 f( )TS

p------+

----------------------------------------------------1

f1 f( )

p-----------------+

---------------------------= =

123456789

10

0.2 0.4 0.6 0.8 1.0

S

f



Amdahls Law (contd)

Amdahls law: even a little ratio of sequentialcomputation imposes a certain limit to speedup; a higherspeedup than 1/fcan not be achieved, regardless thenumber of processors.

To efficiently exploit a high number of processors, fmustbe small (the algorithm has to be highly parallel).

ES

P---

1

f p 1( ) 1+------------------------------------= =

8/4/2019 lectures11-12

6/13



Other Aspects which Limit the Speedup

Beside the intrinsic sequentiality of some parts ofan algorithm there are also other factors that limitthe achievable speedup:

- communication cost

- load balancing of processors

- costs of creating and scheduling processes

- I/O operations

There are many algorithms with a high degree ofparallelism; for such algorithms the value of fis verysmall and can be ignored. These algorithms aresuited for massively parallel systems; in such casesthe other limiting factors, like the cost ofcommunications, become critical.



Efficiency and Communication Cost

Consider a highly parallel computation, so that f(theratio of sequential computations) can be neglected.

We define fc, the fractional communication overhead of aprocessor:

Tcalc: time that a processor executes computations;

Tcomm: time that a processor is idle because ofcommunication;

With algorithms that have a high degree ofparallelism, massively parallel computers,consisting of large number of processors, can beefficiently used if fcis small; this means that thetime spent by a processor for communication has tobe small compared to its effective time ofcomputation.

In order to keep fcreasonably small, the size ofprocesses can not go below a certain limit.

fc

Tcomm

Tcalc----------------=

Tp

TS

p------ 1 fc+( )=

STS

TP------

p

1 fc+---------------= =

E1

1 fc+--------------- 1 fc=



The Interconnection Network

The interconnection network (IN) is a keycomponent of the architecture. It has a decisiveinfluence on the overall performance and cost.

The traffic in the IN consists of data transfer andtransfer of commands and requests.

The key parameters of the IN are- total bandwidth: transferred bits/second

- cost



The Interconnection Network (contd)

Single Bus

Single bus networks are simple and cheap.

One single communication is allowed at a time; thebandwidth is shred by all nodes.

Performance is relatively poor.

Inorder to keep a certainperformance, the numberof nodes is limited (16 - 20).

Node1 Node2 Noden

8/4/2019 lectures11-12

7/13




Completely connected network

Each node is connected to every other one.

Communications can be performed in parallelbetween any pair of nodes.

Both performance and cost are high.

Cost increases rapidly with number of nodes.

Node1

Node2 Node5

Node3 Node4




Crossbar network

The crossbar is a dynamic network: theinterconnection topology can be modified by

positioning of switches. The crossbar switch is completely connected: anynode can be directly connected to any other.

Fewer interconnections are needed than for thestatic completely connected network; however, alarge number of switches is needed.

A large number of communications can beperformed in parallel (one certain node can receiveor send only one data at a time).

Node1

Node2

Noden




Mesh network

Mesh networks are cheaper than completelyconnected ones and provide relatively goodperformance.

In order to transmit an information between certainnodes, routing through intermediate nodes is need-ed (maximum 2*(n-1) intermediates for an n*n mesh).

It is possible to provide wraparound connections:between nodes 1 and 13, 2 and 14, etc.

Three dimensional meshes have been alsoimplemented.

Node1

Node2

Node3

Node4

Node5

Node6

Node7

Node8

Node9

Node10

Node11

Node12

Node13

Node14

Node15

Node16




Hypercube network

2n nodes are arranged in an n-dimensional cube.Each node is connected to nneighbours.

In order to transmit an information between certainnodes, routing through intermediate nodes isneeded (maximum nintermediates).

N10

N11 N15

N14

N12

N9 N13

N0

N2

N3 N7

N5

N4

N1

N6

N8

8/4/2019 lectures11-12

8/13



SIMD Computers

SIMD computers are usually called array processors.

PUs are usually very simple: an ALU whichexecutes the instruction broadcast by the CU, a fewregisters, and some local memory.

The first SIMD computer:

- ILLIAC IV (1970s): 64 relatively powerfulprocessors (mesh connection, see above).

Contemporary commercial computer:- CM-2 (Connection Machine) by Thinking Ma-

chines Corporation: 65 536 very simple proces-sors (connected as a hypercube).

Array processors are highly specialized fornumerical problems that can be expressed in matrixor vector format (see program on slide 8).Each PU computes one element of the result.

Controlunit

PU

PU

PU

PU

PU

PU

PU

PU

PU



MULTIPROCESSORS

Shared memory MIMD computers are calledmultiprocessors:

Some multiprocessors have no shared memorywhich is central to the system and equally

accessible to all processors. All the memory isdistributed as local memory to the processors.However, each processor has access to the localmemory of any other processor a globalphysical address space is available.This memory organization is called distributedshared memory.

SharedMemory

Processor1

Processor2

Processorn

Local

Memory

Local

Memory

Local

Memory



Multiprocessors (contd)

Communication between processors is through theshared memory. One processor can change thevalue in a location and the other processors canread the new value.

From the programmers point of viewcommunication is realised by shared variables;these are variables which can be accessed by eachof the parallel activities (processes):

- table tin slide 5;

- matrixes a, b, and cin slide 7;

With many fast processors memory contention canseriously degrade performance multiprocessorarchitectures dont support a high number ofprocessors.



Mutiprocessors (contd)

IBM System/370 (1970s): two IBM CPUsconnected to shared memory.IBM System370/XA (1981): multiple CPUs can beconnected to shared memory.IBM System/390 (1990s): similar features likeS370/XA, with improved performance. Possibility toconnect several multiprocessor systems togetherthrough fast fibre-optic connection.

CRAY X-MP (mid 1980s): from one to four vectorprocessors connected to shared memory (cycletime: 8.5 ns).CRAY Y-MP (1988): from one to 8 vector processorsconnected to shared memory; 3 times morepowerful than CRAY X-MP (cycle time: 4 ns).C90 (early 1990s): further development of CRAY Y-MP; 16 vector processors.CRAY 3 (1993): maximum 16 vector processors(cycle time 2ns).

Butterfly multiprocessor system, by BBN AdvancedComputers (1985/87): maximum 256 Motorola68020 processors, interconnected by asophisticated dynamic switching network;

distributed shared memory organization.BBN TC2000 (1990): improved version of theButterflyusing Motorola 88100 RISC processor.

8/4/2019 lectures11-12

9/13



Multicomputers

MIMD computerswith a distributed address space, so thateach processor has its one private memory which is notvisible to other processors, are called multicomputers:

Processor1

Processor2

Processorn

PrivateMemory

PrivateMemory

PrivateMemory



Multicomputers (contd)

Communication between processors is only bypassing messagesover the interconnectionnetwork.

From the programmers point of view this meansthat no shared variables are available (a variablecan be accessed only by one single process). Forcommunication between parallel activities(processes) the programmer uses channelsandsend/receive operations(see program in slide 10).

There is no competition of the processors for theshared memory the number of processors isnot limited by memory contention.

The speed of the interconnection network is animportant parameter for the overall performance.



Multicomputers (contd)

Intel iPSC/2 (1989): 128 CPUs of type 80386 inter-connected by a 7-dimensional hypercube (27=128).

Intel Paragon (1991): over 2000 processors of typei860 (high performance RISC) interconnected by atwo-dimensional mesh network.

KSR-1 by Kendal Square Research (1992): 1088processors interconnected by a ring network.

nCUBE/2S by nCUBE (1992): 8192 processors in-terconnected by a 10-dimensional hypercube.

Cray T3E MC512 (1995): 512 CPUsinterconnectedby a tree-dimensional mesh; each CPU is a DECAlpha RISC.

Network of workstations:

A group of workstations connected through a LocalArea Network (LAN), can be used together as amulticomputer for parallel computation.Performances usually will be lower than withspecialized multicomputers, because of thecommunication speed over the LAN. However, this

is a cheap solution.



Vector Processors

Vector processors include in their instruction set,beside scalar instructions, also instructionsoperating on vectors.

Array processors (SIMD) computers (see slide 29)can operate on vectors by executing simultaneouslythe same instruction on pairs of vector elements;

each instruction is executed by a separateprocessing element.

Several computer architectures have implementedvector operations using the parallelism provided bypipelined functional units. Such architectures arecalled vector processors.

8/4/2019 lectures11-12

10/13



Vector Processors (contd)

Vector processors are not parallel processors;there are not several CPUs running in parallel.They are SISD processors which haveimplemented vector instructions executed onpipelined functional units.

Vector computers usually have vector registerswhich can store each 64 up to 128 words.

Vector instructions (see slide 40):

- load vector from memory into vector register

- store vector into memory

- arithmetic and logic operations between vectors

- operations between vectors and scalars

- etc.

From the programmers point of view this meansthat he is allowed to use operations on vectors inhis programmes (see program in slide 8), and thecompiler translates these instructions into vectorinstructions at machine level.



Vector Processors (contd)

Vector computers:

- CDC Cyber 205- CRAY

- IBM 3090 (an extension to the IBM System/370)

- NEC SX

- Fujitsu VP

- HITACHI S8000

Scalar registers

Vector registers

Scalar functionalunits

Vector functionalunits

Scalar unit

Vector unit

Instructiondecoder

Scalar

instructions

Vectorinstructions

Memory



The Vector Unit

A vector unit typically consists of

- pipelined functional units

- vector registers

Vector registers:

- ngeneral purpose vector registers Ri, 0 i n-1;

- vector length register VL; stores the length l(0 l s), of the currently processed vector(s);sis the length of the vector registers Ri.

- mask register M; stores a set of lbits, one foreach element in a vector register, interpreted asboolean values; vector instructions can be exe-cuted in masked mode so that vector registerelements corresponding to a false value in M,are ignored.



Vector Instructions

LOAD-STORE instructions:

R A(x1:x2:incr) load

A(x1:x2:incr) R store

R MASKED(A) masked load

A MASKED(R) masked store

R INDIRECT(A(X)) indirect load

A(X) INDIRECT(R) indirect store

Arithmetic - logic

R R' b_op R''

R S b_op R'

R u_op R'

M R rel_op R'

WHERE(M) R R' b_op R''

chaining

R2 R0 + R1

R3 R2 * R4

execution of the vector multiplication has not to waituntil the vector addition has terminated; as

elements of the sum are generated by the additionpipeline they enter the multiplication pipeline; thus,addition and multiplication are performed (partially)in parallel.

8/4/2019 lectures11-12

11/13



Vector Instructions (contd)

In a Pascal-like language with vector computation:

if T[1..50]>0 then

T[1..50]:=T[1..50]+1;

A compiler for a vector computer generates something like:

R0 T(0:49:1)

VL 50

M R0 > 0

WHERE(M) R0 R0 + 1



Multimedia Extensions to General PurposeMicroprocessors

Video and audio applications very often deal withlarge arrays of small data types (8 or 16 bits).

Such applications exhibit a large potential of SIMD(vector) parallelism.

New generations of general purpose microproces-sors have been equipped with special instructionsto exploit this potential of parallelism.

The specialised multimedia instructions performvector computations on bytes, half-words, or words.



Multimedia Extensions to General Purpose

Microprocessors (contd)

Several vendors have extended the instruction set oftheir processors in order to improve performance withmultimedia applications:

MMX for Intel x86 family

VIS for UltraSparc

MDMX for MIPS

MAX-2 for Hewlett-Packard PA-RISC

The Pentium line provides 57 MMX instructions. Theytreat data in a SIMD fashion (see textbook pg. 353).





The basic idea: subword execution

Use the entire width of a processor data path(32or 64 bits), even when processing the smalldata types used in signal processing (8, 12, or16 bits).

With word size 64 bits, the adders will be used toimplement eight 8 bit additions in parallel

This is practically a kind of SIMD parallelism, at areduced scale.

8/4/2019 lectures11-12

12/13





Three packed data types are defined for paralleloperations: packed byte, packed half word, packed word.

q0q1q2q3q4q5q6q7

q0q1q2q3

q0q1

q0

Packed byte

Packed half word

Packed word

Long word

64 bits





Examples of SIMD arithmetics with the MMXinstruction set:

a0a1a2a3a4a5a6a7

ADD R3 R1,R2

b0b1b2b3b4b5b6b7

a0+b0

++++++++

========

a7+b7 a6+b6 a5+b5 a4+b4 a3+b3 a2+b2 a1+b1

a0a1a2a3a4a5a6a7

MPYADD R3 R1,R2

b0b1b2b3b4b5b6b7

x-+

====

(a6xb6)+(a7xb7)

x-+ x-+ x-+

(a4xb4)+(a5xb5) (a2xb2)+(a3xb3) (a0xb0)+(a1xb1)





How to get the data ready for computation?

How to get the results back in the right format?

Packing and Unpacking

truncated a0

a0a1

b0b1

truncated b1 truncated b0 truncated a1

PACK.W R3 R1,R2

a0a1a2a3

a0a1

UNPACK R3 R1



Summary

The growing need for high performance can notalways be satisfied by computers running a singleCPU.

With Parallel computers, several CPUs are runningin order to solve a given application.

Parallel programs have to be available in order touse parallel computers.

Computers can be classified based on the nature ofthe instruction flow executed and that of the dataflow on which the instructions operate: SISD, SIMD,and MIMD architectures.

The performance we effectively can get by using aparallel computer depends not only on the numberof available processors but is limited bycharacteristics of the executed programs.

The efficiency of using a parallel computer isinfluenced by features of the parallel program, like:degree of parallelism, intensity of inter-processorcommunication, etc.

8/4/2019 lectures11-12

13/13



Summary (contd)

A key component of a parallel architecture is theinterconnection network.

Array processors execute the same operation on a

set of interconnected processing units. They arespecialized for numerical problems expressed inmatrix or vector formats.

Multiprocessors are MIMD computers in which allCPUs have access to a common shared addressspace. The number of CPUs is limited.

Multicomputers have a distributed addressspace. Communication between CPUs is only bymessage passing over the interconnection network.The number of interconnected CPUs can be high.

Vector processors are SISD processors whichinclude in their instruction set instructions operating

on vectors. They are implemented using pipelinedfunctional units.

Multimedia applications exhibit a large potential ofSIMD parallelism. The instruction set of moderngeneral purpose microprocessors (Pentium,UltraSparc) has been extended to support SIMD-style parallelism with operations on shor t vectors.

lectures11-12

Documents