lectures11-12

Upload: deepthikompella9

Post on 07-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 lectures11-12

    1/13

    Datorarkitektur F 11/12 - 1

    Petru Eles, IDA, LiTH

    ARCHITECTURES FOR PARALLEL

    COMPUTATION

    1. Why Parallel Computation

    2. Parallel Programs

    3. A Classification of Computer Architectures

    4. Performance of Parallel Architectures

    5. The Interconnection Network

    6. Array Processors

    7. Multiprocessors

    8. Multicomputers

    9. Vector Processors

    10. Multimedia Extensions to Microprocessors

    Datorarkitektur F 11/12 - 2

    Petru Eles, IDA, LiTH

    Why Parallel Computation?

    The need for high performance!

    Two main factors contribute to high performance of

    modern processors:1. Fast circuit technology

    2. Architectural features:

    - large caches

    - multiple fast buses

    - pipelining

    - superscalar architectures (multiple funct. units)

    However

    Computers running with a single CPU, often are notable to meet performance needs in certain areas:

    - Fluid flow analysis and aerodynamics;

    - Simulation of large complex systems, for exam-ple in physics, economy, biology, technic;

    - Computer aided design;- Multimedia.

    Applications in the above domains arecharacterized by a very high amount of numericalcomputation and/or a high quantity of input data.

    Datorarkitektur F 11/12 - 3

    Petru Eles, IDA, LiTH

    A Solution: Parallel Computers

    One solution to the need for high performance:architectures in which several CPUs are running inorder to solve a certain application.

    Such computers have been organized in verydifferent ways. Some key features:

    - number and complexity of individual CPUs

    - availability of common (shared memory)

    - interconnection topology

    - performance of interconnection network

    - I/O devices

    - - - - - - - - - - - - - -

    Datorarkitektur F 11/12 - 4

    Petru Eles, IDA, LiTH

    Parallel Programs

    1. Parallel sorting

    Unsorted-1 Unsorted-4Unsorted-3Unsorted-2

    Sorted-1 Sorted-4Sorted-3Sorted-2

    Sort-1 Sort-4Sort-3Sort-2

    Merge

    S O R T E D

  • 8/4/2019 lectures11-12

    2/13

    Datorarkitektur F 11/12 - 5

    Petru Eles, IDA, LiTH

    Parallel Programs (contd)

    A possible program for parallel sorting:

    var t: array[1..1000] of integer;

    - - - - - - - - - - -

    procedure sort(i,j:integer);

    -sort elements between t[i] and t[j]-

    end sort;

    procedure merge;

    - - merge the four sub-arrays - -

    end merge;

    - - - - - - - - - - -

    begin

    - - - - - - - -

    cobegin

    sort(1,250)|

    sort(251,500)|

    sort(501,750)|sort(751,1000)

    coend;

    merge;

    - - - - - - - -

    end;

    Datorarkitektur F 11/12 - 6

    Petru Eles, IDA, LiTH

    Parallel Programs (contd)

    2. Matrix addition:

    var a: array[1..n,1..m] of integer;

    b: array[1..n,1..m] of integer;

    c: array[1..n,1..m] of integer;

    i:integer

    - - - - - - - - - - -

    begin

    - - - - - - - -

    for i:=1 to n dofor j:= 1 to m do

    c[i,j]:=a[i,j]+b[i,j];

    end for

    end for

    - - - - - - - -

    end;

    a11a12

    a13

    a1n

    a21a22

    a23

    a2n

    am1am2

    am3

    amn

    b11b12

    b13

    b1n

    b21b22

    b23

    b2n

    bm1bm2

    bm3

    bmn

    c11c12

    c13

    c1n

    c21c22

    c23

    c2n

    cm1cm2

    cm3

    cmn

    + =

    Datorarkitektur F 11/12 - 7

    Petru Eles, IDA, LiTH

    Parallel Programs (contd)

    Matrix addition - parallel version:

    var a: array[1..n,1..m] of integer;

    b: array[1..n,1..m] of integer;

    c: array[1..n,1..m] of integer;

    i:integer

    - - - - - - - - - - -

    procedure add_vector(n_ln:integer);

    var j:integer

    begin

    for j:=1 to m do

    c[n_ln,j]:=a[n_ln,j]+b[n_ln,j];

    end for

    end add_vector;

    begin

    - - - - - - - -

    cobegin for i:=1 to n do

    add_vector(i);

    coend;

    - - - - - - - -

    end;

    Datorarkitektur F 11/12 - 8

    Petru Eles, IDA, LiTH

    Parallel Programs (contd)

    Matrix addition - vector computationversion:

    var a: array[1..n,1..m] of integer;

    b: array[1..n,1..m] of integer;

    c: array[1..n,1..m] of integer;

    i,j:integer

    - - - - - - - - - - -

    begin

    - - - - - - - -

    for i:=1 to n do

    c[i,1:m]:=a[i,1:m]+b[i,1:m];

    end for;

    - - - - - - - -

    end;

    Or even so:

    begin

    - - - - - - - -

    c[1:n,1:m]:=a[1:n,1:m]+b[1:n,1:m];

    - - - - - - - -end;

  • 8/4/2019 lectures11-12

    3/13

    Datorarkitektur F 11/12 - 9

    Petru Eles, IDA, LiTH

    Parallel Programs (contd)

    Pipeline model computation:

    xy 5= 45 xlog+

    a 45 xlog+=

    y

    y 5= aa yx

    Datorarkitektur F 11/12 - 10

    Petru Eles, IDA, LiTH

    Parallel Programs (contd)

    A program for the previous computation:

    channel ch:real;

    - - - - - - - - -

    cobegin

    var x:real;

    while true do

    read(x);

    send(ch,45+log(x));

    end while |

    var v:real;

    while true do

    receive(ch,v);

    write(5*sqrt(v));

    end whilecoend;

    - - - - - - - - -

    Datorarkitektur F 11/12 - 11

    Petru Eles, IDA, LiTH

    Flynns Classification of Computer Architectures

    Flynns classification is based on the nature of theinstruction flow executed by the computer and thatof the data flow on which the instructions operate.

    1. Single Instruction stream, Single Data stream (SISD)

    Controlunit

    Processingunit

    instr. stream

    datastream

    Memory

    CPU

    Datorarkitektur F 11/12 - 12

    Petru Eles, IDA, LiTH

    Flynns Classification (contd)

    2.Single Instruction stream, Multiple Data stream (SIMD)

    SIMD with shared memory

    Controlunit

    Processingunit_1

    Shared

    Memory

    Processingunit_2

    Processingunit_n

    InterconnectionNetwork

    IS

    DS1

    DS2

    DSn

  • 8/4/2019 lectures11-12

    4/13

    Datorarkitektur F 11/12 - 13

    Petru Eles, IDA, LiTH

    Flynns Classification (contd)

    SIMD with no shared memory

    Controlunit

    Processingunit_1

    Processingunit_2

    Processingunit_n

    InterconnectionNetwork

    DS1

    LM1

    IS

    LM

    DS2

    LM2

    DSnLMn

    Datorarkitektur F 11/12 - 14

    Petru Eles, IDA, LiTH

    Flynns Classification (contd)

    3.Multiple Instruction stream, Multiple Datastream (MIMD)

    MIMD with shared memory

    Controlunit_1

    Processingunit_1

    Processingunit_2

    Processingunit_n

    Interco

    nnectionNetwork

    DS1

    LM1IS1

    DS2

    LM2

    DSn

    LMn

    Controlunit_2

    Controlunit_n

    IS2

    ISn

    CPU_1

    CPU_2

    CPU_n

    Shared

    Memory

    Datorarkitektur F 11/12 - 15

    Petru Eles, IDA, LiTH

    Flynns Classification (contd)

    MIMD with no shared memory

    Controlunit_1

    Processingunit_1

    Processingunit_2

    Processing

    unit_n

    InterconnectionNetwork

    DS1

    LM1IS1

    DS2

    LM2

    DSn

    LMn

    Controlunit_2

    Control

    unit_n

    IS2

    ISn

    CPU_1

    CPU_2

    CPU_n

    Datorarkitektur F 11/12 - 16

    Petru Eles, IDA, LiTH

    Performance of Parallel Architectures

    Important questions:

    How fast runs a parallel computer at its maximalpotential?

    How fast execution can we expect from a parallelcomputer for a concrete application?

    How do we measure the performance of a parallelcomputer and the performance improvement weget by using such a computer?

  • 8/4/2019 lectures11-12

    5/13

    Datorarkitektur F 11/12 - 17

    Petru Eles, IDA, LiTH

    Performance Metrics

    Peak rate: the maximal computation rate that canbe theoretically achieved when all modules are fullyutilized.

    The peak rate is of no practical significance for theuser. It is mostly used by vendor companies formarketing of their computers.

    Speedup: measures the gain we get by using acertain parallel computer to run a given parallelprogram in order to solve a specific problem.

    TS: execution time needed with the best sequentialalgorithm;

    TP: execution time needed with the parallelalgorithm.

    STS

    TP------=

    Datorarkitektur F 11/12 - 18

    Petru Eles, IDA, LiTH

    Performance Metrics (contd)

    Efficiency: this metric relates the speedup to thenumber of processors used; by this it provides ameasure of the efficiency with which the processors

    are used.

    S: speedup;

    p: number of processors.

    For the ideal situation, in theory:

    ; which means E= 1

    Practically the ideal efficiency of 1 can not be achieved!

    ES

    p---=

    STS

    TS

    p------

    ------ p= =

    Datorarkitektur F 11/12 - 19

    Petru Eles, IDA, LiTH

    Amdahls Law

    Consider fto be the ratio of computations that,according to the algorithm, have to be executedsequentially (0 f 1); pis the number ofprocessors;

    TP f TS1 f( ) TS

    p-----------------------------+=

    STS

    f TS 1 f( )TS

    p------+

    ----------------------------------------------------1

    f1 f( )

    p-----------------+

    ---------------------------= =

    123456789

    10

    0.2 0.4 0.6 0.8 1.0

    S

    f

    Datorarkitektur F 11/12 - 20

    Petru Eles, IDA, LiTH

    Amdahls Law (contd)

    Amdahls law: even a little ratio of sequentialcomputation imposes a certain limit to speedup; a higherspeedup than 1/fcan not be achieved, regardless thenumber of processors.

    To efficiently exploit a high number of processors, fmustbe small (the algorithm has to be highly parallel).

    ES

    P---

    1

    f p 1( ) 1+------------------------------------= =

  • 8/4/2019 lectures11-12

    6/13

    Datorarkitektur F 11/12 - 21

    Petru Eles, IDA, LiTH

    Other Aspects which Limit the Speedup

    Beside the intrinsic sequentiality of some parts ofan algorithm there are also other factors that limitthe achievable speedup:

    - communication cost

    - load balancing of processors

    - costs of creating and scheduling processes

    - I/O operations

    There are many algorithms with a high degree ofparallelism; for such algorithms the value of fis verysmall and can be ignored. These algorithms aresuited for massively parallel systems; in such casesthe other limiting factors, like the cost ofcommunications, become critical.

    Datorarkitektur F 11/12 - 22

    Petru Eles, IDA, LiTH

    Efficiency and Communication Cost

    Consider a highly parallel computation, so that f(theratio of sequential computations) can be neglected.

    We define fc, the fractional communication overhead of aprocessor:

    Tcalc: time that a processor executes computations;

    Tcomm: time that a processor is idle because ofcommunication;

    With algorithms that have a high degree ofparallelism, massively parallel computers,consisting of large number of processors, can beefficiently used if fcis small; this means that thetime spent by a processor for communication has tobe small compared to its effective time ofcomputation.

    In order to keep fcreasonably small, the size ofprocesses can not go below a certain limit.

    fc

    Tcomm

    Tcalc----------------=

    Tp

    TS

    p------ 1 fc+( )=

    STS

    TP------

    p

    1 fc+---------------= =

    E1

    1 fc+--------------- 1 fc=

    Datorarkitektur F 11/12 - 23

    Petru Eles, IDA, LiTH

    The Interconnection Network

    The interconnection network (IN) is a keycomponent of the architecture. It has a decisiveinfluence on the overall performance and cost.

    The traffic in the IN consists of data transfer andtransfer of commands and requests.

    The key parameters of the IN are- total bandwidth: transferred bits/second

    - cost

    Datorarkitektur F 11/12 - 24

    Petru Eles, IDA, LiTH

    The Interconnection Network (contd)

    Single Bus

    Single bus networks are simple and cheap.

    One single communication is allowed at a time; thebandwidth is shred by all nodes.

    Performance is relatively poor.

    Inorder to keep a certainperformance, the numberof nodes is limited (16 - 20).

    Node1 Node2 Noden

  • 8/4/2019 lectures11-12

    7/13

    Datorarkitektur F 11/12 - 25

    Petru Eles, IDA, LiTH

    The Interconnection Network (contd)

    Completely connected network

    Each node is connected to every other one.

    Communications can be performed in parallelbetween any pair of nodes.

    Both performance and cost are high.

    Cost increases rapidly with number of nodes.

    Node1

    Node2 Node5

    Node3 Node4

    Datorarkitektur F 11/12 - 26

    Petru Eles, IDA, LiTH

    The Interconnection Network (contd)

    Crossbar network

    The crossbar is a dynamic network: theinterconnection topology can be modified by

    positioning of switches. The crossbar switch is completely connected: anynode can be directly connected to any other.

    Fewer interconnections are needed than for thestatic completely connected network; however, alarge number of switches is needed.

    A large number of communications can beperformed in parallel (one certain node can receiveor send only one data at a time).

    Node1

    Node2

    Noden

    Datorarkitektur F 11/12 - 27

    Petru Eles, IDA, LiTH

    The Interconnection Network (contd)

    Mesh network

    Mesh networks are cheaper than completelyconnected ones and provide relatively goodperformance.

    In order to transmit an information between certainnodes, routing through intermediate nodes is need-ed (maximum 2*(n-1) intermediates for an n*n mesh).

    It is possible to provide wraparound connections:between nodes 1 and 13, 2 and 14, etc.

    Three dimensional meshes have been alsoimplemented.

    Node1

    Node2

    Node3

    Node4

    Node5

    Node6

    Node7

    Node8

    Node9

    Node10

    Node11

    Node12

    Node13

    Node14

    Node15

    Node16

    Datorarkitektur F 11/12 - 28

    Petru Eles, IDA, LiTH

    The Interconnection Network (contd)

    Hypercube network

    2n nodes are arranged in an n-dimensional cube.Each node is connected to nneighbours.

    In order to transmit an information between certainnodes, routing through intermediate nodes isneeded (maximum nintermediates).

    N10

    N11 N15

    N14

    N12

    N9 N13

    N0

    N2

    N3 N7

    N5

    N4

    N1

    N6

    N8

  • 8/4/2019 lectures11-12

    8/13

    Datorarkitektur F 11/12 - 29

    Petru Eles, IDA, LiTH

    SIMD Computers

    SIMD computers are usually called array processors.

    PUs are usually very simple: an ALU whichexecutes the instruction broadcast by the CU, a fewregisters, and some local memory.

    The first SIMD computer:

    - ILLIAC IV (1970s): 64 relatively powerfulprocessors (mesh connection, see above).

    Contemporary commercial computer:- CM-2 (Connection Machine) by Thinking Ma-

    chines Corporation: 65 536 very simple proces-sors (connected as a hypercube).

    Array processors are highly specialized fornumerical problems that can be expressed in matrixor vector format (see program on slide 8).Each PU computes one element of the result.

    Controlunit

    PU

    PU

    PU

    PU

    PU

    PU

    PU

    PU

    PU

    Datorarkitektur F 11/12 - 30

    Petru Eles, IDA, LiTH

    MULTIPROCESSORS

    Shared memory MIMD computers are calledmultiprocessors:

    Some multiprocessors have no shared memorywhich is central to the system and equally

    accessible to all processors. All the memory isdistributed as local memory to the processors.However, each processor has access to the localmemory of any other processor a globalphysical address space is available.This memory organization is called distributedshared memory.

    SharedMemory

    Processor1

    Processor2

    Processorn

    Local

    Memory

    Local

    Memory

    Local

    Memory

    Datorarkitektur F 11/12 - 31

    Petru Eles, IDA, LiTH

    Multiprocessors (contd)

    Communication between processors is through theshared memory. One processor can change thevalue in a location and the other processors canread the new value.

    From the programmers point of viewcommunication is realised by shared variables;these are variables which can be accessed by eachof the parallel activities (processes):

    - table tin slide 5;

    - matrixes a, b, and cin slide 7;

    With many fast processors memory contention canseriously degrade performance multiprocessorarchitectures dont support a high number ofprocessors.

    Datorarkitektur F 11/12 - 32

    Petru Eles, IDA, LiTH

    Mutiprocessors (contd)

    IBM System/370 (1970s): two IBM CPUsconnected to shared memory.IBM System370/XA (1981): multiple CPUs can beconnected to shared memory.IBM System/390 (1990s): similar features likeS370/XA, with improved performance. Possibility toconnect several multiprocessor systems togetherthrough fast fibre-optic connection.

    CRAY X-MP (mid 1980s): from one to four vectorprocessors connected to shared memory (cycletime: 8.5 ns).CRAY Y-MP (1988): from one to 8 vector processorsconnected to shared memory; 3 times morepowerful than CRAY X-MP (cycle time: 4 ns).C90 (early 1990s): further development of CRAY Y-MP; 16 vector processors.CRAY 3 (1993): maximum 16 vector processors(cycle time 2ns).

    Butterfly multiprocessor system, by BBN AdvancedComputers (1985/87): maximum 256 Motorola68020 processors, interconnected by asophisticated dynamic switching network;

    distributed shared memory organization.BBN TC2000 (1990): improved version of theButterflyusing Motorola 88100 RISC processor.

  • 8/4/2019 lectures11-12

    9/13

    Datorarkitektur F 11/12 - 33

    Petru Eles, IDA, LiTH

    Multicomputers

    MIMD computerswith a distributed address space, so thateach processor has its one private memory which is notvisible to other processors, are called multicomputers:

    Processor1

    Processor2

    Processorn

    PrivateMemory

    PrivateMemory

    PrivateMemory

    Datorarkitektur F 11/12 - 34

    Petru Eles, IDA, LiTH

    Multicomputers (contd)

    Communication between processors is only bypassing messagesover the interconnectionnetwork.

    From the programmers point of view this meansthat no shared variables are available (a variablecan be accessed only by one single process). Forcommunication between parallel activities(processes) the programmer uses channelsandsend/receive operations(see program in slide 10).

    There is no competition of the processors for theshared memory the number of processors isnot limited by memory contention.

    The speed of the interconnection network is animportant parameter for the overall performance.

    Datorarkitektur F 11/12 - 35

    Petru Eles, IDA, LiTH

    Multicomputers (contd)

    Intel iPSC/2 (1989): 128 CPUs of type 80386 inter-connected by a 7-dimensional hypercube (27=128).

    Intel Paragon (1991): over 2000 processors of typei860 (high performance RISC) interconnected by atwo-dimensional mesh network.

    KSR-1 by Kendal Square Research (1992): 1088processors interconnected by a ring network.

    nCUBE/2S by nCUBE (1992): 8192 processors in-terconnected by a 10-dimensional hypercube.

    Cray T3E MC512 (1995): 512 CPUsinterconnectedby a tree-dimensional mesh; each CPU is a DECAlpha RISC.

    Network of workstations:

    A group of workstations connected through a LocalArea Network (LAN), can be used together as amulticomputer for parallel computation.Performances usually will be lower than withspecialized multicomputers, because of thecommunication speed over the LAN. However, this

    is a cheap solution.

    Datorarkitektur F 11/12 - 36

    Petru Eles, IDA, LiTH

    Vector Processors

    Vector processors include in their instruction set,beside scalar instructions, also instructionsoperating on vectors.

    Array processors (SIMD) computers (see slide 29)can operate on vectors by executing simultaneouslythe same instruction on pairs of vector elements;

    each instruction is executed by a separateprocessing element.

    Several computer architectures have implementedvector operations using the parallelism provided bypipelined functional units. Such architectures arecalled vector processors.

  • 8/4/2019 lectures11-12

    10/13

    Datorarkitektur F 11/12 - 37

    Petru Eles, IDA, LiTH

    Vector Processors (contd)

    Vector processors are not parallel processors;there are not several CPUs running in parallel.They are SISD processors which haveimplemented vector instructions executed onpipelined functional units.

    Vector computers usually have vector registerswhich can store each 64 up to 128 words.

    Vector instructions (see slide 40):

    - load vector from memory into vector register

    - store vector into memory

    - arithmetic and logic operations between vectors

    - operations between vectors and scalars

    - etc.

    From the programmers point of view this meansthat he is allowed to use operations on vectors inhis programmes (see program in slide 8), and thecompiler translates these instructions into vectorinstructions at machine level.

    Datorarkitektur F 11/12 - 38

    Petru Eles, IDA, LiTH

    Vector Processors (contd)

    Vector computers:

    - CDC Cyber 205- CRAY

    - IBM 3090 (an extension to the IBM System/370)

    - NEC SX

    - Fujitsu VP

    - HITACHI S8000

    Scalar registers

    Vector registers

    Scalar functionalunits

    Vector functionalunits

    Scalar unit

    Vector unit

    Instructiondecoder

    Scalar

    instructions

    Vectorinstructions

    Memory

    Datorarkitektur F 11/12 - 39

    Petru Eles, IDA, LiTH

    The Vector Unit

    A vector unit typically consists of

    - pipelined functional units

    - vector registers

    Vector registers:

    - ngeneral purpose vector registers Ri, 0 i n-1;

    - vector length register VL; stores the length l(0 l s), of the currently processed vector(s);sis the length of the vector registers Ri.

    - mask register M; stores a set of lbits, one foreach element in a vector register, interpreted asboolean values; vector instructions can be exe-cuted in masked mode so that vector registerelements corresponding to a false value in M,are ignored.

    Datorarkitektur F 11/12 - 40

    Petru Eles, IDA, LiTH

    Vector Instructions

    LOAD-STORE instructions:

    R A(x1:x2:incr) load

    A(x1:x2:incr) R store

    R MASKED(A) masked load

    A MASKED(R) masked store

    R INDIRECT(A(X)) indirect load

    A(X) INDIRECT(R) indirect store

    Arithmetic - logic

    R R' b_op R''

    R S b_op R'

    R u_op R'

    M R rel_op R'

    WHERE(M) R R' b_op R''

    chaining

    R2 R0 + R1

    R3 R2 * R4

    execution of the vector multiplication has not to waituntil the vector addition has terminated; as

    elements of the sum are generated by the additionpipeline they enter the multiplication pipeline; thus,addition and multiplication are performed (partially)in parallel.

  • 8/4/2019 lectures11-12

    11/13

    Datorarkitektur F 11/12 - 41

    Petru Eles, IDA, LiTH

    Vector Instructions (contd)

    In a Pascal-like language with vector computation:

    if T[1..50]>0 then

    T[1..50]:=T[1..50]+1;

    A compiler for a vector computer generates something like:

    R0 T(0:49:1)

    VL 50

    M R0 > 0

    WHERE(M) R0 R0 + 1

    Datorarkitektur F 11/12 - 42

    Petru Eles, IDA, LiTH

    Multimedia Extensions to General PurposeMicroprocessors

    Video and audio applications very often deal withlarge arrays of small data types (8 or 16 bits).

    Such applications exhibit a large potential of SIMD(vector) parallelism.

    New generations of general purpose microproces-sors have been equipped with special instructionsto exploit this potential of parallelism.

    The specialised multimedia instructions performvector computations on bytes, half-words, or words.

    Datorarkitektur F 11/12 - 43

    Petru Eles, IDA, LiTH

    Multimedia Extensions to General Purpose

    Microprocessors (contd)

    Several vendors have extended the instruction set oftheir processors in order to improve performance withmultimedia applications:

    MMX for Intel x86 family

    VIS for UltraSparc

    MDMX for MIPS

    MAX-2 for Hewlett-Packard PA-RISC

    The Pentium line provides 57 MMX instructions. Theytreat data in a SIMD fashion (see textbook pg. 353).

    Datorarkitektur F 11/12 - 44

    Petru Eles, IDA, LiTH

    Multimedia Extensions to General Purpose

    Microprocessors (contd)

    The basic idea: subword execution

    Use the entire width of a processor data path(32or 64 bits), even when processing the smalldata types used in signal processing (8, 12, or16 bits).

    With word size 64 bits, the adders will be used toimplement eight 8 bit additions in parallel

    This is practically a kind of SIMD parallelism, at areduced scale.

  • 8/4/2019 lectures11-12

    12/13

    Datorarkitektur F 11/12 - 45

    Petru Eles, IDA, LiTH

    Multimedia Extensions to General Purpose

    Microprocessors (contd)

    Three packed data types are defined for paralleloperations: packed byte, packed half word, packed word.

    q0q1q2q3q4q5q6q7

    q0q1q2q3

    q0q1

    q0

    Packed byte

    Packed half word

    Packed word

    Long word

    64 bits

    Datorarkitektur F 11/12 - 46

    Petru Eles, IDA, LiTH

    Multimedia Extensions to General Purpose

    Microprocessors (contd)

    Examples of SIMD arithmetics with the MMXinstruction set:

    a0a1a2a3a4a5a6a7

    ADD R3 R1,R2

    b0b1b2b3b4b5b6b7

    a0+b0

    ++++++++

    ========

    a7+b7 a6+b6 a5+b5 a4+b4 a3+b3 a2+b2 a1+b1

    a0a1a2a3a4a5a6a7

    MPYADD R3 R1,R2

    b0b1b2b3b4b5b6b7

    x-+

    ====

    (a6xb6)+(a7xb7)

    x-+ x-+ x-+

    (a4xb4)+(a5xb5) (a2xb2)+(a3xb3) (a0xb0)+(a1xb1)

    Datorarkitektur F 11/12 - 47

    Petru Eles, IDA, LiTH

    Multimedia Extensions to General Purpose

    Microprocessors (contd)

    How to get the data ready for computation?

    How to get the results back in the right format?

    Packing and Unpacking

    truncated a0

    a0a1

    b0b1

    truncated b1 truncated b0 truncated a1

    PACK.W R3 R1,R2

    a0a1a2a3

    a0a1

    UNPACK R3 R1

    Datorarkitektur F 11/12 - 48

    Petru Eles, IDA, LiTH

    Summary

    The growing need for high performance can notalways be satisfied by computers running a singleCPU.

    With Parallel computers, several CPUs are runningin order to solve a given application.

    Parallel programs have to be available in order touse parallel computers.

    Computers can be classified based on the nature ofthe instruction flow executed and that of the dataflow on which the instructions operate: SISD, SIMD,and MIMD architectures.

    The performance we effectively can get by using aparallel computer depends not only on the numberof available processors but is limited bycharacteristics of the executed programs.

    The efficiency of using a parallel computer isinfluenced by features of the parallel program, like:degree of parallelism, intensity of inter-processorcommunication, etc.

  • 8/4/2019 lectures11-12

    13/13

    Datorarkitektur F 11/12 - 49

    Petru Eles, IDA, LiTH

    Summary (contd)

    A key component of a parallel architecture is theinterconnection network.

    Array processors execute the same operation on a

    set of interconnected processing units. They arespecialized for numerical problems expressed inmatrix or vector formats.

    Multiprocessors are MIMD computers in which allCPUs have access to a common shared addressspace. The number of CPUs is limited.

    Multicomputers have a distributed addressspace. Communication between CPUs is only bymessage passing over the interconnection network.The number of interconnected CPUs can be high.

    Vector processors are SISD processors whichinclude in their instruction set instructions operating

    on vectors. They are implemented using pipelinedfunctional units.

    Multimedia applications exhibit a large potential ofSIMD parallelism. The instruction set of moderngeneral purpose microprocessors (Pentium,UltraSparc) has been extended to support SIMD-style parallelism with operations on shor t vectors.