24-25_parallel processing.pdf

Upload: fanna786

Post on 05-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/16/2019 24-25_Parallel Processing.pdf

    1/36

    Rehan Azmat

    Lecture24 and 25Parallel Processing

  • 8/16/2019 24-25_Parallel Processing.pdf

    2/36

    Introduction and Motivation Loop Unrolling

    The IA-64 Architecture

    Features of Itanium

  • 8/16/2019 24-25_Parallel Processing.pdf

    3/36

    Introduction and Motivation Architecture classification

    Performance of Parallel Architectures

    Interconnection Network

  • 8/16/2019 24-25_Parallel Processing.pdf

    4/36

    Reduction of instruction execution time◦ Increased clock frequency by fast circuit technology

    ◦ Simplify instructions (RISC)

    Parallelism within processor◦ Pipelining

    ◦ Parallel execution of instructions (ILP)

    Superscalar

    VLIW architectures

    Parallel processing

  • 8/16/2019 24-25_Parallel Processing.pdf

    5/36

    Sequential computers often are not able to meetperformance needs in certain applications:◦ Fluid flow analysis and aerodynamics;◦ Simulation of large complex systems;

    in physics, economy, biology, etc.◦ Computer aided design;◦ Multimedia.

    Applications in the above domains are characterizedby a very large amount of numerical computationand/or a high quantity of input data.

    In order to deliver sufficient performance for suchapplications, we can have many processors in a singlecomputer.

  • 8/16/2019 24-25_Parallel Processing.pdf

    6/36

    Parallel computers refer to architectures in which

    many CPUs are running in parallel to implement acertain application.

    Such computers can be organized in very

    different ways, depending on several keyfeatures:◦ number and complexity of individual CPUs;◦ availability of common (shared) memory;◦ interconnection topology;◦

    performance of interconnection network;◦ I/O devices;◦ etc.

  • 8/16/2019 24-25_Parallel Processing.pdf

    7/36

    In order to solve a problem using a parallel computer, one mustdecompose the problem into small sub-problems, which can be solvedin parallel.◦ The results of sub-problems may have to be combined to get the final result of the

    main problem.

    Because of data dependency among the sub-problems, it is not easy todecompose a complex problem to generate a large degree ofparallelism.

    Because of data dependency, the processors may have to communicateamong each other.

    The time taken for communication is usually very high when compared

    with the processing time.

    The communication scheme must therefore be very well planned inorder to get a good parallel algorithm.

  • 8/16/2019 24-25_Parallel Processing.pdf

    8/36

  • 8/16/2019 24-25_Parallel Processing.pdf

    9/36

  • 8/16/2019 24-25_Parallel Processing.pdf

    10/36

    Architectures Classification

  • 8/16/2019 24-25_Parallel Processing.pdf

    11/36

    Flynn’s classification (1966) is based on thenature of the instruction flow executed by thecomputer and that of the data flow on whichthe instructions operate.

    The multiplicity of instruction stream anddata stream gives us four different classes:◦ Single instruction, single data stream - SISD◦

    Single instruction, multiple data stream - SIMD◦ Multiple instruction, single data stream - MISD◦ Multiple instruction, multiple data stream- MIMD

  • 8/16/2019 24-25_Parallel Processing.pdf

    12/36

  • 8/16/2019 24-25_Parallel Processing.pdf

    13/36

    Single processor Single instruction stream

    Data stored in single memory

  • 8/16/2019 24-25_Parallel Processing.pdf

    14/36

    Single machine instruction stream

    Simultaneous execution on different set of data

    Large number of processing elements

    Lockstep synchronization among the process elements.

    The processing elements can◦ Have their respective private data memory; or◦ Share a common memory via an interconnection network.

    Array and vector processors are the most common SIMDmachines

    A processor lockstep is a technique used to achieve high reliability in a microprocessor system. This is done byadding a second identical processor to a system that monitors and verifies the operation of the systemprocessor.* The two processors are initialized to the same state during system start-up, and they receive thesame inputs. so during normal operation the state of the two processors is identical from clock to clock. Theyare said to be operating in lockstep.

    Lockstep systems are fault-tolerant computer systems that run the same set of operations atthe same time in parallel.

  • 8/16/2019 24-25_Parallel Processing.pdf

    15/36

    • Global memory which can be accessed by all processorsof a parallel computer.• Data in the global memory can be read/write by any ofthe processors.

  • 8/16/2019 24-25_Parallel Processing.pdf

    16/36

    Single sequence of data Transmitted to set of processors

    Each processor executes different instructionsequences

    Never been implemented

  • 8/16/2019 24-25_Parallel Processing.pdf

    17/36

    Set of processors

    Simultaneously execute different instructionsequences

    Different sets of data

    The MIMD class can be further divided:◦

    Shared memory (tightly coupled): Symmetric multiprocessor (SMP)

    Non-uniform memory access (NUMA)

    ◦ Distributed memory (loosely coupled) = ClustersEach processor in a parallel computer has its own memory (local memory); no other processor canaccess this memory.

  • 8/16/2019 24-25_Parallel Processing.pdf

    18/36

    IS- Instruction stream

    DS= Data stream

    LM= Local Memory

  • 8/16/2019 24-25_Parallel Processing.pdf

    19/36

    Distributed Memory Shared Memory

  • 8/16/2019 24-25_Parallel Processing.pdf

    20/36

    Performance of ParallelArchitectures

  • 8/16/2019 24-25_Parallel Processing.pdf

    21/36

    Important questions: How fast does a parallel computer run at its

    maximal potential?

    How fast execution can we expect from aparallel computer for a given application?

    How do we measure the performance of a

    parallel computer and the performanceimprovement we get by using such acomputer?

  • 8/16/2019 24-25_Parallel Processing.pdf

    22/36

    Peak rate: the maximal computation rate that can betheoretically achieved when all modules are fully utilized.

    ◦ The peak rate is of no practical significance for the user.

    ◦ It is mostly used by vendor companies for marketing theircomputers.

    Speedup: measures the gain we get by using a certain parallelcomputer to run a given parallel program in order to solve aspecific problem.

  • 8/16/2019 24-25_Parallel Processing.pdf

    23/36

    Efficiency: this metric relates the speedup to thenumber of processors used; by this it provides ameasure of the efficiency with which the processorsare used.

    S: speedup; P: number of processors.

    For the ideal situation, in theory: 

    S = P; which means E = 1.

    Practically the ideal efficiency of 1 cannot be

    achieved

  • 8/16/2019 24-25_Parallel Processing.pdf

    24/36

    Consider f to be the ratio of computationsthat, according to the algorithm, have to beexecuted sequentially (0 ≤ f ≤ 1); and P is thenumber of processors.

    S= speed up

    is used to find the maximum expected improvement to an overall

  • 8/16/2019 24-25_Parallel Processing.pdf

    25/36

    Amdahl’s law says that even a little ratio ofsequential computation imposes a limit tospeedup.◦ A higher speedup than 1/f can not be achieved,

    regardless of the number of processors, since

    To efficiently exploit a high number of

    processors, f must be small (the algorithmhas to be highly parallel), since

    p psystem when only part of the system is improved. It is often used inparallel computing to predict the theoretical maximum speedup usingmultiple processors.

  • 8/16/2019 24-25_Parallel Processing.pdf

    26/36

    Beside the intrinsic sequentiallity of parts of analgorithm, there are also other factors that limit theachievable speedup:◦ communication cost◦ load balancing of processors◦ costs of creating and scheduling processes◦ I/O operations (mostly sequential in nature)

    There are many algorithms with a high degree ofparallelism.◦ The value of f is very small and can be ignored; 

    ◦ Suited for massively parallel systems; and◦ The other limiting factors, like the cost of communications,

    become critical.

  • 8/16/2019 24-25_Parallel Processing.pdf

    27/36

    Consider a highly parallel computation, so that f can be neglected. We define fc the fractional communication overhead of a processor: 

    ◦ Tcalc: the time that a processor executes computations; ◦ Tcomm: the time that a processor is idle because of communication; 

    With algorithms having a high degree of parallelism, massively parallelcomputers, consisting of large number of processors, can be efficientlyused if fc is small.◦ The time spent by a processor for communication has to be small compared to its

    time for computation.

    In order to keep fc reasonably small, the size of processes can not gobelow a certain limit.

  • 8/16/2019 24-25_Parallel Processing.pdf

    28/36

    Interconnection Network

  • 8/16/2019 24-25_Parallel Processing.pdf

    29/36

    The interconnection network (IN) is a keycomponent of the architecture. It has a decisiveinfluence on:◦ the overall performance; and◦ total cost of the architecture.

    The traffic in the IN consists of data transfer andtransfer of commands and requests (controlinformation).

    The key parameters of the IN are◦ total bandwidth: transferred bits/second; and◦ implementation cost.

  • 8/16/2019 24-25_Parallel Processing.pdf

    30/36

    Single bus networks are simple and cheap. One single communication is allowed at a

    time; the bandwidth is shared by all nodes.

    Performance is relatively poor.

    In order to keep a certain performance, thenumber of nodes is limited (around 16 - 20).◦ Multiple buses can be used instead, if needed.

  • 8/16/2019 24-25_Parallel Processing.pdf

    31/36

    Each node is connected to every other one. Communications can be performed in parallel

    between any pair of nodes.

    Both performance and cost are high.

    Cost increases rapidly with number of nodes.

  • 8/16/2019 24-25_Parallel Processing.pdf

    32/36

    A dynamic network: the interconnection topology can be modifiedby configurating the switches.

    It is completely connected: any node can be directly connected toany other.

    Fewer interconnections are needed than for the static completelyconnected network; however, a large number of switches is needed.

    A large number of communications can be performed in parallel(even though one node can receive or send only one data at a time).

  • 8/16/2019 24-25_Parallel Processing.pdf

    33/36

    Cheaper than completely connected networks, while givingrelatively good performance.

    In order to transmit data between two nodes, routing throughintermediate nodes is needed (maximum 2×(n-1)intermediates for an n×n mesh).

    It is possible to provide wrap-around connections:◦ Torus.

    Three dimensional meshes have been also implemented.

  • 8/16/2019 24-25_Parallel Processing.pdf

    34/36

    2n nodes are arranged in an n-dimensionalcube. Each node is connected to n neighbors.

    In order to transmit data between two nodes,routing through intermediate nodes is

    needed (maximum n intermediates).

  • 8/16/2019 24-25_Parallel Processing.pdf

    35/36

    The growing need for high performance can not always be satisfied bycomputers with a single CPU.

    With parallel computers, several CPUs are running concurrently in orderto solve a given application.

    Parallel programs have to be available in order to make use of the

    parallel computers.

    Computers can be classified based on the nature of the instruction flowand that of the data flow on which the instructions operate.

    A key component of a parallel architecture is also the interconnectionnetwork.

    The performance we can get with a parallel computer depends not onlyon the number of available processors but is limited by characteristics ofthe executed programs.

  • 8/16/2019 24-25_Parallel Processing.pdf

    36/36

    End of the Lecture