24-25_parallel processing.pdf

8/16/2019 24-25_Parallel Processing.pdf

1/36

Rehan Azmat

Lecture24 and 25Parallel Processing


2/36

Introduction and Motivation Loop Unrolling

The IA-64 Architecture

Features of Itanium


3/36

Introduction and Motivation Architecture classification

Performance of Parallel Architectures

Interconnection Network


4/36

Reduction of instruction execution time◦ Increased clock frequency by fast circuit technology

◦ Simplify instructions (RISC)

Parallelism within processor◦ Pipelining

◦ Parallel execution of instructions (ILP)

Superscalar

VLIW architectures

Parallel processing


5/36

Sequential computers often are not able to meetperformance needs in certain applications:◦ Fluid flow analysis and aerodynamics;◦ Simulation of large complex systems;

in physics, economy, biology, etc.◦ Computer aided design;◦ Multimedia.

Applications in the above domains are characterizedby a very large amount of numerical computationand/or a high quantity of input data.

In order to deliver sufficient performance for suchapplications, we can have many processors in a singlecomputer.


6/36

Parallel computers refer to architectures in which

many CPUs are running in parallel to implement acertain application.

Such computers can be organized in very

different ways, depending on several keyfeatures:◦ number and complexity of individual CPUs;◦ availability of common (shared) memory;◦ interconnection topology;◦

performance of interconnection network;◦ I/O devices;◦ etc.


7/36

In order to solve a problem using a parallel computer, one mustdecompose the problem into small sub-problems, which can be solvedin parallel.◦ The results of sub-problems may have to be combined to get the final result of the

main problem.

Because of data dependency among the sub-problems, it is not easy todecompose a complex problem to generate a large degree ofparallelism.

Because of data dependency, the processors may have to communicateamong each other.

The time taken for communication is usually very high when compared

with the processing time.

The communication scheme must therefore be very well planned inorder to get a good parallel algorithm.


8/36


9/36


10/36

Architectures Classification


11/36

Flynn’s classification (1966) is based on thenature of the instruction flow executed by thecomputer and that of the data flow on whichthe instructions operate.

The multiplicity of instruction stream anddata stream gives us four different classes:◦ Single instruction, single data stream - SISD◦

Single instruction, multiple data stream - SIMD◦ Multiple instruction, single data stream - MISD◦ Multiple instruction, multiple data stream- MIMD


12/36


13/36

Single processor Single instruction stream

Data stored in single memory


14/36

Single machine instruction stream

Simultaneous execution on different set of data

Large number of processing elements

Lockstep synchronization among the process elements.

The processing elements can◦ Have their respective private data memory; or◦ Share a common memory via an interconnection network.

Array and vector processors are the most common SIMDmachines

A processor lockstep is a technique used to achieve high reliability in a microprocessor system. This is done byadding a second identical processor to a system that monitors and verifies the operation of the systemprocessor.* The two processors are initialized to the same state during system start-up, and they receive thesame inputs. so during normal operation the state of the two processors is identical from clock to clock. Theyare said to be operating in lockstep.

Lockstep systems are fault-tolerant computer systems that run the same set of operations atthe same time in parallel.


15/36

• Global memory which can be accessed by all processorsof a parallel computer.• Data in the global memory can be read/write by any ofthe processors.


16/36

Single sequence of data Transmitted to set of processors

Each processor executes different instructionsequences

Never been implemented


17/36

Set of processors

Simultaneously execute different instructionsequences

Different sets of data

The MIMD class can be further divided:◦

Shared memory (tightly coupled): Symmetric multiprocessor (SMP)

Non-uniform memory access (NUMA)

◦ Distributed memory (loosely coupled) = ClustersEach processor in a parallel computer has its own memory (local memory); no other processor canaccess this memory.


18/36

IS- Instruction stream

DS= Data stream

LM= Local Memory


19/36

Distributed Memory Shared Memory


20/36

Performance of ParallelArchitectures


21/36

Important questions: How fast does a parallel computer run at its

maximal potential?

How fast execution can we expect from aparallel computer for a given application?

How do we measure the performance of a

parallel computer and the performanceimprovement we get by using such acomputer?


22/36

Peak rate: the maximal computation rate that can betheoretically achieved when all modules are fully utilized.

◦ The peak rate is of no practical significance for the user.

◦ It is mostly used by vendor companies for marketing theircomputers.

Speedup: measures the gain we get by using a certain parallelcomputer to run a given parallel program in order to solve aspecific problem.


23/36

Efficiency: this metric relates the speedup to thenumber of processors used; by this it provides ameasure of the efficiency with which the processorsare used.

S: speedup; P: number of processors.

For the ideal situation, in theory:

S = P; which means E = 1.

Practically the ideal efficiency of 1 cannot be

achieved


24/36

Consider f to be the ratio of computationsthat, according to the algorithm, have to beexecuted sequentially (0 ≤ f ≤ 1); and P is thenumber of processors.

S= speed up

is used to find the maximum expected improvement to an overall


25/36

Amdahl’s law says that even a little ratio ofsequential computation imposes a limit tospeedup.◦ A higher speedup than 1/f can not be achieved,

regardless of the number of processors, since

To efficiently exploit a high number of

processors, f must be small (the algorithmhas to be highly parallel), since

p psystem when only part of the system is improved. It is often used inparallel computing to predict the theoretical maximum speedup usingmultiple processors.


26/36

Beside the intrinsic sequentiallity of parts of analgorithm, there are also other factors that limit theachievable speedup:◦ communication cost◦ load balancing of processors◦ costs of creating and scheduling processes◦ I/O operations (mostly sequential in nature)

There are many algorithms with a high degree ofparallelism.◦ The value of f is very small and can be ignored;

◦ Suited for massively parallel systems; and◦ The other limiting factors, like the cost of communications,

become critical.


27/36

Consider a highly parallel computation, so that f can be neglected. We define fc the fractional communication overhead of a processor:

◦ Tcalc: the time that a processor executes computations; ◦ Tcomm: the time that a processor is idle because of communication;

With algorithms having a high degree of parallelism, massively parallelcomputers, consisting of large number of processors, can be efficientlyused if fc is small.◦ The time spent by a processor for communication has to be small compared to its

time for computation.

In order to keep fc reasonably small, the size of processes can not gobelow a certain limit.


28/36

Interconnection Network


29/36

The interconnection network (IN) is a keycomponent of the architecture. It has a decisiveinfluence on:◦ the overall performance; and◦ total cost of the architecture.

The traffic in the IN consists of data transfer andtransfer of commands and requests (controlinformation).

The key parameters of the IN are◦ total bandwidth: transferred bits/second; and◦ implementation cost.


30/36

Single bus networks are simple and cheap. One single communication is allowed at a

time; the bandwidth is shared by all nodes.

Performance is relatively poor.

In order to keep a certain performance, thenumber of nodes is limited (around 16 - 20).◦ Multiple buses can be used instead, if needed.


31/36

Each node is connected to every other one. Communications can be performed in parallel

between any pair of nodes.

Both performance and cost are high.

Cost increases rapidly with number of nodes.


32/36

A dynamic network: the interconnection topology can be modifiedby configurating the switches.

It is completely connected: any node can be directly connected toany other.

Fewer interconnections are needed than for the static completelyconnected network; however, a large number of switches is needed.

A large number of communications can be performed in parallel(even though one node can receive or send only one data at a time).


33/36

Cheaper than completely connected networks, while givingrelatively good performance.

In order to transmit data between two nodes, routing throughintermediate nodes is needed (maximum 2×(n-1)intermediates for an n×n mesh).

It is possible to provide wrap-around connections:◦ Torus.

Three dimensional meshes have been also implemented.


34/36

2n nodes are arranged in an n-dimensionalcube. Each node is connected to n neighbors.

In order to transmit data between two nodes,routing through intermediate nodes is

needed (maximum n intermediates).


35/36

The growing need for high performance can not always be satisfied bycomputers with a single CPU.

With parallel computers, several CPUs are running concurrently in orderto solve a given application.

Parallel programs have to be available in order to make use of the

parallel computers.

Computers can be classified based on the nature of the instruction flowand that of the data flow on which the instructions operate.

A key component of a parallel architecture is also the interconnectionnetwork.

The performance we can get with a parallel computer depends not onlyon the number of available processors but is limited by characteristics ofthe executed programs.


36/36

End of the Lecture

24-25_parallel processing.pdf

Documents