24-25_parallel processing.pdf
TRANSCRIPT
-
8/16/2019 24-25_Parallel Processing.pdf
1/36
Rehan Azmat
Lecture24 and 25Parallel Processing
-
8/16/2019 24-25_Parallel Processing.pdf
2/36
Introduction and Motivation Loop Unrolling
The IA-64 Architecture
Features of Itanium
-
8/16/2019 24-25_Parallel Processing.pdf
3/36
Introduction and Motivation Architecture classification
Performance of Parallel Architectures
Interconnection Network
-
8/16/2019 24-25_Parallel Processing.pdf
4/36
Reduction of instruction execution time◦ Increased clock frequency by fast circuit technology
◦ Simplify instructions (RISC)
Parallelism within processor◦ Pipelining
◦ Parallel execution of instructions (ILP)
Superscalar
VLIW architectures
Parallel processing
-
8/16/2019 24-25_Parallel Processing.pdf
5/36
Sequential computers often are not able to meetperformance needs in certain applications:◦ Fluid flow analysis and aerodynamics;◦ Simulation of large complex systems;
in physics, economy, biology, etc.◦ Computer aided design;◦ Multimedia.
Applications in the above domains are characterizedby a very large amount of numerical computationand/or a high quantity of input data.
In order to deliver sufficient performance for suchapplications, we can have many processors in a singlecomputer.
-
8/16/2019 24-25_Parallel Processing.pdf
6/36
Parallel computers refer to architectures in which
many CPUs are running in parallel to implement acertain application.
Such computers can be organized in very
different ways, depending on several keyfeatures:◦ number and complexity of individual CPUs;◦ availability of common (shared) memory;◦ interconnection topology;◦
performance of interconnection network;◦ I/O devices;◦ etc.
-
8/16/2019 24-25_Parallel Processing.pdf
7/36
In order to solve a problem using a parallel computer, one mustdecompose the problem into small sub-problems, which can be solvedin parallel.◦ The results of sub-problems may have to be combined to get the final result of the
main problem.
Because of data dependency among the sub-problems, it is not easy todecompose a complex problem to generate a large degree ofparallelism.
Because of data dependency, the processors may have to communicateamong each other.
The time taken for communication is usually very high when compared
with the processing time.
The communication scheme must therefore be very well planned inorder to get a good parallel algorithm.
-
8/16/2019 24-25_Parallel Processing.pdf
8/36
-
8/16/2019 24-25_Parallel Processing.pdf
9/36
-
8/16/2019 24-25_Parallel Processing.pdf
10/36
Architectures Classification
-
8/16/2019 24-25_Parallel Processing.pdf
11/36
Flynn’s classification (1966) is based on thenature of the instruction flow executed by thecomputer and that of the data flow on whichthe instructions operate.
The multiplicity of instruction stream anddata stream gives us four different classes:◦ Single instruction, single data stream - SISD◦
Single instruction, multiple data stream - SIMD◦ Multiple instruction, single data stream - MISD◦ Multiple instruction, multiple data stream- MIMD
-
8/16/2019 24-25_Parallel Processing.pdf
12/36
-
8/16/2019 24-25_Parallel Processing.pdf
13/36
Single processor Single instruction stream
Data stored in single memory
-
8/16/2019 24-25_Parallel Processing.pdf
14/36
Single machine instruction stream
Simultaneous execution on different set of data
Large number of processing elements
Lockstep synchronization among the process elements.
The processing elements can◦ Have their respective private data memory; or◦ Share a common memory via an interconnection network.
Array and vector processors are the most common SIMDmachines
A processor lockstep is a technique used to achieve high reliability in a microprocessor system. This is done byadding a second identical processor to a system that monitors and verifies the operation of the systemprocessor.* The two processors are initialized to the same state during system start-up, and they receive thesame inputs. so during normal operation the state of the two processors is identical from clock to clock. Theyare said to be operating in lockstep.
Lockstep systems are fault-tolerant computer systems that run the same set of operations atthe same time in parallel.
-
8/16/2019 24-25_Parallel Processing.pdf
15/36
• Global memory which can be accessed by all processorsof a parallel computer.• Data in the global memory can be read/write by any ofthe processors.
-
8/16/2019 24-25_Parallel Processing.pdf
16/36
Single sequence of data Transmitted to set of processors
Each processor executes different instructionsequences
Never been implemented
-
8/16/2019 24-25_Parallel Processing.pdf
17/36
Set of processors
Simultaneously execute different instructionsequences
Different sets of data
The MIMD class can be further divided:◦
Shared memory (tightly coupled): Symmetric multiprocessor (SMP)
Non-uniform memory access (NUMA)
◦ Distributed memory (loosely coupled) = ClustersEach processor in a parallel computer has its own memory (local memory); no other processor canaccess this memory.
-
8/16/2019 24-25_Parallel Processing.pdf
18/36
IS- Instruction stream
DS= Data stream
LM= Local Memory
-
8/16/2019 24-25_Parallel Processing.pdf
19/36
Distributed Memory Shared Memory
-
8/16/2019 24-25_Parallel Processing.pdf
20/36
Performance of ParallelArchitectures
-
8/16/2019 24-25_Parallel Processing.pdf
21/36
Important questions: How fast does a parallel computer run at its
maximal potential?
How fast execution can we expect from aparallel computer for a given application?
How do we measure the performance of a
parallel computer and the performanceimprovement we get by using such acomputer?
-
8/16/2019 24-25_Parallel Processing.pdf
22/36
Peak rate: the maximal computation rate that can betheoretically achieved when all modules are fully utilized.
◦ The peak rate is of no practical significance for the user.
◦ It is mostly used by vendor companies for marketing theircomputers.
Speedup: measures the gain we get by using a certain parallelcomputer to run a given parallel program in order to solve aspecific problem.
-
8/16/2019 24-25_Parallel Processing.pdf
23/36
Efficiency: this metric relates the speedup to thenumber of processors used; by this it provides ameasure of the efficiency with which the processorsare used.
S: speedup; P: number of processors.
For the ideal situation, in theory:
S = P; which means E = 1.
Practically the ideal efficiency of 1 cannot be
achieved
-
8/16/2019 24-25_Parallel Processing.pdf
24/36
Consider f to be the ratio of computationsthat, according to the algorithm, have to beexecuted sequentially (0 ≤ f ≤ 1); and P is thenumber of processors.
S= speed up
is used to find the maximum expected improvement to an overall
-
8/16/2019 24-25_Parallel Processing.pdf
25/36
Amdahl’s law says that even a little ratio ofsequential computation imposes a limit tospeedup.◦ A higher speedup than 1/f can not be achieved,
regardless of the number of processors, since
To efficiently exploit a high number of
processors, f must be small (the algorithmhas to be highly parallel), since
p psystem when only part of the system is improved. It is often used inparallel computing to predict the theoretical maximum speedup usingmultiple processors.
-
8/16/2019 24-25_Parallel Processing.pdf
26/36
Beside the intrinsic sequentiallity of parts of analgorithm, there are also other factors that limit theachievable speedup:◦ communication cost◦ load balancing of processors◦ costs of creating and scheduling processes◦ I/O operations (mostly sequential in nature)
There are many algorithms with a high degree ofparallelism.◦ The value of f is very small and can be ignored;
◦ Suited for massively parallel systems; and◦ The other limiting factors, like the cost of communications,
become critical.
-
8/16/2019 24-25_Parallel Processing.pdf
27/36
Consider a highly parallel computation, so that f can be neglected. We define fc the fractional communication overhead of a processor:
◦ Tcalc: the time that a processor executes computations; ◦ Tcomm: the time that a processor is idle because of communication;
With algorithms having a high degree of parallelism, massively parallelcomputers, consisting of large number of processors, can be efficientlyused if fc is small.◦ The time spent by a processor for communication has to be small compared to its
time for computation.
In order to keep fc reasonably small, the size of processes can not gobelow a certain limit.
-
8/16/2019 24-25_Parallel Processing.pdf
28/36
Interconnection Network
-
8/16/2019 24-25_Parallel Processing.pdf
29/36
The interconnection network (IN) is a keycomponent of the architecture. It has a decisiveinfluence on:◦ the overall performance; and◦ total cost of the architecture.
The traffic in the IN consists of data transfer andtransfer of commands and requests (controlinformation).
The key parameters of the IN are◦ total bandwidth: transferred bits/second; and◦ implementation cost.
-
8/16/2019 24-25_Parallel Processing.pdf
30/36
Single bus networks are simple and cheap. One single communication is allowed at a
time; the bandwidth is shared by all nodes.
Performance is relatively poor.
In order to keep a certain performance, thenumber of nodes is limited (around 16 - 20).◦ Multiple buses can be used instead, if needed.
-
8/16/2019 24-25_Parallel Processing.pdf
31/36
Each node is connected to every other one. Communications can be performed in parallel
between any pair of nodes.
Both performance and cost are high.
Cost increases rapidly with number of nodes.
-
8/16/2019 24-25_Parallel Processing.pdf
32/36
A dynamic network: the interconnection topology can be modifiedby configurating the switches.
It is completely connected: any node can be directly connected toany other.
Fewer interconnections are needed than for the static completelyconnected network; however, a large number of switches is needed.
A large number of communications can be performed in parallel(even though one node can receive or send only one data at a time).
-
8/16/2019 24-25_Parallel Processing.pdf
33/36
Cheaper than completely connected networks, while givingrelatively good performance.
In order to transmit data between two nodes, routing throughintermediate nodes is needed (maximum 2×(n-1)intermediates for an n×n mesh).
It is possible to provide wrap-around connections:◦ Torus.
Three dimensional meshes have been also implemented.
-
8/16/2019 24-25_Parallel Processing.pdf
34/36
2n nodes are arranged in an n-dimensionalcube. Each node is connected to n neighbors.
In order to transmit data between two nodes,routing through intermediate nodes is
needed (maximum n intermediates).
-
8/16/2019 24-25_Parallel Processing.pdf
35/36
The growing need for high performance can not always be satisfied bycomputers with a single CPU.
With parallel computers, several CPUs are running concurrently in orderto solve a given application.
Parallel programs have to be available in order to make use of the
parallel computers.
Computers can be classified based on the nature of the instruction flowand that of the data flow on which the instructions operate.
A key component of a parallel architecture is also the interconnectionnetwork.
The performance we can get with a parallel computer depends not onlyon the number of available processors but is limited by characteristics ofthe executed programs.
-
8/16/2019 24-25_Parallel Processing.pdf
36/36
End of the Lecture