performance analysis of parallel processingganesan/old/courses/proj664.… · web viewmeasurement...

Performance Analysis of Parallel Processing

Dr. Subra GanesanDepartment of Computer Science and Engineering

School of Engineering and Computer ScienceOakland University

CSE664Project report

Winter 20004/29/2000

BY

Ahmad Milhim

Contents

1. Introduction2. Definitions

2.1 Performance analysis2.2 Performance analysis techniques2.3 Performance analysis metrics

3. Performance means3.1 Arithmetic mean performance 3.2 Geometric mean performance3.3 Harmonic mean performance

4. Speedup performance laws4.1 Asymptotic speedup4.2 Harmonic speedup4.3 Amdahl’s speedup law

5. Workloads 5.1 Types of test workloads5.2 HINT

6. References

1. Introduction

The goal of all computer system users and designers is to get the highest

performance at the lowest cost. Basic knowledge of performance evaluation

terminology and technology is a must for computer professionals to evaluate

their systems. Performance analysis is required at every stage of the life cycle

of a computer system, starting from the design stage to manufacturing, sales,

and upgrade.

Designers do performance analysis when they want to compare a number of

alternative designs and pick the best design, system administrators use

performance analysis to choose a system from a set of possible systems for a

certain application, and users need to know how well their systems are

performing and if an upgrade is needed.

It is said that performance analysis is an art, every analysis requires an

intimate knowledge of the system being evaluated. Computer applications are

so numerous and different that it is not possible to have a standard measure of

performance analysis for all applications. There are three techniques used for

performance analysis, analytical modeling, simulation, and measurement.

Different considerations help analysts choose which technique to use, the key

consideration is the stage in which the system is, if the system is a new

concept analytical modeling and simulations are the only possible techniques

to be used, at the same time these techniques can be base on previous

measurements of other similar systems, other considerations of less

importance are the time available for analysis, tools, accuracy, and cost.

Users of computer systems seek ways to increase the productivity of their

systems in terms of both computer hardware and programmers, and then

reduce the cost of computing. Computer throughputs were increased by

making the operating system handle resource sharing.

There has been always a desire to evaluate how will computer systems are

performing, and to find ways of improving their performance.

Measurement and analysis of parallel processing is a new field, the analysis

of parallel processing helps users and researchers to answer key questions

about the configuration of parallel systems. Usually the questions that

analysts try to find an answer for are what is the best way to configure the

processing elements to solve a particular problem, how much overhead is

generated by parallelism, what is the effect of varying the number of

processors on the performance, which is better to use shared memory or local

memory, and so on.

2. Definitions. In this section the most common terms and concepts

relating to the performance analysis of systems in general and parallel

processing in particular are introduced, and how these terms are applied to

computer systems and parallel processing. First the definition of performance

analysis is introduced, then performance metrics and techniques are discussed

in more detail.

2.1 performance analysis. The ideal performance of a computer system

can be achieved if a perfect match between hardware capability and software

behavior is reached.

2.2 Performance analysis techniques Three analysis techniques are used

for performance analysis; these are analytical modeling, simulation, and

measurement. To choose which one to use for the analysis of a system

depends on certain considerations. The most important consideration is the

stage in which the analysis is to be performed, measurement can’t be

performed unless the system already exists or at least a similar system does,

if the proposed design is a new idea then only analytical modeling or

simulation techniques can be used.

2.2.1 Analytical modeling. Used if the system is in the early design stages,

and the results are needed soon. Provides insight into the underlying system,

but it may not be precise due to simplification in modeling and mathematical

equations.

2.2.2 Simulation. It is a useful technique for analysis. Simulation models

provide easy ways to predict the performance of computer systems before

they exist and it is used to validate the results of analytical modeling and

measurement. Simulation provides snapshots of system behavior

2.2.3 Measurement. It can be used only if the system is available for

measurement (postprototype) or at least other systems similar to the system

under design exist. Its cost is high compared to the other two techniques

because instruments of hardware and software are needed to perform

measurements. It also takes a long time to monitor the system efficiently. The

most important considerations that are used to choose one or more of these

three techniques are illustrated in Figure 1 below.

Figure 1. Considerations in order of importance

It is said that not to trust the results of one technique until they have been

validated by the other two techniques. This is to emphasize that using one

technique may be miss leading, or at least you may not get accurate results.

2.3 Performance metrics. Each performance study has different metrics

that can be defined and used to evaluate that specific system under study,

although these metrics are varied and different from one system to another,

common metrics exist and are used in most general computer system

evaluation. These most common metrics are introduced here.

CriterionCriterion Analytical Analytical SimulationSimulation MeasurementMeasurementModelingModeling ________________________

stagestage AnyAny AnyAny postprototypepostprototype

TimeTimerequiredrequired SmallSmall MediumMedium VariesVaries

ToolsTools AnalystsAnalysts ComputerComputer InstrumentsInstrumentsLanguagesLanguages

AccuracyAccuracy LowLow ModerateModerate VariesVariesCostCost smallsmall MediumMedium HighHigh

Most Most importantimportant

2.3.1 Response time. It is the interval between a users request and the

system response. In general the response time increases as the load increases,

terms that are related to the response time are turnaround time and reaction

time, turnaround time is defined as the time elapsed between submitting a job

and the completion of its output, reaction time elapsed between submitting of

a request and the beginning of its execution by the computer system. Stretch

factor is the ratio of the response time at a particular load to that at a

minimum load.

2.3.2 Throughput is defined as the rate (requests per unit time) at which the

request can be serviced by the system, for CPUs the rate is measured in

million instructions per second (MIPS) or million floating point instructions

per second (Mflops), for networks the rate is measured in bits per second

(bps) or packets per second, and for transaction systems it is measured in

transactions per second (TPS). In general the throughput of a the system

increases as the load initially increases, then after a certain load the

throughput stops increasing and in some cases starts decreasing.

The maximum achievable throughput under ideal workload conditions is

called the nominal capacity of the system, it is the bandwidth in computer

networks and is measured in bits per second. The response time at the

nominal capacity is often too high which leads to the definition of a new

term, the usable capacity which is defined as the maximum achievable

throughput without exceeding a pre specified response time. The optimal

operating point is the point after which the response time increases rapidly as

a function of the load, at the same time the gain in throughput is small, it is

the knee capacity in Figure 2 below.

Load

Throughput

Nominal capacity

Usable capacityKnee capacity

Response time

Load

Figure 2 Capacity of the system

2.3.3 System Efficiency E(n). It is defined as the ratio of the maximum

achievable throughput (usable capacity ) to the maximum achievable capacity

under ideal workload conditions. It is an indication of the actual degree of

speedup performance achieved as compared with the maximum value for

parallel processing, in other words it is the ratio of the performance of n-

processor system to that of a single-processor system. The lowest efficiency

corresponds to the case where the entire program code being executed

sequentially on a single processor. The maximum efficiency corresponds to

the case when all processors are fully utilized throughout the execution

period. The minimum efficiency is the case where the entire program code

being executed sequentially on a single processor as illustrated in figure 3.

E(n)=S(n)/n

Figure 3 efficiency

0

1

2

3

4

5

6

0 1 2 3 4 5 6

Number of processorsNumber of processors

EfficiencyEfficiency

2.3.4 Speedup S(n). It is an indication of the degree of the speed gain in

parallel computations. It is discussed in more detail later in this paper. There

are three different speedup measures, asymptotic speedup, harmonic speedup,

and Amdahl's speedup law, simply the speedup is defined as the ratio of the

time taken by one processor to execute a program to the time taken by n

processors to execute the same program.

S(n) = T(1)/T(n)

2.3.5 Redundancy R(n). The ratio of the total number of unit operations

performed by n-processor system O(n) to the total number of unit operations

performed by a single-processor system O(1).

R(n) = O(n)/O(1)

2.3.6 Utilization U(n). The fraction of the time the resource is busy or it is

the ratio of busy time to total time over an observation period. Idle time is the

time during which a resource is not used. Balancing between resources

produces more efficient systems when parallel processing system is being

designed. In terms of system redundancy and its efficiency, the utilization is

expressed as system redundancy times its efficiency. It indicates the

percentage of the resources that were kept busy during the execution of a

parallel program.

U(n) = R(n)*E(n)

2.3.7 Quality . This metric combines the effects of speedup, utilization, and

redundancy to assess relative merits of parallel computation. It is directly

proportional to the speedup and efficiency and inversely related to the

redundancy.

Q(n) = S(n)*E(n)/R(n)

Since the efficiency is always a fraction and the redundancy is greater than

one then the quality of the parallel system is upper bounded by the speedup.

2.3.8 Reliability. It is measured by the probability of errors or by the mean

time between errors.

2.3.9 availability. The availability of a resource is the fraction of the time

the resource is available to service users requests. Two other terms are used

in system analysis are downtime and uptime, system downtime is the time

period where the system is not available, the uptime is the time during which

the system is available.1

2.3.10 Cost/performance ratio. this metric is used to compare two or more

systems in terms of their cost and performance. The cost should inched the

cost of hardware, software, etc. the efficiency is measured in terms of

throughput for a pre specified response time.

3. Performance means.

To understand the concepts of performance analysis and the significance of

the output of the performance techniques, one should understand the general

concepts of mathematical performance laws and performance means. In this

section performance means are discussed briefly.

3.1 Arithmetic mean performance. It is defined as the ratio of the sum

of all execution rates of all programs to the total number of programs used.

Symbolically, let Rj be the execution rate of program j where j =1,2,…m,

then the arithmetic mean performance

Ra = Rj / m, for all j =1 to m.

this is true only if all programs have equal weighting. If programs are

weighted, then weighted arithmetic mean is defined as:

Ra*

= fjRj / m, for all j =1 to m

Where fj is the weight of program j.

3.2 Geometric mean performance. The geometric mean of n values is

obtained by multiplying the values together and taking the nth root of the

product. It is used only if the product of the values is a quantity of interest.

The geometric mean of the execution rates Rj is:

Rg = ( Rj )1/m , j = 1,2,… m

Where m is the number of programs. Similarly this value is correct if only all

the programs have equal weights. Neither the arithmetic mean nor the

geometric mean represents the real performance of the execution rates of

benchmark programs, which leads to the definition of a third kind of

performance means.

3.3 Harmonic performance means. The harmonic mean of n values is

defined as the ratio of the number of values n to the sum of all the fractions

1/xj where xj is the jth value in the set of values. In the context of

performance analysis, it is defined as the average performance across a large

number of programs running in various execution modes, these modes

corresponds to scalar, vector, sequential, or parallel processing with different

program parts. Symbolically the harmonic performance mean is:

Rh = m / (1 / Rj)

Where Rj is the execution rate of program j, m is the total number of

programs and Rh is the harmonic mean if equal weighting of all programs.

The harmonic mean is the closest of the means to the real performance.

4. Speedup performance laws.

4.1 Asymptotic speedup law. Consider the parallel system of n processors

and , denote wi as the amount of work done with a degree of parallelism

(DOP) equals i. The execution time of wi on a single processor is ti(1),

similarly the execution time of wi on k processors is ti(k). The response time

of T(1) is the response time of a single processor system when executing w i.

And T() is the response time of executing the same workload wi if an

infinite number of processors is available. At this point the asymptotic

speedup S is defined as the ratio of T(1) to T().

S = T(1) / T().

4.2 Harmonic speedup law. Suppose a workload of multiple programs is

to be executed on an n-processor system, the program (workload) may use

different number of processors at different execution times. The program is

executed in mode i if i processors are used, R i is the corresponding execution

rate which reflects the collective speed of i processors. The weighted

harmonic speedup S is defined as the ratio of the sequential execution time T1

to the weighted arithmetic mean execution time T* across the n execution

modes.

S = T1 / T*

4.3 Amdahl’s speedup law. This law can be derived from the previous

laws under the assumption that the system works only in two modes, fully

sequential system with probability or fully parallel mode with probability

1-. Amdahl’s speedup expression Sn equals to the ratio of the total number

of processors to 1+(n-1)

Sn = n / 1+(n-1)

This implies that under the above assumption, the best speedup that we can

get is upper bounded by 1/ regardless of how many processors the system

actually have because Sn 1/ as n . In Figure 4, Sn is plotted as a

function of n for different values of . Note that the ideal speedup achieved

when = 0, and the speedup drops sharply as increases

Amdahl's speedup

1

10

100

1000

10000

1 10 100 1000 10000

n: processors

spee

dup

= 0

= 0.01

= 0.1

= 0.9

Figure 4 plotting speedup versus number of processors for four values of

5. Workloads.

The term test workload denotes any workload that is used in performance

study. Two type of workloads exist, Real workload is one that is used in

normal operations and can’t be repeated, therefore it is not suitable to be used

as a test workload. The other workload is the Synthetic workload , which is a

load that can be used repeatedly, and it models the real workload.

5.1 Types of test workloads

5.1.1 Addition instruction, most early computers where designed around

arithmetic logic units that use the addition instruction to perform most of the

computations needed at that time. Thus using addition instruction to measure

the performance of computers at that time was good enough. The computer

with the faster addition instruction was considered better.

5.1.2 Instruction mix. The addition instruction was no longer sufficient to

measure the performance of computer systems as the number of supported

instructions by increased. An instruction mix workload was then introduced.

5.1.3 Kernels are sets of instructions that constitute a higher level function

performed by the processors, kernels are used mainly to measure the

performance of processors. The kernel performance measure does not reflect

the total system performance due to the fact that most kernel workloads don’t

use I/O operations.

5.1.4 Synthetic programs. simple programs written in a high-level language,

these programs make a specified number of I/O or service calls requests.

This workload measures the CPU time and the time for the I/O requests.

Synthetic programs do not make representative memory and secondary

memory references.

5.1.5 Application benchmarks. Application programs that represent the real

workload. They make use of all available resources, including processors,

networks, I/O devices, and databases. Benchmarks like Siev, Whetstone,

Linpack, Dhrystone, and SPEC are just some of the well known benchmarks

that have been used to evaluate the performance of computer systems. In the

next section, a new benchmark called HINT is discussed in more details.

5.2 HINT.

HINT is a new benchmark developed by John L. Gustafson and Quinn O.

Snell. “It is a practical approach that provides mathematically sound

comparison of computational performance even when the algorithms,

computer, and precision are changed”. It can be used to compare computing

as slow as human calculator to computing as fast as the best super computers.

Most benchmarks are based on the idea of measuring the time various

computers’ take to complete a fixed-size task, others like database

benchmarks fix the time and vary the job size. HINT on the other hand, is

based on a different concept. HINT stands for Hierarchical INTegration, it

does not fix time

HINT produces a speed measure called QUIPS, QUality Improvement Per

Second. It does not fix time nor problem size. It reveals memory bandwidth,

it is scalable; compares computing as slow as hand calculation to computing

as fast as the largest supercomputers, it is portable ; it ports to every

sequential and parallel environment with very little effort , and has a low

cost, permits low cost comparison of any architecture.

A computer with twice the QUIPS rating is as twice as powerful so it must

have more

arithmetic speed

precision

storage

bandwidth

The following HINT plots use logarithmic scale for the time, the plots

illustrate the powerful use of HINT as a performance analysis and evaluation

benchmark. It compares different kinds of computer systems.

Figure 6. Comparison of Different Precessions

Figure 7. Comparison of Different Clock Speeds

Figure 9. Comparison of Different Main Memory Sizes

Figure 10. Comparison of a Scalable Parallel Computer

Figure 11. Comparison of several Parallel Systems

Figure 12. Comparison of various workstations

References

1. Jain, Raj. (1991). The Art of Computer Systems Analysis: techniques

for experimental design, measurement, simulation, and modeling. John

Wiley and sons.

2. Hwang, Kai. (1993). Advanced Computer Architecture, Parallelism,

Scalability, Programmability. McGrow-Hill.

3. McKerrow, Phillip. (1988). Performance Measurement of Computer

Systems. Addison-Wesely.

4. Gustafson, L. John and Snell O. Quinn. HINT: A new Way To

Measure Computer Performance. (http://www.scl.ameslab.gov/HINT)

http://www.scl.ameslab.gov/HINT

performance analysis of parallel processingganesan/old/courses/proj664.… · web viewmeasurement...

Documents