causal performance models -...

Performance Analysis

Jan Lemeire

Parallel Systems lab

November 18th 2010

See Chapter 6 of Jan’s PhD

Parallel Systems Course: Chapter VI

Kumar Chapter 5

Pag.2


Performance Metrics

Other metrics we could try to optimize:

energy consumption, cost, …

Pag.3


Parallel Matrix Multiplication: Execution Profile

Speedup=2.55 Efficiency = 85%

Pag.4


Speedup i.f.o. processors

1) Ideal, linear speedup

2) Increasing, sub-linear speedup

3) Speedup with an optimal number of processors

4) No speedup

5) Super-linear speedup

Pag.5


Super-linear speedup

The parallel execution works with data that fits in higher-level memory, while this is not the case for the sequential execution

The work in parallel is less than that of the sequential program, called parallel anomaly. See Chapter DOP.

PPP p. 92

Pag.

Speedup i.f.o. problem size

6


1) Constant speedup

2) Increasing, asymptotically, towards value sublinearspeedup (< p)

3) Increasing towards p

4) Increasing towards super-linear speedup

Pag.7



Goals:

Understanding of the computational process in terms of resource consumption

Identification of inefficient patterns

Performance prediction

Performance characterization of program and system

Pag.8


Overhead or Lost Cycles

Pag.9


Impact of Overhead on Speedup?

Pag.10


Speedup & Overhead Ratios

Pag.11


Example 1: Execution Profile of Parallel Matrix Multiplication

Speedup=2.55 Efficiency = 85%

Pag.12


Parallel Matrix Multiplication

Parallel anomaly = 2.6%

Pag.13


Overhead Classification

Control of parallelism: extra functionality necessary for parallelization (like partitioning)

Communication: overhead time not overlapping with computation

Idling: processor has to wait for further information

Parallel anomaly : useful work differs for sequential and parallel execution

KUMAR p195

Pag.14


Causes of Idling

Limitations of parallelism

Load imbalances

Waiting for communication

Bottlenecks

Pag.15


Theoretical Analysis of Matrix Multiplication

Computation time

Communication time (slave)

Explains parameter dependence of performance:

Overhead = O(n2.p)

Computation = O(n3/p)

mm

i

workp

nT .

3

ws

i

comm tnp

tpT .).2

1().1( 2

Overhead ratio=O(1/n)

Pag.16


Parameter Dependence of Matrix Multiplication

n: work size, here: matrix size

Pag.

Example 2: Parallel Quicksort

17


Pag.

Execution Profile of Parallel Quicksort

18


Pag.

Quicksort’s performance

Speedup growth is limited!

Reason?

19


0

2

4

6

8

0 10 20 30 40p

Performance i.f.o. p (n = 1000)

Tp

S

E

Without considering load imbalances

Pag.20


Amdahl’s Law

Limitations of inherent parallelism: a part sof the algorithm is not parallelizable

PPP p. 80

seqseqseq TsTsT .).1( seq

seq

par Tsp

TsT .

).1(

sp

p

Tsp

Ts

T

T

TSpeedup

seq

seq

seq

par

seq

).1(1.

).1(max

Assume

no other

overhead

not parallelizableparallelizable

Pag.21


Amdahl’s Law

sp

pSpeedup

).1(1

spEfficiency

).1(1

1

If p is big enough:

sSpeedup

1

s Speedupmax

10% 10

25% 4

50% 2

75% 1.33

Pag.

Amdahl example: video decoding

22


Thanks to Wladimir van der Laan, University of Groningen

Pag.23


Example 3: Job Farming

Set of jobs & cluster of computers

= Independent task parallelism

{job1, job2, job3, job4}

Pag.24


A

B

C

master

slave1

slave2

A

B

CA

B

C

A

B

C A

B

C A

A

Request job

Send job

Return result

job

1

job

2

job

3

job

4

tim

e

Tpar

job

1jo

b2

job

3jo

b4

Tseq

Speedup = 1

Pag.25


Performance of Job Farming?

Overheads? Bottlenecks?

Bottleneck at master => idling of slaves

Granularity = computation/communication

Pag.26


Scalability

Program is scalable: the ability to maintain efficiency at a fixed value by simultaneously increasing the number of processors and the size of the problem.

It reflects a parallel system’s ability to utilize increasing processing resources effectively.

PPP p. 95

KUMAR p. 211

Gustafson’s law

Pag.

1. Generate/draw execution profile

2. Identify lost cycles

3. Study impact on overhead

4. Determine causes of overhead

5. Plot performance in function of p and W

27


Approach to follow

causal performance models -...

Documents