causal performance models -...
TRANSCRIPT
Performance Analysis
Jan Lemeire
Parallel Systems lab
November 18th 2010
See Chapter 6 of Jan’s PhD
Parallel Systems Course: Chapter VI
Kumar Chapter 5
Pag.2
Performance Analysis
Performance Metrics
Other metrics we could try to optimize:
energy consumption, cost, …
Pag.3
Performance Analysis
Parallel Matrix Multiplication: Execution Profile
Speedup=2.55 Efficiency = 85%
Pag.4
Performance Analysis
Speedup i.f.o. processors
1) Ideal, linear speedup
2) Increasing, sub-linear speedup
3) Speedup with an optimal number of processors
4) No speedup
5) Super-linear speedup
Pag.5
Performance Analysis
Super-linear speedup
The parallel execution works with data that fits in higher-level memory, while this is not the case for the sequential execution
The work in parallel is less than that of the sequential program, called parallel anomaly. See Chapter DOP.
PPP p. 92
Pag.
Speedup i.f.o. problem size
6
Performance Analysis
1) Constant speedup
2) Increasing, asymptotically, towards value sublinearspeedup (< p)
3) Increasing towards p
4) Increasing towards super-linear speedup
Pag.7
Performance Analysis
Performance Analysis
Goals:
Understanding of the computational process in terms of resource consumption
Identification of inefficient patterns
Performance prediction
Performance characterization of program and system
Pag.8
Performance Analysis
Overhead or Lost Cycles
Pag.9
Performance Analysis
Impact of Overhead on Speedup?
Pag.10
Performance Analysis
Speedup & Overhead Ratios
Pag.11
Performance Analysis
Example 1: Execution Profile of Parallel Matrix Multiplication
Speedup=2.55 Efficiency = 85%
Pag.12
Performance Analysis
Parallel Matrix Multiplication
Parallel anomaly = 2.6%
Pag.13
Performance Analysis
Overhead Classification
Control of parallelism: extra functionality necessary for parallelization (like partitioning)
Communication: overhead time not overlapping with computation
Idling: processor has to wait for further information
Parallel anomaly : useful work differs for sequential and parallel execution
KUMAR p195
Pag.14
Performance Analysis
Causes of Idling
Limitations of parallelism
Load imbalances
Waiting for communication
Bottlenecks
Pag.15
Performance Analysis
Theoretical Analysis of Matrix Multiplication
Computation time
Communication time (slave)
Explains parameter dependence of performance:
Overhead = O(n2.p)
Computation = O(n3/p)
mm
i
workp
nT .
3
ws
i
comm tnp
tpT .).2
1().1( 2
Overhead ratio=O(1/n)
Pag.16
Performance Analysis
Parameter Dependence of Matrix Multiplication
n: work size, here: matrix size
Pag.
Example 2: Parallel Quicksort
17
Performance Analysis
Pag.
Execution Profile of Parallel Quicksort
18
Performance Analysis
Pag.
Quicksort’s performance
Speedup growth is limited!
Reason?
19
Performance Analysis
0
2
4
6
8
0 10 20 30 40p
Performance i.f.o. p (n = 1000)
Tp
S
E
Without considering load imbalances
Pag.20
Performance Analysis
Amdahl’s Law
Limitations of inherent parallelism: a part sof the algorithm is not parallelizable
PPP p. 80
seqseqseq TsTsT .).1( seq
seq
par Tsp
TsT .
).1(
sp
p
Tsp
Ts
T
T
TSpeedup
seq
seq
seq
par
seq
).1(1.
).1(max
Assume
no other
overhead
not parallelizableparallelizable
Pag.21
Performance Analysis
Amdahl’s Law
sp
pSpeedup
).1(1
spEfficiency
).1(1
1
If p is big enough:
sSpeedup
1
s Speedupmax
10% 10
25% 4
50% 2
75% 1.33
Pag.
Amdahl example: video decoding
22
Performance Analysis
Thanks to Wladimir van der Laan, University of Groningen
Pag.23
Performance Analysis
Example 3: Job Farming
Set of jobs & cluster of computers
= Independent task parallelism
{job1, job2, job3, job4}
Pag.24
Performance Analysis
A
B
C
master
slave1
slave2
A
B
CA
B
C
A
B
C A
B
C A
A
Request job
Send job
Return result
job
1
job
2
job
3
job
4
tim
e
Tpar
job
1jo
b2
job
3jo
b4
Tseq
Speedup = 1
Pag.25
Performance Analysis
Performance of Job Farming?
Overheads? Bottlenecks?
Bottleneck at master => idling of slaves
Granularity = computation/communication
Pag.26
Performance Analysis
Scalability
Program is scalable: the ability to maintain efficiency at a fixed value by simultaneously increasing the number of processors and the size of the problem.
It reflects a parallel system’s ability to utilize increasing processing resources effectively.
PPP p. 95
KUMAR p. 211
Gustafson’s law
Pag.
1. Generate/draw execution profile
2. Identify lost cycles
3. Study impact on overhead
4. Determine causes of overhead
5. Plot performance in function of p and W
27
Performance Analysis
Approach to follow