causal performance models -...

27
Performance Analysis Jan Lemeire Parallel Systems lab November 18 th 2010 See Chapter 6 of Jan’s PhD Parallel Systems Course: Chapter VI Kumar Chapter 5

Upload: others

Post on 05-Sep-2019

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Performance Analysis

Jan Lemeire

Parallel Systems lab

November 18th 2010

See Chapter 6 of Jan’s PhD

Parallel Systems Course: Chapter VI

Kumar Chapter 5

Page 2: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.2

Performance Analysis

Performance Metrics

Other metrics we could try to optimize:

energy consumption, cost, …

Page 3: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.3

Performance Analysis

Parallel Matrix Multiplication: Execution Profile

Speedup=2.55 Efficiency = 85%

Page 4: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.4

Performance Analysis

Speedup i.f.o. processors

1) Ideal, linear speedup

2) Increasing, sub-linear speedup

3) Speedup with an optimal number of processors

4) No speedup

5) Super-linear speedup

Page 5: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.5

Performance Analysis

Super-linear speedup

The parallel execution works with data that fits in higher-level memory, while this is not the case for the sequential execution

The work in parallel is less than that of the sequential program, called parallel anomaly. See Chapter DOP.

PPP p. 92

Page 6: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.

Speedup i.f.o. problem size

6

Performance Analysis

1) Constant speedup

2) Increasing, asymptotically, towards value sublinearspeedup (< p)

3) Increasing towards p

4) Increasing towards super-linear speedup

Page 7: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.7

Performance Analysis

Performance Analysis

Goals:

Understanding of the computational process in terms of resource consumption

Identification of inefficient patterns

Performance prediction

Performance characterization of program and system

Page 8: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.8

Performance Analysis

Overhead or Lost Cycles

Page 9: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.9

Performance Analysis

Impact of Overhead on Speedup?

Page 10: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.10

Performance Analysis

Speedup & Overhead Ratios

Page 11: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.11

Performance Analysis

Example 1: Execution Profile of Parallel Matrix Multiplication

Speedup=2.55 Efficiency = 85%

Page 12: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.12

Performance Analysis

Parallel Matrix Multiplication

Parallel anomaly = 2.6%

Page 13: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.13

Performance Analysis

Overhead Classification

Control of parallelism: extra functionality necessary for parallelization (like partitioning)

Communication: overhead time not overlapping with computation

Idling: processor has to wait for further information

Parallel anomaly : useful work differs for sequential and parallel execution

KUMAR p195

Page 14: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.14

Performance Analysis

Causes of Idling

Limitations of parallelism

Load imbalances

Waiting for communication

Bottlenecks

Page 15: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.15

Performance Analysis

Theoretical Analysis of Matrix Multiplication

Computation time

Communication time (slave)

Explains parameter dependence of performance:

Overhead = O(n2.p)

Computation = O(n3/p)

mm

i

workp

nT .

3

ws

i

comm tnp

tpT .).2

1().1( 2

Overhead ratio=O(1/n)

Page 16: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.16

Performance Analysis

Parameter Dependence of Matrix Multiplication

n: work size, here: matrix size

Page 17: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.

Example 2: Parallel Quicksort

17

Performance Analysis

Page 18: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.

Execution Profile of Parallel Quicksort

18

Performance Analysis

Page 19: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.

Quicksort’s performance

Speedup growth is limited!

Reason?

19

Performance Analysis

0

2

4

6

8

0 10 20 30 40p

Performance i.f.o. p (n = 1000)

Tp

S

E

Without considering load imbalances

Page 20: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.20

Performance Analysis

Amdahl’s Law

Limitations of inherent parallelism: a part sof the algorithm is not parallelizable

PPP p. 80

seqseqseq TsTsT .).1( seq

seq

par Tsp

TsT .

).1(

sp

p

Tsp

Ts

T

T

TSpeedup

seq

seq

seq

par

seq

).1(1.

).1(max

Assume

no other

overhead

not parallelizableparallelizable

Page 21: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.21

Performance Analysis

Amdahl’s Law

sp

pSpeedup

).1(1

spEfficiency

).1(1

1

If p is big enough:

sSpeedup

1

s Speedupmax

10% 10

25% 4

50% 2

75% 1.33

Page 22: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.

Amdahl example: video decoding

22

Performance Analysis

Thanks to Wladimir van der Laan, University of Groningen

Page 23: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.23

Performance Analysis

Example 3: Job Farming

Set of jobs & cluster of computers

= Independent task parallelism

{job1, job2, job3, job4}

Page 24: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.24

Performance Analysis

A

B

C

master

slave1

slave2

A

B

CA

B

C

A

B

C A

B

C A

A

Request job

Send job

Return result

job

1

job

2

job

3

job

4

tim

e

Tpar

job

1jo

b2

job

3jo

b4

Tseq

Speedup = 1

Page 25: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.25

Performance Analysis

Performance of Job Farming?

Overheads? Bottlenecks?

Bottleneck at master => idling of slaves

Granularity = computation/communication

Page 26: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.26

Performance Analysis

Scalability

Program is scalable: the ability to maintain efficiency at a fixed value by simultaneously increasing the number of processors and the size of the problem.

It reflects a parallel system’s ability to utilize increasing processing resources effectively.

PPP p. 95

KUMAR p. 211

Gustafson’s law

Page 27: Causal Performance Models - parallel.vub.ac.beparallel.vub.ac.be/education/parsys/notes2010/ParSys_PerformanceAnalysis.pdfPag.3 Performance Analysis Parallel Matrix Multiplication:

Pag.

1. Generate/draw execution profile

2. Identify lost cycles

3. Study impact on overhead

4. Determine causes of overhead

5. Plot performance in function of p and W

27

Performance Analysis

Approach to follow