improving parallel performance

29
INTEL CONFIDENTIAL Improving Parallel Performance Introduction to Parallel Programming – Part 11

Upload: conan-ramirez

Post on 01-Jan-2016

33 views

Category:

Documents


1 download

DESCRIPTION

Improving Parallel Performance. Introduction to Parallel Programming – Part 11. Review & Objectives. Previously: Define speedup and efficiency Use Amdahl’s Law to predict maximum speedup At the end of this part you should be able to: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Improving Parallel Performance

INTEL CONFIDENTIAL

Improving Parallel PerformanceIntroduction to Parallel Programming – Part 11

Page 2: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

2

Review & Objectives

Previously:Define speedup and efficiencyUse Amdahl’s Law to predict maximum speedup

At the end of this part you should be able to: Explain why it can be difficult both to optimize

load balancing and maximize locality Use loop fusion, loop fission, and loop inversion to

create or improve opportunities for parallel execution

Page 3: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

General Rules of Thumb

Start with best sequential algorithmMaximize locality

3

Page 4: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

Start with Best Sequential Algorithm

Don’t confuse “speedup” with “speed”Speedup: ratio of program’s execution time on 1 core

to its execution time on p coresWhat if start with inferior sequential algorithm?Naïve, higher complexity algorithms

Easier to make parallelUsually don’t lead to fastest parallel algorithm

4

Page 5: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

5

Maximize Locality

Temporal locality: If a processor accesses a memory location, there is a good chance it will revisit that memory location soon

Data locality: If a processor accesses a memory location, there is a good chance it will visit a nearby location soon

Programs tend to exhibit locality because they tend to have loops indexing through arrays

Principle of locality makes cache memory worthwhile

Page 6: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

Parallel Processing and Locality

Multiple cores multiple cachesWhen a core writes a value, the system must ensure no core

tries to reference an obsolete value (cache coherence problem)

A write by one core can cause the invalidation of another core’s copy of cache line, leading to a cache miss

Rule of thumb: Better to have different cores manipulating totally different chunks of arrays

We say a parallel program has good locality if cores’ memory writes tend not to interfere with the work being done by other cores

6

Page 7: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

Example: Array Initialization

7

for (i = 0; i < N; i++) a[i] = 0;

0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

Terrible allocation of work to processors

0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3

Better allocation of work to processors...

unless sub-arrays map to same cache lines!

Page 8: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

Loop Transformations

Loop fissionLoop fusionLoop inversion

8

Page 9: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

9

Loop Fission

Begin with single loop having loop-carried dependence

Split loop into two or more loopsNew loops can be executed in parallel

Page 10: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

Before Loop Fission

float *a, *b;int i;for (i = 1; i < N; i++) { if (b[i] > 0.0) a[i] = 2.0 * b[i]; else a[i] = 2.0 * fabs(b[i]); b[i] = a[i-1];}

10

Perfectlyparallel

Loop-carried dependence

Page 11: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

After Loop Fission

#pragma omp parallel

{

#pragma omp for

for (i = 1; i < N; i++) {

if (b[i] > 0.0) a[i] = 2.0 * b[i];

else a[i] = 2.0 * fabs(b[i]);

}

#pragma omp for

for (i = 1; i < N; i++)

b[i] = a[i-1];

}

11

Page 12: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

Loop Fission and Locality

Another use of loop fission is to increase data localityBefore fission, nested loops reference too many data

values, leading to poor cache hit rateBreak nested loops into multiple nested loopsNew nested loops have higher cache hit rate

12

Page 13: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

Before Fission

for (i = 0; i < list_len; i++) for (j = prime[i]; j < N; j += prime[i]) marked[j] = 1;

13

marked

Page 14: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

After Fission

for (k = 0; k < N; k += CHUNK_SIZE) for (i = 0; i < list_len; i++) { start = f(prime[i], k); end = g(prime[i], k); for (j = start; j < end; j += prime[i]) marked[j] = 1; }

14

marked

etc.

Page 15: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

Loop Fusion

The opposite of loop fissionCombine loops increase grain size

15

Page 16: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

Before Loop Fusion

float *a, *b, x, y;int i;...for (i = 0; i < N; i++) a[i] = foo(i);x = a[N-1] – a[0];for (i = 0; i < N; i++) b[i] = bar(a[i]);y = x * b[0] / b[N-1];

Functions foo and bar are side-effect free.

16

Page 17: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

After Loop Fusion

#pragma omp parallel forfor (i = 0; i < N; i++) { a[i] = foo(i); b[i] = bar(a[i]);}x = a[N-1] – a[0];y = x * b[0] / b[N-1];

Now one barrier instead of two

17

Page 18: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

18

Loop Coalescing Example

#define N 23#define M 1000

. . .

for (k = 0; k < N; k++) for (j = 0; j < M; j++) w_new[k][j] = DoSomeWork(w[k][j], k, j);

Prime number of iterations will never be perfectly load balanced

Parallelize inner loop? Are there enough iterations to overcome overhead?

Page 19: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

19

Loop Coalescing Example

#define N 23#define M 1000

. . .

for (kj = 0; kj < N*M; kj++) { k = kj / M; j = kj % M; w_new[k][j] = DoSomeWork(w[k][j], k, j);}

Larger number of iterations gives better opportunity for load balance and hiding overhead

DIV and MOD are overhead

Page 20: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

20

Loop Inversion

Nested for loops may have data dependences that prevent parallelization

Inverting the nesting of for loops mayExpose a parallelizable loopIncrease grain sizeImprove parallel program’s locality

Page 21: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

for (j = 1; j < n; j++) for (i = 0; i < m; i++) a[i][j] = 2 * a[i][j-1];

7 11 15

4 8 12 16

3

2 6 10 14

1 5 9 13

Loop Inversion Example

21

. . .

. . .

. . .

. . .

Page 22: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

Before Loop Inversion

for (j = 1; j < n; j++) #pragma omp parallel for for (i = 0; i < m; i++) a[i][j] = 2 * a[i][j-1];

22

Can executeinner loop inparallel, butgrain size small

. . .

. . .

. . .

. . .

Page 23: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

Before Loop Inversion

for (j = 1; j < n; j++) #pragma omp parallel for for (i = 0; i < m; i++) a[i][j] = 2 * a[i][j-1];

23

Can executeinner loop inparallel, butgrain size small

. . .

. . .

. . .

. . .

Page 24: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

Before Loop Inversion

for (j = 1; j < n; j++) #pragma omp parallel for for (i = 0; i < m; i++) a[i][j] = 2 * a[i][j-1];

24

Can executeinner loop inparallel, butgrain size small

. . .

. . .

. . .

. . .

Page 25: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

Before Loop Inversion

for (j = 1; j < n; j++) #pragma omp parallel for for (i = 0; i < m; i++) a[i][j] = 2 * a[i][j-1];

25

Can executeinner loop inparallel, butgrain size small

. . .

. . .

. . .

. . .

Page 26: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

Before Loop Inversion

for (j = 1; j < n; j++) #pragma omp parallel for for (i = 0; i < m; i++) a[i][j] = 2 * a[i][j-1];

26

Can executeinner loop inparallel, butgrain size small

. . .

. . .

. . .

. . .

Page 27: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

After Loop Inversion

#pragma omp parallel forfor (i = 0; i < m; i++) for (j = 1; j < n; j++) a[i][j] = 2 * a[i][j-1];

27

Can executeouter loop inparallel

. . .

. . .

. . .

. . .

Page 28: Improving Parallel Performance

Copyright © 2009, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners.

References

Rohit Chandra, Leonardo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, and Ramesh Menon, Parallel Programming in OpenMP, Morgan Kaufmann (2001).

Peter Denning, “The Locality Principle,” Naval Postgraduate School (2005).

Michael J. Quinn, Parallel Programming in C with MPI and OpenMP, McGraw-Hill (2004).

28

Page 29: Improving Parallel Performance