hamid sarbazi-azad - sharifce.sharif.edu/courses/93-94/2/ce215-1/resources/root/slides... · a...

Computational

Mathematics

Department of Computer Engineering Sharif University of Technology e-mail: [email protected]

Hamid Sarbazi-Azad

OpenMP

Department of Computer Engineering Sharif University of Technology e-mail: [email protected]

Work-sharing Instructor

PanteA Zardoshti

Computational Mathematics, OpenMP , Sharif University Fall 2015 3

A worksharing construct distributes the execution of the associated region among the members of the team that encounters it.

Work-sharing

#pragma omp parallel for { for (i=0;i<100;i++) A(i) = A(i) + B }


A worksharing construct distributes the execution of the associated region among the members of the team that encounters it.

A worksharing region has no barrier on entry; however, an implied barrier exists at the end of the worksharing region.

Work-sharing

#pragma omp parallel for { for (i=0;i<100;i++) A(i) = A(i) + B }

barrier


The OpenMP API defines the following worksharing constructs, and these are described in the sections that follow:

• loop

• sections

• single

Constructs

LOOP CONSTRUCT


7

The loop construct specifies that the iterations of one or more associated loops will be executed in parallel by threads in the team in the context of their implicit tasks.

The iterations are distributed across threads that already exist in the team executing the parallel region to which the loop region binds.

Loop Construct

#pragma omp for [clause[[,] clause] ... ]

for-loops

Computational Mathematics, OpenMP , Sharif University Fall 2015

8

where clause is one of the following:

• private(list)

• firstprivate(list)

• lastprivate(list)

• schedule(kind[, chunk_size])

• collapse(n)

• ordered

• Nowait

• reduction(reduction-identifier: list)

Clauses


9

How OMP schedules iterations?

Schedule Clause


10


Although the OpenMP standard does not specify how a loop should be partitioned most compilers split the loop in N/p (N #iterations, p #threads) chunks by default.

Schedule Clause


11



This is called a static schedule (with chunk size N/p)

Schedule Clause


12



This is called a static schedule (with chunk size N/p) • For example, suppose we have a loop with 1000 iterations and 4 omp

threads.The loop is partitioned as follows:

Schedule Clause

1 250 500 750 1000


13



This is called a static schedule (with chunk size N/p) • For example, suppose we have a loop with 1000 iterations and 4 omp

threads.The loop is partitioned as follows:

Schedule Clause

1 250 500 750 1000


14

Static

• Blocks of iterations of size “chunk” to threads

• Round robin distribution

schedule(static [,chunk])

Schedule Clause


15

Static

• Blocks of iterations of size “chunk” to threads

• Round robin distribution

schedule(static [,chunk])

Dynamic

• Threads grab “chunk” iterations

• When done with iterations, thread requests next set

schedule(dynamic[,chunk])

Schedule Clause


16

Guided

• Dynamic schedule starting with large block

• Size of the blocks shrink; no smaller than “chunk”

schedule(guided[,chunk])

Runtime • Indicates that the schedule type and chunk are specified by

environment variable OMP_SCHEDULE

• Example of run-time specified scheduling

OMP_SCHEDULE “dynamic,2”

Schedule Clause(cont’d)


17

The Experiment


18

Allows parallelization of perfectly nested loops without using nested parallelism

collapse clause on for/do loop indicates how many loops should be collapsed

Compiler forms a single loop and then parallelizes this

Collapse Clause

#pragma omp for collapse (2) for (k=1; k<=100; k++) for (j=1; j<=200; j++)


19

The ordered region executes in the sequential order

since do_lots_of_work takes a lot of time, most parallel benefit will be realized

ordered is helpful for debugging

Ordered Clause

#pragma omp parallel for for(i = 0; i < nproc; i++){ do_lots_of_work(result[i]); #pragma omp ordered fprintf(fid,”%d %f\n,”i,result[i]”); #pragma omp end ordered }


20

To minimize synchronization, some OpenMP pragmas support the optional nowait clause

If present, threads do not synchronize/wait at the end of that particular construct

Nowait Clause

#pragma omp for nowait for (k=1; k<=100; k++) …


21

#pragma omp parallel shared(n,a,b,c,x,y,z) private(f,i,scale)

{

f = 1.0;

#pragma omp for nowait

for (i=0; i<n; i++)

z[i] = x[i] + y[i];


for (i=0; i<n; i++)

a[i] = b[i] + c[i];

....

#pragma omp barrier

scale = sum(a,0,n) + sum(z,0,n) + f

} /*-- End of parallel region --*/

Example


22


{

f = 1.0;


for (i=0; i<n; i++)

z[i] = x[i] + y[i];


for (i=0; i<n; i++)

a[i] = b[i] + c[i];

....

#pragma omp barrier



Example parallel region


23


{

f = 1.0;


for (i=0; i<n; i++)

z[i] = x[i] + y[i];


for (i=0; i<n; i++)

a[i] = b[i] + c[i];

....

#pragma omp barrier




Statement is executed by all threads


24


{

f = 1.0;


for (i=0; i<n; i++)

z[i] = x[i] + y[i];


for (i=0; i<n; i++)

a[i] = b[i] + c[i];

....

#pragma omp barrier





parallel loop

(work is distributed)


25


{

f = 1.0;


for (i=0; i<n; i++)

z[i] = x[i] + y[i];


for (i=0; i<n; i++)

a[i] = b[i] + c[i];

....

#pragma omp barrier





parallel loop


parallel loop



26


{

f = 1.0;


for (i=0; i<n; i++)

z[i] = x[i] + y[i];


for (i=0; i<n; i++)

a[i] = b[i] + c[i];

....

#pragma omp barrier





parallel loop


parallel loop


synchronization


27


{

f = 1.0;


for (i=0; i<n; i++)

z[i] = x[i] + y[i];


for (i=0; i<n; i++)

a[i] = b[i] + c[i];

....

#pragma omp barrier





parallel loop


parallel loop



synchronization


28

Barrier

Tread 1 Tread 2 Tread 3


barrier

barrier

?


29

Barrier



barrier

barrier

Use OMP_WAIT_POLICY

to control behaviour of

idle threads ?


30

Suppose we run each of these two loops in parallel over i:

This may give us a wrong answer, Why ?

Example

for (i=0; i < N; i++) a[i] = b[i] + c[i]; for (i=0; i < N; i++) d[i] = a[i] + b[i];


31

Suppose we run each of these two loops in parallel over i:

This may give us a wrong answer, Why ?

Example



32

We need to have updated all of a[ ] first, before using a[ ]

All threads wait at the barrier point and only continue when all threads have reached the barrier point

Example(cont’d)



33

We need to have updated all of a[ ] first, before using a[ ]

All threads wait at the barrier point and only continue when all threads have reached the barrier point

Example(cont’d)


wait ! barrier


34

Barrier


SECTIONS CONSTRUCT

35 Computational Mathematics, OpenMP , Sharif University Fall 2015

36

Independent sections of code can execute concurrently

Sections Construct

#pragma omp parallel sections [clause[[,] clause] ...] { #pragma omp section phase1(); #pragma omp section phase2(); #pragma omp section phase3(); }


37

where clause is one of the following:

• private(list)

• firstprivate(list)

• lastprivate(list)

• Nowait

• reduction(reduction-identifier: list)

Clauses


38

#pragma omp parallel default(none) shared(n,a,b,c,d) private(i)

{

#pragma omp sections nowait

{

#pragma omp section

for (i=0; i<n-1; i++)

b[i] = (a[i] + a[i+1])/2;

#pragma omp section

for (i=0; i<n; i++)

d[i] = 1.0/c[i];

} /*-- End of sections --*/


Example

Section #1 Section #2

Parallel Region

Time


39

Example


40

Example


SINGLE CONSTRUCT


42

Denotes block of code to be executed by only one thread

Thread chosen is implementation dependent

Implicit barrier at end

Single Construct

#pragma omp parallel { DoManyThings(); #pragma omp single { ExchangeBoundaries(); } DoManyMoreThings(); }

Threads wait here for single


43

Single Construct


CRITICAL SECTION


45

float dot_prod(float* a, float* b, int N)

{

float sum = 0.0;

#pragma omp parallel for shared(sum)

for(int i=0; i<N; i++) {

sum += a[i] * b[i];

}

return sum;

}

Critical Section

What is Wrong?


46

Defines a critical region on a structured block

Critical Construct

float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<N; i++) { #pragma omp critical sum += a[i] * b[i]; } return sum; }

#pragma omp critical [(lock_name)]

Naming the critical constructs is

optional,but may increase

performance.


47

The variables in “list” must be shared in the enclosing parallel Region

Inside parallel or work-sharing construct: • A PRIVATE copy of each list variable is created and initialized

depending on the “op” • These copies are updated locally by threads

• At end of construct, local copies are combined through “op” into a single value and combined with the value in the original SHARED variable

Reduction Clause

reduction (op : list)


48

Local copy of sum for each thread

All local copies of sum added together and stored in “global” variable

Reduction Clause

float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for reduction (+:sum) for(int i=0; i<N; i++) { sum += a[i] * b[i]; } return sum; }


c) Hamid Sarbazi-Azad Parallel Programming: OpenMP 49

Reduction Clause (cont.)

Operators

• + Sum

• * Product

• & Bitwise and

• | Bitwise or

• ^ Bitwise exclusive or

• && Logical and

• || Logical or

50

Special case of a critical section

Applies only to simple update of memory location

Atomic Construct

#pragma omp parallel for shared(x, y, index, n) for (i = 0; i < n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); }


51

void omp_init_lock(omp_lock_t * lock_p);

void omp_set_lock(omp_lock_t * lock_p);

void omp_unset_lock(omp_lock_t * lock_p);

void omp_destroy_lock(omp_lock_t * lock_p);

Lock Construct

Protect resources with locks.


52

omp_lock_t lck;

omp_init_lock(&lck);

#pragma omp parallel for

for(i=0;i<=N;i++){

omp_set_lock(&lck);

result+=w[i]*y[i];

omp_unset_lock(&lck);

} omp_destroy_lock(&lck);

Lock Construct


53

omp_lock_t lck;



for(i=0;i<=N;i++){

omp_set_lock(&lck);

result+=w[i]*y[i];



Lock Construct

Wait here for your turn


54

omp_lock_t lck;



for(i=0;i<=N;i++){

omp_set_lock(&lck);

result+=w[i]*y[i];



Lock Construct


Release the lock so the next

thread gets a turn


55

omp_lock_t lck;



for(i=0;i<=N;i++){

omp_set_lock(&lck);

result+=w[i]*y[i];



Lock Construct


Release the lock so the next

thread gets a turn

Free--up storage when done


hamid sarbazi-azad - sharifce.sharif.edu/courses/93-94/2/ce215-1/resources/root/slides... · a...

Documents