hamid sarbazi-azad - sharifce.sharif.edu/courses/93-94/2/ce215-1/resources/root/slides... · a...
TRANSCRIPT
Computational
Mathematics
Department of Computer Engineering Sharif University of Technology e-mail: [email protected]
Hamid Sarbazi-Azad
OpenMP
Department of Computer Engineering Sharif University of Technology e-mail: [email protected]
Work-sharing Instructor
PanteA Zardoshti
Computational Mathematics, OpenMP , Sharif University Fall 2015 3
A worksharing construct distributes the execution of the associated region among the members of the team that encounters it.
Work-sharing
#pragma omp parallel for { for (i=0;i<100;i++) A(i) = A(i) + B }
Computational Mathematics, OpenMP , Sharif University Fall 2015 4
A worksharing construct distributes the execution of the associated region among the members of the team that encounters it.
A worksharing region has no barrier on entry; however, an implied barrier exists at the end of the worksharing region.
Work-sharing
#pragma omp parallel for { for (i=0;i<100;i++) A(i) = A(i) + B }
barrier
Computational Mathematics, OpenMP , Sharif University Fall 2015 5
The OpenMP API defines the following worksharing constructs, and these are described in the sections that follow:
• loop
• sections
• single
Constructs
LOOP CONSTRUCT
Computational Mathematics, OpenMP , Sharif University Fall 2015 6
7
The loop construct specifies that the iterations of one or more associated loops will be executed in parallel by threads in the team in the context of their implicit tasks.
The iterations are distributed across threads that already exist in the team executing the parallel region to which the loop region binds.
Loop Construct
#pragma omp for [clause[[,] clause] ... ]
for-loops
Computational Mathematics, OpenMP , Sharif University Fall 2015
8
where clause is one of the following:
• private(list)
• firstprivate(list)
• lastprivate(list)
• schedule(kind[, chunk_size])
• collapse(n)
• ordered
• Nowait
• reduction(reduction-identifier: list)
Clauses
Computational Mathematics, OpenMP , Sharif University Fall 2015
9
How OMP schedules iterations?
Schedule Clause
Computational Mathematics, OpenMP , Sharif University Fall 2015
10
How OMP schedules iterations?
Although the OpenMP standard does not specify how a loop should be partitioned most compilers split the loop in N/p (N #iterations, p #threads) chunks by default.
Schedule Clause
Computational Mathematics, OpenMP , Sharif University Fall 2015
11
How OMP schedules iterations?
Although the OpenMP standard does not specify how a loop should be partitioned most compilers split the loop in N/p (N #iterations, p #threads) chunks by default.
This is called a static schedule (with chunk size N/p)
Schedule Clause
Computational Mathematics, OpenMP , Sharif University Fall 2015
12
How OMP schedules iterations?
Although the OpenMP standard does not specify how a loop should be partitioned most compilers split the loop in N/p (N #iterations, p #threads) chunks by default.
This is called a static schedule (with chunk size N/p) • For example, suppose we have a loop with 1000 iterations and 4 omp
threads.The loop is partitioned as follows:
Schedule Clause
1 250 500 750 1000
Computational Mathematics, OpenMP , Sharif University Fall 2015
13
How OMP schedules iterations?
Although the OpenMP standard does not specify how a loop should be partitioned most compilers split the loop in N/p (N #iterations, p #threads) chunks by default.
This is called a static schedule (with chunk size N/p) • For example, suppose we have a loop with 1000 iterations and 4 omp
threads.The loop is partitioned as follows:
Schedule Clause
1 250 500 750 1000
Computational Mathematics, OpenMP , Sharif University Fall 2015
14
Static
• Blocks of iterations of size “chunk” to threads
• Round robin distribution
schedule(static [,chunk])
Schedule Clause
Computational Mathematics, OpenMP , Sharif University Fall 2015
15
Static
• Blocks of iterations of size “chunk” to threads
• Round robin distribution
schedule(static [,chunk])
Dynamic
• Threads grab “chunk” iterations
• When done with iterations, thread requests next set
schedule(dynamic[,chunk])
Schedule Clause
Computational Mathematics, OpenMP , Sharif University Fall 2015
16
Guided
• Dynamic schedule starting with large block
• Size of the blocks shrink; no smaller than “chunk”
schedule(guided[,chunk])
Runtime • Indicates that the schedule type and chunk are specified by
environment variable OMP_SCHEDULE
• Example of run-time specified scheduling
OMP_SCHEDULE “dynamic,2”
Schedule Clause(cont’d)
Computational Mathematics, OpenMP , Sharif University Fall 2015
17
The Experiment
Computational Mathematics, OpenMP , Sharif University Fall 2015
18
Allows parallelization of perfectly nested loops without using nested parallelism
collapse clause on for/do loop indicates how many loops should be collapsed
Compiler forms a single loop and then parallelizes this
Collapse Clause
#pragma omp for collapse (2) for (k=1; k<=100; k++) for (j=1; j<=200; j++)
Computational Mathematics, OpenMP , Sharif University Fall 2015
19
The ordered region executes in the sequential order
since do_lots_of_work takes a lot of time, most parallel benefit will be realized
ordered is helpful for debugging
Ordered Clause
#pragma omp parallel for for(i = 0; i < nproc; i++){ do_lots_of_work(result[i]); #pragma omp ordered fprintf(fid,”%d %f\n,”i,result[i]”); #pragma omp end ordered }
Computational Mathematics, OpenMP , Sharif University Fall 2015
20
To minimize synchronization, some OpenMP pragmas support the optional nowait clause
If present, threads do not synchronize/wait at the end of that particular construct
Nowait Clause
#pragma omp for nowait for (k=1; k<=100; k++) …
Computational Mathematics, OpenMP , Sharif University Fall 2015
21
#pragma omp parallel shared(n,a,b,c,x,y,z) private(f,i,scale)
{
f = 1.0;
#pragma omp for nowait
for (i=0; i<n; i++)
z[i] = x[i] + y[i];
#pragma omp for nowait
for (i=0; i<n; i++)
a[i] = b[i] + c[i];
....
#pragma omp barrier
scale = sum(a,0,n) + sum(z,0,n) + f
} /*-- End of parallel region --*/
Example
Computational Mathematics, OpenMP , Sharif University Fall 2015
22
#pragma omp parallel shared(n,a,b,c,x,y,z) private(f,i,scale)
{
f = 1.0;
#pragma omp for nowait
for (i=0; i<n; i++)
z[i] = x[i] + y[i];
#pragma omp for nowait
for (i=0; i<n; i++)
a[i] = b[i] + c[i];
....
#pragma omp barrier
scale = sum(a,0,n) + sum(z,0,n) + f
} /*-- End of parallel region --*/
Example parallel region
Computational Mathematics, OpenMP , Sharif University Fall 2015
23
#pragma omp parallel shared(n,a,b,c,x,y,z) private(f,i,scale)
{
f = 1.0;
#pragma omp for nowait
for (i=0; i<n; i++)
z[i] = x[i] + y[i];
#pragma omp for nowait
for (i=0; i<n; i++)
a[i] = b[i] + c[i];
....
#pragma omp barrier
scale = sum(a,0,n) + sum(z,0,n) + f
} /*-- End of parallel region --*/
Example parallel region
Statement is executed by all threads
Computational Mathematics, OpenMP , Sharif University Fall 2015
24
#pragma omp parallel shared(n,a,b,c,x,y,z) private(f,i,scale)
{
f = 1.0;
#pragma omp for nowait
for (i=0; i<n; i++)
z[i] = x[i] + y[i];
#pragma omp for nowait
for (i=0; i<n; i++)
a[i] = b[i] + c[i];
....
#pragma omp barrier
scale = sum(a,0,n) + sum(z,0,n) + f
} /*-- End of parallel region --*/
Example parallel region
Statement is executed by all threads
parallel loop
(work is distributed)
Computational Mathematics, OpenMP , Sharif University Fall 2015
25
#pragma omp parallel shared(n,a,b,c,x,y,z) private(f,i,scale)
{
f = 1.0;
#pragma omp for nowait
for (i=0; i<n; i++)
z[i] = x[i] + y[i];
#pragma omp for nowait
for (i=0; i<n; i++)
a[i] = b[i] + c[i];
....
#pragma omp barrier
scale = sum(a,0,n) + sum(z,0,n) + f
} /*-- End of parallel region --*/
Example parallel region
Statement is executed by all threads
parallel loop
(work is distributed)
parallel loop
(work is distributed)
Computational Mathematics, OpenMP , Sharif University Fall 2015
26
#pragma omp parallel shared(n,a,b,c,x,y,z) private(f,i,scale)
{
f = 1.0;
#pragma omp for nowait
for (i=0; i<n; i++)
z[i] = x[i] + y[i];
#pragma omp for nowait
for (i=0; i<n; i++)
a[i] = b[i] + c[i];
....
#pragma omp barrier
scale = sum(a,0,n) + sum(z,0,n) + f
} /*-- End of parallel region --*/
Example parallel region
Statement is executed by all threads
parallel loop
(work is distributed)
parallel loop
(work is distributed)
synchronization
Computational Mathematics, OpenMP , Sharif University Fall 2015
27
#pragma omp parallel shared(n,a,b,c,x,y,z) private(f,i,scale)
{
f = 1.0;
#pragma omp for nowait
for (i=0; i<n; i++)
z[i] = x[i] + y[i];
#pragma omp for nowait
for (i=0; i<n; i++)
a[i] = b[i] + c[i];
....
#pragma omp barrier
scale = sum(a,0,n) + sum(z,0,n) + f
} /*-- End of parallel region --*/
Example parallel region
Statement is executed by all threads
parallel loop
(work is distributed)
parallel loop
(work is distributed)
Statement is executed by all threads
synchronization
Computational Mathematics, OpenMP , Sharif University Fall 2015
28
Barrier
Tread 1 Tread 2 Tread 3
Tread 1 Tread 2 Tread 3
barrier
barrier
?
Computational Mathematics, OpenMP , Sharif University Fall 2015
29
Barrier
Tread 1 Tread 2 Tread 3
Tread 1 Tread 2 Tread 3
barrier
barrier
Use OMP_WAIT_POLICY
to control behaviour of
idle threads ?
Computational Mathematics, OpenMP , Sharif University Fall 2015
30
Suppose we run each of these two loops in parallel over i:
This may give us a wrong answer, Why ?
Example
for (i=0; i < N; i++) a[i] = b[i] + c[i]; for (i=0; i < N; i++) d[i] = a[i] + b[i];
Computational Mathematics, OpenMP , Sharif University Fall 2015
31
Suppose we run each of these two loops in parallel over i:
This may give us a wrong answer, Why ?
Example
for (i=0; i < N; i++) a[i] = b[i] + c[i]; for (i=0; i < N; i++) d[i] = a[i] + b[i];
Computational Mathematics, OpenMP , Sharif University Fall 2015
32
We need to have updated all of a[ ] first, before using a[ ]
All threads wait at the barrier point and only continue when all threads have reached the barrier point
Example(cont’d)
for (i=0; i < N; i++) a[i] = b[i] + c[i]; for (i=0; i < N; i++) d[i] = a[i] + b[i];
Computational Mathematics, OpenMP , Sharif University Fall 2015
33
We need to have updated all of a[ ] first, before using a[ ]
All threads wait at the barrier point and only continue when all threads have reached the barrier point
Example(cont’d)
for (i=0; i < N; i++) a[i] = b[i] + c[i]; for (i=0; i < N; i++) d[i] = a[i] + b[i];
wait ! barrier
Computational Mathematics, OpenMP , Sharif University Fall 2015
34
Barrier
Computational Mathematics, OpenMP , Sharif University Fall 2015
SECTIONS CONSTRUCT
35 Computational Mathematics, OpenMP , Sharif University Fall 2015
36
Independent sections of code can execute concurrently
Sections Construct
#pragma omp parallel sections [clause[[,] clause] ...] { #pragma omp section phase1(); #pragma omp section phase2(); #pragma omp section phase3(); }
Computational Mathematics, OpenMP , Sharif University Fall 2015
37
where clause is one of the following:
• private(list)
• firstprivate(list)
• lastprivate(list)
• Nowait
• reduction(reduction-identifier: list)
Clauses
Computational Mathematics, OpenMP , Sharif University Fall 2015
38
#pragma omp parallel default(none) shared(n,a,b,c,d) private(i)
{
#pragma omp sections nowait
{
#pragma omp section
for (i=0; i<n-1; i++)
b[i] = (a[i] + a[i+1])/2;
#pragma omp section
for (i=0; i<n; i++)
d[i] = 1.0/c[i];
} /*-- End of sections --*/
} /*-- End of parallel region --*/
Example
Section #1 Section #2
Parallel Region
Time
Computational Mathematics, OpenMP , Sharif University Fall 2015
39
Example
Computational Mathematics, OpenMP , Sharif University Fall 2015
40
Example
Computational Mathematics, OpenMP , Sharif University Fall 2015
SINGLE CONSTRUCT
41 Computational Mathematics, OpenMP , Sharif University Fall 2015
42
Denotes block of code to be executed by only one thread
Thread chosen is implementation dependent
Implicit barrier at end
Single Construct
#pragma omp parallel { DoManyThings(); #pragma omp single { ExchangeBoundaries(); } DoManyMoreThings(); }
Threads wait here for single
Computational Mathematics, OpenMP , Sharif University Fall 2015
43
Single Construct
Computational Mathematics, OpenMP , Sharif University Fall 2015
CRITICAL SECTION
44 Computational Mathematics, OpenMP , Sharif University Fall 2015
45
float dot_prod(float* a, float* b, int N)
{
float sum = 0.0;
#pragma omp parallel for shared(sum)
for(int i=0; i<N; i++) {
sum += a[i] * b[i];
}
return sum;
}
Critical Section
What is Wrong?
Computational Mathematics, OpenMP , Sharif University Fall 2015
46
Defines a critical region on a structured block
Critical Construct
float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<N; i++) { #pragma omp critical sum += a[i] * b[i]; } return sum; }
#pragma omp critical [(lock_name)]
Naming the critical constructs is
optional,but may increase
performance.
Computational Mathematics, OpenMP , Sharif University Fall 2015
47
The variables in “list” must be shared in the enclosing parallel Region
Inside parallel or work-sharing construct: • A PRIVATE copy of each list variable is created and initialized
depending on the “op” • These copies are updated locally by threads
• At end of construct, local copies are combined through “op” into a single value and combined with the value in the original SHARED variable
Reduction Clause
reduction (op : list)
Computational Mathematics, OpenMP , Sharif University Fall 2015
48
Local copy of sum for each thread
All local copies of sum added together and stored in “global” variable
Reduction Clause
float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for reduction (+:sum) for(int i=0; i<N; i++) { sum += a[i] * b[i]; } return sum; }
Computational Mathematics, OpenMP , Sharif University Fall 2015
c) Hamid Sarbazi-Azad Parallel Programming: OpenMP 49
Reduction Clause (cont.)
Operators
• + Sum
• * Product
• & Bitwise and
• | Bitwise or
• ^ Bitwise exclusive or
• && Logical and
• || Logical or
50
Special case of a critical section
Applies only to simple update of memory location
Atomic Construct
#pragma omp parallel for shared(x, y, index, n) for (i = 0; i < n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); }
Computational Mathematics, OpenMP , Sharif University Fall 2015
51
void omp_init_lock(omp_lock_t * lock_p);
void omp_set_lock(omp_lock_t * lock_p);
void omp_unset_lock(omp_lock_t * lock_p);
void omp_destroy_lock(omp_lock_t * lock_p);
Lock Construct
Protect resources with locks.
Computational Mathematics, OpenMP , Sharif University Fall 2015
52
omp_lock_t lck;
omp_init_lock(&lck);
#pragma omp parallel for
for(i=0;i<=N;i++){
omp_set_lock(&lck);
result+=w[i]*y[i];
omp_unset_lock(&lck);
} omp_destroy_lock(&lck);
Lock Construct
Computational Mathematics, OpenMP , Sharif University Fall 2015
53
omp_lock_t lck;
omp_init_lock(&lck);
#pragma omp parallel for
for(i=0;i<=N;i++){
omp_set_lock(&lck);
result+=w[i]*y[i];
omp_unset_lock(&lck);
} omp_destroy_lock(&lck);
Lock Construct
Wait here for your turn
Computational Mathematics, OpenMP , Sharif University Fall 2015
54
omp_lock_t lck;
omp_init_lock(&lck);
#pragma omp parallel for
for(i=0;i<=N;i++){
omp_set_lock(&lck);
result+=w[i]*y[i];
omp_unset_lock(&lck);
} omp_destroy_lock(&lck);
Lock Construct
Wait here for your turn
Release the lock so the next
thread gets a turn
Computational Mathematics, OpenMP , Sharif University Fall 2015
55
omp_lock_t lck;
omp_init_lock(&lck);
#pragma omp parallel for
for(i=0;i<=N;i++){
omp_set_lock(&lck);
result+=w[i]*y[i];
omp_unset_lock(&lck);
} omp_destroy_lock(&lck);
Lock Construct
Wait here for your turn
Release the lock so the next
thread gets a turn
Free--up storage when done
Computational Mathematics, OpenMP , Sharif University Fall 2015