open multiprocessing

55
Open Multiprocessing Dr. Bo Yuan E-mail: [email protected]

Upload: meara

Post on 23-Feb-2016

61 views

Category:

Documents


0 download

DESCRIPTION

Open Multiprocessing. Dr. Bo Yuan E-mail: [email protected]. OpenMP. An API for shared memory multiprocessing (parallel) programming in C, C++ and Fortran. Supports multiple platforms (processor architectures and operating systems). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Open Multiprocessing

Open Multiprocessing

Dr. Bo YuanE-mail: [email protected]

Page 2: Open Multiprocessing

2

Note on Parallel Programming

• An incorrect program may produce correct results.

– The order of execution of processes/threads is unpredictable.

– May depend on your luck!

• A program that always produce correct results may not make sense.

– The outputs of a program are just part of the story.

– Efficiency matters!

Page 3: Open Multiprocessing

3

OpenMP

• An API for shared memory multiprocessing (parallel) programming in C, C++ and Fortran.– Supports multiple platforms (processor architectures and operating systems).– Higher level implementation (a block of code that should be executed in parallel).

• A method of parallelizing whereby a master thread forks a number of slave threads and a task is divided among them.

• Based on preprocessor directives (Pragma)– Requires compiler support.– omp.h

• References– http://openmp.org/

– https://computing.llnl.gov/tutorials/openMP/– http://supercomputingblog.com/openmp/

Page 4: Open Multiprocessing

4

Hello, World!#include <stdio.h>#include <stdlib.h>#include <omp.h>

void Hello(void);

int main(int argc, char* argv[]) { /* Get the number of threads from command line */ int thread_count=strtol(argv[1], NULL, 10);

# pragma omp parallel num_threads(thread_count) Hello();

return 0;}

void Hello(void) { int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads();

printf(“Hello from thread %d of %d\n”, my_rank, thread_count);}

Page 5: Open Multiprocessing

5

Definitions# pragma omp parallel [clauses] { code_block }

Error Checking

#ifdef _OPENMP# include <omp.h>#endif

#ifdef _OPENMP int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads();#else int my_rank=0; int thread_count=1;#endif

implicit barrier

Thread Team = Master + Slaves

text to modify the directive

Page 6: Open Multiprocessing

6

The Trapezoidal Rule

/* Input: a, b, n */h=(b-a)/n;approx=(f(a)+f(b))/2.0;for (i=1; i<=n-1; i++) { x_i=a+i*h; approx+=f(x_i);}approx=h*approx;

Thread 0 Thread 2

# pragma omp critical global_result+=my_result;

Shared Memory Shared Variables Race Condition

Page 7: Open Multiprocessing

7

The critical Directive

# pragma omp critical y=f(x); ... double f(double x) {# pragma omp critical z=g(x); ... }

Cannot be executed simultaneously!

# pragma omp critical(one) y=f(x); ... double f(double x) {# pragma omp critical(two) z=g(x); ... }

Deadlock

Page 8: Open Multiprocessing

8

The atomic Directive# pragma omp atomic x <op>=<expression>;

<op> can be one of the binary operators:+, *, -, /, &, ^, |, <<, >>

• Higher performance than the critical

directive.

• Only single C assignment statement is

protected.

• Only the load and store of x is

protected.

• <expression> must not reference x.

# pragma omp atomic # pragma omp critical x+=f(y); x=g(x);

Can be executed simultaneously!

x++++xx----x

Page 9: Open Multiprocessing

9

Locks/* Executed by one thread */Initialize the lock data structure;.../* Executed by multiple threads */Attempt to lock or set the lock data structure;Critical section;Unlock or unset the lock data structure;.../* Executed by one thread */Destroy the lock data structure;

void omp_init_lock(omp_lock_t* lock_p);void omp_set_lock(omp_lock_t* lock_p);void omp_unset_lock(omp_lock_t* lock_p);void omp_destroy_lock(omp_lock_t* lock_p);

Page 10: Open Multiprocessing

10

Trapezoidal Rule in OpenMP #include <stdio.h>#include <stdlib.h>#include <omp.h>

void Trap(double a, double b, int n, double* global_result_p);

int main(int argc, char* argv[]) { double global_result=0.0; double a, b; int n, thread_count;

thread_count=strtol(argv[1], NULL, 10); printf(“Enter a, b, and n\n”); scanf(“%lf %lf %d”, &a, &b, &n);# pragma omp parallel num_threads(thread_count) Trap(a, b, n, &global_result);

printf(“With n=%d trapezoids, our estimate\n”, n); printf(“of the integral from %f to %f = %.15e\n”, a, b, global_result); return 0;}

Page 11: Open Multiprocessing

11

Trapezoidal Rule in OpenMP void Trap(double a, double b, int n, double* global_result_p) { double h, x, my_result; double local_a, local_b; int i, local_n; int my_rank=omp_get_thread_num(); int thread_count=omp_get_num_threads();

h=(b-a)/n; local_n=n/thread_count; local_a=a+my_rank*local_n*h; local_b=local_a+local_n*h; my_result=(f(local_a)+f(local_b))/2.0; for(i=1; i<=local_n-1; i++) { x=local_a+i*h; my_result+=f(x); } my_result=my_result*h;

# pragma omp critical *global_result_p+=my_result;}

Page 12: Open Multiprocessing

12

Scope of Variables

Private Scope

• Only accessible by a single thread

• Declared in the code block

Shared Scope

• Accessible by all threads in a team

• Declared before a parallel directive

• a, b, n

• global_result

• thread_count

• my_rank

• my_result

• global_result_p

• *global_result_p

In serial programming:

• Function-wide scope

• File-wide scope

Page 13: Open Multiprocessing

13

Another Trap Functiondouble Local_trap(double a, double b, int n);

global_result=0.0;# pragma omp parallel num_threads(thread_count) {# pragma omp critical global_result+=Local_trap(a, b, n); }

global_result=0.0;# pragma omp parallel num_threads(thread_count) { double my_result=0.0; /* Private */ my_result=Local_trap(a, b, n); # pragma omp critical global_result+=my_result; }

Page 14: Open Multiprocessing

14

The Reduction Clause

• Reduction: A computation (binary operation) that repeatedly applies the same reduction operator (e.g., addition or multiplication) to a sequence of operands in order to get a single result.

• Note:– The reduction variable itself is shared.– A private variable is created for each thread in the team.– The private variables are initialized to 0 for addition operator.

global_result=0.0;# pragma omp parallel num_threads(thread_count)\ reduction(+: global_result) global_result=Local_trap(a, b, n);

reduction(<operator>: <variable list>)

Page 15: Open Multiprocessing

15

The parallel for Directive

h=(b-a)/n;approx=(f(a)+f(b))/2.0;for (i=1; i<=n-1; i++) { approx+=f(a+i*h);}approx=h*approx;

h=(b-a)/n; approx=(f(a)+f(b))/2.0;# pragma omp parallel for num_threads(thread_count)\ reduction(+: approx) for (i=1; i<=n-1; i++) { approx+=f(a+i*h); } approx=h*approx;

• The code block must be a for loop.

• Iterations of the for loop are divided among threads.

• approx is a reduction variable.

• i is a private variable.

Page 16: Open Multiprocessing

16

The parallel for Directive

• Sounds like a truly wonderful approach to parallelizing serial programs.• Does not work with while or do-while loops.

– How about converting them into for loops?

• The number of iterations must be determined in advance.

for (; ;) { ...}

for (i=0; i<n; i++) { if (...) break; ...}

int x, y; # pragma omp parallel for num_threads(thread_count) for(x=0; x < width; x++) { for(y=0; y < height; y++) { finalImage[x][y] = f(x, y); } }

private(y)

Page 17: Open Multiprocessing

17

Estimating π

0 12)1(4

71

51

3114

k

k

K

double factor=1.0;double sum=0.0;for(k=0; k<n; k++) { sum+=factor/(2*k+1); factor=-factor;}pi_approx=4.0*sum;

double factor=1.0; double sum=0.0;# pragma omp parallel for\ num_threads(thread_count)\ reduction(+: sum) for(k=0; k<n; k++) { sum+=factor/(2*k+1); factor=-factor; } pi_approx=4.0*sum;

?

Loop-carried dependence

Page 18: Open Multiprocessing

18

Estimating πif(k%2 == 0) factor=1.0;else factor=-1.0;sum+=factor/(2*k+1);

factor=(k%2 == 0)?1.0: -1.0;sum+=factor/(2*k+1);

double factor=1.0; double sum=0.0;# pragma omp parallel for num_threads(thread_count)\ reduction(+: sum) private(factor) for(k=0; k<n; k++) { if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); } pi_approx=4.0*sum;

Page 19: Open Multiprocessing

19

Scope Matters

double factor=1.0; double sum=0.0;# pragma omp parallel for num_threads(thread_count)\ default(none) reduction(+: sum) private(k, factor) shared(n) for(k=0; k<n; k++) { if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); } pi_approx=4.0*sum;

• With the default (none) clause, we need to specify the scope of each variable that we use in the block that has been declared outside the block.

• The value of a variable with private scope is unspecified at the beginning (and after completion) of a parallel or parallel for block.

The private factor is not specified.

Page 20: Open Multiprocessing

20

Bubble Sortfor (len=n; len>=2; len--) for (i=0; i<len-1; i++) if (a[i]>a[i+1]) { tmp=a[i]; a[i]=a[i+1]; a[i+1]=tmp; }

• Can we make it faster?

• Can we parallelize the outer loop?

• Can we parallelize the inner loop?

Page 21: Open Multiprocessing

21

Odd-Even Sort

PhaseSubscript in Array

0 1 2 3

0 9 7 8 6

7 9 6 8

1 7 9 6 8

7 6 9 8

2 7 6 9 8

6 7 8 9

3 6 7 8 9

6 7 8 9

Any opportunities for parallelism?

Page 22: Open Multiprocessing

22

Odd-Even Sortvoid Odd_even_sort (int a[], int n) { int phase, i, temp; for (phase=0; phase<n; phase++) if (phase%2 == 0) { /* Even phase */ for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */ for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }

Page 23: Open Multiprocessing

23

Odd-Even Sort in OpenMP for (phase=0; phase<n; phase++) { if (phase%2 == 0) { /* Even phase */# pragma omp parallel for num_threads(thread_count)\ default(none) shared(a, n) private(i, temp) for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */# pragma omp parallel for num_threads(thread_count)\ default(none) shared(a, n) private(i, temp) for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }

Page 24: Open Multiprocessing

24

Odd-Even Sort in OpenMP# pragma omp parallel num_thread(thread_count) \ default(none) shared(a, n) private(i, tmp, phase) for (phase=0; phase<n; phase++) { if (phase%2 == 0) { /* Even phase */# pragma omp for for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */# pragma omp for for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }

Page 25: Open Multiprocessing

25

Data Partitioning

0 1 2 3 4 5 6 7 8

0 1 2

Iterations

Threads

0 1 2 3 4 5 6 7 8

0 1 2

Iterations

Threads

Block

Cyclic

Page 26: Open Multiprocessing

26

Scheduling Loopsdouble Z[N][N];…sum=0.0;for (i=0; i<N; i++) sum+=f(i);

double f(int r) { int i; double val=0.0;

for (i=r+1; i<N; i++) { return_val+=sin(Z[r][i]); } return val;}

Load Balancing

Page 27: Open Multiprocessing

27

The schedule clause sum=0.0;# pragma omp parallel for num_threads(thread_count) \ reduction(+:sum) schedule(static, 1) for (i=0; i<n; i++) sum+=f(i);

n=12, t=3schedule(static, 1) schedule(static, 2)

schedule(static, 4)

Thread 0: 0, 3, 6, 9Thread 1: 1, 4, 7, 10Thread 2: 2, 5, 8, 11

Thread 0: 0, 1, 6, 7

Thread 1: 2, 3, 8, 9

Thread 2: 4, 5, 10, 11

Thread 0: 0, 1, 2, 3Thread 1: 4, 5, 6, 7Thread 2: 8, 9, 10, 11

chunksize

schedule(static, total_iterations/thread_count)

Page 28: Open Multiprocessing

28

The dynamic and guided Types

• In a dynamic schedule:– Iterations are broken into chunks of chunksize consecutive iterations.– Default chunksize value: 1– Each thread executes a chunk.– When a thread finishes a chunk, it requests another one.

• In a guided schedule:– Each thread executes a chunk.– When a thread finishes a chunk, it requests another one.– As chunks are completed, the size of the new chunks decreases.– Approximately equals to the number of iterations remaining divided by the number

of threads.– The size of chunks decreases down to chunksize or 1 (default).

Page 29: Open Multiprocessing

29

Example of guided Schedule Thread Chunk Size of Chunk Remaining Iterations

0 1-5000 5000 49991 5001-7500 2500 24991 7501-8750 1250 12491 8751-9375 625 6240 9376-9687 312 3121 9688-9843 156 1560 9844-9921 78 781 9922-9960 39 391 9961-9980 20 191 9981-9990 10 91 9991-9995 5 40 9996-9997 2 21 9998-9998 1 10 9999-9999 1 0

Page 30: Open Multiprocessing

30

Which schedule?

• The optimal schedule depends on:– The type of problem– The number of iterations– The number of threads

• Overhead– guided>dynamic>static

– If you are getting satisfactory results (e.g., close to the theoretically maximum speedup) without a schedule clause, go no further.

• The Cost of Iterations– If it is roughly the same, use the default schedule.– If it decreases or increases linearly as the loop executes, a static schedule with

small chunksize values will be good.– If it cannot be determined in advance, try to explore different options.

Page 31: Open Multiprocessing

31

Performance IssueA x y

# pragma omp parallel for num_threads(thread_count) \ default(none) private(i,j) shared(A, x, y, m, n) for(i=1; i<m; i++) { y[i]=0.0; for(j=0; j<n; j++) y[i]+=A[i][j]*x[j]; }

=X

Page 32: Open Multiprocessing

32

Performance Issue

Number of

Threads

Matrix Dimension8,000,000 x 8 8,000 x 8,000 8 x 8,000,000

Time Efficiency Time Efficiency Time Efficiency

1 0.322 1.000 0.264 1.000 0.333 1.000

2 0.219 0.735 0.189 0.698 0.300 0.555

3 0.141 0.571 0.119 0.555 0.303 0.275

Cache MissFalse Sharing

Page 33: Open Multiprocessing

33

Performance Issue

• 8,000,000-by-8– y has 8,000,000 elements Potentially large number of write misses

• 8-by-8,000,000– x has 8,000,000 elements Potentially large number of read misses

• 8-by-8,000,000– y has 8 elements (8 doubles) Could be stored in the same cache line (64 bytes).– Potentially serious false sharing effect for multiple processors

• 8000-by-8000– y has 8,000 elements (8,000 doubles).– Thread 2: 4000 to 5999 Thread 3: 6000 to 7999– {y[5996], y[5997], y[5998], y[5999], y[6000], y[6001], y[6002], y[6003] }– The effect of false sharing is highly unlikely.

Page 34: Open Multiprocessing

34

Thread Safety

• How to generate random numbers in C?– First, call srand() with an integer seed.– Second, call rand() to create a sequence of random numbers.

• Pseudorandom Number Generator (PRNG)

• Is it thread safe?– Can it be simultaneously executed by multiple threads without causing problems?

mcaXX nn mod1

Shared State

Page 35: Open Multiprocessing

35

Foster’s Methodology

• Partitioning– Divide the computation and the data into small tasks.– Identify tasks that can be executed in parallel.

• Communication– Determine what communication needs to be carried out.– Local Communication vs. Global Communication

• Agglomeration– Group tasks into larger tasks.– Reduce communication.– Task Dependence

• Mapping– Assign the composite tasks to processes/threads.

Page 36: Open Multiprocessing

36

Foster’s Methodology

Page 37: Open Multiprocessing

37

The n-body Problem

• To predict the motion of a group of objects that interact with each other gravitationally over a period of time.

– Inputs: Mass, Position and Velocity

• Astrophysicist– The positions and velocities of a collection of stars

• Chemist– The positions and velocities of a collection of molecules

Page 38: Open Multiprocessing

38

Newton’s Law

)()(3 tsts

tsts

mGmf kq

kq

kqqk

)()()()(

1

03 tsts

tsts

mGmF kq

n

qkk kq

kqq

)()()()(

1

03 tsts

tsts

mGa kq

n

qkk kq

kq

Page 39: Open Multiprocessing

39

The Basic AlgorithmGet input data;for each timestep { if (timestep output) Print positions and velocities of particles; for each particle q Compute total force on q; for each particle q Compute position and velocity of q;}

for each particle q { forces[q][0]=forces[q][1]=0; for each particle k!=q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; forces[q][0]-=G*masses[q]*masses[k]/dist_cubed*x_diff; forces[q][1]-=G*masses[q]*masses[k]/dist_cubed*y_diff; }}

Page 40: Open Multiprocessing

40

Newton’s 3rd Law of Motion

f38

f58

f83 f85

n=12

q=8

r=3

Page 41: Open Multiprocessing

41

The Reduced Algorithmfor each particle q forces[q][0]=forces[q][1]=0;

for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;

forces[q][0]+=force_qk[0]; forces[q][1]+=force_qk[1]; forces[k][0]-=force_qk[0]; forces[k][1]-=force_qk[1]; }}

Page 42: Open Multiprocessing

42

Euler Method

ttytyttttytytty )()())(()()( 0000000

Page 43: Open Multiprocessing

43

Position and Velocity

q

qqqqqqq

qqqqq

q

qqqqqqq

qqqqq

mtF

ttvttatvttvtvtv

ttvtsttststs

mFtvtavtvvtv

tvstssts

)()()()()()()2(

)()()()()2(

)0()0()0()0()0()0()(

)0()0()0()0()(

'

'

'

'

for each particle q { pos[q][0]+=delta_t*vel[q][0]; pos[q][1]+=delta_t*vel[q][1]; vel[q][0]+=delta_t*forces[q][0]/masses[q]; vel[q][1]+=delta_t*forces[q][1]/masses[q];}

Page 44: Open Multiprocessing

44

Communications: Basic

sq(t) vq(t) sr(t) vr(t)

sq(t + t)△ vq(t + t)△ sr(t + t)△ vr(t + t)△

Fq(t) Fr(t)

Fq(t+ t)△ Fr(t+ t)△

Page 45: Open Multiprocessing

45

Agglomeration: Basic

sq, vq, Fq

sq, vq, Fq

sr, vr, Fr

sr, vr, Fr

sq sr

sq sr

t

t + t△

Page 46: Open Multiprocessing

46

Agglomeration: Reduced

sq, vq, Fq

sq, vq, Fq

sr, vr, Fr

sr, vr, Fr

fqr sr

fqr sr

t

t + t△

q<r

Page 47: Open Multiprocessing

47

Parallelizing the Basic Solver# pragma omp parallel for each timestep { if (timestep output){# pragma omp single nowait Print positions and velocities of particles; }# pragma omp for for each particle q Compute total force on q;# pragma omp for for each particle q Compute position and velocity of q; }

Race Conditions?

Page 48: Open Multiprocessing

48

Parallelizing the Reduced Solver# pragma omp for for each particle q forces[q][0]=forces[q][1]=0;

# pragma omp for for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;

forces[q][0]+=force_qk[0]; forces[q][1]+=force_qk[1]; forces[k][0]-=force_qk[0]; forces[k][1]-=force_qk[1]; } }

Page 49: Open Multiprocessing

49

Does it work properly?

• Consider 2 threads and 4 particles.

• Thread 1 is assigned particle 0 and particle 1.

• Thread 2 is assigned particle 2 and particle 3.

• F3=-f03-f13-f23

• Who will calculate f03 and f13?

• Who will calculate f23?

• Any race conditions?

Page 50: Open Multiprocessing

50

Thread Contributions

Thread ParticleContributions of Threads0 1 2

0 0 f01 +f02 +f03+f04 +f05 0 0

1 -f01 +f12 +f13+f14 +f15 0 0

1 2 -f02 -f12 f23 +f24 +f25 0

3 -f03 -f13 -f23 +f34 +f35 0

2 4 -f04 -f14 -f24 -f34 f45

5 -f05 -f15 -f25 -f35 -f45

3 Threads, 6 Particles, Block Partition

Page 51: Open Multiprocessing

51

Thread Contributions

Thread ParticleContributions of Threads

0 1 2

0 0 f01 +f02 +f03+f04 +f05 0 0

1 1 -f01 f12 +f13+f14 +f15 0

2 2 -f02 -f12 f23 +f24 +f25

0 3 -f03+f34 +f35 -f13 -f23

1 4 -f04 -f34 -f14 +f45 -f24

2 5 -f05 -f35 -f15 -f45 -f25

3 Threads, 6 Particles, Cyclic Partition

Page 52: Open Multiprocessing

52

First Phase

# pragma omp for for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff;

loc_forces[my_rank][q][0]+=force_qk[0]; loc_forces[my_rank][q][1]+=force_qk[1]; loc_forces[my_rank][k][0]-=force_qk[0]; loc_forces[my_rank][k][1]-=force_qk[1]; } }

Page 53: Open Multiprocessing

53

Second Phase

# pragma omp for for (q=0; q<n; q++) { forces[q][0]=forces[q][1]=0; for(thread=0; thread<thread_count; thread++) { forces[q][0]+=loc_forces[thread][q][0]; forces[q][1]+=loc_forces[thread][q][1]; } }

Race Conditions?

• In the first phase, each thread carries out the same calculations as before but the values are stored in its own array of forces (loc_forces).

• In the second phase, the thread that has been assigned particle q will add the contributions that have been computed by different threads.

Page 54: Open Multiprocessing

54

Evaluating the OpenMP Codes

• In the reduced code:– Loop 1: Initialization of the loc_forces array– Loop 2: The first phase of the computation of forces– Loop 3: The second phase of the computation of forces– Loop 4: The updating of positions and velocities

• Which schedule should be used?

Threads Basic ReducedDefault

ReducedForces Cyclic

ReducedAll Cyclic

1 7.71 3.90 3.90 3.90

2 3.87 2.94 1.98 2.01

4 1.95 1.73 1.01 1.08

8 0.99 0.95 0.54 0.61

Page 55: Open Multiprocessing

55

Review

• What are the major differences between MPI and OpenMP?

• What is the scope of a variable?

• What is a reduction variable?

• How to ensure mutual exclusion in a critical section?

• What are the common loop scheduling options?

• What is a thread safe function?

• What factors may affect the performance of an OpenMP program?