openmp to cuda
TRANSCRIPT
Mapping OpenMP to the Stream Programming Model
Hu Ming Zhang Fangzhou Yue Kun
Objective 1. Study the mapping relationship of parallel mechanism in OpenMP to stream programming model (CUDA). 2. Point out the which part is suitable for translation. 3. Analyzing typical scientific applications
Outline OpenMP vs CUDA: Execution model OpenMP vs CUDA: Semantics OpenMP vs CUDA: Performace Analysis of Benchmarks
OpenMP vs CUDA Execution Model
OpenMP vs CUDA Execution Model
OpenMP vs CUDA Semantic
Parallel Construct parallel
Worksharing Construct loop, sections, single
Master and Synchronization Construct critical, barrier, taskwait, atomic, flush, ordered
Data Environment shared, private, firstprivate, lastprivate, reduction, copyin, copyprivate
OpenMP vs CUDA Semantic
#include <omp.h>
main()
{
int x;
x = 0;
#pragma omp parallel shared(x)
{
#pragma omp critical
x = x + 1;
}
/* end of parallel section */
}
OpenMP vs CUDA Semantic
#pragma omp for ordered [clauses...] (loop region) #pragma omp ordered structured_block (endo of loop region)
OpenMP vs CUDA Semantic
Most of the directives and clauses can be mapped into the stream programs
OpenMP vs CUDA Performance
CUDA: lightweight hardware thread data-centric processing model simple control logic inefficient to handle branch
OpenMP: OS level thread thread-centric parallel processing model thread can be complicated
Map those constructs that have large parallelism and uniform processing among threads
OpenMP vs CUDA Performance
Not suitable: single, section. –-- they have small parallelism and different processing among threads master ---- parallelism is 1 barrier, taskwait ---- demand all threads grouped into one block lastprivate ---- processing is not uniform among threadc
OpenMP vs CUDA
To understand whether it is reasonable to translate OpenMP program to CUDA program, we should analyze the application’s pattern.
Conclusion 1. A majority of scientific applications
are suitable to be mapped to stream programming model.
2. The heterogeneous architecture using CPU and GPU will be more common.
Comments: 1.This paper’s work is mainly on
analysis.
2.We think more real applications should be considered, not just benchmark.
3.Automatically translate OpenMP program to CUDA program may be possible.