openmp to cuda

17
Mapping OpenMP to the Stream Programming Model Hu Ming Zhang Fangzhou Yue Kun

Upload: hu-ming

Post on 21-Apr-2015

129 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: OpenMP to CUDA

Mapping OpenMP to the Stream Programming Model

Hu Ming Zhang Fangzhou Yue Kun

Page 2: OpenMP to CUDA

Objective 1. Study the mapping relationship of parallel mechanism in OpenMP to stream programming model (CUDA). 2. Point out the which part is suitable for translation. 3. Analyzing typical scientific applications

Page 3: OpenMP to CUDA

Outline OpenMP vs CUDA: Execution model OpenMP vs CUDA: Semantics OpenMP vs CUDA: Performace Analysis of Benchmarks

Page 4: OpenMP to CUDA

OpenMP vs CUDA Execution Model

Page 5: OpenMP to CUDA

OpenMP vs CUDA Execution Model

Page 6: OpenMP to CUDA
Page 7: OpenMP to CUDA

OpenMP vs CUDA Semantic

Parallel Construct parallel

Worksharing Construct loop, sections, single

Master and Synchronization Construct critical, barrier, taskwait, atomic, flush, ordered

Data Environment shared, private, firstprivate, lastprivate, reduction, copyin, copyprivate

Page 8: OpenMP to CUDA

OpenMP vs CUDA Semantic

#include <omp.h>

main()

{

int x;

x = 0;

#pragma omp parallel shared(x)

{

#pragma omp critical

x = x + 1;

}

/* end of parallel section */

}

Page 9: OpenMP to CUDA

OpenMP vs CUDA Semantic

#pragma omp for ordered [clauses...] (loop region) #pragma omp ordered structured_block (endo of loop region)

Page 10: OpenMP to CUDA

OpenMP vs CUDA Semantic

Most of the directives and clauses can be mapped into the stream programs

Page 11: OpenMP to CUDA

OpenMP vs CUDA Performance

CUDA: lightweight hardware thread data-centric processing model simple control logic inefficient to handle branch

OpenMP: OS level thread thread-centric parallel processing model thread can be complicated

Map those constructs that have large parallelism and uniform processing among threads

Page 12: OpenMP to CUDA

OpenMP vs CUDA Performance

Not suitable: single, section. –-- they have small parallelism and different processing among threads master ---- parallelism is 1 barrier, taskwait ---- demand all threads grouped into one block lastprivate ---- processing is not uniform among threadc

Page 13: OpenMP to CUDA

OpenMP vs CUDA

To understand whether it is reasonable to translate OpenMP program to CUDA program, we should analyze the application’s pattern.

Page 14: OpenMP to CUDA
Page 15: OpenMP to CUDA
Page 16: OpenMP to CUDA

Conclusion 1. A majority of scientific applications

are suitable to be mapped to stream programming model.

2. The heterogeneous architecture using CPU and GPU will be more common.

Page 17: OpenMP to CUDA

Comments: 1.This paper’s work is mainly on

analysis.

2.We think more real applications should be considered, not just benchmark.

3.Automatically translate OpenMP program to CUDA program may be possible.