Download - PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, by Jean-Charles Vasnier
PORTING AND OPTIMIZING OPENMP APPLICATIONS TO APU USING CAPS TOOLS
JEAN-‐CHARLES VASNIER, CAPS ENTREPRISE
2 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
AGENDA
y CAPS enterprise y OpenACC y CAPS Compilers
y CAPS OpenMP Compiler for AMD APUs ‒ Compiler analyzes and code generaPon ‒ InteracPve report
y ExperimentaPons with benchmark applicaPons ‒ HydroC
y Future work
CAPS OpenMP Compiler - June 2013 2
CAPS enterprise
4 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
y Founded in 2002 ‒ Large experPse in processor micro-‐architecture and code generaPon ‒ Spin-‐off of French INRIA Research Lab ‒ 30 employees
y Mission: to help its customers to leverage the performance of mulP/manycore machines ‒ ConsulPng & engineering services ‒ CAPS OpenACC Compiler & toolchain ‒ Trainings
y Expanding sales worldwide ‒ Resellers in US and APAC (Exxact, Abso^, JCC Gimmick Ltd, Nodasys, …)
www.caps-entreprise.com 4
COMPANY PROFILE
5 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL www.caps-entreprise.com 5
CAPS ECOSYSTEM
Business Partners
European R&D Projects
Customers
OpenACC
7 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
OPENACC INITIATIVE
y A CAPS, CRAY, Nvidia and PGI initiative
y Open Standard
y A directive-based approach for programming heterogeneous many-core hardware for C and FORTRAN applications
y http://www.openacc-standard.com
www.caps-entreprise.com 7
8 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
DIRECTIVE-‐BASED PROGRAMMING (1)
y Three ways of programming GPGPU applications:
www.caps-entreprise.com 8
Libraries
Ready-to-use Acceleration
Directives
Quickly Accelerate Existing Applications
Programming Languages
Maximum Performance
9 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
DIRECTIVE-‐BASED PROGRAMMING (2)
www.caps-‐entreprise.com 9
10 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
EXECUTION MODEL
y Among a bulk of computations executed by the CPU, some regions can be offloaded to hardware accelerators ‒ Parallel regions ‒ Kernels regions
y Host is responsible for: ‒ Allocating memory space on accelerator ‒ Initiating data transfers ‒ Launching computations ‒ Waiting for completion ‒ Deallocating memory space
y Accelerators execute parallel regions: ‒ Use work-sharing directives ‒ Specify level of parallelization
www.caps-entreprise.com 10
11 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
OPENACC EXECUTION MODEL
y Host-‐controlled execuPon y Based on three parallelism levels
‒ Gangs – coarse grain ‒ Workers – fine grain ‒ Vectors – finest grain
www.caps-entreprise.com 11
Device
Gang
Worker
Vectors
Gang
Worker
Vectors
…
CAPS Compilers
13 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
OPENACC COMPILERS (1)
CAPS Compilers: y Source-‐to-‐source compilers y Support Intel Xeon Phi, NVIDIA GPUs,
AMD GPUs and APUs
PGI Accelerator y Extension of x86 PGI compiler y Support Intel Xeon Phi, NVIDIA GPUs,
AMD GPUs and APUs
www.caps-‐entreprise.com 13
Cray Compilers: y Provided with Cray system only
14 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
CAPS COMPILERS (2)
Are source-to-source compilers, composed of 3 parts:
y The directives (OpenACC or OpenHMPP) ‒ Define parts of code to be accelerated ‒ Indicate resource allocation and communication ‒ Ensure portability
y The toolchain ‒ Helps building manycore applications ‒ Includes compilers and target code generators ‒ Insulates hardware specific computations ‒ Uses hardware vendor SDK
y The runtime ‒ Helps to adapt to platform configuration ‒ Manages hardware resource availability
www.caps-entreprise.com 14
15 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
CAPS COMPILERS (3)
y Take the original applicaPon as input and generate another applicaPon source code as output ‒ AutomaPcally turn the OpenACC source code into a accelerator-‐specific source code (CUDA, OpenCL)
y Compile the enPre hybrid applicaPon y Just prefix the original compilaPon line with capsmc to produce a hybrid applicaPon
y CompaPble with: ‒ GNU ‒ Intel ‒ Open64 ‒ Abso^ ‒ …
www.caps-entreprise.com 15
$ capsmc gcc myprogram.c $ capsmc gfortran myprogram.f90
16 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
CAPS COMPILERS (4)
y CAPS Compilers drives all compilation passes
y Host application compilation ‒ Calls traditional CPU compilers ‒ CAPS Runtime is linked to the host part of the application
y Device code production ‒ According to the specified target
‒ A dynamic library is built
www.caps-‐entreprise.com 16
Fun #3
C++ Frontend
C Frontend
Fortran Frontend
CUDA Code GeneraPon
Executable (mybin.exe)
Instrumen-‐taPon module
CPU compiler (gcc, ifort, …) CUDA compilers
HWA Code (Dynamic library)
OpenCL GeneraPon
OpenCL compilers
ExtracPon module
Fun #2
Host code
codelets
CAPS RunDme
Fun #1
From OpenMP To OpenACC
18 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
CAPS OPENMP COMPILER
y AutomaPcally turns OpenMP codes into OpenACC
y Diagnoses compaPbility issues and suggests code transformaPons
y Builds accelerated versions based on CUDA or OpenCL y Works with all plalorms
‒ AMD and Nvidia GPUs ‒ AMD APUs ‒ Intel Xeon Phi
CAPS OpenMP Compiler - June 2013 18
19 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
CAPS OPENMP COMPILER OVERVIEW
Profiling Analysis AcceleraPon
CAPS OpenMP Compiler - June 2013 19
20 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
y Converts OpenMP codes into OpenACC ‒ Examine OpenMP loop nests and check their OpenACC compaPbility ‒ Diagnose non compaPbility issues and propose advice ‒ Build an APU version based on OpenCL
y Builds a interacPve report ‒ Based on the compiler staPc and dynamic analyses ‒ OpenMP to OpenACC kernels view o Performance details of each region ‒ Regions’ In/Out and data dependencies between regions ‒ Gives the user control on pushing kernels onto GPU and manage data transfers
EXTENSION OF THE CAPS OPENACC COMPILER
21 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
OPENMP-‐BASED OPTIMIZATION PROCESS
CAPS OpenMP Compiler - June 2013 21
Execution
Profiling report
Generation
Accelerated executable
Application with OpenMP
directives
Instrumentation
Tracable application
Analysis
HTML interactive
report
22 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
y Code preprocessing and instrumentaPon ‒ IdenPfy supported OpenMP regions
‒ parallel, parallel for and parallel for constructs ‒ Instrument the code to track data and measure kernel performance
y Instrumented applicaPon execuPon ‒ Based on the user data set ‒ Number of Pmes a OpenMP region is executed ‒ Region’s reads and writes ‒ Range of loops iteraPon ‒ Region performance
INSTRUMENTATION AND PROFILING PHASES
23 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
y Generates an interacPve HTML report ‒ Based on the compiler staPc and dynamic analyses ‒ Metrics for each OpenMP regions
‒ Check OpenACC compliancy ‒ ComputaPon density ‒ Coalescing of data accesses ‒ EsPmated speed-‐up ‒ Memory usage
‒ Propose a GPU execuPon or naPve OpenMP execuPon ‒ Data usage and data dependencies graph between regions
‒ Determine when transfers are required between kernels ‒ Let the user modify the CPU or GPU execuPon and data transfer policy
ANALYSIS PHASE
24 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
HTML INTERACTIVE REPORT (1)
y Get regions overview in a snap!
y Code View: from OpenMP to OpenACC direcPves
CAPS OpenMP Compiler - June 2013 24
25 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
HTML INTERACTIVE REPORT (2)
y Performance details of each region
y Analysis conclusions and portability diagnosis
CAPS OpenMP Compiler -‐ June 2013 25
26 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
HTML INTERACTIVE REPORT (3)
y Regions’ inputs/outputs and data dependencies map
CAPS OpenMP Compiler - June 2013 26
27 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
HTML INTERACTIVE REPORT (4)
y Get the control! ‒ Manually push kernels onto accelerators ‒ Manage data transfers
CAPS OpenMP Compiler - June 2013 27
28 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
y Same as the CAPS OpenACC Compiler ‒ Based on the analysis report ‒ Generates OpenCL kernels from OpenACC ‒ AutomaPc data updates to ensure memory coherency
CODE GENERATION PHASE
29 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
FEATURES
y Diagnoses ‒ OpenACC compliancy ‒ ComputaPonal density ‒ Data accesses coalescing ‒ Memory usage ‒ EsPmated speed-‐up
y AutomaPc porPng to AMD, NVIDIA, or Intel accelerators
y Accelerates execuPon or keeps the OpenMP naPve one
y Gives users control to manual opPmizaPons
CAPS OpenMP Compiler - June 2013 29
ApplicaPon ExperimentaPons
31 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
HARDWARE AND SOFTWARE ENVIRONMENT
y Linux system ‒ AMD SDK 2.8 ‒ CAPS Compiler revision 50387 ‒ GCC 4.6.1 ‒ OpenMPI 1.6.4
y Hardware ‒ AMD A10-‐5800K APU with Radeon HD Graphics
CAPS OpenMP Compiler - June 2013 31
32 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
APPLICATIONS STATUS
y Main objecPve is proof of concept, not performance ‒ Performance limitaPons of current version of the APU
y HydroC ‒ Most convincing demo ‒ x1.3 speed-‐up by modifying the execuPon and transfer policy
CAPS OpenMP Compiler - June 2013 32
33 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
HYDROC HTML REPORT
Fututre Work C2PO
35 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
C2PO MISSION STATEMENT
y Combines various CAPS technologies in a modular tool chain ‒ StaPc and dynamic code analyzers ‒ OpenMP to OpenACC code transformers ‒ Kernel micro-‐bencher ‒ Plug with third-‐party tools: Vtune, CUDA profiler ‒ Use CAPS Compiler at final stage to produce manycore applicaPon
C2PO - Oct. 2013 35
Guides you through the whole process of porPng and tuning applicaPons onto manycore parallel systems
36 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
C2PO PHASES
1. GeneraPon of an OpenACC skeleton from OpenMP or sequenPal code ‒ Hotspot detecPon and dataflow analysis
2. Indicates global and local advice on ‒ Data management/placement between kernels or regions ‒ First ten Pps on kernel performance
‒ Data coalescing, parallelism, gridificaPon, loops order
3. Let you rapidly opPmize performance of kernels ‒ Extracts funcPons, loops or annotated regions ‒ Tune kernel code following C2PO advice ‒ Replay standalone with applicaPon data and measure performance gain ‒ Re-‐inject opPmized into applicaPon source code
4. Use CAPS Compilers to build Intel Xeon Phi, NVIDIA or AMD GPUs
C2PO - Oct. 2013 36
Dataflow analysis
OpenACC skeleton generaPon
Extract loops, funcPons, regions
Fine tune kernels
User Input
37 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
C2PO TOOL CHAIN
C2PO - Oct. 2013 37
OpenACC Generator
Data Movement Analyzer SequenPal
Code
OpenMP Code
HTML Report
OpenACC Code
ubencher
InteracPve Report
Kernels Performance analyzer
Code skeleton generaDon
Global tuning
Local tuning
CUDA profiler
VTune
38 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
C2PO OPENACC GENERATION
y From sequenPal or OpenMP code to first parallelized code ‒ Instrument applicaPon and detect hotspots ‒ Generate OpenACC skeleton of kernels from loops ‒ Manage data transfers between kernels
y A report is generated containing ‒ Various performance metrics
‒ Kernel execuPon ‒ Memory reads and writes ‒ PotenPal performance gain
‒ Data dependencies and usage between kernels ‒ OpenACC code view
C2PO - Oct. 2013 38
39 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
C2PO GLOBAL TUNING
y Dynamic tracking of data so as to opPmize their movement ‒ Dynamically trace uploads and downloads at execuPon Pme ‒ Detect potenPally redundant data transfers
C2PO - Oct. 2013 39
#openacc data region // convergence loop for { Upload data() Kernels’ calls() Download data() } …
Difficult for the compiler to detect any CPU use of data
Possible advice: are the following parameters modified
by the CPU between the downloads and uploads?
If yes, insert OpenACC data region with non modified parameters
40 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
C2PO TUNING PHASE
y Microbenchmarking mechanism ‒ Loops, funcPons, user annotated regions are extracted in kernels ‒ Apply opPmizaPons ‒ Replay kernels with original data set without running the whole applicaPon ‒ Once tuned, inject kernels into the applicaPon source code
y Apply performance analyzers from third party tools (Vtune, CUDA profiler) ‒ Synthesizes raw metrics (hardware counters) linked to the source code
C2PO - Oct. 2013 40
41 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
C2PO OBJECTIVES AND BENEFITS
y Keep one single OpenMP code for various parallel many-‐core systems (GPUs, APUs, MIC)
y Incrementally port and opPmize codes in a modular way
y Use an interacPve compiler: advice from dynamic and staPc analyses at source code level
C2PO - Oct. 2013 41
THANK YOU FOR YOUR ATTENTION!
Vasnier Jean-‐Charles Sales Engineer, CAPS entreprise
Phone: +1-‐865-‐227-‐6899 Email: jvasnier@caps-‐entreprise.com
43 | PRESENTATION TITLE | NOVEMBRE 19, 2013 | CONFIDENTIAL
GET PERFORMANCE IN NO TIME!
CAPS OpenMP Compiler - June 2013 43
‒ Measured on a dual Sandy bridge E5-‐2687W with 32 Go RAM and a Kepler K20C driven by CUDA v5.0
45,698
63,42
27,539
12,71
23,417
12,55
0
10
20
30
40
50
60
70
Hydro Nbody
ExecuD
on Tim
e (secon
ds)
Original (OpenMP)
Generated (auto)
Generated(tweaked)
x2 speed-‐up (a^er user’s tuning)
x6 speed-‐up in 3 clicks (full automaPc)