jean-françois méhaut · 2017. 5. 10. · frédéric desprez(dr1 inria: graal, avalon) parallel...
TRANSCRIPT
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.
http://www.montblanc-project.eu
Jean-François Méhaut
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under grant agreement n° 288777.
http://www.montblanc-project.eu
The New Killer Processors Overview of the Mont-Blanc projects BOAST DSL for computing kernels
Corse: Compiler Optimization and Run-time SystEms ∗
Fabrice Rastello
∗Inria Joint Project Team (proposal)
June 9, 2015
Fabrice Rastello (Inria) Corse June 9, 2015 1 / 26
Project-team composition / Institutional context
Joint Project-Team (Inria, Grenoble INP, UJF) in the LIG laboratory@ Giant/Minatec
Fabrice Rastello, Florent Bouchez Tichadou, François Broquedis,Frédéric Desprez, Yliès Falcone, Jean-François Mehaut
8 PhD, 3 Post-doc, 1 Engineer
Fabrice Rastello (Inria) Corse June 9, 2015 3 / 26
Permanent member curriculum vitae
Florent Bouchez Tichadou MdC UJF (PhD Lyon 2009, 1Y Bangalore, 3Y Kalray,Nanosim) compiler optimization, compiler back-end
François Broquedis MdC INP (PhD Bordeaux 2010, 1Y Mescal, 3Y Moais)runtime systems, OpenMP, memory management
Frédéric Desprez (DR1 Inria: Graal, Avalon)parallel algorithmic, numerical libraries
Ylies Falcone MdC UJF (PhD Grenoble 2009, 2Y Rennes, Vasco)validation, enforcement, debugging, runtime
Jean-François Mehaut Pr UJF ( Mescal, Nanosim)runtime, debugging, memory management, scientific applications
Fabrice Rastello CR1 Inria (PhD Lyon 2000, 2Y STMicro, Compsys, GCG)compiler optimization, graph theory, compiler back-end, automaticparallelization
Fabrice Rastello (Inria) Corse June 9, 2015 4 / 26
Overall Objectives
Domain : Compiler optimization and runtime systems for performanceand energy consumption (not reliability, nor WCET)
Issues: Scalability and heterogeneity/complexity ≡ trade-off betweenspecific optimizations and programmability/portability
Target architectures: VLIW / SIMD / embedded / many-cores /heterogeneity
Applications: dynamic-systems / loop-nests / graph-algorithmic /signal-processing
Approach: combine static/dynamic & compiler/run-time
Fabrice Rastello (Inria) Corse June 9, 2015 5 / 26
First, vector processors dominated HPC
• 1st Top500 list (June 1993) dominated by DLP architectures • Cray vector,41% • MasPar SIMD, 11% • Convex/HP vector, 5%
• Fujitsu Wind Tunnel is #1 1993-1996, with 170 GFLOPS
Then, commodity took over special purpose
• ASCI Red, Sandia • 1997, 1 TFLOPS • 9,298 cores @ 200 Mhz • Intel Pentium Pro
• Upgraded to Pentium II Xeon, 1999, 3.1 TFLOPS
• ASCI White, LLNL • 2001, 7.3 TFLOPS • 8,192 proc. @ 375 Mhz, • IBM Power 3
Transition from Vector parallelism to Message-Passing Programming Models
Commodity components drive HPC
• RISC processors replaced vectors • x86 processors replaced RISC
• Vector processors survive as (widening) SIMD extensions
5
The killer microprocessors
• Microprocessors killed the Vector supercomputers • They were not faster ... • ... but they were significantly cheaper and greener
• Need 10 microprocessors to achieve the performance of 1 Vector CPU • SIMD vs. MIMD programming paradigms
Cray-1, Cray-C90 NEC SX4, SX5
Alpha AV4, EV5 Intel Pentium IBM P2SC HP PA8200
1974 1979 1984 1989 1994 1999 10
100
1000
10.000
MFL
OP
S
The killer mobile processorsTM
• Microprocessors killed the Vector supercomputers • They were not faster ... • ... but they were significantly
cheaper and greener
• History may be about to repeat itself … • Mobile processor are not
faster … • … but they are significantly
cheaper
Alpha Intel AMD NVIDIA Tegra Samsung Exynos 4-core ARMv8 1.5 GHz
1990 1995 2000 2005 2010 100
1.000
10.000
100.000
MFL
OP
S
2015
1.000.000
Mobile SoC vs Server processor
Performance
5.2 GFLOPS
153 GFLOPS
Cost
21$1
1500$2
x30
1. Leaked Tegra3 price from the Nexus 7 Bill of Materials 2. Non-discounted List Price for the 8-core Intel E5 SandyBrdige
x70
15.2 GFLOPS 21$ (?)
x10 x70
SoC under study: CPU and Memory
NVIDIA Tegra 2 2 x ARM Cortex-A9 @ 1GHz 1 x 32-bit DDR2-333 channel
32KB L1 + 1MB L2
NVIDIA Tegra 3 4 x ARM Cortex-A9 @ 1.3GHz 2 x 32-bit DDR23-750 channels
32KB L1 + 1MB L2
Samsng Exynos 5 Dual 2 x ARM Cortex-A15 @ 1.7GHz 2 x 32-bit DDR3-800 channels
32KB L1 + 1MB L2
Intel Core i7-2760QM 4 x Intel SandyBrdige @ 2.4GHz 2 x 64-bit DDR3-800 channels 32KB L1 + 1MB L2 + 6MB L3
Evaluated kernels
Tag Full name Properties
pthreads
OpenM
P
Om
pSs
CU
DA
OpenC
L
vecop Vector operation Common operation in numerical codes
dmmm Dense matrix-matrix multiply Data reuse an compute performance
3dstc 3D volume stencil Strided memory accesses (7-point 3D stencil)
2dcon 2D convolution Spatial locality
fft 1D FFT transform Peak floating-point, variable stride accesses
red Reduction operation Varying levels of parallelism
hist Histogram calculation Local privatization and reduction stage
msort Generic merge sort Barrier synchronization
nbody N-body calculation Irregular memory accesses
amcd Markov chain Monte-Carlo method Embarassingly parallel
spvm Sparse matrix-vector multiply Load imbalance
Single core performance and energy
• Tegra3 is 1.4x faster than Tegra2 • Higher clock frequency
• Exynos 5 is 1.7x faster than Tegra3 • Better frequency, memory bandwidth, and core microarchitecture
• Intel Core i7 is ~3x better than ARM Cortex-A15 at maximum frequency • ARM platforms more energy-efficient than Intel platform
Multicore performance and energy
• Tegra3 is as fast as Exynos 5, a bit more energy efficient • 4-core vs. 2-core
• ARM multicores as efficient as Intel at the same frequency • Intel still more energy efficient at highest performance
• ARM CPU is not the major power sink in the platform
Memory bandwidth (STREAM)
• Exynos 5 improves dramatically over Tegra (4.5x) • Dual-channel DDR3 • ARM Cortex-A15 sustains more in-flight cache misses
Tibidabo: The first ARM HPC multicore cluster
• Proof of concept • It is possible to deploy a cluster of smartphone processors
• Enable software stack development
Q7 carrier board 2 x Cortex-A9 2 GFLOPS 1 GbE + 100 MbE 7 Watts 0.3 GFLOPS / W
Q7 Tegra 2 2 x Cortex-A9 @ 1GHz 2 GFLOPS 5 Watts (?) 0.4 GFLOPS / W
1U Rackable blade 8 nodes 16 GFLOPS 65 Watts 0.25 GFLOPS / W
2 Racks 32 blade containers 256 nodes 512 cores 9x 48-port 1GbE switch 512 GFLOPS 3.4 Kwatt 0.15 GFLOPS / W
HPC System software stack on ARM
OmpSs runtime library (NANOS++)
GPU CPU GPU CPU
CPU GPU …
Source files (C, C++, FORTRAN, …)
gcc gfortran OmpSs … Compiler(s)
Executable(s)
CUDA OpenCL MPI
GASNet
Linux Linux Linux
FFTW HDF5 … … ATLAS Scientific libraries
• Open source system software stack • Ubuntu Linux OS • GNU compilers
• gcc, g++, gfortran • Scientific libraries
• ATLAS, FFTW, HDF5,... • Slurm cluster management
• Runtime libraries • MPICH2, OpenMPI • OmpSs toolchain
• Performance analysis tools • Paraver, Scalasca
• Allinea DDT 3.1 debugger • Ported to ARM
Scalasca … Paraver
Developer tools
Cluster management (Slurm)
Parallel scalability
• HPC applications scale well on Tegra2 cluster • Capable of exploiting enough nodes to compensate for lower
node performance
SoC under study: Interconnection
NVIDIA Tegra 2 1 GbE (PCIe)
100 Mbit (USB 2.0)
NVIDIA Tegra 3 1 GbE (PCIe)
100 Mbit (USB 2.0)
Samsng Exynos 5 Dual 1 GbE (USB3.0)
100 Mbit (USB 2.0)
Intel Core i7-2760QM 1 GbE (PCIe)
QDR Infiniband (PCIe)
Interconnection network: Latency
• TCP/IP adds a lot of CPU overhead • OpenMX driver interfaces directly to the Ethernet NIC • USB stack adds extra latency on top of network stack
Thanks to Gabor Dozsa and Chris Adeniyi-Jones for their OpenMX results
Interconnection network: Bandwidth
• TCP/IP overhead prevents Cortex-A9 CPU from achieving full bandwidth
• USB stack overheads prevent Exynos 5 from achieving full bandwidth, even on OpenMX
Thanks to Gabor Dozsa and Chris Adeniyi-Jones for their OpenMX results
Interconnect vs. Performance ratio
• Mobile SoC have low-bandwidth interconnect … • 1 GbE or USB 3.0 (6Gb/s)
• … but ratio to performance is similar to high-end • 40 Gb/s Inifiniband
1 Gb/s 6 Gb/s 40 Gb/s Tegra2 0.06 0.38 2.50 Tegra3 0.02 0.14 0.96 Exynos 5250 0.02 0.11 0.74 Intel i7 0.00 0.01 0.07
Peak IN bytes / FLOPS
Limitations of current mobile processors for HPC
• 32-bit memory controller • Even if ARM Cortex-A15 offers 40-bit address space
• No ECC protection in memory • Limited scalability, errors will appear beyond a certain number of
nodes • No standard server I/O interfaces
• Do NOT provide native Ethernet or PCI Express • Provide USB 3.0 and SATA (required for tablets)
• No network protocol off-load engine • TCP/IP, OpenMX, USB protocol stacks run on the CPU
• Thermal package not designed for sustained full-power operation
• All these are implementation decisions, not unsolvable problems • Only need a business case to jusitfy the cost of including the new
features … such as the HPC and server markets
Server chips vs. mobile chips
Server chips Mobile chips
Per-node figure Intel
SandyBridge (E5-2670)
AppliedMicro X-Gene
Calxeda EnergyCore (“Midway”)
TI Keystone II
Nvidia Tegra4
Samsung Exynos 5
Octa
#cores 8 16-32 4 4 4 4+4
CPU Sandy Bridge
Custom ARMv8
Cortex-A15 Cortex-A15 Cortex-A15 Cortex-A15 + Cortex-A7
Technology 32nm 40nm 28nm 28nm 28nm
Clock speed 2.6GHz 3GHz 2GHz 1.9GHz 1.8GHz
Memory size 750GB ? 4GB 4GB 4GB 4GB
Memory bandwidth 51.2GB/s 80 GB/s 12.8 GB/s 12.8 GB/s 12.8 GB/s
ECC in DRAM Yes Yes Yes Yes No No
I/O bandwidth 80GB/s ? 4 x 10 Gb/s 10 Gb/s 6 Gb/s * 6 Gb/s *
I/O interface PCIe Integrated Integrated Integrated USB 3.0 USB 3.0
Protocol offload (in the NIC) Yes Yes Yes No No
Conclusions
• Mobile processors have qualities that make them interesting for HPC • FP64 capability • Performance increasing rapidly • Large market, many providers, competition, low cost • Embedded GPU accelerator
• Current limitations due to target market conditions
• Not real technical challenges
• A whole set of ARM server chips is coming • Solving most of the limitations identified
• Get ready for the change, before it happens …
Low-Power High Performance Computing
● Industrial collaboration
● Kalray (http://www.kalray.eu)● French fabless semiconductor and software compagny
founded in 2008● France (Grenoble, Orsay), USA (California), Japan (Tokyo)● Compagny developing and selling a new generation of
manycore processors
● MPPA-256● Multi-Purpose Processor Array (MPPA)● Manycore processor : 256 cores on a single chip● Low Power Consumption [5W - 11W]
Kalray MPPA-256 architecture
● 256 cores (PEs) @ 400 MHz : 16 clusters, 16 PEs per cluster
● PEs share 2MB of memory
● Absence of cache coherence protocol inside the cluster
● Network-on-Chip (NoC) : communication between clusters
● 4 I/O subsystems : 2 connected to external memory
Seismic Wave Propagation (Ondes3D, BRGM)
● Simulation composed by time steps
● In each time step (3D simulation)● The first triple nested loop computes
the velocity components● The second loop reuses velocity result
of the previous time step to updatethe stress field
4th-order stencil
Overview of Parallel Execution on MPPA-256
● Two-level tiling scheme to exploit the memory hierarchy of MPPA-256
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
The Mont-Blanc European Projects
Mont-Blanc 1 (2011-2015) Mont-Blanc 2 (2013-2016) :
Develop prototypes of HPC clusters using low power commerciallyavailable embedded technology (ARM CPUs, low power GPUs...).
Design the next generation in HPC systems based on embeddedtechnologies and experiments on the prototypes.
Develop a portfolio of existing applications to test these systemsand optimize their efficiency, using BSC’s OmpSs programmingmodel (11 existing applications were selected for this portfolio).
Build Software Stack (OS, runtime, performance tools,...)
Prototype : based on Exynos 5250 : ARM dual core Cortex A15with T604 Mali GPU (OpenCL)
7 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
BigDFT a Tool for NanotechnologiesAb initio simulation :
Simulates the properties of crystalsand molecules,
Computes the electronic density,based on Daubechie wavelet.
This formalism was chosen because itis fit for HPC computations :
Each orbital can be treatedindependently most of the time,
Operator on orbitals are simple andstraightforward.
Mainly developed in Europe :
CEA-DSM/INAC (Grenoble)
Basel, Louvain la Neuve,...
Electronic density around amethane molecule.
8 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
BigDFT as an HPC application
Implementation details :
200,000 lines of Fortran 90 and C
Supports MPI, OpenMP, CUDA and OpenCL
Uses BLAS
Scalability up to 16000 cores of Curie and 288GPUs
Operators can be expressed as 3D convolutions :
Wavelet Transform
Potential Energy
Kinetic Energy
These convolutions are separable and filter are short (16 elements).Can take up to 90% of the computation time on some systems.
9 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
SPECFEM3D a tool for wave propagationresearch
Wave propagation simulation :
Used for geophysics and materialresearch,
Accurately simulate earthquakes,
Based on spectral finite element.
Developed all around the world :
France (CNRS Marseille),
Switzerland (ETH Zurich) CUDA,
United States (Princeton)Networking,
Grenoble (LIG/CNRS) OpenCL.
Sichuan earthequake.
10 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
SPECFEM3D as an HPC application
Implementation details :
80,000 lines of Fortran 90
Supports MPI, CUDA, OpenCL and an OMPSs + MPI miniapp
Scalability up to 693,600 cores on IBM BlueWaters
11 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Case Study 1 : BigDFT’s MagicFilter
The simplest convolution found in BigDFT, corresponds to thepotential operator.
Characteristics
Separable,
Filter length 16,
Transposition,
Periodic,
Only 32 operationsper element.
Pseudo code
1 doub l e f i l t [ 1 6 ] = {F0 , F1 , . . . , F15 } ;2 v o i d m a g i c f i l t e r ( i n t n , i n t ndat ,3 doub l e ∗ in , doub l e ∗out ){4 doub l e temp ;5 f o r ( j =0; j<ndat ; j++) {6 f o r ( i =0; i<n ; i++) {7 temp = 0 ;8 f o r ( k=0; k<16; k++) {9 temp+= i n [ ( ( i−7+k)%n ) + j ∗n ]
10 ∗ f i l t [ k ] ;11 }12 out [ j + i ∗ndat ] = temp ;13 } } }
13 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Case study 2 : SPECFEM3D port toOpenCL
Existing CUDA code :
42 kernels and 15000 lines of code
kernels with 80+ parameters
∼ 7500 lines of cuda code
∼ 7500 lines of wrapper code
Objectives :
Factorize the existing code,
Single OpenCL and CUDA description for the kernels,
Validate without unit tests, comparing native Cuda to generatedCuda executions
Keep similar performances.14 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
A Parametrized Generator
15 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Classical Software Development Loop
SourceCodeDeveloper
Binary
Performancedata
Development Compilation
PerfomanceAnalysis
Optimization
Kernel optimization workflow
Usually performed by a knowledgeable developer
16 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Classical Software Development Loop
SourceCode
BinaryGccMercuriumOpenCL
Performancedata
Development Compilation
PerfomanceAnalysis
Optimization
Compilers perform optimizations
Architecture specific or generic optimizations
16 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Classical Software Development Loop
SourceCode
Binary
Performancedata
MAQAO HW CountersProprietary Tools
Development Compilation
PerfomanceAnalysis
Optimization
Performance data hint at source transformations
Architecture specific or generic hints
16 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Classical Software Development Loop
SourceCode
Developer
Binary
Performancedata
Development Compilation
PerfomanceAnalysis
Optimization
Multiplication of kernel versions or loss of versions
Difficulty to benchmark versions against each-other
16 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
BOAST Development Loop
SourceCode
Binary
Performancedata
Development Compilation
PerfomanceAnalysis
OptimizationGenerativeSource Code Developer
Transformation
Meta-programming of optimizations in BOAST
High level object oriented language
17 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
BOAST Development Loop
SourceCode
BOAST
Binary
Performancedata
Development Compilation
PerfomanceAnalysis
OptimizationGenerativeSource Code
Transformation
Generate combination of optimizations
C, OpenCL, FORTRAN and CUDA are supported
17 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
BOAST Development Loop
SourceCode
Binary
MAQAO HW CountersProprietary Tools
Performancedata
Development Compilation
PerfomanceAnalysis
OptimizationGenerativeSource Code
Transformation
GccMercuriumOpenCL
Compilation and analysis are automated
Selection of best version can also be automated
17 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
BOAST
C kernel
Fortrankernel
OpenCLkernel
CUDAkernel
C with vectorintrinsics kernel
Select targetlanguage
Selectoptimizations
Performancemeasurements
Select performancemetrics
Binarykernel
Select compilerand options
Select inputdata
Optimization spaceprunner: ASK,
Collective Mind
Binary analysis toollike MAQAO
Kernel written inBOAST DSL
Application kernel(SPECFEM3D,
BigDFT, ...)
code generationBOAST
gcc,opencl
runtimeBOAST
1
2
3
45
Bes
t per
form
ing
vers
ion
18 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Use Case Driven
Parameters arising in a convolution :
Filter : length, values, center.
Direction : forward or inverse convolution.
Boundary conditions : free or periodic.
Unroll factor : arbitrary.
How are those parameters constraining our tool ?
19 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Features required
Unroll factor :
Create and manipulate an unknown number of variables,
Create loops with variable steps.
Boundary conditions :
Manage arrays with parametrized size.
Filter and convolution direction :
Transform arrays.
And of course be able to describe convolutions and output them indifferent languages.
20 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Proposed Generator
Idea : use a high level language with support for operatoroverloading to describe the structure of the code, rather than tryingto transform a decorated tree.Define several abstractions :
Variables : type (array, float, integer), size...
Operators : affect, multiply...
Procedure and functions : parameters, variables...
Constructs : for, while...
21 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Sample Code : Variables and Parameters
1 #simple Variable2 i = Int "i"3 #simple constant4 lowfil = Int( "lowfil", :const => 1-center )5 #simple constant array6 fil = Real("fil", :const => arr , :dim => [ Dim(lowfil ,upfil) ])7 #simple parameter8 ndat = Int("ndat", :dir => :in)9 #multidimensional array , an output parameter
10 y = Real("y", :dir => :out , :dim => [ Dim(ndat), Dim(dim_out_min , dim_out_max) ] )
Variables and Parameters are objects with a name, a type, and aset of named properties.
22 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Sample Code : Procedure Declaration
The following declaration :1 p = Procedure("magic_filter", [n,ndat ,x,y], [lowfil ,upfil])2 open p
Outputs Fortran :1 subroutine magicfilter(n, ndat , x, y)2 integer(kind=4), parameter :: lowfil = -83 integer(kind=4), parameter :: upfil = 74 integer(kind=4), intent(in) :: n5 integer(kind=4), intent(in) :: ndat6 real(kind=8), intent(in), dimension (0:n-1, ndat) :: x7 real(kind=8), intent(out), dimension(ndat , 0:n-1) :: y
Or C :1 void magicfilter(const int32_t n, const int32_t ndat , const double * x, double * y){2 const int32_t lowfil = -8;3 const int32_t upfil = 7;
23 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Sample Code : Constructs and Arrays
The following declaration :1 unroll = 52 pr For(j,1,ndat -(unroll -1), unroll) {3 #.....4 pr tt2 === tt2 + x[k,j+1]* fil[l]5 #.....6 }
Outputs Fortran :1 do j=1, ndat -4, 52 !......3 tt2=tt2+x(k,j+1)* fil(l)4 !......5 enddo
Or C :1 for(j=1; j<=ndat -4; j+=5){2 /* ........... */3 tt2=tt2+x[k-0+(j+1 -1)*(n-1 -0+1)]* fil[l-lowfil ];4 /* ........... */5 }
24 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Generator Evaluation
Back to the test cases :
The generator was used to unroll the Magicfilter an evaluate it’sperformance on an ARM processor and an Intel processor.
The generator was used to describe SPECFEM3D kernel.
25 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Performance Results
Tegra2 Intel T7500
26 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
BigDFT Synthesis Kernel
27 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Improvement for BigDFT
Most of the convolutions have been ported to BOAST.
Results are encouraging : on the hardware BigDFT was handoptimized for, convolutions gained on average between 30 and 40%of performance.
MagicFilter OpenCL versions tailored for problem size by BOASTgain 10 to 20% of performance.
28 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
SPECFEM3D OpenCL port
Fully ported to OpenCL with comparable performances (using theglobal_s362ani_small test case) :
On a 2*6 cores (E5-2630) machine with 2 K40, using 12 MPIprocesses :
OpenCL : 4m15sCUDA : 3m10s
On an 2*4 cores (E5620) with a K20 using 6 MPI processes :
OpenCL : 12m47sCUDA : 11m23s
Difference comes from the capacity of cuda to specify the minimumnumber of blocks to launch on a multiprocessor. Less than 4000lines of BOAST code (7500 lines of cuda originally).
29 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Conclusions and Future Work
30 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Conclusions
Generator has been used to test several loop unrolling strategies inBigDFT.Highlights :
Several output languages.
All constraints have been met.
Automatic benchmarking framework allows us to test severaloptimization levels and compilers.
Automatic non regression testing.
Several algorithmically different versions can be generated (changingthe filter, boundary conditions...).
31 / 32BOAST
Introduction Case Studies A Parametrized Generator Evaluation Conclusions and Future Work
Future Works and Considerations
Future work :
Produce an autotuning convolution library.
Implement a parametric space explorer or use an existing one(ASK : Adaptative Sampling Kit, Collective Mind...).
Vector code is supported, but needs improvements.
Test the OpenCL version of SPECFEM3D on the Mont-Blancprototype.
Question raised :
Is this approach extensible enough ?
Can we improve the language used further ?
32 / 32BOAST