(costless) software abstractions for parallel architectures
DESCRIPTION
Performing large, intensive or non-trivial computing on array like data structures is one of the most common task in scientific computing, video game development and other fields. This matter of fact is backed up by the large number of tools, languages and libraries to perform such tasks. If we restrict ourselves to C++ based solutions, more than a dozen such libraries exists from BLAS/LAPACK C++ binding to template meta-programming based Blitz++ or Eigen. If all of these libraries provide good performance or good abstraction, none of them seems to fit the need of so many different user types. Moreover, as parallel system complexity grows, the need to maintain all those components quickly become unwieldy. This talk explores various software design techniques - like Generative Programming, MetaProgramming and Generic Programming - and their application to the implementation of a parallel computing librariy in such a way that: - abstraction and expressiveness are maximized - cost over efficiency is minimized We'll skim over various applications and see how they can benefit from such tools. We will conclude by discussing what lessons were learnt from this kind of implementation and how those lessons can translate into new directions for the language itself.TRANSCRIPT
(Costless) Software Abstractions for Parallel Architectures
numscaleUnlocked software performance
Joel Falcou
NumScale SAS – LRI - INRIA
CppCon – 01/07/2014
numscaleUnlocked software performance
Context
Decades of hardware improvements
� Scientic Computing now drives most hardware innovations� Current Solution: Parallel architectures� Machines become more and more complex
Example : A simple laptop
� CPU: Intel Core i5-2410M (2.3 GHz) : 4 logical cores, AVX� 4 logical cores� SIMD Extensions: SSE2-SSE4.2, AVX
� GPU: NVIDIA GeForce GT 520M (48 CUDA cores)
2 of 40
numscaleUnlocked software performance
Context
Decades of hardware improvements
� Scientic Computing now drives most hardware innovations� Current Solution: Parallel architectures� Machines become more and more complex
Example : A simple laptop
� CPU: Intel Core i5-2410M (2.3 GHz) : 4 logical cores, AVX� 4 logical cores� SIMD Extensions: SSE2-SSE4.2, AVX
� GPU: NVIDIA GeForce GT 520M (48 CUDA cores)
2 of 40
numscaleUnlocked software performance
The Real Challenge of HPC
Single Core Era
Performance
Expressiveness
C/Fort.
C++
Java
Multi-Core/SIMD Era
Performance
Expressiveness
Sequential
Threads
SIMD
Heterogenous Era
Performance
Expressiveness
Sequential
SIMD
Threads
GPUPhi
Distributed
3 of 40
numscaleUnlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)
4 of 40
numscaleUnlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)
4 of 40
numscaleUnlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)
4 of 40
numscaleUnlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)
4 of 40
numscaleUnlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)
4 of 40
numscaleUnlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)
4 of 40
numscaleUnlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)
4 of 40
numscaleUnlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
� Design tools as C++ libraries (1)
� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)
4 of 40
numscaleUnlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)
� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)
4 of 40
numscaleUnlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)
� Use Generative Programming to deliver performance (5)
4 of 40
numscaleUnlocked software performance
Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
5. Be efficient
Our Approach
� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)
4 of 40
numscaleUnlocked software performance
Parallel Programming Ain’t Easy
5 of 40
numscaleUnlocked software performance
Spotting abstraction when you see one
Why Parallel Programming Models ?
� Unstructured parallelism is error-prone� Low level parallel tools are non-composable
Available Models
� Performance centric: P-RAM, LOG-P, BSP� Data centric: HTA, PGAS� Pattern centric: Actors, Skeletons
6 of 40
numscaleUnlocked software performance
Spotting abstraction when you see one
Why Parallel Programming Models ?
� Unstructured parallelism is error-prone� Low level parallel tools are non-composable
Available Models
� Performance centric: P-RAM, LOG-P, BSP� Data centric: HTA, PGAS� Pattern centric: Actors, Skeletons
6 of 40
numscaleUnlocked software performance
Spotting abstraction when you see one
Why Parallel Programming Models ?
� Unstructured parallelism is error-prone� Low level parallel tools are non-composable
Available Models
� Performance centric: P-RAM, LOG-P, BSP� Data centric: HTA, PGAS� Pattern centric: Actors, Skeletons
6 of 40
numscaleUnlocked software performance
Parallel Skeletons in a nutshell
Basic Principles [COLE 89]
� There are patterns in parallel applications� Those patterns can be generalized in Skeletons� Applications are assembled as combination of such patterns
Functionnal point of view
� Skeletons are Higher-Order Functions� Skeletons support a compositionnal semantic� Applications become composition of state-less functions
7 of 40
numscaleUnlocked software performance
Parallel Skeletons in a nutshell
Basic Principles [COLE 89]
� There are patterns in parallel applications� Those patterns can be generalized in Skeletons� Applications are assembled as combination of such patterns
Functionnal point of view
� Skeletons are Higher-Order Functions� Skeletons support a compositionnal semantic� Applications become composition of state-less functions
7 of 40
numscaleUnlocked software performance
Classic Parallel Skeletons
Data Parallel Skeletons
� map: Apply a n-ary function in SIMD mode over subset of data� fold: Perform n-ary reduction over subset of data� scan: Perform n-ary prex reduction over subset of data
Task Parallel Skeletons
� par: Independant task execution� pipe: Task dependency over time� farm: Load-balancing
8 of 40
numscaleUnlocked software performance
Classic Parallel Skeletons
Data Parallel Skeletons
� map: Apply a n-ary function in SIMD mode over subset of data� fold: Perform n-ary reduction over subset of data� scan: Perform n-ary prex reduction over subset of data
Task Parallel Skeletons
� par: Independant task execution� pipe: Task dependency over time� farm: Load-balancing
8 of 40
numscaleUnlocked software performance
Why using Parallel Skeletons
Software Abstraction
� Write without bothering with parallel details� Code is scalable and easy to maintain� Debuggable, Provable, Certiable
Hardware Abstraction
� Semantic is set, implementation is free� Composability ⇒ Hierarchical architecture
9 of 40
numscaleUnlocked software performance
Why using Parallel Skeletons
Software Abstraction
� Write without bothering with parallel details� Code is scalable and easy to maintain� Debuggable, Provable, Certiable
Hardware Abstraction
� Semantic is set, implementation is free� Composability ⇒ Hierarchical architecture
9 of 40
numscaleUnlocked software performance
Generative Programming
Domain SpecificApplication Description
Generative Component Concrete Application
Translator
Parametric Sub-components
10 of 40
numscaleUnlocked software performance
Generative Programming as a Tool
Available techniques
� Dedicated compilers� External pre-processing tools� Languages supporting meta-programming
Denition of Meta-programmingMeta-programming is the writing of computer programs that analyse, transform andgenerate other programs (or themselves) as their data.
C++ meta-programming
� Relies on the C++ sub-language� Handles types and integral constants at compile-time� Proved to be Turing-complete
11 of 40
numscaleUnlocked software performance
Generative Programming as a Tool
Available techniques
� Dedicated compilers� External pre-processing tools� Languages supporting meta-programming
Denition of Meta-programmingMeta-programming is the writing of computer programs that analyse, transform andgenerate other programs (or themselves) as their data.
C++ meta-programming
� Relies on the C++ sub-language� Handles types and integral constants at compile-time� Proved to be Turing-complete
11 of 40
numscaleUnlocked software performance
Generative Programming as a Tool
Available techniques
� Dedicated compilers� External pre-processing tools� Languages supporting meta-programming
Denition of Meta-programmingMeta-programming is the writing of computer programs that analyse, transform andgenerate other programs (or themselves) as their data.
C++ meta-programming
� Relies on the C++ sub-language� Handles types and integral constants at compile-time� Proved to be Turing-complete
11 of 40
numscaleUnlocked software performance
Generative Programming as a Tool
Available techniques
� Dedicated compilers� External pre-processing tools� Languages supporting meta-programming
Denition of Meta-programmingMeta-programming is the writing of computer programs that analyse, transform andgenerate other programs (or themselves) as their data.
C++ meta-programming
� Relies on the C++ sub-language� Handles types and integral constants at compile-time� Proved to be Turing-complete
11 of 40
numscaleUnlocked software performance
Domain Specic Embedded Languages
What’s an DSEL ?� DSL = Domain Specic Language� Declarative language, easy-to-use, tting the domain� DSEL = DSL within a general purpose language
EDSL in C++� Relies on operator overload abuse (Expression Templates)� Carry semantic information around code fragment� Generic implementation become self-aware of optimizations
Exploiting static AST
� At the expression level: code generation� At the function level: inter-procedural optimization
12 of 40
numscaleUnlocked software performance
Embedded Domain Specic Languages
EDSL in C++� Relies on operator overload abuse – see B.P� Carry semantic information around code fragment� Generic implementation become self-aware of optimizations
Advantages
� Allow introduction of DSLs without disrupting dev. chain� Semantic dened as type informations means compile-time resolution� Access to a large selection of runtime binding
13 of 40
numscaleUnlocked software performance
Expression Templates in A Nutshell
14 of 40
numscaleUnlocked software performance
Expression Templates
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);
+
*cos
a ab
=
x
#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}
Arbitrary Transforms appliedon the meta-AST
15 of 40
numscaleUnlocked software performance
Architecture Aware Generative Programming
16 of 40
numscaleUnlocked software performance
Parallel DSEL in practice
Objectives
� Apply DSEL generation techniques for different kind of hardware� Demonstrate low cost of abstractions� Demonstrate applicability of skeletons
Our contribution
� BSP++ : Generic C++ BSP for shared/distributed memory� Quaff: DSEL for skeleton programming� B.SIMD: DSEL for portable SIMD programming� NT2: M like DSEL for scientic computing
17 of 40
numscaleUnlocked software performance
Parallel DSEL in practice
Objectives
� Apply DSEL generation techniques for different kind of hardware� Demonstrate low cost of abstractions� Demonstrate applicability of skeletons
Our contribution
� BSP++ : Generic C++ BSP for shared/distributed memory� Quaff: DSEL for skeleton programming� B.SIMD: DSEL for portable SIMD programming� NT2: M like DSEL for scientic computing
17 of 40
numscaleUnlocked software performance
Parallel DSEL in practice
Objectives
� Apply DSEL generation techniques for different kind of hardware� Demonstrate low cost of abstractions� Demonstrate applicability of skeletons
Our contribution
� BSP++ : Generic C++ BSP for shared/distributed memory� Quaff: DSEL for skeleton programming� B.SIMD: DSEL for portable SIMD programming� NT2: M like DSEL for scientic computing
17 of 40
numscaleUnlocked software performance
NT2
A Scientic Computing Library
� Provide a simple, M-like interface for users� Provide high-performance computing entities and primitives� Easily extendable
Components
� Use Boost.SIMD for in-core optimizations� Use recursive parallel skeletons� Code is made independant of architecture and runtime
18 of 40
numscaleUnlocked software performance
The Numerical Template ToolboxComparison to other libraries
Feature Armadillo Blaze Eigen MTL uBlas NT2
M-like API X − − − − XBLAS/LAPACK binding X X X X X XMAGMA binding − − − − − XSSE2+ support X X X − − XAVX support X X − − − XAVX2 support − − − − − XXeon Phi support − − − − − XAltivec support − − X − − XARM support − − X − − XThreading support − − − − − XCUDA support − − − − − X
19 of 40
numscaleUnlocked software performance
The Numerical Template Toolbox
Principles
� table<T,S> is a simple, multidimensional array object that exactly mimicsM array behavior and functionalities
� 500+ functions usable directly either on table or on any scalar values as in M
How does it works
� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a
� ??????� PROFIT!
20 of 40
numscaleUnlocked software performance
The Numerical Template Toolbox
Principles
� table<T,S> is a simple, multidimensional array object that exactly mimicsM array behavior and functionalities
� 500+ functions usable directly either on table or on any scalar values as in M
How does it works� Take a .m le, copy to a .cpp le
� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a
� ??????� PROFIT!
20 of 40
numscaleUnlocked software performance
The Numerical Template Toolbox
Principles
� table<T,S> is a simple, multidimensional array object that exactly mimicsM array behavior and functionalities
� 500+ functions usable directly either on table or on any scalar values as in M
How does it works� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes
� Compile the le and link with libnt2.a
� ??????� PROFIT!
20 of 40
numscaleUnlocked software performance
The Numerical Template Toolbox
Principles
� table<T,S> is a simple, multidimensional array object that exactly mimicsM array behavior and functionalities
� 500+ functions usable directly either on table or on any scalar values as in M
How does it works� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a
� ??????� PROFIT!
20 of 40
numscaleUnlocked software performance
The Numerical Template Toolbox
Principles
� table<T,S> is a simple, multidimensional array object that exactly mimicsM array behavior and functionalities
� 500+ functions usable directly either on table or on any scalar values as in M
How does it works� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a
� ??????� PROFIT!
20 of 40
numscaleUnlocked software performance
NT2 - From M ...
A1 = 1:1000;A2 = A1 + randn(size(A1));
X = lu(A1*A1’);
rms = sqrt( sum(sqr(A1(:) - A2(:))) / numel(A1) );
21 of 40
numscaleUnlocked software performance
NT2 - ... to C++
table <double > A1 = _(1. ,1000.);table <double > A2 = A1 + randn(size(A1));
table <double > X = lu( mtimes(A1 , trans(A1) );
double rms = sqrt( sum(sqr(A1(_) - A2(_))) / numel(A1) );
22 of 40
numscaleUnlocked software performance
Parallel Skeletons extraction process
A = B / sum(C+D);
=
A /
B sum
+
C D
fold
transform
23 of 40
numscaleUnlocked software performance
Parallel Skeletons extraction process
A = B / sum(C+D);
; ;
=
A /
B sum
+
C D
fold
transform
=
tmp sum
+
C D
fold
⇒=
A /
B tmp
transform
24 of 40
numscaleUnlocked software performance
From data to task parallelism
Limits of the fork-join model
� Synchronization cost due to implicit barriers� Under-exploitation of potential parallelism� Poor data locality and no inter-statement optimization
From Skeletons to Actors
� Upgrade NT2 to enable task parallelism� Adapt current skeletons for taskication� Use Futures ( or HPX)to automatically create pipelines� Derive a dependency graph between statements
25 of 40
numscaleUnlocked software performance
From data to task parallelism
Limits of the fork-join model
� Synchronization cost due to implicit barriers� Under-exploitation of potential parallelism� Poor data locality and no inter-statement optimization
From Skeletons to Actors
� Upgrade NT2 to enable task parallelism� Adapt current skeletons for taskication� Use Futures ( or HPX)to automatically create pipelines� Derive a dependency graph between statements
25 of 40
numscaleUnlocked software performance
Parallel Skeletons extraction process - Take 2
A = B / sum(C+D);
; ;
=
tmp sum
+
C D
fold
=
A /
B tmp
transform
26 of 40
numscaleUnlocked software performance
Parallel Skeletons extraction process - Take 2
A = B / sum(C+D);
fold
=
tmp(3) sum(3)
+
C(:, 3) D(:, 3)transform
=
A(:, 3) /
B(:, 3) tmp(3)
fold
=
tmp(2) sum(2)
+
C(:, 2) D(:, 2)transform
=
A(:, 2) /
B(:, 2) tmp(2)
workerfold,simd
=
tmp(1) sum
+
C(:, 1) D(:, 1)workertransform,simd
=
A(:, 1) /
B(:, 1) tmp(1)
spawnertransform,OpenMP
spawnertransform,OpenMP
;
27 of 40
numscaleUnlocked software performance
Sigma-Delta Motion DetectionContext
� Mono-modal algorithm based on background substraction� Use local gaussian model of lightness variation to detect motion� Target applications: robotic, video survey and analytics, defence� Challenge: Very low arithmetic density� Challenge: Integer-based implementation with small range
28 of 40
numscaleUnlocked software performance
Motion DetectionNT2 Code
table <char > sigma_delta( table <char >& background, table <char > const& frame, table <char >& variance)
{// Estimate Raw Movementbackground = selinc( background < frame
, seldec(background > frame , background));
table <char > diff = dist(background , frame);
// Compute Local Variancetable <char > sig3 = muls(diff ,3);
var = if_else( diff != 0, selinc( variance < sig3
, seldec( var > sig3 , variance))
, variance);
// Generate Movement Labelreturn if_zero_else_one( diff < variance );
}
29 of 40
numscaleUnlocked software performance
Motion DetectionPerformance
0
2
4
6
8
10
12
14
16
18
512x512 1024x1024
cycle
s/e
lem
ent
Image Size (N x N)
x6.8
x14.8
x16.5
x2.1
x3.6
x6.7
x15.3
x18
x2.3
x3.9
9
x10.8
x10.8
SCALARHALF COREFULL CORE
SIMDJRTIP2008
SIMD + HALF CORESIMD + FULL CORE
30 of 40
numscaleUnlocked software performance
Black and Scholes Option Pricing
NT2 Codetable <float > blackscholes( table <float > const& Sa, table <float > const& Xa
, table <float > const& Ta, table <float > const& ra, table <float > const& va)
{table <float > da = sqrt(Ta);table <float > d1 = log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da);table <float > d2 = d1-va*da;
return Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2);}
31 of 40
numscaleUnlocked software performance
Black and Scholes Option Pricing
NT2 Code with loop fusiontable <float > blackscholes( table <float > const& Sa, table <float > const& Xa
, table <float > const& Ta, table <float > const& ra, table <float > const& va)
{// Preallocate temporary tablestable <float > da(extent(Ta)), d1(extent(Ta)), d2(extent(Ta)), R(extent(Ta));
// tie merge loop nest and increase cache localitytie(da,d1 ,d2,R) = tie( sqrt(Ta)
, log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da), d1-va*da, Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2));
return R;}
31 of 40
numscaleUnlocked software performance
Black and Scholes Option PricingPerformance
1000000
0
50
100
150
x1.8
9
x2.9
1
x5.5
8
x6.3
0
Size
cycle/value
scalar
SSE2
AVX2
SSE2, 4 cores
AVX2, 4 cores
32 of 40
numscaleUnlocked software performance
Black and Scholes Option PricingPerformance with loop fusion
1000000
0
50
100
150
x2.2
7
x4.1
3
x8.0
5
x11.
12
Size
cycle/value
scalar
SSE2
AVX2
SSE2, 4 cores
AVX2, 4 cores
33 of 40
numscaleUnlocked software performance
LU DecompositionAlgorithm
A00
A01 A02A10
A20 A11
A21
A12
A22 A11
A12A21
A22
A22
step 1
step 2
step 3
step 4
step 5
step 6
step 7
DGETRF
DGESSM
DTSTRF
DSSSSM
34 of 40
numscaleUnlocked software performance
LU DecompositionPerformance
0 10 20 30 40 50
0
50
100
Number of cores
Med
ian
GFL
OPS
8000× 8000 LU decomposition
NT2Intel MKL
35 of 40
numscaleUnlocked software performance
What we learn
Parallel Computing for Scientist
� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.
� Like regular language, EDSL needs informations about the hardware system� Integrating hardware descriptions as Generic components increases tools portability
and re-targetability
New Directions� Toward a global generic approach to parallelism� Turning hacks into language features
36 of 40
numscaleUnlocked software performance
What we learn
Parallel Computing for Scientist
� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.
� Like regular language, EDSL needs informations about the hardware system� Integrating hardware descriptions as Generic components increases tools portability
and re-targetability
New Directions� Toward a global generic approach to parallelism� Turning hacks into language features
36 of 40
numscaleUnlocked software performance
Generic Parallelism
Parallel C++ Concepts
� Expand function hierarchization to Concepts� e.g : DataParallel, AssociativeOperations, etc.� Use C++1y Concept overloading to split skeletons
Impact
� Less work for the Skeleton users� Extendable through renement� Static assertion of function properties
37 of 40
numscaleUnlocked software performance
Generic Parallelism
Parallel C++ Concepts
� Expand function hierarchization to Concepts� e.g : DataParallel, AssociativeOperations, etc.� Use C++1y Concept overloading to split skeletons
Impact
� Less work for the Skeleton users� Extendable through renement� Static assertion of function properties
37 of 40
numscaleUnlocked software performance
New C++ Language Features
My C++ Christmas Land
� Build lazy evaluation into the language� Interactions with generic function is cumbersome� SIMD as part of the standard at type level
Current Work� Can sizeof inspires an ast_of operator� Proposal N4035 for auto customization� Proposal N3571 for standard SIMD computation
38 of 40
numscaleUnlocked software performance
New C++ Language Features
My C++ Christmas Land
� Build lazy evaluation into the language� Interactions with generic function is cumbersome� SIMD as part of the standard at type level
Current Work� Can sizeof inspires an ast_of operator� Proposal N4035 for auto customization� Proposal N3571 for standard SIMD computation
38 of 40
numscaleUnlocked software performance
Perspectives
At tools level
� Prototype of single source GPU support� Work on distributed systems� Applications to Big Data
At language level
� Formalize meta-programming� DSEL verication transferance over C++� Interaction with polyhedral model
39 of 40
numscaleUnlocked software performance
Perspectives
At tools level
� Prototype of single source GPU support� Work on distributed systems� Applications to Big Data
At language level
� Formalize meta-programming� DSEL verication transferance over C++� Interaction with polyhedral model
39 of 40
Thanks for your attention