(costless) software abstractions for parallel architectures

71
(Costless) Software Abstractions for Parallel Architectures numscale Unlocked software performance Joel Falcou NumScale SAS – LRI - INRIA CppCon – 01/07/2014

Upload: joel-falcou

Post on 29-Jun-2015

389 views

Category:

Software


0 download

DESCRIPTION

Performing large, intensive or non-trivial computing on array like data structures is one of the most common task in scientific computing, video game development and other fields. This matter of fact is backed up by the large number of tools, languages and libraries to perform such tasks. If we restrict ourselves to C++ based solutions, more than a dozen such libraries exists from BLAS/LAPACK C++ binding to template meta-programming based Blitz++ or Eigen. If all of these libraries provide good performance or good abstraction, none of them seems to fit the need of so many different user types. Moreover, as parallel system complexity grows, the need to maintain all those components quickly become unwieldy. This talk explores various software design techniques - like Generative Programming, MetaProgramming and Generic Programming - and their application to the implementation of a parallel computing librariy in such a way that: - abstraction and expressiveness are maximized - cost over efficiency is minimized We'll skim over various applications and see how they can benefit from such tools. We will conclude by discussing what lessons were learnt from this kind of implementation and how those lessons can translate into new directions for the language itself.

TRANSCRIPT

Page 1: (Costless) Software Abstractions for Parallel Architectures

(Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Joel Falcou

NumScale SAS – LRI - INRIA

CppCon – 01/07/2014

Page 2: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Context

Decades of hardware improvements

� Scientic Computing now drives most hardware innovations� Current Solution: Parallel architectures� Machines become more and more complex

Example : A simple laptop

� CPU: Intel Core i5-2410M (2.3 GHz) : 4 logical cores, AVX� 4 logical cores� SIMD Extensions: SSE2-SSE4.2, AVX

� GPU: NVIDIA GeForce GT 520M (48 CUDA cores)

2 of 40

Page 3: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Context

Decades of hardware improvements

� Scientic Computing now drives most hardware innovations� Current Solution: Parallel architectures� Machines become more and more complex

Example : A simple laptop

� CPU: Intel Core i5-2410M (2.3 GHz) : 4 logical cores, AVX� 4 logical cores� SIMD Extensions: SSE2-SSE4.2, AVX

� GPU: NVIDIA GeForce GT 520M (48 CUDA cores)

2 of 40

Page 4: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

The Real Challenge of HPC

Single Core Era

Performance

Expressiveness

C/Fort.

C++

Java

Multi-Core/SIMD Era

Performance

Expressiveness

Sequential

Threads

SIMD

Heterogenous Era

Performance

Expressiveness

Sequential

SIMD

Threads

GPUPhi

Distributed

3 of 40

Page 5: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Designing tools for Scientic Computing

Challenges

1. Be non-disruptive

2. Domain driven optimizations

3. Provide intuitive API for the user

4. Support a wide architectural landscape

5. Be efficient

Our Approach

� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)

4 of 40

Page 6: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Designing tools for Scientic Computing

Challenges

1. Be non-disruptive

2. Domain driven optimizations

3. Provide intuitive API for the user

4. Support a wide architectural landscape

5. Be efficient

Our Approach

� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)

4 of 40

Page 7: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Designing tools for Scientic Computing

Challenges

1. Be non-disruptive

2. Domain driven optimizations

3. Provide intuitive API for the user

4. Support a wide architectural landscape

5. Be efficient

Our Approach

� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)

4 of 40

Page 8: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Designing tools for Scientic Computing

Challenges

1. Be non-disruptive

2. Domain driven optimizations

3. Provide intuitive API for the user

4. Support a wide architectural landscape

5. Be efficient

Our Approach

� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)

4 of 40

Page 9: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Designing tools for Scientic Computing

Challenges

1. Be non-disruptive

2. Domain driven optimizations

3. Provide intuitive API for the user

4. Support a wide architectural landscape

5. Be efficient

Our Approach

� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)

4 of 40

Page 10: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Designing tools for Scientic Computing

Challenges

1. Be non-disruptive

2. Domain driven optimizations

3. Provide intuitive API for the user

4. Support a wide architectural landscape

5. Be efficient

Our Approach

� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)

4 of 40

Page 11: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Designing tools for Scientic Computing

Challenges

1. Be non-disruptive

2. Domain driven optimizations

3. Provide intuitive API for the user

4. Support a wide architectural landscape

5. Be efficient

Our Approach

� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)

4 of 40

Page 12: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Designing tools for Scientic Computing

Challenges

1. Be non-disruptive

2. Domain driven optimizations

3. Provide intuitive API for the user

4. Support a wide architectural landscape

5. Be efficient

Our Approach

� Design tools as C++ libraries (1)

� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)

4 of 40

Page 13: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Designing tools for Scientic Computing

Challenges

1. Be non-disruptive

2. Domain driven optimizations

3. Provide intuitive API for the user

4. Support a wide architectural landscape

5. Be efficient

Our Approach

� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)

� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)

4 of 40

Page 14: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Designing tools for Scientic Computing

Challenges

1. Be non-disruptive

2. Domain driven optimizations

3. Provide intuitive API for the user

4. Support a wide architectural landscape

5. Be efficient

Our Approach

� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)

� Use Generative Programming to deliver performance (5)

4 of 40

Page 15: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Designing tools for Scientic Computing

Challenges

1. Be non-disruptive

2. Domain driven optimizations

3. Provide intuitive API for the user

4. Support a wide architectural landscape

5. Be efficient

Our Approach

� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)

4 of 40

Page 16: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Parallel Programming Ain’t Easy

5 of 40

Page 17: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Spotting abstraction when you see one

Why Parallel Programming Models ?

� Unstructured parallelism is error-prone� Low level parallel tools are non-composable

Available Models

� Performance centric: P-RAM, LOG-P, BSP� Data centric: HTA, PGAS� Pattern centric: Actors, Skeletons

6 of 40

Page 18: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Spotting abstraction when you see one

Why Parallel Programming Models ?

� Unstructured parallelism is error-prone� Low level parallel tools are non-composable

Available Models

� Performance centric: P-RAM, LOG-P, BSP� Data centric: HTA, PGAS� Pattern centric: Actors, Skeletons

6 of 40

Page 19: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Spotting abstraction when you see one

Why Parallel Programming Models ?

� Unstructured parallelism is error-prone� Low level parallel tools are non-composable

Available Models

� Performance centric: P-RAM, LOG-P, BSP� Data centric: HTA, PGAS� Pattern centric: Actors, Skeletons

6 of 40

Page 20: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Parallel Skeletons in a nutshell

Basic Principles [COLE 89]

� There are patterns in parallel applications� Those patterns can be generalized in Skeletons� Applications are assembled as combination of such patterns

Functionnal point of view

� Skeletons are Higher-Order Functions� Skeletons support a compositionnal semantic� Applications become composition of state-less functions

7 of 40

Page 21: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Parallel Skeletons in a nutshell

Basic Principles [COLE 89]

� There are patterns in parallel applications� Those patterns can be generalized in Skeletons� Applications are assembled as combination of such patterns

Functionnal point of view

� Skeletons are Higher-Order Functions� Skeletons support a compositionnal semantic� Applications become composition of state-less functions

7 of 40

Page 22: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Classic Parallel Skeletons

Data Parallel Skeletons

� map: Apply a n-ary function in SIMD mode over subset of data� fold: Perform n-ary reduction over subset of data� scan: Perform n-ary prex reduction over subset of data

Task Parallel Skeletons

� par: Independant task execution� pipe: Task dependency over time� farm: Load-balancing

8 of 40

Page 23: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Classic Parallel Skeletons

Data Parallel Skeletons

� map: Apply a n-ary function in SIMD mode over subset of data� fold: Perform n-ary reduction over subset of data� scan: Perform n-ary prex reduction over subset of data

Task Parallel Skeletons

� par: Independant task execution� pipe: Task dependency over time� farm: Load-balancing

8 of 40

Page 24: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Why using Parallel Skeletons

Software Abstraction

� Write without bothering with parallel details� Code is scalable and easy to maintain� Debuggable, Provable, Certiable

Hardware Abstraction

� Semantic is set, implementation is free� Composability ⇒ Hierarchical architecture

9 of 40

Page 25: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Why using Parallel Skeletons

Software Abstraction

� Write without bothering with parallel details� Code is scalable and easy to maintain� Debuggable, Provable, Certiable

Hardware Abstraction

� Semantic is set, implementation is free� Composability ⇒ Hierarchical architecture

9 of 40

Page 26: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Generative Programming

Domain SpecificApplication Description

Generative Component Concrete Application

Translator

Parametric Sub-components

10 of 40

Page 27: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Generative Programming as a Tool

Available techniques

� Dedicated compilers� External pre-processing tools� Languages supporting meta-programming

Denition of Meta-programmingMeta-programming is the writing of computer programs that analyse, transform andgenerate other programs (or themselves) as their data.

C++ meta-programming

� Relies on the C++ sub-language� Handles types and integral constants at compile-time� Proved to be Turing-complete

11 of 40

Page 28: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Generative Programming as a Tool

Available techniques

� Dedicated compilers� External pre-processing tools� Languages supporting meta-programming

Denition of Meta-programmingMeta-programming is the writing of computer programs that analyse, transform andgenerate other programs (or themselves) as their data.

C++ meta-programming

� Relies on the C++ sub-language� Handles types and integral constants at compile-time� Proved to be Turing-complete

11 of 40

Page 29: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Generative Programming as a Tool

Available techniques

� Dedicated compilers� External pre-processing tools� Languages supporting meta-programming

Denition of Meta-programmingMeta-programming is the writing of computer programs that analyse, transform andgenerate other programs (or themselves) as their data.

C++ meta-programming

� Relies on the C++ sub-language� Handles types and integral constants at compile-time� Proved to be Turing-complete

11 of 40

Page 30: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Generative Programming as a Tool

Available techniques

� Dedicated compilers� External pre-processing tools� Languages supporting meta-programming

Denition of Meta-programmingMeta-programming is the writing of computer programs that analyse, transform andgenerate other programs (or themselves) as their data.

C++ meta-programming

� Relies on the C++ sub-language� Handles types and integral constants at compile-time� Proved to be Turing-complete

11 of 40

Page 31: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Domain Specic Embedded Languages

What’s an DSEL ?� DSL = Domain Specic Language� Declarative language, easy-to-use, tting the domain� DSEL = DSL within a general purpose language

EDSL in C++� Relies on operator overload abuse (Expression Templates)� Carry semantic information around code fragment� Generic implementation become self-aware of optimizations

Exploiting static AST

� At the expression level: code generation� At the function level: inter-procedural optimization

12 of 40

Page 32: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Embedded Domain Specic Languages

EDSL in C++� Relies on operator overload abuse – see B.P� Carry semantic information around code fragment� Generic implementation become self-aware of optimizations

Advantages

� Allow introduction of DSLs without disrupting dev. chain� Semantic dened as type informations means compile-time resolution� Access to a large selection of runtime binding

13 of 40

Page 33: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Expression Templates in A Nutshell

14 of 40

Page 34: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Expression Templates

matrix x(h,w),a(h,w),b(h,w);

x = cos(a) + (b*a);

expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);

+

*cos

a ab

=

x

#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}

Arbitrary Transforms appliedon the meta-AST

15 of 40

Page 35: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Architecture Aware Generative Programming

16 of 40

Page 36: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Parallel DSEL in practice

Objectives

� Apply DSEL generation techniques for different kind of hardware� Demonstrate low cost of abstractions� Demonstrate applicability of skeletons

Our contribution

� BSP++ : Generic C++ BSP for shared/distributed memory� Quaff: DSEL for skeleton programming� B.SIMD: DSEL for portable SIMD programming� NT2: M like DSEL for scientic computing

17 of 40

Page 37: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Parallel DSEL in practice

Objectives

� Apply DSEL generation techniques for different kind of hardware� Demonstrate low cost of abstractions� Demonstrate applicability of skeletons

Our contribution

� BSP++ : Generic C++ BSP for shared/distributed memory� Quaff: DSEL for skeleton programming� B.SIMD: DSEL for portable SIMD programming� NT2: M like DSEL for scientic computing

17 of 40

Page 38: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Parallel DSEL in practice

Objectives

� Apply DSEL generation techniques for different kind of hardware� Demonstrate low cost of abstractions� Demonstrate applicability of skeletons

Our contribution

� BSP++ : Generic C++ BSP for shared/distributed memory� Quaff: DSEL for skeleton programming� B.SIMD: DSEL for portable SIMD programming� NT2: M like DSEL for scientic computing

17 of 40

Page 39: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

NT2

A Scientic Computing Library

� Provide a simple, M-like interface for users� Provide high-performance computing entities and primitives� Easily extendable

Components

� Use Boost.SIMD for in-core optimizations� Use recursive parallel skeletons� Code is made independant of architecture and runtime

18 of 40

Page 40: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

The Numerical Template ToolboxComparison to other libraries

Feature Armadillo Blaze Eigen MTL uBlas NT2

M-like API X − − − − XBLAS/LAPACK binding X X X X X XMAGMA binding − − − − − XSSE2+ support X X X − − XAVX support X X − − − XAVX2 support − − − − − XXeon Phi support − − − − − XAltivec support − − X − − XARM support − − X − − XThreading support − − − − − XCUDA support − − − − − X

19 of 40

Page 41: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

The Numerical Template Toolbox

Principles

� table<T,S> is a simple, multidimensional array object that exactly mimicsM array behavior and functionalities

� 500+ functions usable directly either on table or on any scalar values as in M

How does it works

� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

� ??????� PROFIT!

20 of 40

Page 42: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

The Numerical Template Toolbox

Principles

� table<T,S> is a simple, multidimensional array object that exactly mimicsM array behavior and functionalities

� 500+ functions usable directly either on table or on any scalar values as in M

How does it works� Take a .m le, copy to a .cpp le

� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

� ??????� PROFIT!

20 of 40

Page 43: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

The Numerical Template Toolbox

Principles

� table<T,S> is a simple, multidimensional array object that exactly mimicsM array behavior and functionalities

� 500+ functions usable directly either on table or on any scalar values as in M

How does it works� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes

� Compile the le and link with libnt2.a

� ??????� PROFIT!

20 of 40

Page 44: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

The Numerical Template Toolbox

Principles

� table<T,S> is a simple, multidimensional array object that exactly mimicsM array behavior and functionalities

� 500+ functions usable directly either on table or on any scalar values as in M

How does it works� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

� ??????� PROFIT!

20 of 40

Page 45: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

The Numerical Template Toolbox

Principles

� table<T,S> is a simple, multidimensional array object that exactly mimicsM array behavior and functionalities

� 500+ functions usable directly either on table or on any scalar values as in M

How does it works� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

� ??????� PROFIT!

20 of 40

Page 46: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

NT2 - From M ...

A1 = 1:1000;A2 = A1 + randn(size(A1));

X = lu(A1*A1’);

rms = sqrt( sum(sqr(A1(:) - A2(:))) / numel(A1) );

21 of 40

Page 47: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

NT2 - ... to C++

table <double > A1 = _(1. ,1000.);table <double > A2 = A1 + randn(size(A1));

table <double > X = lu( mtimes(A1 , trans(A1) );

double rms = sqrt( sum(sqr(A1(_) - A2(_))) / numel(A1) );

22 of 40

Page 48: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Parallel Skeletons extraction process

A = B / sum(C+D);

=

A /

B sum

+

C D

fold

transform

23 of 40

Page 49: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Parallel Skeletons extraction process

A = B / sum(C+D);

; ;

=

A /

B sum

+

C D

fold

transform

=

tmp sum

+

C D

fold

⇒=

A /

B tmp

transform

24 of 40

Page 50: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

From data to task parallelism

Limits of the fork-join model

� Synchronization cost due to implicit barriers� Under-exploitation of potential parallelism� Poor data locality and no inter-statement optimization

From Skeletons to Actors

� Upgrade NT2 to enable task parallelism� Adapt current skeletons for taskication� Use Futures ( or HPX)to automatically create pipelines� Derive a dependency graph between statements

25 of 40

Page 51: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

From data to task parallelism

Limits of the fork-join model

� Synchronization cost due to implicit barriers� Under-exploitation of potential parallelism� Poor data locality and no inter-statement optimization

From Skeletons to Actors

� Upgrade NT2 to enable task parallelism� Adapt current skeletons for taskication� Use Futures ( or HPX)to automatically create pipelines� Derive a dependency graph between statements

25 of 40

Page 52: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Parallel Skeletons extraction process - Take 2

A = B / sum(C+D);

; ;

=

tmp sum

+

C D

fold

=

A /

B tmp

transform

26 of 40

Page 53: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Parallel Skeletons extraction process - Take 2

A = B / sum(C+D);

fold

=

tmp(3) sum(3)

+

C(:, 3) D(:, 3)transform

=

A(:, 3) /

B(:, 3) tmp(3)

fold

=

tmp(2) sum(2)

+

C(:, 2) D(:, 2)transform

=

A(:, 2) /

B(:, 2) tmp(2)

workerfold,simd

=

tmp(1) sum

+

C(:, 1) D(:, 1)workertransform,simd

=

A(:, 1) /

B(:, 1) tmp(1)

spawnertransform,OpenMP

spawnertransform,OpenMP

;

27 of 40

Page 54: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Sigma-Delta Motion DetectionContext

� Mono-modal algorithm based on background substraction� Use local gaussian model of lightness variation to detect motion� Target applications: robotic, video survey and analytics, defence� Challenge: Very low arithmetic density� Challenge: Integer-based implementation with small range

28 of 40

Page 55: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Motion DetectionNT2 Code

table <char > sigma_delta( table <char >& background, table <char > const& frame, table <char >& variance)

{// Estimate Raw Movementbackground = selinc( background < frame

, seldec(background > frame , background));

table <char > diff = dist(background , frame);

// Compute Local Variancetable <char > sig3 = muls(diff ,3);

var = if_else( diff != 0, selinc( variance < sig3

, seldec( var > sig3 , variance))

, variance);

// Generate Movement Labelreturn if_zero_else_one( diff < variance );

}

29 of 40

Page 56: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Motion DetectionPerformance

0

2

4

6

8

10

12

14

16

18

512x512 1024x1024

cycle

s/e

lem

ent

Image Size (N x N)

x6.8

x14.8

x16.5

x2.1

x3.6

x6.7

x15.3

x18

x2.3

x3.9

9

x10.8

x10.8

SCALARHALF COREFULL CORE

SIMDJRTIP2008

SIMD + HALF CORESIMD + FULL CORE

30 of 40

Page 57: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Black and Scholes Option Pricing

NT2 Codetable <float > blackscholes( table <float > const& Sa, table <float > const& Xa

, table <float > const& Ta, table <float > const& ra, table <float > const& va)

{table <float > da = sqrt(Ta);table <float > d1 = log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da);table <float > d2 = d1-va*da;

return Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2);}

31 of 40

Page 58: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Black and Scholes Option Pricing

NT2 Code with loop fusiontable <float > blackscholes( table <float > const& Sa, table <float > const& Xa

, table <float > const& Ta, table <float > const& ra, table <float > const& va)

{// Preallocate temporary tablestable <float > da(extent(Ta)), d1(extent(Ta)), d2(extent(Ta)), R(extent(Ta));

// tie merge loop nest and increase cache localitytie(da,d1 ,d2,R) = tie( sqrt(Ta)

, log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da), d1-va*da, Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2));

return R;}

31 of 40

Page 59: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Black and Scholes Option PricingPerformance

1000000

0

50

100

150

x1.8

9

x2.9

1

x5.5

8

x6.3

0

Size

cycle/value

scalar

SSE2

AVX2

SSE2, 4 cores

AVX2, 4 cores

32 of 40

Page 60: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Black and Scholes Option PricingPerformance with loop fusion

1000000

0

50

100

150

x2.2

7

x4.1

3

x8.0

5

x11.

12

Size

cycle/value

scalar

SSE2

AVX2

SSE2, 4 cores

AVX2, 4 cores

33 of 40

Page 61: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

LU DecompositionAlgorithm

A00

A01 A02A10

A20 A11

A21

A12

A22 A11

A12A21

A22

A22

step 1

step 2

step 3

step 4

step 5

step 6

step 7

DGETRF

DGESSM

DTSTRF

DSSSSM

34 of 40

Page 62: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

LU DecompositionPerformance

0 10 20 30 40 50

0

50

100

Number of cores

Med

ian

GFL

OPS

8000× 8000 LU decomposition

NT2Intel MKL

35 of 40

Page 63: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

What we learn

Parallel Computing for Scientist

� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.

� Like regular language, EDSL needs informations about the hardware system� Integrating hardware descriptions as Generic components increases tools portability

and re-targetability

New Directions� Toward a global generic approach to parallelism� Turning hacks into language features

36 of 40

Page 64: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

What we learn

Parallel Computing for Scientist

� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.

� Like regular language, EDSL needs informations about the hardware system� Integrating hardware descriptions as Generic components increases tools portability

and re-targetability

New Directions� Toward a global generic approach to parallelism� Turning hacks into language features

36 of 40

Page 65: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Generic Parallelism

Parallel C++ Concepts

� Expand function hierarchization to Concepts� e.g : DataParallel, AssociativeOperations, etc.� Use C++1y Concept overloading to split skeletons

Impact

� Less work for the Skeleton users� Extendable through renement� Static assertion of function properties

37 of 40

Page 66: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Generic Parallelism

Parallel C++ Concepts

� Expand function hierarchization to Concepts� e.g : DataParallel, AssociativeOperations, etc.� Use C++1y Concept overloading to split skeletons

Impact

� Less work for the Skeleton users� Extendable through renement� Static assertion of function properties

37 of 40

Page 67: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

New C++ Language Features

My C++ Christmas Land

� Build lazy evaluation into the language� Interactions with generic function is cumbersome� SIMD as part of the standard at type level

Current Work� Can sizeof inspires an ast_of operator� Proposal N4035 for auto customization� Proposal N3571 for standard SIMD computation

38 of 40

Page 68: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

New C++ Language Features

My C++ Christmas Land

� Build lazy evaluation into the language� Interactions with generic function is cumbersome� SIMD as part of the standard at type level

Current Work� Can sizeof inspires an ast_of operator� Proposal N4035 for auto customization� Proposal N3571 for standard SIMD computation

38 of 40

Page 69: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Perspectives

At tools level

� Prototype of single source GPU support� Work on distributed systems� Applications to Big Data

At language level

� Formalize meta-programming� DSEL verication transferance over C++� Interaction with polyhedral model

39 of 40

Page 70: (Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Perspectives

At tools level

� Prototype of single source GPU support� Work on distributed systems� Applications to Big Data

At language level

� Formalize meta-programming� DSEL verication transferance over C++� Interaction with polyhedral model

39 of 40

Page 71: (Costless) Software Abstractions for Parallel Architectures

Thanks for your attention