Download - (Costless) Software Abstractions for Parallel Architectures

(Costless) Software Abstractions for Parallel Architectures

numscaleUnlocked software performance

Joel Falcou

NumScale SAS – LRI - INRIA

CppCon – 01/07/2014

Page 2: (Costless) Software Abstractions for Parallel Architectures

Context

Decades of hardware improvements

� Scientic Computing now drives most hardware innovations� Current Solution: Parallel architectures� Machines become more and more complex

Example : A simple laptop

� CPU: Intel Core i5-2410M (2.3 GHz) : 4 logical cores, AVX� 4 logical cores� SIMD Extensions: SSE2-SSE4.2, AVX

� GPU: NVIDIA GeForce GT 520M (48 CUDA cores)

2 of 40

Page 3: (Costless) Software Abstractions for Parallel Architectures

Context

Decades of hardware improvements

� Scientic Computing now drives most hardware innovations� Current Solution: Parallel architectures� Machines become more and more complex

Example : A simple laptop

� CPU: Intel Core i5-2410M (2.3 GHz) : 4 logical cores, AVX� 4 logical cores� SIMD Extensions: SSE2-SSE4.2, AVX

� GPU: NVIDIA GeForce GT 520M (48 CUDA cores)

2 of 40

Page 4: (Costless) Software Abstractions for Parallel Architectures

The Real Challenge of HPC

Single Core Era

Performance

Expressiveness

C/Fort.

C++

Java

Multi-Core/SIMD Era

Performance

Expressiveness

Sequential

Threads

SIMD

Heterogenous Era

Performance

Expressiveness

Sequential

SIMD

Threads

GPUPhi

Distributed

3 of 40

Page 5: (Costless) Software Abstractions for Parallel Architectures

Designing tools for Scientic Computing

Challenges

1. Be non-disruptive

2. Domain driven optimizations

3. Provide intuitive API for the user

4. Support a wide architectural landscape

5. Be efficient

Our Approach

� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)

4 of 40

Page 6: (Costless) Software Abstractions for Parallel Architectures

Challenges

5. Be efficient

Our Approach

4 of 40

Page 7: (Costless) Software Abstractions for Parallel Architectures

Challenges

5. Be efficient

Our Approach

4 of 40

Page 8: (Costless) Software Abstractions for Parallel Architectures

Challenges

5. Be efficient

Our Approach

4 of 40

Page 9: (Costless) Software Abstractions for Parallel Architectures

Challenges

5. Be efficient

Our Approach

4 of 40

Page 10: (Costless) Software Abstractions for Parallel Architectures

Challenges

5. Be efficient

Our Approach

4 of 40

Page 11: (Costless) Software Abstractions for Parallel Architectures

Challenges

5. Be efficient

Our Approach

4 of 40

Page 12: (Costless) Software Abstractions for Parallel Architectures

Challenges

5. Be efficient

Our Approach

� Design tools as C++ libraries (1)

� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)

4 of 40

Page 13: (Costless) Software Abstractions for Parallel Architectures

Challenges

5. Be efficient

Our Approach

� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)

� Use Parallel Skeletons as parallel components (4)� Use Generative Programming to deliver performance (5)

4 of 40

Page 14: (Costless) Software Abstractions for Parallel Architectures

Challenges

5. Be efficient

Our Approach

� Design tools as C++ libraries (1)� Design these libraries as Domain Specic Embedded Language (DSEL) (2+3)� Use Parallel Skeletons as parallel components (4)

� Use Generative Programming to deliver performance (5)

4 of 40

Page 15: (Costless) Software Abstractions for Parallel Architectures

Challenges

5. Be efficient

Our Approach

4 of 40

Page 16: (Costless) Software Abstractions for Parallel Architectures

Parallel Programming Ain’t Easy

5 of 40

Page 17: (Costless) Software Abstractions for Parallel Architectures

Spotting abstraction when you see one

Why Parallel Programming Models ?

� Unstructured parallelism is error-prone� Low level parallel tools are non-composable

Available Models

� Performance centric: P-RAM, LOG-P, BSP� Data centric: HTA, PGAS� Pattern centric: Actors, Skeletons

6 of 40

Page 18: (Costless) Software Abstractions for Parallel Architectures

Available Models

6 of 40

Page 19: (Costless) Software Abstractions for Parallel Architectures

Available Models

6 of 40

Page 20: (Costless) Software Abstractions for Parallel Architectures

Parallel Skeletons in a nutshell

Basic Principles [COLE 89]

� There are patterns in parallel applications� Those patterns can be generalized in Skeletons� Applications are assembled as combination of such patterns

Functionnal point of view

� Skeletons are Higher-Order Functions� Skeletons support a compositionnal semantic� Applications become composition of state-less functions

7 of 40

Page 21: (Costless) Software Abstractions for Parallel Architectures

Parallel Skeletons in a nutshell

Basic Principles [COLE 89]

� There are patterns in parallel applications� Those patterns can be generalized in Skeletons� Applications are assembled as combination of such patterns

Functionnal point of view

� Skeletons are Higher-Order Functions� Skeletons support a compositionnal semantic� Applications become composition of state-less functions

7 of 40

Page 22: (Costless) Software Abstractions for Parallel Architectures

Classic Parallel Skeletons

Data Parallel Skeletons

� map: Apply a n-ary function in SIMD mode over subset of data� fold: Perform n-ary reduction over subset of data� scan: Perform n-ary prex reduction over subset of data

Task Parallel Skeletons

� par: Independant task execution� pipe: Task dependency over time� farm: Load-balancing

8 of 40

Page 23: (Costless) Software Abstractions for Parallel Architectures

Classic Parallel Skeletons

Data Parallel Skeletons

� map: Apply a n-ary function in SIMD mode over subset of data� fold: Perform n-ary reduction over subset of data� scan: Perform n-ary prex reduction over subset of data

Task Parallel Skeletons

� par: Independant task execution� pipe: Task dependency over time� farm: Load-balancing

8 of 40

Why using Parallel Skeletons

Software Abstraction

� Write without bothering with parallel details� Code is scalable and easy to maintain� Debuggable, Provable, Certiable

Hardware Abstraction

� Semantic is set, implementation is free� Composability ⇒ Hierarchical architecture

9 of 40

Why using Parallel Skeletons

Software Abstraction

� Write without bothering with parallel details� Code is scalable and easy to maintain� Debuggable, Provable, Certiable

Hardware Abstraction

� Semantic is set, implementation is free� Composability ⇒ Hierarchical architecture

9 of 40

Page 26: (Costless) Software Abstractions for Parallel Architectures

Generative Programming

Domain SpecificApplication Description

Generative Component Concrete Application

Translator

Parametric Sub-components

10 of 40

Page 27: (Costless) Software Abstractions for Parallel Architectures

Generative Programming as a Tool

Available techniques

� Dedicated compilers� External pre-processing tools� Languages supporting meta-programming

Denition of Meta-programmingMeta-programming is the writing of computer programs that analyse, transform andgenerate other programs (or themselves) as their data.

C++ meta-programming

� Relies on the C++ sub-language� Handles types and integral constants at compile-time� Proved to be Turing-complete

11 of 40

Page 28: (Costless) Software Abstractions for Parallel Architectures

11 of 40

Page 29: (Costless) Software Abstractions for Parallel Architectures

11 of 40

Page 30: (Costless) Software Abstractions for Parallel Architectures

11 of 40

Page 31: (Costless) Software Abstractions for Parallel Architectures

Domain Specic Embedded Languages

What’s an DSEL ?� DSL = Domain Specic Language� Declarative language, easy-to-use, tting the domain� DSEL = DSL within a general purpose language

EDSL in C++� Relies on operator overload abuse (Expression Templates)� Carry semantic information around code fragment� Generic implementation become self-aware of optimizations

Exploiting static AST

� At the expression level: code generation� At the function level: inter-procedural optimization

12 of 40

Page 32: (Costless) Software Abstractions for Parallel Architectures

Embedded Domain Specic Languages

EDSL in C++� Relies on operator overload abuse – see B.P� Carry semantic information around code fragment� Generic implementation become self-aware of optimizations

Advantages

� Allow introduction of DSLs without disrupting dev. chain� Semantic dened as type informations means compile-time resolution� Access to a large selection of runtime binding

13 of 40

Page 33: (Costless) Software Abstractions for Parallel Architectures

Expression Templates in A Nutshell

14 of 40

Page 34: (Costless) Software Abstractions for Parallel Architectures

Expression Templates

matrix x(h,w),a(h,w),b(h,w);

x = cos(a) + (b*a);

expr<assign ,expr<matrix&> ,expr<plus , expr<cos ,expr<matrix&> > , expr<multiplies ,expr<matrix&> ,expr<matrix&> > >(x,a,b);

+

*cos

a ab

=

x

#pragma omp parallel forfor(int j=0;j<h;++j){ for(int i=0;i<w;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); }}

Arbitrary Transforms appliedon the meta-AST

15 of 40

Page 35: (Costless) Software Abstractions for Parallel Architectures

Architecture Aware Generative Programming

16 of 40

Page 36: (Costless) Software Abstractions for Parallel Architectures

Parallel DSEL in practice

Objectives

� Apply DSEL generation techniques for different kind of hardware� Demonstrate low cost of abstractions� Demonstrate applicability of skeletons

Our contribution

� BSP++ : Generic C++ BSP for shared/distributed memory� Quaff: DSEL for skeleton programming� B.SIMD: DSEL for portable SIMD programming� NT2: M like DSEL for scientic computing

17 of 40

Page 37: (Costless) Software Abstractions for Parallel Architectures

Objectives

Our contribution

17 of 40

Page 38: (Costless) Software Abstractions for Parallel Architectures

Objectives

Our contribution

17 of 40

Page 39: (Costless) Software Abstractions for Parallel Architectures

NT2

A Scientic Computing Library

� Provide a simple, M-like interface for users� Provide high-performance computing entities and primitives� Easily extendable

Components

� Use Boost.SIMD for in-core optimizations� Use recursive parallel skeletons� Code is made independant of architecture and runtime

18 of 40

Page 40: (Costless) Software Abstractions for Parallel Architectures

The Numerical Template ToolboxComparison to other libraries

Feature Armadillo Blaze Eigen MTL uBlas NT2

M-like API X − − − − XBLAS/LAPACK binding X X X X X XMAGMA binding − − − − − XSSE2+ support X X X − − XAVX support X X − − − XAVX2 support − − − − − XXeon Phi support − − − − − XAltivec support − − X − − XARM support − − X − − XThreading support − − − − − XCUDA support − − − − − X

19 of 40

Page 41: (Costless) Software Abstractions for Parallel Architectures

The Numerical Template Toolbox

Principles

� table<T,S> is a simple, multidimensional array object that exactly mimicsM array behavior and functionalities

� 500+ functions usable directly either on table or on any scalar values as in M

How does it works

� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

� ??????� PROFIT!

20 of 40

Page 42: (Costless) Software Abstractions for Parallel Architectures

Principles

How does it works� Take a .m le, copy to a .cpp le

� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

� ??????� PROFIT!

20 of 40

Page 43: (Costless) Software Abstractions for Parallel Architectures

Principles

How does it works� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes

� Compile the le and link with libnt2.a

� ??????� PROFIT!

20 of 40

Page 44: (Costless) Software Abstractions for Parallel Architectures

Principles

How does it works� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

� ??????� PROFIT!

20 of 40

Page 45: (Costless) Software Abstractions for Parallel Architectures

Principles

How does it works� Take a .m le, copy to a .cpp le� Add #include <nt2/nt2.hpp> and do cosmetic changes� Compile the le and link with libnt2.a

� ??????� PROFIT!

20 of 40

Page 46: (Costless) Software Abstractions for Parallel Architectures

NT2 - From M ...

A1 = 1:1000;A2 = A1 + randn(size(A1));

X = lu(A1*A1’);

rms = sqrt( sum(sqr(A1(:) - A2(:))) / numel(A1) );

21 of 40

Page 47: (Costless) Software Abstractions for Parallel Architectures

NT2 - ... to C++

table <double > A1 = _(1. ,1000.);table <double > A2 = A1 + randn(size(A1));

table <double > X = lu( mtimes(A1 , trans(A1) );

double rms = sqrt( sum(sqr(A1(_) - A2(_))) / numel(A1) );

22 of 40

Page 48: (Costless) Software Abstractions for Parallel Architectures

Parallel Skeletons extraction process

A = B / sum(C+D);

=

A /

B sum

+

C D

fold

transform

23 of 40

Page 49: (Costless) Software Abstractions for Parallel Architectures

Parallel Skeletons extraction process

A = B / sum(C+D);

; ;

=

A /

B sum

+

C D

fold

transform

=

tmp sum

+

C D

fold

⇒=

A /

B tmp

transform

24 of 40

Page 50: (Costless) Software Abstractions for Parallel Architectures

From data to task parallelism

Limits of the fork-join model

� Synchronization cost due to implicit barriers� Under-exploitation of potential parallelism� Poor data locality and no inter-statement optimization

From Skeletons to Actors

� Upgrade NT2 to enable task parallelism� Adapt current skeletons for taskication� Use Futures ( or HPX)to automatically create pipelines� Derive a dependency graph between statements

25 of 40

Page 51: (Costless) Software Abstractions for Parallel Architectures

From data to task parallelism

Limits of the fork-join model

� Synchronization cost due to implicit barriers� Under-exploitation of potential parallelism� Poor data locality and no inter-statement optimization

From Skeletons to Actors

� Upgrade NT2 to enable task parallelism� Adapt current skeletons for taskication� Use Futures ( or HPX)to automatically create pipelines� Derive a dependency graph between statements

25 of 40

Page 52: (Costless) Software Abstractions for Parallel Architectures

Parallel Skeletons extraction process - Take 2

A = B / sum(C+D);

; ;

=

tmp sum

+

C D

fold

=

A /

B tmp

transform

26 of 40

Page 53: (Costless) Software Abstractions for Parallel Architectures

Parallel Skeletons extraction process - Take 2

A = B / sum(C+D);

fold

=

tmp(3) sum(3)

+

C(:, 3) D(:, 3)transform

=

A(:, 3) /

B(:, 3) tmp(3)

fold

=

tmp(2) sum(2)

+

C(:, 2) D(:, 2)transform

=

A(:, 2) /

B(:, 2) tmp(2)

workerfold,simd

=

tmp(1) sum

+

C(:, 1) D(:, 1)workertransform,simd

=

A(:, 1) /

B(:, 1) tmp(1)

spawnertransform,OpenMP

;

27 of 40

Page 54: (Costless) Software Abstractions for Parallel Architectures

Sigma-Delta Motion DetectionContext

� Mono-modal algorithm based on background substraction� Use local gaussian model of lightness variation to detect motion� Target applications: robotic, video survey and analytics, defence� Challenge: Very low arithmetic density� Challenge: Integer-based implementation with small range

28 of 40

Page 55: (Costless) Software Abstractions for Parallel Architectures

Motion DetectionNT2 Code

table <char > sigma_delta( table <char >& background, table <char > const& frame, table <char >& variance)

{// Estimate Raw Movementbackground = selinc( background < frame

, seldec(background > frame , background));

table <char > diff = dist(background , frame);

// Compute Local Variancetable <char > sig3 = muls(diff ,3);

var = if_else( diff != 0, selinc( variance < sig3

, seldec( var > sig3 , variance))

, variance);

// Generate Movement Labelreturn if_zero_else_one( diff < variance );

}

29 of 40

Page 56: (Costless) Software Abstractions for Parallel Architectures

Motion DetectionPerformance

0

2

4

6

8

10

12

14

16

18

512x512 1024x1024

cycle

s/e

lem

ent

Image Size (N x N)

x6.8

x14.8

x16.5

x2.1

x3.6

x6.7

x15.3

x18

x2.3

x3.9

9

x10.8

SCALARHALF COREFULL CORE

SIMDJRTIP2008

SIMD + HALF CORESIMD + FULL CORE

30 of 40

Page 57: (Costless) Software Abstractions for Parallel Architectures

Black and Scholes Option Pricing

NT2 Codetable <float > blackscholes( table <float > const& Sa, table <float > const& Xa

, table <float > const& Ta, table <float > const& ra, table <float > const& va)

{table <float > da = sqrt(Ta);table <float > d1 = log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da);table <float > d2 = d1-va*da;

return Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2);}

31 of 40

Page 58: (Costless) Software Abstractions for Parallel Architectures

Black and Scholes Option Pricing

NT2 Code with loop fusiontable <float > blackscholes( table <float > const& Sa, table <float > const& Xa

, table <float > const& Ta, table <float > const& ra, table <float > const& va)

{// Preallocate temporary tablestable <float > da(extent(Ta)), d1(extent(Ta)), d2(extent(Ta)), R(extent(Ta));

// tie merge loop nest and increase cache localitytie(da,d1 ,d2,R) = tie( sqrt(Ta)

, log(Sa/Xa) + (sqr(va)*0.5f+ra)*Ta/(va*da), d1-va*da, Sa*normcdf(d1)- Xa*exp(-ra*Ta)*normcdf(d2));

return R;}

31 of 40

Page 59: (Costless) Software Abstractions for Parallel Architectures

Black and Scholes Option PricingPerformance

1000000

0

50

100

150

x1.8

9

x2.9

1

x5.5

8

x6.3

0

Size

cycle/value

scalar

SSE2

AVX2

SSE2, 4 cores

AVX2, 4 cores

32 of 40

Page 60: (Costless) Software Abstractions for Parallel Architectures

Black and Scholes Option PricingPerformance with loop fusion

1000000

0

50

100

150

x2.2

7

x4.1

3

x8.0

5

x11.

12

Size

cycle/value

scalar

SSE2

AVX2

SSE2, 4 cores

AVX2, 4 cores

33 of 40

Page 61: (Costless) Software Abstractions for Parallel Architectures

LU DecompositionAlgorithm

A00

A01 A02A10

A20 A11

A21

A12

A22 A11

A12A21

A22

step 1

step 2

step 3

step 4

step 5

step 6

step 7

DGETRF

DGESSM

DTSTRF

DSSSSM

34 of 40

Page 62: (Costless) Software Abstractions for Parallel Architectures

LU DecompositionPerformance

0 10 20 30 40 50

0

50

100

Number of cores

Med

ian

GFL

OPS

8000× 8000 LU decomposition

NT2Intel MKL

35 of 40

Page 63: (Costless) Software Abstractions for Parallel Architectures

What we learn

Parallel Computing for Scientist

� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.

� Like regular language, EDSL needs informations about the hardware system� Integrating hardware descriptions as Generic components increases tools portability

and re-targetability

New Directions� Toward a global generic approach to parallelism� Turning hacks into language features

36 of 40

Page 64: (Costless) Software Abstractions for Parallel Architectures

What we learn

Parallel Computing for Scientist

� Software Libraries built as Generic and Generative components can solve a largechunk of parallelism related problems while being easy to use.

� Like regular language, EDSL needs informations about the hardware system� Integrating hardware descriptions as Generic components increases tools portability

and re-targetability

New Directions� Toward a global generic approach to parallelism� Turning hacks into language features

36 of 40

Page 65: (Costless) Software Abstractions for Parallel Architectures

Generic Parallelism

Parallel C++ Concepts

� Expand function hierarchization to Concepts� e.g : DataParallel, AssociativeOperations, etc.� Use C++1y Concept overloading to split skeletons

Impact

� Less work for the Skeleton users� Extendable through renement� Static assertion of function properties

37 of 40

Page 66: (Costless) Software Abstractions for Parallel Architectures

Generic Parallelism

Parallel C++ Concepts

� Expand function hierarchization to Concepts� e.g : DataParallel, AssociativeOperations, etc.� Use C++1y Concept overloading to split skeletons

Impact

� Less work for the Skeleton users� Extendable through renement� Static assertion of function properties

37 of 40

Page 67: (Costless) Software Abstractions for Parallel Architectures

New C++ Language Features

My C++ Christmas Land

� Build lazy evaluation into the language� Interactions with generic function is cumbersome� SIMD as part of the standard at type level

Current Work� Can sizeof inspires an ast_of operator� Proposal N4035 for auto customization� Proposal N3571 for standard SIMD computation

38 of 40

Page 68: (Costless) Software Abstractions for Parallel Architectures

New C++ Language Features

My C++ Christmas Land

� Build lazy evaluation into the language� Interactions with generic function is cumbersome� SIMD as part of the standard at type level

Current Work� Can sizeof inspires an ast_of operator� Proposal N4035 for auto customization� Proposal N3571 for standard SIMD computation

38 of 40

Page 69: (Costless) Software Abstractions for Parallel Architectures

Perspectives

At tools level

� Prototype of single source GPU support� Work on distributed systems� Applications to Big Data

At language level

� Formalize meta-programming� DSEL verication transferance over C++� Interaction with polyhedral model

39 of 40

Page 70: (Costless) Software Abstractions for Parallel Architectures

Perspectives

At tools level

� Prototype of single source GPU support� Work on distributed systems� Applications to Big Data

At language level

� Formalize meta-programming� DSEL verication transferance over C++� Interaction with polyhedral model

39 of 40

Page 71: (Costless) Software Abstractions for Parallel Architectures

Thanks for your attention

Download - (Costless) Software Abstractions for Parallel Architectures

Top Related