parallel stl in today's sycl - khronos group hsa: ... template voidlibrary_function(policy p,...

39
Parallel STL in today’s SYCL Ruymán Reyes [email protected] Codeplay Research 15 th November, 2016

Upload: hadieu

Post on 01-May-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Parallel STL in today’s SYCL

Ruymán [email protected]

Codeplay Research

15th November, 2016

Page 2: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Outline

1 Parallelism TS

2 The SYCL parallel STL

3 Heterogeneous Execution with Parallel STL

4 Conclusions and Future Work

2

Page 3: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

The presenter

Ruyman Reyes, PhDI Background in HPC, programming models and compilers

→ Worked in HPC Scientific Code (ScaLAPACK, GROMACs, CP2K)→ Created the first Open Source OpenACC implementation

I Contributor to SYCL SpecificationI Lead of ComputeCpp (Codeplay’s SYCL implementation)I Coordinating the work on SYCL Parallel STL

3

Page 4: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Codeplay Software

We build software development tools for SoCI Software company based in EdinburghI 42 developersI Different background and skill set

I Games Industry, AI, compilers, HPC, roboticsI Various levels of expertise (graduates to PhD)

I Customers work in all areas of IndustryI SmartphonesI Self-driving carsI Game consoles

Our technology is probably in your pocket!

4

Page 5: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Outline

1 Parallelism TS

2 The SYCL parallel STL

3 Heterogeneous Execution with Parallel STL

4 Conclusions and Future Work

5

Page 6: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Parallel STL: Democratizing Parallelism in C++

I Various libraries offered STL-like interface for parallel algorithms→ Thrust, Bolt, libstdc++ Parallel Mode, AMP algorithms

I In 2012, two separate proposals for parallelism to C++ standard:→ NVIDIA (N3408), based on Thrust (CUDA-based C++ library)→ Microsoft and Intel (N3429), based on Intel TBB andPPL/C++AMP

I Made joint proposal (N3554) suggested by SG1→ Many working drafts for N3554, N3850, N3960, N4071, N4409

I Final proposal P0024R2 accepted for C++17 during JacksonvilleI Latest status on C++ draft in github

6

Page 7: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Existing implementations

Following the evolution of the documentI Microsoft: http://parallelstl.codeplex.comI HPX: http://stellar-group.github.io/hpx/docs/html/hpx/

manual/parallel.htmlI HSA: http://www.hsafoundation.com/hsa-for-math-scienceI Thibaut Lutz: http://github.com/t-lutz/ParallelSTLI NVIDIA: http://github.com/n3554/n3554I Codeplay: http://github.com/KhronosGroup/SyclParallelSTL

7

Page 8: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

What is Parallelism TS adding?

A set of execution policies and a collection of parallel algorithms

I The Execution PoliciesI Paragraphs explaining the conditions for parallel algorithmsI New parallel algorithmsI The exception_list class

8

Page 9: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Sorting with the STL

A sequential sortstd :: vector <int > data = { 8, 9, 1, 4 };std :: sort(std :: begin (data), std :: end(data));if (std :: is_sorted (data)) {

cout << " Data is sorted ! " << endl;}

I par is an object of an Execution PolicyI The sort will be executed in parallel using an implementation-defined

method

9

Page 10: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Sorting with the STL

A parallel sortstd :: vector <int > data = { 8, 9, 1, 4 };std :: sort(std ::par , std :: begin (data), std :: end(data));if (std :: is_sorted (data)) {

cout << " Data is sorted ! " << endl;}

I par is an object of an Execution PolicyI The sort will be executed in parallel using an implementation-defined

method

9

Page 11: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Sorting with the STL

A parallel sortstd :: vector <int > data = { 8, 9, 1, 4 };std :: sort(std ::par , std :: begin (data), std :: end(data));if (std :: is_sorted (data)) {

cout << " Data is sorted ! " << endl;}

I par is an object of an Execution PolicyI The sort will be executed in parallel using an implementation-defined

method

9

Page 12: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

The Execution PolicyStandard policy classesDefined in the execution namespace

I class sequenced_policy:→ Never do parallel

I class parallel_policy:→ Can use caller thread, but may span others (e.g, std::thread) →Invocations do not interleave on a single thread

I class parallel_unsequenced_policy:→ Can use caller thread or others (e.g std::thread) → Multipleinvocations may be interleaved on a single thread

Global objectsconstexpr sequenced_policy sequenced ;constexpr parallel_policy par;constexpr parallel_unseq_policy par_unseq ;

10

Page 13: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

The Execution PolicyChoosing different parallel implementations// May execute in parallelstd :: sort(std ::par , std :: begin (data), std :: end(data));// May be parallelized and vectorizedstd :: sort(std :: par_unseq , std :: begin (data), std :: end(data));// Will not be parallelizedstd :: sort(std :: sequenced , std :: begin (data), std :: end(data));

Propagating the policy to the end usertemplate < typename Policy , typename Iterator >void library_function ( Policy p, Iterator begin , Iterator end) {

std :: sort(p, begin , end);std :: for_each (p, begin , end , [&] ( Iterator :: value_type e& ) { e++; } );std :: for_each (std :: sequenced , begin , end , non_parallel_operation );

}

Implementations can define their own Execution Policies

11

Page 14: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Dealing with exceptions

I Different execution threads may abort with different exceptionsI Parallel STL algorithms may throw an exception_listI Note that an uncaught exception on theparallel_unsequenced_policy will cause a terminate.

class exception_list : public exception {public :

typedef unspecified iterator ;size_t size () const noexcept ;iterator begin () const noexcept ;iterator end () const noexcept ;virtual const char* what () const noexcept ;

};

12

Page 15: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Parallel Algorithms

I Overloads to STL algorithms taking the SYCL Execution PolicyI Not all STL algorithms are suitable for Parallel Execution!

13

Page 16: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Introducing new algorithms to the STL

For Eachtemplate < class ExecutionPolicy ,

class InputIterator , class Function >void for_each ( ExecutionPolicy && exec ,

InputIterator first , InputIterator last ,Function f);

template < class ExecutionPolicy ,class InputIterator , class Size , class Function >

InputIterator for_each_n ( ExecutionPolicy && exec ,InputIterator first , Size n,Function f);

template < class InputIterator , class Size , class Function >InputIterator for_each_n ( InputIterator first , Size n,

Function f);

I for_each: Applies f to elements in [first, last).I for_each_n: Applies f to elements in [first, first + n).

14

Page 17: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Introducing new algorithms to the STL

Numerical parallel algorithmstemplate < class InputIterator >typename iterator_traits < InputIterator >:: value_type

reduce ( InputIterator first , InputIterator last);template < class InputIterator , class T>T reduce ( InputIterator first , InputIterator last , T init);template < class InputIterator , class T, class BinaryOperation >T reduce ( InputIterator first , InputIterator last , T init ,

BinaryOperation binary_op );

I As opposed to accumulate, binary_op is executed on anunespecified order.

15

Page 18: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Introducing new algorithms to the STL

Other algorithms initroducedI Exclusive/Inclusive Scan (Prefix Sum)I Transform ReduceI Transform Exclusive/Inclusive Scan

Basic block to construct other algorithms and applications!

16

Page 19: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Outline

1 Parallelism TS

2 The SYCL parallel STL

3 Heterogeneous Execution with Parallel STL

4 Conclusions and Future Work

17

Page 20: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

The SYCL Parallel STL

Spec & ExamplesI Enable C++17 Parallel STL to run on any SYCL-supported device

→ Any OpenCL platform with SPIR supportI Improves productivity of C++ developers worried about performanceI Integrates nicely with existing SYCL codebasesI Completely Open Source

→ https://github.com/KhronosGroup/SyclParallelSTL

SYCL Parallel STL introduces two execution policies

18

Page 21: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Sorting with the STL

A sequential sortstd :: vector <int > data = { 8, 9, 1, 4 };std :: sort(std :: begin (data), std :: end(data));if (std :: is_sorted (data)) {

cout << " Data is sorted ! " << endl;}

I sycl_policy is an Execution PolicyI data is an standard stl::vectorI Technically will use the device returned by default_selector

19

Page 22: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Sorting with the STL

Sorting on the GPU!std :: vector <int > data = { 8, 9, 1, 4 };std :: sort( sycl_policy , std :: begin (v), std :: end(v));if (std :: is_sorted (data)) {

cout << " Data is sorted ! " << endl;}

I sycl_policy is an Execution PolicyI data is an standard stl::vectorI Technically will use the device returned by default_selector

19

Page 23: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Sorting with the STL

Sorting on the GPU!std :: vector <int > data = { 8, 9, 1, 4 };std :: sort( sycl_policy , std :: begin (v), std :: end(v));if (std :: is_sorted (data)) {

cout << " Data is sorted ! " << endl;}

I sycl_policy is an Execution PolicyI data is an standard stl::vectorI Technically will use the device returned by default_selector

19

Page 24: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

The SYCL Policy

template <typename KernelName = DefaultKernelName >class sycl_execution_policy {

public :

using kernelName = KernelName ;

sycl_execution_policy () = default ;sycl_execution_policy (cl :: sycl :: queue q);cl :: sycl :: queue get_queue () const ;

};

I Indicates algorithm will be executed using a SYCL-deviceI Can optionally take a queue

→ Re-use device-selection→ Asynchronous data copy-back→ . . .

20

Page 25: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Why the KernelName template?

How are algorithms implemented?auto f = [ vectorSize , &bufI , &bufO , op ]( cl :: sycl :: handler &h) mutable {

...auto aI = bufI. template get_access < access :: mode :: read >(h);auto aO = bufO. template get_access < access :: mode :: write >(h);h. parallel_for < /* The Kernel Name */ >(r,

[aI , aO , op ]( cl :: sycl ::id <1> id) {aO[id.get (0)] = UserFunctor (aI[id.get (0) ]);

});};

Two separate calls can generate different kernels!transform (par , v. begin () , v.end () , [=]( int& val){ val ++; });transform (par , v. begin () , v.end () , [=]( int& val){ val --; });

21

Page 26: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Why the KernelName template?

How are algorithms implemented?auto f = [ vectorSize , &bufI , &bufO , op ]( cl :: sycl :: handler &h) mutable {

...auto aI = bufI. template get_access < access :: mode :: read >(h);auto aO = bufO. template get_access < access :: mode :: write >(h);h. parallel_for < /* The Kernel Name */ >(r,

[aI , aO , op ]( cl :: sycl ::id <1> id) {aO[id.get (0)] = UserFunctor (aI[id.get (0) ]);

});};

Two separate calls can generate different kernels!transform (par , v. begin () , v.end () , [=]( int& val){ val ++; });transform (par , v. begin () , v.end () , [=]( int& val){ val --; });

21

Page 27: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Using named policies and queues

using namespace cl :: sycl;using namespace experimental :: parallel :: sycl;

std :: vector <int > v = ...;// Transformdefault_selector ds;{

queue q(ds);sort( sycl_execution_policy (q), std :: begin (v), std :: end(v));sycl_execution_policy < class myName > sepn1 (q);transform (sepn1 , std :: begin (v), std :: end(v),

std :: begin (v), [=]( int i) { return i + 1;});}

I Only required for lambdas, not functorsI Device selection and queue are re-usedI Data is copied in/out in each call!

22

Page 28: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Avoiding data-copies using buffers

using namespace cl :: sycl;using namespace experimental :: parallel :: sycl;

std :: vector <int > v = ...;default_selector h;{

buffer <int > b(std :: begin (v), std :: end(v));b. set_final_data (v.data ());{

queue q(h);sort( sycl_execution_policy (q), begin (b), end(b));sycl_execution_policy < class transform1 > sepn1 (q);transform (sepn1 , begin (b), end(b), begin (b),

[]( int num) { return num + 1; });}

}

I Buffer is constructed from STL containersI Data will be copied back to the container when buffer is done

→ Note the additional copy from vec to buffer and vice-versa

23

Page 29: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Using device-only data

using namespace experimental :: parallel :: sycl;default_selector h;{

buffer <int , 1> b(range <1 >( size));b. set_final_data (v.data ());{

cl :: sycl :: queue q(h);{

auto hostAcc = b. get_access <mode :: read_write ,target :: host_buffer >();

for (auto & : hostAcc ) {*i = read_data_from_file (...) ;

}}sort( sycl_execution_policy (q), begin (b), end(b));transform ( sycl_policy , begin (b), end(b), begin (b),

std :: negate <int >());}

}

I Data is initialized in the host using a host accessorI After host accessor is done, data is on the device

24

Page 30: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Outline

1 Parallelism TS

2 The SYCL parallel STL

3 Heterogeneous Execution with Parallel STL

4 Conclusions and Future Work

25

Page 31: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Heterogeneous Execution Policy

Distribute workload in Parallel STL algorithmsI Execution on two devices at the same timeI Designed for integrated CPU / GPU platformsI User sets the decide percentage of work assigned to GPU/CPUI Policy distributes workload accordingly

HiPEAC internshipI Research work funded via collaboration grant with HiPEACI Ph.D Student from University of Malaga (A. Vilches)

26

Page 32: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Heterogeneous Execution Example

cl :: sycl :: queue q;amd_cpu_selector cpu_sel ;cl :: sycl :: queue q2( cpu_sel );sycl :: sycl_heterogeneous_execution_policy < class TransformAlgorithm1 > snp(

q, q2 , ratio );

auto mytransform = [&]() {float pi = 3.14;std :: experimental :: parallel :: transform (

snp , std :: begin (v1), std :: end(v1), std :: begin (v2), std :: begin (res),[=]( float a, float b) { return pi * a + b; });

};

27

Page 33: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Manual distribution of the work

28

Page 34: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Heterogeneous Execution Policy in Use

29

Page 35: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Heterogeneous Execution Trivial Implementation

...{auto buf1_q1 =

sycl :: helpers :: make_const_buffer (first1 , first1 + crosspoint );auto buf2_q1 =

sycl :: helpers :: make_const_buffer (first2 , first2 + crosspoint );auto res_q1 = sycl :: helpers :: make_buffer (result , result + crosspoint );auto buf1_q2 =

sycl :: helpers :: make_const_buffer ( first1 + crosspoint , last1 );auto buf2_q2 = sycl :: helpers :: make_const_buffer ( first2 + crosspoint ,

first2 + elements );auto res_q2 =

sycl :: helpers :: make_buffer ( result + crosspoint , result + elements );impl :: transform (named_sep , q1 , buf1_q1 , buf2_q1 , res_q1 , binary_op );impl :: transform (named_sep , q2 , buf1_q2 , buf2_q2 , res_q2 , binary_op );

}...

30

Page 36: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Heterogeneous load balancing

Dynamic decision of heterogeneous balancingI Percentage offloading is runtime valueI Developers can create runtime evaluation functions

→ Depending on workload→ Depending on platform→ Depending on user-input

31

Page 37: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

Outline

1 Parallelism TS

2 The SYCL parallel STL

3 Heterogeneous Execution with Parallel STL

4 Conclusions and Future Work

32

Page 38: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

ConclusionsParallel STL

I Enables developers to quickly exploit parallel architecturesI SYCL makes implementing these algorithms for heterogeneous

platforms trivial→ Just write single-source C++→ Will work on any OpenCL + SPIR platform!

Heterogeneous ExecutionI Plenty of heterogeneous platform out thereI Complex to work with them!I SYCL allows developers to focus on algorithms and distribute workI Heterogeneous Policy can be optimized and customized per platform

→ Runtime can get platform information and use custom balancing→ Users can extend the policy with specific balancing decisions

33

Page 39: Parallel STL in today's SYCL - Khronos Group HSA: ... template voidlibrary_function(Policy p, ... I Enable C++17 Parallel STL to run on any SYCL-supported

@codeplaysoft [email protected]