parallelism in the standard c++: what to expect in c++ 17

40
Parallelism in the Standard C++: What to Expect in C++ 17 Artur Laksberg [email protected] Visual C++ Team, Microsoft September 17, 2014

Upload: hayley-hill

Post on 02-Jan-2016

34 views

Category:

Documents


2 download

DESCRIPTION

Parallelism in the Standard C++: What to Expect in C++ 17. Artur Laksberg [email protected] Visual C++ Team, Microsoft September 17, 2014. Agenda. Parallel Fundamentals Task regions Parallel Algorithms Parallelization Vectorization. Part 1: The Fundamentals. Renderscript. OpenMP. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parallelism in the Standard C++: What to Expect in C++ 17

Parallelism in the Standard C++: What to Expect in C++ 17

Artur Laksberg

[email protected]

Visual C++ Team, Microsoft

September 17, 2014

Page 2: Parallelism in the Standard C++: What to Expect in C++ 17

Agenda

Parallel Fundamentals Task regions

Parallel Algorithms Parallelization Vectorization

Page 3: Parallelism in the Standard C++: What to Expect in C++ 17

Part 1: The Fundamentals

Page 4: Parallelism in the Standard C++: What to Expect in C++ 17

OpenMPTBBPPL

MPIOpenCLOpenACC

CUDA C++ AMP

Renderscript

Cilk Plus GCD

Page 5: Parallelism in the Standard C++: What to Expect in C++ 17

Parallelism in C++11/14

Fundamentals: Memory model Atomics

Basics: thread mutex condition_variable async future

Page 6: Parallelism in the Standard C++: What to Expect in C++ 17

Quicksort: Serial

void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end);

quicksort(v, start, pivot - 1);

quicksort(v, pivot + 1, end);

}}

Page 7: Parallelism in the Standard C++: What to Expect in C++ 17

Quicksort: Use Threads

void quicksort(int *v, int start, int end) { if (start < end) {

int pivot = partition(v, start, end);

std::thread t1([&] { quicksort(v, start, pivot - 1); });

std::thread t2([&] { quicksort(v, pivot + 1, end); });

t1.join(); t2.join(); }}

Problem 1:expensive

Problem 2:Fork-join not enforced

Problem 3:Exceptions??

Page 8: Parallelism in the Standard C++: What to Expect in C++ 17

Andrzej Krzemieński:“Do not use naked threads in the program:

use RAII-like wrappers instead”

Page 9: Parallelism in the Standard C++: What to Expect in C++ 17

Quicksort: Fork-Join Parallelism

void quicksort(int *v, int start, int end) { if (start < end) {

int pivot = partition(v, start, end);

quicksort(v, start, pivot - 1);

quicksort(v, pivot + 1, end);

}}

parallel region

task

task

Page 10: Parallelism in the Standard C++: What to Expect in C++ 17

Quicksort: Using Task Regions (N3832)void quicksort(int *v, int start, int end) { if (start < end) {

task_region([&] (auto& r) {

int pivot = partition(v, start, end);

r.run([&] { quicksort(v, start, pivot - 1); });

r.run([&] { quicksort(v, pivot + 1, end); });

}); }}

task

task

parallel region

Page 11: Parallelism in the Standard C++: What to Expect in C++ 17

Under The Hood…

Page 12: Parallelism in the Standard C++: What to Expect in C++ 17

Work Stealing Scheduling

proc 1 proc 3proc 2 proc 4

Page 13: Parallelism in the Standard C++: What to Expect in C++ 17

Work Stealing Scheduling

proc 1

Old items

proc 3proc 2 proc 4

New items

Page 14: Parallelism in the Standard C++: What to Expect in C++ 17

Work Stealing Scheduling

proc 1

Old items

proc 3proc 2 proc 4

New items

Page 15: Parallelism in the Standard C++: What to Expect in C++ 17

Work Stealing Scheduling

proc 1

Old items

proc 3proc 2 proc 4

New items

Page 16: Parallelism in the Standard C++: What to Expect in C++ 17

Work Stealing Scheduling

proc 1

Old items

proc 3proc 2 proc 4

New items

“Thief”

Page 17: Parallelism in the Standard C++: What to Expect in C++ 17

Fork-Join Parallelism and Work Stealing

e();

task_region([] (auto& r) {

r.run(f);

g();

});

h();

e()

f() g()

h()

Q2: What thread runs g?

Q3: What thread runs h?

Q1: What thread runs f?

Page 18: Parallelism in the Standard C++: What to Expect in C++ 17

Work Stealing Design Choices What Thread Executes After

a Spawn? Child Stealing Continuation (parent)

Stealing

What Thread Executes After a Join? Stalling: initiating thread

waits Greedy: the last thread to

reach join continuestask_region([] (auto& r) { for(int i=0; i<n; ++i) r.run(f);});

Page 19: Parallelism in the Standard C++: What to Expect in C++ 17

Part 2: The Algorithms

Page 20: Parallelism in the Standard C++: What to Expect in C++ 17

Alex Stepanov: Start With The Algorithms

Page 21: Parallelism in the Standard C++: What to Expect in C++ 17

Inspiration

Performing Parallel Operations On Containers

Intel Threading Building Blocks

Microsoft Parallel Patterns Library, C++ AMP

Nvidia Thrust

Page 22: Parallelism in the Standard C++: What to Expect in C++ 17

Parallel STL

Just like STL, only parallel… Can be faster

If you know what you’re doing

Two Execution Policies: std:par std::par_vec

Page 23: Parallelism in the Standard C++: What to Expect in C++ 17

Parallelization: What’s a Big Deal?

Why not already parallel?

std::sort(begin, end, [](int a, int b) { return a < b; });

User-provided closures must be thread safe:

int comparisons = 0;std::sort(begin, end, [&](int a, int b) { comparisons++; return a < b; });

But also special-member functions, std::swap etc.

Page 24: Parallelism in the Standard C++: What to Expect in C++ 17

It’s a Contract

What the user can do What the implementer can do

Asymptotic Guarantees:std::sort: O(n*log(n)), std::stable_sort: O(n*log2(n)), what about parallel sort?

What is a valid implementation? (see next slide)

Page 25: Parallelism in the Standard C++: What to Expect in C++ 17

Chaos Sorttemplate<typename Iterator, typename Compare>void chaos_sort( Iterator first, Iterator last, Compare comp ) { auto n = last-first; std::vector<char> c(n); for(;;) { bool flag = false; for( size_t i=1; i<n; ++i ) { c[i] = comp(first[i],first[i-1]); flag |= c[i]; } if( !flag ) break; for( size_t i=1; i<n; ++i ) if( c[i] ) std::swap( first[i-1], first[i] ); }}

Page 26: Parallelism in the Standard C++: What to Expect in C++ 17

Execution Policies

Built-in Execution Policies:extern const sequential_execution_policy seq;extern const parallel_execution_policy par;extern const parallel_vector_execution_policy par_vec;

Dynamic Execution Policy:class execution_policy{public:// ... const type_info& target_type() const; template<class T> T *target(); template<class T> const T *target() const;};

Page 27: Parallelism in the Standard C++: What to Expect in C++ 17

Using Execution Policy To Write Paralel Code

std::vector<int> vec = ...

// standard sequential sortstd::sort(vec.begin(), vec.end());

using namespace std::experimental::parallel;

// explicitly sequential sortsort(seq, vec.begin(), vec.end());

// permitting parallel executionsort(par, vec.begin(), vec.end());

// permitting vectorization as wellsort(par_vec, vec.begin(), vec.end());

Page 28: Parallelism in the Standard C++: What to Expect in C++ 17

Picking Execution Policy Dynamically

size_t threshold = ...

execution_policy exec = seq;

if(vec.size() > threshold){ exec = par;}

sort(exec, vec.begin(), vec.end());

Page 29: Parallelism in the Standard C++: What to Expect in C++ 17

Exception Handling

In C++ philosophy, no exception is silently ignored Exception list: container of exception_ptr objects

try{ r = std::inner_product(std::par, a.begin(), a.end(), b.begin(), func1, func2, 0);}catch(const exception_list& list){ for(auto& exptr : list) { // process exception pointer exptr }}

Page 30: Parallelism in the Standard C++: What to Expect in C++ 17

Vectorization: What’s a Big Deal?

int a[n] = ...;int b[n] = ...;for(int i=0; i<n; ++i){ a[i] = b[i] + c;}

movdqu xmm1, XMMWORD PTR _b$[esp+eax+132]movdqu xmm0, XMMWORD PTR _a$[esp+eax+132]paddd xmm1, xmm2paddd xmm1, xmm0movdqu XMMWORD PTR _a$[esp+eax+132], xmm1

a[i:i+3] = b[i:i+3] + c;

Move Unaligned Double Quadword

Page 31: Parallelism in the Standard C++: What to Expect in C++ 17

Vector Lane is not a Thread!

Taking locks Thread with thread_id x takes a lock… Then another “thread” with the same thread_id enters the

lock… Deadlock!!!

Exceptions Can we unwind 1/4th of the stack?

Page 32: Parallelism in the Standard C++: What to Expect in C++ 17

Vectorization: Not So Easy Any More…

void f(int* a, int*b){ for(int i=0; i<n; ++i) { a[i] = b[i] + c; func();

}}

mov ecx, DWORD PTR _b$[esp+esi+140]add ecx, ediadd DWORD PTR _a$[esp+esi+140], ecxcall func

Aliasing?

Side effects?Dependence?Exceptions?

Page 33: Parallelism in the Standard C++: What to Expect in C++ 17

How Do We Get This?

void f(float* a, float*b){ for(int i=0; i<n; ++i) { a[i] = b[i] + c; func(); }

}

for(int i=0; i<n; i+=4){ a[i:i+3] = b[i:i+3] + c; for(int j=0; j<4; ++j) func();}

Need a helping hand from the programmer!

Page 34: Parallelism in the Standard C++: What to Expect in C++ 17

Vectorization Hazard: Locks

for(int i=0; i<n; ++i){ lock.enter(); a[i] = b[i] + c; lock.release();}

for(int i=0; i<n; i+=4){ for(int j=0; j<4; ++j) lock.enter();

a[i:i+3] = b[i:i+3] + c;

for(int j=0; j<4; ++j) lock.release();}

This transformation is not safe!

Consider: f takes a lock, g releases the lock:

?

Page 35: Parallelism in the Standard C++: What to Expect in C++ 17

But Wait, There Is One Little Problem…

void f(float* a, float*b){ std::for_each(a, b, [&](float f) { // Oops, no ‘i’: a[i] = b[i] + c;

func(); });}

void f(float* a, float*b){ for(int i=0; i<n; ++i) { // OK: a[i] = b[i] + c;

func(); }}

Index-based algorithm: Element-based algorithm:

Page 36: Parallelism in the Standard C++: What to Expect in C++ 17

Vector Loop with Parallel STL

void f(float* a, float*b){ integer_iterator begin {0}; // almost, see N3976 integer_iterator end {b-a};

std::for_each( std::par_vec, begin, end, [&](int i) { a[i] = b[i] + c; func(); });}

Page 37: Parallelism in the Standard C++: What to Expect in C++ 17

Parallelization vs. Vectorization

Parallelization Threads Stack Good for divergent code Relatively heavy-weight

Vectorization Vector Lanes No stack Lock-step execution Very light-weight

Page 38: Parallelism in the Standard C++: What to Expect in C++ 17

When To Vectorize

std::par No race conditions No aliasing

std::par_vec Same as std::par, plus: No Exceptions No Locks No/Little Divergence

Page 39: Parallelism in the Standard C++: What to Expect in C++ 17

References

N3991: Task Region N3872: A Primer on Scheduling Fork-Join Parallelism with

Work Stealing N3724: A Parallel Algorithms Library N3989: Working Draft, Technical Specification for C++

Extensions for Parallelism N3976 : Multidimensional bounds, index and

array_view parallelstl.codeplex.com

Page 40: Parallelism in the Standard C++: What to Expect in C++ 17