embedded multicore building blocks (emb²) · p. sun, s. chandrasekaran, s. zhu, and b. chapman....

30
Unrestricted © Siemens AG 2016. All rights reserved Embedded Multicore Building Blocks (EMB²) Dr. Tobias Schüle Siemens Corporate Technology

Upload: others

Post on 08-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Unrestricted © Siemens AG 2016. All rights reserved

Embedded Multicore Building Blocks (EMB²)Dr. Tobias Schüle

Siemens Corporate Technology

Page 2: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Page 2 Corporate Technology Unrestricted © Siemens AG 2016. All rights reserved

Introduction

Sequential programming is easy (sometimes) …

#define SIZE 1000

main() {

double a[SIZE], b[SIZE];

// Compute a and b ...

double sum = 0.0;

for(int i = 0; i < SIZE; i++)

sum += a[i] * b[i];

// Use sum ...

}

Dot product (sequential)

Page 3: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Page 3 Corporate Technology Unrestricted © Siemens AG 2016. All rights reserved

Introduction

… but multithreaded programming is tedious!

#include <iostream>#include <pthread.h>

#define THREADS 4#define SIZE 1000

using namespace std;

double a[SIZE], b[SIZE], sum;

pthread_mutex_t mutex_sum;

void *dotprod(void *arg) {int my_id = (int)arg;int my_first = my_id * SIZE/THREADS;int my_last = (my_id + 1) * SIZE/THREADS;

double partial_sum = 0;for(int i = my_first; i < my_last && i < SIZE; i++)

partial_sum += a[i] * b[i];

pthread_mutex_lock(&mutex_sum);sum += partial_sum;pthread_mutex_unlock(&mutex_sum);

pthread_exit((void*)0);}

int main(int argc, char *argv[]) {// Compute a and b ...

pthread_attr_t attr;pthread_t threads[THREADS];

pthread_mutex_init(&mutex_sum, NULL);pthread_attr_init(&attr);pthread_attr_setdetachstate(&attr,

PTHREAD_CREATE_JOINABLE);

sum = 0;for(int i = 0; i < THREADS; i++)

pthread_create(&threads[i], &attr, dotprod,(void*)i);

pthread_attr_destroy(&attr);

int status;for(int i = 0; i < THREADS; i++)

pthread_join(threads[i], (void**)&status);

// Use sum ...

pthread_mutex_destroy(&mutex_sum);pthread_exit(NULL);

}

Dot product (POSIX threads)

Barbara Chapman, Gabriele Jost, Ruud van der Pas. Using OpenMP: Portable Shared Memory Parallel Programming. MIT Press, 2007.

Page 4: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Unrestricted © Siemens AG 2016. All rights reservedPage 4 Corporate Technology

Introduction

Multicore processors are here to stay

High computing power

High energy

efficiency

Low material

costs

1995 2000 2005 2005 2010 2020...

Single-core

processor

f

Core 1

Core 2

f/2

Dual-core processor

Source: Vishwani D. Agrawal

Capacitance = C Voltage = V

Frequency = f Power = CV2f

Capacitance = 2.2C Voltage = 0.6V

Frequency = 0.5f Power 0.4CV2f

Page 5: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Page 5 Corporate Technology Unrestricted © Siemens AG 2016. All rights reserved

Introduction

“In 2022, multicore will be everywhere.”

H. Alkhatib, P. Faraboschi, E. Frachtenberg, H. Kasahara, D. Lange,

P. Laplante, A. Merchant, D. Milojicic, and K. Schwan. IEEE CS 2022

Report. IEEE Computer Society, 2014.

www.computer.org/cms/Computer.org/ComputingNow/2022Report.pdf

“Multicore has attracted wide attention from the

embedded systems community […].

However, to obtain good multicore performance,

software is key for decomposing an original

sequential program into parallel program parts

and assigning them to processor cores.

So far, such parallelization has been performed by

application programmers, but it is very difficult,

takes a long time, and has a high cost.”

1 2 4 8 16 32

Co

st

Cores

Conventional approach

Ideal case

Page 6: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Page 6 Corporate Technology Unrestricted © Siemens AG 2016. All rights reserved

Introduction

Frameworks and Libraries for Parallel Programming

• Low-power scalable homogeneous and

heterogeneous architectures

• Hard real-time architectures with local

memory and their programming

• …

Top challenges for multicore (IEEE CS 2022)

Parallel Patterns Library (PPL)

Threading Building Blocks (TBB)

Most frameworks for parallel programming are

intended for desktop/server applications and are

not suitable for embedded systems.

Apple’s

Grand Central Dispatch

Page 7: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Page 7 Corporate Technology Unrestricted © Siemens AG 2016. All rights reservedPage 7 Corporate Technology

Embedded Multicore Building Blocks

Overview

Key features:

• Easy parallelization of existing code

• Resource-awareness (memory consumption)

• Real-time capability

• Fine-grained control over core usage (priorities, affinities)

• Support for distributed / heterogeneous systems

• Independence of hardware architecture (x86, ARM, …)

Embedded Multicore Building Blocks (EMB2)

Domain-independent C/C++ library and runtime

platform for embedded multicore systems.

Page 8: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Unrestricted © Siemens AG 2016. All rights reservedPage 8 Corporate Technology

Embedded Multicore Building Blocks

Multicore Task Management API (MTAPI)

MTAPI in a nut shell

Standardized API for task-parallel

programming on a wide range of hardware

architectures

Developed and driven by practitioners of

market-leading companies

Part of Multicore-Association’s ecosystem

(MRAPI, MCAPI, …)

Contributing members:

Working group lead

Tasks

Tasks Queues Heterogeneous Systems

Shared memory

Distributed memory

Different instruction

set architectures

Page 9: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Page 9 Corporate Technology Unrestricted © Siemens AG 2016. All rights reserved

Embedded Multicore Building Blocks

Heterogeneous systems

TI OMAP5430

Xilinx Zynq UltraScale MPSoC

Page 10: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Unrestricted © Siemens AG 2016. All rights reservedPage 10 Corporate Technology

Embedded Multicore Building Blocks

MTAPI for Heterogeneous Systems (1)

Node Node Node Node

tasks

CPU

core

memory

GPU

memory

DSPCPU

core

CPU

core

CPU

core

memory

sched. / lib. OS 1 OS 2

MTAPI runtime system (optionally MCAPI / MRAPI)

MTAPI tasksMTAPI tasks

MTAPI application

MTAPI tasks

Domain

Page 11: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Unrestricted © Siemens AG 2016. All rights reservedPage 11 Corporate Technology

Embedded Multicore Building Blocks

MTAPI for Heterogeneous Systems (2)

Job: A piece of processing implemented by an action. Each job has a unique identifier.

Action: Implementation of a job, may be hardware or software-defined.

Task: Execution of a job resulting in the invocation of an action implementing the job associated

with some data to be processed.

Task Job Action 2

Action 1

Action n

accomplishes

implemented by

(e.g., CPU code)

(e.g., OpenCL code)

MTAPI

Exte

nsio

n A

PI

Plugin 1

Plugin 2

Plugin n

Example: FFT

Page 12: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Unrestricted © Siemens AG 2016. All rights reservedPage 12 Corporate Technology

Embedded Multicore Building Blocks

Components

Operating system / hypervisor

Dataflow

Application

Hardware

Containers

Task management (MTAPI)

Algorithms

EMB2

Base library (abstraction layer)

Page 13: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Unrestricted © Siemens AG 2016. All rights reservedPage 13 Corporate Technology

Embedded Multicore Building Blocks

Algorithms and Task Affinities / Priorities

std::vector<int> v;

// initialize v ...

embb::algorithms::ForEach(v.begin(), v.end(),

[] (int& x) {x *= 2;}

);

Parallel for-each loop

No need to care of

task creation and management

number of processor cores

load balancing and scheduling

// Create execution policy

ExecutionPolicy policy(true, 0);

// Remove worker thread 0 fromaffinity set

policy.RemoveWorker(0);

// Start high priority tasks in parallel on

// specified worker threads (cores)

Invoke([=](){HighPrioFun1();},

[=](){HighPrioFun2();},

policy);

Function invocation

1st argument: affinity set (true = all)

2nd argument: priority (0 = highest)

Example: worker thread (core) 0 is

reserved for special tasks

Pass policy as optional parameter

Page 14: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Unrestricted © Siemens AG 2016. All rights reservedPage 14 Corporate Technology

Embedded Multicore Building Blocks

Dot Product Revisited (1)

#define SIZE 1000

main() {

double a[SIZE], b[SIZE];

// Compute a and b ...

double sum = 0.0;

for(int i = 0; i < SIZE; i++)

sum += a[i] * b[i];

// Use sum ...

}

Dot product (sequential)

E = mb2

Page 15: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Unrestricted © Siemens AG 2016. All rights reservedPage 15 Corporate Technology

Embedded Multicore Building Blocks

Dot Product Revisited (2)

#define SIZE 1000

main() {

double a[SIZE], b[SIZE];

// Compute a and b ...

double sum =

);

// Use sum ...

}

Dot product (EMB2)

[] (const ZipPair<double&, double&>& p) {

return p.First() * p.Second();

}

Reduce( Recipe (parallel algorithm)

Zip(&a[0], &b[0]), Zip(&a[SIZE], &b[SIZE]), 1. Input sequence

0.0, 2. Neutral element

std::plus<double>(), 3. Reduction op.

4. Transformation fn.

Ingredients

No need to care of

task creation and management

number of processor cores

load balancing and scheduling

Page 16: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Unrestricted © Siemens AG 2016. All rights reservedPage 16 Corporate Technology

Embedded Multicore Building Blocks

Task Affinities and Priorities

// Create execution policy

ExecutionPolicy policy(true, 0);

// Remove worker thread (core) 0 from

// affinity set

policy.RemoveWorker(0);

// Start high priority tasks in parallel on

// specified worker threads (cores)

Invoke([=](){HighPrioFun1();},

[=](){HighPrioFun2();},

policy);

Function invocation

1st argument: affinity set (true = all)

2nd argument: priority (0 = highest)

Example: worker thread (core) 0 is

reserved for special tasks

Pass policy as optional parameter

Page 17: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Unrestricted © Siemens AG 2016. All rights reservedPage 17 Corporate Technology

Embedded Multicore Building Blocks

Components

Operating system

Dataflow

Application

Hardware

Containers

Task management (MTAPI)

Algorithms

EMB2

Base library (abstraction layer)

Page 18: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Unrestricted © Siemens AG 2016. All rights reservedPage 18 Corporate Technology

Embedded Multicore Building Blocks

Dataflow Framework

Stream processing

Embedded systems frequently process

continuous streams of data such as

sensor and actuator data,

network packets, …

medical images, …

Such applications can be modeled using

dataflow networks and executed in parallel

T F

T F

... ...

MTAPI Task Trees

Page 19: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Unrestricted © Siemens AG 2016. All rights reservedPage 19 Corporate Technology

Embedded Multicore Building Blocks

Components

Operating system

Dataflow

Application

Hardware

Containers

Task management (MTAPI)

Algorithms

EMB2

Base library (abstraction layer)

Page 20: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Unrestricted © Siemens AG 2016. All rights reservedPage 20 Corporate Technology

Embedded Multicore Building Blocks

Container Requirements

1. No race conditions in case of concurrent accesses Thread safety

2. No unpredictable delays in case of contention Progress guarantee

3. No dynamic memory allocation after startup Preallocated memory

P0

Pm

C0

Cn

Queue

ConsumersProducers

Implementation Thread

safety

Progress

guarantee

Preallocated

memory

std::queueQQueue (Qt) -

std::queueQQueue (Qt)

boost::lockfree::queuetbb::concurrent_queue / ? / ?embb::LockFreeMPMCQueueembb::WaitFreeSPSCQueue

+ Mutex

Page 21: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Unrestricted © Siemens AG 2016. All rights reservedPage 21 Corporate Technology

Embedded Multicore Building Blocks

Progress Guarantees

Starvation-free

Lock-free

Clash-free

Wait-free

Obstruction-free

Deadlock-free

At least one thread always

makes progress

All threads always

make progress

M. Herlihy and N. Shavit. “On the nature of progress”. International conference on Principles of Distributed Systems

(OPODIS'11), Springer, 2011.

Page 22: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Unrestricted © Siemens AG 2016. All rights reservedPage 22 Corporate Technology

Embedded Multicore Building Blocks

Lock-free / Wait-free Algorithms

Page 23: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Unrestricted © Siemens AG 2016. All rights reservedPage 23 Corporate TechnologyPage 23 Corporate Technology

Progress guarantees

With wait-freedom, the completion of an operation is guaranteed

to occur in a finite number of steps. Lock-freedom guarantees the

overall progress of a system.

Deadlock absence

Wait-free and lock-free data structures are immune to deadlock

conditions.

Signal safety

Coherency in the context of asynchronous interruptions is

guaranteed.

Termination safety

Linearizable operations may be aborted at any time without

sacrificing the overall availability of a system.

Priority inversion avoidance

Wait-free algorithms cannot prevent high priority threads from

making progress.

Embedded Multicore Building Blocks

Properties of Lock-free / Wait-free Algorithms

Page 24: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Unrestricted © Siemens AG 2016. All rights reservedPage 24 Corporate Technology

Embedded Multicore Building Blocks

Tracing/Profiling MTAPI Applications with Score-P

Open source community

Linux & Windows

Platform independent (x86,

ARM, and PPC)

Heterogeneous system

support (e.g., Intel Phi,

CUDA)

Open formats enabling

interoperability and custom

analysis types

Extremely scalable

www.score-p.org

Page 25: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Unrestricted © Siemens AG 2016. All rights reservedPage 25 Corporate Technology

Embedded Multicore Building Blocks

Visualization and Interpretation

MTAPI functions

Adaptation of well-established and widely used Score-P

framework to MTAPI Benefit from experience in HPC

Cooperation with Jülich Supercomputing Centre (JSC)

Page 26: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Page 26 Corporate Technology Unrestricted © Siemens AG 2016. All rights reserved

Embedded Multicore Building Blocks

Hello World!

https://github.com/siemens/embb/

BSD 2-clause license

Feedback and contributions

are very welcome!

Page 27: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Page 27 Corporate Technology Unrestricted © Siemens AG 2016. All rights reserved

Embedded Multicore Building Blocks

Code Quality

Formal verification (partially)

Static source code analysis

Rule checker (cppcheck)

Continuous integration

Unit tests (> 90% statement coverage)

Coding guidelines (Google’s cpplint)

Zero compiler warnings

Workflow-driven design/code reviews

Agile

development

process

Linearizability checker

Page 28: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Page 28 Corporate Technology Unrestricted © Siemens AG 2016. All rights reserved

Embedded Multicore Building Blocks

Performance Comparison

Measurements from University of Houston show efficiency of EMB² (green bars):

P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs.

International Conference on High Performance Computing and Communications (HPCC), IEEE, 2016.

Page 29: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Page 29 Corporate Technology Unrestricted © Siemens AG 2016. All rights reserved

Embedded Multicore Building Blocks

Summary

Efficient software development

High performance and scalability

Improved code quality(prevention of concurrency bugs)

Suitable for embedded systems(memory and real-time constraints)

Page 30: Embedded Multicore Building Blocks (EMB²) · P. Sun, S. Chandrasekaran, S. Zhu, and B. Chapman. Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs

Page 30 Corporate Technology Unrestricted © Siemens AG 2016. All rights reserved

Contact

Dr. Tobias Schüle

Siemens AG

Corporate Technology

CT RDA ITP SES-DE

Otto-Hahn-Ring 6

81739 München

Germany

Email: [email protected]