| |
Iuliana Bacivarov
Computer Engineering and Networks Laboratory, ETH Zürich
1st International Workshop on Multicore Application Debugging
(MAD) 2013, 14-15 November 2013, München, Germany
How Model-Based Design Simplifies
the Debugging of Many-Core Systems
| |
team
Devesh Chokshi, Wolfgang Haid, Kai Huang, Shin-Haeng
Kang, Pratyush Kumar, Devendra Rai, Lars Schor, Hoeseok
Yang, Prof. Lothar Thiele
projects
EU-SHAPES, EU-PREDATOR, EU-COMBEST, EU-
ARTISTDESIGN, EU-PRO3D, EU-EURETILE, nano-tera
Extreme, nano-tera UltrasoundToGo
11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 2
Acknowledgements
Intel SCC (Single-chip
Cloud Computer )
| | 11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 3
Current Embedded Systems are Complex
Intel SCC
(48 cores)
Intel Xeon Phi
(64 cores)
parallel applications
many-tile/many-core hardware
dynamic workloads
performance,
real-time,
power,
and temperature high-
temperature
fault
dynamic mapping
| |
11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 4
Debugging is Hard!
| |
“Debugging is a methodical process of finding and reducing the number of bugs, or defects, in a computer program or a piece of electronic hardware, thus making it behave as expected.”
---- Wikipedia
“Debugging tends to be harder when various subsystems are tightly coupled, as changes in one may cause bugs to emerge in another.”
---- Wikipedia
11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 5
Debugging
| |
11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 6
Problems with Parallel Programming
Im
age “
bor
row
ed” f
rom
an
Iom
ega
adve
rtis
em
ent
for
Y2
K
soft
war
e a
nd d
isk
dri
ves,
Sci
enti
fic
Am
eric
an, S
ept
em
ber
199
9.
Ed Lee, The Future of Embedded Software, 2006
http://ptolemy.eecs.berkeley.edu/presentations/06/
What it Feels Like to Use the
synchronized Keyword in Java
| | 11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 7
Problems with Parallel Programming
Ed Lee, The Future of Embedded Software, 2006
http://ptolemy.eecs.berkeley.edu/presentations/06/
Threads are wildly nondeterministic
The programmer’s job is to prune away the non-determinism by
imposing constraints on execution order (e.g., mutexes)
Nontrivial software written with threads, semaphores, and
mutexes is incomprehensible to humans
… and doesn’t deliver a rigorous, analyzable, and
understandable model of concurrency.
“Humans are quickly overwhelmed by concurrency and find it much more difficult to reason about concurrent than sequential code. Even careful people miss possible interleavings among even simple collections of partially ordered operations.” H. Sutter and J. Larus. Software and the concurrency revolution. ACM Queue, 3(7), 2005.
| | 11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 8
Key Concepts in Model-Based Design
Models are composed to form designs.
Models evolve during design.
Specifications are executable models.
Deployed code is generated from models.
Modeling languages have formal semantics.
Modeling languages themselves are modeled.
For general-purpose software, this is about Object-oriented design
For embedded systems, this is about Time
Concurrency
Ed Lee, The Future of Embedded Software, 2006
http://ptolemy.eecs.berkeley.edu/presentations/06/
| | 11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 9
The Good News
Model-Based Design
enables a
‘correct by design’ execution
execution
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile
Mem
ory
Cntr
.
Mem
ory
Cntr
.
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile TileR R R R R R M
em
ory
Cntr
.
Mem
ory
Cntr
.
R R R R R R
R R R R R R
R R R R R R
p1 p2 p3
| |
The Good News
11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 10
application architecture
design space
exploration analysis
mapping
software
synthesis
execution
functional
simulation
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile
Mem
ory
Cntr
.
Mem
ory
Cntr
.
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile TileR R R R R R M
em
ory
Cntr
.
Mem
ory
Cntr
.
R R R R R R
R R R R R R
R R R R R R
p1 p2 p3
Distributed Application Layer:
model-based design &
separation of concerns
| |
Proposed by Kahn in 1974 as a general-purpose scheme for parallel programming READ: destructive and blocking
WRITE: non-blocking
FIFO: infinite size
Unique attribute: determinate
Deterministic model of computation Focus on causality, not order (implementation independent)
Functional behavior is independent of timing (execution time, communication time, scheduling)
Data-driven scheduling: processes run whenever they are ready
11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 11
Application Specification: Kahn Process
Network p1 p2 p3
| | 11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 12
Application Specification: MPEG2 KPN
Kahn process network
Unique attribute:
determinate
TG
MERGE
DEMUX
IQ ZZ iDCT
LIBU
| | 11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 13
Execution Scenarios Specification
Application / run-time
environment can request a
scenario change
stand-by
music
video
phone
and
music
phone
and
video
phone
R: phone R: -
R: MP3
R: MPEG-2,
AAC
R: phone
H: MP3
R: phone, MPEG-2
H: AAC
Each application can:
START
STOP
PAUSE
RESUME
TG
MERGE
DEMUX
IQ ZZ iDCT
LIBU
| | 11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 14
Architecture Specification
Hierarchical architecture
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile
Mem
ory
Cntr
.
Mem
ory
Cntr
.
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile R R R R R R M
em
ory
Cntr
.
Mem
ory
Cntr
.
R R R R R R
R R R R R R
R R R R R R
e.g., Intel SCC
| | 11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 15
Application-to-Architecture Mapping
c1 c2
c3 c4
c1 c2
c3 c4
scenario1
scenario2
c1 c2
c3 c4
c1 c2
c3 c4
scenario1
scenario2
c1 c2
c3 c4
c1 c2
c3 c4
scenario1
scenario2
scenario1 scenario2
| | 11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 16
Hierarchical Mapping Optimization
– via Problem Decomposition
scenario1
scenario4
scenario2
scenario3
e1
e2
e3e4
e5
e6
e7 e8
P2:
running
P1:
running
P1:
paused
P2:
running
P1:
running
P3:
running
state-based
decomposition
architecture-based
decomposition
[ref] S. Kang, H. Yang, L. Schor, I. Bacivarov, S. Ha and L. Thiele, Multi-Objective Mapping Optimization via Problem
Decomposition for Many-Core Systems, ESTIMedia, Tampere, Finland, Oct. 2012
[ref] L. Schor, I. Bacivarov, D. Rai, H. Yang, S. Kang and L. Thiele, Scenario-Based Design Flow for Mapping Streaming
Applications onto On-Chip Many-Core Systems, CASES, Tampere, Finland, Oct. 2012
| |
From Specification to Analysis and Simulations
11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 17
automatic generation of
different system ‘views’
analysis
functional simulation
cycle-/instruction-accurate
simulation
execution on hardware
functional simulation simulation/execution
core 1
Linux kernel
multi-processing
v1 v3
interconnect
core 2
Linux kernel
multi-processing
v4 v2
MPA analysis model
[ref] K. Huang, W. Haid, I. Bacivarov, M. Keller, and L. Thiele. Embedding Formal Performance Analysis into the Design
Cycle of MPSoCs for Real-time Multimedia Applications. ACM TECS, Vol. 11, No. 1, pages 8:1-8:23, March, 2012.
[ref] L. Schor, I. Bacivarov, D. Rai, H. Yang, S. Kang and L. Thiele, Scenario-Based Design Flow for Mapping Streaming
Applications onto On-Chip Many-Core Systems, CASES, Tampere, Finland, Oct. 2012
system specification
| |
provides an implementation of the programming interface
inter-process communication (distributed memory)
multi-processing mechanisms
services to manage processes and channels at runtime
11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 18
Runtime System
core 1
Linux kernel
multi-processing
producer consumer
network-on-chip
core 2
Linux kernel
multi-processing
worker A worker B
[ref] L. Schor, D. Rai, H. Yang, I. Bacivarov, and L. Thiele, Reliable and Efficient Execution of Multiple Streaming
Applications on Intel's SCC Processor. Runtime and Operating Systems for the Many-core Era (ROME) August 2013.
| |
shared vs. distributed memory
on Intel SCC, RCKMPI lib. for inter-core communication
one listener thread per core for all incoming traffic
virtual buffer at sender to limit traffic
11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 19
Inter-Process Communication
memory 1
core 1
producer worker
network-on-chip
memory 2
core 2
LISTENER consumer
RCKMPI
| |
on top of Linux kernel – processes mapped onto POSIX
threads
data-driven execution – no global scheduler required
11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 20
Multi Processing
core 1
Linux kernel
POSIX environment
POSIX thread POSIX thread
producer consumer
void *producer_thread
(void *arg) {
Process *p = (Process*) arg;
while (!p->stopped) {
p->fire();
}
}
| |
specified as a process network
one master process: manages dynamic execution
one slave process per core: manage processes and channels
11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 21
Runtime Manager
core 1
network-on-chip
core 2 core 3
M
S
S
producer consumer
Z Z Z Z Z Z
Z Z Z
1. install processes
2. create FIFO(s)
3. start processes
| | 11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 22
Synthesis Backend
target platforms
functional simulation on Linux
multi-cluster system:
each Linux server forms one cluster with multiple cores
Inter-cluster communication with MPI
Intel SCC
QUonG platform (INFN)
3
21A B C
mapping optimization
runtime-manager synthesis
process network synthesis
fire(){
read(...);
...}
Process A --> core 1
Process B --> core 2
Process C --> core 2
MS
S
main(for each core)
Makefile process wrappers
DNP
RISC
DSP
MEM
***
***
Intel SCC (Single-chip
Cloud Computer )
APEnet+
| | 11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 23
Deployment
DAL is available:
www.tik.ee.ethz.ch/~euretile/dal.php
| | 11/15/2013 Iuliana Bacivarov, Computer Engineering Group, ETH Zurich 24
predictability
safety, dynamism
3
21
safe execution
execution, scalability
DNP
RISC
DSP
MEM
***
***
Intel SCC (Single-chip
Cloud Computer )
APEnet+
complete design flow
easy debugging
core 1
Linux kernel
multi-processing
p1 p2 p3
optimality
coverage of A
A1
A2A0
coverage of BB1
B0 B2
processor fitness
clu
ste
r fitn
ess
p1 p2 p3
KPN - deterministic MoC